Hi, On Tue, 18 Aug 2020, Sebastian Pop wrote:
Unrolling by 4 the outer loop in yuv2planeX reduces the number of cache accesses by 7.5%. The values loaded for the filter are used in the 4 unrolled iterations and avoids reloading 3 times the same values. The performance was measured on an Arm64 Neoverse-N1 Graviton2 c6g.metal instance with the following command: $ perf stat -e cache-references ./ffmpeg_g -nostats -f lavfi -i testsrc2=4k:d=2 -vf bench=start,scale=1024x1024,bench=stop -f null - before: 1551591469 cache-references after: 1436140431 cache-references before: [bench @ 0xaaaac62b7d30] t:0.013226 avg:0.013219 max:0.013537 min:0.012975 after: [bench @ 0xaaaad84f3d30] t:0.012355 avg:0.012381 max:0.013164 min:0.012158
This function (like most in swscale) lacks a checkasm test; it'd be great to have one, to allow for proper microbenchmarks of different tunings on it. The existing tests in tests/checkasm/sw*.c serve as decent examples for how to write a test for it. This isn't a blocker, but checkasm is a great tool for testing asm, and I personally try to add tests for anything I'm writing asm for if there isn't already one.
When targeting linux, by default checkasm ends up using the perf cycle counter API. This is a bit inexact for finetuning of asm, so personally I also prefer configuring with --disable-linux-perf and loading a kernel module to allow usermode access to the raw cycle counter register - e.g. the module from https://code.videolan.org/janne/arm64-cycle-cnt should work fine.
diff --git a/libswscale/aarch64/output.S b/libswscale/aarch64/output.S index af71de6050..2cba810e16 100644 --- a/libswscale/aarch64/output.S +++ b/libswscale/aarch64/output.S @@ -26,10 +26,13 @@ function ff_yuv2planeX_8_neon, export=1 ext v0.8B, v0.8B, v0.8B, #3 // honor offsetting which can be 0 or 3 only 1: uxtl v0.8H, v0.8B // extend dither to 16-bit ushll v1.4S, v0.4H, #12 // extend dither to 32-bit with left shift by 12 (part 1) - ushll2 v2.4S, v0.8H, #12 // extend dither to 32-bit with left shift by 12 (part 2) + ushll2 v0.4S, v0.8H, #12 // extend dither to 32-bit with left shift by 12 (part 2) + tst w4, #0x1f // check dstW % 32 mov x7, #0 // i = 0 + b.eq 4f // jump to outer loop unrolled by 4 + 2: mov v3.16B, v1.16B // initialize accumulator part 1 with dithering value - mov v4.16B, v2.16B // initialize accumulator part 2 with dithering value + mov v4.16B, v0.16B // initialize accumulator part 2 with dithering value mov w8, w1 // tmpfilterSize = filterSize mov x9, x2 // srcp = src mov x10, x0 // filterp = filter @@ -55,4 +58,73 @@ function ff_yuv2planeX_8_neon, export=1 add x7, x7, #8 // i += 8 b.gt 2b // loop until width consumed ret + +// Outer loop unrolled by 4 to maximize reuse of filter values. +4: mov v2.16b, v0.16b // initialize accumulator with dithering value + mov v3.16b, v0.16b // initialize accumulator with dithering value + mov v4.16b, v0.16b // initialize accumulator with dithering value + mov v5.16b, v0.16b // initialize accumulator with dithering value + mov v6.16b, v1.16b // initialize accumulator with dithering value + mov v7.16b, v1.16b // initialize accumulator with dithering value + mov v16.16b, v1.16b // initialize accumulator with dithering value + mov v17.16b, v1.16b // initialize accumulator with dithering value
For asm like this, I personally prefer to align the columns vertically. In this file, the existing code isn't already aligned, so there's not much example for follow, but I normally align columns that contain vector references like v16.16b according the max width of those, and GPR references (and references to the vector registers without the layout appended, like q19) against the max size for those (that is, 3 chars). Columns with other constructs (memory operands etc) is more ad-hoc aligned.
It makes the asm a bit more readable IMO.
+ mov w8, w1 // tmpfilterSize = filterSize + mov x9, x2 // srcp = src + mov x10, x0 // filterp = filter + +5: ldp x15, x14, [x9], #16 // get 2 pointers: src[j] and src[j+1] + ld1r {v18.8h}, [x10], #2 // read 1x16-bit coeff X at filter[j ] and duplicate across lanes + ldrh w12, [x10], #2 // read 1x16-bit coeff Y at filter[j+1]
Here there's a data dependency between the first load, that updates x10, and the second one. Moving the first load from x10 to before the "ldp x15, 14" would help on in-order cores.
+ dup v27.4h, w12 // and duplicate across low lanes + dup v27.8h, w12 // and duplicate across high lanes
Umm, what? A dup for v27.8h will fill the lower 4 elements as well. I don't see why this needs to be different from the ld1r for the filter[j] coefficients
+ + add x15, x15, x7, lsl #1 // &src[j ][i] + ldp q19, q20, [x15] // read 16x16-bit @ src[j ][i + {0..15}] + ldp q23, q24, [x15, #32] // read 16x16-bit @ src[j ][i + {16..32}] + + add x14, x14, x7, lsl #1 // &src[j+1][i] + ldp q21, q22, [x14] // read 16x16-bit @ src[j+1][i + {0..15}] + ldp q25, q26, [x14, #32] // read 16x16-bit @ src[j+1][i + {16..32}
Are the loads from x14/x15 guaranteed to be aligned here? (IIRC ldp does require the addresses to be aligned.) If this would be written using ld1 (like the existing original code), alignment wouldn't matter. That disallows using the pre-indexed form though, but an ld1 can load 64 byte in one go anyway. And doing both adds first, followed by two ld1's, avoids the data dependencies a bit.
+ + smlal v17.4s, v19.4h, v18.4h // val0 += {A,B,C,D} * X + smlal v16.4s, v20.4h, v18.4h // ... (see comments in non unrolled code) + smlal v7.4s, v23.4h, v18.4h + smlal v6.4s, v24.4h, v18.4h + + smlal2 v2.4s, v19.8h, v18.8h + smlal2 v3.4s, v20.8h, v18.8h + smlal2 v4.4s, v23.8h, v18.8h + smlal2 v5.4s, v24.8h, v18.8h + + smlal v17.4s, v21.4h, v27.4h + smlal v16.4s, v22.4h, v27.4h + smlal v7.4s, v25.4h, v27.4h + smlal v6.4s, v26.4h, v27.4h + + smlal2 v2.4s, v21.8h, v27.8h + smlal2 v3.4s, v22.8h, v27.8h + smlal2 v4.4s, v25.8h, v27.8h + smlal2 v5.4s, v26.8h, v27.8h + + subs w8, w8, #2 // tmpfilterSize -= 2
I'd suggest moving the subs earlier, to avoid the dependency between the subs and the b.gt below.
+ b.gt 5b // loop until filterSize consumed + + sqshrun v17.4h, v17.4s, #16 // clip16(val0>>16) + sqshrun v16.4h, v16.4s, #16 // ... (see comments in non unrolled code) + sqshrun v7.4h, v7.4s, #16 + sqshrun v6.4h, v6.4s, #16 + sqshrun2 v17.8h, v2.4s, #16 + sqshrun2 v16.8h, v3.4s, #16 + sqshrun2 v7.8h, v4.4s, #16 + sqshrun2 v6.8h, v5.4s, #16 + uqshrn v2.8b, v17.8h, #3 + uqshrn v3.8b, v16.8h, #3 + uqshrn v4.8b, v7.8h, #3 + uqshrn v5.8b, v6.8h, #3 + stp d2, d3, [x3], #16 // write to destination + stp d4, d5, [x3], #16 // write to destination + subs w4, w4, #32 // dstW -= 32
Maybe move the subs even further back here, between the uqrshrn and the stp?
And the stp could be converted into one single st1 here as well. // Martin _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".