Hi,

On Tue, 18 Aug 2020, Sebastian Pop wrote:

Unrolling by 4 the outer loop in yuv2planeX reduces the number of cache
accesses by 7.5%.
The values loaded for the filter are used in the 4 unrolled iterations and
avoids reloading 3 times the same values.
The performance was measured on an Arm64 Neoverse-N1 Graviton2 c6g.metal
instance with the following command:
$ perf stat -e cache-references ./ffmpeg_g -nostats -f lavfi -i
testsrc2=4k:d=2 -vf bench=start,scale=1024x1024,bench=stop -f null -

before: 1551591469      cache-references
after:  1436140431      cache-references

before: [bench @ 0xaaaac62b7d30] t:0.013226 avg:0.013219 max:0.013537
min:0.012975
after:  [bench @ 0xaaaad84f3d30] t:0.012355 avg:0.012381 max:0.013164
min:0.012158

This function (like most in swscale) lacks a checkasm test; it'd be great to have one, to allow for proper microbenchmarks of different tunings on it. The existing tests in tests/checkasm/sw*.c serve as decent examples for how to write a test for it. This isn't a blocker, but checkasm is a great tool for testing asm, and I personally try to add tests for anything I'm writing asm for if there isn't already one.

When targeting linux, by default checkasm ends up using the perf cycle counter API. This is a bit inexact for finetuning of asm, so personally I also prefer configuring with --disable-linux-perf and loading a kernel module to allow usermode access to the raw cycle counter register - e.g. the module from https://code.videolan.org/janne/arm64-cycle-cnt should work fine.


diff --git a/libswscale/aarch64/output.S b/libswscale/aarch64/output.S
index af71de6050..2cba810e16 100644
--- a/libswscale/aarch64/output.S
+++ b/libswscale/aarch64/output.S
@@ -26,10 +26,13 @@ function ff_yuv2planeX_8_neon, export=1
         ext                 v0.8B, v0.8B, v0.8B, #3         // honor 
offsetting which can be 0 or 3 only
 1:      uxtl                v0.8H, v0.8B                    // extend dither 
to 16-bit
         ushll               v1.4S, v0.4H, #12               // extend dither 
to 32-bit with left shift by 12 (part 1)
-        ushll2              v2.4S, v0.8H, #12               // extend dither 
to 32-bit with left shift by 12 (part 2)
+        ushll2              v0.4S, v0.8H, #12               // extend dither 
to 32-bit with left shift by 12 (part 2)
+        tst                 w4, #0x1f                       // check dstW % 32
         mov                 x7, #0                          // i = 0
+        b.eq                4f                              // jump to outer 
loop unrolled by 4
+
 2:      mov                 v3.16B, v1.16B                  // initialize 
accumulator part 1 with dithering value
-        mov                 v4.16B, v2.16B                  // initialize 
accumulator part 2 with dithering value
+        mov                 v4.16B, v0.16B                  // initialize 
accumulator part 2 with dithering value
         mov                 w8, w1                          // tmpfilterSize = 
filterSize
         mov                 x9, x2                          // srcp    = src
         mov                 x10, x0                         // filterp = filter
@@ -55,4 +58,73 @@ function ff_yuv2planeX_8_neon, export=1
         add                 x7, x7, #8                      // i += 8
         b.gt                2b                              // loop until 
width consumed
         ret
+
+// Outer loop unrolled by 4 to maximize reuse of filter values.
+4:      mov                 v2.16b, v0.16b                  // initialize 
accumulator with dithering value
+        mov                 v3.16b, v0.16b                  // initialize 
accumulator with dithering value
+        mov                 v4.16b, v0.16b                  // initialize 
accumulator with dithering value
+        mov                 v5.16b, v0.16b                  // initialize 
accumulator with dithering value
+        mov                 v6.16b, v1.16b                  // initialize 
accumulator with dithering value
+        mov                 v7.16b, v1.16b                  // initialize 
accumulator with dithering value
+        mov                 v16.16b, v1.16b                 // initialize 
accumulator with dithering value
+        mov                 v17.16b, v1.16b                 // initialize 
accumulator with dithering value

For asm like this, I personally prefer to align the columns vertically. In this file, the existing code isn't already aligned, so there's not much example for follow, but I normally align columns that contain vector references like v16.16b according the max width of those, and GPR references (and references to the vector registers without the layout appended, like q19) against the max size for those (that is, 3 chars). Columns with other constructs (memory operands etc) is more ad-hoc aligned.

It makes the asm a bit more readable IMO.

+        mov                 w8, w1                          // tmpfilterSize = 
filterSize
+        mov                 x9, x2                          // srcp    = src
+        mov                 x10, x0                         // filterp = filter
+
+5:      ldp                 x15, x14, [x9], #16             // get 2 pointers: 
src[j] and src[j+1]
+        ld1r                {v18.8h}, [x10], #2             // read 1x16-bit 
coeff X at filter[j  ] and duplicate across lanes
+        ldrh                w12, [x10], #2                  // read 1x16-bit 
coeff Y at filter[j+1]

Here there's a data dependency between the first load, that updates x10, and the second one. Moving the first load from x10 to before the "ldp x15, 14" would help on in-order cores.

+        dup                 v27.4h, w12                     // and duplicate 
across low lanes
+        dup                 v27.8h, w12                     // and duplicate 
across high lanes

Umm, what? A dup for v27.8h will fill the lower 4 elements as well. I don't see why this needs to be different from the ld1r for the filter[j] coefficients

+
+        add                 x15, x15, x7, lsl #1            // &src[j  ][i]
+        ldp                 q19, q20, [x15]                 // read 16x16-bit 
@ src[j  ][i + {0..15}]
+        ldp                 q23, q24, [x15, #32]            // read 16x16-bit 
@ src[j  ][i + {16..32}]
+
+        add                 x14, x14, x7, lsl #1            // &src[j+1][i]
+        ldp                 q21, q22, [x14]                 // read 16x16-bit 
@ src[j+1][i + {0..15}]
+        ldp                 q25, q26, [x14, #32]            // read 16x16-bit 
@ src[j+1][i + {16..32}

Are the loads from x14/x15 guaranteed to be aligned here? (IIRC ldp does require the addresses to be aligned.) If this would be written using ld1 (like the existing original code), alignment wouldn't matter. That disallows using the pre-indexed form though, but an ld1 can load 64 byte in one go anyway. And doing both adds first, followed by two ld1's, avoids the data dependencies a bit.

+
+        smlal               v17.4s, v19.4h, v18.4h          // val0 += 
{A,B,C,D} * X
+        smlal               v16.4s, v20.4h, v18.4h          // ... (see 
comments in non unrolled code)
+        smlal               v7.4s, v23.4h, v18.4h
+        smlal               v6.4s, v24.4h, v18.4h
+
+        smlal2              v2.4s, v19.8h, v18.8h
+        smlal2              v3.4s, v20.8h, v18.8h
+        smlal2              v4.4s, v23.8h, v18.8h
+        smlal2              v5.4s, v24.8h, v18.8h
+
+        smlal               v17.4s, v21.4h, v27.4h
+        smlal               v16.4s, v22.4h, v27.4h
+        smlal               v7.4s, v25.4h, v27.4h
+        smlal               v6.4s, v26.4h, v27.4h
+
+        smlal2              v2.4s, v21.8h, v27.8h
+        smlal2              v3.4s, v22.8h, v27.8h
+        smlal2              v4.4s, v25.8h, v27.8h
+        smlal2              v5.4s, v26.8h, v27.8h
+
+        subs                w8, w8, #2                      // tmpfilterSize 
-= 2

I'd suggest moving the subs earlier, to avoid the dependency between the subs and the b.gt below.

+        b.gt                5b                              // loop until 
filterSize consumed
+
+        sqshrun             v17.4h, v17.4s, #16             // clip16(val0>>16)
+        sqshrun             v16.4h, v16.4s, #16             // ... (see 
comments in non unrolled code)
+        sqshrun             v7.4h, v7.4s, #16
+        sqshrun             v6.4h, v6.4s, #16
+        sqshrun2            v17.8h, v2.4s, #16
+        sqshrun2            v16.8h, v3.4s, #16
+        sqshrun2            v7.8h, v4.4s, #16
+        sqshrun2            v6.8h, v5.4s, #16
+        uqshrn              v2.8b, v17.8h, #3
+        uqshrn              v3.8b, v16.8h, #3
+        uqshrn              v4.8b, v7.8h, #3
+        uqshrn              v5.8b, v6.8h, #3
+        stp                 d2, d3, [x3], #16               // write to 
destination
+        stp                 d4, d5, [x3], #16               // write to 
destination
+        subs                w4, w4, #32                     // dstW -= 32

Maybe move the subs even further back here, between the uqrshrn and the stp?

And the stp could be converted into one single st1 here as well.


// Martin

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Reply via email to