Re: [FFmpeg-devel] [aarch64] yuv2planeX - unroll outer loop by 4 to increase performance by 6.3%

Martin Storsjö Mon, 14 Sep 2020 13:12:44 -0700

Hi,

On Tue, 18 Aug 2020, Sebastian Pop wrote:

Unrolling by 4 the outer loop in yuv2planeX reduces the number of cache
accesses by 7.5%.
The values loaded for the filter are used in the 4 unrolled iterations and
avoids reloading 3 times the same values.
The performance was measured on an Arm64 Neoverse-N1 Graviton2 c6g.metal
instance with the following command:
$ perf stat -e cache-references ./ffmpeg_g -nostats -f lavfi -i
testsrc2=4k:d=2 -vf bench=start,scale=1024x1024,bench=stop -f null -

before: 1551591469      cache-references
after:  1436140431      cache-references

before: [bench @ 0xaaaac62b7d30] t:0.013226 avg:0.013219 max:0.013537
min:0.012975
after:  [bench @ 0xaaaad84f3d30] t:0.012355 avg:0.012381 max:0.013164
min:0.012158

This function (like most in swscale) lacks a checkasm test; it'd be greatto have one, to allow for proper microbenchmarks of different tunings onit. The existing tests in tests/checkasm/sw*.c serve as decent examplesfor how to write a test for it. This isn't a blocker, but checkasm is agreat tool for testing asm, and I personally try to add tests for anythingI'm writing asm for if there isn't already one.

When targeting linux, by default checkasm ends up using the perf cyclecounter API. This is a bit inexact for finetuning of asm, so personally Ialso prefer configuring with --disable-linux-perf and loading a kernelmodule to allow usermode access to the raw cycle counter register - e.g.the module from https://code.videolan.org/janne/arm64-cycle-cnt shouldwork fine.

diff --git a/libswscale/aarch64/output.S b/libswscale/aarch64/output.S
index af71de6050..2cba810e16 100644
--- a/libswscale/aarch64/output.S
+++ b/libswscale/aarch64/output.S
@@ -26,10 +26,13 @@ function ff_yuv2planeX_8_neon, export=1
         ext                 v0.8B, v0.8B, v0.8B, #3         // honor 
offsetting which can be 0 or 3 only
 1:      uxtl                v0.8H, v0.8B                    // extend dither 
to 16-bit
         ushll               v1.4S, v0.4H, #12               // extend dither 
to 32-bit with left shift by 12 (part 1)
-        ushll2              v2.4S, v0.8H, #12               // extend dither 
to 32-bit with left shift by 12 (part 2)
+        ushll2              v0.4S, v0.8H, #12               // extend dither 
to 32-bit with left shift by 12 (part 2)
+        tst                 w4, #0x1f                       // check dstW % 32
         mov                 x7, #0                          // i = 0
+        b.eq                4f                              // jump to outer 
loop unrolled by 4
+
 2:      mov                 v3.16B, v1.16B                  // initialize 
accumulator part 1 with dithering value
-        mov                 v4.16B, v2.16B                  // initialize 
accumulator part 2 with dithering value
+        mov                 v4.16B, v0.16B                  // initialize 
accumulator part 2 with dithering value
         mov                 w8, w1                          // tmpfilterSize = 
filterSize
         mov                 x9, x2                          // srcp    = src
         mov                 x10, x0                         // filterp = filter
@@ -55,4 +58,73 @@ function ff_yuv2planeX_8_neon, export=1
         add                 x7, x7, #8                      // i += 8
         b.gt                2b                              // loop until 
width consumed
         ret
+
+// Outer loop unrolled by 4 to maximize reuse of filter values.
+4:      mov                 v2.16b, v0.16b                  // initialize 
accumulator with dithering value
+        mov                 v3.16b, v0.16b                  // initialize 
accumulator with dithering value
+        mov                 v4.16b, v0.16b                  // initialize 
accumulator with dithering value
+        mov                 v5.16b, v0.16b                  // initialize 
accumulator with dithering value
+        mov                 v6.16b, v1.16b                  // initialize 
accumulator with dithering value
+        mov                 v7.16b, v1.16b                  // initialize 
accumulator with dithering value
+        mov                 v16.16b, v1.16b                 // initialize 
accumulator with dithering value
+        mov                 v17.16b, v1.16b                 // initialize 
accumulator with dithering value

For asm like this, I personally prefer to align the columns vertically. Inthis file, the existing code isn't already aligned, so there's not muchexample for follow, but I normally align columns that contain vectorreferences like v16.16b according the max width of those, and GPRreferences (and references to the vector registers without the layoutappended, like q19) against the max size for those (that is, 3 chars).Columns with other constructs (memory operands etc) is more ad-hocaligned.


It makes the asm a bit more readable IMO.

+        mov                 w8, w1                          // tmpfilterSize = 
filterSize
+        mov                 x9, x2                          // srcp    = src
+        mov                 x10, x0                         // filterp = filter
+
+5:      ldp                 x15, x14, [x9], #16             // get 2 pointers: 
src[j] and src[j+1]
+        ld1r                {v18.8h}, [x10], #2             // read 1x16-bit 
coeff X at filter[j  ] and duplicate across lanes
+        ldrh                w12, [x10], #2                  // read 1x16-bit 
coeff Y at filter[j+1]

Here there's a data dependency between the first load, that updates x10,and the second one. Moving the first load from x10 to before the "ldp x15,14" would help on in-order cores.

+        dup                 v27.4h, w12                     // and duplicate 
across low lanes
+        dup                 v27.8h, w12                     // and duplicate 
across high lanes

Umm, what? A dup for v27.8h will fill the lower 4 elements as well. Idon't see why this needs to be different from the ld1r for the filter[j]coefficients

+
+        add                 x15, x15, x7, lsl #1            // &src[j  ][i]
+        ldp                 q19, q20, [x15]                 // read 16x16-bit 
@ src[j  ][i + {0..15}]
+        ldp                 q23, q24, [x15, #32]            // read 16x16-bit 
@ src[j  ][i + {16..32}]
+
+        add                 x14, x14, x7, lsl #1            // &src[j+1][i]
+        ldp                 q21, q22, [x14]                 // read 16x16-bit 
@ src[j+1][i + {0..15}]
+        ldp                 q25, q26, [x14, #32]            // read 16x16-bit 
@ src[j+1][i + {16..32}

Are the loads from x14/x15 guaranteed to be aligned here? (IIRC ldp doesrequire the addresses to be aligned.) If this would be written using ld1(like the existing original code), alignment wouldn't matter. Thatdisallows using the pre-indexed form though, but an ld1 can load 64 bytein one go anyway. And doing both adds first, followed by two ld1's, avoidsthe data dependencies a bit.

+
+        smlal               v17.4s, v19.4h, v18.4h          // val0 += 
{A,B,C,D} * X
+        smlal               v16.4s, v20.4h, v18.4h          // ... (see 
comments in non unrolled code)
+        smlal               v7.4s, v23.4h, v18.4h
+        smlal               v6.4s, v24.4h, v18.4h
+
+        smlal2              v2.4s, v19.8h, v18.8h
+        smlal2              v3.4s, v20.8h, v18.8h
+        smlal2              v4.4s, v23.8h, v18.8h
+        smlal2              v5.4s, v24.8h, v18.8h
+
+        smlal               v17.4s, v21.4h, v27.4h
+        smlal               v16.4s, v22.4h, v27.4h
+        smlal               v7.4s, v25.4h, v27.4h
+        smlal               v6.4s, v26.4h, v27.4h
+
+        smlal2              v2.4s, v21.8h, v27.8h
+        smlal2              v3.4s, v22.8h, v27.8h
+        smlal2              v4.4s, v25.8h, v27.8h
+        smlal2              v5.4s, v26.8h, v27.8h
+
+        subs                w8, w8, #2                      // tmpfilterSize 
-= 2

I'd suggest moving the subs earlier, to avoid the dependency between thesubs and the b.gt below.

+        b.gt                5b                              // loop until 
filterSize consumed
+
+        sqshrun             v17.4h, v17.4s, #16             // clip16(val0>>16)
+        sqshrun             v16.4h, v16.4s, #16             // ... (see 
comments in non unrolled code)
+        sqshrun             v7.4h, v7.4s, #16
+        sqshrun             v6.4h, v6.4s, #16
+        sqshrun2            v17.8h, v2.4s, #16
+        sqshrun2            v16.8h, v3.4s, #16
+        sqshrun2            v7.8h, v4.4s, #16
+        sqshrun2            v6.8h, v5.4s, #16
+        uqshrn              v2.8b, v17.8h, #3
+        uqshrn              v3.8b, v16.8h, #3
+        uqshrn              v4.8b, v7.8h, #3
+        uqshrn              v5.8b, v6.8h, #3
+        stp                 d2, d3, [x3], #16               // write to 
destination
+        stp                 d4, d5, [x3], #16               // write to 
destination
+        subs                w4, w4, #32                     // dstW -= 32

Maybe move the subs even further back here, between the uqrshrn and thestp?


And the stp could be converted into one single st1 here as well.


// Martin

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Re: [FFmpeg-devel] [aarch64] yuv2planeX - unroll outer loop by 4 to increase performance by 6.3%

Reply via email to