HI George,
Looks good, thank for the patch. Regards, Chen At 2024-12-17 01:02:11, "George Steed" <[email protected]> wrote: >The lane-indexed LD1 load instructions imply a dependency on the >previous value of the vector register to maintain the values in lanes >not loaded. On larger micro-architectures this introduces an unnecessary >dependency chain which limits the ability of the core to execute >out-of-order. > >To avoid this dependency being introduced, simply use the scalar LDR >instructions to load the lowest lane of the vector, this has the effect >of zeroing the top portion of the vector rather than trying to maintain >the previous value of the upper lanes. > >On a Neoverse V2 machine this results in a 62% reduction in times >reported for the SATD 4x4 benchmarks, and a 65% reduction for the SATD >8x4 benchmarks. >--- > source/common/aarch64/pixel-util.S | 13 +++++++++---- > 1 file changed, 9 insertions(+), 4 deletions(-) > >diff --git a/source/common/aarch64/pixel-util.S >b/source/common/aarch64/pixel-util.S >index 5d8cc8c8e..d8b3f4365 100644 >--- a/source/common/aarch64/pixel-util.S >+++ b/source/common/aarch64/pixel-util.S >@@ -609,13 +609,18 @@ endfunc > > //******* satd ******* > .macro satd_4x4_neon >- ld1 {v0.s}[0], [x0], x1 >+ ldr s0, [x0] >+ ldr s1, [x2] >+ add x0, x0, x1 >+ add x2, x2, x3 > ld1 {v0.s}[1], [x0], x1 >- ld1 {v1.s}[0], [x2], x3 > ld1 {v1.s}[1], [x2], x3 >- ld1 {v2.s}[0], [x0], x1 >+ >+ ldr s2, [x0] >+ ldr s3, [x2] >+ add x0, x0, x1 >+ add x2, x2, x3 > ld1 {v2.s}[1], [x0], x1 >- ld1 {v3.s}[0], [x2], x3 > ld1 {v3.s}[1], [x2], x3 > > usubl v4.8h, v0.8b, v1.8b >-- >2.34.1 >
_______________________________________________ x265-devel mailing list [email protected] https://mailman.videolan.org/listinfo/x265-devel
