[x265] [PATCH] aarch64/pixel-util.S: Improve satd_4x4_neon

George Steed Mon, 16 Dec 2024 09:02:59 -0800

The lane-indexed LD1 load instructions imply a dependency on the
previous value of the vector register to maintain the values in lanes
not loaded. On larger micro-architectures this introduces an unnecessary
dependency chain which limits the ability of the core to execute
out-of-order.


To avoid this dependency being introduced, simply use the scalar LDR
instructions to load the lowest lane of the vector, this has the effect
of zeroing the top portion of the vector rather than trying to maintain
the previous value of the upper lanes.

On a Neoverse V2 machine this results in a 62% reduction in times
reported for the SATD 4x4 benchmarks, and a 65% reduction for the SATD
8x4 benchmarks.
---
 source/common/aarch64/pixel-util.S | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/source/common/aarch64/pixel-util.S 
b/source/common/aarch64/pixel-util.S
index 5d8cc8c8e..d8b3f4365 100644
--- a/source/common/aarch64/pixel-util.S
+++ b/source/common/aarch64/pixel-util.S
@@ -609,13 +609,18 @@ endfunc
 
 //******* satd *******
 .macro satd_4x4_neon
-    ld1             {v0.s}[0], [x0], x1
+    ldr             s0, [x0]
+    ldr             s1, [x2]
+    add             x0, x0, x1
+    add             x2, x2, x3
     ld1             {v0.s}[1], [x0], x1
-    ld1             {v1.s}[0], [x2], x3
     ld1             {v1.s}[1], [x2], x3
-    ld1             {v2.s}[0], [x0], x1
+
+    ldr             s2, [x0]
+    ldr             s3, [x2]
+    add             x0, x0, x1
+    add             x2, x2, x3
     ld1             {v2.s}[1], [x0], x1
-    ld1             {v3.s}[0], [x2], x3
     ld1             {v3.s}[1], [x2], x3
 
     usubl           v4.8h, v0.8b, v1.8b
-- 
2.34.1

>From dc074c2a4c827fc7c4c8b2674245c54f5e0b534b Mon Sep 17 00:00:00 2001
From: George Steed <[email protected]>
Date: Thu, 20 Jun 2024 11:04:52 +0100
Subject: [PATCH] aarch64/pixel-util.S: Improve satd_4x4_neon

The lane-indexed LD1 load instructions imply a dependency on the
previous value of the vector register to maintain the values in lanes
not loaded. On larger micro-architectures this introduces an unnecessary
dependency chain which limits the ability of the core to execute
out-of-order.

To avoid this dependency being introduced, simply use the scalar LDR
instructions to load the lowest lane of the vector, this has the effect
of zeroing the top portion of the vector rather than trying to maintain
the previous value of the upper lanes.

On a Neoverse V2 machine this results in a 62% reduction in times
reported for the SATD 4x4 benchmarks, and a 65% reduction for the SATD
8x4 benchmarks.
---
 source/common/aarch64/pixel-util.S | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/source/common/aarch64/pixel-util.S b/source/common/aarch64/pixel-util.S
index 5d8cc8c8e..d8b3f4365 100644
--- a/source/common/aarch64/pixel-util.S
+++ b/source/common/aarch64/pixel-util.S
@@ -609,13 +609,18 @@ endfunc
 
 //******* satd *******
 .macro satd_4x4_neon
-    ld1             {v0.s}[0], [x0], x1
+    ldr             s0, [x0]
+    ldr             s1, [x2]
+    add             x0, x0, x1
+    add             x2, x2, x3
     ld1             {v0.s}[1], [x0], x1
-    ld1             {v1.s}[0], [x2], x3
     ld1             {v1.s}[1], [x2], x3
-    ld1             {v2.s}[0], [x0], x1
+
+    ldr             s2, [x0]
+    ldr             s3, [x2]
+    add             x0, x0, x1
+    add             x2, x2, x3
     ld1             {v2.s}[1], [x0], x1
-    ld1             {v3.s}[0], [x2], x3
     ld1             {v3.s}[1], [x2], x3
 
     usubl           v4.8h, v0.8b, v1.8b
-- 
2.34.1

_______________________________________________
x265-devel mailing list
[email protected]
https://mailman.videolan.org/listinfo/x265-devel

[x265] [PATCH] aarch64/pixel-util.S: Improve satd_4x4_neon

Reply via email to