The lane-indexed LD1 load instructions imply a dependency on the
previous value of the vector register to maintain the values in lanes
not loaded. On larger micro-architectures this introduces an unnecessary
dependency chain which limits the ability of the core to execute
out-of-order.
To avoid this dependency being introduced, simply use the scalar LDR
instructions to load the lowest lane of the vector, this has the effect
of zeroing the top portion of the vector rather than trying to maintain
the previous value of the upper lanes.
On a Neoverse V2 machine this results in a 62% reduction in times
reported for the SATD 4x4 benchmarks, and a 65% reduction for the SATD
8x4 benchmarks.
---
source/common/aarch64/pixel-util.S | 13 +++++++++----
1 file changed, 9 insertions(+), 4 deletions(-)
diff --git a/source/common/aarch64/pixel-util.S
b/source/common/aarch64/pixel-util.S
index 5d8cc8c8e..d8b3f4365 100644
--- a/source/common/aarch64/pixel-util.S
+++ b/source/common/aarch64/pixel-util.S
@@ -609,13 +609,18 @@ endfunc
//******* satd *******
.macro satd_4x4_neon
- ld1 {v0.s}[0], [x0], x1
+ ldr s0, [x0]
+ ldr s1, [x2]
+ add x0, x0, x1
+ add x2, x2, x3
ld1 {v0.s}[1], [x0], x1
- ld1 {v1.s}[0], [x2], x3
ld1 {v1.s}[1], [x2], x3
- ld1 {v2.s}[0], [x0], x1
+
+ ldr s2, [x0]
+ ldr s3, [x2]
+ add x0, x0, x1
+ add x2, x2, x3
ld1 {v2.s}[1], [x0], x1
- ld1 {v3.s}[0], [x2], x3
ld1 {v3.s}[1], [x2], x3
usubl v4.8h, v0.8b, v1.8b
--
2.34.1
>From dc074c2a4c827fc7c4c8b2674245c54f5e0b534b Mon Sep 17 00:00:00 2001
From: George Steed <[email protected]>
Date: Thu, 20 Jun 2024 11:04:52 +0100
Subject: [PATCH] aarch64/pixel-util.S: Improve satd_4x4_neon
The lane-indexed LD1 load instructions imply a dependency on the
previous value of the vector register to maintain the values in lanes
not loaded. On larger micro-architectures this introduces an unnecessary
dependency chain which limits the ability of the core to execute
out-of-order.
To avoid this dependency being introduced, simply use the scalar LDR
instructions to load the lowest lane of the vector, this has the effect
of zeroing the top portion of the vector rather than trying to maintain
the previous value of the upper lanes.
On a Neoverse V2 machine this results in a 62% reduction in times
reported for the SATD 4x4 benchmarks, and a 65% reduction for the SATD
8x4 benchmarks.
---
source/common/aarch64/pixel-util.S | 13 +++++++++----
1 file changed, 9 insertions(+), 4 deletions(-)
diff --git a/source/common/aarch64/pixel-util.S b/source/common/aarch64/pixel-util.S
index 5d8cc8c8e..d8b3f4365 100644
--- a/source/common/aarch64/pixel-util.S
+++ b/source/common/aarch64/pixel-util.S
@@ -609,13 +609,18 @@ endfunc
//******* satd *******
.macro satd_4x4_neon
- ld1 {v0.s}[0], [x0], x1
+ ldr s0, [x0]
+ ldr s1, [x2]
+ add x0, x0, x1
+ add x2, x2, x3
ld1 {v0.s}[1], [x0], x1
- ld1 {v1.s}[0], [x2], x3
ld1 {v1.s}[1], [x2], x3
- ld1 {v2.s}[0], [x0], x1
+
+ ldr s2, [x0]
+ ldr s3, [x2]
+ add x0, x0, x1
+ add x2, x2, x3
ld1 {v2.s}[1], [x0], x1
- ld1 {v3.s}[0], [x2], x3
ld1 {v3.s}[1], [x2], x3
usubl v4.8h, v0.8b, v1.8b
--
2.34.1
_______________________________________________
x265-devel mailing list
[email protected]
https://mailman.videolan.org/listinfo/x265-devel