Hi Sebastian,
The code looks good.
I guess you can't find so much performance improve because two things:
1. Replace 5 instructions by another 5 instrcuctions may get similar
performance on OOO CPU, but different on In-Order CPU.
2. There have a potential pipeline stall may affect performance.
Regards,
Min Chn
At 2021-07-20 12:45:03, "Pop, Sebastian" <s...@amazon.com> wrote:
Thanks Min Chen for your reviews.
I tried your suggestion to remove one of the FP->GPR transfers.
With the following patch I do not see any improvement for the 64x routines, and
the number of instructions remains the same:
--- a/source/common/aarch64/sad-a.S
+++ b/source/common/aarch64/sad-a.S
@@ -137,14 +137,14 @@
add v16.8h, v16.8h, v17.8h
add v17.8h, v18.8h, v19.8h
add v16.8h, v16.8h, v17.8h
- uaddlv s0, v16.8h
- fmov w0, s0
+ uaddlp v16.4s, v16.8h
use v16 immedidate follow by instruction ADD may make pipeline stall
add v18.8h, v20.8h, v21.8h
add v19.8h, v22.8h, v23.8h
add v17.8h, v18.8h, v19.8h
- uaddlv s1, v17.8h
- fmov w1, s1
- add w0, w0, w1
+ uaddlp v17.4s, v17.8h
+ add v16.4s, v16.4s, v17.4s
+ uaddlv d0, v16.4s
+ fmov x0, d0
ret
.endm
Please see the amended patch with your recommended change.
Thanks,
Sebastian
_______________________________________________
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel