Hi Sebastian,

The code looks good.


I guess you can't find so much performance improve because two things:
1. Replace 5 instructions by another 5 instrcuctions may get similar 
performance on OOO CPU, but different on In-Order CPU.
2. There have a potential pipeline stall may affect performance.


Regards,
Min Chn



At 2021-07-20 12:45:03, "Pop, Sebastian" <s...@amazon.com> wrote:

Thanks Min Chen for your reviews.

I tried your suggestion to remove one of the FP->GPR transfers.

With the following patch I do not see any improvement for the 64x routines, and 
the number of instructions remains the same:

 

--- a/source/common/aarch64/sad-a.S

+++ b/source/common/aarch64/sad-a.S

@@ -137,14 +137,14 @@

     add             v16.8h, v16.8h, v17.8h

     add             v17.8h, v18.8h, v19.8h

     add             v16.8h, v16.8h, v17.8h

-    uaddlv          s0,  v16.8h

-    fmov            w0,  s0

+    uaddlp          v16.4s, v16.8h

use v16 immedidate follow by instruction ADD may make pipeline stall




     add             v18.8h, v20.8h, v21.8h

     add             v19.8h, v22.8h, v23.8h

     add             v17.8h, v18.8h, v19.8h

-    uaddlv          s1,  v17.8h

-    fmov            w1,  s1

-    add             w0, w0, w1

+    uaddlp          v17.4s, v17.8h

+    add             v16.4s, v16.4s, v17.4s

+    uaddlv          d0, v16.4s

+    fmov            x0, d0

     ret

.endm

 

Please see the amended patch with your recommended change.

 

Thanks,

Sebastian
_______________________________________________
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel

Reply via email to