Hi Sebastian,


Thank you for your code.


At first, sorry for delay, I am very busy on my family and my toy hardware 
codec in last week, I just have a little spare-time during weekend.
The next, I didn't take a look all of functions, but I made some comments on 
64x64.


On the function, unroll=8 (4*2) will get good performance on Out-Of-Order (OOO) 
CPU, but may drain performance due to cache miss and related issues on low-end 
CPU such as Cortex-A53, Of course, this is not problem on this versiong of 
patch.


In the 64x64, the sum calculate by below code.
==========

+.macro SAD_END_64

+    add         v16.8h, v16.8h, v17.8h

+    add         v17.8h, v18.8h, v19.8h

+    add         v16.8h, v16.8h, v17.8h

+    uaddlv      s0,  v16.8h

+    fmov        w0,  s0

+    add         v18.8h, v20.8h, v21.8h

+    add         v19.8h, v22.8h, v23.8h

+    add         v17.8h, v18.8h, v19.8h

+    uaddlv      s1,  v17.8h

+    fmov        w1,  s1

+    add         w0, w0, w1

+    ret

+.endm

==========


You use two of UADDLV to avoid overflow, how about sum these partial registers 
on NEON field to reduce instruction UADDLV?
e.g.
UADDLP v16,v16
UADDLP v17,v17
ADD v16,v17
UADDLV s0,v16


Regards,
Min Chen

2021-07-17 04:44:05,"Pop, Sebastian" <s...@amazon.com> 

Hi,

the attached patch ports to arm64 the following kernels:

 

            sad[  4x4]  10.11x   6.50            65.72

            sad[  8x8]  28.95x   8.50            246.00

            sad[  8x4]  23.03x   5.45            125.43

            sad[  4x8]  12.09x   10.64           128.68

            sad[16x16]  53.37x   19.19           1024.05

            sad[ 16x8]  43.09x   11.62           500.84

            sad[ 8x16]  31.03x   16.87           523.44

            sad[ 16x4]  39.73x   6.27            249.10

            sad[16x12]  50.55x   15.10           763.44

            sad[ 4x16]  14.23x   19.39           275.91

            sad[12x16]  33.68x   22.95           772.81

            sad[32x32]  62.10x   64.84           4026.97

            sad[32x16]  59.82x   33.74           2018.56

            sad[16x32]  57.94x   35.01           2028.17

            sad[ 32x8]  53.98x   18.77           1013.48

            sad[32x24]  61.29x   49.36           3024.90

            sad[ 8x32]  31.84x   32.49           1034.56

            sad[24x32]  53.61x   56.39           3022.97

            sad[64x64]  65.24x   255.86          16692.29

            sad[64x32]  61.77x   131.16          8100.90

            sad[32x64]  62.31x   128.90          8031.79

            sad[64x16]  60.28x   67.35           4060.31

            sad[64x48]  62.53x   193.59          12104.64

            sad[16x64]  61.10x   66.13           4040.26

            sad[48x64]  61.75x   194.68          12022.14

 

Ok to commit?

 

Thanks,

Sebastian

 
_______________________________________________
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel

Reply via email to