Re: [FFmpeg-devel] [PATCH 2/2] lavc/aarch64: add hevc epel/qpel assembly

2021-04-30 Thread Martin Storsjö

On Wed, 28 Apr 2021, Josh Dekker wrote:


From: Rafal Dabrowa 



First a couple technical details:

The use of '.ifeqs "\op", "sshr"' needs to be changed into '.ifc \op, 
sshr', because gas-preprocessor doesn't implement '.ifeqs'.


The checkasm tests for hevc_pel that were added in 
9c513edb7999a35ddcc6e3a8d984a96c8fb492a3 aren't actually ever run when you 
run "make fate-checkasm", because they're not added in 
tests/fate/checkasm.mak. They're, IMO, separated way too finegrainedly 
into 10 test groups of hevc_epel, hevc_epel_uni, hevc_epel_uni_w, etc. I'd 
strongly suggest you'd merge them into one single group, hevc_pel, just 
like the file name. That makes it easier to hook up to fate.


(As a side note, the fact that new checkasm tests need to be hooked up in 
fate like this in tests/fate/checkasm.mak is easy to overlook, and the 
benefit from being able to mark one specific group of checkasm as 
known-failure to ignore, is not used very much. So maybe we should 
reconsider to just have one single "fate-checkasm" which runs the full 
suite?)


All the tests crash under checkasm on linux. The reason is that they use 
e.g. x3 for loop counter, while the variable is declared as int. Checkasm 
tests against this by making sure the upper half of such registers are 
filled with garbage, but the wrapper that does that can't be used on apple 
platforms, so that aspect goes untested when developing on a mac. So 
please test checkasm on something else than a mac too.


Then, there's the sheer size of the patch. You said you considered 
templating it - if possible that would be very welcome. I haven't looked 
closely enough to be able to say easily how to do it, if at all though.


Then finally, the code is very very stall prone. It might not be 
measurable on an M1, but is very real on other cores. While you can say "I 
haven't bothered tuning it yet", the thing is, you don't need to actually 
run it on such a core to roughly do much better scheduling. I did one 
initial test on one function and got it sped up by 4% by just adjusting it 
in an initial way. Example:


function ff_hevc_put_hevc_epel_h48_8_neon, export=1
[...]
1:  ld3{v26.16b, v27.16b, v28.16b}, [x1], x5
ushrv29.2d, v26.2d, #8// Uses v26 which was loaded 
right before
ushrv30.2d, v27.2d, #8
ushrv31.2d, v28.2d, #8
ld1{v24.s}[0], [x1], x5
ld1{v25.s}[0], [x1], x2 // Uses x1 which was updated in the 
previous instruction
mov v29.b[7], v24.b[0] // Uses v24 which was loaded almost 
right before
mov v30.b[7], v24.b[1]
mov v31.b[7], v24.b[2]
mov v29.b[15], v25.b[0]
mov v30.b[15], v25.b[1]
mov v31.b[15], v25.b[2]
moviv16.8h, #0 // This has no dependencies and could be 
done anytime, while e.g. waiting for results for the loads above!
moviv17.8h, #0
moviv18.8h, #0
moviv20.8h, #0
moviv21.8h, #0
moviv22.8h, #0
calc_epelb  v16, v26, v27, v28, v29
calc_epelb2 v20, v26, v27, v28, v29
calc_epelb  v17, v27, v28, v29, v30
calc_epelb2 v21, v27, v28, v29, v30
calc_epelb  v18, v28, v29, v30, v31
calc_epelb2 v22, v28, v29, v30, v31
st3{v16.8h, v17.8h, v18.8h}, [x0], #48
st3{v20.8h, v21.8h, v22.8h}, [x0], x10
subsx3, x3, #1   // This updates the condition flags 
right before the branch, while we could stick it anywhere else in the 
loop, where we have an instruction waiting for the result of the previous 
one

b.ne1b
ret


A trivial rescheduling of it looks like this:

1:  ld3{v26.16b, v27.16b, v28.16b}, [x1], x5
moviv16.8h, #0
moviv17.8h, #0
moviv18.8h, #0
ld1{v24.s}[0], [x1], x5
moviv20.8h, #0
moviv21.8h, #0
moviv22.8h, #0
ld1{v25.s}[0], [x1], x2
ushrv29.2d, v26.2d, #8
ushrv30.2d, v27.2d, #8
ushrv31.2d, v28.2d, #8
mov v29.b[7], v24.b[0]
mov v30.b[7], v24.b[1]
mov v31.b[7], v24.b[2]
mov v29.b[15], v25.b[0]
mov v30.b[15], v25.b[1]
mov v31.b[15], v25.b[2]
calc_epelb  v16, v26, v27, v28, v29
calc_epelb2 v20, v26, v27, v28, v29
calc_epelb  v17, v27, v28, v29, v30
calc_epelb2 v21, v27, v28, v29, v30
calc_epelb  v18, v28, v29, v30, v31
calc_epelb2 v22, v28, v29, v30, v31
st3{v16.8h, v17.8h, v18.8h}, [x0], #48
subs  

Re: [FFmpeg-devel] [PATCH 2/2] lavc/aarch64: add hevc epel/qpel assembly

2021-04-28 Thread chen
inline comment with prefix [MC]

At 2021-04-29 03:50:26, "Josh Dekker"  wrote:
>From: Rafal Dabrowa 
>
>Benchmarked on Apple M1:
>
>put_hevc_epel_bi_h4_8_c: 69.9
>put_hevc_epel_bi_h4_8_neon: 15.4
>put_hevc_epel_bi_h6_8_c: 137.1
>put_hevc_epel_bi_h6_8_neon: 31.9
>put_hevc_epel_bi_h8_8_c: 124.6
>put_hevc_epel_bi_h8_8_neon: 40.9
>put_hevc_epel_bi_h12_8_c: 331.9
>put_hevc_epel_bi_h12_8_neon: 72.4
>put_hevc_epel_bi_h16_8_c: 383.4
>put_hevc_epel_bi_h16_8_neon: 124.9
>put_hevc_epel_bi_h24_8_c: 771.6
>put_hevc_epel_bi_h24_8_neon: 209.6
>put_hevc_epel_bi_h32_8_c: 1324.4
>put_hevc_epel_bi_h32_8_neon: 389.4
>put_hevc_epel_bi_h48_8_c: 2869.6
>put_hevc_epel_bi_h48_8_neon: 730.1
>put_hevc_epel_bi_h64_8_c: 4992.6
>put_hevc_epel_bi_h64_8_neon: 1490.4
>put_hevc_epel_bi_hv4_8_c: 163.4
>put_hevc_epel_bi_hv4_8_neon: 38.4
>put_hevc_epel_bi_hv6_8_c: 292.4
>put_hevc_epel_bi_hv6_8_neon: 66.4
>put_hevc_epel_bi_hv8_8_c: 375.6
>put_hevc_epel_bi_hv8_8_neon: 62.4
>put_hevc_epel_bi_hv12_8_c: 831.6
>put_hevc_epel_bi_hv12_8_neon: 134.9
>put_hevc_epel_bi_hv16_8_c: 1257.9
>put_hevc_epel_bi_hv16_8_neon: 214.1
>put_hevc_epel_bi_hv24_8_c: 2666.6
>put_hevc_epel_bi_hv24_8_neon: 391.1
>put_hevc_epel_bi_hv32_8_c: 4722.4
>put_hevc_epel_bi_hv32_8_neon: 734.1
>put_hevc_epel_bi_hv48_8_c: 10100.4
>put_hevc_epel_bi_hv48_8_neon: 1570.4
>put_hevc_epel_bi_hv64_8_c: 17613.4
>put_hevc_epel_bi_hv64_8_neon: 2810.6
>put_hevc_epel_bi_v4_8_c: 77.4
>put_hevc_epel_bi_v4_8_neon: 18.6
>put_hevc_epel_bi_v6_8_c: 142.1
>put_hevc_epel_bi_v6_8_neon: 27.1
>put_hevc_epel_bi_v8_8_c: 192.9
>put_hevc_epel_bi_v8_8_neon: 9.1
>put_hevc_epel_bi_v12_8_c: 415.6
>put_hevc_epel_bi_v12_8_neon: 55.6
>put_hevc_epel_bi_v16_8_c: 487.6
>put_hevc_epel_bi_v16_8_neon: 61.9
>put_hevc_epel_bi_v24_8_c: 957.4
>put_hevc_epel_bi_v24_8_neon: 131.1
>put_hevc_epel_bi_v32_8_c: 1540.4
>put_hevc_epel_bi_v32_8_neon: 210.4
>put_hevc_epel_bi_v48_8_c: 3242.9
>put_hevc_epel_bi_v48_8_neon: 465.6
>put_hevc_epel_bi_v64_8_c: 5441.1
>put_hevc_epel_bi_v64_8_neon: 818.1
>put_hevc_epel_h4_8_c: 41.6
>put_hevc_epel_h4_8_neon: 8.4
>put_hevc_epel_h6_8_c: 110.1
>put_hevc_epel_h6_8_neon: 24.4
>put_hevc_epel_h8_8_c: 41.6
>put_hevc_epel_h8_8_neon: 17.6
>put_hevc_epel_h12_8_c: 183.1
>put_hevc_epel_h12_8_neon: 58.1
>put_hevc_epel_h16_8_c: 146.6
>put_hevc_epel_h16_8_neon: 83.4
>put_hevc_epel_h24_8_c: 240.4
>put_hevc_epel_h24_8_neon: 157.1
>put_hevc_epel_h32_8_c: 431.1
>put_hevc_epel_h32_8_neon: 292.1
>put_hevc_epel_h48_8_c: 858.6
>put_hevc_epel_h48_8_neon: 557.4
>put_hevc_epel_h64_8_c: 1536.6
>put_hevc_epel_h64_8_neon: 1116.6
>put_hevc_epel_hv4_8_c: 152.6
>put_hevc_epel_hv4_8_neon: 34.9
>put_hevc_epel_hv6_8_c: 269.6
>put_hevc_epel_hv6_8_neon: 61.6
>put_hevc_epel_hv8_8_c: 307.4
>put_hevc_epel_hv8_8_neon: 76.9
>put_hevc_epel_hv12_8_c: 702.6
>put_hevc_epel_hv12_8_neon: 113.1
>put_hevc_epel_hv16_8_c: 1081.4
>put_hevc_epel_hv16_8_neon: 190.6
>put_hevc_epel_hv24_8_c: 2276.1
>put_hevc_epel_hv24_8_neon: 345.1
>put_hevc_epel_hv32_8_c: 4068.6
>put_hevc_epel_hv32_8_neon: 780.4
>put_hevc_epel_hv48_8_c: 8754.1
>put_hevc_epel_hv48_8_neon: 1394.4
>put_hevc_epel_hv64_8_c: 15402.1
>put_hevc_epel_hv64_8_neon: 2616.6
>put_hevc_epel_uni_hv4_8_c: 142.1
>put_hevc_epel_uni_hv4_8_neon: 46.6
>put_hevc_epel_uni_hv6_8_c: 298.4
>put_hevc_epel_uni_hv6_8_neon: 72.4
>put_hevc_epel_uni_hv8_8_c: 352.9
>put_hevc_epel_uni_hv8_8_neon: 75.1
>put_hevc_epel_uni_hv12_8_c: 776.6
>put_hevc_epel_uni_hv12_8_neon: 125.9
>put_hevc_epel_uni_hv16_8_c: 1216.1
>put_hevc_epel_uni_hv16_8_neon: 199.1
>put_hevc_epel_uni_hv24_8_c: 2577.9
>put_hevc_epel_uni_hv24_8_neon: 386.6
>put_hevc_epel_uni_hv32_8_c: 4554.9
>put_hevc_epel_uni_hv32_8_neon: 710.9
>put_hevc_epel_uni_hv48_8_c: 9869.1
>put_hevc_epel_uni_hv48_8_neon: 1499.4
>put_hevc_epel_uni_hv64_8_c: 17307.1
>put_hevc_epel_uni_hv64_8_neon: 2750.6
>put_hevc_epel_uni_v4_8_c: 59.9
>put_hevc_epel_uni_v4_8_neon: 21.9
>put_hevc_epel_uni_v6_8_c: 136.1
>put_hevc_epel_uni_v6_8_neon: 19.6
>put_hevc_epel_uni_v8_8_c: 222.4
>put_hevc_epel_uni_v8_8_neon: 17.1
>put_hevc_epel_uni_v12_8_c: 481.6
>put_hevc_epel_uni_v12_8_neon: 42.4
>put_hevc_epel_uni_v16_8_c: 424.4
>put_hevc_epel_uni_v16_8_neon: 63.4
>put_hevc_epel_uni_v24_8_c: 1184.1
>put_hevc_epel_uni_v24_8_neon: 109.9
>put_hevc_epel_uni_v32_8_c: 1401.1
>put_hevc_epel_uni_v32_8_neon: 182.9
>put_hevc_epel_uni_v48_8_c: 2933.9
>put_hevc_epel_uni_v48_8_neon: 388.9
>put_hevc_epel_uni_v64_8_c: 5044.9
>put_hevc_epel_uni_v64_8_neon: 701.1
>put_hevc_epel_v4_8_c: 31.9
>put_hevc_epel_v4_8_neon: 13.4
>put_hevc_epel_v6_8_c: 95.1
>put_hevc_epel_v6_8_neon: 16.4
>put_hevc_epel_v8_8_c: 98.9
>put_hevc_epel_v8_8_neon: 26.1
>put_hevc_epel_v12_8_c: 283.9
>put_hevc_epel_v12_8_neon: 36.9
>put_hevc_epel_v16_8_c: 229.6
>put_hevc_epel_v16_8_neon: 41.9
>put_hevc_epel_v24_8_c: 376.4
>put_hevc_epel_v24_8_neon: 90.4
>put_hevc_epel_v32_8_c: 577.4
>put_hevc_epel_v32_8_neon: 188.4
>put_hevc_epel_v48_8_c: 1058.4
>put_hevc_epel_v48_8_neon: 350.6
>put_hevc_epel_v64_8_c: 1647.4
>put_hevc_epel_v64_8_neon: 647.9
>put_hevc_pel_bi_pixels4