Re: [FFmpeg-devel] [PATCH 07/10] avcodec/vc1: Arm 64-bit NEON inverse transform fast paths

2022-03-31 Thread Martin Storsjö

On Thu, 31 Mar 2022, Ben Avison wrote:


On 30/03/2022 14:49, Martin Storsjö wrote:
Looks generally reasonable. Is it possible to factorize out the individual 
transforms (so that you'd e.g. invoke the same macro twice in the 8x8 and 
4x4 functions) without too much loss?


There is a close analogy here with the vertical/horizontal deblocking 
filters, because while there are similarities between the two matrix 
multiplications within a transform, one of them follows a series of loads and 
the other follows a matrix transposition.


If you look for example at ff_vc1_inv_trans_8x8_neon, you'll see I was able 
to do a fair amount of overlap between sections of the function - 
particularly between the transpose and the second matrix multiplication, but 
to a lesser extent between the loads and the first matrix multiplication and 
between the second multiplication and the stores. This sort of overlapping is 
tricky to maintain when using macros. Also, it means the the order of 
operations within each matrix multiply ended up quite different.


At first sight, you might think that the multiplies from the 8x8 function 
(which you might also view as kind of 8-tap filter) would be re-usable for 
the size-8 multiplies in the 8x4 or 4x8 function. Yes, the instructions are 
similar, save for using .4h elements rather than .8h elements, but that has 
significant impacts on scheduling. For example, the Cortex-A72, which is my 
primary target, can only do NEON bit-shifts in one pipeline at once, 
irrespective of whether the vectors are 64-bit or 128-bit long, while other 
instructions don't have such restrictions.


So while in theory you could factor some of this code out more, I suspect any 
attempt to do so would have a detrimental effect on performance.


Ok, fair enough. Yes, it's always a trade off between code simplicity and 
getting the optimal interleaving. As you've spent the effort on making it 
efficient with respect to that, let's go with that then!


(FWIW, for future endeavours, having the checkasm tests in place while 
developing/tuning the implementation does allow getting good empirical 
data on how much you gain from different alternative scheduling choices. I 
usually don't follow the optimization guides for any specific core, but 
track the benchmark numbers for a couple different cores and try to pick a 
scheduling that is a decent compromise for all of them.)


Also, for future work - if you have checkasm tests in place while working 
on the assembly, I usually amend the test with debug printouts that 
visualize the output of the reference and the tested function, and a map 
showing which elements differ - which makes tracking down issues a whole 
lot easier. I don't think any of the checkasm tests in ffmpeg have such 
printouts though, but within e.g. the dav1d project, the checkasm tool is 
extended with helpers for comparing and printing such debug aids.


// Martin
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


Re: [FFmpeg-devel] [PATCH 07/10] avcodec/vc1: Arm 64-bit NEON inverse transform fast paths

2022-03-31 Thread Ben Avison

On 30/03/2022 14:49, Martin Storsjö wrote:
Looks generally reasonable. Is it possible to factorize out the 
individual transforms (so that you'd e.g. invoke the same macro twice in 
the 8x8 and 4x4 functions) without too much loss?


There is a close analogy here with the vertical/horizontal deblocking 
filters, because while there are similarities between the two matrix 
multiplications within a transform, one of them follows a series of 
loads and the other follows a matrix transposition.


If you look for example at ff_vc1_inv_trans_8x8_neon, you'll see I was 
able to do a fair amount of overlap between sections of the function - 
particularly between the transpose and the second matrix multiplication, 
but to a lesser extent between the loads and the first matrix 
multiplication and between the second multiplication and the stores. 
This sort of overlapping is tricky to maintain when using macros. Also, 
it means the the order of operations within each matrix multiply ended 
up quite different.


At first sight, you might think that the multiplies from the 8x8 
function (which you might also view as kind of 8-tap filter) would be 
re-usable for the size-8 multiplies in the 8x4 or 4x8 function. Yes, the 
instructions are similar, save for using .4h elements rather than .8h 
elements, but that has significant impacts on scheduling. For example, 
the Cortex-A72, which is my primary target, can only do NEON bit-shifts 
in one pipeline at once, irrespective of whether the vectors are 64-bit 
or 128-bit long, while other instructions don't have such restrictions.


So while in theory you could factor some of this code out more, I 
suspect any attempt to do so would have a detrimental effect on performance.


Ben
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


Re: [FFmpeg-devel] [PATCH 07/10] avcodec/vc1: Arm 64-bit NEON inverse transform fast paths

2022-03-30 Thread Martin Storsjö

On Wed, 30 Mar 2022, Martin Storsjö wrote:


On Fri, 25 Mar 2022, Ben Avison wrote:


checkasm benchmarks on 1.5 GHz Cortex-A72 are as follows.

vc1dsp.vc1_inv_trans_4x4_c: 158.2
vc1dsp.vc1_inv_trans_4x4_neon: 65.7
vc1dsp.vc1_inv_trans_4x4_dc_c: 86.5
vc1dsp.vc1_inv_trans_4x4_dc_neon: 26.5
vc1dsp.vc1_inv_trans_4x8_c: 335.2
vc1dsp.vc1_inv_trans_4x8_neon: 106.2
vc1dsp.vc1_inv_trans_4x8_dc_c: 151.2
vc1dsp.vc1_inv_trans_4x8_dc_neon: 25.5
vc1dsp.vc1_inv_trans_8x4_c: 365.7
vc1dsp.vc1_inv_trans_8x4_neon: 97.2
vc1dsp.vc1_inv_trans_8x4_dc_c: 139.7
vc1dsp.vc1_inv_trans_8x4_dc_neon: 16.5
vc1dsp.vc1_inv_trans_8x8_c: 547.7
vc1dsp.vc1_inv_trans_8x8_neon: 137.0
vc1dsp.vc1_inv_trans_8x8_dc_c: 268.2
vc1dsp.vc1_inv_trans_8x8_dc_neon: 30.5

Signed-off-by: Ben Avison 
---
libavcodec/aarch64/vc1dsp_init_aarch64.c |  19 +
libavcodec/aarch64/vc1dsp_neon.S | 678 +++
2 files changed, 697 insertions(+)


Looks generally reasonable. Is it possible to factorize out the individual 
transforms (so that you'd e.g. invoke the same macro twice in the 8x8 and 4x4 
functions) without too much loss? The downshift which differs between thw two 
could either be left outside of the macro, or the downshift amount could be 
made a macro parameter.


Another aspect: I forgot the aspect that we have existing arm assembly for 
the idct. In some cases, there's value in keeping the implementations 
similar if possible and relevant. But your implementation seems quite 
straightforward, and seems to get better benchmark numbers on the same 
cores, so I guess it's fine to diverge and add a new from-scratch 
implementation here.


// Martin
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


Re: [FFmpeg-devel] [PATCH 07/10] avcodec/vc1: Arm 64-bit NEON inverse transform fast paths

2022-03-30 Thread Martin Storsjö

On Fri, 25 Mar 2022, Ben Avison wrote:


checkasm benchmarks on 1.5 GHz Cortex-A72 are as follows.

vc1dsp.vc1_inv_trans_4x4_c: 158.2
vc1dsp.vc1_inv_trans_4x4_neon: 65.7
vc1dsp.vc1_inv_trans_4x4_dc_c: 86.5
vc1dsp.vc1_inv_trans_4x4_dc_neon: 26.5
vc1dsp.vc1_inv_trans_4x8_c: 335.2
vc1dsp.vc1_inv_trans_4x8_neon: 106.2
vc1dsp.vc1_inv_trans_4x8_dc_c: 151.2
vc1dsp.vc1_inv_trans_4x8_dc_neon: 25.5
vc1dsp.vc1_inv_trans_8x4_c: 365.7
vc1dsp.vc1_inv_trans_8x4_neon: 97.2
vc1dsp.vc1_inv_trans_8x4_dc_c: 139.7
vc1dsp.vc1_inv_trans_8x4_dc_neon: 16.5
vc1dsp.vc1_inv_trans_8x8_c: 547.7
vc1dsp.vc1_inv_trans_8x8_neon: 137.0
vc1dsp.vc1_inv_trans_8x8_dc_c: 268.2
vc1dsp.vc1_inv_trans_8x8_dc_neon: 30.5

Signed-off-by: Ben Avison 
---
libavcodec/aarch64/vc1dsp_init_aarch64.c |  19 +
libavcodec/aarch64/vc1dsp_neon.S | 678 +++
2 files changed, 697 insertions(+)


Looks generally reasonable. Is it possible to factorize out the individual 
transforms (so that you'd e.g. invoke the same macro twice in the 8x8 and 
4x4 functions) without too much loss? The downshift which differs between 
thw two could either be left outside of the macro, or the downshift amount 
could be made a macro parameter.


// Martin

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


[FFmpeg-devel] [PATCH 07/10] avcodec/vc1: Arm 64-bit NEON inverse transform fast paths

2022-03-25 Thread Ben Avison
checkasm benchmarks on 1.5 GHz Cortex-A72 are as follows.

vc1dsp.vc1_inv_trans_4x4_c: 158.2
vc1dsp.vc1_inv_trans_4x4_neon: 65.7
vc1dsp.vc1_inv_trans_4x4_dc_c: 86.5
vc1dsp.vc1_inv_trans_4x4_dc_neon: 26.5
vc1dsp.vc1_inv_trans_4x8_c: 335.2
vc1dsp.vc1_inv_trans_4x8_neon: 106.2
vc1dsp.vc1_inv_trans_4x8_dc_c: 151.2
vc1dsp.vc1_inv_trans_4x8_dc_neon: 25.5
vc1dsp.vc1_inv_trans_8x4_c: 365.7
vc1dsp.vc1_inv_trans_8x4_neon: 97.2
vc1dsp.vc1_inv_trans_8x4_dc_c: 139.7
vc1dsp.vc1_inv_trans_8x4_dc_neon: 16.5
vc1dsp.vc1_inv_trans_8x8_c: 547.7
vc1dsp.vc1_inv_trans_8x8_neon: 137.0
vc1dsp.vc1_inv_trans_8x8_dc_c: 268.2
vc1dsp.vc1_inv_trans_8x8_dc_neon: 30.5

Signed-off-by: Ben Avison 
---
 libavcodec/aarch64/vc1dsp_init_aarch64.c |  19 +
 libavcodec/aarch64/vc1dsp_neon.S | 678 +++
 2 files changed, 697 insertions(+)

diff --git a/libavcodec/aarch64/vc1dsp_init_aarch64.c 
b/libavcodec/aarch64/vc1dsp_init_aarch64.c
index edfb296b75..b672b2aa99 100644
--- a/libavcodec/aarch64/vc1dsp_init_aarch64.c
+++ b/libavcodec/aarch64/vc1dsp_init_aarch64.c
@@ -25,6 +25,16 @@
 
 #include "config.h"
 
+void ff_vc1_inv_trans_8x8_neon(int16_t *block);
+void ff_vc1_inv_trans_8x4_neon(uint8_t *dest, ptrdiff_t stride, int16_t 
*block);
+void ff_vc1_inv_trans_4x8_neon(uint8_t *dest, ptrdiff_t stride, int16_t 
*block);
+void ff_vc1_inv_trans_4x4_neon(uint8_t *dest, ptrdiff_t stride, int16_t 
*block);
+
+void ff_vc1_inv_trans_8x8_dc_neon(uint8_t *dest, ptrdiff_t stride, int16_t 
*block);
+void ff_vc1_inv_trans_8x4_dc_neon(uint8_t *dest, ptrdiff_t stride, int16_t 
*block);
+void ff_vc1_inv_trans_4x8_dc_neon(uint8_t *dest, ptrdiff_t stride, int16_t 
*block);
+void ff_vc1_inv_trans_4x4_dc_neon(uint8_t *dest, ptrdiff_t stride, int16_t 
*block);
+
 void ff_vc1_v_loop_filter4_neon(uint8_t *src, int stride, int pq);
 void ff_vc1_h_loop_filter4_neon(uint8_t *src, int stride, int pq);
 void ff_vc1_v_loop_filter8_neon(uint8_t *src, int stride, int pq);
@@ -46,6 +56,15 @@ av_cold void ff_vc1dsp_init_aarch64(VC1DSPContext *dsp)
 int cpu_flags = av_get_cpu_flags();
 
 if (have_neon(cpu_flags)) {
+dsp->vc1_inv_trans_8x8 = ff_vc1_inv_trans_8x8_neon;
+dsp->vc1_inv_trans_8x4 = ff_vc1_inv_trans_8x4_neon;
+dsp->vc1_inv_trans_4x8 = ff_vc1_inv_trans_4x8_neon;
+dsp->vc1_inv_trans_4x4 = ff_vc1_inv_trans_4x4_neon;
+dsp->vc1_inv_trans_8x8_dc = ff_vc1_inv_trans_8x8_dc_neon;
+dsp->vc1_inv_trans_8x4_dc = ff_vc1_inv_trans_8x4_dc_neon;
+dsp->vc1_inv_trans_4x8_dc = ff_vc1_inv_trans_4x8_dc_neon;
+dsp->vc1_inv_trans_4x4_dc = ff_vc1_inv_trans_4x4_dc_neon;
+
 dsp->vc1_v_loop_filter4  = ff_vc1_v_loop_filter4_neon;
 dsp->vc1_h_loop_filter4  = ff_vc1_h_loop_filter4_neon;
 dsp->vc1_v_loop_filter8  = ff_vc1_v_loop_filter8_neon;
diff --git a/libavcodec/aarch64/vc1dsp_neon.S b/libavcodec/aarch64/vc1dsp_neon.S
index 70391b4179..e68e0fce53 100644
--- a/libavcodec/aarch64/vc1dsp_neon.S
+++ b/libavcodec/aarch64/vc1dsp_neon.S
@@ -22,7 +22,685 @@
 
 #include "libavutil/aarch64/asm.S"
 
+// VC-1 8x8 inverse transform
+// On entry:
+//   x0 -> array of 16-bit inverse transform coefficients, in column-major 
order
+// On exit:
+//   array at x0 updated to hold transformed block; also now held in row-major 
order
+function ff_vc1_inv_trans_8x8_neon, export=1
+ld1 {v1.16b, v2.16b}, [x0], #32
+ld1 {v3.16b, v4.16b}, [x0], #32
+ld1 {v5.16b, v6.16b}, [x0], #32
+shl v1.8h, v1.8h, #2// 8/2 * src[0]
+sub x1, x0, #3*32
+ld1 {v16.16b, v17.16b}, [x0]
+shl v7.8h, v2.8h, #4//  16 * src[8]
+shl v18.8h, v2.8h, #2   //   4 * src[8]
+shl v19.8h, v4.8h, #4   //16 * 
src[24]
+ldr d0, .Lcoeffs_it8
+shl v5.8h, v5.8h, #2// 
 8/2 * src[32]
+shl v20.8h, v6.8h, #4   // 
  16 * src[40]
+shl v21.8h, v6.8h, #2   // 
   4 * src[40]
+shl v22.8h, v17.8h, #4  // 
 16 * src[56]
+ssrav20.8h, v19.8h, #2  // 4 * 
src[24] + 16 * src[40]
+mul v23.8h, v3.8h, v0.h[0]  //   6/2 * 
src[16]
+sub v19.8h, v19.8h, v21.8h  //16 * 
src[24] -  4 * src[40]
+ssrav7.8h, v22.8h, #2   //  16 * src[8]
   +  4 * src[56]
+sub v18.8h, v22.8h, v18.8h  //-  4 * src[8]
   + 16 * src[56]
+shl v3.8h, v3.8h, #3//  16/2 * 
src[16]
+