On Fri, 25 Mar 2022, Ben Avison wrote:
checkasm benchmarks on 1.5 GHz Cortex-A72 are as follows.
vc1dsp.vc1_inv_trans_4x4_c: 158.2
vc1dsp.vc1_inv_trans_4x4_neon: 65.7
vc1dsp.vc1_inv_trans_4x4_dc_c: 86.5
vc1dsp.vc1_inv_trans_4x4_dc_neon: 26.5
vc1dsp.vc1_inv_trans_4x8_c: 335.2
vc1dsp.vc1_inv_trans_4x8_neon: 106.2
vc1dsp.vc1_inv_trans_4x8_dc_c: 151.2
vc1dsp.vc1_inv_trans_4x8_dc_neon: 25.5
vc1dsp.vc1_inv_trans_8x4_c: 365.7
vc1dsp.vc1_inv_trans_8x4_neon: 97.2
vc1dsp.vc1_inv_trans_8x4_dc_c: 139.7
vc1dsp.vc1_inv_trans_8x4_dc_neon: 16.5
vc1dsp.vc1_inv_trans_8x8_c: 547.7
vc1dsp.vc1_inv_trans_8x8_neon: 137.0
vc1dsp.vc1_inv_trans_8x8_dc_c: 268.2
vc1dsp.vc1_inv_trans_8x8_dc_neon: 30.5
Signed-off-by: Ben Avison <bavi...@riscosopen.org>
---
libavcodec/aarch64/vc1dsp_init_aarch64.c | 19 +
libavcodec/aarch64/vc1dsp_neon.S | 678 +++++++++++++++++++++++
2 files changed, 697 insertions(+)
Looks generally reasonable. Is it possible to factorize out the individual
transforms (so that you'd e.g. invoke the same macro twice in the 8x8 and
4x4 functions) without too much loss? The downshift which differs between
thw two could either be left outside of the macro, or the downshift amount
could be made a macro parameter.
// Martin
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".