On Sat, 5 Nov 2016, Martin Storsjö wrote:
This work is sponsored by, and copyright, Google.
These are ported from the ARM version; it is essentially a 1:1
port with no extra added features, but with some hand tuning
(especially for the plain copy/avg functions). The ARM version
isn't very register starved to begin with, so there's not much
to be gained from having more spare registers here - we only
avoid having to clobber callee-saved registers.
Examples of runtimes vs the 32 bit version, on a Cortex A53:
ARM AArch64
vp9_avg4_neon: 27.2 23.7
vp9_avg8_neon: 57.5 54.7
vp9_avg16_neon: 170.3 165.4
vp9_avg32_neon: 586.7 585.2
vp9_avg64_neon: 2459.6 2322.5
vp9_avg_8tap_smooth_4h_neon: 132.7 125.0
vp9_avg_8tap_smooth_4hv_neon: 483.7 439.3
vp9_avg_8tap_smooth_4v_neon: 121.0 93.2
vp9_avg_8tap_smooth_8h_neon: 241.7 234.0
vp9_avg_8tap_smooth_8hv_neon: 695.4 643.4
vp9_avg_8tap_smooth_8v_neon: 240.0 206.0
vp9_avg_8tap_smooth_64h_neon: 11285.8 11265.7
vp9_avg_8tap_smooth_64hv_neon: 23016.3 22186.1
vp9_avg_8tap_smooth_64v_neon: 11556.9 10783.3
vp9_put4_neon: 18.0 16.5
vp9_put8_neon: 40.3 37.7
vp9_put16_neon: 97.6 95.2
vp9_put32_neon/armv8: 347.9 307.7
vp9_put64_neon/armv8: 1319.5 1107.4
vp9_put_8tap_smooth_4h_neon: 126.7 118.3
vp9_put_8tap_smooth_4hv_neon: 470.7 431.3
vp9_put_8tap_smooth_4v_neon: 108.0 86.5
vp9_put_8tap_smooth_8h_neon: 229.7 221.2
vp9_put_8tap_smooth_8hv_neon: 665.4 619.2
vp9_put_8tap_smooth_8v_neon: 210.0 186.5
vp9_put_8tap_smooth_64h_neon: 10657.4 10625.1
vp9_put_8tap_smooth_64hv_neon: 21084.8 21032.5
vp9_put_8tap_smooth_64v_neon: 9637.6 9648.8
These are generally about as fast as the corresponding ARM
routines on the same CPU (at least on the A53), in most cases
marginally faster.
The speedup vs C code is pretty much the same as for the 32 bit
case; on the A53 it's around 6-13x for ther larger 8tap filters.
The exact speedup varies a little, since the C versions generally
don't end up exactly as slow/fast as on 32 bit.
---
v3: Updated according to review comments.
v2: Updated according to the comments on the 32 bit version.
---
libavcodec/aarch64/Makefile | 2 +
libavcodec/aarch64/vp9dsp_init_aarch64.c | 153 +++++++
libavcodec/aarch64/vp9mc_neon.S | 679 +++++++++++++++++++++++++++++++
libavcodec/vp9.h | 1 +
libavcodec/vp9block.c | 4 +-
libavcodec/vp9dsp.c | 2 +
6 files changed, 839 insertions(+), 2 deletions(-)
create mode 100644 libavcodec/aarch64/vp9dsp_init_aarch64.c
create mode 100644 libavcodec/aarch64/vp9mc_neon.S
+.macro do_8tap_h_func type, filter, offset, size
+function ff_vp9_\type\()_\filter\()\size\()_h_neon, export=1
+ movrel x6, X(ff_vp9_subpel_filters) + 120*\offset - 8
+ cmp w5, #8
This fails on ios/clang, with the following error:
<instantiation>:3:21: error: expected compatible register, symbol or
integer in range [0, 4095]
add x6, x6, _ff_vp9_subpel_filters+120*0-8@PAGEOFF
^
<instantiation>:3:9: note: while in macro instantiation
movrel x6, _ff_vp9_subpel_filters + 120*0 - 8
^
This seems to be because it fails to parse an expression here - adding
parentheses like this fixes assembling:
movrel x6, X(ff_vp9_subpel_filters) + (120*\offset - 8)
But then linking still fails:
ld: in libavcodec/libavcodec.a(vp9mc_neon.o), in section __TEXT,__text
reloc 0: symbol index out of range for architecture arm64
The previous version of the patch, where we just did "movrel x5,
\filter\()_filter-16", also failed in this way. The relocations seem to
fail as long as the symbol offset is negative.
I can do something like this, which works:
#ifndef __APPLE__
movrel x6, X(ff_vp9_subpel_filters) + 120*\offset - 8
#else
movrel x6, X(ff_vp9_subpel_filters)
add x6, x6, 120*\offset - 8
#endif
Or just perhaps only keep the apple-compatible version just to be safe? It
shouldn't hurt significantly anyway, and would keep the code simpler.
// Martin
_______________________________________________
libav-devel mailing list
libav-devel@libav.org
https://lists.libav.org/mailman/listinfo/libav-devel