On Wed, 9 Nov 2016, Martin Storsjö wrote:

On Sat, 5 Nov 2016, Martin Storsjö wrote:

This work is sponsored by, and copyright, Google.

These are ported from the ARM version; it is essentially a 1:1
port with no extra added features, but with some hand tuning
(especially for the plain copy/avg functions). The ARM version
isn't very register starved to begin with, so there's not much
to be gained from having more spare registers here - we only
avoid having to clobber callee-saved registers.

Examples of runtimes vs the 32 bit version, on a Cortex A53:
                                    ARM   AArch64
vp9_avg4_neon:                      27.2      23.7
vp9_avg8_neon:                      57.5      54.7
vp9_avg16_neon:                    170.3     165.4
vp9_avg32_neon:                    586.7     585.2
vp9_avg64_neon:                   2459.6    2322.5
vp9_avg_8tap_smooth_4h_neon:       132.7     125.0
vp9_avg_8tap_smooth_4hv_neon:      483.7     439.3
vp9_avg_8tap_smooth_4v_neon:       121.0      93.2
vp9_avg_8tap_smooth_8h_neon:       241.7     234.0
vp9_avg_8tap_smooth_8hv_neon:      695.4     643.4
vp9_avg_8tap_smooth_8v_neon:       240.0     206.0
vp9_avg_8tap_smooth_64h_neon:    11285.8   11265.7
vp9_avg_8tap_smooth_64hv_neon:   23016.3   22186.1
vp9_avg_8tap_smooth_64v_neon:    11556.9   10783.3
vp9_put4_neon:                      18.0      16.5
vp9_put8_neon:                      40.3      37.7
vp9_put16_neon:                     97.6      95.2
vp9_put32_neon/armv8:              347.9     307.7
vp9_put64_neon/armv8:             1319.5    1107.4
vp9_put_8tap_smooth_4h_neon:       126.7     118.3
vp9_put_8tap_smooth_4hv_neon:      470.7     431.3
vp9_put_8tap_smooth_4v_neon:       108.0      86.5
vp9_put_8tap_smooth_8h_neon:       229.7     221.2
vp9_put_8tap_smooth_8hv_neon:      665.4     619.2
vp9_put_8tap_smooth_8v_neon:       210.0     186.5
vp9_put_8tap_smooth_64h_neon:    10657.4   10625.1
vp9_put_8tap_smooth_64hv_neon:   21084.8   21032.5
vp9_put_8tap_smooth_64v_neon:     9637.6    9648.8

These are generally about as fast as the corresponding ARM
routines on the same CPU (at least on the A53), in most cases
marginally faster.

The speedup vs C code is pretty much the same as for the 32 bit
case; on the A53 it's around 6-13x for ther larger 8tap filters.
The exact speedup varies a little, since the C versions generally
don't end up exactly as slow/fast as on 32 bit.
---
v3: Updated according to review comments.

v2: Updated according to the comments on the 32 bit version.
---
libavcodec/aarch64/Makefile              |   2 +
libavcodec/aarch64/vp9dsp_init_aarch64.c | 153 +++++++
libavcodec/aarch64/vp9mc_neon.S | 679 +++++++++++++++++++++++++++++++
libavcodec/vp9.h                         |   1 +
libavcodec/vp9block.c                    |   4 +-
libavcodec/vp9dsp.c                      |   2 +
6 files changed, 839 insertions(+), 2 deletions(-)
create mode 100644 libavcodec/aarch64/vp9dsp_init_aarch64.c
create mode 100644 libavcodec/aarch64/vp9mc_neon.S

+.macro do_8tap_h_func type, filter, offset, size
+function ff_vp9_\type\()_\filter\()\size\()_h_neon, export=1
+        movrel          x6,  X(ff_vp9_subpel_filters) + 120*\offset - 8
+        cmp             w5,  #8

This fails on ios/clang, with the following error:

<instantiation>:3:21: error: expected compatible register, symbol or integer in range [0, 4095]
       add x6, x6, _ff_vp9_subpel_filters+120*0-8@PAGEOFF
                   ^
<instantiation>:3:9: note: while in macro instantiation
       movrel x6, _ff_vp9_subpel_filters + 120*0 - 8
       ^

This seems to be because it fails to parse an expression here - adding parentheses like this fixes assembling:

       movrel          x6,  X(ff_vp9_subpel_filters) + (120*\offset - 8)

But then linking still fails:

ld: in libavcodec/libavcodec.a(vp9mc_neon.o), in section __TEXT,__text reloc 0: symbol index out of range for architecture arm64

The previous version of the patch, where we just did "movrel x5, \filter\()_filter-16", also failed in this way. The relocations seem to fail as long as the symbol offset is negative.

I can do something like this, which works:

#ifndef __APPLE__
       movrel          x6,  X(ff_vp9_subpel_filters) + 120*\offset - 8
#else
       movrel          x6,  X(ff_vp9_subpel_filters)
       add             x6,  x6, 120*\offset - 8
#endif

Or just perhaps only keep the apple-compatible version just to be safe? It shouldn't hurt significantly anyway, and would keep the code simpler.

For the record, --disable-pic builds for darwin-aarch64 also fail with current master. But that's a different matter...

// Martin
_______________________________________________
libav-devel mailing list
libav-devel@libav.org
https://lists.libav.org/mailman/listinfo/libav-devel

Reply via email to