On Wed, 2 Nov 2016, Martin Storsjö wrote:
This work is sponsored by, and copyright, Google.
These are ported from the ARM version; it is essentially a 1:1
port with no extra added features, but with some hand tuning
(especially for the plain copy/avg functions). The ARM version
isn't very register starved to begin with, so there's not much
to be gained from having more spare registers here - we only
avoid having to clobber callee-saved registers.
Examples of runtimes vs the 32 bit version, on a Cortex A53:
ARM AArch64
vp9_avg4_neon: 32.2 23.7
vp9_avg8_neon: 57.5 53.7
vp9_avg16_neon: 168.6 165.4
vp9_avg32_neon: 586.7 585.2
vp9_avg64_neon: 2458.6 2325.9
vp9_avg_8tap_smooth_4h_neon: 130.7 124.0
vp9_avg_8tap_smooth_4hv_neon: 478.8 440.3
vp9_avg_8tap_smooth_4v_neon: 118.0 96.2
vp9_avg_8tap_smooth_8h_neon: 239.7 232.0
vp9_avg_8tap_smooth_8hv_neon: 691.3 649.9
vp9_avg_8tap_smooth_8v_neon: 238.0 214.5
vp9_avg_8tap_smooth_64h_neon: 11512.9 11492.8
vp9_avg_8tap_smooth_64hv_neon: 23322.1 23255.1
vp9_avg_8tap_smooth_64v_neon: 11556.2 11554.5
vp9_put4_neon: 18.0 16.5
vp9_put8_neon: 40.2 37.7
vp9_put16_neon: 99.4 95.2
vp9_put32_neon: 348.8 307.4
vp9_put64_neon: 1321.3 1109.8
vp9_put_8tap_smooth_4h_neon: 124.7 117.3
vp9_put_8tap_smooth_4hv_neon: 465.8 425.3
vp9_put_8tap_smooth_4v_neon: 105.0 82.5
vp9_put_8tap_smooth_8h_neon: 227.7 218.2
vp9_put_8tap_smooth_8hv_neon: 661.4 620.1
vp9_put_8tap_smooth_8v_neon: 208.0 187.2
vp9_put_8tap_smooth_64h_neon: 10864.6 10873.9
vp9_put_8tap_smooth_64hv_neon: 21359.4 21295.7
vp9_put_8tap_smooth_64v_neon: 9629.1 9639.4
These are generally about as fast as the corresponding ARM
routines on the same CPU (at least on the A53), in most cases
marginally faster.
The speedup vs C code is pretty much the same as for the 32 bit
case; on the A53 it's around 6-13x for ther larger 8tap filters.
The exact speedup varies a little, since the C versions generally
don't end up exactly as slow/fast as on 32 bit.
---
v2: Updated according to the comments on the 32 bit version.
---
libavcodec/aarch64/Makefile | 2 +
libavcodec/aarch64/vp9dsp_init_aarch64.c | 139 ++++++
libavcodec/aarch64/vp9mc_neon.S | 733 +++++++++++++++++++++++++++++++
libavcodec/vp9.h | 1 +
libavcodec/vp9dsp.c | 2 +
5 files changed, 877 insertions(+)
create mode 100644 libavcodec/aarch64/vp9dsp_init_aarch64.c
create mode 100644 libavcodec/aarch64/vp9mc_neon.S
+function ff_vp9_copy64_neon, export=1
+1:
+ ldp x5, x6, [x2]
+ stp x5, x6, [x0]
+ ldp x5, x6, [x2, #16]
+ stp x5, x6, [x0, #16]
+ subs w4, w4, #1
+ ldp x5, x6, [x2, #32]
+ stp x5, x6, [x0, #32]
+ ldp x5, x6, [x2, #48]
+ stp x5, x6, [x0, #48]
+ add x2, x2, x3
+ add x0, x0, x1
+ b.ne 1b
+ ret
+endfunc
I forgot to mention it anywhere, but the copy32 and copy64 functions don't
actually use any vector registers at all, but only plain aarch64 ldp/stp.
When implemented with neon loads/stores, they ended up significantly
slower than the C version, on my dragonboard.
Currently copy64 runs at around 1100 cycles, while a trivial neon version
(that loads all 64 bytes at once with a ld1 {v0,v1,v2,v3}) runs at around
1600 cycles. One could of course play with all different combinations of
loading 16, 32 or 64 bytes per ld1 and scheduling them differently (IIRC I
did try some of those combinations at least), but I never got down to what
the C version did unless I use ldp/stp.
Technically, having a _neon prefix for them is wrong, but anything else
(omitting these two while hooking up avg32/avg64 separately) is more
complication - although I'm open for suggestions on how to handle it best.
// Martin
_______________________________________________
libav-devel mailing list
libav-devel@libav.org
https://lists.libav.org/mailman/listinfo/libav-devel