On Fri, 11 Nov 2016, Janne Grunau wrote:

On 2016-10-16 23:18:58 +0300, Martin Storsjö wrote:
This work is sponsored by, and copyright, Google.

The implementation tries to have smart handling of cases
where no pixels need the full filtering for the 8/16 width
filters, skipping both calculation and writeback of the
unmodified pixels in those cases. The actual effect of this
is hard to test with checkasm though, since it tests the
full filtering, and the benefit depends on how many filtered
blocks use the shortcut.

Did you benchmark it with START/STOP_TIMER in the decoder? It will still
depend on the amount of filtering required by the test stream.

No, but IIRC the overall speed of decoding a full clip was measurably improved when I added these shortcuts.

Examples of relative speedup compared to the C version, from checkasm:
                          Cortex       A7     A8     A9    A53
vp9_loop_filter_h_4_8_neon:          2.81   2.59   2.29   3.33
vp9_loop_filter_h_8_8_neon:          2.17   2.09   1.81   2.67
vp9_loop_filter_h_16_8_neon:         2.72   2.29   2.20   3.10
vp9_loop_filter_h_16_16_neon:        2.49   2.38   2.23   2.82
vp9_loop_filter_mix2_h_44_16_neon:   2.88   2.71   2.30   3.31
vp9_loop_filter_mix2_h_48_16_neon:   2.59   2.40   2.05   3.00
vp9_loop_filter_mix2_h_84_16_neon:   2.58   2.33   2.04   3.05
vp9_loop_filter_mix2_h_88_16_neon:   2.32   2.25   1.91   2.81
vp9_loop_filter_mix2_v_44_16_neon:   4.28   4.13   3.49   5.64
vp9_loop_filter_mix2_v_48_16_neon:   4.32   4.15   3.54   5.96
vp9_loop_filter_mix2_v_84_16_neon:   4.13   3.71   3.31   5.33
vp9_loop_filter_mix2_v_88_16_neon:   4.64   4.50   3.94   6.74
vp9_loop_filter_v_4_8_neon:          4.10   3.78   3.12   5.44
vp9_loop_filter_v_8_8_neon:          4.50   4.24   3.66   6.45
vp9_loop_filter_v_16_8_neon:         5.21   4.42   4.12   7.00
vp9_loop_filter_v_16_16_neon:        3.76   3.57   3.23   5.25

The speedup vs C code is around 2-7x. The numbers are quite
inconclusive though, since the checkasm test runs multiple filterings
on top of each other, so later rounds might end up with different
codepaths (different decisions on which filter to apply, based
on input pixel differences). Disabling the early-exit in the asm
doesn't give a fair comparison either though, since the C code
only does the necessary calcuations for each row.

The only way to get meaning full numbers is timing the use in the
decoder.

Indeed

This is pretty similar in runtime to the corresponding routines
in libvpx. (This is comparing vpx_lpf_vertical_16_neon,
vpx_lpf_horizontal_edge_8_neon and vpx_lpf_horizontal_edge_16_neon
to vp9_loop_filter_h_16_8_neon, vp9_loop_filter_v_16_8_neon
and vp9_loop_filter_v_16_16_neon - note that the naming of horizonal
and vertical is flipped between the libraries.)

In order to have stable, comparable numbers, the early exits in both
asm versions were disabled, forcing the full filtering codepath.

This is probably ok since the implemtations are probably pretty similar.

                           Cortex           A7      A8      A9     A53
vp9_loop_filter_h_16_8_neon:             594.7   441.2   459.5   406.0
libvpx vpx_lpf_vertical_16_neon:         626.0   464.5   470.7   445.0
vp9_loop_filter_v_16_8_neon:             516.3   390.2   406.7   285.0
libvpx vpx_lpf_horizontal_edge_8_neon:   586.5   414.5   415.6   383.2
vp9_loop_filter_v_16_16_neon:            955.4   762.2   780.9   558.0
libvpx vpx_lpf_horizontal_edge_16_neon: 1060.2   751.7   743.5   685.2

Our version is consistently faster on on A7 and A53, consistently
marginally slower on A9, and faster in some tests but slower in some,
on A7 and A8.
---
I did some more optimizations to the core filter macro, and added
assembly frontends for the 16_16 functions to avoid doing vpush/vpop
twice, and added lots of more code comments. Assembly frontends for
the mix functions isn't necessary since none of them clobber any callee
saved registers.

Now the comparison to libvpx is much more close; we're rarely slower
at all, and even much faster in some cases.
---
 libavcodec/arm/Makefile          |   1 +
 libavcodec/arm/vp9dsp_init_arm.c |  57 +++
 libavcodec/arm/vp9lpf_neon.S     | 764 +++++++++++++++++++++++++++++++++++++++
 3 files changed, 822 insertions(+)
 create mode 100644 libavcodec/arm/vp9lpf_neon.S


+function ff_vp9_loop_filter_h_4_8_neon, export=1
+        sub             r12, r0,  #4
+        add             r0,  r12, r1, lsl #2
+        vld1.8          {d20}, [r12], r1
+        vld1.8          {d24}, [r0],  r1
+        vld1.8          {d21}, [r12], r1
+        vld1.8          {d25}, [r0],  r1
+        vld1.8          {d22}, [r12], r1
+        vld1.8          {d26}, [r0],  r1
+        vld1.8          {d23}, [r12], r1
+        vld1.8          {d27}, [r0],  r1
+
+        sub             r12, r12, r1, lsl #2
+        sub             r0,  r0,  r1, lsl #2
+        @ Move r0/r12 forward by 2 pixels; we don't need to rewrite the
+        @ outermost 2 pixels since they aren't changed.
+        add             r12, r12, #2
+        add             r0,  r0,  #2
+
+        @ Transpose the 8x8 pixels, taking advantage of q registers, to get
+        @ one register per column.
+        transpose_q_8x8 q10, q11, q12, q13, d20, d21, d22, d23, d24, d25, d26, 
d27
+
+        loop_filter_4

since the loop filter core is huge and doesn't differ between horizontal
and vertical filtering it's probably benificial to have a loopfilter
core function and call it from both _h and _v

Actually, it's not all that huge, surprisingly.

By doing that, I reduce the size of the text segment for this object file from 4664 bytes to 3204, but it gives the following slowdown:

pre:
vp9_loop_filter_v_4_8_neon:     147.0   129.2   114.5    89.0
vp9_loop_filter_v_8_8_neon:     242.0   198.2   174.5   140.0
vp9_loop_filter_v_16_8_neon:    500.0   438.7   379.7   293.0
vp9_loop_filter_v_16_16_neon:   971.0   818.2   731.2   578.2
post:
vp9_loop_filter_v_4_8_neon:     153.0   132.0   116.5    94.2
vp9_loop_filter_v_8_8_neon:     251.2   202.4   179.2   147.0
vp9_loop_filter_v_16_8_neon:    509.2   425.5   388.4   301.0
vp9_loop_filter_v_16_16_neon:   990.5   839.5   753.2   596.2

I'm a little undecided whether this size reduction is worth it or not.

For aarch64, the code size for all the filter combos is 16KB though, so there it's probably much more worthwhile.

// Martin
_______________________________________________
libav-devel mailing list
libav-devel@libav.org
https://lists.libav.org/mailman/listinfo/libav-devel

Reply via email to