vc1: Arm 64-bit NEON deblocking filter fast paths

Martin Storsjö Wed, 30 Mar 2022 05:36:07 -0700

On Fri, 25 Mar 2022, Ben Avison wrote:

checkasm benchmarks on 1.5 GHz Cortex-A72 are as follows. Note that the C
version can still outperform the NEON version in specific cases. The balance
between different code paths is stream-dependent, but in practice the best
case happens about 5% of the time, the worst case happens about 40% of the
time, and the complexity of the remaining cases fall somewhere in between.
Therefore, taking the average of the best and worst case timings is
probably a conservative estimate of the degree by which the NEON code
improves performance.


vc1dsp.vc1_h_loop_filter4_bestcase_c: 10.7
vc1dsp.vc1_h_loop_filter4_bestcase_neon: 43.5
vc1dsp.vc1_h_loop_filter4_worstcase_c: 184.5
vc1dsp.vc1_h_loop_filter4_worstcase_neon: 73.7
vc1dsp.vc1_h_loop_filter8_bestcase_c: 31.2
vc1dsp.vc1_h_loop_filter8_bestcase_neon: 62.2
vc1dsp.vc1_h_loop_filter8_worstcase_c: 358.2
vc1dsp.vc1_h_loop_filter8_worstcase_neon: 88.2
vc1dsp.vc1_h_loop_filter16_bestcase_c: 51.0
vc1dsp.vc1_h_loop_filter16_bestcase_neon: 107.7
vc1dsp.vc1_h_loop_filter16_worstcase_c: 722.7
vc1dsp.vc1_h_loop_filter16_worstcase_neon: 140.5
vc1dsp.vc1_v_loop_filter4_bestcase_c: 9.7
vc1dsp.vc1_v_loop_filter4_bestcase_neon: 43.0
vc1dsp.vc1_v_loop_filter4_worstcase_c: 178.7
vc1dsp.vc1_v_loop_filter4_worstcase_neon: 69.0
vc1dsp.vc1_v_loop_filter8_bestcase_c: 30.2
vc1dsp.vc1_v_loop_filter8_bestcase_neon: 50.7
vc1dsp.vc1_v_loop_filter8_worstcase_c: 353.0
vc1dsp.vc1_v_loop_filter8_worstcase_neon: 69.2
vc1dsp.vc1_v_loop_filter16_bestcase_c: 60.0
vc1dsp.vc1_v_loop_filter16_bestcase_neon: 90.0
vc1dsp.vc1_v_loop_filter16_worstcase_c: 714.2
vc1dsp.vc1_v_loop_filter16_worstcase_neon: 97.2

Signed-off-by: Ben Avison <bavi...@riscosopen.org>
---
libavcodec/aarch64/Makefile              |   1 +
libavcodec/aarch64/vc1dsp_init_aarch64.c |  14 +
libavcodec/aarch64/vc1dsp_neon.S         | 698 +++++++++++++++++++++++
3 files changed, 713 insertions(+)
create mode 100644 libavcodec/aarch64/vc1dsp_neon.S

diff --git a/libavcodec/aarch64/vc1dsp_neon.S b/libavcodec/aarch64/vc1dsp_neon.S
new file mode 100644
index 0000000000..70391b4179
--- /dev/null
+++ b/libavcodec/aarch64/vc1dsp_neon.S
@@ -0,0 +1,698 @@
+/*
+ * VC1 AArch64 NEON optimisations
+ *
+ * Copyright (c) 2022 Ben Avison <bavi...@riscosopen.org>
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "libavutil/aarch64/asm.S"
+
+.align  5
+.Lcoeffs:
+.quad   0x00050002
+

This constant is problematic when building with MSVC/armasm64.exe (viagas-preprocessor). (gas-preprocessor, processing assembly for use witharmasm, can't handle any completely general gas assembly, but works fineas long as one sticks to the general code patterns/macros we use.)

The issue here is that this is a naked label before the firstfunction/endfunc/const/endconst block. In practice it works fine if thisfollows after an existing function though. (gas-preprocessor currentlyneeds an explicit .text directive, to set it up to emit code to the .textsection. I guess I could look into handling that implicitly too.)

In practice, this issue disappears further ahead in the patch stack whenother functions are added above this in the source file though, so it'snot really an issue - I just thought I'd mention it.

+// VC-1 in-loop deblocking filter for 4 pixel pairs at boundary of 
vertically-neighbouring blocks
+// On entry:
+//   x0 -> top-left pel of lower block
+//   w1 = row stride, bytes
+//   w2 = PQUANT bitstream parameter
+function ff_vc1_v_loop_filter4_neon, export=1
+        sub             x3, x0, w1, sxtw #2
+        sxtw            x1, w1                  // technically, stride is 
signed int
+        ldr             d0, .Lcoeffs
+        ld1             {v1.s}[0], [x0], x1     // P5
+        ld1             {v2.s}[0], [x3], x1     // P1
+        ld1             {v3.s}[0], [x3], x1     // P2
+        ld1             {v4.s}[0], [x0], x1     // P6
+        ld1             {v5.s}[0], [x3], x1     // P3
+        ld1             {v6.s}[0], [x0], x1     // P7
+        ld1             {v7.s}[0], [x3]         // P4
+        ld1             {v16.s}[0], [x0]        // P8
+        ushll           v17.8h, v1.8b, #1       // 2*P5
+        dup             v18.8h, w2              // pq
+        ushll           v2.8h, v2.8b, #1        // 2*P1
+        uxtl            v3.8h, v3.8b            // P2
+        uxtl            v4.8h, v4.8b            // P6
+        uxtl            v19.8h, v5.8b           // P3


Overall, the code looks sensible to me.

Would it make sense to share thecore of the filter between the horizontal/vertical cases with e.g. amacro? (I didn't check in detail if there's much differences in the coreof the filter. At most some differences in condition registers for partialwriteout in the horizontal forms?)

If it's shareable, I guess the core of the filter even could be factorizedto a separate sub-function to avoid duplicating it across the functions?(For the smaller filters it's probably no big deal, but for the biggerfilters it could maybe be worthwhile?) It probably costs a couple cyclesextra though, so if that's a too costly here, just macroing it to avoidduplication is fine too.


// Martin

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Re: [FFmpeg-devel] [PATCH 05/10] avcodec/vc1: Arm 64-bit NEON deblocking filter fast paths

Reply via email to