On Thu, 7 Jan 2021, Josh Dekker wrote:

Signed-off-by: Josh Dekker <j...@itanimul.li>
---
libavcodec/aarch64/Makefile            |  3 +-
libavcodec/aarch64/hevcdsp_idct_neon.S | 74 ++++++++++++++++++++++++++
libavcodec/aarch64/hevcdsp_init.c      | 19 +++++++
3 files changed, 95 insertions(+), 1 deletion(-)
create mode 100644 libavcodec/aarch64/hevcdsp_idct_neon.S

diff --git a/libavcodec/aarch64/Makefile b/libavcodec/aarch64/Makefile
index 4bdd554e7e..42d80bf74c 100644
--- a/libavcodec/aarch64/Makefile
+++ b/libavcodec/aarch64/Makefile
@@ -54,7 +54,8 @@ NEON-OBJS-$(CONFIG_VP8DSP)              += 
aarch64/vp8dsp_neon.o
# decoders/encoders
NEON-OBJS-$(CONFIG_AAC_DECODER)         += aarch64/aacpsdsp_neon.o
NEON-OBJS-$(CONFIG_DCA_DECODER)         += aarch64/synth_filter_neon.o
-NEON-OBJS-$(CONFIG_HEVC_DECODER)        += aarch64/hevcdsp_add_res_neon.o
+NEON-OBJS-$(CONFIG_HEVC_DECODER)        += aarch64/hevcdsp_add_res_neon.o      
\
+                                           aarch64/hevcdsp_idct_neon.o
NEON-OBJS-$(CONFIG_OPUS_DECODER)        += aarch64/opusdsp_neon.o
NEON-OBJS-$(CONFIG_VORBIS_DECODER)      += aarch64/vorbisdsp_neon.o
NEON-OBJS-$(CONFIG_VP9_DECODER)         += aarch64/vp9itxfm_16bpp_neon.o       \
diff --git a/libavcodec/aarch64/hevcdsp_idct_neon.S 
b/libavcodec/aarch64/hevcdsp_idct_neon.S
new file mode 100644
index 0000000000..cd886bb6dc
--- /dev/null
+++ b/libavcodec/aarch64/hevcdsp_idct_neon.S
@@ -0,0 +1,74 @@
+/* -*-arm64-*-
+ *
+ * AArch64 NEON optimised IDCT functions for HEVC decoding
+ *
+ * Copyright (c) 2020 Josh Dekker <j...@itanimul.li>
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "libavutil/aarch64/asm.S"
+
+.macro idct_dc size bitdepth
+function ff_hevc_idct_\size\()x\size\()_dc_\bitdepth\()_neon, export=1

As a bonus, it'd be nice to have the function prototype in a comment above here, as a quick reference for what the inputs are.

+    ldrsh w1, [x0]
+    mov   w2, #(1 << (13 - \bitdepth))
+    add   w1, w1, #1
+    asr   w1, w1, #1
+    add   w1, w1, w2

As commented on the other patch, please align things like in existing assembly, and add another extra space after the commas here, to keep things aligned even if you'd be using e.g. w10.

+    asr   w1, w1, #(14 - \bitdepth)
+    dup   v0.8h, w1
+    dup   v1.8h, w1
+.if \size > 4
+    dup   v2.8h, w1
+    dup   v3.8h, w1
+.if \size > 16 /* dc 32x32 */
+    mov x2, #4
+1:
+    subs x2, x2, #1
+.endif
+.if \size > 8 /* dc 16x16 */
+    st1   {v0.8h-v3.8h}, [x0], #64
+    st1   {v0.8h-v3.8h}, [x0], #64
+    st1   {v0.8h-v3.8h}, [x0], #64
+    st1   {v0.8h-v3.8h}, [x0], #64
+    st1   {v0.8h-v3.8h}, [x0], #64
+    st1   {v0.8h-v3.8h}, [x0], #64
+.endif /* dc 8x8 */
+    st1   {v0.8h-v3.8h}, [x0], #64
+    st1   {v0.8h-v3.8h}, [x0], #64

In a series of stores like this, each instruction updates the pointer x0, which means that the next instruction can't start executing until the previous one is done writing back the value to that register. So for a sequence like this, it's sometimes useful to have two registers for storing things interleavedly, e.g. like this:

add x12,  x0,  #64
mov x13,  #128
st1 {}, [x0],  x13
st1 {}, [x12], x13
st1 {}, [x0],  x13
st1 {}, [x12], x13
...

Depending on the context and so on, it may or may not be worth doing that. For such a small trivial function like this, I'd definitely try at least.

Btw, you didn't mention what core you had benchmarked your functions on. When doing micro tuning like this, it's usually good to benchmark on both big and small cores, as some optimizations can make things faster on one but slower on another one.

Functionally, the patch looks ok - but as this goes into the same file as Reimar's patches also add things into (and his patches do things that are fairly 1:1 with the existing arm assembly), it might be good to hold off of pushing this one until his patches are in, to rebase this one on top of those.

// Martin

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Reply via email to