[FFmpeg-devel] [PATCH 1/2] lavu/checkasm: add (private) kperf timing for macOS

2021-04-28 Thread Josh Dekker
Signed-off-by: Josh Dekker 
---
 configure |   2 +
 libavutil/Makefile|   1 +
 libavutil/macos_kperf.c   | 140 ++
 libavutil/macos_kperf.h   |  23 +++
 libavutil/timer.h |  17 -
 tests/checkasm/checkasm.c |  14 +++-
 tests/checkasm/checkasm.h |   7 +-
 7 files changed, 200 insertions(+), 4 deletions(-)
 create mode 100644 libavutil/macos_kperf.c
 create mode 100644 libavutil/macos_kperf.h

diff --git a/configure b/configure
index 820f719a32..a79052ad28 100755
--- a/configure
+++ b/configure
@@ -489,6 +489,7 @@ Developer options (useful when working on FFmpeg itself):
   --ignore-tests=TESTS comma-separated list (without "fate-" prefix
in the name) of tests whose result is ignored
   --enable-linux-perf  enable Linux Performance Monitor API
+  --enable-macos-kperf enable macOS kperf (private) API
   --disable-large-testsdisable tests that use a large amount of memory
 
 NOTE: Object files are built at the place where configure is launched.
@@ -1947,6 +1948,7 @@ CONFIG_LIST="
 fontconfig
 large_tests
 linux_perf
+macos_kperf
 memory_poisoning
 neon_clobber_test
 ossfuzz
diff --git a/libavutil/Makefile b/libavutil/Makefile
index 47efb718d2..18dc5f22d9 100644
--- a/libavutil/Makefile
+++ b/libavutil/Makefile
@@ -181,6 +181,7 @@ OBJS-$(CONFIG_D3D11VA)  += 
hwcontext_d3d11va.o
 OBJS-$(CONFIG_DXVA2)+= hwcontext_dxva2.o
 OBJS-$(CONFIG_LIBDRM)   += hwcontext_drm.o
 OBJS-$(CONFIG_LZO)  += lzo.o
+OBJS-$(CONFIG_MACOS_KPERF)  += macos_kperf.o
 OBJS-$(CONFIG_MEDIACODEC)   += hwcontext_mediacodec.o
 OBJS-$(CONFIG_OPENCL)   += hwcontext_opencl.o
 OBJS-$(CONFIG_QSV)  += hwcontext_qsv.o
diff --git a/libavutil/macos_kperf.c b/libavutil/macos_kperf.c
new file mode 100644
index 00..d5de491e12
--- /dev/null
+++ b/libavutil/macos_kperf.c
@@ -0,0 +1,140 @@
+/*
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License along
+ * with FFmpeg; if not, write to the Free Software Foundation, Inc.,
+ * 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
+ */
+
+#include "macos_kperf.h"
+#include 
+#include 
+#include 
+
+#define KPERF_LIST \
+F(int, kpc_get_counting, void) \
+F(int, kpc_force_all_ctrs_set, int)\
+F(int, kpc_set_counting, uint32_t) \
+F(int, kpc_set_thread_counting, uint32_t)  \
+F(int, kpc_set_config, uint32_t, void *)   \
+F(int, kpc_get_config, uint32_t, void *)   \
+F(int, kpc_set_period, uint32_t, void *)   \
+F(int, kpc_get_period, uint32_t, void *)   \
+F(uint32_t, kpc_get_counter_count, uint32_t)   \
+F(uint32_t, kpc_get_config_count, uint32_t)\
+F(int, kperf_sample_get, int *)\
+F(int, kpc_get_thread_counters, int, unsigned int, void *)
+
+#define F(ret, name, ...)  \
+typedef ret name##proc(__VA_ARGS__);   \
+static name##proc *name = NULL;
+KPERF_LIST
+#undef F
+
+#define CFGWORD_EL0A32EN_MASK (0x1)
+#define CFGWORD_EL0A64EN_MASK (0x2)
+#define CFGWORD_EL1EN_MASK(0x4)
+#define CFGWORD_EL3EN_MASK(0x8)
+#define CFGWORD_ALLMODES_MASK (0xf)
+
+#define CPMU_NONE 0
+#define CPMU_CORE_CYCLE 0x02
+#define CPMU_INST_A64 0x8c
+#define CPMU_INST_BRANCH 0x8d
+#define CPMU_SYNC_DC_LOAD_MISS 0xbf
+#define CPMU_SYNC_DC_STORE_MISS 0xc0
+#define CPMU_SYNC_DTLB_MISS 0xc1
+#define CPMU_SYNC_ST_HIT_YNGR_LD 0xc4
+#define CPMU_SYNC_BR_ANY_MISP 0xcb
+#define CPMU_FED_IC_MISS_DEM 0xd3
+#define CPMU_FED_ITLB_MISS 0xd4
+
+#define KPC_CLASS_FIXED_MASK(1 << 0)
+#define KPC_CLASS_CONFIGURABLE_MASK (1 << 1)
+#define KPC_CLASS_POWER_MASK(1 << 2)
+#define KPC_CLASS_RAWPMU_MASK   (1 << 3)
+
+#define COUNTERS_COUNT 10
+#define CONFIG_COUNT 8
+#define KPC_MASK (KPC_CLASS_CONFIGURABLE_MASK | KPC_CLASS_FIXED_MASK)
+
+int ff_kperf_setup()
+{
+uint64_t config[COUNTERS_COUNT] = {0};
+config[0] = CPMU_CORE_CYCLE | CFGWORD_EL0A64EN_MASK;
+// con

[FFmpeg-devel] [PATCH 0/2] ARM64 HEVC QPEL/EPEL

2021-04-28 Thread Josh Dekker
This is a patch originally, submitted in 2017 (author/date info left
intact). At the time, it didn't get much attention I assume due to the
sheer size of it. I have split the patch into only its QPEL/EPEL parts,
rebasing, and doing some cleaning of the patches as much is reasonable
for a 9001 line diff. I also have SAO band (non-working) and 32x32 IDCT
(working but honestly in a worse state than these patches).

This patch gives a large overall speedup roughly 30% in my testing. The
only problem is that (as previously stated), 1) it's a lot of code, the
original author didn't make use of macros. 2) it's only 8-bit. I will be
writing 10-bit assembly, and whilst I do that will clean-up/macro-ify
the current 8-bit assembly. Though there is still lots to be done.

Our current IDCTs for HEVC aren't great either, I had a 40% speedup on
the 16x16 one in testing. The assembly is far from 'done' but we're
getting closer slowly at least.

There were some suggestions for smaller improvements in the previous
reviews and I have not applied those. The first course of action is to
refractor it so that it is possible to work on the code without going
insane. I think it's fine to use it whilst I'm working on refractoring
it due to the large speedup: the code-weight in the binary should be
relatively similar even after that anyway.

Also, updated kperf patch as per Lynne's request.

--.
Josh


___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


Re: [FFmpeg-devel] [PATCH v4 1/2] lavc/aarch64: change h264pred_init structure

2021-04-19 Thread Josh Dekker
Set applied.

-- 
Josh
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


[FFmpeg-devel] [PATCH] checkasm: add (private) kperf timing for macOS

2021-04-12 Thread Josh Dekker
Signed-off-by: Josh Dekker 
---
 configure|   2 +
 tests/checkasm/Makefile  |   1 +
 tests/checkasm/checkasm.c|  19 -
 tests/checkasm/checkasm.h|  10 ++-
 tests/checkasm/macos_kperf.c | 143 +++
 tests/checkasm/macos_kperf.h |  23 ++
 6 files changed, 195 insertions(+), 3 deletions(-)
 create mode 100644 tests/checkasm/macos_kperf.c
 create mode 100644 tests/checkasm/macos_kperf.h

diff --git a/configure b/configure
index d7a3f507e8..a47e3dea67 100755
--- a/configure
+++ b/configure
@@ -490,6 +490,7 @@ Developer options (useful when working on FFmpeg itself):
   --ignore-tests=TESTS comma-separated list (without "fate-" prefix
in the name) of tests whose result is ignored
   --enable-linux-perf  enable Linux Performance Monitor API
+  --enable-macos-kperf enable macOS kperf (private) API
   --disable-large-testsdisable tests that use a large amount of memory
 
 NOTE: Object files are built at the place where configure is launched.
@@ -1949,6 +1950,7 @@ CONFIG_LIST="
 fontconfig
 large_tests
 linux_perf
+macos_kperf
 memory_poisoning
 neon_clobber_test
 ossfuzz
diff --git a/tests/checkasm/Makefile b/tests/checkasm/Makefile
index 1827a4e134..4abaef9c63 100644
--- a/tests/checkasm/Makefile
+++ b/tests/checkasm/Makefile
@@ -58,6 +58,7 @@ CHECKASMOBJS-$(CONFIG_AVUTIL)  += $(AVUTILOBJS)
 CHECKASMOBJS-$(ARCH_AARCH64)+= aarch64/checkasm.o
 CHECKASMOBJS-$(HAVE_ARMV5TE_EXTERNAL)   += arm/checkasm.o
 CHECKASMOBJS-$(HAVE_X86ASM) += x86/checkasm.o
+CHECKASMOBJS-$(CONFIG_MACOS_KPERF)  += macos_kperf.o
 
 CHECKASMOBJS += $(CHECKASMOBJS-yes) checkasm.o
 CHECKASMOBJS := $(sort $(CHECKASMOBJS:%=tests/checkasm/%))
diff --git a/tests/checkasm/checkasm.c b/tests/checkasm/checkasm.c
index 8338e8ff58..4c42040244 100644
--- a/tests/checkasm/checkasm.c
+++ b/tests/checkasm/checkasm.c
@@ -26,6 +26,8 @@
 # ifndef _GNU_SOURCE
 #  define _GNU_SOURCE // for syscall (performance monitoring API)
 # endif
+#elif CONFIG_MACOS_KPERF
+#include "macos_kperf.h"
 #endif
 
 #include 
@@ -637,9 +639,20 @@ static int bench_init_linux(void)
 }
 return 0;
 }
-#endif
+#elif CONFIG_MACOS_KPERF
+static int bench_init_kperf(void)
+{
+if (ff_kperf_init() || ff_kperf_setup())
+return -1;
 
-#if !CONFIG_LINUX_PERF
+if (ff_kperf_cycles(NULL)) {
+fprintf(stderr, "checkasm must be run as root to use kperf on 
macOS\n");
+return -1;
+}
+
+return 0;
+}
+#else
 static int bench_init_ffmpeg(void)
 {
 #ifdef AV_READ_TIME
@@ -656,6 +669,8 @@ static int bench_init(void)
 {
 #if CONFIG_LINUX_PERF
 int ret = bench_init_linux();
+#elif CONFIG_MACOS_KPERF
+int ret = bench_init_kperf();
 #else
 int ret = bench_init_ffmpeg();
 #endif
diff --git a/tests/checkasm/checkasm.h b/tests/checkasm/checkasm.h
index ef6645e3a2..4127081d74 100644
--- a/tests/checkasm/checkasm.h
+++ b/tests/checkasm/checkasm.h
@@ -31,6 +31,8 @@
 #include 
 #include 
 #include 
+#elif CONFIG_MACOS_KPERF
+#include "macos_kperf.h"
 #endif
 
 #include "libavutil/avstring.h"
@@ -224,7 +226,7 @@ typedef struct CheckasmPerf {
 int iterations;
 } CheckasmPerf;
 
-#if defined(AV_READ_TIME) || CONFIG_LINUX_PERF
+#if defined(AV_READ_TIME) || CONFIG_LINUX_PERF || CONFIG_MACOS_KPERF
 
 #if CONFIG_LINUX_PERF
 #define PERF_START(t) do {  \
@@ -235,6 +237,12 @@ typedef struct CheckasmPerf {
 ioctl(sysfd, PERF_EVENT_IOC_DISABLE, 0);\
 read(sysfd, , sizeof(t)); \
 } while (0)
+#elif CONFIG_MACOS_KPERF
+#define PERF_START(t) do {  \
+t = 0;  \
+ff_kperf_cycles();\
+} while (0)
+#define PERF_STOP(t) ff_kperf_cycles()
 #else
 #define PERF_START(t) t = AV_READ_TIME()
 #define PERF_STOP(t)  t = AV_READ_TIME() - t
diff --git a/tests/checkasm/macos_kperf.c b/tests/checkasm/macos_kperf.c
new file mode 100644
index 00..e6ae316608
--- /dev/null
+++ b/tests/checkasm/macos_kperf.c
@@ -0,0 +1,143 @@
+/*
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License along
+ * with FFmpeg; if not, write to the Free Software Foundation, Inc.,
+ * 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
+ */
+
+#

Re: [FFmpeg-devel] [PATCH v2 0/4] avcodec/aarch64/hevcdsp

2021-02-18 Thread Josh Dekker

Set pushed with all Martin's changes implemented. More NEON & updates soon.

--
Josh

On 2021-02-04 12:32, Josh Dekker wrote:

Hi,

Rebases the unpushed part of my patches on top of Reimar's set.  Also
implements Martin's suggestions except 'unrolling the loop' for SAO band
function, will update the band function when I fix non 8x8 cases.


___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

[FFmpeg-devel] [PATCH v2 4/4] avcodec/aarch64/hevcdsp: add sao_band NEON

2021-02-04 Thread Josh Dekker
Only works for 8x8.

Signed-off-by: Josh Dekker 
---
 libavcodec/aarch64/Makefile   |  3 +-
 libavcodec/aarch64/hevcdsp_init_aarch64.c |  7 ++
 libavcodec/aarch64/hevcdsp_sao_neon.S | 87 +++
 3 files changed, 96 insertions(+), 1 deletion(-)
 create mode 100644 libavcodec/aarch64/hevcdsp_sao_neon.S

diff --git a/libavcodec/aarch64/Makefile b/libavcodec/aarch64/Makefile
index 2ea1d74a38..954461f81d 100644
--- a/libavcodec/aarch64/Makefile
+++ b/libavcodec/aarch64/Makefile
@@ -62,4 +62,5 @@ NEON-OBJS-$(CONFIG_VP9_DECODER) += 
aarch64/vp9itxfm_16bpp_neon.o   \
aarch64/vp9mc_16bpp_neon.o  
\
aarch64/vp9mc_neon.o
 NEON-OBJS-$(CONFIG_HEVC_DECODER)+= aarch64/hevcdsp_idct_neon.o 
\
-   aarch64/hevcdsp_init_aarch64.o
+   aarch64/hevcdsp_init_aarch64.o  
\
+   aarch64/hevcdsp_sao_neon.o
diff --git a/libavcodec/aarch64/hevcdsp_init_aarch64.c 
b/libavcodec/aarch64/hevcdsp_init_aarch64.c
index fe111bd1ac..c785e46f79 100644
--- a/libavcodec/aarch64/hevcdsp_init_aarch64.c
+++ b/libavcodec/aarch64/hevcdsp_init_aarch64.c
@@ -53,6 +53,12 @@ void ff_hevc_idct_4x4_dc_10_neon(int16_t *coeffs);
 void ff_hevc_idct_8x8_dc_10_neon(int16_t *coeffs);
 void ff_hevc_idct_16x16_dc_10_neon(int16_t *coeffs);
 void ff_hevc_idct_32x32_dc_10_neon(int16_t *coeffs);
+void ff_hevc_sao_band_filter_8x8_8_neon(uint8_t *_dst, uint8_t *_src,
+  ptrdiff_t stride_dst, ptrdiff_t stride_src,
+  int16_t *sao_offset_val, int sao_left_class,
+  int width, int height);
+
+
 
 av_cold void ff_hevc_dsp_init_aarch64(HEVCDSPContext *c, const int bit_depth)
 {
@@ -69,6 +75,7 @@ av_cold void ff_hevc_dsp_init_aarch64(HEVCDSPContext *c, 
const int bit_depth)
 c->idct_dc[1]  = ff_hevc_idct_8x8_dc_8_neon;
 c->idct_dc[2]  = ff_hevc_idct_16x16_dc_8_neon;
 c->idct_dc[3]  = ff_hevc_idct_32x32_dc_8_neon;
+c->sao_band_filter[0]  = ff_hevc_sao_band_filter_8x8_8_neon;
 }
 if (bit_depth == 10) {
 c->add_residual[0] = ff_hevc_add_residual_4x4_10_neon;
diff --git a/libavcodec/aarch64/hevcdsp_sao_neon.S 
b/libavcodec/aarch64/hevcdsp_sao_neon.S
new file mode 100644
index 00..f142c1e8c2
--- /dev/null
+++ b/libavcodec/aarch64/hevcdsp_sao_neon.S
@@ -0,0 +1,87 @@
+/* -*-arm64-*-
+ * vim: syntax=arm64asm
+ *
+ * AArch64 NEON optimised SAO functions for HEVC decoding
+ *
+ * Copyright (c) 2020 Josh Dekker 
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "libavutil/aarch64/asm.S"
+
+// void sao_band_filter(uint8_t *_dst, uint8_t *_src,
+//  ptrdiff_t stride_dst, ptrdiff_t stride_src,
+//  int16_t *sao_offset_val, int sao_left_class,
+//  int width, int height)
+function ff_hevc_sao_band_filter_8x8_8_neon, export=1
+sub sp, sp, #64
+stp xzr, xzr, [sp]
+stp xzr, xzr, [sp, #16]
+stp xzr, xzr, [sp, #32]
+stp xzr, xzr, [sp, #48]
+mov w8, #4
+0:
+ldrsh x9, [x4, x8, lsl #1] // x9 = sao_offset_val[k+1]
+subs w8, w8, #1
+add w10, w8, w5 // x10 = k + sao_left_class
+and w10, w10, #0x1F
+strh w9, [sp, x10, lsl #1]
+bne 0b
+ld1 {v16.16b-v19.16b}, [sp], #64
+movi v20.8h, #1
+1:  // beginning of line
+mov w8, w6
+2:
+// Simple layout for accessing 16bit values
+// with 8bit LUT.
+//
+//   00  01  02  03  04  05  06  07
+// +--->
+// |xDE#xAD|xCA#xFE|xBE#xEF|xFE#xED|
+// +--->
+//i-0 i-1 i-2 i-3
+// dst[x] = av_clip_pixel(src[x] + offset_table[src[x] >> shift]);
+ld1 {v2.8b}, [x1]
+// load src[x]
+uxtl v0.8h, v2.8b
+// >> shift
+ushr v2.8h, v0.8h, #3 // BIT_DEPTH - 3
+// x2 (access lower short)
+shl v1.8h, v2.8h, #1 // low (x2, accessing short)
+// +1 acces

[FFmpeg-devel] [PATCH v2 1/4] avcodec/aarch64/hevcdsp: port SIMD idct functions

2021-02-04 Thread Josh Dekker
From: Reimar Döffinger 

Makes SIMD-optimized 8x8 and 16x16 idcts for 8 and 10 bit depth
available on aarch64.
For a UHD HDR (10 bit) sample video these were consuming the most time
and this optimization reduced overall decode time from 19.4s to 16.4s,
approximately 15% speedup.
Test sample was the first 300 frames of "LG 4K HDR Demo - New York.ts",
running on Apple M1.

Signed-off-by: Josh Dekker 
---
 libavcodec/aarch64/Makefile   |   2 +
 libavcodec/aarch64/hevcdsp_idct_neon.S| 380 ++
 libavcodec/aarch64/hevcdsp_init_aarch64.c |  45 +++
 libavcodec/hevcdsp.c  |   2 +
 libavcodec/hevcdsp.h  |   1 +
 5 files changed, 430 insertions(+)
 create mode 100644 libavcodec/aarch64/hevcdsp_idct_neon.S
 create mode 100644 libavcodec/aarch64/hevcdsp_init_aarch64.c

diff --git a/libavcodec/aarch64/Makefile b/libavcodec/aarch64/Makefile
index f6434e40da..2ea1d74a38 100644
--- a/libavcodec/aarch64/Makefile
+++ b/libavcodec/aarch64/Makefile
@@ -61,3 +61,5 @@ NEON-OBJS-$(CONFIG_VP9_DECODER) += 
aarch64/vp9itxfm_16bpp_neon.o   \
aarch64/vp9lpf_neon.o   
\
aarch64/vp9mc_16bpp_neon.o  
\
aarch64/vp9mc_neon.o
+NEON-OBJS-$(CONFIG_HEVC_DECODER)+= aarch64/hevcdsp_idct_neon.o 
\
+   aarch64/hevcdsp_init_aarch64.o
diff --git a/libavcodec/aarch64/hevcdsp_idct_neon.S 
b/libavcodec/aarch64/hevcdsp_idct_neon.S
new file mode 100644
index 00..c70d6a906d
--- /dev/null
+++ b/libavcodec/aarch64/hevcdsp_idct_neon.S
@@ -0,0 +1,380 @@
+/*
+ * ARM NEON optimised IDCT functions for HEVC decoding
+ * Copyright (c) 2014 Seppo Tomperi 
+ * Copyright (c) 2017 Alexandra Hájková
+ *
+ * Ported from arm/hevcdsp_idct_neon.S by
+ * Copyright (c) 2020 Reimar Döffinger
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "libavutil/aarch64/asm.S"
+
+const trans, align=4
+.short 64, 83, 64, 36
+.short 89, 75, 50, 18
+.short 90, 87, 80, 70
+.short 57, 43, 25, 9
+.short 90, 90, 88, 85
+.short 82, 78, 73, 67
+.short 61, 54, 46, 38
+.short 31, 22, 13, 4
+endconst
+
+.macro sum_sub out, in, c, op, p
+  .ifc \op, +
+smlal\p \out, \in, \c
+  .else
+smlsl\p \out, \in, \c
+  .endif
+.endm
+
+.macro fixsqrshrn d, dt, n, m
+  .ifc \dt, .8h
+sqrshrn2\d\dt, \n\().4s, \m
+  .else
+sqrshrn \n\().4h, \n\().4s, \m
+mov \d\().d[0], \n\().d[0]
+  .endif
+.endm
+
+// uses and clobbers v28-v31 as temp registers
+.macro tr_4x4_8 in0, in1, in2, in3, out0, out1, out2, out3, p1, p2
+ sshll\p1   v28.4s, \in0, #6
+ movv29.16b, v28.16b
+ smull\p1   v30.4s, \in1, v0.h[1]
+ smull\p1   v31.4s, \in1, v0.h[3]
+ smlal\p2   v28.4s, \in2, v0.h[0] //e0
+ smlsl\p2   v29.4s, \in2, v0.h[0] //e1
+ smlal\p2   v30.4s, \in3, v0.h[3] //o0
+ smlsl\p2   v31.4s, \in3, v0.h[1] //o1
+
+ add\out0, v28.4s, v30.4s
+ add\out1, v29.4s, v31.4s
+ sub\out2, v29.4s, v31.4s
+ sub\out3, v28.4s, v30.4s
+.endm
+
+.macro transpose8_4x4 r0, r1, r2, r3
+trn1v2.8h, \r0\().8h, \r1\().8h
+trn2v3.8h, \r0\().8h, \r1\().8h
+trn1v4.8h, \r2\().8h, \r3\().8h
+trn2v5.8h, \r2\().8h, \r3\().8h
+trn1\r0\().4s, v2.4s, v4.4s
+trn2\r2\().4s, v2.4s, v4.4s
+trn1\r1\().4s, v3.4s, v5.4s
+trn2\r3\().4s, v3.4s, v5.4s
+.endm
+
+.macro transpose_8x8 r0, r1, r2, r3, r4, r5, r6, r7
+transpose8_4x4  \r0, \r1, \r2, \r3
+transpose8_4x4  \r4, \r5, \r6, \r7
+.endm
+
+.macro tr_8x4 shift, in0,in0t, in1,in1t, in2,in2t, in3,in3t, in4,in4t, 
in5,in5t, in6,in6t, in7,in7t, p1, p2
+tr_4x4_8\in0\in0t, \in2\in2t, \in4\in4t, \in6\in6t, v24.4s, 
v25.4s, v26.4s, v27.4s, \p1, \p2
+
+smull\p1 

[FFmpeg-devel] [PATCH v2 2/4] avcodec/aarch64/hevcdsp: port add_residual functions

2021-02-04 Thread Josh Dekker
From: Reimar Döffinger 

Speedup is fairly small, around 1.5%, but these are fairly simple.

Signed-off-by: Josh Dekker 
---
 libavcodec/aarch64/hevcdsp_idct_neon.S| 190 ++
 libavcodec/aarch64/hevcdsp_init_aarch64.c |  24 +++
 2 files changed, 214 insertions(+)

diff --git a/libavcodec/aarch64/hevcdsp_idct_neon.S 
b/libavcodec/aarch64/hevcdsp_idct_neon.S
index c70d6a906d..329038a958 100644
--- a/libavcodec/aarch64/hevcdsp_idct_neon.S
+++ b/libavcodec/aarch64/hevcdsp_idct_neon.S
@@ -36,6 +36,196 @@ const trans, align=4
 .short 31, 22, 13, 4
 endconst
 
+.macro clip10 in1, in2, c1, c2
+smax\in1, \in1, \c1
+smax\in2, \in2, \c1
+smin\in1, \in1, \c2
+smin\in2, \in2, \c2
+.endm
+
+function ff_hevc_add_residual_4x4_8_neon, export=1
+ld1 {v0.8h-v1.8h}, [x1]
+ld1 {v2.s}[0], [x0], x2
+ld1 {v2.s}[1], [x0], x2
+ld1 {v2.s}[2], [x0], x2
+ld1 {v2.s}[3], [x0], x2
+sub x0, x0, x2, lsl #2
+uxtlv6.8h,  v2.8B
+uxtl2   v7.8h,  v2.16B
+sqadd   v0.8h,  v0.8h, v6.8h
+sqadd   v1.8h,  v1.8h, v7.8h
+sqxtun  v0.8B,  v0.8h
+sqxtun2 v0.16B, v1.8h
+st1 {v0.s}[0], [x0], x2
+st1 {v0.s}[1], [x0], x2
+st1 {v0.s}[2], [x0], x2
+st1 {v0.s}[3], [x0], x2
+ret
+endfunc
+
+function ff_hevc_add_residual_4x4_10_neon, export=1
+mov x12, x0
+ld1 {v0.8h-v1.8h}, [x1]
+ld1 {v2.d}[0], [x12], x2
+ld1 {v2.d}[1], [x12], x2
+ld1 {v3.d}[0], [x12], x2
+sqadd   v0.8h, v0.8h, v2.8h
+ld1 {V3.d}[1], [x12], x2
+moviv4.8h, #0
+sqadd   v1.8h, v1.8h, v3.8h
+mvniv5.8h, #0xFC, LSL #8 // movi #0x3FF
+clip10  v0.8h, v1.8h, v4.8h, v5.8h
+st1 {v0.d}[0], [x0], x2
+st1 {v0.d}[1], [x0], x2
+st1 {v1.d}[0], [x0], x2
+st1 {v1.d}[1], [x0], x2
+ret
+endfunc
+
+function ff_hevc_add_residual_8x8_8_neon, export=1
+add x12, x0, x2
+add x2,  x2, x2
+mov x3,   #8
+1:  subsx3,   x3, #2
+ld1 {v2.d}[0],   [x0]
+ld1 {v2.d}[1],   [x12]
+uxtlv3.8h,   v2.8B
+ld1 {v0.8h-v1.8h}, [x1], #32
+uxtl2   v2.8h,   v2.16B
+sqadd   v0.8h,   v0.8h,   v3.8h
+sqadd   v1.8h,   v1.8h,   v2.8h
+sqxtun  v0.8B,   v0.8h
+sqxtun2 v0.16B,  v1.8h
+st1 {v0.d}[0],   [x0], x2
+st1 {v0.d}[1],   [x12], x2
+bne 1b
+ret
+endfunc
+
+function ff_hevc_add_residual_8x8_10_neon, export=1
+add x12, x0, x2
+add x2,  x2, x2
+mov x3,  #8
+moviv4.8h, #0
+mvniv5.8h, #0xFC, LSL #8 // movi #0x3FF
+1:  subsx3,  x3, #2
+ld1 {v0.8h-v1.8h}, [x1], #32
+ld1 {v2.8h},[x0]
+sqadd   v0.8h, v0.8h, v2.8h
+ld1 {v3.8h},[x12]
+sqadd   v1.8h, v1.8h, v3.8h
+clip10  v0.8h, v1.8h, v4.8h, v5.8h
+st1 {v0.8h}, [x0], x2
+st1 {v1.8h}, [x12], x2
+bne 1b
+ret
+endfunc
+
+function ff_hevc_add_residual_16x16_8_neon, export=1
+mov x3,  #16
+add x12, x0, x2
+add x2,  x2, x2
+1:  subsx3,  x3, #2
+ld1 {v16.16B}, [x0]
+ld1 {v0.8h-v3.8h}, [x1], #64
+ld1 {v19.16B},[x12]
+uxtlv17.8h, v16.8B
+uxtl2   v18.8h, v16.16B
+uxtlv20.8h, v19.8B
+uxtl2   v21.8h, v19.16B
+sqadd   v0.8h,  v0.8h, v17.8h
+sqadd   v1.8h,  v1.8h, v18.8h
+sqadd   v2.8h,  v2.8h, v20.8h
+sqadd   v3.8h,  v3.8h, v21.8h
+sqxtun  v0.8B,  v0.8h
+sqxtun2 v0.16B, v1.8h
+sqxtun  v1.8B,  v2.8h
+sqxtun2 v1.16B, v3.8h
+st1 {v0.16B}, [x0], x2
+st1 {v1.16B}, [x12], x2
+bne 1b
+ret
+endfunc
+
+function ff_hevc_add_residual_16x16_10_neon, export=1
+mov x3,  #16
+moviv20.8h, #0
+mvniv21.8h, #0xFC, LSL #8 // movi #0x3FF
+add x12, x0, x2
+add

[FFmpeg-devel] [PATCH v2 3/4] avcodec/aarch64/hevcdsp: add idct_dc NEON

2021-02-04 Thread Josh Dekker
Signed-off-by: Josh Dekker 
---
 libavcodec/aarch64/hevcdsp_idct_neon.S| 54 +++
 libavcodec/aarch64/hevcdsp_init_aarch64.c | 16 +++
 2 files changed, 70 insertions(+)

diff --git a/libavcodec/aarch64/hevcdsp_idct_neon.S 
b/libavcodec/aarch64/hevcdsp_idct_neon.S
index 329038a958..d3902a9e0f 100644
--- a/libavcodec/aarch64/hevcdsp_idct_neon.S
+++ b/libavcodec/aarch64/hevcdsp_idct_neon.S
@@ -5,6 +5,7 @@
  *
  * Ported from arm/hevcdsp_idct_neon.S by
  * Copyright (c) 2020 Reimar Döffinger
+ * Copyright (c) 2020 Josh Dekker
  *
  * This file is part of FFmpeg.
  *
@@ -568,3 +569,56 @@ tr_16x4 secondpass_10, 20 - 10, 512, 1
 
 idct_16x16 8
 idct_16x16 10
+
+// void ff_hevc_idct_NxN_dc_DEPTH_neon(int16_t *coeffs)
+.macro idct_dc size bitdepth
+function ff_hevc_idct_\size\()x\size\()_dc_\bitdepth\()_neon, export=1
+ldrsh   w1, [x0]
+mov w2,  #(1 << (13 - \bitdepth))
+add w1,  w1, #1
+asr w1,  w1, #1
+add w1,  w1, w2
+asr w1,  w1, #(14 - \bitdepth)
+dup  v0.8h,  w1
+dup  v1.8h,  w1
+.if \size > 4
+dup  v2.8h,  w1
+dup  v3.8h,  w1
+.if \size > 16 /* dc 32x32 */
+mov x2,  #4
+1:
+subsx2,  x2, #1
+.endif
+addx12,  x0,  #64
+movx13,  #128
+.if \size > 8 /* dc 16x16 */
+st1   {v0.8h-v3.8h}, [ x0], x13
+st1   {v0.8h-v3.8h}, [x12], x13
+st1   {v0.8h-v3.8h}, [ x0], x13
+st1   {v0.8h-v3.8h}, [x12], x13
+st1   {v0.8h-v3.8h}, [ x0], x13
+st1   {v0.8h-v3.8h}, [x12], x13
+.endif /* dc 8x8 */
+st1   {v0.8h-v3.8h}, [ x0], x13
+st1   {v0.8h-v3.8h}, [x12], x13
+.if \size > 16 /* dc 32x32 */
+bne 1b
+.endif
+.else /* dc 4x4 */
+st1   {v0.8h-v1.8h}, [x0]
+.endif
+ret
+endfunc
+.endm
+
+idct_dc 4 8
+idct_dc 4 10
+
+idct_dc 8 8
+idct_dc 8 10
+
+idct_dc 16 8
+idct_dc 16 10
+
+idct_dc 32 8
+idct_dc 32 10
diff --git a/libavcodec/aarch64/hevcdsp_init_aarch64.c 
b/libavcodec/aarch64/hevcdsp_init_aarch64.c
index 4c29daa6d5..fe111bd1ac 100644
--- a/libavcodec/aarch64/hevcdsp_init_aarch64.c
+++ b/libavcodec/aarch64/hevcdsp_init_aarch64.c
@@ -45,6 +45,14 @@ void ff_hevc_idct_8x8_8_neon(int16_t *coeffs, int col_limit);
 void ff_hevc_idct_8x8_10_neon(int16_t *coeffs, int col_limit);
 void ff_hevc_idct_16x16_8_neon(int16_t *coeffs, int col_limit);
 void ff_hevc_idct_16x16_10_neon(int16_t *coeffs, int col_limit);
+void ff_hevc_idct_4x4_dc_8_neon(int16_t *coeffs);
+void ff_hevc_idct_8x8_dc_8_neon(int16_t *coeffs);
+void ff_hevc_idct_16x16_dc_8_neon(int16_t *coeffs);
+void ff_hevc_idct_32x32_dc_8_neon(int16_t *coeffs);
+void ff_hevc_idct_4x4_dc_10_neon(int16_t *coeffs);
+void ff_hevc_idct_8x8_dc_10_neon(int16_t *coeffs);
+void ff_hevc_idct_16x16_dc_10_neon(int16_t *coeffs);
+void ff_hevc_idct_32x32_dc_10_neon(int16_t *coeffs);
 
 av_cold void ff_hevc_dsp_init_aarch64(HEVCDSPContext *c, const int bit_depth)
 {
@@ -57,6 +65,10 @@ av_cold void ff_hevc_dsp_init_aarch64(HEVCDSPContext *c, 
const int bit_depth)
 c->add_residual[3] = ff_hevc_add_residual_32x32_8_neon;
 c->idct[1] = ff_hevc_idct_8x8_8_neon;
 c->idct[2] = ff_hevc_idct_16x16_8_neon;
+c->idct_dc[0]  = ff_hevc_idct_4x4_dc_8_neon;
+c->idct_dc[1]  = ff_hevc_idct_8x8_dc_8_neon;
+c->idct_dc[2]  = ff_hevc_idct_16x16_dc_8_neon;
+c->idct_dc[3]  = ff_hevc_idct_32x32_dc_8_neon;
 }
 if (bit_depth == 10) {
 c->add_residual[0] = ff_hevc_add_residual_4x4_10_neon;
@@ -65,5 +77,9 @@ av_cold void ff_hevc_dsp_init_aarch64(HEVCDSPContext *c, 
const int bit_depth)
 c->add_residual[3] = ff_hevc_add_residual_32x32_10_neon;
 c->idct[1] = ff_hevc_idct_8x8_10_neon;
 c->idct[2] = ff_hevc_idct_16x16_10_neon;
+c->idct_dc[0]  = ff_hevc_idct_4x4_dc_10_neon;
+c->idct_dc[1]  = ff_hevc_idct_8x8_dc_10_neon;
+c->idct_dc[2]  = ff_hevc_idct_16x16_dc_10_neon;
+c->idct_dc[3]  = ff_hevc_idct_32x32_dc_10_neon;
 }
 }
-- 
2.24.3 (Apple Git-128)

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

[FFmpeg-devel] [PATCH v2 0/4] avcodec/aarch64/hevcdsp

2021-02-04 Thread Josh Dekker
Hi,

Rebases the unpushed part of my patches on top of Reimar's set.  Also
implements Martin's suggestions except 'unrolling the loop' for SAO band
function, will update the band function when I fix non 8x8 cases.

-- 
Josh


___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Re: [FFmpeg-devel] Patch for FFmpeg

2021-01-25 Thread Josh Dekker

On 2021-01-13 17:06, Robin Cooksey wrote:

I’ve attached a patch which makes avformat handle the 308 Permanent Redirect 
HTTP status code – which is more recently defined in 
https://tools.ietf.org/html/rfc7538

The change just treats 308 in the same way as the other 30x status codes.


Thanks. Applied with a slightly edited commit message to conform to our 
conventions & a small reference to the spec.


--
Josh
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Re: [FFmpeg-devel] [PATCH 4/4] checkasm: add hevc_pel tests

2021-01-25 Thread Josh Dekker

On 2021-01-07 13:10, Josh Dekker wrote:

Co-authored-by: Niklas Haas 
Signed-off-by: Josh Dekker 
---
  tests/checkasm/Makefile   |   2 +-
  tests/checkasm/checkasm.c |  10 +
  tests/checkasm/checkasm.h |  10 +
  tests/checkasm/hevc_pel.c | 523 ++
  4 files changed, 544 insertions(+), 1 deletion(-)
  create mode 100644 tests/checkasm/hevc_pel.c

[...]


Pushed (only this patch).

--
Josh
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Re: [FFmpeg-devel] [PATCH] configure: add fallback to $arch in msvc assembler check.

2021-01-25 Thread Josh Dekker

On 2021-01-23 14:14, Martin Storsjö wrote:

On Sat, 23 Jan 2021, Reimar Döffinger wrote:


Setting the defaults for $arch happens only later, so
the current code would not set AS correctly if --arch
was not specified on the command-line.
Fix it by adding an explicit fallback to $arch_default.
---
configure | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/configure b/configure
index 54fbbd6b5f..df298b4b9b 100755
--- a/configure
+++ b/configure
@@ -4268,7 +4268,7 @@ case "$toolchain" in
    ld_default="$source_path/compat/windows/mslink"
    nm_default="dumpbin.exe -symbols"
    ar_default="lib.exe"
-    case "$arch" in
+    case "${arch:-$arch_default}" in
    aarch64|arm64)
    as_default="armasm64.exe"
    ;;
--
2.30.0


LGTM, thanks!

// Martin


Thanks, pushed.

--
Josh
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Re: [FFmpeg-devel] [PATCH] libavcodec/hevcdsp: port SIMD idct functions from 32-bit.

2021-01-12 Thread Josh Dekker

Hi,

On 2021-01-08 21:36, reimar.doeffin...@gmx.de wrote:

From: Reimar Döffinger 

Makes SIMD-optimized 8x8 and 16x16 idcts for 8 and 10 bit depth
available on aarch64.
For a UHD HDR (10 bit) sample video these were consuming the most time
and this optimization reduced overall decode time from 19.4s to 16.4s,
approximately 15% speedup.
Test sample was the first 300 frames of "LG 4K HDR Demo - New York.ts",
running on Apple M1.
---
  libavcodec/aarch64/Makefile   |   2 +
  libavcodec/aarch64/hevcdsp_idct_neon.S| 426 ++
  libavcodec/aarch64/hevcdsp_init_aarch64.c |  45 +++
  libavcodec/hevcdsp.c  |   2 +
  libavcodec/hevcdsp.h  |   1 +
  5 files changed, 476 insertions(+)
  create mode 100644 libavcodec/aarch64/hevcdsp_idct_neon.S
  create mode 100644 libavcodec/aarch64/hevcdsp_init_aarch64.c

[...]


AS  libavcodec/aarch64/hevcdsp_idct_neon.o
libavcodec/aarch64/hevcdsp_idct_neon.S: Assembler messages:
libavcodec/aarch64/hevcdsp_idct_neon.S:418: Error: operand mismatch -- 
`mov v29.4S,v28.4S'

libavcodec/aarch64/hevcdsp_idct_neon.S:418: Info:did you mean this?
libavcodec/aarch64/hevcdsp_idct_neon.S:418: Info:   mov v29.8b, v28.8b
libavcodec/aarch64/hevcdsp_idct_neon.S:418: Info:other valid variant(s):
libavcodec/aarch64/hevcdsp_idct_neon.S:418: Info:   mov v29.16b, v28.16b
libavcodec/aarch64/hevcdsp_idct_neon.S:418: Error: operand mismatch -- 
`mov v29.4S,v28.4S'

libavcodec/aarch64/hevcdsp_idct_neon.S:418: Info:did you mean this?
libavcodec/aarch64/hevcdsp_idct_neon.S:418: Info:   mov v29.8b, v28.8b
libavcodec/aarch64/hevcdsp_idct_neon.S:418: Info:other valid variant(s):
libavcodec/aarch64/hevcdsp_idct_neon.S:418: Info:   mov v29.16b, v28.16b
libavcodec/aarch64/hevcdsp_idct_neon.S:418: Error: operand mismatch -- 
`mov v29.4S,v28.4S'

libavcodec/aarch64/hevcdsp_idct_neon.S:418: Info:did you mean this?
libavcodec/aarch64/hevcdsp_idct_neon.S:418: Info:   mov v29.8b, v28.8b
libavcodec/aarch64/hevcdsp_idct_neon.S:418: Info:other valid variant(s):
libavcodec/aarch64/hevcdsp_idct_neon.S:418: Info:   mov v29.16b, v28.16b
libavcodec/aarch64/hevcdsp_idct_neon.S:418: Error: operand mismatch -- 
`mov v29.4S,v28.4S'

libavcodec/aarch64/hevcdsp_idct_neon.S:418: Info:did you mean this?
libavcodec/aarch64/hevcdsp_idct_neon.S:418: Info:   mov v29.8b, v28.8b
libavcodec/aarch64/hevcdsp_idct_neon.S:418: Info:other valid variant(s):
libavcodec/aarch64/hevcdsp_idct_neon.S:418: Info:   mov v29.16b, v28.16b

This doesn't build on GNU assembler (GNU Binutils for Ubuntu) 2.34 
(aarch64). Thanks for porting this, I was in the process of writing HEVC
assembly (see my set on the ML) and would be interested to rebase this 
on top of that set.


--
Josh
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

[FFmpeg-devel] [PATCH 4/4] checkasm: add hevc_pel tests

2021-01-07 Thread Josh Dekker
Co-authored-by: Niklas Haas 
Signed-off-by: Josh Dekker 
---
 tests/checkasm/Makefile   |   2 +-
 tests/checkasm/checkasm.c |  10 +
 tests/checkasm/checkasm.h |  10 +
 tests/checkasm/hevc_pel.c | 523 ++
 4 files changed, 544 insertions(+), 1 deletion(-)
 create mode 100644 tests/checkasm/hevc_pel.c

diff --git a/tests/checkasm/Makefile b/tests/checkasm/Makefile
index 9e9569777b..1827a4e134 100644
--- a/tests/checkasm/Makefile
+++ b/tests/checkasm/Makefile
@@ -24,7 +24,7 @@ AVCODECOBJS-$(CONFIG_HUFFYUV_DECODER)   += huffyuvdsp.o
 AVCODECOBJS-$(CONFIG_JPEG2000_DECODER)  += jpeg2000dsp.o
 AVCODECOBJS-$(CONFIG_OPUS_DECODER)  += opusdsp.o
 AVCODECOBJS-$(CONFIG_PIXBLOCKDSP)   += pixblockdsp.o
-AVCODECOBJS-$(CONFIG_HEVC_DECODER)  += hevc_add_res.o hevc_idct.o 
hevc_sao.o
+AVCODECOBJS-$(CONFIG_HEVC_DECODER)  += hevc_add_res.o hevc_idct.o 
hevc_sao.o hevc_pel.o
 AVCODECOBJS-$(CONFIG_UTVIDEO_DECODER)   += utvideodsp.o
 AVCODECOBJS-$(CONFIG_V210_DECODER)  += v210dec.o
 AVCODECOBJS-$(CONFIG_V210_ENCODER)  += v210enc.o
diff --git a/tests/checkasm/checkasm.c b/tests/checkasm/checkasm.c
index b3ac76c325..8338e8ff58 100644
--- a/tests/checkasm/checkasm.c
+++ b/tests/checkasm/checkasm.c
@@ -116,6 +116,16 @@ static const struct {
 #if CONFIG_HEVC_DECODER
 { "hevc_add_res", checkasm_check_hevc_add_res },
 { "hevc_idct", checkasm_check_hevc_idct },
+{ "hevc_qpel", checkasm_check_hevc_qpel },
+{ "hevc_qpel_uni", checkasm_check_hevc_qpel_uni },
+{ "hevc_qpel_uni_w", checkasm_check_hevc_qpel_uni_w },
+{ "hevc_qpel_bi", checkasm_check_hevc_qpel_bi },
+{ "hevc_qpel_bi_w", checkasm_check_hevc_qpel_bi_w },
+{ "hevc_epel", checkasm_check_hevc_epel },
+{ "hevc_epel_uni", checkasm_check_hevc_epel_uni },
+{ "hevc_epel_uni_w", checkasm_check_hevc_epel_uni_w },
+{ "hevc_epel_bi", checkasm_check_hevc_epel_bi },
+{ "hevc_epel_bi_w", checkasm_check_hevc_epel_bi_w },
 { "hevc_sao", checkasm_check_hevc_sao },
 #endif
 #if CONFIG_HUFFYUV_DECODER
diff --git a/tests/checkasm/checkasm.h b/tests/checkasm/checkasm.h
index 0190bc912c..ef6645e3a2 100644
--- a/tests/checkasm/checkasm.h
+++ b/tests/checkasm/checkasm.h
@@ -58,6 +58,16 @@ void checkasm_check_h264pred(void);
 void checkasm_check_h264qpel(void);
 void checkasm_check_hevc_add_res(void);
 void checkasm_check_hevc_idct(void);
+void checkasm_check_hevc_qpel(void);
+void checkasm_check_hevc_qpel_uni(void);
+void checkasm_check_hevc_qpel_uni_w(void);
+void checkasm_check_hevc_qpel_bi(void);
+void checkasm_check_hevc_qpel_bi_w(void);
+void checkasm_check_hevc_epel(void);
+void checkasm_check_hevc_epel_uni(void);
+void checkasm_check_hevc_epel_uni_w(void);
+void checkasm_check_hevc_epel_bi(void);
+void checkasm_check_hevc_epel_bi_w(void);
 void checkasm_check_hevc_sao(void);
 void checkasm_check_huffyuvdsp(void);
 void checkasm_check_jpeg2000dsp(void);
diff --git a/tests/checkasm/hevc_pel.c b/tests/checkasm/hevc_pel.c
new file mode 100644
index 00..236404f8ff
--- /dev/null
+++ b/tests/checkasm/hevc_pel.c
@@ -0,0 +1,523 @@
+/*
+ * Copyright (c) 2015 Henrik Gramner
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License along
+ * with FFmpeg; if not, write to the Free Software Foundation, Inc.,
+ * 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
+ */
+
+#include 
+#include "checkasm.h"
+#include "libavcodec/hevcdsp.h"
+#include "libavutil/common.h"
+#include "libavutil/internal.h"
+#include "libavutil/intreadwrite.h"
+
+static const uint32_t pixel_mask[] = { 0x, 0x01ff01ff, 0x03ff03ff, 
0x07ff07ff, 0x0fff0fff };
+static const uint32_t pixel_mask16[] = { 0x00ff00ff, 0x01ff01ff, 0x03ff03ff, 
0x07ff07ff, 0x0fff0fff };
+static const int sizes[] = { -1, 4, 6, 8, 12, 16, 24, 32, 48, 64 };
+static const int weights[] = { 0, 128, 255, -1 };
+static const int denoms[] = {0, 7, 12, -1 };
+static const int offsets[] = {0, 255, -1 };
+
+#define SIZEOF_PIXEL ((bit_depth + 7) / 8)
+#define BUF_SIZE (2 * MAX_PB_SIZE * (2 * 4 + MAX_PB_SIZE))
+
+#define randomize_buffers()  \
+do { \
+uint32_t mask = p

[FFmpeg-devel] [PATCH 3/4] lavc/aarch64: add HEVC sao_band NEON

2021-01-07 Thread Josh Dekker
Only works for 8x8.

Signed-off-by: Josh Dekker 
---
 libavcodec/aarch64/Makefile   |  3 +-
 libavcodec/aarch64/hevcdsp_init.c |  7 +++
 libavcodec/aarch64/hevcdsp_sao_neon.S | 87 +++
 3 files changed, 96 insertions(+), 1 deletion(-)
 create mode 100644 libavcodec/aarch64/hevcdsp_sao_neon.S

diff --git a/libavcodec/aarch64/Makefile b/libavcodec/aarch64/Makefile
index 42d80bf74c..1f54fc31f4 100644
--- a/libavcodec/aarch64/Makefile
+++ b/libavcodec/aarch64/Makefile
@@ -55,7 +55,8 @@ NEON-OBJS-$(CONFIG_VP8DSP)  += 
aarch64/vp8dsp_neon.o
 NEON-OBJS-$(CONFIG_AAC_DECODER) += aarch64/aacpsdsp_neon.o
 NEON-OBJS-$(CONFIG_DCA_DECODER) += aarch64/synth_filter_neon.o
 NEON-OBJS-$(CONFIG_HEVC_DECODER)+= aarch64/hevcdsp_add_res_neon.o  
\
-   aarch64/hevcdsp_idct_neon.o
+   aarch64/hevcdsp_idct_neon.o 
\
+   aarch64/hevcdsp_sao_neon.o
 NEON-OBJS-$(CONFIG_OPUS_DECODER)+= aarch64/opusdsp_neon.o
 NEON-OBJS-$(CONFIG_VORBIS_DECODER)  += aarch64/vorbisdsp_neon.o
 NEON-OBJS-$(CONFIG_VP9_DECODER) += aarch64/vp9itxfm_16bpp_neon.o   
\
diff --git a/libavcodec/aarch64/hevcdsp_init.c 
b/libavcodec/aarch64/hevcdsp_init.c
index 2cd7ef3a6c..8f0a923ab1 100644
--- a/libavcodec/aarch64/hevcdsp_init.c
+++ b/libavcodec/aarch64/hevcdsp_init.c
@@ -23,6 +23,11 @@
 #include "libavcodec/hevcdsp.h"
 #include "libavcodec/avcodec.h"
 
+void ff_hevc_sao_band_filter_8x8_8_neon(uint8_t *_dst, uint8_t *_src,
+  ptrdiff_t stride_dst, ptrdiff_t stride_src,
+  int16_t *sao_offset_val, int sao_left_class,
+  int width, int height);
+
 void ff_hevc_idct_4x4_dc_8_neon(int16_t *coeffs);
 void ff_hevc_idct_8x8_dc_8_neon(int16_t *coeffs);
 void ff_hevc_idct_16x16_dc_8_neon(int16_t *coeffs);
@@ -53,6 +58,8 @@ av_cold void ff_hevc_dsp_init_aarch64(HEVCDSPContext *c, 
const int bit_depth)
 {
 int cpu_flags = av_get_cpu_flags();
 if (have_neon(cpu_flags) && bit_depth == 8) {
+c->sao_band_filter[0]  = ff_hevc_sao_band_filter_8x8_8_neon;
+
 c->add_residual[0] = ff_hevc_add_residual_4x4_8_neon;
 c->add_residual[1] = ff_hevc_add_residual_8x8_8_neon;
 c->add_residual[2] = ff_hevc_add_residual_16x16_8_neon;
diff --git a/libavcodec/aarch64/hevcdsp_sao_neon.S 
b/libavcodec/aarch64/hevcdsp_sao_neon.S
new file mode 100644
index 00..25b6c25117
--- /dev/null
+++ b/libavcodec/aarch64/hevcdsp_sao_neon.S
@@ -0,0 +1,87 @@
+/* -*-arm64-*-
+ * vim: syntax=arm64asm
+ *
+ * AArch64 NEON optimised SAO functions for HEVC decoding
+ *
+ * Copyright (c) 2020 Josh Dekker 
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "libavutil/aarch64/asm.S"
+
+// void sao_band_filter(uint8_t *_dst, uint8_t *_src,
+//  ptrdiff_t stride_dst, ptrdiff_t stride_src,
+//  int16_t *sao_offset_val, int sao_left_class,
+//  int width, int height)
+function ff_hevc_sao_band_filter_8x8_8_neon, export=1
+sub sp, sp, #64
+stp xzr, xzr, [sp]
+stp xzr, xzr, [sp, #16]
+stp xzr, xzr, [sp, #32]
+stp xzr, xzr, [sp, #48]
+mov w8, #4
+.setup:
+ldrsh x9, [x4, x8, lsl #1] // x9 = sao_offset_val[k+1]
+subs w8, w8, #1
+add w10, w8, w5 // x10 = k + sao_left_class
+and w10, w10, #0x1F
+strh w9, [sp, x10, lsl #1]
+bne .setup
+ld1 {v16.16B-v19.16B}, [sp], #64
+movi v20.8H, #1
+0:  // beginning of line
+mov w8, w6
+8:
+// Simple layout for accessing 16bit values
+// with 8bit LUT.
+//
+//   00  01  02  03  04  05  06  07
+// +--->
+// |xDE#xAD|xCA#xFE|xBE#xEF|xFE#xED|
+// +--->
+//i-0 i-1 i-2 i-3
+// dst[x] = av_clip_pixel(src[x] + offset_table[src[x] >> shift]);
+ld1 {v2.8B}, [x1]
+// load src[x]
+ushll v0.8H, v2.8B, #0
+// >> shift
+ushr v2.8H, v0.8H, #3 // BIT_DEPTH 

[FFmpeg-devel] [PATCH 2/4] lavc/aarch64: add HEVC idct_dc NEON

2021-01-07 Thread Josh Dekker
Signed-off-by: Josh Dekker 
---
 libavcodec/aarch64/Makefile|  3 +-
 libavcodec/aarch64/hevcdsp_idct_neon.S | 74 ++
 libavcodec/aarch64/hevcdsp_init.c  | 19 +++
 3 files changed, 95 insertions(+), 1 deletion(-)
 create mode 100644 libavcodec/aarch64/hevcdsp_idct_neon.S

diff --git a/libavcodec/aarch64/Makefile b/libavcodec/aarch64/Makefile
index 4bdd554e7e..42d80bf74c 100644
--- a/libavcodec/aarch64/Makefile
+++ b/libavcodec/aarch64/Makefile
@@ -54,7 +54,8 @@ NEON-OBJS-$(CONFIG_VP8DSP)  += 
aarch64/vp8dsp_neon.o
 # decoders/encoders
 NEON-OBJS-$(CONFIG_AAC_DECODER) += aarch64/aacpsdsp_neon.o
 NEON-OBJS-$(CONFIG_DCA_DECODER) += aarch64/synth_filter_neon.o
-NEON-OBJS-$(CONFIG_HEVC_DECODER)+= aarch64/hevcdsp_add_res_neon.o
+NEON-OBJS-$(CONFIG_HEVC_DECODER)+= aarch64/hevcdsp_add_res_neon.o  
\
+   aarch64/hevcdsp_idct_neon.o
 NEON-OBJS-$(CONFIG_OPUS_DECODER)+= aarch64/opusdsp_neon.o
 NEON-OBJS-$(CONFIG_VORBIS_DECODER)  += aarch64/vorbisdsp_neon.o
 NEON-OBJS-$(CONFIG_VP9_DECODER) += aarch64/vp9itxfm_16bpp_neon.o   
\
diff --git a/libavcodec/aarch64/hevcdsp_idct_neon.S 
b/libavcodec/aarch64/hevcdsp_idct_neon.S
new file mode 100644
index 00..cd886bb6dc
--- /dev/null
+++ b/libavcodec/aarch64/hevcdsp_idct_neon.S
@@ -0,0 +1,74 @@
+/* -*-arm64-*-
+ *
+ * AArch64 NEON optimised IDCT functions for HEVC decoding
+ *
+ * Copyright (c) 2020 Josh Dekker 
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "libavutil/aarch64/asm.S"
+
+.macro idct_dc size bitdepth
+function ff_hevc_idct_\size\()x\size\()_dc_\bitdepth\()_neon, export=1
+ldrsh w1, [x0]
+mov   w2, #(1 << (13 - \bitdepth))
+add   w1, w1, #1
+asr   w1, w1, #1
+add   w1, w1, w2
+asr   w1, w1, #(14 - \bitdepth)
+dup   v0.8h, w1
+dup   v1.8h, w1
+.if \size > 4
+dup   v2.8h, w1
+dup   v3.8h, w1
+.if \size > 16 /* dc 32x32 */
+mov x2, #4
+1:
+subs x2, x2, #1
+.endif
+.if \size > 8 /* dc 16x16 */
+st1   {v0.8h-v3.8h}, [x0], #64
+st1   {v0.8h-v3.8h}, [x0], #64
+st1   {v0.8h-v3.8h}, [x0], #64
+st1   {v0.8h-v3.8h}, [x0], #64
+st1   {v0.8h-v3.8h}, [x0], #64
+st1   {v0.8h-v3.8h}, [x0], #64
+.endif /* dc 8x8 */
+st1   {v0.8h-v3.8h}, [x0], #64
+st1   {v0.8h-v3.8h}, [x0], #64
+.if \size > 16 /* dc 32x32 */
+bne 1b
+.endif
+.else /* dc 4x4 */
+st1   {v0.8h-v1.8h}, [x0]
+.endif
+ret
+endfunc
+.endm
+
+idct_dc 4 8
+idct_dc 4 10
+
+idct_dc 8 8
+idct_dc 8 10
+
+idct_dc 16 8
+idct_dc 16 10
+
+idct_dc 32 8
+idct_dc 32 10
diff --git a/libavcodec/aarch64/hevcdsp_init.c 
b/libavcodec/aarch64/hevcdsp_init.c
index f0a617ab39..2cd7ef3a6c 100644
--- a/libavcodec/aarch64/hevcdsp_init.c
+++ b/libavcodec/aarch64/hevcdsp_init.c
@@ -23,6 +23,15 @@
 #include "libavcodec/hevcdsp.h"
 #include "libavcodec/avcodec.h"
 
+void ff_hevc_idct_4x4_dc_8_neon(int16_t *coeffs);
+void ff_hevc_idct_8x8_dc_8_neon(int16_t *coeffs);
+void ff_hevc_idct_16x16_dc_8_neon(int16_t *coeffs);
+void ff_hevc_idct_32x32_dc_8_neon(int16_t *coeffs);
+void ff_hevc_idct_4x4_dc_10_neon(int16_t *coeffs);
+void ff_hevc_idct_8x8_dc_10_neon(int16_t *coeffs);
+void ff_hevc_idct_16x16_dc_10_neon(int16_t *coeffs);
+void ff_hevc_idct_32x32_dc_10_neon(int16_t *coeffs);
+
 void ff_hevc_add_residual_4x4_8_neon(uint8_t *_dst, int16_t *coeffs,
  ptrdiff_t stride);
 void ff_hevc_add_residual_4x4_10_neon(uint8_t *_dst, int16_t *coeffs,
@@ -48,6 +57,11 @@ av_cold void ff_hevc_dsp_init_aarch64(HEVCDSPContext *c, 
const int bit_depth)
 c->add_residual[1] = ff_hevc_add_residual_8x8_8_neon;
 c->add_residual[2] = ff_hevc_add_residual_16x16_8_neon;
 c->add_residual[3] = ff_hevc_add_residual_32x32_8_neon;
+
+c->idct_dc[0]  = ff_hevc_idct_4x4_dc_8_neon;
+c->idct_dc[1]  = ff_hevc_idct_8x8_dc_8_neon;
+c->idct_dc[2]  = ff_hevc_idct_16x16_dc_8_neon;
+c->idct_dc[3]  = ff_hevc_idct_32x32_dc_8_neon;
 }
 
 if (have_

[FFmpeg-devel] [PATCH 1/4] lavc/aarch64: add HEVC add_residual NEON

2021-01-07 Thread Josh Dekker
Signed-off-by: Josh Dekker 
---
 libavcodec/aarch64/Makefile   |   2 +
 libavcodec/aarch64/hevcdsp_add_res_neon.S | 298 ++
 libavcodec/aarch64/hevcdsp_init.c |  59 +
 libavcodec/hevcdsp.c  |   2 +
 libavcodec/hevcdsp.h  |   1 +
 5 files changed, 362 insertions(+)
 create mode 100644 libavcodec/aarch64/hevcdsp_add_res_neon.S
 create mode 100644 libavcodec/aarch64/hevcdsp_init.c

diff --git a/libavcodec/aarch64/Makefile b/libavcodec/aarch64/Makefile
index f6434e40da..4bdd554e7e 100644
--- a/libavcodec/aarch64/Makefile
+++ b/libavcodec/aarch64/Makefile
@@ -17,6 +17,7 @@ OBJS-$(CONFIG_VP8DSP)   += 
aarch64/vp8dsp_init_aarch64.o
 OBJS-$(CONFIG_AAC_DECODER)  += aarch64/aacpsdsp_init_aarch64.o \
aarch64/sbrdsp_init_aarch64.o
 OBJS-$(CONFIG_DCA_DECODER)  += aarch64/synth_filter_init.o
+OBJS-$(CONFIG_HEVC_DECODER) += aarch64/hevcdsp_init.o
 OBJS-$(CONFIG_OPUS_DECODER) += aarch64/opusdsp_init.o
 OBJS-$(CONFIG_RV40_DECODER) += aarch64/rv40dsp_init_aarch64.o
 OBJS-$(CONFIG_VC1DSP)   += aarch64/vc1dsp_init_aarch64.o
@@ -53,6 +54,7 @@ NEON-OBJS-$(CONFIG_VP8DSP)  += 
aarch64/vp8dsp_neon.o
 # decoders/encoders
 NEON-OBJS-$(CONFIG_AAC_DECODER) += aarch64/aacpsdsp_neon.o
 NEON-OBJS-$(CONFIG_DCA_DECODER) += aarch64/synth_filter_neon.o
+NEON-OBJS-$(CONFIG_HEVC_DECODER)+= aarch64/hevcdsp_add_res_neon.o
 NEON-OBJS-$(CONFIG_OPUS_DECODER)+= aarch64/opusdsp_neon.o
 NEON-OBJS-$(CONFIG_VORBIS_DECODER)  += aarch64/vorbisdsp_neon.o
 NEON-OBJS-$(CONFIG_VP9_DECODER) += aarch64/vp9itxfm_16bpp_neon.o   
\
diff --git a/libavcodec/aarch64/hevcdsp_add_res_neon.S 
b/libavcodec/aarch64/hevcdsp_add_res_neon.S
new file mode 100644
index 00..4005366192
--- /dev/null
+++ b/libavcodec/aarch64/hevcdsp_add_res_neon.S
@@ -0,0 +1,298 @@
+/* -*-arm64-*-
+ *
+ * AArch64 NEON optimised add residual functions for HEVC decoding
+ *
+ * Copyright (c) 2020 Josh Dekker 
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "libavutil/aarch64/asm.S"
+
+.macro clip10 in1, in2, c1, c2
+smax \in1, \in1, \c1
+smax \in2, \in2, \c1
+smin \in1, \in1, \c2
+smin \in2, \in2, \c2
+.endm
+
+function ff_hevc_add_residual_4x4_8_neon, export=1
+mov x3, x0
+ld1 {v0.S}[0], [x3], x2
+ld1 {v0.S}[1], [x3], x2
+ld1 {v1.S}[0], [x3], x2
+ld1 {v1.S}[1], [x3], x2
+ld1 { v2.8H-v3.8H}, [x1]
+ushll v4.8H, v0.8B, #0
+ushll v5.8H, v1.8B, #0
+add v6.8H, v4.8H, v2.8H
+add v7.8H, v5.8H, v3.8H
+sqxtun v0.8B, v6.8H
+sqxtun v1.8B, v7.8H
+st1 {v0.S}[0], [x0], x2
+st1 {v0.S}[1], [x0], x2
+st1 {v1.S}[0], [x0], x2
+st1 {v1.S}[1], [x0], x2
+ret
+endfunc
+
+function ff_hevc_add_residual_4x4_10_neon, export=1
+mov x3, x0
+movi v4.8H, #0
+mvni v5.8H, #0xFC, lsl #8
+ld1 {v0.D}[0], [x3], x2
+ld1 {v0.D}[1], [x3], x2
+ld1 {v1.D}[0], [x3], x2
+ld1 {v1.D}[1], [x3], x2
+ld1 { v2.8H-v3.8H}, [x1]
+add v2.8H, v0.8H, v2.8H
+add v3.8H, v1.8H, v3.8H
+clip10 v2.8H, v3.8H, v4.8H, v5.8H
+st1 {v2.D}[0], [x0], x2
+st1 {v2.D}[1], [x0], x2
+st1 {v3.D}[0], [x0], x2
+st1 {v3.D}[1], [x0], x2
+ret
+endfunc
+
+function ff_hevc_add_residual_8x8_8_neon, export=1
+mov x3, x0
+ld1 {v0.8B}, [x3], x2
+ld1 {v1.8B}, [x3], x2
+ld1 {v2.8B}, [x3], x2
+ld1 {v3.8B}, [x3], x2
+ld1 {v4.8B}, [x3], x2
+ld1 {v5.8B}, [x3], x2
+ld1 {v6.8B}, [x3], x2
+ld1 {v7.8B}, [x3], x2
+ld1 { v16.8H-v19.8H}, [x1], #64
+ld1 { v20.8H-v23.8H}, [x1]
+ushll v24.8H, v0.8B, #0
+ushll v25.8H, v1.8B, #0
+ushll v26.8H, v2.8B, #0
+ushll v27.8H, v3.8B, #0
+ushll v28.8H, v4.8B, #0
+ushll v29.8H, v5.8B, #0
+ushll v30.8H, v6.8B, #0
+ushll v31.8H, v7.8B, #0
+add v0.8H, v24.8H, v16.8H
+add v1.8H, v25.8H, v17.8H
+add v2.8H, v26.8H, v18.8H
+add v3.8H, v27.8H, v19.8H
+add v4.8H, v28.8H, v20.8H
+add v5.8H, v29.8H, v21.8H
+add v6.8H, v30.8H, v22.8H
+add v7.8H, v31.8H, v23.8H
+sqxtun v24.8B, v0.8H
+sqxtun v25

[FFmpeg-devel] [PATCH 0/4] AArch64 NEON for HEVC

2021-01-07 Thread Josh Dekker
checkasm: all 657 tests passed
hevc_add_res_4x4_8_c: 49.7
hevc_add_res_4x4_8_neon: 20.5
hevc_add_res_4x4_10_c: 45.7
hevc_add_res_4x4_10_neon: 18.7
hevc_add_res_8x8_8_c: 211.0
hevc_add_res_8x8_8_neon: 24.5
hevc_add_res_8x8_10_c: 195.7
hevc_add_res_8x8_10_neon: 24.0
hevc_add_res_16x16_8_c: 787.2
hevc_add_res_16x16_8_neon: 79.0
hevc_add_res_16x16_10_c: 714.7
hevc_add_res_16x16_10_neon: 77.7
hevc_add_res_32x32_8_c: 3444.2
hevc_add_res_32x32_8_neon: 306.5
hevc_add_res_32x32_10_c: 3820.7
hevc_add_res_32x32_10_neon: 299.5
hevc_idct_4x4_dc_8_c: 16.2
hevc_idct_4x4_dc_8_neon: 13.7
hevc_idct_4x4_dc_10_c: 16.2
hevc_idct_4x4_dc_10_neon: 14.5
hevc_idct_8x8_dc_8_c: 40.7
hevc_idct_8x8_dc_8_neon: 18.5
hevc_idct_8x8_dc_10_c: 39.2
hevc_idct_8x8_dc_10_neon: 19.2
hevc_idct_16x16_dc_8_c: 136.7
hevc_idct_16x16_dc_8_neon: 35.7
hevc_idct_16x16_dc_10_c: 136.0
hevc_idct_16x16_dc_10_neon: 36.0
hevc_idct_32x32_dc_8_c: 1386.7
hevc_idct_32x32_dc_8_neon: 132.0
hevc_idct_32x32_dc_10_c: 1366.2
hevc_idct_32x32_dc_10_neon: 132.0
hevc_sao_band_8x8_8_c: 230.7
hevc_sao_band_8x8_8_neon: 92.7

Please disregard my previous email with subject 'lavc/aarch64: add HEVC
add_residual NEON', the patch was split incorrectly.

IDCT (first) and QPEL functions in the works, then SAO edge, and
whatever is left for parity with ARM NEON.

-- 
Josh


___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

[FFmpeg-devel] [PATCH] lavc/aarch64: add HEVC add_residual NEON

2021-01-07 Thread Josh Dekker
Signed-off-by: Josh Dekker 
---

checkasm: all 648 tests passed
hevc_add_res_4x4_8_c: 49.7
hevc_add_res_4x4_8_neon: 20.5
hevc_add_res_4x4_10_c: 46.0
hevc_add_res_4x4_10_neon: 19.0
hevc_add_res_8x8_8_c: 209.0
hevc_add_res_8x8_8_neon: 24.5
hevc_add_res_8x8_10_c: 192.7
hevc_add_res_8x8_10_neon: 27.0
hevc_add_res_16x16_8_c: 791.5
hevc_add_res_16x16_8_neon: 79.0
hevc_add_res_16x16_10_c: 711.0
hevc_add_res_16x16_10_neon: 77.7
hevc_add_res_32x32_8_c: 3431.2
hevc_add_res_32x32_8_neon: 306.5
hevc_add_res_32x32_10_c: 3825.0
hevc_add_res_32x32_10_neon: 299.5

 libavcodec/aarch64/Makefile   |   3 +
 libavcodec/aarch64/hevcdsp_add_res_neon.S | 298 ++
 libavcodec/aarch64/hevcdsp_idct_neon.S|  24 ++
 libavcodec/aarch64/hevcdsp_init.c |  59 +
 libavcodec/hevcdsp.c  |   2 +
 libavcodec/hevcdsp.h  |   1 +
 6 files changed, 387 insertions(+)
 create mode 100644 libavcodec/aarch64/hevcdsp_add_res_neon.S
 create mode 100644 libavcodec/aarch64/hevcdsp_idct_neon.S
 create mode 100644 libavcodec/aarch64/hevcdsp_init.c

diff --git a/libavcodec/aarch64/Makefile b/libavcodec/aarch64/Makefile
index f6434e40da..0eaafce74b 100644
--- a/libavcodec/aarch64/Makefile
+++ b/libavcodec/aarch64/Makefile
@@ -17,6 +17,7 @@ OBJS-$(CONFIG_VP8DSP)   += 
aarch64/vp8dsp_init_aarch64.o
 OBJS-$(CONFIG_AAC_DECODER)  += aarch64/aacpsdsp_init_aarch64.o \
aarch64/sbrdsp_init_aarch64.o
 OBJS-$(CONFIG_DCA_DECODER)  += aarch64/synth_filter_init.o
+OBJS-$(CONFIG_HEVC_DECODER) += aarch64/hevcdsp_init.o
 OBJS-$(CONFIG_OPUS_DECODER) += aarch64/opusdsp_init.o
 OBJS-$(CONFIG_RV40_DECODER) += aarch64/rv40dsp_init_aarch64.o
 OBJS-$(CONFIG_VC1DSP)   += aarch64/vc1dsp_init_aarch64.o
@@ -53,6 +54,8 @@ NEON-OBJS-$(CONFIG_VP8DSP)  += 
aarch64/vp8dsp_neon.o
 # decoders/encoders
 NEON-OBJS-$(CONFIG_AAC_DECODER) += aarch64/aacpsdsp_neon.o
 NEON-OBJS-$(CONFIG_DCA_DECODER) += aarch64/synth_filter_neon.o
+NEON-OBJS-$(CONFIG_HEVC_DECODER)+= aarch64/hevcdsp_add_res_neon.o  
 \
+   aarch64/hevcdsp_idct_neon.o
 NEON-OBJS-$(CONFIG_OPUS_DECODER)+= aarch64/opusdsp_neon.o
 NEON-OBJS-$(CONFIG_VORBIS_DECODER)  += aarch64/vorbisdsp_neon.o
 NEON-OBJS-$(CONFIG_VP9_DECODER) += aarch64/vp9itxfm_16bpp_neon.o   
\
diff --git a/libavcodec/aarch64/hevcdsp_add_res_neon.S 
b/libavcodec/aarch64/hevcdsp_add_res_neon.S
new file mode 100644
index 00..dc7e8127b9
--- /dev/null
+++ b/libavcodec/aarch64/hevcdsp_add_res_neon.S
@@ -0,0 +1,298 @@
+/* -*-armv8-*-
+ *
+ * AArch64 NEON optimised add residual functions for HEVC decoding
+ *
+ * Copyright (c) 2020 Josh Dekker 
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "libavutil/aarch64/asm.S"
+
+.macro clip10 in1, in2, c1, c2
+smax \in1, \in1, \c1
+smax \in2, \in2, \c1
+smin \in1, \in1, \c2
+smin \in2, \in2, \c2
+.endm
+
+function ff_hevc_add_residual_4x4_8_neon, export=1
+mov x3, x0
+ld1 {v0.S}[0], [x3], x2
+ld1 {v0.S}[1], [x3], x2
+ld1 {v1.S}[0], [x3], x2
+ld1 {v1.S}[1], [x3], x2
+ld1 { v2.8H-v3.8H}, [x1]
+ushll v4.8H, v0.8B, #0
+ushll v5.8H, v1.8B, #0
+add v6.8H, v4.8H, v2.8H
+add v7.8H, v5.8H, v3.8H
+sqxtun v0.8B, v6.8H
+sqxtun v1.8B, v7.8H
+st1 {v0.S}[0], [x0], x2
+st1 {v0.S}[1], [x0], x2
+st1 {v1.S}[0], [x0], x2
+st1 {v1.S}[1], [x0], x2
+ret
+endfunc
+
+function ff_hevc_add_residual_4x4_10_neon, export=1
+mov x3, x0
+movi v4.8H, #0
+mvni v5.8H, #0xFC, lsl #8
+ld1 {v0.D}[0], [x3], x2
+ld1 {v0.D}[1], [x3], x2
+ld1 {v1.D}[0], [x3], x2
+ld1 {v1.D}[1], [x3], x2
+ld1 { v2.8H-v3.8H}, [x1]
+add v2.8H, v0.8H, v2.8H
+add v3.8H, v1.8H, v3.8H
+clip10 v2.8H, v3.8H, v4.8H, v5.8H
+st1 {v2.D}[0], [x0], x2
+st1 {v2.D}[1], [x0], x2
+st1 {v3.D}[0], [x0], x2
+st1 {v3.D}[1], [x0], x2
+ret
+endfunc
+
+function ff_hevc_add_residual_8x8_8_neon, export=1
+mov x3, x0
+ld1 {v0.8B}, [x3], x2
+ld1 {v1.8B}, [x3], x2
+ld1 {v2.8B}, [x3], x2
+

Re: [FFmpeg-devel] FFmpeg buying an Apple M1 Mac Mini

2021-01-03 Thread Josh Dekker

On 2021/01/03 20:18, Michael Niedermayer wrote:

On Sun, Jan 03, 2021 at 06:32:11PM +0100, Kieran Kunhya wrote:

Hello,

As it's 2021 I would like to propose FFmpeg purchase one or more (e.g
FATE + development) Apple M1 Mac Minis and provide access to developers.
This is something I have done a few years ago when AVX2 was a new
instruction set.

I can host these in the UK 24/7 and provide access and label them as
belonging to the project and not me.

< To clarify these will be hosted in a proper datacentre, with proper
< connectivity, cooling etc.


I would propose buying and getting reimbursed one or more of:

- Apple M1 chip with 8‑core CPU, 8‑core GPU and 16‑core Neural Engine
- 16GB unified memory
- 1TB SSD storage
- Gigabit Ethernet

This is £1,299.00 in the UK right now on the Apple Site.


assuming noone has objections or better suggestions
LGTM

thx



Ok from me too. I would suggest getting 2x, one for only FATE and the
other for general access & development for FFmpeg developers.

--
Josh
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

2020-12-10 Thread Josh Dekker

On 2020/12/09 11:19, Alan Kelly wrote:

---
  Activates avx2 version of yuv2yuvX
  Adds checkasm for yuv2yuvX
  Modifies ff_yuv2yuvX_* signature to match yuv2yuvX_*
  Replaces non-temporal stores with temporal stores
  libswscale/x86/Makefile |   1 +
  libswscale/x86/swscale.c| 106 +---
  libswscale/x86/yuv2yuvX.asm | 118 
  tests/checkasm/sw_scale.c   | 101 +-
  4 files changed, 249 insertions(+), 77 deletions(-)
  create mode 100644 libswscale/x86/yuv2yuvX.asm

[...]
diff --git a/tests/checkasm/sw_scale.c b/tests/checkasm/sw_scale.c
index 9efa2b4def..7009169361 100644
--- a/tests/checkasm/sw_scale.c
+++ b/tests/checkasm/sw_scale.c

[...]

+static void check_yuv2yuvX(void)
+{
+struct SwsContext *ctx;
+int fsi, osi;
+#define LARGEST_FILTER 8
+#define FILTER_SIZES 4
+static const int filter_sizes[FILTER_SIZES] = {1, 4, 8, 16};
+
+declare_func_emms(AV_CPU_FLAG_MMX, void, const int16_t *filter,
+  int filterSize, const int16_t **src, uint8_t *dest,
+  int dstW, const uint8_t *dither, int offset);
+
+int dstW = SRC_PIXELS;
+const int16_t **src;
+LOCAL_ALIGNED_32(int16_t, filter_coeff, [LARGEST_FILTER]);
+LOCAL_ALIGNED_32(uint8_t, dst0, [SRC_PIXELS]);
+LOCAL_ALIGNED_32(uint8_t, dst1, [SRC_PIXELS]);
+LOCAL_ALIGNED_32(uint8_t, dither, [SRC_PIXELS]);
+union VFilterData{
+const int16_t *src;
+uint16_t coeff[8];
+} *vFilterData;
+uint8_t d_val = rnd();
+randomize_buffers(filter_coeff, LARGEST_FILTER);
+ctx = sws_alloc_context();
+if (sws_init_context(ctx, NULL, NULL) < 0)
+fail();
+
+ff_sws_init_swscale_x86(ctx);

This should be ff_getSwsFunc() instead.

+for(int i = 0; i < SRC_PIXELS; ++i){
+dither[i] = d_val;
+}
[...]

--
Josh
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

[FFmpeg-devel] [RFC] Machines & Platforms of interest for testing

2020-12-08 Thread Josh Dekker

Hi,

As discussed in the meeting, I'm starting a RFC for Machines & Platforms of
interest for testing, developer access and FATE. These would be funded by SPI.

The two platforms mentioned were a Mac Mini (M1 Apple Silicon platform) and a
TALOS II (POWER9 platform). My personal suggestion would be a machine with
both a modern nVidia GPU and AMD GPU for testing hardware acceleration
integration.

Kieran offered to host one Mac Mini, though I'm unsure what his capacity for
hosting is.

Any comments and suggestions welcome.

--
Josh
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

[FFmpeg-devel] [IMPORTANT] Meeting Notes - December 2020

2020-12-07 Thread Josh Dekker

Hi all,

Here are the notes from the FFmpeg developer meeting of the 5th of December
2020, 15:00 UTC:

# Notes

A recording of the call and IRC logs are available on YouTube and the Wiki
respectively:

- https://youtu.be/1EjIdYuWXEM
- https://trac.ffmpeg.org/wiki/FFmeeting/2020-12

Some extra topics were discussed in the meeting. The notes are cleaned up to
have actionable points.

## Proposed Agenda

https://ffmpeg.org/pipermail/ffmpeg-devel/2020-November/272272.html

- Tech / Comm committees, how-to in practice
- GSoC adaptations for upcoming years
- Splitting libraries (lavf->lavio)
- Deprecating libpostproc
- Writing down development rules
- Switching to a merge request-like system
- Propose FFmpeg/SPI purchase a Mac Mini ARM machine for development (and FATE
  if required).

## People Present (15)

- Jean-Baptiste Kempf
- Josh Dekker (Illya)
- Jan Ekström
- Michael Niedermayer
- Gyan Doshi
- Lynne
- Kieran Kunhya
- Mark Thompson
- Anton Khirnov
- Paul B Mahol
- Steven Liu
- Marvin Scholz
- Andriy Gelman
- Linjie Fu
- James Almer

## Topics discussed

### Topic 1.0: Technical Comittee

Clarify Technical Process on the mailing list -> vote in one week, in a Yes/No 
fashion.

Question in 100 word Tech limit on the number of words

* The question is limited to the one hundred words, so for example of the form
  "should we do X rather than Y?".
* The background to the question will likely be complex and can be explained in
  detail elsewhere.
* The intent of this restriction is to avoid an unclear question or any
  ambiguity in the answer.

 Action points

- [Mark and Lynne] Submit an updated patch to clarify how the 100 word limit
  works.

### Topic 1.1: Community committees

A Community Code of Conduct was written by j-b, needs review. The CoC MUST HAVE
abuse limits of TC and CC.

 Action points

 * [Josh, JEEB, Kieran and Michael] Pre-review
 * [j-b] Post CoC on mailing list for general review
 * [GA] Vote on CoC in the same format as technical process

**Goal: everything voted end of Dec 2020**

### Topic 2: GSoC

GSoC was shortened to 5 weeks, project should take around 150 hours, overall
time is halved. Small projects are generally more boring and less useful.

Kieran and Lynne believe we should stop doing GSoC. Anton noted that GSoC is
not a burden on people who don't care about it.

Going forward projects would have to be restructured from the types of projects
previously. They should be more integrated with community \& more specifically
detailed. Suggestions included smaller optimisation projects (still lots of
assembly unwritten).

 Action points

- [Carl] Setup wiki page for *small*, *self-contained*, *specific* GSoC project
  ideas.

### Topic 3: splitting and merging libraries

 Topic 3.1 libavdevice

Libavdevice is very tightly coupled to libavformat, so should be merged into
it. Apparently nobody is against.

Anton is working on the merge.

 Topic 3.2 merging all libraries into one

Steven mentions it is useful for his use case to build a single libffmpeg.so

JB replies this makes sense for some cases, but not for others, multiple
separate libraries are preferable for many other cases.

Discussion of various open source libraries moving to meson.

# Action points

- [Who?] More discussion on how to do a mega-library and if we should document
  it. Lynne suggests only this mega-library to be a static library.
- [Requires further discussion] Do we support libffmpeg.a|.so in the build
  system?

 Topic 3.3 Splitting libavformat IO into libavio

Some API users would prefer this functionality to be separate, as it involves
network communication and such. Additionally, it would allow proper IO in
other libraries, such as lavc and lavfi.

Question whether IO can be moved to libavutil. Conclusion -> libavutil is big
enough already

# Action points

- [Anton] Reflect on how to split IO out of lavf into libavio
- [Requires further discussion] Point raised whether hwcontext should be split
  off from libavutil

 Topic 3.4 Splitting libavutil hwcontext into its own library

Mark mentions that hardware context should move to a separate library?

Anton says it is probably inconvenient for distros that lavu links to many
hwaccel libs -> libavhwcontext makes sense.

Nobody raises objections to this.

# Action points

- [Who?] Reflect on whether it makes sense to do, and if so split hardware
  contexts into separate library

## Topic 4: deprecating libpostproc

Libpostproc does not have any external users (Kodi was thought to use it
directly but was confirmed as not a user). There is no need for it to be a
standalone library.

A few options were suggested:
- remove it all together
- merge in libavfilter
- move it to another repo

Kieran wants to move to a different repo, Anton noted that libpostproc is
already in external repo, the one from Libav. Michael wants to leave it as it
is or integrate it into libavfilter.

 Action points

-