Re: [FFmpeg-devel] [PATCH 2/5] lavc/aarch64: Add neon implementation for sse4

2022-08-18 Thread Martin Storsjö
On Tue, 16 Aug 2022, Hubert Mazur wrote: Provide neon implementation for sse4 function. Performance comparison tests are shown below. - sse_2_c: 80.7 - sse_2_neon: 31.0 Benchmarks and tests are run with checkasm tool on AWS Graviton 3. Signed-off-by: Hubert Mazur ---

Re: [FFmpeg-devel] [PATCH 1/5] lavc/aarch64: Add neon implementation for sse16

2022-08-18 Thread Martin Storsjö
On Tue, 16 Aug 2022, Hubert Mazur wrote: Provide neon implementation for sse16 function. Performance comparison tests are shown below. - sse_0_c: 268.2 - sse_0_neon: 43.5 Benchmarks and tests run with checkasm tool on AWS Graviton 3. Signed-off-by: Hubert Mazur ---

Re: [FFmpeg-devel] [PATCH 4/5] lavc/aarch64: Add neon implementation for sse8

2022-08-18 Thread Martin Storsjö
On Tue, 16 Aug 2022, Hubert Mazur wrote: Provide optimized implementation of sse8 function for arm64. Performance comparison tests are shown below. - sse_1_c: 130.7 - sse_1_neon: 29.7 Benchmarks and tests run with checkasm tool on AWS Graviton 3. Signed-off-by: Hubert Mazur ---

Re: [FFmpeg-devel] [PATCH 5/5] lavc/aarch64: Add neon implementation for pix_abs8

2022-08-18 Thread Martin Storsjö
On Tue, 16 Aug 2022, Hubert Mazur wrote: Provide optimized implementation of pix_abs8 function for arm64. Performance comparison tests are shown below. - pix_abs_1_0_c: 101.2 - pix_abs_1_0_neon: 22.5 - sad_1_c: 101.2 - sad_1_neon: 22.5 Benchmarks and tests are run with checkasm tool on AWS

Re: [FFmpeg-devel] [PATCH 2/2] arm: rv40dsp: Change stride parameters to ptrdiff_t

2022-09-02 Thread Martin Storsjö
On Tue, 9 Aug 2022, Martin Storsjö wrote: These were missed when h264_chroma_mc_func was changed in e4a94d8b36c48d95a7d412c40d7b558422ff659c. Signed-off-by: Martin Storsjö --- libavcodec/arm/rv40dsp_init_arm.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) OK'd by Andreas

Re: [FFmpeg-devel] [PATCH v4] libavcodec: Set hidden visibility on global symbols accessed from AArch64 assembly

2022-09-02 Thread Martin Storsjö
On Sat, 27 Aug 2022, Martin Storsjö wrote: The AArch64 assembly accesses those symbols directly, without indirection via e.g. the GOT on ELF. In order for this not to require text relocations, those symbols need to be resolved fully at link time, i.e. those symbols can't be interposable

Re: [FFmpeg-devel] [PATCH v2] arm: Check the build time constants in av_clip_*intp2

2022-09-02 Thread Martin Storsjö
On Fri, 26 Aug 2022, Martin Storsjö wrote: This fixes building for arm targets with optimizations disabled. --- libavutil/arm/intmath.h | 24 ++-- 1 file changed, 18 insertions(+), 6 deletions(-) diff --git a/libavutil/arm/intmath.h b/libavutil/arm/intmath.h index 5311a7d52b

Re: [FFmpeg-devel] [PATCH] slicethread: Limit the automatic number of threads to 16

2022-09-06 Thread Martin Storsjö
On Tue, 6 Sep 2022, Lukas Fellechner wrote: There are really two separate issues here: 1. Running out of address space in 32-bit processes It probably makes sense to limit auto threads to 16, but it should only be done in 32-bit processes. FWIW, this was my first approach, until Andreas

Re: [FFmpeg-devel] [PATCH 1/5] lavc/aarch64: Add neon implementation for vsad16

2022-09-02 Thread Martin Storsjö
On Mon, 22 Aug 2022, Hubert Mazur wrote: Provide optimized implementation of vsad16 function for arm64. Performance comparison tests are shown below. - vsad_0_c: 285.0 - vsad_0_neon: 42.5 Benchmarks and tests are run with checkasm tool on AWS Graviton 3. Signed-off-by: Hubert Mazur ---

Re: [FFmpeg-devel] [PATCH 5/5] lavc/aarch64: Provide neon implementation of nsse16

2022-09-02 Thread Martin Storsjö
On Mon, 22 Aug 2022, Hubert Mazur wrote: Add vectorized implementation of nsse16 function. Performance comparison tests are shown below. - nsse_0_c: 707.0 - nsse_0_neon: 120.0 Benchmarks and tests run with checkasm tool on AWS Graviton 3. Signed-off-by: Hubert Mazur ---

Re: [FFmpeg-devel] [PATCH 2/2] arm: relax byte-swap assembler constraints

2022-09-03 Thread Martin Storsjö
On Sat, 3 Sep 2022, r...@remlab.net wrote: From: Rémi Denis-Courmont There are no particular reasons to force the compiler to use the same register as output and input operand. This forces an extra MOV instruction if the input value needs to be reused after the swap. In most cases, this

Re: [FFmpeg-devel] [PATCH] avcodec/mathops: Set hidden visibility where advantageous

2022-09-03 Thread Martin Storsjö
On Sat, 3 Sep 2022, Andreas Rheinhardt wrote: It is advantageous for ff_crop_tab, as the base pointer used to access this table is not the first element of it. But the real base pointer is still at a constant offset from the code/the GOT and can therefore be accessed relative to the instruction

Re: [FFmpeg-devel] [PATCH 2/5] lavc/aarch64: Add neon implementation of vsse16

2022-09-07 Thread Martin Storsjö
On Tue, 6 Sep 2022, Hubert Mazur wrote: Provide optimized implementation of vsse16 for arm64. Performance comparison tests are shown below. - vsse_0_c: 254.4 - vsse_0_neon: 64.7 Benchmarks and tests are run with checkasm tool on AWS Graviton 3. Signed-off-by: Hubert Mazur ---

Re: [FFmpeg-devel] [PATCH 0/5] Provide optimized neon implementation

2022-09-07 Thread Martin Storsjö
On Tue, 6 Sep 2022, Hubert Mazur wrote: Provide optimized implementations for me_cmp functions. This set of patches fixes all issues addressed in previous review. Major changes: - Remove redundant loads since the data can be reused. - Improve style. - Fix issues with unrecognized symbols.

Re: [FFmpeg-devel] [PATCH 5/5] lavc/aarch64: Provide neon implementation of nsse16

2022-09-07 Thread Martin Storsjö
On Tue, 6 Sep 2022, Hubert Mazur wrote: Add vectorized implementation of nsse16 function. Performance comparison tests are shown below. - nsse_0_c: 707.0 - nsse_0_neon: 120.0 Benchmarks and tests run with checkasm tool on AWS Graviton 3. Signed-off-by: Hubert Mazur ---

Re: [FFmpeg-devel] [PATCH] slicethread: Limit the automatic number of threads to 16

2022-09-05 Thread Martin Storsjö
On Mon, 5 Sep 2022, Martin Storsjö wrote: This matches a similar cap on the number of automatic threads in libavcodec/pthread_slice.c. On systems with lots of cores, this does speed things up in general (measurable on the level of the runtime of running "make fate"), and fixes a c

Re: [FFmpeg-devel] [PATCH 2/5] lavc/aarch64: Add neon implementation of vsse16

2022-09-04 Thread Martin Storsjö
On Mon, 22 Aug 2022, Hubert Mazur wrote: Provide optimized implementation of vsse16 for arm64. Performance comparison tests are shown below. - vsse_0_c: 254.4 - vsse_0_neon: 64.7 Benchmarks and tests are run with checkasm tool on AWS Graviton 3. Signed-off-by: Hubert Mazur ---

Re: [FFmpeg-devel] [PATCH 3/5] lavc/aarch64: Add neon implementation for vsad_intra16

2022-09-04 Thread Martin Storsjö
On Mon, 22 Aug 2022, Hubert Mazur wrote: Provide optimized implementation for vsad_intra16 function for arm64. Performance comparison tests are shown below. - vsad_4_c: 177.2 - vsad_4_neon: 24.5 Benchmarks and tests are run with checkasm tool on AWS Gravtion 3. Signed-off-by: Hubert Mazur

Re: [FFmpeg-devel] [PATCH 4/5] lavc/aarch64: Add neon implementation for vsse_intra16

2022-09-04 Thread Martin Storsjö
On Mon, 22 Aug 2022, Hubert Mazur wrote: Provide optimized implementation for vsse_intra16 for arm64. Performance tests are shown below. - vsse_4_c: 153.7 - vsse_4_neon: 34.2 Benchmarks and tests are run with checkasm tool on AWS Graviton 3. Signed-off-by: Hubert Mazur ---

Re: [FFmpeg-devel] [PATCH 5/5] lavc/aarch64: Provide neon implementation of nsse16

2022-09-04 Thread Martin Storsjö
On Mon, 22 Aug 2022, Hubert Mazur wrote: Add vectorized implementation of nsse16 function. Performance comparison tests are shown below. - nsse_0_c: 707.0 - nsse_0_neon: 120.0 Benchmarks and tests run with checkasm tool on AWS Graviton 3. Signed-off-by: Hubert Mazur ---

[FFmpeg-devel] [PATCH] x86/tx_float: Fix building for platforms with a symbol prefix

2022-09-06 Thread Martin Storsjö
This fixes building for e.g. i386 windows. --- libavutil/x86/tx_float.asm | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/libavutil/x86/tx_float.asm b/libavutil/x86/tx_float.asm index 1b9131e7fa..ace19788a6 100644 --- a/libavutil/x86/tx_float.asm +++

[FFmpeg-devel] [PATCH v2] x86/tx_float: Fix building for platforms with a symbol prefix

2022-09-06 Thread Martin Storsjö
This fixes building for x86 macOS (both i386 and x86_64) and i386 windows. --- v2: Add mangle() in a couple more places, that weren't noticed on i386 windows. --- libavutil/x86/tx_float.asm | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/libavutil/x86/tx_float.asm

Re: [FFmpeg-devel] [PATCH 1/2] x86/tx_float: add support for calling assembly functions from assembly

2022-09-06 Thread Martin Storsjö
On Tue, 6 Sep 2022, Mattias Wadman wrote: On Sat, Sep 3, 2022 at 3:41 AM Lynne wrote: Needed for the next patch. We get this for the extremely small cost of a branch on _ns functions, which wouldn't be used anyway with assembly. Patch attached. Hi, I have issues building on macOS

[FFmpeg-devel] [PATCH v4] libavcodec: Set hidden visibility on global symbols accessed from AArch64 assembly

2022-08-27 Thread Martin Storsjö
that are accessed from AArch64 assembly as hidden, so that they are resolved fully at link time even without the version script and -Wl,-Bsymbolic. Signed-off-by: Martin Storsjö --- v4: Moved the attribute definition to a new, standalone header (which only depends on libavutil/attributes.h

[FFmpeg-devel] [PATCH] arm: Skip certain inline assembly functions if built without optimizations

2022-08-26 Thread Martin Storsjö
These inline assembly functions rely on being inlined into the caller, so that the parameter "int p" can be a known assembly time constant, instead of a variable parameter. __OPTIMIZE__ is a built-in define which is set by both GCC and Clang (the two main compilers supporting our inline assembly)

Re: [FFmpeg-devel] [PATCH v2] checkasm: sw_scale: Produce more realistic test filter coefficients for yuv2yuvX

2022-08-19 Thread Martin Storsjö
On Thu, 18 Aug 2022, Alan Kelly wrote: Thanks Martin for doing this. On Thu, Aug 18, 2022 at 10:16 AM Martin Storsjö wrote: This avoids triggering overflows in the filters, and avoids stray test failures in the approximate functions on x86; due to rounding

Re: [FFmpeg-devel] [PATCH 2/4] lavc/aarch64: Provide neon implementation of nsse8

2022-09-28 Thread Martin Storsjö
On Mon, 26 Sep 2022, Grzegorz Bernacki wrote: Add vectorized implementation of nsse8 function. Performance comparison tests are shown below. - nsse_1_c: 256.0 - nsse_1_neon: 82.7 Benchmarks and tests run with checkasm tool on AWS Graviton 3. Signed-off-by: Grzegorz Bernacki ---

[FFmpeg-devel] [PATCH 1/2] aarch64: me_cmp: Avoid redundant loads in ff_pix_abs16_y2_neon

2022-09-28 Thread Martin Storsjö
This avoids one redundant load per row; pix3 from the previous iteration can be used as pix2 in the next one. Before: Cortex A53A72A73 pix_abs_0_2_neon: 138.0 59.7 48.0 After: pix_abs_0_2_neon: 109.7 50.2 39.5 Signed-off-by: Martin Storsjö --- libavcodec/aarch64

[FFmpeg-devel] [PATCH 2/2] aarch64: me_cmp: Avoid using the non-unrolled codepath for the minimum unroll size

2022-09-28 Thread Martin Storsjö
Signed-off-by: Martin Storsjö --- libavcodec/aarch64/me_cmp_neon.S | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/libavcodec/aarch64/me_cmp_neon.S b/libavcodec/aarch64/me_cmp_neon.S index 832a7cb22d..c710358ab7 100644 --- a/libavcodec/aarch64/me_cmp_neon.S +++ b

[FFmpeg-devel] [PATCH] riscv: Use the correct path for including asm.S

2022-09-28 Thread Martin Storsjö
Signed-off-by: Martin Storsjö --- This should hopefully fix the compile failures on fate, http://fate.ffmpeg.org/report.cgi?time=20220927222508=riscv64-linux-gnu-gcc-12 and http://fate.ffmpeg.org/report.cgi?time=20220927225014=riscv64-linux-gnu-clang-14. --- libavcodec/riscv/fmtconvert_rvv.S

Re: [FFmpeg-devel] [PATCH] riscv: Use the correct path for including asm.S

2022-09-28 Thread Martin Storsjö
On Wed, 28 Sep 2022, Rémi Denis-Courmont wrote: Le 28 septembre 2022 10:13:57 GMT+03:00, "Martin Storsjö" a écrit : Signed-off-by: Martin Storsjö --- This should hopefully fix the compile failures on fate, http://fate.ffmpeg.org/report.cgi?time=20220927222508=riscv64-linux-

Re: [FFmpeg-devel] [PATCH 1/4] lavc/aarch64: Add neon implementation for pix_abs8 functions.

2022-09-28 Thread Martin Storsjö
On Mon, 26 Sep 2022, Grzegorz Bernacki wrote: Provide optimized implementation of pix_abs8 function for arm64. Performance comparison tests are shown below: pix_abs_1_1_c: 162.5 pix_abs_1_1_neon: 27.0 pix_abs_1_2_c: 174.0 pix_abs_1_2_neon: 23.5 pix_abs_1_3_c: 203.2 pix_abs_1_3_neon: 34.7

Re: [FFmpeg-devel] [PATCH 4/4] lavc/aarch64: Add neon implementation for vsse_intra8

2022-09-28 Thread Martin Storsjö
On Mon, 26 Sep 2022, Grzegorz Bernacki wrote: Provide optimized implementation for vsse_intra8 for arm64. Performance tests are shown below. - vsse_5_c: 87.7 - vsse_5_neon: 26.2 Benchmarks and tests are run with checkasm tool on AWS Graviton 3. --- libavcodec/aarch64/me_cmp_init_aarch64.c |

[FFmpeg-devel] [PATCH] arm: vc1dsp: Canonicalize the syntax for aligned NEON loads/stores

2022-09-28 Thread Martin Storsjö
This hopefully should fix building with older toolchains, hopefully fixing the fate failures on http://fate.ffmpeg.org/history.cgi?slot=armel5tej-qemu-debian-gcc4.4. Signed-off-by: Martin Storsjö --- libavcodec/arm/vc1dsp_neon.S | 40 ++-- 1 file changed, 20

Re: [FFmpeg-devel] [PATCH 3/4] lavc/aarch64: Provide optimized implementation of vsse8 for arm64.

2022-09-28 Thread Martin Storsjö
On Mon, 26 Sep 2022, Grzegorz Bernacki wrote: Provide optimized implementation of vsse8 for arm64. Performance comparison tests are shown below. - vsse_1_c: 141.5 - vsse_1_neon: 32.5 Benchmarks and tests are run with checkasm tool on AWS Graviton 3. Signed-off-by: Grzegorz Bernacki ---

[FFmpeg-devel] [PATCH] riscv: Fix linking without RVV; change #ifdef into #if

2022-09-28 Thread Martin Storsjö
--- This should hopefully fix the current build failures at http://fate.ffmpeg.org/history.cgi?slot=riscv64-linux-gnu-clang-14. --- libavcodec/riscv/fmtconvert_init.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/libavcodec/riscv/fmtconvert_init.c

Re: [FFmpeg-devel] [PATCH] arm: vc1dsp: Canonicalize the syntax for aligned NEON loads/stores

2022-09-29 Thread Martin Storsjö
On Wed, 28 Sep 2022, Martin Storsjö wrote: This hopefully should fix building with older toolchains, hopefully fixing the fate failures on http://fate.ffmpeg.org/history.cgi?slot=armel5tej-qemu-debian-gcc4.4. Signed-off-by: Martin Storsjö --- libavcodec/arm/vc1dsp_neon.S | 40

[FFmpeg-devel] [PATCH] swscale: aarch64: Fix yuv2rgb with negative strides

2022-10-25 Thread Martin Storsjö
operation, which would clamp the intermediates to 32 bit still). Fixes: https://trac.ffmpeg.org/ticket/9985 Signed-off-by: Martin Storsjö --- libswscale/aarch64/yuv2rgb_neon.S | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/libswscale/aarch64/yuv2rgb_neon.S b/libswscale

Re: [FFmpeg-devel] [PATCH 4/4] sw_scale: Add specializations for hscale 16 to 19

2022-10-24 Thread Martin Storsjö
On Mon, 17 Oct 2022, Hubert Mazur wrote: Provide arm64 neon optimized implementations for hscale16To19 with filter sizes 4, 8 and X4. The tests and benchmarks run on AWS Graviton 2 instances. The results from a checkasm tool are shown below. hscale_16_to_19__fs_4_dstW_512_c: 6216.0

Re: [FFmpeg-devel] [PATCH 1/4] sw_scale: Add specializations for hscale 8 to 19

2022-10-24 Thread Martin Storsjö
On Mon, 17 Oct 2022, Hubert Mazur wrote: Add arm64 neon implementations for hscale 8 to 19 with filter sizes 4, 4X and 8. Both implementations are based on very similar ones dedicated to hscale 8 to 15. The major changes refer to saving the data - instead of writing the result as int16_t it is

Re: [FFmpeg-devel] [PATCH v3] lavc/aarch64: add hevc horizontal qpel/uni/bi

2022-10-24 Thread Martin Storsjö
On Tue, 11 Oct 2022, J. Dekker wrote: checkasm benchmark on Ampere Altra (Neoverse N1): put_hevc_qpel_bi_h4_8_c: 170.7 put_hevc_qpel_bi_h4_8_neon: 64.5 put_hevc_qpel_bi_h6_8_c: 373.7 put_hevc_qpel_bi_h6_8_neon: 130.2 put_hevc_qpel_bi_h8_8_c: 662.0 put_hevc_qpel_bi_h8_8_neon: 138.5

Re: [FFmpeg-devel] [PATCH] configure: Remove a leftover comment about MSVC C99 support

2022-10-27 Thread Martin Storsjö
On Wed, 19 Oct 2022, Martin Storsjö wrote: Support for building with older versions of MSVC (with the c99wrap/c99conv frontend) was removed in ce943dd6acbfdfc40223c0fb24d4cad438e6499c. Signed-off-by: Martin Storsjö --- configure | 6 -- 1 file changed, 6 deletions(-) diff --git

Re: [FFmpeg-devel] [PATCH] swscale: aarch64: Fix yuv2rgb with negative strides

2022-10-27 Thread Martin Storsjö
On Tue, 25 Oct 2022, Martin Storsjö wrote: Treat the 32 bit stride registers as signed. Alternatively, we could make the stride arguments ptrdiff_t instead of int, and changing all of the assembly to operate on these registers with their full 64 bit width, but that's a much larger and more

[FFmpeg-devel] [PATCH] configure: Remove a leftover comment about MSVC C99 support

2022-10-19 Thread Martin Storsjö
Support for building with older versions of MSVC (with the c99wrap/c99conv frontend) was removed in ce943dd6acbfdfc40223c0fb24d4cad438e6499c. Signed-off-by: Martin Storsjö --- configure | 6 -- 1 file changed, 6 deletions(-) diff --git a/configure b/configure index 6712d045d9..ed52212f93

Re: [FFmpeg-devel] [PATCH 0/3] Provide neon implementations

2022-09-21 Thread Martin Storsjö
On Tue, 20 Sep 2022, Hubert Mazur wrote: This fixes issues addressed in previous patchset: - move sub instruction in vsad8_intra, - remove unnecessary mov instructions, - remove single lane extraction in loop and place it at the end. Removing mov instructions from pix_median_abs functions

Re: [FFmpeg-devel] [PATCH v2] avcodec/arm/sbcenc: avoid callee preserved vfp registers

2022-09-12 Thread Martin Storsjö
On Sun, 25 Aug 2019, James Cowgill wrote: When compiling FFmpeg with GCC-9, some very random segfaults were observed in code which had previously called down into the SBC encoder NEON assembly routines. This was caused by these functions clobbering some of the vfp callee saved registers (d8 -

Re: [FFmpeg-devel] [PATCH 2/3] lavc/aarch64: Add neon implementation for vsad8_intra

2022-09-16 Thread Martin Storsjö
On Tue, 13 Sep 2022, Hubert Mazur wrote: Provide optimized implementation for pix_median_abs16 function. You've forgot to update this part of the commit message. Performance comparison tests are shown below. - vsad_5_c: 94.7 - vsad_5_neon: 20.7 Benchmarks and tests run with checkasm tool

Re: [FFmpeg-devel] [PATCH 3/3] lavc/aarch64: Add neon implementation for pix_median_abs8

2022-09-16 Thread Martin Storsjö
On Tue, 13 Sep 2022, Hubert Mazur wrote: Provide optimized implementation for pix_median_abs16 function. Forgot to update this part of the commit message here too. Performance comparison tests are shown below. - median_sad_1_c: 273.7 - median_sad_1_neon: 98.2 Benchmarks and tests run with

Re: [FFmpeg-devel] [PATCH 1/3] lavc/aarch64: Add neon implementation for pix_median_abs16

2022-09-16 Thread Martin Storsjö
On Tue, 13 Sep 2022, Hubert Mazur wrote: Provide optimized implementation for pix_median_abs16 function. Performance comparison tests are shown below. - median_sad_0_c: 722.0 - median_sad_0_neon: 144.7 Benchmarks and tests run with checkasm tool on AWS Graviton 3. Signed-off-by: Hubert Mazur

Re: [FFmpeg-devel] [PATCH 1/6] opus: convert encoder and decoder to lavu/tx

2022-09-25 Thread Martin Storsjö
On Sat, 24 Sep 2022, Lynne wrote: What about ac3dsp then - that one seems like it's fairly optimized for arm? Haven't touched them, they're still being used. Unfortunately, for AC3, the full MDCT optimizations in lavc do make a difference and the overall decoder becomes 15% slower with this

Re: [FFmpeg-devel] Patchwork issues

2022-09-26 Thread Martin Storsjö
On Mon, 26 Sep 2022, Marvin Scholz wrote: As I am not sure who else to email about this, I'll just post it here. I tried to register for Patchwork, however I got an error when registering. I tried again and was told the account already exists, I tried to reset the password for the account but

Re: [FFmpeg-devel] [PATCH 1/6] opus: convert encoder and decoder to lavu/tx

2022-09-24 Thread Martin Storsjö
On Sat, 24 Sep 2022, Lynne wrote: This commit changes both the encoder and decoder to use the new lavu/tx code, which has faster C transforms and more assembly optimizations. What's the case of e.g. 32 bit arm - that does have a bunch of fft and mdct assembly, but is that something that ends

Re: [FFmpeg-devel] [PATCH 1/6] opus: convert encoder and decoder to lavu/tx

2022-09-24 Thread Martin Storsjö
On Sat, 24 Sep 2022, Hendrik Leppkes wrote: On Sat, Sep 24, 2022 at 9:26 PM Hendrik Leppkes wrote: On Sat, Sep 24, 2022 at 8:43 PM Martin Storsjö wrote: > > On Sat, 24 Sep 2022, Lynne wrote: > > > This commit changes both the encoder and decoder to use the new lavu/tx code

Re: [FFmpeg-devel] [PATCH 0/5] Provide optimized neon implementation

2022-09-09 Thread Martin Storsjö
On Thu, 8 Sep 2022, Hubert Mazur wrote: Fix minor issues in the patches. Regarding vsse16 I didn't change saba & umlal to sub & smlal. It doesn't affect the performance, so left it as it was. The majority of changes refer to nsse16: - fixed indentation (thanks for pointing out), - applied the

Re: [FFmpeg-devel] [PATCH v2 0/7] arm64 neon implementation for 8bits functions

2022-10-04 Thread Martin Storsjö
: Provide neon implementation of nsse8 lavc/aarch64: Provide optimized implementation of vsse8 for arm64. lavc/aarch64: Add neon implementation for vsse_intra8 Martin Storsjö (3): aarch64: me_cmp: Improve scheduling in ff_pix_abs8_y2_neon aarch64: me_cmp: Fix up the prologue

Re: [FFmpeg-devel] [PATCH] aarch64: Implement stack spilling in a consistent way.

2022-10-10 Thread Martin Storsjö
On Sun, 9 Oct 2022, reimar.doeffin...@gmx.de wrote: From: Reimar Döffinger Currently it is done in several different ways, which might cause needless dependencies or in case of tx_float_neon.S is incorrect. Signed-off-by: Reimar Döffinger --- libavcodec/aarch64/fft_neon.S | 3 +-

[FFmpeg-devel] [PATCH] libavcodec: Fix a comment typo

2022-10-03 Thread Martin Storsjö
Signed-off-by: Martin Storsjö --- libavcodec/packet.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/libavcodec/packet.h b/libavcodec/packet.h index 404d520071..f28e7e7011 100644 --- a/libavcodec/packet.h +++ b/libavcodec/packet.h @@ -161,7 +161,7 @@ enum

[FFmpeg-devel] [PATCH] cpu: Limit the number of auto threads in 32 bit builds

2022-09-05 Thread Martin Storsjö
Limit the returned value from av_cpu_count to sensible amounts in 32 bit builds. This chosen limit, 64, is somewhat arbitrary - a 32 bit process is capable of creating much more than 64 threads. But in many cases, multiple parts of the encoding pipeline (decoder, filters, encoders) all create a

[FFmpeg-devel] [PATCH] slicethread: Limit the automatic number of threads to 16

2022-09-05 Thread Martin Storsjö
This matches a similar cap on the number of automatic threads in libavcodec/pthread_slice.c. On systems with lots of cores, this does speed things up in general (measurable on the level of the runtime of running "make fate"), and fixes a couple fate failures in 32 bit mode on such machines (where

Re: [FFmpeg-devel] [PATCH] cpu: Limit the number of auto threads in 32 bit builds

2022-09-05 Thread Martin Storsjö
On Mon, 5 Sep 2022, Andreas Rheinhardt wrote: Martin Storsjö: Limit the returned value from av_cpu_count to sensible amounts in 32 bit builds. This chosen limit, 64, is somewhat arbitrary - a 32 bit process is capable of creating much more than 64 threads. But in many cases, multiple parts

Re: [FFmpeg-devel] [PATCH] lavc/aarch64: Add neon implementation for sse16

2022-08-03 Thread Martin Storsjö
On Mon, 25 Jul 2022, Hubert Mazur wrote: Provide neon implementation for sse16 function. Performance comparison tests are shown below. - sse_0_c: 273.0 - sse_0_neon: 48.2 Benchmarks and tests run with checkasm tool on AWS Graviton 3. Signed-off-by: Hubert Mazur ---

Re: [FFmpeg-devel] [PATCH] checkasm: Silence warnings about unused return value from read()

2022-08-05 Thread Martin Storsjö
On Wed, 27 Jul 2022, Andreas Rheinhardt wrote: Swinney, Jonathan: This patch looks good to me. I would appreciate its merging. } while (0) #define PERF_STOP(t) do { \ +int ret;\ ioctl(sysfd,

Re: [FFmpeg-devel] [FFmpeg-cvslog] swscale/output: add VUYA output support

2022-08-08 Thread Martin Storsjö
On Sun, 7 Aug 2022, James Almer wrote: ffmpeg | branch: master | James Almer | Fri Aug 5 13:44:16 2022 -0300| [19748132613d1d13f5b6786051910e7375bb3df6] | committer: James Almer swscale/output: add VUYA output support Signed-off-by: James Almer

[FFmpeg-devel] [PATCH] lavc/libx264: support AV_CODEC_CAP_ENCODER_RECON_FRAME

2022-08-08 Thread Martin Storsjö
From: Anton Khirnov Bump the version requirement to 122, which introduced b_full_recon. --- configure| 2 +- libavcodec/libx264.c | 55 +++- 2 files changed, 55 insertions(+), 2 deletions(-) diff --git a/configure b/configure index

[FFmpeg-devel] [PATCH] tools: Make sure to create the tools directory before building decode_simple.o

2022-08-08 Thread Martin Storsjö
This directory dependency is normally added implicitly by rules in ffbuild/common.mak; for tools it's created by a rule for TOOLOBJS. TOOLOBJS is populated implicitly from TOOLS, and decode_simple.o doesn't end up there because it's an odd occurrance of a lone object file in the tools

Re: [FFmpeg-devel] [PATCH] lavc/libx264: support AV_CODEC_CAP_ENCODER_RECON_FRAME

2022-08-08 Thread Martin Storsjö
On Mon, 8 Aug 2022, Martin Storsjö wrote: From: Anton Khirnov Bump the version requirement to 122, which introduced b_full_recon. --- configure| 2 +- libavcodec/libx264.c | 55 +++- 2 files changed, 55 insertions(+), 2 deletions(-) Sorry

Re: [FFmpeg-devel] [PATCH v3 0/3] checkasm: updated tests for sw_scale

2022-08-16 Thread Martin Storsjö
On Sat, 13 Aug 2022, Swinney, Jonathan wrote: We don't generally use stdbool in ffmpeg, even if it's C99 - just use a plain int and 0/1. Updated this. Other than that, the checkasm changes look fine (I coauthored part of them - and your cleanup of my WIP patch looks good!). Yes, thank you

Re: [FFmpeg-devel] [PATCH v2] libswscale/aarch64: add another hscale specialization

2022-08-16 Thread Martin Storsjö
On Sat, 13 Aug 2022, Swinney, Jonathan wrote: This specialization handles the case where filtersize is 4 mod 8, e.g. 12, 20, etc. Aarch64 was previously using the c function for this case. This implementation speeds up that case significantly. hscale_8_to_15__fs_12_dstW_512_c: 6234.1

Re: [FFmpeg-devel] [PATCH 2/2] RFC: checkasm: motion: Test different h parameters

2022-08-16 Thread Martin Storsjö
On Thu, 4 Aug 2022, Martin Storsjö wrote: On Wed, 13 Jul 2022, Martin Storsjö wrote: Previously, the checkasm test always passed h=8, so no other cases were tested. Out of the me_cmp functions, in practice, some functions are hardcoded to always assume a 8x8 block (ignoring the h parameter

Re: [FFmpeg-devel] [PATCH v2] lavc/aarch64: hevc_add_res add 12bit variants

2022-08-16 Thread Martin Storsjö
On Tue, 16 Aug 2022, J. Dekker wrote: hevc_add_res_4x4_12_c: 46.0 hevc_add_res_4x4_12_neon: 18.7 hevc_add_res_8x8_12_c: 194.7 hevc_add_res_8x8_12_neon: 25.2 hevc_add_res_16x16_12_c: 716.0 hevc_add_res_16x16_12_neon: 69.7 hevc_add_res_32x32_12_c: 3820.7 hevc_add_res_32x32_12_neon: 261.0

Re: [FFmpeg-devel] [PATCH v3] lavc/aarch64: hevc_add_res add 12bit variants

2022-08-16 Thread Martin Storsjö
On Tue, 16 Aug 2022, J. Dekker wrote: hevc_add_res_4x4_12_c: 46.0 hevc_add_res_4x4_12_neon: 18.7 hevc_add_res_8x8_12_c: 194.7 hevc_add_res_8x8_12_neon: 25.2 hevc_add_res_16x16_12_c: 716.0 hevc_add_res_16x16_12_neon: 69.7 hevc_add_res_32x32_12_c: 3820.7 hevc_add_res_32x32_12_neon: 261.0

Re: [FFmpeg-devel] [PATCH 1/2] x86: Don't hardcode the height to 8 in sad8_xy2_mmx

2022-08-04 Thread Martin Storsjö
On Wed, 13 Jul 2022, Martin Storsjö wrote: The height is hardcoded in some of the me_cmp functions, but not in all of them. But in the case of all other functions, it's hardcoded in the same place in SIMD functions as in the C reference functions, while this one function differs from

Re: [FFmpeg-devel] [PATCH] lavc/aarch64: Add neon implementation for sse4

2022-08-04 Thread Martin Storsjö
On Mon, 25 Jul 2022, Hubert Mazur wrote: Provide neon implementation for sse4 function. Performance comparison tests are shown below. - sse_2_c: 74.0 - sse_2_neon: 24.0 Benchmarks and tests are run with checkasm tool on AWS Graviton 3. Signed-off-by: Hubert Mazur ---

Re: [FFmpeg-devel] [PATCH] lavc/aarch64: Add neon implementation for pix_abs8

2022-08-04 Thread Martin Storsjö
On Mon, 25 Jul 2022, Hubert Mazur wrote: Provide optimized implementation of pix_abs8 function for arm64. Performance comparison tests are shown below. - pix_abs_1_0_c: 105.2 - pix_abs_1_0_neon: 21.4 - sad_1_c: 107.2 - sad_1_neon: 20.9 Benchmarks and tests are run with checkasm tool on AWS

Re: [FFmpeg-devel] [PATCH] lavc/aarch64: Add neon implementation for pix_abs16_y2

2022-08-04 Thread Martin Storsjö
On Mon, 25 Jul 2022, Hubert Mazur wrote: Provide optimized implementation of pix_abs16_y2 function for arm64. Performance comparison tests are shown below. pix_abs_0_2_c: 308.5 pix_abs_0_2_neon: 39.2 Benchmarks and tests run with checkasm tool on AWS Graviton 3. Signed-off-by: Hubert Mazur

Re: [FFmpeg-devel] [PATCH 2/2] RFC: checkasm: motion: Test different h parameters

2022-08-04 Thread Martin Storsjö
On Wed, 13 Jul 2022, Martin Storsjö wrote: Previously, the checkasm test always passed h=8, so no other cases were tested. Out of the me_cmp functions, in practice, some functions are hardcoded to always assume a 8x8 block (ignoring the h parameter), while others do use the parameter

Re: [FFmpeg-devel] [PATCH] lavc/aarch64: Add neon implementation for sse8

2022-08-04 Thread Martin Storsjö
On Mon, 25 Jul 2022, Hubert Mazur wrote: Provide optimized implementation of sse8 function for arm64. Performance comparison tests are shown below. - sse_1_c: 133.0 - sse_1_neon: 36.7 Benchmarks and tests run with checkasm tool on AWS Graviton 3. Signed-off-by: Hubert Mazur ---

Re: [FFmpeg-devel] [PATCH] lavc/aarch64: Add neon implementation for sse16

2022-08-04 Thread Martin Storsjö
On Mon, 25 Jul 2022, Hubert Mazur wrote: Provide neon implementation for sse16 function. Performance comparison tests are shown below. - sse_0_c: 273.0 - sse_0_neon: 48.2 Benchmarks and tests run with checkasm tool on AWS Graviton 3. Signed-off-by: Hubert Mazur ---

Re: [FFmpeg-devel] [PATCH] lavc/aarch64: Add neon implementation for pix_abs16_y2

2022-08-04 Thread Martin Storsjö
On Mon, 25 Jul 2022, Hubert Mazur wrote: Provide optimized implementation of pix_abs16_y2 function for arm64. Performance comparison tests are shown below. pix_abs_0_2_c: 308.5 pix_abs_0_2_neon: 39.2 Benchmarks and tests run with checkasm tool on AWS Graviton 3. Signed-off-by: Hubert Mazur

Re: [FFmpeg-devel] [PATCH 3/4] configure: Add msmpeg4(dec|enc) subsystems

2022-07-31 Thread Martin Storsjö
On Sun, 31 Jul 2022, Andreas Rheinhardt wrote: The msmpeg4 decoders/encoders share a common set of prerequisites, ergo it makes sense to use common subsystems for them. This also allows to remove the CONFIG_MSMPEG4_DECODER/ENCODER ad-hoc defines (which violated the CONFIG_ namespace).

Re: [FFmpeg-devel] [PATCH v3] libavcodec: Set hidden visibility on global symbols accessed from AArch64 assembly

2022-08-04 Thread Martin Storsjö
On Thu, 14 Jul 2022, Martin Storsjö wrote: The AArch64 assembly accesses those symbols directly, without indirection via e.g. the GOT on ELF. In order for this not to require text relocations, those symbols need to be resolved fully at link time, i.e. those symbols can't be interposable

Re: [FFmpeg-devel] [PATCH 1/1] libswscale/aarch64: add another hscale specialization

2022-08-04 Thread Martin Storsjö
On Fri, 22 Jul 2022, Swinney, Jonathan wrote: This specialization handles the case where filtersize is 4 mod 8, e.g. 12, 20, etc. Aarch64 was previously using the c function for this case. This implementation speeds up that case significantly. hscale_8_to_15__fs_12_dstW_512_c: 6234.1

Re: [FFmpeg-devel] [PATCH v2 1/2] checkasm: updated tests for sw_scale

2022-08-04 Thread Martin Storsjö
On Wed, 27 Jul 2022, Swinney, Jonathan wrote: - added a test for yuv2plane1 - fixed test for yuv2planeX for aarch64 which was previously not working at all - updated the test for yuv2planeX to check exact results or approximated results Signed-off-by: Jonathan Swinney ---

Re: [FFmpeg-devel] [PATCH 2/2] libavcodec: Set hidden visibility on global symbols accessed from x86_64 assembly

2022-08-04 Thread Martin Storsjö
On Wed, 27 Jul 2022, Thomas Guillem wrote: DECLARE_ALIGNED, DECLARE_ASM_ALIGNED, and DECLARE_ASM_CONST will include attribute_visibility_hidden. Hmm, I'm not entirely sure that we should do that - if we should add such extra meaning to those macros. How many symbols would it need to be

Re: [FFmpeg-devel] [PATCH v2 2/2] swscale/aarch64: add vscale specializations

2022-08-04 Thread Martin Storsjö
On Wed, 27 Jul 2022, Swinney, Jonathan wrote: This commit adds new code paths for vscale when filterSize is 2, 4, or 8. By using specialized code with unrolling to match the filterSize we can improve performance. This patch also corrects the behavor for filterSize 1 which was previously

Re: [FFmpeg-devel] [PATCH 1/2] x86: Don't hardcode the height to 8 in sad8_xy2_mmx

2022-08-04 Thread Martin Storsjö
On Thu, 4 Aug 2022, Michael Niedermayer wrote: On Thu, Aug 04, 2022 at 10:47:34AM +0300, Martin Storsjö wrote: On Wed, 13 Jul 2022, Martin Storsjö wrote: The height is hardcoded in some of the me_cmp functions, but not in all of them. But in the case of all other functions, it's hardcoded

Re: [FFmpeg-devel] [PATCH 2/3] lavc/aarch64: reformat add_res funcs

2022-08-09 Thread Martin Storsjö
On Thu, 23 Jun 2022, J. Dekker wrote: Signed-off-by: J. Dekker --- libavcodec/aarch64/hevcdsp_idct_neon.S | 216 - 1 file changed, 108 insertions(+), 108 deletions(-) LGTM, thanks! // Martin ___ ffmpeg-devel mailing list

Re: [FFmpeg-devel] [PATCH 1/2] lavc/aarch64: new 8-bit hevc 16x16 idct

2022-08-09 Thread Martin Storsjö
On Thu, 23 Jun 2022, J. Dekker wrote: old: hevc_idct_16x16_8_c: 5366.2 hevc_idct_16x16_8_neon: 1493.2 new: hevc_idct_16x16_8_c: 5363.2 hevc_idct_16x16_8_neon: 943.5 Co-developed-by: Rafal Dabrowa Signed-off-by: J. Dekker --- libavcodec/aarch64/hevcdsp_idct_neon.S| 666

Re: [FFmpeg-devel] [PATCH 3/3] lavc/aarch64: hevc_add_res add 12bit variants

2022-08-09 Thread Martin Storsjö
On Thu, 23 Jun 2022, J. Dekker wrote: hevc_add_res_4x4_12_c: 46.0 hevc_add_res_4x4_12_neon: 18.7 hevc_add_res_8x8_12_c: 194.7 hevc_add_res_8x8_12_neon: 25.2 hevc_add_res_16x16_12_c: 716.0 hevc_add_res_16x16_12_neon: 69.7 hevc_add_res_32x32_12_c: 3820.7 hevc_add_res_32x32_12_neon: 261.0

Re: [FFmpeg-devel] [PATCH 3/3] lavc/aarch64: hevc_add_res add 12bit variants

2022-08-09 Thread Martin Storsjö
On Tue, 9 Aug 2022, Martin Storsjö wrote: On Thu, 23 Jun 2022, J. Dekker wrote: hevc_add_res_4x4_12_c: 46.0 hevc_add_res_4x4_12_neon: 18.7 hevc_add_res_8x8_12_c: 194.7 hevc_add_res_8x8_12_neon: 25.2 hevc_add_res_16x16_12_c: 716.0 hevc_add_res_16x16_12_neon: 69.7 hevc_add_res_32x32_12_c

Re: [FFmpeg-devel] [PATCH 1/3] checkasm/hevc_add_res: add 12bit test

2022-08-09 Thread Martin Storsjö
On Thu, 23 Jun 2022, J. Dekker wrote: Signed-off-by: J. Dekker --- tests/checkasm/hevc_add_res.c | 15 --- 1 file changed, 8 insertions(+), 7 deletions(-) diff --git a/tests/checkasm/hevc_add_res.c b/tests/checkasm/hevc_add_res.c index 0c896adaca..f17d121939 100644 ---

Re: [FFmpeg-devel] [PATCH] lavc/aarch64: add hevc chroma loop filter 8-12bit

2022-08-09 Thread Martin Storsjö
On Thu, 23 Jun 2022, J. Dekker wrote: Signed-off-by: J. Dekker --- libavcodec/aarch64/Makefile | 3 +- libavcodec/aarch64/hevcdsp_deblock_neon.S | 168 ++ libavcodec/aarch64/hevcdsp_init_aarch64.c | 14 ++ 3 files changed, 184 insertions(+), 1 deletion(-)

Re: [FFmpeg-devel] [PATCH] swscale/output: fix reading chroma values when generating vuya output

2022-08-08 Thread Martin Storsjö
On Mon, 8 Aug 2022, James Almer wrote: Signed-off-by: James Almer --- libswscale/output.c | 4 ++-- tests/ref/fate/filter-pixdesc-vuya | 2 +- tests/ref/fate/filter-pixfmts-copy | 2 +- tests/ref/fate/filter-pixfmts-crop | 2 +-

[FFmpeg-devel] [PATCH 2/2] arm: rv40dsp: Change stride parameters to ptrdiff_t

2022-08-09 Thread Martin Storsjö
These were missed when h264_chroma_mc_func was changed in e4a94d8b36c48d95a7d412c40d7b558422ff659c. Signed-off-by: Martin Storsjö --- libavcodec/arm/rv40dsp_init_arm.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/libavcodec/arm/rv40dsp_init_arm.c b/libavcodec/arm

[FFmpeg-devel] [PATCH 1/2] arm: vc1sdp: Change stride parameters to ptrdiff_t

2022-08-09 Thread Martin Storsjö
This was missed in db54426975e124e98e5130ad01316cb7afd60630. Signed-off-by: Martin Storsjö --- In practice, ptrdiff_t and int are the same type on arm, so these didn't cause any warnings and haven't been caught due to that. --- libavcodec/arm/vc1dsp_init_neon.c | 12 ++-- 1 file changed

Re: [FFmpeg-devel] [PATCH] avcodec/me_cmp: Remove now incorrect av_assert2()

2022-08-17 Thread Martin Storsjö
On Wed, 17 Aug 2022, Andreas Rheinhardt wrote: Since d69d12a5b9236b9d2f1fd247ea452f84cdd1aaf9 these av_assert2() (or more exactly, the ones in hadamard8_diff8x8_c() and hadamard8_intra8x8_c()) are hit. So just remove all of these asserts. (If the test were improved to know which functions

[FFmpeg-devel] [PATCH 1/2] checkasm: sw_scale: Fix the difference printing for approximate functions

2022-08-17 Thread Martin Storsjö
Don't stop directly at the first differing pixel, but find the one that differs by more than the expected accuracy. Also print the failing value in check_yuv2yuvX. Signed-off-by: Martin Storsjö --- tests/checkasm/sw_scale.c | 14 ++ 1 file changed, 10 insertions(+), 4 deletions

[FFmpeg-devel] [PATCH 2/2] checkasm: sw_scale: Reduce range of test data in the yuv2yuvX test to get closer to real data

2022-08-17 Thread Martin Storsjö
more realistic output pixel values, instead of having essentially all pixels clipped to either 0 or 255. Signed-off-by: Martin Storsjö --- tests/checkasm/sw_scale.c | 8 1 file changed, 8 insertions(+) diff --git a/tests/checkasm/sw_scale.c b/tests/checkasm/sw_scale.c index d72506ed86

Re: [FFmpeg-devel] [PATCH 2/2] checkasm: sw_scale: Reduce range of test data in the yuv2yuvX test to get closer to real data

2022-08-18 Thread Martin Storsjö
On Wed, 17 Aug 2022, Ronald S. Bultje wrote: On Wed, Aug 17, 2022 at 4:32 PM Martin Storsjö wrote: This avoids overflows on some inputs in the x86 case, where the assembly version would clip/overflow differently from the C reference function. This doesn't seem

[FFmpeg-devel] [PATCH v2] checkasm: sw_scale: Produce more realistic test filter coefficients for yuv2yuvX

2022-08-18 Thread Martin Storsjö
This avoids triggering overflows in the filters, and avoids stray test failures in the approximate functions on x86; due to rounding differences, one implementation might overflow while another one doesn't. Signed-off-by: Martin Storsjö --- FWIW, this modification runs successfully with over

<    5   6   7   8   9   10   11   12   13   14   >