On Tue, 16 Aug 2022, Hubert Mazur wrote:
Provide neon implementation for sse4 function.
Performance comparison tests are shown below.
- sse_2_c: 80.7
- sse_2_neon: 31.0
Benchmarks and tests are run with checkasm tool on AWS Graviton 3.
Signed-off-by: Hubert Mazur
---
On Tue, 16 Aug 2022, Hubert Mazur wrote:
Provide neon implementation for sse16 function.
Performance comparison tests are shown below.
- sse_0_c: 268.2
- sse_0_neon: 43.5
Benchmarks and tests run with checkasm tool on AWS Graviton 3.
Signed-off-by: Hubert Mazur
---
On Tue, 16 Aug 2022, Hubert Mazur wrote:
Provide optimized implementation of sse8 function for arm64.
Performance comparison tests are shown below.
- sse_1_c: 130.7
- sse_1_neon: 29.7
Benchmarks and tests run with checkasm tool on AWS Graviton 3.
Signed-off-by: Hubert Mazur
---
On Tue, 16 Aug 2022, Hubert Mazur wrote:
Provide optimized implementation of pix_abs8 function for arm64.
Performance comparison tests are shown below.
- pix_abs_1_0_c: 101.2
- pix_abs_1_0_neon: 22.5
- sad_1_c: 101.2
- sad_1_neon: 22.5
Benchmarks and tests are run with checkasm tool on AWS
On Tue, 9 Aug 2022, Martin Storsjö wrote:
These were missed when h264_chroma_mc_func was changed in
e4a94d8b36c48d95a7d412c40d7b558422ff659c.
Signed-off-by: Martin Storsjö
---
libavcodec/arm/rv40dsp_init_arm.c | 8
1 file changed, 4 insertions(+), 4 deletions(-)
OK'd by Andreas
On Sat, 27 Aug 2022, Martin Storsjö wrote:
The AArch64 assembly accesses those symbols directly, without
indirection via e.g. the GOT on ELF. In order for this not to
require text relocations, those symbols need to be resolved fully
at link time, i.e. those symbols can't be interposable
On Fri, 26 Aug 2022, Martin Storsjö wrote:
This fixes building for arm targets with optimizations disabled.
---
libavutil/arm/intmath.h | 24 ++--
1 file changed, 18 insertions(+), 6 deletions(-)
diff --git a/libavutil/arm/intmath.h b/libavutil/arm/intmath.h
index 5311a7d52b
On Tue, 6 Sep 2022, Lukas Fellechner wrote:
There are really two separate issues here:
1. Running out of address space in 32-bit processes
It probably makes sense to limit auto threads to 16, but it should only
be done in 32-bit processes.
FWIW, this was my first approach, until Andreas
On Mon, 22 Aug 2022, Hubert Mazur wrote:
Provide optimized implementation of vsad16 function for arm64.
Performance comparison tests are shown below.
- vsad_0_c: 285.0
- vsad_0_neon: 42.5
Benchmarks and tests are run with checkasm tool on AWS Graviton 3.
Signed-off-by: Hubert Mazur
---
On Mon, 22 Aug 2022, Hubert Mazur wrote:
Add vectorized implementation of nsse16 function.
Performance comparison tests are shown below.
- nsse_0_c: 707.0
- nsse_0_neon: 120.0
Benchmarks and tests run with checkasm tool on AWS Graviton 3.
Signed-off-by: Hubert Mazur
---
On Sat, 3 Sep 2022, r...@remlab.net wrote:
From: Rémi Denis-Courmont
There are no particular reasons to force the compiler to use the same
register as output and input operand. This forces an extra MOV
instruction if the input value needs to be reused after the swap.
In most cases, this
On Sat, 3 Sep 2022, Andreas Rheinhardt wrote:
It is advantageous for ff_crop_tab, as the base pointer used to
access this table is not the first element of it. But the real
base pointer is still at a constant offset from the code/the GOT
and can therefore be accessed relative to the instruction
On Tue, 6 Sep 2022, Hubert Mazur wrote:
Provide optimized implementation of vsse16 for arm64.
Performance comparison tests are shown below.
- vsse_0_c: 254.4
- vsse_0_neon: 64.7
Benchmarks and tests are run with checkasm tool on AWS Graviton 3.
Signed-off-by: Hubert Mazur
---
On Tue, 6 Sep 2022, Hubert Mazur wrote:
Provide optimized implementations for me_cmp functions.
This set of patches fixes all issues addressed in previous review.
Major changes:
- Remove redundant loads since the data can be reused.
- Improve style.
- Fix issues with unrecognized symbols.
On Tue, 6 Sep 2022, Hubert Mazur wrote:
Add vectorized implementation of nsse16 function.
Performance comparison tests are shown below.
- nsse_0_c: 707.0
- nsse_0_neon: 120.0
Benchmarks and tests run with checkasm tool on AWS Graviton 3.
Signed-off-by: Hubert Mazur
---
On Mon, 5 Sep 2022, Martin Storsjö wrote:
This matches a similar cap on the number of automatic threads
in libavcodec/pthread_slice.c.
On systems with lots of cores, this does speed things up in
general (measurable on the level of the runtime of running
"make fate"), and fixes a c
On Mon, 22 Aug 2022, Hubert Mazur wrote:
Provide optimized implementation of vsse16 for arm64.
Performance comparison tests are shown below.
- vsse_0_c: 254.4
- vsse_0_neon: 64.7
Benchmarks and tests are run with checkasm tool on AWS Graviton 3.
Signed-off-by: Hubert Mazur
---
On Mon, 22 Aug 2022, Hubert Mazur wrote:
Provide optimized implementation for vsad_intra16 function for arm64.
Performance comparison tests are shown below.
- vsad_4_c: 177.2
- vsad_4_neon: 24.5
Benchmarks and tests are run with checkasm tool on AWS Gravtion 3.
Signed-off-by: Hubert Mazur
On Mon, 22 Aug 2022, Hubert Mazur wrote:
Provide optimized implementation for vsse_intra16 for arm64.
Performance tests are shown below.
- vsse_4_c: 153.7
- vsse_4_neon: 34.2
Benchmarks and tests are run with checkasm tool on AWS Graviton 3.
Signed-off-by: Hubert Mazur
---
On Mon, 22 Aug 2022, Hubert Mazur wrote:
Add vectorized implementation of nsse16 function.
Performance comparison tests are shown below.
- nsse_0_c: 707.0
- nsse_0_neon: 120.0
Benchmarks and tests run with checkasm tool on AWS Graviton 3.
Signed-off-by: Hubert Mazur
---
This fixes building for e.g. i386 windows.
---
libavutil/x86/tx_float.asm | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/libavutil/x86/tx_float.asm b/libavutil/x86/tx_float.asm
index 1b9131e7fa..ace19788a6 100644
--- a/libavutil/x86/tx_float.asm
+++
This fixes building for x86 macOS (both i386 and x86_64) and
i386 windows.
---
v2: Add mangle() in a couple more places, that weren't noticed
on i386 windows.
---
libavutil/x86/tx_float.asm | 10 +-
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/libavutil/x86/tx_float.asm
On Tue, 6 Sep 2022, Mattias Wadman wrote:
On Sat, Sep 3, 2022 at 3:41 AM Lynne wrote:
Needed for the next patch.
We get this for the extremely small cost of a branch on _ns functions,
which wouldn't be used anyway with assembly.
Patch attached.
Hi, I have issues building on macOS
that are accessed from AArch64 assembly
as hidden, so that they are resolved fully at link time even without
the version script and -Wl,-Bsymbolic.
Signed-off-by: Martin Storsjö
---
v4: Moved the attribute definition to a new, standalone header (which
only depends on libavutil/attributes.h
These inline assembly functions rely on being inlined into the
caller, so that the parameter "int p" can be a known assembly time
constant, instead of a variable parameter.
__OPTIMIZE__ is a built-in define which is set by both GCC and Clang
(the two main compilers supporting our inline assembly)
On Thu, 18 Aug 2022, Alan Kelly wrote:
Thanks Martin for doing this.
On Thu, Aug 18, 2022 at 10:16 AM Martin Storsjö wrote:
This avoids triggering overflows in the filters, and avoids
stray
test failures in the approximate functions on x86; due to
rounding
On Mon, 26 Sep 2022, Grzegorz Bernacki wrote:
Add vectorized implementation of nsse8 function.
Performance comparison tests are shown below.
- nsse_1_c: 256.0
- nsse_1_neon: 82.7
Benchmarks and tests run with checkasm tool on AWS Graviton 3.
Signed-off-by: Grzegorz Bernacki
---
This avoids one redundant load per row; pix3 from the previous
iteration can be used as pix2 in the next one.
Before: Cortex A53A72A73
pix_abs_0_2_neon: 138.0 59.7 48.0
After:
pix_abs_0_2_neon: 109.7 50.2 39.5
Signed-off-by: Martin Storsjö
---
libavcodec/aarch64
Signed-off-by: Martin Storsjö
---
libavcodec/aarch64/me_cmp_neon.S | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/libavcodec/aarch64/me_cmp_neon.S b/libavcodec/aarch64/me_cmp_neon.S
index 832a7cb22d..c710358ab7 100644
--- a/libavcodec/aarch64/me_cmp_neon.S
+++ b
Signed-off-by: Martin Storsjö
---
This should hopefully fix the compile failures on fate,
http://fate.ffmpeg.org/report.cgi?time=20220927222508=riscv64-linux-gnu-gcc-12
and
http://fate.ffmpeg.org/report.cgi?time=20220927225014=riscv64-linux-gnu-clang-14.
---
libavcodec/riscv/fmtconvert_rvv.S
On Wed, 28 Sep 2022, Rémi Denis-Courmont wrote:
Le 28 septembre 2022 10:13:57 GMT+03:00, "Martin Storsjö" a
écrit :
Signed-off-by: Martin Storsjö
---
This should hopefully fix the compile failures on fate,
http://fate.ffmpeg.org/report.cgi?time=20220927222508=riscv64-linux-
On Mon, 26 Sep 2022, Grzegorz Bernacki wrote:
Provide optimized implementation of pix_abs8 function for arm64.
Performance comparison tests are shown below:
pix_abs_1_1_c: 162.5
pix_abs_1_1_neon: 27.0
pix_abs_1_2_c: 174.0
pix_abs_1_2_neon: 23.5
pix_abs_1_3_c: 203.2
pix_abs_1_3_neon: 34.7
On Mon, 26 Sep 2022, Grzegorz Bernacki wrote:
Provide optimized implementation for vsse_intra8 for arm64.
Performance tests are shown below.
- vsse_5_c: 87.7
- vsse_5_neon: 26.2
Benchmarks and tests are run with checkasm tool on AWS Graviton 3.
---
libavcodec/aarch64/me_cmp_init_aarch64.c |
This hopefully should fix building with older toolchains, hopefully
fixing the fate failures on
http://fate.ffmpeg.org/history.cgi?slot=armel5tej-qemu-debian-gcc4.4.
Signed-off-by: Martin Storsjö
---
libavcodec/arm/vc1dsp_neon.S | 40 ++--
1 file changed, 20
On Mon, 26 Sep 2022, Grzegorz Bernacki wrote:
Provide optimized implementation of vsse8 for arm64.
Performance comparison tests are shown below.
- vsse_1_c: 141.5
- vsse_1_neon: 32.5
Benchmarks and tests are run with checkasm tool on AWS Graviton 3.
Signed-off-by: Grzegorz Bernacki
---
---
This should hopefully fix the current build failures at
http://fate.ffmpeg.org/history.cgi?slot=riscv64-linux-gnu-clang-14.
---
libavcodec/riscv/fmtconvert_init.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/libavcodec/riscv/fmtconvert_init.c
On Wed, 28 Sep 2022, Martin Storsjö wrote:
This hopefully should fix building with older toolchains, hopefully
fixing the fate failures on
http://fate.ffmpeg.org/history.cgi?slot=armel5tej-qemu-debian-gcc4.4.
Signed-off-by: Martin Storsjö
---
libavcodec/arm/vc1dsp_neon.S | 40
operation, which
would clamp the intermediates to 32 bit still).
Fixes: https://trac.ffmpeg.org/ticket/9985
Signed-off-by: Martin Storsjö
---
libswscale/aarch64/yuv2rgb_neon.S | 8
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/libswscale/aarch64/yuv2rgb_neon.S
b/libswscale
On Mon, 17 Oct 2022, Hubert Mazur wrote:
Provide arm64 neon optimized implementations for hscale16To19 with
filter sizes 4, 8 and X4.
The tests and benchmarks run on AWS Graviton 2 instances.
The results from a checkasm tool are shown below.
hscale_16_to_19__fs_4_dstW_512_c: 6216.0
On Mon, 17 Oct 2022, Hubert Mazur wrote:
Add arm64 neon implementations for hscale 8 to 19 with filter
sizes 4, 4X and 8. Both implementations are based on very similar ones
dedicated to hscale 8 to 15. The major changes refer to saving
the data - instead of writing the result as int16_t it is
On Tue, 11 Oct 2022, J. Dekker wrote:
checkasm benchmark on Ampere Altra (Neoverse N1):
put_hevc_qpel_bi_h4_8_c: 170.7
put_hevc_qpel_bi_h4_8_neon: 64.5
put_hevc_qpel_bi_h6_8_c: 373.7
put_hevc_qpel_bi_h6_8_neon: 130.2
put_hevc_qpel_bi_h8_8_c: 662.0
put_hevc_qpel_bi_h8_8_neon: 138.5
On Wed, 19 Oct 2022, Martin Storsjö wrote:
Support for building with older versions of MSVC (with the
c99wrap/c99conv frontend) was removed in
ce943dd6acbfdfc40223c0fb24d4cad438e6499c.
Signed-off-by: Martin Storsjö
---
configure | 6 --
1 file changed, 6 deletions(-)
diff --git
On Tue, 25 Oct 2022, Martin Storsjö wrote:
Treat the 32 bit stride registers as signed.
Alternatively, we could make the stride arguments ptrdiff_t instead
of int, and changing all of the assembly to operate on these
registers with their full 64 bit width, but that's a much larger
and more
Support for building with older versions of MSVC (with the
c99wrap/c99conv frontend) was removed in
ce943dd6acbfdfc40223c0fb24d4cad438e6499c.
Signed-off-by: Martin Storsjö
---
configure | 6 --
1 file changed, 6 deletions(-)
diff --git a/configure b/configure
index 6712d045d9..ed52212f93
On Tue, 20 Sep 2022, Hubert Mazur wrote:
This fixes issues addressed in previous patchset:
- move sub instruction in vsad8_intra,
- remove unnecessary mov instructions,
- remove single lane extraction in loop and place it at the end.
Removing mov instructions from pix_median_abs functions
On Sun, 25 Aug 2019, James Cowgill wrote:
When compiling FFmpeg with GCC-9, some very random segfaults were
observed in code which had previously called down into the SBC encoder
NEON assembly routines. This was caused by these functions clobbering
some of the vfp callee saved registers (d8 -
On Tue, 13 Sep 2022, Hubert Mazur wrote:
Provide optimized implementation for pix_median_abs16 function.
You've forgot to update this part of the commit message.
Performance comparison tests are shown below.
- vsad_5_c: 94.7
- vsad_5_neon: 20.7
Benchmarks and tests run with checkasm tool
On Tue, 13 Sep 2022, Hubert Mazur wrote:
Provide optimized implementation for pix_median_abs16 function.
Forgot to update this part of the commit message here too.
Performance comparison tests are shown below.
- median_sad_1_c: 273.7
- median_sad_1_neon: 98.2
Benchmarks and tests run with
On Tue, 13 Sep 2022, Hubert Mazur wrote:
Provide optimized implementation for pix_median_abs16 function.
Performance comparison tests are shown below.
- median_sad_0_c: 722.0
- median_sad_0_neon: 144.7
Benchmarks and tests run with checkasm tool on AWS Graviton 3.
Signed-off-by: Hubert Mazur
On Sat, 24 Sep 2022, Lynne wrote:
What about ac3dsp then - that one seems like it's fairly optimized for arm?
Haven't touched them, they're still being used. Unfortunately, for AC3,
the full MDCT optimizations in lavc do make a difference and the overall
decoder becomes 15% slower with this
On Mon, 26 Sep 2022, Marvin Scholz wrote:
As I am not sure who else to email about this, I'll just post it here.
I tried to register for Patchwork, however I got an error when registering.
I tried again and was told the account already exists, I tried to reset the
password for the account but
On Sat, 24 Sep 2022, Lynne wrote:
This commit changes both the encoder and decoder to use the new lavu/tx code,
which has faster C transforms and more assembly optimizations.
What's the case of e.g. 32 bit arm - that does have a bunch of fft and
mdct assembly, but is that something that ends
On Sat, 24 Sep 2022, Hendrik Leppkes wrote:
On Sat, Sep 24, 2022 at 9:26 PM Hendrik Leppkes wrote:
On Sat, Sep 24, 2022 at 8:43 PM Martin Storsjö wrote:
>
> On Sat, 24 Sep 2022, Lynne wrote:
>
> > This commit changes both the encoder and decoder to use the new lavu/tx
code
On Thu, 8 Sep 2022, Hubert Mazur wrote:
Fix minor issues in the patches.
Regarding vsse16 I didn't change saba & umlal to sub & smlal.
It doesn't affect the performance, so left it as it was.
The majority of changes refer to nsse16:
- fixed indentation (thanks for pointing out),
- applied the
: Provide neon implementation of nsse8
lavc/aarch64: Provide optimized implementation of vsse8 for arm64.
lavc/aarch64: Add neon implementation for vsse_intra8
Martin Storsjö (3):
aarch64: me_cmp: Improve scheduling in ff_pix_abs8_y2_neon
aarch64: me_cmp: Fix up the prologue
On Sun, 9 Oct 2022, reimar.doeffin...@gmx.de wrote:
From: Reimar Döffinger
Currently it is done in several different ways, which
might cause needless dependencies or in case of
tx_float_neon.S is incorrect.
Signed-off-by: Reimar Döffinger
---
libavcodec/aarch64/fft_neon.S | 3 +-
Signed-off-by: Martin Storsjö
---
libavcodec/packet.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/libavcodec/packet.h b/libavcodec/packet.h
index 404d520071..f28e7e7011 100644
--- a/libavcodec/packet.h
+++ b/libavcodec/packet.h
@@ -161,7 +161,7 @@ enum
Limit the returned value from av_cpu_count to sensible amounts
in 32 bit builds.
This chosen limit, 64, is somewhat arbitrary - a 32 bit process
is capable of creating much more than 64 threads. But in many
cases, multiple parts of the encoding pipeline (decoder, filters,
encoders) all create a
This matches a similar cap on the number of automatic threads
in libavcodec/pthread_slice.c.
On systems with lots of cores, this does speed things up in
general (measurable on the level of the runtime of running
"make fate"), and fixes a couple fate failures in 32 bit mode on
such machines (where
On Mon, 5 Sep 2022, Andreas Rheinhardt wrote:
Martin Storsjö:
Limit the returned value from av_cpu_count to sensible amounts
in 32 bit builds.
This chosen limit, 64, is somewhat arbitrary - a 32 bit process
is capable of creating much more than 64 threads. But in many
cases, multiple parts
On Mon, 25 Jul 2022, Hubert Mazur wrote:
Provide neon implementation for sse16 function.
Performance comparison tests are shown below.
- sse_0_c: 273.0
- sse_0_neon: 48.2
Benchmarks and tests run with checkasm tool on AWS Graviton 3.
Signed-off-by: Hubert Mazur
---
On Wed, 27 Jul 2022, Andreas Rheinhardt wrote:
Swinney, Jonathan:
This patch looks good to me. I would appreciate its merging.
} while (0)
#define PERF_STOP(t) do { \
+int ret;\
ioctl(sysfd,
On Sun, 7 Aug 2022, James Almer wrote:
ffmpeg | branch: master | James Almer | Fri Aug 5 13:44:16
2022 -0300| [19748132613d1d13f5b6786051910e7375bb3df6] | committer: James Almer
swscale/output: add VUYA output support
Signed-off-by: James Almer
From: Anton Khirnov
Bump the version requirement to 122, which introduced b_full_recon.
---
configure| 2 +-
libavcodec/libx264.c | 55 +++-
2 files changed, 55 insertions(+), 2 deletions(-)
diff --git a/configure b/configure
index
This directory dependency is normally added implicitly by rules
in ffbuild/common.mak; for tools it's created by a rule for TOOLOBJS.
TOOLOBJS is populated implicitly from TOOLS, and decode_simple.o
doesn't end up there because it's an odd occurrance of a lone
object file in the tools
On Mon, 8 Aug 2022, Martin Storsjö wrote:
From: Anton Khirnov
Bump the version requirement to 122, which introduced b_full_recon.
---
configure| 2 +-
libavcodec/libx264.c | 55 +++-
2 files changed, 55 insertions(+), 2 deletions(-)
Sorry
On Sat, 13 Aug 2022, Swinney, Jonathan wrote:
We don't generally use stdbool in ffmpeg, even if it's C99 - just use a
plain int and 0/1.
Updated this.
Other than that, the checkasm changes look fine (I coauthored part of
them - and your cleanup of my WIP patch looks good!).
Yes, thank you
On Sat, 13 Aug 2022, Swinney, Jonathan wrote:
This specialization handles the case where filtersize is 4 mod 8, e.g.
12, 20, etc. Aarch64 was previously using the c function for this case.
This implementation speeds up that case significantly.
hscale_8_to_15__fs_12_dstW_512_c: 6234.1
On Thu, 4 Aug 2022, Martin Storsjö wrote:
On Wed, 13 Jul 2022, Martin Storsjö wrote:
Previously, the checkasm test always passed h=8, so no other cases
were tested.
Out of the me_cmp functions, in practice, some functions are hardcoded
to always assume a 8x8 block (ignoring the h parameter
On Tue, 16 Aug 2022, J. Dekker wrote:
hevc_add_res_4x4_12_c: 46.0
hevc_add_res_4x4_12_neon: 18.7
hevc_add_res_8x8_12_c: 194.7
hevc_add_res_8x8_12_neon: 25.2
hevc_add_res_16x16_12_c: 716.0
hevc_add_res_16x16_12_neon: 69.7
hevc_add_res_32x32_12_c: 3820.7
hevc_add_res_32x32_12_neon: 261.0
On Tue, 16 Aug 2022, J. Dekker wrote:
hevc_add_res_4x4_12_c: 46.0
hevc_add_res_4x4_12_neon: 18.7
hevc_add_res_8x8_12_c: 194.7
hevc_add_res_8x8_12_neon: 25.2
hevc_add_res_16x16_12_c: 716.0
hevc_add_res_16x16_12_neon: 69.7
hevc_add_res_32x32_12_c: 3820.7
hevc_add_res_32x32_12_neon: 261.0
On Wed, 13 Jul 2022, Martin Storsjö wrote:
The height is hardcoded in some of the me_cmp functions, but not
in all of them. But in the case of all other functions, it's hardcoded
in the same place in SIMD functions as in the C reference functions,
while this one function differs from
On Mon, 25 Jul 2022, Hubert Mazur wrote:
Provide neon implementation for sse4 function.
Performance comparison tests are shown below.
- sse_2_c: 74.0
- sse_2_neon: 24.0
Benchmarks and tests are run with checkasm tool on AWS Graviton 3.
Signed-off-by: Hubert Mazur
---
On Mon, 25 Jul 2022, Hubert Mazur wrote:
Provide optimized implementation of pix_abs8 function for arm64.
Performance comparison tests are shown below.
- pix_abs_1_0_c: 105.2
- pix_abs_1_0_neon: 21.4
- sad_1_c: 107.2
- sad_1_neon: 20.9
Benchmarks and tests are run with checkasm tool on AWS
On Mon, 25 Jul 2022, Hubert Mazur wrote:
Provide optimized implementation of pix_abs16_y2 function for arm64.
Performance comparison tests are shown below.
pix_abs_0_2_c: 308.5
pix_abs_0_2_neon: 39.2
Benchmarks and tests run with checkasm tool on AWS Graviton 3.
Signed-off-by: Hubert Mazur
On Wed, 13 Jul 2022, Martin Storsjö wrote:
Previously, the checkasm test always passed h=8, so no other cases
were tested.
Out of the me_cmp functions, in practice, some functions are hardcoded
to always assume a 8x8 block (ignoring the h parameter), while others
do use the parameter
On Mon, 25 Jul 2022, Hubert Mazur wrote:
Provide optimized implementation of sse8 function for arm64.
Performance comparison tests are shown below.
- sse_1_c: 133.0
- sse_1_neon: 36.7
Benchmarks and tests run with checkasm tool on AWS Graviton 3.
Signed-off-by: Hubert Mazur
---
On Mon, 25 Jul 2022, Hubert Mazur wrote:
Provide neon implementation for sse16 function.
Performance comparison tests are shown below.
- sse_0_c: 273.0
- sse_0_neon: 48.2
Benchmarks and tests run with checkasm tool on AWS Graviton 3.
Signed-off-by: Hubert Mazur
---
On Mon, 25 Jul 2022, Hubert Mazur wrote:
Provide optimized implementation of pix_abs16_y2 function for arm64.
Performance comparison tests are shown below.
pix_abs_0_2_c: 308.5
pix_abs_0_2_neon: 39.2
Benchmarks and tests run with checkasm tool on AWS Graviton 3.
Signed-off-by: Hubert Mazur
On Sun, 31 Jul 2022, Andreas Rheinhardt wrote:
The msmpeg4 decoders/encoders share a common set of prerequisites,
ergo it makes sense to use common subsystems for them. This also
allows to remove the CONFIG_MSMPEG4_DECODER/ENCODER ad-hoc defines
(which violated the CONFIG_ namespace).
On Thu, 14 Jul 2022, Martin Storsjö wrote:
The AArch64 assembly accesses those symbols directly, without
indirection via e.g. the GOT on ELF. In order for this not to
require text relocations, those symbols need to be resolved fully
at link time, i.e. those symbols can't be interposable
On Fri, 22 Jul 2022, Swinney, Jonathan wrote:
This specialization handles the case where filtersize is 4 mod 8, e.g.
12, 20, etc. Aarch64 was previously using the c function for this case.
This implementation speeds up that case significantly.
hscale_8_to_15__fs_12_dstW_512_c: 6234.1
On Wed, 27 Jul 2022, Swinney, Jonathan wrote:
- added a test for yuv2plane1
- fixed test for yuv2planeX for aarch64 which was previously not working
at all
- updated the test for yuv2planeX to check exact results or approximated
results
Signed-off-by: Jonathan Swinney
---
On Wed, 27 Jul 2022, Thomas Guillem wrote:
DECLARE_ALIGNED, DECLARE_ASM_ALIGNED, and DECLARE_ASM_CONST will include
attribute_visibility_hidden.
Hmm, I'm not entirely sure that we should do that - if we should add such
extra meaning to those macros.
How many symbols would it need to be
On Wed, 27 Jul 2022, Swinney, Jonathan wrote:
This commit adds new code paths for vscale when filterSize is 2, 4, or
8. By using specialized code with unrolling to match the filterSize we
can improve performance.
This patch also corrects the behavor for
filterSize 1 which was previously
On Thu, 4 Aug 2022, Michael Niedermayer wrote:
On Thu, Aug 04, 2022 at 10:47:34AM +0300, Martin Storsjö wrote:
On Wed, 13 Jul 2022, Martin Storsjö wrote:
The height is hardcoded in some of the me_cmp functions, but not
in all of them. But in the case of all other functions, it's hardcoded
On Thu, 23 Jun 2022, J. Dekker wrote:
Signed-off-by: J. Dekker
---
libavcodec/aarch64/hevcdsp_idct_neon.S | 216 -
1 file changed, 108 insertions(+), 108 deletions(-)
LGTM, thanks!
// Martin
___
ffmpeg-devel mailing list
On Thu, 23 Jun 2022, J. Dekker wrote:
old:
hevc_idct_16x16_8_c: 5366.2
hevc_idct_16x16_8_neon: 1493.2
new:
hevc_idct_16x16_8_c: 5363.2
hevc_idct_16x16_8_neon: 943.5
Co-developed-by: Rafal Dabrowa
Signed-off-by: J. Dekker
---
libavcodec/aarch64/hevcdsp_idct_neon.S| 666
On Thu, 23 Jun 2022, J. Dekker wrote:
hevc_add_res_4x4_12_c: 46.0
hevc_add_res_4x4_12_neon: 18.7
hevc_add_res_8x8_12_c: 194.7
hevc_add_res_8x8_12_neon: 25.2
hevc_add_res_16x16_12_c: 716.0
hevc_add_res_16x16_12_neon: 69.7
hevc_add_res_32x32_12_c: 3820.7
hevc_add_res_32x32_12_neon: 261.0
On Tue, 9 Aug 2022, Martin Storsjö wrote:
On Thu, 23 Jun 2022, J. Dekker wrote:
hevc_add_res_4x4_12_c: 46.0
hevc_add_res_4x4_12_neon: 18.7
hevc_add_res_8x8_12_c: 194.7
hevc_add_res_8x8_12_neon: 25.2
hevc_add_res_16x16_12_c: 716.0
hevc_add_res_16x16_12_neon: 69.7
hevc_add_res_32x32_12_c
On Thu, 23 Jun 2022, J. Dekker wrote:
Signed-off-by: J. Dekker
---
tests/checkasm/hevc_add_res.c | 15 ---
1 file changed, 8 insertions(+), 7 deletions(-)
diff --git a/tests/checkasm/hevc_add_res.c b/tests/checkasm/hevc_add_res.c
index 0c896adaca..f17d121939 100644
---
On Thu, 23 Jun 2022, J. Dekker wrote:
Signed-off-by: J. Dekker
---
libavcodec/aarch64/Makefile | 3 +-
libavcodec/aarch64/hevcdsp_deblock_neon.S | 168 ++
libavcodec/aarch64/hevcdsp_init_aarch64.c | 14 ++
3 files changed, 184 insertions(+), 1 deletion(-)
On Mon, 8 Aug 2022, James Almer wrote:
Signed-off-by: James Almer
---
libswscale/output.c | 4 ++--
tests/ref/fate/filter-pixdesc-vuya | 2 +-
tests/ref/fate/filter-pixfmts-copy | 2 +-
tests/ref/fate/filter-pixfmts-crop | 2 +-
These were missed when h264_chroma_mc_func was changed in
e4a94d8b36c48d95a7d412c40d7b558422ff659c.
Signed-off-by: Martin Storsjö
---
libavcodec/arm/rv40dsp_init_arm.c | 8
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/libavcodec/arm/rv40dsp_init_arm.c
b/libavcodec/arm
This was missed in db54426975e124e98e5130ad01316cb7afd60630.
Signed-off-by: Martin Storsjö
---
In practice, ptrdiff_t and int are the same type on arm, so these
didn't cause any warnings and haven't been caught due to that.
---
libavcodec/arm/vc1dsp_init_neon.c | 12 ++--
1 file changed
On Wed, 17 Aug 2022, Andreas Rheinhardt wrote:
Since d69d12a5b9236b9d2f1fd247ea452f84cdd1aaf9 these av_assert2()
(or more exactly, the ones in hadamard8_diff8x8_c() and
hadamard8_intra8x8_c()) are hit. So just remove all of these asserts.
(If the test were improved to know which functions
Don't stop directly at the first differing pixel, but find the
one that differs by more than the expected accuracy.
Also print the failing value in check_yuv2yuvX.
Signed-off-by: Martin Storsjö
---
tests/checkasm/sw_scale.c | 14 ++
1 file changed, 10 insertions(+), 4 deletions
more realistic output pixel
values, instead of having essentially all pixels clipped to either
0 or 255.
Signed-off-by: Martin Storsjö
---
tests/checkasm/sw_scale.c | 8
1 file changed, 8 insertions(+)
diff --git a/tests/checkasm/sw_scale.c b/tests/checkasm/sw_scale.c
index d72506ed86
On Wed, 17 Aug 2022, Ronald S. Bultje wrote:
On Wed, Aug 17, 2022 at 4:32 PM Martin Storsjö wrote:
This avoids overflows on some inputs in the x86 case, where the
assembly version would clip/overflow differently from the
C reference function.
This doesn't seem
This avoids triggering overflows in the filters, and avoids stray
test failures in the approximate functions on x86; due to rounding
differences, one implementation might overflow while another one
doesn't.
Signed-off-by: Martin Storsjö
---
FWIW, this modification runs successfully with over
901 - 1000 of 1416 matches
Mail list logo