Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.
Thanks James for spotting this. I have sent two patches fixing the valgrind error from checkasm and the unchecked av_mallocs. I do not believe that the two remaining valgrind errors come from my patch, although I may be mistaken. Using git bisect, I have identified b94cd55155d8c061f1e1faca9076afe540149c27 as the problematic commit. On Thu, Feb 18, 2021 at 11:23 PM James Almer wrote: > On 2/17/2021 5:24 PM, Paul B Mahol wrote: > > On Tue, Feb 16, 2021 at 6:31 PM Alan Kelly < > > alankelly-at-google@ffmpeg.org> wrote: > > > >> Looks like there are no comments, is this OK to be applied? Thanks > >> > > > > Applied, thanks for pinging. > > Valgrind complains about this change. The checkasm test specifically. > > > http://fate.ffmpeg.org/report.cgi?time=20210218014903=x86_64-archlinux-gcc-valgrind > > I also noticed it has a bunch of unchecked av_mallocs(). > ___ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > https://ffmpeg.org/mailman/listinfo/ffmpeg-devel > > To unsubscribe, visit link above, or email > ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe". ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.
On 2/17/2021 5:24 PM, Paul B Mahol wrote: On Tue, Feb 16, 2021 at 6:31 PM Alan Kelly < alankelly-at-google@ffmpeg.org> wrote: Looks like there are no comments, is this OK to be applied? Thanks Applied, thanks for pinging. Valgrind complains about this change. The checkasm test specifically. http://fate.ffmpeg.org/report.cgi?time=20210218014903=x86_64-archlinux-gcc-valgrind I also noticed it has a bunch of unchecked av_mallocs(). ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.
On Tue, Feb 16, 2021 at 6:31 PM Alan Kelly < alankelly-at-google@ffmpeg.org> wrote: > Looks like there are no comments, is this OK to be applied? Thanks > Applied, thanks for pinging. > > On Tue, Feb 9, 2021 at 6:25 PM Paul B Mahol wrote: > > > Will apply in no comments. > > ___ > > ffmpeg-devel mailing list > > ffmpeg-devel@ffmpeg.org > > https://ffmpeg.org/mailman/listinfo/ffmpeg-devel > > > > To unsubscribe, visit link above, or email > > ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe". > ___ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > https://ffmpeg.org/mailman/listinfo/ffmpeg-devel > > To unsubscribe, visit link above, or email > ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe". ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.
Looks like there are no comments, is this OK to be applied? Thanks On Tue, Feb 9, 2021 at 6:25 PM Paul B Mahol wrote: > Will apply in no comments. > ___ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > https://ffmpeg.org/mailman/listinfo/ffmpeg-devel > > To unsubscribe, visit link above, or email > ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe". ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.
Will apply in no comments. ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.
Ping! On Thu, Jan 14, 2021 at 3:47 PM Alan Kelly wrote: > --- > Replaces cpuflag(mmx) with notcpuflag(sse3) for store macro > Tests for multiple sizes in checkasm-sw_scale > checkasm-sw_scale aligns memory on 8 bytes instad of 32 to catch aligned > loads > libswscale/x86/Makefile | 1 + > libswscale/x86/swscale.c | 130 > libswscale/x86/swscale_template.c | 82 -- > libswscale/x86/yuv2yuvX.asm | 136 ++ > tests/checkasm/sw_scale.c | 103 ++ > 5 files changed, 294 insertions(+), 158 deletions(-) > create mode 100644 libswscale/x86/yuv2yuvX.asm > > diff --git a/libswscale/x86/Makefile b/libswscale/x86/Makefile > index 831d5359aa..bfe383364e 100644 > --- a/libswscale/x86/Makefile > +++ b/libswscale/x86/Makefile > @@ -13,3 +13,4 @@ X86ASM-OBJS += x86/input.o > \ > x86/scale.o \ > x86/rgb_2_rgb.o \ > x86/yuv_2_rgb.o \ > + x86/yuv2yuvX.o \ > diff --git a/libswscale/x86/swscale.c b/libswscale/x86/swscale.c > index 15c0b22f20..3df193a067 100644 > --- a/libswscale/x86/swscale.c > +++ b/libswscale/x86/swscale.c > @@ -63,6 +63,16 @@ DECLARE_ASM_ALIGNED(8, const uint64_t, ff_bgr2UVOffset) > = 0x8080808080808080ULL; > DECLARE_ASM_ALIGNED(8, const uint64_t, ff_w)= > 0x0001000100010001ULL; > > > +#define YUV2YUVX_FUNC_DECL(opt) \ > +static void yuv2yuvX_ ##opt(const int16_t *filter, int filterSize, const > int16_t **src, \ > + uint8_t *dest, int dstW, \ > + const uint8_t *dither, int offset); \ > + > +YUV2YUVX_FUNC_DECL(mmx) > +YUV2YUVX_FUNC_DECL(mmxext) > +YUV2YUVX_FUNC_DECL(sse3) > +YUV2YUVX_FUNC_DECL(avx2) > + > //MMX versions > #if HAVE_MMX_INLINE > #undef RENAME > @@ -198,81 +208,44 @@ void ff_updateMMXDitherTables(SwsContext *c, int > dstY) > } > > #if HAVE_MMXEXT > -static void yuv2yuvX_sse3(const int16_t *filter, int filterSize, > - const int16_t **src, uint8_t *dest, int dstW, > - const uint8_t *dither, int offset) > -{ > -if(((uintptr_t)dest) & 15){ > -yuv2yuvX_mmxext(filter, filterSize, src, dest, dstW, dither, > offset); > -return; > -} > -filterSize--; > -#define MAIN_FUNCTION \ > -"pxor %%xmm0, %%xmm0 \n\t" \ > -"punpcklbw %%xmm0, %%xmm3 \n\t" \ > -"movd %4, %%xmm1 \n\t" \ > -"punpcklwd %%xmm1, %%xmm1 \n\t" \ > -"punpckldq %%xmm1, %%xmm1 \n\t" \ > -"punpcklqdq %%xmm1, %%xmm1 \n\t" \ > -"psllw $3, %%xmm1 \n\t" \ > -"paddw %%xmm1, %%xmm3 \n\t" \ > -"psraw $4, %%xmm3 \n\t" \ > -"movdqa %%xmm3, %%xmm4 \n\t" \ > -"movdqa %%xmm3, %%xmm7 \n\t" \ > -"movl %3, %%ecx \n\t" \ > -"mov %0, %%"FF_REG_d" > \n\t"\ > -"mov(%%"FF_REG_d"), %%"FF_REG_S" > \n\t"\ > -".p2align 4 \n\t" /* > FIXME Unroll? */\ > -"1: \n\t"\ > -"movddup 8(%%"FF_REG_d"), %%xmm0 \n\t" /* > filterCoeff */\ > -"movdqa (%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm2 > \n\t" /* srcData */\ > -"movdqa16(%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm5 > \n\t" /* srcData */\ > -"add$16, %%"FF_REG_d" > \n\t"\ > -"mov(%%"FF_REG_d"), %%"FF_REG_S" > \n\t"\ > -"test %%"FF_REG_S", %%"FF_REG_S" > \n\t"\ > -"pmulhw %%xmm0, %%xmm2 \n\t"\ > -"pmulhw %%xmm0, %%xmm5 \n\t"\ > -"paddw%%xmm2, %%xmm3 \n\t"\ > -"paddw%%xmm5, %%xmm4 \n\t"\ > -" jnz1b \n\t"\ > -"psraw $3, %%xmm3 \n\t"\ > -"psraw $3, %%xmm4 \n\t"\ > -"packuswb %%xmm4, %%xmm3 \n\t"\ > -"movntdq %%xmm3, (%1, %%"FF_REG_c") > \n\t"\ > -"add $16, %%"FF_REG_c"\n\t"\ > -"cmp %2, %%"FF_REG_c"\n\t"\ > -"movdqa %%xmm7, %%xmm3\n\t" \ > -"movdqa %%xmm7, %%xmm4\n\t" \ > -"mov %0, %%"FF_REG_d" > \n\t"\ > -"mov
[FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.
--- Replaces cpuflag(mmx) with notcpuflag(sse3) for store macro Tests for multiple sizes in checkasm-sw_scale checkasm-sw_scale aligns memory on 8 bytes instad of 32 to catch aligned loads libswscale/x86/Makefile | 1 + libswscale/x86/swscale.c | 130 libswscale/x86/swscale_template.c | 82 -- libswscale/x86/yuv2yuvX.asm | 136 ++ tests/checkasm/sw_scale.c | 103 ++ 5 files changed, 294 insertions(+), 158 deletions(-) create mode 100644 libswscale/x86/yuv2yuvX.asm diff --git a/libswscale/x86/Makefile b/libswscale/x86/Makefile index 831d5359aa..bfe383364e 100644 --- a/libswscale/x86/Makefile +++ b/libswscale/x86/Makefile @@ -13,3 +13,4 @@ X86ASM-OBJS += x86/input.o \ x86/scale.o \ x86/rgb_2_rgb.o \ x86/yuv_2_rgb.o \ + x86/yuv2yuvX.o \ diff --git a/libswscale/x86/swscale.c b/libswscale/x86/swscale.c index 15c0b22f20..3df193a067 100644 --- a/libswscale/x86/swscale.c +++ b/libswscale/x86/swscale.c @@ -63,6 +63,16 @@ DECLARE_ASM_ALIGNED(8, const uint64_t, ff_bgr2UVOffset) = 0x8080808080808080ULL; DECLARE_ASM_ALIGNED(8, const uint64_t, ff_w)= 0x0001000100010001ULL; +#define YUV2YUVX_FUNC_DECL(opt) \ +static void yuv2yuvX_ ##opt(const int16_t *filter, int filterSize, const int16_t **src, \ + uint8_t *dest, int dstW, \ + const uint8_t *dither, int offset); \ + +YUV2YUVX_FUNC_DECL(mmx) +YUV2YUVX_FUNC_DECL(mmxext) +YUV2YUVX_FUNC_DECL(sse3) +YUV2YUVX_FUNC_DECL(avx2) + //MMX versions #if HAVE_MMX_INLINE #undef RENAME @@ -198,81 +208,44 @@ void ff_updateMMXDitherTables(SwsContext *c, int dstY) } #if HAVE_MMXEXT -static void yuv2yuvX_sse3(const int16_t *filter, int filterSize, - const int16_t **src, uint8_t *dest, int dstW, - const uint8_t *dither, int offset) -{ -if(((uintptr_t)dest) & 15){ -yuv2yuvX_mmxext(filter, filterSize, src, dest, dstW, dither, offset); -return; -} -filterSize--; -#define MAIN_FUNCTION \ -"pxor %%xmm0, %%xmm0 \n\t" \ -"punpcklbw %%xmm0, %%xmm3 \n\t" \ -"movd %4, %%xmm1 \n\t" \ -"punpcklwd %%xmm1, %%xmm1 \n\t" \ -"punpckldq %%xmm1, %%xmm1 \n\t" \ -"punpcklqdq %%xmm1, %%xmm1 \n\t" \ -"psllw $3, %%xmm1 \n\t" \ -"paddw %%xmm1, %%xmm3 \n\t" \ -"psraw $4, %%xmm3 \n\t" \ -"movdqa %%xmm3, %%xmm4 \n\t" \ -"movdqa %%xmm3, %%xmm7 \n\t" \ -"movl %3, %%ecx \n\t" \ -"mov %0, %%"FF_REG_d"\n\t"\ -"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\ -".p2align 4 \n\t" /* FIXME Unroll? */\ -"1: \n\t"\ -"movddup 8(%%"FF_REG_d"), %%xmm0 \n\t" /* filterCoeff */\ -"movdqa (%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm2 \n\t" /* srcData */\ -"movdqa16(%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm5 \n\t" /* srcData */\ -"add$16, %%"FF_REG_d"\n\t"\ -"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\ -"test %%"FF_REG_S", %%"FF_REG_S" \n\t"\ -"pmulhw %%xmm0, %%xmm2 \n\t"\ -"pmulhw %%xmm0, %%xmm5 \n\t"\ -"paddw%%xmm2, %%xmm3 \n\t"\ -"paddw%%xmm5, %%xmm4 \n\t"\ -" jnz1b \n\t"\ -"psraw $3, %%xmm3 \n\t"\ -"psraw $3, %%xmm4 \n\t"\ -"packuswb %%xmm4, %%xmm3 \n\t"\ -"movntdq %%xmm3, (%1, %%"FF_REG_c") \n\t"\ -"add $16, %%"FF_REG_c"\n\t"\ -"cmp %2, %%"FF_REG_c"\n\t"\ -"movdqa %%xmm7, %%xmm3\n\t" \ -"movdqa %%xmm7, %%xmm4\n\t" \ -"mov %0, %%"FF_REG_d"\n\t"\ -"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\ -"jb 1b \n\t" - -if (offset) { -__asm__ volatile( -"movq %5, %%xmm3 \n\t" -"movdqa
Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.
Apologies for this: when I added mmx to the yasm file, I added a macro for the stores selecting mova for mmx and movdqu for the others. if cpuflag(mmx) evaluates to true for all architectures so I replaced it with if notcpuflag(sse3). The alignment in the checkasm test has been changed to 8 from 32 so that the test catches problems with alignment. On Thu, Jan 14, 2021 at 1:11 AM Michael Niedermayer wrote: > On Mon, Jan 11, 2021 at 05:46:31PM +0100, Alan Kelly wrote: > > --- > > Fixes a bug where if there is no offset and a tail which is not > processed by the > > sse3/avx2 version the dither is modified > > Deletes mmx/mmxext yuv2yuvX version from swscale_template and adds it > > to yuv2yuvX.asm to reduce code duplication and so that it may be used > > to process the tail from the larger cardinal simd versions. > > src argument of yuv2yuvX_* is now srcOffset, so that tails and offsets > > are accounted for correctly. > > Changes input size in checkasm so that this corner case is tested. > > > > libswscale/x86/Makefile | 1 + > > libswscale/x86/swscale.c | 130 > > libswscale/x86/swscale_template.c | 82 -- > > libswscale/x86/yuv2yuvX.asm | 136 ++ > > tests/checkasm/sw_scale.c | 100 ++ > > 5 files changed, 291 insertions(+), 158 deletions(-) > > create mode 100644 libswscale/x86/yuv2yuvX.asm > > This seems to be crashing again unless i messed up testing > > (gdb) disassemble $rip-32,$rip+32 > Dump of assembler code from 0x55572f02 to 0x55572f42: >0x55572f02 : int$0x71 >0x55572f04 : out%al,$0x3 >0x55572f06 : vpsraw $0x3,%ymm1,%ymm1 >0x55572f0b : vpackuswb %ymm4,%ymm3,%ymm3 >0x55572f0f : vpackuswb %ymm1,%ymm6,%ymm6 >0x55572f13 : mov(%rdi),%rdx >0x55572f16 : vpermq $0xd8,%ymm3,%ymm3 >0x55572f1c : vpermq $0xd8,%ymm6,%ymm6 > => 0x55572f22 : vmovdqa %ymm3,(%rcx,%rax,1) >0x55572f27 : vmovdqa > %ymm6,0x20(%rcx,%rax,1) >0x55572f2d : add$0x40,%rax >0x55572f31 : mov%rdi,%rsi >0x55572f34 : cmp%r8,%rax >0x55572f37 : jb 0x55572eae > >0x55572f3d : vzeroupper >0x55572f40 : retq >0x55572f41 : nopw %cs:0x0(%rax,%rax,1) > > rax0x0 0 > rbx0x30 48 > rcx0x5583f470 93824995292272 > rdx0x5585e500 93824995419392 > > #0 0x55572f22 in ff_yuv2yuvX_avx2 () > #1 0x555724ee in yuv2yuvX_avx2 () > #2 0x5556b4f6 in chr_planar_vscale () > #3 0x55566d41 in swscale () > #4 0x55568284 in sws_scale () > > > > [...] > -- > Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB > > What does censorship reveal? It reveals fear. -- Julian Assange > ___ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > https://ffmpeg.org/mailman/listinfo/ffmpeg-devel > > To unsubscribe, visit link above, or email > ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe". ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.
On Mon, Jan 11, 2021 at 05:46:31PM +0100, Alan Kelly wrote: > --- > Fixes a bug where if there is no offset and a tail which is not processed by > the > sse3/avx2 version the dither is modified > Deletes mmx/mmxext yuv2yuvX version from swscale_template and adds it > to yuv2yuvX.asm to reduce code duplication and so that it may be used > to process the tail from the larger cardinal simd versions. > src argument of yuv2yuvX_* is now srcOffset, so that tails and offsets > are accounted for correctly. > Changes input size in checkasm so that this corner case is tested. > > libswscale/x86/Makefile | 1 + > libswscale/x86/swscale.c | 130 > libswscale/x86/swscale_template.c | 82 -- > libswscale/x86/yuv2yuvX.asm | 136 ++ > tests/checkasm/sw_scale.c | 100 ++ > 5 files changed, 291 insertions(+), 158 deletions(-) > create mode 100644 libswscale/x86/yuv2yuvX.asm This seems to be crashing again unless i messed up testing (gdb) disassemble $rip-32,$rip+32 Dump of assembler code from 0x55572f02 to 0x55572f42: 0x55572f02 : int$0x71 0x55572f04 : out%al,$0x3 0x55572f06 : vpsraw $0x3,%ymm1,%ymm1 0x55572f0b : vpackuswb %ymm4,%ymm3,%ymm3 0x55572f0f : vpackuswb %ymm1,%ymm6,%ymm6 0x55572f13 : mov(%rdi),%rdx 0x55572f16 : vpermq $0xd8,%ymm3,%ymm3 0x55572f1c : vpermq $0xd8,%ymm6,%ymm6 => 0x55572f22 : vmovdqa %ymm3,(%rcx,%rax,1) 0x55572f27 : vmovdqa %ymm6,0x20(%rcx,%rax,1) 0x55572f2d : add$0x40,%rax 0x55572f31 : mov%rdi,%rsi 0x55572f34 : cmp%r8,%rax 0x55572f37 : jb 0x55572eae 0x55572f3d : vzeroupper 0x55572f40 : retq 0x55572f41 : nopw %cs:0x0(%rax,%rax,1) rax0x0 0 rbx0x30 48 rcx0x5583f470 93824995292272 rdx0x5585e500 93824995419392 #0 0x55572f22 in ff_yuv2yuvX_avx2 () #1 0x555724ee in yuv2yuvX_avx2 () #2 0x5556b4f6 in chr_planar_vscale () #3 0x55566d41 in swscale () #4 0x55568284 in sws_scale () [...] -- Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB What does censorship reveal? It reveals fear. -- Julian Assange signature.asc Description: PGP signature ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
[FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.
--- Fixes a bug where if there is no offset and a tail which is not processed by the sse3/avx2 version the dither is modified Deletes mmx/mmxext yuv2yuvX version from swscale_template and adds it to yuv2yuvX.asm to reduce code duplication and so that it may be used to process the tail from the larger cardinal simd versions. src argument of yuv2yuvX_* is now srcOffset, so that tails and offsets are accounted for correctly. Changes input size in checkasm so that this corner case is tested. libswscale/x86/Makefile | 1 + libswscale/x86/swscale.c | 130 libswscale/x86/swscale_template.c | 82 -- libswscale/x86/yuv2yuvX.asm | 136 ++ tests/checkasm/sw_scale.c | 100 ++ 5 files changed, 291 insertions(+), 158 deletions(-) create mode 100644 libswscale/x86/yuv2yuvX.asm diff --git a/libswscale/x86/Makefile b/libswscale/x86/Makefile index 831d5359aa..bfe383364e 100644 --- a/libswscale/x86/Makefile +++ b/libswscale/x86/Makefile @@ -13,3 +13,4 @@ X86ASM-OBJS += x86/input.o \ x86/scale.o \ x86/rgb_2_rgb.o \ x86/yuv_2_rgb.o \ + x86/yuv2yuvX.o \ diff --git a/libswscale/x86/swscale.c b/libswscale/x86/swscale.c index 15c0b22f20..3df193a067 100644 --- a/libswscale/x86/swscale.c +++ b/libswscale/x86/swscale.c @@ -63,6 +63,16 @@ DECLARE_ASM_ALIGNED(8, const uint64_t, ff_bgr2UVOffset) = 0x8080808080808080ULL; DECLARE_ASM_ALIGNED(8, const uint64_t, ff_w)= 0x0001000100010001ULL; +#define YUV2YUVX_FUNC_DECL(opt) \ +static void yuv2yuvX_ ##opt(const int16_t *filter, int filterSize, const int16_t **src, \ + uint8_t *dest, int dstW, \ + const uint8_t *dither, int offset); \ + +YUV2YUVX_FUNC_DECL(mmx) +YUV2YUVX_FUNC_DECL(mmxext) +YUV2YUVX_FUNC_DECL(sse3) +YUV2YUVX_FUNC_DECL(avx2) + //MMX versions #if HAVE_MMX_INLINE #undef RENAME @@ -198,81 +208,44 @@ void ff_updateMMXDitherTables(SwsContext *c, int dstY) } #if HAVE_MMXEXT -static void yuv2yuvX_sse3(const int16_t *filter, int filterSize, - const int16_t **src, uint8_t *dest, int dstW, - const uint8_t *dither, int offset) -{ -if(((uintptr_t)dest) & 15){ -yuv2yuvX_mmxext(filter, filterSize, src, dest, dstW, dither, offset); -return; -} -filterSize--; -#define MAIN_FUNCTION \ -"pxor %%xmm0, %%xmm0 \n\t" \ -"punpcklbw %%xmm0, %%xmm3 \n\t" \ -"movd %4, %%xmm1 \n\t" \ -"punpcklwd %%xmm1, %%xmm1 \n\t" \ -"punpckldq %%xmm1, %%xmm1 \n\t" \ -"punpcklqdq %%xmm1, %%xmm1 \n\t" \ -"psllw $3, %%xmm1 \n\t" \ -"paddw %%xmm1, %%xmm3 \n\t" \ -"psraw $4, %%xmm3 \n\t" \ -"movdqa %%xmm3, %%xmm4 \n\t" \ -"movdqa %%xmm3, %%xmm7 \n\t" \ -"movl %3, %%ecx \n\t" \ -"mov %0, %%"FF_REG_d"\n\t"\ -"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\ -".p2align 4 \n\t" /* FIXME Unroll? */\ -"1: \n\t"\ -"movddup 8(%%"FF_REG_d"), %%xmm0 \n\t" /* filterCoeff */\ -"movdqa (%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm2 \n\t" /* srcData */\ -"movdqa16(%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm5 \n\t" /* srcData */\ -"add$16, %%"FF_REG_d"\n\t"\ -"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\ -"test %%"FF_REG_S", %%"FF_REG_S" \n\t"\ -"pmulhw %%xmm0, %%xmm2 \n\t"\ -"pmulhw %%xmm0, %%xmm5 \n\t"\ -"paddw%%xmm2, %%xmm3 \n\t"\ -"paddw%%xmm5, %%xmm4 \n\t"\ -" jnz1b \n\t"\ -"psraw $3, %%xmm3 \n\t"\ -"psraw $3, %%xmm4 \n\t"\ -"packuswb %%xmm4, %%xmm3 \n\t"\ -"movntdq %%xmm3, (%1, %%"FF_REG_c") \n\t"\ -"add $16, %%"FF_REG_c"\n\t"\ -"cmp %2, %%"FF_REG_c"\n\t"\ -"movdqa %%xmm7, %%xmm3\n\t" \ -"movdqa %%xmm7, %%xmm4\n\t" \ -"mov
Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.
It's a bug in the patch. The tail not processed by the sse3/avx2 version is done by the mmx version. I used offset to account for the src pixels already processed, however, dither is modified if offset is not 0. In cases where there is a tail and offset is 0, this bug appears. I am working on a solution. On Sun, Jan 10, 2021 at 4:26 PM Michael Niedermayer wrote: > On Thu, Jan 07, 2021 at 10:41:19AM +0100, Alan Kelly wrote: > > --- > > Replaces mova with movdqu due to alignment issues > > libswscale/x86/Makefile | 1 + > > libswscale/x86/swscale.c| 106 +--- > > libswscale/x86/yuv2yuvX.asm | 117 > > tests/checkasm/sw_scale.c | 98 ++ > > 4 files changed, 246 insertions(+), 76 deletions(-) > > create mode 100644 libswscale/x86/yuv2yuvX.asm > > I have one / some ? cases where this changes output > ./ffmpeg -i utvideo-yuv422p10le_UQY2_crc32-A431CD5F.avi -bitexact avi.avi > > i dont know if theres a decoder bug or bug in the patch or something else > > -rw-r- 1 michael michael 246218 Jan 10 16:23 avi.avi > -rw-r- 1 michael michael 245824 Jan 10 16:23 avi-ref.avi > > file should be at: > https://samples.ffmpeg.org/ffmpeg-bugs/trac/ticket4044/ > > [...] > -- > Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB > > In a rich man's house there is no place to spit but his face. > -- Diogenes of Sinope > ___ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > https://ffmpeg.org/mailman/listinfo/ffmpeg-devel > > To unsubscribe, visit link above, or email > ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe". ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.
On Thu, Jan 07, 2021 at 10:41:19AM +0100, Alan Kelly wrote: > --- > Replaces mova with movdqu due to alignment issues > libswscale/x86/Makefile | 1 + > libswscale/x86/swscale.c| 106 +--- > libswscale/x86/yuv2yuvX.asm | 117 > tests/checkasm/sw_scale.c | 98 ++ > 4 files changed, 246 insertions(+), 76 deletions(-) > create mode 100644 libswscale/x86/yuv2yuvX.asm I have one / some ? cases where this changes output ./ffmpeg -i utvideo-yuv422p10le_UQY2_crc32-A431CD5F.avi -bitexact avi.avi i dont know if theres a decoder bug or bug in the patch or something else -rw-r- 1 michael michael 246218 Jan 10 16:23 avi.avi -rw-r- 1 michael michael 245824 Jan 10 16:23 avi-ref.avi file should be at: https://samples.ffmpeg.org/ffmpeg-bugs/trac/ticket4044/ [...] -- Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB In a rich man's house there is no place to spit but his face. -- Diogenes of Sinope signature.asc Description: PGP signature ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.
On Thu, Jan 07, 2021 at 10:39:56AM +0100, Alan Kelly wrote: > Thanks for your patience with this, I have replaced mova with movdqu - movu > generated a compile error on ssse3. What system did this crash on? AMD Ryzen 9 3950X on linux [...] -- Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB Everything should be made as simple as possible, but not simpler. -- Albert Einstein signature.asc Description: PGP signature ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.
Thanks for your patience with this, I have replaced mova with movdqu - movu generated a compile error on ssse3. What system did this crash on? On Wed, Jan 6, 2021 at 9:10 PM Michael Niedermayer wrote: > On Tue, Jan 05, 2021 at 01:31:25PM +0100, Alan Kelly wrote: > > Ping! > > crashes (due to alignment i think) > > (gdb) disassemble $rip-32,$rip+32 > Dump of assembler code from 0x555730a1 to 0x555730e1: >0x555730a1 : int$0x71 >0x555730a3 : out%al,$0x3 >0x555730a5 : vpsraw $0x3,%ymm1,%ymm1 >0x555730aa : vpackuswb %ymm4,%ymm3,%ymm3 >0x555730ae : vpackuswb %ymm1,%ymm6,%ymm6 >0x555730b2 : mov(%rdi),%rdx >0x555730b5 : vpermq $0xd8,%ymm3,%ymm3 >0x555730bb : vpermq $0xd8,%ymm6,%ymm6 > => 0x555730c1 : vmovdqa %ymm3,(%rcx,%rax,1) >0x555730c6 : vmovdqa > %ymm6,0x20(%rcx,%rax,1) >0x555730cc : add$0x40,%rax >0x555730d0 : mov%rdi,%rsi >0x555730d3 : cmp%r8,%rax >0x555730d6 : jb 0x5557304d > >0x555730dc : vzeroupper >0x555730df : retq >0x555730e0 : push %r15 > End of assembler dump. > (gdb) info all-registers > rax0x0 0 > rbx0x0 0 > rcx0x5583f470 93824995292272 > > > [...] > -- > Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB > > Modern terrorism, a quick summary: Need oil, start war with country that > has oil, kill hundread thousand in war. Let country fall into chaos, > be surprised about raise of fundamantalists. Drop more bombs, kill more > people, be surprised about them taking revenge and drop even more bombs > and strip your own citizens of their rights and freedoms. to be continued > ___ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > https://ffmpeg.org/mailman/listinfo/ffmpeg-devel > > To unsubscribe, visit link above, or email > ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe". ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
[FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.
--- Replaces mova with movdqu due to alignment issues libswscale/x86/Makefile | 1 + libswscale/x86/swscale.c| 106 +--- libswscale/x86/yuv2yuvX.asm | 117 tests/checkasm/sw_scale.c | 98 ++ 4 files changed, 246 insertions(+), 76 deletions(-) create mode 100644 libswscale/x86/yuv2yuvX.asm diff --git a/libswscale/x86/Makefile b/libswscale/x86/Makefile index 831d5359aa..bfe383364e 100644 --- a/libswscale/x86/Makefile +++ b/libswscale/x86/Makefile @@ -13,3 +13,4 @@ X86ASM-OBJS += x86/input.o \ x86/scale.o \ x86/rgb_2_rgb.o \ x86/yuv_2_rgb.o \ + x86/yuv2yuvX.o \ diff --git a/libswscale/x86/swscale.c b/libswscale/x86/swscale.c index 3160fedf04..8cd8713705 100644 --- a/libswscale/x86/swscale.c +++ b/libswscale/x86/swscale.c @@ -197,81 +197,30 @@ void ff_updateMMXDitherTables(SwsContext *c, int dstY) } #if HAVE_MMXEXT -static void yuv2yuvX_sse3(const int16_t *filter, int filterSize, - const int16_t **src, uint8_t *dest, int dstW, - const uint8_t *dither, int offset) -{ -if(((uintptr_t)dest) & 15){ -yuv2yuvX_mmxext(filter, filterSize, src, dest, dstW, dither, offset); -return; -} -filterSize--; -#define MAIN_FUNCTION \ -"pxor %%xmm0, %%xmm0 \n\t" \ -"punpcklbw %%xmm0, %%xmm3 \n\t" \ -"movd %4, %%xmm1 \n\t" \ -"punpcklwd %%xmm1, %%xmm1 \n\t" \ -"punpckldq %%xmm1, %%xmm1 \n\t" \ -"punpcklqdq %%xmm1, %%xmm1 \n\t" \ -"psllw $3, %%xmm1 \n\t" \ -"paddw %%xmm1, %%xmm3 \n\t" \ -"psraw $4, %%xmm3 \n\t" \ -"movdqa %%xmm3, %%xmm4 \n\t" \ -"movdqa %%xmm3, %%xmm7 \n\t" \ -"movl %3, %%ecx \n\t" \ -"mov %0, %%"FF_REG_d"\n\t"\ -"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\ -".p2align 4 \n\t" /* FIXME Unroll? */\ -"1: \n\t"\ -"movddup 8(%%"FF_REG_d"), %%xmm0 \n\t" /* filterCoeff */\ -"movdqa (%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm2 \n\t" /* srcData */\ -"movdqa16(%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm5 \n\t" /* srcData */\ -"add$16, %%"FF_REG_d"\n\t"\ -"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\ -"test %%"FF_REG_S", %%"FF_REG_S" \n\t"\ -"pmulhw %%xmm0, %%xmm2 \n\t"\ -"pmulhw %%xmm0, %%xmm5 \n\t"\ -"paddw%%xmm2, %%xmm3 \n\t"\ -"paddw%%xmm5, %%xmm4 \n\t"\ -" jnz1b \n\t"\ -"psraw $3, %%xmm3 \n\t"\ -"psraw $3, %%xmm4 \n\t"\ -"packuswb %%xmm4, %%xmm3 \n\t"\ -"movntdq %%xmm3, (%1, %%"FF_REG_c") \n\t"\ -"add $16, %%"FF_REG_c"\n\t"\ -"cmp %2, %%"FF_REG_c"\n\t"\ -"movdqa %%xmm7, %%xmm3\n\t" \ -"movdqa %%xmm7, %%xmm4\n\t" \ -"mov %0, %%"FF_REG_d"\n\t"\ -"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\ -"jb 1b \n\t" - -if (offset) { -__asm__ volatile( -"movq %5, %%xmm3 \n\t" -"movdqa%%xmm3, %%xmm4 \n\t" -"psrlq$24, %%xmm3 \n\t" -"psllq$40, %%xmm4 \n\t" -"por %%xmm4, %%xmm3 \n\t" -MAIN_FUNCTION - :: "g" (filter), - "r" (dest-offset), "g" ((x86_reg)(dstW+offset)), "m" (offset), - "m"(filterSize), "m"(((uint64_t *) dither)[0]) - : XMM_CLOBBERS("%xmm0" , "%xmm1" , "%xmm2" , "%xmm3" , "%xmm4" , "%xmm5" , "%xmm7" ,) -"%"FF_REG_d, "%"FF_REG_S, "%"FF_REG_c - ); -} else { -__asm__ volatile( -"movq %5, %%xmm3 \n\t" -MAIN_FUNCTION - :: "g" (filter), - "r" (dest-offset), "g" ((x86_reg)(dstW+offset)), "m" (offset), - "m"(filterSize), "m"(((uint64_t
Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.
On Tue, Jan 05, 2021 at 01:31:25PM +0100, Alan Kelly wrote: > Ping! crashes (due to alignment i think) (gdb) disassemble $rip-32,$rip+32 Dump of assembler code from 0x555730a1 to 0x555730e1: 0x555730a1 : int$0x71 0x555730a3 : out%al,$0x3 0x555730a5 : vpsraw $0x3,%ymm1,%ymm1 0x555730aa : vpackuswb %ymm4,%ymm3,%ymm3 0x555730ae : vpackuswb %ymm1,%ymm6,%ymm6 0x555730b2 : mov(%rdi),%rdx 0x555730b5 : vpermq $0xd8,%ymm3,%ymm3 0x555730bb : vpermq $0xd8,%ymm6,%ymm6 => 0x555730c1 : vmovdqa %ymm3,(%rcx,%rax,1) 0x555730c6 : vmovdqa %ymm6,0x20(%rcx,%rax,1) 0x555730cc : add$0x40,%rax 0x555730d0 : mov%rdi,%rsi 0x555730d3 : cmp%r8,%rax 0x555730d6 : jb 0x5557304d 0x555730dc : vzeroupper 0x555730df : retq 0x555730e0 : push %r15 End of assembler dump. (gdb) info all-registers rax0x0 0 rbx0x0 0 rcx0x5583f470 93824995292272 [...] -- Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB Modern terrorism, a quick summary: Need oil, start war with country that has oil, kill hundread thousand in war. Let country fall into chaos, be surprised about raise of fundamantalists. Drop more bombs, kill more people, be surprised about them taking revenge and drop even more bombs and strip your own citizens of their rights and freedoms. to be continued signature.asc Description: PGP signature ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.
Ping! On Thu, Dec 17, 2020 at 11:42 AM Alan Kelly wrote: > --- > Fixes memory alignment problem in checkasm-sw_scale > Tested on Linux 32 and 64 bit and mingw32 > libswscale/x86/Makefile | 1 + > libswscale/x86/swscale.c| 106 +--- > libswscale/x86/yuv2yuvX.asm | 117 > tests/checkasm/sw_scale.c | 98 ++ > 4 files changed, 246 insertions(+), 76 deletions(-) > create mode 100644 libswscale/x86/yuv2yuvX.asm > > diff --git a/libswscale/x86/Makefile b/libswscale/x86/Makefile > index 831d5359aa..bfe383364e 100644 > --- a/libswscale/x86/Makefile > +++ b/libswscale/x86/Makefile > @@ -13,3 +13,4 @@ X86ASM-OBJS += x86/input.o > \ > x86/scale.o \ > x86/rgb_2_rgb.o \ > x86/yuv_2_rgb.o \ > + x86/yuv2yuvX.o \ > diff --git a/libswscale/x86/swscale.c b/libswscale/x86/swscale.c > index 3160fedf04..8cd8713705 100644 > --- a/libswscale/x86/swscale.c > +++ b/libswscale/x86/swscale.c > @@ -197,81 +197,30 @@ void ff_updateMMXDitherTables(SwsContext *c, int > dstY) > } > > #if HAVE_MMXEXT > -static void yuv2yuvX_sse3(const int16_t *filter, int filterSize, > - const int16_t **src, uint8_t *dest, int dstW, > - const uint8_t *dither, int offset) > -{ > -if(((uintptr_t)dest) & 15){ > -yuv2yuvX_mmxext(filter, filterSize, src, dest, dstW, dither, > offset); > -return; > -} > -filterSize--; > -#define MAIN_FUNCTION \ > -"pxor %%xmm0, %%xmm0 \n\t" \ > -"punpcklbw %%xmm0, %%xmm3 \n\t" \ > -"movd %4, %%xmm1 \n\t" \ > -"punpcklwd %%xmm1, %%xmm1 \n\t" \ > -"punpckldq %%xmm1, %%xmm1 \n\t" \ > -"punpcklqdq %%xmm1, %%xmm1 \n\t" \ > -"psllw $3, %%xmm1 \n\t" \ > -"paddw %%xmm1, %%xmm3 \n\t" \ > -"psraw $4, %%xmm3 \n\t" \ > -"movdqa %%xmm3, %%xmm4 \n\t" \ > -"movdqa %%xmm3, %%xmm7 \n\t" \ > -"movl %3, %%ecx \n\t" \ > -"mov %0, %%"FF_REG_d" > \n\t"\ > -"mov(%%"FF_REG_d"), %%"FF_REG_S" > \n\t"\ > -".p2align 4 \n\t" /* > FIXME Unroll? */\ > -"1: \n\t"\ > -"movddup 8(%%"FF_REG_d"), %%xmm0 \n\t" /* > filterCoeff */\ > -"movdqa (%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm2 > \n\t" /* srcData */\ > -"movdqa16(%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm5 > \n\t" /* srcData */\ > -"add$16, %%"FF_REG_d" > \n\t"\ > -"mov(%%"FF_REG_d"), %%"FF_REG_S" > \n\t"\ > -"test %%"FF_REG_S", %%"FF_REG_S" > \n\t"\ > -"pmulhw %%xmm0, %%xmm2 \n\t"\ > -"pmulhw %%xmm0, %%xmm5 \n\t"\ > -"paddw%%xmm2, %%xmm3 \n\t"\ > -"paddw%%xmm5, %%xmm4 \n\t"\ > -" jnz1b \n\t"\ > -"psraw $3, %%xmm3 \n\t"\ > -"psraw $3, %%xmm4 \n\t"\ > -"packuswb %%xmm4, %%xmm3 \n\t"\ > -"movntdq %%xmm3, (%1, %%"FF_REG_c") > \n\t"\ > -"add $16, %%"FF_REG_c"\n\t"\ > -"cmp %2, %%"FF_REG_c"\n\t"\ > -"movdqa %%xmm7, %%xmm3\n\t" \ > -"movdqa %%xmm7, %%xmm4\n\t" \ > -"mov %0, %%"FF_REG_d" > \n\t"\ > -"mov(%%"FF_REG_d"), %%"FF_REG_S" > \n\t"\ > -"jb 1b \n\t" > - > -if (offset) { > -__asm__ volatile( > -"movq %5, %%xmm3 \n\t" > -"movdqa%%xmm3, %%xmm4 \n\t" > -"psrlq$24, %%xmm3 \n\t" > -"psllq$40, %%xmm4 \n\t" > -"por %%xmm4, %%xmm3 \n\t" > -MAIN_FUNCTION > - :: "g" (filter), > - "r" (dest-offset), "g" ((x86_reg)(dstW+offset)), "m" > (offset), > - "m"(filterSize), "m"(((uint64_t *) dither)[0]) > - : XMM_CLOBBERS("%xmm0" , "%xmm1" , "%xmm2" , "%xmm3" , > "%xmm4" , "%xmm5" , "%xmm7" ,) > -"%"FF_REG_d, "%"FF_REG_S, "%"FF_REG_c > - ); > -
[FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.
--- Fixes memory alignment problem in checkasm-sw_scale Tested on Linux 32 and 64 bit and mingw32 libswscale/x86/Makefile | 1 + libswscale/x86/swscale.c| 106 +--- libswscale/x86/yuv2yuvX.asm | 117 tests/checkasm/sw_scale.c | 98 ++ 4 files changed, 246 insertions(+), 76 deletions(-) create mode 100644 libswscale/x86/yuv2yuvX.asm diff --git a/libswscale/x86/Makefile b/libswscale/x86/Makefile index 831d5359aa..bfe383364e 100644 --- a/libswscale/x86/Makefile +++ b/libswscale/x86/Makefile @@ -13,3 +13,4 @@ X86ASM-OBJS += x86/input.o \ x86/scale.o \ x86/rgb_2_rgb.o \ x86/yuv_2_rgb.o \ + x86/yuv2yuvX.o \ diff --git a/libswscale/x86/swscale.c b/libswscale/x86/swscale.c index 3160fedf04..8cd8713705 100644 --- a/libswscale/x86/swscale.c +++ b/libswscale/x86/swscale.c @@ -197,81 +197,30 @@ void ff_updateMMXDitherTables(SwsContext *c, int dstY) } #if HAVE_MMXEXT -static void yuv2yuvX_sse3(const int16_t *filter, int filterSize, - const int16_t **src, uint8_t *dest, int dstW, - const uint8_t *dither, int offset) -{ -if(((uintptr_t)dest) & 15){ -yuv2yuvX_mmxext(filter, filterSize, src, dest, dstW, dither, offset); -return; -} -filterSize--; -#define MAIN_FUNCTION \ -"pxor %%xmm0, %%xmm0 \n\t" \ -"punpcklbw %%xmm0, %%xmm3 \n\t" \ -"movd %4, %%xmm1 \n\t" \ -"punpcklwd %%xmm1, %%xmm1 \n\t" \ -"punpckldq %%xmm1, %%xmm1 \n\t" \ -"punpcklqdq %%xmm1, %%xmm1 \n\t" \ -"psllw $3, %%xmm1 \n\t" \ -"paddw %%xmm1, %%xmm3 \n\t" \ -"psraw $4, %%xmm3 \n\t" \ -"movdqa %%xmm3, %%xmm4 \n\t" \ -"movdqa %%xmm3, %%xmm7 \n\t" \ -"movl %3, %%ecx \n\t" \ -"mov %0, %%"FF_REG_d"\n\t"\ -"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\ -".p2align 4 \n\t" /* FIXME Unroll? */\ -"1: \n\t"\ -"movddup 8(%%"FF_REG_d"), %%xmm0 \n\t" /* filterCoeff */\ -"movdqa (%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm2 \n\t" /* srcData */\ -"movdqa16(%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm5 \n\t" /* srcData */\ -"add$16, %%"FF_REG_d"\n\t"\ -"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\ -"test %%"FF_REG_S", %%"FF_REG_S" \n\t"\ -"pmulhw %%xmm0, %%xmm2 \n\t"\ -"pmulhw %%xmm0, %%xmm5 \n\t"\ -"paddw%%xmm2, %%xmm3 \n\t"\ -"paddw%%xmm5, %%xmm4 \n\t"\ -" jnz1b \n\t"\ -"psraw $3, %%xmm3 \n\t"\ -"psraw $3, %%xmm4 \n\t"\ -"packuswb %%xmm4, %%xmm3 \n\t"\ -"movntdq %%xmm3, (%1, %%"FF_REG_c") \n\t"\ -"add $16, %%"FF_REG_c"\n\t"\ -"cmp %2, %%"FF_REG_c"\n\t"\ -"movdqa %%xmm7, %%xmm3\n\t" \ -"movdqa %%xmm7, %%xmm4\n\t" \ -"mov %0, %%"FF_REG_d"\n\t"\ -"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\ -"jb 1b \n\t" - -if (offset) { -__asm__ volatile( -"movq %5, %%xmm3 \n\t" -"movdqa%%xmm3, %%xmm4 \n\t" -"psrlq$24, %%xmm3 \n\t" -"psllq$40, %%xmm4 \n\t" -"por %%xmm4, %%xmm3 \n\t" -MAIN_FUNCTION - :: "g" (filter), - "r" (dest-offset), "g" ((x86_reg)(dstW+offset)), "m" (offset), - "m"(filterSize), "m"(((uint64_t *) dither)[0]) - : XMM_CLOBBERS("%xmm0" , "%xmm1" , "%xmm2" , "%xmm3" , "%xmm4" , "%xmm5" , "%xmm7" ,) -"%"FF_REG_d, "%"FF_REG_S, "%"FF_REG_c - ); -} else { -__asm__ volatile( -"movq %5, %%xmm3 \n\t" -MAIN_FUNCTION - :: "g" (filter), - "r" (dest-offset), "g" ((x86_reg)(dstW+offset)), "m" (offset), -
Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.
On Thu, Dec 10, 2020 at 04:46:26PM +0100, Alan Kelly wrote: > --- > Replaces ff_sws_init_swscale_x86 with ff_getSwsFunc > Load offset if not gprsize but 8 on both 32 and 64 bit > Removes sfence as NT store no longer used > libswscale/x86/Makefile | 1 + > libswscale/x86/swscale.c| 106 +--- > libswscale/x86/yuv2yuvX.asm | 117 > tests/checkasm/sw_scale.c | 101 ++- > 4 files changed, 248 insertions(+), 77 deletions(-) > create mode 100644 libswscale/x86/yuv2yuvX.asm breaks fate on mingw32 make fate-checkasm-sw_scale TESTcheckasm-sw_scale Test checkasm-sw_scale failed. Look at tests/data/fate/checkasm-sw_scale.err for details. src/tests/Makefile:255: recipe for target 'fate-checkasm-sw_scale' failed make: *** [fate-checkasm-sw_scale] Error 5 [...] -- Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB Take away the freedom of one citizen and you will be jailed, take away the freedom of all citizens and you will be congratulated by your peers in Parliament. signature.asc Description: PGP signature ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
[FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.
--- Replaces ff_sws_init_swscale_x86 with ff_getSwsFunc Load offset if not gprsize but 8 on both 32 and 64 bit Removes sfence as NT store no longer used libswscale/x86/Makefile | 1 + libswscale/x86/swscale.c| 106 +--- libswscale/x86/yuv2yuvX.asm | 117 tests/checkasm/sw_scale.c | 101 ++- 4 files changed, 248 insertions(+), 77 deletions(-) create mode 100644 libswscale/x86/yuv2yuvX.asm diff --git a/libswscale/x86/Makefile b/libswscale/x86/Makefile index 831d5359aa..bfe383364e 100644 --- a/libswscale/x86/Makefile +++ b/libswscale/x86/Makefile @@ -13,3 +13,4 @@ X86ASM-OBJS += x86/input.o \ x86/scale.o \ x86/rgb_2_rgb.o \ x86/yuv_2_rgb.o \ + x86/yuv2yuvX.o \ diff --git a/libswscale/x86/swscale.c b/libswscale/x86/swscale.c index 3160fedf04..8cd8713705 100644 --- a/libswscale/x86/swscale.c +++ b/libswscale/x86/swscale.c @@ -197,81 +197,30 @@ void ff_updateMMXDitherTables(SwsContext *c, int dstY) } #if HAVE_MMXEXT -static void yuv2yuvX_sse3(const int16_t *filter, int filterSize, - const int16_t **src, uint8_t *dest, int dstW, - const uint8_t *dither, int offset) -{ -if(((uintptr_t)dest) & 15){ -yuv2yuvX_mmxext(filter, filterSize, src, dest, dstW, dither, offset); -return; -} -filterSize--; -#define MAIN_FUNCTION \ -"pxor %%xmm0, %%xmm0 \n\t" \ -"punpcklbw %%xmm0, %%xmm3 \n\t" \ -"movd %4, %%xmm1 \n\t" \ -"punpcklwd %%xmm1, %%xmm1 \n\t" \ -"punpckldq %%xmm1, %%xmm1 \n\t" \ -"punpcklqdq %%xmm1, %%xmm1 \n\t" \ -"psllw $3, %%xmm1 \n\t" \ -"paddw %%xmm1, %%xmm3 \n\t" \ -"psraw $4, %%xmm3 \n\t" \ -"movdqa %%xmm3, %%xmm4 \n\t" \ -"movdqa %%xmm3, %%xmm7 \n\t" \ -"movl %3, %%ecx \n\t" \ -"mov %0, %%"FF_REG_d"\n\t"\ -"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\ -".p2align 4 \n\t" /* FIXME Unroll? */\ -"1: \n\t"\ -"movddup 8(%%"FF_REG_d"), %%xmm0 \n\t" /* filterCoeff */\ -"movdqa (%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm2 \n\t" /* srcData */\ -"movdqa16(%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm5 \n\t" /* srcData */\ -"add$16, %%"FF_REG_d"\n\t"\ -"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\ -"test %%"FF_REG_S", %%"FF_REG_S" \n\t"\ -"pmulhw %%xmm0, %%xmm2 \n\t"\ -"pmulhw %%xmm0, %%xmm5 \n\t"\ -"paddw%%xmm2, %%xmm3 \n\t"\ -"paddw%%xmm5, %%xmm4 \n\t"\ -" jnz1b \n\t"\ -"psraw $3, %%xmm3 \n\t"\ -"psraw $3, %%xmm4 \n\t"\ -"packuswb %%xmm4, %%xmm3 \n\t"\ -"movntdq %%xmm3, (%1, %%"FF_REG_c") \n\t"\ -"add $16, %%"FF_REG_c"\n\t"\ -"cmp %2, %%"FF_REG_c"\n\t"\ -"movdqa %%xmm7, %%xmm3\n\t" \ -"movdqa %%xmm7, %%xmm4\n\t" \ -"mov %0, %%"FF_REG_d"\n\t"\ -"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\ -"jb 1b \n\t" - -if (offset) { -__asm__ volatile( -"movq %5, %%xmm3 \n\t" -"movdqa%%xmm3, %%xmm4 \n\t" -"psrlq$24, %%xmm3 \n\t" -"psllq$40, %%xmm4 \n\t" -"por %%xmm4, %%xmm3 \n\t" -MAIN_FUNCTION - :: "g" (filter), - "r" (dest-offset), "g" ((x86_reg)(dstW+offset)), "m" (offset), - "m"(filterSize), "m"(((uint64_t *) dither)[0]) - : XMM_CLOBBERS("%xmm0" , "%xmm1" , "%xmm2" , "%xmm3" , "%xmm4" , "%xmm5" , "%xmm7" ,) -"%"FF_REG_d, "%"FF_REG_S, "%"FF_REG_c - ); -} else { -__asm__ volatile( -"movq %5, %%xmm3 \n\t" -MAIN_FUNCTION - :: "g" (filter), - "r"
Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.
On 2020/12/09 11:19, Alan Kelly wrote: --- Activates avx2 version of yuv2yuvX Adds checkasm for yuv2yuvX Modifies ff_yuv2yuvX_* signature to match yuv2yuvX_* Replaces non-temporal stores with temporal stores libswscale/x86/Makefile | 1 + libswscale/x86/swscale.c| 106 +--- libswscale/x86/yuv2yuvX.asm | 118 tests/checkasm/sw_scale.c | 101 +- 4 files changed, 249 insertions(+), 77 deletions(-) create mode 100644 libswscale/x86/yuv2yuvX.asm [...] diff --git a/tests/checkasm/sw_scale.c b/tests/checkasm/sw_scale.c index 9efa2b4def..7009169361 100644 --- a/tests/checkasm/sw_scale.c +++ b/tests/checkasm/sw_scale.c [...] +static void check_yuv2yuvX(void) +{ +struct SwsContext *ctx; +int fsi, osi; +#define LARGEST_FILTER 8 +#define FILTER_SIZES 4 +static const int filter_sizes[FILTER_SIZES] = {1, 4, 8, 16}; + +declare_func_emms(AV_CPU_FLAG_MMX, void, const int16_t *filter, + int filterSize, const int16_t **src, uint8_t *dest, + int dstW, const uint8_t *dither, int offset); + +int dstW = SRC_PIXELS; +const int16_t **src; +LOCAL_ALIGNED_32(int16_t, filter_coeff, [LARGEST_FILTER]); +LOCAL_ALIGNED_32(uint8_t, dst0, [SRC_PIXELS]); +LOCAL_ALIGNED_32(uint8_t, dst1, [SRC_PIXELS]); +LOCAL_ALIGNED_32(uint8_t, dither, [SRC_PIXELS]); +union VFilterData{ +const int16_t *src; +uint16_t coeff[8]; +} *vFilterData; +uint8_t d_val = rnd(); +randomize_buffers(filter_coeff, LARGEST_FILTER); +ctx = sws_alloc_context(); +if (sws_init_context(ctx, NULL, NULL) < 0) +fail(); + +ff_sws_init_swscale_x86(ctx); This should be ff_getSwsFunc() instead. +for(int i = 0; i < SRC_PIXELS; ++i){ +dither[i] = d_val; +} [...] -- Josh ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.
This function is tested by fate-filter-fps-r. I have also added a checkasm test and bench. I have done a lot more testing and benching of this code and I am now happy to activate the avx2 version because the performance is so good. On my machine I get the following results for filter size 4 and 0 offset. For all other sizes/offsets the results are similar: yuv2yuvX_4_0_mmx: 1567.2 1563.1 yuv2yuvX_4_0_mmxext: 1560.7 1560.1 yuv2yuvX_4_0_sse3: 780.7 572.1 -26.7% yuv2yuvX_4_0_avx2: n/a 341.1 -56.3% Interestingly I discovered that the non-temporal store movntdq results in a very large variability in the test results, in many cases it significantly increases the execution time. I have replaced these stores with aligned stores which stabilised the runtimes. However, I am aware that benchmarks often don't represent reality and these non-temporal stores were probably used for a good reason. If you think it better to use NT stores, I will replace them. On Fri, Dec 4, 2020 at 2:00 PM Anton Khirnov wrote: > Quoting Alan Kelly (2020-11-19 09:41:56) > > --- > > All of Henrik's suggestions have been implemented. Additionally, > > m3 and m6 are permuted in avx2 before storing to ensure bit by bit > > identical results in avx2. > > libswscale/x86/Makefile | 1 + > > libswscale/x86/swscale.c| 75 +++ > > libswscale/x86/yuv2yuvX.asm | 118 > > 3 files changed, 129 insertions(+), 65 deletions(-) > > create mode 100644 libswscale/x86/yuv2yuvX.asm > > Is this function tested by FATE? > I did some brief testing and apparently it gets called during > fate-filter-shuffleplanes-dup-luma, but the results do not change even > if I comment out the whole function. > > Also, it seems like you are adding an AVX2 version of the function, but > I don't see it being used. > > -- > Anton Khirnov > ___ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > https://ffmpeg.org/mailman/listinfo/ffmpeg-devel > > To unsubscribe, visit link above, or email > ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe". ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
[FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.
--- Activates avx2 version of yuv2yuvX Adds checkasm for yuv2yuvX Modifies ff_yuv2yuvX_* signature to match yuv2yuvX_* Replaces non-temporal stores with temporal stores libswscale/x86/Makefile | 1 + libswscale/x86/swscale.c| 106 +--- libswscale/x86/yuv2yuvX.asm | 118 tests/checkasm/sw_scale.c | 101 +- 4 files changed, 249 insertions(+), 77 deletions(-) create mode 100644 libswscale/x86/yuv2yuvX.asm diff --git a/libswscale/x86/Makefile b/libswscale/x86/Makefile index 831d5359aa..bfe383364e 100644 --- a/libswscale/x86/Makefile +++ b/libswscale/x86/Makefile @@ -13,3 +13,4 @@ X86ASM-OBJS += x86/input.o \ x86/scale.o \ x86/rgb_2_rgb.o \ x86/yuv_2_rgb.o \ + x86/yuv2yuvX.o \ diff --git a/libswscale/x86/swscale.c b/libswscale/x86/swscale.c index 3160fedf04..8cd8713705 100644 --- a/libswscale/x86/swscale.c +++ b/libswscale/x86/swscale.c @@ -197,81 +197,30 @@ void ff_updateMMXDitherTables(SwsContext *c, int dstY) } #if HAVE_MMXEXT -static void yuv2yuvX_sse3(const int16_t *filter, int filterSize, - const int16_t **src, uint8_t *dest, int dstW, - const uint8_t *dither, int offset) -{ -if(((uintptr_t)dest) & 15){ -yuv2yuvX_mmxext(filter, filterSize, src, dest, dstW, dither, offset); -return; -} -filterSize--; -#define MAIN_FUNCTION \ -"pxor %%xmm0, %%xmm0 \n\t" \ -"punpcklbw %%xmm0, %%xmm3 \n\t" \ -"movd %4, %%xmm1 \n\t" \ -"punpcklwd %%xmm1, %%xmm1 \n\t" \ -"punpckldq %%xmm1, %%xmm1 \n\t" \ -"punpcklqdq %%xmm1, %%xmm1 \n\t" \ -"psllw $3, %%xmm1 \n\t" \ -"paddw %%xmm1, %%xmm3 \n\t" \ -"psraw $4, %%xmm3 \n\t" \ -"movdqa %%xmm3, %%xmm4 \n\t" \ -"movdqa %%xmm3, %%xmm7 \n\t" \ -"movl %3, %%ecx \n\t" \ -"mov %0, %%"FF_REG_d"\n\t"\ -"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\ -".p2align 4 \n\t" /* FIXME Unroll? */\ -"1: \n\t"\ -"movddup 8(%%"FF_REG_d"), %%xmm0 \n\t" /* filterCoeff */\ -"movdqa (%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm2 \n\t" /* srcData */\ -"movdqa16(%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm5 \n\t" /* srcData */\ -"add$16, %%"FF_REG_d"\n\t"\ -"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\ -"test %%"FF_REG_S", %%"FF_REG_S" \n\t"\ -"pmulhw %%xmm0, %%xmm2 \n\t"\ -"pmulhw %%xmm0, %%xmm5 \n\t"\ -"paddw%%xmm2, %%xmm3 \n\t"\ -"paddw%%xmm5, %%xmm4 \n\t"\ -" jnz1b \n\t"\ -"psraw $3, %%xmm3 \n\t"\ -"psraw $3, %%xmm4 \n\t"\ -"packuswb %%xmm4, %%xmm3 \n\t"\ -"movntdq %%xmm3, (%1, %%"FF_REG_c") \n\t"\ -"add $16, %%"FF_REG_c"\n\t"\ -"cmp %2, %%"FF_REG_c"\n\t"\ -"movdqa %%xmm7, %%xmm3\n\t" \ -"movdqa %%xmm7, %%xmm4\n\t" \ -"mov %0, %%"FF_REG_d"\n\t"\ -"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\ -"jb 1b \n\t" - -if (offset) { -__asm__ volatile( -"movq %5, %%xmm3 \n\t" -"movdqa%%xmm3, %%xmm4 \n\t" -"psrlq$24, %%xmm3 \n\t" -"psllq$40, %%xmm4 \n\t" -"por %%xmm4, %%xmm3 \n\t" -MAIN_FUNCTION - :: "g" (filter), - "r" (dest-offset), "g" ((x86_reg)(dstW+offset)), "m" (offset), - "m"(filterSize), "m"(((uint64_t *) dither)[0]) - : XMM_CLOBBERS("%xmm0" , "%xmm1" , "%xmm2" , "%xmm3" , "%xmm4" , "%xmm5" , "%xmm7" ,) -"%"FF_REG_d, "%"FF_REG_S, "%"FF_REG_c - ); -} else { -__asm__ volatile( -"movq %5, %%xmm3 \n\t" -MAIN_FUNCTION - :: "g" (filter), -
Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.
Quoting Alan Kelly (2020-11-19 09:41:56) > --- > All of Henrik's suggestions have been implemented. Additionally, > m3 and m6 are permuted in avx2 before storing to ensure bit by bit > identical results in avx2. > libswscale/x86/Makefile | 1 + > libswscale/x86/swscale.c| 75 +++ > libswscale/x86/yuv2yuvX.asm | 118 > 3 files changed, 129 insertions(+), 65 deletions(-) > create mode 100644 libswscale/x86/yuv2yuvX.asm Is this function tested by FATE? I did some brief testing and apparently it gets called during fate-filter-shuffleplanes-dup-luma, but the results do not change even if I comment out the whole function. Also, it seems like you are adding an AVX2 version of the function, but I don't see it being used. -- Anton Khirnov ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.
Ping On Thu, Nov 19, 2020 at 9:42 AM Alan Kelly wrote: > --- > All of Henrik's suggestions have been implemented. Additionally, > m3 and m6 are permuted in avx2 before storing to ensure bit by bit > identical results in avx2. > libswscale/x86/Makefile | 1 + > libswscale/x86/swscale.c| 75 +++ > libswscale/x86/yuv2yuvX.asm | 118 > 3 files changed, 129 insertions(+), 65 deletions(-) > create mode 100644 libswscale/x86/yuv2yuvX.asm > > diff --git a/libswscale/x86/Makefile b/libswscale/x86/Makefile > index 831d5359aa..bfe383364e 100644 > --- a/libswscale/x86/Makefile > +++ b/libswscale/x86/Makefile > @@ -13,3 +13,4 @@ X86ASM-OBJS += x86/input.o > \ > x86/scale.o \ > x86/rgb_2_rgb.o \ > x86/yuv_2_rgb.o \ > + x86/yuv2yuvX.o \ > diff --git a/libswscale/x86/swscale.c b/libswscale/x86/swscale.c > index 3160fedf04..758c8e540f 100644 > --- a/libswscale/x86/swscale.c > +++ b/libswscale/x86/swscale.c > @@ -197,80 +197,25 @@ void ff_updateMMXDitherTables(SwsContext *c, int > dstY) > } > > #if HAVE_MMXEXT > +void ff_yuv2yuvX_sse3(const int16_t *filter, long filterSize, > + uint8_t *dest, int dstW, > + const uint8_t *dither, int offset); > + > static void yuv2yuvX_sse3(const int16_t *filter, int filterSize, > const int16_t **src, uint8_t *dest, int dstW, > const uint8_t *dither, int offset) > { > +int remainder = (dstW % 32); > +int pixelsProcessed = dstW - remainder; > if(((uintptr_t)dest) & 15){ > yuv2yuvX_mmxext(filter, filterSize, src, dest, dstW, dither, > offset); > return; > } > -filterSize--; > -#define MAIN_FUNCTION \ > -"pxor %%xmm0, %%xmm0 \n\t" \ > -"punpcklbw %%xmm0, %%xmm3 \n\t" \ > -"movd %4, %%xmm1 \n\t" \ > -"punpcklwd %%xmm1, %%xmm1 \n\t" \ > -"punpckldq %%xmm1, %%xmm1 \n\t" \ > -"punpcklqdq %%xmm1, %%xmm1 \n\t" \ > -"psllw $3, %%xmm1 \n\t" \ > -"paddw %%xmm1, %%xmm3 \n\t" \ > -"psraw $4, %%xmm3 \n\t" \ > -"movdqa %%xmm3, %%xmm4 \n\t" \ > -"movdqa %%xmm3, %%xmm7 \n\t" \ > -"movl %3, %%ecx \n\t" \ > -"mov %0, %%"FF_REG_d" > \n\t"\ > -"mov(%%"FF_REG_d"), %%"FF_REG_S" > \n\t"\ > -".p2align 4 \n\t" /* > FIXME Unroll? */\ > -"1: \n\t"\ > -"movddup 8(%%"FF_REG_d"), %%xmm0 \n\t" /* > filterCoeff */\ > -"movdqa (%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm2 > \n\t" /* srcData */\ > -"movdqa16(%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm5 > \n\t" /* srcData */\ > -"add$16, %%"FF_REG_d" > \n\t"\ > -"mov(%%"FF_REG_d"), %%"FF_REG_S" > \n\t"\ > -"test %%"FF_REG_S", %%"FF_REG_S" > \n\t"\ > -"pmulhw %%xmm0, %%xmm2 \n\t"\ > -"pmulhw %%xmm0, %%xmm5 \n\t"\ > -"paddw%%xmm2, %%xmm3 \n\t"\ > -"paddw%%xmm5, %%xmm4 \n\t"\ > -" jnz1b \n\t"\ > -"psraw $3, %%xmm3 \n\t"\ > -"psraw $3, %%xmm4 \n\t"\ > -"packuswb %%xmm4, %%xmm3 \n\t"\ > -"movntdq %%xmm3, (%1, %%"FF_REG_c") > \n\t"\ > -"add $16, %%"FF_REG_c"\n\t"\ > -"cmp %2, %%"FF_REG_c"\n\t"\ > -"movdqa %%xmm7, %%xmm3\n\t" \ > -"movdqa %%xmm7, %%xmm4\n\t" \ > -"mov %0, %%"FF_REG_d" > \n\t"\ > -"mov(%%"FF_REG_d"), %%"FF_REG_S" > \n\t"\ > -"jb 1b \n\t" > - > -if (offset) { > -__asm__ volatile( > -"movq %5, %%xmm3 \n\t" > -"movdqa%%xmm3, %%xmm4 \n\t" > -"psrlq$24, %%xmm3 \n\t" > -"psllq$40, %%xmm4 \n\t" > -"por %%xmm4, %%xmm3 \n\t" > -MAIN_FUNCTION > - :: "g" (filter), > - "r" (dest-offset), "g" ((x86_reg)(dstW+offset)), "m" >
[FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.
--- All of Henrik's suggestions have been implemented. Additionally, m3 and m6 are permuted in avx2 before storing to ensure bit by bit identical results in avx2. libswscale/x86/Makefile | 1 + libswscale/x86/swscale.c| 75 +++ libswscale/x86/yuv2yuvX.asm | 118 3 files changed, 129 insertions(+), 65 deletions(-) create mode 100644 libswscale/x86/yuv2yuvX.asm diff --git a/libswscale/x86/Makefile b/libswscale/x86/Makefile index 831d5359aa..bfe383364e 100644 --- a/libswscale/x86/Makefile +++ b/libswscale/x86/Makefile @@ -13,3 +13,4 @@ X86ASM-OBJS += x86/input.o \ x86/scale.o \ x86/rgb_2_rgb.o \ x86/yuv_2_rgb.o \ + x86/yuv2yuvX.o \ diff --git a/libswscale/x86/swscale.c b/libswscale/x86/swscale.c index 3160fedf04..758c8e540f 100644 --- a/libswscale/x86/swscale.c +++ b/libswscale/x86/swscale.c @@ -197,80 +197,25 @@ void ff_updateMMXDitherTables(SwsContext *c, int dstY) } #if HAVE_MMXEXT +void ff_yuv2yuvX_sse3(const int16_t *filter, long filterSize, + uint8_t *dest, int dstW, + const uint8_t *dither, int offset); + static void yuv2yuvX_sse3(const int16_t *filter, int filterSize, const int16_t **src, uint8_t *dest, int dstW, const uint8_t *dither, int offset) { +int remainder = (dstW % 32); +int pixelsProcessed = dstW - remainder; if(((uintptr_t)dest) & 15){ yuv2yuvX_mmxext(filter, filterSize, src, dest, dstW, dither, offset); return; } -filterSize--; -#define MAIN_FUNCTION \ -"pxor %%xmm0, %%xmm0 \n\t" \ -"punpcklbw %%xmm0, %%xmm3 \n\t" \ -"movd %4, %%xmm1 \n\t" \ -"punpcklwd %%xmm1, %%xmm1 \n\t" \ -"punpckldq %%xmm1, %%xmm1 \n\t" \ -"punpcklqdq %%xmm1, %%xmm1 \n\t" \ -"psllw $3, %%xmm1 \n\t" \ -"paddw %%xmm1, %%xmm3 \n\t" \ -"psraw $4, %%xmm3 \n\t" \ -"movdqa %%xmm3, %%xmm4 \n\t" \ -"movdqa %%xmm3, %%xmm7 \n\t" \ -"movl %3, %%ecx \n\t" \ -"mov %0, %%"FF_REG_d"\n\t"\ -"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\ -".p2align 4 \n\t" /* FIXME Unroll? */\ -"1: \n\t"\ -"movddup 8(%%"FF_REG_d"), %%xmm0 \n\t" /* filterCoeff */\ -"movdqa (%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm2 \n\t" /* srcData */\ -"movdqa16(%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm5 \n\t" /* srcData */\ -"add$16, %%"FF_REG_d"\n\t"\ -"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\ -"test %%"FF_REG_S", %%"FF_REG_S" \n\t"\ -"pmulhw %%xmm0, %%xmm2 \n\t"\ -"pmulhw %%xmm0, %%xmm5 \n\t"\ -"paddw%%xmm2, %%xmm3 \n\t"\ -"paddw%%xmm5, %%xmm4 \n\t"\ -" jnz1b \n\t"\ -"psraw $3, %%xmm3 \n\t"\ -"psraw $3, %%xmm4 \n\t"\ -"packuswb %%xmm4, %%xmm3 \n\t"\ -"movntdq %%xmm3, (%1, %%"FF_REG_c") \n\t"\ -"add $16, %%"FF_REG_c"\n\t"\ -"cmp %2, %%"FF_REG_c"\n\t"\ -"movdqa %%xmm7, %%xmm3\n\t" \ -"movdqa %%xmm7, %%xmm4\n\t" \ -"mov %0, %%"FF_REG_d"\n\t"\ -"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\ -"jb 1b \n\t" - -if (offset) { -__asm__ volatile( -"movq %5, %%xmm3 \n\t" -"movdqa%%xmm3, %%xmm4 \n\t" -"psrlq$24, %%xmm3 \n\t" -"psllq$40, %%xmm4 \n\t" -"por %%xmm4, %%xmm3 \n\t" -MAIN_FUNCTION - :: "g" (filter), - "r" (dest-offset), "g" ((x86_reg)(dstW+offset)), "m" (offset), - "m"(filterSize), "m"(((uint64_t *) dither)[0]) - : XMM_CLOBBERS("%xmm0" , "%xmm1" , "%xmm2" , "%xmm3" , "%xmm4" , "%xmm5" , "%xmm7" ,) -"%"FF_REG_d, "%"FF_REG_S,
Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.
On Mon, Nov 16, 2020 at 11:03 AM Alan Kelly wrote: > +cglobal yuv2yuvX, 6, 7, 16, filter, filterSize, dest, dstW, dither, offset, > src Only 8 xmm registers are used, so 8 should be used instead of 16 here. Otherwise it causes unnecessary spilling of registers on 64-bit Windows. > +%if ARCH_X86_64 > +%define ptr_size 8 [...] > +%else > +%define ptr_size 4 The predefined variable gprsize already exists for this purpose, so that can be used instead. > +movq xmm3, [ditherq] If vpbroadcastq m3, [ditherq] is used for AVX2 here, then the following > +vperm2i128 m3, m3, m3, 0 instruction can be eliminated. > +punpcklwdm1, m1 > +punpckldqm1, m1 Can be replaced with pshuflw m1, m1, q >+mov srcq, [filterSizeq] >+test srcd, srcd test srcq, srcq should be used here, since the lower 32 bits of a valid pointer could randomly happen to be zero on a 64-bit system. > +REP_RET Since non-temporal stores are being used, this should be replaced with sfence RET to guarantee proper memory ordering semantics in multi-threaded use cases. Things will usually work fine without it, but may potentially break in "fun to debug" ways. ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
[FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.
--- Fixes bug in sse3 path where m1 is not set correctly resulting in off by one errors. The results are now bit by bit identical. libswscale/x86/Makefile | 1 + libswscale/x86/swscale.c| 75 libswscale/x86/yuv2yuvX.asm | 114 3 files changed, 125 insertions(+), 65 deletions(-) create mode 100644 libswscale/x86/yuv2yuvX.asm diff --git a/libswscale/x86/Makefile b/libswscale/x86/Makefile index 831d5359aa..bfe383364e 100644 --- a/libswscale/x86/Makefile +++ b/libswscale/x86/Makefile @@ -13,3 +13,4 @@ X86ASM-OBJS += x86/input.o \ x86/scale.o \ x86/rgb_2_rgb.o \ x86/yuv_2_rgb.o \ + x86/yuv2yuvX.o \ diff --git a/libswscale/x86/swscale.c b/libswscale/x86/swscale.c index 3160fedf04..758c8e540f 100644 --- a/libswscale/x86/swscale.c +++ b/libswscale/x86/swscale.c @@ -197,80 +197,25 @@ void ff_updateMMXDitherTables(SwsContext *c, int dstY) } #if HAVE_MMXEXT +void ff_yuv2yuvX_sse3(const int16_t *filter, long filterSize, + uint8_t *dest, int dstW, + const uint8_t *dither, int offset); + static void yuv2yuvX_sse3(const int16_t *filter, int filterSize, const int16_t **src, uint8_t *dest, int dstW, const uint8_t *dither, int offset) { +int remainder = (dstW % 32); +int pixelsProcessed = dstW - remainder; if(((uintptr_t)dest) & 15){ yuv2yuvX_mmxext(filter, filterSize, src, dest, dstW, dither, offset); return; } -filterSize--; -#define MAIN_FUNCTION \ -"pxor %%xmm0, %%xmm0 \n\t" \ -"punpcklbw %%xmm0, %%xmm3 \n\t" \ -"movd %4, %%xmm1 \n\t" \ -"punpcklwd %%xmm1, %%xmm1 \n\t" \ -"punpckldq %%xmm1, %%xmm1 \n\t" \ -"punpcklqdq %%xmm1, %%xmm1 \n\t" \ -"psllw $3, %%xmm1 \n\t" \ -"paddw %%xmm1, %%xmm3 \n\t" \ -"psraw $4, %%xmm3 \n\t" \ -"movdqa %%xmm3, %%xmm4 \n\t" \ -"movdqa %%xmm3, %%xmm7 \n\t" \ -"movl %3, %%ecx \n\t" \ -"mov %0, %%"FF_REG_d"\n\t"\ -"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\ -".p2align 4 \n\t" /* FIXME Unroll? */\ -"1: \n\t"\ -"movddup 8(%%"FF_REG_d"), %%xmm0 \n\t" /* filterCoeff */\ -"movdqa (%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm2 \n\t" /* srcData */\ -"movdqa16(%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm5 \n\t" /* srcData */\ -"add$16, %%"FF_REG_d"\n\t"\ -"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\ -"test %%"FF_REG_S", %%"FF_REG_S" \n\t"\ -"pmulhw %%xmm0, %%xmm2 \n\t"\ -"pmulhw %%xmm0, %%xmm5 \n\t"\ -"paddw%%xmm2, %%xmm3 \n\t"\ -"paddw%%xmm5, %%xmm4 \n\t"\ -" jnz1b \n\t"\ -"psraw $3, %%xmm3 \n\t"\ -"psraw $3, %%xmm4 \n\t"\ -"packuswb %%xmm4, %%xmm3 \n\t"\ -"movntdq %%xmm3, (%1, %%"FF_REG_c") \n\t"\ -"add $16, %%"FF_REG_c"\n\t"\ -"cmp %2, %%"FF_REG_c"\n\t"\ -"movdqa %%xmm7, %%xmm3\n\t" \ -"movdqa %%xmm7, %%xmm4\n\t" \ -"mov %0, %%"FF_REG_d"\n\t"\ -"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\ -"jb 1b \n\t" - -if (offset) { -__asm__ volatile( -"movq %5, %%xmm3 \n\t" -"movdqa%%xmm3, %%xmm4 \n\t" -"psrlq$24, %%xmm3 \n\t" -"psllq$40, %%xmm4 \n\t" -"por %%xmm4, %%xmm3 \n\t" -MAIN_FUNCTION - :: "g" (filter), - "r" (dest-offset), "g" ((x86_reg)(dstW+offset)), "m" (offset), - "m"(filterSize), "m"(((uint64_t *) dither)[0]) - : XMM_CLOBBERS("%xmm0" , "%xmm1" , "%xmm2" , "%xmm3" , "%xmm4" , "%xmm5" , "%xmm7" ,) -"%"FF_REG_d, "%"FF_REG_S, "%"FF_REG_c - ); -}
Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.
On Thu, Nov 12, 2020 at 09:33:18AM +0100, Alan Kelly wrote: > --- > It now works on x86-32 > libswscale/x86/Makefile | 1 + > libswscale/x86/swscale.c| 75 > libswscale/x86/yuv2yuvX.asm | 110 > 3 files changed, 121 insertions(+), 65 deletions(-) > create mode 100644 libswscale/x86/yuv2yuvX.asm Is this intended to produce bit by bit identical output ? [...] -- Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB The greatest way to live with honor in this world is to be what we pretend to be. -- Socrates signature.asc Description: PGP signature ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
[FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.
--- It now works on x86-32 libswscale/x86/Makefile | 1 + libswscale/x86/swscale.c| 75 libswscale/x86/yuv2yuvX.asm | 110 3 files changed, 121 insertions(+), 65 deletions(-) create mode 100644 libswscale/x86/yuv2yuvX.asm diff --git a/libswscale/x86/Makefile b/libswscale/x86/Makefile index 831d5359aa..bfe383364e 100644 --- a/libswscale/x86/Makefile +++ b/libswscale/x86/Makefile @@ -13,3 +13,4 @@ X86ASM-OBJS += x86/input.o \ x86/scale.o \ x86/rgb_2_rgb.o \ x86/yuv_2_rgb.o \ + x86/yuv2yuvX.o \ diff --git a/libswscale/x86/swscale.c b/libswscale/x86/swscale.c index 3160fedf04..758c8e540f 100644 --- a/libswscale/x86/swscale.c +++ b/libswscale/x86/swscale.c @@ -197,80 +197,25 @@ void ff_updateMMXDitherTables(SwsContext *c, int dstY) } #if HAVE_MMXEXT +void ff_yuv2yuvX_sse3(const int16_t *filter, long filterSize, + uint8_t *dest, int dstW, + const uint8_t *dither, int offset); + static void yuv2yuvX_sse3(const int16_t *filter, int filterSize, const int16_t **src, uint8_t *dest, int dstW, const uint8_t *dither, int offset) { +int remainder = (dstW % 32); +int pixelsProcessed = dstW - remainder; if(((uintptr_t)dest) & 15){ yuv2yuvX_mmxext(filter, filterSize, src, dest, dstW, dither, offset); return; } -filterSize--; -#define MAIN_FUNCTION \ -"pxor %%xmm0, %%xmm0 \n\t" \ -"punpcklbw %%xmm0, %%xmm3 \n\t" \ -"movd %4, %%xmm1 \n\t" \ -"punpcklwd %%xmm1, %%xmm1 \n\t" \ -"punpckldq %%xmm1, %%xmm1 \n\t" \ -"punpcklqdq %%xmm1, %%xmm1 \n\t" \ -"psllw $3, %%xmm1 \n\t" \ -"paddw %%xmm1, %%xmm3 \n\t" \ -"psraw $4, %%xmm3 \n\t" \ -"movdqa %%xmm3, %%xmm4 \n\t" \ -"movdqa %%xmm3, %%xmm7 \n\t" \ -"movl %3, %%ecx \n\t" \ -"mov %0, %%"FF_REG_d"\n\t"\ -"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\ -".p2align 4 \n\t" /* FIXME Unroll? */\ -"1: \n\t"\ -"movddup 8(%%"FF_REG_d"), %%xmm0 \n\t" /* filterCoeff */\ -"movdqa (%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm2 \n\t" /* srcData */\ -"movdqa16(%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm5 \n\t" /* srcData */\ -"add$16, %%"FF_REG_d"\n\t"\ -"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\ -"test %%"FF_REG_S", %%"FF_REG_S" \n\t"\ -"pmulhw %%xmm0, %%xmm2 \n\t"\ -"pmulhw %%xmm0, %%xmm5 \n\t"\ -"paddw%%xmm2, %%xmm3 \n\t"\ -"paddw%%xmm5, %%xmm4 \n\t"\ -" jnz1b \n\t"\ -"psraw $3, %%xmm3 \n\t"\ -"psraw $3, %%xmm4 \n\t"\ -"packuswb %%xmm4, %%xmm3 \n\t"\ -"movntdq %%xmm3, (%1, %%"FF_REG_c") \n\t"\ -"add $16, %%"FF_REG_c"\n\t"\ -"cmp %2, %%"FF_REG_c"\n\t"\ -"movdqa %%xmm7, %%xmm3\n\t" \ -"movdqa %%xmm7, %%xmm4\n\t" \ -"mov %0, %%"FF_REG_d"\n\t"\ -"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\ -"jb 1b \n\t" - -if (offset) { -__asm__ volatile( -"movq %5, %%xmm3 \n\t" -"movdqa%%xmm3, %%xmm4 \n\t" -"psrlq$24, %%xmm3 \n\t" -"psllq$40, %%xmm4 \n\t" -"por %%xmm4, %%xmm3 \n\t" -MAIN_FUNCTION - :: "g" (filter), - "r" (dest-offset), "g" ((x86_reg)(dstW+offset)), "m" (offset), - "m"(filterSize), "m"(((uint64_t *) dither)[0]) - : XMM_CLOBBERS("%xmm0" , "%xmm1" , "%xmm2" , "%xmm3" , "%xmm4" , "%xmm5" , "%xmm7" ,) -"%"FF_REG_d, "%"FF_REG_S, "%"FF_REG_c - ); -} else { -__asm__ volatile( -"movq %5, %%xmm3 \n\t" -
Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.
Am Fr., 6. Nov. 2020 um 09:04 Uhr schrieb Alan Kelly : > > The function was re-written in asm, this code is heavily derived from the > original code, the algorithm remains unchanged, the implementation is > optimized. Would you agree to adding the copyright from swscale.c: > * Copyright (C) 2001-2011 Michael Niedermayer > to this file, having both copyrights? Thank you. No real opinion here but your argumentation sounds solid. Thank you, Carl Eugen ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.
On Tue, Nov 10, 2020 at 09:43:47AM +0100, Alan Kelly wrote: > --- > yuv2yuvX.asm: Ports yuv2yuvX to asm, unrolls main loop and adds > other small optimizations for ~20% speed-up. Copyright updated to > include the original from swscale.c > swscale.c: Removes yuv2yuvX_sse3 and calls new function ff_yuv2yuvX_sse3. > Calls yuv2yuvX_mmxext on remainining elements if required. > Makefile: Compiles yuv2yuvX.asm > > libswscale/x86/Makefile | 1 + > libswscale/x86/swscale.c| 75 > libswscale/x86/yuv2yuvX.asm | 110 > 3 files changed, 121 insertions(+), 65 deletions(-) > create mode 100644 libswscale/x86/yuv2yuvX.asm on x86-32 X86ASM libswscale/x86/yuv2yuvX.o src/libswscale/x86/yuv2yuvX.asm:110: error: invalid combination of opcode and operands src/libswscale/x86/yuv2yuvX.asm:55: ... from macro `YUV2YUVX_FUNC' defined here src//libavutil/x86/x86inc.asm:1395: ... from macro `movd' defined here src//libavutil/x86/x86inc.asm:1263: ... from macro `RUN_AVX_INSTR' defined here /home/michael/ffmpeg-git/ffmpeg/ffbuild/common.mak:89: recipe for target 'libswscale/x86/yuv2yuvX.o' failed make: *** [libswscale/x86/yuv2yuvX.o] Error 1 make: Target 'all' not remade because of errors. [...] -- Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB "Nothing to hide" only works if the folks in power share the values of you and everyone you know entirely and always will -- Tom Scott signature.asc Description: PGP signature ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
[FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.
--- yuv2yuvX.asm: Ports yuv2yuvX to asm, unrolls main loop and adds other small optimizations for ~20% speed-up. Copyright updated to include the original from swscale.c swscale.c: Removes yuv2yuvX_sse3 and calls new function ff_yuv2yuvX_sse3. Calls yuv2yuvX_mmxext on remainining elements if required. Makefile: Compiles yuv2yuvX.asm libswscale/x86/Makefile | 1 + libswscale/x86/swscale.c| 75 libswscale/x86/yuv2yuvX.asm | 110 3 files changed, 121 insertions(+), 65 deletions(-) create mode 100644 libswscale/x86/yuv2yuvX.asm diff --git a/libswscale/x86/Makefile b/libswscale/x86/Makefile index 831d5359aa..bfe383364e 100644 --- a/libswscale/x86/Makefile +++ b/libswscale/x86/Makefile @@ -13,3 +13,4 @@ X86ASM-OBJS += x86/input.o \ x86/scale.o \ x86/rgb_2_rgb.o \ x86/yuv_2_rgb.o \ + x86/yuv2yuvX.o \ diff --git a/libswscale/x86/swscale.c b/libswscale/x86/swscale.c index 3160fedf04..fec9fa22e0 100644 --- a/libswscale/x86/swscale.c +++ b/libswscale/x86/swscale.c @@ -197,80 +197,25 @@ void ff_updateMMXDitherTables(SwsContext *c, int dstY) } #if HAVE_MMXEXT +void ff_yuv2yuvX_sse3(const int16_t *filter, int filterSize, + uint8_t *dest, int dstW, + const uint8_t *dither, int offset); + static void yuv2yuvX_sse3(const int16_t *filter, int filterSize, const int16_t **src, uint8_t *dest, int dstW, const uint8_t *dither, int offset) { +int remainder = (dstW % 32); +int pixelsProcessed = dstW - remainder; if(((uintptr_t)dest) & 15){ yuv2yuvX_mmxext(filter, filterSize, src, dest, dstW, dither, offset); return; } -filterSize--; -#define MAIN_FUNCTION \ -"pxor %%xmm0, %%xmm0 \n\t" \ -"punpcklbw %%xmm0, %%xmm3 \n\t" \ -"movd %4, %%xmm1 \n\t" \ -"punpcklwd %%xmm1, %%xmm1 \n\t" \ -"punpckldq %%xmm1, %%xmm1 \n\t" \ -"punpcklqdq %%xmm1, %%xmm1 \n\t" \ -"psllw $3, %%xmm1 \n\t" \ -"paddw %%xmm1, %%xmm3 \n\t" \ -"psraw $4, %%xmm3 \n\t" \ -"movdqa %%xmm3, %%xmm4 \n\t" \ -"movdqa %%xmm3, %%xmm7 \n\t" \ -"movl %3, %%ecx \n\t" \ -"mov %0, %%"FF_REG_d"\n\t"\ -"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\ -".p2align 4 \n\t" /* FIXME Unroll? */\ -"1: \n\t"\ -"movddup 8(%%"FF_REG_d"), %%xmm0 \n\t" /* filterCoeff */\ -"movdqa (%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm2 \n\t" /* srcData */\ -"movdqa16(%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm5 \n\t" /* srcData */\ -"add$16, %%"FF_REG_d"\n\t"\ -"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\ -"test %%"FF_REG_S", %%"FF_REG_S" \n\t"\ -"pmulhw %%xmm0, %%xmm2 \n\t"\ -"pmulhw %%xmm0, %%xmm5 \n\t"\ -"paddw%%xmm2, %%xmm3 \n\t"\ -"paddw%%xmm5, %%xmm4 \n\t"\ -" jnz1b \n\t"\ -"psraw $3, %%xmm3 \n\t"\ -"psraw $3, %%xmm4 \n\t"\ -"packuswb %%xmm4, %%xmm3 \n\t"\ -"movntdq %%xmm3, (%1, %%"FF_REG_c") \n\t"\ -"add $16, %%"FF_REG_c"\n\t"\ -"cmp %2, %%"FF_REG_c"\n\t"\ -"movdqa %%xmm7, %%xmm3\n\t" \ -"movdqa %%xmm7, %%xmm4\n\t" \ -"mov %0, %%"FF_REG_d"\n\t"\ -"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\ -"jb 1b \n\t" - -if (offset) { -__asm__ volatile( -"movq %5, %%xmm3 \n\t" -"movdqa%%xmm3, %%xmm4 \n\t" -"psrlq$24, %%xmm3 \n\t" -"psllq$40, %%xmm4 \n\t" -"por %%xmm4, %%xmm3 \n\t" -MAIN_FUNCTION - :: "g" (filter), - "r" (dest-offset), "g" ((x86_reg)(dstW+offset)), "m" (offset), - "m"(filterSize),
Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.
The function was re-written in asm, this code is heavily derived from the original code, the algorithm remains unchanged, the implementation is optimized. Would you agree to adding the copyright from swscale.c: * Copyright (C) 2001-2011 Michael Niedermayer to this file, having both copyrights? Thank you. On Sat, Oct 31, 2020 at 1:02 PM Carl Eugen Hoyos wrote: > Am Di., 27. Okt. 2020 um 09:56 Uhr schrieb Alan Kelly > : > > > --- /dev/null > > +++ b/libswscale/x86/yuv2yuvX.asm > > @@ -0,0 +1,105 @@ > > > +;** > > +;* x86-optimized yuv2yuvX > > +;* Copyright 2020 Google LLC > > Either the commit message ("move a function") or this > copyright statement is wrong, please fix this. > > Please do not commit as-is... > > Carl Eugen > ___ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > https://ffmpeg.org/mailman/listinfo/ffmpeg-devel > > To unsubscribe, visit link above, or email > ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe". ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.
Am Di., 27. Okt. 2020 um 09:56 Uhr schrieb Alan Kelly : > --- /dev/null > +++ b/libswscale/x86/yuv2yuvX.asm > @@ -0,0 +1,105 @@ > +;** > +;* x86-optimized yuv2yuvX > +;* Copyright 2020 Google LLC Either the commit message ("move a function") or this copyright statement is wrong, please fix this. Please do not commit as-is... Carl Eugen ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.
Thanks for the feedback Anton. The second patch incorporates changes suggested by James Almer: avx2 instructions are wrapped in if cpuflag(avx2) and movddup restored mm1 is replaced by m1 on x86_32 On Tue, Oct 27, 2020 at 10:40 AM Anton Khirnov wrote: > Hi, > Quoting Alan Kelly (2020-10-27 10:10:14) > > --- > > libswscale/x86/Makefile | 1 + > > libswscale/x86/swscale.c| 75 - > > libswscale/x86/yuv2yuvX.asm | 109 > > 3 files changed, 120 insertions(+), 65 deletions(-) > > create mode 100644 libswscale/x86/yuv2yuvX.asm > > > > No comments on the code itself (yet?), but as for your submission: > - when you send multiple iterations of the same patch, it is helpful to > mention what changed, e.g. with git send-email --annotate > - the commit message should follow the standard format of: > * swscale: short summary of the change > > Extended description of the commit, if needed. > > -- > Anton Khirnov > ___ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > https://ffmpeg.org/mailman/listinfo/ffmpeg-devel > > To unsubscribe, visit link above, or email > ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe". ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.
Hi, Quoting Alan Kelly (2020-10-27 10:10:14) > --- > libswscale/x86/Makefile | 1 + > libswscale/x86/swscale.c| 75 - > libswscale/x86/yuv2yuvX.asm | 109 > 3 files changed, 120 insertions(+), 65 deletions(-) > create mode 100644 libswscale/x86/yuv2yuvX.asm > No comments on the code itself (yet?), but as for your submission: - when you send multiple iterations of the same patch, it is helpful to mention what changed, e.g. with git send-email --annotate - the commit message should follow the standard format of: * swscale: short summary of the change Extended description of the commit, if needed. -- Anton Khirnov ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
[FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.
--- libswscale/x86/Makefile | 1 + libswscale/x86/swscale.c| 75 - libswscale/x86/yuv2yuvX.asm | 109 3 files changed, 120 insertions(+), 65 deletions(-) create mode 100644 libswscale/x86/yuv2yuvX.asm diff --git a/libswscale/x86/Makefile b/libswscale/x86/Makefile index 831d5359aa..bfe383364e 100644 --- a/libswscale/x86/Makefile +++ b/libswscale/x86/Makefile @@ -13,3 +13,4 @@ X86ASM-OBJS += x86/input.o \ x86/scale.o \ x86/rgb_2_rgb.o \ x86/yuv_2_rgb.o \ + x86/yuv2yuvX.o \ diff --git a/libswscale/x86/swscale.c b/libswscale/x86/swscale.c index 3160fedf04..fec9fa22e0 100644 --- a/libswscale/x86/swscale.c +++ b/libswscale/x86/swscale.c @@ -197,80 +197,25 @@ void ff_updateMMXDitherTables(SwsContext *c, int dstY) } #if HAVE_MMXEXT +void ff_yuv2yuvX_sse3(const int16_t *filter, int filterSize, + uint8_t *dest, int dstW, + const uint8_t *dither, int offset); + static void yuv2yuvX_sse3(const int16_t *filter, int filterSize, const int16_t **src, uint8_t *dest, int dstW, const uint8_t *dither, int offset) { +int remainder = (dstW % 32); +int pixelsProcessed = dstW - remainder; if(((uintptr_t)dest) & 15){ yuv2yuvX_mmxext(filter, filterSize, src, dest, dstW, dither, offset); return; } -filterSize--; -#define MAIN_FUNCTION \ -"pxor %%xmm0, %%xmm0 \n\t" \ -"punpcklbw %%xmm0, %%xmm3 \n\t" \ -"movd %4, %%xmm1 \n\t" \ -"punpcklwd %%xmm1, %%xmm1 \n\t" \ -"punpckldq %%xmm1, %%xmm1 \n\t" \ -"punpcklqdq %%xmm1, %%xmm1 \n\t" \ -"psllw $3, %%xmm1 \n\t" \ -"paddw %%xmm1, %%xmm3 \n\t" \ -"psraw $4, %%xmm3 \n\t" \ -"movdqa %%xmm3, %%xmm4 \n\t" \ -"movdqa %%xmm3, %%xmm7 \n\t" \ -"movl %3, %%ecx \n\t" \ -"mov %0, %%"FF_REG_d"\n\t"\ -"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\ -".p2align 4 \n\t" /* FIXME Unroll? */\ -"1: \n\t"\ -"movddup 8(%%"FF_REG_d"), %%xmm0 \n\t" /* filterCoeff */\ -"movdqa (%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm2 \n\t" /* srcData */\ -"movdqa16(%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm5 \n\t" /* srcData */\ -"add$16, %%"FF_REG_d"\n\t"\ -"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\ -"test %%"FF_REG_S", %%"FF_REG_S" \n\t"\ -"pmulhw %%xmm0, %%xmm2 \n\t"\ -"pmulhw %%xmm0, %%xmm5 \n\t"\ -"paddw%%xmm2, %%xmm3 \n\t"\ -"paddw%%xmm5, %%xmm4 \n\t"\ -" jnz1b \n\t"\ -"psraw $3, %%xmm3 \n\t"\ -"psraw $3, %%xmm4 \n\t"\ -"packuswb %%xmm4, %%xmm3 \n\t"\ -"movntdq %%xmm3, (%1, %%"FF_REG_c") \n\t"\ -"add $16, %%"FF_REG_c"\n\t"\ -"cmp %2, %%"FF_REG_c"\n\t"\ -"movdqa %%xmm7, %%xmm3\n\t" \ -"movdqa %%xmm7, %%xmm4\n\t" \ -"mov %0, %%"FF_REG_d"\n\t"\ -"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\ -"jb 1b \n\t" - -if (offset) { -__asm__ volatile( -"movq %5, %%xmm3 \n\t" -"movdqa%%xmm3, %%xmm4 \n\t" -"psrlq$24, %%xmm3 \n\t" -"psllq$40, %%xmm4 \n\t" -"por %%xmm4, %%xmm3 \n\t" -MAIN_FUNCTION - :: "g" (filter), - "r" (dest-offset), "g" ((x86_reg)(dstW+offset)), "m" (offset), - "m"(filterSize), "m"(((uint64_t *) dither)[0]) - : XMM_CLOBBERS("%xmm0" , "%xmm1" , "%xmm2" , "%xmm3" , "%xmm4" , "%xmm5" , "%xmm7" ,) -"%"FF_REG_d, "%"FF_REG_S, "%"FF_REG_c - ); -} else { -__asm__ volatile( -"movq %5, %%xmm3 \n\t" -MAIN_FUNCTION - :: "g"
Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.
Apologies for the multiple threads, my git send-email was wrongly configured. This has been fixed. This code has been tested on AVX2 giving a significant speedup, however, until the ff_hscale* functions are ported to avx2, this should not be enabled as it results in an overall slowdown of swscale probably due to cpu frequency scaling. checkasm will follow in a separate patch. On Tue, Oct 27, 2020 at 9:56 AM Alan Kelly wrote: > --- > libswscale/x86/Makefile | 1 + > libswscale/x86/swscale.c| 75 -- > libswscale/x86/yuv2yuvX.asm | 105 > 3 files changed, 116 insertions(+), 65 deletions(-) > create mode 100644 libswscale/x86/yuv2yuvX.asm > > diff --git a/libswscale/x86/Makefile b/libswscale/x86/Makefile > index 831d5359aa..bfe383364e 100644 > --- a/libswscale/x86/Makefile > +++ b/libswscale/x86/Makefile > @@ -13,3 +13,4 @@ X86ASM-OBJS += x86/input.o > \ > x86/scale.o \ > x86/rgb_2_rgb.o \ > x86/yuv_2_rgb.o \ > + x86/yuv2yuvX.o \ > diff --git a/libswscale/x86/swscale.c b/libswscale/x86/swscale.c > index 3160fedf04..fec9fa22e0 100644 > --- a/libswscale/x86/swscale.c > +++ b/libswscale/x86/swscale.c > @@ -197,80 +197,25 @@ void ff_updateMMXDitherTables(SwsContext *c, int > dstY) > } > > #if HAVE_MMXEXT > +void ff_yuv2yuvX_sse3(const int16_t *filter, int filterSize, > + uint8_t *dest, int dstW, > + const uint8_t *dither, int offset); > + > static void yuv2yuvX_sse3(const int16_t *filter, int filterSize, > const int16_t **src, uint8_t *dest, int dstW, > const uint8_t *dither, int offset) > { > +int remainder = (dstW % 32); > +int pixelsProcessed = dstW - remainder; > if(((uintptr_t)dest) & 15){ > yuv2yuvX_mmxext(filter, filterSize, src, dest, dstW, dither, > offset); > return; > } > -filterSize--; > -#define MAIN_FUNCTION \ > -"pxor %%xmm0, %%xmm0 \n\t" \ > -"punpcklbw %%xmm0, %%xmm3 \n\t" \ > -"movd %4, %%xmm1 \n\t" \ > -"punpcklwd %%xmm1, %%xmm1 \n\t" \ > -"punpckldq %%xmm1, %%xmm1 \n\t" \ > -"punpcklqdq %%xmm1, %%xmm1 \n\t" \ > -"psllw $3, %%xmm1 \n\t" \ > -"paddw %%xmm1, %%xmm3 \n\t" \ > -"psraw $4, %%xmm3 \n\t" \ > -"movdqa %%xmm3, %%xmm4 \n\t" \ > -"movdqa %%xmm3, %%xmm7 \n\t" \ > -"movl %3, %%ecx \n\t" \ > -"mov %0, %%"FF_REG_d" > \n\t"\ > -"mov(%%"FF_REG_d"), %%"FF_REG_S" > \n\t"\ > -".p2align 4 \n\t" /* > FIXME Unroll? */\ > -"1: \n\t"\ > -"movddup 8(%%"FF_REG_d"), %%xmm0 \n\t" /* > filterCoeff */\ > -"movdqa (%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm2 > \n\t" /* srcData */\ > -"movdqa16(%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm5 > \n\t" /* srcData */\ > -"add$16, %%"FF_REG_d" > \n\t"\ > -"mov(%%"FF_REG_d"), %%"FF_REG_S" > \n\t"\ > -"test %%"FF_REG_S", %%"FF_REG_S" > \n\t"\ > -"pmulhw %%xmm0, %%xmm2 \n\t"\ > -"pmulhw %%xmm0, %%xmm5 \n\t"\ > -"paddw%%xmm2, %%xmm3 \n\t"\ > -"paddw%%xmm5, %%xmm4 \n\t"\ > -" jnz1b \n\t"\ > -"psraw $3, %%xmm3 \n\t"\ > -"psraw $3, %%xmm4 \n\t"\ > -"packuswb %%xmm4, %%xmm3 \n\t"\ > -"movntdq %%xmm3, (%1, %%"FF_REG_c") > \n\t"\ > -"add $16, %%"FF_REG_c"\n\t"\ > -"cmp %2, %%"FF_REG_c"\n\t"\ > -"movdqa %%xmm7, %%xmm3\n\t" \ > -"movdqa %%xmm7, %%xmm4\n\t" \ > -"mov %0, %%"FF_REG_d" > \n\t"\ > -"mov(%%"FF_REG_d"), %%"FF_REG_S" > \n\t"\ > -"jb 1b \n\t" > - > -if (offset) { > -__asm__ volatile( > -"movq %5, %%xmm3 \n\t" > -"movdqa%%xmm3, %%xmm4 \n\t" > -"psrlq$24, %%xmm3 \n\t" > -
Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup. AVX2 version is ready and tested, however, although local tests show a signifi
Thanks for the review, I have made the required changes. As I have changed the subject the patch is in a new thread. On Fri, Oct 23, 2020 at 4:10 PM James Almer wrote: > On 10/23/2020 10:17 AM, Alan Kelly wrote: > > Fixed. The wrong step size was used causing a write passed the end of > > the buffer. yuv2yuvX_mmxext is now called if there are any remaining > pixels. > > Please fix the commit subject (It's too long and contains commentary), > and keep comments about fixes between versions outside of the commit > message body. You can manually place them after the --- below, or in a > separate reply. > > > --- > > libswscale/x86/Makefile | 1 + > > libswscale/x86/swscale.c| 75 -- > > libswscale/x86/yuv2yuvX.asm | 105 > > 3 files changed, 116 insertions(+), 65 deletions(-) > > create mode 100644 libswscale/x86/yuv2yuvX.asm > > > > diff --git a/libswscale/x86/Makefile b/libswscale/x86/Makefile > > index 831d5359aa..bfe383364e 100644 > > --- a/libswscale/x86/Makefile > > +++ b/libswscale/x86/Makefile > > @@ -13,3 +13,4 @@ X86ASM-OBJS += x86/input.o > \ > > x86/scale.o > \ > > x86/rgb_2_rgb.o > \ > > x86/yuv_2_rgb.o > \ > > + x86/yuv2yuvX.o > \ > > diff --git a/libswscale/x86/swscale.c b/libswscale/x86/swscale.c > > index 3160fedf04..fec9fa22e0 100644 > > --- a/libswscale/x86/swscale.c > > +++ b/libswscale/x86/swscale.c > > @@ -197,80 +197,25 @@ void ff_updateMMXDitherTables(SwsContext *c, int > dstY) > > } > > > > #if HAVE_MMXEXT > > +void ff_yuv2yuvX_sse3(const int16_t *filter, int filterSize, > > + uint8_t *dest, int dstW, > > + const uint8_t *dither, int offset); > > + > > static void yuv2yuvX_sse3(const int16_t *filter, int filterSize, > > const int16_t **src, uint8_t *dest, int dstW, > > const uint8_t *dither, int offset) > > { > > +int remainder = (dstW % 32); > > +int pixelsProcessed = dstW - remainder; > > if(((uintptr_t)dest) & 15){ > > yuv2yuvX_mmxext(filter, filterSize, src, dest, dstW, dither, > offset); > > return; > > } > > -filterSize--; > > -#define MAIN_FUNCTION \ > > -"pxor %%xmm0, %%xmm0 \n\t" \ > > -"punpcklbw %%xmm0, %%xmm3 \n\t" \ > > -"movd %4, %%xmm1 \n\t" \ > > -"punpcklwd %%xmm1, %%xmm1 \n\t" \ > > -"punpckldq %%xmm1, %%xmm1 \n\t" \ > > -"punpcklqdq %%xmm1, %%xmm1 \n\t" \ > > -"psllw $3, %%xmm1 \n\t" \ > > -"paddw %%xmm1, %%xmm3 \n\t" \ > > -"psraw $4, %%xmm3 \n\t" \ > > -"movdqa %%xmm3, %%xmm4 \n\t" \ > > -"movdqa %%xmm3, %%xmm7 \n\t" \ > > -"movl %3, %%ecx \n\t" \ > > -"mov %0, %%"FF_REG_d" > \n\t"\ > > -"mov(%%"FF_REG_d"), %%"FF_REG_S" > \n\t"\ > > -".p2align 4 \n\t" /* > FIXME Unroll? */\ > > -"1: \n\t"\ > > -"movddup 8(%%"FF_REG_d"), %%xmm0 \n\t" /* > filterCoeff */\ > > -"movdqa (%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm2 > \n\t" /* srcData */\ > > -"movdqa16(%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm5 > \n\t" /* srcData */\ > > -"add$16, %%"FF_REG_d" > \n\t"\ > > -"mov(%%"FF_REG_d"), %%"FF_REG_S" > \n\t"\ > > -"test %%"FF_REG_S", %%"FF_REG_S" > \n\t"\ > > -"pmulhw %%xmm0, %%xmm2 \n\t"\ > > -"pmulhw %%xmm0, %%xmm5 \n\t"\ > > -"paddw%%xmm2, %%xmm3 \n\t"\ > > -"paddw%%xmm5, %%xmm4 \n\t"\ > > -" jnz1b \n\t"\ > > -"psraw $3, %%xmm3 \n\t"\ > > -"psraw $3, %%xmm4 \n\t"\ > > -"packuswb %%xmm4, %%xmm3 \n\t"\ > > -"movntdq %%xmm3, (%1, %%"FF_REG_c") > \n\t"\ > > -"add $16, %%"FF_REG_c"\n\t"\ > > -"cmp %2, %%"FF_REG_c"\n\t"\ > > -"movdqa %%xmm7, %%xmm3\n\t" \ > > -"movdqa %%xmm7, %%xmm4\n\t" \ > > -"mov %0, %%"FF_REG_d" > \n\t"\ > > -"mov(%%"FF_REG_d"), %%"FF_REG_S" > \n\t"\ > > -"jb
[FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.
--- libswscale/x86/Makefile | 1 + libswscale/x86/swscale.c| 75 -- libswscale/x86/yuv2yuvX.asm | 105 3 files changed, 116 insertions(+), 65 deletions(-) create mode 100644 libswscale/x86/yuv2yuvX.asm diff --git a/libswscale/x86/Makefile b/libswscale/x86/Makefile index 831d5359aa..bfe383364e 100644 --- a/libswscale/x86/Makefile +++ b/libswscale/x86/Makefile @@ -13,3 +13,4 @@ X86ASM-OBJS += x86/input.o \ x86/scale.o \ x86/rgb_2_rgb.o \ x86/yuv_2_rgb.o \ + x86/yuv2yuvX.o \ diff --git a/libswscale/x86/swscale.c b/libswscale/x86/swscale.c index 3160fedf04..fec9fa22e0 100644 --- a/libswscale/x86/swscale.c +++ b/libswscale/x86/swscale.c @@ -197,80 +197,25 @@ void ff_updateMMXDitherTables(SwsContext *c, int dstY) } #if HAVE_MMXEXT +void ff_yuv2yuvX_sse3(const int16_t *filter, int filterSize, + uint8_t *dest, int dstW, + const uint8_t *dither, int offset); + static void yuv2yuvX_sse3(const int16_t *filter, int filterSize, const int16_t **src, uint8_t *dest, int dstW, const uint8_t *dither, int offset) { +int remainder = (dstW % 32); +int pixelsProcessed = dstW - remainder; if(((uintptr_t)dest) & 15){ yuv2yuvX_mmxext(filter, filterSize, src, dest, dstW, dither, offset); return; } -filterSize--; -#define MAIN_FUNCTION \ -"pxor %%xmm0, %%xmm0 \n\t" \ -"punpcklbw %%xmm0, %%xmm3 \n\t" \ -"movd %4, %%xmm1 \n\t" \ -"punpcklwd %%xmm1, %%xmm1 \n\t" \ -"punpckldq %%xmm1, %%xmm1 \n\t" \ -"punpcklqdq %%xmm1, %%xmm1 \n\t" \ -"psllw $3, %%xmm1 \n\t" \ -"paddw %%xmm1, %%xmm3 \n\t" \ -"psraw $4, %%xmm3 \n\t" \ -"movdqa %%xmm3, %%xmm4 \n\t" \ -"movdqa %%xmm3, %%xmm7 \n\t" \ -"movl %3, %%ecx \n\t" \ -"mov %0, %%"FF_REG_d"\n\t"\ -"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\ -".p2align 4 \n\t" /* FIXME Unroll? */\ -"1: \n\t"\ -"movddup 8(%%"FF_REG_d"), %%xmm0 \n\t" /* filterCoeff */\ -"movdqa (%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm2 \n\t" /* srcData */\ -"movdqa16(%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm5 \n\t" /* srcData */\ -"add$16, %%"FF_REG_d"\n\t"\ -"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\ -"test %%"FF_REG_S", %%"FF_REG_S" \n\t"\ -"pmulhw %%xmm0, %%xmm2 \n\t"\ -"pmulhw %%xmm0, %%xmm5 \n\t"\ -"paddw%%xmm2, %%xmm3 \n\t"\ -"paddw%%xmm5, %%xmm4 \n\t"\ -" jnz1b \n\t"\ -"psraw $3, %%xmm3 \n\t"\ -"psraw $3, %%xmm4 \n\t"\ -"packuswb %%xmm4, %%xmm3 \n\t"\ -"movntdq %%xmm3, (%1, %%"FF_REG_c") \n\t"\ -"add $16, %%"FF_REG_c"\n\t"\ -"cmp %2, %%"FF_REG_c"\n\t"\ -"movdqa %%xmm7, %%xmm3\n\t" \ -"movdqa %%xmm7, %%xmm4\n\t" \ -"mov %0, %%"FF_REG_d"\n\t"\ -"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\ -"jb 1b \n\t" - -if (offset) { -__asm__ volatile( -"movq %5, %%xmm3 \n\t" -"movdqa%%xmm3, %%xmm4 \n\t" -"psrlq$24, %%xmm3 \n\t" -"psllq$40, %%xmm4 \n\t" -"por %%xmm4, %%xmm3 \n\t" -MAIN_FUNCTION - :: "g" (filter), - "r" (dest-offset), "g" ((x86_reg)(dstW+offset)), "m" (offset), - "m"(filterSize), "m"(((uint64_t *) dither)[0]) - : XMM_CLOBBERS("%xmm0" , "%xmm1" , "%xmm2" , "%xmm3" , "%xmm4" , "%xmm5" , "%xmm7" ,) -"%"FF_REG_d, "%"FF_REG_S, "%"FF_REG_c - ); -} else { -__asm__ volatile( -"movq %5, %%xmm3 \n\t" -MAIN_FUNCTION - :: "g"
Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup. AVX2 version is ready and tested, although local tests show a significant spee
On Fri, Oct 23, 2020 at 03:34:18PM +0200, Alan Kelly wrote: > Fixed. The wrong step size was used causing a write passed the end of > the buffer. yuv2yuvX_mmxext is now called if there are any remaining > pixels. > > There is currently no checkasm for these functions. Is this required for > submission? > > (Apologies for the double mail, I used git send-email but it didn't > respond to the correct thread) > --- > libswscale/x86/Makefile | 1 + > libswscale/x86/swscale.c| 75 -- > libswscale/x86/yuv2yuvX.asm | 105 > 3 files changed, 116 insertions(+), 65 deletions(-) > create mode 100644 libswscale/x86/yuv2yuvX.asm error: corrupt patch at line 18 [...] -- Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB The worst form of inequality is to try to make unequal things equal. -- Aristotle signature.asc Description: PGP signature ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup. AVX2 version is ready and tested, however, although local tests show a signifi
On 10/23/2020 10:17 AM, Alan Kelly wrote: > Fixed. The wrong step size was used causing a write passed the end of > the buffer. yuv2yuvX_mmxext is now called if there are any remaining pixels. Please fix the commit subject (It's too long and contains commentary), and keep comments about fixes between versions outside of the commit message body. You can manually place them after the --- below, or in a separate reply. > --- > libswscale/x86/Makefile | 1 + > libswscale/x86/swscale.c| 75 -- > libswscale/x86/yuv2yuvX.asm | 105 > 3 files changed, 116 insertions(+), 65 deletions(-) > create mode 100644 libswscale/x86/yuv2yuvX.asm > > diff --git a/libswscale/x86/Makefile b/libswscale/x86/Makefile > index 831d5359aa..bfe383364e 100644 > --- a/libswscale/x86/Makefile > +++ b/libswscale/x86/Makefile > @@ -13,3 +13,4 @@ X86ASM-OBJS += x86/input.o > \ > x86/scale.o \ > x86/rgb_2_rgb.o \ > x86/yuv_2_rgb.o \ > + x86/yuv2yuvX.o \ > diff --git a/libswscale/x86/swscale.c b/libswscale/x86/swscale.c > index 3160fedf04..fec9fa22e0 100644 > --- a/libswscale/x86/swscale.c > +++ b/libswscale/x86/swscale.c > @@ -197,80 +197,25 @@ void ff_updateMMXDitherTables(SwsContext *c, int dstY) > } > > #if HAVE_MMXEXT > +void ff_yuv2yuvX_sse3(const int16_t *filter, int filterSize, > + uint8_t *dest, int dstW, > + const uint8_t *dither, int offset); > + > static void yuv2yuvX_sse3(const int16_t *filter, int filterSize, > const int16_t **src, uint8_t *dest, int dstW, > const uint8_t *dither, int offset) > { > +int remainder = (dstW % 32); > +int pixelsProcessed = dstW - remainder; > if(((uintptr_t)dest) & 15){ > yuv2yuvX_mmxext(filter, filterSize, src, dest, dstW, dither, offset); > return; > } > -filterSize--; > -#define MAIN_FUNCTION \ > -"pxor %%xmm0, %%xmm0 \n\t" \ > -"punpcklbw %%xmm0, %%xmm3 \n\t" \ > -"movd %4, %%xmm1 \n\t" \ > -"punpcklwd %%xmm1, %%xmm1 \n\t" \ > -"punpckldq %%xmm1, %%xmm1 \n\t" \ > -"punpcklqdq %%xmm1, %%xmm1 \n\t" \ > -"psllw $3, %%xmm1 \n\t" \ > -"paddw %%xmm1, %%xmm3 \n\t" \ > -"psraw $4, %%xmm3 \n\t" \ > -"movdqa %%xmm3, %%xmm4 \n\t" \ > -"movdqa %%xmm3, %%xmm7 \n\t" \ > -"movl %3, %%ecx \n\t" \ > -"mov %0, %%"FF_REG_d"\n\t"\ > -"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\ > -".p2align 4 \n\t" /* FIXME > Unroll? */\ > -"1: \n\t"\ > -"movddup 8(%%"FF_REG_d"), %%xmm0 \n\t" /* > filterCoeff */\ > -"movdqa (%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm2 \n\t" > /* srcData */\ > -"movdqa16(%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm5 \n\t" > /* srcData */\ > -"add$16, %%"FF_REG_d"\n\t"\ > -"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\ > -"test %%"FF_REG_S", %%"FF_REG_S" \n\t"\ > -"pmulhw %%xmm0, %%xmm2 \n\t"\ > -"pmulhw %%xmm0, %%xmm5 \n\t"\ > -"paddw%%xmm2, %%xmm3 \n\t"\ > -"paddw%%xmm5, %%xmm4 \n\t"\ > -" jnz1b \n\t"\ > -"psraw $3, %%xmm3 \n\t"\ > -"psraw $3, %%xmm4 \n\t"\ > -"packuswb %%xmm4, %%xmm3 \n\t"\ > -"movntdq %%xmm3, (%1, %%"FF_REG_c") \n\t"\ > -"add $16, %%"FF_REG_c"\n\t"\ > -"cmp %2, %%"FF_REG_c"\n\t"\ > -"movdqa %%xmm7, %%xmm3\n\t" \ > -"movdqa %%xmm7, %%xmm4\n\t" \ > -"mov %0, %%"FF_REG_d"\n\t"\ > -"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\ > -"jb 1b \n\t" > - > -if (offset) { > -__asm__ volatile( > -"movq %5, %%xmm3 \n\t" > -"movdqa%%xmm3, %%xmm4 \n\t" > -"psrlq$24, %%xmm3 \n\t"
Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup. AVX2 version is ready and tested, although local tests show a significant spee
Fixed. The wrong step size was used causing a write passed the end of the buffer. yuv2yuvX_mmxext is now called if there are any remaining pixels. There is currently no checkasm for these functions. Is this required for submission? (Apologies for the double mail, I used git send-email but it didn't respond to the correct thread) --- libswscale/x86/Makefile | 1 + libswscale/x86/swscale.c| 75 -- libswscale/x86/yuv2yuvX.asm | 105 3 files changed, 116 insertions(+), 65 deletions(-) create mode 100644 libswscale/x86/yuv2yuvX.asm diff --git a/libswscale/x86/Makefile b/libswscale/x86/Makefile index 831d5359aa..bfe383364e 100644 --- a/libswscale/x86/Makefile +++ b/libswscale/x86/Makefile @@ -13,3 +13,4 @@ X86ASM-OBJS += x86/input.o \ x86/scale.o \ x86/rgb_2_rgb.o \ x86/yuv_2_rgb.o \ + x86/yuv2yuvX.o \ diff --git a/libswscale/x86/swscale.c b/libswscale/x86/swscale.c index 3160fedf04..fec9fa22e0 100644 --- a/libswscale/x86/swscale.c +++ b/libswscale/x86/swscale.c @@ -197,80 +197,25 @@ void ff_updateMMXDitherTables(SwsContext *c, int dstY) } #if HAVE_MMXEXT +void ff_yuv2yuvX_sse3(const int16_t *filter, int filterSize, + uint8_t *dest, int dstW, + const uint8_t *dither, int offset); + static void yuv2yuvX_sse3(const int16_t *filter, int filterSize, const int16_t **src, uint8_t *dest, int dstW, const uint8_t *dither, int offset) { +int remainder = (dstW % 32); +int pixelsProcessed = dstW - remainder; if(((uintptr_t)dest) & 15){ yuv2yuvX_mmxext(filter, filterSize, src, dest, dstW, dither, offset); return; } -filterSize--; -#define MAIN_FUNCTION \ -"pxor %%xmm0, %%xmm0 \n\t" \ -"punpcklbw %%xmm0, %%xmm3 \n\t" \ -"movd %4, %%xmm1 \n\t" \ -"punpcklwd %%xmm1, %%xmm1 \n\t" \ -"punpckldq %%xmm1, %%xmm1 \n\t" \ -"punpcklqdq %%xmm1, %%xmm1 \n\t" \ -"psllw $3, %%xmm1 \n\t" \ -"paddw %%xmm1, %%xmm3 \n\t" \ -"psraw $4, %%xmm3 \n\t" \ -"movdqa %%xmm3, %%xmm4 \n\t" \ -"movdqa %%xmm3, %%xmm7 \n\t" \ -"movl %3, %%ecx \n\t" \ -"mov %0, %%"FF_REG_d"\n\t"\ -"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\ -".p2align 4 \n\t" /* FIXME Unroll? */\ -"1: \n\t"\ -"movddup 8(%%"FF_REG_d"), %%xmm0 \n\t" /* filterCoeff */\ -"movdqa (%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm2 \n\t" /* srcData */\ -"movdqa16(%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm5 \n\t" /* srcData */\ -"add$16, %%"FF_REG_d"\n\t"\ -"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\ -"test %%"FF_REG_S", %%"FF_REG_S" \n\t"\ -"pmulhw %%xmm0, %%xmm2 \n\t"\ -"pmulhw %%xmm0, %%xmm5 \n\t"\ -"paddw%%xmm2, %%xmm3 \n\t"\ -"paddw%%xmm5, %%xmm4 \n\t"\ -" jnz1b \n\t"\ -"psraw $3, %%xmm3 \n\t"\ -"psraw $3, %%xmm4 \n\t"\ -"packuswb %%xmm4, %%xmm3 \n\t"\ -"movntdq %%xmm3, (%1, %%"FF_REG_c") \n\t"\ -"add $16, %%"FF_REG_c"\n\t"\ -"cmp %2, %%"FF_REG_c"\n\t"\ -"movdqa %%xmm7, %%xmm3\n\t" \ -"movdqa %%xmm7, %%xmm4\n\t" \ -"mov %0, %%"FF_REG_d"\n\t"\ -"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\ -"jb 1b \n\t" - -if (offset) { -__asm__ volatile( -"movq %5, %%xmm3 \n\t" -"movdqa%%xmm3, %%xmm4 \n\t" -"psrlq$24, %%xmm3 \n\t" -"psllq$40, %%xmm4 \n\t" -"por %%xmm4, %%xmm3 \n\t" -MAIN_FUNCTION - :: "g" (filter), - "r" (dest-offset), "g" ((x86_reg)(dstW+offset)), "m" (offset), - "m"(filterSize), "m"(((uint64_t *)
[FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup. AVX2 version is ready and tested, however, although local tests show a significant
Fixed. The wrong step size was used causing a write passed the end of the buffer. yuv2yuvX_mmxext is now called if there are any remaining pixels. --- libswscale/x86/Makefile | 1 + libswscale/x86/swscale.c| 75 -- libswscale/x86/yuv2yuvX.asm | 105 3 files changed, 116 insertions(+), 65 deletions(-) create mode 100644 libswscale/x86/yuv2yuvX.asm diff --git a/libswscale/x86/Makefile b/libswscale/x86/Makefile index 831d5359aa..bfe383364e 100644 --- a/libswscale/x86/Makefile +++ b/libswscale/x86/Makefile @@ -13,3 +13,4 @@ X86ASM-OBJS += x86/input.o \ x86/scale.o \ x86/rgb_2_rgb.o \ x86/yuv_2_rgb.o \ + x86/yuv2yuvX.o \ diff --git a/libswscale/x86/swscale.c b/libswscale/x86/swscale.c index 3160fedf04..fec9fa22e0 100644 --- a/libswscale/x86/swscale.c +++ b/libswscale/x86/swscale.c @@ -197,80 +197,25 @@ void ff_updateMMXDitherTables(SwsContext *c, int dstY) } #if HAVE_MMXEXT +void ff_yuv2yuvX_sse3(const int16_t *filter, int filterSize, + uint8_t *dest, int dstW, + const uint8_t *dither, int offset); + static void yuv2yuvX_sse3(const int16_t *filter, int filterSize, const int16_t **src, uint8_t *dest, int dstW, const uint8_t *dither, int offset) { +int remainder = (dstW % 32); +int pixelsProcessed = dstW - remainder; if(((uintptr_t)dest) & 15){ yuv2yuvX_mmxext(filter, filterSize, src, dest, dstW, dither, offset); return; } -filterSize--; -#define MAIN_FUNCTION \ -"pxor %%xmm0, %%xmm0 \n\t" \ -"punpcklbw %%xmm0, %%xmm3 \n\t" \ -"movd %4, %%xmm1 \n\t" \ -"punpcklwd %%xmm1, %%xmm1 \n\t" \ -"punpckldq %%xmm1, %%xmm1 \n\t" \ -"punpcklqdq %%xmm1, %%xmm1 \n\t" \ -"psllw $3, %%xmm1 \n\t" \ -"paddw %%xmm1, %%xmm3 \n\t" \ -"psraw $4, %%xmm3 \n\t" \ -"movdqa %%xmm3, %%xmm4 \n\t" \ -"movdqa %%xmm3, %%xmm7 \n\t" \ -"movl %3, %%ecx \n\t" \ -"mov %0, %%"FF_REG_d"\n\t"\ -"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\ -".p2align 4 \n\t" /* FIXME Unroll? */\ -"1: \n\t"\ -"movddup 8(%%"FF_REG_d"), %%xmm0 \n\t" /* filterCoeff */\ -"movdqa (%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm2 \n\t" /* srcData */\ -"movdqa16(%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm5 \n\t" /* srcData */\ -"add$16, %%"FF_REG_d"\n\t"\ -"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\ -"test %%"FF_REG_S", %%"FF_REG_S" \n\t"\ -"pmulhw %%xmm0, %%xmm2 \n\t"\ -"pmulhw %%xmm0, %%xmm5 \n\t"\ -"paddw%%xmm2, %%xmm3 \n\t"\ -"paddw%%xmm5, %%xmm4 \n\t"\ -" jnz1b \n\t"\ -"psraw $3, %%xmm3 \n\t"\ -"psraw $3, %%xmm4 \n\t"\ -"packuswb %%xmm4, %%xmm3 \n\t"\ -"movntdq %%xmm3, (%1, %%"FF_REG_c") \n\t"\ -"add $16, %%"FF_REG_c"\n\t"\ -"cmp %2, %%"FF_REG_c"\n\t"\ -"movdqa %%xmm7, %%xmm3\n\t" \ -"movdqa %%xmm7, %%xmm4\n\t" \ -"mov %0, %%"FF_REG_d"\n\t"\ -"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\ -"jb 1b \n\t" - -if (offset) { -__asm__ volatile( -"movq %5, %%xmm3 \n\t" -"movdqa%%xmm3, %%xmm4 \n\t" -"psrlq$24, %%xmm3 \n\t" -"psllq$40, %%xmm4 \n\t" -"por %%xmm4, %%xmm3 \n\t" -MAIN_FUNCTION - :: "g" (filter), - "r" (dest-offset), "g" ((x86_reg)(dstW+offset)), "m" (offset), - "m"(filterSize), "m"(((uint64_t *) dither)[0]) - : XMM_CLOBBERS("%xmm0" , "%xmm1" , "%xmm2" , "%xmm3" , "%xmm4" , "%xmm5" , "%xmm7" ,) -"%"FF_REG_d, "%"FF_REG_S, "%"FF_REG_c -
Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup. AVX2 version is ready and tested, although local tests show a significant spee
On Thu, Oct 22, 2020 at 09:43:53AM +0200, Alan Kelly wrote: > Other functions to be ported to avx2 have been identified and are on > the todo list. > --- > libswscale/x86/Makefile | 1 + > libswscale/x86/swscale.c| 72 +++-- > libswscale/x86/yuv2yuvX.asm | 105 > 3 files changed, 112 insertions(+), 66 deletions(-) > create mode 100644 libswscale/x86/yuv2yuvX.asm Breaks: ./ffmpeg -i ~/vlcticket/5887/Cruise\ 2012_07_29_19_02_16.wmv -an -vcodec mjpeg -vf scale=800:600:interl=1 -qscale 1 out.avi (the output file has artifacts at the left side) input is maybe here: https://trac.videolan.org/vlc/attachment/ticket/7246/Cruise%202012_07_29_19_02_16.wmv [...] -- Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB Its not that you shouldnt use gotos but rather that you should write readable code and code with gotos often but not always is less readable signature.asc Description: PGP signature ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup. AVX2 version is ready and tested, although local tests show a significant spee
Do we have checkasm for those functions? On Thu, 22 Oct 2020, at 09:43, Alan Kelly wrote: > Other functions to be ported to avx2 have been identified and are on > the todo list. > --- > libswscale/x86/Makefile | 1 + > libswscale/x86/swscale.c| 72 +++-- > libswscale/x86/yuv2yuvX.asm | 105 > 3 files changed, 112 insertions(+), 66 deletions(-) > create mode 100644 libswscale/x86/yuv2yuvX.asm > > diff --git a/libswscale/x86/Makefile b/libswscale/x86/Makefile > index 831d5359aa..bfe383364e 100644 > --- a/libswscale/x86/Makefile > +++ b/libswscale/x86/Makefile > @@ -13,3 +13,4 @@ X86ASM-OBJS += x86/input.o > \ > x86/scale.o > \ > x86/rgb_2_rgb.o > \ > x86/yuv_2_rgb.o > \ > + x86/yuv2yuvX.o > \ > diff --git a/libswscale/x86/swscale.c b/libswscale/x86/swscale.c > index 3160fedf04..ea83b097ca 100644 > --- a/libswscale/x86/swscale.c > +++ b/libswscale/x86/swscale.c > @@ -197,6 +197,10 @@ void ff_updateMMXDitherTables(SwsContext *c, int > dstY) > } > > #if HAVE_MMXEXT > +void ff_yuv2yuvX_sse3(const int16_t *filter, int filterSize, > + uint8_t *dest, int dstW, > + const uint8_t *dither, int offset); > + > static void yuv2yuvX_sse3(const int16_t *filter, int filterSize, > const int16_t **src, uint8_t *dest, int > dstW, > const uint8_t *dither, int offset) > @@ -205,72 +209,8 @@ static void yuv2yuvX_sse3(const int16_t *filter, > int filterSize, > yuv2yuvX_mmxext(filter, filterSize, src, dest, dstW, dither, > offset); > return; > } > -filterSize--; > -#define MAIN_FUNCTION \ > -"pxor %%xmm0, %%xmm0 \n\t" \ > -"punpcklbw %%xmm0, %%xmm3 \n\t" \ > -"movd %4, %%xmm1 \n\t" \ > -"punpcklwd %%xmm1, %%xmm1 \n\t" \ > -"punpckldq %%xmm1, %%xmm1 \n\t" \ > -"punpcklqdq %%xmm1, %%xmm1 \n\t" \ > -"psllw $3, %%xmm1 \n\t" \ > -"paddw %%xmm1, %%xmm3 \n\t" \ > -"psraw $4, %%xmm3 \n\t" \ > -"movdqa %%xmm3, %%xmm4 \n\t" \ > -"movdqa %%xmm3, %%xmm7 \n\t" \ > -"movl %3, %%ecx \n\t" \ > -"mov %0, %%"FF_REG_d" > \n\t"\ > -"mov(%%"FF_REG_d"), %%"FF_REG_S" > \n\t"\ > -".p2align 4 \n\t" /* > FIXME Unroll? */\ > -"1: \n\t"\ > -"movddup 8(%%"FF_REG_d"), %%xmm0 \n\t" /* > filterCoeff */\ > -"movdqa (%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm2 > \n\t" /* srcData */\ > -"movdqa16(%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm5 > \n\t" /* srcData */\ > -"add$16, %%"FF_REG_d" > \n\t"\ > -"mov(%%"FF_REG_d"), %%"FF_REG_S" > \n\t"\ > -"test %%"FF_REG_S", %%"FF_REG_S" > \n\t"\ > -"pmulhw %%xmm0, %%xmm2 \n\t"\ > -"pmulhw %%xmm0, %%xmm5 \n\t"\ > -"paddw%%xmm2, %%xmm3 \n\t"\ > -"paddw%%xmm5, %%xmm4 \n\t"\ > -" jnz1b \n\t"\ > -"psraw $3, %%xmm3 \n\t"\ > -"psraw $3, %%xmm4 \n\t"\ > -"packuswb %%xmm4, %%xmm3 \n\t"\ > -"movntdq %%xmm3, (%1, %%"FF_REG_c") > \n\t"\ > -"add $16, %%"FF_REG_c"\n\t"\ > -"cmp %2, %%"FF_REG_c"\n\t"\ > -"movdqa %%xmm7, %%xmm3\n\t" \ > -"movdqa %%xmm7, %%xmm4\n\t" \ > -"mov %0, %%"FF_REG_d" > \n\t"\ > -"mov(%%"FF_REG_d"), %%"FF_REG_S" > \n\t"\ > -"jb 1b \n\t" > - > -if (offset) { > -__asm__ volatile( > -"movq %5, %%xmm3 \n\t" > -"movdqa%%xmm3, %%xmm4 \n\t" > -"psrlq$24, %%xmm3 \n\t" > -"psllq$40, %%xmm4 \n\t" > -"por %%xmm4, %%xmm3 \n\t" > -MAIN_FUNCTION > - :: "g" (filter), > - "r" (dest-offset), "g" ((x86_reg)(dstW+offset)), "m" >
[FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup. AVX2 version is ready and tested, although local tests show a significant speed-up
Other functions to be ported to avx2 have been identified and are on the todo list. --- libswscale/x86/Makefile | 1 + libswscale/x86/swscale.c| 72 +++-- libswscale/x86/yuv2yuvX.asm | 105 3 files changed, 112 insertions(+), 66 deletions(-) create mode 100644 libswscale/x86/yuv2yuvX.asm diff --git a/libswscale/x86/Makefile b/libswscale/x86/Makefile index 831d5359aa..bfe383364e 100644 --- a/libswscale/x86/Makefile +++ b/libswscale/x86/Makefile @@ -13,3 +13,4 @@ X86ASM-OBJS += x86/input.o \ x86/scale.o \ x86/rgb_2_rgb.o \ x86/yuv_2_rgb.o \ + x86/yuv2yuvX.o \ diff --git a/libswscale/x86/swscale.c b/libswscale/x86/swscale.c index 3160fedf04..ea83b097ca 100644 --- a/libswscale/x86/swscale.c +++ b/libswscale/x86/swscale.c @@ -197,6 +197,10 @@ void ff_updateMMXDitherTables(SwsContext *c, int dstY) } #if HAVE_MMXEXT +void ff_yuv2yuvX_sse3(const int16_t *filter, int filterSize, + uint8_t *dest, int dstW, + const uint8_t *dither, int offset); + static void yuv2yuvX_sse3(const int16_t *filter, int filterSize, const int16_t **src, uint8_t *dest, int dstW, const uint8_t *dither, int offset) @@ -205,72 +209,8 @@ static void yuv2yuvX_sse3(const int16_t *filter, int filterSize, yuv2yuvX_mmxext(filter, filterSize, src, dest, dstW, dither, offset); return; } -filterSize--; -#define MAIN_FUNCTION \ -"pxor %%xmm0, %%xmm0 \n\t" \ -"punpcklbw %%xmm0, %%xmm3 \n\t" \ -"movd %4, %%xmm1 \n\t" \ -"punpcklwd %%xmm1, %%xmm1 \n\t" \ -"punpckldq %%xmm1, %%xmm1 \n\t" \ -"punpcklqdq %%xmm1, %%xmm1 \n\t" \ -"psllw $3, %%xmm1 \n\t" \ -"paddw %%xmm1, %%xmm3 \n\t" \ -"psraw $4, %%xmm3 \n\t" \ -"movdqa %%xmm3, %%xmm4 \n\t" \ -"movdqa %%xmm3, %%xmm7 \n\t" \ -"movl %3, %%ecx \n\t" \ -"mov %0, %%"FF_REG_d"\n\t"\ -"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\ -".p2align 4 \n\t" /* FIXME Unroll? */\ -"1: \n\t"\ -"movddup 8(%%"FF_REG_d"), %%xmm0 \n\t" /* filterCoeff */\ -"movdqa (%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm2 \n\t" /* srcData */\ -"movdqa16(%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm5 \n\t" /* srcData */\ -"add$16, %%"FF_REG_d"\n\t"\ -"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\ -"test %%"FF_REG_S", %%"FF_REG_S" \n\t"\ -"pmulhw %%xmm0, %%xmm2 \n\t"\ -"pmulhw %%xmm0, %%xmm5 \n\t"\ -"paddw%%xmm2, %%xmm3 \n\t"\ -"paddw%%xmm5, %%xmm4 \n\t"\ -" jnz1b \n\t"\ -"psraw $3, %%xmm3 \n\t"\ -"psraw $3, %%xmm4 \n\t"\ -"packuswb %%xmm4, %%xmm3 \n\t"\ -"movntdq %%xmm3, (%1, %%"FF_REG_c") \n\t"\ -"add $16, %%"FF_REG_c"\n\t"\ -"cmp %2, %%"FF_REG_c"\n\t"\ -"movdqa %%xmm7, %%xmm3\n\t" \ -"movdqa %%xmm7, %%xmm4\n\t" \ -"mov %0, %%"FF_REG_d"\n\t"\ -"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\ -"jb 1b \n\t" - -if (offset) { -__asm__ volatile( -"movq %5, %%xmm3 \n\t" -"movdqa%%xmm3, %%xmm4 \n\t" -"psrlq$24, %%xmm3 \n\t" -"psllq$40, %%xmm4 \n\t" -"por %%xmm4, %%xmm3 \n\t" -MAIN_FUNCTION - :: "g" (filter), - "r" (dest-offset), "g" ((x86_reg)(dstW+offset)), "m" (offset), - "m"(filterSize), "m"(((uint64_t *) dither)[0]) - : XMM_CLOBBERS("%xmm0" , "%xmm1" , "%xmm2" , "%xmm3" , "%xmm4" , "%xmm5" , "%xmm7" ,) -"%"FF_REG_d, "%"FF_REG_S, "%"FF_REG_c - ); -} else { -__asm__ volatile( -"movq %5, %%xmm3