Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

2021-02-19 Thread Alan Kelly
Thanks James for spotting this. I have sent two patches fixing the valgrind
error from checkasm and the unchecked av_mallocs.

I do not believe that the two remaining valgrind errors come from my patch,
although I may be mistaken. Using git bisect, I have
identified b94cd55155d8c061f1e1faca9076afe540149c27 as the problematic
commit.

On Thu, Feb 18, 2021 at 11:23 PM James Almer  wrote:

> On 2/17/2021 5:24 PM, Paul B Mahol wrote:
> > On Tue, Feb 16, 2021 at 6:31 PM Alan Kelly <
> > alankelly-at-google@ffmpeg.org> wrote:
> >
> >> Looks like there are no comments, is this OK to be applied? Thanks
> >>
> >
> > Applied, thanks for pinging.
>
> Valgrind complains about this change. The checkasm test specifically.
>
>
> http://fate.ffmpeg.org/report.cgi?time=20210218014903=x86_64-archlinux-gcc-valgrind
>
> I also noticed it has a bunch of unchecked av_mallocs().
> ___
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>
> To unsubscribe, visit link above, or email
> ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

2021-02-18 Thread James Almer

On 2/17/2021 5:24 PM, Paul B Mahol wrote:

On Tue, Feb 16, 2021 at 6:31 PM Alan Kelly <
alankelly-at-google@ffmpeg.org> wrote:


Looks like there are no comments, is this OK to be applied? Thanks



Applied, thanks for pinging.


Valgrind complains about this change. The checkasm test specifically.

http://fate.ffmpeg.org/report.cgi?time=20210218014903=x86_64-archlinux-gcc-valgrind

I also noticed it has a bunch of unchecked av_mallocs().
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

2021-02-17 Thread Paul B Mahol
On Tue, Feb 16, 2021 at 6:31 PM Alan Kelly <
alankelly-at-google@ffmpeg.org> wrote:

> Looks like there are no comments, is this OK to be applied? Thanks
>

Applied, thanks for pinging.


>
> On Tue, Feb 9, 2021 at 6:25 PM Paul B Mahol  wrote:
>
> > Will apply in no comments.
> > ___
> > ffmpeg-devel mailing list
> > ffmpeg-devel@ffmpeg.org
> > https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
> >
> > To unsubscribe, visit link above, or email
> > ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
> ___
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>
> To unsubscribe, visit link above, or email
> ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

2021-02-16 Thread Alan Kelly
Looks like there are no comments, is this OK to be applied? Thanks

On Tue, Feb 9, 2021 at 6:25 PM Paul B Mahol  wrote:

> Will apply in no comments.
> ___
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>
> To unsubscribe, visit link above, or email
> ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

2021-02-09 Thread Paul B Mahol
Will apply in no comments.
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

2021-02-09 Thread Alan Kelly
Ping!

On Thu, Jan 14, 2021 at 3:47 PM Alan Kelly  wrote:

> ---
>  Replaces cpuflag(mmx) with notcpuflag(sse3) for store macro
>  Tests for multiple sizes in checkasm-sw_scale
>  checkasm-sw_scale aligns memory on 8 bytes instad of 32 to catch aligned
> loads
>  libswscale/x86/Makefile   |   1 +
>  libswscale/x86/swscale.c  | 130 
>  libswscale/x86/swscale_template.c |  82 --
>  libswscale/x86/yuv2yuvX.asm   | 136 ++
>  tests/checkasm/sw_scale.c | 103 ++
>  5 files changed, 294 insertions(+), 158 deletions(-)
>  create mode 100644 libswscale/x86/yuv2yuvX.asm
>
> diff --git a/libswscale/x86/Makefile b/libswscale/x86/Makefile
> index 831d5359aa..bfe383364e 100644
> --- a/libswscale/x86/Makefile
> +++ b/libswscale/x86/Makefile
> @@ -13,3 +13,4 @@ X86ASM-OBJS += x86/input.o
> \
> x86/scale.o  \
> x86/rgb_2_rgb.o  \
> x86/yuv_2_rgb.o  \
> +   x86/yuv2yuvX.o   \
> diff --git a/libswscale/x86/swscale.c b/libswscale/x86/swscale.c
> index 15c0b22f20..3df193a067 100644
> --- a/libswscale/x86/swscale.c
> +++ b/libswscale/x86/swscale.c
> @@ -63,6 +63,16 @@ DECLARE_ASM_ALIGNED(8, const uint64_t, ff_bgr2UVOffset)
> = 0x8080808080808080ULL;
>  DECLARE_ASM_ALIGNED(8, const uint64_t, ff_w)=
> 0x0001000100010001ULL;
>
>
> +#define YUV2YUVX_FUNC_DECL(opt)  \
> +static void yuv2yuvX_ ##opt(const int16_t *filter, int filterSize, const
> int16_t **src, \
> +   uint8_t *dest, int dstW, \
> +   const uint8_t *dither, int offset); \
> +
> +YUV2YUVX_FUNC_DECL(mmx)
> +YUV2YUVX_FUNC_DECL(mmxext)
> +YUV2YUVX_FUNC_DECL(sse3)
> +YUV2YUVX_FUNC_DECL(avx2)
> +
>  //MMX versions
>  #if HAVE_MMX_INLINE
>  #undef RENAME
> @@ -198,81 +208,44 @@ void ff_updateMMXDitherTables(SwsContext *c, int
> dstY)
>  }
>
>  #if HAVE_MMXEXT
> -static void yuv2yuvX_sse3(const int16_t *filter, int filterSize,
> -   const int16_t **src, uint8_t *dest, int dstW,
> -   const uint8_t *dither, int offset)
> -{
> -if(((uintptr_t)dest) & 15){
> -yuv2yuvX_mmxext(filter, filterSize, src, dest, dstW, dither,
> offset);
> -return;
> -}
> -filterSize--;
> -#define MAIN_FUNCTION \
> -"pxor   %%xmm0, %%xmm0 \n\t" \
> -"punpcklbw  %%xmm0, %%xmm3 \n\t" \
> -"movd   %4, %%xmm1 \n\t" \
> -"punpcklwd  %%xmm1, %%xmm1 \n\t" \
> -"punpckldq  %%xmm1, %%xmm1 \n\t" \
> -"punpcklqdq %%xmm1, %%xmm1 \n\t" \
> -"psllw  $3, %%xmm1 \n\t" \
> -"paddw  %%xmm1, %%xmm3 \n\t" \
> -"psraw  $4, %%xmm3 \n\t" \
> -"movdqa %%xmm3, %%xmm4 \n\t" \
> -"movdqa %%xmm3, %%xmm7 \n\t" \
> -"movl   %3, %%ecx  \n\t" \
> -"mov %0, %%"FF_REG_d"
> \n\t"\
> -"mov(%%"FF_REG_d"), %%"FF_REG_S"
>  \n\t"\
> -".p2align 4 \n\t" /*
> FIXME Unroll? */\
> -"1: \n\t"\
> -"movddup  8(%%"FF_REG_d"), %%xmm0   \n\t" /*
> filterCoeff */\
> -"movdqa  (%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm2
> \n\t" /* srcData */\
> -"movdqa16(%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm5
> \n\t" /* srcData */\
> -"add$16, %%"FF_REG_d"
> \n\t"\
> -"mov(%%"FF_REG_d"), %%"FF_REG_S"
>  \n\t"\
> -"test %%"FF_REG_S", %%"FF_REG_S"
>  \n\t"\
> -"pmulhw   %%xmm0, %%xmm2  \n\t"\
> -"pmulhw   %%xmm0, %%xmm5  \n\t"\
> -"paddw%%xmm2, %%xmm3  \n\t"\
> -"paddw%%xmm5, %%xmm4  \n\t"\
> -" jnz1b \n\t"\
> -"psraw   $3, %%xmm3  \n\t"\
> -"psraw   $3, %%xmm4  \n\t"\
> -"packuswb %%xmm4, %%xmm3  \n\t"\
> -"movntdq  %%xmm3, (%1, %%"FF_REG_c")
> \n\t"\
> -"add $16, %%"FF_REG_c"\n\t"\
> -"cmp  %2, %%"FF_REG_c"\n\t"\
> -"movdqa   %%xmm7, %%xmm3\n\t" \
> -"movdqa   %%xmm7, %%xmm4\n\t" \
> -"mov %0, %%"FF_REG_d"
> \n\t"\
> -"mov

[FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

2021-01-14 Thread Alan Kelly
---
 Replaces cpuflag(mmx) with notcpuflag(sse3) for store macro
 Tests for multiple sizes in checkasm-sw_scale
 checkasm-sw_scale aligns memory on 8 bytes instad of 32 to catch aligned loads
 libswscale/x86/Makefile   |   1 +
 libswscale/x86/swscale.c  | 130 
 libswscale/x86/swscale_template.c |  82 --
 libswscale/x86/yuv2yuvX.asm   | 136 ++
 tests/checkasm/sw_scale.c | 103 ++
 5 files changed, 294 insertions(+), 158 deletions(-)
 create mode 100644 libswscale/x86/yuv2yuvX.asm

diff --git a/libswscale/x86/Makefile b/libswscale/x86/Makefile
index 831d5359aa..bfe383364e 100644
--- a/libswscale/x86/Makefile
+++ b/libswscale/x86/Makefile
@@ -13,3 +13,4 @@ X86ASM-OBJS += x86/input.o
  \
x86/scale.o  \
x86/rgb_2_rgb.o  \
x86/yuv_2_rgb.o  \
+   x86/yuv2yuvX.o   \
diff --git a/libswscale/x86/swscale.c b/libswscale/x86/swscale.c
index 15c0b22f20..3df193a067 100644
--- a/libswscale/x86/swscale.c
+++ b/libswscale/x86/swscale.c
@@ -63,6 +63,16 @@ DECLARE_ASM_ALIGNED(8, const uint64_t, ff_bgr2UVOffset) = 
0x8080808080808080ULL;
 DECLARE_ASM_ALIGNED(8, const uint64_t, ff_w)= 
0x0001000100010001ULL;
 
 
+#define YUV2YUVX_FUNC_DECL(opt)  \
+static void yuv2yuvX_ ##opt(const int16_t *filter, int filterSize, const 
int16_t **src, \
+   uint8_t *dest, int dstW, \
+   const uint8_t *dither, int offset); \
+
+YUV2YUVX_FUNC_DECL(mmx)
+YUV2YUVX_FUNC_DECL(mmxext)
+YUV2YUVX_FUNC_DECL(sse3)
+YUV2YUVX_FUNC_DECL(avx2)
+
 //MMX versions
 #if HAVE_MMX_INLINE
 #undef RENAME
@@ -198,81 +208,44 @@ void ff_updateMMXDitherTables(SwsContext *c, int dstY)
 }
 
 #if HAVE_MMXEXT
-static void yuv2yuvX_sse3(const int16_t *filter, int filterSize,
-   const int16_t **src, uint8_t *dest, int dstW,
-   const uint8_t *dither, int offset)
-{
-if(((uintptr_t)dest) & 15){
-yuv2yuvX_mmxext(filter, filterSize, src, dest, dstW, dither, offset);
-return;
-}
-filterSize--;
-#define MAIN_FUNCTION \
-"pxor   %%xmm0, %%xmm0 \n\t" \
-"punpcklbw  %%xmm0, %%xmm3 \n\t" \
-"movd   %4, %%xmm1 \n\t" \
-"punpcklwd  %%xmm1, %%xmm1 \n\t" \
-"punpckldq  %%xmm1, %%xmm1 \n\t" \
-"punpcklqdq %%xmm1, %%xmm1 \n\t" \
-"psllw  $3, %%xmm1 \n\t" \
-"paddw  %%xmm1, %%xmm3 \n\t" \
-"psraw  $4, %%xmm3 \n\t" \
-"movdqa %%xmm3, %%xmm4 \n\t" \
-"movdqa %%xmm3, %%xmm7 \n\t" \
-"movl   %3, %%ecx  \n\t" \
-"mov %0, %%"FF_REG_d"\n\t"\
-"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\
-".p2align 4 \n\t" /* FIXME 
Unroll? */\
-"1: \n\t"\
-"movddup  8(%%"FF_REG_d"), %%xmm0   \n\t" /* 
filterCoeff */\
-"movdqa  (%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm2 \n\t" /* 
srcData */\
-"movdqa16(%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm5 \n\t" /* 
srcData */\
-"add$16, %%"FF_REG_d"\n\t"\
-"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\
-"test %%"FF_REG_S", %%"FF_REG_S" \n\t"\
-"pmulhw   %%xmm0, %%xmm2  \n\t"\
-"pmulhw   %%xmm0, %%xmm5  \n\t"\
-"paddw%%xmm2, %%xmm3  \n\t"\
-"paddw%%xmm5, %%xmm4  \n\t"\
-" jnz1b \n\t"\
-"psraw   $3, %%xmm3  \n\t"\
-"psraw   $3, %%xmm4  \n\t"\
-"packuswb %%xmm4, %%xmm3  \n\t"\
-"movntdq  %%xmm3, (%1, %%"FF_REG_c") \n\t"\
-"add $16, %%"FF_REG_c"\n\t"\
-"cmp  %2, %%"FF_REG_c"\n\t"\
-"movdqa   %%xmm7, %%xmm3\n\t" \
-"movdqa   %%xmm7, %%xmm4\n\t" \
-"mov %0, %%"FF_REG_d"\n\t"\
-"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\
-"jb  1b \n\t"
-
-if (offset) {
-__asm__ volatile(
-"movq  %5, %%xmm3  \n\t"
-"movdqa

Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

2021-01-14 Thread Alan Kelly
Apologies for this: when I added mmx to the yasm file, I added a macro for
the stores selecting mova for mmx and movdqu for the others. if
cpuflag(mmx) evaluates to true for all architectures so I replaced it with
if notcpuflag(sse3).

The alignment in the checkasm test has been changed to 8 from 32 so that
the test catches problems with alignment.

On Thu, Jan 14, 2021 at 1:11 AM Michael Niedermayer 
wrote:

> On Mon, Jan 11, 2021 at 05:46:31PM +0100, Alan Kelly wrote:
> > ---
> >  Fixes a bug where if there is no offset and a tail which is not
> processed by the
> >  sse3/avx2 version the dither is modified
> >  Deletes mmx/mmxext yuv2yuvX version from swscale_template and adds it
> >  to yuv2yuvX.asm to reduce code duplication and so that it may be used
> >  to process the tail from the larger cardinal simd versions.
> >  src argument of yuv2yuvX_* is now srcOffset, so that tails and offsets
> >  are accounted for correctly.
> >  Changes input size in checkasm so that this corner case is tested.
> >
> >  libswscale/x86/Makefile   |   1 +
> >  libswscale/x86/swscale.c  | 130 
> >  libswscale/x86/swscale_template.c |  82 --
> >  libswscale/x86/yuv2yuvX.asm   | 136 ++
> >  tests/checkasm/sw_scale.c | 100 ++
> >  5 files changed, 291 insertions(+), 158 deletions(-)
> >  create mode 100644 libswscale/x86/yuv2yuvX.asm
>
> This seems to be crashing again unless i messed up testing
>
> (gdb) disassemble $rip-32,$rip+32
> Dump of assembler code from 0x55572f02 to 0x55572f42:
>0x55572f02 :   int$0x71
>0x55572f04 :   out%al,$0x3
>0x55572f06 :   vpsraw $0x3,%ymm1,%ymm1
>0x55572f0b :   vpackuswb %ymm4,%ymm3,%ymm3
>0x55572f0f :   vpackuswb %ymm1,%ymm6,%ymm6
>0x55572f13 :   mov(%rdi),%rdx
>0x55572f16 :   vpermq $0xd8,%ymm3,%ymm3
>0x55572f1c :   vpermq $0xd8,%ymm6,%ymm6
> => 0x55572f22 :   vmovdqa %ymm3,(%rcx,%rax,1)
>0x55572f27 :   vmovdqa
> %ymm6,0x20(%rcx,%rax,1)
>0x55572f2d :   add$0x40,%rax
>0x55572f31 :   mov%rdi,%rsi
>0x55572f34 :   cmp%r8,%rax
>0x55572f37 :   jb 0x55572eae
> 
>0x55572f3d :   vzeroupper
>0x55572f40 :   retq
>0x55572f41 :   nopw   %cs:0x0(%rax,%rax,1)
>
> rax0x0  0
> rbx0x30 48
> rcx0x5583f470   93824995292272
> rdx0x5585e500   93824995419392
>
> #0  0x55572f22 in ff_yuv2yuvX_avx2 ()
> #1  0x555724ee in yuv2yuvX_avx2 ()
> #2  0x5556b4f6 in chr_planar_vscale ()
> #3  0x55566d41 in swscale ()
> #4  0x55568284 in sws_scale ()
>
>
>
> [...]
> --
> Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB
>
> What does censorship reveal? It reveals fear. -- Julian Assange
> ___
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>
> To unsubscribe, visit link above, or email
> ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

2021-01-13 Thread Michael Niedermayer
On Mon, Jan 11, 2021 at 05:46:31PM +0100, Alan Kelly wrote:
> ---
>  Fixes a bug where if there is no offset and a tail which is not processed by 
> the
>  sse3/avx2 version the dither is modified
>  Deletes mmx/mmxext yuv2yuvX version from swscale_template and adds it
>  to yuv2yuvX.asm to reduce code duplication and so that it may be used
>  to process the tail from the larger cardinal simd versions.
>  src argument of yuv2yuvX_* is now srcOffset, so that tails and offsets
>  are accounted for correctly.
>  Changes input size in checkasm so that this corner case is tested.
> 
>  libswscale/x86/Makefile   |   1 +
>  libswscale/x86/swscale.c  | 130 
>  libswscale/x86/swscale_template.c |  82 --
>  libswscale/x86/yuv2yuvX.asm   | 136 ++
>  tests/checkasm/sw_scale.c | 100 ++
>  5 files changed, 291 insertions(+), 158 deletions(-)
>  create mode 100644 libswscale/x86/yuv2yuvX.asm

This seems to be crashing again unless i messed up testing 

(gdb) disassemble $rip-32,$rip+32
Dump of assembler code from 0x55572f02 to 0x55572f42:
   0x55572f02 :   int$0x71
   0x55572f04 :   out%al,$0x3
   0x55572f06 :   vpsraw $0x3,%ymm1,%ymm1
   0x55572f0b :   vpackuswb %ymm4,%ymm3,%ymm3
   0x55572f0f :   vpackuswb %ymm1,%ymm6,%ymm6
   0x55572f13 :   mov(%rdi),%rdx
   0x55572f16 :   vpermq $0xd8,%ymm3,%ymm3
   0x55572f1c :   vpermq $0xd8,%ymm6,%ymm6
=> 0x55572f22 :   vmovdqa %ymm3,(%rcx,%rax,1)
   0x55572f27 :   vmovdqa %ymm6,0x20(%rcx,%rax,1)
   0x55572f2d :   add$0x40,%rax
   0x55572f31 :   mov%rdi,%rsi
   0x55572f34 :   cmp%r8,%rax
   0x55572f37 :   jb 0x55572eae 

   0x55572f3d :   vzeroupper 
   0x55572f40 :   retq   
   0x55572f41 :   nopw   %cs:0x0(%rax,%rax,1)
   
rax0x0  0
rbx0x30 48
rcx0x5583f470   93824995292272
rdx0x5585e500   93824995419392

#0  0x55572f22 in ff_yuv2yuvX_avx2 ()
#1  0x555724ee in yuv2yuvX_avx2 ()
#2  0x5556b4f6 in chr_planar_vscale ()
#3  0x55566d41 in swscale ()
#4  0x55568284 in sws_scale ()



[...]
-- 
Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

What does censorship reveal? It reveals fear. -- Julian Assange


signature.asc
Description: PGP signature
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

[FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

2021-01-11 Thread Alan Kelly
---
 Fixes a bug where if there is no offset and a tail which is not processed by 
the
 sse3/avx2 version the dither is modified
 Deletes mmx/mmxext yuv2yuvX version from swscale_template and adds it
 to yuv2yuvX.asm to reduce code duplication and so that it may be used
 to process the tail from the larger cardinal simd versions.
 src argument of yuv2yuvX_* is now srcOffset, so that tails and offsets
 are accounted for correctly.
 Changes input size in checkasm so that this corner case is tested.

 libswscale/x86/Makefile   |   1 +
 libswscale/x86/swscale.c  | 130 
 libswscale/x86/swscale_template.c |  82 --
 libswscale/x86/yuv2yuvX.asm   | 136 ++
 tests/checkasm/sw_scale.c | 100 ++
 5 files changed, 291 insertions(+), 158 deletions(-)
 create mode 100644 libswscale/x86/yuv2yuvX.asm

diff --git a/libswscale/x86/Makefile b/libswscale/x86/Makefile
index 831d5359aa..bfe383364e 100644
--- a/libswscale/x86/Makefile
+++ b/libswscale/x86/Makefile
@@ -13,3 +13,4 @@ X86ASM-OBJS += x86/input.o
  \
x86/scale.o  \
x86/rgb_2_rgb.o  \
x86/yuv_2_rgb.o  \
+   x86/yuv2yuvX.o   \
diff --git a/libswscale/x86/swscale.c b/libswscale/x86/swscale.c
index 15c0b22f20..3df193a067 100644
--- a/libswscale/x86/swscale.c
+++ b/libswscale/x86/swscale.c
@@ -63,6 +63,16 @@ DECLARE_ASM_ALIGNED(8, const uint64_t, ff_bgr2UVOffset) = 
0x8080808080808080ULL;
 DECLARE_ASM_ALIGNED(8, const uint64_t, ff_w)= 
0x0001000100010001ULL;
 
 
+#define YUV2YUVX_FUNC_DECL(opt)  \
+static void yuv2yuvX_ ##opt(const int16_t *filter, int filterSize, const 
int16_t **src, \
+   uint8_t *dest, int dstW, \
+   const uint8_t *dither, int offset); \
+
+YUV2YUVX_FUNC_DECL(mmx)
+YUV2YUVX_FUNC_DECL(mmxext)
+YUV2YUVX_FUNC_DECL(sse3)
+YUV2YUVX_FUNC_DECL(avx2)
+
 //MMX versions
 #if HAVE_MMX_INLINE
 #undef RENAME
@@ -198,81 +208,44 @@ void ff_updateMMXDitherTables(SwsContext *c, int dstY)
 }
 
 #if HAVE_MMXEXT
-static void yuv2yuvX_sse3(const int16_t *filter, int filterSize,
-   const int16_t **src, uint8_t *dest, int dstW,
-   const uint8_t *dither, int offset)
-{
-if(((uintptr_t)dest) & 15){
-yuv2yuvX_mmxext(filter, filterSize, src, dest, dstW, dither, offset);
-return;
-}
-filterSize--;
-#define MAIN_FUNCTION \
-"pxor   %%xmm0, %%xmm0 \n\t" \
-"punpcklbw  %%xmm0, %%xmm3 \n\t" \
-"movd   %4, %%xmm1 \n\t" \
-"punpcklwd  %%xmm1, %%xmm1 \n\t" \
-"punpckldq  %%xmm1, %%xmm1 \n\t" \
-"punpcklqdq %%xmm1, %%xmm1 \n\t" \
-"psllw  $3, %%xmm1 \n\t" \
-"paddw  %%xmm1, %%xmm3 \n\t" \
-"psraw  $4, %%xmm3 \n\t" \
-"movdqa %%xmm3, %%xmm4 \n\t" \
-"movdqa %%xmm3, %%xmm7 \n\t" \
-"movl   %3, %%ecx  \n\t" \
-"mov %0, %%"FF_REG_d"\n\t"\
-"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\
-".p2align 4 \n\t" /* FIXME 
Unroll? */\
-"1: \n\t"\
-"movddup  8(%%"FF_REG_d"), %%xmm0   \n\t" /* 
filterCoeff */\
-"movdqa  (%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm2 \n\t" /* 
srcData */\
-"movdqa16(%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm5 \n\t" /* 
srcData */\
-"add$16, %%"FF_REG_d"\n\t"\
-"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\
-"test %%"FF_REG_S", %%"FF_REG_S" \n\t"\
-"pmulhw   %%xmm0, %%xmm2  \n\t"\
-"pmulhw   %%xmm0, %%xmm5  \n\t"\
-"paddw%%xmm2, %%xmm3  \n\t"\
-"paddw%%xmm5, %%xmm4  \n\t"\
-" jnz1b \n\t"\
-"psraw   $3, %%xmm3  \n\t"\
-"psraw   $3, %%xmm4  \n\t"\
-"packuswb %%xmm4, %%xmm3  \n\t"\
-"movntdq  %%xmm3, (%1, %%"FF_REG_c") \n\t"\
-"add $16, %%"FF_REG_c"\n\t"\
-"cmp  %2, %%"FF_REG_c"\n\t"\
-"movdqa   %%xmm7, %%xmm3\n\t" \
-"movdqa   %%xmm7, %%xmm4\n\t" \
-"mov 

Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

2021-01-11 Thread Alan Kelly
It's a bug in the patch. The tail not processed by the sse3/avx2 version is
done by the mmx version. I used offset to account for the src pixels
already processed, however, dither is modified if offset is not 0. In cases
where there is a tail and offset is 0, this bug appears. I am working on a
solution.

On Sun, Jan 10, 2021 at 4:26 PM Michael Niedermayer 
wrote:

> On Thu, Jan 07, 2021 at 10:41:19AM +0100, Alan Kelly wrote:
> > ---
> >  Replaces mova with movdqu due to alignment issues
> >  libswscale/x86/Makefile |   1 +
> >  libswscale/x86/swscale.c| 106 +---
> >  libswscale/x86/yuv2yuvX.asm | 117 
> >  tests/checkasm/sw_scale.c   |  98 ++
> >  4 files changed, 246 insertions(+), 76 deletions(-)
> >  create mode 100644 libswscale/x86/yuv2yuvX.asm
>
> I have one / some ? cases where this changes output
>  ./ffmpeg -i utvideo-yuv422p10le_UQY2_crc32-A431CD5F.avi -bitexact avi.avi
>
>  i dont know if theres a decoder bug or bug in the patch or something else
>
> -rw-r- 1 michael michael 246218 Jan 10 16:23 avi.avi
> -rw-r- 1 michael michael 245824 Jan 10 16:23 avi-ref.avi
>
> file should be at:
> https://samples.ffmpeg.org/ffmpeg-bugs/trac/ticket4044/
>
> [...]
> --
> Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB
>
> In a rich man's house there is no place to spit but his face.
> -- Diogenes of Sinope
> ___
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>
> To unsubscribe, visit link above, or email
> ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

2021-01-10 Thread Michael Niedermayer
On Thu, Jan 07, 2021 at 10:41:19AM +0100, Alan Kelly wrote:
> ---
>  Replaces mova with movdqu due to alignment issues
>  libswscale/x86/Makefile |   1 +
>  libswscale/x86/swscale.c| 106 +---
>  libswscale/x86/yuv2yuvX.asm | 117 
>  tests/checkasm/sw_scale.c   |  98 ++
>  4 files changed, 246 insertions(+), 76 deletions(-)
>  create mode 100644 libswscale/x86/yuv2yuvX.asm

I have one / some ? cases where this changes output
 ./ffmpeg -i utvideo-yuv422p10le_UQY2_crc32-A431CD5F.avi -bitexact avi.avi
 
 i dont know if theres a decoder bug or bug in the patch or something else
 
-rw-r- 1 michael michael 246218 Jan 10 16:23 avi.avi
-rw-r- 1 michael michael 245824 Jan 10 16:23 avi-ref.avi

file should be at:
https://samples.ffmpeg.org/ffmpeg-bugs/trac/ticket4044/

[...]
-- 
Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

In a rich man's house there is no place to spit but his face.
-- Diogenes of Sinope


signature.asc
Description: PGP signature
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

2021-01-10 Thread Michael Niedermayer
On Thu, Jan 07, 2021 at 10:39:56AM +0100, Alan Kelly wrote:
> Thanks for your patience with this, I have replaced mova with movdqu - movu
> generated a compile error on ssse3. What system did this crash on?

AMD Ryzen 9 3950X on linux

[...]
-- 
Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

Everything should be made as simple as possible, but not simpler.
-- Albert Einstein


signature.asc
Description: PGP signature
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

2021-01-07 Thread Alan Kelly
Thanks for your patience with this, I have replaced mova with movdqu - movu
generated a compile error on ssse3. What system did this crash on?

On Wed, Jan 6, 2021 at 9:10 PM Michael Niedermayer 
wrote:

> On Tue, Jan 05, 2021 at 01:31:25PM +0100, Alan Kelly wrote:
> > Ping!
>
> crashes (due to alignment i think)
>
> (gdb) disassemble $rip-32,$rip+32
> Dump of assembler code from 0x555730a1 to 0x555730e1:
>0x555730a1 :   int$0x71
>0x555730a3 :   out%al,$0x3
>0x555730a5 :   vpsraw $0x3,%ymm1,%ymm1
>0x555730aa :   vpackuswb %ymm4,%ymm3,%ymm3
>0x555730ae :   vpackuswb %ymm1,%ymm6,%ymm6
>0x555730b2 :   mov(%rdi),%rdx
>0x555730b5 :   vpermq $0xd8,%ymm3,%ymm3
>0x555730bb :   vpermq $0xd8,%ymm6,%ymm6
> => 0x555730c1 :   vmovdqa %ymm3,(%rcx,%rax,1)
>0x555730c6 :   vmovdqa
> %ymm6,0x20(%rcx,%rax,1)
>0x555730cc :   add$0x40,%rax
>0x555730d0 :   mov%rdi,%rsi
>0x555730d3 :   cmp%r8,%rax
>0x555730d6 :   jb 0x5557304d
> 
>0x555730dc :   vzeroupper
>0x555730df :   retq
>0x555730e0 : push   %r15
> End of assembler dump.
> (gdb) info all-registers
> rax0x0  0
> rbx0x0  0
> rcx0x5583f470   93824995292272
>
>
> [...]
> --
> Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB
>
> Modern terrorism, a quick summary: Need oil, start war with country that
> has oil, kill hundread thousand in war. Let country fall into chaos,
> be surprised about raise of fundamantalists. Drop more bombs, kill more
> people, be surprised about them taking revenge and drop even more bombs
> and strip your own citizens of their rights and freedoms. to be continued
> ___
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>
> To unsubscribe, visit link above, or email
> ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

[FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

2021-01-07 Thread Alan Kelly
---
 Replaces mova with movdqu due to alignment issues
 libswscale/x86/Makefile |   1 +
 libswscale/x86/swscale.c| 106 +---
 libswscale/x86/yuv2yuvX.asm | 117 
 tests/checkasm/sw_scale.c   |  98 ++
 4 files changed, 246 insertions(+), 76 deletions(-)
 create mode 100644 libswscale/x86/yuv2yuvX.asm

diff --git a/libswscale/x86/Makefile b/libswscale/x86/Makefile
index 831d5359aa..bfe383364e 100644
--- a/libswscale/x86/Makefile
+++ b/libswscale/x86/Makefile
@@ -13,3 +13,4 @@ X86ASM-OBJS += x86/input.o
  \
x86/scale.o  \
x86/rgb_2_rgb.o  \
x86/yuv_2_rgb.o  \
+   x86/yuv2yuvX.o   \
diff --git a/libswscale/x86/swscale.c b/libswscale/x86/swscale.c
index 3160fedf04..8cd8713705 100644
--- a/libswscale/x86/swscale.c
+++ b/libswscale/x86/swscale.c
@@ -197,81 +197,30 @@ void ff_updateMMXDitherTables(SwsContext *c, int dstY)
 }
 
 #if HAVE_MMXEXT
-static void yuv2yuvX_sse3(const int16_t *filter, int filterSize,
-   const int16_t **src, uint8_t *dest, int dstW,
-   const uint8_t *dither, int offset)
-{
-if(((uintptr_t)dest) & 15){
-yuv2yuvX_mmxext(filter, filterSize, src, dest, dstW, dither, offset);
-return;
-}
-filterSize--;
-#define MAIN_FUNCTION \
-"pxor   %%xmm0, %%xmm0 \n\t" \
-"punpcklbw  %%xmm0, %%xmm3 \n\t" \
-"movd   %4, %%xmm1 \n\t" \
-"punpcklwd  %%xmm1, %%xmm1 \n\t" \
-"punpckldq  %%xmm1, %%xmm1 \n\t" \
-"punpcklqdq %%xmm1, %%xmm1 \n\t" \
-"psllw  $3, %%xmm1 \n\t" \
-"paddw  %%xmm1, %%xmm3 \n\t" \
-"psraw  $4, %%xmm3 \n\t" \
-"movdqa %%xmm3, %%xmm4 \n\t" \
-"movdqa %%xmm3, %%xmm7 \n\t" \
-"movl   %3, %%ecx  \n\t" \
-"mov %0, %%"FF_REG_d"\n\t"\
-"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\
-".p2align 4 \n\t" /* FIXME 
Unroll? */\
-"1: \n\t"\
-"movddup  8(%%"FF_REG_d"), %%xmm0   \n\t" /* 
filterCoeff */\
-"movdqa  (%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm2 \n\t" /* 
srcData */\
-"movdqa16(%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm5 \n\t" /* 
srcData */\
-"add$16, %%"FF_REG_d"\n\t"\
-"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\
-"test %%"FF_REG_S", %%"FF_REG_S" \n\t"\
-"pmulhw   %%xmm0, %%xmm2  \n\t"\
-"pmulhw   %%xmm0, %%xmm5  \n\t"\
-"paddw%%xmm2, %%xmm3  \n\t"\
-"paddw%%xmm5, %%xmm4  \n\t"\
-" jnz1b \n\t"\
-"psraw   $3, %%xmm3  \n\t"\
-"psraw   $3, %%xmm4  \n\t"\
-"packuswb %%xmm4, %%xmm3  \n\t"\
-"movntdq  %%xmm3, (%1, %%"FF_REG_c") \n\t"\
-"add $16, %%"FF_REG_c"\n\t"\
-"cmp  %2, %%"FF_REG_c"\n\t"\
-"movdqa   %%xmm7, %%xmm3\n\t" \
-"movdqa   %%xmm7, %%xmm4\n\t" \
-"mov %0, %%"FF_REG_d"\n\t"\
-"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\
-"jb  1b \n\t"
-
-if (offset) {
-__asm__ volatile(
-"movq  %5, %%xmm3  \n\t"
-"movdqa%%xmm3, %%xmm4  \n\t"
-"psrlq$24, %%xmm3  \n\t"
-"psllq$40, %%xmm4  \n\t"
-"por   %%xmm4, %%xmm3  \n\t"
-MAIN_FUNCTION
-  :: "g" (filter),
-  "r" (dest-offset), "g" ((x86_reg)(dstW+offset)), "m" (offset),
-  "m"(filterSize), "m"(((uint64_t *) dither)[0])
-  : XMM_CLOBBERS("%xmm0" , "%xmm1" , "%xmm2" , "%xmm3" , "%xmm4" , 
"%xmm5" , "%xmm7" ,)
-"%"FF_REG_d, "%"FF_REG_S, "%"FF_REG_c
-  );
-} else {
-__asm__ volatile(
-"movq  %5, %%xmm3   \n\t"
-MAIN_FUNCTION
-  :: "g" (filter),
-  "r" (dest-offset), "g" ((x86_reg)(dstW+offset)), "m" (offset),
-  "m"(filterSize), "m"(((uint64_t 

Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

2021-01-06 Thread Michael Niedermayer
On Tue, Jan 05, 2021 at 01:31:25PM +0100, Alan Kelly wrote:
> Ping!

crashes (due to alignment i think)

(gdb) disassemble $rip-32,$rip+32
Dump of assembler code from 0x555730a1 to 0x555730e1:
   0x555730a1 :   int$0x71
   0x555730a3 :   out%al,$0x3
   0x555730a5 :   vpsraw $0x3,%ymm1,%ymm1
   0x555730aa :   vpackuswb %ymm4,%ymm3,%ymm3
   0x555730ae :   vpackuswb %ymm1,%ymm6,%ymm6
   0x555730b2 :   mov(%rdi),%rdx
   0x555730b5 :   vpermq $0xd8,%ymm3,%ymm3
   0x555730bb :   vpermq $0xd8,%ymm6,%ymm6
=> 0x555730c1 :   vmovdqa %ymm3,(%rcx,%rax,1)
   0x555730c6 :   vmovdqa %ymm6,0x20(%rcx,%rax,1)
   0x555730cc :   add$0x40,%rax
   0x555730d0 :   mov%rdi,%rsi
   0x555730d3 :   cmp%r8,%rax
   0x555730d6 :   jb 0x5557304d 

   0x555730dc :   vzeroupper 
   0x555730df :   retq   
   0x555730e0 : push   %r15
End of assembler dump.
(gdb) info all-registers 
rax0x0  0
rbx0x0  0
rcx0x5583f470   93824995292272


[...]
-- 
Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

Modern terrorism, a quick summary: Need oil, start war with country that
has oil, kill hundread thousand in war. Let country fall into chaos,
be surprised about raise of fundamantalists. Drop more bombs, kill more
people, be surprised about them taking revenge and drop even more bombs
and strip your own citizens of their rights and freedoms. to be continued


signature.asc
Description: PGP signature
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

2021-01-05 Thread Alan Kelly
Ping!

On Thu, Dec 17, 2020 at 11:42 AM Alan Kelly  wrote:

> ---
>  Fixes memory alignment problem in checkasm-sw_scale
>  Tested on Linux 32 and 64 bit and mingw32
>  libswscale/x86/Makefile |   1 +
>  libswscale/x86/swscale.c| 106 +---
>  libswscale/x86/yuv2yuvX.asm | 117 
>  tests/checkasm/sw_scale.c   |  98 ++
>  4 files changed, 246 insertions(+), 76 deletions(-)
>  create mode 100644 libswscale/x86/yuv2yuvX.asm
>
> diff --git a/libswscale/x86/Makefile b/libswscale/x86/Makefile
> index 831d5359aa..bfe383364e 100644
> --- a/libswscale/x86/Makefile
> +++ b/libswscale/x86/Makefile
> @@ -13,3 +13,4 @@ X86ASM-OBJS += x86/input.o
> \
> x86/scale.o  \
> x86/rgb_2_rgb.o  \
> x86/yuv_2_rgb.o  \
> +   x86/yuv2yuvX.o   \
> diff --git a/libswscale/x86/swscale.c b/libswscale/x86/swscale.c
> index 3160fedf04..8cd8713705 100644
> --- a/libswscale/x86/swscale.c
> +++ b/libswscale/x86/swscale.c
> @@ -197,81 +197,30 @@ void ff_updateMMXDitherTables(SwsContext *c, int
> dstY)
>  }
>
>  #if HAVE_MMXEXT
> -static void yuv2yuvX_sse3(const int16_t *filter, int filterSize,
> -   const int16_t **src, uint8_t *dest, int dstW,
> -   const uint8_t *dither, int offset)
> -{
> -if(((uintptr_t)dest) & 15){
> -yuv2yuvX_mmxext(filter, filterSize, src, dest, dstW, dither,
> offset);
> -return;
> -}
> -filterSize--;
> -#define MAIN_FUNCTION \
> -"pxor   %%xmm0, %%xmm0 \n\t" \
> -"punpcklbw  %%xmm0, %%xmm3 \n\t" \
> -"movd   %4, %%xmm1 \n\t" \
> -"punpcklwd  %%xmm1, %%xmm1 \n\t" \
> -"punpckldq  %%xmm1, %%xmm1 \n\t" \
> -"punpcklqdq %%xmm1, %%xmm1 \n\t" \
> -"psllw  $3, %%xmm1 \n\t" \
> -"paddw  %%xmm1, %%xmm3 \n\t" \
> -"psraw  $4, %%xmm3 \n\t" \
> -"movdqa %%xmm3, %%xmm4 \n\t" \
> -"movdqa %%xmm3, %%xmm7 \n\t" \
> -"movl   %3, %%ecx  \n\t" \
> -"mov %0, %%"FF_REG_d"
> \n\t"\
> -"mov(%%"FF_REG_d"), %%"FF_REG_S"
>  \n\t"\
> -".p2align 4 \n\t" /*
> FIXME Unroll? */\
> -"1: \n\t"\
> -"movddup  8(%%"FF_REG_d"), %%xmm0   \n\t" /*
> filterCoeff */\
> -"movdqa  (%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm2
> \n\t" /* srcData */\
> -"movdqa16(%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm5
> \n\t" /* srcData */\
> -"add$16, %%"FF_REG_d"
> \n\t"\
> -"mov(%%"FF_REG_d"), %%"FF_REG_S"
>  \n\t"\
> -"test %%"FF_REG_S", %%"FF_REG_S"
>  \n\t"\
> -"pmulhw   %%xmm0, %%xmm2  \n\t"\
> -"pmulhw   %%xmm0, %%xmm5  \n\t"\
> -"paddw%%xmm2, %%xmm3  \n\t"\
> -"paddw%%xmm5, %%xmm4  \n\t"\
> -" jnz1b \n\t"\
> -"psraw   $3, %%xmm3  \n\t"\
> -"psraw   $3, %%xmm4  \n\t"\
> -"packuswb %%xmm4, %%xmm3  \n\t"\
> -"movntdq  %%xmm3, (%1, %%"FF_REG_c")
> \n\t"\
> -"add $16, %%"FF_REG_c"\n\t"\
> -"cmp  %2, %%"FF_REG_c"\n\t"\
> -"movdqa   %%xmm7, %%xmm3\n\t" \
> -"movdqa   %%xmm7, %%xmm4\n\t" \
> -"mov %0, %%"FF_REG_d"
> \n\t"\
> -"mov(%%"FF_REG_d"), %%"FF_REG_S"
>  \n\t"\
> -"jb  1b \n\t"
> -
> -if (offset) {
> -__asm__ volatile(
> -"movq  %5, %%xmm3  \n\t"
> -"movdqa%%xmm3, %%xmm4  \n\t"
> -"psrlq$24, %%xmm3  \n\t"
> -"psllq$40, %%xmm4  \n\t"
> -"por   %%xmm4, %%xmm3  \n\t"
> -MAIN_FUNCTION
> -  :: "g" (filter),
> -  "r" (dest-offset), "g" ((x86_reg)(dstW+offset)), "m"
> (offset),
> -  "m"(filterSize), "m"(((uint64_t *) dither)[0])
> -  : XMM_CLOBBERS("%xmm0" , "%xmm1" , "%xmm2" , "%xmm3" ,
> "%xmm4" , "%xmm5" , "%xmm7" ,)
> -"%"FF_REG_d, "%"FF_REG_S, "%"FF_REG_c
> -  );
> -   

[FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

2020-12-17 Thread Alan Kelly
---
 Fixes memory alignment problem in checkasm-sw_scale
 Tested on Linux 32 and 64 bit and mingw32
 libswscale/x86/Makefile |   1 +
 libswscale/x86/swscale.c| 106 +---
 libswscale/x86/yuv2yuvX.asm | 117 
 tests/checkasm/sw_scale.c   |  98 ++
 4 files changed, 246 insertions(+), 76 deletions(-)
 create mode 100644 libswscale/x86/yuv2yuvX.asm

diff --git a/libswscale/x86/Makefile b/libswscale/x86/Makefile
index 831d5359aa..bfe383364e 100644
--- a/libswscale/x86/Makefile
+++ b/libswscale/x86/Makefile
@@ -13,3 +13,4 @@ X86ASM-OBJS += x86/input.o
  \
x86/scale.o  \
x86/rgb_2_rgb.o  \
x86/yuv_2_rgb.o  \
+   x86/yuv2yuvX.o   \
diff --git a/libswscale/x86/swscale.c b/libswscale/x86/swscale.c
index 3160fedf04..8cd8713705 100644
--- a/libswscale/x86/swscale.c
+++ b/libswscale/x86/swscale.c
@@ -197,81 +197,30 @@ void ff_updateMMXDitherTables(SwsContext *c, int dstY)
 }
 
 #if HAVE_MMXEXT
-static void yuv2yuvX_sse3(const int16_t *filter, int filterSize,
-   const int16_t **src, uint8_t *dest, int dstW,
-   const uint8_t *dither, int offset)
-{
-if(((uintptr_t)dest) & 15){
-yuv2yuvX_mmxext(filter, filterSize, src, dest, dstW, dither, offset);
-return;
-}
-filterSize--;
-#define MAIN_FUNCTION \
-"pxor   %%xmm0, %%xmm0 \n\t" \
-"punpcklbw  %%xmm0, %%xmm3 \n\t" \
-"movd   %4, %%xmm1 \n\t" \
-"punpcklwd  %%xmm1, %%xmm1 \n\t" \
-"punpckldq  %%xmm1, %%xmm1 \n\t" \
-"punpcklqdq %%xmm1, %%xmm1 \n\t" \
-"psllw  $3, %%xmm1 \n\t" \
-"paddw  %%xmm1, %%xmm3 \n\t" \
-"psraw  $4, %%xmm3 \n\t" \
-"movdqa %%xmm3, %%xmm4 \n\t" \
-"movdqa %%xmm3, %%xmm7 \n\t" \
-"movl   %3, %%ecx  \n\t" \
-"mov %0, %%"FF_REG_d"\n\t"\
-"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\
-".p2align 4 \n\t" /* FIXME 
Unroll? */\
-"1: \n\t"\
-"movddup  8(%%"FF_REG_d"), %%xmm0   \n\t" /* 
filterCoeff */\
-"movdqa  (%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm2 \n\t" /* 
srcData */\
-"movdqa16(%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm5 \n\t" /* 
srcData */\
-"add$16, %%"FF_REG_d"\n\t"\
-"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\
-"test %%"FF_REG_S", %%"FF_REG_S" \n\t"\
-"pmulhw   %%xmm0, %%xmm2  \n\t"\
-"pmulhw   %%xmm0, %%xmm5  \n\t"\
-"paddw%%xmm2, %%xmm3  \n\t"\
-"paddw%%xmm5, %%xmm4  \n\t"\
-" jnz1b \n\t"\
-"psraw   $3, %%xmm3  \n\t"\
-"psraw   $3, %%xmm4  \n\t"\
-"packuswb %%xmm4, %%xmm3  \n\t"\
-"movntdq  %%xmm3, (%1, %%"FF_REG_c") \n\t"\
-"add $16, %%"FF_REG_c"\n\t"\
-"cmp  %2, %%"FF_REG_c"\n\t"\
-"movdqa   %%xmm7, %%xmm3\n\t" \
-"movdqa   %%xmm7, %%xmm4\n\t" \
-"mov %0, %%"FF_REG_d"\n\t"\
-"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\
-"jb  1b \n\t"
-
-if (offset) {
-__asm__ volatile(
-"movq  %5, %%xmm3  \n\t"
-"movdqa%%xmm3, %%xmm4  \n\t"
-"psrlq$24, %%xmm3  \n\t"
-"psllq$40, %%xmm4  \n\t"
-"por   %%xmm4, %%xmm3  \n\t"
-MAIN_FUNCTION
-  :: "g" (filter),
-  "r" (dest-offset), "g" ((x86_reg)(dstW+offset)), "m" (offset),
-  "m"(filterSize), "m"(((uint64_t *) dither)[0])
-  : XMM_CLOBBERS("%xmm0" , "%xmm1" , "%xmm2" , "%xmm3" , "%xmm4" , 
"%xmm5" , "%xmm7" ,)
-"%"FF_REG_d, "%"FF_REG_S, "%"FF_REG_c
-  );
-} else {
-__asm__ volatile(
-"movq  %5, %%xmm3   \n\t"
-MAIN_FUNCTION
-  :: "g" (filter),
-  "r" (dest-offset), "g" ((x86_reg)(dstW+offset)), "m" (offset),
-   

Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

2020-12-11 Thread Michael Niedermayer
On Thu, Dec 10, 2020 at 04:46:26PM +0100, Alan Kelly wrote:
> ---
>  Replaces ff_sws_init_swscale_x86 with ff_getSwsFunc
>  Load offset if not gprsize but 8 on both 32 and 64 bit
>  Removes sfence as NT store no longer used
>  libswscale/x86/Makefile |   1 +
>  libswscale/x86/swscale.c| 106 +---
>  libswscale/x86/yuv2yuvX.asm | 117 
>  tests/checkasm/sw_scale.c   | 101 ++-
>  4 files changed, 248 insertions(+), 77 deletions(-)
>  create mode 100644 libswscale/x86/yuv2yuvX.asm

breaks fate on mingw32

make fate-checkasm-sw_scale
TESTcheckasm-sw_scale
Test checkasm-sw_scale failed. Look at tests/data/fate/checkasm-sw_scale.err 
for details.
src/tests/Makefile:255: recipe for target 'fate-checkasm-sw_scale' failed
make: *** [fate-checkasm-sw_scale] Error 5


[...]
-- 
Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

Take away the freedom of one citizen and you will be jailed, take away
the freedom of all citizens and you will be congratulated by your peers
in Parliament.


signature.asc
Description: PGP signature
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

[FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

2020-12-10 Thread Alan Kelly
---
 Replaces ff_sws_init_swscale_x86 with ff_getSwsFunc
 Load offset if not gprsize but 8 on both 32 and 64 bit
 Removes sfence as NT store no longer used
 libswscale/x86/Makefile |   1 +
 libswscale/x86/swscale.c| 106 +---
 libswscale/x86/yuv2yuvX.asm | 117 
 tests/checkasm/sw_scale.c   | 101 ++-
 4 files changed, 248 insertions(+), 77 deletions(-)
 create mode 100644 libswscale/x86/yuv2yuvX.asm

diff --git a/libswscale/x86/Makefile b/libswscale/x86/Makefile
index 831d5359aa..bfe383364e 100644
--- a/libswscale/x86/Makefile
+++ b/libswscale/x86/Makefile
@@ -13,3 +13,4 @@ X86ASM-OBJS += x86/input.o
  \
x86/scale.o  \
x86/rgb_2_rgb.o  \
x86/yuv_2_rgb.o  \
+   x86/yuv2yuvX.o   \
diff --git a/libswscale/x86/swscale.c b/libswscale/x86/swscale.c
index 3160fedf04..8cd8713705 100644
--- a/libswscale/x86/swscale.c
+++ b/libswscale/x86/swscale.c
@@ -197,81 +197,30 @@ void ff_updateMMXDitherTables(SwsContext *c, int dstY)
 }
 
 #if HAVE_MMXEXT
-static void yuv2yuvX_sse3(const int16_t *filter, int filterSize,
-   const int16_t **src, uint8_t *dest, int dstW,
-   const uint8_t *dither, int offset)
-{
-if(((uintptr_t)dest) & 15){
-yuv2yuvX_mmxext(filter, filterSize, src, dest, dstW, dither, offset);
-return;
-}
-filterSize--;
-#define MAIN_FUNCTION \
-"pxor   %%xmm0, %%xmm0 \n\t" \
-"punpcklbw  %%xmm0, %%xmm3 \n\t" \
-"movd   %4, %%xmm1 \n\t" \
-"punpcklwd  %%xmm1, %%xmm1 \n\t" \
-"punpckldq  %%xmm1, %%xmm1 \n\t" \
-"punpcklqdq %%xmm1, %%xmm1 \n\t" \
-"psllw  $3, %%xmm1 \n\t" \
-"paddw  %%xmm1, %%xmm3 \n\t" \
-"psraw  $4, %%xmm3 \n\t" \
-"movdqa %%xmm3, %%xmm4 \n\t" \
-"movdqa %%xmm3, %%xmm7 \n\t" \
-"movl   %3, %%ecx  \n\t" \
-"mov %0, %%"FF_REG_d"\n\t"\
-"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\
-".p2align 4 \n\t" /* FIXME 
Unroll? */\
-"1: \n\t"\
-"movddup  8(%%"FF_REG_d"), %%xmm0   \n\t" /* 
filterCoeff */\
-"movdqa  (%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm2 \n\t" /* 
srcData */\
-"movdqa16(%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm5 \n\t" /* 
srcData */\
-"add$16, %%"FF_REG_d"\n\t"\
-"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\
-"test %%"FF_REG_S", %%"FF_REG_S" \n\t"\
-"pmulhw   %%xmm0, %%xmm2  \n\t"\
-"pmulhw   %%xmm0, %%xmm5  \n\t"\
-"paddw%%xmm2, %%xmm3  \n\t"\
-"paddw%%xmm5, %%xmm4  \n\t"\
-" jnz1b \n\t"\
-"psraw   $3, %%xmm3  \n\t"\
-"psraw   $3, %%xmm4  \n\t"\
-"packuswb %%xmm4, %%xmm3  \n\t"\
-"movntdq  %%xmm3, (%1, %%"FF_REG_c") \n\t"\
-"add $16, %%"FF_REG_c"\n\t"\
-"cmp  %2, %%"FF_REG_c"\n\t"\
-"movdqa   %%xmm7, %%xmm3\n\t" \
-"movdqa   %%xmm7, %%xmm4\n\t" \
-"mov %0, %%"FF_REG_d"\n\t"\
-"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\
-"jb  1b \n\t"
-
-if (offset) {
-__asm__ volatile(
-"movq  %5, %%xmm3  \n\t"
-"movdqa%%xmm3, %%xmm4  \n\t"
-"psrlq$24, %%xmm3  \n\t"
-"psllq$40, %%xmm4  \n\t"
-"por   %%xmm4, %%xmm3  \n\t"
-MAIN_FUNCTION
-  :: "g" (filter),
-  "r" (dest-offset), "g" ((x86_reg)(dstW+offset)), "m" (offset),
-  "m"(filterSize), "m"(((uint64_t *) dither)[0])
-  : XMM_CLOBBERS("%xmm0" , "%xmm1" , "%xmm2" , "%xmm3" , "%xmm4" , 
"%xmm5" , "%xmm7" ,)
-"%"FF_REG_d, "%"FF_REG_S, "%"FF_REG_c
-  );
-} else {
-__asm__ volatile(
-"movq  %5, %%xmm3   \n\t"
-MAIN_FUNCTION
-  :: "g" (filter),
-  "r" 

Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

2020-12-10 Thread Josh Dekker

On 2020/12/09 11:19, Alan Kelly wrote:

---
  Activates avx2 version of yuv2yuvX
  Adds checkasm for yuv2yuvX
  Modifies ff_yuv2yuvX_* signature to match yuv2yuvX_*
  Replaces non-temporal stores with temporal stores
  libswscale/x86/Makefile |   1 +
  libswscale/x86/swscale.c| 106 +---
  libswscale/x86/yuv2yuvX.asm | 118 
  tests/checkasm/sw_scale.c   | 101 +-
  4 files changed, 249 insertions(+), 77 deletions(-)
  create mode 100644 libswscale/x86/yuv2yuvX.asm

[...]
diff --git a/tests/checkasm/sw_scale.c b/tests/checkasm/sw_scale.c
index 9efa2b4def..7009169361 100644
--- a/tests/checkasm/sw_scale.c
+++ b/tests/checkasm/sw_scale.c

[...]

+static void check_yuv2yuvX(void)
+{
+struct SwsContext *ctx;
+int fsi, osi;
+#define LARGEST_FILTER 8
+#define FILTER_SIZES 4
+static const int filter_sizes[FILTER_SIZES] = {1, 4, 8, 16};
+
+declare_func_emms(AV_CPU_FLAG_MMX, void, const int16_t *filter,
+  int filterSize, const int16_t **src, uint8_t *dest,
+  int dstW, const uint8_t *dither, int offset);
+
+int dstW = SRC_PIXELS;
+const int16_t **src;
+LOCAL_ALIGNED_32(int16_t, filter_coeff, [LARGEST_FILTER]);
+LOCAL_ALIGNED_32(uint8_t, dst0, [SRC_PIXELS]);
+LOCAL_ALIGNED_32(uint8_t, dst1, [SRC_PIXELS]);
+LOCAL_ALIGNED_32(uint8_t, dither, [SRC_PIXELS]);
+union VFilterData{
+const int16_t *src;
+uint16_t coeff[8];
+} *vFilterData;
+uint8_t d_val = rnd();
+randomize_buffers(filter_coeff, LARGEST_FILTER);
+ctx = sws_alloc_context();
+if (sws_init_context(ctx, NULL, NULL) < 0)
+fail();
+
+ff_sws_init_swscale_x86(ctx);

This should be ff_getSwsFunc() instead.

+for(int i = 0; i < SRC_PIXELS; ++i){
+dither[i] = d_val;
+}
[...]

--
Josh
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

2020-12-09 Thread Alan Kelly
This function is tested by fate-filter-fps-r. I have also added a checkasm
test and bench.

I have done a lot more testing and benching of this code and I am now happy
to activate the avx2 version because the performance is so good. On my
machine I get the following results for filter size 4 and 0 offset. For all
other sizes/offsets the results are similar:

yuv2yuvX_4_0_mmx:
1567.2 1563.1

yuv2yuvX_4_0_mmxext:
1560.7 1560.1

yuv2yuvX_4_0_sse3:
780.7 572.1 -26.7%

yuv2yuvX_4_0_avx2:
n/a 341.1 -56.3%

Interestingly I discovered that the non-temporal store movntdq results in a
very large variability in the test results, in many cases it significantly
increases the execution time. I have replaced these stores with aligned
stores which stabilised the runtimes. However, I am aware that
benchmarks often don't represent reality and these non-temporal stores were
probably used for a good reason. If you think it better to use NT stores, I
will replace them.


On Fri, Dec 4, 2020 at 2:00 PM Anton Khirnov  wrote:

> Quoting Alan Kelly (2020-11-19 09:41:56)
> > ---
> >  All of Henrik's suggestions have been implemented. Additionally,
> >  m3 and m6 are permuted in avx2 before storing to ensure bit by bit
> >  identical results in avx2.
> >  libswscale/x86/Makefile |   1 +
> >  libswscale/x86/swscale.c|  75 +++
> >  libswscale/x86/yuv2yuvX.asm | 118 
> >  3 files changed, 129 insertions(+), 65 deletions(-)
> >  create mode 100644 libswscale/x86/yuv2yuvX.asm
>
> Is this function tested by FATE?
> I did some brief testing and apparently it gets called during
> fate-filter-shuffleplanes-dup-luma, but the results do not change even
> if I comment out the whole function.
>
> Also, it seems like you are adding an AVX2 version of the function, but
> I don't see it being used.
>
> --
> Anton Khirnov
> ___
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>
> To unsubscribe, visit link above, or email
> ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

[FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

2020-12-09 Thread Alan Kelly
---
 Activates avx2 version of yuv2yuvX
 Adds checkasm for yuv2yuvX
 Modifies ff_yuv2yuvX_* signature to match yuv2yuvX_*
 Replaces non-temporal stores with temporal stores
 libswscale/x86/Makefile |   1 +
 libswscale/x86/swscale.c| 106 +---
 libswscale/x86/yuv2yuvX.asm | 118 
 tests/checkasm/sw_scale.c   | 101 +-
 4 files changed, 249 insertions(+), 77 deletions(-)
 create mode 100644 libswscale/x86/yuv2yuvX.asm

diff --git a/libswscale/x86/Makefile b/libswscale/x86/Makefile
index 831d5359aa..bfe383364e 100644
--- a/libswscale/x86/Makefile
+++ b/libswscale/x86/Makefile
@@ -13,3 +13,4 @@ X86ASM-OBJS += x86/input.o
  \
x86/scale.o  \
x86/rgb_2_rgb.o  \
x86/yuv_2_rgb.o  \
+   x86/yuv2yuvX.o   \
diff --git a/libswscale/x86/swscale.c b/libswscale/x86/swscale.c
index 3160fedf04..8cd8713705 100644
--- a/libswscale/x86/swscale.c
+++ b/libswscale/x86/swscale.c
@@ -197,81 +197,30 @@ void ff_updateMMXDitherTables(SwsContext *c, int dstY)
 }
 
 #if HAVE_MMXEXT
-static void yuv2yuvX_sse3(const int16_t *filter, int filterSize,
-   const int16_t **src, uint8_t *dest, int dstW,
-   const uint8_t *dither, int offset)
-{
-if(((uintptr_t)dest) & 15){
-yuv2yuvX_mmxext(filter, filterSize, src, dest, dstW, dither, offset);
-return;
-}
-filterSize--;
-#define MAIN_FUNCTION \
-"pxor   %%xmm0, %%xmm0 \n\t" \
-"punpcklbw  %%xmm0, %%xmm3 \n\t" \
-"movd   %4, %%xmm1 \n\t" \
-"punpcklwd  %%xmm1, %%xmm1 \n\t" \
-"punpckldq  %%xmm1, %%xmm1 \n\t" \
-"punpcklqdq %%xmm1, %%xmm1 \n\t" \
-"psllw  $3, %%xmm1 \n\t" \
-"paddw  %%xmm1, %%xmm3 \n\t" \
-"psraw  $4, %%xmm3 \n\t" \
-"movdqa %%xmm3, %%xmm4 \n\t" \
-"movdqa %%xmm3, %%xmm7 \n\t" \
-"movl   %3, %%ecx  \n\t" \
-"mov %0, %%"FF_REG_d"\n\t"\
-"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\
-".p2align 4 \n\t" /* FIXME 
Unroll? */\
-"1: \n\t"\
-"movddup  8(%%"FF_REG_d"), %%xmm0   \n\t" /* 
filterCoeff */\
-"movdqa  (%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm2 \n\t" /* 
srcData */\
-"movdqa16(%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm5 \n\t" /* 
srcData */\
-"add$16, %%"FF_REG_d"\n\t"\
-"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\
-"test %%"FF_REG_S", %%"FF_REG_S" \n\t"\
-"pmulhw   %%xmm0, %%xmm2  \n\t"\
-"pmulhw   %%xmm0, %%xmm5  \n\t"\
-"paddw%%xmm2, %%xmm3  \n\t"\
-"paddw%%xmm5, %%xmm4  \n\t"\
-" jnz1b \n\t"\
-"psraw   $3, %%xmm3  \n\t"\
-"psraw   $3, %%xmm4  \n\t"\
-"packuswb %%xmm4, %%xmm3  \n\t"\
-"movntdq  %%xmm3, (%1, %%"FF_REG_c") \n\t"\
-"add $16, %%"FF_REG_c"\n\t"\
-"cmp  %2, %%"FF_REG_c"\n\t"\
-"movdqa   %%xmm7, %%xmm3\n\t" \
-"movdqa   %%xmm7, %%xmm4\n\t" \
-"mov %0, %%"FF_REG_d"\n\t"\
-"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\
-"jb  1b \n\t"
-
-if (offset) {
-__asm__ volatile(
-"movq  %5, %%xmm3  \n\t"
-"movdqa%%xmm3, %%xmm4  \n\t"
-"psrlq$24, %%xmm3  \n\t"
-"psllq$40, %%xmm4  \n\t"
-"por   %%xmm4, %%xmm3  \n\t"
-MAIN_FUNCTION
-  :: "g" (filter),
-  "r" (dest-offset), "g" ((x86_reg)(dstW+offset)), "m" (offset),
-  "m"(filterSize), "m"(((uint64_t *) dither)[0])
-  : XMM_CLOBBERS("%xmm0" , "%xmm1" , "%xmm2" , "%xmm3" , "%xmm4" , 
"%xmm5" , "%xmm7" ,)
-"%"FF_REG_d, "%"FF_REG_S, "%"FF_REG_c
-  );
-} else {
-__asm__ volatile(
-"movq  %5, %%xmm3   \n\t"
-MAIN_FUNCTION
-  :: "g" (filter),
-

Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

2020-12-04 Thread Anton Khirnov
Quoting Alan Kelly (2020-11-19 09:41:56)
> ---
>  All of Henrik's suggestions have been implemented. Additionally,
>  m3 and m6 are permuted in avx2 before storing to ensure bit by bit
>  identical results in avx2.
>  libswscale/x86/Makefile |   1 +
>  libswscale/x86/swscale.c|  75 +++
>  libswscale/x86/yuv2yuvX.asm | 118 
>  3 files changed, 129 insertions(+), 65 deletions(-)
>  create mode 100644 libswscale/x86/yuv2yuvX.asm

Is this function tested by FATE?
I did some brief testing and apparently it gets called during
fate-filter-shuffleplanes-dup-luma, but the results do not change even
if I comment out the whole function.

Also, it seems like you are adding an AVX2 version of the function, but
I don't see it being used.

-- 
Anton Khirnov
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

2020-12-01 Thread Alan Kelly
Ping

On Thu, Nov 19, 2020 at 9:42 AM Alan Kelly  wrote:

> ---
>  All of Henrik's suggestions have been implemented. Additionally,
>  m3 and m6 are permuted in avx2 before storing to ensure bit by bit
>  identical results in avx2.
>  libswscale/x86/Makefile |   1 +
>  libswscale/x86/swscale.c|  75 +++
>  libswscale/x86/yuv2yuvX.asm | 118 
>  3 files changed, 129 insertions(+), 65 deletions(-)
>  create mode 100644 libswscale/x86/yuv2yuvX.asm
>
> diff --git a/libswscale/x86/Makefile b/libswscale/x86/Makefile
> index 831d5359aa..bfe383364e 100644
> --- a/libswscale/x86/Makefile
> +++ b/libswscale/x86/Makefile
> @@ -13,3 +13,4 @@ X86ASM-OBJS += x86/input.o
> \
> x86/scale.o  \
> x86/rgb_2_rgb.o  \
> x86/yuv_2_rgb.o  \
> +   x86/yuv2yuvX.o   \
> diff --git a/libswscale/x86/swscale.c b/libswscale/x86/swscale.c
> index 3160fedf04..758c8e540f 100644
> --- a/libswscale/x86/swscale.c
> +++ b/libswscale/x86/swscale.c
> @@ -197,80 +197,25 @@ void ff_updateMMXDitherTables(SwsContext *c, int
> dstY)
>  }
>
>  #if HAVE_MMXEXT
> +void ff_yuv2yuvX_sse3(const int16_t *filter, long filterSize,
> +   uint8_t *dest, int dstW,
> +   const uint8_t *dither, int offset);
> +
>  static void yuv2yuvX_sse3(const int16_t *filter, int filterSize,
> const int16_t **src, uint8_t *dest, int dstW,
> const uint8_t *dither, int offset)
>  {
> +int remainder = (dstW % 32);
> +int pixelsProcessed = dstW - remainder;
>  if(((uintptr_t)dest) & 15){
>  yuv2yuvX_mmxext(filter, filterSize, src, dest, dstW, dither,
> offset);
>  return;
>  }
> -filterSize--;
> -#define MAIN_FUNCTION \
> -"pxor   %%xmm0, %%xmm0 \n\t" \
> -"punpcklbw  %%xmm0, %%xmm3 \n\t" \
> -"movd   %4, %%xmm1 \n\t" \
> -"punpcklwd  %%xmm1, %%xmm1 \n\t" \
> -"punpckldq  %%xmm1, %%xmm1 \n\t" \
> -"punpcklqdq %%xmm1, %%xmm1 \n\t" \
> -"psllw  $3, %%xmm1 \n\t" \
> -"paddw  %%xmm1, %%xmm3 \n\t" \
> -"psraw  $4, %%xmm3 \n\t" \
> -"movdqa %%xmm3, %%xmm4 \n\t" \
> -"movdqa %%xmm3, %%xmm7 \n\t" \
> -"movl   %3, %%ecx  \n\t" \
> -"mov %0, %%"FF_REG_d"
> \n\t"\
> -"mov(%%"FF_REG_d"), %%"FF_REG_S"
>  \n\t"\
> -".p2align 4 \n\t" /*
> FIXME Unroll? */\
> -"1: \n\t"\
> -"movddup  8(%%"FF_REG_d"), %%xmm0   \n\t" /*
> filterCoeff */\
> -"movdqa  (%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm2
> \n\t" /* srcData */\
> -"movdqa16(%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm5
> \n\t" /* srcData */\
> -"add$16, %%"FF_REG_d"
> \n\t"\
> -"mov(%%"FF_REG_d"), %%"FF_REG_S"
>  \n\t"\
> -"test %%"FF_REG_S", %%"FF_REG_S"
>  \n\t"\
> -"pmulhw   %%xmm0, %%xmm2  \n\t"\
> -"pmulhw   %%xmm0, %%xmm5  \n\t"\
> -"paddw%%xmm2, %%xmm3  \n\t"\
> -"paddw%%xmm5, %%xmm4  \n\t"\
> -" jnz1b \n\t"\
> -"psraw   $3, %%xmm3  \n\t"\
> -"psraw   $3, %%xmm4  \n\t"\
> -"packuswb %%xmm4, %%xmm3  \n\t"\
> -"movntdq  %%xmm3, (%1, %%"FF_REG_c")
> \n\t"\
> -"add $16, %%"FF_REG_c"\n\t"\
> -"cmp  %2, %%"FF_REG_c"\n\t"\
> -"movdqa   %%xmm7, %%xmm3\n\t" \
> -"movdqa   %%xmm7, %%xmm4\n\t" \
> -"mov %0, %%"FF_REG_d"
> \n\t"\
> -"mov(%%"FF_REG_d"), %%"FF_REG_S"
>  \n\t"\
> -"jb  1b \n\t"
> -
> -if (offset) {
> -__asm__ volatile(
> -"movq  %5, %%xmm3  \n\t"
> -"movdqa%%xmm3, %%xmm4  \n\t"
> -"psrlq$24, %%xmm3  \n\t"
> -"psllq$40, %%xmm4  \n\t"
> -"por   %%xmm4, %%xmm3  \n\t"
> -MAIN_FUNCTION
> -  :: "g" (filter),
> -  "r" (dest-offset), "g" ((x86_reg)(dstW+offset)), "m"
> 

[FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

2020-11-19 Thread Alan Kelly
---
 All of Henrik's suggestions have been implemented. Additionally,
 m3 and m6 are permuted in avx2 before storing to ensure bit by bit
 identical results in avx2.
 libswscale/x86/Makefile |   1 +
 libswscale/x86/swscale.c|  75 +++
 libswscale/x86/yuv2yuvX.asm | 118 
 3 files changed, 129 insertions(+), 65 deletions(-)
 create mode 100644 libswscale/x86/yuv2yuvX.asm

diff --git a/libswscale/x86/Makefile b/libswscale/x86/Makefile
index 831d5359aa..bfe383364e 100644
--- a/libswscale/x86/Makefile
+++ b/libswscale/x86/Makefile
@@ -13,3 +13,4 @@ X86ASM-OBJS += x86/input.o
  \
x86/scale.o  \
x86/rgb_2_rgb.o  \
x86/yuv_2_rgb.o  \
+   x86/yuv2yuvX.o   \
diff --git a/libswscale/x86/swscale.c b/libswscale/x86/swscale.c
index 3160fedf04..758c8e540f 100644
--- a/libswscale/x86/swscale.c
+++ b/libswscale/x86/swscale.c
@@ -197,80 +197,25 @@ void ff_updateMMXDitherTables(SwsContext *c, int dstY)
 }
 
 #if HAVE_MMXEXT
+void ff_yuv2yuvX_sse3(const int16_t *filter, long filterSize,
+   uint8_t *dest, int dstW,
+   const uint8_t *dither, int offset);
+
 static void yuv2yuvX_sse3(const int16_t *filter, int filterSize,
const int16_t **src, uint8_t *dest, int dstW,
const uint8_t *dither, int offset)
 {
+int remainder = (dstW % 32);
+int pixelsProcessed = dstW - remainder;
 if(((uintptr_t)dest) & 15){
 yuv2yuvX_mmxext(filter, filterSize, src, dest, dstW, dither, offset);
 return;
 }
-filterSize--;
-#define MAIN_FUNCTION \
-"pxor   %%xmm0, %%xmm0 \n\t" \
-"punpcklbw  %%xmm0, %%xmm3 \n\t" \
-"movd   %4, %%xmm1 \n\t" \
-"punpcklwd  %%xmm1, %%xmm1 \n\t" \
-"punpckldq  %%xmm1, %%xmm1 \n\t" \
-"punpcklqdq %%xmm1, %%xmm1 \n\t" \
-"psllw  $3, %%xmm1 \n\t" \
-"paddw  %%xmm1, %%xmm3 \n\t" \
-"psraw  $4, %%xmm3 \n\t" \
-"movdqa %%xmm3, %%xmm4 \n\t" \
-"movdqa %%xmm3, %%xmm7 \n\t" \
-"movl   %3, %%ecx  \n\t" \
-"mov %0, %%"FF_REG_d"\n\t"\
-"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\
-".p2align 4 \n\t" /* FIXME 
Unroll? */\
-"1: \n\t"\
-"movddup  8(%%"FF_REG_d"), %%xmm0   \n\t" /* 
filterCoeff */\
-"movdqa  (%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm2 \n\t" /* 
srcData */\
-"movdqa16(%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm5 \n\t" /* 
srcData */\
-"add$16, %%"FF_REG_d"\n\t"\
-"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\
-"test %%"FF_REG_S", %%"FF_REG_S" \n\t"\
-"pmulhw   %%xmm0, %%xmm2  \n\t"\
-"pmulhw   %%xmm0, %%xmm5  \n\t"\
-"paddw%%xmm2, %%xmm3  \n\t"\
-"paddw%%xmm5, %%xmm4  \n\t"\
-" jnz1b \n\t"\
-"psraw   $3, %%xmm3  \n\t"\
-"psraw   $3, %%xmm4  \n\t"\
-"packuswb %%xmm4, %%xmm3  \n\t"\
-"movntdq  %%xmm3, (%1, %%"FF_REG_c") \n\t"\
-"add $16, %%"FF_REG_c"\n\t"\
-"cmp  %2, %%"FF_REG_c"\n\t"\
-"movdqa   %%xmm7, %%xmm3\n\t" \
-"movdqa   %%xmm7, %%xmm4\n\t" \
-"mov %0, %%"FF_REG_d"\n\t"\
-"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\
-"jb  1b \n\t"
-
-if (offset) {
-__asm__ volatile(
-"movq  %5, %%xmm3  \n\t"
-"movdqa%%xmm3, %%xmm4  \n\t"
-"psrlq$24, %%xmm3  \n\t"
-"psllq$40, %%xmm4  \n\t"
-"por   %%xmm4, %%xmm3  \n\t"
-MAIN_FUNCTION
-  :: "g" (filter),
-  "r" (dest-offset), "g" ((x86_reg)(dstW+offset)), "m" (offset),
-  "m"(filterSize), "m"(((uint64_t *) dither)[0])
-  : XMM_CLOBBERS("%xmm0" , "%xmm1" , "%xmm2" , "%xmm3" , "%xmm4" , 
"%xmm5" , "%xmm7" ,)
-"%"FF_REG_d, "%"FF_REG_S, 

Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

2020-11-17 Thread Henrik Gramner
On Mon, Nov 16, 2020 at 11:03 AM Alan Kelly
 wrote:
> +cglobal yuv2yuvX, 6, 7, 16, filter, filterSize, dest, dstW, dither, offset, 
> src
Only 8 xmm registers are used, so 8 should be used instead of 16 here.
Otherwise it causes unnecessary spilling of registers on 64-bit
Windows.

> +%if ARCH_X86_64
> +%define ptr_size 8
[...]
> +%else
> +%define ptr_size 4
The predefined variable gprsize already exists for this purpose, so
that can be used instead.

> +movq xmm3, [ditherq]
If vpbroadcastq m3, [ditherq] is used for AVX2 here, then the following
> +vperm2i128   m3, m3, m3, 0
instruction can be eliminated.

> +punpcklwdm1, m1
> +punpckldqm1, m1
Can be replaced with pshuflw m1, m1, q

>+mov  srcq, [filterSizeq]
>+test srcd, srcd
test srcq, srcq should be used here, since the lower 32 bits of a
valid pointer could randomly happen to be zero on a 64-bit system.

> +REP_RET
Since non-temporal stores are being used, this should be replaced with
sfence
RET
to guarantee proper memory ordering semantics in multi-threaded use
cases. Things will usually work fine without it, but may potentially
break in "fun to debug" ways.
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

[FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

2020-11-16 Thread Alan Kelly
---
 Fixes bug in sse3 path where m1 is not set correctly resulting in off
 by one errors. The results are now bit by bit identical.
 libswscale/x86/Makefile |   1 +
 libswscale/x86/swscale.c|  75 
 libswscale/x86/yuv2yuvX.asm | 114 
 3 files changed, 125 insertions(+), 65 deletions(-)
 create mode 100644 libswscale/x86/yuv2yuvX.asm

diff --git a/libswscale/x86/Makefile b/libswscale/x86/Makefile
index 831d5359aa..bfe383364e 100644
--- a/libswscale/x86/Makefile
+++ b/libswscale/x86/Makefile
@@ -13,3 +13,4 @@ X86ASM-OBJS += x86/input.o
  \
x86/scale.o  \
x86/rgb_2_rgb.o  \
x86/yuv_2_rgb.o  \
+   x86/yuv2yuvX.o   \
diff --git a/libswscale/x86/swscale.c b/libswscale/x86/swscale.c
index 3160fedf04..758c8e540f 100644
--- a/libswscale/x86/swscale.c
+++ b/libswscale/x86/swscale.c
@@ -197,80 +197,25 @@ void ff_updateMMXDitherTables(SwsContext *c, int dstY)
 }
 
 #if HAVE_MMXEXT
+void ff_yuv2yuvX_sse3(const int16_t *filter, long filterSize,
+   uint8_t *dest, int dstW,
+   const uint8_t *dither, int offset);
+
 static void yuv2yuvX_sse3(const int16_t *filter, int filterSize,
const int16_t **src, uint8_t *dest, int dstW,
const uint8_t *dither, int offset)
 {
+int remainder = (dstW % 32);
+int pixelsProcessed = dstW - remainder;
 if(((uintptr_t)dest) & 15){
 yuv2yuvX_mmxext(filter, filterSize, src, dest, dstW, dither, offset);
 return;
 }
-filterSize--;
-#define MAIN_FUNCTION \
-"pxor   %%xmm0, %%xmm0 \n\t" \
-"punpcklbw  %%xmm0, %%xmm3 \n\t" \
-"movd   %4, %%xmm1 \n\t" \
-"punpcklwd  %%xmm1, %%xmm1 \n\t" \
-"punpckldq  %%xmm1, %%xmm1 \n\t" \
-"punpcklqdq %%xmm1, %%xmm1 \n\t" \
-"psllw  $3, %%xmm1 \n\t" \
-"paddw  %%xmm1, %%xmm3 \n\t" \
-"psraw  $4, %%xmm3 \n\t" \
-"movdqa %%xmm3, %%xmm4 \n\t" \
-"movdqa %%xmm3, %%xmm7 \n\t" \
-"movl   %3, %%ecx  \n\t" \
-"mov %0, %%"FF_REG_d"\n\t"\
-"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\
-".p2align 4 \n\t" /* FIXME 
Unroll? */\
-"1: \n\t"\
-"movddup  8(%%"FF_REG_d"), %%xmm0   \n\t" /* 
filterCoeff */\
-"movdqa  (%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm2 \n\t" /* 
srcData */\
-"movdqa16(%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm5 \n\t" /* 
srcData */\
-"add$16, %%"FF_REG_d"\n\t"\
-"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\
-"test %%"FF_REG_S", %%"FF_REG_S" \n\t"\
-"pmulhw   %%xmm0, %%xmm2  \n\t"\
-"pmulhw   %%xmm0, %%xmm5  \n\t"\
-"paddw%%xmm2, %%xmm3  \n\t"\
-"paddw%%xmm5, %%xmm4  \n\t"\
-" jnz1b \n\t"\
-"psraw   $3, %%xmm3  \n\t"\
-"psraw   $3, %%xmm4  \n\t"\
-"packuswb %%xmm4, %%xmm3  \n\t"\
-"movntdq  %%xmm3, (%1, %%"FF_REG_c") \n\t"\
-"add $16, %%"FF_REG_c"\n\t"\
-"cmp  %2, %%"FF_REG_c"\n\t"\
-"movdqa   %%xmm7, %%xmm3\n\t" \
-"movdqa   %%xmm7, %%xmm4\n\t" \
-"mov %0, %%"FF_REG_d"\n\t"\
-"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\
-"jb  1b \n\t"
-
-if (offset) {
-__asm__ volatile(
-"movq  %5, %%xmm3  \n\t"
-"movdqa%%xmm3, %%xmm4  \n\t"
-"psrlq$24, %%xmm3  \n\t"
-"psllq$40, %%xmm4  \n\t"
-"por   %%xmm4, %%xmm3  \n\t"
-MAIN_FUNCTION
-  :: "g" (filter),
-  "r" (dest-offset), "g" ((x86_reg)(dstW+offset)), "m" (offset),
-  "m"(filterSize), "m"(((uint64_t *) dither)[0])
-  : XMM_CLOBBERS("%xmm0" , "%xmm1" , "%xmm2" , "%xmm3" , "%xmm4" , 
"%xmm5" , "%xmm7" ,)
-"%"FF_REG_d, "%"FF_REG_S, "%"FF_REG_c
-  );
-} 

Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

2020-11-13 Thread Michael Niedermayer
On Thu, Nov 12, 2020 at 09:33:18AM +0100, Alan Kelly wrote:
> ---
>  It now works on x86-32
>  libswscale/x86/Makefile |   1 +
>  libswscale/x86/swscale.c|  75 
>  libswscale/x86/yuv2yuvX.asm | 110 
>  3 files changed, 121 insertions(+), 65 deletions(-)
>  create mode 100644 libswscale/x86/yuv2yuvX.asm

Is this intended to produce bit by bit identical output ?

[...]
-- 
Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

The greatest way to live with honor in this world is to be what we pretend
to be. -- Socrates


signature.asc
Description: PGP signature
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

[FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

2020-11-12 Thread Alan Kelly
---
 It now works on x86-32
 libswscale/x86/Makefile |   1 +
 libswscale/x86/swscale.c|  75 
 libswscale/x86/yuv2yuvX.asm | 110 
 3 files changed, 121 insertions(+), 65 deletions(-)
 create mode 100644 libswscale/x86/yuv2yuvX.asm

diff --git a/libswscale/x86/Makefile b/libswscale/x86/Makefile
index 831d5359aa..bfe383364e 100644
--- a/libswscale/x86/Makefile
+++ b/libswscale/x86/Makefile
@@ -13,3 +13,4 @@ X86ASM-OBJS += x86/input.o
  \
x86/scale.o  \
x86/rgb_2_rgb.o  \
x86/yuv_2_rgb.o  \
+   x86/yuv2yuvX.o   \
diff --git a/libswscale/x86/swscale.c b/libswscale/x86/swscale.c
index 3160fedf04..758c8e540f 100644
--- a/libswscale/x86/swscale.c
+++ b/libswscale/x86/swscale.c
@@ -197,80 +197,25 @@ void ff_updateMMXDitherTables(SwsContext *c, int dstY)
 }
 
 #if HAVE_MMXEXT
+void ff_yuv2yuvX_sse3(const int16_t *filter, long filterSize,
+   uint8_t *dest, int dstW,
+   const uint8_t *dither, int offset);
+
 static void yuv2yuvX_sse3(const int16_t *filter, int filterSize,
const int16_t **src, uint8_t *dest, int dstW,
const uint8_t *dither, int offset)
 {
+int remainder = (dstW % 32);
+int pixelsProcessed = dstW - remainder;
 if(((uintptr_t)dest) & 15){
 yuv2yuvX_mmxext(filter, filterSize, src, dest, dstW, dither, offset);
 return;
 }
-filterSize--;
-#define MAIN_FUNCTION \
-"pxor   %%xmm0, %%xmm0 \n\t" \
-"punpcklbw  %%xmm0, %%xmm3 \n\t" \
-"movd   %4, %%xmm1 \n\t" \
-"punpcklwd  %%xmm1, %%xmm1 \n\t" \
-"punpckldq  %%xmm1, %%xmm1 \n\t" \
-"punpcklqdq %%xmm1, %%xmm1 \n\t" \
-"psllw  $3, %%xmm1 \n\t" \
-"paddw  %%xmm1, %%xmm3 \n\t" \
-"psraw  $4, %%xmm3 \n\t" \
-"movdqa %%xmm3, %%xmm4 \n\t" \
-"movdqa %%xmm3, %%xmm7 \n\t" \
-"movl   %3, %%ecx  \n\t" \
-"mov %0, %%"FF_REG_d"\n\t"\
-"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\
-".p2align 4 \n\t" /* FIXME 
Unroll? */\
-"1: \n\t"\
-"movddup  8(%%"FF_REG_d"), %%xmm0   \n\t" /* 
filterCoeff */\
-"movdqa  (%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm2 \n\t" /* 
srcData */\
-"movdqa16(%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm5 \n\t" /* 
srcData */\
-"add$16, %%"FF_REG_d"\n\t"\
-"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\
-"test %%"FF_REG_S", %%"FF_REG_S" \n\t"\
-"pmulhw   %%xmm0, %%xmm2  \n\t"\
-"pmulhw   %%xmm0, %%xmm5  \n\t"\
-"paddw%%xmm2, %%xmm3  \n\t"\
-"paddw%%xmm5, %%xmm4  \n\t"\
-" jnz1b \n\t"\
-"psraw   $3, %%xmm3  \n\t"\
-"psraw   $3, %%xmm4  \n\t"\
-"packuswb %%xmm4, %%xmm3  \n\t"\
-"movntdq  %%xmm3, (%1, %%"FF_REG_c") \n\t"\
-"add $16, %%"FF_REG_c"\n\t"\
-"cmp  %2, %%"FF_REG_c"\n\t"\
-"movdqa   %%xmm7, %%xmm3\n\t" \
-"movdqa   %%xmm7, %%xmm4\n\t" \
-"mov %0, %%"FF_REG_d"\n\t"\
-"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\
-"jb  1b \n\t"
-
-if (offset) {
-__asm__ volatile(
-"movq  %5, %%xmm3  \n\t"
-"movdqa%%xmm3, %%xmm4  \n\t"
-"psrlq$24, %%xmm3  \n\t"
-"psllq$40, %%xmm4  \n\t"
-"por   %%xmm4, %%xmm3  \n\t"
-MAIN_FUNCTION
-  :: "g" (filter),
-  "r" (dest-offset), "g" ((x86_reg)(dstW+offset)), "m" (offset),
-  "m"(filterSize), "m"(((uint64_t *) dither)[0])
-  : XMM_CLOBBERS("%xmm0" , "%xmm1" , "%xmm2" , "%xmm3" , "%xmm4" , 
"%xmm5" , "%xmm7" ,)
-"%"FF_REG_d, "%"FF_REG_S, "%"FF_REG_c
-  );
-} else {
-__asm__ volatile(
-"movq  %5, %%xmm3   \n\t"
-

Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

2020-11-10 Thread Carl Eugen Hoyos
Am Fr., 6. Nov. 2020 um 09:04 Uhr schrieb Alan Kelly
:
>
> The function was re-written in asm, this code is heavily derived from the
> original code, the algorithm remains unchanged, the implementation is
> optimized. Would you agree to adding the copyright from swscale.c:
> * Copyright (C) 2001-2011 Michael Niedermayer 
> to this file, having both copyrights?  Thank you.

No real opinion here but your argumentation sounds solid.

Thank you, Carl Eugen
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

2020-11-10 Thread Michael Niedermayer
On Tue, Nov 10, 2020 at 09:43:47AM +0100, Alan Kelly wrote:
> ---
>  yuv2yuvX.asm: Ports yuv2yuvX to asm, unrolls main loop and adds
>  other small optimizations for ~20% speed-up. Copyright updated to
>  include the original from swscale.c
>  swscale.c: Removes yuv2yuvX_sse3 and calls new function ff_yuv2yuvX_sse3.
>  Calls yuv2yuvX_mmxext on remainining elements if required.
>  Makefile: Compiles yuv2yuvX.asm
> 
>  libswscale/x86/Makefile |   1 +
>  libswscale/x86/swscale.c|  75 
>  libswscale/x86/yuv2yuvX.asm | 110 
>  3 files changed, 121 insertions(+), 65 deletions(-)
>  create mode 100644 libswscale/x86/yuv2yuvX.asm

on x86-32

X86ASM  libswscale/x86/yuv2yuvX.o
src/libswscale/x86/yuv2yuvX.asm:110: error: invalid combination of opcode and 
operands
src/libswscale/x86/yuv2yuvX.asm:55: ... from macro `YUV2YUVX_FUNC' defined here
src//libavutil/x86/x86inc.asm:1395: ... from macro `movd' defined here
src//libavutil/x86/x86inc.asm:1263: ... from macro `RUN_AVX_INSTR' defined here
/home/michael/ffmpeg-git/ffmpeg/ffbuild/common.mak:89: recipe for target 
'libswscale/x86/yuv2yuvX.o' failed
make: *** [libswscale/x86/yuv2yuvX.o] Error 1
make: Target 'all' not remade because of errors.

[...]
-- 
Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

"Nothing to hide" only works if the folks in power share the values of
you and everyone you know entirely and always will -- Tom Scott



signature.asc
Description: PGP signature
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

[FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

2020-11-10 Thread Alan Kelly
---
 yuv2yuvX.asm: Ports yuv2yuvX to asm, unrolls main loop and adds
 other small optimizations for ~20% speed-up. Copyright updated to
 include the original from swscale.c
 swscale.c: Removes yuv2yuvX_sse3 and calls new function ff_yuv2yuvX_sse3.
 Calls yuv2yuvX_mmxext on remainining elements if required.
 Makefile: Compiles yuv2yuvX.asm

 libswscale/x86/Makefile |   1 +
 libswscale/x86/swscale.c|  75 
 libswscale/x86/yuv2yuvX.asm | 110 
 3 files changed, 121 insertions(+), 65 deletions(-)
 create mode 100644 libswscale/x86/yuv2yuvX.asm

diff --git a/libswscale/x86/Makefile b/libswscale/x86/Makefile
index 831d5359aa..bfe383364e 100644
--- a/libswscale/x86/Makefile
+++ b/libswscale/x86/Makefile
@@ -13,3 +13,4 @@ X86ASM-OBJS += x86/input.o
  \
x86/scale.o  \
x86/rgb_2_rgb.o  \
x86/yuv_2_rgb.o  \
+   x86/yuv2yuvX.o   \
diff --git a/libswscale/x86/swscale.c b/libswscale/x86/swscale.c
index 3160fedf04..fec9fa22e0 100644
--- a/libswscale/x86/swscale.c
+++ b/libswscale/x86/swscale.c
@@ -197,80 +197,25 @@ void ff_updateMMXDitherTables(SwsContext *c, int dstY)
 }
 
 #if HAVE_MMXEXT
+void ff_yuv2yuvX_sse3(const int16_t *filter, int filterSize,
+   uint8_t *dest, int dstW,
+   const uint8_t *dither, int offset);
+
 static void yuv2yuvX_sse3(const int16_t *filter, int filterSize,
const int16_t **src, uint8_t *dest, int dstW,
const uint8_t *dither, int offset)
 {
+int remainder = (dstW % 32);
+int pixelsProcessed = dstW - remainder;
 if(((uintptr_t)dest) & 15){
 yuv2yuvX_mmxext(filter, filterSize, src, dest, dstW, dither, offset);
 return;
 }
-filterSize--;
-#define MAIN_FUNCTION \
-"pxor   %%xmm0, %%xmm0 \n\t" \
-"punpcklbw  %%xmm0, %%xmm3 \n\t" \
-"movd   %4, %%xmm1 \n\t" \
-"punpcklwd  %%xmm1, %%xmm1 \n\t" \
-"punpckldq  %%xmm1, %%xmm1 \n\t" \
-"punpcklqdq %%xmm1, %%xmm1 \n\t" \
-"psllw  $3, %%xmm1 \n\t" \
-"paddw  %%xmm1, %%xmm3 \n\t" \
-"psraw  $4, %%xmm3 \n\t" \
-"movdqa %%xmm3, %%xmm4 \n\t" \
-"movdqa %%xmm3, %%xmm7 \n\t" \
-"movl   %3, %%ecx  \n\t" \
-"mov %0, %%"FF_REG_d"\n\t"\
-"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\
-".p2align 4 \n\t" /* FIXME 
Unroll? */\
-"1: \n\t"\
-"movddup  8(%%"FF_REG_d"), %%xmm0   \n\t" /* 
filterCoeff */\
-"movdqa  (%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm2 \n\t" /* 
srcData */\
-"movdqa16(%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm5 \n\t" /* 
srcData */\
-"add$16, %%"FF_REG_d"\n\t"\
-"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\
-"test %%"FF_REG_S", %%"FF_REG_S" \n\t"\
-"pmulhw   %%xmm0, %%xmm2  \n\t"\
-"pmulhw   %%xmm0, %%xmm5  \n\t"\
-"paddw%%xmm2, %%xmm3  \n\t"\
-"paddw%%xmm5, %%xmm4  \n\t"\
-" jnz1b \n\t"\
-"psraw   $3, %%xmm3  \n\t"\
-"psraw   $3, %%xmm4  \n\t"\
-"packuswb %%xmm4, %%xmm3  \n\t"\
-"movntdq  %%xmm3, (%1, %%"FF_REG_c") \n\t"\
-"add $16, %%"FF_REG_c"\n\t"\
-"cmp  %2, %%"FF_REG_c"\n\t"\
-"movdqa   %%xmm7, %%xmm3\n\t" \
-"movdqa   %%xmm7, %%xmm4\n\t" \
-"mov %0, %%"FF_REG_d"\n\t"\
-"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\
-"jb  1b \n\t"
-
-if (offset) {
-__asm__ volatile(
-"movq  %5, %%xmm3  \n\t"
-"movdqa%%xmm3, %%xmm4  \n\t"
-"psrlq$24, %%xmm3  \n\t"
-"psllq$40, %%xmm4  \n\t"
-"por   %%xmm4, %%xmm3  \n\t"
-MAIN_FUNCTION
-  :: "g" (filter),
-  "r" (dest-offset), "g" ((x86_reg)(dstW+offset)), "m" (offset),
-  "m"(filterSize), 

Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

2020-11-06 Thread Alan Kelly
The function was re-written in asm, this code is heavily derived from the
original code, the algorithm remains unchanged, the implementation is
optimized. Would you agree to adding the copyright from swscale.c:
* Copyright (C) 2001-2011 Michael Niedermayer 
to this file, having both copyrights?  Thank you.


On Sat, Oct 31, 2020 at 1:02 PM Carl Eugen Hoyos  wrote:

> Am Di., 27. Okt. 2020 um 09:56 Uhr schrieb Alan Kelly
> :
>
> > --- /dev/null
> > +++ b/libswscale/x86/yuv2yuvX.asm
> > @@ -0,0 +1,105 @@
> >
> +;**
> > +;* x86-optimized yuv2yuvX
> > +;* Copyright 2020 Google LLC
>
> Either the commit message ("move a function") or this
> copyright statement is wrong, please fix this.
>
> Please do not commit as-is...
>
> Carl Eugen
> ___
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>
> To unsubscribe, visit link above, or email
> ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

2020-10-31 Thread Carl Eugen Hoyos
Am Di., 27. Okt. 2020 um 09:56 Uhr schrieb Alan Kelly
:

> --- /dev/null
> +++ b/libswscale/x86/yuv2yuvX.asm
> @@ -0,0 +1,105 @@
> +;**
> +;* x86-optimized yuv2yuvX
> +;* Copyright 2020 Google LLC

Either the commit message ("move a function") or this
copyright statement is wrong, please fix this.

Please do not commit as-is...

Carl Eugen
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

2020-10-27 Thread Alan Kelly
Thanks for the feedback Anton.

The second patch incorporates changes suggested by James Almer:
avx2 instructions are wrapped in if cpuflag(avx2) and movddup restored
mm1 is replaced by m1 on x86_32



On Tue, Oct 27, 2020 at 10:40 AM Anton Khirnov  wrote:

> Hi,
> Quoting Alan Kelly (2020-10-27 10:10:14)
> > ---
> >  libswscale/x86/Makefile |   1 +
> >  libswscale/x86/swscale.c|  75 -
> >  libswscale/x86/yuv2yuvX.asm | 109 
> >  3 files changed, 120 insertions(+), 65 deletions(-)
> >  create mode 100644 libswscale/x86/yuv2yuvX.asm
> >
>
> No comments on the code itself (yet?), but as for your submission:
> - when you send multiple iterations of the same patch, it is helpful to
>   mention what changed, e.g. with git send-email --annotate
> - the commit message should follow the standard format of:
> * swscale: short summary of the change
>
>   Extended description of the commit, if needed.
>
> --
> Anton Khirnov
> ___
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>
> To unsubscribe, visit link above, or email
> ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

2020-10-27 Thread Anton Khirnov
Hi,
Quoting Alan Kelly (2020-10-27 10:10:14)
> ---
>  libswscale/x86/Makefile |   1 +
>  libswscale/x86/swscale.c|  75 -
>  libswscale/x86/yuv2yuvX.asm | 109 
>  3 files changed, 120 insertions(+), 65 deletions(-)
>  create mode 100644 libswscale/x86/yuv2yuvX.asm
>

No comments on the code itself (yet?), but as for your submission:
- when you send multiple iterations of the same patch, it is helpful to
  mention what changed, e.g. with git send-email --annotate
- the commit message should follow the standard format of:
* swscale: short summary of the change

  Extended description of the commit, if needed.

-- 
Anton Khirnov
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

[FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

2020-10-27 Thread Alan Kelly
---
 libswscale/x86/Makefile |   1 +
 libswscale/x86/swscale.c|  75 -
 libswscale/x86/yuv2yuvX.asm | 109 
 3 files changed, 120 insertions(+), 65 deletions(-)
 create mode 100644 libswscale/x86/yuv2yuvX.asm

diff --git a/libswscale/x86/Makefile b/libswscale/x86/Makefile
index 831d5359aa..bfe383364e 100644
--- a/libswscale/x86/Makefile
+++ b/libswscale/x86/Makefile
@@ -13,3 +13,4 @@ X86ASM-OBJS += x86/input.o
  \
x86/scale.o  \
x86/rgb_2_rgb.o  \
x86/yuv_2_rgb.o  \
+   x86/yuv2yuvX.o   \
diff --git a/libswscale/x86/swscale.c b/libswscale/x86/swscale.c
index 3160fedf04..fec9fa22e0 100644
--- a/libswscale/x86/swscale.c
+++ b/libswscale/x86/swscale.c
@@ -197,80 +197,25 @@ void ff_updateMMXDitherTables(SwsContext *c, int dstY)
 }
 
 #if HAVE_MMXEXT
+void ff_yuv2yuvX_sse3(const int16_t *filter, int filterSize,
+   uint8_t *dest, int dstW,
+   const uint8_t *dither, int offset);
+
 static void yuv2yuvX_sse3(const int16_t *filter, int filterSize,
const int16_t **src, uint8_t *dest, int dstW,
const uint8_t *dither, int offset)
 {
+int remainder = (dstW % 32);
+int pixelsProcessed = dstW - remainder;
 if(((uintptr_t)dest) & 15){
 yuv2yuvX_mmxext(filter, filterSize, src, dest, dstW, dither, offset);
 return;
 }
-filterSize--;
-#define MAIN_FUNCTION \
-"pxor   %%xmm0, %%xmm0 \n\t" \
-"punpcklbw  %%xmm0, %%xmm3 \n\t" \
-"movd   %4, %%xmm1 \n\t" \
-"punpcklwd  %%xmm1, %%xmm1 \n\t" \
-"punpckldq  %%xmm1, %%xmm1 \n\t" \
-"punpcklqdq %%xmm1, %%xmm1 \n\t" \
-"psllw  $3, %%xmm1 \n\t" \
-"paddw  %%xmm1, %%xmm3 \n\t" \
-"psraw  $4, %%xmm3 \n\t" \
-"movdqa %%xmm3, %%xmm4 \n\t" \
-"movdqa %%xmm3, %%xmm7 \n\t" \
-"movl   %3, %%ecx  \n\t" \
-"mov %0, %%"FF_REG_d"\n\t"\
-"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\
-".p2align 4 \n\t" /* FIXME 
Unroll? */\
-"1: \n\t"\
-"movddup  8(%%"FF_REG_d"), %%xmm0   \n\t" /* 
filterCoeff */\
-"movdqa  (%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm2 \n\t" /* 
srcData */\
-"movdqa16(%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm5 \n\t" /* 
srcData */\
-"add$16, %%"FF_REG_d"\n\t"\
-"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\
-"test %%"FF_REG_S", %%"FF_REG_S" \n\t"\
-"pmulhw   %%xmm0, %%xmm2  \n\t"\
-"pmulhw   %%xmm0, %%xmm5  \n\t"\
-"paddw%%xmm2, %%xmm3  \n\t"\
-"paddw%%xmm5, %%xmm4  \n\t"\
-" jnz1b \n\t"\
-"psraw   $3, %%xmm3  \n\t"\
-"psraw   $3, %%xmm4  \n\t"\
-"packuswb %%xmm4, %%xmm3  \n\t"\
-"movntdq  %%xmm3, (%1, %%"FF_REG_c") \n\t"\
-"add $16, %%"FF_REG_c"\n\t"\
-"cmp  %2, %%"FF_REG_c"\n\t"\
-"movdqa   %%xmm7, %%xmm3\n\t" \
-"movdqa   %%xmm7, %%xmm4\n\t" \
-"mov %0, %%"FF_REG_d"\n\t"\
-"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\
-"jb  1b \n\t"
-
-if (offset) {
-__asm__ volatile(
-"movq  %5, %%xmm3  \n\t"
-"movdqa%%xmm3, %%xmm4  \n\t"
-"psrlq$24, %%xmm3  \n\t"
-"psllq$40, %%xmm4  \n\t"
-"por   %%xmm4, %%xmm3  \n\t"
-MAIN_FUNCTION
-  :: "g" (filter),
-  "r" (dest-offset), "g" ((x86_reg)(dstW+offset)), "m" (offset),
-  "m"(filterSize), "m"(((uint64_t *) dither)[0])
-  : XMM_CLOBBERS("%xmm0" , "%xmm1" , "%xmm2" , "%xmm3" , "%xmm4" , 
"%xmm5" , "%xmm7" ,)
-"%"FF_REG_d, "%"FF_REG_S, "%"FF_REG_c
-  );
-} else {
-__asm__ volatile(
-"movq  %5, %%xmm3   \n\t"
-MAIN_FUNCTION
-  :: "g" 

Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

2020-10-27 Thread Alan Kelly
Apologies for the multiple threads, my git send-email was wrongly
configured. This has been fixed.

This code has been tested on AVX2 giving a significant speedup, however,
until the ff_hscale* functions are ported to avx2, this should not be
enabled as it results in an overall slowdown of swscale probably due to cpu
frequency scaling.

checkasm will follow in a separate patch.

On Tue, Oct 27, 2020 at 9:56 AM Alan Kelly  wrote:

> ---
>  libswscale/x86/Makefile |   1 +
>  libswscale/x86/swscale.c|  75 --
>  libswscale/x86/yuv2yuvX.asm | 105 
>  3 files changed, 116 insertions(+), 65 deletions(-)
>  create mode 100644 libswscale/x86/yuv2yuvX.asm
>
> diff --git a/libswscale/x86/Makefile b/libswscale/x86/Makefile
> index 831d5359aa..bfe383364e 100644
> --- a/libswscale/x86/Makefile
> +++ b/libswscale/x86/Makefile
> @@ -13,3 +13,4 @@ X86ASM-OBJS += x86/input.o
> \
> x86/scale.o  \
> x86/rgb_2_rgb.o  \
> x86/yuv_2_rgb.o  \
> +   x86/yuv2yuvX.o   \
> diff --git a/libswscale/x86/swscale.c b/libswscale/x86/swscale.c
> index 3160fedf04..fec9fa22e0 100644
> --- a/libswscale/x86/swscale.c
> +++ b/libswscale/x86/swscale.c
> @@ -197,80 +197,25 @@ void ff_updateMMXDitherTables(SwsContext *c, int
> dstY)
>  }
>
>  #if HAVE_MMXEXT
> +void ff_yuv2yuvX_sse3(const int16_t *filter, int filterSize,
> +   uint8_t *dest, int dstW,
> +   const uint8_t *dither, int offset);
> +
>  static void yuv2yuvX_sse3(const int16_t *filter, int filterSize,
> const int16_t **src, uint8_t *dest, int dstW,
> const uint8_t *dither, int offset)
>  {
> +int remainder = (dstW % 32);
> +int pixelsProcessed = dstW - remainder;
>  if(((uintptr_t)dest) & 15){
>  yuv2yuvX_mmxext(filter, filterSize, src, dest, dstW, dither,
> offset);
>  return;
>  }
> -filterSize--;
> -#define MAIN_FUNCTION \
> -"pxor   %%xmm0, %%xmm0 \n\t" \
> -"punpcklbw  %%xmm0, %%xmm3 \n\t" \
> -"movd   %4, %%xmm1 \n\t" \
> -"punpcklwd  %%xmm1, %%xmm1 \n\t" \
> -"punpckldq  %%xmm1, %%xmm1 \n\t" \
> -"punpcklqdq %%xmm1, %%xmm1 \n\t" \
> -"psllw  $3, %%xmm1 \n\t" \
> -"paddw  %%xmm1, %%xmm3 \n\t" \
> -"psraw  $4, %%xmm3 \n\t" \
> -"movdqa %%xmm3, %%xmm4 \n\t" \
> -"movdqa %%xmm3, %%xmm7 \n\t" \
> -"movl   %3, %%ecx  \n\t" \
> -"mov %0, %%"FF_REG_d"
> \n\t"\
> -"mov(%%"FF_REG_d"), %%"FF_REG_S"
>  \n\t"\
> -".p2align 4 \n\t" /*
> FIXME Unroll? */\
> -"1: \n\t"\
> -"movddup  8(%%"FF_REG_d"), %%xmm0   \n\t" /*
> filterCoeff */\
> -"movdqa  (%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm2
> \n\t" /* srcData */\
> -"movdqa16(%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm5
> \n\t" /* srcData */\
> -"add$16, %%"FF_REG_d"
> \n\t"\
> -"mov(%%"FF_REG_d"), %%"FF_REG_S"
>  \n\t"\
> -"test %%"FF_REG_S", %%"FF_REG_S"
>  \n\t"\
> -"pmulhw   %%xmm0, %%xmm2  \n\t"\
> -"pmulhw   %%xmm0, %%xmm5  \n\t"\
> -"paddw%%xmm2, %%xmm3  \n\t"\
> -"paddw%%xmm5, %%xmm4  \n\t"\
> -" jnz1b \n\t"\
> -"psraw   $3, %%xmm3  \n\t"\
> -"psraw   $3, %%xmm4  \n\t"\
> -"packuswb %%xmm4, %%xmm3  \n\t"\
> -"movntdq  %%xmm3, (%1, %%"FF_REG_c")
> \n\t"\
> -"add $16, %%"FF_REG_c"\n\t"\
> -"cmp  %2, %%"FF_REG_c"\n\t"\
> -"movdqa   %%xmm7, %%xmm3\n\t" \
> -"movdqa   %%xmm7, %%xmm4\n\t" \
> -"mov %0, %%"FF_REG_d"
> \n\t"\
> -"mov(%%"FF_REG_d"), %%"FF_REG_S"
>  \n\t"\
> -"jb  1b \n\t"
> -
> -if (offset) {
> -__asm__ volatile(
> -"movq  %5, %%xmm3  \n\t"
> -"movdqa%%xmm3, %%xmm4  \n\t"
> -"psrlq$24, %%xmm3  \n\t"
> -

Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup. AVX2 version is ready and tested, however, although local tests show a signifi

2020-10-27 Thread Alan Kelly
Thanks for the review, I have made the required changes. As I have changed
the subject the patch is in a new thread.

On Fri, Oct 23, 2020 at 4:10 PM James Almer  wrote:

> On 10/23/2020 10:17 AM, Alan Kelly wrote:
> >  Fixed. The wrong step size was used causing a write passed the end of
> >  the buffer. yuv2yuvX_mmxext is now called if there are any remaining
> pixels.
>
> Please fix the commit subject (It's too long and contains commentary),
> and keep comments about fixes between versions outside of the commit
> message body. You can manually place them after the --- below, or in a
> separate reply.
>
> > ---
> >  libswscale/x86/Makefile |   1 +
> >  libswscale/x86/swscale.c|  75 --
> >  libswscale/x86/yuv2yuvX.asm | 105 
> >  3 files changed, 116 insertions(+), 65 deletions(-)
> >  create mode 100644 libswscale/x86/yuv2yuvX.asm
> >
> > diff --git a/libswscale/x86/Makefile b/libswscale/x86/Makefile
> > index 831d5359aa..bfe383364e 100644
> > --- a/libswscale/x86/Makefile
> > +++ b/libswscale/x86/Makefile
> > @@ -13,3 +13,4 @@ X86ASM-OBJS += x86/input.o
>   \
> > x86/scale.o
> \
> > x86/rgb_2_rgb.o
> \
> > x86/yuv_2_rgb.o
> \
> > +   x86/yuv2yuvX.o
>  \
> > diff --git a/libswscale/x86/swscale.c b/libswscale/x86/swscale.c
> > index 3160fedf04..fec9fa22e0 100644
> > --- a/libswscale/x86/swscale.c
> > +++ b/libswscale/x86/swscale.c
> > @@ -197,80 +197,25 @@ void ff_updateMMXDitherTables(SwsContext *c, int
> dstY)
> >  }
> >
> >  #if HAVE_MMXEXT
> > +void ff_yuv2yuvX_sse3(const int16_t *filter, int filterSize,
> > +   uint8_t *dest, int dstW,
> > +   const uint8_t *dither, int offset);
> > +
> >  static void yuv2yuvX_sse3(const int16_t *filter, int filterSize,
> > const int16_t **src, uint8_t *dest, int dstW,
> > const uint8_t *dither, int offset)
> >  {
> > +int remainder = (dstW % 32);
> > +int pixelsProcessed = dstW - remainder;
> >  if(((uintptr_t)dest) & 15){
> >  yuv2yuvX_mmxext(filter, filterSize, src, dest, dstW, dither,
> offset);
> >  return;
> >  }
> > -filterSize--;
> > -#define MAIN_FUNCTION \
> > -"pxor   %%xmm0, %%xmm0 \n\t" \
> > -"punpcklbw  %%xmm0, %%xmm3 \n\t" \
> > -"movd   %4, %%xmm1 \n\t" \
> > -"punpcklwd  %%xmm1, %%xmm1 \n\t" \
> > -"punpckldq  %%xmm1, %%xmm1 \n\t" \
> > -"punpcklqdq %%xmm1, %%xmm1 \n\t" \
> > -"psllw  $3, %%xmm1 \n\t" \
> > -"paddw  %%xmm1, %%xmm3 \n\t" \
> > -"psraw  $4, %%xmm3 \n\t" \
> > -"movdqa %%xmm3, %%xmm4 \n\t" \
> > -"movdqa %%xmm3, %%xmm7 \n\t" \
> > -"movl   %3, %%ecx  \n\t" \
> > -"mov %0, %%"FF_REG_d"
> \n\t"\
> > -"mov(%%"FF_REG_d"), %%"FF_REG_S"
>  \n\t"\
> > -".p2align 4 \n\t" /*
> FIXME Unroll? */\
> > -"1: \n\t"\
> > -"movddup  8(%%"FF_REG_d"), %%xmm0   \n\t" /*
> filterCoeff */\
> > -"movdqa  (%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm2
> \n\t" /* srcData */\
> > -"movdqa16(%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm5
> \n\t" /* srcData */\
> > -"add$16, %%"FF_REG_d"
> \n\t"\
> > -"mov(%%"FF_REG_d"), %%"FF_REG_S"
>  \n\t"\
> > -"test %%"FF_REG_S", %%"FF_REG_S"
>  \n\t"\
> > -"pmulhw   %%xmm0, %%xmm2  \n\t"\
> > -"pmulhw   %%xmm0, %%xmm5  \n\t"\
> > -"paddw%%xmm2, %%xmm3  \n\t"\
> > -"paddw%%xmm5, %%xmm4  \n\t"\
> > -" jnz1b \n\t"\
> > -"psraw   $3, %%xmm3  \n\t"\
> > -"psraw   $3, %%xmm4  \n\t"\
> > -"packuswb %%xmm4, %%xmm3  \n\t"\
> > -"movntdq  %%xmm3, (%1, %%"FF_REG_c")
> \n\t"\
> > -"add $16, %%"FF_REG_c"\n\t"\
> > -"cmp  %2, %%"FF_REG_c"\n\t"\
> > -"movdqa   %%xmm7, %%xmm3\n\t" \
> > -"movdqa   %%xmm7, %%xmm4\n\t" \
> > -"mov %0, %%"FF_REG_d"
> \n\t"\
> > -"mov(%%"FF_REG_d"), %%"FF_REG_S"
>  \n\t"\
> > -"jb 

[FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

2020-10-27 Thread Alan Kelly
---
 libswscale/x86/Makefile |   1 +
 libswscale/x86/swscale.c|  75 --
 libswscale/x86/yuv2yuvX.asm | 105 
 3 files changed, 116 insertions(+), 65 deletions(-)
 create mode 100644 libswscale/x86/yuv2yuvX.asm

diff --git a/libswscale/x86/Makefile b/libswscale/x86/Makefile
index 831d5359aa..bfe383364e 100644
--- a/libswscale/x86/Makefile
+++ b/libswscale/x86/Makefile
@@ -13,3 +13,4 @@ X86ASM-OBJS += x86/input.o
  \
x86/scale.o  \
x86/rgb_2_rgb.o  \
x86/yuv_2_rgb.o  \
+   x86/yuv2yuvX.o   \
diff --git a/libswscale/x86/swscale.c b/libswscale/x86/swscale.c
index 3160fedf04..fec9fa22e0 100644
--- a/libswscale/x86/swscale.c
+++ b/libswscale/x86/swscale.c
@@ -197,80 +197,25 @@ void ff_updateMMXDitherTables(SwsContext *c, int dstY)
 }
 
 #if HAVE_MMXEXT
+void ff_yuv2yuvX_sse3(const int16_t *filter, int filterSize,
+   uint8_t *dest, int dstW,
+   const uint8_t *dither, int offset);
+
 static void yuv2yuvX_sse3(const int16_t *filter, int filterSize,
const int16_t **src, uint8_t *dest, int dstW,
const uint8_t *dither, int offset)
 {
+int remainder = (dstW % 32);
+int pixelsProcessed = dstW - remainder;
 if(((uintptr_t)dest) & 15){
 yuv2yuvX_mmxext(filter, filterSize, src, dest, dstW, dither, offset);
 return;
 }
-filterSize--;
-#define MAIN_FUNCTION \
-"pxor   %%xmm0, %%xmm0 \n\t" \
-"punpcklbw  %%xmm0, %%xmm3 \n\t" \
-"movd   %4, %%xmm1 \n\t" \
-"punpcklwd  %%xmm1, %%xmm1 \n\t" \
-"punpckldq  %%xmm1, %%xmm1 \n\t" \
-"punpcklqdq %%xmm1, %%xmm1 \n\t" \
-"psllw  $3, %%xmm1 \n\t" \
-"paddw  %%xmm1, %%xmm3 \n\t" \
-"psraw  $4, %%xmm3 \n\t" \
-"movdqa %%xmm3, %%xmm4 \n\t" \
-"movdqa %%xmm3, %%xmm7 \n\t" \
-"movl   %3, %%ecx  \n\t" \
-"mov %0, %%"FF_REG_d"\n\t"\
-"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\
-".p2align 4 \n\t" /* FIXME 
Unroll? */\
-"1: \n\t"\
-"movddup  8(%%"FF_REG_d"), %%xmm0   \n\t" /* 
filterCoeff */\
-"movdqa  (%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm2 \n\t" /* 
srcData */\
-"movdqa16(%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm5 \n\t" /* 
srcData */\
-"add$16, %%"FF_REG_d"\n\t"\
-"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\
-"test %%"FF_REG_S", %%"FF_REG_S" \n\t"\
-"pmulhw   %%xmm0, %%xmm2  \n\t"\
-"pmulhw   %%xmm0, %%xmm5  \n\t"\
-"paddw%%xmm2, %%xmm3  \n\t"\
-"paddw%%xmm5, %%xmm4  \n\t"\
-" jnz1b \n\t"\
-"psraw   $3, %%xmm3  \n\t"\
-"psraw   $3, %%xmm4  \n\t"\
-"packuswb %%xmm4, %%xmm3  \n\t"\
-"movntdq  %%xmm3, (%1, %%"FF_REG_c") \n\t"\
-"add $16, %%"FF_REG_c"\n\t"\
-"cmp  %2, %%"FF_REG_c"\n\t"\
-"movdqa   %%xmm7, %%xmm3\n\t" \
-"movdqa   %%xmm7, %%xmm4\n\t" \
-"mov %0, %%"FF_REG_d"\n\t"\
-"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\
-"jb  1b \n\t"
-
-if (offset) {
-__asm__ volatile(
-"movq  %5, %%xmm3  \n\t"
-"movdqa%%xmm3, %%xmm4  \n\t"
-"psrlq$24, %%xmm3  \n\t"
-"psllq$40, %%xmm4  \n\t"
-"por   %%xmm4, %%xmm3  \n\t"
-MAIN_FUNCTION
-  :: "g" (filter),
-  "r" (dest-offset), "g" ((x86_reg)(dstW+offset)), "m" (offset),
-  "m"(filterSize), "m"(((uint64_t *) dither)[0])
-  : XMM_CLOBBERS("%xmm0" , "%xmm1" , "%xmm2" , "%xmm3" , "%xmm4" , 
"%xmm5" , "%xmm7" ,)
-"%"FF_REG_d, "%"FF_REG_S, "%"FF_REG_c
-  );
-} else {
-__asm__ volatile(
-"movq  %5, %%xmm3   \n\t"
-MAIN_FUNCTION
-  :: "g" 

Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup. AVX2 version is ready and tested, although local tests show a significant spee

2020-10-24 Thread Michael Niedermayer
On Fri, Oct 23, 2020 at 03:34:18PM +0200, Alan Kelly wrote:
>  Fixed. The wrong step size was used causing a write passed the end of
>  the buffer. yuv2yuvX_mmxext is now called if there are any remaining
> pixels.
> 
>  There is currently no checkasm for these functions. Is this required for
> submission?
> 
>  (Apologies for the double mail, I used git send-email but it didn't
> respond to the correct thread)
> ---
>  libswscale/x86/Makefile |   1 +
>  libswscale/x86/swscale.c|  75 --
>  libswscale/x86/yuv2yuvX.asm | 105 
>  3 files changed, 116 insertions(+), 65 deletions(-)
>  create mode 100644 libswscale/x86/yuv2yuvX.asm

error: corrupt patch at line 18

[...]
-- 
Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

The worst form of inequality is to try to make unequal things equal.
-- Aristotle


signature.asc
Description: PGP signature
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup. AVX2 version is ready and tested, however, although local tests show a signifi

2020-10-23 Thread James Almer
On 10/23/2020 10:17 AM, Alan Kelly wrote:
>  Fixed. The wrong step size was used causing a write passed the end of
>  the buffer. yuv2yuvX_mmxext is now called if there are any remaining pixels.

Please fix the commit subject (It's too long and contains commentary),
and keep comments about fixes between versions outside of the commit
message body. You can manually place them after the --- below, or in a
separate reply.

> ---
>  libswscale/x86/Makefile |   1 +
>  libswscale/x86/swscale.c|  75 --
>  libswscale/x86/yuv2yuvX.asm | 105 
>  3 files changed, 116 insertions(+), 65 deletions(-)
>  create mode 100644 libswscale/x86/yuv2yuvX.asm
> 
> diff --git a/libswscale/x86/Makefile b/libswscale/x86/Makefile
> index 831d5359aa..bfe383364e 100644
> --- a/libswscale/x86/Makefile
> +++ b/libswscale/x86/Makefile
> @@ -13,3 +13,4 @@ X86ASM-OBJS += x86/input.o  
> \
> x86/scale.o  \
> x86/rgb_2_rgb.o  \
> x86/yuv_2_rgb.o  \
> +   x86/yuv2yuvX.o   \
> diff --git a/libswscale/x86/swscale.c b/libswscale/x86/swscale.c
> index 3160fedf04..fec9fa22e0 100644
> --- a/libswscale/x86/swscale.c
> +++ b/libswscale/x86/swscale.c
> @@ -197,80 +197,25 @@ void ff_updateMMXDitherTables(SwsContext *c, int dstY)
>  }
>  
>  #if HAVE_MMXEXT
> +void ff_yuv2yuvX_sse3(const int16_t *filter, int filterSize,
> +   uint8_t *dest, int dstW,
> +   const uint8_t *dither, int offset);
> +
>  static void yuv2yuvX_sse3(const int16_t *filter, int filterSize,
> const int16_t **src, uint8_t *dest, int dstW,
> const uint8_t *dither, int offset)
>  {
> +int remainder = (dstW % 32);
> +int pixelsProcessed = dstW - remainder;
>  if(((uintptr_t)dest) & 15){
>  yuv2yuvX_mmxext(filter, filterSize, src, dest, dstW, dither, offset);
>  return;
>  }
> -filterSize--;
> -#define MAIN_FUNCTION \
> -"pxor   %%xmm0, %%xmm0 \n\t" \
> -"punpcklbw  %%xmm0, %%xmm3 \n\t" \
> -"movd   %4, %%xmm1 \n\t" \
> -"punpcklwd  %%xmm1, %%xmm1 \n\t" \
> -"punpckldq  %%xmm1, %%xmm1 \n\t" \
> -"punpcklqdq %%xmm1, %%xmm1 \n\t" \
> -"psllw  $3, %%xmm1 \n\t" \
> -"paddw  %%xmm1, %%xmm3 \n\t" \
> -"psraw  $4, %%xmm3 \n\t" \
> -"movdqa %%xmm3, %%xmm4 \n\t" \
> -"movdqa %%xmm3, %%xmm7 \n\t" \
> -"movl   %3, %%ecx  \n\t" \
> -"mov %0, %%"FF_REG_d"\n\t"\
> -"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\
> -".p2align 4 \n\t" /* FIXME 
> Unroll? */\
> -"1: \n\t"\
> -"movddup  8(%%"FF_REG_d"), %%xmm0   \n\t" /* 
> filterCoeff */\
> -"movdqa  (%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm2 \n\t" 
> /* srcData */\
> -"movdqa16(%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm5 \n\t" 
> /* srcData */\
> -"add$16, %%"FF_REG_d"\n\t"\
> -"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\
> -"test %%"FF_REG_S", %%"FF_REG_S" \n\t"\
> -"pmulhw   %%xmm0, %%xmm2  \n\t"\
> -"pmulhw   %%xmm0, %%xmm5  \n\t"\
> -"paddw%%xmm2, %%xmm3  \n\t"\
> -"paddw%%xmm5, %%xmm4  \n\t"\
> -" jnz1b \n\t"\
> -"psraw   $3, %%xmm3  \n\t"\
> -"psraw   $3, %%xmm4  \n\t"\
> -"packuswb %%xmm4, %%xmm3  \n\t"\
> -"movntdq  %%xmm3, (%1, %%"FF_REG_c") \n\t"\
> -"add $16, %%"FF_REG_c"\n\t"\
> -"cmp  %2, %%"FF_REG_c"\n\t"\
> -"movdqa   %%xmm7, %%xmm3\n\t" \
> -"movdqa   %%xmm7, %%xmm4\n\t" \
> -"mov %0, %%"FF_REG_d"\n\t"\
> -"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\
> -"jb  1b \n\t"
> -
> -if (offset) {
> -__asm__ volatile(
> -"movq  %5, %%xmm3  \n\t"
> -"movdqa%%xmm3, %%xmm4  \n\t"
> -"psrlq$24, %%xmm3  \n\t"

Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup. AVX2 version is ready and tested, although local tests show a significant spee

2020-10-23 Thread Alan Kelly
 Fixed. The wrong step size was used causing a write passed the end of
 the buffer. yuv2yuvX_mmxext is now called if there are any remaining
pixels.

 There is currently no checkasm for these functions. Is this required for
submission?

 (Apologies for the double mail, I used git send-email but it didn't
respond to the correct thread)
---
 libswscale/x86/Makefile |   1 +
 libswscale/x86/swscale.c|  75 --
 libswscale/x86/yuv2yuvX.asm | 105 
 3 files changed, 116 insertions(+), 65 deletions(-)
 create mode 100644 libswscale/x86/yuv2yuvX.asm

diff --git a/libswscale/x86/Makefile b/libswscale/x86/Makefile
index 831d5359aa..bfe383364e 100644
--- a/libswscale/x86/Makefile
+++ b/libswscale/x86/Makefile
@@ -13,3 +13,4 @@ X86ASM-OBJS += x86/input.o
   \
x86/scale.o  \
x86/rgb_2_rgb.o  \
x86/yuv_2_rgb.o  \
+   x86/yuv2yuvX.o   \
diff --git a/libswscale/x86/swscale.c b/libswscale/x86/swscale.c
index 3160fedf04..fec9fa22e0 100644
--- a/libswscale/x86/swscale.c
+++ b/libswscale/x86/swscale.c
@@ -197,80 +197,25 @@ void ff_updateMMXDitherTables(SwsContext *c, int dstY)
 }

 #if HAVE_MMXEXT
+void ff_yuv2yuvX_sse3(const int16_t *filter, int filterSize,
+   uint8_t *dest, int dstW,
+   const uint8_t *dither, int offset);
+
 static void yuv2yuvX_sse3(const int16_t *filter, int filterSize,
const int16_t **src, uint8_t *dest, int dstW,
const uint8_t *dither, int offset)
 {
+int remainder = (dstW % 32);
+int pixelsProcessed = dstW - remainder;
 if(((uintptr_t)dest) & 15){
 yuv2yuvX_mmxext(filter, filterSize, src, dest, dstW, dither,
offset);
 return;
 }
-filterSize--;
-#define MAIN_FUNCTION \
-"pxor   %%xmm0, %%xmm0 \n\t" \
-"punpcklbw  %%xmm0, %%xmm3 \n\t" \
-"movd   %4, %%xmm1 \n\t" \
-"punpcklwd  %%xmm1, %%xmm1 \n\t" \
-"punpckldq  %%xmm1, %%xmm1 \n\t" \
-"punpcklqdq %%xmm1, %%xmm1 \n\t" \
-"psllw  $3, %%xmm1 \n\t" \
-"paddw  %%xmm1, %%xmm3 \n\t" \
-"psraw  $4, %%xmm3 \n\t" \
-"movdqa %%xmm3, %%xmm4 \n\t" \
-"movdqa %%xmm3, %%xmm7 \n\t" \
-"movl   %3, %%ecx  \n\t" \
-"mov %0, %%"FF_REG_d"\n\t"\
-"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\
-".p2align 4 \n\t" /* FIXME
Unroll? */\
-"1: \n\t"\
-"movddup  8(%%"FF_REG_d"), %%xmm0   \n\t" /*
filterCoeff */\
-"movdqa  (%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm2 \n\t"
/* srcData */\
-"movdqa16(%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm5 \n\t"
/* srcData */\
-"add$16, %%"FF_REG_d"\n\t"\
-"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\
-"test %%"FF_REG_S", %%"FF_REG_S" \n\t"\
-"pmulhw   %%xmm0, %%xmm2  \n\t"\
-"pmulhw   %%xmm0, %%xmm5  \n\t"\
-"paddw%%xmm2, %%xmm3  \n\t"\
-"paddw%%xmm5, %%xmm4  \n\t"\
-" jnz1b \n\t"\
-"psraw   $3, %%xmm3  \n\t"\
-"psraw   $3, %%xmm4  \n\t"\
-"packuswb %%xmm4, %%xmm3  \n\t"\
-"movntdq  %%xmm3, (%1, %%"FF_REG_c") \n\t"\
-"add $16, %%"FF_REG_c"\n\t"\
-"cmp  %2, %%"FF_REG_c"\n\t"\
-"movdqa   %%xmm7, %%xmm3\n\t" \
-"movdqa   %%xmm7, %%xmm4\n\t" \
-"mov %0, %%"FF_REG_d"\n\t"\
-"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\
-"jb  1b \n\t"
-
-if (offset) {
-__asm__ volatile(
-"movq  %5, %%xmm3  \n\t"
-"movdqa%%xmm3, %%xmm4  \n\t"
-"psrlq$24, %%xmm3  \n\t"
-"psllq$40, %%xmm4  \n\t"
-"por   %%xmm4, %%xmm3  \n\t"
-MAIN_FUNCTION
-  :: "g" (filter),
-  "r" (dest-offset), "g" ((x86_reg)(dstW+offset)), "m"
(offset),
-  "m"(filterSize), "m"(((uint64_t *) 

[FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup. AVX2 version is ready and tested, however, although local tests show a significant

2020-10-23 Thread Alan Kelly
 Fixed. The wrong step size was used causing a write passed the end of
 the buffer. yuv2yuvX_mmxext is now called if there are any remaining pixels.
---
 libswscale/x86/Makefile |   1 +
 libswscale/x86/swscale.c|  75 --
 libswscale/x86/yuv2yuvX.asm | 105 
 3 files changed, 116 insertions(+), 65 deletions(-)
 create mode 100644 libswscale/x86/yuv2yuvX.asm

diff --git a/libswscale/x86/Makefile b/libswscale/x86/Makefile
index 831d5359aa..bfe383364e 100644
--- a/libswscale/x86/Makefile
+++ b/libswscale/x86/Makefile
@@ -13,3 +13,4 @@ X86ASM-OBJS += x86/input.o
  \
x86/scale.o  \
x86/rgb_2_rgb.o  \
x86/yuv_2_rgb.o  \
+   x86/yuv2yuvX.o   \
diff --git a/libswscale/x86/swscale.c b/libswscale/x86/swscale.c
index 3160fedf04..fec9fa22e0 100644
--- a/libswscale/x86/swscale.c
+++ b/libswscale/x86/swscale.c
@@ -197,80 +197,25 @@ void ff_updateMMXDitherTables(SwsContext *c, int dstY)
 }
 
 #if HAVE_MMXEXT
+void ff_yuv2yuvX_sse3(const int16_t *filter, int filterSize,
+   uint8_t *dest, int dstW,
+   const uint8_t *dither, int offset);
+
 static void yuv2yuvX_sse3(const int16_t *filter, int filterSize,
const int16_t **src, uint8_t *dest, int dstW,
const uint8_t *dither, int offset)
 {
+int remainder = (dstW % 32);
+int pixelsProcessed = dstW - remainder;
 if(((uintptr_t)dest) & 15){
 yuv2yuvX_mmxext(filter, filterSize, src, dest, dstW, dither, offset);
 return;
 }
-filterSize--;
-#define MAIN_FUNCTION \
-"pxor   %%xmm0, %%xmm0 \n\t" \
-"punpcklbw  %%xmm0, %%xmm3 \n\t" \
-"movd   %4, %%xmm1 \n\t" \
-"punpcklwd  %%xmm1, %%xmm1 \n\t" \
-"punpckldq  %%xmm1, %%xmm1 \n\t" \
-"punpcklqdq %%xmm1, %%xmm1 \n\t" \
-"psllw  $3, %%xmm1 \n\t" \
-"paddw  %%xmm1, %%xmm3 \n\t" \
-"psraw  $4, %%xmm3 \n\t" \
-"movdqa %%xmm3, %%xmm4 \n\t" \
-"movdqa %%xmm3, %%xmm7 \n\t" \
-"movl   %3, %%ecx  \n\t" \
-"mov %0, %%"FF_REG_d"\n\t"\
-"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\
-".p2align 4 \n\t" /* FIXME 
Unroll? */\
-"1: \n\t"\
-"movddup  8(%%"FF_REG_d"), %%xmm0   \n\t" /* 
filterCoeff */\
-"movdqa  (%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm2 \n\t" /* 
srcData */\
-"movdqa16(%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm5 \n\t" /* 
srcData */\
-"add$16, %%"FF_REG_d"\n\t"\
-"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\
-"test %%"FF_REG_S", %%"FF_REG_S" \n\t"\
-"pmulhw   %%xmm0, %%xmm2  \n\t"\
-"pmulhw   %%xmm0, %%xmm5  \n\t"\
-"paddw%%xmm2, %%xmm3  \n\t"\
-"paddw%%xmm5, %%xmm4  \n\t"\
-" jnz1b \n\t"\
-"psraw   $3, %%xmm3  \n\t"\
-"psraw   $3, %%xmm4  \n\t"\
-"packuswb %%xmm4, %%xmm3  \n\t"\
-"movntdq  %%xmm3, (%1, %%"FF_REG_c") \n\t"\
-"add $16, %%"FF_REG_c"\n\t"\
-"cmp  %2, %%"FF_REG_c"\n\t"\
-"movdqa   %%xmm7, %%xmm3\n\t" \
-"movdqa   %%xmm7, %%xmm4\n\t" \
-"mov %0, %%"FF_REG_d"\n\t"\
-"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\
-"jb  1b \n\t"
-
-if (offset) {
-__asm__ volatile(
-"movq  %5, %%xmm3  \n\t"
-"movdqa%%xmm3, %%xmm4  \n\t"
-"psrlq$24, %%xmm3  \n\t"
-"psllq$40, %%xmm4  \n\t"
-"por   %%xmm4, %%xmm3  \n\t"
-MAIN_FUNCTION
-  :: "g" (filter),
-  "r" (dest-offset), "g" ((x86_reg)(dstW+offset)), "m" (offset),
-  "m"(filterSize), "m"(((uint64_t *) dither)[0])
-  : XMM_CLOBBERS("%xmm0" , "%xmm1" , "%xmm2" , "%xmm3" , "%xmm4" , 
"%xmm5" , "%xmm7" ,)
-"%"FF_REG_d, "%"FF_REG_S, "%"FF_REG_c
- 

Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup. AVX2 version is ready and tested, although local tests show a significant spee

2020-10-22 Thread Michael Niedermayer
On Thu, Oct 22, 2020 at 09:43:53AM +0200, Alan Kelly wrote:
> Other functions to be ported to avx2 have been identified and are on
> the todo list.
> ---
>  libswscale/x86/Makefile |   1 +
>  libswscale/x86/swscale.c|  72 +++--
>  libswscale/x86/yuv2yuvX.asm | 105 
>  3 files changed, 112 insertions(+), 66 deletions(-)
>  create mode 100644 libswscale/x86/yuv2yuvX.asm

Breaks:

./ffmpeg -i ~/vlcticket/5887/Cruise\ 2012_07_29_19_02_16.wmv  -an -vcodec mjpeg 
-vf scale=800:600:interl=1  -qscale 1 out.avi

(the output file has artifacts at the left side)

input is maybe here:
https://trac.videolan.org/vlc/attachment/ticket/7246/Cruise%202012_07_29_19_02_16.wmv

[...]
-- 
Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

Its not that you shouldnt use gotos but rather that you should write
readable code and code with gotos often but not always is less readable


signature.asc
Description: PGP signature
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Re: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup. AVX2 version is ready and tested, although local tests show a significant spee

2020-10-22 Thread Jean-Baptiste Kempf
Do we have checkasm for those functions?

On Thu, 22 Oct 2020, at 09:43, Alan Kelly wrote:
> Other functions to be ported to avx2 have been identified and are on
> the todo list.
> ---
>  libswscale/x86/Makefile |   1 +
>  libswscale/x86/swscale.c|  72 +++--
>  libswscale/x86/yuv2yuvX.asm | 105 
>  3 files changed, 112 insertions(+), 66 deletions(-)
>  create mode 100644 libswscale/x86/yuv2yuvX.asm
> 
> diff --git a/libswscale/x86/Makefile b/libswscale/x86/Makefile
> index 831d5359aa..bfe383364e 100644
> --- a/libswscale/x86/Makefile
> +++ b/libswscale/x86/Makefile
> @@ -13,3 +13,4 @@ X86ASM-OBJS += x86/input.o
>   \
> x86/scale.o 
>  \
> x86/rgb_2_rgb.o 
>  \
> x86/yuv_2_rgb.o 
>  \
> +   x86/yuv2yuvX.o  
>  \
> diff --git a/libswscale/x86/swscale.c b/libswscale/x86/swscale.c
> index 3160fedf04..ea83b097ca 100644
> --- a/libswscale/x86/swscale.c
> +++ b/libswscale/x86/swscale.c
> @@ -197,6 +197,10 @@ void ff_updateMMXDitherTables(SwsContext *c, int 
> dstY)
>  }
>  
>  #if HAVE_MMXEXT
> +void ff_yuv2yuvX_sse3(const int16_t *filter, int filterSize,
> +   uint8_t *dest, int dstW,
> +   const uint8_t *dither, int offset);
> +
>  static void yuv2yuvX_sse3(const int16_t *filter, int filterSize,
> const int16_t **src, uint8_t *dest, int 
> dstW,
> const uint8_t *dither, int offset)
> @@ -205,72 +209,8 @@ static void yuv2yuvX_sse3(const int16_t *filter, 
> int filterSize,
>  yuv2yuvX_mmxext(filter, filterSize, src, dest, dstW, dither, 
> offset);
>  return;
>  }
> -filterSize--;
> -#define MAIN_FUNCTION \
> -"pxor   %%xmm0, %%xmm0 \n\t" \
> -"punpcklbw  %%xmm0, %%xmm3 \n\t" \
> -"movd   %4, %%xmm1 \n\t" \
> -"punpcklwd  %%xmm1, %%xmm1 \n\t" \
> -"punpckldq  %%xmm1, %%xmm1 \n\t" \
> -"punpcklqdq %%xmm1, %%xmm1 \n\t" \
> -"psllw  $3, %%xmm1 \n\t" \
> -"paddw  %%xmm1, %%xmm3 \n\t" \
> -"psraw  $4, %%xmm3 \n\t" \
> -"movdqa %%xmm3, %%xmm4 \n\t" \
> -"movdqa %%xmm3, %%xmm7 \n\t" \
> -"movl   %3, %%ecx  \n\t" \
> -"mov %0, %%"FF_REG_d"
> \n\t"\
> -"mov(%%"FF_REG_d"), %%"FF_REG_S" 
> \n\t"\
> -".p2align 4 \n\t" /* 
> FIXME Unroll? */\
> -"1: \n\t"\
> -"movddup  8(%%"FF_REG_d"), %%xmm0   \n\t" /* 
> filterCoeff */\
> -"movdqa  (%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm2 
> \n\t" /* srcData */\
> -"movdqa16(%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm5 
> \n\t" /* srcData */\
> -"add$16, %%"FF_REG_d"
> \n\t"\
> -"mov(%%"FF_REG_d"), %%"FF_REG_S" 
> \n\t"\
> -"test %%"FF_REG_S", %%"FF_REG_S" 
> \n\t"\
> -"pmulhw   %%xmm0, %%xmm2  \n\t"\
> -"pmulhw   %%xmm0, %%xmm5  \n\t"\
> -"paddw%%xmm2, %%xmm3  \n\t"\
> -"paddw%%xmm5, %%xmm4  \n\t"\
> -" jnz1b \n\t"\
> -"psraw   $3, %%xmm3  \n\t"\
> -"psraw   $3, %%xmm4  \n\t"\
> -"packuswb %%xmm4, %%xmm3  \n\t"\
> -"movntdq  %%xmm3, (%1, %%"FF_REG_c") 
> \n\t"\
> -"add $16, %%"FF_REG_c"\n\t"\
> -"cmp  %2, %%"FF_REG_c"\n\t"\
> -"movdqa   %%xmm7, %%xmm3\n\t" \
> -"movdqa   %%xmm7, %%xmm4\n\t" \
> -"mov %0, %%"FF_REG_d"
> \n\t"\
> -"mov(%%"FF_REG_d"), %%"FF_REG_S" 
> \n\t"\
> -"jb  1b \n\t"
> -
> -if (offset) {
> -__asm__ volatile(
> -"movq  %5, %%xmm3  \n\t"
> -"movdqa%%xmm3, %%xmm4  \n\t"
> -"psrlq$24, %%xmm3  \n\t"
> -"psllq$40, %%xmm4  \n\t"
> -"por   %%xmm4, %%xmm3  \n\t"
> -MAIN_FUNCTION
> -  :: "g" (filter),
> -  "r" (dest-offset), "g" ((x86_reg)(dstW+offset)), "m" 
> 

[FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup. AVX2 version is ready and tested, although local tests show a significant speed-up

2020-10-22 Thread Alan Kelly
Other functions to be ported to avx2 have been identified and are on
the todo list.
---
 libswscale/x86/Makefile |   1 +
 libswscale/x86/swscale.c|  72 +++--
 libswscale/x86/yuv2yuvX.asm | 105 
 3 files changed, 112 insertions(+), 66 deletions(-)
 create mode 100644 libswscale/x86/yuv2yuvX.asm

diff --git a/libswscale/x86/Makefile b/libswscale/x86/Makefile
index 831d5359aa..bfe383364e 100644
--- a/libswscale/x86/Makefile
+++ b/libswscale/x86/Makefile
@@ -13,3 +13,4 @@ X86ASM-OBJS += x86/input.o
  \
x86/scale.o  \
x86/rgb_2_rgb.o  \
x86/yuv_2_rgb.o  \
+   x86/yuv2yuvX.o   \
diff --git a/libswscale/x86/swscale.c b/libswscale/x86/swscale.c
index 3160fedf04..ea83b097ca 100644
--- a/libswscale/x86/swscale.c
+++ b/libswscale/x86/swscale.c
@@ -197,6 +197,10 @@ void ff_updateMMXDitherTables(SwsContext *c, int dstY)
 }
 
 #if HAVE_MMXEXT
+void ff_yuv2yuvX_sse3(const int16_t *filter, int filterSize,
+   uint8_t *dest, int dstW,
+   const uint8_t *dither, int offset);
+
 static void yuv2yuvX_sse3(const int16_t *filter, int filterSize,
const int16_t **src, uint8_t *dest, int dstW,
const uint8_t *dither, int offset)
@@ -205,72 +209,8 @@ static void yuv2yuvX_sse3(const int16_t *filter, int 
filterSize,
 yuv2yuvX_mmxext(filter, filterSize, src, dest, dstW, dither, offset);
 return;
 }
-filterSize--;
-#define MAIN_FUNCTION \
-"pxor   %%xmm0, %%xmm0 \n\t" \
-"punpcklbw  %%xmm0, %%xmm3 \n\t" \
-"movd   %4, %%xmm1 \n\t" \
-"punpcklwd  %%xmm1, %%xmm1 \n\t" \
-"punpckldq  %%xmm1, %%xmm1 \n\t" \
-"punpcklqdq %%xmm1, %%xmm1 \n\t" \
-"psllw  $3, %%xmm1 \n\t" \
-"paddw  %%xmm1, %%xmm3 \n\t" \
-"psraw  $4, %%xmm3 \n\t" \
-"movdqa %%xmm3, %%xmm4 \n\t" \
-"movdqa %%xmm3, %%xmm7 \n\t" \
-"movl   %3, %%ecx  \n\t" \
-"mov %0, %%"FF_REG_d"\n\t"\
-"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\
-".p2align 4 \n\t" /* FIXME 
Unroll? */\
-"1: \n\t"\
-"movddup  8(%%"FF_REG_d"), %%xmm0   \n\t" /* 
filterCoeff */\
-"movdqa  (%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm2 \n\t" /* 
srcData */\
-"movdqa16(%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm5 \n\t" /* 
srcData */\
-"add$16, %%"FF_REG_d"\n\t"\
-"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\
-"test %%"FF_REG_S", %%"FF_REG_S" \n\t"\
-"pmulhw   %%xmm0, %%xmm2  \n\t"\
-"pmulhw   %%xmm0, %%xmm5  \n\t"\
-"paddw%%xmm2, %%xmm3  \n\t"\
-"paddw%%xmm5, %%xmm4  \n\t"\
-" jnz1b \n\t"\
-"psraw   $3, %%xmm3  \n\t"\
-"psraw   $3, %%xmm4  \n\t"\
-"packuswb %%xmm4, %%xmm3  \n\t"\
-"movntdq  %%xmm3, (%1, %%"FF_REG_c") \n\t"\
-"add $16, %%"FF_REG_c"\n\t"\
-"cmp  %2, %%"FF_REG_c"\n\t"\
-"movdqa   %%xmm7, %%xmm3\n\t" \
-"movdqa   %%xmm7, %%xmm4\n\t" \
-"mov %0, %%"FF_REG_d"\n\t"\
-"mov(%%"FF_REG_d"), %%"FF_REG_S" \n\t"\
-"jb  1b \n\t"
-
-if (offset) {
-__asm__ volatile(
-"movq  %5, %%xmm3  \n\t"
-"movdqa%%xmm3, %%xmm4  \n\t"
-"psrlq$24, %%xmm3  \n\t"
-"psllq$40, %%xmm4  \n\t"
-"por   %%xmm4, %%xmm3  \n\t"
-MAIN_FUNCTION
-  :: "g" (filter),
-  "r" (dest-offset), "g" ((x86_reg)(dstW+offset)), "m" (offset),
-  "m"(filterSize), "m"(((uint64_t *) dither)[0])
-  : XMM_CLOBBERS("%xmm0" , "%xmm1" , "%xmm2" , "%xmm3" , "%xmm4" , 
"%xmm5" , "%xmm7" ,)
-"%"FF_REG_d, "%"FF_REG_S, "%"FF_REG_c
-  );
-} else {
-__asm__ volatile(
-"movq  %5, %%xmm3