[FFmpeg-devel] [PATCH] RFC: v210enc optimisations and initial AVX-512

2022-10-20 Thread Kieran Kunhya
Hi,

Please see attached an attempt to optimise the 8-bit input to v210enc to
reduce the number of shuffles.
This comes at the cost of having to extract the middle element and perform
a DWORD shift on it and then reinserting it.
I have added a few comments but any other ideas are welcome.

Crude benchmarks on Intel(R) Xeon(R) D-2123IT:

Before:

v210_planar_pack_8_ssse3: 316.5
v210_planar_pack_8_avx: 319.0
v210_planar_pack_8_avx2: 223.0

After:

v210_planar_pack_8_ssse3: 321.0
v210_planar_pack_8_avx: 326.0
v210_planar_pack_8_avx2: 217.0
v210_planar_pack_8_avx512: 211.0

Regards,
Kieran Kunhya


0001-RFC-v210enc-optimisations-and-initial-AVX-512.patch
Description: Binary data
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


Re: [FFmpeg-devel] [PATCH] RFC: v210enc optimisations and initial AVX-512

2022-10-21 Thread Henrik Gramner
On Fri, Oct 21, 2022 at 5:41 AM Kieran Kunhya  wrote:
>
> Hi,
>
> Please see attached an attempt to optimise the 8-bit input to v210enc to
> reduce the number of shuffles.
> This comes at the cost of having to extract the middle element and perform
> a DWORD shift on it and then reinserting it.
> I have added a few comments but any other ideas are welcome.

Random untested idea:

A: db 32,  0, 48, -1,  1, 33,  2, -1, 49,  3, 34, -1,  4, 50,  5, -1
   db 35,  6, 51, -1,  7, 36,  8, -1, 52,  9, 37, -1, 10, 53, 11, -1
   db 38, 12, 54, -1, 13, 39, 14, -1, 55, 15, 40, -1, 16, 56, 17, -1
   db 41, 18, 57, -1, 19, 42, 20, -1, 58, 21, 43, -1, 22, 59, 23, -1
B: db  1,  0, 16,  0
C: dd 0x0003fc00

[...]

mova  m2, [A]
vpbroadcastd  m3, [B]
vpbroadcastd  m6, [C]

[...]

.loop:
movu ym1, [yq]
vinserti32x4  m1, [uq], 2
vinserti32x4  m1, [vq], 3
CLIPUBm1, m4, m5
vpermbm1, m2, m1
pmaddubsw m0, m1, m3
pslld m1, 2
vpternlogdm0, m1, m6, 0xca
movu  [dstq], m0

I guess it could also be scaled to ymm if you're a big Skylake fan :P
(in which case you'd probably want to reorder the shuffle indices so
that chroma comes first, i.e. movq [u] + movhps [v] + vinserti32x4
[y])
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


Re: [FFmpeg-devel] [PATCH] RFC: v210enc optimisations and initial AVX-512

2022-10-26 Thread James Darnley
I guess it could also be scaled to ymm if you're a big Skylake fan :P 
(in which case you'd probably want to reorder the shuffle indices so  
that chroma comes first, i.e. movq [u] + movhps [v] + vinserti32x4[y])


What shuffle or permute did you have in mind when you suggested this for 
Skylake?  Without the permute I'm not sure how the change in ordering 
helps.  Aren't we stuck with data in separate lanes?  I'm probably 
missing something though.

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".