On Sun, 18 Sep 2011, Kieran Kunhya wrote:
> This adds SSE4 ASM for the (easy) case of lumFilterSize=1 and for 10-bit. Thee
> naming scheme and function pointers doesn't yet match swscale's scheme.
> Assistance to get this up to scratch appreciated.
Isn't there already a special case for that, named yuv2yuv1, as distinct from
yuv2yuvX?
>lum_10_upper: times 4 dd 0x400
0x3ff, unless your "10-bit" means something different than mine.
>cglobal lum_10_filter1_%1, 4, 6
> xor r4d, r4d
> sub r0, r1
> movsx r5d, r2w
movsx r2d, r2w
> movd m3, r5d
> pshufd m3, m3, 0
> mova m2, [lum_10_start]
>.loop
> mova m0, [r1]
> pmovsxwd m1, m0
> movhlps m0, m0
> pmovsxwd m0, m0
> pmulld m1, m3
> pmulld m0, m3
> psrad m1, 1
> psrad m0, 1
> paddd m1, m2
> paddd m0, m2
> psrad m1, 16
> psrad m0, 16
punpcklwd m0, {4}
pmaddwd m0, {filter, 1<14}
psrad m0, 17
Or if the lsbs are 0:
pmulhrsw m0, {filter>>2}
> CLIPD_SSE41 m1, [lum_10_lower], [lum_10_upper]
> CLIPD_SSE41 m0, [lum_10_lower], [lum_10_upper]
> packusdw m1, m0
packusdw m1, m0
pminuw m1, [lum_10_upper]
Or if the inputs can't be more than 32x out of range:
(Same number of uops, but more flexibility on where they execute)
packusdw m1, m0
pminsw m1, [lum_10_upper]
Or if the inputs can't be out of range:
packssdw m1, m0
> add r1, mmsize
> add r4d, mmsize/2
> cmp r4d, r3d
> jl .loop
add r1, mmsize
sub r3d, mmsize/2
jg
Or do the standard pointer munging trick to reduce it to a single add.
--Loren Merritt
_______________________________________________
libav-devel mailing list
[email protected]
https://lists.libav.org/mailman/listinfo/libav-devel