On Sun, 18 Sep 2011, Kieran Kunhya wrote:

> This adds SSE4 ASM for the (easy) case of lumFilterSize=1 and for 10-bit. Thee
> naming scheme and function pointers doesn't yet match swscale's scheme.
> Assistance to get this up to scratch appreciated.

Isn't there already a special case for that, named yuv2yuv1, as distinct from 
yuv2yuvX?

>lum_10_upper:  times 4 dd 0x400

0x3ff, unless your "10-bit" means something different than mine.

>cglobal lum_10_filter1_%1, 4, 6
>    xor  r4d, r4d
>    sub  r0, r1
>    movsx  r5d, r2w

movsx r2d, r2w

>    movd   m3, r5d
>    pshufd m3, m3, 0
>    mova m2, [lum_10_start]
>.loop
>    mova m0, [r1]
>    pmovsxwd m1, m0
>    movhlps  m0, m0
>    pmovsxwd m0, m0
>    pmulld   m1, m3
>    pmulld   m0, m3
>    psrad    m1, 1
>    psrad    m0, 1
>    paddd    m1, m2
>    paddd    m0, m2
>    psrad    m1, 16
>    psrad    m0, 16

punpcklwd m0, {4}
pmaddwd   m0, {filter, 1<14}
psrad     m0, 17

Or if the lsbs are 0:
pmulhrsw  m0, {filter>>2}

>    CLIPD_SSE41 m1, [lum_10_lower], [lum_10_upper]
>    CLIPD_SSE41 m0, [lum_10_lower], [lum_10_upper]
>    packusdw  m1, m0

packusdw m1, m0
pminuw   m1, [lum_10_upper]

Or if the inputs can't be more than 32x out of range:
(Same number of uops, but more flexibility on where they execute)
packusdw m1, m0
pminsw   m1, [lum_10_upper]

Or if the inputs can't be out of range:
packssdw m1, m0

>    add  r1, mmsize
>    add  r4d, mmsize/2
>    cmp  r4d, r3d
>    jl   .loop

add  r1, mmsize
sub r3d, mmsize/2
jg

Or do the standard pointer munging trick to reduce it to a single add.

--Loren Merritt
_______________________________________________
libav-devel mailing list
[email protected]
https://lists.libav.org/mailman/listinfo/libav-devel

Reply via email to