Hi,

On Thu, Aug 23, 2012 at 12:37 PM, Justin Ruggles
<justin.rugg...@gmail.com> wrote:
> On 08/21/2012 03:53 PM, Ronald S. Bultje wrote:
>> On Mon, Aug 6, 2012 at 9:22 AM, Justin Ruggles <justin.rugg...@gmail.com> 
>> wrote:
>>> +%if cpuflag(sse2)
>>> +    mulps         m0, m6, [srcq      ]
>>> +    mulps         m1, m6, [srcq+src1q]
>>> +    mulps         m2, m6, [srcq+src2q]
>>> +    mulps         m3, m6, [srcq+src3q]
>>> +    mulps         m4, m6, [srcq+src4q]
>>> +    mulps         m5, m6, [srcq+src5q]
>>> +    cvtps2dq      m0, m0
>>> +    cvtps2dq      m1, m1
>>> +    cvtps2dq      m2, m2
>>> +    cvtps2dq      m3, m3
>>> +    cvtps2dq      m4, m4
>>> +    cvtps2dq      m5, m5
>>> +    packssdw      m0, m3            ; m0 =  0,  6, 12, 18,  3,  9, 15, 21
>>> +    packssdw      m1, m4            ; m1 =  1,  7, 13, 19,  4, 10, 16, 22
>>> +    packssdw      m2, m5            ; m2 =  2,  8, 14, 20,  5, 11, 17, 23
>>> +                                    ; unpack words:
>>> +    movhlps       m3, m0            ; m3 =  3,  9, 15, 21,  x,  x,  x,  x
>>> +    punpcklwd     m0, m1            ; m0 =  0,  1,  6,  7, 12, 13, 18, 19
>>> +    punpckhwd     m1, m2            ; m1 =  4,  5, 10, 11, 16, 17, 22, 23
>>> +    punpcklwd     m2, m3            ; m2 =  2,  3,  8,  9, 14, 15, 20, 21
>>> +                                    ; blend dwords:
>>> +    shufps        m3, m0, m2, q2020 ; m3 =  0,  1, 12, 13,  2,  3, 14, 15
>>> +    shufps        m0, m1, q2031     ; m0 =  6,  7, 18, 19,  4,  5, 16, 17
>>> +    shufps        m2, m1, q3131     ; m2 =  8,  9, 20, 21, 10, 11, 22, 23
>>> +                                    ; shuffle dwords:
>>> +    shufps        m1, m2, m3, q3120 ; m1 =  8,  9, 10, 11, 12, 13, 14, 15
>>> +    shufps        m3, m0,     q0220 ; m3 =  0,  1,  2,  3,  4,  5,  6,  7
>>> +    shufps        m0, m2,     q3113 ; m0 = 16, 17, 18, 19, 20, 21, 22, 23
>>> +    mova  [dstq+0*mmsize], m3
>>> +    mova  [dstq+1*mmsize], m1
>>> +    mova  [dstq+2*mmsize], m0
>>> +%else ; sse
>>
>> For sse4+:
>>
>> packssdw:
>> a 0-6-12-18-3-9-15-21
>> b 1-7-13-19-4-10-16-22
>> c 2-8-14-20-5-11-17-23
>>
>> pshufb:
>> a' 0-9-18-3-12-31-6-15
>> b' 16-1-10-19-4-13-22-7
>> c' 8-17-2-11-20-5-14-23
>>
>> pblendw:
>> ab' x-9-10-x-12-13-x-15
>> ac' x-17-18-x-20-21-x-23
>> bc' x-1-2-x-4-5-x-7
>>
>> and then another 3 blends to fill in abc for 0-7, 8-15 and 16-23: 12
>> instructions instead of 13. Thought out together with Christian here
>> (CC'ed). There's probably a way to get the blends to be more efficient
>> but I can't think of one very quickly.
>
> Great idea. It certainly works, but I didn't measure any difference in
> speed on Sandy Bridge. Here is what I used:
>
>                                 ; shuffle words:
> pshufb    m0, m7                ; m0 =  0,  9, 18,  3, 12, 21,  6, 15
> pshufb    m1, m8                ; m1 = 16,  1, 10, 19,  4, 13, 22,  7
> pshufb    m2, m9                ; m2 =  8, 17,  2, 11, 20,  5, 14, 23
>                                 ; blend words:
> pblendw   m3, m0, m1, 10010010b ; m3 =  0,  1,  x,  3,  4,  x,  6,  7
> pblendw   m3, m2,     00100100b ; m3 =  0,  1,  2,  3,  4,  5,  6,  7
> pblendw   m4, m0, m1, 00100100b ; m4 =  x,  9, 10,  x, 12, 13,  x, 15
> pblendw   m4, m2,     01001001b ; m4 =  8,  9, 10, 11, 12, 13, 14, 15
> pblendw   m2, m0,     00100100b ; m2 =  x, 17, 18,  x, 20, 21,  x, 23
> pblendw   m2, m1,     01001001b ; m2 = 16, 17, 18, 19, 20, 21, 22, 23
> mova  [dstq+0*mmsize], m3
> mova  [dstq+1*mmsize], m4
> mova  [dstq+2*mmsize], m2

Hm... What if you use vpperm as a combination of pshufb and the first
of two pblendw steps?

Ronald
_______________________________________________
libav-devel mailing list
libav-devel@libav.org
https://lists.libav.org/mailman/listinfo/libav-devel

Reply via email to