2013/4/9 Justin Ruggles <[email protected]>:
> On 04/07/2013 04:29 PM, Christophe Gisquet wrote:
>> +    mova       m0, [src0q+cq]
>> +    mova       m1, [src1q]
>> +    mova       m4, [src0q+cq+mmsize]
>> +    mova       m5, [src1q+mmsize]
>> +%if cpuflag(sse2)
>> +    pshufd     m2, m0, q0123
>> +    pshufd     m3, m1, q0123
>> +    pshufd     m6, m4, q0123
>> +    pshufd     m7, m5, q0123
>> +%else
>> +    shufps     m2, m0, m0, q0123
>> +    shufps     m3, m1, m1, q0123
>> +    shufps     m6, m4, m4, q0123
>> +    shufps     m7, m5, m5, q0123
>> +%endif
>
> You can use memory args for the pshufd.

Because of the subps, it's not stricty commutative here, and I ended
up with this using 6 xmm regs:
    mova       m0, [src0q+cq]
    mova       m2, [src0q+cq+mmsize]
    pshufd     m4, [src1q], q0123
    pshufd     m5, [src1q+mmsize], q0123
    pshufd     m3, m0, m0, q0123
    pshufd     m1, m2, m2, q0123
    addps      m3, [src1q+mmsize]
    subps      m0, m5
    addps      m1, [src1q]
    subps      m2, m4
This is 79 cycles compared to the 68 of the original version. Nothing
that better scheduling could help.

--
Christophe
_______________________________________________
libav-devel mailing list
[email protected]
https://lists.libav.org/mailman/listinfo/libav-devel

Reply via email to