2013/4/9 Justin Ruggles <[email protected]>:
> On 04/07/2013 04:29 PM, Christophe Gisquet wrote:
>> + mova m0, [src0q+cq]
>> + mova m1, [src1q]
>> + mova m4, [src0q+cq+mmsize]
>> + mova m5, [src1q+mmsize]
>> +%if cpuflag(sse2)
>> + pshufd m2, m0, q0123
>> + pshufd m3, m1, q0123
>> + pshufd m6, m4, q0123
>> + pshufd m7, m5, q0123
>> +%else
>> + shufps m2, m0, m0, q0123
>> + shufps m3, m1, m1, q0123
>> + shufps m6, m4, m4, q0123
>> + shufps m7, m5, m5, q0123
>> +%endif
>
> You can use memory args for the pshufd.
Because of the subps, it's not stricty commutative here, and I ended
up with this using 6 xmm regs:
mova m0, [src0q+cq]
mova m2, [src0q+cq+mmsize]
pshufd m4, [src1q], q0123
pshufd m5, [src1q+mmsize], q0123
pshufd m3, m0, m0, q0123
pshufd m1, m2, m2, q0123
addps m3, [src1q+mmsize]
subps m0, m5
addps m1, [src1q]
subps m2, m4
This is 79 cycles compared to the 68 of the original version. Nothing
that better scheduling could help.
--
Christophe
_______________________________________________
libav-devel mailing list
[email protected]
https://lists.libav.org/mailman/listinfo/libav-devel