Hi, On Sat, Jul 21, 2012 at 12:12 PM, Justin Ruggles <justin.rugg...@gmail.com> wrote: > +%if cpuflag(ssse3) > + pshufb m3, m0, unpack_odd ; m3 = 12, 13, 14, 15 > + pshufb m0, unpack_even ; m0 = 0, 1, 2, 3 > + pshufb m4, m1, unpack_odd ; m4 = 16, 17, 18, 19 > + pshufb m1, unpack_even ; m1 = 4, 5, 6, 7 > + pshufb m5, m2, unpack_odd ; m5 = 20, 21, 22, 23 > + pshufb m2, unpack_even ; m2 = 8, 9, 10, 11 > +%else
I'm going to assume you tested vpperm and it was not faster? > + mova [dstq ], m0 > + mova [dstq+ mmsize], m1 > + mova [dstq+2*mmsize], m2 > + mova [dstq+3*mmsize], m3 > + mova [dstq+4*mmsize], m4 > + mova [dstq+5*mmsize], m5 > + add srcq, mmsize/2 > + add dstq, mmsize*6 > + sub lend, mmsize/4 Can you try the pointer munging trick here too (i.e. sign-extend lend; imul lenq, x; add dstq, lenq; neg lenq) so add dstq, mmsize*6 and sub lend, mmsize/4 can be merged and we can remove one from the inner loop? Ronald _______________________________________________ libav-devel mailing list libav-devel@libav.org https://lists.libav.org/mailman/listinfo/libav-devel