Hi,

On Sat, Jul 21, 2012 at 12:12 PM, Justin Ruggles
<justin.rugg...@gmail.com> wrote:
> +%if cpuflag(ssse3)
> +    pshufb     m3, m0, unpack_odd   ; m3 =  12,     13,     14,     15
> +    pshufb         m0, unpack_even  ; m0 =   0,      1,      2,      3
> +    pshufb     m4, m1, unpack_odd   ; m4 =  16,     17,     18,     19
> +    pshufb         m1, unpack_even  ; m1 =   4,      5,      6,      7
> +    pshufb     m5, m2, unpack_odd   ; m5 =  20,     21,     22,     23
> +    pshufb         m2, unpack_even  ; m2 =   8,      9,     10,     11
> +%else

I'm going to assume you tested vpperm and it was not faster?

> +    mova  [dstq         ], m0
> +    mova  [dstq+  mmsize], m1
> +    mova  [dstq+2*mmsize], m2
> +    mova  [dstq+3*mmsize], m3
> +    mova  [dstq+4*mmsize], m4
> +    mova  [dstq+5*mmsize], m5
> +    add      srcq, mmsize/2
> +    add      dstq, mmsize*6
> +    sub      lend, mmsize/4

Can you try the pointer munging trick here too (i.e. sign-extend lend;
imul lenq, x; add dstq, lenq; neg lenq) so add dstq, mmsize*6 and sub
lend, mmsize/4 can be merged and we can remove one from the inner
loop?

Ronald
_______________________________________________
libav-devel mailing list
libav-devel@libav.org
https://lists.libav.org/mailman/listinfo/libav-devel

Reply via email to