Hi, On Thu, Aug 23, 2012 at 12:37 PM, Justin Ruggles <justin.rugg...@gmail.com> wrote: > On 08/21/2012 03:53 PM, Ronald S. Bultje wrote: >> On Mon, Aug 6, 2012 at 9:22 AM, Justin Ruggles <justin.rugg...@gmail.com> >> wrote: >>> +%if cpuflag(sse2) >>> + mulps m0, m6, [srcq ] >>> + mulps m1, m6, [srcq+src1q] >>> + mulps m2, m6, [srcq+src2q] >>> + mulps m3, m6, [srcq+src3q] >>> + mulps m4, m6, [srcq+src4q] >>> + mulps m5, m6, [srcq+src5q] >>> + cvtps2dq m0, m0 >>> + cvtps2dq m1, m1 >>> + cvtps2dq m2, m2 >>> + cvtps2dq m3, m3 >>> + cvtps2dq m4, m4 >>> + cvtps2dq m5, m5 >>> + packssdw m0, m3 ; m0 = 0, 6, 12, 18, 3, 9, 15, 21 >>> + packssdw m1, m4 ; m1 = 1, 7, 13, 19, 4, 10, 16, 22 >>> + packssdw m2, m5 ; m2 = 2, 8, 14, 20, 5, 11, 17, 23 >>> + ; unpack words: >>> + movhlps m3, m0 ; m3 = 3, 9, 15, 21, x, x, x, x >>> + punpcklwd m0, m1 ; m0 = 0, 1, 6, 7, 12, 13, 18, 19 >>> + punpckhwd m1, m2 ; m1 = 4, 5, 10, 11, 16, 17, 22, 23 >>> + punpcklwd m2, m3 ; m2 = 2, 3, 8, 9, 14, 15, 20, 21 >>> + ; blend dwords: >>> + shufps m3, m0, m2, q2020 ; m3 = 0, 1, 12, 13, 2, 3, 14, 15 >>> + shufps m0, m1, q2031 ; m0 = 6, 7, 18, 19, 4, 5, 16, 17 >>> + shufps m2, m1, q3131 ; m2 = 8, 9, 20, 21, 10, 11, 22, 23 >>> + ; shuffle dwords: >>> + shufps m1, m2, m3, q3120 ; m1 = 8, 9, 10, 11, 12, 13, 14, 15 >>> + shufps m3, m0, q0220 ; m3 = 0, 1, 2, 3, 4, 5, 6, 7 >>> + shufps m0, m2, q3113 ; m0 = 16, 17, 18, 19, 20, 21, 22, 23 >>> + mova [dstq+0*mmsize], m3 >>> + mova [dstq+1*mmsize], m1 >>> + mova [dstq+2*mmsize], m0 >>> +%else ; sse >> >> For sse4+: >> >> packssdw: >> a 0-6-12-18-3-9-15-21 >> b 1-7-13-19-4-10-16-22 >> c 2-8-14-20-5-11-17-23 >> >> pshufb: >> a' 0-9-18-3-12-31-6-15 >> b' 16-1-10-19-4-13-22-7 >> c' 8-17-2-11-20-5-14-23 >> >> pblendw: >> ab' x-9-10-x-12-13-x-15 >> ac' x-17-18-x-20-21-x-23 >> bc' x-1-2-x-4-5-x-7 >> >> and then another 3 blends to fill in abc for 0-7, 8-15 and 16-23: 12 >> instructions instead of 13. Thought out together with Christian here >> (CC'ed). There's probably a way to get the blends to be more efficient >> but I can't think of one very quickly. > > Great idea. It certainly works, but I didn't measure any difference in > speed on Sandy Bridge. Here is what I used: > > ; shuffle words: > pshufb m0, m7 ; m0 = 0, 9, 18, 3, 12, 21, 6, 15 > pshufb m1, m8 ; m1 = 16, 1, 10, 19, 4, 13, 22, 7 > pshufb m2, m9 ; m2 = 8, 17, 2, 11, 20, 5, 14, 23 > ; blend words: > pblendw m3, m0, m1, 10010010b ; m3 = 0, 1, x, 3, 4, x, 6, 7 > pblendw m3, m2, 00100100b ; m3 = 0, 1, 2, 3, 4, 5, 6, 7 > pblendw m4, m0, m1, 00100100b ; m4 = x, 9, 10, x, 12, 13, x, 15 > pblendw m4, m2, 01001001b ; m4 = 8, 9, 10, 11, 12, 13, 14, 15 > pblendw m2, m0, 00100100b ; m2 = x, 17, 18, x, 20, 21, x, 23 > pblendw m2, m1, 01001001b ; m2 = 16, 17, 18, 19, 20, 21, 22, 23 > mova [dstq+0*mmsize], m3 > mova [dstq+1*mmsize], m4 > mova [dstq+2*mmsize], m2
Hm... What if you use vpperm as a combination of pshufb and the first of two pblendw steps? Ronald _______________________________________________ libav-devel mailing list libav-devel@libav.org https://lists.libav.org/mailman/listinfo/libav-devel