Hi, On Thu, Feb 23, 2012 at 9:47 AM, Ronald S. Bultje <[email protected]> wrote: > Hi, > > On Thu, Feb 23, 2012 at 9:28 AM, Christophe Gisquet > <[email protected]> wrote: >> again, a simple function which takes approximately 2% of the decoding >> time. The timings go down from 117 (32bits) / 109 (64 bits) cycles to >> 68 cycles. See patch comments for some further experiments. > [..] >> + movsxd r4, r4d > > Let's change the argument to intptr_t, the caller can likely do this for free. > >> +.loop4: >> + movq m0, [r2 + 0] >> + movq m1, [r2 + 8] >> + movq m2, [r1 + 0*STEP] >> + movq m3, [r1 + 2*STEP] >> + movhps m2, [r1 + 1*STEP] >> + movhps m3, [r1 + 3*STEP] >> + punpckldq m0, m0 >> + punpckldq m1, m1 >> + mulps m0, m2 >> + mulps m1, m3 >> + movu [r0 + 0], m0 >> + movu [r0 + 16], m1 >> + add r1, 4*STEP >> + add r2, 4*1*4 >> + add r0, 4*2*4 >> + dec r3 >> + jnz .loop4 > > shl r3, 2 > lea r2, [r2+r3*4] > lea r0, [r0+r3*8]
Forgot a "neg r3" here. > .loop4: > movq ..., [r2+r3*4+0/8] > ... > movu [r0+r3*8+0/16], ... > add r1, 4*STEP > add r3, r4 > jnz .loop4 Ronald _______________________________________________ libav-devel mailing list [email protected] https://lists.libav.org/mailman/listinfo/libav-devel
