Hi, On Thu, Feb 23, 2012 at 9:28 AM, Christophe Gisquet <[email protected]> wrote: > again, a simple function which takes approximately 2% of the decoding > time. The timings go down from 117 (32bits) / 109 (64 bits) cycles to > 68 cycles. See patch comments for some further experiments. [..] > + movsxd r4, r4d
Let's change the argument to intptr_t, the caller can likely do this for free. > +.loop4: > + movq m0, [r2 + 0] > + movq m1, [r2 + 8] > + movq m2, [r1 + 0*STEP] > + movq m3, [r1 + 2*STEP] > + movhps m2, [r1 + 1*STEP] > + movhps m3, [r1 + 3*STEP] > + punpckldq m0, m0 > + punpckldq m1, m1 > + mulps m0, m2 > + mulps m1, m3 > + movu [r0 + 0], m0 > + movu [r0 + 16], m1 > + add r1, 4*STEP > + add r2, 4*1*4 > + add r0, 4*2*4 > + dec r3 > + jnz .loop4 shl r3, 2 lea r2, [r2+r3*4] lea r0, [r0+r3*8] .loop4: movq ..., [r2+r3*4+0/8] ... movu [r0+r3*8+0/16], ... add r1, 4*STEP add r3, r4 jnz .loop4 Is that faster? It saves one add and one sub in the loop, at the cost of more complex movh/movus. Also, can the movu be changed to a mova, i.e. can we somehow guarantee alignment? Ronald _______________________________________________ libav-devel mailing list [email protected] https://lists.libav.org/mailman/listinfo/libav-devel
