Hi,

On Thu, Feb 23, 2012 at 9:28 AM, Christophe Gisquet
<[email protected]> wrote:
> again, a simple function which takes approximately 2% of the decoding
> time. The timings go down from 117 (32bits) / 109 (64 bits) cycles to
> 68 cycles. See patch comments for some further experiments.
[..]
> +    movsxd      r4, r4d

Let's change the argument to intptr_t, the caller can likely do this for free.

> +.loop4:
> +    movq        m0, [r2 + 0]
> +    movq        m1, [r2 + 8]
> +    movq        m2, [r1 + 0*STEP]
> +    movq        m3, [r1 + 2*STEP]
> +    movhps      m2, [r1 + 1*STEP]
> +    movhps      m3, [r1 + 3*STEP]
> +    punpckldq   m0, m0
> +    punpckldq   m1, m1
> +    mulps       m0, m2
> +    mulps       m1, m3
> +    movu [r0 +  0], m0
> +    movu [r0 + 16], m1
> +    add         r1, 4*STEP
> +    add         r2, 4*1*4
> +    add         r0, 4*2*4
> +    dec         r3
> +    jnz .loop4

shl r3, 2
lea r2, [r2+r3*4]
lea r0, [r0+r3*8]
.loop4:
movq ..., [r2+r3*4+0/8]
...
movu [r0+r3*8+0/16], ...
add r1, 4*STEP
add r3, r4
jnz .loop4

Is that faster? It saves one add and one sub in the loop, at the cost
of more complex movh/movus. Also, can the movu be changed to a mova,
i.e. can we somehow guarantee alignment?

Ronald
_______________________________________________
libav-devel mailing list
[email protected]
https://lists.libav.org/mailman/listinfo/libav-devel

Reply via email to