From: Tom Herbert
> Sent: 03 February 2016 19:19
...
> +     /* Main loop */
> +50:  adcq    0*8(%rdi),%rax
> +     adcq    1*8(%rdi),%rax
> +     adcq    2*8(%rdi),%rax
> +     adcq    3*8(%rdi),%rax
> +     adcq    4*8(%rdi),%rax
> +     adcq    5*8(%rdi),%rax
> +     adcq    6*8(%rdi),%rax
> +     adcq    7*8(%rdi),%rax
> +     adcq    8*8(%rdi),%rax
> +     adcq    9*8(%rdi),%rax
> +     adcq    10*8(%rdi),%rax
> +     adcq    11*8(%rdi),%rax
> +     adcq    12*8(%rdi),%rax
> +     adcq    13*8(%rdi),%rax
> +     adcq    14*8(%rdi),%rax
> +     adcq    15*8(%rdi),%rax
> +     lea     128(%rdi), %rdi
> +     loop    50b

I'd need convincing that unrolling the loop like that gives any significant 
gain.
You have a dependency chain on the carry flag so have delays between the 'adcq'
instructions (these may be more significant than the memory reads from l1 
cache).

I also don't remember (might be wrong) the 'loop' instruction being executed 
quickly.
If 'loop' is fast then you will probably find that:

10:     adcq 0(%rdi),%rax
        lea  8(%rdi),%rdi
        loop 10b

is just as fast since the three instructions could all be executed in parallel.
But I suspect that 'dec %cx; jnz 10b' is actually better (and might execute as
a single micro-op).
IIRC 'adc' and 'dec' will both have dependencies on the flags register
so cannot execute together (which is a shame here).

It is also possible that breaking the carry-chain dependency by doing 32bit
adds (possibly after 64bit reads) can be made to be faster.

        David

Reply via email to