https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617

--- Comment #21 from Michael_S <already5chosen at yahoo dot com> ---
(In reply to Mason from comment #20)
> Doh! You're right.
> I come from a background where overlapping/aliasing inputs are heresy,
> thus got blindsided :(
> 
> This would be the optimal code, right?
> 
> add4i:
> # rdi = dst, rsi = a, rdx = b
>       movq     0(%rdx), %r8
>       movq     8(%rdx), %rax
>       movq    16(%rdx), %rcx
>       movq    24(%rdx), %rdx
>       addq     0(%rsi), %r8
>       adcq     8(%rsi), %rax
>       adcq    16(%rsi), %rcx
>       adcq    24(%rsi), %rdx
>       movq    %r8,   0(%rdi)
>       movq    %rax,  8(%rdi)
>       movq    %rcx, 16(%rdi)
>       movq    %rdx, 24(%rdi)
>       ret
> 

If one does not care deeply about latency (which is likely for function that
stores result into memory) then that looks good enough.
But if one does care deeply then I'd expect interleaved loads, as in first 8
lines of code generated by trunk, to produce slightly lower latency on majority
of modern CPUs.

Reply via email to