many cases

already5chosen at yahoo dot com via Gcc-bugs Fri, 16 Jun 2023 07:56:15 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617


--- Comment #21 from Michael_S <already5chosen at yahoo dot com> ---
(In reply to Mason from comment #20)
> Doh! You're right.
> I come from a background where overlapping/aliasing inputs are heresy,
> thus got blindsided :(
> 
> This would be the optimal code, right?
> 
> add4i:
> # rdi = dst, rsi = a, rdx = b
>       movq     0(%rdx), %r8
>       movq     8(%rdx), %rax
>       movq    16(%rdx), %rcx
>       movq    24(%rdx), %rdx
>       addq     0(%rsi), %r8
>       adcq     8(%rsi), %rax
>       adcq    16(%rsi), %rcx
>       adcq    24(%rsi), %rdx
>       movq    %r8,   0(%rdi)
>       movq    %rax,  8(%rdi)
>       movq    %rcx, 16(%rdi)
>       movq    %rdx, 24(%rdi)
>       ret
> 

If one does not care deeply about latency (which is likely for function that
stores result into memory) then that looks good enough.
But if one does care deeply then I'd expect interleaved loads, as in first 8
lines of code generated by trunk, to produce slightly lower latency on majority
of modern CPUs.

[Bug target/105617] [12/13/14 Regression] Slp is maybe too aggressive in some/many cases

Reply via email to