https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617
--- Comment #21 from Michael_S <already5chosen at yahoo dot com> --- (In reply to Mason from comment #20) > Doh! You're right. > I come from a background where overlapping/aliasing inputs are heresy, > thus got blindsided :( > > This would be the optimal code, right? > > add4i: > # rdi = dst, rsi = a, rdx = b > movq 0(%rdx), %r8 > movq 8(%rdx), %rax > movq 16(%rdx), %rcx > movq 24(%rdx), %rdx > addq 0(%rsi), %r8 > adcq 8(%rsi), %rax > adcq 16(%rsi), %rcx > adcq 24(%rsi), %rdx > movq %r8, 0(%rdi) > movq %rax, 8(%rdi) > movq %rcx, 16(%rdi) > movq %rdx, 24(%rdi) > ret > If one does not care deeply about latency (which is likely for function that stores result into memory) then that looks good enough. But if one does care deeply then I'd expect interleaved loads, as in first 8 lines of code generated by trunk, to produce slightly lower latency on majority of modern CPUs.