https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90204
Jakub Jelinek <jakub at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|UNCONFIRMED |NEW Last reconfirmed| |2019-04-23 CC| |hjl.tools at gmail dot com, | |jakub at gcc dot gnu.org, | |uros at gcc dot gnu.org Component|c |target Target Milestone|--- |8.4 Summary|[8 Regression] C code is |[8/9 Regression] C code is |optimized worse than C++ |optimized worse than C++ Ever confirmed|0 |1 --- Comment #1 from Jakub Jelinek <jakub at gcc dot gnu.org> --- Started with r257505. A smaller regression happened already earlier with r254855. Before the latter, we emitted: pushq %rbp movq %rdi, %rax movq %rsp, %rbp andq $-64, %rsp vmovdqu32 16(%rbp), %zmm1 vpaddd 80(%rbp), %zmm1, %zmm0 vmovdqa64 %zmm0, -64(%rsp) vmovdqa64 -64(%rsp), %xmm2 vmovdqa64 -48(%rsp), %xmm3 vmovdqa64 -32(%rsp), %xmm4 vmovdqa64 -16(%rsp), %xmm5 vmovups %xmm2, (%rdi) vmovups %xmm3, 16(%rdi) vmovups %xmm4, 32(%rdi) vmovups %xmm5, 48(%rdi) vzeroupper leave ret r254855 then changed it into: pushq %rbp movq %rsp, %rbp andq $-32, %rsp movq %rdi, %rax vmovdqu32 16(%rbp), %ymm2 vpaddd 80(%rbp), %ymm2, %ymm0 vmovq %xmm0, %rdx vmovdqa64 %ymm0, -64(%rsp) vmovdqu32 48(%rbp), %ymm3 vpaddd 112(%rbp), %ymm3, %ymm0 vmovdqa64 %ymm0, -32(%rsp) movq %rdx, (%rdi) movq -56(%rsp), %rdx movq %rdx, 8(%rdi) movq -48(%rsp), %rdx movq %rdx, 16(%rdi) movq -40(%rsp), %rdx movq %rdx, 24(%rdi) vmovq %xmm0, 32(%rax) movq -24(%rsp), %rdx movq %rdx, 40(%rdi) movq -16(%rsp), %rdx movq %rdx, 48(%rdi) movq -8(%rsp), %rdx movq %rdx, 56(%rdi) vzeroupper leave ret After the r257505 we seem to be versioning for alignment or so, that can't be right for a loop with just 16 iterations.