https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70976
Bug ID: 70976 Summary: Useless vectorization leads to degradation of performance Product: gcc Version: 6.1.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: b7.10110111 at gmail dot com Target Milestone: --- See the following code: #include <stdio.h> int main() { unsigned long u = 13; for(unsigned long i = 0; i < 1UL<<30; i++) u += 23442*u; if (u == 0) printf("0\n"); } Compiling it on an AMD64 system with -O2, I get normal assembly for the loop: .L2: imul rdx, rdx, 23443 sub rax, 1 jne .L2 But if I use -O3, the loop looks like this: .L2: movdqa xmm3, xmm1 add eax, 1 movdqa xmm0, xmm1 pmuludq xmm1, xmm4 cmp eax, 536870912 pmuludq xmm3, xmm2 psrlq xmm0, 32 pmuludq xmm0, xmm2 paddq xmm0, xmm1 movdqa xmm1, xmm3 psllq xmm0, 32 paddq xmm1, xmm0 jne .L2 Not only does it become longer, but also it needlessly does calculations on pairs of identical numbers. On my CPU (Intel(R) Xeon(R) CPU E3-1226 v3 @ 3.30GHz) the -O2 version is almost two times faster than -O3 one. This happens with gcc 4.7.3 and newer, but doesn't with 4.6.4 and older.