https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68483
Bug ID: 68483 Summary: gcc 5.2: suboptimal code compared to 4.9 Product: gcc Version: 5.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: other Assignee: unassigned at gcc dot gnu.org Reporter: lvqcl.mail at gmail dot com Target Milestone: --- #include <stdint.h> void test(int32_t* input, int32_t* out, unsigned x1, unsigned x2) { unsigned i, j; unsigned end = x1; for(i = j = 0; i < 1000; i++) { int32_t sum = 0; end += x2; for( ; j < end; j++) sum += input[j]; out[i] = sum; } } options used: -S -O2 -ftree-vectorize -msse2 GCC 5.2 generates the following code: ... movdqa %xmm0, %xmm1 movl 8(%esp), %ebx psrldq $8, %xmm1 paddd %xmm1, %xmm0 movdqa %xmm0, %xmm3 pshufd $255, %xmm0, %xmm2 addl %ebx, %eax cmpl %ebx, %esi pshufd $85, %xmm0, %xmm1 punpckhdq %xmm0, %xmm3 movd %xmm2, %ecx punpckldq %xmm3, %xmm1 movd %ecx, %xmm2 punpcklqdq %xmm2, %xmm1 paddd %xmm1, %xmm0 movd %xmm0, %ecx ... while GCC 4.9.2 generates this: ... movdqa %xmm0, %xmm1 movl 8(%esp), %ebx psrldq $8, %xmm1 paddd %xmm1, %xmm0 movdqa %xmm0, %xmm1 addl %ebx, %eax cmpl %ebx, %esi psrldq $4, %xmm1 paddd %xmm1, %xmm0 movd %xmm0, %ecx ... GCC 4.9.2: 1 psrldq instruction GCC 5.2.0: 2 pshufd, 2 movd, 2 punpckldq, 1 punpcklqdq instructions. Also, GCC 5.2.0 can generate the same code as GCC 4.9.2, but it requires -mssse3 option for this. It's strange that -mssse3 is necessary to generate more efficient SSE2 code.