https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68483

            Bug ID: 68483
           Summary: gcc 5.2: suboptimal code compared to 4.9
           Product: gcc
           Version: 5.2.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: other
          Assignee: unassigned at gcc dot gnu.org
          Reporter: lvqcl.mail at gmail dot com
  Target Milestone: ---

#include <stdint.h>

void test(int32_t* input, int32_t* out, unsigned x1, unsigned x2)
{
        unsigned i, j;
        unsigned end = x1;

        for(i = j = 0; i < 1000; i++) {
                int32_t sum = 0;
                end += x2;
                for( ; j < end; j++)
                        sum += input[j];
                out[i] = sum;
        }
}

options used: -S -O2 -ftree-vectorize -msse2
GCC 5.2 generates the following code:
...
        movdqa  %xmm0, %xmm1
        movl    8(%esp), %ebx
        psrldq  $8, %xmm1
        paddd   %xmm1, %xmm0
        movdqa  %xmm0, %xmm3
        pshufd  $255, %xmm0, %xmm2
        addl    %ebx, %eax
        cmpl    %ebx, %esi
        pshufd  $85, %xmm0, %xmm1
        punpckhdq       %xmm0, %xmm3
        movd    %xmm2, %ecx
        punpckldq       %xmm3, %xmm1
        movd    %ecx, %xmm2
        punpcklqdq      %xmm2, %xmm1
        paddd   %xmm1, %xmm0
        movd    %xmm0, %ecx
...

while GCC 4.9.2 generates this:
...
        movdqa  %xmm0, %xmm1
        movl    8(%esp), %ebx
        psrldq  $8, %xmm1
        paddd   %xmm1, %xmm0
        movdqa  %xmm0, %xmm1
        addl    %ebx, %eax
        cmpl    %ebx, %esi
        psrldq  $4, %xmm1
        paddd   %xmm1, %xmm0
        movd    %xmm0, %ecx
...

GCC 4.9.2: 1 psrldq instruction
GCC 5.2.0: 2 pshufd, 2 movd, 2 punpckldq, 1 punpcklqdq instructions.

Also, GCC 5.2.0 can generate the same code as GCC 4.9.2, but it requires
-mssse3 option for this. It's strange that -mssse3 is necessary to generate
more efficient SSE2 code.

Reply via email to