https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70976

            Bug ID: 70976
           Summary: Useless vectorization leads to degradation of
                    performance
           Product: gcc
           Version: 6.1.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: rtl-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: b7.10110111 at gmail dot com
  Target Milestone: ---

See the following code:

#include <stdio.h>
int main()
{
    unsigned long u = 13;
    for(unsigned long i = 0; i < 1UL<<30; i++)
        u += 23442*u;
    if (u == 0) printf("0\n");
}

Compiling it on an AMD64 system with -O2, I get normal assembly for the loop:

.L2:
        imul    rdx, rdx, 23443
        sub     rax, 1
        jne     .L2

But if I use -O3, the loop looks like this:

.L2:
        movdqa  xmm3, xmm1
        add     eax, 1
        movdqa  xmm0, xmm1
        pmuludq xmm1, xmm4
        cmp     eax, 536870912
        pmuludq xmm3, xmm2
        psrlq   xmm0, 32
        pmuludq xmm0, xmm2
        paddq   xmm0, xmm1
        movdqa  xmm1, xmm3
        psllq   xmm0, 32
        paddq   xmm1, xmm0
        jne     .L2

Not only does it become longer, but also it needlessly does calculations on
pairs of identical numbers. On my CPU (Intel(R) Xeon(R) CPU E3-1226 v3 @
3.30GHz) the -O2 version is almost two times faster than -O3 one.

This happens with gcc 4.7.3 and newer, but doesn't with 4.6.4 and older.

Reply via email to