[Bug tree-optimization/54939] New: Very poor vectorization of loops with complex arithmetic

ysrumyan at gmail dot com Tue, 16 Oct 2012 07:22:31 -0700


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54939




             Bug #: 54939

           Summary: Very poor vectorization of loops with complex

                    arithmetic

    Classification: Unclassified

           Product: gcc

           Version: 4.8.0

            Status: UNCONFIRMED

          Severity: normal

          Priority: P3

         Component: tree-optimization

        AssignedTo: unassig...@gcc.gnu.org

        ReportedBy: ysrum...@gmail.com





Analyzing some performance anomaly for spec2000 I found out that 168.wupwise

with vectorization is slower than without it on x86. The main problem is that

gcc does not recognize some special idioms of complex addition and

multiplication in process of loop vectorization. For example, for a simple

zaxpy loop icc genearates 1.6X faster code than gcc. Here is assembly for zaxpy

loop produced by icc:



..B1.4:                         # Preds ..B1.2 ..B1.4

        movups    (%rsi,%rdx), %xmm2                            #7.28

        movups    16(%rsi,%rdx), %xmm5                          #7.28

        movups    (%rsi,%rcx), %xmm4                            #7.17

        movups    16(%rsi,%rcx), %xmm7                          #7.17

        movddup   (%rsi,%rdx), %xmm3                            #7.27

        incq      %r8                                           #6.10

        movddup   16(%rsi,%rdx), %xmm6                          #7.27

        unpckhpd  %xmm2, %xmm2                                  #7.27

        unpckhpd  %xmm5, %xmm5                                  #7.27

        mulpd     %xmm1, %xmm3                                  #7.27

        mulpd     %xmm0, %xmm2                                  #7.27

        mulpd     %xmm1, %xmm6                                  #7.27

        mulpd     %xmm0, %xmm5                                  #7.27

        addsubpd  %xmm2, %xmm3                                  #7.27

        addsubpd  %xmm5, %xmm6                                  #7.27

        addpd     %xmm3, %xmm4                                  #7.9

        addpd     %xmm6, %xmm7                                  #7.9

        movups    %xmm4, (%rsi,%rcx)                            #7.9

        movups    %xmm7, 16(%rsi,%rcx)                          #7.9

        addq      $32, %rsi                                     #6.10

        cmpq      %rdi, %r8                                     #6.10

        jb        ..B1.4        # Prob 64%                      #6.10

( I got it with -xSSE4.2 -O3 options). Gor gcc compiler the following options

were used: -m64 -mfpmath=sse  -march=corei7 -O3 -ffast-math.

[Bug tree-optimization/54939] New: Very poor vectorization of loops with complex arithmetic

Reply via email to