https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69041

            Bug ID: 69041
           Summary: Unnecessary push/pop of caller-save register (ecx) on
                    32bit with vector intrinsics.  Sometimes without the
                    pop, clobbering ebp (callee-save)
           Product: gcc
           Version: 5.3.0
            Status: UNCONFIRMED
          Keywords: missed-optimization, wrong-code
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: peter at cordes dot ca
  Target Milestone: ---
            Target: i386-linux-gnu

gcc sometimes generates a totally unneeded push/pop of ecx in functions with
vector args.

One manifestation of this omitted the pop, leading to a clobber of the caller's
ebp.  IDK if that's possible for non-empty functions, so it might still just be
a performance bug in practice.

I've boiled this down to a tiny testcase:

    #include <immintrin.h>

    __m256 add_pixdiff(__m256 c[2], __m256i a, __m256i b)
    {
        c[0] = _mm256_setzero_ps();
        return c[0];
     }
    void dummy(__m256 c[2], __m256i a, __m256i b) { }  // clobbers ebp!!
    int dummy2(__m256 c[2], __m256i a, __m256i b) { return 0; }

Compile (on godbolt) with gcc 5.3 -x c -O2 -Wall -mavx2 -m32:
http://goo.gl/CkjU5y

Also reproduced with Ubuntu 15.10 5.2.1 20151010 (gcc /tmp/foo.c -O2 -m32
-mavx2 -S -o- makes near-identical code.)

add_pixdiff:
        pushl   %ebp
        vxorps  %xmm0, %xmm0, %xmm0
        movl    %esp, %ebp
        pushl   %ecx
        leal    8(%ebp), %ecx   # other versions don't have this LEA
        movl    (%ecx), %eax
        vmovaps %ymm0, (%eax)
        popl    %ecx
        popl    %ebp
        ret
dummy:
        pushl   %ebp
        movl    %esp, %ebp
        pushl   %ecx
        vzeroupper    # why is this here?  we didn't do anything!
                       #### AND NO MATCHING POP
        leave          #### clobbers the caller's ebp with the pushed value of
ecx, but the esp=ebp part of leave cleans up after the mismatched push/pop
        ret
dummy2:
        pushl   %ebp
        movl    %esp, %ebp
        pushl   %ecx
        vzeroupper
        xorl    %eax, %eax
        popl    %ecx          # the return 0 version does pop ecx
        popl    %ebp
        ret

When playing around with this, I saw some dependence on the order within the
file (http://goo.gl/wwH1r8):

    #include <immintrin.h>
    __m256 dummy1(__m256 c[2], __m256i a, __m256i b) { }  
        vxorps  %xmm0, %xmm0, %xmm0
        ret

    __m256 dummy2(__m256 c[2], __m256i a, __m256i b) { }
        pushl   %ebp
        movl    %esp, %ebp
        pushl   %ecx
        popl    %ecx
        popl    %ebp
        ret


With more vector intrinsics inside the function (e.g. http://goo.gl/CWiQu9 semi
cut-down), you don't get a stack frame, but still a push/pop and lea. 
Obviously something is very wrong for 32bit targets.

Reply via email to