https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81274

            Bug ID: 81274
           Summary: x86 optimizer emits unnecessary LEA instruction when
                    using AVX intrinsics
           Product: gcc
           Version: 8.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: cody at codygray dot com
  Target Milestone: ---
            Target: i?86-*-*

When AVX intrinsics are used in a function, the x86-32 optimizer emits
unnecessary LEA instructions that clobber a register, forcing it to be
preserved at additional expense.


Test Code:
----------

        #include <immintrin.h>

        __m256 foo(const float *x)
        {
           __m256 ymmX = _mm256_load_ps(&x[0]);
           return _mm256_addsub_ps(ymmX, ymmX);
        }


Compile with: "-m32 -mtune=generic -mavx -O2"

This is also reproduced at -O1 and -O3, and when tuning for any architecture
that supports AVX (not specific to the "generic" target).

It also does not matter whether the code is compiled in C or C++ mode.

This behavior is exhibited by *all* versions of GCC that support AVX targeting,
from at least 4.9.0 through the 8.0.0 (20170701).

The code compiles warning-free, of course.

See it live on Godbolt: https://godbolt.org/g/NDDgsA


Actual Disassembly:
-------------------

foo:                                    # -O2 or -O3
        pushl      %ecx
        movl       8(%esp), %eax
        leal       8(%esp), %ecx
        vmovaps    (%eax), %ymm0
        popl       %ecx
        vaddsubps  %ymm0, %ymm0, %ymm0
        ret

The LEA instruction performs a redundant load of the parameter from the stack
into ECX, and then promptly discards that value. The load of ECX also has
spill-over effects, requiring that additional code be emitted to preserve the
original value of this register (PUSH+POP).

The same bug is observed at -O1, but the ordering of the instructions is
slightly different and the load of ECX is actually used to load EAX, further
lengthening the dependency chain for no benefit whatsoever.

foo:                                    # -O1
        pushl      %ecx
        leal       8(%esp), %ecx
        movl       (%ecx), %eax
        vmovaps    (%eax), %ymm0
        vaddsubps  %ymm0, %ymm0, %ymm0
        popl       %ecx
        ret


Expected Disassembly:
---------------------

foo:
        movl       8(%esp), %eax
        vmovaps    (%eax), %ymm0
        vaddsubps  %ymm0, %ymm0, %ymm0
        ret


Or better yet:

foo:
        vmovaps    8(%esp), %ymm0
        vaddsubps  %ymm0, %ymm0, %ymm0
        ret


The correct code shown above is already generated for x86-64 builds (-m64), so
this optimization deficiency affects only x86-32 builds (-m32).

Reply via email to