https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81274
Bug ID: 81274 Summary: x86 optimizer emits unnecessary LEA instruction when using AVX intrinsics Product: gcc Version: 8.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: cody at codygray dot com Target Milestone: --- Target: i?86-*-* When AVX intrinsics are used in a function, the x86-32 optimizer emits unnecessary LEA instructions that clobber a register, forcing it to be preserved at additional expense. Test Code: ---------- #include <immintrin.h> __m256 foo(const float *x) { __m256 ymmX = _mm256_load_ps(&x[0]); return _mm256_addsub_ps(ymmX, ymmX); } Compile with: "-m32 -mtune=generic -mavx -O2" This is also reproduced at -O1 and -O3, and when tuning for any architecture that supports AVX (not specific to the "generic" target). It also does not matter whether the code is compiled in C or C++ mode. This behavior is exhibited by *all* versions of GCC that support AVX targeting, from at least 4.9.0 through the 8.0.0 (20170701). The code compiles warning-free, of course. See it live on Godbolt: https://godbolt.org/g/NDDgsA Actual Disassembly: ------------------- foo: # -O2 or -O3 pushl %ecx movl 8(%esp), %eax leal 8(%esp), %ecx vmovaps (%eax), %ymm0 popl %ecx vaddsubps %ymm0, %ymm0, %ymm0 ret The LEA instruction performs a redundant load of the parameter from the stack into ECX, and then promptly discards that value. The load of ECX also has spill-over effects, requiring that additional code be emitted to preserve the original value of this register (PUSH+POP). The same bug is observed at -O1, but the ordering of the instructions is slightly different and the load of ECX is actually used to load EAX, further lengthening the dependency chain for no benefit whatsoever. foo: # -O1 pushl %ecx leal 8(%esp), %ecx movl (%ecx), %eax vmovaps (%eax), %ymm0 vaddsubps %ymm0, %ymm0, %ymm0 popl %ecx ret Expected Disassembly: --------------------- foo: movl 8(%esp), %eax vmovaps (%eax), %ymm0 vaddsubps %ymm0, %ymm0, %ymm0 ret Or better yet: foo: vmovaps 8(%esp), %ymm0 vaddsubps %ymm0, %ymm0, %ymm0 ret The correct code shown above is already generated for x86-64 builds (-m64), so this optimization deficiency affects only x86-32 builds (-m32).