https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69041
Bug ID: 69041 Summary: Unnecessary push/pop of caller-save register (ecx) on 32bit with vector intrinsics. Sometimes without the pop, clobbering ebp (callee-save) Product: gcc Version: 5.3.0 Status: UNCONFIRMED Keywords: missed-optimization, wrong-code Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: i386-linux-gnu gcc sometimes generates a totally unneeded push/pop of ecx in functions with vector args. One manifestation of this omitted the pop, leading to a clobber of the caller's ebp. IDK if that's possible for non-empty functions, so it might still just be a performance bug in practice. I've boiled this down to a tiny testcase: #include <immintrin.h> __m256 add_pixdiff(__m256 c[2], __m256i a, __m256i b) { c[0] = _mm256_setzero_ps(); return c[0]; } void dummy(__m256 c[2], __m256i a, __m256i b) { } // clobbers ebp!! int dummy2(__m256 c[2], __m256i a, __m256i b) { return 0; } Compile (on godbolt) with gcc 5.3 -x c -O2 -Wall -mavx2 -m32: http://goo.gl/CkjU5y Also reproduced with Ubuntu 15.10 5.2.1 20151010 (gcc /tmp/foo.c -O2 -m32 -mavx2 -S -o- makes near-identical code.) add_pixdiff: pushl %ebp vxorps %xmm0, %xmm0, %xmm0 movl %esp, %ebp pushl %ecx leal 8(%ebp), %ecx # other versions don't have this LEA movl (%ecx), %eax vmovaps %ymm0, (%eax) popl %ecx popl %ebp ret dummy: pushl %ebp movl %esp, %ebp pushl %ecx vzeroupper # why is this here? we didn't do anything! #### AND NO MATCHING POP leave #### clobbers the caller's ebp with the pushed value of ecx, but the esp=ebp part of leave cleans up after the mismatched push/pop ret dummy2: pushl %ebp movl %esp, %ebp pushl %ecx vzeroupper xorl %eax, %eax popl %ecx # the return 0 version does pop ecx popl %ebp ret When playing around with this, I saw some dependence on the order within the file (http://goo.gl/wwH1r8): #include <immintrin.h> __m256 dummy1(__m256 c[2], __m256i a, __m256i b) { } vxorps %xmm0, %xmm0, %xmm0 ret __m256 dummy2(__m256 c[2], __m256i a, __m256i b) { } pushl %ebp movl %esp, %ebp pushl %ecx popl %ecx popl %ebp ret With more vector intrinsics inside the function (e.g. http://goo.gl/CWiQu9 semi cut-down), you don't get a stack frame, but still a push/pop and lea. Obviously something is very wrong for 32bit targets.