https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97286
Bug ID: 97286 Summary: GCC sometimes uses an extra xmm register for the destination of _mm_blend_ps Product: gcc Version: 10.2.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: shlomo at fastmail dot com Target Milestone: --- GCC sometimes uses an extra xmm register for the destination of _mm_blend_ps, instead of doing the blend in-place. Example program: // gcc -Wall -O3 -march=znver1 -S #include <stdint.h> #include <stddef.h> #include <immintrin.h> void foo(const __m128i *in, __m128i *out, size_t count, __m128i a) { while (count--) { __m128 b = (__m128) _mm_loadu_si128(in++); a = (__m128i)_mm_blend_ps((__m128)a, b, 0x5); _mm_storeu_si128(out++, a); } } GCC output for the loop: .L3: vblendps $5, (%rdi,%rax), %xmm1, %xmm0 decq %rdx vmovdqu %xmm0, (%rsi,%rax) vmovdqa %xmm0, %xmm1 addq $16, %rax cmpq $-1, %rdx jne .L3 Note the destination of vblendps is xmm0 instead of xmm1, increasing register pressure and requiring an extra mov (vmovdqa %xmm0, %xmm1). clang does the vblendps in-place: .LBB0_2: # =>This Inner Loop Header: Depth=1 vblendps $5, (%rdi,%rax), %xmm0, %xmm0 # xmm0 = mem[0],xmm0[1],mem[2],xmm0[3] decq %rdx vmovups %xmm0, (%rsi,%rax) leaq 16(%rax), %rax jne .LBB0_2 This missed optimization causes register spills in a more complex loop I have, though I didn't measure significant performance difference for my use case.