https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105079
Bug ID: 105079 Summary: _mm_storeu_si16 inefficiently uses pextrw to an integer reg (without SSE4.1) Product: gcc Version: 12.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* With PR105066 fixed, we do _mm_loadu_si16 with pinsrw from memory, because that's available with just SSE2. (And the cause wasn't tuning choices, it was a typo in what insns GCC thought were available.) Related: PR105072 re: folding such 16-bit loads into memory source operands for PMOVZX/SXBQ. But the famously non-orthogonal SSE2 only includes pextrw $imm, %xmm, reg. Not reg/mem until SSE4.1 (with a longer opcode for no apparent reason, instead of just allowing mem addressing modes for the existing one. But same mnemonic so the assembler takes care of it. https://www.felixcloutier.com/x86/pextrw) So we do need to care about tuning for _mm_storeu_si16(p, v) without the option of PEXTRW to memory. Currently we do this, which is obviously bad: pextrw $0, %xmm0, %eax # 2 uops movw %ax, (%rdi) we should be doing this movd %xmm0, %eax # 1 uop mov %ax, (%rdi) https://godbolt.org/z/Ee3Ez174M This is especially true if we don't need the integer value zero-extended into EAX. If we *did* also want the value zero-extended in an integer register, the extra uop in PEXTRW (in addition to the port 0 uop like MOVD) is a port-5 shuffle to extract an arbitrary 16-bit element, vs. a separate integer movzwl %cx, %eax could run on any integer ALU port. (Including port 6 on HSW/SKL, which doesn't compete with any vector ALUs). Mov-elimination for movzwl doesn't work on any current CPUs, only movzbl on Intel, and movl / movq on both Intel and AMD. So currently there's no benefit to picking a different register like %ecx, instead of just using movzwl %ax, %eax When we both store and use the integer value: int store16_and_use(void *p, __m128i v){ _mm_storeu_si16( p, v ); return 123 + *(unsigned short*)p; } https://godbolt.org/z/zq6TMo1oE current trunk GCC does this, which is not bad: # -O3 with or without -msse4.1 pextrw $0, %xmm0, %eax movw %ax, (%rdi) addl $123, %eax ret Clang13 uses MOVD + MOVZX like I was suggesting, even though it costs more code size. That's not necessarily better movd %xmm0, %eax movw %ax, (%rdi) movzwl %ax, %eax addl $123, %eax retq In this case it's not obviously wrong to use PEXTRW to an integer reg, but it's also fine to do it clang's way. So however that corner case shakes out in the process of fixing the main bug (using movd / movw without SSE4.1 when we don't reload) is fine. If SSE4.1 is available, the no-reload case should probably use PEXTRW to memory instead of movd + movw. On some CPUs, the ALU op that's part of PEXTRW has more choice of ALU port than xmm->gp_int operations.