https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105079

            Bug ID: 105079
           Summary: _mm_storeu_si16 inefficiently uses pextrw to an
                    integer reg (without SSE4.1)
           Product: gcc
           Version: 12.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: peter at cordes dot ca
  Target Milestone: ---
            Target: x86_64-*-*, i?86-*-*

With PR105066 fixed, we do _mm_loadu_si16 with pinsrw from memory, because
that's available with just SSE2.  (And the cause wasn't tuning choices, it was
a typo in what insns GCC thought were available.)  Related: PR105072 re:
folding such 16-bit loads into memory source operands for PMOVZX/SXBQ.

But the famously non-orthogonal SSE2 only includes pextrw $imm, %xmm, reg.  Not
reg/mem until SSE4.1 (with a longer opcode for no apparent reason, instead of
just allowing mem addressing modes for the existing one.  But same mnemonic so
the assembler takes care of it.  https://www.felixcloutier.com/x86/pextrw)

So we do need to care about tuning for _mm_storeu_si16(p, v) without the option
of PEXTRW to memory.  Currently we do this, which is obviously bad:

    pextrw  $0, %xmm0, %eax      # 2 uops
    movw    %ax, (%rdi)

we should be doing this

    movd    %xmm0, %eax          # 1 uop
    mov     %ax, (%rdi)

https://godbolt.org/z/Ee3Ez174M

This is especially true if we don't need the integer value zero-extended into
EAX.

If we *did* also want the value zero-extended in an integer register, the extra
uop in PEXTRW (in addition to the port 0 uop like MOVD) is a port-5 shuffle to
extract an arbitrary 16-bit element, vs. a separate integer movzwl %cx, %eax
could run on any integer ALU port.  (Including port 6 on HSW/SKL, which doesn't
compete with any vector ALUs).

Mov-elimination for movzwl doesn't work on any current CPUs, only movzbl on
Intel, and movl / movq on both Intel and AMD.  So currently there's no benefit
to picking a different register like %ecx, instead of just using movzwl %ax,
%eax

When we both store and use the integer value:

int store16_and_use(void *p, __m128i v){
    _mm_storeu_si16( p, v );
    return 123 + *(unsigned short*)p;
}

https://godbolt.org/z/zq6TMo1oE current trunk GCC does this, which is not bad:

# -O3 with or without -msse4.1
        pextrw  $0, %xmm0, %eax
        movw    %ax, (%rdi)
        addl    $123, %eax
        ret

Clang13 uses MOVD + MOVZX like I was suggesting, even though it costs more code
size.  That's not necessarily better

        movd    %xmm0, %eax
        movw    %ax, (%rdi)
        movzwl  %ax, %eax
        addl    $123, %eax
        retq

In this case it's not obviously wrong to use PEXTRW to an integer reg, but it's
also fine to do it clang's way.  So however that corner case shakes out in the
process of fixing the main bug (using movd / movw without SSE4.1 when we don't
reload) is fine.

If SSE4.1 is available, the no-reload case should probably use PEXTRW to memory
instead of movd + movw.  On some CPUs, the ALU op that's part of PEXTRW has
more choice of ALU port than xmm->gp_int operations.

Reply via email to