https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97286

            Bug ID: 97286
           Summary: GCC sometimes uses an extra xmm register for the
                    destination of _mm_blend_ps
           Product: gcc
           Version: 10.2.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: c
          Assignee: unassigned at gcc dot gnu.org
          Reporter: shlomo at fastmail dot com
  Target Milestone: ---

GCC sometimes uses an extra xmm register for the destination of _mm_blend_ps,
instead of doing the blend in-place.

Example program:

// gcc -Wall -O3 -march=znver1 -S
#include <stdint.h>
#include <stddef.h>
#include <immintrin.h>
void foo(const __m128i *in, __m128i *out, size_t count, __m128i a) {
    while (count--) {
        __m128 b = (__m128) _mm_loadu_si128(in++);
        a = (__m128i)_mm_blend_ps((__m128)a, b, 0x5);
        _mm_storeu_si128(out++, a);
    }
}

GCC output for the loop:

.L3:
        vblendps        $5, (%rdi,%rax), %xmm1, %xmm0
        decq    %rdx
        vmovdqu %xmm0, (%rsi,%rax)
        vmovdqa %xmm0, %xmm1
        addq    $16, %rax
        cmpq    $-1, %rdx
        jne     .L3

Note the destination of vblendps is xmm0 instead of xmm1, increasing register
pressure and requiring an extra mov (vmovdqa      %xmm0, %xmm1).

clang does the vblendps in-place:

.LBB0_2:                                # =>This Inner Loop Header: Depth=1
        vblendps        $5, (%rdi,%rax), %xmm0, %xmm0 # xmm0 =
mem[0],xmm0[1],mem[2],xmm0[3]
        decq    %rdx
        vmovups %xmm0, (%rsi,%rax)
        leaq    16(%rax), %rax
        jne     .LBB0_2

This missed optimization causes register spills in a more complex loop I have,
though I didn't measure significant performance difference for my use case.

Reply via email to