AVX vector intrinsics

peter at cordes dot ca via Gcc-bugs Sat, 10 Oct 2020 23:41:31 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97366


            Bug ID: 97366
           Summary: [8/9/10/11 Regression] Redundant load with SSE/AVX
                    vector intrinsics
           Product: gcc
           Version: 11.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: peter at cordes dot ca
  Target Milestone: ---

When you use the same _mm_load_si128 or _mm256_load_si256 result twice,
sometimes GCC loads it *and* uses it as a memory source operand.

I'm not certain this is specific to x86 back-ends, please check bug tags if it
happens elsewhere.  (But it probably doesn't on 3-operand load/store RISC
machines; it looks like one operation chooses to load and then operate, the
other chooses to use the original source as a memory operand.)

#include <immintrin.h>
void gcc_double_load_128(int8_t *__restrict out, const int8_t *__restrict
input)
{
    for (unsigned i=0 ; i<1024 ; i+=16){
        __m128i in = _mm_load_si128((__m128i*)&input[i]);
        __m128i high = _mm_srli_epi32(in, 4);
        _mm_store_si128((__m128i*)&out[i], _mm_or_si128(in,high));
    }
}

gcc 8 and later -O3 -mavx2, including 11.0.0 20200920, with 

gcc_double_load_128(signed char*, signed char const*):
        xorl    %eax, %eax
.L6:
        vmovdqa (%rsi,%rax), %xmm1         # load
        vpsrld  $4, %xmm1, %xmm0
        vpor    (%rsi,%rax), %xmm0, %xmm0  # reload as a memory operand
        vmovdqa %xmm0, (%rdi,%rax)
        addq    $16, %rax
        cmpq    $1024, %rax
        jne     .L6
        ret

GCC7.5 and earlier use  vpor %xmm1, %xmm0, %xmm0 to use the copy of the
original that was already loaded.

`-march=haswell` happens to fix this for GCC trunk, for this 128-bit version
but not for a __m256i version.

restrict doesn't make a difference, and there's no overlapping anyway.  The two
redundant loads both happen between any other stores.

Using a memory source operand for vpsrld wasn't an option: the form with a
memory source takes the *count* from  memory, not the data. 
https://www.felixcloutier.com/x86/psllw:pslld:psllq

----

Note that *without* AVX, the redundant load is a possible win, for code running
on Haswell and later Intel (and AMD) CPUs.  Possibly some heuristic is saving
instructions for the legacy-SSE case (in a way that's probably worse overall)
and hurting the AVX case.

GCC 7.5, -O3  without any -m options
gcc_double_load_128(signed char*, signed char const*):
        xorl    %eax, %eax
.L2:
        movdqa  (%rsi,%rax), %xmm0
        movdqa  %xmm0, %xmm1         # this instruction avoided
        psrld   $4, %xmm1
        por     %xmm1, %xmm0         # with a memory source reload, in GCC8 and
later
        movaps  %xmm0, (%rdi,%rax)
        addq    $16, %rax
        cmpq    $1024, %rax
        jne     .L2
        rep ret


Using a memory-source POR saves 1 front-end uop by avoiding a register-copy, as
long as the indexed addressing mode can stay micro-fused on Intel.  (Requires
Haswell or later for that to happen, or any AMD.)  But in practice it's
probably worse.  Load-port pressure, and space in the out-of-order scheduler,
as well as code-size, is a problem for using an extra memory-source operand in
the SSE version, with the upside being saving 1 uop for the front-end.  (And
thus in the ROB.)  mov-elimination on modern CPUs means the movdqa register
copy costs no back-end resources (ivybridge and bdver1).

I don't know if GCC trunk is using por  (%rsi,%rax), %xmm0  on purpose for that
reason, of if it's just a coincidence.
I don't think it's a good idea on most CPUs, even if alignment is guaranteed.

This is of course 100% a loss with AVX; we have to `vmovdqa/u` load for the
shift, and it can leave the original value in a register so we're not saving a
vmovdqua.  And it's a bigger loss because indexed memory-source operands
unlaminate from 3-operand instructions even on Haswell/Skylake:
https://stackoverflow.com/questions/26046634/micro-fusion-and-addressing-modes/31027695#31027695
so it hurts the front-end as well as wasting cycles on load ports, and taking
up space in the RS (scheduler).

The fact that -mtune=haswell fixes this for 128-bit vectors is interesting, but
it's clearly still a loss in the AVX version for all AVX CPUs.  2 memory ops /
cycle on Zen could become a bottleneck, and it's larger code size.  And
-mtune=haswell *doesn't* fix it for the -mavx2 _m256i version.

There is a possible real advantage in the SSE case, but it's very minor and
outweighed by disadvantages.  Especially for older CPUs like Nehalem that can
only do 1 load / 1 store per clock.  (Although this has so many uops in the
loop that it barely bottlenecks on that.)

[Bug target/97366] New: [8/9/10/11 Regression] Redundant load with SSE/AVX vector intrinsics

Reply via email to