https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97366
Bug ID: 97366 Summary: [8/9/10/11 Regression] Redundant load with SSE/AVX vector intrinsics Product: gcc Version: 11.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- When you use the same _mm_load_si128 or _mm256_load_si256 result twice, sometimes GCC loads it *and* uses it as a memory source operand. I'm not certain this is specific to x86 back-ends, please check bug tags if it happens elsewhere. (But it probably doesn't on 3-operand load/store RISC machines; it looks like one operation chooses to load and then operate, the other chooses to use the original source as a memory operand.) #include <immintrin.h> void gcc_double_load_128(int8_t *__restrict out, const int8_t *__restrict input) { for (unsigned i=0 ; i<1024 ; i+=16){ __m128i in = _mm_load_si128((__m128i*)&input[i]); __m128i high = _mm_srli_epi32(in, 4); _mm_store_si128((__m128i*)&out[i], _mm_or_si128(in,high)); } } gcc 8 and later -O3 -mavx2, including 11.0.0 20200920, with gcc_double_load_128(signed char*, signed char const*): xorl %eax, %eax .L6: vmovdqa (%rsi,%rax), %xmm1 # load vpsrld $4, %xmm1, %xmm0 vpor (%rsi,%rax), %xmm0, %xmm0 # reload as a memory operand vmovdqa %xmm0, (%rdi,%rax) addq $16, %rax cmpq $1024, %rax jne .L6 ret GCC7.5 and earlier use vpor %xmm1, %xmm0, %xmm0 to use the copy of the original that was already loaded. `-march=haswell` happens to fix this for GCC trunk, for this 128-bit version but not for a __m256i version. restrict doesn't make a difference, and there's no overlapping anyway. The two redundant loads both happen between any other stores. Using a memory source operand for vpsrld wasn't an option: the form with a memory source takes the *count* from memory, not the data. https://www.felixcloutier.com/x86/psllw:pslld:psllq ---- Note that *without* AVX, the redundant load is a possible win, for code running on Haswell and later Intel (and AMD) CPUs. Possibly some heuristic is saving instructions for the legacy-SSE case (in a way that's probably worse overall) and hurting the AVX case. GCC 7.5, -O3 without any -m options gcc_double_load_128(signed char*, signed char const*): xorl %eax, %eax .L2: movdqa (%rsi,%rax), %xmm0 movdqa %xmm0, %xmm1 # this instruction avoided psrld $4, %xmm1 por %xmm1, %xmm0 # with a memory source reload, in GCC8 and later movaps %xmm0, (%rdi,%rax) addq $16, %rax cmpq $1024, %rax jne .L2 rep ret Using a memory-source POR saves 1 front-end uop by avoiding a register-copy, as long as the indexed addressing mode can stay micro-fused on Intel. (Requires Haswell or later for that to happen, or any AMD.) But in practice it's probably worse. Load-port pressure, and space in the out-of-order scheduler, as well as code-size, is a problem for using an extra memory-source operand in the SSE version, with the upside being saving 1 uop for the front-end. (And thus in the ROB.) mov-elimination on modern CPUs means the movdqa register copy costs no back-end resources (ivybridge and bdver1). I don't know if GCC trunk is using por (%rsi,%rax), %xmm0 on purpose for that reason, of if it's just a coincidence. I don't think it's a good idea on most CPUs, even if alignment is guaranteed. This is of course 100% a loss with AVX; we have to `vmovdqa/u` load for the shift, and it can leave the original value in a register so we're not saving a vmovdqua. And it's a bigger loss because indexed memory-source operands unlaminate from 3-operand instructions even on Haswell/Skylake: https://stackoverflow.com/questions/26046634/micro-fusion-and-addressing-modes/31027695#31027695 so it hurts the front-end as well as wasting cycles on load ports, and taking up space in the RS (scheduler). The fact that -mtune=haswell fixes this for 128-bit vectors is interesting, but it's clearly still a loss in the AVX version for all AVX CPUs. 2 memory ops / cycle on Zen could become a bottleneck, and it's larger code size. And -mtune=haswell *doesn't* fix it for the -mavx2 _m256i version. There is a possible real advantage in the SSE case, but it's very minor and outweighed by disadvantages. Especially for older CPUs like Nehalem that can only do 1 load / 1 store per clock. (Although this has so many uops in the loop that it barely bottlenecks on that.)