https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108401

            Bug ID: 108401
           Summary: gcc defeats vector constant generation with intrinsics
           Product: gcc
           Version: 11.3.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: andysem at mail dot ru
  Target Milestone: ---

Consider the following code:

#include <immintrin.h>

__m256i load_00FF()
{
    __m256i mm = _mm256_setzero_si256();
    return _mm256_srli_epi16(_mm256_cmpeq_epi64(mm, mm), 8);
}

This function generates a vector constant of alternating 0xFF and 0x00 bytes.
The code is written this way to avoid a load from memory, which may cause a
cache miss. The expected generated code is this:

        vpcmpeqq        ymm0, ymm0, ymm0
        vpsrlw  ymm0, ymm0, 8
        ret

which is almost exactly what gcc 8 generates (it uses vpcmpeqd instead of
vpcmpeqq, which is fine). However, gcc 9 through 11 generates a memory load
instead, defeating the attempt to avoid it:

        vmovdqa ymm0, YMMWORD PTR .LC0[rip]
        ret

and gcc 12 generates a worse code:

        movabs  rax, 71777214294589695
        vmovq   xmm1, rax
        vpbroadcastq    ymm0, xmm1
        ret

In all cases, the compiler flags are: -O3 -march=haswell

Code on godbolt.org: https://gcc.godbolt.org/z/sfT787PY9

I think the compiler should follow the code in intrinsics more closely since
despite the apparent equivalence, the choice of instructions can have
performance implications. The original code that is written by the developer is
better anyway, so it's not clear why the compiler is being so creative in this
case.

Reply via email to