https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108401
Bug ID: 108401 Summary: gcc defeats vector constant generation with intrinsics Product: gcc Version: 11.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: andysem at mail dot ru Target Milestone: --- Consider the following code: #include <immintrin.h> __m256i load_00FF() { __m256i mm = _mm256_setzero_si256(); return _mm256_srli_epi16(_mm256_cmpeq_epi64(mm, mm), 8); } This function generates a vector constant of alternating 0xFF and 0x00 bytes. The code is written this way to avoid a load from memory, which may cause a cache miss. The expected generated code is this: vpcmpeqq ymm0, ymm0, ymm0 vpsrlw ymm0, ymm0, 8 ret which is almost exactly what gcc 8 generates (it uses vpcmpeqd instead of vpcmpeqq, which is fine). However, gcc 9 through 11 generates a memory load instead, defeating the attempt to avoid it: vmovdqa ymm0, YMMWORD PTR .LC0[rip] ret and gcc 12 generates a worse code: movabs rax, 71777214294589695 vmovq xmm1, rax vpbroadcastq ymm0, xmm1 ret In all cases, the compiler flags are: -O3 -march=haswell Code on godbolt.org: https://gcc.godbolt.org/z/sfT787PY9 I think the compiler should follow the code in intrinsics more closely since despite the apparent equivalence, the choice of instructions can have performance implications. The original code that is written by the developer is better anyway, so it's not clear why the compiler is being so creative in this case.