[Bug target/108401] gcc defeats vector constant generation with intrinsics
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108401 --- Comment #8 from Hongtao.liu --- > But, if you're going to improve constant generation, please make it so that > it can recognize not only the particular pattern described in this bug. More > importantly, it should recognize the all-ones case (as a single pcmpeq) as a > starting point. Then it can apply shifts to achieve the final result from > the all-ones vector - shifts of any width, length or direction, including > psrldq/pslldq. This would improve generated code in a wider range of cases. yes, we will try to do that. Generally fold intrinsic into compiler IR helps performance, and for this case we need to optimize codegen for special immediate broadcast(all-ones + shift)
[Bug target/108401] gcc defeats vector constant generation with intrinsics
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108401 --- Comment #7 from andysem at mail dot ru --- To be clear, I'm not asking the compiler to recognize the particular pattern of alternating 0x00 and 0xFF bytes. Because hardcoding this particular pattern won't improve generated code in other cases. Rather, I'm asking to tune down code transformations for intrinsics. If the developer wrote a sequence of intrinsics to generate a constant then he probably wanted that sequence instead of a simple _mm_set1_epi32 or a load from memory. But, if you're going to improve constant generation, please make it so that it can recognize not only the particular pattern described in this bug. More importantly, it should recognize the all-ones case (as a single pcmpeq) as a starting point. Then it can apply shifts to achieve the final result from the all-ones vector - shifts of any width, length or direction, including psrldq/pslldq. This would improve generated code in a wider range of cases.
[Bug target/108401] gcc defeats vector constant generation with intrinsics
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108401 --- Comment #6 from andysem at mail dot ru --- (In reply to Andrew Pinski from comment #1) > >and gcc 12 generates a worse code: > > it is not worse really; depending on the how fast moving between the > register sets is. I meant "worse" compared to vpcmpeq+vpsrlw pair. (Side note about the broadcast version: it could have been smaller if it used a 32-bit constant and vpbroadcastd. vpcmpeq+vpsrlw would still be better in this particular case, but if broadcast is needed, a smaller footprint code is preferred.)
[Bug target/108401] gcc defeats vector constant generation with intrinsics
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108401 Richard Biener changed: What|Removed |Added Last reconfirmed||2023-01-16 Status|UNCONFIRMED |NEW Ever confirmed|0 |1 --- Comment #5 from Richard Biener --- Confirmed. We expand from return { 71777214294589695, 71777214294589695, 71777214294589695, 71777214294589695 }; where we could reduce the DImode broadcast to a HImode one (if that exists). But sure, the x86 backend could implement the intrinsic suggested way to generate this particular pattern. I'll also note that -O0 produces quite bad code here.
[Bug target/108401] gcc defeats vector constant generation with intrinsics
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108401 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #4 from Alexander Monakov --- (In reply to Hongtao.liu from comment #3) > > and gcc 12 generates a worse code: > > > > movabs rax, 71777214294589695 > > vmovq xmm1, rax > > vpbroadcastqymm0, xmm1 > > ret > > > > It's on purpose by edafb35bdadf309ebb9d1eddc5549f9e1ad49c09 since > microbenchmark shows moving from imm is faster than memory. But the bug is not asking you to reinstate loading from memory. The bug is asking you to notice that the result can be constructed via cmpeq+psrlw, which is even better than a broadcast (cmpeq with dst same as src is usually a dependency-breaking instruction that does not occupy an execution port).
[Bug target/108401] gcc defeats vector constant generation with intrinsics
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108401 Hongtao.liu changed: What|Removed |Added CC||crazylht at gmail dot com --- Comment #3 from Hongtao.liu --- > and gcc 12 generates a worse code: > > movabs rax, 71777214294589695 > vmovq xmm1, rax > vpbroadcastqymm0, xmm1 > ret > It's on purpose by edafb35bdadf309ebb9d1eddc5549f9e1ad49c09 since microbenchmark shows moving from imm is faster than memory. > In all cases, the compiler flags are: -O3 -march=haswell > > Code on godbolt.org: https://gcc.godbolt.org/z/sfT787PY9 > > I think the compiler should follow the code in intrinsics more closely since > despite the apparent equivalence, the choice of instructions can have > performance implications. The original code that is written by the developer > is better anyway, so it's not clear why the compiler is being so creative in > this case.
[Bug target/108401] gcc defeats vector constant generation with intrinsics
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108401 --- Comment #2 from Andrew Pinski --- r12-1958-gedafb35bdadf30 changed the behavior in GCC 12 to be better ... (see the commit message that it shows it is better than doing a memory load).
[Bug target/108401] gcc defeats vector constant generation with intrinsics
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108401 --- Comment #1 from Andrew Pinski --- >and gcc 12 generates a worse code: it is not worse really; depending on the how fast moving between the register sets is.