https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111378
--- Comment #5 from Gabriel Ravier <gabravier at gmail dot com> --- It does seem as though this transformation is not particularly favorable on most platforms. In fact, it seems as though the opposite transformation (which Clang does on many targets, along with MSVC) would be useful on most target, with some exceptions, including: - PowerPC, on which llvm-mca appears to consider `srdi.` to be faster than `cmplwi` - MIPS16, though I am unsure of this - GCC code generation is messy on there and I have trouble getting llvm-mca to parse GCC's output, but it seems to consider loading the constant from memory to be far slower than even doing the shift in two steps (which MIPS16 apparently requires, given GCC emits two `srl $4, $4, 8` instructions to do the shift) - Loongarch, which seems to give code for `x < 0x10000` that I would have a hard time imagining being faster than a single shift given that it outputs this: lu12i.w $r12,61440>>12 # 0xf000 ori $r5,$r12,4095 sltu $r4,$r5,$r4 xori $r6,$r4,1 andi $r4,$r6,1 whereas a shift outputs this: bstrpick.d $r4,$r4,31,16 sltui $r4,$r4,1 (note: I am not too certain for some of these, but it also seems like Alpha, C6x, FR-V, RISC-V 64 and Sparc emit much smaller code sequences (i.e. 2-3 times smaller) that look faster at first glance for the shifting version as compared to the comparing version) (PS: Given I do not have a server farm containing every single target GCC supports for the purposes of benchmarking this, I'm mostly assuming this from manually peeking at assembly output to try and guess which would be better and what from looking at what llvm-mca considers to be the faster instruction sequence on the targets it supports, so potentially llvm-mca and me could just be wrong, though I would hope LLVM correctly models the performance of the chips it targets...)