https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111378

--- Comment #5 from Gabriel Ravier <gabravier at gmail dot com> ---
It does seem as though this transformation is not particularly favorable on
most platforms. In fact, it seems as though the opposite transformation (which
Clang does on many targets, along with MSVC) would be useful on most target,
with some exceptions, including:

- PowerPC, on which llvm-mca appears to consider `srdi.` to be faster than
`cmplwi`

- MIPS16, though I am unsure of this - GCC code generation is messy on there
and I have trouble getting llvm-mca to parse GCC's output, but it seems to
consider loading the constant from memory to be far slower than even doing the
shift in two steps (which MIPS16 apparently requires, given GCC emits two `srl
$4, $4, 8` instructions to do the shift)

- Loongarch, which seems to give code for `x < 0x10000` that I would have a
hard time imagining being faster than a single shift given that it outputs
this:
  lu12i.w $r12,61440>>12 # 0xf000
  ori $r5,$r12,4095
  sltu $r4,$r5,$r4
  xori $r6,$r4,1
  andi $r4,$r6,1
whereas a shift outputs this:
  bstrpick.d $r4,$r4,31,16
  sltui $r4,$r4,1
(note: I am not too certain for some of these, but it also seems like Alpha,
C6x, FR-V, RISC-V 64 and Sparc emit much smaller code sequences (i.e. 2-3 times
smaller) that look faster at first glance for the shifting version as compared
to the comparing version)


(PS: Given I do not have a server farm containing every single target GCC
supports for the purposes of benchmarking this, I'm mostly assuming this from
manually peeking at assembly output to try and guess which would be better and
what from looking at what llvm-mca considers to be the faster instruction
sequence on the targets it supports, so potentially llvm-mca and me could just
be wrong, though I would hope LLVM correctly models the performance of the
chips it targets...)

Reply via email to