https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95704
Segher Boessenkool <segher at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|UNCONFIRMED |NEW Ever confirmed|0 |1 CC| |segher at gcc dot gnu.org Last reconfirmed| |2020-06-17 --- Comment #2 from Segher Boessenkool <segher at gcc dot gnu.org> --- (In reply to Jens Seifert from comment #0) > PowerPC processors don't like branches and branch mispredicts lead to large > overhead. While that is of course true, the situation isn't worse than on other CPUs. The situation here is exactly analogous to 64-bit shifts with -m32. Fixed distance shifts (and rotates) generate pretty much ideal code already (sometimes it could save a "mr" insn, by reordering more -- that is because the rl*imi insns use a register as both input and output). > shift left/right unsigned __in128 can be implemented in 8 instructions which > can be processed on 2 pipelines almost in parallel leading to ~5 cycle > latency on Power 7 and 8. > shift right algebraic __int128 can be implemented in 10 instructions. > Overall comparable in latency of the branching code. This can be done better, using the fact that shifts over 64..127 bits are defined just fine for 64-bit power shift insns. > The unnecessary rldicl 8,5,0,32 at the beginning of the routines are also > not necessary. I see no rldicl? Confirmed.