https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82261

--- Comment #4 from Peter Cordes <peter at cordes dot ca> ---
GCC will emit SHLD / SHRD as part of shifting an integer that's two registers
wide.
Hironori Bono proposed the following functions as a workaround for this missed
optimization (https://stackoverflow.com/a/71805063/224132)

#include <stdint.h>

#ifdef __SIZEOF_INT128__
uint64_t shldq_x64(uint64_t low, uint64_t high, uint64_t count) {
  return (uint64_t)(((((unsigned __int128)high << 64) | (unsigned __int128)low)
<< (count & 63)) >> 64);
}

uint64_t shrdq_x64(uint64_t low, uint64_t high, uint64_t count) {
  return (uint64_t)((((unsigned __int128)high << 64) | (unsigned __int128)low)
>> (count & 63));
}
#endif

uint32_t shld_x86(uint32_t low, uint32_t high, uint32_t count) {
  return (uint32_t)(((((uint64_t)high << 32) | (uint64_t)low) << (count & 31))
>> 32);
}

uint32_t shrd_x86(uint32_t low, uint32_t high, uint32_t count) {
  return (uint32_t)((((uint64_t)high << 32) | (uint64_t)low) >> (count & 31));
}

---

The uint64_t functions (using __int128) compile cleanly in 64-bit mode
(https://godbolt.org/z/1j94Gcb4o) using 64-bit operand-size shld/shrd

but the uint32_t functions compile to a total mess in 32-bit mode (GCC11.2 -O3
-m32 -mregparm=3) before eventually using shld, including a totally insane 
    or      dh, 0

GCC trunk with -O3 -mregparm=3 compiles them cleanly, but without regparm it's
also slightly different mess.

Ironically, the uint32_t functions compile to quite a few instructions in
64-bit mode, actually doing the operations as written with shifts and ORs, and
having to manually mask the shift count to &31 because it uses a 64-bit
operand-size shift which masks with &63.  32-bit operand-size SHLD would be a
win here, at least for -mtune=intel or a specific Intel uarch.

I haven't looked at whether they still compile ok after inlining into
surrounding code, or whether operations would tend to combine with other things
in preference to becoming an SHLD.

Reply via email to