https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82261
--- Comment #4 from Peter Cordes <peter at cordes dot ca> --- GCC will emit SHLD / SHRD as part of shifting an integer that's two registers wide. Hironori Bono proposed the following functions as a workaround for this missed optimization (https://stackoverflow.com/a/71805063/224132) #include <stdint.h> #ifdef __SIZEOF_INT128__ uint64_t shldq_x64(uint64_t low, uint64_t high, uint64_t count) { return (uint64_t)(((((unsigned __int128)high << 64) | (unsigned __int128)low) << (count & 63)) >> 64); } uint64_t shrdq_x64(uint64_t low, uint64_t high, uint64_t count) { return (uint64_t)((((unsigned __int128)high << 64) | (unsigned __int128)low) >> (count & 63)); } #endif uint32_t shld_x86(uint32_t low, uint32_t high, uint32_t count) { return (uint32_t)(((((uint64_t)high << 32) | (uint64_t)low) << (count & 31)) >> 32); } uint32_t shrd_x86(uint32_t low, uint32_t high, uint32_t count) { return (uint32_t)((((uint64_t)high << 32) | (uint64_t)low) >> (count & 31)); } --- The uint64_t functions (using __int128) compile cleanly in 64-bit mode (https://godbolt.org/z/1j94Gcb4o) using 64-bit operand-size shld/shrd but the uint32_t functions compile to a total mess in 32-bit mode (GCC11.2 -O3 -m32 -mregparm=3) before eventually using shld, including a totally insane or dh, 0 GCC trunk with -O3 -mregparm=3 compiles them cleanly, but without regparm it's also slightly different mess. Ironically, the uint32_t functions compile to quite a few instructions in 64-bit mode, actually doing the operations as written with shifts and ORs, and having to manually mask the shift count to &31 because it uses a 64-bit operand-size shift which masks with &63. 32-bit operand-size SHLD would be a win here, at least for -mtune=intel or a specific Intel uarch. I haven't looked at whether they still compile ok after inlining into surrounding code, or whether operations would tend to combine with other things in preference to becoming an SHLD.