https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51837
--- Comment #4 from Peter Cordes <pcordes at gmail dot com> --- It's far from optimal for 32x32 -> 64 in 64-bit mode, on old CPUs with slow 64x64-bit multiply like -mtune=atom or early silvermont-family or -mtune=bdver1 . These CPUs are mostly irrelevant now, but code-gen for foo hasn't changed for -m32 or -m64 since GCC6.1. In those CPUs, it can save total uops and be better throughput to use MULL (32-bit operand-size), or especially MULX if available, so the high half is already in a register by itself when that's what the code wants. (On AMD Bulldozer, imul r64, r64 has 6 cycle latency and 4c throughput. But mul r32 has 4c latency and 2c throughput, and is still a single uop unlike on Intel. The difference is bigger on early Atom) There are GCC6.5 vs. trunk code-gen differences when returning a struct of 2 values from high-halves of two multiplies as in the Godbolt link in my previous comment. Anyway, currently we compile foo() with z ^ (z >> 32); on a 64-bit product into this asm from GCC -O2 -m64 -march=bdver1 https://gcc.godbolt.org/z/oqr8q4Ybh foo: movl %esi, %esi movl %edi, %edi # zero-extend both inputs imulq %rsi, %rdi movq %rdi, %rax shrq $32, %rax xorl %edi, %eax ret Not counting the RET, that's 6 instructions, 6 uops on most modern cores. Critical path latency from either input is 1 + 3 + 0 + 1 + 1 = 6 cycles, assuming a 3-cycle imul latency like modern CPUs, and mov-elimination for integer regs like modern CPUs other than Ice Lake. (The first 2 mov instructions could be zero latency and no back-end uop if they weren't mov same,same.) A pure improvement with no downsides would shorten the critical path to 5 cycles (except on Ice Lake) by letting mov-elim work on every MOV, and taking the later MOV off the critical path. ## tweaked compiler output changing only register choices movl %esi, %eax # not same-same so mov-elim can work movl %edi, %ecx # or use ESI as a destination if we don't mind destroying it. imulq %rcx, %rax movq %rax, %rcx # off the critical path even without mov-elim shrq $32, %rcx xorl %ecx, %eax ret #### We could instead be doing: #### hand-written function, good if inputs aren't already zero-extended #### or on old CPUs with slow 64-bit multiply movl %esi, %eax mull %edi # 3 uops on modern Intel xorl %edx, %eax ret This also has a critical-path latency of 5 cycles on CPUs like Skylake where MUL takes an extra cycle to write the high half. (https://uops.info/ has latency breakdowns per input -> output pair, but it's down at the moment.) But it's only 5 total uops (not counting the RET) instead of 6 on Intel P-cores, and 4 on Zen-family. (32-bit widening MUL on Intel P-cores is 3 uops. 64-bit widening MUL is 2 uops.) And this is much better on early CPUs with slow 64-bit multiply. On Bulldozer-family, MULL is only 1 uop for the front-end, so extra good there. This case is fairly artificial; it's rare that you need to combine low and high halves of a 64-bit multiply. If just storing them, we could destroy the low half with a shift and save 1 MOV in the current code-gen. Then it's break-even for front-end cost. So this is about equal on modern CPUs, probably not worse on any, and better on old CPUs. And is better for code-size, so the main benefit would be with -Os. Also, this is artificial in terms of the IMULQ version needing to zero-extend both inputs, unlike if it had inlined into code that loaded or produced those value. Then the IMUL version would only be 4 uops. (Or 5 if neither multiply input is dead so we needed to save it with MOV and don't have AMX for non-destructive IMUL.) So most of the time using 1-operand 32-bit MULL is only barely worth it, only a big win on old CPUs like -march=bdver3, atom, silvermont
