https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105596
Bug ID: 105596 Summary: Loop counter widened to 128-bit unnecessarily Product: gcc Version: 13.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- For total *= i with a u128 total and a u32 loop counter, GCC pessimizes by widening i and doing a full 128x128 => 128-bit multiply, and having to do a 128-bit increment and compare. uint64_t i to make it a full register width doesn't help. unsigned __int128 fact(unsigned n){ unsigned __int128 total = n; for (unsigned i=2 ; i < n ; i++) total *= i; return total; } // 0! = 0 isn't mathematically correct, but that's not the point https://godbolt.org/z/W4MW9b6T3 (gcc trunk 13.0.0 20220508 (experimental) and clang 14, which makes efficient asm for all of these.) # gcc -O3 fact: movl %edi, %r9d xorl %r11d, %r11d movq %r9, %r10 # total = n zext into R11:R10 cmpl $2, %edi jbe .L7 # if n<=2 return r11:r10 movl $2, %esi # i = 2 in RDI:RSI xorl %edi, %edi .L9: # do{ movq %r11, %rcx movq %rdi, %rdx movq %r10, %rax movq %r9, %r8 # copy original n to destroy later imulq %r10, %rdx # 128x128 multiply with 2x imul, 1x widening mul imulq %rsi, %rcx addq %rdx, %rcx mulq %rsi movq %rdx, %r11 # update total in r11:r10 movq %rax, %r10 addq %rcx, %r11 # last partial product addq $1, %rsi # i++ as a 128-bit integer adcq $0, %rdi xorq %rsi, %r8 # r8 = n^i movq %rdi, %rcx # useless copy, we're already destroying r8 orq %r8, %rcx # hi(i^n) | lo(i^n) jne .L9 # }while(i != n); .L7: movq %r10, %rax movq %r11, %rdx ret So as well as creating extra work to do, it's not even doing it very efficiently, with multiple unnecessary mov instructions. This doesn't seem to be x86-64 specific. It also compiles similarly for AArch64 and MIPS64. For some ISAs, I'm not sure if potentially-infinite loops are making a difference, e.g. PowerPC is hard for me to read. RV64 has three multiply instructions in both versions. I haven't tested a 32-bit equivalent with uint64_t total and uint32_t i. This anti-optimization goes back to GCC4.6. With GCC4.5 and earlier, the above C compiles to a tight loop with the expected mul reg + imul reg,reg and 1 register loop counter: https://godbolt.org/z/6KheaqTx4 (using __uint128_t, since unsigned __int128 wasn't supported on GCC4.4 or 4.1) GCC 4.1 does an inefficient multiply, but one of the chunks is a freshly xor-zeroed register. It's still just incrementing and comparing a 32-bit loop counter, but widening it for a 128x128-bit multiply recipe. GCC4.4 optimizes away the parts that are useless for the high 64 bits of (u128)i being zero. ----- A different version compiles efficiently with GCC6 and earlier, only becoming slow like the above with GCC7 and later. unsigned __int128 fact_downcount(unsigned n){ unsigned __int128 total = n; for (unsigned i=n-1 ; i > 1 ; i--) total *= i; return total; // 0! = 0 isn't mathematically correct } ----- When the loop condition is possibly always-true, GCC can't prove the loop is non-infinite, and as usual can't widen the loop counter. In this case, that's a good thing: unsigned __int128 fact_gcc_handhold(unsigned n){ unsigned __int128 total = 1; // loop does do final n for (unsigned i=2 ; i <= n ; i++) // potentially infinite loop defeats this pessimization total *= i; return total; // fun fact: 0! = 1 is mathematically correct } fact_gcc_handhold: cmpl $1, %edi jbe .L4 movl $2, %ecx # i = 2 in ECX movl $1, %eax # total = 1 in RDX:RAX xorl %edx, %edx .L3: #do{ movl %ecx, %esi # copy i instead of just incrementing it later :/ movq %rdx, %r8 # save high half of total addl $1, %ecx # i++ imulq %rsi, %r8 # lo x hi cross product mulq %rsi # lo x lo widening addq %r8, %rdx # 128x64-bit multiply cmpl %ecx, %edi jnb .L3 # }while(i < n) ret Allocating total in RDX:RAX is nice, putting the lo part where we need it for mulq anyway.