[Bug c/124838] __builtin_clzl(0) overoptimization even on targets with native lzcnt instruction
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=124838
Richard Biener changed:
What|Removed |Added
CC||pinskia at gcc dot gnu.org
Resolution|WONTFIX |---
Keywords||missed-optimization
Status|RESOLVED|NEW
Last reconfirmed||2026-04-10
Ever confirmed|0 |1
--- Comment #9 from Richard Biener ---
int bar3(unsigned long x) { return x ? __builtin_clzl(x) : 64; }
works, the pattern detection is confused by the conversion to unsinged long
which moved inside the conditional early:
return x != 0 ? (long unsigned int) __builtin_clzl (x) : 64;
without that phiopt1 could
phiopt match-simplify trying:
x_2(D) != 0 ? _4 : 64
Applying pattern match.pd:10864, gimple-match-9.cc:6415
phiopt match-simplify back:
_3 = .CLZ (x_2(D), 64);
result: _3
rejected because early
but phiopt2 then performs the transform. Not with the conversion though
as we match
(for func (CLZ)
(simplify
(cond (ne @0 integer_zerop@1) (func (convert?@3 @0)) INTEGER_CST@2)
I think we want to fix both, not reject early and add another
convert? around func and adjust phiopt to then consider the folding,
which it does not.
[Bug c/124838] __builtin_clzl(0) overoptimization even on targets with native lzcnt instruction
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=124838 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #8 from Alexander Monakov --- bsr has a _true_ dependency on the old value of the output operand (when input is zero). popcnt/lzcnt/tzcnt have a _false_ dependency on some Intel CPUs (out-of-order execution treating them as if they read from the output operand), fixed in different generations for popcnt and lzcnt/tzcnt. Clang not avoiding the false dependency is a known LLVM bug: https://github.com/llvm/llvm-project/issues/33216 bar3 from comment #5 would be optimized to same code as bar2 if it returned 'int' instead of 'unsigned long'. GCC's pattern-matching fails when types differ (please file a new bug if you'd like that to be improved).
[Bug c/124838] __builtin_clzl(0) overoptimization even on targets with native lzcnt instruction
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=124838 --- Comment #7 from Zoltan Hidvegi --- clang generates a single lzcnt for both. And it seems that clearing EAX before lzcnt is unnecessary, I thought that unlike bsr, lzcnt does not have the false dependency on the old value of EAX. clang skips clearing eax. The testcase in #0 returns a different value at 0, so the cmov cannot be optimized.
[Bug c/124838] __builtin_clzl(0) overoptimization even on targets with native lzcnt instruction
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=124838 --- Comment #6 from Xi Ruoyao --- Then this is a bug. Hmm why replacing ctzg with ctzl and ?: seems to work for the test case in comment 0...
[Bug c/124838] __builtin_clzl(0) overoptimization even on targets with native lzcnt instruction
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=124838
--- Comment #5 from Zoltan Hidvegi ---
I've compiled this:
unsigned long bar2(unsigned long x) { return __builtin_clzg(x, 64); }
unsigned long bar3(unsigned long x) { return x ? __builtin_clzl(x) : 64; }
gcc -march=skylake -mlzcnt -mpopcnt -O2 -S clztest.c
gives this:
bar2:
xorl%eax, %eax
lzcntq %rdi, %rax
ret
bar3:
xorl%eax, %eax
movl$64, %edx
lzcntq %rdi, %rax
testq %rdi, %rdi
cmove %rdx, %rax
ret
[Bug c/124838] __builtin_clzl(0) overoptimization even on targets with native lzcnt instruction
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=124838 --- Comment #4 from Xi Ruoyao --- At -O2 (In reply to Zoltan Hidvegi from comment #3) > Sorry, I haven't realized that I can use __builtin_clzg for that, it works > great. The ? operator though expands to lzcnt/test/cmov instead of the > single lzcnt instruction, so the two are not the same. No, the produced assembly is exactly same at -O2 -mlzcnt. GCC knows lzcnt outputs 64 when the input is 0, so it optimizes the test/cmov away. See CLZ_DEFINED_VALUE_AT_ZERO in the GCC code and Internal documentation for how the target maintainers have already tell GCC this fact.
[Bug c/124838] __builtin_clzl(0) overoptimization even on targets with native lzcnt instruction
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=124838 --- Comment #3 from Zoltan Hidvegi --- Sorry, I haven't realized that I can use __builtin_clzg for that, it works great. The ? operator though expands to lzcnt/test/cmov instead of the single lzcnt instruction, so the two are not the same.
[Bug c/124838] __builtin_clzl(0) overoptimization even on targets with native lzcnt instruction
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=124838 Xi Ruoyao changed: What|Removed |Added Resolution|--- |WONTFIX Status|UNCONFIRMED |RESOLVED CC||xry111 at gcc dot gnu.org --- Comment #2 from Xi Ruoyao --- Even without clzg you can simply write "max ? __builtin_clz(max) : 64". clzg just expands to the same IR as that. GCC is able to optimize away the ?: operation in case it's redundant.
[Bug c/124838] __builtin_clzl(0) overoptimization even on targets with native lzcnt instruction
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=124838 --- Comment #1 from Richard Biener --- You can use __builtin_clzg (x, ) where is the result when x is zero.
