[Bug c/124838] __builtin_clzl(0) overoptimization even on targets with native lzcnt instruction

2026-04-10 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=124838

Richard Biener  changed:

   What|Removed |Added

 CC||pinskia at gcc dot gnu.org
 Resolution|WONTFIX |---
   Keywords||missed-optimization
 Status|RESOLVED|NEW
   Last reconfirmed||2026-04-10
 Ever confirmed|0   |1

--- Comment #9 from Richard Biener  ---
int bar3(unsigned long x) { return x ? __builtin_clzl(x) : 64; }

works, the pattern detection is confused by the conversion to unsinged long
which moved inside the conditional early:

  return x != 0 ? (long unsigned int) __builtin_clzl (x) : 64;

without that phiopt1 could

phiopt match-simplify trying:
x_2(D) != 0 ? _4 : 64
Applying pattern match.pd:10864, gimple-match-9.cc:6415

phiopt match-simplify back:
_3 = .CLZ (x_2(D), 64);
result: _3
rejected because early

but phiopt2 then performs the transform.  Not with the conversion though
as we match

(for func (CLZ) 
 (simplify
  (cond (ne @0 integer_zerop@1) (func (convert?@3 @0)) INTEGER_CST@2)


I think we want to fix both, not reject early and add another
convert? around func and adjust phiopt to then consider the folding,
which it does not.

[Bug c/124838] __builtin_clzl(0) overoptimization even on targets with native lzcnt instruction

2026-04-10 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=124838

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #8 from Alexander Monakov  ---
bsr has a _true_ dependency on the old value of the output operand (when input
is zero). popcnt/lzcnt/tzcnt have a _false_ dependency on some Intel CPUs
(out-of-order execution treating them as if they read from the output operand),
fixed in different generations for popcnt and lzcnt/tzcnt. Clang not avoiding
the false dependency is a known LLVM bug:
https://github.com/llvm/llvm-project/issues/33216

bar3 from comment #5 would be optimized to same code as bar2 if it returned
'int' instead of 'unsigned long'. GCC's pattern-matching fails when types
differ (please file a new bug if you'd like that to be improved).

[Bug c/124838] __builtin_clzl(0) overoptimization even on targets with native lzcnt instruction

2026-04-10 Thread zoltan at hidvegi dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=124838

--- Comment #7 from Zoltan Hidvegi  ---
clang generates a single lzcnt for both. And it seems that clearing EAX before
lzcnt is unnecessary, I thought that unlike bsr, lzcnt does not have the false
dependency on the old value of EAX. clang skips clearing eax. The testcase in
#0 returns a different value at 0, so the cmov cannot be optimized.

[Bug c/124838] __builtin_clzl(0) overoptimization even on targets with native lzcnt instruction

2026-04-10 Thread xry111 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=124838

--- Comment #6 from Xi Ruoyao  ---
Then this is a bug.  Hmm why replacing ctzg with ctzl and ?: seems to work for
the test case in comment 0...

[Bug c/124838] __builtin_clzl(0) overoptimization even on targets with native lzcnt instruction

2026-04-10 Thread zoltan at hidvegi dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=124838

--- Comment #5 from Zoltan Hidvegi  ---
I've compiled this:

unsigned long bar2(unsigned long x) { return __builtin_clzg(x, 64); }
unsigned long bar3(unsigned long x) { return x ? __builtin_clzl(x) : 64; }

gcc -march=skylake -mlzcnt -mpopcnt -O2 -S clztest.c
gives this:

bar2:
xorl%eax, %eax
lzcntq  %rdi, %rax
ret
bar3:
xorl%eax, %eax
movl$64, %edx
lzcntq  %rdi, %rax
testq   %rdi, %rdi
cmove   %rdx, %rax
ret

[Bug c/124838] __builtin_clzl(0) overoptimization even on targets with native lzcnt instruction

2026-04-10 Thread xry111 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=124838

--- Comment #4 from Xi Ruoyao  ---
At -O2 (In reply to Zoltan Hidvegi from comment #3)
> Sorry, I haven't realized that I can use __builtin_clzg for that, it works
> great. The ? operator though expands to lzcnt/test/cmov instead of the
> single lzcnt instruction, so the two are not the same.

No, the produced assembly is exactly same at -O2 -mlzcnt.  GCC knows lzcnt
outputs 64 when the input is 0, so it optimizes the test/cmov away.  See
CLZ_DEFINED_VALUE_AT_ZERO in the GCC code and Internal documentation for how
the target maintainers have already tell GCC this fact.

[Bug c/124838] __builtin_clzl(0) overoptimization even on targets with native lzcnt instruction

2026-04-10 Thread zoltan at hidvegi dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=124838

--- Comment #3 from Zoltan Hidvegi  ---
Sorry, I haven't realized that I can use __builtin_clzg for that, it works
great. The ? operator though expands to lzcnt/test/cmov instead of the single
lzcnt instruction, so the two are not the same.

[Bug c/124838] __builtin_clzl(0) overoptimization even on targets with native lzcnt instruction

2026-04-10 Thread xry111 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=124838

Xi Ruoyao  changed:

   What|Removed |Added

 Resolution|--- |WONTFIX
 Status|UNCONFIRMED |RESOLVED
 CC||xry111 at gcc dot gnu.org

--- Comment #2 from Xi Ruoyao  ---
Even without clzg you can simply write "max ? __builtin_clz(max) : 64".  clzg
just expands to the same IR as that.  GCC is able to optimize away the ?:
operation in case it's redundant.

[Bug c/124838] __builtin_clzl(0) overoptimization even on targets with native lzcnt instruction

2026-04-09 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=124838

--- Comment #1 from Richard Biener  ---
You can use __builtin_clzg (x, ) where  is the result when x is zero.