https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90838
--- Comment #17 from Wilco <wilco at gcc dot gnu.org> --- (In reply to Jakub Jelinek from comment #16) > (In reply to Wilco from comment #15) > > It would make more sense to move x86 backends to CTZ_DEFINED_VALUE_AT_ZERO > > == 2 so that you always get the same result even when you don't have tzcnt. > > A conditional move would be possible, so it adds an extra 2 instructions at > > worst (ie. still significantly faster than doing the table lookup, multiply > > etc). And it could be optimized when you know CLZ/CTZ input is non-zero. > > Conditional moves are a lottery on x86, in many cases very bad idea. And > when people actually use __builtin_clz*, they state that they don't care > about the 0 value, so emitting terribly performing code for it just in case > would be wrong. > If forwprop emits the conditional in separate blocks for the CTZ_DVAZ!=2 > case, on targets where conditional moves are beneficial for it it can also > emit them, or emit the jump which say on x86 will be most likely faster than > cmov. Well GCC emits a cmov for this (-O2 -march=x86-64-v2): int ctz(long a) { return (a == 0) ? 64 : __builtin_ctzl (a); } ctz: xor edx, edx mov eax, 64 rep bsf rdx, rdi test rdi, rdi cmovne eax, edx ret Note the extra 'test' seems redundant since IIRC bsf sets Z=1 if the input is zero. On Zen 2 this has identical performance as the plain builtin when you loop it as res = ctz (res) + 1; (ie. measuring latency of non-zero case). So I find it hard to believe cmov is expensive on modern cores.