https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97759
Peter Cordes <peter at cordes dot ca> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |peter at cordes dot ca --- Comment #14 from Peter Cordes <peter at cordes dot ca> --- Agreed with the idea of expanding doing the popcount(x) == 1 peephole replacement in the compiler, not forcing the header to figure out whether we have efficient popcount or not. If we have BMI2, it's BLSR (Bit Lowest-Set Reset) sets CF=1 if the input was zero, and ZF=1 if the output is zero. Unfortunately none of the standard jcc/setcc conditions check ZF=1 & CF=0, even with CMC to invert CF first. (If Intel had designed it to produce ZF and CF inverted, it would be non-intuitive for all other uses but would have allowed blsr / jcc to implement if(has_single_bit(x)).) CF=1, ZF=0 impossible: input was zero, output was non-zero CF=1, ZF=1 input was zero CF=0, ZF=0 input had multiple bits set CF=0, ZF=1 input had a single bit set. If we're going to branch on it anyway after inlining, a branchy strategy is probably good: singlebit_bmi2_branchy: xor %eax, %eax blsr %edi, %edi # ZF=1 means we cleared the last bit, or the input was zero jc .Linput_zero # input was zero, return 0 regardless of ZF setz %al .Linput_zero: ret And when we want a boolean in a register, a combination of setz and cmovc can materialize one. With clever choice of registers, we can even avoid giving setcc a false dependency on a register that isn't already part of its dep chain singlebit_bmi2_cmov: blsr %edi, %eax setz %al # false dep, but it's ready if FLAGS are ready because we wrote it with BLSR cmovc %edi, %eax # return 1 only if ZF=1 (setz produces 1) and CF=0 (cmovc doesn't overwrite it with the input 0) ret With xor-zeroing first, we could produce the boolean zero-extended to 32-bit, instead of here where only the low 8 bits are actually 0 / 1. (Which is fine for returning a bool in all the mainstream calling conventions) (This is the same kind of idea as ARM64 sub/tst / ccmp / cset, where ccmp can conditionally update flags.) An evil variation on this uses setnz / dec to invert ZF without affecting CF, allowing JA: blsr %edi,%eax setnz %al # AL = !ZF dec %al # 1-1 -> ZF=1, 0-1 -> ZF=0. ZF=!ZF without affecting CF # seta %al # set on CF=0 and ZF=0 ja was_single_bit # only actually useful for branching after inlining dec/ja can't macro-fuse into a single uop, but on Skylake and later Intel it doesn't cost any extra partial-FLAGS merging uops, because JA simply has both parts of FLAGS as separate inputs. (This is why CMOVA / CMOVBE are still 2 uops on Skylake, unlike all other forms: they need 2 integer inputs and 2 parts of FLAGS, while others need either CF or SPAZO not both. Interestingly, Zen1/2 have that effect but not Zen3) I don't know how AMD handles dec / ja partial-flags shenanigans. Intel Haswell would I think have a flags-merging uop; older Intel doesn't support BMI1 so P6-family is irrelevant. https://stackoverflow.com/a/49868149/224132 I haven't benchmarked them because they have different use-cases (materializing a boolean vs. creating a FLAGS condition to branch on, being branchless itself), so any single benchmark would make one of them look good. If your data almost never (or always) has an all-zero input, the JC in the first version will predict well. After inlining, if the caller branches on the bool result, you might want to just branch on both conditions separately. I don't think this setnz/dec/ja version is ever useful. Unlike branching separately on ZF and CF, it's not bad if both 0 and multi-bit inputs are common while single-bit inputs are rare. But blsr/setz/cmovc + test/jnz is only 4 uops, same as this on Skylake. (test can macro-fuse with jnz). The uops are all dependent on each other, so it also has the same latency (to detect a branch miss) as popcnt / macro-fused cmp/je which is 2 uops. The only thing this has going for it is avoiding a port-1-only uop, I think. It's also possible to blsr / lahf / and ah, (1<<6) | (1<<0) / cmp ah, 1<<6 to directly check that ZF=1 and CF=0. I doubt that's useful. Or hmm, can we branch directly on PF after AND with that 2-bit mask? CF=1 ZF=0 is impossible, so the only other odd-parity case is CF=0 ZF=1. AMD and Intel can macro-fuse test/jp. blsr %edi, %eax lahf test $(1<<6) | (1<<0), %ah # check ZF and CF. jpo was_single_bit # ZF != CF means CF=0, ZF=1 because the other way is impossible. Also possible of course is the straightforward 2x setcc and AND to materialize a boolean in the bottom byte of EAX. Good ILP, only 3 cycle latency from input to result on Intel, but that's the same as setz/cmovc which is fewer uops and can avoid false dependencies without destroying the input. blsr %edi, %eax setnc %al setz %dil and %edi, %eax # or test %dil, %al ---- If BLSR is available, POPCNT is also definitely available, so these need to be compared against POPCNT / cmp / je (or sete). With that in mind, BLSR / 2x JCC is actually worse for the front-end, 3 uops because it can't macro-fuse. It has slightly lower latency before a branch miss could be discovered, but it has 2 separate branches instead of just 1. Potentially we could take on of the JCCs off the fast-path with BLSR/JCC/JCC, at the expense of branch-prediction throughput when jumping to another branch on some CPUs. Low-power cores like Silvermont-family don't do macro-fusion, and Alder Lake suddenly made Gracemont significantly more relevant for tune=generic than Intel low-power chips were before. They have single-uop 3c latency popcnt, and 1 uop 3c latency for BMI1 instructions like blsr. --- Also note that AMD Zen has 1-cycle latency popcnt as a single uop (down from 4c in Bulldozer-family), but BLSR is a 2-uop instruction on AMD even including Zen3. (Perhaps due to setting CF from the input instead of output? Or they just decode it to sub/and uops.) So with a -mtune that indicates AMD, we should favour popcnt/cmp/setz to materialize a boolean. (And write the setz target with popcnt to break *its* false dependency, because unlike Intel before IceLake, popcnt itself doesn't have a false output dependency on AMD.) But with Intel tuning and BMI2 available, we should consider blsr/setz/cmovc if not branching on the result. That might be too niche to be worth adding/maintaining code to look for it; probably most use-cases of single-bit checks *do* branch on it instead of assigning a boolean result somewhere.