[Bug libstdc++/97759] Could std::has_single_bit be faster?

peter at cordes dot ca via Gcc-bugs Thu, 03 Mar 2022 03:08:17 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97759


Peter Cordes <peter at cordes dot ca> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |peter at cordes dot ca

--- Comment #14 from Peter Cordes <peter at cordes dot ca> ---
Agreed with the idea of expanding doing the  popcount(x) == 1  peephole
replacement in the compiler, not forcing the header to figure out whether we
have efficient popcount or not.

If we have BMI2, it's BLSR (Bit Lowest-Set Reset) sets CF=1 if the input was
zero, and ZF=1 if the output is zero.  Unfortunately none of the standard
jcc/setcc conditions check ZF=1 & CF=0, even with CMC to invert CF first.  
(If Intel had designed it to produce ZF and CF inverted, it would be
non-intuitive for all other uses but would have allowed blsr / jcc to implement
if(has_single_bit(x)).)

CF=1, ZF=0  impossible: input was zero, output was non-zero
CF=1, ZF=1  input was zero
CF=0, ZF=0  input had multiple bits set
CF=0, ZF=1  input had a single bit set.

If we're going to branch on it anyway after inlining, a branchy strategy is
probably good:

singlebit_bmi2_branchy:
   xor    %eax, %eax
   blsr   %edi, %edi    #  ZF=1 means we cleared the last bit, or the input was
zero
   jc     .Linput_zero  # input was zero, return 0 regardless of ZF
   setz   %al
 .Linput_zero:
   ret


And when we want a boolean in a register, a combination of setz and cmovc can
materialize one.  With clever choice of registers, we can even avoid giving
setcc a false dependency on a register that isn't already part of its dep chain

singlebit_bmi2_cmov:
   blsr    %edi, %eax
   setz    %al         # false dep, but it's ready if FLAGS are ready because
we wrote it with BLSR
   cmovc   %edi, %eax  # return 1 only if ZF=1 (setz produces 1) and CF=0
(cmovc doesn't overwrite it with the input 0)
   ret

With xor-zeroing first, we could produce the boolean zero-extended to 32-bit,
instead of here where only the low 8 bits are actually 0 / 1.  (Which is fine
for returning a bool in all the mainstream calling conventions)

(This is the same kind of idea as ARM64 sub/tst / ccmp / cset, where ccmp can
conditionally update flags.)

An evil variation on this uses setnz / dec to invert ZF without affecting CF,
allowing JA:

   blsr   %edi,%eax
   setnz  %al             # AL = !ZF
   dec    %al             # 1-1 -> ZF=1,  0-1 -> ZF=0.  ZF=!ZF without
affecting CF
   # seta   %al             # set on CF=0 and ZF=0
   ja     was_single_bit    # only actually useful for branching after inlining

dec/ja can't macro-fuse into a single uop, but on Skylake and later Intel it
doesn't cost any extra partial-FLAGS merging uops, because JA simply has both
parts of FLAGS as separate inputs.  (This is why CMOVA / CMOVBE are still 2
uops on Skylake, unlike all other forms: they need 2 integer inputs and 2 parts
of FLAGS, while others need either CF or SPAZO not both.  Interestingly, Zen1/2
have that effect but not Zen3)

I don't know how AMD handles dec / ja partial-flags shenanigans.  Intel Haswell
would I think have a flags-merging uop; older Intel doesn't support BMI1 so
P6-family is irrelevant.  https://stackoverflow.com/a/49868149/224132

I haven't benchmarked them because they have different use-cases (materializing
a boolean vs. creating a FLAGS condition to branch on, being branchless
itself), so any single benchmark would make one of them look good.  If your
data almost never (or always) has an all-zero input, the JC in the first
version will predict well.  After inlining, if the caller branches on the bool
result, you might want to just branch on both conditions separately.

I don't think this setnz/dec/ja version is ever useful.  Unlike branching
separately on ZF and CF, it's not bad if both 0 and multi-bit inputs are common
while single-bit inputs are rare.  But blsr/setz/cmovc + test/jnz is only 4
uops, same as this on Skylake. (test can macro-fuse with jnz).

The uops are all dependent on each other, so it also has the same latency (to
detect a branch miss) as popcnt / macro-fused cmp/je which is 2 uops.  The only
thing this has going for it is avoiding a port-1-only uop, I think.

It's also possible to blsr / lahf / and  ah, (1<<6) | (1<<0) / cmp  ah, 1<<6   
to directly check that ZF=1 and CF=0.  I doubt that's useful.  Or hmm, can we
branch directly on PF after AND with that 2-bit mask?  CF=1 ZF=0 is impossible,
so the only other odd-parity case is CF=0 ZF=1.  AMD and Intel can macro-fuse
test/jp.

   blsr  %edi, %eax
   lahf
   test  $(1<<6) | (1<<0), %ah        # check ZF and CF.
   jpo   was_single_bit               # ZF != CF means CF=0, ZF=1 because the
other way is impossible.

Also possible of course is the straightforward 2x setcc and AND to materialize
a boolean in the bottom byte of EAX.  Good ILP, only 3 cycle latency from input
to result on Intel, but that's the same as setz/cmovc which is fewer uops and
can avoid false dependencies without destroying the input.

  blsr   %edi, %eax
  setnc  %al
  setz   %dil
  and    %edi, %eax     # or test %dil, %al

----

If BLSR is available, POPCNT is also definitely available, so these need to be
compared against POPCNT / cmp / je (or sete).  With that in mind, BLSR / 2x JCC
is actually worse for the front-end, 3 uops because it can't macro-fuse.  It
has slightly lower latency before a branch miss could be discovered, but it has
2 separate branches instead of just 1.

Potentially we could take on of the JCCs off the fast-path with BLSR/JCC/JCC,
at the expense of branch-prediction throughput when jumping to another branch
on some CPUs.

Low-power cores like Silvermont-family don't do macro-fusion, and Alder Lake
suddenly made Gracemont significantly more relevant for tune=generic than Intel
low-power chips were before.  They have single-uop 3c latency popcnt, and 1 uop
3c latency for BMI1 instructions like blsr.


---

Also note that AMD Zen has 1-cycle latency popcnt as a single uop (down from 4c
in Bulldozer-family), but BLSR is a 2-uop instruction on AMD even including
Zen3.  (Perhaps due to setting CF from the input instead of output?  Or they
just decode it to sub/and uops.)

So with a -mtune that indicates AMD, we should favour popcnt/cmp/setz to
materialize a boolean.  (And write the setz target with popcnt to break *its*
false dependency, because unlike Intel before IceLake, popcnt itself doesn't
have a false output dependency on AMD.)

But with Intel tuning and BMI2 available, we should consider blsr/setz/cmovc if
not branching on the result.  That might be too niche to be worth
adding/maintaining code to look for it; probably most use-cases of single-bit
checks *do* branch on it instead of assigning a boolean result somewhere.

[Bug libstdc++/97759] Could std::has_single_bit be faster?

Reply via email to