On Thu, Feb 5, 2026 at 4:43 AM Nathan Bossart <[email protected]> wrote:
> Sure.  I'm tempted to suggest that we only use the plain C version here,
> too.  The SSE4.2 bms_num_members() test I did yesterday used it and showed
> improvement at one word.  If we do that, we can rip out even more code
> since we no longer need the popcount built-ins.
>
> * tests plain C version on an Apple M3 *
>
> Yeah, the plain C version might be marginally slower than the built-in
> version for that test, but it still seems quite a bit faster than HEAD.
>
>     HEAD  v8  v10
>       40  25   29

(for the following, numbers are nanoseconds per call from
drive_bms_num_members())

Seems similar on S390X / gcc 13.3 (last week I only tested a single
bitmapword and feel don't like repeating):

master (older): 4.1083 (call builtin)
v8:     2.8889 (inline builtin)
v10:    2.7961 (inline pure C)

On ppc64le / gcc 8.5, without native popcount it suffers:

words  master  v14
   1    4.5      6.5
   2    5.8      9.7
  64   67.9    101
 128  143      190

So one up, one down among obscure platforms. There seems to be a
fairly thin case for the builtin anymore, although it's not zero.

--
John Naylor
Amazon Web Services


Reply via email to