On Thu, Feb 5, 2026 at 4:43 AM Nathan Bossart <[email protected]> wrote: > Sure. I'm tempted to suggest that we only use the plain C version here, > too. The SSE4.2 bms_num_members() test I did yesterday used it and showed > improvement at one word. If we do that, we can rip out even more code > since we no longer need the popcount built-ins. > > * tests plain C version on an Apple M3 * > > Yeah, the plain C version might be marginally slower than the built-in > version for that test, but it still seems quite a bit faster than HEAD. > > HEAD v8 v10 > 40 25 29
(for the following, numbers are nanoseconds per call from drive_bms_num_members()) Seems similar on S390X / gcc 13.3 (last week I only tested a single bitmapword and feel don't like repeating): master (older): 4.1083 (call builtin) v8: 2.8889 (inline builtin) v10: 2.7961 (inline pure C) On ppc64le / gcc 8.5, without native popcount it suffers: words master v14 1 4.5 6.5 2 5.8 9.7 64 67.9 101 128 143 190 So one up, one down among obscure platforms. There seems to be a fairly thin case for the builtin anymore, although it's not zero. -- John Naylor Amazon Web Services
