On Fri, Feb 20, 2026 at 03:21:05PM +0700, John Naylor wrote:
> On Thu, Feb 5, 2026 at 4:43 AM Nathan Bossart <[email protected]> 
> wrote:
>> Yeah, the plain C version might be marginally slower than the built-in
>> version for that test, but it still seems quite a bit faster than HEAD.
>>
>>     HEAD  v8  v10
>>       40  25   29
> 
> (for the following, numbers are nanoseconds per call from
> drive_bms_num_members())
> 
> Seems similar on S390X / gcc 13.3 (last week I only tested a single
> bitmapword and feel don't like repeating):
> 
> master (older): 4.1083 (call builtin)
> v8:     2.8889 (inline builtin)
> v10:    2.7961 (inline pure C)

Thanks for testing it.

> On ppc64le / gcc 8.5, without native popcount it suffers:
> 
> words  master  v14
>    1    4.5      6.5
>    2    5.8      9.7
>   64   67.9    101
>  128  143      190
> 
> So one up, one down among obscure platforms. There seems to be a
> fairly thin case for the builtin anymore, although it's not zero.

I spent some time looking at how clang/gcc compiled the plain-C version on
various architectures [0], and I was pleasantly surprised to discover that
at some point in recent history they started automatically converting it to
special popcount instructions.  I suspect that you'd see better results on
ppc64le if you upgraded the compiler...

[0] https://godbolt.org/z/v9vvx7E89

-- 
nathan


Reply via email to