> On Apr 3, 2026, at 19:42, David Rowley <[email protected]> wrote:
> 
> On Fri, 3 Apr 2026 at 16:24, Chao Li <[email protected]> wrote:
>> I also did a load test with a standalone c program with 4 versions:
> 
> I think you'd be better off picking a more realistic Bitmapset. The
> code is doing:
> 
>    int words_to_alloc = 20000; // Large set to bypass CPU cache slightly
>    Bitmapset *bms = malloc(sizeof(Bitmapset) + words_to_alloc *
> sizeof(bitmapword));
>    bms->nwords = words_to_alloc;
>    memset(bms->words, 0, words_to_alloc * sizeof(bitmapword));
> 
>    /* Set a bit far into the set to force a long scan */
>    int target_bit = (words_to_alloc - 1) * 64 + 10;
>    bms->words[words_to_alloc - 1] |= (1ULL << 10);
> 
> A set with 20k words! Then you're doing:
> 
>    for (int i = 0; i < iterations; i++) sink = bms_next_member(bms, 0);
> 
> So you're not looping correctly over the set, as looping over a set
> with 1 member requires 2 calls to the function.
> 
> I'd say this unrealistically favours the unrolled loop version, as
> most of the effort is in the loop looping over empty words and you can
> do that without the extra bit-wise AND that the other versions have.
> That version also seems to apply both the 64-bit fix and the INT_MAX
> fix. Why both?  Because you're only calling the function once per
> iteration, it also means your || INT_MAX is only executed once, rather
> than n_members + 1 as it would normally be executed. That'll make your
> version seem better than it is.
> 
> I fixed all that and changed the Bitmapset to something more realistic
> with how it's more likely to be seen in PostgreSQL and ran the
> benchmark with a proper bms_next_member loop. File attached and
> results below:
> 
> Zen2 Linux
> 
> $ gcc test_bms_next.c -O2 -o test_bms_next && ./test_bms_next
> Benchmarking 100000000 iterations...
> 
> Original:  0.62113 seconds
> David's:          0.70257 seconds
> Chao's version:   1.70691 seconds
> Unrolled loop: 1.56239 seconds
> 2800000000
> 
> $ clang test_bms_next.c -O2 -o test_bms_next && ./test_bms_next
> Benchmarking 100000000 iterations...
> 
> Original:  1.11263 seconds
> David's:          0.87399 seconds
> Chao's version:   0.52962 seconds
> Unrolled loop: 1.07799 seconds
> 2800000000
> 
> Apple M2:
> 
> Benchmarking 100000000 iterations...
> 
> Original:  0.64023 seconds
> David's:          0.41087 seconds
> Chao's version:   1.21113 seconds
> Unrolled loop: 0.55924 seconds
> 2800000000
> 
>> I’m curious that, when something performs differently across platforms, 
>> which platform should take priority?
> 
> I don't think it matters too much for this case. If we were actually
> working on a function that was a bottleneck, then it would be worth
> looking deeper and seeing if the compilers are producing suboptimal
> code and then try to coax them into making better code. As a last
> resort, have two versions and some preprocessor to decide which one.
> However, this just isn't performance-critical enough to warrant such
> an effort. I think we're already going far too far with this. I just
> want to fix the bug. If performance increases in some way as a result
> of that, then good.
> 
> If I sum up the times from each version of the above results, I get:
> 
> Original: 2.37399
> David's: 1.98743
> Chao's: 3.44766
> Unrolled: 3.19962
> 
> I'm happy to go with the fastest one. I expect the one with the fewest
> instructions is likely more important when running it in the planner,
> as that's less to decode. I didn't look, but I expect your loop
> unrolled version and your || INT_MAX version to have more
> instructions.
> 
> David
> <test_bms_next2.c>

In test_bms_next2.c, you set words_to_alloc = 1, which disables the 
optimization in the unrolled version. If I change words_to_alloc = 2, then on 
my MacBook M4, the unrolled version seems to win:
```
chaol@ChaodeMacBook-Air test % ./test_bms_next2
Benchmarking 100000000 iterations...

Original:  0.61559 seconds
David's:          0.61651 seconds
Chao's version:   1.06033 seconds
Unrolled loop: 0.60561 seconds
2800000000
```

(I also checked on Godbolt, from the instruction-count perspective, the 
unrolled version actually looks the worst.)

I guess one-word Bitmapset are probably the most common case in practice. So 
overall, I agree that your version is the best choice. Please go ahead with it 
as we have already gained both a bug fix and a performance improvement. If we 
really want to study the performance more deeply, I think that would be a 
separate topic.

It was fun working on this patch, and thank you for your patience.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/






Reply via email to