> On Apr 4, 2026, at 11:30, David Rowley <[email protected]> wrote:
>
> On Sat, 4 Apr 2026 at 02:28, Chao Li <[email protected]> wrote:
>> In test_bms_next2.c, you set words_to_alloc = 1, which disables the
>> optimization in the unrolled version. If I change words_to_alloc = 2, then
>> on my MacBook M4, the unrolled version seems to win:
>> ```
>> chaol@ChaodeMacBook-Air test % ./test_bms_next2
>> Benchmarking 100000000 iterations...
>>
>> Original: 0.61559 seconds
>> David's: 0.61651 seconds
>> Chao's version: 1.06033 seconds
>> Unrolled loop: 0.60561 seconds
>> 2800000000
>> ```
>
> Doing the same locally, here's what I get on my Zen2 machine:
>
> drowley@amd3990x:~$ gcc test_bms_next.c -O2 -o test_bms_next &&
> ./test_bms_next
> Benchmarking 100000000 iterations...
>
> Original: 1.21125 seconds
> David's: 1.11997 seconds
> Chao's version: 1.72662 seconds
> Unrolled loop: 1.63090 seconds
> 2800000000
> drowley@amd3990x:~$ clang test_bms_next.c -O2 -o test_bms_next &&
> ./test_bms_next
> Benchmarking 100000000 iterations...
>
> Original: 1.10780 seconds
> David's: 1.05968 seconds
> Chao's version: 1.87123 seconds
> Unrolled loop: 1.11002 seconds
> 2800000000
I just tried my Windows laptop with both MSYS2 and WSL Ubuntu, and on both of
them the original version and David’s version consistently performed better
than the other versions.
I have a new finding from Godbolt: adding likely(w != 0) seems to reduce the
instruction count by one. bms_next_member_fast has 26 instructions, while after
adding "likely", bms_next_member_fast2 has 25. It seems that this may avoid one
jump and perhaps be slightly better, but I’m not sure. I’m really not an expert
in assembly.
Then I tested. “likely” doesn’t seem to help on Mac:
```
chaol@ChaodeMacBook-Air test % ./test_bms_next2
Benchmarking 100000000 iterations...
Original: 0.54994 seconds
David's: 0.78218 seconds
David's likely: 0.78990 seconds
Chao's version: 1.11530 seconds
Unrolled loop: 0.47660 seconds
3500000000
```
But it helps slightly on Windows:
```
$ ./test_bms_next2
Benchmarking 100000000 iterations...
Original: 1.45312 seconds
David's: 1.48438 seconds
David's likely: 1.40625 seconds
Chao's version: 3.12500 seconds
Unrolled loop: 2.95312 seconds
-794967296
```
I’m not suggesting adding likely() to your version, I’m just sharing the
information for your reference and leaving the decision to you.
>
>> I guess one-word Bitmapset are probably the most common case in practice. So
>> overall, I agree that your version is the best choice. Please go ahead with
>> it as we have already gained both a bug fix and a performance improvement.
>> If we really want to study the performance more deeply, I think that would
>> be a separate topic.
>
> Yeah, but it's better to find bottlenecks and focus there than to
> optimise blindly without knowing if it'll make any actual difference.
> It's still important to test for regressions when modifying code and
> try to avoid them, as it's sometimes not obvious if the code being
> modified is critical for performance.
>
> I also don't think we should be doing anything to optimise for
> multi-word sets at the expense of slowing down single-word sets.
Fully agreed.
> Not
> that it's very representative of the real world, but I added some
> telemetry to bitmapset.c and I see 43 multi-word sets being created
> and 1.45 million single-word sets created during make check, or 1 in
> 33785, if you prefer.
>
> diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
> index 786f343b3c9..21b5053e9a2 100644
> --- a/src/backend/nodes/bitmapset.c
> +++ b/src/backend/nodes/bitmapset.c
> @@ -219,6 +219,10 @@ bms_make_singleton(int x)
> int wordnum,
> bitnum;
>
> + if (x >= 64)
> + elog(NOTICE, "multi-word set bms_make_singleton");
> + else
> + elog(NOTICE, "single-word set bms_make_singleton");
> if (x < 0)
> elog(ERROR, "negative bitmapset member not allowed");
> wordnum = WORDNUM(x);
> @@ -811,6 +815,9 @@ bms_add_member(Bitmapset *a, int x)
> wordnum = WORDNUM(x);
> bitnum = BITNUM(x);
>
> + if (x >= 64 && a->nwords == 1)
> + elog(NOTICE, "multi-set bms_add_member");
> +
> /* enlarge the set if necessary */
> if (wordnum >= a->nwords)
> {
>
> Not quite perfect as a set made first as a single word set that later
> becomes a multi-word set will be double counted. The number of
> operations on the sets is likely more important anyway, not the number
> of sets being created. The point is, multi-word sets are rare for most
> workloads.
>
What tests did you run after adding the logs to collect the data? This is a
method I might borrow in the future for similar investigations.
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/