> On Apr 3, 2026, at 10:08, David Rowley <[email protected]> wrote: > > On Fri, 3 Apr 2026 at 11:12, David Rowley <[email protected]> wrote: >> >> On Thu, 2 Apr 2026 at 17:22, Tom Lane <[email protected]> wrote: >>> I don't think we should add cycles here for this purpose. >> >> I'm not keen on slowing things down for this either. I did do some >> experiments in [1] that sees fewer instructions from using 64-bit >> maths. I might go off and see if there are any wins there that also >> give us the INT_MAX fix. It's not great effort to reward ratio >> though... > > The reduction in instructions with the patched version got me curious > to see if it would translate into a performance increase. I tested on > an AMD Zen2 machine, and it's a decent amount faster than master. I > tested with gcc and clang. > > I also scanned over the remaining parts of bitmapset.c and didn't find > anywhere else that has overflow risk aside from what you pointed out > in bms_prev_member(). > > The attached patch contains the benchmark function I added to the > test_bitmapset module. It should apply to master with a bit of noise. > > CREATE EXTENSION test_bitmapset; > SELECT > generate_series(1,3) AS run, > bench_bms_next_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 AS > bms_next_member_us, > bench_bms_prev_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 AS > bms_prev_member_us; > > master (gcc) > > run | bms_next_member_us | bms_prev_member_us > -----+--------------------+-------------------- > 1 | 26473 | 40404 > 2 | 26218 | 40413 > 3 | 26209 | 40387 > > patched (gcc) > > run | bms_next_member_us | bms_prev_member_us > -----+--------------------+-------------------- > 1 | 25409 | 29705 > 2 | 24905 | 29693 > 3 | 24870 | 29707 > > Times are in microseconds to do 1 million bms_*_member() loops over > the entire set. > > I've also attached the full results I got. I've also included the > results from Chao's version, which does slow things down decently on > clang. > > IMO, if we can make bitmapset.c work with INT_MAX members and get a > performance increase, then we should do it. > > David > >> [1] https://godbolt.org/z/Eh1vzssq7 > <benchmark_results.txt><bms_fixes.patch>
I also did a load test with a standalone c program with 4 versions:
* The original bms_next_member (Original)
* The fast version from [1], that uses 64bit maths (Fast)
* The original version + INT32_MAX check + 64bit maths (Original2)
* I tried the other approach that pulls up the first iteration, so that removes
"mask = (~(bitmapword) 0);” from the loop. (PullUp)
Note: all tests used -O2 to build the executable.
On my MacBook M4, the Fast version constantly won, and PullUp version performed
badly.
```
% gcc --version
Apple clang version 17.0.0 (clang-1700.6.4.2)
Target: arm64-apple-darwin25.3.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin
```
A typical test run:
```
Benchmarking 100000 iterations...
Original: 0.48893 seconds
Fast: 0.46979 seconds
Original2: 0.47740 seconds
PullUp: 0.48029 seconds
```
On my Windows laptop, Intel(R) Core Ultra 5, with WSL based Ubuntu, Orignal2
won in the most runs, and the PullUp version was faster than Fast version.
```
chaol@lichao-highgo:~$ gcc --version
gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
Copyright (C) 2023 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
```
A typical test run:
```
Original: 0.99849 seconds
Fast: 0.74722 seconds
Original2: 0.59407 seconds
PullUp: 0.62746 seconds
```
Then I also tried to run on Windows directly. Here, PullUp version performed
the best.
```
$ gcc --version
gcc.exe (Rev13, Built by MSYS2 project) 15.2.0
Copyright (C) 2025 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
```
A typical test run:
```
Original: 0.32931 seconds
Fast: 0.32740 seconds
Original2: 0.32378 seconds
PullUp: 0.30795 seconds
```
I’m curious that, when something performs differently across platforms, which
platform should take priority?
Please see the attached test program. It’s possible I did something wrong.
[1] https://godbolt.org/z/Eh1vzssq7
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
test_bms_next.c
Description: Binary data
