On Thu, 12 Feb 2026 at 12:31, Ants Aasma <[email protected]> wrote:
> I meant this:
>
> @@ -287,6 +287,7 @@ pgstat_io_flush_cb(bool nowait)
>
> bktype_shstats->times[io_object][io_context][io_op] +=
> INSTR_TIME_GET_MICROSEC(time);
>
> + if
> (PendingIOStats.counts[io_object][io_context][io_op] > 0)
> for(int b = 0; b <
> PGSTAT_IO_HIST_BUCKETS; b++)
>
> bktype_shstats->hist_time_buckets[io_object][io_context][io_op][b] +=
>
> PendingIOStats.pending_hist_time_buckets[io_object][io_context][io_op][b];
>
> Most object/context/op combinations will have a 0 count, so no point in
> actually looking at the histogram.
For this to be of any use, the memset that comes after lock release
would also have to be adjusted to be conditional.
I was also able to convince clang and gcc to vectorize these loops. I
had to split the innermost loop so the time calculation with the /1000
for microseconds conversion and the conditional histogram loop are
done separately, mark all the loop with nounroll pragmas, and tag the
innermost loop for vectorization by clang. But looking at the
benchmark results below, probably not worth the effort.
>> .. but the main problem, even if I do all of that I won't be able to
>> reliably measure the impact, probably the best I could say is
>> "runs good as well as master, +/- 3%".
>>
>> Could you somehow help me with this? I mean should we reduce the scope(remove
>> context) and add that "if"?
>
>
> I think if we only aggregate histograms conditionally, then having a ton of
> different histograms is less of a problem. Only the histograms that have any
> data will get accessed. The overhead is limited to the memory usage which I
> think is acceptable.
>
> I'll run a few benchmarks on what I have available here to see if I can tease
> out anything more than the no effect with a 3% error margin we have today.
I benchmarked with gcc 15.2 with "-O2 -march=x86-64
-fno-omit-frame-pointer" to match what most users have. The CPU is
Ryzen 9 9900X. I concentrated on two codepaths, pgstat_io_flush_cb and
pgstat_count_io_op_time. Configuration is default except the
following:
track_io_timing = 'on'
io_method = 'io_uring'
io_combine_limit = '1'
effective_io_concurrency = '1'
For checking if aggregating statistics has an effect I used pgbench
scale 100 in read only mode, 10 60s runs each. It will do 1-2 reads
from page cache per select.
bin | concurrency | avg | stddev_fraction | diff
-----------+-------------+----------+-----------------+---------
pg-iohist | 1 | 92927.2 | 0.00254 | 1.00735
pg-master | 1 | 92248.9 | 0.00214 |
pg-iohist | 12 | 618342.3 | 0.00828 | 1.00035
pg-master | 12 | 618127.9 | 0.00819 |
pg-iohist | 24 | 591228.1 | 0.00889 | 0.98858
pg-master | 24 | 598058.5 | 0.00846 |
perf measurement shows 0.00% samples in pgstat_io_flush_cb, and 0.07%
in pgstat_count_io_op_time. After checking the logic in
pgstat_report_stat() these make sense - stats are aggregated at most
once per second per backend.
For the I/O collection, I tried using prewarm, but got really noisy
results from it. So instead I created a table with 100k rows with one
row per page, vacuumed it and benchmarked select count(*) over it.
Interestingly, setting effective_io_concurrency = 1 made the results
both more consistent and faster.
bin | avg | stddev_fraction | diff
-----------+-------+-----------------+---------
pg-iohist | 7.526 | 0.01012 | 0.99396
pg-master | 7.572 | 0.01186 |
perf measurement shows 0.40% spent in pgstat_count_io_op_time.
With -march=native build it goes up to 0.62%, mostly thanks to tps
going up by 30%. Checksum calculations really love AVX-512.
I think performance wise the patch is fine as is, there is negligible
performance overhead even in most adverse conditions.
I still want to look at the memory overhead more closely. The 30kB per
backend seems tolerable to me, but I think having it in
PgStat_BktypeIO is not great. This makes PgStat_IO
30k*BACKEND_NUM_TYPES bigger, or ~ 0.5MB. Having a stats snapshot be
half a megabyte bigger for no reason seems too wasteful.
Regards,
Ants Aasma