On Thu, 19 Feb 2026 at 01:12, Andres Freund <[email protected]> wrote: > I'd probably, just out of paranoia, also test without checksums enabled (to > avoid the memory bandwidth hit) and see if the overhead increases if you > change the query to not need to evaluate expressions (e.g. by using SELECT * > FROM tbl OFFSET large_number, or using pg_prewarm with > maintenance_io_concurrency=1).
Tried it, disabling checksums made the performance of march=x86-64 match march=native. I didn't run enough iterations to make any statistically significant conclusions, but curiously perf now shows only 0.23% in pgstat_count_io_op_time, compared to 0.60% before with march=native. Probably less CPU cache thrashing going on. > One thing to be aware of is that with the rdtsc[p] patch (to substantially > reduce timing overhead), it'll become a tad more expensive to convert an > instr_time to nanoseconds (due to having to convert cycles to nanoseconds). > It may be worth testing the combination. > > On that note, why is this measuring things in nanoseconds, given that we > already conver instr_time to microseconds nearby and that its quite unlikely > that you'd ever have IO times below a microsecond and that > MIN_PG_STAT_IO_HIST_LATENCY already is in the microsecond domain and we > display it as microseconds? I agree that just using microseconds here would be better. > > I still want to look at the memory overhead more closely. The 30kB per > > backend seems tolerable to me > > One thing worth thinking about here is that we probably could stand to > increase the number of IO types further, we e.g. have been talking about > tracking IO that bypasses shared buffers separately. And a few more context > types (e.g. index inner/leaf) could also make sense. > > Without that change that'd be a somewhat moderate increase in memory usage, > but with this change it'd increase a lot more. > > > > but I think having it in PgStat_BktypeIO is not great. This makes > > PgStat_IO 30k*BACKEND_NUM_TYPES bigger, or ~ 0.5MB. Having a stats snapshot > > be half a megabyte bigger for no reason seems too wasteful. > > Yea, that's not awesome. > > I guess we could count IO as 4 byte integers, and shift all bucket counts down > in the rare case of an on overflow. It's just a 2x improvement, but ... > > I think we might need to reduce the number of buckets somewhat. > > > Right now the lowest bucket is for 0-8 ms, the second for 8-16, the third for > 16-32. I.e. the first bucket is the same width as the second. Is that > intentional? If the boundaries are not on power-of-2 calculating the correct bucket would take a bit longer. For reducing the number of buckets one option is to use log base-4 buckets instead of base-2. But if we are worried about the size, then reducing the number of histograms kept would be better. Many of the combinations are not used at all, and for normal use being able to distinguish latency profiles between so many different categories is not that useful. Regards, Ants Aasma
