Re: pg_stat_io_histogram

Jakub Wartak Mon, 23 Feb 2026 04:31:13 -0800

On Thu, Feb 19, 2026 at 12:12 AM Andres Freund <[email protected]> wrote:


Hi Andres,

> One thing to be aware of is that with the rdtsc[p] patch (to substantially
> reduce timing overhead), it'll become a tad more expensive to convert an
> instr_time to nanoseconds (due to having to convert cycles to nanoseconds).
> It may be worth testing the combination.

I've took a quick look on latest v7-0002 from there [1] and to sum up it does:

-#define INSTR_TIME_GET_NANOSEC(t) \
-    ((int64) (t).ticks)

+static inline int64
+pg_ticks_to_ns(int64 ticks)
+{
+#if defined(__x86_64__) || defined(WIN32)
[..]
+    ns += ticks * ticks_per_ns_scaled / TICKS_TO_NS_PRECISION;
+    return ns;
+#else
+    return ticks;
+#endif
+}

[..but!]
+/*
+ * Make sure this is a power-of-two, so that the compiler can turn the
+ * multiplications and divisions into shifts.
+ */
+#define TICKS_TO_NS_PRECISION (1<<14)

+#define INSTR_TIME_GET_NANOSEC(t) \
+    (pg_ticks_to_ns((t).ticks))
+

So at least to my eyes, it looks pretty cheap, doesn't it?


> On that note, why is this measuring things in nanoseconds, given that we
> already conver instr_time to microseconds nearby and that its quite unlikely
> that you'd ever have IO times below a microsecond and that
> MIN_PG_STAT_IO_HIST_LATENCY already is in the microsecond domain and we
> display it as microseconds?

Hmm, in earlier reply You have recommened to get away from conversion from
microseconds so I've did because the microseconds were really costly
integer divisions [2]
  "It's annoying to have to convert to microseconds here, that's not free :("

so because INSTR_TIME_GET_NANOSEC() is still cheap and fetching "ticks".


> > I still want to look at the memory overhead more closely. The 30kB per
> > backend seems tolerable to me
>
> One thing worth thinking about here is that we probably could stand to
> increase the number of IO types further, we e.g. have been talking about
> tracking IO that bypasses shared buffers separately.  And a few more context
> types (e.g. index inner/leaf) could also make sense.
>
> Without that change that'd be a somewhat moderate increase in memory usage,
> but with this change it'd increase a lot more.

OK, point taken, it can grow even further, but..:

> > but I think having it in PgStat_BktypeIO is not great. This makes
> > PgStat_IO 30k*BACKEND_NUM_TYPES bigger, or ~ 0.5MB. Having a stats snapshot
> > be half a megabyte bigger for no reason seems too wasteful.
>
> Yea, that's not awesome.

Guys, question, could You please explain me what are the drawbacks of having
this semi-big (internal-only) stat snapshot of 0.5MB? I'm struggling to
understand two things:
a) 0.5MB is not a lot those days (ok my 286 had 1MB in the day ;))
b) how does it affect anything, because testing show it's not?

My understandiung is that it only affects file size on startup/shutdown
in $PGDATADIR/pgstat/pgstat.stat, correct?  My worry is that we introduce
more code (and bugs) for no real gain (?)

> I guess we could count IO as 4 byte integers, and shift all bucket counts down
> in the rare case of an on overflow. It's just a 2x improvement, but ...

[..I'll reply to that in next follow-up]

> I think we might need to reduce the number of buckets somewhat.

I'm kind of skeptical on lowering bucket count, and even Ants wanted to
increase it, so that we would gain perfect visibility into sometimes
problematic hardware issues (I would also swear there is something magical
for I/Os stuck for 60secs), so we would both would want to cover it there,
but we cannot squeezee more due to performance concerns...

Now there's also this area where we want to understand was it from page
cache or some-fast-IO-dev and that's how I arrived at this first edge of
~8us. If we go one bucket further (that is make first bucket 16us), I was
afraid we may start loosing being able to differentiating page-cache vs
devices, won't we? (Optane seems to be gone, but it started @ ~20us? You said
in [3] that it could be even as low as 10? so I've thought 8 is good bet)

Right now, the final bucket is that we track >128ms (==> bad stuff),
and I would love to extend to that >512ms, but we cannot as it would be
more than 16 buckets (and 16*8bytes_due_to_uint64=128bytes already).

> Right now the lowest bucket is for 0-8 ms, the second for 8-16, the third for
> 16-32. I.e. the first bucket is the same width as the second. Is that
> intentional?

Yes, it's intentional flat at the beggining to be able to differentiate
those fast accesses.

-J.

[1] - 
https://www.postgresql.org/message-id/CAP53PkxNJ2Y6G8PEpQn1zKa6ODE6k1-oP9DNqWjkTj%3DdC8_KiA%40mail.gmail.com
[2] - 
https://www.postgresql.org/message-id/vhzkeogzrrfzjwo3xrnq4xsjh6i37ou6xsbz7yby3lbb3rnxzz%406fpysnkjyldi

Re: pg_stat_io_histogram

Reply via email to