Re: pg_stat_io_histogram

Jakub Wartak Tue, 24 Feb 2026 06:04:49 -0800

On Mon, Feb 23, 2026 at 1:35 PM Jakub Wartak
<[email protected]> wrote:
>
> On Thu, Feb 19, 2026 at 7:12 PM Andres Freund <[email protected]> wrote:
> >
> > Hi,
> >
> > On 2026-02-19 19:55:06 +0200, Ants Aasma wrote:
> > > > Right now the lowest bucket is for 0-8 ms, the second for 8-16, the 
> > > > third for
> > > > 16-32. I.e. the first bucket is the same width as the second. Is that
> > > > intentional?
> > >
> > > If the boundaries are not on power-of-2 calculating the correct bucket
> > > would take a bit longer.
> >
> > Powers of two make sense, my point was that the lowest bucket and the next
> > smallest one are *not* sized in a powers of two fashion, unless I miss
> > something?
>
> Yes, as stated earlier it's intentionally made flat at the beggining to be 
> able
> to differentiate those fast accesses.
>
> > > For reducing the number of buckets one option is to use log base-4 buckets
> > > instead of base-2.
> >
> > Yea, that could make sense, although it'd be somewhat sad to lose that much
> > precision.
>
> Same here, as stated earlier I wouldn't like to loose this precision.
>
> > > But if we are worried about the size, then reducing the number of 
> > > histograms
> > > kept would be better.
> >
> > I think we may want both.
>
> +1.
>
> > > Many of the combinations are not used at all
>
> This!
>
> > Yea, and for many of the operations we will never measure time and thus will
> > never have anything to fill the histogram with.
> >
> > Perhaps we need to do something like have an array of histogram IDs and 
> > then a
> > smaller number of histograms without the same indexing.  That implies more
> > indirection, but I think that may be acceptable - the overhead of reading a
> > page are high enough that it's probably fine, whereas a lot more indirection
> > for something like a buffer hit is a different story.
>
> OK so the previous options from the thread are:
> a) we might use uint32 instead of uint64 and deal with overflows
> b) we might filter some out of in order to save some memory. Trouble would be
>    which ones to eliminate... and would e.g. 2x saving be enough?
> c) we leave it as it is (accept the simple code/optimal code and waste
> this ~0.5MB
>   pgstat.stat)
> d) the above - but I hardly understood how it would look like at all
> e) eliminate some precision (via log4?) or column (like context/) - IMHO we
>    would waste too much precision or orginal goals with this.
>
> So I'm kind of lost how to progress this, because now I - as previously 
> stated -
> I do not understand this challenge with memory saving and do now know the aim
> or where to stop this optimization, thus I'm mostly +1 for "c", unless 
> somebody
> Enlighten me, please ;)
>
> > > and for normal use being able to distinguish latency profiles between so
> > > many different categories is not that useful.
> >
> > I'm not that convinced by that though. It's pretty useful to separate out 
> > the
> > IO latency for something like vacuuming, COPY and normal use of a
> > relation. They will often have very different latency profiles.
>
> +1
>
> --
>
> Anyway, I'm attaching v6 - no serious changes, just cleaning:
>
> 1. Removed dead ifdefed code (finding most siginificant bits) as testing by 
> Ants
>    showed that CLZ has literally zero overhead.
> 2. Rebased and fixed some missing include for ports/bits header  for
>    pg_leading_zero_bits64(), dunno why it didnt complain earlier.
> 3. Added Ants as reviewer.
> 4. Fixed one comment refering to wrong function (nearby enum 
> hist_io_stat_col).
> 5. Added one typedef to src/tools/pgindent/typedefs.list.
>


I think I have found another way how to minimize the weight of that memory
allocation simply remapping sparse backend type IDs to contiguous ones:

0. So the orginal patch weights like below according to pahole:

struct PgStat_BktypeIO {
        [..]
        uint64             hist_time_buckets[3][5][8][16]; /*  2880 15360 */
        /* size: 18240, cachelines: 285, members: 4 */
};

struct PgStat_IO {
        [..]
        PgStat_BktypeIO    stats[18];            /*     8 328320 */
        /* size: 328328, cachelines: 5131, members: 2 */
        /* last cacheline: 8 bytes */
};

so 320kB total and not 0.5MB for a start.

1. I've noticed that we were already skipping 4 out of 17 (~23%) of backend
types (thanks to pgstat_tracks_io_bktype()), and with simple array
condensation of backendtype (attached dirty PoC) I can get this down to:

struct PgStat_IO {
        [..]
        PgStat_BktypeIO    stats[14];            /*     8 255360 */
        /* size: 255368, cachelines: 3991, members: 2 */
        /* last cacheline: 8 bytes */
};

so the attached crude patch is mainly about remapping using
pgstat_remap_condensed_bktype(). Patch needs lots of work, but
demonstrates a point.

2. We could slightly reduce even further if necessary, by also ignorning
B_AUTOVAC_LAUNCHER and B_STANDALONE_BACKEND for pg_stat_io. I mean those
seem to not generating any I/O and yet pgstat_tracks_io_bktype says
yes to them.

Thoughts? Is that a good direction? Would 1 or 2 be enough?

-J.

diff --git a/src/backend/utils/activity/pgstat_io.c 
b/src/backend/utils/activity/pgstat_io.c
index 148a2a9c7d5..da1f3103ee7 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -222,13 +222,14 @@ pgstat_io_flush_cb(bool nowait)
 {
        LWLock     *bktype_lock;
        PgStat_BktypeIO *bktype_shstats;
+       BackendType condensedBkType = 
pgstat_remap_condensed_bktype(MyBackendType);
 
        if (!have_iostats)
                return false;
 
        bktype_lock = &pgStatLocal.shmem->io.locks[MyBackendType];
        bktype_shstats =
-               &pgStatLocal.shmem->io.stats.stats[MyBackendType];
+               &pgStatLocal.shmem->io.stats.stats[condensedBkType];
 
        if (!nowait)
                LWLockAcquire(bktype_lock, LW_EXCLUSIVE);
@@ -352,7 +353,10 @@ pgstat_io_reset_all_cb(TimestampTz ts)
        for (int i = 0; i < BACKEND_NUM_TYPES; i++)
        {
                LWLock     *bktype_lock = &pgStatLocal.shmem->io.locks[i];
-               PgStat_BktypeIO *bktype_shstats = 
&pgStatLocal.shmem->io.stats.stats[i];
+               BackendType bktype = pgstat_remap_condensed_bktype(i);
+               PgStat_BktypeIO *bktype_shstats = 
&pgStatLocal.shmem->io.stats.stats[bktype];
+               if(bktype == -1)
+                       continue;
 
                LWLockAcquire(bktype_lock, LW_EXCLUSIVE);
 
@@ -374,8 +378,11 @@ pgstat_io_snapshot_cb(void)
        for (int i = 0; i < BACKEND_NUM_TYPES; i++)
        {
                LWLock     *bktype_lock = &pgStatLocal.shmem->io.locks[i];
-               PgStat_BktypeIO *bktype_shstats = 
&pgStatLocal.shmem->io.stats.stats[i];
-               PgStat_BktypeIO *bktype_snap = 
&pgStatLocal.snapshot.io.stats[i];
+               BackendType bktype = pgstat_remap_condensed_bktype(i);
+               PgStat_BktypeIO *bktype_shstats = 
&pgStatLocal.shmem->io.stats.stats[bktype];
+               PgStat_BktypeIO *bktype_snap = 
&pgStatLocal.snapshot.io.stats[bktype];
+               if(bktype == -1)
+                       continue;
 
                LWLockAcquire(bktype_lock, LW_SHARED);
 
@@ -445,6 +452,42 @@ pgstat_tracks_io_bktype(BackendType bktype)
        return false;
 }
 
+
+/*
+ * Remap sparse backend type IDs to contiguous ones. Keep in sync with enum
+ * BackendType.
+ *
+ * Returns -1 if the input ID is invalid or unused.
+ */
+int
+pgstat_remap_condensed_bktype(BackendType bktype) {
+       /* -1 here means it should not be used */
+       static const int mapping_table[BACKEND_NUM_TYPES] = {
+               -1, /* B_INVALID */
+               0,
+               -1, /* B_DEAD_END_BACKEND */
+               1,
+               2,
+               3,
+               4,
+               5,
+               6,
+               -1, /* B_ARCHIVER */
+               7,
+               8,
+               9,
+               10,
+               11,
+               12,
+               13,
+               -1  /* B_LOGGER */
+       };
+
+       if (bktype < 0 || bktype > BACKEND_NUM_TYPES)
+               return -1;
+       return mapping_table[bktype];
+}
+
 /*
  * Some BackendTypes do not perform IO on certain IOObjects or in certain
  * IOContexts. Some IOObjects are never operated on in some IOContexts. Check
diff --git a/src/backend/utils/adt/pgstatfuncs.c 
b/src/backend/utils/adt/pgstatfuncs.c
index ac08ab14195..72296720286 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1578,9 +1578,12 @@ pg_stat_get_io(PG_FUNCTION_ARGS)
 
        backends_io_stats = pgstat_fetch_stat_io();
 
-       for (int bktype = 0; bktype < BACKEND_NUM_TYPES; bktype++)
+       for (int i = 0; i < BACKEND_NUM_TYPES; i++)
        {
+               BackendType bktype = pgstat_remap_condensed_bktype(i);
                PgStat_BktypeIO *bktype_stats = 
&backends_io_stats->stats[bktype];
+               if(bktype == -1)
+                       continue;
 
                /*
                 * In Assert builds, we can afford an extra loop through all of 
the
@@ -1757,9 +1760,12 @@ pg_stat_get_io_histogram(PG_FUNCTION_ARGS)
 
        backends_io_stats = pgstat_fetch_stat_io();
 
-       for (int bktype = 0; bktype < BACKEND_NUM_TYPES; bktype++)
+       for (int i = 0; i < BACKEND_NUM_TYPES; i++)
        {
+               BackendType bktype = pgstat_remap_condensed_bktype(i);
                PgStat_BktypeIO *bktype_stats = 
&backends_io_stats->stats[bktype];
+               if(bktype == -1)
+                       continue;
 
                /*
                 * In Assert builds, we can afford an extra loop through all of 
the
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index f16f35659b9..d0c62d3248e 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -332,7 +332,7 @@ extern void SwitchBackToLocalLatch(void);
  * MyBackendType indicates what kind of a backend this is.
  *
  * If you add entries, please also update the child_process_kinds array in
- * launch_backend.c.
+ * launch_backend.c and PGSTAT_USED_BACKEND_NUM_TYPES in pgstat.h
  */
 typedef enum BackendType
 {
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 816d261e80d..a8e1f88e4c6 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -348,10 +348,12 @@ typedef struct PgStat_PendingIO
        uint64          
pending_hist_time_buckets[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES][PGSTAT_IO_HIST_BUCKETS];
 } PgStat_PendingIO;
 
+/* This needs to stay in sync with pgstat_tracks_io_bktype() */
+#define PGSTAT_USED_BACKEND_NUM_TYPES BACKEND_NUM_TYPES - 4
 typedef struct PgStat_IO
 {
        TimestampTz stat_reset_timestamp;
-       PgStat_BktypeIO stats[BACKEND_NUM_TYPES];
+       PgStat_BktypeIO stats[PGSTAT_USED_BACKEND_NUM_TYPES];
 } PgStat_IO;
 
 typedef struct PgStat_StatDBEntry
@@ -620,6 +622,7 @@ extern const char *pgstat_get_io_context_name(IOContext 
io_context);
 extern const char *pgstat_get_io_object_name(IOObject io_object);
 extern const char *pgstat_get_io_op_name(IOOp io_op);
 
+extern int pgstat_remap_condensed_bktype(BackendType bktype);
 extern bool pgstat_tracks_io_bktype(BackendType bktype);
 extern bool pgstat_tracks_io_object(BackendType bktype,
                                                                        
IOObject io_object, IOContext io_context);

Re: pg_stat_io_histogram

Reply via email to