On Mon, Feb 23, 2026 at 1:35 PM Jakub Wartak
<[email protected]> wrote:
>
> On Thu, Feb 19, 2026 at 7:12 PM Andres Freund <[email protected]> wrote:
> >
> > Hi,
> >
> > On 2026-02-19 19:55:06 +0200, Ants Aasma wrote:
> > > > Right now the lowest bucket is for 0-8 ms, the second for 8-16, the
> > > > third for
> > > > 16-32. I.e. the first bucket is the same width as the second. Is that
> > > > intentional?
> > >
> > > If the boundaries are not on power-of-2 calculating the correct bucket
> > > would take a bit longer.
> >
> > Powers of two make sense, my point was that the lowest bucket and the next
> > smallest one are *not* sized in a powers of two fashion, unless I miss
> > something?
>
> Yes, as stated earlier it's intentionally made flat at the beggining to be
> able
> to differentiate those fast accesses.
>
> > > For reducing the number of buckets one option is to use log base-4 buckets
> > > instead of base-2.
> >
> > Yea, that could make sense, although it'd be somewhat sad to lose that much
> > precision.
>
> Same here, as stated earlier I wouldn't like to loose this precision.
>
> > > But if we are worried about the size, then reducing the number of
> > > histograms
> > > kept would be better.
> >
> > I think we may want both.
>
> +1.
>
> > > Many of the combinations are not used at all
>
> This!
>
> > Yea, and for many of the operations we will never measure time and thus will
> > never have anything to fill the histogram with.
> >
> > Perhaps we need to do something like have an array of histogram IDs and
> > then a
> > smaller number of histograms without the same indexing. That implies more
> > indirection, but I think that may be acceptable - the overhead of reading a
> > page are high enough that it's probably fine, whereas a lot more indirection
> > for something like a buffer hit is a different story.
>
> OK so the previous options from the thread are:
> a) we might use uint32 instead of uint64 and deal with overflows
> b) we might filter some out of in order to save some memory. Trouble would be
> which ones to eliminate... and would e.g. 2x saving be enough?
> c) we leave it as it is (accept the simple code/optimal code and waste
> this ~0.5MB
> pgstat.stat)
> d) the above - but I hardly understood how it would look like at all
> e) eliminate some precision (via log4?) or column (like context/) - IMHO we
> would waste too much precision or orginal goals with this.
>
> So I'm kind of lost how to progress this, because now I - as previously
> stated -
> I do not understand this challenge with memory saving and do now know the aim
> or where to stop this optimization, thus I'm mostly +1 for "c", unless
> somebody
> Enlighten me, please ;)
>
> > > and for normal use being able to distinguish latency profiles between so
> > > many different categories is not that useful.
> >
> > I'm not that convinced by that though. It's pretty useful to separate out
> > the
> > IO latency for something like vacuuming, COPY and normal use of a
> > relation. They will often have very different latency profiles.
>
> +1
>
> --
>
> Anyway, I'm attaching v6 - no serious changes, just cleaning:
>
> 1. Removed dead ifdefed code (finding most siginificant bits) as testing by
> Ants
> showed that CLZ has literally zero overhead.
> 2. Rebased and fixed some missing include for ports/bits header for
> pg_leading_zero_bits64(), dunno why it didnt complain earlier.
> 3. Added Ants as reviewer.
> 4. Fixed one comment refering to wrong function (nearby enum
> hist_io_stat_col).
> 5. Added one typedef to src/tools/pgindent/typedefs.list.
>
I think I have found another way how to minimize the weight of that memory
allocation simply remapping sparse backend type IDs to contiguous ones:
0. So the orginal patch weights like below according to pahole:
struct PgStat_BktypeIO {
[..]
uint64 hist_time_buckets[3][5][8][16]; /* 2880 15360 */
/* size: 18240, cachelines: 285, members: 4 */
};
struct PgStat_IO {
[..]
PgStat_BktypeIO stats[18]; /* 8 328320 */
/* size: 328328, cachelines: 5131, members: 2 */
/* last cacheline: 8 bytes */
};
so 320kB total and not 0.5MB for a start.
1. I've noticed that we were already skipping 4 out of 17 (~23%) of backend
types (thanks to pgstat_tracks_io_bktype()), and with simple array
condensation of backendtype (attached dirty PoC) I can get this down to:
struct PgStat_IO {
[..]
PgStat_BktypeIO stats[14]; /* 8 255360 */
/* size: 255368, cachelines: 3991, members: 2 */
/* last cacheline: 8 bytes */
};
so the attached crude patch is mainly about remapping using
pgstat_remap_condensed_bktype(). Patch needs lots of work, but
demonstrates a point.
2. We could slightly reduce even further if necessary, by also ignorning
B_AUTOVAC_LAUNCHER and B_STANDALONE_BACKEND for pg_stat_io. I mean those
seem to not generating any I/O and yet pgstat_tracks_io_bktype says
yes to them.
Thoughts? Is that a good direction? Would 1 or 2 be enough?
-J.
diff --git a/src/backend/utils/activity/pgstat_io.c
b/src/backend/utils/activity/pgstat_io.c
index 148a2a9c7d5..da1f3103ee7 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -222,13 +222,14 @@ pgstat_io_flush_cb(bool nowait)
{
LWLock *bktype_lock;
PgStat_BktypeIO *bktype_shstats;
+ BackendType condensedBkType =
pgstat_remap_condensed_bktype(MyBackendType);
if (!have_iostats)
return false;
bktype_lock = &pgStatLocal.shmem->io.locks[MyBackendType];
bktype_shstats =
- &pgStatLocal.shmem->io.stats.stats[MyBackendType];
+ &pgStatLocal.shmem->io.stats.stats[condensedBkType];
if (!nowait)
LWLockAcquire(bktype_lock, LW_EXCLUSIVE);
@@ -352,7 +353,10 @@ pgstat_io_reset_all_cb(TimestampTz ts)
for (int i = 0; i < BACKEND_NUM_TYPES; i++)
{
LWLock *bktype_lock = &pgStatLocal.shmem->io.locks[i];
- PgStat_BktypeIO *bktype_shstats =
&pgStatLocal.shmem->io.stats.stats[i];
+ BackendType bktype = pgstat_remap_condensed_bktype(i);
+ PgStat_BktypeIO *bktype_shstats =
&pgStatLocal.shmem->io.stats.stats[bktype];
+ if(bktype == -1)
+ continue;
LWLockAcquire(bktype_lock, LW_EXCLUSIVE);
@@ -374,8 +378,11 @@ pgstat_io_snapshot_cb(void)
for (int i = 0; i < BACKEND_NUM_TYPES; i++)
{
LWLock *bktype_lock = &pgStatLocal.shmem->io.locks[i];
- PgStat_BktypeIO *bktype_shstats =
&pgStatLocal.shmem->io.stats.stats[i];
- PgStat_BktypeIO *bktype_snap =
&pgStatLocal.snapshot.io.stats[i];
+ BackendType bktype = pgstat_remap_condensed_bktype(i);
+ PgStat_BktypeIO *bktype_shstats =
&pgStatLocal.shmem->io.stats.stats[bktype];
+ PgStat_BktypeIO *bktype_snap =
&pgStatLocal.snapshot.io.stats[bktype];
+ if(bktype == -1)
+ continue;
LWLockAcquire(bktype_lock, LW_SHARED);
@@ -445,6 +452,42 @@ pgstat_tracks_io_bktype(BackendType bktype)
return false;
}
+
+/*
+ * Remap sparse backend type IDs to contiguous ones. Keep in sync with enum
+ * BackendType.
+ *
+ * Returns -1 if the input ID is invalid or unused.
+ */
+int
+pgstat_remap_condensed_bktype(BackendType bktype) {
+ /* -1 here means it should not be used */
+ static const int mapping_table[BACKEND_NUM_TYPES] = {
+ -1, /* B_INVALID */
+ 0,
+ -1, /* B_DEAD_END_BACKEND */
+ 1,
+ 2,
+ 3,
+ 4,
+ 5,
+ 6,
+ -1, /* B_ARCHIVER */
+ 7,
+ 8,
+ 9,
+ 10,
+ 11,
+ 12,
+ 13,
+ -1 /* B_LOGGER */
+ };
+
+ if (bktype < 0 || bktype > BACKEND_NUM_TYPES)
+ return -1;
+ return mapping_table[bktype];
+}
+
/*
* Some BackendTypes do not perform IO on certain IOObjects or in certain
* IOContexts. Some IOObjects are never operated on in some IOContexts. Check
diff --git a/src/backend/utils/adt/pgstatfuncs.c
b/src/backend/utils/adt/pgstatfuncs.c
index ac08ab14195..72296720286 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1578,9 +1578,12 @@ pg_stat_get_io(PG_FUNCTION_ARGS)
backends_io_stats = pgstat_fetch_stat_io();
- for (int bktype = 0; bktype < BACKEND_NUM_TYPES; bktype++)
+ for (int i = 0; i < BACKEND_NUM_TYPES; i++)
{
+ BackendType bktype = pgstat_remap_condensed_bktype(i);
PgStat_BktypeIO *bktype_stats =
&backends_io_stats->stats[bktype];
+ if(bktype == -1)
+ continue;
/*
* In Assert builds, we can afford an extra loop through all of
the
@@ -1757,9 +1760,12 @@ pg_stat_get_io_histogram(PG_FUNCTION_ARGS)
backends_io_stats = pgstat_fetch_stat_io();
- for (int bktype = 0; bktype < BACKEND_NUM_TYPES; bktype++)
+ for (int i = 0; i < BACKEND_NUM_TYPES; i++)
{
+ BackendType bktype = pgstat_remap_condensed_bktype(i);
PgStat_BktypeIO *bktype_stats =
&backends_io_stats->stats[bktype];
+ if(bktype == -1)
+ continue;
/*
* In Assert builds, we can afford an extra loop through all of
the
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index f16f35659b9..d0c62d3248e 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -332,7 +332,7 @@ extern void SwitchBackToLocalLatch(void);
* MyBackendType indicates what kind of a backend this is.
*
* If you add entries, please also update the child_process_kinds array in
- * launch_backend.c.
+ * launch_backend.c and PGSTAT_USED_BACKEND_NUM_TYPES in pgstat.h
*/
typedef enum BackendType
{
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 816d261e80d..a8e1f88e4c6 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -348,10 +348,12 @@ typedef struct PgStat_PendingIO
uint64
pending_hist_time_buckets[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES][PGSTAT_IO_HIST_BUCKETS];
} PgStat_PendingIO;
+/* This needs to stay in sync with pgstat_tracks_io_bktype() */
+#define PGSTAT_USED_BACKEND_NUM_TYPES BACKEND_NUM_TYPES - 4
typedef struct PgStat_IO
{
TimestampTz stat_reset_timestamp;
- PgStat_BktypeIO stats[BACKEND_NUM_TYPES];
+ PgStat_BktypeIO stats[PGSTAT_USED_BACKEND_NUM_TYPES];
} PgStat_IO;
typedef struct PgStat_StatDBEntry
@@ -620,6 +622,7 @@ extern const char *pgstat_get_io_context_name(IOContext
io_context);
extern const char *pgstat_get_io_object_name(IOObject io_object);
extern const char *pgstat_get_io_op_name(IOOp io_op);
+extern int pgstat_remap_condensed_bktype(BackendType bktype);
extern bool pgstat_tracks_io_bktype(BackendType bktype);
extern bool pgstat_tracks_io_object(BackendType bktype,
IOObject io_object, IOContext io_context);