On Thu, Mar 19, 2026 at 11:16 AM Jakub Wartak
<[email protected]> wrote:
>
> On Wed, Mar 18, 2026 at 2:29 PM Jakub Wartak
> <[email protected]> wrote:
> >
> > On Tue, Mar 17, 2026 at 3:17 PM Andres Freund <[email protected]> wrote:
> > > On 2026-03-17 13:13:59 +0100, Jakub Wartak wrote:
> > > > 1. Concerns about memory use. With v7 I had couple of ideas, and with 
> > > > those
> > > > the memory use is really minimized as long as the code is still simple
> > > > (so nothing fancy, just some ideas to trim stuff and dynamically 
> > > > allocate
> > > > memory). I hope those reduce memory footprint to acceptable levels, see 
> > > > my
> > > > earlier description for v7.
> > >
> > > Personally I unfortunately continue to think that storing lots of values 
> > > that
> > > are never anything but zero isn't a good idea once you have more than a
> > > handful of kB. Storing pointless data is something different than 
> > > increasing
> > > memory usage with actual information.
> > >
> > > I still think you should just count the number of histograms needed, have 
> > > an
> > > array [object][context][op] with the associated histogram "offset" and 
> > > then
> > > increment the associated offset.  It'll add an indirection at count time, 
> > > but
> > > no additional branches.
> >
> > Great idea, thanks, I haven't thought about that! Attached v9 attempts to do
> > that for pending backend I/O struct, which minimizes the (backend) memory
> > footprint for client backends to just about ~5kB.
> >
> > I have been pulling my hair trying to achieve the same for shared-memory, 
> > but I
> > have failed to do that w/o sinking into complexity [..]
>
> OK, I've made  it done too with indirect offset on shared memory, it wasn't 
> easy
> at least for me, but now we have two approaches/patchsets:
>
[..]
> v9b: with more code and build complexity but that should address concern of 
> not
>      used memory
>
> 'Shared Memory Stats' allocated size:
> master - uses ~308kB for shm
> v9a-000[12]: 578kB shm
> v9a-000[123]: 507kB shm
> v9a-000[1234]: 471kB shm (+~163kB more)
>
> v9b-000[123]: 361kB shm
>
> v9a-000[12] are identical to v9b-00[12], but included just for
> patchset completeness.
>
> In v9b meson/autoconf (for adding pgstat_io_genstats) build most of
> the time what
> they need, but probably that needs some cleanups and better dependency
> tracking. I'm
> not sure about correctnes of those changes as especially
> autoconf/Makefile is a lot
> like brainf**k to me and that area would need some help...
>
> I think now we could even increase max resolution of buckets to cover
> max those maximum
> of 32s+ (at the cost of one extra 64-byte cacheline for pending IO
> stats, so go with
> PGSTAT_IO_HIST_BUCKETS from 16 to 24)

Good morning all,

Ok here comes v10, which is bit like earlier v9b (so has reduced shared memory
footprint using Yours idea about indirect offsets idea), but now with shm memory
sized and allocated on startup by postmaster. There are 3 patches:
- 0001, one to introduce view and bucketting, no changes since quite some time
- 0002, saves some private (backend) memory
- 0003, main meat, saving shared memory (main problem raised earlier),
now switched
  to simply dynamically size shared memory based on those pgstat_track_io*()
  logic

The problem with the 0003 earlier was that I wanted to absolutley avoid further
complexiy/alterations in struct PgStat_IO related to dynamic shared memory
allocation for hist_time_buckets_slots[PGSTAT_IO_HIST_BUCKET_SLOTS]
[PGSTAT_IO_HIST_BUCKETS] (I was afraid to touch that shm code, it
looks complex),
so I had to come out with something that would tell us how many slots
(PGSTAT_IO_HIST_BUCKET_SLOTS) we need, I wish we had C++'s `constexpr` that
would do all of that. I've tried three aproaches (like in v9b but that hit
some serious cross-compiling obstacles, also had perl doing that, but that
had lots of code duplication), so in the end I had to alter the pgstat_io
shm allocation which is now in 0003.

Summary of changes in 0003 since v9b / earlier post:
- Fixed potential race condition (touch via memset/memcpy() only histogram
  slots under LWLock)
- Fixed/removed the PGSTAT_IO_HIST_BUCKET_SLOTS macro
- Removed pgstat_io_genslots.c (first idea, above) and abandonded attempt to
  fixup some cross compilation woes on MSVC/mingw
- Bumped PGSTAT_FILE_FORMAT_ID
- Move/optimize pending_off in pgstat_io_flush_cb out of hot loop
- Document that hist_time_buckets_offsets should be the last member of
PgStat_BktypeIO
- Be defensive - added some asserts()
- Adjust _bucket_offsets from uint64 to just int to save memory (offsets are low
  numbers)
- and finally moved to dynamic shm allocation of PgStat_IO stuff during
  startup

At the end of the day, I'll squeze 000[123] into just one, but wanted
to ease the
review first a bit. Of course this is material for PG20.

-J.
From d6783896069de828b13c554c4c21ce439a76d2bc Mon Sep 17 00:00:00 2001
From: Jakub Wartak <[email protected]>
Date: Wed, 18 Mar 2026 07:24:14 +0100
Subject: [PATCH v10f 2/3] Lower pg_stat_io_histogram private (backend) memory
 in pending_hist_time_buckets by using array with indirect offsets.

---
 src/backend/utils/activity/pgstat.c    |  9 +--
 src/backend/utils/activity/pgstat_io.c | 90 ++++++++++++++++++++++++--
 src/include/pgstat.h                   | 19 ++++--
 src/include/utils/pgstat_internal.h    |  1 +
 4 files changed, 102 insertions(+), 17 deletions(-)

diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 9feb2f1370b..7c597932671 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -445,6 +445,7 @@ static const PgStat_KindInfo pgstat_kind_builtin_infos[PGSTAT_KIND_BUILTIN_SIZE]
 		.shared_data_off = offsetof(PgStatShared_IO, stats),
 		.shared_data_len = sizeof(((PgStatShared_IO *) 0)->stats),
 
+		.init_backend_cb = pgstat_io_init_backend_cb,
 		.flush_static_cb = pgstat_io_flush_cb,
 		.init_shmem_cb = pgstat_io_init_shmem_cb,
 		.reset_all_cb = pgstat_io_reset_all_cb,
@@ -691,14 +692,6 @@ pgstat_initialize(void)
 	/* Set up a process-exit hook to clean up */
 	before_shmem_exit(pgstat_shutdown_hook, 0);
 
-	/* Allocate I/O latency buckets only if we are going to populate it */
-	if (track_io_timing || track_wal_io_timing)
-		PendingIOStats.pending_hist_time_buckets = MemoryContextAllocZero(TopMemoryContext,
-																		  IOOBJECT_NUM_TYPES * IOCONTEXT_NUM_TYPES * IOOP_NUM_TYPES *
-																		  PGSTAT_IO_HIST_BUCKETS * sizeof(uint64));
-	else
-		PendingIOStats.pending_hist_time_buckets = NULL;
-
 #ifdef USE_ASSERT_CHECKING
 	pgstat_is_initialized = true;
 #endif
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index bbf910ac4bb..1696f278a77 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -16,6 +16,7 @@
 
 #include "postgres.h"
 
+#include "access/xlog.h"
 #include "executor/instrument.h"
 #include "port/pg_bitutils.h"
 #include "storage/bufmgr.h"
@@ -66,6 +67,27 @@ pgstat_bktype_io_stats_valid(PgStat_BktypeIO *backend_io,
 	return true;
 }
 
+int
+pgstat_bktype_count_potentially_used(BackendType bktype)
+{
+	int			cnt = 0;
+
+	for (int io_object = 0; io_object < IOOBJECT_NUM_TYPES; io_object++)
+	{
+		for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
+		{
+			for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
+			{
+				/* we do track it */
+				if (pgstat_tracks_io_op(bktype, io_object, io_context, io_op))
+					cnt++;
+			}
+		}
+	}
+
+	return cnt;
+}
+
 void
 pgstat_count_io_op(IOObject io_object, IOContext io_context, IOOp io_op,
 				   uint32 cnt, uint64 bytes)
@@ -186,12 +208,16 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
 
 		if (PendingIOStats.pending_hist_time_buckets != NULL)
 		{
+			int			offset;
+
 			/*
 			 * calculate the bucket_index based on latency in nanoseconds
 			 * (uint64)
 			 */
 			bucket_index = get_bucket_index(INSTR_TIME_GET_NANOSEC(io_time));
-			PendingIOStats.pending_hist_time_buckets[io_object][io_context][io_op][bucket_index]++;
+
+			offset = PendingIOStats.pending_hist_time_buckets_offsets[io_object][io_context][io_op];
+			PendingIOStats.pending_hist_time_buckets[offset][bucket_index]++;
 		}
 
 		/* Add the per-backend count */
@@ -264,10 +290,23 @@ pgstat_io_flush_cb(bool nowait)
 				bktype_shstats->times[io_object][io_context][io_op] +=
 					INSTR_TIME_GET_MICROSEC(time);
 
+				/*
+				 * If tracking I/O stats, save I/O histograms from backend
+				 * local's PendingIOStats by using indirect offsets from the
+				 * pending_hist_time_buckets dynamic array (accessed with
+				 * offsets to save memory) into shared memory.
+				 */
 				if (PendingIOStats.pending_hist_time_buckets != NULL)
 					for (int b = 0; b < PGSTAT_IO_HIST_BUCKETS; b++)
-						bktype_shstats->hist_time_buckets[io_object][io_context][io_op][b] +=
-							PendingIOStats.pending_hist_time_buckets[io_object][io_context][io_op][b];
+					{
+						int			pending_off = PendingIOStats.pending_hist_time_buckets_offsets[io_object][io_context][io_op];
+
+						if (pending_off != -1)
+						{
+							bktype_shstats->hist_time_buckets[io_object][io_context][io_op][b] +=
+								PendingIOStats.pending_hist_time_buckets[pending_off][b];
+						}
+					}
 			}
 		}
 	}
@@ -276,8 +315,14 @@ pgstat_io_flush_cb(bool nowait)
 
 	LWLockRelease(bktype_lock);
 
-	/* Avoid overwriting latency buckets array pointer */
+	/*
+	 * Avoid overwriting histogram latency array (with offsets) and pointer to
+	 * dynamically allocated memory
+	 */
 	memset(&PendingIOStats, 0, offsetof(PgStat_PendingIO, pending_hist_time_buckets));
+	if (PendingIOStats.pending_hist_time_buckets != NULL)
+		memset(PendingIOStats.pending_hist_time_buckets, 0,
+			   PendingIOStats.pending_hist_time_buckets_size * sizeof(*PendingIOStats.pending_hist_time_buckets));
 
 	have_iostats = false;
 
@@ -349,6 +394,43 @@ pgstat_get_io_op_name(IOOp io_op)
 	pg_unreachable();
 }
 
+void
+pgstat_io_init_backend_cb(void)
+{
+	/* Allocate I/O latency buckets only if we are going to populate it */
+	if (track_io_timing || track_wal_io_timing)
+	{
+		int			alloc_sz,
+					io_histograms_used = 0;
+
+		PendingIOStats.pending_hist_time_buckets_size = pgstat_bktype_count_potentially_used(MyBackendType);
+		alloc_sz = PendingIOStats.pending_hist_time_buckets_size * sizeof(*PendingIOStats.pending_hist_time_buckets);
+		PendingIOStats.pending_hist_time_buckets = MemoryContextAllocZero(TopMemoryContext, alloc_sz);
+
+		for (int io_object = 0; io_object < IOOBJECT_NUM_TYPES; io_object++)
+		{
+			for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
+			{
+				for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
+				{
+					if (pgstat_tracks_io_op(MyBackendType, io_object, io_context, io_op))
+					{
+						Assert(io_histograms_used <= PendingIOStats.pending_hist_time_buckets_size);
+
+						PendingIOStats.pending_hist_time_buckets_offsets[io_object][io_context][io_op] =
+							io_histograms_used++;
+					}
+					else
+						PendingIOStats.pending_hist_time_buckets_offsets[io_object][io_context][io_op] = -1;
+				}
+			}
+		}
+	}
+	else
+		PendingIOStats.pending_hist_time_buckets = NULL;
+
+}
+
 void
 pgstat_io_init_shmem_cb(void *stats)
 {
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 34fd93f86dc..984914e69b8 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -352,12 +352,20 @@ typedef struct PgStat_PendingIO
 	instr_time	pending_times[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
 
 	/*
-	 * Dynamically allocated array of
-	 * [IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES]
-	 * [IOOP_NUM_TYPES][PGSTAT_IO_HIST_BUCKETS] only with track_io_timings
-	 * true.
+	 * Dynamically allocated array for pg_stat_io_histograms only when
+	 * track_io_timings is true. pending_hist_time_buckets_offsets is just an
+	 * offset within pending_hist_time_buckets to avoid using unnecessary
+	 * memory.
 	 */
-	uint64		(*pending_hist_time_buckets)[IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES][PGSTAT_IO_HIST_BUCKETS];
+	uint64		(*pending_hist_time_buckets)[PGSTAT_IO_HIST_BUCKETS];
+	uint64		pending_hist_time_buckets_offsets[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
+
+	/*
+	 * Cache how much histograms we have allocated to avoid repetably calling
+	 * pgstat_bktype_count_potentially_used(MyBackendType) from
+	 * pgstat_io_flush_cb()
+	 */
+	int			pending_hist_time_buckets_size;
 } PgStat_PendingIO;
 
 extern PgStat_PendingIO PendingIOStats;
@@ -645,6 +653,7 @@ extern PgStat_CheckpointerStats *pgstat_fetch_stat_checkpointer(void);
 
 extern bool pgstat_bktype_io_stats_valid(PgStat_BktypeIO *backend_io,
 										 BackendType bktype);
+extern int	pgstat_bktype_count_potentially_used(BackendType bktype);
 extern void pgstat_count_io_op(IOObject io_object, IOContext io_context,
 							   IOOp io_op, uint32 cnt, uint64 bytes);
 extern instr_time pgstat_prepare_io_time(bool track_io_guc);
diff --git a/src/include/utils/pgstat_internal.h b/src/include/utils/pgstat_internal.h
index a3ce8b04723..fcaf21db574 100644
--- a/src/include/utils/pgstat_internal.h
+++ b/src/include/utils/pgstat_internal.h
@@ -759,6 +759,7 @@ extern void pgstat_function_reset_timestamp_cb(PgStatShared_Common *header, Time
 extern void pgstat_flush_io(bool nowait);
 
 extern bool pgstat_io_flush_cb(bool nowait);
+extern void pgstat_io_init_backend_cb(void);
 extern void pgstat_io_init_shmem_cb(void *stats);
 extern void pgstat_io_reset_all_cb(TimestampTz ts);
 extern void pgstat_io_snapshot_cb(void);
-- 
2.43.0

From de74801e8e784f5126adf7029a2d3e112c985fde Mon Sep 17 00:00:00 2001
From: Jakub Wartak <[email protected]>
Date: Fri, 23 Jan 2026 08:10:09 +0100
Subject: [PATCH v10f 1/3] Add pg_stat_io_histogram view to provide more
 detailed insight into IO profile

pg_stat_io_histogram displays a histogram of IO latencies for specific
backend_type, object, context and io_type. The histogram has buckets that allow
faster identification of I/O latency outliers due to faulty hardware and/or
misbehaving I/O stack. Such I/O outliers e.g. slow fsyncs could sometimes
cause intermittent issues e.g. for COMMIT or affect the synchronous standbys
performance.

Author: Jakub Wartak <[email protected]>
Reviewed-by: Andres Freund <[email protected]>
Reviewed-by: Ants Aasma <[email protected]>
Reviewed-by: Tomas Vondra <[email protected]>
Discussion: https://postgr.es/m/CAKZiRmwvE4uJLKTgPXeBA4m%2Bd4tTghayoefcaM9%3Dz3_S7i72GA%40mail.gmail.com
---
 configure                                   |  38 +++
 configure.ac                                |   1 +
 doc/src/sgml/config.sgml                    |  12 +-
 doc/src/sgml/monitoring.sgml                | 290 ++++++++++++++++++++
 doc/src/sgml/wal.sgml                       |   5 +-
 meson.build                                 |   1 +
 src/backend/catalog/system_views.sql        |  11 +
 src/backend/utils/activity/pgstat.c         |  19 +-
 src/backend/utils/activity/pgstat_backend.c |   4 +-
 src/backend/utils/activity/pgstat_io.c      |  92 ++++++-
 src/backend/utils/adt/pgstatfuncs.c         | 148 ++++++++++
 src/include/catalog/pg_proc.dat             |   9 +
 src/include/pgstat.h                        |  38 ++-
 src/include/port/pg_bitutils.h              |  38 ++-
 src/include/utils/pgstat_internal.h         |   2 +-
 src/test/recovery/t/029_stats_restart.pl    |  29 ++
 src/test/regress/expected/rules.out         |   8 +
 src/tools/pgindent/typedefs.list            |   1 +
 18 files changed, 727 insertions(+), 19 deletions(-)

diff --git a/configure b/configure
index f66c1054a7a..c09329240be 100755
--- a/configure
+++ b/configure
@@ -16054,6 +16054,44 @@ cat >>confdefs.h <<_ACEOF
 #define HAVE__BUILTIN_CLZ 1
 _ACEOF
 
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __builtin_clzl" >&5
+$as_echo_n "checking for __builtin_clzl... " >&6; }
+if ${pgac_cv__builtin_clzl+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+int
+call__builtin_clzl(unsigned long x)
+{
+    return __builtin_clzl(x);
+}
+int
+main ()
+{
+
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv__builtin_clzl=yes
+else
+  pgac_cv__builtin_clzl=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__builtin_clzl" >&5
+$as_echo "$pgac_cv__builtin_clzl" >&6; }
+if test x"${pgac_cv__builtin_clzl}" = xyes ; then
+
+cat >>confdefs.h <<_ACEOF
+#define HAVE__BUILTIN_CLZL 1
+_ACEOF
+
 fi
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking for __builtin_ctz" >&5
 $as_echo_n "checking for __builtin_ctz... " >&6; }
diff --git a/configure.ac b/configure.ac
index 8d176bd3468..8f804464bc5 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1881,6 +1881,7 @@ PGAC_CHECK_BUILTIN_FUNC([__builtin_bswap32], [int x])
 PGAC_CHECK_BUILTIN_FUNC([__builtin_bswap64], [long int x])
 # We assume that we needn't test all widths of these explicitly:
 PGAC_CHECK_BUILTIN_FUNC([__builtin_clz], [unsigned int x])
+PGAC_CHECK_BUILTIN_FUNC([__builtin_clzl], [unsigned long x])
 PGAC_CHECK_BUILTIN_FUNC([__builtin_ctz], [unsigned int x])
 # __builtin_frame_address may draw a diagnostic for non-constant argument,
 # so it needs a different test function.
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 73cc0412330..91b1fd7e635 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -9067,9 +9067,11 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         displayed in <link linkend="monitoring-pg-stat-database-view">
         <structname>pg_stat_database</structname></link>,
         <link linkend="monitoring-pg-stat-io-view">
-        <structname>pg_stat_io</structname></link> (if <varname>object</varname>
-        is not <literal>wal</literal>), in the output of the
-        <link linkend="pg-stat-get-backend-io">
+        <structname>pg_stat_io</structname></link> and
+        <link linkend="monitoring-pg-stat-io-histogram-view">
+        <structname>pg_stat_io_histogram</structname></link>
+        (if <varname>object</varname> is not <literal>wal</literal>),
+        in the output of the <link linkend="pg-stat-get-backend-io">
         <function>pg_stat_get_backend_io()</function></link> function (if
         <varname>object</varname> is not <literal>wal</literal>), in the
         output of <xref linkend="sql-explain"/> when the <literal>BUFFERS</literal>
@@ -9099,7 +9101,9 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         measure the overhead of timing on your system.
         I/O timing information is displayed in
         <link linkend="monitoring-pg-stat-io-view">
-        <structname>pg_stat_io</structname></link> for the
+        <structname>pg_stat_io</structname></link> and
+        <link linkend="monitoring-pg-stat-io-histogram-view">
+        <structname>pg_stat_io_histogram</structname></link> for the
         <varname>object</varname> <literal>wal</literal> and in the output of
         the <link linkend="pg-stat-get-backend-io">
         <function>pg_stat_get_backend_io()</function></link> function for the
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 08d5b824552..e8c5f391841 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -509,6 +509,17 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
      </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_io_histogram</structname><indexterm><primary>pg_stat_io_histogram</primary></indexterm></entry>
+      <entry>
+       One row for each combination of backend type, context, target object,
+       IO operation type and latency bucket (in microseconds) containing
+       cluster-wide I/O statistics.
+       See <link linkend="monitoring-pg-stat-io-histogram-view">
+       <structname>pg_stat_io_histogram</structname></link> for details.
+     </entry>
+     </row>
+
      <row>
       <entry><structname>pg_stat_lock</structname><indexterm><primary>pg_stat_lock</primary></indexterm></entry>
       <entry>
@@ -734,6 +745,8 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    Users are advised to use the <productname>PostgreSQL</productname>
    statistics views in combination with operating system utilities for a more
    complete picture of their database's I/O performance.
+   Furthermore the <structname>pg_stat_io_histogram</structname> view can be helpful
+   identifying latency outliers for specific I/O operations.
   </para>
 
  </sect2>
@@ -3302,6 +3315,283 @@ description | Waiting for a newly initialized WAL file to reach durable storage
 
  </sect2>
 
+ <sect2 id="monitoring-pg-stat-io-histogram-view">
+  <title><structname>pg_stat_io_histogram</structname></title>
+
+  <indexterm>
+   <primary>pg_stat_io_histogram</primary>
+  </indexterm>
+
+  <para>
+   The <structname>pg_stat_io_histogram</structname> view will contain one row for each
+   combination of backend type, target I/O object, and I/O context, IO operation
+   type, bucket latency cluster-wide I/O statistics. Combinations which do not make sense
+   are omitted.
+  </para>
+
+  <para>
+   The view shows measured perceived I/O latency by the backend, not the kernel or device
+   one. This is important distinction when troubleshooting, as the I/O latency observed by
+   the backend might get affected by:
+   <itemizedlist>
+     <listitem>
+        <para>OS scheduler decisions and available CPU resources.</para>
+        <para>With AIO, it might include time to service other IOs from the queue. That will often inflate IO latency.</para>
+        <para>In case of writing, additional filesystem journaling operations.</para>
+     </listitem>
+  </itemizedlist>
+  </para>
+
+  <para>
+   Currently, I/O on relations (e.g. tables, indexes) and WAL activity are
+   tracked. However, relation I/O which bypasses shared buffers
+   (e.g. when moving a table from one tablespace to another) is currently
+   not tracked.
+  </para>
+
+  <table id="pg-stat-io-histogram-view" xreflabel="pg_stat_io_histogram">
+   <title><structname>pg_stat_io_histogram</structname> View</title>
+   <tgroup cols="1">
+    <thead>
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        Column Type
+       </para>
+       <para>
+        Description
+       </para>
+      </entry>
+     </row>
+    </thead>
+    <tbody>
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>backend_type</structfield> <type>text</type>
+       </para>
+       <para>
+        Type of backend (e.g. background worker, autovacuum worker). See <link
+        linkend="monitoring-pg-stat-activity-view">
+        <structname>pg_stat_activity</structname></link> for more information
+        on <varname>backend_type</varname>s. Some
+        <varname>backend_type</varname>s do not accumulate I/O operation
+        statistics and will not be included in the view.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>object</structfield> <type>text</type>
+       </para>
+       <para>
+        Target object of an I/O operation. Possible values are:
+       <itemizedlist>
+        <listitem>
+         <para>
+          <literal>relation</literal>: Permanent relations.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>temp relation</literal>: Temporary relations.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>wal</literal>: Write Ahead Logs.
+         </para>
+        </listitem>
+       </itemizedlist>
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>context</structfield> <type>text</type>
+       </para>
+       <para>
+        The context of an I/O operation. Possible values are:
+       </para>
+       <itemizedlist>
+        <listitem>
+         <para>
+          <literal>normal</literal>: The default or standard
+          <varname>context</varname> for a type of I/O operation. For
+          example, by default, relation data is read into and written out from
+          shared buffers. Thus, reads and writes of relation data to and from
+          shared buffers are tracked in <varname>context</varname>
+          <literal>normal</literal>.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>init</literal>: I/O operations performed while creating the
+          WAL segments are tracked in <varname>context</varname>
+          <literal>init</literal>.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>vacuum</literal>: I/O operations performed outside of shared
+          buffers while vacuuming and analyzing permanent relations. Temporary
+          table vacuums use the same local buffer pool as other temporary table
+          I/O operations and are tracked in <varname>context</varname>
+          <literal>normal</literal>.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>bulkread</literal>: Certain large read I/O operations
+          done outside of shared buffers, for example, a sequential scan of a
+          large table.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>bulkwrite</literal>: Certain large write I/O operations
+          done outside of shared buffers, such as <command>COPY</command>.
+         </para>
+        </listitem>
+       </itemizedlist>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>io_type</structfield> <type>text</type>
+       </para>
+       <para>
+        The type of I/O operation. Possible values are:
+       </para>
+       <itemizedlist>
+        <listitem>
+         <para>
+          <literal>evict</literal>: eviction from shared buffers cache.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>fsync</literal>: synchronization of modified kernel's
+          filesystem page cache with storage device.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>hit</literal>: shared buffers cache lookup hit.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>reuse</literal>: reuse of existing buffer in case of
+          reusing limited-space ring buffer (applies to <literal>bulkread</literal>,
+          <literal>bulkwrite</literal>, or <literal>vacuum</literal> contexts).
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>writeback</literal>: advise kernel that the described dirty
+          data should be flushed to disk preferably asynchronously.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>extend</literal>: add new zeroed blocks to the end of file.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>read</literal>: self explanatory.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>write</literal>: self explanatory.
+         </para>
+        </listitem>
+       </itemizedlist>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>bucket_latency_us</structfield> <type>int4range</type>
+       </para>
+       <para>
+        The latency bucket (in microseconds).
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>bucket_count</structfield> <type>bigint</type>
+       </para>
+       <para>
+        Number of times latency of the I/O operation hit this specific bucket (with
+        up to <varname>bucket_latency_us</varname> microseconds).
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
+       </para>
+       <para>
+        Time at which these statistics were last reset.
+       </para>
+      </entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+  <para>
+   Some backend types never perform I/O operations on some I/O objects and/or
+   in some I/O contexts. These rows might display zero bucket counts for such
+   specific operations.
+  </para>
+
+  <para>
+   <structname>pg_stat_io_histogram</structname> can be used to identify
+   I/O storage issues
+   For example:
+   <itemizedlist>
+    <listitem>
+     <para>
+      Presence of abnormally high latency for <varname>fsyncs</varname> might
+      indicate I/O saturation, oversubscription or hardware connectivity issues.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      Unusually high latency for <varname>fsyncs</varname> on standby's startup
+      backend type, might be responsible for high duration of commits in
+      synchronous replication setups.
+     </para>
+    </listitem>
+   </itemizedlist>
+  </para>
+
+  <note>
+   <para>
+    Columns tracking I/O wait time will only be non-zero when
+    <xref linkend="guc-track-io-timing"/> is enabled. The user should be
+    careful when referencing these columns in combination with their
+    corresponding I/O operations in case <varname>track_io_timing</varname>
+    was not enabled for the entire time since the last stats reset.
+   </para>
+  </note>
+ </sect2>
 
  <sect2 id="monitoring-pg-stat-lock-view">
   <title><structname>pg_stat_lock</structname></title>
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index c32931edde3..531245935da 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -950,8 +950,9 @@
    of times <function>XLogWrite</function> writes and
    <function>issue_xlog_fsync</function> syncs WAL data to disk are also
    counted as <varname>writes</varname> and <varname>fsyncs</varname>
-   in <structname>pg_stat_io</structname> for the <varname>object</varname>
-   <literal>wal</literal>, respectively.
+   in <structname>pg_stat_io</structname> and
+   <structname>pg_stat_io_histogram</structname> for the
+   <varname>object</varname> <literal>wal</literal>, respectively.
   </para>
 
   <para>
diff --git a/meson.build b/meson.build
index 20b887f1a1b..51058165742 100644
--- a/meson.build
+++ b/meson.build
@@ -2048,6 +2048,7 @@ builtins = [
   'bswap32',
   'bswap64',
   'clz',
+  'clzl',
   'ctz',
   'constant_p',
   'frame_address',
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 73a1c1c4670..a752ab157ba 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1282,6 +1282,17 @@ SELECT
        b.stats_reset
 FROM pg_stat_get_io() b;
 
+CREATE VIEW pg_stat_io_histogram AS
+SELECT
+       b.backend_type,
+       b.object,
+       b.context,
+       b.io_type,
+       b.bucket_latency_us,
+       b.bucket_count,
+       b.stats_reset
+FROM pg_stat_get_io_histogram() b;
+
 CREATE VIEW pg_stat_wal AS
     SELECT
         w.wal_records,
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index b67da88c7dc..9feb2f1370b 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -105,8 +105,10 @@
 #include <unistd.h>
 
 #include "access/xact.h"
+#include "access/xlog.h"
 #include "lib/dshash.h"
 #include "pgstat.h"
+#include "storage/bufmgr.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
@@ -689,6 +691,14 @@ pgstat_initialize(void)
 	/* Set up a process-exit hook to clean up */
 	before_shmem_exit(pgstat_shutdown_hook, 0);
 
+	/* Allocate I/O latency buckets only if we are going to populate it */
+	if (track_io_timing || track_wal_io_timing)
+		PendingIOStats.pending_hist_time_buckets = MemoryContextAllocZero(TopMemoryContext,
+																		  IOOBJECT_NUM_TYPES * IOCONTEXT_NUM_TYPES * IOOP_NUM_TYPES *
+																		  PGSTAT_IO_HIST_BUCKETS * sizeof(uint64));
+	else
+		PendingIOStats.pending_hist_time_buckets = NULL;
+
 #ifdef USE_ASSERT_CHECKING
 	pgstat_is_initialized = true;
 #endif
@@ -1668,10 +1678,17 @@ pgstat_write_statsfile(void)
 
 		pgstat_build_snapshot_fixed(kind);
 		if (pgstat_is_kind_builtin(kind))
-			ptr = ((char *) &pgStatLocal.snapshot) + info->snapshot_ctl_off;
+		{
+			if (kind == PGSTAT_KIND_IO)
+				ptr = (char *) pgStatLocal.snapshot.io;
+			else
+				ptr = ((char *) &pgStatLocal.snapshot) + info->snapshot_ctl_off;
+		}
 		else
 			ptr = pgStatLocal.snapshot.custom_data[kind - PGSTAT_KIND_CUSTOM_MIN];
 
+		Assert(ptr != NULL);
+
 		fputc(PGSTAT_FILE_ENTRY_FIXED, fpout);
 		pgstat_write_chunk_s(fpout, &kind);
 		pgstat_write_chunk(fpout, ptr, info->shared_data_len);
diff --git a/src/backend/utils/activity/pgstat_backend.c b/src/backend/utils/activity/pgstat_backend.c
index 73461c9bca5..fc1bf824a31 100644
--- a/src/backend/utils/activity/pgstat_backend.c
+++ b/src/backend/utils/activity/pgstat_backend.c
@@ -168,7 +168,7 @@ pgstat_flush_backend_entry_io(PgStat_EntryRef *entry_ref)
 {
 	PgStatShared_Backend *shbackendent;
 	PgStat_BktypeIO *bktype_shstats;
-	PgStat_PendingIO pending_io;
+	PgStat_BackendPendingIO pending_io;
 
 	/*
 	 * This function can be called even if nothing at all has happened for IO
@@ -205,7 +205,7 @@ pgstat_flush_backend_entry_io(PgStat_EntryRef *entry_ref)
 	/*
 	 * Clear out the statistics buffer, so it can be re-used.
 	 */
-	MemSet(&PendingBackendStats.pending_io, 0, sizeof(PgStat_PendingIO));
+	MemSet(&PendingBackendStats.pending_io, 0, sizeof(PgStat_BackendPendingIO));
 
 	backend_has_iostats = false;
 }
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 2be26e92283..bbf910ac4bb 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -17,10 +17,12 @@
 #include "postgres.h"
 
 #include "executor/instrument.h"
+#include "port/pg_bitutils.h"
 #include "storage/bufmgr.h"
+#include "utils/memutils.h"
 #include "utils/pgstat_internal.h"
 
-static PgStat_PendingIO PendingIOStats;
+PgStat_PendingIO PendingIOStats;
 static bool have_iostats = false;
 
 /*
@@ -107,6 +109,35 @@ pgstat_prepare_io_time(bool track_io_guc)
 	return io_start;
 }
 
+#define MIN_PG_STAT_IO_HIST_LATENCY 8191
+static inline int
+get_bucket_index(uint64_t ns)
+{
+	const uint32_t max_index = PGSTAT_IO_HIST_BUCKETS - 1;
+
+	/*
+	 * hopefully pre-calculated by the compiler: clzl(8191) =
+	 * clz(01111111111111b on uint64)
+	 */
+	const uint32_t min_latency_leading_zeros =
+		pg_leading_zero_bits64(MIN_PG_STAT_IO_HIST_LATENCY);
+
+	/*
+	 * make sure the tmp value has at least 8191 (our minimum bucket size) as
+	 * __builtin_clzl might return undefined behavior when operating on 0
+	 */
+	uint64_t	tmp = ns | MIN_PG_STAT_IO_HIST_LATENCY;
+
+	/* count leading zeros */
+	int			leading_zeros = pg_leading_zero_bits64(tmp);
+
+	/* normalize the index */
+	uint32_t	index = min_latency_leading_zeros - leading_zeros;
+
+	/* clamp it to the maximum */
+	return (index > max_index) ? max_index : index;
+}
+
 /*
  * Like pgstat_count_io_op() except it also accumulates time.
  *
@@ -125,6 +156,7 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
 	if (!INSTR_TIME_IS_ZERO(start_time))
 	{
 		instr_time	io_time;
+		int			bucket_index;
 
 		INSTR_TIME_SET_CURRENT(io_time);
 		INSTR_TIME_SUBTRACT(io_time, start_time);
@@ -152,6 +184,16 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
 		INSTR_TIME_ADD(PendingIOStats.pending_times[io_object][io_context][io_op],
 					   io_time);
 
+		if (PendingIOStats.pending_hist_time_buckets != NULL)
+		{
+			/*
+			 * calculate the bucket_index based on latency in nanoseconds
+			 * (uint64)
+			 */
+			bucket_index = get_bucket_index(INSTR_TIME_GET_NANOSEC(io_time));
+			PendingIOStats.pending_hist_time_buckets[io_object][io_context][io_op][bucket_index]++;
+		}
+
 		/* Add the per-backend count */
 		pgstat_count_backend_io_op_time(io_object, io_context, io_op,
 										io_time);
@@ -165,7 +207,7 @@ pgstat_fetch_stat_io(void)
 {
 	pgstat_snapshot_fixed(PGSTAT_KIND_IO);
 
-	return &pgStatLocal.snapshot.io;
+	return pgStatLocal.snapshot.io;
 }
 
 /*
@@ -221,6 +263,11 @@ pgstat_io_flush_cb(bool nowait)
 
 				bktype_shstats->times[io_object][io_context][io_op] +=
 					INSTR_TIME_GET_MICROSEC(time);
+
+				if (PendingIOStats.pending_hist_time_buckets != NULL)
+					for (int b = 0; b < PGSTAT_IO_HIST_BUCKETS; b++)
+						bktype_shstats->hist_time_buckets[io_object][io_context][io_op][b] +=
+							PendingIOStats.pending_hist_time_buckets[io_object][io_context][io_op][b];
 			}
 		}
 	}
@@ -229,7 +276,8 @@ pgstat_io_flush_cb(bool nowait)
 
 	LWLockRelease(bktype_lock);
 
-	memset(&PendingIOStats, 0, sizeof(PendingIOStats));
+	/* Avoid overwriting latency buckets array pointer */
+	memset(&PendingIOStats, 0, offsetof(PgStat_PendingIO, pending_hist_time_buckets));
 
 	have_iostats = false;
 
@@ -274,6 +322,33 @@ pgstat_get_io_object_name(IOObject io_object)
 	pg_unreachable();
 }
 
+const char *
+pgstat_get_io_op_name(IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_EVICT:
+			return "evict";
+		case IOOP_FSYNC:
+			return "fsync";
+		case IOOP_HIT:
+			return "hit";
+		case IOOP_REUSE:
+			return "reuse";
+		case IOOP_WRITEBACK:
+			return "writeback";
+		case IOOP_EXTEND:
+			return "extend";
+		case IOOP_READ:
+			return "read";
+		case IOOP_WRITE:
+			return "write";
+	}
+
+	elog(ERROR, "unrecognized IOOp value: %d", io_op);
+	pg_unreachable();
+}
+
 void
 pgstat_io_init_shmem_cb(void *stats)
 {
@@ -281,6 +356,9 @@ pgstat_io_init_shmem_cb(void *stats)
 
 	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
 		LWLockInitialize(&stat_shmem->locks[i], LWTRANCHE_PGSTATS_DATA);
+
+	/* this might end up being lazily allocated in pgstat_io_snapshot_cb() */
+	pgStatLocal.snapshot.io = NULL;
 }
 
 void
@@ -308,11 +386,15 @@ pgstat_io_reset_all_cb(TimestampTz ts)
 void
 pgstat_io_snapshot_cb(void)
 {
+	if (unlikely(pgStatLocal.snapshot.io == NULL))
+		pgStatLocal.snapshot.io = MemoryContextAllocZero(TopMemoryContext,
+														 sizeof(PgStat_IO));
+
 	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
 	{
 		LWLock	   *bktype_lock = &pgStatLocal.shmem->io.locks[i];
 		PgStat_BktypeIO *bktype_shstats = &pgStatLocal.shmem->io.stats.stats[i];
-		PgStat_BktypeIO *bktype_snap = &pgStatLocal.snapshot.io.stats[i];
+		PgStat_BktypeIO *bktype_snap = &pgStatLocal.snapshot.io->stats[i];
 
 		LWLockAcquire(bktype_lock, LW_SHARED);
 
@@ -321,7 +403,7 @@ pgstat_io_snapshot_cb(void)
 		 * the reset timestamp as well.
 		 */
 		if (i == 0)
-			pgStatLocal.snapshot.io.stat_reset_timestamp =
+			pgStatLocal.snapshot.io->stat_reset_timestamp =
 				pgStatLocal.shmem->io.stats.stat_reset_timestamp;
 
 		/* using struct assignment due to better type safety */
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 7a9dfa9ba3b..04eab7b3bcd 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -18,6 +18,7 @@
 #include "access/xlog.h"
 #include "access/xlogprefetcher.h"
 #include "catalog/catalog.h"
+#include "catalog/namespace.h"
 #include "catalog/pg_authid.h"
 #include "catalog/pg_type.h"
 #include "common/ip.h"
@@ -30,6 +31,7 @@
 #include "storage/procarray.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
+#include "utils/rangetypes.h"
 #include "utils/timestamp.h"
 #include "utils/tuplestore.h"
 #include "utils/wait_event.h"
@@ -1638,6 +1640,152 @@ pg_stat_get_backend_io(PG_FUNCTION_ARGS)
 	return (Datum) 0;
 }
 
+/*
+* When adding a new column to the pg_stat_io_histogram view and the
+* pg_stat_get_io_histogram() function, add a new enum value here above
+* HIST_IO_NUM_COLUMNS.
+*/
+typedef enum hist_io_stat_col
+{
+	HIST_IO_COL_INVALID = -1,
+	HIST_IO_COL_BACKEND_TYPE,
+	HIST_IO_COL_OBJECT,
+	HIST_IO_COL_CONTEXT,
+	HIST_IO_COL_IOTYPE,
+	HIST_IO_COL_BUCKET_US,
+	HIST_IO_COL_COUNT,
+	HIST_IO_COL_RESET_TIME,
+	HIST_IO_NUM_COLUMNS
+} histogram_io_stat_col;
+
+/*
+ * pg_stat_io_histogram_build_tuples
+ *
+ * Helper routine for pg_stat_get_io_histogram() and pg_stat_get_backend_io()
+ * filling a result tuplestore with one tuple for each object and each
+ * context supported by the caller, based on the contents of bktype_stats.
+ */
+static void
+pg_stat_io_histogram_build_tuples(ReturnSetInfo *rsinfo,
+								  PgStat_BktypeIO *bktype_stats,
+								  BackendType bktype,
+								  TimestampTz stat_reset_timestamp)
+{
+	/* Get OID for int4range type */
+	Datum		bktype_desc = CStringGetTextDatum(GetBackendTypeDesc(bktype));
+	Oid			range_typid = TypenameGetTypid("int4range");
+	TypeCacheEntry *typcache = lookup_type_cache(range_typid, TYPECACHE_RANGE_INFO);
+
+	for (int io_obj = 0; io_obj < IOOBJECT_NUM_TYPES; io_obj++)
+	{
+		const char *obj_name = pgstat_get_io_object_name(io_obj);
+
+		for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
+		{
+			const char *context_name = pgstat_get_io_context_name(io_context);
+
+			/*
+			 * Some combinations of BackendType, IOObject, and IOContext are
+			 * not valid for any type of IOOp. In such cases, omit the entire
+			 * row from the view.
+			 */
+			if (!pgstat_tracks_io_object(bktype, io_obj, io_context))
+				continue;
+
+			for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
+			{
+				const char *op_name = pgstat_get_io_op_name(io_op);
+
+				for (int bucket = 0; bucket < PGSTAT_IO_HIST_BUCKETS; bucket++)
+				{
+					Datum		values[HIST_IO_NUM_COLUMNS] = {0};
+					bool		nulls[HIST_IO_NUM_COLUMNS] = {0};
+					RangeBound	lower,
+								upper;
+					RangeType  *range;
+
+					values[HIST_IO_COL_BACKEND_TYPE] = bktype_desc;
+					values[HIST_IO_COL_OBJECT] = CStringGetTextDatum(obj_name);
+					values[HIST_IO_COL_CONTEXT] = CStringGetTextDatum(context_name);
+					values[HIST_IO_COL_IOTYPE] = CStringGetTextDatum(op_name);
+
+					/* bucket's maximum latency as range in microseconds */
+					if (bucket == 0)
+						lower.val = Int32GetDatum(0);
+					else
+						lower.val = Int32GetDatum(1 << (2 + bucket));
+					lower.infinite = false;
+					lower.inclusive = true;
+					lower.lower = true;
+
+					if (bucket == PGSTAT_IO_HIST_BUCKETS - 1)
+						upper.infinite = true;
+					else
+					{
+						upper.val = Int32GetDatum(1 << (2 + bucket + 1));
+						upper.infinite = false;
+					}
+					upper.inclusive = false;
+					upper.lower = false;
+
+					range = make_range(typcache, &lower, &upper, false, NULL);
+					values[HIST_IO_COL_BUCKET_US] = RangeTypePGetDatum(range);
+
+					/* bucket count */
+					values[HIST_IO_COL_COUNT] = Int64GetDatum(
+															  bktype_stats->hist_time_buckets[io_obj][io_context][io_op][bucket]);
+
+					if (stat_reset_timestamp != 0)
+						values[HIST_IO_COL_RESET_TIME] = TimestampTzGetDatum(stat_reset_timestamp);
+					else
+						nulls[HIST_IO_COL_RESET_TIME] = true;
+
+					tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
+										 values, nulls);
+				}
+			}
+		}
+	}
+}
+
+Datum
+pg_stat_get_io_histogram(PG_FUNCTION_ARGS)
+{
+	ReturnSetInfo *rsinfo;
+	PgStat_IO  *backends_io_stats;
+
+	InitMaterializedSRF(fcinfo, 0);
+	rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+
+	backends_io_stats = pgstat_fetch_stat_io();
+
+	for (int bktype = 0; bktype < BACKEND_NUM_TYPES; bktype++)
+	{
+		PgStat_BktypeIO *bktype_stats = &backends_io_stats->stats[bktype];
+
+		/*
+		 * In Assert builds, we can afford an extra loop through all of the
+		 * counters (in pg_stat_io_build_tuples()), checking that only
+		 * expected stats are non-zero, since it keeps the non-Assert code
+		 * cleaner.
+		 */
+		Assert(pgstat_bktype_io_stats_valid(bktype_stats, bktype));
+
+		/*
+		 * For those BackendTypes without IO Operation stats, skip
+		 * representing them in the view altogether.
+		 */
+		if (!pgstat_tracks_io_bktype(bktype))
+			continue;
+
+		/* save tuples with data from this PgStat_BktypeIO */
+		pg_stat_io_histogram_build_tuples(rsinfo, bktype_stats, bktype,
+										  backends_io_stats->stat_reset_timestamp);
+	}
+
+	return (Datum) 0;
+}
+
 /*
  * pg_stat_wal_build_tuple
  *
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index fa9ae79082b..4b50f80140c 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6061,6 +6061,15 @@
   proargnames => '{backend_type,object,context,reads,read_bytes,read_time,writes,write_bytes,write_time,writebacks,writeback_time,extends,extend_bytes,extend_time,hits,evictions,reuses,fsyncs,fsync_time,stats_reset}',
   prosrc => 'pg_stat_get_io' },
 
+{ oid => '6149', descr => 'statistics: per backend type IO latency histogram',
+  proname => 'pg_stat_get_io_histogram', prorows => '30', proretset => 't',
+  provolatile => 'v', proparallel => 'r', prorettype => 'record',
+  proargtypes => '',
+  proallargtypes => '{text,text,text,text,int4range,int8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o}',
+  proargnames => '{backend_type,object,context,io_type,bucket_latency_us,bucket_count,stats_reset}',
+  prosrc => 'pg_stat_get_io_histogram' },
+
 { oid => '9375', descr => 'statistics: per lock type statistics',
   proname => 'pg_stat_get_lock', prorows => '10', proretset => 't',
   provolatile => 'v', proparallel => 'r', prorettype => 'record',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index dfa2e837638..34fd93f86dc 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -326,11 +326,23 @@ typedef enum IOOp
 	(((unsigned int) (io_op)) < IOOP_NUM_TYPES && \
 	 ((unsigned int) (io_op)) >= IOOP_EXTEND)
 
+/*
+ * This should represent balance between being fast and providing value
+ * to the users:
+ * 1. We want to cover various fast and slow device types (0.01ms - 15ms)
+ * 2. We want to also cover sporadic long tail latencies (hardware issues,
+ *    delayed fsyncs, stuck I/O)
+ * 3. We want to be as small as possible here in terms of size:
+ *    16 * sizeof(uint64) = which should be less than two cachelines.
+ */
+#define PGSTAT_IO_HIST_BUCKETS 16
+
 typedef struct PgStat_BktypeIO
 {
 	uint64		bytes[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
 	PgStat_Counter counts[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
 	PgStat_Counter times[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
+	uint64		hist_time_buckets[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES][PGSTAT_IO_HIST_BUCKETS];
 } PgStat_BktypeIO;
 
 typedef struct PgStat_PendingIO
@@ -338,8 +350,18 @@ typedef struct PgStat_PendingIO
 	uint64		bytes[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
 	PgStat_Counter counts[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
 	instr_time	pending_times[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
+
+	/*
+	 * Dynamically allocated array of
+	 * [IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES]
+	 * [IOOP_NUM_TYPES][PGSTAT_IO_HIST_BUCKETS] only with track_io_timings
+	 * true.
+	 */
+	uint64		(*pending_hist_time_buckets)[IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES][PGSTAT_IO_HIST_BUCKETS];
 } PgStat_PendingIO;
 
+extern PgStat_PendingIO PendingIOStats;
+
 typedef struct PgStat_IO
 {
 	TimestampTz stat_reset_timestamp;
@@ -526,15 +548,24 @@ typedef struct PgStat_Backend
 } PgStat_Backend;
 
 /* ---------
- * PgStat_BackendPending	Non-flushed backend stats.
+ * PgStat_BackendPending(IO)	Non-flushed backend stats.
  * ---------
  */
+typedef struct PgStat_BackendPendingIO
+{
+	uint64		bytes[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
+	PgStat_Counter counts[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
+	instr_time	pending_times[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
+}			PgStat_BackendPendingIO;
+
 typedef struct PgStat_BackendPending
 {
 	/*
-	 * Backend statistics store the same amount of IO data as PGSTAT_KIND_IO.
+	 * Backend statistics store almost the same amount of IO data as
+	 * PGSTAT_KIND_IO. The only difference between PgStat_BackendPendingIO and
+	 * PgStat_PendingIO is that the latter also track IO latency histograms.
 	 */
-	PgStat_PendingIO pending_io;
+	PgStat_BackendPendingIO pending_io;
 } PgStat_BackendPending;
 
 /*
@@ -624,6 +655,7 @@ extern void pgstat_count_io_op_time(IOObject io_object, IOContext io_context,
 extern PgStat_IO *pgstat_fetch_stat_io(void);
 extern const char *pgstat_get_io_context_name(IOContext io_context);
 extern const char *pgstat_get_io_object_name(IOObject io_object);
+extern const char *pgstat_get_io_op_name(IOOp io_op);
 
 extern bool pgstat_tracks_io_bktype(BackendType bktype);
 extern bool pgstat_tracks_io_object(BackendType bktype,
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 7a00d197013..b27913a2ad8 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -32,6 +32,42 @@ extern PGDLLIMPORT const uint8 pg_leftmost_one_pos[256];
 extern PGDLLIMPORT const uint8 pg_rightmost_one_pos[256];
 extern PGDLLIMPORT const uint8 pg_number_of_ones[256];
 
+
+/*
+ * pg_leading_zero_bits64
+ *		Returns the number of leading 0-bits in x, starting at the most significant bit position.
+ *		Word must not be 0 (as it is undefined behavior).
+ */
+static inline int
+pg_leading_zero_bits64(uint64 word)
+{
+#ifdef HAVE__BUILTIN_CLZL
+	Assert(word != 0);
+
+#if SIZEOF_LONG == 8
+	return __builtin_clzl(word);
+#elif SIZEOF_LONG_LONG == 8
+	return __builtin_clzll(word);
+#else
+#error "cannot find integer type of the same size as uint64_t"
+#endif
+
+#else
+	uint64 y;
+	int n = 64;
+	if (word == 0)
+		return 64;
+
+	y = word >> 32; if (y != 0) { n -= 32; word = y; }
+	y = word >> 16; if (y != 0) { n -= 16; word = y; }
+	y = word >> 8;  if (y != 0) { n -= 8;  word = y; }
+	y = word >> 4;  if (y != 0) { n -= 4;  word = y; }
+	y = word >> 2;  if (y != 0) { n -= 2;  word = y; }
+	y = word >> 1;  if (y != 0) { return n - 2; }
+	return n - 1;
+#endif
+}
+
 /*
  * pg_leftmost_one_pos32
  *		Returns the position of the most significant set bit in "word",
@@ -71,7 +107,7 @@ pg_leftmost_one_pos32(uint32 word)
 static inline int
 pg_leftmost_one_pos64(uint64 word)
 {
-#ifdef HAVE__BUILTIN_CLZ
+#ifdef HAVE__BUILTIN_CLZL
 	Assert(word != 0);
 
 #if SIZEOF_LONG == 8
diff --git a/src/include/utils/pgstat_internal.h b/src/include/utils/pgstat_internal.h
index fe463faaf63..a3ce8b04723 100644
--- a/src/include/utils/pgstat_internal.h
+++ b/src/include/utils/pgstat_internal.h
@@ -608,7 +608,7 @@ typedef struct PgStat_Snapshot
 
 	PgStat_CheckpointerStats checkpointer;
 
-	PgStat_IO	io;
+	PgStat_IO  *io;
 
 	PgStat_Lock lock;
 
diff --git a/src/test/recovery/t/029_stats_restart.pl b/src/test/recovery/t/029_stats_restart.pl
index cdc427dbc78..33939c8701a 100644
--- a/src/test/recovery/t/029_stats_restart.pl
+++ b/src/test/recovery/t/029_stats_restart.pl
@@ -293,7 +293,36 @@ cmp_ok(
 	$wal_restart_immediate->{reset},
 	"$sect: reset timestamp is new");
 
+
+## Test pg_stat_io_histogram that is becoming active due to dynamic memory
+## allocation only for new backends with globally set track_[io|wal_io]_timing
+$sect = "pg_stat_io_histogram";
+$node->append_conf('postgresql.conf', "track_io_timing = 'on'");
+$node->append_conf('postgresql.conf', "track_wal_io_timing = 'on'");
+$node->restart;
+
+
+## Check that pg_stat_io_histograms sees some growing counts in buckets
+## We could also try with checkpointer, but it often runs with fsync=off
+## during test.
+my $countbefore = $node->safe_psql('postgres',
+	"SELECT sum(bucket_count) AS hist_bucket_count_sum FROM pg_stat_get_io_histogram() " .
+	"WHERE backend_type='client backend' AND object='relation' AND context='normal'");
+
+$node->safe_psql('postgres', "CREATE TABLE test_io_hist(id bigint);");
+$node->safe_psql('postgres', "INSERT INTO test_io_hist SELECT generate_series(1, 100) s;");
+$node->safe_psql('postgres', "SELECT pg_stat_force_next_flush();");
+
+my $countafter = $node->safe_psql('postgres',
+	"SELECT sum(bucket_count) AS hist_bucket_count_sum FROM pg_stat_get_io_histogram() " .
+	"WHERE backend_type='client backend' AND object='relation' AND context='normal'");
+
+cmp_ok(
+	$countafter, '>', $countbefore,
+	"pg_stat_io_histogram: latency buckets growing");
+
 $node->stop;
+
 done_testing();
 
 sub trigger_funcrel_stat
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index a65a5bf0c4f..c0067cb653b 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1967,6 +1967,14 @@ pg_stat_io| SELECT backend_type,
     fsync_time,
     stats_reset
    FROM pg_stat_get_io() b(backend_type, object, context, reads, read_bytes, read_time, writes, write_bytes, write_time, writebacks, writeback_time, extends, extend_bytes, extend_time, hits, evictions, reuses, fsyncs, fsync_time, stats_reset);
+pg_stat_io_histogram| SELECT backend_type,
+    object,
+    context,
+    io_type,
+    bucket_latency_us,
+    bucket_count,
+    stats_reset
+   FROM pg_stat_get_io_histogram() b(backend_type, object, context, io_type, bucket_latency_us, bucket_count, stats_reset);
 pg_stat_lock| SELECT locktype,
     waits,
     wait_time,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 3ade7c08b6d..7a6b604ce5e 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3838,6 +3838,7 @@ gzFile
 having_collation_ctx
 heap_page_items_state
 help_handler
+histogram_io_stat_col
 hlCheck
 hstoreCheckKeyLen_t
 hstoreCheckValLen_t
-- 
2.43.0

From 6db9eb6f11629e3610eaaa2901f847f3238bfb3f Mon Sep 17 00:00:00 2001
From: Jakub Wartak <[email protected]>
Date: Fri, 8 May 2026 09:19:49 +0200
Subject: [PATCH v10f 3/3] Lower pg_stat_io_histogram shared memory use by
 using array with indirect offsets.

We use pgstat_track_io_*() family of functions to derive the length of static
array that is allocated in shared memory region during startup. As the number
of valid combinations of backend types vs I/O object/context/operations is
coming from semi-runtime pgstat_io_get_sum_tracked() function, it cannot be
preprocessed, so we would need to come up with #define PGSTAT_IO_HIST_BUCKET_SLOTS
somehow. In order to do that - and avoid that C limitations (lack of
constexpr) - we could precalculate (in the build system) the size of
static array and generate .h include that would be included by pgstat.h,
however it appears that would be it hardly cross-portable and hardly
cross-compilable. Instead of doing that, we dynamically allocate shared memory
for IO historgrams during startup.
---
 src/backend/utils/activity/pgstat.c       |  42 +++++-
 src/backend/utils/activity/pgstat_io.c    | 165 +++++++++++++++++++---
 src/backend/utils/activity/pgstat_shmem.c |  15 ++
 src/backend/utils/adt/pgstatfuncs.c       |  20 ++-
 src/include/pgstat.h                      |  26 +++-
 5 files changed, 241 insertions(+), 27 deletions(-)

diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 7c597932671..0bd59992f4e 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -443,7 +443,13 @@ static const PgStat_KindInfo pgstat_kind_builtin_infos[PGSTAT_KIND_BUILTIN_SIZE]
 		.snapshot_ctl_off = offsetof(PgStat_Snapshot, io),
 		.shared_ctl_off = offsetof(PgStat_ShmemControl, io),
 		.shared_data_off = offsetof(PgStatShared_IO, stats),
-		.shared_data_len = sizeof(((PgStatShared_IO *) 0)->stats),
+
+		/*
+		 * Do not write everything using this .shared_data_len, as the IO
+		 * histogram backing store is handled by special-case (as it is
+		 * dynamic) in pgstat_write_statsfile() / pgstat_read_statsfile().
+		 */
+		.shared_data_len = offsetof(PgStat_IO, hist_time_buckets_slot_count),
 
 		.init_backend_cb = pgstat_io_init_backend_cb,
 		.flush_static_cb = pgstat_io_flush_cb,
@@ -1685,6 +1691,21 @@ pgstat_write_statsfile(void)
 		fputc(PGSTAT_FILE_ENTRY_FIXED, fpout);
 		pgstat_write_chunk_s(fpout, &kind);
 		pgstat_write_chunk(fpout, ptr, info->shared_data_len);
+
+		/*
+		 * PGSTAT_KIND_IO has a dynamically-sized histogram that lives outside
+		 * the shared_data_len region. This assumes that PGSTAT_FILE_FORMAT_ID
+		 * would be bumped each time that pgstat_track_io*() logic is altered.
+		 */
+		if (kind == PGSTAT_KIND_IO)
+		{
+			PgStat_IO  *io = pgStatLocal.snapshot.io;
+
+			pgstat_write_chunk(fpout, io->hist_time_buckets_slots,
+							   (size_t) io->hist_time_buckets_slot_count *
+							   PGSTAT_IO_HIST_BUCKETS *
+							   sizeof(uint64));
+		}
 	}
 
 	/*
@@ -1930,6 +1951,25 @@ pgstat_read_statsfile(void)
 						goto error;
 					}
 
+					/*
+					 * PGSTAT_KIND_IO has also semi-dynamic histogram array
+					 * appended after the main chunk. By now, the
+					 * StatsShmemInit() prepared the memory.
+					 */
+					if (kind == PGSTAT_KIND_IO)
+					{
+						PgStat_IO  *io = &shmem->io.stats;
+
+						if (!pgstat_read_chunk(fpin, io->hist_time_buckets_slots,
+											   (size_t) io->hist_time_buckets_slot_count *
+											   PGSTAT_IO_HIST_BUCKETS *
+											   sizeof(uint64)))
+						{
+							elog(WARNING, "could not read pgstat_io histogram backing store");
+							goto error;
+						}
+					}
+
 					break;
 				}
 			case PGSTAT_FILE_ENTRY_HASH:
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 1696f278a77..3cc965638e8 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -20,6 +20,7 @@
 #include "executor/instrument.h"
 #include "port/pg_bitutils.h"
 #include "storage/bufmgr.h"
+#include "storage/shmem.h"
 #include "utils/memutils.h"
 #include "utils/pgstat_internal.h"
 
@@ -210,6 +211,8 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
 		{
 			int			offset;
 
+			Assert(track_io_timing || track_wal_io_timing);
+
 			/*
 			 * calculate the bucket_index based on latency in nanoseconds
 			 * (uint64)
@@ -217,6 +220,10 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
 			bucket_index = get_bucket_index(INSTR_TIME_GET_NANOSEC(io_time));
 
 			offset = PendingIOStats.pending_hist_time_buckets_offsets[io_object][io_context][io_op];
+
+			/* does offset points to valid slot? */
+			Assert(offset >= 0 && offset < PendingIOStats.pending_hist_time_buckets_size);
+
 			PendingIOStats.pending_hist_time_buckets[offset][bucket_index]++;
 		}
 
@@ -258,6 +265,7 @@ pgstat_io_flush_cb(bool nowait)
 {
 	LWLock	   *bktype_lock;
 	PgStat_BktypeIO *bktype_shstats;
+	PgStat_IO  *bk_io;
 
 	if (!have_iostats)
 		return false;
@@ -265,6 +273,7 @@ pgstat_io_flush_cb(bool nowait)
 	bktype_lock = &pgStatLocal.shmem->io.locks[MyBackendType];
 	bktype_shstats =
 		&pgStatLocal.shmem->io.stats.stats[MyBackendType];
+	bk_io = &pgStatLocal.shmem->io.stats;
 
 	if (!nowait)
 		LWLockAcquire(bktype_lock, LW_EXCLUSIVE);
@@ -297,16 +306,23 @@ pgstat_io_flush_cb(bool nowait)
 				 * offsets to save memory) into shared memory.
 				 */
 				if (PendingIOStats.pending_hist_time_buckets != NULL)
-					for (int b = 0; b < PGSTAT_IO_HIST_BUCKETS; b++)
-					{
-						int			pending_off = PendingIOStats.pending_hist_time_buckets_offsets[io_object][io_context][io_op];
+				{
+					int			bktype_shstats_off = bktype_shstats->hist_time_buckets_offsets[io_object][io_context][io_op];
+					int			pending_off = PendingIOStats.pending_hist_time_buckets_offsets[io_object][io_context][io_op];
 
-						if (pending_off != -1)
-						{
-							bktype_shstats->hist_time_buckets[io_object][io_context][io_op][b] +=
-								PendingIOStats.pending_hist_time_buckets[pending_off][b];
-						}
-					}
+					Assert(track_io_timing || track_wal_io_timing);
+
+					/*
+					 * -1 means here that such mapping doesn't have a slot
+					 * (based on pgstat_track_io_*()).
+					 */
+					if (bktype_shstats_off == -1 || pending_off == -1)
+						continue;
+
+					for (int b = 0; b < PGSTAT_IO_HIST_BUCKETS; b++)
+						bk_io->hist_time_buckets_slots[bktype_shstats_off][b] +=
+							PendingIOStats.pending_hist_time_buckets[pending_off][b];
+				}
 			}
 		}
 	}
@@ -415,7 +431,7 @@ pgstat_io_init_backend_cb(void)
 				{
 					if (pgstat_tracks_io_op(MyBackendType, io_object, io_context, io_op))
 					{
-						Assert(io_histograms_used <= PendingIOStats.pending_hist_time_buckets_size);
+						Assert(io_histograms_used < PendingIOStats.pending_hist_time_buckets_size);
 
 						PendingIOStats.pending_hist_time_buckets_offsets[io_object][io_context][io_op] =
 							io_histograms_used++;
@@ -428,12 +444,12 @@ pgstat_io_init_backend_cb(void)
 	}
 	else
 		PendingIOStats.pending_hist_time_buckets = NULL;
-
 }
 
 void
 pgstat_io_init_shmem_cb(void *stats)
 {
+	int			histogram_slots = 0;
 	PgStatShared_IO *stat_shmem = (PgStatShared_IO *) stats;
 
 	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
@@ -441,26 +457,79 @@ pgstat_io_init_shmem_cb(void *stats)
 
 	/* this might end up being lazily allocated in pgstat_io_snapshot_cb() */
 	pgStatLocal.snapshot.io = NULL;
+
+	/*
+	 * Establish indirect mapping from
+	 * PgStat_BktypeIO.hist_time_buckets_offsets[][][] to
+	 * PgStat_IO.hist_time_buckets_slots[x]
+	 */
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		for (int io_object = 0; io_object < IOOBJECT_NUM_TYPES; io_object++)
+		{
+			for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
+			{
+				for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
+				{
+					if (pgstat_tracks_io_op(i, io_object, io_context, io_op))
+					{
+						stat_shmem->stats.stats[i].hist_time_buckets_offsets[io_object][io_context][io_op] =
+							histogram_slots++;
+					}
+					else
+						stat_shmem->stats.stats[i].hist_time_buckets_offsets[io_object][io_context][io_op] =
+							-1;
+				}
+			}
+		}
+	}
+
+	/*
+	 * Sanity check: the offset table we just produced must use exactly the
+	 * number of slots StatsShmemInit() reserved.  Both come from the same
+	 * pgstat_tracks_io_*() rules, so a mismatch would indicate a bug.
+	 */
+	Assert(histogram_slots == stat_shmem->stats.hist_time_buckets_slot_count);
 }
 
 void
 pgstat_io_reset_all_cb(TimestampTz ts)
 {
+	PgStat_IO  *io_stats = &pgStatLocal.shmem->io.stats;
+
 	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
 	{
 		LWLock	   *bktype_lock = &pgStatLocal.shmem->io.locks[i];
-		PgStat_BktypeIO *bktype_shstats = &pgStatLocal.shmem->io.stats.stats[i];
+		PgStat_BktypeIO *bktype_shstats = &io_stats->stats[i];
 
 		LWLockAcquire(bktype_lock, LW_EXCLUSIVE);
 
 		/*
 		 * Use the lock in the first BackendType's PgStat_BktypeIO to protect
-		 * the reset timestamp as well.
+		 * the reset timestamp.
 		 */
 		if (i == 0)
-			pgStatLocal.shmem->io.stats.stat_reset_timestamp = ts;
+			io_stats->stat_reset_timestamp = ts;
+
+		/* Reset this BackendType's histogram slots */
+		for (int io_object = 0; io_object < IOOBJECT_NUM_TYPES; io_object++)
+		{
+			for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
+			{
+				for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
+				{
+					int			off = bktype_shstats->hist_time_buckets_offsets[io_object][io_context][io_op];
+
+					if (off == -1)
+						continue;
+					memset(io_stats->hist_time_buckets_slots[off], 0,
+						   sizeof(io_stats->hist_time_buckets_slots[off]));
+				}
+			}
+		}
 
-		memset(bktype_shstats, 0, sizeof(*bktype_shstats));
+		/* Avoid resetting our indirect mapping offsets */
+		memset(bktype_shstats, 0, offsetof(PgStat_BktypeIO, hist_time_buckets_offsets));
 		LWLockRelease(bktype_lock);
 	}
 }
@@ -468,14 +537,30 @@ pgstat_io_reset_all_cb(TimestampTz ts)
 void
 pgstat_io_snapshot_cb(void)
 {
+	PgStat_IO  *shmem_io = &pgStatLocal.shmem->io.stats;
+
 	if (unlikely(pgStatLocal.snapshot.io == NULL))
+	{
+		int			n = shmem_io->hist_time_buckets_slot_count;
+
 		pgStatLocal.snapshot.io = MemoryContextAllocZero(TopMemoryContext,
 														 sizeof(PgStat_IO));
 
+		/*
+		 * Allocated on demand in private (TopMemoryContext) memory and points
+		 * to the same indirect offsets.
+		 */
+		pgStatLocal.snapshot.io->hist_time_buckets_slot_count = n;
+		pgStatLocal.snapshot.io->hist_time_buckets_slots =
+			MemoryContextAllocZero(TopMemoryContext,
+								   (size_t) n * PGSTAT_IO_HIST_BUCKETS *
+								   sizeof(uint64));
+	}
+
 	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
 	{
 		LWLock	   *bktype_lock = &pgStatLocal.shmem->io.locks[i];
-		PgStat_BktypeIO *bktype_shstats = &pgStatLocal.shmem->io.stats.stats[i];
+		PgStat_BktypeIO *bktype_shstats = &shmem_io->stats[i];
 		PgStat_BktypeIO *bktype_snap = &pgStatLocal.snapshot.io->stats[i];
 
 		LWLockAcquire(bktype_lock, LW_SHARED);
@@ -486,17 +571,35 @@ pgstat_io_snapshot_cb(void)
 		 */
 		if (i == 0)
 			pgStatLocal.snapshot.io->stat_reset_timestamp =
-				pgStatLocal.shmem->io.stats.stat_reset_timestamp;
+				shmem_io->stat_reset_timestamp;
 
 		/* using struct assignment due to better type safety */
 		*bktype_snap = *bktype_shstats;
+
+		/* Copy this BackendType's histogram slots */
+		for (int io_object = 0; io_object < IOOBJECT_NUM_TYPES; io_object++)
+		{
+			for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
+			{
+				for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
+				{
+					int			off = bktype_shstats->hist_time_buckets_offsets[io_object][io_context][io_op];
+
+					if (off == -1)
+						continue;
+					memcpy(pgStatLocal.snapshot.io->hist_time_buckets_slots[off],
+						   shmem_io->hist_time_buckets_slots[off],
+						   sizeof(shmem_io->hist_time_buckets_slots[off]));
+				}
+			}
+		}
+
 		LWLockRelease(bktype_lock);
 	}
 }
 
 /*
 * IO statistics are not collected for all BackendTypes.
-*
 * The following BackendTypes do not participate in the cumulative stats
 * subsystem or do not perform IO on which we currently track:
 * - Dead-end backend because it is not connected to shared memory and
@@ -720,3 +823,29 @@ pgstat_tracks_io_op(BackendType bktype, IOObject io_object,
 
 	return true;
 }
+
+/*
+ * Total number of tuple of really usable combinations (BackendType, IOObject,
+ * IOContext, IOOp) that we consider trackable.
+ */
+int
+pgstat_io_get_sum_tracked(void)
+{
+	int			sum = 0;
+
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+		sum += pgstat_bktype_count_potentially_used(i);
+
+	return sum;
+}
+
+/*
+ * Returns number of bytes for shared memory required by
+ * PgStat_IO.hist_time_buckets_slots,
+ */
+Size
+pgstat_io_histogram_shmem_size(void)
+{
+	return mul_size(pgstat_io_get_sum_tracked(),
+					PGSTAT_IO_HIST_BUCKETS * sizeof(uint64));
+}
diff --git a/src/backend/utils/activity/pgstat_shmem.c b/src/backend/utils/activity/pgstat_shmem.c
index b8f354c818a..bb25be106be 100644
--- a/src/backend/utils/activity/pgstat_shmem.c
+++ b/src/backend/utils/activity/pgstat_shmem.c
@@ -139,6 +139,12 @@ StatsShmemSize(void)
 	sz = MAXALIGN(sizeof(PgStat_ShmemControl));
 	sz = add_size(sz, pgstat_dsa_init_size());
 
+	/*
+	 * Dynamic allocation for PgStat_IO.hist_time_buckets_slots. Sized from
+	 * the rules in pgstat_tracks_io_*()
+	 */
+	sz = add_size(sz, MAXALIGN(pgstat_io_histogram_shmem_size()));
+
 	/* Add shared memory for all the custom fixed-numbered statistics */
 	for (PgStat_Kind kind = PGSTAT_KIND_CUSTOM_MIN; kind <= PGSTAT_KIND_CUSTOM_MAX; kind++)
 	{
@@ -194,6 +200,15 @@ StatsShmemInit(void *arg)
 							  LWTRANCHE_PGSTATS_DSA, NULL);
 	dsa_pin(dsa);
 
+	/*
+	 * Prepare PgStat_IO.hist_time_buckets_slot* stuff before calling
+	 * pgstat_io_init_shmem_cb(). The additional memory for this was requested
+	 * in the StatsShmemSize() above.
+	 */
+	ctl->io.stats.hist_time_buckets_slot_count = pgstat_io_get_sum_tracked();
+	ctl->io.stats.hist_time_buckets_slots = (uint64 (*)[PGSTAT_IO_HIST_BUCKETS]) p;
+	p += MAXALIGN(pgstat_io_histogram_shmem_size());
+
 	/*
 	 * To ensure dshash is created in "plain" shared memory, temporarily limit
 	 * size of dsa to the initial size of the dsa.
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 04eab7b3bcd..f106ca9cd3e 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1667,6 +1667,7 @@ typedef enum hist_io_stat_col
  */
 static void
 pg_stat_io_histogram_build_tuples(ReturnSetInfo *rsinfo,
+								  PgStat_IO *backends_io_stats,
 								  PgStat_BktypeIO *bktype_stats,
 								  BackendType bktype,
 								  TimestampTz stat_reset_timestamp)
@@ -1695,6 +1696,16 @@ pg_stat_io_histogram_build_tuples(ReturnSetInfo *rsinfo,
 			for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
 			{
 				const char *op_name = pgstat_get_io_op_name(io_op);
+				int			bktype_hist_time_bucket_off;
+
+				/*
+				 * The offset is the same for every histogram bucket of this
+				 * io_obj/io_context/io_op combination.
+				 */
+				bktype_hist_time_bucket_off = bktype_stats->hist_time_buckets_offsets[io_obj][io_context][io_op];
+				if (bktype_hist_time_bucket_off == -1)
+					continue;
+				Assert(bktype_hist_time_bucket_off < backends_io_stats->hist_time_buckets_slot_count);
 
 				for (int bucket = 0; bucket < PGSTAT_IO_HIST_BUCKETS; bucket++)
 				{
@@ -1703,6 +1714,7 @@ pg_stat_io_histogram_build_tuples(ReturnSetInfo *rsinfo,
 					RangeBound	lower,
 								upper;
 					RangeType  *range;
+					uint64		bktype_bucket;
 
 					values[HIST_IO_COL_BACKEND_TYPE] = bktype_desc;
 					values[HIST_IO_COL_OBJECT] = CStringGetTextDatum(obj_name);
@@ -1731,9 +1743,9 @@ pg_stat_io_histogram_build_tuples(ReturnSetInfo *rsinfo,
 					range = make_range(typcache, &lower, &upper, false, NULL);
 					values[HIST_IO_COL_BUCKET_US] = RangeTypePGetDatum(range);
 
-					/* bucket count */
-					values[HIST_IO_COL_COUNT] = Int64GetDatum(
-															  bktype_stats->hist_time_buckets[io_obj][io_context][io_op][bucket]);
+					/* get bucket count, access indirectly */
+					bktype_bucket = backends_io_stats->hist_time_buckets_slots[bktype_hist_time_bucket_off][bucket];
+					values[HIST_IO_COL_COUNT] = Int64GetDatum(bktype_bucket);
 
 					if (stat_reset_timestamp != 0)
 						values[HIST_IO_COL_RESET_TIME] = TimestampTzGetDatum(stat_reset_timestamp);
@@ -1779,7 +1791,7 @@ pg_stat_get_io_histogram(PG_FUNCTION_ARGS)
 			continue;
 
 		/* save tuples with data from this PgStat_BktypeIO */
-		pg_stat_io_histogram_build_tuples(rsinfo, bktype_stats, bktype,
+		pg_stat_io_histogram_build_tuples(rsinfo, backends_io_stats, bktype_stats, bktype,
 										  backends_io_stats->stat_reset_timestamp);
 	}
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 984914e69b8..de90f1fb5b0 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -20,7 +20,6 @@
 #include "utils/backend_status.h"	/* for backward compatibility */	/* IWYU pragma: export */
 #include "utils/pgstat_kind.h"
 
-
 /* avoid including access/transam.h */
 typedef struct FullTransactionId FullTransactionId;
 
@@ -218,7 +217,7 @@ typedef struct PgStat_TableXactStatus
  * ------------------------------------------------------------
  */
 
-#define PGSTAT_FILE_FORMAT_ID	0x01A5BCBC
+#define PGSTAT_FILE_FORMAT_ID	0x01A5BCBD
 
 typedef struct PgStat_ArchiverStats
 {
@@ -342,7 +341,14 @@ typedef struct PgStat_BktypeIO
 	uint64		bytes[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
 	PgStat_Counter counts[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
 	PgStat_Counter times[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
-	uint64		hist_time_buckets[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES][PGSTAT_IO_HIST_BUCKETS];
+
+	/*
+	 * Indirect offset to PgStat_IO (parent
+	 * structure).hist_time_buckets_slots. This needs to be the last field due
+	 * to the use of memset(.., offsetof(hist_time_buckets_offsets)) in
+	 * pgstat_io_reset_all_cb().
+	 */
+	int			hist_time_buckets_offsets[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
 } PgStat_BktypeIO;
 
 typedef struct PgStat_PendingIO
@@ -358,7 +364,7 @@ typedef struct PgStat_PendingIO
 	 * memory.
 	 */
 	uint64		(*pending_hist_time_buckets)[PGSTAT_IO_HIST_BUCKETS];
-	uint64		pending_hist_time_buckets_offsets[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
+	int			pending_hist_time_buckets_offsets[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
 
 	/*
 	 * Cache how much histograms we have allocated to avoid repetably calling
@@ -374,6 +380,16 @@ typedef struct PgStat_IO
 {
 	TimestampTz stat_reset_timestamp;
 	PgStat_BktypeIO stats[BACKEND_NUM_TYPES];
+
+	/*
+	 * The IO histogram memory is sized at postmaster start from the rules in
+	 * pgstat_tracks_io_*() and persisted by additinal code to handle this
+	 * dynamic (shared) memory pointer in pgstat_write_statsfile() /
+	 * pgstat_read_statsfile(), so they nes are not part of the serialization
+	 * to disk by common code.
+	 */
+	int			hist_time_buckets_slot_count;
+	uint64		(*hist_time_buckets_slots)[PGSTAT_IO_HIST_BUCKETS];
 } PgStat_IO;
 
 typedef struct PgStat_LockEntry
@@ -654,6 +670,8 @@ extern PgStat_CheckpointerStats *pgstat_fetch_stat_checkpointer(void);
 extern bool pgstat_bktype_io_stats_valid(PgStat_BktypeIO *backend_io,
 										 BackendType bktype);
 extern int	pgstat_bktype_count_potentially_used(BackendType bktype);
+extern int	pgstat_io_get_sum_tracked(void);
+extern Size pgstat_io_histogram_shmem_size(void);
 extern void pgstat_count_io_op(IOObject io_object, IOContext io_context,
 							   IOOp io_op, uint32 cnt, uint64 bytes);
 extern instr_time pgstat_prepare_io_time(bool track_io_guc);
-- 
2.43.0

Reply via email to