On Thu, Feb 26, 2026 at 5:13 PM Andres Freund <[email protected]> wrote: > > Hi,
Hi Andres, > On 2026-02-23 13:30:44 +0100, Jakub Wartak wrote: > > > > but I think having it in PgStat_BktypeIO is not great. This makes > > > > PgStat_IO 30k*BACKEND_NUM_TYPES bigger, or ~ 0.5MB. Having a stats > > > > snapshot > > > > be half a megabyte bigger for no reason seems too wasteful. > > > > > > Yea, that's not awesome. > > > > Guys, question, could You please explain me what are the drawbacks of having > > this semi-big (internal-only) stat snapshot of 0.5MB? I'm struggling to > > understand two things: > > a) 0.5MB is not a lot those days (ok my 286 had 1MB in the day ;)) > > I don't really agree with that, I guess. And even if I did, it's one thing to > use 0.5MB when you actually use it, it's quite another when most of that > memory is never used. > > > With the patch, *every* backend ends up with a substantially larger > pgStatLocal. Before: > > nm -t d --size-sort -r -S src/backend/postgres|head -n20|less > (the second column is the decimal size, third the type of the symbol) > > 0000000004131808 0000000000297456 r yy_transition > ... > 0000000003916352 0000000000054744 r UnicodeDecompMain > 0000000021004896 0000000000052824 B pgStatLocal > 0000000003850592 0000000000040416 r unicode_categories > ... > > after: > 0000000023220512 0000000000329304 B pgStatLocal > 0000000018531648 0000000000297456 r yy_transition > ... > > And because pgStatLocal is zero initialized data, it'll be on-demand-allocated > in every single backend (whereas e.g. yy_transition is read-only shared). So > you're not talking a single time increase, you're multiplying it by the numer > of active connections > > Now, it's true that most backend won't ever touch pgStatLocal. However, most > backends will touch Pending[Backend]IOStats, which also increased noticably: > > before: > 0000000021060960 0000000000002880 b PendingIOStats > 0000000021057792 0000000000002880 b PendingBackendStats > > after: > 0000000023568416 0000000000018240 b PendingIOStats > 0000000023549888 0000000000018240 b PendingBackendStats > > > Again, I think some increase here doesn't have to be fatal, but increasing > with mainly impossible-to-use memory seems just too much waste to mee. > > > This also increases the shared-memory usage of pgstats: Before it used ~300kB > on a small system. That nearly doubles with this patch. But that's perhaps > less concerning, given it's per-system, rather than per-backend memory usage. > > > > > b) how does it affect anything, because testing show it's not? > > Which of your testing would conceivably show the effect? The concern here > isn't really performance, it's that it increases our memory usage, which you'd > only see having an effect if you are tight on memory or have a workload that > is cache sensitive. > Oh ok, now I get understand the problem about pgStatLocal properly, thanks for detailed explanation! (but I'm somewhat I'm still lost a little in the woods of pgstat infra). Anyway, I agree that PgStat_IO started to be way too big especially when the pg_stat_io(_histogram) views wouldn't be really accessed. How about the attached v6-0002? It now dynamically allocates PgStat_IO memory to avoid the memory cost (only allocated if pgstat_io_snapshot_cb() is used).Is that the right path? And if so, perhaps it should allocate it from mxct pgStatLocal.snapshot.context instead? -J.
From ce2bdfc9789cc69f16b79c45a719bb12d9a4284b Mon Sep 17 00:00:00 2001 From: Jakub Wartak <[email protected]> Date: Fri, 23 Jan 2026 08:10:09 +0100 Subject: [PATCH v6 1/2] Add pg_stat_io_histogram view to provide more detailed insight into IO profile pg_stat_io_histogram displays a histogram of IO latencies for specific backend_type, object, context and io_type. The histogram has buckets that allow faster identification of I/O latency outliers due to faulty hardware and/or misbehaving I/O stack. Such I/O outliers e.g. slow fsyncs could sometimes cause intermittent issues e.g. for COMMIT or affect the synchronous standbys performance. Author: Jakub Wartak <[email protected]> Reviewed-by: Andres Freund <[email protected]> Reviewed-by: Ants Aasma <[email protected]> Discussion: https://postgr.es/m/CAKZiRmwvE4uJLKTgPXeBA4m%2Bd4tTghayoefcaM9%3Dz3_S7i72GA%40mail.gmail.com --- configure | 38 ++++ configure.ac | 1 + doc/src/sgml/config.sgml | 12 +- doc/src/sgml/monitoring.sgml | 293 ++++++++++++++++++++++++- doc/src/sgml/wal.sgml | 5 +- meson.build | 1 + src/backend/catalog/system_views.sql | 11 + src/backend/utils/activity/pgstat_io.c | 63 ++++++ src/backend/utils/adt/pgstatfuncs.c | 145 ++++++++++++ src/include/catalog/pg_proc.dat | 9 + src/include/pgstat.h | 14 ++ src/include/port/pg_bitutils.h | 31 ++- src/test/regress/expected/rules.out | 8 + src/test/regress/expected/stats.out | 23 ++ src/test/regress/sql/stats.sql | 15 ++ src/tools/pgindent/typedefs.list | 1 + 16 files changed, 662 insertions(+), 8 deletions(-) diff --git a/configure b/configure index a285a6ec3d7..60c460393f7 100755 --- a/configure +++ b/configure @@ -15908,6 +15908,44 @@ cat >>confdefs.h <<_ACEOF #define HAVE__BUILTIN_CLZ 1 _ACEOF +fi +{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __builtin_clzl" >&5 +$as_echo_n "checking for __builtin_clzl... " >&6; } +if ${pgac_cv__builtin_clzl+:} false; then : + $as_echo_n "(cached) " >&6 +else + cat confdefs.h - <<_ACEOF >conftest.$ac_ext +/* end confdefs.h. */ + +int +call__builtin_clzl(unsigned long x) +{ + return __builtin_clzl(x); +} +int +main () +{ + + ; + return 0; +} +_ACEOF +if ac_fn_c_try_link "$LINENO"; then : + pgac_cv__builtin_clzl=yes +else + pgac_cv__builtin_clzl=no +fi +rm -f core conftest.err conftest.$ac_objext \ + conftest$ac_exeext conftest.$ac_ext +fi +{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__builtin_clzl" >&5 +$as_echo "$pgac_cv__builtin_clzl" >&6; } +if test x"${pgac_cv__builtin_clzl}" = xyes ; then + +cat >>confdefs.h <<_ACEOF +#define HAVE__BUILTIN_CLZL 1 +_ACEOF + fi { $as_echo "$as_me:${as_lineno-$LINENO}: checking for __builtin_ctz" >&5 $as_echo_n "checking for __builtin_ctz... " >&6; } diff --git a/configure.ac b/configure.ac index 476a76c7991..33117f23999 100644 --- a/configure.ac +++ b/configure.ac @@ -1866,6 +1866,7 @@ PGAC_CHECK_BUILTIN_FUNC([__builtin_bswap32], [int x]) PGAC_CHECK_BUILTIN_FUNC([__builtin_bswap64], [long int x]) # We assume that we needn't test all widths of these explicitly: PGAC_CHECK_BUILTIN_FUNC([__builtin_clz], [unsigned int x]) +PGAC_CHECK_BUILTIN_FUNC([__builtin_clzl], [unsigned long x]) PGAC_CHECK_BUILTIN_FUNC([__builtin_ctz], [unsigned int x]) # __builtin_frame_address may draw a diagnostic for non-constant argument, # so it needs a different test function. diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index f670e2d4c31..04f89362029 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -8825,9 +8825,11 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv; displayed in <link linkend="monitoring-pg-stat-database-view"> <structname>pg_stat_database</structname></link>, <link linkend="monitoring-pg-stat-io-view"> - <structname>pg_stat_io</structname></link> (if <varname>object</varname> - is not <literal>wal</literal>), in the output of the - <link linkend="pg-stat-get-backend-io"> + <structname>pg_stat_io</structname></link> and + <link linkend="monitoring-pg-stat-io-histogram-view"> + <structname>pg_stat_io_histogram</structname></link> + (if <varname>object</varname> is not <literal>wal</literal>), + in the output of the <link linkend="pg-stat-get-backend-io"> <function>pg_stat_get_backend_io()</function></link> function (if <varname>object</varname> is not <literal>wal</literal>), in the output of <xref linkend="sql-explain"/> when the <literal>BUFFERS</literal> @@ -8857,7 +8859,9 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv; measure the overhead of timing on your system. I/O timing information is displayed in <link linkend="monitoring-pg-stat-io-view"> - <structname>pg_stat_io</structname></link> for the + <structname>pg_stat_io</structname></link> and + <link linkend="monitoring-pg-stat-io-histogram-view"> + <structname>pg_stat_io_histogram</structname></link> for the <varname>object</varname> <literal>wal</literal> and in the output of the <link linkend="pg-stat-get-backend-io"> <function>pg_stat_get_backend_io()</function></link> function for the diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml index b77d189a500..d1d699cc693 100644 --- a/doc/src/sgml/monitoring.sgml +++ b/doc/src/sgml/monitoring.sgml @@ -493,6 +493,17 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser </entry> </row> + <row> + <entry><structname>pg_stat_io_histogram</structname><indexterm><primary>pg_stat_io_histogram</primary></indexterm></entry> + <entry> + One row for each combination of backend type, context, target object, + IO operation type and latency bucket (in microseconds) containing + cluster-wide I/O statistics. + See <link linkend="monitoring-pg-stat-io-histogram-view"> + <structname>pg_stat_io_histogram</structname></link> for details. + </entry> + </row> + <row> <entry><structname>pg_stat_replication_slots</structname><indexterm><primary>pg_stat_replication_slots</primary></indexterm></entry> <entry>One row per replication slot, showing statistics about the @@ -690,7 +701,7 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser <para> The <structname>pg_stat_io</structname> and - <structname>pg_statio_</structname> set of views are useful for determining + <structname>pg_stat_io_histogram</structname> set of views are useful for determining the effectiveness of the buffer cache. They can be used to calculate a cache hit ratio. Note that while <productname>PostgreSQL</productname>'s I/O statistics capture most instances in which the kernel was invoked in order @@ -699,6 +710,8 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser Users are advised to use the <productname>PostgreSQL</productname> statistics views in combination with operating system utilities for a more complete picture of their database's I/O performance. + Furthermore the <structname>pg_stat_io_histogram</structname> view can be helpful + identifying latency outliers for specific I/O operations. </para> </sect2> @@ -3124,6 +3137,284 @@ description | Waiting for a newly initialized WAL file to reach durable storage </sect2> + <sect2 id="monitoring-pg-stat-io-histogram-view"> + <title><structname>pg_stat_io_histogram</structname></title> + + <indexterm> + <primary>pg_stat_io_histogram</primary> + </indexterm> + + <para> + The <structname>pg_stat_io_histogram</structname> view will contain one row for each + combination of backend type, target I/O object, and I/O context, IO operation + type, bucket latency cluster-wide I/O statistics. Combinations which do not make sense + are omitted. + </para> + + <para> + The view shows measured perceived I/O latency by the backend, not the kernel or device + one. This is important distinction when troubleshooting, as the I/O latency observed by + the backend might get affected by: + <itemizedlist> + <listitem> + <para>OS scheduler decisions and available CPU resources.</para> + <para>With AIO, it might include time to service other IOs from the queue. That will often inflate IO latency.</para> + <para>In case of writing, additional filesystem journaling operations.</para> + </listitem> + </itemizedlist> + </para> + + <para> + Currently, I/O on relations (e.g. tables, indexes) and WAL activity are + tracked. However, relation I/O which bypasses shared buffers + (e.g. when moving a table from one tablespace to another) is currently + not tracked. + </para> + + <table id="pg-stat-io-histogram-view" xreflabel="pg_stat_io_histogram"> + <title><structname>pg_stat_io_histogram</structname> View</title> + <tgroup cols="1"> + <thead> + <row> + <entry role="catalog_table_entry"> + <para role="column_definition"> + Column Type + </para> + <para> + Description + </para> + </entry> + </row> + </thead> + <tbody> + <row> + <entry role="catalog_table_entry"> + <para role="column_definition"> + <structfield>backend_type</structfield> <type>text</type> + </para> + <para> + Type of backend (e.g. background worker, autovacuum worker). See <link + linkend="monitoring-pg-stat-activity-view"> + <structname>pg_stat_activity</structname></link> for more information + on <varname>backend_type</varname>s. Some + <varname>backend_type</varname>s do not accumulate I/O operation + statistics and will not be included in the view. + </para> + </entry> + </row> + + <row> + <entry role="catalog_table_entry"> + <para role="column_definition"> + <structfield>object</structfield> <type>text</type> + </para> + <para> + Target object of an I/O operation. Possible values are: + <itemizedlist> + <listitem> + <para> + <literal>relation</literal>: Permanent relations. + </para> + </listitem> + <listitem> + <para> + <literal>temp relation</literal>: Temporary relations. + </para> + </listitem> + <listitem> + <para> + <literal>wal</literal>: Write Ahead Logs. + </para> + </listitem> + </itemizedlist> + </para> + </entry> + </row> + + <row> + <entry role="catalog_table_entry"> + <para role="column_definition"> + <structfield>context</structfield> <type>text</type> + </para> + <para> + The context of an I/O operation. Possible values are: + </para> + <itemizedlist> + <listitem> + <para> + <literal>normal</literal>: The default or standard + <varname>context</varname> for a type of I/O operation. For + example, by default, relation data is read into and written out from + shared buffers. Thus, reads and writes of relation data to and from + shared buffers are tracked in <varname>context</varname> + <literal>normal</literal>. + </para> + </listitem> + <listitem> + <para> + <literal>init</literal>: I/O operations performed while creating the + WAL segments are tracked in <varname>context</varname> + <literal>init</literal>. + </para> + </listitem> + <listitem> + <para> + <literal>vacuum</literal>: I/O operations performed outside of shared + buffers while vacuuming and analyzing permanent relations. Temporary + table vacuums use the same local buffer pool as other temporary table + I/O operations and are tracked in <varname>context</varname> + <literal>normal</literal>. + </para> + </listitem> + <listitem> + <para> + <literal>bulkread</literal>: Certain large read I/O operations + done outside of shared buffers, for example, a sequential scan of a + large table. + </para> + </listitem> + <listitem> + <para> + <literal>bulkwrite</literal>: Certain large write I/O operations + done outside of shared buffers, such as <command>COPY</command>. + </para> + </listitem> + </itemizedlist> + </entry> + </row> + + <row> + <entry role="catalog_table_entry"> + <para role="column_definition"> + <structfield>io_type</structfield> <type>text</type> + </para> + <para> + The type of I/O operation. Possible values are: + </para> + <itemizedlist> + <listitem> + <para> + <literal>evict</literal>: eviction from shared buffers cache. + </para> + </listitem> + <listitem> + <para> + <literal>fsync</literal>: synchronization of modified kernel's + filesystem page cache with storage device. + </para> + </listitem> + <listitem> + <para> + <literal>hit</literal>: shared buffers cache lookup hit. + </para> + </listitem> + <listitem> + <para> + <literal>reuse</literal>: reuse of existing buffer in case of + reusing limited-space ring buffer (applies to <literal>bulkread</literal>, + <literal>bulkwrite</literal>, or <literal>vacuum</literal> contexts). + </para> + </listitem> + <listitem> + <para> + <literal>writeback</literal>: advise kernel that the described dirty + data should be flushed to disk preferably asynchronously. + </para> + </listitem> + <listitem> + <para> + <literal>extend</literal>: add new zeroed blocks to the end of file. + </para> + </listitem> + <listitem> + <para> + <literal>read</literal>: self explanatory. + </para> + </listitem> + <listitem> + <para> + <literal>write</literal>: self explanatory. + </para> + </listitem> + </itemizedlist> + </entry> + </row> + + <row> + <entry role="catalog_table_entry"> + <para role="column_definition"> + <structfield>bucket_latency_us</structfield> <type>int4range</type> + </para> + <para> + The latency bucket (in microseconds). + </para> + </entry> + </row> + + <row> + <entry role="catalog_table_entry"> + <para role="column_definition"> + <structfield>bucket_count</structfield> <type>bigint</type> + </para> + <para> + Number of times latency of the I/O operation hit this specific bucket (with + up to <varname>bucket_latency_us</varname> microseconds). + </para> + </entry> + </row> + + <row> + <entry role="catalog_table_entry"> + <para role="column_definition"> + <structfield>stats_reset</structfield> <type>timestamp with time zone</type> + </para> + <para> + Time at which these statistics were last reset. + </para> + </entry> + </row> + </tbody> + </tgroup> + </table> + + <para> + Some backend types never perform I/O operations on some I/O objects and/or + in some I/O contexts. These rows might display zero bucket counts for such + specific operations. + </para> + + <para> + <structname>pg_stat_io_histogram</structname> can be used to identify + I/O storage issues + For example: + <itemizedlist> + <listitem> + <para> + Presence of abnormally high latency for <varname>fsyncs</varname> might + indicate I/O saturation, oversubscription or hardware connectivity issues. + </para> + </listitem> + <listitem> + <para> + Unusually high latency for <varname>fsyncs</varname> on standby's startup + backend type, might be responsible for high duration of commits in + synchronous replication setups. + </para> + </listitem> + </itemizedlist> + </para> + + <note> + <para> + Columns tracking I/O wait time will only be non-zero when + <xref linkend="guc-track-io-timing"/> is enabled. The user should be + careful when referencing these columns in combination with their + corresponding I/O operations in case <varname>track_io_timing</varname> + was not enabled for the entire time since the last stats reset. + </para> + </note> + </sect2> + <sect2 id="monitoring-pg-stat-bgwriter-view"> <title><structname>pg_stat_bgwriter</structname></title> diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml index f3b86b26be9..8b8c407e69f 100644 --- a/doc/src/sgml/wal.sgml +++ b/doc/src/sgml/wal.sgml @@ -832,8 +832,9 @@ of times <function>XLogWrite</function> writes and <function>issue_xlog_fsync</function> syncs WAL data to disk are also counted as <varname>writes</varname> and <varname>fsyncs</varname> - in <structname>pg_stat_io</structname> for the <varname>object</varname> - <literal>wal</literal>, respectively. + in <structname>pg_stat_io</structname> and + <structname>pg_stat_io_histogram</structname> for the + <varname>object</varname> <literal>wal</literal>, respectively. </para> <para> diff --git a/meson.build b/meson.build index 5122706477d..f906198bb36 100644 --- a/meson.build +++ b/meson.build @@ -2058,6 +2058,7 @@ builtins = [ 'bswap32', 'bswap64', 'clz', + 'clzl', 'ctz', 'constant_p', 'frame_address', diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql index 1ea8f1faa9e..a47adeec5a3 100644 --- a/src/backend/catalog/system_views.sql +++ b/src/backend/catalog/system_views.sql @@ -1240,6 +1240,17 @@ SELECT b.stats_reset FROM pg_stat_get_io() b; +CREATE VIEW pg_stat_io_histogram AS +SELECT + b.backend_type, + b.object, + b.context, + b.io_type, + b.bucket_latency_us, + b.bucket_count, + b.stats_reset +FROM pg_stat_get_io_histogram() b; + CREATE VIEW pg_stat_wal AS SELECT w.wal_records, diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c index 28de24538dc..148a2a9c7d5 100644 --- a/src/backend/utils/activity/pgstat_io.c +++ b/src/backend/utils/activity/pgstat_io.c @@ -17,6 +17,7 @@ #include "postgres.h" #include "executor/instrument.h" +#include "port/pg_bitutils.h" #include "storage/bufmgr.h" #include "utils/pgstat_internal.h" @@ -107,6 +108,32 @@ pgstat_prepare_io_time(bool track_io_guc) return io_start; } +#define MIN_PG_STAT_IO_HIST_LATENCY 8191 +static inline int get_bucket_index(uint64_t ns) { + const uint32_t max_index = PGSTAT_IO_HIST_BUCKETS - 1; + /* + * hopefully pre-calculated by the compiler: + * clzl(8191) = clz(01111111111111b on uint64) + */ + const uint32_t min_latency_leading_zeros = + pg_leading_zero_bits64(MIN_PG_STAT_IO_HIST_LATENCY); + + /* + * make sure the tmp value has at least 8191 (our minimum bucket size) + * as __builtin_clzl might return undefined behavior when operating on 0 + */ + uint64_t tmp = ns | MIN_PG_STAT_IO_HIST_LATENCY; + + /* count leading zeros */ + int leading_zeros = pg_leading_zero_bits64(tmp); + + /* normalize the index */ + uint32_t index = min_latency_leading_zeros - leading_zeros; + + /* clamp it to the maximum */ + return (index > max_index) ? max_index : index; +} + /* * Like pgstat_count_io_op() except it also accumulates time. * @@ -125,6 +152,7 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op, if (!INSTR_TIME_IS_ZERO(start_time)) { instr_time io_time; + int bucket_index; INSTR_TIME_SET_CURRENT(io_time); INSTR_TIME_SUBTRACT(io_time, start_time); @@ -152,6 +180,10 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op, INSTR_TIME_ADD(PendingIOStats.pending_times[io_object][io_context][io_op], io_time); + /* calculate the bucket_index based on latency in nanoseconds (uint64) */ + bucket_index = get_bucket_index(INSTR_TIME_GET_NANOSEC(io_time)); + PendingIOStats.pending_hist_time_buckets[io_object][io_context][io_op][bucket_index]++; + /* Add the per-backend count */ pgstat_count_backend_io_op_time(io_object, io_context, io_op, io_time); @@ -221,6 +253,10 @@ pgstat_io_flush_cb(bool nowait) bktype_shstats->times[io_object][io_context][io_op] += INSTR_TIME_GET_MICROSEC(time); + + for(int b = 0; b < PGSTAT_IO_HIST_BUCKETS; b++) + bktype_shstats->hist_time_buckets[io_object][io_context][io_op][b] += + PendingIOStats.pending_hist_time_buckets[io_object][io_context][io_op][b]; } } } @@ -274,6 +310,33 @@ pgstat_get_io_object_name(IOObject io_object) pg_unreachable(); } +const char * +pgstat_get_io_op_name(IOOp io_op) +{ + switch (io_op) + { + case IOOP_EVICT: + return "evict"; + case IOOP_FSYNC: + return "fsync"; + case IOOP_HIT: + return "hit"; + case IOOP_REUSE: + return "reuse"; + case IOOP_WRITEBACK: + return "writeback"; + case IOOP_EXTEND: + return "extend"; + case IOOP_READ: + return "read"; + case IOOP_WRITE: + return "write"; + } + + elog(ERROR, "unrecognized IOOp value: %d", io_op); + pg_unreachable(); +} + void pgstat_io_init_shmem_cb(void *stats) { diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c index b1df96e7b0b..ac08ab14195 100644 --- a/src/backend/utils/adt/pgstatfuncs.c +++ b/src/backend/utils/adt/pgstatfuncs.c @@ -18,6 +18,7 @@ #include "access/xlog.h" #include "access/xlogprefetcher.h" #include "catalog/catalog.h" +#include "catalog/namespace.h" #include "catalog/pg_authid.h" #include "catalog/pg_type.h" #include "common/ip.h" @@ -30,6 +31,7 @@ #include "storage/procarray.h" #include "utils/acl.h" #include "utils/builtins.h" +#include "utils/rangetypes.h" #include "utils/timestamp.h" #define UINT32_ACCESS_ONCE(var) ((uint32)(*((volatile uint32 *)&(var)))) @@ -1639,6 +1641,149 @@ pg_stat_get_backend_io(PG_FUNCTION_ARGS) return (Datum) 0; } +/* +* When adding a new column to the pg_stat_io_histogram view and the +* pg_stat_get_io_histogram() function, add a new enum value here above +* HIST_IO_NUM_COLUMNS. +*/ +typedef enum hist_io_stat_col +{ + HIST_IO_COL_INVALID = -1, + HIST_IO_COL_BACKEND_TYPE, + HIST_IO_COL_OBJECT, + HIST_IO_COL_CONTEXT, + HIST_IO_COL_IOTYPE, + HIST_IO_COL_BUCKET_US, + HIST_IO_COL_COUNT, + HIST_IO_COL_RESET_TIME, + HIST_IO_NUM_COLUMNS +} histogram_io_stat_col; + +/* + * pg_stat_io_histogram_build_tuples + * + * Helper routine for pg_stat_get_io_histogram() and pg_stat_get_backend_io() + * filling a result tuplestore with one tuple for each object and each + * context supported by the caller, based on the contents of bktype_stats. + */ +static void +pg_stat_io_histogram_build_tuples(ReturnSetInfo *rsinfo, + PgStat_BktypeIO *bktype_stats, + BackendType bktype, + TimestampTz stat_reset_timestamp) +{ + /* Get OID for int4range type */ + Datum bktype_desc = CStringGetTextDatum(GetBackendTypeDesc(bktype)); + Oid range_typid = TypenameGetTypid("int4range"); + TypeCacheEntry *typcache = lookup_type_cache(range_typid, TYPECACHE_RANGE_INFO); + + for (int io_obj = 0; io_obj < IOOBJECT_NUM_TYPES; io_obj++) + { + const char *obj_name = pgstat_get_io_object_name(io_obj); + + for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++) + { + const char *context_name = pgstat_get_io_context_name(io_context); + + /* + * Some combinations of BackendType, IOObject, and IOContext are + * not valid for any type of IOOp. In such cases, omit the entire + * row from the view. + */ + if (!pgstat_tracks_io_object(bktype, io_obj, io_context)) + continue; + + for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++) + { + const char *op_name = pgstat_get_io_op_name(io_op); + + for(int bucket = 0; bucket < PGSTAT_IO_HIST_BUCKETS; bucket++) { + Datum values[HIST_IO_NUM_COLUMNS] = {0}; + bool nulls[HIST_IO_NUM_COLUMNS] = {0}; + RangeBound lower, upper; + RangeType *range; + + values[HIST_IO_COL_BACKEND_TYPE] = bktype_desc; + values[HIST_IO_COL_OBJECT] = CStringGetTextDatum(obj_name); + values[HIST_IO_COL_CONTEXT] = CStringGetTextDatum(context_name); + values[HIST_IO_COL_IOTYPE] = CStringGetTextDatum(op_name); + + /* bucket's maximum latency as range in microseconds */ + if(bucket == 0) + lower.val = Int32GetDatum(0); + else + lower.val = Int32GetDatum(1 << (2 + bucket)); + lower.infinite = false; + lower.inclusive = true; + lower.lower = true; + + if(bucket == PGSTAT_IO_HIST_BUCKETS - 1) + upper.infinite = true; + else { + upper.val = Int32GetDatum(1 << (2 + bucket + 1)); + upper.infinite = false; + } + upper.inclusive = true; + upper.lower = false; + + range = make_range(typcache, &lower, &upper, false, NULL); + values[HIST_IO_COL_BUCKET_US] = RangeTypePGetDatum(range); + + /* bucket count */ + values[HIST_IO_COL_COUNT] = Int64GetDatum( + bktype_stats->hist_time_buckets[io_obj][io_context][io_op][bucket]); + + if (stat_reset_timestamp != 0) + values[HIST_IO_COL_RESET_TIME] = TimestampTzGetDatum(stat_reset_timestamp); + else + nulls[HIST_IO_COL_RESET_TIME] = true; + + tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, + values, nulls); + } + } + } + } +} + +Datum +pg_stat_get_io_histogram(PG_FUNCTION_ARGS) +{ + ReturnSetInfo *rsinfo; + PgStat_IO *backends_io_stats; + + InitMaterializedSRF(fcinfo, 0); + rsinfo = (ReturnSetInfo *) fcinfo->resultinfo; + + backends_io_stats = pgstat_fetch_stat_io(); + + for (int bktype = 0; bktype < BACKEND_NUM_TYPES; bktype++) + { + PgStat_BktypeIO *bktype_stats = &backends_io_stats->stats[bktype]; + + /* + * In Assert builds, we can afford an extra loop through all of the + * counters (in pg_stat_io_build_tuples()), checking that only + * expected stats are non-zero, since it keeps the non-Assert code + * cleaner. + */ + Assert(pgstat_bktype_io_stats_valid(bktype_stats, bktype)); + + /* + * For those BackendTypes without IO Operation stats, skip + * representing them in the view altogether. + */ + if (!pgstat_tracks_io_bktype(bktype)) + continue; + + /* save tuples with data from this PgStat_BktypeIO */ + pg_stat_io_histogram_build_tuples(rsinfo, bktype_stats, bktype, + backends_io_stats->stat_reset_timestamp); + } + + return (Datum) 0; +} + /* * pg_stat_wal_build_tuple * diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat index dac40992cbc..9b417911472 100644 --- a/src/include/catalog/pg_proc.dat +++ b/src/include/catalog/pg_proc.dat @@ -6027,6 +6027,15 @@ proargnames => '{backend_type,object,context,reads,read_bytes,read_time,writes,write_bytes,write_time,writebacks,writeback_time,extends,extend_bytes,extend_time,hits,evictions,reuses,fsyncs,fsync_time,stats_reset}', prosrc => 'pg_stat_get_io' }, +{ oid => '6149', descr => 'statistics: per backend type IO latency histogram', + proname => 'pg_stat_get_io_histogram', prorows => '30', proretset => 't', + provolatile => 'v', proparallel => 'r', prorettype => 'record', + proargtypes => '', + proallargtypes => '{text,text,text,text,int4range,int8,timestamptz}', + proargmodes => '{o,o,o,o,o,o,o}', + proargnames => '{backend_type,object,context,io_type,bucket_latency_us,bucket_count,stats_reset}', + prosrc => 'pg_stat_get_io_histogram' }, + { oid => '6386', descr => 'statistics: backend IO statistics', proname => 'pg_stat_get_backend_io', prorows => '5', proretset => 't', provolatile => 'v', proparallel => 'r', prorettype => 'record', diff --git a/src/include/pgstat.h b/src/include/pgstat.h index 0e9d2b4c623..8e06f3e05d2 100644 --- a/src/include/pgstat.h +++ b/src/include/pgstat.h @@ -326,11 +326,23 @@ typedef enum IOOp (((unsigned int) (io_op)) < IOOP_NUM_TYPES && \ ((unsigned int) (io_op)) >= IOOP_EXTEND) +/* + * This should represent balance between being fast and providing value + * to the users: + * 1. We want to cover various fast and slow device types (0.01ms - 15ms) + * 2. We want to also cover sporadic long tail latencies (hardware issues, + * delayed fsyncs, stuck I/O) + * 3. We want to be as small as possible here in terms of size: + * 16 * sizeof(uint64) = which should be less than two cachelines. + */ +#define PGSTAT_IO_HIST_BUCKETS 16 + typedef struct PgStat_BktypeIO { uint64 bytes[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES]; PgStat_Counter counts[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES]; PgStat_Counter times[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES]; + uint64 hist_time_buckets[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES][PGSTAT_IO_HIST_BUCKETS]; } PgStat_BktypeIO; typedef struct PgStat_PendingIO @@ -338,6 +350,7 @@ typedef struct PgStat_PendingIO uint64 bytes[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES]; PgStat_Counter counts[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES]; instr_time pending_times[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES]; + uint64 pending_hist_time_buckets[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES][PGSTAT_IO_HIST_BUCKETS]; } PgStat_PendingIO; typedef struct PgStat_IO @@ -610,6 +623,7 @@ extern void pgstat_count_io_op_time(IOObject io_object, IOContext io_context, extern PgStat_IO *pgstat_fetch_stat_io(void); extern const char *pgstat_get_io_context_name(IOContext io_context); extern const char *pgstat_get_io_object_name(IOObject io_object); +extern const char *pgstat_get_io_op_name(IOOp io_op); extern bool pgstat_tracks_io_bktype(BackendType bktype); extern bool pgstat_tracks_io_object(BackendType bktype, diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h index 0bca559caaa..4e94257682b 100644 --- a/src/include/port/pg_bitutils.h +++ b/src/include/port/pg_bitutils.h @@ -32,6 +32,35 @@ extern PGDLLIMPORT const uint8 pg_leftmost_one_pos[256]; extern PGDLLIMPORT const uint8 pg_rightmost_one_pos[256]; extern PGDLLIMPORT const uint8 pg_number_of_ones[256]; + +/* + * pg_leading_zero_bits64 + * Returns the number of leading 0-bits in x, starting at the most significant bit position. + * Word must not be 0 (as it is undefined behavior). + */ +static inline int +pg_leading_zero_bits64(uint64 word) +{ +#ifdef HAVE__BUILTIN_CLZL + Assert(word != 0); + + return __builtin_clzl(word); +#else + int n = 64; + uint64 y; + if (word == 0) + return 64; + + y = word >> 32; if (y != 0) { n -= 32; word = y; } + y = word >> 16; if (y != 0) { n -= 16; word = y; } + y = word >> 8; if (y != 0) { n -= 8; word = y; } + y = word >> 4; if (y != 0) { n -= 4; word = y; } + y = word >> 2; if (y != 0) { n -= 2; word = y; } + y = word >> 1; if (y != 0) { return n - 2; } + return n - 1; +#endif +} + /* * pg_leftmost_one_pos32 * Returns the position of the most significant set bit in "word", @@ -71,7 +100,7 @@ pg_leftmost_one_pos32(uint32 word) static inline int pg_leftmost_one_pos64(uint64 word) { -#ifdef HAVE__BUILTIN_CLZ +#ifdef HAVE__BUILTIN_CLZL Assert(word != 0); #if SIZEOF_LONG == 8 diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out index 78a37d9fc8f..4288ff5a8fd 100644 --- a/src/test/regress/expected/rules.out +++ b/src/test/regress/expected/rules.out @@ -1951,6 +1951,14 @@ pg_stat_io| SELECT backend_type, fsync_time, stats_reset FROM pg_stat_get_io() b(backend_type, object, context, reads, read_bytes, read_time, writes, write_bytes, write_time, writebacks, writeback_time, extends, extend_bytes, extend_time, hits, evictions, reuses, fsyncs, fsync_time, stats_reset); +pg_stat_io_histogram| SELECT backend_type, + object, + context, + io_type, + bucket_latency_us, + bucket_count, + stats_reset + FROM pg_stat_get_io_histogram() b(backend_type, object, context, io_type, bucket_latency_us, bucket_count, stats_reset); pg_stat_progress_analyze| SELECT s.pid, s.datid, d.datname, diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out index cd00f35bf7a..4c95f09d651 100644 --- a/src/test/regress/expected/stats.out +++ b/src/test/regress/expected/stats.out @@ -1765,6 +1765,29 @@ SELECT :my_io_stats_pre_reset > :my_io_stats_post_backend_reset; t (1 row) +-- Check that pg_stat_io_histograms sees some growing counts in buckets +-- We could also try with checkpointer, but it often runs with fsync=off +-- during test. +SET track_io_timing TO 'on'; +SELECT sum(bucket_count) AS hist_bucket_count_sum FROM pg_stat_get_io_histogram() +WHERE backend_type='client backend' AND object='relation' AND context='normal' \gset +CREATE TABLE test_io_hist(id bigint); +INSERT INTO test_io_hist SELECT generate_series(1, 100) s; +SELECT pg_stat_force_next_flush(); + pg_stat_force_next_flush +-------------------------- + +(1 row) + +SELECT sum(bucket_count) AS hist_bucket_count_sum2 FROM pg_stat_get_io_histogram() +WHERE backend_type='client backend' AND object='relation' AND context='normal' \gset +SELECT :hist_bucket_count_sum2 > :hist_bucket_count_sum; + ?column? +---------- + t +(1 row) + +RESET track_io_timing; -- Check invalid input for pg_stat_get_backend_io() SELECT pg_stat_get_backend_io(NULL); pg_stat_get_backend_io diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql index 8768e0f27fd..063b1011d7e 100644 --- a/src/test/regress/sql/stats.sql +++ b/src/test/regress/sql/stats.sql @@ -841,6 +841,21 @@ SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) + FROM pg_stat_get_backend_io(pg_backend_pid()) \gset SELECT :my_io_stats_pre_reset > :my_io_stats_post_backend_reset; + +-- Check that pg_stat_io_histograms sees some growing counts in buckets +-- We could also try with checkpointer, but it often runs with fsync=off +-- during test. +SET track_io_timing TO 'on'; +SELECT sum(bucket_count) AS hist_bucket_count_sum FROM pg_stat_get_io_histogram() +WHERE backend_type='client backend' AND object='relation' AND context='normal' \gset +CREATE TABLE test_io_hist(id bigint); +INSERT INTO test_io_hist SELECT generate_series(1, 100) s; +SELECT pg_stat_force_next_flush(); +SELECT sum(bucket_count) AS hist_bucket_count_sum2 FROM pg_stat_get_io_histogram() +WHERE backend_type='client backend' AND object='relation' AND context='normal' \gset +SELECT :hist_bucket_count_sum2 > :hist_bucket_count_sum; +RESET track_io_timing; + -- Check invalid input for pg_stat_get_backend_io() SELECT pg_stat_get_backend_io(NULL); SELECT pg_stat_get_backend_io(0); diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list index 77e3c04144e..15c16db8793 100644 --- a/src/tools/pgindent/typedefs.list +++ b/src/tools/pgindent/typedefs.list @@ -3758,6 +3758,7 @@ gtrgm_consistent_cache gzFile heap_page_items_state help_handler +histogram_io_stat_col hlCheck hstoreCheckKeyLen_t hstoreCheckValLen_t -- 2.43.0
From 50517c03c8f118e14bb31d06e4cdf7627866a668 Mon Sep 17 00:00:00 2001 From: Jakub Wartak <[email protected]> Date: Mon, 2 Mar 2026 08:20:09 +0100 Subject: [PATCH v6 2/2] Convert PgStat_IO to pointer to avoid huge static memory allocation if not used --- src/backend/utils/activity/pgstat.c | 9 ++++++++- src/backend/utils/activity/pgstat_io.c | 15 ++++++++++++--- src/include/utils/pgstat_internal.h | 2 +- 3 files changed, 21 insertions(+), 5 deletions(-) diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c index 11bb71cad5a..62a950987de 100644 --- a/src/backend/utils/activity/pgstat.c +++ b/src/backend/utils/activity/pgstat.c @@ -1634,10 +1634,17 @@ pgstat_write_statsfile(void) pgstat_build_snapshot_fixed(kind); if (pgstat_is_kind_builtin(kind)) - ptr = ((char *) &pgStatLocal.snapshot) + info->snapshot_ctl_off; + { + if(kind == PGSTAT_KIND_IO) + ptr = (char *) pgStatLocal.snapshot.io; + else + ptr = ((char *) &pgStatLocal.snapshot) + info->snapshot_ctl_off; + } else ptr = pgStatLocal.snapshot.custom_data[kind - PGSTAT_KIND_CUSTOM_MIN]; + Assert(ptr != NULL); + fputc(PGSTAT_FILE_ENTRY_FIXED, fpout); pgstat_write_chunk_s(fpout, &kind); pgstat_write_chunk(fpout, ptr, info->shared_data_len); diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c index 148a2a9c7d5..b5c75e67781 100644 --- a/src/backend/utils/activity/pgstat_io.c +++ b/src/backend/utils/activity/pgstat_io.c @@ -19,6 +19,7 @@ #include "executor/instrument.h" #include "port/pg_bitutils.h" #include "storage/bufmgr.h" +#include "utils/memutils.h" #include "utils/pgstat_internal.h" static PgStat_PendingIO PendingIOStats; @@ -197,7 +198,7 @@ pgstat_fetch_stat_io(void) { pgstat_snapshot_fixed(PGSTAT_KIND_IO); - return &pgStatLocal.snapshot.io; + return pgStatLocal.snapshot.io; } /* @@ -344,6 +345,9 @@ pgstat_io_init_shmem_cb(void *stats) for (int i = 0; i < BACKEND_NUM_TYPES; i++) LWLockInitialize(&stat_shmem->locks[i], LWTRANCHE_PGSTATS_DATA); + + /* this might end up being lazily allocated in pgstat_io_snapshot_cb() */ + pgStatLocal.snapshot.io = NULL; } void @@ -371,11 +375,16 @@ pgstat_io_reset_all_cb(TimestampTz ts) void pgstat_io_snapshot_cb(void) { + /* allocate memory only when necessary on first use */ + if (unlikely(pgStatLocal.snapshot.io == NULL)) + pgStatLocal.snapshot.io = MemoryContextAllocZero(TopMemoryContext, + sizeof(PgStat_IO)); + for (int i = 0; i < BACKEND_NUM_TYPES; i++) { LWLock *bktype_lock = &pgStatLocal.shmem->io.locks[i]; PgStat_BktypeIO *bktype_shstats = &pgStatLocal.shmem->io.stats.stats[i]; - PgStat_BktypeIO *bktype_snap = &pgStatLocal.snapshot.io.stats[i]; + PgStat_BktypeIO *bktype_snap = &pgStatLocal.snapshot.io->stats[i]; LWLockAcquire(bktype_lock, LW_SHARED); @@ -384,7 +393,7 @@ pgstat_io_snapshot_cb(void) * the reset timestamp as well. */ if (i == 0) - pgStatLocal.snapshot.io.stat_reset_timestamp = + pgStatLocal.snapshot.io->stat_reset_timestamp = pgStatLocal.shmem->io.stats.stat_reset_timestamp; /* using struct assignment due to better type safety */ diff --git a/src/include/utils/pgstat_internal.h b/src/include/utils/pgstat_internal.h index 9b8fbae00ed..407657e060c 100644 --- a/src/include/utils/pgstat_internal.h +++ b/src/include/utils/pgstat_internal.h @@ -600,7 +600,7 @@ typedef struct PgStat_Snapshot PgStat_CheckpointerStats checkpointer; - PgStat_IO io; + PgStat_IO *io; PgStat_SLRUStats slru[SLRU_NUM_ELEMENTS]; -- 2.43.0
