On Fri, May 8, 2026 at 11:57 AM Jakub Wartak <[email protected]> wrote: > > On Thu, Mar 19, 2026 at 11:16 AM Jakub Wartak > <[email protected]> wrote: > > > > On Wed, Mar 18, 2026 at 2:29 PM Jakub Wartak > > <[email protected]> wrote: > > > > > > On Tue, Mar 17, 2026 at 3:17 PM Andres Freund <[email protected]> wrote: > > > > On 2026-03-17 13:13:59 +0100, Jakub Wartak wrote: > > > > > 1. Concerns about memory use. With v7 I had couple of ideas, and with > > > > > those > > > > > the memory use is really minimized as long as the code is still simple > > > > > (so nothing fancy, just some ideas to trim stuff and dynamically > > > > > allocate > > > > > memory). I hope those reduce memory footprint to acceptable levels, > > > > > see my > > > > > earlier description for v7. > > > > > > > > Personally I unfortunately continue to think that storing lots of > > > > values that > > > > are never anything but zero isn't a good idea once you have more than a > > > > handful of kB. Storing pointless data is something different than > > > > increasing > > > > memory usage with actual information. > > > > > > > > I still think you should just count the number of histograms needed, > > > > have an > > > > array [object][context][op] with the associated histogram "offset" and > > > > then > > > > increment the associated offset. It'll add an indirection at count > > > > time, but > > > > no additional branches. > > > > > > Great idea, thanks, I haven't thought about that! Attached v9 attempts to > > > do > > > that for pending backend I/O struct, which minimizes the (backend) memory > > > footprint for client backends to just about ~5kB. > > > > > > I have been pulling my hair trying to achieve the same for shared-memory, > > > but I > > > have failed to do that w/o sinking into complexity [..] > > > > OK, I've made it done too with indirect offset on shared memory, it wasn't > > easy > > at least for me, but now we have two approaches/patchsets: > > > [..] > > v9b: with more code and build complexity but that should address concern of > > not > > used memory > > > > 'Shared Memory Stats' allocated size: > > master - uses ~308kB for shm > > v9a-000[12]: 578kB shm > > v9a-000[123]: 507kB shm > > v9a-000[1234]: 471kB shm (+~163kB more) > > > > v9b-000[123]: 361kB shm > > > > v9a-000[12] are identical to v9b-00[12], but included just for > > patchset completeness. > > > > In v9b meson/autoconf (for adding pgstat_io_genstats) build most of > > the time what > > they need, but probably that needs some cleanups and better dependency > > tracking. I'm > > not sure about correctnes of those changes as especially > > autoconf/Makefile is a lot > > like brainf**k to me and that area would need some help... > > > > I think now we could even increase max resolution of buckets to cover > > max those maximum > > of 32s+ (at the cost of one extra 64-byte cacheline for pending IO > > stats, so go with > > PGSTAT_IO_HIST_BUCKETS from 16 to 24) > > Good morning all, > > Ok here comes v10, which is bit like earlier v9b (so has reduced shared memory > footprint using Yours idea about indirect offsets idea), but now with shm > memory > sized and allocated on startup by postmaster. There are 3 patches: > - 0001, one to introduce view and bucketting, no changes since quite some time > - 0002, saves some private (backend) memory > - 0003, main meat, saving shared memory (main problem raised earlier), > now switched > to simply dynamically size shared memory based on those pgstat_track_io*() > logic > > The problem with the 0003 earlier was that I wanted to absolutley avoid > further > complexiy/alterations in struct PgStat_IO related to dynamic shared memory > allocation for hist_time_buckets_slots[PGSTAT_IO_HIST_BUCKET_SLOTS] > [PGSTAT_IO_HIST_BUCKETS] (I was afraid to touch that shm code, it > looks complex), > so I had to come out with something that would tell us how many slots > (PGSTAT_IO_HIST_BUCKET_SLOTS) we need, I wish we had C++'s `constexpr` that > would do all of that. I've tried three aproaches (like in v9b but that hit > some serious cross-compiling obstacles, also had perl doing that, but that > had lots of code duplication), so in the end I had to alter the pgstat_io > shm allocation which is now in 0003. > > Summary of changes in 0003 since v9b / earlier post: > - Fixed potential race condition (touch via memset/memcpy() only histogram > slots under LWLock) > - Fixed/removed the PGSTAT_IO_HIST_BUCKET_SLOTS macro > - Removed pgstat_io_genslots.c (first idea, above) and abandonded attempt to > fixup some cross compilation woes on MSVC/mingw > - Bumped PGSTAT_FILE_FORMAT_ID > - Move/optimize pending_off in pgstat_io_flush_cb out of hot loop > - Document that hist_time_buckets_offsets should be the last member of > PgStat_BktypeIO > - Be defensive - added some asserts() > - Adjust _bucket_offsets from uint64 to just int to save memory (offsets are > low > numbers) > - and finally moved to dynamic shm allocation of PgStat_IO stuff during > startup > > At the end of the day, I'll squeze 000[123] into just one, but wanted > to ease the > review first a bit. Of course this is material for PG20.
Just noticed it needed a rebase (due to c7cb8e5b73c6; renumber_oids.pl), so v11 attached before I forget. -J.
From cb29b625be435f5fab3c8f2f19ab81ae170f3bfc Mon Sep 17 00:00:00 2001 From: Jakub Wartak <[email protected]> Date: Fri, 23 Jan 2026 08:10:09 +0100 Subject: [PATCH v11 1/3] Add pg_stat_io_histogram view to provide more detailed insight into IO profile pg_stat_io_histogram displays a histogram of IO latencies for specific backend_type, object, context and io_type. The histogram has buckets that allow faster identification of I/O latency outliers due to faulty hardware and/or misbehaving I/O stack. Such I/O outliers e.g. slow fsyncs could sometimes cause intermittent issues e.g. for COMMIT or affect the synchronous standbys performance. Author: Jakub Wartak <[email protected]> Reviewed-by: Andres Freund <[email protected]> Reviewed-by: Ants Aasma <[email protected]> Reviewed-by: Tomas Vondra <[email protected]> Discussion: https://postgr.es/m/CAKZiRmwvE4uJLKTgPXeBA4m%2Bd4tTghayoefcaM9%3Dz3_S7i72GA%40mail.gmail.com --- configure | 38 +++ configure.ac | 1 + doc/src/sgml/config.sgml | 12 +- doc/src/sgml/monitoring.sgml | 290 ++++++++++++++++++++ doc/src/sgml/wal.sgml | 5 +- meson.build | 1 + src/backend/catalog/system_views.sql | 11 + src/backend/utils/activity/pgstat.c | 19 +- src/backend/utils/activity/pgstat_backend.c | 4 +- src/backend/utils/activity/pgstat_io.c | 92 ++++++- src/backend/utils/adt/pgstatfuncs.c | 148 ++++++++++ src/include/catalog/pg_proc.dat | 9 + src/include/pgstat.h | 38 ++- src/include/port/pg_bitutils.h | 38 ++- src/include/utils/pgstat_internal.h | 2 +- src/test/recovery/t/029_stats_restart.pl | 29 ++ src/test/regress/expected/rules.out | 8 + src/tools/pgindent/typedefs.list | 1 + 18 files changed, 727 insertions(+), 19 deletions(-) diff --git a/configure b/configure index f66c1054a7a..c09329240be 100755 --- a/configure +++ b/configure @@ -16054,6 +16054,44 @@ cat >>confdefs.h <<_ACEOF #define HAVE__BUILTIN_CLZ 1 _ACEOF +fi +{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __builtin_clzl" >&5 +$as_echo_n "checking for __builtin_clzl... " >&6; } +if ${pgac_cv__builtin_clzl+:} false; then : + $as_echo_n "(cached) " >&6 +else + cat confdefs.h - <<_ACEOF >conftest.$ac_ext +/* end confdefs.h. */ + +int +call__builtin_clzl(unsigned long x) +{ + return __builtin_clzl(x); +} +int +main () +{ + + ; + return 0; +} +_ACEOF +if ac_fn_c_try_link "$LINENO"; then : + pgac_cv__builtin_clzl=yes +else + pgac_cv__builtin_clzl=no +fi +rm -f core conftest.err conftest.$ac_objext \ + conftest$ac_exeext conftest.$ac_ext +fi +{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__builtin_clzl" >&5 +$as_echo "$pgac_cv__builtin_clzl" >&6; } +if test x"${pgac_cv__builtin_clzl}" = xyes ; then + +cat >>confdefs.h <<_ACEOF +#define HAVE__BUILTIN_CLZL 1 +_ACEOF + fi { $as_echo "$as_me:${as_lineno-$LINENO}: checking for __builtin_ctz" >&5 $as_echo_n "checking for __builtin_ctz... " >&6; } diff --git a/configure.ac b/configure.ac index 8d176bd3468..8f804464bc5 100644 --- a/configure.ac +++ b/configure.ac @@ -1881,6 +1881,7 @@ PGAC_CHECK_BUILTIN_FUNC([__builtin_bswap32], [int x]) PGAC_CHECK_BUILTIN_FUNC([__builtin_bswap64], [long int x]) # We assume that we needn't test all widths of these explicitly: PGAC_CHECK_BUILTIN_FUNC([__builtin_clz], [unsigned int x]) +PGAC_CHECK_BUILTIN_FUNC([__builtin_clzl], [unsigned long x]) PGAC_CHECK_BUILTIN_FUNC([__builtin_ctz], [unsigned int x]) # __builtin_frame_address may draw a diagnostic for non-constant argument, # so it needs a different test function. diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 73cc0412330..91b1fd7e635 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -9067,9 +9067,11 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv; displayed in <link linkend="monitoring-pg-stat-database-view"> <structname>pg_stat_database</structname></link>, <link linkend="monitoring-pg-stat-io-view"> - <structname>pg_stat_io</structname></link> (if <varname>object</varname> - is not <literal>wal</literal>), in the output of the - <link linkend="pg-stat-get-backend-io"> + <structname>pg_stat_io</structname></link> and + <link linkend="monitoring-pg-stat-io-histogram-view"> + <structname>pg_stat_io_histogram</structname></link> + (if <varname>object</varname> is not <literal>wal</literal>), + in the output of the <link linkend="pg-stat-get-backend-io"> <function>pg_stat_get_backend_io()</function></link> function (if <varname>object</varname> is not <literal>wal</literal>), in the output of <xref linkend="sql-explain"/> when the <literal>BUFFERS</literal> @@ -9099,7 +9101,9 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv; measure the overhead of timing on your system. I/O timing information is displayed in <link linkend="monitoring-pg-stat-io-view"> - <structname>pg_stat_io</structname></link> for the + <structname>pg_stat_io</structname></link> and + <link linkend="monitoring-pg-stat-io-histogram-view"> + <structname>pg_stat_io_histogram</structname></link> for the <varname>object</varname> <literal>wal</literal> and in the output of the <link linkend="pg-stat-get-backend-io"> <function>pg_stat_get_backend_io()</function></link> function for the diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml index 08d5b824552..e8c5f391841 100644 --- a/doc/src/sgml/monitoring.sgml +++ b/doc/src/sgml/monitoring.sgml @@ -509,6 +509,17 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser </entry> </row> + <row> + <entry><structname>pg_stat_io_histogram</structname><indexterm><primary>pg_stat_io_histogram</primary></indexterm></entry> + <entry> + One row for each combination of backend type, context, target object, + IO operation type and latency bucket (in microseconds) containing + cluster-wide I/O statistics. + See <link linkend="monitoring-pg-stat-io-histogram-view"> + <structname>pg_stat_io_histogram</structname></link> for details. + </entry> + </row> + <row> <entry><structname>pg_stat_lock</structname><indexterm><primary>pg_stat_lock</primary></indexterm></entry> <entry> @@ -734,6 +745,8 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser Users are advised to use the <productname>PostgreSQL</productname> statistics views in combination with operating system utilities for a more complete picture of their database's I/O performance. + Furthermore the <structname>pg_stat_io_histogram</structname> view can be helpful + identifying latency outliers for specific I/O operations. </para> </sect2> @@ -3302,6 +3315,283 @@ description | Waiting for a newly initialized WAL file to reach durable storage </sect2> + <sect2 id="monitoring-pg-stat-io-histogram-view"> + <title><structname>pg_stat_io_histogram</structname></title> + + <indexterm> + <primary>pg_stat_io_histogram</primary> + </indexterm> + + <para> + The <structname>pg_stat_io_histogram</structname> view will contain one row for each + combination of backend type, target I/O object, and I/O context, IO operation + type, bucket latency cluster-wide I/O statistics. Combinations which do not make sense + are omitted. + </para> + + <para> + The view shows measured perceived I/O latency by the backend, not the kernel or device + one. This is important distinction when troubleshooting, as the I/O latency observed by + the backend might get affected by: + <itemizedlist> + <listitem> + <para>OS scheduler decisions and available CPU resources.</para> + <para>With AIO, it might include time to service other IOs from the queue. That will often inflate IO latency.</para> + <para>In case of writing, additional filesystem journaling operations.</para> + </listitem> + </itemizedlist> + </para> + + <para> + Currently, I/O on relations (e.g. tables, indexes) and WAL activity are + tracked. However, relation I/O which bypasses shared buffers + (e.g. when moving a table from one tablespace to another) is currently + not tracked. + </para> + + <table id="pg-stat-io-histogram-view" xreflabel="pg_stat_io_histogram"> + <title><structname>pg_stat_io_histogram</structname> View</title> + <tgroup cols="1"> + <thead> + <row> + <entry role="catalog_table_entry"> + <para role="column_definition"> + Column Type + </para> + <para> + Description + </para> + </entry> + </row> + </thead> + <tbody> + <row> + <entry role="catalog_table_entry"> + <para role="column_definition"> + <structfield>backend_type</structfield> <type>text</type> + </para> + <para> + Type of backend (e.g. background worker, autovacuum worker). See <link + linkend="monitoring-pg-stat-activity-view"> + <structname>pg_stat_activity</structname></link> for more information + on <varname>backend_type</varname>s. Some + <varname>backend_type</varname>s do not accumulate I/O operation + statistics and will not be included in the view. + </para> + </entry> + </row> + + <row> + <entry role="catalog_table_entry"> + <para role="column_definition"> + <structfield>object</structfield> <type>text</type> + </para> + <para> + Target object of an I/O operation. Possible values are: + <itemizedlist> + <listitem> + <para> + <literal>relation</literal>: Permanent relations. + </para> + </listitem> + <listitem> + <para> + <literal>temp relation</literal>: Temporary relations. + </para> + </listitem> + <listitem> + <para> + <literal>wal</literal>: Write Ahead Logs. + </para> + </listitem> + </itemizedlist> + </para> + </entry> + </row> + + <row> + <entry role="catalog_table_entry"> + <para role="column_definition"> + <structfield>context</structfield> <type>text</type> + </para> + <para> + The context of an I/O operation. Possible values are: + </para> + <itemizedlist> + <listitem> + <para> + <literal>normal</literal>: The default or standard + <varname>context</varname> for a type of I/O operation. For + example, by default, relation data is read into and written out from + shared buffers. Thus, reads and writes of relation data to and from + shared buffers are tracked in <varname>context</varname> + <literal>normal</literal>. + </para> + </listitem> + <listitem> + <para> + <literal>init</literal>: I/O operations performed while creating the + WAL segments are tracked in <varname>context</varname> + <literal>init</literal>. + </para> + </listitem> + <listitem> + <para> + <literal>vacuum</literal>: I/O operations performed outside of shared + buffers while vacuuming and analyzing permanent relations. Temporary + table vacuums use the same local buffer pool as other temporary table + I/O operations and are tracked in <varname>context</varname> + <literal>normal</literal>. + </para> + </listitem> + <listitem> + <para> + <literal>bulkread</literal>: Certain large read I/O operations + done outside of shared buffers, for example, a sequential scan of a + large table. + </para> + </listitem> + <listitem> + <para> + <literal>bulkwrite</literal>: Certain large write I/O operations + done outside of shared buffers, such as <command>COPY</command>. + </para> + </listitem> + </itemizedlist> + </entry> + </row> + + <row> + <entry role="catalog_table_entry"> + <para role="column_definition"> + <structfield>io_type</structfield> <type>text</type> + </para> + <para> + The type of I/O operation. Possible values are: + </para> + <itemizedlist> + <listitem> + <para> + <literal>evict</literal>: eviction from shared buffers cache. + </para> + </listitem> + <listitem> + <para> + <literal>fsync</literal>: synchronization of modified kernel's + filesystem page cache with storage device. + </para> + </listitem> + <listitem> + <para> + <literal>hit</literal>: shared buffers cache lookup hit. + </para> + </listitem> + <listitem> + <para> + <literal>reuse</literal>: reuse of existing buffer in case of + reusing limited-space ring buffer (applies to <literal>bulkread</literal>, + <literal>bulkwrite</literal>, or <literal>vacuum</literal> contexts). + </para> + </listitem> + <listitem> + <para> + <literal>writeback</literal>: advise kernel that the described dirty + data should be flushed to disk preferably asynchronously. + </para> + </listitem> + <listitem> + <para> + <literal>extend</literal>: add new zeroed blocks to the end of file. + </para> + </listitem> + <listitem> + <para> + <literal>read</literal>: self explanatory. + </para> + </listitem> + <listitem> + <para> + <literal>write</literal>: self explanatory. + </para> + </listitem> + </itemizedlist> + </entry> + </row> + + <row> + <entry role="catalog_table_entry"> + <para role="column_definition"> + <structfield>bucket_latency_us</structfield> <type>int4range</type> + </para> + <para> + The latency bucket (in microseconds). + </para> + </entry> + </row> + + <row> + <entry role="catalog_table_entry"> + <para role="column_definition"> + <structfield>bucket_count</structfield> <type>bigint</type> + </para> + <para> + Number of times latency of the I/O operation hit this specific bucket (with + up to <varname>bucket_latency_us</varname> microseconds). + </para> + </entry> + </row> + + <row> + <entry role="catalog_table_entry"> + <para role="column_definition"> + <structfield>stats_reset</structfield> <type>timestamp with time zone</type> + </para> + <para> + Time at which these statistics were last reset. + </para> + </entry> + </row> + </tbody> + </tgroup> + </table> + + <para> + Some backend types never perform I/O operations on some I/O objects and/or + in some I/O contexts. These rows might display zero bucket counts for such + specific operations. + </para> + + <para> + <structname>pg_stat_io_histogram</structname> can be used to identify + I/O storage issues + For example: + <itemizedlist> + <listitem> + <para> + Presence of abnormally high latency for <varname>fsyncs</varname> might + indicate I/O saturation, oversubscription or hardware connectivity issues. + </para> + </listitem> + <listitem> + <para> + Unusually high latency for <varname>fsyncs</varname> on standby's startup + backend type, might be responsible for high duration of commits in + synchronous replication setups. + </para> + </listitem> + </itemizedlist> + </para> + + <note> + <para> + Columns tracking I/O wait time will only be non-zero when + <xref linkend="guc-track-io-timing"/> is enabled. The user should be + careful when referencing these columns in combination with their + corresponding I/O operations in case <varname>track_io_timing</varname> + was not enabled for the entire time since the last stats reset. + </para> + </note> + </sect2> <sect2 id="monitoring-pg-stat-lock-view"> <title><structname>pg_stat_lock</structname></title> diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml index c32931edde3..531245935da 100644 --- a/doc/src/sgml/wal.sgml +++ b/doc/src/sgml/wal.sgml @@ -950,8 +950,9 @@ of times <function>XLogWrite</function> writes and <function>issue_xlog_fsync</function> syncs WAL data to disk are also counted as <varname>writes</varname> and <varname>fsyncs</varname> - in <structname>pg_stat_io</structname> for the <varname>object</varname> - <literal>wal</literal>, respectively. + in <structname>pg_stat_io</structname> and + <structname>pg_stat_io_histogram</structname> for the + <varname>object</varname> <literal>wal</literal>, respectively. </para> <para> diff --git a/meson.build b/meson.build index 20b887f1a1b..51058165742 100644 --- a/meson.build +++ b/meson.build @@ -2048,6 +2048,7 @@ builtins = [ 'bswap32', 'bswap64', 'clz', + 'clzl', 'ctz', 'constant_p', 'frame_address', diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql index 73a1c1c4670..a752ab157ba 100644 --- a/src/backend/catalog/system_views.sql +++ b/src/backend/catalog/system_views.sql @@ -1282,6 +1282,17 @@ SELECT b.stats_reset FROM pg_stat_get_io() b; +CREATE VIEW pg_stat_io_histogram AS +SELECT + b.backend_type, + b.object, + b.context, + b.io_type, + b.bucket_latency_us, + b.bucket_count, + b.stats_reset +FROM pg_stat_get_io_histogram() b; + CREATE VIEW pg_stat_wal AS SELECT w.wal_records, diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c index b67da88c7dc..9feb2f1370b 100644 --- a/src/backend/utils/activity/pgstat.c +++ b/src/backend/utils/activity/pgstat.c @@ -105,8 +105,10 @@ #include <unistd.h> #include "access/xact.h" +#include "access/xlog.h" #include "lib/dshash.h" #include "pgstat.h" +#include "storage/bufmgr.h" #include "storage/fd.h" #include "storage/ipc.h" #include "storage/lwlock.h" @@ -689,6 +691,14 @@ pgstat_initialize(void) /* Set up a process-exit hook to clean up */ before_shmem_exit(pgstat_shutdown_hook, 0); + /* Allocate I/O latency buckets only if we are going to populate it */ + if (track_io_timing || track_wal_io_timing) + PendingIOStats.pending_hist_time_buckets = MemoryContextAllocZero(TopMemoryContext, + IOOBJECT_NUM_TYPES * IOCONTEXT_NUM_TYPES * IOOP_NUM_TYPES * + PGSTAT_IO_HIST_BUCKETS * sizeof(uint64)); + else + PendingIOStats.pending_hist_time_buckets = NULL; + #ifdef USE_ASSERT_CHECKING pgstat_is_initialized = true; #endif @@ -1668,10 +1678,17 @@ pgstat_write_statsfile(void) pgstat_build_snapshot_fixed(kind); if (pgstat_is_kind_builtin(kind)) - ptr = ((char *) &pgStatLocal.snapshot) + info->snapshot_ctl_off; + { + if (kind == PGSTAT_KIND_IO) + ptr = (char *) pgStatLocal.snapshot.io; + else + ptr = ((char *) &pgStatLocal.snapshot) + info->snapshot_ctl_off; + } else ptr = pgStatLocal.snapshot.custom_data[kind - PGSTAT_KIND_CUSTOM_MIN]; + Assert(ptr != NULL); + fputc(PGSTAT_FILE_ENTRY_FIXED, fpout); pgstat_write_chunk_s(fpout, &kind); pgstat_write_chunk(fpout, ptr, info->shared_data_len); diff --git a/src/backend/utils/activity/pgstat_backend.c b/src/backend/utils/activity/pgstat_backend.c index 73461c9bca5..fc1bf824a31 100644 --- a/src/backend/utils/activity/pgstat_backend.c +++ b/src/backend/utils/activity/pgstat_backend.c @@ -168,7 +168,7 @@ pgstat_flush_backend_entry_io(PgStat_EntryRef *entry_ref) { PgStatShared_Backend *shbackendent; PgStat_BktypeIO *bktype_shstats; - PgStat_PendingIO pending_io; + PgStat_BackendPendingIO pending_io; /* * This function can be called even if nothing at all has happened for IO @@ -205,7 +205,7 @@ pgstat_flush_backend_entry_io(PgStat_EntryRef *entry_ref) /* * Clear out the statistics buffer, so it can be re-used. */ - MemSet(&PendingBackendStats.pending_io, 0, sizeof(PgStat_PendingIO)); + MemSet(&PendingBackendStats.pending_io, 0, sizeof(PgStat_BackendPendingIO)); backend_has_iostats = false; } diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c index 13a5d8e6440..c2faada6487 100644 --- a/src/backend/utils/activity/pgstat_io.c +++ b/src/backend/utils/activity/pgstat_io.c @@ -17,10 +17,12 @@ #include "postgres.h" #include "executor/instrument.h" +#include "port/pg_bitutils.h" #include "storage/bufmgr.h" +#include "utils/memutils.h" #include "utils/pgstat_internal.h" -static PgStat_PendingIO PendingIOStats; +PgStat_PendingIO PendingIOStats; static bool have_iostats = false; /* @@ -107,6 +109,35 @@ pgstat_prepare_io_time(bool track_io_guc) return io_start; } +#define MIN_PG_STAT_IO_HIST_LATENCY 8191 +static inline int +get_bucket_index(uint64_t ns) +{ + const uint32_t max_index = PGSTAT_IO_HIST_BUCKETS - 1; + + /* + * hopefully pre-calculated by the compiler: clzl(8191) = + * clz(01111111111111b on uint64) + */ + const uint32_t min_latency_leading_zeros = + pg_leading_zero_bits64(MIN_PG_STAT_IO_HIST_LATENCY); + + /* + * make sure the tmp value has at least 8191 (our minimum bucket size) as + * __builtin_clzl might return undefined behavior when operating on 0 + */ + uint64_t tmp = ns | MIN_PG_STAT_IO_HIST_LATENCY; + + /* count leading zeros */ + int leading_zeros = pg_leading_zero_bits64(tmp); + + /* normalize the index */ + uint32_t index = min_latency_leading_zeros - leading_zeros; + + /* clamp it to the maximum */ + return (index > max_index) ? max_index : index; +} + /* * Like pgstat_count_io_op() except it also accumulates time. * @@ -125,6 +156,7 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op, if (!INSTR_TIME_IS_ZERO(start_time)) { instr_time io_time; + int bucket_index; INSTR_TIME_SET_CURRENT(io_time); INSTR_TIME_SUBTRACT(io_time, start_time); @@ -152,6 +184,16 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op, INSTR_TIME_ADD(PendingIOStats.pending_times[io_object][io_context][io_op], io_time); + if (PendingIOStats.pending_hist_time_buckets != NULL) + { + /* + * calculate the bucket_index based on latency in nanoseconds + * (uint64) + */ + bucket_index = get_bucket_index(INSTR_TIME_GET_NANOSEC(io_time)); + PendingIOStats.pending_hist_time_buckets[io_object][io_context][io_op][bucket_index]++; + } + /* Add the per-backend count */ pgstat_count_backend_io_op_time(io_object, io_context, io_op, io_time); @@ -165,7 +207,7 @@ pgstat_fetch_stat_io(void) { pgstat_snapshot_fixed(PGSTAT_KIND_IO); - return &pgStatLocal.snapshot.io; + return pgStatLocal.snapshot.io; } /* @@ -221,6 +263,11 @@ pgstat_io_flush_cb(bool nowait) bktype_shstats->times[io_object][io_context][io_op] += INSTR_TIME_GET_MICROSEC(time); + + if (PendingIOStats.pending_hist_time_buckets != NULL) + for (int b = 0; b < PGSTAT_IO_HIST_BUCKETS; b++) + bktype_shstats->hist_time_buckets[io_object][io_context][io_op][b] += + PendingIOStats.pending_hist_time_buckets[io_object][io_context][io_op][b]; } } } @@ -229,7 +276,8 @@ pgstat_io_flush_cb(bool nowait) LWLockRelease(bktype_lock); - memset(&PendingIOStats, 0, sizeof(PendingIOStats)); + /* Avoid overwriting latency buckets array pointer */ + memset(&PendingIOStats, 0, offsetof(PgStat_PendingIO, pending_hist_time_buckets)); have_iostats = false; @@ -274,6 +322,33 @@ pgstat_get_io_object_name(IOObject io_object) pg_unreachable(); } +const char * +pgstat_get_io_op_name(IOOp io_op) +{ + switch (io_op) + { + case IOOP_EVICT: + return "evict"; + case IOOP_FSYNC: + return "fsync"; + case IOOP_HIT: + return "hit"; + case IOOP_REUSE: + return "reuse"; + case IOOP_WRITEBACK: + return "writeback"; + case IOOP_EXTEND: + return "extend"; + case IOOP_READ: + return "read"; + case IOOP_WRITE: + return "write"; + } + + elog(ERROR, "unrecognized IOOp value: %d", io_op); + pg_unreachable(); +} + void pgstat_io_init_shmem_cb(void *stats) { @@ -281,6 +356,9 @@ pgstat_io_init_shmem_cb(void *stats) for (int i = 0; i < BACKEND_NUM_TYPES; i++) LWLockInitialize(&stat_shmem->locks[i], LWTRANCHE_PGSTATS_DATA); + + /* this might end up being lazily allocated in pgstat_io_snapshot_cb() */ + pgStatLocal.snapshot.io = NULL; } void @@ -308,11 +386,15 @@ pgstat_io_reset_all_cb(TimestampTz ts) void pgstat_io_snapshot_cb(void) { + if (unlikely(pgStatLocal.snapshot.io == NULL)) + pgStatLocal.snapshot.io = MemoryContextAllocZero(TopMemoryContext, + sizeof(PgStat_IO)); + for (int i = 0; i < BACKEND_NUM_TYPES; i++) { LWLock *bktype_lock = &pgStatLocal.shmem->io.locks[i]; PgStat_BktypeIO *bktype_shstats = &pgStatLocal.shmem->io.stats.stats[i]; - PgStat_BktypeIO *bktype_snap = &pgStatLocal.snapshot.io.stats[i]; + PgStat_BktypeIO *bktype_snap = &pgStatLocal.snapshot.io->stats[i]; LWLockAcquire(bktype_lock, LW_SHARED); @@ -321,7 +403,7 @@ pgstat_io_snapshot_cb(void) * the reset timestamp as well. */ if (i == 0) - pgStatLocal.snapshot.io.stat_reset_timestamp = + pgStatLocal.snapshot.io->stat_reset_timestamp = pgStatLocal.shmem->io.stats.stat_reset_timestamp; /* using struct assignment due to better type safety */ diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c index 6f9c9c72de5..e16c65d45e9 100644 --- a/src/backend/utils/adt/pgstatfuncs.c +++ b/src/backend/utils/adt/pgstatfuncs.c @@ -18,6 +18,7 @@ #include "access/xlog.h" #include "access/xlogprefetcher.h" #include "catalog/catalog.h" +#include "catalog/namespace.h" #include "catalog/pg_authid.h" #include "catalog/pg_type.h" #include "common/ip.h" @@ -30,6 +31,7 @@ #include "storage/procarray.h" #include "utils/acl.h" #include "utils/builtins.h" +#include "utils/rangetypes.h" #include "utils/timestamp.h" #include "utils/tuplestore.h" #include "utils/wait_event.h" @@ -1638,6 +1640,152 @@ pg_stat_get_backend_io(PG_FUNCTION_ARGS) return (Datum) 0; } +/* +* When adding a new column to the pg_stat_io_histogram view and the +* pg_stat_get_io_histogram() function, add a new enum value here above +* HIST_IO_NUM_COLUMNS. +*/ +typedef enum hist_io_stat_col +{ + HIST_IO_COL_INVALID = -1, + HIST_IO_COL_BACKEND_TYPE, + HIST_IO_COL_OBJECT, + HIST_IO_COL_CONTEXT, + HIST_IO_COL_IOTYPE, + HIST_IO_COL_BUCKET_US, + HIST_IO_COL_COUNT, + HIST_IO_COL_RESET_TIME, + HIST_IO_NUM_COLUMNS +} histogram_io_stat_col; + +/* + * pg_stat_io_histogram_build_tuples + * + * Helper routine for pg_stat_get_io_histogram() and pg_stat_get_backend_io() + * filling a result tuplestore with one tuple for each object and each + * context supported by the caller, based on the contents of bktype_stats. + */ +static void +pg_stat_io_histogram_build_tuples(ReturnSetInfo *rsinfo, + PgStat_BktypeIO *bktype_stats, + BackendType bktype, + TimestampTz stat_reset_timestamp) +{ + /* Get OID for int4range type */ + Datum bktype_desc = CStringGetTextDatum(GetBackendTypeDesc(bktype)); + Oid range_typid = TypenameGetTypid("int4range"); + TypeCacheEntry *typcache = lookup_type_cache(range_typid, TYPECACHE_RANGE_INFO); + + for (int io_obj = 0; io_obj < IOOBJECT_NUM_TYPES; io_obj++) + { + const char *obj_name = pgstat_get_io_object_name(io_obj); + + for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++) + { + const char *context_name = pgstat_get_io_context_name(io_context); + + /* + * Some combinations of BackendType, IOObject, and IOContext are + * not valid for any type of IOOp. In such cases, omit the entire + * row from the view. + */ + if (!pgstat_tracks_io_object(bktype, io_obj, io_context)) + continue; + + for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++) + { + const char *op_name = pgstat_get_io_op_name(io_op); + + for (int bucket = 0; bucket < PGSTAT_IO_HIST_BUCKETS; bucket++) + { + Datum values[HIST_IO_NUM_COLUMNS] = {0}; + bool nulls[HIST_IO_NUM_COLUMNS] = {0}; + RangeBound lower, + upper; + RangeType *range; + + values[HIST_IO_COL_BACKEND_TYPE] = bktype_desc; + values[HIST_IO_COL_OBJECT] = CStringGetTextDatum(obj_name); + values[HIST_IO_COL_CONTEXT] = CStringGetTextDatum(context_name); + values[HIST_IO_COL_IOTYPE] = CStringGetTextDatum(op_name); + + /* bucket's maximum latency as range in microseconds */ + if (bucket == 0) + lower.val = Int32GetDatum(0); + else + lower.val = Int32GetDatum(1 << (2 + bucket)); + lower.infinite = false; + lower.inclusive = true; + lower.lower = true; + + if (bucket == PGSTAT_IO_HIST_BUCKETS - 1) + upper.infinite = true; + else + { + upper.val = Int32GetDatum(1 << (2 + bucket + 1)); + upper.infinite = false; + } + upper.inclusive = false; + upper.lower = false; + + range = make_range(typcache, &lower, &upper, false, NULL); + values[HIST_IO_COL_BUCKET_US] = RangeTypePGetDatum(range); + + /* bucket count */ + values[HIST_IO_COL_COUNT] = Int64GetDatum( + bktype_stats->hist_time_buckets[io_obj][io_context][io_op][bucket]); + + if (stat_reset_timestamp != 0) + values[HIST_IO_COL_RESET_TIME] = TimestampTzGetDatum(stat_reset_timestamp); + else + nulls[HIST_IO_COL_RESET_TIME] = true; + + tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, + values, nulls); + } + } + } + } +} + +Datum +pg_stat_get_io_histogram(PG_FUNCTION_ARGS) +{ + ReturnSetInfo *rsinfo; + PgStat_IO *backends_io_stats; + + InitMaterializedSRF(fcinfo, 0); + rsinfo = (ReturnSetInfo *) fcinfo->resultinfo; + + backends_io_stats = pgstat_fetch_stat_io(); + + for (int bktype = 0; bktype < BACKEND_NUM_TYPES; bktype++) + { + PgStat_BktypeIO *bktype_stats = &backends_io_stats->stats[bktype]; + + /* + * In Assert builds, we can afford an extra loop through all of the + * counters (in pg_stat_io_build_tuples()), checking that only + * expected stats are non-zero, since it keeps the non-Assert code + * cleaner. + */ + Assert(pgstat_bktype_io_stats_valid(bktype_stats, bktype)); + + /* + * For those BackendTypes without IO Operation stats, skip + * representing them in the view altogether. + */ + if (!pgstat_tracks_io_bktype(bktype)) + continue; + + /* save tuples with data from this PgStat_BktypeIO */ + pg_stat_io_histogram_build_tuples(rsinfo, bktype_stats, bktype, + backends_io_stats->stat_reset_timestamp); + } + + return (Datum) 0; +} + /* * pg_stat_wal_build_tuple * diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat index be157a5fbe9..159d912515c 100644 --- a/src/include/catalog/pg_proc.dat +++ b/src/include/catalog/pg_proc.dat @@ -6061,6 +6061,15 @@ proargnames => '{backend_type,object,context,reads,read_bytes,read_time,writes,write_bytes,write_time,writebacks,writeback_time,extends,extend_bytes,extend_time,hits,evictions,reuses,fsyncs,fsync_time,stats_reset}', prosrc => 'pg_stat_get_io' }, +{ oid => '6149', descr => 'statistics: per backend type IO latency histogram', + proname => 'pg_stat_get_io_histogram', prorows => '30', proretset => 't', + provolatile => 'v', proparallel => 'r', prorettype => 'record', + proargtypes => '', + proallargtypes => '{text,text,text,text,int4range,int8,timestamptz}', + proargmodes => '{o,o,o,o,o,o,o}', + proargnames => '{backend_type,object,context,io_type,bucket_latency_us,bucket_count,stats_reset}', + prosrc => 'pg_stat_get_io_histogram' }, + { oid => '6509', descr => 'statistics: per lock type statistics', proname => 'pg_stat_get_lock', prorows => '10', proretset => 't', provolatile => 'v', proparallel => 'r', prorettype => 'record', diff --git a/src/include/pgstat.h b/src/include/pgstat.h index dfa2e837638..34fd93f86dc 100644 --- a/src/include/pgstat.h +++ b/src/include/pgstat.h @@ -326,11 +326,23 @@ typedef enum IOOp (((unsigned int) (io_op)) < IOOP_NUM_TYPES && \ ((unsigned int) (io_op)) >= IOOP_EXTEND) +/* + * This should represent balance between being fast and providing value + * to the users: + * 1. We want to cover various fast and slow device types (0.01ms - 15ms) + * 2. We want to also cover sporadic long tail latencies (hardware issues, + * delayed fsyncs, stuck I/O) + * 3. We want to be as small as possible here in terms of size: + * 16 * sizeof(uint64) = which should be less than two cachelines. + */ +#define PGSTAT_IO_HIST_BUCKETS 16 + typedef struct PgStat_BktypeIO { uint64 bytes[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES]; PgStat_Counter counts[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES]; PgStat_Counter times[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES]; + uint64 hist_time_buckets[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES][PGSTAT_IO_HIST_BUCKETS]; } PgStat_BktypeIO; typedef struct PgStat_PendingIO @@ -338,8 +350,18 @@ typedef struct PgStat_PendingIO uint64 bytes[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES]; PgStat_Counter counts[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES]; instr_time pending_times[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES]; + + /* + * Dynamically allocated array of + * [IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES] + * [IOOP_NUM_TYPES][PGSTAT_IO_HIST_BUCKETS] only with track_io_timings + * true. + */ + uint64 (*pending_hist_time_buckets)[IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES][PGSTAT_IO_HIST_BUCKETS]; } PgStat_PendingIO; +extern PgStat_PendingIO PendingIOStats; + typedef struct PgStat_IO { TimestampTz stat_reset_timestamp; @@ -526,15 +548,24 @@ typedef struct PgStat_Backend } PgStat_Backend; /* --------- - * PgStat_BackendPending Non-flushed backend stats. + * PgStat_BackendPending(IO) Non-flushed backend stats. * --------- */ +typedef struct PgStat_BackendPendingIO +{ + uint64 bytes[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES]; + PgStat_Counter counts[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES]; + instr_time pending_times[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES]; +} PgStat_BackendPendingIO; + typedef struct PgStat_BackendPending { /* - * Backend statistics store the same amount of IO data as PGSTAT_KIND_IO. + * Backend statistics store almost the same amount of IO data as + * PGSTAT_KIND_IO. The only difference between PgStat_BackendPendingIO and + * PgStat_PendingIO is that the latter also track IO latency histograms. */ - PgStat_PendingIO pending_io; + PgStat_BackendPendingIO pending_io; } PgStat_BackendPending; /* @@ -624,6 +655,7 @@ extern void pgstat_count_io_op_time(IOObject io_object, IOContext io_context, extern PgStat_IO *pgstat_fetch_stat_io(void); extern const char *pgstat_get_io_context_name(IOContext io_context); extern const char *pgstat_get_io_object_name(IOObject io_object); +extern const char *pgstat_get_io_op_name(IOOp io_op); extern bool pgstat_tracks_io_bktype(BackendType bktype); extern bool pgstat_tracks_io_object(BackendType bktype, diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h index 7a00d197013..b27913a2ad8 100644 --- a/src/include/port/pg_bitutils.h +++ b/src/include/port/pg_bitutils.h @@ -32,6 +32,42 @@ extern PGDLLIMPORT const uint8 pg_leftmost_one_pos[256]; extern PGDLLIMPORT const uint8 pg_rightmost_one_pos[256]; extern PGDLLIMPORT const uint8 pg_number_of_ones[256]; + +/* + * pg_leading_zero_bits64 + * Returns the number of leading 0-bits in x, starting at the most significant bit position. + * Word must not be 0 (as it is undefined behavior). + */ +static inline int +pg_leading_zero_bits64(uint64 word) +{ +#ifdef HAVE__BUILTIN_CLZL + Assert(word != 0); + +#if SIZEOF_LONG == 8 + return __builtin_clzl(word); +#elif SIZEOF_LONG_LONG == 8 + return __builtin_clzll(word); +#else +#error "cannot find integer type of the same size as uint64_t" +#endif + +#else + uint64 y; + int n = 64; + if (word == 0) + return 64; + + y = word >> 32; if (y != 0) { n -= 32; word = y; } + y = word >> 16; if (y != 0) { n -= 16; word = y; } + y = word >> 8; if (y != 0) { n -= 8; word = y; } + y = word >> 4; if (y != 0) { n -= 4; word = y; } + y = word >> 2; if (y != 0) { n -= 2; word = y; } + y = word >> 1; if (y != 0) { return n - 2; } + return n - 1; +#endif +} + /* * pg_leftmost_one_pos32 * Returns the position of the most significant set bit in "word", @@ -71,7 +107,7 @@ pg_leftmost_one_pos32(uint32 word) static inline int pg_leftmost_one_pos64(uint64 word) { -#ifdef HAVE__BUILTIN_CLZ +#ifdef HAVE__BUILTIN_CLZL Assert(word != 0); #if SIZEOF_LONG == 8 diff --git a/src/include/utils/pgstat_internal.h b/src/include/utils/pgstat_internal.h index fe463faaf63..a3ce8b04723 100644 --- a/src/include/utils/pgstat_internal.h +++ b/src/include/utils/pgstat_internal.h @@ -608,7 +608,7 @@ typedef struct PgStat_Snapshot PgStat_CheckpointerStats checkpointer; - PgStat_IO io; + PgStat_IO *io; PgStat_Lock lock; diff --git a/src/test/recovery/t/029_stats_restart.pl b/src/test/recovery/t/029_stats_restart.pl index cdc427dbc78..33939c8701a 100644 --- a/src/test/recovery/t/029_stats_restart.pl +++ b/src/test/recovery/t/029_stats_restart.pl @@ -293,7 +293,36 @@ cmp_ok( $wal_restart_immediate->{reset}, "$sect: reset timestamp is new"); + +## Test pg_stat_io_histogram that is becoming active due to dynamic memory +## allocation only for new backends with globally set track_[io|wal_io]_timing +$sect = "pg_stat_io_histogram"; +$node->append_conf('postgresql.conf', "track_io_timing = 'on'"); +$node->append_conf('postgresql.conf', "track_wal_io_timing = 'on'"); +$node->restart; + + +## Check that pg_stat_io_histograms sees some growing counts in buckets +## We could also try with checkpointer, but it often runs with fsync=off +## during test. +my $countbefore = $node->safe_psql('postgres', + "SELECT sum(bucket_count) AS hist_bucket_count_sum FROM pg_stat_get_io_histogram() " . + "WHERE backend_type='client backend' AND object='relation' AND context='normal'"); + +$node->safe_psql('postgres', "CREATE TABLE test_io_hist(id bigint);"); +$node->safe_psql('postgres', "INSERT INTO test_io_hist SELECT generate_series(1, 100) s;"); +$node->safe_psql('postgres', "SELECT pg_stat_force_next_flush();"); + +my $countafter = $node->safe_psql('postgres', + "SELECT sum(bucket_count) AS hist_bucket_count_sum FROM pg_stat_get_io_histogram() " . + "WHERE backend_type='client backend' AND object='relation' AND context='normal'"); + +cmp_ok( + $countafter, '>', $countbefore, + "pg_stat_io_histogram: latency buckets growing"); + $node->stop; + done_testing(); sub trigger_funcrel_stat diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out index a65a5bf0c4f..c0067cb653b 100644 --- a/src/test/regress/expected/rules.out +++ b/src/test/regress/expected/rules.out @@ -1967,6 +1967,14 @@ pg_stat_io| SELECT backend_type, fsync_time, stats_reset FROM pg_stat_get_io() b(backend_type, object, context, reads, read_bytes, read_time, writes, write_bytes, write_time, writebacks, writeback_time, extends, extend_bytes, extend_time, hits, evictions, reuses, fsyncs, fsync_time, stats_reset); +pg_stat_io_histogram| SELECT backend_type, + object, + context, + io_type, + bucket_latency_us, + bucket_count, + stats_reset + FROM pg_stat_get_io_histogram() b(backend_type, object, context, io_type, bucket_latency_us, bucket_count, stats_reset); pg_stat_lock| SELECT locktype, waits, wait_time, diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list index 8cf40c87043..ce52e7619fd 100644 --- a/src/tools/pgindent/typedefs.list +++ b/src/tools/pgindent/typedefs.list @@ -3858,6 +3858,7 @@ gzFile having_collation_ctx heap_page_items_state help_handler +histogram_io_stat_col hlCheck host_cache_hash hstoreCheckKeyLen_t -- 2.43.0
From e4ebec91ac9b9a9984afeacc62d6f216569c2a29 Mon Sep 17 00:00:00 2001 From: Jakub Wartak <[email protected]> Date: Wed, 18 Mar 2026 07:24:14 +0100 Subject: [PATCH v11 2/3] Lower pg_stat_io_histogram private (backend) memory in pending_hist_time_buckets by using array with indirect offsets. --- src/backend/utils/activity/pgstat.c | 9 +-- src/backend/utils/activity/pgstat_io.c | 90 ++++++++++++++++++++++++-- src/include/pgstat.h | 19 ++++-- src/include/utils/pgstat_internal.h | 1 + 4 files changed, 102 insertions(+), 17 deletions(-) diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c index 9feb2f1370b..7c597932671 100644 --- a/src/backend/utils/activity/pgstat.c +++ b/src/backend/utils/activity/pgstat.c @@ -445,6 +445,7 @@ static const PgStat_KindInfo pgstat_kind_builtin_infos[PGSTAT_KIND_BUILTIN_SIZE] .shared_data_off = offsetof(PgStatShared_IO, stats), .shared_data_len = sizeof(((PgStatShared_IO *) 0)->stats), + .init_backend_cb = pgstat_io_init_backend_cb, .flush_static_cb = pgstat_io_flush_cb, .init_shmem_cb = pgstat_io_init_shmem_cb, .reset_all_cb = pgstat_io_reset_all_cb, @@ -691,14 +692,6 @@ pgstat_initialize(void) /* Set up a process-exit hook to clean up */ before_shmem_exit(pgstat_shutdown_hook, 0); - /* Allocate I/O latency buckets only if we are going to populate it */ - if (track_io_timing || track_wal_io_timing) - PendingIOStats.pending_hist_time_buckets = MemoryContextAllocZero(TopMemoryContext, - IOOBJECT_NUM_TYPES * IOCONTEXT_NUM_TYPES * IOOP_NUM_TYPES * - PGSTAT_IO_HIST_BUCKETS * sizeof(uint64)); - else - PendingIOStats.pending_hist_time_buckets = NULL; - #ifdef USE_ASSERT_CHECKING pgstat_is_initialized = true; #endif diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c index c2faada6487..4c655d38b97 100644 --- a/src/backend/utils/activity/pgstat_io.c +++ b/src/backend/utils/activity/pgstat_io.c @@ -16,6 +16,7 @@ #include "postgres.h" +#include "access/xlog.h" #include "executor/instrument.h" #include "port/pg_bitutils.h" #include "storage/bufmgr.h" @@ -66,6 +67,27 @@ pgstat_bktype_io_stats_valid(PgStat_BktypeIO *backend_io, return true; } +int +pgstat_bktype_count_potentially_used(BackendType bktype) +{ + int cnt = 0; + + for (int io_object = 0; io_object < IOOBJECT_NUM_TYPES; io_object++) + { + for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++) + { + for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++) + { + /* we do track it */ + if (pgstat_tracks_io_op(bktype, io_object, io_context, io_op)) + cnt++; + } + } + } + + return cnt; +} + void pgstat_count_io_op(IOObject io_object, IOContext io_context, IOOp io_op, uint32 cnt, uint64 bytes) @@ -186,12 +208,16 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op, if (PendingIOStats.pending_hist_time_buckets != NULL) { + int offset; + /* * calculate the bucket_index based on latency in nanoseconds * (uint64) */ bucket_index = get_bucket_index(INSTR_TIME_GET_NANOSEC(io_time)); - PendingIOStats.pending_hist_time_buckets[io_object][io_context][io_op][bucket_index]++; + + offset = PendingIOStats.pending_hist_time_buckets_offsets[io_object][io_context][io_op]; + PendingIOStats.pending_hist_time_buckets[offset][bucket_index]++; } /* Add the per-backend count */ @@ -264,10 +290,23 @@ pgstat_io_flush_cb(bool nowait) bktype_shstats->times[io_object][io_context][io_op] += INSTR_TIME_GET_MICROSEC(time); + /* + * If tracking I/O stats, save I/O histograms from backend + * local's PendingIOStats by using indirect offsets from the + * pending_hist_time_buckets dynamic array (accessed with + * offsets to save memory) into shared memory. + */ if (PendingIOStats.pending_hist_time_buckets != NULL) for (int b = 0; b < PGSTAT_IO_HIST_BUCKETS; b++) - bktype_shstats->hist_time_buckets[io_object][io_context][io_op][b] += - PendingIOStats.pending_hist_time_buckets[io_object][io_context][io_op][b]; + { + int pending_off = PendingIOStats.pending_hist_time_buckets_offsets[io_object][io_context][io_op]; + + if (pending_off != -1) + { + bktype_shstats->hist_time_buckets[io_object][io_context][io_op][b] += + PendingIOStats.pending_hist_time_buckets[pending_off][b]; + } + } } } } @@ -276,8 +315,14 @@ pgstat_io_flush_cb(bool nowait) LWLockRelease(bktype_lock); - /* Avoid overwriting latency buckets array pointer */ + /* + * Avoid overwriting histogram latency array (with offsets) and pointer to + * dynamically allocated memory + */ memset(&PendingIOStats, 0, offsetof(PgStat_PendingIO, pending_hist_time_buckets)); + if (PendingIOStats.pending_hist_time_buckets != NULL) + memset(PendingIOStats.pending_hist_time_buckets, 0, + PendingIOStats.pending_hist_time_buckets_size * sizeof(*PendingIOStats.pending_hist_time_buckets)); have_iostats = false; @@ -349,6 +394,43 @@ pgstat_get_io_op_name(IOOp io_op) pg_unreachable(); } +void +pgstat_io_init_backend_cb(void) +{ + /* Allocate I/O latency buckets only if we are going to populate it */ + if (track_io_timing || track_wal_io_timing) + { + int alloc_sz, + io_histograms_used = 0; + + PendingIOStats.pending_hist_time_buckets_size = pgstat_bktype_count_potentially_used(MyBackendType); + alloc_sz = PendingIOStats.pending_hist_time_buckets_size * sizeof(*PendingIOStats.pending_hist_time_buckets); + PendingIOStats.pending_hist_time_buckets = MemoryContextAllocZero(TopMemoryContext, alloc_sz); + + for (int io_object = 0; io_object < IOOBJECT_NUM_TYPES; io_object++) + { + for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++) + { + for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++) + { + if (pgstat_tracks_io_op(MyBackendType, io_object, io_context, io_op)) + { + Assert(io_histograms_used <= PendingIOStats.pending_hist_time_buckets_size); + + PendingIOStats.pending_hist_time_buckets_offsets[io_object][io_context][io_op] = + io_histograms_used++; + } + else + PendingIOStats.pending_hist_time_buckets_offsets[io_object][io_context][io_op] = -1; + } + } + } + } + else + PendingIOStats.pending_hist_time_buckets = NULL; + +} + void pgstat_io_init_shmem_cb(void *stats) { diff --git a/src/include/pgstat.h b/src/include/pgstat.h index 34fd93f86dc..984914e69b8 100644 --- a/src/include/pgstat.h +++ b/src/include/pgstat.h @@ -352,12 +352,20 @@ typedef struct PgStat_PendingIO instr_time pending_times[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES]; /* - * Dynamically allocated array of - * [IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES] - * [IOOP_NUM_TYPES][PGSTAT_IO_HIST_BUCKETS] only with track_io_timings - * true. + * Dynamically allocated array for pg_stat_io_histograms only when + * track_io_timings is true. pending_hist_time_buckets_offsets is just an + * offset within pending_hist_time_buckets to avoid using unnecessary + * memory. */ - uint64 (*pending_hist_time_buckets)[IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES][PGSTAT_IO_HIST_BUCKETS]; + uint64 (*pending_hist_time_buckets)[PGSTAT_IO_HIST_BUCKETS]; + uint64 pending_hist_time_buckets_offsets[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES]; + + /* + * Cache how much histograms we have allocated to avoid repetably calling + * pgstat_bktype_count_potentially_used(MyBackendType) from + * pgstat_io_flush_cb() + */ + int pending_hist_time_buckets_size; } PgStat_PendingIO; extern PgStat_PendingIO PendingIOStats; @@ -645,6 +653,7 @@ extern PgStat_CheckpointerStats *pgstat_fetch_stat_checkpointer(void); extern bool pgstat_bktype_io_stats_valid(PgStat_BktypeIO *backend_io, BackendType bktype); +extern int pgstat_bktype_count_potentially_used(BackendType bktype); extern void pgstat_count_io_op(IOObject io_object, IOContext io_context, IOOp io_op, uint32 cnt, uint64 bytes); extern instr_time pgstat_prepare_io_time(bool track_io_guc); diff --git a/src/include/utils/pgstat_internal.h b/src/include/utils/pgstat_internal.h index a3ce8b04723..fcaf21db574 100644 --- a/src/include/utils/pgstat_internal.h +++ b/src/include/utils/pgstat_internal.h @@ -759,6 +759,7 @@ extern void pgstat_function_reset_timestamp_cb(PgStatShared_Common *header, Time extern void pgstat_flush_io(bool nowait); extern bool pgstat_io_flush_cb(bool nowait); +extern void pgstat_io_init_backend_cb(void); extern void pgstat_io_init_shmem_cb(void *stats); extern void pgstat_io_reset_all_cb(TimestampTz ts); extern void pgstat_io_snapshot_cb(void); -- 2.43.0
From 3dd96ae7164e28f802fa44c311b123fdf31e223b Mon Sep 17 00:00:00 2001 From: Jakub Wartak <[email protected]> Date: Fri, 8 May 2026 09:19:49 +0200 Subject: [PATCH v11 3/3] Lower pg_stat_io_histogram shared memory use by using array with indirect offsets. We use pgstat_track_io_*() family of functions to derive the length of static array that is allocated in shared memory region during startup. As the number of valid combinations of backend types vs I/O object/context/operations is coming from semi-runtime pgstat_io_get_sum_tracked() function, it cannot be preprocessed, so we would need to come up with #define PGSTAT_IO_HIST_BUCKET_SLOTS somehow. In order to do that - and avoid that C limitations (lack of constexpr) - we could precalculate (in the build system) the size of static array and generate .h include that would be included by pgstat.h, however it appears that would be it hardly cross-portable and hardly cross-compilable. Instead of doing that, we dynamically allocate shared memory for IO historgrams during startup. --- src/backend/utils/activity/pgstat.c | 42 +++++- src/backend/utils/activity/pgstat_io.c | 164 +++++++++++++++++++--- src/backend/utils/activity/pgstat_shmem.c | 15 ++ src/backend/utils/adt/pgstatfuncs.c | 20 ++- src/include/pgstat.h | 26 +++- 5 files changed, 241 insertions(+), 26 deletions(-) diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c index 7c597932671..0bd59992f4e 100644 --- a/src/backend/utils/activity/pgstat.c +++ b/src/backend/utils/activity/pgstat.c @@ -443,7 +443,13 @@ static const PgStat_KindInfo pgstat_kind_builtin_infos[PGSTAT_KIND_BUILTIN_SIZE] .snapshot_ctl_off = offsetof(PgStat_Snapshot, io), .shared_ctl_off = offsetof(PgStat_ShmemControl, io), .shared_data_off = offsetof(PgStatShared_IO, stats), - .shared_data_len = sizeof(((PgStatShared_IO *) 0)->stats), + + /* + * Do not write everything using this .shared_data_len, as the IO + * histogram backing store is handled by special-case (as it is + * dynamic) in pgstat_write_statsfile() / pgstat_read_statsfile(). + */ + .shared_data_len = offsetof(PgStat_IO, hist_time_buckets_slot_count), .init_backend_cb = pgstat_io_init_backend_cb, .flush_static_cb = pgstat_io_flush_cb, @@ -1685,6 +1691,21 @@ pgstat_write_statsfile(void) fputc(PGSTAT_FILE_ENTRY_FIXED, fpout); pgstat_write_chunk_s(fpout, &kind); pgstat_write_chunk(fpout, ptr, info->shared_data_len); + + /* + * PGSTAT_KIND_IO has a dynamically-sized histogram that lives outside + * the shared_data_len region. This assumes that PGSTAT_FILE_FORMAT_ID + * would be bumped each time that pgstat_track_io*() logic is altered. + */ + if (kind == PGSTAT_KIND_IO) + { + PgStat_IO *io = pgStatLocal.snapshot.io; + + pgstat_write_chunk(fpout, io->hist_time_buckets_slots, + (size_t) io->hist_time_buckets_slot_count * + PGSTAT_IO_HIST_BUCKETS * + sizeof(uint64)); + } } /* @@ -1930,6 +1951,25 @@ pgstat_read_statsfile(void) goto error; } + /* + * PGSTAT_KIND_IO has also semi-dynamic histogram array + * appended after the main chunk. By now, the + * StatsShmemInit() prepared the memory. + */ + if (kind == PGSTAT_KIND_IO) + { + PgStat_IO *io = &shmem->io.stats; + + if (!pgstat_read_chunk(fpin, io->hist_time_buckets_slots, + (size_t) io->hist_time_buckets_slot_count * + PGSTAT_IO_HIST_BUCKETS * + sizeof(uint64))) + { + elog(WARNING, "could not read pgstat_io histogram backing store"); + goto error; + } + } + break; } case PGSTAT_FILE_ENTRY_HASH: diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c index 4c655d38b97..ad8093420ed 100644 --- a/src/backend/utils/activity/pgstat_io.c +++ b/src/backend/utils/activity/pgstat_io.c @@ -20,6 +20,7 @@ #include "executor/instrument.h" #include "port/pg_bitutils.h" #include "storage/bufmgr.h" +#include "storage/shmem.h" #include "utils/memutils.h" #include "utils/pgstat_internal.h" @@ -210,6 +211,8 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op, { int offset; + Assert(track_io_timing || track_wal_io_timing); + /* * calculate the bucket_index based on latency in nanoseconds * (uint64) @@ -217,6 +220,10 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op, bucket_index = get_bucket_index(INSTR_TIME_GET_NANOSEC(io_time)); offset = PendingIOStats.pending_hist_time_buckets_offsets[io_object][io_context][io_op]; + + /* does offset points to valid slot? */ + Assert(offset >= 0 && offset < PendingIOStats.pending_hist_time_buckets_size); + PendingIOStats.pending_hist_time_buckets[offset][bucket_index]++; } @@ -258,6 +265,7 @@ pgstat_io_flush_cb(bool nowait) { LWLock *bktype_lock; PgStat_BktypeIO *bktype_shstats; + PgStat_IO *bk_io; if (!have_iostats) return false; @@ -265,6 +273,7 @@ pgstat_io_flush_cb(bool nowait) bktype_lock = &pgStatLocal.shmem->io.locks[MyBackendType]; bktype_shstats = &pgStatLocal.shmem->io.stats.stats[MyBackendType]; + bk_io = &pgStatLocal.shmem->io.stats; if (!nowait) LWLockAcquire(bktype_lock, LW_EXCLUSIVE); @@ -297,16 +306,23 @@ pgstat_io_flush_cb(bool nowait) * offsets to save memory) into shared memory. */ if (PendingIOStats.pending_hist_time_buckets != NULL) - for (int b = 0; b < PGSTAT_IO_HIST_BUCKETS; b++) - { - int pending_off = PendingIOStats.pending_hist_time_buckets_offsets[io_object][io_context][io_op]; + { + int bktype_shstats_off = bktype_shstats->hist_time_buckets_offsets[io_object][io_context][io_op]; + int pending_off = PendingIOStats.pending_hist_time_buckets_offsets[io_object][io_context][io_op]; - if (pending_off != -1) - { - bktype_shstats->hist_time_buckets[io_object][io_context][io_op][b] += - PendingIOStats.pending_hist_time_buckets[pending_off][b]; - } - } + Assert(track_io_timing || track_wal_io_timing); + + /* + * -1 means here that such mapping doesn't have a slot + * (based on pgstat_track_io_*()). + */ + if (bktype_shstats_off == -1 || pending_off == -1) + continue; + + for (int b = 0; b < PGSTAT_IO_HIST_BUCKETS; b++) + bk_io->hist_time_buckets_slots[bktype_shstats_off][b] += + PendingIOStats.pending_hist_time_buckets[pending_off][b]; + } } } } @@ -415,7 +431,7 @@ pgstat_io_init_backend_cb(void) { if (pgstat_tracks_io_op(MyBackendType, io_object, io_context, io_op)) { - Assert(io_histograms_used <= PendingIOStats.pending_hist_time_buckets_size); + Assert(io_histograms_used < PendingIOStats.pending_hist_time_buckets_size); PendingIOStats.pending_hist_time_buckets_offsets[io_object][io_context][io_op] = io_histograms_used++; @@ -428,12 +444,12 @@ pgstat_io_init_backend_cb(void) } else PendingIOStats.pending_hist_time_buckets = NULL; - } void pgstat_io_init_shmem_cb(void *stats) { + int histogram_slots = 0; PgStatShared_IO *stat_shmem = (PgStatShared_IO *) stats; for (int i = 0; i < BACKEND_NUM_TYPES; i++) @@ -441,26 +457,79 @@ pgstat_io_init_shmem_cb(void *stats) /* this might end up being lazily allocated in pgstat_io_snapshot_cb() */ pgStatLocal.snapshot.io = NULL; + + /* + * Establish indirect mapping from + * PgStat_BktypeIO.hist_time_buckets_offsets[][][] to + * PgStat_IO.hist_time_buckets_slots[x] + */ + for (int i = 0; i < BACKEND_NUM_TYPES; i++) + { + for (int io_object = 0; io_object < IOOBJECT_NUM_TYPES; io_object++) + { + for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++) + { + for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++) + { + if (pgstat_tracks_io_op(i, io_object, io_context, io_op)) + { + stat_shmem->stats.stats[i].hist_time_buckets_offsets[io_object][io_context][io_op] = + histogram_slots++; + } + else + stat_shmem->stats.stats[i].hist_time_buckets_offsets[io_object][io_context][io_op] = + -1; + } + } + } + } + + /* + * Sanity check: the offset table we just produced must use exactly the + * number of slots StatsShmemInit() reserved. Both come from the same + * pgstat_tracks_io_*() rules, so a mismatch would indicate a bug. + */ + Assert(histogram_slots == stat_shmem->stats.hist_time_buckets_slot_count); } void pgstat_io_reset_all_cb(TimestampTz ts) { + PgStat_IO *io_stats = &pgStatLocal.shmem->io.stats; + for (int i = 0; i < BACKEND_NUM_TYPES; i++) { LWLock *bktype_lock = &pgStatLocal.shmem->io.locks[i]; - PgStat_BktypeIO *bktype_shstats = &pgStatLocal.shmem->io.stats.stats[i]; + PgStat_BktypeIO *bktype_shstats = &io_stats->stats[i]; LWLockAcquire(bktype_lock, LW_EXCLUSIVE); /* * Use the lock in the first BackendType's PgStat_BktypeIO to protect - * the reset timestamp as well. + * the reset timestamp. */ if (i == 0) - pgStatLocal.shmem->io.stats.stat_reset_timestamp = ts; + io_stats->stat_reset_timestamp = ts; - memset(bktype_shstats, 0, sizeof(*bktype_shstats)); + /* Reset this BackendType's histogram slots */ + for (int io_object = 0; io_object < IOOBJECT_NUM_TYPES; io_object++) + { + for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++) + { + for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++) + { + int off = bktype_shstats->hist_time_buckets_offsets[io_object][io_context][io_op]; + + if (off == -1) + continue; + memset(io_stats->hist_time_buckets_slots[off], 0, + sizeof(io_stats->hist_time_buckets_slots[off])); + } + } + } + + /* Avoid resetting our indirect mapping offsets */ + memset(bktype_shstats, 0, offsetof(PgStat_BktypeIO, hist_time_buckets_offsets)); LWLockRelease(bktype_lock); } } @@ -468,14 +537,30 @@ pgstat_io_reset_all_cb(TimestampTz ts) void pgstat_io_snapshot_cb(void) { + PgStat_IO *shmem_io = &pgStatLocal.shmem->io.stats; + if (unlikely(pgStatLocal.snapshot.io == NULL)) + { + int n = shmem_io->hist_time_buckets_slot_count; + pgStatLocal.snapshot.io = MemoryContextAllocZero(TopMemoryContext, sizeof(PgStat_IO)); + /* + * Allocated on demand in private (TopMemoryContext) memory and points + * to the same indirect offsets. + */ + pgStatLocal.snapshot.io->hist_time_buckets_slot_count = n; + pgStatLocal.snapshot.io->hist_time_buckets_slots = + MemoryContextAllocZero(TopMemoryContext, + (size_t) n * PGSTAT_IO_HIST_BUCKETS * + sizeof(uint64)); + } + for (int i = 0; i < BACKEND_NUM_TYPES; i++) { LWLock *bktype_lock = &pgStatLocal.shmem->io.locks[i]; - PgStat_BktypeIO *bktype_shstats = &pgStatLocal.shmem->io.stats.stats[i]; + PgStat_BktypeIO *bktype_shstats = &shmem_io->stats[i]; PgStat_BktypeIO *bktype_snap = &pgStatLocal.snapshot.io->stats[i]; LWLockAcquire(bktype_lock, LW_SHARED); @@ -486,10 +571,29 @@ pgstat_io_snapshot_cb(void) */ if (i == 0) pgStatLocal.snapshot.io->stat_reset_timestamp = - pgStatLocal.shmem->io.stats.stat_reset_timestamp; + shmem_io->stat_reset_timestamp; /* using struct assignment due to better type safety */ *bktype_snap = *bktype_shstats; + + /* Copy this BackendType's histogram slots */ + for (int io_object = 0; io_object < IOOBJECT_NUM_TYPES; io_object++) + { + for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++) + { + for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++) + { + int off = bktype_shstats->hist_time_buckets_offsets[io_object][io_context][io_op]; + + if (off == -1) + continue; + memcpy(pgStatLocal.snapshot.io->hist_time_buckets_slots[off], + shmem_io->hist_time_buckets_slots[off], + sizeof(shmem_io->hist_time_buckets_slots[off])); + } + } + } + LWLockRelease(bktype_lock); } } @@ -720,3 +824,29 @@ pgstat_tracks_io_op(BackendType bktype, IOObject io_object, return true; } + +/* + * Total number of tuple of really usable combinations (BackendType, IOObject, + * IOContext, IOOp) that we consider trackable. + */ +int +pgstat_io_get_sum_tracked(void) +{ + int sum = 0; + + for (int i = 0; i < BACKEND_NUM_TYPES; i++) + sum += pgstat_bktype_count_potentially_used(i); + + return sum; +} + +/* + * Returns number of bytes for shared memory required by + * PgStat_IO.hist_time_buckets_slots, + */ +Size +pgstat_io_histogram_shmem_size(void) +{ + return mul_size(pgstat_io_get_sum_tracked(), + PGSTAT_IO_HIST_BUCKETS * sizeof(uint64)); +} diff --git a/src/backend/utils/activity/pgstat_shmem.c b/src/backend/utils/activity/pgstat_shmem.c index b8f354c818a..bb25be106be 100644 --- a/src/backend/utils/activity/pgstat_shmem.c +++ b/src/backend/utils/activity/pgstat_shmem.c @@ -139,6 +139,12 @@ StatsShmemSize(void) sz = MAXALIGN(sizeof(PgStat_ShmemControl)); sz = add_size(sz, pgstat_dsa_init_size()); + /* + * Dynamic allocation for PgStat_IO.hist_time_buckets_slots. Sized from + * the rules in pgstat_tracks_io_*() + */ + sz = add_size(sz, MAXALIGN(pgstat_io_histogram_shmem_size())); + /* Add shared memory for all the custom fixed-numbered statistics */ for (PgStat_Kind kind = PGSTAT_KIND_CUSTOM_MIN; kind <= PGSTAT_KIND_CUSTOM_MAX; kind++) { @@ -194,6 +200,15 @@ StatsShmemInit(void *arg) LWTRANCHE_PGSTATS_DSA, NULL); dsa_pin(dsa); + /* + * Prepare PgStat_IO.hist_time_buckets_slot* stuff before calling + * pgstat_io_init_shmem_cb(). The additional memory for this was requested + * in the StatsShmemSize() above. + */ + ctl->io.stats.hist_time_buckets_slot_count = pgstat_io_get_sum_tracked(); + ctl->io.stats.hist_time_buckets_slots = (uint64 (*)[PGSTAT_IO_HIST_BUCKETS]) p; + p += MAXALIGN(pgstat_io_histogram_shmem_size()); + /* * To ensure dshash is created in "plain" shared memory, temporarily limit * size of dsa to the initial size of the dsa. diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c index e16c65d45e9..da0e309600a 100644 --- a/src/backend/utils/adt/pgstatfuncs.c +++ b/src/backend/utils/adt/pgstatfuncs.c @@ -1667,6 +1667,7 @@ typedef enum hist_io_stat_col */ static void pg_stat_io_histogram_build_tuples(ReturnSetInfo *rsinfo, + PgStat_IO *backends_io_stats, PgStat_BktypeIO *bktype_stats, BackendType bktype, TimestampTz stat_reset_timestamp) @@ -1695,6 +1696,16 @@ pg_stat_io_histogram_build_tuples(ReturnSetInfo *rsinfo, for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++) { const char *op_name = pgstat_get_io_op_name(io_op); + int bktype_hist_time_bucket_off; + + /* + * The offset is the same for every histogram bucket of this + * io_obj/io_context/io_op combination. + */ + bktype_hist_time_bucket_off = bktype_stats->hist_time_buckets_offsets[io_obj][io_context][io_op]; + if (bktype_hist_time_bucket_off == -1) + continue; + Assert(bktype_hist_time_bucket_off < backends_io_stats->hist_time_buckets_slot_count); for (int bucket = 0; bucket < PGSTAT_IO_HIST_BUCKETS; bucket++) { @@ -1703,6 +1714,7 @@ pg_stat_io_histogram_build_tuples(ReturnSetInfo *rsinfo, RangeBound lower, upper; RangeType *range; + uint64 bktype_bucket; values[HIST_IO_COL_BACKEND_TYPE] = bktype_desc; values[HIST_IO_COL_OBJECT] = CStringGetTextDatum(obj_name); @@ -1731,9 +1743,9 @@ pg_stat_io_histogram_build_tuples(ReturnSetInfo *rsinfo, range = make_range(typcache, &lower, &upper, false, NULL); values[HIST_IO_COL_BUCKET_US] = RangeTypePGetDatum(range); - /* bucket count */ - values[HIST_IO_COL_COUNT] = Int64GetDatum( - bktype_stats->hist_time_buckets[io_obj][io_context][io_op][bucket]); + /* get bucket count, access indirectly */ + bktype_bucket = backends_io_stats->hist_time_buckets_slots[bktype_hist_time_bucket_off][bucket]; + values[HIST_IO_COL_COUNT] = Int64GetDatum(bktype_bucket); if (stat_reset_timestamp != 0) values[HIST_IO_COL_RESET_TIME] = TimestampTzGetDatum(stat_reset_timestamp); @@ -1779,7 +1791,7 @@ pg_stat_get_io_histogram(PG_FUNCTION_ARGS) continue; /* save tuples with data from this PgStat_BktypeIO */ - pg_stat_io_histogram_build_tuples(rsinfo, bktype_stats, bktype, + pg_stat_io_histogram_build_tuples(rsinfo, backends_io_stats, bktype_stats, bktype, backends_io_stats->stat_reset_timestamp); } diff --git a/src/include/pgstat.h b/src/include/pgstat.h index 984914e69b8..de90f1fb5b0 100644 --- a/src/include/pgstat.h +++ b/src/include/pgstat.h @@ -20,7 +20,6 @@ #include "utils/backend_status.h" /* for backward compatibility */ /* IWYU pragma: export */ #include "utils/pgstat_kind.h" - /* avoid including access/transam.h */ typedef struct FullTransactionId FullTransactionId; @@ -218,7 +217,7 @@ typedef struct PgStat_TableXactStatus * ------------------------------------------------------------ */ -#define PGSTAT_FILE_FORMAT_ID 0x01A5BCBC +#define PGSTAT_FILE_FORMAT_ID 0x01A5BCBD typedef struct PgStat_ArchiverStats { @@ -342,7 +341,14 @@ typedef struct PgStat_BktypeIO uint64 bytes[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES]; PgStat_Counter counts[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES]; PgStat_Counter times[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES]; - uint64 hist_time_buckets[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES][PGSTAT_IO_HIST_BUCKETS]; + + /* + * Indirect offset to PgStat_IO (parent + * structure).hist_time_buckets_slots. This needs to be the last field due + * to the use of memset(.., offsetof(hist_time_buckets_offsets)) in + * pgstat_io_reset_all_cb(). + */ + int hist_time_buckets_offsets[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES]; } PgStat_BktypeIO; typedef struct PgStat_PendingIO @@ -358,7 +364,7 @@ typedef struct PgStat_PendingIO * memory. */ uint64 (*pending_hist_time_buckets)[PGSTAT_IO_HIST_BUCKETS]; - uint64 pending_hist_time_buckets_offsets[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES]; + int pending_hist_time_buckets_offsets[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES]; /* * Cache how much histograms we have allocated to avoid repetably calling @@ -374,6 +380,16 @@ typedef struct PgStat_IO { TimestampTz stat_reset_timestamp; PgStat_BktypeIO stats[BACKEND_NUM_TYPES]; + + /* + * The IO histogram memory is sized at postmaster start from the rules in + * pgstat_tracks_io_*() and persisted by additinal code to handle this + * dynamic (shared) memory pointer in pgstat_write_statsfile() / + * pgstat_read_statsfile(), so they nes are not part of the serialization + * to disk by common code. + */ + int hist_time_buckets_slot_count; + uint64 (*hist_time_buckets_slots)[PGSTAT_IO_HIST_BUCKETS]; } PgStat_IO; typedef struct PgStat_LockEntry @@ -654,6 +670,8 @@ extern PgStat_CheckpointerStats *pgstat_fetch_stat_checkpointer(void); extern bool pgstat_bktype_io_stats_valid(PgStat_BktypeIO *backend_io, BackendType bktype); extern int pgstat_bktype_count_potentially_used(BackendType bktype); +extern int pgstat_io_get_sum_tracked(void); +extern Size pgstat_io_histogram_shmem_size(void); extern void pgstat_count_io_op(IOObject io_object, IOContext io_context, IOOp io_op, uint32 cnt, uint64 bytes); extern instr_time pgstat_prepare_io_time(bool track_io_guc); -- 2.43.0
