Hi,
Thanks for all the feedback. I am sharing the new version of the patchset.
Current status of the patchset is:
- IOOBJECT_WAL / IOCONTEXT_NORMAL / read, write, fsync stats and their
tests are added.
- IOOBJECT_WAL / IOCONTEXT_INIT stats and their tests are added.
- Documentation is updated.
- pg_stat_io shows different op_bytes for the IOOBJECT_WAL operations.
- PendingWalStats.wal_sync and PendingWalStats.wal_write_time /
PendingWalStats.wal_sync_time are moved to pgstat_count_io_op_n() /
pgstat_count_io_op_time() respectively.
Updates & Discussion items:
- Try to set op_bytes for BackendType / IOContext: I think we don't
need this now, we will need this when we add streaming replication WAL
IOs.
- Decide which 'BackendType / IOContext / IOOp' should not be tracked:
-- IOOBJECT_WAL / IOCONTEXT_INIT + IOCONTEXT_NORMAL / write and fsync
IOs can be done on every backend that tracks IO statistics. Because of
that and since we have a pgstat_tracks_io_bktype(bktype) check, I
didn't add another check for this.
-- I found that only the standalone backend and startup backend do
IOOBJECT_WAL / IOCONTEXT_NORMAL / read IOs. So, I added a check for
that but I am not sure if there are more backends that do WAL reads on
WAL recovery.
- For the IOOBJECT_WAL / IOCONTEXT_INIT and IOOBJECT_WAL /
IOCONTEXT_NORMAL / read tests, I used initial WAL IOs to check these
stats. I am not sure if that is the correct way or enough to test
these stats.
- To not calculate WAL timings on pg_stat_wal and pg_stat_io view,
pg_stat_wal view's WAL timings are fetched from pg_stat_io. Since
these timings are fetched from pg_stat_io, pg_stat_reset_shared('io')
will reset pg_stat_wal's timings too.
- I didn't move 'PendingWalStats.wal_sync' out from the
'pgstat_count_io_op_n' function because they count the same thing
(block vs system calls) but I agree that this doesn't look good.
Any kind of feedback would be appreciated.
--
Regards,
Nazir Bilal Yavuz
Microsoft
From 4ad85b11d418ae78237ed70eced6e3b46d086ef5 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <[email protected]>
Date: Fri, 1 Dec 2023 10:03:21 +0300
Subject: [PATCH v5 3/4] Add IOOBJECT_WAL / IOCONTEXT_INIT / write and fsync
tests
---
src/test/regress/expected/stats.out | 19 +++++++++++++++++++
src/test/regress/sql/stats.sql | 10 ++++++++++
2 files changed, 29 insertions(+)
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index 4d3a515bdd..4adda9e479 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -862,6 +862,25 @@ WHERE pg_stat_get_backend_pid(beid) = pg_backend_pid();
t
(1 row)
+-- Test pg_stat_io IOOBJECT_WAL / IOCONTEXT_INIT / write and fsync.
+-- When the servers starts, the initial WAL file must be created,
+-- so check these stats before stats get resetted.
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+ FROM pg_stat_io
+ WHERE context = 'init' and object = 'wal' \gset io_sum_wal_init_
+SELECT :io_sum_wal_init_writes > 0;
+ ?column?
+----------
+ t
+(1 row)
+
+SELECT current_setting('fsync') = 'off'
+ OR :io_sum_wal_init_fsyncs > 0;
+ ?column?
+----------
+ t
+(1 row)
+
-----
-- Test that resetting stats works for reset timestamp
-----
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index aa48e65dc8..72e864a0d2 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -442,6 +442,16 @@ SELECT (current_schemas(true))[1] = ('pg_temp_' || beid::text) AS match
FROM pg_stat_get_backend_idset() beid
WHERE pg_stat_get_backend_pid(beid) = pg_backend_pid();
+-- Test pg_stat_io IOOBJECT_WAL / IOCONTEXT_INIT / write and fsync.
+-- When the servers starts, the initial WAL file must be created,
+-- so check these stats before stats get resetted.
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+ FROM pg_stat_io
+ WHERE context = 'init' and object = 'wal' \gset io_sum_wal_init_
+SELECT :io_sum_wal_init_writes > 0;
+SELECT current_setting('fsync') = 'off'
+ OR :io_sum_wal_init_fsyncs > 0;
+
-----
-- Test that resetting stats works for reset timestamp
-----
--
2.43.0
From c7ae6c12cd02806d9d8201d738920179985cee7a Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <[email protected]>
Date: Fri, 10 Nov 2023 14:52:22 +0300
Subject: [PATCH v5 2/4] Add IOOBJECT_WAL / IOCONTEXT_NORMAL / write and fsync
tests
---
src/test/regress/expected/stats.out | 26 ++++++++++++++++++++++++++
src/test/regress/sql/stats.sql | 11 +++++++++++
2 files changed, 37 insertions(+)
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index 346e10a3d2..4d3a515bdd 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -1255,6 +1255,7 @@ SELECT pg_stat_get_subscription_stats(NULL);
-- - extends of relations using shared buffers
-- - fsyncs done to ensure the durability of data dirtying shared buffers
-- - shared buffer hits
+-- - WAL writes and fsyncs in IOContext IOCONTEXT_NORMAL
-- There is no test for blocks evicted from shared buffers, because we cannot
-- be sure of the state of shared buffers at the point the test is run.
-- Create a regular table and insert some data to generate IOCONTEXT_NORMAL
@@ -1264,6 +1265,9 @@ SELECT sum(extends) AS io_sum_shared_before_extends
SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
FROM pg_stat_io
WHERE object = 'relation' \gset io_sum_shared_before_
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+ FROM pg_stat_io
+ WHERE context = 'normal' and object = 'wal' \gset io_sum_wal_normal_before_
CREATE TABLE test_io_shared(a int);
INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
SELECT pg_stat_force_next_flush();
@@ -1301,6 +1305,28 @@ SELECT current_setting('fsync') = 'off'
t
(1 row)
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+ FROM pg_stat_io
+ WHERE context = 'normal' and object = 'wal' \gset io_sum_wal_normal_after_
+SELECT current_setting('synchronous_commit') = 'on';
+ ?column?
+----------
+ t
+(1 row)
+
+SELECT :io_sum_wal_normal_after_writes > :io_sum_wal_normal_before_writes;
+ ?column?
+----------
+ t
+(1 row)
+
+SELECT current_setting('fsync') = 'off'
+ OR :io_sum_wal_normal_after_fsyncs > :io_sum_wal_normal_before_fsyncs;
+ ?column?
+----------
+ t
+(1 row)
+
-- Change the tablespace so that the table is rewritten directly, then SELECT
-- from it to cause it to be read back into shared buffers.
SELECT sum(reads) AS io_sum_shared_before_reads
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index e3b4ca96e8..aa48e65dc8 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -601,6 +601,7 @@ SELECT pg_stat_get_subscription_stats(NULL);
-- - extends of relations using shared buffers
-- - fsyncs done to ensure the durability of data dirtying shared buffers
-- - shared buffer hits
+-- - WAL writes and fsyncs in IOContext IOCONTEXT_NORMAL
-- There is no test for blocks evicted from shared buffers, because we cannot
-- be sure of the state of shared buffers at the point the test is run.
@@ -612,6 +613,9 @@ SELECT sum(extends) AS io_sum_shared_before_extends
SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
FROM pg_stat_io
WHERE object = 'relation' \gset io_sum_shared_before_
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+ FROM pg_stat_io
+ WHERE context = 'normal' and object = 'wal' \gset io_sum_wal_normal_before_
CREATE TABLE test_io_shared(a int);
INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
SELECT pg_stat_force_next_flush();
@@ -630,6 +634,13 @@ SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
SELECT :io_sum_shared_after_writes > :io_sum_shared_before_writes;
SELECT current_setting('fsync') = 'off'
OR :io_sum_shared_after_fsyncs > :io_sum_shared_before_fsyncs;
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+ FROM pg_stat_io
+ WHERE context = 'normal' and object = 'wal' \gset io_sum_wal_normal_after_
+SELECT current_setting('synchronous_commit') = 'on';
+SELECT :io_sum_wal_normal_after_writes > :io_sum_wal_normal_before_writes;
+SELECT current_setting('fsync') = 'off'
+ OR :io_sum_wal_normal_after_fsyncs > :io_sum_wal_normal_before_fsyncs;
-- Change the tablespace so that the table is rewritten directly, then SELECT
-- from it to cause it to be read back into shared buffers.
--
2.43.0
From 5538140043a975f8d3a469fe83c0dec7919f8dfa Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <[email protected]>
Date: Thu, 26 Oct 2023 12:12:32 +0300
Subject: [PATCH v5 1/4] Show WAL stats on pg_stat_io (except streaming
replication)
This patch aims to showing WAL stats per backend on pg_stat_io view.
With this patch, it can be seen how many WAL operations it makes, their
context, types and total timings per backend in pg_stat_io view.
For the not calculating WAL timings on pg_stat_wal and pg_stat_io view,
pg_stat_wal view's WAL timings are fetched from pg_stat_io. Since these
timings are fetched from pg_stat_io, pg_stat_reset_shared('io') will
reset pg_stat_wal's timings too.
In this path new IOContext IOCONTEXT_INIT is introduced, it is for IO
operations done while creating the things. Currently, it is used only
together with IOObject IOOBJECT_WAL.
IOOBJECT_WAL means IO operations related to WAL.
IOOBJECT_WAL / IOCONTEXT_NORMAL means IO operations done on already
created WAL segments.
IOOBJECT_WAL / IOCONTEXT_INIT means IO operations done while creating
the WAL segments.
This patch currently covers:
- Documentation
- IOOBJECT_WAL / IOCONTEXT_NORMAL / read, write and fsync stats on
pg_stat_io.
- IOOBJECT_WAL / IOCONTEXT_INIT / write and fsync stats on pg_stat_io.
doesn't cover:
- Streaming replication WAL IO.
---
doc/src/sgml/monitoring.sgml | 28 ++++-
src/backend/access/transam/xlog.c | 60 ++++------
src/backend/access/transam/xlogrecovery.c | 10 ++
src/backend/utils/activity/pgstat_io.c | 136 +++++++++++++++++++++-
src/backend/utils/adt/pgstatfuncs.c | 48 ++++++--
src/include/pgstat.h | 8 +-
6 files changed, 232 insertions(+), 58 deletions(-)
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 42509042ad..679c527f46 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2499,9 +2499,10 @@ description | Waiting for a newly initialized WAL file to reach durable storage
</para>
<para>
- Currently, I/O on relations (e.g. tables, indexes) is tracked. However,
- relation I/O which bypasses shared buffers (e.g. when moving a table from one
- tablespace to another) is currently not tracked.
+ Currently, I/O on relations (e.g. tables, indexes) and WAL activities are
+ tracked. However, relation I/O which bypasses shared buffers
+ (e.g. when moving a table from one tablespace to another) is currently
+ not tracked.
</para>
<table id="pg-stat-io-view" xreflabel="pg_stat_io">
@@ -2554,6 +2555,11 @@ description | Waiting for a newly initialized WAL file to reach durable storage
<literal>temp relation</literal>: Temporary relations.
</para>
</listitem>
+ <listitem>
+ <para>
+ <literal>wal</literal>: Write Ahead Logs.
+ </para>
+ </listitem>
</itemizedlist>
</para>
</entry>
@@ -2578,6 +2584,13 @@ description | Waiting for a newly initialized WAL file to reach durable storage
<literal>normal</literal>.
</para>
</listitem>
+ <listitem>
+ <para>
+ <literal>init</literal>: I/O operations performed while creating the
+ WAL segments are tracked in <varname>context</varname>
+ <literal>init</literal>.
+ </para>
+ </listitem>
<listitem>
<para>
<literal>vacuum</literal>: I/O operations performed outside of shared
@@ -4740,7 +4753,9 @@ description | Waiting for a newly initialized WAL file to reach durable storage
<listitem>
<para>
<literal>io</literal>: Reset all the counters shown in the
- <structname>pg_stat_io</structname> view.
+ <structname>pg_stat_io</structname> view. Note that, this will
+ reset <structname>pg_stat_wal</structname> view's timing counters
+ too.
</para>
</listitem>
<listitem>
@@ -4757,8 +4772,9 @@ description | Waiting for a newly initialized WAL file to reach durable storage
</listitem>
<listitem>
<para>
- <literal>wal</literal>: Reset all the counters shown in the
- <structname>pg_stat_wal</structname> view.
+ <literal>wal</literal>: Reset all the counters except timings shown
+ in the <structname>pg_stat_wal</structname> view. These timing
+ counters can be reset by calling pg_stat_reset_shared with 'io'.
</para>
</listitem>
<listitem>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6526bd4f43..55da2d7812 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2275,38 +2275,22 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
Size nbytes;
Size nleft;
int written;
- instr_time start;
+ instr_time io_start;
/* OK to write the page(s) */
from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ;
nbytes = npages * (Size) XLOG_BLCKSZ;
nleft = nbytes;
+
+ io_start = pgstat_prepare_io_time();
do
{
errno = 0;
- /* Measure I/O timing to write WAL data */
- if (track_wal_io_timing)
- INSTR_TIME_SET_CURRENT(start);
- else
- INSTR_TIME_SET_ZERO(start);
-
pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
written = pg_pwrite(openLogFile, from, nleft, startoffset);
pgstat_report_wait_end();
- /*
- * Increment the I/O timing and the number of times WAL data
- * were written out to disk.
- */
- if (track_wal_io_timing)
- {
- instr_time end;
-
- INSTR_TIME_SET_CURRENT(end);
- INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_write_time, end, start);
- }
-
PendingWalStats.wal_write++;
if (written <= 0)
@@ -2331,6 +2315,9 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
startoffset += written;
} while (nleft > 0);
+ pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL,
+ IOOP_WRITE, io_start, npages);
+
npages = 0;
/*
@@ -3037,6 +3024,7 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
int fd;
int save_errno;
int open_flags = O_RDWR | O_CREAT | O_EXCL | PG_BINARY;
+ instr_time io_start;
Assert(logtli != 0);
@@ -3080,6 +3068,9 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
(errcode_for_file_access(),
errmsg("could not create file \"%s\": %m", tmppath)));
+ /* start timing writes for stats */
+ io_start = pgstat_prepare_io_time();
+
pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_WRITE);
save_errno = 0;
if (wal_init_zero)
@@ -3115,6 +3106,9 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
}
pgstat_report_wait_end();
+ pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_INIT, IOOP_WRITE,
+ io_start, 1);
+
if (save_errno)
{
/*
@@ -3131,6 +3125,9 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
errmsg("could not write to file \"%s\": %m", tmppath)));
}
+ /* start timing fsyncs for stats */
+ io_start = pgstat_prepare_io_time();
+
pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_SYNC);
if (pg_fsync(fd) != 0)
{
@@ -3143,6 +3140,9 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
}
pgstat_report_wait_end();
+ pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_INIT,
+ IOOP_FSYNC, io_start, 1);
+
if (close(fd) != 0)
ereport(ERROR,
(errcode_for_file_access(),
@@ -8315,7 +8315,7 @@ void
issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
{
char *msg = NULL;
- instr_time start;
+ instr_time io_start;
Assert(tli != 0);
@@ -8328,11 +8328,7 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
wal_sync_method == WAL_SYNC_METHOD_OPEN_DSYNC)
return;
- /* Measure I/O timing to sync the WAL file */
- if (track_wal_io_timing)
- INSTR_TIME_SET_CURRENT(start);
- else
- INSTR_TIME_SET_ZERO(start);
+ io_start = pgstat_prepare_io_time();
pgstat_report_wait_start(WAIT_EVENT_WAL_SYNC);
switch (wal_sync_method)
@@ -8376,18 +8372,8 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
pgstat_report_wait_end();
- /*
- * Increment the I/O timing and the number of times WAL files were synced.
- */
- if (track_wal_io_timing)
- {
- instr_time end;
-
- INSTR_TIME_SET_CURRENT(end);
- INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_sync_time, end, start);
- }
-
- PendingWalStats.wal_sync++;
+ pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL, IOOP_FSYNC,
+ io_start, 1);
}
/*
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index c61566666a..5557f846a0 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -60,6 +60,7 @@
#include "utils/builtins.h"
#include "utils/datetime.h"
#include "utils/guc_hooks.h"
+#include "utils/pgstat_internal.h"
#include "utils/pg_lsn.h"
#include "utils/ps_status.h"
#include "utils/pg_rusage.h"
@@ -1773,6 +1774,9 @@ PerformWalRecovery(void)
*/
ApplyWalRecord(xlogreader, record, &replayTLI);
+ /* Report pending statistics to the cumulative stats system */
+ pgstat_flush_io(false);
+
/* Exit loop if we reached inclusive recovery target */
if (recoveryStopsAfter(xlogreader))
{
@@ -3248,6 +3252,7 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
uint32 targetPageOff;
XLogSegNo targetSegNo PG_USED_FOR_ASSERTS_ONLY;
int r;
+ instr_time io_start;
XLByteToSeg(targetPagePtr, targetSegNo, wal_segment_size);
targetPageOff = XLogSegmentOffset(targetPagePtr, wal_segment_size);
@@ -3340,6 +3345,8 @@ retry:
/* Read the requested page */
readOff = targetPageOff;
+ io_start = pgstat_prepare_io_time();
+
pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
r = pg_pread(readFile, readBuf, XLOG_BLCKSZ, (off_t) readOff);
if (r != XLOG_BLCKSZ)
@@ -3368,6 +3375,9 @@ retry:
}
pgstat_report_wait_end();
+ pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL, IOOP_READ,
+ io_start, 1);
+
Assert(targetSegNo == readSegNo);
Assert(targetPageOff == readOff);
Assert(reqLen <= readLen);
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 490d5a9ab7..f2f544bb73 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -18,6 +18,7 @@
#include "executor/instrument.h"
#include "storage/bufmgr.h"
+#include "access/xlog.h"
#include "utils/pgstat_internal.h"
@@ -87,24 +88,59 @@ pgstat_count_io_op_n(IOObject io_object, IOContext io_context, IOOp io_op, uint3
Assert((unsigned int) io_op < IOOP_NUM_TYPES);
Assert(pgstat_tracks_io_op(MyBackendType, io_object, io_context, io_op));
+ /*
+ * Only count PendingWalStats.wal_sync in pg_stat_io because
+ * PendingWalStats.wal_write could count different than what pg_stat_io
+ * counts e.g. system calls. Also, PendingWalStats doesn't count WAL init
+ * or WAL reads. So, they are not included too.
+ */
+ if (io_object == IOOBJECT_WAL && io_context == IOCONTEXT_NORMAL &&
+ io_op == IOOP_FSYNC)
+ PendingWalStats.wal_sync += cnt;
+
PendingIOStats.counts[io_object][io_context][io_op] += cnt;
have_iostats = true;
}
+/*
+ * Prepares the io_time for pgstat_count_io_op_time() function. It needs to
+ * return current time if there is a chance that any 'time' can be tracked.
+ */
instr_time
pgstat_prepare_io_time(void)
{
instr_time io_start;
- if (track_io_timing)
+ if (track_io_timing || track_wal_io_timing)
INSTR_TIME_SET_CURRENT(io_start);
else
+
+ /*
+ * If the time won't be tracked, there is no need to set io_start but
+ * compiler complains about uninitialized use. So, set it to zero.
+ */
INSTR_TIME_SET_ZERO(io_start);
return io_start;
}
+/*
+ * Decide if the io timing needs be tracked.
+ */
+bool
+pgstat_should_track_io_time(IOObject io_object, IOContext io_context)
+{
+ /*
+ * io times of IOOBJECT_WAL IOObject needs to be tracked when
+ * 'track_wal_io_timing' is set regardless of 'track_io_timing'.
+ */
+ if (io_object == IOOBJECT_WAL)
+ return track_wal_io_timing;
+
+ return track_io_timing;
+}
+
/*
* Like pgstat_count_io_op_n() except it also accumulates time.
*/
@@ -112,14 +148,35 @@ void
pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
instr_time start_time, uint32 cnt)
{
- if (track_io_timing)
+ /*
+ * Accumulate different type of times here. We want to eventually
+ * deduplicate these counters, so we are consolidating them first. This
+ * also makes it easy to compare what is tracked for which stats or
+ * instrumentation purpose.
+ *
+ * Some of the IO counters didn't moved here because they track at a
+ * different level of granularity or at a different point in the call
+ * stack. Also, some of the IO counters are moved to pgstat_count_io_op_n
+ * function since they track only numbers not timings.
+ *
+ * pgstat_count_buffer is for pgstat_database. Since pg_stat_database only
+ * counts blk_read_time and blk_write_time, it is set for IOOP_READ and
+ * IOOP_WRITE.
+ *
+ * pgBufferUsage is for EXPLAIN. pgBufferUsage has only write and read
+ * stats for shared/local and temporary blocks. Only shared/local blocks
+ * are counted here.
+ *
+ * At the end of the if case, accumulate time for the pg_stat_io.
+ */
+ if (pgstat_should_track_io_time(io_object, io_context))
{
instr_time io_time;
INSTR_TIME_SET_CURRENT(io_time);
INSTR_TIME_SUBTRACT(io_time, start_time);
- if (io_op == IOOP_WRITE || io_op == IOOP_EXTEND)
+ if (io_op == IOOP_EXTEND || io_op == IOOP_WRITE)
{
pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
if (io_object == IOOBJECT_RELATION)
@@ -140,6 +197,7 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
io_time);
}
+ /* The IO timings are counted, now count the IO numbers */
pgstat_count_io_op_n(io_object, io_context, io_op, cnt);
}
@@ -220,12 +278,33 @@ pgstat_get_io_context_name(IOContext io_context)
return "normal";
case IOCONTEXT_VACUUM:
return "vacuum";
+ case IOCONTEXT_INIT:
+ return "init";
}
elog(ERROR, "unrecognized IOContext value: %d", io_context);
pg_unreachable();
}
+/*
+ * op_bytes can change according to IOObject and IOContext.
+ * Return BLCKSZ as default because most of the
+ * IOObject / IOContext uses BLCKSZ.
+ */
+int
+pgstat_get_io_op_bytes(IOObject io_object, IOContext io_context)
+{
+ if (io_object == IOOBJECT_WAL)
+ {
+ if (io_context == IOCONTEXT_NORMAL)
+ return XLOG_BLCKSZ;
+ else if (io_context == IOCONTEXT_INIT)
+ return wal_segment_size;
+ }
+
+ return BLCKSZ;
+}
+
const char *
pgstat_get_io_object_name(IOObject io_object)
{
@@ -235,6 +314,8 @@ pgstat_get_io_object_name(IOObject io_object)
return "relation";
case IOOBJECT_TEMP_RELATION:
return "temp relation";
+ case IOOBJECT_WAL:
+ return "wal";
}
elog(ERROR, "unrecognized IOObject value: %d", io_object);
@@ -316,10 +397,10 @@ pgstat_tracks_io_bktype(BackendType bktype)
case B_INVALID:
case B_ARCHIVER:
case B_LOGGER:
- case B_WAL_RECEIVER:
- case B_WAL_WRITER:
return false;
+ case B_WAL_RECEIVER:
+ case B_WAL_WRITER:
case B_AUTOVAC_LAUNCHER:
case B_AUTOVAC_WORKER:
case B_BACKEND:
@@ -354,6 +435,15 @@ pgstat_tracks_io_object(BackendType bktype, IOObject io_object,
if (!pgstat_tracks_io_bktype(bktype))
return false;
+ /*
+ * Currently, IO on IOOBJECT_WAL IOObject can only occur in the
+ * IOCONTEXT_NORMAL and IOCONTEXT_INIT IOContext.
+ */
+ if (io_object == IOOBJECT_WAL &&
+ (io_context != IOCONTEXT_NORMAL &&
+ io_context != IOCONTEXT_INIT))
+ return false;
+
/*
* Currently, IO on temporary relations can only occur in the
* IOCONTEXT_NORMAL IOContext.
@@ -412,6 +502,7 @@ pgstat_tracks_io_op(BackendType bktype, IOObject io_object,
IOContext io_context, IOOp io_op)
{
bool strategy_io_context;
+ bool no_wal_normal_read;
/* if (io_context, io_object) will never collect stats, we're done */
if (!pgstat_tracks_io_object(bktype, io_object, io_context))
@@ -428,6 +519,20 @@ pgstat_tracks_io_op(BackendType bktype, IOObject io_object,
bktype == B_CHECKPOINTER) && io_op == IOOP_EXTEND)
return false;
+ /*
+ * Some BackendTypes / IOObjects will not do certain IOOps.
+ */
+ no_wal_normal_read = bktype == B_AUTOVAC_LAUNCHER ||
+ bktype == B_AUTOVAC_WORKER || bktype == B_BACKEND ||
+ bktype == B_BG_WORKER || bktype == B_BG_WRITER ||
+ bktype == B_CHECKPOINTER || bktype == B_WAL_RECEIVER ||
+ bktype == B_WAL_SENDER || bktype == B_WAL_WRITER;
+
+ if (no_wal_normal_read &&
+ (io_object == IOOBJECT_WAL &&
+ io_op == IOOP_READ))
+ return false;
+
/*
* Temporary tables are not logged and thus do not require fsync'ing.
* Writeback is not requested for temporary tables.
@@ -452,6 +557,27 @@ pgstat_tracks_io_op(BackendType bktype, IOObject io_object,
if (!strategy_io_context && io_op == IOOP_REUSE)
return false;
+ /*
+ * Some IOOps are not valid in certain IOContexts / IOObjects and some
+ * IOOps are only valid in certain IOContexts / IOObjects.
+ */
+
+ /*
+ * In IOOBJECT_WAL io_object, IOCONTEXT_INIT io_context means operations
+ * done while creating new WAL segments.
+ */
+ if (io_object == IOOBJECT_WAL && io_context == IOCONTEXT_INIT &&
+ !(io_op == IOOP_WRITE || io_op == IOOP_FSYNC))
+ return false;
+
+ /*
+ * In IOOBJECT_WAL io_object, IOCONTEXT_NORMAL io_context means operations
+ * done on already created WAL segments.
+ */
+ if (io_object == IOOBJECT_WAL && io_context == IOCONTEXT_NORMAL &&
+ !(io_op == IOOP_WRITE || io_op == IOOP_READ || io_op == IOOP_FSYNC))
+ return false;
+
/*
* IOOP_FSYNC IOOps done by a backend using a BufferAccessStrategy are
* counted in the IOCONTEXT_NORMAL IOContext. See comment in
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 0cea320c00..36cba20856 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1377,6 +1377,7 @@ pg_stat_get_io(PG_FUNCTION_ARGS)
for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
{
const char *context_name = pgstat_get_io_context_name(io_context);
+ int op_bytes;
Datum values[IO_NUM_COLUMNS] = {0};
bool nulls[IO_NUM_COLUMNS] = {0};
@@ -1395,12 +1396,11 @@ pg_stat_get_io(PG_FUNCTION_ARGS)
values[IO_COL_RESET_TIME] = TimestampTzGetDatum(reset_time);
/*
- * Hard-code this to the value of BLCKSZ for now. Future
- * values could include XLOG_BLCKSZ, once WAL IO is tracked,
- * and constant multipliers, once non-block-oriented IO (e.g.
- * temporary file IO) is tracked.
+ * op_bytes can change according to IOObject and IOContext.
+ * Get the correct op_bytes.
*/
- values[IO_COL_CONVERSION] = Int64GetDatum(BLCKSZ);
+ op_bytes = pgstat_get_io_op_bytes(io_obj, io_context);
+ values[IO_COL_CONVERSION] = Int64GetDatum(op_bytes);
for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
{
@@ -1446,6 +1446,27 @@ pg_stat_get_io(PG_FUNCTION_ARGS)
return (Datum) 0;
}
+/*
+ * Return total IOOp time by IOObject and IOContext.
+ */
+static double
+pg_stat_get_io_time(IOObject io_obj, IOContext io_context, IOOp io_op)
+{
+ double sum_time = 0;
+ PgStat_IO *backends_io_stats = pgstat_fetch_stat_io();
+
+ for (int bktype = 0; bktype < BACKEND_NUM_TYPES; bktype++)
+ {
+ PgStat_BktypeIO *bktype_stats = &backends_io_stats->stats[bktype];
+
+ if (!pgstat_tracks_io_bktype(bktype) || !pgstat_tracks_io_op(bktype, io_obj, io_context, io_op))
+ continue;
+
+ sum_time += bktype_stats->times[io_obj][io_context][io_op];
+ }
+ return pg_stat_us_to_ms(sum_time);
+}
+
/*
* Returns statistics of WAL activity
*/
@@ -1500,9 +1521,20 @@ pg_stat_get_wal(PG_FUNCTION_ARGS)
values[4] = Int64GetDatum(wal_stats->wal_write);
values[5] = Int64GetDatum(wal_stats->wal_sync);
- /* Convert counters from microsec to millisec for display */
- values[6] = Float8GetDatum(((double) wal_stats->wal_write_time) / 1000.0);
- values[7] = Float8GetDatum(((double) wal_stats->wal_sync_time) / 1000.0);
+ /*
+ * There is no need to calculate timings for both pg_stat_wal and
+ * pg_stat_io. So, fetch timings from pg_stat_io to make stats gathering
+ * cheaper. Note that, since timings are fetched from pg_stat_io;
+ * pg_stat_reset_shared('io') will reset pg_stat_wal's timings too.
+ *
+ * Convert counters from microsec to millisec for display
+ */
+ values[6] = Float8GetDatum(pg_stat_get_io_time(IOOBJECT_WAL,
+ IOCONTEXT_NORMAL,
+ IOOP_WRITE));
+ values[7] = Float8GetDatum(pg_stat_get_io_time(IOOBJECT_WAL,
+ IOCONTEXT_NORMAL,
+ IOOP_FSYNC));
values[8] = TimestampTzGetDatum(wal_stats->stat_reset_timestamp);
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index f95d8db0c4..1d9428b3a8 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -276,9 +276,10 @@ typedef enum IOObject
{
IOOBJECT_RELATION,
IOOBJECT_TEMP_RELATION,
+ IOOBJECT_WAL,
} IOObject;
-#define IOOBJECT_NUM_TYPES (IOOBJECT_TEMP_RELATION + 1)
+#define IOOBJECT_NUM_TYPES (IOOBJECT_WAL + 1)
typedef enum IOContext
{
@@ -286,9 +287,10 @@ typedef enum IOContext
IOCONTEXT_BULKWRITE,
IOCONTEXT_NORMAL,
IOCONTEXT_VACUUM,
+ IOCONTEXT_INIT,
} IOContext;
-#define IOCONTEXT_NUM_TYPES (IOCONTEXT_VACUUM + 1)
+#define IOCONTEXT_NUM_TYPES (IOCONTEXT_INIT + 1)
typedef enum IOOp
{
@@ -520,10 +522,12 @@ extern bool pgstat_bktype_io_stats_valid(PgStat_BktypeIO *backend_io,
extern void pgstat_count_io_op(IOObject io_object, IOContext io_context, IOOp io_op);
extern void pgstat_count_io_op_n(IOObject io_object, IOContext io_context, IOOp io_op, uint32 cnt);
extern instr_time pgstat_prepare_io_time(void);
+extern bool pgstat_should_track_io_time(IOObject io_object, IOContext io_context);
extern void pgstat_count_io_op_time(IOObject io_object, IOContext io_context,
IOOp io_op, instr_time start_time, uint32 cnt);
extern PgStat_IO *pgstat_fetch_stat_io(void);
+extern int pgstat_get_io_op_bytes(IOObject io_object, IOContext io_context);
extern const char *pgstat_get_io_context_name(IOContext io_context);
extern const char *pgstat_get_io_object_name(IOObject io_object);
--
2.43.0
From b0a54f44e4b5699d62c071d87619f19084e5475a Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <[email protected]>
Date: Wed, 29 Nov 2023 15:30:03 +0300
Subject: [PATCH v5 4/4] Add IOOBJECT_WAL / IOCONTEXT_NORMAL / read tests
---
src/test/regress/expected/stats.out | 12 ++++++++++++
src/test/regress/sql/stats.sql | 8 ++++++++
2 files changed, 20 insertions(+)
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index 4adda9e479..7f5340cd7e 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -881,6 +881,18 @@ SELECT current_setting('fsync') = 'off'
t
(1 row)
+-- Test pg_stat_io IOOBJECT_WAL / IOCONTEXT_NORMAL / read.
+-- When the servers starts, StartupXLOG function must be called by postmaster
+-- or standalone-backend startup and WAL read must be done.
+-- So, check these before stats get resetted.
+SELECT SUM(reads) > 0
+ FROM pg_stat_io
+ WHERE context = 'normal' and object = 'wal';
+ ?column?
+----------
+ t
+(1 row)
+
-----
-- Test that resetting stats works for reset timestamp
-----
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index 72e864a0d2..b13f17f4d2 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -452,6 +452,14 @@ SELECT :io_sum_wal_init_writes > 0;
SELECT current_setting('fsync') = 'off'
OR :io_sum_wal_init_fsyncs > 0;
+-- Test pg_stat_io IOOBJECT_WAL / IOCONTEXT_NORMAL / read.
+-- When the servers starts, StartupXLOG function must be called by postmaster
+-- or standalone-backend startup and WAL read must be done.
+-- So, check these before stats get resetted.
+SELECT SUM(reads) > 0
+ FROM pg_stat_io
+ WHERE context = 'normal' and object = 'wal';
+
-----
-- Test that resetting stats works for reset timestamp
-----
--
2.43.0