Hi,
On Sun, 31 Dec 2023 at 03:58, Michael Paquier <[email protected]> wrote:
>
> On Tue, Dec 26, 2023 at 03:35:52PM +0300, Nazir Bilal Yavuz wrote:
> > On Tue, 26 Dec 2023 at 13:10, Michael Paquier <[email protected]> wrote:
> >> I am not sure while the whole point of the exercise is to have all the
> >> I/O related data in a single view. Something that I've also found a
> >> bit disturbing yesterday while looking at your patch is the fact that
> >> the operation size is guessed from the context and object type when
> >> querying the view because now everything is tied to BLCKSZ. This
> >> patch extends it with two more operation sizes, and there are even
> >> cases where it may be a variable. Could it be a better option to
> >> extend pgstat_count_io_op_time() so as callers can themselves give the
> >> size of the operation?
> >
> > Do you mean removing the op_bytes column and tracking the number of
> > bytes in reads, writes, and extends? If so, that makes sense to me but
> > I don't want to remove the number of operations; I believe that has a
> > value too. We can extend the pgstat_count_io_op_time() so it can both
> > track the number of bytes and the number of operations.
>
> Apologies if my previous wording sounded confusing. The idea I had in
> mind was to keep op_bytes in pg_stat_io, and extend it so as a value
> of NULL (or 0, or -1) is a synonym as "writes", "extends" and "reads"
> as a number of bytes.
Oh, I understand it now. Yes, that makes sense.
I thought removing op_bytes completely ( as you said "This patch
extends it with two more operation sizes, and there are even cases
where it may be a variable" ) from pg_stat_io view then adding
something like {read | write | extend}_bytes and {read | write |
extend}_calls could be better, so that we don't lose any information.
> > Also, it is not directly related to this patch but vectored IO [1] is
> > coming soon; so the number of operations could be wrong since vectored
> > IO could merge a couple of operations.
>
> Hmm. I have not checked this patch series so I cannot say for sure,
> but we'd likely just want to track the number of bytes if a single
> operation has a non-equal size rather than registering in pg_stat_io N
> rows with different op_bytes, no?
Yes, that is correct.
> I am looping in Thomas Munro in CC for comments.
Thanks for doing that.
> > If we want to entirely disable it, we can add
> >
> > if (MyBackendType == B_WAL_RECEIVER && io_object == IOOBJECT_WAL)
> > return;
> >
> > to the top of the pgstat_count_io_op_time() since all IOOBJECT_WAL
> > calls are done by this function, then we can disable it at
> > pgstat_tracks_io_bktype().
>
> Yeah, a limitation like that may be acceptable for now. Tracking the
> WAL writer and WAL sender activities can be relevant in a lot of cases
> even if we don't have the full picture for the WAL receiver yet.
I added that and disabled B_WAL_RECEIVER backend with comments
explaining why. v8 is attached.
--
Regards,
Nazir Bilal Yavuz
Microsoft
From dc25cccc8a5b168f15c9771f174924261b151193 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <[email protected]>
Date: Wed, 3 Jan 2024 15:36:19 +0300
Subject: [PATCH v8] Show WAL stats on pg_stat_io (except streaming
replication)
This patch aims to showing WAL stats per backend on pg_stat_io view.
With this patch, it can be seen how many WAL operations it makes, their
context, types and total timings per backend in pg_stat_io view.
In this path new IOContext IOCONTEXT_INIT is introduced, it is for IO
operations done while creating the things. Currently, it is used only
together with IOObject IOOBJECT_WAL.
IOOBJECT_WAL means IO operations related to WAL.
IOOBJECT_WAL / IOCONTEXT_NORMAL means IO operations done on already
created WAL segments.
IOOBJECT_WAL / IOCONTEXT_INIT means IO operations done while creating
the WAL segments.
This patch currently covers:
- Documentation
- IOOBJECT_WAL / IOCONTEXT_NORMAL / read, write and fsync stats on
pg_stat_io.
- IOOBJECT_WAL / IOCONTEXT_INIT / write and fsync stats on pg_stat_io.
doesn't cover:
- Streaming replication WAL IO.
---
src/include/catalog/pg_proc.dat | 6 +-
src/include/pgstat.h | 6 +-
src/backend/access/transam/xlog.c | 58 +++++------
src/backend/access/transam/xlogrecovery.c | 10 ++
src/backend/catalog/system_views.sql | 15 ++-
src/backend/utils/activity/pgstat_io.c | 119 ++++++++++++++++++++--
src/backend/utils/adt/pgstatfuncs.c | 24 ++---
src/test/regress/expected/rules.out | 27 +++--
src/test/regress/expected/stats.out | 53 ++++++++++
src/test/regress/sql/stats.sql | 25 +++++
doc/src/sgml/monitoring.sgml | 29 ++++--
11 files changed, 292 insertions(+), 80 deletions(-)
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 5b67784731a..26197dbb817 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5785,9 +5785,9 @@
{ oid => '1136', descr => 'statistics: information about WAL activity',
proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
proparallel => 'r', prorettype => 'record', proargtypes => '',
- proallargtypes => '{int8,int8,numeric,int8,int8,int8,float8,float8,timestamptz}',
- proargmodes => '{o,o,o,o,o,o,o,o,o}',
- proargnames => '{wal_records,wal_fpi,wal_bytes,wal_buffers_full,wal_write,wal_sync,wal_write_time,wal_sync_time,stats_reset}',
+ proallargtypes => '{int8,int8,numeric,int8,int8,int8,timestamptz}',
+ proargmodes => '{o,o,o,o,o,o,o}',
+ proargnames => '{wal_records,wal_fpi,wal_bytes,wal_buffers_full,wal_write,wal_sync,stats_reset}',
prosrc => 'pg_stat_get_wal' },
{ oid => '6248', descr => 'statistics: information about WAL prefetching',
proname => 'pg_stat_get_recovery_prefetch', prorows => '1', proretset => 't',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index ab91b3b367d..c23970da49f 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -279,14 +279,16 @@ typedef enum IOObject
{
IOOBJECT_RELATION,
IOOBJECT_TEMP_RELATION,
+ IOOBJECT_WAL,
} IOObject;
-#define IOOBJECT_NUM_TYPES (IOOBJECT_TEMP_RELATION + 1)
+#define IOOBJECT_NUM_TYPES (IOOBJECT_WAL + 1)
typedef enum IOContext
{
IOCONTEXT_BULKREAD,
IOCONTEXT_BULKWRITE,
+ IOCONTEXT_INIT,
IOCONTEXT_NORMAL,
IOCONTEXT_VACUUM,
} IOContext;
@@ -527,6 +529,8 @@ extern void pgstat_count_io_op_time(IOObject io_object, IOContext io_context,
IOOp io_op, instr_time start_time, uint32 cnt);
extern PgStat_IO *pgstat_fetch_stat_io(void);
+extern int pgstat_get_io_op_bytes(IOObject io_object,
+ IOContext io_context);
extern const char *pgstat_get_io_context_name(IOContext io_context);
extern const char *pgstat_get_io_object_name(IOObject io_object);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 1264849883b..84b6ca9d1d8 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2281,38 +2281,22 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
Size nbytes;
Size nleft;
ssize_t written;
- instr_time start;
+ instr_time io_start;
/* OK to write the page(s) */
from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ;
nbytes = npages * (Size) XLOG_BLCKSZ;
nleft = nbytes;
+
+ io_start = pgstat_prepare_io_time(track_wal_io_timing);
do
{
errno = 0;
- /* Measure I/O timing to write WAL data */
- if (track_wal_io_timing)
- INSTR_TIME_SET_CURRENT(start);
- else
- INSTR_TIME_SET_ZERO(start);
-
pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
written = pg_pwrite(openLogFile, from, nleft, startoffset);
pgstat_report_wait_end();
- /*
- * Increment the I/O timing and the number of times WAL data
- * were written out to disk.
- */
- if (track_wal_io_timing)
- {
- instr_time end;
-
- INSTR_TIME_SET_CURRENT(end);
- INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_write_time, end, start);
- }
-
PendingWalStats.wal_write++;
if (written <= 0)
@@ -2337,6 +2321,9 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
startoffset += written;
} while (nleft > 0);
+ pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL,
+ IOOP_WRITE, io_start, npages);
+
npages = 0;
/*
@@ -3043,6 +3030,7 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
int fd;
int save_errno;
int open_flags = O_RDWR | O_CREAT | O_EXCL | PG_BINARY;
+ instr_time io_start;
Assert(logtli != 0);
@@ -3086,6 +3074,9 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
(errcode_for_file_access(),
errmsg("could not create file \"%s\": %m", tmppath)));
+ /* start timing writes for stats */
+ io_start = pgstat_prepare_io_time(track_wal_io_timing);
+
pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_WRITE);
save_errno = 0;
if (wal_init_zero)
@@ -3121,6 +3112,9 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
}
pgstat_report_wait_end();
+ pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_INIT, IOOP_WRITE,
+ io_start, 1);
+
if (save_errno)
{
/*
@@ -3137,6 +3131,9 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
errmsg("could not write to file \"%s\": %m", tmppath)));
}
+ /* start timing fsyncs for stats */
+ io_start = pgstat_prepare_io_time(track_wal_io_timing);
+
pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_SYNC);
if (pg_fsync(fd) != 0)
{
@@ -3149,6 +3146,9 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
}
pgstat_report_wait_end();
+ pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_INIT,
+ IOOP_FSYNC, io_start, 1);
+
if (close(fd) != 0)
ereport(ERROR,
(errcode_for_file_access(),
@@ -8414,7 +8414,7 @@ void
issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
{
char *msg = NULL;
- instr_time start;
+ instr_time io_start;
Assert(tli != 0);
@@ -8427,11 +8427,7 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
wal_sync_method == WAL_SYNC_METHOD_OPEN_DSYNC)
return;
- /* Measure I/O timing to sync the WAL file */
- if (track_wal_io_timing)
- INSTR_TIME_SET_CURRENT(start);
- else
- INSTR_TIME_SET_ZERO(start);
+ io_start = pgstat_prepare_io_time(track_wal_io_timing);
pgstat_report_wait_start(WAIT_EVENT_WAL_SYNC);
switch (wal_sync_method)
@@ -8475,16 +8471,8 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
pgstat_report_wait_end();
- /*
- * Increment the I/O timing and the number of times WAL files were synced.
- */
- if (track_wal_io_timing)
- {
- instr_time end;
-
- INSTR_TIME_SET_CURRENT(end);
- INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_sync_time, end, start);
- }
+ pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL, IOOP_FSYNC,
+ io_start, 1);
PendingWalStats.wal_sync++;
}
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 6f4f81f9927..bb42263f2da 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -60,6 +60,7 @@
#include "utils/builtins.h"
#include "utils/datetime.h"
#include "utils/guc_hooks.h"
+#include "utils/pgstat_internal.h"
#include "utils/pg_lsn.h"
#include "utils/ps_status.h"
#include "utils/pg_rusage.h"
@@ -1777,6 +1778,9 @@ PerformWalRecovery(void)
*/
ApplyWalRecord(xlogreader, record, &replayTLI);
+ /* Report pending statistics to the cumulative stats system */
+ pgstat_flush_io(false);
+
/* Exit loop if we reached inclusive recovery target */
if (recoveryStopsAfter(xlogreader))
{
@@ -3252,6 +3256,7 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
uint32 targetPageOff;
XLogSegNo targetSegNo PG_USED_FOR_ASSERTS_ONLY;
int r;
+ instr_time io_start;
XLByteToSeg(targetPagePtr, targetSegNo, wal_segment_size);
targetPageOff = XLogSegmentOffset(targetPagePtr, wal_segment_size);
@@ -3344,6 +3349,8 @@ retry:
/* Read the requested page */
readOff = targetPageOff;
+ io_start = pgstat_prepare_io_time(track_wal_io_timing);
+
pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
r = pg_pread(readFile, readBuf, XLOG_BLCKSZ, (off_t) readOff);
if (r != XLOG_BLCKSZ)
@@ -3372,6 +3379,9 @@ retry:
}
pgstat_report_wait_end();
+ pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL, IOOP_READ,
+ io_start, 1);
+
Assert(targetSegNo == readSegNo);
Assert(targetPageOff == readOff);
Assert(reqLen <= readLen);
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 058fc47c919..c57b5428ae5 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1172,6 +1172,13 @@ SELECT
FROM pg_stat_get_io() b;
CREATE VIEW pg_stat_wal AS
+ WITH pgsio_sum_write_fsync_time AS (
+ SELECT
+ sum(write_time) as sum_write_time,
+ sum(fsync_time) as sum_fsync_time
+ FROM pg_stat_io
+ WHERE context = 'normal' and object = 'wal'
+ )
SELECT
w.wal_records,
w.wal_fpi,
@@ -1179,10 +1186,12 @@ CREATE VIEW pg_stat_wal AS
w.wal_buffers_full,
w.wal_write,
w.wal_sync,
- w.wal_write_time,
- w.wal_sync_time,
+ p.sum_write_time as write_time,
+ p.sum_fsync_time as fsync_time,
w.stats_reset
- FROM pg_stat_get_wal() w;
+ FROM pg_stat_get_wal() w
+ CROSS JOIN pgsio_sum_write_fsync_time p;
+
CREATE VIEW pg_stat_progress_analyze AS
SELECT
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 0dd9b98b3e4..8953a705615 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -16,6 +16,7 @@
#include "postgres.h"
+#include "access/xlog.h"
#include "executor/instrument.h"
#include "storage/bufmgr.h"
#include "utils/pgstat_internal.h"
@@ -115,6 +116,20 @@ pgstat_prepare_io_time(bool track_io_guc)
return io_start;
}
+/*
+ * Decide if IO timings need to be tracked. Timings associated to
+ * IOOBJECT_WAL objects are tracked if track_wal_io_timing is enabled,
+ * else rely on track_io_timing.
+ */
+static bool
+pgstat_should_track_io_time(IOObject io_object)
+{
+ if (io_object == IOOBJECT_WAL)
+ return track_wal_io_timing;
+
+ return track_io_timing;
+}
+
/*
* Like pgstat_count_io_op_n() except it also accumulates time.
*/
@@ -122,7 +137,28 @@ void
pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
instr_time start_time, uint32 cnt)
{
- if (track_io_timing)
+ /*
+ * B_WAL_RECEIVER backend does IOOBJECT_WAL IOObject & IOOP_READ IOOp IOs
+ * but these IOs are not countable for now because IOOP_READ IOs' op_bytes
+ * (number of bytes per unit of I/O) might not be the same all the time.
+ * The current implementation requires that the op_bytes must be the same
+ * for the same IOObject, IOContext and IOOp. To avoid confusion, the
+ * B_WAL_RECEIVER backend & IOOBJECT_WAL IOObject IOs are disabled for
+ * now.
+ */
+ if (MyBackendType == B_WAL_RECEIVER && io_object == IOOBJECT_WAL)
+ return;
+
+ /*
+ * Accumulate timing data. pgstat_count_buffer is for pgstat_database. As
+ * pg_stat_database only counts blk_read_time and blk_write_time, it is
+ * set for IOOP_READ and IOOP_WRITE.
+ *
+ * pgBufferUsage is for EXPLAIN. pgBufferUsage has write and read stats
+ * for shared, local and temporary blocks. Temporary blocks are ignored
+ * here.
+ */
+ if (pgstat_should_track_io_time(io_object))
{
instr_time io_time;
@@ -131,7 +167,9 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
if (io_op == IOOP_WRITE || io_op == IOOP_EXTEND)
{
- pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
+ if (io_object != IOOBJECT_WAL)
+ pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
+
if (io_object == IOOBJECT_RELATION)
INSTR_TIME_ADD(pgBufferUsage.shared_blk_write_time, io_time);
else if (io_object == IOOBJECT_TEMP_RELATION)
@@ -139,7 +177,9 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
}
else if (io_op == IOOP_READ)
{
- pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
+ if (io_object != IOOBJECT_WAL)
+ pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
+
if (io_object == IOOBJECT_RELATION)
INSTR_TIME_ADD(pgBufferUsage.shared_blk_read_time, io_time);
else if (io_object == IOOBJECT_TEMP_RELATION)
@@ -226,6 +266,8 @@ pgstat_get_io_context_name(IOContext io_context)
return "bulkread";
case IOCONTEXT_BULKWRITE:
return "bulkwrite";
+ case IOCONTEXT_INIT:
+ return "init";
case IOCONTEXT_NORMAL:
return "normal";
case IOCONTEXT_VACUUM:
@@ -236,6 +278,30 @@ pgstat_get_io_context_name(IOContext io_context)
pg_unreachable();
}
+/*
+ * Get the number of bytes associated to an operation depending on
+ * an IOObject and an IOContext.
+ *
+ * Returns BLCKSZ by default because that is what most of the
+ * IOObject/IOContext combinations use. The number of bytes for an
+ * IOOBJECT_WAL depends on the context.
+ */
+int
+pgstat_get_io_op_bytes(IOObject io_object, IOContext io_context)
+{
+ if (io_object == IOOBJECT_WAL)
+ {
+ if (io_context == IOCONTEXT_NORMAL)
+ return XLOG_BLCKSZ;
+ else if (io_context == IOCONTEXT_INIT)
+ return wal_segment_size;
+
+ Assert(false); /* not reachable */
+ }
+
+ return BLCKSZ;
+}
+
const char *
pgstat_get_io_object_name(IOObject io_object)
{
@@ -245,6 +311,8 @@ pgstat_get_io_object_name(IOObject io_object)
return "relation";
case IOOBJECT_TEMP_RELATION:
return "temp relation";
+ case IOOBJECT_WAL:
+ return "wal";
}
elog(ERROR, "unrecognized IOObject value: %d", io_object);
@@ -306,8 +374,8 @@ pgstat_io_snapshot_cb(void)
* - Syslogger because it is not connected to shared memory
* - Archiver because most relevant archiving IO is delegated to a
* specialized command or module
-* - WAL Receiver, WAL Writer, and WAL Summarizer IO are not tracked in
-* pg_stat_io for now
+* - WAL Receiver and WAL Summarizer IO are not tracked in pg_stat_io for
+* now.
*
* Function returns true if BackendType participates in the cumulative stats
* subsystem for IO and false if it does not.
@@ -328,7 +396,13 @@ pgstat_tracks_io_bktype(BackendType bktype)
case B_ARCHIVER:
case B_LOGGER:
case B_WAL_RECEIVER:
- case B_WAL_WRITER:
+
+ /*
+ * B_WAL_RECEIVER can do IOs but it is disabled for now to avoid
+ * confusion. See comment at the top of the
+ * pgstat_count_io_op_time() function.
+ */
+
case B_WAL_SUMMARIZER:
return false;
@@ -341,6 +415,7 @@ pgstat_tracks_io_bktype(BackendType bktype)
case B_STANDALONE_BACKEND:
case B_STARTUP:
case B_WAL_SENDER:
+ case B_WAL_WRITER:
return true;
}
@@ -366,6 +441,15 @@ pgstat_tracks_io_object(BackendType bktype, IOObject io_object,
if (!pgstat_tracks_io_bktype(bktype))
return false;
+ /*
+ * Currently, IO on IOOBJECT_WAL objects can only occur in the
+ * IOCONTEXT_NORMAL and IOCONTEXT_INIT IOContexts.
+ */
+ if (io_object == IOOBJECT_WAL &&
+ (io_context != IOCONTEXT_NORMAL &&
+ io_context != IOCONTEXT_INIT))
+ return false;
+
/*
* Currently, IO on temporary relations can only occur in the
* IOCONTEXT_NORMAL IOContext.
@@ -440,6 +524,13 @@ pgstat_tracks_io_op(BackendType bktype, IOObject io_object,
bktype == B_CHECKPOINTER) && io_op == IOOP_EXTEND)
return false;
+ /*
+ * Most BackendTypes don't do reads with IOOBJECT_WAL.
+ */
+ if (io_object == IOOBJECT_WAL && io_op == IOOP_READ &&
+ !(bktype == B_STANDALONE_BACKEND || bktype == B_STARTUP))
+ return false;
+
/*
* Temporary tables are not logged and thus do not require fsync'ing.
* Writeback is not requested for temporary tables.
@@ -464,6 +555,22 @@ pgstat_tracks_io_op(BackendType bktype, IOObject io_object,
if (!strategy_io_context && io_op == IOOP_REUSE)
return false;
+ /*
+ * An IOCONTEXT_INIT done for an IOOBJECT_WAL io_object does writes and
+ * syncs.
+ */
+ if (io_object == IOOBJECT_WAL && io_context == IOCONTEXT_INIT &&
+ !(io_op == IOOP_WRITE || io_op == IOOP_FSYNC))
+ return false;
+
+ /*
+ * An IOCONTEXT_NORMAL done for an IOOBJECT_WAL io_object does writes,
+ * reads and syncs.
+ */
+ if (io_object == IOOBJECT_WAL && io_context == IOCONTEXT_NORMAL &&
+ !(io_op == IOOP_WRITE || io_op == IOOP_READ || io_op == IOOP_FSYNC))
+ return false;
+
/*
* IOOP_FSYNC IOOps done by a backend using a BufferAccessStrategy are
* counted in the IOCONTEXT_NORMAL IOContext. See comment in
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index e65cbf41e9f..a77c2ce0622 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1395,6 +1395,7 @@ pg_stat_get_io(PG_FUNCTION_ARGS)
for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
{
const char *context_name = pgstat_get_io_context_name(io_context);
+ int op_bytes;
Datum values[IO_NUM_COLUMNS] = {0};
bool nulls[IO_NUM_COLUMNS] = {0};
@@ -1413,12 +1414,11 @@ pg_stat_get_io(PG_FUNCTION_ARGS)
values[IO_COL_RESET_TIME] = TimestampTzGetDatum(reset_time);
/*
- * Hard-code this to the value of BLCKSZ for now. Future
- * values could include XLOG_BLCKSZ, once WAL IO is tracked,
- * and constant multipliers, once non-block-oriented IO (e.g.
- * temporary file IO) is tracked.
+ * op_bytes can change according to IOObject and IOContext.
+ * Get the correct op_bytes.
*/
- values[IO_COL_CONVERSION] = Int64GetDatum(BLCKSZ);
+ op_bytes = pgstat_get_io_op_bytes(io_obj, io_context);
+ values[IO_COL_CONVERSION] = Int64GetDatum(op_bytes);
for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
{
@@ -1470,7 +1470,7 @@ pg_stat_get_io(PG_FUNCTION_ARGS)
Datum
pg_stat_get_wal(PG_FUNCTION_ARGS)
{
-#define PG_STAT_GET_WAL_COLS 9
+#define PG_STAT_GET_WAL_COLS 7
TupleDesc tupdesc;
Datum values[PG_STAT_GET_WAL_COLS] = {0};
bool nulls[PG_STAT_GET_WAL_COLS] = {0};
@@ -1491,11 +1491,7 @@ pg_stat_get_wal(PG_FUNCTION_ARGS)
INT8OID, -1, 0);
TupleDescInitEntry(tupdesc, (AttrNumber) 6, "wal_sync",
INT8OID, -1, 0);
- TupleDescInitEntry(tupdesc, (AttrNumber) 7, "wal_write_time",
- FLOAT8OID, -1, 0);
- TupleDescInitEntry(tupdesc, (AttrNumber) 8, "wal_sync_time",
- FLOAT8OID, -1, 0);
- TupleDescInitEntry(tupdesc, (AttrNumber) 9, "stats_reset",
+ TupleDescInitEntry(tupdesc, (AttrNumber) 7, "stats_reset",
TIMESTAMPTZOID, -1, 0);
BlessTupleDesc(tupdesc);
@@ -1518,11 +1514,7 @@ pg_stat_get_wal(PG_FUNCTION_ARGS)
values[4] = Int64GetDatum(wal_stats->wal_write);
values[5] = Int64GetDatum(wal_stats->wal_sync);
- /* Convert counters from microsec to millisec for display */
- values[6] = Float8GetDatum(((double) wal_stats->wal_write_time) / 1000.0);
- values[7] = Float8GetDatum(((double) wal_stats->wal_sync_time) / 1000.0);
-
- values[8] = TimestampTzGetDatum(wal_stats->stat_reset_timestamp);
+ values[6] = TimestampTzGetDatum(wal_stats->stat_reset_timestamp);
/* Returns the record as Datum */
PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index f645e8486bf..1b9db4e45af 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2225,16 +2225,23 @@ pg_stat_user_tables| SELECT relid,
autoanalyze_count
FROM pg_stat_all_tables
WHERE ((schemaname <> ALL (ARRAY['pg_catalog'::name, 'information_schema'::name])) AND (schemaname !~ '^pg_toast'::text));
-pg_stat_wal| SELECT wal_records,
- wal_fpi,
- wal_bytes,
- wal_buffers_full,
- wal_write,
- wal_sync,
- wal_write_time,
- wal_sync_time,
- stats_reset
- FROM pg_stat_get_wal() w(wal_records, wal_fpi, wal_bytes, wal_buffers_full, wal_write, wal_sync, wal_write_time, wal_sync_time, stats_reset);
+pg_stat_wal| WITH pgsio_sum_write_fsync_time AS (
+ SELECT sum(pg_stat_io.write_time) AS sum_write_time,
+ sum(pg_stat_io.fsync_time) AS sum_fsync_time
+ FROM pg_stat_io
+ WHERE ((pg_stat_io.context = 'normal'::text) AND (pg_stat_io.object = 'wal'::text))
+ )
+ SELECT w.wal_records,
+ w.wal_fpi,
+ w.wal_bytes,
+ w.wal_buffers_full,
+ w.wal_write,
+ w.wal_sync,
+ p.sum_write_time AS write_time,
+ p.sum_fsync_time AS fsync_time,
+ w.stats_reset
+ FROM (pg_stat_get_wal() w(wal_records, wal_fpi, wal_bytes, wal_buffers_full, wal_write, wal_sync, stats_reset)
+ CROSS JOIN pgsio_sum_write_fsync_time p);
pg_stat_wal_receiver| SELECT pid,
status,
receive_start_lsn,
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index 346e10a3d2b..c566ddbdd42 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -862,6 +862,33 @@ WHERE pg_stat_get_backend_pid(beid) = pg_backend_pid();
t
(1 row)
+-- Test pg_stat_io for WAL in an init context, that should do writes
+-- and syncs.
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+ FROM pg_stat_io
+ WHERE context = 'init' and object = 'wal' \gset io_sum_wal_init_
+SELECT :io_sum_wal_init_writes > 0;
+ ?column?
+----------
+ t
+(1 row)
+
+SELECT current_setting('fsync') = 'off'
+ OR :io_sum_wal_init_fsyncs > 0;
+ ?column?
+----------
+ t
+(1 row)
+
+-- Test pg_stat_io for WAL in a normal context, that should do reads as well.
+SELECT SUM(reads) > 0
+ FROM pg_stat_io
+ WHERE context = 'normal' and object = 'wal';
+ ?column?
+----------
+ t
+(1 row)
+
-----
-- Test that resetting stats works for reset timestamp
-----
@@ -1255,6 +1282,7 @@ SELECT pg_stat_get_subscription_stats(NULL);
-- - extends of relations using shared buffers
-- - fsyncs done to ensure the durability of data dirtying shared buffers
-- - shared buffer hits
+-- - WAL writes and fsyncs in IOContext IOCONTEXT_NORMAL
-- There is no test for blocks evicted from shared buffers, because we cannot
-- be sure of the state of shared buffers at the point the test is run.
-- Create a regular table and insert some data to generate IOCONTEXT_NORMAL
@@ -1264,6 +1292,9 @@ SELECT sum(extends) AS io_sum_shared_before_extends
SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
FROM pg_stat_io
WHERE object = 'relation' \gset io_sum_shared_before_
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+ FROM pg_stat_io
+ WHERE context = 'normal' and object = 'wal' \gset io_sum_wal_normal_before_
CREATE TABLE test_io_shared(a int);
INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
SELECT pg_stat_force_next_flush();
@@ -1301,6 +1332,28 @@ SELECT current_setting('fsync') = 'off'
t
(1 row)
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+ FROM pg_stat_io
+ WHERE context = 'normal' and object = 'wal' \gset io_sum_wal_normal_after_
+SELECT current_setting('synchronous_commit') = 'on';
+ ?column?
+----------
+ t
+(1 row)
+
+SELECT :io_sum_wal_normal_after_writes > :io_sum_wal_normal_before_writes;
+ ?column?
+----------
+ t
+(1 row)
+
+SELECT current_setting('fsync') = 'off'
+ OR :io_sum_wal_normal_after_fsyncs > :io_sum_wal_normal_before_fsyncs;
+ ?column?
+----------
+ t
+(1 row)
+
-- Change the tablespace so that the table is rewritten directly, then SELECT
-- from it to cause it to be read back into shared buffers.
SELECT sum(reads) AS io_sum_shared_before_reads
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index e3b4ca96e89..aa6caf7fd7c 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -442,6 +442,20 @@ SELECT (current_schemas(true))[1] = ('pg_temp_' || beid::text) AS match
FROM pg_stat_get_backend_idset() beid
WHERE pg_stat_get_backend_pid(beid) = pg_backend_pid();
+-- Test pg_stat_io for WAL in an init context, that should do writes
+-- and syncs.
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+ FROM pg_stat_io
+ WHERE context = 'init' and object = 'wal' \gset io_sum_wal_init_
+SELECT :io_sum_wal_init_writes > 0;
+SELECT current_setting('fsync') = 'off'
+ OR :io_sum_wal_init_fsyncs > 0;
+
+-- Test pg_stat_io for WAL in a normal context, that should do reads as well.
+SELECT SUM(reads) > 0
+ FROM pg_stat_io
+ WHERE context = 'normal' and object = 'wal';
+
-----
-- Test that resetting stats works for reset timestamp
-----
@@ -601,6 +615,7 @@ SELECT pg_stat_get_subscription_stats(NULL);
-- - extends of relations using shared buffers
-- - fsyncs done to ensure the durability of data dirtying shared buffers
-- - shared buffer hits
+-- - WAL writes and fsyncs in IOContext IOCONTEXT_NORMAL
-- There is no test for blocks evicted from shared buffers, because we cannot
-- be sure of the state of shared buffers at the point the test is run.
@@ -612,6 +627,9 @@ SELECT sum(extends) AS io_sum_shared_before_extends
SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
FROM pg_stat_io
WHERE object = 'relation' \gset io_sum_shared_before_
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+ FROM pg_stat_io
+ WHERE context = 'normal' and object = 'wal' \gset io_sum_wal_normal_before_
CREATE TABLE test_io_shared(a int);
INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
SELECT pg_stat_force_next_flush();
@@ -630,6 +648,13 @@ SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
SELECT :io_sum_shared_after_writes > :io_sum_shared_before_writes;
SELECT current_setting('fsync') = 'off'
OR :io_sum_shared_after_fsyncs > :io_sum_shared_before_fsyncs;
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+ FROM pg_stat_io
+ WHERE context = 'normal' and object = 'wal' \gset io_sum_wal_normal_after_
+SELECT current_setting('synchronous_commit') = 'on';
+SELECT :io_sum_wal_normal_after_writes > :io_sum_wal_normal_before_writes;
+SELECT current_setting('fsync') = 'off'
+ OR :io_sum_wal_normal_after_fsyncs > :io_sum_wal_normal_before_fsyncs;
-- Change the tablespace so that the table is rewritten directly, then SELECT
-- from it to cause it to be read back into shared buffers.
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index b804eb8b5ef..a721f7b3459 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2499,9 +2499,10 @@ description | Waiting for a newly initialized WAL file to reach durable storage
</para>
<para>
- Currently, I/O on relations (e.g. tables, indexes) is tracked. However,
- relation I/O which bypasses shared buffers (e.g. when moving a table from one
- tablespace to another) is currently not tracked.
+ Currently, I/O on relations (e.g. tables, indexes) and WAL activity are
+ tracked. However, relation I/O which bypasses shared buffers
+ (e.g. when moving a table from one tablespace to another) is currently
+ not tracked.
</para>
<table id="pg-stat-io-view" xreflabel="pg_stat_io">
@@ -2554,6 +2555,11 @@ description | Waiting for a newly initialized WAL file to reach durable storage
<literal>temp relation</literal>: Temporary relations.
</para>
</listitem>
+ <listitem>
+ <para>
+ <literal>wal</literal>: Write Ahead Logs.
+ </para>
+ </listitem>
</itemizedlist>
</para>
</entry>
@@ -2578,6 +2584,13 @@ description | Waiting for a newly initialized WAL file to reach durable storage
<literal>normal</literal>.
</para>
</listitem>
+ <listitem>
+ <para>
+ <literal>init</literal>: I/O operations performed while creating the
+ WAL segments are tracked in <varname>context</varname>
+ <literal>init</literal>.
+ </para>
+ </listitem>
<listitem>
<para>
<literal>vacuum</literal>: I/O operations performed outside of shared
@@ -4767,7 +4780,9 @@ description | Waiting for a newly initialized WAL file to reach durable storage
<listitem>
<para>
<literal>io</literal>: Reset all the counters shown in the
- <structname>pg_stat_io</structname> view.
+ <structname>pg_stat_io</structname> view. This resets the timing
+ counter stored in the <structname>pg_stat_wal</structname> view
+ too.
</para>
</listitem>
<listitem>
@@ -4784,8 +4799,10 @@ description | Waiting for a newly initialized WAL file to reach durable storage
</listitem>
<listitem>
<para>
- <literal>wal</literal>: Reset all the counters shown in the
- <structname>pg_stat_wal</structname> view.
+ <literal>wal</literal>: Reset all the counters except timings shown
+ in the <structname>pg_stat_wal</structname> view. The timing
+ counters can be reset by calling pg_stat_reset_shared set to
+ <literal>io</literal>.
</para>
</listitem>
<listitem>
--
2.43.0