Hi,

Thanks for the review!

Current status of the patch is:
- IOOBJECT_WAL / IOCONTEXT_NORMAL read, write and fsync stats are added.
- IOOBJECT_WAL / IOCONTEXT_NORMAL write and fsync tests are added.
- IOOBJECT_WAL / IOCONTEXT_INIT stats are added.
- pg_stat_io shows different op_bytes for the IOOBJECT_WAL operations.
- Working on which 'BackendType / IOContext / IOOp' should be banned in
pg_stat_io.
- PendingWalStats.wal_sync and PendingWalStats.wal_write_time /
PendingWalStats.wal_sync_time are moved to pgstat_count_io_op_n() /
pgstat_count_io_op_time() respectively.

TODOs:
- Documentation.
- Try to set op_bytes for BackendType / IOContext.
- Decide which 'BackendType / IOContext / IOOp' should not be tracked.
- Add IOOBJECT_WAL / IOCONTEXT_NORMAL read tests.
- Add IOOBJECT_WAL / IOCONTEXT_INIT tests.

I am adding tracking of BackendType / IOContext / IOOp as tables, empty
cell means it is not decided yet:

IOCONTEXT_NORMAL / Backend / IOOp table:

╔═════════════════════╦═══════╦═══════╦═══════╗
║ IOCONTEXT_NORMAL    ║ read  ║ write ║ fsync ║
╠═════════════════════╬═══════╬═══════╬═══════╣
║ autovacuum launcher ║ FALSE ║ FALSE ║ FALSE ║
╠═════════════════════╬═══════╬═══════╬═══════╣
║ autovacuum worker   ║ FALSE ║  TRUE ║  TRUE ║
╠═════════════════════╬═══════╬═══════╬═══════╣
║ client backend      ║       ║  TRUE ║  TRUE ║
╠═════════════════════╬═══════╬═══════╬═══════╣
║ background worker   ║       ║       ║       ║
╠═════════════════════╬═══════╬═══════╬═══════╣
║ background writer   ║       ║  TRUE ║  TRUE ║
╠═════════════════════╬═══════╬═══════╬═══════╣
║ checkpointer        ║       ║  TRUE ║  TRUE ║
╠═════════════════════╬═══════╬═══════╬═══════╣
║ standalone backend  ║  TRUE ║  TRUE ║  TRUE ║
╠═════════════════════╬═══════╬═══════╬═══════╣
║ startup             ║  TRUE ║       ║       ║
╠═════════════════════╬═══════╬═══════╬═══════╣
║ walreceiver         ║       ║       ║       ║
╠═════════════════════╬═══════╬═══════╬═══════╣
║ walsender           ║       ║       ║       ║
╠═════════════════════╬═══════╬═══════╬═══════╣
║ walwriter           ║       ║  TRUE ║  TRUE ║
╚═════════════════════╩═══════╩═══════╩═══════╝


IOCONTEXT_WAL_INIT / Backend / IOOp table:

╔═════════════════════╦═══════╦═══════╗
║ IOCONTEXT_WAL_INIT  ║ write ║ fsync ║
╠═════════════════════╬═══════╬═══════╣
║ autovacuum launcher ║       ║       ║
╠═════════════════════╬═══════╬═══════╣
║ autovacuum worker   ║       ║       ║
╠═════════════════════╬═══════╬═══════╣
║ client backend      ║  TRUE ║  TRUE ║
╠═════════════════════╬═══════╬═══════╣
║ background worker   ║       ║       ║
╠═════════════════════╬═══════╬═══════╣
║ background writer   ║       ║       ║
╠═════════════════════╬═══════╬═══════╣
║ checkpointer        ║       ║       ║
╠═════════════════════╬═══════╬═══════╣
║ standalone backend  ║  TRUE ║  TRUE ║
╠═════════════════════╬═══════╬═══════╣
║ startup             ║       ║       ║
╠═════════════════════╬═══════╬═══════╣
║ walreceiver         ║       ║       ║
╠═════════════════════╬═══════╬═══════╣
║ walsender           ║       ║       ║
╠═════════════════════╬═══════╬═══════╣
║ walwriter           ║       ║       ║
╚═════════════════════╩═══════╩═══════╝


On Wed, 9 Aug 2023 at 21:52, Melanie Plageman <melanieplage...@gmail.com>
wrote:
>
> > On Sat, 22 Jul 2023 at 01:30, Melanie Plageman
> > <melanieplage...@gmail.com> wrote:
> > > I think it would be good to count WAL reads even though they are not
> > > currently represented in pg_stat_wal. Here is a thread discussing this
> > > [1].
> >
> > I used the same implementation in the thread link [1]. I added 'WAL
> > read' to only xlogrecovery.c for now. I didn't add 'WAL read' to
> > xlogreader.c and walsender.c because they cause some failures on:
> > '!pgStatLocal.shmem->is_shutdown' asserts. I will spend more time on
> > these. Also, I added Bharath to CC. I have a question about 'WAL
> > read':
> > 1. There are two places where 'WAL read' happens.
> > a. In WALRead() in xlogreader.c, it reads 'count' bytes, most of the
> > time count is equal to XLOG_BLCKSZ but there are some cases it is not.
> > For example
> > - in XLogSendPhysical() in walsender.c WALRead() is called by nbytes
> > - in WALDumpReadPage() in pg_waldump.c WALRead() is called by count
> > These nbytes and count variables could be different from XLOG_BLCKSZ.
> >
> > b. in XLogPageRead() in xlogreader.c, it reads exactly XLOG_BLCKSZ
bytes:
> > pg_pread(readFile, readBuf, XLOG_BLCKSZ, (off_t) readOff);
> >
> > So, what should op_bytes be set to for 'WAL read' operations?
>
> If there is any combination of BackendType and IOContext which will
> always read XLOG_BLCKSZ bytes, we could use XLOG_BLCKSZ for that row's
> op_bytes. For other cases, we may have to consider using op_bytes 1 and
> tracking reads and write IOOps in number of bytes (instead of number of
> pages). I don't actually know if there is a clear separation by
> BackendType for these different cases.

I agree. I will edit that later, added to TODOs.

>
> The other alternative I see is to use XLOG_BLCKSZ as the op_bytes and
> treat op_bytes * number of reads as an approximation of the number of
> bytes read. I don't actually know what makes more sense. I don't think I
> would like having a number for bytes that is not accurate.

Yes, the prior one makes more sense to me.

>
> > Should I use IsBootstrapProcessingMode() to enable WAL io timings at
> > the initdb or are they not that much important?
>
> I don't have an opinion about this. I can see an argument for doing it
> either way. We do track other IO during initdb in pg_stat_io.

I didn't add it for now. It is an easy change, it could be added later.

>
> As an additional TODO, I would explore adding some tests to prevent
> accidental removal of the pg_stat_io WAL tracking.
>
> I think we can easily test IOCONTEXT_NORMAL WAL writes in
> src/test/regress/sql/stats.sql (perhaps it is worth checking that
> synchronous_commit is on in the test). IOCONTEXT_NORMAL WAL fsyncs
> should again be easy to test if synchronous_commit is on and fsync is
> on.
>
> I'm not sure how to reliably test WAL reads (given timing). Logically,
> you can sum WAL reads before a crash is initiated in one of the tests in
> the recovery suite, and then sum them after the db has restarted and
> there should definitely be an increase in WAL reads, but I don't know if
> we need to do something to guarantee that there will have been WAL reads
> (to avoid test flakes).
>
> I'm also not sure how to reliably test any IOCONTEXT_INIT operations. We
> need a before and after and I can't think of a cheap operation to ensure
> a new WAL segment is written to or fsyncd in between a before and after
> for the purposes of testing.

IOOBJECT_WAL / IOCONTEXT_NORMAL write and fsync tests are added.
For the IOCONTEXT_NORMAL reads and IOCONTEXT_INIT tests, I couldn't find a
way to avoid test flakes. I am open to suggestions. I added these to TODOs.

>
> > ---
> > diff --git a/src/backend/access/transam/xlog.c
b/src/backend/access/transam/xlog.c
> > index 60c0b7ec3af..ee7b85e18ca 100644
> > --- a/src/backend/access/transam/xlog.c
> > +++ b/src/backend/access/transam/xlog.c
> > @@ -2245,6 +2229,9 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli,
bool flexible)
> >                               startoffset += written;
> >                       } while (nleft > 0);
> >
>
> I'm not sure if the right location is here or in
> pgstat_count_io_op_time(), but I would explain why you did not move
> PendingWalStats.wal_writes counter into pg_stat_io code (and why you did
> move the other PendingWalStats counters there.
>
> > +                     pgstat_count_io_op_time(IOOBJECT_WAL,
IOCONTEXT_NORMAL,
> > +
IOOP_WRITE, io_start, npages);
> > +
> >                       npages = 0;
> >
> >                       /*
> > @@ -2938,6 +2925,7 @@ XLogFileInitInternal(XLogSegNo logsegno,
TimeLineID logtli,
> >       int                     fd;
> >       int                     save_errno;
> >       int                     open_flags = O_RDWR | O_CREAT | O_EXCL |
PG_BINARY;
> > +     instr_time      io_start;
> >
> >       Assert(logtli != 0);
> >
> > @@ -2981,6 +2969,8 @@ XLogFileInitInternal(XLogSegNo logsegno,
TimeLineID logtli,
> >                               (errcode_for_file_access(),
> >                                errmsg("could not create file \"%s\":
%m", tmppath)));
> >
>
> Since you have two calls to pgstat_prepare_io_time() in this function, I
> think it would be nice to have a comment above each to the effect of
> "start timing writes for stats" and "start timing fsyncs for stats"

Done.

>
> > +     io_start = pgstat_prepare_io_time();
> > +
> >       pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_WRITE);
>
> > diff --git a/src/backend/access/transam/xlogrecovery.c
b/src/backend/access/transam/xlogrecovery.c
> > index becc2bda62e..ee850af5514 100644
> > --- a/src/backend/access/transam/xlogrecovery.c
> > +++ b/src/backend/access/transam/xlogrecovery.c
> > @@ -1587,6 +1587,7 @@ PerformWalRecovery(void)
> >       XLogRecord *record;
> >       bool            reachedRecoveryTarget = false;
> >       TimeLineID      replayTLI;
> > +     uint32          pgstat_report_wal_frequency = 0;
> >
> >       /*
> >        * Initialize shared variables for tracking progress of WAL
replay, as if
> > @@ -1745,6 +1746,16 @@ PerformWalRecovery(void)
> >                        */
> >                       ApplyWalRecord(xlogreader, record, &replayTLI);
> >
> > +                     /*
> > +                      * Report pending statistics to the cumulative
stats system once
> > +                      * every PGSTAT_REPORT_FREQUENCY times to not
hinder performance.
> > +                      */
> > +                     if (pgstat_report_wal_frequency++ ==
PGSTAT_REPORT_FREQUENCY)
> > +                     {
> > +                             pgstat_report_wal(false);
> > +                             pgstat_report_wal_frequency = 0;
> > +                     }
> > +
>
> Is the above needed for your patch to work? What does it do? It should
> probably be in a separate commit and should definitely have an
> explanation.

Done, I omit that part.

>
> > --- a/src/backend/utils/activity/pgstat_io.c
> > +++ b/src/backend/utils/activity/pgstat_io.c
> > @@ -87,17 +87,25 @@ pgstat_count_io_op_n(IOObject io_object, IOContext
io_context, IOOp io_op, uint3
> >       Assert((unsigned int) io_op < IOOP_NUM_TYPES);
> >       Assert(pgstat_tracks_io_op(MyBackendType, io_object, io_context,
io_op));
>
> I would add a comment here explaining that pg_stat_wal doesn't count WAL
> init or WAL reads.

Done.

>
> > +     if(io_object == IOOBJECT_WAL && io_context == IOCONTEXT_NORMAL &&
> > +        io_op == IOOP_FSYNC)
> > +             PendingWalStats.wal_sync += cnt;
> > +
> >       PendingIOStats.counts[io_object][io_context][io_op] += cnt;
> >
> >       have_iostats = true;
> >  }
>
> > +/*
> > + * Prepares io_time for pgstat_count_io_op_time() function. It needs
to return
> > + * current time if there is a chance that any 'time' can be tracked.
> > + */
> >  instr_time
> >  pgstat_prepare_io_time(void)
> >  {
> >       instr_time      io_start;
> >
> > -     if (track_io_timing)
> > +     if(track_io_timing || track_wal_io_timing)
> >               INSTR_TIME_SET_CURRENT(io_start);
> >       else
> >               INSTR_TIME_SET_ZERO(io_start);
>
> Since you asked me off-list why we had to do INSTR_TIME_SET_ZERO() and I
> couldn't remember, it is probably worth a comment.

Done.

>
> >  pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp
io_op,
> >                                               instr_time start_time,
uint32 cnt)
> >  {
> > -     if (track_io_timing)
> > +     if (pgstat_should_track_io_time(io_object, io_context))
> >       {
> >               instr_time      io_time;
> >
> > @@ -124,6 +148,9 @@ pgstat_count_io_op_time(IOObject io_object,
IOContext io_context, IOOp io_op,
> >
pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
>
> Now that we are adding more if statements to this function, I think we
> should start adding more comments.
>
> We should explain what the different counters here are for e.g.
> pgBufferUsage for EXPLAIN, PendingWalStats for pg_stat_wal.
>
> We should also explain what is tracked for each and why it differs --
> e.g. some track time and some don't, some track only reads or writes,
> etc.
>
> Also we should mention why we are consolidating them here. That is, we
> want to eventually deduplicate these counters, so we are consolidating
> them first. This also makes it easy to compare what is tracked for which
> stats or instrumentation purpose.
>
> And for those IO counters that we haven't moved here, we should mention
> it is because they track at a different level of granularity or at a
> different point in the call stack.

Done.

>
> >                       if (io_object == IOOBJECT_RELATION)
> >
INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time);
> > +                     /* Track IOOBJECT_WAL/IOCONTEXT_NORMAL times on
PendingWalStats */
> > +                     else if (io_object == IOOBJECT_WAL && io_context
== IOCONTEXT_NORMAL)
> > +
INSTR_TIME_ADD(PendingWalStats.wal_write_time, io_time);
> >               }
>
>
> Also, I would reorder the if statements to be in order of the enum
> values (e.g. FSYNC, READ, WRITE).

Done.

>
> >               else if (io_op == IOOP_READ)
> >               {
> > @@ -131,6 +158,12 @@ pgstat_count_io_op_time(IOObject io_object,
IOContext io_context, IOOp io_op,
> >                       if (io_object == IOOBJECT_RELATION)
> >
INSTR_TIME_ADD(pgBufferUsage.blk_read_time, io_time);
> >               }
> > +             else if (io_op == IOOP_FSYNC)
> > +             {
> > +                     /* Track IOOBJECT_WAL/IOCONTEXT_NORMAL times on
PendingWalStats */
>
> I wouldn't squeeze this comment here like this. It is hard to read

Done.

>
> > +                     if (io_object == IOOBJECT_WAL && io_context ==
IOCONTEXT_NORMAL)
> > +
INSTR_TIME_ADD(PendingWalStats.wal_sync_time, io_time);
>
>
> > + * op_bytes can change according to IOObject and IOContext.
> > + * Return BLCKSZ as default.
> > + */
> > +int
> > +pgstat_get_io_op_btyes(IOObject io_object, IOContext io_context)
> > +{
>
> Small typo in function name:
> pgstat_get_io_op_btyes -> pgstat_get_io_op_bytes
> I'd also mention why BLCKSZ is the default

Done.

>
> > +     if (io_object == IOOBJECT_WAL)
> > +     {
> > +             if (io_context == IOCONTEXT_NORMAL)
> > +                     return XLOG_BLCKSZ;
> > +             else if (io_context == IOCONTEXT_INIT)
> > +                     return wal_segment_size;
> > +     }
> > +
> > +     return BLCKSZ;
> > +}
>
> > @@ -350,6 +405,15 @@ pgstat_tracks_io_object(BackendType bktype,
IOObject io_object,
> >       if (!pgstat_tracks_io_bktype(bktype))
> >               return false;
> >
> > +     /*
> > +      * Currently, IO on IOOBJECT_WAL IOObject can only occur in the
> > +      * IOCONTEXT_NORMAL and IOCONTEXT_INIT IOContext.
> > +      */
> > +     if (io_object == IOOBJECT_WAL &&
> > +             (io_context != IOCONTEXT_NORMAL &&
>
> Little bit of errant whitespace here.

Done.

>
> >       /*
> >        * Currently, IO on temporary relations can only occur in the
> >        * IOCONTEXT_NORMAL IOContext.
> > @@ -439,6 +503,14 @@ pgstat_tracks_io_op(BackendType bktype, IOObject
io_object,
> >       if (io_context == IOCONTEXT_BULKREAD && io_op == IOOP_EXTEND)
> >               return false;
>
> I would expand on the comment to explain what NORMAL is for WAL -- what
> we consider normal to be and why. And why it is different than INIT.

Done.

>
> >
> > +     if(io_object == IOOBJECT_WAL && io_context == IOCONTEXT_INIT &&
> > +        !(io_op == IOOP_WRITE || io_op == IOOP_FSYNC))
> > +        return false;
> > +
> > +     if(io_object == IOOBJECT_WAL && io_context == IOCONTEXT_NORMAL &&
> > +        !(io_op == IOOP_WRITE || io_op == IOOP_READ || io_op ==
IOOP_FSYNC))
> > +        return false;
>
> These are the first "bans" that we have for an IOOp for a specific
> combination of io_context and io_object. We should add a new comment for
> this and perhaps consider what ordering makes most sense. I tried to
> organize the bans from most broad to most specific at the bottom.

Done.

>
> >
> > --- a/src/backend/utils/adt/pgstatfuncs.c
> > +++ b/src/backend/utils/adt/pgstatfuncs.c
> > @@ -1409,7 +1410,8 @@ pg_stat_get_io(PG_FUNCTION_ARGS)
> >                                * and constant multipliers, once
non-block-oriented IO (e.g.
> >                                * temporary file IO) is tracked.
> >                                */
> > -                             values[IO_COL_CONVERSION] =
Int64GetDatum(BLCKSZ);
>
> There's a comment above this in the code that says this is hard-coded to
> BLCKSZ. That comment needs to be updated or removed (in lieu of the
> comment in your pgstat_get_io_op_bytes() function).

Done.

>
>
> > +                             op_bytes = pgstat_get_io_op_btyes(io_obj,
io_context);
> > +                             values[IO_COL_CONVERSION] =
Int64GetDatum(op_bytes);
> >
>
> > +extern PGDLLIMPORT bool track_wal_io_timing;
> > +extern PGDLLIMPORT int wal_segment_size;
>
> These shouldn't be in two places (i.e. they are already in xlog.h and
> you added them in pgstat.h. pg_stat_io.c includes bufmgr.h for
> track_io_timing, so you can probably justify including xlog.h.

Done.

Any kind of feedback would be appreciated.

Regards,
Nazir Bilal Yavuz
Microsoft
From 8a06deb704b51010f36a448edf37dfdc1c829d43 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavu...@gmail.com>
Date: Thu, 14 Sep 2023 12:44:29 +0300
Subject: [PATCH v3 2/2] Add IOOBJECT_WAL / IOCONTEXT_NORMAL write and fsync
 tests

---
 src/test/regress/expected/stats.out | 25 +++++++++++++++++++++++++
 src/test/regress/sql/stats.sql      | 10 ++++++++++
 2 files changed, 35 insertions(+)

diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index 94187e59cfb..98f94b189ea 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -1246,6 +1246,9 @@ SELECT sum(extends) AS io_sum_shared_before_extends
 SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
   FROM pg_stat_io
   WHERE object = 'relation' \gset io_sum_shared_before_
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'normal' and object = 'wal' \gset io_sum_wal_normal_before_
 CREATE TABLE test_io_shared(a int);
 INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
 SELECT pg_stat_force_next_flush();
@@ -1283,6 +1286,28 @@ SELECT current_setting('fsync') = 'off'
  t
 (1 row)
 
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'normal' and object = 'wal' \gset io_sum_wal_normal_after_
+SELECT current_setting('synchronous_commit') = 'on';
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_wal_normal_after_writes > :io_sum_wal_normal_before_writes;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT current_setting('fsync') = 'off'
+ OR :io_sum_wal_normal_after_fsyncs > :io_sum_wal_normal_before_fsyncs;
+ ?column? 
+----------
+ t
+(1 row)
+
 -- Change the tablespace so that the table is rewritten directly, then SELECT
 -- from it to cause it to be read back into shared buffers.
 SELECT sum(reads) AS io_sum_shared_before_reads
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index 1e21e55c6d9..c91ec43f039 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -603,6 +603,9 @@ SELECT sum(extends) AS io_sum_shared_before_extends
 SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
   FROM pg_stat_io
   WHERE object = 'relation' \gset io_sum_shared_before_
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'normal' and object = 'wal' \gset io_sum_wal_normal_before_
 CREATE TABLE test_io_shared(a int);
 INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
 SELECT pg_stat_force_next_flush();
@@ -621,6 +624,13 @@ SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
 SELECT :io_sum_shared_after_writes > :io_sum_shared_before_writes;
 SELECT current_setting('fsync') = 'off'
   OR :io_sum_shared_after_fsyncs > :io_sum_shared_before_fsyncs;
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'normal' and object = 'wal' \gset io_sum_wal_normal_after_
+SELECT current_setting('synchronous_commit') = 'on';
+SELECT :io_sum_wal_normal_after_writes > :io_sum_wal_normal_before_writes;
+SELECT current_setting('fsync') = 'off'
+ OR :io_sum_wal_normal_after_fsyncs > :io_sum_wal_normal_before_fsyncs;
 
 -- Change the tablespace so that the table is rewritten directly, then SELECT
 -- from it to cause it to be read back into shared buffers.
-- 
2.40.1

From fbbd64e8bf53c0ab149bac7970df9b4c1ac92589 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavu...@gmail.com>
Date: Tue, 19 Sep 2023 17:56:56 +0300
Subject: [PATCH v3 1/2] Show WAL stats (except streaming replication WAL) on
 pg_stat_io

This patch aims to showing WAL stats per backend on pg_stat_io view.

With this patch, it can be seen how many WAL operations it makes, their
context, types and total timings per backend in pg_stat_io view.

This patchset currently covers:
- IOOBJECT_WAL / IOCONTEXT_NORMAL read, write and fsync.
- IOOBJECT_WAL / IOCONTEXT_INIT write and fsync.

doesn't cover:
- Streaming replication WAL IO.
---
 src/backend/access/transam/xlog.c         |  60 ++++------
 src/backend/access/transam/xlogrecovery.c |   9 ++
 src/backend/utils/activity/pgstat_io.c    | 137 ++++++++++++++++++++--
 src/backend/utils/adt/pgstatfuncs.c       |  10 +-
 src/include/pgstat.h                      |   8 +-
 5 files changed, 172 insertions(+), 52 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index fcbde10529b..5c0da34bc45 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2200,38 +2200,22 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
 			Size		nbytes;
 			Size		nleft;
 			int			written;
-			instr_time	start;
+			instr_time	io_start;
 
 			/* OK to write the page(s) */
 			from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ;
 			nbytes = npages * (Size) XLOG_BLCKSZ;
 			nleft = nbytes;
+
+			io_start = pgstat_prepare_io_time();
 			do
 			{
 				errno = 0;
 
-				/* Measure I/O timing to write WAL data */
-				if (track_wal_io_timing)
-					INSTR_TIME_SET_CURRENT(start);
-				else
-					INSTR_TIME_SET_ZERO(start);
-
 				pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
 				written = pg_pwrite(openLogFile, from, nleft, startoffset);
 				pgstat_report_wait_end();
 
-				/*
-				 * Increment the I/O timing and the number of times WAL data
-				 * were written out to disk.
-				 */
-				if (track_wal_io_timing)
-				{
-					instr_time	end;
-
-					INSTR_TIME_SET_CURRENT(end);
-					INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_write_time, end, start);
-				}
-
 				PendingWalStats.wal_write++;
 
 				if (written <= 0)
@@ -2256,6 +2240,9 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
 				startoffset += written;
 			} while (nleft > 0);
 
+			pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL,
+			                        IOOP_WRITE, io_start, npages);
+
 			npages = 0;
 
 			/*
@@ -2949,6 +2936,7 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 	int			fd;
 	int			save_errno;
 	int			open_flags = O_RDWR | O_CREAT | O_EXCL | PG_BINARY;
+	instr_time	io_start;
 
 	Assert(logtli != 0);
 
@@ -2992,6 +2980,9 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 				(errcode_for_file_access(),
 				 errmsg("could not create file \"%s\": %m", tmppath)));
 
+	/* start timing writes for stats */
+	io_start = pgstat_prepare_io_time();
+
 	pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_WRITE);
 	save_errno = 0;
 	if (wal_init_zero)
@@ -3027,6 +3018,9 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 	}
 	pgstat_report_wait_end();
 
+	pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_INIT, IOOP_WRITE,
+	                        io_start, 1);
+
 	if (save_errno)
 	{
 		/*
@@ -3043,6 +3037,9 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 				 errmsg("could not write to file \"%s\": %m", tmppath)));
 	}
 
+	/* start timing fsyncs for stats */
+	io_start = pgstat_prepare_io_time();
+
 	pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_SYNC);
 	if (pg_fsync(fd) != 0)
 	{
@@ -3055,6 +3052,9 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 	}
 	pgstat_report_wait_end();
 
+	pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_INIT,
+	                        IOOP_FSYNC, io_start, 1);
+
 	if (close(fd) != 0)
 		ereport(ERROR,
 				(errcode_for_file_access(),
@@ -8186,7 +8186,7 @@ void
 issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 {
 	char	   *msg = NULL;
-	instr_time	start;
+	instr_time	io_start;
 
 	Assert(tli != 0);
 
@@ -8199,11 +8199,7 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 		sync_method == SYNC_METHOD_OPEN_DSYNC)
 		return;
 
-	/* Measure I/O timing to sync the WAL file */
-	if (track_wal_io_timing)
-		INSTR_TIME_SET_CURRENT(start);
-	else
-		INSTR_TIME_SET_ZERO(start);
+	io_start = pgstat_prepare_io_time();
 
 	pgstat_report_wait_start(WAIT_EVENT_WAL_SYNC);
 	switch (sync_method)
@@ -8247,18 +8243,8 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 
 	pgstat_report_wait_end();
 
-	/*
-	 * Increment the I/O timing and the number of times WAL files were synced.
-	 */
-	if (track_wal_io_timing)
-	{
-		instr_time	end;
-
-		INSTR_TIME_SET_CURRENT(end);
-		INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_sync_time, end, start);
-	}
-
-	PendingWalStats.wal_sync++;
+	pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL, IOOP_FSYNC,
+	                        io_start, 1);
 }
 
 /*
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index becc2bda62e..ca1041c5fd7 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1745,6 +1745,9 @@ PerformWalRecovery(void)
 			 */
 			ApplyWalRecord(xlogreader, record, &replayTLI);
 
+			/*  Report pending statistics to the cumulative stats system  */
+			pgstat_report_wal(false);
+
 			/* Exit loop if we reached inclusive recovery target */
 			if (recoveryStopsAfter(xlogreader))
 			{
@@ -3220,6 +3223,7 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
 	uint32		targetPageOff;
 	XLogSegNo	targetSegNo PG_USED_FOR_ASSERTS_ONLY;
 	int			r;
+	instr_time	io_start;
 
 	XLByteToSeg(targetPagePtr, targetSegNo, wal_segment_size);
 	targetPageOff = XLogSegmentOffset(targetPagePtr, wal_segment_size);
@@ -3312,6 +3316,8 @@ retry:
 	/* Read the requested page */
 	readOff = targetPageOff;
 
+	io_start = pgstat_prepare_io_time();
+
 	pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
 	r = pg_pread(readFile, readBuf, XLOG_BLCKSZ, (off_t) readOff);
 	if (r != XLOG_BLCKSZ)
@@ -3340,6 +3346,9 @@ retry:
 	}
 	pgstat_report_wait_end();
 
+	pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL, IOOP_READ,
+	                        io_start, 1);
+
 	Assert(targetSegNo == readSegNo);
 	Assert(targetPageOff == readOff);
 	Assert(reqLen <= readLen);
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index eb7d35d4225..d5345ba2536 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -18,6 +18,7 @@
 
 #include "executor/instrument.h"
 #include "storage/bufmgr.h"
+#include "access/xlog.h"
 #include "utils/pgstat_internal.h"
 
 
@@ -87,24 +88,58 @@ pgstat_count_io_op_n(IOObject io_object, IOContext io_context, IOOp io_op, uint3
 	Assert((unsigned int) io_op < IOOP_NUM_TYPES);
 	Assert(pgstat_tracks_io_op(MyBackendType, io_object, io_context, io_op));
 
+	/*
+	 * Only count PendingWalStats.wal_sync in pg_stat_io because
+	 * PendingWalStats.wal_write could count different than what pg_stat_io
+	 * counts e.g. system calls. Also, PendingWalStats doesn't count WAL init
+	 * or WAL reads. So, they are not included too.
+	 */
+	if (io_object == IOOBJECT_WAL && io_context == IOCONTEXT_NORMAL &&
+	    io_op == IOOP_FSYNC)
+		PendingWalStats.wal_sync += cnt;
+
 	PendingIOStats.counts[io_object][io_context][io_op] += cnt;
 
 	have_iostats = true;
 }
 
+/*
+ * Prepares io_time for pgstat_count_io_op_time() function. It needs to return
+ * current time if there is a chance that any 'time' can be tracked.
+ */
 instr_time
 pgstat_prepare_io_time(void)
 {
 	instr_time	io_start;
 
-	if (track_io_timing)
+	if (track_io_timing || track_wal_io_timing)
 		INSTR_TIME_SET_CURRENT(io_start);
 	else
+		/*
+		 * If time won't be tracked, there is no need to set io_start but
+		 * compiler complains about uninitialized use. So, set it to zero.
+		 */
 		INSTR_TIME_SET_ZERO(io_start);
 
 	return io_start;
 }
 
+/*
+ * Decide if the io timing needs be tracked
+ */
+bool
+pgstat_should_track_io_time(IOObject io_object, IOContext io_context)
+{
+	/*
+	 * io times of IOOBJECT_WAL IOObject needs to be tracked when
+	 * 'track_wal_io_timing' is set regardless of 'track_io_timing'.
+	 */
+	if (io_object == IOOBJECT_WAL)
+		return track_wal_io_timing;
+
+	return track_io_timing;
+}
+
 /*
  * Like pgstat_count_io_op_n() except it also accumulates time.
  */
@@ -112,18 +147,40 @@ void
 pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
 						instr_time start_time, uint32 cnt)
 {
-	if (track_io_timing)
+	/*
+	 * Accumulate different type of times here. We want to eventually
+	 * deduplicate these counters, so we are consolidating them first.
+	 * This also makes it easy to compare what is tracked for which
+	 * stats or instrumentation purpose.
+	 *
+	 * Some IO counters didn't moved here because they track at a different
+	 * level of granularity or at a different point in the call stack.
+	 *
+	 * pgstat_count_buffer is for pgstat_database. Since pg_stat_database only
+	 * counts blk_read_time and blk_write_time, it is set for IOOP_READ and
+	 * IOOP_WRITE.
+	 *
+	 * pgBufferUsage is for EXPLAIN. pgBufferUsage has only write and read stats
+	 * for shared/local and temporary blocks. Only shared/local blocks are counted
+	 * here.
+	 *
+	 * PendingWalStats is for counting WAL stats on pg_stat_wal. It
+	 * only counts IOOBJECT_WAL / IOCONTEXT_NORMAL.
+	 *
+	 * At the end of the if case, accumulate time for pg_stat_io.
+	 */
+	if (pgstat_should_track_io_time(io_object, io_context))
 	{
 		instr_time	io_time;
 
 		INSTR_TIME_SET_CURRENT(io_time);
 		INSTR_TIME_SUBTRACT(io_time, start_time);
 
-		if (io_op == IOOP_WRITE)
+		if (io_op == IOOP_FSYNC)
 		{
-			pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
-			if (io_object == IOOBJECT_RELATION)
-				INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time);
+			/* Track IOOBJECT_WAL/IOCONTEXT_NORMAL times on PendingWalStats */
+			if (io_object == IOOBJECT_WAL && io_context == IOCONTEXT_NORMAL)
+				INSTR_TIME_ADD(PendingWalStats.wal_sync_time, io_time);
 		}
 		else if (io_op == IOOP_READ)
 		{
@@ -131,11 +188,22 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
 			if (io_object == IOOBJECT_RELATION)
 				INSTR_TIME_ADD(pgBufferUsage.blk_read_time, io_time);
 		}
+		else if (io_op == IOOP_WRITE)
+		{
+			pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
+			if (io_object == IOOBJECT_RELATION)
+				INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time);
+
+			/* Track IOOBJECT_WAL/IOCONTEXT_NORMAL times on PendingWalStats */
+			else if (io_object == IOOBJECT_WAL && io_context == IOCONTEXT_NORMAL)
+				INSTR_TIME_ADD(PendingWalStats.wal_write_time, io_time);
+		}
 
 		INSTR_TIME_ADD(PendingIOStats.pending_times[io_object][io_context][io_op],
 					   io_time);
 	}
 
+	/* IO OP times are counted, now count IO OPs */
 	pgstat_count_io_op_n(io_object, io_context, io_op, cnt);
 }
 
@@ -216,12 +284,33 @@ pgstat_get_io_context_name(IOContext io_context)
 			return "normal";
 		case IOCONTEXT_VACUUM:
 			return "vacuum";
+		case IOCONTEXT_INIT:
+			return "init";
 	}
 
 	elog(ERROR, "unrecognized IOContext value: %d", io_context);
 	pg_unreachable();
 }
 
+/*
+ * op_bytes can change according to IOObject and IOContext.
+ * Return BLCKSZ as default because most of the
+ * IOObject / IOContext uses BLCKSZ.
+ */
+int
+pgstat_get_io_op_bytes(IOObject io_object, IOContext io_context)
+{
+	if (io_object == IOOBJECT_WAL)
+	{
+		if (io_context == IOCONTEXT_NORMAL)
+			return XLOG_BLCKSZ;
+		else if (io_context == IOCONTEXT_INIT)
+			return wal_segment_size;
+	}
+
+	return BLCKSZ;
+}
+
 const char *
 pgstat_get_io_object_name(IOObject io_object)
 {
@@ -231,6 +320,8 @@ pgstat_get_io_object_name(IOObject io_object)
 			return "relation";
 		case IOOBJECT_TEMP_RELATION:
 			return "temp relation";
+		case IOOBJECT_WAL:
+			return "wal";
 	}
 
 	elog(ERROR, "unrecognized IOObject value: %d", io_object);
@@ -312,10 +403,10 @@ pgstat_tracks_io_bktype(BackendType bktype)
 		case B_INVALID:
 		case B_ARCHIVER:
 		case B_LOGGER:
-		case B_WAL_RECEIVER:
-		case B_WAL_WRITER:
 			return false;
 
+		case B_WAL_RECEIVER:
+		case B_WAL_WRITER:
 		case B_AUTOVAC_LAUNCHER:
 		case B_AUTOVAC_WORKER:
 		case B_BACKEND:
@@ -350,6 +441,15 @@ pgstat_tracks_io_object(BackendType bktype, IOObject io_object,
 	if (!pgstat_tracks_io_bktype(bktype))
 		return false;
 
+	/*
+	 * Currently, IO on IOOBJECT_WAL IOObject can only occur in the
+	 * IOCONTEXT_NORMAL and IOCONTEXT_INIT IOContext.
+	 */
+	if (io_object == IOOBJECT_WAL &&
+	    (io_context != IOCONTEXT_NORMAL &&
+	     io_context != IOCONTEXT_INIT))
+		return false;
+
 	/*
 	 * Currently, IO on temporary relations can only occur in the
 	 * IOCONTEXT_NORMAL IOContext.
@@ -448,6 +548,27 @@ pgstat_tracks_io_op(BackendType bktype, IOObject io_object,
 	if (!strategy_io_context && io_op == IOOP_REUSE)
 		return false;
 
+	/*
+	 * Some IOOps are not valid in certain IOContexts & IOObjects and some
+	 * IOOps are only valid in certain IOContexts & IOObjects.
+	 */
+
+	/*
+	 * In IOOBJECT_WAL io_object, IOCONTEXT_INIT io_context means
+	 * operations done while creating new WAL segments.
+	 */
+	if(io_object == IOOBJECT_WAL && io_context == IOCONTEXT_INIT &&
+	   !(io_op == IOOP_WRITE || io_op == IOOP_FSYNC))
+	   return false;
+
+	/*
+	 * In IOOBJECT_WAL io_object, IOCONTEXT_NORMAL io_context means
+	 * operations done on already created WAL segments.
+	 */
+	if(io_object == IOOBJECT_WAL && io_context == IOCONTEXT_NORMAL &&
+	   !(io_op == IOOP_WRITE || io_op == IOOP_READ || io_op == IOOP_FSYNC))
+	   return false;
+
 	/*
 	 * IOOP_FSYNC IOOps done by a backend using a BufferAccessStrategy are
 	 * counted in the IOCONTEXT_NORMAL IOContext. See comment in
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 3b44af80066..18bff382363 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1383,6 +1383,7 @@ pg_stat_get_io(PG_FUNCTION_ARGS)
 			for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
 			{
 				const char *context_name = pgstat_get_io_context_name(io_context);
+				int			op_bytes;
 
 				Datum		values[IO_NUM_COLUMNS] = {0};
 				bool		nulls[IO_NUM_COLUMNS] = {0};
@@ -1401,12 +1402,11 @@ pg_stat_get_io(PG_FUNCTION_ARGS)
 				values[IO_COL_RESET_TIME] = TimestampTzGetDatum(reset_time);
 
 				/*
-				 * Hard-code this to the value of BLCKSZ for now. Future
-				 * values could include XLOG_BLCKSZ, once WAL IO is tracked,
-				 * and constant multipliers, once non-block-oriented IO (e.g.
-				 * temporary file IO) is tracked.
+				 * op_bytes can change according to IOObject and IOContext.
+				 * Get the correct op_bytes.
 				 */
-				values[IO_COL_CONVERSION] = Int64GetDatum(BLCKSZ);
+				op_bytes = pgstat_get_io_op_bytes(io_obj, io_context);
+				values[IO_COL_CONVERSION] = Int64GetDatum(op_bytes);
 
 				for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
 				{
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 57a2c0866a2..21b6b8ca31c 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -277,9 +277,10 @@ typedef enum IOObject
 {
 	IOOBJECT_RELATION,
 	IOOBJECT_TEMP_RELATION,
+	IOOBJECT_WAL,
 } IOObject;
 
-#define IOOBJECT_NUM_TYPES (IOOBJECT_TEMP_RELATION + 1)
+#define IOOBJECT_NUM_TYPES (IOOBJECT_WAL + 1)
 
 typedef enum IOContext
 {
@@ -287,9 +288,10 @@ typedef enum IOContext
 	IOCONTEXT_BULKWRITE,
 	IOCONTEXT_NORMAL,
 	IOCONTEXT_VACUUM,
+	IOCONTEXT_INIT,
 } IOContext;
 
-#define IOCONTEXT_NUM_TYPES (IOCONTEXT_VACUUM + 1)
+#define IOCONTEXT_NUM_TYPES (IOCONTEXT_INIT + 1)
 
 typedef enum IOOp
 {
@@ -521,10 +523,12 @@ extern bool pgstat_bktype_io_stats_valid(PgStat_BktypeIO *backend_io,
 extern void pgstat_count_io_op(IOObject io_object, IOContext io_context, IOOp io_op);
 extern void pgstat_count_io_op_n(IOObject io_object, IOContext io_context, IOOp io_op, uint32 cnt);
 extern instr_time pgstat_prepare_io_time(void);
+extern bool pgstat_should_track_io_time(IOObject io_object, IOContext io_context);
 extern void pgstat_count_io_op_time(IOObject io_object, IOContext io_context,
 									IOOp io_op, instr_time start_time, uint32 cnt);
 
 extern PgStat_IO *pgstat_fetch_stat_io(void);
+extern int pgstat_get_io_op_bytes(IOObject io_object, IOContext io_context);
 extern const char *pgstat_get_io_context_name(IOContext io_context);
 extern const char *pgstat_get_io_object_name(IOObject io_object);
 
-- 
2.40.1

Reply via email to