Hi, Thanks for the review!
Current status of the patch is: - IOOBJECT_WAL / IOCONTEXT_NORMAL read, write and fsync stats are added. - IOOBJECT_WAL / IOCONTEXT_NORMAL write and fsync tests are added. - IOOBJECT_WAL / IOCONTEXT_INIT stats are added. - pg_stat_io shows different op_bytes for the IOOBJECT_WAL operations. - Working on which 'BackendType / IOContext / IOOp' should be banned in pg_stat_io. - PendingWalStats.wal_sync and PendingWalStats.wal_write_time / PendingWalStats.wal_sync_time are moved to pgstat_count_io_op_n() / pgstat_count_io_op_time() respectively. TODOs: - Documentation. - Try to set op_bytes for BackendType / IOContext. - Decide which 'BackendType / IOContext / IOOp' should not be tracked. - Add IOOBJECT_WAL / IOCONTEXT_NORMAL read tests. - Add IOOBJECT_WAL / IOCONTEXT_INIT tests. I am adding tracking of BackendType / IOContext / IOOp as tables, empty cell means it is not decided yet: IOCONTEXT_NORMAL / Backend / IOOp table: ╔═════════════════════╦═══════╦═══════╦═══════╗ ║ IOCONTEXT_NORMAL ║ read ║ write ║ fsync ║ ╠═════════════════════╬═══════╬═══════╬═══════╣ ║ autovacuum launcher ║ FALSE ║ FALSE ║ FALSE ║ ╠═════════════════════╬═══════╬═══════╬═══════╣ ║ autovacuum worker ║ FALSE ║ TRUE ║ TRUE ║ ╠═════════════════════╬═══════╬═══════╬═══════╣ ║ client backend ║ ║ TRUE ║ TRUE ║ ╠═════════════════════╬═══════╬═══════╬═══════╣ ║ background worker ║ ║ ║ ║ ╠═════════════════════╬═══════╬═══════╬═══════╣ ║ background writer ║ ║ TRUE ║ TRUE ║ ╠═════════════════════╬═══════╬═══════╬═══════╣ ║ checkpointer ║ ║ TRUE ║ TRUE ║ ╠═════════════════════╬═══════╬═══════╬═══════╣ ║ standalone backend ║ TRUE ║ TRUE ║ TRUE ║ ╠═════════════════════╬═══════╬═══════╬═══════╣ ║ startup ║ TRUE ║ ║ ║ ╠═════════════════════╬═══════╬═══════╬═══════╣ ║ walreceiver ║ ║ ║ ║ ╠═════════════════════╬═══════╬═══════╬═══════╣ ║ walsender ║ ║ ║ ║ ╠═════════════════════╬═══════╬═══════╬═══════╣ ║ walwriter ║ ║ TRUE ║ TRUE ║ ╚═════════════════════╩═══════╩═══════╩═══════╝ IOCONTEXT_WAL_INIT / Backend / IOOp table: ╔═════════════════════╦═══════╦═══════╗ ║ IOCONTEXT_WAL_INIT ║ write ║ fsync ║ ╠═════════════════════╬═══════╬═══════╣ ║ autovacuum launcher ║ ║ ║ ╠═════════════════════╬═══════╬═══════╣ ║ autovacuum worker ║ ║ ║ ╠═════════════════════╬═══════╬═══════╣ ║ client backend ║ TRUE ║ TRUE ║ ╠═════════════════════╬═══════╬═══════╣ ║ background worker ║ ║ ║ ╠═════════════════════╬═══════╬═══════╣ ║ background writer ║ ║ ║ ╠═════════════════════╬═══════╬═══════╣ ║ checkpointer ║ ║ ║ ╠═════════════════════╬═══════╬═══════╣ ║ standalone backend ║ TRUE ║ TRUE ║ ╠═════════════════════╬═══════╬═══════╣ ║ startup ║ ║ ║ ╠═════════════════════╬═══════╬═══════╣ ║ walreceiver ║ ║ ║ ╠═════════════════════╬═══════╬═══════╣ ║ walsender ║ ║ ║ ╠═════════════════════╬═══════╬═══════╣ ║ walwriter ║ ║ ║ ╚═════════════════════╩═══════╩═══════╝ On Wed, 9 Aug 2023 at 21:52, Melanie Plageman <melanieplage...@gmail.com> wrote: > > > On Sat, 22 Jul 2023 at 01:30, Melanie Plageman > > <melanieplage...@gmail.com> wrote: > > > I think it would be good to count WAL reads even though they are not > > > currently represented in pg_stat_wal. Here is a thread discussing this > > > [1]. > > > > I used the same implementation in the thread link [1]. I added 'WAL > > read' to only xlogrecovery.c for now. I didn't add 'WAL read' to > > xlogreader.c and walsender.c because they cause some failures on: > > '!pgStatLocal.shmem->is_shutdown' asserts. I will spend more time on > > these. Also, I added Bharath to CC. I have a question about 'WAL > > read': > > 1. There are two places where 'WAL read' happens. > > a. In WALRead() in xlogreader.c, it reads 'count' bytes, most of the > > time count is equal to XLOG_BLCKSZ but there are some cases it is not. > > For example > > - in XLogSendPhysical() in walsender.c WALRead() is called by nbytes > > - in WALDumpReadPage() in pg_waldump.c WALRead() is called by count > > These nbytes and count variables could be different from XLOG_BLCKSZ. > > > > b. in XLogPageRead() in xlogreader.c, it reads exactly XLOG_BLCKSZ bytes: > > pg_pread(readFile, readBuf, XLOG_BLCKSZ, (off_t) readOff); > > > > So, what should op_bytes be set to for 'WAL read' operations? > > If there is any combination of BackendType and IOContext which will > always read XLOG_BLCKSZ bytes, we could use XLOG_BLCKSZ for that row's > op_bytes. For other cases, we may have to consider using op_bytes 1 and > tracking reads and write IOOps in number of bytes (instead of number of > pages). I don't actually know if there is a clear separation by > BackendType for these different cases. I agree. I will edit that later, added to TODOs. > > The other alternative I see is to use XLOG_BLCKSZ as the op_bytes and > treat op_bytes * number of reads as an approximation of the number of > bytes read. I don't actually know what makes more sense. I don't think I > would like having a number for bytes that is not accurate. Yes, the prior one makes more sense to me. > > > Should I use IsBootstrapProcessingMode() to enable WAL io timings at > > the initdb or are they not that much important? > > I don't have an opinion about this. I can see an argument for doing it > either way. We do track other IO during initdb in pg_stat_io. I didn't add it for now. It is an easy change, it could be added later. > > As an additional TODO, I would explore adding some tests to prevent > accidental removal of the pg_stat_io WAL tracking. > > I think we can easily test IOCONTEXT_NORMAL WAL writes in > src/test/regress/sql/stats.sql (perhaps it is worth checking that > synchronous_commit is on in the test). IOCONTEXT_NORMAL WAL fsyncs > should again be easy to test if synchronous_commit is on and fsync is > on. > > I'm not sure how to reliably test WAL reads (given timing). Logically, > you can sum WAL reads before a crash is initiated in one of the tests in > the recovery suite, and then sum them after the db has restarted and > there should definitely be an increase in WAL reads, but I don't know if > we need to do something to guarantee that there will have been WAL reads > (to avoid test flakes). > > I'm also not sure how to reliably test any IOCONTEXT_INIT operations. We > need a before and after and I can't think of a cheap operation to ensure > a new WAL segment is written to or fsyncd in between a before and after > for the purposes of testing. IOOBJECT_WAL / IOCONTEXT_NORMAL write and fsync tests are added. For the IOCONTEXT_NORMAL reads and IOCONTEXT_INIT tests, I couldn't find a way to avoid test flakes. I am open to suggestions. I added these to TODOs. > > > --- > > diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c > > index 60c0b7ec3af..ee7b85e18ca 100644 > > --- a/src/backend/access/transam/xlog.c > > +++ b/src/backend/access/transam/xlog.c > > @@ -2245,6 +2229,9 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible) > > startoffset += written; > > } while (nleft > 0); > > > > I'm not sure if the right location is here or in > pgstat_count_io_op_time(), but I would explain why you did not move > PendingWalStats.wal_writes counter into pg_stat_io code (and why you did > move the other PendingWalStats counters there. > > > + pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL, > > + IOOP_WRITE, io_start, npages); > > + > > npages = 0; > > > > /* > > @@ -2938,6 +2925,7 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli, > > int fd; > > int save_errno; > > int open_flags = O_RDWR | O_CREAT | O_EXCL | PG_BINARY; > > + instr_time io_start; > > > > Assert(logtli != 0); > > > > @@ -2981,6 +2969,8 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli, > > (errcode_for_file_access(), > > errmsg("could not create file \"%s\": %m", tmppath))); > > > > Since you have two calls to pgstat_prepare_io_time() in this function, I > think it would be nice to have a comment above each to the effect of > "start timing writes for stats" and "start timing fsyncs for stats" Done. > > > + io_start = pgstat_prepare_io_time(); > > + > > pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_WRITE); > > > diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c > > index becc2bda62e..ee850af5514 100644 > > --- a/src/backend/access/transam/xlogrecovery.c > > +++ b/src/backend/access/transam/xlogrecovery.c > > @@ -1587,6 +1587,7 @@ PerformWalRecovery(void) > > XLogRecord *record; > > bool reachedRecoveryTarget = false; > > TimeLineID replayTLI; > > + uint32 pgstat_report_wal_frequency = 0; > > > > /* > > * Initialize shared variables for tracking progress of WAL replay, as if > > @@ -1745,6 +1746,16 @@ PerformWalRecovery(void) > > */ > > ApplyWalRecord(xlogreader, record, &replayTLI); > > > > + /* > > + * Report pending statistics to the cumulative stats system once > > + * every PGSTAT_REPORT_FREQUENCY times to not hinder performance. > > + */ > > + if (pgstat_report_wal_frequency++ == PGSTAT_REPORT_FREQUENCY) > > + { > > + pgstat_report_wal(false); > > + pgstat_report_wal_frequency = 0; > > + } > > + > > Is the above needed for your patch to work? What does it do? It should > probably be in a separate commit and should definitely have an > explanation. Done, I omit that part. > > > --- a/src/backend/utils/activity/pgstat_io.c > > +++ b/src/backend/utils/activity/pgstat_io.c > > @@ -87,17 +87,25 @@ pgstat_count_io_op_n(IOObject io_object, IOContext io_context, IOOp io_op, uint3 > > Assert((unsigned int) io_op < IOOP_NUM_TYPES); > > Assert(pgstat_tracks_io_op(MyBackendType, io_object, io_context, io_op)); > > I would add a comment here explaining that pg_stat_wal doesn't count WAL > init or WAL reads. Done. > > > + if(io_object == IOOBJECT_WAL && io_context == IOCONTEXT_NORMAL && > > + io_op == IOOP_FSYNC) > > + PendingWalStats.wal_sync += cnt; > > + > > PendingIOStats.counts[io_object][io_context][io_op] += cnt; > > > > have_iostats = true; > > } > > > +/* > > + * Prepares io_time for pgstat_count_io_op_time() function. It needs to return > > + * current time if there is a chance that any 'time' can be tracked. > > + */ > > instr_time > > pgstat_prepare_io_time(void) > > { > > instr_time io_start; > > > > - if (track_io_timing) > > + if(track_io_timing || track_wal_io_timing) > > INSTR_TIME_SET_CURRENT(io_start); > > else > > INSTR_TIME_SET_ZERO(io_start); > > Since you asked me off-list why we had to do INSTR_TIME_SET_ZERO() and I > couldn't remember, it is probably worth a comment. Done. > > > pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op, > > instr_time start_time, uint32 cnt) > > { > > - if (track_io_timing) > > + if (pgstat_should_track_io_time(io_object, io_context)) > > { > > instr_time io_time; > > > > @@ -124,6 +148,9 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op, > > pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time)); > > Now that we are adding more if statements to this function, I think we > should start adding more comments. > > We should explain what the different counters here are for e.g. > pgBufferUsage for EXPLAIN, PendingWalStats for pg_stat_wal. > > We should also explain what is tracked for each and why it differs -- > e.g. some track time and some don't, some track only reads or writes, > etc. > > Also we should mention why we are consolidating them here. That is, we > want to eventually deduplicate these counters, so we are consolidating > them first. This also makes it easy to compare what is tracked for which > stats or instrumentation purpose. > > And for those IO counters that we haven't moved here, we should mention > it is because they track at a different level of granularity or at a > different point in the call stack. Done. > > > if (io_object == IOOBJECT_RELATION) > > INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time); > > + /* Track IOOBJECT_WAL/IOCONTEXT_NORMAL times on PendingWalStats */ > > + else if (io_object == IOOBJECT_WAL && io_context == IOCONTEXT_NORMAL) > > + INSTR_TIME_ADD(PendingWalStats.wal_write_time, io_time); > > } > > > Also, I would reorder the if statements to be in order of the enum > values (e.g. FSYNC, READ, WRITE). Done. > > > else if (io_op == IOOP_READ) > > { > > @@ -131,6 +158,12 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op, > > if (io_object == IOOBJECT_RELATION) > > INSTR_TIME_ADD(pgBufferUsage.blk_read_time, io_time); > > } > > + else if (io_op == IOOP_FSYNC) > > + { > > + /* Track IOOBJECT_WAL/IOCONTEXT_NORMAL times on PendingWalStats */ > > I wouldn't squeeze this comment here like this. It is hard to read Done. > > > + if (io_object == IOOBJECT_WAL && io_context == IOCONTEXT_NORMAL) > > + INSTR_TIME_ADD(PendingWalStats.wal_sync_time, io_time); > > > > + * op_bytes can change according to IOObject and IOContext. > > + * Return BLCKSZ as default. > > + */ > > +int > > +pgstat_get_io_op_btyes(IOObject io_object, IOContext io_context) > > +{ > > Small typo in function name: > pgstat_get_io_op_btyes -> pgstat_get_io_op_bytes > I'd also mention why BLCKSZ is the default Done. > > > + if (io_object == IOOBJECT_WAL) > > + { > > + if (io_context == IOCONTEXT_NORMAL) > > + return XLOG_BLCKSZ; > > + else if (io_context == IOCONTEXT_INIT) > > + return wal_segment_size; > > + } > > + > > + return BLCKSZ; > > +} > > > @@ -350,6 +405,15 @@ pgstat_tracks_io_object(BackendType bktype, IOObject io_object, > > if (!pgstat_tracks_io_bktype(bktype)) > > return false; > > > > + /* > > + * Currently, IO on IOOBJECT_WAL IOObject can only occur in the > > + * IOCONTEXT_NORMAL and IOCONTEXT_INIT IOContext. > > + */ > > + if (io_object == IOOBJECT_WAL && > > + (io_context != IOCONTEXT_NORMAL && > > Little bit of errant whitespace here. Done. > > > /* > > * Currently, IO on temporary relations can only occur in the > > * IOCONTEXT_NORMAL IOContext. > > @@ -439,6 +503,14 @@ pgstat_tracks_io_op(BackendType bktype, IOObject io_object, > > if (io_context == IOCONTEXT_BULKREAD && io_op == IOOP_EXTEND) > > return false; > > I would expand on the comment to explain what NORMAL is for WAL -- what > we consider normal to be and why. And why it is different than INIT. Done. > > > > > + if(io_object == IOOBJECT_WAL && io_context == IOCONTEXT_INIT && > > + !(io_op == IOOP_WRITE || io_op == IOOP_FSYNC)) > > + return false; > > + > > + if(io_object == IOOBJECT_WAL && io_context == IOCONTEXT_NORMAL && > > + !(io_op == IOOP_WRITE || io_op == IOOP_READ || io_op == IOOP_FSYNC)) > > + return false; > > These are the first "bans" that we have for an IOOp for a specific > combination of io_context and io_object. We should add a new comment for > this and perhaps consider what ordering makes most sense. I tried to > organize the bans from most broad to most specific at the bottom. Done. > > > > > --- a/src/backend/utils/adt/pgstatfuncs.c > > +++ b/src/backend/utils/adt/pgstatfuncs.c > > @@ -1409,7 +1410,8 @@ pg_stat_get_io(PG_FUNCTION_ARGS) > > * and constant multipliers, once non-block-oriented IO (e.g. > > * temporary file IO) is tracked. > > */ > > - values[IO_COL_CONVERSION] = Int64GetDatum(BLCKSZ); > > There's a comment above this in the code that says this is hard-coded to > BLCKSZ. That comment needs to be updated or removed (in lieu of the > comment in your pgstat_get_io_op_bytes() function). Done. > > > > + op_bytes = pgstat_get_io_op_btyes(io_obj, io_context); > > + values[IO_COL_CONVERSION] = Int64GetDatum(op_bytes); > > > > > +extern PGDLLIMPORT bool track_wal_io_timing; > > +extern PGDLLIMPORT int wal_segment_size; > > These shouldn't be in two places (i.e. they are already in xlog.h and > you added them in pgstat.h. pg_stat_io.c includes bufmgr.h for > track_io_timing, so you can probably justify including xlog.h. Done. Any kind of feedback would be appreciated. Regards, Nazir Bilal Yavuz Microsoft
From 8a06deb704b51010f36a448edf37dfdc1c829d43 Mon Sep 17 00:00:00 2001 From: Nazir Bilal Yavuz <byavu...@gmail.com> Date: Thu, 14 Sep 2023 12:44:29 +0300 Subject: [PATCH v3 2/2] Add IOOBJECT_WAL / IOCONTEXT_NORMAL write and fsync tests --- src/test/regress/expected/stats.out | 25 +++++++++++++++++++++++++ src/test/regress/sql/stats.sql | 10 ++++++++++ 2 files changed, 35 insertions(+) diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out index 94187e59cfb..98f94b189ea 100644 --- a/src/test/regress/expected/stats.out +++ b/src/test/regress/expected/stats.out @@ -1246,6 +1246,9 @@ SELECT sum(extends) AS io_sum_shared_before_extends SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs FROM pg_stat_io WHERE object = 'relation' \gset io_sum_shared_before_ +SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs + FROM pg_stat_io + WHERE context = 'normal' and object = 'wal' \gset io_sum_wal_normal_before_ CREATE TABLE test_io_shared(a int); INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i; SELECT pg_stat_force_next_flush(); @@ -1283,6 +1286,28 @@ SELECT current_setting('fsync') = 'off' t (1 row) +SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs + FROM pg_stat_io + WHERE context = 'normal' and object = 'wal' \gset io_sum_wal_normal_after_ +SELECT current_setting('synchronous_commit') = 'on'; + ?column? +---------- + t +(1 row) + +SELECT :io_sum_wal_normal_after_writes > :io_sum_wal_normal_before_writes; + ?column? +---------- + t +(1 row) + +SELECT current_setting('fsync') = 'off' + OR :io_sum_wal_normal_after_fsyncs > :io_sum_wal_normal_before_fsyncs; + ?column? +---------- + t +(1 row) + -- Change the tablespace so that the table is rewritten directly, then SELECT -- from it to cause it to be read back into shared buffers. SELECT sum(reads) AS io_sum_shared_before_reads diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql index 1e21e55c6d9..c91ec43f039 100644 --- a/src/test/regress/sql/stats.sql +++ b/src/test/regress/sql/stats.sql @@ -603,6 +603,9 @@ SELECT sum(extends) AS io_sum_shared_before_extends SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs FROM pg_stat_io WHERE object = 'relation' \gset io_sum_shared_before_ +SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs + FROM pg_stat_io + WHERE context = 'normal' and object = 'wal' \gset io_sum_wal_normal_before_ CREATE TABLE test_io_shared(a int); INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i; SELECT pg_stat_force_next_flush(); @@ -621,6 +624,13 @@ SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs SELECT :io_sum_shared_after_writes > :io_sum_shared_before_writes; SELECT current_setting('fsync') = 'off' OR :io_sum_shared_after_fsyncs > :io_sum_shared_before_fsyncs; +SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs + FROM pg_stat_io + WHERE context = 'normal' and object = 'wal' \gset io_sum_wal_normal_after_ +SELECT current_setting('synchronous_commit') = 'on'; +SELECT :io_sum_wal_normal_after_writes > :io_sum_wal_normal_before_writes; +SELECT current_setting('fsync') = 'off' + OR :io_sum_wal_normal_after_fsyncs > :io_sum_wal_normal_before_fsyncs; -- Change the tablespace so that the table is rewritten directly, then SELECT -- from it to cause it to be read back into shared buffers. -- 2.40.1
From fbbd64e8bf53c0ab149bac7970df9b4c1ac92589 Mon Sep 17 00:00:00 2001 From: Nazir Bilal Yavuz <byavu...@gmail.com> Date: Tue, 19 Sep 2023 17:56:56 +0300 Subject: [PATCH v3 1/2] Show WAL stats (except streaming replication WAL) on pg_stat_io This patch aims to showing WAL stats per backend on pg_stat_io view. With this patch, it can be seen how many WAL operations it makes, their context, types and total timings per backend in pg_stat_io view. This patchset currently covers: - IOOBJECT_WAL / IOCONTEXT_NORMAL read, write and fsync. - IOOBJECT_WAL / IOCONTEXT_INIT write and fsync. doesn't cover: - Streaming replication WAL IO. --- src/backend/access/transam/xlog.c | 60 ++++------ src/backend/access/transam/xlogrecovery.c | 9 ++ src/backend/utils/activity/pgstat_io.c | 137 ++++++++++++++++++++-- src/backend/utils/adt/pgstatfuncs.c | 10 +- src/include/pgstat.h | 8 +- 5 files changed, 172 insertions(+), 52 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index fcbde10529b..5c0da34bc45 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -2200,38 +2200,22 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible) Size nbytes; Size nleft; int written; - instr_time start; + instr_time io_start; /* OK to write the page(s) */ from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ; nbytes = npages * (Size) XLOG_BLCKSZ; nleft = nbytes; + + io_start = pgstat_prepare_io_time(); do { errno = 0; - /* Measure I/O timing to write WAL data */ - if (track_wal_io_timing) - INSTR_TIME_SET_CURRENT(start); - else - INSTR_TIME_SET_ZERO(start); - pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE); written = pg_pwrite(openLogFile, from, nleft, startoffset); pgstat_report_wait_end(); - /* - * Increment the I/O timing and the number of times WAL data - * were written out to disk. - */ - if (track_wal_io_timing) - { - instr_time end; - - INSTR_TIME_SET_CURRENT(end); - INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_write_time, end, start); - } - PendingWalStats.wal_write++; if (written <= 0) @@ -2256,6 +2240,9 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible) startoffset += written; } while (nleft > 0); + pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL, + IOOP_WRITE, io_start, npages); + npages = 0; /* @@ -2949,6 +2936,7 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli, int fd; int save_errno; int open_flags = O_RDWR | O_CREAT | O_EXCL | PG_BINARY; + instr_time io_start; Assert(logtli != 0); @@ -2992,6 +2980,9 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli, (errcode_for_file_access(), errmsg("could not create file \"%s\": %m", tmppath))); + /* start timing writes for stats */ + io_start = pgstat_prepare_io_time(); + pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_WRITE); save_errno = 0; if (wal_init_zero) @@ -3027,6 +3018,9 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli, } pgstat_report_wait_end(); + pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_INIT, IOOP_WRITE, + io_start, 1); + if (save_errno) { /* @@ -3043,6 +3037,9 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli, errmsg("could not write to file \"%s\": %m", tmppath))); } + /* start timing fsyncs for stats */ + io_start = pgstat_prepare_io_time(); + pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_SYNC); if (pg_fsync(fd) != 0) { @@ -3055,6 +3052,9 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli, } pgstat_report_wait_end(); + pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_INIT, + IOOP_FSYNC, io_start, 1); + if (close(fd) != 0) ereport(ERROR, (errcode_for_file_access(), @@ -8186,7 +8186,7 @@ void issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli) { char *msg = NULL; - instr_time start; + instr_time io_start; Assert(tli != 0); @@ -8199,11 +8199,7 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli) sync_method == SYNC_METHOD_OPEN_DSYNC) return; - /* Measure I/O timing to sync the WAL file */ - if (track_wal_io_timing) - INSTR_TIME_SET_CURRENT(start); - else - INSTR_TIME_SET_ZERO(start); + io_start = pgstat_prepare_io_time(); pgstat_report_wait_start(WAIT_EVENT_WAL_SYNC); switch (sync_method) @@ -8247,18 +8243,8 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli) pgstat_report_wait_end(); - /* - * Increment the I/O timing and the number of times WAL files were synced. - */ - if (track_wal_io_timing) - { - instr_time end; - - INSTR_TIME_SET_CURRENT(end); - INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_sync_time, end, start); - } - - PendingWalStats.wal_sync++; + pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL, IOOP_FSYNC, + io_start, 1); } /* diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c index becc2bda62e..ca1041c5fd7 100644 --- a/src/backend/access/transam/xlogrecovery.c +++ b/src/backend/access/transam/xlogrecovery.c @@ -1745,6 +1745,9 @@ PerformWalRecovery(void) */ ApplyWalRecord(xlogreader, record, &replayTLI); + /* Report pending statistics to the cumulative stats system */ + pgstat_report_wal(false); + /* Exit loop if we reached inclusive recovery target */ if (recoveryStopsAfter(xlogreader)) { @@ -3220,6 +3223,7 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen, uint32 targetPageOff; XLogSegNo targetSegNo PG_USED_FOR_ASSERTS_ONLY; int r; + instr_time io_start; XLByteToSeg(targetPagePtr, targetSegNo, wal_segment_size); targetPageOff = XLogSegmentOffset(targetPagePtr, wal_segment_size); @@ -3312,6 +3316,8 @@ retry: /* Read the requested page */ readOff = targetPageOff; + io_start = pgstat_prepare_io_time(); + pgstat_report_wait_start(WAIT_EVENT_WAL_READ); r = pg_pread(readFile, readBuf, XLOG_BLCKSZ, (off_t) readOff); if (r != XLOG_BLCKSZ) @@ -3340,6 +3346,9 @@ retry: } pgstat_report_wait_end(); + pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL, IOOP_READ, + io_start, 1); + Assert(targetSegNo == readSegNo); Assert(targetPageOff == readOff); Assert(reqLen <= readLen); diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c index eb7d35d4225..d5345ba2536 100644 --- a/src/backend/utils/activity/pgstat_io.c +++ b/src/backend/utils/activity/pgstat_io.c @@ -18,6 +18,7 @@ #include "executor/instrument.h" #include "storage/bufmgr.h" +#include "access/xlog.h" #include "utils/pgstat_internal.h" @@ -87,24 +88,58 @@ pgstat_count_io_op_n(IOObject io_object, IOContext io_context, IOOp io_op, uint3 Assert((unsigned int) io_op < IOOP_NUM_TYPES); Assert(pgstat_tracks_io_op(MyBackendType, io_object, io_context, io_op)); + /* + * Only count PendingWalStats.wal_sync in pg_stat_io because + * PendingWalStats.wal_write could count different than what pg_stat_io + * counts e.g. system calls. Also, PendingWalStats doesn't count WAL init + * or WAL reads. So, they are not included too. + */ + if (io_object == IOOBJECT_WAL && io_context == IOCONTEXT_NORMAL && + io_op == IOOP_FSYNC) + PendingWalStats.wal_sync += cnt; + PendingIOStats.counts[io_object][io_context][io_op] += cnt; have_iostats = true; } +/* + * Prepares io_time for pgstat_count_io_op_time() function. It needs to return + * current time if there is a chance that any 'time' can be tracked. + */ instr_time pgstat_prepare_io_time(void) { instr_time io_start; - if (track_io_timing) + if (track_io_timing || track_wal_io_timing) INSTR_TIME_SET_CURRENT(io_start); else + /* + * If time won't be tracked, there is no need to set io_start but + * compiler complains about uninitialized use. So, set it to zero. + */ INSTR_TIME_SET_ZERO(io_start); return io_start; } +/* + * Decide if the io timing needs be tracked + */ +bool +pgstat_should_track_io_time(IOObject io_object, IOContext io_context) +{ + /* + * io times of IOOBJECT_WAL IOObject needs to be tracked when + * 'track_wal_io_timing' is set regardless of 'track_io_timing'. + */ + if (io_object == IOOBJECT_WAL) + return track_wal_io_timing; + + return track_io_timing; +} + /* * Like pgstat_count_io_op_n() except it also accumulates time. */ @@ -112,18 +147,40 @@ void pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op, instr_time start_time, uint32 cnt) { - if (track_io_timing) + /* + * Accumulate different type of times here. We want to eventually + * deduplicate these counters, so we are consolidating them first. + * This also makes it easy to compare what is tracked for which + * stats or instrumentation purpose. + * + * Some IO counters didn't moved here because they track at a different + * level of granularity or at a different point in the call stack. + * + * pgstat_count_buffer is for pgstat_database. Since pg_stat_database only + * counts blk_read_time and blk_write_time, it is set for IOOP_READ and + * IOOP_WRITE. + * + * pgBufferUsage is for EXPLAIN. pgBufferUsage has only write and read stats + * for shared/local and temporary blocks. Only shared/local blocks are counted + * here. + * + * PendingWalStats is for counting WAL stats on pg_stat_wal. It + * only counts IOOBJECT_WAL / IOCONTEXT_NORMAL. + * + * At the end of the if case, accumulate time for pg_stat_io. + */ + if (pgstat_should_track_io_time(io_object, io_context)) { instr_time io_time; INSTR_TIME_SET_CURRENT(io_time); INSTR_TIME_SUBTRACT(io_time, start_time); - if (io_op == IOOP_WRITE) + if (io_op == IOOP_FSYNC) { - pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time)); - if (io_object == IOOBJECT_RELATION) - INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time); + /* Track IOOBJECT_WAL/IOCONTEXT_NORMAL times on PendingWalStats */ + if (io_object == IOOBJECT_WAL && io_context == IOCONTEXT_NORMAL) + INSTR_TIME_ADD(PendingWalStats.wal_sync_time, io_time); } else if (io_op == IOOP_READ) { @@ -131,11 +188,22 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op, if (io_object == IOOBJECT_RELATION) INSTR_TIME_ADD(pgBufferUsage.blk_read_time, io_time); } + else if (io_op == IOOP_WRITE) + { + pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time)); + if (io_object == IOOBJECT_RELATION) + INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time); + + /* Track IOOBJECT_WAL/IOCONTEXT_NORMAL times on PendingWalStats */ + else if (io_object == IOOBJECT_WAL && io_context == IOCONTEXT_NORMAL) + INSTR_TIME_ADD(PendingWalStats.wal_write_time, io_time); + } INSTR_TIME_ADD(PendingIOStats.pending_times[io_object][io_context][io_op], io_time); } + /* IO OP times are counted, now count IO OPs */ pgstat_count_io_op_n(io_object, io_context, io_op, cnt); } @@ -216,12 +284,33 @@ pgstat_get_io_context_name(IOContext io_context) return "normal"; case IOCONTEXT_VACUUM: return "vacuum"; + case IOCONTEXT_INIT: + return "init"; } elog(ERROR, "unrecognized IOContext value: %d", io_context); pg_unreachable(); } +/* + * op_bytes can change according to IOObject and IOContext. + * Return BLCKSZ as default because most of the + * IOObject / IOContext uses BLCKSZ. + */ +int +pgstat_get_io_op_bytes(IOObject io_object, IOContext io_context) +{ + if (io_object == IOOBJECT_WAL) + { + if (io_context == IOCONTEXT_NORMAL) + return XLOG_BLCKSZ; + else if (io_context == IOCONTEXT_INIT) + return wal_segment_size; + } + + return BLCKSZ; +} + const char * pgstat_get_io_object_name(IOObject io_object) { @@ -231,6 +320,8 @@ pgstat_get_io_object_name(IOObject io_object) return "relation"; case IOOBJECT_TEMP_RELATION: return "temp relation"; + case IOOBJECT_WAL: + return "wal"; } elog(ERROR, "unrecognized IOObject value: %d", io_object); @@ -312,10 +403,10 @@ pgstat_tracks_io_bktype(BackendType bktype) case B_INVALID: case B_ARCHIVER: case B_LOGGER: - case B_WAL_RECEIVER: - case B_WAL_WRITER: return false; + case B_WAL_RECEIVER: + case B_WAL_WRITER: case B_AUTOVAC_LAUNCHER: case B_AUTOVAC_WORKER: case B_BACKEND: @@ -350,6 +441,15 @@ pgstat_tracks_io_object(BackendType bktype, IOObject io_object, if (!pgstat_tracks_io_bktype(bktype)) return false; + /* + * Currently, IO on IOOBJECT_WAL IOObject can only occur in the + * IOCONTEXT_NORMAL and IOCONTEXT_INIT IOContext. + */ + if (io_object == IOOBJECT_WAL && + (io_context != IOCONTEXT_NORMAL && + io_context != IOCONTEXT_INIT)) + return false; + /* * Currently, IO on temporary relations can only occur in the * IOCONTEXT_NORMAL IOContext. @@ -448,6 +548,27 @@ pgstat_tracks_io_op(BackendType bktype, IOObject io_object, if (!strategy_io_context && io_op == IOOP_REUSE) return false; + /* + * Some IOOps are not valid in certain IOContexts & IOObjects and some + * IOOps are only valid in certain IOContexts & IOObjects. + */ + + /* + * In IOOBJECT_WAL io_object, IOCONTEXT_INIT io_context means + * operations done while creating new WAL segments. + */ + if(io_object == IOOBJECT_WAL && io_context == IOCONTEXT_INIT && + !(io_op == IOOP_WRITE || io_op == IOOP_FSYNC)) + return false; + + /* + * In IOOBJECT_WAL io_object, IOCONTEXT_NORMAL io_context means + * operations done on already created WAL segments. + */ + if(io_object == IOOBJECT_WAL && io_context == IOCONTEXT_NORMAL && + !(io_op == IOOP_WRITE || io_op == IOOP_READ || io_op == IOOP_FSYNC)) + return false; + /* * IOOP_FSYNC IOOps done by a backend using a BufferAccessStrategy are * counted in the IOCONTEXT_NORMAL IOContext. See comment in diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c index 3b44af80066..18bff382363 100644 --- a/src/backend/utils/adt/pgstatfuncs.c +++ b/src/backend/utils/adt/pgstatfuncs.c @@ -1383,6 +1383,7 @@ pg_stat_get_io(PG_FUNCTION_ARGS) for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++) { const char *context_name = pgstat_get_io_context_name(io_context); + int op_bytes; Datum values[IO_NUM_COLUMNS] = {0}; bool nulls[IO_NUM_COLUMNS] = {0}; @@ -1401,12 +1402,11 @@ pg_stat_get_io(PG_FUNCTION_ARGS) values[IO_COL_RESET_TIME] = TimestampTzGetDatum(reset_time); /* - * Hard-code this to the value of BLCKSZ for now. Future - * values could include XLOG_BLCKSZ, once WAL IO is tracked, - * and constant multipliers, once non-block-oriented IO (e.g. - * temporary file IO) is tracked. + * op_bytes can change according to IOObject and IOContext. + * Get the correct op_bytes. */ - values[IO_COL_CONVERSION] = Int64GetDatum(BLCKSZ); + op_bytes = pgstat_get_io_op_bytes(io_obj, io_context); + values[IO_COL_CONVERSION] = Int64GetDatum(op_bytes); for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++) { diff --git a/src/include/pgstat.h b/src/include/pgstat.h index 57a2c0866a2..21b6b8ca31c 100644 --- a/src/include/pgstat.h +++ b/src/include/pgstat.h @@ -277,9 +277,10 @@ typedef enum IOObject { IOOBJECT_RELATION, IOOBJECT_TEMP_RELATION, + IOOBJECT_WAL, } IOObject; -#define IOOBJECT_NUM_TYPES (IOOBJECT_TEMP_RELATION + 1) +#define IOOBJECT_NUM_TYPES (IOOBJECT_WAL + 1) typedef enum IOContext { @@ -287,9 +288,10 @@ typedef enum IOContext IOCONTEXT_BULKWRITE, IOCONTEXT_NORMAL, IOCONTEXT_VACUUM, + IOCONTEXT_INIT, } IOContext; -#define IOCONTEXT_NUM_TYPES (IOCONTEXT_VACUUM + 1) +#define IOCONTEXT_NUM_TYPES (IOCONTEXT_INIT + 1) typedef enum IOOp { @@ -521,10 +523,12 @@ extern bool pgstat_bktype_io_stats_valid(PgStat_BktypeIO *backend_io, extern void pgstat_count_io_op(IOObject io_object, IOContext io_context, IOOp io_op); extern void pgstat_count_io_op_n(IOObject io_object, IOContext io_context, IOOp io_op, uint32 cnt); extern instr_time pgstat_prepare_io_time(void); +extern bool pgstat_should_track_io_time(IOObject io_object, IOContext io_context); extern void pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op, instr_time start_time, uint32 cnt); extern PgStat_IO *pgstat_fetch_stat_io(void); +extern int pgstat_get_io_op_bytes(IOObject io_object, IOContext io_context); extern const char *pgstat_get_io_context_name(IOContext io_context); extern const char *pgstat_get_io_object_name(IOObject io_object); -- 2.40.1