v5 attached mostly addresses instr_time persistence issues. On Tue, Mar 14, 2023 at 6:56 PM Andres Freund <and...@anarazel.de> wrote: > On 2023-03-09 11:50:38 -0500, Melanie Plageman wrote: > > On Tue, Mar 7, 2023 at 1:39 PM Andres Freund <and...@anarazel.de> wrote: > > > On 2023-03-06 11:30:13 -0500, Melanie Plageman wrote: > > > > > As pgstat_bktype_io_stats_valid() is called only in Assert(), I think > > > > > that would be a good idea > > > > > to also check that if counts are not Zero then times are not Zero. > > > > > > > > Yes, I think adding some validation around the relationship between > > > > counts and timing should help prevent developers from forgetting to call > > > > pg_stat_count_io_op() when calling pgstat_count_io_time() (as relevant). > > > > > > > > However, I think that we cannot check that if IO counts are non-zero > > > > that IO times are non-zero, because the user may not have > > > > track_io_timing enabled. We can check that if IO times are not zero, IO > > > > counts are not zero. I've done this in the attached v3. > > > > > > And even if track_io_timing is enabled, the timer granularity might be so > > > low > > > that we *still* get zeroes. > > > > > > I wonder if we should get rid of pgStatBlockReadTime, > > > pgStatBlockWriteTime, > > > > And then have pg_stat_reset_shared('io') reset pg_stat_database IO > > stats? > > Yes.
I think this makes sense but I am hesitant to do it in this patchset, because it feels a bit hidden...maybe? But, if you feel strongly, I will make the change. > > > > typedef struct PgStat_BktypeIO > > > > { > > > > - PgStat_Counter > > > > data[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES]; > > > > + PgStat_Counter > > > > counts[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES]; > > > > + instr_time > > > > times[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES]; > > > > } PgStat_BktypeIO; > > > > > > Ah, you're going to hate me. We can't store instr_time on disk. There's > > > another patch that gets substantial peformance gains by varying the > > > frequency > > > at which instr_time keeps track of time based on the CPU frequency... > > > > What does that have to do with what we can store on disk? > > The frequency can change. Ah, I see. > > If so, would it not be enough to do this when reading/writing the stats > > file? > > Theoretically yes. But to me it seems cleaner to do it when flushing to shared > stats. See also the overflow issue below. > > > > It also just doesn't have enough range to keep track of system wide > > > time on a larger system. A single backend won't run for 293 years, but > > > with a few thousand backends that's a whole different story. > > > > > > I think we need to accumulate in instr_time, but convert to floating point > > > when flushing stats. > > > > Hmmm. So, are you saying that we need to read from disk when we query > > the view and add that to what is in shared memory? That we only store > > the delta since the last restart in the instr_time array? > > No, I don't think I am suggesting that. What I am trying to suggest is that > PendingIOStats should contain instr_time, but that PgStat_IO should contain > PgStat_Counter as microseconds, as before. So, I've modified the code to make a union of instr_time and PgStat_Counter in PgStat_BktypeIO. I am not quite sure if this is okay. I store in microsec and then in pg_stat_io, I multiply to get milliseconds for display. I considered refactoring pgstat_io_end() to use INSTR_TIME_ACCUM_DIFF() like [1], but, in the end I actually think I would end up with more operations because of the various different counters needing to be updated. As it is now, I do a single subtract and a few adds (one for each of the different statistics objects tracking IO times (pgBufferUsage, pgStatBlockWrite/ReadTime). Whereas, I would need to do an accum diff for every one of those. - Melanie [1] https://www.postgresql.org/message-id/1feedb83-7aa9-cb4b-5086-598349d3f555%40gmail.com
From 5c02cd9cb784bf22f756fb9d92f4bd29ba2e744a Mon Sep 17 00:00:00 2001 From: Melanie Plageman <melanieplage...@gmail.com> Date: Mon, 6 Mar 2023 10:41:51 -0500 Subject: [PATCH v5] Track IO times in pg_stat_io Add IO timing for reads, writes, extends, and fsyncs to pg_stat_io. Reviewed-by: Bertrand Drouvot <bertranddrouvot...@gmail.com> Reviewed-by: Andres Freund <and...@anarazel.de> Discussion: https://www.postgresql.org/message-id/flat/CAAKRu_ay5iKmnbXZ3DsauViF3eMxu4m1oNnJXqV_HyqYeg55Ww%40mail.gmail.com --- doc/src/sgml/monitoring.sgml | 59 +++++++++++++++ src/backend/catalog/system_views.sql | 4 ++ src/backend/storage/buffer/bufmgr.c | 56 ++++++--------- src/backend/storage/buffer/localbuf.c | 6 +- src/backend/storage/smgr/md.c | 27 ++++--- src/backend/utils/activity/pgstat.c | 77 +++++++++++++++++++- src/backend/utils/activity/pgstat_io.c | 99 +++++++++++++++++++++----- src/backend/utils/adt/pgstatfuncs.c | 48 +++++++++++-- src/include/catalog/pg_proc.dat | 6 +- src/include/pgstat.h | 14 +++- src/test/regress/expected/rules.out | 6 +- 11 files changed, 325 insertions(+), 77 deletions(-) diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml index 6249bb50d0..ad3667f258 100644 --- a/doc/src/sgml/monitoring.sgml +++ b/doc/src/sgml/monitoring.sgml @@ -3814,6 +3814,18 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i </entry> </row> + <row> + <entry role="catalog_table_entry"> + <para role="column_definition"> + <structfield>read_time</structfield> <type>double precision</type> + </para> + <para> + Time spent in read operations in milliseconds (if <xref + linkend="guc-track-io-timing"/> is enabled, otherwise zero) + </para> + </entry> + </row> + <row> <entry role="catalog_table_entry"> <para role="column_definition"> @@ -3826,6 +3838,18 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i </entry> </row> + <row> + <entry role="catalog_table_entry"> + <para role="column_definition"> + <structfield>write_time</structfield> <type>double precision</type> + </para> + <para> + Time spent in write operations in milliseconds (if <xref + linkend="guc-track-io-timing"/> is enabled, otherwise zero) + </para> + </entry> + </row> + <row> <entry role="catalog_table_entry"> <para role="column_definition"> @@ -3838,6 +3862,18 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i </entry> </row> + <row> + <entry role="catalog_table_entry"> + <para role="column_definition"> + <structfield>extend_time</structfield> <type>double precision</type> + </para> + <para> + Time spent in extend operations in milliseconds (if <xref + linkend="guc-track-io-timing"/> is enabled, otherwise zero) + </para> + </entry> + </row> + <row> <entry role="catalog_table_entry"> <para role="column_definition"> @@ -3902,6 +3938,18 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i </entry> </row> + <row> + <entry role="catalog_table_entry"> + <para role="column_definition"> + <structfield>fsync_time</structfield> <type>double precision</type> + </para> + <para> + Time spent in fsync operations in milliseconds (if <xref + linkend="guc-track-io-timing"/> is enabled, otherwise zero) + </para> + </entry> + </row> + <row> <entry role="catalog_table_entry"> <para role="column_definition"> @@ -3967,6 +4015,17 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i </itemizedlist> </para> + <note> + <para> + Columns tracking I/O time will only be non-zero when <xref + linkend="guc-track-io-timing"/> is enabled. The user should be careful when + using these columns in combination with their corresponding operations to + ensure that <varname>track_io_timing</varname> was enabled for the entire + time since the last reset. + </para> + </note> + + </sect2> diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql index 34ca0e739f..39391bc2fc 100644 --- a/src/backend/catalog/system_views.sql +++ b/src/backend/catalog/system_views.sql @@ -1123,12 +1123,16 @@ SELECT b.io_object, b.io_context, b.reads, + b.read_time, b.writes, + b.write_time, b.extends, + b.extend_time, b.op_bytes, b.evictions, b.reuses, b.fsyncs, + b.fsync_time, b.stats_reset FROM pg_stat_get_io() b; diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index 0a05577b68..64857f1ff6 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -1000,12 +1000,17 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum, if (isExtend) { + instr_time io_start; + /* new buffers are zero-filled */ MemSet((char *) bufBlock, 0, BLCKSZ); + + io_start = pgstat_io_start(); + /* don't set checksum for all-zero page */ smgrextend(smgr, forkNum, blockNum, bufBlock, false); - pgstat_count_io_op(io_object, io_context, IOOP_EXTEND); + pgstat_io_end(io_start, io_object, io_context, IOOP_EXTEND); /* * NB: we're *not* doing a ScheduleBufferTagForWriteback here; @@ -1024,25 +1029,13 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum, MemSet((char *) bufBlock, 0, BLCKSZ); else { - instr_time io_start, - io_time; + instr_time io_start; - if (track_io_timing) - INSTR_TIME_SET_CURRENT(io_start); - else - INSTR_TIME_SET_ZERO(io_start); + io_start = pgstat_io_start(); smgrread(smgr, forkNum, blockNum, bufBlock); - pgstat_count_io_op(io_object, io_context, IOOP_READ); - - if (track_io_timing) - { - INSTR_TIME_SET_CURRENT(io_time); - INSTR_TIME_SUBTRACT(io_time, io_start); - pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time)); - INSTR_TIME_ADD(pgBufferUsage.blk_read_time, io_time); - } + pgstat_io_end(io_start, io_object, io_context, IOOP_READ); /* check for garbage data */ if (!PageIsVerifiedExtended((Page) bufBlock, blockNum, @@ -2848,6 +2841,7 @@ BufferGetTag(Buffer buffer, RelFileLocator *rlocator, ForkNumber *forknum, *blknum = bufHdr->tag.blockNum; } + /* * FlushBuffer * Physically write out a shared buffer. @@ -2873,8 +2867,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object, { XLogRecPtr recptr; ErrorContextCallback errcallback; - instr_time io_start, - io_time; + instr_time io_start; Block bufBlock; char *bufToWrite; uint32 buf_state; @@ -2949,10 +2942,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object, */ bufToWrite = PageSetChecksumCopy((Page) bufBlock, buf->tag.blockNum); - if (track_io_timing) - INSTR_TIME_SET_CURRENT(io_start); - else - INSTR_TIME_SET_ZERO(io_start); + io_start = pgstat_io_start(); /* * bufToWrite is either the shared buffer or a copy, as appropriate. @@ -2981,17 +2971,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object, * When a strategy is not in use, the write can only be a "regular" write * of a dirty shared buffer (IOCONTEXT_NORMAL IOOP_WRITE). */ - pgstat_count_io_op(IOOBJECT_RELATION, io_context, IOOP_WRITE); - - if (track_io_timing) - { - INSTR_TIME_SET_CURRENT(io_time); - INSTR_TIME_SUBTRACT(io_time, io_start); - pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time)); - INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time); - } - - pgBufferUsage.shared_blks_written++; + pgstat_io_end(io_start, IOOBJECT_RELATION, io_context, IOOP_WRITE); /* * Mark the buffer as clean (unless BM_JUST_DIRTIED has become set) and @@ -3597,6 +3577,7 @@ FlushRelationBuffers(Relation rel) for (i = 0; i < NLocBuffer; i++) { uint32 buf_state; + instr_time io_start; bufHdr = GetLocalBufferDescriptor(i); if (BufTagMatchesRelFileLocator(&bufHdr->tag, &rel->rd_locator) && @@ -3616,6 +3597,12 @@ FlushRelationBuffers(Relation rel) PageSetChecksumInplace(localpage, bufHdr->tag.blockNum); + /* + * TODO: is it okay that this formerly didn't count + * pgBufferUsage or pgStatBlockReadTime + */ + io_start = pgstat_io_start(); + smgrwrite(RelationGetSmgr(rel), BufTagGetForkNum(&bufHdr->tag), bufHdr->tag.blockNum, @@ -3625,13 +3612,14 @@ FlushRelationBuffers(Relation rel) buf_state &= ~(BM_DIRTY | BM_JUST_DIRTIED); pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state); - pgstat_count_io_op(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_WRITE); + pgstat_io_end(io_start, IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_WRITE); /* Pop the error context stack */ error_context_stack = errcallback.previous; } } + return; } diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c index 5325ddb663..af45a72934 100644 --- a/src/backend/storage/buffer/localbuf.c +++ b/src/backend/storage/buffer/localbuf.c @@ -220,6 +220,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum, */ if (buf_state & BM_DIRTY) { + instr_time io_start; SMgrRelation oreln; Page localpage = (char *) LocalBufHdrGetBlock(bufHdr); @@ -228,6 +229,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum, PageSetChecksumInplace(localpage, bufHdr->tag.blockNum); + io_start = pgstat_io_start(); + /* And write... */ smgrwrite(oreln, BufTagGetForkNum(&bufHdr->tag), @@ -239,8 +242,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum, buf_state &= ~BM_DIRTY; pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state); - pgstat_count_io_op(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_WRITE); - pgBufferUsage.local_blks_written++; + pgstat_io_end(io_start, IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_WRITE); } /* diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c index 352958e1fe..2440211f03 100644 --- a/src/backend/storage/smgr/md.c +++ b/src/backend/storage/smgr/md.c @@ -1030,6 +1030,17 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg) if (!RegisterSyncRequest(&tag, SYNC_REQUEST, false /* retryOnError */ )) { + instr_time io_start = pgstat_io_start(); + + ereport(DEBUG1, + (errmsg_internal("could not forward fsync request because request queue is full"))); + + if (FileSync(seg->mdfd_vfd, WAIT_EVENT_DATA_FILE_SYNC) < 0) + ereport(data_sync_elevel(ERROR), + (errcode_for_file_access(), + errmsg("could not fsync file \"%s\": %m", + FilePathName(seg->mdfd_vfd)))); + /* * We have no way of knowing if the current IOContext is * IOCONTEXT_NORMAL or IOCONTEXT_[BULKREAD, BULKWRITE, VACUUM] at this @@ -1041,16 +1052,7 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg) * IOCONTEXT_NORMAL is likely clearer when investigating the number of * backend fsyncs. */ - pgstat_count_io_op(IOOBJECT_RELATION, IOCONTEXT_NORMAL, IOOP_FSYNC); - - ereport(DEBUG1, - (errmsg_internal("could not forward fsync request because request queue is full"))); - - if (FileSync(seg->mdfd_vfd, WAIT_EVENT_DATA_FILE_SYNC) < 0) - ereport(data_sync_elevel(ERROR), - (errcode_for_file_access(), - errmsg("could not fsync file \"%s\": %m", - FilePathName(seg->mdfd_vfd)))); + pgstat_io_end(io_start, IOOBJECT_RELATION, IOCONTEXT_NORMAL, IOOP_FSYNC); } } @@ -1399,6 +1401,7 @@ int mdsyncfiletag(const FileTag *ftag, char *path) { SMgrRelation reln = smgropen(ftag->rlocator, InvalidBackendId); + instr_time io_start; File file; bool need_to_close; int result, @@ -1425,6 +1428,8 @@ mdsyncfiletag(const FileTag *ftag, char *path) need_to_close = true; } + io_start = pgstat_io_start(); + /* Sync the file. */ result = FileSync(file, WAIT_EVENT_DATA_FILE_SYNC); save_errno = errno; @@ -1432,7 +1437,7 @@ mdsyncfiletag(const FileTag *ftag, char *path) if (need_to_close) FileClose(file); - pgstat_count_io_op(IOOBJECT_RELATION, IOCONTEXT_NORMAL, IOOP_FSYNC); + pgstat_io_end(io_start, IOOBJECT_RELATION, IOCONTEXT_NORMAL, IOOP_FSYNC); errno = save_errno; return result; diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c index b125802b21..9f5f041d7c 100644 --- a/src/backend/utils/activity/pgstat.c +++ b/src/backend/utils/activity/pgstat.c @@ -95,9 +95,11 @@ #include "access/transam.h" #include "access/xact.h" +#include "executor/instrument.h" #include "lib/dshash.h" #include "pgstat.h" #include "port/atomics.h" +#include "storage/bufmgr.h" #include "storage/fd.h" #include "storage/ipc.h" #include "storage/lwlock.h" @@ -108,7 +110,6 @@ #include "utils/pgstat_internal.h" #include "utils/timestamp.h" - /* ---------- * Timer definitions. * @@ -906,6 +907,80 @@ pgstat_have_entry(PgStat_Kind kind, Oid dboid, Oid objoid) return pgstat_get_entry_ref(kind, dboid, objoid, false, NULL) != NULL; } +instr_time +pgstat_io_start(void) +{ + instr_time io_start; + + if (track_io_timing) + INSTR_TIME_SET_CURRENT(io_start); + else + INSTR_TIME_SET_ZERO(io_start); + + return io_start; +} + +void +pgstat_io_end(instr_time io_start, IOObject io_object, + IOContext io_context, IOOp io_op) +{ + instr_time io_time; + + if (track_io_timing) + { + INSTR_TIME_SET_CURRENT(io_time); + INSTR_TIME_SUBTRACT(io_time, io_start); + pgstat_count_io_time(io_object, io_context, io_op, io_time); + + if (io_op == IOOP_WRITE) + { + if (io_object == IOOBJECT_RELATION) + { + /* TODO: AFAICT, pgstat_count_buffer_write_time is only called */ + /* for shared buffers whereas pgstat_count_buffer_read_time is */ + /* called for temp relations and shared buffers. */ + /* + * is this intentional and should I match current behavior or + * not? + */ + pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time)); + INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time); + } + } + else if (io_op == IOOP_READ) + { + pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time)); + if (io_object == IOOBJECT_RELATION) + { + INSTR_TIME_ADD(pgBufferUsage.blk_read_time, io_time); + } + } + } + + pgstat_count_io_op(io_object, io_context, io_op); + + if (io_op == IOOP_WRITE) + { + if (io_object == IOOBJECT_RELATION) + pgBufferUsage.shared_blks_written++; + else if (io_object == IOOBJECT_TEMP_RELATION) + pgBufferUsage.local_blks_written++; + } + + /* + * TODO: this is normally done later in ReadBuffer_common() is it okay to + * do here? + */ + else if (io_op == IOOP_READ) + { + if (io_object == IOOBJECT_RELATION) + pgBufferUsage.shared_blks_read++; + else if (io_object == IOOBJECT_TEMP_RELATION) + pgBufferUsage.local_blks_read++; + } +} + + /* * Ensure snapshot for fixed-numbered 'kind' exists. * diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c index af5d554610..fe09cbf16d 100644 --- a/src/backend/utils/activity/pgstat_io.c +++ b/src/backend/utils/activity/pgstat_io.c @@ -25,36 +25,47 @@ bool have_iostats = false; /* * Check that stats have not been counted for any combination of IOObject, - * IOContext, and IOOp which are not tracked for the passed-in BackendType. The - * passed-in PgStat_BktypeIO must contain stats from the BackendType specified - * by the second parameter. Caller is responsible for locking the passed-in - * PgStat_BktypeIO, if needed. + * IOContext, and IOOp which are not tracked for the passed-in BackendType. If + * the IOOp is not counted for this combination but IO time is otherwise + * tracked for this IOOp, check that IO time has not been counted for this + * combination. If stats are tracked for this combination and IO times are + * non-zero, counts should be non-zero. + * + * The passed-in PgStat_BktypeIO must contain stats from the BackendType + * specified by the second parameter. Caller is responsible for locking the + * passed-in PgStat_BktypeIO, if needed. */ bool pgstat_bktype_io_stats_valid(PgStat_BktypeIO *backend_io, BackendType bktype) { - bool bktype_tracked = pgstat_tracks_io_bktype(bktype); - for (int io_object = 0; io_object < IOOBJECT_NUM_TYPES; io_object++) { for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++) { - /* - * Don't bother trying to skip to the next loop iteration if - * pgstat_tracks_io_object() would return false here. We still - * need to validate that each counter is zero anyway. - */ for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++) { - /* No stats, so nothing to validate */ - if (backend_io->data[io_object][io_context][io_op] == 0) + /* we do track it */ + if (pgstat_tracks_io_op(bktype, io_object, io_context, io_op)) + { + /* ensure that if IO times are non-zero, counts are > 0 */ + if (backend_io->times[io_object][io_context][io_op] != 0 && + backend_io->counts[io_object][io_context][io_op] <= 0) + return false; + continue; + } - /* There are stats and there shouldn't be */ - if (!bktype_tracked || - !pgstat_tracks_io_op(bktype, io_object, io_context, io_op)) + /* we don't track it, and it is not 0 */ + if (backend_io->counts[io_object][io_context][io_op] != 0) return false; + + /* we don't track this IOOp, so make sure its IO time is zero */ + if (pgstat_tracks_io_time(io_op) > -1) + { + if (backend_io->times[io_object][io_context][io_op] != 0) + return false; + } } } } @@ -70,7 +81,21 @@ pgstat_count_io_op(IOObject io_object, IOContext io_context, IOOp io_op) Assert((unsigned int) io_op < IOOP_NUM_TYPES); Assert(pgstat_tracks_io_op(MyBackendType, io_object, io_context, io_op)); - PendingIOStats.data[io_object][io_context][io_op]++; + PendingIOStats.counts[io_object][io_context][io_op]++; + + have_iostats = true; +} + +void +pgstat_count_io_time(IOObject io_object, IOContext io_context, IOOp io_op, instr_time time) +{ + Assert(io_object < IOOBJECT_NUM_TYPES); + Assert(io_context < IOCONTEXT_NUM_TYPES); + Assert(io_op < IOOP_NUM_TYPES); + Assert(pgstat_tracks_io_op(MyBackendType, io_object, io_context, io_op)); + Assert(pgstat_tracks_io_time(io_op) != -1); + + INSTR_TIME_ADD(PendingIOStats.pending_times[io_object][io_context][io_op], time); have_iostats = true; } @@ -114,8 +139,17 @@ pgstat_flush_io(bool nowait) for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++) { for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++) - bktype_shstats->data[io_object][io_context][io_op] += - PendingIOStats.data[io_object][io_context][io_op]; + { + instr_time time; + + bktype_shstats->counts[io_object][io_context][io_op] += + PendingIOStats.counts[io_object][io_context][io_op]; + + time = PendingIOStats.pending_times[io_object][io_context][io_op]; + + bktype_shstats->times[io_object][io_context][io_op] += + INSTR_TIME_GET_MICROSEC(time); + } } } @@ -384,3 +418,30 @@ pgstat_tracks_io_op(BackendType bktype, IOObject io_object, return true; } + +/* + * PgStat_BktypeIO->times contains IO times for IOOps. For simplicity this + * array has a spot for every IOOp. pgstat_tracks_io_time() is the source of + * truth for which IOOps have corresponding IO times. + */ +IOOp +pgstat_tracks_io_time(IOOp io_op) +{ + switch (io_op) + { + case IOOP_READ: + return IOOP_READ; + case IOOP_WRITE: + return IOOP_WRITE; + case IOOP_EXTEND: + return IOOP_EXTEND; + case IOOP_FSYNC: + return IOOP_FSYNC; + case IOOP_EVICT: + case IOOP_REUSE: + return -1; + } + + elog(ERROR, "unrecognized IOOp value: %d", io_op); + pg_unreachable(); +} diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c index 35c6d46555..60fdda1a14 100644 --- a/src/backend/utils/adt/pgstatfuncs.c +++ b/src/backend/utils/adt/pgstatfuncs.c @@ -1255,12 +1255,16 @@ typedef enum io_stat_col IO_COL_IO_OBJECT, IO_COL_IO_CONTEXT, IO_COL_READS, + IO_COL_READ_TIME, IO_COL_WRITES, + IO_COL_WRITE_TIME, IO_COL_EXTENDS, + IO_COL_EXTEND_TIME, IO_COL_CONVERSION, IO_COL_EVICTIONS, IO_COL_REUSES, IO_COL_FSYNCS, + IO_COL_FSYNC_TIME, IO_COL_RESET_TIME, IO_NUM_COLUMNS, } io_stat_col; @@ -1292,6 +1296,28 @@ pgstat_get_io_op_index(IOOp io_op) pg_unreachable(); } +/* + * Get the number of the column containing IO times for the specified IOOp. If + * the specified IOOp is one for which IO time is not tracked, return -1. Note + * that this function assumes that IO time for an IOOp is displayed in the view + * in the column directly after the IOOp counts. + */ +static io_stat_col +pgstat_get_io_time_index(IOOp io_op) +{ + if (pgstat_tracks_io_time(io_op) == -1) + return -1; + + return pgstat_get_io_op_index(io_op) + 1; +} + +static inline +PgStat_Counter +pg_stat_micro_to_millisecs(PgStat_Counter val_microsec) +{ + return val_microsec * 0.001; +} + Datum pg_stat_get_io(PG_FUNCTION_ARGS) { @@ -1359,20 +1385,32 @@ pg_stat_get_io(PG_FUNCTION_ARGS) for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++) { - int col_idx = pgstat_get_io_op_index(io_op); + PgStat_Counter count = bktype_stats->counts[io_obj][io_context][io_op]; + int i = pgstat_get_io_op_index(io_op); /* * Some combinations of BackendType and IOOp, of IOContext * and IOOp, and of IOObject and IOOp are not tracked. Set * these cells in the view NULL. */ - nulls[col_idx] = !pgstat_tracks_io_op(bktype, io_obj, io_context, io_op); + if (pgstat_tracks_io_op(bktype, io_obj, io_context, io_op)) + values[i] = Int64GetDatum(count); + else + nulls[i] = true; + } - if (nulls[col_idx]) + for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++) + { + PgStat_Counter time = bktype_stats->times[io_obj][io_context][io_op]; + int i = pgstat_get_io_time_index(io_op); + + if (i == -1) continue; - values[col_idx] = - Int64GetDatum(bktype_stats->data[io_obj][io_context][io_op]); + if (!nulls[pgstat_get_io_op_index(io_op)]) + values[i] = Float8GetDatum(pg_stat_micro_to_millisecs(time)); + else + nulls[i] = true; } tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat index fbc4aade49..a16c49fad2 100644 --- a/src/include/catalog/pg_proc.dat +++ b/src/include/catalog/pg_proc.dat @@ -5728,9 +5728,9 @@ proname => 'pg_stat_get_io', provolatile => 'v', prorows => '30', proretset => 't', proparallel => 'r', prorettype => 'record', proargtypes => '', - proallargtypes => '{text,text,text,int8,int8,int8,int8,int8,int8,int8,timestamptz}', - proargmodes => '{o,o,o,o,o,o,o,o,o,o,o}', - proargnames => '{backend_type,io_object,io_context,reads,writes,extends,op_bytes,evictions,reuses,fsyncs,stats_reset}', + proallargtypes => '{text,text,text,int8,float8,int8,float8,int8,float8,int8,int8,int8,int8,float8,timestamptz}', + proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}', + proargnames => '{backend_type,io_object,io_context,reads,read_time,writes,write_time,extends,extend_time,op_bytes,evictions,reuses,fsyncs,fsync_time,stats_reset}', prosrc => 'pg_stat_get_io' }, { oid => '1136', descr => 'statistics: information about WAL activity', diff --git a/src/include/pgstat.h b/src/include/pgstat.h index 1e418b682b..288a8c7c0e 100644 --- a/src/include/pgstat.h +++ b/src/include/pgstat.h @@ -304,7 +304,12 @@ typedef enum IOOp typedef struct PgStat_BktypeIO { - PgStat_Counter data[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES]; + PgStat_Counter counts[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES]; + union + { + instr_time pending_times[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES]; + PgStat_Counter times[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES]; + }; } PgStat_BktypeIO; typedef struct PgStat_IO @@ -466,6 +471,11 @@ extern TimestampTz pgstat_get_stat_snapshot_timestamp(bool *have_snapshot); extern PgStat_Kind pgstat_get_kind_from_str(char *kind_str); extern bool pgstat_have_entry(PgStat_Kind kind, Oid dboid, Oid objoid); +extern instr_time pgstat_io_start(void); + +extern void pgstat_io_end(instr_time io_start, IOObject io_object, + IOContext io_context, IOOp io_op); + /* * Functions in pgstat_archiver.c @@ -498,6 +508,7 @@ extern PgStat_CheckpointerStats *pgstat_fetch_stat_checkpointer(void); extern bool pgstat_bktype_io_stats_valid(PgStat_BktypeIO *context_ops, BackendType bktype); extern void pgstat_count_io_op(IOObject io_object, IOContext io_context, IOOp io_op); +extern void pgstat_count_io_time(IOObject io_object, IOContext io_context, IOOp io_op, instr_time time); extern PgStat_IO *pgstat_fetch_stat_io(void); extern const char *pgstat_get_io_context_name(IOContext io_context); extern const char *pgstat_get_io_object_name(IOObject io_object); @@ -507,6 +518,7 @@ extern bool pgstat_tracks_io_object(BackendType bktype, IOObject io_object, IOContext io_context); extern bool pgstat_tracks_io_op(BackendType bktype, IOObject io_object, IOContext io_context, IOOp io_op); +extern IOOp pgstat_tracks_io_time(IOOp io_op); /* diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out index e953d1f515..5434851314 100644 --- a/src/test/regress/expected/rules.out +++ b/src/test/regress/expected/rules.out @@ -1880,14 +1880,18 @@ pg_stat_io| SELECT backend_type, io_object, io_context, reads, + read_time, writes, + write_time, extends, + extend_time, op_bytes, evictions, reuses, fsyncs, + fsync_time, stats_reset - FROM pg_stat_get_io() b(backend_type, io_object, io_context, reads, writes, extends, op_bytes, evictions, reuses, fsyncs, stats_reset); + FROM pg_stat_get_io() b(backend_type, io_object, io_context, reads, read_time, writes, write_time, extends, extend_time, op_bytes, evictions, reuses, fsyncs, fsync_time, stats_reset); pg_stat_progress_analyze| SELECT s.pid, s.datid, d.datname, -- 2.37.2