Re: Track IO times in pg_stat_io

2023-04-07 Thread Andres Freund
Hi,

On 2023-04-07 12:17:38 -0400, Melanie Plageman wrote:
> Attached v9 addresses review feedback as well as resolving merge
> conflicts with recent relation extension patchset.

I've edited it a bit more:

- removed pgstat_tracks_io_time() and replaced it by returning the new
  IO_COL_INVALID = -1 from pgstat_get_io_time_index() when there's no time

- moved PgStat_Counter count, time into the respective branches. It feels
  somewhat wrong to access the time when we then decide there is no time.

- s/io_object/io_obj/ in pgstat_count_io_op_time(), combined with added
  linebreaks, got the code to under 80 chars

- renamed pg_stat_microseconds_to_milliseconds to pg_stat_us_to_ms

- removed a spurious newline

- the times reported by pg_stat_io had their fractional part removed, due to
  pg_stat_us_to_ms returning an integer


Verifying this, I saw that the write time visible in pg_stat_io didn't quite
match what I saw in log_checkpoints. But not always. Eventually I figured out
that that's not pg_stat_io's fault - log_checkpoint's write includes a lot of
things, including several other CheckPoint* routines, flushing WAL, asking the
kernel to flush things to disk...  The biggest portion in my case were the
smgrwriteback() calls - which pg_stat_io doesn't track - oops.

Pushed up to and including 0003.


> I've changed pgstat_count_io_op_time() to take a count and call
> pgstat_count_io_op_n() so it can be used with smgrzeroextend(). I do
> wish that the parameter to pgstat_count_io_op_n() was called "count" and
> not "cnt"...

Heh.


> I've also reordered the call site of pgstat_count_io_op_time() in a few
> locations, but I have some questions about this.
> 
> Before, I didn't think it mattered much that we didn't finish counting
> IO time until after setting BM_VALID or BM_DIRTY and unsetting
> BM_IO_IN_PROGRESS. With the relation extension code doing this for many
> buffers at once, though, I wondered if this will make the IO timing too
> inaccurate.

> As such, I've moved pgstat_count_io_op_time() to before we set those
> flags in all locations. I did wonder if it is bad to prolong having the
> buffer pinned and not having those flags set, though.

I went back and forth about this before. I think it's ok the way you did it.


I think 0004 needs a bit more work. At the very least we would have to swap
the order of pgstat_flush_pending_entries() and pgstat_flush_io() - entirely
doable. Unlike 0003, this doesn't make pg_stat_io more complete, or such, so
I'm inclined to leave it for 17.  I think there might be some more
opportunities for having counts "flow down", like the patch does.

Greetings,

Andres Freund




Re: Track IO times in pg_stat_io

2023-04-07 Thread Melanie Plageman
Attached v9 addresses review feedback as well as resolving merge
conflicts with recent relation extension patchset.

I've changed pgstat_count_io_op_time() to take a count and call
pgstat_count_io_op_n() so it can be used with smgrzeroextend(). I do
wish that the parameter to pgstat_count_io_op_n() was called "count" and
not "cnt"...

I've also reordered the call site of pgstat_count_io_op_time() in a few
locations, but I have some questions about this.

Before, I didn't think it mattered much that we didn't finish counting
IO time until after setting BM_VALID or BM_DIRTY and unsetting
BM_IO_IN_PROGRESS. With the relation extension code doing this for many
buffers at once, though, I wondered if this will make the IO timing too
inaccurate.

As such, I've moved pgstat_count_io_op_time() to before we set those
flags in all locations. I did wonder if it is bad to prolong having the
buffer pinned and not having those flags set, though.

On Tue, Apr 4, 2023 at 8:59 PM Andres Freund  wrote:
>
> Hi,
>
> On 2023-03-31 15:44:58 -0400, Melanie Plageman wrote:
> > From 789d4bf1fb749a26523dbcd2c69795916b711c68 Mon Sep 17 00:00:00 2001
> > From: Melanie Plageman 
> > Date: Tue, 21 Mar 2023 16:00:55 -0400
> > Subject: [PATCH v8 1/4] Count IO time for temp relation writes
> >
> > Both pgstat_database and pgBufferUsage write times failed to count
> > timing for flushes of dirty local buffers when acquiring a new local
> > buffer for a temporary relation block.
>
> I think it'd be worth mentioning here that we do count read time? Otherwise
> it'd not be as clear that adding tracking increases consistency...

Done

> > From f4e0db5c833f33b30d4c0b4bebec1096a1745d81 Mon Sep 17 00:00:00 2001
> > From: Melanie Plageman 
> > Date: Tue, 21 Mar 2023 18:20:44 -0400
> > Subject: [PATCH v8 2/4] FlushRelationBuffers() counts temp relation IO 
> > timing
> >
> > Add pgstat_database and pgBufferUsage IO timing counting to
> > FlushRelationBuffers() for writes of temporary relations.
> > ---
> >  src/backend/storage/buffer/bufmgr.c | 18 ++
> >  1 file changed, 18 insertions(+)
> >
> > diff --git a/src/backend/storage/buffer/bufmgr.c 
> > b/src/backend/storage/buffer/bufmgr.c
> > index b3adbbe7d2..05e98d5994 100644
> > --- a/src/backend/storage/buffer/bufmgr.c
> > +++ b/src/backend/storage/buffer/bufmgr.c
> > @@ -3571,6 +3571,8 @@ FlushRelationBuffers(Relation rel)
> >  {
> >   int i;
> >   BufferDesc *bufHdr;
> > + instr_time  io_start,
> > + io_time;
> >
> >   if (RelationUsesLocalBuffers(rel))
> >   {
> > @@ -3596,17 +3598,33 @@ FlushRelationBuffers(Relation rel)
> >
> >   PageSetChecksumInplace(localpage, 
> > bufHdr->tag.blockNum);
> >
> > + if (track_io_timing)
> > + INSTR_TIME_SET_CURRENT(io_start);
> > + else
> > + INSTR_TIME_SET_ZERO(io_start);
> > +
> >   smgrwrite(RelationGetSmgr(rel),
> > 
> > BufTagGetForkNum(>tag),
> > bufHdr->tag.blockNum,
> > localpage,
> > false);
> >
> > +
>
> Spurious newline.

Fixed.

> > From 2bdad725133395ded199ecc726096e052d6e654b Mon Sep 17 00:00:00 2001
> > From: Melanie Plageman 
> > Date: Fri, 31 Mar 2023 15:32:36 -0400
> > Subject: [PATCH v8 3/4] Track IO times in pg_stat_io
> >
> > Add IO timing for reads, writes, extends, and fsyncs to pg_stat_io.
> >
> > Reviewed-by: Bertrand Drouvot 
> > Reviewed-by: Andres Freund 
> > Discussion: 
> > https://www.postgresql.org/message-id/flat/CAAKRu_ay5iKmnbXZ3DsauViF3eMxu4m1oNnJXqV_HyqYeg55Ww%40mail.gmail.com
> > ---
>
> > -static PgStat_BktypeIO PendingIOStats;
> > +typedef struct PgStat_PendingIO
> > +{
> > + PgStat_Counter 
> > counts[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
> > + instr_time  
> > pending_times[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
> > +}PgStat_PendingIO;
>
> Probably will look less awful after adding the typedef to typedefs.list.

Done.
One day I will remember to add things to typedefs.list.

> > + /* we do track it */
> > + if (pgstat_tracks_io_op(bktype, io_object, 
> > io_context, io_op))
> > + {
> > + /* ensure that if IO times are 
> > non-zero, counts are > 0 */
> > + if 
> > (backend_io->times[io_object][io_context][io_op] != 0 &&
> > + 
> > backend_io->counts[io_object][io_context][io_op] <= 0)
> > + return false;
> > +
> >   

Re: Track IO times in pg_stat_io

2023-04-04 Thread Andres Freund
Hi,

On 2023-03-31 15:44:58 -0400, Melanie Plageman wrote:
> From 789d4bf1fb749a26523dbcd2c69795916b711c68 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman 
> Date: Tue, 21 Mar 2023 16:00:55 -0400
> Subject: [PATCH v8 1/4] Count IO time for temp relation writes
>
> Both pgstat_database and pgBufferUsage write times failed to count
> timing for flushes of dirty local buffers when acquiring a new local
> buffer for a temporary relation block.

I think it'd be worth mentioning here that we do count read time? Otherwise
it'd not be as clear that adding tracking increases consistency...



> From f4e0db5c833f33b30d4c0b4bebec1096a1745d81 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman 
> Date: Tue, 21 Mar 2023 18:20:44 -0400
> Subject: [PATCH v8 2/4] FlushRelationBuffers() counts temp relation IO timing
>
> Add pgstat_database and pgBufferUsage IO timing counting to
> FlushRelationBuffers() for writes of temporary relations.
> ---
>  src/backend/storage/buffer/bufmgr.c | 18 ++
>  1 file changed, 18 insertions(+)
>
> diff --git a/src/backend/storage/buffer/bufmgr.c 
> b/src/backend/storage/buffer/bufmgr.c
> index b3adbbe7d2..05e98d5994 100644
> --- a/src/backend/storage/buffer/bufmgr.c
> +++ b/src/backend/storage/buffer/bufmgr.c
> @@ -3571,6 +3571,8 @@ FlushRelationBuffers(Relation rel)
>  {
>   int i;
>   BufferDesc *bufHdr;
> + instr_time  io_start,
> + io_time;
>
>   if (RelationUsesLocalBuffers(rel))
>   {
> @@ -3596,17 +3598,33 @@ FlushRelationBuffers(Relation rel)
>
>   PageSetChecksumInplace(localpage, 
> bufHdr->tag.blockNum);
>
> + if (track_io_timing)
> + INSTR_TIME_SET_CURRENT(io_start);
> + else
> + INSTR_TIME_SET_ZERO(io_start);
> +
>   smgrwrite(RelationGetSmgr(rel),
> 
> BufTagGetForkNum(>tag),
> bufHdr->tag.blockNum,
> localpage,
> false);
>
> +

Spurious newline.


>   buf_state &= ~(BM_DIRTY | BM_JUST_DIRTIED);
>   pg_atomic_unlocked_write_u32(>state, 
> buf_state);
>
>   pgstat_count_io_op(IOOBJECT_TEMP_RELATION, 
> IOCONTEXT_NORMAL, IOOP_WRITE);
>
> + if (track_io_timing)
> + {
> + INSTR_TIME_SET_CURRENT(io_time);
> + INSTR_TIME_SUBTRACT(io_time, io_start);
> + 
> pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
> + 
> INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time);
> + }
> +
> + pgBufferUsage.local_blks_written++;
> +
>   /* Pop the error context stack */
>   error_context_stack = errcallback.previous;
>   }
> --
> 2.37.2
>


> From 2bdad725133395ded199ecc726096e052d6e654b Mon Sep 17 00:00:00 2001
> From: Melanie Plageman 
> Date: Fri, 31 Mar 2023 15:32:36 -0400
> Subject: [PATCH v8 3/4] Track IO times in pg_stat_io
>
> Add IO timing for reads, writes, extends, and fsyncs to pg_stat_io.
>
> Reviewed-by: Bertrand Drouvot 
> Reviewed-by: Andres Freund 
> Discussion: 
> https://www.postgresql.org/message-id/flat/CAAKRu_ay5iKmnbXZ3DsauViF3eMxu4m1oNnJXqV_HyqYeg55Ww%40mail.gmail.com
> ---

> -static PgStat_BktypeIO PendingIOStats;
> +typedef struct PgStat_PendingIO
> +{
> + PgStat_Counter 
> counts[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
> + instr_time  
> pending_times[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
> +}PgStat_PendingIO;

Probably will look less awful after adding the typedef to typedefs.list.


> + /* we do track it */
> + if (pgstat_tracks_io_op(bktype, io_object, 
> io_context, io_op))
> + {
> + /* ensure that if IO times are 
> non-zero, counts are > 0 */
> + if 
> (backend_io->times[io_object][io_context][io_op] != 0 &&
> + 
> backend_io->counts[io_object][io_context][io_op] <= 0)
> + return false;
> +
>   continue;
> + }
>
> - /* There are stats and there shouldn't be */
> - if (!bktype_tracked ||
> - !pgstat_tracks_io_op(bktype, 

Re: Track IO times in pg_stat_io

2023-03-31 Thread Melanie Plageman
Attached is a rebased version in light of 8aaa04b32d

- Melanie
From 789d4bf1fb749a26523dbcd2c69795916b711c68 Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Tue, 21 Mar 2023 16:00:55 -0400
Subject: [PATCH v8 1/4] Count IO time for temp relation writes

Both pgstat_database and pgBufferUsage write times failed to count
timing for flushes of dirty local buffers when acquiring a new local
buffer for a temporary relation block.
---
 src/backend/storage/buffer/localbuf.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 6f9e7eda57..ecccb6c1a9 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -114,6 +114,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 	LocalBufferLookupEnt *hresult;
 	BufferDesc *bufHdr;
 	int			b;
+	instr_time	io_start,
+io_time;
 	int			trycounter;
 	bool		found;
 	uint32		buf_state;
@@ -220,6 +222,11 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 
 		PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
 
+		if (track_io_timing)
+			INSTR_TIME_SET_CURRENT(io_start);
+		else
+			INSTR_TIME_SET_ZERO(io_start);
+
 		/* And write... */
 		smgrwrite(oreln,
   BufTagGetForkNum(>tag),
@@ -233,6 +240,15 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 
 		/* Temporary table I/O does not use Buffer Access Strategies */
 		pgstat_count_io_op(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_WRITE);
+
+		if (track_io_timing)
+		{
+			INSTR_TIME_SET_CURRENT(io_time);
+			INSTR_TIME_SUBTRACT(io_time, io_start);
+			pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
+			INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time);
+		}
+
 		pgBufferUsage.local_blks_written++;
 	}
 
-- 
2.37.2

From 726ab546a11707baa167106d5b6266451dfae445 Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Tue, 21 Mar 2023 20:36:10 -0400
Subject: [PATCH v8 4/4] pgstat_database uses pgstat_io time counters

Use pgstat_io's pending counters to increment pgStatBlockWriteTime and
pgStatBlockReadTime.
---
 src/backend/utils/activity/pgstat_io.c | 14 --
 src/include/pgstat.h   |  4 
 2 files changed, 8 insertions(+), 10 deletions(-)

diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 4e98c4749a..905566decd 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -124,13 +124,11 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
 
 		if (io_op == IOOP_WRITE)
 		{
-			pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
 			if (io_object == IOOBJECT_RELATION)
 INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time);
 		}
 		else if (io_op == IOOP_READ)
 		{
-			pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
 			if (io_object == IOOBJECT_RELATION)
 INSTR_TIME_ADD(pgBufferUsage.blk_read_time, io_time);
 		}
@@ -181,15 +179,19 @@ pgstat_flush_io(bool nowait)
 		{
 			for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
 			{
-instr_time	time;
+PgStat_Counter time;
 
 bktype_shstats->counts[io_object][io_context][io_op] +=
 	PendingIOStats.counts[io_object][io_context][io_op];
 
-time = PendingIOStats.pending_times[io_object][io_context][io_op];
+time = INSTR_TIME_GET_MICROSEC(PendingIOStats.pending_times[io_object][io_context][io_op]);
 
-bktype_shstats->times[io_object][io_context][io_op] +=
-	INSTR_TIME_GET_MICROSEC(time);
+bktype_shstats->times[io_object][io_context][io_op] += time;
+
+if (io_op == IOOP_WRITE)
+	pgStatBlockWriteTime += time;
+else if (io_op == IOOP_READ)
+	pgStatBlockReadTime += time;
 			}
 		}
 	}
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index bf54c6defe..833476a2bb 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -544,10 +544,6 @@ extern void pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount);
 extern void pgstat_report_checksum_failure(void);
 extern void pgstat_report_connect(Oid dboid);
 
-#define pgstat_count_buffer_read_time(n)			\
-	(pgStatBlockReadTime += (n))
-#define pgstat_count_buffer_write_time(n)			\
-	(pgStatBlockWriteTime += (n))
 #define pgstat_count_conn_active_time(n)			\
 	(pgStatActiveTime += (n))
 #define pgstat_count_conn_txn_idle_time(n)			\
-- 
2.37.2

From 2bdad725133395ded199ecc726096e052d6e654b Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Fri, 31 Mar 2023 15:32:36 -0400
Subject: [PATCH v8 3/4] Track IO times in pg_stat_io

Add IO timing for reads, writes, extends, and fsyncs to pg_stat_io.

Reviewed-by: Bertrand Drouvot 
Reviewed-by: Andres Freund 
Discussion: https://www.postgresql.org/message-id/flat/CAAKRu_ay5iKmnbXZ3DsauViF3eMxu4m1oNnJXqV_HyqYeg55Ww%40mail.gmail.com
---
 

Re: Track IO times in pg_stat_io

2023-03-21 Thread Melanie Plageman
On Mon, Mar 20, 2023 at 10:34 PM Andres Freund  wrote:
>
> Hi,
>
> On 2023-03-16 17:19:16 -0400, Melanie Plageman wrote:
> > > > > I wonder if we should get rid of pgStatBlockReadTime, 
> > > > > pgStatBlockWriteTime,
> > > >
> > > > And then have pg_stat_reset_shared('io') reset pg_stat_database IO
> > > > stats?
> > >
> > > Yes.
> >
> > I think this makes sense but I am hesitant to do it in this patchset,
> > because it feels a bit hidden...maybe?
>
> I'd not do it in the same commit, but I don't see a problem with doing it in
> the same patchset.
>
> Now that I think about it again, this wouldn't make pg_stat_reset_shared('io')
> affect pg_stat_database - I was thinking we should use pgstat_io.c stats to
> provide the information for pgstat_database.c, using its own pending counter.

So, I've done this in the attached. But, won't resetting pgstat_database
be a bit weird if you have built up some IO timing in pending counters
and right after you reset a flush happens and then suddenly the values
are way above 0 again?

> > I considered refactoring pgstat_io_end() to use INSTR_TIME_ACCUM_DIFF()
> > like [1], but, in the end I actually think I would end up with more
> > operations because of the various different counters needing to be
> > updated. As it is now, I do a single subtract and a few adds (one for
> > each of the different statistics objects tracking IO times
> > (pgBufferUsage, pgStatBlockWrite/ReadTime). Whereas, I would need to do
> > an accum diff for every one of those.
>
> Right - that only INSTR_TIME_ACCUM_DIFF() only makes sense if there's just a
> single counter to update.
>
>
> WRT:
> /* TODO: AFAICT, 
> pgstat_count_buffer_write_time is only called */
> /* for shared buffers whereas 
> pgstat_count_buffer_read_time is */
> /* called for temp relations and shared 
> buffers. */
> /*
>  * is this intentional and should I match 
> current behavior or
>  * not?
>  */
>
> It's hard to see how that behaviour could be intentional.  Probably worth
> fixing in a separate patch. I don't think we're going to backpatch, but it
> would make this clearer nonetheless.


Attached v7 does this in separate commits.

Remaining feedback is about FlushLocalBuffers(). Is the idea simply to
get it into bufmgr.c because that is cleaner from an API perspective?

- Melanie
From a7ba3cce6dbbde49efa5b20e2db5cd49c259d3ad Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Tue, 21 Mar 2023 18:20:44 -0400
Subject: [PATCH v7 2/4] FlushRelationBuffers() counts temp relation IO timing

Add pgstat_database and pgBufferUsage IO timing counting to
FlushRelationBuffers() for writes of temporary relations.
---
 src/backend/storage/buffer/bufmgr.c | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 0a05577b68..dea2e8fe40 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3591,6 +3591,8 @@ FlushRelationBuffers(Relation rel)
 {
 	int			i;
 	BufferDesc *bufHdr;
+	instr_time	io_start,
+io_time;
 
 	if (RelationUsesLocalBuffers(rel))
 	{
@@ -3616,17 +3618,33 @@ FlushRelationBuffers(Relation rel)
 
 PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
 
+if (track_io_timing)
+	INSTR_TIME_SET_CURRENT(io_start);
+else
+	INSTR_TIME_SET_ZERO(io_start);
+
 smgrwrite(RelationGetSmgr(rel),
 		  BufTagGetForkNum(>tag),
 		  bufHdr->tag.blockNum,
 		  localpage,
 		  false);
 
+
 buf_state &= ~(BM_DIRTY | BM_JUST_DIRTIED);
 pg_atomic_unlocked_write_u32(>state, buf_state);
 
 pgstat_count_io_op(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_WRITE);
 
+if (track_io_timing)
+{
+	INSTR_TIME_SET_CURRENT(io_time);
+	INSTR_TIME_SUBTRACT(io_time, io_start);
+	pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
+	INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time);
+}
+
+pgBufferUsage.local_blks_written++;
+
 /* Pop the error context stack */
 error_context_stack = errcallback.previous;
 			}
-- 
2.37.2

From 1088cde0ea0d39b7e55cd919f4c9151136f36e28 Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Tue, 21 Mar 2023 16:00:55 -0400
Subject: [PATCH v7 1/4] Count IO time for temp relation writes

Both pgstat_database and pgBufferUsage write times failed to count
timing for flushes of dirty local buffers when acquiring a new local
buffer for a temporary relation block.
---
 src/backend/storage/buffer/localbuf.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 5325ddb663..80510411ae 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ 

Re: Track IO times in pg_stat_io

2023-03-20 Thread Andres Freund
Hi,

On 2023-03-16 17:19:16 -0400, Melanie Plageman wrote:
> > > > I wonder if we should get rid of pgStatBlockReadTime, 
> > > > pgStatBlockWriteTime,
> > >
> > > And then have pg_stat_reset_shared('io') reset pg_stat_database IO
> > > stats?
> >
> > Yes.
> 
> I think this makes sense but I am hesitant to do it in this patchset,
> because it feels a bit hidden...maybe?

I'd not do it in the same commit, but I don't see a problem with doing it in
the same patchset.

Now that I think about it again, this wouldn't make pg_stat_reset_shared('io')
affect pg_stat_database - I was thinking we should use pgstat_io.c stats to
provide the information for pgstat_database.c, using its own pending counter.


> > No, I don't think I am suggesting that. What I am trying to suggest is that
> > PendingIOStats should contain instr_time, but that PgStat_IO should contain
> > PgStat_Counter as microseconds, as before.
> 
> So, I've modified the code to make a union of instr_time and
> PgStat_Counter in PgStat_BktypeIO. I am not quite sure if this is okay.
> I store in microsec and then in pg_stat_io, I multiply to get
> milliseconds for display.

Not a fan - what do we gain by having this union? It seems considerably
cleaner to have a struct local to pgstat_io.c that uses instr_time and have a
clean type in PgStat_BktypeIO.  In fact, the code worked after just changing
that.

I don't think it makes sense to have pgstat_io_start()/end() as well as
pgstat_count_io*. For one, the name seems in a "too general namespace" - why
not a pgstat_count*?


> I considered refactoring pgstat_io_end() to use INSTR_TIME_ACCUM_DIFF()
> like [1], but, in the end I actually think I would end up with more
> operations because of the various different counters needing to be
> updated. As it is now, I do a single subtract and a few adds (one for
> each of the different statistics objects tracking IO times
> (pgBufferUsage, pgStatBlockWrite/ReadTime). Whereas, I would need to do
> an accum diff for every one of those.

Right - that only INSTR_TIME_ACCUM_DIFF() only makes sense if there's just a
single counter to update.


WRT:
/* TODO: AFAICT, pgstat_count_buffer_write_time 
is only called */
/* for shared buffers whereas 
pgstat_count_buffer_read_time is */
/* called for temp relations and shared 
buffers. */
/*
 * is this intentional and should I match 
current behavior or
 * not?
 */

It's hard to see how that behaviour could be intentional.  Probably worth
fixing in a separate patch. I don't think we're going to backpatch, but it
would make this clearer nonetheless.


Incremental patch with some of the above changed attached.



Btw, it's quite nice how one now can attribute time more easily:

20 connections copying an 8MB file 50 times each:
SELECT reuses, evictions, writes, write_time, extends, extend_time FROM 
pg_stat_io WHERE backend_type = 'client backend' AND io_object = 'relation' AND 
io_context='bulkwrite';
┌┬───┬┬┬─┬─┐
│ reuses │ evictions │ writes │ write_time │ extends │ extend_time │
├┼───┼┼┼─┼─┤
│  36112 │ 0 │  36112 │141 │ 1523176 │8676 │
└┴───┴┴┴─┴─┘

20 connections copying an 80MB file 5 times each:
┌─┬───┬─┬┬─┬─┐
│ reuses  │ evictions │ writes  │ write_time │ extends │ extend_time │
├─┼───┼─┼┼─┼─┤
│ 1318539 │ 0 │ 1318539 │   5013 │ 1523339 │7873 │
└─┴───┴─┴┴─┴─┘
(1 row)


Greetings,

Andres
>From 5d4aa3f6c651006f1ec960f59e24ebc8b5a8ca25 Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Mon, 6 Mar 2023 10:41:51 -0500
Subject: [PATCH v6 1/2] Track IO times in pg_stat_io

Add IO timing for reads, writes, extends, and fsyncs to pg_stat_io.

Reviewed-by: Bertrand Drouvot 
Reviewed-by: Andres Freund 
Discussion: https://www.postgresql.org/message-id/flat/CAAKRu_ay5iKmnbXZ3DsauViF3eMxu4m1oNnJXqV_HyqYeg55Ww%40mail.gmail.com
---
 src/include/catalog/pg_proc.dat|   6 +-
 src/include/pgstat.h   |  14 +++-
 src/backend/catalog/system_views.sql   |   4 +
 src/backend/storage/buffer/bufmgr.c|  56 ++
 src/backend/storage/buffer/localbuf.c  |   6 +-
 src/backend/storage/smgr/md.c  |  27 ---
 src/backend/utils/activity/pgstat.c|  77 ++-
 src/backend/utils/activity/pgstat_io.c | 101 -
 src/backend/utils/adt/pgstatfuncs.c|  48 ++--
 doc/src/sgml/monitoring.sgml   |  59 +++
 src/test/regress/expected/rules.out|   6 

Re: Track IO times in pg_stat_io

2023-03-16 Thread Melanie Plageman
v5 attached mostly addresses instr_time persistence issues.

On Tue, Mar 14, 2023 at 6:56 PM Andres Freund  wrote:
> On 2023-03-09 11:50:38 -0500, Melanie Plageman wrote:
> > On Tue, Mar 7, 2023 at 1:39 PM Andres Freund  wrote:
> > > On 2023-03-06 11:30:13 -0500, Melanie Plageman wrote:
> > > > > As pgstat_bktype_io_stats_valid() is called only in Assert(), I think 
> > > > > that would be a good idea
> > > > > to also check that if counts are not Zero then times are not Zero.
> > > >
> > > > Yes, I think adding some validation around the relationship between
> > > > counts and timing should help prevent developers from forgetting to call
> > > > pg_stat_count_io_op() when calling pgstat_count_io_time() (as relevant).
> > > >
> > > > However, I think that we cannot check that if IO counts are non-zero
> > > > that IO times are non-zero, because the user may not have
> > > > track_io_timing enabled. We can check that if IO times are not zero, IO
> > > > counts are not zero. I've done this in the attached v3.
> > >
> > > And even if track_io_timing is enabled, the timer granularity might be so 
> > > low
> > > that we *still* get zeroes.
> > >
> > > I wonder if we should get rid of pgStatBlockReadTime, 
> > > pgStatBlockWriteTime,
> >
> > And then have pg_stat_reset_shared('io') reset pg_stat_database IO
> > stats?
>
> Yes.

I think this makes sense but I am hesitant to do it in this patchset,
because it feels a bit hidden...maybe?

But, if you feel strongly, I will make the change.

> > > >  typedef struct PgStat_BktypeIO
> > > >  {
> > > > - PgStat_Counter 
> > > > data[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
> > > > + PgStat_Counter 
> > > > counts[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
> > > > + instr_time  
> > > > times[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
> > > >  } PgStat_BktypeIO;
> > >
> > > Ah, you're going to hate me. We can't store instr_time on disk. There's
> > > another patch that gets substantial peformance gains by varying the 
> > > frequency
> > > at which instr_time keeps track of time based on the CPU frequency...
> >
> > What does that have to do with what we can store on disk?
>
> The frequency can change.

Ah, I see.

> > If so, would it not be enough to do this when reading/writing the stats
> > file?
>
> Theoretically yes. But to me it seems cleaner to do it when flushing to shared
> stats. See also the overflow issue below.
>
> > > It also just doesn't have enough range to keep track of system wide
> > > time on a larger system. A single backend won't run for 293 years, but
> > > with a few thousand backends that's a whole different story.
> > >
> > > I think we need to accumulate in instr_time, but convert to floating point
> > > when flushing stats.
> >
> > Hmmm. So, are you saying that we need to read from disk when we query
> > the view and add that to what is in shared memory? That we only store
> > the delta since the last restart in the instr_time array?
>
> No, I don't think I am suggesting that. What I am trying to suggest is that
> PendingIOStats should contain instr_time, but that PgStat_IO should contain
> PgStat_Counter as microseconds, as before.

So, I've modified the code to make a union of instr_time and
PgStat_Counter in PgStat_BktypeIO. I am not quite sure if this is okay.
I store in microsec and then in pg_stat_io, I multiply to get
milliseconds for display.

I considered refactoring pgstat_io_end() to use INSTR_TIME_ACCUM_DIFF()
like [1], but, in the end I actually think I would end up with more
operations because of the various different counters needing to be
updated. As it is now, I do a single subtract and a few adds (one for
each of the different statistics objects tracking IO times
(pgBufferUsage, pgStatBlockWrite/ReadTime). Whereas, I would need to do
an accum diff for every one of those.

- Melanie

[1] 
https://www.postgresql.org/message-id/1feedb83-7aa9-cb4b-5086-598349d3f555%40gmail.com
From 5c02cd9cb784bf22f756fb9d92f4bd29ba2e744a Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Mon, 6 Mar 2023 10:41:51 -0500
Subject: [PATCH v5] Track IO times in pg_stat_io

Add IO timing for reads, writes, extends, and fsyncs to pg_stat_io.

Reviewed-by: Bertrand Drouvot 
Reviewed-by: Andres Freund 
Discussion: https://www.postgresql.org/message-id/flat/CAAKRu_ay5iKmnbXZ3DsauViF3eMxu4m1oNnJXqV_HyqYeg55Ww%40mail.gmail.com
---
 doc/src/sgml/monitoring.sgml   | 59 +++
 src/backend/catalog/system_views.sql   |  4 ++
 src/backend/storage/buffer/bufmgr.c| 56 ++-
 src/backend/storage/buffer/localbuf.c  |  6 +-
 src/backend/storage/smgr/md.c  | 27 ---
 src/backend/utils/activity/pgstat.c| 77 +++-
 src/backend/utils/activity/pgstat_io.c | 99 +-
 src/backend/utils/adt/pgstatfuncs.c| 48 +++--
 src/include/catalog/pg_proc.dat|  6 +-
 src/include/pgstat.h   | 14 +++-
 

Re: Track IO times in pg_stat_io

2023-03-14 Thread Andres Freund
Hi,

On 2023-03-09 11:50:38 -0500, Melanie Plageman wrote:
> On Tue, Mar 7, 2023 at 1:39 PM Andres Freund  wrote:
> > On 2023-03-06 11:30:13 -0500, Melanie Plageman wrote:
> > > > As pgstat_bktype_io_stats_valid() is called only in Assert(), I think 
> > > > that would be a good idea
> > > > to also check that if counts are not Zero then times are not Zero.
> > >
> > > Yes, I think adding some validation around the relationship between
> > > counts and timing should help prevent developers from forgetting to call
> > > pg_stat_count_io_op() when calling pgstat_count_io_time() (as relevant).
> > >
> > > However, I think that we cannot check that if IO counts are non-zero
> > > that IO times are non-zero, because the user may not have
> > > track_io_timing enabled. We can check that if IO times are not zero, IO
> > > counts are not zero. I've done this in the attached v3.
> >
> > And even if track_io_timing is enabled, the timer granularity might be so 
> > low
> > that we *still* get zeroes.
> >
> > I wonder if we should get rid of pgStatBlockReadTime, pgStatBlockWriteTime,
>
> And then have pg_stat_reset_shared('io') reset pg_stat_database IO
> stats?

Yes.


> > > @@ -1000,11 +1000,27 @@ ReadBuffer_common(SMgrRelation smgr, char 
> > > relpersistence, ForkNumber forkNum,
> > >
> > >   if (isExtend)
> > >   {
> > > + instr_time  io_start,
> > > + io_time;
> > > +
> > >   /* new buffers are zero-filled */
> > >   MemSet((char *) bufBlock, 0, BLCKSZ);
> > > +
> > > + if (track_io_timing)
> > > + INSTR_TIME_SET_CURRENT(io_start);
> > > + else
> > > + INSTR_TIME_SET_ZERO(io_start);
> > > +
> >
> > I wonder if there's an argument for tracking this in the existing IO stats 
> > as
> > well. But I guess we've lived with this for a long time...
>
> Not sure I want to include that in this patchset.

No, probably not.


> > >  typedef struct PgStat_BktypeIO
> > >  {
> > > - PgStat_Counter 
> > > data[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
> > > + PgStat_Counter 
> > > counts[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
> > > + instr_time  
> > > times[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
> > >  } PgStat_BktypeIO;
> >
> > Ah, you're going to hate me. We can't store instr_time on disk. There's
> > another patch that gets substantial peformance gains by varying the 
> > frequency
> > at which instr_time keeps track of time based on the CPU frequency...
>
> What does that have to do with what we can store on disk?

The frequency can change.


> If so, would it not be enough to do this when reading/writing the stats
> file?

Theoretically yes. But to me it seems cleaner to do it when flushing to shared
stats. See also the overflow issue below.



> void
> instr_time_deserialize(instr_time *dest, int64 *src, int length)
> {
> for (size_t i = 0; i < length; i++)
> {
> INSTR_TIME_SET_ZERO(dest[i]);
> dest[i].ticks = src[i];
> }
> }

That wouldn't be correct, because what ticks means will at some point change
between postgres stopping and starting.



> > It also just doesn't have enough range to keep track of system wide
> > time on a larger system. A single backend won't run for 293 years, but
> > with a few thousand backends that's a whole different story.
> >
> > I think we need to accumulate in instr_time, but convert to floating point
> > when flushing stats.
>
> Hmmm. So, are you saying that we need to read from disk when we query
> the view and add that to what is in shared memory? That we only store
> the delta since the last restart in the instr_time array?

No, I don't think I am suggesting that. What I am trying to suggest is that
PendingIOStats should contain instr_time, but that PgStat_IO should contain
PgStat_Counter as microseconds, as before.


> But, I don't see how that avoids the problem of backend total runtime
> being 293 years. We would have to reset and write out the delta whenever
> we thought the times would overflow.

The overflow risk is due to storing nanoseconds (which is what instr_time
stores internally on linux now) - which we don't need to do once
accumulatated. Right now we store them as microseconds.

nanosecond range:
((2**63) - 1)/(10**9*60*60*24*365) -> 292 years
microsecond range:
((2**63) - 1)/(10**6*60*60*24*365) -> 292471 years

If you assume 5k connections continually doing IO, a range of 292 years would
last 21 days at nanosecond resolution. At microsecond resolution it's 58
years.

Greetings,

Andres Freund




Re: Track IO times in pg_stat_io

2023-03-09 Thread Melanie Plageman
Hi, v4 attached addresses these review comments.

On Tue, Mar 7, 2023 at 1:39 PM Andres Freund  wrote:
> On 2023-03-06 11:30:13 -0500, Melanie Plageman wrote:
> > > As pgstat_bktype_io_stats_valid() is called only in Assert(), I think 
> > > that would be a good idea
> > > to also check that if counts are not Zero then times are not Zero.
> >
> > Yes, I think adding some validation around the relationship between
> > counts and timing should help prevent developers from forgetting to call
> > pg_stat_count_io_op() when calling pgstat_count_io_time() (as relevant).
> >
> > However, I think that we cannot check that if IO counts are non-zero
> > that IO times are non-zero, because the user may not have
> > track_io_timing enabled. We can check that if IO times are not zero, IO
> > counts are not zero. I've done this in the attached v3.
>
> And even if track_io_timing is enabled, the timer granularity might be so low
> that we *still* get zeroes.
>
> I wonder if we should get rid of pgStatBlockReadTime, pgStatBlockWriteTime,

And then have pg_stat_reset_shared('io') reset pg_stat_database IO
stats?

> > @@ -1000,11 +1000,27 @@ ReadBuffer_common(SMgrRelation smgr, char 
> > relpersistence, ForkNumber forkNum,
> >
> >   if (isExtend)
> >   {
> > + instr_time  io_start,
> > + io_time;
> > +
> >   /* new buffers are zero-filled */
> >   MemSet((char *) bufBlock, 0, BLCKSZ);
> > +
> > + if (track_io_timing)
> > + INSTR_TIME_SET_CURRENT(io_start);
> > + else
> > + INSTR_TIME_SET_ZERO(io_start);
> > +
>
> I wonder if there's an argument for tracking this in the existing IO stats as
> well. But I guess we've lived with this for a long time...

Not sure I want to include that in this patchset.

> > @@ -2981,16 +2998,16 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, 
> > IOObject io_object,
> >* When a strategy is not in use, the write can only be a "regular" 
> > write
> >* of a dirty shared buffer (IOCONTEXT_NORMAL IOOP_WRITE).
> >*/
> > - pgstat_count_io_op(IOOBJECT_RELATION, io_context, IOOP_WRITE);
> > -
> >   if (track_io_timing)
> >   {
> >   INSTR_TIME_SET_CURRENT(io_time);
> >   INSTR_TIME_SUBTRACT(io_time, io_start);
> >   
> > pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
> >   INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time);
> > + pgstat_count_io_time(IOOBJECT_RELATION, io_context, 
> > IOOP_WRITE, io_time);
> >   }
>
> I think this needs a bit of cleanup - pgstat_count_buffer_write_time(),
> pgBufferUsage.blk_write_time++, pgstat_count_io_time() is a bit excessive. We
> might not be able to reduce the whole duplication at this point, but at least
> it should be a bit more centralized.

So, in the attached v4, I've introduced pgstat_io_start() and
pgstat_io_end(...). The end IO function takes the IOObject, IOOp, and
IOContext, in addition to the start_time, so that we know which
pgBufferUsage field to increment and which pgstat_count_buffer_*_time()
to call.

I will note that calling this function now causes pgBufferUsage and
pgStatBlock*Time to be incremented in a couple of places that they were
not before. I think those might have been accidental omissions, so I
think it is okay.

The exception is pgstat_count_write_time() being only called for
relations in shared buffers and not temporary relations while
pgstat_count_buffer_read_time() is called for temporary relations and
relations in shared buffers. I left that behavior as is, though it seems
like it is wrong.

I added pgstat_io_start() to pgstat.c -- not sure if it is best there.

I could separate it into a commit that does this refactoring of the
existing counting (without adding pgstat_count_io_time()) and then
another that adds pgstat_count_io_time(). I hesitated to do that until I
knew that the new functions were viable.

> > + pgstat_count_io_op(IOOBJECT_RELATION, io_context, IOOP_WRITE);
> >   pgBufferUsage.shared_blks_written++;
> >
> >   /*
> > @@ -3594,6 +3611,9 @@ FlushRelationBuffers(Relation rel)
> >
> >   if (RelationUsesLocalBuffers(rel))
> >   {
> > + instr_time  io_start,
> > + io_time;
> > +
> >   for (i = 0; i < NLocBuffer; i++)
> >   {
> >   uint32  buf_state;
> > @@ -3616,6 +3636,11 @@ FlushRelationBuffers(Relation rel)
> >
> >   PageSetChecksumInplace(localpage, 
> > bufHdr->tag.blockNum);
> >
> > + if (track_io_timing)
> > + INSTR_TIME_SET_CURRENT(io_start);
> > + else
> > + INSTR_TIME_SET_ZERO(io_start);
> > +
> >   smgrwrite(RelationGetSmgr(rel),
> >   

Re: Track IO times in pg_stat_io

2023-03-09 Thread Imseih (AWS), Sami
> >>> Now I've a second thought: what do you think about resetting the related 
> >>> number
> >>> of operations and *_time fields when enabling/disabling track_io_timing? 
> >>> (And mention it in the doc).
> >>>
> >>> That way it'd prevent bad interpretation (at least as far the time per 
> >>> operation metrics are concerned).
> >>>
> >>> Thinking that way as we'd loose some (most?) benefits of the new *_time 
> >>> columns
> >>> if one can't "trust" their related operations and/or one is not sampling 
> >>> pg_stat_io frequently enough (to discard the samples
> >>> where the track_io_timing changes occur).
> >>>
> >>> But well, resetting the operations could also lead to bad interpretation 
> >>> about the operations...
> >>>
> >>> Not sure about which approach I like the most yet, what do you think?
> >>
> >> Oh, this is an interesting idea. I think you are right about the
> >> synchronization issues making the statistics untrustworthy and, thus,
> >> unuseable.
> > 
> > No, I don't think we can do that. It can be enabled on a per-session basis.

> Oh right. So it's even less clear to me to get how one would make use of 
> those new *_time fields, given that:

> - pg_stat_io is "global" across all sessions. So, even if one session is 
> doing some "testing" and needs to turn track_io_timing on, then it
> is even not sure it's only reflecting its own testing (as other sessions may 
> have turned it on too).

> - There is the risk mentioned above of bad interpretations for the "time per 
> operation" metrics.

> - Even if there is frequent enough sampling of it pg_stat_io, one does not 
> know which samples contain track_io_timing changes (at the cluster or session 
> level).

As long as track_io_timing can be toggled, blk_write_time could lead to wrong 
conclusions.
I think it may be helpful to track the blks_read when track_io_timing is enabled
Separately.

blks_read will be as is and give the overall blks_read, while a new column
blks_read_with_timing will only report on blks_read with track_io_timing 
enabled.

blks_read_with_timing should never be larger than blks_read.

This will then make the blks_read_time valuable if it's looked at with
the blks_read_with_timing column.


Regards,

-- 

Sami Imseih
Amazon Web Services (AWS)



Re: Track IO times in pg_stat_io

2023-03-09 Thread Drouvot, Bertrand

Hi,

On 3/9/23 1:34 AM, Andres Freund wrote:

Hi,

On 2023-03-08 12:55:34 +0100, Drouvot, Bertrand wrote:

On 3/7/23 7:47 PM, Andres Freund wrote:

On 2023-03-07 13:43:28 -0500, Melanie Plageman wrote:
No, I don't think we can do that. It can be enabled on a per-session basis.


Oh right. So it's even less clear to me to get how one would make use of those 
new *_time fields, given that:

- pg_stat_io is "global" across all sessions. So, even if one session is doing some 
"testing" and needs to turn track_io_timing on, then it
is even not sure it's only reflecting its own testing (as other sessions may 
have turned it on too).


I think for 17 we should provide access to per-existing-connection pg_stat_io
stats, and also provide a database aggregated version. Neither should be
particularly hard.



+1 that would be great.


I don't think it's particularly useful to use the time to calculate "per IO"
costs - they can vary *drastically* due to kernel level buffering.


Exactly and I think that's the reason why it could be useful. I think that 
could help (with frequent enough sampling)
to try to identify when the IOs are served by the page cache or not (if one 
knows his infra well enough).

One could say (for example, depending on his environment) that if the read_time 
> 4ms then the IO is served by spindle disks (if any)
and if <<< ms then by the page cache.

What I mean is that one could try to characterized their IOs based on threshold 
that they could define.

Adding/reporting histograms in the game would be even better: something we 
could look for for 17?

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Track IO times in pg_stat_io

2023-03-08 Thread Andres Freund
Hi,

On 2023-03-08 12:55:34 +0100, Drouvot, Bertrand wrote:
> On 3/7/23 7:47 PM, Andres Freund wrote:
> > On 2023-03-07 13:43:28 -0500, Melanie Plageman wrote:
> > > > Now I've a second thought: what do you think about resetting the 
> > > > related number
> > > > of operations and *_time fields when enabling/disabling 
> > > > track_io_timing? (And mention it in the doc).
> > > > 
> > > > That way it'd prevent bad interpretation (at least as far the time per 
> > > > operation metrics are concerned).
> > > > 
> > > > Thinking that way as we'd loose some (most?) benefits of the new *_time 
> > > > columns
> > > > if one can't "trust" their related operations and/or one is not 
> > > > sampling pg_stat_io frequently enough (to discard the samples
> > > > where the track_io_timing changes occur).
> > > > 
> > > > But well, resetting the operations could also lead to bad 
> > > > interpretation about the operations...
> > > > 
> > > > Not sure about which approach I like the most yet, what do you think?
> > > 
> > > Oh, this is an interesting idea. I think you are right about the
> > > synchronization issues making the statistics untrustworthy and, thus,
> > > unuseable.
> > 
> > No, I don't think we can do that. It can be enabled on a per-session basis.
> 
> Oh right. So it's even less clear to me to get how one would make use of 
> those new *_time fields, given that:
> 
> - pg_stat_io is "global" across all sessions. So, even if one session is 
> doing some "testing" and needs to turn track_io_timing on, then it
> is even not sure it's only reflecting its own testing (as other sessions may 
> have turned it on too).

I think for 17 we should provide access to per-existing-connection pg_stat_io
stats, and also provide a database aggregated version. Neither should be
particularly hard.


> - There is the risk mentioned above of bad interpretations for the "time per 
> operation" metrics.
> 
> - Even if there is frequent enough sampling of it pg_stat_io, one does not 
> know which samples contain track_io_timing changes (at the cluster or session 
> level).

You'd just make the same use of them you do with pg_stat_database.blks_read
etc today.

I don't think it's particularly useful to use the time to calculate "per IO"
costs - they can vary *drastically* due to kernel level buffering. The point
of having the time available is that it provides information that the number
of operations doesn't provide.


> > I think we simply shouldn't do anything here. This is a pre-existing issue.
> 
> Oh, never thought about it. You mean like for pg_stat_database.blks_read and 
> pg_stat_database.blk_read_time for example?

Yes.

Greetings,

Andres Freund




Re: Track IO times in pg_stat_io

2023-03-08 Thread Drouvot, Bertrand

Hi,

On 3/7/23 7:47 PM, Andres Freund wrote:

On 2023-03-07 13:43:28 -0500, Melanie Plageman wrote:

Now I've a second thought: what do you think about resetting the related number
of operations and *_time fields when enabling/disabling track_io_timing? (And 
mention it in the doc).

That way it'd prevent bad interpretation (at least as far the time per 
operation metrics are concerned).

Thinking that way as we'd loose some (most?) benefits of the new *_time columns
if one can't "trust" their related operations and/or one is not sampling 
pg_stat_io frequently enough (to discard the samples
where the track_io_timing changes occur).

But well, resetting the operations could also lead to bad interpretation about 
the operations...

Not sure about which approach I like the most yet, what do you think?


Oh, this is an interesting idea. I think you are right about the
synchronization issues making the statistics untrustworthy and, thus,
unuseable.


No, I don't think we can do that. It can be enabled on a per-session basis.


Oh right. So it's even less clear to me to get how one would make use of those 
new *_time fields, given that:

- pg_stat_io is "global" across all sessions. So, even if one session is doing some 
"testing" and needs to turn track_io_timing on, then it
is even not sure it's only reflecting its own testing (as other sessions may 
have turned it on too).

- There is the risk mentioned above of bad interpretations for the "time per 
operation" metrics.

- Even if there is frequent enough sampling of it pg_stat_io, one does not know 
which samples contain track_io_timing changes (at the cluster or session level).


I think we simply shouldn't do anything here. This is a pre-existing issue.


Oh, never thought about it. You mean like for pg_stat_database.blks_read and 
pg_stat_database.blk_read_time for example?


I also think that loosing stats when turning track_io_timing on/off would not be
helpful.



Yeah not 100% sure too as that would lead to other possible bad interpretations.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Track IO times in pg_stat_io

2023-03-07 Thread Andres Freund
On 2023-03-07 13:43:28 -0500, Melanie Plageman wrote:
> > Now I've a second thought: what do you think about resetting the related 
> > number
> > of operations and *_time fields when enabling/disabling track_io_timing? 
> > (And mention it in the doc).
> >
> > That way it'd prevent bad interpretation (at least as far the time per 
> > operation metrics are concerned).
> >
> > Thinking that way as we'd loose some (most?) benefits of the new *_time 
> > columns
> > if one can't "trust" their related operations and/or one is not sampling 
> > pg_stat_io frequently enough (to discard the samples
> > where the track_io_timing changes occur).
> >
> > But well, resetting the operations could also lead to bad interpretation 
> > about the operations...
> >
> > Not sure about which approach I like the most yet, what do you think?
> 
> Oh, this is an interesting idea. I think you are right about the
> synchronization issues making the statistics untrustworthy and, thus,
> unuseable.

No, I don't think we can do that. It can be enabled on a per-session basis.

I think we simply shouldn't do anything here. This is a pre-existing issue. I
also think that loosing stats when turning track_io_timing on/off would not be
helpful.

Greetings,

Andres Freund




Re: Track IO times in pg_stat_io

2023-03-07 Thread Melanie Plageman
Thanks for taking another look!

On Tue, Mar 7, 2023 at 10:52 AM Drouvot, Bertrand
 wrote:
> On 3/6/23 5:30 PM, Melanie Plageman wrote:
> > Thanks for the review!
> >
> > On Tue, Feb 28, 2023 at 4:49 AM Drouvot, Bertrand
> >  wrote:
> >> On 2/26/23 5:03 PM, Melanie Plageman wrote:
> >>> The timings will only be non-zero when track_io_timing is on
> >>
> >> That could lead to incorrect interpretation if one wants to divide the 
> >> timing per operations, say:
> >>
> >> - track_io_timing is set to on while there is already operations
> >> - or set to off while it was on (and the number of operations keeps 
> >> growing)
> >>
> >> Might be worth to warn/highlight in the "track_io_timing" doc?
> >
> > This is a good point. I've added a note to the docs for pg_stat_io.
>
> Thanks!
>
> Now I've a second thought: what do you think about resetting the related 
> number
> of operations and *_time fields when enabling/disabling track_io_timing? (And 
> mention it in the doc).
>
> That way it'd prevent bad interpretation (at least as far the time per 
> operation metrics are concerned).
>
> Thinking that way as we'd loose some (most?) benefits of the new *_time 
> columns
> if one can't "trust" their related operations and/or one is not sampling 
> pg_stat_io frequently enough (to discard the samples
> where the track_io_timing changes occur).
>
> But well, resetting the operations could also lead to bad interpretation 
> about the operations...
>
> Not sure about which approach I like the most yet, what do you think?

Oh, this is an interesting idea. I think you are right about the
synchronization issues making the statistics untrustworthy and, thus,
unuseable.

Building on your idea, what if we had the times be NULL instead of zero
when track_io_timing is disabled? Then as you suggested, when you enable
track_io_timing, it resets the IOOp counts and starts the times off at
zero. However, disabling track_io_timing would only NULL out the times
and not zero out the counts.

We could also, as you say, log these events.

- Melanie




Re: Track IO times in pg_stat_io

2023-03-07 Thread Andres Freund
Hi,

On 2023-03-06 11:30:13 -0500, Melanie Plageman wrote:
> > As pgstat_bktype_io_stats_valid() is called only in Assert(), I think that 
> > would be a good idea
> > to also check that if counts are not Zero then times are not Zero.
> 
> Yes, I think adding some validation around the relationship between
> counts and timing should help prevent developers from forgetting to call
> pg_stat_count_io_op() when calling pgstat_count_io_time() (as relevant).
> 
> However, I think that we cannot check that if IO counts are non-zero
> that IO times are non-zero, because the user may not have
> track_io_timing enabled. We can check that if IO times are not zero, IO
> counts are not zero. I've done this in the attached v3.

And even if track_io_timing is enabled, the timer granularity might be so low
that we *still* get zeroes.

I wonder if we should get rid of pgStatBlockReadTime, pgStatBlockWriteTime,


> @@ -1000,11 +1000,27 @@ ReadBuffer_common(SMgrRelation smgr, char 
> relpersistence, ForkNumber forkNum,
>  
>   if (isExtend)
>   {
> + instr_time  io_start,
> + io_time;
> +
>   /* new buffers are zero-filled */
>   MemSet((char *) bufBlock, 0, BLCKSZ);
> +
> + if (track_io_timing)
> + INSTR_TIME_SET_CURRENT(io_start);
> + else
> + INSTR_TIME_SET_ZERO(io_start);
> +

I wonder if there's an argument for tracking this in the existing IO stats as
well. But I guess we've lived with this for a long time...


> @@ -2981,16 +2998,16 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, 
> IOObject io_object,
>* When a strategy is not in use, the write can only be a "regular" 
> write
>* of a dirty shared buffer (IOCONTEXT_NORMAL IOOP_WRITE).
>*/
> - pgstat_count_io_op(IOOBJECT_RELATION, io_context, IOOP_WRITE);
> -
>   if (track_io_timing)
>   {
>   INSTR_TIME_SET_CURRENT(io_time);
>   INSTR_TIME_SUBTRACT(io_time, io_start);
>   
> pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
>   INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time);
> + pgstat_count_io_time(IOOBJECT_RELATION, io_context, IOOP_WRITE, 
> io_time);
>   }

I think this needs a bit of cleanup - pgstat_count_buffer_write_time(),
pgBufferUsage.blk_write_time++, pgstat_count_io_time() is a bit excessive. We
might not be able to reduce the whole duplication at this point, but at least
it should be a bit more centralized.



> + pgstat_count_io_op(IOOBJECT_RELATION, io_context, IOOP_WRITE);
>   pgBufferUsage.shared_blks_written++;
>  
>   /*
> @@ -3594,6 +3611,9 @@ FlushRelationBuffers(Relation rel)
>  
>   if (RelationUsesLocalBuffers(rel))
>   {
> + instr_time  io_start,
> + io_time;
> +
>   for (i = 0; i < NLocBuffer; i++)
>   {
>   uint32  buf_state;
> @@ -3616,6 +3636,11 @@ FlushRelationBuffers(Relation rel)
>  
>   PageSetChecksumInplace(localpage, 
> bufHdr->tag.blockNum);
>  
> + if (track_io_timing)
> + INSTR_TIME_SET_CURRENT(io_start);
> + else
> + INSTR_TIME_SET_ZERO(io_start);
> +
>   smgrwrite(RelationGetSmgr(rel),
> 
> BufTagGetForkNum(>tag),
> bufHdr->tag.blockNum,

I don't think you need the INSTR_TIME_SET_ZERO() in the body of the loop, to
silence the compiler warnings you can do it one level up.



> @@ -228,6 +230,11 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, 
> BlockNumber blockNum,
>  
>   PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
>  
> + if (track_io_timing)
> + INSTR_TIME_SET_CURRENT(io_start);
> + else
> + INSTR_TIME_SET_ZERO(io_start);
> +
>   /* And write... */
>   smgrwrite(oreln,
> BufTagGetForkNum(>tag),
> @@ -239,6 +246,13 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, 
> BlockNumber blockNum,
>   buf_state &= ~BM_DIRTY;
>   pg_atomic_unlocked_write_u32(>state, buf_state);
>  
> + if (track_io_timing)
> + {
> + INSTR_TIME_SET_CURRENT(io_time);
> + INSTR_TIME_SUBTRACT(io_time, io_start);
> + pgstat_count_io_time(IOOBJECT_TEMP_RELATION, 
> IOCONTEXT_NORMAL, IOOP_WRITE, io_time);
> + }
> +
>   pgstat_count_io_op(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, 
> IOOP_WRITE);
>   pgBufferUsage.local_blks_written++;
>   }

Perhaps we can instead introduce a 

Re: Track IO times in pg_stat_io

2023-03-07 Thread Drouvot, Bertrand

Hi,

On 3/6/23 5:30 PM, Melanie Plageman wrote:

Thanks for the review!

On Tue, Feb 28, 2023 at 4:49 AM Drouvot, Bertrand
 wrote:

On 2/26/23 5:03 PM, Melanie Plageman wrote:

The timings will only be non-zero when track_io_timing is on


That could lead to incorrect interpretation if one wants to divide the timing 
per operations, say:

- track_io_timing is set to on while there is already operations
- or set to off while it was on (and the number of operations keeps growing)

Might be worth to warn/highlight in the "track_io_timing" doc?


This is a good point. I've added a note to the docs for pg_stat_io.


Thanks!

Now I've a second thought: what do you think about resetting the related number
of operations and *_time fields when enabling/disabling track_io_timing? (And 
mention it in the doc).

That way it'd prevent bad interpretation (at least as far the time per 
operation metrics are concerned).

Thinking that way as we'd loose some (most?) benefits of the new *_time columns
if one can't "trust" their related operations and/or one is not sampling 
pg_stat_io frequently enough (to discard the samples
where the track_io_timing changes occur).

But well, resetting the operations could also lead to bad interpretation about 
the operations...

Not sure about which approach I like the most yet, what do you think?


That leads to pgstat_count_io_time() to be called before pgstat_count_io_op() 
(for the IOOP_EXTEND case) and
after pgstat_count_io_op() (for the IOOP_READ case).

What about calling them in the same order and so that pgstat_count_io_time() is 
called before pgstat_count_io_op()?

If so, the ordering would also need to be changed in:

- FlushRelationBuffers()
- register_dirty_segment()


Yes, good point. I've updated the code to use this suggested ordering in
attached v3.



Thanks, this looks good to me.


As pgstat_bktype_io_stats_valid() is called only in Assert(), I think that 
would be a good idea
to also check that if counts are not Zero then times are not Zero.


Yes, I think adding some validation around the relationship between
counts and timing should help prevent developers from forgetting to call
pg_stat_count_io_op() when calling pgstat_count_io_time() (as relevant).

However, I think that we cannot check that if IO counts are non-zero
that IO times are non-zero, because the user may not have
track_io_timing enabled.


Yeah, right.


We can check that if IO times are not zero, IO
counts are not zero. I've done this in the attached v3.



Thanks, looks good to me.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Track IO times in pg_stat_io

2023-03-06 Thread Melanie Plageman
Thanks for the review!

On Tue, Feb 28, 2023 at 4:49 AM Drouvot, Bertrand
 wrote:
> On 2/26/23 5:03 PM, Melanie Plageman wrote:
> > As suggested in [1], the attached patch adds IO times to pg_stat_io;
>
> Thanks for the patch!
>
> I started to have a look at it and figured out that a tiny rebase was needed 
> (due to
> 728560db7d and b9f0e54bc9), so please find the rebase (aka V2) attached.

Thanks for doing that!

> > The timings will only be non-zero when track_io_timing is on
>
> That could lead to incorrect interpretation if one wants to divide the timing 
> per operations, say:
>
> - track_io_timing is set to on while there is already operations
> - or set to off while it was on (and the number of operations keeps growing)
>
> Might be worth to warn/highlight in the "track_io_timing" doc?

This is a good point. I've added a note to the docs for pg_stat_io.

> +   if (track_io_timing)
> +   {
> +   INSTR_TIME_SET_CURRENT(io_time);
> +   INSTR_TIME_SUBTRACT(io_time, io_start);
> +   pgstat_count_io_time(io_object, io_context, 
> IOOP_EXTEND, io_time);
> +   }
> +
> +
>  pgstat_count_io_op(io_object, io_context, IOOP_EXTEND);
>
> vs
>
> @@ -1042,6 +1059,7 @@ ReadBuffer_common(SMgrRelation smgr, char 
> relpersistence, ForkNumber forkNum,
>  INSTR_TIME_SUBTRACT(io_time, io_start);
>  
> pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
>  INSTR_TIME_ADD(pgBufferUsage.blk_read_time, 
> io_time);
> +   pgstat_count_io_time(io_object, io_context, 
> IOOP_READ, io_time);
>  }
>
> That leads to pgstat_count_io_time() to be called before pgstat_count_io_op() 
> (for the IOOP_EXTEND case) and
> after pgstat_count_io_op() (for the IOOP_READ case).
>
> What about calling them in the same order and so that pgstat_count_io_time() 
> is called before pgstat_count_io_op()?
>
> If so, the ordering would also need to be changed in:
>
> - FlushRelationBuffers()
> - register_dirty_segment()

Yes, good point. I've updated the code to use this suggested ordering in
attached v3.

> > There is one minor question (in the code as a TODO) which is whether or
> > not it is worth cross-checking that IO counts and times are either both
> > zero or neither zero in the validation function
> > pgstat_bktype_io_stats_valid().
> >
>
> As pgstat_bktype_io_stats_valid() is called only in Assert(), I think that 
> would be a good idea
> to also check that if counts are not Zero then times are not Zero.

Yes, I think adding some validation around the relationship between
counts and timing should help prevent developers from forgetting to call
pg_stat_count_io_op() when calling pgstat_count_io_time() (as relevant).

However, I think that we cannot check that if IO counts are non-zero
that IO times are non-zero, because the user may not have
track_io_timing enabled. We can check that if IO times are not zero, IO
counts are not zero. I've done this in the attached v3.

- Melanie
From 52d997001108a52c833b339f9b8dcb3d34ed3270 Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Mon, 6 Mar 2023 10:41:51 -0500
Subject: [PATCH v3] Track IO times in pg_stat_io

Add IO timing for reads, writes, extends, and fsyncs to pg_stat_io.

Reviewed-by: Bertrand Drouvot 
Discussion: https://www.postgresql.org/message-id/flat/CAAKRu_ay5iKmnbXZ3DsauViF3eMxu4m1oNnJXqV_HyqYeg55Ww%40mail.gmail.com
---
 doc/src/sgml/monitoring.sgml   | 59 
 src/backend/catalog/system_views.sql   |  4 ++
 src/backend/storage/buffer/bufmgr.c| 40 +--
 src/backend/storage/buffer/localbuf.c  | 14 
 src/backend/storage/smgr/md.c  | 47 ++---
 src/backend/utils/activity/pgstat_io.c | 96 +-
 src/backend/utils/adt/pgstatfuncs.c| 40 +--
 src/include/catalog/pg_proc.dat|  6 +-
 src/include/pgstat.h   |  5 +-
 src/test/regress/expected/rules.out|  6 +-
 10 files changed, 275 insertions(+), 42 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 6249bb50d0..ad3667f258 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3814,6 +3814,18 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
   
  
 
+ 
+  
+   
+read_time double precision
+   
+   
+Time spent in read operations in milliseconds (if  is enabled, otherwise zero)
+   
+  
+ 
+
  
   

@@ -3826,6 +3838,18 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
   
  
 
+ 
+  
+   
+write_time double precision
+   
+   
+Time spent in write operations in milliseconds (if  is enabled, otherwise zero)
+   
+ 

Re: Track IO times in pg_stat_io

2023-02-28 Thread Drouvot, Bertrand

Hi,

On 2/26/23 5:03 PM, Melanie Plageman wrote:

Hi,

As suggested in [1], the attached patch adds IO times to pg_stat_io;


Thanks for the patch!

I started to have a look at it and figured out that a tiny rebase was needed 
(due to
728560db7d and b9f0e54bc9), so please find the rebase (aka V2) attached.


The timings will only be non-zero when track_io_timing is on


That could lead to incorrect interpretation if one wants to divide the timing 
per operations, say:

- track_io_timing is set to on while there is already operations
- or set to off while it was on (and the number of operations keeps growing)

Might be worth to warn/highlight in the "track_io_timing" doc?


+   if (track_io_timing)
+   {
+   INSTR_TIME_SET_CURRENT(io_time);
+   INSTR_TIME_SUBTRACT(io_time, io_start);
+   pgstat_count_io_time(io_object, io_context, 
IOOP_EXTEND, io_time);
+   }
+
+
pgstat_count_io_op(io_object, io_context, IOOP_EXTEND);

vs

@@ -1042,6 +1059,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, 
ForkNumber forkNum,
INSTR_TIME_SUBTRACT(io_time, io_start);

pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
INSTR_TIME_ADD(pgBufferUsage.blk_read_time, 
io_time);
+   pgstat_count_io_time(io_object, io_context, 
IOOP_READ, io_time);
}

That leads to pgstat_count_io_time() to be called before pgstat_count_io_op() 
(for the IOOP_EXTEND case) and
after pgstat_count_io_op() (for the IOOP_READ case).

What about calling them in the same order and so that pgstat_count_io_time() is 
called before pgstat_count_io_op()?

If so, the ordering would also need to be changed in:

- FlushRelationBuffers()
- register_dirty_segment()



There is one minor question (in the code as a TODO) which is whether or
not it is worth cross-checking that IO counts and times are either both
zero or neither zero in the validation function
pgstat_bktype_io_stats_valid().



As pgstat_bktype_io_stats_valid() is called only in Assert(), I think that 
would be a good idea
to also check that if counts are not Zero then times are not Zero.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.comdiff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 6249bb50d0..2c62b0a437 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3814,6 +3814,18 @@ SELECT pid, wait_event_type, wait_event FROM 
pg_stat_activity WHERE wait_event i
   
  
 
+ 
+  
+   
+read_time double precision
+   
+   
+Time spent in read operations in milliseconds (if  is enabled, otherwise zero)
+   
+  
+ 
+
  
   

@@ -3826,6 +3838,18 @@ SELECT pid, wait_event_type, wait_event FROM 
pg_stat_activity WHERE wait_event i
   
  
 
+ 
+  
+   
+write_time double precision
+   
+   
+Time spent in write operations in milliseconds (if  is enabled, otherwise zero)
+   
+  
+ 
+
  
   

@@ -3838,6 +3862,18 @@ SELECT pid, wait_event_type, wait_event FROM 
pg_stat_activity WHERE wait_event i
   
  
 
+ 
+  
+   
+extend_time double precision
+   
+   
+Time spent in extend operations in milliseconds (if  is enabled, otherwise zero)
+   
+  
+ 
+
  
   

@@ -3902,6 +3938,18 @@ SELECT pid, wait_event_type, wait_event FROM 
pg_stat_activity WHERE wait_event i
   
  
 
+ 
+  
+   
+fsync_time double precision
+   
+   
+Time spent in fsync operations in milliseconds (if  is enabled, otherwise zero)
+   
+  
+ 
+
  
   

diff --git a/src/backend/catalog/system_views.sql 
b/src/backend/catalog/system_views.sql
index 34ca0e739f..39391bc2fc 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1123,12 +1123,16 @@ SELECT
b.io_object,
b.io_context,
b.reads,
+   b.read_time,
b.writes,
+   b.write_time,
b.extends,
+   b.extend_time,
b.op_bytes,
b.evictions,
b.reuses,
b.fsyncs,
+   b.fsync_time,
b.stats_reset
 FROM pg_stat_get_io() b;
 
diff --git a/src/backend/storage/buffer/bufmgr.c 
b/src/backend/storage/buffer/bufmgr.c
index 0a05577b68..bbd2af9fae 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1000,11 +1000,26 @@ ReadBuffer_common(SMgrRelation smgr, char 
relpersistence, ForkNumber forkNum,
 
if (isExtend)
{
+   instr_time  io_start,
+   io_time;
/* new