Re: Show WAL write and fsync stats in pg_stat_io

2024-06-09 Thread Nitin Jadhav
> If possible, let's have all the I/O stats (even for WAL) in
> pg_stat_io. Can't we show the WAL data we get from buffers in the hits
> column and then have read_bytes or something like that to know the
> amount of data read?

The ‘hits’ column in ‘pg_stat_io’ is a vital indicator for adjusting a
database. It signifies the count of cache hits, or in other words, the
instances where data was located in the ‘shared_buffers’. As a result,
keeping an eye on the ‘hits’ column in ‘pg_stat_io’ can offer useful
knowledge about the buffer cache’s efficiency and assist users in
making educated choices when fine-tuning their database. However, if
we include the hit count of WAL buffers in this, it may lead to
misleading interpretations for database tuning. If there’s something
I’ve overlooked that’s already been discussed, please feel free to
correct me.


Best Regards,
Nitin Jadhav
Azure Database for PostgreSQL
Microsoft

On Tue, May 28, 2024 at 6:18 AM Amit Kapila  wrote:
>
> On Mon, May 13, 2024 at 7:42 PM Bharath Rupireddy
>  wrote:
> >
> > On Fri, Apr 19, 2024 at 1:32 PM Nazir Bilal Yavuz  
> > wrote:
> > >
> > > > I wanted to inform you that the 73f0a13266 commit changed all WALRead
> > > > calls to read variable bytes, only the WAL receiver was reading
> > > > variable bytes before.
> > >
> > > I want to start working on this again if possible. I will try to
> > > summarize the current status:
> >
> > Thanks for working on this.
> >
> > > * With the 73f0a13266 commit, the WALRead() function started to read
> > > variable bytes in every case. Before, only the WAL receiver was
> > > reading variable bytes.
> > >
> > > * With the 91f2cae7a4 commit, WALReadFromBuffers() is merged. We were
> > > discussing what we have to do when this is merged. It is decided that
> > > WALReadFromBuffers() does not call pgstat_report_wait_start() because
> > > this function does not perform any IO [1]. We may follow the same
> > > logic by not including these to pg_stat_io?
> >
> > Right. WALReadFromBuffers doesn't do any I/O.
> >
> > Whoever reads WAL from disk (backends, walsenders, recovery process)
> > using pg_pread (XLogPageRead, WALRead) needs to be tracked in
> > pg_stat_io or some other view. If it were to be in pg_stat_io,
> > although we may not be able to distinguish WAL read stats at a backend
> > level (like how many times/bytes a walsender or recovery process or a
> > backend read WAL from disk), but it can help understand overall impact
> > of WAL read I/O at a cluster level. With this approach, the WAL I/O
> > stats are divided up - WAL read I/O and write I/O stats are in
> > pg_stat_io and pg_stat_wal respectively.
> >
> > This makes me think if we need to add WAL read I/O stats also to
> > pg_stat_wal. Then, we can also add WALReadFromBuffers stats
> > hits/misses there. With this approach, pg_stat_wal can be a one-stop
> > view for all the WAL related stats.
> >
>
> If possible, let's have all the I/O stats (even for WAL) in
> pg_stat_io. Can't we show the WAL data we get from buffers in the hits
> column and then have read_bytes or something like that to know the
> amount of data read?
>
> --
> With Regards,
> Amit Kapila.
>
>




Re: Show WAL write and fsync stats in pg_stat_io

2024-05-27 Thread Amit Kapila
On Mon, May 13, 2024 at 7:42 PM Bharath Rupireddy
 wrote:
>
> On Fri, Apr 19, 2024 at 1:32 PM Nazir Bilal Yavuz  wrote:
> >
> > > I wanted to inform you that the 73f0a13266 commit changed all WALRead
> > > calls to read variable bytes, only the WAL receiver was reading
> > > variable bytes before.
> >
> > I want to start working on this again if possible. I will try to
> > summarize the current status:
>
> Thanks for working on this.
>
> > * With the 73f0a13266 commit, the WALRead() function started to read
> > variable bytes in every case. Before, only the WAL receiver was
> > reading variable bytes.
> >
> > * With the 91f2cae7a4 commit, WALReadFromBuffers() is merged. We were
> > discussing what we have to do when this is merged. It is decided that
> > WALReadFromBuffers() does not call pgstat_report_wait_start() because
> > this function does not perform any IO [1]. We may follow the same
> > logic by not including these to pg_stat_io?
>
> Right. WALReadFromBuffers doesn't do any I/O.
>
> Whoever reads WAL from disk (backends, walsenders, recovery process)
> using pg_pread (XLogPageRead, WALRead) needs to be tracked in
> pg_stat_io or some other view. If it were to be in pg_stat_io,
> although we may not be able to distinguish WAL read stats at a backend
> level (like how many times/bytes a walsender or recovery process or a
> backend read WAL from disk), but it can help understand overall impact
> of WAL read I/O at a cluster level. With this approach, the WAL I/O
> stats are divided up - WAL read I/O and write I/O stats are in
> pg_stat_io and pg_stat_wal respectively.
>
> This makes me think if we need to add WAL read I/O stats also to
> pg_stat_wal. Then, we can also add WALReadFromBuffers stats
> hits/misses there. With this approach, pg_stat_wal can be a one-stop
> view for all the WAL related stats.
>

If possible, let's have all the I/O stats (even for WAL) in
pg_stat_io. Can't we show the WAL data we get from buffers in the hits
column and then have read_bytes or something like that to know the
amount of data read?

-- 
With Regards,
Amit Kapila.




Re: Show WAL write and fsync stats in pg_stat_io

2024-05-13 Thread Bharath Rupireddy
On Fri, Apr 19, 2024 at 1:32 PM Nazir Bilal Yavuz  wrote:
>
> > I wanted to inform you that the 73f0a13266 commit changed all WALRead
> > calls to read variable bytes, only the WAL receiver was reading
> > variable bytes before.
>
> I want to start working on this again if possible. I will try to
> summarize the current status:

Thanks for working on this.

> * With the 73f0a13266 commit, the WALRead() function started to read
> variable bytes in every case. Before, only the WAL receiver was
> reading variable bytes.
>
> * With the 91f2cae7a4 commit, WALReadFromBuffers() is merged. We were
> discussing what we have to do when this is merged. It is decided that
> WALReadFromBuffers() does not call pgstat_report_wait_start() because
> this function does not perform any IO [1]. We may follow the same
> logic by not including these to pg_stat_io?

Right. WALReadFromBuffers doesn't do any I/O.

Whoever reads WAL from disk (backends, walsenders, recovery process)
using pg_pread (XLogPageRead, WALRead) needs to be tracked in
pg_stat_io or some other view. If it were to be in pg_stat_io,
although we may not be able to distinguish WAL read stats at a backend
level (like how many times/bytes a walsender or recovery process or a
backend read WAL from disk), but it can help understand overall impact
of WAL read I/O at a cluster level. With this approach, the WAL I/O
stats are divided up - WAL read I/O and write I/O stats are in
pg_stat_io and pg_stat_wal respectively.

This makes me think if we need to add WAL read I/O stats also to
pg_stat_wal. Then, we can also add WALReadFromBuffers stats
hits/misses there. With this approach, pg_stat_wal can be a one-stop
view for all the WAL related stats. If needed, we can join info from
pg_stat_wal to pg_stat_io in system_views.sql so that the I/O stats
are emitted to the end-user via pg_stat_io.

> * With the b5a9b18cd0 commit, streaming I/O is merged but AFAIK this
> does not block anything related to putting WAL stats in pg_stat_io.
>
> If I am not missing any new changes, the only problem is reading
> variable bytes now. We have discussed a couple of solutions:
>
> 1- Change op_bytes to something like -1, 0, 1, NULL etc. and document
> that this means some variable byte I/O is happening.
>
> I kind of dislike this solution because if the *only* read I/O is
> happening in variable bytes, it will look like write and extend I/Os
> are happening in variable bytes as well. As a solution, it could be
> documented that only read I/Os could happen in variable bytes for now.

Yes, read I/O for relation and WAL can happen in variable bytes. I
think this idea seems reasonable and simple yet useful to know the
cluster-wide read I/O.

However, another point here is how the total number of bytes read is
represented with existing pg_stat_io columns 'reads' and 'op_bytes'.
It is known now with 'reads' * 'op_bytes', but with variable bytes,
how is read bytes calculated? Maybe add new columns
read_bytes/write_bytes?

> 2- Use op_bytes_[read | write | extend] columns instead of one
> op_bytes column, also use the first solution.
>
> This can solve the first solution's weakness but it introduces two
> more columns. This is more future proof compared to the first solution
> if there is a chance that some variable I/O could happen in write and
> extend calls as well in the future.

-1 as more columns impact the readability and usability.

> 3- Create a new pg_stat_io_wal view to put WAL I/Os here instead of 
> pg_stat_io.
>
> pg_stat_io could remain untouchable and we will have flexibility to
> edit this new view as much as we want. But the original aim of the
> pg_stat_io is evaluating all I/O from a single view and adding a new
> view breaks this aim.

-1 as it defeats the very purpose of one-stop view pg_stat_io for all
kinds of I/O. PS: see my response above about adding both WAL write
I/O and read I/O stats to pg_stat_wal.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Show WAL write and fsync stats in pg_stat_io

2024-04-24 Thread Nazir Bilal Yavuz
Hi,

On Fri, 19 Apr 2024 at 11:01, Nazir Bilal Yavuz  wrote:
> > On Thu, 18 Jan 2024 at 04:22, Michael Paquier  wrote:
> > >
> > > On Wed, Jan 17, 2024 at 03:20:39PM +0300, Nazir Bilal Yavuz wrote:
> > > > I agree with your points. While the other I/O related work is
> > > > happening we can discuss what we should do in the variable op_byte
> > > > cases. Also, this is happening only for WAL right now but if we try to
> > > > extend pg_stat_io in the future, that problem possibly will rise
> > > > again. So, it could be good to come up with a general solution, not
> > > > only for WAL.
> > >
> > > Okay, I've marked the patch as RwF for this CF.

Since the last commitfest entry was returned with feedback, I created
a new commitfest entry: https://commitfest.postgresql.org/48/4950/

-- 
Regards,
Nazir Bilal Yavuz
Microsoft




Re: Show WAL write and fsync stats in pg_stat_io

2024-04-19 Thread Nazir Bilal Yavuz
Hi,

On Mon, 19 Feb 2024 at 10:28, Nazir Bilal Yavuz  wrote:
>
> Hi,
>
> On Thu, 18 Jan 2024 at 04:22, Michael Paquier  wrote:
> >
> > On Wed, Jan 17, 2024 at 03:20:39PM +0300, Nazir Bilal Yavuz wrote:
> > > I agree with your points. While the other I/O related work is
> > > happening we can discuss what we should do in the variable op_byte
> > > cases. Also, this is happening only for WAL right now but if we try to
> > > extend pg_stat_io in the future, that problem possibly will rise
> > > again. So, it could be good to come up with a general solution, not
> > > only for WAL.
> >
> > Okay, I've marked the patch as RwF for this CF.
>
> I wanted to inform you that the 73f0a13266 commit changed all WALRead
> calls to read variable bytes, only the WAL receiver was reading
> variable bytes before.

I want to start working on this again if possible. I will try to
summarize the current status:

* With the 73f0a13266 commit, the WALRead() function started to read
variable bytes in every case. Before, only the WAL receiver was
reading variable bytes.

* With the 91f2cae7a4 commit, WALReadFromBuffers() is merged. We were
discussing what we have to do when this is merged. It is decided that
WALReadFromBuffers() does not call pgstat_report_wait_start() because
this function does not perform any IO [1]. We may follow the same
logic by not including these to pg_stat_io?

* With the b5a9b18cd0 commit, streaming I/O is merged but AFAIK this
does not block anything related to putting WAL stats in pg_stat_io.

If I am not missing any new changes, the only problem is reading
variable bytes now. We have discussed a couple of solutions:

1- Change op_bytes to something like -1, 0, 1, NULL etc. and document
that this means some variable byte I/O is happening.

I kind of dislike this solution because if the *only* read I/O is
happening in variable bytes, it will look like write and extend I/Os
are happening in variable bytes as well. As a solution, it could be
documented that only read I/Os could happen in variable bytes for now.

2- Use op_bytes_[read | write | extend] columns instead of one
op_bytes column, also use the first solution.

This can solve the first solution's weakness but it introduces two
more columns. This is more future proof compared to the first solution
if there is a chance that some variable I/O could happen in write and
extend calls as well in the future.

3- Create a new pg_stat_io_wal view to put WAL I/Os here instead of pg_stat_io.

pg_stat_io could remain untouchable and we will have flexibility to
edit this new view as much as we want. But the original aim of the
pg_stat_io is evaluating all I/O from a single view and adding a new
view breaks this aim.

I hope that I did not miss anything and my explanations are clear.

Any kind of feedback would be appreciated.


[1] 
https://www.postgresql.org/message-id/CAFiTN-sE7CJn-ZFj%2B-0Wv6TNytv_fp4n%2BeCszspxJ3mt77t5ig%40mail.gmail.com

--
Regards,
Nazir Bilal Yavuz
Microsoft




Re: Show WAL write and fsync stats in pg_stat_io

2024-02-18 Thread Nazir Bilal Yavuz
Hi,

On Thu, 18 Jan 2024 at 04:22, Michael Paquier  wrote:
>
> On Wed, Jan 17, 2024 at 03:20:39PM +0300, Nazir Bilal Yavuz wrote:
> > I agree with your points. While the other I/O related work is
> > happening we can discuss what we should do in the variable op_byte
> > cases. Also, this is happening only for WAL right now but if we try to
> > extend pg_stat_io in the future, that problem possibly will rise
> > again. So, it could be good to come up with a general solution, not
> > only for WAL.
>
> Okay, I've marked the patch as RwF for this CF.

I wanted to inform you that the 73f0a13266 commit changed all WALRead
calls to read variable bytes, only the WAL receiver was reading
variable bytes before.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft




Re: Show WAL write and fsync stats in pg_stat_io

2024-01-17 Thread Michael Paquier
On Wed, Jan 17, 2024 at 03:20:39PM +0300, Nazir Bilal Yavuz wrote:
> I agree with your points. While the other I/O related work is
> happening we can discuss what we should do in the variable op_byte
> cases. Also, this is happening only for WAL right now but if we try to
> extend pg_stat_io in the future, that problem possibly will rise
> again. So, it could be good to come up with a general solution, not
> only for WAL.

Okay, I've marked the patch as RwF for this CF.
--
Michael


signature.asc
Description: PGP signature


Re: Show WAL write and fsync stats in pg_stat_io

2024-01-17 Thread Nazir Bilal Yavuz
Hi,

On Mon, 15 Jan 2024 at 09:27, Michael Paquier  wrote:
>
> On Fri, Jan 12, 2024 at 04:23:26PM +0300, Nazir Bilal Yavuz wrote:
> > On Thu, 11 Jan 2024 at 17:28, Melanie Plageman  
> > wrote:
> >> Even if we made a separate view for WAL I/O stats, we would still have
> >> this issue of variable sized I/O vs block sized I/O and would probably
> >> end up solving it with separate columns for the number of bytes and
> >> number of operations.
> >
> > Yes, I think it is more about flexibility and not changing the already
> > published pg_stat_io view.
>
> I don't know.  Adding more columns or changing op_bytes with an extra
> mode that reflects on what the other columns mean is kind of the same
> thing to me: we want pg_stat_io to report more modes so as all I/O can
> be evaluated from a single view, but the complication is now that
> everything is tied to BLCKSZ.
>
> IMHO, perhaps we'd better put this patch aside until we are absolutely
> *sure* of what we want to achieve when it comes to WAL, and I am
> afraid that this cannot happen until we're happy with the way we
> handle WAL reads *and* writes, including WAL receiver or anything that
> has the idea of pulling its own page callback with
> XLogReaderAllocate() in the backend.  Well, writes should be
> relatively "easy" as things happen with XLOG_BLCKSZ, mainly, but
> reads are the unknown part.
>
> That also seems furiously related to the work happening with async I/O
> or the fact that we may want to have in the view a separate meaning
> for cached pages or pages read directly from disk.  The worst thing
> that we would do is rush something into the tree and then have to deal
> with the aftermath of what we'd need to deal with in terms of
> compatibility depending on the state of the other I/O related work
> when the new view is released.  That would not be fun for the users
> and any hackers who would have to deal with that (aka mainly me if I
> were to commit something), because pg_stat_io could mean something in
> version N, still mean something entirely different in version N+1.

I agree with your points. While the other I/O related work is
happening we can discuss what we should do in the variable op_byte
cases. Also, this is happening only for WAL right now but if we try to
extend pg_stat_io in the future, that problem possibly will rise
again. So, it could be good to come up with a general solution, not
only for WAL.

--
Regards,
Nazir Bilal Yavuz
Microsoft




Re: Show WAL write and fsync stats in pg_stat_io

2024-01-14 Thread Michael Paquier
On Fri, Jan 12, 2024 at 04:23:26PM +0300, Nazir Bilal Yavuz wrote:
> On Thu, 11 Jan 2024 at 17:28, Melanie Plageman  
> wrote:
>> Even if we made a separate view for WAL I/O stats, we would still have
>> this issue of variable sized I/O vs block sized I/O and would probably
>> end up solving it with separate columns for the number of bytes and
>> number of operations.
> 
> Yes, I think it is more about flexibility and not changing the already
> published pg_stat_io view.

I don't know.  Adding more columns or changing op_bytes with an extra
mode that reflects on what the other columns mean is kind of the same
thing to me: we want pg_stat_io to report more modes so as all I/O can
be evaluated from a single view, but the complication is now that
everything is tied to BLCKSZ.

IMHO, perhaps we'd better put this patch aside until we are absolutely
*sure* of what we want to achieve when it comes to WAL, and I am
afraid that this cannot happen until we're happy with the way we
handle WAL reads *and* writes, including WAL receiver or anything that
has the idea of pulling its own page callback with
XLogReaderAllocate() in the backend.  Well, writes should be
relatively "easy" as things happen with XLOG_BLCKSZ, mainly, but
reads are the unknown part.

That also seems furiously related to the work happening with async I/O
or the fact that we may want to have in the view a separate meaning
for cached pages or pages read directly from disk.  The worst thing
that we would do is rush something into the tree and then have to deal
with the aftermath of what we'd need to deal with in terms of
compatibility depending on the state of the other I/O related work
when the new view is released.  That would not be fun for the users
and any hackers who would have to deal with that (aka mainly me if I
were to commit something), because pg_stat_io could mean something in
version N, still mean something entirely different in version N+1.
--
Michael


signature.asc
Description: PGP signature


Re: Show WAL write and fsync stats in pg_stat_io

2024-01-12 Thread Nazir Bilal Yavuz
Hi,

On Thu, 11 Jan 2024 at 17:28, Melanie Plageman
 wrote:
>
> On Thu, Jan 11, 2024 at 6:19 AM Nazir Bilal Yavuz  wrote:
> >
> > On Thu, 11 Jan 2024 at 08:01, Michael Paquier  wrote:
> > >
> > > On Wed, Jan 10, 2024 at 07:24:50PM -0500, Melanie Plageman wrote:
> > > > On Wed, Jan 3, 2024 at 8:11 AM Nazir Bilal Yavuz  
> > > > wrote:
> > > >> On Sun, 31 Dec 2023 at 03:58, Michael Paquier  
> > > >> wrote:
> > > >> Oh, I understand it now. Yes, that makes sense.
> > > >> I thought removing op_bytes completely ( as you said "This patch
> > > >> extends it with two more operation sizes, and there are even cases
> > > >> where it may be a variable" ) from pg_stat_io view then adding
> > > >> something like {read | write | extend}_bytes and {read | write |
> > > >> extend}_calls could be better, so that we don't lose any information.
> > > >
> > > > Upthread, Michael says:
> > > >
> > > >> I find the use of 1 in this context a bit confusing, because when
> > > >> referring to a counter at N, then it can be understood as doing N
> > > >> times a operation,
> > > >
> > > > I didn't understand this argument, so I'm not sure if I agree or
> > > > disagree with it.
> > >
> > > Nazir has mentioned upthread one thing: what should we do for the case
> > > where a combination of (io_object,io_context) does I/O with a
> > > *variable* op_bytes, because that may be the case for the WAL
> > > receiver?  For this case, he has mentioned that we should set op_bytes
> > > to 1, but that's something I find confusing because it would mean that
> > > we are doing read, writes or extends 1 byte at a time.  My suggestion
> > > would be to use op_bytes = -1 or NULL for the variable case instead,
> > > with reads, writes and extends referring to a number of bytes rather
> > > than a number of operations.
> >
> > I agree but we can't do this only for the *variable* cases since
> > B_WAL_RECEIVER and other backends use the same
> > pgstat_count_io_op_time(IOObject, IOContext, ...) call. What I mean
> > is, if two backends use the same pgstat_count_io_op_time() function
> > call in the code; they must count the same thing (number of calls,
> > bytes, etc.). It could be better to count the number of bytes for all
> > the IOOBJECT_WAL IOs.
>
> I'm a bit confused by this. pgstat_count_io_op_time() can check
> MyBackendType. In fact, you do this to ban the wal receiver already.
> It is true that you would need to count all wal receiver normal
> context wal object IOOps in the variable way, but I don't see how
> pgstat_count_io_op_time() is the limiting factor as long as the
> callsite is always doing either the number of bytes or the number of
> calls.

Apologies for not being clear. Let me try to explain this by giving examples:

Let's assume that there are 3 different pgstat_count_io_op_time()
calls in the code base and they are labeled as 1, 2 and 3.

And let's' assume that: WAL receiver uses 1st and 2nd
pgstat_count_io_op_time(), autovacuum uses 2nd and 3rd
pgstat_count_io_op_time() and checkpointer uses 3rd
pgstat_count_io_op_time() to count IOs.

The 1st one is the only pgstat_count_io_op_time() call that must count
the number of bytes because of the variable cases and the others count
the number of calls or blocks.

a) WAL receiver uses both 1st and 2nd => 1st and 2nd
pgstat_count_io_op_time() must count the same thing => 2nd
pgstat_count_io_op_time() must count the number of bytes as well.

b) 2nd pgstat_count_io_op_time() started to count the number of bytes
=> Autovacuum will start to count the number of bytes => 2nd and 3rd
both are used by autocavuum => 3rd pgstat_count_io_op_time() must
count the number of bytes as well.

c) 3rd pgstat_count_io_op_time() started to count the number of bytes
=> Checkpointer will start to count the number of bytes.

The list goes on like this and if we don't have [write | read |
extend]_bytes, this effect will be multiplied.

> > > > I think these are the three proposals for handling WAL reads:
> > > >
> > > > 1) setting op_bytes to 1 and the number of reads is the number of bytes
> > > > 2) setting op_bytes to XLOG_BLCKSZ and the number of reads is the
> > > > number of calls to pg_pread() or similar
> > > > 3) setting op_bytes to NULL and the number of reads is the number of
> > > > calls to pg_pread() or similar
> > >
> > > 3) could be a number of bytes, actually.
> >
> > One important point is that we can't change only reads, if we decide
> > to count the number of bytes for the reads; writes and extends should
> > be counted as a number of bytes as well.
>
> Yes, that is true.
>
> > > > Looking at the patch, I think it is still doing 2.
> > >
> > > The patch disables stats for the WAL receiver, while the startup
> > > process reads WAL with XLOG_BLCKSZ, so yeah that's 2) with a trick to
> > > discard the variable case.
> > >
> > > > For an unpopular idea: we could add separate [IOOp]_bytes columns for
> > > > all those IOOps for which it would be relevant. It kind of stinks but
> > > > it 

Re: Show WAL write and fsync stats in pg_stat_io

2024-01-11 Thread Melanie Plageman
On Thu, Jan 11, 2024 at 6:19 AM Nazir Bilal Yavuz  wrote:
>
> On Thu, 11 Jan 2024 at 08:01, Michael Paquier  wrote:
> >
> > On Wed, Jan 10, 2024 at 07:24:50PM -0500, Melanie Plageman wrote:
> > > On Wed, Jan 3, 2024 at 8:11 AM Nazir Bilal Yavuz  
> > > wrote:
> > >> On Sun, 31 Dec 2023 at 03:58, Michael Paquier  
> > >> wrote:
> > >> Oh, I understand it now. Yes, that makes sense.
> > >> I thought removing op_bytes completely ( as you said "This patch
> > >> extends it with two more operation sizes, and there are even cases
> > >> where it may be a variable" ) from pg_stat_io view then adding
> > >> something like {read | write | extend}_bytes and {read | write |
> > >> extend}_calls could be better, so that we don't lose any information.
> > >
> > > Upthread, Michael says:
> > >
> > >> I find the use of 1 in this context a bit confusing, because when
> > >> referring to a counter at N, then it can be understood as doing N
> > >> times a operation,
> > >
> > > I didn't understand this argument, so I'm not sure if I agree or
> > > disagree with it.
> >
> > Nazir has mentioned upthread one thing: what should we do for the case
> > where a combination of (io_object,io_context) does I/O with a
> > *variable* op_bytes, because that may be the case for the WAL
> > receiver?  For this case, he has mentioned that we should set op_bytes
> > to 1, but that's something I find confusing because it would mean that
> > we are doing read, writes or extends 1 byte at a time.  My suggestion
> > would be to use op_bytes = -1 or NULL for the variable case instead,
> > with reads, writes and extends referring to a number of bytes rather
> > than a number of operations.
>
> I agree but we can't do this only for the *variable* cases since
> B_WAL_RECEIVER and other backends use the same
> pgstat_count_io_op_time(IOObject, IOContext, ...) call. What I mean
> is, if two backends use the same pgstat_count_io_op_time() function
> call in the code; they must count the same thing (number of calls,
> bytes, etc.). It could be better to count the number of bytes for all
> the IOOBJECT_WAL IOs.

I'm a bit confused by this. pgstat_count_io_op_time() can check
MyBackendType. In fact, you do this to ban the wal receiver already.
It is true that you would need to count all wal receiver normal
context wal object IOOps in the variable way, but I don't see how
pgstat_count_io_op_time() is the limiting factor as long as the
callsite is always doing either the number of bytes or the number of
calls.

> > > I think these are the three proposals for handling WAL reads:
> > >
> > > 1) setting op_bytes to 1 and the number of reads is the number of bytes
> > > 2) setting op_bytes to XLOG_BLCKSZ and the number of reads is the
> > > number of calls to pg_pread() or similar
> > > 3) setting op_bytes to NULL and the number of reads is the number of
> > > calls to pg_pread() or similar
> >
> > 3) could be a number of bytes, actually.
>
> One important point is that we can't change only reads, if we decide
> to count the number of bytes for the reads; writes and extends should
> be counted as a number of bytes as well.

Yes, that is true.

> > > Looking at the patch, I think it is still doing 2.
> >
> > The patch disables stats for the WAL receiver, while the startup
> > process reads WAL with XLOG_BLCKSZ, so yeah that's 2) with a trick to
> > discard the variable case.
> >
> > > For an unpopular idea: we could add separate [IOOp]_bytes columns for
> > > all those IOOps for which it would be relevant. It kind of stinks but
> > > it would give us the freedom to document exactly what a single IOOp
> > > means for each combination of BackendType, IOContext, IOObject, and
> > > IOOp (as relevant) and still have an accurate number in the *bytes
> > > columns. Everyone will probably hate us if we do that, though.
> > > Especially because having bytes for the existing IOObjects is an
> > > existing feature.
> >
> > An issue I have with this one is that having multiple tuples for
> > each (object,context) if they have multiple op_bytes leads to
> > potentially a lot of bloat in the view.  That would be up to 8k extra
> > tuples just for the sake of op_byte's variability.
>
> Yes, that doesn't seem applicable to me.

My suggestion (again not sure it is a good idea) was actually that we
remove op_bytes and add "write_bytes", "read_bytes", and
"extend_bytes". AFAICT, this would add columns not rows. In this
schema, read bytes for wal receiver could be counted in one way and
writes in another. We could document that, for wal receiver, the reads
are not always done in units of the same size, so the read_bytes /
reads could be thought of as an average size of read.

Even if we made a separate view for WAL I/O stats, we would still have
this issue of variable sized I/O vs block sized I/O and would probably
end up solving it with separate columns for the number of bytes and
number of operations.

> > > A separate question: suppose [1] goes in (to read WAL from WAL 

Re: Show WAL write and fsync stats in pg_stat_io

2024-01-11 Thread Nazir Bilal Yavuz
Hi,

On Thu, 11 Jan 2024 at 08:01, Michael Paquier  wrote:
>
> On Wed, Jan 10, 2024 at 07:24:50PM -0500, Melanie Plageman wrote:
> > I have code review feedback as well, but I've saved that for my next email.
>
> Ah, cool.
>
> > On Wed, Jan 3, 2024 at 8:11 AM Nazir Bilal Yavuz  wrote:
> >> On Sun, 31 Dec 2023 at 03:58, Michael Paquier  wrote:
> >> Oh, I understand it now. Yes, that makes sense.
> >> I thought removing op_bytes completely ( as you said "This patch
> >> extends it with two more operation sizes, and there are even cases
> >> where it may be a variable" ) from pg_stat_io view then adding
> >> something like {read | write | extend}_bytes and {read | write |
> >> extend}_calls could be better, so that we don't lose any information.
> >
> > Upthread, Michael says:
> >
> >> I find the use of 1 in this context a bit confusing, because when
> >> referring to a counter at N, then it can be understood as doing N
> >> times a operation,
> >
> > I didn't understand this argument, so I'm not sure if I agree or
> > disagree with it.
>
> Nazir has mentioned upthread one thing: what should we do for the case
> where a combination of (io_object,io_context) does I/O with a
> *variable* op_bytes, because that may be the case for the WAL
> receiver?  For this case, he has mentioned that we should set op_bytes
> to 1, but that's something I find confusing because it would mean that
> we are doing read, writes or extends 1 byte at a time.  My suggestion
> would be to use op_bytes = -1 or NULL for the variable case instead,
> with reads, writes and extends referring to a number of bytes rather
> than a number of operations.

I agree but we can't do this only for the *variable* cases since
B_WAL_RECEIVER and other backends use the same
pgstat_count_io_op_time(IOObject, IOContext, ...) call. What I mean
is, if two backends use the same pgstat_count_io_op_time() function
call in the code; they must count the same thing (number of calls,
bytes, etc.). It could be better to count the number of bytes for all
the IOOBJECT_WAL IOs.

> > I think these are the three proposals for handling WAL reads:
> >
> > 1) setting op_bytes to 1 and the number of reads is the number of bytes
> > 2) setting op_bytes to XLOG_BLCKSZ and the number of reads is the
> > number of calls to pg_pread() or similar
> > 3) setting op_bytes to NULL and the number of reads is the number of
> > calls to pg_pread() or similar
>
> 3) could be a number of bytes, actually.

One important point is that we can't change only reads, if we decide
to count the number of bytes for the reads; writes and extends should
be counted as a number of bytes as well.

> > Looking at the patch, I think it is still doing 2.
>
> The patch disables stats for the WAL receiver, while the startup
> process reads WAL with XLOG_BLCKSZ, so yeah that's 2) with a trick to
> discard the variable case.
>
> > For an unpopular idea: we could add separate [IOOp]_bytes columns for
> > all those IOOps for which it would be relevant. It kind of stinks but
> > it would give us the freedom to document exactly what a single IOOp
> > means for each combination of BackendType, IOContext, IOObject, and
> > IOOp (as relevant) and still have an accurate number in the *bytes
> > columns. Everyone will probably hate us if we do that, though.
> > Especially because having bytes for the existing IOObjects is an
> > existing feature.
>
> An issue I have with this one is that having multiple tuples for
> each (object,context) if they have multiple op_bytes leads to
> potentially a lot of bloat in the view.  That would be up to 8k extra
> tuples just for the sake of op_byte's variability.

Yes, that doesn't seem applicable to me.

> > A separate question: suppose [1] goes in (to read WAL from WAL buffers
> > directly). Now, WAL reads are not from permanent storage anymore. Are
> > we only tracking permanent storage I/O in pg_stat_io? I also had this
> > question for some of the WAL receiver functions. Should we track any
> > I/O other than permanent storage I/O? Or did I miss this being
> > addressed upthread?
>
> That's a good point.  I guess that this should just be a different
> IOOp?  That's not a IOOP_READ.  A IOOP_HIT is also different.

I think different IOContext rather than IOOp suits better for this.

> > In terms of what I/O we should track in a streaming/asynchronous
> > world, the options would be:
> >
> > 1) track read/write syscalls
> > 2) track blocks of BLCKSZ submitted to the kernel
> > 3) track bytes submitted to the kernel
> > 4) track merged I/Os (after doing any merging in the application)
> >
> > I think the debate was largely between 2 and 4. There was some
> > disagreement, but I think we landed on 2 because there is merging that
> > can happen at many levels in the storage stack (even the storage
> > controller). Distinguishing between whether or not Postgres submitted
> > 2 32k I/Os or 8 8k I/Os could be useful while you are developing AIO,
> > but I think it might be confusing 

Re: Show WAL write and fsync stats in pg_stat_io

2024-01-10 Thread Michael Paquier
On Wed, Jan 10, 2024 at 07:24:50PM -0500, Melanie Plageman wrote:
> I have code review feedback as well, but I've saved that for my next email.

Ah, cool.

> On Wed, Jan 3, 2024 at 8:11 AM Nazir Bilal Yavuz  wrote:
>> On Sun, 31 Dec 2023 at 03:58, Michael Paquier  wrote:
>> Oh, I understand it now. Yes, that makes sense.
>> I thought removing op_bytes completely ( as you said "This patch
>> extends it with two more operation sizes, and there are even cases
>> where it may be a variable" ) from pg_stat_io view then adding
>> something like {read | write | extend}_bytes and {read | write |
>> extend}_calls could be better, so that we don't lose any information.
> 
> Upthread, Michael says:
> 
>> I find the use of 1 in this context a bit confusing, because when
>> referring to a counter at N, then it can be understood as doing N
>> times a operation,
> 
> I didn't understand this argument, so I'm not sure if I agree or
> disagree with it.

Nazir has mentioned upthread one thing: what should we do for the case
where a combination of (io_object,io_context) does I/O with a
*variable* op_bytes, because that may be the case for the WAL
receiver?  For this case, he has mentioned that we should set op_bytes
to 1, but that's something I find confusing because it would mean that
we are doing read, writes or extends 1 byte at a time.  My suggestion
would be to use op_bytes = -1 or NULL for the variable case instead,
with reads, writes and extends referring to a number of bytes rather
than a number of operations.

> I think these are the three proposals for handling WAL reads:
> 
> 1) setting op_bytes to 1 and the number of reads is the number of bytes
> 2) setting op_bytes to XLOG_BLCKSZ and the number of reads is the
> number of calls to pg_pread() or similar
> 3) setting op_bytes to NULL and the number of reads is the number of
> calls to pg_pread() or similar

3) could be a number of bytes, actually.

> Looking at the patch, I think it is still doing 2.

The patch disables stats for the WAL receiver, while the startup
process reads WAL with XLOG_BLCKSZ, so yeah that's 2) with a trick to
discard the variable case.

> For an unpopular idea: we could add separate [IOOp]_bytes columns for
> all those IOOps for which it would be relevant. It kind of stinks but
> it would give us the freedom to document exactly what a single IOOp
> means for each combination of BackendType, IOContext, IOObject, and
> IOOp (as relevant) and still have an accurate number in the *bytes
> columns. Everyone will probably hate us if we do that, though.
> Especially because having bytes for the existing IOObjects is an
> existing feature.

An issue I have with this one is that having multiple tuples for
each (object,context) if they have multiple op_bytes leads to
potentially a lot of bloat in the view.  That would be up to 8k extra
tuples just for the sake of op_byte's variability.

> A separate question: suppose [1] goes in (to read WAL from WAL buffers
> directly). Now, WAL reads are not from permanent storage anymore. Are
> we only tracking permanent storage I/O in pg_stat_io? I also had this
> question for some of the WAL receiver functions. Should we track any
> I/O other than permanent storage I/O? Or did I miss this being
> addressed upthread?

That's a good point.  I guess that this should just be a different
IOOp?  That's not a IOOP_READ.  A IOOP_HIT is also different.

> In terms of what I/O we should track in a streaming/asynchronous
> world, the options would be:
> 
> 1) track read/write syscalls
> 2) track blocks of BLCKSZ submitted to the kernel
> 3) track bytes submitted to the kernel
> 4) track merged I/Os (after doing any merging in the application)
> 
> I think the debate was largely between 2 and 4. There was some
> disagreement, but I think we landed on 2 because there is merging that
> can happen at many levels in the storage stack (even the storage
> controller). Distinguishing between whether or not Postgres submitted
> 2 32k I/Os or 8 8k I/Os could be useful while you are developing AIO,
> but I think it might be confusing for the Postgres user trying to
> determine why their query is slow. It probably makes the most sense to
> still track in block size.
> 
> No matter what solution we pick, you should get a correct number if
> you multiply op_bytes by an IOOp (assuming nothing is NULL). Or,
> rather, there should be some way of getting an accurate number in
> bytes of the amount of a particular kind of I/O that has been done.

Yeah, coming back to op_bytes = -1/NULL as a tweak to mean that reads,
writes or extends are counted as bytes, because we don't have a fixed
operation size for some (object,context) cases.
--
Michael


signature.asc
Description: PGP signature


Re: Show WAL write and fsync stats in pg_stat_io

2024-01-10 Thread Melanie Plageman
I have code review feedback as well, but I've saved that for my next email.

On Wed, Jan 3, 2024 at 8:11 AM Nazir Bilal Yavuz  wrote:
>
> On Sun, 31 Dec 2023 at 03:58, Michael Paquier  wrote:
> >
> > On Tue, Dec 26, 2023 at 03:35:52PM +0300, Nazir Bilal Yavuz wrote:
> > > On Tue, 26 Dec 2023 at 13:10, Michael Paquier  wrote:
> > >> I am not sure while the whole point of the exercise is to have all the
> > >> I/O related data in a single view.  Something that I've also found a
> > >> bit disturbing yesterday while looking at your patch is the fact that
> > >> the operation size is guessed from the context and object type when
> > >> querying the view because now everything is tied to BLCKSZ.  This
> > >> patch extends it with two more operation sizes, and there are even
> > >> cases where it may be a variable.  Could it be a better option to
> > >> extend pgstat_count_io_op_time() so as callers can themselves give the
> > >> size of the operation?
> > >
> > > Do you mean removing the op_bytes column and tracking the number of
> > > bytes in reads, writes, and extends? If so, that makes sense to me but
> > > I don't want to remove the number of operations; I believe that has a
> > > value too. We can extend the pgstat_count_io_op_time() so it can both
> > > track the number of bytes and the number of operations.
> >
> > Apologies if my previous wording sounded confusing.  The idea I had in
> > mind was to keep op_bytes in pg_stat_io, and extend it so as a value
> > of NULL (or 0, or -1) is a synonym as "writes", "extends" and "reads"
> > as a number of bytes.
>
> Oh, I understand it now. Yes, that makes sense.
> I thought removing op_bytes completely ( as you said "This patch
> extends it with two more operation sizes, and there are even cases
> where it may be a variable" ) from pg_stat_io view then adding
> something like {read | write | extend}_bytes and {read | write |
> extend}_calls could be better, so that we don't lose any information.

Forgive me as I catch up on this thread.

Upthread, Michael says:

> I find the use of 1 in this context a bit confusing, because when
> referring to a counter at N, then it can be understood as doing N
> times a operation,

I didn't understand this argument, so I'm not sure if I agree or
disagree with it.

I think these are the three proposals for handling WAL reads:

1) setting op_bytes to 1 and the number of reads is the number of bytes
2) setting op_bytes to XLOG_BLCKSZ and the number of reads is the
number of calls to pg_pread() or similar
3) setting op_bytes to NULL and the number of reads is the number of
calls to pg_pread() or similar

Looking at the patch, I think it is still doing 2.

It would be good to list all our options with pros and cons (if only
because they are a bit spread throughout the thread now).

For an unpopular idea: we could add separate [IOOp]_bytes columns for
all those IOOps for which it would be relevant. It kind of stinks but
it would give us the freedom to document exactly what a single IOOp
means for each combination of BackendType, IOContext, IOObject, and
IOOp (as relevant) and still have an accurate number in the *bytes
columns. Everyone will probably hate us if we do that, though.
Especially because having bytes for the existing IOObjects is an
existing feature.

A separate question: suppose [1] goes in (to read WAL from WAL buffers
directly). Now, WAL reads are not from permanent storage anymore. Are
we only tracking permanent storage I/O in pg_stat_io? I also had this
question for some of the WAL receiver functions. Should we track any
I/O other than permanent storage I/O? Or did I miss this being
addressed upthread?

> > > Also, it is not directly related to this patch but vectored IO [1] is
> > > coming soon; so the number of operations could be wrong since vectored
> > > IO could merge a couple of operations.
> >
> > Hmm.  I have not checked this patch series so I cannot say for sure,
> > but we'd likely just want to track the number of bytes if a single
> > operation has a non-equal size rather than registering in pg_stat_io N
> > rows with different op_bytes, no?
>
> Yes, that is correct.

I do not like the idea of having basically GROUP BY op_bytes in the
view (if that is the suggestion).

In terms of what I/O we should track in a streaming/asynchronous
world, the options would be:

1) track read/write syscalls
2) track blocks of BLCKSZ submitted to the kernel
3) track bytes submitted to the kernel
4) track merged I/Os (after doing any merging in the application)

I think the debate was largely between 2 and 4. There was some
disagreement, but I think we landed on 2 because there is merging that
can happen at many levels in the storage stack (even the storage
controller). Distinguishing between whether or not Postgres submitted
2 32k I/Os or 8 8k I/Os could be useful while you are developing AIO,
but I think it might be confusing for the Postgres user trying to
determine why their query is slow. It probably makes the most 

Re: Show WAL write and fsync stats in pg_stat_io

2024-01-10 Thread Nazir Bilal Yavuz
Hi,

On Wed, 10 Jan 2024 at 08:25, Michael Paquier  wrote:
>
> On Wed, Jan 03, 2024 at 04:10:58PM +0300, Nazir Bilal Yavuz wrote:
> >
> > I thought removing op_bytes completely ( as you said "This patch
> > extends it with two more operation sizes, and there are even cases
> > where it may be a variable" ) from pg_stat_io view then adding
> > something like {read | write | extend}_bytes and {read | write |
> > extend}_calls could be better, so that we don't lose any information.
>
> But then you'd lose the possibility to analyze correlations between
> the size and the number of the operations, which is something that
> matters for more complex I/O scenarios.  This does not need to be
> tackled in this patch, which is useful on its own, though I am really
> wondering if this is required for the recent work done by Thomas.
> Perhaps Andres, Thomas or Melanie could comment on that?

Yes, you are right.

> >> Yeah, a limitation like that may be acceptable for now.  Tracking the
> >> WAL writer and WAL sender activities can be relevant in a lot of cases
> >> even if we don't have the full picture for the WAL receiver yet.
> >
> > I added that and disabled B_WAL_RECEIVER backend with comments
> > explaining why. v8 is attached.
>
> I can see that's what you have been adding here, which should be OK:
>
> > -if (track_io_timing)
> > +/*
> > + * B_WAL_RECEIVER backend does IOOBJECT_WAL IOObject & IOOP_READ IOOp 
> > IOs
> > + * but these IOs are not countable for now because IOOP_READ IOs' 
> > op_bytes
> > + * (number of bytes per unit of I/O) might not be the same all the 
> > time.
> > + * The current implementation requires that the op_bytes must be the 
> > same
> > + * for the same IOObject, IOContext and IOOp. To avoid confusion, the
> > + * B_WAL_RECEIVER backend & IOOBJECT_WAL IOObject IOs are disabled for
> > + * now.
> > + */
> > +if (MyBackendType == B_WAL_RECEIVER && io_object == IOOBJECT_WAL)
> > +return;
>
> This could be worded better, but that's one of these nits from me I
> usually tweak when committing stuff.

Thanks for doing that! Do you have any specific comments that can help
improve it?

> > +/*
> > + * Decide if IO timings need to be tracked.  Timings associated to
> > + * IOOBJECT_WAL objects are tracked if track_wal_io_timing is enabled,
> > + * else rely on track_io_timing.
> > + */
> > +static bool
> > +pgstat_should_track_io_time(IOObject io_object)
> > +{
> > +if (io_object == IOOBJECT_WAL)
> > +return track_wal_io_timing;
> > +
> > +return track_io_timing;
> > +}
>
> One thing I was also considering is if eliminating this routine would
> make pgstat_count_io_op_time() more readable the result, but I cannot
> get to that.

I could not think of a way to eliminate pgstat_should_track_io_time()
route without causing performance regressions. What do you think about
moving inside of 'pgstat_should_track_io_time(io_object) if check' to
another function and call this function from
pgstat_count_io_op_time()? This does not change anything but IMO it
increases the readability.

> >  if (io_op == IOOP_WRITE || io_op == IOOP_EXTEND)
> >  {
> > -
> > pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
> > +if (io_object != IOOBJECT_WAL)
> > +
> > pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
> > +
> >  if (io_object == IOOBJECT_RELATION)
> >  INSTR_TIME_ADD(pgBufferUsage.shared_blk_write_time, 
> > io_time);
> >  else if (io_object == IOOBJECT_TEMP_RELATION)
> > @@ -139,7 +177,9 @@ pgstat_count_io_op_time(IOObject io_object, IOContext 
> > io_context, IOOp io_op,
> >  }
> >  else if (io_op == IOOP_READ)
> >  {
> > -
> > pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
> > +if (io_object != IOOBJECT_WAL)
> > +
> > pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
> > +
> >  if (io_object == IOOBJECT_RELATION)
> >  INSTR_TIME_ADD(pgBufferUsage.shared_blk_read_time, 
> > io_time);
> >  else if (io_object == IOOBJECT_TEMP_RELATION)
>
> A second thing is if this would be better with more switch/cases, say:
> switch (io_op):
> {
> case IOOP_EXTEND:
> case IOOP_WRITE:
> switch (io_object):
> {
> case WAL:
> /* do nothing */
> break;
> case RELATION:
> case TEMP:
> .. blah ..
> }
> break;
> case IOOP_READ:
> switch (io_object):
> {
> .. blah ..
> }
> break;
> }
>
> Or just this one to make it clear that nothing happens for WAL
> objects:
> switch (io_object):
> {
>case WAL:
>/* do nothing */
>break;
>case RELATION:
>switch (io_op):
>{
>case IOOP_EXTEND:
>

Re: Show WAL write and fsync stats in pg_stat_io

2024-01-09 Thread Michael Paquier
On Wed, Jan 03, 2024 at 04:10:58PM +0300, Nazir Bilal Yavuz wrote:
> On Sun, 31 Dec 2023 at 03:58, Michael Paquier  wrote:
>> Apologies if my previous wording sounded confusing.  The idea I had in
>> mind was to keep op_bytes in pg_stat_io, and extend it so as a value
>> of NULL (or 0, or -1) is a synonym as "writes", "extends" and "reads"
>> as a number of bytes.
> 
> Oh, I understand it now. Yes, that makes sense.
> I thought removing op_bytes completely ( as you said "This patch
> extends it with two more operation sizes, and there are even cases
> where it may be a variable" ) from pg_stat_io view then adding
> something like {read | write | extend}_bytes and {read | write |
> extend}_calls could be better, so that we don't lose any information.

But then you'd lose the possibility to analyze correlations between
the size and the number of the operations, which is something that
matters for more complex I/O scenarios.  This does not need to be
tackled in this patch, which is useful on its own, though I am really
wondering if this is required for the recent work done by Thomas.
Perhaps Andres, Thomas or Melanie could comment on that?

>> Yeah, a limitation like that may be acceptable for now.  Tracking the
>> WAL writer and WAL sender activities can be relevant in a lot of cases
>> even if we don't have the full picture for the WAL receiver yet.
> 
> I added that and disabled B_WAL_RECEIVER backend with comments
> explaining why. v8 is attached.

I can see that's what you have been adding here, which should be OK:

> -if (track_io_timing)
> +/*
> + * B_WAL_RECEIVER backend does IOOBJECT_WAL IOObject & IOOP_READ IOOp IOs
> + * but these IOs are not countable for now because IOOP_READ IOs' 
> op_bytes
> + * (number of bytes per unit of I/O) might not be the same all the time.
> + * The current implementation requires that the op_bytes must be the same
> + * for the same IOObject, IOContext and IOOp. To avoid confusion, the
> + * B_WAL_RECEIVER backend & IOOBJECT_WAL IOObject IOs are disabled for
> + * now.
> + */
> +if (MyBackendType == B_WAL_RECEIVER && io_object == IOOBJECT_WAL)
> +return;

This could be worded better, but that's one of these nits from me I
usually tweak when committing stuff.

> +/*
> + * Decide if IO timings need to be tracked.  Timings associated to
> + * IOOBJECT_WAL objects are tracked if track_wal_io_timing is enabled,
> + * else rely on track_io_timing.
> + */
> +static bool
> +pgstat_should_track_io_time(IOObject io_object)
> +{
> +if (io_object == IOOBJECT_WAL)
> +return track_wal_io_timing;
> +
> +return track_io_timing;
> +}

One thing I was also considering is if eliminating this routine would
make pgstat_count_io_op_time() more readable the result, but I cannot
get to that.

>  if (io_op == IOOP_WRITE || io_op == IOOP_EXTEND)
>  {
> -pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
> +if (io_object != IOOBJECT_WAL)
> +
> pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
> +
>  if (io_object == IOOBJECT_RELATION)
>  INSTR_TIME_ADD(pgBufferUsage.shared_blk_write_time, io_time);
>  else if (io_object == IOOBJECT_TEMP_RELATION)
> @@ -139,7 +177,9 @@ pgstat_count_io_op_time(IOObject io_object, IOContext 
> io_context, IOOp io_op,
>  }
>  else if (io_op == IOOP_READ)
>  {
> -pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
> +if (io_object != IOOBJECT_WAL)
> +
> pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
> +
>  if (io_object == IOOBJECT_RELATION)
>  INSTR_TIME_ADD(pgBufferUsage.shared_blk_read_time, io_time);
>  else if (io_object == IOOBJECT_TEMP_RELATION)

A second thing is if this would be better with more switch/cases, say:
switch (io_op):
{
case IOOP_EXTEND:
case IOOP_WRITE:
switch (io_object):
{
case WAL:
/* do nothing */
break;
case RELATION:
case TEMP:
.. blah ..   
}
break;
case IOOP_READ:
switch (io_object):
{
.. blah ..   
}
break;
}

Or just this one to make it clear that nothing happens for WAL
objects:
switch (io_object):
{
   case WAL:
   /* do nothing */
   break;
   case RELATION:
   switch (io_op):
   {
   case IOOP_EXTEND:
   case IOOP_WRITE:
   .. blah ..
   case IOOP_READ:
   .. blah ..
   }
   break;
   case TEMP:
   /* same switch as RELATION */
   break;
}

This duplicates a bit things, but at least in the second case it's
clear which counters are updated when I/O timings are tracked.  It's
OK by me if people don't like this suggestion, but that would avoid
bugs like the one 

Re: Show WAL write and fsync stats in pg_stat_io

2024-01-03 Thread Nazir Bilal Yavuz
Hi,

On Sun, 31 Dec 2023 at 03:58, Michael Paquier  wrote:
>
> On Tue, Dec 26, 2023 at 03:35:52PM +0300, Nazir Bilal Yavuz wrote:
> > On Tue, 26 Dec 2023 at 13:10, Michael Paquier  wrote:
> >> I am not sure while the whole point of the exercise is to have all the
> >> I/O related data in a single view.  Something that I've also found a
> >> bit disturbing yesterday while looking at your patch is the fact that
> >> the operation size is guessed from the context and object type when
> >> querying the view because now everything is tied to BLCKSZ.  This
> >> patch extends it with two more operation sizes, and there are even
> >> cases where it may be a variable.  Could it be a better option to
> >> extend pgstat_count_io_op_time() so as callers can themselves give the
> >> size of the operation?
> >
> > Do you mean removing the op_bytes column and tracking the number of
> > bytes in reads, writes, and extends? If so, that makes sense to me but
> > I don't want to remove the number of operations; I believe that has a
> > value too. We can extend the pgstat_count_io_op_time() so it can both
> > track the number of bytes and the number of operations.
>
> Apologies if my previous wording sounded confusing.  The idea I had in
> mind was to keep op_bytes in pg_stat_io, and extend it so as a value
> of NULL (or 0, or -1) is a synonym as "writes", "extends" and "reads"
> as a number of bytes.

Oh, I understand it now. Yes, that makes sense.
I thought removing op_bytes completely ( as you said "This patch
extends it with two more operation sizes, and there are even cases
where it may be a variable" ) from pg_stat_io view then adding
something like {read | write | extend}_bytes and {read | write |
extend}_calls could be better, so that we don't lose any information.

> > Also, it is not directly related to this patch but vectored IO [1] is
> > coming soon; so the number of operations could be wrong since vectored
> > IO could merge a couple of operations.
>
> Hmm.  I have not checked this patch series so I cannot say for sure,
> but we'd likely just want to track the number of bytes if a single
> operation has a non-equal size rather than registering in pg_stat_io N
> rows with different op_bytes, no?

Yes, that is correct.

> I am looping in Thomas Munro in CC for comments.

Thanks for doing that.

> > If we want to entirely disable it, we can add
> >
> > if (MyBackendType == B_WAL_RECEIVER && io_object == IOOBJECT_WAL)
> > return;
> >
> > to the top of the pgstat_count_io_op_time() since all IOOBJECT_WAL
> > calls are done by this function, then we can disable it at
> > pgstat_tracks_io_bktype().
>
> Yeah, a limitation like that may be acceptable for now.  Tracking the
> WAL writer and WAL sender activities can be relevant in a lot of cases
> even if we don't have the full picture for the WAL receiver yet.

I added that and disabled B_WAL_RECEIVER backend with comments
explaining why. v8 is attached.

--
Regards,
Nazir Bilal Yavuz
Microsoft
From dc258a5b168f15c9771f174924261b151193 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Wed, 3 Jan 2024 15:36:19 +0300
Subject: [PATCH v8] Show WAL stats on pg_stat_io (except streaming
 replication)

This patch aims to showing WAL stats per backend on pg_stat_io view.

With this patch, it can be seen how many WAL operations it makes, their
context, types and total timings per backend in pg_stat_io view.

In this path new IOContext IOCONTEXT_INIT is introduced, it is for IO
operations done while creating the things. Currently, it is used only
together with IOObject IOOBJECT_WAL.

IOOBJECT_WAL means IO operations related to WAL.
IOOBJECT_WAL / IOCONTEXT_NORMAL means IO operations done on already
created WAL segments.
IOOBJECT_WAL / IOCONTEXT_INIT means IO operations done while creating
the WAL segments.

This patch currently covers:
- Documentation
- IOOBJECT_WAL / IOCONTEXT_NORMAL / read, write and fsync stats on
  pg_stat_io.
- IOOBJECT_WAL / IOCONTEXT_INIT / write and fsync stats on pg_stat_io.

doesn't cover:
- Streaming replication WAL IO.
---
 src/include/catalog/pg_proc.dat   |   6 +-
 src/include/pgstat.h  |   6 +-
 src/backend/access/transam/xlog.c |  58 +--
 src/backend/access/transam/xlogrecovery.c |  10 ++
 src/backend/catalog/system_views.sql  |  15 ++-
 src/backend/utils/activity/pgstat_io.c| 119 --
 src/backend/utils/adt/pgstatfuncs.c   |  24 ++---
 src/test/regress/expected/rules.out   |  27 +++--
 src/test/regress/expected/stats.out   |  53 ++
 src/test/regress/sql/stats.sql|  25 +
 doc/src/sgml/monitoring.sgml  |  29 --
 11 files changed, 292 insertions(+), 80 deletions(-)

diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 5b67784731a..26197dbb817 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5785,9 +5785,9 @@
 { oid => '1136', descr => 'statistics: 

Re: Show WAL write and fsync stats in pg_stat_io

2023-12-30 Thread Michael Paquier
On Tue, Dec 26, 2023 at 03:35:52PM +0300, Nazir Bilal Yavuz wrote:
> On Tue, 26 Dec 2023 at 13:10, Michael Paquier  wrote:
>> I am not sure while the whole point of the exercise is to have all the
>> I/O related data in a single view.  Something that I've also found a
>> bit disturbing yesterday while looking at your patch is the fact that
>> the operation size is guessed from the context and object type when
>> querying the view because now everything is tied to BLCKSZ.  This
>> patch extends it with two more operation sizes, and there are even
>> cases where it may be a variable.  Could it be a better option to
>> extend pgstat_count_io_op_time() so as callers can themselves give the
>> size of the operation?
> 
> Do you mean removing the op_bytes column and tracking the number of
> bytes in reads, writes, and extends? If so, that makes sense to me but
> I don't want to remove the number of operations; I believe that has a
> value too. We can extend the pgstat_count_io_op_time() so it can both
> track the number of bytes and the number of operations.

Apologies if my previous wording sounded confusing.  The idea I had in
mind was to keep op_bytes in pg_stat_io, and extend it so as a value
of NULL (or 0, or -1) is a synonym as "writes", "extends" and "reads"
as a number of bytes.

> Also, it is not directly related to this patch but vectored IO [1] is
> coming soon; so the number of operations could be wrong since vectored
> IO could merge a couple of operations.

Hmm.  I have not checked this patch series so I cannot say for sure,
but we'd likely just want to track the number of bytes if a single
operation has a non-equal size rather than registering in pg_stat_io N
rows with different op_bytes, no?  I am looping in Thomas Munro in CC
for comments.

>> The whole patch is kind of itself complicated enough, so I'd be OK to
>> discard the case of the WAL receiver for now.  Now, if we do so, the
>> code stack of pgstat_io.c should handle WAL receivers as something
>> entirely disabled until all the known issues are solved.  There is
>> still a lot of value in tracking WAL data associated to the WAL
>> writer, normal backends and WAL senders.
> 
> Why can't we add comments and leave it as it is? Is it because this
> could cause misunderstandings?
> 
> If we want to entirely disable it, we can add
> 
> if (MyBackendType == B_WAL_RECEIVER && io_object == IOOBJECT_WAL)
> return;
> 
> to the top of the pgstat_count_io_op_time() since all IOOBJECT_WAL
> calls are done by this function, then we can disable it at
> pgstat_tracks_io_bktype().

Yeah, a limitation like that may be acceptable for now.  Tracking the
WAL writer and WAL sender activities can be relevant in a lot of cases
even if we don't have the full picture for the WAL receiver yet.
--
Michael


signature.asc
Description: PGP signature


Re: Show WAL write and fsync stats in pg_stat_io

2023-12-26 Thread Nazir Bilal Yavuz
Hi,

On Tue, 26 Dec 2023 at 13:10, Michael Paquier  wrote:
>
> On Tue, Dec 26, 2023 at 11:27:16AM +0300, Nazir Bilal Yavuz wrote:
> > Maybe it is better to create a pg_stat_io_wal view like you said
> > before. We could remove unused columns and add op_bytes for each
> > writes and reads. Also, we can track both the number of bytes and the
> > number of the operations. This doesn't fully solve the problem but it
> > will be easier to modify it to meet our needs.
>
> I am not sure while the whole point of the exercise is to have all the
> I/O related data in a single view.  Something that I've also found a
> bit disturbing yesterday while looking at your patch is the fact that
> the operation size is guessed from the context and object type when
> querying the view because now everything is tied to BLCKSZ.  This
> patch extends it with two more operation sizes, and there are even
> cases where it may be a variable.  Could it be a better option to
> extend pgstat_count_io_op_time() so as callers can themselves give the
> size of the operation?

Do you mean removing the op_bytes column and tracking the number of
bytes in reads, writes, and extends? If so, that makes sense to me but
I don't want to remove the number of operations; I believe that has a
value too. We can extend the pgstat_count_io_op_time() so it can both
track the number of bytes and the number of operations.
Also, it is not directly related to this patch but vectored IO [1] is
coming soon; so the number of operations could be wrong since vectored
IO could merge a couple of operations.

>
> The whole patch is kind of itself complicated enough, so I'd be OK to
> discard the case of the WAL receiver for now.  Now, if we do so, the
> code stack of pgstat_io.c should handle WAL receivers as something
> entirely disabled until all the known issues are solved.  There is
> still a lot of value in tracking WAL data associated to the WAL
> writer, normal backends and WAL senders.

Why can't we add comments and leave it as it is? Is it because this
could cause misunderstandings?

If we want to entirely disable it, we can add

if (MyBackendType == B_WAL_RECEIVER && io_object == IOOBJECT_WAL)
return;

to the top of the pgstat_count_io_op_time() since all IOOBJECT_WAL
calls are done by this function, then we can disable it at
pgstat_tracks_io_bktype().

[1] 
https://www.postgresql.org/message-id/CA%2BhUKGJkOiOCa%2Bmag4BF%2BzHo7qo%3Do9CFheB8%3Dg6uT5TUm2gkvA%40mail.gmail.com

--
Regards,
Nazir Bilal Yavuz
Microsoft




Re: Show WAL write and fsync stats in pg_stat_io

2023-12-26 Thread Michael Paquier
On Tue, Dec 26, 2023 at 11:27:16AM +0300, Nazir Bilal Yavuz wrote:
> Maybe it is better to create a pg_stat_io_wal view like you said
> before. We could remove unused columns and add op_bytes for each
> writes and reads. Also, we can track both the number of bytes and the
> number of the operations. This doesn't fully solve the problem but it
> will be easier to modify it to meet our needs.

I am not sure while the whole point of the exercise is to have all the
I/O related data in a single view.  Something that I've also found a
bit disturbing yesterday while looking at your patch is the fact that
the operation size is guessed from the context and object type when
querying the view because now everything is tied to BLCKSZ.  This
patch extends it with two more operation sizes, and there are even
cases where it may be a variable.  Could it be a better option to
extend pgstat_count_io_op_time() so as callers can themselves give the
size of the operation?

The whole patch is kind of itself complicated enough, so I'd be OK to
discard the case of the WAL receiver for now.  Now, if we do so, the
code stack of pgstat_io.c should handle WAL receivers as something
entirely disabled until all the known issues are solved.  There is
still a lot of value in tracking WAL data associated to the WAL
writer, normal backends and WAL senders.
--
Michael


signature.asc
Description: PGP signature


Re: Show WAL write and fsync stats in pg_stat_io

2023-12-26 Thread Nazir Bilal Yavuz
Hi,

On Tue, 26 Dec 2023 at 03:06, Michael Paquier  wrote:
>
> On Mon, Dec 25, 2023 at 04:09:34PM +0300, Nazir Bilal Yavuz wrote:
> > On Wed, 9 Aug 2023 at 21:52, Melanie Plageman  
> > wrote:
> >> If there is any combination of BackendType and IOContext which will
> >> always read XLOG_BLCKSZ bytes, we could use XLOG_BLCKSZ for that row's
> >> op_bytes. For other cases, we may have to consider using op_bytes 1 and
> >> tracking reads and write IOOps in number of bytes (instead of number of
> >> pages). I don't actually know if there is a clear separation by
> >> BackendType for these different cases.
> >
> > Using op_bytes as 1 solves this problem but since it will be different
> > from the rest of the pg_stat_io view it could be hard to understand.
> > There is no clear separation by backends as it can be seen from the 
> > walsender.
>
> I find the use of 1 in this context a bit confusing, because when
> referring to a counter at N, then it can be understood as doing N
> times a operation, but it would be much less than that.  Another
> solution would be to use NULL (as a synonym of "I don't know") and
> then document that in this case all the bigint counters of pg_stat_io
> track the number of bytes rather than the number of operations?

Yes, that makes sense.

Maybe it is better to create a pg_stat_io_wal view like you said
before. We could remove unused columns and add op_bytes for each
writes and reads. Also, we can track both the number of bytes and the
number of the operations. This doesn't fully solve the problem but it
will be easier to modify it to meet our needs.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft




Re: Show WAL write and fsync stats in pg_stat_io

2023-12-25 Thread Michael Paquier
On Mon, Dec 25, 2023 at 04:09:34PM +0300, Nazir Bilal Yavuz wrote:
> On Wed, 9 Aug 2023 at 21:52, Melanie Plageman  
> wrote:
>> If there is any combination of BackendType and IOContext which will
>> always read XLOG_BLCKSZ bytes, we could use XLOG_BLCKSZ for that row's
>> op_bytes. For other cases, we may have to consider using op_bytes 1 and
>> tracking reads and write IOOps in number of bytes (instead of number of
>> pages). I don't actually know if there is a clear separation by
>> BackendType for these different cases.
> 
> Using op_bytes as 1 solves this problem but since it will be different
> from the rest of the pg_stat_io view it could be hard to understand.
> There is no clear separation by backends as it can be seen from the walsender.

I find the use of 1 in this context a bit confusing, because when
referring to a counter at N, then it can be understood as doing N
times a operation, but it would be much less than that.  Another
solution would be to use NULL (as a synonym of "I don't know") and
then document that in this case all the bigint counters of pg_stat_io
track the number of bytes rather than the number of operations?

>> The other alternative I see is to use XLOG_BLCKSZ as the op_bytes and
>> treat op_bytes * number of reads as an approximation of the number of
>> bytes read. I don't actually know what makes more sense. I don't think I
>> would like having a number for bytes that is not accurate.
> 
> Also, we have a similar problem in XLogPageRead() in xlogrecovery.c.
> pg_pread() call tries to read XLOG_BLCKSZ but it is not certain and we
> don't count IO if it couldn't read XLOG_BLCKSZ. IMO, this is not as
> important as the previous problem but it still is a problem.
> 
> I would be glad to hear opinions on these problems.

Correctness matters a lot for monitoring, IMO.
--
Michael


signature.asc
Description: PGP signature


Re: Show WAL write and fsync stats in pg_stat_io

2023-12-25 Thread Nazir Bilal Yavuz
Hi,

Thanks for the review and feedback on your previous reply!

On Mon, 25 Dec 2023 at 09:40, Michael Paquier  wrote:
>
> On Mon, Dec 25, 2023 at 03:20:58PM +0900, Michael Paquier wrote:
> > pgstat_tracks_io_bktype() does not look correct to me.  Why is the WAL
> > receiver considered as something correct in the list of backend types,
> > while the intention is to *not* add it to pg_stat_io?  I have tried to
> > switche to the correct behavior of returning false for a
> > B_WAL_RECEIVER, to notice that pg_rewind's test 002_databases.pl
> > freezes on its shutdown sequence.  Something weird is going on here.
> > Could you look at it?  See the XXX comment in the attached, which is
> > the same behavior as v6-0002.  It looks to me that the patch has
> > introduced an infinite loop tweaking pgstat_tracks_io_bktype() in an
> > incorrect way to avoid the root issue.
>
> Ah, that's because it would trigger an assertion failure:
> TRAP: failed Assert("pgstat_tracks_io_op(MyBackendType, io_object,
>  io_context, io_op)"), File: "pgstat_io.c", Line: 89, PID: 6824
> postgres: standby_local: walreceiver
> (ExceptionalCondition+0xa8)[0x560d1b4dd38a]
>
> And the backtrace just tells that this is the WAL receiver
> initializing a WAL segment:
> #5  0x560d1b3322c8 in pgstat_count_io_op_n
> (io_object=IOOBJECT_WAL, io_context=IOCONTEXT_INIT, io_op=IOOP_WRITE,
> cnt=1) at pgstat_io.c:89
> #6  0x560d1b33254a in pgstat_count_io_op_time
> (io_object=IOOBJECT_WAL, io_context=IOCONTEXT_INIT, io_op=IOOP_WRITE,
> start_time=..., cnt=1) at pgstat_io.c:181
> #7  0x560d1ae7f932 in XLogFileInitInternal (logsegno=3, logtli=1,
> added=0x7ffd2733c6eb, path=0x7ffd2733c2e0 "pg_wal/0001", '0'
> , "3") at xlog.c:3115
> #8  0x560d1ae7fc4e in XLogFileInit (logsegno=3, logtli=1) at
> xlog.c:3215

Correct.

>
> Wouldn't it be simpler to just bite the bullet in this case and handle
> WAL receivers in the IO tracking?

There is one problem and I couldn't decide how to solve it. We need to
handle read IO in WALRead() in xlogreader.c. How many bytes the
WALRead() function will read is controlled by a variable and it can be
different from XLOG_BLCKSZ. This is a problem because pg_stat_io's
op_bytes column is a constant.

Here are all WALRead() function calls:

1- read_local_xlog_page_guts() in xlogutils.c => WALRead(XLOG_BLCKSZ)
=> always reads XLOG_BLCKSZ.

2- summarizer_read_local_xlog_page() in walsummarizer.c =>
WALRead(XLOG_BLCKSZ) => always reads XLOG_BLCKSZ.

3- logical_read_xlog_page() in walsender.c => WALRead(XLOG_BLCKSZ) =>
always reads XLOG_BLCKSZ.

4- XLogSendPhysical() in walsender.c => WALRead(nbytes) =>  nbytes can
be different from XLOG_BLCKSZ.

5- WALDumpReadPage() in pg_waldump.c => WALRead(count) => count can be
different from XLOG_BLCKSZ.

4 and 5 are the problematic calls.

Melanie's answer to this problem on previous discussions:

On Wed, 9 Aug 2023 at 21:52, Melanie Plageman  wrote:
>
> If there is any combination of BackendType and IOContext which will
> always read XLOG_BLCKSZ bytes, we could use XLOG_BLCKSZ for that row's
> op_bytes. For other cases, we may have to consider using op_bytes 1 and
> tracking reads and write IOOps in number of bytes (instead of number of
> pages). I don't actually know if there is a clear separation by
> BackendType for these different cases.

Using op_bytes as 1 solves this problem but since it will be different
from the rest of the pg_stat_io view it could be hard to understand.
There is no clear separation by backends as it can be seen from the walsender.

>
> The other alternative I see is to use XLOG_BLCKSZ as the op_bytes and
> treat op_bytes * number of reads as an approximation of the number of
> bytes read. I don't actually know what makes more sense. I don't think I
> would like having a number for bytes that is not accurate.

Also, we have a similar problem in XLogPageRead() in xlogrecovery.c.
pg_pread() call tries to read XLOG_BLCKSZ but it is not certain and we
don't count IO if it couldn't read XLOG_BLCKSZ. IMO, this is not as
important as the previous problem but it still is a problem.

I would be glad to hear opinions on these problems.

--
Regards,
Nazir Bilal Yavuz
Microsoft




Re: Show WAL write and fsync stats in pg_stat_io

2023-12-24 Thread Michael Paquier
On Mon, Dec 25, 2023 at 03:20:58PM +0900, Michael Paquier wrote:
> pgstat_tracks_io_bktype() does not look correct to me.  Why is the WAL
> receiver considered as something correct in the list of backend types,
> while the intention is to *not* add it to pg_stat_io?  I have tried to
> switche to the correct behavior of returning false for a
> B_WAL_RECEIVER, to notice that pg_rewind's test 002_databases.pl
> freezes on its shutdown sequence.  Something weird is going on here.
> Could you look at it?  See the XXX comment in the attached, which is
> the same behavior as v6-0002.  It looks to me that the patch has
> introduced an infinite loop tweaking pgstat_tracks_io_bktype() in an
> incorrect way to avoid the root issue.

Ah, that's because it would trigger an assertion failure:
TRAP: failed Assert("pgstat_tracks_io_op(MyBackendType, io_object,
 io_context, io_op)"), File: "pgstat_io.c", Line: 89, PID: 6824
postgres: standby_local: walreceiver
(ExceptionalCondition+0xa8)[0x560d1b4dd38a]

And the backtrace just tells that this is the WAL receiver
initializing a WAL segment:
#5  0x560d1b3322c8 in pgstat_count_io_op_n
(io_object=IOOBJECT_WAL, io_context=IOCONTEXT_INIT, io_op=IOOP_WRITE,
cnt=1) at pgstat_io.c:89
#6  0x560d1b33254a in pgstat_count_io_op_time
(io_object=IOOBJECT_WAL, io_context=IOCONTEXT_INIT, io_op=IOOP_WRITE,
start_time=..., cnt=1) at pgstat_io.c:181
#7  0x560d1ae7f932 in XLogFileInitInternal (logsegno=3, logtli=1,
added=0x7ffd2733c6eb, path=0x7ffd2733c2e0 "pg_wal/0001", '0'
, "3") at xlog.c:3115
#8  0x560d1ae7fc4e in XLogFileInit (logsegno=3, logtli=1) at
xlog.c:3215

Wouldn't it be simpler to just bite the bullet in this case and handle
WAL receivers in the IO tracking?
--
Michael


signature.asc
Description: PGP signature


Re: Show WAL write and fsync stats in pg_stat_io

2023-12-24 Thread Michael Paquier
On Sat, Dec 16, 2023 at 08:20:57PM +0100, Michael Paquier wrote:
> One thing that 0001 missed is an update of the header where the
> function is declared.  I've edited a few things, and applied it to
> start on this stuff.  The rest will have to wait a bit more..

I have been reviewing the whole, and spotted a couple of issues.

+* At the end of the if case, accumulate time for the pg_stat_io.
+*/
+   if (pgstat_should_track_io_time(io_object, io_context))

There was a bug here.  WAL operations can do IOOP_WRITE or IOOP_READ,
and this would cause pgstat_count_buffer_read_time() and
pgstat_count_buffer_write_time() to be called, incrementing
pgStatBlock{Read,Write}Time, which would be incorrect when it comes to
a WAL page or a WAL segment.  I was wondering what to do here first,
but we could just avoid calling these routines when working on an
IOOBJECT_WAL as that's the only object not doing a buffer operation.

A comment at the top of pgstat_tracks_io_bktype() is incorrect,
because this patch adds the WAL writer sender in the I/O tracking.

+   case B_WAL_RECEIVER:
+   case B_WAL_SENDER:
+   case B_WAL_WRITER:
+   return false;

pgstat_tracks_io_op() now needs B_WAL_SUMMARIZER.

pgstat_should_track_io_time() is used only in pgstat_io.c, so it can
be static rather than published in pgstat.h.

pgstat_tracks_io_bktype() does not look correct to me.  Why is the WAL
receiver considered as something correct in the list of backend types,
while the intention is to *not* add it to pg_stat_io?  I have tried to
switche to the correct behavior of returning false for a
B_WAL_RECEIVER, to notice that pg_rewind's test 002_databases.pl
freezes on its shutdown sequence.  Something weird is going on here.
Could you look at it?  See the XXX comment in the attached, which is
the same behavior as v6-0002.  It looks to me that the patch has
introduced an infinite loop tweaking pgstat_tracks_io_bktype() in an
incorrect way to avoid the root issue.

I have also spent more time polishing the rest, touching a few things
while reviewing.  Not sure that I see a point in splitting the tests
from the main patch.
--
Michael
From 2ca5a656fbd2a359cc585cf7f59331daf77d760b Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Mon, 25 Dec 2023 15:16:34 +0900
Subject: [PATCH v7] Show WAL stats on pg_stat_io (except streaming 
 replication)

This patch aims to showing WAL stats per backend on pg_stat_io view.

With this patch, it can be seen how many WAL operations it makes, their
context, types and total timings per backend in pg_stat_io view.

In this path new IOContext IOCONTEXT_INIT is introduced, it is for IO
operations done while creating the things. Currently, it is used only
together with IOObject IOOBJECT_WAL.

IOOBJECT_WAL means IO operations related to WAL.
IOOBJECT_WAL / IOCONTEXT_NORMAL means IO operations done on already
created WAL segments.
IOOBJECT_WAL / IOCONTEXT_INIT means IO operations done while creating
the WAL segments.

This patch currently covers:
- Documentation
- IOOBJECT_WAL / IOCONTEXT_NORMAL / read, write and fsync stats on
  pg_stat_io.
- IOOBJECT_WAL / IOCONTEXT_INIT / write and fsync stats on pg_stat_io.

doesn't cover:
- Streaming replication WAL IO.
---
 src/include/catalog/pg_proc.dat   |   6 +-
 src/include/pgstat.h  |   6 +-
 src/backend/access/transam/xlog.c |  58 +---
 src/backend/access/transam/xlogrecovery.c |  10 +++
 src/backend/catalog/system_views.sql  |  15 +++-
 src/backend/utils/activity/pgstat_io.c| 102 --
 src/backend/utils/adt/pgstatfuncs.c   |  24 ++---
 src/test/regress/expected/rules.out   |  27 +++---
 src/test/regress/expected/stats.out   |  53 +++
 src/test/regress/sql/stats.sql|  25 ++
 doc/src/sgml/monitoring.sgml  |  29 --
 11 files changed, 274 insertions(+), 81 deletions(-)

diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 9052f5262a..ddccddd6bd 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5785,9 +5785,9 @@
 { oid => '1136', descr => 'statistics: information about WAL activity',
   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
-  proallargtypes => '{int8,int8,numeric,int8,int8,int8,float8,float8,timestamptz}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o}',
-  proargnames => '{wal_records,wal_fpi,wal_bytes,wal_buffers_full,wal_write,wal_sync,wal_write_time,wal_sync_time,stats_reset}',
+  proallargtypes => '{int8,int8,numeric,int8,int8,int8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o}',
+  proargnames => '{wal_records,wal_fpi,wal_bytes,wal_buffers_full,wal_write,wal_sync,stats_reset}',
   prosrc => 'pg_stat_get_wal' },
 { oid => '6248', descr => 'statistics: information about WAL prefetching',
   proname => 'pg_stat_get_recovery_prefetch', 

Re: Show WAL write and fsync stats in pg_stat_io

2023-12-16 Thread Michael Paquier
On Tue, Dec 12, 2023 at 02:29:03PM +0300, Nazir Bilal Yavuz wrote:
> On Tue, 5 Dec 2023 at 09:16, Michael Paquier  wrote:
>> This interface from pgstat_prepare_io_time() is not really good,
>> because we could finish by setting io_start in the existing code paths
>> calling this routine even if track_io_timing is false when
>> track_wal_io_timing is true.  Why not changing this interface a bit
>> and pass down a GUC (track_io_timing or track_wal_io_timing) as an
>> argument of the function depending on what we expect to trigger the
>> timings?
> 
> Done in 0001.

One thing that 0001 missed is an update of the header where the
function is declared.  I've edited a few things, and applied it to
start on this stuff.  The rest will have to wait a bit more..
--
Michael


signature.asc
Description: PGP signature


Re: Show WAL write and fsync stats in pg_stat_io

2023-12-12 Thread Nazir Bilal Yavuz
Hi,

Thanks for the feedback! The new version of the patch is attached.

On Tue, 5 Dec 2023 at 09:16, Michael Paquier  wrote:
>
> -   if (io_op == IOOP_WRITE || io_op == IOOP_EXTEND)
> +   if (io_op == IOOP_EXTEND || io_op == IOOP_WRITE)
>
> Unrelated diff.

Done.

>
> +   if (io_object == IOOBJECT_WAL && io_context == IOCONTEXT_NORMAL &&
> +   io_op == IOOP_FSYNC)
> +   PendingWalStats.wal_sync += cnt;
>
> Nah, I really don't think that adding this dependency within
> pg_stat_io is a good idea.
>
> -   PendingWalStats.wal_sync++;
> +   pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL, IOOP_FSYNC,
> +   io_start, 1);
>
> This is the only caller where this matters, and the count is always 1.

I reverted that, pgstat_count_io_op_n doesn't count
PendingWalStats.wal_sync now.

>
> +   no_wal_normal_read = bktype == B_AUTOVAC_LAUNCHER ||
> +   bktype == B_AUTOVAC_WORKER || bktype == B_BACKEND ||
> +   bktype == B_BG_WORKER || bktype == B_BG_WRITER ||
> +   bktype == B_CHECKPOINTER || bktype == B_WAL_RECEIVER ||
> +   bktype == B_WAL_SENDER || bktype == B_WAL_WRITER;
> +
> +   if (no_wal_normal_read &&
> +   (io_object == IOOBJECT_WAL &&
> +io_op == IOOP_READ))
> +   return false;
>
> This may be more readable if an enum is applied, without a default
> clause so as it would not be forgotten if a new type is added, perhaps
> in its own little routine.

Done.

>
> -   if (track_io_timing)
> +   if (track_io_timing || track_wal_io_timing)
> INSTR_TIME_SET_CURRENT(io_start);
> else
>
> This interface from pgstat_prepare_io_time() is not really good,
> because we could finish by setting io_start in the existing code paths
> calling this routine even if track_io_timing is false when
> track_wal_io_timing is true.  Why not changing this interface a bit
> and pass down a GUC (track_io_timing or track_wal_io_timing) as an
> argument of the function depending on what we expect to trigger the
> timings?

Done in 0001.

>
> -   /* Convert counters from microsec to millisec for display */
> -   values[6] = Float8GetDatum(((double) wal_stats->wal_write_time) / 
> 1000.0);
> -   values[7] = Float8GetDatum(((double) wal_stats->wal_sync_time) / 
> 1000.0);
> +   /*
> +* There is no need to calculate timings for both pg_stat_wal and
> +* pg_stat_io. So, fetch timings from pg_stat_io to make stats 
> gathering
> +* cheaper. Note that, since timings are fetched from pg_stat_io;
> +* pg_stat_reset_shared('io') will reset pg_stat_wal's timings too.
> +*
> +* Convert counters from microsec to millisec for display
> +*/
> +   values[6] = Float8GetDatum(pg_stat_get_io_time(IOOBJECT_WAL,
> + 
>  IOCONTEXT_NORMAL,
> + 
>  IOOP_WRITE));
> +   values[7] = Float8GetDatum(pg_stat_get_io_time(IOOBJECT_WAL,
> + 
>  IOCONTEXT_NORMAL,
> + 
>  IOOP_FSYNC));
>
> Perhaps it is simpler to remove these columns from pg_stat_get_wal()
> and plug an SQL upgrade to the view definition of pg_stat_wal?

Done in 0003 but I am not sure if that is what you expected.

> Finding a good balance between the subroutines, the two GUCs, the
> contexts, the I/O operation type and the objects is the tricky part of
> this patch.  If the dependency to PendingWalStats is removed and if
> the interface of pgstat_prepare_io_time is improved, things are a bit
> cleaner, but it feels like we could do more..  Nya.

I agree. The patch is not logically complicated but it is hard to
select the best way.

Any kind of feedback would be appreciated.

--
Regards,
Nazir Bilal Yavuz
Microsoft
From b7bf7b92fa274775136314ecfde90fa32ed435cb Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Wed, 29 Nov 2023 15:30:03 +0300
Subject: [PATCH v6 6/6] Add IOOBJECT_WAL / IOCONTEXT_NORMAL / read tests

---
 src/test/regress/expected/stats.out | 12 
 src/test/regress/sql/stats.sql  |  8 
 2 files changed, 20 insertions(+)

diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index 4adda9e479..7f5340cd7e 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -881,6 +881,18 @@ SELECT current_setting('fsync') = 'off'
  t
 (1 row)
 
+-- Test pg_stat_io IOOBJECT_WAL / IOCONTEXT_NORMAL / read.
+-- When the servers starts, StartupXLOG function must be called by postmaster
+-- or standalone-backend startup and WAL read must be done.
+-- So, check these before stats get resetted.
+SELECT SUM(reads) 

Re: Show WAL write and fsync stats in pg_stat_io

2023-12-04 Thread Michael Paquier
On Fri, Dec 01, 2023 at 12:02:05PM +0300, Nazir Bilal Yavuz wrote:
> Thanks for all the feedback. I am sharing the new version of the patchset.
> 
> - I didn't move 'PendingWalStats.wal_sync' out from the
> 'pgstat_count_io_op_n' function because they count the same thing
> (block vs system calls) but I agree that this doesn't look good.

-   if (io_op == IOOP_WRITE || io_op == IOOP_EXTEND)
+   if (io_op == IOOP_EXTEND || io_op == IOOP_WRITE)

Unrelated diff.

+   if (io_object == IOOBJECT_WAL && io_context == IOCONTEXT_NORMAL &&
+   io_op == IOOP_FSYNC)
+   PendingWalStats.wal_sync += cnt;

Nah, I really don't think that adding this dependency within
pg_stat_io is a good idea.

-   PendingWalStats.wal_sync++;
+   pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL, IOOP_FSYNC,
+   io_start, 1);

This is the only caller where this matters, and the count is always 1.

+   no_wal_normal_read = bktype == B_AUTOVAC_LAUNCHER ||
+   bktype == B_AUTOVAC_WORKER || bktype == B_BACKEND ||
+   bktype == B_BG_WORKER || bktype == B_BG_WRITER ||
+   bktype == B_CHECKPOINTER || bktype == B_WAL_RECEIVER ||
+   bktype == B_WAL_SENDER || bktype == B_WAL_WRITER;
+
+   if (no_wal_normal_read &&
+   (io_object == IOOBJECT_WAL &&
+io_op == IOOP_READ))
+   return false;

This may be more readable if an enum is applied, without a default
clause so as it would not be forgotten if a new type is added, perhaps
in its own little routine.

-   if (track_io_timing)
+   if (track_io_timing || track_wal_io_timing)
INSTR_TIME_SET_CURRENT(io_start);
else

This interface from pgstat_prepare_io_time() is not really good,
because we could finish by setting io_start in the existing code paths
calling this routine even if track_io_timing is false when
track_wal_io_timing is true.  Why not changing this interface a bit
and pass down a GUC (track_io_timing or track_wal_io_timing) as an
argument of the function depending on what we expect to trigger the
timings?

-   /* Convert counters from microsec to millisec for display */
-   values[6] = Float8GetDatum(((double) wal_stats->wal_write_time) / 
1000.0);
-   values[7] = Float8GetDatum(((double) wal_stats->wal_sync_time) / 
1000.0);
+   /*
+* There is no need to calculate timings for both pg_stat_wal and
+* pg_stat_io. So, fetch timings from pg_stat_io to make stats gathering
+* cheaper. Note that, since timings are fetched from pg_stat_io;
+* pg_stat_reset_shared('io') will reset pg_stat_wal's timings too.
+*
+* Convert counters from microsec to millisec for display
+*/
+   values[6] = Float8GetDatum(pg_stat_get_io_time(IOOBJECT_WAL,
+   
   IOCONTEXT_NORMAL,
+   
   IOOP_WRITE));
+   values[7] = Float8GetDatum(pg_stat_get_io_time(IOOBJECT_WAL,
+   
   IOCONTEXT_NORMAL,
+   
   IOOP_FSYNC));

Perhaps it is simpler to remove these columns from pg_stat_get_wal()
and plug an SQL upgrade to the view definition of pg_stat_wal?

+int
+pgstat_get_io_op_bytes(IOObject io_object, IOContext io_context) 

This interface looks like a good idea even if there is only one
caller.

Finding a good balance between the subroutines, the two GUCs, the
contexts, the I/O operation type and the objects is the tricky part of
this patch.  If the dependency to PendingWalStats is removed and if
the interface of pgstat_prepare_io_time is improved, things are a bit
cleaner, but it feels like we could do more..  Nya.
--
Michael


signature.asc
Description: PGP signature


Re: Show WAL write and fsync stats in pg_stat_io

2023-12-01 Thread Nazir Bilal Yavuz
Hi,

Thanks for all the feedback. I am sharing the new version of the patchset.

Current status of the patchset is:
- IOOBJECT_WAL / IOCONTEXT_NORMAL / read, write, fsync stats and their
tests are added.
- IOOBJECT_WAL / IOCONTEXT_INIT stats and their tests are added.
- Documentation is updated.
- pg_stat_io shows different op_bytes for the IOOBJECT_WAL operations.
- PendingWalStats.wal_sync and PendingWalStats.wal_write_time /
PendingWalStats.wal_sync_time are moved to pgstat_count_io_op_n() /
pgstat_count_io_op_time() respectively.

Updates & Discussion items:
- Try to set op_bytes for BackendType / IOContext: I think we don't
need this now, we will need this when we add streaming replication WAL
IOs.

- Decide which 'BackendType / IOContext / IOOp' should not be tracked:
-- IOOBJECT_WAL / IOCONTEXT_INIT + IOCONTEXT_NORMAL / write and fsync
IOs can be done on every backend that tracks IO statistics. Because of
that and since we have a pgstat_tracks_io_bktype(bktype) check, I
didn't add another check for this.
-- I found that only the standalone backend and startup backend do
IOOBJECT_WAL / IOCONTEXT_NORMAL / read IOs. So, I added a check for
that but I am not sure if there are more backends that do WAL reads on
WAL recovery.

- For the IOOBJECT_WAL / IOCONTEXT_INIT and IOOBJECT_WAL /
IOCONTEXT_NORMAL / read tests, I used initial WAL IOs to check these
stats. I am not sure if that is the correct way or enough to test
these stats.

- To not calculate WAL timings on pg_stat_wal and pg_stat_io view,
pg_stat_wal view's WAL timings are fetched from pg_stat_io. Since
these timings are fetched from pg_stat_io, pg_stat_reset_shared('io')
will reset pg_stat_wal's timings too.

- I didn't move 'PendingWalStats.wal_sync' out from the
'pgstat_count_io_op_n' function because they count the same thing
(block vs system calls) but I agree that this doesn't look good.

Any kind of feedback would be appreciated.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft
From 4ad85b11d418ae78237ed70eced6e3b46d086ef5 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Fri, 1 Dec 2023 10:03:21 +0300
Subject: [PATCH v5 3/4] Add IOOBJECT_WAL / IOCONTEXT_INIT / write and fsync
 tests

---
 src/test/regress/expected/stats.out | 19 +++
 src/test/regress/sql/stats.sql  | 10 ++
 2 files changed, 29 insertions(+)

diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index 4d3a515bdd..4adda9e479 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -862,6 +862,25 @@ WHERE pg_stat_get_backend_pid(beid) = pg_backend_pid();
  t
 (1 row)
 
+-- Test pg_stat_io IOOBJECT_WAL / IOCONTEXT_INIT / write and fsync.
+-- When the servers starts, the initial WAL file must be created,
+-- so check these stats before stats get resetted.
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'init' and object = 'wal' \gset io_sum_wal_init_
+SELECT :io_sum_wal_init_writes > 0;
+ ?column? 
+--
+ t
+(1 row)
+
+SELECT current_setting('fsync') = 'off'
+  OR :io_sum_wal_init_fsyncs > 0;
+ ?column? 
+--
+ t
+(1 row)
+
 -
 -- Test that resetting stats works for reset timestamp
 -
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index aa48e65dc8..72e864a0d2 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -442,6 +442,16 @@ SELECT (current_schemas(true))[1] = ('pg_temp_' || beid::text) AS match
 FROM pg_stat_get_backend_idset() beid
 WHERE pg_stat_get_backend_pid(beid) = pg_backend_pid();
 
+-- Test pg_stat_io IOOBJECT_WAL / IOCONTEXT_INIT / write and fsync.
+-- When the servers starts, the initial WAL file must be created,
+-- so check these stats before stats get resetted.
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'init' and object = 'wal' \gset io_sum_wal_init_
+SELECT :io_sum_wal_init_writes > 0;
+SELECT current_setting('fsync') = 'off'
+  OR :io_sum_wal_init_fsyncs > 0;
+
 -
 -- Test that resetting stats works for reset timestamp
 -
-- 
2.43.0

From c7ae6c12cd02806d9d8201d738920179985cee7a Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Fri, 10 Nov 2023 14:52:22 +0300
Subject: [PATCH v5 2/4] Add IOOBJECT_WAL / IOCONTEXT_NORMAL / write and fsync
 tests

---
 src/test/regress/expected/stats.out | 26 ++
 src/test/regress/sql/stats.sql  | 11 +++
 2 files changed, 37 insertions(+)

diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index 346e10a3d2..4d3a515bdd 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -1255,6 +1255,7 @@ SELECT pg_stat_get_subscription_stats(NULL);
 -- - extends of relations using shared buffers
 -- - fsyncs done to ensure the durability of data dirtying shared buffers
 -- - shared buffer hits
+-- - WAL writes and fsyncs in IOContext IOCONTEXT_NORMAL
 

Re: Show WAL write and fsync stats in pg_stat_io

2023-12-01 Thread Nazir Bilal Yavuz
Hi,

On Wed, 8 Nov 2023 at 10:34, Bharath Rupireddy
 wrote:
>
> Is there any plan to account for WAL read stats in the WALRead()
> function which will cover walsenders i.e. WAL read by logical and
> streaming replication, WAL read by pg_walinspect and so on? I see the
> patch already covers WAL read stats by recovery in XLogPageRead(), but
> not other page_read callbacks which will end up in WALRead()
> eventually. If added, the feature at
> https://www.postgresql.org/message-id/CALj2ACXKKK%3DwbiG5_t6dGao5GoecMwRkhr7GjVBM_jg54%2BNa%3DQ%40mail.gmail.com
> can then extend it to cover WAL read from WAL buffer stats.

Yes, I am planning to create a patch for that after this patch is
done. Thanks for informing!

-- 
Regards,
Nazir Bilal Yavuz
Microsoft




Re: Show WAL write and fsync stats in pg_stat_io

2023-11-20 Thread Michael Paquier
On Mon, Nov 20, 2023 at 05:43:17PM +0300, Nazir Bilal Yavuz wrote:
> Yes, the timings for the writes and the syncs should work. Another
> question I have in mind is the pg_stat_reset_shared() function. When
> we call it with 'io' it will reset pg_stat_wal's timings and when we
> call it with 'wal' it won't reset them, right?

pg_stat_reset_shared() with a target is IMO a very edge case, so I'm
OK with the approach of resetting timings in pg_stat_wal even if 'io'
was implied because pg_stat_wal would feed partially from pg_stat_io.
I'd take that as a side-cost in favor of compatibility while making
the stats gathering cheaper overall.  I'm OK as well if people
counter-argue on this point, though that would mean to keep entirely
separate views with duplicated fields that serve the same purpose,
impacting all deployments because it would make the stats gathering
heavier for all.
--
Michael


signature.asc
Description: PGP signature


Re: Show WAL write and fsync stats in pg_stat_io

2023-11-20 Thread Nazir Bilal Yavuz
Hi,

Thanks for the feedback.

On Mon, 20 Nov 2023 at 10:47, Michael Paquier  wrote:
>
> On Thu, Nov 09, 2023 at 02:39:26PM +0300, Nazir Bilal Yavuz wrote:
> > There are some differences between pg_stat_wal and pg_stat_io while
> > collecting WAL stats. For example in the XLogWrite() function in the
> > xlog.c file, pg_stat_wal counts wal_writes as write system calls. This
> > is not something we want for pg_stat_io since pg_stat_io counts the
> > number of blocks rather than the system calls, so instead incremented
> > pg_stat_io by npages.
> >
> > Could that cause a problem since pg_stat_wal's behaviour will be
> > changed? Of course, as an alternative we could change pg_stat_io's
> > behaviour but in the end either pg_stat_wal's or pg_stat_io's
> > behaviour will be changed.
>
> Yep, that could be confusing for existing applications that track the
> information of pg_stat_wal.  The number of writes is not something
> that can be correctly shared between both.  The timings for the writes
> and the syncs could be shared at least, right?

Yes, the timings for the writes and the syncs should work. Another
question I have in mind is the pg_stat_reset_shared() function. When
we call it with 'io' it will reset pg_stat_wal's timings and when we
call it with 'wal' it won't reset them, right?

>
> This slightly relates to pgstat_count_io_op_n() in your latest patch,
> where it feels a bit weird to see an update of
> PendingWalStats.wal_sync sit in the middle of a routine dedicated to
> pg_stat_io..  I am not completely sure what's the right balance here,
> but I would try to implement things so as pg_stat_io paths does not
> need to know about PendingWalStats.

Write has block vs system calls differentiation but it is the same for
sync. Because of that I put PendingWalStats.wal_sync to pg_stat_io but
I agree that it looks a bit weird.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft




Re: Show WAL write and fsync stats in pg_stat_io

2023-11-19 Thread Michael Paquier
On Thu, Nov 09, 2023 at 02:39:26PM +0300, Nazir Bilal Yavuz wrote:
> There are some differences between pg_stat_wal and pg_stat_io while
> collecting WAL stats. For example in the XLogWrite() function in the
> xlog.c file, pg_stat_wal counts wal_writes as write system calls. This
> is not something we want for pg_stat_io since pg_stat_io counts the
> number of blocks rather than the system calls, so instead incremented
> pg_stat_io by npages.
> 
> Could that cause a problem since pg_stat_wal's behaviour will be
> changed? Of course, as an alternative we could change pg_stat_io's
> behaviour but in the end either pg_stat_wal's or pg_stat_io's
> behaviour will be changed.

Yep, that could be confusing for existing applications that track the
information of pg_stat_wal.  The number of writes is not something
that can be correctly shared between both.  The timings for the writes
and the syncs could be shared at least, right?

This slightly relates to pgstat_count_io_op_n() in your latest patch,
where it feels a bit weird to see an update of
PendingWalStats.wal_sync sit in the middle of a routine dedicated to
pg_stat_io..  I am not completely sure what's the right balance here,
but I would try to implement things so as pg_stat_io paths does not
need to know about PendingWalStats.
--
Michael


signature.asc
Description: PGP signature


Re: Show WAL write and fsync stats in pg_stat_io

2023-11-09 Thread Nazir Bilal Yavuz
Hi,

On Wed, 8 Nov 2023 at 04:19, Andres Freund  wrote:
>
> Hi,
>
> On 2023-11-08 09:52:16 +0900, Michael Paquier wrote:
> > By the way, if the write/sync quantities and times begin to be tracked
> > by pg_stat_io, I'd see a pretty good argument in removing the
> > equivalent columns in pg_stat_wal.  It looks like this would reduce
> > the confusion related to the handling of PendingWalStats added in
> > pgstat_io.c, for one.
>
> Another approach would be to fetch the relevant columns from pg_stat_io in the
> pg_stat_wal view. That'd avoid double accounting and breaking existing
> monitoring.

There are some differences between pg_stat_wal and pg_stat_io while
collecting WAL stats. For example in the XLogWrite() function in the
xlog.c file, pg_stat_wal counts wal_writes as write system calls. This
is not something we want for pg_stat_io since pg_stat_io counts the
number of blocks rather than the system calls, so instead incremented
pg_stat_io by npages.

Could that cause a problem since pg_stat_wal's behaviour will be
changed? Of course, as an alternative we could change pg_stat_io's
behaviour but in the end either pg_stat_wal's or pg_stat_io's
behaviour will be changed.

Regards,
Nazir Bilal Yavuz
Microsoft




Re: Show WAL write and fsync stats in pg_stat_io

2023-11-09 Thread Nazir Bilal Yavuz
Hi,

Thanks for all the feedback!

On Wed, 8 Nov 2023 at 08:59, Michael Paquier  wrote:
>
> By the way, note that the patch is failing to apply, and that I've
> switched it as waiting on author on 10/26.

Here is an updated patchset in attachment. Rebased on the latest HEAD
and changed 'pgstat_report_wal(false)' to 'pgstat_flush_io(false)' in
xlogrecovery.c. I will share the new version of the patchset once I
address the feedback.

Regards,
Nazir Bilal Yavuz
Microsoft
From e5db5cd6d8c47cadde0539f06bbee22368d17a41 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Thu, 26 Oct 2023 12:12:32 +0300
Subject: [PATCH v4 1/2] Show WAL stats (except streaming replication WAL) on
 pg_stat_io

This patch aims to showing WAL stats per backend on pg_stat_io view.

With this patch, it can be seen how many WAL operations it makes, their
context, types and total timings per backend in pg_stat_io view.

This patchset currently covers:
- IOOBJECT_WAL / IOCONTEXT_NORMAL read, write and fsync.
- IOOBJECT_WAL / IOCONTEXT_INIT write and fsync.

doesn't cover:
- Streaming replication WAL IO.
---
 src/backend/access/transam/xlog.c |  60 -
 src/backend/access/transam/xlogrecovery.c |  10 ++
 src/backend/utils/activity/pgstat_io.c| 144 --
 src/backend/utils/adt/pgstatfuncs.c   |  10 +-
 src/include/pgstat.h  |   8 +-
 5 files changed, 178 insertions(+), 54 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index b541be8eec..d265b8c032 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2256,38 +2256,22 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
 			Size		nbytes;
 			Size		nleft;
 			int			written;
-			instr_time	start;
+			instr_time	io_start;
 
 			/* OK to write the page(s) */
 			from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ;
 			nbytes = npages * (Size) XLOG_BLCKSZ;
 			nleft = nbytes;
+
+			io_start = pgstat_prepare_io_time();
 			do
 			{
 errno = 0;
 
-/* Measure I/O timing to write WAL data */
-if (track_wal_io_timing)
-	INSTR_TIME_SET_CURRENT(start);
-else
-	INSTR_TIME_SET_ZERO(start);
-
 pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
 written = pg_pwrite(openLogFile, from, nleft, startoffset);
 pgstat_report_wait_end();
 
-/*
- * Increment the I/O timing and the number of times WAL data
- * were written out to disk.
- */
-if (track_wal_io_timing)
-{
-	instr_time	end;
-
-	INSTR_TIME_SET_CURRENT(end);
-	INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_write_time, end, start);
-}
-
 PendingWalStats.wal_write++;
 
 if (written <= 0)
@@ -2312,6 +2296,9 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
 startoffset += written;
 			} while (nleft > 0);
 
+			pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL,
+	IOOP_WRITE, io_start, npages);
+
 			npages = 0;
 
 			/*
@@ -3005,6 +2992,7 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 	int			fd;
 	int			save_errno;
 	int			open_flags = O_RDWR | O_CREAT | O_EXCL | PG_BINARY;
+	instr_time	io_start;
 
 	Assert(logtli != 0);
 
@@ -3048,6 +3036,9 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 (errcode_for_file_access(),
  errmsg("could not create file \"%s\": %m", tmppath)));
 
+	/* start timing writes for stats */
+	io_start = pgstat_prepare_io_time();
+
 	pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_WRITE);
 	save_errno = 0;
 	if (wal_init_zero)
@@ -3083,6 +3074,9 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 	}
 	pgstat_report_wait_end();
 
+	pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_INIT, IOOP_WRITE,
+			io_start, 1);
+
 	if (save_errno)
 	{
 		/*
@@ -3099,6 +3093,9 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
  errmsg("could not write to file \"%s\": %m", tmppath)));
 	}
 
+	/* start timing fsyncs for stats */
+	io_start = pgstat_prepare_io_time();
+
 	pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_SYNC);
 	if (pg_fsync(fd) != 0)
 	{
@@ -3111,6 +3108,9 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 	}
 	pgstat_report_wait_end();
 
+	pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_INIT,
+			IOOP_FSYNC, io_start, 1);
+
 	if (close(fd) != 0)
 		ereport(ERROR,
 (errcode_for_file_access(),
@@ -8282,7 +8282,7 @@ void
 issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 {
 	char	   *msg = NULL;
-	instr_time	start;
+	instr_time	io_start;
 
 	Assert(tli != 0);
 
@@ -8295,11 +8295,7 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 		wal_sync_method == WAL_SYNC_METHOD_OPEN_DSYNC)
 		return;
 
-	/* Measure I/O timing to sync the WAL file */
-	if (track_wal_io_timing)
-		INSTR_TIME_SET_CURRENT(start);
-	else
-		INSTR_TIME_SET_ZERO(start);
+	io_start = pgstat_prepare_io_time();
 
 	pgstat_report_wait_start(WAIT_EVENT_WAL_SYNC);
 	switch (wal_sync_method)
@@ -8343,18 

Re: Show WAL write and fsync stats in pg_stat_io

2023-11-07 Thread Bharath Rupireddy
On Wed, Sep 20, 2023 at 1:28 PM Nazir Bilal Yavuz  wrote:
>
> Hi,
>
> Thanks for the review!
>
> Current status of the patch is:
> - IOOBJECT_WAL / IOCONTEXT_NORMAL read, write and fsync stats are added.
> - IOOBJECT_WAL / IOCONTEXT_NORMAL write and fsync tests are added.
> - IOOBJECT_WAL / IOCONTEXT_INIT stats are added.
> - pg_stat_io shows different op_bytes for the IOOBJECT_WAL operations.
> - Working on which 'BackendType / IOContext / IOOp' should be banned in 
> pg_stat_io.
> - PendingWalStats.wal_sync and PendingWalStats.wal_write_time / 
> PendingWalStats.wal_sync_time are moved to pgstat_count_io_op_n() / 
> pgstat_count_io_op_time() respectively.
>
> TODOs:
> - Documentation.
> - Try to set op_bytes for BackendType / IOContext.
> - Decide which 'BackendType / IOContext / IOOp' should not be tracked.
> - Add IOOBJECT_WAL / IOCONTEXT_NORMAL read tests.
> - Add IOOBJECT_WAL / IOCONTEXT_INIT tests.

This patchset currently covers:
- IOOBJECT_WAL / IOCONTEXT_NORMAL read, write and fsync.
- IOOBJECT_WAL / IOCONTEXT_INIT write and fsync.

doesn't cover:
- Streaming replication WAL IO.

Is there any plan to account for WAL read stats in the WALRead()
function which will cover walsenders i.e. WAL read by logical and
streaming replication, WAL read by pg_walinspect and so on? I see the
patch already covers WAL read stats by recovery in XLogPageRead(), but
not other page_read callbacks which will end up in WALRead()
eventually. If added, the feature at
https://www.postgresql.org/message-id/CALj2ACXKKK%3DwbiG5_t6dGao5GoecMwRkhr7GjVBM_jg54%2BNa%3DQ%40mail.gmail.com
can then extend it to cover WAL read from WAL buffer stats.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Show WAL write and fsync stats in pg_stat_io

2023-11-07 Thread Michael Paquier
On Wed, Nov 08, 2023 at 10:27:44AM +0900, Michael Paquier wrote:
> Yep, I'd be OK with that as well to maintain compatibility.

By the way, note that the patch is failing to apply, and that I've
switched it as waiting on author on 10/26.
--
Michael


signature.asc
Description: PGP signature


Re: Show WAL write and fsync stats in pg_stat_io

2023-11-07 Thread Michael Paquier
On Tue, Nov 07, 2023 at 05:19:28PM -0800, Andres Freund wrote:
> Another approach would be to fetch the relevant columns from pg_stat_io in the
> pg_stat_wal view. That'd avoid double accounting and breaking existing
> monitoring.

Yep, I'd be OK with that as well to maintain compatibility.
--
Michael


signature.asc
Description: PGP signature


Re: Show WAL write and fsync stats in pg_stat_io

2023-11-07 Thread Andres Freund
Hi,

On 2023-11-08 09:52:16 +0900, Michael Paquier wrote:
> By the way, if the write/sync quantities and times begin to be tracked
> by pg_stat_io, I'd see a pretty good argument in removing the
> equivalent columns in pg_stat_wal.  It looks like this would reduce
> the confusion related to the handling of PendingWalStats added in
> pgstat_io.c, for one.

Another approach would be to fetch the relevant columns from pg_stat_io in the
pg_stat_wal view. That'd avoid double accounting and breaking existing
monitoring.

Greetings,

Andres Freund




Re: Show WAL write and fsync stats in pg_stat_io

2023-11-07 Thread Michael Paquier
On Tue, Nov 07, 2023 at 03:30:48PM -0800, Andres Freund wrote:
> I strongly disagree. A significant part of the design of pg_stat_io was to
> make it possible to collect multiple sources of IO in a single view, so that
> sysadmins don't have to look in dozens of places to figure out what is causing
> what kind of IO.

Okay.  Point taken.

> We should over time collect all sources of IO in pg_stat_io. For some things
> we might want to also have more detailed information in other views (e.g. it
> doesn't make sense to track FPIs in pg_stat_io, but does make sense in
> pg_stat_wal) - but that should be in addition, not instead of.

Sure.  I understand here that you mean the number of FPIs counted when
a record is inserted, different from the path where we decide to write
and/or flush WAL.  The proposed patch seems to be a bit inconsistent
regarding wal_sync_time, by the way.

By the way, if the write/sync quantities and times begin to be tracked
by pg_stat_io, I'd see a pretty good argument in removing the
equivalent columns in pg_stat_wal.  It looks like this would reduce
the confusion related to the handling of PendingWalStats added in
pgstat_io.c, for one.
--
Michael


signature.asc
Description: PGP signature


Re: Show WAL write and fsync stats in pg_stat_io

2023-11-07 Thread Andres Freund
Hi,

On 2023-10-26 15:28:32 +0900, Michael Paquier wrote:
> On top of that pg_stat_io is now for block-based I/O operations, so
> that does not fit entirely in the picture, though I guess that Melanie
> has thought more on the matter than me.  That may be also a matter of
> taste.

I strongly disagree. A significant part of the design of pg_stat_io was to
make it possible to collect multiple sources of IO in a single view, so that
sysadmins don't have to look in dozens of places to figure out what is causing
what kind of IO.

We should over time collect all sources of IO in pg_stat_io. For some things
we might want to also have more detailed information in other views (e.g. it
doesn't make sense to track FPIs in pg_stat_io, but does make sense in
pg_stat_wal) - but that should be in addition, not instead of.

Greetings,

Andres Freund




Re: Show WAL write and fsync stats in pg_stat_io

2023-11-06 Thread Michael Paquier
On Mon, Nov 06, 2023 at 03:35:01PM +0300, Nazir Bilal Yavuz wrote:
> Results are:
> 
> ╔═╦═══╦╗
> ║ ║  track_wal_io_timing  ║║
> ╠═╬═══╦═══╬╣
> ║  clock  ║   on  ║  off  ║ change ║
> ║ sources ║   ║   ║║
> ╠═╬═══╬═══╬╣
> ║   tsc   ║   ║   ║║
> ║ ║ 514814.459170 ║ 519826.284139 ║   %1   ║
> ╠═╬═══╬═══╬╣
> ║   hpet  ║   ║   ║║
> ║ ║ 132116.272121 ║ 141820.548447 ║   %7   ║
> ╠═╬═══╬═══╬╣
> ║ acpi_pm ║   ║   ║║
> ║ ║ 394793.092255 ║ 403723.874719 ║   %2   ║
> ╚═╩═══╩═══╩╝

Thanks for the tests.  That's indeed noticeable under this load.
Better to keep track_io_timing and track_wal_io_timing as two
separated beasts, at least that's clear.
--
Michael


signature.asc
Description: PGP signature


Re: Show WAL write and fsync stats in pg_stat_io

2023-11-06 Thread Nazir Bilal Yavuz
Hi,

On Tue, 31 Oct 2023 at 16:57, Nazir Bilal Yavuz  wrote:
> On Thu, 26 Oct 2023 at 09:28, Michael Paquier  wrote:
> >
> > And perhaps just putting that everything that calls
> > pgstat_count_io_op_time() under track_io_timing is just natural?
> > What's the performance regression you would expect if both WAL and
> > block I/O are controlled by that, still one would expect only one of
> > them?
>
> I will check these and I hope I will come back with something meaningful.

I applied the patches on upstream postgres and then run pgbench for each
available clock sources couple of times:
# Set fsync = off and track_io_timing = on
# pgbench -i -s 100 test
# pgbench -M prepared -c16 -j8 -f <( echo "SELECT
pg_logical_emit_message(true, \:client_id::text, '1234567890');") -T60 test

Results are:

╔═╦═══╦╗
║ ║  track_wal_io_timing  ║║
╠═╬═══╦═══╬╣
║  clock  ║   on  ║  off  ║ change ║
║ sources ║   ║   ║║
╠═╬═══╬═══╬╣
║   tsc   ║   ║   ║║
║ ║ 514814.459170 ║ 519826.284139 ║   %1   ║
╠═╬═══╬═══╬╣
║   hpet  ║   ║   ║║
║ ║ 132116.272121 ║ 141820.548447 ║   %7   ║
╠═╬═══╬═══╬╣
║ acpi_pm ║   ║   ║║
║ ║ 394793.092255 ║ 403723.874719 ║   %2   ║
╚═╩═══╩═══╩╝

Regards,
Nazir Bilal Yavuz
Microsoft


Re: Show WAL write and fsync stats in pg_stat_io

2023-10-31 Thread Nazir Bilal Yavuz
Hi,

Thank you for the feedback!

On Thu, 26 Oct 2023 at 09:28, Michael Paquier  wrote:
>
> On Wed, Sep 20, 2023 at 10:57:48AM +0300, Nazir Bilal Yavuz wrote:
> > Any kind of feedback would be appreciated.
>
> This was registered in the CF, so I have given it a look.  Note that
> 0001 has a conflict with pgstat_count_io_op_time(), so it cannot be
> applied.
>
> +pgstat_should_track_io_time(IOObject io_object, IOContext io_context)
> +{
> +   /*
> +* io times of IOOBJECT_WAL IOObject needs to be tracked when
> +* 'track_wal_io_timing' is set regardless of 'track_io_timing'.
> +*/
> +   if (io_object == IOOBJECT_WAL)
> +   return track_wal_io_timing;
> +
> +   return track_io_timing;
>
> I can see the temptation to do that, but I have mixed feelings about
> the approach of mixing two GUCs in a code path dedicated to pg_stat_io
> where now we only rely on track_io_timing.  The result brings
> confusion, while making pg_stat_io, which is itself only used for
> block-based operations, harder to read.
>
> The suggestion I am seeing here to have a pg_stat_io_wal (with a SRF)
> is quite tempting, actually, creating a neat separation between the
> existing pg_stat_io and pg_stat_wal (not a SRF), with a third view
> that provides more details about the contexts and backend types for
> the WAL stats with its relevant fields:
> https://www.postgresql.org/message-id/caakru_bm55pj3pprw0nd_-pawhlrkou69r816aeztbba-n1...@mail.gmail.com
>
> And perhaps just putting that everything that calls
> pgstat_count_io_op_time() under track_io_timing is just natural?
> What's the performance regression you would expect if both WAL and
> block I/O are controlled by that, still one would expect only one of
> them?

I will check these and I hope I will come back with something meaningful.

>
> +  /*  Report pending statistics to the cumulative stats system */
> +  pgstat_report_wal(false);
>
> This is hidden in 0001, still would be better if handled as a patch on
> its own and optionally backpatch it as we did for the bgwriter with
> e64c733bb1?

I thought about it again and found the use of
'pgstat_report_wal(false);' here wrong. This was mainly for flushing
WAL stats because of the WAL reads but pg_stat_wal doesn't have WAL
read stats, so there is no need to flush WAL stats here. I think this
should be replaced with 'pgstat_flush_io(false);'.

>
> Side note: I think that we should spend more efforts in documenting
> what IOContext and IOOp mean.  Not something directly related to this
> patch, still this patch or things similar make it a bit harder which
> part of it is used for what by reading pgstat.h.

I agree.

Regards,
Nazir Bilal Yavuz
Microsoft




Re: Show WAL write and fsync stats in pg_stat_io

2023-10-26 Thread Michael Paquier
On Wed, Sep 20, 2023 at 10:57:48AM +0300, Nazir Bilal Yavuz wrote:
> Any kind of feedback would be appreciated.

This was registered in the CF, so I have given it a look.  Note that
0001 has a conflict with pgstat_count_io_op_time(), so it cannot be
applied.

+pgstat_should_track_io_time(IOObject io_object, IOContext io_context)
+{
+   /*
+* io times of IOOBJECT_WAL IOObject needs to be tracked when
+* 'track_wal_io_timing' is set regardless of 'track_io_timing'.
+*/
+   if (io_object == IOOBJECT_WAL)
+   return track_wal_io_timing;
+
+   return track_io_timing;

I can see the temptation to do that, but I have mixed feelings about
the approach of mixing two GUCs in a code path dedicated to pg_stat_io
where now we only rely on track_io_timing.  The result brings
confusion, while making pg_stat_io, which is itself only used for
block-based operations, harder to read.

The suggestion I am seeing here to have a pg_stat_io_wal (with a SRF)
is quite tempting, actually, creating a neat separation between the
existing pg_stat_io and pg_stat_wal (not a SRF), with a third view
that provides more details about the contexts and backend types for
the WAL stats with its relevant fields:
https://www.postgresql.org/message-id/caakru_bm55pj3pprw0nd_-pawhlrkou69r816aeztbba-n1...@mail.gmail.com

And perhaps just putting that everything that calls
pgstat_count_io_op_time() under track_io_timing is just natural?
What's the performance regression you would expect if both WAL and
block I/O are controlled by that, still one would expect only one of
them?

On top of that pg_stat_io is now for block-based I/O operations, so
that does not fit entirely in the picture, though I guess that Melanie
has thought more on the matter than me.  That may be also a matter of
taste.

+  /*  Report pending statistics to the cumulative stats system */
+  pgstat_report_wal(false);

This is hidden in 0001, still would be better if handled as a patch on
its own and optionally backpatch it as we did for the bgwriter with
e64c733bb1?

Side note: I think that we should spend more efforts in documenting
what IOContext and IOOp mean.  Not something directly related to this
patch, still this patch or things similar make it a bit harder which
part of it is used for what by reading pgstat.h.
--
Michael


signature.asc
Description: PGP signature


Re: Show WAL write and fsync stats in pg_stat_io

2023-09-20 Thread Nazir Bilal Yavuz
Hi,

Thanks for the review!

Current status of the patch is:
- IOOBJECT_WAL / IOCONTEXT_NORMAL read, write and fsync stats are added.
- IOOBJECT_WAL / IOCONTEXT_NORMAL write and fsync tests are added.
- IOOBJECT_WAL / IOCONTEXT_INIT stats are added.
- pg_stat_io shows different op_bytes for the IOOBJECT_WAL operations.
- Working on which 'BackendType / IOContext / IOOp' should be banned in
pg_stat_io.
- PendingWalStats.wal_sync and PendingWalStats.wal_write_time /
PendingWalStats.wal_sync_time are moved to pgstat_count_io_op_n() /
pgstat_count_io_op_time() respectively.

TODOs:
- Documentation.
- Try to set op_bytes for BackendType / IOContext.
- Decide which 'BackendType / IOContext / IOOp' should not be tracked.
- Add IOOBJECT_WAL / IOCONTEXT_NORMAL read tests.
- Add IOOBJECT_WAL / IOCONTEXT_INIT tests.

I am adding tracking of BackendType / IOContext / IOOp as tables, empty
cell means it is not decided yet:

IOCONTEXT_NORMAL / Backend / IOOp table:

╔═╦═══╦═══╦═══╗
║ IOCONTEXT_NORMAL║ read  ║ write ║ fsync ║
╠═╬═══╬═══╬═══╣
║ autovacuum launcher ║ FALSE ║ FALSE ║ FALSE ║
╠═╬═══╬═══╬═══╣
║ autovacuum worker   ║ FALSE ║  TRUE ║  TRUE ║
╠═╬═══╬═══╬═══╣
║ client backend  ║   ║  TRUE ║  TRUE ║
╠═╬═══╬═══╬═══╣
║ background worker   ║   ║   ║   ║
╠═╬═══╬═══╬═══╣
║ background writer   ║   ║  TRUE ║  TRUE ║
╠═╬═══╬═══╬═══╣
║ checkpointer║   ║  TRUE ║  TRUE ║
╠═╬═══╬═══╬═══╣
║ standalone backend  ║  TRUE ║  TRUE ║  TRUE ║
╠═╬═══╬═══╬═══╣
║ startup ║  TRUE ║   ║   ║
╠═╬═══╬═══╬═══╣
║ walreceiver ║   ║   ║   ║
╠═╬═══╬═══╬═══╣
║ walsender   ║   ║   ║   ║
╠═╬═══╬═══╬═══╣
║ walwriter   ║   ║  TRUE ║  TRUE ║
╚═╩═══╩═══╩═══╝


IOCONTEXT_WAL_INIT / Backend / IOOp table:

╔═╦═══╦═══╗
║ IOCONTEXT_WAL_INIT  ║ write ║ fsync ║
╠═╬═══╬═══╣
║ autovacuum launcher ║   ║   ║
╠═╬═══╬═══╣
║ autovacuum worker   ║   ║   ║
╠═╬═══╬═══╣
║ client backend  ║  TRUE ║  TRUE ║
╠═╬═══╬═══╣
║ background worker   ║   ║   ║
╠═╬═══╬═══╣
║ background writer   ║   ║   ║
╠═╬═══╬═══╣
║ checkpointer║   ║   ║
╠═╬═══╬═══╣
║ standalone backend  ║  TRUE ║  TRUE ║
╠═╬═══╬═══╣
║ startup ║   ║   ║
╠═╬═══╬═══╣
║ walreceiver ║   ║   ║
╠═╬═══╬═══╣
║ walsender   ║   ║   ║
╠═╬═══╬═══╣
║ walwriter   ║   ║   ║
╚═╩═══╩═══╝


On Wed, 9 Aug 2023 at 21:52, Melanie Plageman 
wrote:
>
> > On Sat, 22 Jul 2023 at 01:30, Melanie Plageman
> >  wrote:
> > > I think it would be good to count WAL reads even though they are not
> > > currently represented in pg_stat_wal. Here is a thread discussing this
> > > [1].
> >
> > I used the same implementation in the thread link [1]. I added 'WAL
> > read' to only xlogrecovery.c for now. I didn't add 'WAL read' to
> > xlogreader.c and walsender.c because they cause some failures on:
> > '!pgStatLocal.shmem->is_shutdown' asserts. I will spend more time on
> > these. Also, I added Bharath to CC. I have a question about 'WAL
> > read':
> > 1. There are two places where 'WAL read' happens.
> > a. In WALRead() in xlogreader.c, it reads 'count' bytes, most of the
> > time count is equal to XLOG_BLCKSZ but there are some cases it is not.
> > For example
> > - in XLogSendPhysical() in walsender.c WALRead() is called by nbytes
> > - in WALDumpReadPage() in pg_waldump.c WALRead() is called by count
> > These nbytes and count variables could be different from XLOG_BLCKSZ.
> >
> > b. in XLogPageRead() in xlogreader.c, it reads exactly XLOG_BLCKSZ
bytes:
> > pg_pread(readFile, readBuf, XLOG_BLCKSZ, (off_t) readOff);
> >
> > So, what should op_bytes be set to for 'WAL read' operations?
>
> If there is any combination of BackendType and IOContext which will
> always read XLOG_BLCKSZ bytes, we could use XLOG_BLCKSZ for that row's
> op_bytes. For other cases, we may have to consider using op_bytes 1 and
> tracking reads and write IOOps in number of bytes (instead of number of
> pages). I don't actually know if there is a clear separation by
> BackendType for these different cases.

I agree. I will edit that later, added to TODOs.

>
> The other alternative I 

Re: Show WAL write and fsync stats in pg_stat_io

2023-08-09 Thread Melanie Plageman
On Thu, Aug 03, 2023 at 04:38:41PM +0300, Nazir Bilal Yavuz wrote:
> Current status of the patch is:
> - 'WAL read' stats in xlogrecovery.c are added to pg_stat_io.
> - IOCONTEXT_INIT is added to count 'WAL init'. 'WAL init' stats are
> added to pg_stat_io.
> - pg_stat_io shows different op_bytes for the IOOBJECT_WAL operations.
> - Working on which 'BackendType / IOContext / IOOp' should be banned
> in pg_stat_io.
> - Working on adding 'WAL read' to the xlogreader.c and walsender.c.
> - PendingWalStats.wal_sync and
> PendingWalStats.wal_write_time/PendingWalStats.wal_sync_time are moved
> to pgstat_count_io_op_n()/pgstat_count_io_op_time() respectively.

Cool! Thanks for the summary and for continuing to work on this.

> TODOs:
> - Documentation.
> - Thinking about how to interpret the different IOOps within the two
> IOContexts and discussing what would be useful to count.
> - Decide which 'BackendType / IOContext / IOOp' should not be tracked.
> - Adding 'WAL read' to the xlogreader.c and walsender.c. (This could
> be an another patch)

Yes, I would be explicit that you are not including WAL IO done exclusively in
the context of replication.

> - Adding WAIT_EVENT_WAL_COPY_* operations to pg_stat_io if needed.
> (This could be an another patch)

Yes, I think it makes sense as another patch.

> 
> On Sat, 22 Jul 2023 at 01:30, Melanie Plageman
>  wrote:
> > I think it would be good to count WAL reads even though they are not
> > currently represented in pg_stat_wal. Here is a thread discussing this
> > [1].
> 
> I used the same implementation in the thread link [1]. I added 'WAL
> read' to only xlogrecovery.c for now. I didn't add 'WAL read' to
> xlogreader.c and walsender.c because they cause some failures on:
> '!pgStatLocal.shmem->is_shutdown' asserts. I will spend more time on
> these. Also, I added Bharath to CC. I have a question about 'WAL
> read':
> 1. There are two places where 'WAL read' happens.
> a. In WALRead() in xlogreader.c, it reads 'count' bytes, most of the
> time count is equal to XLOG_BLCKSZ but there are some cases it is not.
> For example
> - in XLogSendPhysical() in walsender.c WALRead() is called by nbytes
> - in WALDumpReadPage() in pg_waldump.c WALRead() is called by count
> These nbytes and count variables could be different from XLOG_BLCKSZ.
> 
> b. in XLogPageRead() in xlogreader.c, it reads exactly XLOG_BLCKSZ bytes:
> pg_pread(readFile, readBuf, XLOG_BLCKSZ, (off_t) readOff);
> 
> So, what should op_bytes be set to for 'WAL read' operations?

If there is any combination of BackendType and IOContext which will
always read XLOG_BLCKSZ bytes, we could use XLOG_BLCKSZ for that row's
op_bytes. For other cases, we may have to consider using op_bytes 1 and
tracking reads and write IOOps in number of bytes (instead of number of
pages). I don't actually know if there is a clear separation by
BackendType for these different cases.

The other alternative I see is to use XLOG_BLCKSZ as the op_bytes and
treat op_bytes * number of reads as an approximation of the number of
bytes read. I don't actually know what makes more sense. I don't think I
would like having a number for bytes that is not accurate.

> > I think we will also want to add an IOContext for WAL initialization.
> > Then we can track how long is spent doing 'WAL init' (including filling
> > the WAL file with zeroes). XLogFileInitInternal() is likely where we
> > would want to add it. And op_bytes for this would likely be
> > wal_segment_size. I thought I heard about someone proposing adding WAL
> > init to pg_stat_wal, but I can't find the thread.
> 
> Done. I created a new IOCONTEXT_INIT IOContext for the 'WAL init'. I
> have a question there:
> 1. Some of the WAL processes happens at initdb (standalone backend
> IOCONTEXT_NORMAL/(IOOP_READ & IOOP_WRITE) and
> IOCONTEXT_INIT/(IOOP_WRITE & IOOP_FSYNC)). Since this happens at the
> initdb, AFAIK there is no way to set 'track_wal_io_timing' and
> 'track_io_timing' variables there. So, their timings appear as 0.
> Should I use IsBootstrapProcessingMode() to enable WAL io timings at
> the initdb or are they not that much important?

I don't have an opinion about this. I can see an argument for doing it
either way. We do track other IO during initdb in pg_stat_io.

> > Any that we decide not to count for now should be "banned" in
> > pgstat_tracks_io_op() for clarity. For example, if we create a separate
> > IOContext for WAL file init, I'm not sure what would count as an
> > IOOP_EXTEND in IOCONTEXT_NORMAL for IOOBJECT_WAL.
> >
> > Also, I think there are some backend types which will not generate WAL
> > and we should determine which those are and skip those rows in
> > pgstat_tracks_io_object().
> 
> I agree, I am working on this. I have a couple of questions:
> 1. Can client backend and background worker do IOCONTEXT_NORMAL/IOOP_READ?

I don't know the answer to this.

> 2. Is there an easy way to check if 'BackendType / IOOBJECT_WAL' does
> specific IOOp operations?

I 

Re: Show WAL write and fsync stats in pg_stat_io

2023-08-03 Thread Nazir Bilal Yavuz
Hi,

Thanks for the review!

Current status of the patch is:
- 'WAL read' stats in xlogrecovery.c are added to pg_stat_io.
- IOCONTEXT_INIT is added to count 'WAL init'. 'WAL init' stats are
added to pg_stat_io.
- pg_stat_io shows different op_bytes for the IOOBJECT_WAL operations.
- Working on which 'BackendType / IOContext / IOOp' should be banned
in pg_stat_io.
- Working on adding 'WAL read' to the xlogreader.c and walsender.c.
- PendingWalStats.wal_sync and
PendingWalStats.wal_write_time/PendingWalStats.wal_sync_time are moved
to pgstat_count_io_op_n()/pgstat_count_io_op_time() respectively.

TODOs:
- Documentation.
- Thinking about how to interpret the different IOOps within the two
IOContexts and discussing what would be useful to count.
- Decide which 'BackendType / IOContext / IOOp' should not be tracked.
- Adding 'WAL read' to the xlogreader.c and walsender.c. (This could
be an another patch)
- Adding WAIT_EVENT_WAL_COPY_* operations to pg_stat_io if needed.
(This could be an another patch)

On Sat, 22 Jul 2023 at 01:30, Melanie Plageman
 wrote:
> I think it would be good to count WAL reads even though they are not
> currently represented in pg_stat_wal. Here is a thread discussing this
> [1].

I used the same implementation in the thread link [1]. I added 'WAL
read' to only xlogrecovery.c for now. I didn't add 'WAL read' to
xlogreader.c and walsender.c because they cause some failures on:
'!pgStatLocal.shmem->is_shutdown' asserts. I will spend more time on
these. Also, I added Bharath to CC. I have a question about 'WAL
read':
1. There are two places where 'WAL read' happens.
a. In WALRead() in xlogreader.c, it reads 'count' bytes, most of the
time count is equal to XLOG_BLCKSZ but there are some cases it is not.
For example
- in XLogSendPhysical() in walsender.c WALRead() is called by nbytes
- in WALDumpReadPage() in pg_waldump.c WALRead() is called by count
These nbytes and count variables could be different from XLOG_BLCKSZ.

b. in XLogPageRead() in xlogreader.c, it reads exactly XLOG_BLCKSZ bytes:
pg_pread(readFile, readBuf, XLOG_BLCKSZ, (off_t) readOff);

So, what should op_bytes be set to for 'WAL read' operations?

> Eventually, the docs will need an update as well. You can wait until a
> later version of the patch to do this, but I would include it in a list
> of the remaining TODOs in your next version.

Done. I shared TODOs at the top.

> I think we will also want to add an IOContext for WAL initialization.
> Then we can track how long is spent doing 'WAL init' (including filling
> the WAL file with zeroes). XLogFileInitInternal() is likely where we
> would want to add it. And op_bytes for this would likely be
> wal_segment_size. I thought I heard about someone proposing adding WAL
> init to pg_stat_wal, but I can't find the thread.

Done. I created a new IOCONTEXT_INIT IOContext for the 'WAL init'. I
have a question there:
1. Some of the WAL processes happens at initdb (standalone backend
IOCONTEXT_NORMAL/(IOOP_READ & IOOP_WRITE) and
IOCONTEXT_INIT/(IOOP_WRITE & IOOP_FSYNC)). Since this happens at the
initdb, AFAIK there is no way to set 'track_wal_io_timing' and
'track_io_timing' variables there. So, their timings appear as 0.
Should I use IsBootstrapProcessingMode() to enable WAL io timings at
the initdb or are they not that much important?

> I think there is also an argument for counting WAL files recycled as
> IOOP_REUSES. We should start thinking about how to interpret the
> different IOOps within the two IOContexts and discussing what would be
> useful to count. For example, should removing a logfile count as an
> IOOP_EVICT? Maybe it is not directly related to "IO" enough or even an
> interesting statistic, but we should think about what kinds of
> IO-related WAL statistics we want to track.

I added that to TODOs.

> Any that we decide not to count for now should be "banned" in
> pgstat_tracks_io_op() for clarity. For example, if we create a separate
> IOContext for WAL file init, I'm not sure what would count as an
> IOOP_EXTEND in IOCONTEXT_NORMAL for IOOBJECT_WAL.
>
> Also, I think there are some backend types which will not generate WAL
> and we should determine which those are and skip those rows in
> pgstat_tracks_io_object().

I agree, I am working on this. I have a couple of questions:
1. Can client backend and background worker do IOCONTEXT_NORMAL/IOOP_READ?
2. Is there an easy way to check if 'BackendType / IOOBJECT_WAL' does
specific IOOp operations?

> diff --git a/src/backend/access/transam/xlog.c
> b/src/backend/access/transam/xlog.c
> index 8b0710abe6..2ee6c21398 100644
> --- a/src/backend/access/transam/xlog.c
> +++ b/src/backend/access/transam/xlog.c
> @@ -2207,6 +2207,10 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID
> tli, bool flexible)
>
> I think we should likely follow the pattern of using
> pgstat_prepare_io_time() and pgstat_count_io_op_time() as it is done
> elsewhere. You could pass the IOObject as a parameter to
> pgstat_prepare_io_time() in order 

Re: Show WAL write and fsync stats in pg_stat_io

2023-07-21 Thread Melanie Plageman
On Wed, Jun 28, 2023 at 6:09 AM Nazir Bilal Yavuz  wrote:
> This is a WIP patch to add WAL write and fsync stats to pg_stat_io
> view.

Thanks for working on this! I have some feedback on the content of the
patch as well as some items that I feel are missing.

I think it would be good to count WAL reads even though they are not
currently represented in pg_stat_wal. Here is a thread discussing this
[1].

Eventually, the docs will need an update as well. You can wait until a
later version of the patch to do this, but I would include it in a list
of the remaining TODOs in your next version.

I think we will also want to add an IOContext for WAL initialization.
Then we can track how long is spent doing WAL init (including filling
the WAL file with zeroes). XLogFileInitInternal() is likely where we
would want to add it. And op_bytes for this would likely be
wal_segment_size. I thought I heard about someone proposing adding WAL
init to pg_stat_wal, but I can't find the thread.

I think there is also an argument for counting WAL files recycled as
IOOP_REUSES. We should start thinking about how to interpret the
different IOOps within the two IOContexts and discussing what would be
useful to count. For example, should removing a logfile count as an
IOOP_EVICT? Maybe it is not directly related to "IO" enough or even an
interesting statistic, but we should think about what kinds of
IO-related WAL statistics we want to track.

Any that we decide not to count for now should be "banned" in
pgstat_tracks_io_op() for clarity. For example, if we create a separate
IOContext for WAL file init, I'm not sure what would count as an
IOOP_EXTEND in IOCONTEXT_NORMAL for IOOBJECT_WAL.

Also, I think there are some backend types which will not generate WAL
and we should determine which those are and skip those rows in
pgstat_tracks_io_object().

diff --git a/src/backend/access/transam/xlog.c
b/src/backend/access/transam/xlog.c
index 8b0710abe6..2ee6c21398 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2207,6 +2207,10 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID
tli, bool flexible)

I think we should likely follow the pattern of using
pgstat_prepare_io_time() and pgstat_count_io_op_time() as it is done
elsewhere. You could pass the IOObject as a parameter to
pgstat_prepare_io_time() in order to determine if we should check
track_io_timing or track_wal_io_timing. And we'll want to check
track_wal_io_timing if IOObject is IOOBJECT_WAL in
pgstat_count_io_op_time().

 INSTR_TIME_SET_CURRENT(duration);

INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_write_time, duration,
start);

+pgstat_count_io_op_time(IOOBJECT_WAL,
IOCONTEXT_NORMAL, IOOP_WRITE, start, 1);
+} else
+{

Other users of pgstat_count_io_op_time()/io_op_n() which write multiple
pages at a time pass the number of pages in as the cnt parameter. (see
ExtendBufferedRelLocal() as an example). I think we want to do that for
WAL also. In this case, it would be the local variable "npages" and we
can do it outside of this loop.

It is true that the existing WAL stats count wal_writes here. However,
this is essentially counting write system calls, which is probably not
what we want for pg_stat_io. See [2] for a discussion about whether to
count blocks written back or writeback system calls for a previous
pg_stat_io feature. All of the other block-based IO statistics in
pg_stat_io count the number of blocks.

This being said, we probably want to just leave
PendingWalStats.wal_write++ here. We would normally move it into
pg_stat_io like we have with pgBufferUsage and the db IO stats that are
updated in pgstat_count_io_op_time(). This consolidation makes it easier
to eventually reduce the duplication. However, in this case, it seems
wal_write counts something we don't count in pg_stat_io, so it can
probably be left here. I would still move the
PendingWalStats.wal_write_time into pgstat_count_io_op_time(), since
that seems like it is the same as what will be in pg_stat_io.

Also, op_bytes for IOOBJECT_WAL/IOCONTEXT_NORMAL should be XLOG_BLCKSZ
(see comment in pg_stat_get_io() in pgstatfuncs.c). Those default to the
same value but can be made to be different.


+pgstat_count_io_op_n(IOOBJECT_WAL,
IOCONTEXT_NORMAL, IOOP_WRITE, 1);
 }

 PendingWalStats.wal_write++;

@@ -8233,6 +8237,10 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)

 INSTR_TIME_SET_CURRENT(duration);
 INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_sync_time, duration, start);
+pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL,
IOOP_FSYNC, start, 1);

I would wrap this line and check other lines to make sure they are not
too long.

+} else
+{
+pgstat_count_io_op_n(IOOBJECT_WAL, IOCONTEXT_NORMAL, IOOP_FSYNC, 1);
 }

 PendingWalStats.wal_sync++;

Same feedback as above about using the prepare/count 

Show WAL write and fsync stats in pg_stat_io

2023-06-28 Thread Nazir Bilal Yavuz
Hi,

This is a WIP patch to add WAL write and fsync stats to pg_stat_io
view. There is a track_io_timing variable to control pg_stat_io
timings and a track_wal_io_timing variable to control WAL timings. I
couldn't decide on which logic to enable WAL timings on pg_stat_io.
For now, both pg_stat_io and track_wal_io_timing are needed to be
enabled to track WAL timings in pg_stat_io.

Also, if you compare WAL stats in pg_stat_wal and pg_stat_io; you can
come across differences. These differences are caused by the
background writer's WAL stats not being flushed. Because of that,
background writer's WAL stats are not seen in pg_stat_wal but in
pg_stat_io. I already sent a patch [1] to fix that.

[1] 
https://www.postgresql.org/message-id/CAN55FZ2FPYngovZstr%3D3w1KSEHe6toiZwrurbhspfkXe5UDocg%40mail.gmail.com

Any kind of feedback would be appreciated.

Regards,
Nazir Bilal Yavuz
Microsoft


v1-0001-WIP-Show-WAL-write-and-fsync-stats-on-pg_stat_io.patch
Description: Binary data