Re: pg_stat_io for the startup process

2023-06-12 Thread Kyotaro Horiguchi
At Mon, 12 Jun 2023 21:13:35 -0700, Andres Freund  wrote in 
> Thanks for looking - I already had pushed the commit by the time I read your
> email, otherwise I'd have mentioned you reviewing it...

Oops! But I anticipated that and that's no problem (I should have
confirmed commits.). Thanks for the reply!

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: pg_stat_io for the startup process

2023-06-12 Thread Andres Freund
Hi,

On 2023-06-13 12:54:19 +0900, Kyotaro Horiguchi wrote:
> I think that the reason is that we only pass true only when we're in a
> sort of idle time. Addition to that XLOG_RUNNING_XACTS comes so
> infrequently to cause noticeable degradation. If it causes
> sufficiently frequent updates, we will be satisfied with it.
> 
> The patch is quite simple and looks good to me. (At least way better
> than always using GetCurrentTimestamp():)

Thanks for looking - I already had pushed the commit by the time I read your
email, otherwise I'd have mentioned you reviewing it...

Greetings,

Andres Freund




Re: pg_stat_io for the startup process

2023-06-12 Thread Kyotaro Horiguchi
At Fri, 9 Jun 2023 21:28:23 -0700, Andres Freund  wrote in 
> Hi,
> 
> On 2023-05-22 13:36:42 +0900, Kyotaro Horiguchi wrote:
> > At Sun, 21 May 2023 15:14:23 -0700, Andres Freund  
> > wrote in 
> > > Hi,
> > > 
> > > I don't really feel confident we're going to get the timeout approach 
> > > solid
> > > enough for something going into the tree around beta 1.
> > > 
> > > How about this, imo a lot simpler, approach: We flush stats whenever 
> > > replaying
> > > a XLOG_RUNNING_XACTS record. Unless the primary is idle, it will log 
> > > those at
> > > a regular interval. If the primary is idle, we don't need to flush stats 
> > > in
> > > the startup process, because we'll not have done any io.
> > > 
> > > We only log XLOG_RUNNING_XACTS when wal_level >= replica, so stats 
> > > wouldn't
> > > get regularly flushed if wal_level = minimal - but in that case the stats 
> > > are
> > > also not accessible, so that's not a problem.
> > 
> > I concur with the three aspects, interval regularity, reliability and
> > about the case of the minimal wal_level. While I can't confidently
> > claim it is the best, its simplicilty gives us a solid reason to
> > proceed with it.  If necessary, we can switch to alternative timing
> > sources in the future without causing major disruptions for users, I
> > believe.
> > 
> > > It's not the prettiest solution, but I think the simplicity is worth a 
> > > lot.
> > > 
> > > Greetings,
> > 
> > +1.
> 
> Attached a patch implementing this approach.
> 
> It's imo always a separate bug that we were using
> GetCurrentTransactionStopTimestamp() when force was passed in - that timestamp
> can be quite out of date in some cases. But I don't immediately see a
> noticeable consequence, so ...

Thanks!

I think that the reason is that we only pass true only when we're in a
sort of idle time. Addition to that XLOG_RUNNING_XACTS comes so
infrequently to cause noticeable degradation. If it causes
sufficiently frequent updates, we will be satisfied with it.

The patch is quite simple and looks good to me. (At least way better
than always using GetCurrentTimestamp():)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: pg_stat_io for the startup process

2023-06-09 Thread Andres Freund
Hi,

On 2023-05-22 13:36:42 +0900, Kyotaro Horiguchi wrote:
> At Sun, 21 May 2023 15:14:23 -0700, Andres Freund  wrote 
> in 
> > Hi,
> > 
> > I don't really feel confident we're going to get the timeout approach solid
> > enough for something going into the tree around beta 1.
> > 
> > How about this, imo a lot simpler, approach: We flush stats whenever 
> > replaying
> > a XLOG_RUNNING_XACTS record. Unless the primary is idle, it will log those 
> > at
> > a regular interval. If the primary is idle, we don't need to flush stats in
> > the startup process, because we'll not have done any io.
> > 
> > We only log XLOG_RUNNING_XACTS when wal_level >= replica, so stats wouldn't
> > get regularly flushed if wal_level = minimal - but in that case the stats 
> > are
> > also not accessible, so that's not a problem.
> 
> I concur with the three aspects, interval regularity, reliability and
> about the case of the minimal wal_level. While I can't confidently
> claim it is the best, its simplicilty gives us a solid reason to
> proceed with it.  If necessary, we can switch to alternative timing
> sources in the future without causing major disruptions for users, I
> believe.
> 
> > It's not the prettiest solution, but I think the simplicity is worth a lot.
> > 
> > Greetings,
> 
> +1.

Attached a patch implementing this approach.

It's imo always a separate bug that we were using
GetCurrentTransactionStopTimestamp() when force was passed in - that timestamp
can be quite out of date in some cases. But I don't immediately see a
noticeable consequence, so ...

Greetings,

Andres Freund
>From 249c811da9d257c5a71e42016584fc3db0c9a99b Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Fri, 9 Jun 2023 21:19:36 -0700
Subject: [PATCH v1] Report stats when replaying XLOG_RUNNING_XACTS

Previously stats in the startup process would only get reported during
shutdown of the startup process. It has been that way for a long time, but
became a lot more noticeable with the new pg_stat_io view, which separates out
IO done by different backend types...

While replaying after every XLOG_RUNNING_XACTS isn't the prettiest approach,
it has the advantage of being quite easy. Given that we're well past feature
freeze...

It's not a problem that we don't report stats more frequently with
wal_level=minimal, in that case stats can't be read before the stats process
has shut down.

Reported-by: Fujii Masao 
Discussion: https://postgr.es/m/5315aedc-fbca-1556-c5de-dc2e00b23...@oss.nttdata.com
---
 src/backend/storage/ipc/standby.c   |  9 +
 src/backend/utils/activity/pgstat.c | 17 ++---
 2 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index ffe5e1563f5..70b0da50fc1 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -1193,6 +1193,15 @@ standby_redo(XLogReaderState *record)
 		running.xids = xlrec->xids;
 
 		ProcArrayApplyRecoveryInfo();
+
+		/*
+		 * The startup process currently has no convenient way to schedule
+		 * stats to be reported. XLOG_RUNNING_XACTS records issued at a
+		 * regular cadence, making this a convenient location to report
+		 * stats. While these records aren't generated with wal_level=minimal,
+		 * stats also cannot be accessed during WAL replay.
+		 */
+		pgstat_report_stat(true);
 	}
 	else if (info == XLOG_INVALIDATIONS)
 	{
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 0cdb552631e..d743fc0b289 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -615,10 +615,21 @@ pgstat_report_stat(bool force)
 	 */
 	Assert(!pgStatLocal.shmem->is_shutdown);
 
-	now = GetCurrentTransactionStopTimestamp();
-
-	if (!force)
+	if (force)
 	{
+		/*
+		 * Stats reports are forced either when it's been too long since stats
+		 * have been reported or in processes that force stats reporting to
+		 * happen at specific points (including shutdown). In the former case
+		 * the transaction stop time might be quite old, in the latter it
+		 * would never get cleared.
+		 */
+		now = GetCurrentTimestamp();
+	}
+	else
+	{
+		now = GetCurrentTransactionStopTimestamp();
+
 		if (pending_since > 0 &&
 			TimestampDifferenceExceeds(pending_since, now, PGSTAT_MAX_INTERVAL))
 		{
-- 
2.38.0



Re: pg_stat_io for the startup process

2023-05-21 Thread Kyotaro Horiguchi
At Sun, 21 May 2023 15:14:23 -0700, Andres Freund  wrote in 
> Hi,
> 
> I don't really feel confident we're going to get the timeout approach solid
> enough for something going into the tree around beta 1.
> 
> How about this, imo a lot simpler, approach: We flush stats whenever replaying
> a XLOG_RUNNING_XACTS record. Unless the primary is idle, it will log those at
> a regular interval. If the primary is idle, we don't need to flush stats in
> the startup process, because we'll not have done any io.
> 
> We only log XLOG_RUNNING_XACTS when wal_level >= replica, so stats wouldn't
> get regularly flushed if wal_level = minimal - but in that case the stats are
> also not accessible, so that's not a problem.

I concur with the three aspects, interval regularity, reliability and
about the case of the minimal wal_level. While I can't confidently
claim it is the best, its simplicilty gives us a solid reason to
proceed with it.  If necessary, we can switch to alternative timing
sources in the future without causing major disruptions for users, I
believe.

> It's not the prettiest solution, but I think the simplicity is worth a lot.
> 
> Greetings,

+1.


-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: pg_stat_io for the startup process

2023-05-21 Thread Andres Freund
Hi,

I don't really feel confident we're going to get the timeout approach solid
enough for something going into the tree around beta 1.

How about this, imo a lot simpler, approach: We flush stats whenever replaying
a XLOG_RUNNING_XACTS record. Unless the primary is idle, it will log those at
a regular interval. If the primary is idle, we don't need to flush stats in
the startup process, because we'll not have done any io.

We only log XLOG_RUNNING_XACTS when wal_level >= replica, so stats wouldn't
get regularly flushed if wal_level = minimal - but in that case the stats are
also not accessible, so that's not a problem.

It's not the prettiest solution, but I think the simplicity is worth a lot.

Greetings,

Andres Freund




Re: pg_stat_io for the startup process

2023-05-11 Thread Kyotaro Horiguchi
Thanks for clarifying!

At Mon, 8 May 2023 14:46:43 -0700, Andres Freund  wrote in 
> Hi,
> 
> On 2023-04-26 18:47:14 +0900, Kyotaro Horiguchi wrote:
> > I see four issues here.
> > 
> > 1. The current database stats omits buffer fetches that don't
> >   originate from a relation.
> > 
> > In this case pgstat_relation can't work since recovery isn't conscious
> > of relids.  We might be able to resolve relfilenode into a relid, but
> > it may not be that simple. Fortunately we already count fetches and
> > hits process-wide using pgBufferUsage, so we can use this for database
> > stats.
> 
> I don't think we need to do anything about that for 16 - they aren't updated
> at process end either.
> 
> I think the fix here is to do the architectural change of maintaining most
> stats keyed by relfilenode as we've discussed in some other threads. Then we
> also can have relation level write stats etc.

I think so.

> > 2. Even if we wanted to report stats for the startup process,
> >   pgstat_report_stats wouldn't permit it since transaction-end
> >   timestamp doesn't advance.
> > 
> > I'm not certain if it's the correct approach, but perhaps we could use
> > GetCurrentTimestamp() instead of GetCurrentTransactionStopTimestamp()
> > specifically for the startup process.
> 
> What about using GetCurrentTimestamp() when force == true? That'd make sense
> for other users as well, I think?

I'm not sure if I got you right, but when force==true, it allows
pgstat_report_stats to flush without considering whether the interval
has elapsed. In that case, there's no need to keep track of the last
flush time and the caller should handle the interval instead.

> > 3. When should we call pgstat_report_stats on the startup process?
> > 
> > During recovery, I think we can call pgstat_report_stats() (or a
> > subset of it) right before invoking WaitLatch and at segment
> > boundaries.
> 
> I've pondered that as well. But I don't think it's great - it's not exactly
> intuitive that stats reporting gets far less common if you use a 1GB
> wal_segment_size.

If the segment size gets larger, the archive intervals become
longer. So, I have a vague feeling that users wouldn't go for such a
large segment size. I don't have a clear idea about the ideal length
for stats reporting intervals in this case, but I think every few
minutes or so would not be so bad for the startup process to report
stats when recovery gets busy. Also, I think recovery will often wait
for new data once it catches up to the primary.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: pg_stat_io for the startup process

2023-05-08 Thread Andres Freund
Hi,

On 2023-04-26 18:47:14 +0900, Kyotaro Horiguchi wrote:
> I see four issues here.
> 
> 1. The current database stats omits buffer fetches that don't
>   originate from a relation.
> 
> In this case pgstat_relation can't work since recovery isn't conscious
> of relids.  We might be able to resolve relfilenode into a relid, but
> it may not be that simple. Fortunately we already count fetches and
> hits process-wide using pgBufferUsage, so we can use this for database
> stats.

I don't think we need to do anything about that for 16 - they aren't updated
at process end either.

I think the fix here is to do the architectural change of maintaining most
stats keyed by relfilenode as we've discussed in some other threads. Then we
also can have relation level write stats etc.


> 2. Even if we wanted to report stats for the startup process,
>   pgstat_report_stats wouldn't permit it since transaction-end
>   timestamp doesn't advance.
> 
> I'm not certain if it's the correct approach, but perhaps we could use
> GetCurrentTimestamp() instead of GetCurrentTransactionStopTimestamp()
> specifically for the startup process.

What about using GetCurrentTimestamp() when force == true? That'd make sense
for other users as well, I think?


> 3. When should we call pgstat_report_stats on the startup process?
> 
> During recovery, I think we can call pgstat_report_stats() (or a
> subset of it) right before invoking WaitLatch and at segment
> boundaries.

I've pondered that as well. But I don't think it's great - it's not exactly
intuitive that stats reporting gets far less common if you use a 1GB
wal_segment_size.

Greetings,

Andres Freund




Re: pg_stat_io for the startup process

2023-05-08 Thread Melanie Plageman
On Wed, May 03, 2023 at 04:11:33PM +0300, Melih Mutlu wrote:
> Andres Freund , 27 Nis 2023 Per, 19:27 tarihinde şunu 
> yazdı:
> >  #ifdef WAL_DEBUG
> > >   if (XLOG_DEBUG ||
> > >   (record->xl_rmid == RM_XACT_ID &&
> > trace_recovery_messages <= DEBUG2) ||
> > > @@ -3617,6 +3621,13 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr,
> > bool randAccess,
> > >   /* Do background tasks
> > that might benefit us later. */
> > >
> >  KnownAssignedTransactionIdsIdleMaintenance();
> > >
> > > + /*
> > > +  * Need to disable flush
> > timeout to avoid unnecessary
> > > +  * wakeups. Enable it
> > again after a WAL record is read
> > > +  * in PerformWalRecovery.
> > > +  */
> > > +
> >  disable_startup_stat_flush_timeout();
> > > +
> > >   (void)
> > WaitLatch(>recoveryWakeupLatch,
> > >
> >   WL_LATCH_SET | WL_TIMEOUT |
> > >
> >   WL_EXIT_ON_PM_DEATH,
> >
> > I think always disabling the timer here isn't quite right - we want to
> > wake up
> > *once* in WaitForWALToBecomeAvailable(), otherwise we'll not submit pending
> > stats before waiting - potentially for a long time - for WAL. One way
> > would be
> > just explicitly report before the WaitLatch().
> >
> > Another approach, I think better, would be to not use
> > enable_timeout_every(),
> > and to explicitly rearm the timer in HandleStartupProcInterrupts(). When
> > called from WaitForWALToBecomeAvailable(), we'd not rearm, and instead do
> > so
> > at the end of WaitForWALToBecomeAvailable().  That way we also wouldn't
> > repeatedly fire between calls to HandleStartupProcInterrupts().
> >
> 
> Attached patch is probably not doing what you asked. IIUC
> HandleStartupProcInterrupts should arm the timer but also shouldn't arm it
> if it's called from  WaitForWALToBecomeAvailable. But the timer should be
> armed again at the very end of WaitForWALToBecomeAvailable. I've been
> thinking about how to do this properly. Should HandleStartupProcInterrupts
> take a parameter to decide whether the timer needs to be armed? Or need to
> add an additional global flag to rearm the timer? Any thoughts?

I had the same question about how to determine whether or not to rearm.

> From 9be7360e49db424c45c53e85efe8a4f5359b5244 Mon Sep 17 00:00:00 2001
> From: Melih Mutlu 
> Date: Wed, 26 Apr 2023 18:21:32 +0300
> Subject: [PATCH v2] Add timeout to flush stats during startup's main replay
>  loop
> diff --git a/src/backend/postmaster/startup.c 
> b/src/backend/postmaster/startup.c
> index efc2580536..842394bc8f 100644
> --- a/src/backend/postmaster/startup.c
> +++ b/src/backend/postmaster/startup.c
> @@ -72,6 +72,9 @@ static TimestampTz startup_progress_phase_start_time;
>   */
>  static volatile sig_atomic_t startup_progress_timer_expired = false;
>  
> +/* Indicates whether flushing stats is needed. */
> +static volatile sig_atomic_t idle_stats_update_pending = false;
> +
>  /*
>   * Time between progress updates for long-running startup operations.
>   */
> @@ -206,6 +209,18 @@ HandleStartupProcInterrupts(void)
>   /* Perform logging of memory contexts of this process */
>   if (LogMemoryContextPending)
>   ProcessLogMemoryContextInterrupt();
> +
> + if (idle_stats_update_pending)
> + {
> + /* It's time to report wal stats. */

If we want dbstats to be updated, we'll probably have to call
pgstat_report_stat() here and fix the timestamp issue Horiguchi-san
points out upthread. Perhaps you could review those changes and consider
adding those as preliminary patches before this in a set.

I think you will then need to handle the issue he mentions with partial
flushes coming from pgstat_report_stat() and remembering to try and
flush stats again in case of a partial flush. Though, perhaps we can
just pass force=true.

> + pgstat_report_wal(true);
> + idle_stats_update_pending = false;
> + }

Good that you thought to check if the timeout was already active.

> + else if (!get_timeout_active(IDLE_STATS_UPDATE_TIMEOUT))
> + {
> + /* Set the next timeout. */
> + enable_idle_stats_update_timeout();
> + }
>  }
>  
>  
> @@ -385,3 +400,22 @@ has_startup_progress_timeout_expired(long *secs, int 
> *usecs)
>  
>   return true;
>  }
> +
> +/* Set a flag indicating that it's time to flush wal stats. */
> +void
> +idle_stats_update_timeout_handler(void)
> +{
> + idle_stats_update_pending = true;

Is WakeupRecovery() needed when the timer goes off and the startup
process is waiting on a latch? Does this happen in
WaitForWalToBecomeAvailable()?

> + WakeupRecovery();
> +}
> +
> +/* Enable the timeout set for wal stat 

Re: pg_stat_io for the startup process

2023-05-08 Thread Melanie Plageman
I've reviewed both your patch versions and responded to the ideas in
both of them and the associated emails below.

On Wed, Apr 26, 2023 at 5:47 AM Kyotaro Horiguchi
 wrote:
>
> At Tue, 25 Apr 2023 16:04:23 -0700, Andres Freund  wrote 
> in
> > key those stats by oid. However, it *does* report the read/write time. But
> > only at process exit, of course.  The weird part is that the startup process
> > does *NOT* increase pg_stat_database.blks_read/blks_hit, because instead of
> > basing those on pgBufferUsage.shared_blks_read etc, we compute them based on
> > the relation level stats. pgBufferUsage is just used for EXPLAIN.  This 
> > isn't
> > recent, afaict.
>
> I see four issues here.
>
> 1. The current database stats omits buffer fetches that don't
>   originate from a relation.
>
> In this case pgstat_relation can't work since recovery isn't conscious
> of relids.  We might be able to resolve relfilenode into a relid, but
> it may not be that simple. Fortunately we already count fetches and
> hits process-wide using pgBufferUsage, so we can use this for database
> stats.
...
> > TL;DR: Currently the startup process maintains blk_read_time, 
> > blk_write_time,
> > but doesn't maintain blks_read, blks_hit - which doesn't make sense.
>
> As a result, the attached patch, which is meant for discussion, allows
> pg_stat_database to show fetches and reads by the startup process as
> the counts for the database with id 0.

I would put this in its own patch in a patchset. Of course it relies on
having pgstat_report_stat() called at appropriate times by the startup
process, but having pg_stat_database show read/hit counts is a separate
issue than having pg_stat_io do so. I'm not suggesting we do this, but
you could argue that if we fix the startup process stats reporting that
pg_stat_database not showing reads and hits for the startup process is a
bug that also exists in 15.

Not directly related, but I do not get why the existing stats counters
for pg_stat_database count "fetches" and "hits" and then use subtraction
to get the number of reads. I find it confusing and seems like it could
lead to subtle inconsistencies with those counters counting reads closer
to where they are actually happening.

> 2. Even if we wanted to report stats for the startup process,
>   pgstat_report_stats wouldn't permit it since transaction-end
>   timestamp doesn't advance.
>
> I'm not certain if it's the correct approach, but perhaps we could use
> GetCurrentTimestamp() instead of GetCurrentTransactionStopTimestamp()
> specifically for the startup process.

In theory, since all of the approaches proposed in this thread would
exercise rigid control over how often we flush stats in the startup
process, I think it is okay to use GetCurrentTimestamp() when
pgstat_report_stat() is called by the startup process (i.e. we don't
have to worry about overhead of doing it). But looking at it implemented
in the patch made me feel unsettled for some reason.

> 3. When should we call pgstat_report_stats on the startup process?
>
> During recovery, I think we can call pgstat_report_stats() (or a
> subset of it) right before invoking WaitLatch and at segment
> boundaries.

I see in the patch you call pgstat_report_stat() in XLogPageRead(). Will
this only be called on segment boundaries?

> 4. In the existing ReadBuffer_common, there's an inconsistency in
> counting hits and reads between pgstat_io and pgBufferUsage.
>
> The difference comes from the case of RBM_ZERO pages. We should simply
> align them.

I would definitely make this a separate patch and probably a separate
thread. It isn't related to the startup process and is worth a separate
discussion.

On Thu, Apr 27, 2023 at 10:43 PM Kyotaro Horiguchi
 wrote:
>
> At Fri, 28 Apr 2023 11:15:51 +0900 (JST), Kyotaro Horiguchi 
>  wrote in
> > At Thu, 27 Apr 2023 17:30:40 -0400, Melanie Plageman 
> >  wrote in
> > > After a quick example implementation of this, I found that it seemed to
> > > try and flush the stats less often on an idle standby (good) than using
> > > enable_timeout_every().
> >
> > Just rearming with the full-interval will work that way. Our existing
> > strategy for this is seen in PostgresMain().
> >
> >stats_timeout = pgstat_report_stat(false);
> >if (stats_timeout > 0)
> >{
> >   if (!get_timeout_active(BLAH_TIMEOUT))
> >   enable_timeout_after(BLAH_TIMEOUT, stats_timeout);
> >}
> >else
> >{
> >if (get_timeout_active(BLAH_TIMEOUT))
> >disable_timeout(BLAH_TIMEOUT, false);
> >}
> >WaitLatch();
>
> Im my example, I left out idle-time flushing, but I realized we don't
> need the timeout mechanism here since we're already managing it. So
> the following should work (assuming the timestamp updates with
> GetCurrentTimestamp() in my last patch).
>
> @@ -3889,13 +3900,23 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool 
> randAccess,
> /* Update pg_stat_recovery_prefetch 

Re: pg_stat_io for the startup process

2023-05-03 Thread Melih Mutlu
Hi,

Andres Freund , 27 Nis 2023 Per, 19:27 tarihinde şunu
yazdı:

> Huh - do you have a link to the failure? That's how I would expect it to be
> done.


I think I should have registered it in the beginning of
PerformWalRecovery() and not just before the main redo loop
since HandleStartupProcInterrupts is called before the loop too.
Here's the error log otherise [1]

>  #ifdef WAL_DEBUG
> >   if (XLOG_DEBUG ||
> >   (record->xl_rmid == RM_XACT_ID &&
> trace_recovery_messages <= DEBUG2) ||
> > @@ -3617,6 +3621,13 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr,
> bool randAccess,
> >   /* Do background tasks
> that might benefit us later. */
> >
>  KnownAssignedTransactionIdsIdleMaintenance();
> >
> > + /*
> > +  * Need to disable flush
> timeout to avoid unnecessary
> > +  * wakeups. Enable it
> again after a WAL record is read
> > +  * in PerformWalRecovery.
> > +  */
> > +
>  disable_startup_stat_flush_timeout();
> > +
> >   (void)
> WaitLatch(>recoveryWakeupLatch,
> >
>   WL_LATCH_SET | WL_TIMEOUT |
> >
>   WL_EXIT_ON_PM_DEATH,
>
> I think always disabling the timer here isn't quite right - we want to
> wake up
> *once* in WaitForWALToBecomeAvailable(), otherwise we'll not submit pending
> stats before waiting - potentially for a long time - for WAL. One way
> would be
> just explicitly report before the WaitLatch().
>
> Another approach, I think better, would be to not use
> enable_timeout_every(),
> and to explicitly rearm the timer in HandleStartupProcInterrupts(). When
> called from WaitForWALToBecomeAvailable(), we'd not rearm, and instead do
> so
> at the end of WaitForWALToBecomeAvailable().  That way we also wouldn't
> repeatedly fire between calls to HandleStartupProcInterrupts().
>

Attached patch is probably not doing what you asked. IIUC
HandleStartupProcInterrupts should arm the timer but also shouldn't arm it
if it's called from  WaitForWALToBecomeAvailable. But the timer should be
armed again at the very end of WaitForWALToBecomeAvailable. I've been
thinking about how to do this properly. Should HandleStartupProcInterrupts
take a parameter to decide whether the timer needs to be armed? Or need to
add an additional global flag to rearm the timer? Any thoughts?

[1]
https://api.cirrus-ci.com/v1/artifact/task/5282291971260416/testrun/build/testrun/recovery/010_logical_decoding_timelines/log/010_logical_decoding_timelines_replica.log

Best,
-- 
Melih Mutlu
Microsoft


v2-0001-Add-timeout-to-flush-stats-during-startup-s-main-replay-loop.patch
Description: Binary data


Re: pg_stat_io for the startup process

2023-04-27 Thread Kyotaro Horiguchi
At Fri, 28 Apr 2023 11:15:51 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> At Thu, 27 Apr 2023 17:30:40 -0400, Melanie Plageman 
>  wrote in 
> > After a quick example implementation of this, I found that it seemed to
> > try and flush the stats less often on an idle standby (good) than using
> > enable_timeout_every().
> 
> Just rearming with the full-interval will work that way. Our existing
> strategy for this is seen in PostgresMain().
> 
>stats_timeout = pgstat_report_stat(false);
>if (stats_timeout > 0)
>{
>   if (!get_timeout_active(BLAH_TIMEOUT))
>   enable_timeout_after(BLAH_TIMEOUT, stats_timeout);
>}
>else
>{
>if (get_timeout_active(BLAH_TIMEOUT))
>disable_timeout(BLAH_TIMEOUT, false);
>}
>WaitLatch();

Im my example, I left out idle-time flushing, but I realized we don't
need the timeout mechanism here since we're already managing it. So
the following should work (assuming the timestamp updates with
GetCurrentTimestamp() in my last patch).

@@ -3889,13 +3900,23 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool 
randAccess,
/* Update pg_stat_recovery_prefetch 
before sleeping. */

XLogPrefetcherComputeStats(xlogprefetcher);
 
+   /*
+* Report stats; if not time yet, set 
next WaitLatch to
+* wake up at the next reporing time.
+*/
+   wait_time = pgstat_report_stat(false);
+
+   /* if no pending stats, sleep forever */
+   if (wait_time == 0)
+   wait_time = -1L;
+
/*
 * Wait for more WAL to arrive, when we 
will be woken
 * immediately by the WAL receiver.
 */
(void) 
WaitLatch(>recoveryWakeupLatch,
 
WL_LATCH_SET | WL_EXIT_ON_PM_DEATH,
--1L,
+
wait_time,
 
WAIT_EVENT_RECOVERY_WAL_STREAM);

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/access/transam/xlogrecovery.c 
b/src/backend/access/transam/xlogrecovery.c
index 188f6d6f85..0dd465c5e3 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -3248,6 +3248,7 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr 
targetPagePtr, int reqLen,
close(readFile);
readFile = -1;
readSource = XLOG_FROM_ANY;
+   pgstat_report_stat(false);
}
 
XLByteToSeg(targetPagePtr, readSegNo, wal_segment_size);
@@ -3607,6 +3608,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool 
randAccess,

wal_retrieve_retry_interval))
{
longwait_time;
+   longstats_timeout;
 
wait_time = 
wal_retrieve_retry_interval -

TimestampDifferenceMilliseconds(last_fail_time, now);
@@ -3617,6 +3619,14 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool 
randAccess,
/* Do background tasks that 
might benefit us later. */

KnownAssignedTransactionIdsIdleMaintenance();
 
+   /*
+* Report stats; if not time 
yet, set next WaitLatch to
+* wake up at the next reporing 
time.
+*/
+   stats_timeout = 
pgstat_report_stat(false);
+   if (stats_timeout > 0 && 
stats_timeout < wait_time)
+   wait_time = 
stats_timeout;
+
(void) 
WaitLatch(>recoveryWakeupLatch,

 WL_LATCH_SET | WL_TIMEOUT |

 WL_EXIT_ON_PM_DEATH,
@@ -3698,6 

Re: pg_stat_io for the startup process

2023-04-27 Thread Kyotaro Horiguchi
At Thu, 27 Apr 2023 17:30:40 -0400, Melanie Plageman 
 wrote in 
> On Wed, Apr 26, 2023 at 2:53 PM Melih Mutlu  wrote:
> >
> > Hi all,
> >
> > Robert Haas , 26 Nis 2023 Çar, 15:34 tarihinde şunu 
> > yazdı:
> >>
> >> On Wed, Apr 26, 2023 at 5:47 AM Kyotaro Horiguchi
> >>  wrote:
> >> > 3. When should we call pgstat_report_stats on the startup process?
> >> >
> >> > During recovery, I think we can call pgstat_report_stats() (or a
> >> > subset of it) right before invoking WaitLatch and at segment
> >> > boundaries.
> >>
> >> I think this kind of idea is worth exploring. Andres mentioned timers,
> >> but it seems to me that something where we just do it at certain
> >> "convenient" points might be good enough and simpler. I'd much rather
> >> have statistics that were up to date as of the last time we finished a
> >> segment than have nothing at all.
> >
> >
> > I created a rough prototype of a timer-based approach for comparison. 
> > Please see attached.
> >
> > Registered a new timeout named "STARTUP_STAT_FLUSH_TIMEOUT", The timeout's
> > handler sets a flag indicating that io stats need to be flushed.
> > HandleStartupProcInterrupts checks the flag to flush stats. It's enabled if
> > any WAL record is read (in the main loop of PerformWalRecovery). And it's
> > disabled before WaitLatch to avoid unnecessary wakeups if the startup 
> > process
> > decides to sleep.
> 
> I was reviewing this and found that if I remove the calls to
> disable_startup_stat_flush_timeout() the number of calls to
> pgstat_report_wal() on a briefly active and then idle standby are about
> 8 in 100 seconds whereas with timer disablement, the calls over the same
> period are about 40. I would have thought that disabling the timer would
> have caused us to try and flush the stats less often.
> 
> With the calls to disable_startup_stat_flush_timeout(), we do have much
> fewer calls to setitimer(), of course.
> 
> Given the below suggestion by Andres, I tried doing some traces of the
> various approaches.
> 
> > > +   disable_startup_stat_flush_timeout();
> > > +
> > > (void) WaitLatch(>recoveryWakeupLatch,
> > > WL_LATCH_SET | WL_TIMEOUT |
> > > WL_EXIT_ON_PM_DEATH,
> >
> > I think always disabling the timer here isn't quite right - we want to wake 
> > up
> > *once* in WaitForWALToBecomeAvailable(), otherwise we'll not submit pending
> > stats before waiting - potentially for a long time - for WAL. One way would 
> > be
> > just explicitly report before the WaitLatch().
> >
> > Another approach, I think better, would be to not use 
> > enable_timeout_every(),
> > and to explicitly rearm the timer in HandleStartupProcInterrupts(). When
> > called from WaitForWALToBecomeAvailable(), we'd not rearm, and instead do so
> > at the end of WaitForWALToBecomeAvailable().  That way we also wouldn't
> > repeatedly fire between calls to HandleStartupProcInterrupts().
> 
> After a quick example implementation of this, I found that it seemed to
> try and flush the stats less often on an idle standby (good) than using
> enable_timeout_every().

Just rearming with the full-interval will work that way. Our existing
strategy for this is seen in PostgresMain().

   stats_timeout = pgstat_report_stat(false);
   if (stats_timeout > 0)
   {
  if (!get_timeout_active(BLAH_TIMEOUT))
enable_timeout_after(BLAH_TIMEOUT, stats_timeout);
   }
   else
   {
   if (get_timeout_active(BLAH_TIMEOUT))
 disable_timeout(BLAH_TIMEOUT, false);
   }
   WaitLatch();

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


Re: pg_stat_io for the startup process

2023-04-27 Thread Melanie Plageman
On Wed, Apr 26, 2023 at 2:53 PM Melih Mutlu  wrote:
>
> Hi all,
>
> Robert Haas , 26 Nis 2023 Çar, 15:34 tarihinde şunu 
> yazdı:
>>
>> On Wed, Apr 26, 2023 at 5:47 AM Kyotaro Horiguchi
>>  wrote:
>> > 3. When should we call pgstat_report_stats on the startup process?
>> >
>> > During recovery, I think we can call pgstat_report_stats() (or a
>> > subset of it) right before invoking WaitLatch and at segment
>> > boundaries.
>>
>> I think this kind of idea is worth exploring. Andres mentioned timers,
>> but it seems to me that something where we just do it at certain
>> "convenient" points might be good enough and simpler. I'd much rather
>> have statistics that were up to date as of the last time we finished a
>> segment than have nothing at all.
>
>
> I created a rough prototype of a timer-based approach for comparison. Please 
> see attached.
>
> Registered a new timeout named "STARTUP_STAT_FLUSH_TIMEOUT", The timeout's
> handler sets a flag indicating that io stats need to be flushed.
> HandleStartupProcInterrupts checks the flag to flush stats. It's enabled if
> any WAL record is read (in the main loop of PerformWalRecovery). And it's
> disabled before WaitLatch to avoid unnecessary wakeups if the startup process
> decides to sleep.

I was reviewing this and found that if I remove the calls to
disable_startup_stat_flush_timeout() the number of calls to
pgstat_report_wal() on a briefly active and then idle standby are about
8 in 100 seconds whereas with timer disablement, the calls over the same
period are about 40. I would have thought that disabling the timer would
have caused us to try and flush the stats less often.

With the calls to disable_startup_stat_flush_timeout(), we do have much
fewer calls to setitimer(), of course.

Given the below suggestion by Andres, I tried doing some traces of the
various approaches.

> > +   disable_startup_stat_flush_timeout();
> > +
> > (void) WaitLatch(>recoveryWakeupLatch,
> > WL_LATCH_SET | WL_TIMEOUT |
> > WL_EXIT_ON_PM_DEATH,
>
> I think always disabling the timer here isn't quite right - we want to wake up
> *once* in WaitForWALToBecomeAvailable(), otherwise we'll not submit pending
> stats before waiting - potentially for a long time - for WAL. One way would be
> just explicitly report before the WaitLatch().
>
> Another approach, I think better, would be to not use enable_timeout_every(),
> and to explicitly rearm the timer in HandleStartupProcInterrupts(). When
> called from WaitForWALToBecomeAvailable(), we'd not rearm, and instead do so
> at the end of WaitForWALToBecomeAvailable().  That way we also wouldn't
> repeatedly fire between calls to HandleStartupProcInterrupts().

After a quick example implementation of this, I found that it seemed to
try and flush the stats less often on an idle standby (good) than using
enable_timeout_every().

- Melanie




Re: pg_stat_io for the startup process

2023-04-27 Thread Andres Freund
Hi,

On 2023-04-26 21:53:16 +0300, Melih Mutlu wrote:
> Robert Haas , 26 Nis 2023 Çar, 15:34 tarihinde şunu
> yazdı:
> 
> > On Wed, Apr 26, 2023 at 5:47 AM Kyotaro Horiguchi
> >  wrote:
> > > 3. When should we call pgstat_report_stats on the startup process?
> > >
> > > During recovery, I think we can call pgstat_report_stats() (or a
> > > subset of it) right before invoking WaitLatch and at segment
> > > boundaries.
> >
> > I think this kind of idea is worth exploring. Andres mentioned timers,
> > but it seems to me that something where we just do it at certain
> > "convenient" points might be good enough and simpler. I'd much rather
> > have statistics that were up to date as of the last time we finished a
> > segment than have nothing at all.
> >
> 
> I created a rough prototype of a timer-based approach for comparison.
> Please see attached.

Thanks!


> 2- I'm also not sure if this timeout should be registered at the beginning
> of StartupProcessMain, or does it even make any difference. I tried to do
> this just before the main redo loop in PerformWalRecovery, but that made
> the CI red.

Huh - do you have a link to the failure? That's how I would expect it to be
done.


>  /* Unsupported old recovery command file names (relative to $PGDATA) */
>  #define RECOVERY_COMMAND_FILE"recovery.conf"
> @@ -1675,6 +1676,9 @@ PerformWalRecovery(void)
>   ereport_startup_progress("redo in progress, 
> elapsed time: %ld.%02d s, current LSN: %X/%X",
>   
>  LSN_FORMAT_ARGS(xlogreader->ReadRecPtr));
>  
> + /* Is this the right place to enable this? */
> + enable_startup_stat_flush_timeout();
> +

I think we should try not have additional code for this inside the loop - we
should enable the timer once, when needed, not repeatedly.


>  #ifdef WAL_DEBUG
>   if (XLOG_DEBUG ||
>   (record->xl_rmid == RM_XACT_ID && 
> trace_recovery_messages <= DEBUG2) ||
> @@ -3617,6 +3621,13 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool 
> randAccess,
>   /* Do background tasks that 
> might benefit us later. */
>   
> KnownAssignedTransactionIdsIdleMaintenance();
>  
> + /* 
> +  * Need to disable flush 
> timeout to avoid unnecessary
> +  * wakeups. Enable it again 
> after a WAL record is read
> +  * in PerformWalRecovery.
> +  */
> + 
> disable_startup_stat_flush_timeout();
> +
>   (void) 
> WaitLatch(>recoveryWakeupLatch,
>   
>  WL_LATCH_SET | WL_TIMEOUT |
>   
>  WL_EXIT_ON_PM_DEATH,

I think always disabling the timer here isn't quite right - we want to wake up
*once* in WaitForWALToBecomeAvailable(), otherwise we'll not submit pending
stats before waiting - potentially for a long time - for WAL. One way would be
just explicitly report before the WaitLatch().

Another approach, I think better, would be to not use enable_timeout_every(),
and to explicitly rearm the timer in HandleStartupProcInterrupts(). When
called from WaitForWALToBecomeAvailable(), we'd not rearm, and instead do so
at the end of WaitForWALToBecomeAvailable().  That way we also wouldn't
repeatedly fire between calls to HandleStartupProcInterrupts().


> diff --git a/src/backend/postmaster/startup.c 
> b/src/backend/postmaster/startup.c
> index efc2580536..b250fa95f9 100644
> --- a/src/backend/postmaster/startup.c
> +++ b/src/backend/postmaster/startup.c
> @@ -72,6 +72,11 @@ static TimestampTz startup_progress_phase_start_time;
>   */
>  static volatile sig_atomic_t startup_progress_timer_expired = false;
>  
> +/* Indicates whether flushing stats is needed. */
> +static volatile sig_atomic_t startup_stat_need_flush = false;
> +
> +int  pgstat_stat_flush_timeout = 1000;   /* 1 sec ?? */

We probably should move the existing PGSTAT_MIN_INTERVAL constant from
pgstat.c to pgstat.h.


> +extern void enable_startup_stat_flush_timeout(void);
> +extern void disable_startup_stat_flush_timeout(void);
> +extern void startup_stat_flush_timeout_handler(void);
> +
>  #endif   /* _STARTUP_H */
> diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h
> index e561a1cde9..a8d360e255 100644
> --- a/src/include/utils/timeout.h
> +++ b/src/include/utils/timeout.h
> @@ -35,6 +35,7 @@ typedef enum TimeoutId
>   IDLE_STATS_UPDATE_TIMEOUT,
>   

Re: pg_stat_io for the startup process

2023-04-26 Thread Melih Mutlu
Hi all,

Robert Haas , 26 Nis 2023 Çar, 15:34 tarihinde şunu
yazdı:

> On Wed, Apr 26, 2023 at 5:47 AM Kyotaro Horiguchi
>  wrote:
> > 3. When should we call pgstat_report_stats on the startup process?
> >
> > During recovery, I think we can call pgstat_report_stats() (or a
> > subset of it) right before invoking WaitLatch and at segment
> > boundaries.
>
> I think this kind of idea is worth exploring. Andres mentioned timers,
> but it seems to me that something where we just do it at certain
> "convenient" points might be good enough and simpler. I'd much rather
> have statistics that were up to date as of the last time we finished a
> segment than have nothing at all.
>

I created a rough prototype of a timer-based approach for comparison.
Please see attached.

Registered a new timeout named "STARTUP_STAT_FLUSH_TIMEOUT", The timeout's
handler sets a flag indicating that io stats need to be flushed.
HandleStartupProcInterrupts checks the flag to flush stats.
It's enabled if any WAL record is read (in the main loop of
PerformWalRecovery). And it's disabled before WaitLatch to avoid
unnecessary wakeups if the startup process decides to sleep.

With those changes, I see that startup related rows in pg_stat_io are
updated without waiting for the startup process to exit.

postgres=# select context, reads, extends, hits from pg_stat_io where
> backend_type = 'startup';
>   context  | reads  | extends |   hits
> ---++-+--
>  bulkread  |  0 | |0
>  bulkwrite |  0 |   0 |0
>  normal|  6 |   1 |   41
>  vacuum|  0 |   0 |0
> (4 rows)


I'm not sure this is the correct way to implement this approach though. I
appreciate any comment.

Also; some questions about this implementation if you think it's worth
discussing:
1- I set an arbitrary timeout (1 sec) for testing. I don't know what the
correct value should be. Does it make sense to use
PGSTAT_MIN/MAX/IDLE_INTERVAL instead?
2- I'm also not sure if this timeout should be registered at the beginning
of StartupProcessMain, or does it even make any difference. I tried to do
this just before the main redo loop in PerformWalRecovery, but that made
the CI red.

Best,
-- 
Melih Mutlu
Microsoft


0001-Add-timeout-to-flush-stats-during-startup-s-main-replay-loop.patch
Description: Binary data


Re: pg_stat_io for the startup process

2023-04-26 Thread Robert Haas
On Wed, Apr 26, 2023 at 5:47 AM Kyotaro Horiguchi
 wrote:
> 3. When should we call pgstat_report_stats on the startup process?
>
> During recovery, I think we can call pgstat_report_stats() (or a
> subset of it) right before invoking WaitLatch and at segment
> boundaries.

I think this kind of idea is worth exploring. Andres mentioned timers,
but it seems to me that something where we just do it at certain
"convenient" points might be good enough and simpler. I'd much rather
have statistics that were up to date as of the last time we finished a
segment than have nothing at all.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: pg_stat_io for the startup process

2023-04-26 Thread Kyotaro Horiguchi
At Tue, 25 Apr 2023 16:04:23 -0700, Andres Freund  wrote in 
> I refreshed my memory: The startup process has indeed behaved that way for
> much longer than pg_stat_io existed - but it's harder to spot, because the
> stats are more coarsely aggregated :/. And it's very oddly inconsistent:
> 
> The startup process doesn't report per-relation read/hit (it might when we
> create a fake relcache entry, to lazy to see what happens exactly), because we

The key difference lies between relation-level and smgr-level;
recovery doesn't call ReadBufferExtended.

> key those stats by oid. However, it *does* report the read/write time. But
> only at process exit, of course.  The weird part is that the startup process
> does *NOT* increase pg_stat_database.blks_read/blks_hit, because instead of
> basing those on pgBufferUsage.shared_blks_read etc, we compute them based on
> the relation level stats. pgBufferUsage is just used for EXPLAIN.  This isn't
> recent, afaict.

I see four issues here.

1. The current database stats omits buffer fetches that don't
  originate from a relation.

In this case pgstat_relation can't work since recovery isn't conscious
of relids.  We might be able to resolve relfilenode into a relid, but
it may not be that simple. Fortunately we already count fetches and
hits process-wide using pgBufferUsage, so we can use this for database
stats.

2. Even if we wanted to report stats for the startup process,
  pgstat_report_stats wouldn't permit it since transaction-end
  timestamp doesn't advance.

I'm not certain if it's the correct approach, but perhaps we could use
GetCurrentTimestamp() instead of GetCurrentTransactionStopTimestamp()
specifically for the startup process.

3. When should we call pgstat_report_stats on the startup process?

During recovery, I think we can call pgstat_report_stats() (or a
subset of it) right before invoking WaitLatch and at segment
boundaries.

4. In the existing ReadBuffer_common, there's an inconsistency in
counting hits and reads between pgstat_io and pgBufferUsage.

The difference comes from the case of RBM_ZERO pages. We should simply
align them.


> TL;DR: Currently the startup process maintains blk_read_time, blk_write_time,
> but doesn't maintain blks_read, blks_hit - which doesn't make sense.

As a result, the attached patch, which is meant for discussion, allows
pg_stat_database to show fetches and reads by the startup process as
the counts for the database with id 0.

There's still some difference between pg_stat_io and pg_stat_database,
but I haven't examined it in detail.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/access/transam/xlogrecovery.c 
b/src/backend/access/transam/xlogrecovery.c
index 188f6d6f85..6010ae8662 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -3248,6 +3248,7 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr 
targetPagePtr, int reqLen,
close(readFile);
readFile = -1;
readSource = XLOG_FROM_ANY;
+   pgstat_report_stat(false);
}
 
XLByteToSeg(targetPagePtr, readSegNo, wal_segment_size);
@@ -3617,6 +3618,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool 
randAccess,
/* Do background tasks that 
might benefit us later. */

KnownAssignedTransactionIdsIdleMaintenance();
 
+   pgstat_report_stat(false);
(void) 
WaitLatch(>recoveryWakeupLatch,

 WL_LATCH_SET | WL_TIMEOUT |

 WL_EXIT_ON_PM_DEATH,
@@ -3893,6 +3895,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool 
randAccess,
 * Wait for more WAL to arrive, when we 
will be woken
 * immediately by the WAL receiver.
 */
+   pgstat_report_stat(false);
(void) 
WaitLatch(>recoveryWakeupLatch,
 
WL_LATCH_SET | WL_EXIT_ON_PM_DEATH,
 -1L,
diff --git a/src/backend/storage/buffer/bufmgr.c 
b/src/backend/storage/buffer/bufmgr.c
index 1fa689052e..520936d0dd 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1046,9 +1046,6 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, 
ForkNumber forkNum,
bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, );
if (found)
pgBufferUsage.local_blks_hit++;
-   else if (mode == 

Re: pg_stat_io for the startup process

2023-04-25 Thread Andres Freund
Hi,

On 2023-04-25 16:00:24 -0400, Robert Haas wrote:
> On Tue, Apr 25, 2023 at 2:39 PM Andres Freund  wrote:
> > I'm mildly inclined to not consider it a bug, given that this looks to have
> > been true for other stats for quite a while? But it does still seem worth
> > improving upon - I'd make the consideration when to apply the relevant 
> > patches
> > depend on the complexity. I'm worried we'd need to introduce sufficiently 
> > new
> > infrastructure that 16 doesn't seem like a good idea. Let's come up with a
> > patch and judge it after?
> 
> ISTM that it's pretty desirable to do something about this. If the
> process isn't going to report statistics properly, at least remove it
> from the view.

It's populated after crash recovery, when shutting down and at the time of
promotion, that isn't *completely* crazy.


> If it can be made to report properly, that would be even better. But
> shipping a new view with information that will nearly always be zeroes
> instead of real data seems like a bad call, even if there are existing cases
> that have the same problem.

I refreshed my memory: The startup process has indeed behaved that way for
much longer than pg_stat_io existed - but it's harder to spot, because the
stats are more coarsely aggregated :/. And it's very oddly inconsistent:

The startup process doesn't report per-relation read/hit (it might when we
create a fake relcache entry, to lazy to see what happens exactly), because we
key those stats by oid. However, it *does* report the read/write time. But
only at process exit, of course.  The weird part is that the startup process
does *NOT* increase pg_stat_database.blks_read/blks_hit, because instead of
basing those on pgBufferUsage.shared_blks_read etc, we compute them based on
the relation level stats. pgBufferUsage is just used for EXPLAIN.  This isn't
recent, afaict.

TL;DR: Currently the startup process maintains blk_read_time, blk_write_time,
but doesn't maintain blks_read, blks_hit - which doesn't make sense.

Yikes.

Greetings,

Andres Freund




Re: pg_stat_io for the startup process

2023-04-25 Thread Michael Paquier
On Tue, Apr 25, 2023 at 04:00:24PM -0400, Robert Haas wrote:
> ISTM that it's pretty desirable to do something about this. If the
> process isn't going to report statistics properly, at least remove it
> from the view. If it can be made to report properly, that would be
> even better. But shipping a new view with information that will nearly
> always be zeroes instead of real data seems like a bad call, even if
> there are existing cases that have the same problem.

Agreed that reporting no information may be better than reporting
incorrect information, even if it means an extra qual.  As mentioned
upthread, if this requires an extra design piece, adding the correct
information had better be pushed to v17~.

Perhaps an open item should be added?
--
Michael


signature.asc
Description: PGP signature


Re: pg_stat_io for the startup process

2023-04-25 Thread Robert Haas
On Tue, Apr 25, 2023 at 2:39 PM Andres Freund  wrote:
> I'm mildly inclined to not consider it a bug, given that this looks to have
> been true for other stats for quite a while? But it does still seem worth
> improving upon - I'd make the consideration when to apply the relevant patches
> depend on the complexity. I'm worried we'd need to introduce sufficiently new
> infrastructure that 16 doesn't seem like a good idea. Let's come up with a
> patch and judge it after?

ISTM that it's pretty desirable to do something about this. If the
process isn't going to report statistics properly, at least remove it
from the view. If it can be made to report properly, that would be
even better. But shipping a new view with information that will nearly
always be zeroes instead of real data seems like a bad call, even if
there are existing cases that have the same problem.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: pg_stat_io for the startup process

2023-04-25 Thread Andres Freund
Hi,

On 2023-04-25 13:54:43 -0400, Melanie Plageman wrote:
> On Tue, Apr 25, 2023 at 10:51:14PM +0900, Fujii Masao wrote:
> > Regarding pg_stat_io for the startup process, I noticed that the counters
> > are only incremented after the startup process exits, not during WAL replay
> > in standby mode. This is because pgstat_flush_io() is only called when
> > the startup process exits. Shouldn't it be called during WAL replay as well
> > to report IO statistics by the startup process even in standby mode?
> 
> Yes, we definitely want stats from the startup process on the standby.
> Elsewhere on the internet where you originally raised this, I mentioned
> that I hacked a pgstat_flush_io() into the redo apply loop in
> PerformWalRecovery() but that I wasn't sure that this was affordable.
> Andres Freund replied saying that it would be too expensive and
> suggested that the set up a regular timeout which sets a flag that's
> checked by HandleStartupProcInterrupts().

It's tempting to try to reuse the STARTUP_PROGRESS_TIMEOUT timer. But it's
controlled by a GUC, so it's not really suitable.


> I'm wondering if this is something we consider a bug and thus would be
> under consideration for 16.

I'm mildly inclined to not consider it a bug, given that this looks to have
been true for other stats for quite a while? But it does still seem worth
improving upon - I'd make the consideration when to apply the relevant patches
depend on the complexity. I'm worried we'd need to introduce sufficiently new
infrastructure that 16 doesn't seem like a good idea. Let's come up with a
patch and judge it after?


> > Also, the pg_stat_io view includes a row with backend_type=startup and
> > context=vacuum, but it seems that the startup process doesn't perform
> > any I/O operations with BAS_VACUUM. If this understanding is right,
> > shouldn't we omit this row from the view? Additionally, I noticed that
> > the view also includes a row with backend_type=startup and
> > context=bulkread / bulkwrite. Do these operations actually occur
> > during startup process?
> 
> Hmm. Yes, I remember posing this question on the thread and not getting
> an answer. I read some code and did some testing and can't see a way we
> would end up with the startup process doing IO in a non-normal context.
> 
> Certainly I can't see how startup process would ever use a BAS_VACUUM
> context given that it executes heap_xlog_vacuum().
> 
> I thought at some point I had encountered an assertion failure when I
> banned the startup process from tracking io operations in bulkread and
> bulkwrite contexts. But, I'm not seeing how that could happen.

It's possible that we decided to not apply such restrictions because the
startup process can be made to execute more code via the extensible
rmgrs.

Greetings,

Andres Freund




Re: pg_stat_io for the startup process

2023-04-25 Thread Melanie Plageman
On Tue, Apr 25, 2023 at 10:51:14PM +0900, Fujii Masao wrote:
> Hi,
> 
> Regarding pg_stat_io for the startup process, I noticed that the counters
> are only incremented after the startup process exits, not during WAL replay
> in standby mode. This is because pgstat_flush_io() is only called when
> the startup process exits. Shouldn't it be called during WAL replay as well
> to report IO statistics by the startup process even in standby mode?

Yes, we definitely want stats from the startup process on the standby.
Elsewhere on the internet where you originally raised this, I mentioned
that I hacked a pgstat_flush_io() into the redo apply loop in
PerformWalRecovery() but that I wasn't sure that this was affordable.
Andres Freund replied saying that it would be too expensive and
suggested that the set up a regular timeout which sets a flag that's
checked by HandleStartupProcInterrupts().

I'm wondering if this is something we consider a bug and thus would be
under consideration for 16.
 
> Also, the pg_stat_io view includes a row with backend_type=startup and
> context=vacuum, but it seems that the startup process doesn't perform
> any I/O operations with BAS_VACUUM. If this understanding is right,
> shouldn't we omit this row from the view? Additionally, I noticed that
> the view also includes a row with backend_type=startup and
> context=bulkread / bulkwrite. Do these operations actually occur
> during startup process?

Hmm. Yes, I remember posing this question on the thread and not getting
an answer. I read some code and did some testing and can't see a way we
would end up with the startup process doing IO in a non-normal context.

Certainly I can't see how startup process would ever use a BAS_VACUUM
context given that it executes heap_xlog_vacuum().

I thought at some point I had encountered an assertion failure when I
banned the startup process from tracking io operations in bulkread and
bulkwrite contexts. But, I'm not seeing how that could happen.

- Melanie




pg_stat_io for the startup process

2023-04-25 Thread Fujii Masao

Hi,

Regarding pg_stat_io for the startup process, I noticed that the counters
are only incremented after the startup process exits, not during WAL replay
in standby mode. This is because pgstat_flush_io() is only called when
the startup process exits. Shouldn't it be called during WAL replay as well
to report IO statistics by the startup process even in standby mode?

Also, the pg_stat_io view includes a row with backend_type=startup and
context=vacuum, but it seems that the startup process doesn't perform
any I/O operations with BAS_VACUUM. If this understanding is right,
shouldn't we omit this row from the view? Additionally, I noticed that
the view also includes a row with backend_type=startup and
context=bulkread / bulkwrite. Do these operations actually occur
during startup process?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION