Re: Race between KeepFileRestoredFromArchive() and restartpoint

2022-08-04 Thread Noah Misch
On Thu, Aug 04, 2022 at 06:32:38AM -0400, David Steele wrote:
> On 8/4/22 04:06, Kyotaro Horiguchi wrote:
> >At Wed, 3 Aug 2022 23:24:56 -0700, Noah Misch  wrote in
> I think in this case a HINT might be sufficient to at least keep people 
> from
> wasting time tracking down a problem that has already been fixed.
> >>
> >>Here's a patch to add that HINT.  I figure it's better to do this before 
> >>next
> >>week's minor releases.  In the absence of objections, I'll push this around
> >>2022-08-05 14:00 UTC.
> >
> >+1
> 
> Looks good to me as well.

Thanks for reviewing.

> However, there is another issue [1] that might argue for a back patch if
> this patch (as I believe) would fix the issue.
> >>>
> [1] 
> https://www.postgresql.org/message-id/CAHJZqBDxWfcd53jm0bFttuqpK3jV2YKWx%3D4W7KxNB4zzt%2B%2BqFg%40mail.gmail.com
> >>>
> >>>That makes sense.  Each iteration of the restartpoint recycle loop has a 
> >>>1/N
> >>>chance of failing.  Recovery adds >N files between restartpoints.  Hence, 
> >>>the
> >>>WAL directory grows without bound.  Is that roughly the theory in mind?
> >>
> >>On further reflection, I don't expect it to happen that way.  Each failure
> >>message is LOG-level, so the remaining recycles still happen.  (The race
> >>condition can yield an ERROR under PreallocXlogFiles(), but recycling is
> >>already done at that point.)
> >
> >I agree to this.
> 
> Hmmm, OK. We certainly have a fairly serious issue here, i.e. pg_wal growing
> without bound. Even if we are not sure what is causing it, how confident are
> we that the patches applied to v15 would fix it?

I'll say 7%; certainly too low to stop investigating.  The next things I'd want
to see are:

- select name, setting, source from pg_settings where setting <> boot_val;
- log_checkpoints log entries, and other log entries having the same PID

If the theory about checkpoint skipping holds, there should be a period where
the volume of WAL replayed greatly exceeds max_wal_size, yet 0-1 restartpoints
begin and 0 restartpoints complete.




Re: Race between KeepFileRestoredFromArchive() and restartpoint

2022-08-04 Thread David Steele

On 8/4/22 04:06, Kyotaro Horiguchi wrote:

At Wed, 3 Aug 2022 23:24:56 -0700, Noah Misch  wrote in

I think in this case a HINT might be sufficient to at least keep people from
wasting time tracking down a problem that has already been fixed.


Here's a patch to add that HINT.  I figure it's better to do this before next
week's minor releases.  In the absence of objections, I'll push this around
2022-08-05 14:00 UTC.


+1


Looks good to me as well.


However, there is another issue [1] that might argue for a back patch if
this patch (as I believe) would fix the issue.



[1] 
https://www.postgresql.org/message-id/CAHJZqBDxWfcd53jm0bFttuqpK3jV2YKWx%3D4W7KxNB4zzt%2B%2BqFg%40mail.gmail.com


That makes sense.  Each iteration of the restartpoint recycle loop has a 1/N
chance of failing.  Recovery adds >N files between restartpoints.  Hence, the
WAL directory grows without bound.  Is that roughly the theory in mind?


On further reflection, I don't expect it to happen that way.  Each failure
message is LOG-level, so the remaining recycles still happen.  (The race
condition can yield an ERROR under PreallocXlogFiles(), but recycling is
already done at that point.)


I agree to this.


Hmmm, OK. We certainly have a fairly serious issue here, i.e. pg_wal 
growing without bound. Even if we are not sure what is causing it, how 
confident are we that the patches applied to v15 would fix it?


Regards,
-David




Re: Race between KeepFileRestoredFromArchive() and restartpoint

2022-08-04 Thread Kyotaro Horiguchi
At Wed, 3 Aug 2022 23:24:56 -0700, Noah Misch  wrote in 
> > > I think in this case a HINT might be sufficient to at least keep people 
> > > from
> > > wasting time tracking down a problem that has already been fixed.
> 
> Here's a patch to add that HINT.  I figure it's better to do this before next
> week's minor releases.  In the absence of objections, I'll push this around
> 2022-08-05 14:00 UTC.

+1

> > > However, there is another issue [1] that might argue for a back patch if
> > > this patch (as I believe) would fix the issue.
> > 
> > > [1] 
> > > https://www.postgresql.org/message-id/CAHJZqBDxWfcd53jm0bFttuqpK3jV2YKWx%3D4W7KxNB4zzt%2B%2BqFg%40mail.gmail.com
> > 
> > That makes sense.  Each iteration of the restartpoint recycle loop has a 1/N
> > chance of failing.  Recovery adds >N files between restartpoints.  Hence, 
> > the
> > WAL directory grows without bound.  Is that roughly the theory in mind?
> 
> On further reflection, I don't expect it to happen that way.  Each failure
> message is LOG-level, so the remaining recycles still happen.  (The race
> condition can yield an ERROR under PreallocXlogFiles(), but recycling is
> already done at that point.)

I agree to this.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Race between KeepFileRestoredFromArchive() and restartpoint

2022-08-03 Thread Noah Misch
On Tue, Aug 02, 2022 at 07:37:27AM -0700, Noah Misch wrote:
> On Tue, Aug 02, 2022 at 10:14:22AM -0400, David Steele wrote:
> > On 7/31/22 02:17, Noah Misch wrote:
> > >On Tue, Jul 26, 2022 at 07:21:29AM -0400, David Steele wrote:
> > >>On 6/19/21 16:39, Noah Misch wrote:
> > >>>On Tue, Feb 02, 2021 at 07:14:16AM -0800, Noah Misch wrote:
> > Recycling and preallocation are wasteful during archive recovery, 
> > because
> > KeepFileRestoredFromArchive() unlinks every entry in its path.  I 
> > propose to
> > fix the race by adding an XLogCtl flag indicating which regime 
> > currently owns
> > the right to add long-term pg_wal directory entries.  In the archive 
> > recovery
> > regime, the checkpointer will not preallocate and will unlink old 
> > segments
> > instead of recycling them (like wal_recycle=off).  XLogFileInit() will 
> > fail.
> > >>>
> > >>>Here's the implementation.  Patches 1-4 suffice to stop the user-visible
> > >>>ERROR.  Patch 5 avoids a spurious LOG-level message and wasted filesystem
> > >>>writes, and it provides some future-proofing.
> > >>>
> > >>>I was tempted to (but did not) just remove preallocation.  Creating one 
> > >>>file
> > >>>per checkpoint seems tiny relative to the max_wal_size=1GB default, so I
> > >>>expect it's hard to isolate any benefit.  Under the old 
> > >>>checkpoint_segments=3
> > >>>default, a preallocated segment covered a respectable third of the next
> > >>>checkpoint.  Before commit 63653f7 (2002), preallocation created more 
> > >>>files.
> > >>
> > >>This also seems like it would fix the link issues we are seeing in [1].
> > >>
> > >>I wonder if that would make it worth a back patch?
> > >
> > >Perhaps.  It's sad to have multiple people deep-diving into something 
> > >fixed on
> > >HEAD.  On the other hand, I'm not eager to spend risk-of-backpatch points 
> > >on
> > >this.  One alternative would be adding an errhint like "This is known to
> > >happen occasionally during archive recovery, where it is harmless."  That 
> > >has
> > >an unpolished look, but it's low-risk and may avoid deep-dive efforts.
> > 
> > I think in this case a HINT might be sufficient to at least keep people from
> > wasting time tracking down a problem that has already been fixed.

Here's a patch to add that HINT.  I figure it's better to do this before next
week's minor releases.  In the absence of objections, I'll push this around
2022-08-05 14:00 UTC.

> > However, there is another issue [1] that might argue for a back patch if
> > this patch (as I believe) would fix the issue.
> 
> > [1] 
> > https://www.postgresql.org/message-id/CAHJZqBDxWfcd53jm0bFttuqpK3jV2YKWx%3D4W7KxNB4zzt%2B%2BqFg%40mail.gmail.com
> 
> That makes sense.  Each iteration of the restartpoint recycle loop has a 1/N
> chance of failing.  Recovery adds >N files between restartpoints.  Hence, the
> WAL directory grows without bound.  Is that roughly the theory in mind?

On further reflection, I don't expect it to happen that way.  Each failure
message is LOG-level, so the remaining recycles still happen.  (The race
condition can yield an ERROR under PreallocXlogFiles(), but recycling is
already done at that point.)
Author: Noah Misch 
Commit: Noah Misch 

Add HINT for restartpoint race with KeepFileRestoredFromArchive().

The five commits ending at cc2c7d65fc27e877c9f407587b0b92d46cd6dd16
closed this race condition for v15+.  For v14 through v10, add a HINT to
discourage studying the cosmetic problem.

Reviewed by FIXME.

Discussion: https://postgr.es/m/20220731061747.ga3692...@rfd.leadboat.com

diff --git a/src/backend/access/transam/xlog.c 
b/src/backend/access/transam/xlog.c
index 5cdd01f..9c845b7 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3480,7 +3480,10 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, 
bool use_lock)
if (fd < 0)
ereport(ERROR,
(errcode_for_file_access(),
-errmsg("could not open file \"%s\": %m", 
path)));
+errmsg("could not open file \"%s\": %m", path),
+(AmCheckpointerProcess() ?
+ errhint("This is known to fail occasionally 
during archive recovery, where it is harmless.") :
+ 0)));
 
elog(DEBUG2, "done creating and filling new WAL file");
 
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 6688dee..e76daff 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -834,7 +834,10 @@ durable_rename_excl(const char *oldfile, const char 
*newfile, int elevel)
ereport(elevel,
(errcode_for_file_access(),
 errmsg("could not link file \"%s\" to \"%s\": 
%m",
-   

Re: Race between KeepFileRestoredFromArchive() and restartpoint

2022-08-03 Thread Kyotaro Horiguchi
At Wed, 3 Aug 2022 00:28:47 -0700, Noah Misch  wrote in 
> On Wed, Aug 03, 2022 at 11:24:17AM +0900, Kyotaro Horiguchi wrote:
> > At Tue, 2 Aug 2022 16:03:42 -0500, Don Seiler  wrote in 
> > > could not link file “pg_wal/xlogtemp.18799" to
> > > > “pg_wal/0001D4530010”: File exists
> 
> > Hmm.  It seems like a race condition betwen StartupXLOG() and
> > RemoveXlogFIle(). We need wider extent of ContolFileLock. Concretely
> > taking ControlFileLock before deciding the target xlog file name in
> > RemoveXlogFile() seems to prevent this happening. (If this is correct
> > this is a live issue on the master branch.)
> 
> RemoveXlogFile() calls InstallXLogFileSegment() with find_free=true.  The
> intent of find_free=true is to make it okay to pass a target xlog file that
> ceases to be a good target.  (InstallXLogFileSegment() searches for a good
> target while holding ControlFileLock.)  Can you say more about how that proved
> to be insufficient?

Ug.. No. I can't. I was confused by something. Sorry.

PreallocXlogFiles() and checkpointer are mutually excluded by the same
lock, too.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Race between KeepFileRestoredFromArchive() and restartpoint

2022-08-03 Thread Noah Misch
On Wed, Aug 03, 2022 at 11:24:17AM +0900, Kyotaro Horiguchi wrote:
> At Tue, 2 Aug 2022 16:03:42 -0500, Don Seiler  wrote in 
> > could not link file “pg_wal/xlogtemp.18799" to
> > > “pg_wal/0001D4530010”: File exists

> Hmm.  It seems like a race condition betwen StartupXLOG() and
> RemoveXlogFIle(). We need wider extent of ContolFileLock. Concretely
> taking ControlFileLock before deciding the target xlog file name in
> RemoveXlogFile() seems to prevent this happening. (If this is correct
> this is a live issue on the master branch.)

RemoveXlogFile() calls InstallXLogFileSegment() with find_free=true.  The
intent of find_free=true is to make it okay to pass a target xlog file that
ceases to be a good target.  (InstallXLogFileSegment() searches for a good
target while holding ControlFileLock.)  Can you say more about how that proved
to be insufficient?




Re: Race between KeepFileRestoredFromArchive() and restartpoint

2022-08-02 Thread Kyotaro Horiguchi
At Tue, 2 Aug 2022 16:03:42 -0500, Don Seiler  wrote in 
> On Tue, Aug 2, 2022 at 10:01 AM David Steele  wrote:
> 
> >
> > > That makes sense.  Each iteration of the restartpoint recycle loop has a
> > 1/N
> > > chance of failing.  Recovery adds >N files between restartpoints.
> > Hence, the
> > > WAL directory grows without bound.  Is that roughly the theory in mind?
> >
> > Yes, though you have formulated it better than I had in my mind.

I'm not sure I understand it correctly, but isn't the cause of the
issue in the other thread due to skipping many checkpoint records
within the checkpoint_timeout?  I remember that I proposed a GUC
variable to disable that checkpoint skipping.  As another measure for
that issue, we could force replaying checkpoint if max_wal_size is
already filled up or known to be filled in the next checkpoint cycle.
If this is correct, this patch is irrelevant to the issue.

> > Let's see if Don can confirm that he is seeing the "could not link file"
> > messages.
> 
> 
> During my latest incident, there was only one occurrence:
> 
> could not link file “pg_wal/xlogtemp.18799" to
> > “pg_wal/0001D4530010”: File exists

(I noticed that the patch in the other thread is broken:()

Hmm.  It seems like a race condition betwen StartupXLOG() and
RemoveXlogFIle(). We need wider extent of ContolFileLock. Concretely
taking ControlFileLock before deciding the target xlog file name in
RemoveXlogFile() seems to prevent this happening. (If this is correct
this is a live issue on the master branch.)

> WAL restore/recovery seemed to continue on just fine then. And it would
> continue on until the pg_wal volume ran out of space unless I was manually
> rm'ing already-recovered WAL files from the side.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Race between KeepFileRestoredFromArchive() and restartpoint

2022-08-02 Thread Don Seiler
On Tue, Aug 2, 2022 at 10:01 AM David Steele  wrote:

>
> > That makes sense.  Each iteration of the restartpoint recycle loop has a
> 1/N
> > chance of failing.  Recovery adds >N files between restartpoints.
> Hence, the
> > WAL directory grows without bound.  Is that roughly the theory in mind?
>
> Yes, though you have formulated it better than I had in my mind.
>
> Let's see if Don can confirm that he is seeing the "could not link file"
> messages.


During my latest incident, there was only one occurrence:

could not link file “pg_wal/xlogtemp.18799" to
> “pg_wal/0001D4530010”: File exists


WAL restore/recovery seemed to continue on just fine then. And it would
continue on until the pg_wal volume ran out of space unless I was manually
rm'ing already-recovered WAL files from the side.

-- 
Don Seiler
www.seiler.us


Re: Race between KeepFileRestoredFromArchive() and restartpoint

2022-08-02 Thread David Steele




On 8/2/22 10:37, Noah Misch wrote:

On Tue, Aug 02, 2022 at 10:14:22AM -0400, David Steele wrote:

On 7/31/22 02:17, Noah Misch wrote:

On Tue, Jul 26, 2022 at 07:21:29AM -0400, David Steele wrote:

On 6/19/21 16:39, Noah Misch wrote:

On Tue, Feb 02, 2021 at 07:14:16AM -0800, Noah Misch wrote:

Recycling and preallocation are wasteful during archive recovery, because
KeepFileRestoredFromArchive() unlinks every entry in its path.  I propose to
fix the race by adding an XLogCtl flag indicating which regime currently owns
the right to add long-term pg_wal directory entries.  In the archive recovery
regime, the checkpointer will not preallocate and will unlink old segments
instead of recycling them (like wal_recycle=off).  XLogFileInit() will fail.


Here's the implementation.  Patches 1-4 suffice to stop the user-visible
ERROR.  Patch 5 avoids a spurious LOG-level message and wasted filesystem
writes, and it provides some future-proofing.

I was tempted to (but did not) just remove preallocation.  Creating one file
per checkpoint seems tiny relative to the max_wal_size=1GB default, so I
expect it's hard to isolate any benefit.  Under the old checkpoint_segments=3
default, a preallocated segment covered a respectable third of the next
checkpoint.  Before commit 63653f7 (2002), preallocation created more files.


This also seems like it would fix the link issues we are seeing in [1].

I wonder if that would make it worth a back patch?


Perhaps.  It's sad to have multiple people deep-diving into something fixed on
HEAD.  On the other hand, I'm not eager to spend risk-of-backpatch points on
this.  One alternative would be adding an errhint like "This is known to
happen occasionally during archive recovery, where it is harmless."  That has
an unpolished look, but it's low-risk and may avoid deep-dive efforts.


I think in this case a HINT might be sufficient to at least keep people from
wasting time tracking down a problem that has already been fixed.

However, there is another issue [1] that might argue for a back patch if
this patch (as I believe) would fix the issue.



[1] 
https://www.postgresql.org/message-id/CAHJZqBDxWfcd53jm0bFttuqpK3jV2YKWx%3D4W7KxNB4zzt%2B%2BqFg%40mail.gmail.com


That makes sense.  Each iteration of the restartpoint recycle loop has a 1/N
chance of failing.  Recovery adds >N files between restartpoints.  Hence, the
WAL directory grows without bound.  Is that roughly the theory in mind?


Yes, though you have formulated it better than I had in my mind.

Let's see if Don can confirm that he is seeing the "could not link file" 
messages.


Regards,
-David




Re: Race between KeepFileRestoredFromArchive() and restartpoint

2022-08-02 Thread Noah Misch
On Tue, Aug 02, 2022 at 10:14:22AM -0400, David Steele wrote:
> On 7/31/22 02:17, Noah Misch wrote:
> >On Tue, Jul 26, 2022 at 07:21:29AM -0400, David Steele wrote:
> >>On 6/19/21 16:39, Noah Misch wrote:
> >>>On Tue, Feb 02, 2021 at 07:14:16AM -0800, Noah Misch wrote:
> Recycling and preallocation are wasteful during archive recovery, because
> KeepFileRestoredFromArchive() unlinks every entry in its path.  I propose 
> to
> fix the race by adding an XLogCtl flag indicating which regime currently 
> owns
> the right to add long-term pg_wal directory entries.  In the archive 
> recovery
> regime, the checkpointer will not preallocate and will unlink old segments
> instead of recycling them (like wal_recycle=off).  XLogFileInit() will 
> fail.
> >>>
> >>>Here's the implementation.  Patches 1-4 suffice to stop the user-visible
> >>>ERROR.  Patch 5 avoids a spurious LOG-level message and wasted filesystem
> >>>writes, and it provides some future-proofing.
> >>>
> >>>I was tempted to (but did not) just remove preallocation.  Creating one 
> >>>file
> >>>per checkpoint seems tiny relative to the max_wal_size=1GB default, so I
> >>>expect it's hard to isolate any benefit.  Under the old 
> >>>checkpoint_segments=3
> >>>default, a preallocated segment covered a respectable third of the next
> >>>checkpoint.  Before commit 63653f7 (2002), preallocation created more 
> >>>files.
> >>
> >>This also seems like it would fix the link issues we are seeing in [1].
> >>
> >>I wonder if that would make it worth a back patch?
> >
> >Perhaps.  It's sad to have multiple people deep-diving into something fixed 
> >on
> >HEAD.  On the other hand, I'm not eager to spend risk-of-backpatch points on
> >this.  One alternative would be adding an errhint like "This is known to
> >happen occasionally during archive recovery, where it is harmless."  That has
> >an unpolished look, but it's low-risk and may avoid deep-dive efforts.
> 
> I think in this case a HINT might be sufficient to at least keep people from
> wasting time tracking down a problem that has already been fixed.
> 
> However, there is another issue [1] that might argue for a back patch if
> this patch (as I believe) would fix the issue.

> [1] 
> https://www.postgresql.org/message-id/CAHJZqBDxWfcd53jm0bFttuqpK3jV2YKWx%3D4W7KxNB4zzt%2B%2BqFg%40mail.gmail.com

That makes sense.  Each iteration of the restartpoint recycle loop has a 1/N
chance of failing.  Recovery adds >N files between restartpoints.  Hence, the
WAL directory grows without bound.  Is that roughly the theory in mind?




Re: Race between KeepFileRestoredFromArchive() and restartpoint

2022-08-02 Thread David Steele

On 7/31/22 02:17, Noah Misch wrote:

On Tue, Jul 26, 2022 at 07:21:29AM -0400, David Steele wrote:

On 6/19/21 16:39, Noah Misch wrote:

On Tue, Feb 02, 2021 at 07:14:16AM -0800, Noah Misch wrote:

Recycling and preallocation are wasteful during archive recovery, because
KeepFileRestoredFromArchive() unlinks every entry in its path.  I propose to
fix the race by adding an XLogCtl flag indicating which regime currently owns
the right to add long-term pg_wal directory entries.  In the archive recovery
regime, the checkpointer will not preallocate and will unlink old segments
instead of recycling them (like wal_recycle=off).  XLogFileInit() will fail.


Here's the implementation.  Patches 1-4 suffice to stop the user-visible
ERROR.  Patch 5 avoids a spurious LOG-level message and wasted filesystem
writes, and it provides some future-proofing.

I was tempted to (but did not) just remove preallocation.  Creating one file
per checkpoint seems tiny relative to the max_wal_size=1GB default, so I
expect it's hard to isolate any benefit.  Under the old checkpoint_segments=3
default, a preallocated segment covered a respectable third of the next
checkpoint.  Before commit 63653f7 (2002), preallocation created more files.


This also seems like it would fix the link issues we are seeing in [1].

I wonder if that would make it worth a back patch?


Perhaps.  It's sad to have multiple people deep-diving into something fixed on
HEAD.  On the other hand, I'm not eager to spend risk-of-backpatch points on
this.  One alternative would be adding an errhint like "This is known to
happen occasionally during archive recovery, where it is harmless."  That has
an unpolished look, but it's low-risk and may avoid deep-dive efforts.


I think in this case a HINT might be sufficient to at least keep people 
from wasting time tracking down a problem that has already been fixed.


However, there is another issue [1] that might argue for a back patch if 
this patch (as I believe) would fix the issue.


Regards,
-David

[1] 
https://www.postgresql.org/message-id/CAHJZqBDxWfcd53jm0bFttuqpK3jV2YKWx%3D4W7KxNB4zzt%2B%2BqFg%40mail.gmail.com





Re: Race between KeepFileRestoredFromArchive() and restartpoint

2022-07-30 Thread Noah Misch
On Tue, Jul 26, 2022 at 07:21:29AM -0400, David Steele wrote:
> On 6/19/21 16:39, Noah Misch wrote:
> >On Tue, Feb 02, 2021 at 07:14:16AM -0800, Noah Misch wrote:
> >>Recycling and preallocation are wasteful during archive recovery, because
> >>KeepFileRestoredFromArchive() unlinks every entry in its path.  I propose to
> >>fix the race by adding an XLogCtl flag indicating which regime currently 
> >>owns
> >>the right to add long-term pg_wal directory entries.  In the archive 
> >>recovery
> >>regime, the checkpointer will not preallocate and will unlink old segments
> >>instead of recycling them (like wal_recycle=off).  XLogFileInit() will fail.
> >
> >Here's the implementation.  Patches 1-4 suffice to stop the user-visible
> >ERROR.  Patch 5 avoids a spurious LOG-level message and wasted filesystem
> >writes, and it provides some future-proofing.
> >
> >I was tempted to (but did not) just remove preallocation.  Creating one file
> >per checkpoint seems tiny relative to the max_wal_size=1GB default, so I
> >expect it's hard to isolate any benefit.  Under the old checkpoint_segments=3
> >default, a preallocated segment covered a respectable third of the next
> >checkpoint.  Before commit 63653f7 (2002), preallocation created more files.
> 
> This also seems like it would fix the link issues we are seeing in [1].
> 
> I wonder if that would make it worth a back patch?

Perhaps.  It's sad to have multiple people deep-diving into something fixed on
HEAD.  On the other hand, I'm not eager to spend risk-of-backpatch points on
this.  One alternative would be adding an errhint like "This is known to
happen occasionally during archive recovery, where it is harmless."  That has
an unpolished look, but it's low-risk and may avoid deep-dive efforts.

> [1] 
> https://www.postgresql.org/message-id/flat/CAKw-smBhLOGtRJTC5c%3DqKTPz8gz5%2BWPoVAXrHB6mY-1U4_N7-w%40mail.gmail.com




Re: Race between KeepFileRestoredFromArchive() and restartpoint

2022-07-26 Thread David Steele

Hi Noah,

On 6/19/21 16:39, Noah Misch wrote:

On Tue, Feb 02, 2021 at 07:14:16AM -0800, Noah Misch wrote:

Recycling and preallocation are wasteful during archive recovery, because
KeepFileRestoredFromArchive() unlinks every entry in its path.  I propose to
fix the race by adding an XLogCtl flag indicating which regime currently owns
the right to add long-term pg_wal directory entries.  In the archive recovery
regime, the checkpointer will not preallocate and will unlink old segments
instead of recycling them (like wal_recycle=off).  XLogFileInit() will fail.


Here's the implementation.  Patches 1-4 suffice to stop the user-visible
ERROR.  Patch 5 avoids a spurious LOG-level message and wasted filesystem
writes, and it provides some future-proofing.

I was tempted to (but did not) just remove preallocation.  Creating one file
per checkpoint seems tiny relative to the max_wal_size=1GB default, so I
expect it's hard to isolate any benefit.  Under the old checkpoint_segments=3
default, a preallocated segment covered a respectable third of the next
checkpoint.  Before commit 63653f7 (2002), preallocation created more files.


This also seems like it would fix the link issues we are seeing in [1].

I wonder if that would make it worth a back patch?

[1] 
https://www.postgresql.org/message-id/flat/CAKw-smBhLOGtRJTC5c%3DqKTPz8gz5%2BWPoVAXrHB6mY-1U4_N7-w%40mail.gmail.com





Re: Race between KeepFileRestoredFromArchive() and restartpoint

2021-06-19 Thread Noah Misch
On Tue, Feb 02, 2021 at 07:14:16AM -0800, Noah Misch wrote:
> Recycling and preallocation are wasteful during archive recovery, because
> KeepFileRestoredFromArchive() unlinks every entry in its path.  I propose to
> fix the race by adding an XLogCtl flag indicating which regime currently owns
> the right to add long-term pg_wal directory entries.  In the archive recovery
> regime, the checkpointer will not preallocate and will unlink old segments
> instead of recycling them (like wal_recycle=off).  XLogFileInit() will fail.

Here's the implementation.  Patches 1-4 suffice to stop the user-visible
ERROR.  Patch 5 avoids a spurious LOG-level message and wasted filesystem
writes, and it provides some future-proofing.

I was tempted to (but did not) just remove preallocation.  Creating one file
per checkpoint seems tiny relative to the max_wal_size=1GB default, so I
expect it's hard to isolate any benefit.  Under the old checkpoint_segments=3
default, a preallocated segment covered a respectable third of the next
checkpoint.  Before commit 63653f7 (2002), preallocation created more files.
Author: Noah Misch 
Commit: Noah Misch 

Remove XLogFileInit() ability to skip ControlFileLock.

Cold paths, initdb and end-of-recovery, used it.  Don't optimize them.

Discussion: https://postgr.es/m/20210202151416.gb3304...@rfd.leadboat.com

diff --git a/src/backend/access/transam/xlog.c 
b/src/backend/access/transam/xlog.c
index 1b3a3d9..39a38ba 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -913,8 +913,7 @@ static void AdvanceXLInsertBuffer(XLogRecPtr upto, bool 
opportunistic);
 static bool XLogCheckpointNeeded(XLogSegNo new_segno);
 static void XLogWrite(XLogwrtRqst WriteRqst, bool flexible);
 static bool InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
-  bool 
find_free, XLogSegNo max_segno,
-  bool 
use_lock);
+  bool 
find_free, XLogSegNo max_segno);
 static int XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
 XLogSource source, bool 
notfoundOk);
 static int XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource 
source);
@@ -2492,7 +2491,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 
/* create/use new log file */
use_existent = true;
-   openLogFile = XLogFileInit(openLogSegNo, &use_existent, 
true);
+   openLogFile = XLogFileInit(openLogSegNo, &use_existent);
ReserveExternalFD();
}
 
@@ -3265,10 +3264,6 @@ XLogNeedsFlush(XLogRecPtr record)
  * pre-existing file will be deleted).  On return, true if a pre-existing
  * file was used.
  *
- * use_lock: if true, acquire ControlFileLock while moving file into
- * place.  This should be true except during bootstrap log creation.  The
- * caller must *not* hold the lock at call.
- *
  * Returns FD of opened file.
  *
  * Note: errors here are ERROR not PANIC because we might or might not be
@@ -3277,7 +3272,7 @@ XLogNeedsFlush(XLogRecPtr record)
  * in a critical section.
  */
 int
-XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
+XLogFileInit(XLogSegNo logsegno, bool *use_existent)
 {
charpath[MAXPGPATH];
chartmppath[MAXPGPATH];
@@ -3437,8 +3432,7 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool 
use_lock)
 */
max_segno = logsegno + CheckPointSegments;
if (!InstallXLogFileSegment(&installed_segno, tmppath,
-   *use_existent, 
max_segno,
-   use_lock))
+   *use_existent, 
max_segno))
{
/*
 * No need for any more future segments, or 
InstallXLogFileSegment()
@@ -3592,7 +3586,7 @@ XLogFileCopy(XLogSegNo destsegno, TimeLineID srcTLI, 
XLogSegNo srcsegno,
/*
 * Now move the segment into place with its final name.
 */
-   if (!InstallXLogFileSegment(&destsegno, tmppath, false, 0, false))
+   if (!InstallXLogFileSegment(&destsegno, tmppath, false, 0))
elog(ERROR, "InstallXLogFileSegment should not have failed");
 }
 
@@ -3616,29 +3610,20 @@ XLogFileCopy(XLogSegNo destsegno, TimeLineID srcTLI, 
XLogSegNo srcsegno,
  * free slot is found between *segno and max_segno. (Ignored when find_free
  * is false.)
  *
- * use_lock: if true, acquire ControlFileLock while moving file into
- * place.  This should be true except during bootstrap log creation.  The
- * caller must *not* hold the lock at call.
- *
  * Returns true if the file was installed successfully.

Race between KeepFileRestoredFromArchive() and restartpoint

2021-02-02 Thread Noah Misch
A race with KeepFileRestoredFromArchive() can cause a restartpoint to fail, as
seen once on the buildfarm[1].  The attached patch adds a test case; it
applies atop the "stop events" patch[2].  We have two systems for adding
long-term pg_wal directory entries.  KeepFileRestoredFromArchive() adds them
during archive recovery, while InstallXLogFileSegment() does so at all times.
Unfortunately, InstallXLogFileSegment() happens throughout archive recovery,
via the checkpointer recycling segments and calling PreallocXlogFiles().
Multiple processes can run InstallXLogFileSegment(), which uses
ControlFileLock to represent the authority to modify the directory listing of
pg_wal.  KeepFileRestoredFromArchive() just assumes it controls pg_wal.

Recycling and preallocation are wasteful during archive recovery, because
KeepFileRestoredFromArchive() unlinks every entry in its path.  I propose to
fix the race by adding an XLogCtl flag indicating which regime currently owns
the right to add long-term pg_wal directory entries.  In the archive recovery
regime, the checkpointer will not preallocate and will unlink old segments
instead of recycling them (like wal_recycle=off).  XLogFileInit() will fail.

Notable alternatives:

- Release ControlFileLock at the end of XLogFileInit(), not at the end of
  InstallXLogFileSegment().  Add ControlFileLock acquisition to
  KeepFileRestoredFromArchive().  This provides adequate mutual exclusion, but
  XLogFileInit() could return a file descriptor for an unlinked file.  That's
  fine for PreallocXlogFiles(), but it feels wrong.

- During restartpoints, never preallocate or recycle segments.  (Just delete
  obsolete WAL.)  By denying those benefits, this presumably makes streaming
  recovery less efficient.

- Make KeepFileRestoredFromArchive() call XLogFileInit() to open a segment,
  then copy bytes.  This is simple, but it multiplies I/O.  That might be
  negligible on account of caching, or it might not be.  A variant, incurring
  extra fsyncs, would be to use durable_rename() to replace the segment we get
  from XLogFileInit().

- Make KeepFileRestoredFromArchive() rename without first unlinking.  This
  avoids checkpoint failure, but a race could trigger noise from the LOG
  message in InstallXLogFileSegment -> durable_rename_excl.

Does anyone prefer some alternative?  It's probably not worth back-patching
anything for a restartpoint failure this rare, because most restartpoint
outcomes are not user-visible.

Thanks,
nm

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mandrill&dt=2020-10-05%2023%3A02%3A17
[2] 
https://postgr.es/m/CAPpHfdtSEOHX8dSk9Qp%2BZ%2B%2Bi4BGQoffKip6JDWngEA%2Bg7Z-XmQ%40mail.gmail.com
Author: Noah Misch 
Commit: Noah Misch 

Not for commit; for demonstration only.

Before applying this, apply

https://postgr.es/m/CAPpHfdtSEOHX8dSk9Qp%2BZ%2B%2Bi4BGQoffKip6JDWngEA%2Bg7Z-XmQ%40mail.gmail.com

diff --git a/src/backend/access/transam/xlog.c 
b/src/backend/access/transam/xlog.c
index 199d911..1c0a988 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -71,6 +71,7 @@
 #include "storage/reinit.h"
 #include "storage/smgr.h"
 #include "storage/spin.h"
+#include "storage/stopevent.h"
 #include "storage/sync.h"
 #include "utils/builtins.h"
 #include "utils/guc.h"
@@ -3583,6 +3584,23 @@ XLogFileCopy(XLogSegNo destsegno, TimeLineID srcTLI, 
XLogSegNo srcsegno,
elog(ERROR, "InstallXLogFileSegment should not have failed");
 }
 
+static Jsonb *
+InstallXLogFileSegmentEndStopEventParams(XLogSegNo segno)
+{
+   MemoryContext oldCxt;
+   JsonbParseState *state = NULL;
+   Jsonb  *res;
+
+   stopevents_make_cxt();
+   oldCxt = MemoryContextSwitchTo(stopevents_cxt);
+   pushJsonbValue(&state, WJB_BEGIN_OBJECT, NULL);
+   jsonb_push_int8_key(&state, "segno", segno);
+   res = JsonbValueToJsonb(pushJsonbValue(&state, WJB_END_OBJECT, NULL));
+   MemoryContextSwitchTo(oldCxt);
+
+   return res;
+}
+
 /*
  * Install a new XLOG segment file as a current or future log segment.
  *
@@ -3664,6 +3682,9 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
if (use_lock)
LWLockRelease(ControlFileLock);
 
+   STOPEVENT(InstallXLogFileSegmentEndStopEvent,
+ InstallXLogFileSegmentEndStopEventParams(*segno));
+
return true;
 }
 
diff --git a/src/backend/access/transam/xlogarchive.c 
b/src/backend/access/transam/xlogarchive.c
index 1c5a4f8..d9c5a3a 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -30,6 +30,7 @@
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
 #include "storage/pmsignal.h"
+#include "storage/stopevent.h"
 
 /*
  * Attempt to retrieve the specified file from off-line archival storage.
@@ -372,6 +373,23 @@ ExecuteRecoveryCommand(const char *command, const char 
*commandName, bool failOn
 }
 
 
+static Jsonb *
+KeepFileRestoredFr