Re: Race between KeepFileRestoredFromArchive() and restartpoint
On Thu, Aug 04, 2022 at 06:32:38AM -0400, David Steele wrote: > On 8/4/22 04:06, Kyotaro Horiguchi wrote: > >At Wed, 3 Aug 2022 23:24:56 -0700, Noah Misch wrote in > I think in this case a HINT might be sufficient to at least keep people > from > wasting time tracking down a problem that has already been fixed. > >> > >>Here's a patch to add that HINT. I figure it's better to do this before > >>next > >>week's minor releases. In the absence of objections, I'll push this around > >>2022-08-05 14:00 UTC. > > > >+1 > > Looks good to me as well. Thanks for reviewing. > However, there is another issue [1] that might argue for a back patch if > this patch (as I believe) would fix the issue. > >>> > [1] > https://www.postgresql.org/message-id/CAHJZqBDxWfcd53jm0bFttuqpK3jV2YKWx%3D4W7KxNB4zzt%2B%2BqFg%40mail.gmail.com > >>> > >>>That makes sense. Each iteration of the restartpoint recycle loop has a > >>>1/N > >>>chance of failing. Recovery adds >N files between restartpoints. Hence, > >>>the > >>>WAL directory grows without bound. Is that roughly the theory in mind? > >> > >>On further reflection, I don't expect it to happen that way. Each failure > >>message is LOG-level, so the remaining recycles still happen. (The race > >>condition can yield an ERROR under PreallocXlogFiles(), but recycling is > >>already done at that point.) > > > >I agree to this. > > Hmmm, OK. We certainly have a fairly serious issue here, i.e. pg_wal growing > without bound. Even if we are not sure what is causing it, how confident are > we that the patches applied to v15 would fix it? I'll say 7%; certainly too low to stop investigating. The next things I'd want to see are: - select name, setting, source from pg_settings where setting <> boot_val; - log_checkpoints log entries, and other log entries having the same PID If the theory about checkpoint skipping holds, there should be a period where the volume of WAL replayed greatly exceeds max_wal_size, yet 0-1 restartpoints begin and 0 restartpoints complete.
Re: Race between KeepFileRestoredFromArchive() and restartpoint
On 8/4/22 04:06, Kyotaro Horiguchi wrote: At Wed, 3 Aug 2022 23:24:56 -0700, Noah Misch wrote in I think in this case a HINT might be sufficient to at least keep people from wasting time tracking down a problem that has already been fixed. Here's a patch to add that HINT. I figure it's better to do this before next week's minor releases. In the absence of objections, I'll push this around 2022-08-05 14:00 UTC. +1 Looks good to me as well. However, there is another issue [1] that might argue for a back patch if this patch (as I believe) would fix the issue. [1] https://www.postgresql.org/message-id/CAHJZqBDxWfcd53jm0bFttuqpK3jV2YKWx%3D4W7KxNB4zzt%2B%2BqFg%40mail.gmail.com That makes sense. Each iteration of the restartpoint recycle loop has a 1/N chance of failing. Recovery adds >N files between restartpoints. Hence, the WAL directory grows without bound. Is that roughly the theory in mind? On further reflection, I don't expect it to happen that way. Each failure message is LOG-level, so the remaining recycles still happen. (The race condition can yield an ERROR under PreallocXlogFiles(), but recycling is already done at that point.) I agree to this. Hmmm, OK. We certainly have a fairly serious issue here, i.e. pg_wal growing without bound. Even if we are not sure what is causing it, how confident are we that the patches applied to v15 would fix it? Regards, -David
Re: Race between KeepFileRestoredFromArchive() and restartpoint
At Wed, 3 Aug 2022 23:24:56 -0700, Noah Misch wrote in > > > I think in this case a HINT might be sufficient to at least keep people > > > from > > > wasting time tracking down a problem that has already been fixed. > > Here's a patch to add that HINT. I figure it's better to do this before next > week's minor releases. In the absence of objections, I'll push this around > 2022-08-05 14:00 UTC. +1 > > > However, there is another issue [1] that might argue for a back patch if > > > this patch (as I believe) would fix the issue. > > > > > [1] > > > https://www.postgresql.org/message-id/CAHJZqBDxWfcd53jm0bFttuqpK3jV2YKWx%3D4W7KxNB4zzt%2B%2BqFg%40mail.gmail.com > > > > That makes sense. Each iteration of the restartpoint recycle loop has a 1/N > > chance of failing. Recovery adds >N files between restartpoints. Hence, > > the > > WAL directory grows without bound. Is that roughly the theory in mind? > > On further reflection, I don't expect it to happen that way. Each failure > message is LOG-level, so the remaining recycles still happen. (The race > condition can yield an ERROR under PreallocXlogFiles(), but recycling is > already done at that point.) I agree to this. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Race between KeepFileRestoredFromArchive() and restartpoint
On Tue, Aug 02, 2022 at 07:37:27AM -0700, Noah Misch wrote: > On Tue, Aug 02, 2022 at 10:14:22AM -0400, David Steele wrote: > > On 7/31/22 02:17, Noah Misch wrote: > > >On Tue, Jul 26, 2022 at 07:21:29AM -0400, David Steele wrote: > > >>On 6/19/21 16:39, Noah Misch wrote: > > >>>On Tue, Feb 02, 2021 at 07:14:16AM -0800, Noah Misch wrote: > > Recycling and preallocation are wasteful during archive recovery, > > because > > KeepFileRestoredFromArchive() unlinks every entry in its path. I > > propose to > > fix the race by adding an XLogCtl flag indicating which regime > > currently owns > > the right to add long-term pg_wal directory entries. In the archive > > recovery > > regime, the checkpointer will not preallocate and will unlink old > > segments > > instead of recycling them (like wal_recycle=off). XLogFileInit() will > > fail. > > >>> > > >>>Here's the implementation. Patches 1-4 suffice to stop the user-visible > > >>>ERROR. Patch 5 avoids a spurious LOG-level message and wasted filesystem > > >>>writes, and it provides some future-proofing. > > >>> > > >>>I was tempted to (but did not) just remove preallocation. Creating one > > >>>file > > >>>per checkpoint seems tiny relative to the max_wal_size=1GB default, so I > > >>>expect it's hard to isolate any benefit. Under the old > > >>>checkpoint_segments=3 > > >>>default, a preallocated segment covered a respectable third of the next > > >>>checkpoint. Before commit 63653f7 (2002), preallocation created more > > >>>files. > > >> > > >>This also seems like it would fix the link issues we are seeing in [1]. > > >> > > >>I wonder if that would make it worth a back patch? > > > > > >Perhaps. It's sad to have multiple people deep-diving into something > > >fixed on > > >HEAD. On the other hand, I'm not eager to spend risk-of-backpatch points > > >on > > >this. One alternative would be adding an errhint like "This is known to > > >happen occasionally during archive recovery, where it is harmless." That > > >has > > >an unpolished look, but it's low-risk and may avoid deep-dive efforts. > > > > I think in this case a HINT might be sufficient to at least keep people from > > wasting time tracking down a problem that has already been fixed. Here's a patch to add that HINT. I figure it's better to do this before next week's minor releases. In the absence of objections, I'll push this around 2022-08-05 14:00 UTC. > > However, there is another issue [1] that might argue for a back patch if > > this patch (as I believe) would fix the issue. > > > [1] > > https://www.postgresql.org/message-id/CAHJZqBDxWfcd53jm0bFttuqpK3jV2YKWx%3D4W7KxNB4zzt%2B%2BqFg%40mail.gmail.com > > That makes sense. Each iteration of the restartpoint recycle loop has a 1/N > chance of failing. Recovery adds >N files between restartpoints. Hence, the > WAL directory grows without bound. Is that roughly the theory in mind? On further reflection, I don't expect it to happen that way. Each failure message is LOG-level, so the remaining recycles still happen. (The race condition can yield an ERROR under PreallocXlogFiles(), but recycling is already done at that point.) Author: Noah Misch Commit: Noah Misch Add HINT for restartpoint race with KeepFileRestoredFromArchive(). The five commits ending at cc2c7d65fc27e877c9f407587b0b92d46cd6dd16 closed this race condition for v15+. For v14 through v10, add a HINT to discourage studying the cosmetic problem. Reviewed by FIXME. Discussion: https://postgr.es/m/20220731061747.ga3692...@rfd.leadboat.com diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 5cdd01f..9c845b7 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -3480,7 +3480,10 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock) if (fd < 0) ereport(ERROR, (errcode_for_file_access(), -errmsg("could not open file \"%s\": %m", path))); +errmsg("could not open file \"%s\": %m", path), +(AmCheckpointerProcess() ? + errhint("This is known to fail occasionally during archive recovery, where it is harmless.") : + 0))); elog(DEBUG2, "done creating and filling new WAL file"); diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c index 6688dee..e76daff 100644 --- a/src/backend/storage/file/fd.c +++ b/src/backend/storage/file/fd.c @@ -834,7 +834,10 @@ durable_rename_excl(const char *oldfile, const char *newfile, int elevel) ereport(elevel, (errcode_for_file_access(), errmsg("could not link file \"%s\" to \"%s\": %m", -
Re: Race between KeepFileRestoredFromArchive() and restartpoint
At Wed, 3 Aug 2022 00:28:47 -0700, Noah Misch wrote in > On Wed, Aug 03, 2022 at 11:24:17AM +0900, Kyotaro Horiguchi wrote: > > At Tue, 2 Aug 2022 16:03:42 -0500, Don Seiler wrote in > > > could not link file “pg_wal/xlogtemp.18799" to > > > > “pg_wal/0001D4530010”: File exists > > > Hmm. It seems like a race condition betwen StartupXLOG() and > > RemoveXlogFIle(). We need wider extent of ContolFileLock. Concretely > > taking ControlFileLock before deciding the target xlog file name in > > RemoveXlogFile() seems to prevent this happening. (If this is correct > > this is a live issue on the master branch.) > > RemoveXlogFile() calls InstallXLogFileSegment() with find_free=true. The > intent of find_free=true is to make it okay to pass a target xlog file that > ceases to be a good target. (InstallXLogFileSegment() searches for a good > target while holding ControlFileLock.) Can you say more about how that proved > to be insufficient? Ug.. No. I can't. I was confused by something. Sorry. PreallocXlogFiles() and checkpointer are mutually excluded by the same lock, too. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Race between KeepFileRestoredFromArchive() and restartpoint
On Wed, Aug 03, 2022 at 11:24:17AM +0900, Kyotaro Horiguchi wrote: > At Tue, 2 Aug 2022 16:03:42 -0500, Don Seiler wrote in > > could not link file “pg_wal/xlogtemp.18799" to > > > “pg_wal/0001D4530010”: File exists > Hmm. It seems like a race condition betwen StartupXLOG() and > RemoveXlogFIle(). We need wider extent of ContolFileLock. Concretely > taking ControlFileLock before deciding the target xlog file name in > RemoveXlogFile() seems to prevent this happening. (If this is correct > this is a live issue on the master branch.) RemoveXlogFile() calls InstallXLogFileSegment() with find_free=true. The intent of find_free=true is to make it okay to pass a target xlog file that ceases to be a good target. (InstallXLogFileSegment() searches for a good target while holding ControlFileLock.) Can you say more about how that proved to be insufficient?
Re: Race between KeepFileRestoredFromArchive() and restartpoint
At Tue, 2 Aug 2022 16:03:42 -0500, Don Seiler wrote in > On Tue, Aug 2, 2022 at 10:01 AM David Steele wrote: > > > > > > That makes sense. Each iteration of the restartpoint recycle loop has a > > 1/N > > > chance of failing. Recovery adds >N files between restartpoints. > > Hence, the > > > WAL directory grows without bound. Is that roughly the theory in mind? > > > > Yes, though you have formulated it better than I had in my mind. I'm not sure I understand it correctly, but isn't the cause of the issue in the other thread due to skipping many checkpoint records within the checkpoint_timeout? I remember that I proposed a GUC variable to disable that checkpoint skipping. As another measure for that issue, we could force replaying checkpoint if max_wal_size is already filled up or known to be filled in the next checkpoint cycle. If this is correct, this patch is irrelevant to the issue. > > Let's see if Don can confirm that he is seeing the "could not link file" > > messages. > > > During my latest incident, there was only one occurrence: > > could not link file “pg_wal/xlogtemp.18799" to > > “pg_wal/0001D4530010”: File exists (I noticed that the patch in the other thread is broken:() Hmm. It seems like a race condition betwen StartupXLOG() and RemoveXlogFIle(). We need wider extent of ContolFileLock. Concretely taking ControlFileLock before deciding the target xlog file name in RemoveXlogFile() seems to prevent this happening. (If this is correct this is a live issue on the master branch.) > WAL restore/recovery seemed to continue on just fine then. And it would > continue on until the pg_wal volume ran out of space unless I was manually > rm'ing already-recovered WAL files from the side. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Race between KeepFileRestoredFromArchive() and restartpoint
On Tue, Aug 2, 2022 at 10:01 AM David Steele wrote: > > > That makes sense. Each iteration of the restartpoint recycle loop has a > 1/N > > chance of failing. Recovery adds >N files between restartpoints. > Hence, the > > WAL directory grows without bound. Is that roughly the theory in mind? > > Yes, though you have formulated it better than I had in my mind. > > Let's see if Don can confirm that he is seeing the "could not link file" > messages. During my latest incident, there was only one occurrence: could not link file “pg_wal/xlogtemp.18799" to > “pg_wal/0001D4530010”: File exists WAL restore/recovery seemed to continue on just fine then. And it would continue on until the pg_wal volume ran out of space unless I was manually rm'ing already-recovered WAL files from the side. -- Don Seiler www.seiler.us
Re: Race between KeepFileRestoredFromArchive() and restartpoint
On 8/2/22 10:37, Noah Misch wrote: On Tue, Aug 02, 2022 at 10:14:22AM -0400, David Steele wrote: On 7/31/22 02:17, Noah Misch wrote: On Tue, Jul 26, 2022 at 07:21:29AM -0400, David Steele wrote: On 6/19/21 16:39, Noah Misch wrote: On Tue, Feb 02, 2021 at 07:14:16AM -0800, Noah Misch wrote: Recycling and preallocation are wasteful during archive recovery, because KeepFileRestoredFromArchive() unlinks every entry in its path. I propose to fix the race by adding an XLogCtl flag indicating which regime currently owns the right to add long-term pg_wal directory entries. In the archive recovery regime, the checkpointer will not preallocate and will unlink old segments instead of recycling them (like wal_recycle=off). XLogFileInit() will fail. Here's the implementation. Patches 1-4 suffice to stop the user-visible ERROR. Patch 5 avoids a spurious LOG-level message and wasted filesystem writes, and it provides some future-proofing. I was tempted to (but did not) just remove preallocation. Creating one file per checkpoint seems tiny relative to the max_wal_size=1GB default, so I expect it's hard to isolate any benefit. Under the old checkpoint_segments=3 default, a preallocated segment covered a respectable third of the next checkpoint. Before commit 63653f7 (2002), preallocation created more files. This also seems like it would fix the link issues we are seeing in [1]. I wonder if that would make it worth a back patch? Perhaps. It's sad to have multiple people deep-diving into something fixed on HEAD. On the other hand, I'm not eager to spend risk-of-backpatch points on this. One alternative would be adding an errhint like "This is known to happen occasionally during archive recovery, where it is harmless." That has an unpolished look, but it's low-risk and may avoid deep-dive efforts. I think in this case a HINT might be sufficient to at least keep people from wasting time tracking down a problem that has already been fixed. However, there is another issue [1] that might argue for a back patch if this patch (as I believe) would fix the issue. [1] https://www.postgresql.org/message-id/CAHJZqBDxWfcd53jm0bFttuqpK3jV2YKWx%3D4W7KxNB4zzt%2B%2BqFg%40mail.gmail.com That makes sense. Each iteration of the restartpoint recycle loop has a 1/N chance of failing. Recovery adds >N files between restartpoints. Hence, the WAL directory grows without bound. Is that roughly the theory in mind? Yes, though you have formulated it better than I had in my mind. Let's see if Don can confirm that he is seeing the "could not link file" messages. Regards, -David
Re: Race between KeepFileRestoredFromArchive() and restartpoint
On Tue, Aug 02, 2022 at 10:14:22AM -0400, David Steele wrote: > On 7/31/22 02:17, Noah Misch wrote: > >On Tue, Jul 26, 2022 at 07:21:29AM -0400, David Steele wrote: > >>On 6/19/21 16:39, Noah Misch wrote: > >>>On Tue, Feb 02, 2021 at 07:14:16AM -0800, Noah Misch wrote: > Recycling and preallocation are wasteful during archive recovery, because > KeepFileRestoredFromArchive() unlinks every entry in its path. I propose > to > fix the race by adding an XLogCtl flag indicating which regime currently > owns > the right to add long-term pg_wal directory entries. In the archive > recovery > regime, the checkpointer will not preallocate and will unlink old segments > instead of recycling them (like wal_recycle=off). XLogFileInit() will > fail. > >>> > >>>Here's the implementation. Patches 1-4 suffice to stop the user-visible > >>>ERROR. Patch 5 avoids a spurious LOG-level message and wasted filesystem > >>>writes, and it provides some future-proofing. > >>> > >>>I was tempted to (but did not) just remove preallocation. Creating one > >>>file > >>>per checkpoint seems tiny relative to the max_wal_size=1GB default, so I > >>>expect it's hard to isolate any benefit. Under the old > >>>checkpoint_segments=3 > >>>default, a preallocated segment covered a respectable third of the next > >>>checkpoint. Before commit 63653f7 (2002), preallocation created more > >>>files. > >> > >>This also seems like it would fix the link issues we are seeing in [1]. > >> > >>I wonder if that would make it worth a back patch? > > > >Perhaps. It's sad to have multiple people deep-diving into something fixed > >on > >HEAD. On the other hand, I'm not eager to spend risk-of-backpatch points on > >this. One alternative would be adding an errhint like "This is known to > >happen occasionally during archive recovery, where it is harmless." That has > >an unpolished look, but it's low-risk and may avoid deep-dive efforts. > > I think in this case a HINT might be sufficient to at least keep people from > wasting time tracking down a problem that has already been fixed. > > However, there is another issue [1] that might argue for a back patch if > this patch (as I believe) would fix the issue. > [1] > https://www.postgresql.org/message-id/CAHJZqBDxWfcd53jm0bFttuqpK3jV2YKWx%3D4W7KxNB4zzt%2B%2BqFg%40mail.gmail.com That makes sense. Each iteration of the restartpoint recycle loop has a 1/N chance of failing. Recovery adds >N files between restartpoints. Hence, the WAL directory grows without bound. Is that roughly the theory in mind?
Re: Race between KeepFileRestoredFromArchive() and restartpoint
On 7/31/22 02:17, Noah Misch wrote: On Tue, Jul 26, 2022 at 07:21:29AM -0400, David Steele wrote: On 6/19/21 16:39, Noah Misch wrote: On Tue, Feb 02, 2021 at 07:14:16AM -0800, Noah Misch wrote: Recycling and preallocation are wasteful during archive recovery, because KeepFileRestoredFromArchive() unlinks every entry in its path. I propose to fix the race by adding an XLogCtl flag indicating which regime currently owns the right to add long-term pg_wal directory entries. In the archive recovery regime, the checkpointer will not preallocate and will unlink old segments instead of recycling them (like wal_recycle=off). XLogFileInit() will fail. Here's the implementation. Patches 1-4 suffice to stop the user-visible ERROR. Patch 5 avoids a spurious LOG-level message and wasted filesystem writes, and it provides some future-proofing. I was tempted to (but did not) just remove preallocation. Creating one file per checkpoint seems tiny relative to the max_wal_size=1GB default, so I expect it's hard to isolate any benefit. Under the old checkpoint_segments=3 default, a preallocated segment covered a respectable third of the next checkpoint. Before commit 63653f7 (2002), preallocation created more files. This also seems like it would fix the link issues we are seeing in [1]. I wonder if that would make it worth a back patch? Perhaps. It's sad to have multiple people deep-diving into something fixed on HEAD. On the other hand, I'm not eager to spend risk-of-backpatch points on this. One alternative would be adding an errhint like "This is known to happen occasionally during archive recovery, where it is harmless." That has an unpolished look, but it's low-risk and may avoid deep-dive efforts. I think in this case a HINT might be sufficient to at least keep people from wasting time tracking down a problem that has already been fixed. However, there is another issue [1] that might argue for a back patch if this patch (as I believe) would fix the issue. Regards, -David [1] https://www.postgresql.org/message-id/CAHJZqBDxWfcd53jm0bFttuqpK3jV2YKWx%3D4W7KxNB4zzt%2B%2BqFg%40mail.gmail.com
Re: Race between KeepFileRestoredFromArchive() and restartpoint
On Tue, Jul 26, 2022 at 07:21:29AM -0400, David Steele wrote: > On 6/19/21 16:39, Noah Misch wrote: > >On Tue, Feb 02, 2021 at 07:14:16AM -0800, Noah Misch wrote: > >>Recycling and preallocation are wasteful during archive recovery, because > >>KeepFileRestoredFromArchive() unlinks every entry in its path. I propose to > >>fix the race by adding an XLogCtl flag indicating which regime currently > >>owns > >>the right to add long-term pg_wal directory entries. In the archive > >>recovery > >>regime, the checkpointer will not preallocate and will unlink old segments > >>instead of recycling them (like wal_recycle=off). XLogFileInit() will fail. > > > >Here's the implementation. Patches 1-4 suffice to stop the user-visible > >ERROR. Patch 5 avoids a spurious LOG-level message and wasted filesystem > >writes, and it provides some future-proofing. > > > >I was tempted to (but did not) just remove preallocation. Creating one file > >per checkpoint seems tiny relative to the max_wal_size=1GB default, so I > >expect it's hard to isolate any benefit. Under the old checkpoint_segments=3 > >default, a preallocated segment covered a respectable third of the next > >checkpoint. Before commit 63653f7 (2002), preallocation created more files. > > This also seems like it would fix the link issues we are seeing in [1]. > > I wonder if that would make it worth a back patch? Perhaps. It's sad to have multiple people deep-diving into something fixed on HEAD. On the other hand, I'm not eager to spend risk-of-backpatch points on this. One alternative would be adding an errhint like "This is known to happen occasionally during archive recovery, where it is harmless." That has an unpolished look, but it's low-risk and may avoid deep-dive efforts. > [1] > https://www.postgresql.org/message-id/flat/CAKw-smBhLOGtRJTC5c%3DqKTPz8gz5%2BWPoVAXrHB6mY-1U4_N7-w%40mail.gmail.com
Re: Race between KeepFileRestoredFromArchive() and restartpoint
Hi Noah, On 6/19/21 16:39, Noah Misch wrote: On Tue, Feb 02, 2021 at 07:14:16AM -0800, Noah Misch wrote: Recycling and preallocation are wasteful during archive recovery, because KeepFileRestoredFromArchive() unlinks every entry in its path. I propose to fix the race by adding an XLogCtl flag indicating which regime currently owns the right to add long-term pg_wal directory entries. In the archive recovery regime, the checkpointer will not preallocate and will unlink old segments instead of recycling them (like wal_recycle=off). XLogFileInit() will fail. Here's the implementation. Patches 1-4 suffice to stop the user-visible ERROR. Patch 5 avoids a spurious LOG-level message and wasted filesystem writes, and it provides some future-proofing. I was tempted to (but did not) just remove preallocation. Creating one file per checkpoint seems tiny relative to the max_wal_size=1GB default, so I expect it's hard to isolate any benefit. Under the old checkpoint_segments=3 default, a preallocated segment covered a respectable third of the next checkpoint. Before commit 63653f7 (2002), preallocation created more files. This also seems like it would fix the link issues we are seeing in [1]. I wonder if that would make it worth a back patch? [1] https://www.postgresql.org/message-id/flat/CAKw-smBhLOGtRJTC5c%3DqKTPz8gz5%2BWPoVAXrHB6mY-1U4_N7-w%40mail.gmail.com
Re: Race between KeepFileRestoredFromArchive() and restartpoint
On Tue, Feb 02, 2021 at 07:14:16AM -0800, Noah Misch wrote: > Recycling and preallocation are wasteful during archive recovery, because > KeepFileRestoredFromArchive() unlinks every entry in its path. I propose to > fix the race by adding an XLogCtl flag indicating which regime currently owns > the right to add long-term pg_wal directory entries. In the archive recovery > regime, the checkpointer will not preallocate and will unlink old segments > instead of recycling them (like wal_recycle=off). XLogFileInit() will fail. Here's the implementation. Patches 1-4 suffice to stop the user-visible ERROR. Patch 5 avoids a spurious LOG-level message and wasted filesystem writes, and it provides some future-proofing. I was tempted to (but did not) just remove preallocation. Creating one file per checkpoint seems tiny relative to the max_wal_size=1GB default, so I expect it's hard to isolate any benefit. Under the old checkpoint_segments=3 default, a preallocated segment covered a respectable third of the next checkpoint. Before commit 63653f7 (2002), preallocation created more files. Author: Noah Misch Commit: Noah Misch Remove XLogFileInit() ability to skip ControlFileLock. Cold paths, initdb and end-of-recovery, used it. Don't optimize them. Discussion: https://postgr.es/m/20210202151416.gb3304...@rfd.leadboat.com diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 1b3a3d9..39a38ba 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -913,8 +913,7 @@ static void AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic); static bool XLogCheckpointNeeded(XLogSegNo new_segno); static void XLogWrite(XLogwrtRqst WriteRqst, bool flexible); static bool InstallXLogFileSegment(XLogSegNo *segno, char *tmppath, - bool find_free, XLogSegNo max_segno, - bool use_lock); + bool find_free, XLogSegNo max_segno); static int XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli, XLogSource source, bool notfoundOk); static int XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source); @@ -2492,7 +2491,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible) /* create/use new log file */ use_existent = true; - openLogFile = XLogFileInit(openLogSegNo, &use_existent, true); + openLogFile = XLogFileInit(openLogSegNo, &use_existent); ReserveExternalFD(); } @@ -3265,10 +3264,6 @@ XLogNeedsFlush(XLogRecPtr record) * pre-existing file will be deleted). On return, true if a pre-existing * file was used. * - * use_lock: if true, acquire ControlFileLock while moving file into - * place. This should be true except during bootstrap log creation. The - * caller must *not* hold the lock at call. - * * Returns FD of opened file. * * Note: errors here are ERROR not PANIC because we might or might not be @@ -3277,7 +3272,7 @@ XLogNeedsFlush(XLogRecPtr record) * in a critical section. */ int -XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock) +XLogFileInit(XLogSegNo logsegno, bool *use_existent) { charpath[MAXPGPATH]; chartmppath[MAXPGPATH]; @@ -3437,8 +3432,7 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock) */ max_segno = logsegno + CheckPointSegments; if (!InstallXLogFileSegment(&installed_segno, tmppath, - *use_existent, max_segno, - use_lock)) + *use_existent, max_segno)) { /* * No need for any more future segments, or InstallXLogFileSegment() @@ -3592,7 +3586,7 @@ XLogFileCopy(XLogSegNo destsegno, TimeLineID srcTLI, XLogSegNo srcsegno, /* * Now move the segment into place with its final name. */ - if (!InstallXLogFileSegment(&destsegno, tmppath, false, 0, false)) + if (!InstallXLogFileSegment(&destsegno, tmppath, false, 0)) elog(ERROR, "InstallXLogFileSegment should not have failed"); } @@ -3616,29 +3610,20 @@ XLogFileCopy(XLogSegNo destsegno, TimeLineID srcTLI, XLogSegNo srcsegno, * free slot is found between *segno and max_segno. (Ignored when find_free * is false.) * - * use_lock: if true, acquire ControlFileLock while moving file into - * place. This should be true except during bootstrap log creation. The - * caller must *not* hold the lock at call. - * * Returns true if the file was installed successfully.
Race between KeepFileRestoredFromArchive() and restartpoint
A race with KeepFileRestoredFromArchive() can cause a restartpoint to fail, as seen once on the buildfarm[1]. The attached patch adds a test case; it applies atop the "stop events" patch[2]. We have two systems for adding long-term pg_wal directory entries. KeepFileRestoredFromArchive() adds them during archive recovery, while InstallXLogFileSegment() does so at all times. Unfortunately, InstallXLogFileSegment() happens throughout archive recovery, via the checkpointer recycling segments and calling PreallocXlogFiles(). Multiple processes can run InstallXLogFileSegment(), which uses ControlFileLock to represent the authority to modify the directory listing of pg_wal. KeepFileRestoredFromArchive() just assumes it controls pg_wal. Recycling and preallocation are wasteful during archive recovery, because KeepFileRestoredFromArchive() unlinks every entry in its path. I propose to fix the race by adding an XLogCtl flag indicating which regime currently owns the right to add long-term pg_wal directory entries. In the archive recovery regime, the checkpointer will not preallocate and will unlink old segments instead of recycling them (like wal_recycle=off). XLogFileInit() will fail. Notable alternatives: - Release ControlFileLock at the end of XLogFileInit(), not at the end of InstallXLogFileSegment(). Add ControlFileLock acquisition to KeepFileRestoredFromArchive(). This provides adequate mutual exclusion, but XLogFileInit() could return a file descriptor for an unlinked file. That's fine for PreallocXlogFiles(), but it feels wrong. - During restartpoints, never preallocate or recycle segments. (Just delete obsolete WAL.) By denying those benefits, this presumably makes streaming recovery less efficient. - Make KeepFileRestoredFromArchive() call XLogFileInit() to open a segment, then copy bytes. This is simple, but it multiplies I/O. That might be negligible on account of caching, or it might not be. A variant, incurring extra fsyncs, would be to use durable_rename() to replace the segment we get from XLogFileInit(). - Make KeepFileRestoredFromArchive() rename without first unlinking. This avoids checkpoint failure, but a race could trigger noise from the LOG message in InstallXLogFileSegment -> durable_rename_excl. Does anyone prefer some alternative? It's probably not worth back-patching anything for a restartpoint failure this rare, because most restartpoint outcomes are not user-visible. Thanks, nm [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mandrill&dt=2020-10-05%2023%3A02%3A17 [2] https://postgr.es/m/CAPpHfdtSEOHX8dSk9Qp%2BZ%2B%2Bi4BGQoffKip6JDWngEA%2Bg7Z-XmQ%40mail.gmail.com Author: Noah Misch Commit: Noah Misch Not for commit; for demonstration only. Before applying this, apply https://postgr.es/m/CAPpHfdtSEOHX8dSk9Qp%2BZ%2B%2Bi4BGQoffKip6JDWngEA%2Bg7Z-XmQ%40mail.gmail.com diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 199d911..1c0a988 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -71,6 +71,7 @@ #include "storage/reinit.h" #include "storage/smgr.h" #include "storage/spin.h" +#include "storage/stopevent.h" #include "storage/sync.h" #include "utils/builtins.h" #include "utils/guc.h" @@ -3583,6 +3584,23 @@ XLogFileCopy(XLogSegNo destsegno, TimeLineID srcTLI, XLogSegNo srcsegno, elog(ERROR, "InstallXLogFileSegment should not have failed"); } +static Jsonb * +InstallXLogFileSegmentEndStopEventParams(XLogSegNo segno) +{ + MemoryContext oldCxt; + JsonbParseState *state = NULL; + Jsonb *res; + + stopevents_make_cxt(); + oldCxt = MemoryContextSwitchTo(stopevents_cxt); + pushJsonbValue(&state, WJB_BEGIN_OBJECT, NULL); + jsonb_push_int8_key(&state, "segno", segno); + res = JsonbValueToJsonb(pushJsonbValue(&state, WJB_END_OBJECT, NULL)); + MemoryContextSwitchTo(oldCxt); + + return res; +} + /* * Install a new XLOG segment file as a current or future log segment. * @@ -3664,6 +3682,9 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath, if (use_lock) LWLockRelease(ControlFileLock); + STOPEVENT(InstallXLogFileSegmentEndStopEvent, + InstallXLogFileSegmentEndStopEventParams(*segno)); + return true; } diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c index 1c5a4f8..d9c5a3a 100644 --- a/src/backend/access/transam/xlogarchive.c +++ b/src/backend/access/transam/xlogarchive.c @@ -30,6 +30,7 @@ #include "storage/ipc.h" #include "storage/lwlock.h" #include "storage/pmsignal.h" +#include "storage/stopevent.h" /* * Attempt to retrieve the specified file from off-line archival storage. @@ -372,6 +373,23 @@ ExecuteRecoveryCommand(const char *command, const char *commandName, bool failOn } +static Jsonb * +KeepFileRestoredFr