At Thu, 6 Jun 2024 12:49:45 +1000, Peter Smith <smithpb2...@gmail.com> wrote in > Hi, I have reproduced this multiple times now. > > I confirmed the initial post/steps from Alexander. i.e. The test > script provided [1] gets itself into a state where function > ReadPageInternal (called by XLogDecodeNextRecord and commented "Wait > for the next page to become available") constantly returns > XLREAD_FAIL. Ultimately the test times out because WalSndLoop() loops > forever, since it never calls WalSndDone() to exit the walsender > process.
Thanks for the repro; I believe I understand what's happening here. During server shutdown, the latter half of the last continuation record may fail to be flushed. This is similar to what is described in the commit message of commit ff9f111bce. While shutting down, WalSndLoop() waits for XLogSendLogical() to consume WAL up to flushPtr, but in this case, the last record cannot complete without the continuation part starting from flushPtr, which is missing. However, in such cases, xlogreader.missingContrecPtr is set to the beginning of the missing part, but something similar to So, I believe the attached small patch fixes the behavior. I haven't come up with a good test script for this issue. Something like 026_overwrite_contrecord.pl might work, but this situation seems a bit more complex than what it handles. Versions back to 10 should suffer from the same issue and the same patch will be applicable without significant changes. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
>From 99cad7bd53a94b4b90937fb1eb2f37f2ebcadf6a Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyota....@gmail.com> Date: Thu, 6 Jun 2024 14:56:53 +0900 Subject: [PATCH] Fix infinite loop in walsender during publisher shutdown When a publisher server is shutting down, there can be a case where the last WAL record at that point is a continuation record with its latter part not yet flushed. In such cases, the walsender attempts to read this unflushed part and ends up in an infinite loop. To prevent this situation, modify the logical WAL sender to consider itself caught up in this case. The records that are not fully flushed at this point are generally not significant, so simply ignoring them should not cause any issues. --- src/backend/replication/walsender.c | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c index c623b07cf0..635396c138 100644 --- a/src/backend/replication/walsender.c +++ b/src/backend/replication/walsender.c @@ -3426,8 +3426,15 @@ XLogSendLogical(void) flushPtr = GetFlushRecPtr(NULL); } - /* If EndRecPtr is still past our flushPtr, it means we caught up. */ - if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr) + /* + * If EndRecPtr is still past our flushPtr, it means we caught up. When + * the server is shutting down, the latter part of a continuation record + * may be missing. If got_STOPPING is true, assume we are caught up if the + * last record is missing its continuation part at flushPtr. + */ + if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr || + (got_STOPPING && + logical_decoding_ctx->reader->missingContrecPtr == flushPtr)) WalSndCaughtUp = true; /* -- 2.43.0