Hello Chloe Dives reported that sometimes a walsender would become stuck during shutdown and *not* shutdown, thus preventing postmaster from completing the shutdown cycle. This has been observed to cause the servers to remain in such state for several hours.
After a lengthy investigation and thanks to a handy reproducer by Chris Wilson, we found that the problem is that WalSndDone wants to avoid shutting down until everything has been sent and acknowledged; but this test is coded in a way that ignores the possibility that we have never received anything from the other end. In that case, both MyWalSnd->flush and MyWalSnd->write are InvalidRecPtr, so the condition in WalSndDone to terminate the loop is never fulfilled. So the walsender is looping forever and never terminates, blocking shutdown of the whole instance. The attached patch fixes the problem by testing for the problematic condition. Apparently this problem has existed forever. Fujii-san almost patched for it in 5c6d9fc4b2b8 (2014!), but missed it by a zillionth of an inch. -- Álvaro Herrera
>From aca27a0af5616bc1da4f08cbbc93b4d3c9380f60 Mon Sep 17 00:00:00 2001 From: Alvaro Herrera <alvhe...@alvh.no-ip.org> Date: Mon, 23 Nov 2020 17:51:34 -0300 Subject: [PATCH] Don't loop forever in WalSndDone For a walsender that hasn't sent anything, the "replicatedPtr" as computed for shutdown is not valid, so the comparison to sentPtr fails. Make sure to only compare if replicatedPtr is valid. --- src/backend/replication/walsender.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c index 5d1b1a16be..bb86c094a3 100644 --- a/src/backend/replication/walsender.c +++ b/src/backend/replication/walsender.c @@ -2936,7 +2936,8 @@ WalSndDone(WalSndSendDataCallback send_data) replicatedPtr = XLogRecPtrIsInvalid(MyWalSnd->flush) ? MyWalSnd->write : MyWalSnd->flush; - if (WalSndCaughtUp && sentPtr == replicatedPtr && + if (WalSndCaughtUp && + (XLogRecPtrIsInvalid(replicatedPtr) || sentPtr == replicatedPtr) && !pq_is_send_pending()) { QueryCompletion qc; -- 2.20.1