From b6e7bff6a4f196303356dfb478604a58b077147c Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 5 Mar 2024 20:33:14 +1300
Subject: [PATCH] Fix rare recovery shutdown hang due to checkpointer.

Commit 7ff23c6d started running the checkpointer during crash recovery.

As discovered by Justin, in one rare case it could prevent shutdown from
succeeding during a narrow phase at the beginning of crash recovery
after a server crash.

When the the server is automatically restarting but before
PMSIGNAL_RECOVERY_STARTED is received from the startup process,
FatalError is still true.  If a shutdown request arrived in that narrow
window, the PostmasterStateMachine() logic behaved as if the
checkpointer was not running and didn't need to be told to shut down,
and yet waited forever for it to exit.

Now, we can only move from PM_WAIT_BACKENDS state directly to
PM_WAIT_DEADEND if the checkpointer isn't running.  If it is, we now
distinguish between the smart and fast shutdown case where we need to
tell the checkpointer to shutdown and move to PM_SHUTDOWN, and the
immediate shutdown or child crash case where it should already have been
told to quit, and we're still waiting for that to happen so we stay in
PM_WAIT_BACKENDS.

Back-patch to 15.

XXX Experimental patch, not sure yet

Reported-by: Justin Pryzby <pryzby@telsasoft.com>
Discussion: https://postgr.es/m/ZWlrdQarrZvLsgIk@pryzbyj2023
---
 src/backend/postmaster/postmaster.c | 18 +++++++++++++++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index da0c627107e..62db752228a 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -3748,12 +3748,14 @@ PostmasterStateMachine(void)
 			WalSummarizerPID == 0 &&
 			BgWriterPID == 0 &&
 			(CheckpointerPID == 0 ||
-			 (!FatalError && Shutdown < ImmediateShutdown)) &&
+			 (!FatalError && Shutdown < ImmediateShutdown) ||
+			 (FatalError && CheckpointerPID != 0)) &&
 			WalWriterPID == 0 &&
 			AutoVacPID == 0 &&
 			SlotSyncWorkerPID == 0)
 		{
-			if (Shutdown >= ImmediateShutdown || FatalError)
+			if (CheckpointerPID == 0 &&
+				(Shutdown >= ImmediateShutdown || FatalError))
 			{
 				/*
 				 * Start waiting for dead_end children to die.  This state
@@ -3767,7 +3769,7 @@ PostmasterStateMachine(void)
 				 * FatalError state.
 				 */
 			}
-			else
+			else if (Shutdown > NoShutdown && Shutdown < ImmediateShutdown)
 			{
 				/*
 				 * If we get here, we are proceeding with normal shutdown. All
@@ -3805,6 +3807,16 @@ PostmasterStateMachine(void)
 						signal_child(PgArchPID, SIGQUIT);
 				}
 			}
+			else
+			{
+				/*
+				 * Either it's an immediate shutdown or a child crashed, and
+				 * we're still waiting for all the children to quit.  The
+				 * checkpointer was already told to quit.
+				 */
+				Assert(Shutdown == ImmediateShutdown ||
+					   (Shutdown == NoShutdown && FatalError));
+			}
 		}
 	}
 
-- 
2.43.0

