On Wed, May 3, 2017 at 12:25 AM, Peter Eisentraut
<peter.eisentr...@2ndquadrant.com> wrote:
> On 5/2/17 10:08, Michael Paquier wrote:
>> On Tue, May 2, 2017 at 9:30 PM, Peter Eisentraut
>> <peter.eisentr...@2ndquadrant.com> wrote:
>>> On 5/2/17 03:11, Petr Jelinek wrote:
>>>> logical decoding can theoretically
>>>> do HOT pruning (even if the chance is really small) so it's not safe to
>>>> start logical replication either.
>>>
>>> This seems a bit impossible to resolve.  On the one hand, we want to
>>> allow streaming until after the shutdown checkpoint.  On the other hand,
>>> streaming itself might produce new WAL.
>>
>> It would be nice to split things into two:
>> - patch 1 adding the signal handling that wins a backpatch.
>> - patch 2 fixing the side cases with logical decoding.
>
> The side cases with logical decoding are also not new and would need
> backpatching, AIUI.

Okay, I thought that there was some new concept part of logical
replication here.

>>> Can we prevent HOT pruning during logical decoding?
>>
>> It does not sound much difficult to do, couldn't you just make it a
>> no-op with am_walsender?
>
> That's my hope.

The only code path doing HOT-pruning and generating WAL is
heap_page_prune(). Do you think that we need to worry about FPWs as
well?

Attached is an updated patch, which also forbids the run of any
replication commands when the stopping state is reached.
-- 
Michael
From c8a44b6f84926712b7b6f2b36b4f13b0d1b41977 Mon Sep 17 00:00:00 2001
From: Michael Paquier <mich...@paquier.xyz>
Date: Fri, 5 May 2017 14:10:16 +0900
Subject: [PATCH] Prevent panic during shutdown checkpoint

When the checkpointer writes the shutdown checkpoint, it checks
afterwards whether any WAL has been written since it started and throws
a PANIC if so.  At that point, only walsenders are still active, so one
might think this could not happen, but WAL senders can generate WAL
in the context of certain replication commands that can be run during
the shutdown checkpoint:
- certain variants of CREATE_REPLICATION_SLOT.
- BASE_BACKUP and the backend end WAL record.
- START_REPLICATION and logical decoding, able to do HOT-pruning.

To fix this, divide the walsender shutdown into two phases.  First, the
postmaster sends a SIGUSR2 signal to all walsenders.  The walsenders
then put themselves into the "stopping" state.  In this state, they
reject any commands that might generate WAL.  The checkpointer waits for
all walsenders to reach this state before proceeding with the shutdown
checkpoint.  After the shutdown checkpoint is done, the postmaster sends
SIGINT (previously unused) to the walsenders.  This triggers the
existing shutdown behavior of sending out the shutdown checkpointer and
then terminating.

Author: Michael Paquier <michael.paqu...@gmail.com>
Reported-by: Fujii Masao <masao.fu...@gmail.com>
---
 doc/src/sgml/monitoring.sgml                |   5 +
 src/backend/access/heap/pruneheap.c         |   9 ++
 src/backend/access/transam/xlog.c           |   6 ++
 src/backend/postmaster/postmaster.c         |   7 +-
 src/backend/replication/walsender.c         | 143 ++++++++++++++++++++++++----
 src/include/replication/walsender.h         |   1 +
 src/include/replication/walsender_private.h |   3 +-
 7 files changed, 150 insertions(+), 24 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 2a83671b53..80d12b26d7 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1690,6 +1690,11 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
            <literal>backup</>: This WAL sender is sending a backup.
           </para>
          </listitem>
+         <listitem>
+          <para>
+           <literal>stopping</>: This WAL sender is stopping.
+          </para>
+         </listitem>
        </itemizedlist>
      </entry>
     </row>
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index d69a266c36..d510649b18 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -22,6 +22,7 @@
 #include "catalog/catalog.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/walsender.h"
 #include "storage/bufmgr.h"
 #include "utils/snapmgr.h"
 #include "utils/rel.h"
@@ -189,6 +190,14 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
 	PruneState	prstate;
 
 	/*
+	 * Do nothing in the presence of a WAL sender. This code path can be
+	 * taken when doing logical decoding, and it is better to avoid WAL
+	 * generation as this interferes with shutdown checkpoints.
+	 */
+	if (am_walsender)
+		return ndeleted;
+
+	/*
 	 * Our strategy is to scan the page and make lists of items to change,
 	 * then apply the changes within a critical section.  This keeps as much
 	 * logic as possible out of the critical section, and also ensures that
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a89d99838a..5d6f8b75b8 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8325,6 +8325,12 @@ ShutdownXLOG(int code, Datum arg)
 	ereport(IsPostmasterEnvironment ? LOG : NOTICE,
 			(errmsg("shutting down")));
 
+	/*
+	 * Wait for WAL senders to be in stopping state.  This prevents commands
+	 * from writing new WAL.
+	 */
+	WalSndWaitStopping();
+
 	if (RecoveryInProgress())
 		CreateRestartPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
 	else
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 4a25ed8f5b..01f1c2805f 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -2918,7 +2918,7 @@ reaper(SIGNAL_ARGS)
 				 * Waken walsenders for the last time. No regular backends
 				 * should be around anymore.
 				 */
-				SignalChildren(SIGUSR2);
+				SignalChildren(SIGINT);
 
 				pmState = PM_SHUTDOWN_2;
 
@@ -3656,7 +3656,9 @@ PostmasterStateMachine(void)
 				/*
 				 * If we get here, we are proceeding with normal shutdown. All
 				 * the regular children are gone, and it's time to tell the
-				 * checkpointer to do a shutdown checkpoint.
+				 * checkpointer to do a shutdown checkpoint. All WAL senders
+				 * are told to switch to a stopping state so that the shutdown
+				 * checkpoint can go ahead.
 				 */
 				Assert(Shutdown > NoShutdown);
 				/* Start the checkpointer if not running */
@@ -3665,6 +3667,7 @@ PostmasterStateMachine(void)
 				/* And tell it to shut down */
 				if (CheckpointerPID != 0)
 				{
+					SignalSomeChildren(SIGUSR2, BACKEND_TYPE_WALSND);
 					signal_child(CheckpointerPID, SIGUSR2);
 					pmState = PM_SHUTDOWN;
 				}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 2a6c8bb62d..bcc024fec2 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -24,11 +24,14 @@
  * are treated as not a crash but approximately normal termination;
  * the walsender will exit quickly without sending any more XLOG records.
  *
- * If the server is shut down, postmaster sends us SIGUSR2 after all
- * regular backends have exited and the shutdown checkpoint has been written.
- * This instructs walsender to send any outstanding WAL, including the
- * shutdown checkpoint record, wait for it to be replicated to the standby,
- * and then exit.
+ * If the server is shut down, postmaster sends us SIGUSR2 after all regular
+ * backends have exited. This causes the walsender to switch to the "stopping"
+ * state. In this state, the walsender will reject any replication command
+ * that may generate WAL activity. The checkpointer begins the shutdown
+ * checkpoint once all walsenders are confirmed as stopping. When the shutdown
+ * checkpoint finishes, the postmaster sends us SIGINT. This instructs
+ * walsender to send any outstanding WAL, including the shutdown checkpoint
+ * record, wait for it to be replicated to the standby, and then exit.
  *
  *
  * Portions Copyright (c) 2010-2017, PostgreSQL Global Development Group
@@ -177,13 +180,14 @@ static bool WalSndCaughtUp = false;
 
 /* Flags set by signal handlers for later service in main loop */
 static volatile sig_atomic_t got_SIGHUP = false;
-static volatile sig_atomic_t walsender_ready_to_stop = false;
+static volatile sig_atomic_t got_SIGINT = false;
+static volatile sig_atomic_t got_SIGUSR2 = false;
 
 /*
- * This is set while we are streaming. When not set, SIGUSR2 signal will be
+ * This is set while we are streaming. When not set, SIGINT signal will be
  * handled like SIGTERM. When set, the main loop is responsible for checking
- * walsender_ready_to_stop and terminating when it's set (after streaming any
- * remaining WAL).
+ * got_SIGINT and terminating when it's set (after streaming any remaining
+ * WAL).
  */
 static volatile sig_atomic_t replication_active = false;
 
@@ -213,6 +217,7 @@ static struct
 /* Signal handlers */
 static void WalSndSigHupHandler(SIGNAL_ARGS);
 static void WalSndXLogSendHandler(SIGNAL_ARGS);
+static void WalSndSwitchStopping(SIGNAL_ARGS);
 static void WalSndLastCycleHandler(SIGNAL_ARGS);
 
 /* Prototypes for private functions */
@@ -299,11 +304,14 @@ WalSndErrorCleanup(void)
 	ReplicationSlotCleanup();
 
 	replication_active = false;
-	if (walsender_ready_to_stop)
+	if (got_SIGINT)
 		proc_exit(0);
 
 	/* Revert back to startup state */
 	WalSndSetState(WALSNDSTATE_STARTUP);
+
+	if (got_SIGUSR2)
+		WalSndSetState(WALSNDSTATE_STOPPING);
 }
 
 /*
@@ -323,6 +331,7 @@ WalSndShutdown(void)
 	abort();					/* keep the compiler quiet */
 }
 
+
 /*
  * Handle the IDENTIFY_SYSTEM command.
  */
@@ -676,7 +685,7 @@ StartReplication(StartReplicationCmd *cmd)
 		WalSndLoop(XLogSendPhysical);
 
 		replication_active = false;
-		if (walsender_ready_to_stop)
+		if (got_SIGINT)
 			proc_exit(0);
 		WalSndSetState(WALSNDSTATE_STARTUP);
 
@@ -1053,7 +1062,7 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 	{
 		ereport(LOG,
 				(errmsg("terminating walsender process after promotion")));
-		walsender_ready_to_stop = true;
+		got_SIGINT = true;
 	}
 
 	WalSndSetState(WALSNDSTATE_CATCHUP);
@@ -1103,7 +1112,7 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 	ReplicationSlotRelease();
 
 	replication_active = false;
-	if (walsender_ready_to_stop)
+	if (got_SIGINT)
 		proc_exit(0);
 	WalSndSetState(WALSNDSTATE_STARTUP);
 
@@ -1291,6 +1300,14 @@ WalSndWaitForWal(XLogRecPtr loc)
 			RecentFlushPtr = GetXLogReplayRecPtr(NULL);
 
 		/*
+		 * If postmaster asked us to switch to the stopping state, do so.
+		 * Shutdown is in progress and this will allow the checkpointer to
+		 * move on with the shutdown checkpoint.
+		 */
+		if (got_SIGUSR2)
+			WalSndSetState(WALSNDSTATE_STOPPING);
+
+		/*
 		 * If postmaster asked us to stop, don't wait here anymore. This will
 		 * cause the xlogreader to return without reading a full record, which
 		 * is the fastest way to reach the mainloop which then can quit.
@@ -1299,7 +1316,7 @@ WalSndWaitForWal(XLogRecPtr loc)
 		 * RecentFlushPtr, so we can send all remaining data before shutting
 		 * down.
 		 */
-		if (walsender_ready_to_stop)
+		if (got_SIGINT)
 			break;
 
 		/*
@@ -1374,6 +1391,21 @@ exec_replication_command(const char *cmd_string)
 	MemoryContext old_context;
 
 	/*
+	 * If WAL sender has been told that shutdown is getting close, switch its
+	 * status accordingly to handle the next replication commands correctly.
+	 */
+	if (got_SIGUSR2)
+		WalSndSetState(WALSNDSTATE_STOPPING);
+
+	/*
+	 * Throw error if in stopping mode. This is useful to prevent commands
+	 * that could generate WAL while the shutdown checkpoint is being written.
+	 */
+	if (MyWalSnd->state == WALSNDSTATE_STOPPING)
+		ereport(ERROR,
+				(errmsg("cannot execute replication command while WAL sender is in stopping mode")));
+
+	/*
 	 * CREATE_REPLICATION_SLOT ... LOGICAL exports a snapshot until the next
 	 * command arrives. Clean up the old stuff if there's anything.
 	 */
@@ -2095,13 +2127,20 @@ WalSndLoop(WalSndSendDataCallback send_data)
 			}
 
 			/*
-			 * When SIGUSR2 arrives, we send any outstanding logs up to the
+			 * At the reception of SIGUSR2, switch the WAL sender to the stopping
+			 * state.
+			 */
+			if (got_SIGUSR2)
+				WalSndSetState(WALSNDSTATE_STOPPING);
+
+			/*
+			 * When SIGINT arrives, we send any outstanding logs up to the
 			 * shutdown checkpoint record (i.e., the latest record), wait for
 			 * them to be replicated to the standby, and exit. This may be a
 			 * normal termination at shutdown, or a promotion, the walsender
 			 * is not sure which.
 			 */
-			if (walsender_ready_to_stop)
+			if (got_SIGINT)
 				WalSndDone(send_data);
 		}
 
@@ -2841,7 +2880,23 @@ WalSndXLogSendHandler(SIGNAL_ARGS)
 	errno = save_errno;
 }
 
-/* SIGUSR2: set flag to do a last cycle and shut down afterwards */
+/* SIGUSR2: set flag to switch to stopping state */
+static void
+WalSndSwitchStopping(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	got_SIGUSR2 = true;
+	SetLatch(MyLatch);
+
+	errno = save_errno;
+}
+
+/*
+ * SIGINT: set flag to do a last cycle and shut down afterwards. The WAL
+ * sender should already have been switched to WALSNDSTATE_STOPPING at
+ * this point.
+ */
 static void
 WalSndLastCycleHandler(SIGNAL_ARGS)
 {
@@ -2856,7 +2911,7 @@ WalSndLastCycleHandler(SIGNAL_ARGS)
 	if (!replication_active)
 		kill(MyProcPid, SIGTERM);
 
-	walsender_ready_to_stop = true;
+	got_SIGINT = true;
 	SetLatch(MyLatch);
 
 	errno = save_errno;
@@ -2869,14 +2924,14 @@ WalSndSignals(void)
 	/* Set up signal handlers */
 	pqsignal(SIGHUP, WalSndSigHupHandler);		/* set flag to read config
 												 * file */
-	pqsignal(SIGINT, SIG_IGN);	/* not used */
+	pqsignal(SIGINT, WalSndLastCycleHandler);	/* request a last cycle and
+												 * shutdown */
 	pqsignal(SIGTERM, die);		/* request shutdown */
 	pqsignal(SIGQUIT, quickdie);	/* hard crash time */
 	InitializeTimeouts();		/* establishes SIGALRM handler */
 	pqsignal(SIGPIPE, SIG_IGN);
 	pqsignal(SIGUSR1, WalSndXLogSendHandler);	/* request WAL sending */
-	pqsignal(SIGUSR2, WalSndLastCycleHandler);	/* request a last cycle and
-												 * shutdown */
+	pqsignal(SIGUSR2, WalSndSwitchStopping);	/* switch to stopping state */
 
 	/* Reset some signals that are accepted by postmaster but not here */
 	pqsignal(SIGCHLD, SIG_DFL);
@@ -2954,6 +3009,50 @@ WalSndWakeup(void)
 	}
 }
 
+/*
+ * Wait that all the WAL senders have reached the stopping state. This is
+ * used by the checkpointer to control when shutdown checkpoints can
+ * safely begin.
+ */
+void
+WalSndWaitStopping(void)
+{
+	for (;;)
+	{
+		int			i;
+		bool		all_stopped = true;
+
+		for (i = 0; i < max_wal_senders; i++)
+		{
+			WalSndState state;
+			WalSnd	   *walsnd = &WalSndCtl->walsnds[i];
+
+			SpinLockAcquire(&walsnd->mutex);
+
+			if (walsnd->pid == 0)
+			{
+				SpinLockRelease(&walsnd->mutex);
+				continue;
+			}
+
+			state = walsnd->state;
+			SpinLockRelease(&walsnd->mutex);
+
+			if (state != WALSNDSTATE_STOPPING)
+			{
+				all_stopped = false;
+				break;
+			}
+		}
+
+		/* safe to leave if confirmation is done for all WAL senders */
+		if (all_stopped)
+			return;
+
+		pg_usleep(10000L);	/* wait for 10 msec */
+	}
+}
+
 /* Set state for current walsender (only called in walsender) */
 void
 WalSndSetState(WalSndState state)
@@ -2987,6 +3086,8 @@ WalSndGetStateString(WalSndState state)
 			return "catchup";
 		case WALSNDSTATE_STREAMING:
 			return "streaming";
+		case WALSNDSTATE_STOPPING:
+			return "stopping";
 	}
 	return "UNKNOWN";
 }
diff --git a/src/include/replication/walsender.h b/src/include/replication/walsender.h
index 2ca903872e..99f12377e0 100644
--- a/src/include/replication/walsender.h
+++ b/src/include/replication/walsender.h
@@ -44,6 +44,7 @@ extern void WalSndSignals(void);
 extern Size WalSndShmemSize(void);
 extern void WalSndShmemInit(void);
 extern void WalSndWakeup(void);
+extern void WalSndWaitStopping(void);
 extern void WalSndRqstFileReload(void);
 
 /*
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 2c59056cef..36311e124c 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -24,7 +24,8 @@ typedef enum WalSndState
 	WALSNDSTATE_STARTUP = 0,
 	WALSNDSTATE_BACKUP,
 	WALSNDSTATE_CATCHUP,
-	WALSNDSTATE_STREAMING
+	WALSNDSTATE_STREAMING,
+	WALSNDSTATE_STOPPING
 } WalSndState;
 
 /*
-- 
2.12.2

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to