Hi,

Please find in attachment v5 of the patch set rebased on master after various
conflicts.

Regards,

On Wed, 5 Aug 2020 00:04:53 +0200
Jehan-Guillaume de Rorthais <j...@dalibo.com> wrote:

> Demote now keeps backends with no active xid alive. Smart mode keeps all
> backends: it waits for them to finish their xact and enter read-only. Fast
> mode terminate backends wit an active xid and keeps all other ones.
> Backends enters "read-only" using LocalXLogInsertAllowed=0 and flip it to -1
> (check recovery state) once demoted.
> During demote, no new session is allowed.
> 
> As backends with no active xid survive, a new SQL admin function
> "pg_demote(fast bool, wait bool, wait_seconds int)" had been added.
> 
> Demote now relies on sigusr1 instead of hijacking sigterm/sigint and pmdie().
> The resulting refactoring makes the code much simpler, cleaner, with better
> isolation of actions from the code point of view.
> 
> Thanks to the refactoring, the patch now only adds one state to the state
> machine: PM_DEMOTING. A second one could be use to replace:
> 
>     /* Demoting: start the Startup Process */
>     if (DemoteSignal && pmState == PM_SHUTDOWN && CheckpointerPID == 0)
> 
> with eg.:
> 
>     if (pmState == PM_DEMOTED)
> 
> I believe it might be a bit simpler to understand, but the existing comment
> might be good enough as well. The full state machine path for demote is:
> 
>  PM_DEMOTING  /* wait for active xid backend to finish */
>  PM_SHUTDOWN  /* wait for checkpoint shutdown and its 
>                  various shutdown tasks */
>  PM_SHUTDOWN && !CheckpointerPID /* aka PM_DEMOTED: start Startup process */
>  PM_STARTUP
> 
> Tests in "recovery/t/021_promote-demote.pl" grows from 13 to 24 tests,
> adding tests on backend behaviors during demote and new function pg_demote().
> 
> On my todo:
> 
> * cancel running checkpoint for fast demote ?
> * forbid demote when PITR backup is in progress
> * user documentation
> * Robert's concern about snapshot during hot standby
> * anything else reported to me
> 
> Plus, I might be able to split the backend part and their signals of the patch
> 0002 in its own patch if it helps the review. It would apply after 0001 and
> before actual 0002.
> 
> As there was no consensus and the discussions seemed to conclude this patch
> set should keep growing to see were it goes, I wonder if/when I should add it
> to the commitfest. Advice? Opinion?
>From 90e26a7e2d53f7a8436de0b73eb57498f884de9d Mon Sep 17 00:00:00 2001
From: Jehan-Guillaume de Rorthais <j...@dalibo.com>
Date: Fri, 31 Jul 2020 10:58:40 +0200
Subject: [PATCH 1/4] demote: setter functions for LocalXLogInsert local
 variable

Adds functions extern LocalSetXLogInsertNotAllowed() and
LocalSetXLogInsertCheckRecovery() to set the local variable
LocalXLogInsert respectively to 0 and -1.

These functions are declared as extern for future need in
the demote patch.

Function LocalSetXLogInsertAllowed() already exists and
declared as static as it is not needed outside of xlog.h.
---
 src/backend/access/transam/xlog.c | 27 +++++++++++++++++++++++----
 src/include/access/xlog.h         |  2 ++
 2 files changed, 25 insertions(+), 4 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 09c01ed4ae..c0d79f192c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7711,7 +7711,7 @@ StartupXLOG(void)
 	Insert->fullPageWrites = lastFullPageWrites;
 	LocalSetXLogInsertAllowed();
 	UpdateFullPageWrites();
-	LocalXLogInsertAllowed = -1;
+	LocalSetXLogInsertCheckRecovery();
 
 	if (InRecovery)
 	{
@@ -8219,6 +8219,25 @@ LocalSetXLogInsertAllowed(void)
 	InitXLOGAccess();
 }
 
+/*
+ * Make XLogInsertAllowed() return false in the current process only.
+ */
+void
+LocalSetXLogInsertNotAllowed(void)
+{
+	LocalXLogInsertAllowed = 0;
+}
+
+/*
+ * Make XLogInsertCheckRecovery() return false in the current process only.
+ */
+void
+LocalSetXLogInsertCheckRecovery(void)
+{
+	LocalXLogInsertAllowed = -1;
+}
+
+
 /*
  * Subroutine to try to fetch and validate a prior checkpoint record.
  *
@@ -9004,9 +9023,9 @@ CreateCheckPoint(int flags)
 	if (shutdown)
 	{
 		if (flags & CHECKPOINT_END_OF_RECOVERY)
-			LocalXLogInsertAllowed = -1;	/* return to "check" state */
+			LocalSetXLogInsertCheckRecovery(); /* return to "check" state */
 		else
-			LocalXLogInsertAllowed = 0; /* never again write WAL */
+			LocalSetXLogInsertNotAllowed(); /* never again write WAL */
 	}
 
 	/*
@@ -9159,7 +9178,7 @@ CreateEndOfRecoveryRecord(void)
 
 	END_CRIT_SECTION();
 
-	LocalXLogInsertAllowed = -1;	/* return to "check" state */
+	LocalSetXLogInsertCheckRecovery();	/* return to "check" state */
 }
 
 /*
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 221af87e71..8c9cadc6da 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -306,6 +306,8 @@ extern RecoveryState GetRecoveryState(void);
 extern bool HotStandbyActive(void);
 extern bool HotStandbyActiveInReplay(void);
 extern bool XLogInsertAllowed(void);
+extern void LocalSetXLogInsertNotAllowed(void);
+extern void LocalSetXLogInsertCheckRecovery(void);
 extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream);
 extern XLogRecPtr GetXLogReplayRecPtr(TimeLineID *replayTLI);
 extern XLogRecPtr GetXLogInsertRecPtr(void);
-- 
2.20.1

>From f655909a19250eea98a668051f5941bdd87ad5b2 Mon Sep 17 00:00:00 2001
From: Jehan-Guillaume de Rorthais <j...@dalibo.com>
Date: Fri, 10 Apr 2020 18:01:45 +0200
Subject: [PATCH 2/4] demote: support demoting instance from production to
 standby

This patch adds the ability for an in production read write
instance to step back as a read only standby. Two different
modes are supported: fast or smart. Fast demote mode cancel
transactions holding a xid to demote as fast as possible.
Smart demote mode wait for existing transactions holding a
xid to finish and forbid new ones.

The demote process is triggered by creating a "demote" or
"demote_fast" trigger file then signaling postmaster with
SIGUSR1. This patch adds "pg_ctl [-m {fast|smart}] demote"
and an SQL admin function will be added in a separate commit.

The demote procedure starts be setting the postmaster state
machine to PM_DEMOTING which ends when all backends are
read-only and set LocalXLogInsert=0 to forbid new writes.

When PM_DEMOTING finish, the state is set to PM_SHUTDOWN
to create a shutdown checkpoint and exit the checkpointer.
We could have kept the checkpointer around with some more
effort but there's no good reason worthing the additionnal
code complexity.

ShutdownXLOG now takes a boolean arg to handle demote
differently than a normal shutdown. This allows the
checkpointer to leave walsenders alive to demote faster.
Moreover, this might be useful in futur for eg.
implementing a controlled switchover over the replication
protocol.

The checkpointer set the cluster state as DB_DEMOTING so
the Startup process can detect the demote procedure in
StartupXLOG() and handle subsystems accordingly.

The startup process is started on checkpointer exit. The
postmaster state machine is then switched to PM_STARTUP.

Some sub-processes are kept alive during the demote procedure:
the stat collector, bgwriter and optionally archiver and wal
senders.

At the end of demote, USR1 is sent to all backends and wal
senders to set their environment as in recovery and cascading.

Discuss/Todo:

* add doc
* do not handle backup in progress during demote
* investigate snapshots shmem needs/init during recovery compare to
  production
* cancel running checkpoint during demote
  * replace with a END_OF_PRODUCTION xlog record?
---
 src/backend/access/transam/twophase.c   |  98 ++++++++
 src/backend/access/transam/xlog.c       | 320 ++++++++++++++++--------
 src/backend/postmaster/checkpointer.c   |  28 +++
 src/backend/postmaster/pgstat.c         |   3 +
 src/backend/postmaster/postmaster.c     | 235 ++++++++++++++++-
 src/backend/storage/ipc/procarray.c     |   2 +
 src/backend/storage/ipc/procsignal.c    |  30 +++
 src/backend/storage/lmgr/lock.c         |  12 +
 src/backend/tcop/postgres.c             |  44 ++++
 src/backend/utils/init/globals.c        |   1 +
 src/bin/pg_controldata/pg_controldata.c |   2 +
 src/bin/pg_ctl/pg_ctl.c                 | 117 +++++++++
 src/include/access/twophase.h           |   1 +
 src/include/access/xlog.h               |  23 +-
 src/include/catalog/pg_control.h        |   1 +
 src/include/libpq/libpq-be.h            |   2 +-
 src/include/miscadmin.h                 |   1 +
 src/include/pgstat.h                    |   1 +
 src/include/postmaster/bgwriter.h       |   1 +
 src/include/storage/lock.h              |   2 +
 src/include/storage/procsignal.h        |   4 +
 src/include/tcop/tcopprot.h             |   2 +
 src/include/utils/pidfile.h             |   1 +
 23 files changed, 802 insertions(+), 129 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index ef4f9981e3..31deb487ed 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1557,6 +1557,104 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 	pfree(buf);
 }
 
+/*
+ * ShutdownPreparedTransactions: clean prepared from sheared memory
+ *
+ * This is called during the demote process to clean the shared memory
+ * before the startup process load everything back in correctly
+ * for the standby mode.
+ *
+ * Note: this function assue all prepared transaction have been
+ * written to disk. In consequence, it must be called AFTER the demote
+ * shutdown checkpoint.
+ *
+ * FIXME demote: pay attention to the previous note when removing shutdown
+ * checkpoint from the demote procedure.
+ */
+void
+ShutdownPreparedTransactions(void)
+{
+	int i;
+
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact;
+		PGPROC	   *proc;
+		TransactionId xid;
+		char	   *buf;
+		char	   *bufptr;
+		TwoPhaseFileHeader *hdr;
+		TransactionId latestXid;
+		TransactionId *children;
+
+		gxact = TwoPhaseState->prepXacts[i];
+		proc = &ProcGlobal->allProcs[gxact->pgprocno];
+		xid = gxact->xid;
+
+		/* Read and validate 2PC state data */
+		Assert(gxact->ondisk);
+		buf = ReadTwoPhaseFile(xid, false);
+
+		/*
+		 * Disassemble the header area
+		 */
+		hdr = (TwoPhaseFileHeader *) buf;
+		Assert(TransactionIdEquals(hdr->xid, xid));
+		bufptr = buf + MAXALIGN(sizeof(TwoPhaseFileHeader))
+			   + MAXALIGN(hdr->gidlen);
+		children = (TransactionId *) bufptr;
+		bufptr += MAXALIGN(hdr->nsubxacts * sizeof(TransactionId))
+				+ MAXALIGN(hdr->ncommitrels * sizeof(RelFileNode))
+				+ MAXALIGN(hdr->nabortrels * sizeof(RelFileNode))
+				+ MAXALIGN(hdr->ninvalmsgs * sizeof(SharedInvalidationMessage));
+
+		/* compute latestXid among all children */
+		latestXid = TransactionIdLatest(xid, hdr->nsubxacts, children);
+
+		/* remove dummy proc associated to the gaxt */
+		ProcArrayRemove(proc, latestXid);
+
+		/*
+		 * This lock is probably not needed during the demote process
+		 * as all backends are already gone.
+		 */
+		LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
+
+		/* cleanup locks */
+		for (;;)
+		{
+			TwoPhaseRecordOnDisk *record = (TwoPhaseRecordOnDisk *) bufptr;
+
+			Assert(record->rmid <= TWOPHASE_RM_MAX_ID);
+			if (record->rmid == TWOPHASE_RM_END_ID)
+				break;
+
+			bufptr += MAXALIGN(sizeof(TwoPhaseRecordOnDisk));
+
+			if (record->rmid == TWOPHASE_RM_LOCK_ID)
+				lock_twophase_shutdown(xid, record->info,
+									 (void *) bufptr, record->len);
+
+			bufptr += MAXALIGN(record->len);
+		}
+
+		/* and put it back in the freelist */
+		gxact->next = TwoPhaseState->freeGXacts;
+		TwoPhaseState->freeGXacts = gxact;
+
+		/*
+		 * Release the lock as all callbacks are called and shared memory cleanup
+		 * is done.
+		 */
+		LWLockRelease(TwoPhaseStateLock);
+
+		pfree(buf);
+	}
+
+	TwoPhaseState->numPrepXacts -= i;
+	Assert(TwoPhaseState->numPrepXacts == 0);
+}
+
 /*
  * Scan 2PC state data in memory and call the indicated callbacks for each 2PC record.
  */
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index c0d79f192c..abc88ec241 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6298,6 +6298,11 @@ CheckRequiredParameterValues(void)
 /*
  * This must be called ONCE during postmaster or standalone-backend startup
  */
+/*
+ * FIXME demote: part of the code here assume there's no other active
+ * processes before signal PMSIGNAL_RECOVERY_STARTED is sent.
+ */
+
 void
 StartupXLOG(void)
 {
@@ -6321,6 +6326,7 @@ StartupXLOG(void)
 	XLogPageReadPrivate private;
 	bool		promoted = false;
 	struct stat st;
+	bool		is_demoting = false;
 
 	/*
 	 * We should have an aux process resource owner to use, and we should not
@@ -6385,6 +6391,25 @@ StartupXLOG(void)
 							str_time(ControlFile->time))));
 			break;
 
+		case DB_DEMOTING:
+			ereport(LOG,
+					(errmsg("database system was demoted at %s",
+							str_time(ControlFile->time))));
+			is_demoting = true;
+			bgwriterLaunched = true;
+			InArchiveRecovery = true;
+			StandbyMode = true;
+
+			/*
+			 * previous state was RECOVERY_STATE_DONE. We need to
+			 * reinit it to something else so RecoveryInProgress()
+			 * doesn't return false.
+			 */
+			SpinLockAcquire(&XLogCtl->info_lck);
+			XLogCtl->SharedRecoveryState = RECOVERY_STATE_ARCHIVE;
+			SpinLockRelease(&XLogCtl->info_lck);
+			break;
+
 		default:
 			ereport(FATAL,
 					(errmsg("control file contains invalid database cluster state")));
@@ -6418,7 +6443,8 @@ StartupXLOG(void)
 	 *   persisted.  To avoid that, fsync the entire data directory.
 	 */
 	if (ControlFile->state != DB_SHUTDOWNED &&
-		ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY)
+		ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY &&
+		ControlFile->state != DB_DEMOTING)
 	{
 		RemoveTempXlogFiles();
 		SyncDataDirectory();
@@ -6674,7 +6700,8 @@ StartupXLOG(void)
 					(errmsg("could not locate a valid checkpoint record")));
 		}
 		memcpy(&checkPoint, XLogRecGetData(xlogreader), sizeof(CheckPoint));
-		wasShutdown = ((record->xl_info & ~XLR_INFO_MASK) == XLOG_CHECKPOINT_SHUTDOWN);
+		wasShutdown = ((record->xl_info & ~XLR_INFO_MASK) == XLOG_CHECKPOINT_SHUTDOWN) &&
+			!is_demoting;
 	}
 
 	/*
@@ -6736,9 +6763,9 @@ StartupXLOG(void)
 	LastRec = RecPtr = checkPointLoc;
 
 	ereport(DEBUG1,
-			(errmsg_internal("redo record is at %X/%X; shutdown %s",
+			(errmsg_internal("redo record is at %X/%X; %s checkpoint",
 							 (uint32) (checkPoint.redo >> 32), (uint32) checkPoint.redo,
-							 wasShutdown ? "true" : "false")));
+							 wasShutdown ? "shutdown" : is_demoting? "demote": "")));
 	ereport(DEBUG1,
 			(errmsg_internal("next transaction ID: " UINT64_FORMAT "; next OID: %u",
 							 U64FromFullTransactionId(checkPoint.nextXid),
@@ -6772,47 +6799,7 @@ StartupXLOG(void)
 					 checkPoint.newestCommitTsXid);
 	XLogCtl->ckptFullXid = checkPoint.nextXid;
 
-	/*
-	 * Initialize replication slots, before there's a chance to remove
-	 * required resources.
-	 */
-	StartupReplicationSlots();
-
-	/*
-	 * Startup logical state, needs to be setup now so we have proper data
-	 * during crash recovery.
-	 */
-	StartupReorderBuffer();
 
-	/*
-	 * Startup MultiXact. We need to do this early to be able to replay
-	 * truncations.
-	 */
-	StartupMultiXact();
-
-	/*
-	 * Ditto for commit timestamps.  Activate the facility if the setting is
-	 * enabled in the control file, as there should be no tracking of commit
-	 * timestamps done when the setting was disabled.  This facility can be
-	 * started or stopped when replaying a XLOG_PARAMETER_CHANGE record.
-	 */
-	if (ControlFile->track_commit_timestamp)
-		StartupCommitTs();
-
-	/*
-	 * Recover knowledge about replay progress of known replication partners.
-	 */
-	StartupReplicationOrigin();
-
-	/*
-	 * Initialize unlogged LSN. On a clean shutdown, it's restored from the
-	 * control file. On recovery, all unlogged relations are blown away, so
-	 * the unlogged LSN counter can be reset too.
-	 */
-	if (ControlFile->state == DB_SHUTDOWNED)
-		XLogCtl->unloggedLSN = ControlFile->unloggedLSN;
-	else
-		XLogCtl->unloggedLSN = FirstNormalUnloggedLSN;
 
 	/*
 	 * We must replay WAL entries using the same TimeLineID they were created
@@ -6821,19 +6808,64 @@ StartupXLOG(void)
 	 */
 	ThisTimeLineID = checkPoint.ThisTimeLineID;
 
-	/*
-	 * Copy any missing timeline history files between 'now' and the recovery
-	 * target timeline from archive to pg_wal. While we don't need those files
-	 * ourselves - the history file of the recovery target timeline covers all
-	 * the previous timelines in the history too - a cascading standby server
-	 * might be interested in them. Or, if you archive the WAL from this
-	 * server to a different archive than the primary, it'd be good for all the
-	 * history files to get archived there after failover, so that you can use
-	 * one of the old timelines as a PITR target. Timeline history files are
-	 * small, so it's better to copy them unnecessarily than not copy them and
-	 * regret later.
-	 */
-	restoreTimeLineHistoryFiles(ThisTimeLineID, recoveryTargetTLI);
+	if (!is_demoting)
+	{
+		/*
+		 * Initialize replication slots, before there's a chance to remove
+		 * required resources.
+		 */
+		StartupReplicationSlots();
+
+		/*
+		 * Startup logical state, needs to be setup now so we have proper data
+		 * during crash recovery.
+		 */
+		StartupReorderBuffer();
+
+		/*
+		 * Startup MultiXact. We need to do this early to be able to replay
+		 * truncations.
+		 */
+		StartupMultiXact();
+
+		/*
+		 * Ditto for commit timestamps.  Activate the facility if the setting is
+		 * enabled in the control file, as there should be no tracking of commit
+		 * timestamps done when the setting was disabled.  This facility can be
+		 * started or stopped when replaying a XLOG_PARAMETER_CHANGE record.
+		 */
+		if (ControlFile->track_commit_timestamp)
+			StartupCommitTs();
+
+		/*
+		 * Recover knowledge about replay progress of known replication partners.
+		 */
+		StartupReplicationOrigin();
+
+		/*
+		 * Initialize unlogged LSN. On a clean shutdown, it's restored from the
+		 * control file. On recovery, all unlogged relations are blown away, so
+		 * the unlogged LSN counter can be reset too.
+		 */
+		if (ControlFile->state == DB_SHUTDOWNED)
+			XLogCtl->unloggedLSN = ControlFile->unloggedLSN;
+		else
+			XLogCtl->unloggedLSN = FirstNormalUnloggedLSN;
+
+		/*
+		 * Copy any missing timeline history files between 'now' and the recovery
+		 * target timeline from archive to pg_wal. While we don't need those files
+		 * ourselves - the history file of the recovery target timeline covers all
+		 * the previous timelines in the history too - a cascading standby server
+		 * might be interested in them. Or, if you archive the WAL from this
+		 * server to a different archive than the master, it'd be good for all the
+		 * history files to get archived there after failover, so that you can use
+		 * one of the old timelines as a PITR target. Timeline history files are
+		 * small, so it's better to copy them unnecessarily than not copy them and
+		 * regret later.
+		 */
+		restoreTimeLineHistoryFiles(ThisTimeLineID, recoveryTargetTLI);
+	}
 
 	/*
 	 * Before running in recovery, scan pg_twophase and fill in its status to
@@ -6888,11 +6920,25 @@ StartupXLOG(void)
 		dbstate_at_startup = ControlFile->state;
 		if (InArchiveRecovery)
 		{
-			ControlFile->state = DB_IN_ARCHIVE_RECOVERY;
+			if (is_demoting)
+			{
+				/*
+				 * Avoid concurrent access to the ControlFile datas
+				 * during demotion.
+				 */
+				LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+				ControlFile->state = DB_IN_ARCHIVE_RECOVERY;
+				LWLockRelease(ControlFileLock);
+			}
+			else
+			{
+				ControlFile->state = DB_IN_ARCHIVE_RECOVERY;
 
-			SpinLockAcquire(&XLogCtl->info_lck);
-			XLogCtl->SharedRecoveryState = RECOVERY_STATE_ARCHIVE;
-			SpinLockRelease(&XLogCtl->info_lck);
+				/* This is already set if demoting */
+				SpinLockAcquire(&XLogCtl->info_lck);
+				XLogCtl->SharedRecoveryState = RECOVERY_STATE_ARCHIVE;
+				SpinLockRelease(&XLogCtl->info_lck);
+			}
 		}
 		else
 		{
@@ -6982,7 +7028,8 @@ StartupXLOG(void)
 		/*
 		 * Reset pgstat data, because it may be invalid after recovery.
 		 */
-		pgstat_reset_all();
+		if (!is_demoting)
+			pgstat_reset_all();
 
 		/*
 		 * If there was a backup label file, it's done its job and the info
@@ -7044,7 +7091,7 @@ StartupXLOG(void)
 
 			InitRecoveryTransactionEnvironment();
 
-			if (wasShutdown)
+			if (wasShutdown || is_demoting)
 				oldestActiveXID = PrescanPreparedTransactions(&xids, &nxids);
 			else
 				oldestActiveXID = checkPoint.oldestActiveXid;
@@ -7057,6 +7104,11 @@ StartupXLOG(void)
 			 * Startup commit log and subtrans only.  MultiXact and commit
 			 * timestamp have already been started up and other SLRUs are not
 			 * maintained during recovery and need not be started yet.
+			 *
+			 * Starting up commit log is technicaly not needed during demote
+			 * as the in-memory data did not move. However, this is a
+			 * lightweight initialization and this might seem expected as
+			 * pure symmetry as ShutdownCLOG() is called during ShutdownXLog().
 			 */
 			StartupCLOG();
 			StartupSUBTRANS(oldestActiveXID);
@@ -7067,7 +7119,7 @@ StartupXLOG(void)
 			 * empty running-xacts record and use that here and now. Recover
 			 * additional standby state for prepared transactions.
 			 */
-			if (wasShutdown)
+			if (wasShutdown || is_demoting)
 			{
 				RunningTransactionsData running;
 				TransactionId latestCompletedXid;
@@ -7938,6 +7990,7 @@ StartupXLOG(void)
 
 	SpinLockAcquire(&XLogCtl->info_lck);
 	XLogCtl->SharedRecoveryState = RECOVERY_STATE_DONE;
+	XLogCtl->SharedHotStandbyActive = false;
 	SpinLockRelease(&XLogCtl->info_lck);
 
 	UpdateControlFile();
@@ -8056,6 +8109,23 @@ CheckRecoveryConsistency(void)
 	}
 }
 
+/*
+ * Initialize the local TimeLineID
+ */
+bool
+SetLocalRecoveryInProgress(void)
+{
+	/*
+	 * use volatile pointer to make sure we make a fresh read of the
+	 * shared variable.
+	 */
+	volatile XLogCtlData *xlogctl = XLogCtl;
+
+	LocalRecoveryInProgress = (xlogctl->SharedRecoveryState != RECOVERY_STATE_DONE);
+
+	return LocalRecoveryInProgress;
+}
+
 /*
  * Is the system still in recovery?
  *
@@ -8077,13 +8147,7 @@ RecoveryInProgress(void)
 		return false;
 	else
 	{
-		/*
-		 * use volatile pointer to make sure we make a fresh read of the
-		 * shared variable.
-		 */
-		volatile XLogCtlData *xlogctl = XLogCtl;
-
-		LocalRecoveryInProgress = (xlogctl->SharedRecoveryState != RECOVERY_STATE_DONE);
+		SetLocalRecoveryInProgress();
 
 		/*
 		 * Initialize TimeLineID and RedoRecPtr when we discover that recovery
@@ -8503,6 +8567,8 @@ GetLastSegSwitchData(XLogRecPtr *lastSwitchLSN)
 void
 ShutdownXLOG(int code, Datum arg)
 {
+	bool is_demoting = DatumGetBool(arg);
+
 	/*
 	 * We should have an aux process resource owner to use, and we should not
 	 * be in a transaction that's installed some other resowner.
@@ -8512,35 +8578,60 @@ ShutdownXLOG(int code, Datum arg)
 		   CurrentResourceOwner == AuxProcessResourceOwner);
 	CurrentResourceOwner = AuxProcessResourceOwner;
 
-	/* Don't be chatty in standalone mode */
-	ereport(IsPostmasterEnvironment ? LOG : NOTICE,
-			(errmsg("shutting down")));
-
-	/*
-	 * Signal walsenders to move to stopping state.
-	 */
-	WalSndInitStopping();
-
-	/*
-	 * Wait for WAL senders to be in stopping state.  This prevents commands
-	 * from writing new WAL.
-	 */
-	WalSndWaitStopping();
+	if (is_demoting)
+	{
+		/*
+		 * In contrast with normal shutdown, we keep wal senders alive
+		 * during demote. First, this allows the demote to complete faster.
+		 * Second, we might need them in futur to implement a controlled
+		 * switchover over the replication protocol.
+		 */
+		/* Don't be chatty in standalone mode */
+		ereport(IsPostmasterEnvironment ? LOG : NOTICE,
+				(errmsg("demoting")));
 
-	if (RecoveryInProgress())
-		CreateRestartPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
+		/*
+		 * FIXME demote: avoiding checkpoint?
+		 * A checkpoint is probably running during a demote action. If
+		 * we don't want to wait for the checkpoint during the demote,
+		 * we might need to cancel it as it will not be able to write
+		 * to the WAL after the demote.
+		 */
+		CreateCheckPoint(CHECKPOINT_IS_DEMOTE | CHECKPOINT_IMMEDIATE);
+		ShutdownPreparedTransactions();
+	}
 	else
 	{
+		/* Don't be chatty in standalone mode */
+		ereport(IsPostmasterEnvironment ? LOG : NOTICE,
+				(errmsg("shutting down")));
+
 		/*
-		 * If archiving is enabled, rotate the last XLOG file so that all the
-		 * remaining records are archived (postmaster wakes up the archiver
-		 * process one more time at the end of shutdown). The checkpoint
-		 * record will go to the next XLOG file and won't be archived (yet).
+		 * Signal walsenders to move to stopping state.
 		 */
-		if (XLogArchivingActive() && XLogArchiveCommandSet())
-			RequestXLogSwitch(false);
+		WalSndInitStopping();
+
+		/*
+		 * Wait for WAL senders to be in stopping state.  This prevents commands
+		 * from writing new WAL.
+		 */
+		WalSndWaitStopping();
+
+		if (RecoveryInProgress())
+			CreateRestartPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
+		else
+		{
+			/*
+			 * If archiving is enabled, rotate the last XLOG file so that all the
+			 * remaining records are archived (postmaster wakes up the archiver
+			 * process one more time at the end of shutdown). The checkpoint
+			 * record will go to the next XLOG file and won't be archived (yet).
+			 */
+			if (XLogArchivingActive() && XLogArchiveCommandSet())
+				RequestXLogSwitch(false);
 
-		CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
+			CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
+		}
 	}
 	ShutdownCLOG();
 	ShutdownCommitTs();
@@ -8554,9 +8645,10 @@ ShutdownXLOG(int code, Datum arg)
 static void
 LogCheckpointStart(int flags, bool restartpoint)
 {
-	elog(LOG, "%s starting:%s%s%s%s%s%s%s%s",
+	elog(LOG, "%s starting:%s%s%s%s%s%s%s%s%s",
 		 restartpoint ? "restartpoint" : "checkpoint",
 		 (flags & CHECKPOINT_IS_SHUTDOWN) ? " shutdown" : "",
+		 (flags & CHECKPOINT_IS_DEMOTE) ? " demote" : "",
 		 (flags & CHECKPOINT_END_OF_RECOVERY) ? " end-of-recovery" : "",
 		 (flags & CHECKPOINT_IMMEDIATE) ? " immediate" : "",
 		 (flags & CHECKPOINT_FORCE) ? " force" : "",
@@ -8692,6 +8784,7 @@ UpdateCheckPointDistanceEstimate(uint64 nbytes)
  *
  * flags is a bitwise OR of the following:
  *	CHECKPOINT_IS_SHUTDOWN: checkpoint is for database shutdown.
+ *	CHECKPOINT_IS_DEMOTE: checkpoint is for demote.
  *	CHECKPOINT_END_OF_RECOVERY: checkpoint is for end of WAL recovery.
  *	CHECKPOINT_IMMEDIATE: finish the checkpoint ASAP,
  *		ignoring checkpoint_completion_target parameter.
@@ -8720,6 +8813,7 @@ void
 CreateCheckPoint(int flags)
 {
 	bool		shutdown;
+	bool		demote;
 	CheckPoint	checkPoint;
 	XLogRecPtr	recptr;
 	XLogSegNo	_logSegNo;
@@ -8732,14 +8826,21 @@ CreateCheckPoint(int flags)
 	int			nvxids;
 
 	/*
-	 * An end-of-recovery checkpoint is really a shutdown checkpoint, just
-	 * issued at a different time.
+	 * An end-of-recovery or demote checkpoint is really a shutdown checkpoint,
+	 * just issued at a different time.
 	 */
-	if (flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY))
+	if (flags & (CHECKPOINT_IS_SHUTDOWN |
+				 CHECKPOINT_IS_DEMOTE |
+				 CHECKPOINT_END_OF_RECOVERY))
 		shutdown = true;
 	else
 		shutdown = false;
 
+	if (flags & CHECKPOINT_IS_DEMOTE)
+		demote = true;
+	else
+		demote = false;
+
 	/* sanity check */
 	if (RecoveryInProgress() && (flags & CHECKPOINT_END_OF_RECOVERY) == 0)
 		elog(ERROR, "can't create a checkpoint during recovery");
@@ -8780,7 +8881,7 @@ CreateCheckPoint(int flags)
 	if (shutdown)
 	{
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
-		ControlFile->state = DB_SHUTDOWNING;
+		ControlFile->state = demote? DB_DEMOTING:DB_SHUTDOWNING;
 		ControlFile->time = (pg_time_t) time(NULL);
 		UpdateControlFile();
 		LWLockRelease(ControlFileLock);
@@ -8826,7 +8927,7 @@ CreateCheckPoint(int flags)
 	 * avoid inserting duplicate checkpoints when the system is idle.
 	 */
 	if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY |
-				  CHECKPOINT_FORCE)) == 0)
+				  CHECKPOINT_IS_DEMOTE | CHECKPOINT_FORCE)) == 0)
 	{
 		if (last_important_lsn == ControlFile->checkPoint)
 		{
@@ -8994,8 +9095,8 @@ CreateCheckPoint(int flags)
 	 * allows us to reconstruct the state of running transactions during
 	 * archive recovery, if required. Skip, if this info disabled.
 	 *
-	 * If we are shutting down, or Startup process is completing crash
-	 * recovery we don't need to write running xact data.
+	 * If we are shutting down, demoting or Startup process is completing
+	 * crash recovery we don't need to write running xact data.
 	 */
 	if (!shutdown && XLogStandbyInfoActive())
 		LogStandbySnapshot();
@@ -9014,11 +9115,11 @@ CreateCheckPoint(int flags)
 	XLogFlush(recptr);
 
 	/*
-	 * We mustn't write any new WAL after a shutdown checkpoint, or it will be
-	 * overwritten at next startup.  No-one should even try, this just allows
-	 * sanity-checking.  In the case of an end-of-recovery checkpoint, we want
-	 * to just temporarily disable writing until the system has exited
-	 * recovery.
+	 * We mustn't write any new WAL after a shutdown or demote checkpoint, or
+	 * it will be overwritten at next startup.  No-one should even try, this
+	 * just allows sanity-checking.  In the case of an end-of-recovery
+	 * checkpoint, we want to just temporarily disable writing until the system
+	 * has exited recovery.
 	 */
 	if (shutdown)
 	{
@@ -9034,7 +9135,8 @@ CreateCheckPoint(int flags)
 	 */
 	if (shutdown && checkPoint.redo != ProcLastRecPtr)
 		ereport(PANIC,
-				(errmsg("concurrent write-ahead log activity while database system is shutting down")));
+				(errmsg("concurrent write-ahead log activity while database system is %s",
+						demote? "demoting":"shutting down")));
 
 	/*
 	 * Remember the prior checkpoint's redo ptr for
@@ -9047,7 +9149,7 @@ CreateCheckPoint(int flags)
 	 */
 	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 	if (shutdown)
-		ControlFile->state = DB_SHUTDOWNED;
+		ControlFile->state = demote? DB_DEMOTING:DB_SHUTDOWNED;
 	ControlFile->checkPoint = ProcLastRecPtr;
 	ControlFile->checkPointCopy = checkPoint;
 	ControlFile->time = (pg_time_t) time(NULL);
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 624a3238b8..58473a61fd 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -151,6 +151,7 @@ double		CheckPointCompletionTarget = 0.5;
  * Private state
  */
 static bool ckpt_active = false;
+static volatile sig_atomic_t demoteRequestPending = false;
 
 /* these values are valid when ckpt_active is true: */
 static pg_time_t ckpt_start_time;
@@ -552,6 +553,21 @@ HandleCheckpointerInterrupts(void)
 		 */
 		UpdateSharedMemoryConfig();
 	}
+	if (demoteRequestPending)
+	{
+		demoteRequestPending = false;
+		/* Close down the database */
+		ShutdownXLOG(0, BoolGetDatum(true));
+		/*
+		 * Exit checkpointer. We could keep it around during demotion, but
+		 * exiting here has multiple benefices:
+		 * - to create a fresh process with clean local vars
+		 *   (eg. LocalRecoveryInProgress)
+		 * - to signal postmaster the demote shutdown checkpoint is done
+		 *   and keep going with next steps of the demotion
+		 */
+		proc_exit(0);
+	}
 	if (ShutdownRequestPending)
 	{
 		/*
@@ -680,6 +696,7 @@ CheckpointWriteDelay(int flags, double progress)
 	 * in which case we just try to catch up as quickly as possible.
 	 */
 	if (!(flags & CHECKPOINT_IMMEDIATE) &&
+		!demoteRequestPending &&
 		!ShutdownRequestPending &&
 		!ImmediateCheckpointRequested() &&
 		IsCheckpointOnSchedule(progress))
@@ -812,6 +829,17 @@ IsCheckpointOnSchedule(double progress)
  * --------------------------------
  */
 
+/* SIGUSR1: set flag to demote */
+void
+ReqCheckpointDemoteHandler(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	demoteRequestPending = true;
+
+	errno = save_errno;
+}
+
 /* SIGINT: set flag to run a normal checkpoint right away */
 static void
 ReqCheckpointHandler(SIGNAL_ARGS)
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 73ce944fb1..e75d33d335 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3854,6 +3854,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_PROMOTE:
 			event_name = "Promote";
 			break;
+		case WAIT_EVENT_DEMOTE:
+			event_name = "Demote";
+			break;
 		case WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT:
 			event_name = "RecoveryConflictSnapshot";
 			break;
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 42223c0f61..231febaf2f 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -277,12 +277,13 @@ static StartupStatusEnum StartupStatus = STARTUP_NOT_RUNNING;
 #define			ImmediateShutdown	3
 
 static int	Shutdown = NoShutdown;
+static bool DemoteSignal = false; /* true on demote request */
 
 static bool FatalError = false; /* T if recovering from backend crash */
 
 /*
- * We use a simple state machine to control startup, shutdown, and
- * crash recovery (which is rather like shutdown followed by startup).
+ * We use a simple state machine to control startup, shutdown, demote and
+ * crash recovery (both are rather like shutdown followed by startup).
  *
  * After doing all the postmaster initialization work, we enter PM_STARTUP
  * state and the startup process is launched. The startup process begins by
@@ -325,6 +326,7 @@ typedef enum
 {
 	PM_INIT,					/* postmaster starting */
 	PM_STARTUP,					/* waiting for startup subprocess */
+	PM_DEMOTING,				/* waiting for idle or RO backends for demote */
 	PM_RECOVERY,				/* in archive recovery mode */
 	PM_HOT_STANDBY,				/* in hot standby mode */
 	PM_RUN,						/* normal "database is alive" state */
@@ -429,10 +431,14 @@ static bool RandomCancelKey(int32 *cancel_key);
 static void signal_child(pid_t pid, int signal);
 static bool SignalSomeChildren(int signal, int targets);
 static void TerminateChildren(int signal);
+static void RemoveDemoteSignalFiles(void);
+static bool CheckDemoteSignal(void);
+
 
 #define SignalChildren(sig)			   SignalSomeChildren(sig, BACKEND_TYPE_ALL)
 
 static int	CountChildren(int target);
+static int	CountXacts(void);
 static bool assign_backendlist_entry(RegisteredBgWorker *rw);
 static void maybe_start_bgworkers(void);
 static bool CreateOptsFile(int argc, char *argv[], char *fullprogname);
@@ -2319,6 +2325,11 @@ retry1:
 					(errcode(ERRCODE_CANNOT_CONNECT_NOW),
 					 errmsg("the database system is starting up")));
 			break;
+		case CAC_DEMOTE:
+			ereport(FATAL,
+					(errcode(ERRCODE_CANNOT_CONNECT_NOW),
+					 errmsg("the database system is demoting")));
+			break;
 		case CAC_SHUTDOWN:
 			ereport(FATAL,
 					(errcode(ERRCODE_CANNOT_CONNECT_NOW),
@@ -2450,16 +2461,19 @@ canAcceptConnections(int backend_type)
 	CAC_state	result = CAC_OK;
 
 	/*
-	 * Can't start backends when in startup/shutdown/inconsistent recovery
-	 * state.  We treat autovac workers the same as user backends for this
-	 * purpose.  However, bgworkers are excluded from this test; we expect
-	 * bgworker_should_start_now() decided whether the DB state allows them.
+	 * Can't start backends when in startup/demote/shutdown/inconsistent
+	 * recovery state.  We treat autovac workers the same as user backends
+	 * for this purpose.  However, bgworkers are excluded from this test; we
+	 * expect bgworker_should_start_now() decided whether the DB state allows
+	 * them.
 	 */
 	if (pmState != PM_RUN && pmState != PM_HOT_STANDBY &&
 		backend_type != BACKEND_TYPE_BGWORKER)
 	{
 		if (Shutdown > NoShutdown)
 			return CAC_SHUTDOWN;	/* shutdown is pending */
+		else if (DemoteSignal)
+			return CAC_DEMOTE;	/* demote is pending */
 		else if (!FatalError &&
 				 (pmState == PM_STARTUP ||
 				  pmState == PM_RECOVERY))
@@ -3091,7 +3105,18 @@ reaper(SIGNAL_ARGS)
 		if (pid == CheckpointerPID)
 		{
 			CheckpointerPID = 0;
-			if (EXIT_STATUS_0(exitstatus) && pmState == PM_SHUTDOWN)
+			if (EXIT_STATUS_0(exitstatus) &&
+					 DemoteSignal &&
+					 pmState == PM_SHUTDOWN)
+			{
+				/*
+				 * The checkpointer exit signals the demote shutdown checkpoint
+				 * is done. The startup recovery mode can be started from there.
+				 */
+				ereport(DEBUG1,
+						(errmsg_internal("checkpointer shutdown for demote")));
+			}
+			else if (EXIT_STATUS_0(exitstatus) && pmState == PM_SHUTDOWN)
 			{
 				/*
 				 * OK, we saw normal exit of the checkpointer after it's been
@@ -3799,6 +3824,25 @@ PostmasterStateMachine(void)
 		}
 	}
 
+	if (pmState == PM_DEMOTING)
+	{
+		int numXacts = CountXacts();
+
+		/*
+		 * PM_DEMOTING state ends when we have no active transactions
+		 * and all backends set LocalXLogInsertAllowed=0
+		 */
+		if (numXacts == 0)
+		{
+			ereport(LOG, (errmsg("all backends in read only")));
+
+			SendProcSignal(CheckpointerPID, PROCSIG_CHECKPOINTER_DEMOTING, InvalidBackendId);
+			pmState = PM_SHUTDOWN;
+		}
+		else
+			ereport(LOG, (errmsg("waiting for %d transactions to finish", numXacts)));
+	}
+
 	/*
 	 * If we're ready to do so, signal child processes to shut down.  (This
 	 * isn't a persistent state, but treating it as a distinct pmState allows
@@ -4002,6 +4046,20 @@ PostmasterStateMachine(void)
 		(StartupStatus == STARTUP_CRASHED || !restart_after_crash))
 		ExitPostmaster(1);
 
+
+	/* Demoting: start the Startup Process */
+	if (DemoteSignal && pmState == PM_SHUTDOWN && CheckpointerPID == 0)
+	{
+		/* stop archiver process if not required during standby */
+		if (!XLogArchivingAlways() && PgArchPID != 0)
+			signal_child(PgArchPID, SIGQUIT);
+
+		StartupPID = StartupDataBase();
+		Assert(StartupPID != 0);
+		StartupStatus = STARTUP_RUNNING;
+		pmState = PM_STARTUP;
+	}
+
 	/*
 	 * If we need to recover from a crash, wait for all non-syslogger children
 	 * to exit, then reset shmem and StartupDataBase.
@@ -5212,8 +5270,12 @@ sigusr1_handler(SIGNAL_ARGS)
 		 * Crank up the background tasks.  It doesn't matter if this fails,
 		 * we'll just try again later.
 		 */
+		if (!DemoteSignal)
+			Assert(PgArchPID == 0);
+
 		Assert(CheckpointerPID == 0);
 		CheckpointerPID = StartCheckpointer();
+
 		Assert(BgWriterPID == 0);
 		BgWriterPID = StartBackgroundWriter();
 
@@ -5221,8 +5283,7 @@ sigusr1_handler(SIGNAL_ARGS)
 		 * Start the archiver if we're responsible for (re-)archiving received
 		 * files.
 		 */
-		Assert(PgArchPID == 0);
-		if (XLogArchivingAlways())
+		if (PgArchPID == 0 && XLogArchivingAlways())
 			PgArchPID = pgarch_start();
 
 		/*
@@ -5233,6 +5294,7 @@ sigusr1_handler(SIGNAL_ARGS)
 		if (!EnableHotStandby)
 		{
 			AddToDataDirLockFile(LOCK_FILE_LINE_PM_STATUS, PM_STATUS_STANDBY);
+			DemoteSignal = false;
 #ifdef USE_SYSTEMD
 			sd_notify(0, "READY=1");
 #endif
@@ -5243,11 +5305,15 @@ sigusr1_handler(SIGNAL_ARGS)
 	if (CheckPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY) &&
 		pmState == PM_RECOVERY && Shutdown == NoShutdown)
 	{
+		dlist_iter	iter;
+
 		/*
 		 * Likewise, start other special children as needed.
 		 */
-		Assert(PgStatPID == 0);
-		PgStatPID = pgstat_start();
+		if (!DemoteSignal)
+			Assert(PgStatPID == 0);
+		if(PgStatPID == 0)
+			PgStatPID = pgstat_start();
 
 		ereport(LOG,
 				(errmsg("database system is ready to accept read only connections")));
@@ -5258,8 +5324,18 @@ sigusr1_handler(SIGNAL_ARGS)
 		sd_notify(0, "READY=1");
 #endif
 
+		if (DemoteSignal)
+			dlist_foreach(iter, &BackendList)
+			{
+				Backend    *bp = dlist_container(Backend, elem, iter.cur);
+
+				if (!bp->dead_end && bp->bkend_type & (BACKEND_TYPE_NORMAL|BACKEND_TYPE_WALSND))
+					SendProcSignal(bp->pid, PROCSIG_DEMOTED, InvalidBackendId);
+			}
+
 		pmState = PM_HOT_STANDBY;
 		connsAllowed = ALLOW_ALL_CONNS;
+		DemoteSignal = false;
 
 		/* Some workers may be scheduled to start now */
 		StartWorkerNeeded = true;
@@ -5351,6 +5427,97 @@ sigusr1_handler(SIGNAL_ARGS)
 		signal_child(StartupPID, SIGUSR2);
 	}
 
+	if (CheckDemoteSignal() && pmState != PM_RUN )
+	{
+		DemoteSignal = false;
+		RemoveDemoteSignalFiles();
+		ereport(LOG,
+				(errmsg("ignoring demote signal because already in standby mode")));
+	}
+	/* received demote signal */
+	else if (CheckDemoteSignal())
+	{
+		FILE	   *standby_file;
+		dlist_iter	iter;
+		bool fast_demote;
+		struct stat stat_buf;
+
+		fast_demote = (stat(DEMOTE_FAST_SIGNAL_FILE, &stat_buf) == 0);
+
+		DemoteSignal = true;
+		RemoveDemoteSignalFiles();
+
+		/* create the standby signal file */
+		standby_file = AllocateFile(STANDBY_SIGNAL_FILE, "w");
+		if (!standby_file)
+		{
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not create file \"%s\": %m",
+							STANDBY_SIGNAL_FILE)));
+			goto out;
+		}
+
+		if (FreeFile(standby_file))
+		{
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not write file \"%s\": %m",
+							STANDBY_SIGNAL_FILE)));
+			unlink(STANDBY_SIGNAL_FILE);
+			goto out;
+		}
+
+		if (fast_demote == 0)
+		{
+			/* smart demote */
+			ereport(LOG, (errmsg("received smart demote request")));
+
+		}
+		else
+		{
+			/* fast demote */
+			ereport(LOG, (errmsg("received fast demote request")));
+		}
+
+		SignalSomeChildren(SIGTERM,
+						   BACKEND_TYPE_AUTOVAC | BACKEND_TYPE_BGWORKER);
+
+		/* and the autovac launcher too */
+		if (AutoVacPID != 0)
+			signal_child(AutoVacPID, SIGTERM);
+		/* and the bgwriter too */
+		if (BgWriterPID != 0)
+			signal_child(BgWriterPID, SIGTERM);
+		/* and the walwriter too */
+		if (WalWriterPID != 0)
+			signal_child(WalWriterPID, SIGTERM);
+
+		dlist_foreach(iter, &BackendList)
+		{
+			Backend    *bp = dlist_container(Backend, elem, iter.cur);
+
+			if (bp->dead_end)
+				continue;
+			/*
+			 * Assign bkend_type for any recently announced WAL Sender
+			 * processes.
+			 */
+			if (bp->bkend_type == BACKEND_TYPE_NORMAL &&
+				! IsPostmasterChildWalSender(bp->child_slot))
+				SendProcSignal(bp->pid,
+							   (fast_demote?PROCSIG_DEMOTING_FAST:PROCSIG_DEMOTING),
+							   InvalidBackendId);
+		}
+
+		pmState = PM_DEMOTING;
+
+		/* Report status */
+		AddToDataDirLockFile(LOCK_FILE_LINE_PM_STATUS, PM_STATUS_DEMOTING);
+	}
+
+out:
+
 #ifdef WIN32
 	PG_SETMASK(&UnBlockSig);
 #endif
@@ -5448,6 +5615,26 @@ CountChildren(int target)
 }
 
 
+/*
+ * Count up the number of active transactions
+ */
+static int
+CountXacts(void)
+{
+	int			i;
+	int			cnt = 0;
+
+	for (i = 0; i < ProcGlobal->allProcCount; ++i)
+	{
+		PGPROC   *proc = &ProcGlobal->allProcs[i];
+		if (TransactionIdIsValid(proc->xid))
+			cnt++;
+	}
+
+	return cnt;
+}
+
+
 /*
  * StartChildProcess -- start an auxiliary process for the postmaster
  *
@@ -5912,6 +6099,7 @@ bgworker_should_start_now(BgWorkerStartTime start_time)
 		case PM_SHUTDOWN:
 		case PM_WAIT_BACKENDS:
 		case PM_STOP_BACKENDS:
+		case PM_DEMOTING:
 			break;
 
 		case PM_RUN:
@@ -6660,3 +6848,28 @@ InitPostmasterDeathWatchHandle(void)
 								 GetLastError())));
 #endif							/* WIN32 */
 }
+
+/*
+ * Remove the files signaling a demote request.
+ */
+static void
+RemoveDemoteSignalFiles(void)
+{
+	unlink(DEMOTE_SIGNAL_FILE);
+	unlink(DEMOTE_FAST_SIGNAL_FILE);
+}
+
+/*
+ * Check if a demote request appeared.
+ */
+static bool
+CheckDemoteSignal(void)
+{
+	struct stat stat_buf;
+
+	if (stat(DEMOTE_SIGNAL_FILE, &stat_buf) == 0 ||
+		stat(DEMOTE_FAST_SIGNAL_FILE, &stat_buf) == 0)
+		return true;
+
+	return false;
+}
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 96e4a87857..83d7e05944 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -349,6 +349,8 @@ ProcArrayShmemSize(void)
 	size = add_size(size, mul_size(sizeof(int), PROCARRAY_MAXPROCS));
 
 	/*
+	 * FIXME demote: check safe hotStandby related init and snapshot mech.
+	 *
 	 * During Hot Standby processing we have a data structure called
 	 * KnownAssignedXids, created in shared memory. Local data structures are
 	 * also created in various backends during GetSnapshotData(),
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 4fa385b0ec..ac14c662d3 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -28,6 +28,7 @@
 #include "storage/shmem.h"
 #include "storage/sinval.h"
 #include "tcop/tcopprot.h"
+#include "postmaster/bgwriter.h"
 
 /*
  * The SIGUSR1 signal is multiplexed to support signaling multiple event
@@ -585,6 +586,35 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN);
 
+	/* signal checkpoint process to ignite a demote procedure */
+	if (CheckProcSignal(PROCSIG_CHECKPOINTER_DEMOTING))
+		ReqCheckpointDemoteHandler(PROCSIG_CHECKPOINTER_DEMOTING);
+
+	/*
+	 * ask backends to enter in read only by setting
+	 * LocalXLogInsertAllowed = 0 as soon as their active xact
+	 * finished
+	 */
+	if (CheckProcSignal(PROCSIG_DEMOTING))
+		ReqDemoteHandler(PROCSIG_DEMOTING);
+
+	/*
+	 * ask backends to enter in read only by setting
+	 * LocalXLogInsertAllowed = 0 if they are idle, or
+	 * interrupt their current xact and terminate.
+	 */
+	if (CheckProcSignal(PROCSIG_DEMOTING_FAST))
+		ReqDemoteHandler(PROCSIG_DEMOTING_FAST);
+
+	/*
+	 * demote complete. Ask beckends to rely on
+	 * recovery status for LocalXLogInsertAllowed by
+	 * setting it to -1.
+	 * WAL sender set am_cascading.
+	 */
+	if (CheckProcSignal(PROCSIG_DEMOTED))
+		ReqDemotedHandler(PROCSIG_DEMOTED);
+
 	SetLatch(MyLatch);
 
 	latch_sigusr1_handler();
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index d86566f455..5843db0991 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -4370,6 +4370,18 @@ lock_twophase_postabort(TransactionId xid, uint16 info,
 	lock_twophase_postcommit(xid, info, recdata, len);
 }
 
+/*
+ * 2PC shutdown from lock table.
+ *
+ * This is actually just the same as the COMMIT case.
+ */
+void
+lock_twophase_shutdown(TransactionId xid, uint16 info,
+						void *recdata, uint32 len)
+{
+	lock_twophase_postcommit(xid, info, recdata, len);
+}
+
 /*
  *		VirtualXactLockTableInsert
  *
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index c9424f167c..b44e6f5876 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -67,6 +67,7 @@
 #include "rewrite/rewriteHandler.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
+#include "storage/pmsignal.h"
 #include "storage/proc.h"
 #include "storage/procsignal.h"
 #include "storage/sinval.h"
@@ -3211,6 +3212,42 @@ ProcessInterrupts(void)
 		HandleParallelMessages();
 }
 
+/* SIGUSR1: set flag to demote */
+void
+ReqDemoteHandler(ProcSignalReason reason)
+{
+	if (MyBackendType != B_BACKEND)
+		return;
+
+	if (TransactionIdIsValid(MyProc->xid))
+	{
+		if (reason == PROCSIG_DEMOTING_FAST)
+		{
+			InterruptPending = true;
+			ProcDiePending = true;
+			SetLatch(MyLatch);
+		}
+		else
+			DemotePending = true;
+	}
+	else
+		LocalSetXLogInsertNotAllowed();
+}
+
+/* SIGUSR1: reset LocalRecoveryInProgress */
+void
+ReqDemotedHandler(ProcSignalReason reason)
+{
+	ereport(LOG,
+				(errmsg("received demote complete signal")));
+
+	SetLocalRecoveryInProgress();
+	LocalSetXLogInsertCheckRecovery();
+
+	if (MyBackendType == B_WAL_SENDER)
+		am_cascading_walsender = true;
+}
+
 
 /*
  * IA64-specific code to fetch the AR.BSP register for stack depth checks.
@@ -4224,6 +4261,12 @@ PostgresMain(int argc, char *argv[],
 				/* Send out notify signals and transmit self-notifies */
 				ProcessCompletedNotifies();
 
+				if (DemotePending) {
+					LocalSetXLogInsertNotAllowed();
+					DemotePending = false;
+					SendPostmasterSignal(PMSIGNAL_ADVANCE_STATE_MACHINE);
+				}
+
 				/*
 				 * Also process incoming notifies, if any.  This is mostly to
 				 * ensure stable behavior in tests: if any notifies were
@@ -4285,6 +4328,7 @@ PostgresMain(int argc, char *argv[],
 		{
 			ConfigReloadPending = false;
 			ProcessConfigFile(PGC_SIGHUP);
+			SetLocalRecoveryInProgress();
 		}
 
 		/*
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 6ab8216839..021f6af434 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -33,6 +33,7 @@ volatile sig_atomic_t ProcDiePending = false;
 volatile sig_atomic_t ClientConnectionLost = false;
 volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false;
 volatile sig_atomic_t ProcSignalBarrierPending = false;
+volatile sig_atomic_t DemotePending = false;
 volatile uint32 InterruptHoldoffCount = 0;
 volatile uint32 QueryCancelHoldoffCount = 0;
 volatile uint32 CritSectionCount = 0;
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 3e00ac0f70..9ef133f79a 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -57,6 +57,8 @@ dbState(DBState state)
 			return _("shut down");
 		case DB_SHUTDOWNED_IN_RECOVERY:
 			return _("shut down in recovery");
+		case DB_DEMOTING:
+			return _("demoting");
 		case DB_SHUTDOWNING:
 			return _("shutting down");
 		case DB_IN_CRASH_RECOVERY:
diff --git a/src/bin/pg_ctl/pg_ctl.c b/src/bin/pg_ctl/pg_ctl.c
index 1cdc3ebaa3..a7805bd219 100644
--- a/src/bin/pg_ctl/pg_ctl.c
+++ b/src/bin/pg_ctl/pg_ctl.c
@@ -62,6 +62,7 @@ typedef enum
 	RESTART_COMMAND,
 	RELOAD_COMMAND,
 	STATUS_COMMAND,
+	DEMOTE_COMMAND,
 	PROMOTE_COMMAND,
 	LOGROTATE_COMMAND,
 	KILL_COMMAND,
@@ -103,6 +104,7 @@ static char version_file[MAXPGPATH];
 static char pid_file[MAXPGPATH];
 static char backup_file[MAXPGPATH];
 static char promote_file[MAXPGPATH];
+static char demote_file[MAXPGPATH];
 static char logrotate_file[MAXPGPATH];
 
 static volatile pgpid_t postmasterPID = -1;
@@ -129,6 +131,7 @@ static void do_stop(void);
 static void do_restart(void);
 static void do_reload(void);
 static void do_status(void);
+static void do_demote(void);
 static void do_promote(void);
 static void do_logrotate(void);
 static void do_kill(pgpid_t pid);
@@ -1029,6 +1032,115 @@ do_stop(void)
 }
 
 
+static void
+do_demote(void)
+{
+	int			cnt;
+	FILE	   *dmtfile;
+	pgpid_t		pid;
+	struct stat statbuf;
+
+	pid = get_pgpid(false);
+
+	if (pid == 0)				/* no pid file */
+	{
+		write_stderr(_("%s: PID file \"%s\" does not exist\n"), progname, pid_file);
+		write_stderr(_("Is server running?\n"));
+		exit(1);
+	}
+	else if (pid < 0)			/* standalone backend, not postmaster */
+	{
+		pid = -pid;
+		write_stderr(_("%s: cannot demote server; "
+					   "single-user server is running (PID: %ld)\n"),
+					 progname, pid);
+		exit(1);
+	}
+
+	if (shutdown_mode == IMMEDIATE_MODE)
+	{
+		write_stderr(_("%s: cannot demote server using immediate mode"),
+					 progname);
+		exit(1);
+	}
+	else if (shutdown_mode == FAST_MODE)
+		snprintf(demote_file, MAXPGPATH, "%s/demote_fast", pg_data);
+	else
+		snprintf(demote_file, MAXPGPATH, "%s/demote", pg_data);
+
+	if ((dmtfile = fopen(demote_file, "w")) == NULL)
+	{
+		write_stderr(_("%s: could not create demote signal file \"%s\": %s\n"),
+					 progname, demote_file, strerror(errno));
+		exit(1);
+	}
+
+	if (fclose(dmtfile))
+	{
+		write_stderr(_("%s: could not write demote signal file \"%s\": %s\n"),
+					 progname, demote_file, strerror(errno));
+		exit(1);
+	}
+
+	sig = SIGUSR1;
+	if (kill((pid_t) pid, sig) != 0)
+	{
+		write_stderr(_("%s: could not send demote signal (PID: %ld): %s\n"), progname, pid,
+					 strerror(errno));
+		exit(1);
+	}
+
+	if (!do_wait)
+	{
+		print_msg(_("server demoting\n"));
+		return;
+	}
+	else
+	{
+		/*
+		 * FIXME demote
+		 * If backup_label exists, an online backup is running. Warn the user
+		 * that smart demote will wait for it to finish. However, if the
+		 * server is in archive recovery, we're recovering from an online
+		 * backup instead of performing one.
+		 */
+		if (shutdown_mode == SMART_MODE &&
+			stat(backup_file, &statbuf) == 0 &&
+			get_control_dbstate() != DB_IN_ARCHIVE_RECOVERY)
+		{
+			print_msg(_("WARNING: online backup mode is active\n"
+						"Demote will not complete until pg_stop_backup() is called.\n\n"));
+		}
+
+		print_msg(_("waiting for server to demote..."));
+
+		for (cnt = 0; cnt < wait_seconds * WAITS_PER_SEC; cnt++)
+		{
+			if (get_control_dbstate() == DB_IN_ARCHIVE_RECOVERY)
+				break;
+
+			if (cnt % WAITS_PER_SEC == 0)
+				print_msg(".");
+			pg_usleep(USEC_PER_SEC / WAITS_PER_SEC);
+		}
+
+		if (get_control_dbstate() != DB_IN_ARCHIVE_RECOVERY)
+		{
+			print_msg(_(" failed\n"));
+
+			write_stderr(_("%s: server does not demote\n"), progname);
+			if (shutdown_mode == SMART_MODE)
+				write_stderr(_("HINT: The \"-m fast\" option immediately disconnects sessions rather than\n"
+							   "waiting for session-initiated disconnection.\n"));
+			exit(1);
+		}
+		print_msg(_(" done\n"));
+
+		print_msg(_("server demoted\n"));
+	}
+}
+
+
 /*
  *	restart/reload routines
  */
@@ -2447,6 +2559,8 @@ main(int argc, char **argv)
 				ctl_command = RELOAD_COMMAND;
 			else if (strcmp(argv[optind], "status") == 0)
 				ctl_command = STATUS_COMMAND;
+			else if (strcmp(argv[optind], "demote") == 0)
+				ctl_command = DEMOTE_COMMAND;
 			else if (strcmp(argv[optind], "promote") == 0)
 				ctl_command = PROMOTE_COMMAND;
 			else if (strcmp(argv[optind], "logrotate") == 0)
@@ -2554,6 +2668,9 @@ main(int argc, char **argv)
 		case RELOAD_COMMAND:
 			do_reload();
 			break;
+		case DEMOTE_COMMAND:
+			do_demote();
+			break;
 		case PROMOTE_COMMAND:
 			do_promote();
 			break;
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 2ca71c3445..4b56f92181 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -53,6 +53,7 @@ extern void RecoverPreparedTransactions(void);
 extern void CheckPointTwoPhase(XLogRecPtr redo_horizon);
 
 extern void FinishPreparedTransaction(const char *gid, bool isCommit);
+void ShutdownPreparedTransactions(void);
 
 extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   XLogRecPtr end_lsn, RepOriginId origin_id);
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 8c9cadc6da..b1b1ea67f9 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -219,18 +219,20 @@ extern bool XLOG_DEBUG;
 
 /* These directly affect the behavior of CreateCheckPoint and subsidiaries */
 #define CHECKPOINT_IS_SHUTDOWN	0x0001	/* Checkpoint is for shutdown */
-#define CHECKPOINT_END_OF_RECOVERY	0x0002	/* Like shutdown checkpoint, but
+#define CHECKPOINT_IS_DEMOTE	0x0002	/* Like shutdown checkpoint, but
+											 * issued at end of WAL production */
+#define CHECKPOINT_END_OF_RECOVERY	0x0004	/* Like shutdown checkpoint, but
 											 * issued at end of WAL recovery */
-#define CHECKPOINT_IMMEDIATE	0x0004	/* Do it without delays */
-#define CHECKPOINT_FORCE		0x0008	/* Force even if no activity */
-#define CHECKPOINT_FLUSH_ALL	0x0010	/* Flush all pages, including those
+#define CHECKPOINT_IMMEDIATE	0x0008	/* Do it without delays */
+#define CHECKPOINT_FORCE		0x0010	/* Force even if no activity */
+#define CHECKPOINT_FLUSH_ALL	0x0020	/* Flush all pages, including those
 										 * belonging to unlogged tables */
 /* These are important to RequestCheckpoint */
-#define CHECKPOINT_WAIT			0x0020	/* Wait for completion */
-#define CHECKPOINT_REQUESTED	0x0040	/* Checkpoint request has been made */
+#define CHECKPOINT_WAIT			0x0040	/* Wait for completion */
+#define CHECKPOINT_REQUESTED	0x0080	/* Checkpoint request has been made */
 /* These indicate the cause of a checkpoint request */
-#define CHECKPOINT_CAUSE_XLOG	0x0080	/* XLOG consumption */
-#define CHECKPOINT_CAUSE_TIME	0x0100	/* Elapsed time */
+#define CHECKPOINT_CAUSE_XLOG	0x0100	/* XLOG consumption */
+#define CHECKPOINT_CAUSE_TIME	0x0200	/* Elapsed time */
 
 /*
  * Flag bits for the record being inserted, set using XLogSetRecordFlags().
@@ -301,6 +303,7 @@ extern const char *xlog_identify(uint8 info);
 
 extern void issue_xlog_fsync(int fd, XLogSegNo segno);
 
+extern bool SetLocalRecoveryInProgress(void);
 extern bool RecoveryInProgress(void);
 extern RecoveryState GetRecoveryState(void);
 extern bool HotStandbyActive(void);
@@ -397,4 +400,8 @@ extern SessionBackupState get_backup_status(void);
 /* files to signal promotion to primary */
 #define PROMOTE_SIGNAL_FILE		"promote"
 
+/* files to signal demotion to standby */
+#define DEMOTE_SIGNAL_FILE		"demote"
+#define DEMOTE_FAST_SIGNAL_FILE	"demote_fast"
+
 #endif							/* XLOG_H */
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index 06bed90c5e..3b30c1d767 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -87,6 +87,7 @@ typedef enum DBState
 	DB_STARTUP = 0,
 	DB_SHUTDOWNED,
 	DB_SHUTDOWNED_IN_RECOVERY,
+	DB_DEMOTING,
 	DB_SHUTDOWNING,
 	DB_IN_CRASH_RECOVERY,
 	DB_IN_ARCHIVE_RECOVERY,
diff --git a/src/include/libpq/libpq-be.h b/src/include/libpq/libpq-be.h
index 0a23281ad5..24ba3da013 100644
--- a/src/include/libpq/libpq-be.h
+++ b/src/include/libpq/libpq-be.h
@@ -70,7 +70,7 @@ typedef struct
 
 typedef enum CAC_state
 {
-	CAC_OK, CAC_STARTUP, CAC_SHUTDOWN, CAC_RECOVERY, CAC_TOOMANY,
+	CAC_OK, CAC_STARTUP, CAC_DEMOTE, CAC_SHUTDOWN, CAC_RECOVERY, CAC_TOOMANY,
 	CAC_SUPERUSER
 } CAC_state;
 
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 72e3352398..d60804208f 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -83,6 +83,7 @@ extern PGDLLIMPORT volatile sig_atomic_t QueryCancelPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending;
 extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcSignalBarrierPending;
+extern PGDLLIMPORT volatile sig_atomic_t DemotePending;
 
 extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost;
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 1387201382..f1c0a37e76 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -880,6 +880,7 @@ typedef enum
 	WAIT_EVENT_PROCARRAY_GROUP_UPDATE,
 	WAIT_EVENT_PROC_SIGNAL_BARRIER,
 	WAIT_EVENT_PROMOTE,
+	WAIT_EVENT_DEMOTE,
 	WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
 	WAIT_EVENT_RECOVERY_CONFLICT_TABLESPACE,
 	WAIT_EVENT_RECOVERY_PAUSE,
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 0a5708b32e..4d4f0ea1dd 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -41,5 +41,6 @@ extern Size CheckpointerShmemSize(void);
 extern void CheckpointerShmemInit(void);
 
 extern bool FirstCallSinceLastCheckpoint(void);
+extern void ReqCheckpointDemoteHandler(SIGNAL_ARGS);
 
 #endif							/* _BGWRITER_H */
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index 1c3e9c1999..fa8d64e1a7 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -584,6 +584,8 @@ extern void lock_twophase_postcommit(TransactionId xid, uint16 info,
 									 void *recdata, uint32 len);
 extern void lock_twophase_postabort(TransactionId xid, uint16 info,
 									void *recdata, uint32 len);
+extern void lock_twophase_shutdown(TransactionId xid, uint16 info,
+									void *recdata, uint32 len);
 extern void lock_twophase_standby_recover(TransactionId xid, uint16 info,
 										  void *recdata, uint32 len);
 
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 5cb39697f3..7264e9a705 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -34,6 +34,10 @@ typedef enum
 	PROCSIG_PARALLEL_MESSAGE,	/* message from cooperating parallel backend */
 	PROCSIG_WALSND_INIT_STOPPING,	/* ask walsenders to prepare for shutdown  */
 	PROCSIG_BARRIER,			/* global barrier interrupt  */
+	PROCSIG_DEMOTING,			/* ask backends to demote in smart mode */
+	PROCSIG_DEMOTING_FAST,		/* ask backends to demote in fast mode */
+	PROCSIG_DEMOTED,			/* ask backends to switch to recovery mode */
+	PROCSIG_CHECKPOINTER_DEMOTING,	/* ask checkpointer to demote */
 
 	/* Recovery conflict reasons */
 	PROCSIG_RECOVERY_CONFLICT_DATABASE,
diff --git a/src/include/tcop/tcopprot.h b/src/include/tcop/tcopprot.h
index bd30607b07..e5f42f9fec 100644
--- a/src/include/tcop/tcopprot.h
+++ b/src/include/tcop/tcopprot.h
@@ -68,6 +68,8 @@ extern void StatementCancelHandler(SIGNAL_ARGS);
 extern void FloatExceptionHandler(SIGNAL_ARGS) pg_attribute_noreturn();
 extern void RecoveryConflictInterrupt(ProcSignalReason reason); /* called from SIGUSR1
 																 * handler */
+extern void ReqDemoteHandler(ProcSignalReason reason); /* called from SIGUSR1 handler */
+extern void ReqDemotedHandler(ProcSignalReason reason); /* called from SIGUSR1 handler */
 extern void ProcessClientReadInterrupt(bool blocked);
 extern void ProcessClientWriteInterrupt(bool blocked);
 
diff --git a/src/include/utils/pidfile.h b/src/include/utils/pidfile.h
index 63fefe5c4c..f761d2c4ef 100644
--- a/src/include/utils/pidfile.h
+++ b/src/include/utils/pidfile.h
@@ -50,6 +50,7 @@
  */
 #define PM_STATUS_STARTING		"starting"	/* still starting up */
 #define PM_STATUS_STOPPING		"stopping"	/* in shutdown sequence */
+#define PM_STATUS_DEMOTING		"demoting"	/* demote sequence */
 #define PM_STATUS_READY			"ready   "	/* ready for connections */
 #define PM_STATUS_STANDBY		"standby "	/* up, won't accept connections */
 
-- 
2.20.1

>From 4b1f49b6c9fc099b85fb8a561a7035dde715b211 Mon Sep 17 00:00:00 2001
From: Jehan-Guillaume de Rorthais <j...@dalibo.com>
Date: Fri, 31 Jul 2020 18:07:38 +0200
Subject: [PATCH 3/4] demote: add pg_demote() function

---
 src/backend/access/transam/xlogfuncs.c | 94 ++++++++++++++++++++++++++
 src/backend/catalog/system_views.sql   |  6 ++
 src/include/catalog/pg_proc.dat        |  4 ++
 3 files changed, 104 insertions(+)

diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index 290658b22c..733f465d38 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -784,3 +784,97 @@ pg_promote(PG_FUNCTION_ARGS)
 			(errmsg("server did not promote within %d seconds", wait_seconds)));
 	PG_RETURN_BOOL(false);
 }
+
+/*
+ * Demotes a production server.
+ *
+ * A result of "true" means that demotion has been completed if "wait" is
+ * "true", or initiated if "wait" is false.
+ */
+Datum
+pg_demote(PG_FUNCTION_ARGS)
+{
+	bool		fast = PG_GETARG_BOOL(0);
+	bool		wait = PG_GETARG_BOOL(1);
+	int			wait_seconds = PG_GETARG_INT32(2);
+	char		demote_filename[] = "demote_fast";
+	FILE	   *demote_file;
+	int			i;
+
+	if (RecoveryInProgress())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("recovery in progress"),
+				 errhint("you can not demote while already in recovery.")));
+
+	if (!EnableHotStandby)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("function pg_demote() requires hot_standby parameter to be enabled"),
+				 errhint("The function can not return its status from a non hot_standby-enabled standby")));
+
+	if (wait_seconds <= 0)
+		ereport(ERROR,
+				(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+				 errmsg("\"wait_seconds\" must not be negative or zero")));
+
+	if (!fast)
+		demote_filename[6] = '\0';
+
+	/* create the demote signal file */
+	demote_file = AllocateFile(demote_filename, "w");
+	if (!demote_file)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create file \"%s\": %m",
+						demote_filename)));
+
+	if (FreeFile(demote_file))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write file \"%s\": %m",
+						demote_filename)));
+
+	/* signal the postmaster */
+	if (kill(PostmasterPid, SIGUSR1) != 0)
+	{
+		ereport(WARNING,
+				(errmsg("failed to send signal to postmaster: %m")));
+		(void) unlink(demote_filename);
+		PG_RETURN_BOOL(false);
+	}
+
+	/* return immediately if waiting was not requested */
+	if (!wait)
+		PG_RETURN_BOOL(true);
+
+	/* wait for the amount of time wanted until demotion */
+#define WAITS_PER_SECOND 10
+	for (i = 0; i < WAITS_PER_SECOND * wait_seconds; i++)
+	{
+		int			rc;
+
+		ResetLatch(MyLatch);
+
+		if (RecoveryInProgress())
+			PG_RETURN_BOOL(true);
+
+		CHECK_FOR_INTERRUPTS();
+
+		rc = WaitLatch(MyLatch,
+					   WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
+					   1000L / WAITS_PER_SECOND,
+					   WAIT_EVENT_DEMOTE);
+
+		/*
+		 * Emergency bailout if postmaster has died.  This is to avoid the
+		 * necessity for manual cleanup of all postmaster children.
+		 */
+		if (rc & WL_POSTMASTER_DEATH)
+			PG_RETURN_BOOL(false);
+	}
+
+	ereport(WARNING,
+			(errmsg("server did not demote within %d seconds", wait_seconds)));
+	PG_RETURN_BOOL(false);
+}
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 8625cbeab6..573d7b46eb 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1219,6 +1219,11 @@ CREATE OR REPLACE FUNCTION
   RETURNS boolean STRICT VOLATILE LANGUAGE INTERNAL AS 'pg_promote'
   PARALLEL SAFE;
 
+CREATE OR REPLACE FUNCTION
+  pg_demote(fast boolean DEFAULT true, wait boolean DEFAULT true, wait_seconds integer DEFAULT 60)
+  RETURNS boolean STRICT VOLATILE LANGUAGE INTERNAL AS 'pg_demote'
+  PARALLEL SAFE;
+
 -- legacy definition for compatibility with 9.3
 CREATE OR REPLACE FUNCTION
   json_populate_record(base anyelement, from_json json, use_json_as_text boolean DEFAULT false)
@@ -1435,6 +1440,7 @@ REVOKE EXECUTE ON FUNCTION pg_reload_conf() FROM public;
 REVOKE EXECUTE ON FUNCTION pg_current_logfile() FROM public;
 REVOKE EXECUTE ON FUNCTION pg_current_logfile(text) FROM public;
 REVOKE EXECUTE ON FUNCTION pg_promote(boolean, integer) FROM public;
+REVOKE EXECUTE ON FUNCTION pg_demote(boolean, boolean, integer) FROM public;
 
 REVOKE EXECUTE ON FUNCTION pg_stat_reset() FROM public;
 REVOKE EXECUTE ON FUNCTION pg_stat_reset_shared(text) FROM public;
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 082a11f270..9e4d000d00 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6084,6 +6084,10 @@
   proname => 'pg_promote', provolatile => 'v', prorettype => 'bool',
   proargtypes => 'bool int4', proargnames => '{wait,wait_seconds}',
   prosrc => 'pg_promote' },
+{ oid => '8967', descr => 'demote production server',
+  proname => 'pg_demote', provolatile => 'v', prorettype => 'bool',
+  proargtypes => 'bool bool int4', proargnames => '{fast,wait,wait_seconds}',
+  prosrc => 'pg_demote' },
 { oid => '2848', descr => 'switch to new wal file',
   proname => 'pg_switch_wal', provolatile => 'v', prorettype => 'pg_lsn',
   proargtypes => '', prosrc => 'pg_switch_wal' },
-- 
2.20.1

>From f47d4aca2ff2c938c33b8cabf0538e6f66c06d2b Mon Sep 17 00:00:00 2001
From: Jehan-Guillaume de Rorthais <j...@dalibo.com>
Date: Fri, 10 Jul 2020 02:00:38 +0200
Subject: [PATCH 4/4] demote: add various tests related to demote and promote
 actions

* demote/promote with a standby replicating from the node
* make sure 2PC survive a demote/promote cycle
* commit 2PC and check the result
* swap roles between primary and standby
* make sure wal sender enters cascade mode
* commit a 2PC on the new primary
* confirm behavior of backends during smart/fast demote
---
 src/test/perl/PostgresNode.pm             |  25 ++
 src/test/recovery/t/021_promote-demote.pl | 287 ++++++++++++++++++++++
 2 files changed, 312 insertions(+)
 create mode 100644 src/test/recovery/t/021_promote-demote.pl

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 1488bffa2b..0f3d40088c 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -906,6 +906,31 @@ sub promote
 
 =pod
 
+=item $node->demote()
+
+Wrapper for pg_ctl demote
+
+=cut
+
+sub demote
+{
+	my ($self, $mode) = @_;
+	my $port    = $self->port;
+	my $pgdata  = $self->data_dir;
+	my $logfile = $self->logfile;
+	my $name    = $self->name;
+
+	$mode = 'fast' unless defined $mode;
+
+	print "### Demoting node \"$name\" using mode $mode\n";
+
+	TestLib::system_or_bail('pg_ctl', '-D', $pgdata, '-l', $logfile,
+		'-m', $mode, 'demote');
+	return;
+}
+
+=pod
+
 =item $node->logrotate()
 
 Wrapper for pg_ctl logrotate
diff --git a/src/test/recovery/t/021_promote-demote.pl b/src/test/recovery/t/021_promote-demote.pl
new file mode 100644
index 0000000000..245acfb211
--- /dev/null
+++ b/src/test/recovery/t/021_promote-demote.pl
@@ -0,0 +1,287 @@
+# Test demote/promote actions in various scenarios using three
+# nodes alpha, beta and gamma. We check proper actions results,
+# correct data replication and cascade across multiple
+# demote/promote, manual switchover, smart and fast demote.
+
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 24;
+
+$ENV{PGDATABASE} = 'postgres';
+
+# Initialize node alpha
+my $node_alpha = get_new_node('alpha');
+$node_alpha->init(allows_streaming => 1);
+$node_alpha->append_conf(
+	'postgresql.conf', qq(
+	max_prepared_transactions = 10
+));
+
+# Take backup
+my $backup_name = 'alpha_backup';
+$node_alpha->start;
+$node_alpha->backup($backup_name);
+
+# Create node beta from backup
+my $node_beta = get_new_node('beta');
+$node_beta->init_from_backup($node_alpha, $backup_name);
+$node_beta->enable_streaming($node_alpha);
+$node_beta->start;
+
+# Create node gamma from backup
+my $node_gamma = get_new_node('gamma');
+$node_gamma->init_from_backup($node_alpha, $backup_name);
+$node_gamma->enable_streaming($node_alpha);
+$node_gamma->start;
+
+# Create some 2PC on alpha for future tests
+$node_alpha->safe_psql('postgres', q{
+CREATE TABLE ins AS SELECT 1 AS i;
+BEGIN;
+CREATE TABLE new AS SELECT generate_series(1,5) AS i;
+PREPARE TRANSACTION 'pxact1';
+BEGIN;
+INSERT INTO ins VALUES (2);
+PREPARE TRANSACTION 'pxact2';
+});
+
+# create an in idle in xact session
+my ($sess1_in, $sess1_out, $sess1_err) = ('', '', '');
+my $sess1 = IPC::Run::start(
+	[
+		'psql', '-X', '-qAt', '-v', 'ON_ERROR_STOP=1', '-f', '-', '-d',
+		$node_alpha->connstr('postgres')
+	],
+	'<', \$sess1_in,
+	'>', \$sess1_out,
+	'2>', \$sess1_err);
+
+$sess1_in = q{
+BEGIN;
+CREATE TABLE public.test_aborted (i int);
+SELECT pg_backend_pid();
+};
+$sess1->pump until $sess1_out =~ qr/[[:digit:]]+[\r\n]$/m;
+my $sess1_pid = $sess1_out;
+chomp $sess1_pid;
+
+# create an in idle session
+my ($sess2_in, $sess2_out, $sess2_err) = ('', '', '');
+my $sess2 = IPC::Run::start(
+	[
+		'psql', '-X', '-qAt', '-v', 'ON_ERROR_STOP=1', '-f', '-', '-d',
+		$node_alpha->connstr('postgres')
+	],
+	'<', \$sess2_in,
+	'>', \$sess2_out,
+	'2>', \$sess2_err);
+$sess2_in = q{
+SELECT pg_backend_pid();
+};
+$sess2->pump until $sess2_out =~ qr/\d+\s*$/m;
+my $sess2_pid = $sess2_out;
+chomp $sess2_pid;
+
+$sess2_in = q{
+SELECT pg_is_in_recovery();
+};
+$sess2->pump until $sess2_out =~ qr/(t|f)\s*$/m;
+
+# idle session is not in recovery
+is( $1, 'f', 'idle session is not in recovery' );
+
+# Fast demote alpha.
+# Secondaries beta and gamma should keep streaming from it as cascaded standbys.
+# Idle in xact session should be terminate, idle session should stay alive.
+$node_alpha->demote('fast');
+
+is( $node_alpha->safe_psql( 'postgres', 'SELECT pg_is_in_recovery()'),
+	't', 'node alpha demoted to standby' );
+
+is( $node_alpha->safe_psql(
+		'postgres',
+		'SELECT array_agg(application_name ORDER BY application_name ASC) FROM pg_stat_replication'),
+	'{beta,gamma}', 'standbys keep replicating with alpha after demote' );
+
+# the idle in xact session should not survive the demote
+is( $node_alpha->safe_psql(
+		'postgres',
+		qq{SELECT count(*)
+		   FROM pg_catalog.pg_stat_activity
+		   WHERE pid = $sess1_pid}),
+	'0', 'previous idle in transaction session should be terminated' );
+
+# table "test_aborted" has been rollbacked
+is( $node_alpha->safe_psql(
+		'postgres',
+		q{SELECT count(*) FROM pg_catalog.pg_class
+		  WHERE relname='test_aborted'
+		    AND relnamespace = (SELECT oid FROM pg_namespace
+		                        WHERE nspname='public')}),
+	'0', 'the tansaction bas been aborted during fast demote' );
+
+# the idle session should survive the demote
+is( $node_alpha->safe_psql(
+		'postgres',
+		qq{SELECT count(*)
+		   FROM pg_catalog.pg_stat_activity
+		   WHERE pid = $sess2_pid}),
+	'1', "the idle session should survive the demote: $sess2_pid" );
+
+# the idle session should report in recovery
+$sess2_out = '';
+$sess2_in = q{
+SELECT pg_is_in_recovery();
+};
+$sess2->pump until $sess2_out =~ qr/(t|f)\s*$/m;
+
+# idle session is not in recovery
+is( $1, 't', 'the idle session reports in recovery' );
+
+# close both sessions
+$sess1_out = $sess2_out = $sess1_in = $sess2_in = '';
+$sess1->finish;
+$sess2->finish;
+
+# Promote alpha back in production.
+$node_alpha->promote;
+
+is( $node_alpha->safe_psql( 'postgres', 'SELECT pg_is_in_recovery()'),
+	'f', "node alpha promoted" );
+
+# Check all 2PC xact have been restored
+is( $node_alpha->safe_psql(
+		'postgres',
+		"SELECT string_agg(gid, ',' order by gid asc) FROM pg_prepared_xacts"),
+	'pxact1,pxact2', "prepared transactions 'pxact1' and 'pxact2' exists" );
+
+# Commit one 2PC and check it on alpha and beta
+$node_alpha->safe_psql( 'postgres', "commit prepared 'pxact1'");
+
+is( $node_alpha->safe_psql(
+		'postgres', "SELECT array_agg(i::text ORDER BY i ASC) FROM new"),
+	'{1,2,3,4,5}', "prepared transaction 'pxact1' commited" );
+
+$node_alpha->wait_for_catchup($node_beta);
+$node_alpha->wait_for_catchup($node_gamma);
+
+is( $node_beta->safe_psql(
+		'postgres', "SELECT array_agg(i::text ORDER BY i ASC) FROM new"),
+	'{1,2,3,4,5}', "prepared transaction 'pxact1' replicated to beta" );
+
+is( $node_gamma->safe_psql(
+		'postgres', "SELECT array_agg(i::text ORDER BY i ASC) FROM new"),
+	'{1,2,3,4,5}', "prepared transaction 'pxact1' replicated to gamma" );
+
+# create another idle in xact session
+$sess1_in = q{
+BEGIN;
+CREATE TABLE public.test_succeed (i int);
+SELECT pg_backend_pid();
+};
+$sess1->pump until $sess1_out =~ qr/\d+\s*$/m;
+$sess1_pid = $sess1_out;
+chomp $sess1_pid;
+
+# swap roles between alpha and beta
+
+# Demote alpha in smart mode.
+# Don't wait for demote to complete here so we can use sess1
+# to keep doing some more write activity before commit and demote.
+is( $node_alpha->safe_psql( 'postgres', 'SELECT pg_demote(false, false)'),
+	't', "demote signal sent to node alpha" );
+
+# wait for the demote to begin and wait for active xact.
+my $fh;
+while (1) {
+	my $status;
+	open my $fh, '<', $node_alpha->data_dir . '/postmaster.pid';
+	$status = $_ while <$fh>;
+	close $fh;
+	chomp($status);
+	last if $status eq 'demoting';
+	sleep 1;
+}
+
+# make sure the demote waits for running xacts
+sleep 2;
+
+# test no new session possible during demote
+$sess2_in = q{
+SELECT 1;
+};
+$sess2->start;
+$sess2->finish;
+ok( $sess2_err =~ /FATAL:  the database system is demoting\s$/, 'session rejected during demote process');
+
+# add some write activity on demote-blocking session sess1
+$sess1_out = '';
+$sess1_in = q{
+INSERT INTO public.test_succeed VALUES (1) RETURNING i;
+COMMIT;
+};
+$sess1->pump until $sess1_out =~ qr/\d+\s*$/m;
+$sess1->finish;
+
+chomp($sess1_out);
+is($sess1_out, '1', 'session in active xact able to write the smart demote signal');
+
+$node_alpha->poll_query_until('postgres', 'SELECT pg_is_in_recovery()', 't');
+
+is( $node_alpha->safe_psql( 'postgres', 'SELECT pg_is_in_recovery()'),
+	't', "node alpha demoted" );
+
+# fetch the last REDO location from alpha and chek beta received everyting
+my ($stdout, $stderr) = run_command([ 'pg_controldata', $node_alpha->data_dir ]);
+$stdout =~ m{REDO location:\s+([0-9A-F]+/[0-9A-F]+)$}mg;
+my $redo_loc = $1;
+
+is( $node_beta->safe_psql(
+		'postgres',
+		"SELECT pg_wal_lsn_diff(pg_last_wal_receive_lsn(), '$redo_loc') > 0 "),
+	't', "node beta received the demote checkpoint from alpha" );
+
+# promote beta and check it
+$node_beta->promote;
+is( $node_beta->safe_psql( 'postgres', 'SELECT pg_is_in_recovery()'),
+	'f', "node beta promoted" );
+
+# Setup alpha to replicate from beta
+$node_alpha->enable_streaming($node_beta);
+$node_alpha->reload;
+
+# check alpha is replicating from it
+$node_beta->wait_for_catchup($node_alpha);
+
+is( $node_beta->safe_psql(
+		'postgres', 'SELECT application_name FROM pg_stat_replication'),
+	$node_alpha->name, 'alpha is replicating from beta' );
+
+# check gamma is still replicating from from alpha
+$node_alpha->wait_for_catchup($node_gamma, 'write', $node_alpha->lsn('receive'));
+
+is( $node_alpha->safe_psql(
+		'postgres', 'SELECT application_name FROM pg_stat_replication'),
+	$node_gamma->name, 'gamma is replicating from beta' );
+
+# make sure the second 2PC is still available on beta
+is( $node_beta->safe_psql(
+		'postgres', 'SELECT gid FROM pg_prepared_xacts'),
+	'pxact2', "prepared transactions pxact2' exists" );
+
+# commit the second 2PC and check its result on alpha and beta nodes
+$node_beta->safe_psql( 'postgres', "commit prepared 'pxact2'");
+
+is( $node_beta->safe_psql( 'postgres', 'SELECT 1 FROM ins WHERE i=2'),
+	'1', "prepared transaction 'pxact2' commited" );
+
+$node_beta->wait_for_catchup($node_alpha);
+is( $node_alpha->safe_psql( 'postgres', 'SELECT 1 FROM ins WHERE i=2'),
+	'1', "prepared transaction 'pxact2' streamed to alpha" );
+
+# check the 2PC has been cascaded to gamma
+$node_alpha->wait_for_catchup($node_gamma, 'write', $node_alpha->lsn('receive'));
+is( $node_gamma->safe_psql( 'postgres', 'SELECT 1 FROM ins WHERE i=2'),
+	'1', "prepared transaction 'pxact2' streamed to gamma" );
-- 
2.20.1

  • Re: [patch] demote Jehan-Guillaume de Rorthais

Reply via email to