Hi,

Another summary + patch + tests.

This patch supports 2PC. The goal is to keep them safe during demote/promote
actions so they can be committed/rollbacked later on a primary. See tests.

The checkpointer is now shutdowned after the demote shutdown checkpoint. It
removes some useless code complexity, eg. avoiding to signal postmaster from
checkpointer to keep going with the demotion. 

Cascaded replication is now supported. Wal senders stay actives during
demotion but set their local "am_cascading_walsender = true". It has been a
rough debug session (thank you rr and tests!) on my side, but it might deserve
it. I believe they should stay connected during the demote actions for futur
features, eg. triggering a switchover over the replication protocol using an
admin function.

The first tests has been added in "recovery/t/021_promote-demote.pl". I'll add
some more tests in futur versions.

I believe the patch is ready for some preliminary tests and advice or
directions.

On my todo:

* study how to only disconnect or cancel active RW backends
  * ...then add pg_demote() admin function
* cancel running checkpoint for fast demote ?
* user documentation
* Robert's concern about snapshot during hot standby
* some more coding style cleanup/refactoring
* anything else reported to me :)

Thanks,

On Fri, 3 Jul 2020 00:12:10 +0200
Jehan-Guillaume de Rorthais <j...@dalibo.com> wrote:

> Hi,
> 
> Here is a small activity summary since last report.
> 
> On Thu, 25 Jun 2020 19:27:54 +0200
> Jehan-Guillaume de Rorthais <j...@dalibo.com> wrote:
> [...]
> > I hadn't time to investigate Robert's concern about shared memory for
> > snapshot during recovery.
> 
> I hadn't time to dig very far, but I suppose this might be related to the
> comment in ProcArrayShmemSize(). If I'm right, then it seems the space is
> already allocated as long as hot_standby is enabled. I realize it doesn't
> means we are on the safe side of the fence though. I still have to have a
> better understanding on this.
> 
> > The patch doesn't deal with prepared xact yet. Testing
> > "start->demote->promote" raise an assert if some prepared xact exist. I
> > suppose I will rollback them during demote in next patch version.
> 
> Rollback all prepared transaction on demote seems easy. However, I realized
> there's no point to cancel them. After the demote action, they might still be
> committed later on a promoted instance.
> 
> I am currently trying to clean shared memory for existing prepared transaction
> so they are handled by the startup process during recovery.
> I've been able to clean TwoPhaseState and the ProcArray. I'm now in the
> process to clean remaining prepared xact locks.
> 
> Regards,
> 
> 



-- 
Jehan-Guillaume de Rorthais
Dalibo
>From 4470772702273c720cdea942ed229d59f3a70318 Mon Sep 17 00:00:00 2001
From: Jehan-Guillaume de Rorthais <j...@dalibo.com>
Date: Fri, 10 Apr 2020 18:01:45 +0200
Subject: [PATCH 1/2] Support demoting instance from production to standby

Architecture:

* creates a shutdown checkpoint on demote
* use DB_DEMOTING state in controlfile
* try to handle subsystems init correctly during demote
* keep some sub-processes alive:
  stat collector, bgwriter and optionally archiver or wal senders
* the code currently use USR1 to signal the wal senders to check
  if they must enable the cascading mode
* ShutdownXLOG takes a boolean arg to handle demote differently
* the checkpointer is restarted for code simplicity

Trivial manual tests:

* make check: OK
* make check-world: OK
* start in production->demote->demote: OK
* start in production->demote->stop: OK
* start in production->demote-> promote: OK
* switch roles between primary and standby (switchover): OK
* commit and check 2PC after demote/promote
* commit and check 2PC after switchover

Discuss/Todo:

* cancel or kill active and idle in xact RW backends
  * keep RO backends
  * pg_demote() function?
* code reviewing, arch, analysis, checks, etc
* investigate snapshots shmem needs/init during recovery compare to
  production
* add doc
* cancel running checkpoint during demote
  * replace with a END_OF_PRODUCTION xlog record?
---
 src/backend/access/transam/twophase.c   |  95 +++++++
 src/backend/access/transam/xlog.c       | 315 ++++++++++++++++--------
 src/backend/postmaster/checkpointer.c   |  28 +++
 src/backend/postmaster/postmaster.c     | 261 +++++++++++++++-----
 src/backend/replication/walsender.c     |   1 +
 src/backend/storage/ipc/procarray.c     |   2 +
 src/backend/storage/ipc/procsignal.c    |   4 +
 src/backend/storage/lmgr/lock.c         |  12 +
 src/bin/pg_controldata/pg_controldata.c |   2 +
 src/bin/pg_ctl/pg_ctl.c                 | 111 +++++++++
 src/include/access/twophase.h           |   1 +
 src/include/access/xlog.h               |  19 +-
 src/include/catalog/pg_control.h        |   1 +
 src/include/libpq/libpq-be.h            |   7 +-
 src/include/postmaster/bgwriter.h       |   1 +
 src/include/storage/lock.h              |   2 +
 src/include/storage/procsignal.h        |   1 +
 src/include/utils/pidfile.h             |   1 +
 18 files changed, 689 insertions(+), 175 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 9b2e59bf0e..fda085631f 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1565,6 +1565,101 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 	pfree(buf);
 }
 
+/*
+ * ShutdownPreparedTransactions: clean prepared from sheared memory
+ *
+ * This is called during the demote process to clean the shared memory
+ * before the startup process load everything back in correctly
+ * for the standby mode.
+ *
+ * Note: this function assue all prepared transaction have been
+ * written to disk. In consequence, it must be called AFTER the demote
+ * shutdown checkpoint.
+ */
+void
+ShutdownPreparedTransactions(void)
+{
+	int i;
+
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact;
+		PGPROC	   *proc;
+		TransactionId xid;
+		char	   *buf;
+		char	   *bufptr;
+		TwoPhaseFileHeader *hdr;
+		TransactionId latestXid;
+		TransactionId *children;
+
+		gxact = TwoPhaseState->prepXacts[i];
+		proc = &ProcGlobal->allProcs[gxact->pgprocno];
+		xid = ProcGlobal->allPgXact[gxact->pgprocno].xid;
+
+		/* Read and validate 2PC state data */
+		Assert(gxact->ondisk);
+		buf = ReadTwoPhaseFile(xid, false);
+
+		/*
+		 * Disassemble the header area
+		 */
+		hdr = (TwoPhaseFileHeader *) buf;
+		Assert(TransactionIdEquals(hdr->xid, xid));
+		bufptr = buf + MAXALIGN(sizeof(TwoPhaseFileHeader))
+			   + MAXALIGN(hdr->gidlen);
+		children = (TransactionId *) bufptr;
+		bufptr += MAXALIGN(hdr->nsubxacts * sizeof(TransactionId))
+				+ MAXALIGN(hdr->ncommitrels * sizeof(RelFileNode))
+				+ MAXALIGN(hdr->nabortrels * sizeof(RelFileNode))
+				+ MAXALIGN(hdr->ninvalmsgs * sizeof(SharedInvalidationMessage));
+
+		/* compute latestXid among all children */
+		latestXid = TransactionIdLatest(xid, hdr->nsubxacts, children);
+
+		/* remove dummy proc associated to the gaxt */
+		ProcArrayRemove(proc, latestXid);
+
+		/*
+		 * This lock is probably not needed during the demote process
+		 * as all backends are already gone.
+		 */
+		LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
+
+		/* cleanup locks */
+		for (;;)
+		{
+			TwoPhaseRecordOnDisk *record = (TwoPhaseRecordOnDisk *) bufptr;
+
+			Assert(record->rmid <= TWOPHASE_RM_MAX_ID);
+			if (record->rmid == TWOPHASE_RM_END_ID)
+				break;
+
+			bufptr += MAXALIGN(sizeof(TwoPhaseRecordOnDisk));
+
+			if (record->rmid == TWOPHASE_RM_LOCK_ID)
+				lock_twophase_shutdown(xid, record->info,
+									 (void *) bufptr, record->len);
+
+			bufptr += MAXALIGN(record->len);
+		}
+
+		/* and put it back in the freelist */
+		gxact->next = TwoPhaseState->freeGXacts;
+		TwoPhaseState->freeGXacts = gxact;
+
+		/*
+		 * Release the lock as all callbacks are called and shared memory cleanup
+		 * is done.
+		 */
+		LWLockRelease(TwoPhaseStateLock);
+
+		pfree(buf);
+	}
+
+	TwoPhaseState->numPrepXacts -= i;
+	Assert(TwoPhaseState->numPrepXacts == 0);
+}
+
 /*
  * Scan 2PC state data in memory and call the indicated callbacks for each 2PC record.
  */
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 28daf72a50..3a52f7fde8 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6301,6 +6301,11 @@ CheckRequiredParameterValues(void)
 /*
  * This must be called ONCE during postmaster or standalone-backend startup
  */
+/*
+ * FIXME demote: part of the code here assume there's no other active
+ * processes before signal PMSIGNAL_RECOVERY_STARTED is sent.
+ */
+
 void
 StartupXLOG(void)
 {
@@ -6324,6 +6329,7 @@ StartupXLOG(void)
 	XLogPageReadPrivate private;
 	bool		fast_promoted = false;
 	struct stat st;
+	bool		is_demoting = false;
 
 	/*
 	 * We should have an aux process resource owner to use, and we should not
@@ -6388,6 +6394,25 @@ StartupXLOG(void)
 							str_time(ControlFile->time))));
 			break;
 
+		case DB_DEMOTING:
+			ereport(LOG,
+					(errmsg("database system was demoted at %s",
+							str_time(ControlFile->time))));
+			is_demoting = true;
+			bgwriterLaunched = true;
+			InArchiveRecovery = true;
+			StandbyMode = true;
+
+			/*
+			 * previous state was RECOVERY_STATE_DONE. We need to
+			 * reinit it to something else so RecoveryInProgress()
+			 * doesn't return false.
+			 */
+			SpinLockAcquire(&XLogCtl->info_lck);
+			XLogCtl->SharedRecoveryState = RECOVERY_STATE_ARCHIVE;
+			SpinLockRelease(&XLogCtl->info_lck);
+			break;
+
 		default:
 			ereport(FATAL,
 					(errmsg("control file contains invalid database cluster state")));
@@ -6421,7 +6446,8 @@ StartupXLOG(void)
 	 *   persisted.  To avoid that, fsync the entire data directory.
 	 */
 	if (ControlFile->state != DB_SHUTDOWNED &&
-		ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY)
+		ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY &&
+		ControlFile->state != DB_DEMOTING)
 	{
 		RemoveTempXlogFiles();
 		SyncDataDirectory();
@@ -6677,7 +6703,8 @@ StartupXLOG(void)
 					(errmsg("could not locate a valid checkpoint record")));
 		}
 		memcpy(&checkPoint, XLogRecGetData(xlogreader), sizeof(CheckPoint));
-		wasShutdown = ((record->xl_info & ~XLR_INFO_MASK) == XLOG_CHECKPOINT_SHUTDOWN);
+		wasShutdown = ((record->xl_info & ~XLR_INFO_MASK) == XLOG_CHECKPOINT_SHUTDOWN) &&
+			!is_demoting;
 	}
 
 	/*
@@ -6739,9 +6766,9 @@ StartupXLOG(void)
 	LastRec = RecPtr = checkPointLoc;
 
 	ereport(DEBUG1,
-			(errmsg_internal("redo record is at %X/%X; shutdown %s",
+			(errmsg_internal("redo record is at %X/%X; %s checkpoint",
 							 (uint32) (checkPoint.redo >> 32), (uint32) checkPoint.redo,
-							 wasShutdown ? "true" : "false")));
+							 wasShutdown ? "shutdown" : is_demoting? "demote": "")));
 	ereport(DEBUG1,
 			(errmsg_internal("next transaction ID: " UINT64_FORMAT "; next OID: %u",
 							 U64FromFullTransactionId(checkPoint.nextFullXid),
@@ -6775,47 +6802,7 @@ StartupXLOG(void)
 					 checkPoint.newestCommitTsXid);
 	XLogCtl->ckptFullXid = checkPoint.nextFullXid;
 
-	/*
-	 * Initialize replication slots, before there's a chance to remove
-	 * required resources.
-	 */
-	StartupReplicationSlots();
-
-	/*
-	 * Startup logical state, needs to be setup now so we have proper data
-	 * during crash recovery.
-	 */
-	StartupReorderBuffer();
-
-	/*
-	 * Startup MultiXact. We need to do this early to be able to replay
-	 * truncations.
-	 */
-	StartupMultiXact();
-
-	/*
-	 * Ditto for commit timestamps.  Activate the facility if the setting is
-	 * enabled in the control file, as there should be no tracking of commit
-	 * timestamps done when the setting was disabled.  This facility can be
-	 * started or stopped when replaying a XLOG_PARAMETER_CHANGE record.
-	 */
-	if (ControlFile->track_commit_timestamp)
-		StartupCommitTs();
-
-	/*
-	 * Recover knowledge about replay progress of known replication partners.
-	 */
-	StartupReplicationOrigin();
 
-	/*
-	 * Initialize unlogged LSN. On a clean shutdown, it's restored from the
-	 * control file. On recovery, all unlogged relations are blown away, so
-	 * the unlogged LSN counter can be reset too.
-	 */
-	if (ControlFile->state == DB_SHUTDOWNED)
-		XLogCtl->unloggedLSN = ControlFile->unloggedLSN;
-	else
-		XLogCtl->unloggedLSN = FirstNormalUnloggedLSN;
 
 	/*
 	 * We must replay WAL entries using the same TimeLineID they were created
@@ -6824,19 +6811,64 @@ StartupXLOG(void)
 	 */
 	ThisTimeLineID = checkPoint.ThisTimeLineID;
 
-	/*
-	 * Copy any missing timeline history files between 'now' and the recovery
-	 * target timeline from archive to pg_wal. While we don't need those files
-	 * ourselves - the history file of the recovery target timeline covers all
-	 * the previous timelines in the history too - a cascading standby server
-	 * might be interested in them. Or, if you archive the WAL from this
-	 * server to a different archive than the primary, it'd be good for all the
-	 * history files to get archived there after failover, so that you can use
-	 * one of the old timelines as a PITR target. Timeline history files are
-	 * small, so it's better to copy them unnecessarily than not copy them and
-	 * regret later.
-	 */
-	restoreTimeLineHistoryFiles(ThisTimeLineID, recoveryTargetTLI);
+	if (!is_demoting)
+	{
+		/*
+		 * Initialize replication slots, before there's a chance to remove
+		 * required resources.
+		 */
+		StartupReplicationSlots();
+
+		/*
+		 * Startup logical state, needs to be setup now so we have proper data
+		 * during crash recovery.
+		 */
+		StartupReorderBuffer();
+
+		/*
+		 * Startup MultiXact. We need to do this early to be able to replay
+		 * truncations.
+		 */
+		StartupMultiXact();
+
+		/*
+		 * Ditto for commit timestamps.  Activate the facility if the setting is
+		 * enabled in the control file, as there should be no tracking of commit
+		 * timestamps done when the setting was disabled.  This facility can be
+		 * started or stopped when replaying a XLOG_PARAMETER_CHANGE record.
+		 */
+		if (ControlFile->track_commit_timestamp)
+			StartupCommitTs();
+
+		/*
+		 * Recover knowledge about replay progress of known replication partners.
+		 */
+		StartupReplicationOrigin();
+
+		/*
+		 * Initialize unlogged LSN. On a clean shutdown, it's restored from the
+		 * control file. On recovery, all unlogged relations are blown away, so
+		 * the unlogged LSN counter can be reset too.
+		 */
+		if (ControlFile->state == DB_SHUTDOWNED)
+			XLogCtl->unloggedLSN = ControlFile->unloggedLSN;
+		else
+			XLogCtl->unloggedLSN = FirstNormalUnloggedLSN;
+
+		/*
+		 * Copy any missing timeline history files between 'now' and the recovery
+		 * target timeline from archive to pg_wal. While we don't need those files
+		 * ourselves - the history file of the recovery target timeline covers all
+		 * the previous timelines in the history too - a cascading standby server
+		 * might be interested in them. Or, if you archive the WAL from this
+		 * server to a different archive than the master, it'd be good for all the
+		 * history files to get archived there after failover, so that you can use
+		 * one of the old timelines as a PITR target. Timeline history files are
+		 * small, so it's better to copy them unnecessarily than not copy them and
+		 * regret later.
+		 */
+		restoreTimeLineHistoryFiles(ThisTimeLineID, recoveryTargetTLI);
+	}
 
 	/*
 	 * Before running in recovery, scan pg_twophase and fill in its status to
@@ -6891,11 +6923,25 @@ StartupXLOG(void)
 		dbstate_at_startup = ControlFile->state;
 		if (InArchiveRecovery)
 		{
-			ControlFile->state = DB_IN_ARCHIVE_RECOVERY;
+			if (is_demoting)
+			{
+				/*
+				 * Avoid concurrent access to the ControlFile datas
+				 * during demotion.
+				 */
+				LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+				ControlFile->state = DB_IN_ARCHIVE_RECOVERY;
+				LWLockRelease(ControlFileLock);
+			}
+			else
+			{
+				ControlFile->state = DB_IN_ARCHIVE_RECOVERY;
 
-			SpinLockAcquire(&XLogCtl->info_lck);
-			XLogCtl->SharedRecoveryState = RECOVERY_STATE_ARCHIVE;
-			SpinLockRelease(&XLogCtl->info_lck);
+				/* This is already set if demoting */
+				SpinLockAcquire(&XLogCtl->info_lck);
+				XLogCtl->SharedRecoveryState = RECOVERY_STATE_ARCHIVE;
+				SpinLockRelease(&XLogCtl->info_lck);
+			}
 		}
 		else
 		{
@@ -6985,7 +7031,8 @@ StartupXLOG(void)
 		/*
 		 * Reset pgstat data, because it may be invalid after recovery.
 		 */
-		pgstat_reset_all();
+		if (!is_demoting)
+			pgstat_reset_all();
 
 		/*
 		 * If there was a backup label file, it's done its job and the info
@@ -7047,7 +7094,7 @@ StartupXLOG(void)
 
 			InitRecoveryTransactionEnvironment();
 
-			if (wasShutdown)
+			if (wasShutdown || is_demoting)
 				oldestActiveXID = PrescanPreparedTransactions(&xids, &nxids);
 			else
 				oldestActiveXID = checkPoint.oldestActiveXid;
@@ -7060,6 +7107,11 @@ StartupXLOG(void)
 			 * Startup commit log and subtrans only.  MultiXact and commit
 			 * timestamp have already been started up and other SLRUs are not
 			 * maintained during recovery and need not be started yet.
+			 *
+			 * Starting up commit log is technicaly not needed during demote
+			 * as the in-memory data did not move. However, this is a
+			 * lightweight initialization and this is expected ShutdownCLOG()
+			 * is called during ShutdownXLog()
 			 */
 			StartupCLOG();
 			StartupSUBTRANS(oldestActiveXID);
@@ -7070,7 +7122,7 @@ StartupXLOG(void)
 			 * empty running-xacts record and use that here and now. Recover
 			 * additional standby state for prepared transactions.
 			 */
-			if (wasShutdown)
+			if (wasShutdown || is_demoting)
 			{
 				RunningTransactionsData running;
 				TransactionId latestCompletedXid;
@@ -7941,6 +7993,7 @@ StartupXLOG(void)
 
 	SpinLockAcquire(&XLogCtl->info_lck);
 	XLogCtl->SharedRecoveryState = RECOVERY_STATE_DONE;
+	XLogCtl->SharedHotStandbyActive = false;
 	SpinLockRelease(&XLogCtl->info_lck);
 
 	UpdateControlFile();
@@ -8059,6 +8112,23 @@ CheckRecoveryConsistency(void)
 	}
 }
 
+/*
+ * Initialize the local TimeLineID
+ */
+bool
+SetLocalRecoveryInProgress(void)
+{
+	/*
+	 * use volatile pointer to make sure we make a fresh read of the
+	 * shared variable.
+	 */
+	volatile XLogCtlData *xlogctl = XLogCtl;
+
+	LocalRecoveryInProgress = (xlogctl->SharedRecoveryState != RECOVERY_STATE_DONE);
+
+	return LocalRecoveryInProgress;
+}
+
 /*
  * Is the system still in recovery?
  *
@@ -8080,13 +8150,7 @@ RecoveryInProgress(void)
 		return false;
 	else
 	{
-		/*
-		 * use volatile pointer to make sure we make a fresh read of the
-		 * shared variable.
-		 */
-		volatile XLogCtlData *xlogctl = XLogCtl;
-
-		LocalRecoveryInProgress = (xlogctl->SharedRecoveryState != RECOVERY_STATE_DONE);
+		SetLocalRecoveryInProgress();
 
 		/*
 		 * Initialize TimeLineID and RedoRecPtr when we discover that recovery
@@ -8487,6 +8551,8 @@ GetLastSegSwitchData(XLogRecPtr *lastSwitchLSN)
 void
 ShutdownXLOG(int code, Datum arg)
 {
+	bool is_demoting = DatumGetBool(arg);
+
 	/*
 	 * We should have an aux process resource owner to use, and we should not
 	 * be in a transaction that's installed some other resowner.
@@ -8496,35 +8562,55 @@ ShutdownXLOG(int code, Datum arg)
 		   CurrentResourceOwner == AuxProcessResourceOwner);
 	CurrentResourceOwner = AuxProcessResourceOwner;
 
-	/* Don't be chatty in standalone mode */
-	ereport(IsPostmasterEnvironment ? LOG : NOTICE,
-			(errmsg("shutting down")));
-
-	/*
-	 * Signal walsenders to move to stopping state.
-	 */
-	WalSndInitStopping();
-
-	/*
-	 * Wait for WAL senders to be in stopping state.  This prevents commands
-	 * from writing new WAL.
-	 */
-	WalSndWaitStopping();
+	if (is_demoting)
+	{
+		/* Don't be chatty in standalone mode */
+		ereport(IsPostmasterEnvironment ? LOG : NOTICE,
+				(errmsg("demoting")));
 
-	if (RecoveryInProgress())
-		CreateRestartPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
+		/*
+		 * FIXME demote: avoiding checkpoint?
+		 * A checkpoint is probably running during a demote action. If
+		 * we don't want to wait for the checkpoint during the demote,
+		 * we might need to cancel it as it will not be able to write
+		 * to the WAL after the demote.
+		 */
+		CreateCheckPoint(CHECKPOINT_IS_DEMOTE | CHECKPOINT_IMMEDIATE);
+		ShutdownPreparedTransactions();
+		LocalRecoveryInProgress = true;
+	}
 	else
 	{
+		/* Don't be chatty in standalone mode */
+		ereport(IsPostmasterEnvironment ? LOG : NOTICE,
+				(errmsg("shutting down")));
+
 		/*
-		 * If archiving is enabled, rotate the last XLOG file so that all the
-		 * remaining records are archived (postmaster wakes up the archiver
-		 * process one more time at the end of shutdown). The checkpoint
-		 * record will go to the next XLOG file and won't be archived (yet).
+		 * Signal walsenders to move to stopping state.
 		 */
-		if (XLogArchivingActive() && XLogArchiveCommandSet())
-			RequestXLogSwitch(false);
+		WalSndInitStopping();
+
+		/*
+		 * Wait for WAL senders to be in stopping state.  This prevents commands
+		 * from writing new WAL.
+		 */
+		WalSndWaitStopping();
+
+		if (RecoveryInProgress())
+			CreateRestartPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
+		else
+		{
+			/*
+			 * If archiving is enabled, rotate the last XLOG file so that all the
+			 * remaining records are archived (postmaster wakes up the archiver
+			 * process one more time at the end of shutdown). The checkpoint
+			 * record will go to the next XLOG file and won't be archived (yet).
+			 */
+			if (XLogArchivingActive() && XLogArchiveCommandSet())
+				RequestXLogSwitch(false);
 
-		CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
+			CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
+		}
 	}
 	ShutdownCLOG();
 	ShutdownCommitTs();
@@ -8538,9 +8624,10 @@ ShutdownXLOG(int code, Datum arg)
 static void
 LogCheckpointStart(int flags, bool restartpoint)
 {
-	elog(LOG, "%s starting:%s%s%s%s%s%s%s%s",
+	elog(LOG, "%s starting:%s%s%s%s%s%s%s%s%s",
 		 restartpoint ? "restartpoint" : "checkpoint",
 		 (flags & CHECKPOINT_IS_SHUTDOWN) ? " shutdown" : "",
+		 (flags & CHECKPOINT_IS_DEMOTE) ? " demote" : "",
 		 (flags & CHECKPOINT_END_OF_RECOVERY) ? " end-of-recovery" : "",
 		 (flags & CHECKPOINT_IMMEDIATE) ? " immediate" : "",
 		 (flags & CHECKPOINT_FORCE) ? " force" : "",
@@ -8676,6 +8763,7 @@ UpdateCheckPointDistanceEstimate(uint64 nbytes)
  *
  * flags is a bitwise OR of the following:
  *	CHECKPOINT_IS_SHUTDOWN: checkpoint is for database shutdown.
+ *	CHECKPOINT_IS_DEMOTE: checkpoint is for demote.
  *	CHECKPOINT_END_OF_RECOVERY: checkpoint is for end of WAL recovery.
  *	CHECKPOINT_IMMEDIATE: finish the checkpoint ASAP,
  *		ignoring checkpoint_completion_target parameter.
@@ -8704,6 +8792,7 @@ void
 CreateCheckPoint(int flags)
 {
 	bool		shutdown;
+	bool		demote;
 	CheckPoint	checkPoint;
 	XLogRecPtr	recptr;
 	XLogSegNo	_logSegNo;
@@ -8716,14 +8805,21 @@ CreateCheckPoint(int flags)
 	int			nvxids;
 
 	/*
-	 * An end-of-recovery checkpoint is really a shutdown checkpoint, just
-	 * issued at a different time.
+	 * An end-of-recovery or demote checkpoint is really a shutdown checkpoint,
+	 * just issued at a different time.
 	 */
-	if (flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY))
+	if (flags & (CHECKPOINT_IS_SHUTDOWN |
+				 CHECKPOINT_IS_DEMOTE |
+				 CHECKPOINT_END_OF_RECOVERY))
 		shutdown = true;
 	else
 		shutdown = false;
 
+	if (flags & CHECKPOINT_IS_DEMOTE)
+		demote = true;
+	else
+		demote = false;
+
 	/* sanity check */
 	if (RecoveryInProgress() && (flags & CHECKPOINT_END_OF_RECOVERY) == 0)
 		elog(ERROR, "can't create a checkpoint during recovery");
@@ -8764,7 +8860,7 @@ CreateCheckPoint(int flags)
 	if (shutdown)
 	{
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
-		ControlFile->state = DB_SHUTDOWNING;
+		ControlFile->state = demote? DB_DEMOTING:DB_SHUTDOWNING;
 		ControlFile->time = (pg_time_t) time(NULL);
 		UpdateControlFile();
 		LWLockRelease(ControlFileLock);
@@ -8810,7 +8906,7 @@ CreateCheckPoint(int flags)
 	 * avoid inserting duplicate checkpoints when the system is idle.
 	 */
 	if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY |
-				  CHECKPOINT_FORCE)) == 0)
+				  CHECKPOINT_IS_DEMOTE | CHECKPOINT_FORCE)) == 0)
 	{
 		if (last_important_lsn == ControlFile->checkPoint)
 		{
@@ -8978,8 +9074,8 @@ CreateCheckPoint(int flags)
 	 * allows us to reconstruct the state of running transactions during
 	 * archive recovery, if required. Skip, if this info disabled.
 	 *
-	 * If we are shutting down, or Startup process is completing crash
-	 * recovery we don't need to write running xact data.
+	 * If we are shutting down, demoting or Startup process is completing
+	 * crash recovery we don't need to write running xact data.
 	 */
 	if (!shutdown && XLogStandbyInfoActive())
 		LogStandbySnapshot();
@@ -8998,11 +9094,11 @@ CreateCheckPoint(int flags)
 	XLogFlush(recptr);
 
 	/*
-	 * We mustn't write any new WAL after a shutdown checkpoint, or it will be
-	 * overwritten at next startup.  No-one should even try, this just allows
-	 * sanity-checking.  In the case of an end-of-recovery checkpoint, we want
-	 * to just temporarily disable writing until the system has exited
-	 * recovery.
+	 * We mustn't write any new WAL after a shutdown or demote checkpoint, or
+	 * it will be overwritten at next startup.  No-one should even try, this
+	 * just allows sanity-checking.  In the case of an end-of-recovery
+	 * checkpoint, we want to just temporarily disable writing until the system
+	 * has exited recovery.
 	 */
 	if (shutdown)
 	{
@@ -9018,7 +9114,8 @@ CreateCheckPoint(int flags)
 	 */
 	if (shutdown && checkPoint.redo != ProcLastRecPtr)
 		ereport(PANIC,
-				(errmsg("concurrent write-ahead log activity while database system is shutting down")));
+				(errmsg("concurrent write-ahead log activity while database system is %s",
+						demote? "demoting":"shutting down")));
 
 	/*
 	 * Remember the prior checkpoint's redo ptr for
@@ -9031,7 +9128,7 @@ CreateCheckPoint(int flags)
 	 */
 	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 	if (shutdown)
-		ControlFile->state = DB_SHUTDOWNED;
+		ControlFile->state = demote? DB_DEMOTING:DB_SHUTDOWNED;
 	ControlFile->checkPoint = ProcLastRecPtr;
 	ControlFile->checkPointCopy = checkPoint;
 	ControlFile->time = (pg_time_t) time(NULL);
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 624a3238b8..58473a61fd 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -151,6 +151,7 @@ double		CheckPointCompletionTarget = 0.5;
  * Private state
  */
 static bool ckpt_active = false;
+static volatile sig_atomic_t demoteRequestPending = false;
 
 /* these values are valid when ckpt_active is true: */
 static pg_time_t ckpt_start_time;
@@ -552,6 +553,21 @@ HandleCheckpointerInterrupts(void)
 		 */
 		UpdateSharedMemoryConfig();
 	}
+	if (demoteRequestPending)
+	{
+		demoteRequestPending = false;
+		/* Close down the database */
+		ShutdownXLOG(0, BoolGetDatum(true));
+		/*
+		 * Exit checkpointer. We could keep it around during demotion, but
+		 * exiting here has multiple benefices:
+		 * - to create a fresh process with clean local vars
+		 *   (eg. LocalRecoveryInProgress)
+		 * - to signal postmaster the demote shutdown checkpoint is done
+		 *   and keep going with next steps of the demotion
+		 */
+		proc_exit(0);
+	}
 	if (ShutdownRequestPending)
 	{
 		/*
@@ -680,6 +696,7 @@ CheckpointWriteDelay(int flags, double progress)
 	 * in which case we just try to catch up as quickly as possible.
 	 */
 	if (!(flags & CHECKPOINT_IMMEDIATE) &&
+		!demoteRequestPending &&
 		!ShutdownRequestPending &&
 		!ImmediateCheckpointRequested() &&
 		IsCheckpointOnSchedule(progress))
@@ -812,6 +829,17 @@ IsCheckpointOnSchedule(double progress)
  * --------------------------------
  */
 
+/* SIGUSR1: set flag to demote */
+void
+ReqCheckpointDemoteHandler(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	demoteRequestPending = true;
+
+	errno = save_errno;
+}
+
 /* SIGINT: set flag to run a normal checkpoint right away */
 static void
 ReqCheckpointHandler(SIGNAL_ARGS)
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index dec02586c7..60f159fcb6 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -150,6 +150,9 @@
 
 #define BACKEND_TYPE_WORKER		(BACKEND_TYPE_AUTOVAC | BACKEND_TYPE_BGWORKER)
 
+/* file to signal demotion from primary to standby */
+#define DEMOTE_SIGNAL_FILE		"demote"
+
 /*
  * List of active backends (or child processes anyway; we don't actually
  * know whether a given child has become a backend or is still in the
@@ -269,18 +272,23 @@ typedef enum
 static StartupStatusEnum StartupStatus = STARTUP_NOT_RUNNING;
 
 /* Startup/shutdown state */
-#define			NoShutdown		0
-#define			SmartShutdown	1
-#define			FastShutdown	2
-#define			ImmediateShutdown	3
-
-static int	Shutdown = NoShutdown;
+typedef enum StepDownState {
+	NoShutdown = 0, /* find better label? */
+	SmartShutdown,
+	SmartDemote,
+	FastShutdown,
+	FastDemote,
+	ImmediateShutdown
+} StepDownState;
+
+static StepDownState StepDown = NoShutdown;
+static bool DemoteSignal = false; /* true on demote request */
 
 static bool FatalError = false; /* T if recovering from backend crash */
 
 /*
- * We use a simple state machine to control startup, shutdown, and
- * crash recovery (which is rather like shutdown followed by startup).
+ * We use a simple state machine to control startup, shutdown, demote and
+ * crash recovery (both are rather like shutdown followed by startup).
  *
  * After doing all the postmaster initialization work, we enter PM_STARTUP
  * state and the startup process is launched. The startup process begins by
@@ -314,7 +322,7 @@ static bool FatalError = false; /* T if recovering from backend crash */
  * will not be very long).
  *
  * Notice that this state variable does not distinguish *why* we entered
- * states later than PM_RUN --- Shutdown and FatalError must be consulted
+ * states later than PM_RUN --- StepDown and FatalError must be consulted
  * to find that out.  FatalError is never true in PM_RECOVERY_* or PM_RUN
  * states, nor in PM_SHUTDOWN states (because we don't enter those states
  * when trying to recover from a crash).  It can be true in PM_STARTUP state,
@@ -324,6 +332,7 @@ typedef enum
 {
 	PM_INIT,					/* postmaster starting */
 	PM_STARTUP,					/* waiting for startup subprocess */
+	PM_DEMOTING,				/* demote action in progress */
 	PM_RECOVERY,				/* in archive recovery mode */
 	PM_HOT_STANDBY,				/* in hot standby mode */
 	PM_RUN,						/* normal "database is alive" state */
@@ -414,6 +423,8 @@ static bool RandomCancelKey(int32 *cancel_key);
 static void signal_child(pid_t pid, int signal);
 static bool SignalSomeChildren(int signal, int targets);
 static void TerminateChildren(int signal);
+static bool CheckDemoteSignal(void);
+
 
 #define SignalChildren(sig)			   SignalSomeChildren(sig, BACKEND_TYPE_ALL)
 
@@ -1550,7 +1561,7 @@ DetermineSleepTime(struct timeval *timeout)
 	 * Normal case: either there are no background workers at all, or we're in
 	 * a shutdown sequence (during which we ignore bgworkers altogether).
 	 */
-	if (Shutdown > NoShutdown ||
+	if (StepDown > NoShutdown ||
 		(!StartWorkerNeeded && !HaveCrashedWorker))
 	{
 		if (AbortStartTime != 0)
@@ -1830,7 +1841,7 @@ ServerLoop(void)
 		 *
 		 * Note we also do this during recovery from a process crash.
 		 */
-		if ((Shutdown >= ImmediateShutdown || (FatalError && !SendStop)) &&
+		if ((StepDown >= ImmediateShutdown || (FatalError && !SendStop)) &&
 			AbortStartTime != 0 &&
 			(now - AbortStartTime) >= SIGKILL_CHILDREN_AFTER_SECS)
 		{
@@ -2305,6 +2316,11 @@ retry1:
 					(errcode(ERRCODE_CANNOT_CONNECT_NOW),
 					 errmsg("the database system is starting up")));
 			break;
+		case CAC_DEMOTE:
+			ereport(FATAL,
+					(errcode(ERRCODE_CANNOT_CONNECT_NOW),
+					 errmsg("the database system is demoting")));
+			break;
 		case CAC_SHUTDOWN:
 			ereport(FATAL,
 					(errcode(ERRCODE_CANNOT_CONNECT_NOW),
@@ -2436,7 +2452,7 @@ canAcceptConnections(int backend_type)
 	CAC_state	result = CAC_OK;
 
 	/*
-	 * Can't start backends when in startup/shutdown/inconsistent recovery
+	 * Can't start backends when in startup/demote/shutdown/inconsistent recovery
 	 * state.  We treat autovac workers the same as user backends for this
 	 * purpose.  However, bgworkers are excluded from this test; we expect
 	 * bgworker_should_start_now() decided whether the DB state allows them.
@@ -2452,7 +2468,9 @@ canAcceptConnections(int backend_type)
 	{
 		if (pmState == PM_WAIT_BACKUP)
 			result = CAC_WAITBACKUP;	/* allow superusers only */
-		else if (Shutdown > NoShutdown)
+		else if (StepDown == SmartDemote || StepDown == FastDemote)
+			return CAC_DEMOTE;	/* demote is pending */
+		else if (StepDown > NoShutdown)
 			return CAC_SHUTDOWN;	/* shutdown is pending */
 		else if (!FatalError &&
 				 (pmState == PM_STARTUP ||
@@ -2683,7 +2701,8 @@ SIGHUP_handler(SIGNAL_ARGS)
 	PG_SETMASK(&BlockSig);
 #endif
 
-	if (Shutdown <= SmartShutdown)
+	if (StepDown == NoShutdown || StepDown == SmartShutdown ||
+		StepDown == SmartDemote)
 	{
 		ereport(LOG,
 				(errmsg("received SIGHUP, reloading configuration files")));
@@ -2769,26 +2788,81 @@ pmdie(SIGNAL_ARGS)
 			(errmsg_internal("postmaster received signal %d",
 							 postgres_signal_arg)));
 
+	if (CheckDemoteSignal())
+	{
+		if (pmState != PM_RUN)
+		{
+			DemoteSignal = false;
+			unlink(DEMOTE_SIGNAL_FILE);
+			ereport(LOG,
+					(errmsg("ignoring demote signal because already in standby mode")));
+			goto out;
+		}
+		else if (postgres_signal_arg == SIGQUIT)
+		{
+			DemoteSignal = false;
+			unlink(DEMOTE_SIGNAL_FILE);
+			ereport(WARNING,
+					(errmsg("can not demote in immediate stop mode")));
+			goto out;
+		}
+		else
+		{
+			FILE	   *standby_file;
+
+			DemoteSignal = true;
+
+			unlink(DEMOTE_SIGNAL_FILE);
+
+			/* create the standby signal file */
+			standby_file = AllocateFile(STANDBY_SIGNAL_FILE, "w");
+			if (!standby_file)
+			{
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not create file \"%s\": %m",
+								STANDBY_SIGNAL_FILE)));
+				goto out;
+			}
+
+			if (FreeFile(standby_file))
+			{
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not write file \"%s\": %m",
+								STANDBY_SIGNAL_FILE)));
+				goto out;
+			}
+		}
+	}
+
 	switch (postgres_signal_arg)
 	{
 		case SIGTERM:
 
 			/*
-			 * Smart Shutdown:
+			 * Smart Stepdown:
 			 *
-			 * Wait for children to end their work, then shut down.
+			 * Wait for children to end their work, then shut down or demote.
 			 */
-			if (Shutdown >= SmartShutdown)
+			if (StepDown >= SmartShutdown)
 				break;
-			Shutdown = SmartShutdown;
-			ereport(LOG,
-					(errmsg("received smart shutdown request")));
 
-			/* Report status */
-			AddToDataDirLockFile(LOCK_FILE_LINE_PM_STATUS, PM_STATUS_STOPPING);
+			if (DemoteSignal) {
+				StepDown = SmartDemote;
+				ereport(LOG, (errmsg("received smart demote request")));
+				/* Report status */
+				AddToDataDirLockFile(LOCK_FILE_LINE_PM_STATUS, PM_STATUS_DEMOTING);
+			}
+			else {
+				StepDown = SmartShutdown;
+				ereport(LOG, (errmsg("received smart shutdown request")));
+				/* Report status */
+				AddToDataDirLockFile(LOCK_FILE_LINE_PM_STATUS, PM_STATUS_STOPPING);
 #ifdef USE_SYSTEMD
-			sd_notify(0, "STOPPING=1");
+				sd_notify(0, "STOPPING=1");
 #endif
+			}
 
 			if (pmState == PM_RUN || pmState == PM_RECOVERY ||
 				pmState == PM_HOT_STANDBY || pmState == PM_STARTUP)
@@ -2831,22 +2905,29 @@ pmdie(SIGNAL_ARGS)
 		case SIGINT:
 
 			/*
-			 * Fast Shutdown:
+			 * Fast StepDown:
 			 *
 			 * Abort all children with SIGTERM (rollback active transactions
-			 * and exit) and shut down when they are gone.
+			 * and exit) and shut down or demote when they are gone.
 			 */
-			if (Shutdown >= FastShutdown)
+			if (StepDown >= FastShutdown)
 				break;
-			Shutdown = FastShutdown;
-			ereport(LOG,
-					(errmsg("received fast shutdown request")));
 
-			/* Report status */
-			AddToDataDirLockFile(LOCK_FILE_LINE_PM_STATUS, PM_STATUS_STOPPING);
+			if (DemoteSignal) {
+				StepDown = FastDemote;
+				ereport(LOG, (errmsg("received fast demote request")));
+				/* Report status */
+				AddToDataDirLockFile(LOCK_FILE_LINE_PM_STATUS, PM_STATUS_DEMOTING);
+			}
+			else {
+				StepDown = FastShutdown;
+				ereport(LOG, (errmsg("received fast shutdown request")));
+				/* Report status */
+				AddToDataDirLockFile(LOCK_FILE_LINE_PM_STATUS, PM_STATUS_STOPPING);
 #ifdef USE_SYSTEMD
-			sd_notify(0, "STOPPING=1");
+				sd_notify(0, "STOPPING=1");
 #endif
+			}
 
 			if (StartupPID != 0)
 				signal_child(StartupPID, SIGTERM);
@@ -2903,9 +2984,9 @@ pmdie(SIGNAL_ARGS)
 			 * terminate remaining ones with SIGKILL, then exit without
 			 * attempt to properly shut down the data base system.
 			 */
-			if (Shutdown >= ImmediateShutdown)
+			if (StepDown >= ImmediateShutdown)
 				break;
-			Shutdown = ImmediateShutdown;
+			StepDown = ImmediateShutdown;
 			ereport(LOG,
 					(errmsg("received immediate shutdown request")));
 
@@ -2929,6 +3010,7 @@ pmdie(SIGNAL_ARGS)
 			break;
 	}
 
+out:
 #ifdef WIN32
 	PG_SETMASK(&UnBlockSig);
 #endif
@@ -2967,10 +3049,11 @@ reaper(SIGNAL_ARGS)
 			StartupPID = 0;
 
 			/*
-			 * Startup process exited in response to a shutdown request (or it
-			 * completed normally regardless of the shutdown request).
+			 * Startup process exited in response to a shutdown or demote
+			 * request (or it completed normally regardless of the shutdown
+			 * request).
 			 */
-			if (Shutdown > NoShutdown &&
+			if (StepDown > NoShutdown &&
 				(EXIT_STATUS_0(exitstatus) || EXIT_STATUS_1(exitstatus)))
 			{
 				StartupStatus = STARTUP_NOT_RUNNING;
@@ -2984,7 +3067,7 @@ reaper(SIGNAL_ARGS)
 				ereport(LOG,
 						(errmsg("shutdown at recovery target")));
 				StartupStatus = STARTUP_NOT_RUNNING;
-				Shutdown = SmartShutdown;
+				StepDown = SmartShutdown;
 				TerminateChildren(SIGTERM);
 				pmState = PM_WAIT_BACKENDS;
 				/* PostmasterStateMachine logic does the rest */
@@ -3124,7 +3207,7 @@ reaper(SIGNAL_ARGS)
 				 * archive cycle and quit. Likewise, if we have walsender
 				 * processes, tell them to send any remaining WAL and quit.
 				 */
-				Assert(Shutdown > NoShutdown);
+				Assert(StepDown > NoShutdown);
 
 				/* Waken archiver for the last time */
 				if (PgArchPID != 0)
@@ -3145,6 +3228,18 @@ reaper(SIGNAL_ARGS)
 				if (PgStatPID != 0)
 					signal_child(PgStatPID, SIGQUIT);
 			}
+			else if (EXIT_STATUS_0(exitstatus) &&
+					 DemoteSignal &&
+					 pmState == PM_DEMOTING)
+			{
+				/*
+				 * The checkpointer exit signals the demote shutdown checkpoint
+				 * is done. The startup recovery mode can be started from there.
+				 */
+				ereport(DEBUG1,
+						(errmsg_internal("checkpointer shutdown for demote")));
+				StepDown = NoShutdown;
+			}
 			else
 			{
 				/*
@@ -3484,7 +3579,7 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
 	 * signaled children, nonzero exit status is to be expected, so don't
 	 * clutter log.
 	 */
-	take_action = !FatalError && Shutdown != ImmediateShutdown;
+	take_action = !FatalError && StepDown != ImmediateShutdown;
 
 	if (take_action)
 	{
@@ -3702,7 +3797,7 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
 
 	/* We do NOT restart the syslogger */
 
-	if (Shutdown != ImmediateShutdown)
+	if (StepDown != ImmediateShutdown)
 		FatalError = true;
 
 	/* We now transit into a state of waiting for children to die */
@@ -3845,11 +3940,11 @@ PostmasterStateMachine(void)
 			WalReceiverPID == 0 &&
 			BgWriterPID == 0 &&
 			(CheckpointerPID == 0 ||
-			 (!FatalError && Shutdown < ImmediateShutdown)) &&
+			 (!FatalError && StepDown < ImmediateShutdown)) &&
 			WalWriterPID == 0 &&
 			AutoVacPID == 0)
 		{
-			if (Shutdown >= ImmediateShutdown || FatalError)
+			if (StepDown >= ImmediateShutdown || FatalError)
 			{
 				/*
 				 * Start waiting for dead_end children to die.  This state
@@ -3863,6 +3958,14 @@ PostmasterStateMachine(void)
 				 * FatalError state.
 				 */
 			}
+			/* Handle demote signal */
+			else if (DemoteSignal)
+			{
+				ereport(LOG, (errmsg("all backend processes terminated; demoting")));
+
+				SendProcSignal(CheckpointerPID, PROCSIG_CHECKPOINTER_DEMOTING, InvalidBackendId);
+				pmState = PM_DEMOTING;
+			}
 			else
 			{
 				/*
@@ -3870,7 +3973,7 @@ PostmasterStateMachine(void)
 				 * the regular children are gone, and it's time to tell the
 				 * checkpointer to do a shutdown checkpoint.
 				 */
-				Assert(Shutdown > NoShutdown);
+				Assert(StepDown > NoShutdown);
 				/* Start the checkpointer if not running */
 				if (CheckpointerPID == 0)
 					CheckpointerPID = StartCheckpointer();
@@ -3958,7 +4061,8 @@ PostmasterStateMachine(void)
 	 * EOF on its input pipe, which happens when there are no more upstream
 	 * processes.
 	 */
-	if (Shutdown > NoShutdown && pmState == PM_NO_CHILDREN)
+	if (pmState == PM_NO_CHILDREN && (StepDown == SmartShutdown ||
+		StepDown == FastShutdown || StepDown == ImmediateShutdown))
 	{
 		if (FatalError)
 		{
@@ -3991,10 +4095,23 @@ PostmasterStateMachine(void)
 	 * startup process fails, because more than likely it will just fail again
 	 * and we will keep trying forever.
 	 */
-	if (pmState == PM_NO_CHILDREN &&
+	if (pmState == PM_NO_CHILDREN && !DemoteSignal &&
 		(StartupStatus == STARTUP_CRASHED || !restart_after_crash))
 		ExitPostmaster(1);
 
+
+	/* Demoting: start the Startup Process */
+	if (pmState == PM_DEMOTING && StepDown == NoShutdown)
+	{
+		if (!XLogArchivingAlways() && PgArchPID != 0)
+			signal_child(PgArchPID, SIGQUIT);
+
+		StartupPID = StartupDataBase();
+		Assert(StartupPID != 0);
+		pmState = PM_STARTUP;
+		StartupStatus = STARTUP_RUNNING;
+	}
+
 	/*
 	 * If we need to recover from a crash, wait for all non-syslogger children
 	 * to exit, then reset shmem and StartupDataBase.
@@ -5195,7 +5312,7 @@ sigusr1_handler(SIGNAL_ARGS)
 	 * first. We don't want to go back to recovery in that case.
 	 */
 	if (CheckPostmasterSignal(PMSIGNAL_RECOVERY_STARTED) &&
-		pmState == PM_STARTUP && Shutdown == NoShutdown)
+		pmState == PM_STARTUP && StepDown == NoShutdown)
 	{
 		/* WAL redo has started. We're out of reinitialization. */
 		FatalError = false;
@@ -5205,19 +5322,29 @@ sigusr1_handler(SIGNAL_ARGS)
 		 * Crank up the background tasks.  It doesn't matter if this fails,
 		 * we'll just try again later.
 		 */
+		if (!DemoteSignal)
+		{
+			Assert(BgWriterPID == 0);
+			Assert(PgArchPID == 0);
+		}
+
 		Assert(CheckpointerPID == 0);
 		CheckpointerPID = StartCheckpointer();
-		Assert(BgWriterPID == 0);
-		BgWriterPID = StartBackgroundWriter();
+
+		if (BgWriterPID == 0)
+			BgWriterPID = StartBackgroundWriter();
 
 		/*
 		 * Start the archiver if we're responsible for (re-)archiving received
 		 * files.
 		 */
-		Assert(PgArchPID == 0);
-		if (XLogArchivingAlways())
+		if (PgArchPID == 0 && XLogArchivingAlways())
 			PgArchPID = pgarch_start();
 
+		if (DemoteSignal) {
+			SignalSomeChildren(SIGHUP, BACKEND_TYPE_WALSND);
+		}
+
 		/*
 		 * If we aren't planning to enter hot standby mode later, treat
 		 * RECOVERY_STARTED as meaning we're out of startup, and report status
@@ -5226,6 +5353,7 @@ sigusr1_handler(SIGNAL_ARGS)
 		if (!EnableHotStandby)
 		{
 			AddToDataDirLockFile(LOCK_FILE_LINE_PM_STATUS, PM_STATUS_STANDBY);
+			DemoteSignal = false;
 #ifdef USE_SYSTEMD
 			sd_notify(0, "READY=1");
 #endif
@@ -5234,13 +5362,15 @@ sigusr1_handler(SIGNAL_ARGS)
 		pmState = PM_RECOVERY;
 	}
 	if (CheckPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY) &&
-		pmState == PM_RECOVERY && Shutdown == NoShutdown)
+		pmState == PM_RECOVERY && StepDown == NoShutdown)
 	{
 		/*
 		 * Likewise, start other special children as needed.
 		 */
-		Assert(PgStatPID == 0);
-		PgStatPID = pgstat_start();
+		if (!DemoteSignal)
+			Assert(PgStatPID == 0);
+		if(PgStatPID == 0)
+			PgStatPID = pgstat_start();
 
 		ereport(LOG,
 				(errmsg("database system is ready to accept read only connections")));
@@ -5252,6 +5382,7 @@ sigusr1_handler(SIGNAL_ARGS)
 #endif
 
 		pmState = PM_HOT_STANDBY;
+		DemoteSignal = false;
 		/* Some workers may be scheduled to start now */
 		StartWorkerNeeded = true;
 	}
@@ -5284,7 +5415,7 @@ sigusr1_handler(SIGNAL_ARGS)
 	}
 
 	if (CheckPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER) &&
-		Shutdown == NoShutdown)
+		StepDown == NoShutdown)
 	{
 		/*
 		 * Start one iteration of the autovacuum daemon, even if autovacuuming
@@ -5299,7 +5430,7 @@ sigusr1_handler(SIGNAL_ARGS)
 	}
 
 	if (CheckPostmasterSignal(PMSIGNAL_START_AUTOVAC_WORKER) &&
-		Shutdown == NoShutdown)
+		StepDown == NoShutdown)
 	{
 		/* The autovacuum launcher wants us to start a worker process. */
 		StartAutovacuumWorker();
@@ -5644,7 +5775,7 @@ MaybeStartWalReceiver(void)
 	if (WalReceiverPID == 0 &&
 		(pmState == PM_STARTUP || pmState == PM_RECOVERY ||
 		 pmState == PM_HOT_STANDBY || pmState == PM_WAIT_READONLY) &&
-		Shutdown == NoShutdown)
+		StepDown == NoShutdown)
 	{
 		WalReceiverPID = StartWalReceiver();
 		if (WalReceiverPID != 0)
@@ -5899,6 +6030,7 @@ bgworker_should_start_now(BgWorkerStartTime start_time)
 		case PM_WAIT_BACKENDS:
 		case PM_WAIT_READONLY:
 		case PM_WAIT_BACKUP:
+		case PM_DEMOTING:
 			break;
 
 		case PM_RUN:
@@ -6647,3 +6779,18 @@ InitPostmasterDeathWatchHandle(void)
 								 GetLastError())));
 #endif							/* WIN32 */
 }
+
+/*
+ * Check if a promote request appeared. Should be called by postmaster before
+ * shutting down.
+ */
+bool
+CheckDemoteSignal(void)
+{
+	struct stat stat_buf;
+
+	if (stat(DEMOTE_SIGNAL_FILE, &stat_buf) == 0)
+		return true;
+
+	return false;
+}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 5e2210dd7b..9a2bff7e5e 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2267,6 +2267,7 @@ WalSndLoop(WalSndSendDataCallback send_data)
 			ConfigReloadPending = false;
 			ProcessConfigFile(PGC_SIGHUP);
 			SyncRepInitConfig();
+			am_cascading_walsender = SetLocalRecoveryInProgress();
 		}
 
 		/* Check for input from the client */
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index b448533564..0ccc32f4ce 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -191,6 +191,8 @@ ProcArrayShmemSize(void)
 	size = add_size(size, mul_size(sizeof(int), PROCARRAY_MAXPROCS));
 
 	/*
+	 * TODO demote: check safe hotStandby related init and snapshot mech.
+	 *
 	 * During Hot Standby processing we have a data structure called
 	 * KnownAssignedXids, created in shared memory. Local data structures are
 	 * also created in various backends during GetSnapshotData(),
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 4fa385b0ec..1903f4db2a 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -28,6 +28,7 @@
 #include "storage/shmem.h"
 #include "storage/sinval.h"
 #include "tcop/tcopprot.h"
+#include "postmaster/bgwriter.h"
 
 /*
  * The SIGUSR1 signal is multiplexed to support signaling multiple event
@@ -585,6 +586,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN);
 
+	if (CheckProcSignal(PROCSIG_CHECKPOINTER_DEMOTING))
+		ReqCheckpointDemoteHandler(PROCSIG_CHECKPOINTER_DEMOTING);
+
 	SetLatch(MyLatch);
 
 	latch_sigusr1_handler();
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 95989ce79b..52f85cd1b3 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -4371,6 +4371,18 @@ lock_twophase_postabort(TransactionId xid, uint16 info,
 	lock_twophase_postcommit(xid, info, recdata, len);
 }
 
+/*
+ * 2PC shutdown from lock table.
+ *
+ * This is actually just the same as the COMMIT case.
+ */
+void
+lock_twophase_shutdown(TransactionId xid, uint16 info,
+						void *recdata, uint32 len)
+{
+	lock_twophase_postcommit(xid, info, recdata, len);
+}
+
 /*
  *		VirtualXactLockTableInsert
  *
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index e73639df74..c144cc35d3 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -57,6 +57,8 @@ dbState(DBState state)
 			return _("shut down");
 		case DB_SHUTDOWNED_IN_RECOVERY:
 			return _("shut down in recovery");
+		case DB_DEMOTING:
+			return _("demoting");
 		case DB_SHUTDOWNING:
 			return _("shutting down");
 		case DB_IN_CRASH_RECOVERY:
diff --git a/src/bin/pg_ctl/pg_ctl.c b/src/bin/pg_ctl/pg_ctl.c
index 3c03ace7ed..79bb42f7e7 100644
--- a/src/bin/pg_ctl/pg_ctl.c
+++ b/src/bin/pg_ctl/pg_ctl.c
@@ -62,6 +62,7 @@ typedef enum
 	RESTART_COMMAND,
 	RELOAD_COMMAND,
 	STATUS_COMMAND,
+	DEMOTE_COMMAND,
 	PROMOTE_COMMAND,
 	LOGROTATE_COMMAND,
 	KILL_COMMAND,
@@ -103,6 +104,7 @@ static char version_file[MAXPGPATH];
 static char pid_file[MAXPGPATH];
 static char backup_file[MAXPGPATH];
 static char promote_file[MAXPGPATH];
+static char demote_file[MAXPGPATH];
 static char logrotate_file[MAXPGPATH];
 
 static volatile pgpid_t postmasterPID = -1;
@@ -129,6 +131,7 @@ static void do_stop(void);
 static void do_restart(void);
 static void do_reload(void);
 static void do_status(void);
+static void do_demote(void);
 static void do_promote(void);
 static void do_logrotate(void);
 static void do_kill(pgpid_t pid);
@@ -1029,6 +1032,109 @@ do_stop(void)
 }
 
 
+static void
+do_demote(void)
+{
+	int			cnt;
+	FILE	   *dmtfile;
+	pgpid_t		pid;
+	struct stat statbuf;
+
+	pid = get_pgpid(false);
+
+	if (pid == 0)				/* no pid file */
+	{
+		write_stderr(_("%s: PID file \"%s\" does not exist\n"), progname, pid_file);
+		write_stderr(_("Is server running?\n"));
+		exit(1);
+	}
+	else if (pid < 0)			/* standalone backend, not postmaster */
+	{
+		pid = -pid;
+		write_stderr(_("%s: cannot demote server; "
+					   "single-user server is running (PID: %ld)\n"),
+					 progname, pid);
+		exit(1);
+	}
+	if (shutdown_mode == IMMEDIATE_MODE)
+	{
+		write_stderr(_("%s: cannot demote server using immediate mode"),
+					 progname);
+		exit(1);
+	}
+
+	snprintf(demote_file, MAXPGPATH, "%s/demote", pg_data);
+
+	if ((dmtfile = fopen(demote_file, "w")) == NULL)
+	{
+		write_stderr(_("%s: could not create demote signal file \"%s\": %s\n"),
+					 progname, demote_file, strerror(errno));
+		exit(1);
+	}
+	if (fclose(dmtfile))
+	{
+		write_stderr(_("%s: could not write demote signal file \"%s\": %s\n"),
+					 progname, demote_file, strerror(errno));
+		exit(1);
+	}
+
+	if (kill((pid_t) pid, sig) != 0)
+	{
+		write_stderr(_("%s: could not send stop signal (PID: %ld): %s\n"), progname, pid,
+					 strerror(errno));
+		exit(1);
+	}
+
+	if (!do_wait)
+	{
+		print_msg(_("server demoting\n"));
+		return;
+	}
+	else
+	{
+		/*
+		 * If backup_label exists, an online backup is running. Warn the user
+		 * that smart demote will wait for it to finish. However, if the
+		 * server is in archive recovery, we're recovering from an online
+		 * backup instead of performing one.
+		 */
+		if (shutdown_mode == SMART_MODE &&
+			stat(backup_file, &statbuf) == 0 &&
+			get_control_dbstate() != DB_IN_ARCHIVE_RECOVERY)
+		{
+			print_msg(_("WARNING: online backup mode is active\n"
+						"Demote will not complete until pg_stop_backup() is called.\n\n"));
+		}
+
+		print_msg(_("waiting for server to demote..."));
+
+		for (cnt = 0; cnt < wait_seconds * WAITS_PER_SEC; cnt++)
+		{
+			if (get_control_dbstate() == DB_IN_ARCHIVE_RECOVERY)
+				break;
+
+			if (cnt % WAITS_PER_SEC == 0)
+				print_msg(".");
+			pg_usleep(USEC_PER_SEC / WAITS_PER_SEC);
+		}
+
+		if (get_control_dbstate() != DB_IN_ARCHIVE_RECOVERY)
+		{
+			print_msg(_(" failed\n"));
+
+			write_stderr(_("%s: server does not demote\n"), progname);
+			if (shutdown_mode == SMART_MODE)
+				write_stderr(_("HINT: The \"-m fast\" option immediately disconnects sessions rather than\n"
+							   "waiting for session-initiated disconnection.\n"));
+			exit(1);
+		}
+		print_msg(_(" done\n"));
+
+		print_msg(_("server demoted\n"));
+	}
+}
+
+
 /*
  *	restart/reload routines
  */
@@ -2452,6 +2558,8 @@ main(int argc, char **argv)
 				ctl_command = RELOAD_COMMAND;
 			else if (strcmp(argv[optind], "status") == 0)
 				ctl_command = STATUS_COMMAND;
+			else if (strcmp(argv[optind], "demote") == 0)
+				ctl_command = DEMOTE_COMMAND;
 			else if (strcmp(argv[optind], "promote") == 0)
 				ctl_command = PROMOTE_COMMAND;
 			else if (strcmp(argv[optind], "logrotate") == 0)
@@ -2559,6 +2667,9 @@ main(int argc, char **argv)
 		case RELOAD_COMMAND:
 			do_reload();
 			break;
+		case DEMOTE_COMMAND:
+			do_demote();
+			break;
 		case PROMOTE_COMMAND:
 			do_promote();
 			break;
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 2ca71c3445..4b56f92181 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -53,6 +53,7 @@ extern void RecoverPreparedTransactions(void);
 extern void CheckPointTwoPhase(XLogRecPtr redo_horizon);
 
 extern void FinishPreparedTransaction(const char *gid, bool isCommit);
+void ShutdownPreparedTransactions(void);
 
 extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   XLogRecPtr end_lsn, RepOriginId origin_id);
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 5b14334887..be5e96e437 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -219,18 +219,20 @@ extern bool XLOG_DEBUG;
 
 /* These directly affect the behavior of CreateCheckPoint and subsidiaries */
 #define CHECKPOINT_IS_SHUTDOWN	0x0001	/* Checkpoint is for shutdown */
-#define CHECKPOINT_END_OF_RECOVERY	0x0002	/* Like shutdown checkpoint, but
+#define CHECKPOINT_IS_DEMOTE	0x0002	/* Like shutdown checkpoint, but
+											 * issued at end of WAL production */
+#define CHECKPOINT_END_OF_RECOVERY	0x0004	/* Like shutdown checkpoint, but
 											 * issued at end of WAL recovery */
-#define CHECKPOINT_IMMEDIATE	0x0004	/* Do it without delays */
-#define CHECKPOINT_FORCE		0x0008	/* Force even if no activity */
-#define CHECKPOINT_FLUSH_ALL	0x0010	/* Flush all pages, including those
+#define CHECKPOINT_IMMEDIATE	0x0008	/* Do it without delays */
+#define CHECKPOINT_FORCE		0x0010	/* Force even if no activity */
+#define CHECKPOINT_FLUSH_ALL	0x0020	/* Flush all pages, including those
 										 * belonging to unlogged tables */
 /* These are important to RequestCheckpoint */
-#define CHECKPOINT_WAIT			0x0020	/* Wait for completion */
-#define CHECKPOINT_REQUESTED	0x0040	/* Checkpoint request has been made */
+#define CHECKPOINT_WAIT			0x0040	/* Wait for completion */
+#define CHECKPOINT_REQUESTED	0x0080	/* Checkpoint request has been made */
 /* These indicate the cause of a checkpoint request */
-#define CHECKPOINT_CAUSE_XLOG	0x0080	/* XLOG consumption */
-#define CHECKPOINT_CAUSE_TIME	0x0100	/* Elapsed time */
+#define CHECKPOINT_CAUSE_XLOG	0x0100	/* XLOG consumption */
+#define CHECKPOINT_CAUSE_TIME	0x0200	/* Elapsed time */
 
 /*
  * Flag bits for the record being inserted, set using XLogSetRecordFlags().
@@ -300,6 +302,7 @@ extern const char *xlog_identify(uint8 info);
 
 extern void issue_xlog_fsync(int fd, XLogSegNo segno);
 
+extern bool SetLocalRecoveryInProgress(void);
 extern bool RecoveryInProgress(void);
 extern RecoveryState GetRecoveryState(void);
 extern bool HotStandbyActive(void);
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index de5670e538..f529f8c7bd 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -87,6 +87,7 @@ typedef enum DBState
 	DB_STARTUP = 0,
 	DB_SHUTDOWNED,
 	DB_SHUTDOWNED_IN_RECOVERY,
+	DB_DEMOTING,
 	DB_SHUTDOWNING,
 	DB_IN_CRASH_RECOVERY,
 	DB_IN_ARCHIVE_RECOVERY,
diff --git a/src/include/libpq/libpq-be.h b/src/include/libpq/libpq-be.h
index 179ebaa104..a9e27f009e 100644
--- a/src/include/libpq/libpq-be.h
+++ b/src/include/libpq/libpq-be.h
@@ -70,7 +70,12 @@ typedef struct
 
 typedef enum CAC_state
 {
-	CAC_OK, CAC_STARTUP, CAC_SHUTDOWN, CAC_RECOVERY, CAC_TOOMANY,
+	CAC_OK,
+	CAC_STARTUP,
+	CAC_DEMOTE,
+	CAC_SHUTDOWN,
+	CAC_RECOVERY,
+	CAC_TOOMANY,
 	CAC_WAITBACKUP
 } CAC_state;
 
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 0a5708b32e..4d4f0ea1dd 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -41,5 +41,6 @@ extern Size CheckpointerShmemSize(void);
 extern void CheckpointerShmemInit(void);
 
 extern bool FirstCallSinceLastCheckpoint(void);
+extern void ReqCheckpointDemoteHandler(SIGNAL_ARGS);
 
 #endif							/* _BGWRITER_H */
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index fdabf42721..d3b08163a2 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -574,6 +574,8 @@ extern void lock_twophase_postcommit(TransactionId xid, uint16 info,
 									 void *recdata, uint32 len);
 extern void lock_twophase_postabort(TransactionId xid, uint16 info,
 									void *recdata, uint32 len);
+extern void lock_twophase_shutdown(TransactionId xid, uint16 info,
+									void *recdata, uint32 len);
 extern void lock_twophase_standby_recover(TransactionId xid, uint16 info,
 										  void *recdata, uint32 len);
 
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 5cb39697f3..eb0bda04f5 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -34,6 +34,7 @@ typedef enum
 	PROCSIG_PARALLEL_MESSAGE,	/* message from cooperating parallel backend */
 	PROCSIG_WALSND_INIT_STOPPING,	/* ask walsenders to prepare for shutdown  */
 	PROCSIG_BARRIER,			/* global barrier interrupt  */
+	PROCSIG_CHECKPOINTER_DEMOTING,	/* ask checkpointer to demote */
 
 	/* Recovery conflict reasons */
 	PROCSIG_RECOVERY_CONFLICT_DATABASE,
diff --git a/src/include/utils/pidfile.h b/src/include/utils/pidfile.h
index 63fefe5c4c..f761d2c4ef 100644
--- a/src/include/utils/pidfile.h
+++ b/src/include/utils/pidfile.h
@@ -50,6 +50,7 @@
  */
 #define PM_STATUS_STARTING		"starting"	/* still starting up */
 #define PM_STATUS_STOPPING		"stopping"	/* in shutdown sequence */
+#define PM_STATUS_DEMOTING		"demoting"	/* demote sequence */
 #define PM_STATUS_READY			"ready   "	/* ready for connections */
 #define PM_STATUS_STANDBY		"standby "	/* up, won't accept connections */
 
-- 
2.20.1

>From b548e865e5d0532a03416cbc8db923c1a2f2f01e Mon Sep 17 00:00:00 2001
From: Jehan-Guillaume de Rorthais <j...@dalibo.com>
Date: Fri, 10 Jul 2020 02:00:38 +0200
Subject: [PATCH 2/2] Add various tests related to demote and promote actions

* demote/promote with a standby replicating from the node
* make sure 2PC survive a demote/promote cycle
* commit 2PC and check the result
* swap roles between primary and standby
* commit a 2PC on the new primary
---
 src/test/perl/PostgresNode.pm             |  25 +++++
 src/test/recovery/t/021_promote-demote.pl | 129 ++++++++++++++++++++++
 2 files changed, 154 insertions(+)
 create mode 100644 src/test/recovery/t/021_promote-demote.pl

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 8c1b77376f..4488365ffc 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -906,6 +906,31 @@ sub promote
 
 =pod
 
+=item $node->demote()
+
+Wrapper for pg_ctl demote
+
+=cut
+
+sub demote
+{
+	my ($self, $mode) = @_;
+	my $port    = $self->port;
+	my $pgdata  = $self->data_dir;
+	my $logfile = $self->logfile;
+	my $name    = $self->name;
+
+	$mode = 'fast' unless defined $mode;
+
+	print "### Demoting node \"$name\" using mode $mode\n";
+
+	TestLib::system_or_bail('pg_ctl', '-D', $pgdata, '-l', $logfile,
+		'-m', $mode, 'demote');
+	return;
+}
+
+=pod
+
 =item $node->logrotate()
 
 Wrapper for pg_ctl logrotate
diff --git a/src/test/recovery/t/021_promote-demote.pl b/src/test/recovery/t/021_promote-demote.pl
new file mode 100644
index 0000000000..04e2207470
--- /dev/null
+++ b/src/test/recovery/t/021_promote-demote.pl
@@ -0,0 +1,129 @@
+# Test demote/promote actions in various scenarios using two
+# nodes alpha and beta. We check proper actions results,
+# correct data replication accros multiple demote/promote,
+# manual switchover.
+
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 13;
+
+$ENV{PGDATABASE} = 'postgres';
+
+# Initialize node alpha
+my $node_alpha = get_new_node('alpha');
+$node_alpha->init(allows_streaming => 1);
+$node_alpha->append_conf(
+	'postgresql.conf', qq(
+	max_prepared_transactions = 10
+	log_checkpoints = true
+	log_replication_commands = true
+
+));
+
+# Take backup
+my $backup_name = 'alpha_backup';
+$node_alpha->start;
+$node_alpha->backup($backup_name);
+
+# Create node beta from backup
+my $node_beta = get_new_node('beta');
+$node_beta->init_from_backup($node_alpha, $backup_name);
+$node_beta->enable_streaming($node_alpha);
+$node_beta->start;
+
+
+# Create some 2PC on alpha for futur tests
+$node_alpha->safe_psql('postgres', q{
+CREATE TABLE ins AS SELECT 1 AS i;
+BEGIN;
+CREATE TABLE new AS SELECT generate_series(1,5) AS i;
+PREPARE TRANSACTION 'pxact1';
+BEGIN;
+INSERT INTO ins VALUES (2);
+PREPARE TRANSACTION 'pxact2';
+});
+
+# Demote alpha. beta should keep streaming from it as a
+# cascaded standby.
+$node_alpha->demote;
+
+is( $node_alpha->safe_psql( 'postgres', 'SELECT pg_is_in_recovery()'),
+	't', 'node alpha demoted to standby' );
+
+is( $node_alpha->safe_psql(
+		'postgres', 'SELECT application_name FROM pg_stat_replication'),
+	$node_beta->name, 'beta is still replicating with alpha after demote' );
+
+# Promote alpha back in production.
+$node_alpha->promote;
+
+is( $node_alpha->safe_psql( 'postgres', 'SELECT pg_is_in_recovery()'),
+	'f', "node alpha promoted" );
+
+# Check all 2PC xact have been restored
+is( $node_alpha->safe_psql(
+		'postgres',
+		"SELECT string_agg(gid, ',' order by gid asc) FROM pg_prepared_xacts"),
+	'pxact1,pxact2', "prepared transactions 'pxact1' and 'pxact2' exists" );
+
+# Commit one 2PC and check it on alpha and beta
+$node_alpha->safe_psql( 'postgres', "commit prepared 'pxact1'");
+
+is( $node_alpha->safe_psql(
+		'postgres', "SELECT string_agg(i::text, ',' order by i asc) FROM new"),
+	'1,2,3,4,5', "prepared transaction 'pxact1' commited" );
+
+$node_alpha->wait_for_catchup($node_beta);
+
+is( $node_beta->safe_psql(
+		'postgres', "SELECT string_agg(i::text, ',' order by i asc) FROM new"),
+	'1,2,3,4,5', "prepared transaction 'pxact1' replicated to beta" );
+
+# swap roles between alpha and beta
+# demote and check alpha
+$node_alpha->demote;
+is( $node_alpha->safe_psql( 'postgres', 'SELECT pg_is_in_recovery()'),
+	't', "node alpha demoted" );
+
+# fetch the last REDO location from alpha and chek beta received everyting
+my ($stdout, $stderr) = run_command([ 'pg_controldata', $node_alpha->data_dir ]);
+$stdout =~ m{REDO location:\s+([0-9A-F]+/[0-9A-F]+)$}mg;
+my $redo_loc = $1;
+
+is( $node_beta->safe_psql(
+		'postgres',
+		"SELECT pg_wal_lsn_diff(pg_last_wal_receive_lsn(), '$redo_loc') > 0 "),
+	't', "node beta received the demote checkpoint from alpha" );
+
+# promote beta and check it
+$node_beta->promote;
+is( $node_beta->safe_psql( 'postgres', 'SELECT pg_is_in_recovery()'),
+	'f', "node beta promoted" );
+
+# Setup alpha to replicate from beta
+$node_alpha->enable_streaming($node_beta);
+$node_alpha->reload;
+
+# check alpha is replicating from it
+$node_beta->wait_for_catchup($node_alpha);
+
+is( $node_beta->safe_psql(
+		'postgres', 'SELECT application_name FROM pg_stat_replication'),
+	$node_alpha->name, 'alpha is replicating from beta' );
+
+# make sure the second 2PC is still available on beta
+is( $node_beta->safe_psql(
+		'postgres', 'SELECT gid FROM pg_prepared_xacts'),
+	'pxact2', "prepared transactions pxact2' exists" );
+
+# commit the second 2PC and check its result on both nodes
+$node_beta->safe_psql( 'postgres', "commit prepared 'pxact2'");
+
+is( $node_beta->safe_psql( 'postgres', 'SELECT 1 FROM ins WHERE i=2'),
+	'1', "prepared transaction 'pxact2' commited" );
+
+$node_beta->wait_for_catchup($node_alpha);
+is( $node_alpha->safe_psql( 'postgres', 'SELECT 1 FROM ins WHERE i=2'),
+	'1', "prepared transaction 'pxact2' streamed to alpha" );
-- 
2.20.1

Reply via email to