Re: Timeline switching with partial WAL records can break replica recovery

Alyona Vinter Wed, 10 Sep 2025 02:08:11 -0700

Hi!

I've noticed an issue with pg_rewind caused by my patches.


Some logs for issue demonstration:
pg_rewind: Source timeline history:
pg_rewind: 1: 0/00000000 - 0/03002048
pg_rewind: 2: 0/03002048 - 0/00000000
pg_rewind: Target timeline history:
pg_rewind: 1: 0/00000000 - 0/00000000
pg_rewind: servers diverged at WAL location 0/03002048 on timeline 1
pg_rewind: error: could not find previous WAL record at 0/03002048: invalid
record length at 0/03002048: expected at least 24, got 0

When a common timeline ends with an overwritten contrecord, the divergence
point may not point to the start of a valid WAL record on the target,
causing errors and making rewind impossible.
To handle this case, I suggest looking for a checkpoint preceding the
divergence point starting from the last checkpoint on the target rather
than from the divergence point itself when the common timeline is
unfinished on the target. This ensures we always begin from a known-valid
position in WAL.

I'd appreciate any feedback!

Best Regards,
Alyona Vinter

From 8cf7ed6aa3de1984f3512ec13bfdfab32e36364c Mon Sep 17 00:00:00 2001
From: Alena Vinter <[email protected]>
Date: Mon, 26 May 2025 13:18:35 +0700
Subject: [PATCH 1/3] Handle WAL timeline switches with incomplete records

When switching timelines with incomplete WAL records at the end of the
old timeline, physical replicas could enter an infinite recovery loop by
repeatedly requesting the same WAL data. This occurs because the
incomplete record isn't properly marked with XLOG_OVERWRITE_CONTRECORD
record, causing replicas to retry fetching it.

To fix this, we preserve WAL's append-only nature by writing an
XLOG_OVERWRITE_CONTRECORD to explicitly mark incomplete records before
initializing the new timeline. This ensures replicas can properly detect
transition to the new timeline without getting stuck.
---
 src/backend/access/transam/xlog.c         | 169 ++++++++++++----------
 src/backend/access/transam/xlogrecovery.c |  14 +-
 2 files changed, 93 insertions(+), 90 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7ffb2179151..654a5851eb2 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5885,6 +5885,7 @@ StartupXLOG(void)
 	EndOfLogTLI = endOfRecoveryInfo->endOfLogTLI;
 	abortedRecPtr = endOfRecoveryInfo->abortedRecPtr;
 	missingContrecPtr = endOfRecoveryInfo->missingContrecPtr;
+	newTLI = endOfRecoveryInfo->lastRecTLI;
 
 	/*
 	 * Reset ps status display, so as no information related to recovery shows
@@ -5957,67 +5958,6 @@ StartupXLOG(void)
 	 */
 	SetInstallXLogFileSegmentActive();
 
-	/*
-	 * Consider whether we need to assign a new timeline ID.
-	 *
-	 * If we did archive recovery, we always assign a new ID.  This handles a
-	 * couple of issues.  If we stopped short of the end of WAL during
-	 * recovery, then we are clearly generating a new timeline and must assign
-	 * it a unique new ID.  Even if we ran to the end, modifying the current
-	 * last segment is problematic because it may result in trying to
-	 * overwrite an already-archived copy of that segment, and we encourage
-	 * DBAs to make their archive_commands reject that.  We can dodge the
-	 * problem by making the new active segment have a new timeline ID.
-	 *
-	 * In a normal crash recovery, we can just extend the timeline we were in.
-	 */
-	newTLI = endOfRecoveryInfo->lastRecTLI;
-	if (ArchiveRecoveryRequested)
-	{
-		newTLI = findNewestTimeLine(recoveryTargetTLI) + 1;
-		ereport(LOG,
-				(errmsg("selected new timeline ID: %u", newTLI)));
-
-		/*
-		 * Make a writable copy of the last WAL segment.  (Note that we also
-		 * have a copy of the last block of the old WAL in
-		 * endOfRecovery->lastPage; we will use that below.)
-		 */
-		XLogInitNewTimeline(EndOfLogTLI, EndOfLog, newTLI);
-
-		/*
-		 * Remove the signal files out of the way, so that we don't
-		 * accidentally re-enter archive recovery mode in a subsequent crash.
-		 */
-		if (endOfRecoveryInfo->standby_signal_file_found)
-			durable_unlink(STANDBY_SIGNAL_FILE, FATAL);
-
-		if (endOfRecoveryInfo->recovery_signal_file_found)
-			durable_unlink(RECOVERY_SIGNAL_FILE, FATAL);
-
-		/*
-		 * Write the timeline history file, and have it archived. After this
-		 * point (or rather, as soon as the file is archived), the timeline
-		 * will appear as "taken" in the WAL archive and to any standby
-		 * servers.  If we crash before actually switching to the new
-		 * timeline, standby servers will nevertheless think that we switched
-		 * to the new timeline, and will try to connect to the new timeline.
-		 * To minimize the window for that, try to do as little as possible
-		 * between here and writing the end-of-recovery record.
-		 */
-		writeTimeLineHistory(newTLI, recoveryTargetTLI,
-							 EndOfLog, endOfRecoveryInfo->recoveryStopReason);
-
-		ereport(LOG,
-				(errmsg("archive recovery complete")));
-	}
-
-	/* Save the selected TimeLineID in shared memory, too */
-	SpinLockAcquire(&XLogCtl->info_lck);
-	XLogCtl->InsertTimeLineID = newTLI;
-	XLogCtl->PrevTimeLineID = endOfRecoveryInfo->lastRecTLI;
-	SpinLockRelease(&XLogCtl->info_lck);
-
 	/*
 	 * Actually, if WAL ended in an incomplete record, skip the parts that
 	 * made it through and start writing after the portion that persisted.
@@ -6026,15 +5966,9 @@ StartupXLOG(void)
 	 */
 	if (!XLogRecPtrIsInvalid(missingContrecPtr))
 	{
-		/*
-		 * We should only have a missingContrecPtr if we're not switching to a
-		 * new timeline. When a timeline switch occurs, WAL is copied from the
-		 * old timeline to the new only up to the end of the last complete
-		 * record, so there can't be an incomplete WAL record that we need to
-		 * disregard.
-		 */
 		Assert(newTLI == endOfRecoveryInfo->lastRecTLI);
 		Assert(!XLogRecPtrIsInvalid(abortedRecPtr));
+		Assert(EndOfLog == abortedRecPtr);
 		EndOfLog = missingContrecPtr;
 	}
 
@@ -6091,6 +6025,95 @@ StartupXLOG(void)
 	XLogCtl->LogwrtRqst.Write = EndOfLog;
 	XLogCtl->LogwrtRqst.Flush = EndOfLog;
 
+	/* Enable WAL writes for this backend only. */
+	LocalSetXLogInsertAllowed();
+
+	/* If necessary, write overwrite-contrecord before doing anything else */
+	if (!XLogRecPtrIsInvalid(abortedRecPtr))
+	{
+		Assert(!XLogRecPtrIsInvalid(missingContrecPtr));
+		Assert(EndOfLog == missingContrecPtr);
+
+		/*
+		 * Save the aborted record's TimeLineID in shared memory to ensure
+		 * correct writing of the overwrite-contrecord.
+		 *
+		 * These values will be incremented if timeline switch occurs.
+		 */
+		SpinLockAcquire(&XLogCtl->info_lck);
+		XLogCtl->InsertTimeLineID = newTLI;
+		XLogCtl->PrevTimeLineID = endOfRecoveryInfo->lastRecTLI;
+		SpinLockRelease(&XLogCtl->info_lck);
+
+		EndOfLog = CreateOverwriteContrecordRecord(abortedRecPtr, missingContrecPtr, newTLI);
+		/*
+		 * Ensure the next records will be written to the next timeline segment
+		 * by closing the current segment.
+		 */
+		if (openLogFile >= 0)
+			XLogFileClose();
+	}
+
+	/*
+	 * Consider whether we need to assign a new timeline ID.
+	 *
+	 * If we did archive recovery, we always assign a new ID.  This handles a
+	 * couple of issues.  If we stopped short of the end of WAL during
+	 * recovery, then we are clearly generating a new timeline and must assign
+	 * it a unique new ID.  Even if we ran to the end, modifying the current
+	 * last segment is problematic because it may result in trying to
+	 * overwrite an already-archived copy of that segment, and we encourage
+	 * DBAs to make their archive_commands reject that.  We can dodge the
+	 * problem by making the new active segment have a new timeline ID.
+	 *
+	 * In a normal crash recovery, we can just extend the timeline we were in.
+	 */
+	if (ArchiveRecoveryRequested)
+	{
+		newTLI = findNewestTimeLine(recoveryTargetTLI) + 1;
+		ereport(LOG,
+				(errmsg("selected new timeline ID: %u", newTLI)));
+
+		/*
+		 * Make a writable copy of the last WAL segment.  (Note that we also
+		 * have a copy of the last block of the old WAL in
+		 * endOfRecovery->lastPage; we will use that below.)
+		 */
+		XLogInitNewTimeline(EndOfLogTLI, EndOfLog, newTLI);
+
+		/*
+		 * Remove the signal files out of the way, so that we don't
+		 * accidentally re-enter archive recovery mode in a subsequent crash.
+		 */
+		if (endOfRecoveryInfo->standby_signal_file_found)
+			durable_unlink(STANDBY_SIGNAL_FILE, FATAL);
+
+		if (endOfRecoveryInfo->recovery_signal_file_found)
+			durable_unlink(RECOVERY_SIGNAL_FILE, FATAL);
+
+		/*
+		 * Write the timeline history file, and have it archived. After this
+		 * point (or rather, as soon as the file is archived), the timeline
+		 * will appear as "taken" in the WAL archive and to any standby
+		 * servers.  If we crash before actually switching to the new
+		 * timeline, standby servers will nevertheless think that we switched
+		 * to the new timeline, and will try to connect to the new timeline.
+		 * To minimize the window for that, try to do as little as possible
+		 * between here and writing the end-of-recovery record.
+		 */
+		writeTimeLineHistory(newTLI, recoveryTargetTLI,
+							 EndOfLog, endOfRecoveryInfo->recoveryStopReason);
+
+		ereport(LOG,
+				(errmsg("archive recovery complete")));
+	}
+
+	/* Save the selected TimeLineID in shared memory, too */
+	SpinLockAcquire(&XLogCtl->info_lck);
+	XLogCtl->InsertTimeLineID = newTLI;
+	XLogCtl->PrevTimeLineID = endOfRecoveryInfo->lastRecTLI;
+	SpinLockRelease(&XLogCtl->info_lck);
+
 	/*
 	 * Preallocate additional log files, if wanted.
 	 */
@@ -6134,16 +6157,6 @@ StartupXLOG(void)
 	/* Shut down xlogreader */
 	ShutdownWalRecovery();
 
-	/* Enable WAL writes for this backend only. */
-	LocalSetXLogInsertAllowed();
-
-	/* If necessary, write overwrite-contrecord before doing anything else */
-	if (!XLogRecPtrIsInvalid(abortedRecPtr))
-	{
-		Assert(!XLogRecPtrIsInvalid(missingContrecPtr));
-		CreateOverwriteContrecordRecord(abortedRecPtr, missingContrecPtr, newTLI);
-	}
-
 	/*
 	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
 	 * record before resource manager writes cleanup WAL records or checkpoint
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 346319338a0..41f6577989e 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -3168,19 +3168,9 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 			 * of that record.  After recovery is done, we'll write a record
 			 * to indicate to downstream WAL readers that that portion is to
 			 * be ignored.
-			 *
-			 * However, when ArchiveRecoveryRequested = true, we're going to
-			 * switch to a new timeline at the end of recovery. We will only
-			 * copy WAL over to the new timeline up to the end of the last
-			 * complete record, so if we did this, we would later create an
-			 * overwrite contrecord in the wrong place, breaking everything.
 			 */
-			if (!ArchiveRecoveryRequested &&
-				!XLogRecPtrIsInvalid(xlogreader->abortedRecPtr))
-			{
-				abortedRecPtr = xlogreader->abortedRecPtr;
-				missingContrecPtr = xlogreader->missingContrecPtr;
-			}
+			abortedRecPtr = xlogreader->abortedRecPtr;
+			missingContrecPtr = xlogreader->missingContrecPtr;
 
 			if (readFile >= 0)
 			{
-- 
2.51.0

From 7d2647b42bd49c7caf1ae5995d937145e709e46c Mon Sep 17 00:00:00 2001
From: Alena Vinter <[email protected]>
Date: Sun, 15 Jun 2025 01:29:23 +0700
Subject: [PATCH 2/3] Removed assertion in walsummarizer

Fixes an edge case in the walsummarizer process where fetching InsertTLI
during a timeline switch could lead to latest_lsn < read_upto (which
becomes walrcv->flushedUpto in this case), previously triggering an
assertion failure. We now handle this safely by replacing the assertion
with conditional logic.
---
 src/backend/postmaster/walsummarizer.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/backend/postmaster/walsummarizer.c b/src/backend/postmaster/walsummarizer.c
index e1f142f20c7..ede8f1e1214 100644
--- a/src/backend/postmaster/walsummarizer.c
+++ b/src/backend/postmaster/walsummarizer.c
@@ -1551,8 +1551,8 @@ summarizer_read_local_xlog_page(XLogReaderState *state,
 				if (private_data->tli == latest_tli)
 				{
 					/* Still the current timeline, update max LSN. */
-					Assert(latest_lsn >= private_data->read_upto);
-					private_data->read_upto = latest_lsn;
+					if (latest_lsn >= private_data->read_upto)
+						private_data->read_upto = latest_lsn;
 				}
 				else
 				{
-- 
2.51.0

From a189e0d922c19a23739a4508b0e722c7459e872d Mon Sep 17 00:00:00 2001
From: Alena Vinter <[email protected]>
Date: Wed, 10 Sep 2025 14:06:09 +0700
Subject: [PATCH 3/3] Handle rewind failure when a timeline ends with an
 overwritten contrecord.

When a common timeline ends with an overwritten contrecord, the
divergence point may not point to the start of a valid WAL record on the
target, causing errors and making rewind impossible.

To handle this case, when the target timeline is unfinished, we look for
a checkpoint preceding the divergence point starting from the last
checkpoint on the target rather than from the divergence point itself.
This ensures we always begin from a known-valid position in WAL.
---
 src/bin/pg_rewind/parsexlog.c | 25 +++++++++++++------------
 src/bin/pg_rewind/pg_rewind.c | 18 ++++++++++++++++--
 src/bin/pg_rewind/pg_rewind.h |  7 ++++---
 3 files changed, 33 insertions(+), 17 deletions(-)

diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 8f4b282c6b1..7d01fefb1d5 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -165,9 +165,10 @@ readOneRecord(const char *datadir, XLogRecPtr ptr, int tliIndex,
  * Find the previous checkpoint preceding given WAL location.
  */
 void
-findLastCheckpoint(const char *datadir, XLogRecPtr forkptr, int tliIndex,
-				   XLogRecPtr *lastchkptrec, TimeLineID *lastchkpttli,
-				   XLogRecPtr *lastchkptredo, const char *restoreCommand)
+findLastCheckpoint(const char *datadir, XLogRecPtr startptr, XLogRecPtr forkptr,
+				   int tliIndex, XLogRecPtr *lastchkptrec,
+				   TimeLineID *lastchkpttli, XLogRecPtr *lastchkptredo,
+				   const char *restoreCommand)
 {
 	/* Walk backwards, starting from the given record */
 	XLogRecord *record;
@@ -179,17 +180,17 @@ findLastCheckpoint(const char *datadir, XLogRecPtr forkptr, int tliIndex,
 	TimeLineID	current_tli = 0;

 	/*
-	 * The given fork pointer points to the end of the last common record,
-	 * which is not necessarily the beginning of the next record, if the
-	 * previous record happens to end at a page boundary. Skip over the page
-	 * header in that case to find the next record.
+	 * The given start pointer may point to a page boundary if the startptr is
+	 * the end of the last common record which is not necessarily the beginning
+	 * of the next record. Skip over the page header in that case to find the
+	 * next record.
 	 */
-	if (forkptr % XLOG_BLCKSZ == 0)
+	if (startptr % XLOG_BLCKSZ == 0)
 	{
-		if (XLogSegmentOffset(forkptr, WalSegSz) == 0)
-			forkptr += SizeOfXLogLongPHD;
+		if (XLogSegmentOffset(startptr, WalSegSz) == 0)
+			startptr += SizeOfXLogLongPHD;
 		else
-			forkptr += SizeOfXLogShortPHD;
+			startptr += SizeOfXLogShortPHD;
 	}

 	private.tliIndex = tliIndex;
@@ -200,7 +201,7 @@ findLastCheckpoint(const char *datadir, XLogRecPtr forkptr, int tliIndex,
 	if (xlogreader == NULL)
 		pg_fatal("out of memory while allocating a WAL reading processor");

-	searchptr = forkptr;
+	searchptr = startptr;
 	for (;;)
 	{
 		uint8		info;
diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index 0c68dd4235e..c2ecf72cacf 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -140,6 +140,7 @@ main(int argc, char **argv)
 	int			option_index;
 	int			c;
 	XLogRecPtr	divergerec;
+	XLogRecPtr	checkpoint_search_startrec;
 	int			lastcommontliIndex;
 	XLogRecPtr	chkptrec;
 	TimeLineID	chkpttli;
@@ -459,8 +460,21 @@ main(int argc, char **argv)
 	/* Initialize hashtable that tracks WAL files protected from removal */
 	keepwal_init();

-	findLastCheckpoint(datadir_target, divergerec, lastcommontliIndex,
-					   &chkptrec, &chkpttli, &chkptredo, restore_command);
+	/*
+	 * If the last common timeline is incomplete on the target, a divergence
+	 * point from the source's finished timeline may not exist in the target's
+	 * WAL. Therefore, start searching for a checkpoint preceding the divergence
+	 * point from the last checkpoint on the target server to find a safe common
+	 * point.
+	 */
+	if (targetHistory[lastcommontliIndex].end == InvalidXLogRecPtr)
+		checkpoint_search_startrec = ControlFile_target.checkPoint;
+	else
+		checkpoint_search_startrec = divergerec;
+
+	findLastCheckpoint(datadir_target, checkpoint_search_startrec, divergerec,
+					   lastcommontliIndex, &chkptrec, &chkpttli, &chkptredo,
+					   restore_command);
 	pg_log_info("rewinding from last common checkpoint at %X/%08X on timeline %u",
 				LSN_FORMAT_ARGS(chkptrec), chkpttli);

diff --git a/src/bin/pg_rewind/pg_rewind.h b/src/bin/pg_rewind/pg_rewind.h
index 9cea144d2b2..4879be1d1d4 100644
--- a/src/bin/pg_rewind/pg_rewind.h
+++ b/src/bin/pg_rewind/pg_rewind.h
@@ -35,9 +35,10 @@ extern uint64 fetch_done;
 extern void extractPageMap(const char *datadir, XLogRecPtr startpoint,
 						   int tliIndex, XLogRecPtr endpoint,
 						   const char *restoreCommand);
-extern void findLastCheckpoint(const char *datadir, XLogRecPtr forkptr,
-							   int tliIndex,
-							   XLogRecPtr *lastchkptrec, TimeLineID *lastchkpttli,
+extern void findLastCheckpoint(const char *datadir, XLogRecPtr startptr,
+							   XLogRecPtr forkptr, int tliIndex,
+							   XLogRecPtr *lastchkptrec,
+							   TimeLineID *lastchkpttli,
 							   XLogRecPtr *lastchkptredo,
 							   const char *restoreCommand);
 extern XLogRecPtr readOneRecord(const char *datadir, XLogRecPtr ptr,
--
2.51.0

Re: Timeline switching with partial WAL records can break replica recovery

Reply via email to