Re: [HACKERS] Crash on promotion when recovery.conf is renamed

Michael Paquier Sat, 02 Sep 2017 05:18:09 -0700

On Fri, Sep 1, 2017 at 7:17 PM, Thomas Munro
<thomas.mu...@enterprisedb.com> wrote:
> On Sat, Aug 26, 2017 at 7:32 AM, Robert Haas <robertmh...@gmail.com> wrote:
>> On Sat, Apr 8, 2017 at 10:05 AM, David Steele <da...@pgmasters.net> wrote:
>>> On 3/28/17 1:21 AM, Tsunakawa, Takayuki wrote:
>>>> Thanks, marked as ready for committer, as the code is good and Alexander 
>>>> reported the test success.
>>>
>>> This bug has been moved to CF 2017-07.
>>
>> This bug fix has been pending in "Ready for Committer" state for about
>> 4.5 months.  Three committers (Magnus, Heikki, Tom) have contributed
>> to the thread to date.  Maybe one of them would like to commit this?
>
> In the meantime its bits have begun to rot.  Michael, could you please rebase?


Thanks for the reminder, Thomas. The 2PC code triggered during
recovery has been largely reworked in PG10, explaining the conflicts.
As this has been some time since I touched this patch, I had again a
look at its logic and could not find any problems around it. So
attached are a rebased versions for both 0001 and 0002, I have updated
the messages as well to be more in-line with what is in HEAD for
corrupted entries.
-- 
Michael

From 2f3a16c0c0cb12f8bfef2d58656352c46a681393 Mon Sep 17 00:00:00 2001
From: Michael Paquier <mich...@paquier.xyz>
Date: Sat, 2 Sep 2017 21:15:04 +0900
Subject: [PATCH 2/2] Minimize window between history file and end-of-recovery
 record

Once a standby node is promoted, this makes the assignment of the new
timeline number booked earlier as the history file gets archived immediately.
This way the other nodes are aware that this new timeline number is taken
and should not be assigned to other nodes.

The window between which the history file is archived and the end-of-recovery
record is written cannot be zeroed, but this way it is minimized as much as
possible. The new order of actions prevents as well a corrupted data directory
on failure.
---
 src/backend/access/transam/xlog.c | 27 ++++++++++++++++++---------
 1 file changed, 18 insertions(+), 9 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 80fe2c60e6..abc0cec3d2 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7449,6 +7449,24 @@ StartupXLOG(void)
 		else
 			snprintf(reason, sizeof(reason), "no recovery target specified");
 
+		/*
+		 * We are now done reading the old WAL.  Turn off archive fetching
+		 * if it was active, and make a writable copy of the last WAL segment.
+		 * (Note that we also have a copy of the last block of the old WAL
+		 * in readBuf; we will use that below.)
+		 */
+		exitArchiveRecovery(EndOfLogTLI, EndOfLog);
+
+		/*
+		 * Write the timeline history file, and have it archived. After this
+		 * point (or rather, as soon as the file is archived), the timeline
+		 * will appear as "taken" in the WAL archive and to any standby servers.
+		 * If we crash before actually switching to the new timeline, standby
+		 * servers will nevertheless think that we switched to the new timeline,
+		 * and will try to connect to the new timeline. To minimize the window
+		 * for that, try to do as little as possible between here and writing the
+		 * end-of-recovery record.
+		 */
 		writeTimeLineHistory(ThisTimeLineID, recoveryTargetTLI,
 							 EndRecPtr, reason);
 	}
@@ -7457,15 +7475,6 @@ StartupXLOG(void)
 	XLogCtl->ThisTimeLineID = ThisTimeLineID;
 	XLogCtl->PrevTimeLineID = PrevTimeLineID;
 
-	/*
-	 * We are now done reading the old WAL.  Turn off archive fetching if it
-	 * was active, and make a writable copy of the last WAL segment. (Note
-	 * that we also have a copy of the last block of the old WAL in readBuf;
-	 * we will use that below.)
-	 */
-	if (ArchiveRecoveryRequested)
-		exitArchiveRecovery(EndOfLogTLI, EndOfLog);
-
 	/*
 	 * Prepare to write WAL starting at EndOfLog location, and init xlog
 	 * buffer cache using the block containing the last record from the
-- 
2.14.1

From 12ab1765f7ea032db217f2081f9561c00d1d96df Mon Sep 17 00:00:00 2001
From: Michael Paquier <mich...@paquier.xyz>
Date: Sat, 2 Sep 2017 21:07:21 +0900
Subject: [PATCH 1/2] Change detection of corrupted 2PC files as FATAL

When scanning 2PC files, be it for initializing a hot standby node or
finding the XID range at the end of recovery, bumping on a corrupted
2PC file is reported as a WARNING and are blindly corrupted. If it
happened that the corrupted file was actually legit, there is actually
a risk of corruption, so switch that to a hard failure.
---
 src/backend/access/transam/twophase.c | 27 +++++++++++----------------
 src/backend/access/transam/xlog.c     | 10 +++++++---
 2 files changed, 18 insertions(+), 19 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index ba03d9687e..4b4f2449f8 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1778,6 +1778,10 @@ restoreTwoPhaseData(void)
  * write a WAL entry, and so there might be no evidence in WAL of those
  * subxact XIDs.
  *
+ * On corrupted two-phase files, fail immediately. Keeping around broken
+ * entries and let replay continue causes harm on the system, and likely
+ * a new backup should be rolled in.
+ *
  * Our other responsibility is to determine and return the oldest valid XID
  * among the prepared xacts (if none, return ShmemVariableCache->nextXid).
  * This is needed to synchronize pg_subtrans startup properly.
@@ -2070,13 +2074,9 @@ ProcessTwoPhaseBuffer(TransactionId xid,
 		/* Read and validate file */
 		buf = ReadTwoPhaseFile(xid, true);
 		if (buf == NULL)
-		{
-			ereport(WARNING,
-					(errmsg("removing corrupt two-phase state file for \"%u\"",
+			ereport(FATAL,
+					(errmsg("corrupted two-phase state file for \"%u\"",
 							xid)));
-			RemoveTwoPhaseFile(xid, true);
-			return NULL;
-		}
 	}
 	else
 	{
@@ -2088,20 +2088,15 @@ ProcessTwoPhaseBuffer(TransactionId xid,
 	hdr = (TwoPhaseFileHeader *) buf;
 	if (!TransactionIdEquals(hdr->xid, xid))
 	{
+
 		if (fromdisk)
-		{
-			ereport(WARNING,
-					(errmsg("removing corrupt two-phase state file for \"%u\"",
+			ereport(FATAL,
+					(errmsg("corrupted two-phase state file for \"%u\"",
 							xid)));
-			RemoveTwoPhaseFile(xid, true);
-		}
 		else
-		{
-			ereport(WARNING,
-					(errmsg("removing corrupt two-phase state from memory for \"%u\"",
+			ereport(FATAL,
+					(errmsg("corrupted two-phase state in memory for \"%u\"",
 							xid)));
-			PrepareRedoRemove(xid, true);
-		}
 		pfree(buf);
 		return NULL;
 	}
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index df4843f409..80fe2c60e6 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7388,6 +7388,13 @@ StartupXLOG(void)
 		}
 	}
 
+	/*
+	 * Pre-scan prepared transactions to find out the range of XIDs present.
+	 * We don't need this information quite yet, but if it fails for some
+	 * reason, better to fail before we make any on-disk changes.
+	 */
+	oldestActiveXID = PrescanPreparedTransactions(NULL, NULL);
+
 	/*
 	 * Consider whether we need to assign a new timeline ID.
 	 *
@@ -7511,9 +7518,6 @@ StartupXLOG(void)
 	XLogCtl->LogwrtRqst.Write = EndOfLog;
 	XLogCtl->LogwrtRqst.Flush = EndOfLog;
 
-	/* Pre-scan prepared transactions to find out the range of XIDs present */
-	oldestActiveXID = PrescanPreparedTransactions(NULL, NULL);
-
 	/*
 	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
 	 * record before resource manager writes cleanup WAL records or checkpoint
-- 
2.14.1

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Crash on promotion when recovery.conf is renamed

Reply via email to