Re: [HACKERS] Switching timeline over streaming replication

Heikki Linnakangas Wed, 21 Nov 2012 10:06:52 -0800

On 20.11.2012 15:33, Amit Kapila wrote:

Defect-2:
     1. start primary A
     2. start standby B following A
     3. start cascade standby C following B.
     4. Start another standby D following C.
     5. Execute the following commands in the primary A.
            create table tbl(f int);
            insert into tbl values(generate_series(1,1000));
     6. Promote standby B.
     7. Execute the following commands in the primary B.
            insert into tbl values(generate_series(1001,2000));
            insert into tbl values(generate_series(2001,3000));


     The following logs are observed on standby C:

     LOG:  restarted WAL streaming at position 0/7000000 on tli 2
     ERROR:  requested WAL segment 000000020000000000000007 has already been
removed
     LOG:  record with zero length at 0/7028190
     LOG:  record with zero length at 0/7048540
     LOG:  out-of-sequence timeline ID 1 (after 2) in log segment
000000020000000000000007, offset 0

Hmm, this one is actually a pre-existing bug. There's a sanity checkthat the sequence of timeline IDs that are seen in the XLOG page headersdoesn't go backwards. In other words, if the last XLOG page that wasread had timeline id X, the next page must have a tli >= X. The startupprocess keeps track of the last seen timeline id in lastPageTLI. Instandby_mode, when the startup process is reading from a pre-existingfile in pg_xlog (typically put there by streaming replication) and itreaches the end of valid WAL (marked by an error in decoding it, ie."record with zero length" in your case), it sleeps for five seconds andretries. At retry, the WAL file is re-opened, and as part of sanitychecking it, the first page header in the file is validated.

Now, if there was a timeline change in the current WAL segment, andwe've already replayed past that point, lastPageTLI will already be setto the new TLI, but the first page on the file contains the old TLI.When the file is re-opened, and the first page is validated, you get theerror.

The fix is quite straightforward: we should refrain from checking theTLI when we re-open a WAL file. Or better yet, compare it against theTLI we saw at the beginning of the last WAL segment, not the last WAL page.

I propose the attached patch (against 9.2) to fix that. This should bebackpatched to 9.0, where standby_mode was introduced. The code was thesame in 8.4, too, but AFAICS there was no problem there because 8.4never tried to re-open the same WAL segment after replaying some of it.


- Heikki

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8614907..045d21d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -572,6 +572,7 @@ static uint32 readRecordBufSize = 0;
 static XLogRecPtr ReadRecPtr;	/* start of last record read */
 static XLogRecPtr EndRecPtr;	/* end+1 of last record read */
 static TimeLineID lastPageTLI = 0;
+static TimeLineID lastSegmentTLI = 0;
 
 static XLogRecPtr minRecoveryPoint;		/* local copy of
 										 * ControlFile->minRecoveryPoint */
@@ -655,7 +656,7 @@ static void CleanupBackupHistory(void);
 static void UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force);
 static XLogRecord *ReadRecord(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt);
 static void CheckRecoveryConsistency(void);
-static bool ValidXLOGHeader(XLogPageHeader hdr, int emode);
+static bool ValidXLOGHeader(XLogPageHeader hdr, int emode, bool segmentonly);
 static XLogRecord *ReadCheckpointRecord(XLogRecPtr RecPtr, int whichChkpt);
 static List *readTimeLineHistory(TimeLineID targetTLI);
 static bool existsTimeLineHistory(TimeLineID probeTLI);
@@ -3927,7 +3928,7 @@ ReadRecord(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt)
 		 * to go backwards (but we can't reset that variable right here, since
 		 * we might not change files at all).
 		 */
-		lastPageTLI = 0;		/* see comment in ValidXLOGHeader */
+		lastPageTLI = lastSegmentTLI = 0;	/* see comment in ValidXLOGHeader */
 		randAccess = true;		/* allow curFileTLI to go backwards too */
 	}
 
@@ -4190,7 +4191,7 @@ next_record_is_invalid:
  * ReadRecord.	It's not intended for use from anywhere else.
  */
 static bool
-ValidXLOGHeader(XLogPageHeader hdr, int emode)
+ValidXLOGHeader(XLogPageHeader hdr, int emode, bool segmentonly)
 {
 	XLogRecPtr	recaddr;
 
@@ -4285,18 +4286,31 @@ ValidXLOGHeader(XLogPageHeader hdr, int emode)
 	 * successive pages of a consistent WAL sequence.
 	 *
 	 * Of course this check should only be applied when advancing sequentially
-	 * across pages; therefore ReadRecord resets lastPageTLI to zero when
-	 * going to a random page.
+	 * across pages; therefore ReadRecord resets lastPageTLI and
+	 * lastSegmentTLI to zero when going to a random page.
+	 *
+	 * Sometimes we re-open a segment that's already been partially replayed.
+	 * In that case we cannot perform the normal TLI check: if there is a
+	 * timeline switch within the segment, the first page has a smaller TLI
+	 * than later pages following the timeline switch, and we might've read
+	 * them already. As a weaker test, we still check that it's not smaller
+	 * than the TLI we last saw at the beginning of a segment. Pass
+	 * segmentonly = true when re-validating the first page like that, and the
+	 * page you're actually interested in comes later.
 	 */
-	if (hdr->xlp_tli < lastPageTLI)
+	if (hdr->xlp_tli < (segmentonly ? lastSegmentTLI : lastPageTLI))
 	{
 		ereport(emode_for_corrupt_record(emode, recaddr),
 				(errmsg("out-of-sequence timeline ID %u (after %u) in log file %u, segment %u, offset %u",
-						hdr->xlp_tli, lastPageTLI,
+						hdr->xlp_tli,
+						segmentonly ? lastSegmentTLI : lastPageTLI,
 						readId, readSeg, readOff)));
 		return false;
 	}
 	lastPageTLI = hdr->xlp_tli;
+	if (readOff == 0)
+		lastSegmentTLI = hdr->xlp_tli;
+
 	return true;
 }
 
@@ -10440,7 +10454,7 @@ retry:
 							readId, readSeg, readOff)));
 			goto next_record_is_invalid;
 		}
-		if (!ValidXLOGHeader((XLogPageHeader) readBuf, emode))
+		if (!ValidXLOGHeader((XLogPageHeader) readBuf, emode, true))
 			goto next_record_is_invalid;
 	}
 
@@ -10462,7 +10476,7 @@ retry:
 				readId, readSeg, readOff)));
 		goto next_record_is_invalid;
 	}
-	if (!ValidXLOGHeader((XLogPageHeader) readBuf, emode))
+	if (!ValidXLOGHeader((XLogPageHeader) readBuf, emode, false))
 		goto next_record_is_invalid;
 
 	Assert(targetId == readId);

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Switching timeline over streaming replication

Reply via email to