On 20.11.2012 15:33, Amit Kapila wrote:
Defect-2:
     1. start primary A
     2. start standby B following A
     3. start cascade standby C following B.
     4. Start another standby D following C.
     5. Execute the following commands in the primary A.
            create table tbl(f int);
            insert into tbl values(generate_series(1,1000));
     6. Promote standby B.
     7. Execute the following commands in the primary B.
            insert into tbl values(generate_series(1001,2000));
            insert into tbl values(generate_series(2001,3000));

     The following logs are observed on standby C:

     LOG:  restarted WAL streaming at position 0/7000000 on tli 2
     ERROR:  requested WAL segment 000000020000000000000007 has already been
removed
     LOG:  record with zero length at 0/7028190
     LOG:  record with zero length at 0/7048540
     LOG:  out-of-sequence timeline ID 1 (after 2) in log segment
000000020000000000000007, offset 0

Hmm, this one is actually a pre-existing bug. There's a sanity check that the sequence of timeline IDs that are seen in the XLOG page headers doesn't go backwards. In other words, if the last XLOG page that was read had timeline id X, the next page must have a tli >= X. The startup process keeps track of the last seen timeline id in lastPageTLI. In standby_mode, when the startup process is reading from a pre-existing file in pg_xlog (typically put there by streaming replication) and it reaches the end of valid WAL (marked by an error in decoding it, ie. "record with zero length" in your case), it sleeps for five seconds and retries. At retry, the WAL file is re-opened, and as part of sanity checking it, the first page header in the file is validated.

Now, if there was a timeline change in the current WAL segment, and we've already replayed past that point, lastPageTLI will already be set to the new TLI, but the first page on the file contains the old TLI. When the file is re-opened, and the first page is validated, you get the error.

The fix is quite straightforward: we should refrain from checking the TLI when we re-open a WAL file. Or better yet, compare it against the TLI we saw at the beginning of the last WAL segment, not the last WAL page.

I propose the attached patch (against 9.2) to fix that. This should be backpatched to 9.0, where standby_mode was introduced. The code was the same in 8.4, too, but AFAICS there was no problem there because 8.4 never tried to re-open the same WAL segment after replaying some of it.

- Heikki
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8614907..045d21d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -572,6 +572,7 @@ static uint32 readRecordBufSize = 0;
 static XLogRecPtr ReadRecPtr;	/* start of last record read */
 static XLogRecPtr EndRecPtr;	/* end+1 of last record read */
 static TimeLineID lastPageTLI = 0;
+static TimeLineID lastSegmentTLI = 0;
 
 static XLogRecPtr minRecoveryPoint;		/* local copy of
 										 * ControlFile->minRecoveryPoint */
@@ -655,7 +656,7 @@ static void CleanupBackupHistory(void);
 static void UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force);
 static XLogRecord *ReadRecord(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt);
 static void CheckRecoveryConsistency(void);
-static bool ValidXLOGHeader(XLogPageHeader hdr, int emode);
+static bool ValidXLOGHeader(XLogPageHeader hdr, int emode, bool segmentonly);
 static XLogRecord *ReadCheckpointRecord(XLogRecPtr RecPtr, int whichChkpt);
 static List *readTimeLineHistory(TimeLineID targetTLI);
 static bool existsTimeLineHistory(TimeLineID probeTLI);
@@ -3927,7 +3928,7 @@ ReadRecord(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt)
 		 * to go backwards (but we can't reset that variable right here, since
 		 * we might not change files at all).
 		 */
-		lastPageTLI = 0;		/* see comment in ValidXLOGHeader */
+		lastPageTLI = lastSegmentTLI = 0;	/* see comment in ValidXLOGHeader */
 		randAccess = true;		/* allow curFileTLI to go backwards too */
 	}
 
@@ -4190,7 +4191,7 @@ next_record_is_invalid:
  * ReadRecord.	It's not intended for use from anywhere else.
  */
 static bool
-ValidXLOGHeader(XLogPageHeader hdr, int emode)
+ValidXLOGHeader(XLogPageHeader hdr, int emode, bool segmentonly)
 {
 	XLogRecPtr	recaddr;
 
@@ -4285,18 +4286,31 @@ ValidXLOGHeader(XLogPageHeader hdr, int emode)
 	 * successive pages of a consistent WAL sequence.
 	 *
 	 * Of course this check should only be applied when advancing sequentially
-	 * across pages; therefore ReadRecord resets lastPageTLI to zero when
-	 * going to a random page.
+	 * across pages; therefore ReadRecord resets lastPageTLI and
+	 * lastSegmentTLI to zero when going to a random page.
+	 *
+	 * Sometimes we re-open a segment that's already been partially replayed.
+	 * In that case we cannot perform the normal TLI check: if there is a
+	 * timeline switch within the segment, the first page has a smaller TLI
+	 * than later pages following the timeline switch, and we might've read
+	 * them already. As a weaker test, we still check that it's not smaller
+	 * than the TLI we last saw at the beginning of a segment. Pass
+	 * segmentonly = true when re-validating the first page like that, and the
+	 * page you're actually interested in comes later.
 	 */
-	if (hdr->xlp_tli < lastPageTLI)
+	if (hdr->xlp_tli < (segmentonly ? lastSegmentTLI : lastPageTLI))
 	{
 		ereport(emode_for_corrupt_record(emode, recaddr),
 				(errmsg("out-of-sequence timeline ID %u (after %u) in log file %u, segment %u, offset %u",
-						hdr->xlp_tli, lastPageTLI,
+						hdr->xlp_tli,
+						segmentonly ? lastSegmentTLI : lastPageTLI,
 						readId, readSeg, readOff)));
 		return false;
 	}
 	lastPageTLI = hdr->xlp_tli;
+	if (readOff == 0)
+		lastSegmentTLI = hdr->xlp_tli;
+
 	return true;
 }
 
@@ -10440,7 +10454,7 @@ retry:
 							readId, readSeg, readOff)));
 			goto next_record_is_invalid;
 		}
-		if (!ValidXLOGHeader((XLogPageHeader) readBuf, emode))
+		if (!ValidXLOGHeader((XLogPageHeader) readBuf, emode, true))
 			goto next_record_is_invalid;
 	}
 
@@ -10462,7 +10476,7 @@ retry:
 				readId, readSeg, readOff)));
 		goto next_record_is_invalid;
 	}
-	if (!ValidXLOGHeader((XLogPageHeader) readBuf, emode))
+	if (!ValidXLOGHeader((XLogPageHeader) readBuf, emode, false))
 		goto next_record_is_invalid;
 
 	Assert(targetId == readId);
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to