[HACKERS] exitArchiveRecovery woes

Heikki Linnakangas Wed, 17 Dec 2014 05:41:59 -0800

At the end of archive recovery, we copy the last segment from the old timeline, to initialize the first segment on the new timeline. For example, if the timeline switch happens in the middle of WAL segment 000000010000000000000005, the whole 000000010000000000000005 segment is copied to become 000000020000000000000005. The copying is necessary, so that the new segment contains valid data up to the switch point.

However, we wouldn't really need to copy the whole segment, copying up to the switch point would be enough. In fact, copying the whole segment is a bad idea, because the copied WAL looks valid on the new timeline too. When we read the WAL at crash recovery, we rely on a number of things to determine if the next WAL record is valid. Most importantly, the checksum, and the prev-pointer. The checksum protects any random data from appearing valid, and the prev-pointer makes sure that a WAL record copied from another location in the WAL is not mistaken as valid. The prev-pointer is particularly important when we recycle old WAL segments as new, because the old segment contains valid WAL records with checksums and all. When we copy a WAL segment with the same segment number, the prev pointer doesn't protect us, as there can be WAL records at the exact same locations in both segments. There is a timeline ID on the page header, but we could still be mistaken within the page. Also, we are lenient with the TLI at start of WAL recovery, when we read the first WAL record after the checkpoint. There are further safeguards, like the fact that when writing WAL, we always write full blocks. But the write could still be torn at the OS or disk level, if you crash after writing the WAL, but before fsyncing it.

This is largely academic, but I was able to craft a test case where WAL recovery mistakenly starts to replay the WAL copied from the old timeline, as if it was on the new timeline. Attached is a shell script I used. It's very sensitive to the lengths of the WAL records, so probably only works on a similar platform as mine (x86_64 Linux). Running pitr-test.sh ends with this:

S LOG: database system was interrupted; last known up at 2014-12-17 15:15:42 EET S LOG: database system was not properly shut down; automatic recovery in progress

S LOG:  redo starts at 0/50A2018
S PANIC:  heap_insert_redo: invalid max offset number
S CONTEXT:  xlog redo Heap/INSERT: off 28
S LOG:  startup process (PID 10640) was terminated by signal 6: Aborted
S LOG:  aborting startup due to startup process failure

That PANIC happens because it tries to apply WAL from different timeline, and it doesn't work because it missed an earlier change to the same page it modifies. (If you were unlucky, you could get silent corruption instead, if the WAL record happens to apply without an error)

A simple way to avoid this is to copy the old WAL segment only up to the point of the timeline switch, and zero the rest.

Another thing I noticed is that we copy the last old WAL segment on the new timeline, even if the timeline switch happens at a segment boundary. In that case, the copied WAL segment is 100% identical to the old segment; it contains no records belonging to the new timeline. I guess that's not wrong per se, but it seems pointless and confusing.

Attached is a patch that addresses both of those issues. This doesn't seem worth the risk to back-patch, but let's fix these in master.

PS. The "if (endTLI != ThisTimeLineID)" test in exitArchiveRecovery was always true, because we always switch to a new timeline after archive recovery. I turned that into an Assert.


- Heikki

pitr-test.sh
Description: Bourne shell script

*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 2923,2934 **** XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
   * srcTLI, srclog, srcseg: identify segment to be copied (could be from
   *		a different timeline)
   *
   * Currently this is only used during recovery, and so there are no locking
   * considerations.  But we should be just as tense as XLogFileInit to avoid
   * emplacing a bogus file.
   */
  static void
! XLogFileCopy(XLogSegNo destsegno, TimeLineID srcTLI, XLogSegNo srcsegno)
  {
  	char		path[MAXPGPATH];
  	char		tmppath[MAXPGPATH];
--- 2923,2937 ----
   * srcTLI, srclog, srcseg: identify segment to be copied (could be from
   *		a different timeline)
   *
+  * upto: how much of the source file to copy? (the rest is filled with zeros)
+  *
   * Currently this is only used during recovery, and so there are no locking
   * considerations.  But we should be just as tense as XLogFileInit to avoid
   * emplacing a bogus file.
   */
  static void
! XLogFileCopy(XLogSegNo destsegno, TimeLineID srcTLI, XLogSegNo srcsegno,
! 			 int upto)
  {
  	char		path[MAXPGPATH];
  	char		tmppath[MAXPGPATH];
***************
*** 2967,2982 **** XLogFileCopy(XLogSegNo destsegno, TimeLineID srcTLI, XLogSegNo srcsegno)
  	 */
  	for (nbytes = 0; nbytes < XLogSegSize; nbytes += sizeof(buffer))
  	{
! 		errno = 0;
! 		if ((int) read(srcfd, buffer, sizeof(buffer)) != (int) sizeof(buffer))
  		{
! 			if (errno != 0)
! 				ereport(ERROR,
! 						(errcode_for_file_access(),
! 						 errmsg("could not read file \"%s\": %m", path)));
! 			else
! 				ereport(ERROR,
! 						(errmsg("not enough data in file \"%s\"", path)));
  		}
  		errno = 0;
  		if ((int) write(fd, buffer, sizeof(buffer)) != (int) sizeof(buffer))
--- 2970,3000 ----
  	 */
  	for (nbytes = 0; nbytes < XLogSegSize; nbytes += sizeof(buffer))
  	{
! 		int			nread;
! 
! 		nread = upto - nbytes;
! 
! 		/*
! 		 * The part that is not read from the source file is filled with zeros.
! 		 */
! 		if (nread < sizeof(buffer))
! 			memset(buffer, 0, sizeof(buffer));
! 
! 		if (nread > 0)
  		{
! 			if (nread > sizeof(buffer))
! 				nread = sizeof(buffer);
! 			errno = 0;
! 			if (read(srcfd, buffer, nread) != nread)
! 			{
! 				if (errno != 0)
! 					ereport(ERROR,
! 							(errcode_for_file_access(),
! 							 errmsg("could not read file \"%s\": %m", path)));
! 				else
! 					ereport(ERROR,
! 							(errmsg("not enough data in file \"%s\"", path)));
! 			}
  		}
  		errno = 0;
  		if ((int) write(fd, buffer, sizeof(buffer)) != (int) sizeof(buffer))
***************
*** 4984,4991 **** exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog)
  	char		recoveryPath[MAXPGPATH];
  	char		xlogfname[MAXFNAMELEN];
  	XLogSegNo	endLogSegNo;
  
! 	XLByteToPrevSeg(endOfLog, endLogSegNo);	
  
  	/*
  	 * We are no longer in archive recovery state.
--- 5002,5011 ----
  	char		recoveryPath[MAXPGPATH];
  	char		xlogfname[MAXFNAMELEN];
  	XLogSegNo	endLogSegNo;
+ 	XLogSegNo	startLogSegNo;
  
! 	/* we always switch to a new timeline after archive recovery */
! 	Assert(endTLI != ThisTimeLineID);
  
  	/*
  	 * We are no longer in archive recovery state.
***************
*** 5008,5026 **** exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog)
  	}
  
  	/*
! 	 * If we are establishing a new timeline, we have to copy data from the
! 	 * last WAL segment of the old timeline to create a starting WAL segment
! 	 * for the new timeline. (Unless the switch happens to be at a segment
! 	 * boundary.)
  	 *
  	 * Notify the archiver that the last WAL segment of the old timeline is
  	 * ready to copy to archival storage if its .done file doesn't exist
  	 * (e.g., if it's the restored WAL file, it's expected to have .done file).
  	 * Otherwise, it is not archived for a while.
  	 */
! 	if (endTLI != ThisTimeLineID && endOfLog % XLOG_SEG_SIZE != 0)
  	{
! 		XLogFileCopy(endLogSegNo, endTLI, endLogSegNo);
  
  		/* Create .ready file only when neither .ready nor .done files exist */
  		if (XLogArchivingActive())
--- 5028,5056 ----
  	}
  
  	/*
! 	 * Calculate the last segment on the old timeline, and the first segment
! 	 * on the new timeline. If the switch happens in the middle of a segment,
! 	 * they are the same, but if the switch happens exactly at a segment
! 	 * boundary, startLogSegNo will be endLogSegNo + 1.
! 	 */
! 	XLByteToPrevSeg(endOfLog, endLogSegNo);
! 	XLByteToSeg(endOfLog, startLogSegNo);
! 
! 	/*
! 	 * Initialize the starting WAL segment for the new timeline. If the switch
! 	 * happens in the middle of a segment, copy data from the last WAL segment
! 	 * of the old timeline up to the switch point, to the starting WAL segment
! 	 * on the new timeline.
  	 *
  	 * Notify the archiver that the last WAL segment of the old timeline is
  	 * ready to copy to archival storage if its .done file doesn't exist
  	 * (e.g., if it's the restored WAL file, it's expected to have .done file).
  	 * Otherwise, it is not archived for a while.
  	 */
! 	if (endLogSegNo == startLogSegNo)
  	{
! 		XLogFileCopy(startLogSegNo, endTLI, endLogSegNo,
! 					 endOfLog % XLOG_SEG_SIZE);
  
  		/* Create .ready file only when neither .ready nor .done files exist */
  		if (XLogArchivingActive())
***************
*** 5033,5046 **** exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog)
  	{
  		bool		use_existent = true;
  
! 		XLogFileInit(xlogfname, &use_existent, true);
  	}
  
  	/*
  	 * Let's just make real sure there are not .ready or .done flags posted
  	 * for the new segment.
  	 */
! 	XLogFileName(xlogfname, ThisTimeLineID, endLogSegNo);
  	XLogArchiveCleanup(xlogfname);
  
  	/*
--- 5063,5076 ----
  	{
  		bool		use_existent = true;
  
! 		XLogFileInit(startLogSegNo, &use_existent, true);
  	}
  
  	/*
  	 * Let's just make real sure there are not .ready or .done flags posted
  	 * for the new segment.
  	 */
! 	XLogFileName(xlogfname, ThisTimeLineID, startLogSegNo);
  	XLogArchiveCleanup(xlogfname);
  
  	/*
***************
*** 5628,5634 **** StartupXLOG(void)
  	XLogRecPtr	RecPtr,
  				checkPointLoc,
  				EndOfLog;
! 	XLogSegNo	endLogSegNo;
  	TimeLineID	PrevTimeLineID;
  	XLogRecord *record;
  	TransactionId oldestActiveXID;
--- 5658,5664 ----
  	XLogRecPtr	RecPtr,
  				checkPointLoc,
  				EndOfLog;
! 	XLogSegNo	startLogSegNo;
  	TimeLineID	PrevTimeLineID;
  	XLogRecord *record;
  	TransactionId oldestActiveXID;
***************
*** 6608,6614 **** StartupXLOG(void)
  	 */
  	record = ReadRecord(xlogreader, LastRec, PANIC, false);
  	EndOfLog = EndRecPtr;
! 	XLByteToPrevSeg(EndOfLog, endLogSegNo);
  
  	/*
  	 * Complain if we did not roll forward far enough to render the backup
--- 6638,6644 ----
  	 */
  	record = ReadRecord(xlogreader, LastRec, PANIC, false);
  	EndOfLog = EndRecPtr;
! 	XLByteToSeg(EndOfLog, startLogSegNo);
  
  	/*
  	 * Complain if we did not roll forward far enough to render the backup
***************
*** 6716,6722 **** StartupXLOG(void)
  	 * buffer cache using the block containing the last record from the
  	 * previous incarnation.
  	 */
! 	openLogSegNo = endLogSegNo;
  	openLogFile = XLogFileOpen(openLogSegNo);
  	openLogOff = 0;
  	Insert = &XLogCtl->Insert;
--- 6746,6752 ----
  	 * buffer cache using the block containing the last record from the
  	 * previous incarnation.
  	 */
! 	openLogSegNo = startLogSegNo;
  	openLogFile = XLogFileOpen(openLogSegNo);
  	openLogOff = 0;
  	Insert = &XLogCtl->Insert;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] exitArchiveRecovery woes

Reply via email to