Re: [BUGS] Recovery bug

Heikki Linnakangas Tue, 26 Oct 2010 00:48:52 -0700

On 25.10.2010 19:04, Jeff Davis wrote:

On Mon, 2010-10-25 at 14:44 +0300, Heikki Linnakangas wrote:

It seems we should use ReadRecord instead of the lower-level
XLogPageRead function. One difference is that ReadRecord performs a
bunch of sanity checks on the record, while XLogPageRead just reads the
raw page. Extra sanity checking before removing backup_label seems like
a good idea. Another difference is that in standby-mode, ReadRecord will
retry until it succeeds. A standby server should keep retrying, even the
very first record, until it succeeds, otherwise we have a change in
behavior.


The reason I didn't use ReadRecord is because it sets a global variable
to point to the next location in the log, so that subsequent calls can
just pass NULL for the location.

True. XLogPageRead is new in 9.0, however. We'll have to use ReadRecordor invent something new for back-branches anyway.

It looks like the patch leaves the global variable pointing just after
the redo location rather than the checkpoint. I haven't tested your
patch yet, but it looks like some of the following code depends on
ReadRecord(NULL,...) fetching the record right after the checkpoint
record; so I think something else is required if you want to use
ReadRecord.


Hmm, the next call to ReadRecord is this:

                /*
                 * Find the first record that logically follows the checkpoint 
--- it
                 * might physically precede it, though.
                 */
                if (XLByteLT(checkPoint.redo, RecPtr))
                {
                        /* back up to find the record */
                        record = ReadRecord(&(checkPoint.redo), PANIC, false);
                }
                else
                {
                        /* just have to read next record after CheckPoint */
                        record = ReadRecord(NULL, LOG, false);
                }

In the first case, the location is given explicitly. In the second case,the redo pointer equals the checkpoint record, so the current positionis correct even with the patch. It makes me slightly nervous, though.It's correct today, but if someone adds code between the backup_labelcheck and this that assumes that the current position is the checkpointrecord, it'll fail. Then again, any new ReadRecord call in such addedcode would also break the assumption in the above block that the currentposition is the checkpoint record.

In the case that the redo pointer is the same as the checkpoint record,we don't need to re-fetch the checkpoint record. I've added a test forthat in the attached patch.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 22fd578..6f1fedd 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5839,14 +5839,29 @@ StartupXLOG(void)
 		record = ReadCheckpointRecord(checkPointLoc, 0);
 		if (record != NULL)
 		{
+			memcpy(&checkPoint, XLogRecGetData(record), sizeof(CheckPoint));
 			ereport(DEBUG1,
 					(errmsg("checkpoint record is at %X/%X",
 							checkPointLoc.xlogid, checkPointLoc.xrecoff)));
 			InRecovery = true;	/* force recovery even if SHUTDOWNED */
+
+			/*
+			 * Make sure that REDO location exists. This may not be
+			 * the case if there was a crash during an online backup,
+			 * which left a backup_label around that references a WAL
+			 * segment that's already been archived.
+			 */
+			if (XLByteLT(checkPoint.redo, checkPointLoc))
+			{
+				if (!ReadRecord(&(checkPoint.redo), LOG, false))
+					ereport(FATAL,
+							(errmsg("could not find redo location referenced by checkpoint record"),
+							 errhint("If you are not restoring from a backup, try removing the file \"%s/backup_label\".", DataDir)));
+			}
 		}
 		else
 		{
-			ereport(PANIC,
+			ereport(FATAL,
 					(errmsg("could not locate required checkpoint record"),
 					 errhint("If you are not restoring from a backup, try removing the file \"%s/backup_label\".", DataDir)));
 		}
@@ -5892,10 +5907,10 @@ StartupXLOG(void)
 				ereport(PANIC,
 					 (errmsg("could not locate a valid checkpoint record")));
 		}
+		memcpy(&checkPoint, XLogRecGetData(record), sizeof(CheckPoint));
 	}
 
 	LastRec = RecPtr = checkPointLoc;
-	memcpy(&checkPoint, XLogRecGetData(record), sizeof(CheckPoint));
 	wasShutdown = (record->xl_info == XLOG_CHECKPOINT_SHUTDOWN);
 
 	ereport(DEBUG1,

-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] Recovery bug

Reply via email to