On 10/23/2014 11:09 AM, Heikki Linnakangas wrote:
At least for master, we should consider changing the way the archiving
works so that we only archive WAL that was generated in the same server.
I.e. we should never try to archive WAL files belonging to another timeline.

I just remembered that we discussed a different problem related to this
some time ago, at
http://www.postgresql.org/message-id/20131212.110002.204892575.horiguchi.kyot...@lab.ntt.co.jp.
The conclusion of that was that at promotion, we should not archive the
last, partial, segment from the old timeline.

So, this is what I came up with for master. Does anyone see a problem with it?

- Heikki

>From 9115276335d37d7d2dab55ff8b0642041c16bfc3 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakan...@iki.fi>
Date: Fri, 24 Oct 2014 15:39:02 +0300
Subject: [PATCH] Don't try to archive WAL from different timelines.

Previously, we would always create a .done file for files restored from the
archive, or streamed from a master server. That served two purposes: it
allowed the files to be recycled at restartpoints, and ensured that we don't
try to re-archive the files after promotion. However, there were some gaps
in that approach: if streaming replication was stopped at a segment boundary,
no .done file was created. The standby server might also crash just before
it has created the .done file.

This is a more wholesale solution to the problem. WAL files belonging to
a different timeline are never archived, regardless of .done or .ready
files. The rule is that the server that generates a WAL file is solely
responsible for archiving it.

This also changes the long-standing behavior at end of recovery, where we
archived the last, partial, WAL segment from the old timeline. We no longer
do that; if you copy any files manually to pg_xlog, and want them to be
archived, you should copy them manually to the archive, too. (The first
segment on the new timeline, which is copied from the last partial one, will
be archived as usual.)

This is the same idea as in commit c9cc7e05c6d82a9781883a016c70d95aa4923122,
which was reverted in favor of the .done file approach in commit
1bd42cd70abdbc946ad64c3c8eaefed4bb8b1145. Turns out the .done file approach
was not such a good idea after all. The behavior at end-of-recovery was
discussed on pgsql-hackers back in February 2014
(20131212.110002.204892575.horiguchi.kyot...@lab.ntt.co.jp), but the plan
was never followed through (my bad, I forgot all about).

This should fix the issue reported by Jehan-Guillaume de Rorthais, where
a standby server sometimes creates .ready files for WAL files streamed from
the master. But this is too risky to backpatch - the changes in behavior
are quite signicant - so we'll have to fix it with some other approach in
back-branches.

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e77af22..2370467 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -208,7 +208,8 @@ static int	LocalXLogInsertAllowed = -1;
  * When ArchiveRecoveryRequested is set, archive recovery was requested,
  * ie. recovery.conf file was present. When InArchiveRecovery is set, we are
  * currently recovering using offline XLOG archives. These variables are only
- * valid in the startup process.
+ * valid in the startup process, but there is a copy of InArchiveRecovery
+ * in shared memory.
  *
  * When ArchiveRecoveryRequested is true, but InArchiveRecovery is false, we're
  * currently performing crash recovery using only XLOG files in pg_xlog, but
@@ -546,9 +547,11 @@ typedef struct XLogCtlData
 
 	/*
 	 * SharedRecoveryInProgress indicates if we're still in crash or archive
-	 * recovery.  Protected by info_lck.
+	 * recovery.  SharedInArchiveRecovery indicates if we're currently in
+	 * archive recovery.  Protected by info_lck.
 	 */
 	bool		SharedRecoveryInProgress;
+	bool		SharedInArchiveRecovery;
 
 	/*
 	 * SharedHotStandbyActive indicates if we're still in crash or archive
@@ -748,6 +751,7 @@ static MemoryContext walDebugCxt = NULL;
 #endif
 
 static void readRecoveryCommandFile(void);
+static void enterArchiveRecovery(void);
 static void exitArchiveRecovery(TimeLineID endTLI, XLogSegNo endLogSegNo);
 static bool recoveryStopsBefore(XLogRecord *record);
 static bool recoveryStopsAfter(XLogRecord *record);
@@ -4168,9 +4172,7 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode,
 			{
 				ereport(DEBUG1,
 						(errmsg_internal("reached end of WAL in pg_xlog, entering archive recovery")));
-				InArchiveRecovery = true;
-				if (StandbyModeRequested)
-					StandbyMode = true;
+				enterArchiveRecovery();
 
 				/* initialize minRecoveryPoint to this record */
 				LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
@@ -4866,6 +4868,7 @@ XLOGShmemInit(void)
 	 */
 	XLogCtl->XLogCacheBlck = XLOGbuffers - 1;
 	XLogCtl->SharedRecoveryInProgress = true;
+	XLogCtl->SharedInArchiveRecovery = false;
 	XLogCtl->SharedHotStandbyActive = false;
 	XLogCtl->WalWriterSleeping = false;
 
@@ -5316,6 +5319,18 @@ readRecoveryCommandFile(void)
 	FreeConfigVariables(head);
 }
 
+static void
+enterArchiveRecovery(void)
+{
+	InArchiveRecovery = true;
+	SpinLockAcquire(&XLogCtl->info_lck);
+	XLogCtl->SharedInArchiveRecovery = true;
+	SpinLockRelease(&XLogCtl->info_lck);
+
+	if (StandbyModeRequested)
+		StandbyMode = true;
+}
+
 /*
  * Exit archive-recovery state
  */
@@ -5326,11 +5341,6 @@ exitArchiveRecovery(TimeLineID endTLI, XLogSegNo endLogSegNo)
 	char		xlogpath[MAXPGPATH];
 
 	/*
-	 * We are no longer in archive recovery state.
-	 */
-	InArchiveRecovery = false;
-
-	/*
 	 * Update min recovery point one last time.
 	 */
 	UpdateMinRecoveryPoint(InvalidXLogRecPtr, true);
@@ -5349,22 +5359,10 @@ exitArchiveRecovery(TimeLineID endTLI, XLogSegNo endLogSegNo)
 	 * If we are establishing a new timeline, we have to copy data from the
 	 * last WAL segment of the old timeline to create a starting WAL segment
 	 * for the new timeline.
-	 *
-	 * Notify the archiver that the last WAL segment of the old timeline is
-	 * ready to copy to archival storage. Otherwise, it is not archived for a
-	 * while.
 	 */
 	if (endTLI != ThisTimeLineID)
-	{
 		XLogFileCopy(endLogSegNo, endTLI, endLogSegNo);
 
-		if (XLogArchivingActive())
-		{
-			XLogFileName(xlogpath, endTLI, endLogSegNo);
-			XLogArchiveNotify(xlogpath);
-		}
-	}
-
 	/*
 	 * Let's just make real sure there are not .ready or .done flags posted
 	 * for the new segment.
@@ -5394,6 +5392,12 @@ exitArchiveRecovery(TimeLineID endTLI, XLogSegNo endLogSegNo)
 				 errmsg("could not rename file \"%s\" to \"%s\": %m",
 						RECOVERY_COMMAND_FILE, RECOVERY_COMMAND_DONE)));
 
+	/* Ok, we're out of archive recovery state. */
+	InArchiveRecovery = false;
+	SpinLockAcquire(&XLogCtl->info_lck);
+	XLogCtl->SharedInArchiveRecovery = false;
+	SpinLockRelease(&XLogCtl->info_lck);
+
 	ereport(LOG,
 			(errmsg("archive recovery complete")));
 }
@@ -6108,9 +6112,7 @@ StartupXLOG(void)
 		 * file, we know how far we need to replay to reach consistency. Enter
 		 * archive recovery directly.
 		 */
-		InArchiveRecovery = true;
-		if (StandbyModeRequested)
-			StandbyMode = true;
+		enterArchiveRecovery();
 
 		/*
 		 * When a backup_label file is present, we want to roll forward from
@@ -6174,9 +6176,7 @@ StartupXLOG(void)
 			 ControlFile->backupEndPoint != InvalidXLogRecPtr ||
 			 ControlFile->state == DB_SHUTDOWNED))
 		{
-			InArchiveRecovery = true;
-			if (StandbyModeRequested)
-				StandbyMode = true;
+			enterArchiveRecovery();
 		}
 
 		/*
@@ -7382,6 +7382,23 @@ RecoveryInProgress(void)
 }
 
 /*
+ * Are we currently in archive recovery? In the startup process, you can just
+ * check InArchiveRecovery variable instead.
+ */
+bool
+ArchiveRecoveryInProgress()
+{
+	bool		result;
+
+	/* spinlock is essential on machines with weak memory ordering! */
+	SpinLockAcquire(&XLogCtl->info_lck);
+	result = XLogCtl->SharedInArchiveRecovery;
+	SpinLockRelease(&XLogCtl->info_lck);
+
+	return result;
+}
+
+/*
  * Is HotStandby active yet? This is only important in special backends
  * since normal backends won't ever be able to connect until this returns
  * true. Postmaster knows this by way of signal, not via shared memory.
diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index 047efa2..7f200f9 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -477,12 +477,6 @@ KeepFileRestoredFromArchive(char *path, char *xlogfname)
 						path, xlogfpath)));
 
 	/*
-	 * Create .done file forcibly to prevent the restored segment from being
-	 * archived again later.
-	 */
-	XLogArchiveForceDone(xlogfname);
-
-	/*
 	 * If the existing file was replaced, since walsenders might have it open,
 	 * request them to reload a currently-open segment. This is only required
 	 * for WAL segments, walsenders don't hold other files open, but there's
@@ -554,59 +548,6 @@ XLogArchiveNotifySeg(XLogSegNo segno)
 }
 
 /*
- * XLogArchiveForceDone
- *
- * Emit notification forcibly that an XLOG segment file has been successfully
- * archived, by creating <XLOG>.done regardless of whether <XLOG>.ready
- * exists or not.
- */
-void
-XLogArchiveForceDone(const char *xlog)
-{
-	char		archiveReady[MAXPGPATH];
-	char		archiveDone[MAXPGPATH];
-	struct stat stat_buf;
-	FILE	   *fd;
-
-	/* Exit if already known done */
-	StatusFilePath(archiveDone, xlog, ".done");
-	if (stat(archiveDone, &stat_buf) == 0)
-		return;
-
-	/* If .ready exists, rename it to .done */
-	StatusFilePath(archiveReady, xlog, ".ready");
-	if (stat(archiveReady, &stat_buf) == 0)
-	{
-		if (rename(archiveReady, archiveDone) < 0)
-			ereport(WARNING,
-					(errcode_for_file_access(),
-					 errmsg("could not rename file \"%s\" to \"%s\": %m",
-							archiveReady, archiveDone)));
-
-		return;
-	}
-
-	/* insert an otherwise empty file called <XLOG>.done */
-	fd = AllocateFile(archiveDone, "w");
-	if (fd == NULL)
-	{
-		ereport(LOG,
-				(errcode_for_file_access(),
-				 errmsg("could not create archive status file \"%s\": %m",
-						archiveDone)));
-		return;
-	}
-	if (FreeFile(fd))
-	{
-		ereport(LOG,
-				(errcode_for_file_access(),
-				 errmsg("could not write archive status file \"%s\": %m",
-						archiveDone)));
-		return;
-	}
-}
-
-/*
  * XLogArchiveCheckDone
  *
  * This is called when we are ready to delete or recycle an old XLOG segment
@@ -630,6 +571,30 @@ XLogArchiveCheckDone(const char *xlog)
 	if (!XLogArchivingActive())
 		return true;
 
+	/*
+	 * While we are still in archive recovery, we have not generated any WAL
+	 * ourselves yet. We're not responsible for archiving the segments we're
+	 * replaying; the server that generated them originally is.
+	 *
+	 * During normal operation, too, we're not responsible for archiving
+	 * segments from timelines other than the one we're generating. Such
+	 * segments might be left over in pg_xlog, after a point-in-time
+	 * recovery, or promotion of a standby server.
+	 */
+	if (ArchiveRecoveryInProgress())
+		return true;
+
+	/* ThisTimeLineID should be set, but err on the side of caution */
+	if (ThisTimeLineID != 0)
+	{
+		TimeLineID	fileTLI = 0;
+		uint64		logSegNo;
+
+		XLogFromFileName(xlog, &fileTLI, &logSegNo);
+		if (fileTLI != 0 && fileTLI != ThisTimeLineID)
+			return true;
+	}
+
 	/* First check for .done --- this means archiver is done with it */
 	StatusFilePath(archiveStatusPath, xlog, ".done");
 	if (stat(archiveStatusPath, &stat_buf) == 0)
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index c2d4ed3..284de9c 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -138,6 +138,7 @@ static void WalRcvDie(int code, Datum arg);
 static void XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len);
 static void XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr);
 static void XLogWalRcvFlush(bool dying);
+static void XLogWalRcvFileClose(void);
 static void XLogWalRcvSendReply(bool force, bool requestReply);
 static void XLogWalRcvSendHSFeedback(bool immed);
 static void ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime);
@@ -536,24 +537,7 @@ WalReceiverMain(void)
 		 * segment, and await for new orders from the startup process.
 		 */
 		if (recvFile >= 0)
-		{
-			char		xlogfname[MAXFNAMELEN];
-
-			XLogWalRcvFlush(false);
-			if (close(recvFile) != 0)
-				ereport(PANIC,
-						(errcode_for_file_access(),
-						 errmsg("could not close log segment %s: %m",
-								XLogFileNameP(recvFileTLI, recvSegNo))));
-
-			/*
-			 * Create .done file forcibly to prevent the streamed segment from
-			 * being archived later.
-			 */
-			XLogFileName(xlogfname, recvFileTLI, recvSegNo);
-			XLogArchiveForceDone(xlogfname);
-		}
-		recvFile = -1;
+			XLogWalRcvFileClose();
 
 		elog(DEBUG1, "walreceiver ended streaming and awaits new instructions");
 		WalRcvWaitForStartPosition(&startpoint, &startpointTLI);
@@ -887,30 +871,7 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 			 * would otherwise have to reopen this file to fsync it later
 			 */
 			if (recvFile >= 0)
-			{
-				char		xlogfname[MAXFNAMELEN];
-
-				XLogWalRcvFlush(false);
-
-				/*
-				 * XLOG segment files will be re-read by recovery in startup
-				 * process soon, so we don't advise the OS to release cache
-				 * pages associated with the file like XLogFileClose() does.
-				 */
-				if (close(recvFile) != 0)
-					ereport(PANIC,
-							(errcode_for_file_access(),
-							 errmsg("could not close log segment %s: %m",
-									XLogFileNameP(recvFileTLI, recvSegNo))));
-
-				/*
-				 * Create .done file forcibly to prevent the streamed segment
-				 * from being archived later.
-				 */
-				XLogFileName(xlogfname, recvFileTLI, recvSegNo);
-				XLogArchiveForceDone(xlogfname);
-			}
-			recvFile = -1;
+				XLogWalRcvFileClose();
 
 			/* Create/use new log file */
 			XLByteToSeg(recptr, recvSegNo);
@@ -1022,6 +983,29 @@ XLogWalRcvFlush(bool dying)
 }
 
 /*
+ * Close the currently open WAL file.
+ */
+static void
+XLogWalRcvFileClose(void)
+{
+	Assert(recvFile >= 0);
+
+	XLogWalRcvFlush(false);
+
+	/*
+	 * XLOG segment files will be re-read by recovery in startup process soon,
+	 * so we don't advise the OS to release cache pages associated with the
+	 * file like XLogFileClose() does.
+	 */
+	if (close(recvFile) != 0)
+		ereport(PANIC,
+				(errcode_for_file_access(),
+				 errmsg("could not close log segment %s: %m",
+						XLogFileNameP(recvFileTLI, recvSegNo))));
+	recvFile = -1;
+}
+
+/*
  * Send reply message to primary, indicating our current XLOG positions, oldest
  * xmin and the current time.
  *
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 0ae110f..e6c6737 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -310,6 +310,7 @@ extern const char *xlog_identify(uint8 info);
 extern void issue_xlog_fsync(int fd, XLogSegNo segno);
 
 extern bool RecoveryInProgress(void);
+extern bool ArchiveRecoveryInProgress(void);
 extern bool HotStandbyActive(void);
 extern bool HotStandbyActiveInReplay(void);
 extern bool XLogInsertAllowed(void);
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index 27b9899..3bb891d 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -166,6 +166,10 @@ typedef XLogLongPageHeaderData *XLogLongPageHeader;
 			 (uint32) ((logSegNo) / XLogSegmentsPerXLogId), \
 			 (uint32) ((logSegNo) % XLogSegmentsPerXLogId))
 
+/*
+ * Extract TLI and segment number from WAL filename. Also works with backup
+ * history files.
+ */
 #define XLogFromFileName(fname, tli, logSegNo)	\
 	do {												\
 		uint32 log;										\
@@ -286,7 +290,6 @@ extern void ExecuteRecoveryCommand(char *command, char *commandName,
 extern void KeepFileRestoredFromArchive(char *path, char *xlogfname);
 extern void XLogArchiveNotify(const char *xlog);
 extern void XLogArchiveNotifySeg(XLogSegNo segno);
-extern void XLogArchiveForceDone(const char *xlog);
 extern bool XLogArchiveCheckDone(const char *xlog);
 extern bool XLogArchiveIsBusy(const char *xlog);
 extern void XLogArchiveCleanup(const char *xlog);
-- 
2.1.1

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to