On 17.12.2012 18:58, Magnus Hagander wrote:
On Mon, Dec 17, 2012 at 5:19 PM, Tom Lane<t...@sss.pgh.pa.us>  wrote:
Heikki Linnakangas<hlinnakan...@vmware.com>  writes:
I'm not happy with the fact that we just ignore the problem in a backup
taken from a standby, silently giving the user a backup that won't start
up. Why not include the timeline history file in the backup?

+1.  I was not aware that we weren't doing that --- it seems pretty
foolish, especially since as you say they're tiny.

Yeah, +1. That should probably have been a part of the whole
"basebackup from slave" patch, so it can probably be considered a
back-patchable bugfix in itself, no?

Yes, this should be backpatched to 9.2. I came up with the attached.

However, thinking about this some more, there's a another bug in the way WAL files are included in the backup, when a timeline switch happens. basebackup.c includes all the WAL files on ThisTimeLineID, but when the backup is taken from a standby, the standby might've followed a timeline switch. So it's possible that some of the WAL files should come from timeline 1, while others should come from timeline 2. This leads to an error like "requested WAL segment 00000001000000000000000C has already been removed" in pg_basebackup.

Attached is a script to reproduce that bug, if someone wants to play with it. It's a bit sensitive to timing, and needs tweaking the paths at the top.

One solution to that would be to pay more attention to the timelines to include WAL from. basebackup.c could read the timeline history file, to see exactly where the timeline switches happened, and then construct the filename of each WAL segment using the correct timeline id. Another approach would be to do readdir() on pg_xlog, and include all WAL files, regardless of timeline IDs, that fall in the right XLogRecPtr range. The latter seems easier to backpatch.

- Heikki
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 65200c1..5c0deaa 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -12,12 +12,13 @@
  */
 #include "postgres.h"
 
+#include <string.h>
 #include <sys/types.h>
 #include <sys/stat.h>
 #include <unistd.h>
 #include <time.h>
 
-#include "access/xlog_internal.h"		/* for pg_start/stop_backup */
+#include "access/xlog_internal.h"
 #include "catalog/pg_type.h"
 #include "lib/stringinfo.h"
 #include "libpq/libpq.h"
@@ -44,6 +45,7 @@ typedef struct
 
 
 static int64 sendDir(char *path, int basepathlen, bool sizeonly);
+static void sendTimeLineHistoryFiles(void);
 static void sendFile(char *readfilename, char *tarfilename,
 		 struct stat * statbuf);
 static void sendFileWithContent(const char *filename, const char *content);
@@ -286,6 +288,27 @@ perform_base_backup(basebackup_options *opt, DIR *tblspcdir)
 				break;
 		}
 
+		/*
+		 * Include all timeline history files.
+		 *
+		 * The timeline history files are usually not strictly required to
+		 * restore the backup, but if you take a backup from a standby server,
+		 * and the WAL segment containing the checkpoint record contains WAL
+		 * from an older timeline, recovery will complain on the older
+		 * timeline's ID if there is no timeline history file listing it. This
+		 * can happen if you take a backup right after promoting a standby to
+		 * become new master, and take the backup from a different, cascading
+		 * standby server.
+		 *
+		 * However, even when not strictly required, the timeline history
+		 * files are tiny, and provide a lot of forensic information about the
+		 * recovery history of the database, so it's best to always include
+		 * them all. (If asked to include WAL, that is. Otherwise you need a
+		 * WAL archive to restore anyway, and the timeline history files
+		 * should be present in the archive)
+		 */
+		sendTimeLineHistoryFiles();
+
 		/* Send CopyDone message for the last tar file */
 		pq_putemptymessage('c');
 	}
@@ -726,6 +749,58 @@ sendDir(char *path, int basepathlen, bool sizeonly)
 	return size;
 }
 
+/*
+ * Include all timeline history files from pg_xlog in the output tar stream.
+ */
+static void
+sendTimeLineHistoryFiles(void)
+{
+	DIR		   *dir;
+	struct dirent *de;
+	char		pathbuf[MAXPGPATH];
+	struct stat statbuf;
+
+	dir = AllocateDir("./pg_xlog");
+	while ((de = ReadDir(dir, "./pg_xlog")) != NULL)
+	{
+		CHECK_FOR_INTERRUPTS();
+
+		if (strlen(de->d_name) == 8 + strlen(".history") &&
+			strspn(de->d_name, "0123456789ABCDEF") == 8 &&
+			strcmp(de->d_name + 8, ".history") == 0)
+		{
+			/* It looks like a timeline history file. Include it. */
+			snprintf(pathbuf, MAXPGPATH, "./pg_xlog/%s", de->d_name);
+
+			if (lstat(pathbuf, &statbuf) != 0)
+			{
+				if (errno != ENOENT)
+					ereport(ERROR,
+							(errcode_for_file_access(),
+							 errmsg("could not stat file or directory \"%s\": %m",
+									pathbuf)));
+
+				/* If the file went away while scanning, it's no error. */
+				continue;
+			}
+
+			if (!S_ISREG(statbuf.st_mode))
+			{
+				/*
+				 * Huh? It's named like a timeline history file, but isn't a
+				 * regular file.
+				 */
+				ereport(WARNING,
+						(errmsg("skipping special file \"%s\"", pathbuf)));
+				continue;
+			}
+
+			sendFile(pathbuf, pathbuf + 1, &statbuf);
+		}
+	}
+	FreeDir(dir);
+}
+
 /*****
  * Functions for handling tar file format
  *

Attachment: recipe12.sh
Description: Bourne shell script

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to