Hi hackers, With the addition of archive modules (5ef1eef) and other archiving improvements (e.g., beb4e9b), the amount of archiving overhead has been reduced. I spent some time measuring the remaining overhead, and the following function stood out:
/* * pgarch_archiveDone * * Emit notification that an xlog file has been successfully archived. * We do this by renaming the status file from NNN.ready to NNN.done. * Eventually, a checkpoint process will notice this and delete both the * NNN.done file and the xlog file itself. */ static void pgarch_archiveDone(char *xlog) { char rlogready[MAXPGPATH]; char rlogdone[MAXPGPATH]; StatusFilePath(rlogready, xlog, ".ready"); StatusFilePath(rlogdone, xlog, ".done"); (void) durable_rename(rlogready, rlogdone, WARNING); } Specifically, the call to durable_rename(), which calls fsync() a few times, can cause a decent delay. On my machine, archiving ~12k WAL files using a bare-bones archive module that did nothing but return true took over a minute. When I changed this to rename() (i.e., reverting the change to pgarch.c in 1d4a0ab), the same test took a second or less. I also spent some time investigating whether durably renaming the archive status files was even necessary. In theory, it shouldn't be. If a crash happens before the rename is persisted, we might try to re-archive files, but your archive_command/archive_library should handle that. If the file was already recycled or removed, the archiver will skip it (thanks to 6d8727f). But when digging further, I found that WAL file recyling uses durable_rename_excl(), which has the following note: * Note that a crash in an unfortunate moment can leave you with two links to * the target file. IIUC this means that in theory, a crash at an unfortunate moment could leave you with a .ready file, the file to archive, and another link to that file with a "future" WAL filename. If you re-archive the file after it has been reused, you could end up with corrupt WAL archives. I think this might already be possible today. Syncing the directory after every rename probably reduces the likelihood quite substantially, but if RemoveOldXlogFiles() quickly picks up the .done file and attempts recycling before durable_rename() calls fsync() on the directory, presumably the same problem could occur. I believe the current recommendation is to fail if the archive file already exists. The basic_archive contrib module fails if the archive already exists and has different contents, and it succeeds without re-archiving if it already exists with identical contents. Since something along these lines is recommended as a best practice, perhaps this is okay. But I think we might be able to do better. If we ensure that the archive_status directory is sync'd prior to recycling/removing a WAL file, I believe we are safe from the aforementioned problem. The drawback of this is that we are essentially offloading more work to the checkpointer process. In addition to everything else it must do, it will also need to fsync() the archive_status directory many times. I'm not sure how big of a problem this is. It should be good to ensure that archiving keeps up, and there are far more expensive tasks that the checkpointer must perform that could make the added fsync() calls relatively insignificant. I've attached a patch that makes the changes discussed above. Thoughts? -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
>From a6d033aff90a6218345cba41847ccfdfbe6447d7 Mon Sep 17 00:00:00 2001 From: Nathan Bossart <nathandboss...@gmail.com> Date: Mon, 21 Feb 2022 16:29:14 -0800 Subject: [PATCH v1 1/1] remove more archiving overhead --- src/backend/access/transam/xlogarchive.c | 15 +++++++++++++-- src/backend/postmaster/pgarch.c | 13 ++++++++++++- 2 files changed, 25 insertions(+), 3 deletions(-) diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c index a2657a2005..d6823c9c38 100644 --- a/src/backend/access/transam/xlogarchive.c +++ b/src/backend/access/transam/xlogarchive.c @@ -585,8 +585,13 @@ XLogArchiveForceDone(const char *xlog) * If it is not time to delete it, make sure a .ready file exists, and return * false. * - * If <XLOG>.done exists, then return true; else if <XLOG>.ready exists, - * then return false; else create <XLOG>.ready and return false. + * If <XLOG>.done exists, then sync the archive_status directory to disk and + * return true. Syncing the directory ensures that the rename from + * <XLOG>.ready to <XLOG>.done is persisted so that we won't attempt to re- + * archive the file after a crash. + * + * If <XLOG>.ready exists, then return false. If neither <XLOG>.ready nor + * <XLOG>.done exist, create <XLOG>.ready and return false. * * The reason we do things this way is so that if the original attempt to * create <XLOG>.ready fails, we'll retry during subsequent checkpoints. @@ -618,7 +623,10 @@ XLogArchiveCheckDone(const char *xlog) /* First check for .done --- this means archiver is done with it */ StatusFilePath(archiveStatusPath, xlog, ".done"); if (stat(archiveStatusPath, &stat_buf) == 0) + { + fsync_fname(XLOGDIR "/archive_status", true); return true; + } /* check for .ready --- this means archiver is still busy with it */ StatusFilePath(archiveStatusPath, xlog, ".ready"); @@ -628,7 +636,10 @@ XLogArchiveCheckDone(const char *xlog) /* Race condition --- maybe archiver just finished, so recheck */ StatusFilePath(archiveStatusPath, xlog, ".done"); if (stat(archiveStatusPath, &stat_buf) == 0) + { + fsync_fname(XLOGDIR "/archive_status", true); return true; + } /* Retry creation of the .ready file */ XLogArchiveNotify(xlog); diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c index d916ed39a8..44cb58e1f4 100644 --- a/src/backend/postmaster/pgarch.c +++ b/src/backend/postmaster/pgarch.c @@ -746,7 +746,18 @@ pgarch_archiveDone(char *xlog) StatusFilePath(rlogready, xlog, ".ready"); StatusFilePath(rlogdone, xlog, ".done"); - (void) durable_rename(rlogready, rlogdone, WARNING); + + /* + * To reduce overhead, we don't durably rename the .ready file to .done + * here. Whoever recycles or removes the WAL file is responsible for first + * persisting the archive_status directory to disk so that we won't attempt + * to re-archive the file after a crash. + */ + if (rename(rlogready, rlogdone) < 0) + ereport(WARNING, + (errcode_for_file_access(), + errmsg("could not rename file \"%s\" to \"%s\": %m", + rlogready, rlogdone))); } -- 2.25.1