Hello, hackers!
I would like to thank the community and all participants of this thread
for their interest in this problem.
In our production system with tens of thousands PostgreSQL clusters we
encounter exactly the same issue and are forced to synchronize upstreams
and downstreams via external means, which is quite suboptimal.
I`ve done some work on top of the proposed v4 version patch and would
like to present v5 version for a discussion.
There are a number of changes, such as sending just TLI and Segno
instead of full WAL filename, shifting some work into archiver and
adding shared memory for walreceiver/archiver synchronization.
There are a number of issues currently unresolved, which are worth a
discussion.
1. Should we update pg_stat_archiver on standby to support cascading
replication or should we just resend the report, received from upstream?
Personally I'm more inclined towards the pg_stat_archiver path, because
this way there will be less `if-else` programming and
archive_mode=shared behaviour will be more monitoring-friendly.
2. What should we do with *.backup.ready and *.partial.ready on standby?
Can we just XLogArchiveForceDone() them?
3. Should we keep the awkward part with switchpont calculation in
timeline switch case? I think all segments that are not in our server
history should just be stamped with XLogArchiveForceDone().
4. Currently XLogArchiveForceDone is forced either by walreceiver (on
receiving report from upstream) and archiver. Should we move this into
the archiver entirely?
Any feedback will be much appreciated.
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 67da9a1de66..bab0d624cee 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4046,14 +4046,36 @@ include_dir 'conf.d'
are sent to archive storage by setting
<xref linkend="guc-archive-command"/> or
<xref linkend="guc-archive-library"/>. In addition to <literal>off</literal>,
- to disable, there are two modes: <literal>on</literal>, and
- <literal>always</literal>. During normal operation, there is no
- difference between the two modes, but when set to <literal>always</literal>
- the WAL archiver is enabled also during archive recovery or standby
- mode. In <literal>always</literal> mode, all files restored from the archive
- or streamed with streaming physical replication will be archived (again). See
- <xref linkend="continuous-archiving-in-standby"/> for details.
+ to disable, there are three modes: <literal>on</literal>, <literal>shared</literal>,
+ and <literal>always</literal>. During normal operation as a primary, there is no
+ difference between the three modes, but they differ during archive recovery or
+ standby mode:
</para>
+ <itemizedlist>
+ <listitem>
+ <para>
+ <literal>on</literal>: Archives WAL only when running as a primary.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>shared</literal>: Coordinates archiving between primary and standby.
+ The standby defers WAL archival and deletion until the primary confirms
+ archival via streaming replication. This prevents WAL history loss during
+ standby promotion in high availability setups. Upon promotion, the standby
+ automatically starts archiving any remaining unarchived WAL. This mode works
+ with cascading replication, where each standby coordinates with its immediate
+ upstream server. See <xref linkend="continuous-archiving-in-standby"/> for details.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>always</literal>: Archives all WAL independently, even during recovery.
+ All files restored from the archive or streamed with streaming physical
+ replication will be archived (again), regardless of their source.
+ </para>
+ </listitem>
+ </itemizedlist>
<para>
<varname>archive_mode</varname> is a separate setting from
<varname>archive_command</varname> and
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index be8d3a5bfea..e96a60bfb4c 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1449,35 +1449,61 @@ postgres=# WAIT FOR LSN '0/306EE20';
</indexterm>
<para>
- When continuous WAL archiving is used in a standby, there are two
- different scenarios: the WAL archive can be shared between the primary
- and the standby, or the standby can have its own WAL archive. When
- the standby has its own WAL archive, set <varname>archive_mode</varname>
+ When continuous WAL archiving is used in a standby, there are three
+ different scenarios: the standby can have its own independent WAL archive,
+ the WAL archive can be shared between the primary and standby, or archiving
+ can be coordinated between them.
+ </para>
+
+ <para>
+ For an independent archive, set <varname>archive_mode</varname>
to <literal>always</literal>, and the standby will call the archive
command for every WAL segment it receives, whether it's by restoring
- from the archive or by streaming replication. The shared archive can
- be handled similarly, but the <varname>archive_command</varname> or <varname>archive_library</varname> must
- test if the file being archived exists already, and if the existing file
- has identical contents. This requires more care in the
- <varname>archive_command</varname> or <varname>archive_library</varname>, as it must
- be careful to not overwrite an existing file with different contents,
- but return success if the exactly same file is archived twice. And
- all that must be done free of race conditions, if two servers attempt
- to archive the same file at the same time.
+ from the archive or by streaming replication.
+ </para>
+
+ <para>
+ For a shared archive where both primary and standby can write, use
+ <literal>always</literal> mode as well, but the <varname>archive_command</varname>
+ or <varname>archive_library</varname> must test if the file being archived
+ exists already, and if the existing file has identical contents. This requires
+ more care in the <varname>archive_command</varname> or <varname>archive_library</varname>,
+ as it must be careful to not overwrite an existing file with different contents,
+ but return success if the exactly same file is archived twice. And all that must
+ be done free of race conditions, if two servers attempt to archive the same file
+ at the same time.
+ </para>
+
+ <para>
+ For coordinated archiving in high availability setups, use
+ <varname>archive_mode</varname>=<literal>shared</literal>. In this mode, only
+ the primary archives WAL segments. The standby creates <literal>.ready</literal>
+ files for received segments but defers actual archiving. The primary periodically
+ sends archival status updates to the standby via streaming replication, informing
+ it which segments have been archived. The standby then marks these as archived
+ and allows them to be recycled. Upon promotion, the standby automatically starts
+ archiving any remaining WAL segments that weren't confirmed as archived by the
+ former primary. This prevents WAL history loss during failover while avoiding
+ the complexity of coordinating concurrent archiving. This mode works with cascading
+ replication, where each standby coordinates with its immediate upstream server.
</para>
<para>
If <varname>archive_mode</varname> is set to <literal>on</literal>, the
- archiver is not enabled during recovery or standby mode. If the standby
- server is promoted, it will start archiving after the promotion, but
- will not archive any WAL or timeline history files that
- it did not generate itself. To get a complete
- series of WAL files in the archive, you must ensure that all WAL is
- archived, before it reaches the standby. This is inherently true with
- file-based log shipping, as the standby can only restore files that
- are found in the archive, but not if streaming replication is enabled.
- When a server is not in recovery mode, there is no difference between
- <literal>on</literal> and <literal>always</literal> modes.
+ archiver is not enabled during recovery or standby mode, and this setting
+ cannot be used on a standby. If a standby with <literal>archive_mode</literal>
+ set to <literal>on</literal> is promoted, it will start archiving after the
+ promotion, but will not archive any WAL or timeline history files that it did
+ not generate itself. To get a complete series of WAL files in the archive, you
+ must ensure that all WAL is archived before it reaches the standby. This is
+ inherently true with file-based log shipping, as the standby can only restore
+ files that are found in the archive, but not if streaming replication is enabled.
+ </para>
+
+ <para>
+ When a server is not in recovery mode, <literal>on</literal>,
+ <literal>shared</literal>, and <literal>always</literal> modes all behave
+ identically, archiving completed WAL segments.
</para>
</sect2>
</sect1>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f85b5286086..e41c25e8c1e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -199,6 +199,7 @@ const struct config_enum_entry archive_mode_options[] = {
{"always", ARCHIVE_MODE_ALWAYS, false},
{"on", ARCHIVE_MODE_ON, false},
{"off", ARCHIVE_MODE_OFF, false},
+ {"shared", ARCHIVE_MODE_SHARED, false},
{"true", ARCHIVE_MODE_ON, true},
{"false", ARCHIVE_MODE_OFF, true},
{"yes", ARCHIVE_MODE_ON, true},
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 0f207ac0356..d5cbcdb45a8 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -55,6 +55,9 @@
#include "utils/resowner.h"
#include "utils/timeout.h"
#include "utils/wait_event.h"
+#include "replication/walreceiver.h"
+#include "access/timeline.h"
+#include "access/xlogarchive.h"
/* ----------
@@ -108,6 +111,15 @@ static const ArchiveModuleCallbacks *ArchiveCallbacks;
static ArchiveModuleState *archive_module_state;
static MemoryContext archive_context;
+/*
+ * Last segment we successfully marked as .done. Used to optimize
+ * ProcessArchivalReport() by generating expected filenames instead
+ * of scanning the archive_status directory.
+ */
+static TimeLineID last_processed_tli = 0;
+static XLogSegNo last_processed_segno = 0;
+ArchivalReportData *ArchReport;
+static void ProcessArchivalReport(void);
/*
* Stuff for tracking multiple files to archive from each scan of
@@ -385,6 +397,18 @@ pgarch_ArchiverCopyLoop(void)
{
char xlog[MAX_XFN_CHARS + 1];
+ /*
+ * In shared archive mode during recovery, the archiver doesn't archive
+ * files. The primary is responsible for archiving, and the walreceiver
+ * marks files as .done when the primary confirms archival. After
+ * promotion, the archiver starts working normally.
+ */
+ if (XLogArchiveMode == ARCHIVE_MODE_SHARED && RecoveryInProgress())
+ {
+ ProcessArchivalReport();
+ return;
+ }
+
/* force directory scan in the first call to pgarch_readyXlog() */
arch_files->arch_files_size = 0;
@@ -960,3 +984,169 @@ pgarch_call_module_shutdown_cb(int code, Datum arg)
if (ArchiveCallbacks->shutdown_cb != NULL)
ArchiveCallbacks->shutdown_cb(archive_module_state);
}
+
+/*
+ * Process archival report from primary.
+ *
+ * The primary sends us the last WAL segment it has archived. We scan the
+ * archive_status directory for .ready files and mark segments on the same
+ * timeline as .done if they're <= the reported segment.
+ */
+static void
+ProcessArchivalReport()
+{
+ char walfile[MAX_XFN_CHARS + 1];
+ char primary_last_archived_fname[MAX_XFN_CHARS + 1];
+ char status_path[MAXPGPATH];
+// DIR *status_dir;
+// struct dirent *status_de;
+ List *tli_history = NIL;
+
+ if (ArchReport->tli == 0 || ArchReport->segno == 0)
+ {
+ ereport(DEBUG2,
+ (errmsg("archival report from upstream was not yes received")));
+ return;
+ }
+
+ XLogFileName(primary_last_archived_fname, ArchReport->tli, ArchReport->segno, wal_segment_size);
+
+ /*
+ * Optimization: If the new report is on the same timeline as the last
+ * processed segment and moves forward, we can directly check for .ready
+ * files for segments between last_processed_segno and reported_segno
+ * instead of scanning the entire archive_status directory.
+ *
+ * Fall back to directory scan if:
+ * - Timeline changed (need to handle ancestor timelines)
+ * - This is the first report (last_processed_tli == 0)
+ * - Reported segment is not ahead (nothing new to process)
+ */
+ if (last_processed_tli == ArchReport->tli &&
+ ArchReport->segno > last_processed_segno)
+ {
+ /*
+ * Direct check: generate filenames for expected segments.
+ * XLogArchiveForceDone() will handle the case where .ready doesn't
+ * exist or .done already exists, so no need to stat() first.
+ */
+ XLogSegNo segno;
+ XLogSegNo start_segno = last_processed_segno + 1;
+
+ for (segno = start_segno; segno <= ArchReport->segno; segno++)
+ {
+ char walfile[MAXFNAMELEN];
+
+ /* Generate WAL filename and mark as archived */
+ XLogFileName(walfile, ArchReport->tli, segno, wal_segment_size);
+ XLogArchiveForceDone(walfile);
+ ereport(DEBUG3,
+ (errmsg("marked WAL segment %s as archived (primary archived up to %s)",
+ walfile, primary_last_archived_fname)));
+
+ }
+ /* Track the last segment we processed */
+ last_processed_tli = ArchReport->tli;
+ last_processed_segno = segno;
+ return;
+ }
+
+ /*
+ * Directory scan: needed when timeline changed or first report.
+ * This handles both same-timeline and ancestor-timeline cases.
+ */
+ while (pgarch_readyXlog(walfile))
+ {
+ TimeLineID file_tli;
+ XLogSegNo file_segno;
+
+ /* Parse the WAL filename */
+ // TODO: we must handle somehow partial, .history and .backup files
+ if (!IsXLogFileName(walfile))
+ continue;
+
+ elog(WARNING, "found ready file for %s", walfile);
+
+ XLogFromFileName(walfile, &file_tli, &file_segno, wal_segment_size);
+
+ /*
+ * Mark as .done if:
+ * 1. Same timeline and segment <= reported segment, OR
+ * 2. Ancestor timeline and segment is before the timeline switch point
+ *
+ * For ancestor timelines: if primary archived segment X on timeline T,
+ * then all segments on ancestor timelines before the switch to T must
+ * have been archived (they're required to reach timeline T).
+ */
+ if (file_tli == ArchReport->tli)
+ {
+ // found walfile not yet archived by upstream, we should quit here
+ if (file_segno > ArchReport->segno)
+ {
+ elog(WARNING, "segment %s is not yet archived by upstream", walfile);
+ return;
+ }
+
+ /* Same timeline, segment already archived */
+ XLogArchiveForceDone(walfile);
+ ereport(DEBUG3,
+ (errmsg("marked WAL segment %s as archived (primary archived up to %s)",
+ walfile, primary_last_archived_fname)));
+ }
+ else
+ {
+ XLogRecPtr switchpoint;
+ XLogSegNo switchpoint_segno;
+ /*
+ * Different timeline - check if it's an ancestor and if this
+ * segment is before the timeline switch point. Only read timeline
+ * history if we haven't already (lazy loading).
+ *
+ * Note: Timelines form a tree structure, not a linear sequence,
+ * so we can't use < or > to compare them.
+ */
+
+ if (tli_history == NIL)
+ tli_history = readTimeLineHistory(ArchReport->tli);
+
+ // some garbage in archive_status, should error here
+ if (!tliInHistory(file_tli, tli_history))
+ {
+ ereport(ERROR,
+ (errmsg("walfile %s is not in this server's history", walfile)));
+ continue;
+ }
+
+ /* Get the point where we switched away from this timeline */
+ switchpoint = tliSwitchPoint(file_tli, tli_history, NULL);
+
+ /*
+ * If the segment is at or before the switch point, it must have
+ * been archived (it's required to reach the reported timeline).
+ * The segment containing the switch point belongs to the old
+ * timeline up to the switch point and should be archived.
+ */
+ XLByteToSeg(switchpoint, switchpoint_segno, wal_segment_size);
+ if (file_segno > switchpoint_segno)
+ {
+ ereport(ERROR,
+ (errmsg("walfile %s is not in this server's history", walfile)));
+ continue;
+ }
+
+ XLogArchiveForceDone(walfile);
+ ereport(DEBUG3,
+ (errmsg("marked ancestor timeline segment %s as archived (before switch to timeline %u)",
+ walfile, ArchReport->tli)));
+ }
+ /*
+ * Update our tracking to the newly reported position for future optimizations.
+ */
+ last_processed_tli = file_tli;
+ last_processed_segno = file_segno;
+ // TOOD: what if durable_rename failed in XLogArchiveForceDone ?
+ // we will erroneusly move last_processed_segno forward
+ }
+
+ //FreeDir(status_dir);
+}
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index b6fd332f196..b574360056c 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -3401,7 +3401,9 @@ LaunchMissingBackgroundProcesses(void)
*/
if (PgArchPMChild == NULL &&
((XLogArchivingActive() && pmState == PM_RUN) ||
- (XLogArchivingAlways() && (pmState == PM_RECOVERY || pmState == PM_HOT_STANDBY))) &&
+ (XLogArchivingAlways() && (pmState == PM_RECOVERY || pmState == PM_HOT_STANDBY)) ||
+ (XLogArchivingShared())
+ ) &&
PgArchCanRestart())
PgArchPMChild = StartChildProcess(B_ARCHIVER);
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 8185412a810..98ff3e4f03c 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -636,7 +636,7 @@ WalReceiverMain(const void *startup_data, size_t startup_data_len)
* Create .done file forcibly to prevent the streamed segment from
* being archived later.
*/
- if (XLogArchiveMode != ARCHIVE_MODE_ALWAYS)
+ if (XLogArchiveMode < ARCHIVE_MODE_ALWAYS)
XLogArchiveForceDone(xlogfname);
else
XLogArchiveNotify(xlogfname);
@@ -887,6 +887,19 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len, TimeLineID tli)
XLogWalRcvSendReply(true, false, false);
break;
}
+ case PqReplMsg_ArchiveStatusReport:
+ {
+ TimeLineID received_tli;
+ XLogSegNo received_segno;
+ StringInfoData incoming_message;
+
+ /* initialize a StringInfo with the given buffer */
+ initReadOnlyStringInfo(&incoming_message, buf, sizeof(int64) + sizeof(int64));
+ received_tli = (TimeLineID) pq_getmsgint64(&incoming_message);
+ received_segno = (XLogSegNo) pq_getmsgint64(&incoming_message);
+ StoreArchivalReport(received_tli, received_segno);
+ break;
+ }
default:
ereport(ERROR,
(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -1094,12 +1107,39 @@ XLogWalRcvClose(XLogRecPtr recptr, TimeLineID tli)
/*
* Create .done file forcibly to prevent the streamed segment from being
- * archived later.
+ * archived later, unless archive_mode is 'always' or 'shared'.
+ *
+ * In 'always' mode, the standby archives independently.
+ *
+ * In 'shared' mode, we optimize by checking if this segment is already
+ * covered by the last archival report from the primary. If so, create
+ * .done directly. Otherwise, create .ready and wait for the next report.
*/
- if (XLogArchiveMode != ARCHIVE_MODE_ALWAYS)
- XLogArchiveForceDone(xlogfname);
- else
+ if (XLogArchiveMode == ARCHIVE_MODE_ALWAYS)
+ {
XLogArchiveNotify(xlogfname);
+ }
+ else if (XLogArchiveMode == ARCHIVE_MODE_SHARED)
+ {
+ /*
+ * In shared mode, check if this segment is already archived on primary.
+ * If we're on the same timeline and this segment is <= last archived,
+ * mark it .done immediately. Otherwise create .ready.
+ *
+ * We don't check ancestor timeline cases here to avoid reading timeline
+ * history files on every segment close. ProcessArchivalReport() will
+ * handle marking ancestor timeline segments as .done when it scans
+ * the archive_status directory.
+ */
+ if (IsSegnoArchivedByUpstream(recvFileTLI, recvSegNo))
+ XLogArchiveForceDone(xlogfname);
+ else
+ XLogArchiveNotify(xlogfname);
+ }
+ else
+ {
+ XLogArchiveForceDone(xlogfname);
+ }
recvFile = -1;
}
diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c
index bd5d47be964..c8c6de51619 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -34,6 +34,7 @@
#include "utils/wait_event.h"
WalRcvData *WalRcv = NULL;
+//ArchivalReportData *ArchReport;
static void WalRcvShmemRequest(void *arg);
static void WalRcvShmemInit(void *arg);
@@ -57,6 +58,11 @@ WalRcvShmemRequest(void *arg)
.size = sizeof(WalRcvData),
.ptr = (void **) &WalRcv,
);
+
+ ShmemRequestStruct(.name = "Archival Report Data",
+ .size = sizeof(ArchivalReportData),
+ .ptr = (void **) &ArchReport,
+ );
}
/* Initialize walreceiver-related shared memory */
@@ -69,6 +75,8 @@ WalRcvShmemInit(void *arg)
SpinLockInit(&WalRcv->mutex);
pg_atomic_init_u64(&WalRcv->writtenUpto, 0);
WalRcv->procno = INVALID_PROC_NUMBER;
+
+ MemSet(ArchReport, 0, sizeof(ArchivalReportData));
}
/* Is walreceiver running (or starting up)? */
@@ -422,3 +430,47 @@ GetReplicationTransferLatency(void)
return TimestampDifferenceMilliseconds(lastMsgSendTime,
lastMsgReceiptTime);
}
+
+/*
+ * Is segment was already archived by master?
+ * Can be used only if archive_mode=shared.
+ */
+bool
+IsSegnoArchivedByUpstream(TimeLineID tli, XLogSegNo segno)
+{
+ elog(WARNING, "received tli: %d, segno: %ld", tli, segno);
+ elog(WARNING, "ArchReport->tli: %d, ArchReport->segno: %ld", ArchReport->tli, ArchReport->segno);
+ return tli == ArchReport->tli && segno <= ArchReport->segno;
+}
+
+/*
+ * Store archival report data in shared memory for later use by
+ * walreceiver and archiver.
+ */
+void
+StoreArchivalReport(TimeLineID tli, XLogSegNo segno)
+{
+ ereport(WARNING,
+ (errmsg("received archival report from primary: tli %u, segno %ld",
+ tli, segno)));
+
+ if (tli == 0 || segno == 0)
+ {
+ ereport(WARNING,
+ (errmsg("invalid values in archival report: tli %d, segno %ld",
+ tli, segno)));
+ return;
+ }
+
+ /* nothing is changed */
+ if (tli == ArchReport->tli && segno <= ArchReport->segno)
+ return;
+
+ /* Remember the last archived segment for XLogWalRcvClose() */
+ ArchReport->tli = tli;
+ ArchReport->segno = segno;
+
+ /* Notify archiver that it's got something to do */
+ if (IsUnderPostmaster)
+ PgArchWakeup();
+}
\ No newline at end of file
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 3d4ab929f91..09a8aeb33a4 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -183,6 +183,18 @@ static TimeLineID sendTimeLineNextTLI = 0;
static bool sendTimeLineIsHistoric = false;
static XLogRecPtr sendTimeLineValidUpto = InvalidXLogRecPtr;
+/*
+ * Last archived WAL file. This is fetched from pgstat periodically and sent
+ * to the standby. last_archival_report_timestamp tracks when we last sent
+ * the report to avoid excessive pgstat access.
+ */
+static TimeLineID last_archived_tli = 0;
+static XLogSegNo last_archived_segno = 0;
+static TimestampTz last_archival_report_timestamp = 0;
+
+/* Interval for sending archival reports (10 seconds) */
+#define ARCHIVAL_REPORT_INTERVAL 10000
+
/*
* How far have we sent WAL already? This is also advertised in
* MyWalSnd->sentPtr. (Actually, this is the next WAL location to send.)
@@ -297,6 +309,7 @@ static void ProcessStandbyMessage(void);
static void ProcessStandbyReplyMessage(void);
static void ProcessStandbyHSFeedbackMessage(void);
static void ProcessStandbyPSRequestMessage(void);
+static void WalSndArchivalReport(void);
static void ProcessRepliesIfAny(void);
static void ProcessPendingWrites(void);
static void WalSndKeepalive(bool requestReply, XLogRecPtr writePtr);
@@ -2799,6 +2812,91 @@ ProcessStandbyHSFeedbackMessage(void)
}
}
+/*
+ * Send archival status report to standby.
+ *
+ * This is called periodically during physical replication to inform the
+ * standby about the last WAL segment archived by the primary. The standby
+ * can then mark segments up to that point as .done, allowing them to be
+ * recycled. This prevents WAL loss during standby promotion.
+ */
+static void
+WalSndArchivalReport(void)
+{
+ PgStat_ArchiverStats *archiver_stats;
+ TimestampTz now;
+
+ /* Only send reports when archive_mode=shared */
+ if (XLogArchiveMode != ARCHIVE_MODE_SHARED)
+ return;
+
+ /* Only send reports during physical streaming replication, not during backup */
+ if (MyWalSnd->kind != REPLICATION_KIND_PHYSICAL)
+ return;
+ if (MyWalSnd->state != WALSNDSTATE_CATCHUP &&
+ MyWalSnd->state != WALSNDSTATE_STREAMING)
+ return;
+
+ /*
+ * Don't send to temporary replication slots (used by pg_basebackup).
+ * Connections without slots (regular standbys) are OK.
+ */
+ if (MyReplicationSlot != NULL &&
+ MyReplicationSlot->data.persistency == RS_TEMPORARY)
+ return;
+
+ now = GetCurrentTimestamp();
+
+ /*
+ * Send report at most once per ARCHIVAL_REPORT_INTERVAL (10 seconds).
+ * This avoids excessive pgstat access.
+ */
+ if (now < TimestampTzPlusMilliseconds(last_archival_report_timestamp,
+ ARCHIVAL_REPORT_INTERVAL))
+ return;
+ last_archival_report_timestamp = now;
+ /*
+ * Get archiver statistics. We use non-blocking access to avoid delaying
+ * replication if stats collector is slow. If stats are unavailable or
+ * stale, we'll just try again at the next interval.
+ */
+ pgstat_clear_snapshot();
+ archiver_stats = pgstat_fetch_stat_archiver();
+ if (archiver_stats == NULL)
+ return;
+
+ /* Only send reports for WAL segments, not backup history files or other archived files */
+ if (!IsXLogFileName(archiver_stats->last_archived_wal))
+ return;
+
+ /*
+ * Only send a report if the last archived WAL has changed. This is both
+ * an optimization and ensures we don't send empty reports on startup.
+ */
+ {
+ TimeLineID tli;
+ XLogSegNo segno;
+ XLogFromFileName(archiver_stats->last_archived_wal, &tli, &segno, wal_segment_size);
+ if (last_archived_tli == tli && last_archived_segno == segno)
+ return;
+
+ /* Remember what we are about to sent */
+ last_archived_tli = tli;
+ last_archived_segno = segno;
+ }
+
+ ereport(DEBUG1,
+ (errmsg("sending archival report: tli %d, segno %ld", last_archived_segno)));
+
+ /* Construct the message... */
+ resetStringInfo(&output_message);
+ pq_sendbyte(&output_message, PqReplMsg_ArchiveStatusReport);
+ pq_sendint64(&output_message, (int64) last_archived_tli);
+ pq_sendint64(&output_message, (int64) last_archived_segno);
+ /* ... and send it wrapped in CopyData */
+ pq_putmessage_noblock(PqMsg_CopyData, output_message.data, output_message.len);
+}
+
/*
* Process the request for a primary status update message.
*/
@@ -4350,6 +4448,9 @@ WalSndKeepaliveIfNecessary(void)
{
TimestampTz ping_time;
+ /* Send archival status report if needed */
+ WalSndArchivalReport();
+
/*
* Don't send keepalive messages if timeouts are globally disabled or
* we're doing something not partaking in timeouts.
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 437b4f32349..98a94785ebf 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -67,6 +67,7 @@ typedef enum ArchiveMode
ARCHIVE_MODE_OFF = 0, /* disabled */
ARCHIVE_MODE_ON, /* enabled while server is running normally */
ARCHIVE_MODE_ALWAYS, /* enabled always (even during recovery) */
+ ARCHIVE_MODE_SHARED, /* shared archive between primary and standby */
} ArchiveMode;
extern PGDLLIMPORT int XLogArchiveMode;
@@ -104,6 +105,9 @@ extern PGDLLIMPORT bool XLogLogicalInfo;
/* Is WAL archiving enabled always (even during recovery)? */
#define XLogArchivingAlways() \
(AssertMacro(XLogArchiveMode == ARCHIVE_MODE_OFF || wal_level >= WAL_LEVEL_REPLICA), XLogArchiveMode == ARCHIVE_MODE_ALWAYS)
+#define XLogArchivingShared() \
+ (AssertMacro(XLogArchiveMode == ARCHIVE_MODE_OFF || wal_level >= WAL_LEVEL_REPLICA), XLogArchiveMode == ARCHIVE_MODE_SHARED)
+
/*
* Is WAL-logging necessary for archival or log-shipping, or can we skip
diff --git a/src/include/libpq/protocol.h b/src/include/libpq/protocol.h
index eae8f0e7238..d22aaf9e225 100644
--- a/src/include/libpq/protocol.h
+++ b/src/include/libpq/protocol.h
@@ -72,6 +72,7 @@
/* Replication codes sent by the primary (wrapped in CopyData messages). */
+#define PqReplMsg_ArchiveStatusReport 'a'
#define PqReplMsg_Keepalive 'k'
#define PqReplMsg_PrimaryStatusUpdate 's'
#define PqReplMsg_WALData 'w'
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 47c07574d4d..2888dfb55ac 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -195,6 +195,14 @@ typedef struct
struct WalReceiverConn;
typedef struct WalReceiverConn WalReceiverConn;
+typedef struct ArchivalReportData
+{
+ TimeLineID tli;
+ XLogSegNo segno;
+} ArchivalReportData;
+/* Last archived WAL segment file reported by the primary */
+extern ArchivalReportData *ArchReport;
+
/*
* Status of walreceiver query execution.
*
@@ -502,5 +510,7 @@ extern XLogRecPtr GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID
extern XLogRecPtr GetWalRcvWriteRecPtr(void);
extern int GetReplicationApplyDelay(void);
extern int GetReplicationTransferLatency(void);
+extern bool IsSegnoArchivedByUpstream(TimeLineID tli, XLogSegNo segno);
+extern void StoreArchivalReport(TimeLineID tli, XLogSegNo segno);
#endif /* _WALRECEIVER_H */
diff --git a/src/test/recovery/t/053_archive_shared.pl b/src/test/recovery/t/053_archive_shared.pl
new file mode 100644
index 00000000000..397b71ad79d
--- /dev/null
+++ b/src/test/recovery/t/053_archive_shared.pl
@@ -0,0 +1,270 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+# Test archive_mode=shared for coordinated WAL archiving between primary and standby
+use strict;
+use warnings FATAL => 'all';
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+use File::Path qw(rmtree);
+
+# Initialize primary node with archiving
+my $archive_dir = PostgreSQL::Test::Utils::tempdir();
+my $primary = PostgreSQL::Test::Cluster->new('primary');
+$primary->init(has_archiving => 1, allows_streaming => 1);
+$primary->append_conf('postgresql.conf', "
+archive_mode = shared
+archive_command = 'cp %p \"$archive_dir\"/%f'
+wal_keep_size = 128MB
+");
+$primary->start;
+
+# Create a test table and generate some WAL
+$primary->safe_psql('postgres', 'CREATE TABLE test_table (id int, data text);');
+$primary->safe_psql('postgres', "INSERT INTO test_table SELECT i, 'data' || i FROM generate_series(1, 500) i;");
+$primary->safe_psql('postgres', 'SELECT pg_switch_wal();');
+$primary->safe_psql('postgres', "INSERT INTO test_table SELECT i, 'data' || i FROM generate_series(501, 1000) i;");
+$primary->safe_psql('postgres', 'SELECT pg_switch_wal();');
+
+# Wait for archiver to archive segments
+$primary->poll_query_until('postgres',
+ "SELECT archived_count > 0 FROM pg_stat_archiver")
+ or die "Timed out waiting for archiver to start";
+
+my $archived_count = () = glob("$archive_dir/*");
+ok($archived_count > 0, "primary has archived WAL files to shared archive");
+note("Primary archived $archived_count files");
+
+# Take backup for standby
+my $backup_name = 'standby_backup';
+$primary->backup($backup_name);
+
+# Exclude possible race condition when backup WAL is last archived
+$primary->safe_psql('postgres', "INSERT INTO test_table SELECT i, 'data' || i FROM generate_series(501, 1000) i;");
+$primary->safe_psql('postgres', 'SELECT pg_switch_wal();');
+
+# Set up standby with archive_mode=shared
+my $standby = PostgreSQL::Test::Cluster->new('standby');
+$standby->init_from_backup($primary, $backup_name, has_streaming => 1);
+$standby->append_conf('postgresql.conf', "
+archive_mode = shared
+archive_command = 'cp %p \"$archive_dir\"/%f'
+wal_receiver_status_interval = 1s
+");
+$standby->start;
+
+# Wait for standby to catch up
+$primary->wait_for_catchup($standby);
+
+# Generate more WAL on primary (these are new segments not yet archived)
+$primary->safe_psql('postgres', "INSERT INTO test_table SELECT i, 'data' || i FROM generate_series(1001, 1500) i;");
+$primary->safe_psql('postgres', 'SELECT pg_switch_wal();');
+$primary->safe_psql('postgres', "INSERT INTO test_table SELECT i, 'data' || i FROM generate_series(1501, 2000) i;");
+$primary->safe_psql('postgres', 'SELECT pg_switch_wal();');
+
+# Wait for standby to receive the new WAL
+$primary->wait_for_catchup($standby);
+
+# Check that standby has .ready or .done files for the newly received segments.
+# Normally they should be .ready (not yet archived by primary), but in rare cases
+# the archiver could be very fast and an archive report sent immediately, creating
+# .done files instead. Both are correct behavior - the key is that files exist.
+my $standby_archive_status = $standby->data_dir . '/pg_wal/archive_status';
+my $status_count = 0;
+if (opendir(my $dh, $standby_archive_status))
+{
+ my @files = grep { /\.(ready|done)$/ } readdir($dh);
+ $status_count = scalar(@files);
+ my $ready_count = scalar(grep { /\.ready$/ } @files);
+ my $done_count = scalar(grep { /\.done$/ } @files);
+ note("Standby has $ready_count .ready files and $done_count .done files");
+ closedir($dh);
+}
+cmp_ok($status_count, '>', 0, "standby creates archive status files for received WAL");
+
+# Generate more WAL and wait for archiving on primary
+my $initial_archived = $primary->safe_psql('postgres', 'SELECT archived_count FROM pg_stat_archiver');
+$primary->safe_psql('postgres', "INSERT INTO test_table SELECT i, 'more-data' || i FROM generate_series(2001, 2500) i;");
+$primary->safe_psql('postgres', 'SELECT pg_switch_wal();');
+$primary->safe_psql('postgres', "INSERT INTO test_table SELECT i, 'more-data2' || i FROM generate_series(2501, 3000) i;");
+$primary->safe_psql('postgres', 'SELECT pg_switch_wal();');
+
+# Wait for primary to archive the new segments
+$primary->poll_query_until('postgres',
+ "SELECT archived_count > $initial_archived FROM pg_stat_archiver")
+ or die "Timed out waiting for primary to archive new segments";
+
+# Wait for standby to catch up (archive status is sent during replication)
+$primary->wait_for_catchup($standby);
+
+# Wait for primary to send archival status updates and standby to process them
+# The standby should mark segments as .done after receiving archive status from primary
+my $done_count = 0;
+for (my $i = 0; $i < $PostgreSQL::Test::Utils::timeout_default; $i++)
+{
+ $done_count = 0;
+ if (opendir(my $dh, $standby_archive_status))
+ {
+ $done_count = scalar(grep { /\.done$/ } readdir($dh));
+ closedir($dh);
+ }
+ last if $done_count > 0;
+ sleep(1);
+}
+ok($done_count > 0, "standby marked segments as .done after primary's archival report");
+note("Standby has $done_count .done files");
+
+###############################################################################
+# Test 2: Standby promotion - verify archiver activates
+###############################################################################
+
+# Before promotion, verify archiver is not running on standby (shared mode during recovery)
+# In shared mode, the standby's archiver should not be archiving during recovery
+my $archived_before = $standby->safe_psql('postgres',
+ "SELECT archived_count FROM pg_stat_archiver");
+is($archived_before, '0',
+ "archiver not active on standby before promotion (archived_count=0)");
+
+# Verify standby is still in recovery before promoting
+my $in_recovery = $standby->safe_psql('postgres', "SELECT pg_is_in_recovery();");
+is($in_recovery, 't', "standby is in recovery before promotion");
+
+# Promote the standby
+$standby->promote;
+$standby->poll_query_until('postgres', "SELECT NOT pg_is_in_recovery();");
+
+# Generate WAL on new primary (former standby)
+$standby->safe_psql('postgres', "INSERT INTO test_table SELECT i, 'post-promotion' || i FROM generate_series(2001, 2500) i;");
+$standby->safe_psql('postgres', 'SELECT pg_switch_wal();');
+
+# Wait for archiver to activate and archive the new WAL
+# Check pg_stat_archiver to verify archiving is happening
+$standby->poll_query_until('postgres',
+ "SELECT archived_count > 0 FROM pg_stat_archiver")
+ or die "Timed out waiting for promoted standby to start archiving";
+pass("promoted standby started archiving");
+
+# Verify data integrity
+my $count = $standby->safe_psql('postgres', 'SELECT COUNT(*) FROM test_table;');
+ok($count >= 2500, "promoted standby has all data (got $count rows)");
+
+###############################################################################
+# Test 3: Cascading replication
+###############################################################################
+
+# Take a backup from the promoted standby (now the new primary)
+my $promoted_backup = 'promoted_backup';
+$standby->backup($promoted_backup);
+
+# Set up second-level standby (cascading from first standby, now promoted)
+my $standby2 = PostgreSQL::Test::Cluster->new('standby2');
+$standby2->init_from_backup($standby, $promoted_backup, has_streaming => 1);
+$standby2->append_conf('postgresql.conf', "
+archive_mode = shared
+archive_command = 'cp %p \"$archive_dir\"/%f'
+wal_receiver_status_interval = 1s
+");
+$standby2->start;
+
+# Generate WAL on promoted standby (now primary for standby2)
+my $cascading_archived_before = $standby->safe_psql('postgres', 'SELECT archived_count FROM pg_stat_archiver');
+$standby->safe_psql('postgres', "INSERT INTO test_table SELECT i, 'cascading' || i FROM generate_series(2501, 3000) i;");
+$standby->safe_psql('postgres', 'SELECT pg_switch_wal();');
+
+# Wait for the promoted standby (acting as primary) to archive the new segment
+$standby->poll_query_until('postgres',
+ "SELECT archived_count > $cascading_archived_before FROM pg_stat_archiver")
+ or die "Timed out waiting for primary to archive segment in cascading test";
+
+# Wait for cascading standby to catch up
+$standby->wait_for_catchup($standby2);
+
+# Wait for cascading standby to receive archive status and mark segments as .done
+my $standby2_archive_status = $standby2->data_dir . '/pg_wal/archive_status';
+my $standby2_done_count = 0;
+for (my $i = 0; $i < $PostgreSQL::Test::Utils::timeout_default; $i++)
+{
+ $standby2_done_count = 0;
+ if (opendir(my $dh, $standby2_archive_status))
+ {
+ $standby2_done_count = scalar(grep { /\.done$/ } readdir($dh));
+ closedir($dh);
+ }
+ last if $standby2_done_count > 0;
+ sleep(1);
+}
+ok($standby2_done_count > 0, "cascading standby marks segments as .done");
+note("Cascading standby has $standby2_done_count .done files");
+
+# Verify cascading standby has all data
+my $standby2_count = $standby2->safe_psql('postgres', 'SELECT COUNT(*) FROM test_table;');
+ok($standby2_count >= 3000, "cascading standby has all data (got $standby2_count rows)");
+
+###############################################################################
+# Test 4: Multiple standbys from same primary
+###############################################################################
+
+# Create third standby from promoted standby (current primary)
+my $standby3 = PostgreSQL::Test::Cluster->new('standby3');
+my $backup2 = 'multi_standby_backup';
+$standby->backup($backup2);
+$standby3->init_from_backup($standby, $backup2, has_streaming => 1);
+$standby3->append_conf('postgresql.conf', "
+archive_mode = shared
+archive_command = 'cp %p \"$archive_dir\"/%f'
+wal_receiver_status_interval = 1s
+");
+$standby3->start;
+
+# Generate WAL and ensure both standbys receive it
+my $standby_archived_before = $standby->safe_psql('postgres', 'SELECT archived_count FROM pg_stat_archiver');
+$standby->safe_psql('postgres', "INSERT INTO test_table SELECT i, 'multi' || i FROM generate_series(3001, 3500) i;");
+$standby->safe_psql('postgres', 'SELECT pg_switch_wal();');
+
+# Wait for the promoted standby (acting as primary) to archive the new segment
+$standby->poll_query_until('postgres',
+ "SELECT archived_count > $standby_archived_before FROM pg_stat_archiver")
+ or die "Timed out waiting for primary to archive segment in multi-standby test";
+
+$standby->wait_for_catchup($standby2);
+$standby->wait_for_catchup($standby3);
+
+# Verify both standbys eventually mark segments as .done
+my $standby3_archive_status = $standby3->data_dir . '/pg_wal/archive_status';
+
+for (my $i = 0; $i < $PostgreSQL::Test::Utils::timeout_default; $i++)
+{
+ $standby2_done_count = 0;
+ if (opendir(my $dh, $standby2_archive_status))
+ {
+ $standby2_done_count = scalar(grep { /\.done$/ } readdir($dh));
+ closedir($dh);
+ }
+ last if $standby2_done_count > 0;
+ sleep(1);
+}
+
+my $standby3_done_count = 0;
+for (my $i = 0; $i < $PostgreSQL::Test::Utils::timeout_default; $i++)
+{
+ $standby3_done_count = 0;
+ if (opendir(my $dh, $standby3_archive_status))
+ {
+ $standby3_done_count = scalar(grep { /\.done$/ } readdir($dh));
+ closedir($dh);
+ }
+ last if $standby3_done_count > 0;
+ sleep(1);
+}
+
+ok($standby2_done_count > 0, "standby2 marks segments as .done");
+ok($standby3_done_count > 0, "standby3 marks segments as .done");
+note("standby2 has $standby2_done_count .done files, standby3 has $standby3_done_count .done files");
+
+# Verify both standbys have all data
+$standby2_count = $standby2->safe_psql('postgres', 'SELECT COUNT(*) FROM test_table;');
+my $standby3_count = $standby3->safe_psql('postgres', 'SELECT COUNT(*) FROM test_table;');
+ok($standby2_count >= 3500, "standby2 has all data (got $standby2_count rows)");
+ok($standby3_count >= 3500, "standby3 has all data (got $standby3_count rows)");
+
+done_testing();