Hi hackers,

I'd like to propose a new archive_mode setting to address a gap in WAL 
archiving for high availability streaming replication configurations.

## Problem

In HA setups using streaming replication, standbys can be 
promoted when primary has failed. Some WAL segments might be not yet 
archived. This creates gaps in the WAL archive, breaking point-in-time 
recovery:

1. Primary generates WAL, streams to standby
2. Standby receives WAL, marks segments as .done immediately
3. Standby deletes WAL during checkpoints
4. Primary hasn't archived yet (archiver lag, network issues, etc.)
5. Primary vanishes
6. Standby gets promoted
7. WAL history lost from archive

This is particularly problematic in synchronous replication where 
promotion might happen while the primary is still catching up on archival.

Promoted standby might have some WALs from walreceiver, some from archive. In 
this case we need to archive only those WALs which were received, but not
confirmed to be archived by primary.

## Proposed Solution

Add archive_mode=follow_primary, where standbys defer WAL deletion until 
the primary confirms archival:

- During recovery: standby creates .ready files for received segments
- Periodically: standby queries primary for archive status via replication 
protocol
- Primary responds: which segments are archived (no .ready file exists)
- Standby marks those as .done and can safely delete them
- On promotion: standby automatically archives remaining .ready segments

## Implementation

The patch adds two replication protocol messages:
- 'a' (PqReplMsg_ArchiveStatusQuery): standby → primary, sends (timeline, 
segno) pairs
- 'A' (PqReplMsg_ArchiveStatusResponse): primary → standby, responds with 
archived pairs

Key changes:
- walreceiver: XLogWalRcvSendArchiveQuery() scans archive_status, sends 
queries. I particularily dislike necessity to read whole arcive_status 
directory, 
but found no better way.
- walsender: ProcessStandbyArchiveQueryMessage() checks .ready files, responds.
Fortunately, no potentially FS-heavy operations on Primary.
- archiver: skips archiving during recovery if archive_mode=follow_primary.
I considered creating new kind of status file, but rejected the idea.
- XLogWalRcvClose(): creates .ready files instead of .done in follow_primary 
mode

Status requests happen at wal_receiver_status_interval (similar to 
hot_standby_feedback).
Works with cascading replication - each standby queries its immediate upstream.
Primary can be configured with archive_mode=follow_primary too.

## Testing

Included TAP tests cover:
- Basic archive status synchronization
- Standby promotion triggering archival
- Cascading standby configurations
- Multiple standbys from same primary

## Performance Impact

The overhead is minimal:
- Standby: One archive_status directory scan per wal_receiver_status_interval
- Primary: O(n) stat() calls where n = number of .ready files on standby
- Network: Small message (~1KB for 64 segments)
- Some space occupied by unarchived WALs on all standbys

## Open Questions

1. **Naming**: Is "follow_primary" the best name? Alternatives considered:
   - standby
   - synchronized/sync  
   - coordinated
   - primary_sync

2. **Query frequency**: Currently tied to wal_receiver_status_interval. 
   Should this be a separate GUC?

3. **Message protocol**: Should we batch more segments per message? 
   Current limit is 64 per query. Maybe sort rqeuests by LSN to pick 64 oldest 
segments?

4. **Backwards compatibility**: Primary must understand the protocol. 
   Should we version-check or gracefully degrade? I don't think additional 
check is necessary, but I'm not sure.
   Currently, if a walreceiver with follow_primary connects to an old primary 
that 
   doesn't understand the 'a' message, the primary will log a protocol error 
   but replication will continue (the standby just won't get responses).

## Future work

I'd like to extend archiver design to distribute archival work between cluster 
nodes. But
it would be too big project to do at once, so I decided to address PITR 
continuity issue first.

## Patch

Patch attached implements the feature with documentation and tests, but main 
purpose is, of course, a discussion. Does this approach seem right direction of 
development?
Looking forward to feedback on the approach and any concerns.


Best regards, Andrey Borodin.

Attachment: v1-0001-Add-archive_mode-follow_primary-to-prevent-WAL-lo.patch
Description: Binary data

Reply via email to