Hello PostgreSQL developers,

I'm submitting a documentation improvement for Section 26.2.6 (Replication
Slots) based on my experience as a database architect managing PostgreSQL
high-availability environments.

# Problem

The current documentation warns that replication slots "can cause the
server to retain so many WAL segments that they fill up the space allocated
for "pg_wal" and mentions "max_slot_wal_keep_size" as a mitigation.
However, it doesn't provide guidance on what DBAs should monitor to detect
this condition before it becomes critical.

In production environments, the database engineers need to know what
metrics to watch, not just how to limit the problem after the fact.

# Solution

This patch adds specific monitoring recommendations immediately after the
warning, including:

- Disk space monitoring for pg_wal directory
- Replication lag metrics from pg_stat_replication (write_lag, flush_lag,
replay_lag)
- Slot status checks from pg_replication_slots (active flag, restart_lsn)

This gives the database engineers actionable information they can use to
set up proactive monitoring and alerting.

# Background

>From my production experience managing EFM clusters with streaming
replication, the most common cause of WAL accumulation is inactive standbys
that prevent
WAL cleanup. The combination of monitoring restart_lsn in
pg_replication_slots along with the active flag provides early warning when
a slot is preventing cleanup, before disk space becomes critical.

I've seen this scenario multiple times in enterprise deployments:
- Standby server goes offline for maintenance or network issues
- Replication slot remains active but standby not connected
- Primary keeps WAL segments needed by the slot
- pg_wal fills up over hours or days
- Primary eventually fails when out of disk space

The monitoring guidance in this patch would help the database engineers catch
this early.

# Patch Details

The patch is attached and has also been pushed to my GitHub fork:
https://github.com/mastercloudarchitect/postgres/tree/docs/improve-replication-slot-monitoring

I'm happy to revise based on feedback. If the community thinks additional
monitoring recommendations would be helpful (e.g., alerting thresholds,
integration with monitoring tools), I can add those as well.

Thanks for considering this improvement!

Best regards,
Venkat Venkatakrishnan
Database Architect

Attachment: v1-0001-docs-Add-monitoring-guidance-to-replication-slot-.patch
Description: Binary data

Reply via email to