Hello PostgreSQL developers, I'm submitting a documentation improvement for Section 26.2.6 (Replication Slots) based on my experience as a database architect managing PostgreSQL high-availability environments.
# Problem The current documentation warns that replication slots "can cause the server to retain so many WAL segments that they fill up the space allocated for "pg_wal" and mentions "max_slot_wal_keep_size" as a mitigation. However, it doesn't provide guidance on what DBAs should monitor to detect this condition before it becomes critical. In production environments, the database engineers need to know what metrics to watch, not just how to limit the problem after the fact. # Solution This patch adds specific monitoring recommendations immediately after the warning, including: - Disk space monitoring for pg_wal directory - Replication lag metrics from pg_stat_replication (write_lag, flush_lag, replay_lag) - Slot status checks from pg_replication_slots (active flag, restart_lsn) This gives the database engineers actionable information they can use to set up proactive monitoring and alerting. # Background >From my production experience managing EFM clusters with streaming replication, the most common cause of WAL accumulation is inactive standbys that prevent WAL cleanup. The combination of monitoring restart_lsn in pg_replication_slots along with the active flag provides early warning when a slot is preventing cleanup, before disk space becomes critical. I've seen this scenario multiple times in enterprise deployments: - Standby server goes offline for maintenance or network issues - Replication slot remains active but standby not connected - Primary keeps WAL segments needed by the slot - pg_wal fills up over hours or days - Primary eventually fails when out of disk space The monitoring guidance in this patch would help the database engineers catch this early. # Patch Details The patch is attached and has also been pushed to my GitHub fork: https://github.com/mastercloudarchitect/postgres/tree/docs/improve-replication-slot-monitoring I'm happy to revise based on feedback. If the community thinks additional monitoring recommendations would be helpful (e.g., alerting thresholds, integration with monitoring tools), I can add those as well. Thanks for considering this improvement! Best regards, Venkat Venkatakrishnan Database Architect
v1-0001-docs-Add-monitoring-guidance-to-replication-slot-.patch
Description: Binary data
