Re: confusing results from pg_get_replication_slots()

Andrey Borodin Sat, 03 Jan 2026 04:23:19 -0800

Hi Robert!

I've tried to look how people use wal_status.
There are lots of monitoring usages where transient race conditions do not 
matter much.
But in some cases fatal decisions are made on a "lost" basis. e.g. 
https://github.com/readysettech/readyset/blob/cb77b75a56d952fb6b1c4171afa9f0b0175fb6d8/replicators/src/postgres_connector/connector.rs#L381

I concur that showing "unreserved" when there is no actual WAL is a bug.
Proposed fix will work and is very succinct. Resulting code structure is not 
super elegant, but acceptable.

I don't fully understand circumstances when this bug can do any harm. Maybe 
negative safe_wal_size could be a surprise for some monitoring tools.

> On 2 Jan 2026, at 20:40, Robert Haas <[email protected]> wrote:
> 
> For all practical intents and purposes, such a slot is no
> more - has ceased to be - has expired and gone to meet its maker -
> it's an ex-slot. It makes no sense to me to display that slot with a
> status that shows that there is some hope of recovery when in fact
> there is none.
> 
> Note, by the way, that in existing releases, connections to
> already-invalidated physical slots are not blocked. This has been
> changed, but only in master.

I don't understand a reason to disallow reviving a slot. Ofc with some new LSN 
that is currently available in pg_wal.

Imagine a following scenario: in a cluster of a Primary and a Standby a long 
analytical query is causing huge lag, primary removes some WAL segments due to 
max_slot_wal_keep_size, standby is disconnected, consumes several WALs from 
archive, catches up and continues. Or, if something was vacuumed, cancels 
analytical query. If we disallow reconnection of this stanby, it will stay in 
archive recovery. I don't see how it's a good thing.

> On 3 Jan 2026, at 02:10, Robert Haas <[email protected]> wrote:
> 
> Maybe we shouldn't display "lost" when the slot
> is invalidated but "invalidated", for example, and any other value
> means we're just returning whatever GetWALAvaliability() told us.
> Also, maybe the exception for connect slots should just be removed, on
> the assumption that the race condition isn't common enough to matter,
> or maybe that logic should be pushed down into GetWALAvailability() if
> we want to keep it.

I don't think following logic works: "someone seems to be connected to this 
slot, perhaps it's still not lost". This is error-prone heuristics that is 
trying to workaround possibly stale restart_lsn.
For HEAD I'd propose to actually read restart_lsn, and determine if walsender 
will issue "requested WAL segment has already been removed" on next attempt to 
send something. In this case slot is "lost".

If I understand correctly, slot might be "invalidated", but not "lost" in this 
sense yet: timeout occured, but WAL is still there.

Best regards, Andrey Borodin.

Re: confusing results from pg_get_replication_slots()

Reply via email to