On 2020-Mar-31, Alvaro Herrera wrote:

>                       /* release lock before syscalls */
>                       foreach(l, pids_to_kill)
>                       {
>                               kill(SIGTERM, lfirst_int(l));
>                       }
> 
> I sense some attempt to salvage slots that are reading a segment that is
> "outdated" and removed, but for which the walsender has an open file
> descriptor.  (This appears to be the "losing" state.) This seems
> dangerous, for example the segment might be recycled and is being
> overwritten with different data.  Trying to keep track of that seems
> doomed.  And even if the walsender can still read that data, it's only a
> matter of time before the next segment is also removed.  So keeping the
> walsender alive is futile; it only delays the inevitable.

I think we should kill(SIGTERM) the walsender using the slot (slot->active_pid),
then acquire the slot and set it to some state indicating that it is now
useless, no longer reserving WAL; so when the walsender is restarted, it
will find the slot cannot be used any longer.  Two ideas come to mind
about doing this:

1. set the LSNs and Xmins to Invalid; keep only the slot name, database,
plug_in, etc.  This makes monitoring harder, I think, because as soon as
the slot is gone you know nothing at all about it.

2. add a new flag to ReplicationSlotPersistentData to indicate that the
slot is dead.  This preserves the LSN info for forensics, and might even
be easier to code.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply via email to