On 2020-Mar-31, Alvaro Herrera wrote: > /* release lock before syscalls */ > foreach(l, pids_to_kill) > { > kill(SIGTERM, lfirst_int(l)); > } > > I sense some attempt to salvage slots that are reading a segment that is > "outdated" and removed, but for which the walsender has an open file > descriptor. (This appears to be the "losing" state.) This seems > dangerous, for example the segment might be recycled and is being > overwritten with different data. Trying to keep track of that seems > doomed. And even if the walsender can still read that data, it's only a > matter of time before the next segment is also removed. So keeping the > walsender alive is futile; it only delays the inevitable.
I think we should kill(SIGTERM) the walsender using the slot (slot->active_pid), then acquire the slot and set it to some state indicating that it is now useless, no longer reserving WAL; so when the walsender is restarted, it will find the slot cannot be used any longer. Two ideas come to mind about doing this: 1. set the LSNs and Xmins to Invalid; keep only the slot name, database, plug_in, etc. This makes monitoring harder, I think, because as soon as the slot is gone you know nothing at all about it. 2. add a new flag to ReplicationSlotPersistentData to indicate that the slot is dead. This preserves the LSN info for forensics, and might even be easier to code. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services