On Fri, 8 Dec 2023 at 19:15, Ashutosh Bapat <ashutosh.bapat....@gmail.com> wrote: > > > > > pg_replication_slot could be set back to null. > > > > > > In this case, since the basebackup was taken after the slot was > > > invalidated, it > > > does not require the WAL that was removed. But it seems that once the > > > streaming starts, the slot sprints to life again and gets validated > > > again. Here's > > > pg_replication_slot output after the standby starts. > > > > Actually, It doesn't bring the invalidated slot back to life completely. > > The slot's view data looks valid while the 'invalidated' flag of this slot > > is still > > RS_INVAL_WAL_REMOVED (user are not aware of it.) > > > > I was mislead by the code in pg_get_replication_slots(). I did not > read it till the following > > --- code ---- > case WALAVAIL_REMOVED: > > /* > * If we read the restart_lsn long enough ago, maybe that file > * has been removed by now. However, the walsender could have > * moved forward enough that it jumped to another file after > * we looked. If checkpointer signalled the process to > * termination, then it's definitely lost; but if a process is > * still alive, then "unreserved" seems more appropriate. > * > * If we do change it, save the state for safe_wal_size below. > */ > --- code --- > > I see now how an invalid slot's wal status can be reported as > unreserved. So I think it's a bug. > > > > > > > > > > > > > > So, I think it's a bug and one idea to fix is to check the validity of > > > > the physical slot in > > > > StartReplication() after acquiring the slot like what the attachment > > > > does, what do you think ? > > > > > > I am not sure whether that's necessarily a bug. Of course, we don't expect > > > invalid slots to be used but in this case I don't see any harm. > > > The invalid slot has been revived and has all the properties set just > > > like a valid > > > slot. So it must be considered in > > > ReplicationSlotsComputeRequiredXmin() as well. I haven't verified it > > > myself > > > though. In case the WAL is really lost and is requested by the standby it > > > will > > > throw an error "requested WAL segment [0-9A-F]+ has already been > > > removed". So no harm there as well. > > > > Since the 'invalidated' field of the slot is still (RS_INVAL_WAL_REMOVED), > > even > > if the walsender advances the restart_lsn, the slot will not be considered > > in the > > ReplicationSlotsComputeRequiredXmin(), so the WALs and Rows are not safe > > and that's why I think it's a bug. > > > > After looking closer, it seems this behavior started from 15f8203 which > > introduced the > > ReplicationSlotInvalidationCause 'invalidated', after that we check the > > invalidated enum > > in Xmin/Lsn computation function. > > > > If we want to go back to previous behavior, we need to revert/adjust the > > check > > for invalidated in ReplicationSlotsComputeRequiredXmin(), but since the > > logical > > decoding(walsender) disallow using invalidated slot, so I feel it's > > consistent > > to do similar check for physical one. Besides, pg_replication_slot_advance() > > also disallow passing invalidated slot to it as well. What do you think ? > > What happens if you run your script on a build prior to 15f8203? > Purely from reading the code, it looks like the physical slot would > sprint back to life since its restart_lsn would be updated. But I am > not able to see what happens to invalidated_at. It probably remains a > valid LSN and the slot would still not be considred in xmin > calculation. It will be good to be compatible to pre-15f8203 > behaviour.
I have changed the patch status to "Waiting on Author", as some of the queries requested by Ashutosh are not yet addressed. Regards, Vignesh