> > > pg_replication_slot could be set back to null.
> >
> > In this case, since the basebackup was taken after the slot was 
> > invalidated, it
> > does not require the WAL that was removed. But it seems that once the
> > streaming starts, the slot sprints to life again and gets validated again. 
> > Here's
> > pg_replication_slot output after the standby starts.
>
> Actually, It doesn't bring the invalidated slot back to life completely.
> The slot's view data looks valid while the 'invalidated' flag of this slot is 
> still
> RS_INVAL_WAL_REMOVED (user are not aware of it.)
>

I was mislead by the code in pg_get_replication_slots(). I did not
read it till the following

--- code ----
case WALAVAIL_REMOVED:

/*
* If we read the restart_lsn long enough ago, maybe that file
* has been removed by now. However, the walsender could have
* moved forward enough that it jumped to another file after
* we looked. If checkpointer signalled the process to
* termination, then it's definitely lost; but if a process is
* still alive, then "unreserved" seems more appropriate.
*
* If we do change it, save the state for safe_wal_size below.
*/
--- code ---

I see now how an invalid slot's wal status can be reported as
unreserved. So I think it's a bug.

>
> >
> > >
> > > So, I think it's a bug and one idea to fix is to check the validity of
> > > the physical slot in
> > > StartReplication() after acquiring the slot like what the attachment
> > > does, what do you think ?
> >
> > I am not sure whether that's necessarily a bug. Of course, we don't expect
> > invalid slots to be used but in this case I don't see any harm.
> > The invalid slot has been revived and has all the properties set just like 
> > a valid
> > slot. So it must be considered in
> > ReplicationSlotsComputeRequiredXmin() as well. I haven't verified it myself
> > though. In case the WAL is really lost and is requested by the standby it 
> > will
> > throw an error "requested WAL segment [0-9A-F]+ has already been
> > removed". So no harm there as well.
>
> Since the 'invalidated' field of the slot is still (RS_INVAL_WAL_REMOVED), 
> even
> if the walsender advances the restart_lsn, the slot will not be considered in 
> the
> ReplicationSlotsComputeRequiredXmin(), so the WALs and Rows are not safe
> and that's why I think it's a bug.
>
> After looking closer, it seems this behavior started from 15f8203 which 
> introduced the
> ReplicationSlotInvalidationCause 'invalidated', after that we check the 
> invalidated enum
> in Xmin/Lsn computation function.
>
> If we want to go back to previous behavior, we need to revert/adjust the check
> for invalidated in ReplicationSlotsComputeRequiredXmin(), but since the 
> logical
> decoding(walsender) disallow using invalidated slot, so I feel it's consistent
> to do similar check for physical one. Besides, pg_replication_slot_advance()
> also disallow passing invalidated slot to it as well. What do you think ?

What happens if you run your script on a build prior to 15f8203?
Purely from reading the code, it looks like the physical slot would
sprint back to life since its restart_lsn would be updated. But I am
not able to see what happens to invalidated_at. It probably remains a
valid LSN and the slot would still not be considred in xmin
calculation. It will be good to be compatible to pre-15f8203
behaviour.

I think making logical and physical slot behaviour consistent would be
better but if the inconsistent behaviour is out there for some
releases, changing that now will break backward compatibility.

-- 
Best Wishes,
Ashutosh Bapat


Reply via email to