On Thursday, December 7, 2023 7:43 PM Ashutosh Bapat <ashutosh.bapat....@gmail.com> wrote:
Hi, > > On Wed, Dec 6, 2023 at 6:25 PM Zhijie Hou (Fujitsu) <houzj.f...@fujitsu.com> > wrote: > > > > When testing streaming replication with a physical slot. I found an > > unexpected behavior that the walsender could use an invalidated > > physical slot for streaming. > > > > This occurs when the primary slot is invalidated due to reaching the > > max_slot_wal_keep_size before initializing the streaming replication > > (e.g. before creating the standby). Attach a draft > > script(test_invalidated_slot.sh) which can reproduce this. > > Interesting. Thanks for the script. It reproduces the problem for me easily. Thanks for testing and replying! > > > > > Once the slot is invalidated, it can no longer protect the WALs and > > Rows, as these invalidated slots are not considered in functions like > > ReplicationSlotsComputeRequiredXmin(). > > > > Besides, the walsender could advance the restart_lsn of an invalidated > > slot, then user won't be able to know that if the slot is actually > > validated or not, because the 'conflicting' of view > > pg_replication_slot could be set back to null. > > In this case, since the basebackup was taken after the slot was invalidated, > it > does not require the WAL that was removed. But it seems that once the > streaming starts, the slot sprints to life again and gets validated again. > Here's > pg_replication_slot output after the standby starts. Actually, It doesn't bring the invalidated slot back to life completely. The slot's view data looks valid while the 'invalidated' flag of this slot is still RS_INVAL_WAL_REMOVED (user are not aware of it.) > > > > > So, I think it's a bug and one idea to fix is to check the validity of > > the physical slot in > > StartReplication() after acquiring the slot like what the attachment > > does, what do you think ? > > I am not sure whether that's necessarily a bug. Of course, we don't expect > invalid slots to be used but in this case I don't see any harm. > The invalid slot has been revived and has all the properties set just like a > valid > slot. So it must be considered in > ReplicationSlotsComputeRequiredXmin() as well. I haven't verified it myself > though. In case the WAL is really lost and is requested by the standby it will > throw an error "requested WAL segment [0-9A-F]+ has already been > removed". So no harm there as well. Since the 'invalidated' field of the slot is still (RS_INVAL_WAL_REMOVED), even if the walsender advances the restart_lsn, the slot will not be considered in the ReplicationSlotsComputeRequiredXmin(), so the WALs and Rows are not safe and that's why I think it's a bug. After looking closer, it seems this behavior started from 15f8203 which introduced the ReplicationSlotInvalidationCause 'invalidated', after that we check the invalidated enum in Xmin/Lsn computation function. If we want to go back to previous behavior, we need to revert/adjust the check for invalidated in ReplicationSlotsComputeRequiredXmin(), but since the logical decoding(walsender) disallow using invalidated slot, so I feel it's consistent to do similar check for physical one. Besides, pg_replication_slot_advance() also disallow passing invalidated slot to it as well. What do you think ? Best Regards, Hou zj