On Thu, Nov 13, 2025 at 8:03 AM Zhijie Hou (Fujitsu) <[email protected]> wrote: > > On Wednesday, November 12, 2025 11:55 PMVitaly Davydov > <[email protected]> wrote: > > > > On Monday, November 10, 2025 12:11 MSK, Amit Kapila > > <[email protected]> wrote: > > > The fix seems to be only provided for bank branches, but IIUC the > > > problem can happen in HEAD as well. In Head, how about acquiring > > > ReplicationSlotAllocationLock in Exclusive mode during > > > ReplicationSlotReserveWal? This lock is acquired in SHARE mode in > > > CheckPointReplicationSlots. So, this should make our calculations > > > correct and avoid invalidating the newly created slot. > > > > Amit, I'm not sure how it may help to avoid the change of redo rec ptr > > during > > wal reservation by a slot, because it doesn't serialize redo rec ptr > > assignment > > and slot's wal reservation, but the core issue is in reco rec ptr change > > when we > > reserve the wal by a slot. > > I think the main issue here lies in the possibility that the minimum > restart_lsn > obtained during a checkpoint could be less than the WAL position that is being > reserved concurrently. So, instead of serializing the redo assignment and WAL > reservation, Amit proposes serializing the CheckPointReplicationSlots() and > WAL > reservation. This would ensure the following: > > 1) If the WAL reservation occurs first, the checkpoint must wait for the > restart_lsn to be updated before proceeding with WAL removal. This guarantees > that the most recent restart_lsn position is detected. > > 2) If the checkpoint calls CheckPointReplicationSlots() first, then any > subsequent WAL reservation must take a position later than the redo pointer. >
Your understanding is correct. This whole difference of the way same problem can happen in HEAD and back branches but in different ways because the code varies a lot between branches. Now, if we again put a different fix for the newly discovered problem, then it will take the code much further away adding more to the maintenance burden. So, I thought we should once discuss making the code same in HEAD and back branches. Can we do some research to see if any of the available extensions uses sizeof(ReplicationSlot)? I think even if we don't find any such extension there will always be a chance that there is some closed source extension that might rely on it but I think still this is worth considering to change ABI in back branches for this fix. This code area is quite subtle where the bugs exist from a long time and the initial fix also didn't close all holes. -- With Regards, Amit Kapila.
