On Monday, November 17, 2025 6:50 PM Zhijie Hou (Fujitsu) <[email protected]> wrote: > > > Attach v2 patch that improves the fix based on above analysis. > > (I am still testing some other scenarios related to slotsync to ensure the > current fix is complete)
I have been testing whether taking exclusive ReplicationSlotAllocationLock can prevent newly synced slot from being invalidated, but found that it is insufficient. During slotsync, since it fetches the remote LSN (queried from the publisher) as the initial restart_lsn for newly synced slot and we sync the slots in a asynchronous manner, the synced restart_lsn could be behind the standby's redo pointer. So, there could be race conditions that the checkpoint removes the required WALs and invalidates the newly synced slot: 1. Assuming there is no slot on standby, the minimum slot LSN would be invalid, and the checkpoint reaches InvalidateObsoleteReplicationSlots() with oldestSegno =segno of redo. 2. The slotsync successfully sync the slot A that has old restart_lsn. 3. The checkpoint would then find that the newly synced slot A should be invalidated due to old restart_lsn. Unlike in ReplicationSlotReserveWal(), holding only the ReplicationSlotAllocationLock does not safeguard a newly synced slot if a checkpoint occurs before WAL reservation during slotsync as the initial restart_lsn provided could be prior to the redo pointer. To address this issue, I changed the WAL reservation to compare the remote restart_lsn with both minimum slot LSN and redo pointer. If either local LSN is less than the remote restart_lsn, we update the local slot with the remote value. Otherwise, we use the minimum of the two local LSNs. Additionally, we acquire both ReplicationSlotAllocationLock and ReplicationSlotControlLock to prevent the checkpoint from updating the redo pointer and other backend processes from updating the minimum slot LSN. Here is the V3 patch that fixes all the race conditions found so far. ------ Apart from addressing the fix for HEAD, I would like to inquire if anyone holds a differing opinion regarding the revert of the origin fix 2090edc6f32f652a2c in the back branches and applying the same fix as HEAD. I searched for extensions that rely on the size of ReplicationSlot and did not find any, so I think it is OK to implement the same fix in the backpatches. It there are no alternative views, I will proceed with preparing the patches for the back branches in upcoming versions. Best Regards, Hou zj
v3-0001-Fix-race-conditions-causing-invalidation-of-newly.patch
Description: v3-0001-Fix-race-conditions-causing-invalidation-of-newly.patch
