On Sat, Jun 6, 2026 at 5:35 PM Zhijie Hou (Fujitsu) <[email protected]> wrote: > > On Friday, June 5, 2026 8:45 PM Xuneng Zhou <[email protected]> wrote: > > On Wed, Jun 3, 2026 at 8:03 PM Fujii Masao <[email protected]> wrote: > > > > > > On Tue, Jun 2, 2026 at 3:00 PM Xuneng Zhou <[email protected]> > > wrote: > > > > > > /* Drop the local slot if it is not required to be retained. */ > > > if (!local_sync_slot_required(local_slot, remote_slot_list)) > > > { > > > + bool dropped = false; > > > + NameData slot_name = {0}; > > > + Oid slot_database = local_slot->data.database; > > > bool synced_slot; > > > > > > Is it really safe to read slot_database before acquiring the database > > > lock? > > > > Reading slot_database before taking the database lock seems not > > inherently unsafe by itself. The comment suggests that the lock is > > primarily used to prevent conflicts with the startup process running > > ReplicationSlotsDropDBSlots() during db-drop replay; it does not > > protect replication slot array reuse. > > > > The unsafe part could be reading slot_database from local_slot after > > ReplicationSlotControlLock has been released. At this point, the slot > > array cell may already have been freed and reused, so the value read > > may no longer belong to the slot that get_local_synced_slots() > > originally collected. As a result, we could end up locking the wrong > > database. > > > > There seems to be two related issues: > > > > 1) Before drop: reading local_slot->data.database / > > local_slot->data.name after the slot-array lock was released, before > > verifying the cell still represents the same synced slot. > > I recall condition (1) is considered acceptable, since the database lock is > released immediately after re-verifying that the slot is no longer the > original > 'synced' one anyway. Additionally, this race can only occur when replaying a > DROP DATABASE, which is rare in practice. Since we only take a shared lock, it > does not seem to cause real issues. > > > > > 2) After drop: calling ReplicationSlotDropAcquired(false) and then > > reading local_slot->data.database / local_slot->data.name for > > unlocking/logging after the cell may have been reused by another > > backend. > > Right. I missed this race condition during implementation and agree it's a > real > issue. An even bigger problem is that we could release a lock on the wrong > database if the slot entry is reused after being dropped. I think we should > fix > this by saving the database OID at the beginning, as your patch does. > > > > > The prior patch targets to fix 2), but leaves 1) unfixed. > > > > > BTW, I'm also wondering whether it's safe for > > > local_sync_slot_required() to access local_slot without holding a lock. > > > > I share the same concern here. The issue smells similar to the one > > discussed above -- reading values from the array cell directly without > > the protection of array lock. > > > > To solve them altogether, it might be > > better to stop carrying raw ReplicationSlot * values across > > unprotected windows. We can get the snapshot values such as slot name > > & database oid from get_local_synced_slots() instead. Then > > local_sync_slot_required() works from the snapshot, and > > drop_local_obsolete_slots() uses the copied database OID to take the > > database lock. Before dropping, it should revalidate by copied > > name/database that the slot is still a synced logical slot, then > > acquire/drop by copied name, and use only copied values for > > unlock/logging. I plan to prepare a refactoring patch for this. Does > > that seem like the right direction to you? > > Saving only the name and database OID would force us to search for the slot > again in the loop, which was one reason we didn't implement it that way.
The extra search may not be ideal, especially for the worker with a large number of slots to sync. But the cost could be avoided with an extra field like slotno. Were there other blocking issues discussed before? -- Regards, Xuneng Zhou HighGo Software Co., Ltd.
