Hi Fujii-san, Amit, The issues look real to me and could be dealt with patch v1 partially.
On Thu, Jun 11, 2026 at 9:19 PM Fujii Masao <[email protected]> wrote: > > On Thu, Jun 11, 2026 at 8:18 PM Amit Kapila <[email protected]> wrote: > > 1. Stale name read in local_sync_slot_required(): The reused cell > > holds a different name. local_sync_slot_required() might return false > > (drop needed). But then the in_use && synced spinlock check sees > > synced = false and skips the actual drop. The wrong decision is > > caught. > > Yes, we could skip the actual drop. But then wouldn't we still emit > the log message "dropped replication slot ..." even though no slot was > actually dropped? With v1, we won't emit the log message unless the log is factually dropped. However it did not prevent the stale read in local_sync_slot_required(). > > 2. Wrong database OID read at line 551: The reused cell holds OID_B > > from the new slot. We lock OID_B, then at lines 563–565 we see synced > > = false, skip the drop, and unlock OID_B at line 579. Since no drop > > occurred, the cell is still the same non-synced slot, so the lock and > > unlock see the same OID_B. Symmetric — no lock leak. > > What happens if the slot for OID_B is dropped after we lock > OID_B, and then a new slot for OID_C reuses the same array entry? In > that case, wouldn't the later unlock read OID_C from > local_slot->data.database even though the lock was originally taken on > OID_B? V1 stops doing the venerable second read of local_slot->data.database. So if the copied value was already stale and points to OID_B, v1 is at least symmetric: read OID_B once lock OID_B cell reused as OID_C unlock OID_B But v1 seems not to fully solve issue 1. It can still do this: cell already reused before slot_database is copied v1 copies OID_B from replacement slot locks OID_B recheck sees synced=false skips drop unlocks OID_B That is still a stale read and possibly a wasted/wrong database lock, but it doesn't leak the lock, unlocks the wrong object, logs a false drop, or drops the replacement slot. In an off-list chat with Zhijie, we kinda thought that holding the lock of a wrong db for a brief time doesn't seem to harm a lot. The concurrent dropping-db operation leads to this issue seems rare in practice. He stated that the deletion of the slot seems unavoidable because we have to acquire the database lock after releasing the replication slot lock to avoid the deadlock with the startup/drop db operation. Therefore, he prefered keeping the design simple and avoiding the fatal issue over doing a broader refactoring work. I don't have a strong opinion on this. Still attaching the refactoring patch to do some clean-up in case someone thinks it's worthwhile. -- Regards, Xuneng Zhou HighGo Software Co., Ltd.
v1-0001-Avoid-stale-slot-access-after-dropping-obsolete-s.patch
Description: Binary data
v1-0002-Avoid-stale-slot-pointers-in-slotsync-cleanup.patch
Description: Binary data
