Hi Fujii-san,

On Wed, Jun 3, 2026 at 8:03 PM Fujii Masao <[email protected]> wrote:
>
> On Tue, Jun 2, 2026 at 3:00 PM Xuneng Zhou <[email protected]> wrote:
> >
> > Hi,
> >
> > On Sat, May 30, 2026 at 4:14 PM Zhijie Hou (Fujitsu)
> > <[email protected]> wrote:
> > > I'm not sure if adding an injection point for this rare case is 
> > > worthwhile. Even
> > > if we were to add one, future refactoring of that function could shift the
> > > position of the injection point, so its long-term usefulness is 
> > > uncertain. I
> > > don't have a strong opinion on this, so I'll leave it to Fujii-San to 
> > > decide.
>
> I've pushed the patch. Thanks!
>
> IMO the proposed test looks a bit too narrow, so I'm not sure it's worth
> adding at this point. For now, I've committed only the code fix.
>
>
> > There's an adjacent bug around drop_local_obsolete_slots. The root
> > cause of them looks similar -- ReplicationSlot * is a pointer to a
> > reusable shared-memory array cell, not a durable identity for the same
> > slot. In drop_local_obsolete_slots, the issue is that the slot has
> > been freed after ReplicationSlotDropAcquired(false); however, another
> > backend may reuse the same cell before the unlock/log reads. This
> > seems less severe -- it does not normally corrupt slot state, because
> > the code only read after the drop. But it can still unlock/log
> > misusing the identity of a different slot. Attached a test using
> > injection point to reproduce it and a patch to fix it.
>
> Thanks again for the report and patch!
>
>   /* Drop the local slot if it is not required to be retained. */
>   if (!local_sync_slot_required(local_slot, remote_slot_list))
>   {
> + bool dropped = false;
> + NameData slot_name = {0};
> + Oid slot_database = local_slot->data.database;
>   bool synced_slot;
>
> Is it really safe to read slot_database before acquiring the database lock?

Reading slot_database before taking the database lock seems not
inherently unsafe by itself. The comment suggests that the lock is
primarily used to prevent conflicts with the startup process running
ReplicationSlotsDropDBSlots() during db-drop replay; it does not
protect replication slot array reuse.

The unsafe part could be reading slot_database from local_slot after
ReplicationSlotControlLock has been released. At this point, the slot
array cell may already have been freed and reused, so the value read
may no longer belong to the slot that get_local_synced_slots()
originally collected. As a result, we could end up locking the wrong
database.

There seems to be two related issues:

1) Before drop: reading local_slot->data.database /
local_slot->data.name after the slot-array lock was released, before
verifying the cell still represents the same synced slot.

2) After drop: calling ReplicationSlotDropAcquired(false) and then
reading local_slot->data.database / local_slot->data.name for
unlocking/logging after the cell may have been reused by another
backend.

The prior patch targets to fix 2), but leaves 1) unfixed.

> BTW, I'm also wondering whether it's safe for
> local_sync_slot_required() to access local_slot without holding a lock.

I share the same concern here. The issue smells similar to the one
discussed above -- reading values from the array cell directly without
the protection of array lock. To solve them altogether, it might be
better to stop carrying raw ReplicationSlot * values across
unprotected windows. We can get the snapshot values such as slot name
& database oid from get_local_synced_slots() instead. Then
local_sync_slot_required() works from the snapshot, and
drop_local_obsolete_slots() uses the copied database OID to take the
database lock. Before dropping, it should revalidate by copied
name/database that the slot is still a synced logical slot, then
acquire/drop by copied name, and use only copied values for
unlock/logging. I plan to prepare a refactoring patch for this. Does
that seem like the right direction to you?

--
Regards,
Xuneng Zhou
HighGo Software Co., Ltd.


Reply via email to