On Wed, Jul 30, 2025 at 12:22 AM Hayato Kuroda (Fujitsu)
<kuroda.hay...@fujitsu.com> wrote:
>
> Dear Sawada-san,
>
> While reading more, I found a race condition.

Thank you for reviewing the patch!

> In this case the effective_wal_level
> can be logical even when there is no logical slot.
> UpdateLogicalDecodingStatusEndOfRecovery() checks the number of slots of the 
> logical
> slot then release the lock once. Then startup process acquires the lock once 
> and
> compare with IsLogicalDecodingEnabled(), then update the status afterward if 
> needed.
> So, wal_level can be inconsistent if the status is changed after the 
> n_logical_slots
> is read.
>
> Steps:
> a) constructed a primary-standby system
> b) createad a logical slot on the primary
> c) createad a logical slot on the standby
> d) sent a promote signal to standby
> e) dropped a logical slot on standby, just after startup process released
>    LogicalDecodingControlLock in UpdateLogicalDecodingStatusEndOfRecovery().
>
> After the above, effective_wal_level was keep turning on. Is it the expected 
> behavior?

No, we need to fix it.

I thought we could fix this issue by checking the number of in-use
logical slots while holding ReplicationSlotControlLock and
LogicalDecodingControlLock, but it seems we need to deal with another
race condition too between  backends and startup processes at the end
of recovery.

Currently the backend skips controlling logical decoding status if the
server is in recovery (by checking RecoveryInProgress()), but it's
possible that a backend process tries to drop a logical slot after the
startup process calling UpdateLogicalDecodingStatusEndOfRecovery() and
before accepting writes. In this case, the backend ends up not
disabling logical decoding and it remains enabled. I think we would
somehow need to delay the logical decoding status change in this
period until the recovery completes.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


Reply via email to