On Wed, Jul 30, 2025 at 12:22 AM Hayato Kuroda (Fujitsu) <kuroda.hay...@fujitsu.com> wrote: > > Dear Sawada-san, > > While reading more, I found a race condition.
Thank you for reviewing the patch! > In this case the effective_wal_level > can be logical even when there is no logical slot. > UpdateLogicalDecodingStatusEndOfRecovery() checks the number of slots of the > logical > slot then release the lock once. Then startup process acquires the lock once > and > compare with IsLogicalDecodingEnabled(), then update the status afterward if > needed. > So, wal_level can be inconsistent if the status is changed after the > n_logical_slots > is read. > > Steps: > a) constructed a primary-standby system > b) createad a logical slot on the primary > c) createad a logical slot on the standby > d) sent a promote signal to standby > e) dropped a logical slot on standby, just after startup process released > LogicalDecodingControlLock in UpdateLogicalDecodingStatusEndOfRecovery(). > > After the above, effective_wal_level was keep turning on. Is it the expected > behavior? No, we need to fix it. I thought we could fix this issue by checking the number of in-use logical slots while holding ReplicationSlotControlLock and LogicalDecodingControlLock, but it seems we need to deal with another race condition too between backends and startup processes at the end of recovery. Currently the backend skips controlling logical decoding status if the server is in recovery (by checking RecoveryInProgress()), but it's possible that a backend process tries to drop a logical slot after the startup process calling UpdateLogicalDecodingStatusEndOfRecovery() and before accepting writes. In this case, the backend ends up not disabling logical decoding and it remains enabled. I think we would somehow need to delay the logical decoding status change in this period until the recovery completes. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com