On Tue, Aug 5, 2025 at 3:11 AM shveta malik <[email protected]> wrote: > > On Tue, Aug 5, 2025 at 5:14 AM Masahiko Sawada <[email protected]> wrote: > > > > On Mon, Aug 4, 2025 at 3:38 AM shveta malik <[email protected]> wrote: > > > > > > On Sat, Aug 2, 2025 at 4:53 AM Masahiko Sawada <[email protected]> > > > wrote: > > > > > > > > On Thu, Jul 31, 2025 at 5:00 AM Hayato Kuroda (Fujitsu) > > > > <[email protected]> wrote: > > > > > > > > > > Dear Sawada-san, > > > > > > > > > > > I thought we could fix this issue by checking the number of in-use > > > > > > logical slots while holding ReplicationSlotControlLock and > > > > > > LogicalDecodingControlLock, but it seems we need to deal with > > > > > > another > > > > > > race condition too between backends and startup processes at the end > > > > > > of recovery. > > > > > > > > > > > > Currently the backend skips controlling logical decoding status if > > > > > > the > > > > > > server is in recovery (by checking RecoveryInProgress()), but it's > > > > > > possible that a backend process tries to drop a logical slot after > > > > > > the > > > > > > startup process calling UpdateLogicalDecodingStatusEndOfRecovery() > > > > > > and > > > > > > before accepting writes. > > > > > > > > > > Right. I also verified on local and found that > > > > > ReplicationSlotDropAcquired()->DisableLogicalDecodingIfNecessary() > > > > > sometimes > > > > > skips to modify the status because RecoveryInProgress is still false. > > > > > > > > > > > In this case, the backend ends up not > > > > > > disabling logical decoding and it remains enabled. I think we would > > > > > > somehow need to delay the logical decoding status change in this > > > > > > period until the recovery completes. > > > > > > > > > > My primitive idea was to 1) keep startup acquiring the lock till end > > > > > of recovery > > > > > and 2) DisableLogicalDecodingIfNecessary() acquires lock before > > > > > checking the > > > > > recovery status, but it could not work well. Not sure but > > > > > WaitForProcSignalBarrier() > > > > > stucked if the process acquired LogicalDecodingControlLock lock.... > > > > > > > > I think that it's not realistic to keep holding a lwlock until the > > > > recovery actually completes because we perform a checkpoint after > > > > that. > > > > > > > > In the latest version patch I attached, I introduce a flag on shared > > > > memory to delay any logical decoding status change until the recovery > > > > completes. The implementation got more complex than I expected but I > > > > don't have a better idea. I'm open to other approaches. Also, I > > > > incorporated all comments I got so far[1][2][3] and updated the > > > > documentation. > > > > > > > > > > Yes, it is slightly complex, I will put more thoughts into it. That > > > said, I do have a related scenario in mind concerning the recent fix, > > > where we might still end up with an incorrect effective_wal_level > > > after promotion. > > > > > > Say primary has 'wal_level'=replica and standby has > > > 'wal_level'=logical. Since there are no slots on standby > > > 'effective_wal_level' will still be replica. Now I created a slot both > > > on primary and standby making 'effective_wal_level'=logical. Now, when > > > the standby is promoted and the slot is dropped immediately after > > > UpdateLogicalDecodingStatusEndOfRecovery() releases the lock, we still > > > expect the effective_wal_level on the promoted standby (now the > > > primary) to remain logical, since its configured 'wal_level' is > > > logical and it has become the primary. But I think that may not be the > > > case because > > > 'DisableLogicalDecodingIfNecessary-->start_logical_decoding_status_change()' > > > does not consider original wal_level on promoted standby in > > > retrial-attempt. I feel 'retry' should be above ' wal_level == > > > WAL_LEVEL_LOGICAL' check in below code snippet: > > > > > > +static bool > > > +start_logical_decoding_status_change(bool new_status) > > > +{ > > > + /* > > > + * On the primary with 'logical' WAL level, we can skip logical decoding > > > + * status change as it's always enabled. On standbys, we need to check > > > the > > > + * status on shared memory propagated from the primary and might handle > > > + * status change delay. > > > + */ > > > + if (!RecoveryInProgress() && wal_level == WAL_LEVEL_LOGICAL) > > > + return false; > > > + > > > +retry: > > > + > > > > > > Please note that I could not reproduce this scenario because as soon > > > as I put sleep or injection-point in > > > UpdateLogicalDecodingStatusEndOfRecovery(), I hit some ProcSignal > > > Barriers issue i.e. it never completes even when sleep is over. I get > > > this: 'LOG: still waiting for backend with PID 162838 to accept > > > ProcSignalBarrier'. > > > > Thank you for the comment! I think you're right. That check should be > > done after 'retry'. WIll incorporate the change in the next version > > patch. > > > > Thanks. >
Thank you for reviewing the patch!
> 1)
>
> start_logical_decoding_status_change():
>
> + LWLockRelease(LogicalDecodingControlLock);
> +
> + /* Mark the state transition is in-progress */
> + LogicalDecodingCtl->transition_in_progress = true;
>
> I think we should set transition_in_progress before releasing lock,
> else it may hit a race condition between create and drop slot and can
> end up having a slot but with logical decoding disabled.
Ugh, you're right. It should be protected by the lwlock.
>
> 2)
> CheckLogicalDecodingRequirements has this change:
>
> - if (wal_level < WAL_LEVEL_LOGICAL)
> + if (wal_level < WAL_LEVEL_REPLICA)
> ereport(ERROR,
> (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> - errmsg("logical decoding requires \"wal_level\" >= \"logical\"")));
> + errmsg("logical decoding requires \"wal_level\" >= \"replica\"")));
>
>
> But we already have same wal_level check in CheckSlotRequirements:
>
> if (wal_level < WAL_LEVEL_REPLICA)
> ereport(ERROR,
>
> (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> errmsg("replication slots can only be
> used if \"wal_level\" >= \"replica\"")));
>
> Thus the change in CheckLogicalDecodingRequirements for 'wal_level <
> WAL_LEVEL_REPLICA' will never be reached. Is it needed?
No, I agree to remove it.
I've attached the updated version patch.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
v6-0001-Enable-logical-decoding-dynamically-based-on-logi.patch
Description: Binary data
