> On Jun 2, 2026, at 00:44, Masahiko Sawada <[email protected]> wrote:
>
> On Fri, May 29, 2026 at 7:01 PM Chao Li <[email protected]> wrote:
>>
>>
>>
>>> On May 30, 2026, at 03:24, Masahiko Sawada <[email protected]> wrote:
>>>
>>> On Thu, May 28, 2026 at 8:39 PM Chao Li <[email protected]> wrote:
>>>>
>>>>
>>>>
>>>>> On May 29, 2026, at 05:53, Masahiko Sawada <[email protected]> wrote:
>>>>>
>>>>> On Thu, May 28, 2026 at 2:09 AM Chao Li <[email protected]> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> While testing “Toggle logical decoding dynamically based on logical slot
>>>>>> presence”, I hit an assertion failure with concurrent logical slot
>>>>>> creation.
>>>>>>
>>>>>> This is a repo:
>>>>>>
>>>>>> 1. In session 1, attach the injection point locally and start creating a
>>>>>> logical slot. The session blocks at logical-decoding-activation:
>>>>>> ```
>>>>>> evantest=# set application_name = 'slot_a';
>>>>>> SET
>>>>>> evantest=# select injection_points_set_local();
>>>>>> injection_points_set_local
>>>>>> ----------------------------
>>>>>>
>>>>>> (1 row)
>>>>>> evantest=# select injection_points_attach('logical-decoding-activation',
>>>>>> 'wait');
>>>>>> injection_points_attach
>>>>>> -------------------------
>>>>>>
>>>>>> (1 row)
>>>>>> evantest=# select pg_create_logical_replication_slot('slot_a',
>>>>>> 'pgoutput');
>>>>>> ```
>>>>>>
>>>>>> 2. In session 2, create another logical slot. This succeeds, and
>>>>>> effective_wal_level becomes logical:
>>>>>> ```
>>>>>> evantest=# select pg_create_logical_replication_slot('slot_b',
>>>>>> 'pgoutput');
>>>>>> pg_create_logical_replication_slot
>>>>>> ------------------------------------
>>>>>> (slot_b,0/0902E418)
>>>>>> (1 row)
>>>>>>
>>>>>> evantest=# show effective_wal_level;
>>>>>> effective_wal_level
>>>>>> ---------------------
>>>>>> logical
>>>>>> (1 row)
>>>>>> ```
>>>>>>
>>>>>> 3. In session 2, cancel session 1 instead of waking it up:
>>>>>> ```
>>>>>> evantest=# select pg_cancel_backend(pid) from pg_stat_activity where
>>>>>> application_name = 'slot_a';
>>>>>> pg_cancel_backend
>>>>>> -------------------
>>>>>> t
>>>>>> (1 row)
>>>>>> ```
>>>>>>
>>>>>> Then the server hits this assertion:
>>>>>
>>>>> Thank you for the report! I've confirmed the problem.
>>>>>
>>>>>>
>>>>>> I might be over thinking, but I just feel the safest fix is to make
>>>>>> EnableLogicalDecoding() serialize. I tried serializing with
>>>>>> LogicalDecodingControlLock and with a separate lock, but both approaches
>>>>>> got deadlock around the barrier wait. I ended up with adding an
>>>>>> activation_in_progress flag in shared memory, protected by
>>>>>> LogicalDecodingControlLock, with a condition variable to wait for the
>>>>>> active activation to finish.
>>>>>>
>>>>>> With this fix, rerunning the repro makes session 2 wait while session 1
>>>>>> is blocked at the injection point. After canceling session 1 from
>>>>>> session 3, session 2 continues, creates the slot successfully, and
>>>>>> effective_wal_level becomes logical.
>>>>>
>>>>> This serialization idea is very similar to what we tried in older
>>>>> version patches. While I've confirmed that the patch fixes the
>>>>> reported issue, I'm somewhat concerned that it could introduce another
>>>>> race condition as the patch adds a new mechanism.
>>>>>
>>>>> I think that the complication stems from the fact that
>>>>> abort_logical_decoding_activation() and DisableLogicalDecoding() clear
>>>>> the xlog_logical_info flag in different ways. I guess it would be
>>>>> simpler if we could delegate all the deactivation process including
>>>>> clearing xlog_logical_info flag to DisableLogicalDecoding(). It would
>>>>> allow the system to be in the state of xlog_logical_info == true and
>>>>> logical_decoding_enabled = false, but if we change
>>>>> DisableLogicalDecoding() to handle this case, the deactivation process
>>>>> would become simpler.
>>>>>
>>>>>> I didn’t include a test in this patch, as I wasn’t sure such a test
>>>>>> would be desirable. If others think it is worth adding, I can convert
>>>>>> the repro into a TAP test.
>>>>>
>>>>> I think we should have the tests.
>>>>>
>>>>> I've attached the patch for the above idea including the regression
>>>>> tests. Please review it.
>>>>>
>>>>
>>>> Yeah, your solution of deferring the actual abort to
>>>> DisableLogicalDecoding() is better. I have a few comments on the new patch:
>>>>
>>>
>>> Thank you for reviewing the patch!
>>>
>>>> 1
>>>> ```
>>>> + else
>>>> + {
>>>> + /*
>>>> + * Logical decoding is disabled while xlog_logical_info is
>>>> true.
>>>> + * This could happen if an activation is interrupted.
>>>> Reset only
>>>> + * xlog_logical_info.
>>>> + */
>>>> + LogicalDecodingCtl->xlog_logical_info = false;
>>>> + LogicalDecodingCtl->logical_decoding_enabled = false;
>>>> + }
>>>> ```
>>>>
>>>> Here, we should also clear LogicalDecodingCtl->pending_disable = false;
>>>
>>> Right.
>>>
>>>>
>>>> 2. AKAIK, critical section only works with elog, thus simple shared memory
>>>> assignments don't have to be inside the critical section. So we can move
>>>> out LogicalDecodingCtl assignments to outside the critical section, which
>>>> avoids some duplicate code.
>>>
>>> While it works, keeping these codes in a critical section is more
>>> consistent with what EnableLogicalDecoding() does. Please review the
>>> patch as I simplified the logic.
>>
>> The new code looks good.
>>
>>>
>>>>
>>>> 3
>>>> ```
>>>> + if (!in_recovery && was_enabled)
>>>> ereport(LOG,
>>>> errmsg("logical decoding is disabled because
>>>> there are no valid logical replication slots"));
>>>> ```
>>>>
>>>> I think this log message can be emitted after releasing the lock, which
>>>> makes the locking period a bit shorter. That might delay the log output
>>>> slightly, but since slot creation is not a frequent operation, I think
>>>> that should be fine.
>>>
>>> I'm rather concerned that "disabled" log could be written after
>>> "enabled" log even if the logical decoding gets enabled. I recall
>>> there was the same proposal before and we concluded to write the
>>> deactivation log under the lock. I added the comments so as not to
>>> confuse it.
>>>
>>
>> Thanks for adding the comments which helps code reads better understand the
>> logic.
>>
>> I’m still not convinced that emitting the log before releasing the lock is
>> necessary.
>>
>> Even if there is a race, the disable path would release the lock and then
>> emit its log message immediately. The racing enable path would still need to
>> acquire the lock, set the state, write and flush the logical-decoding status
>> WAL record, release the lock, and only then emit the “enabled” log message.
>> So the misleading log order seems possible in theory, but unlikely in
>> practice.
>>
>> So, this is a trade-off. I think avoiding ereport(LOG) under the LWLock
>> gains more than loss, but I don’t think this point is critical enough to
>> argue further.
>
> Yeah, I think this is a trade-off between the accuracy of writing logs
> in the correct order and the better concurrency on toggling logical
> decoding status. Given that LogicalDecodingControlLock is unlikely to
> be contended in practice, I guess doing ereport(LOG) under the LWLock
> is acceptable in this case.
>
> I think this part is not related to the reported bug and we can change
> it in a separate patch later if we find out that the contention of
> LogicalDecodingControlLock becomes contended.
>
>> V3 overall looks good to me. A couple of small comments:
>>
>> 1
>> ```
>> + * Logging under the lock guarantees our "is disabled" message
>> appear in
>> ```
>>
>> Here, “appear” maybe better to be “appears”.
>
> Fixed.
>
>>
>> 2
>> ```
>> + # Canceling the backend should not affect the concurrent slot
>> creation.
>> + $primary->wait_for_log("canceling statement due to user request”);
>> ```
>>
>> While the final review, I just noticed the previous test cases also have
>> pg_cancel_backend(pid), so we may better to get the log offset otherwise
>> wait_for_log may get the log from a previous cancellation, like this:
>> ```
>> diff --git a/src/test/recovery/t/051_effective_wal_level.pl
>> b/src/test/recovery/t/051_effective_wal_level.pl
>> index e3d8b97c173..0e75b1536a4 100644
>> --- a/src/test/recovery/t/051_effective_wal_level.pl
>> +++ b/src/test/recovery/t/051_effective_wal_level.pl
>> @@ -440,13 +440,14 @@ select
>> pg_create_logical_replication_slot('slot_canceled2', 'pgoutput');
>>
>> # Cancel the backend initiated by $psql_create_slot, aborting its
>> activation
>> # process.
>> + my $log_offset = -s $primary->logfile;
>> $primary->safe_psql(
>> 'postgres',
>> qq[
>> select pg_cancel_backend(pid) from pg_stat_activity where query ~
>> 'slot_canceled2' and pid <> pg_backend_pid()
>> ]);
>> # Canceling the backend should not affect the concurrent slot
>> creation.
>> - $primary->wait_for_log("canceling statement due to user request");
>> + $primary->wait_for_log("canceling statement due to user request",
>> $log_offset);
>> test_wal_level($primary, "replica|logical",
>> "effective_wal_level remains 'logical'
>> even after the concurrent activation is interrupted");
>> }
>> ```
>
> Good catch. Fixed.
>
> I've attached the updated patch.
>
> FYI I've added this issue to the open item for the record.
>
> Regards,
>
> --
> Masahiko Sawada
> Amazon Web Services: https://aws.amazon.com
> <v4-0001-Fix-race-when-logical-decoding-activation-is-conc.patch>
Thanks for updating the patch. I’m good with v4.
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/