On Fri, Jun 5, 2026 at 11:16 AM Masahiko Sawada <[email protected]> wrote:
>
> On Mon, Jun 1, 2026 at 5:02 PM Chao Li <[email protected]> wrote:
> >
> >
> >
> > > On Jun 2, 2026, at 00:44, Masahiko Sawada <[email protected]> wrote:
> > >
> > > On Fri, May 29, 2026 at 7:01 PM Chao Li <[email protected]> wrote:
> > >>
> > >>
> > >>
> > >>> On May 30, 2026, at 03:24, Masahiko Sawada <[email protected]> 
> > >>> wrote:
> > >>>
> > >>> On Thu, May 28, 2026 at 8:39 PM Chao Li <[email protected]> wrote:
> > >>>>
> > >>>>
> > >>>>
> > >>>>> On May 29, 2026, at 05:53, Masahiko Sawada <[email protected]> 
> > >>>>> wrote:
> > >>>>>
> > >>>>> On Thu, May 28, 2026 at 2:09 AM Chao Li <[email protected]> 
> > >>>>> wrote:
> > >>>>>>
> > >>>>>> Hi,
> > >>>>>>
> > >>>>>> While testing “Toggle logical decoding dynamically based on logical 
> > >>>>>> slot presence”, I hit an assertion failure with concurrent logical 
> > >>>>>> slot creation.
> > >>>>>>
> > >>>>>> This is a repo:
> > >>>>>>
> > >>>>>> 1. In session 1, attach the injection point locally and start 
> > >>>>>> creating a logical slot. The session blocks at 
> > >>>>>> logical-decoding-activation:
> > >>>>>> ```
> > >>>>>> evantest=# set application_name = 'slot_a';
> > >>>>>> SET
> > >>>>>> evantest=# select injection_points_set_local();
> > >>>>>> injection_points_set_local
> > >>>>>> ----------------------------
> > >>>>>>
> > >>>>>> (1 row)
> > >>>>>> evantest=# select 
> > >>>>>> injection_points_attach('logical-decoding-activation', 'wait');
> > >>>>>> injection_points_attach
> > >>>>>> -------------------------
> > >>>>>>
> > >>>>>> (1 row)
> > >>>>>> evantest=# select pg_create_logical_replication_slot('slot_a', 
> > >>>>>> 'pgoutput');
> > >>>>>> ```
> > >>>>>>
> > >>>>>> 2. In session 2, create another logical slot. This succeeds, and 
> > >>>>>> effective_wal_level becomes logical:
> > >>>>>> ```
> > >>>>>> evantest=# select pg_create_logical_replication_slot('slot_b', 
> > >>>>>> 'pgoutput');
> > >>>>>> pg_create_logical_replication_slot
> > >>>>>> ------------------------------------
> > >>>>>> (slot_b,0/0902E418)
> > >>>>>> (1 row)
> > >>>>>>
> > >>>>>> evantest=# show effective_wal_level;
> > >>>>>> effective_wal_level
> > >>>>>> ---------------------
> > >>>>>> logical
> > >>>>>> (1 row)
> > >>>>>> ```
> > >>>>>>
> > >>>>>> 3. In session 2, cancel session 1 instead of waking it up:
> > >>>>>> ```
> > >>>>>> evantest=# select pg_cancel_backend(pid) from pg_stat_activity where 
> > >>>>>> application_name = 'slot_a';
> > >>>>>> pg_cancel_backend
> > >>>>>> -------------------
> > >>>>>> t
> > >>>>>> (1 row)
> > >>>>>> ```
> > >>>>>>
> > >>>>>> Then the server hits this assertion:
> > >>>>>
> > >>>>> Thank you for the report! I've confirmed the problem.
> > >>>>>
> > >>>>>>
> > >>>>>> I might be over thinking, but I just feel the safest fix is to make 
> > >>>>>> EnableLogicalDecoding() serialize. I tried serializing with 
> > >>>>>> LogicalDecodingControlLock and with a separate lock, but both 
> > >>>>>> approaches got deadlock around the barrier wait. I ended up with 
> > >>>>>> adding an activation_in_progress flag in shared memory, protected by 
> > >>>>>> LogicalDecodingControlLock, with a condition variable to wait for 
> > >>>>>> the active activation to finish.
> > >>>>>>
> > >>>>>> With this fix, rerunning the repro makes session 2 wait while 
> > >>>>>> session 1 is blocked at the injection point. After canceling session 
> > >>>>>> 1 from session 3, session 2 continues, creates the slot 
> > >>>>>> successfully, and effective_wal_level becomes logical.
> > >>>>>
> > >>>>> This serialization idea is very similar to what we tried in older
> > >>>>> version patches. While I've confirmed that the patch fixes the
> > >>>>> reported issue, I'm somewhat concerned that it could introduce another
> > >>>>> race condition as the patch adds a new mechanism.
> > >>>>>
> > >>>>> I think that the complication stems from the fact that
> > >>>>> abort_logical_decoding_activation() and DisableLogicalDecoding() clear
> > >>>>> the xlog_logical_info flag in different ways. I guess it would be
> > >>>>> simpler if we could delegate all the deactivation process including
> > >>>>> clearing xlog_logical_info flag to DisableLogicalDecoding(). It would
> > >>>>> allow the system to be in the state of xlog_logical_info == true and
> > >>>>> logical_decoding_enabled = false, but if we change
> > >>>>> DisableLogicalDecoding() to handle this case, the deactivation process
> > >>>>> would become simpler.
> > >>>>>
> > >>>>>> I didn’t include a test in this patch, as I wasn’t sure such a test 
> > >>>>>> would be desirable. If others think it is worth adding, I can 
> > >>>>>> convert the repro into a TAP test.
> > >>>>>
> > >>>>> I think we should have the tests.
> > >>>>>
> > >>>>> I've attached the patch for the above idea including the regression
> > >>>>> tests. Please review it.
> > >>>>>
> > >>>>
> > >>>> Yeah, your solution of deferring the actual abort to 
> > >>>> DisableLogicalDecoding() is better. I have a few comments on the new 
> > >>>> patch:
> > >>>>
> > >>>
> > >>> Thank you for reviewing the patch!
> > >>>
> > >>>> 1
> > >>>> ```
> > >>>> +       else
> > >>>> +       {
> > >>>> +               /*
> > >>>> +                * Logical decoding is disabled while 
> > >>>> xlog_logical_info is true.
> > >>>> +                * This could happen if an activation is interrupted. 
> > >>>> Reset only
> > >>>> +                * xlog_logical_info.
> > >>>> +                */
> > >>>> +               LogicalDecodingCtl->xlog_logical_info = false;
> > >>>> +               LogicalDecodingCtl->logical_decoding_enabled = false;
> > >>>> +       }
> > >>>> ```
> > >>>>
> > >>>> Here, we should also clear LogicalDecodingCtl->pending_disable = false;
> > >>>
> > >>> Right.
> > >>>
> > >>>>
> > >>>> 2. AKAIK, critical section only works with elog, thus simple shared 
> > >>>> memory assignments don't have to be inside the critical section. So we 
> > >>>> can move out LogicalDecodingCtl assignments to outside the critical 
> > >>>> section, which avoids some duplicate code.
> > >>>
> > >>> While it works, keeping these codes in a critical section is more
> > >>> consistent with what EnableLogicalDecoding() does. Please review the
> > >>> patch as I simplified the logic.
> > >>
> > >> The new code looks good.
> > >>
> > >>>
> > >>>>
> > >>>> 3
> > >>>> ```
> > >>>> +       if (!in_recovery && was_enabled)
> > >>>>               ereport(LOG,
> > >>>>                               errmsg("logical decoding is disabled 
> > >>>> because there are no valid logical replication slots"));
> > >>>> ```
> > >>>>
> > >>>> I think this log message can be emitted after releasing the lock, 
> > >>>> which makes the locking period a bit shorter. That might delay the log 
> > >>>> output slightly, but since slot creation is not a frequent operation, 
> > >>>> I think that should be fine.
> > >>>
> > >>> I'm rather concerned that "disabled" log could be written after
> > >>> "enabled" log even if the logical decoding gets enabled. I recall
> > >>> there was the same proposal before and we concluded to write the
> > >>> deactivation log under the lock. I added the comments so as not to
> > >>> confuse it.
> > >>>
> > >>
> > >> Thanks for adding the comments which helps code reads better understand 
> > >> the logic.
> > >>
> > >> I’m still not convinced that emitting the log before releasing the lock 
> > >> is necessary.
> > >>
> > >> Even if there is a race, the disable path would release the lock and 
> > >> then emit its log message immediately. The racing enable path would 
> > >> still need to acquire the lock, set the state, write and flush the 
> > >> logical-decoding status WAL record, release the lock, and only then emit 
> > >> the “enabled” log message. So the misleading log order seems possible in 
> > >> theory, but unlikely in practice.
> > >>
> > >> So, this is a trade-off. I think avoiding ereport(LOG) under the LWLock 
> > >> gains more than loss, but I don’t think this point is critical enough to 
> > >> argue further.
> > >
> > > Yeah, I think this is a trade-off between the accuracy of writing logs
> > > in the correct order and the better concurrency on toggling logical
> > > decoding status. Given that LogicalDecodingControlLock is unlikely to
> > > be contended in practice, I guess doing ereport(LOG) under the LWLock
> > > is acceptable in this case.
> > >
> > > I think this part is not related to the reported bug and we can change
> > > it in a separate patch later if we find out that the contention of
> > > LogicalDecodingControlLock becomes contended.
> > >
> > >> V3 overall looks good to me. A couple of small comments:
> > >>
> > >> 1
> > >> ```
> > >> +        * Logging under the lock guarantees our "is disabled" message 
> > >> appear in
> > >> ```
> > >>
> > >> Here, “appear” maybe better to be “appears”.
> > >
> > > Fixed.
> > >
> > >>
> > >> 2
> > >> ```
> > >> +       # Canceling the backend should not affect the concurrent slot 
> > >> creation.
> > >> +       $primary->wait_for_log("canceling statement due to user 
> > >> request”);
> > >> ```
> > >>
> > >> While the final review, I just noticed the previous test cases also have 
> > >> pg_cancel_backend(pid), so we may better to get the log offset otherwise 
> > >> wait_for_log may get the log from a previous cancellation, like this:
> > >> ```
> > >> diff --git a/src/test/recovery/t/051_effective_wal_level.pl 
> > >> b/src/test/recovery/t/051_effective_wal_level.pl
> > >> index e3d8b97c173..0e75b1536a4 100644
> > >> --- a/src/test/recovery/t/051_effective_wal_level.pl
> > >> +++ b/src/test/recovery/t/051_effective_wal_level.pl
> > >> @@ -440,13 +440,14 @@ select 
> > >> pg_create_logical_replication_slot('slot_canceled2', 'pgoutput');
> > >>
> > >>        # Cancel the backend initiated by $psql_create_slot, aborting its 
> > >> activation
> > >>        # process.
> > >> +       my $log_offset = -s $primary->logfile;
> > >>        $primary->safe_psql(
> > >>                'postgres',
> > >>                qq[
> > >> select pg_cancel_backend(pid) from pg_stat_activity where query ~ 
> > >> 'slot_canceled2' and pid <> pg_backend_pid()
> > >> ]);
> > >>        # Canceling the backend should not affect the concurrent slot 
> > >> creation.
> > >> -       $primary->wait_for_log("canceling statement due to user 
> > >> request");
> > >> +       $primary->wait_for_log("canceling statement due to user 
> > >> request", $log_offset);
> > >>        test_wal_level($primary, "replica|logical",
> > >>                                   "effective_wal_level remains 'logical' 
> > >> even after the concurrent activation is interrupted");
> > >> }
> > >> ```
> > >
> > > Good catch. Fixed.
> > >
> > > I've attached the updated patch.
> > >
> > > FYI I've added this issue to the open item for the record.
> > >
> > > Regards,
> > >
> > > --
> > > Masahiko Sawada
> > > Amazon Web Services: https://aws.amazon.com
> > > <v4-0001-Fix-race-when-logical-decoding-activation-is-conc.patch>
> >
> > Thanks for updating the patch. I’m good with v4.
>
> Thank you for reviewing the patch!
>
> I'll push the fix early next week, unless there is further review
> comment on the v4 patch

Pushed.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


Reply via email to