On Fri, Sep 5, 2025 at 3:15 PM Masahiko Sawada <[email protected]> wrote:
>
> On Thu, Sep 4, 2025 at 11:15 AM Masahiko Sawada <[email protected]> wrote:
> >
> > od On Tue, Sep 2, 2025 at 8:11 PM Hayato Kuroda (Fujitsu)
> > <[email protected]> wrote:
> > >
> > > Dear Sawada-san,
> > >
> > > Here are my comments.
> > >
> > > 01.
> > > ```
> > >         checkPoint.logicalDecodingEnabled = IsLogicalDecodingEnabled();
> > > ```
> > >
> > > Per my analysis, the value is always false here because 
> > > StartupLogicalDecodingStatus
> > > is not called yet. Can we use "false" directly?
> >
> > I think that it's better to read the shared flag instead of directly
> > setting false since LogicalDecodingCtl is already initialized.
> >
> > >
> > > 02.
> > > ```
> > > elog(DEBUG1, "waiting for %d transactions to complete", running->xcnt);
> > > ```
> > >
> > > Here plural form is always used even if the running transaction is only 
> > > one.
> > > How about something like:
> > > ```
> > > Number of transactions to wait finishing: %d
> > > ```
> >
> > Hmm, not sure it improves the message. I think we don't care much
> > about plurals in debug messages. And in our convention the main log
> > message doesn't start a capital character.
> >
> >
> > >
> > > 03.
> > > ```
> > >                 while (RecoveryInProgress())
> > >                 {
> > >                         
> > > pgstat_report_wait_start(WAIT_EVENT_LOGICAL_DECODING_STATUS_CHANGE_DELAY);
> > >                         pg_usleep(100000L); /* wait for 100 msec */
> > >                         pgstat_report_wait_end();
> > >                 }
> > > ```
> > >
> > > I found a stuck case here: if a backend process within the loop and 
> > > startup waits
> > > a signal is processed, both of them can stuck. The backend waits the 
> > > recovery
> > > state to be DONE, and the startup waits all processes handle consume the 
> > > signal.
> > > IIUC we must add CHECK_FOR_INTERRUPTS() or ProcessProcSignalBarrier().
> > >
> > > Actual steps:
> > >
> > > 0.  constructed a streaming replication system, which the only primary 
> > > server had
> > >     a logical slot. I.e., the effective_wal_level was logical.
> > > 1.  connected to a standby node
> > > 2.  attached to the backend process via gdb
> > > 3.  added a breakpoint at create_logical_replication_slot
> > > 4.  called pg_create_logical_replication_slot() on the backend.
> > >     the backend will stop before ReplicationSlotCreate().
> > > 5.  from another terminal, attached to the startup process via gdb
> > > 6.  added a breakpoint at UpdateLogicalDecodingStatusEndOfRecovery()
> > > 7.  from another terminal, send a promote signal to the standby.
> > >     The startup will stop at UpdateLogicalDecodingStatusEndOfRecovery()
> > > 8.  executed steps on startup process, untill delay_status_change was 
> > > updated
> > >     and LogicalDecodingControlLock was released.
> > > 9.  detached from the backend process. It would stop at the loop in
> > >     start_logical_decoding_status_change().
> > > 10. detached from the startup process. It would wait all processes 
> > > handled the
> > >     signal, but the backend won't do.
> >
> > Good find! I'll fix the problem by adding CHECK_FOR_INTERRUPTS() as
> > you suggested.
>
> I've attached the updated patch that incorporated all comments I got so far.

Sorry, I've attached the wrong version. Please find the attached correct one.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment: v14-0001-Enable-logical-decoding-dynamically-based-on-log.patch
Description: Binary data

Reply via email to