On Mon, Jun 8, 2026 at 4:31 PM Xuneng Zhou <[email protected]> wrote: > > Hi Kuroda-san, > > On Mon, Jun 8, 2026 at 12:25 PM Hayato Kuroda (Fujitsu) > <[email protected]> wrote: > > > > Hi Alexander, Bertrand, Xuneng, > > > > Thanks for seeing the failure. Our team also recognized but could not find > > the reason. > > > > > Yeah, it looks like there is a race condition here. I think we should > > > check if > > > the insertion timeline has already been set (like the walsummarizer is > > > doing). > > > > Sorry for stupid question; I tried to reproduce the failure but could not, > > see attached. > > > > IIUC, the issue can happen if the walsender must read the WAL record > > generated > > after the promotion but the timeline could not be updated. > > I think the race is that the logical walsender is trying to read the > pre-promotion wal from the slot's restart_lsn. > > 1) In 035_standby_logical_decoding.pl, the active slot / > pg_recvlogical process is started before: > INSERT rows 1..4 # before promotion > promote standby > INSERT rows 5..7 # after promotion > > 2) Alexander’s reproducer sleeps immediately after: > ReplicationSlotAcquire(cmd->slotname, true, true); > pg_usleep(200000); > > So with the delay: > slot is active but walsender has not yet create decoding context and > called xlogbeginread(); > > 3) Then the test continues and promotes the standby while the > walsender still sleeping, therefore promotion could happen before the > walsender starts its actual wal read. > > 4) The logical walsender is trying to read the pre-promotion wal from > the slot's restart_lsn. The failing log implies this: > Streaming transactions committing after 0/06487A20, > reading WAL from 0/064879E8. > > This line is emitted from: > errdetail("Streaming transactions committing after %X/%08X, reading > WAL from %X/%08X.", > slot->data.confirmed_flush, > slot->data.restart_lsn) > > Function StartLogicalReplication() initiates the xlogreader at the restart > lsn: > XLogBeginRead(logical_decoding_ctx->reader, > MyReplicationSlot->data.restart_lsn); > > Because the slot was created and acquired before promotion, its > restart_lsn is from the standby-era state which points to wal required > for decoding the pre-promotion stream. In the test, the first expected > output is rows 1..4, inserted before promotion. > > The failing output was empty: > got: '' > expected: > BEGIN > rows 1..4 > COMMIT > BEGIN > rows 5..7 > COMMIT > > Empty output suggests that the walsender died before emitting even the > first pre-promotion transaction. > > 5) The walsender try to read wal but failed > Failed run: > LOG: !!!WalSndSegmentOpen| nextSegNo: 6, > path: pg_wal/000000010000000000000006 > ERROR: requested WAL segment > 000000010000000000000006 has already been removed > > Passed run > LOG: !!!WalSndSegmentOpen| nextSegNo: 6, > path: pg_wal/000000020000000000000006 > > As the log suggested, the walsender of the successful run readed the > wal from timeline 2 but failed to do so for timeline1. > This comment in xlogutils.is almost exactly the issue: > > We care about timelines ... reading xlog generated prior to a promotion > ... > The server copied the segment to the new timeline ... there's no guarantee > the old segment will still exist. It may have been deleted or renamed with > a .partial suffix > > So the need for tli1 does not always imply that the wal was generated > after promotion. It could be bytes from before the switchpoint, but > they now be available through the new tli segment file. > > 6) XLogReadDetermineTimeline() needs the current system-wide TLI. > > The current code in does roughly: > > am_cascading_walsender = RecoveryInProgress(); > > if (am_cascading_walsender) > GetXLogReplayRecPtr(&currTLI); > else > currTLI = GetWALInsertionTimeLine(); > > During promotion, we set the wal insertion timeline before marking > recovery as fully done. So there is a window where: > > - InsertTimeLineID is already the new timeline, e.g. tli 2. > - RecoveryInProgress() can still return true in another backend. > - A logical walsender that started before promotion can wake up and > choose its read timeline using the old recovery-state test. > > In the race, replay tli 1 is chosen even though the promotion has > created the new tli 2 wal segment copy. Then WalSndSegmentOpen() tries > to open 000000010000000000000006, which may already be gone or > renamed, while the correct file is 000000020000000000000006. > > 7) Bertrand's proposal to fix this > Recognizing the intermediate promotion state -- if recovery still > shows progress but the insertion tli is already advanced, use that > insertion tli for the timeline to be chosen. > > > However, I think logical_read_xlog_page() is called after the new WAL > > records > > are generated, i.e., am_cascading_walsender has already been false at that > > time. > > So not sure where is the race? > > This function is not only called after new post-promotion records are > generated. When START_REPLICATION begins, CreateDecodingContext() / > XLogBeginRead() starts from the slot's restart_lsn, which can be well > before promotion. >
I tweaked the reproducer based on the theory outlined above. The main changes from the original reproducer are: 1) blocks at logical-walsender-after-slot-acquire in walsender.c, before the decoding context is created and before the reader starts from restart_lsn, matching the delay set by Alexander 2) Forces the first read to occur during promotion. It inserts rows 1..4, waits for replay, starts promotion with pg_promote(false), holds startup at startup-logical-decoding-status-change-end-of-recovery, then wakes the walsender. The test failed on HEAD and passed after applying the patch by Bertrand. -- Regards, Xuneng Zhou HighGo Software Co., Ltd.
v1-0001-Reproducer-for-logical-decoding-choosing-the-wron.patch
Description: Binary data
