Re: Improve read_local_xlog_page_guts by replacing polling with latch-based waiting

Xuneng Zhou Wed, 19 Nov 2025 05:39:53 -0800

Hi,

On Wed, Nov 19, 2025 at 11:44 AM Xuneng Zhou <[email protected]> wrote:
>
> Hi Michael,
>
> On Sat, Nov 8, 2025 at 7:03 AM Michael Paquier <[email protected]> wrote:
> >
> > On Fri, Nov 07, 2025 at 09:48:23PM +0800, Xuneng Zhou wrote:
> > > Now that the LSN-waiting infrastructure (3b4e53a) and WAL replay
> > > wake-up calls (447aae1) are in place, this patch has been updated to
> > > make use of them.
> > > Please check.
> >
> > That's indeed much simpler.  I'll check later what you have here.
> > --
> > Michael
>
> Thanks again for your earlier suggestion on splitting the patches to
> make the review process smoother.
>
> Although this version is simpler in terms of the amount of code, the
> review effort still feels non-trivial. During my own self-review, a
> few points stood out as areas that merit careful consideration:
>
> 1) Reliance on the new wait-for-LSN infrastructure
>
> The stability and correctness of this patch now depend heavily on the
> newly added wait-for-LSN infrastructure, which has not yet been
> battle-tested. This puts the patch in a bit of a dilemma: we want the
> infrastructure to be as reliable as possible, but it could be hard to
> fully validate its robustness without using it in real scenarios, even
> after careful review.


Here are some of my incomplete interpretations of the behaviors:

> 2) Wake-up behavior
>
> Are the waiting processes waking up at the correct points and under
> the right conditions? Ensuring proper wake-ups is essential for both
> correctness and performance.

Primary (Flush Wait):

The patch adds WaitLSNWakeup(WAIT_LSN_TYPE_FLUSH, LogwrtResult.Flush)
in XLogFlush and XLogBackgroundFlush right after
        /* wake up walsenders now that we've released heavily contended locks */
        WalSndWakeupProcessRequests(true, !RecoveryInProgress());
where walsenders get notified.

Standby (Replay Wait):
-- The "End of Recovery" Wake-up
Location: xlog.c (inside StartupXLOG, around line 6266)

/*
 * Wake up all waiters for replay LSN.  They need to report an error that
 * recovery was ended before reaching the target LSN.
 */
WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY, InvalidXLogRecPtr);

This call happens immediately after the server transitions from
"Recovery" to "Production" mode (RECOVERY_STATE_DONE).

-- The "Continuous Replay" Wake-up
Location: xlogrecovery.c (inside the main redo loop, around line 1850)
/*
 * If we replayed an LSN that someone was waiting for then walk
 * over the shared memory array and set latches to notify the
 * waiters.
 */
if (waitLSNState &&
    (XLogRecoveryCtl->lastReplayedEndRecPtr >=
     pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_REPLAY])))
    WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY, XLogRecoveryCtl->lastReplayedEndRecPtr);

 It handles the continuous stream of updates during normal standby operation.

> 3) Edge cases
>
> Are edge cases—such as a promotion occurring while a process is
> waiting in standby—handled correctly and without introducing races or
> inconsistent states?

Consider the following sequence which I traced through the logic:

1. Pre-Promotion: A backend (e.g., a logical decoding session) calls
read_local_xlog_page_guts for a future LSN. RecoveryInProgress()
returns true, so it enters WaitForLSN(WAIT_LSN_TYPE_REPLAY, ...).

2. The Event: pg_promote() is issued. The Startup process finishes
recovery and broadcasts a wake-up to all waiters.

3. Detection: WaitForLSN returns WAIT_LSN_RESULT_NOT_IN_RECOVERY. The
code explicitly handles this case:

case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
/* Promoted while waiting... loop back */
break;

4. The Transition: The loop restarts.
 -- RecoveryInProgress() is checked again and now returns false.
 -- The logic automatically switches branches to
WaitForLSN(WAIT_LSN_TYPE_FLUSH, ...).

5. This transition relaxes the constraint from "wait for replay"
(required for consistency on standby) to "wait for flush" (required
for durability on primary).

6. Timeline Divergence:
XLogReadDetermineTimeline is called at the top of the loop.

-- Scenario A: Waiting for Historical Data (Pre-Promotion)
  If we were waiting for LSN 0/5000 and promotion happened at 0/6000
(creating TLI 2), XLogReadDetermineTimeline will see that 0/5000
belongs to TLI 1 (now historical).
  Result: state->currTLI (1) != currTLI (2).
  Action: The loop breaks immediately (via the else block), skipping
the wait. Since the data is historical, it is immutable and assumed to
be on disk.

-- Scenario B: Waiting for Future Data (Post-Promotion)
  If we were waiting for LSN 0/7000 and promotion happened at 0/6000
(creating TLI 2), XLogReadDetermineTimeline will identify that 0/7000
belongs to the new TLI 2.
  Result: state->currTLI (2) == currTLI (2).
  Action: The loop continues, and we enter
WaitForLSN(WAIT_LSN_TYPE_FLUSH, ...) to wait for the new primary to
generate this data.

-- Scenario C: Waiting exactly at the Switch Point
  If we were waiting for the exact LSN where the timeline switched.
  Action: XLogReadDetermineTimeline handles the boundary calculation
(tliSwitchPoint), ensuring we read from the correct segment file
(e.g., switching from 00000001... to 00000002...).


--
Best,
Xuneng

Re: Improve read_local_xlog_page_guts by replacing polling with latch-based waiting

Reply via email to