On Sun, Dec 14, 2025 at 06:17:34PM +0800, Xuneng Zhou wrote: > On Sun, Dec 14, 2025 at 4:55 PM Xuneng Zhou <[email protected]> wrote: > > On Sun, Dec 14, 2025 at 1:14 PM Noah Misch <[email protected]> wrote: > > > > V2 makes the transition from WALRCV_CONNECTING to STREAMING only when > > > > the first valid WAL record is processed by the startup process. A new > > > > function WalRcvSetStreaming is introduced to enable the transition. > > > > > > The original patch set STREAMING in XLogWalRcvFlush(). XLogWalRcvFlush() > > > callee XLogWalRcvSendReply() already fetches applyPtr to send a status > > > message. So I would try the following before involving the startup > > > process > > > like v2 does: > > > > > > 1. store the applyPtr when we enter CONNECTING > > > 2. force a status message as long as we remain in CONNECTING > > > 3. become STREAMING when applyPtr differs from the one stored at (1) > > > > Thanks for the suggestion. Using XLogWalRcvSendReply() for the > > transition could make sense. My concern before is about latency in a > > rare case: if the first flush completes but applyPtr hasn't advanced > > yet at the time of check and then the flush stalls after that, we > > might wait up to wal_receiver_status_interval (default 10s) before the > > next check or indefinitely if (wal_receiver_status_interval <= 0). > > This could be mitigated by shortening the wakeup interval while in > > CONNECTING (step 2), which reduces worst-case latency to ~1 second. > > Given that monitoring typically doesn't require sub-second precision, > > this approach could be feasible. > > > > case WALRCV_WAKEUP_REPLY: > > if (WalRcv->walRcvState == WALRCV_CONNECTING) > > { > > /* Poll frequently while CONNECTING to avoid long latency */ > > wakeup[reason] = TimestampTzPlusMilliseconds(now, 1000); > > } > > > > > A possible issue with all patch versions: when the primary is writing no > > > WAL > > > and the standby was caught up before this walreceiver started, CONNECTING > > > could persist for an unbounded amount of time. Only actual primary WAL > > > generation would move the walreceiver to STREAMING. This relates to your > > > above point about high latency. If that's a concern, perhaps this change > > > deserves a total of two new states, CONNECTING and a state that represents > > > "connection exists, no WAL yet applied"? > > > > Yes, this could be an issue. Using two states would help address it. > > That said, when the primary is idle in this case, we might end up > > repeatedly polling the apply status in the state before streaming if > > we implement the 1s short-interval checking like above, which could be > > costful. However, If we do not implement it && > > wal_receiver_status_interval is set to < 0 && flush stalls, the > > walreceiver could stay in the pre-streaming state indefinitely even if > > streaming did occur, which violates the semantics. Do you think this > > is a valid concern or just an artificial edge case? > > After looking more closely, I found that true indefinite waiting > requires ALL of: > > wal_receiver_status_interval <= 0 (disables status updates) > wal_receiver_timeout <= 0 > Primary sends no keepalives > No more WAL arrives after the first failed-check flush > Startup never sets force_reply > > which is quite impossible and artificial, sorry for the noise here.
Even if indefinite wait is a negligible concern, you identified a lot of intricacy that I hadn't pictured. That makes your startup-process-driven version potentially more attractive. Forcing status messages like I was thinking may also yield an unwanted flurry of them if the startup process is slow. Let's see what the patch reviewer thinks.
