Hi, On Mon, Jan 19, 2026 at 11:35 PM Xuneng Zhou <[email protected]> wrote: > > Hi Michael, > > On Mon, Jan 19, 2026 at 8:13 AM Michael Paquier <[email protected]> wrote: > > > > On Sun, Jan 11, 2026 at 08:56:57PM +0800, Xuneng Zhou wrote: > > > After some thoughts, I’m more inclined toward a startup-process–driven > > > approach. Marking the status as streaming immediately after the > > > connection is established seems not provide sufficient accuracy for > > > monitoring purposes. Introducing an intermediate state, such as > > > connected, would help reduce confusion when the startup process is > > > stalled and would make it easier for users to detect and diagnose > > > anomalies. > > > > > > V4 whitelisted CONNECTED and CONNECTING in WalRcvWaitForStartPosition > > > to handle valid stream termination scenarios without triggering a > > > FATAL error. > > > > > > Specifically, the walreceiver may need to transition to WAITING (idle) if: > > > 1. 'CONNECTED': The handshake succeeded (COPY_BOTH started), but > > > the stream ended before any WAL was applied (e.g., timeline divergence > > > detected mid-stream). > > > 2. 'CONNECTING': The handshake completed (START_REPLICATION > > > acknowledged), but the primary declined to stream (e.g., no WAL > > > available on the requested timeline). > > > > > > In both cases, the receiver should pause and await a new timeline or > > > restart position from the startup process. > > > > This stuff depends on the philosophical difference you want to put > > behind "connecting" and "streaming". My own opinion is that it is a > > non-starter to introduce more states that can be set by the startup > > process, and that a new state should reflect what we do in the code. > > We already have some of that for in the start and stop phases because > > we want some ordering when the WAL receiver process is spawned and at > > shutdown. That's just a simple way to say that we should not rely on > > more static variables to control how to set one or more states, and I > > don't see why that's actually required here? initialApplyPtr and > > force_reply are what I see as potential recipes for more bugs in the > > long term, as showed in the first approach. The second patch, > > introducing a similar new complexity with walrcv_streaming_set, is no > > better in terms of complexity added. > > The main take that I can retrieve from this thread is that it may take > > time between the moment we begin a WAL receiver in WalReceiverMain(), > > where walRcvState is switched to WALRCV_STREAMING, and the moment we > > actually have established a connection, location where "first_stream = > > false" (which is just to track if a WAL receiver is restarting, > > actually) after walrcv_startstreaming() has returned true, so as far > > as I can see you would be happy enough with the addition of a single > > state called CONNECTING, set at the beginning of WalReceiverMain() > > instead of where STREAMING is set now. The same would sound kind of > > true for WalRcvWaitForStartPosition(), because we are not actively > > streaming yet, still we are marking the WAL receiver as streaming, so > > the current code feels like we are cheating as if we define > > "streaming" as a WAL receiver that has already done an active > > connection. We also want the WAL receiver to be killable by the > > startup process while in "connecting" or "streaming" more. > > > > Hence I would suggest something like the following guidelines: > > - Add only a CONNECTING state. Set this state where we switch the > > state to "streaming" now, aka the two locations in the tree now. > > - Switch to STREAMING once the connection has been established, as > > returned by walrcv_startstreaming(), because we are acknowledging *in > > the code* that we have started streaming successfully. > > - Update the docs to reflect the new state, because this state can > > show up in the system view pg_stat_wal_receiver. > > - I am not convinved by what we gain with a CONNECTED state, either. > > Drop it. > > - The fact that we'd want to switch the state once the startup process > > has acknowleged the reception of the first byte from the stream is > > already something we track in the WAL receiver, AFAIK. > > Thank you for the detailed feedback. I agree with your analysis — the > simpler approach seems preferable and should be sufficient in most > cases. Tightly coupling the startup process with the WAL receiver to > set state is not very ideal. I'll post v5 with the simplified > walreceiver changes as you suggested shortly.
Please see v5 of the updated patch. -- Best, Xuneng
From 8a3267ed393f3f6a5d07ded279bc25aaa36ebae6 Mon Sep 17 00:00:00 2001 From: alterego655 <[email protected]> Date: Wed, 21 Jan 2026 13:04:13 +0800 Subject: [PATCH v5] Add WALRCV_CONNECTING state to walreceiver Previously, walreceiver set its state to WALRCV_STREAMING immediately at startup, before actually establishing a replication connection. This was misleading for monitoring, as pg_stat_wal_receiver would show "streaming" even while the connection was still being established. Introduce WALRCV_CONNECTING state to accurately reflect the period between walreceiver startup and successful START_REPLICATION. The transition to WALRCV_STREAMING now occurs only after walrcv_startstreaming() returns successfully. Update pg_stat_wal_receiver documentation to describe all possible status values and clarify that the view returns no row when the WAL receiver is not running. --- doc/src/sgml/monitoring.sgml | 13 ++++++++++++- src/backend/replication/walreceiver.c | 16 +++++++++++++--- src/backend/replication/walreceiverfuncs.c | 3 ++- src/include/replication/walreceiver.h | 2 ++ 4 files changed, 29 insertions(+), 5 deletions(-) diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml index 817fd9f4ca7..828498569fa 100644 --- a/doc/src/sgml/monitoring.sgml +++ b/doc/src/sgml/monitoring.sgml @@ -1737,7 +1737,18 @@ description | Waiting for a newly initialized WAL file to reach durable storage <structfield>status</structfield> <type>text</type> </para> <para> - Activity status of the WAL receiver process + Activity status of the WAL receiver process. Possible values are: + <literal>starting</literal> (WAL receiver process has been launched + but is not yet initialized), + <literal>connecting</literal> (WAL receiver is connecting to the + primary, replication has not yet started), + <literal>streaming</literal> (WAL receiver is streaming WAL data), + <literal>waiting</literal> (WAL receiver has stopped streaming and is + waiting for new instructions from the startup process), + <literal>restarting</literal> (WAL receiver has been asked to restart + streaming), and + <literal>stopping</literal> (WAL receiver has been requested to stop). + This view returns no row when the WAL receiver is not running. </para></entry> </row> diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c index a41453530a1..92e54e52e95 100644 --- a/src/backend/replication/walreceiver.c +++ b/src/backend/replication/walreceiver.c @@ -205,6 +205,7 @@ WalReceiverMain(const void *startup_data, size_t startup_data_len) /* The usual case */ break; + case WALRCV_CONNECTING: case WALRCV_WAITING: case WALRCV_STREAMING: case WALRCV_RESTARTING: @@ -215,7 +216,7 @@ WalReceiverMain(const void *startup_data, size_t startup_data_len) } /* Advertise our PID so that the startup process can kill us */ walrcv->pid = MyProcPid; - walrcv->walRcvState = WALRCV_STREAMING; + walrcv->walRcvState = WALRCV_CONNECTING; /* Fetch information required to start streaming */ walrcv->ready_to_display = false; @@ -395,6 +396,12 @@ WalReceiverMain(const void *startup_data, size_t startup_data_len) LSN_FORMAT_ARGS(startpoint), startpointTLI)); first_stream = false; + /* Connection established, switch to streaming state */ + SpinLockAcquire(&walrcv->mutex); + Assert(walrcv->walRcvState == WALRCV_CONNECTING); + walrcv->walRcvState = WALRCV_STREAMING; + SpinLockRelease(&walrcv->mutex); + /* Initialize LogstreamResult and buffers for processing messages */ LogstreamResult.Write = LogstreamResult.Flush = GetXLogReplayRecPtr(NULL); initStringInfo(&reply_message); @@ -650,7 +657,7 @@ WalRcvWaitForStartPosition(XLogRecPtr *startpoint, TimeLineID *startpointTLI) SpinLockAcquire(&walrcv->mutex); state = walrcv->walRcvState; - if (state != WALRCV_STREAMING) + if (state != WALRCV_STREAMING && state != WALRCV_CONNECTING) { SpinLockRelease(&walrcv->mutex); if (state == WALRCV_STOPPING) @@ -689,7 +696,7 @@ WalRcvWaitForStartPosition(XLogRecPtr *startpoint, TimeLineID *startpointTLI) */ *startpoint = walrcv->receiveStart; *startpointTLI = walrcv->receiveStartTLI; - walrcv->walRcvState = WALRCV_STREAMING; + walrcv->walRcvState = WALRCV_CONNECTING; SpinLockRelease(&walrcv->mutex); break; } @@ -792,6 +799,7 @@ WalRcvDie(int code, Datum arg) /* Mark ourselves inactive in shared memory */ SpinLockAcquire(&walrcv->mutex); Assert(walrcv->walRcvState == WALRCV_STREAMING || + walrcv->walRcvState == WALRCV_CONNECTING || walrcv->walRcvState == WALRCV_RESTARTING || walrcv->walRcvState == WALRCV_STARTING || walrcv->walRcvState == WALRCV_WAITING || @@ -1391,6 +1399,8 @@ WalRcvGetStateString(WalRcvState state) return "stopped"; case WALRCV_STARTING: return "starting"; + case WALRCV_CONNECTING: + return "connecting"; case WALRCV_STREAMING: return "streaming"; case WALRCV_WAITING: diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c index da8794cba7c..42e3e170bc0 100644 --- a/src/backend/replication/walreceiverfuncs.c +++ b/src/backend/replication/walreceiverfuncs.c @@ -179,7 +179,7 @@ WalRcvStreaming(void) } if (state == WALRCV_STREAMING || state == WALRCV_STARTING || - state == WALRCV_RESTARTING) + state == WALRCV_CONNECTING || state == WALRCV_RESTARTING) return true; else return false; @@ -211,6 +211,7 @@ ShutdownWalRcv(void) stopped = true; break; + case WALRCV_CONNECTING: case WALRCV_STREAMING: case WALRCV_WAITING: case WALRCV_RESTARTING: diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h index f3ad00fb6f3..872deb00633 100644 --- a/src/include/replication/walreceiver.h +++ b/src/include/replication/walreceiver.h @@ -47,6 +47,8 @@ typedef enum WALRCV_STOPPED, /* stopped and mustn't start up again */ WALRCV_STARTING, /* launched, but the process hasn't * initialized yet */ + WALRCV_CONNECTING, /* connecting to primary, replication not yet + * started */ WALRCV_STREAMING, /* walreceiver is streaming */ WALRCV_WAITING, /* stopped streaming, waiting for orders */ WALRCV_RESTARTING, /* asked to restart streaming */ -- 2.51.0
