Hi,

On Mon, Jan 19, 2026 at 11:35 PM Xuneng Zhou <[email protected]> wrote:
>
> Hi Michael,
>
> On Mon, Jan 19, 2026 at 8:13 AM Michael Paquier <[email protected]> wrote:
> >
> > On Sun, Jan 11, 2026 at 08:56:57PM +0800, Xuneng Zhou wrote:
> > > After some thoughts, I’m more inclined toward a startup-process–driven
> > > approach. Marking the status as streaming immediately after the
> > > connection is established seems not provide sufficient accuracy for
> > > monitoring purposes. Introducing an intermediate state, such as
> > > connected, would help reduce confusion when the startup process is
> > > stalled and would make it easier for users to detect and diagnose
> > > anomalies.
> > >
> > > V4 whitelisted CONNECTED and CONNECTING in WalRcvWaitForStartPosition
> > > to handle valid stream termination scenarios without triggering a
> > > FATAL error.
> > >
> > > Specifically, the walreceiver may need to transition to WAITING (idle) if:
> > >    1. 'CONNECTED': The handshake succeeded (COPY_BOTH started), but
> > > the stream ended before any WAL was applied (e.g., timeline divergence
> > > detected mid-stream).
> > >    2. 'CONNECTING': The handshake completed (START_REPLICATION
> > > acknowledged), but the primary declined to stream (e.g., no WAL
> > > available on the requested timeline).
> > >
> > >   In both cases, the receiver should pause and await a new timeline or
> > > restart position from the startup process.
> >
> > This stuff depends on the philosophical difference you want to put
> > behind "connecting" and "streaming".  My own opinion is that it is a
> > non-starter to introduce more states that can be set by the startup
> > process, and that a new state should reflect what we do in the code.
> > We already have some of that for in the start and stop phases because
> > we want some ordering when the WAL receiver process is spawned and at
> > shutdown.  That's just a simple way to say that we should not rely on
> > more static variables to control how to set one or more states, and I
> > don't see why that's actually required here?  initialApplyPtr and
> > force_reply are what I see as potential recipes for more bugs in the
> > long term, as showed in the first approach.  The second patch,
> > introducing a similar new complexity with walrcv_streaming_set, is no
> > better in terms of complexity added.
> > The main take that I can retrieve from this thread is that it may take
> > time between the moment we begin a WAL receiver in WalReceiverMain(),
> > where walRcvState is switched to WALRCV_STREAMING, and the moment we
> > actually have established a connection, location where "first_stream =
> > false" (which is just to track if a WAL receiver is restarting,
> > actually) after walrcv_startstreaming() has returned true, so as far
> > as I can see you would be happy enough with the addition of a single
> > state called CONNECTING, set at the beginning of WalReceiverMain()
> > instead of where STREAMING is set now.  The same would sound kind of
> > true for WalRcvWaitForStartPosition(), because we are not actively
> > streaming yet, still we are marking the WAL receiver as streaming, so
> > the current code feels like we are cheating as if we define
> > "streaming" as a WAL receiver that has already done an active
> > connection.  We also want the WAL receiver to be killable by the
> > startup process while in "connecting" or "streaming" more.
> >
> > Hence I would suggest something like the following guidelines:
> > - Add only a CONNECTING state.  Set this state where we switch the
> > state to "streaming" now, aka the two locations in the tree now.
> > - Switch to STREAMING once the connection has been established, as
> > returned by walrcv_startstreaming(), because we are acknowledging *in
> > the code* that we have started streaming successfully.
> > - Update the docs to reflect the new state, because this state can
> > show up in the system view pg_stat_wal_receiver.
> > - I am not convinved by what we gain with a CONNECTED state, either.
> > Drop it.
> > - The fact that we'd want to switch the state once the startup process
> > has acknowleged the reception of the first byte from the stream is
> > already something we track in the WAL receiver, AFAIK.
>
> Thank you for the detailed feedback. I agree with your analysis — the
> simpler approach seems preferable and should be sufficient in most
> cases. Tightly coupling the startup process with the WAL receiver to
> set state is not very ideal. I'll post v5 with the simplified
> walreceiver changes as you suggested shortly.

Please see v5 of the updated patch.


--
Best,
Xuneng
From 8a3267ed393f3f6a5d07ded279bc25aaa36ebae6 Mon Sep 17 00:00:00 2001
From: alterego655 <[email protected]>
Date: Wed, 21 Jan 2026 13:04:13 +0800
Subject: [PATCH v5] Add WALRCV_CONNECTING state to walreceiver

Previously, walreceiver set its state to WALRCV_STREAMING immediately
at startup, before actually establishing a replication connection. This
was misleading for monitoring, as pg_stat_wal_receiver would show
"streaming" even while the connection was still being established.

Introduce WALRCV_CONNECTING state to accurately reflect the period
between walreceiver startup and successful START_REPLICATION. The
transition to WALRCV_STREAMING now occurs only after
walrcv_startstreaming() returns successfully.

Update pg_stat_wal_receiver documentation to describe all possible
status values and clarify that the view returns no row when the WAL
receiver is not running.
---
 doc/src/sgml/monitoring.sgml               | 13 ++++++++++++-
 src/backend/replication/walreceiver.c      | 16 +++++++++++++---
 src/backend/replication/walreceiverfuncs.c |  3 ++-
 src/include/replication/walreceiver.h      |  2 ++
 4 files changed, 29 insertions(+), 5 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 817fd9f4ca7..828498569fa 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1737,7 +1737,18 @@ description | Waiting for a newly initialized WAL file to reach durable storage
        <structfield>status</structfield> <type>text</type>
       </para>
       <para>
-       Activity status of the WAL receiver process
+       Activity status of the WAL receiver process. Possible values are:
+       <literal>starting</literal> (WAL receiver process has been launched
+       but is not yet initialized),
+       <literal>connecting</literal> (WAL receiver is connecting to the
+       primary, replication has not yet started),
+       <literal>streaming</literal> (WAL receiver is streaming WAL data),
+       <literal>waiting</literal> (WAL receiver has stopped streaming and is
+       waiting for new instructions from the startup process),
+       <literal>restarting</literal> (WAL receiver has been asked to restart
+       streaming), and
+       <literal>stopping</literal> (WAL receiver has been requested to stop).
+       This view returns no row when the WAL receiver is not running.
       </para></entry>
      </row>
 
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index a41453530a1..92e54e52e95 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -205,6 +205,7 @@ WalReceiverMain(const void *startup_data, size_t startup_data_len)
 			/* The usual case */
 			break;
 
+		case WALRCV_CONNECTING:
 		case WALRCV_WAITING:
 		case WALRCV_STREAMING:
 		case WALRCV_RESTARTING:
@@ -215,7 +216,7 @@ WalReceiverMain(const void *startup_data, size_t startup_data_len)
 	}
 	/* Advertise our PID so that the startup process can kill us */
 	walrcv->pid = MyProcPid;
-	walrcv->walRcvState = WALRCV_STREAMING;
+	walrcv->walRcvState = WALRCV_CONNECTING;
 
 	/* Fetch information required to start streaming */
 	walrcv->ready_to_display = false;
@@ -395,6 +396,12 @@ WalReceiverMain(const void *startup_data, size_t startup_data_len)
 							   LSN_FORMAT_ARGS(startpoint), startpointTLI));
 			first_stream = false;
 
+			/* Connection established, switch to streaming state */
+			SpinLockAcquire(&walrcv->mutex);
+			Assert(walrcv->walRcvState == WALRCV_CONNECTING);
+			walrcv->walRcvState = WALRCV_STREAMING;
+			SpinLockRelease(&walrcv->mutex);
+
 			/* Initialize LogstreamResult and buffers for processing messages */
 			LogstreamResult.Write = LogstreamResult.Flush = GetXLogReplayRecPtr(NULL);
 			initStringInfo(&reply_message);
@@ -650,7 +657,7 @@ WalRcvWaitForStartPosition(XLogRecPtr *startpoint, TimeLineID *startpointTLI)
 
 	SpinLockAcquire(&walrcv->mutex);
 	state = walrcv->walRcvState;
-	if (state != WALRCV_STREAMING)
+	if (state != WALRCV_STREAMING && state != WALRCV_CONNECTING)
 	{
 		SpinLockRelease(&walrcv->mutex);
 		if (state == WALRCV_STOPPING)
@@ -689,7 +696,7 @@ WalRcvWaitForStartPosition(XLogRecPtr *startpoint, TimeLineID *startpointTLI)
 			 */
 			*startpoint = walrcv->receiveStart;
 			*startpointTLI = walrcv->receiveStartTLI;
-			walrcv->walRcvState = WALRCV_STREAMING;
+			walrcv->walRcvState = WALRCV_CONNECTING;
 			SpinLockRelease(&walrcv->mutex);
 			break;
 		}
@@ -792,6 +799,7 @@ WalRcvDie(int code, Datum arg)
 	/* Mark ourselves inactive in shared memory */
 	SpinLockAcquire(&walrcv->mutex);
 	Assert(walrcv->walRcvState == WALRCV_STREAMING ||
+		   walrcv->walRcvState == WALRCV_CONNECTING ||
 		   walrcv->walRcvState == WALRCV_RESTARTING ||
 		   walrcv->walRcvState == WALRCV_STARTING ||
 		   walrcv->walRcvState == WALRCV_WAITING ||
@@ -1391,6 +1399,8 @@ WalRcvGetStateString(WalRcvState state)
 			return "stopped";
 		case WALRCV_STARTING:
 			return "starting";
+		case WALRCV_CONNECTING:
+			return "connecting";
 		case WALRCV_STREAMING:
 			return "streaming";
 		case WALRCV_WAITING:
diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c
index da8794cba7c..42e3e170bc0 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -179,7 +179,7 @@ WalRcvStreaming(void)
 	}
 
 	if (state == WALRCV_STREAMING || state == WALRCV_STARTING ||
-		state == WALRCV_RESTARTING)
+		state == WALRCV_CONNECTING || state == WALRCV_RESTARTING)
 		return true;
 	else
 		return false;
@@ -211,6 +211,7 @@ ShutdownWalRcv(void)
 			stopped = true;
 			break;
 
+		case WALRCV_CONNECTING:
 		case WALRCV_STREAMING:
 		case WALRCV_WAITING:
 		case WALRCV_RESTARTING:
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index f3ad00fb6f3..872deb00633 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -47,6 +47,8 @@ typedef enum
 	WALRCV_STOPPED,				/* stopped and mustn't start up again */
 	WALRCV_STARTING,			/* launched, but the process hasn't
 								 * initialized yet */
+	WALRCV_CONNECTING,			/* connecting to primary, replication not yet
+								 * started */
 	WALRCV_STREAMING,			/* walreceiver is streaming */
 	WALRCV_WAITING,				/* stopped streaming, waiting for orders */
 	WALRCV_RESTARTING,			/* asked to restart streaming */
-- 
2.51.0

Reply via email to