Re: Unnecessary delay in streaming replication due to replay lag

Soumyadeep Chakraborty Tue, 24 Aug 2021 21:52:20 -0700

Hello,

Ashwin and I recently got a chance to work on this and we addressed all
outstanding feedback and suggestions. PFA a significantly reworked patch.

On 20.11.2020 11:21, Michael Paquier wrote:

> This patch thinks that it is fine to request streaming even if
> PrimaryConnInfo is not set, but that's not fine.

We introduced a check to ensure that PrimaryConnInfo is set up before we
request the WAL stream eagerly.

> Anyway, I don't quite understand what you are trying to achieve here.
> "startpoint" is used to request the beginning of streaming.  It is
> roughly the consistency LSN + some alpha with some checks on WAL
> pages (those WAL page checks are not acceptable as they make
> maintenance harder).  What about the case where consistency is
> reached but there are many segments still ahead that need to be
> replayed?  Your patch would cause streaming to begin too early, and
> a manual copy of segments is not a rare thing as in some environments
> a bulk copy of segments can make the catchup of a standby faster than
> streaming.
>
> It seems to me that what you are looking for here is some kind of
> pre-processing before entering the redo loop to determine the LSN
> that could be reused for the fast streaming start, which should match
> the end of the WAL present locally.  In short, you would need a
> XLogReaderState that begins a scan of WAL from the redo point until it
> cannot find anything more, and use the last LSN found as a base to
> begin requesting streaming.  The question of timeline jumps can also
> be very tricky, but it could also be possible to not allow this option
> if a timeline jump happens while attempting to guess the end of WAL
> ahead of time.  Another thing: could it be useful to have an extra
> mode to begin streaming without waiting for consistency to finish?

1. When wal_receiver_start_condition='consistency', we feel that the
stream start point calculation should be done only when we reach
consistency. Imagine the situation where consistency is reached 2 hours
after start, and within that 2 hours a lot of WAL has been manually
copied over into the standby's pg_wal. If we pre-calculated the stream
start location before we entered the main redo apply loop, we would be
starting the stream from a much earlier location (minus the 2 hours
worth of WAL), leading to wasted work.

2. We have significantly changed the code to calculate the WAL stream
start location. We now traverse pg_wal, find the latest valid WAL
segment and start the stream from the segment's start. This is much
more performant than reading from the beginning of the locally available
WAL.

3. To perform the validation check, we no longer have duplicate code -
as we can now rely on the XLogReaderState(), XLogReaderValidatePageHeader()
and friends.

4. We have an extra mode: wal_receiver_start_condition='startup', which
will start the WAL receiver before the startup process reaches
consistency. We don't fully understand the utility of having 'startup' over
'consistency' though.

5. During the traversal of pg_wal, if we find WAL segments on differing
timelines, we bail out and abandon attempting to start the WAL stream
eagerly.

6. To handle the cases where a lot of WAL is copied over after the
WAL receiver has started at consistency:
i) Don't recommend wal_receiver_start_condition='startup|consistency'.

ii) Copy over the WAL files and then start the standby, so that the WAL
stream starts from a fresher point.

iii) Have an LSN/segment# target to start the WAL receiver from?

7. We have significantly changed the test. It is much more simplified
and focused.

8. We did not test wal_receiver_start_condition='startup' in the test.
It's actually hard to assert that the walreceiver has started at
startup. recovery_min_apply_delay only kicks in once we reach
consistency, and thus there is no way I could think of to reliably halt
the startup process and check: "Has the wal receiver started even
though the standby hasn't reached consistency?" Only way we could think
of is to generate a large workload during the course of the backup so
that the standby has significant WAL to replay before it reaches
consistency. But that will make the test flaky as we will have no
absolutely precise wait condition. That said, we felt that checking
for 'consistency' is enough as it covers the majority of the added
code.

9. We added a documentation section describing the GUC.

Regards,
Ashwin and Soumyadeep (VMware)

From b3703fb16a352bd9166ed75de7b68599c735ac63 Mon Sep 17 00:00:00 2001
From: Soumyadeep Chakraborty <soumyadeep2...@gmail.com>
Date: Fri, 30 Jul 2021 18:21:55 -0700
Subject: [PATCH v1 1/1] Introduce feature to start WAL receiver eagerly

This commit introduces a new GUC wal_receiver_start_condition which can
enable the standby to start it's WAL receiver at an earlier stage. The
GUC will default to starting the WAL receiver after WAL from archives
and pg_wal have been exhausted, designated by the value 'exhaust'.
The value of 'startup' indicates that the WAL receiver will be started
immediately on standby startup. Finally, the value of 'consistency'
indicates that the server will start after the standby has replayed up
to the consistency point.

If 'startup' or 'consistency' is specified, the starting point for the
WAL receiver will always be the end of all locally available WAL in
pg_wal. The end is determined by finding the latest WAL segment in
pg_wal and then iterating to the earliest segment. The iteration is
terminated as soon as a valid WAL segment is found. Streaming can then
commence from the start of that segment.

Co-authors: Ashwin Agrawal, Asim Praveen, Wu Hao, Konstantin Knizhnik
Discussion:
https://www.postgresql.org/message-id/flat/CANXE4Tc3FNvZ_xAimempJWv_RH9pCvsZH7Yq93o1VuNLjUT-mQ%40mail.gmail.com
---
 doc/src/sgml/config.sgml                     |  33 +++++
 src/backend/access/transam/xlog.c            | 141 +++++++++++++++++--
 src/backend/replication/walreceiver.c        |   1 +
 src/backend/utils/misc/guc.c                 |  18 +++
 src/include/replication/walreceiver.h        |  10 ++
 src/test/recovery/t/026_walreceiver_start.pl |  96 +++++++++++++
 6 files changed, 291 insertions(+), 8 deletions(-)
 create mode 100644 src/test/recovery/t/026_walreceiver_start.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 2c31c35a6b1..91ab13e54d2 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4602,6 +4602,39 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-wal-receiver-start-condition" xreflabel="wal_receiver_start_condition">
+      <term><varname>wal_receiver_start_condition</varname> (<type>enum</type>)
+      <indexterm>
+       <primary><varname>wal_receiver_start_condition</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies when the WAL receiver process will be started for a standby
+        server.
+        The allowed values of <varname>wal_receiver_start_condition</varname>
+        are <literal>startup</literal> (start immediately when the standby starts),
+        <literal>consistency</literal> (start only after reaching consistency), and
+        <literal>exhaust</literal> (start only after all WAL from the archive and
+        pg_wal has been replayed)
+         The default setting is<literal>exhaust</literal>.
+       </para>
+
+       <para>
+        Traditionally, the WAL receiver process is started only after the
+        standby server has exhausted all WAL from the WAL archive and the local
+        pg_wal directory. In some environments there can be a significant volume
+        of local WAL left to replay, along with a large volume of yet to be
+        streamed WAL. Such environments can benefit from setting
+        <varname>wal_receiver_start_condition</varname> to
+        <literal>startup</literal> or <literal>consistency</literal>. These
+        values will lead to the WAL receiver starting much earlier, and from
+        the end of locally available WAL. The network will be utilized to stream
+        WAL concurrently with replay, improving performance significantly.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-wal-receiver-status-interval" xreflabel="wal_receiver_status_interval">
       <term><varname>wal_receiver_status_interval</varname> (<type>integer</type>)
       <indexterm>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 24165ab03ec..698ccd4072c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -975,6 +975,7 @@ static XLogRecPtr XLogBytePosToRecPtr(uint64 bytepos);
 static XLogRecPtr XLogBytePosToEndRecPtr(uint64 bytepos);
 static uint64 XLogRecPtrToBytePos(XLogRecPtr ptr);
 static void checkXLogConsistency(XLogReaderState *record);
+static void StartWALReceiverEagerly(void);
 
 static void WALInsertLockAcquire(void);
 static void WALInsertLockAcquireExclusive(void);
@@ -1545,6 +1546,117 @@ checkXLogConsistency(XLogReaderState *record)
 	}
 }
 
+/*
+ * Start WAL receiver eagerly without waiting to play all WAL from the archive
+ * and pg_wal. First, find the last valid WAL segment in pg_wal and then request
+ * streaming to commence from it's beginning.
+ */
+static void
+StartWALReceiverEagerly()
+{
+	DIR		   *dir;
+	struct dirent *de;
+	XLogSegNo	startsegno = -1;
+	XLogSegNo	endsegno = -1;
+
+	Assert(wal_receiver_start_condition <= WAL_RCV_START_AT_CONSISTENCY);
+
+	/* Find the latest and earliest WAL segments in pg_wal */
+	dir = AllocateDir("pg_wal");
+	while ((de = ReadDir(dir, "pg_wal")) != NULL)
+	{
+		/* Does it look like a WAL segment? */
+		if (IsXLogFileName(de->d_name))
+		{
+			int			logSegNo;
+			int			tli;
+
+			XLogFromFileName(de->d_name, &tli, &logSegNo, wal_segment_size);
+			if (tli != ThisTimeLineID)
+			{
+				/*
+				 * It seems wrong to stream WAL on a timeline different from
+				 * the one we are replaying on. So, bail in case a timeline
+				 * change is noticed.
+				 */
+				ereport(LOG,
+						(errmsg("Could not start streaming WAL eagerly"),
+						 errdetail("There are timeline changes in the locally available WAL files."),
+						 errhint("WAL streaming will begin once all local WAL and archives are exhausted.")));
+				return;
+			}
+			startsegno = (startsegno == -1) ? logSegNo : Min(startsegno, logSegNo);
+			endsegno = (endsegno == -1) ? logSegNo : Max(endsegno, logSegNo);
+		}
+	}
+	FreeDir(dir);
+
+	/*
+	 * We should have at least one valid WAL segment in our pg_wal, for the
+	 * standby to have started.
+	 */
+	Assert(startsegno != -1 && endsegno != -1);
+
+	/* Find the latest valid WAL segment and request streaming from its start */
+	while (endsegno >= startsegno)
+	{
+		XLogReaderState *state;
+		XLogRecPtr	startptr;
+		WALReadError errinfo;
+		char		xlogfname[MAXFNAMELEN];
+
+		XLogSegNoOffsetToRecPtr(endsegno, 0, wal_segment_size, startptr);
+		XLogFileName(xlogfname, ThisTimeLineID, endsegno,
+					 wal_segment_size);
+
+		state = XLogReaderAllocate(wal_segment_size, NULL,
+								   XL_ROUTINE(.segment_open = wal_segment_open,
+											  .segment_close = wal_segment_close),
+								   NULL);
+		if (!state)
+			ereport(ERROR,
+					(errcode(ERRCODE_OUT_OF_MEMORY),
+					 errmsg("out of memory"),
+					 errdetail("Failed while allocating a WAL reading processor.")));
+
+		/*
+		 * Read the first page of the current WAL segment and validate it by
+		 * inspecting the page header. Once we find a valid WAL segment, we
+		 * can request WAL streaming from its beginning.
+		 */
+		XLogBeginRead(state, startptr);
+
+		if (!WALRead(state, state->readBuf, startptr, XLOG_BLCKSZ,
+					 ThisTimeLineID,
+					 &errinfo))
+			WALReadRaiseError(&errinfo);
+
+		if (XLogReaderValidatePageHeader(state, startptr, state->readBuf))
+		{
+			ereport(LOG,
+					errmsg("Requesting stream from beginning of: %s",
+						   xlogfname));
+			RequestXLogStreaming(ThisTimeLineID, startptr, PrimaryConnInfo,
+								 PrimarySlotName, wal_receiver_create_temp_slot);
+			XLogReaderFree(state);
+			return;
+		}
+
+		ereport(LOG,
+				errmsg("Invalid WAL segment found while calculating stream start: %s. Skipping.",
+					   xlogfname));
+
+		XLogReaderFree(state);
+		endsegno--;
+	}
+
+	/*
+	 * We should never reach here. We should have at least one valid WAL
+	 * segment in our pg_wal, for the standby to have started.
+	 */
+	Assert(false);
+}
+
 /*
  * Subroutine of XLogInsertRecord.  Copies a WAL record to an already-reserved
  * area in the WAL.
@@ -7053,6 +7165,14 @@ StartupXLOG(void)
 		wasShutdown = ((record->xl_info & ~XLR_INFO_MASK) == XLOG_CHECKPOINT_SHUTDOWN);
 	}
 
+	/*
+	 * Start WAL receiver eagerly if requested.
+	 */
+	if (StandbyModeRequested && !WalRcvStreaming() &&
+		PrimaryConnInfo && strcmp(PrimaryConnInfo, "") != 0 &&
+		wal_receiver_start_condition == WAL_RCV_START_AT_STARTUP)
+		StartWALReceiverEagerly();
+
 	/*
 	 * Clear out any old relcache cache files.  This is *necessary* if we do
 	 * any WAL replay, since that would probably result in the cache files
@@ -8403,6 +8523,15 @@ CheckRecoveryConsistency(void)
 
 		SendPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY);
 	}
+
+	/*
+	 * Start WAL receiver eagerly if requested and upon reaching a consistent
+	 * state.
+	 */
+	if (StandbyModeRequested && !WalRcvStreaming() && reachedConsistency &&
+		PrimaryConnInfo && strcmp(PrimaryConnInfo, "") != 0 &&
+		wal_receiver_start_condition == WAL_RCV_START_AT_CONSISTENCY)
+		StartWALReceiverEagerly();
 }
 
 /*
@@ -12630,10 +12759,12 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 
 					/*
 					 * Move to XLOG_FROM_STREAM state, and set to start a
-					 * walreceiver if necessary.
+					 * walreceiver if necessary. The WAL receiver may have
+					 * already started (if it was configured to start
+					 * eagerly).
 					 */
 					currentSource = XLOG_FROM_STREAM;
-					startWalReceiver = true;
+					startWalReceiver = !WalRcvStreaming();
 					break;
 
 				case XLOG_FROM_STREAM:
@@ -12743,12 +12874,6 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 			case XLOG_FROM_ARCHIVE:
 			case XLOG_FROM_PG_WAL:
 
-				/*
-				 * WAL receiver must not be running when reading WAL from
-				 * archive or pg_wal.
-				 */
-				Assert(!WalRcvStreaming());
-
 				/* Close any old file we might have open. */
 				if (readFile >= 0)
 				{
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 60de3be92c2..feb98add01b 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -89,6 +89,7 @@
 int			wal_receiver_status_interval;
 int			wal_receiver_timeout;
 bool		hot_standby_feedback;
+int			wal_receiver_start_condition;
 
 /* libpqwalreceiver connection */
 static WalReceiverConn *wrconn = NULL;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 467b0fd6fe7..7d55a83bdb8 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -556,6 +556,13 @@ static const struct config_enum_entry wal_compression_options[] = {
 	{NULL, 0, false}
 };
 
+const struct config_enum_entry wal_rcv_start_options[] = {
+	{"exhaust", WAL_RCV_START_AT_EXHAUST, false},
+	{"consistency", WAL_RCV_START_AT_CONSISTENCY, false},
+	{"startup", WAL_RCV_START_AT_STARTUP, false},
+	{NULL, 0, false}
+};
+
 /*
  * Options for enum values stored in other modules
  */
@@ -4970,6 +4977,17 @@ static struct config_enum ConfigureNamesEnum[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"wal_receiver_start_condition", PGC_POSTMASTER, REPLICATION_STANDBY,
+			gettext_noop("When to start WAL receiver."),
+			NULL,
+		},
+		&wal_receiver_start_condition,
+		WAL_RCV_START_AT_EXHAUST,
+		wal_rcv_start_options,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0, NULL, NULL, NULL, NULL
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 0b607ed777b..a0d14791f13 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -24,9 +24,19 @@
 #include "storage/spin.h"
 #include "utils/tuplestore.h"
 
+typedef enum
+{
+	WAL_RCV_START_AT_STARTUP,	/* start a WAL receiver immediately at startup */
+	WAL_RCV_START_AT_CONSISTENCY,	/* start a WAL receiver once consistency
+									 * has been reached */
+	WAL_RCV_START_AT_EXHAUST,	/* start a WAL receiver after WAL from archive
+								 * and pg_wal has been replayed (default) */
+} WalRcvStartCondition;
+
 /* user-settable parameters */
 extern int	wal_receiver_status_interval;
 extern int	wal_receiver_timeout;
+extern int	wal_receiver_start_condition;
 extern bool hot_standby_feedback;
 
 /*
diff --git a/src/test/recovery/t/026_walreceiver_start.pl b/src/test/recovery/t/026_walreceiver_start.pl
new file mode 100644
index 00000000000..f031ec2122f
--- /dev/null
+++ b/src/test/recovery/t/026_walreceiver_start.pl
@@ -0,0 +1,96 @@
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# Checks for wal_receiver_start_condition = 'consistency'
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use File::Copy;
+use Test::More tests => 2;
+
+# Initialize primary node and start it.
+my $node_primary = PostgresNode->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# Initial workload.
+$node_primary->safe_psql(
+	'postgres', qq {
+CREATE TABLE test_walreceiver_start(i int);
+SELECT pg_switch_wal();
+});
+
+# Take backup.
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Run a post-backup workload, whose WAL we will manually copy over to the
+# standby before it starts.
+my $wal_file_to_copy = $node_primary->safe_psql('postgres',
+	"SELECT pg_walfile_name(pg_current_wal_lsn());");
+$node_primary->safe_psql(
+	'postgres', qq {
+INSERT INTO test_walreceiver_start VALUES(1);
+SELECT pg_switch_wal();
+});
+
+# Initialize standby node from the backup and copy over the post-backup WAL.
+my $node_standby = PostgresNode->new('standby');
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+copy($node_primary->data_dir . '/pg_wal/' . $wal_file_to_copy,
+	$node_standby->data_dir . '/pg_wal')
+  or die "Copy failed: $!";
+
+# Set up a long delay to prevent the standby from replaying past the first
+# commit outside the backup.
+$node_standby->append_conf('postgresql.conf',
+	"recovery_min_apply_delay = '2h'");
+# Set up the walreceiver to start as soon as consistency is reached.
+$node_standby->append_conf('postgresql.conf',
+	"wal_receiver_start_condition = 'consistency'");
+
+$node_standby->start();
+
+# The standby should have reached consistency and should be blocked waiting for
+# recovery_min_apply_delay.
+$node_standby->poll_query_until(
+	'postgres', qq{
+SELECT wait_event = 'RecoveryApplyDelay' FROM pg_stat_activity
+WHERE backend_type='startup';
+}) or die "Timed out checking if startup is in recovery_min_apply_delay";
+
+# The walreceiver should have started, streaming from the end of valid locally
+# available WAL, i.e from the WAL file that was copied over.
+$node_standby->poll_query_until('postgres',
+	"SELECT COUNT(1) = 1 FROM pg_stat_wal_receiver;")
+  or die "Timed out while waiting for streaming to start";
+my $receive_start_lsn = $node_standby->safe_psql('postgres',
+	'SELECT receive_start_lsn FROM pg_stat_wal_receiver');
+is( $node_primary->safe_psql(
+		'postgres', "SELECT pg_walfile_name('$receive_start_lsn');"),
+	$wal_file_to_copy,
+	"walreceiver started from end of valid locally available WAL");
+
+# Now run a workload which should get streamed over.
+$node_primary->safe_psql(
+	'postgres', qq {
+SELECT pg_switch_wal();
+INSERT INTO test_walreceiver_start VALUES(2);
+});
+
+# The walreceiver should be caught up, including all WAL generated post backup.
+$node_primary->wait_for_catchup('standby', 'flush');
+
+# Now clear the delay so that the standby can replay the received WAL.
+$node_standby->safe_psql('postgres',
+	'ALTER SYSTEM SET recovery_min_apply_delay TO 0;');
+$node_standby->reload;
+
+# Now the replay should catch up.
+$node_primary->wait_for_catchup('standby', 'replay');
+is( $node_standby->safe_psql(
+		'postgres', 'SELECT count(*) FROM test_walreceiver_start;'),
+	2,
+	"querying test_walreceiver_start now should return 2 rows");
-- 
2.25.1

Re: Unnecessary delay in streaming replication due to replay lag

Reply via email to