Re: allow online change primary_conninfo

Kyotaro Horiguchi Tue, 24 Sep 2019 22:20:04 -0700

Hello.

At Sat, 21 Sep 2019 13:06:25 +0300, Sergei Kornilov <s...@zsrv.org> wrote in 
<41171569060...@myt5-b646bde4b8f3.qloud-c.yandex.net>
> Hello
> 
> Thank you for review! Can you please also check v4 version? v5 implements 
> design suggested by Kyotaro Horiguchi-san, while v4 has another design. Which 
> one do you prefer? Or are both wrong?
> 
> > I can't parse that comment. What does "skipping to starting" mean? I
> > assume it's just about avoiding wal_retrieve_retry_interval, but I think
> > the comment ought to be rephrased.
> 
> Design idea is to rewrite current state from working XLOG_FROM_STREAM to 
> failed XLOG_FROM_ARCHIVE (without actually try this method on this iteration) 
> and immediately go to next iteration to advance the state machine. Next 
> iteration after failed archive recovery is walreceiver. So walreceiver will 
> be stopped just before this lines and started on next iteration. Walreceiver 
> will be restarted, we do not call restore_command


Sorry, it's my bad. It meant "If wal receiver is requested to
restart, change state to XLOG_FROM_STREAM immediately skipping
the next XLOG_FROM_ARCHIVE state.".

> > Also, should we really check this before scanning for new timelines?
> 
> Hmm, on the next day... No, this is not really necessary.

No problem when timeline is not changed when explicit restart of
wal receiver.  But if it happened just after the standby received
new hisotry file, we suffer an extra restart of wal receiver. It
seems better that we check that in the case.

> > Why is it the right thing to change to XLOG_FROM_ARCHIVE when we're just
> > restarting walreceiver? The server might unnecessarily get stuck in
> > archive based recovery for a long time this way? It seems to me that
> > we'd need to actually trigger RequestXLogStreaming() in this case.
> 
> I hope I clarified this in design idea description above.

My suggestion was just to pretend that the next XLOG_FROM_STREAM
failed, but the outcome actually looks wierd as Andres commented.

I think that comes from the structure of the state machine. WAL
receiver is started not at the beginning of XLOG_FROM_STREAM
state but at the end of XLOG_FROM_ARCHIVE. So we need to switch
once to XLOG_FROM_ARCHIVE in order to start wal receiver.

So, I'd like to propose to move the stuff to the second switch().
(See the attached incomplete patch.) This is rather similar to
Sergei's previous proposal, but the structure of the state
machine is kept.

Some other comments are below.

In ProcessStartupSigHup, conninfo and slotname don't need to be
retained until the end of the function.

The log message in the function seems to be too detailed.  On the
other hand, if we changed primary_conninfo to '' (stop) or vise
versa (start), the message (restart) looks strange.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6c69eb6dd7..dba0042db6 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -11755,12 +11755,15 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 	for (;;)
 	{
 		int			oldSource = currentSource;
+		bool		start_wal_receiver = false;
 
 		/*
-		 * First check if we failed to read from the current source, and
+		 * First check if we failed to read from the current source or if
+		 * we want to restart wal receiver, and
 		 * advance the state machine if so. The failure to read might've
 		 * happened outside this function, e.g when a CRC check fails on a
 		 * record, or within this loop.
+		 * Restart wal receiver if explicitly requested, too.
 		 */
 		if (lastSourceFailed)
 		{
@@ -11788,53 +11791,12 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					if (!StandbyMode)
 						return false;
 
-					/*
-					 * If primary_conninfo is set, launch walreceiver to try
-					 * to stream the missing WAL.
-					 *
-					 * If fetching_ckpt is true, RecPtr points to the initial
-					 * checkpoint location. In that case, we use RedoStartLSN
-					 * as the streaming start position instead of RecPtr, so
-					 * that when we later jump backwards to start redo at
-					 * RedoStartLSN, we will have the logs streamed already.
-					 */
-					if (PrimaryConnInfo && strcmp(PrimaryConnInfo, "") != 0)
-					{
-						XLogRecPtr	ptr;
-						TimeLineID	tli;
-
-						if (fetching_ckpt)
-						{
-							ptr = RedoStartLSN;
-							tli = ControlFile->checkPointCopy.ThisTimeLineID;
-						}
-						else
-						{
-							ptr = RecPtr;
-
-							/*
-							 * Use the record begin position to determine the
-							 * TLI, rather than the position we're reading.
-							 */
-							tli = tliOfPointInHistory(tliRecPtr, expectedTLEs);
-
-							if (curFileTLI > 0 && tli < curFileTLI)
-								elog(ERROR, "according to history file, WAL location %X/%X belongs to timeline %u, but previous recovered WAL file came from timeline %u",
-									 (uint32) (tliRecPtr >> 32),
-									 (uint32) tliRecPtr,
-									 tli, curFileTLI);
-						}
-						curFileTLI = tli;
-						RequestXLogStreaming(tli, ptr, PrimaryConnInfo,
-											 PrimarySlotName);
-						receivedUpto = 0;
-					}
-
 					/*
 					 * Move to XLOG_FROM_STREAM state in either case. We'll
 					 * get immediate failure if we didn't launch walreceiver,
 					 * and move on to the next state.
 					 */
+					start_wal_receiver = true;
 					currentSource = XLOG_FROM_STREAM;
 					break;
 
@@ -11864,8 +11826,8 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 						ShutdownWalRcv();
 
 					/*
-					 * Before we sleep, re-scan for possible new timelines if
-					 * we were requested to recover to the latest timeline.
+					 * Re-scan for possible new timelines if we were requested
+					 * to recover to the latest timeline.
 					 */
 					if (recoveryTargetTimeLineGoal == RECOVERY_TARGET_TIMELINE_LATEST)
 					{
@@ -11929,8 +11891,8 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 				 lastSourceFailed ? "failure" : "success");
 
 		/*
-		 * We've now handled possible failure. Try to read from the chosen
-		 * source.
+		 * We've now handled possible failure and configuration change. Try to
+		 * read from the chosen source.
 		 */
 		lastSourceFailed = false;
 
@@ -11969,9 +11931,71 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					bool		havedata;
 
 					/*
-					 * Check if WAL receiver is still active.
+					 * shutdown wal receiver if restart is requested.
 					 */
-					if (!WalRcvStreaming())
+					if (!start_wal_receiver && pendingWalRcvRestart)
+					{
+						if (WalRcvRunning())
+							ShutdownWalRcv();
+
+						/*
+						 * Re-scan for possible new timelines if we were
+						 * requested to recover to the latest timeline.
+						 */
+						if (recoveryTargetTimeLineGoal ==
+							RECOVERY_TARGET_TIMELINE_LATEST)
+							rescanLatestTimeLine();
+						
+						start_wal_receiver = true;
+					}
+					pendingWalRcvRestart = false;
+
+					/*
+					 * Launch walreceiver if needed.
+					 *
+					 * If fetching_ckpt is true, RecPtr points to the initial
+					 * checkpoint location. In that case, we use RedoStartLSN
+					 * as the streaming start position instead of RecPtr, so
+					 * that when we later jump backwards to start redo at
+					 * RedoStartLSN, we will have the logs streamed already.
+					 */
+					if (start_wal_receiver &&
+						PrimaryConnInfo && strcmp(PrimaryConnInfo, "") != 0)
+					{
+						XLogRecPtr	ptr;
+						TimeLineID	tli;
+
+						if (fetching_ckpt)
+						{
+							ptr = RedoStartLSN;
+							tli = ControlFile->checkPointCopy.ThisTimeLineID;
+						}
+						else
+						{
+							ptr = RecPtr;
+
+							/*
+							 * Use the record begin position to determine the
+							 * TLI, rather than the position we're reading.
+							 */
+							tli = tliOfPointInHistory(tliRecPtr, expectedTLEs);
+
+							if (curFileTLI > 0 && tli < curFileTLI)
+								elog(ERROR, "according to history file, WAL location %X/%X belongs to timeline %u, but previous recovered WAL file came from timeline %u",
+									 (uint32) (tliRecPtr >> 32),
+									 (uint32) tliRecPtr,
+									 tli, curFileTLI);
+						}
+						curFileTLI = tli;
+						RequestXLogStreaming(tli, ptr, PrimaryConnInfo,
+											 PrimarySlotName);
+						receivedUpto = 0;
+					}
+
+					/*
+					 * Else, check if WAL receiver is still active.
+					 */
+					else if (!WalRcvStreaming())
 					{
 						lastSourceFailed = true;
 						break;

Re: allow online change primary_conninfo

Reply via email to