Hello.

I changed the condition from randAccess to fetching_ckpt considering
the discussion in another thread [1]. Then I moved the block that
shows the new messages to more appropriate place.

At Fri, 28 Feb 2020 17:28:06 +0900 (JST), Kyotaro Horiguchi 
<horikyota....@gmail.com> wrote in 
> > 
> > Have you considered an error context here?  Your patch leads to a bit
> > of duplication with the message a bit down of what you are changing
> > where the end of local pg_wal is reached.
> 
> It is a DEBUG message and it is for the time moving from crash
> recovery to archive recovery. I could remove that but decided to leave
> it for tracability.

I modified the message so that it has the same look to the new
messages, but I left it being DEBUG1, since it is just a intermediate
state. We should finally see one of the new three messages.

After the messages changed, another message from wal sender came to
look redundant.

| [20866] LOG:  replication terminated by primary server
| [20866] DETAIL:  End of WAL reached on timeline 1 at 0/30001C8.
| [20866] FATAL:  could not send end-of-streaming message to primary: no COPY 
in progress
| [20851] LOG:  reached end of WAL at 0/30001C8 on timeline 1 in archive during 
standby mode
| [20851] DETAIL:  invalid record length at 0/30001C8: wanted 24, got 0

I changed the above to the below, which looks more adequate.

| [24271]  LOG:  replication terminated by primary server on timeline 1 at 
0/3000240.
| [24271]  FATAL:  could not send end-of-streaming message to primary: no COPY 
in progress
| [24267]  LOG:  reached end of WAL at 0/3000240 on timeline 1 in archive 
during standby mode
| [24267]  DETAIL:  invalid record length at 0/3000240: wanted 24, got 0

> > > + * reached the end of WAL.  Otherwise something's really wrong and
> > > + * we report just only the errormsg if any. If we don't receive
> > 
> > This sentence sounds strange to me.  Or you meant "Something is wrong,
> > so use errormsg as report if it is set"?

The message no longer exists.

> > > + (errmsg ("rached end of WAL at %X/%X on timeline %u in %s during crash 
> > > recovery",
> > 
> > FWIW, you are introducing three times the same typo, in the same
> > word, in three different messages.
> 
> They're copy-pasto.  I refrained from constructing an error message
> from multiple nonindipendent parts.  Are you suggesting to do so?

The tree times repetition of almost same phrases is very unreadable. I
rewrote it in more simple shape.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 17ee82e5d44dd5a932ed69b8a1ea91a23d170952 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota....@gmail.com>
Date: Fri, 28 Feb 2020 15:52:58 +0900
Subject: [PATCH v2] Make End-Of-Recovery error less scary

When recovery in any type ends, we see a bit scary error message like
"invalid record length" that suggests something serious is happening.
Make this message less scary as "reached end of WAL".
---
 src/backend/access/transam/xlog.c     | 72 ++++++++++++++++++++-------
 src/backend/replication/walreceiver.c |  3 +-
 2 files changed, 55 insertions(+), 20 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index d19408b3be..849cf6fe6b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4282,12 +4282,15 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 	for (;;)
 	{
 		char	   *errormsg;
+		XLogRecPtr	ErrRecPtr = InvalidXLogRecPtr;
 
 		record = XLogReadRecord(xlogreader, &errormsg);
 		ReadRecPtr = xlogreader->ReadRecPtr;
 		EndRecPtr = xlogreader->EndRecPtr;
 		if (record == NULL)
 		{
+			ErrRecPtr = ReadRecPtr ? ReadRecPtr : EndRecPtr;
+
 			if (readFile >= 0)
 			{
 				close(readFile);
@@ -4295,14 +4298,16 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			}
 
 			/*
-			 * We only end up here without a message when XLogPageRead()
-			 * failed - in that case we already logged something. In
-			 * StandbyMode that only happens if we have been triggered, so we
-			 * shouldn't loop anymore in that case.
+			 * If we are fetching checkpoint, we emit the error message right
+			 * now. Otherwise the error is regarded as "end of WAL" and the
+			 * message if any is shown as a part of the end-of-WAL message
+			 * below.
 			 */
-			if (errormsg)
-				ereport(emode_for_corrupt_record(emode, EndRecPtr),
+			if (fetching_ckpt && errormsg)
+			{
+				ereport(emode_for_corrupt_record(emode, ErrRecPtr),
 						(errmsg_internal("%s", errormsg) /* already translated */ ));
+			}
 		}
 
 		/*
@@ -4332,11 +4337,12 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			/* Great, got a record */
 			return record;
 		}
-		else
-		{
-			/* No valid record available from this source */
-			lastSourceFailed = true;
 
+		/* No valid record available from this source */
+		lastSourceFailed = true;
+
+		if (!fetching_ckpt)
+		{
 			/*
 			 * If archive recovery was requested, but we were still doing
 			 * crash recovery, switch to archive recovery and retry using the
@@ -4349,11 +4355,18 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			 * we'd have no idea how far we'd have to replay to reach
 			 * consistency.  So err on the safe side and give up.
 			 */
-			if (!InArchiveRecovery && ArchiveRecoveryRequested &&
-				!fetching_ckpt)
+			if (!InArchiveRecovery && ArchiveRecoveryRequested)
 			{
+				/*
+				 * We don't report this as LOG, since we don't stop recovery
+				 * here
+				 */
 				ereport(DEBUG1,
-						(errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
+						(errmsg_internal("reached end of WAL at %X/%X on timeline %u in %s during crash recovery, entering archive recovery",
+										 (uint32) (ErrRecPtr >> 32),
+										 (uint32) ErrRecPtr,
+										 ThisTimeLineID,
+										 xlogSourceNames[currentSource])));
 				InArchiveRecovery = true;
 				if (StandbyModeRequested)
 					StandbyMode = true;
@@ -4391,12 +4404,35 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 				continue;
 			}
 
-			/* In standby mode, loop back to retry. Otherwise, give up. */
-			if (StandbyMode && !CheckForStandbyTrigger())
-				continue;
-			else
-				return NULL;
+			/*
+			 *  We reached the end of WAL, show the messages just once at the
+			 *  same LSN.
+			 */
+			if (emode_for_corrupt_record(LOG, ErrRecPtr) == LOG)
+			{
+				char *fmt;
+
+				if (StandbyMode)
+					fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during standby mode");
+				else if (InArchiveRecovery)
+					fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during archive recovery");
+				else
+					fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during crash recovery");
+
+				ereport(LOG,
+						(errmsg (fmt, (uint32) (EndRecPtr >> 32),
+								 (uint32) EndRecPtr,
+								 ThisTimeLineID,
+								 xlogSourceNames[currentSource]),
+						 (errormsg ? errdetail_internal("%s", errormsg) : 0)));
+			}
 		}
+
+		/* In standby mode, loop back to retry. Otherwise, give up. */
+		if (StandbyMode && !CheckForStandbyTrigger())
+			continue;
+		else
+			return NULL;
 	}
 }
 
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 2ab15c3cbb..682dbb4e1f 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -478,8 +478,7 @@ WalReceiverMain(void)
 						else if (len < 0)
 						{
 							ereport(LOG,
-									(errmsg("replication terminated by primary server"),
-									 errdetail("End of WAL reached on timeline %u at %X/%X.",
+									(errmsg("replication terminated by primary server on timeline %u at %X/%X.",
 											   startpointTLI,
 											   (uint32) (LogstreamResult.Write >> 32), (uint32) LogstreamResult.Write)));
 							endofwal = true;
-- 
2.18.2

Reply via email to