Hello. I changed the condition from randAccess to fetching_ckpt considering the discussion in another thread [1]. Then I moved the block that shows the new messages to more appropriate place.
At Fri, 28 Feb 2020 17:28:06 +0900 (JST), Kyotaro Horiguchi <horikyota....@gmail.com> wrote in > > > > Have you considered an error context here? Your patch leads to a bit > > of duplication with the message a bit down of what you are changing > > where the end of local pg_wal is reached. > > It is a DEBUG message and it is for the time moving from crash > recovery to archive recovery. I could remove that but decided to leave > it for tracability. I modified the message so that it has the same look to the new messages, but I left it being DEBUG1, since it is just a intermediate state. We should finally see one of the new three messages. After the messages changed, another message from wal sender came to look redundant. | [20866] LOG: replication terminated by primary server | [20866] DETAIL: End of WAL reached on timeline 1 at 0/30001C8. | [20866] FATAL: could not send end-of-streaming message to primary: no COPY in progress | [20851] LOG: reached end of WAL at 0/30001C8 on timeline 1 in archive during standby mode | [20851] DETAIL: invalid record length at 0/30001C8: wanted 24, got 0 I changed the above to the below, which looks more adequate. | [24271] LOG: replication terminated by primary server on timeline 1 at 0/3000240. | [24271] FATAL: could not send end-of-streaming message to primary: no COPY in progress | [24267] LOG: reached end of WAL at 0/3000240 on timeline 1 in archive during standby mode | [24267] DETAIL: invalid record length at 0/3000240: wanted 24, got 0 > > > + * reached the end of WAL. Otherwise something's really wrong and > > > + * we report just only the errormsg if any. If we don't receive > > > > This sentence sounds strange to me. Or you meant "Something is wrong, > > so use errormsg as report if it is set"? The message no longer exists. > > > + (errmsg ("rached end of WAL at %X/%X on timeline %u in %s during crash > > > recovery", > > > > FWIW, you are introducing three times the same typo, in the same > > word, in three different messages. > > They're copy-pasto. I refrained from constructing an error message > from multiple nonindipendent parts. Are you suggesting to do so? The tree times repetition of almost same phrases is very unreadable. I rewrote it in more simple shape. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
>From 17ee82e5d44dd5a932ed69b8a1ea91a23d170952 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyota....@gmail.com> Date: Fri, 28 Feb 2020 15:52:58 +0900 Subject: [PATCH v2] Make End-Of-Recovery error less scary When recovery in any type ends, we see a bit scary error message like "invalid record length" that suggests something serious is happening. Make this message less scary as "reached end of WAL". --- src/backend/access/transam/xlog.c | 72 ++++++++++++++++++++------- src/backend/replication/walreceiver.c | 3 +- 2 files changed, 55 insertions(+), 20 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index d19408b3be..849cf6fe6b 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -4282,12 +4282,15 @@ ReadRecord(XLogReaderState *xlogreader, int emode, for (;;) { char *errormsg; + XLogRecPtr ErrRecPtr = InvalidXLogRecPtr; record = XLogReadRecord(xlogreader, &errormsg); ReadRecPtr = xlogreader->ReadRecPtr; EndRecPtr = xlogreader->EndRecPtr; if (record == NULL) { + ErrRecPtr = ReadRecPtr ? ReadRecPtr : EndRecPtr; + if (readFile >= 0) { close(readFile); @@ -4295,14 +4298,16 @@ ReadRecord(XLogReaderState *xlogreader, int emode, } /* - * We only end up here without a message when XLogPageRead() - * failed - in that case we already logged something. In - * StandbyMode that only happens if we have been triggered, so we - * shouldn't loop anymore in that case. + * If we are fetching checkpoint, we emit the error message right + * now. Otherwise the error is regarded as "end of WAL" and the + * message if any is shown as a part of the end-of-WAL message + * below. */ - if (errormsg) - ereport(emode_for_corrupt_record(emode, EndRecPtr), + if (fetching_ckpt && errormsg) + { + ereport(emode_for_corrupt_record(emode, ErrRecPtr), (errmsg_internal("%s", errormsg) /* already translated */ )); + } } /* @@ -4332,11 +4337,12 @@ ReadRecord(XLogReaderState *xlogreader, int emode, /* Great, got a record */ return record; } - else - { - /* No valid record available from this source */ - lastSourceFailed = true; + /* No valid record available from this source */ + lastSourceFailed = true; + + if (!fetching_ckpt) + { /* * If archive recovery was requested, but we were still doing * crash recovery, switch to archive recovery and retry using the @@ -4349,11 +4355,18 @@ ReadRecord(XLogReaderState *xlogreader, int emode, * we'd have no idea how far we'd have to replay to reach * consistency. So err on the safe side and give up. */ - if (!InArchiveRecovery && ArchiveRecoveryRequested && - !fetching_ckpt) + if (!InArchiveRecovery && ArchiveRecoveryRequested) { + /* + * We don't report this as LOG, since we don't stop recovery + * here + */ ereport(DEBUG1, - (errmsg_internal("reached end of WAL in pg_wal, entering archive recovery"))); + (errmsg_internal("reached end of WAL at %X/%X on timeline %u in %s during crash recovery, entering archive recovery", + (uint32) (ErrRecPtr >> 32), + (uint32) ErrRecPtr, + ThisTimeLineID, + xlogSourceNames[currentSource]))); InArchiveRecovery = true; if (StandbyModeRequested) StandbyMode = true; @@ -4391,12 +4404,35 @@ ReadRecord(XLogReaderState *xlogreader, int emode, continue; } - /* In standby mode, loop back to retry. Otherwise, give up. */ - if (StandbyMode && !CheckForStandbyTrigger()) - continue; - else - return NULL; + /* + * We reached the end of WAL, show the messages just once at the + * same LSN. + */ + if (emode_for_corrupt_record(LOG, ErrRecPtr) == LOG) + { + char *fmt; + + if (StandbyMode) + fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during standby mode"); + else if (InArchiveRecovery) + fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during archive recovery"); + else + fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during crash recovery"); + + ereport(LOG, + (errmsg (fmt, (uint32) (EndRecPtr >> 32), + (uint32) EndRecPtr, + ThisTimeLineID, + xlogSourceNames[currentSource]), + (errormsg ? errdetail_internal("%s", errormsg) : 0))); + } } + + /* In standby mode, loop back to retry. Otherwise, give up. */ + if (StandbyMode && !CheckForStandbyTrigger()) + continue; + else + return NULL; } } diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c index 2ab15c3cbb..682dbb4e1f 100644 --- a/src/backend/replication/walreceiver.c +++ b/src/backend/replication/walreceiver.c @@ -478,8 +478,7 @@ WalReceiverMain(void) else if (len < 0) { ereport(LOG, - (errmsg("replication terminated by primary server"), - errdetail("End of WAL reached on timeline %u at %X/%X.", + (errmsg("replication terminated by primary server on timeline %u at %X/%X.", startpointTLI, (uint32) (LogstreamResult.Write >> 32), (uint32) LogstreamResult.Write))); endofwal = true; -- 2.18.2