[Resending since I accidentally replied off-list]

On 11/18/23 17:49, Andres Freund wrote:
On 2023-11-18 10:01:42 -0800, Andres Freund wrote:
What about adding it to the "redo starts at" message, something like

   redo starts at 12/12345678, taken from control file

or

   redo starts at 12/12345678, taken from backup label

I think it'd make sense to log use of backup_label earlier than that - the
locations from backup_label might end up not being available in the archive,
the primary or locally, and we'll error out with "could not locate a valid
checkpoint record".

I'd probably just do it within the if (read_backup_label()) block in
InitWalRecovery(), *before* the ReadCheckpointRecord().

Not enamored with the phrasing of the log messages, but here's a prototype:

When starting up with backup_label present:
LOG:  starting from base backup with redo LSN A/34100028, checkpoint LSN 
A/34100080 on timeline ID 1

I'd prefer something like:

LOG:  starting backup recovery with redo...

When restarting before reaching the end of the backup, but after backup_label
has been removed:
LOG:  continuing to start from base backup with redo LSN A/34100028
LOG:  entering standby mode
LOG:  redo starts at A/3954B958

And here:

LOG:  restarting backup recovery with redo...

Note that the LSN in the "continuing" case is the one the backup started at,
not where recovery will start.

I've wondered whether it's worth also adding an explicit message just after
ReachedEndOfBackup(), but it seems far less urgent due to the existing
"consistent recovery state reached at %X/%X" message.

I think the current message is sufficient, but what do you have in mind?

We are quite inconsistent about how we spell LSNs. Sometimes with LSN
preceding, sometimes not. Sometimes with (LSN). Etc.

Well, this could be improved in HEAD for sure.

I do like the idea of expanding the "redo starts at" message
though. E.g. including minRecoveryLSN, ControlFile->backupStartPoint,
ControlFile->backupEndPoint would provide information about when the node
might become consistent.

+1

Playing around with this a bit, I'm wondering if we instead should remove that
message, and emit something more informative earlier on. If there's a problem,
you kinda want the information before we finally get to the loop in
PerformWalLRecovery(). If e.g. there's no WAL you'll only get
LOG:  invalid checkpoint record
PANIC:  could not locate a valid checkpoint record

which is delightfully lacking in details.

I've been thinking about improving this myself. It would probably also help a lot to hint that restore_command may be missing or not returning results (but also not erroring). But there are a bunch of ways to get to this message so we'd need to be careful.

There also are some other oddities:

If the primary is down when starting up, and we'd need WAL from the primary
for the first record, the "redo start at" message is delayed until that
happens, because we emit the message not before we read the first record, but
after. That's just plain odd.

Agreed. Moving it up would be better.

And sometimes we'll start referencing the LSN at which we are starting
recovery before the "redo starts at" message. If e.g. we shut down
at a restart point, we'll emit

   LOG:  consistent recovery state reached at ...
before
   LOG:  redo starts at ...

Huh, I haven't seen that one. Definitely confusing.

But that's all clearly just material for HEAD.

Absolutely. I've been thinking about some of this as well, but want to see if we can remove the backup label first so we don't have to rework a bunch of stuff.

Of course, that shouldn't stop you from proceeding. I'm sure anything you are thinking of here could be adapted.

Regards,
-David


Reply via email to