Re: [HACKERS] Error restoring from a base backup taken from standby

Fujii Masao Tue, 18 Dec 2012 10:07:28 -0800

On Tue, Dec 18, 2012 at 2:39 AM, Heikki Linnakangas
<[email protected]> wrote:
> (This is different from the other issue related to timeline switches I just
> posted about. There's no timeline switch involved in this one.)
>
> If you do "pg_basebackup -x" against a standby server, in some circumstances
> the backup fails to restore with error like this:
>
> C 2012-12-17 19:09:44.042 EET 7832 LOG:  database system was not properly
> shut down; automatic recovery in progress
> C 2012-12-17 19:09:44.091 EET 7832 LOG:  record with zero length at
> 0/1764F48
> C 2012-12-17 19:09:44.091 EET 7832 LOG:  redo is not required
> C 2012-12-17 19:09:44.091 EET 7832 FATAL:  WAL ends before end of online
> backup
> C 2012-12-17 19:09:44.091 EET 7832 HINT:  All WAL generated while online
> backup was taken must be available at recovery.
> C 2012-12-17 19:09:44.092 EET 7831 LOG:  startup process (PID 7832) exited
> with exit code 1
> C 2012-12-17 19:09:44.092 EET 7831 LOG:  aborting startup due to startup
> process failure
>
> I spotted this bug while reading the code, and it took me quite a while to
> actually construct a test case to reproduce the bug, so let me begin by
> discussing the code where the bug is. You get the above error, "WAL ends
> before end of online backup", when you reach the end of WAL before reaching
> the backupEndPoint stored in the control file, which originally comes from
> the backup_label file. backupEndPoint is only used in a base backup taken
> from a standby, in a base backup taken from the master, the end-of-backup
> WAL record is used instead to mark the end of backup. In the xlog redo loop,
> after replaying each record, we check if we've just reached backupEndPoint,
> and clear it from the control file if we have. Now the problem is, if there
> are no WAL records after the checkpoint redo point, we never even enter the
> redo loop, so backupEndPoint is not cleared even though it's reached
> immediately after reading the initial checkpoint record.


Good catch!

> To deal with the similar situation wrt. reaching consistency for hot standby
> purposes, we call CheckRecoveryConsistency() before the redo loop. The
> straightforward fix is to copy-paste the check for backupEndPoint to just
> before the redo loop, next to the CheckRecoveryConsistency() call. Even
> better, I think we should move the backupEndPoint check into
> CheckRecoveryConsistency(). It's already responsible for keeping track of
> whether minRecoveryPoint has been reached, so it seems like a good idea to
> do this check there as well.
>
> Attached is a patch for that (for 9.2), as well as a script I used to
> reproduce the bug.

The patch looks good to me.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Error restoring from a base backup taken from standby

Reply via email to