Re: [HACKERS] Online base backup from the hot-standby

Steve Singer Sun, 25 Sep 2011 19:45:02 -0700

On 11-09-22 09:24 AM, Fujii Masao wrote:

On Wed, Sep 21, 2011 at 11:50 AM, Fujii Masao<masao.fu...@gmail.com>  wrote:

2011/9/13 Jun Ishiduka<ishizuka....@po.ntts.co.jp>:

Update patch.


Changes:
  * set 'on' full_page_writes by user (in document)
  * read "FROM: XX" in backup_label (in xlog.c)
  * check status when pg_stop_backup is executed (in xlog.c)

Thanks for updating the patch.

Before reviewing the patch, to encourage people to comment and
review the patch, I explain what this patch provides:

Attached is the updated version of the patch. I refactored the code, fixed
some bugs, added lots of source code comments, improved the document,
but didn't change the basic design. Please check this patch, and let's use
this patch as the base if you agree with that.

I have looked at both Jun's patch from Sept 13 and Fujii's updates tothe patch. I agree that Fujii's updated version should be used as thebasis for changes going forward. My comments below refer to thatversion (unless otherwise noted).

In backup.sgml the new section titled "Making a Base Backup duringRecovery" I would prefer to see some mention in the title that thisprocedure is for standby servers ie "Making a Base Backup from a StandbyDatabase". Users who have setup a hot-standby database should befamiliar with the 'standby' terminology. I agree that the "duringrecovery" description is technically correct but I'm not sure someonewho is looking through the manual for instructions on making a basebackup from here standby will realize this is the section they should read.

Around line 969 where you give an example of copying the control file Iwould be a bit clearer that this is an example command. Ie (Copy thepg_control file from the cluster directory to the global sub-directoryof the backup. For example "cp $PGDATA/global/pg_control/mnt/server/backupdir/global")



Testing Notes
-----------------------------

I created a standby server from a base backup of another standby server.On this new standby server I then


1. Ran pg_start_backup('3'); and left the psql connection open
2. touch /tmp/3 -- my trigger_file

ssinger@ssinger-laptop:/usr/local/pgsql92git/bin$ LOG: trigger filefound: /tmp/3

FATAL:  terminating walreceiver process due to administrator command
LOG:  restored log file "000000010000000000000006" from archive
LOG:  record with zero length at 0/60002F0
LOG:  restored log file "000000010000000000000006" from archive
LOG:  redo done at 0/6000298
LOG:  restored log file "000000010000000000000006" from archive
PANIC:  record with zero length at 0/6000298
LOG:  startup process (PID 19011) was terminated by signal 6: Aborted
LOG:  terminating any other active server processes
WARNING:  terminating connection because of crash of another server process

DETAIL: The postmaster has commanded this server process to roll backthe current transaction and exit, because another server process exitedabnormally and possibly corrupted shared memory.HINT: In a moment you should be able to reconnect to the database andrepeat your command.

The new postmaster (the one trying to be promoted) dies. This issomewhat repeatable.


----

If a base backup is in progress on a recovery database and that recoverydatabase is promoted to master, following the promotion (if you don'trestart the postmaster). I see

select pg_stop_backup();

ERROR: database system status mismatches between pg_start_backup() andpg_stop_backup()

If you restart the postmaster this goes away. When the postmasterleaves recovery mode I think it should abort an existing base backup sopg_stop_backup() will say no backup in progress, or give an errormessage on pg_stop_backup() saying that the base backup won't beusable. The above error doesn't really tell the user why there is amismatch.


---------

In my testing a few times I got into a situation where a standby servercoming from a recovery target took a while to finish recovery (this ison a database with no activity). Then when i tried promoting thatserver to master I got


LOG:  trigger file found: /tmp/3
FATAL:  terminating walreceiver process due to administrator command
LOG:  restored log file "000000010000000000000009" from archive
LOG:  restored log file "000000010000000000000009" from archive
LOG:  redo done at 0/90000E8
LOG:  restored log file "000000010000000000000009" from archive
PANIC:  unexpected pageaddr 0/6000000 in log file 0, segment 9, offset 0
LOG:  startup process (PID 1804) was terminated by signal 6: Aborted
LOG:  terminating any other active server processes

It is *possible* I mixed up the order of a step somewhere since mytesting isn't script based. A standby server that 'looks' okay but can'tactually be promoted is dangerous.

This version of the patch (I was testing the Sept 22nd version) seemsless stable than how I remember the version from the July CF. Maybe I'mjust testing it harder or maybe something has been broken.

In the current patch, there is no safeguard for preventing users from
taking backup during recovery when FPW is disabled. This is unsafe.
Are you planning to implement such a safeguard?

I agree with Fujii that we need a way (on the recovery machine) todetect if the master doesn't have FPW on. The ideas up-thread on how todo this sound good.

Regards,

Re: [HACKERS] Online base backup from the hot-standby

Reply via email to