On 11-09-22 09:24 AM, Fujii Masao wrote:
On Wed, Sep 21, 2011 at 11:50 AM, Fujii Masao<masao.fu...@gmail.com>  wrote:
2011/9/13 Jun Ishiduka<ishizuka....@po.ntts.co.jp>:
Update patch.

Changes:
  * set 'on' full_page_writes by user (in document)
  * read "FROM: XX" in backup_label (in xlog.c)
  * check status when pg_stop_backup is executed (in xlog.c)
Thanks for updating the patch.

Before reviewing the patch, to encourage people to comment and
review the patch, I explain what this patch provides:
Attached is the updated version of the patch. I refactored the code, fixed
some bugs, added lots of source code comments, improved the document,
but didn't change the basic design. Please check this patch, and let's use
this patch as the base if you agree with that.


I have looked at both Jun's patch from Sept 13 and Fujii's updates to the patch. I agree that Fujii's updated version should be used as the basis for changes going forward. My comments below refer to that version (unless otherwise noted).


In backup.sgml the new section titled "Making a Base Backup during Recovery" I would prefer to see some mention in the title that this procedure is for standby servers ie "Making a Base Backup from a Standby Database". Users who have setup a hot-standby database should be familiar with the 'standby' terminology. I agree that the "during recovery" description is technically correct but I'm not sure someone who is looking through the manual for instructions on making a base backup from here standby will realize this is the section they should read.

Around line 969 where you give an example of copying the control file I would be a bit clearer that this is an example command. Ie (Copy the pg_control file from the cluster directory to the global sub-directory of the backup. For example "cp $PGDATA/global/pg_control /mnt/server/backupdir/global")


Testing Notes
-----------------------------

I created a standby server from a base backup of another standby server. On this new standby server I then

1. Ran pg_start_backup('3'); and left the psql connection open
2. touch /tmp/3 -- my trigger_file

ssinger@ssinger-laptop:/usr/local/pgsql92git/bin$ LOG: trigger file found: /tmp/3
FATAL:  terminating walreceiver process due to administrator command
LOG:  restored log file "000000010000000000000006" from archive
LOG:  record with zero length at 0/60002F0
LOG:  restored log file "000000010000000000000006" from archive
LOG:  redo done at 0/6000298
LOG:  restored log file "000000010000000000000006" from archive
PANIC:  record with zero length at 0/6000298
LOG:  startup process (PID 19011) was terminated by signal 6: Aborted
LOG:  terminating any other active server processes
WARNING:  terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command.

The new postmaster (the one trying to be promoted) dies. This is somewhat repeatable.

----

If a base backup is in progress on a recovery database and that recovery database is promoted to master, following the promotion (if you don't restart the postmaster). I see
select pg_stop_backup();
ERROR: database system status mismatches between pg_start_backup() and pg_stop_backup()

If you restart the postmaster this goes away. When the postmaster leaves recovery mode I think it should abort an existing base backup so pg_stop_backup() will say no backup in progress, or give an error message on pg_stop_backup() saying that the base backup won't be usable. The above error doesn't really tell the user why there is a mismatch.

---------

In my testing a few times I got into a situation where a standby server coming from a recovery target took a while to finish recovery (this is on a database with no activity). Then when i tried promoting that server to master I got

LOG:  trigger file found: /tmp/3
FATAL:  terminating walreceiver process due to administrator command
LOG:  restored log file "000000010000000000000009" from archive
LOG:  restored log file "000000010000000000000009" from archive
LOG:  redo done at 0/90000E8
LOG:  restored log file "000000010000000000000009" from archive
PANIC:  unexpected pageaddr 0/6000000 in log file 0, segment 9, offset 0
LOG:  startup process (PID 1804) was terminated by signal 6: Aborted
LOG:  terminating any other active server processes


It is *possible* I mixed up the order of a step somewhere since my testing isn't script based. A standby server that 'looks' okay but can't actually be promoted is dangerous.

This version of the patch (I was testing the Sept 22nd version) seems less stable than how I remember the version from the July CF. Maybe I'm just testing it harder or maybe something has been broken.



In the current patch, there is no safeguard for preventing users from
taking backup during recovery when FPW is disabled. This is unsafe.
Are you planning to implement such a safeguard?


I agree with Fujii that we need a way (on the recovery machine) to detect if the master doesn't have FPW on. The ideas up-thread on how to do this sound good.


Regards,





Reply via email to