[HACKERS] Spontaneous PITR standby activiation

David Christensen Wed, 25 Nov 2009 15:30:56 -0800

Last night we had the standby server in a PITR setup seeminglyspontaneously believe it had recovered completely and startup a newtimeline; I'm running this information by you guys to see if what hadhappened is considered a bug, or if there is a simple explanation.


So some questions here:

1) is there a hard limit of the number of times the archive_commandwill attempt? I didn't see anything documented about this in the PITRor config docs, so I'm guessing the 10 failures I saw in the log werejust coincidental.

2) are the archive_command failures in the master's log responsiblefor the redo records?

3) would a Pg forced shutdown cause issues with the generated WAL whenreplaying?

4) at first I thought it had to do with a bug/failure in pg_standby,but I'm wondering if it has to do with the "record with zero length"referenced in the standby's logs. Thoughts?


Details:

The postgres partition had completely filled up on the master, causingthe postmaster to shut down. We removed some cruft from the partitionand were able to get the server started back up at this time. Whenchecking on the standby server, I noticed that it had switched out ofrecovery mode and was archiving its own WAL, etc, and was perfectlyconnectable from the command line.


The archive_command is:

cp -i "%p" /mnt/pg_wal_archive/"%f" && cat %p | ssh backupserver"cat - > /mnt/pg_wal_archive/.%f; if test -e /mnt/pg_wal_archive/%f ;then rm /mnt/pg_wal_archive/.%f ; exit 1; else mv -f /mnt/pg_wal_archive/.%f /mnt/pg_wal_archive/%f ; fi"


The recovery.conf file is a fairly vanilla pg_standby invocation:

restore_command = '/usr/lib/postgresql/8.3/bin/pg_standby -c -d -s2 -t /tmp/pgsql.trigger.6666 /db_data/pg_wal_archive %f %p %r 2>> /db_data/pg_log/warmstandby.log'


Some possibly relevant settings from the master:

             name             |  setting
------------------------------+-----------
 archive_mode                 | on
 archive_timeout              | 60
 checkpoint_completion_target | 0.7
 checkpoint_segments          | 100
 checkpoint_timeout           | 600
 checkpoint_warning           | 30
 commit_delay                 | 0
 commit_siblings              | 5
 fsync                        | on
 full_page_writes             | on
 synchronous_commit           | on
 wal_buffers                  | 128
 wal_sync_method              | fdatasync
 wal_writer_delay             | 200

----
Relevant lines from the standby:

2009-11-25 04:03:52 UTC [6315]: [7478-1] LOG: restored log file"000000020000064200000001" from archive2009-11-25 04:03:58 UTC [6315]: [7479-1] LOG: restored log file"000000020000064200000002" from archive2009-11-25 04:44:09 UTC [6315]: [7480-1] LOG: restored log file"000000020000064200000003" from archive2009-11-25 04:45:22 UTC [6315]: [7481-1] LOG: record with zero lengthat 642/3FFEB38

2009-11-25 04:45:22 UTC [6315]: [7482-1] LOG:  redo done at 642/3FFEAF0

2009-11-25 04:45:22 UTC [6315]: [7483-1] LOG: last completedtransaction was at log time 2009-11-25 04:04:03.067187+002009-11-25 04:45:22 UTC [6315]: [7484-1] LOG: restored log file"000000020000064200000003" from archive2009-11-25 04:45:28 UTC [6315]: [7485-1] LOG: selected new timelineID: 32009-11-25 04:45:28 UTC [6315]: [7486-1] LOG: restored log file"00000002.history" from archive

2009-11-25 04:45:28 UTC [6315]: [7487-1] LOG:  archive recovery complete

2009-11-25 04:45:28 UTC [6315]: [7488-1] LOG: checkpoint starting:shutdown immediate2009-11-25 04:45:29 UTC [6315]: [7489-1] LOG: checkpoint complete:wrote 0 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 14recycled; write=0.009 s, sync=0.010 s, total=0.404 s

2009-11-25 04:45:30 UTC [22614]: [1-1] LOG:  autovacuum launcher started

2009-11-25 04:45:30 UTC [6313]: [1-1] LOG: database system is readyto accept connections

2009-11-25 04:55:30 UTC [22612]: [1-1] LOG:  checkpoint starting: time
----
From the master:
aws-oi-db-master-11192009:/var/log/postgresql# date
Wed Nov 25 14:40:44 EST 2009

aws-oi-db-master-11192009:/var/log/postgresql# grep WARNINGpostgresql-2009-11-25.log2009-11-25 00:00:34 EST [30401]: [126-1] WARNING: transaction logfile "000000020000064200000003" could not be archived: too many failures2009-11-25 00:01:36 EST [30401]: [133-1] WARNING: transaction logfile "000000020000064200000003" could not be archived: too many failures2009-11-25 00:02:34 EST [30401]: [140-1] WARNING: transaction logfile "000000020000064200000003" could not be archived: too many failures2009-11-25 00:03:34 EST [30401]: [147-1] WARNING: transaction logfile "000000020000064200000003" could not be archived: too many failures2009-11-25 00:04:34 EST [30401]: [154-1] WARNING: transaction logfile "000000020000064200000003" could not be archived: too many failures2009-11-25 00:05:34 EST [30401]: [161-1] WARNING: transaction logfile "000000020000064200000003" could not be archived: too many failures2009-11-25 00:06:34 EST [30401]: [168-1] WARNING: transaction logfile "000000020000064200000003" could not be archived: too many failures2009-11-25 00:07:34 EST [30401]: [175-1] WARNING: transaction logfile "000000020000064200000003" could not be archived: too many failures2009-11-25 00:08:34 EST [30401]: [182-1] WARNING: transaction logfile "000000020000064200000003" could not be archived: too many failures2009-11-25 00:09:34 EST [30401]: [189-1] WARNING: transaction logfile "000000020000064200000003" could not be archived: too many failures

----
Regards,

David
--
David Christensen
End Point Corporation
[email protected]





--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Spontaneous PITR standby activiation

Reply via email to