Last night we had the standby server in a PITR setup seemingly spontaneously believe it had recovered completely and startup a new timeline; I'm running this information by you guys to see if what had happened is considered a bug, or if there is a simple explanation.

So some questions here:

1) is there a hard limit of the number of times the archive_command will attempt? I didn't see anything documented about this in the PITR or config docs, so I'm guessing the 10 failures I saw in the log were just coincidental.

2) are the archive_command failures in the master's log responsible for the redo records?

3) would a Pg forced shutdown cause issues with the generated WAL when replaying?

4) at first I thought it had to do with a bug/failure in pg_standby, but I'm wondering if it has to do with the "record with zero length" referenced in the standby's logs. Thoughts?

Details:

The postgres partition had completely filled up on the master, causing the postmaster to shut down. We removed some cruft from the partition and were able to get the server started back up at this time. When checking on the standby server, I noticed that it had switched out of recovery mode and was archiving its own WAL, etc, and was perfectly connectable from the command line.

The archive_command is:

cp -i "%p" /mnt/pg_wal_archive/"%f" && cat %p | ssh backupserver "cat - > /mnt/pg_wal_archive/.%f; if test -e /mnt/pg_wal_archive/%f ; then rm /mnt/pg_wal_archive/.%f ; exit 1; else mv -f /mnt/ pg_wal_archive/.%f /mnt/pg_wal_archive/%f ; fi"

The recovery.conf file is a fairly vanilla pg_standby invocation:

restore_command = '/usr/lib/postgresql/8.3/bin/pg_standby -c -d -s 2 -t /tmp/pgsql.trigger.6666 /db_data/pg_wal_archive %f %p %r 2>> / db_data/pg_log/warmstandby.log'

Some possibly relevant settings from the master:

             name             |  setting
------------------------------+-----------
 archive_mode                 | on
 archive_timeout              | 60
 checkpoint_completion_target | 0.7
 checkpoint_segments          | 100
 checkpoint_timeout           | 600
 checkpoint_warning           | 30
 commit_delay                 | 0
 commit_siblings              | 5
 fsync                        | on
 full_page_writes             | on
 synchronous_commit           | on
 wal_buffers                  | 128
 wal_sync_method              | fdatasync
 wal_writer_delay             | 200

----
Relevant lines from the standby:

2009-11-25 04:03:52 UTC [6315]: [7478-1] LOG: restored log file "000000020000064200000001" from archive 2009-11-25 04:03:58 UTC [6315]: [7479-1] LOG: restored log file "000000020000064200000002" from archive 2009-11-25 04:44:09 UTC [6315]: [7480-1] LOG: restored log file "000000020000064200000003" from archive 2009-11-25 04:45:22 UTC [6315]: [7481-1] LOG: record with zero length at 642/3FFEB38
2009-11-25 04:45:22 UTC [6315]: [7482-1] LOG:  redo done at 642/3FFEAF0
2009-11-25 04:45:22 UTC [6315]: [7483-1] LOG: last completed transaction was at log time 2009-11-25 04:04:03.067187+00 2009-11-25 04:45:22 UTC [6315]: [7484-1] LOG: restored log file "000000020000064200000003" from archive 2009-11-25 04:45:28 UTC [6315]: [7485-1] LOG: selected new timeline ID: 3 2009-11-25 04:45:28 UTC [6315]: [7486-1] LOG: restored log file "00000002.history" from archive
2009-11-25 04:45:28 UTC [6315]: [7487-1] LOG:  archive recovery complete
2009-11-25 04:45:28 UTC [6315]: [7488-1] LOG: checkpoint starting: shutdown immediate 2009-11-25 04:45:29 UTC [6315]: [7489-1] LOG: checkpoint complete: wrote 0 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 14 recycled; write=0.009 s, sync=0.010 s, total=0.404 s
2009-11-25 04:45:30 UTC [22614]: [1-1] LOG:  autovacuum launcher started
2009-11-25 04:45:30 UTC [6313]: [1-1] LOG: database system is ready to accept connections
2009-11-25 04:55:30 UTC [22612]: [1-1] LOG:  checkpoint starting: time
----
From the master:
aws-oi-db-master-11192009:/var/log/postgresql# date
Wed Nov 25 14:40:44 EST 2009
aws-oi-db-master-11192009:/var/log/postgresql# grep WARNING postgresql-2009-11-25.log 2009-11-25 00:00:34 EST [30401]: [126-1] WARNING: transaction log file "000000020000064200000003" could not be archived: too many failures 2009-11-25 00:01:36 EST [30401]: [133-1] WARNING: transaction log file "000000020000064200000003" could not be archived: too many failures 2009-11-25 00:02:34 EST [30401]: [140-1] WARNING: transaction log file "000000020000064200000003" could not be archived: too many failures 2009-11-25 00:03:34 EST [30401]: [147-1] WARNING: transaction log file "000000020000064200000003" could not be archived: too many failures 2009-11-25 00:04:34 EST [30401]: [154-1] WARNING: transaction log file "000000020000064200000003" could not be archived: too many failures 2009-11-25 00:05:34 EST [30401]: [161-1] WARNING: transaction log file "000000020000064200000003" could not be archived: too many failures 2009-11-25 00:06:34 EST [30401]: [168-1] WARNING: transaction log file "000000020000064200000003" could not be archived: too many failures 2009-11-25 00:07:34 EST [30401]: [175-1] WARNING: transaction log file "000000020000064200000003" could not be archived: too many failures 2009-11-25 00:08:34 EST [30401]: [182-1] WARNING: transaction log file "000000020000064200000003" could not be archived: too many failures 2009-11-25 00:09:34 EST [30401]: [189-1] WARNING: transaction log file "000000020000064200000003" could not be archived: too many failures
----
Regards,

David
--
David Christensen
End Point Corporation
da...@endpoint.com





--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to