Hi, I am facing an unexpected behavior on a 9.2.2 cluster that I can reproduce on current HEAD.
On a cluster with archive enabled but failing, after a crash of postmaster, the checkpoint occurring before leaving the recovery mode deletes any additional WALs, even those waiting to be archived. Because of this, after recovering from the crash, previous PITR backup can not be used to restore the instance to a time where archiving was failing. Any slaves fed by WAL or lagging in SR need to be recreated. AFAICT, this is not documented and I would expect the WALs to be archived by the archiver process when the cluster exits the recovery step. Here is a simple scenario to reproduce this. Configuration: wal_level = archive archive_mode = on archive_command = '/bin/false' log_checkpoints = on Scenario: createdb test psql -c 'create table test as select i, md5(i::text) from generate_series(1,3000000) as i;' test kill -9 $(head -1 $PGDATA/postmaster.pid) pg_ctl start Using this scenario, log files shows: LOG: archive command failed with exit code 1 DETAIL: The failed archive command was: /bin/false WARNING: transaction log file "000000010000000000000001" could not be archived: too many failures LOG: database system was interrupted; last known up at 2013-02-14 16:12:58 CET LOG: database system was not properly shut down; automatic recovery in progress LOG: crash recovery starts in timeline 1 and has target timeline 1 LOG: redo starts at 0/11400078 LOG: record with zero length at 0/13397190 LOG: redo done at 0/13397160 LOG: last completed transaction was at log time 2013-02-14 16:12:58.49303+01 LOG: checkpoint starting: end-of-recovery immediate LOG: checkpoint complete: wrote 2869 buffers (17.5%); 0 transaction log file(s) added, 9 removed, 7 recycled; write=0.023 s, sync=0.468 s, total=0.739 s; sync files=2, longest=0.426 s, average=0.234 s LOG: autovacuum launcher started LOG: database system is ready to accept connections LOG: archive command failed with exit code 1 DETAIL: The failed archive command was: /bin/false LOG: archive command failed with exit code 1 DETAIL: The failed archive command was: /bin/false LOG: archive command failed with exit code 1 DETAIL: The failed archive command was: /bin/false WARNING: transaction log file "000000010000000000000011" could not be archived: too many failures Before the kill, "000000010000000000000001" was the WAL to archive. After the kill, the checkpoint deleted 9 files before exiting recovery mode and "000000010000000000000011" become the first WAL to archive. "000000010000000000000001" through "000000010000000000000010" were removed or recycled. Is it expected ? -- Jehan-Guillaume de Rorthais http://www.dalibo.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers