Hi,

I am facing an unexpected behavior on a 9.2.2 cluster that I can
reproduce on current HEAD.

On a cluster with archive enabled but failing, after a crash of
postmaster, the checkpoint occurring before leaving the recovery mode
deletes any additional WALs, even those waiting to be archived.

Because of this, after recovering from the crash, previous PITR backup
can not be used to restore the instance to a time where archiving was
failing. Any slaves fed by WAL or lagging in SR need to be recreated.

AFAICT, this is not documented and I would expect the WALs to be
archived by the archiver process when the cluster exits the recovery step.

Here is a simple scenario to reproduce this.

Configuration:

  wal_level = archive
  archive_mode = on
  archive_command = '/bin/false'
  log_checkpoints = on

Scenario:
  createdb test
  psql -c 'create table test as select i, md5(i::text) from
generate_series(1,3000000) as i;' test
  kill -9 $(head -1 $PGDATA/postmaster.pid)
  pg_ctl start

Using this scenario, log files shows:

  LOG:  archive command failed with exit code 1
  DETAIL:  The failed archive command was: /bin/false
  WARNING:  transaction log file "000000010000000000000001" could not
be archived: too many failures
  LOG:  database system was interrupted; last known up at 2013-02-14
16:12:58 CET
  LOG:  database system was not properly shut down; automatic recovery
in progress
  LOG:  crash recovery starts in timeline 1 and has target timeline 1
  LOG:  redo starts at 0/11400078
  LOG:  record with zero length at 0/13397190
  LOG:  redo done at 0/13397160
  LOG:  last completed transaction was at log time 2013-02-14
16:12:58.49303+01
  LOG:  checkpoint starting: end-of-recovery immediate
  LOG:  checkpoint complete: wrote 2869 buffers (17.5%); 0 transaction
log file(s) added, 9 removed, 7 recycled; write=0.023 s, sync=0.468 s,
total=0.739 s; sync files=2, longest=0.426 s, average=0.234 s
  LOG:  autovacuum launcher started
  LOG:  database system is ready to accept connections
  LOG:  archive command failed with exit code 1
  DETAIL:  The failed archive command was: /bin/false
  LOG:  archive command failed with exit code 1
  DETAIL:  The failed archive command was: /bin/false
  LOG:  archive command failed with exit code 1
  DETAIL:  The failed archive command was: /bin/false
  WARNING:  transaction log file "000000010000000000000011" could not
be archived: too many failures

Before the kill, "000000010000000000000001" was the WAL to archive.
After the kill, the checkpoint deleted 9 files before exiting recovery
mode and "000000010000000000000011" become the first WAL to archive.
"000000010000000000000001" through "000000010000000000000010" were
removed or recycled.

Is it expected ?
-- 
Jehan-Guillaume de Rorthais
http://www.dalibo.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to