I have problem with 2nd. stage online PITR recovery procedure. The data received in second stage after base backup and prior to WAL switch gets lost.
I've managed to isolate the problem down to postgresql without the pgpool-II running: - stop failed node //1st stage - start backup - rsync files to failed node - stop backup - do intentional insert in master node //2nd stage - do pg_switch_log (tested also pgpool_xlog_switch with same results) - rsync archive WAL files to failed node - start failed node The failed node starts fine and it does recovery, but for the last WAL file it always reports "invalid record length" error, and it returns to last known good WAL file (the one created in backup step). Log from failed node when I do restore (increasing verbosity reveals no more information): [2011-09-13 12:42:58 CEST]-[]-[31877|] LOG: database system was interrupted; last known up at 2011-09-13 12:40:46 CEST [2011-09-13 12:42:58 CEST]-[]-[31877|] LOG: creating missing WAL directory "pg_xlog/archive_status" [2011-09-13 12:42:58 CEST]-[]-[31877|] LOG: starting archive recovery [2011-09-13 12:42:58 CEST]-[postgres]-[31882|] FATAL: the database system is starting up [2011-09-13 12:42:59 CEST]-[]-[31877|] LOG: restored log file "000000020000000100000020" from archive [2011-09-13 12:42:59 CEST]-[]-[31877|] LOG: redo starts at 1/20000078 [2011-09-13 12:42:59 CEST]-[]-[31877|] LOG: consistent recovery state reached at 1/21000000 [2011-09-13 12:42:59 CEST]-[postgres]-[31886|] FATAL: the database system is starting up [2011-09-13 12:43:00 CEST]-[postgres]-[31887|] FATAL: the database system is starting up [2011-09-13 12:43:01 CEST]-[]-[31877|] LOG: restored log file "000000020000000100000021" from archive [2011-09-13 12:43:01 CEST]-[]-[31877|] LOG: invalid record length at 1/21000020 [2011-09-13 12:43:01 CEST]-[]-[31877|] LOG: redo done at 1/200000A0 [2011-09-13 12:43:01 CEST]-[postgres]-[31890|] FATAL: the database system is starting up [2011-09-13 12:43:01 CEST]-[]-[31877|] LOG: restored log file "000000020000000100000020" from archive [2011-09-13 12:43:01 CEST]-[]-[31877|] LOG: selected new timeline ID: 3 [2011-09-13 12:43:01 CEST]-[]-[31877|] LOG: archive recovery complete [2011-09-13 12:43:01 CEST]-[]-[31883|] LOG: checkpoint starting: end-of-recovery immediate wait [2011-09-13 12:43:02 CEST]-[]-[31883|] LOG: checkpoint complete: wrote 0 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=0.000 s, sync=0.000 s, total=0.659 s [2011-09-13 12:43:02 CEST]-[]-[31876|] LOG: database system is ready to accept connections [2011-09-13 12:43:02 CEST]-[]-[31896|] LOG: autovacuum launcher started I've done md5sum of 000000020000000100000021 WAL file in archive dir on master and target node, and the file is the same on both nodes. So my question goes: Did I miss something, or did I get the procedure wrong? Is online recovery with PITR procedure still valid as it is presented in manual? Can I replace the pg_switch_xlog with another pg_start_backup and pg_stop_backup call and what are performance implications in this case? Software versions: I'm using: PostgreSQL 9.0.4 on both nodes with same OS Restore master: Linux miho 3.0-ARCH #1 SMP PREEMPT Wed Aug 17 21:55:57 CEST 2011 x86_64 Intel(R) Core(TM) i7 CPU 930 @ 2.80GHz GenuineIntel GNU/Linux Restore target: Linux alice 3.0-ARCH #1 SMP PREEMPT Wed Aug 17 21:55:57 CEST 2011 x86_64 Intel(R) Core(TM) i7 CPU 930 @ 2.80GHz GenuineIntel GNU/Linux Thanks for help. Nikola _______________________________________________ Pgpool-general mailing list Pgpool-general@pgfoundry.org http://pgfoundry.org/mailman/listinfo/pgpool-general