The standby log: -> 2010-11-14 17:40:16 MST - 887 -LOG: database system was shut down in recovery at 2010-11-14 17:40:10 MST -> 2010-11-14 17:40:16 MST - 887 -LOG: entering standby mode -> 2010-11-14 17:40:16 MST - 887 -LOG: consistent recovery state reached at 3/3988FF8 -> 2010-11-14 17:40:16 MST - 887 -LOG: redo starts at 3/3988F68 -> 2010-11-14 17:40:16 MST - 887 -LOG: invalid record length at 3/3988FF8 -> 2010-11-14 17:40:16 MST - 885 -LOG: database system is ready to accept read only connections -> 2010-11-14 17:40:16 MST - 890 -LOG: streaming replication successfully connected to primary -> 2010-11-15 02:24:26 MST - 890 -FATAL: could not receive data from WAL stream: FATAL: requested WAL segment 000000010000000300000004 has already been removed -> 2010-11-15 02:24:26 MST - 887 -LOG: unexpected pageaddr 2/B9BF2000 in log file 3, segment 4, offset 12525568 -> 2010-11-15 02:24:27 MST - 2790 -LOG: streaming replication successfully connected to primary -> 2010-11-15 02:24:27 MST - 2790 -FATAL: could not receive data from WAL stream: FATAL: requested WAL segment 000000010000000300000004 has already been removed -> 2010-11-15 02:24:32 MST - 2791 -LOG: streaming replication successfully connected to primary -> 2010-11-15 02:24:32 MST - 2791 -FATAL: could not receive data from WAL stream: FATAL: requested WAL segment 000000010000000300000004 has already been removed
... Now, the standby is geographically isolated from the master, so it's over an internet connection, so it's not a shock that with a large enough update and wal_keep_segments not large enough, speed of disk would outrun speed of network sufficiently for this to happen. But as far as I know there was almost no write activity at 2am, no active users at all, no batch processing. There is a pg_dumpall that kicks off at 2am and these errors start about the same time that it finished. I also did the original synch and standby launch immediately after a mass update before autovacuum had a chance to run, so at some point there would be a lot of tuples marked dead. wal_keep_segments was at 64, the first segment still around was 000000010000000300000010, checkpoint_segments was 16. In the midst of the pg_dumpall the master logs do show messages about checkpoint flushing too often. The 70ish log segments still around show mod times right around 2:23, progressing a second or so each, whereas they were created over a much longer period going back to the day before. 1 question: what happened here? Why were log files created the day before updated? 1 suggestion: would it be possible to not delete wal segments that are needed by a currently attached standby? -- Scott Ribe scott_r...@elevated-dev.com http://www.elevated-dev.com/ (303) 722-0567 voice -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general