On Sun, Jan 26, 2014 at 5:45 PM, Andres Freund <and...@2ndquadrant.com> wrote: > >> We're also seeing log entries about "wal contains reference to invalid >> pages" but these errors seem only vaguely correlated. Sometimes we get >> the errors but the tables don't grow noticeably and sometimes we don't >> get the errors and the tables are much larger. > > Uhm. I am a bit confused. You see those in the standby's log? At !debug > log levels? That'd imply that the standby is dead and needed to be > recloned, no? How do you continue after that?
So in chatting with Heikki last night we came up with a scenario where this check is insufficient. If you have multiple checkpoints during the base backup then there will be restartpoints during recovery. If the reference to the invalid page is before the restartpont then after crashing recovery and coming back up the recovery will go forward fine. Fixing this check doesn't look trivial. I'm inclined to say to suppress any restartpoints while there are references to invalid pages in the hash. The problem with this is that it will prevent trimming the xlog during recovery. It seems frightening that most days recovery will take little extra space but if you happen to have a drop table or truncate during the base backup then your recovery might require a lot of extra space. The alternative of spilling the hash table to disk at every restartpoint seems kind of hokey. Then we need to worry about fsyncing this file, cleaning it up, dealing with the file after crashes, etc. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers