[snip]On a nightly basis, we shut the database down and do a file system backup.
A short chronology of our database problem:
Question:
How can we tell that a database is intact? In the above example, pg_dumpall worked on the 8/21 database. Did it become corrupt between 8/21 and 8/23, or was it already corrupt and got worse? Pg_dumpall tells you nothing about the condition of indexes. Could a corrupt index corrupt data blocks?
I'm looking at doing a pg_dumpall on a weekly basis so that we have a point in time where we know we have a recoverable database. When the database reaches several hundred GB and over over a billion rows, this isn't a great solution, and doesn't address the overall database integrity.
Back to the original question... How can I verify the complete integrity of a database - especially a very large one where a reload or full index rebuild could take on the order of days?
You shouldn't have to verify anything. PG's job is to never corrupt your data, and providing your hardware is good it should do so. If you are getting problems almost daily that would suggest a RAM/disk problem to me (sig 11 usually implies RAM). Can't guarantee it's not PG but it's record of reliability is pretty good.
Steps I'd take:
1. Check your version number against the release notes and see if you should upgrade. You don't mention your version, but it's always worth having the last dot-release (7.2.5, 7.3.7, 7.4.5)
2. Schedule time to run memory/disk tests against your hardware. Finding 48 hours might not be easy, but you need to know where you stand.
3. Setup slony or some other replication so I can schedule my downtime.
-- Richard Huxton Archonet Ltd
---------------------------(end of broadcast)--------------------------- TIP 8: explain analyze is your friend