Re: [HACKERS] Page Checksums

Kevin Grittner Mon, 19 Dec 2011 15:14:59 -0800

Greg Smith <g...@2ndquadrant.com> wrote:

> But if you need all that infrastructure just to get the feature 
> launched, that's a bit hard to stomach.

Triggering a vacuum or some hypothetical "scrubbing" feature?

> Also, as someone who follows Murphy's Law as my chosen religion,

If you don't think I pay attention to Murphy's Law, I should recap
our backup procedures -- which involves three separate forms of
backup, each to multiple servers in different buildings, real-time,
plus idle-time comparison of the databases of origin to all replicas
with reporting of any discrepancies.  And off-line "snapshot"
backups on disk at a records center controlled by a different
department.  That's in addition to RAID redundancy and hardware
health and performance monitoring.  Some people think I border on
the paranoid on this issue.

> I would expect this situation could be exactly how flaky hardware
> would first manifest itself:  server crash and a bad CRC on the
> last thing written out.  And in that case, the last thing you want
> to do is assume things are fine, then kick off a VACUUM that might
> overwrite more good data with bad.

Are you arguing that autovacuum should be disabled after crash
recovery?  I guess if you are arguing that a database VACUUM might
destroy recoverable data when hardware starts to fail, I can't
argue.  And certainly there are way too many people who don't ensure
that they have a good backup before firing up PostgreSQL after a
failure, so I can see not making autovacuum more aggressive than
usual, and perhaps even disabling it until there is some sort of
confirmation (I have no idea how) that a backup has been made.  That
said, a database VACUUM would be one of my first steps after
ensuring that I had a copy of the data directory tree, personally.
I guess I could even live with that as recommended procedure rather
than something triggered through autovacuum and not feel that the
rest of my posts on this are too far off track.

> The main way I expect to validate this sort of thing is with an as
> yet unwritten function to grab information about a data block from
> a standby server for this purpose, something like this:
> 
> Master:  Computed CRC A, Stored CRC B; error raised because A!=B
> Standby:  Computed CRC C, Stored CRC D
> 
> If C==D && A==C, the corruption is probably overwritten bits of
> the CRC B.

Are you arguing we need *that* infrastructure to get the feature
launched?

-Kevin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Page Checksums

Reply via email to