I realize Simone relented on this, but FWIW...

On 3/16/13 4:02 PM, Simon Riggs wrote:
Most other data we store doesn't consist of
large runs of 0x00 or 0xFF as data. Most data is more complex than
that, so any runs of 0s or 1s written to the block will be detected.
...

It's not that uncommon for folks to have tables that have a bunch of 
int[2,4,8]s all in a row, and I'd bet it's not uncommon for a lot of those 
fields to be zero.

Checksums are for detecting problems. What kind of problems? Sporadic
changes of bits? Or repeated errors. If we were trying to trap
isolated bit changes then CRC-32 would be appropriate. But I'm
assuming that whatever causes the problem is going to recur,

That's opposite to my experience. When we've had corruption events we will normally have 
one to several blocks with problems how up essentially all at once. Of course we can't 
prove that all the corruption happened at exactly the same time, but I believe it's a 
strong possibility. If it wasn't exactly the same time it was certainly over a span of 
minutes to hours... *but* we've never seen new corruption occur after we start an 
investigation (we frequently wait several hours for the next time we can take an outage 
without incurring a huge loss in revenue). That we would run for a number of hours with 
no additional corruption leads me to believe that whatever caused the corruption was 
essentially a "one-time" [1] event.

[1] One-time except for the fact that there were several periods where we would 
have corruption occur in 12 or 6 month intervals.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to