Re: [GENERAL] Disk corruption detection

2006-06-12 Thread Lincoln Yeoh

At 07:42 PM 6/11/2006 +0200, Florian Weimer wrote:


We recently had a partially failed disk in a RAID-1 configuration
which did not perform a write operation as requested.  Consequently,


What RAID1 config/hardware/software was this?

Could be good to know...

Regards,
Link.


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [GENERAL] Disk corruption detection

2006-06-12 Thread Jim C. Nasby
On Sun, Jun 11, 2006 at 07:42:55PM +0200, Florian Weimer wrote:
 We recently had a partially failed disk in a RAID-1 configuration
 which did not perform a write operation as requested.  Consequently,
 the mirrored disks had different contents, and the file which
 contained the block switched randomly between two copies, depending on
 which disk had been read.  (In theory, it is possible to read always
 from both disks, but this is not what RAID-1 configurations normally
 do.)
 
Actually, every RAID1 I've ever used will read from both to try and
balance out the load.

 Anyway, how would be the chances for PostgreSQL to detect such a
 corruption on a heap or index data file?  It's typically hard to
 detect this at the application level, so I don't expect wonders.  I'm
 just curious if using PostgreSQL would have helped to catch this
 sooner.

I know that WAL pages are (or at least were) CRC'd, because there was
extensive discussion around 32 bit vs 64 bit CRCs. There is no such
check for data pages, although PostgreSQL has other ways to detect
errors. But in a nutshell, if you care about your data, buy hardware you
can trust.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [GENERAL] Disk corruption detection

2006-06-12 Thread Florian Weimer
* Lincoln Yeoh:

 At 07:42 PM 6/11/2006 +0200, Florian Weimer wrote:

We recently had a partially failed disk in a RAID-1 configuration
which did not perform a write operation as requested.  Consequently,

 What RAID1 config/hardware/software was this?

I would expect that any RAID-1 controller works in this mode by
default.  It's an analogy to RAID-5: In that case, you clearly can't
verify the parity bits on read for performance reasons.  So why do it
for RAID-1?

(If there is a controller which offers compare-on-read for RAID-1, I
would like to know it's name. 8-)

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [GENERAL] Disk corruption detection

2006-06-12 Thread Florian Weimer
* Jim C. Nasby:

 Anyway, how would be the chances for PostgreSQL to detect such a
 corruption on a heap or index data file?  It's typically hard to
 detect this at the application level, so I don't expect wonders.  I'm
 just curious if using PostgreSQL would have helped to catch this
 sooner.

 I know that WAL pages are (or at least were) CRC'd, because there was
 extensive discussion around 32 bit vs 64 bit CRCs.

CRCs wouldn't help because the out-of-date copy has got a correct CRC.
That's why it's so hard to detect this problem at the application
level.  Putting redundancy into rows doesn't help, for instance.

 There is no such check for data pages, although PostgreSQL has other
 ways to detect errors. But in a nutshell, if you care about your
 data, buy hardware you can trust.

All hardware can fail. 8-/

AFAIK, compare-on-read is the recommend measure to compensate for this
kind of failure.  (The traditional recommendation also includes three
disks, so that you've got a tie-breaker.)  It seems to me that
PostgreSQL's MVCC-related don't directly overwrite data rows policy
might help to expose this sooner than with direct B-tree updates.

In this particular case, we would have avoided the failure if we
properly monitored the disk subsystem (the failure was gradual).
Fortunately, it was just a test system, but it got me woried a bit.

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [GENERAL] Disk corruption detection

2006-06-12 Thread Jim C. Nasby
On Mon, Jun 12, 2006 at 07:55:22PM +0200, Florian Weimer wrote:
 * Jim C. Nasby:
 
  Anyway, how would be the chances for PostgreSQL to detect such a
  corruption on a heap or index data file?  It's typically hard to
  detect this at the application level, so I don't expect wonders.  I'm
  just curious if using PostgreSQL would have helped to catch this
  sooner.
 
  I know that WAL pages are (or at least were) CRC'd, because there was
  extensive discussion around 32 bit vs 64 bit CRCs.
 
 CRCs wouldn't help because the out-of-date copy has got a correct CRC.
 That's why it's so hard to detect this problem at the application
 level.  Putting redundancy into rows doesn't help, for instance.
 
  There is no such check for data pages, although PostgreSQL has other
  ways to detect errors. But in a nutshell, if you care about your
  data, buy hardware you can trust.
 
 All hardware can fail. 8-/

I'd argue that any raid controller that carries on without degrading the
array even though it's getting write errors isn't worth the fiberglass
the components are soldered to. Same thing if it's a HD that can't write
something and doesn't throw an error back up the chain.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [GENERAL] Disk corruption detection

2006-06-11 Thread Qingqing Zhou

Florian Weimer [EMAIL PROTECTED] wrote

 Anyway, how would be the chances for PostgreSQL to detect such a
 corruption on a heap or index data file?  It's typically hard to
 detect this at the application level, so I don't expect wonders.  I'm
 just curious if using PostgreSQL would have helped to catch this
 sooner.


PostgreSQL will only detect these corruption when it uses that heap or index
page. So a safe way to is to dump/restore your database if you suspect there
is some inconsistency happened.

Regards,
Qingqing



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq