the next obvious question is, what is
> causing the ZFS checksum errors?  And (possibly of
> some help in answering that question) is the disk
> seeing CRC transfer errors (which show up in its
> SMART data)?
> 
> The memory is ECC in this machine, and Memtest passed
> it for five
> days.  The disk was indeed getting some pretty lousy
> SMART scores,

Seagate ATA disks (if that's what you were using) are notorious for this in a 
couple of specific metrics:  they ship from the factory that way.  This does 
not appear to be indicative of any actual problem but rather of error 
tablulation which they perform differently than other vendors do (e.g., I could 
imagine that they did something unusual in their burn-in exercising that 
generated nominal errors, but that's not even speculation, just a random guess).

 but
> that doesn't explain the controller issue.  This
> particular controller
> is a SIIG-branded silicon image 0680 chipset (which
> is, apparently, a
> piece of junk - if I'd done my homework I would've
> bought something
> else)... but the premise stands.  I bought a piece of
> consumer-level
> hardware off the shelf, it had corruption issues, and
> ZFS told me
> about it when XFS had been silent.

Then we've been talking at cross-purposes.  Your original response was to my 
request for evidence that *platter errors that escape detection by the disk's 
ECC mechanisms* occurred sufficiently frequently to be a cause for concern - 
and that's why I asked specifically what was causing the errors you saw (to see 
whether they were in fact the kind for which I had requested evidence).

Not that detecting silent errors due to buggy firmware is useless:  it clearly 
saved you from continuing corruption in this case.  My impression is that in 
conventional consumer installations (typical consumers never crack open their 
case at all, let alone to add a RAID card) controller and disk firmware is 
sufficiently stable (especially for the limited set of functions demanded of 
it) that ZFS's added integrity checks may not count for a great deal (save 
perhaps peace of mind, but typical consumers aren't sufficiently aware of 
potential dangers to suffer from deficits in that area) - but your experience 
indicates that when you stray from that mold ZFS's added protection may 
sometimes be as significant as it was for Robert's mid-range array firmware 
bugs.

And since there indeed was a RAID card involved in the original hypothetical 
situation under discussion, the fact that I was specifically referring to 
undetectable *disk* errors was only implied by my subsequent discussion of disk 
error rates, rather than explicit.

The bottom line appears to be that introducing non-standard components into the 
path between RAM and disk has, at least for some specific subset of those 
components, the potential to introduce silent errors of the form that ZFS can 
catch - quite possibly in considerably greater numbers that the kinds of 
undetected disk errors that I was talking about ever would (that RAID card you 
were using has a relatively popular low-end chipset, and Robert's mid-range 
arrays were hardly fly-by-night).  So while I'm still not convinced that ZFS 
offers significant features in the reliability area compared with other 
open-source *software* solutions, the evidence that it may do so in more 
sophisticated (but not quite high-end) hardware environments is becoming more 
persuasive.

- bill
 
 
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to