the next obvious question is, what is > causing the ZFS checksum errors? And (possibly of > some help in answering that question) is the disk > seeing CRC transfer errors (which show up in its > SMART data)? > > The memory is ECC in this machine, and Memtest passed > it for five > days. The disk was indeed getting some pretty lousy > SMART scores,
Seagate ATA disks (if that's what you were using) are notorious for this in a couple of specific metrics: they ship from the factory that way. This does not appear to be indicative of any actual problem but rather of error tablulation which they perform differently than other vendors do (e.g., I could imagine that they did something unusual in their burn-in exercising that generated nominal errors, but that's not even speculation, just a random guess). but > that doesn't explain the controller issue. This > particular controller > is a SIIG-branded silicon image 0680 chipset (which > is, apparently, a > piece of junk - if I'd done my homework I would've > bought something > else)... but the premise stands. I bought a piece of > consumer-level > hardware off the shelf, it had corruption issues, and > ZFS told me > about it when XFS had been silent. Then we've been talking at cross-purposes. Your original response was to my request for evidence that *platter errors that escape detection by the disk's ECC mechanisms* occurred sufficiently frequently to be a cause for concern - and that's why I asked specifically what was causing the errors you saw (to see whether they were in fact the kind for which I had requested evidence). Not that detecting silent errors due to buggy firmware is useless: it clearly saved you from continuing corruption in this case. My impression is that in conventional consumer installations (typical consumers never crack open their case at all, let alone to add a RAID card) controller and disk firmware is sufficiently stable (especially for the limited set of functions demanded of it) that ZFS's added integrity checks may not count for a great deal (save perhaps peace of mind, but typical consumers aren't sufficiently aware of potential dangers to suffer from deficits in that area) - but your experience indicates that when you stray from that mold ZFS's added protection may sometimes be as significant as it was for Robert's mid-range array firmware bugs. And since there indeed was a RAID card involved in the original hypothetical situation under discussion, the fact that I was specifically referring to undetectable *disk* errors was only implied by my subsequent discussion of disk error rates, rather than explicit. The bottom line appears to be that introducing non-standard components into the path between RAM and disk has, at least for some specific subset of those components, the potential to introduce silent errors of the form that ZFS can catch - quite possibly in considerably greater numbers that the kinds of undetected disk errors that I was talking about ever would (that RAID card you were using has a relatively popular low-end chipset, and Robert's mid-range arrays were hardly fly-by-night). So while I'm still not convinced that ZFS offers significant features in the reliability area compared with other open-source *software* solutions, the evidence that it may do so in more sophisticated (but not quite high-end) hardware environments is becoming more persuasive. - bill This message posted from opensolaris.org _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss