Well, I guess we're going to remain stuck in this sub-topic for a bit longer:
> > The vast majority of what ZFS can detect (save for > *extremely* rare > > undetectable bit-rot and for real hardware > (path-related) errors that > > studies like CERN's have found to be very rare - > and you have yet to > > provide even anecdotal evidence to the contrary) > > You wanted anectodal evidence: To be accurate, the above was not a solicitation for just any kind of anecdotal evidence but for anecdotal evidence that specifically contradicted the notion that otherwise undetected path-related hardware errors are 'very rare'. During my personal > experience with only two > home machines, ZFS has helped me detect corruption at > least three times in a > period of a few months. > > One due to silent corruption due to a controller bug > (and a driver that did > not work around it). If that experience occurred using what could be considered normal consumer hardware and software, that's relevant (and disturbing). As I noted earlier, the only path-related problem that the CERN study unearthed involved their (hardly consumer-typical) use of RAID cards, the unusual demands that those cards placed on the WD disk firmware (to the point where it produced on-disk errors), and the cards' failure to report accompanying disk time-outs. > > Another time corruption during hotswapping (though > this does not necessarily > count since I did it on hardware that I did not know > was supposed to support > it, and I would not have attempted it to begin with > otherwise). Using ZFS as a test platform to see whether you could get away with using hardware in a manner that it may not have been intended to be used may not really qualify as 'consumer' use. As I've noted before, consumer relevance remains the point in question here (since that's the point that fired off this lengthy sub-discussion). ... > In my professional life I have seen bitflips a few > times in the middle of real > live data running on "real" servers that are used for > important data. As a > result I have become pretty paranoid about it all, > making heavy use of par2. And well you should - but, again, that's hardly 'consumer' use. ... > > can also be detected by > > scrubbing, and it's arguably a lot easier to apply > brute-force scrubbing > > (e.g., by scheduling a job that periodically copies > your data to the null > > device if your system does not otherwise support > the mechanism) than to > > switch your file system. > > How would your magic scrubbing detect arbitrary data > corruption without > checksumming The assertion is that it would catch the large majority of errors that ZFS would catch (i.e., all the otherwise detectable errors, most of them detected by the disk when it attempts to read a sector), leaving a residue of no noticeable consequence to consumers (especially as one could make a reasonable case that most consumers would not experience any noticeable problem even if *none* of these errors were noticed). > or redundancy? Redundancy is necessary if you want to fix (not just catch) errors, but conventional mechanisms provide redundancy just as effective as ZFS's. (With the minor exception of ZFS's added metadata redundancy, but the likelihood that an error will happen to hit the relatively minuscule amount of metadata on a disk rather than the sea of data on it is, for consumers, certainly negligible, especially considering all the far more likely potential risks in the use of their PCs.) > > A lot of the data people save does not have > checksumming. *All* disk data is checksummed, right at the disk - and according to the studies I'm familiar with this detects most errors (certainly enough of those that ZFS also catches to satisfy most consumers). If you've got any quantitative evidence to the contrary, by all means present it. ... > I think one needs to stop making excuses by observing > properties of specific > file types and simlar. I'm afraid that's incorrect: given the statistical incidence of the errors in question here, in normal consumer use only humongous files will ever experience them with non-neglible probability. So those are the kinds of files at issue. When such a file experiences one of these errors, then either it will be one that ZFS is uniquely (save for WAFL) capable of detecting, or it will be one that more conventional mechanisms can detect. The latter are, according to the studies I keep mentioning, far more frequent (only relatively, of course: we're still only talking about one in every 10 TB or so, on average and according to manufacturers' specs, which seem to be if anything pessimistic in this area), and comprise primarily unreadable disk sectors which (as long as they're detected in a timely manner by scrubbing, whether ZFS's or some manually-scheduled mechanism) simply require that the bad sector (or file) be replaced by a good copy to restore the desired level of redundancy. When we get into the realm of errors which are otherwise undetectable, we're either talking about disk read errors in the once-per-PB range (and, if they're single- or few-bit errors they won't noticeably affect the video files which typically dominate consumer storage space use) or about the kinds of hardware errors that some people here have raised anecdotally but AFAIK haven't come back to flesh out (e.g., after questions such as whether they occurred only before ATA started CRCing its transfers). Only the latter would seem to have any potential relevance to consumers, if indeed their incidence is non-negligible. You can always use FEC to do > error correction on > arbitrary files if you really feel they are > important. But the point is that > with ZFS you get detection of *ANY* bit error for > free (essentially), And the counterpoint is that while this is true it *just doesn't matter* in normal consumer use, because the incidence of errors which would otherwise go undetected is negligible. ... > Even without fancy high-end requirements, it is nice > to have some good > statistical reason to believe that random corruption > does not occurs. And you've already got it without ZFS: all ZFS does is add a few more decimal places to an already neglibible (at least to consumers) risk. ... > It's like choosing RAM. You can make excuses all you > want about doing proper > testing, buying good RAM, or having redundancy at > other levels etc - but you > will still sleep better knowing you have ECC RAM than > some random junk. No one is telling you not to do whatever it takes to help you sleep better: I'm just telling you that the comfort you attain thereby may not be strictly rational (i.e., commensurate with the actual effect of your action), so you should be careful about trying to apply that experience to others. > > Or let's do the seat belt analogy. You can try to > convince yourself/other > people all you want that you are a safe driver, that > you should not drive in > a way that allows crashes or whatever else - but you > are still going to be > safer with a seat belt than without it. Indeed. And since studies have shown that if you are wearing a seat belt an airbag at best gives you very minimal additional protection (and in some cases might actually increase your risk), you really ought to stop telling people who already wear seat belts that they need airbags too. > > This is also why we care about fsync(). It doesn't > matter that you spent > $100000 on that expensive server with redundant PSU:s > hooked up to redundant > UPS systems. *SHIT HAPPENS*, and when it does, you > want to be maximally > protected. You're getting very far afield from consumer activities again, I'm afraid. > > Yes, ZFS is not perfect. But to me, both in the > context of personal use and > more serious use, ZFS is, barring some implementation > details, more or less > exactly what I have always wanted and solves pretty > much all of the major > problems with storage. That's very nice for you: just (as I noted above) don't presume to apply your personal fetishes (including what may constitute a 'major' consumer storage problem) to everyone else. > > And let me be clear: That is not hype. It's ZFS > actually providing what I have > wanted, and what I knew I wanted even before ZFS (or > WAFL or whatever else) > was ever on my radar. Having just dealt with that fairly bluntly above, let me state here that the same is true for me: that's why I was working on almost exactly the same kind of checksumming before I ever heard of ZFS (or knew that WAFL already had it). The difference is that I understand *quantitatively* how important it is - both to installations with serious reliability requirements (where it's a legitimate selling point, though not necessarily a dominant one save for pretty unusual installations) and in consumer use (where it's not). > > For some reason some people seem to disagree. Perhaps I've now managed to make it clear exactly where that disagreement lies, if it wasn't clear before. - bill This message posted from opensolaris.org _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss