Once again, profuse apologies for having taken so long (well over 24 hours by now - though I'm not sure it actually appeared in the forum until a few hours after its timestamp) to respond to this.
> can you guess? wrote: > > > > Primarily its checksumming features, since other > open source solutions support simple disk scrubbing > (which given its ability to catch most deteriorating > disk sectors before they become unreadable probably > has a greater effect on reliability than checksums in > any environment where the hardware hasn't been > slapped together so sloppily that connections are > flaky). > > > From what I've read on the subject, That premise > seems bad from the > tart. Then you need to read more or understand it better. I don't believe that scrubbing will catch all > the types of > errors that checksumming will. That's absolutely correct, but it in no way contradicts what I said (and you quoted) above. Perhaps you should read that again, more carefully: it merely states that disk scrubbing probably has a *greater* effect on reliability than checksums do, not that it completely subsumes their features. There are a category > of errors that are > not caused by firmware, or any type of software. The > hardware just > doesn't write or read the correct bit value this time > around. With out a > checksum there's no way for the firmware to know, and > next time it very > well may write or read the correct bit value from the > exact same spot on > the disk, so scrubbing is not going to flag this > sector as 'bad'. It doesn't have to, because that's a *correctable* error that the disk's extensive correction codes (which correct *all* single-bit errors as well as most considerably longer error bursts) resolve automatically. > > Now you may claim that this type of error happens so > infrequently No, it's actually one of the most common forms, due to the desire to pack data on the platter as tightly as possible: that's why those long correction codes were created. Rather than comment on the rest of your confused presentation about disk error rates, I'll just present a capsule review of the various kinds: 1. Correctable errors (which I just described above). If a disk notices that a sector *consistently* requires correction it may deal with it as described in the next paragraph. 2. Errors that can be corrected only with retries (i.e., the sector is not *consistently* readable even after the ECC codes have been applied, but can be successfully read after multiple attempts which can do things like fiddle slightly with the head position over the track and signal amplification to try to get a better response). A disk may try to rewrite such a sector in place to see if its readability improves as a result, and if it doesn't will then transparently revector the data to a spare sector if one exists and mark the original sector as 'bad'. Background scrubbing gives the disk an opportunity to discover such sectors *before* they become completely unreadable, thus significantly improving reliability even in non-redundant environments. 3. Uncorrectable errors (bursts too long for the ECC codes to handle even after the kinds of retries described above, but which the ECC codes can still detect): scrubbing catches these as well, and if suitable redundancy exists it can correct them by rewriting the offending sector (the disk may transparently revector it if necessary, or the LVM or file system can if the disk can't). Disk vendor specs nominally state that one such error may occur for every 10^14 bits transferred for a contemporary commodity (ATA or SATA) drive (i.e., about once in every 12.5 TB), but studies suggest that in practice they're much rarer. 4. Undetectable errors (errors which the ECC codes don't detect but which ZFS's checksums presumably would). Disk vendors no longer provide specs for this reliability metric. My recollection from a decade or more ago is that back when they used to it was three orders of magnitude lower than the uncorrectable error rate: if that still obtained it would mean about once in every 12.5 petabytes transferred, but given that the real-world incidence of uncorrectable errors is so much lower than speced and that ECC codes keep increasing in length it might be far lower than that now. ... > > Aside from the problems that scrubbing handles (and > you need scrubbing even if you have checksums, > because scrubbing is what helps you *avoid* data loss > rather than just discover it after it's too late to > do anything about it), and aside from problems > Again I think you're wrong on the basis for your > point. No: you're just confused again. The checksumming > in ZFS (if I understand it correctly) isn't used for > only detecting the > problem. If the ZFS pool has any redundancy at all, > those same checksums > can be used to repair that same data, thus *avoiding* > the data loss. 1. Unlike things like disk ECC codes, ZFS's checksums can't repair data: they just detect that it's corrupt. 2. So does disk scrubbing, save for the *extremely* rare cases of undetectable errors (see above) or other rare errors that aren't related to transferring bits to and from the disk platter (see Anton's recent post, for example). 3. Unlike disk scrubbing, ZFS's checksums per se only validate data when it happens to be read, and only one copy of it - so ZFS internally schedules background data scrubs that presumably read everything, including applicable redundancy (this can be more expensive than the streaming-sequential background scrubs that can be performed when you don't have to validate file-structured checksum information, but the additional overhead shouldn't be important given that it occurs in the background). 4. With both approaches, if redundancy is present then when the corrupt data is detected it can be corrected by rewriting it using the good copy generated from the redundancy. (more confusion snipped) > > Robert Milkowski cited some sobering evidence that > mid-range arrays may have non-negligible firmware > problems that ZFS could often catch, but a) those are > hardly 'consumer' products (to address that > sub-thread, which I think is what applies in > Stefano's case) and b) ZFS's claimed attraction for > higher-end (corporate) use is its ability to > *eliminate* the need for such products (hence its > ability to catch their bugs would not apply - though > I can understand why people who needed to use them > anyway might like to have ZFS's integrity checks > along for the ride, especially when using > less-than-fully-mature firmware). > > > > > Every drive has firmware too. If it can be used to > detect and repair > array firmware problems, then it can be used by > consumers to detect and > repair drive firmware problems too. As usual, the question is whether that *matters* in any practical sense. Commodity drive firmware is a) far less complex than array firmware and b) is typically exposed to only a few standard operations that are far more thoroughly exercised than array firmware (i.e., any significant bugs tend to get flushed out long before it hits the field). Formal root-cause error analyses that I've seen have not identified disk firmware bugs as a significant source of error in conventional installations. The CERN study did find an adverse interaction between the firmware in its commodity drives and the firmware in its 3Ware RAID controllers due to the unusual demands that the latter were placing on the former plus the latter's inclination to ignore disk time-outs, but that's hardly a 'commodity' environment - and was the reason I specifically focused my comment on ZFS's claimed ability to *avoid* the need to use such hardware aids that might be less thoroughly wrung out than commodity drives in commodity environments. ... > Sure it's true that something else that could trash > your data without > checksumming can still trash your data with it. But > making sure that the > data gets unmangled if it can is still worth > something, And I've never suggested otherwise: the question (once again) is *how much* it's worth, and the answer in most situations is "not all that much, because it doesn't significantly reduce exposure due to the magnitude of the *other* error sources that remain present even if checksums are used". Is everyone here (Anton excepted) so mathematically-challenged that they can't grasp the fact that while something may be 'good' in an abstract sense, whether it's actually *valuable* is a *quantitative* question? and the > improvements you point out are needed in other > components would be > pointless (according to your argument) if something > like ZFS didn't also > exist. No, you're still confused. I listed a bunch of things you'd have to do to protect your data in typical situations before residual risk became sufficiently low that further reducing it via ZFS-style checksumming would have noticeable benefit, but they're all eminently useful without ZFS as well. Hmmm - perhaps that' s once again too abstract for you to follow, so let's try something more concrete. Say your current risk level on a 100 point scale is 20 with no special precautions taken at all. Back up your data with no other changes and it might go down to 15. Back up your data and verify the backup as well (but no other changes) and it might go down to 12. Back up and verify your data multiple times at multiple sites (no other changes) and it might go down to 5. Periodically verify that your backups are still readable and it might go down to 3. So without using ZFS you can reduce your risk from a level of 20 to a level of 3: sounds worthwhile to me. Now that you've done that, if you use ZFS-style checksumming perhaps you can reduce your level of risk from 3 to 2 - and for some installations that might well be worth doing. On the other hand, if you use ZFS *without* taking the other steps, you only reduce your risk level from 20 to 19: perhaps measurable, but probably not noticeable and almost certainly not sufficiently worthwhile *by itself* to change your platform. Whoops - I seem to have said something very similar just below in the material that you quoted, but perhaps reasoning by analogy was not sufficiently concrete either. > > > So depending upon ZFS's checksums to protect your > data in most PC environments is sort of like leaving > on a vacation and locking and bolting the back door > of your house while leaving the front door wide open: > yes, a burglar is less likely to enter by the back > door, but thinking that the extra bolt there made > you much safer is likely foolish. ... > > What I'm saying is that if you *really* care about > your data, then you need to be willing to make the > effort to lock and bolt the front door as well as the > back door and install an alarm system: if you do > that, *then* ZFS's additional protection mechanisms > may start to become significant (because you're > eliminated the higher-probability risks and ZFS's > extra protection then actually reduces the > *remaining* risk by a significant percentage). > > > > > Agreed. Depending on only one copy of your important > data is > shortsighted. But using a tool like ZFS on at least > the most active > copy, if not all copies will be an improvement, if it > even once stops > you from having to go to your other copies. And disk scrubbing is almost equally likely to accomplish this, because it catches all but a minute portion of the same kinds of problems that ZFS catches. > > Also it's interesting that you use the term 'alarm > system'. That's > exactly how I view the checksumming features of ZFS. > It is an alarm that > goes off if any of my bits have been lost to an > invisible 'burglar'. As is does disk scrubbing. > > I've also noticed how you happen to skip the data > replication features > of ZFS. I suspect that you're not talking about RAID but about snapshots. > While they may not be everything you've hoped RAID-Z certainly isn't and ZFS's more general approach to internal redundancy could be more automated and flexible, but ZFS snapshots are fairly similar to most other file system implementations: the only potentially superior approach that I'm acquainted with is something like Interbase's multi-versioning mechanism, which trades off access performance to historical data for a more compact representation and more flexibility in moving data around without creating additional snapshot overhead. > they would be, > they are features that will have value to people who > want to do exactly > what you suggest, keeping multiple copies of their > data in multiple places. You have me at a disadvantage here, because I'm not even a Unix (let alone Solaris and Linux) aficionado. But don't Linux snapshots in conjunction with rsync (leaving aside other possibilities that I've never heard of) provide rather similar capabilities (e.g., incremental backup or re-synching), especially when used in conjunction with scripts and cron? ... > On the cost side of things, I think you also miss a > point. > > The data checking *and repair* features of ZFS bring > down the cost of > storage not just on the cost of the software. It also > allows (as in > safeguards) the use of significantly lower priced > Hardware (SATA drives > instead of SAS or FCAL, or expensive arrays) by > making up for the > slightly higher possibility of problems that hardware > brings with it. Nothing which you describe above is unique to ZFS: comparable zero-cost open-source solutions are available on Linux using its file systems, logical volume management, and disk scrubbing. ... > >> i'd love to see > >> the improvements on the many shortcomings you're > >> pointing to and > >> passionate about written up, proposed, and freely > >> implemented :) > >> > > > > Then ask the ZFS developers to get on the stick: > fixing the fragmentation problem discussed elsewhere > should be easy, and RAID-Z is at least amenable to a > redesign (though not without changing the on-disk > metadata structures a bit - but while they're at it, > they could include support for data redundancy in a > manner analogous to ditto blocks so that they could > get rid of the vestigial LVM-style management in > that area). > > > > I think he was suggesting that if it's so important > to you, go ahead and > submit the changes yourself. Then he clearly hadn't read my earlier posts in which I explained that I have no interest whatsoever in doing that: I just came here on the off-chance that some technically interesting insights might be found, and have mostly stuck around since (despite the conspicuous lack of such insights) because I got sufficiently disgusted with some of the attitudes here that I decided to confront them (it's also kind of entertaining, though so far only in an intellectually-slumming sort of way that I won't really miss after things have run their course). - bill This message posted from opensolaris.org _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss