Re: [zfs-discuss] Yager on ZFS

can you guess? Fri, 07 Dec 2007 12:55:17 -0800

Once again, profuse apologies for having taken so long (well over 24 hours by 
now - though I'm not sure it actually appeared in the forum until a few hours 
after its timestamp) to respond to this.

> can you guess? wrote:
> >
> > Primarily its checksumming features, since other
> open source solutions support simple disk scrubbing
> (which given its ability to catch most deteriorating
> disk sectors before they become unreadable probably
> has a greater effect on reliability than checksums in
> any environment where the hardware hasn't been
> slapped together so sloppily that connections are
> flaky).
> >   
> From what I've read on the subject, That premise
>  seems bad from the 
> tart.

Then you need to read more or understand it better.

  I don't believe that scrubbing will catch all
> the types of 
> errors that checksumming will.

That's absolutely correct, but it in no way contradicts what I said (and you 
quoted) above.  Perhaps you should read that again, more carefully:  it merely 
states that disk scrubbing probably has a *greater* effect on reliability than 
checksums do, not that it completely subsumes their features.

 There are a category
> of errors that are 
> not caused by firmware, or any type of software. The
> hardware just 
> doesn't write or read the correct bit value this time
> around. With out a 
> checksum there's no way for the firmware to know, and
> next time it very 
> well may write or read the correct bit value from the
> exact same spot on 
> the disk, so scrubbing is not going to flag this
> sector as 'bad'.

It doesn't have to, because that's a *correctable* error that the disk's 
extensive correction codes (which correct *all* single-bit errors as well as 
most considerably longer error bursts) resolve automatically.

> 
> Now you may claim that this type of error happens so
> infrequently

No, it's actually one of the most common forms, due to the desire to pack data 
on the platter as tightly as possible:  that's why those long correction codes 
were created.

Rather than comment on the rest of your confused presentation about disk error 
rates, I'll just present a capsule review of the various kinds:

1.  Correctable errors (which I just described above).  If a disk notices that 
a sector *consistently* requires correction it may deal with it as described in 
the next paragraph.

2.  Errors that can be corrected only with retries (i.e., the sector is not 
*consistently* readable even after the ECC codes have been applied, but can be 
successfully read after multiple attempts which can do things like fiddle 
slightly with the head position over the track and signal amplification to try 
to get a better response).  A disk may try to rewrite such a sector in place to 
see if its readability improves as a result, and if it doesn't will then 
transparently revector the data to a spare sector if one exists and mark the 
original sector as 'bad'.  Background scrubbing gives the disk an opportunity 
to discover such sectors *before* they become completely unreadable, thus 
significantly improving reliability even in non-redundant environments.

3.  Uncorrectable errors (bursts too long for the ECC codes to handle even 
after the kinds of retries described above, but which the ECC codes can still 
detect):  scrubbing catches these as well, and if suitable redundancy exists it 
can correct them by rewriting the offending sector (the disk may transparently 
revector it if necessary, or the LVM or file system can if the disk can't).  
Disk vendor specs nominally state that one such error may occur for every 10^14 
bits transferred for a contemporary commodity (ATA or SATA) drive (i.e., about 
once in every 12.5 TB), but studies suggest that in practice they're much rarer.

4.  Undetectable errors (errors which the ECC codes don't detect but which 
ZFS's checksums presumably would).  Disk vendors no longer provide specs for 
this reliability metric.  My recollection from a decade or more ago is that 
back when they used to it was three orders of magnitude lower than the 
uncorrectable error rate:  if that still obtained it would mean about once in 
every 12.5 petabytes transferred, but given that the real-world incidence of 
uncorrectable errors is so much lower than speced and that ECC codes keep 
increasing in length it might be far lower than that now.

...

> > Aside from the problems that scrubbing handles (and
> you need scrubbing even if you have checksums,
> because scrubbing is what helps you *avoid* data loss
> rather than just discover it after it's too late to
> do anything about it), and aside from problems 
> Again I think you're wrong on the basis for your
> point.

No:  you're just confused again.

 The checksumming 
> in ZFS (if I understand it correctly) isn't used for
> only detecting the 
> problem. If the ZFS pool has any redundancy at all,
> those same checksums 
> can be used to repair that same data, thus *avoiding*
> the data loss.

1.  Unlike things like disk ECC codes, ZFS's checksums can't repair data:  they 
just detect that it's corrupt.

2.  So does disk scrubbing, save for the *extremely* rare cases of undetectable 
errors (see above) or other rare errors that aren't related to transferring 
bits to and from the disk platter (see Anton's recent post, for example).

3.  Unlike disk scrubbing, ZFS's checksums per se only validate data when it 
happens to be read, and only one copy of it - so ZFS internally schedules 
background data scrubs that presumably read everything, including applicable 
redundancy (this can be more expensive than the streaming-sequential background 
scrubs that can be performed when you don't have to validate file-structured 
checksum information, but the additional overhead shouldn't be important given 
that it occurs in the background).

4.  With both approaches, if redundancy is present then when the corrupt data 
is detected it can be corrected by rewriting it using the good copy generated 
from the redundancy.

(more confusion snipped)

> > Robert Milkowski cited some sobering evidence that
> mid-range arrays may have non-negligible firmware
> problems that ZFS could often catch, but a) those are
> hardly 'consumer' products (to address that
> sub-thread, which I think is what applies in
> Stefano's case) and b) ZFS's claimed attraction for
> higher-end (corporate) use is its ability to
> *eliminate* the need for such products (hence its
> ability to catch their bugs would not apply - though
> I can understand why people who needed to use them
> anyway might like to have ZFS's integrity checks
> along for the ride, especially when using
> less-than-fully-mature firmware).
> >
> >   
> Every drive has firmware too. If it can be used to
> detect and repair 
> array firmware problems, then it can be used by
> consumers to detect and 
> repair drive firmware problems too.

As usual, the question is whether that *matters* in any practical sense.  
Commodity drive firmware is a) far less complex than array firmware and b) is 
typically exposed to only a few standard operations that are far more 
thoroughly exercised than array firmware (i.e., any significant bugs tend to 
get flushed out long before it hits the field).

Formal root-cause error analyses that I've seen have not identified disk 
firmware bugs as a significant source of error in conventional installations.  
The CERN study did find an adverse interaction between the firmware in its 
commodity drives and the firmware in its 3Ware RAID controllers due to the 
unusual demands that the latter were placing on the former plus the latter's 
inclination to ignore disk time-outs, but that's hardly a 'commodity' 
environment - and was the reason I specifically focused my comment on ZFS's 
claimed ability to *avoid* the need to use such hardware aids that might be 
less thoroughly wrung out than commodity drives in commodity environments.

...

> Sure it's true that something else that could trash
> your data without 
> checksumming can still trash your data with it. But
> making sure that the 
> data gets unmangled if it can is still worth
> something,

And I've never suggested otherwise:  the question (once again) is *how much* 
it's worth, and the answer in most situations is "not all that much, because it 
doesn't significantly reduce exposure due to the magnitude of the *other* error 
sources that remain present even if checksums are used".

Is everyone here (Anton excepted) so mathematically-challenged that they can't 
grasp the fact that while something may be 'good' in an abstract sense, whether 
it's actually *valuable* is a *quantitative* question?

 and the 
> improvements you point out are needed in other
> components would be 
> pointless (according to your argument) if something
> like ZFS didn't also 
> exist.

No, you're still confused.  I listed a bunch of things you'd have to do to 
protect your data in typical situations before residual risk became 
sufficiently low that further reducing it via ZFS-style checksumming would have 
noticeable benefit, but they're all eminently useful without ZFS as well.

Hmmm - perhaps that' s once again too abstract for you to follow, so let's try 
something more concrete.  Say your current risk level on a 100 point scale is 
20 with no special precautions taken at all.  Back up your data with no other 
changes and it might go down to 15.  Back up your data and verify the backup as 
well (but no other changes) and it might go down to 12.  Back up and verify 
your data multiple times at multiple sites (no other changes) and it might go 
down to 5.  Periodically verify that your backups are still readable and it 
might go down to 3.

So without using ZFS you can reduce your risk from a level of 20 to a level of 
3:  sounds worthwhile to me.

Now that you've done that, if you use ZFS-style checksumming perhaps you can 
reduce your level of risk from 3 to 2 - and for some installations that might 
well be worth doing.

On the other hand, if you use ZFS *without* taking the other steps, you only 
reduce your risk level from 20 to 19:  perhaps measurable, but probably not 
noticeable and almost certainly not sufficiently worthwhile *by itself* to 
change your platform.

Whoops - I seem to have said something very similar just below in the material 
that you quoted, but perhaps reasoning by analogy was not sufficiently concrete 
either.

> 
> > So depending upon ZFS's checksums to protect your
> data in most PC environments is sort of like leaving
> on a vacation and locking and bolting the back door
> of your house while leaving the front door wide open:
> yes, a burglar is less likely to enter by the back
> door, but thinking that the extra bolt there made
>  you much safer is likely foolish.

...

> > What I'm saying is that if you *really* care about
> your data, then you need to be willing to make the
> effort to lock and bolt the front door as well as the
> back door and install an alarm system:  if you do
> that, *then* ZFS's additional protection mechanisms
> may start to become significant (because you're
> eliminated the higher-probability risks and ZFS's
> extra protection then actually reduces the
> *remaining* risk by a significant percentage).
> >
> >   
> Agreed. Depending on only one copy of your important
> data is 
> shortsighted. But using a tool like ZFS on at least
> the most active 
> copy, if not all copies will be an improvement, if it
> even once stops 
> you from having to go to your other copies.

And disk scrubbing is almost equally likely to accomplish this, because it 
catches all but a minute portion of the same kinds of problems that ZFS catches.

> 
> Also it's interesting that you use the term 'alarm
> system'. That's 
> exactly how I view the checksumming features of ZFS.
> It is an alarm that 
> goes off if any of my bits have been lost to an
> invisible 'burglar'.

As is does disk scrubbing.

> 
> I've also noticed how you happen to skip the data
> replication features 
> of ZFS.

I suspect that you're not talking about RAID but about snapshots.

> While they may not be everything you've hoped

RAID-Z certainly isn't and ZFS's more general approach to internal redundancy 
could be more automated and flexible, but ZFS snapshots are fairly similar to 
most other file system implementations:  the only potentially superior approach 
that I'm acquainted with is something like Interbase's multi-versioning 
mechanism, which trades off access performance to historical data for a more 
compact representation and more flexibility in moving data around without 
creating additional snapshot overhead.

> they would be, 
> they are features that will have value to people who
> want to do exactly 
> what you suggest, keeping multiple copies of their
> data in multiple places.

You have me at a disadvantage here, because I'm not even a Unix (let alone 
Solaris and Linux) aficionado.  But don't Linux snapshots in conjunction with 
rsync (leaving aside other possibilities that I've never heard of) provide 
rather similar capabilities (e.g., incremental backup or re-synching), 
especially when used in conjunction with scripts and cron?

...

> On the cost side of things, I think you also miss a
> point.
> 
> The data checking *and repair* features of ZFS bring
> down the cost of 
> storage not just on the cost of the software. It also
> allows (as in 
> safeguards) the use of significantly lower priced
> Hardware (SATA drives 
> instead of SAS or FCAL, or expensive arrays) by
> making up for the 
> slightly higher possibility of problems that hardware
> brings with it. 

Nothing which you describe above is unique to ZFS:  comparable zero-cost 
open-source solutions are available on Linux using its file systems, logical 
volume management, and disk scrubbing.

...

> >> i'd love to see  
> >> the improvements on the many shortcomings you're
> >> pointing to and  
> >> passionate about written up, proposed, and freely
> >> implemented :)
> >>     
> >
> > Then ask the ZFS developers to get on the stick:
> fixing the fragmentation problem discussed elsewhere
> should be easy, and RAID-Z is at least amenable to a
> redesign (though not without changing the on-disk
> metadata structures a bit - but while they're at it,
> they could include support for data redundancy in a
> manner analogous to ditto blocks so that they could
> get rid of the vestigial LVM-style management in
>  that area).
> 
> >   
> I think he was suggesting that if it's so important
> to you, go ahead and 
> submit the changes yourself.

Then he clearly hadn't read my earlier posts in which I explained that I have 
no interest whatsoever in doing that:  I just came here on the off-chance that 
some technically interesting insights might be found, and have mostly stuck 
around since (despite the conspicuous lack of such insights) because I got 
sufficiently disgusted with some of the attitudes here that I decided to 
confront them (it's also kind of entertaining, though so far only in an 
intellectually-slumming sort of way that I won't really miss after things have 
run their course).

- bill

This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Yager on ZFS

Reply via email to