Miles Nordin wrote:
>>>>>> "re" == Richard Elling <[EMAIL PROTECTED]> writes:
>>>>>>             
>
>     >> If you really mean there are devices out there which never
>     >> return error codes, and always silently return bad data, please
>     >> tell us which one and the story of when you encountered it,
>
>     re> I blogged about one such case.
>     re> http://blogs.sun.com/relling/entry/holy_smokes_a_holey_file
>
>     re> However, I'm not inclined to publically chastise the vendor or
>     re> device model.  It is a major vendor and a popular
>     re> device. 'nuff said.
>
> It's not really enough for me, but what's more the case doesn't match
> what we were looking for: a device which ``never returns error codes,
> always returns silently bad data.''  I asked for this because you said
> ``However, not all devices return error codes which indicate
> unrecoverable reads,'' which I think is wrong.  Rather, most devices
> sometimes don't, not some devices always don't.
>   

I really don't know how to please you.  I've got a bunch of
borken devices of all sorts.  If you'd like to stop by some time
and rummage in the boneyard, feel free.  Make it quick before
my wife makes me clean up :-)  For the device which
I mentioned in my blog, it does return bad data far more often
than I'd like.  But that is why I only use it for testing and don't
store my wife's photo album on it.  Anyone who has been
around for a while will have similar anecdotes.

> Your experience doesn't say anything about this drive's inability to
> return UNC errors.  It says you suspect it of silently returning bad
> data, once, but your experience doesn't even clearly implicate the
> device once: It could have been cabling/driver/power-supply/zfs-bugs
> when the block was written.  I was hoping for a device in your ``bad
> stack'' which does it over and over.
>
> Remember, I'm not arguing ZFS checksums are worthless---I think
> they're great.  I'm arguing with your original statement that ZFS is
> the only software RAID which deals with the dominant error you find in
> your testing, unrecoverable reads.  This is untrue!
>   

To be clear.  I claim:
    1. The dominant failure mode in my field data for magnetic disks is
    unrecoverable reads.  You need some sort of data protection to get
    past this problem.
    2. Unrecoverable reads are not always reported by disk drives.
    3. You really want a system that performs end-to-end data verification,
    and if you don't bother to code that into your applications, then you
    might rely on ZFS to do it for you.  If you ignore this problem, it will
    not go away.

>     re> This number should scare the *%^ out of you.  It basically
>     re> means that no data redundancy is a recipe for disaster.
>
> yeah, but that 9.5% number alone isn't an argument for ZFS over other
> software LVM's.
>
>     re> 0.466%/yr is a per-disk rate.  If you have 10 disks, your
>     re> exposure is 4.6% per year.  For 100 disks, 46% per year, etc.
>
> no, you're doing the statistics wrong, and in a really elementary way.
> You're counting multiple times the possible years in which more than
> one disk out of the hundred failed.  If what you care about for 100
> disks is that no disk experiences an error within one year, then you
> need to calculate
>
>   (1 - 0.00466) ^ 100 = 62.7%
>
> so that's 37% probability of silent corruption.  For 10 disks, the
> mistake doesn't make much difference and 4.6% is about right.
>   

Indeed.  Intuitively, the AFR and population is more easily grokked by
the masses.  But if you go into a customer and say "dude, there is only a
62.7% chance that your system won't be affected by a silent data corruption
problem this year with my (insert favorite non-ZFS, non-NetApp solution
here)" then you will have a difficult sale.

> I don't dispute ZFS checksums have value, but the point stands that
> the reported-error failure mode is 20x more common in netapp's study
> than this one, and other software LVM's do take care of the more
> common failure mode.
>   

I agree.

>     re> UNCs don't cause ZFS to freeze as long as failmode != wait or
>     re> ZFS manages the data redundancy.
>
> The time between issuing the read and getting the UNC back can be up
> to 30 seconds, and there are often several unrecoverable sectors in a
> row as well as lower-level retries multiplying this 30-second value.
> so, it ends up being a freeze.
>   

Untrue.  There are disks which will retry forever.  But don't take
my word for it, believe another RAID software vendor:
http://blogs.sun.com/relling/entry/adaptec_webinar_on_disks_and
[sorry about the redirect, you have to sign up for an Adaptec
webinar before you can get to the list of webinars, so it is hard
to provide the direct URL]

Incidentally, I have one such disk in my boneyard, but it isn't
much fun to work with because it just sits there and spins when
you try to access the bad sector.

> To fix it, ZFS needs to dispatch read requests for redundant data if
> the driver doesn't reply quickly.  ``Quickly'' can be ambiguous, but
> the whole point of FMD was supposed to be that complicated statistics
> could be collected at various levels to identify even more subtle
> things than READ and CKSUM errors, like drives that are working at
> 1/10th the speed they should be, yet right now we can't even flag a
> drive taking 30 seconds to read a sector.  ZFS is still ``patiently
> waiting'', and now that FMD is supposedly integrated instead of a
> discussion of what knobs and responses there are, you're passing the
> buck to the drivers and their haphazard nonuniform exception state
> machines.  The best answer isn't changing drivers to make the drive
> timeout in 15 seconds instead---it's to send the read to other disks
> quickly using a very simple state machine, and start actually using
> FMD and a complicated state machine to generate suspicion-events for
> slow disks that aren't returning errors.
>   

I think the proposed timeouts here are too short, but the idea has
merit.  Note that such a preemptive read will have negative performance
impacts for high-workload systems, so it will not be a given that people
will want this enabled by default.  Designing such a proactive system
which remains stable under high workloads may not be trivial.
Please file an RFE at http://bugs.opensolaris.org

> Also the driver and mid-layer need to work with the hypothetical
> ZFS-layer timeouts to be as good as possible about not stalling the
> SATA chip, the channel if there's a port multiplier, or freezing the
> whole SATA stack including other chips, just because one disk has an
> outstanding READ command waiting to get an UNC back.  
>
> In some sense the disk drivers and ZFS have different goals.  The goal
> of drivers should be to keep marginal disk/cabling/... subsystems
> online as aggressively as possible, while the goal of ZFS should be to
> notice and work around slightly-failing devices as soon as possible.
> I thought the point of putting off reasonable exception handling for
> two years while waiting for FMD, was to be able to pursue both goals
> simultaneously without pressure to compromise one in favor of the
> other.
>
> In addition, I'm repeating myself like crazy at this point, but ZFS
> tools used for all pools like 'zpool status' need to not freeze when a
> single pool, or single device within a pool, is unavailable or slow,
> and this expectation is having nothing to do with failmode on the
> failing pool.  And NFS running above ZFS should continue serving
> filesystems from available pools even if some pools are faulted, again
> nothing to do with failmode.
>
>   

You mean something like:
http://bugs.opensolaris.org/view_bug.do?bug_id=6667208
http://bugs.opensolaris.org/view_bug.do?bug_id=6667199

Yes, we all wish these to be fixed soon.
 
> Neither is the case now, and it's not a driver fix, but even beyond
> fixing these basic problems there's vast room for improvement, to
> deliver something better than LVM2 and closer to NetApp, rather than
> just catching up.
>   

If you find more issues, then please file bugs. http://bugs.opensolaris.org
 -- richard

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to