Ric Wheeler wrote:
> Hans Reiser wrote:
>> I am skeptical that bitflip errors above the storage layer are as common
>> as the ZFS authors say, and their statistics that I have seen somehow
>> lack a lot of detail about how they were gathered.  If, say, a device
>> with 100 errors counts as 100 instances for their statistics.....  Well,
>> it would be nice to know how they were gathered.  Next time I meet them
>> I must ask.
>>   
> I think that most big vendors have a lot of information about failure
> rates on drives, but cannot actually share the details in public (due
> to NDA's with the suppliers).
>
> One thing that we are trying to do is to get some of the more
> "community" oriented people at Seagate Research to come out and talk
> to the people about what are reasonable types of errors to code
> against.  Current idea is to get everyone in the same place a couple
> of days before the next FAST conference (i.e., linux IO people or file
> system people and these vendors).  (See the USENIX page for details on
> FAST at http://www.usenix.org/events/fast07/cfp/).
>
> I will say that media errors tend to be larger than single bit errors,
> i.e. you will lose a set of sectors instead of seeing a single bit
> flip on one sector (remember that the drive vendors do extensive ECC
> at their level).  What their ECC will not fix is something like junk
> settling on the platter or a really bad error like a bad disk head.
I think that integration of fs, fsck, and raid is the right solution for
media errors.  What I haven't seen data I trust on is what is bitflip
error rate for the  non-media sources.  Since I haven't seen data I
believe (where belief requires details being supplied), my inclination
is to say plugins that users can choose to use if they want them are the
right solution.
> I think that ECC would be overkill,
I view it as an option that we make available to enterprise customers
who want to feel good.

It is not for me to tell them that they are wrong, for I lack the data,
it is merely for me to supply it as a non-default option, and let the
users tell me how often it actually gets triggered when they use it.

Reply via email to