On Thu, 19 Sep 2013, Charles Polisher wrote:

On Wed, Sep 18, 2013 at 10:11:11AM -0700, Tracy Reed wrote:
On Wed, Sep 18, 2013 at 08:11:24AM PDT, Charles Polisher spake thusly:
 - Monte-carlo simulations of RAID systems confirmed
   a batch of disk drives was vastly exceeding the claimed AFR,
   the vendor eventually copped to a quality problem.

I'd like to know more about how this was done.

My shop had two RAID failures with data loss in four years,
which was supposed not to be possible, which is what got me
interested in how reliable RAID actually is. Fun fact: getting
struck by lightening is not a rare event if your name is Roy
"Dooms" Sullivan*.

You can find a spreadsheet on montecarlito.com with a pre-built
general Monte Carlo model. Elerath & Pecht give details of RAID
array reliability in "Enhanced Reliability Modeling of RAID
Storage Systems", as do other authors. Add drive reliability
specs and you're all set. Some results surprised me. For
example, a RAID6 single-disk failure can trigger whole-disk
transfers from every remaining drive all at once. One
uncorrectable error anywhere in that whole bitstream can cause
total data loss. With large drives an uncorrected error can be
expected as often as much as 1 in every 16 whole-disk transfers.
Not as reliable as you might hope. Implementation details can
make a huge difference, like scrubbing or correlated failures
due to heat or vibration.

RAID 6 or RAID 5?

I would not expect a single error in that transfer to kill the entire RAID, just to kill a second disk (and only if you have a third disk would the array die)

David Lang
_______________________________________________
Discuss mailing list
[email protected]
https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss
This list provided by the League of Professional System Administrators
http://lopsa.org/

Reply via email to