good writeup, but I'll point out that you are actually talking about three errors happening here, the first error that triggers the start of the rebuild, and then two additional errors after that.

Also, while I am not disagreeing with your math, this still doesn't seem right.

Also, is the BER really a permanent error?

If it is, then doing an array scrub every month or so should give you confidence that you don't have multiple errors lurking out of the drives.

Also, if the BER represents a permanent error, that would say that you will expect to 'loose a drive' about every four full reads of the array. If you scrub the array monthly, then this should mean that you loose 3 drives a year, just due to the reads from the array check (in addition to any actual use of the array). Given any reasonable amount of real use, this would seem to indicate that you should be loosing half of your array every year. Drive failure rates are not that high.


I've got a 160x1T disk array (10x 16 disk RAID 6, no hot spares) that gets completely re-written about every 8 weeks (it's essentially a circular buffer).

If a full read of 6 drives will fail 25% of the time, then these 160 drives should almost never be able to complete a single pass without loosing drives. But they had <10 failures among the 160 drives in the first four years of operation.

so something just doesn't seem to match reality here.

David Lang

On Sun, 22 Sep 2013, Charles Polisher wrote:

David Lang wrote:
RAID 6 or RAID 5?

I would not expect a single error in that transfer to kill the
entire RAID, just to kill a second disk (and only if you have a
third disk would the array die)

I apologize for the length of my response.

TL;DR: I meant RAID6. Expect is an interesting word. As many as
      1 in 40 RAID 6 array rebuilds may fail in a given case. Use
      smallish drives (<= 1TB) of enterprise quality, your data
      will be safe.

This is a pessimistic walk-through with admitted weaknesses.  No
shop I'm familiar with transfers data at the maximum rate 24
hours a day, 7 days a week. But the solace we take in RAID 6
double redundancy is undermined by a key characteristic: to
rebuild a failed drive, every remaining active drive has to
perform a whole-disk transfer. This exposes the array to a
secondary hazard, and then a tertiary hazard, all of which seem
like remote possibilities, but aren't as remote as you might
hope. You'll probably never lose a RAID level 6 array, but it
depends on what you mean by 'probably'. Here's a timeline of a
failure, with both happy and sad outcomes. Starting at time t0,

t0  7 active 3TB elements (disks) are up + one available hot spare.
   Array integrity: OK. Array can sustain loss of two elements.
   +-------------------------+
   | 1  2  3  4  5  P  Q  Hs |     Elements 1..5 are data, P,Q are parity,
   +-------------------------+     Hs is Hot standby.


t1  Element 1 fails completely.
   The RAID controller removes element 1 from the array.
   The hot spare is promoted to active element 1a, the array starts
   rebuilding by reconstructing the missing data on the new element.
   Array integrity: OK. Array can sustain loss of one element.

   +-------------------------+
   | F  2  3  4  5  P  Q  1a |     F is faulted element.
   +-------------------------+     1a is marked for reconstruction.


t2  For 7200RPM drives transferring 145Mbytes/sec, rebuild needs
   a minimum of 5.75 hours to finish.
   Array integrity: OK. Array can sustain loss of one element.
   +-------------------------+
   | F  2  3  4  5  P  Q  1a |     F is faulted element.
   +-------------------------+     1a is being reconstructed.
        |  |  |  |  |  |  ^
        |  |  |  |  |  R->^
        |  |  |  |  R-->--^        Elements 2,3,4,5,P,Q are read
        |  |  |  R-->-----^        to reconstruct the failed element
        |  |  R-->--------^        from 144 terabits of data.
        |  R-->-----------^
        R-->--------------+

t3  The sysadmin replaces failed drive (was element 1) with a new drive.
   The RAID controller designates it as a new Hot spare.
   +-------------------------+
   | Hs 2  3  4  5  P  Q  1a |     1a is being reconstructed
   +-------------------------+     from 2,3,4,5,P,Q.


t4  During the reconstruction of element 1a, an unrecoverable
   single bit error is detected in element 3's bit stream.
   How likely is this? The probability of a single bit error in
   a whole-disk transfer is about

            (# bits in bitstream)
      p =  -----------------------
                  1 / BER

   Here the # of bits in the bitstream is                                       
                                                             ilding,
      3 terabytes * 8 bits * 6 drives = 144 terabits

   BER is the Bit Error Rate of the disk. Manufacturers typically
   express this as 1 error in so many bits, after error correction
   has been attempted.

   For a typical consumer drive, the BER is stated to be no more
   than 1 in 10E14 bits (eg, Western Digital Green). With a 3TB
   drive (drive sizes are specified in powers of ten) yielding a
   likelihood of a bit error in a whole-array transfer of

               144 terabits
       p1 =  ---------------  =  0.24 (24%)
               10^14 bits

   which means 1 / 0.24 = 4.166 whole array transfers between
   expected bit errors. (A Monte Carlo simulation with 500 trials
   gave a figure of 16.3%, a bit more optimistic).

   +-------------------------+
   | Hs 2  3  4  5  P  Q  1a |     3 is a partially faulted element
   +-------------------------+     1a is being reconstructed
        |  .  |  |  |  |  ^        from 2,3,4,5,P,Q.
        |  .  |  |  |  R->^
        |  .  |  |  R-->--^
        |  .  |  R-->-----^
        |  .  R-->--------^
        |  . . . . . . . .
        R-->--------------+


t5  The RAID controller removes element 3 from the array,
   continues reconstructing the missing element on 1a,
   and schedules Hs to be reconstructed once 1a is restored.
   Array integrity: Partial, RAID has lost one element and is
   rebuilding one element. Array can't sustain loss of any element.
   +-------------------------+
   | Hs 2  F  4  5  P  Q  1a |     F is faulted.
   +-------------------------+     1a is being reconstructed
        |     |  |  |  |  ^        from 2,4,5,P,Q.
        |     |  |  |  R->^        Hs will be reconstructed next.
        |     |  |  R-->--^
        |     |  R-->-----^
        |     R-->--------^
        |                 ^
        R-->--------------+

............................................................
.  Here's the scenario with a Happy outcome. Everything    .
.  works as planned and the sysadmin is home for dinner.   .
............................................................

t6  (Happy outcome)
   The sysadmin replaces failed drive (element 3) with a new drive.
   The RAID controller marks it ready for reconstruction but won't
   start rebuilding it until the rebuild underway is complete.
   Array integrity: Partial, RAID has lost one element and is
   rebuilding one element. Array can't sustain loss of any element.
   +-------------------------+
   | Hs 2  3a 4  5  P  Q  1a |     1a is being reconstructed.
   +-------------------------+     Hs will be reconstructed next.
        |     |  |  |  |  ^        3a will be reconstructed after Hs.
        |     |  |  |  R->^
        |     |  |  R-->--^
        |     |  R-->-----^
        |     R-->--------^
        |                 ^
        R-->--------------+

   **************************************************************
   * The sysadmin saddles the unicorn and rides off. Well done! *
   **************************************************************


t7  (Happy outcome)
   The rebuild of element 1a finishes at t1 + 5.75 hours.
   The controller starts rebuilding element 3a from the
   active elements 2,4,5,P,Q,1a.
   144 terabits are transferred with no detected errors.
   Array integrity: OK. Array can sustain loss of one element.
   +-------------------------+
   | Hs 2  3a 4  5  P  Q  1a |     3a is being reconstructed.
   +-------------------------+     2,4,5,P,Q,1a are active.
        |  ^  |  |  |  |  |        Hs will be reconstructed next.
        |  ^<-R  |  |  |  |
        |  ^--<--R  |  |  |
        |  ^-----<--R  |  |
        |  ^--------<--R  |
        |  ^              |
        R->+-----------<--R


t8 (Happy outcome)
   At t7 + 5.75 hours the rebuild of element 3a is done.
   7 active elements are up, plus one available hot spare.
   Array integrity: OK. Array can sustain loss of two elements.
   +-------------------------+
   | 1  2  3  4  5  P  Q  Hs |     Elements 1..5 are data, P,Q are parity,
   +-------------------------+     Hs is Hot standby.

.............................................................
.  Now we'll work through a Sad scenario. The rebuilding    .
.  hits a slight (1 bit) snag. Cold pudding for dinner.     .
.............................................................


t6 (Sad outcome)
  The operator replaces failed drive (element 3) with a new drive.
  The RAID controller marks it ready for reconstruction but won't
  start rebuilding it until the current rebuild is finished.
  Array integrity: Partial, RAID has lost one element and is
  rebuilding one element, and will rebuild a 2nd element soon.
  Array can't sustain loss of an element.
  +-------------------------+
  | Hs 2  3a 4  5  P  Q  1a |     1a is being reconstructed.
  +-------------------------+     3a is marked to be reconstructed.
       |     |  |  |  |  ^
       |     |  |  |  R->^
       |     |  |  R-->--^
       |     |  R-->-----^
       |     R-->--------^
       |                 ^
       R-->--------------+


t7 (Sad outcome)
  During the rebuild of 1a, a *second* drive (element 4) reports
  an uncorrectable single-bit error to the RAID controller.
  The RAID controller removes element 4 from the array,
  The array is failed and goes offline. Probably with some
  effort all data except one stripe can be reconstructed
  with the faulted element formerly known as 3, but it's
  complicated and a bit risky and unusual and takes time.

  How likely is this? Using a Monte Carlo simulation assuming bit
  errors on multiple drives are uncorrelated, and normally
  distributed in the bitstream, running 500 trials the results at
  the 95% confidence level: 36.7 to 66.2 (mean 47.2) rebuilds
  between hitting 2 bit errors from 2 different drives.

  +-------------------------+
  | Hs 2  3a F  5  P  Q  1a |     F is faulted.
  +-------------------------+     Hs is marked for rebuilding.
       |        |  |  |  ^        3a is new and marked for rebuilding.
       |        |  |  R->^
       |        |  R-->--^
       |        R-->-----^
       |                 ^
       |                 ^
       R-->--------------+


t9 (Sad outcome)
  Restore the array from backups. Up to 28.75 hours is
  needed to write 3TB x 5 data elements to a RAID6, so maybe half that
  is actually needed because the array isn't full, right?


_______________________________________________
Discuss mailing list
[email protected]
https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss
This list provided by the League of Professional System Administrators
http://lopsa.org/

Reply via email to