Re: RAID? Was: PATA hard disks, anyone?

Richard Pope via cctalk Wed, 28 Mar 2018 15:40:56 -0700

Hello all,

I have been kind of following this thread. I have a question aboutMTBF. I have four HGST UltraStar Enterprise 2TB drives setup in aHardware RAID 10 configuration. If the the MTBF is 100,000 Hrs for eachdrive does this mean that the total MTBF is 25,000 Hrs?

GOD Bless and Thanks,
rich!


On 3/28/2018 6:33 AM, Paul Koning via cctalk wrote:

On Mar 27, 2018, at 8:51 PM, Fred Cisin via cctalk <cctalk@classiccmp.org> 
wrote:

Well outside my realm of expertise (as if I had a realm!), . . .

How many drives would you need, to be able to set up a RAID, or hot swappable 
RAUD (Redundant Array of Unreliable Drives), that could give decent reliability 
with such drives?

How many to be able to not have data loss if a second one dies before the first 
casualty is replaced?
How many to be able to avoid data loss if a third one dies before the first two 
are replaced?

These are straightforward questions of probability math, but it takes some time
to get the details right. For one thing, you need believable numbers for the
underlying error probabilities. And you have to analyze the cases carefully.

The basic assumption is that failures are "fail stop", i.e., a drive refuses to deliver
data. (In particular, it doesn't lie -- deliver wrong data. You can build systems that deal with
lying drives but RAID is not such a system.) The failure may be the whole drive ("it's a
door-stop") or individual blocks (hard read errors).

In either case, RAID-1 and RAID-5 handle single faults. RAID-6 isn't a single
well-defined thing but as normally defined it is a system that handles double
faults. So a RAID-1 system with a double fault may fail to give you your data.
(It may also be ok -- it depends on where the faults are.) RAID-5 ditto.

The tricky part is what happens when a drive breaks. Consider RAID-5 with a
single dead drive, and the others are 100% ok. Your data is still good. When
the broken drive is replaced, RAID rebuilds the bits that belong on that drive.
Once that rebuild finishes, you're once again fault tolerant. But a second
failure prior to rebuild completion means loss of data.

So one way to look at it: given the MTBF, calculate the probability of two
drives failing within N hours (where N is the time required to replace the
failed drive and then rebuild the data onto the new drive). But that is not
the whole story.

The other part of the story is that drives have a non-zero probability of a
hard read error. So during rebuild, you may encounter a sector on one of the
remaining drives that can't be read. If so, that sector is lost.

The probability of hard read error varies with drive technology. And of
course, the larger the drive, the greater the probability (all else being
equal) of having SOME sector be unreadable. For drives small enough to have
PATA interfaces, the probability of hard read error is probably low enough that
you can *usually* read the whole drive without error. That translates to:
RAID-1 and RAID-5 are generally adequate for PATA disks.

On the very large drives currently available, it's a different story, and the
published drive specs make this quite clear. This is why RAID-6 is much more
popular now than it was earlier. It isn't the probability of two nearly
simultaneous drive failures, but rather the probability of a hard sector read
error while a drive has failed, that argues for the use of RAID-6 in modern
storage systems.

paul

Re: RAID? Was: PATA hard disks, anyone?

Reply via email to