On second thought, let me further explain why I had the Linux link in the 
same post.

That was written a while ago, but I think the situation for the cheap RAID 
cards has not changed much, though the RAID ASICs in RAID enclosures are 
getting more and more robust, just not "open".

If you take risk management into consideration, that range of chance is just 
too much to take, when the data demand is not only for data accessing, but 
also for accessing the correct data.
We are talking about 0.001% of defined downtime headroom for a 4-9 SLA (that 
may be defined as "accessing the correct data").

You and me can wait half a day for network failures and the world can turn 
just as fine, but not for Joe Tucci.
Not to mention the additional solution ($$$) that must be implemented to 
handle possible user operational errors for highly risky users. [still not 
for you and me for the business case of this solution, not so sure about Mr. 
Tucci though.  ;-) ]

best,
z




http://www.nber.org:80/sys-admin/linux-nas-raid.html

Let's repeat the reliability calculation with our new knowledge of the 
situation. In our experience perhaps half of drives have at least one 
unreadable sector in the first year. Again assume a 6 percent chance of a 
single failure. The chance of at least one of the remaining two drives 
having a bad sector is 75% (1-(1-.5)^2). So the RAID 5 failure rate is about 
4.5%/year, which is .5% MORE than the 4% failure rate one would expect from 
a two drive RAID 0 with the same capacity. Alternatively, if you just had 
two drives with a partition on each and no RAID of any kind, the chance of a 
failure would still be 4%/year but only half the data loss per incident, 
which is considerably better than the RAID 5 can even hope for under the 
current reconstruction policy even with the most expensive hardware.
We don't know what the reconstruction policy is for other raid controllers, 
drivers or NAS devices. None of the boxes we bought acknowledged this 
"gotcha" but none promised to avoid it either. We assume Netapp and ECCS 
have this under control, since we have had several single drive failures on 
those devices with no difficulty resyncing. We have not had a single drive 
failure yet in the MVD based boxes, so we really don't know what they will 
do. [Since that was written we have had such failures, and they were able to 
reconstruct the failed drive, but we don't know if they could always do so].

Some mitigation of the danger is possible. You could read and write the 
entire drive surface periodically, and replace any drives with even a single 
uncorrectable block visible. A daemon Smartd is available for Linux that 
will scan the disk in background for errors and report them. We had been 
running that, but ignored errors on unwritten sectors, because we were used 
to such errors disappearing when the sector was written (and the bad sector 
remapped).

Our current inclination is to shift to a recent 3ware controller, which we 
understand has a "continue on error" rebuild policy available as an option 
in the array setup. But we would really like to know more about just what 
that means. What do the apparently similar RAID controllers from Mylex, LSI 
Logic and Adaptec do about this? A look at their web sites reveals no 
information.


_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to