>> We encountered a multi-disk failure on one of our mdadm RAID6 >> 8+2 OSTs. 2 drives failed in the array within the space of a >> couple of hours and were replaced.
There are many reports of multidrive failures, some pretty impressive e.g. 10 out of 20 on a long-running array after a restart. Because of common modes, that is not unexpected, as failures are not uncorrelated (especially when rebuilding!). > I guess the need for +3 parity is closer than we think... Some people are pushing this, and I guess that you are thinking about the arguments here: http://blogs.sun.com/ahl/entry/acm_triple_parity_raid But I think it is simply stupid -- adding more parity makes things slower and less reliable (e.g. more complexity), especially if one takes "advantage" of the false sense of security of more parity to have wider arrays. I'd rather have, in the few cases where it makes sense, a narrower RAID5 than a wider RAID6, for example (e.g. two 4+1 RAID5s instead of one 8+2 RAID6). The usual arguments apply: http://WWW.BAARF.com/ plus that "stupid" is usually rewarded by "management" who see the obvious reduction in cost but don't see those in performance, simplicity and reliability. Note that one argument in the page above is "fills a niche", and a slong it is acknowledged that is it s a minuscule niche it is fine; but then "need for +3 parity" is a rather wider statement. If an 8+2 array had 2 drive failures, perhaps instead of looking at more parity it would be better to look at common modes of failure; and not just vibration, heat or electrical common modes, but also the thoroughly moronic practice of many RAID vendors (e.g. EMC, DDN, NexSAN by my direct experience, but most/all do that) to put into their arrays drives not only of the same manufacturer and model, but even with nearly consecutive serial numbers from the same delivery and even the same carton. And in any case if one uses something like Lustre 1.x, which is a parallel metafilesystem with no data redundancy (and for very good reasons, and mirroring in 2.x is something that I have very mixed feelings about), using parity RAID is doubly stupid, as the storage layer has to provide all the redundancy. And in any case one cannot do storage systems that never fail; what matter more is what happens when they do fail. As to this fortunately Lustre does pretty well. _______________________________________________ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss