begin quoting Tracy R Reed as of Tue, Feb 20, 2007 at 10:06:19PM -0800: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > A couple of very interesting studies have come out recently about the > reliability of hardware, specifically disks: > > This site summarizes it nicely: > http://storagemojo.com/?p=378 > > With Google's actual paper here here: > http://labs.google.com/papers/disk_failures.pdf > > But perhaps more interesting is this one: > http://www.usenix.org/events/fast07/tech/schroeder/schroeder_html/index.html > > also summarized here: > http://storagemojo.com/?p=383
Saw these on /.! > I have read the entirety of both of the papers. Executive summary of > these two papers: > > - - MTBF is useless MTBF numbers provided by the vendor are often overstated. > - - SCSI, FC, SATA, ATA are equally reliable (surprising and embarrassing > to people who have spent big bucks on SCSI/FC but both studies came to > the same conclusion based on 100,000 disks each) Well, manufacturers have been providing the same disk mechanism underneath, and if that's what's failing (instead of the controller cards), that's not suprising. SCSI & FC have some slight performance advantages, but this isn't worth it for most home consumer use, where the machines are drastically over-powered anyway. > - - Reliability between "enterprise" and "consumer" drives is the same Again, same disk mechanism underneath. Why make two products when you can make just one, and change the label? > - - Temp up to 40C doesn't make much difference in reliability (surprising!) 40C isn't _that_ warm. Too cool and the spindle lubricant will be less than effective... Who wants to fund a university to test, in controlled environments, the effect of temperature ranges? They'll need 100,000+ drives, and need to build a bunch of temperature controlled environments in which to run 'em.... :) > - - SMART is semi-useful for predicting failures but only catches half of > failures > > - - There seems to be no correlation between workload and failure rate This one is suprising. (To me, at least.) > - - The chances of a double failures in a RAID5 are much greater than we > think. It seems mirroring remains a good idea. I didn't quite understand A double-disk failure in a mirror is just as bad. > all of the reasoning in the second paper about long term > auto-correlation and decreasing hazard rates. They seem to say that > statistically speaking one disk failure now suggests greater chance of > another disk failure coming soon which bodes ill for RAID5. IIRC a > fellow KPLUGger had a double-failure in a RAID5 this week. This seems to bear out the old folk wisdom of "buy disks from separate lots". This gets expensive unless you're running three arrays. > - - There is no infant mortality phase for drives nor is there a > particular age at which they tend to die (no "bathtub curve" typical for > consumer products). Rate of drive failure is initially low but steadily > increases as they age. Which was suprising. Drives are supposed to automatically handle certain sorts of failures (bad sectors), up to a point, which would seem to create a bathtub curve. > Unfortunately google wussed out and won't tell us whose drives are the > most/least reliable. The second study didn't mention this either. I > guess they are afraid of getting sued. Or that if they did, nobody would buy 'em, and they'd go out of business, letting the remaining vendors jack up their prices in a less competative market. I suspect they flip-flop as "best", or even in the long run, there being no statistically significance between the vendors across the vendor's entire product line. -- I have 15-year-old 40GB SCSI disks that aren't quite dead yet. Stewart Stremler -- [email protected] http://www.kernel-panic.org/cgi-bin/mailman/listinfo/kplug-list
