Tracy R Reed wrote: > A couple of very interesting studies have come out recently about the > reliability of hardware, specifically disks: > > This site summarizes it nicely: > http://storagemojo.com/?p=378 > > With Google's actual paper here here: > http://labs.google.com/papers/disk_failures.pdf > > But perhaps more interesting is this one: > http://www.usenix.org/events/fast07/tech/schroeder/schroeder_html/index.html > > also summarized here: > http://storagemojo.com/?p=383 > > I have read the entirety of both of the papers. Executive summary of > these two papers: > > - MTBF is useless > > - SCSI, FC, SATA, ATA are equally reliable (surprising and embarrassing > to people who have spent big bucks on SCSI/FC but both studies came to > the same conclusion based on 100,000 disks each) > > - Reliability between "enterprise" and "consumer" drives is the same > > - Temp up to 40C doesn't make much difference in reliability (surprising!) > > - SMART is semi-useful for predicting failures but only catches half of > failures > > - There seems to be no correlation between workload and failure rate > > - The chances of a double failures in a RAID5 are much greater than we > think. It seems mirroring remains a good idea. I didn't quite understand > all of the reasoning in the second paper about long term > auto-correlation and decreasing hazard rates. They seem to say that > statistically speaking one disk failure now suggests greater chance of > another disk failure coming soon which bodes ill for RAID5. IIRC a > fellow KPLUGger had a double-failure in a RAID5 this week. > > - There is no infant mortality phase for drives nor is there a > particular age at which they tend to die (no "bathtub curve" typical for > consumer products). Rate of drive failure is initially low but steadily > increases as they age. > > Unfortunately google wussed out and won't tell us whose drives are the > most/least reliable. The second study didn't mention this either. I > guess they are afraid of getting sued. >
I also read the pdf from google, and found it both very readable and very interesting. The lack of confirmation of traditional guidelines on temperature and activity (after infant-mortality effects are weaned out) is especially surprising. The inadequacy of smart data as a model for predicting failure is almost as notable. I glanced at the second paper and it looks harder to get into (is that just me?). One of the points they didn't seem to grasp (or feel worth emphasizing) was that returns testing "no problem" do not mean there _was_ no problem. In fact I have personally seen lots of cases where repeated failures at run time were not reproducible after a power cycle. I have also seen weird vibration-caused (apparently) problems caused by poor mounting mechanics. Anyway, IT can probably throttle back the A/C, eh? Regards, ..jim -- [email protected] http://www.kernel-panic.org/cgi-bin/mailman/listinfo/kplug-list
