Tracy R Reed wrote:
> A couple of very interesting studies have come out recently about the
> reliability of hardware, specifically disks:
> 
> This site summarizes it nicely:
> http://storagemojo.com/?p=378
> 
> With Google's actual paper here here:
> http://labs.google.com/papers/disk_failures.pdf
> 
> But perhaps more interesting is this one:
> http://www.usenix.org/events/fast07/tech/schroeder/schroeder_html/index.html
> 
> also summarized here:
> http://storagemojo.com/?p=383
> 
> I have read the entirety of both of the papers. Executive summary of
> these two papers:
> 
> - MTBF is useless
> 
> - SCSI, FC, SATA, ATA are equally reliable (surprising and embarrassing
> to people who have spent big bucks on SCSI/FC but both studies came to
> the same conclusion based on 100,000 disks each)
> 
> - Reliability between "enterprise" and "consumer" drives is the same
> 
> - Temp up to 40C doesn't make much difference in reliability (surprising!)
> 
> - SMART is semi-useful for predicting failures but only catches half of
> failures
> 
> - There seems to be no correlation between workload and failure rate
> 
> - The chances of a double failures in a RAID5 are much greater than we
> think. It seems mirroring remains a good idea. I didn't quite understand
> all of the reasoning in the second paper about long term
> auto-correlation and decreasing hazard rates. They seem to say that
> statistically speaking one disk failure now suggests greater chance of
> another disk failure coming soon which bodes ill for RAID5. IIRC a
> fellow KPLUGger had a double-failure in a RAID5 this week.
> 
> - There is no infant mortality phase for drives nor is there a
> particular age at which they tend to die (no "bathtub curve" typical for
> consumer products). Rate of drive failure is initially low but steadily
> increases as they age.
> 
> Unfortunately google wussed out and won't tell us whose drives are the
> most/least reliable. The second study didn't mention this either. I
> guess they are afraid of getting sued.
> 

I also read the pdf from google, and found it both very readable and
very interesting. The lack of confirmation of traditional guidelines on
 temperature and activity (after infant-mortality effects are weaned
out) is especially surprising. The inadequacy of smart data as a model
for predicting failure is almost as notable.

I glanced at the second paper and it looks harder to get into (is that
just me?).

One of the points they didn't seem to grasp (or feel worth emphasizing)
was that returns testing "no problem" do not mean there _was_ no
problem. In fact I have personally seen lots of cases where repeated
failures at run time were not reproducible after a power cycle. I have
also seen weird vibration-caused (apparently) problems caused by poor
mounting  mechanics.

Anyway, IT can probably throttle back the A/C, eh?

Regards,
..jim


-- 
[email protected]
http://www.kernel-panic.org/cgi-bin/mailman/listinfo/kplug-list

Reply via email to