On 4/19/13 4:38 PM, "mathog" <[email protected]> wrote: >Joe Landman <[email protected]> wrote > >> Use AFR and warranty, ignore everything else. MTBF does not >> correlate >> at all against AFR, and AFR is an objective measure. > >MTBF is the inverse of the AFR times the number of hours in a year. ><snip> >The ratings I would really like the industry to use might be called >ef1, ef5, and ef10, where each is the percent of disks that are >Expected to be Functioning (defined as: works at full rated speed, >has suffered zero data losing events, and still has >unused blocks available) at the end of the specified number >of years. It would be really easy to compare disks with that system. >With AFR etc., not so much. >
You can get that information, but it costs a lot and would likely be bound up in NDAs. It is the stock in trade of a disk drive manufacturer, and you can bet that internally, they know a LOT about failure modes, rates, etc. (without even going to the extremes described in Crichton's "Rising Sun") If you could show that knowing this information (in a public way) would make a significant difference in cluster engineering, it should be possible to get funding to do the experiment yourself. That is, buy 100 drives and run them at different temperatures, etc. Do this for various kinds, etc. (This is what Google has done, internally, and they don't publish the data because it is a strategic advantage.. I'm sure Amazon has done the same) The problem is that I think that for run of the mill cluster (or data center) building, what they have is "good enough". That is, you figure on a refresh cycle of three years, and buy drives with a 4 year warranty. You ignore MTBF. I also have to comment that nobody who actually uses MTBF numbers (e.g. DoD) actually believes the numbers. Rather, they tend to be used as way to compare different designs: run design A and it gets 49,000hr and run design B and it gets 20,000 hours.. Neither might actually go that many hours, but A is clearly better than B. To actually do a MIL-HDBK-217 analysis is tedious, and in practice, everyone has their own schemes for derating, etc. There's also the whole logical and/or aspect of failure analysis, if there's any redundancy in the system. Consider that you have 100 parts on a board, and you get MTBF numbers for each, and then combine them to get a PCB level MTBF number. But did you assume all those parts are at the same temperature, or did you analyze the temperature distribution across the board, and assign shorter MTBFs to the hotter parts (in accordance with the appropriate scaling laws). > _______________________________________________ Beowulf mailing list, [email protected] sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
