I read them as soon as they were available. Then I shrugged and noted YMMV to myself.

1= Those studies are valid for =those= users under =those= users' circumstances in =those= users' environments.
 How well do those circumstances and environments mimic anyone else's?
I don't know since the studies did not document said in enough detail (and it would be nigh unto impossible to do so) for me to compare mine to theirs. I =do= know that neither Google's nor a university's nor an ISP's nor a HPC supercomputing facility's NOC are particularly similar to say a financial institution's or a health care organization's NOC.
...and they better not be.  Ditto the personnel's behavior working them.

You yourself have said the environmental factors make a big difference. I agree. I submit that therefore differences in the environmental factors are just as significant.


2= I'll bet all the money in your pockets vs all the money in my pockets that people are going to leap at the chance to use these studies as yet another excuse to pinch IT spending further. In the process they are consciously or unconsciously going to imitate some or all of the environments that were used in those studies. Which IMHO is exactly wrong for most mission critical functions in most non-university organizations.

While we can't all pamper our HDs to the extent that Richard Troy's organization can, frankly that is much closer to the way things should be done for most organizations. Ditto Greg Smith's =very= good habit: "I scan all my drives for reallocated sectors, and the minute there's a single one I get e-mailed about it and get all the data off that drive pronto. This has saved me from a complete failure that happened within the next day on multiple occasions."
Amen.

I'll make the additional bet that no matter what they say neither Google nor the CMU places had to deal with setting up and running environments where the consequences of data loss or data corruption are as serious as they are for most mission critical business applications. =Especially= DBMSs in such organizations. If anyone tried to convince me to run a mission critical or production DBMS in a business the way Google runs their HW, I'd be applying the clue-by-four liberally in "boot to the head" fashion until either they got just how wrong they were or they convinced me they were too stupid to learn.
A which point they are never touching my machines.


3= From the CMU paper:
"We also find evidence, based on records of disk replacements in the field, that failure rate is not constant with age, and that, rather than a significant infant mortality effect, we see a significant early onset of wear-out degradation. That is, replacement rates in our data grew constantly with age, an effect often assumed not to set in until after a nominal lifetime of 5 years." "In our data sets, the replacement rates of SATA disks are not worse than the replacement rates of SCSI or FC disks. =This may indicate that disk independent factors, such as operating conditions, usage and environmental factors, affect replacement=." (emphasis mine)

If you look at the organizations in these two studies, you will note that one thing they all have in common is that they are organizations that tend to push the environmental and usage envelopes. Especially with regards to anything involving spending money. (Google is an extreme even in that group). What these studies say clearly to me is that it is possible to be penny-wise and pound-foolish with regards to IT spending... ...and that these organizations have a tendency to be so.
Not a surprise to anyone who's worked in those environments I'm sure.
The last thing the IT industry needs is for everyone to copy these organization's IT behavior!


4= Tom Lane is of course correct that vendors burn in their HDs enough before selling them to get past most infant mortality. Then any time any HD is shipped between organizations, it is usually burned in again to detect and possibly deal with issues caused by shipping. That's enough to see to it that the end operating environment is not going to see a bath tub curve failure rate. Then environmental, usage, and maintenance factors further distort both the shape and size of the statistical failure curve.


5= The major conclusion of the CMU paper is !NOT! that we should buy the cheapest HDs we can because HD quality doesn't make a difference. The important conclusion is that a very large segment of the industry operates their equipment significantly enough outside manufacturer's specifications that we need a new error rate model for end use. I agree. Regardless of what Seagate et al can do in their QA labs, we need reliability numbers that are actually valid ITRW of HD usage.

The other take-away is that organizational policy and procedure with regards to HD maintenance and use in most organizations could use improving.
I strongly agree with that as well.


Cheers,
Ron Peacetree



At 01:53 AM 4/6/2007, [EMAIL PROTECTED] wrote:
On Fri, 6 Apr 2007, Ron wrote:

Bear in mind that Google was and is notorious for pushing their environmental factors to the limit while using the cheapest "PoS" HW they can get their hands on. Let's just say I'm fairly sure every piece of HW they were using for those studies was operating outside of manufacturer's suggested specifications.

Ron, please go read both the studies. unless you want to say that every orginization the CMU picked to study also abused their hardware as well....

Under such conditions the environmental factors are so deleterious that they swamp any other effect.

OTOH, I've spent my career being as careful as possible to as much as possible run HW within manufacturer's suggested specifications. I've been chided for it over the years... ...usually by folks who "save" money by buying commodity HDs for big RAID farms in NOCs or push their environmental envelope or push their usage envelope or ... ...and then act surprised when they have so much more down time and HW replacements than I do.

All I can tell you is that I've gotten to eat my holiday dinner far more often than than my counterparts who push it in that fashion.

OTOH, there are crises like the Power Outage of 2003 in the NE USA where some places had such Bad Things happen that it simply doesn't matter what you bought (power dies, generator cuts in, power comes on, but AC units crash, temperatures shoot up so fast that by the time everything is re-shutdown it's in the 100F range in the NOC. Lot's 'O Stuff dies on the spot + spend next 6 months having HW failures at +considerably+ higher rates than historical norms. Ick..)

IME, it really does make a difference =if you pay attention to the difference in the first place=. If you treat everything equally poorly, then you should not be surprised when everything acts equally poorly.

But hey, YMMV.

Cheers,
Ron Peacetree


---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Reply via email to