I read them as soon as they were available. Then I shrugged and
noted YMMV to myself.
1= Those studies are valid for =those= users under =those= users'
circumstances in =those= users' environments.
How well do those circumstances and environments mimic anyone else's?
I don't know since the studies did not document said in enough detail
(and it would be nigh unto impossible to do so) for me to compare
mine to theirs. I =do= know that neither Google's nor a university's
nor an ISP's nor a HPC supercomputing facility's NOC are particularly
similar to say a financial institution's or a health care organization's NOC.
...and they better not be. Ditto the personnel's behavior working them.
You yourself have said the environmental factors make a big
difference. I agree. I submit that therefore differences in the
environmental factors are just as significant.
2= I'll bet all the money in your pockets vs all the money in my
pockets that people are going to leap at the chance to use these
studies as yet another excuse to pinch IT spending further. In the
process they are consciously or unconsciously going to imitate some
or all of the environments that were used in those studies.
Which IMHO is exactly wrong for most mission critical functions in
most non-university organizations.
While we can't all pamper our HDs to the extent that Richard Troy's
organization can, frankly that is much closer to the way things
should be done for most organizations. Ditto Greg Smith's =very= good habit:
"I scan all my drives for reallocated sectors, and the minute there's
a single one I get e-mailed about it and get all the data off that
drive pronto. This has saved me from a complete failure that
happened within the next day on multiple occasions."
Amen.
I'll make the additional bet that no matter what they say neither
Google nor the CMU places had to deal with setting up and running
environments where the consequences of data loss or data corruption
are as serious as they are for most mission critical business
applications. =Especially= DBMSs in such organizations.
If anyone tried to convince me to run a mission critical or
production DBMS in a business the way Google runs their HW, I'd be
applying the clue-by-four liberally in "boot to the head" fashion
until either they got just how wrong they were or they convinced me
they were too stupid to learn.
A which point they are never touching my machines.
3= From the CMU paper:
"We also find evidence, based on records of disk replacements in
the field, that failure rate is not constant with age, and that,
rather than a significant infant mortality effect, we see a
significant early onset of wear-out degradation. That is, replacement
rates in our data grew constantly with age, an effect often assumed
not to set in until after a nominal lifetime of 5 years."
"In our data sets, the replacement rates of SATA disks are not worse
than the replacement rates of SCSI or FC disks.
=This may indicate that disk independent factors, such as operating
conditions, usage and environmental factors, affect replacement=."
(emphasis mine)
If you look at the organizations in these two studies, you will note
that one thing they all have in common is that they are organizations
that tend to push the environmental and usage envelopes. Especially
with regards to anything involving spending money. (Google is an
extreme even in that group).
What these studies say clearly to me is that it is possible to be
penny-wise and pound-foolish with regards to IT spending... ...and
that these organizations have a tendency to be so.
Not a surprise to anyone who's worked in those environments I'm sure.
The last thing the IT industry needs is for everyone to copy these
organization's IT behavior!
4= Tom Lane is of course correct that vendors burn in their HDs
enough before selling them to get past most infant mortality. Then
any time any HD is shipped between organizations, it is usually
burned in again to detect and possibly deal with issues caused by
shipping. That's enough to see to it that the end operating
environment is not going to see a bath tub curve failure rate.
Then environmental, usage, and maintenance factors further distort
both the shape and size of the statistical failure curve.
5= The major conclusion of the CMU paper is !NOT! that we should buy
the cheapest HDs we can because HD quality doesn't make a difference.
The important conclusion is that a very large segment of the industry
operates their equipment significantly enough outside manufacturer's
specifications that we need a new error rate model for end use. I agree.
Regardless of what Seagate et al can do in their QA labs, we need
reliability numbers that are actually valid ITRW of HD usage.
The other take-away is that organizational policy and procedure with
regards to HD maintenance and use in most organizations could use improving.
I strongly agree with that as well.
Cheers,
Ron Peacetree
At 01:53 AM 4/6/2007, [EMAIL PROTECTED] wrote:
On Fri, 6 Apr 2007, Ron wrote:
Bear in mind that Google was and is notorious for pushing their
environmental factors to the limit while using the cheapest "PoS"
HW they can get their hands on.
Let's just say I'm fairly sure every piece of HW they were using
for those studies was operating outside of manufacturer's suggested
specifications.
Ron, please go read both the studies. unless you want to say that
every orginization the CMU picked to study also abused their
hardware as well....
Under such conditions the environmental factors are so deleterious
that they swamp any other effect.
OTOH, I've spent my career being as careful as possible to as much
as possible run HW within manufacturer's suggested specifications.
I've been chided for it over the years... ...usually by folks who
"save" money by buying commodity HDs for big RAID farms in NOCs or
push their environmental envelope or push their usage envelope or
... ...and then act surprised when they have so much more down time
and HW replacements than I do.
All I can tell you is that I've gotten to eat my holiday dinner far
more often than than my counterparts who push it in that fashion.
OTOH, there are crises like the Power Outage of 2003 in the NE USA
where some places had such Bad Things happen that it simply doesn't
matter what you bought
(power dies, generator cuts in, power comes on, but AC units crash,
temperatures shoot up so fast that by the time everything is
re-shutdown it's in the 100F range in the NOC. Lot's 'O Stuff dies
on the spot + spend next 6 months having HW failures at
+considerably+ higher rates than historical norms. Ick..)
IME, it really does make a difference =if you pay attention to the
difference in the first place=.
If you treat everything equally poorly, then you should not be
surprised when everything acts equally poorly.
But hey, YMMV.
Cheers,
Ron Peacetree
---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster