Hello Again to all you Beowulf users!
Similar to Doug Edline's poll....

I was a subscriber in the late 1990s, but had to sign off the list for several 
years as my job changed.
I used to follow many of your posts with such interest, and hope you oldtimers 
(and newer subscribers as well) are all doing well !!

I'm working on a study on computer system recapitalization and thought this 
might make an excellent topic to engage on the Beowulf list.
We expect the results to have broad implications for US Federal Government IT 
planning as well.

I am researching the optimal life expectancy for a military program to plan for 
some commercial computer based systems.
My customer has upgraded many systems to COTS based products (Linux/Intel based 
blade servers), and will upgrade more in the future.
They manage about 100 sites, each with a modest (1-room) sized installation of 
maybe 12 racks or so (200 to 300 nodes).
The question was raised as "When should all these servers be upgraded or 
replaced again?"

We know that from a purchasing perspective, the longer the life we plan/allow, 
the fewer systems to be bought per year, and the cheaper it gets.
But there are other factors - over time the "older systems" are harder to 
maintain.... don't run newer licenses of SW products,
need spare parts, some of which are hard or very hard to find (e.g. old RAM 
modules  - on Ebay?!).
Sometimes the newer technology uses less power and is cheaper to 
operate....(anyone ever create a KW/MFLOP vs. Time curve?  has that really gone 
down? )
After several years (e.g.. 6,7, or 8) the systems Admin costs on the older 
systems may be higher - e.g.. more labor, specialized training, unique tools ..
If we know the required life is a long time our customer insists on tracking 
end of life points and buying spares to have on hand.  That costs more for 
longer time spans.
At some point in time the reliability of the fans and disk drives starts to 
really impact computational production - downtime costs $$,
and at that point the repair means "complete replacement", but that cost may be 
lower (cheaper parts) or higher (new facility? rewrite code?) due to waiting so 
long to do it ...
We think that the cost vs.. length-of-life" curve has a parabolic shape, and 
the left is dominated by the cost decrease as noted above.  The right side' is 
the more troublesome.
In economics or manufacturing this is the classic capitalization problem - what 
is the life expectancy of the asset?  The optimal point is at the bottom of the 
parabola.
Do operating costs really go up as a cluster ages?  What other factors are 
there?
For some the upgrade is necessary to tackle a larger or more complex problem  - 
but others can just let the system churn a bit longer to get the answer.  Is 
that the real driver?

I recall that one Beowulf user facility operated both a new and an old 
production cluster, and replaced the "old one" with a "newer" one (the new "new 
one" ) on a regular basis.
Is that common?  Do most of you as users see the old system just chucked out as 
the new one is brought online?
What are the advantages or disadvantages of replacing maybe all 100 sites of 
hardware with 50% new at 25 per year, vs. replacing all (100%) of the system at 
12 units per year?)
(on a coarse level, both plans cost the same in hardware purchasing terms).

I expect that many of you have experience in upgrading and replacing clusters - 
I'd love to hear the feedback.
Could any/some of you please respond with information or help?
Maybe in one of three ways ....

1) Reply to this thread, with general comments, especially if you know of a 
study, presented paper, or archived thread
(if you recall a month I'll go scouring the archives gain - effort so hard is 
fruitless!).   Has anyone modeled this problem?
I know I covered a lot of different aspects - feel free to comment on whatever 
issue or item you wish.

2) Send me info  ( directly to [email protected]<mailto:[email protected]>) ) 
on how old your current operational cluster or systems is (and a little about 
it - e.g. # nodes).
I'll tally those #s and come up with an average age of operating clusters (I'll 
post a summary of all the responses I get in a week).

3) Send me info on the age of a cluster you just recently replaced - if you 
send that directly to me ([email protected]<mailto:[email protected]>) I'll 
post a generic table of data back to the list in a week or so.
This will at least tell us the "hard facts" of when maybe 20 or 30 commercial 
systems were last replaced, with the assumption being that most of you or your 
lab managers were making a best case judgment that it was time for the unit to 
be replaced - this gives us statistical comparison data to the Beowulf user 
community experience.
I think actual (factual) replacement #s are more valid - that shows the result 
of collective decisions to actually spend $ and take action.
Please tell me, if you know:
a) How many nodes there were in the "old system" ?
b) How old was that "old system" when replaced?
c) How many nodes does the "new cluster" have?
d) If known, what was the biggest reason for the replacement?

If responding directly, please put "CLUSTER RECAPITALIZATION" in the title of 
the email.

Hopefully this wasn't done too recently (last few years) - if it was can 
someone please send a pointer and I'll check it out!
I'm reviewing published data on this questions also, but would specifically 
like to get the Beowulf community view.

Thanks again & in advance
Dave Lechner
Principal Systems Engineer
MITRE Corp.
[email protected]<mailto:[email protected]>


_______________________________________________
Beowulf mailing list, [email protected]
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to