Quoting Mark Hahn <[EMAIL PROTECTED]>, on Fri 15 Feb 2008 02:25:26 PM PST:


I'm skeptical how much sense VM's in HPC make, though.  yes, it would
be nice to have a container for MPI jobs: checkpoints for free, ability
to do
migration.  both these factors depend on the scale of your jobs: if all your
jobs are 4k cpus and up, even a modest node failure rate is going to make
agressive checkpointing necessary (versus jobs averaging 64p which are
almost never taken down by a node failure.)  similarly if your workload
is
all serial jobs, there's probably no need at all for migration (versus
a workload with high variance in job size, length, priority, etc).


Perhaps the added overhead of using VMs to do "user transparent checkpointing" is worth it in the same sense that most folks are willing to tolerate the overhead of using a compiler and linker instead of working in hex,octal, or binary machine code. Rather than force a researcher to figure out how to do checkpointing, you buy a few dozen more nodes to make up for the extra work.

You spend more on hardware and less on bodies, and since the hardware is always getting cheaper (per quanta of "work") the trade gets more attractive with time.

{Leaving aside interesting philosophical discussions having to do with incremental cost of labor, especially ones own, vs capital and operating costs of the iron. I've also noticed that even though we've gone through many many Moore's Law doublings, with, probably a 5000 fold increase in computational horsepower on an engineer's desk every 20 years, design and analysis methodologies change much slower. In the RF world, state of the art in design tools in 1960 was a paper Smith chart and a slide rule, and a healthy dose of simplified analytical approximations. State of the art in 1980 was simple computer tools that essentially automated the pencil and paper techniques, as well as some numerical analysis things (e.g. SPICE for circuit simulation, which solves matrix equations and does numerical integration, or early electromagnetics codes) State of the art in 2000 (and today, really) is integrated modeling tools with much larger matrices and tighter integration between FEM codes and circuit theory type analysis (that is, you might model the packaging with an EM code but you'd use a behavioral model for the semiconductor device, rather than using Maxwell's equations all the way down to the atomic level)

However, even with such nifty tools, a huge number of engineers still use paper and pencil style analysis. Granted, they use Excel instead of their trusty HP45 and a quad pad.. but the style of analysis and design is the same. They even teach classes in "RF Design with Excel" (which I view as anathema) Why isn't everyone using the new tools (which hugely improve productivity and quality of the resulting design)?

Capital investment is required (gotta invest in the iron, and the seat license) Familiarity (if you learned to design 20 years ago, you're comfortable with the methodology, you're aware of the limits, and you are satisfied with the precision and accuracy of the results of that methodology...) The latter is another aspect of capital investment.. it takes time to get used to a new way of doing things, time that the engineer may not have, in an environment that stresses getting the product out the door (or, in the case of where I work, getting to the launch pad in time for the every two year launch opportunity for Mars).

So, against this background, giving up even 80% of the computational horsepower, in exchange for allowing one to use a tool that might make you 10 times more productive is a good trade. Sometimes, I think that folks developing automatic parallelizers and similar tools are working too hard to make it perfect. If I can take a chunk of software that takes, say, 1 day (requiring periodic interactions, e.g., it's not a batch overnight thing) to run now, and get it to run in 10 minutes, that's a huge improvement. Put it in numbers. Say it costs me $3000 for a computer to run it in a day. If I can run it in 10 minutes (e.g. about 50 times faster), and I do one run a day, I don't care if it takes 100 processors to run 50x faster, as opposed to only 50. The extra 50 processors costs me, say, $200K (extra overhead for connectivity, facilities, etc.), which is a small fraction of the time saved, because I've essentially replaced 50 engineers with 1. (putting those 49 engineers out on the street, where they will inevitably cause problems..idle hands, playgrounds, and so forth)

In fact, you could have some hideously inefficient scheme that takes 1000 processors to go 10 times faster, and it's probably still a good deal.

Jim Lux





_______________________________________________
Beowulf mailing list, [email protected]
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to