Re: [Beowulf] Exascale by the end of the year?

Joe Landman Wed, 05 Mar 2014 08:07:57 -0800

On 03/05/2014 10:55 AM, Douglas Eadline wrote:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 05/03/14 13:52, Joe Landman wrote:

I think the real question is would the system be viable in a
commercial sense, or is this another boondoggle?


At the Slurm User Group last year Dona Crawford of LLNL gave the
keynote and as part of that talked about some of the challenges of
exascale.

The one everyone thinks about first is power, but the other one she
touched on was reliability and uptime.


Indeed, the fact that these issues were not even mentioned
means to me the project is not very well thought out.
At exascale (using current tech) failure recovery must be built
into any design, either software and/or hardware.

Yes ... such designs must assume that there will be failure, and managethis. The issue, last I checked, is that most people coding to MPIcan't use, or haven't used MPI resiliency features.

Checkpoint/restart (CPR) on this scale is simply not an option, giventhat the probability of a failure occurring during CPR very rapidlyapproaches unity. CPR is built with this implicit assumption that copyout/copy back is *absolutely* reliable and will not fail. Ever.

One way to circumvent portions of the issue are to use the SSD on DIMMdesigns to do very local "snapshot"-like CPR. And add in erasurecoding, and other FEC for the data. So you can accept some small amountof failure in the copy out or copy back.


Basically if you scale a current petascale system up to exascale you
are looking at an expected full-system uptime of between seconds and
minutes.  For comparison Sequoia, their petaflop BG/Q, has a
systemwide MTBF of about a day.


I recall that HPL will take about 6 days to run
on an exascale machine.


That causes problems if you're expecting to do checkpoint/restart to
cope with failures, so really you've got to look at fault tolerances
within applications themselves.   Hands up if you've got (or know of)
a code that can gracefully tolerate and meaningfully continue if nodes
going away whilst the job is running?


I would hate to have my $50B machine give me a the wrong answer
when such large amounts of money are involved. And we all know
it is going to kick out "42" at some point.

Or the complete works of Shakespeare(http://en.wikipedia.org/wiki/Infinite_monkey_theorem), though thiswould be more troubling than 42.


The Slurm folks is already looking at this in terms of having some way
of setting up a bargaining with the scheduler in case of node failure


As a side point, the Hadoop YARN scheduler allows dynamic resource
negotiations while the program is running, thus if a node or rack dies,
a job can request more resources. For MR this rather easy to do because of
the functional nature of the process.

We need to get to that place. Right now, our job scheduling, whilequite sophisticated in rule sets, is firmly entrenched in ideas from the70's and 80's. "New" concepts in (pub sub, etc.) schedulers are neededfor really huge scale. Fully distributed, able to route around failure.Not merely tolerate it, but adapt to it.

This is going to require that we code to reality, not a fictionaluniverse where nodes never fail, storage/networking never goes offline ...

I've not done much with MPI in a few years, have they extended it beyondMPI_Init yet? Can MPI procs just join a "borgified" collective,preserve state so restarts/moves/reschedules of ranks are cheap? Ifnot, what is the replacement for MPI that will do this?

FWIW, folks on Wall Street use pub sub, message passing (ala AMPS, *MQ,...) to handle some elements of this.




--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: [email protected]
web  : http://scalableinformatics.com
twtr : @scalableinfo
phone: +1 734 786 8423 x121
cell : +1 734 612 4615
_______________________________________________
Beowulf mailing list, [email protected] sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Exascale by the end of the year?

Reply via email to