On 03/05/2014 10:55 AM, Douglas Eadline wrote:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 05/03/14 13:52, Joe Landman wrote:

I think the real question is would the system be viable in a
commercial sense, or is this another boondoggle?

At the Slurm User Group last year Dona Crawford of LLNL gave the
keynote and as part of that talked about some of the challenges of
exascale.

The one everyone thinks about first is power, but the other one she
touched on was reliability and uptime.

Indeed, the fact that these issues were not even mentioned
means to me the project is not very well thought out.
At exascale (using current tech) failure recovery must be built
into any design, either software and/or hardware.

Yes ... such designs must assume that there will be failure, and manage this. The issue, last I checked, is that most people coding to MPI can't use, or haven't used MPI resiliency features.

Checkpoint/restart (CPR) on this scale is simply not an option, given that the probability of a failure occurring during CPR very rapidly approaches unity. CPR is built with this implicit assumption that copy out/copy back is *absolutely* reliable and will not fail. Ever.

One way to circumvent portions of the issue are to use the SSD on DIMM designs to do very local "snapshot"-like CPR. And add in erasure coding, and other FEC for the data. So you can accept some small amount of failure in the copy out or copy back.



Basically if you scale a current petascale system up to exascale you
are looking at an expected full-system uptime of between seconds and
minutes.  For comparison Sequoia, their petaflop BG/Q, has a
systemwide MTBF of about a day.

I recall that HPL will take about 6 days to run
on an exascale machine.


That causes problems if you're expecting to do checkpoint/restart to
cope with failures, so really you've got to look at fault tolerances
within applications themselves.   Hands up if you've got (or know of)
a code that can gracefully tolerate and meaningfully continue if nodes
going away whilst the job is running?

I would hate to have my $50B machine give me a the wrong answer
when such large amounts of money are involved. And we all know
it is going to kick out "42" at some point.

Or the complete works of Shakespeare (http://en.wikipedia.org/wiki/Infinite_monkey_theorem), though this would be more troubling than 42.




The Slurm folks is already looking at this in terms of having some way
of setting up a bargaining with the scheduler in case of node failure

As a side point, the Hadoop YARN scheduler allows dynamic resource
negotiations while the program is running, thus if a node or rack dies,
a job can request more resources. For MR this rather easy to do because of
the functional nature of the process.


We need to get to that place. Right now, our job scheduling, while quite sophisticated in rule sets, is firmly entrenched in ideas from the 70's and 80's. "New" concepts in (pub sub, etc.) schedulers are needed for really huge scale. Fully distributed, able to route around failure. Not merely tolerate it, but adapt to it.

This is going to require that we code to reality, not a fictional universe where nodes never fail, storage/networking never goes offline ...

I've not done much with MPI in a few years, have they extended it beyond MPI_Init yet? Can MPI procs just join a "borgified" collective, preserve state so restarts/moves/reschedules of ranks are cheap? If not, what is the replacement for MPI that will do this?

FWIW, folks on Wall Street use pub sub, message passing (ala AMPS, *MQ, ...) to handle some elements of this.



--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: [email protected]
web  : http://scalableinformatics.com
twtr : @scalableinfo
phone: +1 734 786 8423 x121
cell : +1 734 612 4615
_______________________________________________
Beowulf mailing list, [email protected] sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to