Re: [Beowulf] Exascale by the end of the year?

Christopher Samuel Tue, 04 Mar 2014 19:59:26 -0800

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 05/03/14 13:52, Joe Landman wrote:


> I think the real question is would the system be viable in a
> commercial sense, or is this another boondoggle?

At the Slurm User Group last year Dona Crawford of LLNL gave the
keynote and as part of that talked about some of the challenges of
exascale.

The one everyone thinks about first is power, but the other one she
touched on was reliability and uptime.

Basically if you scale a current petascale system up to exascale you
are looking at an expected full-system uptime of between seconds and
minutes.  For comparison Sequoia, their petaflop BG/Q, has a
systemwide MTBF of about a day.

That causes problems if you're expecting to do checkpoint/restart to
cope with failures, so really you've got to look at fault tolerances
within applications themselves.   Hands up if you've got (or know of)
a code that can gracefully tolerate and meaningfully continue if nodes
going away whilst the job is running?

The Slurm folks is already looking at this in terms of having some way
of setting up a bargaining with the scheduler in case of node failure
- - there are slides up on what they are planning here:

http://slurm.schedmd.com/SUG13/nonstop.pdf

cheers,
Chris
- -- 
 Christopher Samuel        Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: [email protected] Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/      http://twitter.com/vlsci

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.14 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlMWoPAACgkQO2KABBYQAh9GiACglcTBFXQt4/3wsL78eRrkILeh
/U8An07MTFVBsX4nssNq7GXZirWuIDii
=Ttyf
-----END PGP SIGNATURE-----
_______________________________________________
Beowulf mailing list, [email protected] sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Exascale by the end of the year?

Reply via email to