-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 05/03/14 13:52, Joe Landman wrote:
> I think the real question is would the system be viable in a > commercial sense, or is this another boondoggle? At the Slurm User Group last year Dona Crawford of LLNL gave the keynote and as part of that talked about some of the challenges of exascale. The one everyone thinks about first is power, but the other one she touched on was reliability and uptime. Basically if you scale a current petascale system up to exascale you are looking at an expected full-system uptime of between seconds and minutes. For comparison Sequoia, their petaflop BG/Q, has a systemwide MTBF of about a day. That causes problems if you're expecting to do checkpoint/restart to cope with failures, so really you've got to look at fault tolerances within applications themselves. Hands up if you've got (or know of) a code that can gracefully tolerate and meaningfully continue if nodes going away whilst the job is running? The Slurm folks is already looking at this in terms of having some way of setting up a bargaining with the scheduler in case of node failure - - there are slides up on what they are planning here: http://slurm.schedmd.com/SUG13/nonstop.pdf cheers, Chris - -- Christopher Samuel Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: [email protected] Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.14 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlMWoPAACgkQO2KABBYQAh9GiACglcTBFXQt4/3wsL78eRrkILeh /U8An07MTFVBsX4nssNq7GXZirWuIDii =Ttyf -----END PGP SIGNATURE----- _______________________________________________ Beowulf mailing list, [email protected] sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
