Re: [Beowulf] Supercomputers face growing resilience problems

Deepak Singh Fri, 23 Nov 2012 12:51:59 -0800

And this is the bit that concerns me the most.  At scale you should only be
making two assumptions: (1) everything breaks all the time (2) you will
have network partitions.  Checkpoint/restart is a lazy option that has no
place in modern software. Yet there doesn't seem to be a priority to go
beyond checkpoint restart and rethinking software architecture. I would
argue that's as much or more important than figuring out manycore.


On Fri, Nov 23, 2012 at 6:44 AM, Lux, Jim (337C)
<[email protected]>wrote:

> a lot of HPC software design
> assumes perfect hardware, or, that the hardware failure rate is
> sufficiently low that a checkpoint/restart (or "do it all over from the
> beginning") is an acceptable strategy.
>

_______________________________________________
Beowulf mailing list, [email protected] sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Supercomputers face growing resilience problems

Reply via email to