And this is the bit that concerns me the most. At scale you should only be making two assumptions: (1) everything breaks all the time (2) you will have network partitions. Checkpoint/restart is a lazy option that has no place in modern software. Yet there doesn't seem to be a priority to go beyond checkpoint restart and rethinking software architecture. I would argue that's as much or more important than figuring out manycore.
On Fri, Nov 23, 2012 at 6:44 AM, Lux, Jim (337C) <[email protected]>wrote: > a lot of HPC software design > assumes perfect hardware, or, that the hardware failure rate is > sufficiently low that a checkpoint/restart (or "do it all over from the > beginning") is an acceptable strategy. >
_______________________________________________ Beowulf mailing list, [email protected] sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
