It's a "hard problem"… The natural tendency is to solve the easy problems first, and only when backed into the corner, do you take on the hard problems. Or.. Someone comes out of the background with a really novel approach. I'm sure folks thought about error correcting codes in an empirical way (e.g. Parity bits) but Hamming put it all together in a nice consistent theoretical framework. Or Shannon, for that matter.
From: Deepak Singh <[email protected]<mailto:[email protected]>> Date: Friday, November 23, 2012 11:45 AM To: Jim Lux <[email protected]<mailto:[email protected]>> Cc: Luc Vereecken <[email protected]<mailto:[email protected]>>, "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>>, "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Subject: Re: [Beowulf] Supercomputers face growing resilience problems And this is the bit that concerns me the most. At scale you should only be making two assumptions: (1) everything breaks all the time (2) you will have network partitions. Checkpoint/restart is a lazy option that has no place in modern software. Yet there doesn't seem to be a priority to go beyond checkpoint restart and rethinking software architecture. I would argue that's as much or more important than figuring out manycore. On Fri, Nov 23, 2012 at 6:44 AM, Lux, Jim (337C) <[email protected]<mailto:[email protected]>> wrote: a lot of HPC software design assumes perfect hardware, or, that the hardware failure rate is sufficiently low that a checkpoint/restart (or "do it all over from the beginning") is an acceptable strategy.
_______________________________________________ Beowulf mailing list, [email protected] sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
