On 09/21/12 12:13, Lux, Jim (337C) wrote: > I would suggest that some scheme of redundant computation might be more > effective.. Rather than try to store a single node's state on the node, > and then, if any node hiccups, restore the state (perhaps to a spare), and > restart, means stopping the entire cluster while you recover.
I am not 100% about the nitty-gritty here, but I do believe there are schemes already in place to deal with single node failures. What I do know for sure is that checkpoints are used as a last line of defense against full cluster failure due to overheating, power failure, or excessive numbers of concurrent failures -- not for just one node going belly up. The LANL clusters I was learning about only checkpointed every 4-6 hours or so, if I remember correctly. With hundred-petascale clusters and beyond hitting failure rates on the frequency of not even hours but minutes, obviously checkpointing is not the go-to first attempt at failure recovery. If I find some of the nitty-gritty I'm currently forgetting about how smaller, isolated failures are handled now I'll report back. Nevertheless, great ideas Jim! Best, ellis _______________________________________________ Beowulf mailing list, [email protected] sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
