Re: [OMPI devel] Fault tolerance

2008-03-07 Thread Aurélien Bouteiller
We now use the errmgr. Aurelien Le 6 mars 08 à 13:38, Aurélien Bouteiller a écrit : Aside of what Josh said, we are working right know at UTK on orted/MPI recovery (without killing/respawning all). For now we had no use of the errgmr, but I'm quite sure this would be the smartest place to put

Re: [OMPI devel] Fault tolerance

2008-03-06 Thread Aurélien Bouteiller
Aside of what Josh said, we are working right know at UTK on orted/MPI recovery (without killing/respawning all). For now we had no use of the errgmr, but I'm quite sure this would be the smartest place to put all the mechanisms we are trying now. Aurelien Le 6 mars 08 à 11:17, Ralph Casta

Re: [OMPI devel] Fault tolerance

2008-03-06 Thread Ralph Castain
Ah - ok, thanks for clarifying! I'm happy to leave it around, but wasn't sure if/where it fit into anyone's future plans. Thanks Ralph On 3/6/08 9:13 AM, "Josh Hursey" wrote: > The checkpoint/restart work that I have integrated does not respond to > failure at the moment. If a failures happen

Re: [OMPI devel] Fault tolerance

2008-03-06 Thread Josh Hursey
The checkpoint/restart work that I have integrated does not respond to failure at the moment. If a failures happens I want ORTE to terminate the entire job. I will then restart the entire job from a checkpoint file. This follows the 'all fall down' approach that users typically expect when

[OMPI devel] Fault tolerance

2008-03-06 Thread Ralph Castain
Hello I've been doing some work on fault response within the system, and finally realized something I should probably have seen awhile back. Perhaps I am misunderstanding somewhere, so forgive the ignorance if so. When we designed ORTE some time in the deep, dark past, we had envisioned that peop