We now use the errmgr.
Aurelien
Le 6 mars 08 à 13:38, Aurélien Bouteiller a écrit :
Aside of what Josh said, we are working right know at UTK on orted/MPI
recovery (without killing/respawning all). For now we had no use of
the errgmr, but I'm quite sure this would be the smartest place to
put
Aside of what Josh said, we are working right know at UTK on orted/MPI
recovery (without killing/respawning all). For now we had no use of
the errgmr, but I'm quite sure this would be the smartest place to
put all the mechanisms we are trying now.
Aurelien
Le 6 mars 08 à 11:17, Ralph Casta
Ah - ok, thanks for clarifying! I'm happy to leave it around, but wasn't
sure if/where it fit into anyone's future plans.
Thanks
Ralph
On 3/6/08 9:13 AM, "Josh Hursey" wrote:
> The checkpoint/restart work that I have integrated does not respond to
> failure at the moment. If a failures happen
The checkpoint/restart work that I have integrated does not respond to
failure at the moment. If a failures happens I want ORTE to terminate
the entire job. I will then restart the entire job from a checkpoint
file. This follows the 'all fall down' approach that users typically
expect when
Hello
I've been doing some work on fault response within the system, and finally
realized something I should probably have seen awhile back. Perhaps I am
misunderstanding somewhere, so forgive the ignorance if so.
When we designed ORTE some time in the deep, dark past, we had envisioned
that peop