Kewl. FWIW: we already have the ability to migrate processes in the ORTE code.
You can tell the system to try and restart the process in its existing location
N number of times before requesting relocation. Of course, if a node fails,
then we automatically relocate the procs to other nodes.
The
Hi all,
I kind of broke something with mail mail configuration so I haven't
been able to properly answer to this earlier, sorry.
@Jsquyres We are planning to work on fault tolerance and improved
scheduling cappabilities for HPC. To do so, we are first focusing on
serial tasks, and in a next step