Ralph, could you tell us when this functionality will be available in the stable version? A rough estimate will be fine.
On Fri, Sep 24, 2010 at 01:24, Ralph Castain <r...@open-mpi.org> wrote: > In a word, no. If a node crashes, OMPI will abort the currently-running job > if it had processes on that node. There is no current ability to "ride-thru" > such an event. > > That said, there is work being done to support "ride-thru". Most of that is > in the current developer's code trunk, and more is coming, but I wouldn't > consider it production-quality just yet. > > Specifically, the code that does what you specify below is done and works. > It is recovery of the MPI job itself (collectives, lost messages, etc.) that > remains to be completed. > > > On Thu, Sep 23, 2010 at 7:22 AM, Andrei Fokau < > andrei.fo...@neutron.kth.se> wrote: > >> Dear users, >> >> Our cluster has a number of nodes which have high probability to crash, so >> it happens quite often that calculations stop due to one node getting down. >> May be you know if it is possible to block the crashed nodes during run-time >> when running with OpenMPI? I am asking about principal possibility to >> program such behavior. Does OpenMPI allow such dynamic checking? The scheme >> I am curious about is the following: >> >> 1. A code starts its tasks via mpirun on several nodes >> 2. At some moment one node gets down >> 3. The code realizes that the node is down (the results are lost) and >> excludes it from the list of nodes to run its tasks on >> 4. At later moment the user restarts the crashed node >> 5. The code notices that the node is up again, and puts it back to the >> list of active nodes >> >> >> Regards, >> Andrei >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >