Re: [OMPI users] Running on crashing nodes

Andrei Fokau Fri, 24 Sep 2010 03:37:54 -0400

Ralph, could you tell us when this functionality will be available in the
stable version? A rough estimate will be fine.



On Fri, Sep 24, 2010 at 01:24, Ralph Castain <r...@open-mpi.org> wrote:

> In a word, no. If a node crashes, OMPI will abort the currently-running job
> if it had processes on that node. There is no current ability to "ride-thru"
> such an event.
>
> That said, there is work being done to support "ride-thru". Most of that is
> in the current developer's code trunk, and more is coming, but I wouldn't
> consider it production-quality just yet.
>
> Specifically, the code that does what you specify below is done and works.
> It is recovery of the MPI job itself (collectives, lost messages, etc.) that
> remains to be completed.
>
>
>  On Thu, Sep 23, 2010 at 7:22 AM, Andrei Fokau <
> andrei.fo...@neutron.kth.se> wrote:
>
>>  Dear users,
>>
>> Our cluster has a number of nodes which have high probability to crash, so
>> it happens quite often that calculations stop due to one node getting down.
>> May be you know if it is possible to block the crashed nodes during run-time
>> when running with OpenMPI? I am asking about principal possibility to
>> program such behavior. Does OpenMPI allow such dynamic checking? The scheme
>> I am curious about is the following:
>>
>> 1. A code starts its tasks via mpirun on several nodes
>> 2. At some moment one node gets down
>> 3. The code realizes that the node is down (the results are lost) and
>> excludes it from the list of nodes to run its tasks on
>> 4. At later moment the user restarts the crashed node
>> 5. The code notices that the node is up again, and puts it back to the
>> list of active nodes
>>
>>
>> Regards,
>> Andrei
>>
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] Running on crashing nodes

Reply via email to