It may be there from a long time ago. When the OOB loses a connection, nothing 
is supposed to happen unless that connection is defined as a "lifeline". 
Remember, the OOB is not an MPI transport - it is there solely to handle 
support functions and therefore is not considered "mission critical". So losing 
an OOB connection isn't considered a "fatal" problem unless it is to the 
"lifeline".

We define a lifeline solely for the case where a daemon dies and we need the 
local procs to "suicide" and mpirun to terminate the job. So I guess the 
question is: which connection failed? Was this a connection from a daemon back 
to mpirun?

Or were you running as a direct launch process - i.e., the connection was 
between two MPI procs that were launched via srun? If so, then there is no 
"lifeline" - if a connection drops, you are on your own. Not much we can do 
about that scenario as you really don't want to abort just because a 
non-critical connection fails.


On Jun 26, 2012, at 1:09 AM, ludovic.hab...@ext.bull.net wrote:

> Version 1.6. But it's already there in 1.5.4.
> 
> -----devel-boun...@open-mpi.org a écrit : -----
> A : Open MPI Developers <de...@open-mpi.org>
> De : Ralph Castain 
> Envoyé par : devel-boun...@open-mpi.org
> Date : 25/06/2012 17:57
> Objet : Re: [OMPI devel] Problem in oob/tcp
> 
> What version?
> 
> On Jun 25, 2012, at 9:53 AM, ludovic.hab...@ext.bull.net wrote:
> 
>> Hi everybody,
>> 
>> I'm facing a problem in orte/oob/tcp/, more particularly in file 
>> oob_tcp_msg.c. Some network interruptions were making my program hanging and 
>> not crashing (a basic helloworld).
>> 
>> Thus, I reproduced the problem with gdb, by simulating an error on read 
>> (jumping from line 357 to 367, oob_tcp_msg.c). Then, openmpi close the 
>> socket, make the shutdown and then is hanging.
>> 
>> It seems that there is an exception callback function 
>> (mca_oob_tcp.oob_exception_callback) "planned" but not implemented yet.
>> 
>> Any idea on how to solve this problem ? Or is this the expected behavior 
>> when we lose connection ? Did I missed anything ?
>> 
>> Thanks in advance,
>> 
>> Ludovic
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to