On Oct 10, 2011, at 11:14 AM, George Bosilca wrote:

> Ralph,
> 
> If you don't mind I would like to understand this issue a little bit more. 
> What exactly is broken in the termination detection?
> 
>> From a network point of view, there is a slight issue with the commit 25245. 
>> A direct call to exit will close all pending sockets, with a linger of 60 
>> seconds (quite bad if you use static ports as an example). There are proper 
>> protocols to shutdown sockets in a reliable way, maybe it is time to 
>> implement one of them.

Did I miss this message somewhere? I didn't receive it.

The problem is that the daemons with children are not seeing the sockets close 
when their child daemons terminate. So the lowest layer of daemons terminate 
(as they have no children). The next layer does not terminate, leaving mpirun 
hanging.

What's interesting is that mpirun does see the sockets directly connected to it 
close. As you probably recall, the binomial routed module has daemons directly 
callback to  mpirun, so that connection is created. The socket to the next 
level of daemon gets created later - it is this later socket whose closure 
isn't being detected.

>From what I could see, it appeared to be a progress issue. The socket may be 
>closed, but the event lib may not be progressing to detect it. As we intend to 
>fix that anyway with the async progress work, and static ports are not the 
>default behavior (and rarely used), I didn't feel it worth leaving the trunk 
>hanging while continuing to chase it.

If someone has time, properly closing the sockets is something we've talked 
about for quite awhile, but never gotten around to doing. I'm not sure it will 
resolve this problem, but would be worth doing anyway.


> 
> Thanks,
>  george.
> 
> On Oct 10, 2011, at 12:40 , Ralph Castain wrote:
> 
>> It wasn't the launcher that was broken, but termination detection, and not 
>> for all environments (e.g., worked fine for slurm). It is a progress-related 
>> issue.
>> 
>> Should be fixed in r25245.
>> 
>> 
>> On Oct 10, 2011, at 8:33 AM, Shamis, Pavel wrote:
>> 
>>> + 1 , I see the same issue.
>>> 
>>>> -----Original Message-----
>>>> From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org]
>>>> On Behalf Of Yevgeny Kliteynik
>>>> Sent: Monday, October 10, 2011 10:24 AM
>>>> To: OpenMPI Devel
>>>> Subject: [OMPI devel] Launcher in trunk is broken?
>>>> 
>>>> It looks like the process launcher is broken in the OMPI trunk:
>>>> If you run any simple test (not necessarily including MPI calls) on 4 or
>>>> more nodes, the MPI processes won't be killed after the test finishes.
>>>> 
>>>> $ mpirun -host host_1,host_2,host_3,host_4 -np 4 --mca btl sm,tcp,self
>>>> /bin/hostname
>>>> 
>>>> Output:
>>>> host_1
>>>> host_2
>>>> host_3
>>>> host_4
>>>> 
>>>> And test is hanging......
>>>> 
>>>> I have an older trunk (r25228), and everything is OK there.
>>>> Not sure if it means that something was broken after that, or the problem
>>>> existed before, but kicked in only now due to some other change.
>>>> 
>>>> -- YK
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> hxxp://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


Reply via email to