On Oct 11, 2011, at 1:40 PM, George Bosilca wrote:

> Unfortunately, this issue appears as having been introduced by a certain 
> change that was supposed to make the ORTE code more debugging-friendly 
> [obviously]. That particular change duplicated the epoch-tainted error 
> managers into their "default" version, more stable and also providing less 
> features.

Interesting - I didn't see any "comm failure" calls back to the errmgr. 
However, it is always possible I missed them...I'll check again.

> 
> The patches (25245, 25248) proposed so far as a solution to this problem 
> should be removed, as they do not really solve the problem, instead they 
> alleviate the symptoms. From here there are two possible fixes:
> 
> 1. Put back the code dealing with the daemons leaving the job in the 
> "default" version of the orted error manager.
> 
> Here are the lines to be added in update_status in 
> orte/mca/errmgr/default_orted/errmgr_default_orted.c:
> 
>        if (0 == orte_routed.num_routes() &&
>            0 == opal_list_get_size(&orte_local_children)) {
>            orte_quit();
>        }

Thanks for looking at this more closely. I'll restore those lines, and see if 
we are actually getting there. Could be the system I'm using behaves 
differently.

> 
> 2. Remove the "default" versions and fall back to a single sane set of error 
> managers.

No thanks, I'd rather only go thru this recovery process once :-)

> 
>  george.
> 
> PS: I feel compelled to clarify a statement here. The failure detection at 
> the socket level, is working properly under all circumstances, how we deal 
> with it at the upper levels suffered some mishandling lately.
> 
> 
> On Oct 11, 2011, at 13:17 , George Bosilca wrote:
> 
>> 
>> On Oct 11, 2011, at 01:17 , Ralph Castain wrote:
>> 
>>> 
>>> On Oct 10, 2011, at 11:29 AM, Ralph Castain wrote:
>>> 
>>>> 
>>>> On Oct 10, 2011, at 11:14 AM, George Bosilca wrote:
>>>> 
>>>>> Ralph,
>>>>> 
>>>>> If you don't mind I would like to understand this issue a little bit 
>>>>> more. What exactly is broken in the termination detection?
>>>>> 
>>>>>> From a network point of view, there is a slight issue with the commit 
>>>>>> 25245. A direct call to exit will close all pending sockets, with a 
>>>>>> linger of 60 seconds (quite bad if you use static ports as an example). 
>>>>>> There are proper protocols to shutdown sockets in a reliable way, maybe 
>>>>>> it is time to implement one of them.
>>> 
>>> I should have clarified something here. The change in r25245 doesn't cause 
>>> the daemons to directly call exit. It causes them to call orte_quit, which 
>>> runs thru orte_finalize on its way out. So if the oob properly shuts down 
>>> its sockets, then there will be no issue with lingering sockets.
>>> 
>>> The question has always been: does the oob properly shut down sockets? If 
>>> someone has time to address that issue, it would indeed be helpful.
>> 
>> Please, allow me to clarify this one. We lived with the OOB issue for 7 (oh 
>> a lucky number) years, and we always manage to patch the rest of the ORTE to 
>> deal with.
>> 
>> Why suddenly someone would bother to fix it?
>> 
>> george.
>> 
>>> 
>>> 
>>>> 
>>>> Did I miss this message somewhere? I didn't receive it.
>>>> 
>>>> The problem is that the daemons with children are not seeing the sockets 
>>>> close when their child daemons terminate. So the lowest layer of daemons 
>>>> terminate (as they have no children). The next layer does not terminate, 
>>>> leaving mpirun hanging.
>>>> 
>>>> What's interesting is that mpirun does see the sockets directly connected 
>>>> to it close. As you probably recall, the binomial routed module has 
>>>> daemons directly callback to  mpirun, so that connection is created. The 
>>>> socket to the next level of daemon gets created later - it is this later 
>>>> socket whose closure isn't being detected.
>>>> 
>>>> From what I could see, it appeared to be a progress issue. The socket may 
>>>> be closed, but the event lib may not be progressing to detect it. As we 
>>>> intend to fix that anyway with the async progress work, and static ports 
>>>> are not the default behavior (and rarely used), I didn't feel it worth 
>>>> leaving the trunk hanging while continuing to chase it.
>>>> 
>>>> If someone has time, properly closing the sockets is something we've 
>>>> talked about for quite awhile, but never gotten around to doing. I'm not 
>>>> sure it will resolve this problem, but would be worth doing anyway.
>>>> 
>>>> 
>>>>> 
>>>>> Thanks,
>>>>> george.
>>>>> 
>>>>> On Oct 10, 2011, at 12:40 , Ralph Castain wrote:
>>>>> 
>>>>>> It wasn't the launcher that was broken, but termination detection, and 
>>>>>> not for all environments (e.g., worked fine for slurm). It is a 
>>>>>> progress-related issue.
>>>>>> 
>>>>>> Should be fixed in r25245.
>>>>>> 
>>>>>> 
>>>>>> On Oct 10, 2011, at 8:33 AM, Shamis, Pavel wrote:
>>>>>> 
>>>>>>> + 1 , I see the same issue.
>>>>>>> 
>>>>>>>> -----Original Message-----
>>>>>>>> From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org]
>>>>>>>> On Behalf Of Yevgeny Kliteynik
>>>>>>>> Sent: Monday, October 10, 2011 10:24 AM
>>>>>>>> To: OpenMPI Devel
>>>>>>>> Subject: [OMPI devel] Launcher in trunk is broken?
>>>>>>>> 
>>>>>>>> It looks like the process launcher is broken in the OMPI trunk:
>>>>>>>> If you run any simple test (not necessarily including MPI calls) on 4 
>>>>>>>> or
>>>>>>>> more nodes, the MPI processes won't be killed after the test finishes.
>>>>>>>> 
>>>>>>>> $ mpirun -host host_1,host_2,host_3,host_4 -np 4 --mca btl sm,tcp,self
>>>>>>>> /bin/hostname
>>>>>>>> 
>>>>>>>> Output:
>>>>>>>> host_1
>>>>>>>> host_2
>>>>>>>> host_3
>>>>>>>> host_4
>>>>>>>> 
>>>>>>>> And test is hanging......
>>>>>>>> 
>>>>>>>> I have an older trunk (r25228), and everything is OK there.
>>>>>>>> Not sure if it means that something was broken after that, or the 
>>>>>>>> problem
>>>>>>>> existed before, but kicked in only now due to some other change.
>>>>>>>> 
>>>>>>>> -- YK
>>>>>>>> _______________________________________________
>>>>>>>> devel mailing list
>>>>>>>> de...@open-mpi.org
>>>>>>>> hxxp://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>> 
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> de...@open-mpi.org
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> de...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> de...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> 
>>> 
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


Reply via email to