I think I have this fixed here: https://github.com/open-mpi/ompi/pull/1756 
<https://github.com/open-mpi/ompi/pull/1756>

George - can you please try it on your system?


> On Jun 5, 2016, at 4:18 PM, Ralph Castain <r...@open-mpi.org> wrote:
> 
> Yeah, I can reproduce on my box. What is happening is that we aren’t properly 
> protected during finalize, and so we tear down some component that is 
> registered for a callback, and then the callback occurs. So we just need to 
> ensure that we finalize in the right order
> 
>> On Jun 5, 2016, at 3:51 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> 
>> wrote:
>> 
>> Ok, good.
>> 
>> FWIW: I tried all Friday afternoon to reproduce on my OS X laptop (i.e., I 
>> did something like George's shell script loop), and just now I ran George's 
>> exact loop, but I haven't been able to reproduce.  In this case, I'm falling 
>> on the wrong side of whatever race condition is happening...
>> 
>> 
>>> On Jun 4, 2016, at 7:57 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>> 
>>> I may have an idea of what’s going on here - I just need to finish 
>>> something else first and then I’ll take a look.
>>> 
>>> 
>>>> On Jun 4, 2016, at 4:20 PM, George Bosilca <bosi...@icl.utk.edu> wrote:
>>>> 
>>>>> 
>>>>> On Jun 5, 2016, at 07:53 , Ralph Castain <r...@open-mpi.org> wrote:
>>>>> 
>>>>>> 
>>>>>> On Jun 4, 2016, at 1:11 PM, George Bosilca <bosi...@icl.utk.edu> wrote:
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Sat, Jun 4, 2016 at 11:05 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>>>>> He can try adding "-mca state_base_verbose 5”, but if we are failing to 
>>>>>> catch sigchld, I’m not sure what debugging info is going to help resolve 
>>>>>> that problem. These aren’t even fast-running apps, so there was plenty 
>>>>>> of time to register for the signal prior to termination.
>>>>>> 
>>>>>> I vaguely recollect that we have occasionally seen this on Mac before 
>>>>>> and it had something to do with oddness in sigchld handling…
>>>>>> 
>>>>>> Assuming sigchld has some oddness on OSX. Why is then mpirun deadlocking 
>>>>>> instead of quitting which will then allow the OS to clean all children?
>>>>> 
>>>>> I don’t think mpirun is actually “deadlocked” - I think it may just be 
>>>>> waiting for sigchld to tell it that the local processes have terminated.
>>>>> 
>>>>> However, that wouldn't explain why you see what looks like libraries 
>>>>> being unloaded. That implies mpirun is actually finalizing, but failing 
>>>>> to fully exit - which would indeed be more of a deadlock.
>>>>> 
>>>>> So the question is: are you truly seeing us missing sigchld (as was 
>>>>> suggested earlier in this thread),
>>>> 
>>>> In theory the processes remains in zombie state until the parent calls 
>>>> waitpid on them, at which moment they are supposed to disappear. Based on 
>>>> this, as the processes are still in zombie state, I assumed that mpirun 
>>>> was not calling waitpid. One could also assume we are again hit by the 
>>>> fork race condition we had a while back, but as all local processes are in 
>>>> zombie mode, this is hardly believable.
>>>> 
>>>>> or did mpirun correctly see all the child processes terminate and is 
>>>>> actually hanging while trying to exit (as was also suggested earlier)?
>>>> 
>>>> One way or another the stack of the main thread looks busted. While the 
>>>> discussion about this was going on I was able to replicate the bug with 
>>>> only ORTE involved. Simply running 
>>>> 
>>>> for i in `seq 1 1 1000`; do echo “$i"; mpirun -np 4 hostname; done
>>>> 
>>>> ‘deadlock’ or whatever name we want to call this reliably before hitting 
>>>> the 300 iteration. Unfortunately adding the verbose option alter the 
>>>> behavior enough that the issue does not reproduce.
>>>> 
>>>> George.
>>>> 
>>>>> 
>>>>> Adding the state verbosity should tell us which of those two is true, 
>>>>> assuming it doesn’t affect the timing so much that everything works :-/
>>>>> 
>>>>> 
>>>>>> 
>>>>>> George.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On Jun 4, 2016, at 7:01 AM, Jeff Squyres (jsquyres) 
>>>>>>> <jsquy...@cisco.com> wrote:
>>>>>>> 
>>>>>>> Meh.  Ok.  Should George run with some verbose level to get more info?
>>>>>>> 
>>>>>>>> On Jun 4, 2016, at 6:43 AM, Ralph Castain <r...@open-mpi.org> wrote:
>>>>>>>> 
>>>>>>>> Neither of those threads have anything to do with catching the sigchld 
>>>>>>>> - threads 4-5 are listening for OOB and PMIx connection requests. It 
>>>>>>>> looks more like mpirun thought it had picked everything up and has 
>>>>>>>> begun shutting down, but I can’t really tell for certain.
>>>>>>>> 
>>>>>>>>> On Jun 4, 2016, at 6:29 AM, Jeff Squyres (jsquyres) 
>>>>>>>>> <jsquy...@cisco.com> wrote:
>>>>>>>>> 
>>>>>>>>> On Jun 3, 2016, at 11:07 PM, George Bosilca <bosi...@icl.utk.edu> 
>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> After finalize. As I said in my original email I se all the output 
>>>>>>>>>> the application is generating, and all processes (which are local as 
>>>>>>>>>> this happens on my laptop) are in zombie mode (Z+). This basically 
>>>>>>>>>> means whoever was supposed to get the SIGCHLD, didn't do it's job of 
>>>>>>>>>> cleaning them up.
>>>>>>>>> 
>>>>>>>>> Ah -- so perhaps threads 1,2,3 are red herrings: the real problem 
>>>>>>>>> here is that the parent didn't catch the child exits (which 
>>>>>>>>> presumably should have been caught in threads 4 or 5).
>>>>>>>>> 
>>>>>>>>> Ralph: is there any state from threads 4 or 5 that would be helpful 
>>>>>>>>> to examine to see if they somehow missed catching children exits?
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> Jeff Squyres
>>>>>>>>> jsquy...@cisco.com
>>>>>>>>> For corporate legal information go to: 
>>>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>>>>>> 
>>>>>>>>> _______________________________________________
>>>>>>>>> devel mailing list
>>>>>>>>> de...@open-mpi.org
>>>>>>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>> Link to this post: 
>>>>>>>>> http://www.open-mpi.org/community/lists/devel/2016/06/19070.php
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> devel mailing list
>>>>>>>> de...@open-mpi.org
>>>>>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>> Link to this post: 
>>>>>>>> http://www.open-mpi.org/community/lists/devel/2016/06/19071.php
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> Jeff Squyres
>>>>>>> jsquy...@cisco.com
>>>>>>> For corporate legal information go to: 
>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> de...@open-mpi.org
>>>>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>> Link to this post: 
>>>>>>> http://www.open-mpi.org/community/lists/devel/2016/06/19072.php
>>>>>> 
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> de...@open-mpi.org
>>>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>> Link to this post: 
>>>>>> http://www.open-mpi.org/community/lists/devel/2016/06/19073.php
>>>>>> 
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> de...@open-mpi.org
>>>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>> Link to this post: 
>>>>>> http://www.open-mpi.org/community/lists/devel/2016/06/19074.php
>>>>> 
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> de...@open-mpi.org
>>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>> Link to this post: 
>>>>> http://www.open-mpi.org/community/lists/devel/2016/06/19075.php
>>>> 
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> Link to this post: 
>>>> http://www.open-mpi.org/community/lists/devel/2016/06/19076.php
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2016/06/19077.php
>> 
>> 
>> -- 
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to: 
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2016/06/19078.php
> 

Reply via email to