Ok, good.

FWIW: I tried all Friday afternoon to reproduce on my OS X laptop (i.e., I did 
something like George's shell script loop), and just now I ran George's exact 
loop, but I haven't been able to reproduce.  In this case, I'm falling on the 
wrong side of whatever race condition is happening...


> On Jun 4, 2016, at 7:57 PM, Ralph Castain <r...@open-mpi.org> wrote:
> 
> I may have an idea of what’s going on here - I just need to finish something 
> else first and then I’ll take a look.
> 
> 
>> On Jun 4, 2016, at 4:20 PM, George Bosilca <bosi...@icl.utk.edu> wrote:
>> 
>>> 
>>> On Jun 5, 2016, at 07:53 , Ralph Castain <r...@open-mpi.org> wrote:
>>> 
>>>> 
>>>> On Jun 4, 2016, at 1:11 PM, George Bosilca <bosi...@icl.utk.edu> wrote:
>>>> 
>>>> 
>>>> 
>>>> On Sat, Jun 4, 2016 at 11:05 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>>> He can try adding "-mca state_base_verbose 5”, but if we are failing to 
>>>> catch sigchld, I’m not sure what debugging info is going to help resolve 
>>>> that problem. These aren’t even fast-running apps, so there was plenty of 
>>>> time to register for the signal prior to termination.
>>>> 
>>>> I vaguely recollect that we have occasionally seen this on Mac before and 
>>>> it had something to do with oddness in sigchld handling…
>>>> 
>>>> Assuming sigchld has some oddness on OSX. Why is then mpirun deadlocking 
>>>> instead of quitting which will then allow the OS to clean all children?
>>> 
>>> I don’t think mpirun is actually “deadlocked” - I think it may just be 
>>> waiting for sigchld to tell it that the local processes have terminated.
>>> 
>>> However, that wouldn't explain why you see what looks like libraries being 
>>> unloaded. That implies mpirun is actually finalizing, but failing to fully 
>>> exit - which would indeed be more of a deadlock.
>>> 
>>> So the question is: are you truly seeing us missing sigchld (as was 
>>> suggested earlier in this thread),
>> 
>> In theory the processes remains in zombie state until the parent calls 
>> waitpid on them, at which moment they are supposed to disappear. Based on 
>> this, as the processes are still in zombie state, I assumed that mpirun was 
>> not calling waitpid. One could also assume we are again hit by the fork race 
>> condition we had a while back, but as all local processes are in zombie 
>> mode, this is hardly believable.
>> 
>>> or did mpirun correctly see all the child processes terminate and is 
>>> actually hanging while trying to exit (as was also suggested earlier)?
>> 
>> One way or another the stack of the main thread looks busted. While the 
>> discussion about this was going on I was able to replicate the bug with only 
>> ORTE involved. Simply running 
>> 
>> for i in `seq 1 1 1000`; do echo “$i"; mpirun -np 4 hostname; done
>> 
>> ‘deadlock’ or whatever name we want to call this reliably before hitting the 
>> 300 iteration. Unfortunately adding the verbose option alter the behavior 
>> enough that the issue does not reproduce.
>> 
>>   George.
>> 
>>> 
>>> Adding the state verbosity should tell us which of those two is true, 
>>> assuming it doesn’t affect the timing so much that everything works :-/
>>> 
>>> 
>>>> 
>>>>   George.
>>>> 
>>>>  
>>>> 
>>>> 
>>>> > On Jun 4, 2016, at 7:01 AM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> 
>>>> > wrote:
>>>> >
>>>> > Meh.  Ok.  Should George run with some verbose level to get more info?
>>>> >
>>>> >> On Jun 4, 2016, at 6:43 AM, Ralph Castain <r...@open-mpi.org> wrote:
>>>> >>
>>>> >> Neither of those threads have anything to do with catching the sigchld 
>>>> >> - threads 4-5 are listening for OOB and PMIx connection requests. It 
>>>> >> looks more like mpirun thought it had picked everything up and has 
>>>> >> begun shutting down, but I can’t really tell for certain.
>>>> >>
>>>> >>> On Jun 4, 2016, at 6:29 AM, Jeff Squyres (jsquyres) 
>>>> >>> <jsquy...@cisco.com> wrote:
>>>> >>>
>>>> >>> On Jun 3, 2016, at 11:07 PM, George Bosilca <bosi...@icl.utk.edu> 
>>>> >>> wrote:
>>>> >>>>
>>>> >>>> After finalize. As I said in my original email I se all the output 
>>>> >>>> the application is generating, and all processes (which are local as 
>>>> >>>> this happens on my laptop) are in zombie mode (Z+). This basically 
>>>> >>>> means whoever was supposed to get the SIGCHLD, didn't do it's job of 
>>>> >>>> cleaning them up.
>>>> >>>
>>>> >>> Ah -- so perhaps threads 1,2,3 are red herrings: the real problem here 
>>>> >>> is that the parent didn't catch the child exits (which presumably 
>>>> >>> should have been caught in threads 4 or 5).
>>>> >>>
>>>> >>> Ralph: is there any state from threads 4 or 5 that would be helpful to 
>>>> >>> examine to see if they somehow missed catching children exits?
>>>> >>>
>>>> >>> --
>>>> >>> Jeff Squyres
>>>> >>> jsquy...@cisco.com
>>>> >>> For corporate legal information go to: 
>>>> >>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>> >>>
>>>> >>> _______________________________________________
>>>> >>> devel mailing list
>>>> >>> de...@open-mpi.org
>>>> >>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> >>> Link to this post: 
>>>> >>> http://www.open-mpi.org/community/lists/devel/2016/06/19070.php
>>>> >>
>>>> >> _______________________________________________
>>>> >> devel mailing list
>>>> >> de...@open-mpi.org
>>>> >> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> >> Link to this post: 
>>>> >> http://www.open-mpi.org/community/lists/devel/2016/06/19071.php
>>>> >
>>>> >
>>>> > --
>>>> > Jeff Squyres
>>>> > jsquy...@cisco.com
>>>> > For corporate legal information go to: 
>>>> > http://www.cisco.com/web/about/doing_business/legal/cri/
>>>> >
>>>> > _______________________________________________
>>>> > devel mailing list
>>>> > de...@open-mpi.org
>>>> > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> > Link to this post: 
>>>> > http://www.open-mpi.org/community/lists/devel/2016/06/19072.php
>>>> 
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> Link to this post: 
>>>> http://www.open-mpi.org/community/lists/devel/2016/06/19073.php
>>>> 
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> Link to this post: 
>>>> http://www.open-mpi.org/community/lists/devel/2016/06/19074.php
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2016/06/19075.php
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2016/06/19076.php
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2016/06/19077.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Reply via email to