Ok, good. FWIW: I tried all Friday afternoon to reproduce on my OS X laptop (i.e., I did something like George's shell script loop), and just now I ran George's exact loop, but I haven't been able to reproduce. In this case, I'm falling on the wrong side of whatever race condition is happening...
> On Jun 4, 2016, at 7:57 PM, Ralph Castain <r...@open-mpi.org> wrote: > > I may have an idea of what’s going on here - I just need to finish something > else first and then I’ll take a look. > > >> On Jun 4, 2016, at 4:20 PM, George Bosilca <bosi...@icl.utk.edu> wrote: >> >>> >>> On Jun 5, 2016, at 07:53 , Ralph Castain <r...@open-mpi.org> wrote: >>> >>>> >>>> On Jun 4, 2016, at 1:11 PM, George Bosilca <bosi...@icl.utk.edu> wrote: >>>> >>>> >>>> >>>> On Sat, Jun 4, 2016 at 11:05 PM, Ralph Castain <r...@open-mpi.org> wrote: >>>> He can try adding "-mca state_base_verbose 5”, but if we are failing to >>>> catch sigchld, I’m not sure what debugging info is going to help resolve >>>> that problem. These aren’t even fast-running apps, so there was plenty of >>>> time to register for the signal prior to termination. >>>> >>>> I vaguely recollect that we have occasionally seen this on Mac before and >>>> it had something to do with oddness in sigchld handling… >>>> >>>> Assuming sigchld has some oddness on OSX. Why is then mpirun deadlocking >>>> instead of quitting which will then allow the OS to clean all children? >>> >>> I don’t think mpirun is actually “deadlocked” - I think it may just be >>> waiting for sigchld to tell it that the local processes have terminated. >>> >>> However, that wouldn't explain why you see what looks like libraries being >>> unloaded. That implies mpirun is actually finalizing, but failing to fully >>> exit - which would indeed be more of a deadlock. >>> >>> So the question is: are you truly seeing us missing sigchld (as was >>> suggested earlier in this thread), >> >> In theory the processes remains in zombie state until the parent calls >> waitpid on them, at which moment they are supposed to disappear. Based on >> this, as the processes are still in zombie state, I assumed that mpirun was >> not calling waitpid. One could also assume we are again hit by the fork race >> condition we had a while back, but as all local processes are in zombie >> mode, this is hardly believable. >> >>> or did mpirun correctly see all the child processes terminate and is >>> actually hanging while trying to exit (as was also suggested earlier)? >> >> One way or another the stack of the main thread looks busted. While the >> discussion about this was going on I was able to replicate the bug with only >> ORTE involved. Simply running >> >> for i in `seq 1 1 1000`; do echo “$i"; mpirun -np 4 hostname; done >> >> ‘deadlock’ or whatever name we want to call this reliably before hitting the >> 300 iteration. Unfortunately adding the verbose option alter the behavior >> enough that the issue does not reproduce. >> >> George. >> >>> >>> Adding the state verbosity should tell us which of those two is true, >>> assuming it doesn’t affect the timing so much that everything works :-/ >>> >>> >>>> >>>> George. >>>> >>>> >>>> >>>> >>>> > On Jun 4, 2016, at 7:01 AM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> >>>> > wrote: >>>> > >>>> > Meh. Ok. Should George run with some verbose level to get more info? >>>> > >>>> >> On Jun 4, 2016, at 6:43 AM, Ralph Castain <r...@open-mpi.org> wrote: >>>> >> >>>> >> Neither of those threads have anything to do with catching the sigchld >>>> >> - threads 4-5 are listening for OOB and PMIx connection requests. It >>>> >> looks more like mpirun thought it had picked everything up and has >>>> >> begun shutting down, but I can’t really tell for certain. >>>> >> >>>> >>> On Jun 4, 2016, at 6:29 AM, Jeff Squyres (jsquyres) >>>> >>> <jsquy...@cisco.com> wrote: >>>> >>> >>>> >>> On Jun 3, 2016, at 11:07 PM, George Bosilca <bosi...@icl.utk.edu> >>>> >>> wrote: >>>> >>>> >>>> >>>> After finalize. As I said in my original email I se all the output >>>> >>>> the application is generating, and all processes (which are local as >>>> >>>> this happens on my laptop) are in zombie mode (Z+). This basically >>>> >>>> means whoever was supposed to get the SIGCHLD, didn't do it's job of >>>> >>>> cleaning them up. >>>> >>> >>>> >>> Ah -- so perhaps threads 1,2,3 are red herrings: the real problem here >>>> >>> is that the parent didn't catch the child exits (which presumably >>>> >>> should have been caught in threads 4 or 5). >>>> >>> >>>> >>> Ralph: is there any state from threads 4 or 5 that would be helpful to >>>> >>> examine to see if they somehow missed catching children exits? >>>> >>> >>>> >>> -- >>>> >>> Jeff Squyres >>>> >>> jsquy...@cisco.com >>>> >>> For corporate legal information go to: >>>> >>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>> >>> >>>> >>> _______________________________________________ >>>> >>> devel mailing list >>>> >>> de...@open-mpi.org >>>> >>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> >>> Link to this post: >>>> >>> http://www.open-mpi.org/community/lists/devel/2016/06/19070.php >>>> >> >>>> >> _______________________________________________ >>>> >> devel mailing list >>>> >> de...@open-mpi.org >>>> >> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> >> Link to this post: >>>> >> http://www.open-mpi.org/community/lists/devel/2016/06/19071.php >>>> > >>>> > >>>> > -- >>>> > Jeff Squyres >>>> > jsquy...@cisco.com >>>> > For corporate legal information go to: >>>> > http://www.cisco.com/web/about/doing_business/legal/cri/ >>>> > >>>> > _______________________________________________ >>>> > devel mailing list >>>> > de...@open-mpi.org >>>> > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> > Link to this post: >>>> > http://www.open-mpi.org/community/lists/devel/2016/06/19072.php >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/devel/2016/06/19073.php >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/devel/2016/06/19074.php >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2016/06/19075.php >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2016/06/19076.php > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2016/06/19077.php -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/