Yeah, I can reproduce on my box. What is happening is that we aren’t properly protected during finalize, and so we tear down some component that is registered for a callback, and then the callback occurs. So we just need to ensure that we finalize in the right order
> On Jun 5, 2016, at 3:51 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> > wrote: > > Ok, good. > > FWIW: I tried all Friday afternoon to reproduce on my OS X laptop (i.e., I > did something like George's shell script loop), and just now I ran George's > exact loop, but I haven't been able to reproduce. In this case, I'm falling > on the wrong side of whatever race condition is happening... > > >> On Jun 4, 2016, at 7:57 PM, Ralph Castain <r...@open-mpi.org> wrote: >> >> I may have an idea of what’s going on here - I just need to finish something >> else first and then I’ll take a look. >> >> >>> On Jun 4, 2016, at 4:20 PM, George Bosilca <bosi...@icl.utk.edu> wrote: >>> >>>> >>>> On Jun 5, 2016, at 07:53 , Ralph Castain <r...@open-mpi.org> wrote: >>>> >>>>> >>>>> On Jun 4, 2016, at 1:11 PM, George Bosilca <bosi...@icl.utk.edu> wrote: >>>>> >>>>> >>>>> >>>>> On Sat, Jun 4, 2016 at 11:05 PM, Ralph Castain <r...@open-mpi.org> wrote: >>>>> He can try adding "-mca state_base_verbose 5”, but if we are failing to >>>>> catch sigchld, I’m not sure what debugging info is going to help resolve >>>>> that problem. These aren’t even fast-running apps, so there was plenty of >>>>> time to register for the signal prior to termination. >>>>> >>>>> I vaguely recollect that we have occasionally seen this on Mac before and >>>>> it had something to do with oddness in sigchld handling… >>>>> >>>>> Assuming sigchld has some oddness on OSX. Why is then mpirun deadlocking >>>>> instead of quitting which will then allow the OS to clean all children? >>>> >>>> I don’t think mpirun is actually “deadlocked” - I think it may just be >>>> waiting for sigchld to tell it that the local processes have terminated. >>>> >>>> However, that wouldn't explain why you see what looks like libraries being >>>> unloaded. That implies mpirun is actually finalizing, but failing to fully >>>> exit - which would indeed be more of a deadlock. >>>> >>>> So the question is: are you truly seeing us missing sigchld (as was >>>> suggested earlier in this thread), >>> >>> In theory the processes remains in zombie state until the parent calls >>> waitpid on them, at which moment they are supposed to disappear. Based on >>> this, as the processes are still in zombie state, I assumed that mpirun was >>> not calling waitpid. One could also assume we are again hit by the fork >>> race condition we had a while back, but as all local processes are in >>> zombie mode, this is hardly believable. >>> >>>> or did mpirun correctly see all the child processes terminate and is >>>> actually hanging while trying to exit (as was also suggested earlier)? >>> >>> One way or another the stack of the main thread looks busted. While the >>> discussion about this was going on I was able to replicate the bug with >>> only ORTE involved. Simply running >>> >>> for i in `seq 1 1 1000`; do echo “$i"; mpirun -np 4 hostname; done >>> >>> ‘deadlock’ or whatever name we want to call this reliably before hitting >>> the 300 iteration. Unfortunately adding the verbose option alter the >>> behavior enough that the issue does not reproduce. >>> >>> George. >>> >>>> >>>> Adding the state verbosity should tell us which of those two is true, >>>> assuming it doesn’t affect the timing so much that everything works :-/ >>>> >>>> >>>>> >>>>> George. >>>>> >>>>> >>>>> >>>>> >>>>>> On Jun 4, 2016, at 7:01 AM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> >>>>>> wrote: >>>>>> >>>>>> Meh. Ok. Should George run with some verbose level to get more info? >>>>>> >>>>>>> On Jun 4, 2016, at 6:43 AM, Ralph Castain <r...@open-mpi.org> wrote: >>>>>>> >>>>>>> Neither of those threads have anything to do with catching the sigchld >>>>>>> - threads 4-5 are listening for OOB and PMIx connection requests. It >>>>>>> looks more like mpirun thought it had picked everything up and has >>>>>>> begun shutting down, but I can’t really tell for certain. >>>>>>> >>>>>>>> On Jun 4, 2016, at 6:29 AM, Jeff Squyres (jsquyres) >>>>>>>> <jsquy...@cisco.com> wrote: >>>>>>>> >>>>>>>> On Jun 3, 2016, at 11:07 PM, George Bosilca <bosi...@icl.utk.edu> >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> After finalize. As I said in my original email I se all the output >>>>>>>>> the application is generating, and all processes (which are local as >>>>>>>>> this happens on my laptop) are in zombie mode (Z+). This basically >>>>>>>>> means whoever was supposed to get the SIGCHLD, didn't do it's job of >>>>>>>>> cleaning them up. >>>>>>>> >>>>>>>> Ah -- so perhaps threads 1,2,3 are red herrings: the real problem here >>>>>>>> is that the parent didn't catch the child exits (which presumably >>>>>>>> should have been caught in threads 4 or 5). >>>>>>>> >>>>>>>> Ralph: is there any state from threads 4 or 5 that would be helpful to >>>>>>>> examine to see if they somehow missed catching children exits? >>>>>>>> >>>>>>>> -- >>>>>>>> Jeff Squyres >>>>>>>> jsquy...@cisco.com >>>>>>>> For corporate legal information go to: >>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> devel mailing list >>>>>>>> de...@open-mpi.org >>>>>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>> Link to this post: >>>>>>>> http://www.open-mpi.org/community/lists/devel/2016/06/19070.php >>>>>>> >>>>>>> _______________________________________________ >>>>>>> devel mailing list >>>>>>> de...@open-mpi.org >>>>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>> Link to this post: >>>>>>> http://www.open-mpi.org/community/lists/devel/2016/06/19071.php >>>>>> >>>>>> >>>>>> -- >>>>>> Jeff Squyres >>>>>> jsquy...@cisco.com >>>>>> For corporate legal information go to: >>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>>> >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> de...@open-mpi.org >>>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> Link to this post: >>>>>> http://www.open-mpi.org/community/lists/devel/2016/06/19072.php >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/devel/2016/06/19073.php >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/devel/2016/06/19074.php >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/devel/2016/06/19075.php >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2016/06/19076.php >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2016/06/19077.php > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2016/06/19078.php