I think I have this fixed here: https://github.com/open-mpi/ompi/pull/1756 <https://github.com/open-mpi/ompi/pull/1756>
George - can you please try it on your system? > On Jun 5, 2016, at 4:18 PM, Ralph Castain <r...@open-mpi.org> wrote: > > Yeah, I can reproduce on my box. What is happening is that we aren’t properly > protected during finalize, and so we tear down some component that is > registered for a callback, and then the callback occurs. So we just need to > ensure that we finalize in the right order > >> On Jun 5, 2016, at 3:51 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> >> wrote: >> >> Ok, good. >> >> FWIW: I tried all Friday afternoon to reproduce on my OS X laptop (i.e., I >> did something like George's shell script loop), and just now I ran George's >> exact loop, but I haven't been able to reproduce. In this case, I'm falling >> on the wrong side of whatever race condition is happening... >> >> >>> On Jun 4, 2016, at 7:57 PM, Ralph Castain <r...@open-mpi.org> wrote: >>> >>> I may have an idea of what’s going on here - I just need to finish >>> something else first and then I’ll take a look. >>> >>> >>>> On Jun 4, 2016, at 4:20 PM, George Bosilca <bosi...@icl.utk.edu> wrote: >>>> >>>>> >>>>> On Jun 5, 2016, at 07:53 , Ralph Castain <r...@open-mpi.org> wrote: >>>>> >>>>>> >>>>>> On Jun 4, 2016, at 1:11 PM, George Bosilca <bosi...@icl.utk.edu> wrote: >>>>>> >>>>>> >>>>>> >>>>>> On Sat, Jun 4, 2016 at 11:05 PM, Ralph Castain <r...@open-mpi.org> wrote: >>>>>> He can try adding "-mca state_base_verbose 5”, but if we are failing to >>>>>> catch sigchld, I’m not sure what debugging info is going to help resolve >>>>>> that problem. These aren’t even fast-running apps, so there was plenty >>>>>> of time to register for the signal prior to termination. >>>>>> >>>>>> I vaguely recollect that we have occasionally seen this on Mac before >>>>>> and it had something to do with oddness in sigchld handling… >>>>>> >>>>>> Assuming sigchld has some oddness on OSX. Why is then mpirun deadlocking >>>>>> instead of quitting which will then allow the OS to clean all children? >>>>> >>>>> I don’t think mpirun is actually “deadlocked” - I think it may just be >>>>> waiting for sigchld to tell it that the local processes have terminated. >>>>> >>>>> However, that wouldn't explain why you see what looks like libraries >>>>> being unloaded. That implies mpirun is actually finalizing, but failing >>>>> to fully exit - which would indeed be more of a deadlock. >>>>> >>>>> So the question is: are you truly seeing us missing sigchld (as was >>>>> suggested earlier in this thread), >>>> >>>> In theory the processes remains in zombie state until the parent calls >>>> waitpid on them, at which moment they are supposed to disappear. Based on >>>> this, as the processes are still in zombie state, I assumed that mpirun >>>> was not calling waitpid. One could also assume we are again hit by the >>>> fork race condition we had a while back, but as all local processes are in >>>> zombie mode, this is hardly believable. >>>> >>>>> or did mpirun correctly see all the child processes terminate and is >>>>> actually hanging while trying to exit (as was also suggested earlier)? >>>> >>>> One way or another the stack of the main thread looks busted. While the >>>> discussion about this was going on I was able to replicate the bug with >>>> only ORTE involved. Simply running >>>> >>>> for i in `seq 1 1 1000`; do echo “$i"; mpirun -np 4 hostname; done >>>> >>>> ‘deadlock’ or whatever name we want to call this reliably before hitting >>>> the 300 iteration. Unfortunately adding the verbose option alter the >>>> behavior enough that the issue does not reproduce. >>>> >>>> George. >>>> >>>>> >>>>> Adding the state verbosity should tell us which of those two is true, >>>>> assuming it doesn’t affect the timing so much that everything works :-/ >>>>> >>>>> >>>>>> >>>>>> George. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> On Jun 4, 2016, at 7:01 AM, Jeff Squyres (jsquyres) >>>>>>> <jsquy...@cisco.com> wrote: >>>>>>> >>>>>>> Meh. Ok. Should George run with some verbose level to get more info? >>>>>>> >>>>>>>> On Jun 4, 2016, at 6:43 AM, Ralph Castain <r...@open-mpi.org> wrote: >>>>>>>> >>>>>>>> Neither of those threads have anything to do with catching the sigchld >>>>>>>> - threads 4-5 are listening for OOB and PMIx connection requests. It >>>>>>>> looks more like mpirun thought it had picked everything up and has >>>>>>>> begun shutting down, but I can’t really tell for certain. >>>>>>>> >>>>>>>>> On Jun 4, 2016, at 6:29 AM, Jeff Squyres (jsquyres) >>>>>>>>> <jsquy...@cisco.com> wrote: >>>>>>>>> >>>>>>>>> On Jun 3, 2016, at 11:07 PM, George Bosilca <bosi...@icl.utk.edu> >>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> After finalize. As I said in my original email I se all the output >>>>>>>>>> the application is generating, and all processes (which are local as >>>>>>>>>> this happens on my laptop) are in zombie mode (Z+). This basically >>>>>>>>>> means whoever was supposed to get the SIGCHLD, didn't do it's job of >>>>>>>>>> cleaning them up. >>>>>>>>> >>>>>>>>> Ah -- so perhaps threads 1,2,3 are red herrings: the real problem >>>>>>>>> here is that the parent didn't catch the child exits (which >>>>>>>>> presumably should have been caught in threads 4 or 5). >>>>>>>>> >>>>>>>>> Ralph: is there any state from threads 4 or 5 that would be helpful >>>>>>>>> to examine to see if they somehow missed catching children exits? >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Jeff Squyres >>>>>>>>> jsquy...@cisco.com >>>>>>>>> For corporate legal information go to: >>>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> devel mailing list >>>>>>>>> de...@open-mpi.org >>>>>>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>> Link to this post: >>>>>>>>> http://www.open-mpi.org/community/lists/devel/2016/06/19070.php >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> devel mailing list >>>>>>>> de...@open-mpi.org >>>>>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>> Link to this post: >>>>>>>> http://www.open-mpi.org/community/lists/devel/2016/06/19071.php >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Jeff Squyres >>>>>>> jsquy...@cisco.com >>>>>>> For corporate legal information go to: >>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>>>> >>>>>>> _______________________________________________ >>>>>>> devel mailing list >>>>>>> de...@open-mpi.org >>>>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>> Link to this post: >>>>>>> http://www.open-mpi.org/community/lists/devel/2016/06/19072.php >>>>>> >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> de...@open-mpi.org >>>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> Link to this post: >>>>>> http://www.open-mpi.org/community/lists/devel/2016/06/19073.php >>>>>> >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> de...@open-mpi.org >>>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> Link to this post: >>>>>> http://www.open-mpi.org/community/lists/devel/2016/06/19074.php >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/devel/2016/06/19075.php >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/devel/2016/06/19076.php >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2016/06/19077.php >> >> >> -- >> Jeff Squyres >> jsquy...@cisco.com >> For corporate legal information go to: >> http://www.cisco.com/web/about/doing_business/legal/cri/ >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2016/06/19078.php >