On Fri, Jun 3, 2016 at 11:10 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com
> wrote:

> That's disappointing / puzzling.
>
> Threads 4 and 5 look like they're in the PMIX / ORTE progress threads,
> respectively.
>
> But I don'tt see any obvious signs of what thread 1, 2, 3 are for.  Huh.
>
> When is this hang happening -- during init?  Middle of the program?
> During finalize?
>

After finalize. As I said in my original email I se all the output the
application is generating, and all processes (which are local as this
happens on my laptop) are in zombie mode (Z+). This basically means whoever
was supposed to get the SIGCHLD, didn't do it's job of cleaning them up.

  George.



>
>
> > On Jun 2, 2016, at 6:00 PM, George Bosilca <bosi...@icl.utk.edu> wrote:
> >
> > Sure, but they mostly look similar.
> >
> >   George.
> >
> >
> > (lldb) thread list
> > Process 76811 stopped
> >   thread #1: tid = 0x272b40e, 0x00007fff93306de6
> libsystem_kernel.dylib`__psynch_mutexwait + 10, queue =
> 'com.apple.main-thread', stop reason = signal SIGSTOP
> >   thread #2: tid = 0x272b40f, 0x00007fff93306de6
> libsystem_kernel.dylib`__psynch_mutexwait + 10
> >   thread #3: tid = 0x272b410, 0x00007fff93306de6
> libsystem_kernel.dylib`__psynch_mutexwait + 10
> >   thread #4: tid = 0x272b411, 0x00007fff9330707a
> libsystem_kernel.dylib`__select + 10
> > * thread #5: tid = 0x272b412, 0x00007fff9330707a
> libsystem_kernel.dylib`__select + 10
> > (lldb)
> >
> >
> > (lldb) thread select 1
> > * thread #1: tid = 0x272b40e, 0x00007fff93306de6
> libsystem_kernel.dylib`__psynch_mutexwait + 10, queue =
> 'com.apple.main-thread', stop reason = signal SIGSTOP
> >     frame #0: 0x00007fff93306de6
> libsystem_kernel.dylib`__psynch_mutexwait + 10
> > libsystem_kernel.dylib`__psynch_mutexwait:
> > ->  0x7fff93306de6 <+10>: jae    0x7fff93306df0            ; <+20>
> >     0x7fff93306de8 <+12>: movq   %rax, %rdi
> >     0x7fff93306deb <+15>: jmp    0x7fff933017cd            ;
> cerror_nocancel
> >     0x7fff93306df0 <+20>: retq
> > (lldb) bt
> > * thread #1: tid = 0x272b40e, 0x00007fff93306de6
> libsystem_kernel.dylib`__psynch_mutexwait + 10, queue =
> 'com.apple.main-thread', stop reason = signal SIGSTOP
> >   * frame #0: 0x00007fff93306de6
> libsystem_kernel.dylib`__psynch_mutexwait + 10
> >     frame #1: 0x00007fff9a000e4a
> libsystem_pthread.dylib`_pthread_mutex_lock_wait + 89
> >     frame #2: 0x00007fff99ffe5f5
> libsystem_pthread.dylib`_pthread_mutex_lock_slow + 300
> >     frame #3: 0x00007fff8c2a00f8 libdyld.dylib`dyldGlobalLockAcquire() +
> 16
> >     frame #4: 0x00007fff6ca8e177
> dyld`ImageLoaderMachOCompressed::doBindFastLazySymbol(unsigned int,
> ImageLoader::LinkContext const&, void (*)(), void (*)()) + 55
> >     frame #5: 0x00007fff6ca78063
> dyld`dyld::fastBindLazySymbol(ImageLoader**, unsigned long) + 90
> >     frame #6: 0x00007fff8c2a0262 libdyld.dylib`dyld_stub_binder + 282
> >     frame #7: 0x000000010a5b29b0 libopen-pal.0.dylib`obj_order_type +
> 3776
> >
> >
> > (lldb) thread select 2
> > * thread #2: tid = 0x272b40f, 0x00007fff93306de6
> libsystem_kernel.dylib`__psynch_mutexwait + 10
> >     frame #0: 0x00007fff93306de6
> libsystem_kernel.dylib`__psynch_mutexwait + 10
> > libsystem_kernel.dylib`__psynch_mutexwait:
> > ->  0x7fff93306de6 <+10>: jae    0x7fff93306df0            ; <+20>
> >     0x7fff93306de8 <+12>: movq   %rax, %rdi
> >     0x7fff93306deb <+15>: jmp    0x7fff933017cd            ;
> cerror_nocancel
> >     0x7fff93306df0 <+20>: retq
> > (lldb) bt
> > * thread #2: tid = 0x272b40f, 0x00007fff93306de6
> libsystem_kernel.dylib`__psynch_mutexwait + 10
> >   * frame #0: 0x00007fff93306de6
> libsystem_kernel.dylib`__psynch_mutexwait + 10
> >     frame #1: 0x00007fff9a000e4a
> libsystem_pthread.dylib`_pthread_mutex_lock_wait + 89
> >     frame #2: 0x00007fff99ffe5f5
> libsystem_pthread.dylib`_pthread_mutex_lock_slow + 300
> >     frame #3: 0x00007fff8c2a00f8 libdyld.dylib`dyldGlobalLockAcquire() +
> 16
> >     frame #4: 0x00007fff6ca8e177
> dyld`ImageLoaderMachOCompressed::doBindFastLazySymbol(unsigned int,
> ImageLoader::LinkContext const&, void (*)(), void (*)()) + 55
> >     frame #5: 0x00007fff6ca78063
> dyld`dyld::fastBindLazySymbol(ImageLoader**, unsigned long) + 90
> >     frame #6: 0x00007fff8c2a0262 libdyld.dylib`dyld_stub_binder + 282
> >     frame #7: 0x000000010a5b29b0 libopen-pal.0.dylib`obj_order_type +
> 3776
> >
> >
> > (lldb) thread select 3
> > * thread #3: tid = 0x272b410, 0x00007fff93306de6
> libsystem_kernel.dylib`__psynch_mutexwait + 10
> >     frame #0: 0x00007fff93306de6
> libsystem_kernel.dylib`__psynch_mutexwait + 10
> > libsystem_kernel.dylib`__psynch_mutexwait:
> > ->  0x7fff93306de6 <+10>: jae    0x7fff93306df0            ; <+20>
> >     0x7fff93306de8 <+12>: movq   %rax, %rdi
> >     0x7fff93306deb <+15>: jmp    0x7fff933017cd            ;
> cerror_nocancel
> >     0x7fff93306df0 <+20>: retq
> > (lldb) bt
> > * thread #3: tid = 0x272b410, 0x00007fff93306de6
> libsystem_kernel.dylib`__psynch_mutexwait + 10
> >   * frame #0: 0x00007fff93306de6
> libsystem_kernel.dylib`__psynch_mutexwait + 10
> >     frame #1: 0x00007fff9a000e4a
> libsystem_pthread.dylib`_pthread_mutex_lock_wait + 89
> >     frame #2: 0x00007fff99ffe5f5
> libsystem_pthread.dylib`_pthread_mutex_lock_slow + 300
> >     frame #3: 0x00007fff8c2a00f8 libdyld.dylib`dyldGlobalLockAcquire() +
> 16
> >     frame #4: 0x00007fff6ca8e177
> dyld`ImageLoaderMachOCompressed::doBindFastLazySymbol(unsigned int,
> ImageLoader::LinkContext const&, void (*)(), void (*)()) + 55
> >     frame #5: 0x00007fff6ca78063
> dyld`dyld::fastBindLazySymbol(ImageLoader**, unsigned long) + 90
> >     frame #6: 0x00007fff8c2a0262 libdyld.dylib`dyld_stub_binder + 282
> >     frame #7: 0x000000010a5b29b0 libopen-pal.0.dylib`obj_order_type +
> 3776
> >
> >
> > (lldb) thread select 4
> > * thread #4: tid = 0x272b411, 0x00007fff9330707a
> libsystem_kernel.dylib`__select + 10
> >     frame #0: 0x00007fff9330707a libsystem_kernel.dylib`__select + 10
> > libsystem_kernel.dylib`__select:
> > ->  0x7fff9330707a <+10>: jae    0x7fff93307084            ; <+20>
> >     0x7fff9330707c <+12>: movq   %rax, %rdi
> >     0x7fff9330707f <+15>: jmp    0x7fff933017f2            ; cerror
> >     0x7fff93307084 <+20>: retq
> > (lldb) bt
> > * thread #4: tid = 0x272b411, 0x00007fff9330707a
> libsystem_kernel.dylib`__select + 10
> >   * frame #0: 0x00007fff9330707a libsystem_kernel.dylib`__select + 10
> >     frame #1: 0x000000010a9b1273
> mca_pmix_pmix114.so`listen_thread(obj=0x0000000000000000) + 371 at
> pmix_server_listener.c:226
> >     frame #2: 0x00007fff9a00099d libsystem_pthread.dylib`_pthread_body +
> 131
> >     frame #3: 0x00007fff9a00091a libsystem_pthread.dylib`_pthread_start
> + 168
> >     frame #4: 0x00007fff99ffe351 libsystem_pthread.dylib`thread_start +
> 13
> >
> >
> > (lldb) thread select 5
> > * thread #5: tid = 0x272b412, 0x00007fff9330707a
> libsystem_kernel.dylib`__select + 10
> >     frame #0: 0x00007fff9330707a libsystem_kernel.dylib`__select + 10
> > libsystem_kernel.dylib`__select:
> > ->  0x7fff9330707a <+10>: jae    0x7fff93307084            ; <+20>
> >     0x7fff9330707c <+12>: movq   %rax, %rdi
> >     0x7fff9330707f <+15>: jmp    0x7fff933017f2            ; cerror
> >     0x7fff93307084 <+20>: retq
> > (lldb) bt
> > * thread #5: tid = 0x272b412, 0x00007fff9330707a
> libsystem_kernel.dylib`__select + 10
> >   * frame #0: 0x00007fff9330707a libsystem_kernel.dylib`__select + 10
> >     frame #1: 0x000000010a3c13cc
> libopen-rte.0.dylib`listen_thread_fn(obj=0x000000010a46e8c0) + 428 at
> listener.c:261
> >     frame #2: 0x00007fff9a00099d libsystem_pthread.dylib`_pthread_body +
> 131
> >     frame #3: 0x00007fff9a00091a libsystem_pthread.dylib`_pthread_start
> + 168
> >     frame #4: 0x00007fff99ffe351 libsystem_pthread.dylib`thread_start +
> 13
> >
> >
> >
> >
> > On Fri, Jun 3, 2016 at 9:50 AM, Jeff Squyres (jsquyres) <
> jsquy...@cisco.com> wrote:
> > George --
> >
> > You might want to get bt's from *all* the threads...?
> >
> >
> > > On Jun 2, 2016, at 5:31 PM, George Bosilca <bosi...@icl.utk.edu>
> wrote:
> > >
> > > The timeout never triggers and when I attach to the mpirun process I
> see an extremely strange stack:
> > >
> > > (lldb) bt
> > > * thread #1: tid = 0x272b40e, 0x00007fff93306de6
> libsystem_kernel.dylib`__psynch_mutexwait + 10, queue =
> 'com.apple.main-thread', stop reason = signal SIGSTOP
> > >   * frame #0: 0x00007fff93306de6
> libsystem_kernel.dylib`__psynch_mutexwait + 10
> > >     frame #1: 0x00007fff9a000e4a
> libsystem_pthread.dylib`_pthread_mutex_lock_wait + 89
> > >     frame #2: 0x00007fff99ffe5f5
> libsystem_pthread.dylib`_pthread_mutex_lock_slow + 300
> > >     frame #3: 0x00007fff8c2a00f8 libdyld.dylib`dyldGlobalLockAcquire()
> + 16
> > >     frame #4: 0x00007fff6ca8e177
> dyld`ImageLoaderMachOCompressed::doBindFastLazySymbol(unsigned int,
> ImageLoader::LinkContext const&, void (*)(), void (*)()) + 55
> > >     frame #5: 0x00007fff6ca78063
> dyld`dyld::fastBindLazySymbol(ImageLoader**, unsigned long) + 90
> > >     frame #6: 0x00007fff8c2a0262 libdyld.dylib`dyld_stub_binder + 282
> > >     frame #7: 0x000000010a5b29b0 libopen-pal.0.dylib`obj_order_type +
> 3776
> > >
> > > This seems to indicate that we are trying to access a function from a
> dylib that has been or is in the process of being unloaded.
> > >
> > >   George.
> > >
> > >
> > > On Thu, Jun 2, 2016 at 8:34 AM, Nathan Hjelm <hje...@me.com> wrote:
> > > The osc hang is fixed by a PR to fix bugs in start in cm and ob1. See
> #1729.
> > >
> > > -Nathan
> > >
> > > On Jun 2, 2016, at 5:17 AM, Gilles Gouaillardet <
> gilles.gouaillar...@gmail.com> wrote:
> > >
> > >> fwiw,
> > >>
> > >> the onsided/c_fence_lock test from the ibm test suite hangs
> > >>
> > >> (mpirun -np 2 ./c_fence_lock)
> > >>
> > >> i ran a git bisect and it incriminates commit
> b90c83840f472de3219b87cd7e1a364eec5c5a29
> > >>
> > >> commit b90c83840f472de3219b87cd7e1a364eec5c5a29
> > >> Author: bosilca <bosi...@users.noreply.github.com>
> > >> Date:   Tue May 24 18:20:51 2016 -0500
> > >>
> > >>     Refactor the request completion (#1422)
> > >>
> > >>     * Remodel the request.
> > >>     Added the wait sync primitive and integrate it into the PML and
> MTL
> > >>     infrastructure. The multi-threaded requests are now significantly
> > >>     less heavy and less noisy (only the threads associated with
> completed
> > >>     requests are signaled).
> > >>
> > >>     * Fix the condition to release the request.
> > >>
> > >>
> > >>
> > >>
> > >> I also noted a warning is emitted when running only one task
> > >>
> > >> ./c_fence_lock
> > >>
> > >> but I did not git bisect, so that might not be related
> > >>
> > >> Cheers,
> > >>
> > >>
> > >>
> > >> Gilles
> > >>
> > >>
> > >> On Thursday, June 2, 2016, Ralph Castain <r...@open-mpi.org> wrote:
> > >> Yes, please! I’d like to know what mpirun thinks is happening - if
> you like, just set the —timeout N —report-state-on-timeout flags and tell
> me what comes out
> > >>
> > >>> On Jun 1, 2016, at 7:57 PM, George Bosilca <bosi...@icl.utk.edu>
> wrote:
> > >>>
> > >>> I don't think it matters. I was running the IBM collective and pt2pt
> tests, but each time it deadlocked was in a different test. If you are
> interested in some particular values, I would be happy to attach a debugger
> next time it happens.
> > >>>
> > >>>   George.
> > >>>
> > >>>
> > >>> On Wed, Jun 1, 2016 at 10:47 PM, Ralph Castain <r...@open-mpi.org>
> wrote:
> > >>> What kind of apps are they? Or does it matter what you are running?
> > >>>
> > >>>
> > >>> > On Jun 1, 2016, at 7:37 PM, George Bosilca <bosi...@icl.utk.edu>
> wrote:
> > >>> >
> > >>> > I have a seldomly occurring deadlock on a OS X laptop if I use
> more than 2 processes). It is coming up once every 200 runs or so.
> > >>> >
> > >>> > Here is what I could gather from my experiments: All the MPI
> processes seem to have correctly completed (I get all the expected output
> and the MPI processes are in a waiting state), but somehow the mpirun does
> not detect their completion. As a result, mpirun never returns.
> > >>> >
> > >>> >   George.
> > >>> >
> > >>> > _______________________________________________
> > >>> > devel mailing list
> > >>> > de...@open-mpi.org
> > >>> > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
> > >>> > Searchable archives:
> http://www.open-mpi.org/community/lists/devel/2016/06/19054.php
> > >>>
> > >>> _______________________________________________
> > >>> devel mailing list
> > >>> de...@open-mpi.org
> > >>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
> > >>> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2016/06/19054.php
> > >>>
> > >>> _______________________________________________
> > >>> devel mailing list
> > >>> de...@open-mpi.org
> > >>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
> > >>> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2016/06/19055.php
> > >>
> > >> _______________________________________________
> > >> devel mailing list
> > >> de...@open-mpi.org
> > >> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
> > >> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2016/06/19059.php
> > >
> > > _______________________________________________
> > > devel mailing list
> > > de...@open-mpi.org
> > > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > Link to this post:
> http://www.open-mpi.org/community/lists/devel/2016/06/19060.php
> > >
> > > _______________________________________________
> > > devel mailing list
> > > de...@open-mpi.org
> > > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > Link to this post:
> http://www.open-mpi.org/community/lists/devel/2016/06/19061.php
> >
> >
> > --
> > Jeff Squyres
> > jsquy...@cisco.com
> > For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> >
> > _______________________________________________
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post:
> http://www.open-mpi.org/community/lists/devel/2016/06/19062.php
> >
> > _______________________________________________
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post:
> http://www.open-mpi.org/community/lists/devel/2016/06/19063.php
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2016/06/19066.php

Reply via email to