George -- You might want to get bt's from *all* the threads...?
> On Jun 2, 2016, at 5:31 PM, George Bosilca <bosi...@icl.utk.edu> wrote: > > The timeout never triggers and when I attach to the mpirun process I see an > extremely strange stack: > > (lldb) bt > * thread #1: tid = 0x272b40e, 0x00007fff93306de6 > libsystem_kernel.dylib`__psynch_mutexwait + 10, queue = > 'com.apple.main-thread', stop reason = signal SIGSTOP > * frame #0: 0x00007fff93306de6 libsystem_kernel.dylib`__psynch_mutexwait + > 10 > frame #1: 0x00007fff9a000e4a > libsystem_pthread.dylib`_pthread_mutex_lock_wait + 89 > frame #2: 0x00007fff99ffe5f5 > libsystem_pthread.dylib`_pthread_mutex_lock_slow + 300 > frame #3: 0x00007fff8c2a00f8 libdyld.dylib`dyldGlobalLockAcquire() + 16 > frame #4: 0x00007fff6ca8e177 > dyld`ImageLoaderMachOCompressed::doBindFastLazySymbol(unsigned int, > ImageLoader::LinkContext const&, void (*)(), void (*)()) + 55 > frame #5: 0x00007fff6ca78063 dyld`dyld::fastBindLazySymbol(ImageLoader**, > unsigned long) + 90 > frame #6: 0x00007fff8c2a0262 libdyld.dylib`dyld_stub_binder + 282 > frame #7: 0x000000010a5b29b0 libopen-pal.0.dylib`obj_order_type + 3776 > > This seems to indicate that we are trying to access a function from a dylib > that has been or is in the process of being unloaded. > > George. > > > On Thu, Jun 2, 2016 at 8:34 AM, Nathan Hjelm <hje...@me.com> wrote: > The osc hang is fixed by a PR to fix bugs in start in cm and ob1. See #1729. > > -Nathan > > On Jun 2, 2016, at 5:17 AM, Gilles Gouaillardet > <gilles.gouaillar...@gmail.com> wrote: > >> fwiw, >> >> the onsided/c_fence_lock test from the ibm test suite hangs >> >> (mpirun -np 2 ./c_fence_lock) >> >> i ran a git bisect and it incriminates commit >> b90c83840f472de3219b87cd7e1a364eec5c5a29 >> >> commit b90c83840f472de3219b87cd7e1a364eec5c5a29 >> Author: bosilca <bosi...@users.noreply.github.com> >> Date: Tue May 24 18:20:51 2016 -0500 >> >> Refactor the request completion (#1422) >> >> * Remodel the request. >> Added the wait sync primitive and integrate it into the PML and MTL >> infrastructure. The multi-threaded requests are now significantly >> less heavy and less noisy (only the threads associated with completed >> requests are signaled). >> >> * Fix the condition to release the request. >> >> >> >> >> I also noted a warning is emitted when running only one task >> >> ./c_fence_lock >> >> but I did not git bisect, so that might not be related >> >> Cheers, >> >> >> >> Gilles >> >> >> On Thursday, June 2, 2016, Ralph Castain <r...@open-mpi.org> wrote: >> Yes, please! I’d like to know what mpirun thinks is happening - if you like, >> just set the —timeout N —report-state-on-timeout flags and tell me what >> comes out >> >>> On Jun 1, 2016, at 7:57 PM, George Bosilca <bosi...@icl.utk.edu> wrote: >>> >>> I don't think it matters. I was running the IBM collective and pt2pt tests, >>> but each time it deadlocked was in a different test. If you are interested >>> in some particular values, I would be happy to attach a debugger next time >>> it happens. >>> >>> George. >>> >>> >>> On Wed, Jun 1, 2016 at 10:47 PM, Ralph Castain <r...@open-mpi.org> wrote: >>> What kind of apps are they? Or does it matter what you are running? >>> >>> >>> > On Jun 1, 2016, at 7:37 PM, George Bosilca <bosi...@icl.utk.edu> wrote: >>> > >>> > I have a seldomly occurring deadlock on a OS X laptop if I use more than >>> > 2 processes). It is coming up once every 200 runs or so. >>> > >>> > Here is what I could gather from my experiments: All the MPI processes >>> > seem to have correctly completed (I get all the expected output and the >>> > MPI processes are in a waiting state), but somehow the mpirun does not >>> > detect their completion. As a result, mpirun never returns. >>> > >>> > George. >>> > >>> > _______________________________________________ >>> > devel mailing list >>> > de...@open-mpi.org >>> > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel >>> > Searchable archives: >>> > http://www.open-mpi.org/community/lists/devel/2016/06/19054.php >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2016/06/19054.php >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2016/06/19055.php >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2016/06/19059.php > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2016/06/19060.php > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2016/06/19061.php -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/