Re: [OMPI devel] Seldom deadlock in mpirun

2016-06-06 Thread Ralph Castain
Huh - okay, must be a difference in our race conditions. I can run it for more than 1k cycles without hitting it. I’ll poke some more later > On Jun 6, 2016, at 8:06 AM, George Bosilca wrote: > > Ralph, > > Things got better, in the sense that before I was getting about 1 deadlock > for each

Re: [OMPI devel] Seldom deadlock in mpirun

2016-06-06 Thread George Bosilca
Ralph, Things got better, in the sense that before I was getting about 1 deadlock for each 300 runs, now the number if more 1 out of every 500. George. On Tue, Jun 7, 2016 at 12:04 AM, George Bosilca wrote: > Ralph, > > Not there yet. I got similar deadlocks, but the stack not looks slightl

Re: [OMPI devel] Seldom deadlock in mpirun

2016-06-06 Thread George Bosilca
Ralph, Not there yet. I got similar deadlocks, but the stack not looks slightly different. I only have 1 single thread doing something useful (aka being in listen_thread_fn), every other thread is having a similar stack: * frame #0: 0x7fff93306de6 libsystem_kernel.dylib`__psynch_mutexwait +

Re: [OMPI devel] Seldom deadlock in mpirun

2016-06-06 Thread Ralph Castain
I think I have this fixed here: https://github.com/open-mpi/ompi/pull/1756 George - can you please try it on your system? > On Jun 5, 2016, at 4:18 PM, Ralph Castain wrote: > > Yeah, I can reproduce on my box. What is happening is that we aren’t pr

Re: [OMPI devel] Seldom deadlock in mpirun

2016-06-05 Thread Ralph Castain
Yeah, I can reproduce on my box. What is happening is that we aren’t properly protected during finalize, and so we tear down some component that is registered for a callback, and then the callback occurs. So we just need to ensure that we finalize in the right order > On Jun 5, 2016, at 3:51 PM

Re: [OMPI devel] Seldom deadlock in mpirun

2016-06-05 Thread Jeff Squyres (jsquyres)
Ok, good. FWIW: I tried all Friday afternoon to reproduce on my OS X laptop (i.e., I did something like George's shell script loop), and just now I ran George's exact loop, but I haven't been able to reproduce. In this case, I'm falling on the wrong side of whatever race condition is happening

Re: [OMPI devel] Seldom deadlock in mpirun

2016-06-04 Thread Ralph Castain
I may have an idea of what’s going on here - I just need to finish something else first and then I’ll take a look. > On Jun 4, 2016, at 4:20 PM, George Bosilca wrote: > >> >> On Jun 5, 2016, at 07:53 , Ralph Castain > > wrote: >> >>> >>> On Jun 4, 2016, at 1:11 PM,

Re: [OMPI devel] Seldom deadlock in mpirun

2016-06-04 Thread George Bosilca
> On Jun 5, 2016, at 07:53 , Ralph Castain wrote: > >> >> On Jun 4, 2016, at 1:11 PM, George Bosilca > > wrote: >> >> >> >> On Sat, Jun 4, 2016 at 11:05 PM, Ralph Castain > > wrote: >> He can try adding "-mca state_base_verbose 5”, but if

Re: [OMPI devel] Seldom deadlock in mpirun

2016-06-04 Thread Ralph Castain
> On Jun 4, 2016, at 1:11 PM, George Bosilca wrote: > > > > On Sat, Jun 4, 2016 at 11:05 PM, Ralph Castain > wrote: > He can try adding "-mca state_base_verbose 5”, but if we are failing to catch > sigchld, I’m not sure what debugging info is going to help resolve t

Re: [OMPI devel] Seldom deadlock in mpirun

2016-06-04 Thread George Bosilca
On Sat, Jun 4, 2016 at 11:05 PM, Ralph Castain wrote: > He can try adding "-mca state_base_verbose 5”, but if we are failing to > catch sigchld, I’m not sure what debugging info is going to help resolve > that problem. These aren’t even fast-running apps, so there was plenty of > time to register

Re: [OMPI devel] Seldom deadlock in mpirun

2016-06-04 Thread Ralph Castain
He can try adding "-mca state_base_verbose 5”, but if we are failing to catch sigchld, I’m not sure what debugging info is going to help resolve that problem. These aren’t even fast-running apps, so there was plenty of time to register for the signal prior to termination. I vaguely recollect th

Re: [OMPI devel] Seldom deadlock in mpirun

2016-06-04 Thread Jeff Squyres (jsquyres)
Meh. Ok. Should George run with some verbose level to get more info? > On Jun 4, 2016, at 6:43 AM, Ralph Castain wrote: > > Neither of those threads have anything to do with catching the sigchld - > threads 4-5 are listening for OOB and PMIx connection requests. It looks more > like mpirun t

Re: [OMPI devel] Seldom deadlock in mpirun

2016-06-04 Thread Ralph Castain
Neither of those threads have anything to do with catching the sigchld - threads 4-5 are listening for OOB and PMIx connection requests. It looks more like mpirun thought it had picked everything up and has begun shutting down, but I can’t really tell for certain. > On Jun 4, 2016, at 6:29 AM,

Re: [OMPI devel] Seldom deadlock in mpirun

2016-06-04 Thread Jeff Squyres (jsquyres)
On Jun 3, 2016, at 11:07 PM, George Bosilca wrote: > > After finalize. As I said in my original email I se all the output the > application is generating, and all processes (which are local as this happens > on my laptop) are in zombie mode (Z+). This basically means whoever was > supposed to

Re: [OMPI devel] Seldom deadlock in mpirun

2016-06-04 Thread George Bosilca
On Fri, Jun 3, 2016 at 11:10 PM, Jeff Squyres (jsquyres) wrote: > That's disappointing / puzzling. > > Threads 4 and 5 look like they're in the PMIX / ORTE progress threads, > respectively. > > But I don'tt see any obvious signs of what thread 1, 2, 3 are for. Huh. > > When is this hang happenin

Re: [OMPI devel] Seldom deadlock in mpirun

2016-06-03 Thread Jeff Squyres (jsquyres)
That's disappointing / puzzling. Threads 4 and 5 look like they're in the PMIX / ORTE progress threads, respectively. But I don'tt see any obvious signs of what thread 1, 2, 3 are for. Huh. When is this hang happening -- during init? Middle of the program? During finalize? > On Jun 2, 201

Re: [OMPI devel] Seldom deadlock in mpirun

2016-06-02 Thread George Bosilca
Sure, but they mostly look similar. George. (lldb) thread list Process 76811 stopped thread #1: tid = 0x272b40e, 0x7fff93306de6 libsystem_kernel.dylib`__psynch_mutexwait + 10, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP thread #2: tid = 0x272b40f, 0x7fff93306de6 l

Re: [OMPI devel] Seldom deadlock in mpirun

2016-06-02 Thread Jeff Squyres (jsquyres)
George -- You might want to get bt's from *all* the threads...? > On Jun 2, 2016, at 5:31 PM, George Bosilca wrote: > > The timeout never triggers and when I attach to the mpirun process I see an > extremely strange stack: > > (lldb) bt > * thread #1: tid = 0x272b40e, 0x7fff93306de6 > l

Re: [OMPI devel] Seldom deadlock in mpirun

2016-06-02 Thread George Bosilca
The timeout never triggers and when I attach to the mpirun process I see an extremely strange stack: (lldb) bt * thread #1: tid = 0x272b40e, 0x7fff93306de6 libsystem_kernel.dylib`__psynch_mutexwait + 10, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP * frame #0: 0x7fff9330

Re: [OMPI devel] Seldom deadlock in mpirun

2016-06-02 Thread Nathan Hjelm
The osc hang is fixed by a PR to fix bugs in start in cm and ob1. See #1729. -Nathan > On Jun 2, 2016, at 5:17 AM, Gilles Gouaillardet > wrote: > > fwiw, > > the onsided/c_fence_lock test from the ibm test suite hangs > > (mpirun -np 2 ./c_fence_lock) > > i ran a git bisect and it incrimina

Re: [OMPI devel] Seldom deadlock in mpirun

2016-06-02 Thread Gilles Gouaillardet
fwiw, the onsided/c_fence_lock test from the ibm test suite hangs (mpirun -np 2 ./c_fence_lock) i ran a git bisect and it incriminates commit b90c83840f472de3219b87cd7e1a364eec5c5a29 commit b90c83840f472de3219b87cd7e1a364eec5c5a29 Author: bosilca List-Post: devel@lists.open-mpi.org Date: Tue

Re: [OMPI devel] Seldom deadlock in mpirun

2016-06-01 Thread Ralph Castain
Yes, please! I’d like to know what mpirun thinks is happening - if you like, just set the —timeout N —report-state-on-timeout flags and tell me what comes out > On Jun 1, 2016, at 7:57 PM, George Bosilca wrote: > > I don't think it matters. I was running the IBM collective and pt2pt tests, >

Re: [OMPI devel] Seldom deadlock in mpirun

2016-06-01 Thread George Bosilca
I don't think it matters. I was running the IBM collective and pt2pt tests, but each time it deadlocked was in a different test. If you are interested in some particular values, I would be happy to attach a debugger next time it happens. George. On Wed, Jun 1, 2016 at 10:47 PM, Ralph Castain

Re: [OMPI devel] Seldom deadlock in mpirun

2016-06-01 Thread Ralph Castain
What kind of apps are they? Or does it matter what you are running? > On Jun 1, 2016, at 7:37 PM, George Bosilca wrote: > > I have a seldomly occurring deadlock on a OS X laptop if I use more than 2 > processes). It is coming up once every 200 runs or so. > > Here is what I could gather from

[OMPI devel] Seldom deadlock in mpirun

2016-06-01 Thread George Bosilca
I have a seldomly occurring deadlock on a OS X laptop if I use more than 2 processes). It is coming up once every 200 runs or so. Here is what I could gather from my experiments: All the MPI processes seem to have correctly completed (I get all the expected output and the MPI processes are in a wa