Re: [OMPI devel] [EXTERNAL] 1.7.4rc2r30031 - OpenBSD-5 mpirun hangs

2013-12-21 Thread Ralph Castain
For anyone who is seeing this problem and willing to help debug it, I've added more debug output to the OOB connection handler (which is where the problem resides). I am unable to replicate the problem on any of my systems, so I'd appreciate your help. Just set "-mca oob_base_verbose 10" on

Re: [OMPI devel] [EXTERNAL] 1.7.4rc2r30031 - OpenBSD-5 mpirun hangs

2013-12-20 Thread Ralph Castain
Yeah - even in a singleton, you are still connecting back to the local daemon On Dec 20, 2013, at 4:20 PM, Paul Hargrove wrote: > Ralph, > > Does some part of the "timer that is firing to indicate a failed connection > attempt" theory explain the case of singletons

Re: [OMPI devel] [EXTERNAL] 1.7.4rc2r30031 - OpenBSD-5 mpirun hangs

2013-12-20 Thread Paul Hargrove
FYI: My Solaris-10/SPARC build finally finished and *does* appear to be showing this same behavior. -Paul On Fri, Dec 20, 2013 at 4:15 PM, Ralph Castain wrote: > This is the same problem Jeff and I are looking at on Solaris - it > requires a slow machine to make it appear.

Re: [OMPI devel] [EXTERNAL] 1.7.4rc2r30031 - OpenBSD-5 mpirun hangs

2013-12-20 Thread Paul Hargrove
On Fri, Dec 20, 2013 at 4:02 PM, Paul Hargrove wrote: > FWIW: > I've confirmed that this is REGRESSION relative to 1.7.2, which works fine > on OpenBSD-5 > > I could not build 1.7.3 due to some of issues fixed for 1.7.4rc in the > past 24 hours. > I am going to try

Re: [OMPI devel] [EXTERNAL] 1.7.4rc2r30031 - OpenBSD-5 mpirun hangs

2013-12-20 Thread Paul Hargrove
Ralph, Does some part of the "timer that is firing to indicate a failed connection attempt" theory explain the case of singletons hanging? I'm just bringing this up in case you might be looking in the wrong direction. -Paul On Fri, Dec 20, 2013 at 4:15 PM, Ralph Castain

Re: [OMPI devel] [EXTERNAL] 1.7.4rc2r30031 - OpenBSD-5 mpirun hangs

2013-12-20 Thread Ralph Castain
This is the same problem Jeff and I are looking at on Solaris - it requires a slow machine to make it appear. I'm investigating and think I know where the issue might lie (a timer that is firing to indicate a failed connection attempt and causing a race condition) On Dec 20, 2013, at 4:02 PM,

Re: [OMPI devel] [EXTERNAL] 1.7.4rc2r30031 - OpenBSD-5 mpirun hangs

2013-12-20 Thread Paul Hargrove
FWIW: I've confirmed that this is REGRESSION relative to 1.7.2, which works fine on OpenBSD-5 I could not build 1.7.3 due to some of issues fixed for 1.7.4rc in the past 24 hours. I am going to try back-porting the fix(es) to see if 1.7.3 works or not . -Paul On Fri, Dec 20, 2013 at 3:16 PM,

Re: [OMPI devel] [EXTERNAL] 1.7.4rc2r30031 - OpenBSD-5 mpirun hangs

2013-12-20 Thread Paul Hargrove
Below is the backtrace again, this time configured w/ --enable-debug and for all threads. -Paul Thread 2 (thread 1021110): #0 0x1bc0ef6c5e3a in nanosleep () at :2 #1 0x1bc0f317c2d4 in nanosleep (rqtp=0x7f7bc900, rmtp=0x0) at /usr/src/lib/librthread/rthread_cancel.c:274 #2

Re: [OMPI devel] [EXTERNAL] 1.7.4rc2r30031 - OpenBSD-5 mpirun hangs

2013-12-20 Thread Paul Hargrove
Brian, Of course, I should have thought of that myself. See below for backtrace from a singleton run. I'm starting an --enable-debug build to maybe get some line number info too. -Paul (gdb) where #0 0x0406457a9e3a in nanosleep () at :2 #1 0x04063947e2d4 in nanosleep

Re: [OMPI devel] [EXTERNAL] 1.7.4rc2r30031 - OpenBSD-5 mpirun hangs

2013-12-20 Thread Barrett, Brian W
Paul - Any chance you could grab a stack trace from the mpi app? That's probably the fastest next step Brian Sent with Good (www.good.com) -Original Message- From: Paul Hargrove [phhargr...@lbl.gov] Sent: Friday, December 20, 2013 03:33 PM Mountain