Re: [OMPI devel] 1.7.4rc1 autogen error: NetBSD-6

2013-12-20 Thread Paul Hargrove
My dinner plans have been delayed. So, here is the promised fix: $ diff -u autogen.pl~ autogen.pl --- autogen.pl~ 2013-12-20 18:01:21.0 -0800 +++ autogen.pl 2013-12-20 18:31:09.0 -0800 @@ -967,6 +967,9 @@ verbose "$indent_str"."Patching configure for IBM xlf libtool bug\n";

Re: [OMPI devel] 1.7.4rc1 autogen error: NetBSD-6

2013-12-20 Thread Paul Hargrove
As I indicated earlier today, the CMRed fix to push/pop "dir" in hwloc did NOT fix the problem of configure failing after running autogen.pl on my NetBSD-6/amd64 system. I've traced the problem to the following fragment from _LT_PROG_ECHO_BACKSLASH in the NetBSD-provided libtool.m4:

Re: [OMPI devel] [EXTERNAL] 1.7.4rc2r30031 - OpenBSD-5 mpirun hangs

2013-12-20 Thread Ralph Castain
Yeah - even in a singleton, you are still connecting back to the local daemon On Dec 20, 2013, at 4:20 PM, Paul Hargrove wrote: > Ralph, > > Does some part of the "timer that is firing to indicate a failed connection > attempt" theory explain the case of singletons

Re: [OMPI devel] [EXTERNAL] 1.7.4rc2r30031 - OpenBSD-5 mpirun hangs

2013-12-20 Thread Paul Hargrove
FYI: My Solaris-10/SPARC build finally finished and *does* appear to be showing this same behavior. -Paul On Fri, Dec 20, 2013 at 4:15 PM, Ralph Castain wrote: > This is the same problem Jeff and I are looking at on Solaris - it > requires a slow machine to make it appear.

Re: [OMPI devel] [EXTERNAL] 1.7.4rc2r30031 - OpenBSD-5 mpirun hangs

2013-12-20 Thread Paul Hargrove
On Fri, Dec 20, 2013 at 4:02 PM, Paul Hargrove wrote: > FWIW: > I've confirmed that this is REGRESSION relative to 1.7.2, which works fine > on OpenBSD-5 > > I could not build 1.7.3 due to some of issues fixed for 1.7.4rc in the > past 24 hours. > I am going to try

Re: [OMPI devel] [EXTERNAL] 1.7.4rc2r30031 - OpenBSD-5 mpirun hangs

2013-12-20 Thread Paul Hargrove
Ralph, Does some part of the "timer that is firing to indicate a failed connection attempt" theory explain the case of singletons hanging? I'm just bringing this up in case you might be looking in the wrong direction. -Paul On Fri, Dec 20, 2013 at 4:15 PM, Ralph Castain

Re: [OMPI devel] 1.7.4rc2r30031 - FreeBSD mpirun warning

2013-12-20 Thread Ralph Castain
I'll silence it - thanks! On Dec 20, 2013, at 3:20 PM, Paul Hargrove wrote: > On Fri, Dec 20, 2013 at 3:12 PM, Dave Goodell (dgoodell) > wrote: > On Dec 20, 2013, at 4:43 PM, Paul Hargrove wrote: > > > The warning is correct that

Re: [OMPI devel] [EXTERNAL] 1.7.4rc2r30031 - OpenBSD-5 mpirun hangs

2013-12-20 Thread Ralph Castain
This is the same problem Jeff and I are looking at on Solaris - it requires a slow machine to make it appear. I'm investigating and think I know where the issue might lie (a timer that is firing to indicate a failed connection attempt and causing a race condition) On Dec 20, 2013, at 4:02 PM,

Re: [OMPI devel] [EXTERNAL] 1.7.4rc2r30031 - OpenBSD-5 mpirun hangs

2013-12-20 Thread Paul Hargrove
FWIW: I've confirmed that this is REGRESSION relative to 1.7.2, which works fine on OpenBSD-5 I could not build 1.7.3 due to some of issues fixed for 1.7.4rc in the past 24 hours. I am going to try back-porting the fix(es) to see if 1.7.3 works or not . -Paul On Fri, Dec 20, 2013 at 3:16 PM,

Re: [OMPI devel] 1.7.4rc2r30031 - FreeBSD-9 mpirun hangs

2013-12-20 Thread Paul Hargrove
And the FreeBSD backtraces again, this time configured with --enable-debug and for all threads: The 100%-cpu ring_c process: (gdb) thread apply all where Thread 2 (Thread 802007400 (LWP 182916/ring_c)): #0 0x000800de7aac in sched_yield () from /lib/libc.so.7 #1 0x0008013c7a5a in

Re: [OMPI devel] 1.7.4rc2r30031 - FreeBSD mpirun warning

2013-12-20 Thread Paul Hargrove
On Fri, Dec 20, 2013 at 3:12 PM, Dave Goodell (dgoodell) wrote: > On Dec 20, 2013, at 4:43 PM, Paul Hargrove wrote: > > > The warning is correct that no such interface exists. > > However 127.0.0.1/24 DOES exist: > > > > $ ifconfig lo0 inet > > lo0:

Re: [OMPI devel] [EXTERNAL] 1.7.4rc2r30031 - OpenBSD-5 mpirun hangs

2013-12-20 Thread Paul Hargrove
Below is the backtrace again, this time configured w/ --enable-debug and for all threads. -Paul Thread 2 (thread 1021110): #0 0x1bc0ef6c5e3a in nanosleep () at :2 #1 0x1bc0f317c2d4 in nanosleep (rqtp=0x7f7bc900, rmtp=0x0) at /usr/src/lib/librthread/rthread_cancel.c:274 #2

Re: [OMPI devel] 1.7.4rc2r30031 - FreeBSD mpirun warning

2013-12-20 Thread Dave Goodell (dgoodell)
On Dec 20, 2013, at 4:43 PM, Paul Hargrove wrote: > The warning is correct that no such interface exists. > However 127.0.0.1/24 DOES exist: > > $ ifconfig lo0 inet > lo0: flags=8049 metric 0 mtu 16384 >

[OMPI devel] 1.7.4rc2r30031 - FreeBSD-9 mpirun hangs

2013-12-20 Thread Paul Hargrove
This case is not quite like my OpenBSD-5 report. On FreeBSD-9 I *can* run singletons, but "-np 2" hangs. The following hangs: $ mpirun -np 2 examples/ring_c The following complains about the "bogus" btl selection. So this is not the same as my problem with OpenBSD-5: $ mpirun -mca btl bogus -np

Re: [OMPI devel] [EXTERNAL] 1.7.4rc2r30031 - OpenBSD-5 mpirun hangs

2013-12-20 Thread Paul Hargrove
Brian, Of course, I should have thought of that myself. See below for backtrace from a singleton run. I'm starting an --enable-debug build to maybe get some line number info too. -Paul (gdb) where #0 0x0406457a9e3a in nanosleep () at :2 #1 0x04063947e2d4 in nanosleep

[OMPI devel] 1.7.4rc2r30031 - FreeBSD mpirun warning

2013-12-20 Thread Paul Hargrove
I have a build of OMPI 1.7.4rc2r30031 on FreeBSD-9 finally. I can (as I will detail in another email) run only singletons at the moment. However, when I do I get a warning that is, IMHO, unnecessary: $ mpirun -np 1 examples/ring_c

Re: [OMPI devel] [EXTERNAL] 1.7.4rc2r30031 - OpenBSD-5 mpirun hangs

2013-12-20 Thread Barrett, Brian W
Paul - Any chance you could grab a stack trace from the mpi app? That's probably the fastest next step Brian Sent with Good (www.good.com) -Original Message- From: Paul Hargrove [phhargr...@lbl.gov] Sent: Friday, December 20, 2013 03:33 PM Mountain

[OMPI devel] 1.7.4rc2r30031 - OpenBSD-5 mpirun hangs

2013-12-20 Thread Paul Hargrove
With plenty of help from Jeff and Ralph's bug fixes in the past 24 hours, I can now build OMPI for NetBSD. However, running even a simple example fails: Having set PATH and LD_LIBARY_PATH: $ mpirun -np 1 examples/ring_c just hangs Output from "top" shows idle procs: PID USERNAME PRI NICE

[OMPI devel] 1.7.4rc2r30031 testing summary

2013-12-20 Thread Paul Hargrove
This email is a summary of my results testing 1.7.4rc2r30031. I will send detailed follow-ups on the new issues. So, this is a heads-up to let you know this version still has significant problems for me. FreeBSD-9/amd64 + "mpirun -np 2 examples/ring_c" hangs! + "mpirun -np 1 examples/ring_c" runs

Re: [OMPI devel] [PATCH v3 0/2] Trying to get the C/R code to compile again

2013-12-20 Thread Adrian Reber
On Thu, Dec 19, 2013 at 09:54:19PM +0100, Adrian Reber wrote: > This is the second try to replace the usage of blocking send and > recv in the C/R code with the non-blocking versions. The new code > compiles (in contrast to the old code) but does not work yet. > This is the first step to get the

Re: [OMPI devel] 1.7.4rc1 run failure on Solaris 10 / SPARC (not SIGBUS)

2013-12-20 Thread Paul Hargrove
Ralph and Jeff, Thanks for all the rapid fixes. I'll send openmpi-1.7.4rc2r30031 for a spin while I go wait in line at the Post Office. -Paul On Fri, Dec 20, 2013 at 11:45 AM, Ralph Castain wrote: > Hi Paul > > The binding stuff was in there, but the limit protection code

Re: [OMPI devel] 1.7.4rc1 run failure on Solaris 10 / SPARC (not SIGBUS)

2013-12-20 Thread Ralph Castain
Hi Paul The binding stuff was in there, but the limit protection code just went in today. Jeff has since regenerated the tarball for the web site, so the one up there should have most (if not all) of these problems fixed Have a great holiday! Ralph On Dec 20, 2013, at 11:40 AM, Paul Hargrove

Re: [OMPI devel] 1.7.4rc1 run failure on Solaris 10 / SPARC (not SIGBUS)

2013-12-20 Thread Paul Hargrove
Ralph, I see the same behavior w/ last night's 1.7 tarball (openmpi-1.7.4rc2r30002). The very next commit, r30003, is your addition (on trunk) of guards for RLIMIT_AS, etc.. So, I DON'T think any fix for this behavior is in the 1.7 branch as you thought (maybe just CMR'ed?) Let me know if there

Re: [OMPI devel] 1.74rc1 build failure: Solaris 11 / x86_64 / Sun Studio 12.3

2013-12-20 Thread Jeff Squyres (jsquyres)
Fixed these two and just CMR'ed them. On Dec 19, 2013, at 9:47 PM, Paul Hargrove wrote: > Jeff, > > I didn't actually get very far after fixing __always_inline. > In fact, the build still fails on the *same* line, but for a different > (valid) reason: > fls() is

Re: [OMPI devel] rpath issues (re: svn:open-mpi r30005 - trunk/config)

2013-12-20 Thread Jeff Squyres (jsquyres)
Cool; thanks. Does libmxm have a .a (static) version? On Dec 20, 2013, at 11:42 AM, Mike Dubman wrote: > Hi Jeff, > Thanks for comments. > I checked #1 and it does the trick, will fix and commit it. > as for #2 - we do not modify LDFLAGS in mca/mtl/mxm/configure.m4.

Re: [OMPI devel] rpath issues (re: svn:open-mpi r30005 - trunk/config)

2013-12-20 Thread Mike Dubman
Hi Jeff, Thanks for comments. I checked #1 and it does the trick, will fix and commit it. as for #2 - we do not modify LDFLAGS in mca/mtl/mxm/configure.m4. M On Fri, Dec 20, 2013 at 3:54 PM, Jeff Squyres (jsquyres) wrote: > This commit doesn't seem right. You can't just

Re: [OMPI devel] 1.7.4rc1 autogen error: NetBSD-6

2013-12-20 Thread Jeff Squyres (jsquyres)
I just submitted a CMR to Brian to fix this: https://svn.open-mpi.org/trac/ompi/ticket/4015 On Dec 19, 2013, at 10:46 PM, Paul Hargrove wrote: > Probably nobody cares, but I'll report this for completeness. > In trying to understand the "make install" failure on

[OMPI devel] rpath issues (re: svn:open-mpi r30005 - trunk/config)

2013-12-20 Thread Jeff Squyres (jsquyres)
This commit doesn't seem right. You can't just assign -Wl,-rpath to rpath something -- those flags are dependent on the actual back-end linker (which may not be gnu ld). We have a bunch of logic in configure that was just recently revamped to figure out what the rpath linker flags should be.

Re: [OMPI devel] 1.7.4rc1 build failure: another OpenBSD-5

2013-12-20 Thread Jeff Squyres (jsquyres)
Fixed and CMR'ed: https://svn.open-mpi.org/trac/ompi/ticket/4012 On Dec 20, 2013, at 12:27 AM, Paul Hargrove wrote: > Manually #ifdef'ing out the RLIMIT_AS code which lead to my previous failure > on OpenBSD-5 allows me to reach the (sigh) *next* problem: > > Making

[OMPI devel] 1.7.4rc1 build failure: another OpenBSD-5

2013-12-20 Thread Paul Hargrove
Manually #ifdef'ing out the RLIMIT_AS code which lead to my previous failure on OpenBSD-5 allows me to reach the (sigh) *next* problem: Making all in mpi/cxx CXX mpicxx.lo /home/phargrov/OMPI/openmpi-1.7.4rc2forpaul-openbsd5-amd64/openmpi-1.7.4rc2forpaul/ompi/mpi/cxx/mpicxx.cc:120:21: