Re: [OMPI devel] Bus error with openmpi-1.7.4rc1 on Solaris

2013-12-19 Thread Siegmar Gross
Hi, at first thank you very much for your help. 1st patch: > Can you apply the following patch to a trunk tarball and see if it works > for you? 2nd patch: > Found the problem. Was accessing a boolean variable using intval. That > is a bug that has gone unnoticed on all platforms but thankfull

[OMPI devel] Consequence of bind-to-core by default

2013-12-19 Thread Jeff Squyres (jsquyres)
I notice Absoft's MTT runs are failing due to the change in bind-to-core-by-default: http://mtt.open-mpi.org/index.php?do_redir=2136 I asked Tony, who runs the Absoft MTT runs; he confirms that this particular machine has 1 socket with 2 cores (and we're running -np 4 on this machine). 1. T

Re: [OMPI devel] Consequence of bind-to-core by default

2013-12-19 Thread Ashley Pittman
On 19 Dec 2013, at 13:59, Jeff Squyres (jsquyres) wrote: > > - if we oversubscribe, (possibly) warn about the performance loss of > oversubscription, and don't bind > - don't warn about lack of memory binding > > Thoughts? +1, I hit this myself today. I typically run on a VM and oversubscrib

Re: [OMPI devel] [EXTERNAL] Consequence of bind-to-core by default

2013-12-19 Thread Barrett, Brian W
On 12/19/13 6:59 AM, "Jeff Squyres (jsquyres)" wrote: >3. Finally, we're giving a warning saying: > >- >WARNING: a request was made to bind a process. While the system >supports binding the process itself, at least one node does NOT >support binding memory to the process location. >- > >F

[OMPI devel] Speedup for MPI_Dims_create()

2013-12-19 Thread Andreas Schäfer
Dear all, please find attached a (trivial) patch to MPI_Dims_create(). When computing the prime factors of nnodes, it is sufficient to check for primes less or equal to sqrt(nnodes). This was not so much of a problem in the past, but now that Tier 0 systems are capable of running O(10^6) MPI proc

Re: [OMPI devel] [EXTERNAL] Re: RFC: remove opal progress recursion depth counter

2013-12-19 Thread Barrett, Brian W
Someone who understands the mpi debugging handles code: The opal_progress_recursion_depth_counter and opal_progress_thread_counter are both only used internally in opal_progress (for book keeping, but never any decisions) and are declared in ompi_mpihandles_dll.c, but then don't appear to be used.

Re: [OMPI devel] [EXTERNAL] Consequence of bind-to-core by default

2013-12-19 Thread Ralph Castain
On Dec 19, 2013, at 6:27 AM, Barrett, Brian W wrote: > On 12/19/13 6:59 AM, "Jeff Squyres (jsquyres)" wrote: > >> 3. Finally, we're giving a warning saying: >> >> - >> WARNING: a request was made to bind a process. While the system >> supports binding the process itself, at least one node

Re: [OMPI devel] [EXTERNAL] Consequence of bind-to-core by default

2013-12-19 Thread Barrett, Brian W
On 12/19/13 8:43 AM, "Ralph Castain" wrote: > >On Dec 19, 2013, at 6:27 AM, Barrett, Brian W wrote: > >> On 12/19/13 6:59 AM, "Jeff Squyres (jsquyres)" >>wrote: >> >>> 3. Finally, we're giving a warning saying: >>> >>> - >>> WARNING: a request was made to bind a process. While the system

Re: [OMPI devel] Bus error with openmpi-1.7.4rc1 on Solaris

2013-12-19 Thread Jeff Squyres (jsquyres)
Siegmar -- So it looks like the net problem is fixed; good. I'll commit and CMR that. For the DDT test, can you give us access to this machine? It might help speed debugging a lot. (I'll let Nathan reply about the var problem) If not, can you provide the following information about the DDT t

Re: [OMPI devel] [EXTERNAL] Consequence of bind-to-core by default

2013-12-19 Thread Ralph Castain
Okay, I think I have these things fixed in r29978 on the trunk - please give it a spin and confirm so we can move it to 1.7.4 On Dec 19, 2013, at 7:54 AM, Barrett, Brian W wrote: > On 12/19/13 8:43 AM, "Ralph Castain" wrote: > >> >> On Dec 19, 2013, at 6:27 AM, Barrett, Brian W wrote: >>

Re: [OMPI devel] [EXTERNAL] Re: RFC: remove opal progress recursion depth counter

2013-12-19 Thread Jeff Squyres (jsquyres)
I think there's no problem with removing them from the dll code -- that stuff doesn't affect MPI application ABI. On Dec 19, 2013, at 9:42 AM, Barrett, Brian W wrote: > Someone who understands the mpi debugging handles code: > > The opal_progress_recursion_depth_counter and opal_progress_thre

Re: [OMPI devel] [EXTERNAL] Consequence of bind-to-core by default

2013-12-19 Thread Jeff Squyres (jsquyres)
On Dec 19, 2013, at 10:54 AM, Barrett, Brian W wrote: >> Just to help me understand a bit better - you are saying that the node >> supports process binding, but not memory binding? I don't see how the >> error appears otherwise, but want to ensure I understand the code path. > > That appears to

Re: [OMPI devel] [EXTERNAL] Consequence of bind-to-core by default

2013-12-19 Thread Barrett, Brian W
That worked for me. Brian On 12/19/13 9:32 AM, "Ralph Castain" wrote: > > > >Okay, I think I have these things fixed in r29978 on the trunk - please >give it a spin and confirm so we can move it to 1.7.4 > > > >On Dec 19, 2013, at 7:54 AM, Barrett, Brian W wrote: > > >On 12/19/13 8:43 AM, "Ral

Re: [OMPI devel] [EXTERNAL] Re: RFC: remove opal progress recursion depth counter

2013-12-19 Thread Barrett, Brian W
Nathan - Any chance you can remove the two counters this afternoon? Brian On 12/19/13 10:01 AM, "Jeff Squyres (jsquyres)" wrote: >I think there's no problem with removing them from the dll code -- that >stuff doesn't affect MPI application ABI. > > >On Dec 19, 2013, at 9:42 AM, Barrett, Brian

Re: [OMPI devel] Speedup for MPI_Dims_create()

2013-12-19 Thread Jeff Squyres (jsquyres)
Andreas -- Thanks for the patch. Can I ask two things? 1. Can you separate the patch into two: one with the code change, and another with the whitespace update? It will help the readability of the logs to see the exact code change, rather than bury it in a syntax update. 2. You added a copyr

Re: [OMPI devel] [EXTERNAL] Re: RFC: remove opal progress recursion depth counter

2013-12-19 Thread Hjelm, Nathan T
Yes. I will do that once I finish preparing the ORNL collectives for the trunk. Will be 8pm at the latest. -Nathan From: devel [devel-boun...@open-mpi.org] on behalf of Barrett, Brian W [bwba...@sandia.gov] Sent: Thursday, December 19, 2013 10:24 AM To: O

Re: [OMPI devel] [PATCH v2 2/2] Trying to get the C/R code to compile again. (send_*_nb)

2013-12-19 Thread Adrian Reber
Thanks for the review. I am re-spinning the patches and sending the new version in a few moments. On Wed, Dec 18, 2013 at 06:56:47AM -0800, Ralph Castain wrote: > In the case of the send, there really isn't any problem with just replacing > things - the non-blocking change won't impact anything,

[OMPI devel] [PATCH v3 0/2] Trying to get the C/R code to compile again

2013-12-19 Thread Adrian Reber
From: Adrian Reber This is the second try to replace the usage of blocking send and recv in the C/R code with the non-blocking versions. The new code compiles (in contrast to the old code) but does not work yet. This is the first step to get the C/R code working again. Right now it only compiles.

[OMPI devel] [PATCH v3 1/2] Trying to get the C/R code to compile again. (recv_*_nb)

2013-12-19 Thread Adrian Reber
From: Adrian Reber This patch changes all recv/recv_buffer occurrences in the C/R code to recv_nb/recv_buffer_nb. The old code is still there but disabled using ifdefs (ENABLE_FT_FIXED). The new code compiles but does not work. Changes from V1: * #ifdef out the code (so it is preserved for later

[OMPI devel] [PATCH v3 2/2] Trying to get the C/R code to compile again. (send_*_nb)

2013-12-19 Thread Adrian Reber
From: Adrian Reber This patch changes all send/send_buffer occurrences in the C/R code to send_nb/send_buffer_nb. The new code compiles but does not work. Changes from V1: * #ifdef out the code (so it is preserved for later re-design) * marked the broken C/R code with ENABLE_FT_FIXED Changes fr

Re: [OMPI devel] [PATCH v3 2/2] Trying to get the C/R code to compile again. (send_*_nb)

2013-12-19 Thread Ralph Castain
+1 from me On Dec 19, 2013, at 12:54 PM, Adrian Reber wrote: > From: Adrian Reber > > This patch changes all send/send_buffer occurrences in the C/R code > to send_nb/send_buffer_nb. > The new code compiles but does not work. > > Changes from V1: > * #ifdef out the code (so it is preserved f

Re: [OMPI devel] [PATCH v3 1/2] Trying to get the C/R code to compile again. (recv_*_nb)

2013-12-19 Thread Ralph Castain
Looks okay to me. On the places where you need to block while waiting for an answer, you can use OMPI_WAIT_FOR_COMPLETION - this will spin on opal_progress until the condition is met. We use it elsewhere for similar purposes. See ompi/mca/rte/rte.h for the definition On Dec 19, 2013, at 12:54

[OMPI devel] 1.7 series release plans

2013-12-19 Thread Ralph Castain
Hi folks Given the amount of changes/fixes pushed into the 1.7.4rc's this week, it seems best that we delay that release until after the holiday. Accordingly, the revised release plan looks like this: 1.7.4rc2 - this weekend 1.7.4 - Jan 10th 1.7.5 feature freeze (hard deadline) - Jan 24th 1.

[OMPI devel] 1.7.4rc1 build failure: FreeBSD-9

2013-12-19 Thread Paul Hargrove
I see the failure below when building 1.7.4rc1 on FreeBSD-9 (amd64). It looks to be just a missing header, probably sys/stat.h. $ gcc --version gcc (GCC) 4.2.1 20070831 patched [FreeBSD] Only configure option passed was --prefix-... -Paul Making all in mca/sharedfp/sm CC sharedfp_sm.l

[OMPI devel] 1.7.4rc1 build failure: OpenBSD-5 and NetBSD-6

2013-12-19 Thread Paul Hargrove
When building 1.7.4rc1 on OpenBSD-5 and NetBSD-6 (both amd64) I see what appears to be the same three errors ("make" output at end of this email) on both platforms. All three syntax errors appears to be collisions on the symbol if_mtu: -bash-4.2$ cat -n openmpi-1.7.4rc1/opal/util/if.h | grep -w

[OMPI devel] 1.74rc1 build failure: Solaris 11 / x86_64 / Sun Studio 12.3

2013-12-19 Thread Paul Hargrove
In 1.7.4rc1's README support is still claimed for Solaris 11 on x86_64 with Sun Studio (12.2 and 12.3): - Oracle Solaris 10 and 11, 32 and 64 bit (SPARC, i386, x86_64), with Oracle Solaris Studio 12.2 and 12.3 However, I get a build failure when configured with: CC=cc CFLAGS=-m64 --w

Re: [OMPI devel] 1.7.4rc1 build failure: Solaris 11 / x86_64

2013-12-19 Thread Paul Hargrove
I've confirmed that the ifr_hwaddr problem also occurs with this system's /usr/bin/gcc: Making all in mca/if/posix_ipv4 make[2]: Entering directory `/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-gcc452/BLD/opal/mca/if/posix_ipv4' CC if_posix.lo /shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-

Re: [OMPI devel] 1.74rc1 build failure: Solaris 11 / x86_64 / Sun Studio 12.3

2013-12-19 Thread Jeff Squyres (jsquyres)
Paul -- Does this patch fix it for you? Index: opal/mca/if/posix_ipv4/configure.m4 === --- opal/mca/if/posix_ipv4/configure.m4 (revision 29997) +++ opal/mca/if/posix_ipv4/configure.m4 (working copy) @@ -42,8 +42,10 @@ )

Re: [OMPI devel] 1.74rc1 build failure: Solaris 11 / x86_64 / Sun Studio 12.3

2013-12-19 Thread Paul Hargrove
Jeff, The patch looks fine to my eyes, but I cannot test it: 1) Not sure if email botched withepsace or what, but the patch didn't apply to if_posix.c. 2) Even if it did, I don't have sufficiently new autoconf on that system to "use" the configure.m4 part of the patch. Any chance of a patched-an

Re: [OMPI devel] 1.74rc1 build failure: Solaris 11 / x86_64 / Sun Studio 12.3

2013-12-19 Thread Jeff Squyres (jsquyres)
Try http://www.open-mpi.org/~jsquyres/unofficial/. Should have both "if" fixes in it. On Dec 19, 2013, at 7:12 PM, Paul Hargrove wrote: > Jeff, > > The patch looks fine to my eyes, but I cannot test it: > > 1) Not sure if email botched withepsace or what, but the patch didn't apply > to if_

Re: [OMPI devel] 1.7.4rc1 build failure: OpenBSD-5 and NetBSD-6

2013-12-19 Thread Jeff Squyres (jsquyres)
On Dec 19, 2013, at 6:27 PM, Paul Hargrove wrote: > When building 1.7.4rc1 on OpenBSD-5 and NetBSD-6 (both amd64) I see what > appears to be the same three errors ("make" output at end of this email) on > both platforms. > > All three syntax errors appears to be collisions on the symbol if_mt

Re: [OMPI devel] 1.7.4rc1 build failure: FreeBSD-9

2013-12-19 Thread Ralph Castain
Fixed and cmr'd thanks! On Dec 19, 2013, at 3:10 PM, Paul Hargrove wrote: > I see the failure below when building 1.7.4rc1 on FreeBSD-9 (amd64). > It looks to be just a missing header, probably sys/stat.h. > > $ gcc --version > gcc (GCC) 4.2.1 20070831 patched [FreeBSD] > > Only configure opt

Re: [OMPI devel] 1.7.4rc1 build failure: OpenBSD-5 and NetBSD-6

2013-12-19 Thread Paul Hargrove
Jeff, The unofficial "rc2forpaul" gets past the (disgusting) if_mtu problem on both platforms. On NetBSD-6 the build completes ("make install" fails, but I'll report that separately). However, on OpenBSD-5 we now encounter another failure about 20 files later: CC sys_limits.lo /home/pha

Re: [OMPI devel] 1.74rc1 build failure: Solaris 11 / x86_64 / Sun Studio 12.3

2013-12-19 Thread Paul Hargrove
Jeff, Solaris 11 / x86_64 build get farther than before, but fails with the following: make[2]: Entering directory `/shared/OMPI/openmpi-1.7.4rc2forpaul-solaris11-x64-ib-gcc452/BLD/ompi/mca/btl/usnic' CC btl_usnic_module.lo In file included from /shared/OMPI/openmpi-1.7.4rc2forpaul-solari

[OMPI devel] 1.7.4rc1 run failure on Solaris 10 / SPARC (not SIGBUS)

2013-12-19 Thread Paul Hargrove
Testing with Solaris 10 on SPARC, I was expecting to encounter the bus error reported previously by Siegman Gross. Instead I see the following hwloc-related abort: $ env PATH=/home/hargrove/OMPI/openmpi-1.7.4rc1-solaris10-sparcT2-ss12u3-v9/INST/bin:$PATH LD_LIBRARY_PATH_64=/home/hargrove/OMPI/o

Re: [OMPI devel] 1.74rc1 build failure: Solaris 11 / x86_64 / Sun Studio 12.3

2013-12-19 Thread Paul Hargrove
Jeff, I didn't actually get very far after fixing __always_inline. In fact, the build still fails on the *same* line, but for a different (valid) reason: fls() is declared in /usr/include/string.h Making all in mca/btl/usnic make[2]: Entering directory `/shared/OMPI/openmpi-1.7.4rc2forpaul-so

[OMPI devel] 1.7.4rc1 install failure: NetBSD-6 amd64

2013-12-19 Thread Paul Hargrove
Attached is the output from "make install" of 1.7.4rc1 + Jeff's fix for the symbol conflict on "if_mtu". There appear to be at least 2 issues. 1) There are lots of (not fatal) messages about ldconfig not existing, but according to he NetBSD lists that utility went away with the conversion from a.

Re: [OMPI devel] 1.7.4rc1 build failure: OpenBSD-5 and NetBSD-6

2013-12-19 Thread Ralph Castain
I added protections for all the RLIMIT values, just in case. Thanks! Ralph On Dec 19, 2013, at 6:25 PM, Paul Hargrove wrote: > Jeff, > > The unofficial "rc2forpaul" gets past the (disgusting) if_mtu problem on both > platforms. > > On NetBSD-6 the build completes ("make install" fails, but I'

Re: [OMPI devel] 1.7.4rc1 run failure on Solaris 10 / SPARC (not SIGBUS)

2013-12-19 Thread Ralph Castain
I believe this one has already been fixed and is in the nightly (1.7.4rc2) - for now, you can just set "--bind-to none" on the cmd line to get past it On Dec 19, 2013, at 6:42 PM, Paul Hargrove wrote: > Testing with Solaris 10 on SPARC, I was expecting to encounter the bus error > reported pr

Re: [OMPI devel] 1.7.4rc1 run failure on Solaris 10 / SPARC (not SIGBUS)

2013-12-19 Thread Paul Hargrove
Ralph, I can confirm "--bind-to none" worked to eliminate the error, but the test now appears to hang :-( Since you say the binding probably fixed for rc2, I'll see if the latest nightly tarball works better by default. -Paul On Thu, Dec 19, 2013 at 7:19 PM, Ralph Castain wrote: > I believe

[OMPI devel] 1.7.4rc1 autogen error: NetBSD-6

2013-12-19 Thread Paul Hargrove
Probably nobody cares, but I'll report this for completeness. In trying to understand the "make install" failure on NetBSD-6 I run "autogen.sh". The versions detected: Searching for autoconf Found autoconf version 2.69; checking version... Found version component 2 -- need 2

Re: [OMPI devel] 1.74rc1 build failure: Solaris 11 / x86_64 / Sun Studio 12.3

2013-12-19 Thread Paul Hargrove
FYI: My Solaris-11/x86-64/gcc-4.5.2 build completes with the following three changes: + Jeff's fix for if_posix.c + changing __always_inline to __opal_attribute_always_inline__ + fixing the fls() conflict by renaming OMPI's to "my_fls()" (just a lazy choice). -Paul On Thu, Dec 19, 2013 at 6:47