Re: [OMPI devel] Bus error with openmpi-1.7.4rc1 on Solaris

2013-12-19 Thread Siegmar Gross
Hi,

at first thank you very much for your help.

1st patch:

> Can you apply the following patch to a trunk tarball and see if it works
> for you?

2nd patch:

> Found the problem. Was accessing a boolean variable using intval. That
> is a bug that has gone unnoticed on all platforms but thankfully Solaris
> caught it.
> 
> Please try the attached patch.


I applied both patches manually to openmpi-1.9a1r29972, because
my patch program couldn't use the patches. Unfortunately I still
get a Bus Error. Hopefully I didn't make a mistake applying your
patches. Therefore I show you a "diff" for my files. By the way,
I tried to apply your patches with "patch -b -i ".
Is it necessary to use a different command?


tyr openmpi-1.9a1r29972 161 ls -l opal/mca/base/mca_base_var.c*
-rw-r--r-- 1 fd1026 inf 60418 Dec 19 08:35 opal/mca/base/mca_base_var.c
-rw-r--r-- 1 fd1026 inf 60236 Dec 19 03:05 opal/mca/base/mca_base_var.c.orig
tyr openmpi-1.9a1r29972 162 diff opal/mca/base/mca_base_var.c*
1685,1689c1685
mbv_type) {
mbv_enumerator->string_from_value(var->mbv_enumerator, 
value->boolval, &tmp);
<} else {
mbv_enumerator->string_from_value(var->mbv_enumerator, 
value->intval, &tmp);
<}
---
> ret = var->mbv_enumerator->string_from_value(var->mbv_enumerator, 
value->intval, &tmp);
tyr openmpi-1.9a1r29972 163 



tyr openmpi-1.9a1r29972 165 ls -l opal/util/net.c*
-rw-r--r-- 1 fd1026 inf 12922 Dec 19 07:55 opal/util/net.c
-rw-r--r-- 1 fd1026 inf 12675 Dec 19 03:05 opal/util/net.c.orig
tyr openmpi-1.9a1r29972 166 diff opal/util/net.c*
267,271c267,268
< struct sockaddr_in inaddr1, inaddr2;
< /* Use temporary variables and memcpy's so that we don't
 const struct sockaddr_in *inaddr1 = (struct sockaddr_in*) addr1;
> const struct sockaddr_in *inaddr2 = (struct sockaddr_in*) addr2;
274,275c271,272
< if((inaddr1.sin_addr.s_addr & netmask) ==
<(inaddr2.sin_addr.s_addr & netmask)) {
---
> if((inaddr1->sin_addr.s_addr & netmask) ==
>(inaddr2->sin_addr.s_addr & netmask)) {
284,290c281,284
< struct sockaddr_in6 inaddr1, inaddr2;
< /* Use temporary variables and memcpy's so that we don't
 const struct sockaddr_in6 *inaddr1 = (struct sockaddr_in6*) addr1;
> const struct sockaddr_in6 *inaddr2 = (struct sockaddr_in6*) addr2;
> struct in6_addr *a6_1 = (struct in6_addr*) &inaddr1->sin6_addr;
> struct in6_addr *a6_2 = (struct in6_addr*) &inaddr2->sin6_addr;
tyr openmpi-1.9a1r29972 167 



Now my debug information.

tyr fd1026 52 cd /usr/local/openmpi-1.9_64_cc/bin/
tyr bin 53 /opt/solstudio12.3/bin/sparcv9/dbx ompi_info
For information about new features see `help changes'
To remove this message, put `dbxenv suppress_startup_message 7.9' in your .dbxrc
Reading ompi_info
Reading ld.so.1
Reading libmpi.so.0.0.0
Reading libopen-rte.so.0.0.0
Reading libopen-pal.so.0.0.0
Reading libsendfile.so.1
Reading libpicl.so.1
Reading libkstat.so.1
Reading liblgrp.so.1
Reading libsocket.so.1
Reading libnsl.so.1
Reading librt.so.1
Reading libm.so.2
Reading libthread.so.1
Reading libc.so.1
Reading libdoor.so.1
Reading libaio.so.1
Reading libmd.so.1
(dbx) run -a
Running: ompi_info -a 
(process id 10998)
Reading libc_psr.so.1
...
MCA compress: parameter "compress_base_verbose" (current value:
  "-1", data source: default, level: 8 dev/detail,
  type: int)
  Verbosity level for the compress framework (0 = no
  verbosity)
t@1 (l@1) signal BUS (invalid address alignment) in var_value_string
  at line 1680 in file "mca_base_var.c"
 1680  ret = asprintf (value_string, var_type_formats[var->mbv_type],
  value[0]);
(dbx) 
(dbx) 
(dbx) check -all
dbx: warning: check -all will be turned on in the next run of the process
access checking - OFF
memuse checking - OFF
(dbx) run -a
Running: ompi_info -a 
(process id 11000)
Reading rtcapihook.so
Reading libdl.so.1
Reading rtcaudit.so
Reading libmapmalloc.so.1
Reading rtcboot.so
Reading librtc.so
Reading libmd_psr.so.1
RTC: Enabling Error Checking...
RTC: Using UltraSparc trap mechanism
RTC: See `help rtc showmap' and `help rtc limitations' for details.
RTC: Running program...
Read from uninitialized (rui) on thread 1:
Attempting to read 4 bytes at address 0x7fffd5f8
which 

[OMPI devel] Consequence of bind-to-core by default

2013-12-19 Thread Jeff Squyres (jsquyres)
I notice Absoft's MTT runs are failing due to the change in 
bind-to-core-by-default:

   http://mtt.open-mpi.org/index.php?do_redir=2136

I asked Tony, who runs the Absoft MTT runs; he confirms that this particular 
machine has 1 socket with 2 cores (and we're running -np 4 on this machine).

1. This is an unintended consequence of the bind-to-core-by-default policy: we 
fail with "oversubscribed!" when running on a single machine for test runs like 
this.  Do we like this? 

See #3, below, for more on this.

2. Also, the error message that is displayed says:

-
A request was made to bind to that would result in binding more
processes than cpus on a resource:

   Bind to: CORE
   Node:ltljoe3
   #processes:  2
   #cpus:  1
-

Which is odd, because the command line is "mpirun -np 4 --mca btl sm,tcp,self 
./c_hello".  Any idea what's happening here?

3. Finally, we're giving a warning saying:

-
WARNING: a request was made to bind a process. While the system
supports binding the process itself, at least one node does NOT
support binding memory to the process location.
-

For both #1 and #3, I wonder if we shouldn't be warning if no binding was 
explicitly stated (i.e., we're just using the defaults).  Specifically, if no 
binding is specified:

- if we oversubscribe, (possibly) warn about the performance loss of 
oversubscription, and don't bind
- don't warn about lack of memory binding

Thoughts?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] Consequence of bind-to-core by default

2013-12-19 Thread Ashley Pittman

On 19 Dec 2013, at 13:59, Jeff Squyres (jsquyres)  wrote:
> 
> - if we oversubscribe, (possibly) warn about the performance loss of 
> oversubscription, and don't bind
> - don't warn about lack of memory binding
> 
> Thoughts?

+1, I hit this myself today.  I typically run on a VM and oversubscribe the 
cores, until the last update this would work fine, but now I get two error 
messages when trying this.  I can’t “modify” the binding options used because I 
don’t know what they are (i.e. I didn’t give any) and even when not 
over-subscribing there is a warning at startup that I neither understand nor 
can seemingly disable.

My thoughts would be:

Oversubscription is normally bad so by all means issue a warning and/or abort 
however make the message meaningful and offer the use a —allow-oversubscription 
flag.

Jobs running on VMs shouldn’t give warnings to the user.

Finally, the whitespace alignment of the message is a little odd, it looks like 
it’s supposed to be a table or two columns however the indentation is all over 
the place.

Ashley.

Re: [OMPI devel] [EXTERNAL] Consequence of bind-to-core by default

2013-12-19 Thread Barrett, Brian W
On 12/19/13 6:59 AM, "Jeff Squyres (jsquyres)"  wrote:

>3. Finally, we're giving a warning saying:
>
>-
>WARNING: a request was made to bind a process. While the system
>supports binding the process itself, at least one node does NOT
>support binding memory to the process location.
>-
>
>For both #1 and #3, I wonder if we shouldn't be warning if no binding was
>explicitly stated (i.e., we're just using the defaults).  Specifically,
>if no binding is specified:
>
>- if we oversubscribe, (possibly) warn about the performance loss of
>oversubscription, and don't bind
>- don't warn about lack of memory binding

We have a couple machines where memory binding is failing for one reason
or another.  If we're binding by default, we really shouldn't throw error
messages about not being able to bind memory.  It's REALLY annoying.

Brian

--
  Brian W. Barrett
  Scalable System Software Group
  Sandia National Laboratories





[OMPI devel] Speedup for MPI_Dims_create()

2013-12-19 Thread Andreas Schäfer
Dear all,

please find attached a (trivial) patch to MPI_Dims_create(). When
computing the prime factors of nnodes, it is sufficient to check for
primes less or equal to sqrt(nnodes).

This was not so much of a problem in the past, but now that Tier 0
systems are capable of running O(10^6) MPI processes, the difference
in execution time is on the order of seconds (e.g. 8.86s vs. 0.04s on
my notebook, with nnproc = 10^6).

Best
-Andreas

PS: oh, and the patch removes some trailing whitespace. Yuck. :-)


-- 
==
Andreas Schäfer
HPC and Grid Computing
Chair of Computer Science 3
Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany
+49 9131 85-27910
PGP/GPG key via keyserver
http://www.libgeodecomp.org
==

(\___/)
(+'.'+)
(")_(")
This is Bunny. Copy and paste Bunny into your
signature to help him gain world domination!
Index: ompi/mpi/c/dims_create.c
===
--- ompi/mpi/c/dims_create.c	(revision 29976)
+++ ompi/mpi/c/dims_create.c	(working copy)
@@ -5,19 +5,23 @@
  * Copyright (c) 2004-2005 The University of Tennessee and The University
  * of Tennessee Research Foundation.  All rights
  * reserved.
- * Copyright (c) 2004-2005 High Performance Computing Center Stuttgart, 
+ * Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
  * University of Stuttgart.  All rights reserved.
  * Copyright (c) 2004-2005 The Regents of the University of California.
  * All rights reserved.
  * Copyright (c) 2012  Los Alamos National Security, LLC.  All rights
- * reserved. 
+ * reserved.
+ * Copyright (c) 2013  Friedrich-Alexander-Universitaet
+ * Erlangen-Nuernberg. All rights reserved.
  * $COPYRIGHT$
- * 
+ *
  * Additional copyrights may follow
- * 
+ *
  * $HEADER$
  */
 
+#include 
+
 #include "ompi_config.h"
 
 #include "ompi/mpi/c/bindings.h"
@@ -44,8 +48,8 @@
 /*
  * This is a utility function, no need to have anything in the lower
  * layer for this at all
- */ 
-int MPI_Dims_create(int nnodes, int ndims, int dims[]) 
+ */
+int MPI_Dims_create(int nnodes, int ndims, int dims[])
 {
 int i;
 int freeprocs;
@@ -66,9 +70,9 @@
 return OMPI_ERRHANDLER_INVOKE (MPI_COMM_WORLD,
MPI_ERR_ARG, FUNC_NAME);
 }
-
+
 if (1 > ndims) {
-return OMPI_ERRHANDLER_INVOKE (MPI_COMM_WORLD, 
+return OMPI_ERRHANDLER_INVOKE (MPI_COMM_WORLD,
MPI_ERR_DIMS, FUNC_NAME);
 }
 }
@@ -109,11 +113,11 @@
 }
 
 /* Compute the relevant prime numbers for factoring */
-if (MPI_SUCCESS != (err = getprimes(freeprocs, &nprimes, &primes))) {
+if (MPI_SUCCESS != (err = getprimes(sqrt(freeprocs), &nprimes, &primes))) {
return OMPI_ERRHANDLER_INVOKE(MPI_COMM_WORLD, err,
  FUNC_NAME);
 }
-
+
 /* Factor the number of free processes */
 if (MPI_SUCCESS != (err = getfactors(freeprocs, nprimes, primes, &factors))) {
return OMPI_ERRHANDLER_INVOKE(MPI_COMM_WORLD, err,
@@ -166,7 +170,7 @@
 int f;
 int *p;
 int *pmin;
-  
+
 if (0 >= ndim) {
return MPI_ERR_DIMS;
 }
@@ -181,7 +185,7 @@
 for (i = 0, p = bins; i < ndim; ++i, ++p) {
 *p = 1;
  }
-
+
 /* Loop assigning factors from the highest to the lowest */
 for (j = nfactor - 1; j >= 0; --j) {
f = pfacts[j];
@@ -196,7 +200,7 @@
 *pmin *= f;
 }
  }
-
+
  /* Sort dimensions in decreasing order (O(n^2) for now) */
  for (i = 0, pmin = bins; i < ndim - 1; ++i, ++pmin) {
  for (j = i + 1, p = pmin + 1; j < ndim; ++j, ++p) {
@@ -228,7 +232,7 @@
 int i;
 int *p;
 int *c;
-
+
 if (0 >= nprime) {
 return MPI_ERR_INTERN;
 }
@@ -309,4 +313,3 @@
*pnprime = i;
return MPI_SUCCESS;
 }
-


signature.asc
Description: Digital signature


Re: [OMPI devel] [EXTERNAL] Re: RFC: remove opal progress recursion depth counter

2013-12-19 Thread Barrett, Brian W
Someone who understands the mpi debugging handles code:

The opal_progress_recursion_depth_counter and opal_progress_thread_counter
are both only used internally in opal_progress (for book keeping, but
never any decisions) and are declared in ompi_mpihandles_dll.c, but then
don't appear to be used.  Is there a disadvantage to:

 1) removing them from mpihandles_dll.c

or, if that breaks ABI,

 2) Leaving them, but not doing the bookkeeping?

It's fairly heavyweight bookkeeping, so I agree with Nathan, I'd like to
remove it.  But I'd like to remove it pre-1.7.4.  Which means today.

Brian


On 12/18/13 4:40 PM, "Nathan Hjelm"  wrote:

>Opps, yeah. Meant 1.7.5. If people agree with this change I could
>possibly slip it in before Friday but that is unlikely.
>
>On Wed, Dec 18, 2013 at 03:32:36PM -0800, Ralph Castain wrote:
>> U1.7.4 is leaving the station on Fri, Nathan, so next Tues =>
>>will have to go into 1.7.5
>> 
>> 
>> On Dec 18, 2013, at 3:23 PM, Nathan Hjelm  wrote:
>> 
>> > What: Remove the opal_progress_recursion_depth_counter from
>> > opal_progress.
>> > 
>> > Why: This counter adds two atomic adds to the critical path when
>> > OPAL_HAVE_THREADS is set (which is the case for most builds). I
>>grepped
>> > through ompi, orte, and opal to find where this value was being used
>>and
>> > did not find anything either inside or outside opal_progress.
>> > 
>> > When: I want this change to go into 1.7.4 (if possible) so setting a
>> > quick timeout for next Tuesday.
>> > 
>> > Let me know if there is a good reason to keep this counter and it will
>> > be spared.
>> > 
>> > -Nathan Hjelm
>> > HPC-5, LANL
>> > ___
>> > devel mailing list
>> > de...@open-mpi.org
>> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>


--
  Brian W. Barrett
  Scalable System Software Group
  Sandia National Laboratories





Re: [OMPI devel] [EXTERNAL] Consequence of bind-to-core by default

2013-12-19 Thread Ralph Castain

On Dec 19, 2013, at 6:27 AM, Barrett, Brian W  wrote:

> On 12/19/13 6:59 AM, "Jeff Squyres (jsquyres)"  wrote:
> 
>> 3. Finally, we're giving a warning saying:
>> 
>> -
>> WARNING: a request was made to bind a process. While the system
>> supports binding the process itself, at least one node does NOT
>> support binding memory to the process location.
>> -
>> 
>> For both #1 and #3, I wonder if we shouldn't be warning if no binding was
>> explicitly stated (i.e., we're just using the defaults).  Specifically,
>> if no binding is specified:
>> 
>> - if we oversubscribe, (possibly) warn about the performance loss of
>> oversubscription, and don't bind
>> - don't warn about lack of memory binding
> 
> We have a couple machines where memory binding is failing for one reason
> or another.  If we're binding by default, we really shouldn't throw error
> messages about not being able to bind memory.  It's REALLY annoying.

Just to help me understand a bit better - you are saying that the node supports 
process binding, but not memory binding? I don't see how the error appears 
otherwise, but want to ensure I understand the code path.


> 
> Brian
> 
> --
>  Brian W. Barrett
>  Scalable System Software Group
>  Sandia National Laboratories
> 
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] [EXTERNAL] Consequence of bind-to-core by default

2013-12-19 Thread Barrett, Brian W
On 12/19/13 8:43 AM, "Ralph Castain"  wrote:

>
>On Dec 19, 2013, at 6:27 AM, Barrett, Brian W  wrote:
>
>> On 12/19/13 6:59 AM, "Jeff Squyres (jsquyres)" 
>>wrote:
>> 
>>> 3. Finally, we're giving a warning saying:
>>> 
>>> -
>>> WARNING: a request was made to bind a process. While the system
>>> supports binding the process itself, at least one node does NOT
>>> support binding memory to the process location.
>>> -
>>> 
>>> For both #1 and #3, I wonder if we shouldn't be warning if no binding
>>>was
>>> explicitly stated (i.e., we're just using the defaults).  Specifically,
>>> if no binding is specified:
>>> 
>>> - if we oversubscribe, (possibly) warn about the performance loss of
>>> oversubscription, and don't bind
>>> - don't warn about lack of memory binding
>> 
>> We have a couple machines where memory binding is failing for one reason
>> or another.  If we're binding by default, we really shouldn't throw
>>error
>> messages about not being able to bind memory.  It's REALLY annoying.
>
>Just to help me understand a bit better - you are saying that the node
>supports process binding, but not memory binding? I don't see how the
>error appears otherwise, but want to ensure I understand the code path.

That appears to be the case, yes.

Brian

--
  Brian W. Barrett
  Scalable System Software Group
  Sandia National Laboratories





Re: [OMPI devel] Bus error with openmpi-1.7.4rc1 on Solaris

2013-12-19 Thread Jeff Squyres (jsquyres)
Siegmar --

So it looks like the net problem is fixed; good.  I'll commit and CMR that.

For the DDT test, can you give us access to this machine?  It might help speed 
debugging a lot.  (I'll let Nathan reply about the var problem)

If not, can you provide the following information about the DDT test:

1. It SIGBUS's at a point; can you send the full backtrace?
2. It complains about a misaligned read of a variable and shows its address.  
Can you print the values of all the parameters of the function so that we can 
see *which* one it is using for the misaligned read?  (the printf is using 4 
different variables, and we don't know which one is causing the misaligned read)


On Dec 19, 2013, at 8:52 AM, Siegmar Gross 
 wrote:

> Hi,
> 
> at first thank you very much for your help.
> 
> 1st patch:
> 
>> Can you apply the following patch to a trunk tarball and see if it works
>> for you?
> 
> 2nd patch:
> 
>> Found the problem. Was accessing a boolean variable using intval. That
>> is a bug that has gone unnoticed on all platforms but thankfully Solaris
>> caught it.
>> 
>> Please try the attached patch.
> 
> 
> I applied both patches manually to openmpi-1.9a1r29972, because
> my patch program couldn't use the patches. Unfortunately I still
> get a Bus Error. Hopefully I didn't make a mistake applying your
> patches. Therefore I show you a "diff" for my files. By the way,
> I tried to apply your patches with "patch -b -i ".
> Is it necessary to use a different command?
> 
> 
> tyr openmpi-1.9a1r29972 161 ls -l opal/mca/base/mca_base_var.c*
> -rw-r--r-- 1 fd1026 inf 60418 Dec 19 08:35 opal/mca/base/mca_base_var.c
> -rw-r--r-- 1 fd1026 inf 60236 Dec 19 03:05 opal/mca/base/mca_base_var.c.orig
> tyr openmpi-1.9a1r29972 162 diff opal/mca/base/mca_base_var.c*
> 1685,1689c1685
> mbv_type) {
>  var->mbv_enumerator->string_from_value(var->mbv_enumerator, 
> value->boolval, &tmp);
> <} else {
>  var->mbv_enumerator->string_from_value(var->mbv_enumerator, 
> value->intval, &tmp);
> <}
> ---
>>ret = var->mbv_enumerator->string_from_value(var->mbv_enumerator, 
> value->intval, &tmp);
> tyr openmpi-1.9a1r29972 163 
> 
> 
> 
> tyr openmpi-1.9a1r29972 165 ls -l opal/util/net.c*
> -rw-r--r-- 1 fd1026 inf 12922 Dec 19 07:55 opal/util/net.c
> -rw-r--r-- 1 fd1026 inf 12675 Dec 19 03:05 opal/util/net.c.orig
> tyr openmpi-1.9a1r29972 166 diff opal/util/net.c*
> 267,271c267,268
> < struct sockaddr_in inaddr1, inaddr2;
> < /* Use temporary variables and memcpy's so that we don't
>  < memcpy(&inaddr1, addr1, sizeof(inaddr1));
> < memcpy(&inaddr2, addr2, sizeof(inaddr2));
> ---
>>const struct sockaddr_in *inaddr1 = (struct sockaddr_in*) addr1;
>>const struct sockaddr_in *inaddr2 = (struct sockaddr_in*) addr2;
> 274,275c271,272
> < if((inaddr1.sin_addr.s_addr & netmask) ==
> <(inaddr2.sin_addr.s_addr & netmask)) {
> ---
>>if((inaddr1->sin_addr.s_addr & netmask) ==
>>   (inaddr2->sin_addr.s_addr & netmask)) {
> 284,290c281,284
> < struct sockaddr_in6 inaddr1, inaddr2;
> < /* Use temporary variables and memcpy's so that we don't
>  < memcpy(&inaddr1, addr1, sizeof(inaddr1));
> < memcpy(&inaddr2, addr2, sizeof(inaddr2));
> < struct in6_addr *a6_1 = (struct in6_addr*) &inaddr1.sin6_addr;
> < struct in6_addr *a6_2 = (struct in6_addr*) &inaddr2.sin6_addr;
> ---
>>const struct sockaddr_in6 *inaddr1 = (struct sockaddr_in6*) addr1;
>>const struct sockaddr_in6 *inaddr2 = (struct sockaddr_in6*) addr2;
>>struct in6_addr *a6_1 = (struct in6_addr*) &inaddr1->sin6_addr;
>>struct in6_addr *a6_2 = (struct in6_addr*) &inaddr2->sin6_addr;
> tyr openmpi-1.9a1r29972 167 
> 
> 
> 
> Now my debug information.
> 
> tyr fd1026 52 cd /usr/local/openmpi-1.9_64_cc/bin/
> tyr bin 53 /opt/solstudio12.3/bin/sparcv9/dbx ompi_info
> For information about new features see `help changes'
> To remove this message, put `dbxenv suppress_startup_message 7.9' in your 
> .dbxrc
> Reading ompi_info
> Reading ld.so.1
> Reading libmpi.so.0.0.0
> Reading libopen-rte.so.0.0.0
> Reading libopen-pal.so.0.0.0
> Reading libsendfile.so.1
> Reading libpicl.so.1
> Reading libkstat.so.1
> Reading liblgrp.so.1
> Reading libsocket.so.1
> Reading libnsl.so.1
> Reading librt.so.1
> Reading libm.so.2
> Reading libthread.so.1
> Reading libc.so.1
> Reading libdoor.so.1
> Reading libaio.so.1
> Reading libmd.so.1
> (dbx) run -a
> Running: ompi_info -a 
> (process id 10998)
> Reading libc_psr.so.1
> ...
>MCA compress: parameter "compress_base_verbose" (current value:
>  "-1", data source: default, level: 8 dev/detail

Re: [OMPI devel] [EXTERNAL] Consequence of bind-to-core by default

2013-12-19 Thread Ralph Castain
Okay, I think I have these things fixed in r29978 on the trunk - please give it 
a spin and confirm so we can move it to 1.7.4


On Dec 19, 2013, at 7:54 AM, Barrett, Brian W  wrote:

> On 12/19/13 8:43 AM, "Ralph Castain"  wrote:
> 
>> 
>> On Dec 19, 2013, at 6:27 AM, Barrett, Brian W  wrote:
>> 
>>> On 12/19/13 6:59 AM, "Jeff Squyres (jsquyres)" 
>>> wrote:
>>> 
 3. Finally, we're giving a warning saying:
 
 -
 WARNING: a request was made to bind a process. While the system
 supports binding the process itself, at least one node does NOT
 support binding memory to the process location.
 -
 
 For both #1 and #3, I wonder if we shouldn't be warning if no binding
 was
 explicitly stated (i.e., we're just using the defaults).  Specifically,
 if no binding is specified:
 
 - if we oversubscribe, (possibly) warn about the performance loss of
 oversubscription, and don't bind
 - don't warn about lack of memory binding
>>> 
>>> We have a couple machines where memory binding is failing for one reason
>>> or another.  If we're binding by default, we really shouldn't throw
>>> error
>>> messages about not being able to bind memory.  It's REALLY annoying.
>> 
>> Just to help me understand a bit better - you are saying that the node
>> supports process binding, but not memory binding? I don't see how the
>> error appears otherwise, but want to ensure I understand the code path.
> 
> That appears to be the case, yes.
> 
> Brian
> 
> --
>  Brian W. Barrett
>  Scalable System Software Group
>  Sandia National Laboratories
> 
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] [EXTERNAL] Re: RFC: remove opal progress recursion depth counter

2013-12-19 Thread Jeff Squyres (jsquyres)
I think there's no problem with removing them from the dll code -- that stuff 
doesn't affect MPI application ABI.


On Dec 19, 2013, at 9:42 AM, Barrett, Brian W  wrote:

> Someone who understands the mpi debugging handles code:
> 
> The opal_progress_recursion_depth_counter and opal_progress_thread_counter
> are both only used internally in opal_progress (for book keeping, but
> never any decisions) and are declared in ompi_mpihandles_dll.c, but then
> don't appear to be used.  Is there a disadvantage to:
> 
> 1) removing them from mpihandles_dll.c
> 
> or, if that breaks ABI,
> 
> 2) Leaving them, but not doing the bookkeeping?
> 
> It's fairly heavyweight bookkeeping, so I agree with Nathan, I'd like to
> remove it.  But I'd like to remove it pre-1.7.4.  Which means today.
> 
> Brian
> 
> 
> On 12/18/13 4:40 PM, "Nathan Hjelm"  wrote:
> 
>> Opps, yeah. Meant 1.7.5. If people agree with this change I could
>> possibly slip it in before Friday but that is unlikely.
>> 
>> On Wed, Dec 18, 2013 at 03:32:36PM -0800, Ralph Castain wrote:
>>> U1.7.4 is leaving the station on Fri, Nathan, so next Tues =>
>>> will have to go into 1.7.5
>>> 
>>> 
>>> On Dec 18, 2013, at 3:23 PM, Nathan Hjelm  wrote:
>>> 
 What: Remove the opal_progress_recursion_depth_counter from
 opal_progress.
 
 Why: This counter adds two atomic adds to the critical path when
 OPAL_HAVE_THREADS is set (which is the case for most builds). I
>>> grepped
 through ompi, orte, and opal to find where this value was being used
>>> and
 did not find anything either inside or outside opal_progress.
 
 When: I want this change to go into 1.7.4 (if possible) so setting a
 quick timeout for next Tuesday.
 
 Let me know if there is a good reason to keep this counter and it will
 be spared.
 
 -Nathan Hjelm
 HPC-5, LANL
 ___
 devel mailing list
 de...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
> 
> 
> --
>  Brian W. Barrett
>  Scalable System Software Group
>  Sandia National Laboratories
> 
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] [EXTERNAL] Consequence of bind-to-core by default

2013-12-19 Thread Jeff Squyres (jsquyres)
On Dec 19, 2013, at 10:54 AM, Barrett, Brian W  wrote:

>> Just to help me understand a bit better - you are saying that the node
>> supports process binding, but not memory binding? I don't see how the
>> error appears otherwise, but want to ensure I understand the code path.
> 
> That appears to be the case, yes.

I think that's what's happening on the Absoft systems, too.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] [EXTERNAL] Consequence of bind-to-core by default

2013-12-19 Thread Barrett, Brian W
That worked for me.

Brian

On 12/19/13 9:32 AM, "Ralph Castain"  wrote:

>
>
>
>Okay, I think I have these things fixed in r29978 on the trunk - please
>give it a spin and confirm so we can move it to 1.7.4
>
>
>
>On Dec 19, 2013, at 7:54 AM, Barrett, Brian W  wrote:
>
>
>On 12/19/13 8:43 AM, "Ralph Castain"  wrote:
>
>
>
>On Dec 19, 2013, at 6:27 AM, Barrett, Brian W  wrote:
>
>On 12/19/13 6:59 AM, "Jeff Squyres (jsquyres)" 
>wrote:
>
>3. Finally, we're giving a warning saying:
>
>-
>WARNING: a request was made to bind a process. While the system
>supports binding the process itself, at least one node does NOT
>support binding memory to the process location.
>-
>
>For both #1 and #3, I wonder if we shouldn't be warning if no binding
>was
>explicitly stated (i.e., we're just using the defaults).  Specifically,
>if no binding is specified:
>
>- if we oversubscribe, (possibly) warn about the performance loss of
>oversubscription, and don't bind
>- don't warn about lack of memory binding
>
>
>
>We have a couple machines where memory binding is failing for one reason
>or another.  If we're binding by default, we really shouldn't throw
>error
>messages about not being able to bind memory.  It's REALLY annoying.
>
>
>
>Just to help me understand a bit better - you are saying that the node
>supports process binding, but not memory binding? I don't see how the
>error appears otherwise, but want to ensure I understand the code path.
>
>
>
>That appears to be the case, yes.
>
>Brian
>
>--
> Brian W. Barrett
> Scalable System Software Group
> Sandia National Laboratories
>
>
>
>___
>devel mailing list
>de...@open-mpi.org
>http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
>
>
>
>


--
  Brian W. Barrett
  Scalable System Software Group
  Sandia National Laboratories





Re: [OMPI devel] [EXTERNAL] Re: RFC: remove opal progress recursion depth counter

2013-12-19 Thread Barrett, Brian W
Nathan -

Any chance you can remove the two counters this afternoon?

Brian

On 12/19/13 10:01 AM, "Jeff Squyres (jsquyres)"  wrote:

>I think there's no problem with removing them from the dll code -- that
>stuff doesn't affect MPI application ABI.
>
>
>On Dec 19, 2013, at 9:42 AM, Barrett, Brian W  wrote:
>
>> Someone who understands the mpi debugging handles code:
>> 
>> The opal_progress_recursion_depth_counter and
>>opal_progress_thread_counter
>> are both only used internally in opal_progress (for book keeping, but
>> never any decisions) and are declared in ompi_mpihandles_dll.c, but then
>> don't appear to be used.  Is there a disadvantage to:
>> 
>> 1) removing them from mpihandles_dll.c
>> 
>> or, if that breaks ABI,
>> 
>> 2) Leaving them, but not doing the bookkeeping?
>> 
>> It's fairly heavyweight bookkeeping, so I agree with Nathan, I'd like to
>> remove it.  But I'd like to remove it pre-1.7.4.  Which means today.
>> 
>> Brian
>> 
>> 
>> On 12/18/13 4:40 PM, "Nathan Hjelm"  wrote:
>> 
>>> Opps, yeah. Meant 1.7.5. If people agree with this change I could
>>> possibly slip it in before Friday but that is unlikely.
>>> 
>>> On Wed, Dec 18, 2013 at 03:32:36PM -0800, Ralph Castain wrote:
 U1.7.4 is leaving the station on Fri, Nathan, so next Tues =>
 will have to go into 1.7.5
 
 
 On Dec 18, 2013, at 3:23 PM, Nathan Hjelm  wrote:
 
> What: Remove the opal_progress_recursion_depth_counter from
> opal_progress.
> 
> Why: This counter adds two atomic adds to the critical path when
> OPAL_HAVE_THREADS is set (which is the case for most builds). I
 grepped
> through ompi, orte, and opal to find where this value was being used
 and
> did not find anything either inside or outside opal_progress.
> 
> When: I want this change to go into 1.7.4 (if possible) so setting a
> quick timeout for next Tuesday.
> 
> Let me know if there is a good reason to keep this counter and it
>will
> be spared.
> 
> -Nathan Hjelm
> HPC-5, LANL
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
 
 ___
 devel mailing list
 de...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>> 
>> 
>> --
>>  Brian W. Barrett
>>  Scalable System Software Group
>>  Sandia National Laboratories
>> 
>> 
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
>-- 
>Jeff Squyres
>jsquy...@cisco.com
>For corporate legal information go to:
>http://www.cisco.com/web/about/doing_business/legal/cri/
>
>___
>devel mailing list
>de...@open-mpi.org
>http://www.open-mpi.org/mailman/listinfo.cgi/devel
>


--
  Brian W. Barrett
  Scalable System Software Group
  Sandia National Laboratories





Re: [OMPI devel] Speedup for MPI_Dims_create()

2013-12-19 Thread Jeff Squyres (jsquyres)
Andreas --

Thanks for the patch.  Can I ask two things?

1. Can you separate the patch into two: one with the code change, and another 
with the whitespace update?  It will help the readability of the logs to see 
the exact code change, rather than bury it in a syntax update.

2. You added a copyright notice, which is great.  However, it puts this patch 
in a strange position for us -- I think we'd be comfortable with a copyrighted 
patch if we have a 3rd party agreement on file from your organization (i.e., so 
that the copyright holder won't come back to us later and sue us for 
distributing the patch under the BSD license).  I think there are two options 
here (and IANAL, so I could well be wrong here):

2a. Re-submit the patch without a copyright header.  It's such a small patch (1 
line of code change, AFAICT?) that I think we can accept it without a 
contribution agreement.  We'd cite you in the NEWS file and commit logs, of 
course.
2b. Submit a third party contribution agreement (see 
http://www.open-mpi.org/community/contribute/).  Then we can list your 
organization under http://www.open-mpi.org/about/members/, and we can accept 
the patch with the copyright header.

Thanks!


On Dec 19, 2013, at 9:37 AM, Andreas Schäfer  wrote:

> Dear all,
> 
> please find attached a (trivial) patch to MPI_Dims_create(). When
> computing the prime factors of nnodes, it is sufficient to check for
> primes less or equal to sqrt(nnodes).
> 
> This was not so much of a problem in the past, but now that Tier 0
> systems are capable of running O(10^6) MPI processes, the difference
> in execution time is on the order of seconds (e.g. 8.86s vs. 0.04s on
> my notebook, with nnproc = 10^6).
> 
> Best
> -Andreas
> 
> PS: oh, and the patch removes some trailing whitespace. Yuck. :-)
> 
> 
> -- 
> ==
> Andreas Schäfer
> HPC and Grid Computing
> Chair of Computer Science 3
> Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany
> +49 9131 85-27910
> PGP/GPG key via keyserver
> http://www.libgeodecomp.org
> ==
> 
> (\___/)
> (+'.'+)
> (")_(")
> This is Bunny. Copy and paste Bunny into your
> signature to help him gain world domination!
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] [EXTERNAL] Re: RFC: remove opal progress recursion depth counter

2013-12-19 Thread Hjelm, Nathan T
Yes. I will do that once I finish preparing the ORNL collectives for the trunk. 
Will be 8pm at the latest.

-Nathan

From: devel [devel-boun...@open-mpi.org] on behalf of Barrett, Brian W 
[bwba...@sandia.gov]
Sent: Thursday, December 19, 2013 10:24 AM
To: Open MPI Developers
Subject: Re: [OMPI devel] [EXTERNAL] Re: RFC: remove opal progress recursion 
depth counter

Nathan -

Any chance you can remove the two counters this afternoon?

Brian

On 12/19/13 10:01 AM, "Jeff Squyres (jsquyres)"  wrote:

>I think there's no problem with removing them from the dll code -- that
>stuff doesn't affect MPI application ABI.
>
>
>On Dec 19, 2013, at 9:42 AM, Barrett, Brian W  wrote:
>
>> Someone who understands the mpi debugging handles code:
>>
>> The opal_progress_recursion_depth_counter and
>>opal_progress_thread_counter
>> are both only used internally in opal_progress (for book keeping, but
>> never any decisions) and are declared in ompi_mpihandles_dll.c, but then
>> don't appear to be used.  Is there a disadvantage to:
>>
>> 1) removing them from mpihandles_dll.c
>>
>> or, if that breaks ABI,
>>
>> 2) Leaving them, but not doing the bookkeeping?
>>
>> It's fairly heavyweight bookkeeping, so I agree with Nathan, I'd like to
>> remove it.  But I'd like to remove it pre-1.7.4.  Which means today.
>>
>> Brian
>>
>>
>> On 12/18/13 4:40 PM, "Nathan Hjelm"  wrote:
>>
>>> Opps, yeah. Meant 1.7.5. If people agree with this change I could
>>> possibly slip it in before Friday but that is unlikely.
>>>
>>> On Wed, Dec 18, 2013 at 03:32:36PM -0800, Ralph Castain wrote:
 U1.7.4 is leaving the station on Fri, Nathan, so next Tues =>
 will have to go into 1.7.5


 On Dec 18, 2013, at 3:23 PM, Nathan Hjelm  wrote:

> What: Remove the opal_progress_recursion_depth_counter from
> opal_progress.
>
> Why: This counter adds two atomic adds to the critical path when
> OPAL_HAVE_THREADS is set (which is the case for most builds). I
 grepped
> through ompi, orte, and opal to find where this value was being used
 and
> did not find anything either inside or outside opal_progress.
>
> When: I want this change to go into 1.7.4 (if possible) so setting a
> quick timeout for next Tuesday.
>
> Let me know if there is a good reason to keep this counter and it
>will
> be spared.
>
> -Nathan Hjelm
> HPC-5, LANL
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

 ___
 devel mailing list
 de...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>
>>
>> --
>>  Brian W. Barrett
>>  Scalable System Software Group
>>  Sandia National Laboratories
>>
>>
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
>--
>Jeff Squyres
>jsquy...@cisco.com
>For corporate legal information go to:
>http://www.cisco.com/web/about/doing_business/legal/cri/
>
>___
>devel mailing list
>de...@open-mpi.org
>http://www.open-mpi.org/mailman/listinfo.cgi/devel
>


--
  Brian W. Barrett
  Scalable System Software Group
  Sandia National Laboratories



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


Re: [OMPI devel] [PATCH v2 2/2] Trying to get the C/R code to compile again. (send_*_nb)

2013-12-19 Thread Adrian Reber
Thanks for the review. I am re-spinning the patches and sending the new
version in a few moments.

On Wed, Dec 18, 2013 at 06:56:47AM -0800, Ralph Castain wrote:
> In the case of the send, there really isn't any problem with just replacing 
> things - the non-blocking change won't impact anything, so no need to retain 
> the old code. People were only concerned about the recv's as those places 
> will require further repair, and they wanted to ensure we know where those 
> places are located.
> 
> You also need to change those comparisons, however, as the return code isn't 
> the number of bytes sent any more - it is just ORTE_SUCCESS or else an error 
> code, so you should be testing for ORTE_SUCCESS ==
> 
> 
> 
> 
> On Dec 18, 2013, at 6:42 AM, Adrian Reber  wrote:
> 
> > From: Adrian Reber 
> > 
> > This patch changes all send/send_buffer occurrences in the C/R code
> > to send_nb/send_buffer_nb.
> > The old code is still there but disabled using ifdefs (ENABLE_FT_FIXED).
> > The new code compiles but does not work.
> > 
> > Changes from V1:
> > * #ifdef out the code (so it is preserved for later re-design)
> > * marked the broken C/R code with ENABLE_FT_FIXED
> > 
> > Signed-off-by: Adrian Reber 
> > ---
> > ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c| 18 +++
> > orte/mca/errmgr/base/errmgr_base_tool.c |  4 ++
> > orte/mca/rml/ftrm/rml_ftrm.h| 19 
> > orte/mca/rml/ftrm/rml_ftrm_component.c  |  2 -
> > orte/mca/rml/ftrm/rml_ftrm_module.c | 63 
> > +
> > orte/mca/snapc/full/snapc_full_app.c| 20 
> > orte/mca/snapc/full/snapc_full_global.c | 12 +
> > orte/mca/snapc/full/snapc_full_local.c  |  4 ++
> > orte/mca/sstore/central/sstore_central_app.c|  8 
> > orte/mca/sstore/central/sstore_central_global.c |  4 ++
> > orte/mca/sstore/central/sstore_central_local.c  | 12 +
> > orte/mca/sstore/stage/sstore_stage_app.c|  8 
> > orte/mca/sstore/stage/sstore_stage_global.c |  4 ++
> > orte/mca/sstore/stage/sstore_stage_local.c  | 16 +++
> > orte/tools/orte-checkpoint/orte-checkpoint.c|  4 ++
> > orte/tools/orte-migrate/orte-migrate.c  |  4 ++
> > 16 files changed, 130 insertions(+), 72 deletions(-)
> > 
> > diff --git a/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c 
> > b/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c
> > index cba7586..4f7bd7f 100644
> > --- a/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c
> > +++ b/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c
> > @@ -5102,7 +5102,11 @@ static int wait_quiesce_drained(void)
> > PACK_BUFFER(buffer, response, 1, OPAL_SIZE, "");
> > 
> > /* JJH - Performance Optimization? - Why not post all isends, 
> > then wait? */
> > +#ifdef ENABLE_FT_FIXED
> > +/* This is the old, now broken code */
> > if ( 0 > ( ret = 
> > ompi_rte_send_buffer(&(cur_peer_ref->proc_name), buffer, 
> > OMPI_CRCP_COORD_BOOKMARK_TAG, 0)) ) {
> > +#endif /* ENABLE_FT_FIXED */
> > +if ( 0 > ( ret = 
> > ompi_rte_send_buffer_nb(&(cur_peer_ref->proc_name), buffer, 
> > OMPI_CRCP_COORD_BOOKMARK_TAG, orte_rml_send_callback, NULL)) ) {
> > exit_status = ret;
> > goto cleanup;
> > }
> > @@ -5303,7 +5307,11 @@ static int send_bookmarks(int peer_idx)
> > PACK_BUFFER(buffer, (peer_ref->total_msgs_recvd), 1, OPAL_UINT32,
> > "crcp:bkmrk: send_bookmarks: Unable to pack 
> > total_msgs_recvd");
> > 
> > +#ifdef ENABLE_FT_FIXED
> > +/* This is the old, now broken code */
> > if ( 0 > ( ret = ompi_rte_send_buffer(&peer_name, buffer, 
> > OMPI_CRCP_COORD_BOOKMARK_TAG, 0)) ) {
> > +#endif /* ENABLE_FT_FIXED */
> > +if ( 0 > ( ret = ompi_rte_send_buffer_nb(&peer_name, buffer, 
> > OMPI_CRCP_COORD_BOOKMARK_TAG, orte_rml_send_callback, NULL)) ) {
> > opal_output(mca_crcp_bkmrk_component.super.output_handle,
> > "crcp:bkmrk: send_bookmarks: Failed to send bookmark to 
> > peer %s: Return %d\n",
> > OMPI_NAME_PRINT(&peer_name),
> > @@ -5599,8 +5607,13 @@ static int 
> > do_send_msg_detail(ompi_crcp_bkmrk_pml_peer_ref_t *peer_ref,
> > /*
> >  * Do the send...
> >  */
> > +#ifdef ENABLE_FT_FIXED
> > +/* This is the old, now broken code */
> > if ( 0 > ( ret = ompi_rte_send_buffer(&peer_ref->proc_name, buffer,
> >   OMPI_CRCP_COORD_BOOKMARK_TAG, 0)) 
> > ) {
> > +#endif /* ENABLE_FT_FIXED */
> > +if ( 0 > ( ret = ompi_rte_send_buffer_nb(&peer_ref->proc_name, buffer,
> > +  OMPI_CRCP_COORD_BOOKMARK_TAG, 
> > orte_rml_send_callback, NULL)) ) {
> > opal_output(mca_crcp_bkmrk_component.super.output_handle,
> > "crcp:bkmrk: do_send_msg_detail: Unable to send message 
> > details to peer %s: Return %d\n",
> > OMPI_NAME_PRINT(&peer_ref->proc_name),
> > @@ -62

[OMPI devel] [PATCH v3 0/2] Trying to get the C/R code to compile again

2013-12-19 Thread Adrian Reber
From: Adrian Reber 

This is the second try to replace the usage of blocking send and
recv in the C/R code with the non-blocking versions. The new code
compiles (in contrast to the old code) but does not work yet.
This is the first step to get the C/R code working again. Right
now it only compiles.

Changes from V1:
* #ifdef out the broken code (so it is preserved for later re-design)
* marked the broken C/R code with ENABLE_FT_FIXED

Changes from V2:
* only #ifdef out parts where the behaviour actually changes

Adrian Reber (2):
  Trying to get the C/R code to compile again. (recv_*_nb)
  Trying to get the C/R code to compile again. (send_*_nb)

 ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c|  64 +--
 orte/mca/errmgr/base/errmgr_base_tool.c |  20 +---
 orte/mca/rml/ftrm/rml_ftrm.h|  46 +---
 orte/mca/rml/ftrm/rml_ftrm_component.c  |   4 -
 orte/mca/rml/ftrm/rml_ftrm_module.c | 139 +++-
 orte/mca/snapc/full/snapc_full_app.c|  32 +-
 orte/mca/snapc/full/snapc_full_global.c |  52 -
 orte/mca/snapc/full/snapc_full_local.c  |  40 ++-
 orte/mca/sstore/central/sstore_central_app.c|  14 ++-
 orte/mca/sstore/central/sstore_central_global.c |  21 +---
 orte/mca/sstore/central/sstore_central_local.c  |  29 ++---
 orte/mca/sstore/stage/sstore_stage_app.c|  13 ++-
 orte/mca/sstore/stage/sstore_stage_global.c |  21 +---
 orte/mca/sstore/stage/sstore_stage_local.c  |  33 +++---
 orte/tools/orte-checkpoint/orte-checkpoint.c|  20 +---
 orte/tools/orte-migrate/orte-migrate.c  |  20 +---
 16 files changed, 186 insertions(+), 382 deletions(-)

-- 
1.8.4.2



[OMPI devel] [PATCH v3 1/2] Trying to get the C/R code to compile again. (recv_*_nb)

2013-12-19 Thread Adrian Reber
From: Adrian Reber 

This patch changes all recv/recv_buffer occurrences in the C/R code
to recv_nb/recv_buffer_nb.
The old code is still there but disabled using ifdefs (ENABLE_FT_FIXED).
The new code compiles but does not work.

Changes from V1:
* #ifdef out the code (so it is preserved for later re-design)
* marked the broken C/R code with ENABLE_FT_FIXED

Changes from V2:
* only #ifdef out the code where the behaviour is changed
  (used to be blocking; now non-blocking)

Signed-off-by: Adrian Reber 
---
 ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c| 41 +
 orte/mca/errmgr/base/errmgr_base_tool.c | 16 +
 orte/mca/rml/ftrm/rml_ftrm.h| 27 ++---
 orte/mca/rml/ftrm/rml_ftrm_component.c  |  2 -
 orte/mca/rml/ftrm/rml_ftrm_module.c | 78 +++--
 orte/mca/snapc/full/snapc_full_app.c| 12 
 orte/mca/snapc/full/snapc_full_global.c | 37 +++-
 orte/mca/snapc/full/snapc_full_local.c  | 36 +++-
 orte/mca/sstore/central/sstore_central_app.c|  6 ++
 orte/mca/sstore/central/sstore_central_global.c | 17 +-
 orte/mca/sstore/central/sstore_central_local.c  | 17 +-
 orte/mca/sstore/stage/sstore_stage_app.c|  5 ++
 orte/mca/sstore/stage/sstore_stage_global.c | 17 +-
 orte/mca/sstore/stage/sstore_stage_local.c  | 17 +-
 orte/tools/orte-checkpoint/orte-checkpoint.c| 16 +
 orte/tools/orte-migrate/orte-migrate.c  | 16 +
 16 files changed, 87 insertions(+), 273 deletions(-)

diff --git a/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c 
b/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c
index 5d4005f..05cd598 100644
--- a/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c
+++ b/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c
@@ -4717,7 +4717,6 @@ static int ft_event_post_drain_acks(void)
 ompi_crcp_bkmrk_pml_drain_message_ack_ref_t * drain_msg_ack = NULL;
 opal_list_item_t* item = NULL;
 size_t req_size;
-int ret;

 req_size  = opal_list_get_size(&drained_msg_ack_list);
 if(req_size <= 0) {
@@ -4739,17 +4738,8 @@ static int ft_event_post_drain_acks(void)
 drain_msg_ack = (ompi_crcp_bkmrk_pml_drain_message_ack_ref_t*)item;

 /* Post the receive */
-if( OMPI_SUCCESS != (ret = ompi_rte_recv_buffer_nb( 
&drain_msg_ack->peer,
-
OMPI_CRCP_COORD_BOOKMARK_TAG,
-0,
-
drain_message_ack_cbfunc,
-NULL) ) ) {
-opal_output(mca_crcp_bkmrk_component.super.output_handle,
-"crcp:bkmrk: %s <-- %s: Failed to post a RML receive 
to the peer\n",
-OMPI_NAME_PRINT(OMPI_PROC_MY_NAME),
-OMPI_NAME_PRINT(&(drain_msg_ack->peer)));
-return ret;
-}
+ompi_rte_recv_buffer_nb(&drain_msg_ack->peer, 
OMPI_CRCP_COORD_BOOKMARK_TAG,
+0, drain_message_ack_cbfunc, NULL);
 }

 return OMPI_SUCCESS;
@@ -5322,26 +5312,14 @@ static int send_bookmarks(int peer_idx)
 static int recv_bookmarks(int peer_idx)
 {
 ompi_process_name_t peer_name;
-int exit_status = OMPI_SUCCESS;
-int ret;

 START_TIMER(CRCP_TIMER_CKPT_EX_PEER_R);

 peer_name.jobid  = OMPI_PROC_MY_NAME->jobid;
 peer_name.vpid   = peer_idx;

-if ( 0 > (ret = ompi_rte_recv_buffer_nb(&peer_name,
-OMPI_CRCP_COORD_BOOKMARK_TAG,
-0,
-recv_bookmarks_cbfunc,
-NULL) ) ) {
-opal_output(mca_crcp_bkmrk_component.super.output_handle,
-"crcp:bkmrk: recv_bookmarks: Failed to post receive 
bookmark from peer %s: Return %d\n",
-OMPI_NAME_PRINT(&peer_name),
-ret);
-exit_status = ret;
-goto cleanup;
-}
+ompi_rte_recv_buffer_nb(&peer_name, OMPI_CRCP_COORD_BOOKMARK_TAG,
+0, recv_bookmarks_cbfunc, NULL);

 ++total_recv_bookmarks;

@@ -5350,7 +5328,7 @@ static int recv_bookmarks(int peer_idx)
 /* JJH Doesn't make much sense to print this. The real bottleneck is 
always the send_bookmarks() */
 /*DISPLAY_INDV_TIMER(CRCP_TIMER_CKPT_EX_PEER_R, peer_idx, 1);*/

-return exit_status;
+return OMPI_SUCCESS;
 }

 static void recv_bookmarks_cbfunc(int status,
@@ -5616,6 +5594,8 @@ static int 
do_send_msg_detail(ompi_crcp_bkmrk_pml_peer_ref_t *peer_ref,
 /*
  * Recv the ACK msg
  */
+#ifdef ENABLE_FT_FIXED
+/* This is the old, now broken code */
 if ( 0 > (ret = ompi_rte_recv_buffer(&peer_ref->proc_name, buffer,
  OMPI_CRCP_COORD_BOOKMARK_TAG, 0) ) ) {
 opal_output(mca_crcp

[OMPI devel] [PATCH v3 2/2] Trying to get the C/R code to compile again. (send_*_nb)

2013-12-19 Thread Adrian Reber
From: Adrian Reber 

This patch changes all send/send_buffer occurrences in the C/R code
to send_nb/send_buffer_nb.
The new code compiles but does not work.

Changes from V1:
* #ifdef out the code (so it is preserved for later re-design)
* marked the broken C/R code with ENABLE_FT_FIXED

Changes from V2:
* just replace the blocking calls with the non-blocking calls
* all #ifdef's introduced in V1 are gone
* send_* returns error code or ORTE_SUCCESS (not the number of bytes)

Signed-off-by: Adrian Reber 
---
 ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c| 23 ++
 orte/mca/errmgr/base/errmgr_base_tool.c |  4 +-
 orte/mca/rml/ftrm/rml_ftrm.h| 19 
 orte/mca/rml/ftrm/rml_ftrm_component.c  |  2 -
 orte/mca/rml/ftrm/rml_ftrm_module.c | 61 +++--
 orte/mca/snapc/full/snapc_full_app.c| 20 ++--
 orte/mca/snapc/full/snapc_full_global.c | 15 --
 orte/mca/snapc/full/snapc_full_local.c  |  4 +-
 orte/mca/sstore/central/sstore_central_app.c|  8 +++-
 orte/mca/sstore/central/sstore_central_global.c |  4 +-
 orte/mca/sstore/central/sstore_central_local.c  | 12 +++--
 orte/mca/sstore/stage/sstore_stage_app.c|  8 +++-
 orte/mca/sstore/stage/sstore_stage_global.c |  4 +-
 orte/mca/sstore/stage/sstore_stage_local.c  | 16 +--
 orte/tools/orte-checkpoint/orte-checkpoint.c|  4 +-
 orte/tools/orte-migrate/orte-migrate.c  |  4 +-
 16 files changed, 99 insertions(+), 109 deletions(-)

diff --git a/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c 
b/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c
index 05cd598..5ad9a3e 100644
--- a/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c
+++ b/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c
@@ -5077,7 +5077,7 @@ static int wait_quiesce_drained(void)
  "crcp:bkmrk: %s --> %s Send ACKs to Peer\n",
  OMPI_NAME_PRINT(OMPI_PROC_MY_NAME),
  OMPI_NAME_PRINT(&(cur_peer_ref->proc_name)) 
));
-
+
 /* Send All Clear to Peer */
 if (NULL == (buffer = OBJ_NEW(opal_buffer_t))) {
 exit_status = OMPI_ERROR;
@@ -5087,7 +5087,9 @@ static int wait_quiesce_drained(void)
 PACK_BUFFER(buffer, response, 1, OPAL_SIZE, "");

 /* JJH - Performance Optimization? - Why not post all isends, then 
wait? */
-if ( 0 > ( ret = ompi_rte_send_buffer(&(cur_peer_ref->proc_name), 
buffer, OMPI_CRCP_COORD_BOOKMARK_TAG, 0)) ) {
+if (ORTE_SUCCESS != (ret = 
ompi_rte_send_buffer_nb(&(cur_peer_ref->proc_name),
+   buffer, 
OMPI_CRCP_COORD_BOOKMARK_TAG,
+   
orte_rml_send_callback, NULL))) {
 exit_status = ret;
 goto cleanup;
 }
@@ -5288,7 +5290,9 @@ static int send_bookmarks(int peer_idx)
 PACK_BUFFER(buffer, (peer_ref->total_msgs_recvd), 1, OPAL_UINT32,
 "crcp:bkmrk: send_bookmarks: Unable to pack total_msgs_recvd");

-if ( 0 > ( ret = ompi_rte_send_buffer(&peer_name, buffer, 
OMPI_CRCP_COORD_BOOKMARK_TAG, 0)) ) {
+if (ORTE_SUCCSS != (ret = ompi_rte_send_buffer_nb(&peer_name, buffer,
+  
OMPI_CRCP_COORD_BOOKMARK_TAG,
+  orte_rml_send_callback, 
NULL))) {
 opal_output(mca_crcp_bkmrk_component.super.output_handle,
 "crcp:bkmrk: send_bookmarks: Failed to send bookmark to 
peer %s: Return %d\n",
 OMPI_NAME_PRINT(&peer_name),
@@ -5567,13 +5571,14 @@ static int 
do_send_msg_detail(ompi_crcp_bkmrk_pml_peer_ref_t *peer_ref,
 /*
  * Do the send...
  */
-if ( 0 > ( ret = ompi_rte_send_buffer(&peer_ref->proc_name, buffer,
-  OMPI_CRCP_COORD_BOOKMARK_TAG, 0)) ) {
+if (ORTE_SUCCESS != (ret = ompi_rte_send_buffer_nb(&peer_ref->proc_name, 
buffer,
+   
OMPI_CRCP_COORD_BOOKMARK_TAG,
+   orte_rml_send_callback, 
NULL))) {
 opal_output(mca_crcp_bkmrk_component.super.output_handle,
 "crcp:bkmrk: do_send_msg_detail: Unable to send message 
details to peer %s: Return %d\n",
 OMPI_NAME_PRINT(&peer_ref->proc_name),
 ret);
-
+
 exit_status = OMPI_ERROR;
 goto cleanup;
 }
@@ -6185,8 +6190,10 @@ static int 
do_recv_msg_detail_resp(ompi_crcp_bkmrk_pml_peer_ref_t *peer_ref,
 "crcp:bkmrk: recv_msg_details: Unable to ask peer for more 
messages");
 PACK_BUFFER(buffer, total_found, 1, OPAL_UINT32,
 "crcp:bkmrk: recv_msg_details: Unable to ask peer for more 
messages");
-
-if ( 0 > ( ret = ompi_rte_send_buffer(&peer

Re: [OMPI devel] [PATCH v3 2/2] Trying to get the C/R code to compile again. (send_*_nb)

2013-12-19 Thread Ralph Castain
+1 from me


On Dec 19, 2013, at 12:54 PM, Adrian Reber  wrote:

> From: Adrian Reber 
> 
> This patch changes all send/send_buffer occurrences in the C/R code
> to send_nb/send_buffer_nb.
> The new code compiles but does not work.
> 
> Changes from V1:
> * #ifdef out the code (so it is preserved for later re-design)
> * marked the broken C/R code with ENABLE_FT_FIXED
> 
> Changes from V2:
> * just replace the blocking calls with the non-blocking calls
> * all #ifdef's introduced in V1 are gone
> * send_* returns error code or ORTE_SUCCESS (not the number of bytes)
> 
> Signed-off-by: Adrian Reber 
> ---
> ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c| 23 ++
> orte/mca/errmgr/base/errmgr_base_tool.c |  4 +-
> orte/mca/rml/ftrm/rml_ftrm.h| 19 
> orte/mca/rml/ftrm/rml_ftrm_component.c  |  2 -
> orte/mca/rml/ftrm/rml_ftrm_module.c | 61 +++--
> orte/mca/snapc/full/snapc_full_app.c| 20 ++--
> orte/mca/snapc/full/snapc_full_global.c | 15 --
> orte/mca/snapc/full/snapc_full_local.c  |  4 +-
> orte/mca/sstore/central/sstore_central_app.c|  8 +++-
> orte/mca/sstore/central/sstore_central_global.c |  4 +-
> orte/mca/sstore/central/sstore_central_local.c  | 12 +++--
> orte/mca/sstore/stage/sstore_stage_app.c|  8 +++-
> orte/mca/sstore/stage/sstore_stage_global.c |  4 +-
> orte/mca/sstore/stage/sstore_stage_local.c  | 16 +--
> orte/tools/orte-checkpoint/orte-checkpoint.c|  4 +-
> orte/tools/orte-migrate/orte-migrate.c  |  4 +-
> 16 files changed, 99 insertions(+), 109 deletions(-)
> 
> diff --git a/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c 
> b/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c
> index 05cd598..5ad9a3e 100644
> --- a/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c
> +++ b/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c
> @@ -5077,7 +5077,7 @@ static int wait_quiesce_drained(void)
>  "crcp:bkmrk: %s --> %s Send ACKs to Peer\n",
>  OMPI_NAME_PRINT(OMPI_PROC_MY_NAME),
>  OMPI_NAME_PRINT(&(cur_peer_ref->proc_name)) 
> ));
> -
> +
> /* Send All Clear to Peer */
> if (NULL == (buffer = OBJ_NEW(opal_buffer_t))) {
> exit_status = OMPI_ERROR;
> @@ -5087,7 +5087,9 @@ static int wait_quiesce_drained(void)
> PACK_BUFFER(buffer, response, 1, OPAL_SIZE, "");
> 
> /* JJH - Performance Optimization? - Why not post all isends, 
> then wait? */
> -if ( 0 > ( ret = 
> ompi_rte_send_buffer(&(cur_peer_ref->proc_name), buffer, 
> OMPI_CRCP_COORD_BOOKMARK_TAG, 0)) ) {
> +if (ORTE_SUCCESS != (ret = 
> ompi_rte_send_buffer_nb(&(cur_peer_ref->proc_name),
> +   buffer, 
> OMPI_CRCP_COORD_BOOKMARK_TAG,
> +   
> orte_rml_send_callback, NULL))) {
> exit_status = ret;
> goto cleanup;
> }
> @@ -5288,7 +5290,9 @@ static int send_bookmarks(int peer_idx)
> PACK_BUFFER(buffer, (peer_ref->total_msgs_recvd), 1, OPAL_UINT32,
> "crcp:bkmrk: send_bookmarks: Unable to pack 
> total_msgs_recvd");
> 
> -if ( 0 > ( ret = ompi_rte_send_buffer(&peer_name, buffer, 
> OMPI_CRCP_COORD_BOOKMARK_TAG, 0)) ) {
> +if (ORTE_SUCCSS != (ret = ompi_rte_send_buffer_nb(&peer_name, buffer,
> +  
> OMPI_CRCP_COORD_BOOKMARK_TAG,
> +  
> orte_rml_send_callback, NULL))) {
> opal_output(mca_crcp_bkmrk_component.super.output_handle,
> "crcp:bkmrk: send_bookmarks: Failed to send bookmark to 
> peer %s: Return %d\n",
> OMPI_NAME_PRINT(&peer_name),
> @@ -5567,13 +5571,14 @@ static int 
> do_send_msg_detail(ompi_crcp_bkmrk_pml_peer_ref_t *peer_ref,
> /*
>  * Do the send...
>  */
> -if ( 0 > ( ret = ompi_rte_send_buffer(&peer_ref->proc_name, buffer,
> -  OMPI_CRCP_COORD_BOOKMARK_TAG, 0)) 
> ) {
> +if (ORTE_SUCCESS != (ret = ompi_rte_send_buffer_nb(&peer_ref->proc_name, 
> buffer,
> +   
> OMPI_CRCP_COORD_BOOKMARK_TAG,
> +   
> orte_rml_send_callback, NULL))) {
> opal_output(mca_crcp_bkmrk_component.super.output_handle,
> "crcp:bkmrk: do_send_msg_detail: Unable to send message 
> details to peer %s: Return %d\n",
> OMPI_NAME_PRINT(&peer_ref->proc_name),
> ret);
> -
> +
> exit_status = OMPI_ERROR;
> goto cleanup;
> }
> @@ -6185,8 +6190,10 @@ static int 
> do_recv_msg_detail_resp(ompi_crcp_bkmrk_pml_peer_ref_t *peer_ref,
> "crcp:bkmrk: recv_msg_details

Re: [OMPI devel] [PATCH v3 1/2] Trying to get the C/R code to compile again. (recv_*_nb)

2013-12-19 Thread Ralph Castain
Looks okay to me. On the places where you need to block while waiting for an 
answer, you can use OMPI_WAIT_FOR_COMPLETION - this will spin on opal_progress 
until the condition is met. We use it elsewhere for similar purposes.

See ompi/mca/rte/rte.h for the definition


On Dec 19, 2013, at 12:54 PM, Adrian Reber  wrote:

> From: Adrian Reber 
> 
> This patch changes all recv/recv_buffer occurrences in the C/R code
> to recv_nb/recv_buffer_nb.
> The old code is still there but disabled using ifdefs (ENABLE_FT_FIXED).
> The new code compiles but does not work.
> 
> Changes from V1:
> * #ifdef out the code (so it is preserved for later re-design)
> * marked the broken C/R code with ENABLE_FT_FIXED
> 
> Changes from V2:
> * only #ifdef out the code where the behaviour is changed
>  (used to be blocking; now non-blocking)
> 
> Signed-off-by: Adrian Reber 
> ---
> ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c| 41 +
> orte/mca/errmgr/base/errmgr_base_tool.c | 16 +
> orte/mca/rml/ftrm/rml_ftrm.h| 27 ++---
> orte/mca/rml/ftrm/rml_ftrm_component.c  |  2 -
> orte/mca/rml/ftrm/rml_ftrm_module.c | 78 +++--
> orte/mca/snapc/full/snapc_full_app.c| 12 
> orte/mca/snapc/full/snapc_full_global.c | 37 +++-
> orte/mca/snapc/full/snapc_full_local.c  | 36 +++-
> orte/mca/sstore/central/sstore_central_app.c|  6 ++
> orte/mca/sstore/central/sstore_central_global.c | 17 +-
> orte/mca/sstore/central/sstore_central_local.c  | 17 +-
> orte/mca/sstore/stage/sstore_stage_app.c|  5 ++
> orte/mca/sstore/stage/sstore_stage_global.c | 17 +-
> orte/mca/sstore/stage/sstore_stage_local.c  | 17 +-
> orte/tools/orte-checkpoint/orte-checkpoint.c| 16 +
> orte/tools/orte-migrate/orte-migrate.c  | 16 +
> 16 files changed, 87 insertions(+), 273 deletions(-)
> 
> diff --git a/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c 
> b/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c
> index 5d4005f..05cd598 100644
> --- a/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c
> +++ b/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c
> @@ -4717,7 +4717,6 @@ static int ft_event_post_drain_acks(void)
> ompi_crcp_bkmrk_pml_drain_message_ack_ref_t * drain_msg_ack = NULL;
> opal_list_item_t* item = NULL;
> size_t req_size;
> -int ret;
> 
> req_size  = opal_list_get_size(&drained_msg_ack_list);
> if(req_size <= 0) {
> @@ -4739,17 +4738,8 @@ static int ft_event_post_drain_acks(void)
> drain_msg_ack = (ompi_crcp_bkmrk_pml_drain_message_ack_ref_t*)item;
> 
> /* Post the receive */
> -if( OMPI_SUCCESS != (ret = ompi_rte_recv_buffer_nb( 
> &drain_msg_ack->peer,
> -
> OMPI_CRCP_COORD_BOOKMARK_TAG,
> -0,
> -
> drain_message_ack_cbfunc,
> -NULL) ) ) {
> -opal_output(mca_crcp_bkmrk_component.super.output_handle,
> -"crcp:bkmrk: %s <-- %s: Failed to post a RML receive 
> to the peer\n",
> -OMPI_NAME_PRINT(OMPI_PROC_MY_NAME),
> -OMPI_NAME_PRINT(&(drain_msg_ack->peer)));
> -return ret;
> -}
> +ompi_rte_recv_buffer_nb(&drain_msg_ack->peer, 
> OMPI_CRCP_COORD_BOOKMARK_TAG,
> +0, drain_message_ack_cbfunc, NULL);
> }
> 
> return OMPI_SUCCESS;
> @@ -5322,26 +5312,14 @@ static int send_bookmarks(int peer_idx)
> static int recv_bookmarks(int peer_idx)
> {
> ompi_process_name_t peer_name;
> -int exit_status = OMPI_SUCCESS;
> -int ret;
> 
> START_TIMER(CRCP_TIMER_CKPT_EX_PEER_R);
> 
> peer_name.jobid  = OMPI_PROC_MY_NAME->jobid;
> peer_name.vpid   = peer_idx;
> 
> -if ( 0 > (ret = ompi_rte_recv_buffer_nb(&peer_name,
> -OMPI_CRCP_COORD_BOOKMARK_TAG,
> -0,
> -recv_bookmarks_cbfunc,
> -NULL) ) ) {
> -opal_output(mca_crcp_bkmrk_component.super.output_handle,
> -"crcp:bkmrk: recv_bookmarks: Failed to post receive 
> bookmark from peer %s: Return %d\n",
> -OMPI_NAME_PRINT(&peer_name),
> -ret);
> -exit_status = ret;
> -goto cleanup;
> -}
> +ompi_rte_recv_buffer_nb(&peer_name, OMPI_CRCP_COORD_BOOKMARK_TAG,
> +0, recv_bookmarks_cbfunc, NULL);
> 
> ++total_recv_bookmarks;
> 
> @@ -5350,7 +5328,7 @@ static int recv_bookmarks(int peer_idx)
> /* JJH Doesn't make much sense to print this. The real bottleneck is 
> always the send_bookmarks() */
> /*DISPLAY_INDV_TIMER(CRCP_TIMER_CKPT_EX_PEER_R, pee

[OMPI devel] 1.7 series release plans

2013-12-19 Thread Ralph Castain
Hi folks

Given the amount of changes/fixes pushed into the 1.7.4rc's this week, it seems 
best that we delay that release until after the holiday. Accordingly, the 
revised release plan looks like this:

1.7.4rc2 - this weekend

1.7.4 - Jan 10th

1.7.5 feature freeze (hard deadline) - Jan 24th

1.7.5 release - mid-Feb

We are feature-freezing 1.7.4 as of now, so the ORNL collectives will go into 
1.7.5 along with oshmem (assuming it is ready by the deadline).

If we don't connect before the weekend, have a great holiday! I'll be 
occasionally available on email and plan to do a few things over the holiday, 
but it will be somewhat hit-and-miss.

Ralph



[OMPI devel] 1.7.4rc1 build failure: FreeBSD-9

2013-12-19 Thread Paul Hargrove
I see the failure below when building 1.7.4rc1 on FreeBSD-9 (amd64).
It looks to be just a missing header, probably sys/stat.h.

$ gcc --version
gcc (GCC) 4.2.1 20070831 patched [FreeBSD]

Only configure option passed was --prefix-...

-Paul



Making all in mca/sharedfp/sm
  CC   sharedfp_sm.lo
  CC   sharedfp_sm_component.lo
  CC   sharedfp_sm_seek.lo
  CC   sharedfp_sm_get_position.lo
  CC   sharedfp_sm_request_position.lo
  CC   sharedfp_sm_write.lo
  CC   sharedfp_sm_iwrite.lo
  CC   sharedfp_sm_read.lo
  CC   sharedfp_sm_iread.lo
  CC   sharedfp_sm_file_open.lo
/home/phargrov/OMPI/openmpi-1.7.4rc1-freebsd9-amd64/openmpi-1.7.4rc1/ompi/mca/sharedfp/sm/sharedfp_sm_file_open.c:
In function 'mca_sharedfp_sm_file_open':
/home/phargrov/OMPI/openmpi-1.7.4rc1-freebsd9-amd64/openmpi-1.7.4rc1/ompi/mca/sharedfp/sm/sharedfp_sm_file_open.c:121:
error: 'S_IRUSR' undeclared (first use in this function)
/home/phargrov/OMPI/openmpi-1.7.4rc1-freebsd9-amd64/openmpi-1.7.4rc1/ompi/mca/sharedfp/sm/sharedfp_sm_file_open.c:121:
error: (Each undeclared identifier is reported only once
/home/phargrov/OMPI/openmpi-1.7.4rc1-freebsd9-amd64/openmpi-1.7.4rc1/ompi/mca/sharedfp/sm/sharedfp_sm_file_open.c:121:
error: for each function it appears in.)
/home/phargrov/OMPI/openmpi-1.7.4rc1-freebsd9-amd64/openmpi-1.7.4rc1/ompi/mca/sharedfp/sm/sharedfp_sm_file_open.c:121:
error: 'S_IWUSR' undeclared (first use in this function)
/home/phargrov/OMPI/openmpi-1.7.4rc1-freebsd9-amd64/openmpi-1.7.4rc1/ompi/mca/sharedfp/sm/sharedfp_sm_file_open.c:121:
error: 'S_IRGRP' undeclared (first use in this function)
/home/phargrov/OMPI/openmpi-1.7.4rc1-freebsd9-amd64/openmpi-1.7.4rc1/ompi/mca/sharedfp/sm/sharedfp_sm_file_open.c:121:
error: 'S_IROTH' undeclared (first use in this function)
*** [sharedfp_sm_file_open.lo] Error code 1



-- 
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
Computer and Data Sciences Department Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900


[OMPI devel] 1.7.4rc1 build failure: OpenBSD-5 and NetBSD-6

2013-12-19 Thread Paul Hargrove
When building 1.7.4rc1 on OpenBSD-5 and NetBSD-6 (both amd64) I see what
appears to be the same three errors ("make" output  at end of this email)
on both platforms.

All three syntax errors appears to be collisions on the symbol if_mtu:

-bash-4.2$ cat -n openmpi-1.7.4rc1/opal/util/if.h | grep -w -e 182
   182  OPAL_DECLSPEC int opal_ifindextomtu(int if_index, int *if_mtu);
-bash-4.2$ cat -n openmpi-1.7.4rc1/opal/mca/if/if.h | grep -w -e 98
98  int if_mtu;
-bash-4.2$ cat -n openmpi-1.7.4rc1/opal/util/if.c | grep -w -e 482
   482  int opal_ifindextomtu(int if_index, int *if_mtu)

-bash-4.2$ grep if_mtu  /usr/include/net/if.h
#define if_mtu  if_data.ifi_mtu\


-Paul

OpenBSD:
-bash-4.2$ uname -a
OpenBSD pcp-j-16.my.domain 5.3 GENERIC.MP#62 amd64
-bash-4.2$ gcc --version
gcc (GCC) 4.2.1 20070719

Making all in keyval
  LEX  keyval_lex.c
  CC   keyval_lex.lo
  CCLD libopalutilkeyval.la
  CC   fd.lo
  CC   arch.lo
  CC   argv.lo
  CC   basename.lo
  CC   cmd_line.lo
  CC   crc.lo
  CC   convert.lo
  CC   daemon_init.lo
  CC   error.lo
  CC   few.lo
  CC   if.lo
In file included from
/home/phargrov/OMPI/openmpi-1.7.4rc1-openbsd5-amd64/openmpi-1.7.4rc1/opal/util/if.c:74:
/home/phargrov/OMPI/openmpi-1.7.4rc1-openbsd5-amd64/openmpi-1.7.4rc1/opal/util/if.h:182:
error: expected ';', ',' or ')' before '.' token
In file included from
/home/phargrov/OMPI/openmpi-1.7.4rc1-openbsd5-amd64/openmpi-1.7.4rc1/opal/mca/if/base/base.h:18,
 from
/home/phargrov/OMPI/openmpi-1.7.4rc1-openbsd5-amd64/openmpi-1.7.4rc1/opal/util/if.c:81:
/home/phargrov/OMPI/openmpi-1.7.4rc1-openbsd5-amd64/openmpi-1.7.4rc1/opal/mca/if/if.h:98:
error: expected ':', ',', ';', '}' or '__attribute__' before '.' token
/home/phargrov/OMPI/openmpi-1.7.4rc1-openbsd5-amd64/openmpi-1.7.4rc1/opal/util/if.c:482:
error: expected ';', ',' or ')' before '.' token
*** Error 1 in opal/util (Makefile:1642 'if.lo': @echo "  CC  "
if.lo;depbase=`echo if.lo | sed 's|[^/]*$|.deps/&|;s|\.lo$||'`; /bin/sh ...)
*** Error 1 in opal/util (Makefile:1731 'all-recursive')
*** Error 1 in opal (Makefile:2039 'all-recursive')
*** Error 1 in /home/phargrov/OMPI/openmpi-1.7.4rc1-openbsd5-amd64/BLD
(Makefile:1572 'all-recursive')

NetBSD:

-bash-4.2$ uname -a
NetBSD pcp-j-18 6.1 NetBSD 6.1 (GENERIC) amd64
-bash-4.2$ gcc --version
gcc (NetBSD nb2 20110806) 4.5.3


Making all in keyval
  CC   keyval_lex.lo
  CCLD libopalutilkeyval.la
  CC   fd.lo
  CC   arch.lo
  CC   argv.lo
  CC   basename.lo
  CC   cmd_line.lo
  CC   crc.lo
  CC   convert.lo
  CC   daemon_init.lo
  CC   error.lo
  CC   few.lo
  CC   if.lo
In file included from
/home/phargrov/OMPI/openmpi-1.7.4rc1-netbsd6-amd64/openmpi-1.7.4rc1/opal/util/if.c:74:0:
/home/phargrov/OMPI/openmpi-1.7.4rc1-netbsd6-amd64/openmpi-1.7.4rc1/opal/util/if.h:182:56:
error: expected ';', ',' or ')' before '.' token
In file included from
/home/phargrov/OMPI/openmpi-1.7.4rc1-netbsd6-amd64/openmpi-1.7.4rc1/opal/mca/if/base/base.h:18:0,
 from
/home/phargrov/OMPI/openmpi-1.7.4rc1-netbsd6-amd64/openmpi-1.7.4rc1/opal/util/if.c:81:
/home/phargrov/OMPI/openmpi-1.7.4rc1-netbsd6-amd64/openmpi-1.7.4rc1/opal/mca/if/if.h:98:25:
error: expected ':', ',', ';', '}' or '__attribute__' before '.' token
/home/phargrov/OMPI/openmpi-1.7.4rc1-netbsd6-amd64/openmpi-1.7.4rc1/opal/util/if.c:482:42:
error: expected ';', ',' or ')' before '.' token
*** Error code 1

Stop.

-- 
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
Computer and Data Sciences Department Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900


[OMPI devel] 1.74rc1 build failure: Solaris 11 / x86_64 / Sun Studio 12.3

2013-12-19 Thread Paul Hargrove
In 1.7.4rc1's README support is still claimed for Solaris 11 on x86_64 with
Sun Studio (12.2 and 12.3):
  - Oracle Solaris 10 and 11, 32 and 64 bit (SPARC, i386, x86_64),
with Oracle Solaris Studio 12.2 and 12.3

However, I get a build failure when configured with:
CC=cc CFLAGS=-m64 --with-wrapper-cflags=-m64
CXX=CC CXXFLAGS='-m64 -library=stlport4'
--with-wrapper-cxxflags=-m64
FC=f90 FCFLAGS=-m64 --with-wrapper-fcflags=-m64
--with-openib --prefix=...

The failure doesn't appear to be compiler specific, and I will be testing
gcc ASAP.

make[2]: Entering directory
`/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/BLD/opal/mca/if/posix_ipv4'
  CC   if_posix.lo
"/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/openmpi-1.7.4rc1/opal/include/opal/sys/amd64/atomic.h",
line 136: warning: parameter in inline asm statement unused: %3
"/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/openmpi-1.7.4rc1/opal/include/opal/sys/amd64/atomic.h",
line 182: warning: parameter in inline asm statement unused: %2
"/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/openmpi-1.7.4rc1/opal/include/opal/sys/amd64/atomic.h",
line 203: warning: parameter in inline asm statement unused: %2
"/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/openmpi-1.7.4rc1/opal/include/opal/sys/amd64/atomic.h",
line 224: warning: parameter in inline asm statement unused: %2
"/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/openmpi-1.7.4rc1/opal/include/opal/sys/amd64/atomic.h",
line 245: warning: parameter in inline asm statement unused: %2
"/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/openmpi-1.7.4rc1/opal/mca/if/posix_ipv4/if_posix.c",
line 272: undefined struct/union member: ifr_hwaddr
"/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/openmpi-1.7.4rc1/opal/mca/if/posix_ipv4/if_posix.c",
line 272: warning: left operand of "." must be struct/union object
"/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/openmpi-1.7.4rc1/opal/mca/if/posix_ipv4/if_posix.c",
line 272: cannot access member of non-struct/union object
cc: acomp failed for
/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/openmpi-1.7.4rc1/opal/mca/if/posix_ipv4/if_posix.c
make[2]: *** [if_posix.lo] Error 1
make[2]: Leaving directory
`/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/BLD/opal/mca/if/posix_ipv4'

The atomics warnings are concerning (and appear *MANY* times in the output).
However the *real* problem is the three errors in
opal/mca/if/posix_ipv4/if_posix.c", line 272

Solaris does't have a ifr_hwaddr field in struct if_req.
It *does* have an ifr_addr field, but this posting:
http://comments.gmane.org/gmane.os.solaris.opensolaris.networking/12839
suggests that this ioctl probably fails on PF_INET sockets.

The surrounding code looks like:

#ifdef SIOCGIFHWADDR
/* get the MAC address */
if (ioctl(sd, SIOCGIFHWADDR, ifr) < 0) {
opal_output(0, "btl_usnic_opal_ifinit: ioctl(SIOCGIFHWADDR)
failed with errno=%d", errno);
break;
}
memcpy(intf->if_mac, ifr->ifr_hwaddr.sa_data, 6);
#endif

#if defined(SIOCGIFMTU) && defined(HAVE_STRUCT_IFREQ_IFR_MTU)
/* get the MTU */
if (ioctl(sd, SIOCGIFMTU, ifr) < 0) {
opal_output(0, "btl_usnic_opal_ifinit: ioctl(SIOCGIFMTU)
failed with errno=%d", errno);
break;
}
intf->if_mtu = ifr->ifr_mtu;
#endif


Note the "btl_usnic_open_ifinit:" in the opal_output lines is probably a
cut-and-paste error.

-Paul



-- 
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
Computer and Data Sciences Department Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900


Re: [OMPI devel] 1.7.4rc1 build failure: Solaris 11 / x86_64

2013-12-19 Thread Paul Hargrove
I've confirmed that the ifr_hwaddr problem also occurs with this system's
/usr/bin/gcc:

Making all in mca/if/posix_ipv4
make[2]: Entering directory
`/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-gcc452/BLD/opal/mca/if/posix_ipv4'
  CC   if_posix.lo
/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-gcc452/openmpi-1.7.4rc1/opal/mca/if/posix_ipv4/if_posix.c:
In function �if_posix_open�:
/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-gcc452/openmpi-1.7.4rc1/opal/mca/if/posix_ipv4/if_posix.c:272:37:
error: �struct ifreq� has no member named �ifr_hwaddr�
make[2]: *** [if_posix.lo] Error 1
make[2]: Leaving directory
`/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-gcc452/BLD/opal/mca/if/posix_ipv4


-Paul


On Thu, Dec 19, 2013 at 3:51 PM, Paul Hargrove  wrote:

> In 1.7.4rc1's README support is still claimed for Solaris 11 on x86_64
> with Sun Studio (12.2 and 12.3):
>   - Oracle Solaris 10 and 11, 32 and 64 bit (SPARC, i386, x86_64),
> with Oracle Solaris Studio 12.2 and 12.3
>
> However, I get a build failure when configured with:
> CC=cc CFLAGS=-m64 --with-wrapper-cflags=-m64
> CXX=CC CXXFLAGS='-m64 -library=stlport4'
> --with-wrapper-cxxflags=-m64
> FC=f90 FCFLAGS=-m64 --with-wrapper-fcflags=-m64
> --with-openib --prefix=...
>
> The failure doesn't appear to be compiler specific, and I will be testing
> gcc ASAP.
>
> make[2]: Entering directory
> `/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/BLD/opal/mca/if/posix_ipv4'
>   CC   if_posix.lo
> "/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/openmpi-1.7.4rc1/opal/include/opal/sys/amd64/atomic.h",
> line 136: warning: parameter in inline asm statement unused: %3
> "/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/openmpi-1.7.4rc1/opal/include/opal/sys/amd64/atomic.h",
> line 182: warning: parameter in inline asm statement unused: %2
> "/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/openmpi-1.7.4rc1/opal/include/opal/sys/amd64/atomic.h",
> line 203: warning: parameter in inline asm statement unused: %2
> "/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/openmpi-1.7.4rc1/opal/include/opal/sys/amd64/atomic.h",
> line 224: warning: parameter in inline asm statement unused: %2
> "/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/openmpi-1.7.4rc1/opal/include/opal/sys/amd64/atomic.h",
> line 245: warning: parameter in inline asm statement unused: %2
> "/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/openmpi-1.7.4rc1/opal/mca/if/posix_ipv4/if_posix.c",
> line 272: undefined struct/union member: ifr_hwaddr
> "/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/openmpi-1.7.4rc1/opal/mca/if/posix_ipv4/if_posix.c",
> line 272: warning: left operand of "." must be struct/union object
> "/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/openmpi-1.7.4rc1/opal/mca/if/posix_ipv4/if_posix.c",
> line 272: cannot access member of non-struct/union object
> cc: acomp failed for
> /shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/openmpi-1.7.4rc1/opal/mca/if/posix_ipv4/if_posix.c
> make[2]: *** [if_posix.lo] Error 1
> make[2]: Leaving directory
> `/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/BLD/opal/mca/if/posix_ipv4'
>
> The atomics warnings are concerning (and appear *MANY* times in the
> output).
> However the *real* problem is the three errors in
> opal/mca/if/posix_ipv4/if_posix.c", line 272
>
> Solaris does't have a ifr_hwaddr field in struct if_req.
> It *does* have an ifr_addr field, but this posting:
>
> http://comments.gmane.org/gmane.os.solaris.opensolaris.networking/12839
> suggests that this ioctl probably fails on PF_INET sockets.
>
> The surrounding code looks like:
>
> #ifdef SIOCGIFHWADDR
> /* get the MAC address */
> if (ioctl(sd, SIOCGIFHWADDR, ifr) < 0) {
> opal_output(0, "btl_usnic_opal_ifinit:
> ioctl(SIOCGIFHWADDR) failed with errno=%d", errno);
> break;
> }
> memcpy(intf->if_mac, ifr->ifr_hwaddr.sa_data, 6);
> #endif
>
> #if defined(SIOCGIFMTU) && defined(HAVE_STRUCT_IFREQ_IFR_MTU)
> /* get the MTU */
> if (ioctl(sd, SIOCGIFMTU, ifr) < 0) {
> opal_output(0, "btl_usnic_opal_ifinit: ioctl(SIOCGIFMTU)
> failed with errno=%d", errno);
> break;
> }
> intf->if_mtu = ifr->ifr_mtu;
> #endif
>
>
> Note the "btl_usnic_open_ifinit:" in the opal_output lines is probably a
> cut-and-paste error.
>
> -Paul
>
>
>
> --
> Paul H. Hargrove  phhargr...@lbl.gov
> Future Technologies Group
> Computer and Data Sciences Department Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>



-- 
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
Computer and Data Sciences Department Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900


Re: [OMPI devel] 1.74rc1 build failure: Solaris 11 / x86_64 / Sun Studio 12.3

2013-12-19 Thread Jeff Squyres (jsquyres)
Paul --

Does this patch fix it for you?

Index: opal/mca/if/posix_ipv4/configure.m4
===
--- opal/mca/if/posix_ipv4/configure.m4 (revision 29997)
+++ opal/mca/if/posix_ipv4/configure.m4 (working copy)
@@ -42,8 +42,10 @@
  )
 
 AS_IF([test "$opal_if_posix_ipv4_happy" = "yes"],
-  [AC_CHECK_MEMBERS([struct ifreq.ifr_mtu], [], [],
+  [AC_CHECK_MEMBERS([struct ifreq.ifr_hwaddr], [], [],
[[#include ]])
+   AC_CHECK_MEMBERS([struct ifreq.ifr_mtu], [], [],
+   [[#include ]])
   ])
 
 AS_IF([test "$opal_if_posix_ipv4_happy" = "yes"], [$1], [$2]);
Index: opal/mca/if/posix_ipv4/if_posix.c
===
--- opal/mca/if/posix_ipv4/if_posix.c   (revision 29997)
+++ opal/mca/if/posix_ipv4/if_posix.c   (working copy)
@@ -263,22 +263,22 @@
 /* generate CIDR and assign to netmask */
 intf->if_mask = prefix(((struct sockaddr_in*) 
&ifr->ifr_addr)->sin_addr.s_addr);
 
-#ifdef SIOCGIFHWADDR
-/* get the MAC address */
-if (ioctl(sd, SIOCGIFHWADDR, ifr) < 0) {
-opal_output(0, "btl_usnic_opal_ifinit: ioctl(SIOCGIFHWADDR) 
failed with errno=%d", errno);
-break;
-}
-memcpy(intf->if_mac, ifr->ifr_hwaddr.sa_data, 6);
+#ifdef SIOCGIFHWADDR && defined(HAVE_STRUCT_IFREQ_IFR_HWADDR)
+/* get the MAC address */
+if (ioctl(sd, SIOCGIFHWADDR, ifr) < 0) {
+opal_output(0, "opal_ifinit: ioctl(SIOCGIFHWADDR) failed with 
errno=%d", errno);
+break;
+}
+memcpy(intf->if_mac, ifr->ifr_hwaddr.sa_data, 6);
 #endif
 
 #if defined(SIOCGIFMTU) && defined(HAVE_STRUCT_IFREQ_IFR_MTU)
-/* get the MTU */
-if (ioctl(sd, SIOCGIFMTU, ifr) < 0) {
-opal_output(0, "btl_usnic_opal_ifinit: ioctl(SIOCGIFMTU) 
failed with errno=%d", errno);
-break;
-}
-intf->if_mtu = ifr->ifr_mtu;
+/* get the MTU */
+if (ioctl(sd, SIOCGIFMTU, ifr) < 0) {
+opal_output(0, "opal_ifinit: ioctl(SIOCGIFMTU) failed with 
errno=%d", errno);
+break;
+}
+intf->if_mtu = ifr->ifr_mtu;
 #endif
 
 opal_list_append(&opal_if_list, &(intf->super));





On Dec 19, 2013, at 6:51 PM, Paul Hargrove  wrote:

> In 1.7.4rc1's README support is still claimed for Solaris 11 on x86_64 with 
> Sun Studio (12.2 and 12.3):
>   - Oracle Solaris 10 and 11, 32 and 64 bit (SPARC, i386, x86_64),
> with Oracle Solaris Studio 12.2 and 12.3
> 
> However, I get a build failure when configured with:
> CC=cc CFLAGS=-m64 --with-wrapper-cflags=-m64
> CXX=CC CXXFLAGS='-m64 -library=stlport4' --with-wrapper-cxxflags=-m64
> FC=f90 FCFLAGS=-m64 --with-wrapper-fcflags=-m64
> --with-openib --prefix=...
> 
> The failure doesn't appear to be compiler specific, and I will be testing gcc 
> ASAP.
> 
> make[2]: Entering directory 
> `/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/BLD/opal/mca/if/posix_ipv4'
>   CC   if_posix.lo
> "/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/openmpi-1.7.4rc1/opal/include/opal/sys/amd64/atomic.h",
>  line 136: warning: parameter in inline asm statement unused: %3
> "/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/openmpi-1.7.4rc1/opal/include/opal/sys/amd64/atomic.h",
>  line 182: warning: parameter in inline asm statement unused: %2
> "/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/openmpi-1.7.4rc1/opal/include/opal/sys/amd64/atomic.h",
>  line 203: warning: parameter in inline asm statement unused: %2
> "/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/openmpi-1.7.4rc1/opal/include/opal/sys/amd64/atomic.h",
>  line 224: warning: parameter in inline asm statement unused: %2
> "/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/openmpi-1.7.4rc1/opal/include/opal/sys/amd64/atomic.h",
>  line 245: warning: parameter in inline asm statement unused: %2
> "/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/openmpi-1.7.4rc1/opal/mca/if/posix_ipv4/if_posix.c",
>  line 272: undefined struct/union member: ifr_hwaddr
> "/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/openmpi-1.7.4rc1/opal/mca/if/posix_ipv4/if_posix.c",
>  line 272: warning: left operand of "." must be struct/union object
> "/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/openmpi-1.7.4rc1/opal/mca/if/posix_ipv4/if_posix.c",
>  line 272: cannot access member of non-struct/union object
> cc: acomp failed for 
> /shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/openmpi-1.7.4rc1/opal/mca/if/posix_ipv4/if_posix.c
> make[2]: *** [if_posix.lo] Error 1
> make[2]: Leaving directory 
> `/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/BLD/opal/mca/if/posix_ipv4'
> 
> The atomics warnings are concerning (and appear *MANY* times in the output).

Re: [OMPI devel] 1.74rc1 build failure: Solaris 11 / x86_64 / Sun Studio 12.3

2013-12-19 Thread Paul Hargrove
Jeff,

The patch looks fine to my eyes, but I cannot test it:

1) Not sure if email botched withepsace or what, but the patch didn't apply
to if_posix.c.
2) Even if it did, I don't have sufficiently new autoconf on that system to
"use" the configure.m4 part of the patch.

Any chance of a patched-and-autogen'ed tarball to test?

-Paul


On Thu, Dec 19, 2013 at 4:04 PM, Jeff Squyres (jsquyres)  wrote:

> Paul --
>
> Does this patch fix it for you?
>
> Index: opal/mca/if/posix_ipv4/configure.m4
> ===
> --- opal/mca/if/posix_ipv4/configure.m4 (revision 29997)
> +++ opal/mca/if/posix_ipv4/configure.m4 (working copy)
> @@ -42,8 +42,10 @@
>   )
>
>  AS_IF([test "$opal_if_posix_ipv4_happy" = "yes"],
> -  [AC_CHECK_MEMBERS([struct ifreq.ifr_mtu], [], [],
> +  [AC_CHECK_MEMBERS([struct ifreq.ifr_hwaddr], [], [],
> [[#include ]])
> +   AC_CHECK_MEMBERS([struct ifreq.ifr_mtu], [], [],
> +   [[#include ]])
>])
>
>  AS_IF([test "$opal_if_posix_ipv4_happy" = "yes"], [$1], [$2]);
> Index: opal/mca/if/posix_ipv4/if_posix.c
> ===
> --- opal/mca/if/posix_ipv4/if_posix.c   (revision 29997)
> +++ opal/mca/if/posix_ipv4/if_posix.c   (working copy)
> @@ -263,22 +263,22 @@
>  /* generate CIDR and assign to netmask */
>  intf->if_mask = prefix(((struct sockaddr_in*)
> &ifr->ifr_addr)->sin_addr.s_addr);
>
> -#ifdef SIOCGIFHWADDR
> -/* get the MAC address */
> -if (ioctl(sd, SIOCGIFHWADDR, ifr) < 0) {
> -opal_output(0, "btl_usnic_opal_ifinit:
> ioctl(SIOCGIFHWADDR) failed with errno=%d", errno);
> -break;
> -}
> -memcpy(intf->if_mac, ifr->ifr_hwaddr.sa_data, 6);
> +#ifdef SIOCGIFHWADDR && defined(HAVE_STRUCT_IFREQ_IFR_HWADDR)
> +/* get the MAC address */
> +if (ioctl(sd, SIOCGIFHWADDR, ifr) < 0) {
> +opal_output(0, "opal_ifinit: ioctl(SIOCGIFHWADDR) failed with
> errno=%d", errno);
> +break;
> +}
> +memcpy(intf->if_mac, ifr->ifr_hwaddr.sa_data, 6);
>  #endif
>
>  #if defined(SIOCGIFMTU) && defined(HAVE_STRUCT_IFREQ_IFR_MTU)
> -/* get the MTU */
> -if (ioctl(sd, SIOCGIFMTU, ifr) < 0) {
> -opal_output(0, "btl_usnic_opal_ifinit: ioctl(SIOCGIFMTU)
> failed with errno=%d", errno);
> -break;
> -}
> -intf->if_mtu = ifr->ifr_mtu;
> +/* get the MTU */
> +if (ioctl(sd, SIOCGIFMTU, ifr) < 0) {
> +opal_output(0, "opal_ifinit: ioctl(SIOCGIFMTU) failed with
> errno=%d", errno);
> +break;
> +}
> +intf->if_mtu = ifr->ifr_mtu;
>  #endif
>
>  opal_list_append(&opal_if_list, &(intf->super));
>
>
>
>
>
> On Dec 19, 2013, at 6:51 PM, Paul Hargrove  wrote:
>
> > In 1.7.4rc1's README support is still claimed for Solaris 11 on x86_64
> with Sun Studio (12.2 and 12.3):
> >   - Oracle Solaris 10 and 11, 32 and 64 bit (SPARC, i386, x86_64),
> > with Oracle Solaris Studio 12.2 and 12.3
> >
> > However, I get a build failure when configured with:
> > CC=cc CFLAGS=-m64 --with-wrapper-cflags=-m64
> > CXX=CC CXXFLAGS='-m64 -library=stlport4'
> --with-wrapper-cxxflags=-m64
> > FC=f90 FCFLAGS=-m64 --with-wrapper-fcflags=-m64
> > --with-openib --prefix=...
> >
> > The failure doesn't appear to be compiler specific, and I will be
> testing gcc ASAP.
> >
> > make[2]: Entering directory
> `/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/BLD/opal/mca/if/posix_ipv4'
> >   CC   if_posix.lo
> >
> "/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/openmpi-1.7.4rc1/opal/include/opal/sys/amd64/atomic.h",
> line 136: warning: parameter in inline asm statement unused: %3
> >
> "/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/openmpi-1.7.4rc1/opal/include/opal/sys/amd64/atomic.h",
> line 182: warning: parameter in inline asm statement unused: %2
> >
> "/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/openmpi-1.7.4rc1/opal/include/opal/sys/amd64/atomic.h",
> line 203: warning: parameter in inline asm statement unused: %2
> >
> "/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/openmpi-1.7.4rc1/opal/include/opal/sys/amd64/atomic.h",
> line 224: warning: parameter in inline asm statement unused: %2
> >
> "/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/openmpi-1.7.4rc1/opal/include/opal/sys/amd64/atomic.h",
> line 245: warning: parameter in inline asm statement unused: %2
> >
> "/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/openmpi-1.7.4rc1/opal/mca/if/posix_ipv4/if_posix.c",
> line 272: undefined struct/union member: ifr_hwaddr
> >
> "/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/openmpi-1.7.4rc1/opal/mca/if/posix_ipv4/if_posix.c",
> line 272: warning: left operand of "." must b

Re: [OMPI devel] 1.74rc1 build failure: Solaris 11 / x86_64 / Sun Studio 12.3

2013-12-19 Thread Jeff Squyres (jsquyres)
Try http://www.open-mpi.org/~jsquyres/unofficial/.

Should have both "if" fixes in it.


On Dec 19, 2013, at 7:12 PM, Paul Hargrove  wrote:

> Jeff,
> 
> The patch looks fine to my eyes, but I cannot test it:
> 
> 1) Not sure if email botched withepsace or what, but the patch didn't apply 
> to if_posix.c.
> 2) Even if it did, I don't have sufficiently new autoconf on that system to 
> "use" the configure.m4 part of the patch.
> 
> Any chance of a patched-and-autogen'ed tarball to test?
> 
> -Paul
> 
> 
> On Thu, Dec 19, 2013 at 4:04 PM, Jeff Squyres (jsquyres)  
> wrote:
> Paul --
> 
> Does this patch fix it for you?
> 
> Index: opal/mca/if/posix_ipv4/configure.m4
> ===
> --- opal/mca/if/posix_ipv4/configure.m4 (revision 29997)
> +++ opal/mca/if/posix_ipv4/configure.m4 (working copy)
> @@ -42,8 +42,10 @@
>   )
> 
>  AS_IF([test "$opal_if_posix_ipv4_happy" = "yes"],
> -  [AC_CHECK_MEMBERS([struct ifreq.ifr_mtu], [], [],
> +  [AC_CHECK_MEMBERS([struct ifreq.ifr_hwaddr], [], [],
> [[#include ]])
> +   AC_CHECK_MEMBERS([struct ifreq.ifr_mtu], [], [],
> +   [[#include ]])
>])
> 
>  AS_IF([test "$opal_if_posix_ipv4_happy" = "yes"], [$1], [$2]);
> Index: opal/mca/if/posix_ipv4/if_posix.c
> ===
> --- opal/mca/if/posix_ipv4/if_posix.c   (revision 29997)
> +++ opal/mca/if/posix_ipv4/if_posix.c   (working copy)
> @@ -263,22 +263,22 @@
>  /* generate CIDR and assign to netmask */
>  intf->if_mask = prefix(((struct sockaddr_in*) 
> &ifr->ifr_addr)->sin_addr.s_addr);
> 
> -#ifdef SIOCGIFHWADDR
> -/* get the MAC address */
> -if (ioctl(sd, SIOCGIFHWADDR, ifr) < 0) {
> -opal_output(0, "btl_usnic_opal_ifinit: ioctl(SIOCGIFHWADDR) 
> failed with errno=%d", errno);
> -break;
> -}
> -memcpy(intf->if_mac, ifr->ifr_hwaddr.sa_data, 6);
> +#ifdef SIOCGIFHWADDR && defined(HAVE_STRUCT_IFREQ_IFR_HWADDR)
> +/* get the MAC address */
> +if (ioctl(sd, SIOCGIFHWADDR, ifr) < 0) {
> +opal_output(0, "opal_ifinit: ioctl(SIOCGIFHWADDR) failed with 
> errno=%d", errno);
> +break;
> +}
> +memcpy(intf->if_mac, ifr->ifr_hwaddr.sa_data, 6);
>  #endif
> 
>  #if defined(SIOCGIFMTU) && defined(HAVE_STRUCT_IFREQ_IFR_MTU)
> -/* get the MTU */
> -if (ioctl(sd, SIOCGIFMTU, ifr) < 0) {
> -opal_output(0, "btl_usnic_opal_ifinit: ioctl(SIOCGIFMTU) 
> failed with errno=%d", errno);
> -break;
> -}
> -intf->if_mtu = ifr->ifr_mtu;
> +/* get the MTU */
> +if (ioctl(sd, SIOCGIFMTU, ifr) < 0) {
> +opal_output(0, "opal_ifinit: ioctl(SIOCGIFMTU) failed with 
> errno=%d", errno);
> +break;
> +}
> +intf->if_mtu = ifr->ifr_mtu;
>  #endif
> 
>  opal_list_append(&opal_if_list, &(intf->super));
> 
> 
> 
> 
> 
> On Dec 19, 2013, at 6:51 PM, Paul Hargrove  wrote:
> 
> > In 1.7.4rc1's README support is still claimed for Solaris 11 on x86_64 with 
> > Sun Studio (12.2 and 12.3):
> >   - Oracle Solaris 10 and 11, 32 and 64 bit (SPARC, i386, x86_64),
> > with Oracle Solaris Studio 12.2 and 12.3
> >
> > However, I get a build failure when configured with:
> > CC=cc CFLAGS=-m64 --with-wrapper-cflags=-m64
> > CXX=CC CXXFLAGS='-m64 -library=stlport4' 
> > --with-wrapper-cxxflags=-m64
> > FC=f90 FCFLAGS=-m64 --with-wrapper-fcflags=-m64
> > --with-openib --prefix=...
> >
> > The failure doesn't appear to be compiler specific, and I will be testing 
> > gcc ASAP.
> >
> > make[2]: Entering directory 
> > `/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/BLD/opal/mca/if/posix_ipv4'
> >   CC   if_posix.lo
> > "/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/openmpi-1.7.4rc1/opal/include/opal/sys/amd64/atomic.h",
> >  line 136: warning: parameter in inline asm statement unused: %3
> > "/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/openmpi-1.7.4rc1/opal/include/opal/sys/amd64/atomic.h",
> >  line 182: warning: parameter in inline asm statement unused: %2
> > "/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/openmpi-1.7.4rc1/opal/include/opal/sys/amd64/atomic.h",
> >  line 203: warning: parameter in inline asm statement unused: %2
> > "/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/openmpi-1.7.4rc1/opal/include/opal/sys/amd64/atomic.h",
> >  line 224: warning: parameter in inline asm statement unused: %2
> > "/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/openmpi-1.7.4rc1/opal/include/opal/sys/amd64/atomic.h",
> >  line 245: warning: parameter in inline asm statement unused: %2
> > "/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/openmpi-1.7.4rc1/opal/mca/if/posix_ipv4/if_posix.c",
> >  li

Re: [OMPI devel] 1.7.4rc1 build failure: OpenBSD-5 and NetBSD-6

2013-12-19 Thread Jeff Squyres (jsquyres)
On Dec 19, 2013, at 6:27 PM, Paul Hargrove  wrote:

> When building 1.7.4rc1 on OpenBSD-5 and NetBSD-6 (both amd64) I see what 
> appears to be the same three errors ("make" output  at end of this email) on 
> both platforms.
> 
> All three syntax errors appears to be collisions on the symbol if_mtu:
> 
> -bash-4.2$ cat -n openmpi-1.7.4rc1/opal/util/if.h | grep -w -e 182
>182  OPAL_DECLSPEC int opal_ifindextomtu(int if_index, int *if_mtu);
> -bash-4.2$ cat -n openmpi-1.7.4rc1/opal/mca/if/if.h | grep -w -e 98 
> 98  int if_mtu;
> -bash-4.2$ cat -n openmpi-1.7.4rc1/opal/util/if.c | grep -w -e 482
>482  int opal_ifindextomtu(int if_index, int *if_mtu)
> 
> -bash-4.2$ grep if_mtu  /usr/include/net/if.h
> #define if_mtu  if_data.ifi_mtu\

Bah.  Terrible.  Ok, thanks -- I'll fix...

(see the tar ball I just sent you... should have this fix in it)

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] 1.7.4rc1 build failure: FreeBSD-9

2013-12-19 Thread Ralph Castain
Fixed and cmr'd

thanks!

On Dec 19, 2013, at 3:10 PM, Paul Hargrove  wrote:

> I see the failure below when building 1.7.4rc1 on FreeBSD-9 (amd64).
> It looks to be just a missing header, probably sys/stat.h.
> 
> $ gcc --version
> gcc (GCC) 4.2.1 20070831 patched [FreeBSD]
> 
> Only configure option passed was --prefix-...
> 
> -Paul
> 
> 
> 
> Making all in mca/sharedfp/sm
>   CC   sharedfp_sm.lo
>   CC   sharedfp_sm_component.lo
>   CC   sharedfp_sm_seek.lo
>   CC   sharedfp_sm_get_position.lo
>   CC   sharedfp_sm_request_position.lo
>   CC   sharedfp_sm_write.lo
>   CC   sharedfp_sm_iwrite.lo
>   CC   sharedfp_sm_read.lo
>   CC   sharedfp_sm_iread.lo
>   CC   sharedfp_sm_file_open.lo
> /home/phargrov/OMPI/openmpi-1.7.4rc1-freebsd9-amd64/openmpi-1.7.4rc1/ompi/mca/sharedfp/sm/sharedfp_sm_file_open.c:
>  In function 'mca_sharedfp_sm_file_open':
> /home/phargrov/OMPI/openmpi-1.7.4rc1-freebsd9-amd64/openmpi-1.7.4rc1/ompi/mca/sharedfp/sm/sharedfp_sm_file_open.c:121:
>  error: 'S_IRUSR' undeclared (first use in this function)
> /home/phargrov/OMPI/openmpi-1.7.4rc1-freebsd9-amd64/openmpi-1.7.4rc1/ompi/mca/sharedfp/sm/sharedfp_sm_file_open.c:121:
>  error: (Each undeclared identifier is reported only once
> /home/phargrov/OMPI/openmpi-1.7.4rc1-freebsd9-amd64/openmpi-1.7.4rc1/ompi/mca/sharedfp/sm/sharedfp_sm_file_open.c:121:
>  error: for each function it appears in.)
> /home/phargrov/OMPI/openmpi-1.7.4rc1-freebsd9-amd64/openmpi-1.7.4rc1/ompi/mca/sharedfp/sm/sharedfp_sm_file_open.c:121:
>  error: 'S_IWUSR' undeclared (first use in this function)
> /home/phargrov/OMPI/openmpi-1.7.4rc1-freebsd9-amd64/openmpi-1.7.4rc1/ompi/mca/sharedfp/sm/sharedfp_sm_file_open.c:121:
>  error: 'S_IRGRP' undeclared (first use in this function)
> /home/phargrov/OMPI/openmpi-1.7.4rc1-freebsd9-amd64/openmpi-1.7.4rc1/ompi/mca/sharedfp/sm/sharedfp_sm_file_open.c:121:
>  error: 'S_IROTH' undeclared (first use in this function)
> *** [sharedfp_sm_file_open.lo] Error code 1
> 
> 
> 
> -- 
> Paul H. Hargrove  phhargr...@lbl.gov
> Future Technologies Group
> Computer and Data Sciences Department Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] 1.7.4rc1 build failure: OpenBSD-5 and NetBSD-6

2013-12-19 Thread Paul Hargrove
Jeff,

The unofficial "rc2forpaul" gets past the (disgusting) if_mtu problem on
both platforms.

On NetBSD-6 the build completes ("make install" fails, but I'll report that
separately).

However, on OpenBSD-5 we now encounter another failure about 20 files later:

  CC   sys_limits.lo
/home/phargrov/OMPI/openmpi-1.7.4rc2forpaul-openbsd5-amd64/openmpi-1.7.4rc2forpaul/opal/util/sys_limits.c:
In function 'opal_util_init_sys_limits':
/home/phargrov/OMPI/openmpi-1.7.4rc2forpaul-openbsd5-amd64/openmpi-1.7.4rc2forpaul/opal/util/sys_limits.c:172:
error: 'RLIMIT_AS' undeclared (first use in this function)
/home/phargrov/OMPI/openmpi-1.7.4rc2forpaul-openbsd5-amd64/openmpi-1.7.4rc2forpaul/opal/util/sys_limits.c:172:
error: (Each undeclared identifier is reported only once
/home/phargrov/OMPI/openmpi-1.7.4rc2forpaul-openbsd5-amd64/openmpi-1.7.4rc2forpaul/opal/util/sys_limits.c:172:
error: for each function it appears in.)
*** Error 1 in opal/util (Makefile:1692 'sys_limits.lo': @echo "  CC  "
sys_limits.lo;depbase=`echo sys_limits.lo | sed 's|[^/]*$|.deps/...)
*** Error 1 in opal/util (Makefile:1780 'all-recursive')

The getrlimit manpage on this platform does not list RLIMIT_AS.
Running "grep -rl RLIMIT_AS /usr/include" confirms that this constant does
not exist.
So, I think "#ifdef RLIMIT_AS" is required.

-Paul


On Thu, Dec 19, 2013 at 4:39 PM, Jeff Squyres (jsquyres)  wrote:

> On Dec 19, 2013, at 6:27 PM, Paul Hargrove  wrote:
>
> > When building 1.7.4rc1 on OpenBSD-5 and NetBSD-6 (both amd64) I see what
> appears to be the same three errors ("make" output  at end of this email)
> on both platforms.
> >
> > All three syntax errors appears to be collisions on the symbol if_mtu:
> >
> > -bash-4.2$ cat -n openmpi-1.7.4rc1/opal/util/if.h | grep -w -e 182
> >182  OPAL_DECLSPEC int opal_ifindextomtu(int if_index, int *if_mtu);
> > -bash-4.2$ cat -n openmpi-1.7.4rc1/opal/mca/if/if.h | grep -w -e 98
> > 98  int if_mtu;
> > -bash-4.2$ cat -n openmpi-1.7.4rc1/opal/util/if.c | grep -w -e 482
> >482  int opal_ifindextomtu(int if_index, int *if_mtu)
> >
> > -bash-4.2$ grep if_mtu  /usr/include/net/if.h
> > #define if_mtu  if_data.ifi_mtu\
>
> Bah.  Terrible.  Ok, thanks -- I'll fix...
>
> (see the tar ball I just sent you... should have this fix in it)
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>



-- 
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
Computer and Data Sciences Department Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900


Re: [OMPI devel] 1.74rc1 build failure: Solaris 11 / x86_64 / Sun Studio 12.3

2013-12-19 Thread Paul Hargrove
Jeff,

Solaris 11 / x86_64 build get farther than before, but fails with the
following:

make[2]: Entering directory
`/shared/OMPI/openmpi-1.7.4rc2forpaul-solaris11-x64-ib-gcc452/BLD/ompi/mca/btl/usnic'
  CC   btl_usnic_module.lo
In file included from
/shared/OMPI/openmpi-1.7.4rc2forpaul-solaris11-x64-ib-gcc452/openmpi-1.7.4rc2forpaul/ompi/mca/btl/usnic/btl_usnic_module.c:48:0:
/shared/OMPI/openmpi-1.7.4rc2forpaul-solaris11-x64-ib-gcc452/openmpi-1.7.4rc2forpaul/ompi/mca/btl/usnic/btl_usnic_util.h:19:24:
error: expected �=�, �,�, �;�, �asm� or �__attribute__� before �int�
make[2]: *** [btl_usnic_module.lo] Error 1
make[2]: Leaving directory
`/shared/OMPI/openmpi-1.7.4rc2forpaul-solaris11-x64-ib-gcc452/BLD/ompi/mca/btl/usnic'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory
`/shared/OMPI/openmpi-1.7.4rc2forpaul-solaris11-x64-ib-gcc452/BLD/ompi'
make: *** [all-recursive] Error 1

It looks like gcc is choking on __always_inline.
I believe use of __opal_attribute_always_inline__ is the proper fix.
I've made that change and resumed the build... will report again upon
success or the next failure.

I'm not sure why one is trying to build the usnic btl on Solaris at all.
Perhaps just because the OFED stack is present?

-Paul


On Thu, Dec 19, 2013 at 4:39 PM, Jeff Squyres (jsquyres)  wrote:

> Try http://www.open-mpi.org/~jsquyres/unofficial/.
>
> Should have both "if" fixes in it.
>
>
> On Dec 19, 2013, at 7:12 PM, Paul Hargrove  wrote:
>
> > Jeff,
> >
> > The patch looks fine to my eyes, but I cannot test it:
> >
> > 1) Not sure if email botched withepsace or what, but the patch didn't
> apply to if_posix.c.
> > 2) Even if it did, I don't have sufficiently new autoconf on that system
> to "use" the configure.m4 part of the patch.
> >
> > Any chance of a patched-and-autogen'ed tarball to test?
> >
> > -Paul
> >
> >
> > On Thu, Dec 19, 2013 at 4:04 PM, Jeff Squyres (jsquyres) <
> jsquy...@cisco.com> wrote:
> > Paul --
> >
> > Does this patch fix it for you?
> >
> > Index: opal/mca/if/posix_ipv4/configure.m4
> > ===
> > --- opal/mca/if/posix_ipv4/configure.m4 (revision 29997)
> > +++ opal/mca/if/posix_ipv4/configure.m4 (working copy)
> > @@ -42,8 +42,10 @@
> >   )
> >
> >  AS_IF([test "$opal_if_posix_ipv4_happy" = "yes"],
> > -  [AC_CHECK_MEMBERS([struct ifreq.ifr_mtu], [], [],
> > +  [AC_CHECK_MEMBERS([struct ifreq.ifr_hwaddr], [], [],
> > [[#include ]])
> > +   AC_CHECK_MEMBERS([struct ifreq.ifr_mtu], [], [],
> > +   [[#include ]])
> >])
> >
> >  AS_IF([test "$opal_if_posix_ipv4_happy" = "yes"], [$1], [$2]);
> > Index: opal/mca/if/posix_ipv4/if_posix.c
> > ===
> > --- opal/mca/if/posix_ipv4/if_posix.c   (revision 29997)
> > +++ opal/mca/if/posix_ipv4/if_posix.c   (working copy)
> > @@ -263,22 +263,22 @@
> >  /* generate CIDR and assign to netmask */
> >  intf->if_mask = prefix(((struct sockaddr_in*)
> &ifr->ifr_addr)->sin_addr.s_addr);
> >
> > -#ifdef SIOCGIFHWADDR
> > -/* get the MAC address */
> > -if (ioctl(sd, SIOCGIFHWADDR, ifr) < 0) {
> > -opal_output(0, "btl_usnic_opal_ifinit:
> ioctl(SIOCGIFHWADDR) failed with errno=%d", errno);
> > -break;
> > -}
> > -memcpy(intf->if_mac, ifr->ifr_hwaddr.sa_data, 6);
> > +#ifdef SIOCGIFHWADDR && defined(HAVE_STRUCT_IFREQ_IFR_HWADDR)
> > +/* get the MAC address */
> > +if (ioctl(sd, SIOCGIFHWADDR, ifr) < 0) {
> > +opal_output(0, "opal_ifinit: ioctl(SIOCGIFHWADDR) failed
> with errno=%d", errno);
> > +break;
> > +}
> > +memcpy(intf->if_mac, ifr->ifr_hwaddr.sa_data, 6);
> >  #endif
> >
> >  #if defined(SIOCGIFMTU) && defined(HAVE_STRUCT_IFREQ_IFR_MTU)
> > -/* get the MTU */
> > -if (ioctl(sd, SIOCGIFMTU, ifr) < 0) {
> > -opal_output(0, "btl_usnic_opal_ifinit:
> ioctl(SIOCGIFMTU) failed with errno=%d", errno);
> > -break;
> > -}
> > -intf->if_mtu = ifr->ifr_mtu;
> > +/* get the MTU */
> > +if (ioctl(sd, SIOCGIFMTU, ifr) < 0) {
> > +opal_output(0, "opal_ifinit: ioctl(SIOCGIFMTU) failed with
> errno=%d", errno);
> > +break;
> > +}
> > +intf->if_mtu = ifr->ifr_mtu;
> >  #endif
> >
> >  opal_list_append(&opal_if_list, &(intf->super));
> >
> >
> >
> >
> >
> > On Dec 19, 2013, at 6:51 PM, Paul Hargrove  wrote:
> >
> > > In 1.7.4rc1's README support is still claimed for Solaris 11 on x86_64
> with Sun Studio (12.2 and 12.3):
> > >   - Oracle Solaris 10 and 11, 32 and 64 bit (SPARC, i386, x86_64),
> > > with Oracle Solaris Studio 12.2 and 12.3
> > >
> > > However, I get a build failure when configured with:
> > > CC=cc CFLAGS=-m64 --wi

[OMPI devel] 1.7.4rc1 run failure on Solaris 10 / SPARC (not SIGBUS)

2013-12-19 Thread Paul Hargrove
Testing with Solaris 10 on SPARC, I was expecting to encounter the bus
error reported previously by Siegman Gross.  Instead I see the following
hwloc-related abort:

$ env
PATH=/home/hargrove/OMPI/openmpi-1.7.4rc1-solaris10-sparcT2-ss12u3-v9/INST/bin:$PATH
 
LD_LIBRARY_PATH_64=/home/hargrove/OMPI/openmpi-1.7.4rc1-solaris10-sparcT2-ss12u3-v9/INST/lib:$LD_LIBRARY_PATH_64
 OMPI_MCA_shmem_mmap_enable_nfs_warning=0
 
/home/hargrove/OMPI/openmpi-1.7.4rc1-solaris10-sparcT2-ss12u3-v9/INST/bin/mpirun
-mca btl sm,self -np 2 examples/ring_c
--
Open MPI tried to bind a new process, but something went wrong.  The
process was killed without launching the target application.  Your job
will now abort.

  Local host:niagara1
  Application name:  examples/ring_c
  Error message: hwloc indicates cpu binding cannot be enforced
  Location:
 
/home/hargrove/OMPI/openmpi-1.7.4rc1-solaris10-sparcT2-ss12u3-v9/openmpi-1.7.4rc1/orte/mca/odls/default/odls_default_module.c:478
--
2 total processes failed to start


I am assuming I just need some magic pixie dust to disable cpu binding.
I'd appreciate some corresponding instructions.

However, if this is NOT an expected/desired/known behavior please let me
know what I can/should do to help determine the root cause.


-Paul

-- 
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
Computer and Data Sciences Department Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900


Re: [OMPI devel] 1.74rc1 build failure: Solaris 11 / x86_64 / Sun Studio 12.3

2013-12-19 Thread Paul Hargrove
Jeff,

I didn't actually get very far after fixing __always_inline.
In fact, the build still fails on the *same* line, but for a different
(valid) reason:
fls() is declared in /usr/include/string.h

Making all in mca/btl/usnic
make[2]: Entering directory
`/shared/OMPI/openmpi-1.7.4rc2forpaul-solaris11-x64-ib-gcc452/BLD/ompi/mca/btl/usnic'
  CC   btl_usnic_module.lo
In file included from
/shared/OMPI/openmpi-1.7.4rc2forpaul-solaris11-x64-ib-gcc452/openmpi-1.7.4rc2forpaul/ompi/mca/btl/usnic/btl_usnic_module.c:48:0:
/shared/OMPI/openmpi-1.7.4rc2forpaul-solaris11-x64-ib-gcc452/openmpi-1.7.4rc2forpaul/ompi/mca/btl/usnic/btl_usnic_util.h:19:45:
error: static declaration of �fls� follows non-static declaration
/usr/include/string.h:87:12: note: previous declaration of �fls� was here
make[2]: *** [btl_usnic_module.lo] Error 1

-Paul


On Thu, Dec 19, 2013 at 6:35 PM, Paul Hargrove  wrote:

> Jeff,
>
> Solaris 11 / x86_64 build get farther than before, but fails with the
> following:
>
> make[2]: Entering directory
> `/shared/OMPI/openmpi-1.7.4rc2forpaul-solaris11-x64-ib-gcc452/BLD/ompi/mca/btl/usnic'
>   CC   btl_usnic_module.lo
> In file included from
> /shared/OMPI/openmpi-1.7.4rc2forpaul-solaris11-x64-ib-gcc452/openmpi-1.7.4rc2forpaul/ompi/mca/btl/usnic/btl_usnic_module.c:48:0:
> /shared/OMPI/openmpi-1.7.4rc2forpaul-solaris11-x64-ib-gcc452/openmpi-1.7.4rc2forpaul/ompi/mca/btl/usnic/btl_usnic_util.h:19:24:
> error: expected �=�, �,�, �;�, �asm� or �__attribute__� before �int�
> make[2]: *** [btl_usnic_module.lo] Error 1
> make[2]: Leaving directory
> `/shared/OMPI/openmpi-1.7.4rc2forpaul-solaris11-x64-ib-gcc452/BLD/ompi/mca/btl/usnic'
> make[1]: *** [all-recursive] Error 1
> make[1]: Leaving directory
> `/shared/OMPI/openmpi-1.7.4rc2forpaul-solaris11-x64-ib-gcc452/BLD/ompi'
> make: *** [all-recursive] Error 1
>
> It looks like gcc is choking on __always_inline.
> I believe use of __opal_attribute_always_inline__ is the proper fix.
> I've made that change and resumed the build... will report again upon
> success or the next failure.
>
> I'm not sure why one is trying to build the usnic btl on Solaris at all.
> Perhaps just because the OFED stack is present?
>
> -Paul
>
>
> On Thu, Dec 19, 2013 at 4:39 PM, Jeff Squyres (jsquyres) <
> jsquy...@cisco.com> wrote:
>
>> Try http://www.open-mpi.org/~jsquyres/unofficial/.
>>
>> Should have both "if" fixes in it.
>>
>>
>> On Dec 19, 2013, at 7:12 PM, Paul Hargrove  wrote:
>>
>> > Jeff,
>> >
>> > The patch looks fine to my eyes, but I cannot test it:
>> >
>> > 1) Not sure if email botched withepsace or what, but the patch didn't
>> apply to if_posix.c.
>> > 2) Even if it did, I don't have sufficiently new autoconf on that
>> system to "use" the configure.m4 part of the patch.
>> >
>> > Any chance of a patched-and-autogen'ed tarball to test?
>> >
>> > -Paul
>> >
>> >
>> > On Thu, Dec 19, 2013 at 4:04 PM, Jeff Squyres (jsquyres) <
>> jsquy...@cisco.com> wrote:
>> > Paul --
>> >
>> > Does this patch fix it for you?
>> >
>> > Index: opal/mca/if/posix_ipv4/configure.m4
>> > ===
>> > --- opal/mca/if/posix_ipv4/configure.m4 (revision 29997)
>> > +++ opal/mca/if/posix_ipv4/configure.m4 (working copy)
>> > @@ -42,8 +42,10 @@
>> >   )
>> >
>> >  AS_IF([test "$opal_if_posix_ipv4_happy" = "yes"],
>> > -  [AC_CHECK_MEMBERS([struct ifreq.ifr_mtu], [], [],
>> > +  [AC_CHECK_MEMBERS([struct ifreq.ifr_hwaddr], [], [],
>> > [[#include ]])
>> > +   AC_CHECK_MEMBERS([struct ifreq.ifr_mtu], [], [],
>> > +   [[#include ]])
>> >])
>> >
>> >  AS_IF([test "$opal_if_posix_ipv4_happy" = "yes"], [$1], [$2]);
>> > Index: opal/mca/if/posix_ipv4/if_posix.c
>> > ===
>> > --- opal/mca/if/posix_ipv4/if_posix.c   (revision 29997)
>> > +++ opal/mca/if/posix_ipv4/if_posix.c   (working copy)
>> > @@ -263,22 +263,22 @@
>> >  /* generate CIDR and assign to netmask */
>> >  intf->if_mask = prefix(((struct sockaddr_in*)
>> &ifr->ifr_addr)->sin_addr.s_addr);
>> >
>> > -#ifdef SIOCGIFHWADDR
>> > -/* get the MAC address */
>> > -if (ioctl(sd, SIOCGIFHWADDR, ifr) < 0) {
>> > -opal_output(0, "btl_usnic_opal_ifinit:
>> ioctl(SIOCGIFHWADDR) failed with errno=%d", errno);
>> > -break;
>> > -}
>> > -memcpy(intf->if_mac, ifr->ifr_hwaddr.sa_data, 6);
>> > +#ifdef SIOCGIFHWADDR && defined(HAVE_STRUCT_IFREQ_IFR_HWADDR)
>> > +/* get the MAC address */
>> > +if (ioctl(sd, SIOCGIFHWADDR, ifr) < 0) {
>> > +opal_output(0, "opal_ifinit: ioctl(SIOCGIFHWADDR) failed
>> with errno=%d", errno);
>> > +break;
>> > +}
>> > +memcpy(intf->if_mac, ifr->ifr_hwaddr.sa_data, 6);
>> >  #endif
>> >
>> >  #if defined(SIOCGIFMTU) && defined(HAVE_STRUCT_IFREQ_IF

[OMPI devel] 1.7.4rc1 install failure: NetBSD-6 amd64

2013-12-19 Thread Paul Hargrove
Attached is the output from "make install" of 1.7.4rc1 + Jeff's fix for the
symbol conflict on "if_mtu".

There appear to be at least 2 issues.

1) There are lots of (not fatal) messages about ldconfig not existing, but
according to he NetBSD lists that utility went away with the conversion
from a.out to ELF.

2) Many warnings of the form
   *** Warning: linker path does not have real file for library

3) The final (fatal) error about .libs/mca_btl_sm.soT not existing.

I am going to see if I can get a better result using the system libtool
(which is 2.2.6b, thus OLDER than OMPI's 2.4.2) and will report back with
my results.

-Paul

-- 
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
Computer and Data Sciences Department Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900


install.log.bz2
Description: BZip2 compressed data


Re: [OMPI devel] 1.7.4rc1 build failure: OpenBSD-5 and NetBSD-6

2013-12-19 Thread Ralph Castain
I added protections for all the RLIMIT values, just in case. Thanks!
Ralph

On Dec 19, 2013, at 6:25 PM, Paul Hargrove  wrote:

> Jeff,
> 
> The unofficial "rc2forpaul" gets past the (disgusting) if_mtu problem on both 
> platforms.
> 
> On NetBSD-6 the build completes ("make install" fails, but I'll report that 
> separately).
> 
> However, on OpenBSD-5 we now encounter another failure about 20 files later:
> 
>   CC   sys_limits.lo
> /home/phargrov/OMPI/openmpi-1.7.4rc2forpaul-openbsd5-amd64/openmpi-1.7.4rc2forpaul/opal/util/sys_limits.c:
>  In function 'opal_util_init_sys_limits':
> /home/phargrov/OMPI/openmpi-1.7.4rc2forpaul-openbsd5-amd64/openmpi-1.7.4rc2forpaul/opal/util/sys_limits.c:172:
>  error: 'RLIMIT_AS' undeclared (first use in this function)
> /home/phargrov/OMPI/openmpi-1.7.4rc2forpaul-openbsd5-amd64/openmpi-1.7.4rc2forpaul/opal/util/sys_limits.c:172:
>  error: (Each undeclared identifier is reported only once
> /home/phargrov/OMPI/openmpi-1.7.4rc2forpaul-openbsd5-amd64/openmpi-1.7.4rc2forpaul/opal/util/sys_limits.c:172:
>  error: for each function it appears in.)
> *** Error 1 in opal/util (Makefile:1692 'sys_limits.lo': @echo "  CC  " 
> sys_limits.lo;depbase=`echo sys_limits.lo | sed 's|[^/]*$|.deps/...)
> *** Error 1 in opal/util (Makefile:1780 'all-recursive')
> 
> The getrlimit manpage on this platform does not list RLIMIT_AS.
> Running "grep -rl RLIMIT_AS /usr/include" confirms that this constant does 
> not exist.
> So, I think "#ifdef RLIMIT_AS" is required.
> 
> -Paul
> 
> 
> On Thu, Dec 19, 2013 at 4:39 PM, Jeff Squyres (jsquyres)  
> wrote:
> On Dec 19, 2013, at 6:27 PM, Paul Hargrove  wrote:
> 
> > When building 1.7.4rc1 on OpenBSD-5 and NetBSD-6 (both amd64) I see what 
> > appears to be the same three errors ("make" output  at end of this email) 
> > on both platforms.
> >
> > All three syntax errors appears to be collisions on the symbol if_mtu:
> >
> > -bash-4.2$ cat -n openmpi-1.7.4rc1/opal/util/if.h | grep -w -e 182
> >182  OPAL_DECLSPEC int opal_ifindextomtu(int if_index, int *if_mtu);
> > -bash-4.2$ cat -n openmpi-1.7.4rc1/opal/mca/if/if.h | grep -w -e 98
> > 98  int if_mtu;
> > -bash-4.2$ cat -n openmpi-1.7.4rc1/opal/util/if.c | grep -w -e 482
> >482  int opal_ifindextomtu(int if_index, int *if_mtu)
> >
> > -bash-4.2$ grep if_mtu  /usr/include/net/if.h
> > #define if_mtu  if_data.ifi_mtu\
> 
> Bah.  Terrible.  Ok, thanks -- I'll fix...
> 
> (see the tar ball I just sent you... should have this fix in it)
> 
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> 
> -- 
> Paul H. Hargrove  phhargr...@lbl.gov
> Future Technologies Group
> Computer and Data Sciences Department Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] 1.7.4rc1 run failure on Solaris 10 / SPARC (not SIGBUS)

2013-12-19 Thread Ralph Castain
I believe this one has already been fixed and is in the nightly (1.7.4rc2) - 
for now, you can just set "--bind-to none" on the cmd line to get past it


On Dec 19, 2013, at 6:42 PM, Paul Hargrove  wrote:

> Testing with Solaris 10 on SPARC, I was expecting to encounter the bus error 
> reported previously by Siegman Gross.  Instead I see the following 
> hwloc-related abort:
> 
> $ env   
> PATH=/home/hargrove/OMPI/openmpi-1.7.4rc1-solaris10-sparcT2-ss12u3-v9/INST/bin:$PATH
>   
> LD_LIBRARY_PATH_64=/home/hargrove/OMPI/openmpi-1.7.4rc1-solaris10-sparcT2-ss12u3-v9/INST/lib:$LD_LIBRARY_PATH_64
>   OMPI_MCA_shmem_mmap_enable_nfs_warning=0  
> /home/hargrove/OMPI/openmpi-1.7.4rc1-solaris10-sparcT2-ss12u3-v9/INST/bin/mpirun
>  -mca btl sm,self -np 2 examples/ring_c
> --
> Open MPI tried to bind a new process, but something went wrong.  The
> process was killed without launching the target application.  Your job
> will now abort.
> 
>   Local host:niagara1
>   Application name:  examples/ring_c
>   Error message: hwloc indicates cpu binding cannot be enforced
>   Location:  
> /home/hargrove/OMPI/openmpi-1.7.4rc1-solaris10-sparcT2-ss12u3-v9/openmpi-1.7.4rc1/orte/mca/odls/default/odls_default_module.c:478
> --
> 2 total processes failed to start
> 
> 
> I am assuming I just need some magic pixie dust to disable cpu binding.
> I'd appreciate some corresponding instructions.
> 
> However, if this is NOT an expected/desired/known behavior please let me know 
> what I can/should do to help determine the root cause.
> 
> 
> -Paul 
> 
> -- 
> Paul H. Hargrove  phhargr...@lbl.gov
> Future Technologies Group
> Computer and Data Sciences Department Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] 1.7.4rc1 run failure on Solaris 10 / SPARC (not SIGBUS)

2013-12-19 Thread Paul Hargrove
Ralph,

I can confirm "--bind-to none" worked to eliminate the error, but the test
now appears to hang :-(

Since you say the binding probably fixed for rc2, I'll see if the latest
nightly tarball works better by default.

-Paul


On Thu, Dec 19, 2013 at 7:19 PM, Ralph Castain  wrote:

> I believe this one has already been fixed and is in the nightly (1.7.4rc2)
> - for now, you can just set "--bind-to none" on the cmd line to get past it
>
>
> On Dec 19, 2013, at 6:42 PM, Paul Hargrove  wrote:
>
> Testing with Solaris 10 on SPARC, I was expecting to encounter the bus
> error reported previously by Siegman Gross.  Instead I see the following
> hwloc-related abort:
>
> $ env
> PATH=/home/hargrove/OMPI/openmpi-1.7.4rc1-solaris10-sparcT2-ss12u3-v9/INST/bin:$PATH
>  
> LD_LIBRARY_PATH_64=/home/hargrove/OMPI/openmpi-1.7.4rc1-solaris10-sparcT2-ss12u3-v9/INST/lib:$LD_LIBRARY_PATH_64
>  OMPI_MCA_shmem_mmap_enable_nfs_warning=0
>  
> /home/hargrove/OMPI/openmpi-1.7.4rc1-solaris10-sparcT2-ss12u3-v9/INST/bin/mpirun
> -mca btl sm,self -np 2 examples/ring_c
> --
> Open MPI tried to bind a new process, but something went wrong.  The
> process was killed without launching the target application.  Your job
> will now abort.
>
>   Local host:niagara1
>   Application name:  examples/ring_c
>   Error message: hwloc indicates cpu binding cannot be enforced
>   Location:
>  
> /home/hargrove/OMPI/openmpi-1.7.4rc1-solaris10-sparcT2-ss12u3-v9/openmpi-1.7.4rc1/orte/mca/odls/default/odls_default_module.c:478
> --
> 2 total processes failed to start
>
>
> I am assuming I just need some magic pixie dust to disable cpu binding.
> I'd appreciate some corresponding instructions.
>
> However, if this is NOT an expected/desired/known behavior please let me
> know what I can/should do to help determine the root cause.
>
>
> -Paul
>
> --
> Paul H. Hargrove  phhargr...@lbl.gov
> Future Technologies Group
> Computer and Data Sciences Department Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>  ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>



-- 
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
Computer and Data Sciences Department Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900


[OMPI devel] 1.7.4rc1 autogen error: NetBSD-6

2013-12-19 Thread Paul Hargrove
Probably nobody cares, but I'll report this for completeness.
In trying to understand the "make install" failure on NetBSD-6 I run
"autogen.sh".

The versions detected:

   Searching for autoconf
 Found autoconf version 2.69; checking version...
   Found version component 2 -- need 2
   Found version component 69 -- need 65
 ==> ACCEPTED
   Searching for libtoolize
 Found libtoolize version 2.2.6b; checking version...
   Found version component 2 -- need 2
   Found version component 2 -- need 2
   Found version component 6b -- need 6b
 ==> ACCEPTED
   Searching for automake
 Found automake version 1.13.1; checking version...
   Found version component 1 -- need 1
   Found version component 13 -- need 12
 ==> ACCEPTED

The problem is that when run, the generated configure script dies as
follows:

*** Java compiler
configure: WARNING: Found configure shell variable clash!
configure: WARNING: OPAL_VAR_SCOPE_PUSH called on "dir",
configure: WARNING: but it is already defined with value "/bin"
configure: WARNING: This usually indicates an error in configure.
configure: error: Cannot continue


-Paul

-- 
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
Computer and Data Sciences Department Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900


Re: [OMPI devel] 1.74rc1 build failure: Solaris 11 / x86_64 / Sun Studio 12.3

2013-12-19 Thread Paul Hargrove
FYI:

My Solaris-11/x86-64/gcc-4.5.2 build completes with the following three
changes:
+ Jeff's fix for if_posix.c
+ changing __always_inline to __opal_attribute_always_inline__
+ fixing the fls() conflict by renaming OMPI's to "my_fls()" (just a lazy
choice).

-Paul


On Thu, Dec 19, 2013 at 6:47 PM, Paul Hargrove  wrote:

> Jeff,
>
> I didn't actually get very far after fixing __always_inline.
> In fact, the build still fails on the *same* line, but for a different
> (valid) reason:
> fls() is declared in /usr/include/string.h
>
> Making all in mca/btl/usnic
> make[2]: Entering directory
> `/shared/OMPI/openmpi-1.7.4rc2forpaul-solaris11-x64-ib-gcc452/BLD/ompi/mca/btl/usnic'
>   CC   btl_usnic_module.lo
> In file included from
> /shared/OMPI/openmpi-1.7.4rc2forpaul-solaris11-x64-ib-gcc452/openmpi-1.7.4rc2forpaul/ompi/mca/btl/usnic/btl_usnic_module.c:48:0:
> /shared/OMPI/openmpi-1.7.4rc2forpaul-solaris11-x64-ib-gcc452/openmpi-1.7.4rc2forpaul/ompi/mca/btl/usnic/btl_usnic_util.h:19:45:
> error: static declaration of �fls� follows non-static declaration
> /usr/include/string.h:87:12: note: previous declaration of �fls� was here
> make[2]: *** [btl_usnic_module.lo] Error 1
>
> -Paul
>
>
> On Thu, Dec 19, 2013 at 6:35 PM, Paul Hargrove  wrote:
>
>> Jeff,
>>
>> Solaris 11 / x86_64 build get farther than before, but fails with the
>> following:
>>
>> make[2]: Entering directory
>> `/shared/OMPI/openmpi-1.7.4rc2forpaul-solaris11-x64-ib-gcc452/BLD/ompi/mca/btl/usnic'
>>   CC   btl_usnic_module.lo
>> In file included from
>> /shared/OMPI/openmpi-1.7.4rc2forpaul-solaris11-x64-ib-gcc452/openmpi-1.7.4rc2forpaul/ompi/mca/btl/usnic/btl_usnic_module.c:48:0:
>> /shared/OMPI/openmpi-1.7.4rc2forpaul-solaris11-x64-ib-gcc452/openmpi-1.7.4rc2forpaul/ompi/mca/btl/usnic/btl_usnic_util.h:19:24:
>> error: expected �=�, �,�, �;�, �asm� or �__attribute__� before �int�
>> make[2]: *** [btl_usnic_module.lo] Error 1
>> make[2]: Leaving directory
>> `/shared/OMPI/openmpi-1.7.4rc2forpaul-solaris11-x64-ib-gcc452/BLD/ompi/mca/btl/usnic'
>> make[1]: *** [all-recursive] Error 1
>> make[1]: Leaving directory
>> `/shared/OMPI/openmpi-1.7.4rc2forpaul-solaris11-x64-ib-gcc452/BLD/ompi'
>> make: *** [all-recursive] Error 1
>>
>> It looks like gcc is choking on __always_inline.
>> I believe use of __opal_attribute_always_inline__ is the proper fix.
>> I've made that change and resumed the build... will report again upon
>> success or the next failure.
>>
>> I'm not sure why one is trying to build the usnic btl on Solaris at all.
>> Perhaps just because the OFED stack is present?
>>
>> -Paul
>>
>>
>> On Thu, Dec 19, 2013 at 4:39 PM, Jeff Squyres (jsquyres) <
>> jsquy...@cisco.com> wrote:
>>
>>> Try http://www.open-mpi.org/~jsquyres/unofficial/.
>>>
>>> Should have both "if" fixes in it.
>>>
>>>
>>> On Dec 19, 2013, at 7:12 PM, Paul Hargrove  wrote:
>>>
>>> > Jeff,
>>> >
>>> > The patch looks fine to my eyes, but I cannot test it:
>>> >
>>> > 1) Not sure if email botched withepsace or what, but the patch didn't
>>> apply to if_posix.c.
>>> > 2) Even if it did, I don't have sufficiently new autoconf on that
>>> system to "use" the configure.m4 part of the patch.
>>> >
>>> > Any chance of a patched-and-autogen'ed tarball to test?
>>> >
>>> > -Paul
>>> >
>>> >
>>> > On Thu, Dec 19, 2013 at 4:04 PM, Jeff Squyres (jsquyres) <
>>> jsquy...@cisco.com> wrote:
>>> > Paul --
>>> >
>>> > Does this patch fix it for you?
>>> >
>>> > Index: opal/mca/if/posix_ipv4/configure.m4
>>> > ===
>>> > --- opal/mca/if/posix_ipv4/configure.m4 (revision 29997)
>>> > +++ opal/mca/if/posix_ipv4/configure.m4 (working copy)
>>> > @@ -42,8 +42,10 @@
>>> >   )
>>> >
>>> >  AS_IF([test "$opal_if_posix_ipv4_happy" = "yes"],
>>> > -  [AC_CHECK_MEMBERS([struct ifreq.ifr_mtu], [], [],
>>> > +  [AC_CHECK_MEMBERS([struct ifreq.ifr_hwaddr], [], [],
>>> > [[#include ]])
>>> > +   AC_CHECK_MEMBERS([struct ifreq.ifr_mtu], [], [],
>>> > +   [[#include ]])
>>> >])
>>> >
>>> >  AS_IF([test "$opal_if_posix_ipv4_happy" = "yes"], [$1], [$2]);
>>> > Index: opal/mca/if/posix_ipv4/if_posix.c
>>> > ===
>>> > --- opal/mca/if/posix_ipv4/if_posix.c   (revision 29997)
>>> > +++ opal/mca/if/posix_ipv4/if_posix.c   (working copy)
>>> > @@ -263,22 +263,22 @@
>>> >  /* generate CIDR and assign to netmask */
>>> >  intf->if_mask = prefix(((struct sockaddr_in*)
>>> &ifr->ifr_addr)->sin_addr.s_addr);
>>> >
>>> > -#ifdef SIOCGIFHWADDR
>>> > -/* get the MAC address */
>>> > -if (ioctl(sd, SIOCGIFHWADDR, ifr) < 0) {
>>> > -opal_output(0, "btl_usnic_opal_ifinit:
>>> ioctl(SIOCGIFHWADDR) failed with errno=%d", errno);
>>> > -break;
>>> > -}
>>> > -memcpy(intf->if_mac, ifr->ifr_hwaddr.sa