Re: [OMPI devel] Bus error with openmpi-1.7.4rc1 on Solaris
Hi, at first thank you very much for your help. 1st patch: > Can you apply the following patch to a trunk tarball and see if it works > for you? 2nd patch: > Found the problem. Was accessing a boolean variable using intval. That > is a bug that has gone unnoticed on all platforms but thankfully Solaris > caught it. > > Please try the attached patch. I applied both patches manually to openmpi-1.9a1r29972, because my patch program couldn't use the patches. Unfortunately I still get a Bus Error. Hopefully I didn't make a mistake applying your patches. Therefore I show you a "diff" for my files. By the way, I tried to apply your patches with "patch -b -i ". Is it necessary to use a different command? tyr openmpi-1.9a1r29972 161 ls -l opal/mca/base/mca_base_var.c* -rw-r--r-- 1 fd1026 inf 60418 Dec 19 08:35 opal/mca/base/mca_base_var.c -rw-r--r-- 1 fd1026 inf 60236 Dec 19 03:05 opal/mca/base/mca_base_var.c.orig tyr openmpi-1.9a1r29972 162 diff opal/mca/base/mca_base_var.c* 1685,1689c1685mbv_type) { mbv_enumerator->string_from_value(var->mbv_enumerator, value->boolval, &tmp); <} else { mbv_enumerator->string_from_value(var->mbv_enumerator, value->intval, &tmp); <} --- > ret = var->mbv_enumerator->string_from_value(var->mbv_enumerator, value->intval, &tmp); tyr openmpi-1.9a1r29972 163 tyr openmpi-1.9a1r29972 165 ls -l opal/util/net.c* -rw-r--r-- 1 fd1026 inf 12922 Dec 19 07:55 opal/util/net.c -rw-r--r-- 1 fd1026 inf 12675 Dec 19 03:05 opal/util/net.c.orig tyr openmpi-1.9a1r29972 166 diff opal/util/net.c* 267,271c267,268 < struct sockaddr_in inaddr1, inaddr2; < /* Use temporary variables and memcpy's so that we don't const struct sockaddr_in *inaddr1 = (struct sockaddr_in*) addr1; > const struct sockaddr_in *inaddr2 = (struct sockaddr_in*) addr2; 274,275c271,272 < if((inaddr1.sin_addr.s_addr & netmask) == <(inaddr2.sin_addr.s_addr & netmask)) { --- > if((inaddr1->sin_addr.s_addr & netmask) == >(inaddr2->sin_addr.s_addr & netmask)) { 284,290c281,284 < struct sockaddr_in6 inaddr1, inaddr2; < /* Use temporary variables and memcpy's so that we don't const struct sockaddr_in6 *inaddr1 = (struct sockaddr_in6*) addr1; > const struct sockaddr_in6 *inaddr2 = (struct sockaddr_in6*) addr2; > struct in6_addr *a6_1 = (struct in6_addr*) &inaddr1->sin6_addr; > struct in6_addr *a6_2 = (struct in6_addr*) &inaddr2->sin6_addr; tyr openmpi-1.9a1r29972 167 Now my debug information. tyr fd1026 52 cd /usr/local/openmpi-1.9_64_cc/bin/ tyr bin 53 /opt/solstudio12.3/bin/sparcv9/dbx ompi_info For information about new features see `help changes' To remove this message, put `dbxenv suppress_startup_message 7.9' in your .dbxrc Reading ompi_info Reading ld.so.1 Reading libmpi.so.0.0.0 Reading libopen-rte.so.0.0.0 Reading libopen-pal.so.0.0.0 Reading libsendfile.so.1 Reading libpicl.so.1 Reading libkstat.so.1 Reading liblgrp.so.1 Reading libsocket.so.1 Reading libnsl.so.1 Reading librt.so.1 Reading libm.so.2 Reading libthread.so.1 Reading libc.so.1 Reading libdoor.so.1 Reading libaio.so.1 Reading libmd.so.1 (dbx) run -a Running: ompi_info -a (process id 10998) Reading libc_psr.so.1 ... MCA compress: parameter "compress_base_verbose" (current value: "-1", data source: default, level: 8 dev/detail, type: int) Verbosity level for the compress framework (0 = no verbosity) t@1 (l@1) signal BUS (invalid address alignment) in var_value_string at line 1680 in file "mca_base_var.c" 1680 ret = asprintf (value_string, var_type_formats[var->mbv_type], value[0]); (dbx) (dbx) (dbx) check -all dbx: warning: check -all will be turned on in the next run of the process access checking - OFF memuse checking - OFF (dbx) run -a Running: ompi_info -a (process id 11000) Reading rtcapihook.so Reading libdl.so.1 Reading rtcaudit.so Reading libmapmalloc.so.1 Reading rtcboot.so Reading librtc.so Reading libmd_psr.so.1 RTC: Enabling Error Checking... RTC: Using UltraSparc trap mechanism RTC: See `help rtc showmap' and `help rtc limitations' for details. RTC: Running program... Read from uninitialized (rui) on thread 1: Attempting to read 4 bytes at address 0x7fffd5f8 which
[OMPI devel] Consequence of bind-to-core by default
I notice Absoft's MTT runs are failing due to the change in bind-to-core-by-default: http://mtt.open-mpi.org/index.php?do_redir=2136 I asked Tony, who runs the Absoft MTT runs; he confirms that this particular machine has 1 socket with 2 cores (and we're running -np 4 on this machine). 1. This is an unintended consequence of the bind-to-core-by-default policy: we fail with "oversubscribed!" when running on a single machine for test runs like this. Do we like this? See #3, below, for more on this. 2. Also, the error message that is displayed says: - A request was made to bind to that would result in binding more processes than cpus on a resource: Bind to: CORE Node:ltljoe3 #processes: 2 #cpus: 1 - Which is odd, because the command line is "mpirun -np 4 --mca btl sm,tcp,self ./c_hello". Any idea what's happening here? 3. Finally, we're giving a warning saying: - WARNING: a request was made to bind a process. While the system supports binding the process itself, at least one node does NOT support binding memory to the process location. - For both #1 and #3, I wonder if we shouldn't be warning if no binding was explicitly stated (i.e., we're just using the defaults). Specifically, if no binding is specified: - if we oversubscribe, (possibly) warn about the performance loss of oversubscription, and don't bind - don't warn about lack of memory binding Thoughts? -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] Consequence of bind-to-core by default
On 19 Dec 2013, at 13:59, Jeff Squyres (jsquyres) wrote: > > - if we oversubscribe, (possibly) warn about the performance loss of > oversubscription, and don't bind > - don't warn about lack of memory binding > > Thoughts? +1, I hit this myself today. I typically run on a VM and oversubscribe the cores, until the last update this would work fine, but now I get two error messages when trying this. I can’t “modify” the binding options used because I don’t know what they are (i.e. I didn’t give any) and even when not over-subscribing there is a warning at startup that I neither understand nor can seemingly disable. My thoughts would be: Oversubscription is normally bad so by all means issue a warning and/or abort however make the message meaningful and offer the use a —allow-oversubscription flag. Jobs running on VMs shouldn’t give warnings to the user. Finally, the whitespace alignment of the message is a little odd, it looks like it’s supposed to be a table or two columns however the indentation is all over the place. Ashley.
Re: [OMPI devel] [EXTERNAL] Consequence of bind-to-core by default
On 12/19/13 6:59 AM, "Jeff Squyres (jsquyres)" wrote: >3. Finally, we're giving a warning saying: > >- >WARNING: a request was made to bind a process. While the system >supports binding the process itself, at least one node does NOT >support binding memory to the process location. >- > >For both #1 and #3, I wonder if we shouldn't be warning if no binding was >explicitly stated (i.e., we're just using the defaults). Specifically, >if no binding is specified: > >- if we oversubscribe, (possibly) warn about the performance loss of >oversubscription, and don't bind >- don't warn about lack of memory binding We have a couple machines where memory binding is failing for one reason or another. If we're binding by default, we really shouldn't throw error messages about not being able to bind memory. It's REALLY annoying. Brian -- Brian W. Barrett Scalable System Software Group Sandia National Laboratories
[OMPI devel] Speedup for MPI_Dims_create()
Dear all, please find attached a (trivial) patch to MPI_Dims_create(). When computing the prime factors of nnodes, it is sufficient to check for primes less or equal to sqrt(nnodes). This was not so much of a problem in the past, but now that Tier 0 systems are capable of running O(10^6) MPI processes, the difference in execution time is on the order of seconds (e.g. 8.86s vs. 0.04s on my notebook, with nnproc = 10^6). Best -Andreas PS: oh, and the patch removes some trailing whitespace. Yuck. :-) -- == Andreas Schäfer HPC and Grid Computing Chair of Computer Science 3 Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany +49 9131 85-27910 PGP/GPG key via keyserver http://www.libgeodecomp.org == (\___/) (+'.'+) (")_(") This is Bunny. Copy and paste Bunny into your signature to help him gain world domination! Index: ompi/mpi/c/dims_create.c === --- ompi/mpi/c/dims_create.c (revision 29976) +++ ompi/mpi/c/dims_create.c (working copy) @@ -5,19 +5,23 @@ * Copyright (c) 2004-2005 The University of Tennessee and The University * of Tennessee Research Foundation. All rights * reserved. - * Copyright (c) 2004-2005 High Performance Computing Center Stuttgart, + * Copyright (c) 2004-2005 High Performance Computing Center Stuttgart, * University of Stuttgart. All rights reserved. * Copyright (c) 2004-2005 The Regents of the University of California. * All rights reserved. * Copyright (c) 2012 Los Alamos National Security, LLC. All rights - * reserved. + * reserved. + * Copyright (c) 2013 Friedrich-Alexander-Universitaet + * Erlangen-Nuernberg. All rights reserved. * $COPYRIGHT$ - * + * * Additional copyrights may follow - * + * * $HEADER$ */ +#include + #include "ompi_config.h" #include "ompi/mpi/c/bindings.h" @@ -44,8 +48,8 @@ /* * This is a utility function, no need to have anything in the lower * layer for this at all - */ -int MPI_Dims_create(int nnodes, int ndims, int dims[]) + */ +int MPI_Dims_create(int nnodes, int ndims, int dims[]) { int i; int freeprocs; @@ -66,9 +70,9 @@ return OMPI_ERRHANDLER_INVOKE (MPI_COMM_WORLD, MPI_ERR_ARG, FUNC_NAME); } - + if (1 > ndims) { -return OMPI_ERRHANDLER_INVOKE (MPI_COMM_WORLD, +return OMPI_ERRHANDLER_INVOKE (MPI_COMM_WORLD, MPI_ERR_DIMS, FUNC_NAME); } } @@ -109,11 +113,11 @@ } /* Compute the relevant prime numbers for factoring */ -if (MPI_SUCCESS != (err = getprimes(freeprocs, &nprimes, &primes))) { +if (MPI_SUCCESS != (err = getprimes(sqrt(freeprocs), &nprimes, &primes))) { return OMPI_ERRHANDLER_INVOKE(MPI_COMM_WORLD, err, FUNC_NAME); } - + /* Factor the number of free processes */ if (MPI_SUCCESS != (err = getfactors(freeprocs, nprimes, primes, &factors))) { return OMPI_ERRHANDLER_INVOKE(MPI_COMM_WORLD, err, @@ -166,7 +170,7 @@ int f; int *p; int *pmin; - + if (0 >= ndim) { return MPI_ERR_DIMS; } @@ -181,7 +185,7 @@ for (i = 0, p = bins; i < ndim; ++i, ++p) { *p = 1; } - + /* Loop assigning factors from the highest to the lowest */ for (j = nfactor - 1; j >= 0; --j) { f = pfacts[j]; @@ -196,7 +200,7 @@ *pmin *= f; } } - + /* Sort dimensions in decreasing order (O(n^2) for now) */ for (i = 0, pmin = bins; i < ndim - 1; ++i, ++pmin) { for (j = i + 1, p = pmin + 1; j < ndim; ++j, ++p) { @@ -228,7 +232,7 @@ int i; int *p; int *c; - + if (0 >= nprime) { return MPI_ERR_INTERN; } @@ -309,4 +313,3 @@ *pnprime = i; return MPI_SUCCESS; } - signature.asc Description: Digital signature
Re: [OMPI devel] [EXTERNAL] Re: RFC: remove opal progress recursion depth counter
Someone who understands the mpi debugging handles code: The opal_progress_recursion_depth_counter and opal_progress_thread_counter are both only used internally in opal_progress (for book keeping, but never any decisions) and are declared in ompi_mpihandles_dll.c, but then don't appear to be used. Is there a disadvantage to: 1) removing them from mpihandles_dll.c or, if that breaks ABI, 2) Leaving them, but not doing the bookkeeping? It's fairly heavyweight bookkeeping, so I agree with Nathan, I'd like to remove it. But I'd like to remove it pre-1.7.4. Which means today. Brian On 12/18/13 4:40 PM, "Nathan Hjelm" wrote: >Opps, yeah. Meant 1.7.5. If people agree with this change I could >possibly slip it in before Friday but that is unlikely. > >On Wed, Dec 18, 2013 at 03:32:36PM -0800, Ralph Castain wrote: >> U1.7.4 is leaving the station on Fri, Nathan, so next Tues => >>will have to go into 1.7.5 >> >> >> On Dec 18, 2013, at 3:23 PM, Nathan Hjelm wrote: >> >> > What: Remove the opal_progress_recursion_depth_counter from >> > opal_progress. >> > >> > Why: This counter adds two atomic adds to the critical path when >> > OPAL_HAVE_THREADS is set (which is the case for most builds). I >>grepped >> > through ompi, orte, and opal to find where this value was being used >>and >> > did not find anything either inside or outside opal_progress. >> > >> > When: I want this change to go into 1.7.4 (if possible) so setting a >> > quick timeout for next Tuesday. >> > >> > Let me know if there is a good reason to keep this counter and it will >> > be spared. >> > >> > -Nathan Hjelm >> > HPC-5, LANL >> > ___ >> > devel mailing list >> > de...@open-mpi.org >> > http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > -- Brian W. Barrett Scalable System Software Group Sandia National Laboratories
Re: [OMPI devel] [EXTERNAL] Consequence of bind-to-core by default
On Dec 19, 2013, at 6:27 AM, Barrett, Brian W wrote: > On 12/19/13 6:59 AM, "Jeff Squyres (jsquyres)" wrote: > >> 3. Finally, we're giving a warning saying: >> >> - >> WARNING: a request was made to bind a process. While the system >> supports binding the process itself, at least one node does NOT >> support binding memory to the process location. >> - >> >> For both #1 and #3, I wonder if we shouldn't be warning if no binding was >> explicitly stated (i.e., we're just using the defaults). Specifically, >> if no binding is specified: >> >> - if we oversubscribe, (possibly) warn about the performance loss of >> oversubscription, and don't bind >> - don't warn about lack of memory binding > > We have a couple machines where memory binding is failing for one reason > or another. If we're binding by default, we really shouldn't throw error > messages about not being able to bind memory. It's REALLY annoying. Just to help me understand a bit better - you are saying that the node supports process binding, but not memory binding? I don't see how the error appears otherwise, but want to ensure I understand the code path. > > Brian > > -- > Brian W. Barrett > Scalable System Software Group > Sandia National Laboratories > > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] [EXTERNAL] Consequence of bind-to-core by default
On 12/19/13 8:43 AM, "Ralph Castain" wrote: > >On Dec 19, 2013, at 6:27 AM, Barrett, Brian W wrote: > >> On 12/19/13 6:59 AM, "Jeff Squyres (jsquyres)" >>wrote: >> >>> 3. Finally, we're giving a warning saying: >>> >>> - >>> WARNING: a request was made to bind a process. While the system >>> supports binding the process itself, at least one node does NOT >>> support binding memory to the process location. >>> - >>> >>> For both #1 and #3, I wonder if we shouldn't be warning if no binding >>>was >>> explicitly stated (i.e., we're just using the defaults). Specifically, >>> if no binding is specified: >>> >>> - if we oversubscribe, (possibly) warn about the performance loss of >>> oversubscription, and don't bind >>> - don't warn about lack of memory binding >> >> We have a couple machines where memory binding is failing for one reason >> or another. If we're binding by default, we really shouldn't throw >>error >> messages about not being able to bind memory. It's REALLY annoying. > >Just to help me understand a bit better - you are saying that the node >supports process binding, but not memory binding? I don't see how the >error appears otherwise, but want to ensure I understand the code path. That appears to be the case, yes. Brian -- Brian W. Barrett Scalable System Software Group Sandia National Laboratories
Re: [OMPI devel] Bus error with openmpi-1.7.4rc1 on Solaris
Siegmar -- So it looks like the net problem is fixed; good. I'll commit and CMR that. For the DDT test, can you give us access to this machine? It might help speed debugging a lot. (I'll let Nathan reply about the var problem) If not, can you provide the following information about the DDT test: 1. It SIGBUS's at a point; can you send the full backtrace? 2. It complains about a misaligned read of a variable and shows its address. Can you print the values of all the parameters of the function so that we can see *which* one it is using for the misaligned read? (the printf is using 4 different variables, and we don't know which one is causing the misaligned read) On Dec 19, 2013, at 8:52 AM, Siegmar Gross wrote: > Hi, > > at first thank you very much for your help. > > 1st patch: > >> Can you apply the following patch to a trunk tarball and see if it works >> for you? > > 2nd patch: > >> Found the problem. Was accessing a boolean variable using intval. That >> is a bug that has gone unnoticed on all platforms but thankfully Solaris >> caught it. >> >> Please try the attached patch. > > > I applied both patches manually to openmpi-1.9a1r29972, because > my patch program couldn't use the patches. Unfortunately I still > get a Bus Error. Hopefully I didn't make a mistake applying your > patches. Therefore I show you a "diff" for my files. By the way, > I tried to apply your patches with "patch -b -i ". > Is it necessary to use a different command? > > > tyr openmpi-1.9a1r29972 161 ls -l opal/mca/base/mca_base_var.c* > -rw-r--r-- 1 fd1026 inf 60418 Dec 19 08:35 opal/mca/base/mca_base_var.c > -rw-r--r-- 1 fd1026 inf 60236 Dec 19 03:05 opal/mca/base/mca_base_var.c.orig > tyr openmpi-1.9a1r29972 162 diff opal/mca/base/mca_base_var.c* > 1685,1689c1685 >mbv_type) { > var->mbv_enumerator->string_from_value(var->mbv_enumerator, > value->boolval, &tmp); > <} else { > var->mbv_enumerator->string_from_value(var->mbv_enumerator, > value->intval, &tmp); > <} > --- >>ret = var->mbv_enumerator->string_from_value(var->mbv_enumerator, > value->intval, &tmp); > tyr openmpi-1.9a1r29972 163 > > > > tyr openmpi-1.9a1r29972 165 ls -l opal/util/net.c* > -rw-r--r-- 1 fd1026 inf 12922 Dec 19 07:55 opal/util/net.c > -rw-r--r-- 1 fd1026 inf 12675 Dec 19 03:05 opal/util/net.c.orig > tyr openmpi-1.9a1r29972 166 diff opal/util/net.c* > 267,271c267,268 > < struct sockaddr_in inaddr1, inaddr2; > < /* Use temporary variables and memcpy's so that we don't > < memcpy(&inaddr1, addr1, sizeof(inaddr1)); > < memcpy(&inaddr2, addr2, sizeof(inaddr2)); > --- >>const struct sockaddr_in *inaddr1 = (struct sockaddr_in*) addr1; >>const struct sockaddr_in *inaddr2 = (struct sockaddr_in*) addr2; > 274,275c271,272 > < if((inaddr1.sin_addr.s_addr & netmask) == > <(inaddr2.sin_addr.s_addr & netmask)) { > --- >>if((inaddr1->sin_addr.s_addr & netmask) == >> (inaddr2->sin_addr.s_addr & netmask)) { > 284,290c281,284 > < struct sockaddr_in6 inaddr1, inaddr2; > < /* Use temporary variables and memcpy's so that we don't > < memcpy(&inaddr1, addr1, sizeof(inaddr1)); > < memcpy(&inaddr2, addr2, sizeof(inaddr2)); > < struct in6_addr *a6_1 = (struct in6_addr*) &inaddr1.sin6_addr; > < struct in6_addr *a6_2 = (struct in6_addr*) &inaddr2.sin6_addr; > --- >>const struct sockaddr_in6 *inaddr1 = (struct sockaddr_in6*) addr1; >>const struct sockaddr_in6 *inaddr2 = (struct sockaddr_in6*) addr2; >>struct in6_addr *a6_1 = (struct in6_addr*) &inaddr1->sin6_addr; >>struct in6_addr *a6_2 = (struct in6_addr*) &inaddr2->sin6_addr; > tyr openmpi-1.9a1r29972 167 > > > > Now my debug information. > > tyr fd1026 52 cd /usr/local/openmpi-1.9_64_cc/bin/ > tyr bin 53 /opt/solstudio12.3/bin/sparcv9/dbx ompi_info > For information about new features see `help changes' > To remove this message, put `dbxenv suppress_startup_message 7.9' in your > .dbxrc > Reading ompi_info > Reading ld.so.1 > Reading libmpi.so.0.0.0 > Reading libopen-rte.so.0.0.0 > Reading libopen-pal.so.0.0.0 > Reading libsendfile.so.1 > Reading libpicl.so.1 > Reading libkstat.so.1 > Reading liblgrp.so.1 > Reading libsocket.so.1 > Reading libnsl.so.1 > Reading librt.so.1 > Reading libm.so.2 > Reading libthread.so.1 > Reading libc.so.1 > Reading libdoor.so.1 > Reading libaio.so.1 > Reading libmd.so.1 > (dbx) run -a > Running: ompi_info -a > (process id 10998) > Reading libc_psr.so.1 > ... >MCA compress: parameter "compress_base_verbose" (current value: > "-1", data source: default, level: 8 dev/detail
Re: [OMPI devel] [EXTERNAL] Consequence of bind-to-core by default
Okay, I think I have these things fixed in r29978 on the trunk - please give it a spin and confirm so we can move it to 1.7.4 On Dec 19, 2013, at 7:54 AM, Barrett, Brian W wrote: > On 12/19/13 8:43 AM, "Ralph Castain" wrote: > >> >> On Dec 19, 2013, at 6:27 AM, Barrett, Brian W wrote: >> >>> On 12/19/13 6:59 AM, "Jeff Squyres (jsquyres)" >>> wrote: >>> 3. Finally, we're giving a warning saying: - WARNING: a request was made to bind a process. While the system supports binding the process itself, at least one node does NOT support binding memory to the process location. - For both #1 and #3, I wonder if we shouldn't be warning if no binding was explicitly stated (i.e., we're just using the defaults). Specifically, if no binding is specified: - if we oversubscribe, (possibly) warn about the performance loss of oversubscription, and don't bind - don't warn about lack of memory binding >>> >>> We have a couple machines where memory binding is failing for one reason >>> or another. If we're binding by default, we really shouldn't throw >>> error >>> messages about not being able to bind memory. It's REALLY annoying. >> >> Just to help me understand a bit better - you are saying that the node >> supports process binding, but not memory binding? I don't see how the >> error appears otherwise, but want to ensure I understand the code path. > > That appears to be the case, yes. > > Brian > > -- > Brian W. Barrett > Scalable System Software Group > Sandia National Laboratories > > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] [EXTERNAL] Re: RFC: remove opal progress recursion depth counter
I think there's no problem with removing them from the dll code -- that stuff doesn't affect MPI application ABI. On Dec 19, 2013, at 9:42 AM, Barrett, Brian W wrote: > Someone who understands the mpi debugging handles code: > > The opal_progress_recursion_depth_counter and opal_progress_thread_counter > are both only used internally in opal_progress (for book keeping, but > never any decisions) and are declared in ompi_mpihandles_dll.c, but then > don't appear to be used. Is there a disadvantage to: > > 1) removing them from mpihandles_dll.c > > or, if that breaks ABI, > > 2) Leaving them, but not doing the bookkeeping? > > It's fairly heavyweight bookkeeping, so I agree with Nathan, I'd like to > remove it. But I'd like to remove it pre-1.7.4. Which means today. > > Brian > > > On 12/18/13 4:40 PM, "Nathan Hjelm" wrote: > >> Opps, yeah. Meant 1.7.5. If people agree with this change I could >> possibly slip it in before Friday but that is unlikely. >> >> On Wed, Dec 18, 2013 at 03:32:36PM -0800, Ralph Castain wrote: >>> U1.7.4 is leaving the station on Fri, Nathan, so next Tues => >>> will have to go into 1.7.5 >>> >>> >>> On Dec 18, 2013, at 3:23 PM, Nathan Hjelm wrote: >>> What: Remove the opal_progress_recursion_depth_counter from opal_progress. Why: This counter adds two atomic adds to the critical path when OPAL_HAVE_THREADS is set (which is the case for most builds). I >>> grepped through ompi, orte, and opal to find where this value was being used >>> and did not find anything either inside or outside opal_progress. When: I want this change to go into 1.7.4 (if possible) so setting a quick timeout for next Tuesday. Let me know if there is a good reason to keep this counter and it will be spared. -Nathan Hjelm HPC-5, LANL ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> ___ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> > > > -- > Brian W. Barrett > Scalable System Software Group > Sandia National Laboratories > > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] [EXTERNAL] Consequence of bind-to-core by default
On Dec 19, 2013, at 10:54 AM, Barrett, Brian W wrote: >> Just to help me understand a bit better - you are saying that the node >> supports process binding, but not memory binding? I don't see how the >> error appears otherwise, but want to ensure I understand the code path. > > That appears to be the case, yes. I think that's what's happening on the Absoft systems, too. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] [EXTERNAL] Consequence of bind-to-core by default
That worked for me. Brian On 12/19/13 9:32 AM, "Ralph Castain" wrote: > > > >Okay, I think I have these things fixed in r29978 on the trunk - please >give it a spin and confirm so we can move it to 1.7.4 > > > >On Dec 19, 2013, at 7:54 AM, Barrett, Brian W wrote: > > >On 12/19/13 8:43 AM, "Ralph Castain" wrote: > > > >On Dec 19, 2013, at 6:27 AM, Barrett, Brian W wrote: > >On 12/19/13 6:59 AM, "Jeff Squyres (jsquyres)" >wrote: > >3. Finally, we're giving a warning saying: > >- >WARNING: a request was made to bind a process. While the system >supports binding the process itself, at least one node does NOT >support binding memory to the process location. >- > >For both #1 and #3, I wonder if we shouldn't be warning if no binding >was >explicitly stated (i.e., we're just using the defaults). Specifically, >if no binding is specified: > >- if we oversubscribe, (possibly) warn about the performance loss of >oversubscription, and don't bind >- don't warn about lack of memory binding > > > >We have a couple machines where memory binding is failing for one reason >or another. If we're binding by default, we really shouldn't throw >error >messages about not being able to bind memory. It's REALLY annoying. > > > >Just to help me understand a bit better - you are saying that the node >supports process binding, but not memory binding? I don't see how the >error appears otherwise, but want to ensure I understand the code path. > > > >That appears to be the case, yes. > >Brian > >-- > Brian W. Barrett > Scalable System Software Group > Sandia National Laboratories > > > >___ >devel mailing list >de...@open-mpi.org >http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > > > -- Brian W. Barrett Scalable System Software Group Sandia National Laboratories
Re: [OMPI devel] [EXTERNAL] Re: RFC: remove opal progress recursion depth counter
Nathan - Any chance you can remove the two counters this afternoon? Brian On 12/19/13 10:01 AM, "Jeff Squyres (jsquyres)" wrote: >I think there's no problem with removing them from the dll code -- that >stuff doesn't affect MPI application ABI. > > >On Dec 19, 2013, at 9:42 AM, Barrett, Brian W wrote: > >> Someone who understands the mpi debugging handles code: >> >> The opal_progress_recursion_depth_counter and >>opal_progress_thread_counter >> are both only used internally in opal_progress (for book keeping, but >> never any decisions) and are declared in ompi_mpihandles_dll.c, but then >> don't appear to be used. Is there a disadvantage to: >> >> 1) removing them from mpihandles_dll.c >> >> or, if that breaks ABI, >> >> 2) Leaving them, but not doing the bookkeeping? >> >> It's fairly heavyweight bookkeeping, so I agree with Nathan, I'd like to >> remove it. But I'd like to remove it pre-1.7.4. Which means today. >> >> Brian >> >> >> On 12/18/13 4:40 PM, "Nathan Hjelm" wrote: >> >>> Opps, yeah. Meant 1.7.5. If people agree with this change I could >>> possibly slip it in before Friday but that is unlikely. >>> >>> On Wed, Dec 18, 2013 at 03:32:36PM -0800, Ralph Castain wrote: U1.7.4 is leaving the station on Fri, Nathan, so next Tues => will have to go into 1.7.5 On Dec 18, 2013, at 3:23 PM, Nathan Hjelm wrote: > What: Remove the opal_progress_recursion_depth_counter from > opal_progress. > > Why: This counter adds two atomic adds to the critical path when > OPAL_HAVE_THREADS is set (which is the case for most builds). I grepped > through ompi, orte, and opal to find where this value was being used and > did not find anything either inside or outside opal_progress. > > When: I want this change to go into 1.7.4 (if possible) so setting a > quick timeout for next Tuesday. > > Let me know if there is a good reason to keep this counter and it >will > be spared. > > -Nathan Hjelm > HPC-5, LANL > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >> >> >> -- >> Brian W. Barrett >> Scalable System Software Group >> Sandia National Laboratories >> >> >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > >-- >Jeff Squyres >jsquy...@cisco.com >For corporate legal information go to: >http://www.cisco.com/web/about/doing_business/legal/cri/ > >___ >devel mailing list >de...@open-mpi.org >http://www.open-mpi.org/mailman/listinfo.cgi/devel > -- Brian W. Barrett Scalable System Software Group Sandia National Laboratories
Re: [OMPI devel] Speedup for MPI_Dims_create()
Andreas -- Thanks for the patch. Can I ask two things? 1. Can you separate the patch into two: one with the code change, and another with the whitespace update? It will help the readability of the logs to see the exact code change, rather than bury it in a syntax update. 2. You added a copyright notice, which is great. However, it puts this patch in a strange position for us -- I think we'd be comfortable with a copyrighted patch if we have a 3rd party agreement on file from your organization (i.e., so that the copyright holder won't come back to us later and sue us for distributing the patch under the BSD license). I think there are two options here (and IANAL, so I could well be wrong here): 2a. Re-submit the patch without a copyright header. It's such a small patch (1 line of code change, AFAICT?) that I think we can accept it without a contribution agreement. We'd cite you in the NEWS file and commit logs, of course. 2b. Submit a third party contribution agreement (see http://www.open-mpi.org/community/contribute/). Then we can list your organization under http://www.open-mpi.org/about/members/, and we can accept the patch with the copyright header. Thanks! On Dec 19, 2013, at 9:37 AM, Andreas Schäfer wrote: > Dear all, > > please find attached a (trivial) patch to MPI_Dims_create(). When > computing the prime factors of nnodes, it is sufficient to check for > primes less or equal to sqrt(nnodes). > > This was not so much of a problem in the past, but now that Tier 0 > systems are capable of running O(10^6) MPI processes, the difference > in execution time is on the order of seconds (e.g. 8.86s vs. 0.04s on > my notebook, with nnproc = 10^6). > > Best > -Andreas > > PS: oh, and the patch removes some trailing whitespace. Yuck. :-) > > > -- > == > Andreas Schäfer > HPC and Grid Computing > Chair of Computer Science 3 > Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany > +49 9131 85-27910 > PGP/GPG key via keyserver > http://www.libgeodecomp.org > == > > (\___/) > (+'.'+) > (")_(") > This is Bunny. Copy and paste Bunny into your > signature to help him gain world domination! > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] [EXTERNAL] Re: RFC: remove opal progress recursion depth counter
Yes. I will do that once I finish preparing the ORNL collectives for the trunk. Will be 8pm at the latest. -Nathan From: devel [devel-boun...@open-mpi.org] on behalf of Barrett, Brian W [bwba...@sandia.gov] Sent: Thursday, December 19, 2013 10:24 AM To: Open MPI Developers Subject: Re: [OMPI devel] [EXTERNAL] Re: RFC: remove opal progress recursion depth counter Nathan - Any chance you can remove the two counters this afternoon? Brian On 12/19/13 10:01 AM, "Jeff Squyres (jsquyres)" wrote: >I think there's no problem with removing them from the dll code -- that >stuff doesn't affect MPI application ABI. > > >On Dec 19, 2013, at 9:42 AM, Barrett, Brian W wrote: > >> Someone who understands the mpi debugging handles code: >> >> The opal_progress_recursion_depth_counter and >>opal_progress_thread_counter >> are both only used internally in opal_progress (for book keeping, but >> never any decisions) and are declared in ompi_mpihandles_dll.c, but then >> don't appear to be used. Is there a disadvantage to: >> >> 1) removing them from mpihandles_dll.c >> >> or, if that breaks ABI, >> >> 2) Leaving them, but not doing the bookkeeping? >> >> It's fairly heavyweight bookkeeping, so I agree with Nathan, I'd like to >> remove it. But I'd like to remove it pre-1.7.4. Which means today. >> >> Brian >> >> >> On 12/18/13 4:40 PM, "Nathan Hjelm" wrote: >> >>> Opps, yeah. Meant 1.7.5. If people agree with this change I could >>> possibly slip it in before Friday but that is unlikely. >>> >>> On Wed, Dec 18, 2013 at 03:32:36PM -0800, Ralph Castain wrote: U1.7.4 is leaving the station on Fri, Nathan, so next Tues => will have to go into 1.7.5 On Dec 18, 2013, at 3:23 PM, Nathan Hjelm wrote: > What: Remove the opal_progress_recursion_depth_counter from > opal_progress. > > Why: This counter adds two atomic adds to the critical path when > OPAL_HAVE_THREADS is set (which is the case for most builds). I grepped > through ompi, orte, and opal to find where this value was being used and > did not find anything either inside or outside opal_progress. > > When: I want this change to go into 1.7.4 (if possible) so setting a > quick timeout for next Tuesday. > > Let me know if there is a good reason to keep this counter and it >will > be spared. > > -Nathan Hjelm > HPC-5, LANL > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >> >> >> -- >> Brian W. Barrett >> Scalable System Software Group >> Sandia National Laboratories >> >> >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > >-- >Jeff Squyres >jsquy...@cisco.com >For corporate legal information go to: >http://www.cisco.com/web/about/doing_business/legal/cri/ > >___ >devel mailing list >de...@open-mpi.org >http://www.open-mpi.org/mailman/listinfo.cgi/devel > -- Brian W. Barrett Scalable System Software Group Sandia National Laboratories ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] [PATCH v2 2/2] Trying to get the C/R code to compile again. (send_*_nb)
Thanks for the review. I am re-spinning the patches and sending the new version in a few moments. On Wed, Dec 18, 2013 at 06:56:47AM -0800, Ralph Castain wrote: > In the case of the send, there really isn't any problem with just replacing > things - the non-blocking change won't impact anything, so no need to retain > the old code. People were only concerned about the recv's as those places > will require further repair, and they wanted to ensure we know where those > places are located. > > You also need to change those comparisons, however, as the return code isn't > the number of bytes sent any more - it is just ORTE_SUCCESS or else an error > code, so you should be testing for ORTE_SUCCESS == > > > > > On Dec 18, 2013, at 6:42 AM, Adrian Reber wrote: > > > From: Adrian Reber > > > > This patch changes all send/send_buffer occurrences in the C/R code > > to send_nb/send_buffer_nb. > > The old code is still there but disabled using ifdefs (ENABLE_FT_FIXED). > > The new code compiles but does not work. > > > > Changes from V1: > > * #ifdef out the code (so it is preserved for later re-design) > > * marked the broken C/R code with ENABLE_FT_FIXED > > > > Signed-off-by: Adrian Reber > > --- > > ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c| 18 +++ > > orte/mca/errmgr/base/errmgr_base_tool.c | 4 ++ > > orte/mca/rml/ftrm/rml_ftrm.h| 19 > > orte/mca/rml/ftrm/rml_ftrm_component.c | 2 - > > orte/mca/rml/ftrm/rml_ftrm_module.c | 63 > > + > > orte/mca/snapc/full/snapc_full_app.c| 20 > > orte/mca/snapc/full/snapc_full_global.c | 12 + > > orte/mca/snapc/full/snapc_full_local.c | 4 ++ > > orte/mca/sstore/central/sstore_central_app.c| 8 > > orte/mca/sstore/central/sstore_central_global.c | 4 ++ > > orte/mca/sstore/central/sstore_central_local.c | 12 + > > orte/mca/sstore/stage/sstore_stage_app.c| 8 > > orte/mca/sstore/stage/sstore_stage_global.c | 4 ++ > > orte/mca/sstore/stage/sstore_stage_local.c | 16 +++ > > orte/tools/orte-checkpoint/orte-checkpoint.c| 4 ++ > > orte/tools/orte-migrate/orte-migrate.c | 4 ++ > > 16 files changed, 130 insertions(+), 72 deletions(-) > > > > diff --git a/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c > > b/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c > > index cba7586..4f7bd7f 100644 > > --- a/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c > > +++ b/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c > > @@ -5102,7 +5102,11 @@ static int wait_quiesce_drained(void) > > PACK_BUFFER(buffer, response, 1, OPAL_SIZE, ""); > > > > /* JJH - Performance Optimization? - Why not post all isends, > > then wait? */ > > +#ifdef ENABLE_FT_FIXED > > +/* This is the old, now broken code */ > > if ( 0 > ( ret = > > ompi_rte_send_buffer(&(cur_peer_ref->proc_name), buffer, > > OMPI_CRCP_COORD_BOOKMARK_TAG, 0)) ) { > > +#endif /* ENABLE_FT_FIXED */ > > +if ( 0 > ( ret = > > ompi_rte_send_buffer_nb(&(cur_peer_ref->proc_name), buffer, > > OMPI_CRCP_COORD_BOOKMARK_TAG, orte_rml_send_callback, NULL)) ) { > > exit_status = ret; > > goto cleanup; > > } > > @@ -5303,7 +5307,11 @@ static int send_bookmarks(int peer_idx) > > PACK_BUFFER(buffer, (peer_ref->total_msgs_recvd), 1, OPAL_UINT32, > > "crcp:bkmrk: send_bookmarks: Unable to pack > > total_msgs_recvd"); > > > > +#ifdef ENABLE_FT_FIXED > > +/* This is the old, now broken code */ > > if ( 0 > ( ret = ompi_rte_send_buffer(&peer_name, buffer, > > OMPI_CRCP_COORD_BOOKMARK_TAG, 0)) ) { > > +#endif /* ENABLE_FT_FIXED */ > > +if ( 0 > ( ret = ompi_rte_send_buffer_nb(&peer_name, buffer, > > OMPI_CRCP_COORD_BOOKMARK_TAG, orte_rml_send_callback, NULL)) ) { > > opal_output(mca_crcp_bkmrk_component.super.output_handle, > > "crcp:bkmrk: send_bookmarks: Failed to send bookmark to > > peer %s: Return %d\n", > > OMPI_NAME_PRINT(&peer_name), > > @@ -5599,8 +5607,13 @@ static int > > do_send_msg_detail(ompi_crcp_bkmrk_pml_peer_ref_t *peer_ref, > > /* > > * Do the send... > > */ > > +#ifdef ENABLE_FT_FIXED > > +/* This is the old, now broken code */ > > if ( 0 > ( ret = ompi_rte_send_buffer(&peer_ref->proc_name, buffer, > > OMPI_CRCP_COORD_BOOKMARK_TAG, 0)) > > ) { > > +#endif /* ENABLE_FT_FIXED */ > > +if ( 0 > ( ret = ompi_rte_send_buffer_nb(&peer_ref->proc_name, buffer, > > + OMPI_CRCP_COORD_BOOKMARK_TAG, > > orte_rml_send_callback, NULL)) ) { > > opal_output(mca_crcp_bkmrk_component.super.output_handle, > > "crcp:bkmrk: do_send_msg_detail: Unable to send message > > details to peer %s: Return %d\n", > > OMPI_NAME_PRINT(&peer_ref->proc_name), > > @@ -62
[OMPI devel] [PATCH v3 0/2] Trying to get the C/R code to compile again
From: Adrian Reber This is the second try to replace the usage of blocking send and recv in the C/R code with the non-blocking versions. The new code compiles (in contrast to the old code) but does not work yet. This is the first step to get the C/R code working again. Right now it only compiles. Changes from V1: * #ifdef out the broken code (so it is preserved for later re-design) * marked the broken C/R code with ENABLE_FT_FIXED Changes from V2: * only #ifdef out parts where the behaviour actually changes Adrian Reber (2): Trying to get the C/R code to compile again. (recv_*_nb) Trying to get the C/R code to compile again. (send_*_nb) ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c| 64 +-- orte/mca/errmgr/base/errmgr_base_tool.c | 20 +--- orte/mca/rml/ftrm/rml_ftrm.h| 46 +--- orte/mca/rml/ftrm/rml_ftrm_component.c | 4 - orte/mca/rml/ftrm/rml_ftrm_module.c | 139 +++- orte/mca/snapc/full/snapc_full_app.c| 32 +- orte/mca/snapc/full/snapc_full_global.c | 52 - orte/mca/snapc/full/snapc_full_local.c | 40 ++- orte/mca/sstore/central/sstore_central_app.c| 14 ++- orte/mca/sstore/central/sstore_central_global.c | 21 +--- orte/mca/sstore/central/sstore_central_local.c | 29 ++--- orte/mca/sstore/stage/sstore_stage_app.c| 13 ++- orte/mca/sstore/stage/sstore_stage_global.c | 21 +--- orte/mca/sstore/stage/sstore_stage_local.c | 33 +++--- orte/tools/orte-checkpoint/orte-checkpoint.c| 20 +--- orte/tools/orte-migrate/orte-migrate.c | 20 +--- 16 files changed, 186 insertions(+), 382 deletions(-) -- 1.8.4.2
[OMPI devel] [PATCH v3 1/2] Trying to get the C/R code to compile again. (recv_*_nb)
From: Adrian Reber This patch changes all recv/recv_buffer occurrences in the C/R code to recv_nb/recv_buffer_nb. The old code is still there but disabled using ifdefs (ENABLE_FT_FIXED). The new code compiles but does not work. Changes from V1: * #ifdef out the code (so it is preserved for later re-design) * marked the broken C/R code with ENABLE_FT_FIXED Changes from V2: * only #ifdef out the code where the behaviour is changed (used to be blocking; now non-blocking) Signed-off-by: Adrian Reber --- ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c| 41 + orte/mca/errmgr/base/errmgr_base_tool.c | 16 + orte/mca/rml/ftrm/rml_ftrm.h| 27 ++--- orte/mca/rml/ftrm/rml_ftrm_component.c | 2 - orte/mca/rml/ftrm/rml_ftrm_module.c | 78 +++-- orte/mca/snapc/full/snapc_full_app.c| 12 orte/mca/snapc/full/snapc_full_global.c | 37 +++- orte/mca/snapc/full/snapc_full_local.c | 36 +++- orte/mca/sstore/central/sstore_central_app.c| 6 ++ orte/mca/sstore/central/sstore_central_global.c | 17 +- orte/mca/sstore/central/sstore_central_local.c | 17 +- orte/mca/sstore/stage/sstore_stage_app.c| 5 ++ orte/mca/sstore/stage/sstore_stage_global.c | 17 +- orte/mca/sstore/stage/sstore_stage_local.c | 17 +- orte/tools/orte-checkpoint/orte-checkpoint.c| 16 + orte/tools/orte-migrate/orte-migrate.c | 16 + 16 files changed, 87 insertions(+), 273 deletions(-) diff --git a/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c b/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c index 5d4005f..05cd598 100644 --- a/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c +++ b/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c @@ -4717,7 +4717,6 @@ static int ft_event_post_drain_acks(void) ompi_crcp_bkmrk_pml_drain_message_ack_ref_t * drain_msg_ack = NULL; opal_list_item_t* item = NULL; size_t req_size; -int ret; req_size = opal_list_get_size(&drained_msg_ack_list); if(req_size <= 0) { @@ -4739,17 +4738,8 @@ static int ft_event_post_drain_acks(void) drain_msg_ack = (ompi_crcp_bkmrk_pml_drain_message_ack_ref_t*)item; /* Post the receive */ -if( OMPI_SUCCESS != (ret = ompi_rte_recv_buffer_nb( &drain_msg_ack->peer, - OMPI_CRCP_COORD_BOOKMARK_TAG, -0, - drain_message_ack_cbfunc, -NULL) ) ) { -opal_output(mca_crcp_bkmrk_component.super.output_handle, -"crcp:bkmrk: %s <-- %s: Failed to post a RML receive to the peer\n", -OMPI_NAME_PRINT(OMPI_PROC_MY_NAME), -OMPI_NAME_PRINT(&(drain_msg_ack->peer))); -return ret; -} +ompi_rte_recv_buffer_nb(&drain_msg_ack->peer, OMPI_CRCP_COORD_BOOKMARK_TAG, +0, drain_message_ack_cbfunc, NULL); } return OMPI_SUCCESS; @@ -5322,26 +5312,14 @@ static int send_bookmarks(int peer_idx) static int recv_bookmarks(int peer_idx) { ompi_process_name_t peer_name; -int exit_status = OMPI_SUCCESS; -int ret; START_TIMER(CRCP_TIMER_CKPT_EX_PEER_R); peer_name.jobid = OMPI_PROC_MY_NAME->jobid; peer_name.vpid = peer_idx; -if ( 0 > (ret = ompi_rte_recv_buffer_nb(&peer_name, -OMPI_CRCP_COORD_BOOKMARK_TAG, -0, -recv_bookmarks_cbfunc, -NULL) ) ) { -opal_output(mca_crcp_bkmrk_component.super.output_handle, -"crcp:bkmrk: recv_bookmarks: Failed to post receive bookmark from peer %s: Return %d\n", -OMPI_NAME_PRINT(&peer_name), -ret); -exit_status = ret; -goto cleanup; -} +ompi_rte_recv_buffer_nb(&peer_name, OMPI_CRCP_COORD_BOOKMARK_TAG, +0, recv_bookmarks_cbfunc, NULL); ++total_recv_bookmarks; @@ -5350,7 +5328,7 @@ static int recv_bookmarks(int peer_idx) /* JJH Doesn't make much sense to print this. The real bottleneck is always the send_bookmarks() */ /*DISPLAY_INDV_TIMER(CRCP_TIMER_CKPT_EX_PEER_R, peer_idx, 1);*/ -return exit_status; +return OMPI_SUCCESS; } static void recv_bookmarks_cbfunc(int status, @@ -5616,6 +5594,8 @@ static int do_send_msg_detail(ompi_crcp_bkmrk_pml_peer_ref_t *peer_ref, /* * Recv the ACK msg */ +#ifdef ENABLE_FT_FIXED +/* This is the old, now broken code */ if ( 0 > (ret = ompi_rte_recv_buffer(&peer_ref->proc_name, buffer, OMPI_CRCP_COORD_BOOKMARK_TAG, 0) ) ) { opal_output(mca_crcp
[OMPI devel] [PATCH v3 2/2] Trying to get the C/R code to compile again. (send_*_nb)
From: Adrian Reber This patch changes all send/send_buffer occurrences in the C/R code to send_nb/send_buffer_nb. The new code compiles but does not work. Changes from V1: * #ifdef out the code (so it is preserved for later re-design) * marked the broken C/R code with ENABLE_FT_FIXED Changes from V2: * just replace the blocking calls with the non-blocking calls * all #ifdef's introduced in V1 are gone * send_* returns error code or ORTE_SUCCESS (not the number of bytes) Signed-off-by: Adrian Reber --- ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c| 23 ++ orte/mca/errmgr/base/errmgr_base_tool.c | 4 +- orte/mca/rml/ftrm/rml_ftrm.h| 19 orte/mca/rml/ftrm/rml_ftrm_component.c | 2 - orte/mca/rml/ftrm/rml_ftrm_module.c | 61 +++-- orte/mca/snapc/full/snapc_full_app.c| 20 ++-- orte/mca/snapc/full/snapc_full_global.c | 15 -- orte/mca/snapc/full/snapc_full_local.c | 4 +- orte/mca/sstore/central/sstore_central_app.c| 8 +++- orte/mca/sstore/central/sstore_central_global.c | 4 +- orte/mca/sstore/central/sstore_central_local.c | 12 +++-- orte/mca/sstore/stage/sstore_stage_app.c| 8 +++- orte/mca/sstore/stage/sstore_stage_global.c | 4 +- orte/mca/sstore/stage/sstore_stage_local.c | 16 +-- orte/tools/orte-checkpoint/orte-checkpoint.c| 4 +- orte/tools/orte-migrate/orte-migrate.c | 4 +- 16 files changed, 99 insertions(+), 109 deletions(-) diff --git a/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c b/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c index 05cd598..5ad9a3e 100644 --- a/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c +++ b/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c @@ -5077,7 +5077,7 @@ static int wait_quiesce_drained(void) "crcp:bkmrk: %s --> %s Send ACKs to Peer\n", OMPI_NAME_PRINT(OMPI_PROC_MY_NAME), OMPI_NAME_PRINT(&(cur_peer_ref->proc_name)) )); - + /* Send All Clear to Peer */ if (NULL == (buffer = OBJ_NEW(opal_buffer_t))) { exit_status = OMPI_ERROR; @@ -5087,7 +5087,9 @@ static int wait_quiesce_drained(void) PACK_BUFFER(buffer, response, 1, OPAL_SIZE, ""); /* JJH - Performance Optimization? - Why not post all isends, then wait? */ -if ( 0 > ( ret = ompi_rte_send_buffer(&(cur_peer_ref->proc_name), buffer, OMPI_CRCP_COORD_BOOKMARK_TAG, 0)) ) { +if (ORTE_SUCCESS != (ret = ompi_rte_send_buffer_nb(&(cur_peer_ref->proc_name), + buffer, OMPI_CRCP_COORD_BOOKMARK_TAG, + orte_rml_send_callback, NULL))) { exit_status = ret; goto cleanup; } @@ -5288,7 +5290,9 @@ static int send_bookmarks(int peer_idx) PACK_BUFFER(buffer, (peer_ref->total_msgs_recvd), 1, OPAL_UINT32, "crcp:bkmrk: send_bookmarks: Unable to pack total_msgs_recvd"); -if ( 0 > ( ret = ompi_rte_send_buffer(&peer_name, buffer, OMPI_CRCP_COORD_BOOKMARK_TAG, 0)) ) { +if (ORTE_SUCCSS != (ret = ompi_rte_send_buffer_nb(&peer_name, buffer, + OMPI_CRCP_COORD_BOOKMARK_TAG, + orte_rml_send_callback, NULL))) { opal_output(mca_crcp_bkmrk_component.super.output_handle, "crcp:bkmrk: send_bookmarks: Failed to send bookmark to peer %s: Return %d\n", OMPI_NAME_PRINT(&peer_name), @@ -5567,13 +5571,14 @@ static int do_send_msg_detail(ompi_crcp_bkmrk_pml_peer_ref_t *peer_ref, /* * Do the send... */ -if ( 0 > ( ret = ompi_rte_send_buffer(&peer_ref->proc_name, buffer, - OMPI_CRCP_COORD_BOOKMARK_TAG, 0)) ) { +if (ORTE_SUCCESS != (ret = ompi_rte_send_buffer_nb(&peer_ref->proc_name, buffer, + OMPI_CRCP_COORD_BOOKMARK_TAG, + orte_rml_send_callback, NULL))) { opal_output(mca_crcp_bkmrk_component.super.output_handle, "crcp:bkmrk: do_send_msg_detail: Unable to send message details to peer %s: Return %d\n", OMPI_NAME_PRINT(&peer_ref->proc_name), ret); - + exit_status = OMPI_ERROR; goto cleanup; } @@ -6185,8 +6190,10 @@ static int do_recv_msg_detail_resp(ompi_crcp_bkmrk_pml_peer_ref_t *peer_ref, "crcp:bkmrk: recv_msg_details: Unable to ask peer for more messages"); PACK_BUFFER(buffer, total_found, 1, OPAL_UINT32, "crcp:bkmrk: recv_msg_details: Unable to ask peer for more messages"); - -if ( 0 > ( ret = ompi_rte_send_buffer(&peer
Re: [OMPI devel] [PATCH v3 2/2] Trying to get the C/R code to compile again. (send_*_nb)
+1 from me On Dec 19, 2013, at 12:54 PM, Adrian Reber wrote: > From: Adrian Reber > > This patch changes all send/send_buffer occurrences in the C/R code > to send_nb/send_buffer_nb. > The new code compiles but does not work. > > Changes from V1: > * #ifdef out the code (so it is preserved for later re-design) > * marked the broken C/R code with ENABLE_FT_FIXED > > Changes from V2: > * just replace the blocking calls with the non-blocking calls > * all #ifdef's introduced in V1 are gone > * send_* returns error code or ORTE_SUCCESS (not the number of bytes) > > Signed-off-by: Adrian Reber > --- > ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c| 23 ++ > orte/mca/errmgr/base/errmgr_base_tool.c | 4 +- > orte/mca/rml/ftrm/rml_ftrm.h| 19 > orte/mca/rml/ftrm/rml_ftrm_component.c | 2 - > orte/mca/rml/ftrm/rml_ftrm_module.c | 61 +++-- > orte/mca/snapc/full/snapc_full_app.c| 20 ++-- > orte/mca/snapc/full/snapc_full_global.c | 15 -- > orte/mca/snapc/full/snapc_full_local.c | 4 +- > orte/mca/sstore/central/sstore_central_app.c| 8 +++- > orte/mca/sstore/central/sstore_central_global.c | 4 +- > orte/mca/sstore/central/sstore_central_local.c | 12 +++-- > orte/mca/sstore/stage/sstore_stage_app.c| 8 +++- > orte/mca/sstore/stage/sstore_stage_global.c | 4 +- > orte/mca/sstore/stage/sstore_stage_local.c | 16 +-- > orte/tools/orte-checkpoint/orte-checkpoint.c| 4 +- > orte/tools/orte-migrate/orte-migrate.c | 4 +- > 16 files changed, 99 insertions(+), 109 deletions(-) > > diff --git a/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c > b/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c > index 05cd598..5ad9a3e 100644 > --- a/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c > +++ b/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c > @@ -5077,7 +5077,7 @@ static int wait_quiesce_drained(void) > "crcp:bkmrk: %s --> %s Send ACKs to Peer\n", > OMPI_NAME_PRINT(OMPI_PROC_MY_NAME), > OMPI_NAME_PRINT(&(cur_peer_ref->proc_name)) > )); > - > + > /* Send All Clear to Peer */ > if (NULL == (buffer = OBJ_NEW(opal_buffer_t))) { > exit_status = OMPI_ERROR; > @@ -5087,7 +5087,9 @@ static int wait_quiesce_drained(void) > PACK_BUFFER(buffer, response, 1, OPAL_SIZE, ""); > > /* JJH - Performance Optimization? - Why not post all isends, > then wait? */ > -if ( 0 > ( ret = > ompi_rte_send_buffer(&(cur_peer_ref->proc_name), buffer, > OMPI_CRCP_COORD_BOOKMARK_TAG, 0)) ) { > +if (ORTE_SUCCESS != (ret = > ompi_rte_send_buffer_nb(&(cur_peer_ref->proc_name), > + buffer, > OMPI_CRCP_COORD_BOOKMARK_TAG, > + > orte_rml_send_callback, NULL))) { > exit_status = ret; > goto cleanup; > } > @@ -5288,7 +5290,9 @@ static int send_bookmarks(int peer_idx) > PACK_BUFFER(buffer, (peer_ref->total_msgs_recvd), 1, OPAL_UINT32, > "crcp:bkmrk: send_bookmarks: Unable to pack > total_msgs_recvd"); > > -if ( 0 > ( ret = ompi_rte_send_buffer(&peer_name, buffer, > OMPI_CRCP_COORD_BOOKMARK_TAG, 0)) ) { > +if (ORTE_SUCCSS != (ret = ompi_rte_send_buffer_nb(&peer_name, buffer, > + > OMPI_CRCP_COORD_BOOKMARK_TAG, > + > orte_rml_send_callback, NULL))) { > opal_output(mca_crcp_bkmrk_component.super.output_handle, > "crcp:bkmrk: send_bookmarks: Failed to send bookmark to > peer %s: Return %d\n", > OMPI_NAME_PRINT(&peer_name), > @@ -5567,13 +5571,14 @@ static int > do_send_msg_detail(ompi_crcp_bkmrk_pml_peer_ref_t *peer_ref, > /* > * Do the send... > */ > -if ( 0 > ( ret = ompi_rte_send_buffer(&peer_ref->proc_name, buffer, > - OMPI_CRCP_COORD_BOOKMARK_TAG, 0)) > ) { > +if (ORTE_SUCCESS != (ret = ompi_rte_send_buffer_nb(&peer_ref->proc_name, > buffer, > + > OMPI_CRCP_COORD_BOOKMARK_TAG, > + > orte_rml_send_callback, NULL))) { > opal_output(mca_crcp_bkmrk_component.super.output_handle, > "crcp:bkmrk: do_send_msg_detail: Unable to send message > details to peer %s: Return %d\n", > OMPI_NAME_PRINT(&peer_ref->proc_name), > ret); > - > + > exit_status = OMPI_ERROR; > goto cleanup; > } > @@ -6185,8 +6190,10 @@ static int > do_recv_msg_detail_resp(ompi_crcp_bkmrk_pml_peer_ref_t *peer_ref, > "crcp:bkmrk: recv_msg_details
Re: [OMPI devel] [PATCH v3 1/2] Trying to get the C/R code to compile again. (recv_*_nb)
Looks okay to me. On the places where you need to block while waiting for an answer, you can use OMPI_WAIT_FOR_COMPLETION - this will spin on opal_progress until the condition is met. We use it elsewhere for similar purposes. See ompi/mca/rte/rte.h for the definition On Dec 19, 2013, at 12:54 PM, Adrian Reber wrote: > From: Adrian Reber > > This patch changes all recv/recv_buffer occurrences in the C/R code > to recv_nb/recv_buffer_nb. > The old code is still there but disabled using ifdefs (ENABLE_FT_FIXED). > The new code compiles but does not work. > > Changes from V1: > * #ifdef out the code (so it is preserved for later re-design) > * marked the broken C/R code with ENABLE_FT_FIXED > > Changes from V2: > * only #ifdef out the code where the behaviour is changed > (used to be blocking; now non-blocking) > > Signed-off-by: Adrian Reber > --- > ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c| 41 + > orte/mca/errmgr/base/errmgr_base_tool.c | 16 + > orte/mca/rml/ftrm/rml_ftrm.h| 27 ++--- > orte/mca/rml/ftrm/rml_ftrm_component.c | 2 - > orte/mca/rml/ftrm/rml_ftrm_module.c | 78 +++-- > orte/mca/snapc/full/snapc_full_app.c| 12 > orte/mca/snapc/full/snapc_full_global.c | 37 +++- > orte/mca/snapc/full/snapc_full_local.c | 36 +++- > orte/mca/sstore/central/sstore_central_app.c| 6 ++ > orte/mca/sstore/central/sstore_central_global.c | 17 +- > orte/mca/sstore/central/sstore_central_local.c | 17 +- > orte/mca/sstore/stage/sstore_stage_app.c| 5 ++ > orte/mca/sstore/stage/sstore_stage_global.c | 17 +- > orte/mca/sstore/stage/sstore_stage_local.c | 17 +- > orte/tools/orte-checkpoint/orte-checkpoint.c| 16 + > orte/tools/orte-migrate/orte-migrate.c | 16 + > 16 files changed, 87 insertions(+), 273 deletions(-) > > diff --git a/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c > b/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c > index 5d4005f..05cd598 100644 > --- a/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c > +++ b/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c > @@ -4717,7 +4717,6 @@ static int ft_event_post_drain_acks(void) > ompi_crcp_bkmrk_pml_drain_message_ack_ref_t * drain_msg_ack = NULL; > opal_list_item_t* item = NULL; > size_t req_size; > -int ret; > > req_size = opal_list_get_size(&drained_msg_ack_list); > if(req_size <= 0) { > @@ -4739,17 +4738,8 @@ static int ft_event_post_drain_acks(void) > drain_msg_ack = (ompi_crcp_bkmrk_pml_drain_message_ack_ref_t*)item; > > /* Post the receive */ > -if( OMPI_SUCCESS != (ret = ompi_rte_recv_buffer_nb( > &drain_msg_ack->peer, > - > OMPI_CRCP_COORD_BOOKMARK_TAG, > -0, > - > drain_message_ack_cbfunc, > -NULL) ) ) { > -opal_output(mca_crcp_bkmrk_component.super.output_handle, > -"crcp:bkmrk: %s <-- %s: Failed to post a RML receive > to the peer\n", > -OMPI_NAME_PRINT(OMPI_PROC_MY_NAME), > -OMPI_NAME_PRINT(&(drain_msg_ack->peer))); > -return ret; > -} > +ompi_rte_recv_buffer_nb(&drain_msg_ack->peer, > OMPI_CRCP_COORD_BOOKMARK_TAG, > +0, drain_message_ack_cbfunc, NULL); > } > > return OMPI_SUCCESS; > @@ -5322,26 +5312,14 @@ static int send_bookmarks(int peer_idx) > static int recv_bookmarks(int peer_idx) > { > ompi_process_name_t peer_name; > -int exit_status = OMPI_SUCCESS; > -int ret; > > START_TIMER(CRCP_TIMER_CKPT_EX_PEER_R); > > peer_name.jobid = OMPI_PROC_MY_NAME->jobid; > peer_name.vpid = peer_idx; > > -if ( 0 > (ret = ompi_rte_recv_buffer_nb(&peer_name, > -OMPI_CRCP_COORD_BOOKMARK_TAG, > -0, > -recv_bookmarks_cbfunc, > -NULL) ) ) { > -opal_output(mca_crcp_bkmrk_component.super.output_handle, > -"crcp:bkmrk: recv_bookmarks: Failed to post receive > bookmark from peer %s: Return %d\n", > -OMPI_NAME_PRINT(&peer_name), > -ret); > -exit_status = ret; > -goto cleanup; > -} > +ompi_rte_recv_buffer_nb(&peer_name, OMPI_CRCP_COORD_BOOKMARK_TAG, > +0, recv_bookmarks_cbfunc, NULL); > > ++total_recv_bookmarks; > > @@ -5350,7 +5328,7 @@ static int recv_bookmarks(int peer_idx) > /* JJH Doesn't make much sense to print this. The real bottleneck is > always the send_bookmarks() */ > /*DISPLAY_INDV_TIMER(CRCP_TIMER_CKPT_EX_PEER_R, pee
[OMPI devel] 1.7 series release plans
Hi folks Given the amount of changes/fixes pushed into the 1.7.4rc's this week, it seems best that we delay that release until after the holiday. Accordingly, the revised release plan looks like this: 1.7.4rc2 - this weekend 1.7.4 - Jan 10th 1.7.5 feature freeze (hard deadline) - Jan 24th 1.7.5 release - mid-Feb We are feature-freezing 1.7.4 as of now, so the ORNL collectives will go into 1.7.5 along with oshmem (assuming it is ready by the deadline). If we don't connect before the weekend, have a great holiday! I'll be occasionally available on email and plan to do a few things over the holiday, but it will be somewhat hit-and-miss. Ralph
[OMPI devel] 1.7.4rc1 build failure: FreeBSD-9
I see the failure below when building 1.7.4rc1 on FreeBSD-9 (amd64). It looks to be just a missing header, probably sys/stat.h. $ gcc --version gcc (GCC) 4.2.1 20070831 patched [FreeBSD] Only configure option passed was --prefix-... -Paul Making all in mca/sharedfp/sm CC sharedfp_sm.lo CC sharedfp_sm_component.lo CC sharedfp_sm_seek.lo CC sharedfp_sm_get_position.lo CC sharedfp_sm_request_position.lo CC sharedfp_sm_write.lo CC sharedfp_sm_iwrite.lo CC sharedfp_sm_read.lo CC sharedfp_sm_iread.lo CC sharedfp_sm_file_open.lo /home/phargrov/OMPI/openmpi-1.7.4rc1-freebsd9-amd64/openmpi-1.7.4rc1/ompi/mca/sharedfp/sm/sharedfp_sm_file_open.c: In function 'mca_sharedfp_sm_file_open': /home/phargrov/OMPI/openmpi-1.7.4rc1-freebsd9-amd64/openmpi-1.7.4rc1/ompi/mca/sharedfp/sm/sharedfp_sm_file_open.c:121: error: 'S_IRUSR' undeclared (first use in this function) /home/phargrov/OMPI/openmpi-1.7.4rc1-freebsd9-amd64/openmpi-1.7.4rc1/ompi/mca/sharedfp/sm/sharedfp_sm_file_open.c:121: error: (Each undeclared identifier is reported only once /home/phargrov/OMPI/openmpi-1.7.4rc1-freebsd9-amd64/openmpi-1.7.4rc1/ompi/mca/sharedfp/sm/sharedfp_sm_file_open.c:121: error: for each function it appears in.) /home/phargrov/OMPI/openmpi-1.7.4rc1-freebsd9-amd64/openmpi-1.7.4rc1/ompi/mca/sharedfp/sm/sharedfp_sm_file_open.c:121: error: 'S_IWUSR' undeclared (first use in this function) /home/phargrov/OMPI/openmpi-1.7.4rc1-freebsd9-amd64/openmpi-1.7.4rc1/ompi/mca/sharedfp/sm/sharedfp_sm_file_open.c:121: error: 'S_IRGRP' undeclared (first use in this function) /home/phargrov/OMPI/openmpi-1.7.4rc1-freebsd9-amd64/openmpi-1.7.4rc1/ompi/mca/sharedfp/sm/sharedfp_sm_file_open.c:121: error: 'S_IROTH' undeclared (first use in this function) *** [sharedfp_sm_file_open.lo] Error code 1 -- Paul H. Hargrove phhargr...@lbl.gov Future Technologies Group Computer and Data Sciences Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
[OMPI devel] 1.7.4rc1 build failure: OpenBSD-5 and NetBSD-6
When building 1.7.4rc1 on OpenBSD-5 and NetBSD-6 (both amd64) I see what appears to be the same three errors ("make" output at end of this email) on both platforms. All three syntax errors appears to be collisions on the symbol if_mtu: -bash-4.2$ cat -n openmpi-1.7.4rc1/opal/util/if.h | grep -w -e 182 182 OPAL_DECLSPEC int opal_ifindextomtu(int if_index, int *if_mtu); -bash-4.2$ cat -n openmpi-1.7.4rc1/opal/mca/if/if.h | grep -w -e 98 98 int if_mtu; -bash-4.2$ cat -n openmpi-1.7.4rc1/opal/util/if.c | grep -w -e 482 482 int opal_ifindextomtu(int if_index, int *if_mtu) -bash-4.2$ grep if_mtu /usr/include/net/if.h #define if_mtu if_data.ifi_mtu\ -Paul OpenBSD: -bash-4.2$ uname -a OpenBSD pcp-j-16.my.domain 5.3 GENERIC.MP#62 amd64 -bash-4.2$ gcc --version gcc (GCC) 4.2.1 20070719 Making all in keyval LEX keyval_lex.c CC keyval_lex.lo CCLD libopalutilkeyval.la CC fd.lo CC arch.lo CC argv.lo CC basename.lo CC cmd_line.lo CC crc.lo CC convert.lo CC daemon_init.lo CC error.lo CC few.lo CC if.lo In file included from /home/phargrov/OMPI/openmpi-1.7.4rc1-openbsd5-amd64/openmpi-1.7.4rc1/opal/util/if.c:74: /home/phargrov/OMPI/openmpi-1.7.4rc1-openbsd5-amd64/openmpi-1.7.4rc1/opal/util/if.h:182: error: expected ';', ',' or ')' before '.' token In file included from /home/phargrov/OMPI/openmpi-1.7.4rc1-openbsd5-amd64/openmpi-1.7.4rc1/opal/mca/if/base/base.h:18, from /home/phargrov/OMPI/openmpi-1.7.4rc1-openbsd5-amd64/openmpi-1.7.4rc1/opal/util/if.c:81: /home/phargrov/OMPI/openmpi-1.7.4rc1-openbsd5-amd64/openmpi-1.7.4rc1/opal/mca/if/if.h:98: error: expected ':', ',', ';', '}' or '__attribute__' before '.' token /home/phargrov/OMPI/openmpi-1.7.4rc1-openbsd5-amd64/openmpi-1.7.4rc1/opal/util/if.c:482: error: expected ';', ',' or ')' before '.' token *** Error 1 in opal/util (Makefile:1642 'if.lo': @echo " CC " if.lo;depbase=`echo if.lo | sed 's|[^/]*$|.deps/&|;s|\.lo$||'`; /bin/sh ...) *** Error 1 in opal/util (Makefile:1731 'all-recursive') *** Error 1 in opal (Makefile:2039 'all-recursive') *** Error 1 in /home/phargrov/OMPI/openmpi-1.7.4rc1-openbsd5-amd64/BLD (Makefile:1572 'all-recursive') NetBSD: -bash-4.2$ uname -a NetBSD pcp-j-18 6.1 NetBSD 6.1 (GENERIC) amd64 -bash-4.2$ gcc --version gcc (NetBSD nb2 20110806) 4.5.3 Making all in keyval CC keyval_lex.lo CCLD libopalutilkeyval.la CC fd.lo CC arch.lo CC argv.lo CC basename.lo CC cmd_line.lo CC crc.lo CC convert.lo CC daemon_init.lo CC error.lo CC few.lo CC if.lo In file included from /home/phargrov/OMPI/openmpi-1.7.4rc1-netbsd6-amd64/openmpi-1.7.4rc1/opal/util/if.c:74:0: /home/phargrov/OMPI/openmpi-1.7.4rc1-netbsd6-amd64/openmpi-1.7.4rc1/opal/util/if.h:182:56: error: expected ';', ',' or ')' before '.' token In file included from /home/phargrov/OMPI/openmpi-1.7.4rc1-netbsd6-amd64/openmpi-1.7.4rc1/opal/mca/if/base/base.h:18:0, from /home/phargrov/OMPI/openmpi-1.7.4rc1-netbsd6-amd64/openmpi-1.7.4rc1/opal/util/if.c:81: /home/phargrov/OMPI/openmpi-1.7.4rc1-netbsd6-amd64/openmpi-1.7.4rc1/opal/mca/if/if.h:98:25: error: expected ':', ',', ';', '}' or '__attribute__' before '.' token /home/phargrov/OMPI/openmpi-1.7.4rc1-netbsd6-amd64/openmpi-1.7.4rc1/opal/util/if.c:482:42: error: expected ';', ',' or ')' before '.' token *** Error code 1 Stop. -- Paul H. Hargrove phhargr...@lbl.gov Future Technologies Group Computer and Data Sciences Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
[OMPI devel] 1.74rc1 build failure: Solaris 11 / x86_64 / Sun Studio 12.3
In 1.7.4rc1's README support is still claimed for Solaris 11 on x86_64 with Sun Studio (12.2 and 12.3): - Oracle Solaris 10 and 11, 32 and 64 bit (SPARC, i386, x86_64), with Oracle Solaris Studio 12.2 and 12.3 However, I get a build failure when configured with: CC=cc CFLAGS=-m64 --with-wrapper-cflags=-m64 CXX=CC CXXFLAGS='-m64 -library=stlport4' --with-wrapper-cxxflags=-m64 FC=f90 FCFLAGS=-m64 --with-wrapper-fcflags=-m64 --with-openib --prefix=... The failure doesn't appear to be compiler specific, and I will be testing gcc ASAP. make[2]: Entering directory `/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/BLD/opal/mca/if/posix_ipv4' CC if_posix.lo "/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/openmpi-1.7.4rc1/opal/include/opal/sys/amd64/atomic.h", line 136: warning: parameter in inline asm statement unused: %3 "/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/openmpi-1.7.4rc1/opal/include/opal/sys/amd64/atomic.h", line 182: warning: parameter in inline asm statement unused: %2 "/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/openmpi-1.7.4rc1/opal/include/opal/sys/amd64/atomic.h", line 203: warning: parameter in inline asm statement unused: %2 "/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/openmpi-1.7.4rc1/opal/include/opal/sys/amd64/atomic.h", line 224: warning: parameter in inline asm statement unused: %2 "/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/openmpi-1.7.4rc1/opal/include/opal/sys/amd64/atomic.h", line 245: warning: parameter in inline asm statement unused: %2 "/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/openmpi-1.7.4rc1/opal/mca/if/posix_ipv4/if_posix.c", line 272: undefined struct/union member: ifr_hwaddr "/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/openmpi-1.7.4rc1/opal/mca/if/posix_ipv4/if_posix.c", line 272: warning: left operand of "." must be struct/union object "/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/openmpi-1.7.4rc1/opal/mca/if/posix_ipv4/if_posix.c", line 272: cannot access member of non-struct/union object cc: acomp failed for /shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/openmpi-1.7.4rc1/opal/mca/if/posix_ipv4/if_posix.c make[2]: *** [if_posix.lo] Error 1 make[2]: Leaving directory `/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/BLD/opal/mca/if/posix_ipv4' The atomics warnings are concerning (and appear *MANY* times in the output). However the *real* problem is the three errors in opal/mca/if/posix_ipv4/if_posix.c", line 272 Solaris does't have a ifr_hwaddr field in struct if_req. It *does* have an ifr_addr field, but this posting: http://comments.gmane.org/gmane.os.solaris.opensolaris.networking/12839 suggests that this ioctl probably fails on PF_INET sockets. The surrounding code looks like: #ifdef SIOCGIFHWADDR /* get the MAC address */ if (ioctl(sd, SIOCGIFHWADDR, ifr) < 0) { opal_output(0, "btl_usnic_opal_ifinit: ioctl(SIOCGIFHWADDR) failed with errno=%d", errno); break; } memcpy(intf->if_mac, ifr->ifr_hwaddr.sa_data, 6); #endif #if defined(SIOCGIFMTU) && defined(HAVE_STRUCT_IFREQ_IFR_MTU) /* get the MTU */ if (ioctl(sd, SIOCGIFMTU, ifr) < 0) { opal_output(0, "btl_usnic_opal_ifinit: ioctl(SIOCGIFMTU) failed with errno=%d", errno); break; } intf->if_mtu = ifr->ifr_mtu; #endif Note the "btl_usnic_open_ifinit:" in the opal_output lines is probably a cut-and-paste error. -Paul -- Paul H. Hargrove phhargr...@lbl.gov Future Technologies Group Computer and Data Sciences Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
Re: [OMPI devel] 1.7.4rc1 build failure: Solaris 11 / x86_64
I've confirmed that the ifr_hwaddr problem also occurs with this system's /usr/bin/gcc: Making all in mca/if/posix_ipv4 make[2]: Entering directory `/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-gcc452/BLD/opal/mca/if/posix_ipv4' CC if_posix.lo /shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-gcc452/openmpi-1.7.4rc1/opal/mca/if/posix_ipv4/if_posix.c: In function �if_posix_open�: /shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-gcc452/openmpi-1.7.4rc1/opal/mca/if/posix_ipv4/if_posix.c:272:37: error: �struct ifreq� has no member named �ifr_hwaddr� make[2]: *** [if_posix.lo] Error 1 make[2]: Leaving directory `/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-gcc452/BLD/opal/mca/if/posix_ipv4 -Paul On Thu, Dec 19, 2013 at 3:51 PM, Paul Hargrove wrote: > In 1.7.4rc1's README support is still claimed for Solaris 11 on x86_64 > with Sun Studio (12.2 and 12.3): > - Oracle Solaris 10 and 11, 32 and 64 bit (SPARC, i386, x86_64), > with Oracle Solaris Studio 12.2 and 12.3 > > However, I get a build failure when configured with: > CC=cc CFLAGS=-m64 --with-wrapper-cflags=-m64 > CXX=CC CXXFLAGS='-m64 -library=stlport4' > --with-wrapper-cxxflags=-m64 > FC=f90 FCFLAGS=-m64 --with-wrapper-fcflags=-m64 > --with-openib --prefix=... > > The failure doesn't appear to be compiler specific, and I will be testing > gcc ASAP. > > make[2]: Entering directory > `/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/BLD/opal/mca/if/posix_ipv4' > CC if_posix.lo > "/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/openmpi-1.7.4rc1/opal/include/opal/sys/amd64/atomic.h", > line 136: warning: parameter in inline asm statement unused: %3 > "/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/openmpi-1.7.4rc1/opal/include/opal/sys/amd64/atomic.h", > line 182: warning: parameter in inline asm statement unused: %2 > "/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/openmpi-1.7.4rc1/opal/include/opal/sys/amd64/atomic.h", > line 203: warning: parameter in inline asm statement unused: %2 > "/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/openmpi-1.7.4rc1/opal/include/opal/sys/amd64/atomic.h", > line 224: warning: parameter in inline asm statement unused: %2 > "/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/openmpi-1.7.4rc1/opal/include/opal/sys/amd64/atomic.h", > line 245: warning: parameter in inline asm statement unused: %2 > "/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/openmpi-1.7.4rc1/opal/mca/if/posix_ipv4/if_posix.c", > line 272: undefined struct/union member: ifr_hwaddr > "/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/openmpi-1.7.4rc1/opal/mca/if/posix_ipv4/if_posix.c", > line 272: warning: left operand of "." must be struct/union object > "/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/openmpi-1.7.4rc1/opal/mca/if/posix_ipv4/if_posix.c", > line 272: cannot access member of non-struct/union object > cc: acomp failed for > /shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/openmpi-1.7.4rc1/opal/mca/if/posix_ipv4/if_posix.c > make[2]: *** [if_posix.lo] Error 1 > make[2]: Leaving directory > `/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/BLD/opal/mca/if/posix_ipv4' > > The atomics warnings are concerning (and appear *MANY* times in the > output). > However the *real* problem is the three errors in > opal/mca/if/posix_ipv4/if_posix.c", line 272 > > Solaris does't have a ifr_hwaddr field in struct if_req. > It *does* have an ifr_addr field, but this posting: > > http://comments.gmane.org/gmane.os.solaris.opensolaris.networking/12839 > suggests that this ioctl probably fails on PF_INET sockets. > > The surrounding code looks like: > > #ifdef SIOCGIFHWADDR > /* get the MAC address */ > if (ioctl(sd, SIOCGIFHWADDR, ifr) < 0) { > opal_output(0, "btl_usnic_opal_ifinit: > ioctl(SIOCGIFHWADDR) failed with errno=%d", errno); > break; > } > memcpy(intf->if_mac, ifr->ifr_hwaddr.sa_data, 6); > #endif > > #if defined(SIOCGIFMTU) && defined(HAVE_STRUCT_IFREQ_IFR_MTU) > /* get the MTU */ > if (ioctl(sd, SIOCGIFMTU, ifr) < 0) { > opal_output(0, "btl_usnic_opal_ifinit: ioctl(SIOCGIFMTU) > failed with errno=%d", errno); > break; > } > intf->if_mtu = ifr->ifr_mtu; > #endif > > > Note the "btl_usnic_open_ifinit:" in the opal_output lines is probably a > cut-and-paste error. > > -Paul > > > > -- > Paul H. Hargrove phhargr...@lbl.gov > Future Technologies Group > Computer and Data Sciences Department Tel: +1-510-495-2352 > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > -- Paul H. Hargrove phhargr...@lbl.gov Future Technologies Group Computer and Data Sciences Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
Re: [OMPI devel] 1.74rc1 build failure: Solaris 11 / x86_64 / Sun Studio 12.3
Paul -- Does this patch fix it for you? Index: opal/mca/if/posix_ipv4/configure.m4 === --- opal/mca/if/posix_ipv4/configure.m4 (revision 29997) +++ opal/mca/if/posix_ipv4/configure.m4 (working copy) @@ -42,8 +42,10 @@ ) AS_IF([test "$opal_if_posix_ipv4_happy" = "yes"], - [AC_CHECK_MEMBERS([struct ifreq.ifr_mtu], [], [], + [AC_CHECK_MEMBERS([struct ifreq.ifr_hwaddr], [], [], [[#include ]]) + AC_CHECK_MEMBERS([struct ifreq.ifr_mtu], [], [], + [[#include ]]) ]) AS_IF([test "$opal_if_posix_ipv4_happy" = "yes"], [$1], [$2]); Index: opal/mca/if/posix_ipv4/if_posix.c === --- opal/mca/if/posix_ipv4/if_posix.c (revision 29997) +++ opal/mca/if/posix_ipv4/if_posix.c (working copy) @@ -263,22 +263,22 @@ /* generate CIDR and assign to netmask */ intf->if_mask = prefix(((struct sockaddr_in*) &ifr->ifr_addr)->sin_addr.s_addr); -#ifdef SIOCGIFHWADDR -/* get the MAC address */ -if (ioctl(sd, SIOCGIFHWADDR, ifr) < 0) { -opal_output(0, "btl_usnic_opal_ifinit: ioctl(SIOCGIFHWADDR) failed with errno=%d", errno); -break; -} -memcpy(intf->if_mac, ifr->ifr_hwaddr.sa_data, 6); +#ifdef SIOCGIFHWADDR && defined(HAVE_STRUCT_IFREQ_IFR_HWADDR) +/* get the MAC address */ +if (ioctl(sd, SIOCGIFHWADDR, ifr) < 0) { +opal_output(0, "opal_ifinit: ioctl(SIOCGIFHWADDR) failed with errno=%d", errno); +break; +} +memcpy(intf->if_mac, ifr->ifr_hwaddr.sa_data, 6); #endif #if defined(SIOCGIFMTU) && defined(HAVE_STRUCT_IFREQ_IFR_MTU) -/* get the MTU */ -if (ioctl(sd, SIOCGIFMTU, ifr) < 0) { -opal_output(0, "btl_usnic_opal_ifinit: ioctl(SIOCGIFMTU) failed with errno=%d", errno); -break; -} -intf->if_mtu = ifr->ifr_mtu; +/* get the MTU */ +if (ioctl(sd, SIOCGIFMTU, ifr) < 0) { +opal_output(0, "opal_ifinit: ioctl(SIOCGIFMTU) failed with errno=%d", errno); +break; +} +intf->if_mtu = ifr->ifr_mtu; #endif opal_list_append(&opal_if_list, &(intf->super)); On Dec 19, 2013, at 6:51 PM, Paul Hargrove wrote: > In 1.7.4rc1's README support is still claimed for Solaris 11 on x86_64 with > Sun Studio (12.2 and 12.3): > - Oracle Solaris 10 and 11, 32 and 64 bit (SPARC, i386, x86_64), > with Oracle Solaris Studio 12.2 and 12.3 > > However, I get a build failure when configured with: > CC=cc CFLAGS=-m64 --with-wrapper-cflags=-m64 > CXX=CC CXXFLAGS='-m64 -library=stlport4' --with-wrapper-cxxflags=-m64 > FC=f90 FCFLAGS=-m64 --with-wrapper-fcflags=-m64 > --with-openib --prefix=... > > The failure doesn't appear to be compiler specific, and I will be testing gcc > ASAP. > > make[2]: Entering directory > `/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/BLD/opal/mca/if/posix_ipv4' > CC if_posix.lo > "/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/openmpi-1.7.4rc1/opal/include/opal/sys/amd64/atomic.h", > line 136: warning: parameter in inline asm statement unused: %3 > "/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/openmpi-1.7.4rc1/opal/include/opal/sys/amd64/atomic.h", > line 182: warning: parameter in inline asm statement unused: %2 > "/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/openmpi-1.7.4rc1/opal/include/opal/sys/amd64/atomic.h", > line 203: warning: parameter in inline asm statement unused: %2 > "/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/openmpi-1.7.4rc1/opal/include/opal/sys/amd64/atomic.h", > line 224: warning: parameter in inline asm statement unused: %2 > "/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/openmpi-1.7.4rc1/opal/include/opal/sys/amd64/atomic.h", > line 245: warning: parameter in inline asm statement unused: %2 > "/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/openmpi-1.7.4rc1/opal/mca/if/posix_ipv4/if_posix.c", > line 272: undefined struct/union member: ifr_hwaddr > "/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/openmpi-1.7.4rc1/opal/mca/if/posix_ipv4/if_posix.c", > line 272: warning: left operand of "." must be struct/union object > "/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/openmpi-1.7.4rc1/opal/mca/if/posix_ipv4/if_posix.c", > line 272: cannot access member of non-struct/union object > cc: acomp failed for > /shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/openmpi-1.7.4rc1/opal/mca/if/posix_ipv4/if_posix.c > make[2]: *** [if_posix.lo] Error 1 > make[2]: Leaving directory > `/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/BLD/opal/mca/if/posix_ipv4' > > The atomics warnings are concerning (and appear *MANY* times in the output).
Re: [OMPI devel] 1.74rc1 build failure: Solaris 11 / x86_64 / Sun Studio 12.3
Jeff, The patch looks fine to my eyes, but I cannot test it: 1) Not sure if email botched withepsace or what, but the patch didn't apply to if_posix.c. 2) Even if it did, I don't have sufficiently new autoconf on that system to "use" the configure.m4 part of the patch. Any chance of a patched-and-autogen'ed tarball to test? -Paul On Thu, Dec 19, 2013 at 4:04 PM, Jeff Squyres (jsquyres) wrote: > Paul -- > > Does this patch fix it for you? > > Index: opal/mca/if/posix_ipv4/configure.m4 > === > --- opal/mca/if/posix_ipv4/configure.m4 (revision 29997) > +++ opal/mca/if/posix_ipv4/configure.m4 (working copy) > @@ -42,8 +42,10 @@ > ) > > AS_IF([test "$opal_if_posix_ipv4_happy" = "yes"], > - [AC_CHECK_MEMBERS([struct ifreq.ifr_mtu], [], [], > + [AC_CHECK_MEMBERS([struct ifreq.ifr_hwaddr], [], [], > [[#include ]]) > + AC_CHECK_MEMBERS([struct ifreq.ifr_mtu], [], [], > + [[#include ]]) >]) > > AS_IF([test "$opal_if_posix_ipv4_happy" = "yes"], [$1], [$2]); > Index: opal/mca/if/posix_ipv4/if_posix.c > === > --- opal/mca/if/posix_ipv4/if_posix.c (revision 29997) > +++ opal/mca/if/posix_ipv4/if_posix.c (working copy) > @@ -263,22 +263,22 @@ > /* generate CIDR and assign to netmask */ > intf->if_mask = prefix(((struct sockaddr_in*) > &ifr->ifr_addr)->sin_addr.s_addr); > > -#ifdef SIOCGIFHWADDR > -/* get the MAC address */ > -if (ioctl(sd, SIOCGIFHWADDR, ifr) < 0) { > -opal_output(0, "btl_usnic_opal_ifinit: > ioctl(SIOCGIFHWADDR) failed with errno=%d", errno); > -break; > -} > -memcpy(intf->if_mac, ifr->ifr_hwaddr.sa_data, 6); > +#ifdef SIOCGIFHWADDR && defined(HAVE_STRUCT_IFREQ_IFR_HWADDR) > +/* get the MAC address */ > +if (ioctl(sd, SIOCGIFHWADDR, ifr) < 0) { > +opal_output(0, "opal_ifinit: ioctl(SIOCGIFHWADDR) failed with > errno=%d", errno); > +break; > +} > +memcpy(intf->if_mac, ifr->ifr_hwaddr.sa_data, 6); > #endif > > #if defined(SIOCGIFMTU) && defined(HAVE_STRUCT_IFREQ_IFR_MTU) > -/* get the MTU */ > -if (ioctl(sd, SIOCGIFMTU, ifr) < 0) { > -opal_output(0, "btl_usnic_opal_ifinit: ioctl(SIOCGIFMTU) > failed with errno=%d", errno); > -break; > -} > -intf->if_mtu = ifr->ifr_mtu; > +/* get the MTU */ > +if (ioctl(sd, SIOCGIFMTU, ifr) < 0) { > +opal_output(0, "opal_ifinit: ioctl(SIOCGIFMTU) failed with > errno=%d", errno); > +break; > +} > +intf->if_mtu = ifr->ifr_mtu; > #endif > > opal_list_append(&opal_if_list, &(intf->super)); > > > > > > On Dec 19, 2013, at 6:51 PM, Paul Hargrove wrote: > > > In 1.7.4rc1's README support is still claimed for Solaris 11 on x86_64 > with Sun Studio (12.2 and 12.3): > > - Oracle Solaris 10 and 11, 32 and 64 bit (SPARC, i386, x86_64), > > with Oracle Solaris Studio 12.2 and 12.3 > > > > However, I get a build failure when configured with: > > CC=cc CFLAGS=-m64 --with-wrapper-cflags=-m64 > > CXX=CC CXXFLAGS='-m64 -library=stlport4' > --with-wrapper-cxxflags=-m64 > > FC=f90 FCFLAGS=-m64 --with-wrapper-fcflags=-m64 > > --with-openib --prefix=... > > > > The failure doesn't appear to be compiler specific, and I will be > testing gcc ASAP. > > > > make[2]: Entering directory > `/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/BLD/opal/mca/if/posix_ipv4' > > CC if_posix.lo > > > "/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/openmpi-1.7.4rc1/opal/include/opal/sys/amd64/atomic.h", > line 136: warning: parameter in inline asm statement unused: %3 > > > "/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/openmpi-1.7.4rc1/opal/include/opal/sys/amd64/atomic.h", > line 182: warning: parameter in inline asm statement unused: %2 > > > "/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/openmpi-1.7.4rc1/opal/include/opal/sys/amd64/atomic.h", > line 203: warning: parameter in inline asm statement unused: %2 > > > "/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/openmpi-1.7.4rc1/opal/include/opal/sys/amd64/atomic.h", > line 224: warning: parameter in inline asm statement unused: %2 > > > "/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/openmpi-1.7.4rc1/opal/include/opal/sys/amd64/atomic.h", > line 245: warning: parameter in inline asm statement unused: %2 > > > "/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/openmpi-1.7.4rc1/opal/mca/if/posix_ipv4/if_posix.c", > line 272: undefined struct/union member: ifr_hwaddr > > > "/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/openmpi-1.7.4rc1/opal/mca/if/posix_ipv4/if_posix.c", > line 272: warning: left operand of "." must b
Re: [OMPI devel] 1.74rc1 build failure: Solaris 11 / x86_64 / Sun Studio 12.3
Try http://www.open-mpi.org/~jsquyres/unofficial/. Should have both "if" fixes in it. On Dec 19, 2013, at 7:12 PM, Paul Hargrove wrote: > Jeff, > > The patch looks fine to my eyes, but I cannot test it: > > 1) Not sure if email botched withepsace or what, but the patch didn't apply > to if_posix.c. > 2) Even if it did, I don't have sufficiently new autoconf on that system to > "use" the configure.m4 part of the patch. > > Any chance of a patched-and-autogen'ed tarball to test? > > -Paul > > > On Thu, Dec 19, 2013 at 4:04 PM, Jeff Squyres (jsquyres) > wrote: > Paul -- > > Does this patch fix it for you? > > Index: opal/mca/if/posix_ipv4/configure.m4 > === > --- opal/mca/if/posix_ipv4/configure.m4 (revision 29997) > +++ opal/mca/if/posix_ipv4/configure.m4 (working copy) > @@ -42,8 +42,10 @@ > ) > > AS_IF([test "$opal_if_posix_ipv4_happy" = "yes"], > - [AC_CHECK_MEMBERS([struct ifreq.ifr_mtu], [], [], > + [AC_CHECK_MEMBERS([struct ifreq.ifr_hwaddr], [], [], > [[#include ]]) > + AC_CHECK_MEMBERS([struct ifreq.ifr_mtu], [], [], > + [[#include ]]) >]) > > AS_IF([test "$opal_if_posix_ipv4_happy" = "yes"], [$1], [$2]); > Index: opal/mca/if/posix_ipv4/if_posix.c > === > --- opal/mca/if/posix_ipv4/if_posix.c (revision 29997) > +++ opal/mca/if/posix_ipv4/if_posix.c (working copy) > @@ -263,22 +263,22 @@ > /* generate CIDR and assign to netmask */ > intf->if_mask = prefix(((struct sockaddr_in*) > &ifr->ifr_addr)->sin_addr.s_addr); > > -#ifdef SIOCGIFHWADDR > -/* get the MAC address */ > -if (ioctl(sd, SIOCGIFHWADDR, ifr) < 0) { > -opal_output(0, "btl_usnic_opal_ifinit: ioctl(SIOCGIFHWADDR) > failed with errno=%d", errno); > -break; > -} > -memcpy(intf->if_mac, ifr->ifr_hwaddr.sa_data, 6); > +#ifdef SIOCGIFHWADDR && defined(HAVE_STRUCT_IFREQ_IFR_HWADDR) > +/* get the MAC address */ > +if (ioctl(sd, SIOCGIFHWADDR, ifr) < 0) { > +opal_output(0, "opal_ifinit: ioctl(SIOCGIFHWADDR) failed with > errno=%d", errno); > +break; > +} > +memcpy(intf->if_mac, ifr->ifr_hwaddr.sa_data, 6); > #endif > > #if defined(SIOCGIFMTU) && defined(HAVE_STRUCT_IFREQ_IFR_MTU) > -/* get the MTU */ > -if (ioctl(sd, SIOCGIFMTU, ifr) < 0) { > -opal_output(0, "btl_usnic_opal_ifinit: ioctl(SIOCGIFMTU) > failed with errno=%d", errno); > -break; > -} > -intf->if_mtu = ifr->ifr_mtu; > +/* get the MTU */ > +if (ioctl(sd, SIOCGIFMTU, ifr) < 0) { > +opal_output(0, "opal_ifinit: ioctl(SIOCGIFMTU) failed with > errno=%d", errno); > +break; > +} > +intf->if_mtu = ifr->ifr_mtu; > #endif > > opal_list_append(&opal_if_list, &(intf->super)); > > > > > > On Dec 19, 2013, at 6:51 PM, Paul Hargrove wrote: > > > In 1.7.4rc1's README support is still claimed for Solaris 11 on x86_64 with > > Sun Studio (12.2 and 12.3): > > - Oracle Solaris 10 and 11, 32 and 64 bit (SPARC, i386, x86_64), > > with Oracle Solaris Studio 12.2 and 12.3 > > > > However, I get a build failure when configured with: > > CC=cc CFLAGS=-m64 --with-wrapper-cflags=-m64 > > CXX=CC CXXFLAGS='-m64 -library=stlport4' > > --with-wrapper-cxxflags=-m64 > > FC=f90 FCFLAGS=-m64 --with-wrapper-fcflags=-m64 > > --with-openib --prefix=... > > > > The failure doesn't appear to be compiler specific, and I will be testing > > gcc ASAP. > > > > make[2]: Entering directory > > `/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/BLD/opal/mca/if/posix_ipv4' > > CC if_posix.lo > > "/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/openmpi-1.7.4rc1/opal/include/opal/sys/amd64/atomic.h", > > line 136: warning: parameter in inline asm statement unused: %3 > > "/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/openmpi-1.7.4rc1/opal/include/opal/sys/amd64/atomic.h", > > line 182: warning: parameter in inline asm statement unused: %2 > > "/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/openmpi-1.7.4rc1/opal/include/opal/sys/amd64/atomic.h", > > line 203: warning: parameter in inline asm statement unused: %2 > > "/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/openmpi-1.7.4rc1/opal/include/opal/sys/amd64/atomic.h", > > line 224: warning: parameter in inline asm statement unused: %2 > > "/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/openmpi-1.7.4rc1/opal/include/opal/sys/amd64/atomic.h", > > line 245: warning: parameter in inline asm statement unused: %2 > > "/shared/OMPI/openmpi-1.7.4rc1-solaris11-x64-ib-ss12u3/openmpi-1.7.4rc1/opal/mca/if/posix_ipv4/if_posix.c", > > li
Re: [OMPI devel] 1.7.4rc1 build failure: OpenBSD-5 and NetBSD-6
On Dec 19, 2013, at 6:27 PM, Paul Hargrove wrote: > When building 1.7.4rc1 on OpenBSD-5 and NetBSD-6 (both amd64) I see what > appears to be the same three errors ("make" output at end of this email) on > both platforms. > > All three syntax errors appears to be collisions on the symbol if_mtu: > > -bash-4.2$ cat -n openmpi-1.7.4rc1/opal/util/if.h | grep -w -e 182 >182 OPAL_DECLSPEC int opal_ifindextomtu(int if_index, int *if_mtu); > -bash-4.2$ cat -n openmpi-1.7.4rc1/opal/mca/if/if.h | grep -w -e 98 > 98 int if_mtu; > -bash-4.2$ cat -n openmpi-1.7.4rc1/opal/util/if.c | grep -w -e 482 >482 int opal_ifindextomtu(int if_index, int *if_mtu) > > -bash-4.2$ grep if_mtu /usr/include/net/if.h > #define if_mtu if_data.ifi_mtu\ Bah. Terrible. Ok, thanks -- I'll fix... (see the tar ball I just sent you... should have this fix in it) -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] 1.7.4rc1 build failure: FreeBSD-9
Fixed and cmr'd thanks! On Dec 19, 2013, at 3:10 PM, Paul Hargrove wrote: > I see the failure below when building 1.7.4rc1 on FreeBSD-9 (amd64). > It looks to be just a missing header, probably sys/stat.h. > > $ gcc --version > gcc (GCC) 4.2.1 20070831 patched [FreeBSD] > > Only configure option passed was --prefix-... > > -Paul > > > > Making all in mca/sharedfp/sm > CC sharedfp_sm.lo > CC sharedfp_sm_component.lo > CC sharedfp_sm_seek.lo > CC sharedfp_sm_get_position.lo > CC sharedfp_sm_request_position.lo > CC sharedfp_sm_write.lo > CC sharedfp_sm_iwrite.lo > CC sharedfp_sm_read.lo > CC sharedfp_sm_iread.lo > CC sharedfp_sm_file_open.lo > /home/phargrov/OMPI/openmpi-1.7.4rc1-freebsd9-amd64/openmpi-1.7.4rc1/ompi/mca/sharedfp/sm/sharedfp_sm_file_open.c: > In function 'mca_sharedfp_sm_file_open': > /home/phargrov/OMPI/openmpi-1.7.4rc1-freebsd9-amd64/openmpi-1.7.4rc1/ompi/mca/sharedfp/sm/sharedfp_sm_file_open.c:121: > error: 'S_IRUSR' undeclared (first use in this function) > /home/phargrov/OMPI/openmpi-1.7.4rc1-freebsd9-amd64/openmpi-1.7.4rc1/ompi/mca/sharedfp/sm/sharedfp_sm_file_open.c:121: > error: (Each undeclared identifier is reported only once > /home/phargrov/OMPI/openmpi-1.7.4rc1-freebsd9-amd64/openmpi-1.7.4rc1/ompi/mca/sharedfp/sm/sharedfp_sm_file_open.c:121: > error: for each function it appears in.) > /home/phargrov/OMPI/openmpi-1.7.4rc1-freebsd9-amd64/openmpi-1.7.4rc1/ompi/mca/sharedfp/sm/sharedfp_sm_file_open.c:121: > error: 'S_IWUSR' undeclared (first use in this function) > /home/phargrov/OMPI/openmpi-1.7.4rc1-freebsd9-amd64/openmpi-1.7.4rc1/ompi/mca/sharedfp/sm/sharedfp_sm_file_open.c:121: > error: 'S_IRGRP' undeclared (first use in this function) > /home/phargrov/OMPI/openmpi-1.7.4rc1-freebsd9-amd64/openmpi-1.7.4rc1/ompi/mca/sharedfp/sm/sharedfp_sm_file_open.c:121: > error: 'S_IROTH' undeclared (first use in this function) > *** [sharedfp_sm_file_open.lo] Error code 1 > > > > -- > Paul H. Hargrove phhargr...@lbl.gov > Future Technologies Group > Computer and Data Sciences Department Tel: +1-510-495-2352 > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] 1.7.4rc1 build failure: OpenBSD-5 and NetBSD-6
Jeff, The unofficial "rc2forpaul" gets past the (disgusting) if_mtu problem on both platforms. On NetBSD-6 the build completes ("make install" fails, but I'll report that separately). However, on OpenBSD-5 we now encounter another failure about 20 files later: CC sys_limits.lo /home/phargrov/OMPI/openmpi-1.7.4rc2forpaul-openbsd5-amd64/openmpi-1.7.4rc2forpaul/opal/util/sys_limits.c: In function 'opal_util_init_sys_limits': /home/phargrov/OMPI/openmpi-1.7.4rc2forpaul-openbsd5-amd64/openmpi-1.7.4rc2forpaul/opal/util/sys_limits.c:172: error: 'RLIMIT_AS' undeclared (first use in this function) /home/phargrov/OMPI/openmpi-1.7.4rc2forpaul-openbsd5-amd64/openmpi-1.7.4rc2forpaul/opal/util/sys_limits.c:172: error: (Each undeclared identifier is reported only once /home/phargrov/OMPI/openmpi-1.7.4rc2forpaul-openbsd5-amd64/openmpi-1.7.4rc2forpaul/opal/util/sys_limits.c:172: error: for each function it appears in.) *** Error 1 in opal/util (Makefile:1692 'sys_limits.lo': @echo " CC " sys_limits.lo;depbase=`echo sys_limits.lo | sed 's|[^/]*$|.deps/...) *** Error 1 in opal/util (Makefile:1780 'all-recursive') The getrlimit manpage on this platform does not list RLIMIT_AS. Running "grep -rl RLIMIT_AS /usr/include" confirms that this constant does not exist. So, I think "#ifdef RLIMIT_AS" is required. -Paul On Thu, Dec 19, 2013 at 4:39 PM, Jeff Squyres (jsquyres) wrote: > On Dec 19, 2013, at 6:27 PM, Paul Hargrove wrote: > > > When building 1.7.4rc1 on OpenBSD-5 and NetBSD-6 (both amd64) I see what > appears to be the same three errors ("make" output at end of this email) > on both platforms. > > > > All three syntax errors appears to be collisions on the symbol if_mtu: > > > > -bash-4.2$ cat -n openmpi-1.7.4rc1/opal/util/if.h | grep -w -e 182 > >182 OPAL_DECLSPEC int opal_ifindextomtu(int if_index, int *if_mtu); > > -bash-4.2$ cat -n openmpi-1.7.4rc1/opal/mca/if/if.h | grep -w -e 98 > > 98 int if_mtu; > > -bash-4.2$ cat -n openmpi-1.7.4rc1/opal/util/if.c | grep -w -e 482 > >482 int opal_ifindextomtu(int if_index, int *if_mtu) > > > > -bash-4.2$ grep if_mtu /usr/include/net/if.h > > #define if_mtu if_data.ifi_mtu\ > > Bah. Terrible. Ok, thanks -- I'll fix... > > (see the tar ball I just sent you... should have this fix in it) > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > -- Paul H. Hargrove phhargr...@lbl.gov Future Technologies Group Computer and Data Sciences Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
Re: [OMPI devel] 1.74rc1 build failure: Solaris 11 / x86_64 / Sun Studio 12.3
Jeff, Solaris 11 / x86_64 build get farther than before, but fails with the following: make[2]: Entering directory `/shared/OMPI/openmpi-1.7.4rc2forpaul-solaris11-x64-ib-gcc452/BLD/ompi/mca/btl/usnic' CC btl_usnic_module.lo In file included from /shared/OMPI/openmpi-1.7.4rc2forpaul-solaris11-x64-ib-gcc452/openmpi-1.7.4rc2forpaul/ompi/mca/btl/usnic/btl_usnic_module.c:48:0: /shared/OMPI/openmpi-1.7.4rc2forpaul-solaris11-x64-ib-gcc452/openmpi-1.7.4rc2forpaul/ompi/mca/btl/usnic/btl_usnic_util.h:19:24: error: expected �=�, �,�, �;�, �asm� or �__attribute__� before �int� make[2]: *** [btl_usnic_module.lo] Error 1 make[2]: Leaving directory `/shared/OMPI/openmpi-1.7.4rc2forpaul-solaris11-x64-ib-gcc452/BLD/ompi/mca/btl/usnic' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/shared/OMPI/openmpi-1.7.4rc2forpaul-solaris11-x64-ib-gcc452/BLD/ompi' make: *** [all-recursive] Error 1 It looks like gcc is choking on __always_inline. I believe use of __opal_attribute_always_inline__ is the proper fix. I've made that change and resumed the build... will report again upon success or the next failure. I'm not sure why one is trying to build the usnic btl on Solaris at all. Perhaps just because the OFED stack is present? -Paul On Thu, Dec 19, 2013 at 4:39 PM, Jeff Squyres (jsquyres) wrote: > Try http://www.open-mpi.org/~jsquyres/unofficial/. > > Should have both "if" fixes in it. > > > On Dec 19, 2013, at 7:12 PM, Paul Hargrove wrote: > > > Jeff, > > > > The patch looks fine to my eyes, but I cannot test it: > > > > 1) Not sure if email botched withepsace or what, but the patch didn't > apply to if_posix.c. > > 2) Even if it did, I don't have sufficiently new autoconf on that system > to "use" the configure.m4 part of the patch. > > > > Any chance of a patched-and-autogen'ed tarball to test? > > > > -Paul > > > > > > On Thu, Dec 19, 2013 at 4:04 PM, Jeff Squyres (jsquyres) < > jsquy...@cisco.com> wrote: > > Paul -- > > > > Does this patch fix it for you? > > > > Index: opal/mca/if/posix_ipv4/configure.m4 > > === > > --- opal/mca/if/posix_ipv4/configure.m4 (revision 29997) > > +++ opal/mca/if/posix_ipv4/configure.m4 (working copy) > > @@ -42,8 +42,10 @@ > > ) > > > > AS_IF([test "$opal_if_posix_ipv4_happy" = "yes"], > > - [AC_CHECK_MEMBERS([struct ifreq.ifr_mtu], [], [], > > + [AC_CHECK_MEMBERS([struct ifreq.ifr_hwaddr], [], [], > > [[#include ]]) > > + AC_CHECK_MEMBERS([struct ifreq.ifr_mtu], [], [], > > + [[#include ]]) > >]) > > > > AS_IF([test "$opal_if_posix_ipv4_happy" = "yes"], [$1], [$2]); > > Index: opal/mca/if/posix_ipv4/if_posix.c > > === > > --- opal/mca/if/posix_ipv4/if_posix.c (revision 29997) > > +++ opal/mca/if/posix_ipv4/if_posix.c (working copy) > > @@ -263,22 +263,22 @@ > > /* generate CIDR and assign to netmask */ > > intf->if_mask = prefix(((struct sockaddr_in*) > &ifr->ifr_addr)->sin_addr.s_addr); > > > > -#ifdef SIOCGIFHWADDR > > -/* get the MAC address */ > > -if (ioctl(sd, SIOCGIFHWADDR, ifr) < 0) { > > -opal_output(0, "btl_usnic_opal_ifinit: > ioctl(SIOCGIFHWADDR) failed with errno=%d", errno); > > -break; > > -} > > -memcpy(intf->if_mac, ifr->ifr_hwaddr.sa_data, 6); > > +#ifdef SIOCGIFHWADDR && defined(HAVE_STRUCT_IFREQ_IFR_HWADDR) > > +/* get the MAC address */ > > +if (ioctl(sd, SIOCGIFHWADDR, ifr) < 0) { > > +opal_output(0, "opal_ifinit: ioctl(SIOCGIFHWADDR) failed > with errno=%d", errno); > > +break; > > +} > > +memcpy(intf->if_mac, ifr->ifr_hwaddr.sa_data, 6); > > #endif > > > > #if defined(SIOCGIFMTU) && defined(HAVE_STRUCT_IFREQ_IFR_MTU) > > -/* get the MTU */ > > -if (ioctl(sd, SIOCGIFMTU, ifr) < 0) { > > -opal_output(0, "btl_usnic_opal_ifinit: > ioctl(SIOCGIFMTU) failed with errno=%d", errno); > > -break; > > -} > > -intf->if_mtu = ifr->ifr_mtu; > > +/* get the MTU */ > > +if (ioctl(sd, SIOCGIFMTU, ifr) < 0) { > > +opal_output(0, "opal_ifinit: ioctl(SIOCGIFMTU) failed with > errno=%d", errno); > > +break; > > +} > > +intf->if_mtu = ifr->ifr_mtu; > > #endif > > > > opal_list_append(&opal_if_list, &(intf->super)); > > > > > > > > > > > > On Dec 19, 2013, at 6:51 PM, Paul Hargrove wrote: > > > > > In 1.7.4rc1's README support is still claimed for Solaris 11 on x86_64 > with Sun Studio (12.2 and 12.3): > > > - Oracle Solaris 10 and 11, 32 and 64 bit (SPARC, i386, x86_64), > > > with Oracle Solaris Studio 12.2 and 12.3 > > > > > > However, I get a build failure when configured with: > > > CC=cc CFLAGS=-m64 --wi
[OMPI devel] 1.7.4rc1 run failure on Solaris 10 / SPARC (not SIGBUS)
Testing with Solaris 10 on SPARC, I was expecting to encounter the bus error reported previously by Siegman Gross. Instead I see the following hwloc-related abort: $ env PATH=/home/hargrove/OMPI/openmpi-1.7.4rc1-solaris10-sparcT2-ss12u3-v9/INST/bin:$PATH LD_LIBRARY_PATH_64=/home/hargrove/OMPI/openmpi-1.7.4rc1-solaris10-sparcT2-ss12u3-v9/INST/lib:$LD_LIBRARY_PATH_64 OMPI_MCA_shmem_mmap_enable_nfs_warning=0 /home/hargrove/OMPI/openmpi-1.7.4rc1-solaris10-sparcT2-ss12u3-v9/INST/bin/mpirun -mca btl sm,self -np 2 examples/ring_c -- Open MPI tried to bind a new process, but something went wrong. The process was killed without launching the target application. Your job will now abort. Local host:niagara1 Application name: examples/ring_c Error message: hwloc indicates cpu binding cannot be enforced Location: /home/hargrove/OMPI/openmpi-1.7.4rc1-solaris10-sparcT2-ss12u3-v9/openmpi-1.7.4rc1/orte/mca/odls/default/odls_default_module.c:478 -- 2 total processes failed to start I am assuming I just need some magic pixie dust to disable cpu binding. I'd appreciate some corresponding instructions. However, if this is NOT an expected/desired/known behavior please let me know what I can/should do to help determine the root cause. -Paul -- Paul H. Hargrove phhargr...@lbl.gov Future Technologies Group Computer and Data Sciences Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
Re: [OMPI devel] 1.74rc1 build failure: Solaris 11 / x86_64 / Sun Studio 12.3
Jeff, I didn't actually get very far after fixing __always_inline. In fact, the build still fails on the *same* line, but for a different (valid) reason: fls() is declared in /usr/include/string.h Making all in mca/btl/usnic make[2]: Entering directory `/shared/OMPI/openmpi-1.7.4rc2forpaul-solaris11-x64-ib-gcc452/BLD/ompi/mca/btl/usnic' CC btl_usnic_module.lo In file included from /shared/OMPI/openmpi-1.7.4rc2forpaul-solaris11-x64-ib-gcc452/openmpi-1.7.4rc2forpaul/ompi/mca/btl/usnic/btl_usnic_module.c:48:0: /shared/OMPI/openmpi-1.7.4rc2forpaul-solaris11-x64-ib-gcc452/openmpi-1.7.4rc2forpaul/ompi/mca/btl/usnic/btl_usnic_util.h:19:45: error: static declaration of �fls� follows non-static declaration /usr/include/string.h:87:12: note: previous declaration of �fls� was here make[2]: *** [btl_usnic_module.lo] Error 1 -Paul On Thu, Dec 19, 2013 at 6:35 PM, Paul Hargrove wrote: > Jeff, > > Solaris 11 / x86_64 build get farther than before, but fails with the > following: > > make[2]: Entering directory > `/shared/OMPI/openmpi-1.7.4rc2forpaul-solaris11-x64-ib-gcc452/BLD/ompi/mca/btl/usnic' > CC btl_usnic_module.lo > In file included from > /shared/OMPI/openmpi-1.7.4rc2forpaul-solaris11-x64-ib-gcc452/openmpi-1.7.4rc2forpaul/ompi/mca/btl/usnic/btl_usnic_module.c:48:0: > /shared/OMPI/openmpi-1.7.4rc2forpaul-solaris11-x64-ib-gcc452/openmpi-1.7.4rc2forpaul/ompi/mca/btl/usnic/btl_usnic_util.h:19:24: > error: expected �=�, �,�, �;�, �asm� or �__attribute__� before �int� > make[2]: *** [btl_usnic_module.lo] Error 1 > make[2]: Leaving directory > `/shared/OMPI/openmpi-1.7.4rc2forpaul-solaris11-x64-ib-gcc452/BLD/ompi/mca/btl/usnic' > make[1]: *** [all-recursive] Error 1 > make[1]: Leaving directory > `/shared/OMPI/openmpi-1.7.4rc2forpaul-solaris11-x64-ib-gcc452/BLD/ompi' > make: *** [all-recursive] Error 1 > > It looks like gcc is choking on __always_inline. > I believe use of __opal_attribute_always_inline__ is the proper fix. > I've made that change and resumed the build... will report again upon > success or the next failure. > > I'm not sure why one is trying to build the usnic btl on Solaris at all. > Perhaps just because the OFED stack is present? > > -Paul > > > On Thu, Dec 19, 2013 at 4:39 PM, Jeff Squyres (jsquyres) < > jsquy...@cisco.com> wrote: > >> Try http://www.open-mpi.org/~jsquyres/unofficial/. >> >> Should have both "if" fixes in it. >> >> >> On Dec 19, 2013, at 7:12 PM, Paul Hargrove wrote: >> >> > Jeff, >> > >> > The patch looks fine to my eyes, but I cannot test it: >> > >> > 1) Not sure if email botched withepsace or what, but the patch didn't >> apply to if_posix.c. >> > 2) Even if it did, I don't have sufficiently new autoconf on that >> system to "use" the configure.m4 part of the patch. >> > >> > Any chance of a patched-and-autogen'ed tarball to test? >> > >> > -Paul >> > >> > >> > On Thu, Dec 19, 2013 at 4:04 PM, Jeff Squyres (jsquyres) < >> jsquy...@cisco.com> wrote: >> > Paul -- >> > >> > Does this patch fix it for you? >> > >> > Index: opal/mca/if/posix_ipv4/configure.m4 >> > === >> > --- opal/mca/if/posix_ipv4/configure.m4 (revision 29997) >> > +++ opal/mca/if/posix_ipv4/configure.m4 (working copy) >> > @@ -42,8 +42,10 @@ >> > ) >> > >> > AS_IF([test "$opal_if_posix_ipv4_happy" = "yes"], >> > - [AC_CHECK_MEMBERS([struct ifreq.ifr_mtu], [], [], >> > + [AC_CHECK_MEMBERS([struct ifreq.ifr_hwaddr], [], [], >> > [[#include ]]) >> > + AC_CHECK_MEMBERS([struct ifreq.ifr_mtu], [], [], >> > + [[#include ]]) >> >]) >> > >> > AS_IF([test "$opal_if_posix_ipv4_happy" = "yes"], [$1], [$2]); >> > Index: opal/mca/if/posix_ipv4/if_posix.c >> > === >> > --- opal/mca/if/posix_ipv4/if_posix.c (revision 29997) >> > +++ opal/mca/if/posix_ipv4/if_posix.c (working copy) >> > @@ -263,22 +263,22 @@ >> > /* generate CIDR and assign to netmask */ >> > intf->if_mask = prefix(((struct sockaddr_in*) >> &ifr->ifr_addr)->sin_addr.s_addr); >> > >> > -#ifdef SIOCGIFHWADDR >> > -/* get the MAC address */ >> > -if (ioctl(sd, SIOCGIFHWADDR, ifr) < 0) { >> > -opal_output(0, "btl_usnic_opal_ifinit: >> ioctl(SIOCGIFHWADDR) failed with errno=%d", errno); >> > -break; >> > -} >> > -memcpy(intf->if_mac, ifr->ifr_hwaddr.sa_data, 6); >> > +#ifdef SIOCGIFHWADDR && defined(HAVE_STRUCT_IFREQ_IFR_HWADDR) >> > +/* get the MAC address */ >> > +if (ioctl(sd, SIOCGIFHWADDR, ifr) < 0) { >> > +opal_output(0, "opal_ifinit: ioctl(SIOCGIFHWADDR) failed >> with errno=%d", errno); >> > +break; >> > +} >> > +memcpy(intf->if_mac, ifr->ifr_hwaddr.sa_data, 6); >> > #endif >> > >> > #if defined(SIOCGIFMTU) && defined(HAVE_STRUCT_IFREQ_IF
[OMPI devel] 1.7.4rc1 install failure: NetBSD-6 amd64
Attached is the output from "make install" of 1.7.4rc1 + Jeff's fix for the symbol conflict on "if_mtu". There appear to be at least 2 issues. 1) There are lots of (not fatal) messages about ldconfig not existing, but according to he NetBSD lists that utility went away with the conversion from a.out to ELF. 2) Many warnings of the form *** Warning: linker path does not have real file for library 3) The final (fatal) error about .libs/mca_btl_sm.soT not existing. I am going to see if I can get a better result using the system libtool (which is 2.2.6b, thus OLDER than OMPI's 2.4.2) and will report back with my results. -Paul -- Paul H. Hargrove phhargr...@lbl.gov Future Technologies Group Computer and Data Sciences Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 install.log.bz2 Description: BZip2 compressed data
Re: [OMPI devel] 1.7.4rc1 build failure: OpenBSD-5 and NetBSD-6
I added protections for all the RLIMIT values, just in case. Thanks! Ralph On Dec 19, 2013, at 6:25 PM, Paul Hargrove wrote: > Jeff, > > The unofficial "rc2forpaul" gets past the (disgusting) if_mtu problem on both > platforms. > > On NetBSD-6 the build completes ("make install" fails, but I'll report that > separately). > > However, on OpenBSD-5 we now encounter another failure about 20 files later: > > CC sys_limits.lo > /home/phargrov/OMPI/openmpi-1.7.4rc2forpaul-openbsd5-amd64/openmpi-1.7.4rc2forpaul/opal/util/sys_limits.c: > In function 'opal_util_init_sys_limits': > /home/phargrov/OMPI/openmpi-1.7.4rc2forpaul-openbsd5-amd64/openmpi-1.7.4rc2forpaul/opal/util/sys_limits.c:172: > error: 'RLIMIT_AS' undeclared (first use in this function) > /home/phargrov/OMPI/openmpi-1.7.4rc2forpaul-openbsd5-amd64/openmpi-1.7.4rc2forpaul/opal/util/sys_limits.c:172: > error: (Each undeclared identifier is reported only once > /home/phargrov/OMPI/openmpi-1.7.4rc2forpaul-openbsd5-amd64/openmpi-1.7.4rc2forpaul/opal/util/sys_limits.c:172: > error: for each function it appears in.) > *** Error 1 in opal/util (Makefile:1692 'sys_limits.lo': @echo " CC " > sys_limits.lo;depbase=`echo sys_limits.lo | sed 's|[^/]*$|.deps/...) > *** Error 1 in opal/util (Makefile:1780 'all-recursive') > > The getrlimit manpage on this platform does not list RLIMIT_AS. > Running "grep -rl RLIMIT_AS /usr/include" confirms that this constant does > not exist. > So, I think "#ifdef RLIMIT_AS" is required. > > -Paul > > > On Thu, Dec 19, 2013 at 4:39 PM, Jeff Squyres (jsquyres) > wrote: > On Dec 19, 2013, at 6:27 PM, Paul Hargrove wrote: > > > When building 1.7.4rc1 on OpenBSD-5 and NetBSD-6 (both amd64) I see what > > appears to be the same three errors ("make" output at end of this email) > > on both platforms. > > > > All three syntax errors appears to be collisions on the symbol if_mtu: > > > > -bash-4.2$ cat -n openmpi-1.7.4rc1/opal/util/if.h | grep -w -e 182 > >182 OPAL_DECLSPEC int opal_ifindextomtu(int if_index, int *if_mtu); > > -bash-4.2$ cat -n openmpi-1.7.4rc1/opal/mca/if/if.h | grep -w -e 98 > > 98 int if_mtu; > > -bash-4.2$ cat -n openmpi-1.7.4rc1/opal/util/if.c | grep -w -e 482 > >482 int opal_ifindextomtu(int if_index, int *if_mtu) > > > > -bash-4.2$ grep if_mtu /usr/include/net/if.h > > #define if_mtu if_data.ifi_mtu\ > > Bah. Terrible. Ok, thanks -- I'll fix... > > (see the tar ball I just sent you... should have this fix in it) > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > -- > Paul H. Hargrove phhargr...@lbl.gov > Future Technologies Group > Computer and Data Sciences Department Tel: +1-510-495-2352 > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] 1.7.4rc1 run failure on Solaris 10 / SPARC (not SIGBUS)
I believe this one has already been fixed and is in the nightly (1.7.4rc2) - for now, you can just set "--bind-to none" on the cmd line to get past it On Dec 19, 2013, at 6:42 PM, Paul Hargrove wrote: > Testing with Solaris 10 on SPARC, I was expecting to encounter the bus error > reported previously by Siegman Gross. Instead I see the following > hwloc-related abort: > > $ env > PATH=/home/hargrove/OMPI/openmpi-1.7.4rc1-solaris10-sparcT2-ss12u3-v9/INST/bin:$PATH > > LD_LIBRARY_PATH_64=/home/hargrove/OMPI/openmpi-1.7.4rc1-solaris10-sparcT2-ss12u3-v9/INST/lib:$LD_LIBRARY_PATH_64 > OMPI_MCA_shmem_mmap_enable_nfs_warning=0 > /home/hargrove/OMPI/openmpi-1.7.4rc1-solaris10-sparcT2-ss12u3-v9/INST/bin/mpirun > -mca btl sm,self -np 2 examples/ring_c > -- > Open MPI tried to bind a new process, but something went wrong. The > process was killed without launching the target application. Your job > will now abort. > > Local host:niagara1 > Application name: examples/ring_c > Error message: hwloc indicates cpu binding cannot be enforced > Location: > /home/hargrove/OMPI/openmpi-1.7.4rc1-solaris10-sparcT2-ss12u3-v9/openmpi-1.7.4rc1/orte/mca/odls/default/odls_default_module.c:478 > -- > 2 total processes failed to start > > > I am assuming I just need some magic pixie dust to disable cpu binding. > I'd appreciate some corresponding instructions. > > However, if this is NOT an expected/desired/known behavior please let me know > what I can/should do to help determine the root cause. > > > -Paul > > -- > Paul H. Hargrove phhargr...@lbl.gov > Future Technologies Group > Computer and Data Sciences Department Tel: +1-510-495-2352 > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] 1.7.4rc1 run failure on Solaris 10 / SPARC (not SIGBUS)
Ralph, I can confirm "--bind-to none" worked to eliminate the error, but the test now appears to hang :-( Since you say the binding probably fixed for rc2, I'll see if the latest nightly tarball works better by default. -Paul On Thu, Dec 19, 2013 at 7:19 PM, Ralph Castain wrote: > I believe this one has already been fixed and is in the nightly (1.7.4rc2) > - for now, you can just set "--bind-to none" on the cmd line to get past it > > > On Dec 19, 2013, at 6:42 PM, Paul Hargrove wrote: > > Testing with Solaris 10 on SPARC, I was expecting to encounter the bus > error reported previously by Siegman Gross. Instead I see the following > hwloc-related abort: > > $ env > PATH=/home/hargrove/OMPI/openmpi-1.7.4rc1-solaris10-sparcT2-ss12u3-v9/INST/bin:$PATH > > LD_LIBRARY_PATH_64=/home/hargrove/OMPI/openmpi-1.7.4rc1-solaris10-sparcT2-ss12u3-v9/INST/lib:$LD_LIBRARY_PATH_64 > OMPI_MCA_shmem_mmap_enable_nfs_warning=0 > > /home/hargrove/OMPI/openmpi-1.7.4rc1-solaris10-sparcT2-ss12u3-v9/INST/bin/mpirun > -mca btl sm,self -np 2 examples/ring_c > -- > Open MPI tried to bind a new process, but something went wrong. The > process was killed without launching the target application. Your job > will now abort. > > Local host:niagara1 > Application name: examples/ring_c > Error message: hwloc indicates cpu binding cannot be enforced > Location: > > /home/hargrove/OMPI/openmpi-1.7.4rc1-solaris10-sparcT2-ss12u3-v9/openmpi-1.7.4rc1/orte/mca/odls/default/odls_default_module.c:478 > -- > 2 total processes failed to start > > > I am assuming I just need some magic pixie dust to disable cpu binding. > I'd appreciate some corresponding instructions. > > However, if this is NOT an expected/desired/known behavior please let me > know what I can/should do to help determine the root cause. > > > -Paul > > -- > Paul H. Hargrove phhargr...@lbl.gov > Future Technologies Group > Computer and Data Sciences Department Tel: +1-510-495-2352 > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > -- Paul H. Hargrove phhargr...@lbl.gov Future Technologies Group Computer and Data Sciences Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
[OMPI devel] 1.7.4rc1 autogen error: NetBSD-6
Probably nobody cares, but I'll report this for completeness. In trying to understand the "make install" failure on NetBSD-6 I run "autogen.sh". The versions detected: Searching for autoconf Found autoconf version 2.69; checking version... Found version component 2 -- need 2 Found version component 69 -- need 65 ==> ACCEPTED Searching for libtoolize Found libtoolize version 2.2.6b; checking version... Found version component 2 -- need 2 Found version component 2 -- need 2 Found version component 6b -- need 6b ==> ACCEPTED Searching for automake Found automake version 1.13.1; checking version... Found version component 1 -- need 1 Found version component 13 -- need 12 ==> ACCEPTED The problem is that when run, the generated configure script dies as follows: *** Java compiler configure: WARNING: Found configure shell variable clash! configure: WARNING: OPAL_VAR_SCOPE_PUSH called on "dir", configure: WARNING: but it is already defined with value "/bin" configure: WARNING: This usually indicates an error in configure. configure: error: Cannot continue -Paul -- Paul H. Hargrove phhargr...@lbl.gov Future Technologies Group Computer and Data Sciences Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
Re: [OMPI devel] 1.74rc1 build failure: Solaris 11 / x86_64 / Sun Studio 12.3
FYI: My Solaris-11/x86-64/gcc-4.5.2 build completes with the following three changes: + Jeff's fix for if_posix.c + changing __always_inline to __opal_attribute_always_inline__ + fixing the fls() conflict by renaming OMPI's to "my_fls()" (just a lazy choice). -Paul On Thu, Dec 19, 2013 at 6:47 PM, Paul Hargrove wrote: > Jeff, > > I didn't actually get very far after fixing __always_inline. > In fact, the build still fails on the *same* line, but for a different > (valid) reason: > fls() is declared in /usr/include/string.h > > Making all in mca/btl/usnic > make[2]: Entering directory > `/shared/OMPI/openmpi-1.7.4rc2forpaul-solaris11-x64-ib-gcc452/BLD/ompi/mca/btl/usnic' > CC btl_usnic_module.lo > In file included from > /shared/OMPI/openmpi-1.7.4rc2forpaul-solaris11-x64-ib-gcc452/openmpi-1.7.4rc2forpaul/ompi/mca/btl/usnic/btl_usnic_module.c:48:0: > /shared/OMPI/openmpi-1.7.4rc2forpaul-solaris11-x64-ib-gcc452/openmpi-1.7.4rc2forpaul/ompi/mca/btl/usnic/btl_usnic_util.h:19:45: > error: static declaration of �fls� follows non-static declaration > /usr/include/string.h:87:12: note: previous declaration of �fls� was here > make[2]: *** [btl_usnic_module.lo] Error 1 > > -Paul > > > On Thu, Dec 19, 2013 at 6:35 PM, Paul Hargrove wrote: > >> Jeff, >> >> Solaris 11 / x86_64 build get farther than before, but fails with the >> following: >> >> make[2]: Entering directory >> `/shared/OMPI/openmpi-1.7.4rc2forpaul-solaris11-x64-ib-gcc452/BLD/ompi/mca/btl/usnic' >> CC btl_usnic_module.lo >> In file included from >> /shared/OMPI/openmpi-1.7.4rc2forpaul-solaris11-x64-ib-gcc452/openmpi-1.7.4rc2forpaul/ompi/mca/btl/usnic/btl_usnic_module.c:48:0: >> /shared/OMPI/openmpi-1.7.4rc2forpaul-solaris11-x64-ib-gcc452/openmpi-1.7.4rc2forpaul/ompi/mca/btl/usnic/btl_usnic_util.h:19:24: >> error: expected �=�, �,�, �;�, �asm� or �__attribute__� before �int� >> make[2]: *** [btl_usnic_module.lo] Error 1 >> make[2]: Leaving directory >> `/shared/OMPI/openmpi-1.7.4rc2forpaul-solaris11-x64-ib-gcc452/BLD/ompi/mca/btl/usnic' >> make[1]: *** [all-recursive] Error 1 >> make[1]: Leaving directory >> `/shared/OMPI/openmpi-1.7.4rc2forpaul-solaris11-x64-ib-gcc452/BLD/ompi' >> make: *** [all-recursive] Error 1 >> >> It looks like gcc is choking on __always_inline. >> I believe use of __opal_attribute_always_inline__ is the proper fix. >> I've made that change and resumed the build... will report again upon >> success or the next failure. >> >> I'm not sure why one is trying to build the usnic btl on Solaris at all. >> Perhaps just because the OFED stack is present? >> >> -Paul >> >> >> On Thu, Dec 19, 2013 at 4:39 PM, Jeff Squyres (jsquyres) < >> jsquy...@cisco.com> wrote: >> >>> Try http://www.open-mpi.org/~jsquyres/unofficial/. >>> >>> Should have both "if" fixes in it. >>> >>> >>> On Dec 19, 2013, at 7:12 PM, Paul Hargrove wrote: >>> >>> > Jeff, >>> > >>> > The patch looks fine to my eyes, but I cannot test it: >>> > >>> > 1) Not sure if email botched withepsace or what, but the patch didn't >>> apply to if_posix.c. >>> > 2) Even if it did, I don't have sufficiently new autoconf on that >>> system to "use" the configure.m4 part of the patch. >>> > >>> > Any chance of a patched-and-autogen'ed tarball to test? >>> > >>> > -Paul >>> > >>> > >>> > On Thu, Dec 19, 2013 at 4:04 PM, Jeff Squyres (jsquyres) < >>> jsquy...@cisco.com> wrote: >>> > Paul -- >>> > >>> > Does this patch fix it for you? >>> > >>> > Index: opal/mca/if/posix_ipv4/configure.m4 >>> > === >>> > --- opal/mca/if/posix_ipv4/configure.m4 (revision 29997) >>> > +++ opal/mca/if/posix_ipv4/configure.m4 (working copy) >>> > @@ -42,8 +42,10 @@ >>> > ) >>> > >>> > AS_IF([test "$opal_if_posix_ipv4_happy" = "yes"], >>> > - [AC_CHECK_MEMBERS([struct ifreq.ifr_mtu], [], [], >>> > + [AC_CHECK_MEMBERS([struct ifreq.ifr_hwaddr], [], [], >>> > [[#include ]]) >>> > + AC_CHECK_MEMBERS([struct ifreq.ifr_mtu], [], [], >>> > + [[#include ]]) >>> >]) >>> > >>> > AS_IF([test "$opal_if_posix_ipv4_happy" = "yes"], [$1], [$2]); >>> > Index: opal/mca/if/posix_ipv4/if_posix.c >>> > === >>> > --- opal/mca/if/posix_ipv4/if_posix.c (revision 29997) >>> > +++ opal/mca/if/posix_ipv4/if_posix.c (working copy) >>> > @@ -263,22 +263,22 @@ >>> > /* generate CIDR and assign to netmask */ >>> > intf->if_mask = prefix(((struct sockaddr_in*) >>> &ifr->ifr_addr)->sin_addr.s_addr); >>> > >>> > -#ifdef SIOCGIFHWADDR >>> > -/* get the MAC address */ >>> > -if (ioctl(sd, SIOCGIFHWADDR, ifr) < 0) { >>> > -opal_output(0, "btl_usnic_opal_ifinit: >>> ioctl(SIOCGIFHWADDR) failed with errno=%d", errno); >>> > -break; >>> > -} >>> > -memcpy(intf->if_mac, ifr->ifr_hwaddr.sa