[OMPI users] [PATCH] hooks: disable malloc override inside of Gentoo sandbox
As described in the comments in the source, Gentoo's own version of fakeroot, sandbox, also runs into hangs when malloc is overridden. Sandbox environments can easily be detected by looking for SANDBOX_PID in the environment. When detected, employ the same fix used for fakeroot. See https://bugs.gentoo.org/show_bug.cgi?id=462602 --- opal/mca/memory/linux/hooks.c | 11 +-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/opal/mca/memory/linux/hooks.c b/opal/mca/memory/linux/hooks.c index 6a1646f..ce91e76 100644 --- a/opal/mca/memory/linux/hooks.c +++ b/opal/mca/memory/linux/hooks.c @@ -747,9 +747,16 @@ static void opal_memory_linux_malloc_init_hook(void) "fakeroot" build environment that allocates memory during stat() (see http://bugs.debian.org/531522). It may not be necessary any more since we're using access(), not stat(). But - we'll leave the check, anyway. */ + we'll leave the check, anyway. + + This is also an issue when using Gentoo's version of 'fakeroot', + sandbox v2.5. Sandbox environments can also be detected fairly + easily by looking for SANDBOX_PID. +*/ + if (getenv("FAKEROOTKEY") != NULL || -getenv("FAKED_MODE") != NULL) { +getenv("FAKED_MODE") != NULL || +getenv("SANDBOX_PID") != NULL ) { return; } -- 1.8.1.5 -- Justin Bronder signature.asc Description: Digital signature
Re: [OMPI users] Wrappers should put include path *after* user args
On 04/12/09 16:20 -0500, Jeff Squyres wrote: > Oy -- more specifically, we should not be putting -I/usr/include on the > command line *at all* (because it's special and already included by the > compiler search paths; similar for /usr/lib and /usr/lib64). We should have > some special case code that looks for /usr/include and simply drops it. Let > me check and see what's going on... > I believe this was initially added here: https://svn.open-mpi.org/trac/ompi/ticket/870 > Can you send the contents of your > $prefix/share/openmpi/mpif90-wrapper-data.txt? (it is *likely* in that > directory, but it could be somewhere else under prefix as well -- the > mpif90-wrapper-data.txt file is the important one) > > > > On Dec 4, 2009, at 1:08 PM, Jed Brown wrote: > > > Open MPI is installed by the distro with headers in /usr/include > > > > $ mpif90 -showme:compile -I/some/special/path > > -I/usr/include -pthread -I/usr/lib/openmpi -I/some/special/path > > > > Here's why it's a problem: > > > > HDF5 is also installed in /usr with modules at /usr/include/h5*.mod. A > > new HDF5 cannot be compiled using the wrappers because it will always > > resolve the USE statements to /usr/include which is binary-incompatible > > with the the new version (at a minimum, they "fixed" the size of an > > argument to H5Lget_info_f between 1.8.3 and 1.8.4). > > > > To build the library, the current choices are > > > > (a) get rid of the system copy before building > > (b) not use mpif90 wrapper > > > > > > I just checked that MPICH2 wrappers consistently put command-line args > > before the wrapper args. > > > > Jed Any news on this? It doesn't look like it made it into the 1.4.1 release. Also, it's not just /usr/include that is a problem, but the fact that the wrappers are passing their paths before the user specified ones. Here's an example using mpich2 and openmpi with non-standard install paths. Mpich2 (Some output stripped as mpicc -compile_info prints everything): jbronder@mejis ~ $ which mpicc /usr/lib64/mpi/mpi-mpich2/usr/bin/mpicc jbronder@mejis ~ $ mpicc -compile_info -I/bleh x86_64-pc-linux-gnu-gcc -I/bleh -I/usr/lib64/mpi/mpi-mpich2/usr/include OpenMPI: jbronder@mejis ~ $ which mpicc /usr/lib64/mpi/mpi-openmpi/usr/bin/mpicc jbronder@mejis ~ $ mpicc -showme:compile -I/bleh -I/usr/lib64/mpi/mpi-openmpi/usr/include/openmpi -pthread -I/bleh Thanks, -- Justin Bronder pgpUpu5h4BdhJ.pgp Description: PGP signature
[OMPI users] Open-MPI 1.2 and GM
Having a user who requires some of the features of gfortran in 4.1.2, I recently began building a new image. The issue is that "-mca btl gm" fails while "-mca mtl gm" works. I have not yet done any benchmarking, as I was wondering if the move to mtl is part of the upgrade. Below are the packages I rebuilt. Kernel 2.6.16.27 -> 2.6.20.1 Gcc 4.1.1 -> 4.1.2 GM Drivers 2.0.26 -> 2.0.26 (with patches for newer kernels) OpenMPI 1.1.4 -> 1.2 The following works as expected: /usr/local/ompi-gnu/bin/mpirun -np 4 -mca mtl gm --host node84,node83 ./xhpl The following fails: /usr/local/ompi-gnu/bin/mpirun -np 4 -mca btl gm --host node84,node83 ./xhpl I've attached gziped files as suggested on the "Getting Help" section of the website and the output from the failed mpirun. Both nodes are known good Myrinet nodes, using FMA to map. Thanks in advance, -- Justin Bronder Advanced Computing Research Lab University of Maine, Orono 20 Godfrey Dr Orono, ME 04473 www.clusters.umaine.edu config.log.gz Description: Binary data ompi_info.gz Description: Binary data -- Process 0.1.2 is unable to reach 0.1.2 for MPI communication. If you specified the use of a BTL component, you may have forgotten a component (such as "self") in the list of usable components. -- -- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): PML add procs failed --> Returned "Unreachable" (-12) instead of "Success" (0) -- *** An error occurred in MPI_Init *** before MPI was initialized *** MPI_ERRORS_ARE_FATAL (goodbye) -- Process 0.1.1 is unable to reach 0.1.1 for MPI communication. If you specified the use of a BTL component, you may have forgotten a component (such as "self") in the list of usable components. -- -- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): PML add procs failed --> Returned "Unreachable" (-12) instead of "Success" (0) -- *** An error occurred in MPI_Init *** before MPI was initialized *** MPI_ERRORS_ARE_FATAL (goodbye) -- Process 0.1.0 is unable to reach 0.1.0 for MPI communication. If you specified the use of a BTL component, you may have forgotten a component (such as "self") in the list of usable components. -- -- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): PML add procs failed --> Returned "Unreachable" (-12) instead of "Success" (0) -- *** An error occurred in MPI_Init *** before MPI was initialized *** MPI_ERRORS_ARE_FATAL (goodbye) -- Process 0.1.3 is unable to reach 0.1.3 for MPI communication. If you specified the use of a BTL component, you may have forgotten a component (such as "self") in the list of usable components. -- -- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment prob
Re: [OMPI users] problem abut openmpi running
On a number of my Linux machines, /usr/local/lib is not searched by ldconfig, and hence, is not going to be found by gcc. You can fix this by adding /usr/local/lib to /etc/ld.so.conf and running ldconfig ( add the -v flag if you want to see the output ). -Justin. On 10/19/06, Durga Choudhurywrote: George I knew that was the answer to Calin's question, but I still would like to understand the issue: by default, the openMPI installer installs the libraries in /usr/local/lib, which is a standard location for the C compiler to look for libraries. So *why* do I need to explicitly specify this with LD_LIBRARY_PATH? For example, when I am compiling with pthread calls and pass -lpthread to gcc, I need not specify the location of libpthread.sowith LD_LIBRARY_PATH. I had the same problem as Calin so I am curious. This is assuming he has not redirected the installation path to some non-standard location. Thanks Durga On 10/19/06, George Bosilca wrote: > > Calin, > > Look like you're missing a proper value for the LD_LIBRARY_PATH. > Please read the Open MPI FAW at http://www.open-mpi.org/faq/? > category=running. > > Thanks, > george. > > On Oct 19, 2006, at 6:41 AM, calin pal wrote: > > > > > hi, > > i m calin from indiai m working on openmpii > > have installed openmpi 1.1.1-tar.gz in four machines in our college > > labin one system the openmpi is properly working.i have written > > "hello world" program in all machines .but in one machine its > > working properly.in other machine gives > > (( > > (hello:error while loading shared libraries:libmpi.so..o;cannot > > open shared object file:no such file or directory.) > > > > > > what is the problem plz tel me..and how to solve it..please > > tell me > > > > calin pal > > india > > fergusson college > > msc.tech(maths and computer sc.) > > > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- Devil wanted omnipresence; He therefore created communists. ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Problem with Openmpi 1.1
1.) Compiling without XL will take a little while, but I have the setup for the other questions ready now. I figured I'd answer them right away. 2.) TCP works fine, and is quite quick compared to mpich-1.2.7p1 by the way. I just reverified this. WR11C2R45000 160 1 2 10.10 8.253e+00 ||Ax-b||_oo / ( eps * ||A||_1 * N) =0.0412956 .. PASSED ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) =0.0272613 .. PASSED ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) =0.0053214 .. PASSED 3.) Exactly same setup, using mpichgm-1.2.6..14b WR11C2R45000 160 1 2 10.43 7.994e+00 ||Ax-b||_oo / ( eps * ||A||_1 * N) =0.0353693 .. PASSED ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) =0.0233491 .. PASSED ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) =0.0045577 .. PASSED It also worked with mpichgm-1.2.6..15 (I believe this is the version, I don't have a node up with it at the moment). Obviously mpich-1.2.7p1 works as well over ethernet. Anyways, I'll begin the build with the standard gcc compilers that are included with OS X. This is powerpc-apple-darwin8-gcc-4.0.1. Thanks, Justin. Jeff Squyres (jsquyres) wrote: > Justin -- > > Can we eliminate some variables so that we can figure out where the > error is originating? > > - Can you try compiling without the XL compilers? > - Can you try running with just TCP (and not Myrinet)? > - With the same support library installation (such as BLAS, etc., > assumedly also compiled with XL), can you try another MPI (e.g., LAM, > MPICH-gm, whatever)? > > Let us know what you find. Thanks! > > > > *From:* users-boun...@open-mpi.org > [mailto:users-boun...@open-mpi.org] *On Behalf Of *Justin Bronder > *Sent:* Thursday, July 06, 2006 3:16 PM > *To:* Open MPI Users > *Subject:* Re: [OMPI users] Problem with Openmpi 1.1 > > With 1.0.3a1r10670 the same problem is occuring. Again the same > configure arguments > as before. For clarity, the Myrinet drive we are using is 2.0.21 > > node90:~/src/hpl/bin/ompi-xl-1.0.3 jbronder$ gm_board_info > GM build ID is "2.0.21_MacOSX_rc20050429075134PDT > r...@node96.meldrew.clusters.umaine.edu:/usr/src/gm-2.0.21_MacOSX > Fri Jun 16 14:39:45 EDT 2006." > > node90:~/src/hpl/bin/ompi-xl-1.0.3 jbronder$ > /usr/local/ompi-xl-1.0.3/bin/mpirun -np 2 xhpl > This succeeds. > ||Ax-b||_oo / ( eps * ||A||_1 * N) =0.1196787 > .. PASSED > ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) =0.0283195 > .. PASSED > ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) =0.0063300 > .. PASSED > > node90:~/src/hpl/bin/ompi-xl-1.0.3 jbronder$ > /usr/local/ompi-xl-1.0.3/bin/mpirun -mca btl gm -np 2 xhpl > This fails. > ||Ax-b||_oo / ( eps * ||A||_1 * N) = > 717370209518881444284334080.000 .. FAILED > ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 226686309135.4274597 > .. FAILED > ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 2386641249.6518722 > .. FAILED > ||Ax-b||_oo . . . . . . . . . . . . . . . . . = > 2037398812542965504.00 > ||A||_oo . . . . . . . . . . . . . . . . . . . =2561.554752 > ||A||_1 . . . . . . . . . . . . . . . . . . . =2558.129237 > ||x||_oo . . . . . . . . . . . . . . . . . . . = > 300175355203841216.00 > ||x||_1 . . . . . . . . . . . . . . . . . . . = > 31645943341479366656.00 > > Does anyone have a working system with OS X and Myrinet (GM)? If > so, I'd love to hear > the configure arguments and various versions you are using. Bonus > points if you are > using the IBM XL compilers. > > Thanks, > Justin. > > > On 7/6/06, *Justin Bronder* <jsbron...@gmail.com > <mailto:jsbron...@gmail.com>> wrote: > > Yes, that output was actually cut and pasted from an OS X > run. I'm about to test > against 1.0.3a1r10670. > > Justin. > > On 7/6/06, *Galen M. Shipman* < gship...@lanl.gov > <mailto:gship...@lanl.gov>> wrote: > > Justin, > > Is the OS X run showing the same residual failure? > > - Galen > > On Jul 6, 2006, at 10:49 AM, Justin Bronder wrote: > > Disregard the failure on Linux, a rebuild from scratch of > HPL and OpenMPI >
Re: [OMPI users] Problem with Openmpi 1.1
With 1.0.3a1r10670 the same problem is occuring. Again the same configure arguments as before. For clarity, the Myrinet drive we are using is 2.0.21 node90:~/src/hpl/bin/ompi-xl-1.0.3 jbronder$ gm_board_info GM build ID is "2.0.21_MacOSX_rc20050429075134PDT r...@node96.meldrew.clusters.umaine.edu:/usr/src/gm-2.0.21_MacOSX Fri Jun 16 14:39:45 EDT 2006." node90:~/src/hpl/bin/ompi-xl-1.0.3 jbronder$ /usr/local/ompi-xl-1.0.3/bin/mpirun -np 2 xhpl This succeeds. ||Ax-b||_oo / ( eps * ||A||_1 * N) =0.1196787 .. PASSED ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) =0.0283195 .. PASSED ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) =0.0063300 .. PASSED node90:~/src/hpl/bin/ompi-xl-1.0.3 jbronder$ /usr/local/ompi-xl-1.0.3/bin/mpirun -mca btl gm -np 2 xhpl This fails. ||Ax-b||_oo / ( eps * ||A||_1 * N) = 717370209518881444284334080.000 .. FAILED ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 226686309135.4274597 .. FAILED ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 2386641249.6518722 .. FAILED ||Ax-b||_oo . . . . . . . . . . . . . . . . . = 2037398812542965504.00 ||A||_oo . . . . . . . . . . . . . . . . . . . =2561.554752 ||A||_1 . . . . . . . . . . . . . . . . . . . =2558.129237 ||x||_oo . . . . . . . . . . . . . . . . . . . = 300175355203841216.00 ||x||_1 . . . . . . . . . . . . . . . . . . . = 31645943341479366656.00 Does anyone have a working system with OS X and Myrinet (GM)? If so, I'd love to hear the configure arguments and various versions you are using. Bonus points if you are using the IBM XL compilers. Thanks, Justin. On 7/6/06, Justin Bronder <jsbron...@gmail.com> wrote: Yes, that output was actually cut and pasted from an OS X run. I'm about to test against 1.0.3a1r10670. Justin. On 7/6/06, Galen M. Shipman <gship...@lanl.gov> wrote: > Justin, > Is the OS X run showing the same residual failure? > > - Galen > > On Jul 6, 2006, at 10:49 AM, Justin Bronder wrote: > > Disregard the failure on Linux, a rebuild from scratch of HPL and > OpenMPI > seems to have resolved the issue. At least I'm not getting the errors > during > the residual checks. > > However, this is persisting under OS X. > > Thanks, > Justin. > > On 7/6/06, Justin Bronder < jsbron...@gmail.com> wrote: > > > For OS X: > > /usr/local/ompi-xl/bin/mpirun -mca btl gm -np 4 ./xhpl > > > > For Linux: > > ARCH=ompi-gnu-1.1.1a > > /usr/local/$ARCH/bin/mpiexec -mca btl gm -np 2 -path > > /usr/local/$ARCH/bin ./xhpl > > > > Thanks for the speedy response, > > Justin. > > > > On 7/6/06, Galen M. Shipman < gship...@lanl.gov> wrote: > > > > > Hey Justin, > > Please provide us your mca parameters (if any), these could be in a > > config file, environment variables or on the command line. > > > > Thanks, > > > > Galen > > > > On Jul 6, 2006, at 9:22 AM, Justin Bronder wrote: > > > > As far as the nightly builds go, I'm still seeing what I believe to be > > > > this problem in both r10670 and r10652. This is happening with > > both Linux and OS X. Below are the systems and ompi_info for the > > newest revision 10670. > > > > As an example of the error, when running HPL with Myrinet I get the > > following error. Using tcp everything is fine and I see the results > > I'd > > expect. > > > > > > ||Ax-b||_oo / ( eps * ||A||_1 * N) = > > 42820214496954887558164928727596662784.000 .. FAILED > > ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 156556068835.2711182.. FAILED > > ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 1156439380.5172558.. FAILED > > ||Ax-b||_oo . . . . . . . . . . . . . . . . . = > > 272683853978565028754868928512.00 > > ||A||_oo . . . . . . . . . . . . . . . . . . . =3822.884181 > > ||A||_1 . . . . . . . . . . . . . . . . . . . =3823.922627 > > ||x||_oo . . . . . . . . . . . . . . . . . . . = > > 37037692483529688659798261760.00 > > ||x||_1 . . . . . . . . . . . . . . . . . . . = > > 4102704048669982798475494948864.00 > > === > > > > Finished 1 tests with the following results: > > 0 tests completed and passed residual checks, > > 1 tests completed and failed residual checks, > > 0 tests skipped because of illegal input values. > > > > > > > > Linux node41 2.6.16.19 #1 SMP Wed Jun 21 17:22:01
Re: [OMPI users] Problem with Openmpi 1.1
Disregard the failure on Linux, a rebuild from scratch of HPL and OpenMPI seems to have resolved the issue. At least I'm not getting the errors during the residual checks. However, this is persisting under OS X. Thanks, Justin. On 7/6/06, Justin Bronder <jsbron...@gmail.com> wrote: For OS X: /usr/local/ompi-xl/bin/mpirun -mca btl gm -np 4 ./xhpl For Linux: ARCH=ompi-gnu-1.1.1a /usr/local/$ARCH/bin/mpiexec -mca btl gm -np 2 -path /usr/local/$ARCH/bin ./xhpl Thanks for the speedy response, Justin. On 7/6/06, Galen M. Shipman <gship...@lanl.gov> wrote: > Hey Justin, > Please provide us your mca parameters (if any), these could be in a > config file, environment variables or on the command line. > > Thanks, > > Galen > > On Jul 6, 2006, at 9:22 AM, Justin Bronder wrote: > > As far as the nightly builds go, I'm still seeing what I believe to be > this problem in both r10670 and r10652. This is happening with > both Linux and OS X. Below are the systems and ompi_info for the > newest revision 10670. > > As an example of the error, when running HPL with Myrinet I get the > following error. Using tcp everything is fine and I see the results I'd > > expect. > > > ||Ax-b||_oo / ( eps * ||A||_1 * N) = > 42820214496954887558164928727596662784.000 .. FAILED > ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 156556068835.2711182.. FAILED > ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 1156439380.5172558 .. > FAILED > ||Ax-b||_oo . . . . . . . . . . . . . . . . . = > 272683853978565028754868928512.00 > ||A||_oo . . . . . . . . . . . . . . . . . . . =3822.884181 > ||A||_1 . . . . . . . . . . . . . . . . . . . =3823.922627 > ||x||_oo . . . . . . . . . . . . . . . . . . . = > 37037692483529688659798261760.00 > ||x||_1 . . . . . . . . . . . . . . . . . . . = > 4102704048669982798475494948864.00 > === > > Finished 1 tests with the following results: > 0 tests completed and passed residual checks, > 1 tests completed and failed residual checks, > 0 tests skipped because of illegal input values. > > > > Linux node41 2.6.16.19 #1 SMP Wed Jun 21 17:22:01 EDT 2006 ppc64 > PPC970FX, altivec supported GNU/Linux > jbronder@node41 ~ $ /usr/local/ompi- gnu-1.1.1a/bin/ompi_info > Open MPI: 1.1.1a1r10670 >Open MPI SVN revision: r10670 > Open RTE: 1.1.1a1r10670 >Open RTE SVN revision: r10670 > OPAL: 1.1.1a1r10670 >OPAL SVN revision: r10670 > Prefix: /usr/local/ompi-gnu-1.1.1a > Configured architecture: powerpc64-unknown-linux-gnu >Configured by: root >Configured on: Thu Jul 6 10:15:37 EDT 2006 > Configure host: node41 > Built by: root > Built on: Thu Jul 6 10:28:14 EDT 2006 > Built host: node41 > C bindings: yes > C++ bindings: yes > Fortran77 bindings: yes (all) > Fortran90 bindings: yes > Fortran90 bindings size: small > C compiler: gcc > C compiler absolute: /usr/bin/gcc > C++ compiler: g++ >C++ compiler absolute: /usr/bin/g++ > Fortran77 compiler: gfortran > Fortran77 compiler abs: > /usr/powerpc64-unknown-linux-gnu/gcc-bin/4.1.0/gfortran > Fortran90 compiler: gfortran > Fortran90 compiler abs: > /usr/powerpc64-unknown-linux-gnu/gcc-bin/4.1.0/gfortran > C profiling: yes >C++ profiling: yes > Fortran77 profiling: yes > Fortran90 profiling: yes > C++ exceptions: no > Thread support: posix (mpi: no, progress: no) > Internal debug support: no > MPI parameter check: runtime > Memory profiling support: no > Memory debugging support: no > libltdl support: yes > MCA memory: ptmalloc2 (MCA v1.0, API v1.0, Component > v1.1.1) >MCA paffinity: linux (MCA v1.0, API v1.0, Component v1.1.1) >MCA maffinity: first_use (MCA v1.0, API v1.0, Component > v1.1.1) >MCA timer: linux (MCA v1.0, API v1.0, Component v1.1.1) >MCA allocator: basic (MCA v1.0, API v1.0, Component v1.0) >MCA allocator: bucket (MCA v1.0, API v1.0, Component v1.0) > MCA coll: basic (MCA v1.0, API v1.0, Componentv1.1.1) > >MCA coll: hierarch (MCA v1.0, API v1.0, Component v1.1.1) > MCA coll: self (MCA v1.0, API v1
Re: [OMPI users] Problem with Openmpi 1.1
As far as the nightly builds go, I'm still seeing what I believe to be this problem in both r10670 and r10652. This is happening with both Linux and OS X. Below are the systems and ompi_info for the newest revision 10670. As an example of the error, when running HPL with Myrinet I get the following error. Using tcp everything is fine and I see the results I'd expect. ||Ax-b||_oo / ( eps * ||A||_1 * N) = 42820214496954887558164928727596662784.000 .. FAILED ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 156556068835.2711182 .. FAILED ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 1156439380.5172558 .. FAILED ||Ax-b||_oo . . . . . . . . . . . . . . . . . = 272683853978565028754868928512.00 ||A||_oo . . . . . . . . . . . . . . . . . . . =3822.884181 ||A||_1 . . . . . . . . . . . . . . . . . . . =3823.922627 ||x||_oo . . . . . . . . . . . . . . . . . . . = 37037692483529688659798261760.00 ||x||_1 . . . . . . . . . . . . . . . . . . . = 4102704048669982798475494948864.00 === Finished 1 tests with the following results: 0 tests completed and passed residual checks, 1 tests completed and failed residual checks, 0 tests skipped because of illegal input values. Linux node41 2.6.16.19 #1 SMP Wed Jun 21 17:22:01 EDT 2006 ppc64 PPC970FX, altivec supported GNU/Linux jbronder@node41 ~ $ /usr/local/ompi-gnu-1.1.1a/bin/ompi_info Open MPI: 1.1.1a1r10670 Open MPI SVN revision: r10670 Open RTE: 1.1.1a1r10670 Open RTE SVN revision: r10670 OPAL: 1.1.1a1r10670 OPAL SVN revision: r10670 Prefix: /usr/local/ompi-gnu-1.1.1a Configured architecture: powerpc64-unknown-linux-gnu Configured by: root Configured on: Thu Jul 6 10:15:37 EDT 2006 Configure host: node41 Built by: root Built on: Thu Jul 6 10:28:14 EDT 2006 Built host: node41 C bindings: yes C++ bindings: yes Fortran77 bindings: yes (all) Fortran90 bindings: yes Fortran90 bindings size: small C compiler: gcc C compiler absolute: /usr/bin/gcc C++ compiler: g++ C++ compiler absolute: /usr/bin/g++ Fortran77 compiler: gfortran Fortran77 compiler abs: /usr/powerpc64-unknown-linux-gnu/gcc-bin/4.1.0/gfortran Fortran90 compiler: gfortran Fortran90 compiler abs: /usr/powerpc64-unknown-linux-gnu/gcc-bin/4.1.0/gfortran C profiling: yes C++ profiling: yes Fortran77 profiling: yes Fortran90 profiling: yes C++ exceptions: no Thread support: posix (mpi: no, progress: no) Internal debug support: no MPI parameter check: runtime Memory profiling support: no Memory debugging support: no libltdl support: yes MCA memory: ptmalloc2 (MCA v1.0, API v1.0, Component v1.1.1) MCA paffinity: linux (MCA v1.0, API v1.0, Component v1.1.1) MCA maffinity: first_use (MCA v1.0, API v1.0, Component v1.1.1) MCA timer: linux (MCA v1.0, API v1.0, Component v1.1.1) MCA allocator: basic (MCA v1.0, API v1.0, Component v1.0) MCA allocator: bucket (MCA v1.0, API v1.0, Component v1.0) MCA coll: basic (MCA v1.0, API v1.0, Component v1.1.1) MCA coll: hierarch (MCA v1.0, API v1.0, Component v1.1.1) MCA coll: self (MCA v1.0, API v1.0, Component v1.1.1) MCA coll: sm (MCA v1.0, API v1.0, Component v1.1.1) MCA coll: tuned (MCA v1.0, API v1.0, Component v1.1.1) MCA io: romio (MCA v1.0, API v1.0, Component v1.1.1) MCA mpool: gm (MCA v1.0, API v1.0, Component v1.1.1) MCA mpool: sm (MCA v1.0, API v1.0, Component v1.1.1) MCA pml: ob1 (MCA v1.0, API v1.0, Component v1.1.1) MCA bml: r2 (MCA v1.0, API v1.0, Component v1.1.1) MCA rcache: rb (MCA v1.0, API v1.0, Component v1.1.1) MCA btl: gm (MCA v1.0, API v1.0, Component v1.1.1) MCA btl: self (MCA v1.0, API v1.0, Component v1.1.1) MCA btl: sm (MCA v1.0, API v1.0, Component v1.1.1) MCA btl: tcp (MCA v1.0, API v1.0, Component v1.0) MCA topo: unity (MCA v1.0, API v1.0, Component v1.1.1) MCA osc: pt2pt (MCA v1.0, API v1.0, Component v1.0) MCA gpr: null (MCA v1.0, API v1.0, Component v1.1.1) MCA gpr: proxy (MCA v1.0, API v1.0, Component v1.1.1) MCA gpr: replica (MCA v1.0, API v1.0, Component v1.1.1) MCA iof: proxy (MCA v1.0, API v1.0, Component v1.1.1) MCA iof: svc (MCA v1.0, API v1.0, Component v1.1.1)
[OMPI users] OpenMpi 1.1 and Torque 2.1.1
I'm having trouble getting OpenMPI to execute jobs when submitting through Torque. Everything works fine if I am to use the included mpirun scripts, but this is obviously not a good solution for the general users on the cluster. I'm running under OS X 10.4, Darwin 8.6.0. I configured OpenMpi with: export CC=/opt/ibmcmp/vac/6.0/bin/xlc export CXX=/opt/ibmcmp/vacpp/6.0/bin/xlc++ export FC=/opt/ibmcmp/xlf/8.1/bin/xlf90_r export F77=/opt/ibmcmp/xlf/8.1/bin/xlf_r export LDFLAGS=-lSystemStubs export LIBTOOL=glibtool PREFIX=/usr/local/ompi-xl ./configure \ --prefix=$PREFIX \ --with-tm=/usr/local/pbs/ \ --with-gm=/opt/gm \ --enable-static \ --disable-cxx I also had to employ the fix listed in: http://www.open-mpi.org/community/lists/users/2006/04/1007.php I've attached the output of ompi_info while in an interactive job. Looking through the list, I can at least save a bit of trouble by listing what does work. Anything outside of Torque seems fine. From within an interactive job, pbsdsh works fine, hence the earlier problems with poll are fixed. Here is the error that is reported when I attemt to run hostname on one processor: node96:/usr/src/openmpi-1.1 jbronder$ /usr/local/ompi-xl/bin/mpirun -np 1 -mca pls_tm_debug 1 /bin/hostname [node96.meldrew.clusters.umaine.edu:00850] pls:tm: final top-level argv: [node96.meldrew.clusters.umaine.edu:00850] pls:tm: orted --no-daemonize --bootproxy 1 --name --num_procs 2 --vpid_start 0 --nodename --universe jbron...@node96.meldrew.clusters.umaine.edu:default-universe --nsreplica " 0.0.0;tcp://10.0.1.96:49395" --gprreplica "0.0.0;tcp://10.0.1.96:49395" [node96.meldrew.clusters.umaine.edu:00850] pls:tm: Set prefix:/usr/local/ompi-xl [node96.meldrew.clusters.umaine.edu:00850] pls:tm: launching on node localhost [node96.meldrew.clusters.umaine.edu:00850] pls:tm: resetting PATH: /usr/local/ompi-xl/bin:/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/pbs/bin:/usr/local/mpiexec/bin:/opt/ibmcmp/xlf/8.1/bin:/opt/ibmcmp/vac/6.0/bin:/opt/ibmcmp/vacpp/6.0/bin:/opt/gm/bin:/opt/fms/bin [node96.meldrew.clusters.umaine.edu:00850] pls:tm: found /usr/local/ompi-xl/bin/orted [node96.meldrew.clusters.umaine.edu:00850] pls:tm: not oversubscribed -- setting mpi_yield_when_idle to 0 [node96.meldrew.clusters.umaine.edu:00850] pls:tm: executing: orted --no-daemonize --bootproxy 1 --name 0.0.1 --num_procs 2 --vpid_start 0 --nodename localhost --universe jbron...@node96.meldrew.clusters.umaine.edu:default-universe --nsreplica "0.0.0;tcp://10.0.1.96:49395" --gprreplica "0.0.0 ;tcp://10.0.1.96:49395" [node96.meldrew.clusters.umaine.edu:00850] pls:tm: start_procs returned error -13 [node96.meldrew.clusters.umaine.edu:00850] [0,0,0] ORTE_ERROR_LOG: Not found in file rmgr_urm.c at line 184 [node96.meldrew.clusters.umaine.edu:00850] [0,0,0] ORTE_ERROR_LOG: Not found in file rmgr_urm.c at line 432 [node96.meldrew.clusters.umaine.edu:00850] mpirun: spawn failed with errno=-13 node96:/usr/src/openmpi-1.1 jbronder$ My thanks for any help in advance, Justin Bronder. ompi_info.log.gz Description: GNU Zip compressed data
Re: [OMPI users] [PMX:VIRUS] Re: OpenMPI 1.0.3a1r10002 Fails to build with IBM XL Compilers.
On 5/31/06, Brian W. Barrettwrote: A quick workaround is to edit opal/include/opal_config.h and change the #defines for OMPI_CXX_GCC_INLINE_ASSEMBLY and OMPI_CC_GCC_INLINE_ASSEMBLY from 1 to 0. That should allow you to build Open MPI with those XL compilers. Hopefully IBM will fix this in a future version ;). Well I actually edited include/ompi_config.h and set both OMPI_C_GCC_INLINE_ASSEMBLY and OMPI_CXX_GCC_INLINE_ASSEMBLY to 0. This worked until libtool tried to create a shared library: /bin/sh ../libtool --tag=CC --mode=link gxlc_64 -O -DNDEBUG -qnokeyword=asm -export-dynamic -o libopal.la -rpath /usr/local/ompi-xl/lib libltdl/libltdlc.la asm/libasm.la class/libclass.la event/libevent.la mca/base/libmca_base.la memoryhooks/libopalmemory.la runtime/libruntime.la threads/libthreads.la util/libopalutil.la mca/maffinity/base/libmca_maffinity_base.la mca/memory/base/libmca_memory_base.la mca/memory/malloc_hooks/libmca_memory_malloc_hooks.la mca/paffinity/base/libmca_paffinity_base.la mca/timer/base/libmca_timer_base.la mca/timer/linux/libmca_timer_linux.la -lm -lutil -lnsl mkdir .libs gxlc_64 -shared --whole-archive libltdl/.libs/libltdlc.a asm/.libs/libasm.a class/.libs/libclass.a event/.libs/libevent.a mca/base/.libs/libmca_base.a memoryhooks/.libs/libopalmemory.a runtime/.libs/libruntime.a threads/.libs/libthreads.a util/.libs/libopalutil.a mca/maffinity/base/.libs/libmca_maffinity_base.a mca/memory/base/.libs/libmca_memory_base.a mca/memory/malloc_hooks/.libs/libmca_memory_malloc_hooks.a mca/paffinity/base/.libs/libmca_paffinity_base.a mca/timer/base/.libs/libmca_timer_base.a mca/timer/linux/.libs/libmca_timer_linux.a --no-whole-archive -ldl -lm -lutil -lnsl -lc -qnokeyword=asm -soname libopal.so.0 -o .libs/libopal.so.0.0.0 gxlc: 1501-257 Option --whole-archive is not recognized. Option will be ignored. gxlc: 1501-257 Option --no-whole-archive is not recognized. Option will be ignored. gxlc: 1501-257 Option -qnokeyword=asm is not recognized. Option will be ignored. gxlc: 1501-257 Option -soname is not recognized. Option will be ignored. xlc: 1501-218 file libopal.so.0 contains an incorrect file suffix xlc: 1501-228 input file libopal.so.0 not found xlc -q64 -qthreaded -D_REENTRANT -lpthread -qmkshrobj libltdl/.libs/libltdlc.a asm/.libs/libasm.a class/.libs/libclass.a event/.libs/libevent.a mca/base/.libs/libmca_base.a memoryhooks/.libs/libopalmemory.a runtime/.libs/libruntime.a threads/.libs/libthreads.a util/.libs/libopalutil.a mca/maffinity/base/.libs/libmca_maffinity_base. I was able to fix this by editing libtool and replacing $CC with $LD in the following: # Commands used to build and install a shared archive. archive_cmds="\$LD -shared \$libobjs \$deplibs \$compiler_flags \${wl}-soname \$wl\$soname -o \$lib" archive_expsym_cmds="\$echo \\\"{ global:\\\" > \$output_objdir/\$libname.ver~ cat \$export_symbols | sed -e \\\"s/(.*)/1;/\\\" >> \$output_objdir/\$libname.ver~ \$echo \\\"local: *; };\\\" >> \$output_objdir/\$libname.ver~ \$LD -shared \$libobjs \$deplibs \$compiler_flags \${wl}-soname \$wl\$soname \${wl}-version-script \${wl}\$output_objdir/\$libname.ver -o \$lib" We then fail later on at: make[3]: Entering directory `/usr/src/openmpi-1.0.3a1r10133 /orte/tools/orted' /bin/sh ../../../libtool --tag=CC --mode=link gxlc_64 -O -DNDEBUG -export-dynamic -o orted orted.o ../../../orte/liborte.la ../../../opal/libopal.la -lm -lutil -lnsl gxlc_64 -O -DNDEBUG -o .libs/orted orted.o --export-dynamic ../../../orte/.libs/liborte.so /usr/src/openmpi-1.0.3a1r10133/opal/.libs/libopal.so ../../../opal/.libs/libopal.so -ldl -lm -lutil -lnsl --rpath /usr/local/ompi-xl/lib gxlc: 1501-257 Option --export-dynamic is not recognized. Option will be ignored. gxlc: 1501-257 Option --rpath is not recognized. Option will be ignored. xlc: 1501-274 An incompatible level of gcc has been specified. xlc: 1501-228 input file /usr/local/ompi-xl/lib not found xlc -q64 -qthreaded -D_REENTRANT -lpthread -O -DNDEBUG -o .libs/orted orted.o ../../../orte/.libs/liborte.so /usr/src/openmpi-1.0.3a1r10133/opal/.libs/libopal.so ../../../opal/.libs/libopal.so -ldl -lm -lutil -lnsl /usr/local/ompi-xl/lib Simply replacing ld for gxlc_64 here obviously won't work. node42 orted # ld -O -DNDEBUG -o .libs/orted orted.o --export-dynamic ../../../orte/.libs/liborte.so /usr/src/openmpi-1.0.3a1r10133/opal/.libs/libopal.so ../../../opal/.libs/libopal.so -ldl -lm -lutil -lnsl --rpath /usr/local/ompi-xl/lib -lpthread ld: warning: cannot find entry symbol _start; defaulting to 10013ed8 Of course, I've been told that directly linking with ld isn't such a great idea in the first place. Ideas? Thanks, Justin.
Re: [OMPI users] [PMX:VIRUS] Re: OpenMPI 1.0.3a1r10002 Fails to build with IBM XL Compilers.
On 5/30/06, Brian Barrett <brbar...@open-mpi.org> wrote: On May 28, 2006, at 8:48 AM, Justin Bronder wrote: > Brian Barrett wrote: >> On May 27, 2006, at 10:01 AM, Justin Bronder wrote: >> >> >>> I've attached the required logs. Essentially the problem seems to >>> be that the XL Compilers fail to recognize "__asm__ __volatile__" in >>> opal/include/sys/powerpc/atomic.h when building 64-bit. >>> >>> I've tried using various xlc wrappers such as gxlc and xlc_r to >>> no avail. The current log uses xlc_r_64 which is just a one line >>> shell script forcing the -q64 option. >>> >>> The same works flawlessly with gcc-4.1.0. I'm using the nightly >>> build in order to link with Torque's new shared libraries. >>> >>> Any help would be greatly appreciated. For reference here are >>> a few other things that may provide more information. >>> >> >> Can you send the config.log file generated by configure? What else >> is in the xlc_r_64 shell script, other than the -q64 option? > I've attached the config.log, and here's what all of the *_64 scripts > look like. Can you try compiling without the -qkeyword=__volatile__? It looks like XLC now has some support for GCC-style inline assembly, but it doesn't seem to be working in this case. If that doesn't work, try setting CFLAGS and CXXFLAGS to include -qnokeyword=asm, which should disable GCC inline assembly entirely. I don't have access to a linux cluster with the XL compilers, so I can't verify this. But it should work. Brian No good sadly. The same error continues to appear. I had actually initially attempted to compile without -qkeyword=__volatile__, but had hoped to force xlc to recognize it. This is obviously more of an XL issue, especially as I've since found that everything works flawlessly in 32-bit mode. If anyone has more suggestions, I'd love the help as I'm lost at this point. Thanks for the help thus far, Justin.
[OMPI users] [PMX:VIRUS] Re: OpenMPI 1.0.3a1r10002 Fails to build with IBM XL Compilers.
Brian Barrett wrote: > On May 27, 2006, at 10:01 AM, Justin Bronder wrote: > > >> I've attached the required logs. Essentially the problem seems to >> be that the XL Compilers fail to recognize "__asm__ __volatile__" in >> opal/include/sys/powerpc/atomic.h when building 64-bit. >> >> I've tried using various xlc wrappers such as gxlc and xlc_r to >> no avail. The current log uses xlc_r_64 which is just a one line >> shell script forcing the -q64 option. >> >> The same works flawlessly with gcc-4.1.0. I'm using the nightly >> build in order to link with Torque's new shared libraries. >> >> Any help would be greatly appreciated. For reference here are >> a few other things that may provide more information. >> > > Can you send the config.log file generated by configure? What else > is in the xlc_r_64 shell script, other than the -q64 option? > > > I've attached the config.log, and here's what all of the *_64 scripts look like. node42 openmpi-1.0.3a1r10002 # cat /opt/ibmcmp/vac/8.0/bin/xlc_r_64 #!/bin/sh xlc_r -q64 "$@" Thanks, -- Justin Bronder University of Maine, Orono Advanced Computing Research Lab 20 Godfrey Dr Orono, ME 04473 www.clusters.umaine.edu Mathematics Department 425 Neville Hall Orono, ME 04469 WARNING: The virus scanner was unable to scan the next attachment. This attachment could possibly contain viruses or other malicious programs. The attachment could not be scanned for the following reasons: The file was corrupt You are advised NOT to open this attachment unless you are completely sure of its contents. If in doubt, please contact your system administrator. The identifier for this message is 'k4SEmG3W018957'. The Management PureMessage Admin <sys...@cs.indiana.edu> config.log.tar.gz Description: application/gzip