[OMPI users] making all library components static (questions about --enable-mcs-static)
I wish to compile openmpi-1.2.2 so that it: - support MPI_THREAD_MULTIPLE - enable profiling (generate gmon.out for each process after my client app finish running) to tell apart CPU time of my client program from the MPI library - static linking for everything (incl client app and all components of library openmpi) in the documentation, it says that --enable-mcs-static= will enable static linking of the modules in the list, however what can I specify if I want to statically link *all* mcs modules without knowing the list of modules available? Also this is the plan for my command used for configuring openmpi: ./configure CFLAGS="-g -pg -O3 -static" --prefix=./ --enable-mpi-threads --enable-progress-threads --enable-static --disable-shared --enable-mcs-static --with-devel-headers Did you see anything wrong with this command? what else can I modify to satisfy the goals listed above? Thanks!
Re: [OMPI users] SGE and OFED1.1
On Jun 6, 2007, at 5:44 PM, Michael Edwards wrote: I am runing open-mpi 1.1.1-1 compiled from OFED1.1 which I downloaded from their website. You might want to upgrade your Open MPI installation; the current stable version is 1.2.2 (1.2.3 is pending shortly, fixing a few minor regressions that creeped into 1.2.2). You can upgrade OMPI independent of OFED. Use the "--with-openib=/usr/local/ofed" option to OMPI's configure to pick up the OFED 1.1 installation (or, if you used a different OFED prefix, use that as the value for the --with- openib flag). I am using SGE installed via OSCAR 5.0 and when running under SGE I get the "mca_mpool_openib_register: ibv_reg_mr(0x59,528384) failed with error: Cannot allocate memory" error discussed at length in your FAQ. When I run from the command line using mpirun, I don't get the errors. Of course, I don't know how to tell if the code is actually using the IB interface instead of the GigE network... You can tell in two ways: 1. You can force the IB network to be used: mpirun --mca btl openib,self ... Alternatively, you can force the use of the gigE network: mpirun --mca btl tcp,self ... 2. If you look at the bandwidth/latency of running any benchmark papplication, they should be obviously far better than the gigE network. Here's running NetPIPE (http://www.scl.ameslab.gov/netpipe/): mpirun -np 2 NPmpi I tried the suggestions in the FAQ regarding setting the memlock parameter in /etc/security/limits.conf: and all the nodes return "unlimited" in response to "ulimit -l" after rebooting the nodes. The problem persists under SGE and still does not appear when simply using mpirun. The problem is that the SGE daemons are not starting with these memory limits. Therefore, processes that start under SGE inherit the low memory limits, and things go badly from there. I'm afraid I'm not familiar enough with SGE to know how to fix this. One Big Thing to check is that when the SGE daemons are started at init.d/boot time, they have the proper "unlimited" memory locked limits. Then processes that start under SGE should inherit the "unlimited" value and be ok. That being said, SGE may also specifically override the memory locked limits (some resource managers can do this based on site-wide policies). Check to see if SGE is doing this. I assumed it would work since openmpi 1.1.1 was included as working with SGE in OSCAR 5.0, but I don't know how different that version and the one included with OFED is. Any suggestions would be appreciated. ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Cisco Systems
Re: [OMPI users] making all library components static (questions about --enable-mcs-static)
On Jun 7, 2007, at 2:07 AM, Code Master wrote: I wish to compile openmpi-1.2.2 so that it: - support MPI_THREAD_MULTIPLE - enable profiling (generate gmon.out for each process after my client app finish running) to tell apart CPU time of my client program from the MPI library - static linking for everything (incl client app and all components of library openmpi) in the documentation, it says that --enable-mcs-static= will enable static linking of the modules in the list, however what can I specify if I want to statically link *all* mcs modules without knowing the list of modules available? You should be able to do: ./configure --enable-static --disable-shared ... This will do 2 things: - libmpi (and friends) will be compiled as .a's (instead of .so's) - all the MCA components will be physically contained in libmpi (and friends) instead of being built as standalone plugins Also this is the plan for my command used for configuring openmpi: ./configure CFLAGS="-g -pg -O3 -static" --prefix=./ --enable-mpi- threads --enable-progress-threads --enable-static --disable-shared --enable-mcs-static --with-devel-headers It's actually --enable-mca-static, not --enable-mcs-static. However, that should not be necessary; the --enable-static and -- disable-shared should take care of pulling all the components into the libraries for you. -- Jeff Squyres Cisco Systems
Re: [OMPI users] open-mpi with ifort in debug..trouble
Can you be a bit more descriptive? What is the exact compilation output (including the error)? And what exactly do you mean by "debug mode" -- compiling Open MPI with and without -g? Please see http:// www.open-mpi.org/community/help/. FWIW, I do not see the symbol "output_local_symbols" in the Open MPI source code anywhere... On Jun 6, 2007, at 12:17 PM, Srinath Vadlamani wrote: So I have been trying to build multiple applications with an ifort +gcc implementation of Open-MPI. I wanted to build them in debug mode. This is on a Macbook Pro System Version:Mac OS X 10.4.9 (8P2137) Kernel Version:Darwin 8.9.1 gcc :gcc version 4.0.1 ifort: 10.0.16 I have tried building PETSc from ftp://ftp.mcs.anl.gov/pub/petsc/ release-snapshots/petsc-lite-2.3.3-p1.tar.gz in debug mode and the error on gets in building fortran examples is: ld: internal error: output_local_symbols () inconsistent local symbol count This does not happen when *not* in debug mode. This is the same error we get with the same build parameters of one of our Fortran scientific codes. This error does *not* occur when using mpich2-1.0.5p4. ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Cisco Systems
Re: [OMPI users] OpenMPI with multiple threads (MPI_THREAD_MULTIPLE)
This issue was just recently discussed on this list -- check out the thread here: http://www.open-mpi.org/community/lists/users/2007/05/3323.php On Jun 5, 2007, at 6:52 PM, smai...@ksu.edu wrote: Hi, I am trying a program in which I have 2 MPI nodes and each MPI node has 2 threads: Main node-thread Receive Thread - MPI_Init_Thread(MPI_THREAD_MULTIPLE); . . LOOP: LOOP: THREAD-BARRIER THREAD-BARRIER MPI_Send(); MPI_Recv(); goto LOOP; goto LOOP; . . The thread-barrier ensures that the 2 threads complete the previous iteration before moving ahead with this one. I get the following error SOMETIMES (while sometimes the program runs properly): *** An error occurred in MPI_Recv *** on communicator MPI_COMM_WORLD *** MPI_ERR_TRUNCATE: message truncated *** MPI_ERRORS_ARE_FATAL (goodbye) Somewhere I read that MPI_THREAD_MULTIPLE is not properly tested with OpenMPI. Can someone tell me whether I am making some mistake or is there any bug with MPI_THREAD_MULTIPLE? -Thanks and Regards, Sarang. ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Cisco Systems
[OMPI users] Problems configuring petsc-dev with openmpi-1.2.3a0r14886
Hi, I've been using openmpi-1.1.5 with no problems, but I decided to move up to version 1.2 yesterday. I am working with the developers' version of PETSc, so I attempted to configure it using the newly- installed open-mpi. When I tried this, though, I ran into the following problem (from PETsc's configure.log): Possible ERROR while running preprocessor: In file included from / Users/willic3/ geoframe/tools/openmpi-debug/include/mpi.h:1783, from /Users/willic3/geoframe/tools/petsc-dev-new/ include/petsc. h:138, from /Users/willic3/geoframe/tools/petsc-dev-new/ src/dm/mesh/si eve/ALE.hh:4, from /Users/willic3/geoframe/tools/petsc-dev-new/ src/dm/mesh/si eve/Sifter.hh:15, from /Users/willic3/geoframe/tools/petsc-dev-new/ src/dm/mesh/si eve/Sieve.hh:12, from /Users/willic3/geoframe/tools/petsc-dev-new/ src/dm/mesh/si eve/Topology.hh:5, from /Users/willic3/geoframe/tools/petsc-dev-new/ src/dm/mesh/si eve/SectionCompletion.hh:5, from /Users/willic3/geoframe/tools/petsc-dev-new/ src/dm/mesh/si eve/Numbering.hh:5, from /Users/willic3/geoframe/tools/petsc-dev-new/ src/dm/mesh/si eve/Mesh.hh:5, from conftest.cc:3: /Users/willic3/geoframe/tools/openmpi-debug/include/openmpi/ompi/mpi/ cxx/mpicxx. h:162:36: error: ompi/mpi/cxx/constants.h: No such file or directory /Users/willic3/geoframe/tools/openmpi-debug/include/openmpi/ompi/mpi/ cxx/mpicxx. h:163:36: error: ompi/mpi/cxx/functions.h: No such file or directory /Users/willic3/geoframe/tools/openmpi-debug/include/openmpi/ompi/mpi/ cxx/mpicxx. h:164:35: error: ompi/mpi/cxx/datatype.h: No such file or directory ret = 256 Here is what I have for my mpicxx: mpicxx --show g++ -D_REENTRANT -I/Users/willic3/geoframe/tools/openmpi-debug/ include -g -mcpu=G5 -Wl,-u,_munmap -Wl,-multiply_defined,suppress -L/ Users/willic3/geoframe/tools/openmpi-debug/lib -lmpi_cxx -lmpi -lopen- rte -lopen-pal I can make a change to mpicxx.h that fixes the problem: diff mpicxx-orig.h mpicxx.h 162,164c162,164 < #include "ompi/mpi/cxx/constants.h" < #include "ompi/mpi/cxx/functions.h" < #include "ompi/mpi/cxx/datatype.h" --- > #include "openmpi/ompi/mpi/cxx/constants.h" > #include "openmpi/ompi/mpi/cxx/functions.h" > #include "openmpi/ompi/mpi/cxx/datatype.h" I don't know if this is the correct approach, though. Are the paths actually incorrect or have I configured open-mpi incorrectly? Thanks, Charles Charles A. Williams Dept. of Earth & Environmental Sciences Science Center, 2C01B Rensselaer Polytechnic Institute Troy, NY 12180 Phone:(518) 276-3369 FAX:(518) 276-2012 e-mail:will...@rpi.edu
Re: [OMPI users] Problems configuring petsc-dev with openmpi-1.2.3a0r14886
Yes, it is the correct approach. This code was just changed and then fixed in the immediate past (yesterday? or perhaps the day before?), and the fix was exactly as you described. https://svn.open-mpi.org/trac/ompi/changeset/14939 On Jun 7, 2007, at 11:26 AM, Charles Williams wrote: Hi, I've been using openmpi-1.1.5 with no problems, but I decided to move up to version 1.2 yesterday. I am working with the developers' version of PETSc, so I attempted to configure it using the newly-installed open-mpi. When I tried this, though, I ran into the following problem (from PETsc's configure.log): Possible ERROR while running preprocessor: In file included from / Users/willic3/ geoframe/tools/openmpi-debug/include/mpi.h:1783, from /Users/willic3/geoframe/tools/petsc-dev-new/ include/petsc. h:138, from /Users/willic3/geoframe/tools/petsc-dev-new/ src/dm/mesh/si eve/ALE.hh:4, from /Users/willic3/geoframe/tools/petsc-dev-new/ src/dm/mesh/si eve/Sifter.hh:15, from /Users/willic3/geoframe/tools/petsc-dev-new/ src/dm/mesh/si eve/Sieve.hh:12, from /Users/willic3/geoframe/tools/petsc-dev-new/ src/dm/mesh/si eve/Topology.hh:5, from /Users/willic3/geoframe/tools/petsc-dev-new/ src/dm/mesh/si eve/SectionCompletion.hh:5, from /Users/willic3/geoframe/tools/petsc-dev-new/ src/dm/mesh/si eve/Numbering.hh:5, from /Users/willic3/geoframe/tools/petsc-dev-new/ src/dm/mesh/si eve/Mesh.hh:5, from conftest.cc:3: /Users/willic3/geoframe/tools/openmpi-debug/include/openmpi/ompi/ mpi/cxx/mpicxx. h:162:36: error: ompi/mpi/cxx/constants.h: No such file or directory /Users/willic3/geoframe/tools/openmpi-debug/include/openmpi/ompi/ mpi/cxx/mpicxx. h:163:36: error: ompi/mpi/cxx/functions.h: No such file or directory /Users/willic3/geoframe/tools/openmpi-debug/include/openmpi/ompi/ mpi/cxx/mpicxx. h:164:35: error: ompi/mpi/cxx/datatype.h: No such file or directory ret = 256 Here is what I have for my mpicxx: mpicxx --show g++ -D_REENTRANT -I/Users/willic3/geoframe/tools/openmpi-debug/ include -g -mcpu=G5 -Wl,-u,_munmap -Wl,-multiply_defined,suppress - L/Users/willic3/geoframe/tools/openmpi-debug/lib -lmpi_cxx -lmpi - lopen-rte -lopen-pal I can make a change to mpicxx.h that fixes the problem: diff mpicxx-orig.h mpicxx.h 162,164c162,164 < #include "ompi/mpi/cxx/constants.h" < #include "ompi/mpi/cxx/functions.h" < #include "ompi/mpi/cxx/datatype.h" --- > #include "openmpi/ompi/mpi/cxx/constants.h" > #include "openmpi/ompi/mpi/cxx/functions.h" > #include "openmpi/ompi/mpi/cxx/datatype.h" I don't know if this is the correct approach, though. Are the paths actually incorrect or have I configured open-mpi incorrectly? Thanks, Charles Charles A. Williams Dept. of Earth & Environmental Sciences Science Center, 2C01B Rensselaer Polytechnic Institute Troy, NY 12180 Phone:(518) 276-3369 FAX:(518) 276-2012 e-mail:will...@rpi.edu ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Cisco Systems
[OMPI users] Segfault in orted (home directory problem)
I am trying to switch to OpenMPI, and I ran into a problem : my home directory must exist on all the nodes, or orted will crash. I have a "master" machine where I initiate the mpirun command. Then I have a bunch of slave machines, which will also execute the MPI job. My user exists on all the machines, but the home directory is not mounted on the slaves, so it's only visible on the master node. I can log on a slave node, but don't have a home there. Of course the binary I'm running exists on all the machines (not in my home !). And the problem can be reproduced by running a shell command too, to make things simpler. We have thousands of slave nodes and we don't want to mount the user's homedirs on all the slaves, so a fix would be really really nice. Example : I have 3 hosts, master, slave1, slave2. My home directory exists only on master. If I log on master and run "mpirun -host master,slave1uname -a" I get a segfault. If I log on slave1 and run "mpirun -host slave1,slave2 uname -a", it runs fine. My home directory does not exist on either slave1 or slave2. If I log on master and run "mpirun -host master uname -a" it runs fine. I can run across several master nodes, it's fine too. So it runs fine if my home directory exists everywhere, or if it does not exist at all. If it exists only on some nodes and not others, orted crashes. I thought it could be related to my environment but I created a new user with an empty home and it does the same thing. As soon as I create the homedir on slave1 and slave2 it works fine. I'm using OpenMPI 1.2.2, here is the error message and the result of ompi_info. Short version (rnd04 is the master, r137n001 is a slave node). -bash-3.00$ /usr/local/openmpi-1.2.2/bin/mpirun -host rnd04,r137n001 uname -a Linux rnd04 2.6.9-55.ELsmp #1 SMP Fri Apr 20 16:36:54 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux [r137n001:31533] *** Process received signal *** [r137n001:31533] Signal: Segmentation fault (11) [r137n001:31533] Signal code: Address not mapped (1) [r137n001:31533] Failing at address: 0x1 [r137n001:31533] [ 0] [0xe600] [r137n001:31533] [ 1] /lib/tls/libc.so.6 [0xbf3bfc] [r137n001:31533] [ 2] /lib/tls/libc.so.6(_IO_vfprintf+0xcb) [0xbf3e3b] [r137n001:31533] [ 3] /usr/local/openmpi-1.2.2/lib/libopen-pal.so.0 (opal_show_help+0x263) [0xf7f78de3] [r137n001:31533] [ 4] /usr/local/openmpi-1.2.2/lib/libopen-rte.so.0 (orte_rmgr_base_check_context_cwd+0xff) [0xf7fea7ef] [r137n001:31533] [ 5] /usr/local/openmpi-1.2.2/lib/openmpi/ mca_odls_default.so(orte_odls_default_launch_local_procs+0xe7f) [0xf7ea041f] [r137n001:31533] [ 6] /usr/local/openmpi-1.2.2/bin/orted [0x804a1ea] [r137n001:31533] [ 7] /usr/local/openmpi-1.2.2/lib/openmpi/ mca_gpr_proxy.so(orte_gpr_proxy_deliver_notify_msg+0x136) [0xf7ef65c6] [r137n001:31533] [ 8] /usr/local/openmpi-1.2.2/lib/openmpi/ mca_gpr_proxy.so(orte_gpr_proxy_notify_recv+0x108) [0xf7ef4f68] [r137n001:31533] [ 9] /usr/local/openmpi-1.2.2/lib/libopen-rte.so.0 [0xf7fd9a18] [r137n001:31533] [10] /usr/local/openmpi-1.2.2/lib/openmpi/ mca_oob_tcp.so(mca_oob_tcp_msg_recv_complete+0x24c) [0xf7f05fdc] [r137n001:31533] [11] /usr/local/openmpi-1.2.2/lib/openmpi/ mca_oob_tcp.so [0xf7f07f61] [r137n001:31533] [12] /usr/local/openmpi-1.2.2/lib/libopen-pal.so.0 (opal_event_base_loop+0x388) [0xf7f67dd8] [r137n001:31533] [13] /usr/local/openmpi-1.2.2/lib/libopen-pal.so.0 (opal_event_loop+0x29) [0xf7f67fb9] [r137n001:31533] [14] /usr/local/openmpi-1.2.2/lib/openmpi/ mca_oob_tcp.so(mca_oob_tcp_msg_wait+0x37) [0xf7f053c7] [r137n001:31533] [15] /usr/local/openmpi-1.2.2/lib/openmpi/ mca_oob_tcp.so(mca_oob_tcp_recv+0x374) [0xf7f09a04] [r137n001:31533] [16] /usr/local/openmpi-1.2.2/lib/libopen-rte.so.0 (mca_oob_recv_packed+0x4d) [0xf7fd980d] [r137n001:31533] [17] /usr/local/openmpi-1.2.2/lib/openmpi/ mca_gpr_proxy.so(orte_gpr_proxy_exec_compound_cmd+0x137) [0xf7ef55e7] [r137n001:31533] [18] /usr/local/openmpi-1.2.2/bin/orted(main+0x99d) [0x8049d0d] [r137n001:31533] [19] /lib/tls/libc.so.6(__libc_start_main+0xd3) [0xbcee23] [r137n001:31533] [20] /usr/local/openmpi-1.2.2/bin/orted [0x80492e1] [r137n001:31533] *** End of error message *** mpirun noticed that job rank 1 with PID 31533 on node r137n001 exited on signal 11 (Segmentation fault). If I create /home/toto on r137n001, it works fine : (as root on r137n001: "mkdir /home/toto && chown toto:users /home/toto") -bash-3.00$ /usr/local/openmpi-1.2.2/bin/mpirun -host rnd04,r137n001 uname -a Linux rnd04 2.6.9-55.ELsmp #1 SMP Fri Apr 20 16:36:54 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux Linux r137n001 2.6.9-34.ELsmp #1 SMP Fri Feb 24 16:56:28 EST 2006 x86_64 x86_64 x86_64 GNU/Linux I tried to use ssh instead of rsh, it crashes too. If anyone knows a way to run OpenMPI jobs in this configuration where the home directory does not exist on all the nodes, it woud really help ! Or is there a way to fix orted so that it won't crash ? Here
Re: [OMPI users] Segfault in orted (home directory problem)
That is the default behavior because having common home areas is fairly common, but with some work you can run your code from wherever is convenient. Using the -wd flag you can have the code run from wherever you want, but the code and data has to get there somehow. If you are using a batch scheduler it is fairly easy to write into your execution script a section that parses the list of assigned nodes and pushes your data and executable out to the scratch space on those nodes, and then cleans up afterward (or not). The sensible way to do this will depend a lot on what schedulers you are using and your application. Openmpi may have a trick to push data/executables around as well, but I haven't run across one yet. Mike Edwards
Re: [OMPI users] Segfault in orted (home directory problem)
Have you tried the --wdir option yet? It should let you set your working directory to anywhere. I don't believe it will require you to have a home directory on the backend nodes, though I can't sweathat ssh will be happy if you don't. Just do "mpirun -h" for a full list of options - it will describe the exact format of the wdir one plusthers you might find useful. Ralph On 6/7/07 11:12 AM, "Guillaume THOMAS-COLLIGNON wrote: > I am trying to switch to OpenI, and I ran into a problem : my home > directory must exist on all the nodes, or ted will crash. > > I have a "master" machine where I initiate the mpirun command> Then I have a > bunch of slave machines, which will ao execute the > MPI job. > My user exists on all the machines, but the home directory is not > mounted on the slaves,o it's only visible on the master node. I can > log on a slave node, but don't have a home there. Of course the > binary I'm running exists on all the machines (not in my home !). And > the problem can be reproduced by running a shell command too, to make > things simpler. > > We have thousands of slave nodes and we don't want to mount the > user's homedirs on all the slaves, so a fix would be really really nice. > > Example : > > I have 3 hosts, master, slave1, slave2. My home directory exists only > on master. > > If I log on master and run "mpirun -host master,slave1uname -a" I get > a segfault. > If I log on slave1 and run "mpirun -host slave1,slave2 uname -a", it > runs fine. My home directory does not exist on either slave1 or slave2. > If I log on master and run "mpirun -host master uname -a" it runs > fine. I can run across several master nodes, it's fine too. > > So it runs fine if my home directory exists everywhere, or if it does > not exist at all. If it exists only on some nodes and not others, > orted crashes. > I thought it could be related to my environment but I created a new > user with an empty home and it does the same thing. As soon as I > create the homedir on slave1 and slave2 it works fine. > > > > > I'm using OpenMPI 1.2.2, here is the error message and the result of > ompi_info. > > Short version (rnd04 is the master, r137n001 is a slave node). > > -bash-3.00$ /usr/local/openmpi-1.2.2/bin/mpirun -host rnd04,r137n001 > uname -a > Linux rnd04 2.6.9-55.ELsmp #1 SMP Fri Apr 20 16:36:54 EDT 2007 x86_64 > x86_64 x86_64 GNU/Linux > [r137n001:31533] *** Process received signal *** > [r137n001:31533] Signal: Segmentation fault (11) > [r137n001:31533] Signal code: Address not mapped (1) > [r137n001:31533] Failing at address: 0x1 > [r137n001:31533] [ 0] [0xe600] > [r137n001:31533] [ 1] /lib/tls/libc.so.6 [0xbf3bfc] > [r137n001:31533] [ 2] /lib/tls/libc.so.6(_IO_vfprintf+0xcb) [0xbf3e3b] > [r137n001:31533] [ 3] /usr/local/openmpi-1.2.2/lib/libopen-pal.so.0 > (opal_show_help+0x263) [0xf7f78de3] > [r137n001:31533] [ 4] /usr/local/openmpi-1.2.2/lib/libopen-rte.so.0 > (orte_rmgr_base_check_context_cwd+0xff) [0xf7fea7ef] > [r137n001:31533] [ 5] /usr/local/openmpi-1.2.2/lib/openmpi/ > mca_odls_default.so(orte_odls_default_launch_local_procs+0xe7f) > [0xf7ea041f] > [r137n001:31533] [ 6] /usr/local/openmpi-1.2.2/bin/orted [0x804a1ea] > [r137n001:31533] [ 7] /usr/local/openmpi-1.2.2/lib/openmpi/ > mca_gpr_proxy.so(orte_gpr_proxy_deliver_notify_msg+0x136) [0xf7ef65c6] > [r137n001:31533] [ 8] /usr/local/openmpi-1.2.2/lib/openmpi/ > mca_gpr_proxy.so(orte_gpr_proxy_notify_recv+0x108) [0xf7ef4f68] > [r137n001:31533] [ 9] /usr/local/openmpi-1.2.2/lib/libopen-rte.so.0 > [0xf7fd9a18] > [r137n001:31533] [10] /usr/local/openmpi-1.2.2/lib/openmpi/ > mca_oob_tcp.so(mca_oob_tcp_msg_recv_complete+0x24c) [0xf7f05fdc] > [r137n001:31533] [11] /usr/local/openmpi-1.2.2/lib/openmpi/ > mca_oob_tcp.so [0xf7f07f61] > [r137n001:31533] [12] /usr/local/openmpi-1.2.2/lib/libopen-pal.so.0 > (opal_event_base_loop+0x388) [0xf7f67dd8] > [r137n001:31533] [13] /usr/local/openmpi-1.2.2/lib/libopen-pal.so.0 > (opal_event_loop+0x29) [0xf7f67fb9] > [r137n001:31533] [14] /usr/local/openmpi-1.2.2/lib/openmpi/ > mca_oob_tcp.so(mca_oob_tcp_msg_wait+0x37) [0xf7f053c7] > [r137n001:31533] [15] /usr/local/openmpi-1.2.2/lib/openmpi/ > mca_oob_tcp.so(mca_oob_tcp_recv+0x374) [0xf7f09a04] > [r137n001:31533] [16] /usr/local/openmpi-1.2.2/lib/libopen-rte.so.0 > (mca_oob_recv_packed+0x4d) [0xf7fd980d] > [r137n001:31533] [17] /usr/local/openmpi-1.2.2/lib/openmpi/ > mca_gpr_proxy.so(orte_gpr_proxy_exec_compound_cmd+0x137) [0xf7ef55e7] > [r137n001:31533] [18] /usr/local/openmpi-1.2.2/bin/orted(main+0x99d) > [0x8049d0d] > [r137n001:31533] [19] /lib/tls/libc.so.6(__libc_start_main+0xd3) > [0xbcee23] > [r137n001:31533] [20] /usr/local/openmpi-1.2.2/bin/orted [0x80492e1] > [r137n001:31533] *** End of error message *** > mpirun noticed that job rank 1 with PID 31533 on node r137n001 exited > on signal 11 (Segmentation fault). > > > > If I create /home/toto on r137n001, it works fine : > (as root on r137n001: "mkdir /home/toto && chown toto:users /home/tot
Re: [OMPI users] Segfault in orted (home directory problem)
You're right, the --wdir option works fine ! Thanks ! I just tried an older version we had compiled (1.2b3), and the error was more explicit than the seg fault we get with 1.2.2 : Could not chdir to home directory /rdu/thomasco: No such file or directory -- Failed to change to the working directory: <...> -Guillaume On Jun 7, 2007, at 12:57 PM, Ralph Castain wrote: Have you tried the --wdir option yet? It should let you set your working directory to anywhere. I don't believe it will require you to have a home directory on the backend nodes, though I can't sweathat ssh will be happy if you don't. Just do "mpirun -h" for a full list of options - it will describe the exact format of the wdir one plusthers you might find useful. Ralph On 6/7/07 11:12 AM, "Guillaume THOMAS-COLLIGNONcollig...@cggveritas.com> wrote: I am trying to switch to OpenI, and I ran into a problem : my home directory must exist on all the nodes, or ted will crash. I have a "master" machine where I initiate the mpirun command> Then I have a bunch of slave machines, which will ao execute the MPI job. My user exists on all the machines, but the home directory is not mounted on the slaves,o it's only visible on the master node. I can log on a slave node, but don't have a home there. Of course the binary I'm running exists on all the machines (not in my home !). And the problem can be reproduced by running a shell command too, to make things simpler. We have thousands of slave nodes and we don't want to mount the user's homedirs on all the slaves, so a fix would be really really nice. Example : I have 3 hosts, master, slave1, slave2. My home directory exists only on master. If I log on master and run "mpirun -host master,slave1uname -a" I get a segfault. If I log on slave1 and run "mpirun -host slave1,slave2 uname -a", it runs fine. My home directory does not exist on either slave1 or slave2. If I log on master and run "mpirun -host master uname -a" it runs fine. I can run across several master nodes, it's fine too. So it runs fine if my home directory exists everywhere, or if it does not exist at all. If it exists only on some nodes and not others, orted crashes. I thought it could be related to my environment but I created a new user with an empty home and it does the same thing. As soon as I create the homedir on slave1 and slave2 it works fine. I'm using OpenMPI 1.2.2, here is the error message and the result of ompi_info. Short version (rnd04 is the master, r137n001 is a slave node). -bash-3.00$ /usr/local/openmpi-1.2.2/bin/mpirun -host rnd04,r137n001 uname -a Linux rnd04 2.6.9-55.ELsmp #1 SMP Fri Apr 20 16:36:54 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux [r137n001:31533] *** Process received signal *** [r137n001:31533] Signal: Segmentation fault (11) [r137n001:31533] Signal code: Address not mapped (1) [r137n001:31533] Failing at address: 0x1 [r137n001:31533] [ 0] [0xe600] [r137n001:31533] [ 1] /lib/tls/libc.so.6 [0xbf3bfc] [r137n001:31533] [ 2] /lib/tls/libc.so.6(_IO_vfprintf+0xcb) [0xbf3e3b] [r137n001:31533] [ 3] /usr/local/openmpi-1.2.2/lib/libopen-pal.so.0 (opal_show_help+0x263) [0xf7f78de3] [r137n001:31533] [ 4] /usr/local/openmpi-1.2.2/lib/libopen-rte.so.0 (orte_rmgr_base_check_context_cwd+0xff) [0xf7fea7ef] [r137n001:31533] [ 5] /usr/local/openmpi-1.2.2/lib/openmpi/ mca_odls_default.so(orte_odls_default_launch_local_procs+0xe7f) [0xf7ea041f] [r137n001:31533] [ 6] /usr/local/openmpi-1.2.2/bin/orted [0x804a1ea] [r137n001:31533] [ 7] /usr/local/openmpi-1.2.2/lib/openmpi/ mca_gpr_proxy.so(orte_gpr_proxy_deliver_notify_msg+0x136) [0xf7ef65c6] [r137n001:31533] [ 8] /usr/local/openmpi-1.2.2/lib/openmpi/ mca_gpr_proxy.so(orte_gpr_proxy_notify_recv+0x108) [0xf7ef4f68] [r137n001:31533] [ 9] /usr/local/openmpi-1.2.2/lib/libopen-rte.so.0 [0xf7fd9a18] [r137n001:31533] [10] /usr/local/openmpi-1.2.2/lib/openmpi/ mca_oob_tcp.so(mca_oob_tcp_msg_recv_complete+0x24c) [0xf7f05fdc] [r137n001:31533] [11] /usr/local/openmpi-1.2.2/lib/openmpi/ mca_oob_tcp.so [0xf7f07f61] [r137n001:31533] [12] /usr/local/openmpi-1.2.2/lib/libopen-pal.so.0 (opal_event_base_loop+0x388) [0xf7f67dd8] [r137n001:31533] [13] /usr/local/openmpi-1.2.2/lib/libopen-pal.so.0 (opal_event_loop+0x29) [0xf7f67fb9] [r137n001:31533] [14] /usr/local/openmpi-1.2.2/lib/openmpi/ mca_oob_tcp.so(mca_oob_tcp_msg_wait+0x37) [0xf7f053c7] [r137n001:31533] [15] /usr/local/openmpi-1.2.2/lib/openmpi/ mca_oob_tcp.so(mca_oob_tcp_recv+0x374) [0xf7f09a04] [r137n001:31533] [16] /usr/local/openmpi-1.2.2/lib/libopen-rte.so.0 (mca_oob_recv_packed+0x4d) [0xf7fd980d] [r137n001:31533] [17] /usr/local/openmpi-1.2.2/lib/openmpi/ mca_gpr_proxy.so(orte_gpr_proxy_exec_compound_cmd+0x137) [0xf7ef55e7] [r137n001:31533] [18] /usr/local/openmpi-1.2.2/bin/orted(main+0x99d) [0x8049d0d] [r137n001:31533] [19] /lib/tls/libc.so.6(__libc_start_main+0xd3) [0xbcee23] [r137n001:31533] [20] /usr
Re: [OMPI users] Segfault in orted (home directory problem)
Hmmm...well, we certainly will ke a point to give you a better error message! Probably won't get it into 1.2.3, but should make later releases. Thanks for letting me know Ralph On 6/7/07 1:22 PM, "Guillaume THOMAS-COLLIGNON" wrote: > You're right, the --wdir option works fine ! > Thanks ! > > I just tried an older version we had compiled (1.2b3), and the error > was more explicit than the seg fault we get with 1.2.2 : > > Could not chdir to home directory /rdu/thomasco: No such file or > directory > > -- > Failed to change to the working directory: > <...> > > -Guillaume > > On Jun 7, 2007, at 12:57 PM, Ralph Castain wrote: > >> Have you tried the --wdir option yet? It should let you set your >> working >> directory to anywhere. I don't believe it will require you to have >> a home >> directory on the backend nodes, though I can't sweathat ssh will be >> happy >> if you don't. >> >> Just do "mpirun -h" for a full list of options - it will describe >> the exact >> format of the wdir one plusthers you might find useful. >> >> Ralph >> >> >> >> On 6/7/07 11:12 AM, "Guillaume THOMAS-COLLIGNON> collig...@cggveritas.com> wrote: >> >>> I am trying to switch to OpenI, and I ran into a problem : my home >>> directory must exist on all the nodes, or ted will crash. >>> >>> I have a "master" machine where I initiate the mpirun command> >>> Then I have a bunch of slave machines, which will ao execute the >>> MPI job. >>> My user exists on all the machines, but the home directory is not >>> mounted on the slaves,o it's only visible on the master node. I can >>> log on a slave node, but don't have a home there. Of course the >>> binary I'm running exists on all the machines (not in my home !). And >>> the problem can be reproduced by running a shell command too, to make >>> things simpler. >>> >>> We have thousands of slave nod and we don't want to mount the >>> user's homedirs on all the slaves, so a fix would be really really >>> nice. >>> >>> Example : >>> >>> I have 3 hosts, master, slave1, slave2. My home directory exists only >>> on master. >>> >>> If I log on master and run "mpirun -host master,slave1uname -a" I get >>> a segfault. >>> If I log on slave1 and run "mpirun -host slave1,slave2 uname -a", it >>> runs fine. My home directory does not exist on either slave1 or >>> slave2. >>> If I log on master and run "mpirun -host master uname -a" it runs >>> fine. I can run across several master nodes, it's fine too. >>> >>> So it runs fine if my home directory exists everywhere, or if it does >>> not exist at all. If it exists only on some nodes and not others, >>> orted crashes. >>> I thought it could be related to my environment but I created a new >>> user with an empty home and it does the same thing. As soon as I >>> create the homedir on slave1 and slave2 it works fine. >>> >>> >>> >>> >>> I'm using OpenMPI 1.2.2, here is the error message and the result of >>> ompi_info. >>> >>> Short version (rnd04 is the master, r137n001 is a slave node). >>> >>> -bash-3.00$ /usr/local/openmpi-1.2.2/bin/mpirun -host rnd04,r137n001 >>> uname -a >>> Linux rnd04 2.6.9-55.ELsmp #1 SMP Fri Apr 20 16:36:54 EDT 2007 x86_64 >>> x86_64 x86_64 GNU/Linux >>> [r137n001:31533] *** Process received signal *** >>> [r137n001:31533] Signal: Segmentation fault (11) >>> [r137n001:31533] Signal code: Address not mapped (1) >>> [r137n001:533] Failing at address: 0x1 >>> [r137n001:31533] [ 0] [0xe600] >>> [r137n001:31533] [ 1] /lib/tls/libc.so.6 [0xbf3bfc] >>> [r137n001:31533] [ 2] /lib/tls/libc.so.6(_IO_vfprintf+0xcb) >>> [0xbf3e3b] >>> [r137n001:31533] [ 3] /usr/local/openmpi-1.2.2/lib/libopen-pal.so.0 >>> (opal_show_help+0x263) [0xf7f78de3] >>> [r137n001:31533] [ 4] /usr/local/openmpi-1.2.2/lib/libopen-rte.so.0 >>> (orte_rmgr_base_check_context_cwd+0xff) [0xf7fea7ef] >>> [r137n001:31533] [ 5] /usr/local/openmpi-1.2.2/lib/openmpi/ >>> mca_odls_default.so(orte_odls_default_launch_local_procs+0xe7f) >>> [0xf7ea041f] >>> [r137n001:31533] [ 6] /usr/local/openmpi-1.2.2/bin/orted [0x804a1ea] >>> [r137n001:31533] [ 7] /usr/local/openmpi-1.2.2/lib/openmpi/ >>> mca_gpr_proxy.so(orte_gpr_proxy_deliver_notify_msg+0x136) >>> [0xf7ef65c6] >>> [r137n001:31533] [ 8] /usr/local/openmpi-1.2.2/lib/openmpi/ >>> mca_gpr_proxy.so(orte_gpr_proxy_notify_recv+0x108) [0xf7ef4f68] >>> [r137n001:31533] [ 9] /usr/local/openmpi-1.2.2/lib/libopen-rte.so.0 >>> [0xf7fd9a18] >>> [r137n001:31533] [10] /usr/local/openmpi-1.2.2/lib/openmpi/ >>> mca_oob_tcp.so(mca_oob_tcp_msg_recv_complete+0x24c) [0xf7f05fdc] >>> [r137n001:31533] [11] /usr/local/openmpi-1.2.2/lib/openmpi/ >>> mca_oob_tcp.so [0xf7f07f61] >>> [r137n001:31533] [12] /usr/local/openmpi-1.2.2/lib/libopen-pal.so.0 >>> (opal_event_base_loop+0x388) [0xf7f67dd8] >>> [r137n001:31533] [13] /usr/local/openmpi-1.2.2/lib/libopen-pal.so.0 >>> (opal_event_loop+0x29) [0xf7f67fb9] >>> [r137n001:31533] [14] /usr/local/openmpi-1.2
[OMPI users] Issues with DL POLY
Hello, Does anyone have experience using DL POLY with OpenMPI? I've gotten it to compile, but when I run a simulation using mpirun with two dual- processor machines, it runs a little *slower* than on one CPU on one machine! Yet the program is running two instances on each node. Any ideas? The test programs included with OpenMPI show that it is running correctly across multiple nodes. Sorry if this is a little off-topic, I wasn't able to find help on the official DL POLY mailing list. Thank you! Aaron Thompson Vanderbilt University aaron.p.thomp...@vanderbilt.edu
Re: [OMPI users] Issues with DL POLY
We have a few users using DLPOLY with OMPI. Running just fine. Watch out what kind of simulation you are doing like all MD software, not all simulations are better in parallel. In some the comunication overhead is much worse than running on just one cpu. I see this all the time. You could try just 2 cpus, on one node some times that is ok (memory access vs network access) But its not uncommon. Brock Palen Center for Advanced Computing bro...@umich.edu (734)936-1985 On Jun 7, 2007, at 8:24 PM, Aaron Thompson wrote: Hello, Does anyone have experience using DL POLY with OpenMPI? I've gotten it to compile, but when I run a simulation using mpirun with two dual- processor machines, it runs a little *slower* than on one CPU on one machine! Yet the program is running two instances on each node. Any ideas? The test programs included with OpenMPI show that it is running correctly across multiple nodes. Sorry if this is a little off-topic, I wasn't able to find help on the official DL POLY mailing list. Thank you! Aaron Thompson Vanderbilt University aaron.p.thomp...@vanderbilt.edu ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Issues with DL POLY
If your problem size is not large enough than any MPI program will perform worse on a "large number" of nodes because of the overhead involved in setting up the problem and network latency. Sometimes that "large number" is as small as two :) I am not at all familiar with DL POLY, but if you make the size of the problem larger you should see more performance benefit because the overhead will be small compared to the execution time. Just in general I would say to start with a problem that takes at least a minute on one node, run it a few times to see how much the run time varies and then try it on two nodes. Especially if you are going to try and scale it much past that initial two node version... On 6/7/07, Aaron Thompson wrote: Hello, Does anyone have experience using DL POLY with OpenMPI? I've gotten it to compile, but when I run a simulation using mpirun with two dual- processor machines, it runs a little *slower* than on one CPU on one machine! Yet the program is running two instances on each node. Any ideas? The test programs included with OpenMPI show that it is running correctly across multiple nodes. Sorry if this is a little off-topic, I wasn't able to find help on the official DL POLY mailing list. Thank you! Aaron Thompson Vanderbilt University aaron.p.thomp...@vanderbilt.edu ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Issues with DL POLY
Are you saying t(single-process execution) < t(4-process execution) for identical problems on each (same total amount of data)? There's rarely a speedup in such a case-- processing the same amount of data while shipping some fraction of it over a slow network between processing steps is almost certain to be slower. Where things get interesting (and encouraging) is if you increase the total data being processed (hold data quantity per node constant). ben allan On Thu, Jun 07, 2007 at 08:24:03PM -0400, Aaron Thompson wrote: > Hello, > Does anyone have experience using DL POLY with OpenMPI? I've gotten > it to compile, but when I run a simulation using mpirun with two dual- > processor machines, it runs a little *slower* than on one CPU on one > machine! Yet the program is running two instances on each node. Any > ideas? The test programs included with OpenMPI show that it is > running correctly across multiple nodes. > Sorry if this is a little off-topic, I wasn't able to find help on > the official DL POLY mailing list. > > Thank you! > > Aaron Thompson > Vanderbilt University > aaron.p.thomp...@vanderbilt.edu > > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] making all library components static (questions about --enable-mcs-static)
Hi Jeff (and everyone), Thanks! Now I have compiled the openmpi-1.2.2 successfully under i386-Linux (Debian Sarge) with the following configurations: ./configure CFLAGS=-g -pg -O3 --enable-mpi-threads --enable-progress-threads --enable-static --disable-shared However when I compile my client program using mpicc and I inserted -static, (compile is done by a makefile) mpicc -static -g -pg -O3 -W -Wall -pedantic -std=c99 -o raytrace bbox.o cr.o env.o fbuf.o geo.o huprn.o husetup.o hutv.o isect.o main.o matrix.o memory.o poly.o raystack.o shade.o sph.o trace.o tri.o debug.o it fails to link and complains that nction `_int_malloc': : multiple definition of `_int_malloc' /usr/lib/libopen-pal.a(lt1-malloc.o)(.text+0x18a0):openmpi-1.2.2/opal/mca/memory/ptmalloc2/malloc.c:3954: first defined here /usr/bin/ld: Warning: size of symbol `_int_malloc' changed from 1266 in /usr/lib/libopen-pal.a(lt1-malloc.o) to 1333 in /home/490_research/490/src/mpi.optimized_profiling//lib/libopen-pal.a( lt1-malloc.o) so what could go wrong here? Is it because openmpi has internal implementatios of system-provided functions (such as malloc) that are also used in my program, but the one the client program use is provided by the system whereas the one in the library has a different internal implementation? In such case, how could I do the static linking in my client program? I really need static linking as far as possible to do the profiling. Thanks! On 6/8/07, Jeff Squyres wrote: On Jun 7, 2007, at 2:07 AM, Code Master wrote: > I wish to compile openmpi-1.2.2 so that it: > - support MPI_THREAD_MULTIPLE > - enable profiling (generate gmon.out for each process after my > client app finish running) to tell apart CPU time of my client > program from the MPI library > - static linking for everything (incl client app and all components > of library openmpi) > > in the documentation, it says that --enable-mcs-static= > will enable static linking of the modules in the list, however what > can I specify if I want to statically link *all* mcs modules > without knowing the list of modules available? You should be able to do: ./configure --enable-static --disable-shared ... This will do 2 things: - libmpi (and friends) will be compiled as .a's (instead of .so's) - all the MCA components will be physically contained in libmpi (and friends) instead of being built as standalone plugins > Also this is the plan for my command used for configuring openmpi: > > ./configure CFLAGS="-g -pg -O3 -static" --prefix=./ --enable-mpi- > threads --enable-progress-threads --enable-static --disable-shared > --enable-mcs-static --with-devel-headers It's actually --enable-mca-static, not --enable-mcs-static. However, that should not be necessary; the --enable-static and -- disable-shared should take care of pulling all the components into the libraries for you. -- Jeff Squyres Cisco Systems ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users