[OMPI users] making all library components static (questions about --enable-mcs-static)

2007-06-07 Thread Code Master

I wish to compile openmpi-1.2.2 so that it:
- support MPI_THREAD_MULTIPLE
- enable profiling (generate gmon.out for each process after my client app
finish running) to tell apart CPU time of my client program from the MPI
library
- static linking for everything (incl client app and all components of
library openmpi)

in the documentation, it says that --enable-mcs-static= will
enable static linking of the modules in the list, however what can I specify
if I want to statically link *all* mcs modules without knowing the list of
modules available?

Also this is the plan for my command used for configuring openmpi:

./configure CFLAGS="-g -pg -O3 -static" --prefix=./ --enable-mpi-threads
--enable-progress-threads --enable-static  --disable-shared
--enable-mcs-static --with-devel-headers

Did you see anything wrong with this command?  what else can I modify to
satisfy the goals listed above?

Thanks!


Re: [OMPI users] SGE and OFED1.1

2007-06-07 Thread Jeff Squyres

On Jun 6, 2007, at 5:44 PM, Michael Edwards wrote:


I am runing open-mpi 1.1.1-1 compiled from OFED1.1 which I downloaded
from their website.


You might want to upgrade your Open MPI installation; the current  
stable version is 1.2.2 (1.2.3 is pending shortly, fixing a few minor  
regressions that creeped into 1.2.2).  You can upgrade OMPI  
independent of OFED.  Use the "--with-openib=/usr/local/ofed" option  
to OMPI's configure to pick up the OFED 1.1 installation (or, if you  
used a different OFED prefix, use that as the value for the --with- 
openib flag).



I am using SGE installed via OSCAR 5.0 and when running under SGE I
get the "mca_mpool_openib_register: ibv_reg_mr(0x59,528384) failed
with error: Cannot allocate memory" error discussed at length in your
FAQ.

When I run from the command line using mpirun, I don't get the errors.
 Of course, I don't know how to tell if the code is actually using the
IB interface instead of the GigE network...


You can tell in two ways:

1. You can force the IB network to be used:

mpirun --mca btl openib,self ...

Alternatively, you can force the use of the gigE network:

mpirun --mca btl tcp,self ...

2. If you look at the bandwidth/latency of running any benchmark  
papplication, they should be obviously far better than the gigE  
network.  Here's running NetPIPE (http://www.scl.ameslab.gov/netpipe/):


mpirun -np 2 NPmpi


I tried the suggestions in the FAQ regarding setting the memlock
parameter in /etc/security/limits.conf: and all the nodes return
"unlimited" in response to "ulimit -l" after rebooting the nodes.  The
problem persists under SGE and still does not appear when simply using
mpirun.


The problem is that the SGE daemons are not starting with these  
memory limits.  Therefore, processes that start under SGE inherit the  
low memory limits, and things go badly from there.


I'm afraid I'm not familiar enough with SGE to know how to fix this.   
One Big Thing to check is that when the SGE daemons are started at  
init.d/boot time, they have the proper "unlimited" memory locked  
limits.  Then processes that start under SGE should inherit the  
"unlimited" value and be ok.  That being said, SGE may also  
specifically override the memory locked limits (some resource  
managers can do this based on site-wide policies).  Check to see if  
SGE is doing this.



I assumed it would work since openmpi 1.1.1 was included as working
with SGE in OSCAR 5.0, but I don't know how different that version and
the one included with OFED is.

Any suggestions would be appreciated.
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
Cisco Systems



Re: [OMPI users] making all library components static (questions about --enable-mcs-static)

2007-06-07 Thread Jeff Squyres

On Jun 7, 2007, at 2:07 AM, Code Master wrote:


I wish to compile openmpi-1.2.2 so that it:
- support MPI_THREAD_MULTIPLE
- enable profiling (generate gmon.out for each process after my  
client app finish running) to tell apart CPU time of my client  
program from the MPI library
- static linking for everything (incl client app and all components  
of library openmpi)


in the documentation, it says that --enable-mcs-static=  
will enable static linking of the modules in the list, however what  
can I specify if I want to statically link *all* mcs modules  
without knowing the list of modules available?


You should be able to do:

./configure --enable-static --disable-shared ...

This will do 2 things:

- libmpi (and friends) will be compiled as .a's (instead of .so's)
- all the MCA components will be physically contained in libmpi (and  
friends) instead of being built as standalone plugins



Also this is the plan for my command used for configuring openmpi:

./configure CFLAGS="-g -pg -O3 -static" --prefix=./ --enable-mpi- 
threads --enable-progress-threads --enable-static  --disable-shared  
--enable-mcs-static --with-devel-headers


It's actually --enable-mca-static, not --enable-mcs-static.

However, that should not be necessary; the --enable-static and -- 
disable-shared should take care of pulling all the components into  
the libraries for you.


--
Jeff Squyres
Cisco Systems



Re: [OMPI users] open-mpi with ifort in debug..trouble

2007-06-07 Thread Jeff Squyres
Can you be a bit more descriptive?  What is the exact compilation  
output (including the error)?  And what exactly do you mean by "debug  
mode" -- compiling Open MPI with and without -g?  Please see http:// 
www.open-mpi.org/community/help/.


FWIW, I do not see the symbol "output_local_symbols" in the Open MPI  
source code anywhere...



On Jun 6, 2007, at 12:17 PM, Srinath Vadlamani wrote:

So I have been trying to build multiple applications with an ifort 
+gcc implementation of Open-MPI.  I wanted to build them in debug  
mode.  This is on a Macbook Pro

System Version:Mac OS X 10.4.9 (8P2137)
  Kernel Version:Darwin 8.9.1
  gcc :gcc version 4.0.1
  ifort: 10.0.16

I have tried building PETSc from ftp://ftp.mcs.anl.gov/pub/petsc/ 
release-snapshots/petsc-lite-2.3.3-p1.tar.gz

in debug mode and the error on gets in building fortran examples is:
ld: internal error: output_local_symbols () inconsistent local  
symbol count

This does not happen when *not* in debug mode.
This is the same error we get with the same build parameters of one  
of our Fortran scientific codes.


This error does *not* occur when using mpich2-1.0.5p4.


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
Cisco Systems



Re: [OMPI users] OpenMPI with multiple threads (MPI_THREAD_MULTIPLE)

2007-06-07 Thread Jeff Squyres
This issue was just recently discussed on this list -- check out the  
thread here:


http://www.open-mpi.org/community/lists/users/2007/05/3323.php


On Jun 5, 2007, at 6:52 PM, smai...@ksu.edu wrote:


Hi,
I am trying a program in which I have 2 MPI nodes and each MPI node  
has

2 threads:

Main node-thread Receive Thread
-   
MPI_Init_Thread(MPI_THREAD_MULTIPLE);
.
.
LOOP:  LOOP:

THREAD-BARRIER   THREAD-BARRIER
MPI_Send();  MPI_Recv();
goto LOOP;   goto LOOP;

.
.

The thread-barrier ensures that the 2 threads complete the previous
iteration before moving ahead with this one.

I get the following error SOMETIMES (while sometimes the program runs
properly):
*** An error occurred in MPI_Recv
*** on communicator MPI_COMM_WORLD
*** MPI_ERR_TRUNCATE: message truncated
*** MPI_ERRORS_ARE_FATAL (goodbye)

Somewhere I read that MPI_THREAD_MULTIPLE is not properly tested with
OpenMPI. Can someone tell me whether I am making some mistake or is
there any bug with MPI_THREAD_MULTIPLE?

-Thanks and Regards,
Sarang.
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
Cisco Systems



[OMPI users] Problems configuring petsc-dev with openmpi-1.2.3a0r14886

2007-06-07 Thread Charles Williams

Hi,

I've been using openmpi-1.1.5 with no problems, but I decided to move  
up to version 1.2 yesterday.  I am working with the developers'  
version of PETSc, so I attempted to configure it using the newly- 
installed open-mpi.  When I tried this, though, I ran into the  
following problem (from PETsc's configure.log):


Possible ERROR while running preprocessor: In file included from / 
Users/willic3/

geoframe/tools/openmpi-debug/include/mpi.h:1783,
 from /Users/willic3/geoframe/tools/petsc-dev-new/ 
include/petsc.

h:138,
 from /Users/willic3/geoframe/tools/petsc-dev-new/ 
src/dm/mesh/si

eve/ALE.hh:4,
 from /Users/willic3/geoframe/tools/petsc-dev-new/ 
src/dm/mesh/si

eve/Sifter.hh:15,
 from /Users/willic3/geoframe/tools/petsc-dev-new/ 
src/dm/mesh/si

eve/Sieve.hh:12,
 from /Users/willic3/geoframe/tools/petsc-dev-new/ 
src/dm/mesh/si

eve/Topology.hh:5,
 from /Users/willic3/geoframe/tools/petsc-dev-new/ 
src/dm/mesh/si

eve/SectionCompletion.hh:5,
 from /Users/willic3/geoframe/tools/petsc-dev-new/ 
src/dm/mesh/si

eve/Numbering.hh:5,
 from /Users/willic3/geoframe/tools/petsc-dev-new/ 
src/dm/mesh/si

eve/Mesh.hh:5,
 from conftest.cc:3:
/Users/willic3/geoframe/tools/openmpi-debug/include/openmpi/ompi/mpi/ 
cxx/mpicxx.

h:162:36: error: ompi/mpi/cxx/constants.h: No such file or directory
/Users/willic3/geoframe/tools/openmpi-debug/include/openmpi/ompi/mpi/ 
cxx/mpicxx.

h:163:36: error: ompi/mpi/cxx/functions.h: No such file or directory
/Users/willic3/geoframe/tools/openmpi-debug/include/openmpi/ompi/mpi/ 
cxx/mpicxx.

h:164:35: error: ompi/mpi/cxx/datatype.h: No such file or directory
ret = 256


Here is what I have for my mpicxx:

mpicxx --show
g++ -D_REENTRANT -I/Users/willic3/geoframe/tools/openmpi-debug/ 
include -g -mcpu=G5 -Wl,-u,_munmap -Wl,-multiply_defined,suppress -L/ 
Users/willic3/geoframe/tools/openmpi-debug/lib -lmpi_cxx -lmpi -lopen- 
rte -lopen-pal


I can make a change to mpicxx.h that fixes the problem:

diff mpicxx-orig.h mpicxx.h
162,164c162,164
< #include "ompi/mpi/cxx/constants.h"
< #include "ompi/mpi/cxx/functions.h"
< #include "ompi/mpi/cxx/datatype.h"
---
> #include "openmpi/ompi/mpi/cxx/constants.h"
> #include "openmpi/ompi/mpi/cxx/functions.h"
> #include "openmpi/ompi/mpi/cxx/datatype.h"

I don't know if this is the correct approach, though.  Are the paths  
actually incorrect or have I configured open-mpi incorrectly?


Thanks,
Charles


Charles A. Williams
Dept. of Earth & Environmental Sciences
Science Center, 2C01B
Rensselaer Polytechnic Institute
Troy, NY  12180
Phone:(518) 276-3369
FAX:(518) 276-2012
e-mail:will...@rpi.edu




Re: [OMPI users] Problems configuring petsc-dev with openmpi-1.2.3a0r14886

2007-06-07 Thread Jeff Squyres
Yes, it is the correct approach.  This code was just changed and then  
fixed in the immediate past (yesterday? or perhaps the day before?),  
and the fix was exactly as you described.


https://svn.open-mpi.org/trac/ompi/changeset/14939


On Jun 7, 2007, at 11:26 AM, Charles Williams wrote:


Hi,

I've been using openmpi-1.1.5 with no problems, but I decided to  
move up to version 1.2 yesterday.  I am working with the  
developers' version of PETSc, so I attempted to configure it using  
the newly-installed open-mpi.  When I tried this, though, I ran  
into the following problem (from PETsc's configure.log):


Possible ERROR while running preprocessor: In file included from / 
Users/willic3/

geoframe/tools/openmpi-debug/include/mpi.h:1783,
 from /Users/willic3/geoframe/tools/petsc-dev-new/ 
include/petsc.

h:138,
 from /Users/willic3/geoframe/tools/petsc-dev-new/ 
src/dm/mesh/si

eve/ALE.hh:4,
 from /Users/willic3/geoframe/tools/petsc-dev-new/ 
src/dm/mesh/si

eve/Sifter.hh:15,
 from /Users/willic3/geoframe/tools/petsc-dev-new/ 
src/dm/mesh/si

eve/Sieve.hh:12,
 from /Users/willic3/geoframe/tools/petsc-dev-new/ 
src/dm/mesh/si

eve/Topology.hh:5,
 from /Users/willic3/geoframe/tools/petsc-dev-new/ 
src/dm/mesh/si

eve/SectionCompletion.hh:5,
 from /Users/willic3/geoframe/tools/petsc-dev-new/ 
src/dm/mesh/si

eve/Numbering.hh:5,
 from /Users/willic3/geoframe/tools/petsc-dev-new/ 
src/dm/mesh/si

eve/Mesh.hh:5,
 from conftest.cc:3:
/Users/willic3/geoframe/tools/openmpi-debug/include/openmpi/ompi/ 
mpi/cxx/mpicxx.

h:162:36: error: ompi/mpi/cxx/constants.h: No such file or directory
/Users/willic3/geoframe/tools/openmpi-debug/include/openmpi/ompi/ 
mpi/cxx/mpicxx.

h:163:36: error: ompi/mpi/cxx/functions.h: No such file or directory
/Users/willic3/geoframe/tools/openmpi-debug/include/openmpi/ompi/ 
mpi/cxx/mpicxx.

h:164:35: error: ompi/mpi/cxx/datatype.h: No such file or directory
ret = 256


Here is what I have for my mpicxx:

mpicxx --show
g++ -D_REENTRANT -I/Users/willic3/geoframe/tools/openmpi-debug/ 
include -g -mcpu=G5 -Wl,-u,_munmap -Wl,-multiply_defined,suppress - 
L/Users/willic3/geoframe/tools/openmpi-debug/lib -lmpi_cxx -lmpi - 
lopen-rte -lopen-pal


I can make a change to mpicxx.h that fixes the problem:

diff mpicxx-orig.h mpicxx.h
162,164c162,164
< #include "ompi/mpi/cxx/constants.h"
< #include "ompi/mpi/cxx/functions.h"
< #include "ompi/mpi/cxx/datatype.h"
---
> #include "openmpi/ompi/mpi/cxx/constants.h"
> #include "openmpi/ompi/mpi/cxx/functions.h"
> #include "openmpi/ompi/mpi/cxx/datatype.h"

I don't know if this is the correct approach, though.  Are the  
paths actually incorrect or have I configured open-mpi incorrectly?


Thanks,
Charles


Charles A. Williams
Dept. of Earth & Environmental Sciences
Science Center, 2C01B
Rensselaer Polytechnic Institute
Troy, NY  12180
Phone:(518) 276-3369
FAX:(518) 276-2012
e-mail:will...@rpi.edu


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
Cisco Systems



[OMPI users] Segfault in orted (home directory problem)

2007-06-07 Thread Guillaume THOMAS-COLLIGNON
I am trying to switch to OpenMPI, and I ran into a problem : my home  
directory must exist on all the nodes, or orted will crash.


I have a "master" machine where I initiate the mpirun command.
Then I have a bunch of slave machines, which will also execute the  
MPI job.
My user exists on all the machines, but the home directory is not  
mounted on the slaves, so it's only visible on the master node. I can  
log on a slave node, but don't have a home there. Of course the  
binary I'm running exists on all the machines (not in my home !). And  
the problem can be reproduced by running a shell command too, to make  
things simpler.


We have thousands of slave nodes and we don't want to mount the  
user's homedirs on all the slaves, so a fix would be really really nice.


Example :

I have 3 hosts, master, slave1, slave2. My home directory exists only  
on master.


If I log on master and run "mpirun -host master,slave1uname -a" I get  
a segfault.
If I log on slave1 and run "mpirun -host slave1,slave2 uname -a", it  
runs fine. My home directory does not exist on either slave1 or slave2.
If I log on master and run "mpirun -host master uname -a" it runs  
fine. I can run across several master nodes, it's fine too.


So it runs fine if my home directory exists everywhere, or if it does  
not exist at all. If it exists only on some nodes and not others,  
orted crashes.
I thought it could be related to my environment but I created a new  
user with an empty home and it does the same thing. As soon as I  
create the homedir on slave1 and slave2 it works fine.





I'm using OpenMPI 1.2.2, here is the error message and the result of  
ompi_info.


Short version (rnd04 is the master, r137n001 is a slave node).

-bash-3.00$ /usr/local/openmpi-1.2.2/bin/mpirun -host rnd04,r137n001   
uname -a
Linux rnd04 2.6.9-55.ELsmp #1 SMP Fri Apr 20 16:36:54 EDT 2007 x86_64  
x86_64 x86_64 GNU/Linux

[r137n001:31533] *** Process received signal ***
[r137n001:31533] Signal: Segmentation fault (11)
[r137n001:31533] Signal code: Address not mapped (1)
[r137n001:31533] Failing at address: 0x1
[r137n001:31533] [ 0] [0xe600]
[r137n001:31533] [ 1] /lib/tls/libc.so.6 [0xbf3bfc]
[r137n001:31533] [ 2] /lib/tls/libc.so.6(_IO_vfprintf+0xcb) [0xbf3e3b]
[r137n001:31533] [ 3] /usr/local/openmpi-1.2.2/lib/libopen-pal.so.0 
(opal_show_help+0x263) [0xf7f78de3]
[r137n001:31533] [ 4] /usr/local/openmpi-1.2.2/lib/libopen-rte.so.0 
(orte_rmgr_base_check_context_cwd+0xff) [0xf7fea7ef]
[r137n001:31533] [ 5] /usr/local/openmpi-1.2.2/lib/openmpi/ 
mca_odls_default.so(orte_odls_default_launch_local_procs+0xe7f)  
[0xf7ea041f]

[r137n001:31533] [ 6] /usr/local/openmpi-1.2.2/bin/orted [0x804a1ea]
[r137n001:31533] [ 7] /usr/local/openmpi-1.2.2/lib/openmpi/ 
mca_gpr_proxy.so(orte_gpr_proxy_deliver_notify_msg+0x136) [0xf7ef65c6]
[r137n001:31533] [ 8] /usr/local/openmpi-1.2.2/lib/openmpi/ 
mca_gpr_proxy.so(orte_gpr_proxy_notify_recv+0x108) [0xf7ef4f68]
[r137n001:31533] [ 9] /usr/local/openmpi-1.2.2/lib/libopen-rte.so.0  
[0xf7fd9a18]
[r137n001:31533] [10] /usr/local/openmpi-1.2.2/lib/openmpi/ 
mca_oob_tcp.so(mca_oob_tcp_msg_recv_complete+0x24c) [0xf7f05fdc]
[r137n001:31533] [11] /usr/local/openmpi-1.2.2/lib/openmpi/ 
mca_oob_tcp.so [0xf7f07f61]
[r137n001:31533] [12] /usr/local/openmpi-1.2.2/lib/libopen-pal.so.0 
(opal_event_base_loop+0x388) [0xf7f67dd8]
[r137n001:31533] [13] /usr/local/openmpi-1.2.2/lib/libopen-pal.so.0 
(opal_event_loop+0x29) [0xf7f67fb9]
[r137n001:31533] [14] /usr/local/openmpi-1.2.2/lib/openmpi/ 
mca_oob_tcp.so(mca_oob_tcp_msg_wait+0x37) [0xf7f053c7]
[r137n001:31533] [15] /usr/local/openmpi-1.2.2/lib/openmpi/ 
mca_oob_tcp.so(mca_oob_tcp_recv+0x374) [0xf7f09a04]
[r137n001:31533] [16] /usr/local/openmpi-1.2.2/lib/libopen-rte.so.0 
(mca_oob_recv_packed+0x4d) [0xf7fd980d]
[r137n001:31533] [17] /usr/local/openmpi-1.2.2/lib/openmpi/ 
mca_gpr_proxy.so(orte_gpr_proxy_exec_compound_cmd+0x137) [0xf7ef55e7]
[r137n001:31533] [18] /usr/local/openmpi-1.2.2/bin/orted(main+0x99d)  
[0x8049d0d]
[r137n001:31533] [19] /lib/tls/libc.so.6(__libc_start_main+0xd3)  
[0xbcee23]

[r137n001:31533] [20] /usr/local/openmpi-1.2.2/bin/orted [0x80492e1]
[r137n001:31533] *** End of error message ***
mpirun noticed that job rank 1 with PID 31533 on node r137n001 exited  
on signal 11 (Segmentation fault).




If I create /home/toto on r137n001, it works fine :
(as root on r137n001: "mkdir /home/toto && chown toto:users /home/toto")

-bash-3.00$ /usr/local/openmpi-1.2.2/bin/mpirun -host rnd04,r137n001   
uname -a
Linux rnd04 2.6.9-55.ELsmp #1 SMP Fri Apr 20 16:36:54 EDT 2007 x86_64  
x86_64 x86_64 GNU/Linux
Linux r137n001 2.6.9-34.ELsmp #1 SMP Fri Feb 24 16:56:28 EST 2006  
x86_64 x86_64 x86_64 GNU/Linux



I tried to use ssh instead of rsh, it crashes too.

If anyone knows a way to run OpenMPI jobs in this configuration where  
the home directory does not exist on all the nodes, it woud really  
help !


Or is there a way to fix orted so that it won't crash ?


Here 

Re: [OMPI users] Segfault in orted (home directory problem)

2007-06-07 Thread Michael Edwards

That is the default behavior because having common home areas is
fairly common, but with some work you can run your code from wherever
is convenient.  Using the -wd flag you can have the code run from
wherever you want, but the code and data has to get there somehow.

If you are using a batch scheduler it is fairly easy to write into
your execution script a section that parses the list of assigned nodes
and pushes your data and executable out to the scratch space on those
nodes, and then cleans up afterward (or not).

The sensible way to do this will depend a lot on what schedulers you
are using and your application.  Openmpi may have a trick to push
data/executables around as well, but I haven't run across one yet.

Mike Edwards


Re: [OMPI users] Segfault in orted (home directory problem)

2007-06-07 Thread Ralph Castain
Have you tried the --wdir option yet? It should let you set your working
directory to anywhere. I don't believe it will require you to have a home
directory on the backend nodes, though I can't sweathat ssh will be happy
if you don't.

Just do "mpirun -h" for a full list of options - it will describe the exact
format of the wdir one plusthers you might find useful.

Ralph



On 6/7/07 11:12 AM, "Guillaume 
THOMAS-COLLIGNON wrote:

> I am trying to switch to OpenI, and I ran into a problem : my home
> directory must exist on all the nodes, or ted will crash.
> 
> I have a "master" machine where I initiate the mpirun command> Then I have a 
> bunch of slave machines, which will ao execute the
> MPI job.
> My user exists on all the machines, but the home directory is not
> mounted on the slaves,o it's only visible on the master node. I can
> log on a slave node, but don't have a home there. Of course the
> binary I'm running exists on all the machines (not in my home !). And
> the problem can be reproduced by running a shell command too, to make
> things simpler.
> 
> We have thousands of slave nodes and we don't want to mount the
> user's homedirs on all the slaves, so a fix would be really really nice.
> 
> Example :
> 
> I have 3 hosts, master, slave1, slave2. My home directory exists only
> on master.
> 
> If I log on master and run "mpirun -host master,slave1uname -a" I get
> a segfault.
> If I log on slave1 and run "mpirun -host slave1,slave2 uname -a", it
> runs fine. My home directory does not exist on either slave1 or slave2.
> If I log on master and run "mpirun -host master uname -a" it runs
> fine. I can run across several master nodes, it's fine too.
> 
> So it runs fine if my home directory exists everywhere, or if it does
> not exist at all. If it exists only on some nodes and not others,
> orted crashes.
> I thought it could be related to my environment but I created a new
> user with an empty home and it does the same thing. As soon as I
> create the homedir on slave1 and slave2 it works fine.
> 
> 
> 
> 
> I'm using OpenMPI 1.2.2, here is the error message and the result of
> ompi_info.
> 
> Short version (rnd04 is the master, r137n001 is a slave node).
> 
> -bash-3.00$ /usr/local/openmpi-1.2.2/bin/mpirun -host rnd04,r137n001
> uname -a
> Linux rnd04 2.6.9-55.ELsmp #1 SMP Fri Apr 20 16:36:54 EDT 2007 x86_64
> x86_64 x86_64 GNU/Linux
> [r137n001:31533] *** Process received signal ***
> [r137n001:31533] Signal: Segmentation fault (11)
> [r137n001:31533] Signal code: Address not mapped (1)
> [r137n001:31533] Failing at address: 0x1
> [r137n001:31533] [ 0] [0xe600]
> [r137n001:31533] [ 1] /lib/tls/libc.so.6 [0xbf3bfc]
> [r137n001:31533] [ 2] /lib/tls/libc.so.6(_IO_vfprintf+0xcb) [0xbf3e3b]
> [r137n001:31533] [ 3] /usr/local/openmpi-1.2.2/lib/libopen-pal.so.0
> (opal_show_help+0x263) [0xf7f78de3]
> [r137n001:31533] [ 4] /usr/local/openmpi-1.2.2/lib/libopen-rte.so.0
> (orte_rmgr_base_check_context_cwd+0xff) [0xf7fea7ef]
> [r137n001:31533] [ 5] /usr/local/openmpi-1.2.2/lib/openmpi/
> mca_odls_default.so(orte_odls_default_launch_local_procs+0xe7f)
> [0xf7ea041f]
> [r137n001:31533] [ 6] /usr/local/openmpi-1.2.2/bin/orted [0x804a1ea]
> [r137n001:31533] [ 7] /usr/local/openmpi-1.2.2/lib/openmpi/
> mca_gpr_proxy.so(orte_gpr_proxy_deliver_notify_msg+0x136) [0xf7ef65c6]
> [r137n001:31533] [ 8] /usr/local/openmpi-1.2.2/lib/openmpi/
> mca_gpr_proxy.so(orte_gpr_proxy_notify_recv+0x108) [0xf7ef4f68]
> [r137n001:31533] [ 9] /usr/local/openmpi-1.2.2/lib/libopen-rte.so.0
> [0xf7fd9a18]
> [r137n001:31533] [10] /usr/local/openmpi-1.2.2/lib/openmpi/
> mca_oob_tcp.so(mca_oob_tcp_msg_recv_complete+0x24c) [0xf7f05fdc]
> [r137n001:31533] [11] /usr/local/openmpi-1.2.2/lib/openmpi/
> mca_oob_tcp.so [0xf7f07f61]
> [r137n001:31533] [12] /usr/local/openmpi-1.2.2/lib/libopen-pal.so.0
> (opal_event_base_loop+0x388) [0xf7f67dd8]
> [r137n001:31533] [13] /usr/local/openmpi-1.2.2/lib/libopen-pal.so.0
> (opal_event_loop+0x29) [0xf7f67fb9]
> [r137n001:31533] [14] /usr/local/openmpi-1.2.2/lib/openmpi/
> mca_oob_tcp.so(mca_oob_tcp_msg_wait+0x37) [0xf7f053c7]
> [r137n001:31533] [15] /usr/local/openmpi-1.2.2/lib/openmpi/
> mca_oob_tcp.so(mca_oob_tcp_recv+0x374) [0xf7f09a04]
> [r137n001:31533] [16] /usr/local/openmpi-1.2.2/lib/libopen-rte.so.0
> (mca_oob_recv_packed+0x4d) [0xf7fd980d]
> [r137n001:31533] [17] /usr/local/openmpi-1.2.2/lib/openmpi/
> mca_gpr_proxy.so(orte_gpr_proxy_exec_compound_cmd+0x137) [0xf7ef55e7]
> [r137n001:31533] [18] /usr/local/openmpi-1.2.2/bin/orted(main+0x99d)
> [0x8049d0d]
> [r137n001:31533] [19] /lib/tls/libc.so.6(__libc_start_main+0xd3)
> [0xbcee23]
> [r137n001:31533] [20] /usr/local/openmpi-1.2.2/bin/orted [0x80492e1]
> [r137n001:31533] *** End of error message ***
> mpirun noticed that job rank 1 with PID 31533 on node r137n001 exited
> on signal 11 (Segmentation fault).
> 
> 
> 
> If I create /home/toto on r137n001, it works fine :
> (as root on r137n001: "mkdir /home/toto && chown toto:users /home/tot

Re: [OMPI users] Segfault in orted (home directory problem)

2007-06-07 Thread Guillaume THOMAS-COLLIGNON

You're right, the --wdir option works fine !
Thanks !

I just tried an older version we had compiled (1.2b3), and the error  
was more explicit than the seg fault we get with 1.2.2 :


Could not chdir to home directory /rdu/thomasco: No such file or  
directory
 
--

Failed to change to the working directory:
<...>

-Guillaume

On Jun 7, 2007, at 12:57 PM, Ralph Castain wrote:

Have you tried the --wdir option yet? It should let you set your  
working
directory to anywhere. I don't believe it will require you to have  
a home
directory on the backend nodes, though I can't sweathat ssh will be  
happy

if you don't.

Just do "mpirun -h" for a full list of options - it will describe  
the exact

format of the wdir one plusthers you might find useful.

Ralph



On 6/7/07 11:12 AM, "Guillaume THOMAS-COLLIGNONcollig...@cggveritas.com> wrote:



I am trying to switch to OpenI, and I ran into a problem : my home
directory must exist on all the nodes, or ted will crash.

I have a "master" machine where I initiate the mpirun command>  
Then I have a bunch of slave machines, which will ao execute the

MPI job.
My user exists on all the machines, but the home directory is not
mounted on the slaves,o it's only visible on the master node. I can
log on a slave node, but don't have a home there. Of course the
binary I'm running exists on all the machines (not in my home !). And
the problem can be reproduced by running a shell command too, to make
things simpler.

We have thousands of slave nodes and we don't want to mount the
user's homedirs on all the slaves, so a fix would be really really  
nice.


Example :

I have 3 hosts, master, slave1, slave2. My home directory exists only
on master.

If I log on master and run "mpirun -host master,slave1uname -a" I get
a segfault.
If I log on slave1 and run "mpirun -host slave1,slave2 uname -a", it
runs fine. My home directory does not exist on either slave1 or  
slave2.

If I log on master and run "mpirun -host master uname -a" it runs
fine. I can run across several master nodes, it's fine too.

So it runs fine if my home directory exists everywhere, or if it does
not exist at all. If it exists only on some nodes and not others,
orted crashes.
I thought it could be related to my environment but I created a new
user with an empty home and it does the same thing. As soon as I
create the homedir on slave1 and slave2 it works fine.




I'm using OpenMPI 1.2.2, here is the error message and the result of
ompi_info.

Short version (rnd04 is the master, r137n001 is a slave node).

-bash-3.00$ /usr/local/openmpi-1.2.2/bin/mpirun -host rnd04,r137n001
uname -a
Linux rnd04 2.6.9-55.ELsmp #1 SMP Fri Apr 20 16:36:54 EDT 2007 x86_64
x86_64 x86_64 GNU/Linux
[r137n001:31533] *** Process received signal ***
[r137n001:31533] Signal: Segmentation fault (11)
[r137n001:31533] Signal code: Address not mapped (1)
[r137n001:31533] Failing at address: 0x1
[r137n001:31533] [ 0] [0xe600]
[r137n001:31533] [ 1] /lib/tls/libc.so.6 [0xbf3bfc]
[r137n001:31533] [ 2] /lib/tls/libc.so.6(_IO_vfprintf+0xcb)  
[0xbf3e3b]

[r137n001:31533] [ 3] /usr/local/openmpi-1.2.2/lib/libopen-pal.so.0
(opal_show_help+0x263) [0xf7f78de3]
[r137n001:31533] [ 4] /usr/local/openmpi-1.2.2/lib/libopen-rte.so.0
(orte_rmgr_base_check_context_cwd+0xff) [0xf7fea7ef]
[r137n001:31533] [ 5] /usr/local/openmpi-1.2.2/lib/openmpi/
mca_odls_default.so(orte_odls_default_launch_local_procs+0xe7f)
[0xf7ea041f]
[r137n001:31533] [ 6] /usr/local/openmpi-1.2.2/bin/orted [0x804a1ea]
[r137n001:31533] [ 7] /usr/local/openmpi-1.2.2/lib/openmpi/
mca_gpr_proxy.so(orte_gpr_proxy_deliver_notify_msg+0x136)  
[0xf7ef65c6]

[r137n001:31533] [ 8] /usr/local/openmpi-1.2.2/lib/openmpi/
mca_gpr_proxy.so(orte_gpr_proxy_notify_recv+0x108) [0xf7ef4f68]
[r137n001:31533] [ 9] /usr/local/openmpi-1.2.2/lib/libopen-rte.so.0
[0xf7fd9a18]
[r137n001:31533] [10] /usr/local/openmpi-1.2.2/lib/openmpi/
mca_oob_tcp.so(mca_oob_tcp_msg_recv_complete+0x24c) [0xf7f05fdc]
[r137n001:31533] [11] /usr/local/openmpi-1.2.2/lib/openmpi/
mca_oob_tcp.so [0xf7f07f61]
[r137n001:31533] [12] /usr/local/openmpi-1.2.2/lib/libopen-pal.so.0
(opal_event_base_loop+0x388) [0xf7f67dd8]
[r137n001:31533] [13] /usr/local/openmpi-1.2.2/lib/libopen-pal.so.0
(opal_event_loop+0x29) [0xf7f67fb9]
[r137n001:31533] [14] /usr/local/openmpi-1.2.2/lib/openmpi/
mca_oob_tcp.so(mca_oob_tcp_msg_wait+0x37) [0xf7f053c7]
[r137n001:31533] [15] /usr/local/openmpi-1.2.2/lib/openmpi/
mca_oob_tcp.so(mca_oob_tcp_recv+0x374) [0xf7f09a04]
[r137n001:31533] [16] /usr/local/openmpi-1.2.2/lib/libopen-rte.so.0
(mca_oob_recv_packed+0x4d) [0xf7fd980d]
[r137n001:31533] [17] /usr/local/openmpi-1.2.2/lib/openmpi/
mca_gpr_proxy.so(orte_gpr_proxy_exec_compound_cmd+0x137) [0xf7ef55e7]
[r137n001:31533] [18] /usr/local/openmpi-1.2.2/bin/orted(main+0x99d)
[0x8049d0d]
[r137n001:31533] [19] /lib/tls/libc.so.6(__libc_start_main+0xd3)
[0xbcee23]
[r137n001:31533] [20] /usr

Re: [OMPI users] Segfault in orted (home directory problem)

2007-06-07 Thread Ralph Castain
Hmmm...well, we certainly will ke a point to give you a better error
message! Probably won't get it into 1.2.3, but should make later releases.

Thanks for letting me know
Ralph


On 6/7/07 1:22 PM, "Guillaume THOMAS-COLLIGNON"
 wrote:

> You're right, the --wdir option works fine !
> Thanks !
> 
> I just tried an older version we had compiled (1.2b3), and the error
> was more explicit than the seg fault we get with 1.2.2 :
> 
> Could not chdir to home directory /rdu/thomasco: No such file or
> directory
> 
> --
> Failed to change to the working directory:
> <...>
> 
> -Guillaume
> 
> On Jun 7, 2007, at 12:57 PM, Ralph Castain wrote:
> 
>> Have you tried the --wdir option yet? It should let you set your
>> working
>> directory to anywhere. I don't believe it will require you to have
>> a home
>> directory on the backend nodes, though I can't sweathat ssh will be
>> happy
>> if you don't.
>> 
>> Just do "mpirun -h" for a full list of options - it will describe
>> the exact
>> format of the wdir one plusthers you might find useful.
>> 
>> Ralph
>> 
>> 
>> 
>> On 6/7/07 11:12 AM, "Guillaume THOMAS-COLLIGNON> collig...@cggveritas.com> wrote:
>> 
>>> I am trying to switch to OpenI, and I ran into a problem : my home
>>> directory must exist on all the nodes, or ted will crash.
>>> 
>>> I have a "master" machine where I initiate the mpirun command>
>>> Then I have a bunch of slave machines, which will ao execute the
>>> MPI job.
>>> My user exists on all the machines, but the home directory is not
>>> mounted on the slaves,o it's only visible on the master node. I can
>>> log on a slave node, but don't have a home there. Of course the
>>> binary I'm running exists on all the machines (not in my home !). And
>>> the problem can be reproduced by running a shell command too, to make
>>> things simpler.
>>> 
>>> We have thousands of slave nod and we don't want to mount the
>>> user's homedirs on all the slaves, so a fix would be really really
>>> nice.
>>> 
>>> Example :
>>> 
>>> I have 3 hosts, master, slave1, slave2. My home directory exists only
>>> on master.
>>> 
>>> If I log on master and run "mpirun -host master,slave1uname -a" I get
>>> a segfault.
>>> If I log on slave1 and run "mpirun -host slave1,slave2 uname -a", it
>>> runs fine. My home directory does not exist on either slave1 or
>>> slave2.
>>> If I log on master and run "mpirun -host master uname -a" it runs
>>> fine. I can run across several master nodes, it's fine too.
>>> 
>>> So it runs fine if my home directory exists everywhere, or if it does
>>> not exist at all. If it exists only on some nodes and not others,
>>> orted crashes.
>>> I thought it could be related to my environment but I created a new
>>> user with an empty home and it does the same thing. As soon as I
>>> create the homedir on slave1 and slave2 it works fine.
>>> 
>>> 
>>> 
>>> 
>>> I'm using OpenMPI 1.2.2, here is the error message and the result of
>>> ompi_info.
>>> 
>>> Short version (rnd04 is the master, r137n001 is a slave node).
>>> 
>>> -bash-3.00$ /usr/local/openmpi-1.2.2/bin/mpirun -host rnd04,r137n001
>>> uname -a
>>> Linux rnd04 2.6.9-55.ELsmp #1 SMP Fri Apr 20 16:36:54 EDT 2007 x86_64
>>> x86_64 x86_64 GNU/Linux
>>> [r137n001:31533] *** Process received signal ***
>>> [r137n001:31533] Signal: Segmentation fault (11)
>>> [r137n001:31533] Signal code: Address not mapped (1)
>>> [r137n001:533] Failing at address: 0x1
>>> [r137n001:31533] [ 0] [0xe600]
>>> [r137n001:31533] [ 1] /lib/tls/libc.so.6 [0xbf3bfc]
>>> [r137n001:31533] [ 2] /lib/tls/libc.so.6(_IO_vfprintf+0xcb)
>>> [0xbf3e3b]
>>> [r137n001:31533] [ 3] /usr/local/openmpi-1.2.2/lib/libopen-pal.so.0
>>> (opal_show_help+0x263) [0xf7f78de3]
>>> [r137n001:31533] [ 4] /usr/local/openmpi-1.2.2/lib/libopen-rte.so.0
>>> (orte_rmgr_base_check_context_cwd+0xff) [0xf7fea7ef]
>>> [r137n001:31533] [ 5] /usr/local/openmpi-1.2.2/lib/openmpi/
>>> mca_odls_default.so(orte_odls_default_launch_local_procs+0xe7f)
>>> [0xf7ea041f]
>>> [r137n001:31533] [ 6] /usr/local/openmpi-1.2.2/bin/orted [0x804a1ea]
>>> [r137n001:31533] [ 7] /usr/local/openmpi-1.2.2/lib/openmpi/
>>> mca_gpr_proxy.so(orte_gpr_proxy_deliver_notify_msg+0x136)
>>> [0xf7ef65c6]
>>> [r137n001:31533] [ 8] /usr/local/openmpi-1.2.2/lib/openmpi/
>>> mca_gpr_proxy.so(orte_gpr_proxy_notify_recv+0x108) [0xf7ef4f68]
>>> [r137n001:31533] [ 9] /usr/local/openmpi-1.2.2/lib/libopen-rte.so.0
>>> [0xf7fd9a18]
>>> [r137n001:31533] [10] /usr/local/openmpi-1.2.2/lib/openmpi/
>>> mca_oob_tcp.so(mca_oob_tcp_msg_recv_complete+0x24c) [0xf7f05fdc]
>>> [r137n001:31533] [11] /usr/local/openmpi-1.2.2/lib/openmpi/
>>> mca_oob_tcp.so [0xf7f07f61]
>>> [r137n001:31533] [12] /usr/local/openmpi-1.2.2/lib/libopen-pal.so.0
>>> (opal_event_base_loop+0x388) [0xf7f67dd8]
>>> [r137n001:31533] [13] /usr/local/openmpi-1.2.2/lib/libopen-pal.so.0
>>> (opal_event_loop+0x29) [0xf7f67fb9]
>>> [r137n001:31533] [14] /usr/local/openmpi-1.2

[OMPI users] Issues with DL POLY

2007-06-07 Thread Aaron Thompson

Hello,
	Does anyone have experience using DL POLY with OpenMPI?  I've gotten  
it to compile, but when I run a simulation using mpirun with two dual- 
processor machines, it runs a little *slower* than on one CPU on one  
machine!  Yet the program is running two instances on each node.  Any  
ideas?  The test programs included with OpenMPI show that it is  
running correctly across multiple nodes.
	Sorry if this is a little off-topic, I wasn't able to find help on  
the official DL POLY mailing list.


Thank you!

Aaron Thompson
Vanderbilt University
aaron.p.thomp...@vanderbilt.edu





Re: [OMPI users] Issues with DL POLY

2007-06-07 Thread Brock Palen

We have a few users using DLPOLY  with OMPI.  Running just fine.
Watch out what kind of simulation you are doing like all MD  
software,  not all simulations are better in parallel.  In some the  
comunication overhead is much worse than running on just one cpu.  I  
see this all the time.  You could try just 2 cpus, on one node some  
times that is ok (memory access vs network access)  But its not  
uncommon.


Brock Palen
Center for Advanced Computing
bro...@umich.edu
(734)936-1985


On Jun 7, 2007, at 8:24 PM, Aaron Thompson wrote:


Hello,
Does anyone have experience using DL POLY with OpenMPI?  I've gotten
it to compile, but when I run a simulation using mpirun with two dual-
processor machines, it runs a little *slower* than on one CPU on one
machine!  Yet the program is running two instances on each node.  Any
ideas?  The test programs included with OpenMPI show that it is
running correctly across multiple nodes.
Sorry if this is a little off-topic, I wasn't able to find help on
the official DL POLY mailing list.

Thank you!

Aaron Thompson
Vanderbilt University
aaron.p.thomp...@vanderbilt.edu



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users






Re: [OMPI users] Issues with DL POLY

2007-06-07 Thread Michael Edwards

If your problem size is not large enough than any MPI program will
perform worse on a "large number" of nodes because of the overhead
involved in setting up the problem and network latency.  Sometimes
that "large number" is as small as two :)

I am not at all familiar with DL POLY, but if you make the size of the
problem larger you should see more performance benefit because the
overhead will be small compared to the execution time.

Just in general I would say to start with a problem that takes at
least a minute on one node, run it a few times to see how much the run
time varies and then try it on two nodes.  Especially if you are going
to try and scale it much past that initial two node version...

On 6/7/07, Aaron Thompson  wrote:

Hello,
Does anyone have experience using DL POLY with OpenMPI?  I've gotten
it to compile, but when I run a simulation using mpirun with two dual-
processor machines, it runs a little *slower* than on one CPU on one
machine!  Yet the program is running two instances on each node.  Any
ideas?  The test programs included with OpenMPI show that it is
running correctly across multiple nodes.
Sorry if this is a little off-topic, I wasn't able to find help on
the official DL POLY mailing list.

Thank you!

Aaron Thompson
Vanderbilt University
aaron.p.thomp...@vanderbilt.edu



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] Issues with DL POLY

2007-06-07 Thread Ben Allan
Are you saying t(single-process execution) < t(4-process execution)
for identical problems on each (same total amount of data)?

There's rarely a speedup in such a case-- processing the same
amount of data while shipping some fraction of it over
a slow network between processing steps is almost certain to be slower.

Where things get interesting (and encouraging) is if you increase
the total data being processed (hold data quantity per node constant).

ben allan

On Thu, Jun 07, 2007 at 08:24:03PM -0400, Aaron Thompson wrote:
> Hello,
>   Does anyone have experience using DL POLY with OpenMPI?  I've gotten  
> it to compile, but when I run a simulation using mpirun with two dual- 
> processor machines, it runs a little *slower* than on one CPU on one  
> machine!  Yet the program is running two instances on each node.  Any  
> ideas?  The test programs included with OpenMPI show that it is  
> running correctly across multiple nodes.
>   Sorry if this is a little off-topic, I wasn't able to find help on  
> the official DL POLY mailing list.
> 
>   Thank you!
> 
> Aaron Thompson
> Vanderbilt University
> aaron.p.thomp...@vanderbilt.edu
> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] making all library components static (questions about --enable-mcs-static)

2007-06-07 Thread Code Master

Hi Jeff (and everyone),

Thanks!  Now I have compiled the openmpi-1.2.2 successfully under i386-Linux
(Debian Sarge) with the following configurations:

./configure CFLAGS=-g -pg -O3 --enable-mpi-threads --enable-progress-threads
--enable-static --disable-shared

However when I compile my client program using mpicc and I inserted -static,

(compile is done by a makefile)
mpicc  -static -g -pg -O3 -W -Wall -pedantic -std=c99 -o raytrace  bbox.o
cr.o env.o fbuf.o geo.o huprn.o husetup.o hutv.o isect.o main.o matrix.o
memory.o poly.o raystack.o shade.o sph.o trace.o tri.o debug.o


it fails to link and complains that

nction `_int_malloc':
: multiple definition of `_int_malloc'
/usr/lib/libopen-pal.a(lt1-malloc.o)(.text+0x18a0):openmpi-1.2.2/opal/mca/memory/ptmalloc2/malloc.c:3954:
first defined here
/usr/bin/ld: Warning: size of symbol `_int_malloc' changed from 1266 in
/usr/lib/libopen-pal.a(lt1-malloc.o) to 1333 in
/home/490_research/490/src/mpi.optimized_profiling//lib/libopen-pal.a(
lt1-malloc.o)


so what could go wrong here?

Is it because openmpi has internal implementatios of system-provided
functions (such as malloc) that are also used in my program, but the one the
client program use is provided by the system whereas the one in the library
has a different internal implementation?

In such case, how could I do the static linking in my client program?  I
really need static linking as far as possible to do the profiling.

Thanks!


On 6/8/07, Jeff Squyres  wrote:


On Jun 7, 2007, at 2:07 AM, Code Master wrote:

> I wish to compile openmpi-1.2.2 so that it:
> - support MPI_THREAD_MULTIPLE
> - enable profiling (generate gmon.out for each process after my
> client app finish running) to tell apart CPU time of my client
> program from the MPI library
> - static linking for everything (incl client app and all components
> of library openmpi)
>
> in the documentation, it says that --enable-mcs-static=
> will enable static linking of the modules in the list, however what
> can I specify if I want to statically link *all* mcs modules
> without knowing the list of modules available?

You should be able to do:

./configure --enable-static --disable-shared ...

This will do 2 things:

- libmpi (and friends) will be compiled as .a's (instead of .so's)
- all the MCA components will be physically contained in libmpi (and
friends) instead of being built as standalone plugins

> Also this is the plan for my command used for configuring openmpi:
>
> ./configure CFLAGS="-g -pg -O3 -static" --prefix=./ --enable-mpi-
> threads --enable-progress-threads --enable-static  --disable-shared
> --enable-mcs-static --with-devel-headers

It's actually --enable-mca-static, not --enable-mcs-static.

However, that should not be necessary; the --enable-static and --
disable-shared should take care of pulling all the components into
the libraries for you.

--
Jeff Squyres
Cisco Systems

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users