Re: [OMPI users] SGE and openmpi

2011-04-06 Thread Jason Palmer
Ok, the problem was apparently that I was still including mpif.h instead of
using "use mpi".  Seems to be working now.

-Original Message-
From: Jason Palmer [mailto:japalme...@gmail.com] 
Sent: Wednesday, April 06, 2011 5:01 PM
To: 'Open MPI Users'
Subject: RE: SGE and openmpi

Btw, I did compile openmpi with the --with-sge flag.

I am able to compile a test program using openf90 with no errors or
warnings. But when I try to run a test program that just calls
MPI_INIT(ierr), then MPI_COMM_RANK(ierr), I get the following, whether
static or linked, and whether run with mpirun or directly:

[juggling.ucsd.edu:20218] *** An error occurred in MPI_Comm_rank
[juggling.ucsd.edu:20218] *** on communicator MPI_COMM_WORLD
[juggling.ucsd.edu:20218] *** MPI_ERR_COMM: invalid communicator
[juggling.ucsd.edu:20218] *** MPI_ERRORS_ARE_FATAL (your MPI job will now
abort)

Is there something  missing in the linux or parallel environment settings?
Thanks.

-Original Message-
From: Jason Palmer [mailto:japalme...@gmail.com]
Sent: Wednesday, April 06, 2011 4:09 PM
To: 'Open MPI Users'
Subject: SGE and openmpi

Hi,
I am having trouble running a batch job in SGE using openmpi.  I have read
the faq, which says that openmpi will automatically do the right thing, but
something seems to be wrong.

Previously I used MPICH1 under SGE without any problems. I'm avoiding MPICH2
because it doesn't seem to support static compilation, whereas I was able to
get openmpi to compile with open64 and compile my program statically.

But I am having problems launching. According to the documentation, I should
be able to have a script file, qsub.sh:

#!/bin/bash
#$ -cwd
#$ -j y
#$ -S /bin/bash
#$ -q all.q
#$ -pe orte 18
MPI_DIR=/home/jason/openmpi-1.4.3-install/bin
/home/jason/openmpi-1.4.3-install/bin/mpirun -np $NSLOTS  myprog

Then,
$ qsub  qsub.sh

Previously with MPICH1 I would have

-machinefile $TMP/machines

in the mpirun arguments, and the rest of the script the same except -pe
mpich 18, and it would work. The -machinefile argument doesn't seem to work
in orte. The error in qsub.sh.o is:

[jason@juggling ~/amica_open64]$ cat qsub.sh.o7514 [compute-0-0.local:17792]
*** An error occurred in MPI_Comm_rank [compute-0-0.local:17792] *** on
communicator MPI_COMM_WORLD [compute-0-0.local:17792] *** MPI_ERR_COMM:
invalid communicator [compute-0-0.local:17792] *** MPI_ERRORS_ARE_FATAL
(your MPI job will now abort)
--
mpirun has exited due to process rank 0 with PID 17792 on node
compute-0-0.local exiting without calling "finalize". This may have caused
other processes in the application to be terminated by signals sent by
mpirun (as reported here).
--
[compute-0-0.local:17788] 8 more processes have sent help message
help-mpi-errors.txt / mpi_errors_are_fatal [compute-0-0.local:17788] Set MCA
parameter "orte_base_help_aggregate" to 0 to see all help / error messages


I ran qconf, and I get the same output as in the documentation:

[jason@juggling ~/amica_open64]$ qconf -sp orte
pe_nameorte
slots  
user_lists NONE
xuser_listsNONE
start_proc_args/bin/true
stop_proc_args /bin/true
allocation_rule$fill_up
control_slaves TRUE
job_is_first_task  FALSE
urgency_slots  min
accounting_summary TRUE

The qconf mpich output is:

[jason@juggling ~/amica_open64]$ qconf -sp mpich
pe_namempich
slots  
user_lists NONE
xuser_listsNONE
start_proc_args/opt/gridengine/mpi/startmpi.sh -catch_rsh $pe_hostfile
stop_proc_args /opt/gridengine/mpi/stopmpi.sh
allocation_rule$fill_up
control_slaves TRUE
job_is_first_task  FALSE
urgency_slots  min
accounting_summary TRUE

with specific scripts for start_proc_args and stop_proc_args ...

Am I missing something necessary to run openmpi under SGE?

Thanks very much,
Jason



Re: [OMPI users] SGE and openmpi

2011-04-06 Thread Ralph Castain
Are you able to run non-MPI programs like "hostname"?

I ask because that error message indicates that everything started just fine, 
but there is an error in your application.


On Apr 6, 2011, at 6:01 PM, Jason Palmer wrote:

> Btw, I did compile openmpi with the --with-sge flag.
> 
> I am able to compile a test program using openf90 with no errors or
> warnings. But when I try to run a test program that just calls
> MPI_INIT(ierr), then MPI_COMM_RANK(ierr), I get the following, whether
> static or linked, and whether run with mpirun or directly:
> 
> [juggling.ucsd.edu:20218] *** An error occurred in MPI_Comm_rank
> [juggling.ucsd.edu:20218] *** on communicator MPI_COMM_WORLD
> [juggling.ucsd.edu:20218] *** MPI_ERR_COMM: invalid communicator
> [juggling.ucsd.edu:20218] *** MPI_ERRORS_ARE_FATAL (your MPI job will now
> abort)
> 
> Is there something  missing in the linux or parallel environment settings?
> Thanks.
> 
> -Original Message-
> From: Jason Palmer [mailto:japalme...@gmail.com] 
> Sent: Wednesday, April 06, 2011 4:09 PM
> To: 'Open MPI Users'
> Subject: SGE and openmpi
> 
> Hi,
> I am having trouble running a batch job in SGE using openmpi.  I have read
> the faq, which says that openmpi will automatically do the right thing, but
> something seems to be wrong.
> 
> Previously I used MPICH1 under SGE without any problems. I'm avoiding MPICH2
> because it doesn't seem to support static compilation, whereas I was able to
> get openmpi to compile with open64 and compile my program statically.
> 
> But I am having problems launching. According to the documentation, I should
> be able to have a script file, qsub.sh:
> 
> #!/bin/bash
> #$ -cwd
> #$ -j y
> #$ -S /bin/bash
> #$ -q all.q
> #$ -pe orte 18
> MPI_DIR=/home/jason/openmpi-1.4.3-install/bin
> /home/jason/openmpi-1.4.3-install/bin/mpirun -np $NSLOTS  myprog
> 
> Then,
>   $ qsub  qsub.sh
> 
> Previously with MPICH1 I would have
> 
>   -machinefile $TMP/machines
> 
> in the mpirun arguments, and the rest of the script the same except -pe
> mpich 18, and it would work. The -machinefile argument doesn't seem to work
> in orte. The error in qsub.sh.o is:
> 
> [jason@juggling ~/amica_open64]$ cat qsub.sh.o7514 [compute-0-0.local:17792]
> *** An error occurred in MPI_Comm_rank [compute-0-0.local:17792] *** on
> communicator MPI_COMM_WORLD [compute-0-0.local:17792] *** MPI_ERR_COMM:
> invalid communicator [compute-0-0.local:17792] *** MPI_ERRORS_ARE_FATAL
> (your MPI job will now abort)
> --
> mpirun has exited due to process rank 0 with PID 17792 on node
> compute-0-0.local exiting without calling "finalize". This may have caused
> other processes in the application to be terminated by signals sent by
> mpirun (as reported here).
> --
> [compute-0-0.local:17788] 8 more processes have sent help message
> help-mpi-errors.txt / mpi_errors_are_fatal [compute-0-0.local:17788] Set MCA
> parameter "orte_base_help_aggregate" to 0 to see all help / error messages
> 
> 
> I ran qconf, and I get the same output as in the documentation:
> 
> [jason@juggling ~/amica_open64]$ qconf -sp orte
> pe_nameorte
> slots  
> user_lists NONE
> xuser_listsNONE
> start_proc_args/bin/true
> stop_proc_args /bin/true
> allocation_rule$fill_up
> control_slaves TRUE
> job_is_first_task  FALSE
> urgency_slots  min
> accounting_summary TRUE
> 
> The qconf mpich output is:
> 
> [jason@juggling ~/amica_open64]$ qconf -sp mpich
> pe_namempich
> slots  
> user_lists NONE
> xuser_listsNONE
> start_proc_args/opt/gridengine/mpi/startmpi.sh -catch_rsh $pe_hostfile
> stop_proc_args /opt/gridengine/mpi/stopmpi.sh
> allocation_rule$fill_up
> control_slaves TRUE
> job_is_first_task  FALSE
> urgency_slots  min
> accounting_summary TRUE
> 
> with specific scripts for start_proc_args and stop_proc_args ...
> 
> Am I missing something necessary to run openmpi under SGE?
> 
> Thanks very much,
> Jason
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] SGE and openmpi

2011-04-06 Thread Jason Palmer
Btw, I did compile openmpi with the --with-sge flag.

I am able to compile a test program using openf90 with no errors or
warnings. But when I try to run a test program that just calls
MPI_INIT(ierr), then MPI_COMM_RANK(ierr), I get the following, whether
static or linked, and whether run with mpirun or directly:

[juggling.ucsd.edu:20218] *** An error occurred in MPI_Comm_rank
[juggling.ucsd.edu:20218] *** on communicator MPI_COMM_WORLD
[juggling.ucsd.edu:20218] *** MPI_ERR_COMM: invalid communicator
[juggling.ucsd.edu:20218] *** MPI_ERRORS_ARE_FATAL (your MPI job will now
abort)

Is there something  missing in the linux or parallel environment settings?
Thanks.

-Original Message-
From: Jason Palmer [mailto:japalme...@gmail.com] 
Sent: Wednesday, April 06, 2011 4:09 PM
To: 'Open MPI Users'
Subject: SGE and openmpi

Hi,
I am having trouble running a batch job in SGE using openmpi.  I have read
the faq, which says that openmpi will automatically do the right thing, but
something seems to be wrong.

Previously I used MPICH1 under SGE without any problems. I'm avoiding MPICH2
because it doesn't seem to support static compilation, whereas I was able to
get openmpi to compile with open64 and compile my program statically.

But I am having problems launching. According to the documentation, I should
be able to have a script file, qsub.sh:

#!/bin/bash
#$ -cwd
#$ -j y
#$ -S /bin/bash
#$ -q all.q
#$ -pe orte 18
MPI_DIR=/home/jason/openmpi-1.4.3-install/bin
/home/jason/openmpi-1.4.3-install/bin/mpirun -np $NSLOTS  myprog

Then,
$ qsub  qsub.sh

Previously with MPICH1 I would have

-machinefile $TMP/machines

in the mpirun arguments, and the rest of the script the same except -pe
mpich 18, and it would work. The -machinefile argument doesn't seem to work
in orte. The error in qsub.sh.o is:

[jason@juggling ~/amica_open64]$ cat qsub.sh.o7514 [compute-0-0.local:17792]
*** An error occurred in MPI_Comm_rank [compute-0-0.local:17792] *** on
communicator MPI_COMM_WORLD [compute-0-0.local:17792] *** MPI_ERR_COMM:
invalid communicator [compute-0-0.local:17792] *** MPI_ERRORS_ARE_FATAL
(your MPI job will now abort)
--
mpirun has exited due to process rank 0 with PID 17792 on node
compute-0-0.local exiting without calling "finalize". This may have caused
other processes in the application to be terminated by signals sent by
mpirun (as reported here).
--
[compute-0-0.local:17788] 8 more processes have sent help message
help-mpi-errors.txt / mpi_errors_are_fatal [compute-0-0.local:17788] Set MCA
parameter "orte_base_help_aggregate" to 0 to see all help / error messages


I ran qconf, and I get the same output as in the documentation:

[jason@juggling ~/amica_open64]$ qconf -sp orte
pe_nameorte
slots  
user_lists NONE
xuser_listsNONE
start_proc_args/bin/true
stop_proc_args /bin/true
allocation_rule$fill_up
control_slaves TRUE
job_is_first_task  FALSE
urgency_slots  min
accounting_summary TRUE

The qconf mpich output is:

[jason@juggling ~/amica_open64]$ qconf -sp mpich
pe_namempich
slots  
user_lists NONE
xuser_listsNONE
start_proc_args/opt/gridengine/mpi/startmpi.sh -catch_rsh $pe_hostfile
stop_proc_args /opt/gridengine/mpi/stopmpi.sh
allocation_rule$fill_up
control_slaves TRUE
job_is_first_task  FALSE
urgency_slots  min
accounting_summary TRUE

with specific scripts for start_proc_args and stop_proc_args ...

Am I missing something necessary to run openmpi under SGE?

Thanks very much,
Jason



[OMPI users] SGE and openmpi

2011-04-06 Thread Jason Palmer
Hi,
I am having trouble running a batch job in SGE using openmpi.  I have read
the faq, which says that openmpi will automatically do the right thing, but
something seems to be wrong.

Previously I used MPICH1 under SGE without any problems. I'm avoiding MPICH2
because it doesn't seem to support static compilation, whereas I was able to
get openmpi to compile with open64 and compile my program statically.

But I am having problems launching. According to the documentation, I should
be able to have a script file, qsub.sh:

#!/bin/bash
#$ -cwd
#$ -j y
#$ -S /bin/bash
#$ -q all.q
#$ -pe orte 18
MPI_DIR=/home/jason/openmpi-1.4.3-install/bin
/home/jason/openmpi-1.4.3-install/bin/mpirun -np $NSLOTS  myprog

Then,
$ qsub  qsub.sh

Previously with MPICH1 I would have

-machinefile $TMP/machines

in the mpirun arguments, and the rest of the script the same except -pe
mpich 18, and it would work. The -machinefile argument doesn't seem to work
in orte. The error in qsub.sh.o is:

[jason@juggling ~/amica_open64]$ cat qsub.sh.o7514
[compute-0-0.local:17792] *** An error occurred in MPI_Comm_rank
[compute-0-0.local:17792] *** on communicator MPI_COMM_WORLD
[compute-0-0.local:17792] *** MPI_ERR_COMM: invalid communicator
[compute-0-0.local:17792] *** MPI_ERRORS_ARE_FATAL (your MPI job will now
abort)
--
mpirun has exited due to process rank 0 with PID 17792 on
node compute-0-0.local exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--
[compute-0-0.local:17788] 8 more processes have sent help message
help-mpi-errors.txt / mpi_errors_are_fatal
[compute-0-0.local:17788] Set MCA parameter "orte_base_help_aggregate" to 0
to see all help / error messages


I ran qconf, and I get the same output as in the documentation:

[jason@juggling ~/amica_open64]$ qconf -sp orte
pe_nameorte
slots  
user_lists NONE
xuser_listsNONE
start_proc_args/bin/true
stop_proc_args /bin/true
allocation_rule$fill_up
control_slaves TRUE
job_is_first_task  FALSE
urgency_slots  min
accounting_summary TRUE

The qconf mpich output is:

[jason@juggling ~/amica_open64]$ qconf -sp mpich
pe_namempich
slots  
user_lists NONE
xuser_listsNONE
start_proc_args/opt/gridengine/mpi/startmpi.sh -catch_rsh $pe_hostfile
stop_proc_args /opt/gridengine/mpi/stopmpi.sh
allocation_rule$fill_up
control_slaves TRUE
job_is_first_task  FALSE
urgency_slots  min
accounting_summary TRUE

with specific scripts for start_proc_args and stop_proc_args ...

Am I missing something necessary to run openmpi under SGE?

Thanks very much,
Jason



Re: [OMPI users] OMPI 1.4.3 and "make distclean" error

2011-04-06 Thread Gus Correa

Ralph Castain wrote:

On Apr 6, 2011, at 1:21 PM, David Gunter wrote:

We tend to build OMPI for several different architectures. 
Rather than untar the archive file each time I'd rather 
do a "make distclean" in between builds.  
However, this always produces the following error:


...
Making distclean in libltdl
make[2]: Entering directory `/user/openmpi-1.4.3/opal/libltdl'
make[2]: *** No rule to make target `distclean'.  Stop.
make[2]: Leaving directory `/user/openmpi-1.4.3/opal/libltdl'
make[1]: *** [distclean-recursive] Error 1
make[1]: Leaving directory `/user/openmpi-1.4.3/opal'
make: *** [distclean-recursive] Error 1

and then fails to finish the rest of the cleanup.

The reason is due to to our specific systems and the use 
of the configure argument --disable-dlopen, so nothing (including the Makefile) 
gets created in /user/openmpi-1.4.3/opal/libltd.


Is there a workaround for this?


Can't think of any minus build system changes. 
I don't know of any testing done for that scenario, 
so I doubt we've hit it before.


Jeff is out today - will have to ask him tomorrow if he has any suggestions. 
I can think of a couple of possible solutions, but not sure what he would prefer.



Thanks,
david
--
David Gunter
HPC-3: Infrastructure Team
Los Alamos National Laboratory


Hi David

You could build on a different directory, one directory for each build,
and use --prefix=/bla/bla to install in different locations
of your choice.
I do this all the time here, not for different architectures, but for
different compilers.
I use subdirectories on the main directory of the untarred source tree,
but that's a matter of taste.
Launch 'configure' with full path name or relative path from there.
Then do 'make' and 'make install'.
Worse comes to worst, if a particular build fails,
you can delete everything on the subdirectory,
instead of 'make distclean' (if that fails),
and start fresh, no harm to the original source tree.

I hope this helps,
Gus Correa
-
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
-




Re: [OMPI users] problem with configure and c++, lib and lib64

2011-04-06 Thread Jason Palmer
Thanks, this seems to be resolved now after sorting out my previous test
installations of gcc.



-Jason



From: Ralph Castain [mailto:rhc.open...@gmail.com] On Behalf Of Ralph
Castain
Sent: Wednesday, April 06, 2011 1:35 PM
To: japal...@ucsd.edu; Open MPI Users
Subject: Re: [OMPI users] problem with configure and c++, lib and lib64





On Apr 6, 2011, at 1:27 PM, Jason Palmer wrote:





Hello,



I'm trying again with the 1.4.3 version to use compile openmpi statically
with my program . but I'm running into a more basic problem, similar to one
I previously encountered and solved using LD_LIBRARY_PATH.



The configure script is dying when it tries to run the "simple C++ program".
I define CC, CXX to refer to installed gcc-4.4.3 versions, and F77 and FC to
gcc-4.4.3 gfortran, and I set LD_LIBRARY_PATH to be the corresponding
gcc-4.4.3-install/lib64.



I didn't have a problem with the c++ configure last time I tried this . One
odd thing is that it seems to be using the lib directory instead of the
lib64 directory, despite my setting LD_LIBRARY_FLAGS to lib64, and defining
CFLAGS and LDFLAGS to point to the lib64 library as well. I wonder if that
is causing the C++ program to fail.



Did you set CXXFLAGS too? I believe that is what gets used for C++ programs,
not CFLAGS



If you don't need c++ bindings, you can always just configure to ignore it.









The relevant output from config.log is pasted below. Thanks very much for
your help!  -Jason



configure:23457: checking for the C++ compiler vendor

configure:23490: /home/jason/gcc-4.4.3-install/bin/g++ -c -DNDEBUG
conftest.cpp >&5

conftest.cpp:2:2: error: #error "condition defined(__INTEL_COMPILER) ||
defined(__ICC) not met"

conftest.cpp:3: error: 'choke' does not name a type

configure:23497: $? = 1

configure: failed program was:

| #if !( defined(__INTEL_COMPILER) || defined(__ICC) )

| #error "condition defined(__INTEL_COMPILER) || defined(__ICC) not met"

| choke me

| #endif

configure:23529: /home/jason/gcc-4.4.3-install/bin/g++ -c -DNDEBUG
conftest.cpp >&5

configure:23536: $? = 0

configure:24651: result: gnu

configure:24673: checking if C++ compiler works

configure:24754: /home/jason/gcc-4.4.3-install/bin/g++ -o conftest -DNDEBUG
-L/home/jason/gcc-4.4.3-install/lib64 conftest.cpp

>&5

In file included from
/home/jason/gcc-4.4.3-install/lib/gcc/x86_64-unknown-linux-gnu/4.4.3/../../.
./../include/c++/4.4.3/bits/loca

lefwd.h:42,

 from
/home/jason/gcc-4.4.3-install/lib/gcc/x86_64-unknown-linux-gnu/4.4.3/../../.
./../include/c++/4.4.3/string:45

,

 from conftest.cpp:111:

/home/jason/gcc-4.4.3-install/lib/gcc/x86_64-unknown-linux-gnu/4.4.3/../../.
./../include/c++/4.4.3/x86_64-unknown-linux-gnu/bits/c

++locale.h:52: error: 'uselocale' was not declared in this scope

/home/jason/gcc-4.4.3-install/lib/gcc/x86_64-unknown-linux-gnu/4.4.3/../../.
./../include/c++/4.4.3/x86_64-unknown-linux-gnu/bits/c

++locale.h:52: error: invalid type in declaration before ';' token

/home/jason/gcc-4.4.3-install/lib/gcc/x86_64-unknown-linux-gnu/4.4.3/../../.
./../include/c++/4.4.3/x86_64-unknown-linux-gnu/bits/c

++locale.h: In function 'int std::__convert_from_v(__locale_struct* const&,
char*, int, const char*, ...)':

/home/jason/gcc-4.4.3-install/lib/gcc/x86_64-unknown-linux-gnu/4.4.3/../../.
./../include/c++/4.4.3/x86_64-unknown-linux-gnu/bits/c

++locale.h:72: error: '__gnu_cxx::__uselocale' cannot be used as a function

/home/jason/gcc-4.4.3-install/lib/gcc/x86_64-unknown-linux-gnu/4.4.3/../../.
./../include/c++/4.4.3/x86_64-unknown-linux-gnu/bits/c

++locale.h:97: error: '__gnu_cxx::__uselocale' cannot be used as a function

configure:24758: $? = 1

configure: program exited with status 1

configure: failed program was:

| /* confdefs.h.  */

| #define PACKAGE_NAME "Open MPI"

| #define PACKAGE_TARNAME "openmpi"

| #define PACKAGE_VERSION "1.4.3"

| #define PACKAGE_STRING "Open MPI 1.4.3"

| #define PACKAGE_BUGREPORT "http://www.open-mpi.org/community/help/;

| #define OMPI_MAJOR_VERSION 1

| #define OMPI_MINOR_VERSION 4

| #define OMPI_RELEASE_VERSION 3

| #define OMPI_GREEK_VERSION ""

| #define OMPI_VERSION "3"

| #define OMPI_RELEASE_DATE "Oct 05, 2010"

| #define ORTE_MAJOR_VERSION 1

| #define ORTE_MINOR_VERSION 4

| #define ORTE_RELEASE_VERSION 3

| #define ORTE_GREEK_VERSION ""

| #define ORTE_VERSION "3"

| #define ORTE_GREEK_VERSION ""

| #define ORTE_VERSION "3"

| #define ORTE_RELEASE_DATE "Oct 05, 2010"

| #define OPAL_MAJOR_VERSION 1

| #define OPAL_MINOR_VERSION 4

| #define OPAL_RELEASE_VERSION 3

| #define OPAL_GREEK_VERSION ""

| #define OPAL_VERSION "3"

| #define OPAL_RELEASE_DATE "Oct 05, 2010"

| #define OMPI_ENABLE_PROGRESS_THREADS 0

| #define OMPI_ARCH "x86_64-unknown-linux-gnu"

| #define OMPI_ENABLE_MEM_DEBUG 0

| #define OMPI_ENABLE_MEM_PROFILE 0

| #define OMPI_ENABLE_DEBUG 0

| #define OMPI_GROUP_SPARSE 0

| #define OMPI_WANT_MPI_CXX_SEEK 1

| #define MPI_PARAM_CHECK 

Re: [OMPI users] OMPI 1.4.3 and "make distclean" error

2011-04-06 Thread Ralph Castain

On Apr 6, 2011, at 1:21 PM, David Gunter wrote:

> We tend to build OMPI for several different architectures. Rather than untar 
> the archive file each time I'd rather do a "make distclean" in between 
> builds.  However, this always produces the following error:
> 
> ...
> Making distclean in libltdl
> make[2]: Entering directory `/user/openmpi-1.4.3/opal/libltdl'
> make[2]: *** No rule to make target `distclean'.  Stop.
> make[2]: Leaving directory `/user/openmpi-1.4.3/opal/libltdl'
> make[1]: *** [distclean-recursive] Error 1
> make[1]: Leaving directory `/user/openmpi-1.4.3/opal'
> make: *** [distclean-recursive] Error 1
> 
> and then fails to finish the rest of the cleanup.
> 
> The reason is due to to our specific systems and the use of the configure 
> argument --disable-dlopen, so nothing (including the Makefile) gets created 
> in /user/openmpi-1.4.3/opal/libltd.
> 
> Is there a workaround for this?

Can't think of any minus build system changes. I don't know of any testing done 
for that scenario, so I doubt we've hit it before.

Jeff is out today - will have to ask him tomorrow if he has any suggestions. I 
can think of a couple of possible solutions, but not sure what he would prefer.

> 
> Thanks,
> david
> --
> David Gunter
> HPC-3: Infrastructure Team
> Los Alamos National Laboratory
> 
> 
> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] problem with configure and c++, lib and lib64

2011-04-06 Thread Ralph Castain

On Apr 6, 2011, at 1:27 PM, Jason Palmer wrote:

> Hello,
>  
> I’m trying again with the 1.4.3 version to use compile openmpi statically 
> with my program … but I’m running into a more basic problem, similar to one I 
> previously encountered and solved using LD_LIBRARY_PATH.
>  
> The configure script is dying when it tries to run the “simple C++ program”. 
> I define CC, CXX to refer to installed gcc-4.4.3 versions, and F77 and FC to 
> gcc-4.4.3 gfortran, and I set LD_LIBRARY_PATH to be the corresponding 
> gcc-4.4.3-install/lib64.
>  
> I didn’t have a problem with the c++ configure last time I tried this … One 
> odd thing is that it seems to be using the lib directory instead of the lib64 
> directory, despite my setting LD_LIBRARY_FLAGS to lib64, and defining CFLAGS 
> and LDFLAGS to point to the lib64 library as well. I wonder if that is 
> causing the C++ program to fail.

Did you set CXXFLAGS too? I believe that is what gets used for C++ programs, 
not CFLAGS

If you don't need c++ bindings, you can always just configure to ignore it.


>  
> The relevant output from config.log is pasted below. Thanks very much for 
> your help!  -Jason
>  
> configure:23457: checking for the C++ compiler vendor
> configure:23490: /home/jason/gcc-4.4.3-install/bin/g++ -c -DNDEBUG   
> conftest.cpp >&5
> conftest.cpp:2:2: error: #error "condition defined(__INTEL_COMPILER) || 
> defined(__ICC) not met"
> conftest.cpp:3: error: 'choke' does not name a type
> configure:23497: $? = 1
> configure: failed program was:
> | #if !( defined(__INTEL_COMPILER) || defined(__ICC) )
> | #error "condition defined(__INTEL_COMPILER) || defined(__ICC) not met"
> | choke me
> | #endif
> configure:23529: /home/jason/gcc-4.4.3-install/bin/g++ -c -DNDEBUG   
> conftest.cpp >&5
> configure:23536: $? = 0
> configure:24651: result: gnu
> configure:24673: checking if C++ compiler works
> configure:24754: /home/jason/gcc-4.4.3-install/bin/g++ -o conftest -DNDEBUG   
> -L/home/jason/gcc-4.4.3-install/lib64 conftest.cpp
> >&5
> In file included from 
> /home/jason/gcc-4.4.3-install/lib/gcc/x86_64-unknown-linux-gnu/4.4.3/../../../../include/c++/4.4.3/bits/loca
> lefwd.h:42,
>  from 
> /home/jason/gcc-4.4.3-install/lib/gcc/x86_64-unknown-linux-gnu/4.4.3/../../../../include/c++/4.4.3/string:45
> ,
>  from conftest.cpp:111:
> /home/jason/gcc-4.4.3-install/lib/gcc/x86_64-unknown-linux-gnu/4.4.3/../../../../include/c++/4.4.3/x86_64-unknown-linux-gnu/bits/c
> ++locale.h:52: error: 'uselocale' was not declared in this scope
> /home/jason/gcc-4.4.3-install/lib/gcc/x86_64-unknown-linux-gnu/4.4.3/../../../../include/c++/4.4.3/x86_64-unknown-linux-gnu/bits/c
> ++locale.h:52: error: invalid type in declaration before ';' token
> /home/jason/gcc-4.4.3-install/lib/gcc/x86_64-unknown-linux-gnu/4.4.3/../../../../include/c++/4.4.3/x86_64-unknown-linux-gnu/bits/c
> ++locale.h: In function 'int std::__convert_from_v(__locale_struct* const&, 
> char*, int, const char*, ...)':
> /home/jason/gcc-4.4.3-install/lib/gcc/x86_64-unknown-linux-gnu/4.4.3/../../../../include/c++/4.4.3/x86_64-unknown-linux-gnu/bits/c
> ++locale.h:72: error: '__gnu_cxx::__uselocale' cannot be used as a function
> /home/jason/gcc-4.4.3-install/lib/gcc/x86_64-unknown-linux-gnu/4.4.3/../../../../include/c++/4.4.3/x86_64-unknown-linux-gnu/bits/c
> ++locale.h:97: error: '__gnu_cxx::__uselocale' cannot be used as a function
> configure:24758: $? = 1
> configure: program exited with status 1
> configure: failed program was:
> | /* confdefs.h.  */
> | #define PACKAGE_NAME "Open MPI"
> | #define PACKAGE_TARNAME "openmpi"
> | #define PACKAGE_VERSION "1.4.3"
> | #define PACKAGE_STRING "Open MPI 1.4.3"
> | #define PACKAGE_BUGREPORT "http://www.open-mpi.org/community/help/;
> | #define OMPI_MAJOR_VERSION 1
> | #define OMPI_MINOR_VERSION 4
> | #define OMPI_RELEASE_VERSION 3
> | #define OMPI_GREEK_VERSION ""
> | #define OMPI_VERSION "3"
> | #define OMPI_RELEASE_DATE "Oct 05, 2010"
> | #define ORTE_MAJOR_VERSION 1
> | #define ORTE_MINOR_VERSION 4
> | #define ORTE_RELEASE_VERSION 3
> | #define ORTE_GREEK_VERSION ""
> | #define ORTE_VERSION "3"
> | #define ORTE_GREEK_VERSION ""
> | #define ORTE_VERSION "3"
> | #define ORTE_RELEASE_DATE "Oct 05, 2010"
> | #define OPAL_MAJOR_VERSION 1
> | #define OPAL_MINOR_VERSION 4
> | #define OPAL_RELEASE_VERSION 3
> | #define OPAL_GREEK_VERSION ""
> | #define OPAL_VERSION "3"
> | #define OPAL_RELEASE_DATE "Oct 05, 2010"
> | #define OMPI_ENABLE_PROGRESS_THREADS 0
> | #define OMPI_ARCH "x86_64-unknown-linux-gnu"
> | #define OMPI_ENABLE_MEM_DEBUG 0
> | #define OMPI_ENABLE_MEM_PROFILE 0
> | #define OMPI_ENABLE_DEBUG 0
> | #define OMPI_GROUP_SPARSE 0
> | #define OMPI_WANT_MPI_CXX_SEEK 1
> | #define MPI_PARAM_CHECK ompi_mpi_param_check
> | #define OMPI_WANT_PRETTY_PRINT_STACKTRACE 1
> | #define OMPI_WANT_PERUSE 0
> | #define OMPI_ENABLE_PTY_SUPPORT 1
> | #define OMPI_ENABLE_HETEROGENEOUS_SUPPORT 0
> | #define OPAL_ENABLE_TRACE 0
> | #define 

Re: [OMPI users] mpi problems,

2011-04-06 Thread Ralph Castain
Sigh...look at the output of mpicc --showme. It tells you where the OMPI libs 
were installed:

-I/opt/SUNWhpc/HPC8.2.1c/sun/include/64 
-I/opt/SUNWhpc/HPC8.2.1c/sun/include/64/openmpi -R/opt/mx/lib/lib64
-R/opt/SUNWhpc/HPC8.2.1c/sun/lib/lib64 -L/opt/SUNWhpc/HPC8.2.1c/sun/lib/lib64 
-lmpi -lopen-rte -lopen-pal -lnsl -lrt -lm -ldl -lutil -lpthread

Look around a little in those areas - I can't pretend to understand where you 
put them, or if there are copy/paste errors into this thread. But obviously 
OMPI -thinks- the libs are somewhere in there.


On Apr 6, 2011, at 2:18 PM, Nehemiah Dacres wrote:

> [jian@therock lib]$ ls lib64/*.a
> lib64/libotf.a  lib64/libvt.fmpi.a  lib64/libvt.omp.a
> lib64/libvt.a   lib64/libvt.mpi.a   lib64/libvt.ompi.a
> last time i linked one of those files it told me they were in the wrong 
> format. these are in archive format, what format should they be in? 
> 
> 
> On Wed, Apr 6, 2011 at 2:44 PM, Ralph Castain  wrote:
> Look at your output from mpicc --showme. It indicates that the OMPI libs were 
> put in the lib64 directory, not lib.
> 
> 
> On Apr 6, 2011, at 1:38 PM, Nehemiah Dacres wrote:
> 
>> I am also trying to get netlib's hpl to run via sun cluster tools so i am 
>> trying to compile it and am having trouble. Which is the proper mpi library 
>> to give? 
>> naturally this isn't going to work 
>> 
>> MPdir= /opt/SUNWhpc/HPC8.2.1c/sun/
>> MPinc= -I$(MPdir)/include
>> MPlib= $(MPdir)/lib/libmpi.a
>> 
>> because that doesn't exist 
>> /opt/SUNWhpc-O/HPC8.2.1c/sun/lib/libotf.a  
>> /opt/SUNWhpc-O/HPC8.2.1c/sun/lib/libvt.fmpi.a  
>> /opt/SUNWhpc-O/HPC8.2.1c/sun/lib/libvt.omp.a
>> /opt/SUNWhpc-O/HPC8.2.1c/sun/lib/libvt.a   
>> /opt/SUNWhpc-O/HPC8.2.1c/sun/lib/libvt.mpi.a   
>> /opt/SUNWhpc-O/HPC8.2.1c/sun/lib/libvt.ompi.a
>> 
>> is what I have for listing *.a  in the lib directory. none of those are 
>> equivilant becasue they are all linked with vampire trace if I am reading 
>> the names right. I've already tried putting  
>> /opt/SUNWhpc-O/HPC8.2.1c/sun/lib/libvt.mpi.a for this and it didnt work 
>> giving errors like 
>> 
>> On Wed, Apr 6, 2011 at 12:42 PM, Terry Dontje  
>> wrote:
>> Something looks fishy about your numbers.  The first two sets of numbers 
>> look the same and the last set do look better for the most part.  Your 
>> mpirun command line looks weird to me with the "-mca 
>> orte_base_help_aggregate btl,openib,self," did something get chopped off 
>> with the text copy?  You should have had a "-mca btl openib,self".  Can you 
>> do a run with "-mca btl tcp,self", it should be slower.
>> 
>> I really wouldn't have expected another compiler over IB to be that 
>> dramatically lower performing.
>> 
>> --td
>> 
>> 
>> 
>> On 04/06/2011 12:40 PM, Nehemiah Dacres wrote:
>>> also, I'm not sure if I'm reading the results right. According to the last 
>>> run, did using the sun compilers (update 1 )  result in higher performance 
>>> with sunct? 
>>> 
>>> On Wed, Apr 6, 2011 at 11:38 AM, Nehemiah Dacres  wrote:
>>> some tests I did. I hope this isn't an abuse of the list. please tell me if 
>>> it is but thanks to all those who helped me. 
>>> 
>>> this  goes to say that the sun MPI works with programs not compiled with 
>>> sun’s compilers. 
>>> this first test was run as a base case to see if MPI works., the sedcond 
>>> run is to see the speed up using OpenIB provides
>>> jian@therock ~]$ mpirun -machinefile list 
>>> /opt/iba/src/mpi_apps/mpi_stress/mpi_stress
>>> Start mpi_stress at Wed Apr  6 10:56:29 2011
>>> 
>>>   Size (bytes) TxMessages  TxMillionBytes/s   TxMessages/s
>>> 32  1  2.77   86485.67
>>> 64  1  5.76   90049.42
>>>128  1 11.00   85923.85
>>>256  1 18.78   73344.43
>>>512  1 34.47   67331.98
>>>   1024  1 34.81   33998.09
>>>   2048  1 17.318454.27
>>>   4096  1 18.344476.61
>>>   8192  1 25.433104.28
>>>  16384  1 15.56 949.50
>>>  32768  1 13.95 425.74
>>> 
>>>  65536  1  9.88 150.79
>>> 131072   8192 11.05  84.31
>>> 262144   4096 13.12  50.04
>>> 524288   2048 16.54  31.55
>>>1048576   1024 19.92  18.99
>>>2097152512   

Re: [OMPI users] mpi problems,

2011-04-06 Thread Nehemiah Dacres
[jian@therock lib]$ ls lib64/*.a
lib64/libotf.a  lib64/libvt.fmpi.a  lib64/libvt.omp.a
lib64/libvt.a   lib64/libvt.mpi.a   lib64/libvt.ompi.a
last time i linked one of those files it told me they were in the wrong
format. these are in archive format, what format should they be in?


On Wed, Apr 6, 2011 at 2:44 PM, Ralph Castain  wrote:

> Look at your output from mpicc --showme. It indicates that the OMPI libs
> were put in the lib64 directory, not lib.
>
>
> On Apr 6, 2011, at 1:38 PM, Nehemiah Dacres wrote:
>
> I am also trying to get netlib's hpl to run via sun cluster tools so i am
> trying to compile it and am having trouble. Which is the proper mpi library
> to give?
> naturally this isn't going to work
>
> MPdir= /opt/SUNWhpc/HPC8.2.1c/sun/
> MPinc= -I$(MPdir)/include
> *MPlib= $(MPdir)/lib/libmpi.a*
>
> because that doesn't exist
> /opt/SUNWhpc-O/HPC8.2.1c/sun/lib/libotf.a
> /opt/SUNWhpc-O/HPC8.2.1c/sun/lib/libvt.fmpi.a
> /opt/SUNWhpc-O/HPC8.2.1c/sun/lib/libvt.omp.a
> /opt/SUNWhpc-O/HPC8.2.1c/sun/lib/libvt.a
> /opt/SUNWhpc-O/HPC8.2.1c/sun/lib/libvt.mpi.a
> /opt/SUNWhpc-O/HPC8.2.1c/sun/lib/libvt.ompi.a
>
> is what I have for listing *.a  in the lib directory. none of those are
> equivilant becasue they are all linked with vampire trace if I am reading
> the names right. I've already tried putting
> /opt/SUNWhpc-O/HPC8.2.1c/sun/lib/libvt.mpi.a for this and it didnt work
> giving errors like
>
> On Wed, Apr 6, 2011 at 12:42 PM, Terry Dontje wrote:
>
>>  Something looks fishy about your numbers.  The first two sets of numbers
>> look the same and the last set do look better for the most part.  Your
>> mpirun command line looks weird to me with the "-mca
>> orte_base_help_aggregate btl,openib,self," did something get chopped off
>> with the text copy?  You should have had a "-mca btl openib,self".  Can you
>> do a run with "-mca btl tcp,self", it should be slower.
>>
>> I really wouldn't have expected another compiler over IB to be that
>> dramatically lower performing.
>>
>> --td
>>
>>
>>
>> On 04/06/2011 12:40 PM, Nehemiah Dacres wrote:
>>
>> also, I'm not sure if I'm reading the results right. According to the last
>> run, did using the sun compilers (update 1 )  result in higher performance
>> with sunct?
>>
>> On Wed, Apr 6, 2011 at 11:38 AM, Nehemiah Dacres wrote:
>>
>>> some tests I did. I hope this isn't an abuse of the list. please tell me
>>> if it is but thanks to all those who helped me.
>>>
>>> this  goes to say that the sun MPI works with programs not compiled with
>>> sun’s compilers.
>>> this first test was run as a base case to see if MPI works., the sedcond
>>> run is to see the speed up using OpenIB provides
>>> jian@therock ~]$ mpirun -machinefile list
>>> /opt/iba/src/mpi_apps/mpi_stress/mpi_stress
>>> Start mpi_stress at Wed Apr  6 10:56:29 2011
>>>
>>>Size (bytes) TxMessages  TxMillionBytes/s
>>> TxMessages/s
>>>  32  1  2.77
>>> 86485.67
>>>  64  1  5.76
>>> 90049.42
>>> 128  1 11.00
>>> 85923.85
>>> 256  1 18.78
>>> 73344.43
>>> 512  1 34.47
>>> 67331.98
>>>1024  1 34.81
>>> 33998.09
>>>2048  1 17.31
>>> 8454.27
>>>4096  1 18.34
>>> 4476.61
>>>8192  1 25.43
>>> 3104.28
>>>   16384  1 15.56
>>> 949.50
>>>   32768  1 13.95
>>> 425.74
>>>
>>>   65536  1  9.88
>>> 150.79
>>>  131072   8192 11.05
>>> 84.31
>>>  262144   4096 13.12
>>> 50.04
>>>  524288   2048 16.54
>>> 31.55
>>> 1048576   1024 19.92
>>> 18.99
>>> 2097152512 22.54
>>> 10.75
>>> 4194304256 25.46
>>> 6.07
>>>
>>> Iteration 0 : errors = 0, total = 0 (495 secs, Wed Apr  6 11:04:44 2011)
>>> After 1 iteration(s), 8 mins and 15 secs, total errors = 0
>>>
>>> here is the infiniband run
>>>
>>> [jian@therock ~]$ mpirun -mca orte_base_help_aggregate btl,openib,self,
>>> -machinefile list /opt/iba/src/mpi_apps/mpi_stress/mpi_stress
>>> Start mpi_stress at Wed Apr  6 11:07:06 2011
>>>
>>>Size (bytes) TxMessages  TxMillionBytes/s
>>> TxMessages/s
>>>  32  1  2.72   84907.69
>>>  64  1  5.83   91097.94
>>> 128  1 10.75   83959.63
>>> 256   

Re: [OMPI users] mpi problems,

2011-04-06 Thread Ralph Castain
Look at your output from mpicc --showme. It indicates that the OMPI libs were 
put in the lib64 directory, not lib.


On Apr 6, 2011, at 1:38 PM, Nehemiah Dacres wrote:

> I am also trying to get netlib's hpl to run via sun cluster tools so i am 
> trying to compile it and am having trouble. Which is the proper mpi library 
> to give? 
> naturally this isn't going to work 
> 
> MPdir= /opt/SUNWhpc/HPC8.2.1c/sun/
> MPinc= -I$(MPdir)/include
> MPlib= $(MPdir)/lib/libmpi.a
> 
> because that doesn't exist 
> /opt/SUNWhpc-O/HPC8.2.1c/sun/lib/libotf.a  
> /opt/SUNWhpc-O/HPC8.2.1c/sun/lib/libvt.fmpi.a  
> /opt/SUNWhpc-O/HPC8.2.1c/sun/lib/libvt.omp.a
> /opt/SUNWhpc-O/HPC8.2.1c/sun/lib/libvt.a   
> /opt/SUNWhpc-O/HPC8.2.1c/sun/lib/libvt.mpi.a   
> /opt/SUNWhpc-O/HPC8.2.1c/sun/lib/libvt.ompi.a
> 
> is what I have for listing *.a  in the lib directory. none of those are 
> equivilant becasue they are all linked with vampire trace if I am reading the 
> names right. I've already tried putting  
> /opt/SUNWhpc-O/HPC8.2.1c/sun/lib/libvt.mpi.a for this and it didnt work 
> giving errors like 
> 
> On Wed, Apr 6, 2011 at 12:42 PM, Terry Dontje  wrote:
> Something looks fishy about your numbers.  The first two sets of numbers look 
> the same and the last set do look better for the most part.  Your mpirun 
> command line looks weird to me with the "-mca orte_base_help_aggregate 
> btl,openib,self," did something get chopped off with the text copy?  You 
> should have had a "-mca btl openib,self".  Can you do a run with "-mca btl 
> tcp,self", it should be slower.
> 
> I really wouldn't have expected another compiler over IB to be that 
> dramatically lower performing.
> 
> --td
> 
> 
> 
> On 04/06/2011 12:40 PM, Nehemiah Dacres wrote:
>> also, I'm not sure if I'm reading the results right. According to the last 
>> run, did using the sun compilers (update 1 )  result in higher performance 
>> with sunct? 
>> 
>> On Wed, Apr 6, 2011 at 11:38 AM, Nehemiah Dacres  wrote:
>> some tests I did. I hope this isn't an abuse of the list. please tell me if 
>> it is but thanks to all those who helped me. 
>> 
>> this  goes to say that the sun MPI works with programs not compiled with 
>> sun’s compilers. 
>> this first test was run as a base case to see if MPI works., the sedcond run 
>> is to see the speed up using OpenIB provides
>> jian@therock ~]$ mpirun -machinefile list 
>> /opt/iba/src/mpi_apps/mpi_stress/mpi_stress
>> Start mpi_stress at Wed Apr  6 10:56:29 2011
>> 
>>   Size (bytes) TxMessages  TxMillionBytes/s   TxMessages/s
>> 32  1  2.77   86485.67
>> 64  1  5.76   90049.42
>>128  1 11.00   85923.85
>>256  1 18.78   73344.43
>>512  1 34.47   67331.98
>>   1024  1 34.81   33998.09
>>   2048  1 17.318454.27
>>   4096  1 18.344476.61
>>   8192  1 25.433104.28
>>  16384  1 15.56 949.50
>>  32768  1 13.95 425.74
>> 
>>  65536  1  9.88 150.79
>> 131072   8192 11.05  84.31
>> 262144   4096 13.12  50.04
>> 524288   2048 16.54  31.55
>>1048576   1024 19.92  18.99
>>2097152512 22.54  10.75
>>4194304256 25.46   6.07
>> 
>> Iteration 0 : errors = 0, total = 0 (495 secs, Wed Apr  6 11:04:44 2011)
>> After 1 iteration(s), 8 mins and 15 secs, total errors = 0
>> 
>> here is the infiniband run
>> 
>> [jian@therock ~]$ mpirun -mca orte_base_help_aggregate btl,openib,self, 
>> -machinefile list /opt/iba/src/mpi_apps/mpi_stress/mpi_stress
>> Start mpi_stress at Wed Apr  6 11:07:06 2011
>> 
>>   Size (bytes) TxMessages  TxMillionBytes/s   TxMessages/s
>> 32  1  2.72   84907.69
>> 64  1  5.83   91097.94
>>128  1 10.75   83959.63
>>256  1 18.53   72384.48
>>512  1 34.96   68285.00
>>   1024  1 11.40   11133.10
>>   2048  1 20.88  

Re: [OMPI users] mpi problems,

2011-04-06 Thread Nehemiah Dacres
I am also trying to get netlib's hpl to run via sun cluster tools so i am
trying to compile it and am having trouble. Which is the proper mpi library
to give?
naturally this isn't going to work

MPdir= /opt/SUNWhpc/HPC8.2.1c/sun/
MPinc= -I$(MPdir)/include
*MPlib= $(MPdir)/lib/libmpi.a*

because that doesn't exist
/opt/SUNWhpc-O/HPC8.2.1c/sun/lib/libotf.a
/opt/SUNWhpc-O/HPC8.2.1c/sun/lib/libvt.fmpi.a
/opt/SUNWhpc-O/HPC8.2.1c/sun/lib/libvt.omp.a
/opt/SUNWhpc-O/HPC8.2.1c/sun/lib/libvt.a
/opt/SUNWhpc-O/HPC8.2.1c/sun/lib/libvt.mpi.a
/opt/SUNWhpc-O/HPC8.2.1c/sun/lib/libvt.ompi.a

is what I have for listing *.a  in the lib directory. none of those are
equivilant becasue they are all linked with vampire trace if I am reading
the names right. I've already tried putting
/opt/SUNWhpc-O/HPC8.2.1c/sun/lib/libvt.mpi.a for this and it didnt work
giving errors like

On Wed, Apr 6, 2011 at 12:42 PM, Terry Dontje wrote:

>  Something looks fishy about your numbers.  The first two sets of numbers
> look the same and the last set do look better for the most part.  Your
> mpirun command line looks weird to me with the "-mca
> orte_base_help_aggregate btl,openib,self," did something get chopped off
> with the text copy?  You should have had a "-mca btl openib,self".  Can you
> do a run with "-mca btl tcp,self", it should be slower.
>
> I really wouldn't have expected another compiler over IB to be that
> dramatically lower performing.
>
> --td
>
>
>
> On 04/06/2011 12:40 PM, Nehemiah Dacres wrote:
>
> also, I'm not sure if I'm reading the results right. According to the last
> run, did using the sun compilers (update 1 )  result in higher performance
> with sunct?
>
> On Wed, Apr 6, 2011 at 11:38 AM, Nehemiah Dacres  wrote:
>
>> some tests I did. I hope this isn't an abuse of the list. please tell me
>> if it is but thanks to all those who helped me.
>>
>> this  goes to say that the sun MPI works with programs not compiled with
>> sun’s compilers.
>> this first test was run as a base case to see if MPI works., the sedcond
>> run is to see the speed up using OpenIB provides
>> jian@therock ~]$ mpirun -machinefile list
>> /opt/iba/src/mpi_apps/mpi_stress/mpi_stress
>> Start mpi_stress at Wed Apr  6 10:56:29 2011
>>
>>Size (bytes) TxMessages  TxMillionBytes/s
>> TxMessages/s
>>  32  1  2.77
>> 86485.67
>>  64  1  5.76
>> 90049.42
>> 128  1 11.00
>> 85923.85
>> 256  1 18.78
>> 73344.43
>> 512  1 34.47
>> 67331.98
>>1024  1 34.81
>> 33998.09
>>2048  1 17.31
>> 8454.27
>>4096  1 18.34
>> 4476.61
>>8192  1 25.43
>> 3104.28
>>   16384  1 15.56
>> 949.50
>>   32768  1 13.95
>> 425.74
>>
>>   65536  1  9.88
>> 150.79
>>  131072   8192 11.05
>> 84.31
>>  262144   4096 13.12
>> 50.04
>>  524288   2048 16.54
>> 31.55
>> 1048576   1024 19.92
>> 18.99
>> 2097152512 22.54
>> 10.75
>> 4194304256 25.46
>> 6.07
>>
>> Iteration 0 : errors = 0, total = 0 (495 secs, Wed Apr  6 11:04:44 2011)
>> After 1 iteration(s), 8 mins and 15 secs, total errors = 0
>>
>> here is the infiniband run
>>
>> [jian@therock ~]$ mpirun -mca orte_base_help_aggregate btl,openib,self,
>> -machinefile list /opt/iba/src/mpi_apps/mpi_stress/mpi_stress
>> Start mpi_stress at Wed Apr  6 11:07:06 2011
>>
>>Size (bytes) TxMessages  TxMillionBytes/s
>> TxMessages/s
>>  32  1  2.72   84907.69
>>  64  1  5.83   91097.94
>> 128  1 10.75   83959.63
>> 256  1 18.53   72384.48
>> 512  1 34.96   68285.00
>>1024  1 11.40   11133.10
>>2048  1 20.88   10196.34
>>4096  1 10.132472.13
>>8192  1 19.322358.25
>>   16384  1 14.58 890.10
>>   32768  1 15.85 483.61
>>   65536  1  9.04 137.95
>>   

[OMPI users] problem with configure and c++, lib and lib64

2011-04-06 Thread Jason Palmer
Hello,



I'm trying again with the 1.4.3 version to use compile openmpi statically
with my program . but I'm running into a more basic problem, similar to one
I previously encountered and solved using LD_LIBRARY_PATH.



The configure script is dying when it tries to run the "simple C++ program".
I define CC, CXX to refer to installed gcc-4.4.3 versions, and F77 and FC to
gcc-4.4.3 gfortran, and I set LD_LIBRARY_PATH to be the corresponding
gcc-4.4.3-install/lib64.



I didn't have a problem with the c++ configure last time I tried this . One
odd thing is that it seems to be using the lib directory instead of the
lib64 directory, despite my setting LD_LIBRARY_FLAGS to lib64, and defining
CFLAGS and LDFLAGS to point to the lib64 library as well. I wonder if that
is causing the C++ program to fail.



The relevant output from config.log is pasted below. Thanks very much for
your help!  -Jason



configure:23457: checking for the C++ compiler vendor

configure:23490: /home/jason/gcc-4.4.3-install/bin/g++ -c -DNDEBUG
conftest.cpp >&5

conftest.cpp:2:2: error: #error "condition defined(__INTEL_COMPILER) ||
defined(__ICC) not met"

conftest.cpp:3: error: 'choke' does not name a type

configure:23497: $? = 1

configure: failed program was:

| #if !( defined(__INTEL_COMPILER) || defined(__ICC) )

| #error "condition defined(__INTEL_COMPILER) || defined(__ICC) not met"

| choke me

| #endif

configure:23529: /home/jason/gcc-4.4.3-install/bin/g++ -c -DNDEBUG
conftest.cpp >&5

configure:23536: $? = 0

configure:24651: result: gnu

configure:24673: checking if C++ compiler works

configure:24754: /home/jason/gcc-4.4.3-install/bin/g++ -o conftest -DNDEBUG
-L/home/jason/gcc-4.4.3-install/lib64 conftest.cpp

>&5

In file included from
/home/jason/gcc-4.4.3-install/lib/gcc/x86_64-unknown-linux-gnu/4.4.3/../../.
./../include/c++/4.4.3/bits/loca

lefwd.h:42,

 from
/home/jason/gcc-4.4.3-install/lib/gcc/x86_64-unknown-linux-gnu/4.4.3/../../.
./../include/c++/4.4.3/string:45

,

 from conftest.cpp:111:

/home/jason/gcc-4.4.3-install/lib/gcc/x86_64-unknown-linux-gnu/4.4.3/../../.
./../include/c++/4.4.3/x86_64-unknown-linux-gnu/bits/c

++locale.h:52: error: 'uselocale' was not declared in this scope

/home/jason/gcc-4.4.3-install/lib/gcc/x86_64-unknown-linux-gnu/4.4.3/../../.
./../include/c++/4.4.3/x86_64-unknown-linux-gnu/bits/c

++locale.h:52: error: invalid type in declaration before ';' token

/home/jason/gcc-4.4.3-install/lib/gcc/x86_64-unknown-linux-gnu/4.4.3/../../.
./../include/c++/4.4.3/x86_64-unknown-linux-gnu/bits/c

++locale.h: In function 'int std::__convert_from_v(__locale_struct* const&,
char*, int, const char*, ...)':

/home/jason/gcc-4.4.3-install/lib/gcc/x86_64-unknown-linux-gnu/4.4.3/../../.
./../include/c++/4.4.3/x86_64-unknown-linux-gnu/bits/c

++locale.h:72: error: '__gnu_cxx::__uselocale' cannot be used as a function

/home/jason/gcc-4.4.3-install/lib/gcc/x86_64-unknown-linux-gnu/4.4.3/../../.
./../include/c++/4.4.3/x86_64-unknown-linux-gnu/bits/c

++locale.h:97: error: '__gnu_cxx::__uselocale' cannot be used as a function

configure:24758: $? = 1

configure: program exited with status 1

configure: failed program was:

| /* confdefs.h.  */

| #define PACKAGE_NAME "Open MPI"

| #define PACKAGE_TARNAME "openmpi"

| #define PACKAGE_VERSION "1.4.3"

| #define PACKAGE_STRING "Open MPI 1.4.3"

| #define PACKAGE_BUGREPORT "http://www.open-mpi.org/community/help/;

| #define OMPI_MAJOR_VERSION 1

| #define OMPI_MINOR_VERSION 4

| #define OMPI_RELEASE_VERSION 3

| #define OMPI_GREEK_VERSION ""

| #define OMPI_VERSION "3"

| #define OMPI_RELEASE_DATE "Oct 05, 2010"

| #define ORTE_MAJOR_VERSION 1

| #define ORTE_MINOR_VERSION 4

| #define ORTE_RELEASE_VERSION 3

| #define ORTE_GREEK_VERSION ""

| #define ORTE_VERSION "3"

| #define ORTE_GREEK_VERSION ""

| #define ORTE_VERSION "3"

| #define ORTE_RELEASE_DATE "Oct 05, 2010"

| #define OPAL_MAJOR_VERSION 1

| #define OPAL_MINOR_VERSION 4

| #define OPAL_RELEASE_VERSION 3

| #define OPAL_GREEK_VERSION ""

| #define OPAL_VERSION "3"

| #define OPAL_RELEASE_DATE "Oct 05, 2010"

| #define OMPI_ENABLE_PROGRESS_THREADS 0

| #define OMPI_ARCH "x86_64-unknown-linux-gnu"

| #define OMPI_ENABLE_MEM_DEBUG 0

| #define OMPI_ENABLE_MEM_PROFILE 0

| #define OMPI_ENABLE_DEBUG 0

| #define OMPI_GROUP_SPARSE 0

| #define OMPI_WANT_MPI_CXX_SEEK 1

| #define MPI_PARAM_CHECK ompi_mpi_param_check

| #define OMPI_WANT_PRETTY_PRINT_STACKTRACE 1

| #define OMPI_WANT_PERUSE 0

| #define OMPI_ENABLE_PTY_SUPPORT 1

| #define OMPI_ENABLE_HETEROGENEOUS_SUPPORT 0

| #define OPAL_ENABLE_TRACE 0

| #define ORTE_DISABLE_FULL_SUPPORT 0

| #define OPAL_ENABLE_FT 0

| #define OPAL_ENABLE_FT_CR 0

| #define OMPI_WANT_HOME_CONFIG_FILES 1

| #define OPAL_ENABLE_IPV6 1

| #define ORTE_WANT_ORTERUN_PREFIX_BY_DEFAULT 0

| #define OPAL_PACKAGE_STRING "Open MPI jason@guessing Distribution"

| #define OPAL_IDENT_STRING "1.4.3"

| #define OMPI_OPENIB_PAD_HDR 0

| #define 

[OMPI users] OMPI 1.4.3 and "make distclean" error

2011-04-06 Thread David Gunter
We tend to build OMPI for several different architectures. Rather than untar 
the archive file each time I'd rather do a "make distclean" in between builds.  
However, this always produces the following error:

...
Making distclean in libltdl
make[2]: Entering directory `/user/openmpi-1.4.3/opal/libltdl'
make[2]: *** No rule to make target `distclean'.  Stop.
make[2]: Leaving directory `/user/openmpi-1.4.3/opal/libltdl'
make[1]: *** [distclean-recursive] Error 1
make[1]: Leaving directory `/user/openmpi-1.4.3/opal'
make: *** [distclean-recursive] Error 1

and then fails to finish the rest of the cleanup.

The reason is due to to our specific systems and the use of the configure 
argument --disable-dlopen, so nothing (including the Makefile) gets created in 
/user/openmpi-1.4.3/opal/libltd.

Is there a workaround for this?

Thanks,
david
--
David Gunter
HPC-3: Infrastructure Team
Los Alamos National Laboratory







Re: [OMPI users] problems with the -xterm option

2011-04-06 Thread Ralph Castain
Here's a little more info - it's for Cygwin, but I don't see anything 
Cygwin-specific in the answers:

http://x.cygwin.com/docs/faq/cygwin-x-faq.html#q-ssh-no-x11forwarding


On Apr 6, 2011, at 12:30 PM, Ralph Castain wrote:

> Sorry Jody - I should have read your note more carefully to see that you 
> already tried -Y. :-(
> 
> Not sure what to suggest...
> 
> 
> On Apr 6, 2011, at 12:29 PM, Ralph Castain wrote:
> 
>> Like I said, I'm not expert. However, a quick "google" of revealed this 
>> result:
>> 
>> 
>>> When trying to set up x11 forwarding over an ssh session to a remote server 
>>> with the -X switch, I was getting an error like Warning: No xauth data; 
>>> using fake authentication data for X11 forwarding.
>>> 
>>> When doing something like:
>>> ssh -Xl root 10.1.1.9 to a remote server, the authentication worked, but I 
>>> got an error message like:
>>> 
>>> 
>>> jason@badman ~/bin $ ssh -Xl root 10.1.1.9
>>> Warning: untrusted X11 forwarding setup failed: xauth key data not generated
>>> Warning: No xauth data; using fake authentication data for X11 forwarding.
>>> Last login: Wed Apr 14 18:18:39 2010 from 10.1.1.5
>>> [root@RHEL ~]# 
>>> 
>>> and any X programs I ran would not display on my local system..
>>> 
>>> Turns out the solution is to use the -Y switch instead.
>>> 
>>> ssh -Yl root 10.1.1.9 
>>> 
>>> and that worked fine.
>> 
>> 
>> See if that works for you - if it does, we may have to modify OMPI to 
>> accommodate.
>> 
>> 
>> On Apr 6, 2011, at 9:19 AM, jody wrote:
>> 
>>> Hi Ralph
>>> No, after the above error message mpirun has exited.
>>> 
>>> But i also noticed that it is to ssh into squid_0 and open a xterm there:
>>> 
>>>  jody@chefli ~/share/neander $ ssh -Y squid_0
>>>  Last login: Wed Apr  6 17:14:02 CEST 2011 from chefli.uzh.ch on pts/0
>>>  jody@squid_0 ~ $ xterm
>>>  xterm Xt error: Can't open display:
>>>  xterm:  DISPLAY is not set
>>>  jody@squid_0 ~ $ export DISPLAY=130.60.126.74:0.0
>>>  jody@squid_0 ~ $ xterm
>>>  xterm Xt error: Can't open display: 130.60.126.74:0.0
>>>  jody@squid_0 ~ $ export DISPLAY=chefli.uzh.ch:0.0
>>>  jody@squid_0 ~ $ xterm
>>>  xterm Xt error: Can't open display: chefli.uzh.ch:0.0
>>>  jody@squid_0 ~ $ exit
>>>  logout
>>> 
>>> same thing with ssh -X, but here i get the same warning/error message
>>> as with mpirun:
>>> 
>>>  jody@chefli ~/share/neander $ ssh -X squid_0
>>>  Warning: untrusted X11 forwarding setup failed: xauth key data not 
>>> generated
>>>  Warning: No xauth data; using fake authentication data for X11 forwarding.
>>>  Last login: Wed Apr  6 17:12:31 CEST 2011 from chefli.uzh.ch on ssh
>>> 
>>> So perhaps the whole problem is linked to that xauth-thing.
>>> Do you have a suggestion how this can be solved?
>>> 
>>> Thank You
>>>  Jody
>>> 
>>> On Wed, Apr 6, 2011 at 4:41 PM, Ralph Castain  wrote:
 If I read your error messages correctly, it looks like mpirun is crashing 
 - the daemon is complaining that it lost the socket connection back to 
 mpirun, and hence will abort.
 
 Are you seeing mpirun still alive?
 
 
 On Apr 5, 2011, at 4:46 AM, jody wrote:
 
> Hi
> 
> On my workstation and  the cluster i set up OpenMPI (v 1.4.2) so that
> it works in "text-mode":
>  $ mpirun -np 4  -x DISPLAY -host squid_0   printenv | grep WORLD_RANK
>  OMPI_COMM_WORLD_RANK=0
>  OMPI_COMM_WORLD_RANK=1
>  OMPI_COMM_WORLD_RANK=2
>  OMPI_COMM_WORLD_RANK=3
> 
> but when i use  the -xterm option to mpirun, it doesn't work
> 
> $ mpirun -np 4  -x DISPLAY -host squid_0 -xterm 1,2  printenv | grep 
> WORLD_RANK
>  Warning: untrusted X11 forwarding setup failed: xauth key data not 
> generated
>  Warning: No xauth data; using fake authentication data for X11 
> forwarding.
>  OMPI_COMM_WORLD_RANK=0
>  [squid_0:05266] [[55607,0],1]->[[55607,0],0]
> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
> [sd = 8]
>  [squid_0:05266] [[55607,0],1] routed:binomial: Connection to
> lifeline [[55607,0],0] lost
>  /usr/bin/xterm Xt error: Can't open display: chefli.uzh.ch:0.0
>  /usr/bin/xterm Xt error: Can't open display: chefli.uzh.ch:0.0
> 
> (strange: somebody wrote his message to the console)
> 
> No matter whether i set the DISPLAY variable to the full hostname of
> the workstation,
> to the IP-Adress of the workstation or simply to ":0.0", it doesn't work
> 
> But i do have xauth data (as far as i know):
> On the remote (squid_0):
>  jody@squid_0 ~ $ xauth list
>  chefli/unix:10  MIT-MAGIC-COOKIE-1  5293e179bc7b2036d87cbcdf14891d0c
>  chefli/unix:0  MIT-MAGIC-COOKIE-1  146c7f438fab79deb8a8a7df242b6f4b
>  chefli.uzh.ch:0  MIT-MAGIC-COOKIE-1  146c7f438fab79deb8a8a7df242b6f4b
> 
> on the workstation:
>  $ xauth list
>  chefli/unix:10  MIT-MAGIC-COOKIE-1  5293e179bc7b2036d87cbcdf14891d0c
>  chefli/unix:0  

Re: [OMPI users] problems with the -xterm option

2011-04-06 Thread Ralph Castain
Sorry Jody - I should have read your note more carefully to see that you 
already tried -Y. :-(

Not sure what to suggest...


On Apr 6, 2011, at 12:29 PM, Ralph Castain wrote:

> Like I said, I'm not expert. However, a quick "google" of revealed this 
> result:
> 
> 
>> When trying to set up x11 forwarding over an ssh session to a remote server 
>> with the -X switch, I was getting an error like Warning: No xauth data; 
>> using fake authentication data for X11 forwarding.
>> 
>> When doing something like:
>> ssh -Xl root 10.1.1.9 to a remote server, the authentication worked, but I 
>> got an error message like:
>> 
>> 
>> jason@badman ~/bin $ ssh -Xl root 10.1.1.9
>> Warning: untrusted X11 forwarding setup failed: xauth key data not generated
>> Warning: No xauth data; using fake authentication data for X11 forwarding.
>> Last login: Wed Apr 14 18:18:39 2010 from 10.1.1.5
>> [root@RHEL ~]# 
>> 
>> and any X programs I ran would not display on my local system..
>> 
>> Turns out the solution is to use the -Y switch instead.
>> 
>> ssh -Yl root 10.1.1.9 
>> 
>> and that worked fine.
> 
> 
> See if that works for you - if it does, we may have to modify OMPI to 
> accommodate.
> 
> 
> On Apr 6, 2011, at 9:19 AM, jody wrote:
> 
>> Hi Ralph
>> No, after the above error message mpirun has exited.
>> 
>> But i also noticed that it is to ssh into squid_0 and open a xterm there:
>> 
>>  jody@chefli ~/share/neander $ ssh -Y squid_0
>>  Last login: Wed Apr  6 17:14:02 CEST 2011 from chefli.uzh.ch on pts/0
>>  jody@squid_0 ~ $ xterm
>>  xterm Xt error: Can't open display:
>>  xterm:  DISPLAY is not set
>>  jody@squid_0 ~ $ export DISPLAY=130.60.126.74:0.0
>>  jody@squid_0 ~ $ xterm
>>  xterm Xt error: Can't open display: 130.60.126.74:0.0
>>  jody@squid_0 ~ $ export DISPLAY=chefli.uzh.ch:0.0
>>  jody@squid_0 ~ $ xterm
>>  xterm Xt error: Can't open display: chefli.uzh.ch:0.0
>>  jody@squid_0 ~ $ exit
>>  logout
>> 
>> same thing with ssh -X, but here i get the same warning/error message
>> as with mpirun:
>> 
>>  jody@chefli ~/share/neander $ ssh -X squid_0
>>  Warning: untrusted X11 forwarding setup failed: xauth key data not generated
>>  Warning: No xauth data; using fake authentication data for X11 forwarding.
>>  Last login: Wed Apr  6 17:12:31 CEST 2011 from chefli.uzh.ch on ssh
>> 
>> So perhaps the whole problem is linked to that xauth-thing.
>> Do you have a suggestion how this can be solved?
>> 
>> Thank You
>>  Jody
>> 
>> On Wed, Apr 6, 2011 at 4:41 PM, Ralph Castain  wrote:
>>> If I read your error messages correctly, it looks like mpirun is crashing - 
>>> the daemon is complaining that it lost the socket connection back to 
>>> mpirun, and hence will abort.
>>> 
>>> Are you seeing mpirun still alive?
>>> 
>>> 
>>> On Apr 5, 2011, at 4:46 AM, jody wrote:
>>> 
 Hi
 
 On my workstation and  the cluster i set up OpenMPI (v 1.4.2) so that
 it works in "text-mode":
  $ mpirun -np 4  -x DISPLAY -host squid_0   printenv | grep WORLD_RANK
  OMPI_COMM_WORLD_RANK=0
  OMPI_COMM_WORLD_RANK=1
  OMPI_COMM_WORLD_RANK=2
  OMPI_COMM_WORLD_RANK=3
 
 but when i use  the -xterm option to mpirun, it doesn't work
 
 $ mpirun -np 4  -x DISPLAY -host squid_0 -xterm 1,2  printenv | grep 
 WORLD_RANK
  Warning: untrusted X11 forwarding setup failed: xauth key data not 
 generated
  Warning: No xauth data; using fake authentication data for X11 forwarding.
  OMPI_COMM_WORLD_RANK=0
  [squid_0:05266] [[55607,0],1]->[[55607,0],0]
 mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
 [sd = 8]
  [squid_0:05266] [[55607,0],1] routed:binomial: Connection to
 lifeline [[55607,0],0] lost
  /usr/bin/xterm Xt error: Can't open display: chefli.uzh.ch:0.0
  /usr/bin/xterm Xt error: Can't open display: chefli.uzh.ch:0.0
 
 (strange: somebody wrote his message to the console)
 
 No matter whether i set the DISPLAY variable to the full hostname of
 the workstation,
 to the IP-Adress of the workstation or simply to ":0.0", it doesn't work
 
 But i do have xauth data (as far as i know):
 On the remote (squid_0):
  jody@squid_0 ~ $ xauth list
  chefli/unix:10  MIT-MAGIC-COOKIE-1  5293e179bc7b2036d87cbcdf14891d0c
  chefli/unix:0  MIT-MAGIC-COOKIE-1  146c7f438fab79deb8a8a7df242b6f4b
  chefli.uzh.ch:0  MIT-MAGIC-COOKIE-1  146c7f438fab79deb8a8a7df242b6f4b
 
 on the workstation:
  $ xauth list
  chefli/unix:10  MIT-MAGIC-COOKIE-1  5293e179bc7b2036d87cbcdf14891d0c
  chefli/unix:0  MIT-MAGIC-COOKIE-1  146c7f438fab79deb8a8a7df242b6f4b
  localhost.localdomain/unix:0  MIT-MAGIC-COOKIE-1
 146c7f438fab79deb8a8a7df242b6f4b
  chefli.uzh.ch/unix:0  MIT-MAGIC-COOKIE-1  146c7f438fab79deb8a8a7df242b6f4b
 
 In sshd_config on the workstation i have 'X11Forwarding yes'
 I have also done
   xhost + squid_0
 on the workstation.
 

Re: [OMPI users] problems with the -xterm option

2011-04-06 Thread Ralph Castain
Like I said, I'm not expert. However, a quick "google" of revealed this result:

> When trying to set up x11 forwarding over an ssh session to a remote server 
> with the -X switch, I was getting an error like Warning: No xauth data; using 
> fake authentication data for X11 forwarding.
> 
> When doing something like:
> ssh -Xl root 10.1.1.9 to a remote server, the authentication worked, but I 
> got an error message like:
> 
> 
> jason@badman ~/bin $ ssh -Xl root 10.1.1.9
> Warning: untrusted X11 forwarding setup failed: xauth key data not generated
> Warning: No xauth data; using fake authentication data for X11 forwarding.
> Last login: Wed Apr 14 18:18:39 2010 from 10.1.1.5
> [root@RHEL ~]# 
> 
> and any X programs I ran would not display on my local system..
> 
> Turns out the solution is to use the -Y switch instead.
> 
> ssh -Yl root 10.1.1.9 
> 
> and that worked fine.
> 


See if that works for you - if it does, we may have to modify OMPI to 
accommodate.


On Apr 6, 2011, at 9:19 AM, jody wrote:

> Hi Ralph
> No, after the above error message mpirun has exited.
> 
> But i also noticed that it is to ssh into squid_0 and open a xterm there:
> 
>  jody@chefli ~/share/neander $ ssh -Y squid_0
>  Last login: Wed Apr  6 17:14:02 CEST 2011 from chefli.uzh.ch on pts/0
>  jody@squid_0 ~ $ xterm
>  xterm Xt error: Can't open display:
>  xterm:  DISPLAY is not set
>  jody@squid_0 ~ $ export DISPLAY=130.60.126.74:0.0
>  jody@squid_0 ~ $ xterm
>  xterm Xt error: Can't open display: 130.60.126.74:0.0
>  jody@squid_0 ~ $ export DISPLAY=chefli.uzh.ch:0.0
>  jody@squid_0 ~ $ xterm
>  xterm Xt error: Can't open display: chefli.uzh.ch:0.0
>  jody@squid_0 ~ $ exit
>  logout
> 
> same thing with ssh -X, but here i get the same warning/error message
> as with mpirun:
> 
>  jody@chefli ~/share/neander $ ssh -X squid_0
>  Warning: untrusted X11 forwarding setup failed: xauth key data not generated
>  Warning: No xauth data; using fake authentication data for X11 forwarding.
>  Last login: Wed Apr  6 17:12:31 CEST 2011 from chefli.uzh.ch on ssh
> 
> So perhaps the whole problem is linked to that xauth-thing.
> Do you have a suggestion how this can be solved?
> 
> Thank You
>  Jody
> 
> On Wed, Apr 6, 2011 at 4:41 PM, Ralph Castain  wrote:
>> If I read your error messages correctly, it looks like mpirun is crashing - 
>> the daemon is complaining that it lost the socket connection back to mpirun, 
>> and hence will abort.
>> 
>> Are you seeing mpirun still alive?
>> 
>> 
>> On Apr 5, 2011, at 4:46 AM, jody wrote:
>> 
>>> Hi
>>> 
>>> On my workstation and  the cluster i set up OpenMPI (v 1.4.2) so that
>>> it works in "text-mode":
>>>  $ mpirun -np 4  -x DISPLAY -host squid_0   printenv | grep WORLD_RANK
>>>  OMPI_COMM_WORLD_RANK=0
>>>  OMPI_COMM_WORLD_RANK=1
>>>  OMPI_COMM_WORLD_RANK=2
>>>  OMPI_COMM_WORLD_RANK=3
>>> 
>>> but when i use  the -xterm option to mpirun, it doesn't work
>>> 
>>> $ mpirun -np 4  -x DISPLAY -host squid_0 -xterm 1,2  printenv | grep 
>>> WORLD_RANK
>>>  Warning: untrusted X11 forwarding setup failed: xauth key data not 
>>> generated
>>>  Warning: No xauth data; using fake authentication data for X11 forwarding.
>>>  OMPI_COMM_WORLD_RANK=0
>>>  [squid_0:05266] [[55607,0],1]->[[55607,0],0]
>>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
>>> [sd = 8]
>>>  [squid_0:05266] [[55607,0],1] routed:binomial: Connection to
>>> lifeline [[55607,0],0] lost
>>>  /usr/bin/xterm Xt error: Can't open display: chefli.uzh.ch:0.0
>>>  /usr/bin/xterm Xt error: Can't open display: chefli.uzh.ch:0.0
>>> 
>>> (strange: somebody wrote his message to the console)
>>> 
>>> No matter whether i set the DISPLAY variable to the full hostname of
>>> the workstation,
>>> to the IP-Adress of the workstation or simply to ":0.0", it doesn't work
>>> 
>>> But i do have xauth data (as far as i know):
>>> On the remote (squid_0):
>>>  jody@squid_0 ~ $ xauth list
>>>  chefli/unix:10  MIT-MAGIC-COOKIE-1  5293e179bc7b2036d87cbcdf14891d0c
>>>  chefli/unix:0  MIT-MAGIC-COOKIE-1  146c7f438fab79deb8a8a7df242b6f4b
>>>  chefli.uzh.ch:0  MIT-MAGIC-COOKIE-1  146c7f438fab79deb8a8a7df242b6f4b
>>> 
>>> on the workstation:
>>>  $ xauth list
>>>  chefli/unix:10  MIT-MAGIC-COOKIE-1  5293e179bc7b2036d87cbcdf14891d0c
>>>  chefli/unix:0  MIT-MAGIC-COOKIE-1  146c7f438fab79deb8a8a7df242b6f4b
>>>  localhost.localdomain/unix:0  MIT-MAGIC-COOKIE-1
>>> 146c7f438fab79deb8a8a7df242b6f4b
>>>  chefli.uzh.ch/unix:0  MIT-MAGIC-COOKIE-1  146c7f438fab79deb8a8a7df242b6f4b
>>> 
>>> In sshd_config on the workstation i have 'X11Forwarding yes'
>>> I have also done
>>>   xhost + squid_0
>>> on the workstation.
>>> 
>>> 
>>> How can i get the -xterm option running?
>>> 
>>> Thank You
>>>  Jody
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> ___
>> users mailing list
>> 

Re: [OMPI users] mpi problems,

2011-04-06 Thread Terry Dontje
Something looks fishy about your numbers.  The first two sets of numbers 
look the same and the last set do look better for the most part.  Your 
mpirun command line looks weird to me with the "-mca 
orte_base_help_aggregate btl,openib,self," did something get chopped off 
with the text copy?  You should have had a "-mca btl openib,self".  Can 
you do a run with "-mca btl tcp,self", it should be slower.


I really wouldn't have expected another compiler over IB to be that 
dramatically lower performing.


--td


On 04/06/2011 12:40 PM, Nehemiah Dacres wrote:
also, I'm not sure if I'm reading the results right. According to the 
last run, did using the sun compilers (update 1 )  result in higher 
performance with sunct?


On Wed, Apr 6, 2011 at 11:38 AM, Nehemiah Dacres > wrote:


some tests I did. I hope this isn't an abuse of the list. please
tell me if it is but thanks to all those who helped me.

this  goes to say that the sun MPI works with programs not
compiled with sun’s compilers.
this first test was run as a base case to see if MPI works., the
sedcond run is to see the speed up using OpenIB provides
jian@therock ~]$ mpirun -machinefile list
/opt/iba/src/mpi_apps/mpi_stress/mpi_stress
Start mpi_stress at Wed Apr  6 10:56:29 2011

  Size (bytes) TxMessages  TxMillionBytes/s  
TxMessages/s
32  1  2.77  
86485.67
64  1  5.76  
90049.42
   128  1 11.00  
85923.85
   256  1 18.78  
73344.43
   512  1 34.47  
67331.98
  1024  1 34.81  
33998.09
  2048  1 17.31   
8454.27
  4096  1 18.34   
4476.61
  8192  1 25.43   
3104.28

 16384  1 15.56
949.50
 32768  1 13.95
425.74

 65536  1  9.88
150.79
131072   8192 11.05
 84.31
262144   4096 13.12
 50.04
524288   2048 16.54
 31.55
   1048576   1024 19.92
 18.99
   2097152512 22.54
 10.75
   4194304256 25.46
  6.07

Iteration 0 : errors = 0, total = 0 (495 secs, Wed Apr  6 11:04:44
2011)
After 1 iteration(s), 8 mins and 15 secs, total errors = 0

here is the infiniband run

[jian@therock ~]$ mpirun -mca orte_base_help_aggregate
btl,openib,self, -machinefile list
/opt/iba/src/mpi_apps/mpi_stress/mpi_stress
Start mpi_stress at Wed Apr  6 11:07:06 2011

  Size (bytes) TxMessages  TxMillionBytes/s  
TxMessages/s

32  1  2.72   84907.69
64  1  5.83   91097.94
   128  1 10.75   83959.63
   256  1 18.53   72384.48
   512  1 34.96   68285.00
  1024  1 11.40   11133.10
  2048  1 20.88   10196.34
  4096  1 10.132472.13
  8192  1 19.322358.25
 16384  1 14.58 890.10
 32768  1 15.85 483.61
 65536  1  9.04 137.95
 1310728192 10.90  83.12
262144   4096 13.57
 51.76
524288  2048 16.82
 32.08
   10485761024 19.10  18.21
   2097152512 22.13  10.55
   4194304256 21.66   5.16

Iteration 0 : errors = 0, total = 0 (511 secs, Wed Apr  6 11:15:37
2011)
After 1 iteration(s), 8 mins and 31 secs, total errors = 0
compiled with the sun compilers i think
[jian@therock ~]$ mpirun -mca orte_base_help_aggregate
btl,openib,self, 

Re: [OMPI users] mpi problems,

2011-04-06 Thread Eugene Loh




Nehemiah Dacres wrote:
also, I'm not sure if I'm reading the results right.
According to the last run, did using the sun compilers (update 1 ) 
result in higher performance with sunct? 
  
  On Wed, Apr 6, 2011 at 11:38 AM, Nehemiah
Dacres 
wrote:
  this
first test was run as a base case to see if MPI works., the sedcond run
is to see the speed up using OpenIB provides
[jian@therock
~]$ mpirun -machinefile list /opt/iba/src/mpi_apps/mpi_stress/mpi_stress
[jian@therock
~]$ mpirun -mca orte_base_help_aggregate btl,openib,self, -machinefile
list /opt/iba/src/mpi_apps/mpi_stress/mpi_stress
[jian@therock
~]$ mpirun -mca orte_base_help_aggregate btl,openib,self, -machinefile
list sunMpiStress
  

I don't think the command-line syntax for the MCA parameters is quite
right.  I suspect it should be

--mca orte_base_help_aggregate 1 --mca btl openib,self

Further, they are unnecessary.  The first is on by default and the
second is unnecessary since OMPI finds the fastest interconnect
automatically (presumably openib,self, with sm if there are on-node
processes).  Another way of setting MCA parameters is with environment
variables:

setenv OMPI_MCA_orte_base_help_aggregate 1
setenv OMPI_MCA_btl openib,self

since then you can use ompi_info to check your settings.

Anyhow, it looks like your runs are probably all using openib and I
don't know why the last one is 2x faster.  If you're testing the
interconnect, the performance should be limited by the IB (more or
less) and not by the compiler.




Re: [OMPI users] mpi problems,

2011-04-06 Thread Nehemiah Dacres
also, I'm not sure if I'm reading the results right. According to the last
run, did using the sun compilers (update 1 )  result in higher performance
with sunct?

On Wed, Apr 6, 2011 at 11:38 AM, Nehemiah Dacres  wrote:

> some tests I did. I hope this isn't an abuse of the list. please tell me if
> it is but thanks to all those who helped me.
>
> this  goes to say that the sun MPI works with programs not compiled with
> sun’s compilers.
> this first test was run as a base case to see if MPI works., the sedcond
> run is to see the speed up using OpenIB provides
> jian@therock ~]$ mpirun -machinefile list
> /opt/iba/src/mpi_apps/mpi_stress/mpi_stress
> Start mpi_stress at Wed Apr  6 10:56:29 2011
>
>   Size (bytes) TxMessages  TxMillionBytes/s   TxMessages/s
>  32  1  2.77   86485.67
>  64  1  5.76   90049.42
> 128  1 11.00   85923.85
> 256  1 18.78   73344.43
> 512  1 34.47   67331.98
>1024  1 34.81   33998.09
>2048  1 17.318454.27
>4096  1 18.344476.61
>8192  1 25.433104.28
>   16384  1 15.56 949.50
>   32768  1 13.95 425.74
>
>  65536  1  9.88 150.79
>  131072   8192 11.05  84.31
>  262144   4096 13.12  50.04
>  524288   2048 16.54  31.55
> 1048576   1024 19.92  18.99
> 2097152512 22.54  10.75
> 4194304256 25.46   6.07
>
> Iteration 0 : errors = 0, total = 0 (495 secs, Wed Apr  6 11:04:44 2011)
> After 1 iteration(s), 8 mins and 15 secs, total errors = 0
>
> here is the infiniband run
>
> [jian@therock ~]$ mpirun -mca orte_base_help_aggregate btl,openib,self,
> -machinefile list /opt/iba/src/mpi_apps/mpi_stress/mpi_stress
> Start mpi_stress at Wed Apr  6 11:07:06 2011
>
>   Size (bytes) TxMessages  TxMillionBytes/s   TxMessages/s
>  32  1  2.72   84907.69
>  64  1  5.83   91097.94
> 128  1 10.75   83959.63
> 256  1 18.53   72384.48
> 512  1 34.96   68285.00
>1024  1 11.40   11133.10
>2048  1 20.88   10196.34
>4096  1 10.132472.13
>8192  1 19.322358.25
>   16384  1 14.58 890.10
>   32768  1 15.85 483.61
>   65536  1  9.04 137.95
>   1310728192 10.90  83.12
>  262144   4096 13.57  51.76
>  524288  2048 16.82  32.08
> 10485761024 19.10  18.21
> 2097152512 22.13  10.55
> 4194304256 21.66   5.16
>
> Iteration 0 : errors = 0, total = 0 (511 secs, Wed Apr  6 11:15:37 2011)
> After 1 iteration(s), 8 mins and 31 secs, total errors = 0
> compiled with the sun compilers i think
> [jian@therock ~]$ mpirun -mca orte_base_help_aggregate btl,openib,self,
> -machinefile list sunMpiStress
> Start mpi_stress at Wed Apr  6 11:23:18 2011
>
>   Size (bytes) TxMessages  TxMillionBytes/s   TxMessages/s
>  32  1  2.60   81159.60
>  64  1  5.19   81016.95
> 128  1 10.23   79953.34
> 256  1 16.74   65406.52
> 512  1 23.71   46304.92
>1024  1 54.62   53340.73
>2048  1 45.75   22340.58
>4096  1 29.32

Re: [OMPI users] mpi problems,

2011-04-06 Thread Nehemiah Dacres
some tests I did. I hope this isn't an abuse of the list. please tell me if
it is but thanks to all those who helped me.

this  goes to say that the sun MPI works with programs not compiled with
sun’s compilers.
this first test was run as a base case to see if MPI works., the sedcond run
is to see the speed up using OpenIB provides
jian@therock ~]$ mpirun -machinefile list
/opt/iba/src/mpi_apps/mpi_stress/mpi_stress
Start mpi_stress at Wed Apr  6 10:56:29 2011

  Size (bytes) TxMessages  TxMillionBytes/s   TxMessages/s
32  1  2.77   86485.67
64  1  5.76   90049.42
   128  1 11.00   85923.85
   256  1 18.78   73344.43
   512  1 34.47   67331.98
  1024  1 34.81   33998.09
  2048  1 17.318454.27
  4096  1 18.344476.61
  8192  1 25.433104.28
 16384  1 15.56 949.50
 32768  1 13.95 425.74

 65536  1  9.88 150.79
131072   8192 11.05  84.31
262144   4096 13.12  50.04
524288   2048 16.54  31.55
   1048576   1024 19.92  18.99
   2097152512 22.54  10.75
   4194304256 25.46   6.07

Iteration 0 : errors = 0, total = 0 (495 secs, Wed Apr  6 11:04:44 2011)
After 1 iteration(s), 8 mins and 15 secs, total errors = 0

here is the infiniband run

[jian@therock ~]$ mpirun -mca orte_base_help_aggregate btl,openib,self,
-machinefile list /opt/iba/src/mpi_apps/mpi_stress/mpi_stress
Start mpi_stress at Wed Apr  6 11:07:06 2011

  Size (bytes) TxMessages  TxMillionBytes/s   TxMessages/s
32  1  2.72   84907.69
64  1  5.83   91097.94
   128  1 10.75   83959.63
   256  1 18.53   72384.48
   512  1 34.96   68285.00
  1024  1 11.40   11133.10
  2048  1 20.88   10196.34
  4096  1 10.132472.13
  8192  1 19.322358.25
 16384  1 14.58 890.10
 32768  1 15.85 483.61
 65536  1  9.04 137.95
 1310728192 10.90  83.12
262144   4096 13.57  51.76
524288  2048 16.82  32.08
   10485761024 19.10  18.21
   2097152512 22.13  10.55
   4194304256 21.66   5.16

Iteration 0 : errors = 0, total = 0 (511 secs, Wed Apr  6 11:15:37 2011)
After 1 iteration(s), 8 mins and 31 secs, total errors = 0
compiled with the sun compilers i think
[jian@therock ~]$ mpirun -mca orte_base_help_aggregate btl,openib,self,
-machinefile list sunMpiStress
Start mpi_stress at Wed Apr  6 11:23:18 2011

  Size (bytes) TxMessages  TxMillionBytes/s   TxMessages/s
32  1  2.60   81159.60
64  1  5.19   81016.95
   128  1 10.23   79953.34
   256  1 16.74   65406.52
   512  1 23.71   46304.92
  1024  1 54.62   53340.73
  2048  1 45.75   22340.58
  4096  1 29.327158.87
  8192  1 28.613492.77
 16384  1184.03   11232.26
 32768  1215.696582.21
 65536  1229.883507.64
131072   8192231.641767.25
262144   4096 

Re: [OMPI users] mpi problems,

2011-04-06 Thread Nehemiah Dacres
On Mon, Apr 4, 2011 at 7:35 PM, Terry Dontje wrote:

>  libfui.so is a library a part of the Solaris Studio FORTRAN tools.  It
> should be located under lib from where your Solaris Studio compilers are
> installed from.  So one question is whether you actually have Studio Fortran
> installed on all your nodes or not?
>
> --td
>

actually I kind of realized this shortly after I read this message


On 04/04/2011 04:02 PM, Ralph Castain wrote:

Well, where is libfui located? Is that location in your ld path? Is the lib
present on all nodes in your hostfile?


thank you all for your help

-- 
Nehemiah I. Dacres
System Administrator
Advanced Technology Group Saint Louis University


Re: [OMPI users] mpi problems,

2011-04-06 Thread Nehemiah Dacres
thanks all, I realized that  the sun  compilers weren't installed on all the
nodes. It seems to be working, soon I will test the mca parameters for IB

On Mon, Apr 4, 2011 at 7:35 PM, Terry Dontje wrote:

>  libfui.so is a library a part of the Solaris Studio FORTRAN tools.  It
> should be located under lib from where your Solaris Studio compilers are
> installed from.  So one question is whether you actually have Studio Fortran
> installed on all your nodes or not?
>
> --td
>
>
> On 04/04/2011 04:02 PM, Ralph Castain wrote:
>
> Well, where is libfui located? Is that location in your ld path? Is the lib
> present on all nodes in your hostfile?
>
>
>  On Apr 4, 2011, at 1:58 PM, Nehemiah Dacres wrote:
>
>  [jian@therock ~]$ echo $LD_LIBRARY_PATH
>
> /opt/sun/sunstudio12.1/lib:/opt/vtk/lib:/opt/gridengine/lib/lx26-amd64:/opt/gridengine/lib/lx26-amd64:/home/jian/.crlibs:/home/jian/.crlibs32
> [jian@therock ~]$ /opt/SUNWhpc/HPC8.2.1c/sun/bin/mpirun  -np 4 -hostfile
> list ring2
> ring2: error while loading shared libraries: libfui.so.1: cannot open
> shared object file: No such file or directory
> ring2: error while loading shared libraries: libfui.so.1: cannot open
> shared object file: No such file or directory
> ring2: error while loading shared libraries: libfui.so.1: cannot open
> shared object file: No such file or directory
> mpirun: killing job...
>
>
> --
> mpirun noticed that process rank 1 with PID 31763 on node compute-0-1
> exited on signal 0 (Unknown signal 0).
> --
> mpirun: clean termination accomplished
>
>  I really don't know what's wrong here. I was sure that would work
>
> On Mon, Apr 4, 2011 at 2:43 PM, Samuel K. Gutierrez wrote:
>
>> Hi,
>>
>>  Try prepending the path to your compiler libraries.
>>
>>  Example (bash-like):
>>
>>  export
>> LD_LIBRARY_PATH=/compiler/prefix/lib:/ompi/prefix/lib:$LD_LIBRARY_PATH
>>
>>  --
>>Samuel K. Gutierrez
>> Los Alamos National Laboratory
>>
>>
>>  On Apr 4, 2011, at 1:33 PM, Nehemiah Dacres wrote:
>>
>>   altering LD_LIBRARY_PATH alter's the process's path to mpi's libraries,
>> how do i alter its path to compiler libs like libfui.so.1? it needs to find
>> them cause it was compiled by a sun compiler
>>
>>  On Mon, Apr 4, 2011 at 10:06 AM, Nehemiah Dacres wrote:
>>
>>>
>>>  As Ralph indicated, he'll add the hostname to the error message (but
 that might be tricky; that error message is coming from rsh/ssh...).

 In the meantime, you might try (csh style):

 foreach host (`cat list`)
echo $host
ls -l /opt/SUNWhpc/HPC8.2.1c/sun/bin/orted
 end


>>>  that's what the tentakel line was refering to, or ...
>>>
>>>
>>>

 On Apr 4, 2011, at 10:24 AM, Nehemiah Dacres wrote:

 > I have installed it via a symlink on all of the nodes, I can go
 'tentakel which mpirun ' and it finds it' I'll check the library paths but
 isn't there a way to find out which nodes are returning the error?

>>>
>>>  I found it misslinked on a couple nodes. thank you
>>>
>>> --
>>>  Nehemiah I. Dacres
>>> System Administrator
>>> Advanced Technology Group Saint Louis University
>>>
>>>
>>
>>
>> --
>> Nehemiah I. Dacres
>> System Administrator
>> Advanced Technology Group Saint Louis University
>>
>>   ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
>
> --
> Nehemiah I. Dacres
> System Administrator
> Advanced Technology Group Saint Louis University
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> ___
> users mailing 
> listusers@open-mpi.orghttp://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> --
> [image: Oracle]
> Terry D. Dontje | Principal Software Engineer
> Developer Tools Engineering | +1.781.442.2631
>  Oracle * - Performance Technologies*
>  95 Network Drive, Burlington, MA 01803
> Email terry.don...@oracle.com
>
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



-- 
Nehemiah I. Dacres
System Administrator
Advanced Technology Group Saint Louis University


Re: [OMPI users] problems with the -xterm option

2011-04-06 Thread jody
Hi Ralph
No, after the above error message mpirun has exited.

But i also noticed that it is to ssh into squid_0 and open a xterm there:

  jody@chefli ~/share/neander $ ssh -Y squid_0
  Last login: Wed Apr  6 17:14:02 CEST 2011 from chefli.uzh.ch on pts/0
  jody@squid_0 ~ $ xterm
  xterm Xt error: Can't open display:
  xterm:  DISPLAY is not set
  jody@squid_0 ~ $ export DISPLAY=130.60.126.74:0.0
  jody@squid_0 ~ $ xterm
  xterm Xt error: Can't open display: 130.60.126.74:0.0
  jody@squid_0 ~ $ export DISPLAY=chefli.uzh.ch:0.0
  jody@squid_0 ~ $ xterm
  xterm Xt error: Can't open display: chefli.uzh.ch:0.0
  jody@squid_0 ~ $ exit
  logout

same thing with ssh -X, but here i get the same warning/error message
as with mpirun:

  jody@chefli ~/share/neander $ ssh -X squid_0
  Warning: untrusted X11 forwarding setup failed: xauth key data not generated
  Warning: No xauth data; using fake authentication data for X11 forwarding.
  Last login: Wed Apr  6 17:12:31 CEST 2011 from chefli.uzh.ch on ssh

So perhaps the whole problem is linked to that xauth-thing.
Do you have a suggestion how this can be solved?

Thank You
  Jody

On Wed, Apr 6, 2011 at 4:41 PM, Ralph Castain  wrote:
> If I read your error messages correctly, it looks like mpirun is crashing - 
> the daemon is complaining that it lost the socket connection back to mpirun, 
> and hence will abort.
>
> Are you seeing mpirun still alive?
>
>
> On Apr 5, 2011, at 4:46 AM, jody wrote:
>
>> Hi
>>
>> On my workstation and  the cluster i set up OpenMPI (v 1.4.2) so that
>> it works in "text-mode":
>>  $ mpirun -np 4  -x DISPLAY -host squid_0   printenv | grep WORLD_RANK
>>  OMPI_COMM_WORLD_RANK=0
>>  OMPI_COMM_WORLD_RANK=1
>>  OMPI_COMM_WORLD_RANK=2
>>  OMPI_COMM_WORLD_RANK=3
>>
>> but when i use  the -xterm option to mpirun, it doesn't work
>>
>> $ mpirun -np 4  -x DISPLAY -host squid_0 -xterm 1,2  printenv | grep 
>> WORLD_RANK
>>  Warning: untrusted X11 forwarding setup failed: xauth key data not generated
>>  Warning: No xauth data; using fake authentication data for X11 forwarding.
>>  OMPI_COMM_WORLD_RANK=0
>>  [squid_0:05266] [[55607,0],1]->[[55607,0],0]
>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
>> [sd = 8]
>>  [squid_0:05266] [[55607,0],1] routed:binomial: Connection to
>> lifeline [[55607,0],0] lost
>>  /usr/bin/xterm Xt error: Can't open display: chefli.uzh.ch:0.0
>>  /usr/bin/xterm Xt error: Can't open display: chefli.uzh.ch:0.0
>>
>> (strange: somebody wrote his message to the console)
>>
>> No matter whether i set the DISPLAY variable to the full hostname of
>> the workstation,
>> to the IP-Adress of the workstation or simply to ":0.0", it doesn't work
>>
>> But i do have xauth data (as far as i know):
>> On the remote (squid_0):
>>  jody@squid_0 ~ $ xauth list
>>  chefli/unix:10  MIT-MAGIC-COOKIE-1  5293e179bc7b2036d87cbcdf14891d0c
>>  chefli/unix:0  MIT-MAGIC-COOKIE-1  146c7f438fab79deb8a8a7df242b6f4b
>>  chefli.uzh.ch:0  MIT-MAGIC-COOKIE-1  146c7f438fab79deb8a8a7df242b6f4b
>>
>> on the workstation:
>>  $ xauth list
>>  chefli/unix:10  MIT-MAGIC-COOKIE-1  5293e179bc7b2036d87cbcdf14891d0c
>>  chefli/unix:0  MIT-MAGIC-COOKIE-1  146c7f438fab79deb8a8a7df242b6f4b
>>  localhost.localdomain/unix:0  MIT-MAGIC-COOKIE-1
>> 146c7f438fab79deb8a8a7df242b6f4b
>>  chefli.uzh.ch/unix:0  MIT-MAGIC-COOKIE-1  146c7f438fab79deb8a8a7df242b6f4b
>>
>> In sshd_config on the workstation i have 'X11Forwarding yes'
>> I have also done
>>   xhost + squid_0
>> on the workstation.
>>
>>
>> How can i get the -xterm option running?
>>
>> Thank You
>>  Jody
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



Re: [OMPI users] problems with the -xterm option

2011-04-06 Thread Ralph Castain
If I read your error messages correctly, it looks like mpirun is crashing - the 
daemon is complaining that it lost the socket connection back to mpirun, and 
hence will abort.

Are you seeing mpirun still alive?


On Apr 5, 2011, at 4:46 AM, jody wrote:

> Hi
> 
> On my workstation and  the cluster i set up OpenMPI (v 1.4.2) so that
> it works in "text-mode":
>  $ mpirun -np 4  -x DISPLAY -host squid_0   printenv | grep WORLD_RANK
>  OMPI_COMM_WORLD_RANK=0
>  OMPI_COMM_WORLD_RANK=1
>  OMPI_COMM_WORLD_RANK=2
>  OMPI_COMM_WORLD_RANK=3
> 
> but when i use  the -xterm option to mpirun, it doesn't work
> 
> $ mpirun -np 4  -x DISPLAY -host squid_0 -xterm 1,2  printenv | grep 
> WORLD_RANK
>  Warning: untrusted X11 forwarding setup failed: xauth key data not generated
>  Warning: No xauth data; using fake authentication data for X11 forwarding.
>  OMPI_COMM_WORLD_RANK=0
>  [squid_0:05266] [[55607,0],1]->[[55607,0],0]
> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
> [sd = 8]
>  [squid_0:05266] [[55607,0],1] routed:binomial: Connection to
> lifeline [[55607,0],0] lost
>  /usr/bin/xterm Xt error: Can't open display: chefli.uzh.ch:0.0
>  /usr/bin/xterm Xt error: Can't open display: chefli.uzh.ch:0.0
> 
> (strange: somebody wrote his message to the console)
> 
> No matter whether i set the DISPLAY variable to the full hostname of
> the workstation,
> to the IP-Adress of the workstation or simply to ":0.0", it doesn't work
> 
> But i do have xauth data (as far as i know):
> On the remote (squid_0):
>  jody@squid_0 ~ $ xauth list
>  chefli/unix:10  MIT-MAGIC-COOKIE-1  5293e179bc7b2036d87cbcdf14891d0c
>  chefli/unix:0  MIT-MAGIC-COOKIE-1  146c7f438fab79deb8a8a7df242b6f4b
>  chefli.uzh.ch:0  MIT-MAGIC-COOKIE-1  146c7f438fab79deb8a8a7df242b6f4b
> 
> on the workstation:
>  $ xauth list
>  chefli/unix:10  MIT-MAGIC-COOKIE-1  5293e179bc7b2036d87cbcdf14891d0c
>  chefli/unix:0  MIT-MAGIC-COOKIE-1  146c7f438fab79deb8a8a7df242b6f4b
>  localhost.localdomain/unix:0  MIT-MAGIC-COOKIE-1
> 146c7f438fab79deb8a8a7df242b6f4b
>  chefli.uzh.ch/unix:0  MIT-MAGIC-COOKIE-1  146c7f438fab79deb8a8a7df242b6f4b
> 
> In sshd_config on the workstation i have 'X11Forwarding yes'
> I have also done
>   xhost + squid_0
> on the workstation.
> 
> 
> How can i get the -xterm option running?
> 
> Thank You
>  Jody
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] openmpi self checkpointing - error while running example

2011-04-06 Thread Hellmüller Roman
Hi Toan

no that didn't change anything. i'm trying to restart the program on the 
computer it run before and i execute the ompi-restart on the same.

machinefile_cbl1 contains just cbl1

hroman@cbl1 ~/checkpoints $ ompi-restart -v -machinefile machinefile_cbl1 
ompi_global_snapshot_28952.ckpt/
[cbl1:30308] Checking for the existence of 
(/home/hroman/checkpoints/ompi_global_snapshot_28952.ckpt)
[cbl1:30308] Restarting from file (ompi_global_snapshot_28952.ckpt/)
[cbl1:30308]  Exec in self
--
Error: Unable to obtain the proper restart command to restart from the
   checkpoint file (opal_snapshot_0.ckpt). Returned -1.

--
--
Error: Unable to obtain the proper restart command to restart from the
   checkpoint file (opal_snapshot_1.ckpt). Returned -1.

--
--
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--

cheers
roman


Von: users-boun...@open-mpi.org [users-boun...@open-mpi.org]" im Auftrag von 
"Nguyen Toan [nguyentoan1...@gmail.com]
Gesendet: Mittwoch, 6. April 2011 15:00
Bis: Open MPI Users
Betreff: Re: [OMPI users] openmpi self checkpointing - error while running 
example

Hi Roman,

It seems that you misunderstand the parameter "-machinefile".
Following this parameter shoud be a file containing a list of machines
which your MPI application will be run on. For example, you want to
run your app on 2 nodes, named "node1" and "node2", then this file, let call it 
"MACHINES_FILE", should look like this:

node1
node2

Now try to checkpoint and restart again with "-machinefile MACHINES_FILE". Hope 
it works.

On Wed, Apr 6, 2011 at 9:13 PM, Hellmüller Roman 
> wrote:
Hi Toan

Thx for your suggestion. It gives me the following result, which does not tell 
anything more.

hroman@cbl1 ~/checkpoints $ ompi-restart -v  -machinefile 
../semesterthesis/code/code2_self_example/my-hroman-cr-file.ckpt   om
pi_global_snapshot_28952.ckpt/
[cbl1:28974] Checking for the existence of 
(/home/hroman/checkpoints/ompi_global_snapshot_28952.ckpt)
[cbl1:28974] Restarting from file (ompi_global_snapshot_28952.ckpt/)
[cbl1:28974]  Exec in self
ssh: connect to host 15 port 22: Invalid argument
--
A daemon (pid 28975) died unexpectedly with status 255 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--
--
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--
hroman@cbl1 ~/checkpoints $ echo $LD_LIBRARY_PATH
/cluster/sw/blcr/0.8.2/x86_64/gcc//lib:/cluster/sw/openmpi/1.5.3_ft/x86_64/gcc/lib:/opt/intel/Compiler/11.1/056/lib/intel64

The library path seems to be ok or should it look different? do you have 
another idea?
cheers
roman


Von: users-boun...@open-mpi.org 
[users-boun...@open-mpi.org]" im Auftrag von 
"Nguyen Toan [nguyentoan1...@gmail.com]
Gesendet: Mittwoch, 6. April 2011 13:20
Bis: Open MPI Users
Betreff: Re: [OMPI users] openmpi self checkpointing - error while running 
example

Hi Roman,

Did you try to checkpoint and restart with the parameter "-machinefile". It may 
work.

Regards,
Nguyen Toan

On Wed, Apr 6, 2011 at 7:05 PM, Hellmüller Roman 
>>
 wrote:
Hi

I'm trying to get fault tolerant ompi running on our cluster for my 
semesterthesis.

Build & compile were successful, blcr checkpointing works. openmpi 1.5.3, blcr 
0.8.2

Now i'm trying to set up the SELF checkpointing. the example from 
http://osl.iu.edu/research/ft/ompi-cr/examples.php does not work. I can run the 
application and also do checkpoints, but restarting won't work.  I got the 
following error by doning as sugested:

mpicc my-app.c -export -export-dynamic -o my-app


Re: [OMPI users] openmpi self checkpointing - error while running example

2011-04-06 Thread Nguyen Toan
Hi Roman,

It seems that you misunderstand the parameter "-machinefile".
Following this parameter shoud be a file containing a list of machines
which your MPI application will be run on. For example, you want to
run your app on 2 nodes, named "node1" and "node2", then this file, let call
it "MACHINES_FILE", should look like this:

node1
node2

Now try to checkpoint and restart again with "-machinefile MACHINES_FILE".
Hope it works.

On Wed, Apr 6, 2011 at 9:13 PM, Hellmüller Roman wrote:

> Hi Toan
>
> Thx for your suggestion. It gives me the following result, which does not
> tell anything more.
>
> hroman@cbl1 ~/checkpoints $ ompi-restart -v  -machinefile
> ../semesterthesis/code/code2_self_example/my-hroman-cr-file.ckpt   om
> pi_global_snapshot_28952.ckpt/
> [cbl1:28974] Checking for the existence of
> (/home/hroman/checkpoints/ompi_global_snapshot_28952.ckpt)
> [cbl1:28974] Restarting from file (ompi_global_snapshot_28952.ckpt/)
> [cbl1:28974]  Exec in self
> ssh: connect to host 15 port 22: Invalid argument
> --
> A daemon (pid 28975) died unexpectedly with status 255 while attempting
> to launch so we are aborting.
>
> There may be more information reported by the environment (see above).
>
> This may be because the daemon was unable to find all the needed shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> --
> --
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --
> hroman@cbl1 ~/checkpoints $ echo $LD_LIBRARY_PATH
>
> /cluster/sw/blcr/0.8.2/x86_64/gcc//lib:/cluster/sw/openmpi/1.5.3_ft/x86_64/gcc/lib:/opt/intel/Compiler/11.1/056/lib/intel64
>
> The library path seems to be ok or should it look different? do you have
> another idea?
> cheers
> roman
>
> 
> Von: users-boun...@open-mpi.org [users-boun...@open-mpi.org]" im Auftrag
> von "Nguyen Toan [nguyentoan1...@gmail.com]
> Gesendet: Mittwoch, 6. April 2011 13:20
> Bis: Open MPI Users
> Betreff: Re: [OMPI users] openmpi self checkpointing - error while running
> example
>
> Hi Roman,
>
> Did you try to checkpoint and restart with the parameter "-machinefile". It
> may work.
>
> Regards,
> Nguyen Toan
>
> On Wed, Apr 6, 2011 at 7:05 PM, Hellmüller Roman  > wrote:
> Hi
>
> I'm trying to get fault tolerant ompi running on our cluster for my
> semesterthesis.
>
> Build & compile were successful, blcr checkpointing works. openmpi 1.5.3,
> blcr 0.8.2
>
> Now i'm trying to set up the SELF checkpointing. the example from
> http://osl.iu.edu/research/ft/ompi-cr/examples.php does not work. I can
> run the application and also do checkpoints, but restarting won't work.  I
> got the following error by doning as sugested:
>
> mpicc my-app.c -export -export-dynamic -o my-app
>
> mpirun -np 2 -am ft-enable-cr -mca crs_self_prefix my_personal my-app
>
> hroman@cbl1 ~ $ ompi-restart ompi_global_snapshot_27167.ckpt/
> --
> Error: Unable to obtain the proper restart command to restart from the
>  checkpoint file (opal_snapshot_0.ckpt). Returned -1.
>
> --
> --
> Error: Unable to obtain the proper restart command to restart from the
>  checkpoint file (opal_snapshot_1.ckpt). Returned -1.
>
> --
> --
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --
>
> i also tryed around with setting the path in the example file (restart_path
> variable), changing the checkpoint directorys, and running the application
> in different directorys...
>
> do you have an idea where the error could be?
>
> here http://n.ethz.ch/~hroman/downloads/ompi_mailinglist.tar.gz<
> http://n.ethz.ch/%7Ehroman/downloads/ompi_mailinglist.tar.gz> (40MB)
> you'll find the library and the build of openmpi & blcr as well as the env
> variables and the output of ompi_info. there is one for the login and the
> other for the compute nodes due to different kernels.  and here
> http://n.ethz.ch/~hroman/downloads/ompi_global_snapshot_27167.ckpt.tar.gz<
> 

Re: [OMPI users] openmpi self checkpointing - error while running example

2011-04-06 Thread Hellmüller Roman
Hi Toan

Thx for your suggestion. It gives me the following result, which does not tell 
anything more.

hroman@cbl1 ~/checkpoints $ ompi-restart -v  -machinefile 
../semesterthesis/code/code2_self_example/my-hroman-cr-file.ckpt   om
pi_global_snapshot_28952.ckpt/
[cbl1:28974] Checking for the existence of 
(/home/hroman/checkpoints/ompi_global_snapshot_28952.ckpt)
[cbl1:28974] Restarting from file (ompi_global_snapshot_28952.ckpt/)
[cbl1:28974]  Exec in self
ssh: connect to host 15 port 22: Invalid argument
--
A daemon (pid 28975) died unexpectedly with status 255 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--
--
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--
hroman@cbl1 ~/checkpoints $ echo $LD_LIBRARY_PATH
/cluster/sw/blcr/0.8.2/x86_64/gcc//lib:/cluster/sw/openmpi/1.5.3_ft/x86_64/gcc/lib:/opt/intel/Compiler/11.1/056/lib/intel64

The library path seems to be ok or should it look different? do you have 
another idea?
cheers
roman


Von: users-boun...@open-mpi.org [users-boun...@open-mpi.org]" im Auftrag von 
"Nguyen Toan [nguyentoan1...@gmail.com]
Gesendet: Mittwoch, 6. April 2011 13:20
Bis: Open MPI Users
Betreff: Re: [OMPI users] openmpi self checkpointing - error while running 
example

Hi Roman,

Did you try to checkpoint and restart with the parameter "-machinefile". It may 
work.

Regards,
Nguyen Toan

On Wed, Apr 6, 2011 at 7:05 PM, Hellmüller Roman 
> wrote:
Hi

I'm trying to get fault tolerant ompi running on our cluster for my 
semesterthesis.

Build & compile were successful, blcr checkpointing works. openmpi 1.5.3, blcr 
0.8.2

Now i'm trying to set up the SELF checkpointing. the example from 
http://osl.iu.edu/research/ft/ompi-cr/examples.php does not work. I can run the 
application and also do checkpoints, but restarting won't work.  I got the 
following error by doning as sugested:

mpicc my-app.c -export -export-dynamic -o my-app

mpirun -np 2 -am ft-enable-cr -mca crs_self_prefix my_personal my-app

hroman@cbl1 ~ $ ompi-restart ompi_global_snapshot_27167.ckpt/
--
Error: Unable to obtain the proper restart command to restart from the
  checkpoint file (opal_snapshot_0.ckpt). Returned -1.

--
--
Error: Unable to obtain the proper restart command to restart from the
  checkpoint file (opal_snapshot_1.ckpt). Returned -1.

--
--
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--

i also tryed around with setting the path in the example file (restart_path 
variable), changing the checkpoint directorys, and running the application in 
different directorys...

do you have an idea where the error could be?

here 
http://n.ethz.ch/~hroman/downloads/ompi_mailinglist.tar.gz
 (40MB) you'll find the library and the build of openmpi & blcr as well as the 
env variables and the output of ompi_info. there is one for the login and the 
other for the compute nodes due to different kernels.  and here 
http://n.ethz.ch/~hroman/downloads/ompi_global_snapshot_27167.ckpt.tar.gz
 there is the produced checkpoint. please let me know if more outputs are 
needed.

cheers
roman

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




[OMPI users] g95 + open-mpi

2011-04-06 Thread nicolas cordier
hi,

 i need use open-mpi with g95 on debian linux lenny 5.0 - x86_64
 i compile it with FC=g95 F77=g95 and test on my example.c file
 but with g95 mpirun dont use process1 just process 0.
 perhaps my compile option are wrong ?

 i want that mpirun use process 0 and 1 both.

 hostname paola12

 mpicc example.c
 mpirun -np 2 a.out
 C Process 0 on paola12
 0 [1 3 ]
 0 [1.00 3.00 ]
 C Process 0 on paola12
 0 [1 3 ]
 0 [1.00 3.00 ]


 with gfortran ( 4.3.2 ) + openmpi 
 mpirun -np 2 a.out
 C Process 0 on paola12 C Process 1 on paola12 0 [2 9 ] 1 [2 9 ] 0 [3.00 
6.00 ] 1 [3.00 6.00 



example.c

 #include 
 #include 
 #include 
 #include 

 int main(int argc, char** argv) {
 MPI_Init(, );
 int rank;
 int namelen;
 char processor_name[MPI_MAX_PROCESSOR_NAME];
 MPI_Comm_rank(MPI_COMM_WORLD, );
 MPI_Get_processor_name(processor_name, );
 printf("C Process %d on %s \n", rank, processor_name);
 MPI_Barrier(MPI_COMM_WORLD);

 int size = 2;
 int *array, *reducedValues;
 array = (int *) malloc((size) * sizeof (int));
 reducedValues = (int *) malloc((size) * sizeof (int));
 array[0] = rank+1;
 array[1] = 3;
 MPI_Allreduce(array, reducedValues, size, MPI_INTEGER, MPI_PROD, 
MPI_COMM_WORLD);
 int i;
 printf("%d [", rank);
 for (i = 0; i < size; i++) {
 printf("%d ", reducedValues[i]);
 }
 printf("]\n");
 free(reducedValues);
 free(array);

 /* Verif triviale pour un seul entier (OK)
 size=1;
 int *array1, *reducedValues1;
 array1 = (int *) malloc((size) * sizeof (int));
 reducedValues1 = (int *) malloc((size) * sizeof (int));
 array[0] = rank+1;
 MPI_Allreduce(array1, reducedValues1, size, MPI_INTEGER, MPI_PROD, 
MPI_COMM_WORLD);
 printf(" C scalaire %d \n", reducedValues1[0]);
 free(reducedValues1);
 free(array1);
 */

 /* Verif pour les doubles */
 size=2;
 double *Darray, *DreducedValues;
 Darray = (double *) malloc((size) * sizeof (double));
 DreducedValues = (double *) malloc((size) * sizeof (double));
 Darray[0] = (rank+1)*1.0;
 Darray[1] = 3.0;
 MPI_Allreduce(Darray, DreducedValues, size, MPI_DOUBLE, MPI_SUM, 
MPI_COMM_WORLD);
 printf("%d [", rank);
 for (i = 0; i < size; i++) {
 printf("%f ", DreducedValues[i]);
 }
 printf("]\n");
 free(DreducedValues);
 free(Darray);
 MPI_Finalize();
 }


 greetings.

 nicolas cordier


Re: [OMPI users] openmpi self checkpointing - error while running example

2011-04-06 Thread Nguyen Toan
Hi Roman,

Did you try to checkpoint and restart with the parameter "-machinefile". It
may work.

Regards,
Nguyen Toan

On Wed, Apr 6, 2011 at 7:05 PM, Hellmüller Roman wrote:

> Hi
>
> I'm trying to get fault tolerant ompi running on our cluster for my
> semesterthesis.
>
> Build & compile were successful, blcr checkpointing works. openmpi 1.5.3,
> blcr 0.8.2
>
> Now i'm trying to set up the SELF checkpointing. the example from
> http://osl.iu.edu/research/ft/ompi-cr/examples.php does not work. I can
> run the application and also do checkpoints, but restarting won't work.  I
> got the following error by doning as sugested:
>
> mpicc my-app.c -export -export-dynamic -o my-app
>
> mpirun -np 2 -am ft-enable-cr -mca crs_self_prefix my_personal my-app
>
> hroman@cbl1 ~ $ ompi-restart ompi_global_snapshot_27167.ckpt/
> --
> Error: Unable to obtain the proper restart command to restart from the
>   checkpoint file (opal_snapshot_0.ckpt). Returned -1.
>
> --
> --
> Error: Unable to obtain the proper restart command to restart from the
>   checkpoint file (opal_snapshot_1.ckpt). Returned -1.
>
> --
> --
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --
>
> i also tryed around with setting the path in the example file (restart_path
> variable), changing the checkpoint directorys, and running the application
> in different directorys...
>
> do you have an idea where the error could be?
>
> here http://n.ethz.ch/~hroman/downloads/ompi_mailinglist.tar.gz<
> http://n.ethz.ch/%7Ehroman/downloads/ompi_mailinglist.tar.gz> (40MB)
> you'll find the library and the build of openmpi & blcr as well as the env
> variables and the output of ompi_info. there is one for the login and the
> other for the compute nodes due to different kernels.  and here
> http://n.ethz.ch/~hroman/downloads/ompi_global_snapshot_27167.ckpt.tar.gz<
> http://n.ethz.ch/%7Ehroman/downloads/ompi_global_snapshot_27167.ckpt.tar.gz>
> there is the produced checkpoint. please let me know if more outputs are
> needed.
>
> cheers
> roman
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


[OMPI users] openmpi self checkpointing - error while running example

2011-04-06 Thread Hellmüller Roman
Hi

I'm trying to get fault tolerant ompi running on our cluster for my 
semesterthesis.

Build & compile were successful, blcr checkpointing works. openmpi 1.5.3, blcr 
0.8.2

Now i'm trying to set up the SELF checkpointing. the example from 
http://osl.iu.edu/research/ft/ompi-cr/examples.php does not work. I can run the 
application and also do checkpoints, but restarting won't work.  I got the 
following error by doning as sugested:

mpicc my-app.c -export -export-dynamic -o my-app

mpirun -np 2 -am ft-enable-cr -mca crs_self_prefix my_personal my-app

hroman@cbl1 ~ $ ompi-restart ompi_global_snapshot_27167.ckpt/
--
Error: Unable to obtain the proper restart command to restart from the
   checkpoint file (opal_snapshot_0.ckpt). Returned -1.

--
--
Error: Unable to obtain the proper restart command to restart from the
   checkpoint file (opal_snapshot_1.ckpt). Returned -1.

--
--
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--

i also tryed around with setting the path in the example file (restart_path 
variable), changing the checkpoint directorys, and running the application in 
different directorys...

do you have an idea where the error could be?

here 
http://n.ethz.ch/~hroman/downloads/ompi_mailinglist.tar.gz
 (40MB) you'll find the library and the build of openmpi & blcr as well as the 
env variables and the output of ompi_info. there is one for the login and the 
other for the compute nodes due to different kernels.  and here 
http://n.ethz.ch/~hroman/downloads/ompi_global_snapshot_27167.ckpt.tar.gz
 there is the produced checkpoint. please let me know if more outputs are 
needed.

cheers
roman