[OMPI devel] mpirun --prefix question

2007-03-21 Thread David Daniel
I'm experimenting with heterogeneous applications (x86_64 <-->  
ppc64), where the systems share the file system where Open MPI is  
installed.


What I would like to be able to do is something like this:

   mpirun --np 1 --host host-x86_64 --prefix /opt/ompi/x86_64  
a.out.x86_64 : --np 1 --host host-ppc64 --prefix /opt/ompi/ppc64  
a.out.ppc64


Unfortunately it looks as if the second --prefix is always ignored.   
My guess is that orte_app_context_t::prefix_dir is getting set, but  
only the 0th app context is never consulted (except in the dynamic  
process stuff where I do see a loop over the app context array).


I can of course work around it with startup scripts, but a command  
line solution would be attractive.


This is with openmpi-1.2.

Thanks, David



Re: [OMPI devel] mpirun --prefix question

2007-03-22 Thread David Daniel

This is a development system for roadrunner using ssh.

David

On Mar 22, 2007, at 5:19 AM, Jeff Squyres wrote:


FWIW, I believe that we had intended --prefix to handle simple cases
which is why this probably doesn't work for you.  But as long as the
different prefixes are specified for different nodes, it could
probably be made to work.

Which launcher are you using this with?



On Mar 21, 2007, at 11:36 PM, Ralph Castain wrote:


Yo David

What system are you running this on? RoadRunner? If so, I can take
a look at
"fixing" it for you tomorrow (Thurs).

Ralph


On 3/21/07 10:17 AM, "David Daniel"  wrote:


I'm experimenting with heterogeneous applications (x86_64 <-->
ppc64), where the systems share the file system where Open MPI is
installed.

What I would like to be able to do is something like this:

mpirun --np 1 --host host-x86_64 --prefix /opt/ompi/x86_64
a.out.x86_64 : --np 1 --host host-ppc64 --prefix /opt/ompi/ppc64
a.out.ppc64

Unfortunately it looks as if the second --prefix is always ignored.
My guess is that orte_app_context_t::prefix_dir is getting set, but
only the 0th app context is never consulted (except in the dynamic
process stuff where I do see a loop over the app context array).

I can of course work around it with startup scripts, but a command
line solution would be attractive.

This is with openmpi-1.2.

Thanks, David

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
David Daniel 
Computer Science for High-Performance Computing (CCS-1)






Re: [OMPI devel] mpirun --prefix question

2007-03-22 Thread David Daniel

OK. This sounds sensible.

Thanks, David

On Mar 22, 2007, at 10:38 AM, Ralph Castain wrote:


We had a nice chat about this on the OpenRTE telecon this morning. The
question of what to do with multiple prefix's has been a long- 
running issue,
most recently captured in bug trac report #497. The problem is that  
prefix

is intended to tell us where to find the ORTE/OMPI executables, and
therefore is associated with a node - not an app_context. What we  
haven't
been able to define is an appropriate notation that a user can  
exploit to

tell us the association.

This issue has arisen on several occasions where either (a) users have
heterogeneous clusters with a common file system, so the prefix  
must be
adjusted on each *type* of node to point to the correct type of  
binary; and
(b) for whatever reason, typically on rsh/ssh clusters, users have  
installed
the binaries in different locations on some of the nodes. In this  
latter
case, the reports have been from homogeneous clusters, so the  
*type* of

binary was never the issue - it just wasn't located where we expected.

Sun's solution is (I believe) what most of us would expect - they  
locate
their executables in the same relative location on all their nodes.  
The

binary in that location is correct for that local architecture. This
requires, though, that the "prefix" location not be on a common  
file system.


Unfortunately, that isn't the case with LANL's roadrunner, nor can  
we expect
that everyone will follow that sensible approach :-). So we need a  
notation
to support the "exception" case where someone needs to truly  
specify prefix

versus node(s).

We discussed a number of options, including auto-detecting the  
local arch

and appending it to the specified "prefix" and several others. After
discussing them, those of us on the call decided that adding a  
field to the
hostfile that specifies the prefix to use on that host would be the  
best
solution. This could be done on a cluster-level basis, so -  
although it is
annoying to create the data file - at least it would only have to  
be done

once.

Again, this is the exception case, so requiring a little  
inconvenience seems

a reasonable thing to do.

Anyone have heartburn and/or other suggestions? If not, we might  
start to
play with this next week. We would have to do some small  
modifications to
the RAS, RMAPS, and PLS components to ensure that any multi-prefix  
info gets
correctly propagated and used across all platforms for consistent  
behavior.


Ralph


On 3/22/07 9:11 AM, "David Daniel"  wrote:


This is a development system for roadrunner using ssh.

David

On Mar 22, 2007, at 5:19 AM, Jeff Squyres wrote:


FWIW, I believe that we had intended --prefix to handle simple cases
which is why this probably doesn't work for you.  But as long as the
different prefixes are specified for different nodes, it could
probably be made to work.

Which launcher are you using this with?



On Mar 21, 2007, at 11:36 PM, Ralph Castain wrote:


Yo David

What system are you running this on? RoadRunner? If so, I can take
a look at
"fixing" it for you tomorrow (Thurs).

Ralph


On 3/21/07 10:17 AM, "David Daniel"  wrote:


I'm experimenting with heterogeneous applications (x86_64 <-->
ppc64), where the systems share the file system where Open MPI is
installed.

What I would like to be able to do is something like this:

mpirun --np 1 --host host-x86_64 --prefix /opt/ompi/x86_64
a.out.x86_64 : --np 1 --host host-ppc64 --prefix /opt/ompi/ppc64
a.out.ppc64

Unfortunately it looks as if the second --prefix is always  
ignored.
My guess is that orte_app_context_t::prefix_dir is getting set,  
but

only the 0th app context is never consulted (except in the dynamic
process stuff where I do see a loop over the app context array).

I can of course work around it with startup scripts, but a command
line solution would be attractive.

This is with openmpi-1.2.

Thanks, David

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
David Daniel 
Computer Science for High-Performance Computing (CCS-1)




___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
David Daniel 
Computer Science for High-Performance Computing (CCS-1)






[OMPI devel] collective problems

2007-10-04 Thread David Daniel

Hi Folks,

I have been seeing some nasty behaviour in collectives, particularly  
bcast and reduce.  Attached is a reproducer (for bcast).


The code will rapidly slow to a crawl (usually interpreted as a hang  
in real applications) and sometimes gets killed with sigbus or sigterm.


I see this with

  openmpi-1.2.3 or openmpi-1.2.4
  ofed 1.2
  linux 2.6.19 + patches
  gcc (GCC) 3.4.5 20051201 (Red Hat 3.4.5-2)
  4 socket, dual core opterons

run as

  mpirun --mca btl self,openib --npernode 1 --np 4 bcast-hang

To my now uneducated eye it looks as if the root process is rushing  
ahead and not progressing earlier bcasts.


Anyone else seeing similar?  Any ideas for workarounds?

As a point of reference, mvapich2 0.9.8 works fine.

Thanks, David





bcast-hang.c
Description: Binary data


[OMPI devel] libdir not propagated to contrib/vt/vt ??

2008-06-11 Thread David Daniel
Building against recent heads (r18643) it appears that libdir (as set  
by ./configure --prefix=$PREFIX --libdir=$PREFIX/lib64 for example) is  
not propagated to ompi-trunk/contrib/vt/vt.


Feature or bug?

Thanks, David


[O-MPI devel] Question on ROMIO

2005-08-18 Thread David Daniel

A question for those who did the ROMIO port...

The ROMIO component seems to be based on version 1.2.5.1 (the last  
version of ROMIO released independently).  Did anyone make any  
progress using the ROMIO from later MPICH's (version 1.2.6 etc.)?   
Seems to me these are fairly broken as far as compatibility with  
other MPIs is concerned.


Thanks, David

--
David Daniel 
Advanced Computing Laboratory, LANL, MS-B287, Los Alamos NM 87545, USA




Re: [O-MPI devel] Question on ROMIO

2005-08-18 Thread David Daniel

On Aug 18, 2005, at 4:24 PM, Brian Barrett wrote:


On Aug 18, 2005, at 4:53 PM, David Daniel wrote:



A question for those who did the ROMIO port...

The ROMIO component seems to be based on version 1.2.5.1 (the last
version of ROMIO released independently).  Did anyone make any
progress using the ROMIO from later MPICH's (version 1.2.6 etc.)?
Seems to me these are fairly broken as far as compatibility with
other MPIs is concerned.

Thanks, David




Yes, we took the last stable individual release of ROMIO for OMPI.
We haven't looked at bringing in the MPICH-integrated releases.  I'm
not sure how hard it will be to rip out the MPICH-specific stuff from
ROMIO, but at this point, it would take some work to replicate
everything we did to integrate ROMIO into OMPI.  Is this a 1.0
requirement?


No -- Don't panic! The parallel I/O folks here are just interested in  
seeing whether there are fixes in later versions that would help with  
performance.


I was trying to port 1.2.6 into LA-MPI but it is painful (i.e.  
broken) with an MPI implementation that doesn't have MPI_Info  
defined.  My guess is it will be easier with Open MPI.


David


[O-MPI devel] Open MPI over IB in action

2005-08-24 Thread David Daniel

Interesting news...

Jim Barker installed Open MPI on one of our visualization teams'  
InfiniBand clusters.  They successfully built ParaView and ran it to  
drive visualization on 3x3 "power wall" tiled display.  ParaView has  
a history of breaking MPI's so I'm very happy that this went so  
smoothly.


David
--
David Daniel +1-505-667-0883
Advanced Computing Laboratory, LANL, MS-B287, Los Alamos NM 87545, USA



[O-MPI devel] Fortran peculiarities on Mac OS X 10.4

2005-08-30 Thread David Daniel

Hi Folks,

Anyone had any luck building fortran on Tiger, particularly f90?

I'm probably just dumb, but appended are 4 problems I've seen.

Thanks, David



1. gfortran

configure --enable-f77 --enable-f90

[snip]

*** Fortran 77 compiler
checking for gfortran... gfortran
checking whether we are using the GNU Fortran 77 compiler... yes
checking whether gfortran accepts -g... yes
checking gfortran external symbol convention... double underscore
checking if FORTRAN compiler supports LOGICAL... yes
checking size of FORTRAN LOGICAL... 4
checking for C type corresponding to Fortran LOGICAL... int
checking alignment of FORTRAN LOGICAL... unknown
configure: WARNING: *** Problem running configure test!
configure: WARNING: *** See config.log for details.
configure: error: *** Cannot continue.



2. xlf

configure FC=xlf F77=xlf --enable-f77 --enable-f90

[snip]

*** Fortran 77 compiler
checking whether we are using the GNU Fortran 77 compiler... no
checking whether xlf accepts -g... yes
checking xlf external symbol convention... no underscore
checking if FORTRAN compiler supports LOGICAL... yes
checking size of FORTRAN LOGICAL... unknown
configure: WARNING: *** Problem running configure test!
configure: WARNING: *** See config.log for details.
configure: error: *** Cannot continue.



3. xlf again

configure FC=xlf95 F77=xlf77 --enable-f77 --enable-f90

[snip]

*** Fortran 77 compiler
checking whether we are using the GNU Fortran 77 compiler... no
checking whether xlf77 accepts -g... no
checking xlf77 external symbol convention... configure: WARNING:  
unable to produce an object file testing F77 compiler

checking if FORTRAN compiler supports LOGICAL... no
checking if FORTRAN compiler supports INTEGER... no
checking if FORTRAN compiler supports INTEGER*1... no
checking if FORTRAN compiler supports INTEGER*2... no
checking if FORTRAN compiler supports INTEGER*4... no
checking if FORTRAN compiler supports INTEGER*8... no
checking if FORTRAN compiler supports INTEGER*16... no
checking if FORTRAN compiler supports REAL... no
checking if FORTRAN compiler supports REAL*4... no
checking if FORTRAN compiler supports REAL*8... no
checking if FORTRAN compiler supports REAL*16... no
checking if FORTRAN compiler supports DOUBLE PRECISION... no
checking if FORTRAN compiler supports COMPLEX... no
checking if FORTRAN compiler supports COMPLEX*8... no
checking if FORTRAN compiler supports COMPLEX*16... no
checking if FORTRAN compiler supports COMPLEX*32... no
checking for max fortran MPI handle index... 2147483647

*** Fortran 90/95 compiler
checking whether we are using the GNU Fortran compiler... no
checking whether xlf95 accepts -g... yes
checking for Fortran flag to compile .f files... none
checking for Fortran flag to compile .f90 files... -qsuffix=f=f90
checking for Fortran flag to compile .f95 files... -qsuffix=f=f95
checking whether xlf77 and xlf95 compilers are compatible... no
configure: WARNING: *** Fortran 77 and Fortran 90 compilers are not  
link compatible

configure: WARNING: *** Disabling Fortran 90/95 bindings



4. Do we need to add -lSystemStubs ???

$ configure F77=gxlf --disable-f90 --enable-f77 --enable-static -- 
disable-shared


$ make
$ make install
$ mpif77 hellof.f
** _main   === End of Compilation 1 ===
1501-510  Compilation successful for file hellof.f.
/usr/bin/ld: Undefined symbols:
_asprintf$LDBLStub
_fprintf$LDBLStub
_snprintf$LDBLStub
_sprintf$LDBLStub
_sscanf$LDBLStub
_printf$LDBLStub
_vfprintf$LDBLStub
_vasprintf$LDBLStub
_syslog$LDBLStub
xlf95 -I/opt/openmpi/trunk/include -I/opt/openmpi/trunk/include/ 
openmpi/ompi hellof.f -L/opt/openmpi/trunk/lib -lmpi -lorte -lopal - 
lm -ldl



BUT explicitly adding -lSystemStubs (which needs to be at the end  
so I can't use mpif77 -- or can I?)


$ xlf95 -I/opt/openmpi/trunk/include -I/opt/openmpi/trunk/include/ 
openmpi/ompi hellof.f -L/opt/openmpi/trunk/lib -lmpi -lorte -lopal - 
lm -ldl -lSystemStubs

** _main   === End of Compilation 1 ===
1501-510  Compilation successful for file hellof.f.




Re: [O-MPI devel] Fortran peculiarities on Mac OS X 10.4

2005-08-30 Thread David Daniel
Thanks, Greg and George.  I now have xlf working, and I guess my  
gfortran build may be flakey, but I can live with that for now.


David



[O-MPI devel] totalview

2005-09-20 Thread David Daniel
TotalView now appears to be working for pls_rsh with both local and  
remote nodes and pls_bproc.


Not tested elsewhere.

David


Re: [O-MPI devel] Intel tests

2006-01-14 Thread David Daniel

Hi Graham,

On Jan 14, 2006, at 2:07 PM, Graham E Fagg wrote:

Hi all,
  whatever this fixed/changed, I no longer get corrupted memory in the
tuned data segment hung off each communicator... ! I'm still  
testing to

see if I get TimPs error.
G

On Sat, 14 Jan 2006 bosi...@osl.iu.edu wrote:


Author: bosilca
Date: 2006-01-14 15:21:44 -0500 (Sat, 14 Jan 2006)
New Revision: 8692

Modified:
  trunk/ompi/mca/btl/tcp/btl_tcp_endpoint.c
  trunk/ompi/mca/btl/tcp/btl_tcp_endpoint.h
  trunk/ompi/mca/btl/tcp/btl_tcp_frag.c
  trunk/ompi/mca/btl/tcp/btl_tcp_frag.h
Log:
A better implementation for the TCP endpoint cache + few comments.



On  a 64-bit bproc/myrinet system I'm seeing Tim P's problem with the  
current head of the trunk. See attached output.


David




$ ompi_info | head
Open MPI: 1.1a1svn01142006
   Open MPI SVN revision: svn01142006
Open RTE: 1.1a1svn01142006
   Open RTE SVN revision: svn01142006
OPAL: 1.1a1svn01142006
   OPAL SVN revision: svn01142006
  Prefix: /scratch/modules/opt/openmpi-trunk- 
nofortran-bproc64

Configured architecture: x86_64-unknown-linux-gnu
   Configured by: ddd
   Configured on: Sat Jan 14 17:22:16 MST 2006

$ make MPIRUN='mpirun -mca coll basic' MPI_Allreduce_user_c
(cd src ; make MPI_Allreduce_user_c)
make[1]: Entering directory `/home/ddd/intel_tests/src'
mpicc -g -Isrc   -c -o libmpitest.o libmpitest.c
mpicc -g -Isrc  -o MPI_Allreduce_user_c MPI_Allreduce_user_c.c  
libmpitest.o -lm

make[1]: Leaving directory `/home/ddd/intel_tests/src'
mpirun -mca coll basic -n 4 --  `pwd`/src/MPI_Allreduce_user_c
MPITEST info  (0): Starting MPI_Allreduce_user() test
MPITEST_results: MPI_Allreduce_user() all tests PASSED (7076)

$ make MPIRUN='mpirun' MPI_Allreduce_user_c
(cd src ; make MPI_Allreduce_user_c)
make[1]: Entering directory `/home/ddd/intel_tests/src'
make[1]: `MPI_Allreduce_user_c' is up to date.
make[1]: Leaving directory `/home/ddd/intel_tests/src'
mpirun -n 4 --  `pwd`/src/MPI_Allreduce_user_c
MPITEST info  (0): Starting MPI_Allreduce_user() test
MPITEST error (0): i=0, int value=4, expected 1
MPITEST error (0): i=1, int value=4, expected 1
MPITEST error (0): i=2, int value=4, expected 1
MPITEST error (0): i=3, int value=4, expected 1

...




[O-MPI devel] LLNL OpenMP + MPI benchmarks

2006-02-02 Thread David Daniel

http://www.llnl.gov/asci/purple/benchmarks/