[OMPI devel] mpirun --prefix question
I'm experimenting with heterogeneous applications (x86_64 <--> ppc64), where the systems share the file system where Open MPI is installed. What I would like to be able to do is something like this: mpirun --np 1 --host host-x86_64 --prefix /opt/ompi/x86_64 a.out.x86_64 : --np 1 --host host-ppc64 --prefix /opt/ompi/ppc64 a.out.ppc64 Unfortunately it looks as if the second --prefix is always ignored. My guess is that orte_app_context_t::prefix_dir is getting set, but only the 0th app context is never consulted (except in the dynamic process stuff where I do see a loop over the app context array). I can of course work around it with startup scripts, but a command line solution would be attractive. This is with openmpi-1.2. Thanks, David
Re: [OMPI devel] mpirun --prefix question
This is a development system for roadrunner using ssh. David On Mar 22, 2007, at 5:19 AM, Jeff Squyres wrote: FWIW, I believe that we had intended --prefix to handle simple cases which is why this probably doesn't work for you. But as long as the different prefixes are specified for different nodes, it could probably be made to work. Which launcher are you using this with? On Mar 21, 2007, at 11:36 PM, Ralph Castain wrote: Yo David What system are you running this on? RoadRunner? If so, I can take a look at "fixing" it for you tomorrow (Thurs). Ralph On 3/21/07 10:17 AM, "David Daniel" wrote: I'm experimenting with heterogeneous applications (x86_64 <--> ppc64), where the systems share the file system where Open MPI is installed. What I would like to be able to do is something like this: mpirun --np 1 --host host-x86_64 --prefix /opt/ompi/x86_64 a.out.x86_64 : --np 1 --host host-ppc64 --prefix /opt/ompi/ppc64 a.out.ppc64 Unfortunately it looks as if the second --prefix is always ignored. My guess is that orte_app_context_t::prefix_dir is getting set, but only the 0th app context is never consulted (except in the dynamic process stuff where I do see a loop over the app context array). I can of course work around it with startup scripts, but a command line solution would be attractive. This is with openmpi-1.2. Thanks, David ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- David Daniel Computer Science for High-Performance Computing (CCS-1)
Re: [OMPI devel] mpirun --prefix question
OK. This sounds sensible. Thanks, David On Mar 22, 2007, at 10:38 AM, Ralph Castain wrote: We had a nice chat about this on the OpenRTE telecon this morning. The question of what to do with multiple prefix's has been a long- running issue, most recently captured in bug trac report #497. The problem is that prefix is intended to tell us where to find the ORTE/OMPI executables, and therefore is associated with a node - not an app_context. What we haven't been able to define is an appropriate notation that a user can exploit to tell us the association. This issue has arisen on several occasions where either (a) users have heterogeneous clusters with a common file system, so the prefix must be adjusted on each *type* of node to point to the correct type of binary; and (b) for whatever reason, typically on rsh/ssh clusters, users have installed the binaries in different locations on some of the nodes. In this latter case, the reports have been from homogeneous clusters, so the *type* of binary was never the issue - it just wasn't located where we expected. Sun's solution is (I believe) what most of us would expect - they locate their executables in the same relative location on all their nodes. The binary in that location is correct for that local architecture. This requires, though, that the "prefix" location not be on a common file system. Unfortunately, that isn't the case with LANL's roadrunner, nor can we expect that everyone will follow that sensible approach :-). So we need a notation to support the "exception" case where someone needs to truly specify prefix versus node(s). We discussed a number of options, including auto-detecting the local arch and appending it to the specified "prefix" and several others. After discussing them, those of us on the call decided that adding a field to the hostfile that specifies the prefix to use on that host would be the best solution. This could be done on a cluster-level basis, so - although it is annoying to create the data file - at least it would only have to be done once. Again, this is the exception case, so requiring a little inconvenience seems a reasonable thing to do. Anyone have heartburn and/or other suggestions? If not, we might start to play with this next week. We would have to do some small modifications to the RAS, RMAPS, and PLS components to ensure that any multi-prefix info gets correctly propagated and used across all platforms for consistent behavior. Ralph On 3/22/07 9:11 AM, "David Daniel" wrote: This is a development system for roadrunner using ssh. David On Mar 22, 2007, at 5:19 AM, Jeff Squyres wrote: FWIW, I believe that we had intended --prefix to handle simple cases which is why this probably doesn't work for you. But as long as the different prefixes are specified for different nodes, it could probably be made to work. Which launcher are you using this with? On Mar 21, 2007, at 11:36 PM, Ralph Castain wrote: Yo David What system are you running this on? RoadRunner? If so, I can take a look at "fixing" it for you tomorrow (Thurs). Ralph On 3/21/07 10:17 AM, "David Daniel" wrote: I'm experimenting with heterogeneous applications (x86_64 <--> ppc64), where the systems share the file system where Open MPI is installed. What I would like to be able to do is something like this: mpirun --np 1 --host host-x86_64 --prefix /opt/ompi/x86_64 a.out.x86_64 : --np 1 --host host-ppc64 --prefix /opt/ompi/ppc64 a.out.ppc64 Unfortunately it looks as if the second --prefix is always ignored. My guess is that orte_app_context_t::prefix_dir is getting set, but only the 0th app context is never consulted (except in the dynamic process stuff where I do see a loop over the app context array). I can of course work around it with startup scripts, but a command line solution would be attractive. This is with openmpi-1.2. Thanks, David ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- David Daniel Computer Science for High-Performance Computing (CCS-1) ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- David Daniel Computer Science for High-Performance Computing (CCS-1)
[OMPI devel] collective problems
Hi Folks, I have been seeing some nasty behaviour in collectives, particularly bcast and reduce. Attached is a reproducer (for bcast). The code will rapidly slow to a crawl (usually interpreted as a hang in real applications) and sometimes gets killed with sigbus or sigterm. I see this with openmpi-1.2.3 or openmpi-1.2.4 ofed 1.2 linux 2.6.19 + patches gcc (GCC) 3.4.5 20051201 (Red Hat 3.4.5-2) 4 socket, dual core opterons run as mpirun --mca btl self,openib --npernode 1 --np 4 bcast-hang To my now uneducated eye it looks as if the root process is rushing ahead and not progressing earlier bcasts. Anyone else seeing similar? Any ideas for workarounds? As a point of reference, mvapich2 0.9.8 works fine. Thanks, David bcast-hang.c Description: Binary data
[OMPI devel] libdir not propagated to contrib/vt/vt ??
Building against recent heads (r18643) it appears that libdir (as set by ./configure --prefix=$PREFIX --libdir=$PREFIX/lib64 for example) is not propagated to ompi-trunk/contrib/vt/vt. Feature or bug? Thanks, David
[O-MPI devel] Question on ROMIO
A question for those who did the ROMIO port... The ROMIO component seems to be based on version 1.2.5.1 (the last version of ROMIO released independently). Did anyone make any progress using the ROMIO from later MPICH's (version 1.2.6 etc.)? Seems to me these are fairly broken as far as compatibility with other MPIs is concerned. Thanks, David -- David Daniel Advanced Computing Laboratory, LANL, MS-B287, Los Alamos NM 87545, USA
Re: [O-MPI devel] Question on ROMIO
On Aug 18, 2005, at 4:24 PM, Brian Barrett wrote: On Aug 18, 2005, at 4:53 PM, David Daniel wrote: A question for those who did the ROMIO port... The ROMIO component seems to be based on version 1.2.5.1 (the last version of ROMIO released independently). Did anyone make any progress using the ROMIO from later MPICH's (version 1.2.6 etc.)? Seems to me these are fairly broken as far as compatibility with other MPIs is concerned. Thanks, David Yes, we took the last stable individual release of ROMIO for OMPI. We haven't looked at bringing in the MPICH-integrated releases. I'm not sure how hard it will be to rip out the MPICH-specific stuff from ROMIO, but at this point, it would take some work to replicate everything we did to integrate ROMIO into OMPI. Is this a 1.0 requirement? No -- Don't panic! The parallel I/O folks here are just interested in seeing whether there are fixes in later versions that would help with performance. I was trying to port 1.2.6 into LA-MPI but it is painful (i.e. broken) with an MPI implementation that doesn't have MPI_Info defined. My guess is it will be easier with Open MPI. David
[O-MPI devel] Open MPI over IB in action
Interesting news... Jim Barker installed Open MPI on one of our visualization teams' InfiniBand clusters. They successfully built ParaView and ran it to drive visualization on 3x3 "power wall" tiled display. ParaView has a history of breaking MPI's so I'm very happy that this went so smoothly. David -- David Daniel +1-505-667-0883 Advanced Computing Laboratory, LANL, MS-B287, Los Alamos NM 87545, USA
[O-MPI devel] Fortran peculiarities on Mac OS X 10.4
Hi Folks, Anyone had any luck building fortran on Tiger, particularly f90? I'm probably just dumb, but appended are 4 problems I've seen. Thanks, David 1. gfortran configure --enable-f77 --enable-f90 [snip] *** Fortran 77 compiler checking for gfortran... gfortran checking whether we are using the GNU Fortran 77 compiler... yes checking whether gfortran accepts -g... yes checking gfortran external symbol convention... double underscore checking if FORTRAN compiler supports LOGICAL... yes checking size of FORTRAN LOGICAL... 4 checking for C type corresponding to Fortran LOGICAL... int checking alignment of FORTRAN LOGICAL... unknown configure: WARNING: *** Problem running configure test! configure: WARNING: *** See config.log for details. configure: error: *** Cannot continue. 2. xlf configure FC=xlf F77=xlf --enable-f77 --enable-f90 [snip] *** Fortran 77 compiler checking whether we are using the GNU Fortran 77 compiler... no checking whether xlf accepts -g... yes checking xlf external symbol convention... no underscore checking if FORTRAN compiler supports LOGICAL... yes checking size of FORTRAN LOGICAL... unknown configure: WARNING: *** Problem running configure test! configure: WARNING: *** See config.log for details. configure: error: *** Cannot continue. 3. xlf again configure FC=xlf95 F77=xlf77 --enable-f77 --enable-f90 [snip] *** Fortran 77 compiler checking whether we are using the GNU Fortran 77 compiler... no checking whether xlf77 accepts -g... no checking xlf77 external symbol convention... configure: WARNING: unable to produce an object file testing F77 compiler checking if FORTRAN compiler supports LOGICAL... no checking if FORTRAN compiler supports INTEGER... no checking if FORTRAN compiler supports INTEGER*1... no checking if FORTRAN compiler supports INTEGER*2... no checking if FORTRAN compiler supports INTEGER*4... no checking if FORTRAN compiler supports INTEGER*8... no checking if FORTRAN compiler supports INTEGER*16... no checking if FORTRAN compiler supports REAL... no checking if FORTRAN compiler supports REAL*4... no checking if FORTRAN compiler supports REAL*8... no checking if FORTRAN compiler supports REAL*16... no checking if FORTRAN compiler supports DOUBLE PRECISION... no checking if FORTRAN compiler supports COMPLEX... no checking if FORTRAN compiler supports COMPLEX*8... no checking if FORTRAN compiler supports COMPLEX*16... no checking if FORTRAN compiler supports COMPLEX*32... no checking for max fortran MPI handle index... 2147483647 *** Fortran 90/95 compiler checking whether we are using the GNU Fortran compiler... no checking whether xlf95 accepts -g... yes checking for Fortran flag to compile .f files... none checking for Fortran flag to compile .f90 files... -qsuffix=f=f90 checking for Fortran flag to compile .f95 files... -qsuffix=f=f95 checking whether xlf77 and xlf95 compilers are compatible... no configure: WARNING: *** Fortran 77 and Fortran 90 compilers are not link compatible configure: WARNING: *** Disabling Fortran 90/95 bindings 4. Do we need to add -lSystemStubs ??? $ configure F77=gxlf --disable-f90 --enable-f77 --enable-static -- disable-shared $ make $ make install $ mpif77 hellof.f ** _main === End of Compilation 1 === 1501-510 Compilation successful for file hellof.f. /usr/bin/ld: Undefined symbols: _asprintf$LDBLStub _fprintf$LDBLStub _snprintf$LDBLStub _sprintf$LDBLStub _sscanf$LDBLStub _printf$LDBLStub _vfprintf$LDBLStub _vasprintf$LDBLStub _syslog$LDBLStub xlf95 -I/opt/openmpi/trunk/include -I/opt/openmpi/trunk/include/ openmpi/ompi hellof.f -L/opt/openmpi/trunk/lib -lmpi -lorte -lopal - lm -ldl BUT explicitly adding -lSystemStubs (which needs to be at the end so I can't use mpif77 -- or can I?) $ xlf95 -I/opt/openmpi/trunk/include -I/opt/openmpi/trunk/include/ openmpi/ompi hellof.f -L/opt/openmpi/trunk/lib -lmpi -lorte -lopal - lm -ldl -lSystemStubs ** _main === End of Compilation 1 === 1501-510 Compilation successful for file hellof.f.
Re: [O-MPI devel] Fortran peculiarities on Mac OS X 10.4
Thanks, Greg and George. I now have xlf working, and I guess my gfortran build may be flakey, but I can live with that for now. David
[O-MPI devel] totalview
TotalView now appears to be working for pls_rsh with both local and remote nodes and pls_bproc. Not tested elsewhere. David
Re: [O-MPI devel] Intel tests
Hi Graham, On Jan 14, 2006, at 2:07 PM, Graham E Fagg wrote: Hi all, whatever this fixed/changed, I no longer get corrupted memory in the tuned data segment hung off each communicator... ! I'm still testing to see if I get TimPs error. G On Sat, 14 Jan 2006 bosi...@osl.iu.edu wrote: Author: bosilca Date: 2006-01-14 15:21:44 -0500 (Sat, 14 Jan 2006) New Revision: 8692 Modified: trunk/ompi/mca/btl/tcp/btl_tcp_endpoint.c trunk/ompi/mca/btl/tcp/btl_tcp_endpoint.h trunk/ompi/mca/btl/tcp/btl_tcp_frag.c trunk/ompi/mca/btl/tcp/btl_tcp_frag.h Log: A better implementation for the TCP endpoint cache + few comments. On a 64-bit bproc/myrinet system I'm seeing Tim P's problem with the current head of the trunk. See attached output. David $ ompi_info | head Open MPI: 1.1a1svn01142006 Open MPI SVN revision: svn01142006 Open RTE: 1.1a1svn01142006 Open RTE SVN revision: svn01142006 OPAL: 1.1a1svn01142006 OPAL SVN revision: svn01142006 Prefix: /scratch/modules/opt/openmpi-trunk- nofortran-bproc64 Configured architecture: x86_64-unknown-linux-gnu Configured by: ddd Configured on: Sat Jan 14 17:22:16 MST 2006 $ make MPIRUN='mpirun -mca coll basic' MPI_Allreduce_user_c (cd src ; make MPI_Allreduce_user_c) make[1]: Entering directory `/home/ddd/intel_tests/src' mpicc -g -Isrc -c -o libmpitest.o libmpitest.c mpicc -g -Isrc -o MPI_Allreduce_user_c MPI_Allreduce_user_c.c libmpitest.o -lm make[1]: Leaving directory `/home/ddd/intel_tests/src' mpirun -mca coll basic -n 4 -- `pwd`/src/MPI_Allreduce_user_c MPITEST info (0): Starting MPI_Allreduce_user() test MPITEST_results: MPI_Allreduce_user() all tests PASSED (7076) $ make MPIRUN='mpirun' MPI_Allreduce_user_c (cd src ; make MPI_Allreduce_user_c) make[1]: Entering directory `/home/ddd/intel_tests/src' make[1]: `MPI_Allreduce_user_c' is up to date. make[1]: Leaving directory `/home/ddd/intel_tests/src' mpirun -n 4 -- `pwd`/src/MPI_Allreduce_user_c MPITEST info (0): Starting MPI_Allreduce_user() test MPITEST error (0): i=0, int value=4, expected 1 MPITEST error (0): i=1, int value=4, expected 1 MPITEST error (0): i=2, int value=4, expected 1 MPITEST error (0): i=3, int value=4, expected 1 ...
[O-MPI devel] LLNL OpenMP + MPI benchmarks
http://www.llnl.gov/asci/purple/benchmarks/