Re: [OMPI devel] mpirun --prefix question
FWIW, I believe that we had intended --prefix to handle simple cases which is why this probably doesn't work for you. But as long as the different prefixes are specified for different nodes, it could probably be made to work. Which launcher are you using this with? On Mar 21, 2007, at 11:36 PM, Ralph Castain wrote: Yo David What system are you running this on? RoadRunner? If so, I can take a look at "fixing" it for you tomorrow (Thurs). Ralph On 3/21/07 10:17 AM, "David Daniel" wrote: I'm experimenting with heterogeneous applications (x86_64 <--> ppc64), where the systems share the file system where Open MPI is installed. What I would like to be able to do is something like this: mpirun --np 1 --host host-x86_64 --prefix /opt/ompi/x86_64 a.out.x86_64 : --np 1 --host host-ppc64 --prefix /opt/ompi/ppc64 a.out.ppc64 Unfortunately it looks as if the second --prefix is always ignored. My guess is that orte_app_context_t::prefix_dir is getting set, but only the 0th app context is never consulted (except in the dynamic process stuff where I do see a loop over the app context array). I can of course work around it with startup scripts, but a command line solution would be attractive. This is with openmpi-1.2. Thanks, David ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems
Re: [OMPI devel] mpirun --prefix question
This is a development system for roadrunner using ssh. David On Mar 22, 2007, at 5:19 AM, Jeff Squyres wrote: FWIW, I believe that we had intended --prefix to handle simple cases which is why this probably doesn't work for you. But as long as the different prefixes are specified for different nodes, it could probably be made to work. Which launcher are you using this with? On Mar 21, 2007, at 11:36 PM, Ralph Castain wrote: Yo David What system are you running this on? RoadRunner? If so, I can take a look at "fixing" it for you tomorrow (Thurs). Ralph On 3/21/07 10:17 AM, "David Daniel" wrote: I'm experimenting with heterogeneous applications (x86_64 <--> ppc64), where the systems share the file system where Open MPI is installed. What I would like to be able to do is something like this: mpirun --np 1 --host host-x86_64 --prefix /opt/ompi/x86_64 a.out.x86_64 : --np 1 --host host-ppc64 --prefix /opt/ompi/ppc64 a.out.ppc64 Unfortunately it looks as if the second --prefix is always ignored. My guess is that orte_app_context_t::prefix_dir is getting set, but only the 0th app context is never consulted (except in the dynamic process stuff where I do see a loop over the app context array). I can of course work around it with startup scripts, but a command line solution would be attractive. This is with openmpi-1.2. Thanks, David ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- David Daniel Computer Science for High-Performance Computing (CCS-1)
Re: [OMPI devel] mpirun --prefix question
We had a nice chat about this on the OpenRTE telecon this morning. The question of what to do with multiple prefix's has been a long-running issue, most recently captured in bug trac report #497. The problem is that prefix is intended to tell us where to find the ORTE/OMPI executables, and therefore is associated with a node - not an app_context. What we haven't been able to define is an appropriate notation that a user can exploit to tell us the association. This issue has arisen on several occasions where either (a) users have heterogeneous clusters with a common file system, so the prefix must be adjusted on each *type* of node to point to the correct type of binary; and (b) for whatever reason, typically on rsh/ssh clusters, users have installed the binaries in different locations on some of the nodes. In this latter case, the reports have been from homogeneous clusters, so the *type* of binary was never the issue - it just wasn't located where we expected. Sun's solution is (I believe) what most of us would expect - they locate their executables in the same relative location on all their nodes. The binary in that location is correct for that local architecture. This requires, though, that the "prefix" location not be on a common file system. Unfortunately, that isn't the case with LANL's roadrunner, nor can we expect that everyone will follow that sensible approach :-). So we need a notation to support the "exception" case where someone needs to truly specify prefix versus node(s). We discussed a number of options, including auto-detecting the local arch and appending it to the specified "prefix" and several others. After discussing them, those of us on the call decided that adding a field to the hostfile that specifies the prefix to use on that host would be the best solution. This could be done on a cluster-level basis, so - although it is annoying to create the data file - at least it would only have to be done once. Again, this is the exception case, so requiring a little inconvenience seems a reasonable thing to do. Anyone have heartburn and/or other suggestions? If not, we might start to play with this next week. We would have to do some small modifications to the RAS, RMAPS, and PLS components to ensure that any multi-prefix info gets correctly propagated and used across all platforms for consistent behavior. Ralph On 3/22/07 9:11 AM, "David Daniel" wrote: > This is a development system for roadrunner using ssh. > > David > > On Mar 22, 2007, at 5:19 AM, Jeff Squyres wrote: > >> FWIW, I believe that we had intended --prefix to handle simple cases >> which is why this probably doesn't work for you. But as long as the >> different prefixes are specified for different nodes, it could >> probably be made to work. >> >> Which launcher are you using this with? >> >> >> >> On Mar 21, 2007, at 11:36 PM, Ralph Castain wrote: >> >>> Yo David >>> >>> What system are you running this on? RoadRunner? If so, I can take >>> a look at >>> "fixing" it for you tomorrow (Thurs). >>> >>> Ralph >>> >>> >>> On 3/21/07 10:17 AM, "David Daniel" wrote: >>> I'm experimenting with heterogeneous applications (x86_64 <--> ppc64), where the systems share the file system where Open MPI is installed. What I would like to be able to do is something like this: mpirun --np 1 --host host-x86_64 --prefix /opt/ompi/x86_64 a.out.x86_64 : --np 1 --host host-ppc64 --prefix /opt/ompi/ppc64 a.out.ppc64 Unfortunately it looks as if the second --prefix is always ignored. My guess is that orte_app_context_t::prefix_dir is getting set, but only the 0th app context is never consulted (except in the dynamic process stuff where I do see a loop over the app context array). I can of course work around it with startup scripts, but a command line solution would be attractive. This is with openmpi-1.2. Thanks, David ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> >>> ___ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> >> -- >> Jeff Squyres >> Cisco Systems >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > -- > David Daniel > Computer Science for High-Performance Computing (CCS-1) > > > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
[OMPI devel] RH Enterprise Linux issue
Is this a known problem? Building ompi 1.2 on RHEL4: ./configure --with-devel-headers --without-threads (actually tried without '--without-threads' too, but no change) $ mpirun -np 2 test [beth:06029] *** Process received signal *** [beth:06029] Signal: Segmentation fault (11) [beth:06029] Signal code: Address not mapped (1) [beth:06029] Failing at address: 0x2e342e33 [beth:06029] [ 0] /lib/tls/libc.so.6 [0x21b890] [beth:06029] [ 1] /usr/local/lib/libopen-rte.so.0(orte_init_stage1 +0x293) [0xb7fc50cb] [beth:06029] [ 2] /usr/local/lib/libopen-rte.so.0(orte_system_init +0x1e) [0xb7fc84be] [beth:06029] [ 3] /usr/local/lib/libopen-rte.so.0(orte_init+0x6a) [0xb7fc4cee] [beth:06029] [ 4] mpirun(orterun+0x14b) [0x8049ecb] [beth:06029] [ 5] mpirun(main+0x2a) [0x8049d7a] [beth:06029] [ 6] /lib/tls/libc.so.6(__libc_start_main+0xd3) [0x208de3] [beth:06029] [ 7] mpirun [0x8049cc9] [beth:06029] *** End of error message *** Segmentation fault Thanks, Greg
Re: [OMPI devel] RH Enterprise Linux issue
Oh, and this is a single x86 machine. Just trying to launch locally. $uname -a Linux 2.6.9-42.0.2.ELsmp #1 SMP Thu Aug 17 18:00:32 EDT 2006 i686 i686 i386 GNU/Linux Greg On Mar 22, 2007, at 12:17 PM, Greg Watson wrote: Is this a known problem? Building ompi 1.2 on RHEL4: ./configure --with-devel-headers --without-threads (actually tried without '--without-threads' too, but no change) $ mpirun -np 2 test [beth:06029] *** Process received signal *** [beth:06029] Signal: Segmentation fault (11) [beth:06029] Signal code: Address not mapped (1) [beth:06029] Failing at address: 0x2e342e33 [beth:06029] [ 0] /lib/tls/libc.so.6 [0x21b890] [beth:06029] [ 1] /usr/local/lib/libopen-rte.so.0(orte_init_stage1 +0x293) [0xb7fc50cb] [beth:06029] [ 2] /usr/local/lib/libopen-rte.so.0(orte_system_init +0x1e) [0xb7fc84be] [beth:06029] [ 3] /usr/local/lib/libopen-rte.so.0(orte_init+0x6a) [0xb7fc4cee] [beth:06029] [ 4] mpirun(orterun+0x14b) [0x8049ecb] [beth:06029] [ 5] mpirun(main+0x2a) [0x8049d7a] [beth:06029] [ 6] /lib/tls/libc.so.6(__libc_start_main+0xd3) [0x208de3] [beth:06029] [ 7] mpirun [0x8049cc9] [beth:06029] *** End of error message *** Segmentation fault Thanks, Greg
Re: [OMPI devel] RH Enterprise Linux issue
No, not a known problem -- my cluster is RHEL4U4 -- I use it for many thousands of runs of the OMPI v1.2 branch every day... Can you see where it's dying in orte_init_stage1? On Mar 22, 2007, at 2:17 PM, Greg Watson wrote: Is this a known problem? Building ompi 1.2 on RHEL4: ./configure --with-devel-headers --without-threads (actually tried without '--without-threads' too, but no change) $ mpirun -np 2 test [beth:06029] *** Process received signal *** [beth:06029] Signal: Segmentation fault (11) [beth:06029] Signal code: Address not mapped (1) [beth:06029] Failing at address: 0x2e342e33 [beth:06029] [ 0] /lib/tls/libc.so.6 [0x21b890] [beth:06029] [ 1] /usr/local/lib/libopen-rte.so.0(orte_init_stage1 +0x293) [0xb7fc50cb] [beth:06029] [ 2] /usr/local/lib/libopen-rte.so.0(orte_system_init +0x1e) [0xb7fc84be] [beth:06029] [ 3] /usr/local/lib/libopen-rte.so.0(orte_init+0x6a) [0xb7fc4cee] [beth:06029] [ 4] mpirun(orterun+0x14b) [0x8049ecb] [beth:06029] [ 5] mpirun(main+0x2a) [0x8049d7a] [beth:06029] [ 6] /lib/tls/libc.so.6(__libc_start_main+0xd3) [0x208de3] [beth:06029] [ 7] mpirun [0x8049cc9] [beth:06029] *** End of error message *** Segmentation fault Thanks, Greg ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems
Re: [OMPI devel] RH Enterprise Linux issue
gdb says this: #0 0x2e342e33 in ?? () #1 0xb7fe1d31 in orte_pls_base_select () from /usr/local/lib/libopen- rte.so.0 #2 0xb7fc50cb in orte_init_stage1 () from /usr/local/lib/libopen- rte.so.0 #3 0xb7fc84be in orte_system_init () from /usr/local/lib/libopen- rte.so.0 #4 0xb7fc4cee in orte_init () from /usr/local/lib/libopen-rte.so.0 #5 0x08049ecb in orterun (argc=4, argv=0xb9f4) at orterun.c:369 #6 0x08049d7a in main (argc=4, argv=0xb9f4) at main.c:13 (gdb) The program is running. Exit anyway? (y or n) y I can recompile with debugging if that would be useful. Let me know if there's anything else I can do. Here's ompi_info in case it helps: Open MPI: 1.2 Open MPI SVN revision: r14027 Open RTE: 1.2 Open RTE SVN revision: r14027 OPAL: 1.2 OPAL SVN revision: r14027 Prefix: /usr/local Configured architecture: i686-pc-linux-gnu Configured on: Thu Mar 22 13:39:30 EDT 2007 Built on: Thu Mar 22 13:55:38 EDT 2007 C bindings: yes C++ bindings: yes Fortran77 bindings: yes (all) Fortran90 bindings: no Fortran90 bindings size: na C compiler: gcc C compiler absolute: /usr/bin/gcc C++ compiler: g++ C++ compiler absolute: /usr/bin/g++ Fortran77 compiler: g77 Fortran77 compiler abs: /usr/bin/g77 Fortran90 compiler: none Fortran90 compiler abs: none C profiling: yes C++ profiling: yes Fortran77 profiling: yes Fortran90 profiling: no C++ exceptions: no Thread support: no Internal debug support: no MPI parameter check: runtime Memory profiling support: no Memory debugging support: no libltdl support: yes Heterogeneous support: yes mpirun default --prefix: no mca: base: component_find: unable to open pml teg: file not found (ignored) MCA backtrace: execinfo (MCA v1.0, API v1.0, Component v1.2) MCA memory: ptmalloc2 (MCA v1.0, API v1.0, Component v1.2) MCA paffinity: linux (MCA v1.0, API v1.0, Component v1.2) MCA maffinity: first_use (MCA v1.0, API v1.0, Component v1.2) MCA timer: linux (MCA v1.0, API v1.0, Component v1.2) MCA allocator: basic (MCA v1.0, API v1.0, Component v1.0) MCA allocator: bucket (MCA v1.0, API v1.0, Component v1.0) MCA coll: basic (MCA v1.0, API v1.0, Component v1.2) MCA coll: self (MCA v1.0, API v1.0, Component v1.2) MCA coll: sm (MCA v1.0, API v1.0, Component v1.2) MCA coll: tuned (MCA v1.0, API v1.0, Component v1.2) MCA io: romio (MCA v1.0, API v1.0, Component v1.2) MCA mpool: sm (MCA v1.0, API v1.0, Component v1.2) MCA pml: cm (MCA v1.0, API v1.0, Component v1.2) MCA pml: ob1 (MCA v1.0, API v1.0, Component v1.2) MCA bml: r2 (MCA v1.0, API v1.0, Component v1.2) MCA rcache: rb (MCA v1.0, API v1.0, Component v1.2) MCA rcache: vma (MCA v1.0, API v1.0, Component v1.2) MCA btl: self (MCA v1.0, API v1.0.1, Component v1.2) MCA btl: sm (MCA v1.0, API v1.0.1, Component v1.2) MCA btl: tcp (MCA v1.0, API v1.0.1, Component v1.0) MCA topo: unity (MCA v1.0, API v1.0, Component v1.2) MCA osc: pt2pt (MCA v1.0, API v1.0, Component v1.2) MCA errmgr: hnp (MCA v1.0, API v1.3, Component v1.2) MCA errmgr: orted (MCA v1.0, API v1.3, Component v1.2) MCA errmgr: proxy (MCA v1.0, API v1.3, Component v1.2) MCA gpr: null (MCA v1.0, API v1.0, Component v1.2) MCA gpr: proxy (MCA v1.0, API v1.0, Component v1.2) MCA gpr: replica (MCA v1.0, API v1.0, Component v1.2) MCA iof: proxy (MCA v1.0, API v1.0, Component v1.2) MCA iof: svc (MCA v1.0, API v1.0, Component v1.2) MCA ns: proxy (MCA v1.0, API v2.0, Component v1.2) MCA ns: replica (MCA v1.0, API v2.0, Component v1.2) MCA oob: tcp (MCA v1.0, API v1.0, Component v1.0) MCA ras: dash_host (MCA v1.0, API v1.3, Component v1.2) MCA ras: gridengine (MCA v1.0, API v1.3, Component v1.2) MCA ras: hostfile (MCA v1.0, API v1.0, Component v1.0.2) MCA ras: localhost (MCA v1.0, API v1.3, Component v1.2) MCA ras: slurm (MCA v1.0, API v1.3, Component v1.2) MCA rds: hostfile (MCA v1.0, API v1.3, Component v1.2) MCA rds: proxy (MCA v1.0, API v1.3, Component v1.2) MCA rds: resfile (MCA v1.0, API v1.3, Component v1.2) MCA rmaps: round_robin (MCA v1.0, API v1.3, Component v1.2) MCA rmgr: proxy (MCA v1.0, API v2.0, Co
Re: [OMPI devel] RH Enterprise Linux issue
Yes, if you could recompile with debugging, that would be great. What launcher are you trying to use? On Mar 22, 2007, at 2:35 PM, Greg Watson wrote: gdb says this: #0 0x2e342e33 in ?? () #1 0xb7fe1d31 in orte_pls_base_select () from /usr/local/lib/libopen- rte.so.0 #2 0xb7fc50cb in orte_init_stage1 () from /usr/local/lib/libopen- rte.so.0 #3 0xb7fc84be in orte_system_init () from /usr/local/lib/libopen- rte.so.0 #4 0xb7fc4cee in orte_init () from /usr/local/lib/libopen-rte.so.0 #5 0x08049ecb in orterun (argc=4, argv=0xb9f4) at orterun.c:369 #6 0x08049d7a in main (argc=4, argv=0xb9f4) at main.c:13 (gdb) The program is running. Exit anyway? (y or n) y I can recompile with debugging if that would be useful. Let me know if there's anything else I can do. Here's ompi_info in case it helps: Open MPI: 1.2 Open MPI SVN revision: r14027 Open RTE: 1.2 Open RTE SVN revision: r14027 OPAL: 1.2 OPAL SVN revision: r14027 Prefix: /usr/local Configured architecture: i686-pc-linux-gnu Configured on: Thu Mar 22 13:39:30 EDT 2007 Built on: Thu Mar 22 13:55:38 EDT 2007 C bindings: yes C++ bindings: yes Fortran77 bindings: yes (all) Fortran90 bindings: no Fortran90 bindings size: na C compiler: gcc C compiler absolute: /usr/bin/gcc C++ compiler: g++ C++ compiler absolute: /usr/bin/g++ Fortran77 compiler: g77 Fortran77 compiler abs: /usr/bin/g77 Fortran90 compiler: none Fortran90 compiler abs: none C profiling: yes C++ profiling: yes Fortran77 profiling: yes Fortran90 profiling: no C++ exceptions: no Thread support: no Internal debug support: no MPI parameter check: runtime Memory profiling support: no Memory debugging support: no libltdl support: yes Heterogeneous support: yes mpirun default --prefix: no mca: base: component_find: unable to open pml teg: file not found (ignored) MCA backtrace: execinfo (MCA v1.0, API v1.0, Component v1.2) MCA memory: ptmalloc2 (MCA v1.0, API v1.0, Component v1.2) MCA paffinity: linux (MCA v1.0, API v1.0, Component v1.2) MCA maffinity: first_use (MCA v1.0, API v1.0, Component v1.2) MCA timer: linux (MCA v1.0, API v1.0, Component v1.2) MCA allocator: basic (MCA v1.0, API v1.0, Component v1.0) MCA allocator: bucket (MCA v1.0, API v1.0, Component v1.0) MCA coll: basic (MCA v1.0, API v1.0, Component v1.2) MCA coll: self (MCA v1.0, API v1.0, Component v1.2) MCA coll: sm (MCA v1.0, API v1.0, Component v1.2) MCA coll: tuned (MCA v1.0, API v1.0, Component v1.2) MCA io: romio (MCA v1.0, API v1.0, Component v1.2) MCA mpool: sm (MCA v1.0, API v1.0, Component v1.2) MCA pml: cm (MCA v1.0, API v1.0, Component v1.2) MCA pml: ob1 (MCA v1.0, API v1.0, Component v1.2) MCA bml: r2 (MCA v1.0, API v1.0, Component v1.2) MCA rcache: rb (MCA v1.0, API v1.0, Component v1.2) MCA rcache: vma (MCA v1.0, API v1.0, Component v1.2) MCA btl: self (MCA v1.0, API v1.0.1, Component v1.2) MCA btl: sm (MCA v1.0, API v1.0.1, Component v1.2) MCA btl: tcp (MCA v1.0, API v1.0.1, Component v1.0) MCA topo: unity (MCA v1.0, API v1.0, Component v1.2) MCA osc: pt2pt (MCA v1.0, API v1.0, Component v1.2) MCA errmgr: hnp (MCA v1.0, API v1.3, Component v1.2) MCA errmgr: orted (MCA v1.0, API v1.3, Component v1.2) MCA errmgr: proxy (MCA v1.0, API v1.3, Component v1.2) MCA gpr: null (MCA v1.0, API v1.0, Component v1.2) MCA gpr: proxy (MCA v1.0, API v1.0, Component v1.2) MCA gpr: replica (MCA v1.0, API v1.0, Component v1.2) MCA iof: proxy (MCA v1.0, API v1.0, Component v1.2) MCA iof: svc (MCA v1.0, API v1.0, Component v1.2) MCA ns: proxy (MCA v1.0, API v2.0, Component v1.2) MCA ns: replica (MCA v1.0, API v2.0, Component v1.2) MCA oob: tcp (MCA v1.0, API v1.0, Component v1.0) MCA ras: dash_host (MCA v1.0, API v1.3, Component v1.2) MCA ras: gridengine (MCA v1.0, API v1.3, Component v1.2) MCA ras: hostfile (MCA v1.0, API v1.0, Component v1.0.2) MCA ras: localhost (MCA v1.0, API v1.3, Component v1.2) MCA ras: slurm (MCA v1.0, API v1.3, Component v1.2) MCA rds: hostfile (MCA v1.0, API v1.3, Component v1.2) MCA rds: proxy (MCA v1.0, API v1.3, Compon
Re: [OMPI devel] RH Enterprise Linux issue
Scratch that. The problem was an installation over an old copy of ompi. Obviously picking up some old stuff. Sorry for the disturbance. Back to the bat cave... Greg On Mar 22, 2007, at 12:46 PM, Jeff Squyres wrote: Yes, if you could recompile with debugging, that would be great. What launcher are you trying to use? On Mar 22, 2007, at 2:35 PM, Greg Watson wrote: gdb says this: #0 0x2e342e33 in ?? () #1 0xb7fe1d31 in orte_pls_base_select () from /usr/local/lib/ libopen- rte.so.0 #2 0xb7fc50cb in orte_init_stage1 () from /usr/local/lib/libopen- rte.so.0 #3 0xb7fc84be in orte_system_init () from /usr/local/lib/libopen- rte.so.0 #4 0xb7fc4cee in orte_init () from /usr/local/lib/libopen-rte.so.0 #5 0x08049ecb in orterun (argc=4, argv=0xb9f4) at orterun.c:369 #6 0x08049d7a in main (argc=4, argv=0xb9f4) at main.c:13 (gdb) The program is running. Exit anyway? (y or n) y I can recompile with debugging if that would be useful. Let me know if there's anything else I can do. Here's ompi_info in case it helps: Open MPI: 1.2 Open MPI SVN revision: r14027 Open RTE: 1.2 Open RTE SVN revision: r14027 OPAL: 1.2 OPAL SVN revision: r14027 Prefix: /usr/local Configured architecture: i686-pc-linux-gnu Configured on: Thu Mar 22 13:39:30 EDT 2007 Built on: Thu Mar 22 13:55:38 EDT 2007 C bindings: yes C++ bindings: yes Fortran77 bindings: yes (all) Fortran90 bindings: no Fortran90 bindings size: na C compiler: gcc C compiler absolute: /usr/bin/gcc C++ compiler: g++ C++ compiler absolute: /usr/bin/g++ Fortran77 compiler: g77 Fortran77 compiler abs: /usr/bin/g77 Fortran90 compiler: none Fortran90 compiler abs: none C profiling: yes C++ profiling: yes Fortran77 profiling: yes Fortran90 profiling: no C++ exceptions: no Thread support: no Internal debug support: no MPI parameter check: runtime Memory profiling support: no Memory debugging support: no libltdl support: yes Heterogeneous support: yes mpirun default --prefix: no mca: base: component_find: unable to open pml teg: file not found (ignored) MCA backtrace: execinfo (MCA v1.0, API v1.0, Component v1.2) MCA memory: ptmalloc2 (MCA v1.0, API v1.0, Component v1.2) MCA paffinity: linux (MCA v1.0, API v1.0, Component v1.2) MCA maffinity: first_use (MCA v1.0, API v1.0, Component v1.2) MCA timer: linux (MCA v1.0, API v1.0, Component v1.2) MCA allocator: basic (MCA v1.0, API v1.0, Component v1.0) MCA allocator: bucket (MCA v1.0, API v1.0, Component v1.0) MCA coll: basic (MCA v1.0, API v1.0, Component v1.2) MCA coll: self (MCA v1.0, API v1.0, Component v1.2) MCA coll: sm (MCA v1.0, API v1.0, Component v1.2) MCA coll: tuned (MCA v1.0, API v1.0, Component v1.2) MCA io: romio (MCA v1.0, API v1.0, Component v1.2) MCA mpool: sm (MCA v1.0, API v1.0, Component v1.2) MCA pml: cm (MCA v1.0, API v1.0, Component v1.2) MCA pml: ob1 (MCA v1.0, API v1.0, Component v1.2) MCA bml: r2 (MCA v1.0, API v1.0, Component v1.2) MCA rcache: rb (MCA v1.0, API v1.0, Component v1.2) MCA rcache: vma (MCA v1.0, API v1.0, Component v1.2) MCA btl: self (MCA v1.0, API v1.0.1, Component v1.2) MCA btl: sm (MCA v1.0, API v1.0.1, Component v1.2) MCA btl: tcp (MCA v1.0, API v1.0.1, Component v1.0) MCA topo: unity (MCA v1.0, API v1.0, Component v1.2) MCA osc: pt2pt (MCA v1.0, API v1.0, Component v1.2) MCA errmgr: hnp (MCA v1.0, API v1.3, Component v1.2) MCA errmgr: orted (MCA v1.0, API v1.3, Component v1.2) MCA errmgr: proxy (MCA v1.0, API v1.3, Component v1.2) MCA gpr: null (MCA v1.0, API v1.0, Component v1.2) MCA gpr: proxy (MCA v1.0, API v1.0, Component v1.2) MCA gpr: replica (MCA v1.0, API v1.0, Component v1.2) MCA iof: proxy (MCA v1.0, API v1.0, Component v1.2) MCA iof: svc (MCA v1.0, API v1.0, Component v1.2) MCA ns: proxy (MCA v1.0, API v2.0, Component v1.2) MCA ns: replica (MCA v1.0, API v2.0, Component v1.2) MCA oob: tcp (MCA v1.0, API v1.0, Component v1.0) MCA ras: dash_host (MCA v1.0, API v1.3, Component v1.2) MCA ras: gridengine (MCA v1.0, API v1.3, Component v1.2) MCA ras: hostfile (MCA v1.0, API v1.0, Component v1.0.2) MCA ras: localhost (MCA v1.0, API v1.3,
Re: [OMPI devel] RH Enterprise Linux issue
Whew! You had me worried there for a minute... :-) On Mar 22, 2007, at 3:15 PM, Greg Watson wrote: Scratch that. The problem was an installation over an old copy of ompi. Obviously picking up some old stuff. Sorry for the disturbance. Back to the bat cave... Greg On Mar 22, 2007, at 12:46 PM, Jeff Squyres wrote: Yes, if you could recompile with debugging, that would be great. What launcher are you trying to use? On Mar 22, 2007, at 2:35 PM, Greg Watson wrote: gdb says this: #0 0x2e342e33 in ?? () #1 0xb7fe1d31 in orte_pls_base_select () from /usr/local/lib/ libopen- rte.so.0 #2 0xb7fc50cb in orte_init_stage1 () from /usr/local/lib/libopen- rte.so.0 #3 0xb7fc84be in orte_system_init () from /usr/local/lib/libopen- rte.so.0 #4 0xb7fc4cee in orte_init () from /usr/local/lib/libopen-rte.so.0 #5 0x08049ecb in orterun (argc=4, argv=0xb9f4) at orterun.c:369 #6 0x08049d7a in main (argc=4, argv=0xb9f4) at main.c:13 (gdb) The program is running. Exit anyway? (y or n) y I can recompile with debugging if that would be useful. Let me know if there's anything else I can do. Here's ompi_info in case it helps: Open MPI: 1.2 Open MPI SVN revision: r14027 Open RTE: 1.2 Open RTE SVN revision: r14027 OPAL: 1.2 OPAL SVN revision: r14027 Prefix: /usr/local Configured architecture: i686-pc-linux-gnu Configured on: Thu Mar 22 13:39:30 EDT 2007 Built on: Thu Mar 22 13:55:38 EDT 2007 C bindings: yes C++ bindings: yes Fortran77 bindings: yes (all) Fortran90 bindings: no Fortran90 bindings size: na C compiler: gcc C compiler absolute: /usr/bin/gcc C++ compiler: g++ C++ compiler absolute: /usr/bin/g++ Fortran77 compiler: g77 Fortran77 compiler abs: /usr/bin/g77 Fortran90 compiler: none Fortran90 compiler abs: none C profiling: yes C++ profiling: yes Fortran77 profiling: yes Fortran90 profiling: no C++ exceptions: no Thread support: no Internal debug support: no MPI parameter check: runtime Memory profiling support: no Memory debugging support: no libltdl support: yes Heterogeneous support: yes mpirun default --prefix: no mca: base: component_find: unable to open pml teg: file not found (ignored) MCA backtrace: execinfo (MCA v1.0, API v1.0, Component v1.2) MCA memory: ptmalloc2 (MCA v1.0, API v1.0, Component v1.2) MCA paffinity: linux (MCA v1.0, API v1.0, Component v1.2) MCA maffinity: first_use (MCA v1.0, API v1.0, Component v1.2) MCA timer: linux (MCA v1.0, API v1.0, Component v1.2) MCA allocator: basic (MCA v1.0, API v1.0, Component v1.0) MCA allocator: bucket (MCA v1.0, API v1.0, Component v1.0) MCA coll: basic (MCA v1.0, API v1.0, Component v1.2) MCA coll: self (MCA v1.0, API v1.0, Component v1.2) MCA coll: sm (MCA v1.0, API v1.0, Component v1.2) MCA coll: tuned (MCA v1.0, API v1.0, Component v1.2) MCA io: romio (MCA v1.0, API v1.0, Component v1.2) MCA mpool: sm (MCA v1.0, API v1.0, Component v1.2) MCA pml: cm (MCA v1.0, API v1.0, Component v1.2) MCA pml: ob1 (MCA v1.0, API v1.0, Component v1.2) MCA bml: r2 (MCA v1.0, API v1.0, Component v1.2) MCA rcache: rb (MCA v1.0, API v1.0, Component v1.2) MCA rcache: vma (MCA v1.0, API v1.0, Component v1.2) MCA btl: self (MCA v1.0, API v1.0.1, Component v1.2) MCA btl: sm (MCA v1.0, API v1.0.1, Component v1.2) MCA btl: tcp (MCA v1.0, API v1.0.1, Component v1.0) MCA topo: unity (MCA v1.0, API v1.0, Component v1.2) MCA osc: pt2pt (MCA v1.0, API v1.0, Component v1.2) MCA errmgr: hnp (MCA v1.0, API v1.3, Component v1.2) MCA errmgr: orted (MCA v1.0, API v1.3, Component v1.2) MCA errmgr: proxy (MCA v1.0, API v1.3, Component v1.2) MCA gpr: null (MCA v1.0, API v1.0, Component v1.2) MCA gpr: proxy (MCA v1.0, API v1.0, Component v1.2) MCA gpr: replica (MCA v1.0, API v1.0, Component v1.2) MCA iof: proxy (MCA v1.0, API v1.0, Component v1.2) MCA iof: svc (MCA v1.0, API v1.0, Component v1.2) MCA ns: proxy (MCA v1.0, API v2.0, Component v1.2) MCA ns: replica (MCA v1.0, API v2.0, Component v1.2) MCA oob: tcp (MCA v1.0, API v1.0, Component v1.0) MCA ras: dash_host (MCA v1.0, API v1.3, Component v1.2) MCA ras: gridengine (MCA v1.0, API v1.3, Component v1.2)
Re: [OMPI devel] mpirun --prefix question
OK. This sounds sensible. Thanks, David On Mar 22, 2007, at 10:38 AM, Ralph Castain wrote: We had a nice chat about this on the OpenRTE telecon this morning. The question of what to do with multiple prefix's has been a long- running issue, most recently captured in bug trac report #497. The problem is that prefix is intended to tell us where to find the ORTE/OMPI executables, and therefore is associated with a node - not an app_context. What we haven't been able to define is an appropriate notation that a user can exploit to tell us the association. This issue has arisen on several occasions where either (a) users have heterogeneous clusters with a common file system, so the prefix must be adjusted on each *type* of node to point to the correct type of binary; and (b) for whatever reason, typically on rsh/ssh clusters, users have installed the binaries in different locations on some of the nodes. In this latter case, the reports have been from homogeneous clusters, so the *type* of binary was never the issue - it just wasn't located where we expected. Sun's solution is (I believe) what most of us would expect - they locate their executables in the same relative location on all their nodes. The binary in that location is correct for that local architecture. This requires, though, that the "prefix" location not be on a common file system. Unfortunately, that isn't the case with LANL's roadrunner, nor can we expect that everyone will follow that sensible approach :-). So we need a notation to support the "exception" case where someone needs to truly specify prefix versus node(s). We discussed a number of options, including auto-detecting the local arch and appending it to the specified "prefix" and several others. After discussing them, those of us on the call decided that adding a field to the hostfile that specifies the prefix to use on that host would be the best solution. This could be done on a cluster-level basis, so - although it is annoying to create the data file - at least it would only have to be done once. Again, this is the exception case, so requiring a little inconvenience seems a reasonable thing to do. Anyone have heartburn and/or other suggestions? If not, we might start to play with this next week. We would have to do some small modifications to the RAS, RMAPS, and PLS components to ensure that any multi-prefix info gets correctly propagated and used across all platforms for consistent behavior. Ralph On 3/22/07 9:11 AM, "David Daniel" wrote: This is a development system for roadrunner using ssh. David On Mar 22, 2007, at 5:19 AM, Jeff Squyres wrote: FWIW, I believe that we had intended --prefix to handle simple cases which is why this probably doesn't work for you. But as long as the different prefixes are specified for different nodes, it could probably be made to work. Which launcher are you using this with? On Mar 21, 2007, at 11:36 PM, Ralph Castain wrote: Yo David What system are you running this on? RoadRunner? If so, I can take a look at "fixing" it for you tomorrow (Thurs). Ralph On 3/21/07 10:17 AM, "David Daniel" wrote: I'm experimenting with heterogeneous applications (x86_64 <--> ppc64), where the systems share the file system where Open MPI is installed. What I would like to be able to do is something like this: mpirun --np 1 --host host-x86_64 --prefix /opt/ompi/x86_64 a.out.x86_64 : --np 1 --host host-ppc64 --prefix /opt/ompi/ppc64 a.out.ppc64 Unfortunately it looks as if the second --prefix is always ignored. My guess is that orte_app_context_t::prefix_dir is getting set, but only the 0th app context is never consulted (except in the dynamic process stuff where I do see a loop over the app context array). I can of course work around it with startup scripts, but a command line solution would be attractive. This is with openmpi-1.2. Thanks, David ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- David Daniel Computer Science for High-Performance Computing (CCS-1) ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- David Daniel Computer Science for High-Performance Computing (CCS-1)