Re: [OMPI users] Ompi failing on mx only
I'm losing it today, I just now noticed I sent mx_info for the wrong nodes ... // node-1 $ mx_info MX Version: 1.1.6 MX Build: ggrobe@juggernaut:/home/ggrobe/Tools/mx-1.1.6 Thu Nov 30 14:17:44 GMT 2006 1 Myrinet board installed. The MX driver is configured to support up to 4 instances and 1024 nodes. === Instance #0: 224.9 MHz LANai, 133.3 MHz PCI bus, 2 MB SRAM Status: Running, P0: Link up MAC Address:00:60:dd:47:ab:c9 Product code: M3F-PCIXD-2 V2.2 Part number:09-03034 Serial number: 299207 Mapper: 46:4d:53:4d:41:50, version = 0x000c, configured Mapped hosts: 16 ROUTE COUNT INDEXMAC ADDRESS HOST NAMEP0 ---- ---- 0) 00:60:dd:47:ab:c9 node-1:0 1,1 1) 00:60:dd:47:c2:a7 juggernaut:0 5,3 2) 00:60:dd:47:ab:c8 node-2:0 6,3 3) 00:60:dd:47:ab:ca node-3:0 7,3 4) 00:60:dd:47:bf:65 node-7:0 7,3 5) 00:60:dd:47:c2:e1 node-8:0 7,3 6) 00:60:dd:47:c0:c1 node-9:0 6,3 7) 00:60:dd:47:c0:e5 node-13:0 1,1 8) 00:60:dd:47:c2:91 node-14:0 7,3 9) 00:60:dd:47:c0:b2 node-15:0 7,3 10) 00:60:dd:47:bf:f5 node-19:0 6,3 11) 00:60:dd:47:c0:b1 node-20:0 8,3 12) 00:60:dd:47:c0:f8 node-21:0 5,3 13) 00:60:dd:47:c0:8a node-25:0 6,3 14) 00:60:dd:47:c0:c2 node-27:0 7,3 15) 00:60:dd:47:c2:e0 node-26:0 6,3 // node-2 $ mx_info MX Version: 1.1.6 MX Build: ggrobe@juggernaut:/home/ggrobe/Tools/mx-1.1.6 Thu Nov 30 14:17:44 GMT 2006 1 Myrinet board installed. The MX driver is configured to support up to 4 instances and 1024 nodes. === Instance #0: 224.9 MHz LANai, 133.0 MHz PCI bus, 2 MB SRAM Status: Running, P0: Link up MAC Address:00:60:dd:47:ab:c8 Product code: M3F-PCIXD-2 V2.2 Part number:09-03034 Serial number: 299208 Mapper: 46:4d:53:4d:41:50, version = 0x000c, configured Mapped hosts: 16 ROUTE COUNT INDEXMAC ADDRESS HOST NAMEP0 ---- ---- 0) 00:60:dd:47:ab:c8 node-2:0 1,1 1) 00:60:dd:47:ab:c9 node-1:0 6,3 2) 00:60:dd:47:c2:a7 juggernaut:0 5,3 3) 00:60:dd:47:ab:ca node-3:0 6,3 4) 00:60:dd:47:bf:65 node-7:0 5,3 5) 00:60:dd:47:c2:e1 node-8:0 6,3 6) 00:60:dd:47:c0:c1 node-9:0 5,3 7) 00:60:dd:47:c0:e5 node-13:0 6,3 8) 00:60:dd:47:c2:91 node-14:0 8,3 9) 00:60:dd:47:c0:b2 node-15:0 1,1 10) 00:60:dd:47:bf:f5 node-19:0 5,3 11) 00:60:dd:47:c0:f8 node-21:0 5,3 12) 00:60:dd:47:c0:8a node-25:0 5,3 13) 00:60:dd:47:c0:c2 node-27:0 6,3 14) 00:60:dd:47:c2:e0 node-26:0 5,3 15) 00:60:dd:47:c0:b1 node-20:0 6,3 -Original Message- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Reese Faucette Sent: Tuesday, January 02, 2007 4:08 PM To: Open MPI Users Subject: Re: [OMPI users] Ompi failing on mx only Ompi failing on mx only> I've attached the ompi_info from node-1 and node-2. thanks, but i need "mx_info", not "ompi_info" ;-) > But now that you mention mapper, I take it that's what SEGV_MAPERR > might be referring to. this is an ompi red herring; it has nothing to do with Myrinet mapping, even though it kinda looks like it. > $ mpirun --prefix /usr/local/openmpi-1.2b2 -x > LD_LIBRARY_PATH=${LD_LIBRARY_PATH} --hostfile ./h1-3 -np 5 --mca mtl > mx,self ./cpi > Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR) A gdb traceback would be interesting on this one. thanks, -reese ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/
Re: [OMPI users] Ompi failing on mx only
Ah, sorry about that ... $ ./mx_info MX Version: 1.1.6 MX Build: ggrobe@juggernaut:/home/ggrobe/Tools/mx-1.1.6 Thu Nov 30 14:17:44 GMT 2006 1 Myrinet board installed. The MX driver is configured to support up to 4 instances and 1024 nodes. === Instance #0: 224.9 MHz LANai, 99.7 MHz PCI bus, 2 MB SRAM Status: Running, P0: Link up MAC Address:00:60:dd:47:c2:a7 Product code: M3F-PCIXD-2 V2.2 Part number:09-03034 Serial number: 291824 Mapper: 46:4d:53:4d:41:50, version = 0x000c, configured Mapped hosts: 16 ROUTE COUNT INDEXMAC ADDRESS HOST NAMEP0 ---- ---- 0) 00:60:dd:47:c2:a7 juggernaut:0 1,1 1) 00:60:dd:47:ab:c9 node-1:0 6,3 2) 00:60:dd:47:ab:c8 node-2:0 6,3 3) 00:60:dd:47:ab:ca node-3:0 6,3 4) 00:60:dd:47:bf:65 node-7:0 6,3 5) 00:60:dd:47:c2:e1 node-8:0 6,3 6) 00:60:dd:47:c0:c1 node-9:0 6,3 7) 00:60:dd:47:c0:e5 node-13:0 6,3 8) 00:60:dd:47:c2:91 node-14:0 6,3 9) 00:60:dd:47:c0:b2 node-15:0 6,3 10) 00:60:dd:47:bf:f5 node-19:0 1,1 11) 00:60:dd:47:c0:b1 node-20:0 6,3 12) 00:60:dd:47:c0:f8 node-21:0 7,3 13) 00:60:dd:47:c0:8a node-25:0 6,3 14) 00:60:dd:47:c0:c2 node-27:0 5,3 15) 00:60:dd:47:c2:e0 node-26:0 5,3 -Original Message- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Reese Faucette Sent: Tuesday, January 02, 2007 4:08 PM To: Open MPI Users Subject: Re: [OMPI users] Ompi failing on mx only Ompi failing on mx only> I've attached the ompi_info from node-1 and node-2. thanks, but i need "mx_info", not "ompi_info" ;-) > But now that you mention mapper, I take it that's what SEGV_MAPERR > might be referring to. this is an ompi red herring; it has nothing to do with Myrinet mapping, even though it kinda looks like it. > $ mpirun --prefix /usr/local/openmpi-1.2b2 -x > LD_LIBRARY_PATH=${LD_LIBRARY_PATH} --hostfile ./h1-3 -np 5 --mca mtl > mx,self ./cpi > Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR) A gdb traceback would be interesting on this one. thanks, -reese ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Ompi failing on mx only
As for the MTL, there is a bug in the MX MTL for v1.2 that has been fixed, but after 1.2b2 ... oops, i was stupidly assuming he already had that fix. yes, this is an important fix... -reese
Re: [OMPI users] Ompi failing on mx only
Sorry to jump into the discussion late. The mx btl does not support communication between processes on the same node by itself, so you have to include the shared memory transport when using MX. This will eventually be fixed, but likely not for the 1.2 release. So if you do: mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH -- hostfile ./h1-3 -np 2 --mca btl mx,sm,self ./cpi It should work much better. As for the MTL, there is a bug in the MX MTL for v1.2 that has been fixed, but after 1.2b2 that could cause the random failures you were seeing. It will work much better after 1.2b3 is released (or if you are feeling really lucky, you can try out the 1.2 nightly tarballs). The MTL is a new feature in v1.2. It is a different communication abstraction designed to support interconnects that have matching implemented in the lower level library or in hardware (Myrinet/MX, Portals, InfiniPath are currently implemented). The MTL allows us to exploit the low latency and asynchronous progress these libraries can provide, but does mean multi-nic abilities are reduced. Further, the MTL is not well suited to interconnects like TCP or InfiniBand, so we will continue supporting the BTL interface as well. Brian On Jan 2, 2007, at 2:44 PM, Grobe, Gary L. ((JSC-EV))[ESCG] wrote: About the -x, I've been trying it both ways and prefer the latter, and results for either are the same. But it's value is correct. I've attached the ompi_info from node-1 and node-2. Sorry for not zipping them, but they were small and I think I'd have firewall issues. $ mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH -- hostfile ./h13-15 -np 6 --mca pml cm ./cpi [node-14:19260] mx_connect fail for node-14:0 with key (error Endpoint closed or not connectable!) [node-14:19261] mx_connect fail for node-14:0 with key (error Endpoint closed or not connectable!) ... Is there any info anywhere's on MTL? Anyways, I've run w/ mtl, and sometimes it actually worked once. But now I can't reproduce it and it's throwing sig 7's, 11's, and 4's depending upon the number of procs I give it. But now that you mention mapper, I take it that's what SEGV_MAPERR might be referring to. I'm looking into the $ mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH=$ {LD_LIBRARY_PATH} --hostfile ./h1-3 -np 5 --mca mtl mx,self ./cpi Process 4 of 5 is on node-2 Process 0 of 5 is on node-1 Process 1 of 5 is on node-1 Process 2 of 5 is on node-1 Process 3 of 5 is on node-1 pi is approximately 3.1415926544231225, Error is 0.08333294 wall clock time = 0.019305 Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR) Failing at addr:0x2b88243862be mpirun noticed that job rank 0 with PID 0 on node node-1 exited on signal 1. 4 additional processes aborted (not shown) Or sometimes I'll get this error, just depending upon the number of procs ... mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH=$ {LD_LIBRARY_PATH} --hostfile ./h1-3 -np 7 --mca mtl mx,self ./cpi Signal:7 info.si_errno:0(Success) si_code:2() Failing at addr:0x2aaab000 [0] func:/usr/local/openmpi-1.2b2/lib/libopen-pal.so.0 (opal_backtrace_print+0x1f) [0x2b9b7fa52d1f] [1] func:/usr/local/openmpi-1.2b2/lib/libopen-pal.so.0 [0x2b9b7fa51871] [2] func:/lib/libpthread.so.0 [0x2b9b80013d00] [3] func:/usr/local/openmpi-1.2b2/lib/libmca_common_sm.so.0 (mca_common_sm_mmap_init+0x1e3) [0x2b9b8270ef83] [4] func:/usr/local/openmpi-1.2b2/lib/openmpi/mca_mpool_sm.so [0x2b9b8260d0ff] [5] func:/usr/local/openmpi-1.2b2/lib/libmpi.so.0 (mca_mpool_base_module_create+0x70) [0x2b9b7f7afac0] [6] func:/usr/local/openmpi-1.2b2/lib/openmpi/mca_btl_sm.so (mca_btl_sm_add_procs_same_base_addr+0x907) [0x2b9b83070517] [7] func:/usr/local/openmpi-1.2b2/lib/openmpi/mca_bml_r2.so (mca_bml_r2_add_procs+0x206) [0x2b9b82d5f576] [8] func:/usr/local/openmpi-1.2b2/lib/openmpi/mca_pml_ob1.so (mca_pml_ob1_add_procs+0xe3) [0x2b9b82a2d0a3] [9] func:/usr/local/openmpi-1.2b2/lib/libmpi.so.0(ompi_mpi_init +0x697) [0x2b9b7f77be07] [10] func:/usr/local/openmpi-1.2b2/lib/libmpi.so.0(MPI_Init+0x83) [0x2b9b7f79c943] [11] func:./cpi(main+0x42) [0x400cd5] [12] func:/lib/libc.so.6(__libc_start_main+0xf4) [0x2b9b8013a134] [13] func:./cpi [0x400bd9] *** End of error message *** Process 4 of 7 is on node-2 Process 5 of 7 is on node-2 Process 6 of 7 is on node-2 Process 0 of 7 is on node-1 Process 1 of 7 is on node-1 Process 2 of 7 is on node-1 Process 3 of 7 is on node-1 pi is approximately 3.1415926544231239, Error is 0.0807 wall clock time = 0.009331 Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR) Failing at addr:0x2b4ba33652be Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR) Failing at addr:0x2b8685aba2be Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR) Failing at addr:0x2b304ffbe2be mpirun noticed that job rank 0 with PID 0 on node node-1 exited on signal 1. 6 additio
Re: [OMPI users] Ompi failing on mx only
Ompi failing on mx only> I've attached the ompi_info from node-1 and node-2. thanks, but i need "mx_info", not "ompi_info" ;-) But now that you mention mapper, I take it that's what SEGV_MAPERR might be referring to. this is an ompi red herring; it has nothing to do with Myrinet mapping, even though it kinda looks like it. $ mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH=${LD_LIBRARY_PATH} --hostfile ./h1-3 -np 5 --mca mtl mx,self ./cpi Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR) A gdb traceback would be interesting on this one. thanks, -reese
Re: [OMPI users] Ompi failing on mx only
About the -x, I've been trying it both ways and prefer the latter, and results for either are the same. But it's value is correct. I've attached the ompi_info from node-1 and node-2. Sorry for not zipping them, but they were small and I think I'd have firewall issues. $ mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH --hostfile ./h13-15 -np 6 --mca pml cm ./cpi [node-14:19260] mx_connect fail for node-14:0 with key (error Endpoint closed or not connectable!) [node-14:19261] mx_connect fail for node-14:0 with key (error Endpoint closed or not connectable!) ... Is there any info anywhere's on MTL? Anyways, I've run w/ mtl, and sometimes it actually worked once. But now I can't reproduce it and it's throwing sig 7's, 11's, and 4's depending upon the number of procs I give it. But now that you mention mapper, I take it that's what SEGV_MAPERR might be referring to. I'm looking into the $ mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH=${LD_LIBRARY_PATH} --hostfile ./h1-3 -np 5 --mca mtl mx,self ./cpi Process 4 of 5 is on node-2 Process 0 of 5 is on node-1 Process 1 of 5 is on node-1 Process 2 of 5 is on node-1 Process 3 of 5 is on node-1 pi is approximately 3.1415926544231225, Error is 0.08333294 wall clock time = 0.019305 Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR) Failing at addr:0x2b88243862be mpirun noticed that job rank 0 with PID 0 on node node-1 exited on signal 1. 4 additional processes aborted (not shown) Or sometimes I'll get this error, just depending upon the number of procs ... mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH=${LD_LIBRARY_PATH} --hostfile ./h1-3 -np 7 --mca mtl mx,self ./cpi Signal:7 info.si_errno:0(Success) si_code:2() Failing at addr:0x2aaab000 [0] func:/usr/local/openmpi-1.2b2/lib/libopen-pal.so.0(opal_backtrace_print+ 0x1f) [0x2b9b7fa52d1f] [1] func:/usr/local/openmpi-1.2b2/lib/libopen-pal.so.0 [0x2b9b7fa51871] [2] func:/lib/libpthread.so.0 [0x2b9b80013d00] [3] func:/usr/local/openmpi-1.2b2/lib/libmca_common_sm.so.0(mca_common_sm_mm ap_init+0x1e3) [0x2b9b8270ef83] [4] func:/usr/local/openmpi-1.2b2/lib/openmpi/mca_mpool_sm.so [0x2b9b8260d0ff] [5] func:/usr/local/openmpi-1.2b2/lib/libmpi.so.0(mca_mpool_base_module_crea te+0x70) [0x2b9b7f7afac0] [6] func:/usr/local/openmpi-1.2b2/lib/openmpi/mca_btl_sm.so(mca_btl_sm_add_p rocs_same_base_addr+0x907) [0x2b9b83070517] [7] func:/usr/local/openmpi-1.2b2/lib/openmpi/mca_bml_r2.so(mca_bml_r2_add_p rocs+0x206) [0x2b9b82d5f576] [8] func:/usr/local/openmpi-1.2b2/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_add _procs+0xe3) [0x2b9b82a2d0a3] [9] func:/usr/local/openmpi-1.2b2/lib/libmpi.so.0(ompi_mpi_init+0x697) [0x2b9b7f77be07] [10] func:/usr/local/openmpi-1.2b2/lib/libmpi.so.0(MPI_Init+0x83) [0x2b9b7f79c943] [11] func:./cpi(main+0x42) [0x400cd5] [12] func:/lib/libc.so.6(__libc_start_main+0xf4) [0x2b9b8013a134] [13] func:./cpi [0x400bd9] *** End of error message *** Process 4 of 7 is on node-2 Process 5 of 7 is on node-2 Process 6 of 7 is on node-2 Process 0 of 7 is on node-1 Process 1 of 7 is on node-1 Process 2 of 7 is on node-1 Process 3 of 7 is on node-1 pi is approximately 3.1415926544231239, Error is 0.0807 wall clock time = 0.009331 Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR) Failing at addr:0x2b4ba33652be Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR) Failing at addr:0x2b8685aba2be Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR) Failing at addr:0x2b304ffbe2be mpirun noticed that job rank 0 with PID 0 on node node-1 exited on signal 1. 6 additional processes aborted (not shown) Ok, so I take it one is down. Would this be the cause for all the different errors I'm seeing? $ fm_status FMS Fabric status 17 hosts known 16 FMAs found 3 un-ACKed alerts Mapping is complete, last map generated by node-20 Database generation not yet complete. From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Reese Faucette Sent: Tuesday, January 02, 2007 2:52 PM To: Open MPI Users Subject: Re: [OMPI users] Ompi failing on mx only Hi, Gary- This looks like a config problem, and not a code problem yet. Could you send the output of mx_info from node-1 and from node-2? Also, forgive me counter-asking a possibly dumb OMPI question, but is "-x LD_LIBRARY_PATH" really what you want, as opposed to "-x LD_LIBRARY_PATH=${LD_LIBRARY_PATH}" ? (I would not be surprised if not specifying a value defaults to this behavior, but have to ask). Also, have you tried MX MTL as opposed to BTL? --mca pml cm --mca mtl mx,self (it looks like you did) "[node-2:10464] mx_connect fail for node-2:0 with key " makes it look like your fabric may not be fully mapped or that you may have a down link. thanks, -reese Myricom, Inc. I was initially using 1.1.2 and moved to 1.2b2 because of a hang on MPI_Bcast() which 1.2b2 r
Re: [OMPI users] orted: command not found
I had configured the hostfile located at ~prefix/etc/openmpi-default-hostfile. I copied the file to bernie-3, and it worked... Now, at the cluster I was working at the Universidad de Los Andes (Venezuela) -I decided to install mpi on three machines I was able to put together as a personal proyect- all I had to do was to compile and run my applications, that is, I never copied any file to any other machine... now, I had to. I'm sorry if it was obvious and made you guys loose some time, but why on a cluster I didn't have to copy any files, and now I must do so? Thanks for you patiance! Jose
Re: [OMPI users] orted: command not found
I'm just curious, maybe I missed something in a past post of this thread, but ... Are these nodes diskless? If so, then you will have to make sure that these same paths are exported to the diskless nodes and handle non-interactive sessions as well as the init shell scripts properly. It's easiets if you are exporting the same account files and executable/lib paths across all nodes. -Original Message- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Gurhan Ozen Sent: Tuesday, January 02, 2007 2:21 PM To: Open MPI Users Subject: Re: [OMPI users] orted: command not found On 1/2/07, Gurhan Ozen wrote: > On 1/2/07, jcolmena...@ula.ve wrote: > > > First you should make sure that PATH and LD_LIBRARY_PATH are > > > defined in the section of your .bashrc file that get parsed for > > > non interactive sessions. Run "mpirun -np 1 printenv" and check if > > > PATH and LD_LIBRARY_PATH have the values you expect. > > > > in fact they do: > > > > bernie@bernie-1:~/proyecto$ mpirun -np 1 printenv SHELL=/bin/bash > > SSH_CLIENT=192.168.1.142 4109 22 > > USER=bernie > > LD_LIBRARY_PATH=/usr/local/openmpi/lib:/usr/local/openmpi/lib: > > MAIL=/var/mail/bernie > > PATH=/usr/local/openmpi/bin:/usr/local/openmpi/bin:/usr/local/sbin:/ > > usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/bin/X11:/usr/games > > PWD=/home/bernie > > LANG=en_US.UTF-8 > > HISTCONTROL=ignoredups > > SHLVL=1 > > HOME=/home/bernie > > MPI_DIR=/usr/local/openmpi > > LOGNAME=bernie > > SSH_CONNECTION=192.168.1.142 4109 192.168.1.113 22 LESSOPEN=| > > /usr/bin/lesspipe %s LESSCLOSE=/usr/bin/lesspipe %s %s > > _=/usr/local/openmpi/bin/orted > > OMPI_MCA_universe=bernie@bernie-1:default-universe > > OMPI_MCA_ns_nds=env > > OMPI_MCA_ns_nds_vpid_start=0 > > OMPI_MCA_ns_nds_num_procs=1 > > OMPI_MCA_mpi_paffinity_processor=0 > > OMPI_MCA_ns_replica_uri=0.0.0;tcp://192.168.1.142:4775 > > OMPI_MCA_gpr_replica_uri=0.0.0;tcp://192.168.1.142:4775 > > OMPI_MCA_orte_base_nodename=192.168.1.113 > > OMPI_MCA_ns_nds_cellid=0 > > OMPI_MCA_ns_nds_jobid=1 > > OMPI_MCA_ns_nds_vpid=0 > > > > > > > For your second question you should give the path to your > > > prueba.bin executable. I'll do something like "mpirun --prefix > > > /usr/local/ openmpi -np 2 ./prueba.bin". The reason is that > > > usually "." is not in the PATH. > > > > > > > bernie@bernie-1:~/proyecto$ mpirun --prefix /usr/local/openmpi -np 2 > > ./prueba.bin > > > > -- Failed to find or execute the following executable: > > > > Host: bernie-3 > > Executable: ./prueba.bin > > > > Cannot continue. > > > > -- > > > > and the file IS there: > > > > bernie@bernie-1:~/proyecto$ ls prueba* prueba.bin prueba.f90 > > prueba.f90~ > > Wait a minute.. you are running mpirun from bernie-1 without proving any hostfile or hostnames .. So both processes should be running on bernie-1 host, yet the error says it can't find the executable on bernie-3. Why is this? Make sure that the file exists on bernie-3 and is executable. gurhan > > > > I must be missing something pretty silly, but have been looking > > around for days to no avail! > > > >What are the permissions on the file? Is it an executable file? > >gurhan > > > Jose > > > > thanks > > > > > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Ompi failing on mx only
Ompi failing on mx onlyHi, Gary- This looks like a config problem, and not a code problem yet. Could you send the output of mx_info from node-1 and from node-2? Also, forgive me counter-asking a possibly dumb OMPI question, but is "-x LD_LIBRARY_PATH" really what you want, as opposed to "-x LD_LIBRARY_PATH=${LD_LIBRARY_PATH}" ? (I would not be surprised if not specifying a value defaults to this behavior, but have to ask). Also, have you tried MX MTL as opposed to BTL? --mca pml cm --mca mtl mx,self (it looks like you did) "[node-2:10464] mx_connect fail for node-2:0 with key " makes it look like your fabric may not be fully mapped or that you may have a down link. thanks, -reese Myricom, Inc. I was initially using 1.1.2 and moved to 1.2b2 because of a hang on MPI_Bcast() which 1.2b2 reports to fix, and seemed to have done so. My compute nodes are 2 dual core xeons on myrinet with mx. The problem is trying to get ompi running on mx only. My machine file is as follows . node-1 slots=4 max-slots=4 node-2 slots=4 max-slots=4 node-3 slots=4 max-slots=4 'mpirun' with the minimum number of processes in order to get the error ... mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH --hostfile ./h1-3 -np 2 --mca btl mx,self ./cpi I don't believe there'a anything wrong w/ the hardware as I can ping on mx between this failed node and the master fine. So I tried a different set of 3 nodes and I got the same error, it always fails on the 2nd node of any group of nodes I choose.
Re: [OMPI users] orted: command not found
On 1/2/07, Gurhan Ozen wrote: On 1/2/07, jcolmena...@ula.ve wrote: > > First you should make sure that PATH and LD_LIBRARY_PATH are defined > > in the section of your .bashrc file that get parsed for non > > interactive sessions. Run "mpirun -np 1 printenv" and check if PATH > > and LD_LIBRARY_PATH have the values you expect. > > in fact they do: > > bernie@bernie-1:~/proyecto$ mpirun -np 1 printenv > SHELL=/bin/bash > SSH_CLIENT=192.168.1.142 4109 22 > USER=bernie > LD_LIBRARY_PATH=/usr/local/openmpi/lib:/usr/local/openmpi/lib: > MAIL=/var/mail/bernie > PATH=/usr/local/openmpi/bin:/usr/local/openmpi/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/bin/X11:/usr/games > PWD=/home/bernie > LANG=en_US.UTF-8 > HISTCONTROL=ignoredups > SHLVL=1 > HOME=/home/bernie > MPI_DIR=/usr/local/openmpi > LOGNAME=bernie > SSH_CONNECTION=192.168.1.142 4109 192.168.1.113 22 > LESSOPEN=| /usr/bin/lesspipe %s > LESSCLOSE=/usr/bin/lesspipe %s %s > _=/usr/local/openmpi/bin/orted > OMPI_MCA_universe=bernie@bernie-1:default-universe > OMPI_MCA_ns_nds=env > OMPI_MCA_ns_nds_vpid_start=0 > OMPI_MCA_ns_nds_num_procs=1 > OMPI_MCA_mpi_paffinity_processor=0 > OMPI_MCA_ns_replica_uri=0.0.0;tcp://192.168.1.142:4775 > OMPI_MCA_gpr_replica_uri=0.0.0;tcp://192.168.1.142:4775 > OMPI_MCA_orte_base_nodename=192.168.1.113 > OMPI_MCA_ns_nds_cellid=0 > OMPI_MCA_ns_nds_jobid=1 > OMPI_MCA_ns_nds_vpid=0 > > > > For your second question you should give the path to your prueba.bin > > executable. I'll do something like "mpirun --prefix /usr/local/ > > openmpi -np 2 ./prueba.bin". The reason is that usually "." is not in > > the PATH. > > > > bernie@bernie-1:~/proyecto$ mpirun --prefix /usr/local/openmpi -np 2 > ./prueba.bin > -- > Failed to find or execute the following executable: > > Host: bernie-3 > Executable: ./prueba.bin > > Cannot continue. > -- > > and the file IS there: > > bernie@bernie-1:~/proyecto$ ls prueba* > prueba.bin prueba.f90 prueba.f90~ > Wait a minute.. you are running mpirun from bernie-1 without proving any hostfile or hostnames .. So both processes should be running on bernie-1 host, yet the error says it can't find the executable on bernie-3. Why is this? Make sure that the file exists on bernie-3 and is executable. gurhan > > I must be missing something pretty silly, but have been looking around for > days to no avail! > What are the permissions on the file? Is it an executable file? gurhan > Jose > > thanks > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] orted: command not found
it is executable bernie@bernie-1:~/proyecto$ ls -l prueba.bin -rwxr-xr-x 1 bernie bernie 9619 2007-01-02 12:18 prueba.bin
Re: [OMPI users] orted: command not found
On 1/2/07, jcolmena...@ula.ve wrote: > First you should make sure that PATH and LD_LIBRARY_PATH are defined > in the section of your .bashrc file that get parsed for non > interactive sessions. Run "mpirun -np 1 printenv" and check if PATH > and LD_LIBRARY_PATH have the values you expect. in fact they do: bernie@bernie-1:~/proyecto$ mpirun -np 1 printenv SHELL=/bin/bash SSH_CLIENT=192.168.1.142 4109 22 USER=bernie LD_LIBRARY_PATH=/usr/local/openmpi/lib:/usr/local/openmpi/lib: MAIL=/var/mail/bernie PATH=/usr/local/openmpi/bin:/usr/local/openmpi/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/bin/X11:/usr/games PWD=/home/bernie LANG=en_US.UTF-8 HISTCONTROL=ignoredups SHLVL=1 HOME=/home/bernie MPI_DIR=/usr/local/openmpi LOGNAME=bernie SSH_CONNECTION=192.168.1.142 4109 192.168.1.113 22 LESSOPEN=| /usr/bin/lesspipe %s LESSCLOSE=/usr/bin/lesspipe %s %s _=/usr/local/openmpi/bin/orted OMPI_MCA_universe=bernie@bernie-1:default-universe OMPI_MCA_ns_nds=env OMPI_MCA_ns_nds_vpid_start=0 OMPI_MCA_ns_nds_num_procs=1 OMPI_MCA_mpi_paffinity_processor=0 OMPI_MCA_ns_replica_uri=0.0.0;tcp://192.168.1.142:4775 OMPI_MCA_gpr_replica_uri=0.0.0;tcp://192.168.1.142:4775 OMPI_MCA_orte_base_nodename=192.168.1.113 OMPI_MCA_ns_nds_cellid=0 OMPI_MCA_ns_nds_jobid=1 OMPI_MCA_ns_nds_vpid=0 > For your second question you should give the path to your prueba.bin > executable. I'll do something like "mpirun --prefix /usr/local/ > openmpi -np 2 ./prueba.bin". The reason is that usually "." is not in > the PATH. > bernie@bernie-1:~/proyecto$ mpirun --prefix /usr/local/openmpi -np 2 ./prueba.bin -- Failed to find or execute the following executable: Host: bernie-3 Executable: ./prueba.bin Cannot continue. -- and the file IS there: bernie@bernie-1:~/proyecto$ ls prueba* prueba.bin prueba.f90 prueba.f90~ I must be missing something pretty silly, but have been looking around for days to no avail! What are the permissions on the file? Is it an executable file? gurhan Jose thanks ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] orted: command not found
> First you should make sure that PATH and LD_LIBRARY_PATH are defined > in the section of your .bashrc file that get parsed for non > interactive sessions. Run "mpirun -np 1 printenv" and check if PATH > and LD_LIBRARY_PATH have the values you expect. in fact they do: bernie@bernie-1:~/proyecto$ mpirun -np 1 printenv SHELL=/bin/bash SSH_CLIENT=192.168.1.142 4109 22 USER=bernie LD_LIBRARY_PATH=/usr/local/openmpi/lib:/usr/local/openmpi/lib: MAIL=/var/mail/bernie PATH=/usr/local/openmpi/bin:/usr/local/openmpi/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/bin/X11:/usr/games PWD=/home/bernie LANG=en_US.UTF-8 HISTCONTROL=ignoredups SHLVL=1 HOME=/home/bernie MPI_DIR=/usr/local/openmpi LOGNAME=bernie SSH_CONNECTION=192.168.1.142 4109 192.168.1.113 22 LESSOPEN=| /usr/bin/lesspipe %s LESSCLOSE=/usr/bin/lesspipe %s %s _=/usr/local/openmpi/bin/orted OMPI_MCA_universe=bernie@bernie-1:default-universe OMPI_MCA_ns_nds=env OMPI_MCA_ns_nds_vpid_start=0 OMPI_MCA_ns_nds_num_procs=1 OMPI_MCA_mpi_paffinity_processor=0 OMPI_MCA_ns_replica_uri=0.0.0;tcp://192.168.1.142:4775 OMPI_MCA_gpr_replica_uri=0.0.0;tcp://192.168.1.142:4775 OMPI_MCA_orte_base_nodename=192.168.1.113 OMPI_MCA_ns_nds_cellid=0 OMPI_MCA_ns_nds_jobid=1 OMPI_MCA_ns_nds_vpid=0 > For your second question you should give the path to your prueba.bin > executable. I'll do something like "mpirun --prefix /usr/local/ > openmpi -np 2 ./prueba.bin". The reason is that usually "." is not in > the PATH. > bernie@bernie-1:~/proyecto$ mpirun --prefix /usr/local/openmpi -np 2 ./prueba.bin -- Failed to find or execute the following executable: Host: bernie-3 Executable: ./prueba.bin Cannot continue. -- and the file IS there: bernie@bernie-1:~/proyecto$ ls prueba* prueba.bin prueba.f90 prueba.f90~ I must be missing something pretty silly, but have been looking around for days to no avail! Jose thanks
[OMPI users] Ompi failing on mx only
I was initially using 1.1.2 and moved to 1.2b2 because of a hang on MPI_Bcast() which 1.2b2 reports to fix, and seemed to have done so. My compute nodes are 2 dual core xeons on myrinet with mx. The problem is trying to get ompi running on mx only. My machine file is as follows ... node-1 slots=4 max-slots=4 node-2 slots=4 max-slots=4 node-3 slots=4 max-slots=4 'mpirun' with the minimum number of processes in order to get the error ... mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH --hostfile ./h1-3 -np 2 --mca btl mx,self ./cpi Results with the following output ... :~/Projects/ompi/cpi$ mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH --hostfile ./h1-3 -np 2 --mca btl mx,self ./cpi -- Process 0.1.0 is unable to reach 0.1.1 for MPI communication. If you specified the use of a BTL component, you may have forgotten a component (such as "self") in the list of usable components. -- -- Process 0.1.1 is unable to reach 0.1.0 for MPI communication. If you specified the use of a BTL component, you may have forgotten a component (such as "self") in the list of usable components. -- -- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): PML add procs failed --> Returned "Unreachable" (-12) instead of "Success" (0) -- *** An error occurred in MPI_Init *** before MPI was initialized *** MPI_ERRORS_ARE_FATAL (goodbye) -- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): PML add procs failed --> Returned "Unreachable" (-12) instead of "Success" (0) -- *** An error occurred in MPI_Init *** before MPI was initialized *** MPI_ERRORS_ARE_FATAL (goodbye) mpirun noticed that job rank 1 with PID 0 on node node-1 exited on signal 1. end of output --- I get that same error w/ the examples included in the ompi-1.2b2 distrib. However, if I change the mca params as such ... mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH --hostfile ./h1-3 -np 5 --mca pml cm ./cpi Running up to -np 5 works (one of the processes does get put on the 2nd node), but running with -np 6 fails with the following ... [node-2:10464] mx_connect fail for node-2:0 with key (error Endpoint closed or not connectable!) [node-2:10463] mx_connect fail for node-2:0 with key (error Endpoint closed or not connectable!) -- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): PML add procs failed --> Returned "Error" (-1) instead of "Success" (0) -- *** An error occurred in MPI_Init *** before MPI was initialized *** MPI_ERRORS_ARE_FATAL (goodbye) -- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): PML add procs failed --> Returned "Error" (-1) instead of "Success" (0) -- *** An error occurred in MPI_Init *** before MPI was initialized *** MPI_ERRORS_ARE_FATAL (goodbye) mpirun noticed that job rank 0 with PID 0 on node node-1 exited on sig
Re: [OMPI users] orted: command not found
First you should make sure that PATH and LD_LIBRARY_PATH are defined in the section of your .bashrc file that get parsed for non interactive sessions. Run "mpirun -np 1 printenv" and check if PATH and LD_LIBRARY_PATH have the values you expect. For your second question you should give the path to your prueba.bin executable. I'll do something like "mpirun --prefix /usr/local/ openmpi -np 2 ./prueba.bin". The reason is that usually "." is not in the PATH. george. On Jan 2, 2007, at 11:20 AM, jcolmena...@ula.ve wrote: I installed openmpi 1.1.2 on two 686 boxes runing ubuntu 6.10. Followed the instructions given in the FAQ. Nevertheless, I get the following message: [bernie-1:05053] ERROR: A daemon on node 192.168.1.113 failed to start as expected. [bernie-1:05053] ERROR: There may be more information available from [bernie-1:05053] ERROR: the remote shell (see above). [bernie-1:05053] ERROR: The daemon exited unexpectedly with status 127. now, I've been browsing the web, including the mailing lists, and it appears that the error should be that I have not declared the variables export PATH="/usr/local/openmpi/bin:${PATH}" export LD_LIBRARY_PATH="/usr/local/openmpi/lib:${LD_LIBRARY_PATH}" at the node, wich I have. I have even created all the posible folders proposed at the FAQ for remote loggins, although I'm using bash. If I do a ssh user@remote_node, I can connect without being asked for a password, and if I type mpif90, I get: "gfortran: no input files", wich should mean that indeed the PATH and LD_LIBRARY_PATH are being updated on the remote logging. But, if I do: bash$ mpirun --prefix /usr/local/openmpi -np 2 prueba.bin the result is: -- Failed to find the following executable: Host: bernie-3 Executable: prueba.bin Cannot continue. -- mpirun noticed that job rank 0 with PID 0 on node "192.168.1.113" exited on signal 4. I've been looking around, but have not been able to find what does the signal 4 means. Just in case, I was running an example program wich runs fine at my university cluster. Nevertheless, decided to run an even simpler one, wich I include, for it may be that the error is there (I definitly hope not!...) program test use mpi implicit none integer :: myid,sizze,ierr call MPI_INIT(ierr) call MPI_COMM_SIZE(MPI_COMM_WORLD,sizze,ierr) call MPI_COMM_RANK(MPI_COMM_WORLD,myid,ierr) print *,"I'm using ",sizze," processors" print *,"of wich I'm the number ",myid call MPI_FINALIZE(ierr) end program test This is the first time I have installed -and use- any parallel programing program or library, and I'm doing it as a personal proyect for a graduate curse, so any help will be greatly appreciated! Best regards Jose Colmenares ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] segv at runtime with ifort
Jeff, Thanks for the reply, that has fixed the problem. The code in questions appears to have only been ran with mpich and mpich derivatives in the past. Brock Palen Center for Advanced Computing bro...@umich.edu (734)936-1985 On Jan 2, 2007, at 9:56 AM, Jeff Squyres wrote: Brock -- I think your test program is faulty. For MPI_CART_CREATE, you need to pass in an array indicating whether the dimensions are periodic or not -- it is not sufficient to pass in a scalar logical value. For example, the following program seems to work fine for me: program cart include 'mpif.h' logical periods(1) call MPI_INIT(ierr); call MPI_COMM_SIZE(MPI_COMM_WORLD, numprocs, ierr) call MPI_COMM_RANK(MPI_COMM_WORLD, my_world_id, ierr) periods(1) = .false. call MPI_CART_CREATE(MPI_COMM_WORLD, 1, numprocs, periods, .true., ITER_COMM, ierr) call MPI_COMM_RANK(ITER_COMM,myid, ierr) call MPI_FINALIZE(ierr) end On Dec 19, 2006, at 3:22 PM, Brock Palen wrote: Hello, We are getting seg faults at run time on MPI_CART_CREATE() when using openmpi-1.1.2 built with the intel compilers. I have included all versions, code and messages bellow. I know there were problems in the past, and dug around the archives but didn't find this any place. Has anyone else seen this problem? The message is like so: [brockp@nyxtest1 bagci]$ mpirun -np 1 ./a.out Signal:11 info.si_errno:0(Success) si_code:2(SEGV_ACCERR) Failing at addr:0x448c78 [0] func:/home/software/rhel4/openmpi-1.1.2/intel/lib/libopal.so.0 [0x2a958dea55] [1] func:/lib64/tls/libpthread.so.0 [0x325560c430] [2] func:/home/software/rhel4/openmpi-1.1.2/intel/lib/libmpi.so.0 (mpi_cart_create__+0x60) [0x2a955ef93c] [3] func:./a.out(MAIN__+0x8c) [0x406b8c] [4] func:./a.out(main+0x32) [0x406aea] [5] func:/lib64/tls/libc.so.6(__libc_start_main+0xdb) [0x325471c3fb] [6] func:./a.out [0x406a2a] *** End of error message *** ifort is version 9.1.037 icc icpc are version 9.1.043 This problem does not happen with pgi compilers. (version 6.1) Both were done em64t systems (xeon 5160's) Here is the code sample that fails. program cart include 'mpif.h' call MPI_INIT(ierr); call MPI_COMM_SIZE(MPI_COMM_WORLD, numprocs, ierr) call MPI_COMM_RANK(MPI_COMM_WORLD, my_world_id, ierr) call MPI_CART_CREATE(MPI_COMM_WORLD, 1, numprocs, .false., .true., ITER_COMM, ierr) call MPI_COMM_RANK(ITER_COMM,myid, ierr) call MPI_FINALIZE(ierr) end Brock Palen Center for Advanced Computing bro...@umich.edu (734)936-1985 ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Server Virtualization Business Unit Cisco Systems ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] orted: command not found
I installed openmpi 1.1.2 on two 686 boxes runing ubuntu 6.10. Followed the instructions given in the FAQ. Nevertheless, I get the following message: [bernie-1:05053] ERROR: A daemon on node 192.168.1.113 failed to start as expected. [bernie-1:05053] ERROR: There may be more information available from [bernie-1:05053] ERROR: the remote shell (see above). [bernie-1:05053] ERROR: The daemon exited unexpectedly with status 127. now, I've been browsing the web, including the mailing lists, and it appears that the error should be that I have not declared the variables export PATH="/usr/local/openmpi/bin:${PATH}" export LD_LIBRARY_PATH="/usr/local/openmpi/lib:${LD_LIBRARY_PATH}" at the node, wich I have. I have even created all the posible folders proposed at the FAQ for remote loggins, although I'm using bash. If I do a ssh user@remote_node, I can connect without being asked for a password, and if I type mpif90, I get: "gfortran: no input files", wich should mean that indeed the PATH and LD_LIBRARY_PATH are being updated on the remote logging. But, if I do: bash$ mpirun --prefix /usr/local/openmpi -np 2 prueba.bin the result is: -- Failed to find the following executable: Host: bernie-3 Executable: prueba.bin Cannot continue. -- mpirun noticed that job rank 0 with PID 0 on node "192.168.1.113" exited on signal 4. I've been looking around, but have not been able to find what does the signal 4 means. Just in case, I was running an example program wich runs fine at my university cluster. Nevertheless, decided to run an even simpler one, wich I include, for it may be that the error is there (I definitly hope not!...) program test use mpi implicit none integer :: myid,sizze,ierr call MPI_INIT(ierr) call MPI_COMM_SIZE(MPI_COMM_WORLD,sizze,ierr) call MPI_COMM_RANK(MPI_COMM_WORLD,myid,ierr) print *,"I'm using ",sizze," processors" print *,"of wich I'm the number ",myid call MPI_FINALIZE(ierr) end program test This is the first time I have installed -and use- any parallel programing program or library, and I'm doing it as a personal proyect for a graduate curse, so any help will be greatly appreciated! Best regards Jose Colmenares
Re: [OMPI users] openmpi / mpirun problem on aix: poll failed with errno=25, opal_event_loop: ompi_evesel->dispatch() failed.
Yikes - that's not a good error. :-( We don't regularly build / test on AIX, so I don't have much immediate guidance for you. My best suggestion at this point would be to try the latest 1.2 beta or nightly snapshot. We did an update of the event engine (the portion of the code that you're seeing the error issue from) that *may* alleviate the problem...? (I have no idea, actually -- I'm just kinda hoping that the new version of the event engine will fix your problem :-\ ) On Dec 27, 2006, at 10:29 AM, Michael Marti wrote: Dear All I am trying to get openmpi-1.1.2 to work on AIX 5.3 / power5. :: Compilation seems to have worked with the following sequence: setenv OBJECT_MODE 64 setenv CC xlc setenv CXX xlC setenv F77 xlf setenv FC xlf90 setenv CFLAGS "-qthreaded -O3 -qmaxmem=-1 -qarch=pwr5x -qtune=pwr5 - q64" setenv CXXFLAGS "-qthreaded -O3 -qmaxmem=-1 -qarch=pwr5x - qtune=pwr5 -q64" setenv FFLAGS "-qthreaded -O3 -qmaxmem=-1 -qarch=pwr5x -qtune=pwr5 - q64" setenv FCFLAGS "-qthreaded -O3 -qmaxmem=-1 -qarch=pwr5x -qtune=pwr5 -q64" setenv LDFLAGS "-Wl,-brtl" ./configure --prefix=/ist/openmpi-1.1.2 \ --disable-mpi-cxx \ --disable-mpi-cxx-seek \ --enable-mpi-threads \ --enable-progress-threads \ --enable-static \ --disable-shared \ --disable-io-romio :: After the compilation I ran make check and all 11 tests passed successfully. :: Now I'm trying to run the following command just for test: # mpirun -hostfile /gpfs/MICHAEL/MPI_hostfiles/mpinodes_b41- b44_1.asc -np 2 /usr/bin/hostname - The file /gpfs/MICHAEL/MPI_hostfiles/mpinodes_b41-b44_1.asc contains 4 hosts: r1blade041 slots=1 r1blade042 slots=1 r1blade043 slots=1 r1blade044 slots=1 - The mpirun command eventually hangs with the following message: [r1blade041:418014] poll failed with errno=25 [r1blade041:418014] opal_event_loop: ompi_evesel->dispatch() failed. - In this state mpirun cannot be killed by hitting only a kill -9 will do the trick. - While the mpirun still hangs I can see that the "orted" has been launched on both requested hosts. :: I turned on all debug options in openmpi-mca-params.conf. The output for the same call of mpirun is in the file mpirun-debug.txt.gz. :: As sugested in the mailinglis rules I include config.log (config.log.gz) and the output of ompi_info (ompi_info.txt.gz). :: As I am completely new to openmpi (I have some experience with lam) I am lost at this stage. I would really appreciate if someone could give me some hints as to what is going wrong and where I could get more info. Best regards, Michael Marti. -- -- -- Michael Marti Centro de Fisica dos Plasmas Instituto Superior Tecnico Av. Rovisco Pais 1049-001 Lisboa Portugal Tel: +351 218 419 379 Fax: +351 218 464 455 Mobile: +351 968 434 327 -- -- ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Server Virtualization Business Unit Cisco Systems
Re: [OMPI users] segv at runtime with ifort
Brock -- I think your test program is faulty. For MPI_CART_CREATE, you need to pass in an array indicating whether the dimensions are periodic or not -- it is not sufficient to pass in a scalar logical value. For example, the following program seems to work fine for me: program cart include 'mpif.h' logical periods(1) call MPI_INIT(ierr); call MPI_COMM_SIZE(MPI_COMM_WORLD, numprocs, ierr) call MPI_COMM_RANK(MPI_COMM_WORLD, my_world_id, ierr) periods(1) = .false. call MPI_CART_CREATE(MPI_COMM_WORLD, 1, numprocs, periods, .true., ITER_COMM, ierr) call MPI_COMM_RANK(ITER_COMM,myid, ierr) call MPI_FINALIZE(ierr) end On Dec 19, 2006, at 3:22 PM, Brock Palen wrote: Hello, We are getting seg faults at run time on MPI_CART_CREATE() when using openmpi-1.1.2 built with the intel compilers. I have included all versions, code and messages bellow. I know there were problems in the past, and dug around the archives but didn't find this any place. Has anyone else seen this problem? The message is like so: [brockp@nyxtest1 bagci]$ mpirun -np 1 ./a.out Signal:11 info.si_errno:0(Success) si_code:2(SEGV_ACCERR) Failing at addr:0x448c78 [0] func:/home/software/rhel4/openmpi-1.1.2/intel/lib/libopal.so.0 [0x2a958dea55] [1] func:/lib64/tls/libpthread.so.0 [0x325560c430] [2] func:/home/software/rhel4/openmpi-1.1.2/intel/lib/libmpi.so.0 (mpi_cart_create__+0x60) [0x2a955ef93c] [3] func:./a.out(MAIN__+0x8c) [0x406b8c] [4] func:./a.out(main+0x32) [0x406aea] [5] func:/lib64/tls/libc.so.6(__libc_start_main+0xdb) [0x325471c3fb] [6] func:./a.out [0x406a2a] *** End of error message *** ifort is version 9.1.037 icc icpc are version 9.1.043 This problem does not happen with pgi compilers. (version 6.1) Both were done em64t systems (xeon 5160's) Here is the code sample that fails. program cart include 'mpif.h' call MPI_INIT(ierr); call MPI_COMM_SIZE(MPI_COMM_WORLD, numprocs, ierr) call MPI_COMM_RANK(MPI_COMM_WORLD, my_world_id, ierr) call MPI_CART_CREATE(MPI_COMM_WORLD, 1, numprocs, .false., .true., ITER_COMM, ierr) call MPI_COMM_RANK(ITER_COMM,myid, ierr) call MPI_FINALIZE(ierr) end Brock Palen Center for Advanced Computing bro...@umich.edu (734)936-1985 ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Server Virtualization Business Unit Cisco Systems
Re: [OMPI users] mpicc problems finding libraries (mostly)
Welcome back from the holidays! I'll try to catch up on the right- before-the-holidays e-mail this today... On Dec 21, 2006, at 6:07 PM, Dennis McRitchie wrote: I am trying to build openmpi so that mpicc does not require me to set up the compiler's environment, and so that any executables built with mpicc can run without my having to point LD_LIBRARY_PATH to the openmpi lib We had not really considered this use-case before. The current assumption (as you undoubtedly figured out) is that on the node where you're invoking OMPI commands, the PATH/LD_LIBRARY_PATH has been setup properly. I'm not saying that we can't change this; I'm just trying to give you the rationale for why the wrappers are the way they [currently] are. directory. I made some unsuccessful attempts to accomplish this (which I describe below), but after building openmpi using the Intel compiler, I found the following: 1) When typing "/mpicc -showme" I get: /mpicc: error while loading shared libraries: libsvml.so: cannot open shared object file: No such file or directory I then set LD_LIBRARY_PATH to point to the Intel compiler libraries, and now "-showme" works, and returns: icc -I/usr/local/openmpi-1.1.2-intel/include -I/usr/local/openmpi-1.1.2-intel/include/openmpi -pthread -L/usr/local/openmpi-1.1.2-intel/lib -L/usr/ofed/lib -L/usr/ofed/lib64 -lmpi -lorte -lopal -libverbs -lrt -lpbs -lnsl -lutil This behavior reflects the current assumption (above). However... 2) When typing "/mpicc hello.c" I now get: -- -- -- The Open MPI wrapper compiler was unable to find the specified compiler icc in your PATH. Note that this compiler was either specified at configure time or in one of several possible environment variables. -- -- -- Of course, this is due to the fact that -showme indicates that mpicc invokes "icc" instead of "/icc". If I now set up the PATH to the Intel compiler, it works. However... Mmm. Yes. Also a good point; another working assumption is that you're setup for your compiler as well (re: PATH, LD_LIBRARY_PATH, LM_LICENSE_FILE, ...etc.). OMPI *does* save the absolute pathname of the compiler, but we had shied away from using it in the wrappers by default for a few reasons: 1. You may not have the compiler installed in the same location on all nodes. 2. There may be other factors that need to be setup in the environment (such as an env variable containing a license file) that the wrapper compilers are not currently setup to handle. 3. As you noted later, users can specify an absolute path name in CC, CXX (and friends) to configure and that propagates through. Hence, users have the choice of specifying the full pathname if they want to; OMPI's current setup allows you to do it either way. Additionally, be aware that the wrapper compilers are configurable via a text file. Check out this section of the FAQ: http://www.open- mpi.org/faq/?category=mpi-apps#override-wrappers-after-v1.0 3) When I try to run the executable thus created, I get: ./a.out: error while loading shared libraries: libmpi.so.0: cannot open shared object file: No such file or directory I now need to set LD_LIBRARY_PATH to point to the openmpi lib directory. Correct. The mpirun --prefix option may help here, though (and its synonym -- providing a full absolute path to mpirun). --- --- To avoid problems (1) and (2), I built openmpi with: export CC=/opt/intel/cce/latest/bin/icc export CXX=/opt/intel/cce/latest/bin/icpc export F77=/opt/intel/fce/latest/bin/ifort export FC=/opt/intel/fce/latest/bin/ifort export LDFLAGS="-Wl,-rpath,/opt/intel/cce/latest/lib,-rpath,/opt/intel/fce/ late st/lib" But while this satisfied the configure script and all its tests, it did not produce the results I hoped for. To avoid problem (3), I added the following option to configure: --with-wrapper-ldflags=-Wl,-rpath,/usr/local/openmpi-1.1.2-intel/lib I was hoping "-showme" would add this to its parameters, but no such luck. Looking at the build output, it seems that the --with-wrapper-ldflags parameter seems to be parsed differently from how LDFLAGS gets parsed, and I get a compilation line: /opt/intel/cce/latest/bin/icc -O3 -DNDEBUG -fno-strict-aliasing - pthread -Wl,-rpath -Wl,/opt/intel/cce/latest/lib -Wl,-rpath -Wl,/opt/intel/fce/latest/lib -o .libs/opal_wrapper opal_wrapper.o ../../../opal/.libs/libopal.so -lnsl -lutil -Wl,--rpath -Wl,/usr/local/openmpi-1.1.2-intel/lib Notice that the rpath preceding the openmpi lib directory is specified as "--rpath", which is probably why it is ignored. Is this perhaps a bug? Hmm. I'd have to trace into why that happens; that's pretty weird. We put the --with-wrapper-*flags [almost] directl