[OMPI users] Ompi failing on mx only
I was initially using 1.1.2 and moved to 1.2b2 because of a hang on MPI_Bcast() which 1.2b2 reports to fix, and seemed to have done so. My compute nodes are 2 dual core xeons on myrinet with mx. The problem is trying to get ompi running on mx only. My machine file is as follows ... node-1 slots=4 max-slots=4 node-2 slots=4 max-slots=4 node-3 slots=4 max-slots=4 'mpirun' with the minimum number of processes in order to get the error ... mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH --hostfile ./h1-3 -np 2 --mca btl mx,self ./cpi Results with the following output ... :~/Projects/ompi/cpi$ mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH --hostfile ./h1-3 -np 2 --mca btl mx,self ./cpi -- Process 0.1.0 is unable to reach 0.1.1 for MPI communication. If you specified the use of a BTL component, you may have forgotten a component (such as "self") in the list of usable components. -- -- Process 0.1.1 is unable to reach 0.1.0 for MPI communication. If you specified the use of a BTL component, you may have forgotten a component (such as "self") in the list of usable components. -- -- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): PML add procs failed --> Returned "Unreachable" (-12) instead of "Success" (0) -- *** An error occurred in MPI_Init *** before MPI was initialized *** MPI_ERRORS_ARE_FATAL (goodbye) -- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): PML add procs failed --> Returned "Unreachable" (-12) instead of "Success" (0) -- *** An error occurred in MPI_Init *** before MPI was initialized *** MPI_ERRORS_ARE_FATAL (goodbye) mpirun noticed that job rank 1 with PID 0 on node node-1 exited on signal 1. end of output --- I get that same error w/ the examples included in the ompi-1.2b2 distrib. However, if I change the mca params as such ... mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH --hostfile ./h1-3 -np 5 --mca pml cm ./cpi Running up to -np 5 works (one of the processes does get put on the 2nd node), but running with -np 6 fails with the following ... [node-2:10464] mx_connect fail for node-2:0 with key (error Endpoint closed or not connectable!) [node-2:10463] mx_connect fail for node-2:0 with key (error Endpoint closed or not connectable!) -- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): PML add procs failed --> Returned "Error" (-1) instead of "Success" (0) -- *** An error occurred in MPI_Init *** before MPI was initialized *** MPI_ERRORS_ARE_FATAL (goodbye) -- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): PML add procs failed --> Returned "Error" (-1) instead of "Success" (0) -- *** An error occurred in MPI_Init *** before MPI was initialized *** MPI_ERRORS_ARE_FATAL (goodbye) mpirun noticed that job rank 0 with PID 0 on node node-1 exited on sig
Re: [OMPI users] Ompi failing on mx only
Ompi failing on mx onlyHi, Gary- This looks like a config problem, and not a code problem yet. Could you send the output of mx_info from node-1 and from node-2? Also, forgive me counter-asking a possibly dumb OMPI question, but is "-x LD_LIBRARY_PATH" really what you want, as opposed to "-x LD_LIBRARY_PATH=${LD_LIBRARY_PATH}" ? (I would not be surprised if not specifying a value defaults to this behavior, but have to ask). Also, have you tried MX MTL as opposed to BTL? --mca pml cm --mca mtl mx,self (it looks like you did) "[node-2:10464] mx_connect fail for node-2:0 with key " makes it look like your fabric may not be fully mapped or that you may have a down link. thanks, -reese Myricom, Inc. I was initially using 1.1.2 and moved to 1.2b2 because of a hang on MPI_Bcast() which 1.2b2 reports to fix, and seemed to have done so. My compute nodes are 2 dual core xeons on myrinet with mx. The problem is trying to get ompi running on mx only. My machine file is as follows . node-1 slots=4 max-slots=4 node-2 slots=4 max-slots=4 node-3 slots=4 max-slots=4 'mpirun' with the minimum number of processes in order to get the error ... mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH --hostfile ./h1-3 -np 2 --mca btl mx,self ./cpi I don't believe there'a anything wrong w/ the hardware as I can ping on mx between this failed node and the master fine. So I tried a different set of 3 nodes and I got the same error, it always fails on the 2nd node of any group of nodes I choose.
Re: [OMPI users] Ompi failing on mx only
About the -x, I've been trying it both ways and prefer the latter, and results for either are the same. But it's value is correct. I've attached the ompi_info from node-1 and node-2. Sorry for not zipping them, but they were small and I think I'd have firewall issues. $ mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH --hostfile ./h13-15 -np 6 --mca pml cm ./cpi [node-14:19260] mx_connect fail for node-14:0 with key (error Endpoint closed or not connectable!) [node-14:19261] mx_connect fail for node-14:0 with key (error Endpoint closed or not connectable!) ... Is there any info anywhere's on MTL? Anyways, I've run w/ mtl, and sometimes it actually worked once. But now I can't reproduce it and it's throwing sig 7's, 11's, and 4's depending upon the number of procs I give it. But now that you mention mapper, I take it that's what SEGV_MAPERR might be referring to. I'm looking into the $ mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH=${LD_LIBRARY_PATH} --hostfile ./h1-3 -np 5 --mca mtl mx,self ./cpi Process 4 of 5 is on node-2 Process 0 of 5 is on node-1 Process 1 of 5 is on node-1 Process 2 of 5 is on node-1 Process 3 of 5 is on node-1 pi is approximately 3.1415926544231225, Error is 0.08333294 wall clock time = 0.019305 Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR) Failing at addr:0x2b88243862be mpirun noticed that job rank 0 with PID 0 on node node-1 exited on signal 1. 4 additional processes aborted (not shown) Or sometimes I'll get this error, just depending upon the number of procs ... mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH=${LD_LIBRARY_PATH} --hostfile ./h1-3 -np 7 --mca mtl mx,self ./cpi Signal:7 info.si_errno:0(Success) si_code:2() Failing at addr:0x2aaab000 [0] func:/usr/local/openmpi-1.2b2/lib/libopen-pal.so.0(opal_backtrace_print+ 0x1f) [0x2b9b7fa52d1f] [1] func:/usr/local/openmpi-1.2b2/lib/libopen-pal.so.0 [0x2b9b7fa51871] [2] func:/lib/libpthread.so.0 [0x2b9b80013d00] [3] func:/usr/local/openmpi-1.2b2/lib/libmca_common_sm.so.0(mca_common_sm_mm ap_init+0x1e3) [0x2b9b8270ef83] [4] func:/usr/local/openmpi-1.2b2/lib/openmpi/mca_mpool_sm.so [0x2b9b8260d0ff] [5] func:/usr/local/openmpi-1.2b2/lib/libmpi.so.0(mca_mpool_base_module_crea te+0x70) [0x2b9b7f7afac0] [6] func:/usr/local/openmpi-1.2b2/lib/openmpi/mca_btl_sm.so(mca_btl_sm_add_p rocs_same_base_addr+0x907) [0x2b9b83070517] [7] func:/usr/local/openmpi-1.2b2/lib/openmpi/mca_bml_r2.so(mca_bml_r2_add_p rocs+0x206) [0x2b9b82d5f576] [8] func:/usr/local/openmpi-1.2b2/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_add _procs+0xe3) [0x2b9b82a2d0a3] [9] func:/usr/local/openmpi-1.2b2/lib/libmpi.so.0(ompi_mpi_init+0x697) [0x2b9b7f77be07] [10] func:/usr/local/openmpi-1.2b2/lib/libmpi.so.0(MPI_Init+0x83) [0x2b9b7f79c943] [11] func:./cpi(main+0x42) [0x400cd5] [12] func:/lib/libc.so.6(__libc_start_main+0xf4) [0x2b9b8013a134] [13] func:./cpi [0x400bd9] *** End of error message *** Process 4 of 7 is on node-2 Process 5 of 7 is on node-2 Process 6 of 7 is on node-2 Process 0 of 7 is on node-1 Process 1 of 7 is on node-1 Process 2 of 7 is on node-1 Process 3 of 7 is on node-1 pi is approximately 3.1415926544231239, Error is 0.0807 wall clock time = 0.009331 Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR) Failing at addr:0x2b4ba33652be Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR) Failing at addr:0x2b8685aba2be Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR) Failing at addr:0x2b304ffbe2be mpirun noticed that job rank 0 with PID 0 on node node-1 exited on signal 1. 6 additional processes aborted (not shown) Ok, so I take it one is down. Would this be the cause for all the different errors I'm seeing? $ fm_status FMS Fabric status 17 hosts known 16 FMAs found 3 un-ACKed alerts Mapping is complete, last map generated by node-20 Database generation not yet complete. From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Reese Faucette Sent: Tuesday, January 02, 2007 2:52 PM To: Open MPI Users Subject: Re: [OMPI users] Ompi failing on mx only Hi, Gary- This looks like a config problem, and not a code problem yet. Could you send the output of mx_info from node-1 and from node-2? Also, forgive me counter-asking a possibly dumb OMPI question, but is "-x LD_LIBRARY_PATH" really what you want, as opposed to "-x LD_LIBRARY_PATH=${LD_LIBRARY_PATH}" ? (I would not be surprised if not specifying a value defaults to this behavior, but have to ask). Also, have you tried MX MTL as opposed to BTL? --mca pml cm --mca mtl mx,self (it looks like you did) "[node-2:10464] mx_connect fail for node-2:0 with key " makes it look like your fabric may not be fully mapped or that you may have a down link. thanks, -reese Myricom, Inc.
Re: [OMPI users] Ompi failing on mx only
Ompi failing on mx only> I've attached the ompi_info from node-1 and node-2. thanks, but i need "mx_info", not "ompi_info" ;-) But now that you mention mapper, I take it that's what SEGV_MAPERR might be referring to. this is an ompi red herring; it has nothing to do with Myrinet mapping, even though it kinda looks like it. $ mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH=${LD_LIBRARY_PATH} --hostfile ./h1-3 -np 5 --mca mtl mx,self ./cpi Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR) A gdb traceback would be interesting on this one. thanks, -reese
Re: [OMPI users] Ompi failing on mx only
ced that job rank 0 with PID 0 on node node-1 exited on signal 1. 6 additional processes aborted (not shown) Ok, so I take it one is down. Would this be the cause for all the different errors I'm seeing? $ fm_status FMS Fabric status 17 hosts known 16 FMAs found 3 un-ACKed alerts Mapping is complete, last map generated by node-20 Database generation not yet complete. From: users-boun...@open-mpi.org [mailto:users-bounces@open- mpi.org] On Behalf Of Reese Faucette Sent: Tuesday, January 02, 2007 2:52 PM To: Open MPI Users Subject: Re: [OMPI users] Ompi failing on mx only Hi, Gary- This looks like a config problem, and not a code problem yet. Could you send the output of mx_info from node-1 and from node-2? Also, forgive me counter-asking a possibly dumb OMPI question, but is "-x LD_LIBRARY_PATH" really what you want, as opposed to "-x LD_LIBRARY_PATH=${LD_LIBRARY_PATH}" ? (I would not be surprised if not specifying a value defaults to this behavior, but have to ask). Also, have you tried MX MTL as opposed to BTL? --mca pml cm --mca mtl mx,self (it looks like you did) "[node-2:10464] mx_connect fail for node-2:0 with key " makes it look like your fabric may not be fully mapped or that you may have a down link. thanks, -reese Myricom, Inc. I was initially using 1.1.2 and moved to 1.2b2 because of a hang on MPI_Bcast() which 1.2b2 reports to fix, and seemed to have done so. My compute nodes are 2 dual core xeons on myrinet with mx. The problem is trying to get ompi running on mx only. My machine file is as follows … node-1 slots=4 max-slots=4 node-2 slots=4 max-slots=4 node-3 slots=4 max-slots=4 'mpirun' with the minimum number of processes in order to get the error ... mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH --hostfile ./h1-3 -np 2 --mca btl mx,self ./cpi I don't believe there'a anything wrong w/ the hardware as I can ping on mx between this failed node and the master fine. So I tried a different set of 3 nodes and I got the same error, it always fails on the 2nd node of any group of nodes I choose. ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Brian Barrett Open MPI Team, CCS-1 Los Alamos National Laboratory
Re: [OMPI users] Ompi failing on mx only
As for the MTL, there is a bug in the MX MTL for v1.2 that has been fixed, but after 1.2b2 ... oops, i was stupidly assuming he already had that fix. yes, this is an important fix... -reese
Re: [OMPI users] Ompi failing on mx only
Ah, sorry about that ... $ ./mx_info MX Version: 1.1.6 MX Build: ggrobe@juggernaut:/home/ggrobe/Tools/mx-1.1.6 Thu Nov 30 14:17:44 GMT 2006 1 Myrinet board installed. The MX driver is configured to support up to 4 instances and 1024 nodes. === Instance #0: 224.9 MHz LANai, 99.7 MHz PCI bus, 2 MB SRAM Status: Running, P0: Link up MAC Address:00:60:dd:47:c2:a7 Product code: M3F-PCIXD-2 V2.2 Part number:09-03034 Serial number: 291824 Mapper: 46:4d:53:4d:41:50, version = 0x000c, configured Mapped hosts: 16 ROUTE COUNT INDEXMAC ADDRESS HOST NAMEP0 ---- ---- 0) 00:60:dd:47:c2:a7 juggernaut:0 1,1 1) 00:60:dd:47:ab:c9 node-1:0 6,3 2) 00:60:dd:47:ab:c8 node-2:0 6,3 3) 00:60:dd:47:ab:ca node-3:0 6,3 4) 00:60:dd:47:bf:65 node-7:0 6,3 5) 00:60:dd:47:c2:e1 node-8:0 6,3 6) 00:60:dd:47:c0:c1 node-9:0 6,3 7) 00:60:dd:47:c0:e5 node-13:0 6,3 8) 00:60:dd:47:c2:91 node-14:0 6,3 9) 00:60:dd:47:c0:b2 node-15:0 6,3 10) 00:60:dd:47:bf:f5 node-19:0 1,1 11) 00:60:dd:47:c0:b1 node-20:0 6,3 12) 00:60:dd:47:c0:f8 node-21:0 7,3 13) 00:60:dd:47:c0:8a node-25:0 6,3 14) 00:60:dd:47:c0:c2 node-27:0 5,3 15) 00:60:dd:47:c2:e0 node-26:0 5,3 -Original Message- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Reese Faucette Sent: Tuesday, January 02, 2007 4:08 PM To: Open MPI Users Subject: Re: [OMPI users] Ompi failing on mx only Ompi failing on mx only> I've attached the ompi_info from node-1 and node-2. thanks, but i need "mx_info", not "ompi_info" ;-) > But now that you mention mapper, I take it that's what SEGV_MAPERR > might be referring to. this is an ompi red herring; it has nothing to do with Myrinet mapping, even though it kinda looks like it. > $ mpirun --prefix /usr/local/openmpi-1.2b2 -x > LD_LIBRARY_PATH=${LD_LIBRARY_PATH} --hostfile ./h1-3 -np 5 --mca mtl > mx,self ./cpi > Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR) A gdb traceback would be interesting on this one. thanks, -reese ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Ompi failing on mx only
I'm losing it today, I just now noticed I sent mx_info for the wrong nodes ... // node-1 $ mx_info MX Version: 1.1.6 MX Build: ggrobe@juggernaut:/home/ggrobe/Tools/mx-1.1.6 Thu Nov 30 14:17:44 GMT 2006 1 Myrinet board installed. The MX driver is configured to support up to 4 instances and 1024 nodes. === Instance #0: 224.9 MHz LANai, 133.3 MHz PCI bus, 2 MB SRAM Status: Running, P0: Link up MAC Address:00:60:dd:47:ab:c9 Product code: M3F-PCIXD-2 V2.2 Part number:09-03034 Serial number: 299207 Mapper: 46:4d:53:4d:41:50, version = 0x000c, configured Mapped hosts: 16 ROUTE COUNT INDEXMAC ADDRESS HOST NAMEP0 ---- ---- 0) 00:60:dd:47:ab:c9 node-1:0 1,1 1) 00:60:dd:47:c2:a7 juggernaut:0 5,3 2) 00:60:dd:47:ab:c8 node-2:0 6,3 3) 00:60:dd:47:ab:ca node-3:0 7,3 4) 00:60:dd:47:bf:65 node-7:0 7,3 5) 00:60:dd:47:c2:e1 node-8:0 7,3 6) 00:60:dd:47:c0:c1 node-9:0 6,3 7) 00:60:dd:47:c0:e5 node-13:0 1,1 8) 00:60:dd:47:c2:91 node-14:0 7,3 9) 00:60:dd:47:c0:b2 node-15:0 7,3 10) 00:60:dd:47:bf:f5 node-19:0 6,3 11) 00:60:dd:47:c0:b1 node-20:0 8,3 12) 00:60:dd:47:c0:f8 node-21:0 5,3 13) 00:60:dd:47:c0:8a node-25:0 6,3 14) 00:60:dd:47:c0:c2 node-27:0 7,3 15) 00:60:dd:47:c2:e0 node-26:0 6,3 // node-2 $ mx_info MX Version: 1.1.6 MX Build: ggrobe@juggernaut:/home/ggrobe/Tools/mx-1.1.6 Thu Nov 30 14:17:44 GMT 2006 1 Myrinet board installed. The MX driver is configured to support up to 4 instances and 1024 nodes. === Instance #0: 224.9 MHz LANai, 133.0 MHz PCI bus, 2 MB SRAM Status: Running, P0: Link up MAC Address:00:60:dd:47:ab:c8 Product code: M3F-PCIXD-2 V2.2 Part number:09-03034 Serial number: 299208 Mapper: 46:4d:53:4d:41:50, version = 0x000c, configured Mapped hosts: 16 ROUTE COUNT INDEXMAC ADDRESS HOST NAMEP0 ---- ---- 0) 00:60:dd:47:ab:c8 node-2:0 1,1 1) 00:60:dd:47:ab:c9 node-1:0 6,3 2) 00:60:dd:47:c2:a7 juggernaut:0 5,3 3) 00:60:dd:47:ab:ca node-3:0 6,3 4) 00:60:dd:47:bf:65 node-7:0 5,3 5) 00:60:dd:47:c2:e1 node-8:0 6,3 6) 00:60:dd:47:c0:c1 node-9:0 5,3 7) 00:60:dd:47:c0:e5 node-13:0 6,3 8) 00:60:dd:47:c2:91 node-14:0 8,3 9) 00:60:dd:47:c0:b2 node-15:0 1,1 10) 00:60:dd:47:bf:f5 node-19:0 5,3 11) 00:60:dd:47:c0:f8 node-21:0 5,3 12) 00:60:dd:47:c0:8a node-25:0 5,3 13) 00:60:dd:47:c0:c2 node-27:0 6,3 14) 00:60:dd:47:c2:e0 node-26:0 5,3 15) 00:60:dd:47:c0:b1 node-20:0 6,3 -Original Message- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Reese Faucette Sent: Tuesday, January 02, 2007 4:08 PM To: Open MPI Users Subject: Re: [OMPI users] Ompi failing on mx only Ompi failing on mx only> I've attached the ompi_info from node-1 and node-2. thanks, but i need "mx_info", not "ompi_info" ;-) > But now that you mention mapper, I take it that's what SEGV_MAPERR > might be referring to. this is an ompi red herring; it has nothing to do with Myrinet mapping, even though it kinda looks like it. > $ mpirun --prefix /usr/local/openmpi-1.2b2 -x > LD_LIBRARY_PATH=${LD_LIBRARY_PATH} --hostfile ./h1-3 -np 5 --mca mtl > mx,self ./cpi > Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR) A gdb traceback would be interesting on this one. thanks, -reese ___ users mailing
Re: [OMPI users] Ompi failing on mx only
Just as an FYI, I also included the sm param as you suggested and changed the -np to 1, because anything more than that just duplicates the same error. I also saw this same error message in previous posts as a bug. Would that be the same issue in this case? $ mpirun --prefix /usr/local/openmpi-1.2b2 --hostfile ./h1-3 -np 1 --mca btl mx,sm,self ./cpi [node-1:09704] mca: base: component_find: unable to open mtl mx: file not found (ignored) [node-1:09704] mca: base: component_find: unable to open btl mx: file not found (ignored) Process 0 of 1 is on node-1 pi is approximately 3.1415926544231341, Error is 0.08333410 wall clock time = 0.000331 -Original Message- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Brian W. Barrett Sent: Tuesday, January 02, 2007 4:11 PM To: Open MPI Users Subject: Re: [OMPI users] Ompi failing on mx only Sorry to jump into the discussion late. The mx btl does not support communication between processes on the same node by itself, so you have to include the shared memory transport when using MX. This will eventually be fixed, but likely not for the 1.2 release. So if you do: mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH -- hostfile ./h1-3 -np 2 --mca btl mx,sm,self ./cpi It should work much better. As for the MTL, there is a bug in the MX MTL for v1.2 that has been fixed, but after 1.2b2 that could cause the random failures you were seeing. It will work much better after 1.2b3 is released (or if you are feeling really lucky, you can try out the 1.2 nightly tarballs). The MTL is a new feature in v1.2. It is a different communication abstraction designed to support interconnects that have matching implemented in the lower level library or in hardware (Myrinet/MX, Portals, InfiniPath are currently implemented). The MTL allows us to exploit the low latency and asynchronous progress these libraries can provide, but does mean multi-nic abilities are reduced. Further, the MTL is not well suited to interconnects like TCP or InfiniBand, so we will continue supporting the BTL interface as well. Brian On Jan 2, 2007, at 2:44 PM, Grobe, Gary L. ((JSC-EV))[ESCG] wrote: > About the -x, I've been trying it both ways and prefer the latter, and > results for either are the same. But it's value is correct. > I've attached the ompi_info from node-1 and node-2. Sorry for not > zipping them, but they were small and I think I'd have firewall > issues. > > $ mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH -- > hostfile ./h13-15 -np 6 --mca pml cm ./cpi [node-14:19260] mx_connect > fail for node-14:0 with key (error Endpoint closed or not > connectable!) [node-14:19261] mx_connect fail for node-14:0 with key > (error Endpoint closed or not connectable!) ... > > Is there any info anywhere's on MTL? Anyways, I've run w/ mtl, and > sometimes it actually worked once. But now I can't reproduce it and > it's throwing sig 7's, 11's, and 4's depending upon the number of > procs I give it. But now that you mention mapper, I take it that's > what SEGV_MAPERR might be referring to. I'm looking into the > > $ mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH=$ > {LD_LIBRARY_PATH} --hostfile ./h1-3 -np 5 --mca mtl mx,self ./cpi > Process 4 of 5 is on node-2 Process 0 of 5 is on node-1 Process 1 of 5 > is on node-1 Process 2 of 5 is on node-1 Process 3 of 5 is on node-1 > pi is approximately 3.1415926544231225, Error is 0.08333294 > wall clock time = 0.019305 > Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR) Failing at > addr:0x2b88243862be mpirun noticed that job rank 0 with PID 0 on node > node-1 exited on signal 1. > 4 additional processes aborted (not shown) Or sometimes I'll get this > error, just depending upon the number of procs ... > > mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH=$ > {LD_LIBRARY_PATH} --hostfile ./h1-3 -np 7 --mca mtl mx,self ./cpi > Signal:7 info.si_errno:0(Success) si_code:2() Failing at > addr:0x2aaab000 [0] > func:/usr/local/openmpi-1.2b2/lib/libopen-pal.so.0 > (opal_backtrace_print+0x1f) [0x2b9b7fa52d1f] [1] > func:/usr/local/openmpi-1.2b2/lib/libopen-pal.so.0 > [0x2b9b7fa51871] > [2] func:/lib/libpthread.so.0 [0x2b9b80013d00] [3] > func:/usr/local/openmpi-1.2b2/lib/libmca_common_sm.so.0 > (mca_common_sm_mmap_init+0x1e3) [0x2b9b8270ef83] [4] > func:/usr/local/openmpi-1.2b2/lib/openmpi/mca_mpool_sm.so > [0x2b9b8260d0ff] > [5] func:/usr/local/openmpi-1.2b2/lib/libmpi.so.0 > (mca_mpool_base_module_create+0x70) [0x2b9b7f7afac0] [6] > func:/usr/local/openmpi-1.2b2/lib/openmpi/mca_btl_sm.so > (mca_btl_sm_add_procs_same_base_addr+0x907) [0x2b9b83070517] [7] > func:/usr/local/openmpi-1.2b2/lib/open
Re: [OMPI users] Ompi failing on mx only
$ mpirun --prefix /usr/local/openmpi-1.2b2 --hostfile ./h1-3 -np 1 --mca btl mx,sm,self ./cpi [node-1:09704] mca: base: component_find: unable to open mtl mx: file not found (ignored) [node-1:09704] mca: base: component_find: unable to open btl mx: file not found (ignored) This in particular is almost certainly a library path issue. A quick way to check to see if your LD_LIBRARY_PATH is correct is to run: $ mpirun ldd /lib/openmpi/mca_mtl_mx.so If things are good, you will get a first line like: libmyriexpress.so => /opt/mx/lib/libmyriexpress.so (0xb7f1d000) If not, it will tell you explicitly. Since all you specified is the --prefix line, I'm not surprised libmyriexpress.so is not found in this case. -reese
Re: [OMPI users] Ompi failing on mx only
I've grabbed last nights tarball (1.2b3r12981) and tried using the shared mem transport on btl and mx,self on mtl, same results. What I don't get is that, sometimes it works, and sometimes it doesn't (for either). For example, I can run it 10 times successfully then incr the -np from 7 to 10 across 3 nodes, and it'll immediately fail. Here's an example of one run right after another. $ mpirun --prefix /usr/local/openmpi-1.2b3r12981/ -x LD_LIBRARY_PATH=${LD_LIBRARY_PATH} --hostfile ./h25-27 -np 10 --mca mtl mx,self ./cpi Process 0 of 10 is on node-25 Process 4 of 10 is on node-26 Process 1 of 10 is on node-25 Process 5 of 10 is on node-26 Process 2 of 10 is on node-25 Process 8 of 10 is on node-27 Process 6 of 10 is on node-26 Process 9 of 10 is on node-27 Process 7 of 10 is on node-26 Process 3 of 10 is on node-25 pi is approximately 3.1415926544231256, Error is 0.0825 wall clock time = 0.017513 $ mpirun --prefix /usr/local/openmpi-1.2b3r12981/ -x LD_LIBRARY_PATH=${LD_LIBRARY_PATH} --hostfile ./h25-27 -np 10 --mca mtl mx,self ./cpi Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR) Failing at addr:(nil) [0] func:/usr/local/openmpi-1.2b3r12981/lib/libopen-pal.so.0(opal_backtrace_ print+0x1f) [0x2b8ddf3ccd3f] [1] func:/usr/local/openmpi-1.2b3r12981/lib/libopen-pal.so.0 [0x2b8ddf3cb891] [2] func:/lib/libpthread.so.0 [0x2b8ddf98f6c0] [3] func:/opt/mx/lib/libmyriexpress.so(mx_open_endpoint+0x6df) [0x2b8de25bf2af] [4] func:/usr/local/openmpi-1.2b3r12981/lib/openmpi/mca_btl_mx.so(mca_btl_mx _component_init+0x5d7) [0x2b8de27dcd27] [5] func:/usr/local/openmpi-1.2b3r12981/lib/libmpi.so.0(mca_btl_base_select+ 0x156) [0x2b8ddf125b46] [6] func:/usr/local/openmpi-1.2b3r12981/lib/openmpi/mca_bml_r2.so(mca_bml_r2 _component_init+0x11) [0x2b8de26d7491] [7] func:/usr/local/openmpi-1.2b3r12981/lib/libmpi.so.0(mca_bml_base_init+0x 7d) [0x2b8ddf12543d] [8] func:/usr/local/openmpi-1.2b3r12981/lib/openmpi/mca_pml_ob1.so(mca_pml_o b1_component_init+0x6b) [0x2b8de23a4f8b] [9] func:/usr/local/openmpi-1.2b3r12981/lib/libmpi.so.0(mca_pml_base_select+ 0x113) [0x2b8ddf12cea3] [10] func:/usr/local/openmpi-1.2b3r12981/lib/libmpi.so.0(ompi_mpi_init+0x45a) [0x2b8ddf0f5bda] [11] func:/usr/local/openmpi-1.2b3r12981/lib/libmpi.so.0(MPI_Init+0x83) [0x2b8ddf116af3] [12] func:./cpi(main+0x42) [0x400cd5] [13] func:/lib/libc.so.6(__libc_start_main+0xe3) [0x2b8ddfab50e3] [14] func:./cpi [0x400bd9] *** End of error message *** mpirun noticed that job rank 0 with PID 0 on node node-25 exited on signal 11. 9 additional processes aborted (not shown) -Original Message- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Brian W. Barrett Sent: Tuesday, January 02, 2007 4:11 PM To: Open MPI Users Subject: Re: [OMPI users] Ompi failing on mx only Sorry to jump into the discussion late. The mx btl does not support communication between processes on the same node by itself, so you have to include the shared memory transport when using MX. This will eventually be fixed, but likely not for the 1.2 release. So if you do: mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH -- hostfile ./h1-3 -np 2 --mca btl mx,sm,self ./cpi It should work much better. As for the MTL, there is a bug in the MX MTL for v1.2 that has been fixed, but after 1.2b2 that could cause the random failures you were seeing. It will work much better after 1.2b3 is released (or if you are feeling really lucky, you can try out the 1.2 nightly tarballs). The MTL is a new feature in v1.2. It is a different communication abstraction designed to support interconnects that have matching implemented in the lower level library or in hardware (Myrinet/MX, Portals, InfiniPath are currently implemented). The MTL allows us to exploit the low latency and asynchronous progress these libraries can provide, but does mean multi-nic abilities are reduced. Further, the MTL is not well suited to interconnects like TCP or InfiniBand, so we will continue supporting the BTL interface as well. Brian On Jan 2, 2007, at 2:44 PM, Grobe, Gary L. ((JSC-EV))[ESCG] wrote: > About the -x, I've been trying it both ways and prefer the latter, and > results for either are the same. But it's value is correct. > I've attached the ompi_info from node-1 and node-2. Sorry for not > zipping them, but they were small and I think I'd have firewall > issues. > > $ mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH -- > hostfile ./h13-15 -np 6 --mca pml cm ./cpi [node-14:19260] mx_connect > fail for node-14:0 with key (error Endpoint closed or not > connectable!) [node-14:19261] mx_connect fail for node-14:0 with key > (error Endpoint closed or not connectable!) ... > > Is there any info anywhere's on MTL? Anyways, I've run w/ mtl, and > sometimes it actually worked once. But now I can't reproduce i
Re: [OMPI users] Ompi failing on mx only
FWIW, I think we may have broken something in last night's tarball (this just came up on an internal development list, too). I.e., someone broke something that was fixed a little while later, but the nightly tarball was created before the problem was fixed. Sorry about that. :-( Such is the nature of nightly snapshots... On Jan 4, 2007, at 2:00 PM, Grobe, Gary L. ((JSC-EV))[ESCG] wrote: I've grabbed last nights tarball (1.2b3r12981) and tried using the shared mem transport on btl and mx,self on mtl, same results. What I don't get is that, sometimes it works, and sometimes it doesn't (for either). For example, I can run it 10 times successfully then incr the -np from 7 to 10 across 3 nodes, and it'll immediately fail. Here's an example of one run right after another. $ mpirun --prefix /usr/local/openmpi-1.2b3r12981/ -x LD_LIBRARY_PATH=${LD_LIBRARY_PATH} --hostfile ./h25-27 -np 10 --mca mtl mx,self ./cpi Process 0 of 10 is on node-25 Process 4 of 10 is on node-26 Process 1 of 10 is on node-25 Process 5 of 10 is on node-26 Process 2 of 10 is on node-25 Process 8 of 10 is on node-27 Process 6 of 10 is on node-26 Process 9 of 10 is on node-27 Process 7 of 10 is on node-26 Process 3 of 10 is on node-25 pi is approximately 3.1415926544231256, Error is 0.0825 wall clock time = 0.017513 $ mpirun --prefix /usr/local/openmpi-1.2b3r12981/ -x LD_LIBRARY_PATH=${LD_LIBRARY_PATH} --hostfile ./h25-27 -np 10 --mca mtl mx,self ./cpi Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR) Failing at addr:(nil) [0] func:/usr/local/openmpi-1.2b3r12981/lib/libopen-pal.so.0 (opal_backtrace_ print+0x1f) [0x2b8ddf3ccd3f] [1] func:/usr/local/openmpi-1.2b3r12981/lib/libopen-pal.so.0 [0x2b8ddf3cb891] [2] func:/lib/libpthread.so.0 [0x2b8ddf98f6c0] [3] func:/opt/mx/lib/libmyriexpress.so(mx_open_endpoint+0x6df) [0x2b8de25bf2af] [4] func:/usr/local/openmpi-1.2b3r12981/lib/openmpi/mca_btl_mx.so (mca_btl_mx _component_init+0x5d7) [0x2b8de27dcd27] [5] func:/usr/local/openmpi-1.2b3r12981/lib/libmpi.so.0 (mca_btl_base_select+ 0x156) [0x2b8ddf125b46] [6] func:/usr/local/openmpi-1.2b3r12981/lib/openmpi/mca_bml_r2.so (mca_bml_r2 _component_init+0x11) [0x2b8de26d7491] [7] func:/usr/local/openmpi-1.2b3r12981/lib/libmpi.so.0 (mca_bml_base_init+0x 7d) [0x2b8ddf12543d] [8] func:/usr/local/openmpi-1.2b3r12981/lib/openmpi/mca_pml_ob1.so (mca_pml_o b1_component_init+0x6b) [0x2b8de23a4f8b] [9] func:/usr/local/openmpi-1.2b3r12981/lib/libmpi.so.0 (mca_pml_base_select+ 0x113) [0x2b8ddf12cea3] [10] func:/usr/local/openmpi-1.2b3r12981/lib/libmpi.so.0(ompi_mpi_init +0x45a) [0x2b8ddf0f5bda] [11] func:/usr/local/openmpi-1.2b3r12981/lib/libmpi.so.0(MPI_Init +0x83) [0x2b8ddf116af3] [12] func:./cpi(main+0x42) [0x400cd5] [13] func:/lib/libc.so.6(__libc_start_main+0xe3) [0x2b8ddfab50e3] [14] func:./cpi [0x400bd9] *** End of error message *** mpirun noticed that job rank 0 with PID 0 on node node-25 exited on signal 11. 9 additional processes aborted (not shown) -Original Message- From: users-boun...@open-mpi.org [mailto:users-bounces@open- mpi.org] On Behalf Of Brian W. Barrett Sent: Tuesday, January 02, 2007 4:11 PM To: Open MPI Users Subject: Re: [OMPI users] Ompi failing on mx only Sorry to jump into the discussion late. The mx btl does not support communication between processes on the same node by itself, so you have to include the shared memory transport when using MX. This will eventually be fixed, but likely not for the 1.2 release. So if you do: mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH -- hostfile ./h1-3 -np 2 --mca btl mx,sm,self ./cpi It should work much better. As for the MTL, there is a bug in the MX MTL for v1.2 that has been fixed, but after 1.2b2 that could cause the random failures you were seeing. It will work much better after 1.2b3 is released (or if you are feeling really lucky, you can try out the 1.2 nightly tarballs). The MTL is a new feature in v1.2. It is a different communication abstraction designed to support interconnects that have matching implemented in the lower level library or in hardware (Myrinet/MX, Portals, InfiniPath are currently implemented). The MTL allows us to exploit the low latency and asynchronous progress these libraries can provide, but does mean multi-nic abilities are reduced. Further, the MTL is not well suited to interconnects like TCP or InfiniBand, so we will continue supporting the BTL interface as well. Brian On Jan 2, 2007, at 2:44 PM, Grobe, Gary L. ((JSC-EV))[ESCG] wrote: About the -x, I've been trying it both ways and prefer the latter, and results for either are the same. But it's value is correct. I've attached the ompi_info from node-1 and node-2. Sorry for not zipping them, but they were small and I think I'd have firewall issues. $ mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH -- hostfile ./h13-15 -np 6 --mca pml cm ./cpi
Re: [OMPI users] Ompi failing on mx only
There is some confusion here. I see that you try to run using mtl but you have the wrong mca parameters. In order to activate the MTL you should have on the mpirun command line "--mca pml cm --mca mtl mx". As you can se from your backtrace it segfault in the BTL initialization, which means that you're using the BTL and not the MTL. Second thing. From one of your previous emails, I see that MX is configured with 4 instance by node. Your running with exactly 4 processes on the first 2 nodes. Weirds things might happens ... Now, if you use the latest trunk, you can use the new MX BTL which provide support for shared memory and self communications. Add "--mca pml ob1 --mca btl mx --mca btl_mx_shared_mem 1 --mca btl_mx_self 1" in order to activate these new features. If you have a 10G cards, I suggest you add "--mca btl_mx_flags 2" as well. Thanks, george. PS: Is there any way you can attach to the processes with gdb ? I would like to see the backtrace as showed by gdb in order to be able to figure out what's wrong there. On Jan 4, 2007, at 2:00 PM, Grobe, Gary L. ((JSC-EV))[ESCG] wrote: I've grabbed last nights tarball (1.2b3r12981) and tried using the shared mem transport on btl and mx,self on mtl, same results. What I don't get is that, sometimes it works, and sometimes it doesn't (for either). For example, I can run it 10 times successfully then incr the -np from 7 to 10 across 3 nodes, and it'll immediately fail. Here's an example of one run right after another. $ mpirun --prefix /usr/local/openmpi-1.2b3r12981/ -x LD_LIBRARY_PATH=${LD_LIBRARY_PATH} --hostfile ./h25-27 -np 10 --mca mtl mx,self ./cpi Process 0 of 10 is on node-25 Process 4 of 10 is on node-26 Process 1 of 10 is on node-25 Process 5 of 10 is on node-26 Process 2 of 10 is on node-25 Process 8 of 10 is on node-27 Process 6 of 10 is on node-26 Process 9 of 10 is on node-27 Process 7 of 10 is on node-26 Process 3 of 10 is on node-25 pi is approximately 3.1415926544231256, Error is 0.0825 wall clock time = 0.017513 $ mpirun --prefix /usr/local/openmpi-1.2b3r12981/ -x LD_LIBRARY_PATH=${LD_LIBRARY_PATH} --hostfile ./h25-27 -np 10 --mca mtl mx,self ./cpi Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR) Failing at addr:(nil) [0] func:/usr/local/openmpi-1.2b3r12981/lib/libopen-pal.so.0 (opal_backtrace_ print+0x1f) [0x2b8ddf3ccd3f] [1] func:/usr/local/openmpi-1.2b3r12981/lib/libopen-pal.so.0 [0x2b8ddf3cb891] [2] func:/lib/libpthread.so.0 [0x2b8ddf98f6c0] [3] func:/opt/mx/lib/libmyriexpress.so(mx_open_endpoint+0x6df) [0x2b8de25bf2af] [4] func:/usr/local/openmpi-1.2b3r12981/lib/openmpi/mca_btl_mx.so (mca_btl_mx _component_init+0x5d7) [0x2b8de27dcd27] [5] func:/usr/local/openmpi-1.2b3r12981/lib/libmpi.so.0 (mca_btl_base_select+ 0x156) [0x2b8ddf125b46] [6] func:/usr/local/openmpi-1.2b3r12981/lib/openmpi/mca_bml_r2.so (mca_bml_r2 _component_init+0x11) [0x2b8de26d7491] [7] func:/usr/local/openmpi-1.2b3r12981/lib/libmpi.so.0 (mca_bml_base_init+0x 7d) [0x2b8ddf12543d] [8] func:/usr/local/openmpi-1.2b3r12981/lib/openmpi/mca_pml_ob1.so (mca_pml_o b1_component_init+0x6b) [0x2b8de23a4f8b] [9] func:/usr/local/openmpi-1.2b3r12981/lib/libmpi.so.0 (mca_pml_base_select+ 0x113) [0x2b8ddf12cea3] [10] func:/usr/local/openmpi-1.2b3r12981/lib/libmpi.so.0(ompi_mpi_init +0x45a) [0x2b8ddf0f5bda] [11] func:/usr/local/openmpi-1.2b3r12981/lib/libmpi.so.0(MPI_Init +0x83) [0x2b8ddf116af3] [12] func:./cpi(main+0x42) [0x400cd5] [13] func:/lib/libc.so.6(__libc_start_main+0xe3) [0x2b8ddfab50e3] [14] func:./cpi [0x400bd9] *** End of error message *** mpirun noticed that job rank 0 with PID 0 on node node-25 exited on signal 11. 9 additional processes aborted (not shown) -Original Message- From: users-boun...@open-mpi.org [mailto:users-bounces@open- mpi.org] On Behalf Of Brian W. Barrett Sent: Tuesday, January 02, 2007 4:11 PM To: Open MPI Users Subject: Re: [OMPI users] Ompi failing on mx only Sorry to jump into the discussion late. The mx btl does not support communication between processes on the same node by itself, so you have to include the shared memory transport when using MX. This will eventually be fixed, but likely not for the 1.2 release. So if you do: mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH -- hostfile ./h1-3 -np 2 --mca btl mx,sm,self ./cpi It should work much better. As for the MTL, there is a bug in the MX MTL for v1.2 that has been fixed, but after 1.2b2 that could cause the random failures you were seeing. It will work much better after 1.2b3 is released (or if you are feeling really lucky, you can try out the 1.2 nightly tarballs). The MTL is a new feature in v1.2. It is a different communication abstraction designed to support interconnects that have matching implemented in the lower level library or in hardware (Myrinet/MX, Portals, InfiniPath are currentl
Re: [OMPI users] Ompi failing on mx only
This is just an FYI of the Jan 5th snapshot. I'll send a backtrace of the processes as soon as I get a b3 running. Between my filtered webdav svn access problems and the latest nightly snapshots, my builds are currently failing where the same config lines worked on previous snapshots ... $./configure --prefix=/usr/local/openmpi-1.2b3r13006 --with-mx=/opt/mx --with-mx-libdir=/opt/mx/lib ... *** GNU libltdl setup configure: OMPI configuring in opal/libltdl configure: running /bin/sh './configure' '--prefix=/usr/local/openmpi-1.2b3r13006' '--with-mx=/opt/mx' '--with-mx-libdir=/opt/mx/lib' 'F77=ifort' --enable-ltdl-convenience --disable-ltdl-install --enable-shared --disable-static --cache-file=/dev/null --srcdir=. checking for a BSD-compatible install... /usr/bin/install -c checking whether build environment is sane... configure: error: newly created file is older than distributed files! Check your system clock configure: /bin/sh './configure' *failed* for opal/libltdl configure: error: Failed to build GNU libltdl. This usually means that something is incorrectly setup with your environment. There may be useful information in opal/libltdl/config.log. You can also disable GNU libltdl (which will disable dynamic shared object loading) by configuring with --disable-dlopen. end of output of /opal/libltdl/config.log ## --- ## ## confdefs.h. ## ## --- ## #define PACKAGE_BUGREPORT "bug-libt...@gnu.org" #define PACKAGE_NAME "libltdl" #define PACKAGE_STRING "libltdl 2.1a" #define PACKAGE_TARNAME "libltdl" #define PACKAGE_VERSION "2.1a" configure: exit 1 -Original Message- Now, if you use the latest trunk, you can use the new MX BTL which provide support for shared memory and self communications. Add "--mca pml ob1 --mca btl mx --mca btl_mx_shared_mem 1 --mca btl_mx_self 1" in order to activate these new features. If you have a 10G cards, I suggest you add "--mca btl_mx_flags 2" as well. Thanks, george. PS: Is there any way you can attach to the processes with gdb ? I would like to see the backtrace as showed by gdb in order to be able to figure out what's wrong there.
Re: [OMPI users] Ompi failing on mx only
Ok, sorry about that last. I think someone just bumped up the required version of Automake. -Original Message- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Grobe, Gary L. (JSC-EV)[ESCG] Sent: Friday, January 05, 2007 2:29 PM To: Open MPI Users Subject: Re: [OMPI users] Ompi failing on mx only This is just an FYI of the Jan 5th snapshot. I'll send a backtrace of the processes as soon as I get a b3 running. Between my filtered webdav svn access problems and the latest nightly snapshots, my builds are currently failing where the same config lines worked on previous snapshots ... $./configure --prefix=/usr/local/openmpi-1.2b3r13006 --with-mx=/opt/mx --with-mx-libdir=/opt/mx/lib ... *** GNU libltdl setup configure: OMPI configuring in opal/libltdl configure: running /bin/sh './configure' '--prefix=/usr/local/openmpi-1.2b3r13006' '--with-mx=/opt/mx' '--with-mx-libdir=/opt/mx/lib' 'F77=ifort' --enable-ltdl-convenience --disable-ltdl-install --enable-shared --disable-static --cache-file=/dev/null --srcdir=. checking for a BSD-compatible install... /usr/bin/install -c checking whether build environment is sane... configure: error: newly created file is older than distributed files! Check your system clock configure: /bin/sh './configure' *failed* for opal/libltdl configure: error: Failed to build GNU libltdl. This usually means that something is incorrectly setup with your environment. There may be useful information in opal/libltdl/config.log. You can also disable GNU libltdl (which will disable dynamic shared object loading) by configuring with --disable-dlopen. end of output of /opal/libltdl/config.log ## --- ## ## confdefs.h. ## ## --- ## #define PACKAGE_BUGREPORT "bug-libt...@gnu.org" #define PACKAGE_NAME "libltdl" #define PACKAGE_STRING "libltdl 2.1a" #define PACKAGE_TARNAME "libltdl" #define PACKAGE_VERSION "2.1a" configure: exit 1 -Original Message- Now, if you use the latest trunk, you can use the new MX BTL which provide support for shared memory and self communications. Add "--mca pml ob1 --mca btl mx --mca btl_mx_shared_mem 1 --mca btl_mx_self 1" in order to activate these new features. If you have a 10G cards, I suggest you add "--mca btl_mx_flags 2" as well. Thanks, george. PS: Is there any way you can attach to the processes with gdb ? I would like to see the backtrace as showed by gdb in order to be able to figure out what's wrong there. ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Ompi failing on mx only
I was wondering if someone could send me the HACKING file so I can do a bit more with debugging on the snapshots. Our web proxy has webdav methods turned off (request methods fail) so that I can't get to the latest of the svn repos. > Second thing. From one of your previous emails, I see that MX > is configured with 4 instance by node. Your running with > exactly 4 processes on the first 2 nodes. Weirds things might > happens ... Just curious about this comment. Are you referring to over subscribing? We run 4 processes on each node because we have 2 dual core cpu's on each node. Am I not understanding processor counts correctly? > PS: Is there any way you can attach to the processes with gdb > ? I would like to see the backtrace as showed by gdb in order > to be able to figure out what's wrong there. When I can get more detailed dbg, I'll send. Though I'm not clear on what executable is being searched for below. $ mpirun -dbg=gdb --prefix /usr/local/openmpi-1.2b3r13030 -x LD_LIBRARY_PATH=${LD_LIBRARY_PATH} --hostfile ./h1-3 -np 5 --mca pml cm --mca mtl mx ./cpi [juggernaut:14949] connect_uni: connection not allowed [juggernaut:14949] connect_uni: connection not allowed [juggernaut:14949] connect_uni: connection not allowed [juggernaut:14949] connect_uni: connection not allowed [juggernaut:14949] connect_uni: connection not allowed [juggernaut:14949] connect_uni: connection not allowed [juggernaut:14949] connect_uni: connection not allowed [juggernaut:14949] connect_uni: connection not allowed [juggernaut:14949] connect_uni: connection not allowed [juggernaut:14949] [0,0,0] setting up session dir with [juggernaut:14949] universe default-universe-14949 [juggernaut:14949] user ggrobe [juggernaut:14949] host juggernaut [juggernaut:14949] jobid 0 [juggernaut:14949] procid 0 [juggernaut:14949] procdir: /tmp/openmpi-sessions-ggrobe@juggernaut_0/default-universe-14949/0/0 [juggernaut:14949] jobdir: /tmp/openmpi-sessions-ggrobe@juggernaut_0/default-universe-14949/0 [juggernaut:14949] unidir: /tmp/openmpi-sessions-ggrobe@juggernaut_0/default-universe-14949 [juggernaut:14949] top: openmpi-sessions-ggrobe@juggernaut_0 [juggernaut:14949] tmp: /tmp [juggernaut:14949] [0,0,0] contact_file /tmp/openmpi-sessions-ggrobe@juggernaut_0/default-universe-14949/univers e-setup.txt [juggernaut:14949] [0,0,0] wrote setup file [juggernaut:14949] pls:rsh: local csh: 0, local sh: 1 [juggernaut:14949] pls:rsh: assuming same remote shell as local shell [juggernaut:14949] pls:rsh: remote csh: 0, remote sh: 1 [juggernaut:14949] pls:rsh: final template argv: [juggernaut:14949] pls:rsh: /usr/bin/ssh orted --debug --bootproxy 1 --name --num_procs 2 --vpid_start 0 --nodename --universe ggrobe@juggernaut:default-universe-14949 --nsreplica "0.0.0;tcp://192.168.2.10:43121" --gprreplica "0.0.0;tcp://192.168.2.10:43121" [juggernaut:14949] pls:rsh: launching on node juggernaut [juggernaut:14949] pls:rsh: juggernaut is a LOCAL node [juggernaut:14949] pls:rsh: changing to directory /home/ggrobe [juggernaut:14949] pls:rsh: executing: orted --debug --bootproxy 1 --name 0.0.1 --num_procs 2 --vpid_start 0 --nodename juggernaut --universe ggrobe@juggernaut:default-universe-14949 --nsreplica "0.0.0;tcp://192.168.2.10:43121" --gprreplica "0.0.0;tcp://192.168.2.10:43121" [juggernaut:14950] [0,0,1] setting up session dir with [juggernaut:14950] universe default-universe-14949 [juggernaut:14950] user ggrobe [juggernaut:14950] host juggernaut [juggernaut:14950] jobid 0 [juggernaut:14950] procid 1 [juggernaut:14950] procdir: /tmp/openmpi-sessions-ggrobe@juggernaut_0/default-universe-14949/0/1 [juggernaut:14950] jobdir: /tmp/openmpi-sessions-ggrobe@juggernaut_0/default-universe-14949/0 [juggernaut:14950] unidir: /tmp/openmpi-sessions-ggrobe@juggernaut_0/default-universe-14949 [juggernaut:14950] top: openmpi-sessions-ggrobe@juggernaut_0 [juggernaut:14950] tmp: /tmp -- Failed to find the following executable: Host: juggernaut Executable: -b Cannot continue. -- [juggernaut:14950] [0,0,1] ORTE_ERROR_LOG: Fatal in file odls_default_module.c at line 1193 [juggernaut:14949] spawn: in job_state_callback(jobid = 1, state = 0x80) [juggernaut:14950] [0,0,1] ORTE_ERROR_LOG: Fatal in file orted.c at line 575 [juggernaut:14950] sess_dir_finalize: job session dir not empty - leaving [juggernaut:14950] sess_dir_finalize: proc session dir not empty - leaving [juggernaut:14949] sess_dir_finalize: proc session dir not empty - leaving
Re: [OMPI users] Ompi failing on mx only
On Jan 8, 2007, at 2:52 PM, Grobe, Gary L. ((JSC-EV))[ESCG] wrote: I was wondering if someone could send me the HACKING file so I can do a bit more with debugging on the snapshots. Our web proxy has webdav methods turned off (request methods fail) so that I can't get to the latest of the svn repos. Bummer. :-( You are definitely falling victim to the fact that or nightly snapshots have been less-than-stable recently. Sorry [again] about that! FWIW, there's two ways to browse the source in the repository without an SVN checkout: - you can just point a normal web browser to our SVN repository (I'm pretty sure that doesn't use DAV, but I'm not 100% sure...), e.g.: https://svn.open-mpi.org/svn/ompi/trunk/HACKING - you can use our Trac SVN browser, e.g.: https://svn.open-mpi.org/ trac/ompi/browser/trunk/HACKING (there's a link at the bottom to download each file without all the HTML markup). Second thing. From one of your previous emails, I see that MX is configured with 4 instance by node. Your running with exactly 4 processes on the first 2 nodes. Weirds things might happens ... Just curious about this comment. Are you referring to over subscribing? We run 4 processes on each node because we have 2 dual core cpu's on each node. Am I not understanding processor counts correctly? I'll have to defer to Reese on this one... PS: Is there any way you can attach to the processes with gdb ? I would like to see the backtrace as showed by gdb in order to be able to figure out what's wrong there. When I can get more detailed dbg, I'll send. Though I'm not clear on what executable is being searched for below. $ mpirun -dbg=gdb --prefix /usr/local/openmpi-1.2b3r13030 -x LD_LIBRARY_PATH=${LD_LIBRARY_PATH} --hostfile ./h1-3 -np 5 --mca pml cm --mca mtl mx ./cpi FWIW, note that "-dbg" is not a recognized Open MPI mpirun command line switch -- after all the debugging information, Open MPI finally gets to telling you: -- -- Failed to find the following executable: Host: juggernaut Executable: -b Cannot continue. -- -- So nothing actually ran in this instance. Our debugging entries on the FAQ (http://www.open-mpi.org/faq/? category=debugging) are fairly inadequate at the moment, but if you're running in an ssh environment, you generally have 2 choices to attach serial debuggers: 1. Put a loop in your app that pauses until you can attach a debugger. Perhaps something like this: { int i = 0; printf("pid %d ready\n", getpid()); while (0 == i) sleep (5); } Kludgey and horrible, but it works. 2. mpirun an xterm with gdb. You'll need to specifically use the -d option to mpirun in order to keep the ssh sessions alive to relay back your X information, or separately setup your X channels yourself (e.g., if you're on a closed network, it may be acceptable to "xhost +" the nodes that you're running on and just manually setup the DISPLAY variable for the target nodes, perhaps via the -x option to mpirun) -- in which case you would not need to use the -d option to mpirun. Make sense? -- Jeff Squyres Server Virtualization Business Unit Cisco Systems
Re: [OMPI users] Ompi failing on mx only
> >> PS: Is there any way you can attach to the processes with gdb ? I > >> would like to see the backtrace as showed by gdb in order > to be able > >> to figure out what's wrong there. > > > > When I can get more detailed dbg, I'll send. Though I'm not > clear on > > what executable is being searched for below. > > > > $ mpirun -dbg=gdb --prefix /usr/local/openmpi-1.2b3r13030 -x > > LD_LIBRARY_PATH=${LD_LIBRARY_PATH} --hostfile ./h1-3 -np 5 > --mca pml > > cm --mca mtl mx ./cpi > > FWIW, note that "-dbg" is not a recognized Open MPI mpirun > command line switch -- after all the debugging information, > Open MPI finally gets to telling you: > Sorry, wrong mpi, ok ... Fwiw, here's a working crash w/ just the -d option. The problem I'm trying to get to right now is how to dbg the 2nd process on the 2nd node since that's where the crash is always happening. One process past the 1st node works find (5 procs w/ 4 per node), but when a second process on the 2nd node starts or anything more than that, the crashes will occur. $ mpirun -d --prefix /usr/local/openmpi-1.2b3r13030 -x LD_LIBRARY_PATH=${LD_LIBRARY_PATH} --hostfile ./h1-3 -np 6 --mca pml cm --mca mtl mx ./cpi > dbg.out 2>&1 [juggernaut:15087] connect_uni: connection not allowed [juggernaut:15087] connect_uni: connection not allowed [juggernaut:15087] connect_uni: connection not allowed [juggernaut:15087] connect_uni: connection not allowed [juggernaut:15087] connect_uni: connection not allowed [juggernaut:15087] connect_uni: connection not allowed [juggernaut:15087] connect_uni: connection not allowed [juggernaut:15087] connect_uni: connection not allowed [juggernaut:15087] connect_uni: connection not allowed [juggernaut:15087] [0,0,0] setting up session dir with [juggernaut:15087] universe default-universe-15087 [juggernaut:15087] user ggrobe [juggernaut:15087] host juggernaut [juggernaut:15087] jobid 0 [juggernaut:15087] procid 0 [juggernaut:15087] procdir: /tmp/openmpi-sessions-ggrobe@juggernaut_0/default-universe-15087/0/0 [juggernaut:15087] jobdir: /tmp/openmpi-sessions-ggrobe@juggernaut_0/default-universe-15087/0 [juggernaut:15087] unidir: /tmp/openmpi-sessions-ggrobe@juggernaut_0/default-universe-15087 [juggernaut:15087] top: openmpi-sessions-ggrobe@juggernaut_0 [juggernaut:15087] tmp: /tmp [juggernaut:15087] [0,0,0] contact_file /tmp/openmpi-sessions-ggrobe@juggernaut_0/default-universe-15087/univers e-setup.txt [juggernaut:15087] [0,0,0] wrote setup file [juggernaut:15087] pls:rsh: local csh: 0, local sh: 1 [juggernaut:15087] pls:rsh: assuming same remote shell as local shell [juggernaut:15087] pls:rsh: remote csh: 0, remote sh: 1 [juggernaut:15087] pls:rsh: final template argv: [juggernaut:15087] pls:rsh: /usr/bin/ssh orted --debug --bootproxy 1 --name --num_procs 3 --vpid_start 0 --nodename --universe ggrobe@juggernaut:default-universe-15087 --nsreplica "0.0.0;tcp://192.168.2.10:52099" --gprreplica "0.0.0;tcp://192.168.2.10:52099" [juggernaut:15087] pls:rsh: launching on node node-1 [juggernaut:15087] pls:rsh: node-1 is a REMOTE node [juggernaut:15087] pls:rsh: executing: /usr/bin/ssh node-1 PATH=/usr/local/openmpi-1.2b3r13030/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/openmpi-1.2b3r13030/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; /usr/local/openmpi-1.2b3r13030/bin/orted --debug --bootproxy 1 --name 0.0.1 --num_procs 3 --vpid_start 0 --nodename node-1 --universe ggrobe@juggernaut:default-universe-15087 --nsreplica "0.0.0;tcp://192.168.2.10:52099" --gprreplica "0.0.0;tcp://192.168.2.10:52099" [juggernaut:15087] pls:rsh: launching on node node-2 [juggernaut:15087] pls:rsh: node-2 is a REMOTE node [juggernaut:15087] pls:rsh: executing: /usr/bin/ssh node-2 PATH=/usr/local/openmpi-1.2b3r13030/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/openmpi-1.2b3r13030/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; /usr/local/openmpi-1.2b3r13030/bin/orted --debug --bootproxy 1 --name 0.0.2 --num_procs 3 --vpid_start 0 --nodename node-2 --universe ggrobe@juggernaut:default-universe-15087 --nsreplica "0.0.0;tcp://192.168.2.10:52099" --gprreplica "0.0.0;tcp://192.168.2.10:52099" [node-2:11499] [0,0,2] setting up session dir with [node-2:11499] universe default-universe-15087 [node-2:11499] user ggrobe [node-2:11499] host node-2 [node-2:11499] jobid 0 [node-2:11499] procid 2 [node-1:10307] procdir: /tmp/openmpi-sessions-ggrobe@node-1_0/default-universe-15087/0/1 [node-1:10307] jobdir: /tmp/openmpi-sessions-ggrobe@node-1_0/default-universe-15087/0 [node-1:10307] unidir: /tmp/openmpi-sessions-ggrobe@node-1_0/default-universe-15087 [node-1:10307] top: openmpi-sessions-ggrobe@node-1_0 [node-2:11499] procdir: /tmp/openmpi-sessions-ggrobe@node-2_0/default-universe-15087/0/2 [node-2:11499] jobdir: /tmp/openmpi-sessions-ggrobe@node-2_0/default-universe-15087/0 [node-2:11499] unidir: /tmp/openmpi-sessions-ggrobe@node-2_0/default-universe-15087 [node-2:11499] top: openmpi-sessions-ggrobe@node-2_0 [node-2:11499] tmp: /tm
Re: [OMPI users] Ompi failing on mx only
On Mon, Jan 08, 2007 at 03:07:57PM -0500, Jeff Squyres wrote: > if you're running in an ssh environment, you generally have 2 choices to > attach serial debuggers: > > 1. Put a loop in your app that pauses until you can attach a > debugger. Perhaps something like this: > > { int i = 0; printf("pid %d ready\n", getpid()); while (0 == i) sleep > (5); } > > Kludgey and horrible, but it works. > > 2. mpirun an xterm with gdb. If one of the participating hosts is the localhost and it's sufficient to debug only one process, it's even possible to call gdb directly: adi@ipc654~$ mpirun -np 2 -host ipc654,dana \ sh -c 'if [[ $(hostname) == "ipc654" ]]; then gdb test/vm/ring; \ else test/vm/ring ; fi ' (also works great with ddd). -- Cluster and Metacomputing Working Group Friedrich-Schiller-Universität Jena, Germany private: http://adi.thur.de
Re: [OMPI users] Ompi failing on mx only
> >> PS: Is there any way you can attach to the processes with gdb ? I > >> would like to see the backtrace as showed by gdb in order > to be able > >> to figure out what's wrong there. > > I found out that all processes on the 2nd node crash so I just put a 30 second wait before MPI_Init in order to attach gdb and go from there. The code in cpi starts off as follows (in order to show where the SIGTERM below is coming from). MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&numprocs); MPI_Comm_rank(MPI_COMM_WORLD,&myid); MPI_Get_processor_name(processor_name,&namelen); --- Attaching to process 11856 Reading symbols from /home/ggrobe/Projects/ompi/cpi/cpi...done. Using host libthread_db library "/lib/libthread_db.so.1". Reading symbols from /usr/local/openmpi-1.2b3r13030/lib/libmpi.so.0...done. Loaded symbols for /usr/local/openmpi-1.2b3r13030/lib/libmpi.so.0 Reading symbols from /usr/local/openmpi-1.2b3r13030/lib/libopen-rte.so.0...done. Loaded symbols for /usr/local/openmpi-1.2b3r13030/lib/libopen-rte.so.0 Reading symbols from /usr/local/openmpi-1.2b3r13030/lib/libopen-pal.so.0...done. Loaded symbols for /usr/local/openmpi-1.2b3r13030/lib/libopen-pal.so.0 Reading symbols from /lib64/libdl.so.2...done. Loaded symbols for /lib/libdl.so.2 Reading symbols from /lib64/libnsl.so.1...done. Loaded symbols for /lib/libnsl.so.1 Reading symbols from /lib64/libutil.so.1...done. Loaded symbols for /lib/libutil.so.1 Reading symbols from /lib64/libm.so.6...done. Loaded symbols for /lib/libm.so.6 Reading symbols from /lib64/libpthread.so.0...done. [Thread debugging using libthread_db enabled] [New Thread 46974166086512 (LWP 11856)] Loaded symbols for /lib/libpthread.so.0 Reading symbols from /lib64/libc.so.6...done. Loaded symbols for /lib/libc.so.6 Reading symbols from /lib64/ld-linux-x86-64.so.2...done. Loaded symbols for /lib64/ld-linux-x86-64.so.2 0x2ab90661e880 in nanosleep () from /lib/libc.so.6 (gdb) break MPI_Init Breakpoint 1 at 0x2ab905c0c880 (gdb) break MPI_Comm_size Breakpoint 2 at 0x2ab905c01af0 (gdb) continue Continuing. [Switching to Thread 46974166086512 (LWP 11856)] Breakpoint 1, 0x2ab905c0c880 in PMPI_Init () from /usr/local/openmpi-1.2b3r13030/lib/libmpi.so.0 (gdb) n Single stepping until exit from function PMPI_Init, which has no line number information. [New Thread 1082132816 (LWP 11862)] Program received signal SIGTERM, Terminated. 0x2ab906643f47 in ioctl () from /lib/libc.so.6 (gdb) backtrace #0 0x2ab906643f47 in ioctl () from /lib/libc.so.6 Cannot access memory at address 0x7fffa50102f8 --- Does this help in anyway?
Re: [OMPI users] Ompi failing on mx only
Not really. This is the backtrace of the process that get killed because mpirun detect that the other one died ... What I need it's the backtrace on the process which generate the segfault. Second, in order to understand the backtrace, it's better to have run debug version of Open MPI. Without the debug version we only see the address where the fault occur without having access to the line number ... Thanks, george. On Mon, 8 Jan 2007, Grobe, Gary L. \(JSC-EV\)[ESCG] wrote: PS: Is there any way you can attach to the processes with gdb ? I would like to see the backtrace as showed by gdb in order to be able to figure out what's wrong there. I found out that all processes on the 2nd node crash so I just put a 30 second wait before MPI_Init in order to attach gdb and go from there. The code in cpi starts off as follows (in order to show where the SIGTERM below is coming from). MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&numprocs); MPI_Comm_rank(MPI_COMM_WORLD,&myid); MPI_Get_processor_name(processor_name,&namelen); --- Attaching to process 11856 Reading symbols from /home/ggrobe/Projects/ompi/cpi/cpi...done. Using host libthread_db library "/lib/libthread_db.so.1". Reading symbols from /usr/local/openmpi-1.2b3r13030/lib/libmpi.so.0...done. Loaded symbols for /usr/local/openmpi-1.2b3r13030/lib/libmpi.so.0 Reading symbols from /usr/local/openmpi-1.2b3r13030/lib/libopen-rte.so.0...done. Loaded symbols for /usr/local/openmpi-1.2b3r13030/lib/libopen-rte.so.0 Reading symbols from /usr/local/openmpi-1.2b3r13030/lib/libopen-pal.so.0...done. Loaded symbols for /usr/local/openmpi-1.2b3r13030/lib/libopen-pal.so.0 Reading symbols from /lib64/libdl.so.2...done. Loaded symbols for /lib/libdl.so.2 Reading symbols from /lib64/libnsl.so.1...done. Loaded symbols for /lib/libnsl.so.1 Reading symbols from /lib64/libutil.so.1...done. Loaded symbols for /lib/libutil.so.1 Reading symbols from /lib64/libm.so.6...done. Loaded symbols for /lib/libm.so.6 Reading symbols from /lib64/libpthread.so.0...done. [Thread debugging using libthread_db enabled] [New Thread 46974166086512 (LWP 11856)] Loaded symbols for /lib/libpthread.so.0 Reading symbols from /lib64/libc.so.6...done. Loaded symbols for /lib/libc.so.6 Reading symbols from /lib64/ld-linux-x86-64.so.2...done. Loaded symbols for /lib64/ld-linux-x86-64.so.2 0x2ab90661e880 in nanosleep () from /lib/libc.so.6 (gdb) break MPI_Init Breakpoint 1 at 0x2ab905c0c880 (gdb) break MPI_Comm_size Breakpoint 2 at 0x2ab905c01af0 (gdb) continue Continuing. [Switching to Thread 46974166086512 (LWP 11856)] Breakpoint 1, 0x2ab905c0c880 in PMPI_Init () from /usr/local/openmpi-1.2b3r13030/lib/libmpi.so.0 (gdb) n Single stepping until exit from function PMPI_Init, which has no line number information. [New Thread 1082132816 (LWP 11862)] Program received signal SIGTERM, Terminated. 0x2ab906643f47 in ioctl () from /lib/libc.so.6 (gdb) backtrace #0 0x2ab906643f47 in ioctl () from /lib/libc.so.6 Cannot access memory at address 0x7fffa50102f8 --- Does this help in anyway? ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users "We must accept finite disappointment, but we must never lose infinite hope." Martin Luther King
Re: [OMPI users] Ompi failing on mx only
Second thing. From one of your previous emails, I see that MX is configured with 4 instance by node. Your running with exactly 4 processes on the first 2 nodes. Weirds things might happens ... 4 processes per node will be just fine. This is not like GM where the 4 includes some "reserved" ports. -reese
Re: [OMPI users] Ompi failing on mx only
On Jan 8, 2007, at 9:11 PM, Reese Faucette wrote: Second thing. From one of your previous emails, I see that MX is configured with 4 instance by node. Your running with exactly 4 processes on the first 2 nodes. Weirds things might happens ... 4 processes per node will be just fine. This is not like GM where the 4 includes some "reserved" ports. Right, that's the maximum number of open MX channels, i.e. processes than can run on the node using MX. With MX (1.2.0c I think), I get weird messages if I run a second mpirun quickly after the first one failed. The myrinet guys, I quite sure, can explain why and how. Somehow, when an application segfault while the MX port is open things are not cleaned up right away. It take few seconds (not more than one minute) to have everything running correctly after that. george.
Re: [OMPI users] Ompi failing on mx only
Right, that's the maximum number of open MX channels, i.e. processes than can run on the node using MX. With MX (1.2.0c I think), I get weird messages if I run a second mpirun quickly after the first one failed. The myrinet guys, I quite sure, can explain why and how. Somehow, when an application segfault while the MX port is open things are not cleaned up right away. It take few seconds (not more than one minute) to have everything running correctly after that. Supposedly I am a "myrinet guy" ;-) Yeah, the endpoint cleanup stuff could take a few seconds after an ungraceful exit. But, if you're getting some behavior that looks like you ought not be getting, please let us know! -reese Myricom, Inc.
Re: [OMPI users] Ompi failing on mx only
On Jan 8, 2007, at 9:34 PM, Reese Faucette wrote: Right, that's the maximum number of open MX channels, i.e. processes than can run on the node using MX. With MX (1.2.0c I think), I get weird messages if I run a second mpirun quickly after the first one failed. The myrinet guys, I quite sure, can explain why and how. Somehow, when an application segfault while the MX port is open things are not cleaned up right away. It take few seconds (not more than one minute) to have everything running correctly after that. Supposedly I am a "myrinet guy" ;-) Yeah, the endpoint cleanup stuff could take a few seconds after an ungraceful exit. But, if you're getting some behavior that looks like you ought not be getting, please let us know! I think it make sense what I get. If I loop in a script starting mpiruns and one of the run segfault, the next one usually is unable to open the MX endpoints. That's happens only if I run 4 processes by node, where 4 is the number of instances as reported by mx_info. If I put a sleep of 30 seconds between my runs, then everything runs just fine. george. -reese Myricom, Inc. ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Ompi failing on mx only
> I need it's the backtrace on the process which generate the > segfault. Second, in order to understand the backtrace, it's > better to have run debug version of Open MPI. Without the > debug version we only see the address where the fault occur > without having access to the line number ... How about this, this is the section that I was stepping through in order to get the first error I usually run into ... "mx_connect fail for node-1:0 with key (error Endpoint closed or not connectable!)" // gdb output Breakpoint 1, 0x2ac856bd92e0 in opal_progress () from /usr/local/openmpi-1.2b3r13030/lib/libopen-pal.so.0 (gdb) s Single stepping until exit from function opal_progress, which has no line number information. 0x2ac857361540 in sched_yield () from /lib/libc.so.6 (gdb) s Single stepping until exit from function sched_yield, which has no line number information. opal_condition_wait (c=0x5098e0, m=0x5098a0) at condition.h:80 80 while (c->c_signaled == 0) { (gdb) s 81 opal_progress(); (gdb) s Breakpoint 1, 0x2ac856bd92e0 in opal_progress () from /usr/local/openmpi-1.2b3r13030/lib/libopen-pal.so.0 (gdb) s Single stepping until exit from function opal_progress, which has no line number information. 0x2ac857361540 in sched_yield () from /lib/libc.so.6 (gdb) backtrace #0 0x2ac857361540 in sched_yield () from /lib/libc.so.6 #1 0x00402f60 in opal_condition_wait (c=0x5098e0, m=0x5098a0) at condition.h:81 #2 0x00402b3c in orterun (argc=17, argv=0x7fff54151088) at orterun.c:427 #3 0x00402713 in main (argc=17, argv=0x7fff54151088) at main.c:13 --- This is the mpirun output as I was stepping through it. At the end of this is the error that the backtrace above shows. [node-2:11909] top: openmpi-sessions-ggrobe@node-2_0 [node-2:11909] tmp: /tmp [node-1:10719] procdir: /tmp/openmpi-sessions-ggrobe@node-1_0/default-universe-17414/1/0 [node-1:10719] jobdir: /tmp/openmpi-sessions-ggrobe@node-1_0/default-universe-17414/1 [node-1:10719] unidir: /tmp/openmpi-sessions-ggrobe@node-1_0/default-universe-17414 [node-1:10719] top: openmpi-sessions-ggrobe@node-1_0 [node-1:10719] tmp: /tmp [juggernaut:17414] spawn: in job_state_callback(jobid = 1, state = 0x4) [juggernaut:17414] Info: Setting up debugger process table for applications MPIR_being_debugged = 0 MPIR_debug_gate = 0 MPIR_debug_state = 1 MPIR_acquired_pre_main = 0 MPIR_i_am_starter = 0 MPIR_proctable_size = 6 MPIR_proctable: (i, host, exe, pid) = (0, node-1, /home/ggrobe/Projects/ompi/cpi/./cpi, 10719) (i, host, exe, pid) = (1, node-1, /home/ggrobe/Projects/ompi/cpi/./cpi, 10720) (i, host, exe, pid) = (2, node-1, /home/ggrobe/Projects/ompi/cpi/./cpi, 10721) (i, host, exe, pid) = (3, node-1, /home/ggrobe/Projects/ompi/cpi/./cpi, 10722) (i, host, exe, pid) = (4, node-2, /home/ggrobe/Projects/ompi/cpi/./cpi, 11908) (i, host, exe, pid) = (5, node-2, /home/ggrobe/Projects/ompi/cpi/./cpi, 11909) [node-1:10718] sess_dir_finalize: proc session dir not empty - leaving [node-1:10718] sess_dir_finalize: proc session dir not empty - leaving [node-1:10721] procdir: /tmp/openmpi-sessions-ggrobe@node-1_0/default-universe-17414/1/2 [node-1:10721] jobdir: /tmp/openmpi-sessions-ggrobe@node-1_0/default-universe-17414/1 [node-1:10721] unidir: /tmp/openmpi-sessions-ggrobe@node-1_0/default-universe-17414 [node-1:10721] top: openmpi-sessions-ggrobe@node-1_0 [node-1:10721] tmp: /tmp [node-1:10720] mx_connect fail for node-1:0 with key (error Endpoint closed or not connectable!)