I've grabbed last nights tarball (1.2b3r12981) and tried using the
shared mem transport on btl and mx,self on mtl, same results. What I
don't get is that, sometimes it works, and sometimes it doesn't (for
either). For example, I can run it 10 times successfully then incr the
-np from 7 to 10 across 3 nodes, and it'll immediately fail.
Here's an example of one run right after another.
$ mpirun --prefix /usr/local/openmpi-1.2b3r12981/ -x
LD_LIBRARY_PATH=${LD_LIBRARY_PATH} --hostfile ./h25-27 -np 10 --mca
mtl
mx,self ./cpi
Process 0 of 10 is on node-25
Process 4 of 10 is on node-26
Process 1 of 10 is on node-25
Process 5 of 10 is on node-26
Process 2 of 10 is on node-25
Process 8 of 10 is on node-27
Process 6 of 10 is on node-26
Process 9 of 10 is on node-27
Process 7 of 10 is on node-26
Process 3 of 10 is on node-25
pi is approximately 3.1415926544231256, Error is 0.0000000008333325
wall clock time = 0.017513
$ mpirun --prefix /usr/local/openmpi-1.2b3r12981/ -x
LD_LIBRARY_PATH=${LD_LIBRARY_PATH} --hostfile ./h25-27 -np 10 --mca
mtl
mx,self ./cpi
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:(nil)
[0]
func:/usr/local/openmpi-1.2b3r12981/lib/libopen-pal.so.0
(opal_backtrace_
print+0x1f) [0x2b8ddf3ccd3f]
[1] func:/usr/local/openmpi-1.2b3r12981/lib/libopen-pal.so.0
[0x2b8ddf3cb891]
[2] func:/lib/libpthread.so.0 [0x2b8ddf98f6c0]
[3] func:/opt/mx/lib/libmyriexpress.so(mx_open_endpoint+0x6df)
[0x2b8de25bf2af]
[4]
func:/usr/local/openmpi-1.2b3r12981/lib/openmpi/mca_btl_mx.so
(mca_btl_mx
_component_init+0x5d7) [0x2b8de27dcd27]
[5]
func:/usr/local/openmpi-1.2b3r12981/lib/libmpi.so.0
(mca_btl_base_select+
0x156) [0x2b8ddf125b46]
[6]
func:/usr/local/openmpi-1.2b3r12981/lib/openmpi/mca_bml_r2.so
(mca_bml_r2
_component_init+0x11) [0x2b8de26d7491]
[7]
func:/usr/local/openmpi-1.2b3r12981/lib/libmpi.so.0
(mca_bml_base_init+0x
7d) [0x2b8ddf12543d]
[8]
func:/usr/local/openmpi-1.2b3r12981/lib/openmpi/mca_pml_ob1.so
(mca_pml_o
b1_component_init+0x6b) [0x2b8de23a4f8b]
[9]
func:/usr/local/openmpi-1.2b3r12981/lib/libmpi.so.0
(mca_pml_base_select+
0x113) [0x2b8ddf12cea3]
[10]
func:/usr/local/openmpi-1.2b3r12981/lib/libmpi.so.0(ompi_mpi_init
+0x45a)
[0x2b8ddf0f5bda]
[11] func:/usr/local/openmpi-1.2b3r12981/lib/libmpi.so.0(MPI_Init
+0x83)
[0x2b8ddf116af3]
[12] func:./cpi(main+0x42) [0x400cd5]
[13] func:/lib/libc.so.6(__libc_start_main+0xe3) [0x2b8ddfab50e3]
[14] func:./cpi [0x400bd9]
*** End of error message ***
mpirun noticed that job rank 0 with PID 0 on node node-25 exited on
signal 11.
9 additional processes aborted (not shown)
-----Original Message-----
From: users-boun...@open-mpi.org [mailto:users-bounces@open-
mpi.org] On
Behalf Of Brian W. Barrett
Sent: Tuesday, January 02, 2007 4:11 PM
To: Open MPI Users
Subject: Re: [OMPI users] Ompi failing on mx only
Sorry to jump into the discussion late. The mx btl does not support
communication between processes on the same node by itself, so you
have
to include the shared memory transport when using MX. This will
eventually be fixed, but likely not for the 1.2 release. So if you
do:
mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH --
hostfile ./h1-3 -np 2 --mca btl mx,sm,self ./cpi
It should work much better. As for the MTL, there is a bug in the MX
MTL for v1.2 that has been fixed, but after 1.2b2 that could cause the
random failures you were seeing. It will work much better after
1.2b3 is released (or if you are feeling really lucky, you can try out
the 1.2 nightly tarballs).
The MTL is a new feature in v1.2. It is a different communication
abstraction designed to support interconnects that have matching
implemented in the lower level library or in hardware (Myrinet/MX,
Portals, InfiniPath are currently implemented). The MTL allows us to
exploit the low latency and asynchronous progress these libraries can
provide, but does mean multi-nic abilities are reduced. Further, the
MTL is not well suited to interconnects like TCP or InfiniBand, so we
will continue supporting the BTL interface as well.
Brian
On Jan 2, 2007, at 2:44 PM, Grobe, Gary L. ((JSC-EV))[ESCG] wrote:
About the -x, I've been trying it both ways and prefer the latter,
and
results for either are the same. But it's value is correct.
I've attached the ompi_info from node-1 and node-2. Sorry for not
zipping them, but they were small and I think I'd have firewall
issues.
$ mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH --
hostfile ./h13-15 -np 6 --mca pml cm ./cpi [node-14:19260] mx_connect
fail for node-14:0 with key aaaaffff (error Endpoint closed or not
connectable!) [node-14:19261] mx_connect fail for node-14:0 with key
aaaaffff (error Endpoint closed or not connectable!) ...
Is there any info anywhere's on MTL? Anyways, I've run w/ mtl, and
sometimes it actually worked once. But now I can't reproduce it and
it's throwing sig 7's, 11's, and 4's depending upon the number of
procs I give it. But now that you mention mapper, I take it that's
what SEGV_MAPERR might be referring to. I'm looking into the
$ mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH=$
{LD_LIBRARY_PATH} --hostfile ./h1-3 -np 5 --mca mtl mx,self ./cpi
Process 4 of 5 is on node-2 Process 0 of 5 is on node-1 Process 1
of 5
is on node-1 Process 2 of 5 is on node-1 Process 3 of 5 is on node-1
pi is approximately 3.1415926544231225, Error is 0.0000000008333294
wall clock time = 0.019305
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR) Failing at
addr:0x2b88243862be mpirun noticed that job rank 0 with PID 0 on node
node-1 exited on signal 1.
4 additional processes aborted (not shown) Or sometimes I'll get this
error, just depending upon the number of procs ...
mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH=$
{LD_LIBRARY_PATH} --hostfile ./h1-3 -np 7 --mca mtl mx,self ./cpi
Signal:7 info.si_errno:0(Success) si_code:2() Failing at
addr:0x2aaaaaaab000 [0]
func:/usr/local/openmpi-1.2b2/lib/libopen-pal.so.0
(opal_backtrace_print+0x1f) [0x2b9b7fa52d1f] [1]
func:/usr/local/openmpi-1.2b2/lib/libopen-pal.so.0
[0x2b9b7fa51871]
[2] func:/lib/libpthread.so.0 [0x2b9b80013d00] [3]
func:/usr/local/openmpi-1.2b2/lib/libmca_common_sm.so.0
(mca_common_sm_mmap_init+0x1e3) [0x2b9b8270ef83] [4]
func:/usr/local/openmpi-1.2b2/lib/openmpi/mca_mpool_sm.so
[0x2b9b8260d0ff]
[5] func:/usr/local/openmpi-1.2b2/lib/libmpi.so.0
(mca_mpool_base_module_create+0x70) [0x2b9b7f7afac0] [6]
func:/usr/local/openmpi-1.2b2/lib/openmpi/mca_btl_sm.so
(mca_btl_sm_add_procs_same_base_addr+0x907) [0x2b9b83070517] [7]
func:/usr/local/openmpi-1.2b2/lib/openmpi/mca_bml_r2.so
(mca_bml_r2_add_procs+0x206) [0x2b9b82d5f576] [8]
func:/usr/local/openmpi-1.2b2/lib/openmpi/mca_pml_ob1.so
(mca_pml_ob1_add_procs+0xe3) [0x2b9b82a2d0a3] [9]
func:/usr/local/openmpi-1.2b2/lib/libmpi.so.0(ompi_mpi_init
+0x697) [0x2b9b7f77be07]
[10] func:/usr/local/openmpi-1.2b2/lib/libmpi.so.0(MPI_Init+0x83)
[0x2b9b7f79c943]
[11] func:./cpi(main+0x42) [0x400cd5]
[12] func:/lib/libc.so.6(__libc_start_main+0xf4) [0x2b9b8013a134]
[13]
func:./cpi [0x400bd9]
*** End of error message ***
Process 4 of 7 is on node-2
Process 5 of 7 is on node-2
Process 6 of 7 is on node-2
Process 0 of 7 is on node-1
Process 1 of 7 is on node-1
Process 2 of 7 is on node-1
Process 3 of 7 is on node-1
pi is approximately 3.1415926544231239, Error is 0.0000000008333307
wall clock time = 0.009331
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR) Failing at
addr:0x2b4ba33652be
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR) Failing at
addr:0x2b8685aba2be
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR) Failing at
addr:0x2b304ffbe2be mpirun noticed that job rank 0 with PID 0 on node
node-1 exited on signal 1.
6 additional processes aborted (not shown)
Ok, so I take it one is down. Would this be the cause for all the
different errors I'm seeing?
$ fm_status
FMS Fabric status
17 hosts known
16 FMAs found
3 un-ACKed alerts
Mapping is complete, last map generated by node-20 Database
generation
not yet complete.
From: users-boun...@open-mpi.org [mailto:users-bounces@open- mpi.org]
On Behalf Of Reese Faucette
Sent: Tuesday, January 02, 2007 2:52 PM
To: Open MPI Users
Subject: Re: [OMPI users] Ompi failing on mx only
Hi, Gary-
This looks like a config problem, and not a code problem yet.
Could you send the output of mx_info from node-1 and from node-2?
Also, forgive me counter-asking a possibly dumb OMPI question, but is
"-x LD_LIBRARY_PATH" really what you want, as opposed to "-x
LD_LIBRARY_PATH=${LD_LIBRARY_PATH}" ? (I would not be surprised if
not specifying a value defaults to this behavior, but have to ask).
Also, have you tried MX MTL as opposed to BTL? --mca pml cm --mca
mtl
mx,self (it looks like you did)
"[node-2:10464] mx_connect fail for node-2:0 with key aaaaffff "
makes it look like your fabric may not be fully mapped or that you
may
have a down link.
thanks,
-reese
Myricom, Inc.
I was initially using 1.1.2 and moved to 1.2b2 because of a hang on
MPI_Bcast() which 1.2b2 reports to fix, and seemed to have done so.
My compute nodes are 2 dual core xeons on myrinet with mx. The
problem
is trying to get ompi running on mx only. My machine file is as
follows ...
node-1 slots=4 max-slots=4
node-2 slots=4 max-slots=4
node-3 slots=4 max-slots=4
'mpirun' with the minimum number of processes in order to get the
error ...
mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH
--hostfile ./h1-3 -np 2 --mca btl mx,self ./cpi
I don't believe there'a anything wrong w/ the hardware as I can ping
on mx between this failed node and the master fine. So I tried a
different set of 3 nodes and I got the same error, it always fails on
the 2nd node of any group of nodes I choose.
<node-2.out>
<node-1.out>
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Brian Barrett
Open MPI Team, CCS-1
Los Alamos National Laboratory
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users