[OMPI users] Ompi failing on mx only

2007-01-02 Thread Grobe, Gary L. (JSC-EV)[ESCG]

I was initially using 1.1.2 and moved to 1.2b2 because of a hang on
MPI_Bcast() which 1.2b2 reports to fix, and seemed to have done so. My
compute nodes are 2 dual core xeons on myrinet with mx. The problem is
trying to get ompi running on mx only. My machine file is as follows ...

node-1 slots=4 max-slots=4
node-2 slots=4 max-slots=4
node-3 slots=4 max-slots=4

'mpirun' with the minimum number of processes in order to get the error
...
mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH
--hostfile ./h1-3 -np 2 --mca btl mx,self ./cpi

Results with the following output ...

:~/Projects/ompi/cpi$ mpirun --prefix /usr/local/openmpi-1.2b2 -x
LD_LIBRARY_PATH --hostfile ./h1-3 -np 2 --mca btl mx,self ./cpi


--
Process 0.1.0 is unable to reach 0.1.1 for MPI communication.
If you specified the use of a BTL component, you may have
forgotten a component (such as "self") in the list of 
usable components.

--

--
Process 0.1.1 is unable to reach 0.1.0 for MPI communication.
If you specified the use of a BTL component, you may have
forgotten a component (such as "self") in the list of 
usable components.

--

--
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or
environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  PML add procs failed
  --> Returned "Unreachable" (-12) instead of "Success" (0)

--
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (goodbye)

--
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or
environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  PML add procs failed
  --> Returned "Unreachable" (-12) instead of "Success" (0)

--
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (goodbye)
mpirun noticed that job rank 1 with PID 0 on node node-1 exited on
signal 1.

 end of output ---

I get that same error w/ the examples included in the ompi-1.2b2
distrib. However, if I change the mca params as such ...

mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH
--hostfile ./h1-3 -np 5 --mca pml cm ./cpi

Running up to -np 5 works (one of the processes does get put on the 2nd
node), but running with -np 6 fails with the following ...

[node-2:10464] mx_connect fail for node-2:0 with key  (error
Endpoint closed or not connectable!)
[node-2:10463] mx_connect fail for node-2:0 with key  (error
Endpoint closed or not connectable!)

--
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or
environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  PML add procs failed
  --> Returned "Error" (-1) instead of "Success" (0)

--
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (goodbye)

--
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or
environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  PML add procs failed
  --> Returned "Error" (-1) instead of "Success" (0)

--
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (goodbye)
mpirun noticed that job rank 0 with PID 0 on node node-1 exited on
sig

Re: [OMPI users] Ompi failing on mx only

2007-01-02 Thread Reese Faucette
Ompi failing on mx onlyHi, Gary-
This looks like a config problem, and not a code problem yet.  Could you send 
the output of mx_info from node-1 and from node-2?  Also, forgive me 
counter-asking a possibly dumb OMPI question, but is "-x LD_LIBRARY_PATH" 
really what you want, as opposed to "-x LD_LIBRARY_PATH=${LD_LIBRARY_PATH}" ?  
(I would not be surprised if not specifying a value defaults to this behavior, 
but have to ask).

Also, have you tried MX MTL as opposed to BTL?  --mca pml cm --mca mtl mx,self  
(it looks like you did)

"[node-2:10464] mx_connect fail for node-2:0 with key  " makes it look 
like your fabric may not be fully mapped or that you may have a down link.

thanks,
-reese
Myricom, Inc.


  I was initially using 1.1.2 and moved to 1.2b2 because of a hang on 
MPI_Bcast() which 1.2b2 reports to fix, and seemed to have done so. My compute 
nodes are 2 dual core xeons on myrinet with mx. The problem is trying to get 
ompi running on mx only. My machine file is as follows .

  node-1 slots=4 max-slots=4 
  node-2 slots=4 max-slots=4 
  node-3 slots=4 max-slots=4 

  'mpirun' with the minimum number of processes in order to get the error ... 
  mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH 
--hostfile ./h1-3 -np 2 --mca btl mx,self ./cpi 

  I don't believe there'a anything wrong w/ the hardware as I can ping on mx 
between this failed node and the master fine. So I tried a different set of 3 
nodes and I got the same error, it always fails on the 2nd node of any group of 
nodes I choose.


Re: [OMPI users] Ompi failing on mx only

2007-01-02 Thread Grobe, Gary L. (JSC-EV)[ESCG]
About the -x, I've been trying it both ways and prefer the latter, and
results for either are the same. But it's value is correct. I've
attached the ompi_info from node-1 and node-2. Sorry for not zipping
them, but they were small and I think I'd have firewall issues.
 
$ mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH --hostfile
./h13-15 -np 6 --mca pml cm ./cpi 
[node-14:19260] mx_connect fail for node-14:0 with key  (error
Endpoint closed or not connectable!)
[node-14:19261] mx_connect fail for node-14:0 with key  (error
Endpoint closed or not connectable!)
...
 
Is there any info anywhere's on MTL? Anyways, I've run w/ mtl, and
sometimes it actually worked once. But now I can't reproduce it and it's
throwing sig 7's, 11's, and 4's depending upon the number of procs I
give it. But now that you mention mapper, I take it that's what
SEGV_MAPERR might be referring to. I'm looking into the 
 
$ mpirun --prefix /usr/local/openmpi-1.2b2 -x
LD_LIBRARY_PATH=${LD_LIBRARY_PATH} --hostfile ./h1-3 -np 5 --mca mtl
mx,self ./cpi 
Process 4 of 5 is on node-2
Process 0 of 5 is on node-1
Process 1 of 5 is on node-1
Process 2 of 5 is on node-1
Process 3 of 5 is on node-1
pi is approximately 3.1415926544231225, Error is 0.08333294
wall clock time = 0.019305
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0x2b88243862be
mpirun noticed that job rank 0 with PID 0 on node node-1 exited on
signal 1.
4 additional processes aborted (not shown)

Or sometimes I'll get this error, just depending upon the number of
procs ...
 
 mpirun --prefix /usr/local/openmpi-1.2b2 -x
LD_LIBRARY_PATH=${LD_LIBRARY_PATH} --hostfile ./h1-3 -np 7 --mca mtl
mx,self ./cpi 
Signal:7 info.si_errno:0(Success) si_code:2()
Failing at addr:0x2aaab000
[0]
func:/usr/local/openmpi-1.2b2/lib/libopen-pal.so.0(opal_backtrace_print+
0x1f) [0x2b9b7fa52d1f]
[1] func:/usr/local/openmpi-1.2b2/lib/libopen-pal.so.0 [0x2b9b7fa51871]
[2] func:/lib/libpthread.so.0 [0x2b9b80013d00]
[3]
func:/usr/local/openmpi-1.2b2/lib/libmca_common_sm.so.0(mca_common_sm_mm
ap_init+0x1e3) [0x2b9b8270ef83]
[4] func:/usr/local/openmpi-1.2b2/lib/openmpi/mca_mpool_sm.so
[0x2b9b8260d0ff]
[5]
func:/usr/local/openmpi-1.2b2/lib/libmpi.so.0(mca_mpool_base_module_crea
te+0x70) [0x2b9b7f7afac0]
[6]
func:/usr/local/openmpi-1.2b2/lib/openmpi/mca_btl_sm.so(mca_btl_sm_add_p
rocs_same_base_addr+0x907) [0x2b9b83070517]
[7]
func:/usr/local/openmpi-1.2b2/lib/openmpi/mca_bml_r2.so(mca_bml_r2_add_p
rocs+0x206) [0x2b9b82d5f576]
[8]
func:/usr/local/openmpi-1.2b2/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_add
_procs+0xe3) [0x2b9b82a2d0a3]
[9] func:/usr/local/openmpi-1.2b2/lib/libmpi.so.0(ompi_mpi_init+0x697)
[0x2b9b7f77be07]
[10] func:/usr/local/openmpi-1.2b2/lib/libmpi.so.0(MPI_Init+0x83)
[0x2b9b7f79c943]
[11] func:./cpi(main+0x42) [0x400cd5]
[12] func:/lib/libc.so.6(__libc_start_main+0xf4) [0x2b9b8013a134]
[13] func:./cpi [0x400bd9]
*** End of error message ***
Process 4 of 7 is on node-2
Process 5 of 7 is on node-2
Process 6 of 7 is on node-2
Process 0 of 7 is on node-1
Process 1 of 7 is on node-1
Process 2 of 7 is on node-1
Process 3 of 7 is on node-1
pi is approximately 3.1415926544231239, Error is 0.0807
wall clock time = 0.009331
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0x2b4ba33652be
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0x2b8685aba2be
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0x2b304ffbe2be
mpirun noticed that job rank 0 with PID 0 on node node-1 exited on
signal 1.
6 additional processes aborted (not shown)

 
Ok, so I take it one is down. Would this be the cause for all the
different errors I'm seeing?
 
$ fm_status 
FMS Fabric status
 
17  hosts known
16  FMAs found
3   un-ACKed alerts
Mapping is complete, last map generated by node-20
Database generation not yet complete.


 


From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
Behalf Of Reese Faucette
Sent: Tuesday, January 02, 2007 2:52 PM
To: Open MPI Users
Subject: Re: [OMPI users] Ompi failing on mx only


Hi, Gary-
This looks like a config problem, and not a code problem yet.  Could you
send the output of mx_info from node-1 and from node-2?  Also, forgive
me counter-asking a possibly dumb OMPI question, but is "-x
LD_LIBRARY_PATH" really what you want, as opposed to "-x
LD_LIBRARY_PATH=${LD_LIBRARY_PATH}" ?  (I would not be surprised if not
specifying a value defaults to this behavior, but have to ask).
 
Also, have you tried MX MTL as opposed to BTL?  --mca pml cm --mca mtl
mx,self  (it looks like you did)
 
"[node-2:10464] mx_connect fail for node-2:0 with key  " makes
it look like your fabric may not be fully mapped or that you may have a
down link.
 
thanks,
-reese
Myricom, Inc.



Re: [OMPI users] Ompi failing on mx only

2007-01-02 Thread Reese Faucette

Ompi failing on mx only> I've attached the ompi_info from node-1 and node-2.

thanks, but i need "mx_info", not "ompi_info" ;-)

But now that you mention mapper, I take it that's what SEGV_MAPERR might 
be referring to.


this is an ompi red herring; it has nothing to do with Myrinet mapping, even 
though it kinda looks like it.


$ mpirun --prefix /usr/local/openmpi-1.2b2 -x 
LD_LIBRARY_PATH=${LD_LIBRARY_PATH} --hostfile ./h1-3 -np 5 --mca mtl 
mx,self ./cpi

Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)


A gdb traceback would be interesting on this one.
thanks,
-reese 





Re: [OMPI users] Ompi failing on mx only

2007-01-02 Thread Brian W. Barrett
ced that job rank 0 with PID 0 on node node-1 exited on  
signal 1.

6 additional processes aborted (not shown)

Ok, so I take it one is down. Would this be the cause for all the  
different errors I'm seeing?


$ fm_status
FMS Fabric status

17  hosts known
16  FMAs found
3   un-ACKed alerts
Mapping is complete, last map generated by node-20
Database generation not yet complete.


From: users-boun...@open-mpi.org [mailto:users-bounces@open- 
mpi.org] On Behalf Of Reese Faucette

Sent: Tuesday, January 02, 2007 2:52 PM
To: Open MPI Users
Subject: Re: [OMPI users] Ompi failing on mx only

Hi, Gary-
This looks like a config problem, and not a code problem yet.   
Could you send the output of mx_info from node-1 and from node-2?   
Also, forgive me counter-asking a possibly dumb OMPI question, but  
is "-x LD_LIBRARY_PATH" really what you want, as opposed to "-x  
LD_LIBRARY_PATH=${LD_LIBRARY_PATH}" ?  (I would not be surprised if  
not specifying a value defaults to this behavior, but have to ask).


Also, have you tried MX MTL as opposed to BTL?  --mca pml cm --mca  
mtl mx,self  (it looks like you did)


"[node-2:10464] mx_connect fail for node-2:0 with key  "  
makes it look like your fabric may not be fully mapped or that you  
may have a down link.


thanks,
-reese
Myricom, Inc.

I was initially using 1.1.2 and moved to 1.2b2 because of a hang on  
MPI_Bcast() which 1.2b2 reports to fix, and seemed to have done so.  
My compute nodes are 2 dual core xeons on myrinet with mx. The  
problem is trying to get ompi running on mx only. My machine file  
is as follows …


node-1 slots=4 max-slots=4
node-2 slots=4 max-slots=4
node-3 slots=4 max-slots=4

'mpirun' with the minimum number of processes in order to get the  
error ...
mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH  
--hostfile ./h1-3 -np 2 --mca btl mx,self ./cpi


I don't believe there'a anything wrong w/ the hardware as I can  
ping on mx between this failed node and the master fine. So I tried  
a different set of 3 nodes and I got the same error, it always  
fails on the 2nd node of any group of nodes I choose.




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
  Brian Barrett
  Open MPI Team, CCS-1
  Los Alamos National Laboratory





Re: [OMPI users] Ompi failing on mx only

2007-01-02 Thread Reese Faucette

 As for the MTL, there is a bug in the MX
MTL for v1.2 that has been fixed, but after 1.2b2 ...


oops, i was stupidly assuming he already had that fix.  yes, this is an 
important fix...

-reese




Re: [OMPI users] Ompi failing on mx only

2007-01-02 Thread Grobe, Gary L. (JSC-EV)[ESCG]
Ah, sorry about that ... 

$ ./mx_info
MX Version: 1.1.6
MX Build: ggrobe@juggernaut:/home/ggrobe/Tools/mx-1.1.6 Thu Nov 30
14:17:44 GMT 2006
1 Myrinet board installed.
The MX driver is configured to support up to 4 instances and 1024 nodes.
===
Instance #0:  224.9 MHz LANai, 99.7 MHz PCI bus, 2 MB SRAM
Status: Running, P0: Link up
MAC Address:00:60:dd:47:c2:a7
Product code:   M3F-PCIXD-2 V2.2
Part number:09-03034
Serial number:  291824
Mapper: 46:4d:53:4d:41:50, version = 0x000c,
configured
Mapped hosts:   16

ROUTE
COUNT
INDEXMAC ADDRESS HOST NAMEP0
---- ----
   0) 00:60:dd:47:c2:a7 juggernaut:0  1,1
   1) 00:60:dd:47:ab:c9 node-1:0  6,3
   2) 00:60:dd:47:ab:c8 node-2:0  6,3
   3) 00:60:dd:47:ab:ca node-3:0  6,3
   4) 00:60:dd:47:bf:65 node-7:0  6,3
   5) 00:60:dd:47:c2:e1 node-8:0  6,3
   6) 00:60:dd:47:c0:c1 node-9:0  6,3
   7) 00:60:dd:47:c0:e5 node-13:0 6,3
   8) 00:60:dd:47:c2:91 node-14:0 6,3
   9) 00:60:dd:47:c0:b2 node-15:0 6,3
  10) 00:60:dd:47:bf:f5 node-19:0 1,1
  11) 00:60:dd:47:c0:b1 node-20:0 6,3
  12) 00:60:dd:47:c0:f8 node-21:0 7,3
  13) 00:60:dd:47:c0:8a node-25:0 6,3
  14) 00:60:dd:47:c0:c2 node-27:0 5,3
  15) 00:60:dd:47:c2:e0 node-26:0 5,3

-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
Behalf Of Reese Faucette
Sent: Tuesday, January 02, 2007 4:08 PM
To: Open MPI Users
Subject: Re: [OMPI users] Ompi failing on mx only

Ompi failing on mx only> I've attached the ompi_info from node-1 and
node-2.

thanks, but i need "mx_info", not "ompi_info" ;-)

> But now that you mention mapper, I take it that's what SEGV_MAPERR 
> might be referring to.

this is an ompi red herring; it has nothing to do with Myrinet mapping,
even though it kinda looks like it.

> $ mpirun --prefix /usr/local/openmpi-1.2b2 -x 
> LD_LIBRARY_PATH=${LD_LIBRARY_PATH} --hostfile ./h1-3 -np 5 --mca mtl 
> mx,self ./cpi
> Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)

A gdb traceback would be interesting on this one.
thanks,
-reese 


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] Ompi failing on mx only

2007-01-02 Thread Grobe, Gary L. (JSC-EV)[ESCG]
I'm losing it today, I just now noticed I sent mx_info for the wrong
nodes ...


// node-1
$ mx_info
MX Version: 1.1.6
MX Build: ggrobe@juggernaut:/home/ggrobe/Tools/mx-1.1.6 Thu Nov 30
14:17:44 GMT 2006
1 Myrinet board installed.
The MX driver is configured to support up to 4 instances and 1024 nodes.
===
Instance #0:  224.9 MHz LANai, 133.3 MHz PCI bus, 2 MB SRAM
Status: Running, P0: Link up
MAC Address:00:60:dd:47:ab:c9
Product code:   M3F-PCIXD-2 V2.2
Part number:09-03034
Serial number:  299207
Mapper: 46:4d:53:4d:41:50, version = 0x000c,
configured
Mapped hosts:   16

ROUTE
COUNT
INDEXMAC ADDRESS HOST NAMEP0
---- ----
   0) 00:60:dd:47:ab:c9 node-1:0  1,1
   1) 00:60:dd:47:c2:a7 juggernaut:0  5,3
   2) 00:60:dd:47:ab:c8 node-2:0  6,3
   3) 00:60:dd:47:ab:ca node-3:0  7,3
   4) 00:60:dd:47:bf:65 node-7:0  7,3
   5) 00:60:dd:47:c2:e1 node-8:0  7,3
   6) 00:60:dd:47:c0:c1 node-9:0  6,3
   7) 00:60:dd:47:c0:e5 node-13:0 1,1
   8) 00:60:dd:47:c2:91 node-14:0 7,3
   9) 00:60:dd:47:c0:b2 node-15:0 7,3
  10) 00:60:dd:47:bf:f5 node-19:0 6,3
  11) 00:60:dd:47:c0:b1 node-20:0 8,3
  12) 00:60:dd:47:c0:f8 node-21:0 5,3
  13) 00:60:dd:47:c0:8a node-25:0 6,3
  14) 00:60:dd:47:c0:c2 node-27:0 7,3
  15) 00:60:dd:47:c2:e0 node-26:0 6,3

// node-2
$ mx_info
MX Version: 1.1.6
MX Build: ggrobe@juggernaut:/home/ggrobe/Tools/mx-1.1.6 Thu Nov 30
14:17:44 GMT 2006
1 Myrinet board installed.
The MX driver is configured to support up to 4 instances and 1024 nodes.
===
Instance #0:  224.9 MHz LANai, 133.0 MHz PCI bus, 2 MB SRAM
Status: Running, P0: Link up
MAC Address:00:60:dd:47:ab:c8
Product code:   M3F-PCIXD-2 V2.2
Part number:09-03034
Serial number:  299208
Mapper: 46:4d:53:4d:41:50, version = 0x000c,
configured
Mapped hosts:   16

ROUTE
COUNT
INDEXMAC ADDRESS HOST NAMEP0
---- ----
   0) 00:60:dd:47:ab:c8 node-2:0  1,1
   1) 00:60:dd:47:ab:c9 node-1:0  6,3
   2) 00:60:dd:47:c2:a7 juggernaut:0  5,3
   3) 00:60:dd:47:ab:ca node-3:0  6,3
   4) 00:60:dd:47:bf:65 node-7:0  5,3
   5) 00:60:dd:47:c2:e1 node-8:0  6,3
   6) 00:60:dd:47:c0:c1 node-9:0  5,3
   7) 00:60:dd:47:c0:e5 node-13:0 6,3
   8) 00:60:dd:47:c2:91 node-14:0 8,3
   9) 00:60:dd:47:c0:b2 node-15:0 1,1
  10) 00:60:dd:47:bf:f5 node-19:0 5,3
  11) 00:60:dd:47:c0:f8 node-21:0 5,3
  12) 00:60:dd:47:c0:8a node-25:0 5,3
  13) 00:60:dd:47:c0:c2 node-27:0 6,3
  14) 00:60:dd:47:c2:e0 node-26:0 5,3
  15) 00:60:dd:47:c0:b1 node-20:0 6,3

-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
Behalf Of Reese Faucette
Sent: Tuesday, January 02, 2007 4:08 PM
To: Open MPI Users
Subject: Re: [OMPI users] Ompi failing on mx only

Ompi failing on mx only> I've attached the ompi_info from node-1 and
node-2.

thanks, but i need "mx_info", not "ompi_info" ;-)

> But now that you mention mapper, I take it that's what SEGV_MAPERR 
> might be referring to.

this is an ompi red herring; it has nothing to do with Myrinet mapping,
even though it kinda looks like it.

> $ mpirun --prefix /usr/local/openmpi-1.2b2 -x 
> LD_LIBRARY_PATH=${LD_LIBRARY_PATH} --hostfile ./h1-3 -np 5 --mca mtl 
> mx,self ./cpi
> Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)

A gdb traceback would be interesting on this one.
thanks,
-reese 


___
users mailing

Re: [OMPI users] Ompi failing on mx only

2007-01-03 Thread Grobe, Gary L. (JSC-EV)[ESCG]
Just as an FYI, I also included the sm param as you suggested and
changed the -np to 1, because anything more than that just duplicates
the same error. I also saw this same error message in previous posts as
a bug. Would that be the same issue in this case?

$ mpirun --prefix /usr/local/openmpi-1.2b2 --hostfile ./h1-3 -np 1 --mca
btl mx,sm,self ./cpi
[node-1:09704] mca: base: component_find: unable to open mtl mx: file
not found (ignored)
[node-1:09704] mca: base: component_find: unable to open btl mx: file
not found (ignored)
Process 0 of 1 is on node-1
pi is approximately 3.1415926544231341, Error is 0.08333410
wall clock time = 0.000331


-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
Behalf Of Brian W. Barrett
Sent: Tuesday, January 02, 2007 4:11 PM
To: Open MPI Users
Subject: Re: [OMPI users] Ompi failing on mx only

Sorry to jump into the discussion late.  The mx btl does not support
communication between processes on the same node by itself, so you have
to include the shared memory transport when using MX.  This will
eventually be fixed, but likely not for the 1.2 release.  So if you do:

   mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH --
hostfile ./h1-3 -np 2 --mca btl mx,sm,self ./cpi

It should work much better.  As for the MTL, there is a bug in the MX
MTL for v1.2 that has been fixed, but after 1.2b2 that could cause the
random failures you were seeing.  It will work much better after
1.2b3 is released (or if you are feeling really lucky, you can try out
the 1.2 nightly tarballs).

The MTL is a new feature in v1.2. It is a different communication
abstraction designed to support interconnects that have matching
implemented in the lower level library or in hardware (Myrinet/MX,
Portals, InfiniPath are currently implemented).  The MTL allows us to
exploit the low latency and asynchronous progress these libraries can
provide, but does mean multi-nic abilities are reduced.  Further, the
MTL is not well suited to interconnects like TCP or InfiniBand, so we
will continue supporting the BTL interface as well.

Brian


On Jan 2, 2007, at 2:44 PM, Grobe, Gary L. ((JSC-EV))[ESCG] wrote:

> About the -x, I've been trying it both ways and prefer the latter, and

> results for either are the same. But it's value is correct.
> I've attached the ompi_info from node-1 and node-2. Sorry for not 
> zipping them, but they were small and I think I'd have firewall 
> issues.
>
> $ mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH -- 
> hostfile ./h13-15 -np 6 --mca pml cm ./cpi [node-14:19260] mx_connect 
> fail for node-14:0 with key  (error Endpoint closed or not 
> connectable!) [node-14:19261] mx_connect fail for node-14:0 with key 
>  (error Endpoint closed or not connectable!) ...
>
> Is there any info anywhere's on MTL? Anyways, I've run w/ mtl, and 
> sometimes it actually worked once. But now I can't reproduce it and 
> it's throwing sig 7's, 11's, and 4's depending upon the number of 
> procs I give it. But now that you mention mapper, I take it that's 
> what SEGV_MAPERR might be referring to. I'm looking into the
>
> $ mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH=$ 
> {LD_LIBRARY_PATH} --hostfile ./h1-3 -np 5 --mca mtl mx,self ./cpi 
> Process 4 of 5 is on node-2 Process 0 of 5 is on node-1 Process 1 of 5

> is on node-1 Process 2 of 5 is on node-1 Process 3 of 5 is on node-1 
> pi is approximately 3.1415926544231225, Error is 0.08333294 
> wall clock time = 0.019305
> Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR) Failing at 
> addr:0x2b88243862be mpirun noticed that job rank 0 with PID 0 on node 
> node-1 exited on signal 1.
> 4 additional processes aborted (not shown) Or sometimes I'll get this 
> error, just depending upon the number of procs ...
>
>  mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH=$ 
> {LD_LIBRARY_PATH} --hostfile ./h1-3 -np 7 --mca mtl mx,self ./cpi
> Signal:7 info.si_errno:0(Success) si_code:2() Failing at 
> addr:0x2aaab000 [0] 
> func:/usr/local/openmpi-1.2b2/lib/libopen-pal.so.0
> (opal_backtrace_print+0x1f) [0x2b9b7fa52d1f] [1] 
> func:/usr/local/openmpi-1.2b2/lib/libopen-pal.so.0
> [0x2b9b7fa51871]
> [2] func:/lib/libpthread.so.0 [0x2b9b80013d00] [3] 
> func:/usr/local/openmpi-1.2b2/lib/libmca_common_sm.so.0
> (mca_common_sm_mmap_init+0x1e3) [0x2b9b8270ef83] [4] 
> func:/usr/local/openmpi-1.2b2/lib/openmpi/mca_mpool_sm.so
> [0x2b9b8260d0ff]
> [5] func:/usr/local/openmpi-1.2b2/lib/libmpi.so.0
> (mca_mpool_base_module_create+0x70) [0x2b9b7f7afac0] [6] 
> func:/usr/local/openmpi-1.2b2/lib/openmpi/mca_btl_sm.so
> (mca_btl_sm_add_procs_same_base_addr+0x907) [0x2b9b83070517] [7] 
> func:/usr/local/openmpi-1.2b2/lib/open

Re: [OMPI users] Ompi failing on mx only

2007-01-03 Thread Reese Faucette

$ mpirun --prefix /usr/local/openmpi-1.2b2 --hostfile ./h1-3 -np 1 --mca
btl mx,sm,self ./cpi
[node-1:09704] mca: base: component_find: unable to open mtl mx: file
not found (ignored)
[node-1:09704] mca: base: component_find: unable to open btl mx: file
not found (ignored)


This in particular is almost certainly a library path issue.  A quick way to 
check to see if your LD_LIBRARY_PATH is correct is to run:
$ mpirun  ldd 
/lib/openmpi/mca_mtl_mx.so


If things are good, you will get a first line like:
   libmyriexpress.so => /opt/mx/lib/libmyriexpress.so (0xb7f1d000)

If not, it will tell you explicitly.  Since all you specified is 
the --prefix line, I'm not surprised libmyriexpress.so is not found in this 
case.

-reese





Re: [OMPI users] Ompi failing on mx only

2007-01-04 Thread Grobe, Gary L. (JSC-EV)[ESCG]
I've grabbed last nights tarball (1.2b3r12981) and tried using the
shared mem transport on btl and mx,self on mtl, same results. What I
don't get is that, sometimes it works, and sometimes it doesn't (for
either). For example, I can run it 10 times successfully then incr the
-np from 7 to 10 across 3 nodes, and it'll immediately fail.

Here's an example of one run right after another.

$ mpirun --prefix /usr/local/openmpi-1.2b3r12981/ -x
LD_LIBRARY_PATH=${LD_LIBRARY_PATH} --hostfile ./h25-27 -np 10 --mca mtl
mx,self ./cpi 
Process 0 of 10 is on node-25
Process 4 of 10 is on node-26
Process 1 of 10 is on node-25
Process 5 of 10 is on node-26
Process 2 of 10 is on node-25
Process 8 of 10 is on node-27
Process 6 of 10 is on node-26
Process 9 of 10 is on node-27
Process 7 of 10 is on node-26
Process 3 of 10 is on node-25
pi is approximately 3.1415926544231256, Error is 0.0825
wall clock time = 0.017513

$ mpirun --prefix /usr/local/openmpi-1.2b3r12981/ -x
LD_LIBRARY_PATH=${LD_LIBRARY_PATH} --hostfile ./h25-27 -np 10 --mca mtl
mx,self ./cpi 
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:(nil)
[0]
func:/usr/local/openmpi-1.2b3r12981/lib/libopen-pal.so.0(opal_backtrace_
print+0x1f) [0x2b8ddf3ccd3f]
[1] func:/usr/local/openmpi-1.2b3r12981/lib/libopen-pal.so.0
[0x2b8ddf3cb891]
[2] func:/lib/libpthread.so.0 [0x2b8ddf98f6c0]
[3] func:/opt/mx/lib/libmyriexpress.so(mx_open_endpoint+0x6df)
[0x2b8de25bf2af]
[4]
func:/usr/local/openmpi-1.2b3r12981/lib/openmpi/mca_btl_mx.so(mca_btl_mx
_component_init+0x5d7) [0x2b8de27dcd27]
[5]
func:/usr/local/openmpi-1.2b3r12981/lib/libmpi.so.0(mca_btl_base_select+
0x156) [0x2b8ddf125b46]
[6]
func:/usr/local/openmpi-1.2b3r12981/lib/openmpi/mca_bml_r2.so(mca_bml_r2
_component_init+0x11) [0x2b8de26d7491]
[7]
func:/usr/local/openmpi-1.2b3r12981/lib/libmpi.so.0(mca_bml_base_init+0x
7d) [0x2b8ddf12543d]
[8]
func:/usr/local/openmpi-1.2b3r12981/lib/openmpi/mca_pml_ob1.so(mca_pml_o
b1_component_init+0x6b) [0x2b8de23a4f8b]
[9]
func:/usr/local/openmpi-1.2b3r12981/lib/libmpi.so.0(mca_pml_base_select+
0x113) [0x2b8ddf12cea3]
[10]
func:/usr/local/openmpi-1.2b3r12981/lib/libmpi.so.0(ompi_mpi_init+0x45a)
[0x2b8ddf0f5bda]
[11] func:/usr/local/openmpi-1.2b3r12981/lib/libmpi.so.0(MPI_Init+0x83)
[0x2b8ddf116af3]
[12] func:./cpi(main+0x42) [0x400cd5]
[13] func:/lib/libc.so.6(__libc_start_main+0xe3) [0x2b8ddfab50e3]
[14] func:./cpi [0x400bd9]
*** End of error message ***
mpirun noticed that job rank 0 with PID 0 on node node-25 exited on
signal 11.
9 additional processes aborted (not shown) 

-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
Behalf Of Brian W. Barrett
Sent: Tuesday, January 02, 2007 4:11 PM
To: Open MPI Users
Subject: Re: [OMPI users] Ompi failing on mx only

Sorry to jump into the discussion late.  The mx btl does not support
communication between processes on the same node by itself, so you have
to include the shared memory transport when using MX.  This will
eventually be fixed, but likely not for the 1.2 release.  So if you do:

   mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH --
hostfile ./h1-3 -np 2 --mca btl mx,sm,self ./cpi

It should work much better.  As for the MTL, there is a bug in the MX
MTL for v1.2 that has been fixed, but after 1.2b2 that could cause the
random failures you were seeing.  It will work much better after
1.2b3 is released (or if you are feeling really lucky, you can try out
the 1.2 nightly tarballs).

The MTL is a new feature in v1.2. It is a different communication
abstraction designed to support interconnects that have matching
implemented in the lower level library or in hardware (Myrinet/MX,
Portals, InfiniPath are currently implemented).  The MTL allows us to
exploit the low latency and asynchronous progress these libraries can
provide, but does mean multi-nic abilities are reduced.  Further, the
MTL is not well suited to interconnects like TCP or InfiniBand, so we
will continue supporting the BTL interface as well.

Brian


On Jan 2, 2007, at 2:44 PM, Grobe, Gary L. ((JSC-EV))[ESCG] wrote:

> About the -x, I've been trying it both ways and prefer the latter, and

> results for either are the same. But it's value is correct.
> I've attached the ompi_info from node-1 and node-2. Sorry for not 
> zipping them, but they were small and I think I'd have firewall 
> issues.
>
> $ mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH -- 
> hostfile ./h13-15 -np 6 --mca pml cm ./cpi [node-14:19260] mx_connect 
> fail for node-14:0 with key  (error Endpoint closed or not 
> connectable!) [node-14:19261] mx_connect fail for node-14:0 with key 
>  (error Endpoint closed or not connectable!) ...
>
> Is there any info anywhere's on MTL? Anyways, I've run w/ mtl, and 
> sometimes it actually worked once. But now I can't reproduce i

Re: [OMPI users] Ompi failing on mx only

2007-01-04 Thread Jeff Squyres
FWIW, I think we may have broken something in last night's tarball  
(this just came up on an internal development list, too).  I.e.,  
someone broke something that was fixed a little while later, but the  
nightly tarball was created before the problem was fixed.


Sorry about that.  :-(  Such is the nature of nightly snapshots...



On Jan 4, 2007, at 2:00 PM, Grobe, Gary L. ((JSC-EV))[ESCG] wrote:


I've grabbed last nights tarball (1.2b3r12981) and tried using the
shared mem transport on btl and mx,self on mtl, same results. What I
don't get is that, sometimes it works, and sometimes it doesn't (for
either). For example, I can run it 10 times successfully then incr the
-np from 7 to 10 across 3 nodes, and it'll immediately fail.

Here's an example of one run right after another.

$ mpirun --prefix /usr/local/openmpi-1.2b3r12981/ -x
LD_LIBRARY_PATH=${LD_LIBRARY_PATH} --hostfile ./h25-27 -np 10 --mca  
mtl

mx,self ./cpi
Process 0 of 10 is on node-25
Process 4 of 10 is on node-26
Process 1 of 10 is on node-25
Process 5 of 10 is on node-26
Process 2 of 10 is on node-25
Process 8 of 10 is on node-27
Process 6 of 10 is on node-26
Process 9 of 10 is on node-27
Process 7 of 10 is on node-26
Process 3 of 10 is on node-25
pi is approximately 3.1415926544231256, Error is 0.0825
wall clock time = 0.017513

$ mpirun --prefix /usr/local/openmpi-1.2b3r12981/ -x
LD_LIBRARY_PATH=${LD_LIBRARY_PATH} --hostfile ./h25-27 -np 10 --mca  
mtl

mx,self ./cpi
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:(nil)
[0]
func:/usr/local/openmpi-1.2b3r12981/lib/libopen-pal.so.0 
(opal_backtrace_

print+0x1f) [0x2b8ddf3ccd3f]
[1] func:/usr/local/openmpi-1.2b3r12981/lib/libopen-pal.so.0
[0x2b8ddf3cb891]
[2] func:/lib/libpthread.so.0 [0x2b8ddf98f6c0]
[3] func:/opt/mx/lib/libmyriexpress.so(mx_open_endpoint+0x6df)
[0x2b8de25bf2af]
[4]
func:/usr/local/openmpi-1.2b3r12981/lib/openmpi/mca_btl_mx.so 
(mca_btl_mx

_component_init+0x5d7) [0x2b8de27dcd27]
[5]
func:/usr/local/openmpi-1.2b3r12981/lib/libmpi.so.0 
(mca_btl_base_select+

0x156) [0x2b8ddf125b46]
[6]
func:/usr/local/openmpi-1.2b3r12981/lib/openmpi/mca_bml_r2.so 
(mca_bml_r2

_component_init+0x11) [0x2b8de26d7491]
[7]
func:/usr/local/openmpi-1.2b3r12981/lib/libmpi.so.0 
(mca_bml_base_init+0x

7d) [0x2b8ddf12543d]
[8]
func:/usr/local/openmpi-1.2b3r12981/lib/openmpi/mca_pml_ob1.so 
(mca_pml_o

b1_component_init+0x6b) [0x2b8de23a4f8b]
[9]
func:/usr/local/openmpi-1.2b3r12981/lib/libmpi.so.0 
(mca_pml_base_select+

0x113) [0x2b8ddf12cea3]
[10]
func:/usr/local/openmpi-1.2b3r12981/lib/libmpi.so.0(ompi_mpi_init 
+0x45a)

[0x2b8ddf0f5bda]
[11] func:/usr/local/openmpi-1.2b3r12981/lib/libmpi.so.0(MPI_Init 
+0x83)

[0x2b8ddf116af3]
[12] func:./cpi(main+0x42) [0x400cd5]
[13] func:/lib/libc.so.6(__libc_start_main+0xe3) [0x2b8ddfab50e3]
[14] func:./cpi [0x400bd9]
*** End of error message ***
mpirun noticed that job rank 0 with PID 0 on node node-25 exited on
signal 11.
9 additional processes aborted (not shown)

-Original Message-
From: users-boun...@open-mpi.org [mailto:users-bounces@open- 
mpi.org] On

Behalf Of Brian W. Barrett
Sent: Tuesday, January 02, 2007 4:11 PM
To: Open MPI Users
Subject: Re: [OMPI users] Ompi failing on mx only

Sorry to jump into the discussion late.  The mx btl does not support
communication between processes on the same node by itself, so you  
have

to include the shared memory transport when using MX.  This will
eventually be fixed, but likely not for the 1.2 release.  So if you  
do:


   mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH --
hostfile ./h1-3 -np 2 --mca btl mx,sm,self ./cpi

It should work much better.  As for the MTL, there is a bug in the MX
MTL for v1.2 that has been fixed, but after 1.2b2 that could cause the
random failures you were seeing.  It will work much better after
1.2b3 is released (or if you are feeling really lucky, you can try out
the 1.2 nightly tarballs).

The MTL is a new feature in v1.2. It is a different communication
abstraction designed to support interconnects that have matching
implemented in the lower level library or in hardware (Myrinet/MX,
Portals, InfiniPath are currently implemented).  The MTL allows us to
exploit the low latency and asynchronous progress these libraries can
provide, but does mean multi-nic abilities are reduced.  Further, the
MTL is not well suited to interconnects like TCP or InfiniBand, so we
will continue supporting the BTL interface as well.

Brian


On Jan 2, 2007, at 2:44 PM, Grobe, Gary L. ((JSC-EV))[ESCG] wrote:

About the -x, I've been trying it both ways and prefer the latter,  
and



results for either are the same. But it's value is correct.
I've attached the ompi_info from node-1 and node-2. Sorry for not
zipping them, but they were small and I think I'd have firewall
issues.

$ mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH --
hostfile ./h13-15 -np 6 --mca pml cm ./cpi 

Re: [OMPI users] Ompi failing on mx only

2007-01-04 Thread George Bosilca
There is some confusion here. I see that you try to run using mtl but  
you have the wrong mca parameters. In order to activate the MTL you  
should have on the mpirun command line "--mca pml cm --mca mtl mx".  
As you can se from your backtrace it segfault in the BTL  
initialization, which means that you're using the BTL and not the MTL.


Second thing. From one of your previous emails, I see that MX is  
configured with 4 instance by node. Your running with exactly 4  
processes on the first 2 nodes. Weirds things might happens ...


Now, if you use the latest trunk, you can use the new MX BTL which  
provide support for shared memory and self communications. Add "--mca  
pml ob1 --mca btl mx --mca btl_mx_shared_mem 1 --mca btl_mx_self 1"  
in order to activate these new features. If you have a 10G cards, I  
suggest you add "--mca btl_mx_flags 2" as well.


  Thanks,
george.

PS: Is there any way you can attach to the processes with gdb ? I  
would like to see the backtrace as showed by gdb in order to be able  
to figure out what's wrong there.



On Jan 4, 2007, at 2:00 PM, Grobe, Gary L. ((JSC-EV))[ESCG] wrote:


I've grabbed last nights tarball (1.2b3r12981) and tried using the
shared mem transport on btl and mx,self on mtl, same results. What I
don't get is that, sometimes it works, and sometimes it doesn't (for
either). For example, I can run it 10 times successfully then incr the
-np from 7 to 10 across 3 nodes, and it'll immediately fail.

Here's an example of one run right after another.

$ mpirun --prefix /usr/local/openmpi-1.2b3r12981/ -x
LD_LIBRARY_PATH=${LD_LIBRARY_PATH} --hostfile ./h25-27 -np 10 --mca  
mtl

mx,self ./cpi
Process 0 of 10 is on node-25
Process 4 of 10 is on node-26
Process 1 of 10 is on node-25
Process 5 of 10 is on node-26
Process 2 of 10 is on node-25
Process 8 of 10 is on node-27
Process 6 of 10 is on node-26
Process 9 of 10 is on node-27
Process 7 of 10 is on node-26
Process 3 of 10 is on node-25
pi is approximately 3.1415926544231256, Error is 0.0825
wall clock time = 0.017513

$ mpirun --prefix /usr/local/openmpi-1.2b3r12981/ -x
LD_LIBRARY_PATH=${LD_LIBRARY_PATH} --hostfile ./h25-27 -np 10 --mca  
mtl

mx,self ./cpi
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:(nil)
[0]
func:/usr/local/openmpi-1.2b3r12981/lib/libopen-pal.so.0 
(opal_backtrace_

print+0x1f) [0x2b8ddf3ccd3f]
[1] func:/usr/local/openmpi-1.2b3r12981/lib/libopen-pal.so.0
[0x2b8ddf3cb891]
[2] func:/lib/libpthread.so.0 [0x2b8ddf98f6c0]
[3] func:/opt/mx/lib/libmyriexpress.so(mx_open_endpoint+0x6df)
[0x2b8de25bf2af]
[4]
func:/usr/local/openmpi-1.2b3r12981/lib/openmpi/mca_btl_mx.so 
(mca_btl_mx

_component_init+0x5d7) [0x2b8de27dcd27]
[5]
func:/usr/local/openmpi-1.2b3r12981/lib/libmpi.so.0 
(mca_btl_base_select+

0x156) [0x2b8ddf125b46]
[6]
func:/usr/local/openmpi-1.2b3r12981/lib/openmpi/mca_bml_r2.so 
(mca_bml_r2

_component_init+0x11) [0x2b8de26d7491]
[7]
func:/usr/local/openmpi-1.2b3r12981/lib/libmpi.so.0 
(mca_bml_base_init+0x

7d) [0x2b8ddf12543d]
[8]
func:/usr/local/openmpi-1.2b3r12981/lib/openmpi/mca_pml_ob1.so 
(mca_pml_o

b1_component_init+0x6b) [0x2b8de23a4f8b]
[9]
func:/usr/local/openmpi-1.2b3r12981/lib/libmpi.so.0 
(mca_pml_base_select+

0x113) [0x2b8ddf12cea3]
[10]
func:/usr/local/openmpi-1.2b3r12981/lib/libmpi.so.0(ompi_mpi_init 
+0x45a)

[0x2b8ddf0f5bda]
[11] func:/usr/local/openmpi-1.2b3r12981/lib/libmpi.so.0(MPI_Init 
+0x83)

[0x2b8ddf116af3]
[12] func:./cpi(main+0x42) [0x400cd5]
[13] func:/lib/libc.so.6(__libc_start_main+0xe3) [0x2b8ddfab50e3]
[14] func:./cpi [0x400bd9]
*** End of error message ***
mpirun noticed that job rank 0 with PID 0 on node node-25 exited on
signal 11.
9 additional processes aborted (not shown)

-Original Message-
From: users-boun...@open-mpi.org [mailto:users-bounces@open- 
mpi.org] On

Behalf Of Brian W. Barrett
Sent: Tuesday, January 02, 2007 4:11 PM
To: Open MPI Users
Subject: Re: [OMPI users] Ompi failing on mx only

Sorry to jump into the discussion late.  The mx btl does not support
communication between processes on the same node by itself, so you  
have

to include the shared memory transport when using MX.  This will
eventually be fixed, but likely not for the 1.2 release.  So if you  
do:


   mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH --
hostfile ./h1-3 -np 2 --mca btl mx,sm,self ./cpi

It should work much better.  As for the MTL, there is a bug in the MX
MTL for v1.2 that has been fixed, but after 1.2b2 that could cause the
random failures you were seeing.  It will work much better after
1.2b3 is released (or if you are feeling really lucky, you can try out
the 1.2 nightly tarballs).

The MTL is a new feature in v1.2. It is a different communication
abstraction designed to support interconnects that have matching
implemented in the lower level library or in hardware (Myrinet/MX,
Portals, InfiniPath are currentl

Re: [OMPI users] Ompi failing on mx only

2007-01-05 Thread Grobe, Gary L. (JSC-EV)[ESCG]
This is just an FYI of the Jan 5th snapshot.

I'll send a backtrace of the processes as soon as I get a b3 running.
Between my filtered webdav svn access problems and the latest nightly
snapshots, my builds are currently failing where the same config lines
worked on previous snapshots ...

$./configure --prefix=/usr/local/openmpi-1.2b3r13006 --with-mx=/opt/mx
--with-mx-libdir=/opt/mx/lib
...

*** GNU libltdl setup
configure: OMPI configuring in opal/libltdl
configure: running /bin/sh './configure'
'--prefix=/usr/local/openmpi-1.2b3r13006' '--with-mx=/opt/mx'
'--with-mx-libdir=/opt/mx/lib' 'F77=ifort' --enable-ltdl-convenience
--disable-ltdl-install --enable-shared --disable-static
--cache-file=/dev/null --srcdir=.
checking for a BSD-compatible install... /usr/bin/install -c
checking whether build environment is sane... configure: error: newly
created file is older than distributed files!
Check your system clock
configure: /bin/sh './configure' *failed* for opal/libltdl
configure: error: Failed to build GNU libltdl.  This usually means that
something
is incorrectly setup with your environment.  There may be useful
information in
opal/libltdl/config.log.  You can also disable GNU libltdl (which will
disable
dynamic shared object loading) by configuring with --disable-dlopen.

 end of output of /opal/libltdl/config.log 

## --- ##
## confdefs.h. ##
## --- ##

#define PACKAGE_BUGREPORT "bug-libt...@gnu.org"
#define PACKAGE_NAME "libltdl"
#define PACKAGE_STRING "libltdl 2.1a"
#define PACKAGE_TARNAME "libltdl"
#define PACKAGE_VERSION "2.1a"

configure: exit 1


-Original Message-
Now, if you use the latest trunk, you can use the new MX BTL which
provide support for shared memory and self communications. Add "--mca
pml ob1 --mca btl mx --mca btl_mx_shared_mem 1 --mca btl_mx_self 1"  
in order to activate these new features. If you have a 10G cards, I
suggest you add "--mca btl_mx_flags 2" as well.

   Thanks,
 george.

PS: Is there any way you can attach to the processes with gdb ? I would
like to see the backtrace as showed by gdb in order to be able to figure
out what's wrong there.




Re: [OMPI users] Ompi failing on mx only

2007-01-05 Thread Grobe, Gary L. (JSC-EV)[ESCG]
Ok, sorry about that last. I think someone just bumped up the required
version of Automake. 

-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
Behalf Of Grobe, Gary L. (JSC-EV)[ESCG]
Sent: Friday, January 05, 2007 2:29 PM
To: Open MPI Users
Subject: Re: [OMPI users] Ompi failing on mx only

This is just an FYI of the Jan 5th snapshot.

I'll send a backtrace of the processes as soon as I get a b3 running.
Between my filtered webdav svn access problems and the latest nightly
snapshots, my builds are currently failing where the same config lines
worked on previous snapshots ...

$./configure --prefix=/usr/local/openmpi-1.2b3r13006 --with-mx=/opt/mx
--with-mx-libdir=/opt/mx/lib ...

*** GNU libltdl setup
configure: OMPI configuring in opal/libltdl
configure: running /bin/sh './configure'
'--prefix=/usr/local/openmpi-1.2b3r13006' '--with-mx=/opt/mx'
'--with-mx-libdir=/opt/mx/lib' 'F77=ifort' --enable-ltdl-convenience
--disable-ltdl-install --enable-shared --disable-static
--cache-file=/dev/null --srcdir=.
checking for a BSD-compatible install... /usr/bin/install -c checking
whether build environment is sane... configure: error: newly created
file is older than distributed files!
Check your system clock
configure: /bin/sh './configure' *failed* for opal/libltdl
configure: error: Failed to build GNU libltdl.  This usually means that
something is incorrectly setup with your environment.  There may be
useful information in opal/libltdl/config.log.  You can also disable GNU
libltdl (which will disable dynamic shared object loading) by
configuring with --disable-dlopen.

 end of output of /opal/libltdl/config.log 

## --- ##
## confdefs.h. ##
## --- ##

#define PACKAGE_BUGREPORT "bug-libt...@gnu.org"
#define PACKAGE_NAME "libltdl"
#define PACKAGE_STRING "libltdl 2.1a"
#define PACKAGE_TARNAME "libltdl"
#define PACKAGE_VERSION "2.1a"

configure: exit 1


-Original Message-
Now, if you use the latest trunk, you can use the new MX BTL which
provide support for shared memory and self communications. Add "--mca
pml ob1 --mca btl mx --mca btl_mx_shared_mem 1 --mca btl_mx_self 1"  
in order to activate these new features. If you have a 10G cards, I
suggest you add "--mca btl_mx_flags 2" as well.

   Thanks,
 george.

PS: Is there any way you can attach to the processes with gdb ? I would
like to see the backtrace as showed by gdb in order to be able to figure
out what's wrong there.


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] Ompi failing on mx only

2007-01-08 Thread Grobe, Gary L. (JSC-EV)[ESCG]
I was wondering if someone could send me the HACKING file so I can do a
bit more with debugging on the snapshots. Our web proxy has webdav
methods turned off (request methods fail) so that I can't get to the
latest of the svn repos.

> Second thing. From one of your previous emails, I see that MX 
> is configured with 4 instance by node. Your running with 
> exactly 4 processes on the first 2 nodes. Weirds things might 
> happens ...

Just curious about this comment. Are you referring to over subscribing?
We run 4 processes on each node because we have 2 dual core cpu's on
each node. Am I not understanding processor counts correctly?

> PS: Is there any way you can attach to the processes with gdb 
> ? I would like to see the backtrace as showed by gdb in order 
> to be able to figure out what's wrong there.

When I can get more detailed dbg, I'll send. Though I'm not clear on
what executable is being searched for below.

$ mpirun -dbg=gdb --prefix /usr/local/openmpi-1.2b3r13030 -x
LD_LIBRARY_PATH=${LD_LIBRARY_PATH} --hostfile ./h1-3 -np 5 --mca pml cm
--mca mtl mx ./cpi

[juggernaut:14949] connect_uni: connection not allowed
[juggernaut:14949] connect_uni: connection not allowed
[juggernaut:14949] connect_uni: connection not allowed
[juggernaut:14949] connect_uni: connection not allowed
[juggernaut:14949] connect_uni: connection not allowed
[juggernaut:14949] connect_uni: connection not allowed
[juggernaut:14949] connect_uni: connection not allowed
[juggernaut:14949] connect_uni: connection not allowed
[juggernaut:14949] connect_uni: connection not allowed
[juggernaut:14949] [0,0,0] setting up session dir with
[juggernaut:14949]  universe default-universe-14949
[juggernaut:14949]  user ggrobe
[juggernaut:14949]  host juggernaut
[juggernaut:14949]  jobid 0
[juggernaut:14949]  procid 0
[juggernaut:14949] procdir:
/tmp/openmpi-sessions-ggrobe@juggernaut_0/default-universe-14949/0/0
[juggernaut:14949] jobdir:
/tmp/openmpi-sessions-ggrobe@juggernaut_0/default-universe-14949/0
[juggernaut:14949] unidir:
/tmp/openmpi-sessions-ggrobe@juggernaut_0/default-universe-14949
[juggernaut:14949] top: openmpi-sessions-ggrobe@juggernaut_0
[juggernaut:14949] tmp: /tmp
[juggernaut:14949] [0,0,0] contact_file
/tmp/openmpi-sessions-ggrobe@juggernaut_0/default-universe-14949/univers
e-setup.txt
[juggernaut:14949] [0,0,0] wrote setup file
[juggernaut:14949] pls:rsh: local csh: 0, local sh: 1
[juggernaut:14949] pls:rsh: assuming same remote shell as local shell
[juggernaut:14949] pls:rsh: remote csh: 0, remote sh: 1
[juggernaut:14949] pls:rsh: final template argv:
[juggernaut:14949] pls:rsh: /usr/bin/ssh  orted --debug
--bootproxy 1 --name  --num_procs 2 --vpid_start 0 --nodename
 --universe ggrobe@juggernaut:default-universe-14949
--nsreplica "0.0.0;tcp://192.168.2.10:43121" --gprreplica
"0.0.0;tcp://192.168.2.10:43121"
[juggernaut:14949] pls:rsh: launching on node juggernaut
[juggernaut:14949] pls:rsh: juggernaut is a LOCAL node
[juggernaut:14949] pls:rsh: changing to directory /home/ggrobe
[juggernaut:14949] pls:rsh: executing: orted --debug --bootproxy 1
--name 0.0.1 --num_procs 2 --vpid_start 0 --nodename juggernaut
--universe ggrobe@juggernaut:default-universe-14949 --nsreplica
"0.0.0;tcp://192.168.2.10:43121" --gprreplica
"0.0.0;tcp://192.168.2.10:43121"
[juggernaut:14950] [0,0,1] setting up session dir with
[juggernaut:14950]  universe default-universe-14949
[juggernaut:14950]  user ggrobe
[juggernaut:14950]  host juggernaut
[juggernaut:14950]  jobid 0
[juggernaut:14950]  procid 1
[juggernaut:14950] procdir:
/tmp/openmpi-sessions-ggrobe@juggernaut_0/default-universe-14949/0/1
[juggernaut:14950] jobdir:
/tmp/openmpi-sessions-ggrobe@juggernaut_0/default-universe-14949/0
[juggernaut:14950] unidir:
/tmp/openmpi-sessions-ggrobe@juggernaut_0/default-universe-14949
[juggernaut:14950] top: openmpi-sessions-ggrobe@juggernaut_0
[juggernaut:14950] tmp: /tmp

--
Failed to find the following executable:

Host:   juggernaut
Executable: -b

Cannot continue.

--
[juggernaut:14950] [0,0,1] ORTE_ERROR_LOG: Fatal in file
odls_default_module.c at line 1193
[juggernaut:14949] spawn: in job_state_callback(jobid = 1, state = 0x80)
[juggernaut:14950] [0,0,1] ORTE_ERROR_LOG: Fatal in file orted.c at line
575
[juggernaut:14950] sess_dir_finalize: job session dir not empty -
leaving
[juggernaut:14950] sess_dir_finalize: proc session dir not empty -
leaving
[juggernaut:14949] sess_dir_finalize: proc session dir not empty -
leaving






Re: [OMPI users] Ompi failing on mx only

2007-01-08 Thread Jeff Squyres

On Jan 8, 2007, at 2:52 PM, Grobe, Gary L. ((JSC-EV))[ESCG] wrote:

I was wondering if someone could send me the HACKING file so I can  
do a

bit more with debugging on the snapshots. Our web proxy has webdav
methods turned off (request methods fail) so that I can't get to the
latest of the svn repos.


Bummer.  :-(  You are definitely falling victim to the fact that or  
nightly snapshots have been less-than-stable recently.  Sorry [again]  
about that!


FWIW, there's two ways to browse the source in the repository without  
an SVN checkout:


- you can just point a normal web browser to our SVN repository (I'm  
pretty sure that doesn't use DAV, but I'm not 100% sure...), e.g.:  
https://svn.open-mpi.org/svn/ompi/trunk/HACKING


- you can use our Trac SVN browser, e.g.: https://svn.open-mpi.org/ 
trac/ompi/browser/trunk/HACKING (there's a link at the bottom to  
download each file without all the HTML markup).



Second thing. From one of your previous emails, I see that MX
is configured with 4 instance by node. Your running with
exactly 4 processes on the first 2 nodes. Weirds things might
happens ...


Just curious about this comment. Are you referring to over  
subscribing?

We run 4 processes on each node because we have 2 dual core cpu's on
each node. Am I not understanding processor counts correctly?


I'll have to defer to Reese on this one...


PS: Is there any way you can attach to the processes with gdb
? I would like to see the backtrace as showed by gdb in order
to be able to figure out what's wrong there.


When I can get more detailed dbg, I'll send. Though I'm not clear on
what executable is being searched for below.

$ mpirun -dbg=gdb --prefix /usr/local/openmpi-1.2b3r13030 -x
LD_LIBRARY_PATH=${LD_LIBRARY_PATH} --hostfile ./h1-3 -np 5 --mca  
pml cm

--mca mtl mx ./cpi


FWIW, note that "-dbg" is not a recognized Open MPI mpirun command  
line switch -- after all the debugging information, Open MPI finally  
gets to telling you:


-- 
--

Failed to find the following executable:

Host:   juggernaut
Executable: -b

Cannot continue.
-- 
--


So nothing actually ran in this instance.

Our debugging entries on the FAQ (http://www.open-mpi.org/faq/? 
category=debugging) are fairly inadequate at the moment, but if  
you're running in an ssh environment, you generally have 2 choices to  
attach serial debuggers:


1. Put a loop in your app that pauses until you can attach a  
debugger.  Perhaps something like this:


{ int i = 0; printf("pid %d ready\n", getpid()); while (0 == i) sleep  
(5); }


Kludgey and horrible, but it works.

2. mpirun an xterm with gdb.  You'll need to specifically use the -d  
option to mpirun in order to keep the ssh sessions alive to relay  
back your X information, or separately setup your X channels yourself  
(e.g., if you're on a closed network, it may be acceptable to "xhost  
+" the nodes that you're running on and just manually setup the  
DISPLAY variable for the target nodes, perhaps via the -x option to  
mpirun) -- in which case you would not need to use the -d option to  
mpirun.


Make sense?

--
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems



Re: [OMPI users] Ompi failing on mx only

2007-01-08 Thread Grobe, Gary L. (JSC-EV)[ESCG]
> >> PS: Is there any way you can attach to the processes with gdb ? I 
> >> would like to see the backtrace as showed by gdb in order 
> to be able 
> >> to figure out what's wrong there.
> >
> > When I can get more detailed dbg, I'll send. Though I'm not 
> clear on 
> > what executable is being searched for below.
> >
> > $ mpirun -dbg=gdb --prefix /usr/local/openmpi-1.2b3r13030 -x 
> > LD_LIBRARY_PATH=${LD_LIBRARY_PATH} --hostfile ./h1-3 -np 5 
> --mca pml 
> > cm --mca mtl mx ./cpi
> 
> FWIW, note that "-dbg" is not a recognized Open MPI mpirun 
> command line switch -- after all the debugging information, 
> Open MPI finally gets to telling you:
> 

Sorry, wrong mpi, ok ... Fwiw, here's a working crash w/ just the -d
option. The problem I'm trying to get to right now is how to dbg the 2nd
process on the 2nd node since that's where the crash is always
happening. One process past the 1st node works find (5 procs w/ 4 per
node), but when a second process on the 2nd node starts or anything more
than that, the crashes will occur.

$ mpirun -d --prefix /usr/local/openmpi-1.2b3r13030 -x
LD_LIBRARY_PATH=${LD_LIBRARY_PATH} --hostfile ./h1-3 -np 6 --mca pml cm
--mca mtl mx ./cpi > dbg.out 2>&1

[juggernaut:15087] connect_uni: connection not allowed
[juggernaut:15087] connect_uni: connection not allowed
[juggernaut:15087] connect_uni: connection not allowed
[juggernaut:15087] connect_uni: connection not allowed
[juggernaut:15087] connect_uni: connection not allowed
[juggernaut:15087] connect_uni: connection not allowed
[juggernaut:15087] connect_uni: connection not allowed
[juggernaut:15087] connect_uni: connection not allowed
[juggernaut:15087] connect_uni: connection not allowed
[juggernaut:15087] [0,0,0] setting up session dir with
[juggernaut:15087]  universe default-universe-15087
[juggernaut:15087]  user ggrobe
[juggernaut:15087]  host juggernaut
[juggernaut:15087]  jobid 0
[juggernaut:15087]  procid 0
[juggernaut:15087] procdir:
/tmp/openmpi-sessions-ggrobe@juggernaut_0/default-universe-15087/0/0
[juggernaut:15087] jobdir:
/tmp/openmpi-sessions-ggrobe@juggernaut_0/default-universe-15087/0
[juggernaut:15087] unidir:
/tmp/openmpi-sessions-ggrobe@juggernaut_0/default-universe-15087
[juggernaut:15087] top: openmpi-sessions-ggrobe@juggernaut_0
[juggernaut:15087] tmp: /tmp
[juggernaut:15087] [0,0,0] contact_file
/tmp/openmpi-sessions-ggrobe@juggernaut_0/default-universe-15087/univers
e-setup.txt
[juggernaut:15087] [0,0,0] wrote setup file
[juggernaut:15087] pls:rsh: local csh: 0, local sh: 1
[juggernaut:15087] pls:rsh: assuming same remote shell as local shell
[juggernaut:15087] pls:rsh: remote csh: 0, remote sh: 1
[juggernaut:15087] pls:rsh: final template argv:
[juggernaut:15087] pls:rsh: /usr/bin/ssh  orted --debug
--bootproxy 1 --name  --num_procs 3 --vpid_start 0 --nodename
 --universe ggrobe@juggernaut:default-universe-15087
--nsreplica "0.0.0;tcp://192.168.2.10:52099" --gprreplica
"0.0.0;tcp://192.168.2.10:52099"
[juggernaut:15087] pls:rsh: launching on node node-1
[juggernaut:15087] pls:rsh: node-1 is a REMOTE node
[juggernaut:15087] pls:rsh: executing: /usr/bin/ssh node-1
PATH=/usr/local/openmpi-1.2b3r13030/bin:$PATH ; export PATH ;
LD_LIBRARY_PATH=/usr/local/openmpi-1.2b3r13030/lib:$LD_LIBRARY_PATH ;
export LD_LIBRARY_PATH ; /usr/local/openmpi-1.2b3r13030/bin/orted
--debug --bootproxy 1 --name 0.0.1 --num_procs 3 --vpid_start 0
--nodename node-1 --universe ggrobe@juggernaut:default-universe-15087
--nsreplica "0.0.0;tcp://192.168.2.10:52099" --gprreplica
"0.0.0;tcp://192.168.2.10:52099"
[juggernaut:15087] pls:rsh: launching on node node-2
[juggernaut:15087] pls:rsh: node-2 is a REMOTE node
[juggernaut:15087] pls:rsh: executing: /usr/bin/ssh node-2
PATH=/usr/local/openmpi-1.2b3r13030/bin:$PATH ; export PATH ;
LD_LIBRARY_PATH=/usr/local/openmpi-1.2b3r13030/lib:$LD_LIBRARY_PATH ;
export LD_LIBRARY_PATH ; /usr/local/openmpi-1.2b3r13030/bin/orted
--debug --bootproxy 1 --name 0.0.2 --num_procs 3 --vpid_start 0
--nodename node-2 --universe ggrobe@juggernaut:default-universe-15087
--nsreplica "0.0.0;tcp://192.168.2.10:52099" --gprreplica
"0.0.0;tcp://192.168.2.10:52099"
[node-2:11499] [0,0,2] setting up session dir with
[node-2:11499]  universe default-universe-15087
[node-2:11499]  user ggrobe
[node-2:11499]  host node-2
[node-2:11499]  jobid 0
[node-2:11499]  procid 2
[node-1:10307] procdir:
/tmp/openmpi-sessions-ggrobe@node-1_0/default-universe-15087/0/1
[node-1:10307] jobdir:
/tmp/openmpi-sessions-ggrobe@node-1_0/default-universe-15087/0
[node-1:10307] unidir:
/tmp/openmpi-sessions-ggrobe@node-1_0/default-universe-15087
[node-1:10307] top: openmpi-sessions-ggrobe@node-1_0
[node-2:11499] procdir:
/tmp/openmpi-sessions-ggrobe@node-2_0/default-universe-15087/0/2
[node-2:11499] jobdir:
/tmp/openmpi-sessions-ggrobe@node-2_0/default-universe-15087/0
[node-2:11499] unidir:
/tmp/openmpi-sessions-ggrobe@node-2_0/default-universe-15087
[node-2:11499] top: openmpi-sessions-ggrobe@node-2_0
[node-2:11499] tmp: /tm

Re: [OMPI users] Ompi failing on mx only

2007-01-08 Thread Adrian Knoth
On Mon, Jan 08, 2007 at 03:07:57PM -0500, Jeff Squyres wrote:

> if you're running in an ssh environment, you generally have 2 choices to  
> attach serial debuggers:
> 
> 1. Put a loop in your app that pauses until you can attach a  
> debugger.  Perhaps something like this:
> 
> { int i = 0; printf("pid %d ready\n", getpid()); while (0 == i) sleep  
> (5); }
> 
> Kludgey and horrible, but it works.
> 
> 2. mpirun an xterm with gdb. 

If one of the participating hosts is the localhost and it's
sufficient to debug only one process, it's even possible to
call gdb directly:

adi@ipc654~$ mpirun -np 2 -host ipc654,dana \
   sh -c 'if [[ $(hostname)  == "ipc654" ]]; then gdb test/vm/ring; \
  else test/vm/ring ; fi '


(also works great with ddd).


-- 
Cluster and Metacomputing Working Group
Friedrich-Schiller-Universität Jena, Germany

private: http://adi.thur.de


Re: [OMPI users] Ompi failing on mx only

2007-01-08 Thread Grobe, Gary L. (JSC-EV)[ESCG]
> >> PS: Is there any way you can attach to the processes with gdb ? I 
> >> would like to see the backtrace as showed by gdb in order 
> to be able 
> >> to figure out what's wrong there.
> >

I found out that all processes on the 2nd node crash so I just put a 30
second wait before MPI_Init in order to attach gdb and go from there.

The code in cpi starts off as follows (in order to show where the
SIGTERM below is coming from).

MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
MPI_Comm_rank(MPI_COMM_WORLD,&myid);
MPI_Get_processor_name(processor_name,&namelen);

---

Attaching to process 11856
Reading symbols from /home/ggrobe/Projects/ompi/cpi/cpi...done.
Using host libthread_db library "/lib/libthread_db.so.1".
Reading symbols from
/usr/local/openmpi-1.2b3r13030/lib/libmpi.so.0...done.
Loaded symbols for /usr/local/openmpi-1.2b3r13030/lib/libmpi.so.0
Reading symbols from
/usr/local/openmpi-1.2b3r13030/lib/libopen-rte.so.0...done.
Loaded symbols for /usr/local/openmpi-1.2b3r13030/lib/libopen-rte.so.0
Reading symbols from
/usr/local/openmpi-1.2b3r13030/lib/libopen-pal.so.0...done.
Loaded symbols for /usr/local/openmpi-1.2b3r13030/lib/libopen-pal.so.0
Reading symbols from /lib64/libdl.so.2...done.
Loaded symbols for /lib/libdl.so.2
Reading symbols from /lib64/libnsl.so.1...done.
Loaded symbols for /lib/libnsl.so.1
Reading symbols from /lib64/libutil.so.1...done.
Loaded symbols for /lib/libutil.so.1
Reading symbols from /lib64/libm.so.6...done.
Loaded symbols for /lib/libm.so.6
Reading symbols from /lib64/libpthread.so.0...done.
[Thread debugging using libthread_db enabled]
[New Thread 46974166086512 (LWP 11856)]
Loaded symbols for /lib/libpthread.so.0
Reading symbols from /lib64/libc.so.6...done.
Loaded symbols for /lib/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
0x2ab90661e880 in nanosleep () from /lib/libc.so.6
(gdb) break MPI_Init
Breakpoint 1 at 0x2ab905c0c880
(gdb) break MPI_Comm_size
Breakpoint 2 at 0x2ab905c01af0
(gdb) continue
Continuing.
[Switching to Thread 46974166086512 (LWP 11856)]

Breakpoint 1, 0x2ab905c0c880 in PMPI_Init ()
   from /usr/local/openmpi-1.2b3r13030/lib/libmpi.so.0
(gdb) n
Single stepping until exit from function PMPI_Init, 
which has no line number information.
[New Thread 1082132816 (LWP 11862)]

Program received signal SIGTERM, Terminated.
0x2ab906643f47 in ioctl () from /lib/libc.so.6
(gdb) backtrace
#0  0x2ab906643f47 in ioctl () from /lib/libc.so.6
Cannot access memory at address 0x7fffa50102f8
---

Does this help in anyway?



Re: [OMPI users] Ompi failing on mx only

2007-01-08 Thread George Bosilca


Not really. This is the backtrace of the process that get killed because 
mpirun detect that the other one died ... What I need it's the backtrace 
on the process which generate the segfault. Second, in order to understand 
the backtrace, it's better to have run debug version of Open MPI. Without 
the debug version we only see the address where the fault occur without 
having access to the line number ...


  Thanks,
george.

On Mon, 8 Jan 2007, Grobe, Gary L. \(JSC-EV\)[ESCG] wrote:


PS: Is there any way you can attach to the processes with gdb ? I
would like to see the backtrace as showed by gdb in order

to be able

to figure out what's wrong there.




I found out that all processes on the 2nd node crash so I just put a 30
second wait before MPI_Init in order to attach gdb and go from there.

The code in cpi starts off as follows (in order to show where the
SIGTERM below is coming from).

   MPI_Init(&argc,&argv);
   MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
   MPI_Comm_rank(MPI_COMM_WORLD,&myid);
   MPI_Get_processor_name(processor_name,&namelen);

---

Attaching to process 11856
Reading symbols from /home/ggrobe/Projects/ompi/cpi/cpi...done.
Using host libthread_db library "/lib/libthread_db.so.1".
Reading symbols from
/usr/local/openmpi-1.2b3r13030/lib/libmpi.so.0...done.
Loaded symbols for /usr/local/openmpi-1.2b3r13030/lib/libmpi.so.0
Reading symbols from
/usr/local/openmpi-1.2b3r13030/lib/libopen-rte.so.0...done.
Loaded symbols for /usr/local/openmpi-1.2b3r13030/lib/libopen-rte.so.0
Reading symbols from
/usr/local/openmpi-1.2b3r13030/lib/libopen-pal.so.0...done.
Loaded symbols for /usr/local/openmpi-1.2b3r13030/lib/libopen-pal.so.0
Reading symbols from /lib64/libdl.so.2...done.
Loaded symbols for /lib/libdl.so.2
Reading symbols from /lib64/libnsl.so.1...done.
Loaded symbols for /lib/libnsl.so.1
Reading symbols from /lib64/libutil.so.1...done.
Loaded symbols for /lib/libutil.so.1
Reading symbols from /lib64/libm.so.6...done.
Loaded symbols for /lib/libm.so.6
Reading symbols from /lib64/libpthread.so.0...done.
[Thread debugging using libthread_db enabled]
[New Thread 46974166086512 (LWP 11856)]
Loaded symbols for /lib/libpthread.so.0
Reading symbols from /lib64/libc.so.6...done.
Loaded symbols for /lib/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
0x2ab90661e880 in nanosleep () from /lib/libc.so.6
(gdb) break MPI_Init
Breakpoint 1 at 0x2ab905c0c880
(gdb) break MPI_Comm_size
Breakpoint 2 at 0x2ab905c01af0
(gdb) continue
Continuing.
[Switching to Thread 46974166086512 (LWP 11856)]

Breakpoint 1, 0x2ab905c0c880 in PMPI_Init ()
  from /usr/local/openmpi-1.2b3r13030/lib/libmpi.so.0
(gdb) n
Single stepping until exit from function PMPI_Init,
which has no line number information.
[New Thread 1082132816 (LWP 11862)]

Program received signal SIGTERM, Terminated.
0x2ab906643f47 in ioctl () from /lib/libc.so.6
(gdb) backtrace
#0  0x2ab906643f47 in ioctl () from /lib/libc.so.6
Cannot access memory at address 0x7fffa50102f8
---

Does this help in anyway?

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



"We must accept finite disappointment, but we must never lose infinite
hope."
  Martin Luther King



Re: [OMPI users] Ompi failing on mx only

2007-01-08 Thread Reese Faucette

Second thing. From one of your previous emails, I see that MX
is configured with 4 instance by node. Your running with
exactly 4 processes on the first 2 nodes. Weirds things might
happens ...


4 processes per node will be just fine.  This is not like GM where the 4 
includes some "reserved" ports.

-reese




Re: [OMPI users] Ompi failing on mx only

2007-01-08 Thread George Bosilca


On Jan 8, 2007, at 9:11 PM, Reese Faucette wrote:


Second thing. From one of your previous emails, I see that MX
is configured with 4 instance by node. Your running with
exactly 4 processes on the first 2 nodes. Weirds things might
happens ...


4 processes per node will be just fine.  This is not like GM where  
the 4

includes some "reserved" ports.


Right, that's the maximum number of open MX channels, i.e. processes  
than can run on the node using MX. With MX (1.2.0c I think), I get  
weird messages if I run a second mpirun quickly after the first one  
failed. The myrinet guys, I quite sure, can explain why and how.  
Somehow, when an application segfault while the MX port is open  
things are not cleaned up right away. It take few seconds (not more  
than one minute) to have everything running correctly after that.


  george.



Re: [OMPI users] Ompi failing on mx only

2007-01-08 Thread Reese Faucette

Right, that's the maximum number of open MX channels, i.e. processes
than can run on the node using MX. With MX (1.2.0c I think), I get
weird messages if I run a second mpirun quickly after the first one
failed. The myrinet guys, I quite sure, can explain why and how.
Somehow, when an application segfault while the MX port is open
things are not cleaned up right away. It take few seconds (not more
than one minute) to have everything running correctly after that.


Supposedly I am a "myrinet guy" ;-)  Yeah, the endpoint cleanup stuff could 
take a few seconds after an ungraceful exit.  But, if you're getting some 
behavior that looks like you ought not be getting, please let us know!

-reese
Myricom, Inc.




Re: [OMPI users] Ompi failing on mx only

2007-01-08 Thread George Bosilca


On Jan 8, 2007, at 9:34 PM, Reese Faucette wrote:


Right, that's the maximum number of open MX channels, i.e. processes
than can run on the node using MX. With MX (1.2.0c I think), I get
weird messages if I run a second mpirun quickly after the first one
failed. The myrinet guys, I quite sure, can explain why and how.
Somehow, when an application segfault while the MX port is open
things are not cleaned up right away. It take few seconds (not more
than one minute) to have everything running correctly after that.


Supposedly I am a "myrinet guy" ;-)  Yeah, the endpoint cleanup  
stuff could
take a few seconds after an ungraceful exit.  But, if you're  
getting some

behavior that looks like you ought not be getting, please let us know!


I think it make sense what I get. If I loop in a script starting  
mpiruns and one of the run segfault, the next one usually is unable  
to open the MX endpoints. That's happens only if I run 4 processes by  
node, where 4 is the number of instances as reported by mx_info. If I  
put a sleep of 30 seconds between my runs, then everything runs just  
fine.


  george.


-reese
Myricom, Inc.


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Ompi failing on mx only

2007-01-09 Thread Grobe, Gary L. (JSC-EV)[ESCG]
> I need it's the backtrace on the process which generate the 
> segfault. Second, in order to understand the backtrace, it's 
> better to have run debug version of Open MPI. Without the 
> debug version we only see the address where the fault occur 
> without having access to the line number ...

How about this, this is the section that I was stepping through in order
to get the first error I usually run into ... "mx_connect fail for
node-1:0 with key  (error Endpoint closed or not connectable!)"

// gdb output

Breakpoint 1, 0x2ac856bd92e0 in opal_progress ()
   from /usr/local/openmpi-1.2b3r13030/lib/libopen-pal.so.0
(gdb) s
Single stepping until exit from function opal_progress, 
which has no line number information.
0x2ac857361540 in sched_yield () from /lib/libc.so.6
(gdb) s
Single stepping until exit from function sched_yield, 
which has no line number information.
opal_condition_wait (c=0x5098e0, m=0x5098a0) at condition.h:80
80  while (c->c_signaled == 0) {
(gdb) s
81  opal_progress();
(gdb) s

Breakpoint 1, 0x2ac856bd92e0 in opal_progress ()
   from /usr/local/openmpi-1.2b3r13030/lib/libopen-pal.so.0
(gdb) s
Single stepping until exit from function opal_progress, 
which has no line number information.
0x2ac857361540 in sched_yield () from /lib/libc.so.6
(gdb) backtrace
#0  0x2ac857361540 in sched_yield () from /lib/libc.so.6
#1  0x00402f60 in opal_condition_wait (c=0x5098e0, m=0x5098a0)
at condition.h:81
#2  0x00402b3c in orterun (argc=17, argv=0x7fff54151088)
at orterun.c:427
#3  0x00402713 in main (argc=17, argv=0x7fff54151088) at
main.c:13

--- This is the mpirun output as I was stepping through it. At the end
of this is the error that the backtrace above shows.

[node-2:11909] top: openmpi-sessions-ggrobe@node-2_0
[node-2:11909] tmp: /tmp
[node-1:10719] procdir:
/tmp/openmpi-sessions-ggrobe@node-1_0/default-universe-17414/1/0
[node-1:10719] jobdir:
/tmp/openmpi-sessions-ggrobe@node-1_0/default-universe-17414/1
[node-1:10719] unidir:
/tmp/openmpi-sessions-ggrobe@node-1_0/default-universe-17414
[node-1:10719] top: openmpi-sessions-ggrobe@node-1_0
[node-1:10719] tmp: /tmp
[juggernaut:17414] spawn: in job_state_callback(jobid = 1, state = 0x4)
[juggernaut:17414] Info: Setting up debugger process table for
applications
  MPIR_being_debugged = 0
  MPIR_debug_gate = 0
  MPIR_debug_state = 1
  MPIR_acquired_pre_main = 0
  MPIR_i_am_starter = 0
  MPIR_proctable_size = 6
  MPIR_proctable:
(i, host, exe, pid) = (0, node-1,
/home/ggrobe/Projects/ompi/cpi/./cpi, 10719)
(i, host, exe, pid) = (1, node-1,
/home/ggrobe/Projects/ompi/cpi/./cpi, 10720)
(i, host, exe, pid) = (2, node-1,
/home/ggrobe/Projects/ompi/cpi/./cpi, 10721)
(i, host, exe, pid) = (3, node-1,
/home/ggrobe/Projects/ompi/cpi/./cpi, 10722)
(i, host, exe, pid) = (4, node-2,
/home/ggrobe/Projects/ompi/cpi/./cpi, 11908)
(i, host, exe, pid) = (5, node-2,
/home/ggrobe/Projects/ompi/cpi/./cpi, 11909)
[node-1:10718] sess_dir_finalize: proc session dir not empty - leaving
[node-1:10718] sess_dir_finalize: proc session dir not empty - leaving
[node-1:10721] procdir:
/tmp/openmpi-sessions-ggrobe@node-1_0/default-universe-17414/1/2
[node-1:10721] jobdir:
/tmp/openmpi-sessions-ggrobe@node-1_0/default-universe-17414/1
[node-1:10721] unidir:
/tmp/openmpi-sessions-ggrobe@node-1_0/default-universe-17414
[node-1:10721] top: openmpi-sessions-ggrobe@node-1_0
[node-1:10721] tmp: /tmp
[node-1:10720] mx_connect fail for node-1:0 with key  (error
Endpoint closed or not connectable!)