Re: [OMPI users] Ompi failing on mx only

2007-01-02 Thread Grobe, Gary L. (JSC-EV)[ESCG]
I'm losing it today, I just now noticed I sent mx_info for the wrong
nodes ...


// node-1
$ mx_info
MX Version: 1.1.6
MX Build: ggrobe@juggernaut:/home/ggrobe/Tools/mx-1.1.6 Thu Nov 30
14:17:44 GMT 2006
1 Myrinet board installed.
The MX driver is configured to support up to 4 instances and 1024 nodes.
===
Instance #0:  224.9 MHz LANai, 133.3 MHz PCI bus, 2 MB SRAM
Status: Running, P0: Link up
MAC Address:00:60:dd:47:ab:c9
Product code:   M3F-PCIXD-2 V2.2
Part number:09-03034
Serial number:  299207
Mapper: 46:4d:53:4d:41:50, version = 0x000c,
configured
Mapped hosts:   16

ROUTE
COUNT
INDEXMAC ADDRESS HOST NAMEP0
---- ----
   0) 00:60:dd:47:ab:c9 node-1:0  1,1
   1) 00:60:dd:47:c2:a7 juggernaut:0  5,3
   2) 00:60:dd:47:ab:c8 node-2:0  6,3
   3) 00:60:dd:47:ab:ca node-3:0  7,3
   4) 00:60:dd:47:bf:65 node-7:0  7,3
   5) 00:60:dd:47:c2:e1 node-8:0  7,3
   6) 00:60:dd:47:c0:c1 node-9:0  6,3
   7) 00:60:dd:47:c0:e5 node-13:0 1,1
   8) 00:60:dd:47:c2:91 node-14:0 7,3
   9) 00:60:dd:47:c0:b2 node-15:0 7,3
  10) 00:60:dd:47:bf:f5 node-19:0 6,3
  11) 00:60:dd:47:c0:b1 node-20:0 8,3
  12) 00:60:dd:47:c0:f8 node-21:0 5,3
  13) 00:60:dd:47:c0:8a node-25:0 6,3
  14) 00:60:dd:47:c0:c2 node-27:0 7,3
  15) 00:60:dd:47:c2:e0 node-26:0 6,3

// node-2
$ mx_info
MX Version: 1.1.6
MX Build: ggrobe@juggernaut:/home/ggrobe/Tools/mx-1.1.6 Thu Nov 30
14:17:44 GMT 2006
1 Myrinet board installed.
The MX driver is configured to support up to 4 instances and 1024 nodes.
===
Instance #0:  224.9 MHz LANai, 133.0 MHz PCI bus, 2 MB SRAM
Status: Running, P0: Link up
MAC Address:00:60:dd:47:ab:c8
Product code:   M3F-PCIXD-2 V2.2
Part number:09-03034
Serial number:  299208
Mapper: 46:4d:53:4d:41:50, version = 0x000c,
configured
Mapped hosts:   16

ROUTE
COUNT
INDEXMAC ADDRESS HOST NAMEP0
---- ----
   0) 00:60:dd:47:ab:c8 node-2:0  1,1
   1) 00:60:dd:47:ab:c9 node-1:0  6,3
   2) 00:60:dd:47:c2:a7 juggernaut:0  5,3
   3) 00:60:dd:47:ab:ca node-3:0  6,3
   4) 00:60:dd:47:bf:65 node-7:0  5,3
   5) 00:60:dd:47:c2:e1 node-8:0  6,3
   6) 00:60:dd:47:c0:c1 node-9:0  5,3
   7) 00:60:dd:47:c0:e5 node-13:0 6,3
   8) 00:60:dd:47:c2:91 node-14:0 8,3
   9) 00:60:dd:47:c0:b2 node-15:0 1,1
  10) 00:60:dd:47:bf:f5 node-19:0 5,3
  11) 00:60:dd:47:c0:f8 node-21:0 5,3
  12) 00:60:dd:47:c0:8a node-25:0 5,3
  13) 00:60:dd:47:c0:c2 node-27:0 6,3
  14) 00:60:dd:47:c2:e0 node-26:0 5,3
  15) 00:60:dd:47:c0:b1 node-20:0 6,3

-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
Behalf Of Reese Faucette
Sent: Tuesday, January 02, 2007 4:08 PM
To: Open MPI Users
Subject: Re: [OMPI users] Ompi failing on mx only

Ompi failing on mx only> I've attached the ompi_info from node-1 and
node-2.

thanks, but i need "mx_info", not "ompi_info" ;-)

> But now that you mention mapper, I take it that's what SEGV_MAPERR 
> might be referring to.

this is an ompi red herring; it has nothing to do with Myrinet mapping,
even though it kinda looks like it.

> $ mpirun --prefix /usr/local/openmpi-1.2b2 -x 
> LD_LIBRARY_PATH=${LD_LIBRARY_PATH} --hostfile ./h1-3 -np 5 --mca mtl 
> mx,self ./cpi
> Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)

A gdb traceback would be interesting on this one.
thanks,
-reese 


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/

Re: [OMPI users] Ompi failing on mx only

2007-01-02 Thread Grobe, Gary L. (JSC-EV)[ESCG]
Ah, sorry about that ... 

$ ./mx_info
MX Version: 1.1.6
MX Build: ggrobe@juggernaut:/home/ggrobe/Tools/mx-1.1.6 Thu Nov 30
14:17:44 GMT 2006
1 Myrinet board installed.
The MX driver is configured to support up to 4 instances and 1024 nodes.
===
Instance #0:  224.9 MHz LANai, 99.7 MHz PCI bus, 2 MB SRAM
Status: Running, P0: Link up
MAC Address:00:60:dd:47:c2:a7
Product code:   M3F-PCIXD-2 V2.2
Part number:09-03034
Serial number:  291824
Mapper: 46:4d:53:4d:41:50, version = 0x000c,
configured
Mapped hosts:   16

ROUTE
COUNT
INDEXMAC ADDRESS HOST NAMEP0
---- ----
   0) 00:60:dd:47:c2:a7 juggernaut:0  1,1
   1) 00:60:dd:47:ab:c9 node-1:0  6,3
   2) 00:60:dd:47:ab:c8 node-2:0  6,3
   3) 00:60:dd:47:ab:ca node-3:0  6,3
   4) 00:60:dd:47:bf:65 node-7:0  6,3
   5) 00:60:dd:47:c2:e1 node-8:0  6,3
   6) 00:60:dd:47:c0:c1 node-9:0  6,3
   7) 00:60:dd:47:c0:e5 node-13:0 6,3
   8) 00:60:dd:47:c2:91 node-14:0 6,3
   9) 00:60:dd:47:c0:b2 node-15:0 6,3
  10) 00:60:dd:47:bf:f5 node-19:0 1,1
  11) 00:60:dd:47:c0:b1 node-20:0 6,3
  12) 00:60:dd:47:c0:f8 node-21:0 7,3
  13) 00:60:dd:47:c0:8a node-25:0 6,3
  14) 00:60:dd:47:c0:c2 node-27:0 5,3
  15) 00:60:dd:47:c2:e0 node-26:0 5,3

-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
Behalf Of Reese Faucette
Sent: Tuesday, January 02, 2007 4:08 PM
To: Open MPI Users
Subject: Re: [OMPI users] Ompi failing on mx only

Ompi failing on mx only> I've attached the ompi_info from node-1 and
node-2.

thanks, but i need "mx_info", not "ompi_info" ;-)

> But now that you mention mapper, I take it that's what SEGV_MAPERR 
> might be referring to.

this is an ompi red herring; it has nothing to do with Myrinet mapping,
even though it kinda looks like it.

> $ mpirun --prefix /usr/local/openmpi-1.2b2 -x 
> LD_LIBRARY_PATH=${LD_LIBRARY_PATH} --hostfile ./h1-3 -np 5 --mca mtl 
> mx,self ./cpi
> Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)

A gdb traceback would be interesting on this one.
thanks,
-reese 


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] Ompi failing on mx only

2007-01-02 Thread Reese Faucette

 As for the MTL, there is a bug in the MX
MTL for v1.2 that has been fixed, but after 1.2b2 ...


oops, i was stupidly assuming he already had that fix.  yes, this is an 
important fix...

-reese




Re: [OMPI users] Ompi failing on mx only

2007-01-02 Thread Brian W. Barrett
Sorry to jump into the discussion late.  The mx btl does not support  
communication between processes on the same node by itself, so you  
have to include the shared memory transport when using MX.  This will  
eventually be fixed, but likely not for the 1.2 release.  So if you do:


  mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH -- 
hostfile ./h1-3 -np 2 --mca btl mx,sm,self ./cpi


It should work much better.  As for the MTL, there is a bug in the MX  
MTL for v1.2 that has been fixed, but after 1.2b2 that could cause  
the random failures you were seeing.  It will work much better after  
1.2b3 is released (or if you are feeling really lucky, you can try  
out the 1.2 nightly tarballs).


The MTL is a new feature in v1.2. It is a different communication  
abstraction designed to support interconnects that have matching  
implemented in the lower level library or in hardware (Myrinet/MX,  
Portals, InfiniPath are currently implemented).  The MTL allows us to  
exploit the low latency and asynchronous progress these libraries can  
provide, but does mean multi-nic abilities are reduced.  Further, the  
MTL is not well suited to interconnects like TCP or InfiniBand, so we  
will continue supporting the BTL interface as well.


Brian


On Jan 2, 2007, at 2:44 PM, Grobe, Gary L. ((JSC-EV))[ESCG] wrote:

About the -x, I've been trying it both ways and prefer the latter,  
and results for either are the same. But it's value is correct.  
I've attached the ompi_info from node-1 and node-2. Sorry for not  
zipping them, but they were small and I think I'd have firewall  
issues.


$ mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH -- 
hostfile ./h13-15 -np 6 --mca pml cm ./cpi
[node-14:19260] mx_connect fail for node-14:0 with key   
(error Endpoint closed or not connectable!)
[node-14:19261] mx_connect fail for node-14:0 with key   
(error Endpoint closed or not connectable!)

...

Is there any info anywhere's on MTL? Anyways, I've run w/ mtl, and  
sometimes it actually worked once. But now I can't reproduce it and  
it's throwing sig 7's, 11's, and 4's depending upon the number of  
procs I give it. But now that you mention mapper, I take it that's  
what SEGV_MAPERR might be referring to. I'm looking into the


$ mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH=$ 
{LD_LIBRARY_PATH} --hostfile ./h1-3 -np 5 --mca mtl mx,self ./cpi

Process 4 of 5 is on node-2
Process 0 of 5 is on node-1
Process 1 of 5 is on node-1
Process 2 of 5 is on node-1
Process 3 of 5 is on node-1
pi is approximately 3.1415926544231225, Error is 0.08333294
wall clock time = 0.019305
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0x2b88243862be
mpirun noticed that job rank 0 with PID 0 on node node-1 exited on  
signal 1.

4 additional processes aborted (not shown)
Or sometimes I'll get this error, just depending upon the number of  
procs ...


 mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH=$ 
{LD_LIBRARY_PATH} --hostfile ./h1-3 -np 7 --mca mtl mx,self ./cpi

Signal:7 info.si_errno:0(Success) si_code:2()
Failing at addr:0x2aaab000
[0] func:/usr/local/openmpi-1.2b2/lib/libopen-pal.so.0 
(opal_backtrace_print+0x1f) [0x2b9b7fa52d1f]
[1] func:/usr/local/openmpi-1.2b2/lib/libopen-pal.so.0  
[0x2b9b7fa51871]

[2] func:/lib/libpthread.so.0 [0x2b9b80013d00]
[3] func:/usr/local/openmpi-1.2b2/lib/libmca_common_sm.so.0 
(mca_common_sm_mmap_init+0x1e3) [0x2b9b8270ef83]
[4] func:/usr/local/openmpi-1.2b2/lib/openmpi/mca_mpool_sm.so  
[0x2b9b8260d0ff]
[5] func:/usr/local/openmpi-1.2b2/lib/libmpi.so.0 
(mca_mpool_base_module_create+0x70) [0x2b9b7f7afac0]
[6] func:/usr/local/openmpi-1.2b2/lib/openmpi/mca_btl_sm.so 
(mca_btl_sm_add_procs_same_base_addr+0x907) [0x2b9b83070517]
[7] func:/usr/local/openmpi-1.2b2/lib/openmpi/mca_bml_r2.so 
(mca_bml_r2_add_procs+0x206) [0x2b9b82d5f576]
[8] func:/usr/local/openmpi-1.2b2/lib/openmpi/mca_pml_ob1.so 
(mca_pml_ob1_add_procs+0xe3) [0x2b9b82a2d0a3]
[9] func:/usr/local/openmpi-1.2b2/lib/libmpi.so.0(ompi_mpi_init 
+0x697) [0x2b9b7f77be07]
[10] func:/usr/local/openmpi-1.2b2/lib/libmpi.so.0(MPI_Init+0x83)  
[0x2b9b7f79c943]

[11] func:./cpi(main+0x42) [0x400cd5]
[12] func:/lib/libc.so.6(__libc_start_main+0xf4) [0x2b9b8013a134]
[13] func:./cpi [0x400bd9]
*** End of error message ***
Process 4 of 7 is on node-2
Process 5 of 7 is on node-2
Process 6 of 7 is on node-2
Process 0 of 7 is on node-1
Process 1 of 7 is on node-1
Process 2 of 7 is on node-1
Process 3 of 7 is on node-1
pi is approximately 3.1415926544231239, Error is 0.0807
wall clock time = 0.009331
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0x2b4ba33652be
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0x2b8685aba2be
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0x2b304ffbe2be
mpirun noticed that job rank 0 with PID 0 on node node-1 exited on  
signal 1.

6 additio

Re: [OMPI users] Ompi failing on mx only

2007-01-02 Thread Reese Faucette

Ompi failing on mx only> I've attached the ompi_info from node-1 and node-2.

thanks, but i need "mx_info", not "ompi_info" ;-)

But now that you mention mapper, I take it that's what SEGV_MAPERR might 
be referring to.


this is an ompi red herring; it has nothing to do with Myrinet mapping, even 
though it kinda looks like it.


$ mpirun --prefix /usr/local/openmpi-1.2b2 -x 
LD_LIBRARY_PATH=${LD_LIBRARY_PATH} --hostfile ./h1-3 -np 5 --mca mtl 
mx,self ./cpi

Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)


A gdb traceback would be interesting on this one.
thanks,
-reese 





Re: [OMPI users] Ompi failing on mx only

2007-01-02 Thread Grobe, Gary L. (JSC-EV)[ESCG]
About the -x, I've been trying it both ways and prefer the latter, and
results for either are the same. But it's value is correct. I've
attached the ompi_info from node-1 and node-2. Sorry for not zipping
them, but they were small and I think I'd have firewall issues.
 
$ mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH --hostfile
./h13-15 -np 6 --mca pml cm ./cpi 
[node-14:19260] mx_connect fail for node-14:0 with key  (error
Endpoint closed or not connectable!)
[node-14:19261] mx_connect fail for node-14:0 with key  (error
Endpoint closed or not connectable!)
...
 
Is there any info anywhere's on MTL? Anyways, I've run w/ mtl, and
sometimes it actually worked once. But now I can't reproduce it and it's
throwing sig 7's, 11's, and 4's depending upon the number of procs I
give it. But now that you mention mapper, I take it that's what
SEGV_MAPERR might be referring to. I'm looking into the 
 
$ mpirun --prefix /usr/local/openmpi-1.2b2 -x
LD_LIBRARY_PATH=${LD_LIBRARY_PATH} --hostfile ./h1-3 -np 5 --mca mtl
mx,self ./cpi 
Process 4 of 5 is on node-2
Process 0 of 5 is on node-1
Process 1 of 5 is on node-1
Process 2 of 5 is on node-1
Process 3 of 5 is on node-1
pi is approximately 3.1415926544231225, Error is 0.08333294
wall clock time = 0.019305
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0x2b88243862be
mpirun noticed that job rank 0 with PID 0 on node node-1 exited on
signal 1.
4 additional processes aborted (not shown)

Or sometimes I'll get this error, just depending upon the number of
procs ...
 
 mpirun --prefix /usr/local/openmpi-1.2b2 -x
LD_LIBRARY_PATH=${LD_LIBRARY_PATH} --hostfile ./h1-3 -np 7 --mca mtl
mx,self ./cpi 
Signal:7 info.si_errno:0(Success) si_code:2()
Failing at addr:0x2aaab000
[0]
func:/usr/local/openmpi-1.2b2/lib/libopen-pal.so.0(opal_backtrace_print+
0x1f) [0x2b9b7fa52d1f]
[1] func:/usr/local/openmpi-1.2b2/lib/libopen-pal.so.0 [0x2b9b7fa51871]
[2] func:/lib/libpthread.so.0 [0x2b9b80013d00]
[3]
func:/usr/local/openmpi-1.2b2/lib/libmca_common_sm.so.0(mca_common_sm_mm
ap_init+0x1e3) [0x2b9b8270ef83]
[4] func:/usr/local/openmpi-1.2b2/lib/openmpi/mca_mpool_sm.so
[0x2b9b8260d0ff]
[5]
func:/usr/local/openmpi-1.2b2/lib/libmpi.so.0(mca_mpool_base_module_crea
te+0x70) [0x2b9b7f7afac0]
[6]
func:/usr/local/openmpi-1.2b2/lib/openmpi/mca_btl_sm.so(mca_btl_sm_add_p
rocs_same_base_addr+0x907) [0x2b9b83070517]
[7]
func:/usr/local/openmpi-1.2b2/lib/openmpi/mca_bml_r2.so(mca_bml_r2_add_p
rocs+0x206) [0x2b9b82d5f576]
[8]
func:/usr/local/openmpi-1.2b2/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_add
_procs+0xe3) [0x2b9b82a2d0a3]
[9] func:/usr/local/openmpi-1.2b2/lib/libmpi.so.0(ompi_mpi_init+0x697)
[0x2b9b7f77be07]
[10] func:/usr/local/openmpi-1.2b2/lib/libmpi.so.0(MPI_Init+0x83)
[0x2b9b7f79c943]
[11] func:./cpi(main+0x42) [0x400cd5]
[12] func:/lib/libc.so.6(__libc_start_main+0xf4) [0x2b9b8013a134]
[13] func:./cpi [0x400bd9]
*** End of error message ***
Process 4 of 7 is on node-2
Process 5 of 7 is on node-2
Process 6 of 7 is on node-2
Process 0 of 7 is on node-1
Process 1 of 7 is on node-1
Process 2 of 7 is on node-1
Process 3 of 7 is on node-1
pi is approximately 3.1415926544231239, Error is 0.0807
wall clock time = 0.009331
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0x2b4ba33652be
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0x2b8685aba2be
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0x2b304ffbe2be
mpirun noticed that job rank 0 with PID 0 on node node-1 exited on
signal 1.
6 additional processes aborted (not shown)

 
Ok, so I take it one is down. Would this be the cause for all the
different errors I'm seeing?
 
$ fm_status 
FMS Fabric status
 
17  hosts known
16  FMAs found
3   un-ACKed alerts
Mapping is complete, last map generated by node-20
Database generation not yet complete.


 


From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
Behalf Of Reese Faucette
Sent: Tuesday, January 02, 2007 2:52 PM
To: Open MPI Users
Subject: Re: [OMPI users] Ompi failing on mx only


Hi, Gary-
This looks like a config problem, and not a code problem yet.  Could you
send the output of mx_info from node-1 and from node-2?  Also, forgive
me counter-asking a possibly dumb OMPI question, but is "-x
LD_LIBRARY_PATH" really what you want, as opposed to "-x
LD_LIBRARY_PATH=${LD_LIBRARY_PATH}" ?  (I would not be surprised if not
specifying a value defaults to this behavior, but have to ask).
 
Also, have you tried MX MTL as opposed to BTL?  --mca pml cm --mca mtl
mx,self  (it looks like you did)
 
"[node-2:10464] mx_connect fail for node-2:0 with key  " makes
it look like your fabric may not be fully mapped or that you may have a
down link.
 
thanks,
-reese
Myricom, Inc.




I was initially using 1.1.2 and moved to 1.2b2 because of a hang
on MPI_Bcast() which 1.2b2 r

Re: [OMPI users] orted: command not found

2007-01-02 Thread jcolmenares
I had configured the hostfile located at
~prefix/etc/openmpi-default-hostfile.

I copied the file to bernie-3, and it worked...

Now, at the cluster I was working at the Universidad de Los Andes
(Venezuela) -I decided to install mpi on three machines I was able to put
together as a personal proyect- all I had to do was to compile and run my
applications, that is, I never copied any file to any other machine...
now, I had to. I'm sorry if it was obvious and made you guys loose some
time, but why on a cluster I didn't have to copy any files, and now I must
do so?

Thanks for you patiance!

Jose



Re: [OMPI users] orted: command not found

2007-01-02 Thread Grobe, Gary L. (JSC-EV)[ESCG]
I'm just curious, maybe I missed something in a past post of this
thread, but ... Are these nodes diskless? If so, then you will have to
make sure that these same paths are exported to the diskless nodes and
handle non-interactive sessions as well as the init shell scripts
properly. It's easiets if you are exporting the same account files and
executable/lib paths across all nodes.

-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
Behalf Of Gurhan Ozen
Sent: Tuesday, January 02, 2007 2:21 PM
To: Open MPI Users
Subject: Re: [OMPI users] orted: command not found

On 1/2/07, Gurhan Ozen  wrote:
> On 1/2/07, jcolmena...@ula.ve  wrote:
> > > First you should make sure that PATH and LD_LIBRARY_PATH are 
> > > defined in the section of your .bashrc file that get parsed for 
> > > non interactive sessions. Run "mpirun -np 1 printenv" and check if

> > > PATH and LD_LIBRARY_PATH have the values you expect.
> >
> > in fact they do:
> >
> > bernie@bernie-1:~/proyecto$ mpirun -np 1 printenv SHELL=/bin/bash
> > SSH_CLIENT=192.168.1.142 4109 22
> > USER=bernie
> > LD_LIBRARY_PATH=/usr/local/openmpi/lib:/usr/local/openmpi/lib:
> > MAIL=/var/mail/bernie
> > PATH=/usr/local/openmpi/bin:/usr/local/openmpi/bin:/usr/local/sbin:/
> > usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/bin/X11:/usr/games
> > PWD=/home/bernie
> > LANG=en_US.UTF-8
> > HISTCONTROL=ignoredups
> > SHLVL=1
> > HOME=/home/bernie
> > MPI_DIR=/usr/local/openmpi
> > LOGNAME=bernie
> > SSH_CONNECTION=192.168.1.142 4109 192.168.1.113 22 LESSOPEN=| 
> > /usr/bin/lesspipe %s LESSCLOSE=/usr/bin/lesspipe %s %s 
> > _=/usr/local/openmpi/bin/orted 
> > OMPI_MCA_universe=bernie@bernie-1:default-universe
> > OMPI_MCA_ns_nds=env
> > OMPI_MCA_ns_nds_vpid_start=0
> > OMPI_MCA_ns_nds_num_procs=1
> > OMPI_MCA_mpi_paffinity_processor=0
> > OMPI_MCA_ns_replica_uri=0.0.0;tcp://192.168.1.142:4775
> > OMPI_MCA_gpr_replica_uri=0.0.0;tcp://192.168.1.142:4775
> > OMPI_MCA_orte_base_nodename=192.168.1.113
> > OMPI_MCA_ns_nds_cellid=0
> > OMPI_MCA_ns_nds_jobid=1
> > OMPI_MCA_ns_nds_vpid=0
> >
> >
> > > For your second question you should give the path to your 
> > > prueba.bin executable. I'll do something like "mpirun --prefix 
> > > /usr/local/ openmpi -np 2 ./prueba.bin". The reason is that 
> > > usually "." is not in the PATH.
> > >
> >
> > bernie@bernie-1:~/proyecto$ mpirun --prefix /usr/local/openmpi -np 2

> > ./prueba.bin
> > 
> > -- Failed to find or execute the following executable:
> >
> > Host:   bernie-3
> > Executable: ./prueba.bin
> >
> > Cannot continue.
> > 
> > --
> >
> > and the file IS there:
> >
> > bernie@bernie-1:~/proyecto$ ls prueba* prueba.bin  prueba.f90  
> > prueba.f90~
> >

Wait a minute.. you are running mpirun from bernie-1 without proving
any hostfile or hostnames .. So both processes should be running on
bernie-1 host, yet the error says it can't find the executable on
bernie-3. Why is this? Make sure that the file exists on
bernie-3 and is executable.

   gurhan

> >
> > I must be missing something pretty silly, but have been looking 
> > around for days to no avail!
> >
>
>What are the permissions on the file? Is it an executable file?
>
>gurhan
>
> > Jose
> >
> > thanks
> >
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
>
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] Ompi failing on mx only

2007-01-02 Thread Reese Faucette
Ompi failing on mx onlyHi, Gary-
This looks like a config problem, and not a code problem yet.  Could you send 
the output of mx_info from node-1 and from node-2?  Also, forgive me 
counter-asking a possibly dumb OMPI question, but is "-x LD_LIBRARY_PATH" 
really what you want, as opposed to "-x LD_LIBRARY_PATH=${LD_LIBRARY_PATH}" ?  
(I would not be surprised if not specifying a value defaults to this behavior, 
but have to ask).

Also, have you tried MX MTL as opposed to BTL?  --mca pml cm --mca mtl mx,self  
(it looks like you did)

"[node-2:10464] mx_connect fail for node-2:0 with key  " makes it look 
like your fabric may not be fully mapped or that you may have a down link.

thanks,
-reese
Myricom, Inc.


  I was initially using 1.1.2 and moved to 1.2b2 because of a hang on 
MPI_Bcast() which 1.2b2 reports to fix, and seemed to have done so. My compute 
nodes are 2 dual core xeons on myrinet with mx. The problem is trying to get 
ompi running on mx only. My machine file is as follows .

  node-1 slots=4 max-slots=4 
  node-2 slots=4 max-slots=4 
  node-3 slots=4 max-slots=4 

  'mpirun' with the minimum number of processes in order to get the error ... 
  mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH 
--hostfile ./h1-3 -np 2 --mca btl mx,self ./cpi 

  I don't believe there'a anything wrong w/ the hardware as I can ping on mx 
between this failed node and the master fine. So I tried a different set of 3 
nodes and I got the same error, it always fails on the 2nd node of any group of 
nodes I choose.


Re: [OMPI users] orted: command not found

2007-01-02 Thread Gurhan Ozen

On 1/2/07, Gurhan Ozen  wrote:

On 1/2/07, jcolmena...@ula.ve  wrote:
> > First you should make sure that PATH and LD_LIBRARY_PATH are defined
> > in the section of your .bashrc file that get parsed for non
> > interactive sessions. Run "mpirun -np 1 printenv" and check if PATH
> > and LD_LIBRARY_PATH have the values you expect.
>
> in fact they do:
>
> bernie@bernie-1:~/proyecto$ mpirun -np 1 printenv
> SHELL=/bin/bash
> SSH_CLIENT=192.168.1.142 4109 22
> USER=bernie
> LD_LIBRARY_PATH=/usr/local/openmpi/lib:/usr/local/openmpi/lib:
> MAIL=/var/mail/bernie
> 
PATH=/usr/local/openmpi/bin:/usr/local/openmpi/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/bin/X11:/usr/games
> PWD=/home/bernie
> LANG=en_US.UTF-8
> HISTCONTROL=ignoredups
> SHLVL=1
> HOME=/home/bernie
> MPI_DIR=/usr/local/openmpi
> LOGNAME=bernie
> SSH_CONNECTION=192.168.1.142 4109 192.168.1.113 22
> LESSOPEN=| /usr/bin/lesspipe %s
> LESSCLOSE=/usr/bin/lesspipe %s %s
> _=/usr/local/openmpi/bin/orted
> OMPI_MCA_universe=bernie@bernie-1:default-universe
> OMPI_MCA_ns_nds=env
> OMPI_MCA_ns_nds_vpid_start=0
> OMPI_MCA_ns_nds_num_procs=1
> OMPI_MCA_mpi_paffinity_processor=0
> OMPI_MCA_ns_replica_uri=0.0.0;tcp://192.168.1.142:4775
> OMPI_MCA_gpr_replica_uri=0.0.0;tcp://192.168.1.142:4775
> OMPI_MCA_orte_base_nodename=192.168.1.113
> OMPI_MCA_ns_nds_cellid=0
> OMPI_MCA_ns_nds_jobid=1
> OMPI_MCA_ns_nds_vpid=0
>
>
> > For your second question you should give the path to your prueba.bin
> > executable. I'll do something like "mpirun --prefix /usr/local/
> > openmpi -np 2 ./prueba.bin". The reason is that usually "." is not in
> > the PATH.
> >
>
> bernie@bernie-1:~/proyecto$ mpirun --prefix /usr/local/openmpi -np 2
> ./prueba.bin
> --
> Failed to find or execute the following executable:
>
> Host:   bernie-3
> Executable: ./prueba.bin
>
> Cannot continue.
> --
>
> and the file IS there:
>
> bernie@bernie-1:~/proyecto$ ls prueba*
> prueba.bin  prueba.f90  prueba.f90~
>


   Wait a minute.. you are running mpirun from bernie-1 without
proving any hostfile or hostnames .. So both processes should be
running on bernie-1 host, yet the error says it can't find the
executable on bernie-3. Why is this? Make sure that the file exists on
bernie-3 and is executable.

  gurhan


>
> I must be missing something pretty silly, but have been looking around for
> days to no avail!
>

   What are the permissions on the file? Is it an executable file?

   gurhan

> Jose
>
> thanks
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



Re: [OMPI users] orted: command not found

2007-01-02 Thread jcolmenares
it is executable

bernie@bernie-1:~/proyecto$ ls -l prueba.bin
-rwxr-xr-x 1 bernie bernie 9619 2007-01-02 12:18 prueba.bin




Re: [OMPI users] orted: command not found

2007-01-02 Thread Gurhan Ozen

On 1/2/07, jcolmena...@ula.ve  wrote:

> First you should make sure that PATH and LD_LIBRARY_PATH are defined
> in the section of your .bashrc file that get parsed for non
> interactive sessions. Run "mpirun -np 1 printenv" and check if PATH
> and LD_LIBRARY_PATH have the values you expect.

in fact they do:

bernie@bernie-1:~/proyecto$ mpirun -np 1 printenv
SHELL=/bin/bash
SSH_CLIENT=192.168.1.142 4109 22
USER=bernie
LD_LIBRARY_PATH=/usr/local/openmpi/lib:/usr/local/openmpi/lib:
MAIL=/var/mail/bernie
PATH=/usr/local/openmpi/bin:/usr/local/openmpi/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/bin/X11:/usr/games
PWD=/home/bernie
LANG=en_US.UTF-8
HISTCONTROL=ignoredups
SHLVL=1
HOME=/home/bernie
MPI_DIR=/usr/local/openmpi
LOGNAME=bernie
SSH_CONNECTION=192.168.1.142 4109 192.168.1.113 22
LESSOPEN=| /usr/bin/lesspipe %s
LESSCLOSE=/usr/bin/lesspipe %s %s
_=/usr/local/openmpi/bin/orted
OMPI_MCA_universe=bernie@bernie-1:default-universe
OMPI_MCA_ns_nds=env
OMPI_MCA_ns_nds_vpid_start=0
OMPI_MCA_ns_nds_num_procs=1
OMPI_MCA_mpi_paffinity_processor=0
OMPI_MCA_ns_replica_uri=0.0.0;tcp://192.168.1.142:4775
OMPI_MCA_gpr_replica_uri=0.0.0;tcp://192.168.1.142:4775
OMPI_MCA_orte_base_nodename=192.168.1.113
OMPI_MCA_ns_nds_cellid=0
OMPI_MCA_ns_nds_jobid=1
OMPI_MCA_ns_nds_vpid=0


> For your second question you should give the path to your prueba.bin
> executable. I'll do something like "mpirun --prefix /usr/local/
> openmpi -np 2 ./prueba.bin". The reason is that usually "." is not in
> the PATH.
>

bernie@bernie-1:~/proyecto$ mpirun --prefix /usr/local/openmpi -np 2
./prueba.bin
--
Failed to find or execute the following executable:

Host:   bernie-3
Executable: ./prueba.bin

Cannot continue.
--

and the file IS there:

bernie@bernie-1:~/proyecto$ ls prueba*
prueba.bin  prueba.f90  prueba.f90~


I must be missing something pretty silly, but have been looking around for
days to no avail!



  What are the permissions on the file? Is it an executable file?

  gurhan


Jose

thanks


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] orted: command not found

2007-01-02 Thread jcolmenares
> First you should make sure that PATH and LD_LIBRARY_PATH are defined
> in the section of your .bashrc file that get parsed for non
> interactive sessions. Run "mpirun -np 1 printenv" and check if PATH
> and LD_LIBRARY_PATH have the values you expect.

in fact they do:

bernie@bernie-1:~/proyecto$ mpirun -np 1 printenv
SHELL=/bin/bash
SSH_CLIENT=192.168.1.142 4109 22
USER=bernie
LD_LIBRARY_PATH=/usr/local/openmpi/lib:/usr/local/openmpi/lib:
MAIL=/var/mail/bernie
PATH=/usr/local/openmpi/bin:/usr/local/openmpi/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/bin/X11:/usr/games
PWD=/home/bernie
LANG=en_US.UTF-8
HISTCONTROL=ignoredups
SHLVL=1
HOME=/home/bernie
MPI_DIR=/usr/local/openmpi
LOGNAME=bernie
SSH_CONNECTION=192.168.1.142 4109 192.168.1.113 22
LESSOPEN=| /usr/bin/lesspipe %s
LESSCLOSE=/usr/bin/lesspipe %s %s
_=/usr/local/openmpi/bin/orted
OMPI_MCA_universe=bernie@bernie-1:default-universe
OMPI_MCA_ns_nds=env
OMPI_MCA_ns_nds_vpid_start=0
OMPI_MCA_ns_nds_num_procs=1
OMPI_MCA_mpi_paffinity_processor=0
OMPI_MCA_ns_replica_uri=0.0.0;tcp://192.168.1.142:4775
OMPI_MCA_gpr_replica_uri=0.0.0;tcp://192.168.1.142:4775
OMPI_MCA_orte_base_nodename=192.168.1.113
OMPI_MCA_ns_nds_cellid=0
OMPI_MCA_ns_nds_jobid=1
OMPI_MCA_ns_nds_vpid=0


> For your second question you should give the path to your prueba.bin
> executable. I'll do something like "mpirun --prefix /usr/local/
> openmpi -np 2 ./prueba.bin". The reason is that usually "." is not in
> the PATH.
>

bernie@bernie-1:~/proyecto$ mpirun --prefix /usr/local/openmpi -np 2
./prueba.bin
--
Failed to find or execute the following executable:

Host:   bernie-3
Executable: ./prueba.bin

Cannot continue.
--

and the file IS there:

bernie@bernie-1:~/proyecto$ ls prueba*
prueba.bin  prueba.f90  prueba.f90~


I must be missing something pretty silly, but have been looking around for
days to no avail!

Jose

thanks




[OMPI users] Ompi failing on mx only

2007-01-02 Thread Grobe, Gary L. (JSC-EV)[ESCG]

I was initially using 1.1.2 and moved to 1.2b2 because of a hang on
MPI_Bcast() which 1.2b2 reports to fix, and seemed to have done so. My
compute nodes are 2 dual core xeons on myrinet with mx. The problem is
trying to get ompi running on mx only. My machine file is as follows ...

node-1 slots=4 max-slots=4
node-2 slots=4 max-slots=4
node-3 slots=4 max-slots=4

'mpirun' with the minimum number of processes in order to get the error
...
mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH
--hostfile ./h1-3 -np 2 --mca btl mx,self ./cpi

Results with the following output ...

:~/Projects/ompi/cpi$ mpirun --prefix /usr/local/openmpi-1.2b2 -x
LD_LIBRARY_PATH --hostfile ./h1-3 -np 2 --mca btl mx,self ./cpi


--
Process 0.1.0 is unable to reach 0.1.1 for MPI communication.
If you specified the use of a BTL component, you may have
forgotten a component (such as "self") in the list of 
usable components.

--

--
Process 0.1.1 is unable to reach 0.1.0 for MPI communication.
If you specified the use of a BTL component, you may have
forgotten a component (such as "self") in the list of 
usable components.

--

--
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or
environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  PML add procs failed
  --> Returned "Unreachable" (-12) instead of "Success" (0)

--
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (goodbye)

--
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or
environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  PML add procs failed
  --> Returned "Unreachable" (-12) instead of "Success" (0)

--
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (goodbye)
mpirun noticed that job rank 1 with PID 0 on node node-1 exited on
signal 1.

 end of output ---

I get that same error w/ the examples included in the ompi-1.2b2
distrib. However, if I change the mca params as such ...

mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH
--hostfile ./h1-3 -np 5 --mca pml cm ./cpi

Running up to -np 5 works (one of the processes does get put on the 2nd
node), but running with -np 6 fails with the following ...

[node-2:10464] mx_connect fail for node-2:0 with key  (error
Endpoint closed or not connectable!)
[node-2:10463] mx_connect fail for node-2:0 with key  (error
Endpoint closed or not connectable!)

--
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or
environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  PML add procs failed
  --> Returned "Error" (-1) instead of "Success" (0)

--
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (goodbye)

--
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or
environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  PML add procs failed
  --> Returned "Error" (-1) instead of "Success" (0)

--
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (goodbye)
mpirun noticed that job rank 0 with PID 0 on node node-1 exited on
sig

Re: [OMPI users] orted: command not found

2007-01-02 Thread George Bosilca
First you should make sure that PATH and LD_LIBRARY_PATH are defined  
in the section of your .bashrc file that get parsed for non  
interactive sessions. Run "mpirun -np 1 printenv" and check if PATH  
and LD_LIBRARY_PATH have the values you expect.


For your second question you should give the path to your prueba.bin  
executable. I'll do something like "mpirun --prefix /usr/local/ 
openmpi -np 2 ./prueba.bin". The reason is that usually "." is not in  
the PATH.


  george.

On Jan 2, 2007, at 11:20 AM, jcolmena...@ula.ve wrote:


I installed openmpi 1.1.2 on two 686 boxes runing ubuntu 6.10.
Followed the instructions given in the FAQ. Nevertheless, I get the
following message:

[bernie-1:05053] ERROR: A daemon on node 192.168.1.113 failed to  
start as

expected.
[bernie-1:05053] ERROR: There may be more information available from
[bernie-1:05053] ERROR: the remote shell (see above).
[bernie-1:05053] ERROR: The daemon exited unexpectedly with status  
127.


now, I've been browsing the web, including the mailing lists, and it
appears that the error should be that I have not declared the  
variables


export PATH="/usr/local/openmpi/bin:${PATH}"
export LD_LIBRARY_PATH="/usr/local/openmpi/lib:${LD_LIBRARY_PATH}"

at the node, wich I have. I have even created all the posible folders
proposed at the FAQ for remote loggins, although I'm using bash.

If I do a ssh user@remote_node, I can connect without being asked  
for a
password, and if I type mpif90, I get: "gfortran: no input files",  
wich
should mean that indeed the PATH and LD_LIBRARY_PATH are being  
updated on

the remote logging.

But, if I do:

bash$  mpirun --prefix /usr/local/openmpi -np 2 prueba.bin

the result is:

-- 


Failed to find the following executable:

Host:   bernie-3
Executable: prueba.bin

Cannot continue.
-- 

mpirun noticed that job rank 0 with PID 0 on node "192.168.1.113"  
exited

on signal 4.

I've been looking around, but have not been able to find what does the
signal 4 means.

Just in case, I was running an example program wich runs fine at my
university cluster. Nevertheless, decided to run an even simpler  
one, wich

I include, for it may be that the error is there (I definitly hope
not!...)

program test

  use mpi

  implicit none

  integer :: myid,sizze,ierr

  call MPI_INIT(ierr)
  call MPI_COMM_SIZE(MPI_COMM_WORLD,sizze,ierr)
  call MPI_COMM_RANK(MPI_COMM_WORLD,myid,ierr)

  print *,"I'm using ",sizze," processors"
  print *,"of wich I'm the number ",myid

  call MPI_FINALIZE(ierr)

end program test


This is the first time I have installed -and use- any parallel  
programing
program or library, and I'm doing it as a personal proyect for a  
graduate

curse, so any help will be greatly appreciated!

Best regards

Jose Colmenares

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] segv at runtime with ifort

2007-01-02 Thread Brock Palen

Jeff,

Thanks for the reply, that has fixed the problem.  The code in  
questions appears to have only been ran with mpich and mpich  
derivatives in the past.


Brock Palen
Center for Advanced Computing
bro...@umich.edu
(734)936-1985


On Jan 2, 2007, at 9:56 AM, Jeff Squyres wrote:


Brock --

I think your test program is faulty.  For MPI_CART_CREATE, you need
to pass in an array indicating whether the dimensions are periodic or
not -- it is not sufficient to pass in a scalar logical value.

For example, the following program seems to work fine for me:

program cart
include 'mpif.h'

logical periods(1)

call MPI_INIT(ierr);
call MPI_COMM_SIZE(MPI_COMM_WORLD, numprocs, ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD, my_world_id, ierr)
periods(1) = .false.
call MPI_CART_CREATE(MPI_COMM_WORLD, 1, numprocs, periods, .true.,
ITER_COMM, ierr)
call MPI_COMM_RANK(ITER_COMM,myid, ierr)

call MPI_FINALIZE(ierr)
end




On Dec 19, 2006, at 3:22 PM, Brock Palen wrote:


Hello,

We are getting seg faults at run time on MPI_CART_CREATE()   when
using openmpi-1.1.2 built with the intel compilers. I have included
all versions, code and messages bellow.  I know there were problems
in the past, and dug around the archives but didn't find this any
place.  Has anyone else seen this problem?

The message is like so:

[brockp@nyxtest1 bagci]$ mpirun -np 1 ./a.out
Signal:11 info.si_errno:0(Success) si_code:2(SEGV_ACCERR)
Failing at addr:0x448c78
[0] func:/home/software/rhel4/openmpi-1.1.2/intel/lib/libopal.so.0
[0x2a958dea55]
[1] func:/lib64/tls/libpthread.so.0 [0x325560c430]
[2] func:/home/software/rhel4/openmpi-1.1.2/intel/lib/libmpi.so.0
(mpi_cart_create__+0x60) [0x2a955ef93c]
[3] func:./a.out(MAIN__+0x8c) [0x406b8c]
[4] func:./a.out(main+0x32) [0x406aea]
[5] func:/lib64/tls/libc.so.6(__libc_start_main+0xdb) [0x325471c3fb]
[6] func:./a.out [0x406a2a]
*** End of error message ***

ifort is version 9.1.037
icc icpc are version 9.1.043

This problem does not happen with pgi compilers.   (version 6.1)
Both were done em64t systems  (xeon 5160's)

Here is the code sample that fails.

program cart
include 'mpif.h'

call MPI_INIT(ierr);
call MPI_COMM_SIZE(MPI_COMM_WORLD, numprocs, ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD, my_world_id, ierr)
call MPI_CART_CREATE(MPI_COMM_WORLD, 1, numprocs, .false., .true.,
ITER_COMM, ierr)
call MPI_COMM_RANK(ITER_COMM,myid, ierr)

call MPI_FINALIZE(ierr)
end



Brock Palen
Center for Advanced Computing
bro...@umich.edu
(734)936-1985


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users






[OMPI users] orted: command not found

2007-01-02 Thread jcolmenares
I installed openmpi 1.1.2 on two 686 boxes runing ubuntu 6.10.
Followed the instructions given in the FAQ. Nevertheless, I get the
following message:

[bernie-1:05053] ERROR: A daemon on node 192.168.1.113 failed to start as
expected.
[bernie-1:05053] ERROR: There may be more information available from
[bernie-1:05053] ERROR: the remote shell (see above).
[bernie-1:05053] ERROR: The daemon exited unexpectedly with status 127.

now, I've been browsing the web, including the mailing lists, and it
appears that the error should be that I have not declared the variables

export PATH="/usr/local/openmpi/bin:${PATH}"
export LD_LIBRARY_PATH="/usr/local/openmpi/lib:${LD_LIBRARY_PATH}"

at the node, wich I have. I have even created all the posible folders
proposed at the FAQ for remote loggins, although I'm using bash.

If I do a ssh user@remote_node, I can connect without being asked for a
password, and if I type mpif90, I get: "gfortran: no input files", wich
should mean that indeed the PATH and LD_LIBRARY_PATH are being updated on
the remote logging.

But, if I do:

bash$  mpirun --prefix /usr/local/openmpi -np 2 prueba.bin

the result is:

--
Failed to find the following executable:

Host:   bernie-3
Executable: prueba.bin

Cannot continue.
--
mpirun noticed that job rank 0 with PID 0 on node "192.168.1.113" exited
on signal 4.

I've been looking around, but have not been able to find what does the
signal 4 means.

Just in case, I was running an example program wich runs fine at my
university cluster. Nevertheless, decided to run an even simpler one, wich
I include, for it may be that the error is there (I definitly hope
not!...)

program test

  use mpi

  implicit none

  integer :: myid,sizze,ierr

  call MPI_INIT(ierr)
  call MPI_COMM_SIZE(MPI_COMM_WORLD,sizze,ierr)
  call MPI_COMM_RANK(MPI_COMM_WORLD,myid,ierr)

  print *,"I'm using ",sizze," processors"
  print *,"of wich I'm the number ",myid

  call MPI_FINALIZE(ierr)

end program test


This is the first time I have installed -and use- any parallel programing
program or library, and I'm doing it as a personal proyect for a graduate
curse, so any help will be greatly appreciated!

Best regards

Jose Colmenares



Re: [OMPI users] openmpi / mpirun problem on aix: poll failed with errno=25, opal_event_loop: ompi_evesel->dispatch() failed.

2007-01-02 Thread Jeff Squyres

Yikes - that's not a good error.  :-(

We don't regularly build / test on AIX, so I don't have much  
immediate guidance for you.  My best suggestion at this point would  
be to try the latest 1.2 beta or nightly snapshot.  We did an update  
of the event engine (the portion of the code that you're seeing the  
error issue from) that *may* alleviate the problem...?  (I have no  
idea, actually -- I'm just kinda hoping that the new version of the  
event engine will fix your problem :-\ )



On Dec 27, 2006, at 10:29 AM, Michael Marti wrote:


Dear All

I am trying to get openmpi-1.1.2 to work on AIX 5.3 / power5.

:: Compilation seems to have worked with the following sequence:

setenv OBJECT_MODE 64

setenv CC xlc
setenv CXX xlC
setenv F77 xlf
setenv FC xlf90

setenv CFLAGS "-qthreaded -O3 -qmaxmem=-1 -qarch=pwr5x -qtune=pwr5 - 
q64"
setenv CXXFLAGS "-qthreaded -O3 -qmaxmem=-1 -qarch=pwr5x - 
qtune=pwr5 -q64"
setenv FFLAGS "-qthreaded -O3 -qmaxmem=-1 -qarch=pwr5x -qtune=pwr5 - 
q64"
setenv FCFLAGS "-qthreaded -O3 -qmaxmem=-1 -qarch=pwr5x -qtune=pwr5  
-q64"

setenv LDFLAGS "-Wl,-brtl"

./configure --prefix=/ist/openmpi-1.1.2 \
  --disable-mpi-cxx \
  --disable-mpi-cxx-seek \
  --enable-mpi-threads \
  --enable-progress-threads \
  --enable-static \
  --disable-shared \
  --disable-io-romio


:: After the compilation I ran make check and all 11 tests passed  
successfully.


:: Now I'm trying to run the following command just for test:
# mpirun -hostfile /gpfs/MICHAEL/MPI_hostfiles/mpinodes_b41- 
b44_1.asc -np 2 /usr/bin/hostname
- The file /gpfs/MICHAEL/MPI_hostfiles/mpinodes_b41-b44_1.asc  
contains 4 hosts:

r1blade041 slots=1
r1blade042 slots=1
r1blade043 slots=1
r1blade044 slots=1
- The mpirun command eventually hangs with the following message:
[r1blade041:418014] poll failed with errno=25
[r1blade041:418014] opal_event_loop: ompi_evesel->dispatch()  
failed.
- In this state mpirun cannot be killed by hitting  only a  
kill -9 will do the trick.
- While the mpirun still hangs I can see that the "orted" has been  
launched on both requested hosts.


:: I turned on all debug options in openmpi-mca-params.conf. The  
output for the same call of mpirun is in the file mpirun-debug.txt.gz.



:: As sugested in the mailinglis rules I include config.log  
(config.log.gz) and the output of ompi_info (ompi_info.txt.gz).






:: As I am completely new to openmpi (I have some experience with  
lam) I am lost at this stage. I would really appreciate if someone  
could give me some hints as to what is going wrong and where I  
could get more info.


Best regards,

Michael Marti.


--
-- 
--

Michael Marti
Centro de Fisica dos Plasmas
Instituto Superior Tecnico
Av. Rovisco Pais
1049-001 Lisboa
Portugal

Tel:   +351 218 419 379
Fax:  +351 218 464 455
Mobile:  +351 968 434 327
-- 
--



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems



Re: [OMPI users] segv at runtime with ifort

2007-01-02 Thread Jeff Squyres

Brock --

I think your test program is faulty.  For MPI_CART_CREATE, you need  
to pass in an array indicating whether the dimensions are periodic or  
not -- it is not sufficient to pass in a scalar logical value.


For example, the following program seems to work fine for me:

program cart
include 'mpif.h'

logical periods(1)

call MPI_INIT(ierr);
call MPI_COMM_SIZE(MPI_COMM_WORLD, numprocs, ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD, my_world_id, ierr)
periods(1) = .false.
call MPI_CART_CREATE(MPI_COMM_WORLD, 1, numprocs, periods, .true.,  
ITER_COMM, ierr)

call MPI_COMM_RANK(ITER_COMM,myid, ierr)

call MPI_FINALIZE(ierr)
end




On Dec 19, 2006, at 3:22 PM, Brock Palen wrote:


Hello,

We are getting seg faults at run time on MPI_CART_CREATE()   when
using openmpi-1.1.2 built with the intel compilers. I have included
all versions, code and messages bellow.  I know there were problems
in the past, and dug around the archives but didn't find this any
place.  Has anyone else seen this problem?

The message is like so:

[brockp@nyxtest1 bagci]$ mpirun -np 1 ./a.out
Signal:11 info.si_errno:0(Success) si_code:2(SEGV_ACCERR)
Failing at addr:0x448c78
[0] func:/home/software/rhel4/openmpi-1.1.2/intel/lib/libopal.so.0
[0x2a958dea55]
[1] func:/lib64/tls/libpthread.so.0 [0x325560c430]
[2] func:/home/software/rhel4/openmpi-1.1.2/intel/lib/libmpi.so.0
(mpi_cart_create__+0x60) [0x2a955ef93c]
[3] func:./a.out(MAIN__+0x8c) [0x406b8c]
[4] func:./a.out(main+0x32) [0x406aea]
[5] func:/lib64/tls/libc.so.6(__libc_start_main+0xdb) [0x325471c3fb]
[6] func:./a.out [0x406a2a]
*** End of error message ***

ifort is version 9.1.037
icc icpc are version 9.1.043

This problem does not happen with pgi compilers.   (version 6.1)
Both were done em64t systems  (xeon 5160's)

Here is the code sample that fails.

program cart
include 'mpif.h'

call MPI_INIT(ierr);
call MPI_COMM_SIZE(MPI_COMM_WORLD, numprocs, ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD, my_world_id, ierr)
call MPI_CART_CREATE(MPI_COMM_WORLD, 1, numprocs, .false., .true.,
ITER_COMM, ierr)
call MPI_COMM_RANK(ITER_COMM,myid, ierr)

call MPI_FINALIZE(ierr)
end



Brock Palen
Center for Advanced Computing
bro...@umich.edu
(734)936-1985


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems



Re: [OMPI users] mpicc problems finding libraries (mostly)

2007-01-02 Thread Jeff Squyres
Welcome back from the holidays!  I'll try to catch up on the right- 
before-the-holidays e-mail this today...



On Dec 21, 2006, at 6:07 PM, Dennis McRitchie wrote:

I am trying to build openmpi so that mpicc does not require me to  
set up
the compiler's environment, and so that any executables built with  
mpicc

can run without my having to point LD_LIBRARY_PATH to the openmpi lib


We had not really considered this use-case before.  The current  
assumption (as you undoubtedly figured out) is that on the node where  
you're invoking OMPI commands, the PATH/LD_LIBRARY_PATH has been  
setup properly.


I'm not saying that we can't change this; I'm just trying to give you  
the rationale for why the wrappers are the way they [currently] are.


directory. I made some unsuccessful attempts to accomplish this  
(which I
describe below), but after building openmpi using the Intel  
compiler, I

found the following:

1) When typing "/mpicc -showme" I get:
/mpicc: error while loading shared libraries:  
libsvml.so:

cannot open shared object file: No such file or directory

I then set LD_LIBRARY_PATH to point to the Intel compiler  
libraries, and

now "-showme" works, and returns:
icc -I/usr/local/openmpi-1.1.2-intel/include
-I/usr/local/openmpi-1.1.2-intel/include/openmpi -pthread
-L/usr/local/openmpi-1.1.2-intel/lib -L/usr/ofed/lib -L/usr/ofed/lib64
-lmpi -lorte -lopal -libverbs -lrt -lpbs -lnsl -lutil


This behavior reflects the current assumption (above).


However...

2) When typing "/mpicc hello.c" I now get:
-- 
--

--
The Open MPI wrapper compiler was unable to find the specified  
compiler

icc in your PATH.

Note that this compiler was either specified at configure time or in
one of several possible environment variables.
-- 
--

--

Of course, this is due to the fact that -showme indicates that mpicc
invokes "icc" instead of "/icc". If I now set up the PATH
to the Intel compiler, it works. However...


Mmm.  Yes.  Also a good point; another working assumption is that  
you're setup for your compiler as well (re: PATH, LD_LIBRARY_PATH,  
LM_LICENSE_FILE, ...etc.).  OMPI *does* save the absolute pathname of  
the compiler, but we had shied away from using it in the wrappers by  
default for a few reasons:


1. You may not have the compiler installed in the same location on  
all nodes.


2. There may be other factors that need to be setup in the  
environment (such as an env variable containing a license file) that  
the wrapper compilers are not currently setup to handle.


3. As you noted later, users can specify an absolute path name in CC,  
CXX (and friends) to configure and that propagates through.  Hence,  
users have the choice of specifying the full pathname if they want  
to; OMPI's current setup allows you to do it either way.


Additionally, be aware that the wrapper compilers are configurable  
via a text file.  Check out this section of the FAQ: http://www.open- 
mpi.org/faq/?category=mpi-apps#override-wrappers-after-v1.0



3) When I try to run the executable thus created, I get:
./a.out: error while loading shared libraries: libmpi.so.0: cannot  
open

shared object file: No such file or directory

I now need to set LD_LIBRARY_PATH to point to the openmpi lib  
directory.


Correct.  The mpirun --prefix option may help here, though (and its  
synonym -- providing a full absolute path to mpirun).



---
---

To avoid problems (1) and (2), I built openmpi with:
export CC=/opt/intel/cce/latest/bin/icc
export CXX=/opt/intel/cce/latest/bin/icpc
export F77=/opt/intel/fce/latest/bin/ifort
export FC=/opt/intel/fce/latest/bin/ifort
export
LDFLAGS="-Wl,-rpath,/opt/intel/cce/latest/lib,-rpath,/opt/intel/fce/ 
late

st/lib"

But while this satisfied the configure script and all its tests, it  
did

not produce the results I hoped for.

To avoid problem (3), I added the following option to configure:
--with-wrapper-ldflags=-Wl,-rpath,/usr/local/openmpi-1.1.2-intel/lib

I was hoping "-showme" would add this to its parameters, but no such
luck. Looking at the build output, it seems that the
--with-wrapper-ldflags parameter seems to be parsed differently  
from how

LDFLAGS gets parsed, and I get a compilation line:
/opt/intel/cce/latest/bin/icc -O3 -DNDEBUG -fno-strict-aliasing - 
pthread

-Wl,-rpath -Wl,/opt/intel/cce/latest/lib -Wl,-rpath
-Wl,/opt/intel/fce/latest/lib -o .libs/opal_wrapper opal_wrapper.o
../../../opal/.libs/libopal.so -lnsl -lutil -Wl,--rpath
-Wl,/usr/local/openmpi-1.1.2-intel/lib

Notice that the rpath preceding the openmpi lib directory is specified
as "--rpath", which is probably why it is ignored. Is this perhaps a
bug?


Hmm.  I'd have to trace into why that happens; that's pretty weird.   
We put the --with-wrapper-*flags [almost] directl