Re: [OMPI users] Ompi failing on mx only

2007-01-03 Thread Reese Faucette

$ mpirun --prefix /usr/local/openmpi-1.2b2 --hostfile ./h1-3 -np 1 --mca
btl mx,sm,self ./cpi
[node-1:09704] mca: base: component_find: unable to open mtl mx: file
not found (ignored)
[node-1:09704] mca: base: component_find: unable to open btl mx: file
not found (ignored)


This in particular is almost certainly a library path issue.  A quick way to 
check to see if your LD_LIBRARY_PATH is correct is to run:
$ mpirun  ldd 
/lib/openmpi/mca_mtl_mx.so


If things are good, you will get a first line like:
   libmyriexpress.so => /opt/mx/lib/libmyriexpress.so (0xb7f1d000)

If not, it will tell you explicitly.  Since all you specified is 
the --prefix line, I'm not surprised libmyriexpress.so is not found in this 
case.

-reese





Re: [OMPI users] Ompi failing on mx only

2007-01-03 Thread Grobe, Gary L. (JSC-EV)[ESCG]
Just as an FYI, I also included the sm param as you suggested and
changed the -np to 1, because anything more than that just duplicates
the same error. I also saw this same error message in previous posts as
a bug. Would that be the same issue in this case?

$ mpirun --prefix /usr/local/openmpi-1.2b2 --hostfile ./h1-3 -np 1 --mca
btl mx,sm,self ./cpi
[node-1:09704] mca: base: component_find: unable to open mtl mx: file
not found (ignored)
[node-1:09704] mca: base: component_find: unable to open btl mx: file
not found (ignored)
Process 0 of 1 is on node-1
pi is approximately 3.1415926544231341, Error is 0.08333410
wall clock time = 0.000331


-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
Behalf Of Brian W. Barrett
Sent: Tuesday, January 02, 2007 4:11 PM
To: Open MPI Users
Subject: Re: [OMPI users] Ompi failing on mx only

Sorry to jump into the discussion late.  The mx btl does not support
communication between processes on the same node by itself, so you have
to include the shared memory transport when using MX.  This will
eventually be fixed, but likely not for the 1.2 release.  So if you do:

   mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH --
hostfile ./h1-3 -np 2 --mca btl mx,sm,self ./cpi

It should work much better.  As for the MTL, there is a bug in the MX
MTL for v1.2 that has been fixed, but after 1.2b2 that could cause the
random failures you were seeing.  It will work much better after
1.2b3 is released (or if you are feeling really lucky, you can try out
the 1.2 nightly tarballs).

The MTL is a new feature in v1.2. It is a different communication
abstraction designed to support interconnects that have matching
implemented in the lower level library or in hardware (Myrinet/MX,
Portals, InfiniPath are currently implemented).  The MTL allows us to
exploit the low latency and asynchronous progress these libraries can
provide, but does mean multi-nic abilities are reduced.  Further, the
MTL is not well suited to interconnects like TCP or InfiniBand, so we
will continue supporting the BTL interface as well.

Brian


On Jan 2, 2007, at 2:44 PM, Grobe, Gary L. ((JSC-EV))[ESCG] wrote:

> About the -x, I've been trying it both ways and prefer the latter, and

> results for either are the same. But it's value is correct.
> I've attached the ompi_info from node-1 and node-2. Sorry for not 
> zipping them, but they were small and I think I'd have firewall 
> issues.
>
> $ mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH -- 
> hostfile ./h13-15 -np 6 --mca pml cm ./cpi [node-14:19260] mx_connect 
> fail for node-14:0 with key  (error Endpoint closed or not 
> connectable!) [node-14:19261] mx_connect fail for node-14:0 with key 
>  (error Endpoint closed or not connectable!) ...
>
> Is there any info anywhere's on MTL? Anyways, I've run w/ mtl, and 
> sometimes it actually worked once. But now I can't reproduce it and 
> it's throwing sig 7's, 11's, and 4's depending upon the number of 
> procs I give it. But now that you mention mapper, I take it that's 
> what SEGV_MAPERR might be referring to. I'm looking into the
>
> $ mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH=$ 
> {LD_LIBRARY_PATH} --hostfile ./h1-3 -np 5 --mca mtl mx,self ./cpi 
> Process 4 of 5 is on node-2 Process 0 of 5 is on node-1 Process 1 of 5

> is on node-1 Process 2 of 5 is on node-1 Process 3 of 5 is on node-1 
> pi is approximately 3.1415926544231225, Error is 0.08333294 
> wall clock time = 0.019305
> Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR) Failing at 
> addr:0x2b88243862be mpirun noticed that job rank 0 with PID 0 on node 
> node-1 exited on signal 1.
> 4 additional processes aborted (not shown) Or sometimes I'll get this 
> error, just depending upon the number of procs ...
>
>  mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH=$ 
> {LD_LIBRARY_PATH} --hostfile ./h1-3 -np 7 --mca mtl mx,self ./cpi
> Signal:7 info.si_errno:0(Success) si_code:2() Failing at 
> addr:0x2aaab000 [0] 
> func:/usr/local/openmpi-1.2b2/lib/libopen-pal.so.0
> (opal_backtrace_print+0x1f) [0x2b9b7fa52d1f] [1] 
> func:/usr/local/openmpi-1.2b2/lib/libopen-pal.so.0
> [0x2b9b7fa51871]
> [2] func:/lib/libpthread.so.0 [0x2b9b80013d00] [3] 
> func:/usr/local/openmpi-1.2b2/lib/libmca_common_sm.so.0
> (mca_common_sm_mmap_init+0x1e3) [0x2b9b8270ef83] [4] 
> func:/usr/local/openmpi-1.2b2/lib/openmpi/mca_mpool_sm.so
> [0x2b9b8260d0ff]
> [5] func:/usr/local/openmpi-1.2b2/lib/libmpi.so.0
> (mca_mpool_base_module_create+0x70) [0x2b9b7f7afac0] [6] 
> func:/usr/local/openmpi-1.2b2/lib/openmpi/mca_btl_sm.so
> (mca_btl_sm_add_procs_same_base_addr+0x907) [0x2b9b83070517] [7] 
> func:/usr/local/openmpi-1.2b2/lib/openmpi/mca_bml_r2.so
> (mca_bml_r2_add_procs+0x206) [0x2b9b82d5f576] [8] 
> func:/usr/local/openmpi-1.2b2/lib/openmpi/mca_pml_ob1.so
> (mca_pml_ob1_add_procs+0xe3) [0x2b9b82a2d0a3] [9] 
> func:/usr/local/openmpi-1.2b2/lib

Re: [OMPI users] orted: command not found

2007-01-03 Thread Ralph H Castain
Hi Jose

Sorry for entering the discussion late. From tracing the email thread, I
somewhat gather the following:

1. you have installed Open MPI 1.1.2 on two 686 boxes

2. you created a hostfile on one of the nodes and execute mpirun from that
node. You gave us a prefix indicating where we should find the Open MPI
executables on each node

3. you were getting an error message indicating that mpirun was unable to
find your executable

4. you didn't encounter this problem when running on a cluster

If I have those facts correct, then the problem is simple. My guess is that
the cluster you were using has a shared file system - hence, the remote
nodes "see" your executable in the same relative location across the
cluster.

In your simple setup with the two boxes, it sounds like you don't have a
shared file system. When mpirun attempts to locate the executable on
bernie-3, it won't find the file since it doesn't exist on that node's file
system. Once you copied the file over to bernie-3, then mpirun could find it
so everything works fine.

We hope to add file pre-positioning at some point in the future for systems
such as yours. However, that is some time away due to priorities. For now,
Open MPI requires that your executable (and the Open MPI executables and
libraries) be available on each node you are trying to use.

Hope that helps to explain the problem.
Ralph


On 1/2/07 2:03 PM, "jcolmena...@ula.ve"  wrote:

> I had configured the hostfile located at
> ~prefix/etc/openmpi-default-hostfile.
> 
> I copied the file to bernie-3, and it worked...
> 
> Now, at the cluster I was working at the Universidad de Los Andes
> (Venezuela) -I decided to install mpi on three machines I was able to put
> together as a personal proyect- all I had to do was to compile and run my
> applications, that is, I never copied any file to any other machine...
> now, I had to. I'm sorry if it was obvious and made you guys loose some
> time, but why on a cluster I didn't have to copy any files, and now I must
> do so?
> 
> Thanks for you patiance!
> 
> Jose
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users