Re: [OMPI users] Cluster with IB hosts and Ethernet hosts

2009-01-23 Thread Sangamesh B
Any solution for the following problem?

On Fri, Jan 23, 2009 at 7:58 PM, Sangamesh B  wrote:
> On Fri, Jan 23, 2009 at 5:41 PM, Jeff Squyres  wrote:
>> On Jan 22, 2009, at 11:26 PM, Sangamesh B wrote:
>>
>>>   We''ve a cluster with 23 nodes connected to IB switch and 8 nodes
>>> have connected to ethernet switch. Master node is also connected to IB
>>> switch. SGE(with tight integration, -pe orte)  is used for
>>> parallel/serial job submission.
>>>
>>> Open MPI-1.3 is installed on master node with IB support
>>> (--with-openib=/usr). The same folder is copied to the remaining 23 IB
>>> nodes.
>>
>> Sounds good.
>>
>>> Now what shall I do for remaining 8 ethernet nodes:
>>> (1) Copy the same folder(IB) to these nodes
>>> (2) Install Open MPI on one of the 8 eight ethernet nodes. Copy the
>>> same to 7 nodes.
>>> (3) Install an ethernet version of Open MPI on master node and copy to 8
>>> nodes.
>>
>> Either 1 or 2 is your best bet.
>>
>> Do you have OFED installed on all nodes (either explicitly, or included in
>> your Linux distro)?
> No
>>
>> If so, I believe that at least some users with configurations like this
>> install OMPI with OFED support (--with-openib=/usr, as you mentioned above)
>> on all nodes.  OMPI will notice that there is no OpenFabrics-capable
>> hardware on the ethernet-only nodes and will simply not use the openib BTL
>> plugin.
>>
>> Note that OMPI v1.3 got better about being silent about the lack of
>> OpenFabrics devices when the openib BTL is present (OMPI v1.2 issued a
>> warning about this).
>>
>> How you intend to use this setup is up to you; you may want to restrict jobs
>> to 100% IB or 100% ethernet via SGE, or you may want to let them mix,
>> realizing that the overall parallel job may be slowed down to the speed of
>> the slowest network (e.g., ethernet).
>>
>
> Now I've two basic problems:
>
> (1)  Open MPI 1.3 is configurred as:
> # ./configure --prefix=/opt/mpi/openmpi/1.3/intel --with-sge
> --with-openib=/usr | tee config_out
>
> But,
>
>  /opt/mpi/openmpi/1.3/intel/bin/ompi_info | grep gridengine
> MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.3)
>
> shows only one component. Is this ok?
>
> (2) Open MPI is itself not working
> ssh: connect to host chemfist.iitk.ac.in port 22: Connection timed out
> A daemon (pid 31343) died unexpectedly with status 1 while attempting
> to launch so we are aborting.
>
> There may be more information reported by the environment (see above).
>
> This may be because the daemon was unable to find all the needed shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> --
> --
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --
> mpirun: clean termination accomplished
>
>
> On two nodes:
>
> # /opt/mpi/openmpi/1.3/intel/bin/mpirun -np 2 -hostfile ih hostname
> bash: /opt/mpi/openmpi/1.3/intel/bin/orted: No such file or directory
> --
> A daemon (pid 31184) died unexpectedly with status 127 while attempting
> to launch so we are aborting.
>
> There may be more information reported by the environment (see above).
>
> This may be because the daemon was unable to find all the needed shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> --
> --
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --
> --
> mpirun was unable to cleanly terminate the daemons on the nodes shown
> below. Additional manual cleanup may be required - please refer to
> the "orte-clean" tool for assistance.
> --
>ibc0 - daemon did not report back when launched
>ibc1 - daemon did not report back when launched
>
>
> #cat ih
> ibc0
> ibc1
>
> Everything is fine.
> These ib interfaces are able to ping from master.
>
> # echo $LD_LIBRARY_PATH
> /opt/mpi/openmpi/1.3/intel/lib:/opt/intel/cce/10.0.023/lib:/opt/intel/fce/10.0.023/lib:/opt/intel/mkl/10.0.5.025/lib/em64t:/opt/gridengine/lib/lx26-amd6
>
> IB tests are also working fine.
> Please help us to reslove 

Re: [OMPI users] Cluster with IB hosts and Ethernet hosts

2009-01-23 Thread Sangamesh B
On Fri, Jan 23, 2009 at 5:41 PM, Jeff Squyres  wrote:
> On Jan 22, 2009, at 11:26 PM, Sangamesh B wrote:
>
>>   We''ve a cluster with 23 nodes connected to IB switch and 8 nodes
>> have connected to ethernet switch. Master node is also connected to IB
>> switch. SGE(with tight integration, -pe orte)  is used for
>> parallel/serial job submission.
>>
>> Open MPI-1.3 is installed on master node with IB support
>> (--with-openib=/usr). The same folder is copied to the remaining 23 IB
>> nodes.
>
> Sounds good.
>
>> Now what shall I do for remaining 8 ethernet nodes:
>> (1) Copy the same folder(IB) to these nodes
>> (2) Install Open MPI on one of the 8 eight ethernet nodes. Copy the
>> same to 7 nodes.
>> (3) Install an ethernet version of Open MPI on master node and copy to 8
>> nodes.
>
> Either 1 or 2 is your best bet.
>
> Do you have OFED installed on all nodes (either explicitly, or included in
> your Linux distro)?
No
>
> If so, I believe that at least some users with configurations like this
> install OMPI with OFED support (--with-openib=/usr, as you mentioned above)
> on all nodes.  OMPI will notice that there is no OpenFabrics-capable
> hardware on the ethernet-only nodes and will simply not use the openib BTL
> plugin.
>
> Note that OMPI v1.3 got better about being silent about the lack of
> OpenFabrics devices when the openib BTL is present (OMPI v1.2 issued a
> warning about this).
>
> How you intend to use this setup is up to you; you may want to restrict jobs
> to 100% IB or 100% ethernet via SGE, or you may want to let them mix,
> realizing that the overall parallel job may be slowed down to the speed of
> the slowest network (e.g., ethernet).
>

Now I've two basic problems:

(1)  Open MPI 1.3 is configurred as:
# ./configure --prefix=/opt/mpi/openmpi/1.3/intel --with-sge
--with-openib=/usr | tee config_out

But,

 /opt/mpi/openmpi/1.3/intel/bin/ompi_info | grep gridengine
 MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.3)

shows only one component. Is this ok?

(2) Open MPI is itself not working
ssh: connect to host chemfist.iitk.ac.in port 22: Connection timed out
A daemon (pid 31343) died unexpectedly with status 1 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--
--
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--
mpirun: clean termination accomplished


On two nodes:

# /opt/mpi/openmpi/1.3/intel/bin/mpirun -np 2 -hostfile ih hostname
bash: /opt/mpi/openmpi/1.3/intel/bin/orted: No such file or directory
--
A daemon (pid 31184) died unexpectedly with status 127 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--
--
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--
--
mpirun was unable to cleanly terminate the daemons on the nodes shown
below. Additional manual cleanup may be required - please refer to
the "orte-clean" tool for assistance.
--
ibc0 - daemon did not report back when launched
ibc1 - daemon did not report back when launched


#cat ih
ibc0
ibc1

Everything is fine.
These ib interfaces are able to ping from master.

# echo $LD_LIBRARY_PATH
/opt/mpi/openmpi/1.3/intel/lib:/opt/intel/cce/10.0.023/lib:/opt/intel/fce/10.0.023/lib:/opt/intel/mkl/10.0.5.025/lib/em64t:/opt/gridengine/lib/lx26-amd6

IB tests are also working fine.
Please help us to reslove this

> Make sense?
>
> --
> Jeff Squyres
> Cisco Systems
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] Cluster with IB hosts and Ethernet hosts

2009-01-23 Thread Jeff Squyres

On Jan 22, 2009, at 11:26 PM, Sangamesh B wrote:


   We''ve a cluster with 23 nodes connected to IB switch and 8 nodes
have connected to ethernet switch. Master node is also connected to IB
switch. SGE(with tight integration, -pe orte)  is used for
parallel/serial job submission.

Open MPI-1.3 is installed on master node with IB support
(--with-openib=/usr). The same folder is copied to the remaining 23 IB
nodes.


Sounds good.


Now what shall I do for remaining 8 ethernet nodes:
(1) Copy the same folder(IB) to these nodes
(2) Install Open MPI on one of the 8 eight ethernet nodes. Copy the
same to 7 nodes.
(3) Install an ethernet version of Open MPI on master node and copy  
to 8 nodes.


Either 1 or 2 is your best bet.

Do you have OFED installed on all nodes (either explicitly, or  
included in your Linux distro)?


If so, I believe that at least some users with configurations like  
this install OMPI with OFED support (--with-openib=/usr, as you  
mentioned above) on all nodes.  OMPI will notice that there is no  
OpenFabrics-capable hardware on the ethernet-only nodes and will  
simply not use the openib BTL plugin.


Note that OMPI v1.3 got better about being silent about the lack of  
OpenFabrics devices when the openib BTL is present (OMPI v1.2 issued a  
warning about this).


How you intend to use this setup is up to you; you may want to  
restrict jobs to 100% IB or 100% ethernet via SGE, or you may want to  
let them mix, realizing that the overall parallel job may be slowed  
down to the speed of the slowest network (e.g., ethernet).


Make sense?

--
Jeff Squyres
Cisco Systems



[OMPI users] Cluster with IB hosts and Ethernet hosts

2009-01-22 Thread Sangamesh B
Hello all,

We''ve a cluster with 23 nodes connected to IB switch and 8 nodes
have connected to ethernet switch. Master node is also connected to IB
switch. SGE(with tight integration, -pe orte)  is used for
parallel/serial job submission.

Open MPI-1.3 is installed on master node with IB support
(--with-openib=/usr). The same folder is copied to the remaining 23 IB
nodes.

Now what shall I do for remaining 8 ethernet nodes:
 (1) Copy the same folder(IB) to these nodes
 (2) Install Open MPI on one of the 8 eight ethernet nodes. Copy the
same to 7 nodes.
 (3) Install an ethernet version of Open MPI on master node and copy to 8 nodes.

Which of the above could solve the SGE ethernet/IB parallel job submission?

Thanks,
Sangamesh