Re: [OMPI users] Cluster with IB hosts and Ethernet hosts
Any solution for the following problem? On Fri, Jan 23, 2009 at 7:58 PM, Sangamesh Bwrote: > On Fri, Jan 23, 2009 at 5:41 PM, Jeff Squyres wrote: >> On Jan 22, 2009, at 11:26 PM, Sangamesh B wrote: >> >>> We''ve a cluster with 23 nodes connected to IB switch and 8 nodes >>> have connected to ethernet switch. Master node is also connected to IB >>> switch. SGE(with tight integration, -pe orte) is used for >>> parallel/serial job submission. >>> >>> Open MPI-1.3 is installed on master node with IB support >>> (--with-openib=/usr). The same folder is copied to the remaining 23 IB >>> nodes. >> >> Sounds good. >> >>> Now what shall I do for remaining 8 ethernet nodes: >>> (1) Copy the same folder(IB) to these nodes >>> (2) Install Open MPI on one of the 8 eight ethernet nodes. Copy the >>> same to 7 nodes. >>> (3) Install an ethernet version of Open MPI on master node and copy to 8 >>> nodes. >> >> Either 1 or 2 is your best bet. >> >> Do you have OFED installed on all nodes (either explicitly, or included in >> your Linux distro)? > No >> >> If so, I believe that at least some users with configurations like this >> install OMPI with OFED support (--with-openib=/usr, as you mentioned above) >> on all nodes. OMPI will notice that there is no OpenFabrics-capable >> hardware on the ethernet-only nodes and will simply not use the openib BTL >> plugin. >> >> Note that OMPI v1.3 got better about being silent about the lack of >> OpenFabrics devices when the openib BTL is present (OMPI v1.2 issued a >> warning about this). >> >> How you intend to use this setup is up to you; you may want to restrict jobs >> to 100% IB or 100% ethernet via SGE, or you may want to let them mix, >> realizing that the overall parallel job may be slowed down to the speed of >> the slowest network (e.g., ethernet). >> > > Now I've two basic problems: > > (1) Open MPI 1.3 is configurred as: > # ./configure --prefix=/opt/mpi/openmpi/1.3/intel --with-sge > --with-openib=/usr | tee config_out > > But, > > /opt/mpi/openmpi/1.3/intel/bin/ompi_info | grep gridengine > MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.3) > > shows only one component. Is this ok? > > (2) Open MPI is itself not working > ssh: connect to host chemfist.iitk.ac.in port 22: Connection timed out > A daemon (pid 31343) died unexpectedly with status 1 while attempting > to launch so we are aborting. > > There may be more information reported by the environment (see above). > > This may be because the daemon was unable to find all the needed shared > libraries on the remote node. You may set your LD_LIBRARY_PATH to have the > location of the shared libraries on the remote nodes and this will > automatically be forwarded to the remote nodes. > -- > -- > mpirun noticed that the job aborted, but has no info as to the process > that caused that situation. > -- > mpirun: clean termination accomplished > > > On two nodes: > > # /opt/mpi/openmpi/1.3/intel/bin/mpirun -np 2 -hostfile ih hostname > bash: /opt/mpi/openmpi/1.3/intel/bin/orted: No such file or directory > -- > A daemon (pid 31184) died unexpectedly with status 127 while attempting > to launch so we are aborting. > > There may be more information reported by the environment (see above). > > This may be because the daemon was unable to find all the needed shared > libraries on the remote node. You may set your LD_LIBRARY_PATH to have the > location of the shared libraries on the remote nodes and this will > automatically be forwarded to the remote nodes. > -- > -- > mpirun noticed that the job aborted, but has no info as to the process > that caused that situation. > -- > -- > mpirun was unable to cleanly terminate the daemons on the nodes shown > below. Additional manual cleanup may be required - please refer to > the "orte-clean" tool for assistance. > -- >ibc0 - daemon did not report back when launched >ibc1 - daemon did not report back when launched > > > #cat ih > ibc0 > ibc1 > > Everything is fine. > These ib interfaces are able to ping from master. > > # echo $LD_LIBRARY_PATH > /opt/mpi/openmpi/1.3/intel/lib:/opt/intel/cce/10.0.023/lib:/opt/intel/fce/10.0.023/lib:/opt/intel/mkl/10.0.5.025/lib/em64t:/opt/gridengine/lib/lx26-amd6 > > IB tests are also working fine. > Please help us to reslove
Re: [OMPI users] Cluster with IB hosts and Ethernet hosts
On Fri, Jan 23, 2009 at 5:41 PM, Jeff Squyreswrote: > On Jan 22, 2009, at 11:26 PM, Sangamesh B wrote: > >> We''ve a cluster with 23 nodes connected to IB switch and 8 nodes >> have connected to ethernet switch. Master node is also connected to IB >> switch. SGE(with tight integration, -pe orte) is used for >> parallel/serial job submission. >> >> Open MPI-1.3 is installed on master node with IB support >> (--with-openib=/usr). The same folder is copied to the remaining 23 IB >> nodes. > > Sounds good. > >> Now what shall I do for remaining 8 ethernet nodes: >> (1) Copy the same folder(IB) to these nodes >> (2) Install Open MPI on one of the 8 eight ethernet nodes. Copy the >> same to 7 nodes. >> (3) Install an ethernet version of Open MPI on master node and copy to 8 >> nodes. > > Either 1 or 2 is your best bet. > > Do you have OFED installed on all nodes (either explicitly, or included in > your Linux distro)? No > > If so, I believe that at least some users with configurations like this > install OMPI with OFED support (--with-openib=/usr, as you mentioned above) > on all nodes. OMPI will notice that there is no OpenFabrics-capable > hardware on the ethernet-only nodes and will simply not use the openib BTL > plugin. > > Note that OMPI v1.3 got better about being silent about the lack of > OpenFabrics devices when the openib BTL is present (OMPI v1.2 issued a > warning about this). > > How you intend to use this setup is up to you; you may want to restrict jobs > to 100% IB or 100% ethernet via SGE, or you may want to let them mix, > realizing that the overall parallel job may be slowed down to the speed of > the slowest network (e.g., ethernet). > Now I've two basic problems: (1) Open MPI 1.3 is configurred as: # ./configure --prefix=/opt/mpi/openmpi/1.3/intel --with-sge --with-openib=/usr | tee config_out But, /opt/mpi/openmpi/1.3/intel/bin/ompi_info | grep gridengine MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.3) shows only one component. Is this ok? (2) Open MPI is itself not working ssh: connect to host chemfist.iitk.ac.in port 22: Connection timed out A daemon (pid 31343) died unexpectedly with status 1 while attempting to launch so we are aborting. There may be more information reported by the environment (see above). This may be because the daemon was unable to find all the needed shared libraries on the remote node. You may set your LD_LIBRARY_PATH to have the location of the shared libraries on the remote nodes and this will automatically be forwarded to the remote nodes. -- -- mpirun noticed that the job aborted, but has no info as to the process that caused that situation. -- mpirun: clean termination accomplished On two nodes: # /opt/mpi/openmpi/1.3/intel/bin/mpirun -np 2 -hostfile ih hostname bash: /opt/mpi/openmpi/1.3/intel/bin/orted: No such file or directory -- A daemon (pid 31184) died unexpectedly with status 127 while attempting to launch so we are aborting. There may be more information reported by the environment (see above). This may be because the daemon was unable to find all the needed shared libraries on the remote node. You may set your LD_LIBRARY_PATH to have the location of the shared libraries on the remote nodes and this will automatically be forwarded to the remote nodes. -- -- mpirun noticed that the job aborted, but has no info as to the process that caused that situation. -- -- mpirun was unable to cleanly terminate the daemons on the nodes shown below. Additional manual cleanup may be required - please refer to the "orte-clean" tool for assistance. -- ibc0 - daemon did not report back when launched ibc1 - daemon did not report back when launched #cat ih ibc0 ibc1 Everything is fine. These ib interfaces are able to ping from master. # echo $LD_LIBRARY_PATH /opt/mpi/openmpi/1.3/intel/lib:/opt/intel/cce/10.0.023/lib:/opt/intel/fce/10.0.023/lib:/opt/intel/mkl/10.0.5.025/lib/em64t:/opt/gridengine/lib/lx26-amd6 IB tests are also working fine. Please help us to reslove this > Make sense? > > -- > Jeff Squyres > Cisco Systems > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] Cluster with IB hosts and Ethernet hosts
On Jan 22, 2009, at 11:26 PM, Sangamesh B wrote: We''ve a cluster with 23 nodes connected to IB switch and 8 nodes have connected to ethernet switch. Master node is also connected to IB switch. SGE(with tight integration, -pe orte) is used for parallel/serial job submission. Open MPI-1.3 is installed on master node with IB support (--with-openib=/usr). The same folder is copied to the remaining 23 IB nodes. Sounds good. Now what shall I do for remaining 8 ethernet nodes: (1) Copy the same folder(IB) to these nodes (2) Install Open MPI on one of the 8 eight ethernet nodes. Copy the same to 7 nodes. (3) Install an ethernet version of Open MPI on master node and copy to 8 nodes. Either 1 or 2 is your best bet. Do you have OFED installed on all nodes (either explicitly, or included in your Linux distro)? If so, I believe that at least some users with configurations like this install OMPI with OFED support (--with-openib=/usr, as you mentioned above) on all nodes. OMPI will notice that there is no OpenFabrics-capable hardware on the ethernet-only nodes and will simply not use the openib BTL plugin. Note that OMPI v1.3 got better about being silent about the lack of OpenFabrics devices when the openib BTL is present (OMPI v1.2 issued a warning about this). How you intend to use this setup is up to you; you may want to restrict jobs to 100% IB or 100% ethernet via SGE, or you may want to let them mix, realizing that the overall parallel job may be slowed down to the speed of the slowest network (e.g., ethernet). Make sense? -- Jeff Squyres Cisco Systems
[OMPI users] Cluster with IB hosts and Ethernet hosts
Hello all, We''ve a cluster with 23 nodes connected to IB switch and 8 nodes have connected to ethernet switch. Master node is also connected to IB switch. SGE(with tight integration, -pe orte) is used for parallel/serial job submission. Open MPI-1.3 is installed on master node with IB support (--with-openib=/usr). The same folder is copied to the remaining 23 IB nodes. Now what shall I do for remaining 8 ethernet nodes: (1) Copy the same folder(IB) to these nodes (2) Install Open MPI on one of the 8 eight ethernet nodes. Copy the same to 7 nodes. (3) Install an ethernet version of Open MPI on master node and copy to 8 nodes. Which of the above could solve the SGE ethernet/IB parallel job submission? Thanks, Sangamesh