And where I can find run/job/submission ? On Mon, Jul 21, 2014 at 6:57 PM, Shamis, Pavel <sham...@ornl.gov> wrote:
> > You have to check the ports states on *all* nodes in the > run/job/submission. Checking on a single node is not enough. > My guess is the 01-00 tries to connect 01-01 and the ports are down on > 01-01. > > You may disable support for infiniband by adding --mca btl ^openib. > > Best, > Pavel (Pasha) Shamis > --- > Computer Science Research Group > Computer Science and Math Division > Oak Ridge National Laboratory > > > > > > > On Jul 21, 2014, at 3:17 AM, Syed Ahsan Ali <ahsansha...@gmail.com<mailto: > ahsansha...@gmail.com>> wrote: > > Dear All > > I need your help to solve this cluster related issue causing mpirun > malfunction. I get following warning for some of the nodes and then the > route failure message comes causing failure to mpirun. > > > WARNING: There is at least one OpenFabrics device found but there are no > active ports detected (or Open MPI was unable to use them). This > is most certainly not what you wanted. Check your cables, subnet > manager configuration, etc. The openib BTL will be ignored for this > job. > Local host: compute-01-01.private.dns.zone > -------------------------------------------------------------------------- > SETUP OF THE LM > INITIALIZATIONS > INPUT OF THE NAMELISTS > [pmd.pakmet.com:30198<http://pmd.pakmet.com:30198/>] 7 more processes > have sent help message help-mpi-btl-openib.txt / no active ports found > [pmd.pakmet.com:30198<http://pmd.pakmet.com:30198/>] Set MCA parameter > "orte_base_help_aggregate" to 0 to see all help / error messages > [compute-01-00.private.dns.zone][[40500,1],5][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] > connect() to 192.168.108.14 failed: No route to host (113) > [compute-01-00.private.dns.zone][[40500,1],4][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] > connect() to 192.168.108.14 failed: No route to host (113) > My questions are. > I don't include flags for running openmpi over infiniband then why it > still gives warning. If the infiniband ports are not active then it should > start the job over gigabit ethernet of cluster. Why it is unable to find > the route while the node can be pinged and ssh from other nodes and master > node as well. > The ibstatus of the above node (for which I was getting error) shows that > both ports are up. What is causing error then? > > [root@compute-01-00 ~]# ibstatus > Infiniband device 'mlx4_0' port 1 status: > default gid: fe80:0000:0000:0000:0024:e890:97ff:1c61 > base lid: 0x5 > sm lid: 0x1 > state: 4: ACTIVE > phys state: 5: LinkUp > rate: 20 Gb/sec (4X DDR) > Infiniband device 'mlx4_0' port 2 status: > default gid: fe80:0000:0000:0000:0024:e890:97ff:1c62 > base lid: 0x0 > sm lid: 0x0 > state: 2: INIT > phys state: 5: LinkUp > rate: 20 Gb/sec (4X DDR) > > > Thank you in advance for your guidance and support. > > Regards > > -- > Ahsan > _______________________________________________ > users mailing list > us...@open-mpi.org<mailto:us...@open-mpi.org> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/07/24833.php > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/07/24835.php > -- Syed Ahsan Ali Bokhari Electronic Engineer (EE) Research & Development Division Pakistan Meteorological Department H-8/4, Islamabad. Phone # off +92518358714 Cell # +923155145014