You have to check the ports states on *all* nodes in the run/job/submission. Checking on a single node is not enough. My guess is the 01-00 tries to connect 01-01 and the ports are down on 01-01.
You may disable support for infiniband by adding --mca btl ^openib. Best, Pavel (Pasha) Shamis --- Computer Science Research Group Computer Science and Math Division Oak Ridge National Laboratory On Jul 21, 2014, at 3:17 AM, Syed Ahsan Ali <ahsansha...@gmail.com<mailto:ahsansha...@gmail.com>> wrote: Dear All I need your help to solve this cluster related issue causing mpirun malfunction. I get following warning for some of the nodes and then the route failure message comes causing failure to mpirun. WARNING: There is at least one OpenFabrics device found but there are no active ports detected (or Open MPI was unable to use them). This is most certainly not what you wanted. Check your cables, subnet manager configuration, etc. The openib BTL will be ignored for this job. Local host: compute-01-01.private.dns.zone -------------------------------------------------------------------------- SETUP OF THE LM INITIALIZATIONS INPUT OF THE NAMELISTS [pmd.pakmet.com:30198<http://pmd.pakmet.com:30198/>] 7 more processes have sent help message help-mpi-btl-openib.txt / no active ports found [pmd.pakmet.com:30198<http://pmd.pakmet.com:30198/>] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages [compute-01-00.private.dns.zone][[40500,1],5][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] connect() to 192.168.108.14 failed: No route to host (113) [compute-01-00.private.dns.zone][[40500,1],4][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] connect() to 192.168.108.14 failed: No route to host (113) My questions are. I don't include flags for running openmpi over infiniband then why it still gives warning. If the infiniband ports are not active then it should start the job over gigabit ethernet of cluster. Why it is unable to find the route while the node can be pinged and ssh from other nodes and master node as well. The ibstatus of the above node (for which I was getting error) shows that both ports are up. What is causing error then? [root@compute-01-00 ~]# ibstatus Infiniband device 'mlx4_0' port 1 status: default gid: fe80:0000:0000:0000:0024:e890:97ff:1c61 base lid: 0x5 sm lid: 0x1 state: 4: ACTIVE phys state: 5: LinkUp rate: 20 Gb/sec (4X DDR) Infiniband device 'mlx4_0' port 2 status: default gid: fe80:0000:0000:0000:0024:e890:97ff:1c62 base lid: 0x0 sm lid: 0x0 state: 2: INIT phys state: 5: LinkUp rate: 20 Gb/sec (4X DDR) Thank you in advance for your guidance and support. Regards -- Ahsan _______________________________________________ users mailing list us...@open-mpi.org<mailto:us...@open-mpi.org> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2014/07/24833.php