Dear All

I need your help to solve this cluster related issue causing mpirun
malfunction. I get following warning for some of the nodes and then the
route failure message comes causing failure to mpirun.



*WARNING: There is at least one OpenFabrics device found but there are no
active ports detected (or Open MPI was unable to use them).  This*

*is most certainly not what you wanted.  Check your cables, subnet*

*manager configuration, etc.  The openib BTL will be ignored for this*

*job.*

*   Local host: compute-01-01.private.dns.zone*

*--------------------------------------------------------------------------*

*   SETUP OF THE LM*

*     INITIALIZATIONS *

*     INPUT OF THE NAMELISTS*

*[pmd.pakmet.com:30198 <http://pmd.pakmet.com:30198>] 7 more processes have
sent help message help-mpi-btl-openib.txt / no active ports found*

*[pmd.pakmet.com:30198 <http://pmd.pakmet.com:30198>] Set MCA parameter
"orte_base_help_aggregate" to 0 to see all help / error messages*

*[compute-01-00.private.dns.zone][[40500,1],5][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
connect() to 192.168.108.14 failed: No route to host (113)*

*[compute-01-00.private.dns.zone][[40500,1],4][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
connect() to 192.168.108.14 failed: No route to host (113)*

*My questions are.*

I don't include flags for running openmpi over infiniband then why it still
gives warning. If the infiniband ports are not active then it should start
the job over gigabit ethernet of cluster. Why it is unable to find the
route while the node can be pinged and ssh from other nodes and master node
as well.

The ibstatus of the above node (for which I was getting error) shows that
both ports are up. What is causing error then?


[root@compute-01-00 ~]# ibstatus
Infiniband device 'mlx4_0' port 1 status:
        default gid:     fe80:0000:0000:0000:0024:e890:97ff:1c61
        base lid:        0x5
        sm lid:          0x1
        state:           4: ACTIVE
        phys state:      5: LinkUp
        rate:            20 Gb/sec (4X DDR)
Infiniband device 'mlx4_0' port 2 status:
        default gid:     fe80:0000:0000:0000:0024:e890:97ff:1c62
        base lid:        0x0
        sm lid:          0x0
        state:           2: INIT
        phys state:      5: LinkUp
        rate:            20 Gb/sec (4X DDR)


Thank you in advance for your guidance and support.

Regards

-- 
Ahsan

Reply via email to