Yes I had checked running mpirun on all nodes one by one to see the
problematic one. I had already mentioned that compute-01-01 is causing
problem, when I remove it from the hostlist mpirun works fine. Here is
ibstatus of compute-01-01.

Infiniband device 'mlx4_0' port 1 status:
        default gid:     fe80:0000:0000:0000:0024:e890:97ff:1c61
        base lid:        0x5
        sm lid:          0x1
        state:           4: ACTIVE
        phys state:      5: LinkUp
        rate:            20 Gb/sec (4X DDR)
Infiniband device 'mlx4_0' port 2 status:
        default gid:     fe80:0000:0000:0000:0024:e890:97ff:1c62
        base lid:        0x0
        sm lid:          0x0
        state:           2: INIT
        phys state:      5: LinkUp
        rate:            20 Gb/sec (4X DDR)


On Mon, Jul 21, 2014 at 6:57 PM, Shamis, Pavel <sham...@ornl.gov> wrote:

>
> You have to check the ports states on *all* nodes in the
> run/job/submission. Checking on a single node is not enough.
> My guess is the 01-00 tries to connect 01-01 and the ports are down on
> 01-01.
>
> You may disable support for infiniband by adding --mca btl ^openib.
>
> Best,
> Pavel (Pasha) Shamis
> ---
> Computer Science Research Group
> Computer Science and Math Division
> Oak Ridge National Laboratory
>
>
>
>
>
>
> On Jul 21, 2014, at 3:17 AM, Syed Ahsan Ali <ahsansha...@gmail.com<mailto:
> ahsansha...@gmail.com>> wrote:
>
> Dear All
>
> I need your help to solve this cluster related issue causing mpirun
> malfunction. I get following warning for some of the nodes and then the
> route failure message comes causing failure to mpirun.
>
>
> WARNING: There is at least one OpenFabrics device found but there are no
> active ports detected (or Open MPI was unable to use them).  This
> is most certainly not what you wanted.  Check your cables, subnet
> manager configuration, etc.  The openib BTL will be ignored for this
> job.
>    Local host: compute-01-01.private.dns.zone
> --------------------------------------------------------------------------
>    SETUP OF THE LM
>      INITIALIZATIONS
>      INPUT OF THE NAMELISTS
> [pmd.pakmet.com:30198<http://pmd.pakmet.com:30198/>] 7 more processes
> have sent help message help-mpi-btl-openib.txt / no active ports found
> [pmd.pakmet.com:30198<http://pmd.pakmet.com:30198/>] Set MCA parameter
> "orte_base_help_aggregate" to 0 to see all help / error messages
> [compute-01-00.private.dns.zone][[40500,1],5][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
> connect() to 192.168.108.14 failed: No route to host (113)
> [compute-01-00.private.dns.zone][[40500,1],4][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
> connect() to 192.168.108.14 failed: No route to host (113)
> My questions are.
> I don't include flags for running openmpi over infiniband then why it
> still gives warning. If the infiniband ports are not active then it should
> start the job over gigabit ethernet of cluster. Why it is unable to find
> the route while the node can be pinged and ssh from other nodes and master
> node as well.
> The ibstatus of the above node (for which I was getting error) shows that
> both ports are up. What is causing error then?
>
> [root@compute-01-00 ~]# ibstatus
> Infiniband device 'mlx4_0' port 1 status:
>         default gid:     fe80:0000:0000:0000:0024:e890:97ff:1c61
>         base lid:        0x5
>         sm lid:          0x1
>         state:           4: ACTIVE
>         phys state:      5: LinkUp
>         rate:            20 Gb/sec (4X DDR)
> Infiniband device 'mlx4_0' port 2 status:
>         default gid:     fe80:0000:0000:0000:0024:e890:97ff:1c62
>         base lid:        0x0
>         sm lid:          0x0
>         state:           2: INIT
>         phys state:      5: LinkUp
>         rate:            20 Gb/sec (4X DDR)
>
>
> Thank you in advance for your guidance and support.
>
> Regards
>
> --
> Ahsan
> _______________________________________________
> users mailing list
> us...@open-mpi.org<mailto:us...@open-mpi.org>
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/07/24833.php
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/07/24835.php
>



-- 
Syed Ahsan Ali Bokhari
Electronic Engineer (EE)

Research & Development Division
Pakistan Meteorological Department H-8/4, Islamabad.
Phone # off  +92518358714
Cell # +923155145014

Reply via email to