Hi Josh

It was my mistake. The status of error generating node is pasted below

Infiniband device 'mlx4_0' port 1 status:
        default gid:     fe80:0000:0000:0000:0018:8b90:97fe:94fe
        base lid:        0x0
        sm lid:          0x0
        state:           1: DOWN
        phys state:      4: PortConfigurationTraining
        rate:            10 Gb/sec (4X)
Infiniband device 'mlx4_0' port 2 status:
        default gid:     fe80:0000:0000:0000:0018:8b90:97fe:94ff
        base lid:        0x29
        sm lid:          0x15

As you see one port is down. I have all sysadmin rights as I am managing
the cluster, but my level of knowledge is not expert. Can you explain a bit
about ports. Does each infiniband card in a system has 2 physical ports?
What to look for if one port status is down.?

Ahsan

On Tue, Jul 22, 2014 at 6:14 PM, Joshua Ladd <jladd.m...@gmail.com> wrote:

>  Sayed,
>
> You might try this link (or have your sysadmin do it if you do not have
> admin privileges.) To me it looks like your second port is in the "INIT"
> state but has not been added by the subnet manager.
>
>
> https://software.intel.com/en-us/articles/troubleshooting-infiniband-connection-issues-using-ofed-tools
>
> You might also try try running only over port 1 with the mca parameter:
>
> -mca btl_openib_if_include mlx4_0:1
>
> Hope this helps.
>
> Josh
>
>
>  On Tue, Jul 22, 2014 at 12:10 AM, Syed Ahsan Ali <ahsansha...@gmail.com>
> wrote:
>
>>  And where I can find run/job/submission ?
>>
>>  On Mon, Jul 21, 2014 at 6:57 PM, Shamis, Pavel <sham...@ornl.gov> wrote:
>>
>>>
>>> You have to check the ports states on *all* nodes in the
>>> run/job/submission. Checking on a single node is not enough.
>>> My guess is the 01-00 tries to connect 01-01 and the ports are down on
>>> 01-01.
>>>
>>> You may disable support for infiniband by adding --mca btl ^openib.
>>>
>>> Best,
>>> Pavel (Pasha) Shamis
>>> ---
>>> Computer Science Research Group
>>> Computer Science and Math Division
>>> Oak Ridge National Laboratory
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Jul 21, 2014, at 3:17 AM, Syed Ahsan Ali <ahsansha...@gmail.com
>>> <mailto:ahsansha...@gmail.com>> wrote:
>>>
>>> Dear All
>>>
>>> I need your help to solve this cluster related issue causing mpirun
>>> malfunction. I get following warning for some of the nodes and then the
>>> route failure message comes causing failure to mpirun.
>>>
>>>
>>> WARNING: There is at least one OpenFabrics device found but there are no
>>> active ports detected (or Open MPI was unable to use them).  This
>>> is most certainly not what you wanted.  Check your cables, subnet
>>> manager configuration, etc.  The openib BTL will be ignored for this
>>> job.
>>>    Local host: compute-01-01.private.dns.zone
>>>
>>> --------------------------------------------------------------------------
>>>    SETUP OF THE LM
>>>      INITIALIZATIONS
>>>      INPUT OF THE NAMELISTS
>>> [pmd.pakmet.com:30198<http://pmd.pakmet.com:30198/>] 7 more processes
>>> have sent help message help-mpi-btl-openib.txt / no active ports found
>>> [pmd.pakmet.com:30198<http://pmd.pakmet.com:30198/>] Set MCA parameter
>>> "orte_base_help_aggregate" to 0 to see all help / error messages
>>> [compute-01-00.private.dns.zone][[40500,1],5][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>>> connect() to 192.168.108.14 failed: No route to host (113)
>>> [compute-01-00.private.dns.zone][[40500,1],4][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>>> connect() to 192.168.108.14 failed: No route to host (113)
>>> My questions are.
>>> I don't include flags for running openmpi over infiniband then why it
>>> still gives warning. If the infiniband ports are not active then it should
>>> start the job over gigabit ethernet of cluster. Why it is unable to find
>>> the route while the node can be pinged and ssh from other nodes and master
>>> node as well.
>>> The ibstatus of the above node (for which I was getting error) shows
>>> that both ports are up. What is causing error then?
>>>
>>> [root@compute-01-00 ~]# ibstatus
>>> Infiniband device 'mlx4_0' port 1 status:
>>>         default gid:     fe80:0000:0000:0000:0024:e890:97ff:1c61
>>>         base lid:        0x5
>>>         sm lid:          0x1
>>>         state:           4: ACTIVE
>>>         phys state:      5: LinkUp
>>>         rate:            20 Gb/sec (4X DDR)
>>> Infiniband device 'mlx4_0' port 2 status:
>>>         default gid:     fe80:0000:0000:0000:0024:e890:97ff:1c62
>>>         base lid:        0x0
>>>         sm lid:          0x0
>>>         state:           2: INIT
>>>         phys state:      5: LinkUp
>>>         rate:            20 Gb/sec (4X DDR)
>>>
>>>
>>> Thank you in advance for your guidance and support.
>>>
>>> Regards
>>>
>>> --
>>> Ahsan
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org<mailto:us...@open-mpi.org>
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2014/07/24833.php
>>>
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2014/07/24835.php
>>>
>>
>>
>>
>> --
>> Syed Ahsan Ali Bokhari
>> Electronic Engineer (EE)
>>
>> Research & Development Division
>> Pakistan Meteorological Department H-8/4, Islamabad.
>> Phone # off  +92518358714
>> Cell # +923155145014
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2014/07/24842.php
>>
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/07/24844.php
>



-- 
Syed Ahsan Ali Bokhari
Electronic Engineer (EE)

Research & Development Division
Pakistan Meteorological Department H-8/4, Islamabad.
Phone # off  +92518358714
Cell # +923155145014

Reply via email to