Ahsan,

This link might be helpful in trying to diagnose and treat IB fabric issues:

http://docs.oracle.com/cd/E18476_01/doc.220/e18478/fabric.htm#CIHIHJGD

You might try resetting the problematic port, or just use port 2 for your
jobs as a quick workaround:

-mca btl_openib_if_include mlx4_0:2

Josh



On Wed, Jul 23, 2014 at 11:02 AM, Shamis, Pavel <sham...@ornl.gov> wrote:

> It seems that the network was not consistenly wired.
> Port DOWN means that the port was not wired (or bad cable). Moreover, on
> some nodes port 1 is connected on other port 2.
> My concern is that they are not connected to the same subnet. If you have
> at least one port on each node connected to the same subnet,
> you should be able to get it running with " --mca btl_openib_max_btls 1"
> flag. If it does not work for you, it means that you
> have serious issue with you network and you have to review configuration
> of your switches and wiring of your machines.
>
> Best,
>
> Pavel (Pasha) Shamis
> ---
> Computer Science Research Group
> Computer Science and Math Division
> Oak Ridge National Laboratory
>
>
>
>
>
>
> On Jul 22, 2014, at 11:46 PM, Syed Ahsan Ali <ahsansha...@gmail.com
> <mailto:ahsansha...@gmail.com>> wrote:
>
> Dear Pasha
>
> The ibstatus is not of two different machines it is of the same machine.
> There are two infiband ports showing up on all nodes. I checked on all the
> nodes that one of the port in always in INIT status and other one active.
> Now please see below the ibstatus of the problem causing node
> (compute-01-01). Its one port is down. May be this is the reason for
> error?. Is it a physical port?
>
> [root@compute-01-01 ~]# ibstatus
> Infiniband device 'mlx4_0' port 1 status:
>         default gid:     fe80:0000:0000:0000:0018:8b90:97fe:94fe
>         base lid:        0x0
>         sm lid:          0x0
>         state:           1: DOWN
>         phys state:      4: PortConfigurationTraining
>         rate:            10 Gb/sec (4X)
> Infiniband device 'mlx4_0' port 2 status:
>         default gid:     fe80:0000:0000:0000:0018:8b90:97fe:94ff
>         base lid:        0x29
>         sm lid:          0x15
>         state:           4: ACTIVE
>         phys state:      5: LinkUp
>         rate:            20 Gb/sec (4X DDR)
> On Tue, Jul 22, 2014 at 6:50 PM, Shamis, Pavel <sham...@ornl.gov<mailto:
> sham...@ornl.gov>> wrote:
> Hmm, this does not make sense.
> Your copy-n-paste shows that both machines (00 and 01) have the same
> guid/lid (sort of equivalent of mac address in ethernet world).
> As you can guess these two can not be identical for two different machines
> (unless you moved the card around).
>
> Best,
> Pasha
>
> On Jul 21, 2014, at 11:26 PM, Syed Ahsan Ali <ahsansha...@gmail.com
> <mailto:ahsansha...@gmail.com><mailto:ahsansha...@gmail.com<mailto:
> ahsansha...@gmail.com>>> wrote:
>
> Yes I had checked running mpirun on all nodes one by one to see the
> problematic one. I had already mentioned that compute-01-01 is causing
> problem, when I remove it from the hostlist mpirun works fine. Here is
> ibstatus of compute-01-01.
>
> Infiniband device 'mlx4_0' port 1 status:
>         default gid:     fe80:0000:0000:0000:0024:e890:97ff:1c61
>         base lid:        0x5
>         sm lid:          0x1
>         state:           4: ACTIVE
>         phys state:      5: LinkUp
>         rate:            20 Gb/sec (4X DDR)
> Infiniband device 'mlx4_0' port 2 status:
>         default gid:     fe80:0000:0000:0000:0024:e890:97ff:1c62
>         base lid:        0x0
>         sm lid:          0x0
>         state:           2: INIT
>         phys state:      5: LinkUp
>         rate:            20 Gb/sec (4X DDR)
>
>
> On Mon, Jul 21, 2014 at 6:57 PM, Shamis, Pavel <sham...@ornl.gov<mailto:
> sham...@ornl.gov><mailto:sham...@ornl.gov<mailto:sham...@ornl.gov>>>
> wrote:
>
> You have to check the ports states on *all* nodes in the
> run/job/submission. Checking on a single node is not enough.
> My guess is the 01-00 tries to connect 01-01 and the ports are down on
> 01-01.
>
> You may disable support for infiniband by adding --mca btl ^openib.
>
> Best,
> Pavel (Pasha) Shamis
> ---
> Computer Science Research Group
> Computer Science and Math Division
> Oak Ridge National Laboratory
>
>
>
>
>
>
> On Jul 21, 2014, at 3:17 AM, Syed Ahsan Ali <ahsansha...@gmail.com<mailto:
> ahsansha...@gmail.com><mailto:ahsansha...@gmail.com<mailto:
> ahsansha...@gmail.com>><mailto:ahsansha...@gmail.com<mailto:
> ahsansha...@gmail.com><mailto:ahsansha...@gmail.com<mailto:
> ahsansha...@gmail.com>>>> wrote:
>
> Dear All
>
> I need your help to solve this cluster related issue causing mpirun
> malfunction. I get following warning for some of the nodes and then the
> route failure message comes causing failure to mpirun.
>
>
> WARNING: There is at least one OpenFabrics device found but there are no
> active ports detected (or Open MPI was unable to use them).  This
> is most certainly not what you wanted.  Check your cables, subnet
> manager configuration, etc.  The openib BTL will be ignored for this
> job.
>    Local host: compute-01-01.private.dns.zone
> --------------------------------------------------------------------------
>    SETUP OF THE LM
>      INITIALIZATIONS
>      INPUT OF THE NAMELISTS
> [pmd.pakmet.com:30198<http://pmd.pakmet.com:30198/><
> http://pmd.pakmet.com:30198/><http://pmd.pakmet.com:30198/>] 7 more
> processes have sent help message help-mpi-btl-openib.txt / no active ports
> found
> [pmd.pakmet.com:30198<http://pmd.pakmet.com:30198/><
> http://pmd.pakmet.com:30198/><http://pmd.pakmet.com:30198/>] Set MCA
> parameter "orte_base_help_aggregate" to 0 to see all help / error messages
> [compute-01-00.private.dns.zone][[40500,1],5][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
> connect() to 192.168.108.14 failed: No route to host (113)
> [compute-01-00.private.dns.zone][[40500,1],4][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
> connect() to 192.168.108.14 failed: No route to host (113)
> My questions are.
> I don't include flags for running openmpi over infiniband then why it
> still gives warning. If the infiniband ports are not active then it should
> start the job over gigabit ethernet of cluster. Why it is unable to find
> the route while the node can be pinged and ssh from other nodes and master
> node as well.
> The ibstatus of the above node (for which I was getting error) shows that
> both ports are up. What is causing error then?
>
> [root@compute-01-00 ~]# ibstatus
> Infiniband device 'mlx4_0' port 1 status:
>         default gid:     fe80:0000:0000:0000:0024:e890:97ff:1c61
>         base lid:        0x5
>         sm lid:          0x1
>         state:           4: ACTIVE
>         phys state:      5: LinkUp
>         rate:            20 Gb/sec (4X DDR)
> Infiniband device 'mlx4_0' port 2 status:
>         default gid:     fe80:0000:0000:0000:0024:e890:97ff:1c62
>         base lid:        0x0
>         sm lid:          0x0
>         state:           2: INIT
>         phys state:      5: LinkUp
>         rate:            20 Gb/sec (4X DDR)
>
>
> Thank you in advance for your guidance and support.
>
> Regards
>
> --
> Ahsan
> _______________________________________________
> users mailing list
> us...@open-mpi.org<mailto:us...@open-mpi.org><mailto:us...@open-mpi.org
> <mailto:us...@open-mpi.org>><mailto:us...@open-mpi.org<mailto:
> us...@open-mpi.org><mailto:us...@open-mpi.org<mailto:us...@open-mpi.org>>>
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/07/24833.php
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org<mailto:us...@open-mpi.org><mailto:us...@open-mpi.org
> <mailto:us...@open-mpi.org>>
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/07/24835.php
>
>
>
> --
> Syed Ahsan Ali Bokhari
> Electronic Engineer (EE)
>
> Research & Development Division
> Pakistan Meteorological Department H-8/4, Islamabad.
> Phone # off  +92518358714<tel:%2B92518358714>
> Cell # +923155145014<tel:%2B923155145014>
> _______________________________________________
> users mailing list
> us...@open-mpi.org<mailto:us...@open-mpi.org><mailto:us...@open-mpi.org
> <mailto:us...@open-mpi.org>>
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/07/24841.php
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org<mailto:us...@open-mpi.org>
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/07/24845.php
>
>
>
> --
> Syed Ahsan Ali Bokhari
> Electronic Engineer (EE)
>
> Research & Development Division
> Pakistan Meteorological Department H-8/4, Islamabad.
> Phone # off  +92518358714
> Cell # +923155145014
> _______________________________________________
> users mailing list
> us...@open-mpi.org<mailto:us...@open-mpi.org>
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/07/24854.php
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/07/24858.php
>

Reply via email to