It seems that the network was not consistenly wired.
Port DOWN means that the port was not wired (or bad cable). Moreover, on some 
nodes port 1 is connected on other port 2.
My concern is that they are not connected to the same subnet. If you have at 
least one port on each node connected to the same subnet,
you should be able to get it running with " --mca btl_openib_max_btls 1" flag. 
If it does not work for you, it means that you
have serious issue with you network and you have to review configuration of 
your switches and wiring of your machines.

Best,

Pavel (Pasha) Shamis
---
Computer Science Research Group
Computer Science and Math Division
Oak Ridge National Laboratory






On Jul 22, 2014, at 11:46 PM, Syed Ahsan Ali 
<ahsansha...@gmail.com<mailto:ahsansha...@gmail.com>> wrote:

Dear Pasha

The ibstatus is not of two different machines it is of the same machine. There 
are two infiband ports showing up on all nodes. I checked on all the nodes that 
one of the port in always in INIT status and other one active. Now please see 
below the ibstatus of the problem causing node (compute-01-01). Its one port is 
down. May be this is the reason for error?. Is it a physical port?

[root@compute-01-01 ~]# ibstatus
Infiniband device 'mlx4_0' port 1 status:
        default gid:     fe80:0000:0000:0000:0018:8b90:97fe:94fe
        base lid:        0x0
        sm lid:          0x0
        state:           1: DOWN
        phys state:      4: PortConfigurationTraining
        rate:            10 Gb/sec (4X)
Infiniband device 'mlx4_0' port 2 status:
        default gid:     fe80:0000:0000:0000:0018:8b90:97fe:94ff
        base lid:        0x29
        sm lid:          0x15
        state:           4: ACTIVE
        phys state:      5: LinkUp
        rate:            20 Gb/sec (4X DDR)
On Tue, Jul 22, 2014 at 6:50 PM, Shamis, Pavel 
<sham...@ornl.gov<mailto:sham...@ornl.gov>> wrote:
Hmm, this does not make sense.
Your copy-n-paste shows that both machines (00 and 01) have the same guid/lid 
(sort of equivalent of mac address in ethernet world).
As you can guess these two can not be identical for two different machines 
(unless you moved the card around).

Best,
Pasha

On Jul 21, 2014, at 11:26 PM, Syed Ahsan Ali 
<ahsansha...@gmail.com<mailto:ahsansha...@gmail.com><mailto:ahsansha...@gmail.com<mailto:ahsansha...@gmail.com>>>
 wrote:

Yes I had checked running mpirun on all nodes one by one to see the problematic 
one. I had already mentioned that compute-01-01 is causing problem, when I 
remove it from the hostlist mpirun works fine. Here is ibstatus of 
compute-01-01.

Infiniband device 'mlx4_0' port 1 status:
        default gid:     fe80:0000:0000:0000:0024:e890:97ff:1c61
        base lid:        0x5
        sm lid:          0x1
        state:           4: ACTIVE
        phys state:      5: LinkUp
        rate:            20 Gb/sec (4X DDR)
Infiniband device 'mlx4_0' port 2 status:
        default gid:     fe80:0000:0000:0000:0024:e890:97ff:1c62
        base lid:        0x0
        sm lid:          0x0
        state:           2: INIT
        phys state:      5: LinkUp
        rate:            20 Gb/sec (4X DDR)


On Mon, Jul 21, 2014 at 6:57 PM, Shamis, Pavel 
<sham...@ornl.gov<mailto:sham...@ornl.gov><mailto:sham...@ornl.gov<mailto:sham...@ornl.gov>>>
 wrote:

You have to check the ports states on *all* nodes in the run/job/submission. 
Checking on a single node is not enough.
My guess is the 01-00 tries to connect 01-01 and the ports are down on 01-01.

You may disable support for infiniband by adding --mca btl ^openib.

Best,
Pavel (Pasha) Shamis
---
Computer Science Research Group
Computer Science and Math Division
Oak Ridge National Laboratory






On Jul 21, 2014, at 3:17 AM, Syed Ahsan Ali 
<ahsansha...@gmail.com<mailto:ahsansha...@gmail.com><mailto:ahsansha...@gmail.com<mailto:ahsansha...@gmail.com>><mailto:ahsansha...@gmail.com<mailto:ahsansha...@gmail.com><mailto:ahsansha...@gmail.com<mailto:ahsansha...@gmail.com>>>>
 wrote:

Dear All

I need your help to solve this cluster related issue causing mpirun 
malfunction. I get following warning for some of the nodes and then the route 
failure message comes causing failure to mpirun.


WARNING: There is at least one OpenFabrics device found but there are no active 
ports detected (or Open MPI was unable to use them).  This
is most certainly not what you wanted.  Check your cables, subnet
manager configuration, etc.  The openib BTL will be ignored for this
job.
   Local host: compute-01-01.private.dns.zone
--------------------------------------------------------------------------
   SETUP OF THE LM
     INITIALIZATIONS
     INPUT OF THE NAMELISTS
[pmd.pakmet.com:30198<http://pmd.pakmet.com:30198/><http://pmd.pakmet.com:30198/><http://pmd.pakmet.com:30198/>]
 7 more processes have sent help message help-mpi-btl-openib.txt / no active 
ports found
[pmd.pakmet.com:30198<http://pmd.pakmet.com:30198/><http://pmd.pakmet.com:30198/><http://pmd.pakmet.com:30198/>]
 Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error 
messages
[compute-01-00.private.dns.zone][[40500,1],5][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
 connect() to 192.168.108.14 failed: No route to host (113)
[compute-01-00.private.dns.zone][[40500,1],4][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
 connect() to 192.168.108.14 failed: No route to host (113)
My questions are.
I don't include flags for running openmpi over infiniband then why it still 
gives warning. If the infiniband ports are not active then it should start the 
job over gigabit ethernet of cluster. Why it is unable to find the route while 
the node can be pinged and ssh from other nodes and master node as well.
The ibstatus of the above node (for which I was getting error) shows that both 
ports are up. What is causing error then?

[root@compute-01-00 ~]# ibstatus
Infiniband device 'mlx4_0' port 1 status:
        default gid:     fe80:0000:0000:0000:0024:e890:97ff:1c61
        base lid:        0x5
        sm lid:          0x1
        state:           4: ACTIVE
        phys state:      5: LinkUp
        rate:            20 Gb/sec (4X DDR)
Infiniband device 'mlx4_0' port 2 status:
        default gid:     fe80:0000:0000:0000:0024:e890:97ff:1c62
        base lid:        0x0
        sm lid:          0x0
        state:           2: INIT
        phys state:      5: LinkUp
        rate:            20 Gb/sec (4X DDR)


Thank you in advance for your guidance and support.

Regards

--
Ahsan
_______________________________________________
users mailing list
us...@open-mpi.org<mailto:us...@open-mpi.org><mailto:us...@open-mpi.org<mailto:us...@open-mpi.org>><mailto:us...@open-mpi.org<mailto:us...@open-mpi.org><mailto:us...@open-mpi.org<mailto:us...@open-mpi.org>>>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/07/24833.php

_______________________________________________
users mailing list
us...@open-mpi.org<mailto:us...@open-mpi.org><mailto:us...@open-mpi.org<mailto:us...@open-mpi.org>>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/07/24835.php



--
Syed Ahsan Ali Bokhari
Electronic Engineer (EE)

Research & Development Division
Pakistan Meteorological Department H-8/4, Islamabad.
Phone # off  +92518358714<tel:%2B92518358714>
Cell # +923155145014<tel:%2B923155145014>
_______________________________________________
users mailing list
us...@open-mpi.org<mailto:us...@open-mpi.org><mailto:us...@open-mpi.org<mailto:us...@open-mpi.org>>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/07/24841.php

_______________________________________________
users mailing list
us...@open-mpi.org<mailto:us...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/07/24845.php



--
Syed Ahsan Ali Bokhari
Electronic Engineer (EE)

Research & Development Division
Pakistan Meteorological Department H-8/4, Islamabad.
Phone # off  +92518358714
Cell # +923155145014
_______________________________________________
users mailing list
us...@open-mpi.org<mailto:us...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/07/24854.php

Reply via email to