Ok ok I can disable that as well.
Thank you guys. :)

On Thu, Nov 13, 2014 at 12:50 PM, Syed Ahsan Ali <ahsansha...@gmail.com> wrote:
> Now it looks through the loopback address
>
> [pmdtest@pmd ~]$ mpirun --host compute-01-01,compute-01-06 --mca
> btl_tcp_if_exclude ib0 ring_c
> Process 0 sending 10 to 1, tag 201 (2 processes in ring)
> [compute-01-01.private.dns.zone][[37713,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
> connect() to 127.0.0.1 failed: Connection refused (111)
> Process 0 sending 10 to 1, tag 201 (2 processes in ring)
> [pmd.pakmet.com:30867] 1 more process has sent help message
> help-mpi-btl-openib.txt / no active ports found
> [pmd.pakmet.com:30867] Set MCA parameter "orte_base_help_aggregate" to
> 0 to see all help / error messages
>
>
>
> On Thu, Nov 13, 2014 at 12:46 PM, Gilles Gouaillardet
> <gilles.gouaillar...@iferc.org> wrote:
>> --mca btl ^openib
>> disables the openib btl, which is native infiniband only.
>>
>> ib0 is treated as any TCP interface and then handled by the tcp btl
>>
>> an other option is you to use
>> --mca btl_tcp_if_exclude ib0
>>
>> On 2014/11/13 16:43, Syed Ahsan Ali wrote:
>>> You are right it is running on 10.0.0.0 interface [pmdtest@pmd ~]$
>>> mpirun --mca btl ^openib --host compute-01-01,compute-01-06 --mca
>>> btl_tcp_if_include 10.0.0.0/8 ring_c
>>> Process 0 sending 10 to 1, tag 201 (2 processes in ring)
>>> Process 0 sent to 1
>>> Process 0 decremented value: 9
>>> Process 0 decremented value: 8
>>> Process 0 decremented value: 7
>>> Process 0 decremented value: 6
>>> Process 1 exiting
>>> Process 0 decremented value: 5
>>> Process 0 decremented value: 4
>>> Process 0 decremented value: 3
>>> Process 0 decremented value: 2
>>> Process 0 decremented value: 1
>>> Process 0 decremented value: 0
>>> Process 0 exiting
>>> [pmdtest@pmd ~]$
>>>
>>> While the ip addresses 192.168.108* are for ib interface.
>>>
>>>  [root@compute-01-01 ~]# ifconfig
>>> eth0      Link encap:Ethernet  HWaddr 00:24:E8:59:4C:2A
>>>           inet addr:10.0.0.3  Bcast:10.255.255.255  Mask:255.0.0.0
>>>           inet6 addr: fe80::224:e8ff:fe59:4c2a/64 Scope:Link
>>>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>>>           RX packets:65588 errors:0 dropped:0 overruns:0 frame:0
>>>           TX packets:14184 errors:0 dropped:0 overruns:0 carrier:0
>>>           collisions:0 txqueuelen:1000
>>>           RX bytes:18692977 (17.8 MiB)  TX bytes:1834122 (1.7 MiB)
>>>           Interrupt:169 Memory:dc000000-dc012100
>>> ib0       Link encap:InfiniBand  HWaddr
>>> 80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
>>>           inet addr:192.168.108.14  Bcast:192.168.108.255  
>>> Mask:255.255.255.0
>>>           UP BROADCAST MULTICAST  MTU:65520  Metric:1
>>>           RX packets:0 errors:0 dropped:0 overruns:0 frame:0
>>>           TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
>>>           collisions:0 txqueuelen:256
>>>           RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)
>>>
>>>
>>>
>>> So the point is why mpirun is following the ib  path while I it has
>>> been disabled. Possible solutions?
>>>
>>> On Thu, Nov 13, 2014 at 12:32 PM, Gilles Gouaillardet
>>> <gilles.gouaillar...@iferc.org> wrote:
>>>> mpirun complains about the 192.168.108.10 ip address, but ping reports a
>>>> 10.0.0.8 address
>>>>
>>>> is the 192.168.* network a point to point network (for example between a
>>>> host and a mic) so two nodes
>>>> cannot ping each other via this address ?
>>>> /* e.g. from compute-01-01 can you ping the 192.168.108.* ip address of
>>>> compute-01-06 ? */
>>>>
>>>> could you also run
>>>>
>>>> mpirun --mca btl ^openib --host compute-01-01,compute-01-06 --mca
>>>> btl_tcp_if_include 10.0.0.0/8 ring_c
>>>>
>>>> and see whether it helps ?
>>>>
>>>>
>>>> On 2014/11/13 16:24, Syed Ahsan Ali wrote:
>>>>> Same result in both cases
>>>>>
>>>>> [pmdtest@pmd ~]$ mpirun --mca btl ^openib --host
>>>>> compute-01-01,compute-01-06 ring_c
>>>>> Process 0 sending 10 to 1, tag 201 (2 processes in ring)
>>>>> Process 0 sent to 1
>>>>> Process 0 decremented value: 9
>>>>> [compute-01-01.private.dns.zone][[47139,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>>>>> connect() to 192.168.108.10 failed: No route to host (113)
>>>>>
>>>>>
>>>>> [pmdtest@compute-01-01 ~]$ mpirun --mca btl ^openib --host
>>>>> compute-01-01,compute-01-06 ring_c
>>>>> Process 0 sending 10 to 1, tag 201 (2 processes in ring)
>>>>> Process 0 sent to 1
>>>>> Process 0 decremented value: 9
>>>>> [compute-01-01.private.dns.zone][[11064,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>>>>> connect() to 192.168.108.10 failed: No route to host (113)
>>>>>
>>>>>
>>>>> On Thu, Nov 13, 2014 at 12:11 PM, Gilles Gouaillardet
>>>>> <gilles.gouaillar...@iferc.org> wrote:
>>>>>> Hi,
>>>>>>
>>>>>> it seems you messed up the command line
>>>>>>
>>>>>> could you try
>>>>>>
>>>>>> $ mpirun --mca btl ^openib --host compute-01-01,compute-01-06 ring_c
>>>>>>
>>>>>>
>>>>>> can you also try to run mpirun from a compute node instead of the head
>>>>>> node ?
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Gilles
>>>>>>
>>>>>> On 2014/11/13 16:07, Syed Ahsan Ali wrote:
>>>>>>> Here is what I see when disabling openib support.\
>>>>>>>
>>>>>>>
>>>>>>> [pmdtest@pmd ~]$ mpirun --host --mca btl ^openib
>>>>>>> compute-01-01,compute-01-06 ring_c
>>>>>>> ssh:  orted: Temporary failure in name resolution
>>>>>>> ssh:  orted: Temporary failure in name resolution
>>>>>>> --------------------------------------------------------------------------
>>>>>>> A daemon (pid 7608) died unexpectedly with status 255 while attempting
>>>>>>> to launch so we are aborting.
>>>>>>>
>>>>>>> While nodes can still ssh each other
>>>>>>>
>>>>>>> [pmdtest@compute-01-01 ~]$ ssh compute-01-06
>>>>>>> Last login: Thu Nov 13 12:05:58 2014 from compute-01-01.private.dns.zone
>>>>>>> [pmdtest@compute-01-06 ~]$
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Nov 13, 2014 at 12:03 PM, Syed Ahsan Ali 
>>>>>>> <ahsansha...@gmail.com> wrote:
>>>>>>>>  Hi Jefff
>>>>>>>>
>>>>>>>> No firewall is enabled. Running the diagnostics I found that non
>>>>>>>> communication mpi job is running . While ring_c remains stuck. There
>>>>>>>> are of course warnings for open fabrics but in my case I an running
>>>>>>>> application by disabling openib., Please see below
>>>>>>>>
>>>>>>>>  [pmdtest@pmd ~]$ mpirun --host compute-01-01,compute-01-06 hello_c.out
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> WARNING: There is at least one OpenFabrics device found but there are
>>>>>>>> no active ports detected (or Open MPI was unable to use them).  This
>>>>>>>> is most certainly not what you wanted.  Check your cables, subnet
>>>>>>>> manager configuration, etc.  The openib BTL will be ignored for this
>>>>>>>> job.
>>>>>>>>   Local host: compute-01-01.private.dns.zone
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> Hello, world, I am 0 of 2
>>>>>>>> Hello, world, I am 1 of 2
>>>>>>>> [pmd.pakmet.com:06386] 1 more process has sent help message
>>>>>>>> help-mpi-btl-openib.txt / no active ports found
>>>>>>>> [pmd.pakmet.com:06386] Set MCA parameter "orte_base_help_aggregate" to
>>>>>>>> 0 to see all help / error messages
>>>>>>>>
>>>>>>>> [pmdtest@pmd ~]$ mpirun --host compute-01-01,compute-01-06 ring_c
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> WARNING: There is at least one OpenFabrics device found but there are
>>>>>>>> no active ports detected (or Open MPI was unable to use them).  This
>>>>>>>> is most certainly not what you wanted.  Check your cables, subnet
>>>>>>>> manager configuration, etc.  The openib BTL will be ignored for this
>>>>>>>> job.
>>>>>>>>   Local host: compute-01-01.private.dns.zone
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> Process 0 sending 10 to 1, tag 201 (2 processes in ring)
>>>>>>>> Process 0 sent to 1
>>>>>>>> Process 0 decremented value: 9
>>>>>>>> [compute-01-01.private.dns.zone][[54687,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>>>>>>>> connect() to 192.168.108.10 failed: No route to host (113)
>>>>>>>> [pmd.pakmet.com:15965] 1 more process has sent help message
>>>>>>>> help-mpi-btl-openib.txt / no active ports found
>>>>>>>> [pmd.pakmet.com:15965] Set MCA parameter "orte_base_help_aggregate" to
>>>>>>>> 0 to see all help / error messages
>>>>>>>> <span class="sewh9wyhn1gq30p"><br></span>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Nov 12, 2014 at 7:32 PM, Jeff Squyres (jsquyres)
>>>>>>>> <jsquy...@cisco.com> wrote:
>>>>>>>>> Do you have firewalling enabled on either server?
>>>>>>>>>
>>>>>>>>> See this FAQ item:
>>>>>>>>>
>>>>>>>>>     
>>>>>>>>> http://www.open-mpi.org/faq/?category=running#diagnose-multi-host-problems
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Nov 12, 2014, at 4:57 AM, Syed Ahsan Ali <ahsansha...@gmail.com> 
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Dear All
>>>>>>>>>>
>>>>>>>>>> I need your advice. While trying to run mpirun job across nodes I get
>>>>>>>>>> following error. It seems that the two nodes i.e, compute-01-01 and
>>>>>>>>>> compute-01-06 are not able to communicate with each other. While 
>>>>>>>>>> nodes
>>>>>>>>>> see each other on ping.
>>>>>>>>>>
>>>>>>>>>> [pmdtest@pmd ERA_CLM45]$ mpirun -np 16 -hostfile hostlist --mca btl
>>>>>>>>>> ^openib ../bin/regcmMPICLM45 regcm.in
>>>>>>>>>>
>>>>>>>>>> [compute-01-06.private.dns.zone][[48897,1],7][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>>>>>>>>>> connect() to 192.168.108.14 failed: No route to host (113)
>>>>>>>>>> [compute-01-06.private.dns.zone][[48897,1],4][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>>>>>>>>>> connect() to 192.168.108.14 failed: No route to host (113)
>>>>>>>>>> [compute-01-06.private.dns.zone][[48897,1],5][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>>>>>>>>>> connect() to 192.168.108.14 failed: No route to host (113)
>>>>>>>>>> [compute-01-01.private.dns.zone][[48897,1],10][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>>>>>>>>>> [compute-01-01.private.dns.zone][[48897,1],12][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>>>>>>>>>> connect() to 192.168.108.10 failed: No route to host (113)
>>>>>>>>>> [compute-01-01.private.dns.zone][[48897,1],14][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>>>>>>>>>> connect() to 192.168.108.10 failed: No route to host (113)
>>>>>>>>>> connect() to 192.168.108.10 failed: No route to host (113)
>>>>>>>>>>
>>>>>>>>>> mpirun: killing job...
>>>>>>>>>>
>>>>>>>>>> [pmdtest@pmd ERA_CLM45]$ ssh compute-01-01
>>>>>>>>>> Last login: Wed Nov 12 09:48:53 2014 from pmd-eth0.private.dns.zone
>>>>>>>>>> [pmdtest@compute-01-01 ~]$ ping compute-01-06
>>>>>>>>>> PING compute-01-06.private.dns.zone (10.0.0.8) 56(84) bytes of data.
>>>>>>>>>> 64 bytes from compute-01-06.private.dns.zone (10.0.0.8): icmp_seq=1
>>>>>>>>>> ttl=64 time=0.108 ms
>>>>>>>>>> 64 bytes from compute-01-06.private.dns.zone (10.0.0.8): icmp_seq=2
>>>>>>>>>> ttl=64 time=0.088 ms
>>>>>>>>>>
>>>>>>>>>> --- compute-01-06.private.dns.zone ping statistics ---
>>>>>>>>>> 2 packets transmitted, 2 received, 0% packet loss, time 999ms
>>>>>>>>>> rtt min/avg/max/mdev = 0.088/0.098/0.108/0.010 ms
>>>>>>>>>> [pmdtest@compute-01-01 ~]$
>>>>>>>>>>
>>>>>>>>>> Thanks in advance.
>>>>>>>>>>
>>>>>>>>>> Ahsan
>>>>>>>>>> _______________________________________________
>>>>> _______________________________________________



-- 
Syed Ahsan Ali Bokhari
Electronic Engineer (EE)

Research & Development Division
Pakistan Meteorological Department H-8/4, Islamabad.
Phone # off  +92518358714
Cell # +923155145014

Reply via email to