Re: [OMPI users] How OMPI picks ethernet interfaces

Reuti Wed, 12 Nov 2014 09:27:46 -0500 (EST)

Am 11.11.2014 um 02:12 schrieb Gilles Gouaillardet:

> Hi,
> 
> IIRC there were some bug fixes between 1.8.1 and 1.8.2 in order to really use 
> all the published interfaces.
> 
> by any change, are you running a firewall on your head node ?


Yes, but only for the interface to the outside world. Nevertheless I switched 
it off and the result was the same 2 minutes delay during startup.


> one possible explanation is the compute node tries to access the public 
> interface of the head node, and packets get dropped by the firewall.
> 
> if you are running a firewall, can you make a test without it ?
> /* if you do need NAT, then just remove the DROP and REJECT rules "/
> 
> an other possible explanation is the compute node is doing (reverse) dns 
> requests with the public name and/or ip of the head node and that takes some 
> time to complete (success or failure, this does not really matter here)

I tried in the machinefile the internal and the external name of the headnode, 
i.e. different names for different interfaces. The result is the same.


> /* a simple test is to make sure all the hosts/ip of the head node are in the 
> /etc/hosts of the compute node */
> 
> could you check your network config (firewall and dns) ?
> 
> can you reproduce the delay when running mpirun on the head node and with one 
> mpi task on the compute node ?

You mean one on the head node and one on the compute node, opposed to two + two 
in my initial test?

Sure, but with 1+1 I get the same result.


> if yes, then the hard way to trace the delay issue would be to strace -ttt 
> both orted and mpi task that are launched on the compute node and see where 
> the time is lost.
> /* at this stage, i would suspect orted ... */

As the `ssh` on the headnode hangs for a while, I suspect it's something on the 
compute node. I see there during the startup:

orted -mca ess env -mca orte_ess_jobid 2412773376 -mca orte_ess_vpid 1 -mca 
orte_ess_num_procs 2 -mca orte_hnp_uri 
2412773376.0;tcp://137.248.x.y,192.168.154.30,192.168.154.187:58782 
--tree-spawn -mca plm rsh

===
Only the subnet 192.168.154.0/26 (yes, 26) is used to access the nodes from the 
master i.e. login machine. As an additional information: the nodes have two 
network interfaces: one in 192.168.154.0/26 and one in 192.168.154.64/26 to 
reach a file server.
==


Falling back to 1.8.1 I see:

bash -c  orted -mca ess env -mca orte_ess_jobid 3182034944 -mca orte_ess_vpid 1 
-mca orte_ess_num_procs 2 -mca orte_hnp_uri 
"3182034944.0;tcp://137.248.x.y,192.168.154.30,192.168.154.187:54436" 
--tree-spawn -mca plm rsh -mca hwloc_base_binding_policy none

So, the bash was removed. But I don't think that this causes anything.

-- Reuti


> Cheers,
> 
> Gilles
> 
> On Mon, Nov 10, 2014 at 5:56 PM, Reuti <re...@staff.uni-marburg.de> wrote:
> Hi,
> 
> Am 10.11.2014 um 16:39 schrieb Ralph Castain:
> 
> > That is indeed bizarre - we haven’t heard of anything similar from other 
> > users. What is your network configuration? If you use oob_tcp_if_include or 
> > exclude, can you resolve the problem?
> 
> Thx - this option helped to get it working.
> 
> These tests were made for sake of simplicity between the headnode of the 
> cluster and one (idle) compute node. I tried then between the (identical) 
> compute nodes and this worked fine. The headnode of the cluster and the 
> compute node are slightly different though (i.e. number of cores), and using 
> eth1 resp. eth0 for the internal network of the cluster.
> 
> I tried --hetero-nodes with no change.
> 
> Then I turned to:
> 
> reuti@annemarie:~> date; mpiexec -mca btl self,tcp --mca oob_tcp_if_include 
> 192.168.154.0/26 -n 4 --hetero-nodes --hostfile machines ./mpihello; date
> 
> and the application started instantly. On another cluster, where the headnode 
> is identical to the compute nodes but with the same network setup as above, I 
> observed a delay of "only" 30 seconds. Nevertheless, also on this cluster the 
> working addition was the correct "oob_tcp_if_include" to solve the issue.
> 
> The questions which remain: a) is this a targeted behavior, b) what changed 
> in this scope between 1.8.1 and 1.8.2?
> 
> -- Reuti
> 
> 
> >
> >> On Nov 10, 2014, at 4:50 AM, Reuti <re...@staff.uni-marburg.de> wrote:
> >>
> >> Am 10.11.2014 um 12:50 schrieb Jeff Squyres (jsquyres):
> >>
> >>> Wow, that's pretty terrible!  :(
> >>>
> >>> Is the behavior BTL-specific, perchance?  E.G., if you only use certain 
> >>> BTLs, does the delay disappear?
> >>
> >> You mean something like:
> >>
> >> reuti@annemarie:~> date; mpiexec -mca btl self,tcp -n 4 --hostfile 
> >> machines ./mpihello; date
> >> Mon Nov 10 13:44:34 CET 2014
> >> Hello World from Node 1.
> >> Total: 4
> >> Universe: 4
> >> Hello World from Node 0.
> >> Hello World from Node 3.
> >> Hello World from Node 2.
> >> Mon Nov 10 13:46:42 CET 2014
> >>
> >> (the above was even the latest v1.8.3-186-g978f61d)
> >>
> >> Falling back to 1.8.1 gives (as expected):
> >>
> >> reuti@annemarie:~> date; mpiexec -mca btl self,tcp -n 4 --hostfile 
> >> machines ./mpihello; date
> >> Mon Nov 10 13:49:51 CET 2014
> >> Hello World from Node 1.
> >> Total: 4
> >> Universe: 4
> >> Hello World from Node 0.
> >> Hello World from Node 2.
> >> Hello World from Node 3.
> >> Mon Nov 10 13:49:53 CET 2014
> >>
> >>
> >> -- Reuti
> >>
> >>> FWIW: the use-all-IP interfaces approach has been in OMPI forever.
> >>>
> >>> Sent from my phone. No type good.
> >>>
> >>>> On Nov 10, 2014, at 6:42 AM, Reuti <re...@staff.uni-marburg.de> wrote:
> >>>>
> >>>>> Am 10.11.2014 um 12:24 schrieb Reuti:
> >>>>>
> >>>>> Hi,
> >>>>>
> >>>>>> Am 09.11.2014 um 05:38 schrieb Ralph Castain:
> >>>>>>
> >>>>>> FWIW: during MPI_Init, each process “publishes” all of its interfaces. 
> >>>>>> Each process receives a complete map of that info for every process in 
> >>>>>> the job. So when the TCP btl sets itself up, it attempts to connect 
> >>>>>> across -all- the interfaces published by the other end.
> >>>>>>
> >>>>>> So it doesn’t matter what hostname is provided by the RM. We discover 
> >>>>>> and “share” all of the interface info for every node, and then use 
> >>>>>> them for loadbalancing.
> >>>>>
> >>>>> does this lead to any time delay when starting up? I stayed with Open 
> >>>>> MPI 1.6.5 for some time and tried to use Open MPI 1.8.3 now. As there 
> >>>>> is a delay when the applications starts in my first compilation of 
> >>>>> 1.8.3 I disregarded even all my extra options and run it outside of any 
> >>>>> queuingsystem - the delay remains - on two different clusters.
> >>>>
> >>>> I forgot to mention: the delay is more or less exactly 2 minutes from 
> >>>> the time I issued `mpiexec` until the `mpihello` starts up (there is no 
> >>>> delay for the initial `ssh` to reach the other node though).
> >>>>
> >>>> -- Reuti
> >>>>
> >>>>
> >>>>> I tracked it down, that up to 1.8.1 it is working fine, but 1.8.2 
> >>>>> already creates this delay when starting up a simple mpihello. I assume 
> >>>>> it may lay in the way how to reach other machines, as with one single 
> >>>>> machine there is no delay. But using one (and only one - no tree spawn 
> >>>>> involved) additional machine already triggers this delay.
> >>>>>
> >>>>> Did anyone else notice it?
> >>>>>
> >>>>> -- Reuti
> >>>>>
> >>>>>
> >>>>>> HTH
> >>>>>> Ralph
> >>>>>>
> >>>>>>
> >>>>>>> On Nov 8, 2014, at 8:13 PM, Brock Palen <bro...@umich.edu> wrote:
> >>>>>>>
> >>>>>>> Ok I figured, i'm going to have to read some more for my own 
> >>>>>>> curiosity. The reason I mention the Resource Manager we use, and that 
> >>>>>>> the hostnames given but PBS/Torque match the 1gig-e interfaces, i'm 
> >>>>>>> curious what path it would take to get to a peer node when the node 
> >>>>>>> list given all match the 1gig interfaces but yet data is being sent 
> >>>>>>> out the 10gig eoib0/ib0 interfaces.
> >>>>>>>
> >>>>>>> I'll go do some measurements and see.
> >>>>>>>
> >>>>>>> Brock Palen
> >>>>>>> www.umich.edu/~brockp
> >>>>>>> CAEN Advanced Computing
> >>>>>>> XSEDE Campus Champion
> >>>>>>> bro...@umich.edu
> >>>>>>> (734)936-1985
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>> On Nov 8, 2014, at 8:30 AM, Jeff Squyres (jsquyres) 
> >>>>>>>> <jsquy...@cisco.com> wrote:
> >>>>>>>>
> >>>>>>>> Ralph is right: OMPI aggressively uses all Ethernet interfaces by 
> >>>>>>>> default.
> >>>>>>>>
> >>>>>>>> This short FAQ has links to 2 other FAQs that provide detailed 
> >>>>>>>> information about reachability:
> >>>>>>>>
> >>>>>>>> http://www.open-mpi.org/faq/?category=tcp#tcp-multi-network
> >>>>>>>>
> >>>>>>>> The usNIC BTL uses UDP for its wire transport and actually does a 
> >>>>>>>> much more standards-conformant peer reachability determination 
> >>>>>>>> (i.e., it actually checks routing tables to see if it can reach a 
> >>>>>>>> given peer which has all kinds of caching benefits, kernel controls 
> >>>>>>>> if you want them, etc.).  We haven't back-ported this to the TCP BTL 
> >>>>>>>> because a) most people who use TCP for MPI still use a single L2 
> >>>>>>>> address space, and b) no one has asked for it.  :-)
> >>>>>>>>
> >>>>>>>> As for the round robin scheduling, there's no indication from the 
> >>>>>>>> Linux TCP stack what the bandwidth is on a given IP interface.  So 
> >>>>>>>> unless you use the btl_tcp_bandwidth_<IP_INTERFACE_NAME> (e.g., 
> >>>>>>>> btl_tcp_bandwidth_eth0) MCA params, OMPI will round-robin across 
> >>>>>>>> them equally.
> >>>>>>>>
> >>>>>>>> If you have multiple IP interfaces sharing a single physical link, 
> >>>>>>>> there will likely be no benefit from having Open MPI use more than 
> >>>>>>>> one of them.  You should probably use btl_tcp_if_include / 
> >>>>>>>> btl_tcp_if_exclude to select just one.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> On Nov 7, 2014, at 2:53 PM, Brock Palen <bro...@umich.edu> wrote:
> >>>>>>>>>
> >>>>>>>>> I was doing a test on our IB based cluster, where I was diabling IB
> >>>>>>>>>
> >>>>>>>>> --mca btl ^openib --mca mtl ^mxm
> >>>>>>>>>
> >>>>>>>>> I was sending very large messages >1GB  and I was surppised by the 
> >>>>>>>>> speed.
> >>>>>>>>>
> >>>>>>>>> I noticed then that of all our ethernet interfaces
> >>>>>>>>>
> >>>>>>>>> eth0  (1gig-e)
> >>>>>>>>> ib0  (ip over ib, for lustre configuration at vendor request)
> >>>>>>>>> eoib0  (ethernet over IB interface for IB -> Ethernet gateway for 
> >>>>>>>>> some extrnal storage support at >1Gig speed
> >>>>>>>>>
> >>>>>>>>> I saw all three were getting traffic.
> >>>>>>>>>
> >>>>>>>>> We use torque for our Resource Manager and use TM support, the 
> >>>>>>>>> hostnames given by torque match the eth0 interfaces.
> >>>>>>>>>
> >>>>>>>>> How does OMPI figure out that it can also talk over the others?  
> >>>>>>>>> How does it chose to load balance?
> >>>>>>>>>
> >>>>>>>>> BTW that is fine, but we will use if_exclude on one of the IB ones 
> >>>>>>>>> as ib0 and eoib0  are the same physical device and may screw with 
> >>>>>>>>> load balancing if anyone ver falls back to TCP.
> >>>>>>>>>
> >>>>>>>>> Brock Palen
> >>>>>>>>> www.umich.edu/~brockp
> >>>>>>>>> CAEN Advanced Computing
> >>>>>>>>> XSEDE Campus Champion
> >>>>>>>>> bro...@umich.edu
> >>>>>>>>> (734)936-1985
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> _______________________________________________
> >>>>>>>>> users mailing list
> >>>>>>>>> us...@open-mpi.org
> >>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>>> Link to this post: 
> >>>>>>>>> http://www.open-mpi.org/community/lists/users/2014/11/25709.php
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> Jeff Squyres
> >>>>>>>> jsquy...@cisco.com
> >>>>>>>> For corporate legal information go to: 
> >>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
> >>>>>>>>
> >>>>>>>> _______________________________________________
> >>>>>>>> users mailing list
> >>>>>>>> us...@open-mpi.org
> >>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>> Link to this post: 
> >>>>>>>> http://www.open-mpi.org/community/lists/users/2014/11/25713.php
> >>>>>>>
> >>>>>>> _______________________________________________
> >>>>>>> users mailing list
> >>>>>>> us...@open-mpi.org
> >>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>> Link to this post: 
> >>>>>>> http://www.open-mpi.org/community/lists/users/2014/11/25715.php
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> users mailing list
> >>>>>> us...@open-mpi.org
> >>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>> Link to this post: 
> >>>>>> http://www.open-mpi.org/community/lists/users/2014/11/25716.php
> >>>>>
> >>>>> _______________________________________________
> >>>>> users mailing list
> >>>>> us...@open-mpi.org
> >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>> Link to this post: 
> >>>>> http://www.open-mpi.org/community/lists/users/2014/11/25721.php
> >>>>
> >>>> _______________________________________________
> >>>> users mailing list
> >>>> us...@open-mpi.org
> >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>> Link to this post: 
> >>>> http://www.open-mpi.org/community/lists/users/2014/11/25722.php
> >>> _______________________________________________
> >>> users mailing list
> >>> us...@open-mpi.org
> >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>> Link to this post: 
> >>> http://www.open-mpi.org/community/lists/users/2014/11/25724.php
> >>>
> >>
> >> _______________________________________________
> >> users mailing list
> >> us...@open-mpi.org
> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> >> Link to this post: 
> >> http://www.open-mpi.org/community/lists/users/2014/11/25725.php
> >
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > Link to this post: 
> > http://www.open-mpi.org/community/lists/users/2014/11/25733.php
> >
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/11/25736.php
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/11/25737.php

Re: [OMPI users] How OMPI picks ethernet interfaces

Reply via email to