Re: [OMPI users] How OMPI picks ethernet interfaces

Gilles Gouaillardet Mon, 10 Nov 2014 20:12:57 -0500 (EST)

Hi,

IIRC there were some bug fixes between 1.8.1 and 1.8.2 in order to really
use all the published interfaces.


by any change, are you running a firewall on your head node ?
one possible explanation is the compute node tries to access the public
interface of the head node, and packets get dropped by the firewall.

if you are running a firewall, can you make a test without it ?
/* if you do need NAT, then just remove the DROP and REJECT rules "/

an other possible explanation is the compute node is doing (reverse) dns
requests with the public name and/or ip of the head node and that takes
some time to complete (success or failure, this does not really matter here)

/* a simple test is to make sure all the hosts/ip of the head node are in
the /etc/hosts of the compute node */

could you check your network config (firewall and dns) ?

can you reproduce the delay when running mpirun on the head node and with
one mpi task on the compute node ?

if yes, then the hard way to trace the delay issue would be to strace -ttt
both orted and mpi task that are launched on the compute node and see where
the time is lost.
/* at this stage, i would suspect orted ... */

Cheers,

Gilles

On Mon, Nov 10, 2014 at 5:56 PM, Reuti <re...@staff.uni-marburg.de> wrote:

> Hi,
>
> Am 10.11.2014 um 16:39 schrieb Ralph Castain:
>
> > That is indeed bizarre - we haven’t heard of anything similar from other
> users. What is your network configuration? If you use oob_tcp_if_include or
> exclude, can you resolve the problem?
>
> Thx - this option helped to get it working.
>
> These tests were made for sake of simplicity between the headnode of the
> cluster and one (idle) compute node. I tried then between the (identical)
> compute nodes and this worked fine. The headnode of the cluster and the
> compute node are slightly different though (i.e. number of cores), and
> using eth1 resp. eth0 for the internal network of the cluster.
>
> I tried --hetero-nodes with no change.
>
> Then I turned to:
>
> reuti@annemarie:~> date; mpiexec -mca btl self,tcp --mca
> oob_tcp_if_include 192.168.154.0/26 -n 4 --hetero-nodes --hostfile
> machines ./mpihello; date
>
> and the application started instantly. On another cluster, where the
> headnode is identical to the compute nodes but with the same network setup
> as above, I observed a delay of "only" 30 seconds. Nevertheless, also on
> this cluster the working addition was the correct "oob_tcp_if_include" to
> solve the issue.
>
> The questions which remain: a) is this a targeted behavior, b) what
> changed in this scope between 1.8.1 and 1.8.2?
>
> -- Reuti
>
>
> >
> >> On Nov 10, 2014, at 4:50 AM, Reuti <re...@staff.uni-marburg.de> wrote:
> >>
> >> Am 10.11.2014 um 12:50 schrieb Jeff Squyres (jsquyres):
> >>
> >>> Wow, that's pretty terrible!  :(
> >>>
> >>> Is the behavior BTL-specific, perchance?  E.G., if you only use
> certain BTLs, does the delay disappear?
> >>
> >> You mean something like:
> >>
> >> reuti@annemarie:~> date; mpiexec -mca btl self,tcp -n 4 --hostfile
> machines ./mpihello; date
> >> Mon Nov 10 13:44:34 CET 2014
> >> Hello World from Node 1.
> >> Total: 4
> >> Universe: 4
> >> Hello World from Node 0.
> >> Hello World from Node 3.
> >> Hello World from Node 2.
> >> Mon Nov 10 13:46:42 CET 2014
> >>
> >> (the above was even the latest v1.8.3-186-g978f61d)
> >>
> >> Falling back to 1.8.1 gives (as expected):
> >>
> >> reuti@annemarie:~> date; mpiexec -mca btl self,tcp -n 4 --hostfile
> machines ./mpihello; date
> >> Mon Nov 10 13:49:51 CET 2014
> >> Hello World from Node 1.
> >> Total: 4
> >> Universe: 4
> >> Hello World from Node 0.
> >> Hello World from Node 2.
> >> Hello World from Node 3.
> >> Mon Nov 10 13:49:53 CET 2014
> >>
> >>
> >> -- Reuti
> >>
> >>> FWIW: the use-all-IP interfaces approach has been in OMPI forever.
> >>>
> >>> Sent from my phone. No type good.
> >>>
> >>>> On Nov 10, 2014, at 6:42 AM, Reuti <re...@staff.uni-marburg.de>
> wrote:
> >>>>
> >>>>> Am 10.11.2014 um 12:24 schrieb Reuti:
> >>>>>
> >>>>> Hi,
> >>>>>
> >>>>>> Am 09.11.2014 um 05:38 schrieb Ralph Castain:
> >>>>>>
> >>>>>> FWIW: during MPI_Init, each process “publishes” all of its
> interfaces. Each process receives a complete map of that info for every
> process in the job. So when the TCP btl sets itself up, it attempts to
> connect across -all- the interfaces published by the other end.
> >>>>>>
> >>>>>> So it doesn’t matter what hostname is provided by the RM. We
> discover and “share” all of the interface info for every node, and then use
> them for loadbalancing.
> >>>>>
> >>>>> does this lead to any time delay when starting up? I stayed with
> Open MPI 1.6.5 for some time and tried to use Open MPI 1.8.3 now. As there
> is a delay when the applications starts in my first compilation of 1.8.3 I
> disregarded even all my extra options and run it outside of any
> queuingsystem - the delay remains - on two different clusters.
> >>>>
> >>>> I forgot to mention: the delay is more or less exactly 2 minutes from
> the time I issued `mpiexec` until the `mpihello` starts up (there is no
> delay for the initial `ssh` to reach the other node though).
> >>>>
> >>>> -- Reuti
> >>>>
> >>>>
> >>>>> I tracked it down, that up to 1.8.1 it is working fine, but 1.8.2
> already creates this delay when starting up a simple mpihello. I assume it
> may lay in the way how to reach other machines, as with one single machine
> there is no delay. But using one (and only one - no tree spawn involved)
> additional machine already triggers this delay.
> >>>>>
> >>>>> Did anyone else notice it?
> >>>>>
> >>>>> -- Reuti
> >>>>>
> >>>>>
> >>>>>> HTH
> >>>>>> Ralph
> >>>>>>
> >>>>>>
> >>>>>>> On Nov 8, 2014, at 8:13 PM, Brock Palen <bro...@umich.edu> wrote:
> >>>>>>>
> >>>>>>> Ok I figured, i'm going to have to read some more for my own
> curiosity. The reason I mention the Resource Manager we use, and that the
> hostnames given but PBS/Torque match the 1gig-e interfaces, i'm curious
> what path it would take to get to a peer node when the node list given all
> match the 1gig interfaces but yet data is being sent out the 10gig
> eoib0/ib0 interfaces.
> >>>>>>>
> >>>>>>> I'll go do some measurements and see.
> >>>>>>>
> >>>>>>> Brock Palen
> >>>>>>> www.umich.edu/~brockp
> >>>>>>> CAEN Advanced Computing
> >>>>>>> XSEDE Campus Champion
> >>>>>>> bro...@umich.edu
> >>>>>>> (734)936-1985
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>> On Nov 8, 2014, at 8:30 AM, Jeff Squyres (jsquyres) <
> jsquy...@cisco.com> wrote:
> >>>>>>>>
> >>>>>>>> Ralph is right: OMPI aggressively uses all Ethernet interfaces by
> default.
> >>>>>>>>
> >>>>>>>> This short FAQ has links to 2 other FAQs that provide detailed
> information about reachability:
> >>>>>>>>
> >>>>>>>> http://www.open-mpi.org/faq/?category=tcp#tcp-multi-network
> >>>>>>>>
> >>>>>>>> The usNIC BTL uses UDP for its wire transport and actually does a
> much more standards-conformant peer reachability determination (i.e., it
> actually checks routing tables to see if it can reach a given peer which
> has all kinds of caching benefits, kernel controls if you want them,
> etc.).  We haven't back-ported this to the TCP BTL because a) most people
> who use TCP for MPI still use a single L2 address space, and b) no one has
> asked for it.  :-)
> >>>>>>>>
> >>>>>>>> As for the round robin scheduling, there's no indication from the
> Linux TCP stack what the bandwidth is on a given IP interface.  So unless
> you use the btl_tcp_bandwidth_<IP_INTERFACE_NAME> (e.g.,
> btl_tcp_bandwidth_eth0) MCA params, OMPI will round-robin across them
> equally.
> >>>>>>>>
> >>>>>>>> If you have multiple IP interfaces sharing a single physical
> link, there will likely be no benefit from having Open MPI use more than
> one of them.  You should probably use btl_tcp_if_include /
> btl_tcp_if_exclude to select just one.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> On Nov 7, 2014, at 2:53 PM, Brock Palen <bro...@umich.edu>
> wrote:
> >>>>>>>>>
> >>>>>>>>> I was doing a test on our IB based cluster, where I was diabling
> IB
> >>>>>>>>>
> >>>>>>>>> --mca btl ^openib --mca mtl ^mxm
> >>>>>>>>>
> >>>>>>>>> I was sending very large messages >1GB  and I was surppised by
> the speed.
> >>>>>>>>>
> >>>>>>>>> I noticed then that of all our ethernet interfaces
> >>>>>>>>>
> >>>>>>>>> eth0  (1gig-e)
> >>>>>>>>> ib0  (ip over ib, for lustre configuration at vendor request)
> >>>>>>>>> eoib0  (ethernet over IB interface for IB -> Ethernet gateway
> for some extrnal storage support at >1Gig speed
> >>>>>>>>>
> >>>>>>>>> I saw all three were getting traffic.
> >>>>>>>>>
> >>>>>>>>> We use torque for our Resource Manager and use TM support, the
> hostnames given by torque match the eth0 interfaces.
> >>>>>>>>>
> >>>>>>>>> How does OMPI figure out that it can also talk over the others?
> How does it chose to load balance?
> >>>>>>>>>
> >>>>>>>>> BTW that is fine, but we will use if_exclude on one of the IB
> ones as ib0 and eoib0  are the same physical device and may screw with load
> balancing if anyone ver falls back to TCP.
> >>>>>>>>>
> >>>>>>>>> Brock Palen
> >>>>>>>>> www.umich.edu/~brockp
> >>>>>>>>> CAEN Advanced Computing
> >>>>>>>>> XSEDE Campus Champion
> >>>>>>>>> bro...@umich.edu
> >>>>>>>>> (734)936-1985
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> _______________________________________________
> >>>>>>>>> users mailing list
> >>>>>>>>> us...@open-mpi.org
> >>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>>> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/11/25709.php
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> Jeff Squyres
> >>>>>>>> jsquy...@cisco.com
> >>>>>>>> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> >>>>>>>>
> >>>>>>>> _______________________________________________
> >>>>>>>> users mailing list
> >>>>>>>> us...@open-mpi.org
> >>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/11/25713.php
> >>>>>>>
> >>>>>>> _______________________________________________
> >>>>>>> users mailing list
> >>>>>>> us...@open-mpi.org
> >>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/11/25715.php
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> users mailing list
> >>>>>> us...@open-mpi.org
> >>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/11/25716.php
> >>>>>
> >>>>> _______________________________________________
> >>>>> users mailing list
> >>>>> us...@open-mpi.org
> >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/11/25721.php
> >>>>
> >>>> _______________________________________________
> >>>> users mailing list
> >>>> us...@open-mpi.org
> >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/11/25722.php
> >>> _______________________________________________
> >>> users mailing list
> >>> us...@open-mpi.org
> >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/11/25724.php
> >>>
> >>
> >> _______________________________________________
> >> users mailing list
> >> us...@open-mpi.org
> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> >> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/11/25725.php
> >
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/11/25733.php
> >
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/11/25736.php
>

Re: [OMPI users] How OMPI picks ethernet interfaces

Reply via email to