Re: [OMPI users] How OMPI picks ethernet interfaces

Jeff Squyres (jsquyres) Fri, 14 Nov 2014 09:56:30 -0500 (EST)

I lurked on this thread for a while, but I have some thoughts on the many 
issues that were discussed on this thread (sorry, I'm still pretty under water 
trying to get ready for SC next week...).  These points are in no particular 
order...


0. Two fundamental points have been missed in this thread:

   - A hostname technically has nothing to do with the resolvable name of an IP 
interface.  By convention, many people set the hostname to be the same as some 
"primary" IP interface (for some definition of "primary", e.g., eth0).  But 
they are actually unrelated concepts.

   - Open MPI uses host specifications only to specify a remote server, *NOT* 
an interface.  E.g., when you list names in a hostile or the --host CLI option, 
those only specify the server -- not the interface(s).  This was an intentional 
design choice because there tends to be confusion and different schools of 
thought about the question "What's the [resolvable] name of that remote 
server?"  Hence, OMPI will take any old name you throw at it to identify that 
remote server, but then we have separate controls for specifying which 
interface(s) to use to communicate with that server.

1. Remember that there is at least one, and possibly two, uses of TCP 
communications in Open MPI -- and they are used differently:

   - Command/control (sometimes referred to as "oob"): used for things like 
mpirun control messages, shuttling IO from remote processes back to mpirun, 
etc.  Generally, unless you have a mountain of stdout/stderr from your launched 
processes, this isn't a huge amount of traffic.

   - MPI messages: kernel-based TCP is the fallback if you don't have some kind 
of faster off-server network -- i.e., the TCP BTL.  Like all BTLs, the TCP BTL 
carries all MPI traffic when it is used.  How much traffic is sent/received 
depends on your application.

2. For OOB, I believe that the current ORTE mechanism is that it will try all 
available IP interfaces and use the *first* one that succeeds.  Meaning: after 
some negotiation, only one IP interface will be used to communicate with a 
given peer.

3. The TCP BTL will examine all local IP interfaces and determine all that can 
be used to reach each peer according to the algorithm described here: 
http://www.open-mpi.org/faq/?category=tcp#tcp-routability-1.3.  It will use 
*all* IP interfaces to reach a given peer in order to maximize the available 
bandwidth.

4. The usNIC BTL uses UDP as its wire transport, and therefore has the same 
reachability issues as both the TCP OOB and BTL.  However, we use a different 
mechanism than the algorithm described in the above-cited FAQ item: we simply 
query the Linux routing table.  This can cause ARP requests, but the kernel 
caches them (e.g., for multiple MPI procs on the same server making the 
same/similar requests), and for a properly-segmented L3 network, each MPI 
process will effectively end up querying about its local gateway (vs. the 
actual peer), and therefore the chances of having that ARP already cached are 
quite high.

--> I want to make this clear: there's nothing magic our the 
usNIC/check-the-routing-table approach.  It's actually a very standard 
IP/datacenter method.  With a proper routing table, you can know fairly quickly 
whether local IP interface X can reach remote IP interface Y.

5. The original problem cited in this thread was about the TCP OOB, not the TCP 
BTL.  It's important to keep straight that the OOB, with no guidance from the 
user, was trying to probe the different IP interfaces and find one that would 
reach a peer.  Using the check-the-routing-table approach cited in #4, we might 
be able to make this better (that's what Ralph and I are going to talk about in 
December / post-SC / post-US Thanksgiving holiday).

6. As a sidenote to #5, the TCP OOB and TCP BTL determine reachability in 
different ways.  Remember that the TCP BTL has the benefit of having all the 
ORTE infrastructure up and running.  Meaning: MPI processes can exchange IP 
interface information and then use that information to compute which peer IP 
interfaces can be reached.  The TCP OOB doesn't have this benefit -- it's being 
used to establish initial connectivity.  Hence, it probes each IP interface to 
see if it can reach a given peer.

--> We apparently need to do that probe better (vs. blocking in a serial 
fashion, and eventually timing out on "bad" interfaces and then trying the next 
one). 

Having a bad route or gateway listed in a server's IP setup, however, will make 
the process take an artificially long time.  This is a user error that Open MPI 
cannot compensate for.  If prior versions of OMPI tried interfaces in a 
different order that luckily worked nicely, cool.  But as Gilles mentioned, 
that was luck -- there was still a user config error that was the real 
underlying issue.

7. Someone asked: does it matter in which order you specify interfaces in 
btl_tcp_if_include?  No, it effectively does not.  Open MPI will use both 
interfaces.  If you only send one short MPI message to a peer, then yes, OMPI 
will only use one of those interfaces, but that's not the usual case.  Open MPI 
will effectively round robin multiplex across all the interfaces that you list 
(or all the interfaces that are not excluded).  They're all used equally unless 
you specify a weighting factor (i.e., bandwidth) for each interface.

8. Don't forget that you can use CIDR notation to specify which interfaces to 
use, too.  E.g., "--mca btl_tcp_if_include 10.10.10.0/24".  That way, you don't 
have to know which interface a given network uses (and it might even be 
different on different servers).  Same goes for the oob_tcp_if_*clude MCA 
params, too.

9. If I followed the thread properly (and I might not have?), I think Reuti 
eliminated a bad route/gateway and reduced the dead time during startup to be 
much shorter.  But there still seems to be a 30 second timeout in there when no 
sysadmin-specified oob_tcp_if_include param is provided.  If this is correct, 
Reuti, can you send the full "ifconfig -a" output from two servers in question 
(i.e., 2 servers where you can reproduce the problem), and the full routing 
tables between those two servers?  (make sure to show all routing tables on 
each server - fun fact, did you know that you can have a different routing 
table for each IP interface in Linux?)  Include any relevant network routing 
tables (e.g., from intermediate switches), if they're not just pass thru.




On Nov 13, 2014, at 9:17 PM, Gilles Gouaillardet 
<gilles.gouaillar...@iferc.org> wrote:

> My 0.02 US$
> 
> first, the root cause of the problem was a default gateway was
> configured on the node,
> but this gateway was unreachable.
> imho, this is incorrect system setting that can lead to unpredictable
> results :
> - openmpi 1.8.1 works (you are lucky, good for you)
> - openmpi 1.8.3 fails (no luck this time, too bad)
> so i believe it is incorrect to blame openmpi for this.
> 
> that being said, you raise some good points of how to improve user
> friendliness for end users
> that have limited skills and/or interest in OpenMPI and system
> administration.
> 
> basically, i agree with Gus. HPC is complex, not every clusters are the same
> and imho some minimal config/tuning might not be avoided to get OpenMPI
> working,
> or operating at full speed.
> 
> 
> let me give a few examples :
> 
> you recommend OpenMPI uses only the interfaces that matches the
> hostnames in the machinefile.
> what if you submit from the head node ? should you use the interface
> that matches the hostname ?
> what if this interface is the public interface, there is a firewall
> and/or compute nodes have no default gateway ?
> that will simply not work ...
> so mpirun needs to pass orted all its interfaces.
> which one should be picked by orted ?
> - the first one ? it might be the unreachable public interface ...
> - the one on the same subnet ? what if none is on the same subnet ?
>  on the cluster i am working, eth0 are in different subnets, ib0 is on
> a single subnet
>  and i do *not* want to use ib0. but on some other clusters, the
> ethernet network is so cheap
>  they *want* to use ib0.
> 
> on your cluster, you want to use eth0 for oob and mpi, and eth1 for NFS.
> that is legitimate.
> in my case, i want to use eth0 (gigE) for oob and eth2 (10gigE) for MPI.
> that is legitimate too.
> 
> we both want OpenMPI works *and* with best performance out of the box.
> it is a good thing to have high expectations, but they might not all be met.
> 
> i'd rather implement some pre-defined policies that rules how ethernet
> interfaces should be picked up,
> and add a FAQ that mentions : if it does not work (or does not work as
> fast as expected) out of the box, you should
> at first try an other policy.
> 
> then the next legitimate question will be "what is the default policy" ?
> regardless the answer, it will be good for some and bad for others.
> 
> 
> imho, posting a mail to the OMPI users mailing list was the right thing
> to do :
> - you got help on how to troubleshot and fix the issue
> - we got some valuable feedback on endusers expectations.
> 
> Cheers,
> 
> Gilles
> 
> On 2014/11/14 3:36, Gus Correa wrote:
>> On 11/13/2014 11:14 AM, Ralph Castain wrote:
>>> Hmmm…I’m beginning to grok the issue. It is a tad unusual for people to
>>> assign different hostnames to their interfaces - I’ve seen it in the
>>> Hadoop world, but not in HPC. Still, no law against it.
>> 
>> No, not so unusual.
>> I have clusters from respectable vendors that come with
>> /etc/hosts for name resolution of the various interfaces.
>> If I remember right, Rocks clusters also does that (or actually
>> allow the sys admin to setup additional networks and at that point
>> will append /etc/hosts with the additional names, or perhaps put those
>> names in DHCP).
>> I am not so familiar to xcat, but I think it has similar DHCP
>> functionality, or maybe DNS on the head node.
>> 
>> Having said that, I don't think this is an obstacle to setting up the
>> right "if_include/if_exlculde" choices (along with the btl, oob, etc),
>> for each particular cluster in the mca parameter configuration file.
>> That is what my parallel conversation with Reuti was about.
>> 
>> I believe the current approach w.r.t. interfaces:
>> "use everythint, let the sysadmin/user restrict as
>> (s)he sees fit" is both a wise and flexible way to do it.
>> Guessing the "right interface to use" sounds risky to me (wrong
>> choices may happen), and a bit of a cast.
>> 
>>> 
>>> This will take a little thought to figure out a solution. One problem
>>> that immediately occurs is if someone includes a hostfile that has lines
>>> which refer to the same physical server, but using different interface
>>> names. We’ll think those are completely distinct servers, and so the
>>> process placement will be totally messed up.
>>> 
>> 
>> Sure, and besides this, there will be machines with
>> inconsistent/wrong/conflicting name resolution schemes
>> that the current OMPI approach simply (and wisely) ignores.
>> 
>> 
>>> We’ll also encounter issues with the daemon when it reports back, as the
>>> hostname it gets will almost certainly differ from the hostname we were
>>> expecting. Not as critical, but need to check to see where that will
>>> impact the code base
>>> 
>> 
>> I'm sure that will happen.
>> Torque uses hostname by default for several things, and it can be a
>> configuration nightmare to workaround that when what hostname reports
>> is not what you want.
>> 
>> IMHO, you may face a daunting guesswork task to get this right,
>> to pick the
>> interfaces that are best for a particular computer or cluster.
>> It is so much easier to let the sysadmin/user, who presumably knows
>> his/her machine, to write an MCA parameter config file,
>> as it is now in OMPI.
>> 
>>> We can look at the hostfile changes at that time - no real objection to
>>> them, but would need to figure out how to pass that info to the
>>> appropriate subsystems. I assume you want this to apply to both the oob
>>> and tcp/btl?
>>> 
>>> Obviously, this won’t make it for 1.8 as it is going to be fairly
>>> intrusive, but we can probably do something for 1.9
>>> 
>> 
>> The status quo is good.
>> Long life to the OMPI status quo.
>> (You don't know how reluctant I am to support the status quo, any
>> status quo.  :) )
>> My vote (... well, I don't have voting rights on that, but I'll vote
>> anyway ...) is to keeep the current approach.
>> It is wise and flexible, and easy to adjust and configure to specific
>> machines with their own oddities, via MCA parameters, as I tried to
>> explain in previous postings.
>> 
>> My two cents,
>> Gus Correa
>> 
>>> 
>>>> On Nov 13, 2014, at 4:23 AM, Reuti <re...@staff.uni-marburg.de
>>>> <mailto:re...@staff.uni-marburg.de>> wrote:
>>>> 
>>>> Am 13.11.2014 um 00:34 schrieb Ralph Castain:
>>>> 
>>>>>> On Nov 12, 2014, at 2:45 PM, Reuti <re...@staff.uni-marburg.de
>>>>>> <mailto:re...@staff.uni-marburg.de>> wrote:
>>>>>> 
>>>>>> Am 12.11.2014 um 17:27 schrieb Reuti:
>>>>>> 
>>>>>>> Am 11.11.2014 um 02:25 schrieb Ralph Castain:
>>>>>>> 
>>>>>>>> Another thing you can do is (a) ensure you built with
>>>>>>>> —enable-debug, and then (b) run it with -mca oob_base_verbose 100
>>>>>>>> (without the tcp_if_include option) so we can watch the
>>>>>>>> connection handshake and see what it is doing. The —hetero-nodes
>>>>>>>> will have not affect here and can be ignored.
>>>>>>> 
>>>>>>> Done. It really tries to connect to the outside interface of the
>>>>>>> headnode. But being there a firewall or not: the nodes have no clue
>>>>>>> how to reach 137.248.0.0 - they have no gateway to this network
>>>>>>> at all.
>>>>>> 
>>>>>> I have to revert this. They think that there is a gateway although
>>>>>> it isn't. When I remove the entry by hand for the gateway in the
>>>>>> routing table it starts up instantly too.
>>>>>> 
>>>>>> While I can do this on my own cluster I still have the 30 seconds
>>>>>> delay on a cluster where I'm not root, while this can be because of
>>>>>> the firewall there. The gateway on this cluster is indeed going to
>>>>>> the outside world.
>>>>>> 
>>>>>> Personally I find this behavior a little bit too aggressive to use
>>>>>> all interfaces. If you don't check this carefully beforehand and
>>>>>> start a long running application one might even not notice the delay
>>>>>> during the startup.
>>>>> 
>>>>> Agreed - do you have any suggestions on how we should choose the
>>>>> order in which to try them? I haven’t been able to come up with
>>>>> anything yet. Jeff has some fancy algo in his usnic BTL that we are
>>>>> going to discuss after SC that I’m hoping will help, but I’d be open
>>>>> to doing something better in the interim for 1.8.4
>>>> 
>>>> The plain`mpiexec` should just use the specified interface it finds in
>>>> the hostfile. Being it hand crafted or prepared by any queuing system.
>>>> 
>>>> 
>>>> Option: could a single entry for a machine in the hostfile contain a
>>>> list of interfaces? I mean something like:
>>>> 
>>>> node01,node01-extra-eth1,node01-extra-eth2 slots=4
>>>> 
>>>> or
>>>> 
>>>> node01* slots=4
>>>> 
>>>> Means: use exactly these interfaces or even try to find all available
>>>> interfaces on/between the machines.
>>>> 
>>>> In case all interfaces have the same name, then it's up to the admin
>>>> to correct this.
>>>> 
>>>> -- Reuti
>>>> 
>>>> 
>>>>>> -- Reuti
>>>>>> 
>>>>>> 
>>>>>>> It tries so independent from the internal or external name of the
>>>>>>> headnode given in the machinefile - I hit ^C then. I attached the
>>>>>>> output of Open MPI 1.8.1 for this setup too.
>>>>>>> 
>>>>>>> -- Reuti
>>>>>>> 
>>>>>>> <openmpi1.8.3.txt><openmpi1.8.1.txt>_______________________________________________
>>>>>>> 
>>>>>>> users mailing list
>>>>>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>> Link to this post:
>>>>>>> http://www.open-mpi.org/community/lists/users/2014/11/25777.php
>>>>>> 
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> Link to this post:
>>>>>> http://www.open-mpi.org/community/lists/users/2014/11/25781.php
>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>>>>> Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> Link to this
>>>>> post:http://www.open-mpi.org/community/lists/users/2014/11/25782.php
>>>>> 
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>>>> Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this
>>>> post:http://www.open-mpi.org/community/lists/users/2014/11/25800.php
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2014/11/25801.php
>>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2014/11/25806.php
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/11/25809.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Re: [OMPI users] How OMPI picks ethernet interfaces

Reply via email to