Jeff, Gus, Gilles, Am 14.11.2014 um 15:56 schrieb Jeff Squyres (jsquyres):
> I lurked on this thread for a while, but I have some thoughts on the many > issues that were discussed on this thread (sorry, I'm still pretty under > water trying to get ready for SC next week...). I appreciate your replies and will read them thoroughly. I think it's best to continue with the discussion after SC14. I don't want to put any burden on anyone when time is tight. -- Reuti > These points are in no particular order... > > 0. Two fundamental points have been missed in this thread: > > - A hostname technically has nothing to do with the resolvable name of an > IP interface. By convention, many people set the hostname to be the same as > some "primary" IP interface (for some definition of "primary", e.g., eth0). > But they are actually unrelated concepts. > > - Open MPI uses host specifications only to specify a remote server, *NOT* > an interface. E.g., when you list names in a hostile or the --host CLI > option, those only specify the server -- not the interface(s). This was an > intentional design choice because there tends to be confusion and different > schools of thought about the question "What's the [resolvable] name of that > remote server?" Hence, OMPI will take any old name you throw at it to > identify that remote server, but then we have separate controls for > specifying which interface(s) to use to communicate with that server. > > 1. Remember that there is at least one, and possibly two, uses of TCP > communications in Open MPI -- and they are used differently: > > - Command/control (sometimes referred to as "oob"): used for things like > mpirun control messages, shuttling IO from remote processes back to mpirun, > etc. Generally, unless you have a mountain of stdout/stderr from your > launched processes, this isn't a huge amount of traffic. > > - MPI messages: kernel-based TCP is the fallback if you don't have some > kind of faster off-server network -- i.e., the TCP BTL. Like all BTLs, the > TCP BTL carries all MPI traffic when it is used. How much traffic is > sent/received depends on your application. > > 2. For OOB, I believe that the current ORTE mechanism is that it will try all > available IP interfaces and use the *first* one that succeeds. Meaning: > after some negotiation, only one IP interface will be used to communicate > with a given peer. > > 3. The TCP BTL will examine all local IP interfaces and determine all that > can be used to reach each peer according to the algorithm described here: > http://www.open-mpi.org/faq/?category=tcp#tcp-routability-1.3. It will use > *all* IP interfaces to reach a given peer in order to maximize the available > bandwidth. > > 4. The usNIC BTL uses UDP as its wire transport, and therefore has the same > reachability issues as both the TCP OOB and BTL. However, we use a different > mechanism than the algorithm described in the above-cited FAQ item: we simply > query the Linux routing table. This can cause ARP requests, but the kernel > caches them (e.g., for multiple MPI procs on the same server making the > same/similar requests), and for a properly-segmented L3 network, each MPI > process will effectively end up querying about its local gateway (vs. the > actual peer), and therefore the chances of having that ARP already cached are > quite high. > > --> I want to make this clear: there's nothing magic our the > usNIC/check-the-routing-table approach. It's actually a very standard > IP/datacenter method. With a proper routing table, you can know fairly > quickly whether local IP interface X can reach remote IP interface Y. > > 5. The original problem cited in this thread was about the TCP OOB, not the > TCP BTL. It's important to keep straight that the OOB, with no guidance from > the user, was trying to probe the different IP interfaces and find one that > would reach a peer. Using the check-the-routing-table approach cited in #4, > we might be able to make this better (that's what Ralph and I are going to > talk about in December / post-SC / post-US Thanksgiving holiday). > > 6. As a sidenote to #5, the TCP OOB and TCP BTL determine reachability in > different ways. Remember that the TCP BTL has the benefit of having all the > ORTE infrastructure up and running. Meaning: MPI processes can exchange IP > interface information and then use that information to compute which peer IP > interfaces can be reached. The TCP OOB doesn't have this benefit -- it's > being used to establish initial connectivity. Hence, it probes each IP > interface to see if it can reach a given peer. > > --> We apparently need to do that probe better (vs. blocking in a serial > fashion, and eventually timing out on "bad" interfaces and then trying the > next one). > > Having a bad route or gateway listed in a server's IP setup, however, will > make the process take an artificially long time. This is a user error that > Open MPI cannot compensate for. If prior versions of OMPI tried interfaces > in a different order that luckily worked nicely, cool. But as Gilles > mentioned, that was luck -- there was still a user config error that was the > real underlying issue. > > 7. Someone asked: does it matter in which order you specify interfaces in > btl_tcp_if_include? No, it effectively does not. Open MPI will use both > interfaces. If you only send one short MPI message to a peer, then yes, OMPI > will only use one of those interfaces, but that's not the usual case. Open > MPI will effectively round robin multiplex across all the interfaces that you > list (or all the interfaces that are not excluded). They're all used equally > unless you specify a weighting factor (i.e., bandwidth) for each interface. > > 8. Don't forget that you can use CIDR notation to specify which interfaces to > use, too. E.g., "--mca btl_tcp_if_include 10.10.10.0/24". That way, you > don't have to know which interface a given network uses (and it might even be > different on different servers). Same goes for the oob_tcp_if_*clude MCA > params, too. > > 9. If I followed the thread properly (and I might not have?), I think Reuti > eliminated a bad route/gateway and reduced the dead time during startup to be > much shorter. But there still seems to be a 30 second timeout in there when > no sysadmin-specified oob_tcp_if_include param is provided. If this is > correct, Reuti, can you send the full "ifconfig -a" output from two servers > in question (i.e., 2 servers where you can reproduce the problem), and the > full routing tables between those two servers? (make sure to show all > routing tables on each server - fun fact, did you know that you can have a > different routing table for each IP interface in Linux?) Include any > relevant network routing tables (e.g., from intermediate switches), if > they're not just pass thru. > > > > > On Nov 13, 2014, at 9:17 PM, Gilles Gouaillardet > <gilles.gouaillar...@iferc.org> wrote: > >> My 0.02 US$ >> >> first, the root cause of the problem was a default gateway was >> configured on the node, >> but this gateway was unreachable. >> imho, this is incorrect system setting that can lead to unpredictable >> results : >> - openmpi 1.8.1 works (you are lucky, good for you) >> - openmpi 1.8.3 fails (no luck this time, too bad) >> so i believe it is incorrect to blame openmpi for this. >> >> that being said, you raise some good points of how to improve user >> friendliness for end users >> that have limited skills and/or interest in OpenMPI and system >> administration. >> >> basically, i agree with Gus. HPC is complex, not every clusters are the same >> and imho some minimal config/tuning might not be avoided to get OpenMPI >> working, >> or operating at full speed. >> >> >> let me give a few examples : >> >> you recommend OpenMPI uses only the interfaces that matches the >> hostnames in the machinefile. >> what if you submit from the head node ? should you use the interface >> that matches the hostname ? >> what if this interface is the public interface, there is a firewall >> and/or compute nodes have no default gateway ? >> that will simply not work ... >> so mpirun needs to pass orted all its interfaces. >> which one should be picked by orted ? >> - the first one ? it might be the unreachable public interface ... >> - the one on the same subnet ? what if none is on the same subnet ? >> on the cluster i am working, eth0 are in different subnets, ib0 is on >> a single subnet >> and i do *not* want to use ib0. but on some other clusters, the >> ethernet network is so cheap >> they *want* to use ib0. >> >> on your cluster, you want to use eth0 for oob and mpi, and eth1 for NFS. >> that is legitimate. >> in my case, i want to use eth0 (gigE) for oob and eth2 (10gigE) for MPI. >> that is legitimate too. >> >> we both want OpenMPI works *and* with best performance out of the box. >> it is a good thing to have high expectations, but they might not all be met. >> >> i'd rather implement some pre-defined policies that rules how ethernet >> interfaces should be picked up, >> and add a FAQ that mentions : if it does not work (or does not work as >> fast as expected) out of the box, you should >> at first try an other policy. >> >> then the next legitimate question will be "what is the default policy" ? >> regardless the answer, it will be good for some and bad for others. >> >> >> imho, posting a mail to the OMPI users mailing list was the right thing >> to do : >> - you got help on how to troubleshot and fix the issue >> - we got some valuable feedback on endusers expectations. >> >> Cheers, >> >> Gilles >> >> On 2014/11/14 3:36, Gus Correa wrote: >>> On 11/13/2014 11:14 AM, Ralph Castain wrote: >>>> Hmmm…I’m beginning to grok the issue. It is a tad unusual for people to >>>> assign different hostnames to their interfaces - I’ve seen it in the >>>> Hadoop world, but not in HPC. Still, no law against it. >>> >>> No, not so unusual. >>> I have clusters from respectable vendors that come with >>> /etc/hosts for name resolution of the various interfaces. >>> If I remember right, Rocks clusters also does that (or actually >>> allow the sys admin to setup additional networks and at that point >>> will append /etc/hosts with the additional names, or perhaps put those >>> names in DHCP). >>> I am not so familiar to xcat, but I think it has similar DHCP >>> functionality, or maybe DNS on the head node. >>> >>> Having said that, I don't think this is an obstacle to setting up the >>> right "if_include/if_exlculde" choices (along with the btl, oob, etc), >>> for each particular cluster in the mca parameter configuration file. >>> That is what my parallel conversation with Reuti was about. >>> >>> I believe the current approach w.r.t. interfaces: >>> "use everythint, let the sysadmin/user restrict as >>> (s)he sees fit" is both a wise and flexible way to do it. >>> Guessing the "right interface to use" sounds risky to me (wrong >>> choices may happen), and a bit of a cast. >>> >>>> >>>> This will take a little thought to figure out a solution. One problem >>>> that immediately occurs is if someone includes a hostfile that has lines >>>> which refer to the same physical server, but using different interface >>>> names. We’ll think those are completely distinct servers, and so the >>>> process placement will be totally messed up. >>>> >>> >>> Sure, and besides this, there will be machines with >>> inconsistent/wrong/conflicting name resolution schemes >>> that the current OMPI approach simply (and wisely) ignores. >>> >>> >>>> We’ll also encounter issues with the daemon when it reports back, as the >>>> hostname it gets will almost certainly differ from the hostname we were >>>> expecting. Not as critical, but need to check to see where that will >>>> impact the code base >>>> >>> >>> I'm sure that will happen. >>> Torque uses hostname by default for several things, and it can be a >>> configuration nightmare to workaround that when what hostname reports >>> is not what you want. >>> >>> IMHO, you may face a daunting guesswork task to get this right, >>> to pick the >>> interfaces that are best for a particular computer or cluster. >>> It is so much easier to let the sysadmin/user, who presumably knows >>> his/her machine, to write an MCA parameter config file, >>> as it is now in OMPI. >>> >>>> We can look at the hostfile changes at that time - no real objection to >>>> them, but would need to figure out how to pass that info to the >>>> appropriate subsystems. I assume you want this to apply to both the oob >>>> and tcp/btl? >>>> >>>> Obviously, this won’t make it for 1.8 as it is going to be fairly >>>> intrusive, but we can probably do something for 1.9 >>>> >>> >>> The status quo is good. >>> Long life to the OMPI status quo. >>> (You don't know how reluctant I am to support the status quo, any >>> status quo. :) ) >>> My vote (... well, I don't have voting rights on that, but I'll vote >>> anyway ...) is to keeep the current approach. >>> It is wise and flexible, and easy to adjust and configure to specific >>> machines with their own oddities, via MCA parameters, as I tried to >>> explain in previous postings. >>> >>> My two cents, >>> Gus Correa >>> >>>> >>>>> On Nov 13, 2014, at 4:23 AM, Reuti <re...@staff.uni-marburg.de >>>>> <mailto:re...@staff.uni-marburg.de>> wrote: >>>>> >>>>> Am 13.11.2014 um 00:34 schrieb Ralph Castain: >>>>> >>>>>>> On Nov 12, 2014, at 2:45 PM, Reuti <re...@staff.uni-marburg.de >>>>>>> <mailto:re...@staff.uni-marburg.de>> wrote: >>>>>>> >>>>>>> Am 12.11.2014 um 17:27 schrieb Reuti: >>>>>>> >>>>>>>> Am 11.11.2014 um 02:25 schrieb Ralph Castain: >>>>>>>> >>>>>>>>> Another thing you can do is (a) ensure you built with >>>>>>>>> —enable-debug, and then (b) run it with -mca oob_base_verbose 100 >>>>>>>>> (without the tcp_if_include option) so we can watch the >>>>>>>>> connection handshake and see what it is doing. The —hetero-nodes >>>>>>>>> will have not affect here and can be ignored. >>>>>>>> >>>>>>>> Done. It really tries to connect to the outside interface of the >>>>>>>> headnode. But being there a firewall or not: the nodes have no clue >>>>>>>> how to reach 137.248.0.0 - they have no gateway to this network >>>>>>>> at all. >>>>>>> >>>>>>> I have to revert this. They think that there is a gateway although >>>>>>> it isn't. When I remove the entry by hand for the gateway in the >>>>>>> routing table it starts up instantly too. >>>>>>> >>>>>>> While I can do this on my own cluster I still have the 30 seconds >>>>>>> delay on a cluster where I'm not root, while this can be because of >>>>>>> the firewall there. The gateway on this cluster is indeed going to >>>>>>> the outside world. >>>>>>> >>>>>>> Personally I find this behavior a little bit too aggressive to use >>>>>>> all interfaces. If you don't check this carefully beforehand and >>>>>>> start a long running application one might even not notice the delay >>>>>>> during the startup. >>>>>> >>>>>> Agreed - do you have any suggestions on how we should choose the >>>>>> order in which to try them? I haven’t been able to come up with >>>>>> anything yet. Jeff has some fancy algo in his usnic BTL that we are >>>>>> going to discuss after SC that I’m hoping will help, but I’d be open >>>>>> to doing something better in the interim for 1.8.4 >>>>> >>>>> The plain`mpiexec` should just use the specified interface it finds in >>>>> the hostfile. Being it hand crafted or prepared by any queuing system. >>>>> >>>>> >>>>> Option: could a single entry for a machine in the hostfile contain a >>>>> list of interfaces? I mean something like: >>>>> >>>>> node01,node01-extra-eth1,node01-extra-eth2 slots=4 >>>>> >>>>> or >>>>> >>>>> node01* slots=4 >>>>> >>>>> Means: use exactly these interfaces or even try to find all available >>>>> interfaces on/between the machines. >>>>> >>>>> In case all interfaces have the same name, then it's up to the admin >>>>> to correct this. >>>>> >>>>> -- Reuti >>>>> >>>>> >>>>>>> -- Reuti >>>>>>> >>>>>>> >>>>>>>> It tries so independent from the internal or external name of the >>>>>>>> headnode given in the machinefile - I hit ^C then. I attached the >>>>>>>> output of Open MPI 1.8.1 for this setup too. >>>>>>>> >>>>>>>> -- Reuti >>>>>>>> >>>>>>>> <openmpi1.8.3.txt><openmpi1.8.1.txt>_______________________________________________ >>>>>>>> >>>>>>>> users mailing list >>>>>>>> us...@open-mpi.org <mailto:us...@open-mpi.org> >>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>> Link to this post: >>>>>>>> http://www.open-mpi.org/community/lists/users/2014/11/25777.php >>>>>>> >>>>>>> _______________________________________________ >>>>>>> users mailing list >>>>>>> us...@open-mpi.org <mailto:us...@open-mpi.org> >>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>> Link to this post: >>>>>>> http://www.open-mpi.org/community/lists/users/2014/11/25781.php >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org <mailto:us...@open-mpi.org> >>>>>> Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> Link to this >>>>>> post:http://www.open-mpi.org/community/lists/users/2014/11/25782.php >>>>>> >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org <mailto:us...@open-mpi.org> >>>>> Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> Link to this >>>>> post:http://www.open-mpi.org/community/lists/users/2014/11/25800.php >>>> >>>> >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/users/2014/11/25801.php >>>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2014/11/25806.php >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2014/11/25809.php > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/11/25810.php >