That raises a larger issue -- what about Ethernet-only clusters that span 
multiple IP/L3 subnets?  This is a scenario that Cisco definitely wants to 
enable/support.

The usnic BTL, for example, can handle this scenario.  We hadn't previously 
considered the TCP oob component effects in this scenario -- oops.

Hmm.

The usnic BTL both does lazy connections (so to speak...) and uses a 
connectivity checker to ensure that it can actually communicate with each peer. 
 In this way, OMPI has a way of knowing whether process A can communicate with 
process B, even if A and B have effectively unrelated IP addresses (i.e., 
they're not on the same IP subnet).

I don't think the TCP oob will be able to use this same kind of strategy.

As a simple solution, there could be an TCP oob MCA param that says "regardless 
of peer IP address, I can connect to them" (i.e., assume IP routing will make 
everything work out ok).

That doesn't seem like a good overall solution, however -- it doesn't 
necessarily fit in the "it just works out of the box" philosophy that we like 
to have in OMPI.

Let me take this back to some IP experts here and see if someone can come up 
with a better idea.



On Jun 4, 2014, at 10:09 PM, Ralph Castain <r...@open-mpi.org> wrote:

> Well, the problem is that we can't simply decide that anything called "ib.." 
> is an IB port and should be ignored. There is no naming rule regarding IP 
> interfaces that I've ever heard about that would allow us to make such an 
> assumption, though I admit most people let the system create default names 
> and thus would get something like an "ib..".
> 
> So we leave it up to the sys admin to configure the system based on their 
> knowledge of what they want to use. On the big clusters at the labs, we 
> commonly put MCA params in the default param file for this purpose as we 
> *don't* want OOB traffic going over the IB fabric.
> 
> But that's the sys admin's choice, not a requirement. I've seen organizations 
> that do it the other way because their Ethernet is really slow.
> 
> In this case, the problem is really in the OOB itself. The local proc is 
> connecting to its local daemon via eth0, which is fine. When it sends a 
> message to mpirun on a different proc, that message goes from the app to the 
> daemon via eth0. The daemon looks for mpirun in its contact list, and sees 
> that it has a direct link to mpirun via this nifty "ib0" interface - and so 
> it uses that one to relay the message along.
> 
> This is where we are hitting the problem - the OOB isn't correctly doing the 
> transfer between those two interfaces like it should. So it is a bug that we 
> need to fix, regardless of any other actions (e.g., if it was an eth1 that 
> was the direct connection, we would still want to transfer the message to the 
> other interface).
> 
> HTH
> Ralph
> 
> On Jun 4, 2014, at 7:32 PM, Gilles Gouaillardet 
> <gilles.gouaillar...@gmail.com> wrote:
> 
>> Thanks Ralf,
>> 
>> for the time being, i just found a workaround
>> --mca oob_tcp_if_include eth0
>> 
>> Generally speaking, is openmpi doing the wiser thing ?
>> here is what i mean :
>> the cluster i work on (4k+ nodes) each node has two ip interfaces :
>>  * eth0 (gigabit ethernet) : because of the cluster size, several subnets 
>> are used.
>>  * ib0 (IP over IB) : only one subnet
>> i can easily understand such a large cluster is not so common, but on the 
>> other hand i do not believe the IP configuration (subnetted gigE and single 
>> subnet IPoIB) can be called exotic.
>> 
>> if nodes from different eth0 subnets are used, and if i understand correctly 
>> your previous replies, orte will "discard" eth0 because nodes cannot contact 
>> each other "directly".
>> directly means the nodes are not on the same subnet. that being said, they 
>> can communicate via IP thanks to IP routing (i mean IP routing, i do *not* 
>> mean orte routing).
>> that means orte communications will use IPoIB which might not be the best 
>> thing to do since establishing an IPoIB connection can be long (especially 
>> at scale *and* if the arp table is not populated)
>> 
>> is my understanding correct so far ?
>> 
>> bottom line, i would have expected openmpi uses eth0 regardless IP routing 
>> is required, and ib0 is simply not used (or eventually used as a fallback 
>> option)
>> 
>> this leads to my next question : is the current default ok ? if not should 
>> we change it and how ?
>> /*
>> imho :
>>  - IP routing is not always a bad/slow thing
>>  - gigE can sometimes be better than IPoIB)
>> */
>> 
>> i am fine if at the end :
>> - this issue is fixed
>> - we decide it is up to the sysadmin to make --mca oob_tcp_if_include eth0 
>> the default if this is really thought to be best for the cluster. (and i can 
>> try to draft a faq if needed)
>> 
>> Cheers,
>> 
>> Gilles
>> 
>> On Wed, Jun 4, 2014 at 11:50 PM, Ralph Castain <r...@open-mpi.org> wrote:
>> 
>> I'll work on it - may take a day or two to really fix. Only impacts systems 
>> with mismatched interfaces, which is why we aren't generally seeing it.
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/06/14972.php
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/06/14973.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Reply via email to