Because Gilles wants to avoid using IB for TCP messages, and using eth0 also 
solves the problem (the messages just route)

On Jun 5, 2014, at 5:00 AM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> wrote:

> Another random thought for Gilles situation: why not oob-TCP-if-include ib0?  
> (And not eth0)
> 
> That should solve his problem, but not the larger issue I raised in my 
> previous email. 
> 
> Sent from my phone. No type good. 
> 
> On Jun 4, 2014, at 9:32 PM, "Gilles Gouaillardet" 
> <gilles.gouaillar...@gmail.com> wrote:
> 
>> Thanks Ralf,
>> 
>> for the time being, i just found a workaround
>> --mca oob_tcp_if_include eth0
>> 
>> Generally speaking, is openmpi doing the wiser thing ?
>> here is what i mean :
>> the cluster i work on (4k+ nodes) each node has two ip interfaces :
>>  * eth0 (gigabit ethernet) : because of the cluster size, several subnets 
>> are used.
>>  * ib0 (IP over IB) : only one subnet
>> i can easily understand such a large cluster is not so common, but on the 
>> other hand i do not believe the IP configuration (subnetted gigE and single 
>> subnet IPoIB) can be called exotic.
>> 
>> if nodes from different eth0 subnets are used, and if i understand correctly 
>> your previous replies, orte will "discard" eth0 because nodes cannot contact 
>> each other "directly".
>> directly means the nodes are not on the same subnet. that being said, they 
>> can communicate via IP thanks to IP routing (i mean IP routing, i do *not* 
>> mean orte routing).
>> that means orte communications will use IPoIB which might not be the best 
>> thing to do since establishing an IPoIB connection can be long (especially 
>> at scale *and* if the arp table is not populated)
>> 
>> is my understanding correct so far ?
>> 
>> bottom line, i would have expected openmpi uses eth0 regardless IP routing 
>> is required, and ib0 is simply not used (or eventually used as a fallback 
>> option)
>> 
>> this leads to my next question : is the current default ok ? if not should 
>> we change it and how ?
>> /*
>> imho :
>>  - IP routing is not always a bad/slow thing
>>  - gigE can sometimes be better than IPoIB)
>> */
>> 
>> i am fine if at the end :
>> - this issue is fixed
>> - we decide it is up to the sysadmin to make --mca oob_tcp_if_include eth0 
>> the default if this is really thought to be best for the cluster. (and i can 
>> try to draft a faq if needed)
>> 
>> Cheers,
>> 
>> Gilles
>> 
>> On Wed, Jun 4, 2014 at 11:50 PM, Ralph Castain <r...@open-mpi.org> wrote:
>> 
>> I'll work on it - may take a day or two to really fix. Only impacts systems 
>> with mismatched interfaces, which is why we aren't generally seeing it.
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/06/14972.php
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/06/14977.php

Reply via email to