Because Gilles wants to avoid using IB for TCP messages, and using eth0 also solves the problem (the messages just route)
On Jun 5, 2014, at 5:00 AM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> wrote: > Another random thought for Gilles situation: why not oob-TCP-if-include ib0? > (And not eth0) > > That should solve his problem, but not the larger issue I raised in my > previous email. > > Sent from my phone. No type good. > > On Jun 4, 2014, at 9:32 PM, "Gilles Gouaillardet" > <gilles.gouaillar...@gmail.com> wrote: > >> Thanks Ralf, >> >> for the time being, i just found a workaround >> --mca oob_tcp_if_include eth0 >> >> Generally speaking, is openmpi doing the wiser thing ? >> here is what i mean : >> the cluster i work on (4k+ nodes) each node has two ip interfaces : >> * eth0 (gigabit ethernet) : because of the cluster size, several subnets >> are used. >> * ib0 (IP over IB) : only one subnet >> i can easily understand such a large cluster is not so common, but on the >> other hand i do not believe the IP configuration (subnetted gigE and single >> subnet IPoIB) can be called exotic. >> >> if nodes from different eth0 subnets are used, and if i understand correctly >> your previous replies, orte will "discard" eth0 because nodes cannot contact >> each other "directly". >> directly means the nodes are not on the same subnet. that being said, they >> can communicate via IP thanks to IP routing (i mean IP routing, i do *not* >> mean orte routing). >> that means orte communications will use IPoIB which might not be the best >> thing to do since establishing an IPoIB connection can be long (especially >> at scale *and* if the arp table is not populated) >> >> is my understanding correct so far ? >> >> bottom line, i would have expected openmpi uses eth0 regardless IP routing >> is required, and ib0 is simply not used (or eventually used as a fallback >> option) >> >> this leads to my next question : is the current default ok ? if not should >> we change it and how ? >> /* >> imho : >> - IP routing is not always a bad/slow thing >> - gigE can sometimes be better than IPoIB) >> */ >> >> i am fine if at the end : >> - this issue is fixed >> - we decide it is up to the sysadmin to make --mca oob_tcp_if_include eth0 >> the default if this is really thought to be best for the cluster. (and i can >> try to draft a faq if needed) >> >> Cheers, >> >> Gilles >> >> On Wed, Jun 4, 2014 at 11:50 PM, Ralph Castain <r...@open-mpi.org> wrote: >> >> I'll work on it - may take a day or two to really fix. Only impacts systems >> with mismatched interfaces, which is why we aren't generally seeing it. >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/06/14972.php > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/06/14977.php