On 2/10/2012 11:50 AM, Jeff Squyres wrote:
This is an open question to OMPI developers...
It looks like RHEL (and maybe others?) adds the "virbr0" IP interface when Xen
is activated. This IP interface is only used to communicate with the local Xen
instance(s); it is not used to communicate over the real network.
In a case that I saw, the interface is created, set to "up", and is given an IP address
in the 192.168.1.x range. This was done by default -- all the user had done was either say
"yes, I want Xen enabled", or he didn't say he wanted it *disabled* (I'm not sure which).
I've done the latter and hit the same problem. There were instructions
somewhere on the web that I found that told one how to disable vibr0.
This causes a problem if you have Xen enabled on multiple machines in an OMPI job. OMPI
will see the 192.168.1.x address and see that it's "up", so it'll add it to the
eligible subnets that can be used. When OMPI sees that its peer processes also have
192.168.1.x, it'll try to use that network for OOB/BTL traffic -- which will fail,
because these are local-only interfaces.
Should we add "virbr0" to the default value for [btl|oob]_tcp_if_exclude?
What happens to that value if you then set btl_tcp_if_exclude to some
value on the mpirun command line? So this brings me to something that
has annoyed me for a bit. It seems to me that maybe it would be nice to
have a conf file that you can dump interface names to exclude but would
not be interpreted as a btl_tcp_if_exclude options. For example there
were some interfaces on certain Sun machine (a long time ago) that went
to the diagnostic processor and caused a similar issue as the virbr0
issue. So we started delivering a conf file that set btl_tcp_if_exclude
but then this precluded anyone from being able to set
btl_tcp_if_include. If we had a file one could specify the set of
interfaces to use or exclude but allow the user to operate on the result
of that set it seems that would be nice.
--td
Or is there another way to detect that an interface is local-only and should
not be used for OOB/BTL communication?
See this post on the user's list:
http://www.open-mpi.org/community/lists/users/2012/02/18432.php
--
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle *- Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com <mailto:terry.don...@oracle.com>