There is no high-speed network, only eth0.  So MPI communication must be TCP 
over eth0.  I have tried forcing eth0 with --mca btl_tcp_if_include eth0, and 
also by specifying the eth0 subnet.  (Looking at the btl_tcp_component.c 
source, I see that the subnet is just translated back into the interface name, 
so these are equivalent.)

The problem is that including eth0 does not exclude the virtual interfaces 
(eth0:1 and eth0:5 in my case).  According to the bug, the linux kernel assigns 
the same interface index to both the physical and virtual interfaces.  Because 
the TCP BTL uses this kernel index to choose the interface, it can't 
distinguish between physical and virtual interfaces.  I can see this play out 
in the verbose TCP BTL output, with oob and TCP communication happening over 
all three subnets, rather than just the eth0 subnet.  This results in a hang.

I'm looking into the possibility of using tun/tap interfaces for IMPI and 
system management, but I'm not sure if that's a possibility.  There is a 
mention of using tun/tap for MPI in the bug report, but I don't know what 
overhead that would have.  I was hoping that someone might have come up with 
some other solution.

Thanks,
Kris


> From: George Bosilca (bosilca_at_[hidden])
> Date: 2015-01-26 15:19:40
> 
> Use mpirun --mca btl_tcp_if_exclude eth0 should fix your problem. Otherwise
> you can add it to your configuration file. Everything is extensively
> described in the FAQ.
> 
> George.
> On Jan 26, 2015 12:11 PM, "Kris Kersten" <kkersten_at_[hidden]> wrote:
> 
> > I'm working on an ethernet cluster that uses virtual eth0:* interfaces
> > on the compute nodes for IPMI and system management.  As described in Trac
> > ticket #3339 (https://svn.open-mpi.org/trac/ompi/ticket/3339 ), this
> > setup confuses the TCP BTL which can't differentiate between the physical
> > and virtual interfaces.  Verbose BTL output confirms this, showing
> > attempted communication on both the physical and virtual IP addresses
> > followed by a hang.
> >
> >
> >
> > Has there been any progress on this bug?  Or has anyone managed to figure
> > out a workaround?
> >
> >
> >
> > Thanks,
> >
> > Kris

Reply via email to