On May 29, 2010, at 11:35 AM, Rahul Nabar wrote:

> On Sat, May 29, 2010 at 8:19 AM, Ralph Castain <r...@open-mpi.org> wrote:
> 
>> 
>>> From your other note, it sounds like #3 might be the problem here. Do you 
>>> have some nodes that are configured with "eth0" pointing to your 10.x 
>>> network, and other nodes with "eth0" pointing to your 192.x network? I have 
>>> found that having interfaces that share a name but are on different IP 
>>> addresses sometimes causes OMPI to miss-connect.
>> 
>> If you randomly got some of those nodes in your allocation, that might 
>> explain why your jobs sometimes work and sometimes don't.
> 
> That is exactly true. On some nodes eth0 is 1Gig and on others 10Gig
> and vice versa. Is that going to be a problem and is there a
> workaround? I mean 192.168 is always the 10Gig and 10.0 the 1 Gig but
> the correspondence with eth0 vs eth1 is not consistent. I'd have liked
> that but couldn't figure out a way to guarantee the order of the eth
> interfaces.

Just set the mca param oob_tcp_if_include 192.168 and you should be okay. I 
forget the exact param syntax for specifying an IP network instead of an 
interface name, but you can get it by using

ompi_info --param oob tcp


> 
> -- 
> Rahul
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to