Open MPI cannot handle having two interfaces on a node on the same subnet.  I 
believe it has to do with our matching code when we try to match up a 
connection.
The result is a hang as you observe.  I also believe it is not good practice to 
have two interfaces on the same subnet.
If you put them on different subnets, things will work fine and communication 
will stripe over the two of them.

Rolf


From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Richard Bardwell
Sent: Friday, February 17, 2012 5:37 AM
To: Open MPI Users
Subject: Re: [OMPI users] Problem running an mpi applicatio​n on nodes with 
more than one interface

I had exactly the same problem.
Trying to run mpi between 2 separate machines, with each machine having
2 ethernet ports, causes really weird behaviour on the most basic code.
I had to disable one of the ethernet ports on each of the machines
and it worked just fine after that. No idea why though !

----- Original Message -----
From: Jingcha Joba<mailto:pukkimon...@gmail.com>
To: us...@open-mpi.org<mailto:us...@open-mpi.org>
Sent: Thursday, February 16, 2012 8:43 PM
Subject: [OMPI users] Problem running an mpi applicatio​n on nodes with more 
than one interface

Hello Everyone,
This is my 1st post in open-mpi forum.
I am trying to run a simple program which does Sendrecv between two nodes 
having 2 interface cards on each of two nodes.
Both the nodes are running RHEL6, with open-mpi 1.4.4 on a 8 core Xeon 
processor.
What I noticed was that when using two or more interface on both the nodes, the 
mpi "hangs" attempting to connect.
These details might help,
Node 1 - Denver has a single port "A" card (eth21 - 25.192.xx.xx - which I use 
to ssh to that machine), and a double port "B" card (eth23 - 10.3.1.1 & eth24 - 
10.3.1.2).
Node 2 - Chicago also the same single port A card (eth19 - 25.192.xx.xx - again 
uses for ssh) and a double port B card ( eth29 - 10.3.1.3 & eth30 - 10.3.1.4).
My /etc/host looks like
25.192.xx.xx denver.xxx.com<http://denver.xxx.com/> denver
10.3.1.1 denver.xxx.com<http://denver.xxx.com/> denver
10.3.1.2 denver.xxx.com<http://denver.xxx.com/> denver
25.192.xx.xx chicago.xxx.com<http://chicago.xxx.com/> chicago
10.3.1.3 chicago.xxx.com<http://chicago.xxx.com/> chicago
10.3.1.4 chicago.xxx.com<http://chicago.xxx.com/> chicago
...
...
...
This is how I run,
mpirun --hostfile host1 --mca btl tcp,sm,self --mca btl_tcp_if_exclude 
eth21,eth19,lo,virbr0 --mca btl_base_verbose 30 -np 4 ./Sendrecv
I get bunch of things from both chicago and denver, which says its has found 
components like tcp, sm, self and stuffs, and then hangs at
[denver.xxx.com:21682<http://denver.xxx.com:21682/>] btl: tcp: attempting to 
connect() to address 10.3.1.3 on port 4
[denver.xxx.com:21682<http://denver.xxx.com:21682/>] btl: tcp: attempting to 
connect() to address 10.3.1.4 on port 4
However, if I run the same program by excluding eth29 or eth30, then it works 
fine. Something like this:
mpirun --hostfile host1 --mca btl tcp,sm,self --mca btl_tcp_if_exclude 
eth21,eth19,eth29,lo,virbr0 --mca btl_base_verbose 30 -np 4 ./Sendrecv
My hostfile looks like this
[sshuser@denver Sendrecv]$ cat host1
denver slots=2
chicago slots=2
I am not sure if I have to provide somethbing else. Please if I have to, please 
feel to ask me..
thanks,
--
Joba
________________________________
_______________________________________________
users mailing list
us...@open-mpi.org<mailto:us...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/users

-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------

Reply via email to