Re: [OMPI devel] connect management for multirail (Open-)MX

George Bosilca Wed, 17 Jun 2009 14:45:15 -0400

Yes, in Open MPI the connections are usually created on demand. As faras I know there are few devices that do not abide to this "law", butMX is not one of them.

To be more precise on how the connections are established, if we saythat each node has two rails and we're doing a ping-pong, the firstmessage from p0 to p1 will connect the first NIC, and the secondmessage the second NIC (here I made the assumption that both networkare similar). Moreover in MX, the connection is not symmetric, so your(1) and (2) might happens simultaneously.

Does the code contain an MPI_Barrier ? If yes, this might be why yousee the sequence (1), (2), (3) and (4) ...


  george.

On Jun 17, 2009, at 12:13 , Brice Goglin wrote:

Thanks for the answer. So if I understand correctly, the connection
order is decided dynamically depending on when each peer has some
messages to send and how the upper level load-balances them. There

shouldn't be anything preventing (1) and (2) from happening at thesametime then. And I wonder why I always see 1,2,3,4 with MX (using IMB)and

not with Open-MX...

Brice



George Bosilca wrote:

Brice,

The connection mechanism in the MX BTL suffers from a big problem on
multi-rail (if all NICS are identical). If the rails are connected
using the same mapper, they will have identical ID. Unfortunately,
these ID are supposed to be unique in order to guarantee the
connection ordering (0 to 0, 1 to 1 and so on based on the mapper's
MAC). However, the outcome I saw in the past in this case is not a
deadlock but a poorly distribution of the data over the two NICS: one
will be over-loaded while the other will not be used at all.

There is no answer from a peer when we connect the MX BTLs. If the
steps are the ones you described in your email, then I guess both of
the peers try to connect to the other simultaneously. Now, when you
have multiple rails, we treat them at the upper level as independent
devices, and we will try to load balance the messages over all of
them. The step (3) seems to indicate that another message (MPI) has
been sent, and because of the load balancing scheme we try to connect
the second device (rail in this context). In MX this works because we
use the blocking function (mx_connect).

 george.

On Jun 17, 2009, at 08:23 , Brice Goglin wrote:

Hello,
I am debugging some sort of deadlock when doing multirail overOpen-MX.What I am seeing with 2 processes and 2 boards per node with *MX*is:
1) process 0 rail 0 connects to process 1 rail 0
2) p1r0 connects back to p0r0
3) p0 rail 1 connects to p1 rail 1
4) p1r1 connects back to p0r1
For some reason, with *Open-MX*, process 0 seems to start (3) before
process 1 has finished (2). It probably causes a deadlock becausep1 ispolling on rail 0 for (2), while (3) needs somebody to poll onrail 1
for the connect handshake.
So, the question is: is there anything in OMPI (1.3) guarantyingthatthe above 4 steps will occur in some specified order? If so, Open-MX is
probably doing something wrong breaking the order. If not, adding a
progression thread to Open-MX might be the only solution...

thanks,
Brice

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] connect management for multirail (Open-)MX

Reply via email to