Yes, in Open MPI the connections are usually created on demand. As far as I know there are few devices that do not abide to this "law", but MX is not one of them.

To be more precise on how the connections are established, if we say that each node has two rails and we're doing a ping-pong, the first message from p0 to p1 will connect the first NIC, and the second message the second NIC (here I made the assumption that both network are similar). Moreover in MX, the connection is not symmetric, so your (1) and (2) might happens simultaneously.

Does the code contain an MPI_Barrier ? If yes, this might be why you see the sequence (1), (2), (3) and (4) ...

  george.

On Jun 17, 2009, at 12:13 , Brice Goglin wrote:

Thanks for the answer. So if I understand correctly, the connection
order is decided dynamically depending on when each peer has some
messages to send and how the upper level load-balances them. There
shouldn't be anything preventing (1) and (2) from happening at the same time then. And I wonder why I always see 1,2,3,4 with MX (using IMB) and
not with Open-MX...

Brice



George Bosilca wrote:
Brice,

The connection mechanism in the MX BTL suffers from a big problem on
multi-rail (if all NICS are identical). If the rails are connected
using the same mapper, they will have identical ID. Unfortunately,
these ID are supposed to be unique in order to guarantee the
connection ordering (0 to 0, 1 to 1 and so on based on the mapper's
MAC). However, the outcome I saw in the past in this case is not a
deadlock but a poorly distribution of the data over the two NICS: one
will be over-loaded while the other will not be used at all.

There is no answer from a peer when we connect the MX BTLs. If the
steps are the ones you described in your email, then I guess both of
the peers try to connect to the other simultaneously. Now, when you
have multiple rails, we treat them at the upper level as independent
devices, and we will try to load balance the messages over all of
them. The step (3) seems to indicate that another message (MPI) has
been sent, and because of the load balancing scheme we try to connect
the second device (rail in this context). In MX this works because we
use the blocking function (mx_connect).

 george.

On Jun 17, 2009, at 08:23 , Brice Goglin wrote:

Hello,

I am debugging some sort of deadlock when doing multirail over Open-MX. What I am seeing with 2 processes and 2 boards per node with *MX* is:
1) process 0 rail 0 connects to process 1 rail 0
2) p1r0 connects back to p0r0
3) p0 rail 1 connects to p1 rail 1
4) p1r1 connects back to p0r1
For some reason, with *Open-MX*, process 0 seems to start (3) before
process 1 has finished (2). It probably causes a deadlock because p1 is polling on rail 0 for (2), while (3) needs somebody to poll on rail 1
for the connect handshake.

So, the question is: is there anything in OMPI (1.3) guarantying that the above 4 steps will occur in some specified order? If so, Open- MX is
probably doing something wrong breaking the order. If not, adding a
progression thread to Open-MX might be the only solution...

thanks,
Brice

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to