Brice,

The connection mechanism in the MX BTL suffers from a big problem on multi-rail (if all NICS are identical). If the rails are connected using the same mapper, they will have identical ID. Unfortunately, these ID are supposed to be unique in order to guarantee the connection ordering (0 to 0, 1 to 1 and so on based on the mapper's MAC). However, the outcome I saw in the past in this case is not a deadlock but a poorly distribution of the data over the two NICS: one will be over-loaded while the other will not be used at all.

There is no answer from a peer when we connect the MX BTLs. If the steps are the ones you described in your email, then I guess both of the peers try to connect to the other simultaneously. Now, when you have multiple rails, we treat them at the upper level as independent devices, and we will try to load balance the messages over all of them. The step (3) seems to indicate that another message (MPI) has been sent, and because of the load balancing scheme we try to connect the second device (rail in this context). In MX this works because we use the blocking function (mx_connect).

  george.

On Jun 17, 2009, at 08:23 , Brice Goglin wrote:

Hello,

I am debugging some sort of deadlock when doing multirail over Open- MX.
What I am seeing with 2 processes and 2 boards per node with *MX* is:
1) process 0 rail 0 connects to process 1 rail 0
2) p1r0 connects back to p0r0
3) p0 rail 1 connects to p1 rail 1
4) p1r1 connects back to p0r1
For some reason, with *Open-MX*, process 0 seems to start (3) before
process 1 has finished (2). It probably causes a deadlock because p1 is
polling on rail 0 for (2), while (3) needs somebody to poll on rail 1
for the connect handshake.

So, the question is: is there anything in OMPI (1.3) guarantying that
the above 4 steps will occur in some specified order? If so, Open-MX is
probably doing something wrong breaking the order. If not, adding a
progression thread to Open-MX might be the only solution...

thanks,
Brice

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to