Hi again,
I found out that if I add an
MPI_Barrier after the MPI_Recv part, then there is no minute-long latency.
Is it possible that even if MPI_Recv returns, the openib btl does not
guarantee that the acknowledgement is sent promptly ? In other words, is
it possible that the computation following the MPI_Recv delays the
acknowledgement ? If so, is it supposed to be this way, or is it normal,
and why isn't the same behavior observed with the tcp btl ?
Maxime Boissonneault
Le 2013-02-14 11:50, Maxime Boissonneault a écrit :
Hi,
I have a strange case here. The application is "plink"
(http://pngu.mgh.harvard.edu/~purcell/plink/download.shtml). The
computation/communication pattern of the application is the following :
1- MPI_Init
2- Some single rank computation
3- MPI_Bcast
4- Some single rank computation
5- MPI_Barrier
6- rank 0 sends data to each other rank with MPI_Ssend, one rank at a
time.
6- other ranks use MPI_Recv
7- Some single rank computation
8- other ranks send result to rank 0 with MPI_Ssend
8- rank 0 receives data with MPI_Recv
9- rank 0 analyses result
10- MPI_Finalize
The amount of data being sent is of the order of the kilobytes, and we
have IB.
The problem we observe is in step 6. I've output dates before and
after each MPI operation. With the openib btl, the behavior I observe
is that :
- rank 0 starts sending
- rank n receives almost instantly, and MPI_Recv returns.
- rank 0's MPI_Ssend often returns _minutes_.
It looks like the acknowledgement from rank n takes minutes to reach
rank 0.
Now, the tricky part is that if I disable the openib btl to use
instead tcp over IB, there is no such latency and the acknowledgement
comes back within a fraction of a second. Also, if rank 0 and rank n
are on the same node, the acknowledgement is also quasi-instantaneous
(I guess it goes through the SM btl instead of openib).
I tried to reproduce this in a simple case, but I observed no such
latency. The duration that I got for the whole communication is of the
order of milliseconds.
Does anyone have an idea of what could cause such very high latencies
when using the OpenIB BTL ?
Also, I tried replacing step 6 by explicitly sending a confirmation :
- rank 0 does MPI_Isend to rank n followed by MPI_Recv from rank n
- rank n does MPI_Recv from rank 0 followed by MPI_Isend to rank 0
In this case also, rank n's MPI_Isend executes quasi-instantaneously,
and rank 0's MPI_Recv only returns a few minutes later.
Thanks,
Maxime Boissonneault
--
---------------------------------
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique