There is a significant improvement in non-blocking MPI calls (over Infiniband) from version 1.4 to version 1.6.
I am comparing two methods to exchange messages between two nodes. The message size varies from 1 MB to 1 GB. The first method is sends using MPI_Isend()and receives using MPI_Irecv(). The same buffers are used repeatedly to exchange messages between two nodes. The buffers are allocated using malloc(). In the second method, the buffers are allocated using MPI_Alloc_mem() and the send and receive are initialized using MPI_Send_init() and MPI_Recv_init(). The sends and recvs are posted using MPI_Start. In version 1.4, the first method has a peak bidirectional bandwidth of 5.3 GB/s and the second method has a peak of 6.2 GB/s. In version 1.6, both methods have peak bandwidth of 6.2 GB/s. The peak bandwidths are pretty close to the number reported by ib_read_bw or ib_write_bw commands for Infiniband. 1. The first question is as follows: why does version 1.6 do nonblocking Isend/Irecv better than version 1.4? I would assume that in the second method, memory is pinned and registered during MPI_Alloc_mem() and the transfers use RDMA direct. In the first method, where the buffers are allocated using malloc(), I would assume that RDMA pipelining is used. I emphasize that the mpi_leave_pinned parameter has its default value of -1 and is turned off in all the runs. I would expect some overhead due to registering and unregistering memory during each Isend/Irecv, even though pipelining tries to amortize the costs. The numbers for version 1.4 are in line with this expectation. However, in version 1.6 there seems to be no overhead at all due to registering/unregistering memory. What is going on? Do large messages still use RDMA pipelining? How has the RDMA pipeline been improved? 2. To send and receive a large message, openmpi may choose between RDMA write and RDMA read. If RDMA pipelining is used, it seems advantageous to use RDMA write because some fragments use send/recv semantics. If the memory is registered and the send/recv result in a single RDMA operation, there seems nothing to choose between the two. Is that correct? If so, does openmpi use RDMA write or RDMA read? Thanks! Divakar Viswanath