Dear Developers,

This is an old problem, which I described in an email to the users list in 2015, but I continue to struggle with it. In short, MPI_Comm_accept / MPI_Comm_disconnect combo causes any communication over openib btl (e.g., also a barrier) to hang after a few clients connect and disconnect from the server. I've noticed that the number of successful connects depends on the number of server ranks, e.g., if my server has 32 ranks, then the communication hangs already for the second connecting client.

I have now checked that the problem exists also in 1.10.6. As far as I could tell, MPI_Comm_accept is not working in 2.0 and 2.1 at all, so I could not test those versions. My previous investigations have shown that the problem was introduced in 1.8.4.

I wonder, will this be addressed in OpenMPI, or is this part of the MPI functionality considered less important than the core? Should I file a bug report?

Thanks!

Marcin Krotkiewski


On 09/16/2015 04:06 PM, marcin.krotkiewski wrote:
I have run into a freeze / potential bug when using MPI_Comm_accept in a simple client / server implementation. I have attached two simplest programs I could produce:

1. mpi-receiver.c opens a port using MPI_Open_port, saves the port name to a file

2. mpi-receiver enters infinite loop and waits for connections using MPI_Comm_accept

3. mpi-sender.c connects to that port using MPI_Comm_connect, sends one MPI_UNSIGNED_LONG, calls barrier and disconnects using MPI_Comm_disconnect

4. mpi-receiver reads the MPI_UNSIGNED_LONG, prints it, calls barrier and disconnects using MPI_Comm_disconnect and goes to point 2 - infinite loop

All works fine, but only exactly 5 times. After that the receiver hangs in MPI_Recv, after exit from MPI_Comm_accept. That is 100% repeatable. I have tried with Intel MPI - no such problem.

I execute the programs using OpenMPI 1.10 as follows

mpirun -np 1 --mca mpi_leave_pinned 0 ./mpi-receiver


Do you have any clues what could be the reason? Am I doing sth wrong, or is it some problem with internal state of OpenMPI?

Thanks a lot!

Marcin


_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Reply via email to