Bullseye!

Thank you, Takahiro, for your quick answer. Brief tests with 1.10.6 show that this did indeed solve the problem! I will look at this in more detail, but it looks really good now.

About MPI_Comm_accept in 2.1.x. I've seen a thread here by Adam Sylvester, where it essentially says that it is not working now, nor in 2.0.x. I've checked the master, and it also does not work there. Is there any time line for this?

Thanks a lot!

Marcin



On 04/04/2017 11:03 AM, Kawashima, Takahiro wrote:
Hi,

I encountered a similar problem using MPI_COMM_SPAWN last month.
Your problem my be same.

The problem was fixed by commit 0951a34 in Open MPI master and
backported to v2.1.x v2.0.x but not backported to v1.8.x and
v1.10.x.

   https://github.com/open-mpi/ompi/commit/0951a34

Please try the attached patch. It was backported for v1.10 branch.

The problem exists in the memory registration limit calculation
in openib BTL and processes loop forever in OMPI_FREE_LIST_WAIT_MT
when connecting to other ORTE jobs because openib_reg_mr returns
OMPI_ERR_OUT_OF_RESOURCE. It probably affects MPI_COMM_SPAWN,
MPI_COMM_SPAWN_MULTIPLE, MPI_COMM_ACCEPT, and MPI_COMM_CONNECT.

Takahiro Kawashima,
MPI development team,
Fujitsu

Dear Developers,

This is an old problem, which I described in an email to the users list
in 2015, but I continue to struggle with it. In short, MPI_Comm_accept /
MPI_Comm_disconnect combo causes any communication over openib btl
(e.g., also a barrier) to hang after a few clients connect and
disconnect from the server. I've noticed that the number of successful
connects depends on the number of server ranks, e.g., if my server has
32 ranks, then the communication hangs already for the second connecting
client.

I have now checked that the problem exists also in 1.10.6. As far as I
could tell, MPI_Comm_accept is not working in 2.0 and 2.1 at all, so I
could not test those versions. My previous investigations have shown
that the problem was introduced in 1.8.4.

I wonder, will this be addressed in OpenMPI, or is this part of the MPI
functionality considered less important than the core? Should I file a
bug report?

Thanks!

Marcin Krotkiewski


On 09/16/2015 04:06 PM, marcin.krotkiewski wrote:
I have run into a freeze / potential bug when using MPI_Comm_accept in
a simple client / server implementation. I have attached two simplest
programs I could produce:

  1. mpi-receiver.c opens a port using MPI_Open_port, saves the port
name to a file

  2. mpi-receiver enters infinite loop and waits for connections using
MPI_Comm_accept

  3. mpi-sender.c connects to that port using MPI_Comm_connect, sends
one MPI_UNSIGNED_LONG, calls barrier and disconnects using
MPI_Comm_disconnect

  4. mpi-receiver reads the MPI_UNSIGNED_LONG, prints it, calls barrier
and disconnects using MPI_Comm_disconnect and goes to point 2 -
infinite loop

All works fine, but only exactly 5 times. After that the receiver
hangs in MPI_Recv, after exit from MPI_Comm_accept. That is 100%
repeatable. I have tried with Intel MPI - no such problem.

I execute the programs using OpenMPI 1.10 as follows

mpirun -np 1 --mca mpi_leave_pinned 0 ./mpi-receiver


Do you have any clues what could be the reason? Am I doing sth wrong,
or is it some problem with internal state of OpenMPI?

Thanks a lot!

Marcin


_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Reply via email to