I filed a PR against v1.10.7 though v1.10.7 may not be released.

  https://github.com/open-mpi/ompi/pull/3276

I'm not aware of v2.1.x issue, sorry. Other developer may be
able to answer.

Takahiro Kawashima,
MPI development team,
Fujitsu

> Bullseye!
> 
> Thank you, Takahiro, for your quick answer. Brief tests with 1.10.6 show 
> that this did indeed solve the problem! I will look at this in more 
> detail, but it looks really good now.
> 
> About MPI_Comm_accept in 2.1.x. I've seen a thread here by Adam 
> Sylvester, where it essentially says that it is not working now, nor in 
> 2.0.x. I've checked the master, and it also does not work there. Is 
> there any time line for this?
> 
> Thanks a lot!
> 
> Marcin
> 
> 
> 
> On 04/04/2017 11:03 AM, Kawashima, Takahiro wrote:
> > Hi,
> >
> > I encountered a similar problem using MPI_COMM_SPAWN last month.
> > Your problem my be same.
> >
> > The problem was fixed by commit 0951a34 in Open MPI master and
> > backported to v2.1.x v2.0.x but not backported to v1.8.x and
> > v1.10.x.
> >
> >    https://github.com/open-mpi/ompi/commit/0951a34
> >
> > Please try the attached patch. It was backported for v1.10 branch.
> >
> > The problem exists in the memory registration limit calculation
> > in openib BTL and processes loop forever in OMPI_FREE_LIST_WAIT_MT
> > when connecting to other ORTE jobs because openib_reg_mr returns
> > OMPI_ERR_OUT_OF_RESOURCE. It probably affects MPI_COMM_SPAWN,
> > MPI_COMM_SPAWN_MULTIPLE, MPI_COMM_ACCEPT, and MPI_COMM_CONNECT.
> >
> > Takahiro Kawashima,
> > MPI development team,
> > Fujitsu
> >
> >> Dear Developers,
> >>
> >> This is an old problem, which I described in an email to the users list
> >> in 2015, but I continue to struggle with it. In short, MPI_Comm_accept /
> >> MPI_Comm_disconnect combo causes any communication over openib btl
> >> (e.g., also a barrier) to hang after a few clients connect and
> >> disconnect from the server. I've noticed that the number of successful
> >> connects depends on the number of server ranks, e.g., if my server has
> >> 32 ranks, then the communication hangs already for the second connecting
> >> client.
> >>
> >> I have now checked that the problem exists also in 1.10.6. As far as I
> >> could tell, MPI_Comm_accept is not working in 2.0 and 2.1 at all, so I
> >> could not test those versions. My previous investigations have shown
> >> that the problem was introduced in 1.8.4.
> >>
> >> I wonder, will this be addressed in OpenMPI, or is this part of the MPI
> >> functionality considered less important than the core? Should I file a
> >> bug report?
> >>
> >> Thanks!
> >>
> >> Marcin Krotkiewski
> >>
> >>
> >> On 09/16/2015 04:06 PM, marcin.krotkiewski wrote:
> >>> I have run into a freeze / potential bug when using MPI_Comm_accept in
> >>> a simple client / server implementation. I have attached two simplest
> >>> programs I could produce:
> >>>
> >>>   1. mpi-receiver.c opens a port using MPI_Open_port, saves the port
> >>> name to a file
> >>>
> >>>   2. mpi-receiver enters infinite loop and waits for connections using
> >>> MPI_Comm_accept
> >>>
> >>>   3. mpi-sender.c connects to that port using MPI_Comm_connect, sends
> >>> one MPI_UNSIGNED_LONG, calls barrier and disconnects using
> >>> MPI_Comm_disconnect
> >>>
> >>>   4. mpi-receiver reads the MPI_UNSIGNED_LONG, prints it, calls barrier
> >>> and disconnects using MPI_Comm_disconnect and goes to point 2 -
> >>> infinite loop
> >>>
> >>> All works fine, but only exactly 5 times. After that the receiver
> >>> hangs in MPI_Recv, after exit from MPI_Comm_accept. That is 100%
> >>> repeatable. I have tried with Intel MPI - no such problem.
> >>>
> >>> I execute the programs using OpenMPI 1.10 as follows
> >>>
> >>> mpirun -np 1 --mca mpi_leave_pinned 0 ./mpi-receiver
> >>>
> >>>
> >>> Do you have any clues what could be the reason? Am I doing sth wrong,
> >>> or is it some problem with internal state of OpenMPI?
> >>>
> >>> Thanks a lot!
> >>>
> >>> Marcin
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Reply via email to