There is a particular use-case that is not currently supported, but will be fixed as time permits. Jobs launched by the same mpirun can currently execute MPI_Comm_connect/accept.
> On Apr 4, 2017, at 5:33 AM, Kawashima, Takahiro <t-kawash...@jp.fujitsu.com> > wrote: > > I filed a PR against v1.10.7 though v1.10.7 may not be released. > > https://github.com/open-mpi/ompi/pull/3276 > > I'm not aware of v2.1.x issue, sorry. Other developer may be > able to answer. > > Takahiro Kawashima, > MPI development team, > Fujitsu > >> Bullseye! >> >> Thank you, Takahiro, for your quick answer. Brief tests with 1.10.6 show >> that this did indeed solve the problem! I will look at this in more >> detail, but it looks really good now. >> >> About MPI_Comm_accept in 2.1.x. I've seen a thread here by Adam >> Sylvester, where it essentially says that it is not working now, nor in >> 2.0.x. I've checked the master, and it also does not work there. Is >> there any time line for this? >> >> Thanks a lot! >> >> Marcin >> >> >> >> On 04/04/2017 11:03 AM, Kawashima, Takahiro wrote: >>> Hi, >>> >>> I encountered a similar problem using MPI_COMM_SPAWN last month. >>> Your problem my be same. >>> >>> The problem was fixed by commit 0951a34 in Open MPI master and >>> backported to v2.1.x v2.0.x but not backported to v1.8.x and >>> v1.10.x. >>> >>> https://github.com/open-mpi/ompi/commit/0951a34 >>> >>> Please try the attached patch. It was backported for v1.10 branch. >>> >>> The problem exists in the memory registration limit calculation >>> in openib BTL and processes loop forever in OMPI_FREE_LIST_WAIT_MT >>> when connecting to other ORTE jobs because openib_reg_mr returns >>> OMPI_ERR_OUT_OF_RESOURCE. It probably affects MPI_COMM_SPAWN, >>> MPI_COMM_SPAWN_MULTIPLE, MPI_COMM_ACCEPT, and MPI_COMM_CONNECT. >>> >>> Takahiro Kawashima, >>> MPI development team, >>> Fujitsu >>> >>>> Dear Developers, >>>> >>>> This is an old problem, which I described in an email to the users list >>>> in 2015, but I continue to struggle with it. In short, MPI_Comm_accept / >>>> MPI_Comm_disconnect combo causes any communication over openib btl >>>> (e.g., also a barrier) to hang after a few clients connect and >>>> disconnect from the server. I've noticed that the number of successful >>>> connects depends on the number of server ranks, e.g., if my server has >>>> 32 ranks, then the communication hangs already for the second connecting >>>> client. >>>> >>>> I have now checked that the problem exists also in 1.10.6. As far as I >>>> could tell, MPI_Comm_accept is not working in 2.0 and 2.1 at all, so I >>>> could not test those versions. My previous investigations have shown >>>> that the problem was introduced in 1.8.4. >>>> >>>> I wonder, will this be addressed in OpenMPI, or is this part of the MPI >>>> functionality considered less important than the core? Should I file a >>>> bug report? >>>> >>>> Thanks! >>>> >>>> Marcin Krotkiewski >>>> >>>> >>>> On 09/16/2015 04:06 PM, marcin.krotkiewski wrote: >>>>> I have run into a freeze / potential bug when using MPI_Comm_accept in >>>>> a simple client / server implementation. I have attached two simplest >>>>> programs I could produce: >>>>> >>>>> 1. mpi-receiver.c opens a port using MPI_Open_port, saves the port >>>>> name to a file >>>>> >>>>> 2. mpi-receiver enters infinite loop and waits for connections using >>>>> MPI_Comm_accept >>>>> >>>>> 3. mpi-sender.c connects to that port using MPI_Comm_connect, sends >>>>> one MPI_UNSIGNED_LONG, calls barrier and disconnects using >>>>> MPI_Comm_disconnect >>>>> >>>>> 4. mpi-receiver reads the MPI_UNSIGNED_LONG, prints it, calls barrier >>>>> and disconnects using MPI_Comm_disconnect and goes to point 2 - >>>>> infinite loop >>>>> >>>>> All works fine, but only exactly 5 times. After that the receiver >>>>> hangs in MPI_Recv, after exit from MPI_Comm_accept. That is 100% >>>>> repeatable. I have tried with Intel MPI - no such problem. >>>>> >>>>> I execute the programs using OpenMPI 1.10 as follows >>>>> >>>>> mpirun -np 1 --mca mpi_leave_pinned 0 ./mpi-receiver >>>>> >>>>> >>>>> Do you have any clues what could be the reason? Am I doing sth wrong, >>>>> or is it some problem with internal state of OpenMPI? >>>>> >>>>> Thanks a lot! >>>>> >>>>> Marcin > _______________________________________________ > devel mailing list > devel@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel _______________________________________________ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel