There is a particular use-case that is not currently supported, but will be 
fixed as time permits. Jobs launched by the same mpirun can currently execute 
MPI_Comm_connect/accept.


> On Apr 4, 2017, at 5:33 AM, Kawashima, Takahiro <t-kawash...@jp.fujitsu.com> 
> wrote:
> 
> I filed a PR against v1.10.7 though v1.10.7 may not be released.
> 
>  https://github.com/open-mpi/ompi/pull/3276
> 
> I'm not aware of v2.1.x issue, sorry. Other developer may be
> able to answer.
> 
> Takahiro Kawashima,
> MPI development team,
> Fujitsu
> 
>> Bullseye!
>> 
>> Thank you, Takahiro, for your quick answer. Brief tests with 1.10.6 show 
>> that this did indeed solve the problem! I will look at this in more 
>> detail, but it looks really good now.
>> 
>> About MPI_Comm_accept in 2.1.x. I've seen a thread here by Adam 
>> Sylvester, where it essentially says that it is not working now, nor in 
>> 2.0.x. I've checked the master, and it also does not work there. Is 
>> there any time line for this?
>> 
>> Thanks a lot!
>> 
>> Marcin
>> 
>> 
>> 
>> On 04/04/2017 11:03 AM, Kawashima, Takahiro wrote:
>>> Hi,
>>> 
>>> I encountered a similar problem using MPI_COMM_SPAWN last month.
>>> Your problem my be same.
>>> 
>>> The problem was fixed by commit 0951a34 in Open MPI master and
>>> backported to v2.1.x v2.0.x but not backported to v1.8.x and
>>> v1.10.x.
>>> 
>>>   https://github.com/open-mpi/ompi/commit/0951a34
>>> 
>>> Please try the attached patch. It was backported for v1.10 branch.
>>> 
>>> The problem exists in the memory registration limit calculation
>>> in openib BTL and processes loop forever in OMPI_FREE_LIST_WAIT_MT
>>> when connecting to other ORTE jobs because openib_reg_mr returns
>>> OMPI_ERR_OUT_OF_RESOURCE. It probably affects MPI_COMM_SPAWN,
>>> MPI_COMM_SPAWN_MULTIPLE, MPI_COMM_ACCEPT, and MPI_COMM_CONNECT.
>>> 
>>> Takahiro Kawashima,
>>> MPI development team,
>>> Fujitsu
>>> 
>>>> Dear Developers,
>>>> 
>>>> This is an old problem, which I described in an email to the users list
>>>> in 2015, but I continue to struggle with it. In short, MPI_Comm_accept /
>>>> MPI_Comm_disconnect combo causes any communication over openib btl
>>>> (e.g., also a barrier) to hang after a few clients connect and
>>>> disconnect from the server. I've noticed that the number of successful
>>>> connects depends on the number of server ranks, e.g., if my server has
>>>> 32 ranks, then the communication hangs already for the second connecting
>>>> client.
>>>> 
>>>> I have now checked that the problem exists also in 1.10.6. As far as I
>>>> could tell, MPI_Comm_accept is not working in 2.0 and 2.1 at all, so I
>>>> could not test those versions. My previous investigations have shown
>>>> that the problem was introduced in 1.8.4.
>>>> 
>>>> I wonder, will this be addressed in OpenMPI, or is this part of the MPI
>>>> functionality considered less important than the core? Should I file a
>>>> bug report?
>>>> 
>>>> Thanks!
>>>> 
>>>> Marcin Krotkiewski
>>>> 
>>>> 
>>>> On 09/16/2015 04:06 PM, marcin.krotkiewski wrote:
>>>>> I have run into a freeze / potential bug when using MPI_Comm_accept in
>>>>> a simple client / server implementation. I have attached two simplest
>>>>> programs I could produce:
>>>>> 
>>>>>  1. mpi-receiver.c opens a port using MPI_Open_port, saves the port
>>>>> name to a file
>>>>> 
>>>>>  2. mpi-receiver enters infinite loop and waits for connections using
>>>>> MPI_Comm_accept
>>>>> 
>>>>>  3. mpi-sender.c connects to that port using MPI_Comm_connect, sends
>>>>> one MPI_UNSIGNED_LONG, calls barrier and disconnects using
>>>>> MPI_Comm_disconnect
>>>>> 
>>>>>  4. mpi-receiver reads the MPI_UNSIGNED_LONG, prints it, calls barrier
>>>>> and disconnects using MPI_Comm_disconnect and goes to point 2 -
>>>>> infinite loop
>>>>> 
>>>>> All works fine, but only exactly 5 times. After that the receiver
>>>>> hangs in MPI_Recv, after exit from MPI_Comm_accept. That is 100%
>>>>> repeatable. I have tried with Intel MPI - no such problem.
>>>>> 
>>>>> I execute the programs using OpenMPI 1.10 as follows
>>>>> 
>>>>> mpirun -np 1 --mca mpi_leave_pinned 0 ./mpi-receiver
>>>>> 
>>>>> 
>>>>> Do you have any clues what could be the reason? Am I doing sth wrong,
>>>>> or is it some problem with internal state of OpenMPI?
>>>>> 
>>>>> Thanks a lot!
>>>>> 
>>>>> Marcin
> _______________________________________________
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Reply via email to