FWIW: I just tested this on today’s OMPI master and it is working there. Could 
just be something that didn’t get moved to a release branch.


> On May 21, 2018, at 8:43 PM, Ben Menadue <ben.mena...@nci.org.au 
> <mailto:ben.mena...@nci.org.au>> wrote:
> 
> Hi Ralph,
> 
> Thanks for that. That would also explain why it works with OMPI 1.10.7. In 
> which case, I’ll just suggest they continue using 1.10.7 for now.
> 
> I just went back over the doMPI R code, and it looks like it’s using 
> MPI_Comm_spawn to create it’s “cluster” of MPI worker processes but then 
> using MPI_Comm_disconnect when closing the cluster. I think the idea is that 
> they can then create and destroy clusters several times within the same R 
> script. But of course, that won’t work here when you can’t disconnect 
> processes.
> 
> Cheers,
> Ben
> 
> 
> 
>> On 22 May 2018, at 1:09 pm, r...@open-mpi.org <mailto:r...@open-mpi.org> 
>> wrote:
>> 
>> Comm_connect and Comm_disconnect are both broken in OMPI v2.0 and above, 
>> including OMPI master - the precise reasons differ across the various 
>> releases. From what I can tell, the problem is in the OMPI side (as opposed 
>> to PMIx). I’ll try to file a few issues (since the problem is different in 
>> the various releases) in the next few days that points to the problems.
>> 
>> Comm_spawn is okay, FWIW
>> 
>> Ralph
>> 
>> 
>>> On May 21, 2018, at 8:00 PM, Ben Menadue <ben.mena...@nci.org.au 
>>> <mailto:ben.mena...@nci.org.au>> wrote:
>>> 
>>> Hi,
>>> 
>>> Moving this over to the devel list... I’m not sure if it's is a problem 
>>> with PMIx or with OMPI’s integration with that. It looks like wait_cbfunc 
>>> callback enqueued as part of the PMIX_PTL_SEND_RECV at 
>>> pmix_client_connect.c:329 is never called, and so the main thread is never 
>>> woken from the PMIX_WAIT_THREAD at pmix_client_connect.c:232. (This is for 
>>> PMIx v2.1.1.) But I haven’t worked out why that callback is not being 
>>> called yet… looking at the output, I think that it’s expecting a message 
>>> back from the PMIx server that it’s never getting.
>>> 
>>> [raijin7:05505] pmix: disconnect called
>>> [raijin7:05505] [../../../../../src/mca/ptl/tcp/ptl_tcp.c:431] post send to 
>>> server
>>> [raijin7:05505] posting recv on tag 119
>>> [raijin7:05505] QUEIENG MSG TO SERVER OF SIZE 645
>>> [raijin7:05505] 1746468865:0 ptl:base:send_handler SENDING TO PEER 
>>> 1746468864:0 tag 119 with NON-NULL msg
>>> [raijin7:05505] ptl:base:send_handler SENDING MSG
>>> [raijin7:05493] 1746468864:0 ptl:base:recv:handler called with peer 
>>> 1746468865:0
>>> [raijin7:05493] ptl:base:recv:handler allocate new recv msg
>>> [raijin7:05493] ptl:base:recv:handler read hdr on socket 27
>>> [raijin7:05493] RECVD MSG FOR TAG 119 SIZE 645
>>> [raijin7:05493] ptl:base:recv:handler allocate data region of size 645
>>> [raijin7:05505] ptl:base:send_handler MSG SENT
>>> [raijin7:05493] 1746468864:0 RECVD COMPLETE MESSAGE FROM SERVER OF 645 
>>> BYTES FOR TAG 119 ON PEER SOCKET 27
>>> [raijin7:05493] [../../../../src/mca/ptl/base/ptl_base_sendrecv.c:507] post 
>>> msg
>>> [raijin7:05493] 1746468864:0 message received 645 bytes for tag 119 on 
>>> socket 27
>>> [raijin7:05493] checking msg on tag 119 for tag 0
>>> [raijin7:05493] checking msg on tag 119 for tag 4294967295
>>> [raijin7:05505] pmix: disconnect completed
>>> [raijin7:05493] 1746468864:0 EXECUTE CALLBACK for tag 119
>>> [raijin7:05493] SWITCHYARD for 1746468865:0:27
>>> [raijin7:05493] recvd pmix cmd 11 from 1746468865:0
>>> [raijin7:05493] recvd CONNECT from peer 1746468865:0
>>> [raijin7:05493] get_tracker called with 32 procs
>>> [raijin7:05493] 1746468864:0 CALLBACK COMPLETE
>>> 
>>> Here, 5493 is the mpirun and 5505 is one of the spawned processes. All of 
>>> the MPI processes (i.e. the original one along with the dynamically 
>>> launched ones) look to be waiting on the same pthread_cond_wait in the 
>>> backtrace below, while the mpirun is just in the standard event loops 
>>> (event_base_loop, oob_tcp_listener, opal_progress_threads, 
>>> ptl_base_listener, and pmix_progress_threads).
>>> 
>>> That said, I’m not sure why get_tracker is reporting 32 procs — there’s 
>>> only 16 running here (i.e. 1 original + 15 spawned).
>>> 
>>> Or should I post this over in the PMIx list instead?
>>> 
>>> Cheers,
>>> Ben
>>> 
>>> 
>>>> On 17 May 2018, at 9:59 am, Ben Menadue <ben.mena...@nci.org.au 
>>>> <mailto:ben.mena...@nci.org.au>> wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> I’m trying to debug a user’s program that uses dynamic process management 
>>>> through Rmpi + doMPI. We’re seeing a hang in MPI_Comm_disconnect. Each of 
>>>> the processes is in
>>>> 
>>>> #0  0x00007ff72513168c in pthread_cond_wait@@GLIBC_2.3.2 () from 
>>>> /lib64/libpthread.so.0
>>>> #1  0x00007ff7130760d3 in PMIx_Disconnect (procs=0x5b0d600, nprocs=<value 
>>>> optimized out>, info=<value optimized out>, ninfo=0) at 
>>>> ../../src/client/pmix_client_connect.c:232
>>>> #2  0x00007ff7132fa670 in ext2x_disconnect (procs=0x7fff48ed6700) at 
>>>> ext2x_client.c:1432
>>>> #3  0x00007ff71a3b7ce4 in ompi_dpm_disconnect (comm=0x5af6910) at 
>>>> ../../../../../ompi/dpm/dpm.c:596
>>>> #4  0x00007ff71a402ff8 in PMPI_Comm_disconnect (comm=0x5a3c4f8) at 
>>>> pcomm_disconnect.c:67
>>>> #5  0x00007ff71a7466b9 in mpi_comm_disconnect () from 
>>>> /home/900/bjm900/R/x86_64-pc-linux-gnu-library/3.4/Rmpi/libs/Rmpi.so
>>>> 
>>>> This is using 3.1.0 against and external install of PMIx 2.1.1. But I see 
>>>> exactly the same issue with 3.0.1 using its internal PMIx. It looks 
>>>> similar to issue #4542, but the corresponding patch in PR#4549 doesn’t 
>>>> seem to help (it just hangs in PMIx_fence instead of PMIx_disconnect).
>>>> 
>>>> Attached is the offending R script, it hangs in the “closeCluster” call. 
>>>> Has anyone seen this issue? I’m not sure what approach to take to debug 
>>>> it, but I have builds of the MPI libraries with --enable-debug available 
>>>> if needed.
>>>> 
>>>> Cheers,
>>>> Ben
>>>> 
>>>> <Rmpi_test.r>
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@lists.open-mpi.org <mailto:us...@lists.open-mpi.org>
>>>> https://lists.open-mpi.org/mailman/listinfo/users 
>>>> <https://lists.open-mpi.org/mailman/listinfo/users>
>>> _______________________________________________
>>> devel mailing list
>>> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
>>> https://lists.open-mpi.org/mailman/listinfo/devel 
>>> <https://lists.open-mpi.org/mailman/listinfo/devel>
>> _______________________________________________
>> devel mailing list
>> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
>> https://lists.open-mpi.org/mailman/listinfo/devel 
>> <https://lists.open-mpi.org/mailman/listinfo/devel>
> _______________________________________________
> devel mailing list
> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
> https://lists.open-mpi.org/mailman/listinfo/devel

_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Reply via email to