Re: [OMPI devel] [OMPI users] 3.x - hang in MPI_Comm_disconnect

[email protected] Mon, 21 May 2018 20:11:22 -0700

Comm_connect and Comm_disconnect are both broken in OMPI v2.0 and above, 
including OMPI master - the precise reasons differ across the various releases. 
From what I can tell, the problem is in the OMPI side (as opposed to PMIx). 
I’ll try to file a few issues (since the problem is different in the various 
releases) in the next few days that points to the problems.


Comm_spawn is okay, FWIW

Ralph


> On May 21, 2018, at 8:00 PM, Ben Menadue <[email protected]> wrote:
> 
> Hi,
> 
> Moving this over to the devel list... I’m not sure if it's is a problem with 
> PMIx or with OMPI’s integration with that. It looks like wait_cbfunc callback 
> enqueued as part of the PMIX_PTL_SEND_RECV at pmix_client_connect.c:329 is 
> never called, and so the main thread is never woken from the PMIX_WAIT_THREAD 
> at pmix_client_connect.c:232. (This is for PMIx v2.1.1.) But I haven’t worked 
> out why that callback is not being called yet… looking at the output, I think 
> that it’s expecting a message back from the PMIx server that it’s never 
> getting.
> 
> [raijin7:05505] pmix: disconnect called
> [raijin7:05505] [../../../../../src/mca/ptl/tcp/ptl_tcp.c:431] post send to 
> server
> [raijin7:05505] posting recv on tag 119
> [raijin7:05505] QUEIENG MSG TO SERVER OF SIZE 645
> [raijin7:05505] 1746468865:0 ptl:base:send_handler SENDING TO PEER 
> 1746468864:0 tag 119 with NON-NULL msg
> [raijin7:05505] ptl:base:send_handler SENDING MSG
> [raijin7:05493] 1746468864:0 ptl:base:recv:handler called with peer 
> 1746468865:0
> [raijin7:05493] ptl:base:recv:handler allocate new recv msg
> [raijin7:05493] ptl:base:recv:handler read hdr on socket 27
> [raijin7:05493] RECVD MSG FOR TAG 119 SIZE 645
> [raijin7:05493] ptl:base:recv:handler allocate data region of size 645
> [raijin7:05505] ptl:base:send_handler MSG SENT
> [raijin7:05493] 1746468864:0 RECVD COMPLETE MESSAGE FROM SERVER OF 645 BYTES 
> FOR TAG 119 ON PEER SOCKET 27
> [raijin7:05493] [../../../../src/mca/ptl/base/ptl_base_sendrecv.c:507] post 
> msg
> [raijin7:05493] 1746468864:0 message received 645 bytes for tag 119 on socket 
> 27
> [raijin7:05493] checking msg on tag 119 for tag 0
> [raijin7:05493] checking msg on tag 119 for tag 4294967295
> [raijin7:05505] pmix: disconnect completed
> [raijin7:05493] 1746468864:0 EXECUTE CALLBACK for tag 119
> [raijin7:05493] SWITCHYARD for 1746468865:0:27
> [raijin7:05493] recvd pmix cmd 11 from 1746468865:0
> [raijin7:05493] recvd CONNECT from peer 1746468865:0
> [raijin7:05493] get_tracker called with 32 procs
> [raijin7:05493] 1746468864:0 CALLBACK COMPLETE
> 
> Here, 5493 is the mpirun and 5505 is one of the spawned processes. All of the 
> MPI processes (i.e. the original one along with the dynamically launched 
> ones) look to be waiting on the same pthread_cond_wait in the backtrace 
> below, while the mpirun is just in the standard event loops (event_base_loop, 
> oob_tcp_listener, opal_progress_threads, ptl_base_listener, and 
> pmix_progress_threads).
> 
> That said, I’m not sure why get_tracker is reporting 32 procs — there’s only 
> 16 running here (i.e. 1 original + 15 spawned).
> 
> Or should I post this over in the PMIx list instead?
> 
> Cheers,
> Ben
> 
> 
>> On 17 May 2018, at 9:59 am, Ben Menadue <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> Hi,
>> 
>> I’m trying to debug a user’s program that uses dynamic process management 
>> through Rmpi + doMPI. We’re seeing a hang in MPI_Comm_disconnect. Each of 
>> the processes is in
>> 
>> #0  0x00007ff72513168c in pthread_cond_wait@@GLIBC_2.3.2 () from 
>> /lib64/libpthread.so.0
>> #1  0x00007ff7130760d3 in PMIx_Disconnect (procs=0x5b0d600, nprocs=<value 
>> optimized out>, info=<value optimized out>, ninfo=0) at 
>> ../../src/client/pmix_client_connect.c:232
>> #2  0x00007ff7132fa670 in ext2x_disconnect (procs=0x7fff48ed6700) at 
>> ext2x_client.c:1432
>> #3  0x00007ff71a3b7ce4 in ompi_dpm_disconnect (comm=0x5af6910) at 
>> ../../../../../ompi/dpm/dpm.c:596
>> #4  0x00007ff71a402ff8 in PMPI_Comm_disconnect (comm=0x5a3c4f8) at 
>> pcomm_disconnect.c:67
>> #5  0x00007ff71a7466b9 in mpi_comm_disconnect () from 
>> /home/900/bjm900/R/x86_64-pc-linux-gnu-library/3.4/Rmpi/libs/Rmpi.so
>> 
>> This is using 3.1.0 against and external install of PMIx 2.1.1. But I see 
>> exactly the same issue with 3.0.1 using its internal PMIx. It looks similar 
>> to issue #4542, but the corresponding patch in PR#4549 doesn’t seem to help 
>> (it just hangs in PMIx_fence instead of PMIx_disconnect).
>> 
>> Attached is the offending R script, it hangs in the “closeCluster” call. Has 
>> anyone seen this issue? I’m not sure what approach to take to debug it, but 
>> I have builds of the MPI libraries with --enable-debug available if needed.
>> 
>> Cheers,
>> Ben
>> 
>> <Rmpi_test.r>
>> _______________________________________________
>> users mailing list
>> [email protected] <mailto:[email protected]>
>> https://lists.open-mpi.org/mailman/listinfo/users
> 
> _______________________________________________
> devel mailing list
> [email protected]
> https://lists.open-mpi.org/mailman/listinfo/devel

_______________________________________________
devel mailing list
[email protected]
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] [OMPI users] 3.x - hang in MPI_Comm_disconnect

Reply via email to