Comm_connect and Comm_disconnect are both broken in OMPI v2.0 and above, including OMPI master - the precise reasons differ across the various releases. From what I can tell, the problem is in the OMPI side (as opposed to PMIx). I’ll try to file a few issues (since the problem is different in the various releases) in the next few days that points to the problems.
Comm_spawn is okay, FWIW Ralph > On May 21, 2018, at 8:00 PM, Ben Menadue <ben.mena...@nci.org.au> wrote: > > Hi, > > Moving this over to the devel list... I’m not sure if it's is a problem with > PMIx or with OMPI’s integration with that. It looks like wait_cbfunc callback > enqueued as part of the PMIX_PTL_SEND_RECV at pmix_client_connect.c:329 is > never called, and so the main thread is never woken from the PMIX_WAIT_THREAD > at pmix_client_connect.c:232. (This is for PMIx v2.1.1.) But I haven’t worked > out why that callback is not being called yet… looking at the output, I think > that it’s expecting a message back from the PMIx server that it’s never > getting. > > [raijin7:05505] pmix: disconnect called > [raijin7:05505] [../../../../../src/mca/ptl/tcp/ptl_tcp.c:431] post send to > server > [raijin7:05505] posting recv on tag 119 > [raijin7:05505] QUEIENG MSG TO SERVER OF SIZE 645 > [raijin7:05505] 1746468865:0 ptl:base:send_handler SENDING TO PEER > 1746468864:0 tag 119 with NON-NULL msg > [raijin7:05505] ptl:base:send_handler SENDING MSG > [raijin7:05493] 1746468864:0 ptl:base:recv:handler called with peer > 1746468865:0 > [raijin7:05493] ptl:base:recv:handler allocate new recv msg > [raijin7:05493] ptl:base:recv:handler read hdr on socket 27 > [raijin7:05493] RECVD MSG FOR TAG 119 SIZE 645 > [raijin7:05493] ptl:base:recv:handler allocate data region of size 645 > [raijin7:05505] ptl:base:send_handler MSG SENT > [raijin7:05493] 1746468864:0 RECVD COMPLETE MESSAGE FROM SERVER OF 645 BYTES > FOR TAG 119 ON PEER SOCKET 27 > [raijin7:05493] [../../../../src/mca/ptl/base/ptl_base_sendrecv.c:507] post > msg > [raijin7:05493] 1746468864:0 message received 645 bytes for tag 119 on socket > 27 > [raijin7:05493] checking msg on tag 119 for tag 0 > [raijin7:05493] checking msg on tag 119 for tag 4294967295 > [raijin7:05505] pmix: disconnect completed > [raijin7:05493] 1746468864:0 EXECUTE CALLBACK for tag 119 > [raijin7:05493] SWITCHYARD for 1746468865:0:27 > [raijin7:05493] recvd pmix cmd 11 from 1746468865:0 > [raijin7:05493] recvd CONNECT from peer 1746468865:0 > [raijin7:05493] get_tracker called with 32 procs > [raijin7:05493] 1746468864:0 CALLBACK COMPLETE > > Here, 5493 is the mpirun and 5505 is one of the spawned processes. All of the > MPI processes (i.e. the original one along with the dynamically launched > ones) look to be waiting on the same pthread_cond_wait in the backtrace > below, while the mpirun is just in the standard event loops (event_base_loop, > oob_tcp_listener, opal_progress_threads, ptl_base_listener, and > pmix_progress_threads). > > That said, I’m not sure why get_tracker is reporting 32 procs — there’s only > 16 running here (i.e. 1 original + 15 spawned). > > Or should I post this over in the PMIx list instead? > > Cheers, > Ben > > >> On 17 May 2018, at 9:59 am, Ben Menadue <ben.mena...@nci.org.au >> <mailto:ben.mena...@nci.org.au>> wrote: >> >> Hi, >> >> I’m trying to debug a user’s program that uses dynamic process management >> through Rmpi + doMPI. We’re seeing a hang in MPI_Comm_disconnect. Each of >> the processes is in >> >> #0 0x00007ff72513168c in pthread_cond_wait@@GLIBC_2.3.2 () from >> /lib64/libpthread.so.0 >> #1 0x00007ff7130760d3 in PMIx_Disconnect (procs=0x5b0d600, nprocs=<value >> optimized out>, info=<value optimized out>, ninfo=0) at >> ../../src/client/pmix_client_connect.c:232 >> #2 0x00007ff7132fa670 in ext2x_disconnect (procs=0x7fff48ed6700) at >> ext2x_client.c:1432 >> #3 0x00007ff71a3b7ce4 in ompi_dpm_disconnect (comm=0x5af6910) at >> ../../../../../ompi/dpm/dpm.c:596 >> #4 0x00007ff71a402ff8 in PMPI_Comm_disconnect (comm=0x5a3c4f8) at >> pcomm_disconnect.c:67 >> #5 0x00007ff71a7466b9 in mpi_comm_disconnect () from >> /home/900/bjm900/R/x86_64-pc-linux-gnu-library/3.4/Rmpi/libs/Rmpi.so >> >> This is using 3.1.0 against and external install of PMIx 2.1.1. But I see >> exactly the same issue with 3.0.1 using its internal PMIx. It looks similar >> to issue #4542, but the corresponding patch in PR#4549 doesn’t seem to help >> (it just hangs in PMIx_fence instead of PMIx_disconnect). >> >> Attached is the offending R script, it hangs in the “closeCluster” call. Has >> anyone seen this issue? I’m not sure what approach to take to debug it, but >> I have builds of the MPI libraries with --enable-debug available if needed. >> >> Cheers, >> Ben >> >> <Rmpi_test.r> >> _______________________________________________ >> users mailing list >> us...@lists.open-mpi.org <mailto:us...@lists.open-mpi.org> >> https://lists.open-mpi.org/mailman/listinfo/users > > _______________________________________________ > devel mailing list > devel@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/devel
_______________________________________________ devel mailing list devel@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/devel