Hi Ralph, Thanks for that. That would also explain why it works with OMPI 1.10.7. In which case, I’ll just suggest they continue using 1.10.7 for now.
I just went back over the doMPI R code, and it looks like it’s using MPI_Comm_spawn to create it’s “cluster” of MPI worker processes but then using MPI_Comm_disconnect when closing the cluster. I think the idea is that they can then create and destroy clusters several times within the same R script. But of course, that won’t work here when you can’t disconnect processes. Cheers, Ben > On 22 May 2018, at 1:09 pm, r...@open-mpi.org wrote: > > Comm_connect and Comm_disconnect are both broken in OMPI v2.0 and above, > including OMPI master - the precise reasons differ across the various > releases. From what I can tell, the problem is in the OMPI side (as opposed > to PMIx). I’ll try to file a few issues (since the problem is different in > the various releases) in the next few days that points to the problems. > > Comm_spawn is okay, FWIW > > Ralph > > >> On May 21, 2018, at 8:00 PM, Ben Menadue <ben.mena...@nci.org.au >> <mailto:ben.mena...@nci.org.au>> wrote: >> >> Hi, >> >> Moving this over to the devel list... I’m not sure if it's is a problem with >> PMIx or with OMPI’s integration with that. It looks like wait_cbfunc >> callback enqueued as part of the PMIX_PTL_SEND_RECV at >> pmix_client_connect.c:329 is never called, and so the main thread is never >> woken from the PMIX_WAIT_THREAD at pmix_client_connect.c:232. (This is for >> PMIx v2.1.1.) But I haven’t worked out why that callback is not being called >> yet… looking at the output, I think that it’s expecting a message back from >> the PMIx server that it’s never getting. >> >> [raijin7:05505] pmix: disconnect called >> [raijin7:05505] [../../../../../src/mca/ptl/tcp/ptl_tcp.c:431] post send to >> server >> [raijin7:05505] posting recv on tag 119 >> [raijin7:05505] QUEIENG MSG TO SERVER OF SIZE 645 >> [raijin7:05505] 1746468865:0 ptl:base:send_handler SENDING TO PEER >> 1746468864:0 tag 119 with NON-NULL msg >> [raijin7:05505] ptl:base:send_handler SENDING MSG >> [raijin7:05493] 1746468864:0 ptl:base:recv:handler called with peer >> 1746468865:0 >> [raijin7:05493] ptl:base:recv:handler allocate new recv msg >> [raijin7:05493] ptl:base:recv:handler read hdr on socket 27 >> [raijin7:05493] RECVD MSG FOR TAG 119 SIZE 645 >> [raijin7:05493] ptl:base:recv:handler allocate data region of size 645 >> [raijin7:05505] ptl:base:send_handler MSG SENT >> [raijin7:05493] 1746468864:0 RECVD COMPLETE MESSAGE FROM SERVER OF 645 BYTES >> FOR TAG 119 ON PEER SOCKET 27 >> [raijin7:05493] [../../../../src/mca/ptl/base/ptl_base_sendrecv.c:507] post >> msg >> [raijin7:05493] 1746468864:0 message received 645 bytes for tag 119 on >> socket 27 >> [raijin7:05493] checking msg on tag 119 for tag 0 >> [raijin7:05493] checking msg on tag 119 for tag 4294967295 >> [raijin7:05505] pmix: disconnect completed >> [raijin7:05493] 1746468864:0 EXECUTE CALLBACK for tag 119 >> [raijin7:05493] SWITCHYARD for 1746468865:0:27 >> [raijin7:05493] recvd pmix cmd 11 from 1746468865:0 >> [raijin7:05493] recvd CONNECT from peer 1746468865:0 >> [raijin7:05493] get_tracker called with 32 procs >> [raijin7:05493] 1746468864:0 CALLBACK COMPLETE >> >> Here, 5493 is the mpirun and 5505 is one of the spawned processes. All of >> the MPI processes (i.e. the original one along with the dynamically launched >> ones) look to be waiting on the same pthread_cond_wait in the backtrace >> below, while the mpirun is just in the standard event loops >> (event_base_loop, oob_tcp_listener, opal_progress_threads, >> ptl_base_listener, and pmix_progress_threads). >> >> That said, I’m not sure why get_tracker is reporting 32 procs — there’s only >> 16 running here (i.e. 1 original + 15 spawned). >> >> Or should I post this over in the PMIx list instead? >> >> Cheers, >> Ben >> >> >>> On 17 May 2018, at 9:59 am, Ben Menadue <ben.mena...@nci.org.au >>> <mailto:ben.mena...@nci.org.au>> wrote: >>> >>> Hi, >>> >>> I’m trying to debug a user’s program that uses dynamic process management >>> through Rmpi + doMPI. We’re seeing a hang in MPI_Comm_disconnect. Each of >>> the processes is in >>> >>> #0 0x00007ff72513168c in pthread_cond_wait@@GLIBC_2.3.2 () from >>> /lib64/libpthread.so.0 >>> #1 0x00007ff7130760d3 in PMIx_Disconnect (procs=0x5b0d600, nprocs=<value >>> optimized out>, info=<value optimized out>, ninfo=0) at >>> ../../src/client/pmix_client_connect.c:232 >>> #2 0x00007ff7132fa670 in ext2x_disconnect (procs=0x7fff48ed6700) at >>> ext2x_client.c:1432 >>> #3 0x00007ff71a3b7ce4 in ompi_dpm_disconnect (comm=0x5af6910) at >>> ../../../../../ompi/dpm/dpm.c:596 >>> #4 0x00007ff71a402ff8 in PMPI_Comm_disconnect (comm=0x5a3c4f8) at >>> pcomm_disconnect.c:67 >>> #5 0x00007ff71a7466b9 in mpi_comm_disconnect () from >>> /home/900/bjm900/R/x86_64-pc-linux-gnu-library/3.4/Rmpi/libs/Rmpi.so >>> >>> This is using 3.1.0 against and external install of PMIx 2.1.1. But I see >>> exactly the same issue with 3.0.1 using its internal PMIx. It looks similar >>> to issue #4542, but the corresponding patch in PR#4549 doesn’t seem to help >>> (it just hangs in PMIx_fence instead of PMIx_disconnect). >>> >>> Attached is the offending R script, it hangs in the “closeCluster” call. >>> Has anyone seen this issue? I’m not sure what approach to take to debug it, >>> but I have builds of the MPI libraries with --enable-debug available if >>> needed. >>> >>> Cheers, >>> Ben >>> >>> <Rmpi_test.r> >>> _______________________________________________ >>> users mailing list >>> us...@lists.open-mpi.org <mailto:us...@lists.open-mpi.org> >>> https://lists.open-mpi.org/mailman/listinfo/users >>> <https://lists.open-mpi.org/mailman/listinfo/users> >> _______________________________________________ >> devel mailing list >> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org> >> https://lists.open-mpi.org/mailman/listinfo/devel > > _______________________________________________ > devel mailing list > devel@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/devel
_______________________________________________ devel mailing list devel@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/devel