FWIW: I just tested this on today’s OMPI master and it is working there. Could just be something that didn’t get moved to a release branch.
> On May 21, 2018, at 8:43 PM, Ben Menadue <ben.mena...@nci.org.au > <mailto:ben.mena...@nci.org.au>> wrote: > > Hi Ralph, > > Thanks for that. That would also explain why it works with OMPI 1.10.7. In > which case, I’ll just suggest they continue using 1.10.7 for now. > > I just went back over the doMPI R code, and it looks like it’s using > MPI_Comm_spawn to create it’s “cluster” of MPI worker processes but then > using MPI_Comm_disconnect when closing the cluster. I think the idea is that > they can then create and destroy clusters several times within the same R > script. But of course, that won’t work here when you can’t disconnect > processes. > > Cheers, > Ben > > > >> On 22 May 2018, at 1:09 pm, r...@open-mpi.org <mailto:r...@open-mpi.org> >> wrote: >> >> Comm_connect and Comm_disconnect are both broken in OMPI v2.0 and above, >> including OMPI master - the precise reasons differ across the various >> releases. From what I can tell, the problem is in the OMPI side (as opposed >> to PMIx). I’ll try to file a few issues (since the problem is different in >> the various releases) in the next few days that points to the problems. >> >> Comm_spawn is okay, FWIW >> >> Ralph >> >> >>> On May 21, 2018, at 8:00 PM, Ben Menadue <ben.mena...@nci.org.au >>> <mailto:ben.mena...@nci.org.au>> wrote: >>> >>> Hi, >>> >>> Moving this over to the devel list... I’m not sure if it's is a problem >>> with PMIx or with OMPI’s integration with that. It looks like wait_cbfunc >>> callback enqueued as part of the PMIX_PTL_SEND_RECV at >>> pmix_client_connect.c:329 is never called, and so the main thread is never >>> woken from the PMIX_WAIT_THREAD at pmix_client_connect.c:232. (This is for >>> PMIx v2.1.1.) But I haven’t worked out why that callback is not being >>> called yet… looking at the output, I think that it’s expecting a message >>> back from the PMIx server that it’s never getting. >>> >>> [raijin7:05505] pmix: disconnect called >>> [raijin7:05505] [../../../../../src/mca/ptl/tcp/ptl_tcp.c:431] post send to >>> server >>> [raijin7:05505] posting recv on tag 119 >>> [raijin7:05505] QUEIENG MSG TO SERVER OF SIZE 645 >>> [raijin7:05505] 1746468865:0 ptl:base:send_handler SENDING TO PEER >>> 1746468864:0 tag 119 with NON-NULL msg >>> [raijin7:05505] ptl:base:send_handler SENDING MSG >>> [raijin7:05493] 1746468864:0 ptl:base:recv:handler called with peer >>> 1746468865:0 >>> [raijin7:05493] ptl:base:recv:handler allocate new recv msg >>> [raijin7:05493] ptl:base:recv:handler read hdr on socket 27 >>> [raijin7:05493] RECVD MSG FOR TAG 119 SIZE 645 >>> [raijin7:05493] ptl:base:recv:handler allocate data region of size 645 >>> [raijin7:05505] ptl:base:send_handler MSG SENT >>> [raijin7:05493] 1746468864:0 RECVD COMPLETE MESSAGE FROM SERVER OF 645 >>> BYTES FOR TAG 119 ON PEER SOCKET 27 >>> [raijin7:05493] [../../../../src/mca/ptl/base/ptl_base_sendrecv.c:507] post >>> msg >>> [raijin7:05493] 1746468864:0 message received 645 bytes for tag 119 on >>> socket 27 >>> [raijin7:05493] checking msg on tag 119 for tag 0 >>> [raijin7:05493] checking msg on tag 119 for tag 4294967295 >>> [raijin7:05505] pmix: disconnect completed >>> [raijin7:05493] 1746468864:0 EXECUTE CALLBACK for tag 119 >>> [raijin7:05493] SWITCHYARD for 1746468865:0:27 >>> [raijin7:05493] recvd pmix cmd 11 from 1746468865:0 >>> [raijin7:05493] recvd CONNECT from peer 1746468865:0 >>> [raijin7:05493] get_tracker called with 32 procs >>> [raijin7:05493] 1746468864:0 CALLBACK COMPLETE >>> >>> Here, 5493 is the mpirun and 5505 is one of the spawned processes. All of >>> the MPI processes (i.e. the original one along with the dynamically >>> launched ones) look to be waiting on the same pthread_cond_wait in the >>> backtrace below, while the mpirun is just in the standard event loops >>> (event_base_loop, oob_tcp_listener, opal_progress_threads, >>> ptl_base_listener, and pmix_progress_threads). >>> >>> That said, I’m not sure why get_tracker is reporting 32 procs — there’s >>> only 16 running here (i.e. 1 original + 15 spawned). >>> >>> Or should I post this over in the PMIx list instead? >>> >>> Cheers, >>> Ben >>> >>> >>>> On 17 May 2018, at 9:59 am, Ben Menadue <ben.mena...@nci.org.au >>>> <mailto:ben.mena...@nci.org.au>> wrote: >>>> >>>> Hi, >>>> >>>> I’m trying to debug a user’s program that uses dynamic process management >>>> through Rmpi + doMPI. We’re seeing a hang in MPI_Comm_disconnect. Each of >>>> the processes is in >>>> >>>> #0 0x00007ff72513168c in pthread_cond_wait@@GLIBC_2.3.2 () from >>>> /lib64/libpthread.so.0 >>>> #1 0x00007ff7130760d3 in PMIx_Disconnect (procs=0x5b0d600, nprocs=<value >>>> optimized out>, info=<value optimized out>, ninfo=0) at >>>> ../../src/client/pmix_client_connect.c:232 >>>> #2 0x00007ff7132fa670 in ext2x_disconnect (procs=0x7fff48ed6700) at >>>> ext2x_client.c:1432 >>>> #3 0x00007ff71a3b7ce4 in ompi_dpm_disconnect (comm=0x5af6910) at >>>> ../../../../../ompi/dpm/dpm.c:596 >>>> #4 0x00007ff71a402ff8 in PMPI_Comm_disconnect (comm=0x5a3c4f8) at >>>> pcomm_disconnect.c:67 >>>> #5 0x00007ff71a7466b9 in mpi_comm_disconnect () from >>>> /home/900/bjm900/R/x86_64-pc-linux-gnu-library/3.4/Rmpi/libs/Rmpi.so >>>> >>>> This is using 3.1.0 against and external install of PMIx 2.1.1. But I see >>>> exactly the same issue with 3.0.1 using its internal PMIx. It looks >>>> similar to issue #4542, but the corresponding patch in PR#4549 doesn’t >>>> seem to help (it just hangs in PMIx_fence instead of PMIx_disconnect). >>>> >>>> Attached is the offending R script, it hangs in the “closeCluster” call. >>>> Has anyone seen this issue? I’m not sure what approach to take to debug >>>> it, but I have builds of the MPI libraries with --enable-debug available >>>> if needed. >>>> >>>> Cheers, >>>> Ben >>>> >>>> <Rmpi_test.r> >>>> _______________________________________________ >>>> users mailing list >>>> us...@lists.open-mpi.org <mailto:us...@lists.open-mpi.org> >>>> https://lists.open-mpi.org/mailman/listinfo/users >>>> <https://lists.open-mpi.org/mailman/listinfo/users> >>> _______________________________________________ >>> devel mailing list >>> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org> >>> https://lists.open-mpi.org/mailman/listinfo/devel >>> <https://lists.open-mpi.org/mailman/listinfo/devel> >> _______________________________________________ >> devel mailing list >> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org> >> https://lists.open-mpi.org/mailman/listinfo/devel >> <https://lists.open-mpi.org/mailman/listinfo/devel> > _______________________________________________ > devel mailing list > devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org> > https://lists.open-mpi.org/mailman/listinfo/devel
_______________________________________________ devel mailing list devel@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/devel