I’ve verified that the orte/util/listener thread is not being started, so I don’t think it should be involved in this problem.
HTH Ralph > On Oct 30, 2015, at 8:07 AM, Ralph Castain <r...@open-mpi.org> wrote: > > Hmmm…there is a hook that would allow the PMIx server to utilize that > listener thread, but we aren’t currently using it. Each daemon plus mpirun > will call orte_start_listener, but nothing is currently registering and so > the listener in that code is supposed to just return without starting the > thread. > > So the only listener thread that should exist is the one inside the PMIx > server itself. If something else is happening, then that would be a bug. I > can look at the orte listener code to ensure that the thread isn’t > incorrectly starting. > > >> On Oct 29, 2015, at 10:03 PM, George Bosilca <bosi...@icl.utk.edu >> <mailto:bosi...@icl.utk.edu>> wrote: >> >> Some progress, that puzzles me but might help you understand. Once the >> deadlock appears, if I manually kill the MPI process on the node where the >> deadlock was created, the local orte daemon doesn't notice and will just >> keep waiting. >> >> Quick question: I am under the impression that the issue is not in the PMIX >> server but somewhere around the listener_thread_fn in orte/util/listener.c. >> Possible ? >> >> George. >> >> >> On Wed, Oct 28, 2015 at 3:56 AM, Ralph Castain <r...@open-mpi.org >> <mailto:r...@open-mpi.org>> wrote: >> Should have also clarified: the prior fixes are indeed in the current master. >> >>> On Oct 28, 2015, at 12:42 AM, Ralph Castain <r...@open-mpi.org >>> <mailto:r...@open-mpi.org>> wrote: >>> >>> Nope - I was wrong. The correction on the client side consisted of >>> attempting to timeout if the blocking recv failed. We then modified the >>> blocking send/recv so they would handle errors. >>> >>> So that problem occurred -after- the server had correctly called accept. >>> The listener code is in >>> opal/mca/pmix/pmix1xx/pmix/src/server/pmix_server_listener.c >>> >>> It looks to me like the only way we could drop the accept (assuming the OS >>> doesn’t lose it) is if the file descriptor lies outside the expected range >>> once we fall out of select: >>> >>> >>> /* Spin accepting connections until all active listen sockets >>> * do not have any incoming connections, pushing each connection >>> * onto the event queue for processing >>> */ >>> do { >>> accepted_connections = 0; >>> /* according to the man pages, select replaces the given >>> descriptor >>> * set with a subset consisting of those descriptors that are >>> ready >>> * for the specified operation - in this case, a read. So we >>> need to >>> * first check to see if this file descriptor is included in the >>> * returned subset >>> */ >>> if (0 == FD_ISSET(pmix_server_globals.listen_socket, &readfds)) >>> { >>> /* this descriptor is not included */ >>> continue; >>> } >>> >>> /* this descriptor is ready to be read, which means a connection >>> * request has been received - so harvest it. All we want to do >>> * here is accept the connection and push the info onto the >>> event >>> * library for subsequent processing - we don't want to actually >>> * process the connection here as it takes too long, and so the >>> * OS might start rejecting connections due to timeout. >>> */ >>> pending_connection = PMIX_NEW(pmix_pending_connection_t); >>> event_assign(&pending_connection->ev, pmix_globals.evbase, -1, >>> EV_WRITE, connection_handler, pending_connection); >>> pending_connection->sd = >>> accept(pmix_server_globals.listen_socket, >>> (struct >>> sockaddr*)&(pending_connection->addr), >>> &addrlen); >>> if (pending_connection->sd < 0) { >>> PMIX_RELEASE(pending_connection); >>> if (pmix_socket_errno != EAGAIN || >>> pmix_socket_errno != EWOULDBLOCK) { >>> if (EMFILE == pmix_socket_errno) { >>> PMIX_ERROR_LOG(PMIX_ERR_OUT_OF_RESOURCE); >>> } else { >>> pmix_output(0, "listen_thread: accept() failed: %s >>> (%d).", >>> strerror(pmix_socket_errno), >>> pmix_socket_errno); >>> } >>> goto done; >>> } >>> continue; >>> } >>> >>> pmix_output_verbose(8, pmix_globals.debug_output, >>> "listen_thread: new connection: (%d, %d)", >>> pending_connection->sd, pmix_socket_errno); >>> /* activate the event */ >>> event_active(&pending_connection->ev, EV_WRITE, 1); >>> accepted_connections++; >>> } while (accepted_connections > 0); >>> >>> >>>> On Oct 28, 2015, at 12:25 AM, Ralph Castain <r...@open-mpi.org >>>> <mailto:r...@open-mpi.org>> wrote: >>>> >>>> Looking at the code, it appears that a fix was committed for this problem, >>>> and that we correctly resolved the issue found by Paul. The problem is >>>> that the fix didn’t get upstreamed, and so it was lost the next time we >>>> refreshed PMIx. Sigh. >>>> >>>> Let me try to recreate the fix and have you take a gander at it. >>>> >>>> >>>>> On Oct 28, 2015, at 12:22 AM, Ralph Castain <r...@open-mpi.org >>>>> <mailto:r...@open-mpi.org>> wrote: >>>>> >>>>> Here is the discussion - afraid it is fairly lengthy. Ignore the hwloc >>>>> references in it as that was a separate issue: >>>>> >>>>> http://www.open-mpi.org/community/lists/devel/2015/09/18074.php >>>>> <http://www.open-mpi.org/community/lists/devel/2015/09/18074.php> >>>>> >>>>> It definitely sounds like the same issue creeping in again. I’d >>>>> appreciate any thoughts on how to correct it. If it helps, you could look >>>>> at the PMIx master - there are standalone tests in the test/simple >>>>> directory that fork/exec a child and just do the connection. >>>>> >>>>> https://github.com/pmix/master <https://github.com/pmix/master> >>>>> >>>>> The test server is simptest.c - it will spawn a single copy of >>>>> simpclient.c by default. >>>>> >>>>> >>>>>> On Oct 27, 2015, at 10:14 PM, George Bosilca <bosi...@icl.utk.edu >>>>>> <mailto:bosi...@icl.utk.edu>> wrote: >>>>>> >>>>>> Interesting. Do you have a pointer to the commit (or/and to the >>>>>> discussion)? >>>>>> >>>>>> I looked at the PMIX code, and I have identified few issues, but >>>>>> unfortunately none of them seem to fix the problem for good. However, >>>>>> now I need more than 1000 runs to get a deadlock (instead of few tens). >>>>>> >>>>>> Looking with "netstat -ax" at the status of the UDS while the processes >>>>>> are deadlocked, I see 2 UDS with the same name: one from the server >>>>>> which is in LISTEN state, and one for the client which is being in >>>>>> CONNECTING state (while the client already sent a message in the socket >>>>>> and is now waiting in a blocking receive). This somehow suggest that the >>>>>> server has not yet called accept on the UDS. Unfortunately, there are 3 >>>>>> threads all doing different flavors of even_base and select, so I have a >>>>>> hard time tracking the path of the UDS on the server side. >>>>>> >>>>>> So in order to validate my assumption I wrote a minimalistic UDS client >>>>>> and server application and tried different scenarios. The conclusion is >>>>>> that in order to see the same type of output from "netstat -ax" I have >>>>>> to call listen on the server, connect on the client and do not call >>>>>> accept on the server. >>>>>> >>>>>> With the same occasion I also confirmed that the UDS are holding the >>>>>> data sent so there is no need for further synchronization for the case >>>>>> where the data is sent first. We only need to find out how the server >>>>>> forgets to call accept. >>>>>> >>>>>> George. >>>>>> >>>>>> >>>>>> >>>>>> On Tue, Oct 27, 2015 at 7:52 PM, Ralph Castain <r...@open-mpi.org >>>>>> <mailto:r...@open-mpi.org>> wrote: >>>>>> Hmmm…this looks like it might be that problem we previously saw where >>>>>> the blocking recv hangs in a proc when the blocking send tries to send >>>>>> before the domain socket is actually ready, and so the send fails on the >>>>>> other end. As I recall, it was something to do with the socketoptions - >>>>>> and then Paul had a problem on some of his machines, and we backed it >>>>>> out? >>>>>> >>>>>> I wonder if that’s what is biting us here again, and what we need is to >>>>>> either remove the blocking send/recv’s altogether, or figure out a way >>>>>> to wait until the socket is really ready. >>>>>> >>>>>> Any thoughts? >>>>>> >>>>>> >>>>>>> On Oct 27, 2015, at 4:11 PM, George Bosilca <bosi...@icl.utk.edu >>>>>>> <mailto:bosi...@icl.utk.edu>> wrote: >>>>>>> >>>>>>> It appear the branch solve the problem at least partially. I asked one >>>>>>> of my students to hammer it pretty badly, and he reported that the >>>>>>> deadlocks still occur. He also graciously provided some stacktraces: >>>>>>> >>>>>>> #0 0x00007f4bd5274aed in nanosleep () from /lib64/libc.so.6 >>>>>>> #1 0x00007f4bd52a9c94 in usleep () from /lib64/libc.so.6 >>>>>>> #2 0x00007f4bd2e42b00 in OPAL_PMIX_PMIX1XX_PMIx_Fence (procs=0x0, >>>>>>> nprocs=0, info=0x7fff3c561960, >>>>>>> ninfo=1) at src/client/pmix_client_fence.c:100 >>>>>>> #3 0x00007f4bd306e6d2 in pmix1_fence (procs=0x0, collect_data=1) at >>>>>>> pmix1_client.c:306 >>>>>>> #4 0x00007f4bd57d5cc3 in ompi_mpi_init (argc=3, argv=0x7fff3c561ea8, >>>>>>> requested=3, >>>>>>> provided=0x7fff3c561d84) at runtime/ompi_mpi_init.c:644 >>>>>>> #5 0x00007f4bd5813399 in PMPI_Init_thread (argc=0x7fff3c561d7c, >>>>>>> argv=0x7fff3c561d70, required=3, >>>>>>> provided=0x7fff3c561d84) at pinit_thread.c:69 >>>>>>> #6 0x0000000000401516 in main (argc=3, argv=0x7fff3c561ea8) at >>>>>>> osu_mbw_mr.c:86 >>>>>>> >>>>>>> And another process: >>>>>>> >>>>>>> #0 0x00007f7b9d7d8bdc in recv () from /lib64/libpthread.so.0 >>>>>>> #1 0x00007f7b9b0aa42d in opal_pmix_pmix1xx_pmix_usock_recv_blocking >>>>>>> (sd=13, data=0x7ffd62139004 "", >>>>>>> size=4) at src/usock/usock.c:168 >>>>>>> #2 0x00007f7b9b0af5d9 in recv_connect_ack (sd=13) at >>>>>>> src/client/pmix_client.c:844 >>>>>>> #3 0x00007f7b9b0b085e in usock_connect (addr=0x7ffd62139330) at >>>>>>> src/client/pmix_client.c:1110 >>>>>>> #4 0x00007f7b9b0acc24 in connect_to_server (address=0x7ffd62139330, >>>>>>> cbdata=0x7ffd621390e0) >>>>>>> at src/client/pmix_client.c:181 >>>>>>> #5 0x00007f7b9b0ad569 in OPAL_PMIX_PMIX1XX_PMIx_Init >>>>>>> (proc=0x7f7b9b4e9b60) >>>>>>> at src/client/pmix_client.c:362 >>>>>>> #6 0x00007f7b9b2dbd9d in pmix1_client_init () at pmix1_client.c:99 >>>>>>> #7 0x00007f7b9b4eb95f in pmi_component_query (module=0x7ffd62139490, >>>>>>> priority=0x7ffd6213948c) >>>>>>> at ess_pmi_component.c:90 >>>>>>> #8 0x00007f7b9ce70ec5 in mca_base_select (type_name=0x7f7b9d20e059 >>>>>>> "ess", output_id=-1, >>>>>>> components_available=0x7f7b9d431eb0, best_module=0x7ffd621394d0, >>>>>>> best_component=0x7ffd621394d8, >>>>>>> priority_out=0x0) at mca_base_components_select.c:77 >>>>>>> #9 0x00007f7b9d1a956b in orte_ess_base_select () at >>>>>>> base/ess_base_select.c:40 >>>>>>> #10 0x00007f7b9d160449 in orte_init (pargc=0x0, pargv=0x0, flags=32) at >>>>>>> runtime/orte_init.c:219 >>>>>>> #11 0x00007f7b9da4377a in ompi_mpi_init (argc=3, argv=0x7ffd621397f8, >>>>>>> requested=3, >>>>>>> provided=0x7ffd621396d4) at runtime/ompi_mpi_init.c:488 >>>>>>> #12 0x00007f7b9da81399 in PMPI_Init_thread (argc=0x7ffd621396cc, >>>>>>> argv=0x7ffd621396c0, required=3, >>>>>>> provided=0x7ffd621396d4) at pinit_thread.c:69 >>>>>>> #13 0x0000000000401516 in main (argc=3, argv=0x7ffd621397f8) at >>>>>>> osu_mbw_mr.c:86 >>>>>>> >>>>>>> George. >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Tue, Oct 27, 2015 at 2:36 PM, Ralph Castain <r...@open-mpi.org >>>>>>> <mailto:r...@open-mpi.org>> wrote: >>>>>>> I haven’t been able to replicate this when using the branch in this PR: >>>>>>> >>>>>>> https://github.com/open-mpi/ompi/pull/1073 >>>>>>> <https://github.com/open-mpi/ompi/pull/1073> >>>>>>> >>>>>>> Would you mind giving it a try? It fixes some other race conditions and >>>>>>> might pick this one up too. >>>>>>> >>>>>>> >>>>>>>> On Oct 27, 2015, at 10:04 AM, Ralph Castain <r...@open-mpi.org >>>>>>>> <mailto:r...@open-mpi.org>> wrote: >>>>>>>> >>>>>>>> Okay, I’ll take a look - I’ve been chasing a race condition that might >>>>>>>> be related >>>>>>>> >>>>>>>>> On Oct 27, 2015, at 9:54 AM, George Bosilca <bosi...@icl.utk.edu >>>>>>>>> <mailto:bosi...@icl.utk.edu>> wrote: >>>>>>>>> >>>>>>>>> No, it's using 2 nodes. >>>>>>>>> George. >>>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, Oct 27, 2015 at 12:35 PM, Ralph Castain <r...@open-mpi.org >>>>>>>>> <mailto:r...@open-mpi.org>> wrote: >>>>>>>>> Is this on a single node? >>>>>>>>> >>>>>>>>>> On Oct 27, 2015, at 9:25 AM, George Bosilca <bosi...@icl.utk.edu >>>>>>>>>> <mailto:bosi...@icl.utk.edu>> wrote: >>>>>>>>>> >>>>>>>>>> I get intermittent deadlocks wit the latest trunk. The smallest >>>>>>>>>> reproducer is a shell for loop around a small (2 processes) short >>>>>>>>>> (20 seconds) MPI application. After few tens of iterations the >>>>>>>>>> MPI_Init will deadlock with the following backtrace: >>>>>>>>>> >>>>>>>>>> #0 0x00007fa94b5d9aed in nanosleep () from /lib64/libc.so.6 >>>>>>>>>> #1 0x00007fa94b60ec94 in usleep () from /lib64/libc.so.6 >>>>>>>>>> #2 0x00007fa94960ba08 in OPAL_PMIX_PMIX1XX_PMIx_Fence (procs=0x0, >>>>>>>>>> nprocs=0, info=0x7ffd7934fb90, >>>>>>>>>> ninfo=1) at src/client/pmix_client_fence.c:100 >>>>>>>>>> #3 0x00007fa9498376a2 in pmix1_fence (procs=0x0, collect_data=1) at >>>>>>>>>> pmix1_client.c:305 >>>>>>>>>> #4 0x00007fa94bb39ba4 in ompi_mpi_init (argc=3, >>>>>>>>>> argv=0x7ffd793500a8, requested=3, >>>>>>>>>> provided=0x7ffd7934ff94) at runtime/ompi_mpi_init.c:645 >>>>>>>>>> #5 0x00007fa94bb77281 in PMPI_Init_thread (argc=0x7ffd7934ff8c, >>>>>>>>>> argv=0x7ffd7934ff80, required=3, >>>>>>>>>> provided=0x7ffd7934ff94) at pinit_thread.c:69 >>>>>>>>>> #6 0x000000000040150f in main (argc=3, argv=0x7ffd793500a8) at >>>>>>>>>> osu_mbw_mr.c:86 >>>>>>>>>> >>>>>>>>>> On my machines this is reproducible at 100% after anywhere between >>>>>>>>>> 50 and 100 iterations. >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> George. >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> devel mailing list >>>>>>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>>>>>> Link to this post: >>>>>>>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18280.php >>>>>>>>>> <http://www.open-mpi.org/community/lists/devel/2015/10/18280.php> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> devel mailing list >>>>>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>>>>> Link to this post: >>>>>>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18281.php >>>>>>>>> <http://www.open-mpi.org/community/lists/devel/2015/10/18281.php> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> devel mailing list >>>>>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>>>>> Link to this post: >>>>>>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18282.php >>>>>>>>> <http://www.open-mpi.org/community/lists/devel/2015/10/18282.php> >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> devel mailing list >>>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>>> Link to this post: >>>>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18284.php >>>>>>> <http://www.open-mpi.org/community/lists/devel/2015/10/18284.php> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> devel mailing list >>>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>>> Link to this post: >>>>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18292.php >>>>>>> <http://www.open-mpi.org/community/lists/devel/2015/10/18292.php> >>>>>> >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>> Link to this post: >>>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18294.php >>>>>> <http://www.open-mpi.org/community/lists/devel/2015/10/18294.php> >>>>>> >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>> Link to this post: >>>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18302.php >>>>>> <http://www.open-mpi.org/community/lists/devel/2015/10/18302.php> >>>> >>> >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org <mailto:de...@open-mpi.org> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2015/10/18309.php >> <http://www.open-mpi.org/community/lists/devel/2015/10/18309.php> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org <mailto:de...@open-mpi.org> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2015/10/18320.php >