Re: [OMPI devel] PMIX deadlock

Ralph Castain Fri, 30 Oct 2015 11:21:53 -0400 (EDT)

I’ve verified that the orte/util/listener thread is not being started, so I 
don’t think it should be involved in this problem.


HTH
Ralph

> On Oct 30, 2015, at 8:07 AM, Ralph Castain <r...@open-mpi.org> wrote:
> 
> Hmmm…there is a hook that would allow the PMIx server to utilize that 
> listener thread, but we aren’t currently using it. Each daemon plus mpirun 
> will call orte_start_listener, but nothing is currently registering and so 
> the listener in that code is supposed to just return without starting the 
> thread.
> 
> So the only listener thread that should exist is the one inside the PMIx 
> server itself. If something else is happening, then that would be a bug. I 
> can look at the orte listener code to ensure that the thread isn’t 
> incorrectly starting.
> 
> 
>> On Oct 29, 2015, at 10:03 PM, George Bosilca <bosi...@icl.utk.edu 
>> <mailto:bosi...@icl.utk.edu>> wrote:
>> 
>> Some progress, that puzzles me but might help you understand. Once the 
>> deadlock appears, if I manually kill the MPI process on the node where the 
>> deadlock was created, the local orte daemon doesn't notice and will just 
>> keep waiting.
>> 
>> Quick question: I am under the impression that the issue is not in the PMIX 
>> server but somewhere around the listener_thread_fn in orte/util/listener.c. 
>> Possible ?
>> 
>>   George.
>> 
>> 
>> On Wed, Oct 28, 2015 at 3:56 AM, Ralph Castain <r...@open-mpi.org 
>> <mailto:r...@open-mpi.org>> wrote:
>> Should have also clarified: the prior fixes are indeed in the current master.
>> 
>>> On Oct 28, 2015, at 12:42 AM, Ralph Castain <r...@open-mpi.org 
>>> <mailto:r...@open-mpi.org>> wrote:
>>> 
>>> Nope - I was wrong. The correction on the client side consisted of 
>>> attempting to timeout if the blocking recv failed. We then modified the 
>>> blocking send/recv so they would handle errors.
>>> 
>>> So that problem occurred -after- the server had correctly called accept. 
>>> The listener code is in 
>>> opal/mca/pmix/pmix1xx/pmix/src/server/pmix_server_listener.c
>>> 
>>> It looks to me like the only way we could drop the accept (assuming the OS 
>>> doesn’t lose it) is if the file descriptor lies outside the expected range 
>>> once we fall out of select:
>>> 
>>> 
>>>         /* Spin accepting connections until all active listen sockets
>>>          * do not have any incoming connections, pushing each connection
>>>          * onto the event queue for processing
>>>          */
>>>         do {
>>>             accepted_connections = 0;
>>>             /* according to the man pages, select replaces the given 
>>> descriptor
>>>              * set with a subset consisting of those descriptors that are 
>>> ready
>>>              * for the specified operation - in this case, a read. So we 
>>> need to
>>>              * first check to see if this file descriptor is included in the
>>>              * returned subset
>>>              */
>>>             if (0 == FD_ISSET(pmix_server_globals.listen_socket, &readfds)) 
>>> {
>>>                 /* this descriptor is not included */
>>>                 continue;
>>>             }
>>> 
>>>             /* this descriptor is ready to be read, which means a connection
>>>              * request has been received - so harvest it. All we want to do
>>>              * here is accept the connection and push the info onto the 
>>> event
>>>              * library for subsequent processing - we don't want to actually
>>>              * process the connection here as it takes too long, and so the
>>>              * OS might start rejecting connections due to timeout.
>>>              */
>>>             pending_connection = PMIX_NEW(pmix_pending_connection_t);
>>>             event_assign(&pending_connection->ev, pmix_globals.evbase, -1,
>>>                          EV_WRITE, connection_handler, pending_connection);
>>>             pending_connection->sd = 
>>> accept(pmix_server_globals.listen_socket,
>>>                                             (struct 
>>> sockaddr*)&(pending_connection->addr),
>>>                                             &addrlen);
>>>             if (pending_connection->sd < 0) {
>>>                 PMIX_RELEASE(pending_connection);
>>>                 if (pmix_socket_errno != EAGAIN ||
>>>                     pmix_socket_errno != EWOULDBLOCK) {
>>>                     if (EMFILE == pmix_socket_errno) {
>>>                         PMIX_ERROR_LOG(PMIX_ERR_OUT_OF_RESOURCE);
>>>                     } else {
>>>                         pmix_output(0, "listen_thread: accept() failed: %s 
>>> (%d).",
>>>                                     strerror(pmix_socket_errno), 
>>> pmix_socket_errno);
>>>                     }
>>>                     goto done;
>>>                 }
>>>                 continue;
>>>             }
>>> 
>>>             pmix_output_verbose(8, pmix_globals.debug_output,
>>>                                 "listen_thread: new connection: (%d, %d)",
>>>                                 pending_connection->sd, pmix_socket_errno);
>>>             /* activate the event */
>>>             event_active(&pending_connection->ev, EV_WRITE, 1);
>>>             accepted_connections++;
>>>         } while (accepted_connections > 0);
>>> 
>>> 
>>>> On Oct 28, 2015, at 12:25 AM, Ralph Castain <r...@open-mpi.org 
>>>> <mailto:r...@open-mpi.org>> wrote:
>>>> 
>>>> Looking at the code, it appears that a fix was committed for this problem, 
>>>> and that we correctly resolved the issue found by Paul. The problem is 
>>>> that the fix didn’t get upstreamed, and so it was lost the next time we 
>>>> refreshed PMIx. Sigh.
>>>> 
>>>> Let me try to recreate the fix and have you take a gander at it.
>>>> 
>>>> 
>>>>> On Oct 28, 2015, at 12:22 AM, Ralph Castain <r...@open-mpi.org 
>>>>> <mailto:r...@open-mpi.org>> wrote:
>>>>> 
>>>>> Here is the discussion - afraid it is fairly lengthy. Ignore the hwloc 
>>>>> references in it as that was a separate issue:
>>>>> 
>>>>> http://www.open-mpi.org/community/lists/devel/2015/09/18074.php 
>>>>> <http://www.open-mpi.org/community/lists/devel/2015/09/18074.php>
>>>>> 
>>>>> It definitely sounds like the same issue creeping in again. I’d 
>>>>> appreciate any thoughts on how to correct it. If it helps, you could look 
>>>>> at the PMIx master - there are standalone tests in the test/simple 
>>>>> directory that fork/exec a child and just do the connection.
>>>>> 
>>>>> https://github.com/pmix/master <https://github.com/pmix/master>
>>>>> 
>>>>> The test server is simptest.c - it will spawn a single copy of 
>>>>> simpclient.c by default.
>>>>> 
>>>>> 
>>>>>> On Oct 27, 2015, at 10:14 PM, George Bosilca <bosi...@icl.utk.edu 
>>>>>> <mailto:bosi...@icl.utk.edu>> wrote:
>>>>>> 
>>>>>> Interesting. Do you have a pointer to the commit (or/and to the 
>>>>>> discussion)?
>>>>>> 
>>>>>> I looked at the PMIX code, and I have identified few issues, but 
>>>>>> unfortunately none of them seem to fix the problem for good. However, 
>>>>>> now I need more than 1000 runs to get a deadlock (instead of few tens).
>>>>>> 
>>>>>> Looking with "netstat -ax" at the status of the UDS while the processes 
>>>>>> are deadlocked, I see 2 UDS with the same name: one from the server 
>>>>>> which is in LISTEN state, and one for the client which is being in 
>>>>>> CONNECTING state (while the client already sent a message in the socket 
>>>>>> and is now waiting in a blocking receive). This somehow suggest that the 
>>>>>> server has not yet called accept on the UDS. Unfortunately, there are 3 
>>>>>> threads all doing different flavors of even_base and select, so I have a 
>>>>>> hard time tracking the path of the UDS on the server side.
>>>>>> 
>>>>>> So in order to validate my assumption I wrote a minimalistic UDS client 
>>>>>> and server application and tried different scenarios. The conclusion is 
>>>>>> that in order to see the same type of output from "netstat -ax" I have 
>>>>>> to call listen on the server, connect on the client and do not call 
>>>>>> accept on the server.
>>>>>> 
>>>>>> With the same occasion I also confirmed that the UDS are holding the 
>>>>>> data sent so there is no need for further synchronization for the case 
>>>>>> where the data is sent first. We only need to find out how the server 
>>>>>> forgets to call accept.
>>>>>> 
>>>>>>   George.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Tue, Oct 27, 2015 at 7:52 PM, Ralph Castain <r...@open-mpi.org 
>>>>>> <mailto:r...@open-mpi.org>> wrote:
>>>>>> Hmmm…this looks like it might be that problem we previously saw where 
>>>>>> the blocking recv hangs in a proc when the blocking send tries to send 
>>>>>> before the domain socket is actually ready, and so the send fails on the 
>>>>>> other end. As I recall, it was something to do with the socketoptions - 
>>>>>> and then Paul had a problem on some of his machines, and we backed it 
>>>>>> out?
>>>>>> 
>>>>>> I wonder if that’s what is biting us here again, and what we need is to 
>>>>>> either remove the blocking send/recv’s altogether, or figure out a way 
>>>>>> to wait until the socket is really ready.
>>>>>> 
>>>>>> Any thoughts?
>>>>>> 
>>>>>> 
>>>>>>> On Oct 27, 2015, at 4:11 PM, George Bosilca <bosi...@icl.utk.edu 
>>>>>>> <mailto:bosi...@icl.utk.edu>> wrote:
>>>>>>> 
>>>>>>> It appear the branch solve the problem at least partially. I asked one 
>>>>>>> of my students to hammer it pretty badly, and he reported that the 
>>>>>>> deadlocks still occur. He also graciously provided some stacktraces:
>>>>>>> 
>>>>>>> #0  0x00007f4bd5274aed in nanosleep () from /lib64/libc.so.6
>>>>>>> #1  0x00007f4bd52a9c94 in usleep () from /lib64/libc.so.6
>>>>>>> #2  0x00007f4bd2e42b00 in OPAL_PMIX_PMIX1XX_PMIx_Fence (procs=0x0, 
>>>>>>> nprocs=0, info=0x7fff3c561960, 
>>>>>>>     ninfo=1) at src/client/pmix_client_fence.c:100
>>>>>>> #3  0x00007f4bd306e6d2 in pmix1_fence (procs=0x0, collect_data=1) at 
>>>>>>> pmix1_client.c:306
>>>>>>> #4  0x00007f4bd57d5cc3 in ompi_mpi_init (argc=3, argv=0x7fff3c561ea8, 
>>>>>>> requested=3, 
>>>>>>>     provided=0x7fff3c561d84) at runtime/ompi_mpi_init.c:644
>>>>>>> #5  0x00007f4bd5813399 in PMPI_Init_thread (argc=0x7fff3c561d7c, 
>>>>>>> argv=0x7fff3c561d70, required=3, 
>>>>>>>     provided=0x7fff3c561d84) at pinit_thread.c:69
>>>>>>> #6  0x0000000000401516 in main (argc=3, argv=0x7fff3c561ea8) at 
>>>>>>> osu_mbw_mr.c:86
>>>>>>> 
>>>>>>> And another process:
>>>>>>> 
>>>>>>> #0  0x00007f7b9d7d8bdc in recv () from /lib64/libpthread.so.0
>>>>>>> #1  0x00007f7b9b0aa42d in opal_pmix_pmix1xx_pmix_usock_recv_blocking 
>>>>>>> (sd=13, data=0x7ffd62139004 "", 
>>>>>>>     size=4) at src/usock/usock.c:168
>>>>>>> #2  0x00007f7b9b0af5d9 in recv_connect_ack (sd=13) at 
>>>>>>> src/client/pmix_client.c:844
>>>>>>> #3  0x00007f7b9b0b085e in usock_connect (addr=0x7ffd62139330) at 
>>>>>>> src/client/pmix_client.c:1110
>>>>>>> #4  0x00007f7b9b0acc24 in connect_to_server (address=0x7ffd62139330, 
>>>>>>> cbdata=0x7ffd621390e0)
>>>>>>>     at src/client/pmix_client.c:181
>>>>>>> #5  0x00007f7b9b0ad569 in OPAL_PMIX_PMIX1XX_PMIx_Init 
>>>>>>> (proc=0x7f7b9b4e9b60)
>>>>>>>     at src/client/pmix_client.c:362
>>>>>>> #6  0x00007f7b9b2dbd9d in pmix1_client_init () at pmix1_client.c:99
>>>>>>> #7  0x00007f7b9b4eb95f in pmi_component_query (module=0x7ffd62139490, 
>>>>>>> priority=0x7ffd6213948c)
>>>>>>>     at ess_pmi_component.c:90
>>>>>>> #8  0x00007f7b9ce70ec5 in mca_base_select (type_name=0x7f7b9d20e059 
>>>>>>> "ess", output_id=-1, 
>>>>>>>     components_available=0x7f7b9d431eb0, best_module=0x7ffd621394d0, 
>>>>>>> best_component=0x7ffd621394d8, 
>>>>>>>     priority_out=0x0) at mca_base_components_select.c:77
>>>>>>> #9  0x00007f7b9d1a956b in orte_ess_base_select () at 
>>>>>>> base/ess_base_select.c:40
>>>>>>> #10 0x00007f7b9d160449 in orte_init (pargc=0x0, pargv=0x0, flags=32) at 
>>>>>>> runtime/orte_init.c:219
>>>>>>> #11 0x00007f7b9da4377a in ompi_mpi_init (argc=3, argv=0x7ffd621397f8, 
>>>>>>> requested=3, 
>>>>>>>     provided=0x7ffd621396d4) at runtime/ompi_mpi_init.c:488
>>>>>>> #12 0x00007f7b9da81399 in PMPI_Init_thread (argc=0x7ffd621396cc, 
>>>>>>> argv=0x7ffd621396c0, required=3, 
>>>>>>>     provided=0x7ffd621396d4) at pinit_thread.c:69
>>>>>>> #13 0x0000000000401516 in main (argc=3, argv=0x7ffd621397f8) at 
>>>>>>> osu_mbw_mr.c:86
>>>>>>> 
>>>>>>>   George.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Tue, Oct 27, 2015 at 2:36 PM, Ralph Castain <r...@open-mpi.org 
>>>>>>> <mailto:r...@open-mpi.org>> wrote:
>>>>>>> I haven’t been able to replicate this when using the branch in this PR:
>>>>>>> 
>>>>>>> https://github.com/open-mpi/ompi/pull/1073 
>>>>>>> <https://github.com/open-mpi/ompi/pull/1073>
>>>>>>> 
>>>>>>> Would you mind giving it a try? It fixes some other race conditions and 
>>>>>>> might pick this one up too.
>>>>>>> 
>>>>>>> 
>>>>>>>> On Oct 27, 2015, at 10:04 AM, Ralph Castain <r...@open-mpi.org 
>>>>>>>> <mailto:r...@open-mpi.org>> wrote:
>>>>>>>> 
>>>>>>>> Okay, I’ll take a look - I’ve been chasing a race condition that might 
>>>>>>>> be related
>>>>>>>> 
>>>>>>>>> On Oct 27, 2015, at 9:54 AM, George Bosilca <bosi...@icl.utk.edu 
>>>>>>>>> <mailto:bosi...@icl.utk.edu>> wrote:
>>>>>>>>> 
>>>>>>>>> No, it's using 2 nodes.
>>>>>>>>>   George.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Tue, Oct 27, 2015 at 12:35 PM, Ralph Castain <r...@open-mpi.org 
>>>>>>>>> <mailto:r...@open-mpi.org>> wrote:
>>>>>>>>> Is this on a single node?
>>>>>>>>> 
>>>>>>>>>> On Oct 27, 2015, at 9:25 AM, George Bosilca <bosi...@icl.utk.edu 
>>>>>>>>>> <mailto:bosi...@icl.utk.edu>> wrote:
>>>>>>>>>> 
>>>>>>>>>> I get intermittent deadlocks wit the latest trunk. The smallest 
>>>>>>>>>> reproducer is a shell for loop around a small (2 processes) short 
>>>>>>>>>> (20 seconds) MPI application. After few tens of iterations the 
>>>>>>>>>> MPI_Init will deadlock with the following backtrace:
>>>>>>>>>> 
>>>>>>>>>> #0  0x00007fa94b5d9aed in nanosleep () from /lib64/libc.so.6
>>>>>>>>>> #1  0x00007fa94b60ec94 in usleep () from /lib64/libc.so.6
>>>>>>>>>> #2  0x00007fa94960ba08 in OPAL_PMIX_PMIX1XX_PMIx_Fence (procs=0x0, 
>>>>>>>>>> nprocs=0, info=0x7ffd7934fb90, 
>>>>>>>>>>     ninfo=1) at src/client/pmix_client_fence.c:100
>>>>>>>>>> #3  0x00007fa9498376a2 in pmix1_fence (procs=0x0, collect_data=1) at 
>>>>>>>>>> pmix1_client.c:305
>>>>>>>>>> #4  0x00007fa94bb39ba4 in ompi_mpi_init (argc=3, 
>>>>>>>>>> argv=0x7ffd793500a8, requested=3, 
>>>>>>>>>>     provided=0x7ffd7934ff94) at runtime/ompi_mpi_init.c:645
>>>>>>>>>> #5  0x00007fa94bb77281 in PMPI_Init_thread (argc=0x7ffd7934ff8c, 
>>>>>>>>>> argv=0x7ffd7934ff80, required=3, 
>>>>>>>>>>     provided=0x7ffd7934ff94) at pinit_thread.c:69
>>>>>>>>>> #6  0x000000000040150f in main (argc=3, argv=0x7ffd793500a8) at 
>>>>>>>>>> osu_mbw_mr.c:86
>>>>>>>>>> 
>>>>>>>>>> On my machines this is reproducible at 100% after anywhere between 
>>>>>>>>>> 50 and 100 iterations.
>>>>>>>>>> 
>>>>>>>>>>   Thanks,
>>>>>>>>>>     George.
>>>>>>>>>> 
>>>>>>>>>> _______________________________________________
>>>>>>>>>> devel mailing list
>>>>>>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>>>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>>>>>>>>> Link to this post: 
>>>>>>>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18280.php 
>>>>>>>>>> <http://www.open-mpi.org/community/lists/devel/2015/10/18280.php>
>>>>>>>>> 
>>>>>>>>> _______________________________________________
>>>>>>>>> devel mailing list
>>>>>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>>>>>>>> Link to this post: 
>>>>>>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18281.php 
>>>>>>>>> <http://www.open-mpi.org/community/lists/devel/2015/10/18281.php>
>>>>>>>>> 
>>>>>>>>> _______________________________________________
>>>>>>>>> devel mailing list
>>>>>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>>>>>>>> Link to this post: 
>>>>>>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18282.php 
>>>>>>>>> <http://www.open-mpi.org/community/lists/devel/2015/10/18282.php>
>>>>>>> 
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>>>>>> Link to this post: 
>>>>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18284.php 
>>>>>>> <http://www.open-mpi.org/community/lists/devel/2015/10/18284.php>
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>>>>>> Link to this post: 
>>>>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18292.php 
>>>>>>> <http://www.open-mpi.org/community/lists/devel/2015/10/18292.php>
>>>>>> 
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>>>>> Link to this post: 
>>>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18294.php 
>>>>>> <http://www.open-mpi.org/community/lists/devel/2015/10/18294.php>
>>>>>> 
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>>>>> Link to this post: 
>>>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18302.php 
>>>>>> <http://www.open-mpi.org/community/lists/devel/2015/10/18302.php>
>>>> 
>>> 
>> 
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2015/10/18309.php 
>> <http://www.open-mpi.org/community/lists/devel/2015/10/18309.php>
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2015/10/18320.php
>

Re: [OMPI devel] PMIX deadlock

Reply via email to