Re: [OMPI devel] PMIX deadlock

Artem Polyakov Mon, 9 Nov 2015 11:55:28 -0500 (EST)

2015-11-09 22:42 GMT+06:00 Artem Polyakov <artpo...@gmail.com>:

> This is the very good point, Nysal!
>
> This is definitely a problem and I can say even more: avg. 3 from every 10
> tasks was affected by this bug. Once the PR (
> https://github.com/pmix/master/pull/8) was applied I was able to run 100
> testing tasks without any hangs.
>
> Here some more information on my symptoms. I was observing this without
> OMPI, just running pmix_client test binary from PMIx test suite with SLURM
> PMIx plugin.
> Periodicaly application was hanging. Investigation shows that not all
> processes are able to initialize correctly.
> Here is how such client's backtrace looks like:
>


P.S. I think that this backtrace may be relevant to George's problem as
well. In my case not all of the processes was hanging in the
connect_to_server, most of them were able to move forward and reach Fence.
George, the backtrace that you've posted was the same on both processes or
it was the "random" one from one of them?


> (gdb) bt
> #0  0x00007f1448f1b7eb in recv () from
> /lib/x86_64-linux-gnu/libpthread.so.0
> #1  0x00007f144914c191 in pmix_usock_recv_blocking (sd=9,
> data=0x7fff367f7c64 "", size=4) at src/usock/usock.c:166
> #2  0x00007f1449152d18 in recv_connect_ack (sd=9) at
> src/client/pmix_client.c:837
> #3  0x00007f14491546bf in usock_connect (addr=0x7fff367f7d60) at
> src/client/pmix_client.c:1103
> #4  0x00007f144914f94c in connect_to_server (address=0x7fff367f7d60,
> cbdata=0x7fff367f7dd0) at src/client/pmix_client.c:179
> #5  0x00007f1449150421 in PMIx_Init (proc=0x7fff367f81d0) at
> src/client/pmix_client.c:355
> #6  0x0000000000401b97 in main (argc=9, argv=0x7fff367f83d8) at
> pmix_client.c:62
>
>
> The server-side debug has the following lines at the end of the file:
> [cn33:00482] pmix:server register client slurm.pmix.22.0:10
> [cn33:00482] pmix:server _register_client for nspace slurm.pmix.22.0 rank
> 10
> [cn33:00482] pmix:server setup_fork for nspace slurm.pmix.22.0 rank 10
>
> in normal operation the following lines should appear after lines above:
> ....
> [cn33:00188] listen_thread: new connection: (26, 0)
> [cn33:00188] connection_handler: new connection: 26
> [cn33:00188] RECV CONNECT ACK FROM PEER ON SOCKET 26
> [cn33:00188] waiting for blocking recv of 16 bytes
> [cn33:00188] blocking receive complete from remote
> ....
>
> At the client side I see the following lines
> cn33:00491] usock_peer_try_connect: attempting to connect to server
> [cn33:00491] usock_peer_try_connect: attempting to connect to server on
> socket 10
> [cn33:00491] pmix: SEND CONNECT ACK
> [cn33:00491] sec: native create_cred
> [cn33:00491] sec: using credential 1000:1000
> [cn33:00491] send blocking of 54 bytes to socket 10
> [cn33:00491] blocking send complete to socket 10
> [cn33:00491] pmix: RECV CONNECT ACK FROM SERVER
> [cn33:00491] waiting for blocking recv of 4 bytes
> [cn33:00491] blocking_recv received error 11:Resource temporarily
> unavailable from remote - cycling
> [cn33:00491] blocking_recv received error 11:Resource temporarily
> unavailable from remote - cycling
> [... repeated many times ...]
>
> With the fix for the problem highlighted by Nysal all runs cleanly.
>
>
> 2015-11-09 10:53 GMT+06:00 Nysal Jan K A <jny...@gmail.com>:
>
>> In listen_thread():
>> 194     while (pmix_server_globals.listen_thread_active) {
>> 195         FD_ZERO(&readfds);
>> 196         FD_SET(pmix_server_globals.listen_socket, &readfds);
>> 197         max = pmix_server_globals.listen_socket;
>>
>> Is it possible that pmix_server_globals.listen_thread_active can be
>> false, in which case the thread just exits and will never call accept() ?
>>
>> In pmix_start_listening():
>> 147         /* fork off the listener thread */
>> 148         if (0 > pthread_create(&engine, NULL, listen_thread, NULL)) {
>> 149             return PMIX_ERROR;
>> 150         }
>> 151         pmix_server_globals.listen_thread_active = true;
>>
>> pmix_server_globals.listen_thread_active is set to true after the thread
>> is created, could this cause a race ?
>> listen_thread_active might also need to be declared as volatile.
>>
>> Regards
>> --Nysal
>>
>> On Sun, Nov 8, 2015 at 10:38 PM, George Bosilca <bosi...@icl.utk.edu>
>> wrote:
>>
>>> We had a power outage last week and the local disks on our cluster were
>>> wiped out. My tester was in there. But, I can rewrite it after SC.
>>>
>>>   George.
>>>
>>> On Sat, Nov 7, 2015 at 12:04 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>>
>>>> Could you send me your stress test? I’m wondering if it is just
>>>> something about how we set socket options
>>>>
>>>>
>>>> On Nov 7, 2015, at 8:58 AM, George Bosilca <bosi...@icl.utk.edu> wrote:
>>>>
>>>> I has to postpone this until after SC. However, I ran for 3 days a
>>>> stress test of UDS reproducing the opening and sending of data (what Ralph
>>>> said in his email) and I never could get a deadlock.
>>>>
>>>>   George.
>>>>
>>>>
>>>> On Sat, Nov 7, 2015 at 11:26 AM, Ralph Castain <r...@open-mpi.org>
>>>> wrote:
>>>>
>>>>> George was looking into it, but I don’t know if he has had time
>>>>> recently to continue the investigation. We understand “what” is happening
>>>>> (accept sometimes ignores the connection), but we don’t yet know “why”.
>>>>> I’ve done some digging around the web, and found that sometimes you can 
>>>>> try
>>>>> to talk to a Unix Domain Socket too quickly - i.e., you open it and then
>>>>> send to it, but the OS hasn’t yet set it up. In those cases, you can hang
>>>>> the socket. However, I’ve tried adding some artificial delay, and while it
>>>>> helped, it didn’t completely solve the problem.
>>>>>
>>>>> I have an idea for a workaround (set a timer and retry after awhile),
>>>>> but would obviously prefer a real solution. I’m not even sure it will work
>>>>> as it is unclear that the server (who is the one hung in accept) will 
>>>>> break
>>>>> free if the client closes the socket and retries.
>>>>>
>>>>>
>>>>> On Nov 6, 2015, at 10:53 PM, Artem Polyakov <artpo...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> Hello, is there any progress on this topic? This affects our PMIx
>>>>> measurements.
>>>>>
>>>>> 2015-10-30 21:21 GMT+06:00 Ralph Castain <r...@open-mpi.org>:
>>>>>
>>>>>> I’ve verified that the orte/util/listener thread is not being
>>>>>> started, so I don’t think it should be involved in this problem.
>>>>>>
>>>>>> HTH
>>>>>> Ralph
>>>>>>
>>>>>> On Oct 30, 2015, at 8:07 AM, Ralph Castain <r...@open-mpi.org> wrote:
>>>>>>
>>>>>> Hmmm…there is a hook that would allow the PMIx server to utilize that
>>>>>> listener thread, but we aren’t currently using it. Each daemon plus 
>>>>>> mpirun
>>>>>> will call orte_start_listener, but nothing is currently registering and 
>>>>>> so
>>>>>> the listener in that code is supposed to just return without starting the
>>>>>> thread.
>>>>>>
>>>>>> So the only listener thread that should exist is the one inside the
>>>>>> PMIx server itself. If something else is happening, then that would be a
>>>>>> bug. I can look at the orte listener code to ensure that the thread isn’t
>>>>>> incorrectly starting.
>>>>>>
>>>>>>
>>>>>> On Oct 29, 2015, at 10:03 PM, George Bosilca <bosi...@icl.utk.edu>
>>>>>> wrote:
>>>>>>
>>>>>> Some progress, that puzzles me but might help you understand. Once
>>>>>> the deadlock appears, if I manually kill the MPI process on the node 
>>>>>> where
>>>>>> the deadlock was created, the local orte daemon doesn't notice and will
>>>>>> just keep waiting.
>>>>>>
>>>>>> Quick question: I am under the impression that the issue is not in
>>>>>> the PMIX server but somewhere around the listener_thread_fn in
>>>>>> orte/util/listener.c. Possible ?
>>>>>>
>>>>>>   George.
>>>>>>
>>>>>>
>>>>>> On Wed, Oct 28, 2015 at 3:56 AM, Ralph Castain <r...@open-mpi.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Should have also clarified: the prior fixes are indeed in the
>>>>>>> current master.
>>>>>>>
>>>>>>> On Oct 28, 2015, at 12:42 AM, Ralph Castain <r...@open-mpi.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Nope - I was wrong. The correction on the client side consisted of
>>>>>>> attempting to timeout if the blocking recv failed. We then modified the
>>>>>>> blocking send/recv so they would handle errors.
>>>>>>>
>>>>>>> So that problem occurred -after- the server had correctly called
>>>>>>> accept. The listener code is in
>>>>>>> opal/mca/pmix/pmix1xx/pmix/src/server/pmix_server_listener.c
>>>>>>>
>>>>>>> It looks to me like the only way we could drop the accept (assuming
>>>>>>> the OS doesn’t lose it) is if the file descriptor lies outside the 
>>>>>>> expected
>>>>>>> range once we fall out of select:
>>>>>>>
>>>>>>>
>>>>>>>         /* Spin accepting connections until all active listen sockets
>>>>>>>          * do not have any incoming connections, pushing each
>>>>>>> connection
>>>>>>>          * onto the event queue for processing
>>>>>>>          */
>>>>>>>         do {
>>>>>>>             accepted_connections = 0;
>>>>>>>             /* according to the man pages, select replaces the given
>>>>>>> descriptor
>>>>>>>              * set with a subset consisting of those descriptors
>>>>>>> that are ready
>>>>>>>              * for the specified operation - in this case, a read.
>>>>>>> So we need to
>>>>>>>              * first check to see if this file descriptor is
>>>>>>> included in the
>>>>>>>              * returned subset
>>>>>>>              */
>>>>>>>             if (0 == FD_ISSET(pmix_server_globals.listen_socket,
>>>>>>> &readfds)) {
>>>>>>>                 /* this descriptor is not included */
>>>>>>>                 continue;
>>>>>>>             }
>>>>>>>
>>>>>>>             /* this descriptor is ready to be read, which means a
>>>>>>> connection
>>>>>>>              * request has been received - so harvest it. All we
>>>>>>> want to do
>>>>>>>              * here is accept the connection and push the info onto
>>>>>>> the event
>>>>>>>              * library for subsequent processing - we don't want to
>>>>>>> actually
>>>>>>>              * process the connection here as it takes too long, and
>>>>>>> so the
>>>>>>>              * OS might start rejecting connections due to timeout.
>>>>>>>              */
>>>>>>>             pending_connection = PMIX_NEW(pmix_pending_connection_t);
>>>>>>>             event_assign(&pending_connection->ev,
>>>>>>> pmix_globals.evbase, -1,
>>>>>>>                          EV_WRITE, connection_handler,
>>>>>>> pending_connection);
>>>>>>>             pending_connection->sd =
>>>>>>> accept(pmix_server_globals.listen_socket,
>>>>>>>                                             (struct
>>>>>>> sockaddr*)&(pending_connection->addr),
>>>>>>>                                             &addrlen);
>>>>>>>             if (pending_connection->sd < 0) {
>>>>>>>                 PMIX_RELEASE(pending_connection);
>>>>>>>                 if (pmix_socket_errno != EAGAIN ||
>>>>>>>                     pmix_socket_errno != EWOULDBLOCK) {
>>>>>>>                     if (EMFILE == pmix_socket_errno) {
>>>>>>>                         PMIX_ERROR_LOG(PMIX_ERR_OUT_OF_RESOURCE);
>>>>>>>                     } else {
>>>>>>>                         pmix_output(0, "listen_thread: accept()
>>>>>>> failed: %s (%d).",
>>>>>>>                                     strerror(pmix_socket_errno),
>>>>>>> pmix_socket_errno);
>>>>>>>                     }
>>>>>>>                     goto done;
>>>>>>>                 }
>>>>>>>                 continue;
>>>>>>>             }
>>>>>>>
>>>>>>>             pmix_output_verbose(8, pmix_globals.debug_output,
>>>>>>>                                 "listen_thread: new connection: (%d,
>>>>>>> %d)",
>>>>>>>                                 pending_connection->sd,
>>>>>>> pmix_socket_errno);
>>>>>>>             /* activate the event */
>>>>>>>             event_active(&pending_connection->ev, EV_WRITE, 1);
>>>>>>>             accepted_connections++;
>>>>>>>         } while (accepted_connections > 0);
>>>>>>>
>>>>>>>
>>>>>>> On Oct 28, 2015, at 12:25 AM, Ralph Castain <r...@open-mpi.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Looking at the code, it appears that a fix was committed for this
>>>>>>> problem, and that we correctly resolved the issue found by Paul. The
>>>>>>> problem is that the fix didn’t get upstreamed, and so it was lost the 
>>>>>>> next
>>>>>>> time we refreshed PMIx. Sigh.
>>>>>>>
>>>>>>> Let me try to recreate the fix and have you take a gander at it.
>>>>>>>
>>>>>>>
>>>>>>> On Oct 28, 2015, at 12:22 AM, Ralph Castain <r...@open-mpi.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Here is the discussion - afraid it is fairly lengthy. Ignore the
>>>>>>> hwloc references in it as that was a separate issue:
>>>>>>>
>>>>>>> http://www.open-mpi.org/community/lists/devel/2015/09/18074.php
>>>>>>>
>>>>>>> It definitely sounds like the same issue creeping in again. I’d
>>>>>>> appreciate any thoughts on how to correct it. If it helps, you could 
>>>>>>> look
>>>>>>> at the PMIx master - there are standalone tests in the test/simple
>>>>>>> directory that fork/exec a child and just do the connection.
>>>>>>>
>>>>>>> https://github.com/pmix/master
>>>>>>>
>>>>>>> The test server is simptest.c - it will spawn a single copy of
>>>>>>> simpclient.c by default.
>>>>>>>
>>>>>>>
>>>>>>> On Oct 27, 2015, at 10:14 PM, George Bosilca <bosi...@icl.utk.edu>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Interesting. Do you have a pointer to the commit (or/and to the
>>>>>>> discussion)?
>>>>>>>
>>>>>>> I looked at the PMIX code, and I have identified few issues, but
>>>>>>> unfortunately none of them seem to fix the problem for good. However, 
>>>>>>> now I
>>>>>>> need more than 1000 runs to get a deadlock (instead of few tens).
>>>>>>>
>>>>>>> Looking with "netstat -ax" at the status of the UDS while the
>>>>>>> processes are deadlocked, I see 2 UDS with the same name: one from the
>>>>>>> server which is in LISTEN state, and one for the client which is being 
>>>>>>> in
>>>>>>> CONNECTING state (while the client already sent a message in the socket 
>>>>>>> and
>>>>>>> is now waiting in a blocking receive). This somehow suggest that the 
>>>>>>> server
>>>>>>> has not yet called accept on the UDS. Unfortunately, there are 3 threads
>>>>>>> all doing different flavors of even_base and select, so I have a hard 
>>>>>>> time
>>>>>>> tracking the path of the UDS on the server side.
>>>>>>>
>>>>>>> So in order to validate my assumption I wrote a minimalistic UDS
>>>>>>> client and server application and tried different scenarios. The 
>>>>>>> conclusion
>>>>>>> is that in order to see the same type of output from "netstat -ax" I 
>>>>>>> have
>>>>>>> to call listen on the server, connect on the client and do not call 
>>>>>>> accept
>>>>>>> on the server.
>>>>>>>
>>>>>>> With the same occasion I also confirmed that the UDS are holding the
>>>>>>> data sent so there is no need for further synchronization for the case
>>>>>>> where the data is sent first. We only need to find out how the server
>>>>>>> forgets to call accept.
>>>>>>>
>>>>>>>   George.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Oct 27, 2015 at 7:52 PM, Ralph Castain <r...@open-mpi.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hmmm…this looks like it might be that problem we previously saw
>>>>>>>> where the blocking recv hangs in a proc when the blocking send tries to
>>>>>>>> send before the domain socket is actually ready, and so the send fails 
>>>>>>>> on
>>>>>>>> the other end. As I recall, it was something to do with the 
>>>>>>>> socketoptions -
>>>>>>>> and then Paul had a problem on some of his machines, and we backed it 
>>>>>>>> out?
>>>>>>>>
>>>>>>>> I wonder if that’s what is biting us here again, and what we need
>>>>>>>> is to either remove the blocking send/recv’s altogether, or figure out 
>>>>>>>> a
>>>>>>>> way to wait until the socket is really ready.
>>>>>>>>
>>>>>>>> Any thoughts?
>>>>>>>>
>>>>>>>>
>>>>>>>> On Oct 27, 2015, at 4:11 PM, George Bosilca <bosi...@icl.utk.edu>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> It appear the branch solve the problem at least partially. I asked
>>>>>>>> one of my students to hammer it pretty badly, and he reported that the
>>>>>>>> deadlocks still occur. He also graciously provided some stacktraces:
>>>>>>>>
>>>>>>>> #0  0x00007f4bd5274aed in nanosleep () from /lib64/libc.so.6
>>>>>>>> #1  0x00007f4bd52a9c94 in usleep () from /lib64/libc.so.6
>>>>>>>> #2  0x00007f4bd2e42b00 in OPAL_PMIX_PMIX1XX_PMIx_Fence (procs=0x0,
>>>>>>>> nprocs=0, info=0x7fff3c561960,
>>>>>>>>     ninfo=1) at src/client/pmix_client_fence.c:100
>>>>>>>> #3  0x00007f4bd306e6d2 in pmix1_fence (procs=0x0, collect_data=1)
>>>>>>>> at pmix1_client.c:306
>>>>>>>> #4  0x00007f4bd57d5cc3 in ompi_mpi_init (argc=3,
>>>>>>>> argv=0x7fff3c561ea8, requested=3,
>>>>>>>>     provided=0x7fff3c561d84) at runtime/ompi_mpi_init.c:644
>>>>>>>> #5  0x00007f4bd5813399 in PMPI_Init_thread (argc=0x7fff3c561d7c,
>>>>>>>> argv=0x7fff3c561d70, required=3,
>>>>>>>>     provided=0x7fff3c561d84) at pinit_thread.c:69
>>>>>>>> #6  0x0000000000401516 in main (argc=3, argv=0x7fff3c561ea8) at
>>>>>>>> osu_mbw_mr.c:86
>>>>>>>>
>>>>>>>> And another process:
>>>>>>>>
>>>>>>>> #0  0x00007f7b9d7d8bdc in recv () from /lib64/libpthread.so.0
>>>>>>>> #1  0x00007f7b9b0aa42d in
>>>>>>>> opal_pmix_pmix1xx_pmix_usock_recv_blocking (sd=13, data=0x7ffd62139004 
>>>>>>>> "",
>>>>>>>>     size=4) at src/usock/usock.c:168
>>>>>>>> #2  0x00007f7b9b0af5d9 in recv_connect_ack (sd=13) at
>>>>>>>> src/client/pmix_client.c:844
>>>>>>>> #3  0x00007f7b9b0b085e in usock_connect (addr=0x7ffd62139330) at
>>>>>>>> src/client/pmix_client.c:1110
>>>>>>>> #4  0x00007f7b9b0acc24 in connect_to_server
>>>>>>>> (address=0x7ffd62139330, cbdata=0x7ffd621390e0)
>>>>>>>>     at src/client/pmix_client.c:181
>>>>>>>> #5  0x00007f7b9b0ad569 in OPAL_PMIX_PMIX1XX_PMIx_Init
>>>>>>>> (proc=0x7f7b9b4e9b60)
>>>>>>>>     at src/client/pmix_client.c:362
>>>>>>>> #6  0x00007f7b9b2dbd9d in pmix1_client_init () at pmix1_client.c:99
>>>>>>>> #7  0x00007f7b9b4eb95f in pmi_component_query
>>>>>>>> (module=0x7ffd62139490, priority=0x7ffd6213948c)
>>>>>>>>     at ess_pmi_component.c:90
>>>>>>>> #8  0x00007f7b9ce70ec5 in mca_base_select (type_name=0x7f7b9d20e059
>>>>>>>> "ess", output_id=-1,
>>>>>>>>     components_available=0x7f7b9d431eb0,
>>>>>>>> best_module=0x7ffd621394d0, best_component=0x7ffd621394d8,
>>>>>>>>     priority_out=0x0) at mca_base_components_select.c:77
>>>>>>>> #9  0x00007f7b9d1a956b in orte_ess_base_select () at
>>>>>>>> base/ess_base_select.c:40
>>>>>>>> #10 0x00007f7b9d160449 in orte_init (pargc=0x0, pargv=0x0,
>>>>>>>> flags=32) at runtime/orte_init.c:219
>>>>>>>> #11 0x00007f7b9da4377a in ompi_mpi_init (argc=3,
>>>>>>>> argv=0x7ffd621397f8, requested=3,
>>>>>>>>     provided=0x7ffd621396d4) at runtime/ompi_mpi_init.c:488
>>>>>>>> #12 0x00007f7b9da81399 in PMPI_Init_thread (argc=0x7ffd621396cc,
>>>>>>>> argv=0x7ffd621396c0, required=3,
>>>>>>>>     provided=0x7ffd621396d4) at pinit_thread.c:69
>>>>>>>> #13 0x0000000000401516 in main (argc=3, argv=0x7ffd621397f8) at
>>>>>>>> osu_mbw_mr.c:86
>>>>>>>>
>>>>>>>>   George.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Oct 27, 2015 at 2:36 PM, Ralph Castain <r...@open-mpi.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I haven’t been able to replicate this when using the branch in
>>>>>>>>> this PR:
>>>>>>>>>
>>>>>>>>> https://github.com/open-mpi/ompi/pull/1073
>>>>>>>>>
>>>>>>>>> Would you mind giving it a try? It fixes some other race
>>>>>>>>> conditions and might pick this one up too.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Oct 27, 2015, at 10:04 AM, Ralph Castain <r...@open-mpi.org>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Okay, I’ll take a look - I’ve been chasing a race condition that
>>>>>>>>> might be related
>>>>>>>>>
>>>>>>>>> On Oct 27, 2015, at 9:54 AM, George Bosilca <bosi...@icl.utk.edu>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> No, it's using 2 nodes.
>>>>>>>>>   George.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Oct 27, 2015 at 12:35 PM, Ralph Castain <r...@open-mpi.org>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Is this on a single node?
>>>>>>>>>>
>>>>>>>>>> On Oct 27, 2015, at 9:25 AM, George Bosilca <bosi...@icl.utk.edu>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> I get intermittent deadlocks wit the latest trunk. The smallest
>>>>>>>>>> reproducer is a shell for loop around a small (2 processes) short (20
>>>>>>>>>> seconds) MPI application. After few tens of iterations the MPI_Init 
>>>>>>>>>> will
>>>>>>>>>> deadlock with the following backtrace:
>>>>>>>>>>
>>>>>>>>>> #0  0x00007fa94b5d9aed in nanosleep () from /lib64/libc.so.6
>>>>>>>>>> #1  0x00007fa94b60ec94 in usleep () from /lib64/libc.so.6
>>>>>>>>>> #2  0x00007fa94960ba08 in OPAL_PMIX_PMIX1XX_PMIx_Fence
>>>>>>>>>> (procs=0x0, nprocs=0, info=0x7ffd7934fb90,
>>>>>>>>>>     ninfo=1) at src/client/pmix_client_fence.c:100
>>>>>>>>>> #3  0x00007fa9498376a2 in pmix1_fence (procs=0x0, collect_data=1)
>>>>>>>>>> at pmix1_client.c:305
>>>>>>>>>> #4  0x00007fa94bb39ba4 in ompi_mpi_init (argc=3,
>>>>>>>>>> argv=0x7ffd793500a8, requested=3,
>>>>>>>>>>     provided=0x7ffd7934ff94) at runtime/ompi_mpi_init.c:645
>>>>>>>>>> #5  0x00007fa94bb77281 in PMPI_Init_thread (argc=0x7ffd7934ff8c,
>>>>>>>>>> argv=0x7ffd7934ff80, required=3,
>>>>>>>>>>     provided=0x7ffd7934ff94) at pinit_thread.c:69
>>>>>>>>>> #6  0x000000000040150f in main (argc=3, argv=0x7ffd793500a8) at
>>>>>>>>>> osu_mbw_mr.c:86
>>>>>>>>>>
>>>>>>>>>> On my machines this is reproducible at 100% after anywhere
>>>>>>>>>> between 50 and 100 iterations.
>>>>>>>>>>
>>>>>>>>>>   Thanks,
>>>>>>>>>>     George.
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> devel mailing list
>>>>>>>>>> de...@open-mpi.org
>>>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>> Link to this post:
>>>>>>>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18280.php
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> devel mailing list
>>>>>>>>>> de...@open-mpi.org
>>>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>> Link to this post:
>>>>>>>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18281.php
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> devel mailing list
>>>>>>>>> de...@open-mpi.org
>>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>> Link to this post:
>>>>>>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18282.php
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> devel mailing list
>>>>>>>>> de...@open-mpi.org
>>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>> Link to this post:
>>>>>>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18284.php
>>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> devel mailing list
>>>>>>>> de...@open-mpi.org
>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>> Link to this post:
>>>>>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18292.php
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> devel mailing list
>>>>>>>> de...@open-mpi.org
>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>> Link to this post:
>>>>>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18294.php
>>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> de...@open-mpi.org
>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>> Link to this post:
>>>>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18302.php
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> de...@open-mpi.org
>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>> Link to this post:
>>>>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18309.php
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> de...@open-mpi.org
>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>> Link to this post:
>>>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18320.php
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> de...@open-mpi.org
>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>> Link to this post:
>>>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18323.php
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> С Уважением, Поляков Артем Юрьевич
>>>>> Best regards, Artem Y. Polyakov
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> de...@open-mpi.org
>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>> Link to this post:
>>>>> http://www.open-mpi.org/community/lists/devel/2015/11/18334.php
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> de...@open-mpi.org
>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>> Link to this post:
>>>>> http://www.open-mpi.org/community/lists/devel/2015/11/18335.php
>>>>>
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> Link to this post:
>>>> http://www.open-mpi.org/community/lists/devel/2015/11/18336.php
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> Link to this post:
>>>> http://www.open-mpi.org/community/lists/devel/2015/11/18337.php
>>>>
>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/devel/2015/11/18340.php
>>>
>>
>>
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2015/11/18341.php
>>
>
>
>
> --
> С Уважением, Поляков Артем Юрьевич
> Best regards, Artem Y. Polyakov
>



-- 
С Уважением, Поляков Артем Юрьевич
Best regards, Artem Y. Polyakov

Re: [OMPI devel] PMIX deadlock

Reply via email to