Ralph,
yes and no ...
mpi hello world with four nodes can be used to reproduce the issue,
you can increase the likelyhood of producing the race condition by hacking
./opal/mca/event/libevent2021/libevent/poll.c
and replace
i = random() % nfds;
with
if (nfds < 2) {
i = 0;
} else {
i = nfds - 2;
}
but since this is really a race condition, all i could do is show you
how to use a debugger in order to force it
here is what really happens :
- thanks to your patch, when vpid 2 cannot read the acknowledge, this is
no more a fatal error.
- that being said, the peer->recv_event is not removed from the libevent
- later, send_event will be added to the libevent
- and then peer->recv_event will be added to the libevent
/* this is clearly not supported, and the interesting behaviour is that
peer->send_event will be kicked out of libevent (!) */
The attached patch fixes this race condition, could you please review it ?
Cheers,
Gilles
On 2014/09/17 22:17, Ralph Castain wrote:
> Do you have a reproducer you can share for testing this? I'm unable to get it
> to happen on my machine, but maybe you have a test code that triggers it so I
> can continue debugging
>
> Ralph
>
> On Sep 17, 2014, at 4:07 AM, Gilles Gouaillardet
> <[email protected]> wrote:
>
>> Thanks Ralph,
>>
>> this is much better but there is still a bug :
>> with the very same scenario i described earlier, vpid 2 does not send
>> its message to vpid 3 once the connection has been established.
>>
>> i tried to debug it but i have been pretty unsuccessful so far ..
>>
>> vpid 2 calls tcp_peer_connected and execute the following snippet
>>
>> if (NULL != peer->send_msg && !peer->send_ev_active) {
>> opal_event_add(&peer->send_event, 0);
>> peer->send_ev_active = true;
>> }
>>
>> but when evmap_io_active is invoked later, the following part :
>>
>> TAILQ_FOREACH(ev, &ctx->events, ev_io_next) {
>> if (ev->ev_events & events)
>> event_active_nolock(ev, ev->ev_events & events, 1);
>> }
>>
>> finds only one ev (mca_oob_tcp_recv_handler and *no*
>> mca_oob_tcp_send_handler)
>>
>> i will resume my investigations tomorrow
>>
>> Cheers,
>>
>> Gilles
>>
>> On 2014/09/17 4:01, Ralph Castain wrote:
>>> Hi Gilles
>>>
>>> I took a crack at solving this in r32744 - CMRd it for 1.8.3 and assigned
>>> it to you for review. Give it a try and let me know if I (hopefully) got it.
>>>
>>> The approach we have used in the past is to have both sides close their
>>> connections, and then have the higher vpid retry while the lower one waits.
>>> The logic for that was still in place, but it looks like you are hitting a
>>> different code path, and I found another potential one as well. So I think
>>> I plugged the holes, but will wait to hear if you confirm.
>>>
>>> Thanks
>>> Ralph
>>>
>>> On Sep 16, 2014, at 6:27 AM, Gilles Gouaillardet
>>> <[email protected]> wrote:
>>>
>>>> Ralph,
>>>>
>>>> here is the full description of a race condition in oob/tcp i very briefly
>>>> mentionned in a previous post :
>>>>
>>>> the race condition can occur when two not connected orted try to send a
>>>> message to each other for the first time and at the same time.
>>>>
>>>> that can occur when running mpi helloworld on 4 nodes with the grpcomm/rcd
>>>> module.
>>>>
>>>> here is a scenario in which the race condition occurs :
>>>>
>>>> orted vpid 2 and 3 enter the allgather
>>>> /* they are not orte yet oob/tcp connected*/
>>>> and they call orte.send_buffer_nb each other.
>>>> from a libevent point of view, vpid 2 and 3 will call
>>>> mca_oob_tcp_peer_try_connect
>>>>
>>>> vpid 2 calls mca_oob_tcp_send_handler
>>>>
>>>> vpid 3 calls connection_event_handler
>>>>
>>>> depending on the value returned by random() in libevent, vpid 3 will
>>>> either call mca_oob_tcp_send_handler (likely) or recv_handler (unlikely)
>>>> if vpid 3 calls recv_handler, it will close the two sockets to vpid 2
>>>>
>>>> then vpid 2 will call mca_oob_tcp_recv_handler
>>>> (peer->state is MCA_OOB_TCP_CONNECT_ACK)
>>>> that will invoke mca_oob_tcp_recv_connect_ack
>>>> tcp_peer_recv_blocking will fail
>>>> /* zero bytes are recv'ed since vpid 3 previously closed the socket before
>>>> writing a header */
>>>> and this is handled by mca_oob_tcp_recv_handler as a fatal error
>>>> /* ORTE_FORCED_TERMINATE(1) */
>>>>
>>>> could you please have a look at it ?
>>>>
>>>> if you are too busy, could you please advise where this scenario should be
>>>> handled differently ?
>>>> - should vpid 3 keep one socket instead of closing both and retrying ?
>>>> - should vpid 2 handle the failure as a non fatal error ?
>>>>
>>>> Cheers,
>>>>
>>>> Gilles
>>>> _______________________________________________
>>>> devel mailing list
>>>> [email protected]
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> Link to this post:
>>>> http://www.open-mpi.org/community/lists/devel/2014/09/15836.php
>>> _______________________________________________
>>> devel mailing list
>>> [email protected]
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/devel/2014/09/15844.php
>> _______________________________________________
>> devel mailing list
>> [email protected]
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/09/15854.php
> _______________________________________________
> devel mailing list
> [email protected]
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/09/15855.php
Index: orte/mca/oob/tcp/oob_tcp_connection.c
===================================================================
--- orte/mca/oob/tcp/oob_tcp_connection.c (revision 32752)
+++ orte/mca/oob/tcp/oob_tcp_connection.c (working copy)
@@ -14,6 +14,8 @@
* Copyright (c) 2009-2014 Cisco Systems, Inc. All rights reserved.
* Copyright (c) 2011 Oak Ridge National Labs. All rights reserved.
* Copyright (c) 2013-2014 Intel, Inc. All rights reserved.
+ * Copyright (c) 2014 Research Organization for Information Science
+ * and Technology (RIST). All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
@@ -867,6 +869,16 @@
peer->active_addr->state = MCA_OOB_TCP_CLOSED;
}
+ /* unregister active events */
+ if (peer->recv_ev_active) {
+ opal_event_del(&peer->recv_event);
+ peer->recv_ev_active = false;
+ }
+ if (peer->send_ev_active) {
+ opal_event_del(&peer->send_event);
+ peer->send_ev_active = false;
+ }
+
/* inform the component-level that we have lost a connection so
* it can decide what to do about it.
*/