Re: [OMPI devel] race condition in oob/tcp

Ralph Castain Thu, 18 Sep 2014 04:03:27 -0400 (EDT)

The patch looks fine to me - please go ahead and apply it. Thanks!

On Sep 17, 2014, at 11:35 PM, Gilles Gouaillardet 
<[email protected]> wrote:


> Ralph,
> 
> yes and no ...
> 
> mpi hello world with four nodes can be used to reproduce the issue,
> 
> 
> you can increase the likelyhood of producing the race condition by hacking
> ./opal/mca/event/libevent2021/libevent/poll.c
> and replace
>        i = random() % nfds;
> with
>       if (nfds < 2) {
>           i = 0;
>       } else {
>           i = nfds - 2;
>       }
> 
> but since this is really a race condition, all i could do is show you
> how to use a debugger in order to force it
> 
> 
> here is what really happens :
> - thanks to your patch, when vpid 2 cannot read the acknowledge, this is
> no more a fatal error.
> - that being said, the peer->recv_event is not removed from the libevent
> - later, send_event will be added to the libevent
> - and then peer->recv_event will be added to the libevent
> /* this is clearly not supported, and the interesting behaviour is that
> peer->send_event will be kicked out of libevent (!) */
> 
> The attached patch fixes this race condition, could you please review it ?
> 
> Cheers,
> 
> Gilles
> 
> On 2014/09/17 22:17, Ralph Castain wrote:
>> Do you have a reproducer you can share for testing this? I'm unable to get 
>> it to happen on my machine, but maybe you have a test code that triggers it 
>> so I can continue debugging
>> 
>> Ralph
>> 
>> On Sep 17, 2014, at 4:07 AM, Gilles Gouaillardet 
>> <[email protected]> wrote:
>> 
>>> Thanks Ralph,
>>> 
>>> this is much better but there is still a bug :
>>> with the very same scenario i described earlier, vpid 2 does not send
>>> its message to vpid 3 once the connection has been established.
>>> 
>>> i tried to debug it but i have been pretty unsuccessful so far ..
>>> 
>>> vpid 2 calls tcp_peer_connected and execute the following snippet
>>> 
>>> if (NULL != peer->send_msg && !peer->send_ev_active) {
>>>       opal_event_add(&peer->send_event, 0);
>>>       peer->send_ev_active = true;
>>>   }
>>> 
>>> but when evmap_io_active is invoked later, the following part :
>>> 
>>>   TAILQ_FOREACH(ev, &ctx->events, ev_io_next) {
>>>       if (ev->ev_events & events)
>>>           event_active_nolock(ev, ev->ev_events & events, 1);
>>>   }
>>> 
>>> finds only one ev (mca_oob_tcp_recv_handler and *no*
>>> mca_oob_tcp_send_handler)
>>> 
>>> i will resume my investigations tomorrow
>>> 
>>> Cheers,
>>> 
>>> Gilles
>>> 
>>> On 2014/09/17 4:01, Ralph Castain wrote:
>>>> Hi Gilles
>>>> 
>>>> I took a crack at solving this in r32744 - CMRd it for 1.8.3 and assigned 
>>>> it to you for review. Give it a try and let me know if I (hopefully) got 
>>>> it.
>>>> 
>>>> The approach we have used in the past is to have both sides close their 
>>>> connections, and then have the higher vpid retry while the lower one 
>>>> waits. The logic for that was still in place, but it looks like you are 
>>>> hitting a different code path, and I found another potential one as well. 
>>>> So I think I plugged the holes, but will wait to hear if you confirm.
>>>> 
>>>> Thanks
>>>> Ralph
>>>> 
>>>> On Sep 16, 2014, at 6:27 AM, Gilles Gouaillardet 
>>>> <[email protected]> wrote:
>>>> 
>>>>> Ralph,
>>>>> 
>>>>> here is the full description of a race condition in oob/tcp i very 
>>>>> briefly mentionned in a previous post :
>>>>> 
>>>>> the race condition can occur when two not connected orted try to send a 
>>>>> message to each other for the first time and at the same time.
>>>>> 
>>>>> that can occur when running mpi helloworld on 4 nodes with the 
>>>>> grpcomm/rcd module.
>>>>> 
>>>>> here is a scenario in which the race condition occurs :
>>>>> 
>>>>> orted vpid 2 and 3 enter the allgather
>>>>> /* they are not orte yet oob/tcp connected*/
>>>>> and they call orte.send_buffer_nb each other.
>>>>> from a libevent point of view, vpid 2 and 3 will call 
>>>>> mca_oob_tcp_peer_try_connect
>>>>> 
>>>>> vpid 2 calls mca_oob_tcp_send_handler
>>>>> 
>>>>> vpid 3 calls connection_event_handler
>>>>> 
>>>>> depending on the value returned by random() in libevent, vpid 3 will
>>>>> either call mca_oob_tcp_send_handler (likely) or recv_handler (unlikely)
>>>>> if vpid 3 calls recv_handler, it will close the two sockets to vpid 2
>>>>> 
>>>>> then vpid 2 will call mca_oob_tcp_recv_handler
>>>>> (peer->state is MCA_OOB_TCP_CONNECT_ACK)
>>>>> that will invoke mca_oob_tcp_recv_connect_ack
>>>>> tcp_peer_recv_blocking will fail 
>>>>> /* zero bytes are recv'ed since vpid 3 previously closed the socket 
>>>>> before writing a header */
>>>>> and this is handled by mca_oob_tcp_recv_handler as a fatal error
>>>>> /* ORTE_FORCED_TERMINATE(1) */
>>>>> 
>>>>> could you please have a look at it ?
>>>>> 
>>>>> if you are too busy, could you please advise where this scenario should 
>>>>> be handled differently ?
>>>>> - should vpid 3 keep one socket instead of closing both and retrying ?
>>>>> - should vpid 2 handle the failure as a non fatal error ?
>>>>> 
>>>>> Cheers,
>>>>> 
>>>>> Gilles
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> [email protected]
>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>> Link to this post: 
>>>>> http://www.open-mpi.org/community/lists/devel/2014/09/15836.php
>>>> _______________________________________________
>>>> devel mailing list
>>>> [email protected]
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> Link to this post: 
>>>> http://www.open-mpi.org/community/lists/devel/2014/09/15844.php
>>> _______________________________________________
>>> devel mailing list
>>> [email protected]
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2014/09/15854.php
>> _______________________________________________
>> devel mailing list
>> [email protected]
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/09/15855.php
> 
> <oob_tcp.patch>_______________________________________________
> devel mailing list
> [email protected]
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/09/15862.php

Re: [OMPI devel] race condition in oob/tcp

Reply via email to