Re: [OMPI devel] race condition in oob/tcp

Gilles Gouaillardet Fri, 19 Sep 2014 07:05:55 -0400 (EDT)

Ralph,

i found an other race condition.
in a very specific scenario, vpid3 is in the MCA_OOB_TCP_CLOSED state,
and processes data from the socket received from vpid 2
vpid3 is in the MCA_OOB_TCP_CLOSED state because vpid2 called retry()
and closed all its both sockets to vpid 3


vpid3 read the ack data that was send to the socket (ok) and then ends
up calling tcp_peer_send_blocking

Function
main (orted.c:62)
  orte_daemon (orted_main.c:828)
    opal_libevent2021_event_base_loop (event.c:1645)
      event_process_active (event.c:1437)
        event_process_active_single_queue (event.c:1367)
          recv_handler (oob_tcp.c:599)
            mca_oob_tcp_peer_accept (oob_tcp_connection.c:1071)
              tcp_peer_send_connect_ack (oob_tcp_connection.c:384)
                tcp_peer_send_blocking (oob_tcp_connection.c:525)


though the socket (fd 17) is my case has been closed by the peer, and is
hence reported in the CLOSE_WAIT state by lsof,
send(17, ...) is a success (!!!)

i thought the root cause was we previously set the O_NONBLOCK flag to
this socket.
so i explicitly cleared this flag (that was not set anyway...), before
invoking mca_oob_tcp_peer_accept
but i got the very same behaviour :-(

could you please advise :
- should the send fail because the socket is in the CLOSE_WAIT state ?
- if a success is not a bad behaviour, does this mean an other step
should be added to the oob/tcp "handshake" ?
- or could this mean that when the peer state was moved from
MCA_OOB_TCP_CONNECT_ACK to MCA_OOB_TCP_CLOSED,
retry() should have been invoked ?

Cheers,

Gilles

On 2014/09/18 17:02, Ralph Castain wrote:
> The patch looks fine to me - please go ahead and apply it. Thanks!
>
> On Sep 17, 2014, at 11:35 PM, Gilles Gouaillardet 
> <[email protected]> wrote:
>
>> Ralph,
>>
>> yes and no ...
>>
>> mpi hello world with four nodes can be used to reproduce the issue,
>>
>>
>> you can increase the likelyhood of producing the race condition by hacking
>> ./opal/mca/event/libevent2021/libevent/poll.c
>> and replace
>>        i = random() % nfds;
>> with
>>       if (nfds < 2) {
>>           i = 0;
>>       } else {
>>           i = nfds - 2;
>>       }
>>
>> but since this is really a race condition, all i could do is show you
>> how to use a debugger in order to force it
>>
>>
>> here is what really happens :
>> - thanks to your patch, when vpid 2 cannot read the acknowledge, this is
>> no more a fatal error.
>> - that being said, the peer->recv_event is not removed from the libevent
>> - later, send_event will be added to the libevent
>> - and then peer->recv_event will be added to the libevent
>> /* this is clearly not supported, and the interesting behaviour is that
>> peer->send_event will be kicked out of libevent (!) */
>>
>> The attached patch fixes this race condition, could you please review it ?
>>
>> Cheers,
>>
>> Gilles
>>
>> On 2014/09/17 22:17, Ralph Castain wrote:
>>> Do you have a reproducer you can share for testing this? I'm unable to get 
>>> it to happen on my machine, but maybe you have a test code that triggers it 
>>> so I can continue debugging
>>>
>>> Ralph
>>>
>>> On Sep 17, 2014, at 4:07 AM, Gilles Gouaillardet 
>>> <[email protected]> wrote:
>>>
>>>> Thanks Ralph,
>>>>
>>>> this is much better but there is still a bug :
>>>> with the very same scenario i described earlier, vpid 2 does not send
>>>> its message to vpid 3 once the connection has been established.
>>>>
>>>> i tried to debug it but i have been pretty unsuccessful so far ..
>>>>
>>>> vpid 2 calls tcp_peer_connected and execute the following snippet
>>>>
>>>> if (NULL != peer->send_msg && !peer->send_ev_active) {
>>>>       opal_event_add(&peer->send_event, 0);
>>>>       peer->send_ev_active = true;
>>>>   }
>>>>
>>>> but when evmap_io_active is invoked later, the following part :
>>>>
>>>>   TAILQ_FOREACH(ev, &ctx->events, ev_io_next) {
>>>>       if (ev->ev_events & events)
>>>>           event_active_nolock(ev, ev->ev_events & events, 1);
>>>>   }
>>>>
>>>> finds only one ev (mca_oob_tcp_recv_handler and *no*
>>>> mca_oob_tcp_send_handler)
>>>>
>>>> i will resume my investigations tomorrow
>>>>
>>>> Cheers,
>>>>
>>>> Gilles
>>>>
>>>> On 2014/09/17 4:01, Ralph Castain wrote:
>>>>> Hi Gilles
>>>>>
>>>>> I took a crack at solving this in r32744 - CMRd it for 1.8.3 and assigned 
>>>>> it to you for review. Give it a try and let me know if I (hopefully) got 
>>>>> it.
>>>>>
>>>>> The approach we have used in the past is to have both sides close their 
>>>>> connections, and then have the higher vpid retry while the lower one 
>>>>> waits. The logic for that was still in place, but it looks like you are 
>>>>> hitting a different code path, and I found another potential one as well. 
>>>>> So I think I plugged the holes, but will wait to hear if you confirm.
>>>>>
>>>>> Thanks
>>>>> Ralph
>>>>>
>>>>> On Sep 16, 2014, at 6:27 AM, Gilles Gouaillardet 
>>>>> <[email protected]> wrote:
>>>>>
>>>>>> Ralph,
>>>>>>
>>>>>> here is the full description of a race condition in oob/tcp i very 
>>>>>> briefly mentionned in a previous post :
>>>>>>
>>>>>> the race condition can occur when two not connected orted try to send a 
>>>>>> message to each other for the first time and at the same time.
>>>>>>
>>>>>> that can occur when running mpi helloworld on 4 nodes with the 
>>>>>> grpcomm/rcd module.
>>>>>>
>>>>>> here is a scenario in which the race condition occurs :
>>>>>>
>>>>>> orted vpid 2 and 3 enter the allgather
>>>>>> /* they are not orte yet oob/tcp connected*/
>>>>>> and they call orte.send_buffer_nb each other.
>>>>>> from a libevent point of view, vpid 2 and 3 will call 
>>>>>> mca_oob_tcp_peer_try_connect
>>>>>>
>>>>>> vpid 2 calls mca_oob_tcp_send_handler
>>>>>>
>>>>>> vpid 3 calls connection_event_handler
>>>>>>
>>>>>> depending on the value returned by random() in libevent, vpid 3 will
>>>>>> either call mca_oob_tcp_send_handler (likely) or recv_handler (unlikely)
>>>>>> if vpid 3 calls recv_handler, it will close the two sockets to vpid 2
>>>>>>
>>>>>> then vpid 2 will call mca_oob_tcp_recv_handler
>>>>>> (peer->state is MCA_OOB_TCP_CONNECT_ACK)
>>>>>> that will invoke mca_oob_tcp_recv_connect_ack
>>>>>> tcp_peer_recv_blocking will fail 
>>>>>> /* zero bytes are recv'ed since vpid 3 previously closed the socket 
>>>>>> before writing a header */
>>>>>> and this is handled by mca_oob_tcp_recv_handler as a fatal error
>>>>>> /* ORTE_FORCED_TERMINATE(1) */
>>>>>>
>>>>>> could you please have a look at it ?
>>>>>>
>>>>>> if you are too busy, could you please advise where this scenario should 
>>>>>> be handled differently ?
>>>>>> - should vpid 3 keep one socket instead of closing both and retrying ?
>>>>>> - should vpid 2 handle the failure as a non fatal error ?
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Gilles
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> [email protected]
>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>> Link to this post: 
>>>>>> http://www.open-mpi.org/community/lists/devel/2014/09/15836.php
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> [email protected]
>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>> Link to this post: 
>>>>> http://www.open-mpi.org/community/lists/devel/2014/09/15844.php
>>>> _______________________________________________
>>>> devel mailing list
>>>> [email protected]
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> Link to this post: 
>>>> http://www.open-mpi.org/community/lists/devel/2014/09/15854.php
>>> _______________________________________________
>>> devel mailing list
>>> [email protected]
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2014/09/15855.php
>> <oob_tcp.patch>_______________________________________________
>> devel mailing list
>> [email protected]
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/09/15862.php
> _______________________________________________
> devel mailing list
> [email protected]
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/09/15863.php

Re: [OMPI devel] race condition in oob/tcp

Reply via email to