The patch looks fine to me - please go ahead and apply it. Thanks!
On Sep 17, 2014, at 11:35 PM, Gilles Gouaillardet
<[email protected]> wrote:
> Ralph,
>
> yes and no ...
>
> mpi hello world with four nodes can be used to reproduce the issue,
>
>
> you can increase the likelyhood of producing the race condition by hacking
> ./opal/mca/event/libevent2021/libevent/poll.c
> and replace
> i = random() % nfds;
> with
> if (nfds < 2) {
> i = 0;
> } else {
> i = nfds - 2;
> }
>
> but since this is really a race condition, all i could do is show you
> how to use a debugger in order to force it
>
>
> here is what really happens :
> - thanks to your patch, when vpid 2 cannot read the acknowledge, this is
> no more a fatal error.
> - that being said, the peer->recv_event is not removed from the libevent
> - later, send_event will be added to the libevent
> - and then peer->recv_event will be added to the libevent
> /* this is clearly not supported, and the interesting behaviour is that
> peer->send_event will be kicked out of libevent (!) */
>
> The attached patch fixes this race condition, could you please review it ?
>
> Cheers,
>
> Gilles
>
> On 2014/09/17 22:17, Ralph Castain wrote:
>> Do you have a reproducer you can share for testing this? I'm unable to get
>> it to happen on my machine, but maybe you have a test code that triggers it
>> so I can continue debugging
>>
>> Ralph
>>
>> On Sep 17, 2014, at 4:07 AM, Gilles Gouaillardet
>> <[email protected]> wrote:
>>
>>> Thanks Ralph,
>>>
>>> this is much better but there is still a bug :
>>> with the very same scenario i described earlier, vpid 2 does not send
>>> its message to vpid 3 once the connection has been established.
>>>
>>> i tried to debug it but i have been pretty unsuccessful so far ..
>>>
>>> vpid 2 calls tcp_peer_connected and execute the following snippet
>>>
>>> if (NULL != peer->send_msg && !peer->send_ev_active) {
>>> opal_event_add(&peer->send_event, 0);
>>> peer->send_ev_active = true;
>>> }
>>>
>>> but when evmap_io_active is invoked later, the following part :
>>>
>>> TAILQ_FOREACH(ev, &ctx->events, ev_io_next) {
>>> if (ev->ev_events & events)
>>> event_active_nolock(ev, ev->ev_events & events, 1);
>>> }
>>>
>>> finds only one ev (mca_oob_tcp_recv_handler and *no*
>>> mca_oob_tcp_send_handler)
>>>
>>> i will resume my investigations tomorrow
>>>
>>> Cheers,
>>>
>>> Gilles
>>>
>>> On 2014/09/17 4:01, Ralph Castain wrote:
>>>> Hi Gilles
>>>>
>>>> I took a crack at solving this in r32744 - CMRd it for 1.8.3 and assigned
>>>> it to you for review. Give it a try and let me know if I (hopefully) got
>>>> it.
>>>>
>>>> The approach we have used in the past is to have both sides close their
>>>> connections, and then have the higher vpid retry while the lower one
>>>> waits. The logic for that was still in place, but it looks like you are
>>>> hitting a different code path, and I found another potential one as well.
>>>> So I think I plugged the holes, but will wait to hear if you confirm.
>>>>
>>>> Thanks
>>>> Ralph
>>>>
>>>> On Sep 16, 2014, at 6:27 AM, Gilles Gouaillardet
>>>> <[email protected]> wrote:
>>>>
>>>>> Ralph,
>>>>>
>>>>> here is the full description of a race condition in oob/tcp i very
>>>>> briefly mentionned in a previous post :
>>>>>
>>>>> the race condition can occur when two not connected orted try to send a
>>>>> message to each other for the first time and at the same time.
>>>>>
>>>>> that can occur when running mpi helloworld on 4 nodes with the
>>>>> grpcomm/rcd module.
>>>>>
>>>>> here is a scenario in which the race condition occurs :
>>>>>
>>>>> orted vpid 2 and 3 enter the allgather
>>>>> /* they are not orte yet oob/tcp connected*/
>>>>> and they call orte.send_buffer_nb each other.
>>>>> from a libevent point of view, vpid 2 and 3 will call
>>>>> mca_oob_tcp_peer_try_connect
>>>>>
>>>>> vpid 2 calls mca_oob_tcp_send_handler
>>>>>
>>>>> vpid 3 calls connection_event_handler
>>>>>
>>>>> depending on the value returned by random() in libevent, vpid 3 will
>>>>> either call mca_oob_tcp_send_handler (likely) or recv_handler (unlikely)
>>>>> if vpid 3 calls recv_handler, it will close the two sockets to vpid 2
>>>>>
>>>>> then vpid 2 will call mca_oob_tcp_recv_handler
>>>>> (peer->state is MCA_OOB_TCP_CONNECT_ACK)
>>>>> that will invoke mca_oob_tcp_recv_connect_ack
>>>>> tcp_peer_recv_blocking will fail
>>>>> /* zero bytes are recv'ed since vpid 3 previously closed the socket
>>>>> before writing a header */
>>>>> and this is handled by mca_oob_tcp_recv_handler as a fatal error
>>>>> /* ORTE_FORCED_TERMINATE(1) */
>>>>>
>>>>> could you please have a look at it ?
>>>>>
>>>>> if you are too busy, could you please advise where this scenario should
>>>>> be handled differently ?
>>>>> - should vpid 3 keep one socket instead of closing both and retrying ?
>>>>> - should vpid 2 handle the failure as a non fatal error ?
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Gilles
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> [email protected]
>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>> Link to this post:
>>>>> http://www.open-mpi.org/community/lists/devel/2014/09/15836.php
>>>> _______________________________________________
>>>> devel mailing list
>>>> [email protected]
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> Link to this post:
>>>> http://www.open-mpi.org/community/lists/devel/2014/09/15844.php
>>> _______________________________________________
>>> devel mailing list
>>> [email protected]
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/devel/2014/09/15854.php
>> _______________________________________________
>> devel mailing list
>> [email protected]
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/09/15855.php
>
> <oob_tcp.patch>_______________________________________________
> devel mailing list
> [email protected]
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/09/15862.php