Thanks Ralph,

this is much better but there is still a bug :
with the very same scenario i described earlier, vpid 2 does not send
its message to vpid 3 once the connection has been established.

i tried to debug it but i have been pretty unsuccessful so far ..

vpid 2 calls tcp_peer_connected and execute the following snippet

 if (NULL != peer->send_msg && !peer->send_ev_active) {
        opal_event_add(&peer->send_event, 0);
        peer->send_ev_active = true;
    }

but when evmap_io_active is invoked later, the following part :

    TAILQ_FOREACH(ev, &ctx->events, ev_io_next) {
        if (ev->ev_events & events)
            event_active_nolock(ev, ev->ev_events & events, 1);
    }

finds only one ev (mca_oob_tcp_recv_handler and *no*
mca_oob_tcp_send_handler)

i will resume my investigations tomorrow

Cheers,

Gilles

On 2014/09/17 4:01, Ralph Castain wrote:
> Hi Gilles
>
> I took a crack at solving this in r32744 - CMRd it for 1.8.3 and assigned it 
> to you for review. Give it a try and let me know if I (hopefully) got it.
>
> The approach we have used in the past is to have both sides close their 
> connections, and then have the higher vpid retry while the lower one waits. 
> The logic for that was still in place, but it looks like you are hitting a 
> different code path, and I found another potential one as well. So I think I 
> plugged the holes, but will wait to hear if you confirm.
>
> Thanks
> Ralph
>
> On Sep 16, 2014, at 6:27 AM, Gilles Gouaillardet 
> <gilles.gouaillar...@gmail.com> wrote:
>
>> Ralph,
>>
>> here is the full description of a race condition in oob/tcp i very briefly 
>> mentionned in a previous post :
>>
>> the race condition can occur when two not connected orted try to send a 
>> message to each other for the first time and at the same time.
>>
>> that can occur when running mpi helloworld on 4 nodes with the grpcomm/rcd 
>> module.
>>
>> here is a scenario in which the race condition occurs :
>>
>> orted vpid 2 and 3 enter the allgather
>> /* they are not orte yet oob/tcp connected*/
>> and they call orte.send_buffer_nb each other.
>> from a libevent point of view, vpid 2 and 3 will call 
>> mca_oob_tcp_peer_try_connect
>>
>> vpid 2 calls mca_oob_tcp_send_handler
>>
>> vpid 3 calls connection_event_handler
>>
>> depending on the value returned by random() in libevent, vpid 3 will
>> either call mca_oob_tcp_send_handler (likely) or recv_handler (unlikely)
>> if vpid 3 calls recv_handler, it will close the two sockets to vpid 2
>>
>> then vpid 2 will call mca_oob_tcp_recv_handler
>> (peer->state is MCA_OOB_TCP_CONNECT_ACK)
>> that will invoke mca_oob_tcp_recv_connect_ack
>> tcp_peer_recv_blocking will fail 
>> /* zero bytes are recv'ed since vpid 3 previously closed the socket before 
>> writing a header */
>> and this is handled by mca_oob_tcp_recv_handler as a fatal error
>> /* ORTE_FORCED_TERMINATE(1) */
>>
>> could you please have a look at it ?
>>
>> if you are too busy, could you please advise where this scenario should be 
>> handled differently ?
>> - should vpid 3 keep one socket instead of closing both and retrying ?
>> - should vpid 2 handle the failure as a non fatal error ?
>>
>> Cheers,
>>
>> Gilles
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/09/15836.php
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/09/15844.php

Reply via email to