Thanks Ralph, this is much better but there is still a bug : with the very same scenario i described earlier, vpid 2 does not send its message to vpid 3 once the connection has been established.
i tried to debug it but i have been pretty unsuccessful so far .. vpid 2 calls tcp_peer_connected and execute the following snippet if (NULL != peer->send_msg && !peer->send_ev_active) { opal_event_add(&peer->send_event, 0); peer->send_ev_active = true; } but when evmap_io_active is invoked later, the following part : TAILQ_FOREACH(ev, &ctx->events, ev_io_next) { if (ev->ev_events & events) event_active_nolock(ev, ev->ev_events & events, 1); } finds only one ev (mca_oob_tcp_recv_handler and *no* mca_oob_tcp_send_handler) i will resume my investigations tomorrow Cheers, Gilles On 2014/09/17 4:01, Ralph Castain wrote: > Hi Gilles > > I took a crack at solving this in r32744 - CMRd it for 1.8.3 and assigned it > to you for review. Give it a try and let me know if I (hopefully) got it. > > The approach we have used in the past is to have both sides close their > connections, and then have the higher vpid retry while the lower one waits. > The logic for that was still in place, but it looks like you are hitting a > different code path, and I found another potential one as well. So I think I > plugged the holes, but will wait to hear if you confirm. > > Thanks > Ralph > > On Sep 16, 2014, at 6:27 AM, Gilles Gouaillardet > <gilles.gouaillar...@gmail.com> wrote: > >> Ralph, >> >> here is the full description of a race condition in oob/tcp i very briefly >> mentionned in a previous post : >> >> the race condition can occur when two not connected orted try to send a >> message to each other for the first time and at the same time. >> >> that can occur when running mpi helloworld on 4 nodes with the grpcomm/rcd >> module. >> >> here is a scenario in which the race condition occurs : >> >> orted vpid 2 and 3 enter the allgather >> /* they are not orte yet oob/tcp connected*/ >> and they call orte.send_buffer_nb each other. >> from a libevent point of view, vpid 2 and 3 will call >> mca_oob_tcp_peer_try_connect >> >> vpid 2 calls mca_oob_tcp_send_handler >> >> vpid 3 calls connection_event_handler >> >> depending on the value returned by random() in libevent, vpid 3 will >> either call mca_oob_tcp_send_handler (likely) or recv_handler (unlikely) >> if vpid 3 calls recv_handler, it will close the two sockets to vpid 2 >> >> then vpid 2 will call mca_oob_tcp_recv_handler >> (peer->state is MCA_OOB_TCP_CONNECT_ACK) >> that will invoke mca_oob_tcp_recv_connect_ack >> tcp_peer_recv_blocking will fail >> /* zero bytes are recv'ed since vpid 3 previously closed the socket before >> writing a header */ >> and this is handled by mca_oob_tcp_recv_handler as a fatal error >> /* ORTE_FORCED_TERMINATE(1) */ >> >> could you please have a look at it ? >> >> if you are too busy, could you please advise where this scenario should be >> handled differently ? >> - should vpid 3 keep one socket instead of closing both and retrying ? >> - should vpid 2 handle the failure as a non fatal error ? >> >> Cheers, >> >> Gilles >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/09/15836.php > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/09/15844.php