[OMPI devel] race condition in oob/tcp

2014-09-16 Thread Gilles Gouaillardet
Ralph, here is the full description of a race condition in oob/tcp i very briefly mentionned in a previous post : the race condition can occur when two not connected orted try to send a message to each other for the first time and at the same time. that can occur when running mpi helloworld on 4

Re: [OMPI devel] race condition in oob/tcp

2014-09-16 Thread Ralph Castain
Hi Gilles I took a crack at solving this in r32744 - CMRd it for 1.8.3 and assigned it to you for review. Give it a try and let me know if I (hopefully) got it. The approach we have used in the past is to have both sides close their connections, and then have the higher vpid retry while the low

Re: [OMPI devel] race condition in oob/tcp

2014-09-17 Thread Gilles Gouaillardet
Thanks Ralph, this is much better but there is still a bug : with the very same scenario i described earlier, vpid 2 does not send its message to vpid 3 once the connection has been established. i tried to debug it but i have been pretty unsuccessful so far .. vpid 2 calls tcp_peer_connected and

Re: [OMPI devel] race condition in oob/tcp

2014-09-17 Thread Ralph Castain
Do you have a reproducer you can share for testing this? I'm unable to get it to happen on my machine, but maybe you have a test code that triggers it so I can continue debugging Ralph On Sep 17, 2014, at 4:07 AM, Gilles Gouaillardet wrote: > Thanks Ralph, > > this is much better but there

Re: [OMPI devel] race condition in oob/tcp

2014-09-18 Thread Gilles Gouaillardet
Ralph, yes and no ... mpi hello world with four nodes can be used to reproduce the issue, you can increase the likelyhood of producing the race condition by hacking ./opal/mca/event/libevent2021/libevent/poll.c and replace i = random() % nfds; with if (nfds < 2) { i =

Re: [OMPI devel] race condition in oob/tcp

2014-09-18 Thread Ralph Castain
The patch looks fine to me - please go ahead and apply it. Thanks! On Sep 17, 2014, at 11:35 PM, Gilles Gouaillardet wrote: > Ralph, > > yes and no ... > > mpi hello world with four nodes can be used to reproduce the issue, > > > you can increase the likelyhood of producing the race conditi

Re: [OMPI devel] race condition in oob/tcp

2014-09-19 Thread Gilles Gouaillardet
Ralph, i found an other race condition. in a very specific scenario, vpid3 is in the MCA_OOB_TCP_CLOSED state, and processes data from the socket received from vpid 2 vpid3 is in the MCA_OOB_TCP_CLOSED state because vpid2 called retry() and closed all its both sockets to vpid 3 vpid3 read the ack

Re: [OMPI devel] race condition in oob/tcp

2014-09-19 Thread Gilles Gouaillardet
Ralph, let me detail the new race condition. orted 2 and 3 are not connected to each other and send a message to each other orted 2 and 3 call send_process (that set peer->state = MCA_OOB_TCP_PEER_CONNECTING) they both end up calling mca_oob_tcp_peer_try_connect now if orted 3 calls mca_oob_tcp

Re: [OMPI devel] race condition in oob/tcp

2014-09-19 Thread Ralph Castain
You know, I'm almost beginning to dread opening my email in the morning for fear of seeing another "race condition" subject line! :-) I think the correct answer here is that orted 3 should be entering "retry" when it sees the peer state change to "closed", regardless of what happened in the sen

Re: [OMPI devel] race condition in oob/tcp

2014-09-19 Thread George Bosilca
Or copy the handshake protocol design of the TCP BTL... George. On Fri, Sep 19, 2014 at 6:23 PM, Ralph Castain wrote: > You know, I'm almost beginning to dread opening my email in the morning > for fear of seeing another "race condition" subject line! :-) > > I think the correct answer here i

Re: [OMPI devel] race condition in oob/tcp

2014-09-21 Thread Gilles Gouaillardet
Thanks for the pointer George ! On Sat, Sep 20, 2014 at 5:46 AM, George Bosilca wrote: > Or copy the handshake protocol design of the TCP BTL... > > the main difference between oob/tcp and btl/tcp is the way we resolve the situation in which two processes send their first message to each other a

Re: [OMPI devel] race condition in oob/tcp

2014-09-21 Thread Ralph Castain
Sounds fine with me - please go ahead, and thanks On Sep 20, 2014, at 10:26 PM, Gilles Gouaillardet wrote: > Thanks for the pointer George ! > > On Sat, Sep 20, 2014 at 5:46 AM, George Bosilca wrote: > Or copy the handshake protocol design of the TCP BTL... > > > the main difference between

Re: [OMPI devel] race condition in oob/tcp

2014-09-22 Thread Ralph Castain
Gilles - please let me know if/when you think you'll do this. I'm debating about adding it to 1.8.3, but don't want to delay that release too long. Alternatively, I can take care of it if you don't have time (I'm asking if you can do it solely because you have the reproducer). On Sep 21, 2014,

Re: [OMPI devel] race condition in oob/tcp

2014-09-22 Thread Gilles Gouaillardet
Ralph, here is the patch i am using so far. i will resume working on this from Wednesday (there is at least one remaining race condition yet) unless you have the time to take care of it today. so far, the race condition has only been observed in real life with the grpcomm/rcd module, and this is

Re: [OMPI devel] race condition in oob/tcp

2014-09-23 Thread Ralph Castain
Thanks! I won't have time to work on it this week, but appreciate your effort. Also, thanks for clarifying the race condition vis 1.8 - I agree it is not a blocker for that release. Ralph On Sep 22, 2014, at 4:49 PM, Gilles Gouaillardet wrote: > Ralph, > > here is the patch i am using so fa

Re: [OMPI devel] race condition in oob/tcp

2014-09-26 Thread Gilles Gouaillardet
Ralph, i just commited r32799 in order to fix this issue. i cmr'ed (#4923) and set the target for 1.8.4 Cheers, Gilles On 2014/09/23 22:55, Ralph Castain wrote: > Thanks! I won't have time to work on it this week, but appreciate your > effort. Also, thanks for clarifying the race condition vis

Re: [OMPI devel] race condition in oob/tcp

2014-09-26 Thread Ralph Castain
Thanks! On Fri, Sep 26, 2014 at 12:56 AM, Gilles Gouaillardet < gilles.gouaillar...@iferc.org> wrote: > Ralph, > > i just commited r32799 in order to fix this issue. > i cmr'ed (#4923) and set the target for 1.8.4 > > Cheers, > > Gilles > > > On 2014/09/23 22:55, Ralph Castain wrote: > > Thanks