Re: [OMPI devel] race condition in grpcomm/rcd

2014-09-12 Thread Joshua Ladd
Let me know if Nadia can help here, Ralph. Josh On Fri, Sep 12, 2014 at 9:31 AM, Ralph Castain wrote: > > On Sep 12, 2014, at 5:45 AM, Gilles Gouaillardet < > gilles.gouaillar...@gmail.com> wrote: > > Ralph, > > On Fri, Sep 12, 2014 at 10:54 AM, Ralph Castain wrote: > >> The design is suppose

Re: [OMPI devel] race condition in grpcomm/rcd

2014-09-12 Thread Ralph Castain
On Sep 12, 2014, at 5:45 AM, Gilles Gouaillardet wrote: > Ralph, > > On Fri, Sep 12, 2014 at 10:54 AM, Ralph Castain wrote: > The design is supposed to be that each node knows precisely how many daemons > are involved in each collective, and who is going to talk to them. > > ok, but in the

Re: [OMPI devel] race condition in grpcomm/rcd

2014-09-12 Thread Gilles Gouaillardet
Ralph, On Fri, Sep 12, 2014 at 10:54 AM, Ralph Castain wrote: > The design is supposed to be that each node knows precisely how many > daemons are involved in each collective, and who is going to talk to them. ok, but in the design does not ensure that things will happen in the right order : -

Re: [OMPI devel] race condition in grpcomm/rcd

2014-09-11 Thread Ralph Castain
The design is supposed to be that each node knows precisely how many daemons are involved in each collective, and who is going to talk to them. The signature contains the info required to ensure the receiver knows which collective this message relates to, and just happens to also allow them to

Re: [OMPI devel] race condition in grpcomm/rcd

2014-09-11 Thread Gilles Gouaillardet
Ralph, you are right, this was definetly not the right fix (at least with 4 nodes or more) i finally understood what is going wrong here : to make it simple, the allgather recursive doubling algo is not implemented with MPI_Recv(...,peer,...) like functions but with MPI_Recv(...,MPI_ANY_SOURCE,..

Re: [OMPI devel] race condition in grpcomm/rcd

2014-09-11 Thread Ralph Castain
Yeah, that's not the right fix, I'm afraid. I've made the direct component the default again until I have time to dig into this deeper. On Sep 11, 2014, at 4:02 AM, Gilles Gouaillardet wrote: > Ralph, > > the root cause is when the second orted/mpirun runs rcd_finalize_coll, > it does not inv

Re: [OMPI devel] race condition in grpcomm/rcd

2014-09-11 Thread Gilles Gouaillardet
Ralph, the root cause is when the second orted/mpirun runs rcd_finalize_coll, it does not invoke pmix_server_release because allgather_stub was not previously invoked since the the fence was not yet entered. /* in rcd_finalize_coll, coll->cbfunc is NULL */ the attached patch is likely not the rig

Re: [OMPI devel] race condition in grpcomm/rcd

2014-09-11 Thread Gilles Gouaillardet
Ralph, things got worst indeed :-( now a simple hello world involving two hosts hang in mpi_init. there is still a race condition : if a tasks a call fence long after task b, then task b will never leave the fence i ll try to debug this ... Cheers, Gilles On 2014/09/11 2:36, Ralph Castain wro

Re: [OMPI devel] race condition in grpcomm/rcd

2014-09-10 Thread Ralph Castain
I think I now have this fixed - let me know what you see. On Sep 9, 2014, at 6:15 AM, Ralph Castain wrote: > Yeah, that's not the correct fix. The right way to fix it is for all three > components to have their own RML tag, and for each of them to establish a > persistent receive. They then c

Re: [OMPI devel] race condition in grpcomm/rcd

2014-09-09 Thread Ralph Castain
Yeah, that's not the correct fix. The right way to fix it is for all three components to have their own RML tag, and for each of them to establish a persistent receive. They then can use the signature to tell which collective the incoming message belongs to. I'll fix it, but it won't be until t