subject:"\[OMPI devel\] race condition in grpcomm\/rcd"

Re: [OMPI devel] race condition in grpcomm/rcd

2014-09-12 Thread Joshua Ladd

Let me know if Nadia can help here, Ralph. Josh On Fri, Sep 12, 2014 at 9:31 AM, Ralph Castain wrote: > > On Sep 12, 2014, at 5:45 AM, Gilles Gouaillardet < > gilles.gouaillar...@gmail.com> wrote: > > Ralph, > > On Fri, Sep 12, 2014 at 10:54 AM, Ralph Castain wrote: > >> The design is suppose

Re: [OMPI devel] race condition in grpcomm/rcd

2014-09-12 Thread Ralph Castain

On Sep 12, 2014, at 5:45 AM, Gilles Gouaillardet wrote: > Ralph, > > On Fri, Sep 12, 2014 at 10:54 AM, Ralph Castain wrote: > The design is supposed to be that each node knows precisely how many daemons > are involved in each collective, and who is going to talk to them. > > ok, but in the

Re: [OMPI devel] race condition in grpcomm/rcd

2014-09-12 Thread Gilles Gouaillardet

Ralph, On Fri, Sep 12, 2014 at 10:54 AM, Ralph Castain wrote: > The design is supposed to be that each node knows precisely how many > daemons are involved in each collective, and who is going to talk to them. ok, but in the design does not ensure that things will happen in the right order : -

Re: [OMPI devel] race condition in grpcomm/rcd

2014-09-11 Thread Ralph Castain

The design is supposed to be that each node knows precisely how many daemons are involved in each collective, and who is going to talk to them. The signature contains the info required to ensure the receiver knows which collective this message relates to, and just happens to also allow them to

Re: [OMPI devel] race condition in grpcomm/rcd

2014-09-11 Thread Gilles Gouaillardet

Ralph, you are right, this was definetly not the right fix (at least with 4 nodes or more) i finally understood what is going wrong here : to make it simple, the allgather recursive doubling algo is not implemented with MPI_Recv(...,peer,...) like functions but with MPI_Recv(...,MPI_ANY_SOURCE,..

Re: [OMPI devel] race condition in grpcomm/rcd

2014-09-11 Thread Ralph Castain

Yeah, that's not the right fix, I'm afraid. I've made the direct component the default again until I have time to dig into this deeper. On Sep 11, 2014, at 4:02 AM, Gilles Gouaillardet wrote: > Ralph, > > the root cause is when the second orted/mpirun runs rcd_finalize_coll, > it does not inv

Re: [OMPI devel] race condition in grpcomm/rcd

2014-09-11 Thread Gilles Gouaillardet

Ralph, the root cause is when the second orted/mpirun runs rcd_finalize_coll, it does not invoke pmix_server_release because allgather_stub was not previously invoked since the the fence was not yet entered. /* in rcd_finalize_coll, coll->cbfunc is NULL */ the attached patch is likely not the rig

Re: [OMPI devel] race condition in grpcomm/rcd

2014-09-11 Thread Gilles Gouaillardet

Ralph, things got worst indeed :-( now a simple hello world involving two hosts hang in mpi_init. there is still a race condition : if a tasks a call fence long after task b, then task b will never leave the fence i ll try to debug this ... Cheers, Gilles On 2014/09/11 2:36, Ralph Castain wro

Re: [OMPI devel] race condition in grpcomm/rcd

2014-09-10 Thread Ralph Castain

I think I now have this fixed - let me know what you see. On Sep 9, 2014, at 6:15 AM, Ralph Castain wrote: > Yeah, that's not the correct fix. The right way to fix it is for all three > components to have their own RML tag, and for each of them to establish a > persistent receive. They then c

Re: [OMPI devel] race condition in grpcomm/rcd

2014-09-09 Thread Ralph Castain

Yeah, that's not the correct fix. The right way to fix it is for all three components to have their own RML tag, and for each of them to establish a persistent receive. They then can use the signature to tell which collective the incoming message belongs to. I'll fix it, but it won't be until t

[OMPI devel] race condition in grpcomm/rcd

2014-09-09 Thread Gilles Gouaillardet

Folks, Since r32672 (trunk), grpcomm/rcd is the default module. the attached spawn.c test program is a trimmed version of the spawn_with_env_vars.c test case from the ibm test suite. when invoked on two nodes : - the program hangs with -np 2 - the program can crash with np > 2 error message is [n

Re: [OMPI devel] race condition in grpcomm/rcd

Re: [OMPI devel] race condition in grpcomm/rcd

Re: [OMPI devel] race condition in grpcomm/rcd

Re: [OMPI devel] race condition in grpcomm/rcd

Re: [OMPI devel] race condition in grpcomm/rcd

Re: [OMPI devel] race condition in grpcomm/rcd

Re: [OMPI devel] race condition in grpcomm/rcd

Re: [OMPI devel] race condition in grpcomm/rcd

Re: [OMPI devel] race condition in grpcomm/rcd

Re: [OMPI devel] race condition in grpcomm/rcd

[OMPI devel] race condition in grpcomm/rcd

11 matches

Site Navigation

Mail list logo

Footer information