Re: [OMPI devel] race condition in grpcomm/rcd

Gilles Gouaillardet Fri, 12 Sep 2014 08:45:28 -0400 (EDT)

Ralph,

On Fri, Sep 12, 2014 at 10:54 AM, Ralph Castain <r...@open-mpi.org> wrote:


> The design is supposed to be that each node knows precisely how many
> daemons are involved in each collective, and who is going to talk to them.


ok, but in the design does not ensure that things will happen in the right
order :
- enter the allgather
- receive data from the daemon at distance 1
- receive data from the daemon at distance 2
- and so on

with current implementation when 2 daemons are involved, if a daemon enters
the allgather after it received data from the peer, then the mpi processes
local to this daemon will hang

with 4 nodes, it got trickier :
0 enter allgather and send a message to 1
1 receive the message and send to 2 but with data from 0 only
/* 1 did not enter the allgather, so its data cannot be sent to 2 */

this issue did not occur before the persistent receive :
no receive was posted if the daemon did not enter the allgather


The signature contains the info required to ensure the receiver knows which
> collective this message relates to, and just happens to also allow them to
> lookup the number of daemons involved (the base function takes care of that
> for them).
>
>
ok too, this issue was solved with the persistent receive

So there is no need for a "pending" list - if you receive a message about a
> collective you don't yet know about, you just put it on the ongoing
> collective list. You should only receive it if you are going to be involved
> - i.e., you have local procs that are going to participate. So you wait
> until your local procs participate, and then pass your collected bucket
> along.
>
> ok, i did something similar
(e.g. pass all the available data)
some data might be passed twice, but that might not be an issue


> I suspect the link to the local procs isn't being correctly dealt with,
> else you couldn't be hanging. Or the rcd isn't correctly passing incoming
> messages to the base functions to register the collective.
>
> I'll look at it over the weekend and can resolve it then.
>
>
 the attached patch is an illustration of what i was trying to explain.
coll->nreported is used by rcd as a bitmask of the received messages
(bit 0 is for the local daemon, bit n for the daemon at distance n)

i was still debugging a race condition :
if daemons 2 and 3 enter the allgather at the send time, they will sent a
message to each other at the same time and rml fails establishing the
connection.  i could not find whether this is linked to my changes...

Cheers,

Gilles

>
> On Sep 11, 2014, at 5:23 PM, Gilles Gouaillardet <
> gilles.gouaillar...@iferc.org> wrote:
>
> > Ralph,
> >
> > you are right, this was definetly not the right fix (at least with 4
> > nodes or more)
> >
> > i finally understood what is going wrong here :
> > to make it simple, the allgather recursive doubling algo is not
> > implemented with
> > MPI_Recv(...,peer,...) like functions but with
> > MPI_Recv(...,MPI_ANY_SOURCE,...) like functions
> > and that makes things slightly more complicated :
> > right now :
> > - with two nodes : if node 1 is late, it gets stuck in the allgather
> > - with four nodes : if node 0 is first, then node 2 and 3 while node 1
> > is still late, then node 0
> > will likely leaves the allgather though it did not receive anything
> > from  node 1
> > - and so on
> >
> > i think i can fix that from now
> >
> > Cheers,
> >
> > Gilles
> >
> > On 2014/09/11 23:47, Ralph Castain wrote:
> >> Yeah, that's not the right fix, I'm afraid. I've made the direct
> component the default again until I have time to dig into this deeper.
> >>
> >> On Sep 11, 2014, at 4:02 AM, Gilles Gouaillardet <
> gilles.gouaillar...@iferc.org> wrote:
> >>
> >>> Ralph,
> >>>
> >>> the root cause is when the second orted/mpirun runs rcd_finalize_coll,
> >>> it does not invoke pmix_server_release
> >>> because allgather_stub was not previously invoked since the the fence
> >>> was not yet entered.
> >>> /* in rcd_finalize_coll, coll->cbfunc is NULL */
> >>>
> >>> the attached patch is likely not the right fix, it was very lightly
> >>> tested, but so far, it works for me ...
> >>>
> >>> Cheers,
> >>>
> >>> Gilles
> >>>
> >>> On 2014/09/11 16:11, Gilles Gouaillardet wrote:
> >>>> Ralph,
> >>>>
> >>>> things got worst indeed :-(
> >>>>
> >>>> now a simple hello world involving two hosts hang in mpi_init.
> >>>> there is still a race condition : if a tasks a call fence long after
> task b,
> >>>> then task b will never leave the fence
> >>>>
> >>>> i ll try to debug this ...
> >>>>
> >>>> Cheers,
> >>>>
> >>>> Gilles
> >>>>
> >>>> On 2014/09/11 2:36, Ralph Castain wrote:
> >>>>> I think I now have this fixed - let me know what you see.
> >>>>>
> >>>>>
> >>>>> On Sep 9, 2014, at 6:15 AM, Ralph Castain <r...@open-mpi.org> wrote:
> >>>>>
> >>>>>> Yeah, that's not the correct fix. The right way to fix it is for
> all three components to have their own RML tag, and for each of them to
> establish a persistent receive. They then can use the signature to tell
> which collective the incoming message belongs to.
> >>>>>>
> >>>>>> I'll fix it, but it won't be until tomorrow I'm afraid as today is
> shot.
> >>>>>>
> >>>>>>
> >>>>>> On Sep 9, 2014, at 3:10 AM, Gilles Gouaillardet <
> gilles.gouaillar...@iferc.org> wrote:
> >>>>>>
> >>>>>>> Folks,
> >>>>>>>
> >>>>>>> Since r32672 (trunk), grpcomm/rcd is the default module.
> >>>>>>> the attached spawn.c test program is a trimmed version of the
> >>>>>>> spawn_with_env_vars.c test case
> >>>>>>> from the ibm test suite.
> >>>>>>>
> >>>>>>> when invoked on two nodes :
> >>>>>>> - the program hangs with -np 2
> >>>>>>> - the program can crash with np > 2
> >>>>>>> error message is
> >>>>>>> [node0:30701] [[42913,0],0] TWO RECEIVES WITH SAME PEER
> [[42913,0],1]
> >>>>>>> AND TAG -33 - ABORTING
> >>>>>>>
> >>>>>>> here is my full command line (from node0) :
> >>>>>>>
> >>>>>>> mpirun -host node0,node1 -np 2 --oversubscribe --mca btl tcp,self
> --mca
> >>>>>>> coll ^ml ./spawn
> >>>>>>>
> >>>>>>> a simple workaround is to add the following extra parameter to the
> >>>>>>> mpirun command line :
> >>>>>>> --mca grpcomm_rcd_priority 0
> >>>>>>>
> >>>>>>> my understanding it that the race condition occurs when all the
> >>>>>>> processes call MPI_Finalize()
> >>>>>>> internally, the pmix module will have mpirun/orted issue two
> ALLGATHER
> >>>>>>> involving mpirun and orted
> >>>>>>> (one job 1 aka the parent, and one for job 2 aka the spawned tasks)
> >>>>>>> the error message is very explicit : this is not (currently)
> supported
> >>>>>>>
> >>>>>>> i wrote the attached rml.patch which is really a workaround and
> not a fix :
> >>>>>>> in this case, each job will invoke an ALLGATHER but with a
> different tag
> >>>>>>> /* that works for a limited number of jobs only */
> >>>>>>>
> >>>>>>> i did not commit this patch since this is not a fix, could someone
> >>>>>>> (Ralph ?) please review the issue and comment ?
> >>>>>>>
> >>>>>>>
> >>>>>>> Cheers,
> >>>>>>>
> >>>>>>> Gilles
> >>>>>>>
> >>>>>>> <spawn.c><rml.patch>_______________________________________________
> >>>>>>> devel mailing list
> >>>>>>> de...@open-mpi.org
> >>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>>>>>> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/09/15780.php
> >>>>> _______________________________________________
> >>>>> devel mailing list
> >>>>> de...@open-mpi.org
> >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>>>> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/09/15794.php
> >>>> _______________________________________________
> >>>> devel mailing list
> >>>> de...@open-mpi.org
> >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>>> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/09/15804.php
> >>> <rml2.patch>_______________________________________________
> >>> devel mailing list
> >>> de...@open-mpi.org
> >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/09/15805.php
> >> _______________________________________________
> >> devel mailing list
> >> de...@open-mpi.org
> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/09/15810.php
> >
> > _______________________________________________
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/09/15814.php
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/09/15815.php
>

rml3.patch
Description: Binary data

Re: [OMPI devel] race condition in grpcomm/rcd

Reply via email to