Ralph, On Fri, Sep 12, 2014 at 10:54 AM, Ralph Castain <r...@open-mpi.org> wrote:
> The design is supposed to be that each node knows precisely how many > daemons are involved in each collective, and who is going to talk to them. ok, but in the design does not ensure that things will happen in the right order : - enter the allgather - receive data from the daemon at distance 1 - receive data from the daemon at distance 2 - and so on with current implementation when 2 daemons are involved, if a daemon enters the allgather after it received data from the peer, then the mpi processes local to this daemon will hang with 4 nodes, it got trickier : 0 enter allgather and send a message to 1 1 receive the message and send to 2 but with data from 0 only /* 1 did not enter the allgather, so its data cannot be sent to 2 */ this issue did not occur before the persistent receive : no receive was posted if the daemon did not enter the allgather The signature contains the info required to ensure the receiver knows which > collective this message relates to, and just happens to also allow them to > lookup the number of daemons involved (the base function takes care of that > for them). > > ok too, this issue was solved with the persistent receive So there is no need for a "pending" list - if you receive a message about a > collective you don't yet know about, you just put it on the ongoing > collective list. You should only receive it if you are going to be involved > - i.e., you have local procs that are going to participate. So you wait > until your local procs participate, and then pass your collected bucket > along. > > ok, i did something similar (e.g. pass all the available data) some data might be passed twice, but that might not be an issue > I suspect the link to the local procs isn't being correctly dealt with, > else you couldn't be hanging. Or the rcd isn't correctly passing incoming > messages to the base functions to register the collective. > > I'll look at it over the weekend and can resolve it then. > > the attached patch is an illustration of what i was trying to explain. coll->nreported is used by rcd as a bitmask of the received messages (bit 0 is for the local daemon, bit n for the daemon at distance n) i was still debugging a race condition : if daemons 2 and 3 enter the allgather at the send time, they will sent a message to each other at the same time and rml fails establishing the connection. i could not find whether this is linked to my changes... Cheers, Gilles > > On Sep 11, 2014, at 5:23 PM, Gilles Gouaillardet < > gilles.gouaillar...@iferc.org> wrote: > > > Ralph, > > > > you are right, this was definetly not the right fix (at least with 4 > > nodes or more) > > > > i finally understood what is going wrong here : > > to make it simple, the allgather recursive doubling algo is not > > implemented with > > MPI_Recv(...,peer,...) like functions but with > > MPI_Recv(...,MPI_ANY_SOURCE,...) like functions > > and that makes things slightly more complicated : > > right now : > > - with two nodes : if node 1 is late, it gets stuck in the allgather > > - with four nodes : if node 0 is first, then node 2 and 3 while node 1 > > is still late, then node 0 > > will likely leaves the allgather though it did not receive anything > > from node 1 > > - and so on > > > > i think i can fix that from now > > > > Cheers, > > > > Gilles > > > > On 2014/09/11 23:47, Ralph Castain wrote: > >> Yeah, that's not the right fix, I'm afraid. I've made the direct > component the default again until I have time to dig into this deeper. > >> > >> On Sep 11, 2014, at 4:02 AM, Gilles Gouaillardet < > gilles.gouaillar...@iferc.org> wrote: > >> > >>> Ralph, > >>> > >>> the root cause is when the second orted/mpirun runs rcd_finalize_coll, > >>> it does not invoke pmix_server_release > >>> because allgather_stub was not previously invoked since the the fence > >>> was not yet entered. > >>> /* in rcd_finalize_coll, coll->cbfunc is NULL */ > >>> > >>> the attached patch is likely not the right fix, it was very lightly > >>> tested, but so far, it works for me ... > >>> > >>> Cheers, > >>> > >>> Gilles > >>> > >>> On 2014/09/11 16:11, Gilles Gouaillardet wrote: > >>>> Ralph, > >>>> > >>>> things got worst indeed :-( > >>>> > >>>> now a simple hello world involving two hosts hang in mpi_init. > >>>> there is still a race condition : if a tasks a call fence long after > task b, > >>>> then task b will never leave the fence > >>>> > >>>> i ll try to debug this ... > >>>> > >>>> Cheers, > >>>> > >>>> Gilles > >>>> > >>>> On 2014/09/11 2:36, Ralph Castain wrote: > >>>>> I think I now have this fixed - let me know what you see. > >>>>> > >>>>> > >>>>> On Sep 9, 2014, at 6:15 AM, Ralph Castain <r...@open-mpi.org> wrote: > >>>>> > >>>>>> Yeah, that's not the correct fix. The right way to fix it is for > all three components to have their own RML tag, and for each of them to > establish a persistent receive. They then can use the signature to tell > which collective the incoming message belongs to. > >>>>>> > >>>>>> I'll fix it, but it won't be until tomorrow I'm afraid as today is > shot. > >>>>>> > >>>>>> > >>>>>> On Sep 9, 2014, at 3:10 AM, Gilles Gouaillardet < > gilles.gouaillar...@iferc.org> wrote: > >>>>>> > >>>>>>> Folks, > >>>>>>> > >>>>>>> Since r32672 (trunk), grpcomm/rcd is the default module. > >>>>>>> the attached spawn.c test program is a trimmed version of the > >>>>>>> spawn_with_env_vars.c test case > >>>>>>> from the ibm test suite. > >>>>>>> > >>>>>>> when invoked on two nodes : > >>>>>>> - the program hangs with -np 2 > >>>>>>> - the program can crash with np > 2 > >>>>>>> error message is > >>>>>>> [node0:30701] [[42913,0],0] TWO RECEIVES WITH SAME PEER > [[42913,0],1] > >>>>>>> AND TAG -33 - ABORTING > >>>>>>> > >>>>>>> here is my full command line (from node0) : > >>>>>>> > >>>>>>> mpirun -host node0,node1 -np 2 --oversubscribe --mca btl tcp,self > --mca > >>>>>>> coll ^ml ./spawn > >>>>>>> > >>>>>>> a simple workaround is to add the following extra parameter to the > >>>>>>> mpirun command line : > >>>>>>> --mca grpcomm_rcd_priority 0 > >>>>>>> > >>>>>>> my understanding it that the race condition occurs when all the > >>>>>>> processes call MPI_Finalize() > >>>>>>> internally, the pmix module will have mpirun/orted issue two > ALLGATHER > >>>>>>> involving mpirun and orted > >>>>>>> (one job 1 aka the parent, and one for job 2 aka the spawned tasks) > >>>>>>> the error message is very explicit : this is not (currently) > supported > >>>>>>> > >>>>>>> i wrote the attached rml.patch which is really a workaround and > not a fix : > >>>>>>> in this case, each job will invoke an ALLGATHER but with a > different tag > >>>>>>> /* that works for a limited number of jobs only */ > >>>>>>> > >>>>>>> i did not commit this patch since this is not a fix, could someone > >>>>>>> (Ralph ?) please review the issue and comment ? > >>>>>>> > >>>>>>> > >>>>>>> Cheers, > >>>>>>> > >>>>>>> Gilles > >>>>>>> > >>>>>>> <spawn.c><rml.patch>_______________________________________________ > >>>>>>> devel mailing list > >>>>>>> de...@open-mpi.org > >>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>>>>>> Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/09/15780.php > >>>>> _______________________________________________ > >>>>> devel mailing list > >>>>> de...@open-mpi.org > >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>>>> Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/09/15794.php > >>>> _______________________________________________ > >>>> devel mailing list > >>>> de...@open-mpi.org > >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>>> Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/09/15804.php > >>> <rml2.patch>_______________________________________________ > >>> devel mailing list > >>> de...@open-mpi.org > >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>> Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/09/15805.php > >> _______________________________________________ > >> devel mailing list > >> de...@open-mpi.org > >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > >> Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/09/15810.php > > > > _______________________________________________ > > devel mailing list > > de...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/09/15814.php > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/09/15815.php >
rml3.patch
Description: Binary data