Re: [OMPI devel] race condition in grpcomm/rcd
Let me know if Nadia can help here, Ralph. Josh On Fri, Sep 12, 2014 at 9:31 AM, Ralph Castainwrote: > > On Sep 12, 2014, at 5:45 AM, Gilles Gouaillardet < > gilles.gouaillar...@gmail.com> wrote: > > Ralph, > > On Fri, Sep 12, 2014 at 10:54 AM, Ralph Castain wrote: > >> The design is supposed to be that each node knows precisely how many >> daemons are involved in each collective, and who is going to talk to them. > > > ok, but in the design does not ensure that things will happen in the right > order : > - enter the allgather > - receive data from the daemon at distance 1 > - receive data from the daemon at distance 2 > - and so on > > with current implementation when 2 daemons are involved, if a daemon > enters the allgather after it received data from the peer, then the mpi > processes local to this daemon will hang > > with 4 nodes, it got trickier : > 0 enter allgather and send a message to 1 > 1 receive the message and send to 2 but with data from 0 only > /* 1 did not enter the allgather, so its data cannot be sent to 2 */ > > > It's just a bug in the rcd logic, Gilles. I'll take a look and get it > fixed - just don't have time right now > > > this issue did not occur before the persistent receive : > no receive was posted if the daemon did not enter the allgather > > > The signature contains the info required to ensure the receiver knows >> which collective this message relates to, and just happens to also allow >> them to lookup the number of daemons involved (the base function takes care >> of that for them). >> >> > ok too, this issue was solved with the persistent receive > > So there is no need for a "pending" list - if you receive a message about >> a collective you don't yet know about, you just put it on the ongoing >> collective list. You should only receive it if you are going to be involved >> - i.e., you have local procs that are going to participate. So you wait >> until your local procs participate, and then pass your collected bucket >> along. >> >> ok, i did something similar > (e.g. pass all the available data) > some data might be passed twice, but that might not be an issue > > >> I suspect the link to the local procs isn't being correctly dealt with, >> else you couldn't be hanging. Or the rcd isn't correctly passing incoming >> messages to the base functions to register the collective. >> >> I'll look at it over the weekend and can resolve it then. >> >> > the attached patch is an illustration of what i was trying to explain. > coll->nreported is used by rcd as a bitmask of the received messages > (bit 0 is for the local daemon, bit n for the daemon at distance n) > > i was still debugging a race condition : > if daemons 2 and 3 enter the allgather at the send time, they will sent a > message to each other at the same time and rml fails establishing the > connection. i could not find whether this is linked to my changes... > > Cheers, > > Gilles > >> >> On Sep 11, 2014, at 5:23 PM, Gilles Gouaillardet < >> gilles.gouaillar...@iferc.org> wrote: >> >> > Ralph, >> > >> > you are right, this was definetly not the right fix (at least with 4 >> > nodes or more) >> > >> > i finally understood what is going wrong here : >> > to make it simple, the allgather recursive doubling algo is not >> > implemented with >> > MPI_Recv(...,peer,...) like functions but with >> > MPI_Recv(...,MPI_ANY_SOURCE,...) like functions >> > and that makes things slightly more complicated : >> > right now : >> > - with two nodes : if node 1 is late, it gets stuck in the allgather >> > - with four nodes : if node 0 is first, then node 2 and 3 while node 1 >> > is still late, then node 0 >> > will likely leaves the allgather though it did not receive anything >> > from node 1 >> > - and so on >> > >> > i think i can fix that from now >> > >> > Cheers, >> > >> > Gilles >> > >> > On 2014/09/11 23:47, Ralph Castain wrote: >> >> Yeah, that's not the right fix, I'm afraid. I've made the direct >> component the default again until I have time to dig into this deeper. >> >> >> >> On Sep 11, 2014, at 4:02 AM, Gilles Gouaillardet < >> gilles.gouaillar...@iferc.org> wrote: >> >> >> >>> Ralph, >> >>> >> >>> the root cause is when the second orted/mpirun runs rcd_finalize_coll, >> >>> it does not invoke pmix_server_release >> >>> because allgather_stub was not previously invoked since the the fence >> >>> was not yet entered. >> >>> /* in rcd_finalize_coll, coll->cbfunc is NULL */ >> >>> >> >>> the attached patch is likely not the right fix, it was very lightly >> >>> tested, but so far, it works for me ... >> >>> >> >>> Cheers, >> >>> >> >>> Gilles >> >>> >> >>> On 2014/09/11 16:11, Gilles Gouaillardet wrote: >> Ralph, >> >> things got worst indeed :-( >> >> now a simple hello world involving two hosts hang in mpi_init. >> there is still a race condition : if a tasks a call fence long after >> task b, >> then task b will never leave the
Re: [OMPI devel] race condition in grpcomm/rcd
On Sep 12, 2014, at 5:45 AM, Gilles Gouaillardetwrote: > Ralph, > > On Fri, Sep 12, 2014 at 10:54 AM, Ralph Castain wrote: > The design is supposed to be that each node knows precisely how many daemons > are involved in each collective, and who is going to talk to them. > > ok, but in the design does not ensure that things will happen in the right > order : > - enter the allgather > - receive data from the daemon at distance 1 > - receive data from the daemon at distance 2 > - and so on > > with current implementation when 2 daemons are involved, if a daemon enters > the allgather after it received data from the peer, then the mpi processes > local to this daemon will hang > > with 4 nodes, it got trickier : > 0 enter allgather and send a message to 1 > 1 receive the message and send to 2 but with data from 0 only > /* 1 did not enter the allgather, so its data cannot be sent to 2 */ It's just a bug in the rcd logic, Gilles. I'll take a look and get it fixed - just don't have time right now > > this issue did not occur before the persistent receive : > no receive was posted if the daemon did not enter the allgather > > > The signature contains the info required to ensure the receiver knows which > collective this message relates to, and just happens to also allow them to > lookup the number of daemons involved (the base function takes care of that > for them). > > > ok too, this issue was solved with the persistent receive > > So there is no need for a "pending" list - if you receive a message about a > collective you don't yet know about, you just put it on the ongoing > collective list. You should only receive it if you are going to be involved - > i.e., you have local procs that are going to participate. So you wait until > your local procs participate, and then pass your collected bucket along. > > ok, i did something similar > (e.g. pass all the available data) > some data might be passed twice, but that might not be an issue > > I suspect the link to the local procs isn't being correctly dealt with, else > you couldn't be hanging. Or the rcd isn't correctly passing incoming messages > to the base functions to register the collective. > > I'll look at it over the weekend and can resolve it then. > > > the attached patch is an illustration of what i was trying to explain. > coll->nreported is used by rcd as a bitmask of the received messages > (bit 0 is for the local daemon, bit n for the daemon at distance n) > > i was still debugging a race condition : > if daemons 2 and 3 enter the allgather at the send time, they will sent a > message to each other at the same time and rml fails establishing the > connection. i could not find whether this is linked to my changes... > > Cheers, > > Gilles > > On Sep 11, 2014, at 5:23 PM, Gilles Gouaillardet > wrote: > > > Ralph, > > > > you are right, this was definetly not the right fix (at least with 4 > > nodes or more) > > > > i finally understood what is going wrong here : > > to make it simple, the allgather recursive doubling algo is not > > implemented with > > MPI_Recv(...,peer,...) like functions but with > > MPI_Recv(...,MPI_ANY_SOURCE,...) like functions > > and that makes things slightly more complicated : > > right now : > > - with two nodes : if node 1 is late, it gets stuck in the allgather > > - with four nodes : if node 0 is first, then node 2 and 3 while node 1 > > is still late, then node 0 > > will likely leaves the allgather though it did not receive anything > > from node 1 > > - and so on > > > > i think i can fix that from now > > > > Cheers, > > > > Gilles > > > > On 2014/09/11 23:47, Ralph Castain wrote: > >> Yeah, that's not the right fix, I'm afraid. I've made the direct component > >> the default again until I have time to dig into this deeper. > >> > >> On Sep 11, 2014, at 4:02 AM, Gilles Gouaillardet > >> wrote: > >> > >>> Ralph, > >>> > >>> the root cause is when the second orted/mpirun runs rcd_finalize_coll, > >>> it does not invoke pmix_server_release > >>> because allgather_stub was not previously invoked since the the fence > >>> was not yet entered. > >>> /* in rcd_finalize_coll, coll->cbfunc is NULL */ > >>> > >>> the attached patch is likely not the right fix, it was very lightly > >>> tested, but so far, it works for me ... > >>> > >>> Cheers, > >>> > >>> Gilles > >>> > >>> On 2014/09/11 16:11, Gilles Gouaillardet wrote: > Ralph, > > things got worst indeed :-( > > now a simple hello world involving two hosts hang in mpi_init. > there is still a race condition : if a tasks a call fence long after > task b, > then task b will never leave the fence > > i ll try to debug this ... > > Cheers, > > Gilles > > On 2014/09/11 2:36, Ralph Castain wrote: > > I think I now have this fixed
Re: [OMPI devel] race condition in grpcomm/rcd
Ralph, On Fri, Sep 12, 2014 at 10:54 AM, Ralph Castainwrote: > The design is supposed to be that each node knows precisely how many > daemons are involved in each collective, and who is going to talk to them. ok, but in the design does not ensure that things will happen in the right order : - enter the allgather - receive data from the daemon at distance 1 - receive data from the daemon at distance 2 - and so on with current implementation when 2 daemons are involved, if a daemon enters the allgather after it received data from the peer, then the mpi processes local to this daemon will hang with 4 nodes, it got trickier : 0 enter allgather and send a message to 1 1 receive the message and send to 2 but with data from 0 only /* 1 did not enter the allgather, so its data cannot be sent to 2 */ this issue did not occur before the persistent receive : no receive was posted if the daemon did not enter the allgather The signature contains the info required to ensure the receiver knows which > collective this message relates to, and just happens to also allow them to > lookup the number of daemons involved (the base function takes care of that > for them). > > ok too, this issue was solved with the persistent receive So there is no need for a "pending" list - if you receive a message about a > collective you don't yet know about, you just put it on the ongoing > collective list. You should only receive it if you are going to be involved > - i.e., you have local procs that are going to participate. So you wait > until your local procs participate, and then pass your collected bucket > along. > > ok, i did something similar (e.g. pass all the available data) some data might be passed twice, but that might not be an issue > I suspect the link to the local procs isn't being correctly dealt with, > else you couldn't be hanging. Or the rcd isn't correctly passing incoming > messages to the base functions to register the collective. > > I'll look at it over the weekend and can resolve it then. > > the attached patch is an illustration of what i was trying to explain. coll->nreported is used by rcd as a bitmask of the received messages (bit 0 is for the local daemon, bit n for the daemon at distance n) i was still debugging a race condition : if daemons 2 and 3 enter the allgather at the send time, they will sent a message to each other at the same time and rml fails establishing the connection. i could not find whether this is linked to my changes... Cheers, Gilles > > On Sep 11, 2014, at 5:23 PM, Gilles Gouaillardet < > gilles.gouaillar...@iferc.org> wrote: > > > Ralph, > > > > you are right, this was definetly not the right fix (at least with 4 > > nodes or more) > > > > i finally understood what is going wrong here : > > to make it simple, the allgather recursive doubling algo is not > > implemented with > > MPI_Recv(...,peer,...) like functions but with > > MPI_Recv(...,MPI_ANY_SOURCE,...) like functions > > and that makes things slightly more complicated : > > right now : > > - with two nodes : if node 1 is late, it gets stuck in the allgather > > - with four nodes : if node 0 is first, then node 2 and 3 while node 1 > > is still late, then node 0 > > will likely leaves the allgather though it did not receive anything > > from node 1 > > - and so on > > > > i think i can fix that from now > > > > Cheers, > > > > Gilles > > > > On 2014/09/11 23:47, Ralph Castain wrote: > >> Yeah, that's not the right fix, I'm afraid. I've made the direct > component the default again until I have time to dig into this deeper. > >> > >> On Sep 11, 2014, at 4:02 AM, Gilles Gouaillardet < > gilles.gouaillar...@iferc.org> wrote: > >> > >>> Ralph, > >>> > >>> the root cause is when the second orted/mpirun runs rcd_finalize_coll, > >>> it does not invoke pmix_server_release > >>> because allgather_stub was not previously invoked since the the fence > >>> was not yet entered. > >>> /* in rcd_finalize_coll, coll->cbfunc is NULL */ > >>> > >>> the attached patch is likely not the right fix, it was very lightly > >>> tested, but so far, it works for me ... > >>> > >>> Cheers, > >>> > >>> Gilles > >>> > >>> On 2014/09/11 16:11, Gilles Gouaillardet wrote: > Ralph, > > things got worst indeed :-( > > now a simple hello world involving two hosts hang in mpi_init. > there is still a race condition : if a tasks a call fence long after > task b, > then task b will never leave the fence > > i ll try to debug this ... > > Cheers, > > Gilles > > On 2014/09/11 2:36, Ralph Castain wrote: > > I think I now have this fixed - let me know what you see. > > > > > > On Sep 9, 2014, at 6:15 AM, Ralph Castain wrote: > > > >> Yeah, that's not the correct fix. The right way to fix it is for > all three components to have their own RML tag, and for each of them to > establish a persistent receive. They then
Re: [OMPI devel] race condition in grpcomm/rcd
The design is supposed to be that each node knows precisely how many daemons are involved in each collective, and who is going to talk to them. The signature contains the info required to ensure the receiver knows which collective this message relates to, and just happens to also allow them to lookup the number of daemons involved (the base function takes care of that for them). So there is no need for a "pending" list - if you receive a message about a collective you don't yet know about, you just put it on the ongoing collective list. You should only receive it if you are going to be involved - i.e., you have local procs that are going to participate. So you wait until your local procs participate, and then pass your collected bucket along. I suspect the link to the local procs isn't being correctly dealt with, else you couldn't be hanging. Or the rcd isn't correctly passing incoming messages to the base functions to register the collective. I'll look at it over the weekend and can resolve it then. On Sep 11, 2014, at 5:23 PM, Gilles Gouaillardetwrote: > Ralph, > > you are right, this was definetly not the right fix (at least with 4 > nodes or more) > > i finally understood what is going wrong here : > to make it simple, the allgather recursive doubling algo is not > implemented with > MPI_Recv(...,peer,...) like functions but with > MPI_Recv(...,MPI_ANY_SOURCE,...) like functions > and that makes things slightly more complicated : > right now : > - with two nodes : if node 1 is late, it gets stuck in the allgather > - with four nodes : if node 0 is first, then node 2 and 3 while node 1 > is still late, then node 0 > will likely leaves the allgather though it did not receive anything > from node 1 > - and so on > > i think i can fix that from now > > Cheers, > > Gilles > > On 2014/09/11 23:47, Ralph Castain wrote: >> Yeah, that's not the right fix, I'm afraid. I've made the direct component >> the default again until I have time to dig into this deeper. >> >> On Sep 11, 2014, at 4:02 AM, Gilles Gouaillardet >> wrote: >> >>> Ralph, >>> >>> the root cause is when the second orted/mpirun runs rcd_finalize_coll, >>> it does not invoke pmix_server_release >>> because allgather_stub was not previously invoked since the the fence >>> was not yet entered. >>> /* in rcd_finalize_coll, coll->cbfunc is NULL */ >>> >>> the attached patch is likely not the right fix, it was very lightly >>> tested, but so far, it works for me ... >>> >>> Cheers, >>> >>> Gilles >>> >>> On 2014/09/11 16:11, Gilles Gouaillardet wrote: Ralph, things got worst indeed :-( now a simple hello world involving two hosts hang in mpi_init. there is still a race condition : if a tasks a call fence long after task b, then task b will never leave the fence i ll try to debug this ... Cheers, Gilles On 2014/09/11 2:36, Ralph Castain wrote: > I think I now have this fixed - let me know what you see. > > > On Sep 9, 2014, at 6:15 AM, Ralph Castain wrote: > >> Yeah, that's not the correct fix. The right way to fix it is for all >> three components to have their own RML tag, and for each of them to >> establish a persistent receive. They then can use the signature to tell >> which collective the incoming message belongs to. >> >> I'll fix it, but it won't be until tomorrow I'm afraid as today is shot. >> >> >> On Sep 9, 2014, at 3:10 AM, Gilles Gouaillardet >> wrote: >> >>> Folks, >>> >>> Since r32672 (trunk), grpcomm/rcd is the default module. >>> the attached spawn.c test program is a trimmed version of the >>> spawn_with_env_vars.c test case >>> from the ibm test suite. >>> >>> when invoked on two nodes : >>> - the program hangs with -np 2 >>> - the program can crash with np > 2 >>> error message is >>> [node0:30701] [[42913,0],0] TWO RECEIVES WITH SAME PEER [[42913,0],1] >>> AND TAG -33 - ABORTING >>> >>> here is my full command line (from node0) : >>> >>> mpirun -host node0,node1 -np 2 --oversubscribe --mca btl tcp,self --mca >>> coll ^ml ./spawn >>> >>> a simple workaround is to add the following extra parameter to the >>> mpirun command line : >>> --mca grpcomm_rcd_priority 0 >>> >>> my understanding it that the race condition occurs when all the >>> processes call MPI_Finalize() >>> internally, the pmix module will have mpirun/orted issue two ALLGATHER >>> involving mpirun and orted >>> (one job 1 aka the parent, and one for job 2 aka the spawned tasks) >>> the error message is very explicit : this is not (currently) supported >>> >>> i wrote the attached rml.patch which is really a workaround and not a
Re: [OMPI devel] race condition in grpcomm/rcd
Ralph, you are right, this was definetly not the right fix (at least with 4 nodes or more) i finally understood what is going wrong here : to make it simple, the allgather recursive doubling algo is not implemented with MPI_Recv(...,peer,...) like functions but with MPI_Recv(...,MPI_ANY_SOURCE,...) like functions and that makes things slightly more complicated : right now : - with two nodes : if node 1 is late, it gets stuck in the allgather - with four nodes : if node 0 is first, then node 2 and 3 while node 1 is still late, then node 0 will likely leaves the allgather though it did not receive anything from node 1 - and so on i think i can fix that from now Cheers, Gilles On 2014/09/11 23:47, Ralph Castain wrote: > Yeah, that's not the right fix, I'm afraid. I've made the direct component > the default again until I have time to dig into this deeper. > > On Sep 11, 2014, at 4:02 AM, Gilles Gouaillardet >wrote: > >> Ralph, >> >> the root cause is when the second orted/mpirun runs rcd_finalize_coll, >> it does not invoke pmix_server_release >> because allgather_stub was not previously invoked since the the fence >> was not yet entered. >> /* in rcd_finalize_coll, coll->cbfunc is NULL */ >> >> the attached patch is likely not the right fix, it was very lightly >> tested, but so far, it works for me ... >> >> Cheers, >> >> Gilles >> >> On 2014/09/11 16:11, Gilles Gouaillardet wrote: >>> Ralph, >>> >>> things got worst indeed :-( >>> >>> now a simple hello world involving two hosts hang in mpi_init. >>> there is still a race condition : if a tasks a call fence long after task b, >>> then task b will never leave the fence >>> >>> i ll try to debug this ... >>> >>> Cheers, >>> >>> Gilles >>> >>> On 2014/09/11 2:36, Ralph Castain wrote: I think I now have this fixed - let me know what you see. On Sep 9, 2014, at 6:15 AM, Ralph Castain wrote: > Yeah, that's not the correct fix. The right way to fix it is for all > three components to have their own RML tag, and for each of them to > establish a persistent receive. They then can use the signature to tell > which collective the incoming message belongs to. > > I'll fix it, but it won't be until tomorrow I'm afraid as today is shot. > > > On Sep 9, 2014, at 3:10 AM, Gilles Gouaillardet > wrote: > >> Folks, >> >> Since r32672 (trunk), grpcomm/rcd is the default module. >> the attached spawn.c test program is a trimmed version of the >> spawn_with_env_vars.c test case >> from the ibm test suite. >> >> when invoked on two nodes : >> - the program hangs with -np 2 >> - the program can crash with np > 2 >> error message is >> [node0:30701] [[42913,0],0] TWO RECEIVES WITH SAME PEER [[42913,0],1] >> AND TAG -33 - ABORTING >> >> here is my full command line (from node0) : >> >> mpirun -host node0,node1 -np 2 --oversubscribe --mca btl tcp,self --mca >> coll ^ml ./spawn >> >> a simple workaround is to add the following extra parameter to the >> mpirun command line : >> --mca grpcomm_rcd_priority 0 >> >> my understanding it that the race condition occurs when all the >> processes call MPI_Finalize() >> internally, the pmix module will have mpirun/orted issue two ALLGATHER >> involving mpirun and orted >> (one job 1 aka the parent, and one for job 2 aka the spawned tasks) >> the error message is very explicit : this is not (currently) supported >> >> i wrote the attached rml.patch which is really a workaround and not a >> fix : >> in this case, each job will invoke an ALLGATHER but with a different tag >> /* that works for a limited number of jobs only */ >> >> i did not commit this patch since this is not a fix, could someone >> (Ralph ?) please review the issue and comment ? >> >> >> Cheers, >> >> Gilles >> >> ___ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/09/15780.php ___ devel mailing list de...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel Link to this post: http://www.open-mpi.org/community/lists/devel/2014/09/15794.php >>> ___ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2014/09/15804.php >> ___ >> devel mailing list >> de...@open-mpi.org >> Subscription:
Re: [OMPI devel] race condition in grpcomm/rcd
Yeah, that's not the right fix, I'm afraid. I've made the direct component the default again until I have time to dig into this deeper. On Sep 11, 2014, at 4:02 AM, Gilles Gouaillardetwrote: > Ralph, > > the root cause is when the second orted/mpirun runs rcd_finalize_coll, > it does not invoke pmix_server_release > because allgather_stub was not previously invoked since the the fence > was not yet entered. > /* in rcd_finalize_coll, coll->cbfunc is NULL */ > > the attached patch is likely not the right fix, it was very lightly > tested, but so far, it works for me ... > > Cheers, > > Gilles > > On 2014/09/11 16:11, Gilles Gouaillardet wrote: >> Ralph, >> >> things got worst indeed :-( >> >> now a simple hello world involving two hosts hang in mpi_init. >> there is still a race condition : if a tasks a call fence long after task b, >> then task b will never leave the fence >> >> i ll try to debug this ... >> >> Cheers, >> >> Gilles >> >> On 2014/09/11 2:36, Ralph Castain wrote: >>> I think I now have this fixed - let me know what you see. >>> >>> >>> On Sep 9, 2014, at 6:15 AM, Ralph Castain wrote: >>> Yeah, that's not the correct fix. The right way to fix it is for all three components to have their own RML tag, and for each of them to establish a persistent receive. They then can use the signature to tell which collective the incoming message belongs to. I'll fix it, but it won't be until tomorrow I'm afraid as today is shot. On Sep 9, 2014, at 3:10 AM, Gilles Gouaillardet wrote: > Folks, > > Since r32672 (trunk), grpcomm/rcd is the default module. > the attached spawn.c test program is a trimmed version of the > spawn_with_env_vars.c test case > from the ibm test suite. > > when invoked on two nodes : > - the program hangs with -np 2 > - the program can crash with np > 2 > error message is > [node0:30701] [[42913,0],0] TWO RECEIVES WITH SAME PEER [[42913,0],1] > AND TAG -33 - ABORTING > > here is my full command line (from node0) : > > mpirun -host node0,node1 -np 2 --oversubscribe --mca btl tcp,self --mca > coll ^ml ./spawn > > a simple workaround is to add the following extra parameter to the > mpirun command line : > --mca grpcomm_rcd_priority 0 > > my understanding it that the race condition occurs when all the > processes call MPI_Finalize() > internally, the pmix module will have mpirun/orted issue two ALLGATHER > involving mpirun and orted > (one job 1 aka the parent, and one for job 2 aka the spawned tasks) > the error message is very explicit : this is not (currently) supported > > i wrote the attached rml.patch which is really a workaround and not a fix > : > in this case, each job will invoke an ALLGATHER but with a different tag > /* that works for a limited number of jobs only */ > > i did not commit this patch since this is not a fix, could someone > (Ralph ?) please review the issue and comment ? > > > Cheers, > > Gilles > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/09/15780.php >>> ___ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2014/09/15794.php >> ___ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/09/15804.php > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/09/15805.php
Re: [OMPI devel] race condition in grpcomm/rcd
Ralph, the root cause is when the second orted/mpirun runs rcd_finalize_coll, it does not invoke pmix_server_release because allgather_stub was not previously invoked since the the fence was not yet entered. /* in rcd_finalize_coll, coll->cbfunc is NULL */ the attached patch is likely not the right fix, it was very lightly tested, but so far, it works for me ... Cheers, Gilles On 2014/09/11 16:11, Gilles Gouaillardet wrote: > Ralph, > > things got worst indeed :-( > > now a simple hello world involving two hosts hang in mpi_init. > there is still a race condition : if a tasks a call fence long after task b, > then task b will never leave the fence > > i ll try to debug this ... > > Cheers, > > Gilles > > On 2014/09/11 2:36, Ralph Castain wrote: >> I think I now have this fixed - let me know what you see. >> >> >> On Sep 9, 2014, at 6:15 AM, Ralph Castainwrote: >> >>> Yeah, that's not the correct fix. The right way to fix it is for all three >>> components to have their own RML tag, and for each of them to establish a >>> persistent receive. They then can use the signature to tell which >>> collective the incoming message belongs to. >>> >>> I'll fix it, but it won't be until tomorrow I'm afraid as today is shot. >>> >>> >>> On Sep 9, 2014, at 3:10 AM, Gilles Gouaillardet >>> wrote: >>> Folks, Since r32672 (trunk), grpcomm/rcd is the default module. the attached spawn.c test program is a trimmed version of the spawn_with_env_vars.c test case from the ibm test suite. when invoked on two nodes : - the program hangs with -np 2 - the program can crash with np > 2 error message is [node0:30701] [[42913,0],0] TWO RECEIVES WITH SAME PEER [[42913,0],1] AND TAG -33 - ABORTING here is my full command line (from node0) : mpirun -host node0,node1 -np 2 --oversubscribe --mca btl tcp,self --mca coll ^ml ./spawn a simple workaround is to add the following extra parameter to the mpirun command line : --mca grpcomm_rcd_priority 0 my understanding it that the race condition occurs when all the processes call MPI_Finalize() internally, the pmix module will have mpirun/orted issue two ALLGATHER involving mpirun and orted (one job 1 aka the parent, and one for job 2 aka the spawned tasks) the error message is very explicit : this is not (currently) supported i wrote the attached rml.patch which is really a workaround and not a fix : in this case, each job will invoke an ALLGATHER but with a different tag /* that works for a limited number of jobs only */ i did not commit this patch since this is not a fix, could someone (Ralph ?) please review the issue and comment ? Cheers, Gilles ___ devel mailing list de...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel Link to this post: http://www.open-mpi.org/community/lists/devel/2014/09/15780.php >> ___ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/09/15794.php > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/09/15804.php Index: orte/mca/grpcomm/rcd/grpcomm_rcd.c === --- orte/mca/grpcomm/rcd/grpcomm_rcd.c (revision 32706) +++ orte/mca/grpcomm/rcd/grpcomm_rcd.c (working copy) @@ -6,6 +6,8 @@ * Copyright (c) 2011-2013 Los Alamos National Security, LLC. All * rights reserved. * Copyright (c) 2014 Intel, Inc. All rights reserved. + * Copyright (c) 2014 Research Organization for Information Science + * and Technology (RIST). All rights reserved. * $COPYRIGHT$ * * Additional copyrights may follow @@ -85,6 +87,9 @@ static int allgather(orte_grpcomm_coll_t *coll, opal_buffer_t *sendbuf) { +orte_grpcomm_base_pending_coll_t *pc; +bool pending = false; + OPAL_OUTPUT_VERBOSE((5, orte_grpcomm_base_framework.framework_output, "%s grpcomm:coll:recdub algo employed for %d processes", ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), (int)coll->ndmns)); @@ -106,6 +111,33 @@ */ rcd_allgather_send_dist(coll, 1); +OPAL_LIST_FOREACH(pc, _grpcomm_base.pending, orte_grpcomm_base_pending_coll_t) { +if (NULL == coll->sig->signature) { +if (NULL == pc->coll->sig->signature) { +/* only one collective can operate
Re: [OMPI devel] race condition in grpcomm/rcd
Ralph, things got worst indeed :-( now a simple hello world involving two hosts hang in mpi_init. there is still a race condition : if a tasks a call fence long after task b, then task b will never leave the fence i ll try to debug this ... Cheers, Gilles On 2014/09/11 2:36, Ralph Castain wrote: > I think I now have this fixed - let me know what you see. > > > On Sep 9, 2014, at 6:15 AM, Ralph Castainwrote: > >> Yeah, that's not the correct fix. The right way to fix it is for all three >> components to have their own RML tag, and for each of them to establish a >> persistent receive. They then can use the signature to tell which collective >> the incoming message belongs to. >> >> I'll fix it, but it won't be until tomorrow I'm afraid as today is shot. >> >> >> On Sep 9, 2014, at 3:10 AM, Gilles Gouaillardet >> wrote: >> >>> Folks, >>> >>> Since r32672 (trunk), grpcomm/rcd is the default module. >>> the attached spawn.c test program is a trimmed version of the >>> spawn_with_env_vars.c test case >>> from the ibm test suite. >>> >>> when invoked on two nodes : >>> - the program hangs with -np 2 >>> - the program can crash with np > 2 >>> error message is >>> [node0:30701] [[42913,0],0] TWO RECEIVES WITH SAME PEER [[42913,0],1] >>> AND TAG -33 - ABORTING >>> >>> here is my full command line (from node0) : >>> >>> mpirun -host node0,node1 -np 2 --oversubscribe --mca btl tcp,self --mca >>> coll ^ml ./spawn >>> >>> a simple workaround is to add the following extra parameter to the >>> mpirun command line : >>> --mca grpcomm_rcd_priority 0 >>> >>> my understanding it that the race condition occurs when all the >>> processes call MPI_Finalize() >>> internally, the pmix module will have mpirun/orted issue two ALLGATHER >>> involving mpirun and orted >>> (one job 1 aka the parent, and one for job 2 aka the spawned tasks) >>> the error message is very explicit : this is not (currently) supported >>> >>> i wrote the attached rml.patch which is really a workaround and not a fix : >>> in this case, each job will invoke an ALLGATHER but with a different tag >>> /* that works for a limited number of jobs only */ >>> >>> i did not commit this patch since this is not a fix, could someone >>> (Ralph ?) please review the issue and comment ? >>> >>> >>> Cheers, >>> >>> Gilles >>> >>> ___ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2014/09/15780.php > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/09/15794.php
Re: [OMPI devel] race condition in grpcomm/rcd
I think I now have this fixed - let me know what you see. On Sep 9, 2014, at 6:15 AM, Ralph Castainwrote: > Yeah, that's not the correct fix. The right way to fix it is for all three > components to have their own RML tag, and for each of them to establish a > persistent receive. They then can use the signature to tell which collective > the incoming message belongs to. > > I'll fix it, but it won't be until tomorrow I'm afraid as today is shot. > > > On Sep 9, 2014, at 3:10 AM, Gilles Gouaillardet > wrote: > >> Folks, >> >> Since r32672 (trunk), grpcomm/rcd is the default module. >> the attached spawn.c test program is a trimmed version of the >> spawn_with_env_vars.c test case >> from the ibm test suite. >> >> when invoked on two nodes : >> - the program hangs with -np 2 >> - the program can crash with np > 2 >> error message is >> [node0:30701] [[42913,0],0] TWO RECEIVES WITH SAME PEER [[42913,0],1] >> AND TAG -33 - ABORTING >> >> here is my full command line (from node0) : >> >> mpirun -host node0,node1 -np 2 --oversubscribe --mca btl tcp,self --mca >> coll ^ml ./spawn >> >> a simple workaround is to add the following extra parameter to the >> mpirun command line : >> --mca grpcomm_rcd_priority 0 >> >> my understanding it that the race condition occurs when all the >> processes call MPI_Finalize() >> internally, the pmix module will have mpirun/orted issue two ALLGATHER >> involving mpirun and orted >> (one job 1 aka the parent, and one for job 2 aka the spawned tasks) >> the error message is very explicit : this is not (currently) supported >> >> i wrote the attached rml.patch which is really a workaround and not a fix : >> in this case, each job will invoke an ALLGATHER but with a different tag >> /* that works for a limited number of jobs only */ >> >> i did not commit this patch since this is not a fix, could someone >> (Ralph ?) please review the issue and comment ? >> >> >> Cheers, >> >> Gilles >> >> ___ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/09/15780.php >
Re: [OMPI devel] race condition in grpcomm/rcd
Yeah, that's not the correct fix. The right way to fix it is for all three components to have their own RML tag, and for each of them to establish a persistent receive. They then can use the signature to tell which collective the incoming message belongs to. I'll fix it, but it won't be until tomorrow I'm afraid as today is shot. On Sep 9, 2014, at 3:10 AM, Gilles Gouaillardetwrote: > Folks, > > Since r32672 (trunk), grpcomm/rcd is the default module. > the attached spawn.c test program is a trimmed version of the > spawn_with_env_vars.c test case > from the ibm test suite. > > when invoked on two nodes : > - the program hangs with -np 2 > - the program can crash with np > 2 > error message is > [node0:30701] [[42913,0],0] TWO RECEIVES WITH SAME PEER [[42913,0],1] > AND TAG -33 - ABORTING > > here is my full command line (from node0) : > > mpirun -host node0,node1 -np 2 --oversubscribe --mca btl tcp,self --mca > coll ^ml ./spawn > > a simple workaround is to add the following extra parameter to the > mpirun command line : > --mca grpcomm_rcd_priority 0 > > my understanding it that the race condition occurs when all the > processes call MPI_Finalize() > internally, the pmix module will have mpirun/orted issue two ALLGATHER > involving mpirun and orted > (one job 1 aka the parent, and one for job 2 aka the spawned tasks) > the error message is very explicit : this is not (currently) supported > > i wrote the attached rml.patch which is really a workaround and not a fix : > in this case, each job will invoke an ALLGATHER but with a different tag > /* that works for a limited number of jobs only */ > > i did not commit this patch since this is not a fix, could someone > (Ralph ?) please review the issue and comment ? > > > Cheers, > > Gilles > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/09/15780.php
[OMPI devel] race condition in grpcomm/rcd
Folks, Since r32672 (trunk), grpcomm/rcd is the default module. the attached spawn.c test program is a trimmed version of the spawn_with_env_vars.c test case from the ibm test suite. when invoked on two nodes : - the program hangs with -np 2 - the program can crash with np > 2 error message is [node0:30701] [[42913,0],0] TWO RECEIVES WITH SAME PEER [[42913,0],1] AND TAG -33 - ABORTING here is my full command line (from node0) : mpirun -host node0,node1 -np 2 --oversubscribe --mca btl tcp,self --mca coll ^ml ./spawn a simple workaround is to add the following extra parameter to the mpirun command line : --mca grpcomm_rcd_priority 0 my understanding it that the race condition occurs when all the processes call MPI_Finalize() internally, the pmix module will have mpirun/orted issue two ALLGATHER involving mpirun and orted (one job 1 aka the parent, and one for job 2 aka the spawned tasks) the error message is very explicit : this is not (currently) supported i wrote the attached rml.patch which is really a workaround and not a fix : in this case, each job will invoke an ALLGATHER but with a different tag /* that works for a limited number of jobs only */ i did not commit this patch since this is not a fix, could someone (Ralph ?) please review the issue and comment ? Cheers, Gilles /* * $HEADER$ * * Program to test MPI_Comm_spawn with environment variables. */ #include #include #include #include "mpi.h" static void do_parent(char *cmd, int rank, int count) { int *errcode, err; int i; MPI_Comm child_inter; MPI_Comm intra; FILE *fp; int found; int size; /* First, see if cmd exists on all ranks */ fp = fopen(cmd, "r"); if (NULL == fp) { found = 0; } else { fclose(fp); found = 1; } MPI_Comm_size(MPI_COMM_WORLD, ); MPI_Allreduce(, , 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD); if (count != size) { if (rank == 0) { MPI_Abort(MPI_COMM_WORLD, 77); } return; } /* Now try the spawn if it's found anywhere */ errcode = malloc(sizeof(int) * count); if (NULL == errcode) { MPI_Abort(MPI_COMM_WORLD, 1); } memset(errcode, -1, count); MPI_Comm_spawn(cmd, MPI_ARGV_NULL, count, MPI_INFO_NULL, 0, MPI_COMM_WORLD, _inter, errcode); /* Clean up */ MPI_Barrier(child_inter); MPI_Comm_disconnect(_inter); free(errcode); } static void do_target(MPI_Comm parent) { MPI_Barrier(parent); MPI_Comm_disconnect(); } int main(int argc, char *argv[]) { int rank, size; MPI_Comm parent; /* Ok, we're good. Proceed with the test. */ MPI_Init(, ); MPI_Comm_size(MPI_COMM_WORLD, ); MPI_Comm_rank(MPI_COMM_WORLD, ); /* Check to see if we *were* spawned -- because this is a test, we can only assume the existence of this one executable. Hence, we both mpirun it and spawn it. */ parent = MPI_COMM_NULL; MPI_Comm_get_parent(); if (parent != MPI_COMM_NULL) { do_target(parent); } else { do_parent(argv[0], rank, size); } MPI_Comm_size(MPI_COMM_WORLD, ); MPI_Comm_rank(MPI_COMM_WORLD, ); if (0 < rank) sleep(3); MPI_Finalize(); /* All done */ return 0; } Index: orte/mca/grpcomm/brks/grpcomm_brks.c === --- orte/mca/grpcomm/brks/grpcomm_brks.c(revision 32688) +++ orte/mca/grpcomm/brks/grpcomm_brks.c(working copy) @@ -6,6 +6,8 @@ * Copyright (c) 2011-2013 Los Alamos National Security, LLC. All * rights reserved. * Copyright (c) 2014 Intel, Inc. All rights reserved. + * Copyright (c) 2014 Research Organization for Information Science + * and Technology (RIST). All rights reserved. * $COPYRIGHT$ * * Additional copyrights may follow @@ -111,6 +113,7 @@ static int brks_allgather_send_dist(orte_grpcomm_coll_t *coll, orte_vpid_t distance) { orte_process_name_t peer_send, peer_recv; opal_buffer_t *send_buf; +orte_rml_tag_t tag; int rc; peer_send.jobid = ORTE_PROC_MY_NAME->jobid; @@ -174,8 +177,14 @@ ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), ORTE_NAME_PRINT(_send))); +if (1 != coll->sig->sz || ORTE_VPID_WILDCARD != coll->sig->signature[0].vpid) { +tag = ORTE_RML_TAG_ALLGATHER; +} else { +tag = ORTE_RML_TAG_JOB_ALLGATHER + ORTE_LOCAL_JOBID(coll->sig->signature[0].jobid) % (ORTE_RML_TAG_MAX-ORTE_RML_TAG_JOB_ALLGATHER); +} + if (0 > (rc = orte_rml.send_buffer_nb(_send, send_buf, - -ORTE_RML_TAG_ALLGATHER, + -tag, orte_rml_send_callback, NULL))) { ORTE_ERROR_LOG(rc); OBJ_RELEASE(send_buf); @@ -189,7 +198,7 @@ /* setup recv for distance data