Re: [OMPI devel] race condition in grpcomm/rcd

2014-09-12 Thread Joshua Ladd
Let me know if Nadia can help here, Ralph.

Josh


On Fri, Sep 12, 2014 at 9:31 AM, Ralph Castain  wrote:

>
> On Sep 12, 2014, at 5:45 AM, Gilles Gouaillardet <
> gilles.gouaillar...@gmail.com> wrote:
>
> Ralph,
>
> On Fri, Sep 12, 2014 at 10:54 AM, Ralph Castain  wrote:
>
>> The design is supposed to be that each node knows precisely how many
>> daemons are involved in each collective, and who is going to talk to them.
>
>
> ok, but in the design does not ensure that things will happen in the right
> order :
> - enter the allgather
> - receive data from the daemon at distance 1
> - receive data from the daemon at distance 2
> - and so on
>
> with current implementation when 2 daemons are involved, if a daemon
> enters the allgather after it received data from the peer, then the mpi
> processes local to this daemon will hang
>
> with 4 nodes, it got trickier :
> 0 enter allgather and send a message to 1
> 1 receive the message and send to 2 but with data from 0 only
> /* 1 did not enter the allgather, so its data cannot be sent to 2 */
>
>
> It's just a bug in the rcd logic, Gilles. I'll take a look and get it
> fixed - just don't have time right now
>
>
> this issue did not occur before the persistent receive :
> no receive was posted if the daemon did not enter the allgather
>
>
> The signature contains the info required to ensure the receiver knows
>> which collective this message relates to, and just happens to also allow
>> them to lookup the number of daemons involved (the base function takes care
>> of that for them).
>>
>>
> ok too, this issue was solved with the persistent receive
>
> So there is no need for a "pending" list - if you receive a message about
>> a collective you don't yet know about, you just put it on the ongoing
>> collective list. You should only receive it if you are going to be involved
>> - i.e., you have local procs that are going to participate. So you wait
>> until your local procs participate, and then pass your collected bucket
>> along.
>>
>> ok, i did something similar
> (e.g. pass all the available data)
> some data might be passed twice, but that might not be an issue
>
>
>> I suspect the link to the local procs isn't being correctly dealt with,
>> else you couldn't be hanging. Or the rcd isn't correctly passing incoming
>> messages to the base functions to register the collective.
>>
>> I'll look at it over the weekend and can resolve it then.
>>
>>
>  the attached patch is an illustration of what i was trying to explain.
> coll->nreported is used by rcd as a bitmask of the received messages
> (bit 0 is for the local daemon, bit n for the daemon at distance n)
>
> i was still debugging a race condition :
> if daemons 2 and 3 enter the allgather at the send time, they will sent a
> message to each other at the same time and rml fails establishing the
> connection.  i could not find whether this is linked to my changes...
>
> Cheers,
>
> Gilles
>
>>
>> On Sep 11, 2014, at 5:23 PM, Gilles Gouaillardet <
>> gilles.gouaillar...@iferc.org> wrote:
>>
>> > Ralph,
>> >
>> > you are right, this was definetly not the right fix (at least with 4
>> > nodes or more)
>> >
>> > i finally understood what is going wrong here :
>> > to make it simple, the allgather recursive doubling algo is not
>> > implemented with
>> > MPI_Recv(...,peer,...) like functions but with
>> > MPI_Recv(...,MPI_ANY_SOURCE,...) like functions
>> > and that makes things slightly more complicated :
>> > right now :
>> > - with two nodes : if node 1 is late, it gets stuck in the allgather
>> > - with four nodes : if node 0 is first, then node 2 and 3 while node 1
>> > is still late, then node 0
>> > will likely leaves the allgather though it did not receive anything
>> > from  node 1
>> > - and so on
>> >
>> > i think i can fix that from now
>> >
>> > Cheers,
>> >
>> > Gilles
>> >
>> > On 2014/09/11 23:47, Ralph Castain wrote:
>> >> Yeah, that's not the right fix, I'm afraid. I've made the direct
>> component the default again until I have time to dig into this deeper.
>> >>
>> >> On Sep 11, 2014, at 4:02 AM, Gilles Gouaillardet <
>> gilles.gouaillar...@iferc.org> wrote:
>> >>
>> >>> Ralph,
>> >>>
>> >>> the root cause is when the second orted/mpirun runs rcd_finalize_coll,
>> >>> it does not invoke pmix_server_release
>> >>> because allgather_stub was not previously invoked since the the fence
>> >>> was not yet entered.
>> >>> /* in rcd_finalize_coll, coll->cbfunc is NULL */
>> >>>
>> >>> the attached patch is likely not the right fix, it was very lightly
>> >>> tested, but so far, it works for me ...
>> >>>
>> >>> Cheers,
>> >>>
>> >>> Gilles
>> >>>
>> >>> On 2014/09/11 16:11, Gilles Gouaillardet wrote:
>>  Ralph,
>> 
>>  things got worst indeed :-(
>> 
>>  now a simple hello world involving two hosts hang in mpi_init.
>>  there is still a race condition : if a tasks a call fence long after
>> task b,
>>  then task b will never leave the 

Re: [OMPI devel] race condition in grpcomm/rcd

2014-09-12 Thread Ralph Castain

On Sep 12, 2014, at 5:45 AM, Gilles Gouaillardet 
 wrote:

> Ralph,
> 
> On Fri, Sep 12, 2014 at 10:54 AM, Ralph Castain  wrote:
> The design is supposed to be that each node knows precisely how many daemons 
> are involved in each collective, and who is going to talk to them.
> 
> ok, but in the design does not ensure that things will happen in the right 
> order :
> - enter the allgather
> - receive data from the daemon at distance 1
> - receive data from the daemon at distance 2
> - and so on
> 
> with current implementation when 2 daemons are involved, if a daemon enters 
> the allgather after it received data from the peer, then the mpi processes 
> local to this daemon will hang
> 
> with 4 nodes, it got trickier :
> 0 enter allgather and send a message to 1
> 1 receive the message and send to 2 but with data from 0 only
> /* 1 did not enter the allgather, so its data cannot be sent to 2 */

It's just a bug in the rcd logic, Gilles. I'll take a look and get it fixed - 
just don't have time right now

> 
> this issue did not occur before the persistent receive :
> no receive was posted if the daemon did not enter the allgather 
> 
> 
> The signature contains the info required to ensure the receiver knows which 
> collective this message relates to, and just happens to also allow them to 
> lookup the number of daemons involved (the base function takes care of that 
> for them).
> 
>  
> ok too, this issue was solved with the persistent receive
> 
> So there is no need for a "pending" list - if you receive a message about a 
> collective you don't yet know about, you just put it on the ongoing 
> collective list. You should only receive it if you are going to be involved - 
> i.e., you have local procs that are going to participate. So you wait until 
> your local procs participate, and then pass your collected bucket along.
> 
> ok, i did something similar
> (e.g. pass all the available data)
> some data might be passed twice, but that might not be an issue
>  
> I suspect the link to the local procs isn't being correctly dealt with, else 
> you couldn't be hanging. Or the rcd isn't correctly passing incoming messages 
> to the base functions to register the collective.
> 
> I'll look at it over the weekend and can resolve it then.
> 
> 
>  the attached patch is an illustration of what i was trying to explain.
> coll->nreported is used by rcd as a bitmask of the received messages
> (bit 0 is for the local daemon, bit n for the daemon at distance n)
> 
> i was still debugging a race condition :
> if daemons 2 and 3 enter the allgather at the send time, they will sent a 
> message to each other at the same time and rml fails establishing the 
> connection.  i could not find whether this is linked to my changes...
> 
> Cheers,
> 
> Gilles
> 
> On Sep 11, 2014, at 5:23 PM, Gilles Gouaillardet 
>  wrote:
> 
> > Ralph,
> >
> > you are right, this was definetly not the right fix (at least with 4
> > nodes or more)
> >
> > i finally understood what is going wrong here :
> > to make it simple, the allgather recursive doubling algo is not
> > implemented with
> > MPI_Recv(...,peer,...) like functions but with
> > MPI_Recv(...,MPI_ANY_SOURCE,...) like functions
> > and that makes things slightly more complicated :
> > right now :
> > - with two nodes : if node 1 is late, it gets stuck in the allgather
> > - with four nodes : if node 0 is first, then node 2 and 3 while node 1
> > is still late, then node 0
> > will likely leaves the allgather though it did not receive anything
> > from  node 1
> > - and so on
> >
> > i think i can fix that from now
> >
> > Cheers,
> >
> > Gilles
> >
> > On 2014/09/11 23:47, Ralph Castain wrote:
> >> Yeah, that's not the right fix, I'm afraid. I've made the direct component 
> >> the default again until I have time to dig into this deeper.
> >>
> >> On Sep 11, 2014, at 4:02 AM, Gilles Gouaillardet 
> >>  wrote:
> >>
> >>> Ralph,
> >>>
> >>> the root cause is when the second orted/mpirun runs rcd_finalize_coll,
> >>> it does not invoke pmix_server_release
> >>> because allgather_stub was not previously invoked since the the fence
> >>> was not yet entered.
> >>> /* in rcd_finalize_coll, coll->cbfunc is NULL */
> >>>
> >>> the attached patch is likely not the right fix, it was very lightly
> >>> tested, but so far, it works for me ...
> >>>
> >>> Cheers,
> >>>
> >>> Gilles
> >>>
> >>> On 2014/09/11 16:11, Gilles Gouaillardet wrote:
>  Ralph,
> 
>  things got worst indeed :-(
> 
>  now a simple hello world involving two hosts hang in mpi_init.
>  there is still a race condition : if a tasks a call fence long after 
>  task b,
>  then task b will never leave the fence
> 
>  i ll try to debug this ...
> 
>  Cheers,
> 
>  Gilles
> 
>  On 2014/09/11 2:36, Ralph Castain wrote:
> > I think I now have this fixed 

Re: [OMPI devel] race condition in grpcomm/rcd

2014-09-12 Thread Gilles Gouaillardet
Ralph,

On Fri, Sep 12, 2014 at 10:54 AM, Ralph Castain  wrote:

> The design is supposed to be that each node knows precisely how many
> daemons are involved in each collective, and who is going to talk to them.


ok, but in the design does not ensure that things will happen in the right
order :
- enter the allgather
- receive data from the daemon at distance 1
- receive data from the daemon at distance 2
- and so on

with current implementation when 2 daemons are involved, if a daemon enters
the allgather after it received data from the peer, then the mpi processes
local to this daemon will hang

with 4 nodes, it got trickier :
0 enter allgather and send a message to 1
1 receive the message and send to 2 but with data from 0 only
/* 1 did not enter the allgather, so its data cannot be sent to 2 */

this issue did not occur before the persistent receive :
no receive was posted if the daemon did not enter the allgather


The signature contains the info required to ensure the receiver knows which
> collective this message relates to, and just happens to also allow them to
> lookup the number of daemons involved (the base function takes care of that
> for them).
>
>
ok too, this issue was solved with the persistent receive

So there is no need for a "pending" list - if you receive a message about a
> collective you don't yet know about, you just put it on the ongoing
> collective list. You should only receive it if you are going to be involved
> - i.e., you have local procs that are going to participate. So you wait
> until your local procs participate, and then pass your collected bucket
> along.
>
> ok, i did something similar
(e.g. pass all the available data)
some data might be passed twice, but that might not be an issue


> I suspect the link to the local procs isn't being correctly dealt with,
> else you couldn't be hanging. Or the rcd isn't correctly passing incoming
> messages to the base functions to register the collective.
>
> I'll look at it over the weekend and can resolve it then.
>
>
 the attached patch is an illustration of what i was trying to explain.
coll->nreported is used by rcd as a bitmask of the received messages
(bit 0 is for the local daemon, bit n for the daemon at distance n)

i was still debugging a race condition :
if daemons 2 and 3 enter the allgather at the send time, they will sent a
message to each other at the same time and rml fails establishing the
connection.  i could not find whether this is linked to my changes...

Cheers,

Gilles

>
> On Sep 11, 2014, at 5:23 PM, Gilles Gouaillardet <
> gilles.gouaillar...@iferc.org> wrote:
>
> > Ralph,
> >
> > you are right, this was definetly not the right fix (at least with 4
> > nodes or more)
> >
> > i finally understood what is going wrong here :
> > to make it simple, the allgather recursive doubling algo is not
> > implemented with
> > MPI_Recv(...,peer,...) like functions but with
> > MPI_Recv(...,MPI_ANY_SOURCE,...) like functions
> > and that makes things slightly more complicated :
> > right now :
> > - with two nodes : if node 1 is late, it gets stuck in the allgather
> > - with four nodes : if node 0 is first, then node 2 and 3 while node 1
> > is still late, then node 0
> > will likely leaves the allgather though it did not receive anything
> > from  node 1
> > - and so on
> >
> > i think i can fix that from now
> >
> > Cheers,
> >
> > Gilles
> >
> > On 2014/09/11 23:47, Ralph Castain wrote:
> >> Yeah, that's not the right fix, I'm afraid. I've made the direct
> component the default again until I have time to dig into this deeper.
> >>
> >> On Sep 11, 2014, at 4:02 AM, Gilles Gouaillardet <
> gilles.gouaillar...@iferc.org> wrote:
> >>
> >>> Ralph,
> >>>
> >>> the root cause is when the second orted/mpirun runs rcd_finalize_coll,
> >>> it does not invoke pmix_server_release
> >>> because allgather_stub was not previously invoked since the the fence
> >>> was not yet entered.
> >>> /* in rcd_finalize_coll, coll->cbfunc is NULL */
> >>>
> >>> the attached patch is likely not the right fix, it was very lightly
> >>> tested, but so far, it works for me ...
> >>>
> >>> Cheers,
> >>>
> >>> Gilles
> >>>
> >>> On 2014/09/11 16:11, Gilles Gouaillardet wrote:
>  Ralph,
> 
>  things got worst indeed :-(
> 
>  now a simple hello world involving two hosts hang in mpi_init.
>  there is still a race condition : if a tasks a call fence long after
> task b,
>  then task b will never leave the fence
> 
>  i ll try to debug this ...
> 
>  Cheers,
> 
>  Gilles
> 
>  On 2014/09/11 2:36, Ralph Castain wrote:
> > I think I now have this fixed - let me know what you see.
> >
> >
> > On Sep 9, 2014, at 6:15 AM, Ralph Castain  wrote:
> >
> >> Yeah, that's not the correct fix. The right way to fix it is for
> all three components to have their own RML tag, and for each of them to
> establish a persistent receive. They then 

Re: [OMPI devel] race condition in grpcomm/rcd

2014-09-11 Thread Ralph Castain
The design is supposed to be that each node knows precisely how many daemons 
are involved in each collective, and who is going to talk to them. The 
signature contains the info required to ensure the receiver knows which 
collective this message relates to, and just happens to also allow them to 
lookup the number of daemons involved (the base function takes care of that for 
them).

So there is no need for a "pending" list - if you receive a message about a 
collective you don't yet know about, you just put it on the ongoing collective 
list. You should only receive it if you are going to be involved - i.e., you 
have local procs that are going to participate. So you wait until your local 
procs participate, and then pass your collected bucket along.

I suspect the link to the local procs isn't being correctly dealt with, else 
you couldn't be hanging. Or the rcd isn't correctly passing incoming messages 
to the base functions to register the collective.

I'll look at it over the weekend and can resolve it then.


On Sep 11, 2014, at 5:23 PM, Gilles Gouaillardet 
 wrote:

> Ralph,
> 
> you are right, this was definetly not the right fix (at least with 4
> nodes or more)
> 
> i finally understood what is going wrong here :
> to make it simple, the allgather recursive doubling algo is not
> implemented with
> MPI_Recv(...,peer,...) like functions but with
> MPI_Recv(...,MPI_ANY_SOURCE,...) like functions
> and that makes things slightly more complicated :
> right now :
> - with two nodes : if node 1 is late, it gets stuck in the allgather
> - with four nodes : if node 0 is first, then node 2 and 3 while node 1
> is still late, then node 0
> will likely leaves the allgather though it did not receive anything
> from  node 1
> - and so on
> 
> i think i can fix that from now
> 
> Cheers,
> 
> Gilles
> 
> On 2014/09/11 23:47, Ralph Castain wrote:
>> Yeah, that's not the right fix, I'm afraid. I've made the direct component 
>> the default again until I have time to dig into this deeper.
>> 
>> On Sep 11, 2014, at 4:02 AM, Gilles Gouaillardet 
>>  wrote:
>> 
>>> Ralph,
>>> 
>>> the root cause is when the second orted/mpirun runs rcd_finalize_coll,
>>> it does not invoke pmix_server_release
>>> because allgather_stub was not previously invoked since the the fence
>>> was not yet entered.
>>> /* in rcd_finalize_coll, coll->cbfunc is NULL */
>>> 
>>> the attached patch is likely not the right fix, it was very lightly
>>> tested, but so far, it works for me ...
>>> 
>>> Cheers,
>>> 
>>> Gilles
>>> 
>>> On 2014/09/11 16:11, Gilles Gouaillardet wrote:
 Ralph,
 
 things got worst indeed :-(
 
 now a simple hello world involving two hosts hang in mpi_init.
 there is still a race condition : if a tasks a call fence long after task 
 b,
 then task b will never leave the fence
 
 i ll try to debug this ...
 
 Cheers,
 
 Gilles
 
 On 2014/09/11 2:36, Ralph Castain wrote:
> I think I now have this fixed - let me know what you see.
> 
> 
> On Sep 9, 2014, at 6:15 AM, Ralph Castain  wrote:
> 
>> Yeah, that's not the correct fix. The right way to fix it is for all 
>> three components to have their own RML tag, and for each of them to 
>> establish a persistent receive. They then can use the signature to tell 
>> which collective the incoming message belongs to.
>> 
>> I'll fix it, but it won't be until tomorrow I'm afraid as today is shot.
>> 
>> 
>> On Sep 9, 2014, at 3:10 AM, Gilles Gouaillardet 
>>  wrote:
>> 
>>> Folks,
>>> 
>>> Since r32672 (trunk), grpcomm/rcd is the default module.
>>> the attached spawn.c test program is a trimmed version of the
>>> spawn_with_env_vars.c test case
>>> from the ibm test suite.
>>> 
>>> when invoked on two nodes :
>>> - the program hangs with -np 2
>>> - the program can crash with np > 2
>>> error message is
>>> [node0:30701] [[42913,0],0] TWO RECEIVES WITH SAME PEER [[42913,0],1]
>>> AND TAG -33 - ABORTING
>>> 
>>> here is my full command line (from node0) :
>>> 
>>> mpirun -host node0,node1 -np 2 --oversubscribe --mca btl tcp,self --mca
>>> coll ^ml ./spawn
>>> 
>>> a simple workaround is to add the following extra parameter to the
>>> mpirun command line :
>>> --mca grpcomm_rcd_priority 0
>>> 
>>> my understanding it that the race condition occurs when all the
>>> processes call MPI_Finalize()
>>> internally, the pmix module will have mpirun/orted issue two ALLGATHER
>>> involving mpirun and orted
>>> (one job 1 aka the parent, and one for job 2 aka the spawned tasks)
>>> the error message is very explicit : this is not (currently) supported
>>> 
>>> i wrote the attached rml.patch which is really a workaround and not a 

Re: [OMPI devel] race condition in grpcomm/rcd

2014-09-11 Thread Gilles Gouaillardet
Ralph,

you are right, this was definetly not the right fix (at least with 4
nodes or more)

i finally understood what is going wrong here :
to make it simple, the allgather recursive doubling algo is not
implemented with
MPI_Recv(...,peer,...) like functions but with
MPI_Recv(...,MPI_ANY_SOURCE,...) like functions
and that makes things slightly more complicated :
right now :
- with two nodes : if node 1 is late, it gets stuck in the allgather
- with four nodes : if node 0 is first, then node 2 and 3 while node 1
is still late, then node 0
will likely leaves the allgather though it did not receive anything
from  node 1
- and so on

i think i can fix that from now

Cheers,

Gilles

On 2014/09/11 23:47, Ralph Castain wrote:
> Yeah, that's not the right fix, I'm afraid. I've made the direct component 
> the default again until I have time to dig into this deeper.
>
> On Sep 11, 2014, at 4:02 AM, Gilles Gouaillardet 
>  wrote:
>
>> Ralph,
>>
>> the root cause is when the second orted/mpirun runs rcd_finalize_coll,
>> it does not invoke pmix_server_release
>> because allgather_stub was not previously invoked since the the fence
>> was not yet entered.
>> /* in rcd_finalize_coll, coll->cbfunc is NULL */
>>
>> the attached patch is likely not the right fix, it was very lightly
>> tested, but so far, it works for me ...
>>
>> Cheers,
>>
>> Gilles
>>
>> On 2014/09/11 16:11, Gilles Gouaillardet wrote:
>>> Ralph,
>>>
>>> things got worst indeed :-(
>>>
>>> now a simple hello world involving two hosts hang in mpi_init.
>>> there is still a race condition : if a tasks a call fence long after task b,
>>> then task b will never leave the fence
>>>
>>> i ll try to debug this ...
>>>
>>> Cheers,
>>>
>>> Gilles
>>>
>>> On 2014/09/11 2:36, Ralph Castain wrote:
 I think I now have this fixed - let me know what you see.


 On Sep 9, 2014, at 6:15 AM, Ralph Castain  wrote:

> Yeah, that's not the correct fix. The right way to fix it is for all 
> three components to have their own RML tag, and for each of them to 
> establish a persistent receive. They then can use the signature to tell 
> which collective the incoming message belongs to.
>
> I'll fix it, but it won't be until tomorrow I'm afraid as today is shot.
>
>
> On Sep 9, 2014, at 3:10 AM, Gilles Gouaillardet 
>  wrote:
>
>> Folks,
>>
>> Since r32672 (trunk), grpcomm/rcd is the default module.
>> the attached spawn.c test program is a trimmed version of the
>> spawn_with_env_vars.c test case
>> from the ibm test suite.
>>
>> when invoked on two nodes :
>> - the program hangs with -np 2
>> - the program can crash with np > 2
>> error message is
>> [node0:30701] [[42913,0],0] TWO RECEIVES WITH SAME PEER [[42913,0],1]
>> AND TAG -33 - ABORTING
>>
>> here is my full command line (from node0) :
>>
>> mpirun -host node0,node1 -np 2 --oversubscribe --mca btl tcp,self --mca
>> coll ^ml ./spawn
>>
>> a simple workaround is to add the following extra parameter to the
>> mpirun command line :
>> --mca grpcomm_rcd_priority 0
>>
>> my understanding it that the race condition occurs when all the
>> processes call MPI_Finalize()
>> internally, the pmix module will have mpirun/orted issue two ALLGATHER
>> involving mpirun and orted
>> (one job 1 aka the parent, and one for job 2 aka the spawned tasks)
>> the error message is very explicit : this is not (currently) supported
>>
>> i wrote the attached rml.patch which is really a workaround and not a 
>> fix :
>> in this case, each job will invoke an ALLGATHER but with a different tag
>> /* that works for a limited number of jobs only */
>>
>> i did not commit this patch since this is not a fix, could someone
>> (Ralph ?) please review the issue and comment ?
>>
>>
>> Cheers,
>>
>> Gilles
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/09/15780.php
 ___
 devel mailing list
 de...@open-mpi.org
 Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
 Link to this post: 
 http://www.open-mpi.org/community/lists/devel/2014/09/15794.php
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2014/09/15804.php
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: 

Re: [OMPI devel] race condition in grpcomm/rcd

2014-09-11 Thread Ralph Castain
Yeah, that's not the right fix, I'm afraid. I've made the direct component the 
default again until I have time to dig into this deeper.

On Sep 11, 2014, at 4:02 AM, Gilles Gouaillardet 
 wrote:

> Ralph,
> 
> the root cause is when the second orted/mpirun runs rcd_finalize_coll,
> it does not invoke pmix_server_release
> because allgather_stub was not previously invoked since the the fence
> was not yet entered.
> /* in rcd_finalize_coll, coll->cbfunc is NULL */
> 
> the attached patch is likely not the right fix, it was very lightly
> tested, but so far, it works for me ...
> 
> Cheers,
> 
> Gilles
> 
> On 2014/09/11 16:11, Gilles Gouaillardet wrote:
>> Ralph,
>> 
>> things got worst indeed :-(
>> 
>> now a simple hello world involving two hosts hang in mpi_init.
>> there is still a race condition : if a tasks a call fence long after task b,
>> then task b will never leave the fence
>> 
>> i ll try to debug this ...
>> 
>> Cheers,
>> 
>> Gilles
>> 
>> On 2014/09/11 2:36, Ralph Castain wrote:
>>> I think I now have this fixed - let me know what you see.
>>> 
>>> 
>>> On Sep 9, 2014, at 6:15 AM, Ralph Castain  wrote:
>>> 
 Yeah, that's not the correct fix. The right way to fix it is for all three 
 components to have their own RML tag, and for each of them to establish a 
 persistent receive. They then can use the signature to tell which 
 collective the incoming message belongs to.
 
 I'll fix it, but it won't be until tomorrow I'm afraid as today is shot.
 
 
 On Sep 9, 2014, at 3:10 AM, Gilles Gouaillardet 
  wrote:
 
> Folks,
> 
> Since r32672 (trunk), grpcomm/rcd is the default module.
> the attached spawn.c test program is a trimmed version of the
> spawn_with_env_vars.c test case
> from the ibm test suite.
> 
> when invoked on two nodes :
> - the program hangs with -np 2
> - the program can crash with np > 2
> error message is
> [node0:30701] [[42913,0],0] TWO RECEIVES WITH SAME PEER [[42913,0],1]
> AND TAG -33 - ABORTING
> 
> here is my full command line (from node0) :
> 
> mpirun -host node0,node1 -np 2 --oversubscribe --mca btl tcp,self --mca
> coll ^ml ./spawn
> 
> a simple workaround is to add the following extra parameter to the
> mpirun command line :
> --mca grpcomm_rcd_priority 0
> 
> my understanding it that the race condition occurs when all the
> processes call MPI_Finalize()
> internally, the pmix module will have mpirun/orted issue two ALLGATHER
> involving mpirun and orted
> (one job 1 aka the parent, and one for job 2 aka the spawned tasks)
> the error message is very explicit : this is not (currently) supported
> 
> i wrote the attached rml.patch which is really a workaround and not a fix 
> :
> in this case, each job will invoke an ALLGATHER but with a different tag
> /* that works for a limited number of jobs only */
> 
> i did not commit this patch since this is not a fix, could someone
> (Ralph ?) please review the issue and comment ?
> 
> 
> Cheers,
> 
> Gilles
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/09/15780.php
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2014/09/15794.php
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/09/15804.php
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/09/15805.php



Re: [OMPI devel] race condition in grpcomm/rcd

2014-09-11 Thread Gilles Gouaillardet
Ralph,

the root cause is when the second orted/mpirun runs rcd_finalize_coll,
it does not invoke pmix_server_release
because allgather_stub was not previously invoked since the the fence
was not yet entered.
/* in rcd_finalize_coll, coll->cbfunc is NULL */

the attached patch is likely not the right fix, it was very lightly
tested, but so far, it works for me ...

Cheers,

Gilles

On 2014/09/11 16:11, Gilles Gouaillardet wrote:
> Ralph,
>
> things got worst indeed :-(
>
> now a simple hello world involving two hosts hang in mpi_init.
> there is still a race condition : if a tasks a call fence long after task b,
> then task b will never leave the fence
>
> i ll try to debug this ...
>
> Cheers,
>
> Gilles
>
> On 2014/09/11 2:36, Ralph Castain wrote:
>> I think I now have this fixed - let me know what you see.
>>
>>
>> On Sep 9, 2014, at 6:15 AM, Ralph Castain  wrote:
>>
>>> Yeah, that's not the correct fix. The right way to fix it is for all three 
>>> components to have their own RML tag, and for each of them to establish a 
>>> persistent receive. They then can use the signature to tell which 
>>> collective the incoming message belongs to.
>>>
>>> I'll fix it, but it won't be until tomorrow I'm afraid as today is shot.
>>>
>>>
>>> On Sep 9, 2014, at 3:10 AM, Gilles Gouaillardet 
>>>  wrote:
>>>
 Folks,

 Since r32672 (trunk), grpcomm/rcd is the default module.
 the attached spawn.c test program is a trimmed version of the
 spawn_with_env_vars.c test case
 from the ibm test suite.

 when invoked on two nodes :
 - the program hangs with -np 2
 - the program can crash with np > 2
 error message is
 [node0:30701] [[42913,0],0] TWO RECEIVES WITH SAME PEER [[42913,0],1]
 AND TAG -33 - ABORTING

 here is my full command line (from node0) :

 mpirun -host node0,node1 -np 2 --oversubscribe --mca btl tcp,self --mca
 coll ^ml ./spawn

 a simple workaround is to add the following extra parameter to the
 mpirun command line :
 --mca grpcomm_rcd_priority 0

 my understanding it that the race condition occurs when all the
 processes call MPI_Finalize()
 internally, the pmix module will have mpirun/orted issue two ALLGATHER
 involving mpirun and orted
 (one job 1 aka the parent, and one for job 2 aka the spawned tasks)
 the error message is very explicit : this is not (currently) supported

 i wrote the attached rml.patch which is really a workaround and not a fix :
 in this case, each job will invoke an ALLGATHER but with a different tag
 /* that works for a limited number of jobs only */

 i did not commit this patch since this is not a fix, could someone
 (Ralph ?) please review the issue and comment ?


 Cheers,

 Gilles

 ___
 devel mailing list
 de...@open-mpi.org
 Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
 Link to this post: 
 http://www.open-mpi.org/community/lists/devel/2014/09/15780.php
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/09/15794.php
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/09/15804.php

Index: orte/mca/grpcomm/rcd/grpcomm_rcd.c
===
--- orte/mca/grpcomm/rcd/grpcomm_rcd.c  (revision 32706)
+++ orte/mca/grpcomm/rcd/grpcomm_rcd.c  (working copy)
@@ -6,6 +6,8 @@
  * Copyright (c) 2011-2013 Los Alamos National Security, LLC. All
  * rights reserved.
  * Copyright (c) 2014  Intel, Inc.  All rights reserved.
+ * Copyright (c) 2014  Research Organization for Information Science
+ * and Technology (RIST). All rights reserved.
  * $COPYRIGHT$
  *
  * Additional copyrights may follow
@@ -85,6 +87,9 @@
 static int allgather(orte_grpcomm_coll_t *coll,
  opal_buffer_t *sendbuf)
 {
+orte_grpcomm_base_pending_coll_t *pc;
+bool pending = false;
+
 OPAL_OUTPUT_VERBOSE((5, orte_grpcomm_base_framework.framework_output,
  "%s grpcomm:coll:recdub algo employed for %d 
processes",
  ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), 
(int)coll->ndmns));
@@ -106,6 +111,33 @@
  */
 rcd_allgather_send_dist(coll, 1);

+OPAL_LIST_FOREACH(pc, _grpcomm_base.pending, 
orte_grpcomm_base_pending_coll_t) {
+if (NULL == coll->sig->signature) {
+if (NULL == pc->coll->sig->signature) {
+/* only one collective can operate 

Re: [OMPI devel] race condition in grpcomm/rcd

2014-09-11 Thread Gilles Gouaillardet
Ralph,

things got worst indeed :-(

now a simple hello world involving two hosts hang in mpi_init.
there is still a race condition : if a tasks a call fence long after task b,
then task b will never leave the fence

i ll try to debug this ...

Cheers,

Gilles

On 2014/09/11 2:36, Ralph Castain wrote:
> I think I now have this fixed - let me know what you see.
>
>
> On Sep 9, 2014, at 6:15 AM, Ralph Castain  wrote:
>
>> Yeah, that's not the correct fix. The right way to fix it is for all three 
>> components to have their own RML tag, and for each of them to establish a 
>> persistent receive. They then can use the signature to tell which collective 
>> the incoming message belongs to.
>>
>> I'll fix it, but it won't be until tomorrow I'm afraid as today is shot.
>>
>>
>> On Sep 9, 2014, at 3:10 AM, Gilles Gouaillardet 
>>  wrote:
>>
>>> Folks,
>>>
>>> Since r32672 (trunk), grpcomm/rcd is the default module.
>>> the attached spawn.c test program is a trimmed version of the
>>> spawn_with_env_vars.c test case
>>> from the ibm test suite.
>>>
>>> when invoked on two nodes :
>>> - the program hangs with -np 2
>>> - the program can crash with np > 2
>>> error message is
>>> [node0:30701] [[42913,0],0] TWO RECEIVES WITH SAME PEER [[42913,0],1]
>>> AND TAG -33 - ABORTING
>>>
>>> here is my full command line (from node0) :
>>>
>>> mpirun -host node0,node1 -np 2 --oversubscribe --mca btl tcp,self --mca
>>> coll ^ml ./spawn
>>>
>>> a simple workaround is to add the following extra parameter to the
>>> mpirun command line :
>>> --mca grpcomm_rcd_priority 0
>>>
>>> my understanding it that the race condition occurs when all the
>>> processes call MPI_Finalize()
>>> internally, the pmix module will have mpirun/orted issue two ALLGATHER
>>> involving mpirun and orted
>>> (one job 1 aka the parent, and one for job 2 aka the spawned tasks)
>>> the error message is very explicit : this is not (currently) supported
>>>
>>> i wrote the attached rml.patch which is really a workaround and not a fix :
>>> in this case, each job will invoke an ALLGATHER but with a different tag
>>> /* that works for a limited number of jobs only */
>>>
>>> i did not commit this patch since this is not a fix, could someone
>>> (Ralph ?) please review the issue and comment ?
>>>
>>>
>>> Cheers,
>>>
>>> Gilles
>>>
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2014/09/15780.php
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/09/15794.php



Re: [OMPI devel] race condition in grpcomm/rcd

2014-09-10 Thread Ralph Castain
I think I now have this fixed - let me know what you see.


On Sep 9, 2014, at 6:15 AM, Ralph Castain  wrote:

> Yeah, that's not the correct fix. The right way to fix it is for all three 
> components to have their own RML tag, and for each of them to establish a 
> persistent receive. They then can use the signature to tell which collective 
> the incoming message belongs to.
> 
> I'll fix it, but it won't be until tomorrow I'm afraid as today is shot.
> 
> 
> On Sep 9, 2014, at 3:10 AM, Gilles Gouaillardet 
>  wrote:
> 
>> Folks,
>> 
>> Since r32672 (trunk), grpcomm/rcd is the default module.
>> the attached spawn.c test program is a trimmed version of the
>> spawn_with_env_vars.c test case
>> from the ibm test suite.
>> 
>> when invoked on two nodes :
>> - the program hangs with -np 2
>> - the program can crash with np > 2
>> error message is
>> [node0:30701] [[42913,0],0] TWO RECEIVES WITH SAME PEER [[42913,0],1]
>> AND TAG -33 - ABORTING
>> 
>> here is my full command line (from node0) :
>> 
>> mpirun -host node0,node1 -np 2 --oversubscribe --mca btl tcp,self --mca
>> coll ^ml ./spawn
>> 
>> a simple workaround is to add the following extra parameter to the
>> mpirun command line :
>> --mca grpcomm_rcd_priority 0
>> 
>> my understanding it that the race condition occurs when all the
>> processes call MPI_Finalize()
>> internally, the pmix module will have mpirun/orted issue two ALLGATHER
>> involving mpirun and orted
>> (one job 1 aka the parent, and one for job 2 aka the spawned tasks)
>> the error message is very explicit : this is not (currently) supported
>> 
>> i wrote the attached rml.patch which is really a workaround and not a fix :
>> in this case, each job will invoke an ALLGATHER but with a different tag
>> /* that works for a limited number of jobs only */
>> 
>> i did not commit this patch since this is not a fix, could someone
>> (Ralph ?) please review the issue and comment ?
>> 
>> 
>> Cheers,
>> 
>> Gilles
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/09/15780.php
> 



Re: [OMPI devel] race condition in grpcomm/rcd

2014-09-09 Thread Ralph Castain
Yeah, that's not the correct fix. The right way to fix it is for all three 
components to have their own RML tag, and for each of them to establish a 
persistent receive. They then can use the signature to tell which collective 
the incoming message belongs to.

I'll fix it, but it won't be until tomorrow I'm afraid as today is shot.


On Sep 9, 2014, at 3:10 AM, Gilles Gouaillardet  
wrote:

> Folks,
> 
> Since r32672 (trunk), grpcomm/rcd is the default module.
> the attached spawn.c test program is a trimmed version of the
> spawn_with_env_vars.c test case
> from the ibm test suite.
> 
> when invoked on two nodes :
> - the program hangs with -np 2
> - the program can crash with np > 2
> error message is
> [node0:30701] [[42913,0],0] TWO RECEIVES WITH SAME PEER [[42913,0],1]
> AND TAG -33 - ABORTING
> 
> here is my full command line (from node0) :
> 
> mpirun -host node0,node1 -np 2 --oversubscribe --mca btl tcp,self --mca
> coll ^ml ./spawn
> 
> a simple workaround is to add the following extra parameter to the
> mpirun command line :
> --mca grpcomm_rcd_priority 0
> 
> my understanding it that the race condition occurs when all the
> processes call MPI_Finalize()
> internally, the pmix module will have mpirun/orted issue two ALLGATHER
> involving mpirun and orted
> (one job 1 aka the parent, and one for job 2 aka the spawned tasks)
> the error message is very explicit : this is not (currently) supported
> 
> i wrote the attached rml.patch which is really a workaround and not a fix :
> in this case, each job will invoke an ALLGATHER but with a different tag
> /* that works for a limited number of jobs only */
> 
> i did not commit this patch since this is not a fix, could someone
> (Ralph ?) please review the issue and comment ?
> 
> 
> Cheers,
> 
> Gilles
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/09/15780.php



[OMPI devel] race condition in grpcomm/rcd

2014-09-09 Thread Gilles Gouaillardet
Folks,

Since r32672 (trunk), grpcomm/rcd is the default module.
the attached spawn.c test program is a trimmed version of the
spawn_with_env_vars.c test case
from the ibm test suite.

when invoked on two nodes :
- the program hangs with -np 2
- the program can crash with np > 2
error message is
[node0:30701] [[42913,0],0] TWO RECEIVES WITH SAME PEER [[42913,0],1]
AND TAG -33 - ABORTING

here is my full command line (from node0) :

mpirun -host node0,node1 -np 2 --oversubscribe --mca btl tcp,self --mca
coll ^ml ./spawn

a simple workaround is to add the following extra parameter to the
mpirun command line :
--mca grpcomm_rcd_priority 0

my understanding it that the race condition occurs when all the
processes call MPI_Finalize()
internally, the pmix module will have mpirun/orted issue two ALLGATHER
involving mpirun and orted
(one job 1 aka the parent, and one for job 2 aka the spawned tasks)
the error message is very explicit : this is not (currently) supported

i wrote the attached rml.patch which is really a workaround and not a fix :
in this case, each job will invoke an ALLGATHER but with a different tag
/* that works for a limited number of jobs only */

i did not commit this patch since this is not a fix, could someone
(Ralph ?) please review the issue and comment ?


Cheers,

Gilles

/*
 * $HEADER$
 *
 * Program to test MPI_Comm_spawn with environment variables.
 */

#include 
#include 
#include 

#include "mpi.h"

static void do_parent(char *cmd, int rank, int count)
{
int *errcode, err;
int i;
MPI_Comm child_inter;
MPI_Comm intra;
FILE *fp;
int found;
int size;

/* First, see if cmd exists on all ranks */

fp = fopen(cmd, "r");
if (NULL == fp) {
found = 0;
} else {
fclose(fp);
found = 1;
}
MPI_Comm_size(MPI_COMM_WORLD, );
MPI_Allreduce(, , 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
if (count != size) {
if (rank == 0) {
MPI_Abort(MPI_COMM_WORLD, 77);
}
return;
}

/* Now try the spawn if it's found anywhere */

errcode = malloc(sizeof(int) * count);
if (NULL == errcode) {
MPI_Abort(MPI_COMM_WORLD, 1);
}
memset(errcode, -1, count);
MPI_Comm_spawn(cmd, MPI_ARGV_NULL, count, MPI_INFO_NULL, 0,
   MPI_COMM_WORLD, _inter, errcode);

/* Clean up */
MPI_Barrier(child_inter);

MPI_Comm_disconnect(_inter);
free(errcode);
}


static void do_target(MPI_Comm parent)
{
MPI_Barrier(parent);
MPI_Comm_disconnect();
}


int main(int argc, char *argv[])
{
int rank, size;
MPI_Comm parent;

/* Ok, we're good.  Proceed with the test. */
MPI_Init(, );
MPI_Comm_size(MPI_COMM_WORLD, );
MPI_Comm_rank(MPI_COMM_WORLD, );

/* Check to see if we *were* spawned -- because this is a test, we
   can only assume the existence of this one executable.  Hence, we
   both mpirun it and spawn it. */

parent = MPI_COMM_NULL;
MPI_Comm_get_parent();
if (parent != MPI_COMM_NULL) {
do_target(parent);
} else {
do_parent(argv[0], rank, size);
}

MPI_Comm_size(MPI_COMM_WORLD, );
MPI_Comm_rank(MPI_COMM_WORLD, );
if (0 < rank) sleep(3);

MPI_Finalize();

/* All done */

return 0;
}
Index: orte/mca/grpcomm/brks/grpcomm_brks.c
===
--- orte/mca/grpcomm/brks/grpcomm_brks.c(revision 32688)
+++ orte/mca/grpcomm/brks/grpcomm_brks.c(working copy)
@@ -6,6 +6,8 @@
  * Copyright (c) 2011-2013 Los Alamos National Security, LLC. All
  * rights reserved.
  * Copyright (c) 2014  Intel, Inc.  All rights reserved.
+ * Copyright (c) 2014  Research Organization for Information Science
+ * and Technology (RIST). All rights reserved.
  * $COPYRIGHT$
  *
  * Additional copyrights may follow
@@ -111,6 +113,7 @@
 static int brks_allgather_send_dist(orte_grpcomm_coll_t *coll, orte_vpid_t 
distance) {
 orte_process_name_t peer_send, peer_recv;
 opal_buffer_t *send_buf;
+orte_rml_tag_t tag;
 int rc;

 peer_send.jobid = ORTE_PROC_MY_NAME->jobid;
@@ -174,8 +177,14 @@
  ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
  ORTE_NAME_PRINT(_send)));

+if (1 != coll->sig->sz || ORTE_VPID_WILDCARD != 
coll->sig->signature[0].vpid) {
+tag = ORTE_RML_TAG_ALLGATHER;
+} else {
+tag = ORTE_RML_TAG_JOB_ALLGATHER + 
ORTE_LOCAL_JOBID(coll->sig->signature[0].jobid) % 
(ORTE_RML_TAG_MAX-ORTE_RML_TAG_JOB_ALLGATHER);
+}
+
 if (0 > (rc = orte_rml.send_buffer_nb(_send, send_buf,
-  -ORTE_RML_TAG_ALLGATHER,
+  -tag,
   orte_rml_send_callback, NULL))) {
 ORTE_ERROR_LOG(rc);
 OBJ_RELEASE(send_buf);
@@ -189,7 +198,7 @@

 /* setup recv for distance data