Yeah, that's not the right fix, I'm afraid. I've made the direct component the 
default again until I have time to dig into this deeper.

On Sep 11, 2014, at 4:02 AM, Gilles Gouaillardet 
<gilles.gouaillar...@iferc.org> wrote:

> Ralph,
> 
> the root cause is when the second orted/mpirun runs rcd_finalize_coll,
> it does not invoke pmix_server_release
> because allgather_stub was not previously invoked since the the fence
> was not yet entered.
> /* in rcd_finalize_coll, coll->cbfunc is NULL */
> 
> the attached patch is likely not the right fix, it was very lightly
> tested, but so far, it works for me ...
> 
> Cheers,
> 
> Gilles
> 
> On 2014/09/11 16:11, Gilles Gouaillardet wrote:
>> Ralph,
>> 
>> things got worst indeed :-(
>> 
>> now a simple hello world involving two hosts hang in mpi_init.
>> there is still a race condition : if a tasks a call fence long after task b,
>> then task b will never leave the fence
>> 
>> i ll try to debug this ...
>> 
>> Cheers,
>> 
>> Gilles
>> 
>> On 2014/09/11 2:36, Ralph Castain wrote:
>>> I think I now have this fixed - let me know what you see.
>>> 
>>> 
>>> On Sep 9, 2014, at 6:15 AM, Ralph Castain <r...@open-mpi.org> wrote:
>>> 
>>>> Yeah, that's not the correct fix. The right way to fix it is for all three 
>>>> components to have their own RML tag, and for each of them to establish a 
>>>> persistent receive. They then can use the signature to tell which 
>>>> collective the incoming message belongs to.
>>>> 
>>>> I'll fix it, but it won't be until tomorrow I'm afraid as today is shot.
>>>> 
>>>> 
>>>> On Sep 9, 2014, at 3:10 AM, Gilles Gouaillardet 
>>>> <gilles.gouaillar...@iferc.org> wrote:
>>>> 
>>>>> Folks,
>>>>> 
>>>>> Since r32672 (trunk), grpcomm/rcd is the default module.
>>>>> the attached spawn.c test program is a trimmed version of the
>>>>> spawn_with_env_vars.c test case
>>>>> from the ibm test suite.
>>>>> 
>>>>> when invoked on two nodes :
>>>>> - the program hangs with -np 2
>>>>> - the program can crash with np > 2
>>>>> error message is
>>>>> [node0:30701] [[42913,0],0] TWO RECEIVES WITH SAME PEER [[42913,0],1]
>>>>> AND TAG -33 - ABORTING
>>>>> 
>>>>> here is my full command line (from node0) :
>>>>> 
>>>>> mpirun -host node0,node1 -np 2 --oversubscribe --mca btl tcp,self --mca
>>>>> coll ^ml ./spawn
>>>>> 
>>>>> a simple workaround is to add the following extra parameter to the
>>>>> mpirun command line :
>>>>> --mca grpcomm_rcd_priority 0
>>>>> 
>>>>> my understanding it that the race condition occurs when all the
>>>>> processes call MPI_Finalize()
>>>>> internally, the pmix module will have mpirun/orted issue two ALLGATHER
>>>>> involving mpirun and orted
>>>>> (one job 1 aka the parent, and one for job 2 aka the spawned tasks)
>>>>> the error message is very explicit : this is not (currently) supported
>>>>> 
>>>>> i wrote the attached rml.patch which is really a workaround and not a fix 
>>>>> :
>>>>> in this case, each job will invoke an ALLGATHER but with a different tag
>>>>> /* that works for a limited number of jobs only */
>>>>> 
>>>>> i did not commit this patch since this is not a fix, could someone
>>>>> (Ralph ?) please review the issue and comment ?
>>>>> 
>>>>> 
>>>>> Cheers,
>>>>> 
>>>>> Gilles
>>>>> 
>>>>> <spawn.c><rml.patch>_______________________________________________
>>>>> devel mailing list
>>>>> de...@open-mpi.org
>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>> Link to this post: 
>>>>> http://www.open-mpi.org/community/lists/devel/2014/09/15780.php
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2014/09/15794.php
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/09/15804.php
> 
> <rml2.patch>_______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/09/15805.php

Reply via email to