Ralph,

things got worst indeed :-(

now a simple hello world involving two hosts hang in mpi_init.
there is still a race condition : if a tasks a call fence long after task b,
then task b will never leave the fence

i ll try to debug this ...

Cheers,

Gilles

On 2014/09/11 2:36, Ralph Castain wrote:
> I think I now have this fixed - let me know what you see.
>
>
> On Sep 9, 2014, at 6:15 AM, Ralph Castain <r...@open-mpi.org> wrote:
>
>> Yeah, that's not the correct fix. The right way to fix it is for all three 
>> components to have their own RML tag, and for each of them to establish a 
>> persistent receive. They then can use the signature to tell which collective 
>> the incoming message belongs to.
>>
>> I'll fix it, but it won't be until tomorrow I'm afraid as today is shot.
>>
>>
>> On Sep 9, 2014, at 3:10 AM, Gilles Gouaillardet 
>> <gilles.gouaillar...@iferc.org> wrote:
>>
>>> Folks,
>>>
>>> Since r32672 (trunk), grpcomm/rcd is the default module.
>>> the attached spawn.c test program is a trimmed version of the
>>> spawn_with_env_vars.c test case
>>> from the ibm test suite.
>>>
>>> when invoked on two nodes :
>>> - the program hangs with -np 2
>>> - the program can crash with np > 2
>>> error message is
>>> [node0:30701] [[42913,0],0] TWO RECEIVES WITH SAME PEER [[42913,0],1]
>>> AND TAG -33 - ABORTING
>>>
>>> here is my full command line (from node0) :
>>>
>>> mpirun -host node0,node1 -np 2 --oversubscribe --mca btl tcp,self --mca
>>> coll ^ml ./spawn
>>>
>>> a simple workaround is to add the following extra parameter to the
>>> mpirun command line :
>>> --mca grpcomm_rcd_priority 0
>>>
>>> my understanding it that the race condition occurs when all the
>>> processes call MPI_Finalize()
>>> internally, the pmix module will have mpirun/orted issue two ALLGATHER
>>> involving mpirun and orted
>>> (one job 1 aka the parent, and one for job 2 aka the spawned tasks)
>>> the error message is very explicit : this is not (currently) supported
>>>
>>> i wrote the attached rml.patch which is really a workaround and not a fix :
>>> in this case, each job will invoke an ALLGATHER but with a different tag
>>> /* that works for a limited number of jobs only */
>>>
>>> i did not commit this patch since this is not a fix, could someone
>>> (Ralph ?) please review the issue and comment ?
>>>
>>>
>>> Cheers,
>>>
>>> Gilles
>>>
>>> <spawn.c><rml.patch>_______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2014/09/15780.php
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/09/15794.php

Reply via email to