Yeah, that's not the right fix, I'm afraid. I've made the direct component the default again until I have time to dig into this deeper.
On Sep 11, 2014, at 4:02 AM, Gilles Gouaillardet <gilles.gouaillar...@iferc.org> wrote: > Ralph, > > the root cause is when the second orted/mpirun runs rcd_finalize_coll, > it does not invoke pmix_server_release > because allgather_stub was not previously invoked since the the fence > was not yet entered. > /* in rcd_finalize_coll, coll->cbfunc is NULL */ > > the attached patch is likely not the right fix, it was very lightly > tested, but so far, it works for me ... > > Cheers, > > Gilles > > On 2014/09/11 16:11, Gilles Gouaillardet wrote: >> Ralph, >> >> things got worst indeed :-( >> >> now a simple hello world involving two hosts hang in mpi_init. >> there is still a race condition : if a tasks a call fence long after task b, >> then task b will never leave the fence >> >> i ll try to debug this ... >> >> Cheers, >> >> Gilles >> >> On 2014/09/11 2:36, Ralph Castain wrote: >>> I think I now have this fixed - let me know what you see. >>> >>> >>> On Sep 9, 2014, at 6:15 AM, Ralph Castain <r...@open-mpi.org> wrote: >>> >>>> Yeah, that's not the correct fix. The right way to fix it is for all three >>>> components to have their own RML tag, and for each of them to establish a >>>> persistent receive. They then can use the signature to tell which >>>> collective the incoming message belongs to. >>>> >>>> I'll fix it, but it won't be until tomorrow I'm afraid as today is shot. >>>> >>>> >>>> On Sep 9, 2014, at 3:10 AM, Gilles Gouaillardet >>>> <gilles.gouaillar...@iferc.org> wrote: >>>> >>>>> Folks, >>>>> >>>>> Since r32672 (trunk), grpcomm/rcd is the default module. >>>>> the attached spawn.c test program is a trimmed version of the >>>>> spawn_with_env_vars.c test case >>>>> from the ibm test suite. >>>>> >>>>> when invoked on two nodes : >>>>> - the program hangs with -np 2 >>>>> - the program can crash with np > 2 >>>>> error message is >>>>> [node0:30701] [[42913,0],0] TWO RECEIVES WITH SAME PEER [[42913,0],1] >>>>> AND TAG -33 - ABORTING >>>>> >>>>> here is my full command line (from node0) : >>>>> >>>>> mpirun -host node0,node1 -np 2 --oversubscribe --mca btl tcp,self --mca >>>>> coll ^ml ./spawn >>>>> >>>>> a simple workaround is to add the following extra parameter to the >>>>> mpirun command line : >>>>> --mca grpcomm_rcd_priority 0 >>>>> >>>>> my understanding it that the race condition occurs when all the >>>>> processes call MPI_Finalize() >>>>> internally, the pmix module will have mpirun/orted issue two ALLGATHER >>>>> involving mpirun and orted >>>>> (one job 1 aka the parent, and one for job 2 aka the spawned tasks) >>>>> the error message is very explicit : this is not (currently) supported >>>>> >>>>> i wrote the attached rml.patch which is really a workaround and not a fix >>>>> : >>>>> in this case, each job will invoke an ALLGATHER but with a different tag >>>>> /* that works for a limited number of jobs only */ >>>>> >>>>> i did not commit this patch since this is not a fix, could someone >>>>> (Ralph ?) please review the issue and comment ? >>>>> >>>>> >>>>> Cheers, >>>>> >>>>> Gilles >>>>> >>>>> <spawn.c><rml.patch>_______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/devel/2014/09/15780.php >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2014/09/15794.php >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/09/15804.php > > <rml2.patch>_______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/09/15805.php