Yeah, that's not the correct fix. The right way to fix it is for all three components to have their own RML tag, and for each of them to establish a persistent receive. They then can use the signature to tell which collective the incoming message belongs to.
I'll fix it, but it won't be until tomorrow I'm afraid as today is shot. On Sep 9, 2014, at 3:10 AM, Gilles Gouaillardet <gilles.gouaillar...@iferc.org> wrote: > Folks, > > Since r32672 (trunk), grpcomm/rcd is the default module. > the attached spawn.c test program is a trimmed version of the > spawn_with_env_vars.c test case > from the ibm test suite. > > when invoked on two nodes : > - the program hangs with -np 2 > - the program can crash with np > 2 > error message is > [node0:30701] [[42913,0],0] TWO RECEIVES WITH SAME PEER [[42913,0],1] > AND TAG -33 - ABORTING > > here is my full command line (from node0) : > > mpirun -host node0,node1 -np 2 --oversubscribe --mca btl tcp,self --mca > coll ^ml ./spawn > > a simple workaround is to add the following extra parameter to the > mpirun command line : > --mca grpcomm_rcd_priority 0 > > my understanding it that the race condition occurs when all the > processes call MPI_Finalize() > internally, the pmix module will have mpirun/orted issue two ALLGATHER > involving mpirun and orted > (one job 1 aka the parent, and one for job 2 aka the spawned tasks) > the error message is very explicit : this is not (currently) supported > > i wrote the attached rml.patch which is really a workaround and not a fix : > in this case, each job will invoke an ALLGATHER but with a different tag > /* that works for a limited number of jobs only */ > > i did not commit this patch since this is not a fix, could someone > (Ralph ?) please review the issue and comment ? > > > Cheers, > > Gilles > > <spawn.c><rml.patch>_______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/09/15780.php