I think I now have this fixed - let me know what you see.
On Sep 9, 2014, at 6:15 AM, Ralph Castain <r...@open-mpi.org> wrote: > Yeah, that's not the correct fix. The right way to fix it is for all three > components to have their own RML tag, and for each of them to establish a > persistent receive. They then can use the signature to tell which collective > the incoming message belongs to. > > I'll fix it, but it won't be until tomorrow I'm afraid as today is shot. > > > On Sep 9, 2014, at 3:10 AM, Gilles Gouaillardet > <gilles.gouaillar...@iferc.org> wrote: > >> Folks, >> >> Since r32672 (trunk), grpcomm/rcd is the default module. >> the attached spawn.c test program is a trimmed version of the >> spawn_with_env_vars.c test case >> from the ibm test suite. >> >> when invoked on two nodes : >> - the program hangs with -np 2 >> - the program can crash with np > 2 >> error message is >> [node0:30701] [[42913,0],0] TWO RECEIVES WITH SAME PEER [[42913,0],1] >> AND TAG -33 - ABORTING >> >> here is my full command line (from node0) : >> >> mpirun -host node0,node1 -np 2 --oversubscribe --mca btl tcp,self --mca >> coll ^ml ./spawn >> >> a simple workaround is to add the following extra parameter to the >> mpirun command line : >> --mca grpcomm_rcd_priority 0 >> >> my understanding it that the race condition occurs when all the >> processes call MPI_Finalize() >> internally, the pmix module will have mpirun/orted issue two ALLGATHER >> involving mpirun and orted >> (one job 1 aka the parent, and one for job 2 aka the spawned tasks) >> the error message is very explicit : this is not (currently) supported >> >> i wrote the attached rml.patch which is really a workaround and not a fix : >> in this case, each job will invoke an ALLGATHER but with a different tag >> /* that works for a limited number of jobs only */ >> >> i did not commit this patch since this is not a fix, could someone >> (Ralph ?) please review the issue and comment ? >> >> >> Cheers, >> >> Gilles >> >> <spawn.c><rml.patch>_______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/09/15780.php >