Ralph, things got worst indeed :-(
now a simple hello world involving two hosts hang in mpi_init. there is still a race condition : if a tasks a call fence long after task b, then task b will never leave the fence i ll try to debug this ... Cheers, Gilles On 2014/09/11 2:36, Ralph Castain wrote: > I think I now have this fixed - let me know what you see. > > > On Sep 9, 2014, at 6:15 AM, Ralph Castain <r...@open-mpi.org> wrote: > >> Yeah, that's not the correct fix. The right way to fix it is for all three >> components to have their own RML tag, and for each of them to establish a >> persistent receive. They then can use the signature to tell which collective >> the incoming message belongs to. >> >> I'll fix it, but it won't be until tomorrow I'm afraid as today is shot. >> >> >> On Sep 9, 2014, at 3:10 AM, Gilles Gouaillardet >> <gilles.gouaillar...@iferc.org> wrote: >> >>> Folks, >>> >>> Since r32672 (trunk), grpcomm/rcd is the default module. >>> the attached spawn.c test program is a trimmed version of the >>> spawn_with_env_vars.c test case >>> from the ibm test suite. >>> >>> when invoked on two nodes : >>> - the program hangs with -np 2 >>> - the program can crash with np > 2 >>> error message is >>> [node0:30701] [[42913,0],0] TWO RECEIVES WITH SAME PEER [[42913,0],1] >>> AND TAG -33 - ABORTING >>> >>> here is my full command line (from node0) : >>> >>> mpirun -host node0,node1 -np 2 --oversubscribe --mca btl tcp,self --mca >>> coll ^ml ./spawn >>> >>> a simple workaround is to add the following extra parameter to the >>> mpirun command line : >>> --mca grpcomm_rcd_priority 0 >>> >>> my understanding it that the race condition occurs when all the >>> processes call MPI_Finalize() >>> internally, the pmix module will have mpirun/orted issue two ALLGATHER >>> involving mpirun and orted >>> (one job 1 aka the parent, and one for job 2 aka the spawned tasks) >>> the error message is very explicit : this is not (currently) supported >>> >>> i wrote the attached rml.patch which is really a workaround and not a fix : >>> in this case, each job will invoke an ALLGATHER but with a different tag >>> /* that works for a limited number of jobs only */ >>> >>> i did not commit this patch since this is not a fix, could someone >>> (Ralph ?) please review the issue and comment ? >>> >>> >>> Cheers, >>> >>> Gilles >>> >>> <spawn.c><rml.patch>_______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2014/09/15780.php > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/09/15794.php