On May 28, 2014, at 7:34 AM, George Bosilca <bosi...@icl.utk.edu> wrote:
> Calling MPI_Comm_free is not enough from MPI perspective to clean up > all knowledge about remote processes, nor to sever the links between > the local and remote groups. One MUST call MPI_Comm_disconnect in > order to achieve this. > > Look at the code in ompi/mpi/c and see the difference between > MPI_Comm_free and MPI_Comm_disconnect. In addition to the barrier only > disconnect calls into the DPM framework, giving a chance to further > cleanup. Good point - however, that doesn't fix it. Changing the Comm_free calls to Comm_disconnect results in the same error messages when the parent finalizes: Parent: MPI_Init( &argc, &argv); for (iter = 0; iter < 100; ++iter) { MPI_Comm_spawn(EXE_TEST, NULL, 1, MPI_INFO_NULL, 0, MPI_COMM_WORLD, &comm, &err); printf("parent: MPI_Comm_spawn #%d return : %d\n", iter, err); MPI_Intercomm_merge(comm, 0, &merged); MPI_Comm_rank(merged, &rank); MPI_Comm_size(merged, &size); printf("parent: MPI_Comm_spawn #%d rank %d, size %d\n", iter, rank, size); MPI_Comm_disconnect(&merged); } MPI_Finalize(); Child: MPI_Init(&argc, &argv); printf("Child: launch\n"); MPI_Comm_get_parent(&parent); MPI_Intercomm_merge(parent, 1, &merged); MPI_Comm_rank(merged, &rank); MPI_Comm_size(merged, &size); printf("Child merged rank = %d, size = %d\n", rank, size); MPI_Comm_disconnect(&merged); MPI_Finalize(); Upon parent calling finalize: dpm_base_disconnect_init: error -12 in isend to process 0 > > George. > > > On Wed, May 28, 2014 at 10:10 AM, Ralph Castain <r...@open-mpi.org> wrote: >> >> On May 28, 2014, at 6:41 AM, Gilles Gouaillardet >> <gilles.gouaillar...@gmail.com> wrote: >> >> Ralph, >> >> >> On Wed, May 28, 2014 at 9:33 PM, Ralph Castain <r...@open-mpi.org> wrote: >>> >>> This is definetly what happens : only some tasks call MPI_Comm_free() >>> >>> >>> Really? I don't see how that can happen in loop_spawn - every process is >>> clearly calling comm_free. Or are you referring to the intercomm_create >>> test? >>> >> yes, i am referring intercomm_create test >> >> >> kewl - thanks >> >> >> about loop_spawn, i could not get any error on my single host single socket >> VM. >> (i tried --mca btl tcp,sm,self and --mca btl tcp,self) >> >> MPI_Finalize will end up calling ompi_dpm_dyn_finalize which causes the >> error message on the parent of intercomm_create. >> a necessary condition is ompi_comm_num_dyncomm > 1 >> /* which by the way sounds odd to me, should it be 0 ? */ >> >> >> That does sound odd >> >> which imho cannot happen if all communicators have been freed >> >> can you detail your full mpirun command line, the number of servers you are >> using, the btl involved and the ompi release that can be used to reproduce >> the issue ? >> >> >> Running on only one server, using the current head of the svn repo. My >> cluster only has Ethernet, and I let it freely choose the BTLs (so I imagine >> the candidates are sm,self,tcp,vader). The cmd line is really trivial: >> >> mpirun -n 1 ./loop_spawn >> >> I modified loop_spawn to only run 100 iterations as I am not patient enough >> to wait for 1000, and the number of iters isn't a factor so long as it is >> greater than 1. When the parent calls finalize, I get one of the following >> emitted for every iteration that was done: >> >> dpm_base_disconnect_init: error -12 in isend to process 0 >> >> So in other words, the parent is attempting to isend to every child that was >> spawned during the test - it thinks that every comm_spawn'd process remains >> connected to it. >> >> I'm wondering if the issue is that the parent and child are calling >> comm_free, but neither side called comm_disconnect. So unless comm_free is >> calling disconnect under-the-covers, it might explain why the parent thinks >> all the children are still present. >> >> >> >> i will try to reproduce this myself >> >> Cheers, >> >> Gilles >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/05/14890.php >> >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/05/14891.php > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/05/14892.php