On May 27, 2014, at 6:11 PM, Gilles Gouaillardet
<[email protected]> wrote:
> Ralph,
>
> in the case of intercomm_create, the children free all the communicators and
> then MPI_Disconnect() and then MPI_Finalize() and exits.
> the parent only MPI_Disconnect() without freeing all the communicators.
> MPI_Finalize() tries to disconnect and communicate with already exited
> processes.
>
> my understanding is that there are two ways of seeing things :
> a) the "R-way" : the problem is the parent should not try to communicate to
> already exited processes
> b) the "J-way" : the problem is the children should have waited either in
> MPI_Comm_free() or MPI_Finalize()
I don't think you can use option (b) - we can't have the children lingering
around for the parent to call finalize, if I'm understanding you correctly.
When I look at loop_spawn, I see this being done by the parent on every
iteration:
MPI_Init( &argc, &argv);
loop() {
MPI_Comm_spawn(EXE_TEST, NULL, 1, MPI_INFO_NULL,
0, MPI_COMM_WORLD, &comm, &err);
printf("parent: MPI_Comm_spawn #%d return : %d\n", iter, err);
MPI_Intercomm_merge(comm, 0, &merged);
MPI_Comm_rank(merged, &rank);
MPI_Comm_size(merged, &size);
printf("parent: MPI_Comm_spawn #%d rank %d, size %d\n",
iter, rank, size);
MPI_Comm_free(&merged);
}
MPI_Finalize();
The child does:
MPI_Init(&argc, &argv);
MPI_Comm_get_parent(&parent);
MPI_Intercomm_merge(parent, 1, &merged);
MPI_Comm_rank(merged, &rank);
MPI_Comm_size(merged, &size);
printf("Child merged rank = %d, size = %d\n", rank, size);
MPI_Comm_free(&merged);
MPI_Finalize();
So it looks to me like there is either something missing, or a bug in Comm_free
that isn't removing the child from the parent's field of view.
>
> i did not investigate the loop_spawn test yet, and will do today.
>
> as far as i am concerned, i have no opinion on which of a) or b) is the
> correct/most appropriate approach.
>
> Cheers,
>
> Gilles
>
>
> On Wed, May 28, 2014 at 9:46 AM, Ralph Castain <[email protected]> wrote:
> Since you ignored my response, I'll reiterate and clarify it here. The
> problem in the case of loop_spawn is that the parent process remains
> "connected" to children after the child has finalized and died. Hence, when
> the parent attempts to finalize, it tries to "disconnect" itself from
> processes that no longer exist - and that is what generates the error message.
>
> So the issue in that case appears to be that "finalize" is not marking the
> child process as "disconnected", thus leaving the parent thinking that it
> needs to disconnect when it finally ends.
>
>
> On May 27, 2014, at 5:33 PM, Jeff Squyres (jsquyres) <[email protected]>
> wrote:
>
> > Note that MPI says that COMM_DISCONNECT simply disconnects that individual
> > communicator. It does *not* guarantee that the processes involved will be
> > fully disconnected.
> >
> > So I think that the freeing of communicators is good app behavior, but it
> > is not required by the MPI spec.
> >
> > If OMPI is requiring this for correct termination, then something is wrong.
> > MPI_FINALIZE is supposed to be collective across all connected MPI procs
> > -- and if the parent and spawned procs in this test are still connected
> > (because they have not disconnected all communicators between them), the
> > FINALIZE is supposed to be collective across all of them.
> >
> > This means that FINALIZE is allowed to block if it needs to, such that OMPI
> > sending control messages to procs that are still "connected" (in the MPI
> > sense) should never cause a race condition.
> >
> > As such, this sounds like an OMPI bug.
> >
> >
> >
> >
> > On May 27, 2014, at 2:27 AM, Gilles Gouaillardet
> > <[email protected]> wrote:
> >
> >> Folks,
> >>
> >> currently, the dynamic/intercomm_create test from the ibm test suite
> >> output the following messages :
> >>
> >> dpm_base_disconnect_init: error -12 in isend to process 1
> >>
> >> the root cause it task 0 tries to send messages to already exited tasks.
> >>
> >> one way of seeing things is that this is an application issue :
> >> task 0 should have MPI_Comm_free'd all its communicator before calling
> >> MPI_Comm_disconnect.
> >> This can be achieved via the attached patch
> >>
> >> an other way of seeing things is that this is a bug in OpenMPI.
> >> In this case, what would be the the right approach ?
> >> - automatically free communicators (if needed) when MPI_Comm_disconnect is
> >> invoked ?
> >> - simply remove communicators (if needed) from ompi_mpi_communicators when
> >> MPI_Comm_disconnect is invoked ?
> >> /* this causes a memory leak, but the application can be seen as
> >> responsible of it */
> >> - other ?
> >>
> >> Thanks in advance for your feedback,
> >>
> >> Gilles
> >> <intercomm_create.patch>_______________________________________________
> >> devel mailing list
> >> [email protected]
> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >> Link to this post:
> >> http://www.open-mpi.org/community/lists/devel/2014/05/14847.php
> >
> >
> > --
> > Jeff Squyres
> > [email protected]
> > For corporate legal information go to:
> > http://www.cisco.com/web/about/doing_business/legal/cri/
> >
> > _______________________________________________
> > devel mailing list
> > [email protected]
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post:
> > http://www.open-mpi.org/community/lists/devel/2014/05/14875.php
>
> _______________________________________________
> devel mailing list
> [email protected]
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/05/14876.php
>
> _______________________________________________
> devel mailing list
> [email protected]
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/05/14877.php