Ok figured it out. There were three problems with the del_procs code:
1) ompi_mpi_finalize used ompi_proc_all to get the list of procs but
never released the reference to them (ompi_proc_all called
OBJ_RETAIN on all the procs returned). When calling del_procs at
finalize it should suffice to call ompi_proc_world which does not
increment the reference count.
2) del_procs is called BEFORE ompi_comm_finalize. This leaves the
references to the procs from calling the pml_add_comm function. The
fix is to reorder the calls to do omp_comm_finalize, del_procs,
pml_finalize instead of del_procs, pml_finalize,
ompi_comm_finalize.
3) The check in del_procs in r2 checked for a reference count of
1. This is incorrect. At this point there should be 2 references: 1
from ompi_proc, and another from the add_procs. The fix is to change
this check to look for 2. This check makes me extremely uncomforable
as nothing will call del_procs if the reference count of a procs is
not 2 when del_procs is called. Maybe there should be an assert
since this is a developer error IMHO.
Committing a patch to fix all three of these issues.
-Nathan
On Thu, May 15, 2014 at 11:52:27AM -0600, Nathan Hjelm wrote:
> On Thu, May 15, 2014 at 11:44:05AM -0600, Nathan Hjelm wrote:
> > On Thu, May 15, 2014 at 01:33:31PM -0400, George Bosilca wrote:
> > > The solution you propose here is definitively not OK. It is 1) ugly and
> > > 2) break the separation barrier that we hold dear.
> >
> > Which is why I asked :)
> >
> > > Regarding your other suggestion I don’t see any reasons not to call the
> > > delete_proc on MPI_COMM_WORLD as the last action we do before tearing
> > > down everything else.
> >
> > I spoke too soon. It looks like we *are* calling del_procs but I am not
> > seeing the call reach the bml.... I will try and track this down.
>
> /bml/btl/ .. I see what is happening. The proc reference counts are all
> larger than 1 when we call del_procs:
>
>
> [1,2]<stderr>:Deleting proc 0x7b83190 with reference count 5
> [1,1]<stderr>:Deleting proc 0x7b83180 with reference count 5
> [1,2]<stderr>:Deleting proc 0x7b832b0 with reference count 5
> [1,1]<stderr>:Deleting proc 0x7b832a0 with reference count 7
> [1,2]<stderr>:Deleting proc 0x7b83360 with reference count 7
> [1,1]<stderr>:Deleting proc 0x7b833a0 with reference count 5
> [1,0]<stderr>:Deleting proc 0x7b83190 with reference count 7
> [1,0]<stderr>:Deleting proc 0x7b83300 with reference count 5
> [1,0]<stderr>:Deleting proc 0x7b833b0 with reference count 5
>
>
> I will track that down.
>
> -Nathan
> _______________________________________________
> devel mailing list
> [email protected]
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/05/14812.php
pgp00dWQ5nXSm.pgp
Description: PGP signature
