[OMPI devel] r31765 causes crash in mpirun

2014-05-15 Thread Gilles Gouaillardet
Folks,

since r31765 (opal/event: release the opal event context when closing
the event base)
mpirun crashes at the end of the job.

for example :

$ mpirun --mca btl tcp,self -n 4 `pwd`/src/MPI_Allreduce_user_c
MPITEST info  (0): Starting MPI_Allreduce_user() test
MPITEST_results: MPI_Allreduce_user() all tests PASSED (7076)
[soleil:10959] *** Process received signal ***
[soleil:10959] Signal: Segmentation fault (11)
[soleil:10959] Signal code: Address not mapped (1)
[soleil:10959] Failing at address: 0x7fd969e75a98
[soleil:10959] [ 0] /lib64/libpthread.so.0[0x3c9da0f500]
[soleil:10959] [ 1]
/csc/home1/gouaillardet/local/ompi-trunk/lib/libopen-pal.so.0(+0x7bae5)[0x7fd96a55dae5]
[soleil:10959] [ 2]
/csc/home1/gouaillardet/local/ompi-trunk/lib/libopen-pal.so.0(+0x7ac97)[0x7fd96a55cc97]
[soleil:10959] [ 3]
/csc/home1/gouaillardet/local/ompi-trunk/lib/libopen-pal.so.0(opal_libevent2021_event_del+0x88)[0x7fd96a55ca15]
[soleil:10959] [ 4]
/csc/home1/gouaillardet/local/ompi-trunk/lib/libopen-pal.so.0(opal_libevent2021_event_base_free+0x132)[0x7fd96a558831]
[soleil:10959] [ 5]
/csc/home1/gouaillardet/local/ompi-trunk/lib/libopen-pal.so.0(+0x74126)[0x7fd96a556126]
[soleil:10959] [ 6]
/csc/home1/gouaillardet/local/ompi-trunk/lib/libopen-pal.so.0(mca_base_framework_close+0xdd)[0x7fd96a54026f]
[soleil:10959] [ 7]
/csc/home1/gouaillardet/local/ompi-trunk/lib/libopen-pal.so.0(opal_finalize+0x7e)[0x7fd96a50d36e]
[soleil:10959] [ 8]
/csc/home1/gouaillardet/local/ompi-trunk/lib/libopen-rte.so.0(orte_finalize+0xd3)[0x7fd96a7ead2f]
[soleil:10959] [ 9] mpirun(orterun+0x1298)[0x404f0e]
[soleil:10959] [10] mpirun(main+0x20)[0x4038a4]
[soleil:10959] [11] /lib64/libc.so.6(__libc_start_main+0xfd)[0x3c9d21ecdd]
[soleil:10959] [12] mpirun[0x4037c9]
[soleil:10959] *** End of error message ***
Segmentation fault (core dumped)

Gilles



Re: [OMPI devel] about btl/scif thread cancellation (#4616 / r31738)

2014-05-15 Thread Gilles Gouaillardet
Nathan,

this had no effect on my environment :-(

i am not sure you can reuse mca_btl_scif_module.scif_fd with connect()
i had to use a new scif fd for that.

then i ran into an other glitch : if the listen thread does not
scif_accept() the connection,
the scif_connect() will take 30 seconds (default timeout value i guess)

i fixed this in r31772

Gilles

On 2014/05/15 1:16, Nathan Hjelm wrote:
> That is exactly how I decided to fix it. It looks like it is
> working. Please try r31755 when you get a chance.
>



[OMPI devel] RFC: fix leak of bml endpoints

2014-05-15 Thread Nathan Hjelm

What: We never call del_procs in the procs in comm world. This leads us
to leak the bml endpoints created by r2.

The proposed solution is not idea but it avoids adding a call to del
procs for comm world. Something I know would require more discussion
since there is likely a reason for that. I propose we delete any
remaining bml endpoints when we tear down the ompi_proc_t:

diff --git a/ompi/proc/proc.c b/ompi/proc/proc.c
index f549335..9ea0311 100644
--- a/ompi/proc/proc.c
+++ b/ompi/proc/proc.c
@@ -89,6 +89,13 @@ void ompi_proc_destruct(ompi_proc_t* proc)
 OPAL_THREAD_LOCK(&ompi_proc_lock);
 opal_list_remove_item(&ompi_proc_list, (opal_list_item_t*)proc);
 OPAL_THREAD_UNLOCK(&ompi_proc_lock);
+
+#if defined(OMPI_PROC_ENDPOINT_TAG_BML)
+/* release the bml endpoint if it still exists */
+if (proc->proc_endpoints[OMPI_PROC_ENDPOINT_TAG_BML]) {
+OBJ_RELEASE(proc->proc_endpoints[OMPI_PROC_ENDPOINT_TAG_BML]);
+}
+#endif
 }
 
This fixes the leak and appears to have no negative side effects for
r2.

Why: Trying to clean up the last remaining leaks in the Open MPI code
base. This is one of the larger ones as it grows with comm world.

When: I want this to go into 1.8.2 if possible. Setting a short timeout
of 1 week.

Keep in mind I do not know the full history of add_procs/del_procs so
there may be a better way to fix this. This RFC is meant to open the
discussion about how to address this leak. If the above fix looks ok I
will commit it.

-Nathan


pgpxSOjRQoGpG.pgp
Description: PGP signature


Re: [OMPI devel] r31765 causes crash in mpirun

2014-05-15 Thread Ralph Castain
I fixed this by reverting r31765 in r31775. Annotated ticket with explanation.


On May 15, 2014, at 1:20 AM, Gilles Gouaillardet 
 wrote:

> Folks,
> 
> since r31765 (opal/event: release the opal event context when closing
> the event base)
> mpirun crashes at the end of the job.
> 
> for example :
> 
> $ mpirun --mca btl tcp,self -n 4 `pwd`/src/MPI_Allreduce_user_c
> MPITEST info  (0): Starting MPI_Allreduce_user() test
> MPITEST_results: MPI_Allreduce_user() all tests PASSED (7076)
> [soleil:10959] *** Process received signal ***
> [soleil:10959] Signal: Segmentation fault (11)
> [soleil:10959] Signal code: Address not mapped (1)
> [soleil:10959] Failing at address: 0x7fd969e75a98
> [soleil:10959] [ 0] /lib64/libpthread.so.0[0x3c9da0f500]
> [soleil:10959] [ 1]
> /csc/home1/gouaillardet/local/ompi-trunk/lib/libopen-pal.so.0(+0x7bae5)[0x7fd96a55dae5]
> [soleil:10959] [ 2]
> /csc/home1/gouaillardet/local/ompi-trunk/lib/libopen-pal.so.0(+0x7ac97)[0x7fd96a55cc97]
> [soleil:10959] [ 3]
> /csc/home1/gouaillardet/local/ompi-trunk/lib/libopen-pal.so.0(opal_libevent2021_event_del+0x88)[0x7fd96a55ca15]
> [soleil:10959] [ 4]
> /csc/home1/gouaillardet/local/ompi-trunk/lib/libopen-pal.so.0(opal_libevent2021_event_base_free+0x132)[0x7fd96a558831]
> [soleil:10959] [ 5]
> /csc/home1/gouaillardet/local/ompi-trunk/lib/libopen-pal.so.0(+0x74126)[0x7fd96a556126]
> [soleil:10959] [ 6]
> /csc/home1/gouaillardet/local/ompi-trunk/lib/libopen-pal.so.0(mca_base_framework_close+0xdd)[0x7fd96a54026f]
> [soleil:10959] [ 7]
> /csc/home1/gouaillardet/local/ompi-trunk/lib/libopen-pal.so.0(opal_finalize+0x7e)[0x7fd96a50d36e]
> [soleil:10959] [ 8]
> /csc/home1/gouaillardet/local/ompi-trunk/lib/libopen-rte.so.0(orte_finalize+0xd3)[0x7fd96a7ead2f]
> [soleil:10959] [ 9] mpirun(orterun+0x1298)[0x404f0e]
> [soleil:10959] [10] mpirun(main+0x20)[0x4038a4]
> [soleil:10959] [11] /lib64/libc.so.6(__libc_start_main+0xfd)[0x3c9d21ecdd]
> [soleil:10959] [12] mpirun[0x4037c9]
> [soleil:10959] *** End of error message ***
> Segmentation fault (core dumped)
> 
> Gilles
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/05/14806.php



Re: [OMPI devel] RFC: fix leak of bml endpoints

2014-05-15 Thread George Bosilca
The solution you propose here is definitively not OK. It is 1) ugly and 2) 
break the separation barrier that we hold dear.

Regarding your other suggestion I don’t see any reasons not to call the 
delete_proc on MPI_COMM_WORLD as the last action we do before tearing down 
everything else.

  George.

On May 15, 2014, at 11:22 , Nathan Hjelm  wrote:

> 
> What: We never call del_procs in the procs in comm world. This leads us
> to leak the bml endpoints created by r2.
> 
> The proposed solution is not idea but it avoids adding a call to del
> procs for comm world. Something I know would require more discussion
> since there is likely a reason for that. I propose we delete any
> remaining bml endpoints when we tear down the ompi_proc_t:
> 
> diff --git a/ompi/proc/proc.c b/ompi/proc/proc.c
> index f549335..9ea0311 100644
> --- a/ompi/proc/proc.c
> +++ b/ompi/proc/proc.c
> @@ -89,6 +89,13 @@ void ompi_proc_destruct(ompi_proc_t* proc)
> OPAL_THREAD_LOCK(&ompi_proc_lock);
> opal_list_remove_item(&ompi_proc_list, (opal_list_item_t*)proc);
> OPAL_THREAD_UNLOCK(&ompi_proc_lock);
> +
> +#if defined(OMPI_PROC_ENDPOINT_TAG_BML)
> +/* release the bml endpoint if it still exists */
> +if (proc->proc_endpoints[OMPI_PROC_ENDPOINT_TAG_BML]) {
> +OBJ_RELEASE(proc->proc_endpoints[OMPI_PROC_ENDPOINT_TAG_BML]);
> +}
> +#endif
> }
> 
> This fixes the leak and appears to have no negative side effects for
> r2.
> 
> Why: Trying to clean up the last remaining leaks in the Open MPI code
> base. This is one of the larger ones as it grows with comm world.
> 
> When: I want this to go into 1.8.2 if possible. Setting a short timeout
> of 1 week.
> 
> Keep in mind I do not know the full history of add_procs/del_procs so
> there may be a better way to fix this. This RFC is meant to open the
> discussion about how to address this leak. If the above fix looks ok I
> will commit it.
> 
> -Nathan
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/05/14808.php



Re: [OMPI devel] RFC: fix leak of bml endpoints

2014-05-15 Thread Nathan Hjelm
On Thu, May 15, 2014 at 01:33:31PM -0400, George Bosilca wrote:
> The solution you propose here is definitively not OK. It is 1) ugly and 2) 
> break the separation barrier that we hold dear.

Which is why I asked :)

> Regarding your other suggestion I don’t see any reasons not to call the 
> delete_proc on MPI_COMM_WORLD as the last action we do before tearing down 
> everything else.

I spoke too soon. It looks like we *are* calling del_procs but I am not
seeing the call reach the bml I will try and track this down.

-Nathan


pgpAqhVWMcCUn.pgp
Description: PGP signature


Re: [OMPI devel] RFC: fix leak of bml endpoints

2014-05-15 Thread Nathan Hjelm
On Thu, May 15, 2014 at 11:44:05AM -0600, Nathan Hjelm wrote:
> On Thu, May 15, 2014 at 01:33:31PM -0400, George Bosilca wrote:
> > The solution you propose here is definitively not OK. It is 1) ugly and 2) 
> > break the separation barrier that we hold dear.
> 
> Which is why I asked :)
> 
> > Regarding your other suggestion I don’t see any reasons not to call the 
> > delete_proc on MPI_COMM_WORLD as the last action we do before tearing down 
> > everything else.
> 
> I spoke too soon. It looks like we *are* calling del_procs but I am not
> seeing the call reach the bml I will try and track this down.

/bml/btl/ .. I see what is happening. The proc reference counts are all
larger than 1 when we call del_procs:


[1,2]:Deleting proc 0x7b83190 with reference count 5
[1,1]:Deleting proc 0x7b83180 with reference count 5
[1,2]:Deleting proc 0x7b832b0 with reference count 5
[1,1]:Deleting proc 0x7b832a0 with reference count 7
[1,2]:Deleting proc 0x7b83360 with reference count 7
[1,1]:Deleting proc 0x7b833a0 with reference count 5
[1,0]:Deleting proc 0x7b83190 with reference count 7
[1,0]:Deleting proc 0x7b83300 with reference count 5
[1,0]:Deleting proc 0x7b833b0 with reference count 5


I will track that down.

-Nathan


pgp8zMhY4lf_f.pgp
Description: PGP signature


Re: [OMPI devel] RFC: fix leak of bml endpoints

2014-05-15 Thread Nathan Hjelm
Ok figured it out. There were three problems with the del_procs code:

 1) ompi_mpi_finalize used ompi_proc_all to get the list of procs but
never released the reference to them (ompi_proc_all called
OBJ_RETAIN on all the procs returned). When calling del_procs at
finalize it should suffice to call ompi_proc_world which does not
increment the reference count.

 2) del_procs is called BEFORE ompi_comm_finalize. This leaves the
references to the procs from calling the pml_add_comm function. The
fix is to reorder the calls to do omp_comm_finalize, del_procs,
pml_finalize instead of del_procs, pml_finalize,
ompi_comm_finalize.

 3) The check in del_procs in r2 checked for a reference count of
1. This is incorrect. At this point there should be 2 references: 1
from ompi_proc, and another from the add_procs. The fix is to change
this check to look for 2. This check makes me extremely uncomforable
as nothing will call del_procs if the reference count of a procs is
not 2 when del_procs is called. Maybe there should be an assert
since this is a developer error IMHO.

Committing a patch to fix all three of these issues.

-Nathan

On Thu, May 15, 2014 at 11:52:27AM -0600, Nathan Hjelm wrote:
> On Thu, May 15, 2014 at 11:44:05AM -0600, Nathan Hjelm wrote:
> > On Thu, May 15, 2014 at 01:33:31PM -0400, George Bosilca wrote:
> > > The solution you propose here is definitively not OK. It is 1) ugly and 
> > > 2) break the separation barrier that we hold dear.
> > 
> > Which is why I asked :)
> > 
> > > Regarding your other suggestion I don’t see any reasons not to call the 
> > > delete_proc on MPI_COMM_WORLD as the last action we do before tearing 
> > > down everything else.
> > 
> > I spoke too soon. It looks like we *are* calling del_procs but I am not
> > seeing the call reach the bml I will try and track this down.
> 
> /bml/btl/ .. I see what is happening. The proc reference counts are all
> larger than 1 when we call del_procs:
> 
> 
> [1,2]:Deleting proc 0x7b83190 with reference count 5
> [1,1]:Deleting proc 0x7b83180 with reference count 5
> [1,2]:Deleting proc 0x7b832b0 with reference count 5
> [1,1]:Deleting proc 0x7b832a0 with reference count 7
> [1,2]:Deleting proc 0x7b83360 with reference count 7
> [1,1]:Deleting proc 0x7b833a0 with reference count 5
> [1,0]:Deleting proc 0x7b83190 with reference count 7
> [1,0]:Deleting proc 0x7b83300 with reference count 5
> [1,0]:Deleting proc 0x7b833b0 with reference count 5
> 
> 
> I will track that down.
> 
> -Nathan



> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/05/14812.php



pgp00dWQ5nXSm.pgp
Description: PGP signature