[OMPI devel] r31765 causes crash in mpirun
Folks, since r31765 (opal/event: release the opal event context when closing the event base) mpirun crashes at the end of the job. for example : $ mpirun --mca btl tcp,self -n 4 `pwd`/src/MPI_Allreduce_user_c MPITEST info (0): Starting MPI_Allreduce_user() test MPITEST_results: MPI_Allreduce_user() all tests PASSED (7076) [soleil:10959] *** Process received signal *** [soleil:10959] Signal: Segmentation fault (11) [soleil:10959] Signal code: Address not mapped (1) [soleil:10959] Failing at address: 0x7fd969e75a98 [soleil:10959] [ 0] /lib64/libpthread.so.0[0x3c9da0f500] [soleil:10959] [ 1] /csc/home1/gouaillardet/local/ompi-trunk/lib/libopen-pal.so.0(+0x7bae5)[0x7fd96a55dae5] [soleil:10959] [ 2] /csc/home1/gouaillardet/local/ompi-trunk/lib/libopen-pal.so.0(+0x7ac97)[0x7fd96a55cc97] [soleil:10959] [ 3] /csc/home1/gouaillardet/local/ompi-trunk/lib/libopen-pal.so.0(opal_libevent2021_event_del+0x88)[0x7fd96a55ca15] [soleil:10959] [ 4] /csc/home1/gouaillardet/local/ompi-trunk/lib/libopen-pal.so.0(opal_libevent2021_event_base_free+0x132)[0x7fd96a558831] [soleil:10959] [ 5] /csc/home1/gouaillardet/local/ompi-trunk/lib/libopen-pal.so.0(+0x74126)[0x7fd96a556126] [soleil:10959] [ 6] /csc/home1/gouaillardet/local/ompi-trunk/lib/libopen-pal.so.0(mca_base_framework_close+0xdd)[0x7fd96a54026f] [soleil:10959] [ 7] /csc/home1/gouaillardet/local/ompi-trunk/lib/libopen-pal.so.0(opal_finalize+0x7e)[0x7fd96a50d36e] [soleil:10959] [ 8] /csc/home1/gouaillardet/local/ompi-trunk/lib/libopen-rte.so.0(orte_finalize+0xd3)[0x7fd96a7ead2f] [soleil:10959] [ 9] mpirun(orterun+0x1298)[0x404f0e] [soleil:10959] [10] mpirun(main+0x20)[0x4038a4] [soleil:10959] [11] /lib64/libc.so.6(__libc_start_main+0xfd)[0x3c9d21ecdd] [soleil:10959] [12] mpirun[0x4037c9] [soleil:10959] *** End of error message *** Segmentation fault (core dumped) Gilles
Re: [OMPI devel] about btl/scif thread cancellation (#4616 / r31738)
Nathan, this had no effect on my environment :-( i am not sure you can reuse mca_btl_scif_module.scif_fd with connect() i had to use a new scif fd for that. then i ran into an other glitch : if the listen thread does not scif_accept() the connection, the scif_connect() will take 30 seconds (default timeout value i guess) i fixed this in r31772 Gilles On 2014/05/15 1:16, Nathan Hjelm wrote: > That is exactly how I decided to fix it. It looks like it is > working. Please try r31755 when you get a chance. >
[OMPI devel] RFC: fix leak of bml endpoints
What: We never call del_procs in the procs in comm world. This leads us to leak the bml endpoints created by r2. The proposed solution is not idea but it avoids adding a call to del procs for comm world. Something I know would require more discussion since there is likely a reason for that. I propose we delete any remaining bml endpoints when we tear down the ompi_proc_t: diff --git a/ompi/proc/proc.c b/ompi/proc/proc.c index f549335..9ea0311 100644 --- a/ompi/proc/proc.c +++ b/ompi/proc/proc.c @@ -89,6 +89,13 @@ void ompi_proc_destruct(ompi_proc_t* proc) OPAL_THREAD_LOCK(&ompi_proc_lock); opal_list_remove_item(&ompi_proc_list, (opal_list_item_t*)proc); OPAL_THREAD_UNLOCK(&ompi_proc_lock); + +#if defined(OMPI_PROC_ENDPOINT_TAG_BML) +/* release the bml endpoint if it still exists */ +if (proc->proc_endpoints[OMPI_PROC_ENDPOINT_TAG_BML]) { +OBJ_RELEASE(proc->proc_endpoints[OMPI_PROC_ENDPOINT_TAG_BML]); +} +#endif } This fixes the leak and appears to have no negative side effects for r2. Why: Trying to clean up the last remaining leaks in the Open MPI code base. This is one of the larger ones as it grows with comm world. When: I want this to go into 1.8.2 if possible. Setting a short timeout of 1 week. Keep in mind I do not know the full history of add_procs/del_procs so there may be a better way to fix this. This RFC is meant to open the discussion about how to address this leak. If the above fix looks ok I will commit it. -Nathan pgpxSOjRQoGpG.pgp Description: PGP signature
Re: [OMPI devel] r31765 causes crash in mpirun
I fixed this by reverting r31765 in r31775. Annotated ticket with explanation. On May 15, 2014, at 1:20 AM, Gilles Gouaillardet wrote: > Folks, > > since r31765 (opal/event: release the opal event context when closing > the event base) > mpirun crashes at the end of the job. > > for example : > > $ mpirun --mca btl tcp,self -n 4 `pwd`/src/MPI_Allreduce_user_c > MPITEST info (0): Starting MPI_Allreduce_user() test > MPITEST_results: MPI_Allreduce_user() all tests PASSED (7076) > [soleil:10959] *** Process received signal *** > [soleil:10959] Signal: Segmentation fault (11) > [soleil:10959] Signal code: Address not mapped (1) > [soleil:10959] Failing at address: 0x7fd969e75a98 > [soleil:10959] [ 0] /lib64/libpthread.so.0[0x3c9da0f500] > [soleil:10959] [ 1] > /csc/home1/gouaillardet/local/ompi-trunk/lib/libopen-pal.so.0(+0x7bae5)[0x7fd96a55dae5] > [soleil:10959] [ 2] > /csc/home1/gouaillardet/local/ompi-trunk/lib/libopen-pal.so.0(+0x7ac97)[0x7fd96a55cc97] > [soleil:10959] [ 3] > /csc/home1/gouaillardet/local/ompi-trunk/lib/libopen-pal.so.0(opal_libevent2021_event_del+0x88)[0x7fd96a55ca15] > [soleil:10959] [ 4] > /csc/home1/gouaillardet/local/ompi-trunk/lib/libopen-pal.so.0(opal_libevent2021_event_base_free+0x132)[0x7fd96a558831] > [soleil:10959] [ 5] > /csc/home1/gouaillardet/local/ompi-trunk/lib/libopen-pal.so.0(+0x74126)[0x7fd96a556126] > [soleil:10959] [ 6] > /csc/home1/gouaillardet/local/ompi-trunk/lib/libopen-pal.so.0(mca_base_framework_close+0xdd)[0x7fd96a54026f] > [soleil:10959] [ 7] > /csc/home1/gouaillardet/local/ompi-trunk/lib/libopen-pal.so.0(opal_finalize+0x7e)[0x7fd96a50d36e] > [soleil:10959] [ 8] > /csc/home1/gouaillardet/local/ompi-trunk/lib/libopen-rte.so.0(orte_finalize+0xd3)[0x7fd96a7ead2f] > [soleil:10959] [ 9] mpirun(orterun+0x1298)[0x404f0e] > [soleil:10959] [10] mpirun(main+0x20)[0x4038a4] > [soleil:10959] [11] /lib64/libc.so.6(__libc_start_main+0xfd)[0x3c9d21ecdd] > [soleil:10959] [12] mpirun[0x4037c9] > [soleil:10959] *** End of error message *** > Segmentation fault (core dumped) > > Gilles > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/05/14806.php
Re: [OMPI devel] RFC: fix leak of bml endpoints
The solution you propose here is definitively not OK. It is 1) ugly and 2) break the separation barrier that we hold dear. Regarding your other suggestion I don’t see any reasons not to call the delete_proc on MPI_COMM_WORLD as the last action we do before tearing down everything else. George. On May 15, 2014, at 11:22 , Nathan Hjelm wrote: > > What: We never call del_procs in the procs in comm world. This leads us > to leak the bml endpoints created by r2. > > The proposed solution is not idea but it avoids adding a call to del > procs for comm world. Something I know would require more discussion > since there is likely a reason for that. I propose we delete any > remaining bml endpoints when we tear down the ompi_proc_t: > > diff --git a/ompi/proc/proc.c b/ompi/proc/proc.c > index f549335..9ea0311 100644 > --- a/ompi/proc/proc.c > +++ b/ompi/proc/proc.c > @@ -89,6 +89,13 @@ void ompi_proc_destruct(ompi_proc_t* proc) > OPAL_THREAD_LOCK(&ompi_proc_lock); > opal_list_remove_item(&ompi_proc_list, (opal_list_item_t*)proc); > OPAL_THREAD_UNLOCK(&ompi_proc_lock); > + > +#if defined(OMPI_PROC_ENDPOINT_TAG_BML) > +/* release the bml endpoint if it still exists */ > +if (proc->proc_endpoints[OMPI_PROC_ENDPOINT_TAG_BML]) { > +OBJ_RELEASE(proc->proc_endpoints[OMPI_PROC_ENDPOINT_TAG_BML]); > +} > +#endif > } > > This fixes the leak and appears to have no negative side effects for > r2. > > Why: Trying to clean up the last remaining leaks in the Open MPI code > base. This is one of the larger ones as it grows with comm world. > > When: I want this to go into 1.8.2 if possible. Setting a short timeout > of 1 week. > > Keep in mind I do not know the full history of add_procs/del_procs so > there may be a better way to fix this. This RFC is meant to open the > discussion about how to address this leak. If the above fix looks ok I > will commit it. > > -Nathan > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/05/14808.php
Re: [OMPI devel] RFC: fix leak of bml endpoints
On Thu, May 15, 2014 at 01:33:31PM -0400, George Bosilca wrote: > The solution you propose here is definitively not OK. It is 1) ugly and 2) > break the separation barrier that we hold dear. Which is why I asked :) > Regarding your other suggestion I don’t see any reasons not to call the > delete_proc on MPI_COMM_WORLD as the last action we do before tearing down > everything else. I spoke too soon. It looks like we *are* calling del_procs but I am not seeing the call reach the bml I will try and track this down. -Nathan pgpAqhVWMcCUn.pgp Description: PGP signature
Re: [OMPI devel] RFC: fix leak of bml endpoints
On Thu, May 15, 2014 at 11:44:05AM -0600, Nathan Hjelm wrote: > On Thu, May 15, 2014 at 01:33:31PM -0400, George Bosilca wrote: > > The solution you propose here is definitively not OK. It is 1) ugly and 2) > > break the separation barrier that we hold dear. > > Which is why I asked :) > > > Regarding your other suggestion I don’t see any reasons not to call the > > delete_proc on MPI_COMM_WORLD as the last action we do before tearing down > > everything else. > > I spoke too soon. It looks like we *are* calling del_procs but I am not > seeing the call reach the bml I will try and track this down. /bml/btl/ .. I see what is happening. The proc reference counts are all larger than 1 when we call del_procs: [1,2]:Deleting proc 0x7b83190 with reference count 5 [1,1]:Deleting proc 0x7b83180 with reference count 5 [1,2]:Deleting proc 0x7b832b0 with reference count 5 [1,1]:Deleting proc 0x7b832a0 with reference count 7 [1,2]:Deleting proc 0x7b83360 with reference count 7 [1,1]:Deleting proc 0x7b833a0 with reference count 5 [1,0]:Deleting proc 0x7b83190 with reference count 7 [1,0]:Deleting proc 0x7b83300 with reference count 5 [1,0]:Deleting proc 0x7b833b0 with reference count 5 I will track that down. -Nathan pgp8zMhY4lf_f.pgp Description: PGP signature
Re: [OMPI devel] RFC: fix leak of bml endpoints
Ok figured it out. There were three problems with the del_procs code: 1) ompi_mpi_finalize used ompi_proc_all to get the list of procs but never released the reference to them (ompi_proc_all called OBJ_RETAIN on all the procs returned). When calling del_procs at finalize it should suffice to call ompi_proc_world which does not increment the reference count. 2) del_procs is called BEFORE ompi_comm_finalize. This leaves the references to the procs from calling the pml_add_comm function. The fix is to reorder the calls to do omp_comm_finalize, del_procs, pml_finalize instead of del_procs, pml_finalize, ompi_comm_finalize. 3) The check in del_procs in r2 checked for a reference count of 1. This is incorrect. At this point there should be 2 references: 1 from ompi_proc, and another from the add_procs. The fix is to change this check to look for 2. This check makes me extremely uncomforable as nothing will call del_procs if the reference count of a procs is not 2 when del_procs is called. Maybe there should be an assert since this is a developer error IMHO. Committing a patch to fix all three of these issues. -Nathan On Thu, May 15, 2014 at 11:52:27AM -0600, Nathan Hjelm wrote: > On Thu, May 15, 2014 at 11:44:05AM -0600, Nathan Hjelm wrote: > > On Thu, May 15, 2014 at 01:33:31PM -0400, George Bosilca wrote: > > > The solution you propose here is definitively not OK. It is 1) ugly and > > > 2) break the separation barrier that we hold dear. > > > > Which is why I asked :) > > > > > Regarding your other suggestion I don’t see any reasons not to call the > > > delete_proc on MPI_COMM_WORLD as the last action we do before tearing > > > down everything else. > > > > I spoke too soon. It looks like we *are* calling del_procs but I am not > > seeing the call reach the bml I will try and track this down. > > /bml/btl/ .. I see what is happening. The proc reference counts are all > larger than 1 when we call del_procs: > > > [1,2]:Deleting proc 0x7b83190 with reference count 5 > [1,1]:Deleting proc 0x7b83180 with reference count 5 > [1,2]:Deleting proc 0x7b832b0 with reference count 5 > [1,1]:Deleting proc 0x7b832a0 with reference count 7 > [1,2]:Deleting proc 0x7b83360 with reference count 7 > [1,1]:Deleting proc 0x7b833a0 with reference count 5 > [1,0]:Deleting proc 0x7b83190 with reference count 7 > [1,0]:Deleting proc 0x7b83300 with reference count 5 > [1,0]:Deleting proc 0x7b833b0 with reference count 5 > > > I will track that down. > > -Nathan > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/05/14812.php pgp00dWQ5nXSm.pgp Description: PGP signature