Re: [OMPI devel] race condition in finalize

2015-07-22 Thread Ralph Castain
Thanks Gilles! The “clean” event doesn’t have to go last - any messages that arrive after all recvs have been removed will simply be dropped upon termination. This commit only ensures that the list of posted recvs is cleanly destructed, which will prevent the segfault. > On Jul 22, 2015, at 1

Re: [OMPI devel] race condition in finalize

2015-07-22 Thread Gilles Gouaillardet
Thanks Ralph, i was unable to reproduce any crash with this fix :-) i checked the code that is invoked in the progress thread, and it might queue other events. bottom line, i am not 100% convinced the "clean" event will be executed at the very last. that being said, and once again, i was una

Re: [OMPI devel] race condition in finalize

2015-07-21 Thread Ralph Castain
I believe I have this fixed - please see if this solves the problem: https://github.com/open-mpi/ompi/pull/730 > On Jul 21, 2015, at 12:22 AM, Gilles Gouaillardet wrote: > > Ralph, > > here is some more detailed information. > > > from orte_ess_b

Re: [OMPI devel] race condition in finalize

2015-07-21 Thread Gilles Gouaillardet
Ralph, here is some more detailed information. from orte_ess_base_app_finalize() first orte_rml_base_close() is invoked(via mca_base_framework_close(&orte_rml_base_framework); and it does while (NULL != (item = opal_list_remove_first(&orte_rml_base.posted_recvs))) { OBJ_RELEASE(

Re: [OMPI devel] race condition in finalize

2015-07-17 Thread Ralph Castain
It’s probably a race condition caused by uniting the ORTE and OPAL async threads, though I can’t confirm that yet. > On Jul 17, 2015, at 3:11 AM, Gilles Gouaillardet > wrote: > > Folks, > > I noticed several errors such as > http://mtt.open-mpi.org/index.php?do_redir=2244 >

[OMPI devel] race condition in finalize

2015-07-17 Thread Gilles Gouaillardet
Folks, I noticed several errors such as http://mtt.open-mpi.org/index.php?do_redir=2244 that did not make any sense to me (at first glance) I was able to attach one process when the issue occurs. the sigsegv occurs in thread 2, while thread 1 is invoking ompi_rte_finalize. All I can think is a s