I believe I have this fixed - please see if this solves the problem: https://github.com/open-mpi/ompi/pull/730 <https://github.com/open-mpi/ompi/pull/730>
> On Jul 21, 2015, at 12:22 AM, Gilles Gouaillardet <gil...@rist.or.jp> wrote: > > Ralph, > > here is some more detailed information. > > > from orte_ess_base_app_finalize() > first orte_rml_base_close() is invoked(via > mca_base_framework_close(&orte_rml_base_framework); > and it does > while (NULL != (item = > opal_list_remove_first(&orte_rml_base.posted_recvs))) { > OBJ_RELEASE(item); > } > then, opal_stop_progress_thread() is invoked > > that means that when orte_rml_base_close is invoked, the progress thread is > up and running, > and can potentially invoke orte_rml_base_post_recv that does > > if (req->cancel) { > OPAL_LIST_FOREACH(recv, &orte_rml_base.posted_recvs, > orte_rml_posted_recv_t) { > if (OPAL_EQUAL == orte_util_compare_name_fields(mask, > &post->peer, &recv->peer) && > post->tag == recv->tag) { > opal_output_verbose(5, > orte_rml_base_framework.framework_output, > "%s canceling recv %d for peer %s", > ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), > post->tag, ORTE_NAME_PRINT(&recv->peer)); > /* got a match - remove it */ > opal_list_remove_item(&orte_rml_base.posted_recvs, > &recv->super); > OBJ_RELEASE(recv); > break; > } > } > OBJ_RELEASE(req); > return; > } > > /* this is where the assert fails */ > > since there is no lock, there is room for a race condition. > > > before all that happen, and in orte_ess_base_app_finalize(), > mca_base_framework_close(&orte_grpcomm_base_framework) invokes finalize from > grpcomm_rcd.c > that does > orte_rml.recv_cancel(ORTE_NAME_WILDCARD, ORTE_RML_TAG_ALLGATHER_RCD) that is > orte_rml_oob_recv_cancel that ends up doing opal_event_set(..., > orte_rml_base_post_recv) > > > my first and naive attempt was to stop the opal_async progress thread before > closing the rml_base framework: > diff --git a/orte/mca/ess/base/ess_base_std_app.c > b/orte/mca/ess/base/ess_base_std_app.c > index d3bc6e6..4e09b47 100644 > --- a/orte/mca/ess/base/ess_base_std_app.c > +++ b/orte/mca/ess/base/ess_base_std_app.c > @@ -353,18 +353,18 @@ int orte_ess_base_app_finalize(void) > (void) mca_base_framework_close(&orte_dfs_base_framework); > (void) mca_base_framework_close(&orte_routed_base_framework); > > - (void) mca_base_framework_close(&orte_rml_base_framework); > - (void) mca_base_framework_close(&orte_oob_base_framework); > - (void) mca_base_framework_close(&orte_state_base_framework); > - > - orte_session_dir_finalize(ORTE_PROC_MY_NAME); > - > /* release the event base */ > if (progress_thread_running) { > opal_stop_progress_thread("opal_async", true); > progress_thread_running = false; > } > > + (void) mca_base_framework_close(&orte_rml_base_framework); > + (void) mca_base_framework_close(&orte_oob_base_framework); > + (void) mca_base_framework_close(&orte_state_base_framework); > + > + orte_session_dir_finalize(ORTE_PROC_MY_NAME); > + > return ORTE_SUCCESS; > } > > that did not work : the opal_async progress thread is also used by pmix, so > at this stage, > invoking opal_stop_progress_thread only decrements the refcount (e.g. no > pthread_join() ) > > my second and dumb attempt was to finalize pmix before ess_base_app, and that > did not work > (crash) > > i ran out of idea on how to fix this issue, but i found a simple workaround : > adding a short sleep (10 ms) in orte_rml_base_close() seems enough to avoid > the race condition. > > diff --git a/orte/mca/rml/base/rml_base_frame.c > b/orte/mca/rml/base/rml_base_frame.c > index 54dc454..050154c 100644 > --- a/orte/mca/rml/base/rml_base_frame.c > +++ b/orte/mca/rml/base/rml_base_frame.c > @@ -17,6 +17,7 @@ > > #include "orte_config.h" > > +#include <sys/poll.h> > #include <string.h> > > #include "opal/dss/dss.h" > @@ -78,6 +79,7 @@ static int orte_rml_base_close(void) > { > opal_list_item_t *item; > > + poll(NULL,0,10); > while (NULL != (item = > opal_list_remove_first(&orte_rml_base.posted_recvs))) { > OBJ_RELEASE(item); > } > > incidentally, i found two OPAL_LIST_FOREACH "loops" in which > opal_list_remove_item is invoked. > per a comment in opal_list.h, this is unsafe and OPAL_LIST_FOREACH_SAFE > should be used : > > diff --git a/orte/mca/rml/base/rml_base_msg_handlers.c > b/orte/mca/rml/base/rml_base_msg_handlers.c > index 758bf91..22c7601 100644 > --- a/orte/mca/rml/base/rml_base_msg_handlers.c > +++ b/orte/mca/rml/base/rml_base_msg_handlers.c > @@ -12,7 +12,9 @@ > * All rights reserved. > * Copyright (c) 2007-2013 Los Alamos National Security, LLC. All rights > * reserved. > - * Copyright (c) 2015 Intel, Inc. All rights reserved. > + * Copyright (c) 2015 Intel, Inc. All rights reserved. > + * Copyright (c) 2015 Research Organization for Information Science > + * and Technology (RIST). All rights reserved. > * $COPYRIGHT$ > * > * Additional copyrights may follow > @@ -55,7 +57,7 @@ static void msg_match_recv(orte_rml_posted_recv_t *rcv, > bool get_all); > void orte_rml_base_post_recv(int sd, short args, void *cbdata) > { > orte_rml_recv_request_t *req = (orte_rml_recv_request_t*)cbdata; > - orte_rml_posted_recv_t *post, *recv; > + orte_rml_posted_recv_t *post, *recv, *next; > orte_ns_cmp_bitmask_t mask = ORTE_NS_CMP_ALL | ORTE_NS_CMP_WILD; > > opal_output_verbose(5, orte_rml_base_framework.framework_output, > @@ -75,7 +77,7 @@ void orte_rml_base_post_recv(int sd, short args, void > *cbdata) > * and remove it from our list > */ > if (req->cancel) { > - OPAL_LIST_FOREACH(recv, &orte_rml_base.posted_recvs, > orte_rml_posted_recv_t) { > + OPAL_LIST_FOREACH_SAFE(recv, next, &orte_rml_base.posted_recvs, > orte_rml_posted_recv_t) { > if (OPAL_EQUAL == orte_util_compare_name_fields(mask, > &post->peer, &recv->peer) && > post->tag == recv->tag) { > opal_output_verbose(5, > orte_rml_base_framework.framework_output, > @@ -120,12 +122,12 @@ void orte_rml_base_post_recv(int sd, short args, void > *cbdata) > > void orte_rml_base_complete_recv_msg (orte_rml_recv_t **recv_msg) > { > - orte_rml_posted_recv_t *post; > + orte_rml_posted_recv_t *post, *next; > orte_ns_cmp_bitmask_t mask = ORTE_NS_CMP_ALL | ORTE_NS_CMP_WILD; > opal_buffer_t buf; > orte_rml_recv_t *msg = *recv_msg; > /* see if we have a waiting recv for this message */ > - OPAL_LIST_FOREACH(post, &orte_rml_base.posted_recvs, > orte_rml_posted_recv_t) { > + OPAL_LIST_FOREACH_SAFE(post, next, &orte_rml_base.posted_recvs, > orte_rml_posted_recv_t) { > /* since names could include wildcards, must use > * the more generalized comparison function > */ > > i hope this helps, > > Gilles > > On 7/17/2015 11:04 PM, Ralph Castain wrote: >> It’s probably a race condition caused by uniting the ORTE and OPAL async >> threads, though I can’t confirm that yet. >> >>> On Jul 17, 2015, at 3:11 AM, Gilles Gouaillardet >>> <gilles.gouaillar...@gmail.com <mailto:gilles.gouaillar...@gmail.com>> >>> wrote: >>> >>> Folks, >>> >>> I noticed several errors such as >>> http://mtt.open-mpi.org/index.php?do_redir=2244 >>> <http://mtt.open-mpi.org/index.php?do_redir=2244> >>> that did not make any sense to me (at first glance) >>> >>> I was able to attach one process when the issue occurs. >>> the sigsegv occurs in thread 2, while thread 1 is invoking >>> ompi_rte_finalize. >>> >>> All I can think is a scenario in which the progress thread (aka thread 2) >>> is still dealing with some memory that was just freed/unmapped/corrupted by >>> the main thread. >>> >>> I empirically noticed the error is more likely to occur when there are many >>> tasks on one node >>> e.g. mpirun --oversubscribe -np 32 a.out >>> >>> Cheers, >>> >>> Gilles >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2015/07/17652.php >>> <http://www.open-mpi.org/community/lists/devel/2015/07/17652.php> >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org <mailto:de...@open-mpi.org> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2015/07/17655.php >> <http://www.open-mpi.org/community/lists/devel/2015/07/17655.php> > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/07/17668.php