Re: [OMPI devel] race condition in finalize

Gilles Gouaillardet Wed, 22 Jul 2015 04:03:22 -0400 (EDT)

Thanks Ralph,

i was unable to reproduce any crash with this fix :-)

i checked the code that is invoked in the progress thread, and it mightqueue other events.bottom line, i am not 100% convinced the "clean" event will be executedat the very last.


that being said, and once again, i was unable to reproduce any crash.

Cheers,

Gilles

On 7/22/2015 12:48 AM, Ralph Castain wrote:

I believe I have this fixed - please see if this solves the problem:

https://github.com/open-mpi/ompi/pull/730
On Jul 21, 2015, at 12:22 AM, Gilles Gouaillardet <[email protected]<mailto:[email protected]>> wrote:
Ralph,

here is some more detailed information.


from orte_ess_base_app_finalize()
first orte_rml_base_close() is invoked(viamca_base_framework_close(&orte_rml_base_framework);
and it does
while (NULL != (item =opal_list_remove_first(&orte_rml_base.posted_recvs))) {
        OBJ_RELEASE(item);
    }
then, opal_stop_progress_thread() is invoked
that means that when orte_rml_base_close is invoked, the progressthread is up and running,
and can potentially invoke orte_rml_base_post_recv that does

    if (req->cancel) {
OPAL_LIST_FOREACH(recv, &orte_rml_base.posted_recvs,orte_rml_posted_recv_t) {if (OPAL_EQUAL == orte_util_compare_name_fields(mask,&post->peer, &recv->peer) &&
                post->tag == recv->tag) {
opal_output_verbose(5,orte_rml_base_framework.framework_output,
                                    "%s canceling recv %d for peer %s",
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
post->tag,ORTE_NAME_PRINT(&recv->peer));
                /* got a match - remove it */
opal_list_remove_item(&orte_rml_base.posted_recvs, &recv->super);
                OBJ_RELEASE(recv);
                break;
            }
        }
        OBJ_RELEASE(req);
        return;
    }

/* this is where the assert fails */

since there is no lock, there is room for a race condition.


before all that happen, and in orte_ess_base_app_finalize(),
mca_base_framework_close(&orte_grpcomm_base_framework) invokesfinalize from grpcomm_rcd.c
that does
orte_rml.recv_cancel(ORTE_NAME_WILDCARD, ORTE_RML_TAG_ALLGATHER_RCD)that isorte_rml_oob_recv_cancel that ends up doing opal_event_set(...,orte_rml_base_post_recv)
my first and naive attempt was to stop the opal_async progress threadbefore closing the rml_base framework:diff --git a/orte/mca/ess/base/ess_base_std_app.cb/orte/mca/ess/base/ess_base_std_app.c
index d3bc6e6..4e09b47 100644
--- a/orte/mca/ess/base/ess_base_std_app.c
+++ b/orte/mca/ess/base/ess_base_std_app.c
@@ -353,18 +353,18 @@ int orte_ess_base_app_finalize(void)
     (void) mca_base_framework_close(&orte_dfs_base_framework);
     (void) mca_base_framework_close(&orte_routed_base_framework);

-    (void) mca_base_framework_close(&orte_rml_base_framework);
-    (void) mca_base_framework_close(&orte_oob_base_framework);
-    (void) mca_base_framework_close(&orte_state_base_framework);
-
-    orte_session_dir_finalize(ORTE_PROC_MY_NAME);
-
     /* release the event base */
     if (progress_thread_running) {
         opal_stop_progress_thread("opal_async", true);
         progress_thread_running = false;
     }

+    (void) mca_base_framework_close(&orte_rml_base_framework);
+    (void) mca_base_framework_close(&orte_oob_base_framework);
+    (void) mca_base_framework_close(&orte_state_base_framework);
+
+    orte_session_dir_finalize(ORTE_PROC_MY_NAME);
+
     return ORTE_SUCCESS;
 }
that did not work : the opal_async progress thread is also used bypmix, so at this stage,invoking opal_stop_progress_thread only decrements the refcount (e.g.no pthread_join() )
my second and dumb attempt was to finalize pmix before ess_base_app,and that did not work
(crash)
i ran out of idea on how to fix this issue, but i found a simpleworkaround :adding a short sleep (10 ms) in orte_rml_base_close() seems enough toavoid the race condition.
diff --git a/orte/mca/rml/base/rml_base_frame.cb/orte/mca/rml/base/rml_base_frame.c
index 54dc454..050154c 100644
--- a/orte/mca/rml/base/rml_base_frame.c
+++ b/orte/mca/rml/base/rml_base_frame.c
@@ -17,6 +17,7 @@

 #include "orte_config.h"

+#include <sys/poll.h>
 #include <string.h>

 #include "opal/dss/dss.h"
@@ -78,6 +79,7 @@ static int orte_rml_base_close(void)
 {
     opal_list_item_t *item;

+    poll(NULL,0,10);
while (NULL != (item =opal_list_remove_first(&orte_rml_base.posted_recvs))) {
         OBJ_RELEASE(item);
     }
incidentally, i found two OPAL_LIST_FOREACH "loops" in whichopal_list_remove_item is invoked.per a comment in opal_list.h, this is unsafe andOPAL_LIST_FOREACH_SAFE should be used :
diff --git a/orte/mca/rml/base/rml_base_msg_handlers.cb/orte/mca/rml/base/rml_base_msg_handlers.c
index 758bf91..22c7601 100644
--- a/orte/mca/rml/base/rml_base_msg_handlers.c
+++ b/orte/mca/rml/base/rml_base_msg_handlers.c
@@ -12,7 +12,9 @@
  *                         All rights reserved.
* Copyright (c) 2007-2013 Los Alamos National Security, LLC. Allrights
  *                         reserved.
- * Copyright (c) 2015 Intel, Inc. All rights reserved.
+ * Copyright (c) 2015      Intel, Inc. All rights reserved.
+ * Copyright (c) 2015      Research Organization for Information Science
+ *                         and Technology (RIST). All rights reserved.
  * $COPYRIGHT$
  *
  * Additional copyrights may follow
@@ -55,7 +57,7 @@ static void msg_match_recv(orte_rml_posted_recv_t*rcv, bool get_all);
 void orte_rml_base_post_recv(int sd, short args, void *cbdata)
 {
     orte_rml_recv_request_t *req = (orte_rml_recv_request_t*)cbdata;
-    orte_rml_posted_recv_t *post, *recv;
+    orte_rml_posted_recv_t *post, *recv, *next;
     orte_ns_cmp_bitmask_t mask = ORTE_NS_CMP_ALL | ORTE_NS_CMP_WILD;

     opal_output_verbose(5, orte_rml_base_framework.framework_output,
@@ -75,7 +77,7 @@ void orte_rml_base_post_recv(int sd, short args,void *cbdata)
      * and remove it from our list
      */
     if (req->cancel) {
- OPAL_LIST_FOREACH(recv, &orte_rml_base.posted_recvs,orte_rml_posted_recv_t) {+ OPAL_LIST_FOREACH_SAFE(recv, next,&orte_rml_base.posted_recvs, orte_rml_posted_recv_t) {if (OPAL_EQUAL == orte_util_compare_name_fields(mask,&post->peer, &recv->peer) &&
                 post->tag == recv->tag) {
opal_output_verbose(5,orte_rml_base_framework.framework_output,@@ -120,12 +122,12 @@ void orte_rml_base_post_recv(int sd, shortargs, void *cbdata)
 void orte_rml_base_complete_recv_msg (orte_rml_recv_t **recv_msg)
 {
-    orte_rml_posted_recv_t *post;
+    orte_rml_posted_recv_t *post, *next;
     orte_ns_cmp_bitmask_t mask = ORTE_NS_CMP_ALL | ORTE_NS_CMP_WILD;
     opal_buffer_t buf;
     orte_rml_recv_t *msg = *recv_msg;
     /* see if we have a waiting recv for this message */
- OPAL_LIST_FOREACH(post, &orte_rml_base.posted_recvs,orte_rml_posted_recv_t) {+ OPAL_LIST_FOREACH_SAFE(post, next, &orte_rml_base.posted_recvs,orte_rml_posted_recv_t) {
         /* since names could include wildcards, must use
          * the more generalized comparison function
          */

i hope this helps,

Gilles

On 7/17/2015 11:04 PM, Ralph Castain wrote:
It’s probably a race condition caused by uniting the ORTE and OPALasync threads, though I can’t confirm that yet.
On Jul 17, 2015, at 3:11 AM, Gilles Gouaillardet<[email protected]<mailto:[email protected]>> wrote:
Folks,
I noticed several errors such ashttp://mtt.open-mpi.org/index.php?do_redir=2244
that did not make any sense to me (at first glance)

I was able to attach one process when the issue occurs.
the sigsegv occurs in thread 2, while thread 1 is invokingompi_rte_finalize.
All I can think is a scenario in which the progress thread (akathread 2) is still dealing with some memory that was justfreed/unmapped/corrupted by the main thread.
I empirically noticed the error is more likely to occur when thereare many tasks on one node
e.g. mpirun --oversubscribe -np 32 a.out

Cheers,

Gilles

_______________________________________________
devel mailing list
[email protected] <mailto:[email protected]>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post:http://www.open-mpi.org/community/lists/devel/2015/07/17652.php
_______________________________________________
devel mailing list
[email protected]
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this 
post:http://www.open-mpi.org/community/lists/devel/2015/07/17655.php
_______________________________________________
devel mailing list
[email protected] <mailto:[email protected]>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post:http://www.open-mpi.org/community/lists/devel/2015/07/17668.php
_______________________________________________
devel mailing list
[email protected]
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2015/07/17669.php

Re: [OMPI devel] race condition in finalize

Reply via email to