Ralph,

the root cause is when the second orted/mpirun runs rcd_finalize_coll,
it does not invoke pmix_server_release
because allgather_stub was not previously invoked since the the fence
was not yet entered.
/* in rcd_finalize_coll, coll->cbfunc is NULL */

the attached patch is likely not the right fix, it was very lightly
tested, but so far, it works for me ...

Cheers,

Gilles

On 2014/09/11 16:11, Gilles Gouaillardet wrote:
> Ralph,
>
> things got worst indeed :-(
>
> now a simple hello world involving two hosts hang in mpi_init.
> there is still a race condition : if a tasks a call fence long after task b,
> then task b will never leave the fence
>
> i ll try to debug this ...
>
> Cheers,
>
> Gilles
>
> On 2014/09/11 2:36, Ralph Castain wrote:
>> I think I now have this fixed - let me know what you see.
>>
>>
>> On Sep 9, 2014, at 6:15 AM, Ralph Castain <r...@open-mpi.org> wrote:
>>
>>> Yeah, that's not the correct fix. The right way to fix it is for all three 
>>> components to have their own RML tag, and for each of them to establish a 
>>> persistent receive. They then can use the signature to tell which 
>>> collective the incoming message belongs to.
>>>
>>> I'll fix it, but it won't be until tomorrow I'm afraid as today is shot.
>>>
>>>
>>> On Sep 9, 2014, at 3:10 AM, Gilles Gouaillardet 
>>> <gilles.gouaillar...@iferc.org> wrote:
>>>
>>>> Folks,
>>>>
>>>> Since r32672 (trunk), grpcomm/rcd is the default module.
>>>> the attached spawn.c test program is a trimmed version of the
>>>> spawn_with_env_vars.c test case
>>>> from the ibm test suite.
>>>>
>>>> when invoked on two nodes :
>>>> - the program hangs with -np 2
>>>> - the program can crash with np > 2
>>>> error message is
>>>> [node0:30701] [[42913,0],0] TWO RECEIVES WITH SAME PEER [[42913,0],1]
>>>> AND TAG -33 - ABORTING
>>>>
>>>> here is my full command line (from node0) :
>>>>
>>>> mpirun -host node0,node1 -np 2 --oversubscribe --mca btl tcp,self --mca
>>>> coll ^ml ./spawn
>>>>
>>>> a simple workaround is to add the following extra parameter to the
>>>> mpirun command line :
>>>> --mca grpcomm_rcd_priority 0
>>>>
>>>> my understanding it that the race condition occurs when all the
>>>> processes call MPI_Finalize()
>>>> internally, the pmix module will have mpirun/orted issue two ALLGATHER
>>>> involving mpirun and orted
>>>> (one job 1 aka the parent, and one for job 2 aka the spawned tasks)
>>>> the error message is very explicit : this is not (currently) supported
>>>>
>>>> i wrote the attached rml.patch which is really a workaround and not a fix :
>>>> in this case, each job will invoke an ALLGATHER but with a different tag
>>>> /* that works for a limited number of jobs only */
>>>>
>>>> i did not commit this patch since this is not a fix, could someone
>>>> (Ralph ?) please review the issue and comment ?
>>>>
>>>>
>>>> Cheers,
>>>>
>>>> Gilles
>>>>
>>>> <spawn.c><rml.patch>_______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> Link to this post: 
>>>> http://www.open-mpi.org/community/lists/devel/2014/09/15780.php
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/09/15794.php
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/09/15804.php

Index: orte/mca/grpcomm/rcd/grpcomm_rcd.c
===================================================================
--- orte/mca/grpcomm/rcd/grpcomm_rcd.c  (revision 32706)
+++ orte/mca/grpcomm/rcd/grpcomm_rcd.c  (working copy)
@@ -6,6 +6,8 @@
  * Copyright (c) 2011-2013 Los Alamos National Security, LLC. All
  *                         rights reserved.
  * Copyright (c) 2014      Intel, Inc.  All rights reserved.
+ * Copyright (c) 2014      Research Organization for Information Science
+ *                         and Technology (RIST). All rights reserved.
  * $COPYRIGHT$
  *
  * Additional copyrights may follow
@@ -85,6 +87,9 @@
 static int allgather(orte_grpcomm_coll_t *coll,
                      opal_buffer_t *sendbuf)
 {
+    orte_grpcomm_base_pending_coll_t *pc;
+    bool pending = false;
+
     OPAL_OUTPUT_VERBOSE((5, orte_grpcomm_base_framework.framework_output,
                          "%s grpcomm:coll:recdub algo employed for %d 
processes",
                          ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), 
(int)coll->ndmns));
@@ -106,6 +111,33 @@
      */
     rcd_allgather_send_dist(coll, 1);

+    OPAL_LIST_FOREACH(pc, &orte_grpcomm_base.pending, 
orte_grpcomm_base_pending_coll_t) {
+        if (NULL == coll->sig->signature) {
+            if (NULL == pc->coll->sig->signature) {
+                /* only one collective can operate at a time
+                 * across every process in the system */
+                 pending = true;
+                 break;
+            }
+            /* if only one is NULL, then we can't possibly match */
+            break;
+        }
+        if (OPAL_EQUAL == opal_dss.compare(coll->sig, pc->coll->sig, 
ORTE_SIGNATURE)) {
+            OPAL_OUTPUT_VERBOSE((1, 
orte_grpcomm_base_framework.framework_output,
+                                 "%s grpcomm:rcd:found existing pending 
collective",
+                                 ORTE_NAME_PRINT(ORTE_PROC_MY_NAME)));
+            pending = true;
+            break;
+        }
+    }
+    if (pending) {
+        assert (coll == pc->coll);
+        pc->coll->cbfunc(pc->ret, pc->reply, coll->cbdata);
+        opal_list_remove_item(&orte_grpcomm_base.pending, &pc->super);
+        OBJ_RELEASE(pc->reply);
+        OBJ_RELEASE(pc);
+    }
+    
     return ORTE_SUCCESS;
 }

@@ -271,11 +303,16 @@
     /* execute the callback */
     if (NULL != coll->cbfunc) {
         coll->cbfunc(ret, reply, coll->cbdata);
+        opal_list_remove_item(&orte_grpcomm_base.ongoing, &coll->super);
+        OBJ_RELEASE(reply);
+    } else {
+        orte_grpcomm_base_pending_coll_t *pcoll = 
OBJ_NEW(orte_grpcomm_base_pending_coll_t);
+        opal_list_remove_item(&orte_grpcomm_base.ongoing, &coll->super);
+        pcoll->ret = ret;
+        pcoll->reply = reply;
+        pcoll->coll = coll;
+        opal_list_append(&orte_grpcomm_base.pending, &pcoll->super);
     }

-    opal_list_remove_item(&orte_grpcomm_base.ongoing, &coll->super);
-
-    OBJ_RELEASE(reply);
-
     return ORTE_SUCCESS;
 }
Index: orte/mca/grpcomm/base/grpcomm_base_stubs.c
===================================================================
--- orte/mca/grpcomm/base/grpcomm_base_stubs.c  (revision 32706)
+++ orte/mca/grpcomm/base/grpcomm_base_stubs.c  (working copy)
@@ -12,6 +12,8 @@
  *                         All rights reserved.
  * Copyright (c) 2011-2012 Los Alamos National Security, LLC.
  *                         All rights reserved.
+ * Copyright (c) 2014      Research Organization for Information Science
+ *                         and Technology (RIST). All rights reserved.
  * $COPYRIGHT$
  *
  * Additional copyrights may follow
@@ -72,6 +74,11 @@
                           opal_object_t,
                           gccon, NULL);

+OBJ_CLASS_INSTANCE(orte_grpcomm_base_pending_coll_t,
+                   opal_list_item_t,
+                   NULL,
+                   NULL);
+
 int orte_grpcomm_API_xcast(orte_grpcomm_signature_t *sig,
                            orte_rml_tag_t tag,
                            opal_buffer_t *msg)
@@ -184,6 +191,7 @@
 orte_grpcomm_coll_t* orte_grpcomm_base_get_tracker(orte_grpcomm_signature_t 
*sig, bool create)
 {
     orte_grpcomm_coll_t *coll;
+    orte_grpcomm_base_pending_coll_t *pcoll;
     int rc;

     /* search the existing tracker list to see if this already exists */
@@ -204,6 +212,23 @@
             return coll;
         }
     }
+    OPAL_LIST_FOREACH(pcoll, &orte_grpcomm_base.pending, 
orte_grpcomm_base_pending_coll_t) {
+        if (NULL == sig->signature) {
+            if (NULL == pcoll->coll->sig->signature) {
+                /* only one collective can operate at a time
+                 * across every process in the system */
+                return pcoll->coll;
+            }
+            /* if only one is NULL, then we can't possibly match */
+            break;
+        }
+        if (OPAL_EQUAL == opal_dss.compare(sig, pcoll->coll->sig, 
ORTE_SIGNATURE)) {
+            OPAL_OUTPUT_VERBOSE((1, 
orte_grpcomm_base_framework.framework_output,
+                                 "%s grpcomm:base:returning existing pending 
collective",
+                                 ORTE_NAME_PRINT(ORTE_PROC_MY_NAME)));
+            return pcoll->coll;
+        }
+    }
     /* if we get here, then this is a new collective - so create
      * the tracker for it */
     if (!create) {
Index: orte/mca/grpcomm/base/base.h
===================================================================
--- orte/mca/grpcomm/base/base.h        (revision 32706)
+++ orte/mca/grpcomm/base/base.h        (working copy)
@@ -12,6 +12,8 @@
  * Copyright (c) 2011-2013 Los Alamos National Security, LLC.
  *                         All rights reserved.
  * Copyright (c) 2013-2014 Intel, Inc. All rights reserved.
+ * Copyright (c) 2014      Research Organization for Information Science
+ *                         and Technology (RIST). All rights reserved.
  * $COPYRIGHT$
  *
  * Additional copyrights may follow
@@ -65,8 +67,18 @@
 OBJ_CLASS_DECLARATION(orte_grpcomm_base_active_t);

 typedef struct {
+    opal_list_item_t super;
+    orte_grpcomm_coll_t * coll;
+    int ret;
+    opal_buffer_t *reply;
+} orte_grpcomm_base_pending_coll_t;
+
+OBJ_CLASS_DECLARATION(orte_grpcomm_base_pending_coll_t);
+
+typedef struct {
     opal_list_t actives;
     opal_list_t ongoing;
+    opal_list_t pending;
 } orte_grpcomm_base_t;

 ORTE_DECLSPEC extern orte_grpcomm_base_t orte_grpcomm_base;
Index: orte/mca/grpcomm/base/grpcomm_base_frame.c
===================================================================
--- orte/mca/grpcomm/base/grpcomm_base_frame.c  (revision 32706)
+++ orte/mca/grpcomm/base/grpcomm_base_frame.c  (working copy)
@@ -12,6 +12,8 @@
  * Copyright (c) 2011-2013 Los Alamos National Security, LLC.
  *                         All rights reserved.
  * Copyright (c) 2014      Intel, Inc. All rights reserved.
+ * Copyright (c) 2014      Research Organization for Information Science
+ *                         and Technology (RIST). All rights reserved.
  * $COPYRIGHT$
  *
  * Additional copyrights may follow
@@ -70,6 +72,7 @@
     }
     OPAL_LIST_DESTRUCT(&orte_grpcomm_base.actives);
     OPAL_LIST_DESTRUCT(&orte_grpcomm_base.ongoing);
+    OPAL_LIST_DESTRUCT(&orte_grpcomm_base.pending);

     return mca_base_framework_components_close(&orte_grpcomm_base_framework, 
NULL);
 }
@@ -82,6 +85,7 @@
 {
     OBJ_CONSTRUCT(&orte_grpcomm_base.actives, opal_list_t);
     OBJ_CONSTRUCT(&orte_grpcomm_base.ongoing, opal_list_t);
+    OBJ_CONSTRUCT(&orte_grpcomm_base.pending, opal_list_t);

     return mca_base_framework_components_open(&orte_grpcomm_base_framework, 
flags);
 }

Reply via email to