Ralph,
the root cause is when the second orted/mpirun runs rcd_finalize_coll,
it does not invoke pmix_server_release
because allgather_stub was not previously invoked since the the fence
was not yet entered.
/* in rcd_finalize_coll, coll->cbfunc is NULL */
the attached patch is likely not the right fix, it was very lightly
tested, but so far, it works for me ...
Cheers,
Gilles
On 2014/09/11 16:11, Gilles Gouaillardet wrote:
> Ralph,
>
> things got worst indeed :-(
>
> now a simple hello world involving two hosts hang in mpi_init.
> there is still a race condition : if a tasks a call fence long after task b,
> then task b will never leave the fence
>
> i ll try to debug this ...
>
> Cheers,
>
> Gilles
>
> On 2014/09/11 2:36, Ralph Castain wrote:
>> I think I now have this fixed - let me know what you see.
>>
>>
>> On Sep 9, 2014, at 6:15 AM, Ralph Castain <[email protected]> wrote:
>>
>>> Yeah, that's not the correct fix. The right way to fix it is for all three
>>> components to have their own RML tag, and for each of them to establish a
>>> persistent receive. They then can use the signature to tell which
>>> collective the incoming message belongs to.
>>>
>>> I'll fix it, but it won't be until tomorrow I'm afraid as today is shot.
>>>
>>>
>>> On Sep 9, 2014, at 3:10 AM, Gilles Gouaillardet
>>> <[email protected]> wrote:
>>>
>>>> Folks,
>>>>
>>>> Since r32672 (trunk), grpcomm/rcd is the default module.
>>>> the attached spawn.c test program is a trimmed version of the
>>>> spawn_with_env_vars.c test case
>>>> from the ibm test suite.
>>>>
>>>> when invoked on two nodes :
>>>> - the program hangs with -np 2
>>>> - the program can crash with np > 2
>>>> error message is
>>>> [node0:30701] [[42913,0],0] TWO RECEIVES WITH SAME PEER [[42913,0],1]
>>>> AND TAG -33 - ABORTING
>>>>
>>>> here is my full command line (from node0) :
>>>>
>>>> mpirun -host node0,node1 -np 2 --oversubscribe --mca btl tcp,self --mca
>>>> coll ^ml ./spawn
>>>>
>>>> a simple workaround is to add the following extra parameter to the
>>>> mpirun command line :
>>>> --mca grpcomm_rcd_priority 0
>>>>
>>>> my understanding it that the race condition occurs when all the
>>>> processes call MPI_Finalize()
>>>> internally, the pmix module will have mpirun/orted issue two ALLGATHER
>>>> involving mpirun and orted
>>>> (one job 1 aka the parent, and one for job 2 aka the spawned tasks)
>>>> the error message is very explicit : this is not (currently) supported
>>>>
>>>> i wrote the attached rml.patch which is really a workaround and not a fix :
>>>> in this case, each job will invoke an ALLGATHER but with a different tag
>>>> /* that works for a limited number of jobs only */
>>>>
>>>> i did not commit this patch since this is not a fix, could someone
>>>> (Ralph ?) please review the issue and comment ?
>>>>
>>>>
>>>> Cheers,
>>>>
>>>> Gilles
>>>>
>>>> <spawn.c><rml.patch>_______________________________________________
>>>> devel mailing list
>>>> [email protected]
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> Link to this post:
>>>> http://www.open-mpi.org/community/lists/devel/2014/09/15780.php
>> _______________________________________________
>> devel mailing list
>> [email protected]
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/09/15794.php
> _______________________________________________
> devel mailing list
> [email protected]
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/09/15804.php
Index: orte/mca/grpcomm/rcd/grpcomm_rcd.c
===================================================================
--- orte/mca/grpcomm/rcd/grpcomm_rcd.c (revision 32706)
+++ orte/mca/grpcomm/rcd/grpcomm_rcd.c (working copy)
@@ -6,6 +6,8 @@
* Copyright (c) 2011-2013 Los Alamos National Security, LLC. All
* rights reserved.
* Copyright (c) 2014 Intel, Inc. All rights reserved.
+ * Copyright (c) 2014 Research Organization for Information Science
+ * and Technology (RIST). All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
@@ -85,6 +87,9 @@
static int allgather(orte_grpcomm_coll_t *coll,
opal_buffer_t *sendbuf)
{
+ orte_grpcomm_base_pending_coll_t *pc;
+ bool pending = false;
+
OPAL_OUTPUT_VERBOSE((5, orte_grpcomm_base_framework.framework_output,
"%s grpcomm:coll:recdub algo employed for %d
processes",
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
(int)coll->ndmns));
@@ -106,6 +111,33 @@
*/
rcd_allgather_send_dist(coll, 1);
+ OPAL_LIST_FOREACH(pc, &orte_grpcomm_base.pending,
orte_grpcomm_base_pending_coll_t) {
+ if (NULL == coll->sig->signature) {
+ if (NULL == pc->coll->sig->signature) {
+ /* only one collective can operate at a time
+ * across every process in the system */
+ pending = true;
+ break;
+ }
+ /* if only one is NULL, then we can't possibly match */
+ break;
+ }
+ if (OPAL_EQUAL == opal_dss.compare(coll->sig, pc->coll->sig,
ORTE_SIGNATURE)) {
+ OPAL_OUTPUT_VERBOSE((1,
orte_grpcomm_base_framework.framework_output,
+ "%s grpcomm:rcd:found existing pending
collective",
+ ORTE_NAME_PRINT(ORTE_PROC_MY_NAME)));
+ pending = true;
+ break;
+ }
+ }
+ if (pending) {
+ assert (coll == pc->coll);
+ pc->coll->cbfunc(pc->ret, pc->reply, coll->cbdata);
+ opal_list_remove_item(&orte_grpcomm_base.pending, &pc->super);
+ OBJ_RELEASE(pc->reply);
+ OBJ_RELEASE(pc);
+ }
+
return ORTE_SUCCESS;
}
@@ -271,11 +303,16 @@
/* execute the callback */
if (NULL != coll->cbfunc) {
coll->cbfunc(ret, reply, coll->cbdata);
+ opal_list_remove_item(&orte_grpcomm_base.ongoing, &coll->super);
+ OBJ_RELEASE(reply);
+ } else {
+ orte_grpcomm_base_pending_coll_t *pcoll =
OBJ_NEW(orte_grpcomm_base_pending_coll_t);
+ opal_list_remove_item(&orte_grpcomm_base.ongoing, &coll->super);
+ pcoll->ret = ret;
+ pcoll->reply = reply;
+ pcoll->coll = coll;
+ opal_list_append(&orte_grpcomm_base.pending, &pcoll->super);
}
- opal_list_remove_item(&orte_grpcomm_base.ongoing, &coll->super);
-
- OBJ_RELEASE(reply);
-
return ORTE_SUCCESS;
}
Index: orte/mca/grpcomm/base/grpcomm_base_stubs.c
===================================================================
--- orte/mca/grpcomm/base/grpcomm_base_stubs.c (revision 32706)
+++ orte/mca/grpcomm/base/grpcomm_base_stubs.c (working copy)
@@ -12,6 +12,8 @@
* All rights reserved.
* Copyright (c) 2011-2012 Los Alamos National Security, LLC.
* All rights reserved.
+ * Copyright (c) 2014 Research Organization for Information Science
+ * and Technology (RIST). All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
@@ -72,6 +74,11 @@
opal_object_t,
gccon, NULL);
+OBJ_CLASS_INSTANCE(orte_grpcomm_base_pending_coll_t,
+ opal_list_item_t,
+ NULL,
+ NULL);
+
int orte_grpcomm_API_xcast(orte_grpcomm_signature_t *sig,
orte_rml_tag_t tag,
opal_buffer_t *msg)
@@ -184,6 +191,7 @@
orte_grpcomm_coll_t* orte_grpcomm_base_get_tracker(orte_grpcomm_signature_t
*sig, bool create)
{
orte_grpcomm_coll_t *coll;
+ orte_grpcomm_base_pending_coll_t *pcoll;
int rc;
/* search the existing tracker list to see if this already exists */
@@ -204,6 +212,23 @@
return coll;
}
}
+ OPAL_LIST_FOREACH(pcoll, &orte_grpcomm_base.pending,
orte_grpcomm_base_pending_coll_t) {
+ if (NULL == sig->signature) {
+ if (NULL == pcoll->coll->sig->signature) {
+ /* only one collective can operate at a time
+ * across every process in the system */
+ return pcoll->coll;
+ }
+ /* if only one is NULL, then we can't possibly match */
+ break;
+ }
+ if (OPAL_EQUAL == opal_dss.compare(sig, pcoll->coll->sig,
ORTE_SIGNATURE)) {
+ OPAL_OUTPUT_VERBOSE((1,
orte_grpcomm_base_framework.framework_output,
+ "%s grpcomm:base:returning existing pending
collective",
+ ORTE_NAME_PRINT(ORTE_PROC_MY_NAME)));
+ return pcoll->coll;
+ }
+ }
/* if we get here, then this is a new collective - so create
* the tracker for it */
if (!create) {
Index: orte/mca/grpcomm/base/base.h
===================================================================
--- orte/mca/grpcomm/base/base.h (revision 32706)
+++ orte/mca/grpcomm/base/base.h (working copy)
@@ -12,6 +12,8 @@
* Copyright (c) 2011-2013 Los Alamos National Security, LLC.
* All rights reserved.
* Copyright (c) 2013-2014 Intel, Inc. All rights reserved.
+ * Copyright (c) 2014 Research Organization for Information Science
+ * and Technology (RIST). All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
@@ -65,8 +67,18 @@
OBJ_CLASS_DECLARATION(orte_grpcomm_base_active_t);
typedef struct {
+ opal_list_item_t super;
+ orte_grpcomm_coll_t * coll;
+ int ret;
+ opal_buffer_t *reply;
+} orte_grpcomm_base_pending_coll_t;
+
+OBJ_CLASS_DECLARATION(orte_grpcomm_base_pending_coll_t);
+
+typedef struct {
opal_list_t actives;
opal_list_t ongoing;
+ opal_list_t pending;
} orte_grpcomm_base_t;
ORTE_DECLSPEC extern orte_grpcomm_base_t orte_grpcomm_base;
Index: orte/mca/grpcomm/base/grpcomm_base_frame.c
===================================================================
--- orte/mca/grpcomm/base/grpcomm_base_frame.c (revision 32706)
+++ orte/mca/grpcomm/base/grpcomm_base_frame.c (working copy)
@@ -12,6 +12,8 @@
* Copyright (c) 2011-2013 Los Alamos National Security, LLC.
* All rights reserved.
* Copyright (c) 2014 Intel, Inc. All rights reserved.
+ * Copyright (c) 2014 Research Organization for Information Science
+ * and Technology (RIST). All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
@@ -70,6 +72,7 @@
}
OPAL_LIST_DESTRUCT(&orte_grpcomm_base.actives);
OPAL_LIST_DESTRUCT(&orte_grpcomm_base.ongoing);
+ OPAL_LIST_DESTRUCT(&orte_grpcomm_base.pending);
return mca_base_framework_components_close(&orte_grpcomm_base_framework,
NULL);
}
@@ -82,6 +85,7 @@
{
OBJ_CONSTRUCT(&orte_grpcomm_base.actives, opal_list_t);
OBJ_CONSTRUCT(&orte_grpcomm_base.ongoing, opal_list_t);
+ OBJ_CONSTRUCT(&orte_grpcomm_base.pending, opal_list_t);
return mca_base_framework_components_open(&orte_grpcomm_base_framework,
flags);
}