Ralph, the root cause is when the second orted/mpirun runs rcd_finalize_coll, it does not invoke pmix_server_release because allgather_stub was not previously invoked since the the fence was not yet entered. /* in rcd_finalize_coll, coll->cbfunc is NULL */
the attached patch is likely not the right fix, it was very lightly tested, but so far, it works for me ... Cheers, Gilles On 2014/09/11 16:11, Gilles Gouaillardet wrote: > Ralph, > > things got worst indeed :-( > > now a simple hello world involving two hosts hang in mpi_init. > there is still a race condition : if a tasks a call fence long after task b, > then task b will never leave the fence > > i ll try to debug this ... > > Cheers, > > Gilles > > On 2014/09/11 2:36, Ralph Castain wrote: >> I think I now have this fixed - let me know what you see. >> >> >> On Sep 9, 2014, at 6:15 AM, Ralph Castain <r...@open-mpi.org> wrote: >> >>> Yeah, that's not the correct fix. The right way to fix it is for all three >>> components to have their own RML tag, and for each of them to establish a >>> persistent receive. They then can use the signature to tell which >>> collective the incoming message belongs to. >>> >>> I'll fix it, but it won't be until tomorrow I'm afraid as today is shot. >>> >>> >>> On Sep 9, 2014, at 3:10 AM, Gilles Gouaillardet >>> <gilles.gouaillar...@iferc.org> wrote: >>> >>>> Folks, >>>> >>>> Since r32672 (trunk), grpcomm/rcd is the default module. >>>> the attached spawn.c test program is a trimmed version of the >>>> spawn_with_env_vars.c test case >>>> from the ibm test suite. >>>> >>>> when invoked on two nodes : >>>> - the program hangs with -np 2 >>>> - the program can crash with np > 2 >>>> error message is >>>> [node0:30701] [[42913,0],0] TWO RECEIVES WITH SAME PEER [[42913,0],1] >>>> AND TAG -33 - ABORTING >>>> >>>> here is my full command line (from node0) : >>>> >>>> mpirun -host node0,node1 -np 2 --oversubscribe --mca btl tcp,self --mca >>>> coll ^ml ./spawn >>>> >>>> a simple workaround is to add the following extra parameter to the >>>> mpirun command line : >>>> --mca grpcomm_rcd_priority 0 >>>> >>>> my understanding it that the race condition occurs when all the >>>> processes call MPI_Finalize() >>>> internally, the pmix module will have mpirun/orted issue two ALLGATHER >>>> involving mpirun and orted >>>> (one job 1 aka the parent, and one for job 2 aka the spawned tasks) >>>> the error message is very explicit : this is not (currently) supported >>>> >>>> i wrote the attached rml.patch which is really a workaround and not a fix : >>>> in this case, each job will invoke an ALLGATHER but with a different tag >>>> /* that works for a limited number of jobs only */ >>>> >>>> i did not commit this patch since this is not a fix, could someone >>>> (Ralph ?) please review the issue and comment ? >>>> >>>> >>>> Cheers, >>>> >>>> Gilles >>>> >>>> <spawn.c><rml.patch>_______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/devel/2014/09/15780.php >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/09/15794.php > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/09/15804.php
Index: orte/mca/grpcomm/rcd/grpcomm_rcd.c =================================================================== --- orte/mca/grpcomm/rcd/grpcomm_rcd.c (revision 32706) +++ orte/mca/grpcomm/rcd/grpcomm_rcd.c (working copy) @@ -6,6 +6,8 @@ * Copyright (c) 2011-2013 Los Alamos National Security, LLC. All * rights reserved. * Copyright (c) 2014 Intel, Inc. All rights reserved. + * Copyright (c) 2014 Research Organization for Information Science + * and Technology (RIST). All rights reserved. * $COPYRIGHT$ * * Additional copyrights may follow @@ -85,6 +87,9 @@ static int allgather(orte_grpcomm_coll_t *coll, opal_buffer_t *sendbuf) { + orte_grpcomm_base_pending_coll_t *pc; + bool pending = false; + OPAL_OUTPUT_VERBOSE((5, orte_grpcomm_base_framework.framework_output, "%s grpcomm:coll:recdub algo employed for %d processes", ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), (int)coll->ndmns)); @@ -106,6 +111,33 @@ */ rcd_allgather_send_dist(coll, 1); + OPAL_LIST_FOREACH(pc, &orte_grpcomm_base.pending, orte_grpcomm_base_pending_coll_t) { + if (NULL == coll->sig->signature) { + if (NULL == pc->coll->sig->signature) { + /* only one collective can operate at a time + * across every process in the system */ + pending = true; + break; + } + /* if only one is NULL, then we can't possibly match */ + break; + } + if (OPAL_EQUAL == opal_dss.compare(coll->sig, pc->coll->sig, ORTE_SIGNATURE)) { + OPAL_OUTPUT_VERBOSE((1, orte_grpcomm_base_framework.framework_output, + "%s grpcomm:rcd:found existing pending collective", + ORTE_NAME_PRINT(ORTE_PROC_MY_NAME))); + pending = true; + break; + } + } + if (pending) { + assert (coll == pc->coll); + pc->coll->cbfunc(pc->ret, pc->reply, coll->cbdata); + opal_list_remove_item(&orte_grpcomm_base.pending, &pc->super); + OBJ_RELEASE(pc->reply); + OBJ_RELEASE(pc); + } + return ORTE_SUCCESS; } @@ -271,11 +303,16 @@ /* execute the callback */ if (NULL != coll->cbfunc) { coll->cbfunc(ret, reply, coll->cbdata); + opal_list_remove_item(&orte_grpcomm_base.ongoing, &coll->super); + OBJ_RELEASE(reply); + } else { + orte_grpcomm_base_pending_coll_t *pcoll = OBJ_NEW(orte_grpcomm_base_pending_coll_t); + opal_list_remove_item(&orte_grpcomm_base.ongoing, &coll->super); + pcoll->ret = ret; + pcoll->reply = reply; + pcoll->coll = coll; + opal_list_append(&orte_grpcomm_base.pending, &pcoll->super); } - opal_list_remove_item(&orte_grpcomm_base.ongoing, &coll->super); - - OBJ_RELEASE(reply); - return ORTE_SUCCESS; } Index: orte/mca/grpcomm/base/grpcomm_base_stubs.c =================================================================== --- orte/mca/grpcomm/base/grpcomm_base_stubs.c (revision 32706) +++ orte/mca/grpcomm/base/grpcomm_base_stubs.c (working copy) @@ -12,6 +12,8 @@ * All rights reserved. * Copyright (c) 2011-2012 Los Alamos National Security, LLC. * All rights reserved. + * Copyright (c) 2014 Research Organization for Information Science + * and Technology (RIST). All rights reserved. * $COPYRIGHT$ * * Additional copyrights may follow @@ -72,6 +74,11 @@ opal_object_t, gccon, NULL); +OBJ_CLASS_INSTANCE(orte_grpcomm_base_pending_coll_t, + opal_list_item_t, + NULL, + NULL); + int orte_grpcomm_API_xcast(orte_grpcomm_signature_t *sig, orte_rml_tag_t tag, opal_buffer_t *msg) @@ -184,6 +191,7 @@ orte_grpcomm_coll_t* orte_grpcomm_base_get_tracker(orte_grpcomm_signature_t *sig, bool create) { orte_grpcomm_coll_t *coll; + orte_grpcomm_base_pending_coll_t *pcoll; int rc; /* search the existing tracker list to see if this already exists */ @@ -204,6 +212,23 @@ return coll; } } + OPAL_LIST_FOREACH(pcoll, &orte_grpcomm_base.pending, orte_grpcomm_base_pending_coll_t) { + if (NULL == sig->signature) { + if (NULL == pcoll->coll->sig->signature) { + /* only one collective can operate at a time + * across every process in the system */ + return pcoll->coll; + } + /* if only one is NULL, then we can't possibly match */ + break; + } + if (OPAL_EQUAL == opal_dss.compare(sig, pcoll->coll->sig, ORTE_SIGNATURE)) { + OPAL_OUTPUT_VERBOSE((1, orte_grpcomm_base_framework.framework_output, + "%s grpcomm:base:returning existing pending collective", + ORTE_NAME_PRINT(ORTE_PROC_MY_NAME))); + return pcoll->coll; + } + } /* if we get here, then this is a new collective - so create * the tracker for it */ if (!create) { Index: orte/mca/grpcomm/base/base.h =================================================================== --- orte/mca/grpcomm/base/base.h (revision 32706) +++ orte/mca/grpcomm/base/base.h (working copy) @@ -12,6 +12,8 @@ * Copyright (c) 2011-2013 Los Alamos National Security, LLC. * All rights reserved. * Copyright (c) 2013-2014 Intel, Inc. All rights reserved. + * Copyright (c) 2014 Research Organization for Information Science + * and Technology (RIST). All rights reserved. * $COPYRIGHT$ * * Additional copyrights may follow @@ -65,8 +67,18 @@ OBJ_CLASS_DECLARATION(orte_grpcomm_base_active_t); typedef struct { + opal_list_item_t super; + orte_grpcomm_coll_t * coll; + int ret; + opal_buffer_t *reply; +} orte_grpcomm_base_pending_coll_t; + +OBJ_CLASS_DECLARATION(orte_grpcomm_base_pending_coll_t); + +typedef struct { opal_list_t actives; opal_list_t ongoing; + opal_list_t pending; } orte_grpcomm_base_t; ORTE_DECLSPEC extern orte_grpcomm_base_t orte_grpcomm_base; Index: orte/mca/grpcomm/base/grpcomm_base_frame.c =================================================================== --- orte/mca/grpcomm/base/grpcomm_base_frame.c (revision 32706) +++ orte/mca/grpcomm/base/grpcomm_base_frame.c (working copy) @@ -12,6 +12,8 @@ * Copyright (c) 2011-2013 Los Alamos National Security, LLC. * All rights reserved. * Copyright (c) 2014 Intel, Inc. All rights reserved. + * Copyright (c) 2014 Research Organization for Information Science + * and Technology (RIST). All rights reserved. * $COPYRIGHT$ * * Additional copyrights may follow @@ -70,6 +72,7 @@ } OPAL_LIST_DESTRUCT(&orte_grpcomm_base.actives); OPAL_LIST_DESTRUCT(&orte_grpcomm_base.ongoing); + OPAL_LIST_DESTRUCT(&orte_grpcomm_base.pending); return mca_base_framework_components_close(&orte_grpcomm_base_framework, NULL); } @@ -82,6 +85,7 @@ { OBJ_CONSTRUCT(&orte_grpcomm_base.actives, opal_list_t); OBJ_CONSTRUCT(&orte_grpcomm_base.ongoing, opal_list_t); + OBJ_CONSTRUCT(&orte_grpcomm_base.pending, opal_list_t); return mca_base_framework_components_open(&orte_grpcomm_base_framework, flags); }