Ralph and all, The following trivial test hangs /* it hangs at least 99% of the time in my environment, 1% is a race condition and the program behaves as expected */
mpirun -np 1 --mca btl self /bin/false same behaviour happen with the following trivial but MPI program : #include <mpi.h> int main (int argc, char *argv[]) { MPI_Init(&argc, &argv); MPI_Finalize(); return 1; } The attached patch fixes the hang (e.g. the program nicely abort with the correct error message) i did not commit it since i am not confident at all could you please review it ? Cheers Gilles
Index: orte/mca/errmgr/default_hnp/errmgr_default_hnp.c =================================================================== --- orte/mca/errmgr/default_hnp/errmgr_default_hnp.c (revision 32642) +++ orte/mca/errmgr/default_hnp/errmgr_default_hnp.c (working copy) @@ -10,6 +10,8 @@ * Copyright (c) 2011-2013 Los Alamos National Security, LLC. * All rights reserved. * Copyright (c) 2014 Intel, Inc. All rights reserved. + * Copyright (c) 2014 Research Organization for Information Science + * and Technology (RIST). All rights reserved. * $COPYRIGHT$ * * Additional copyrights may follow @@ -382,6 +384,14 @@ jdata->num_terminated++; } + /* FIXME ??? + * mark the proc as no more alive if needed + */ + if (ORTE_PROC_STATE_KILLED_BY_CMD == state) { + if (ORTE_FLAG_TEST(pptr, ORTE_PROC_FLAG_WAITPID) && ORTE_FLAG_TEST(pptr, ORTE_PROC_FLAG_IOF_COMPLETE)) { + ORTE_FLAG_UNSET(pptr, ORTE_PROC_FLAG_ALIVE); + } + } /* if we were ordered to terminate, mark this proc as dead and see if * any of our routes or local children remain alive - if not, then * terminate ourselves. */