Ralph and all,

The following trivial test hangs
/* it hangs at least 99% of the time in my environment, 1% is a race
condition and the program behaves as expected */

mpirun -np 1 --mca btl self /bin/false

same behaviour happen with the following trivial but MPI program :

#include <mpi.h>

int main (int argc, char *argv[]) {
    MPI_Init(&argc, &argv);
    MPI_Finalize();
    return 1;
}

The attached patch fixes the hang (e.g. the program nicely abort with
the correct error message)

i did not commit it since i am not confident at all

could you please review it ?

Cheers

Gilles
Index: orte/mca/errmgr/default_hnp/errmgr_default_hnp.c
===================================================================
--- orte/mca/errmgr/default_hnp/errmgr_default_hnp.c    (revision 32642)
+++ orte/mca/errmgr/default_hnp/errmgr_default_hnp.c    (working copy)
@@ -10,6 +10,8 @@
  * Copyright (c) 2011-2013 Los Alamos National Security, LLC.
  *                         All rights reserved.
  * Copyright (c) 2014      Intel, Inc.  All rights reserved.
+ * Copyright (c) 2014      Research Organization for Information Science
+ *                         and Technology (RIST). All rights reserved.
  * $COPYRIGHT$
  * 
  * Additional copyrights may follow
@@ -382,6 +384,14 @@
         jdata->num_terminated++;
     }

+    /* FIXME ???
+     * mark the proc as no more alive if needed
+     */
+    if (ORTE_PROC_STATE_KILLED_BY_CMD == state) {
+        if (ORTE_FLAG_TEST(pptr, ORTE_PROC_FLAG_WAITPID) && 
ORTE_FLAG_TEST(pptr, ORTE_PROC_FLAG_IOF_COMPLETE)) {
+            ORTE_FLAG_UNSET(pptr, ORTE_PROC_FLAG_ALIVE);
+        }
+    }
     /* if we were ordered to terminate, mark this proc as dead and see if
      * any of our routes or local  children remain alive - if not, then
      * terminate ourselves. */

Reply via email to