Folks,
several mtt tests (ompi-trunk r31963) failed (SIGSEGV in mpirun) with a
similar stack trace.
For example, you can refer to :
http://mtt.open-mpi.org/index.php?do_redir=2199
the issue is not related whatsoever to the init_thread_serialized test
(other tests failed with similar symptoms)
so far i could find that :
- the issue is intermittent and can be hard to reproduce (1 failure over
1000 runs)
- per the mtt logs, it seems this is quite a recent failure
- a necessary condition is that MPI tasks exit with a non zero status after
having called MPI_Finalize()
- the crash occurs is in orte/mca/oob/base/oob_base_frame.c at line 89 when
invoking
OBJ_RELEASE(value) ;
in some rare cases, value is NULL which causes the crash.
- though i cannot incriminate one changeset in particular, i highly suspect
the changes that were made in order to address the issue(s) discussed at
http://www.open-mpi.org/community/lists/devel/2014/05/14908.php
the attached a patch that works around this issue.
i did not commit it because i consider this as a work around and not as a
fix :
the root cause might be a tricky race condition ("abort" after
MPI_Finalize).
as a side note, here is the definition of OBJ_RELEASE
(opal/class/opal_object.h)
#if OPAL_ENABLE_DEBUG
#define OBJ_RELEASE(object) \
do { \
assert(NULL != ((opal_object_t *) (object))->obj_class); \
assert(OPAL_OBJ_MAGIC_ID == ((opal_object_t *)
(object))->obj_magic_id); \
} while (0)
...
#else
...
should we add the following assert at the beginning ?
assert(NULL != object);
Thanks in advance for your comments,
Gilles
Index: orte/mca/oob/base/oob_base_frame.c
===================================================================
--- orte/mca/oob/base/oob_base_frame.c (revision 31967)
+++ orte/mca/oob/base/oob_base_frame.c (working copy)
@@ -13,6 +13,8 @@
* Copyright (c) 2007 Cisco Systems, Inc. All rights reserved.
* Copyright (c) 2013-2014 Los Alamos National Security, LLC. All rights
* reserved.
+ * Copyright (c) 2014 Research Organization for Information Science
+ * and Technology (RIST). All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
@@ -86,7 +88,11 @@
rc = opal_hash_table_get_first_key_uint64 (&orte_oob_base.peers, &key,
(void **) &value, &node);
while (OPAL_SUCCESS == rc) {
- OBJ_RELEASE(value);
+ /* in some rare cases, value can be NULL.
+ this would cause a crash in OBJ_RELEASE */
+ if (NULL != value) {
+ OBJ_RELEASE(value);
+ }
rc = opal_hash_table_get_next_key_uint64 (&orte_oob_base.peers, &key,
(void **) &value, node, &node);
}