Ralph,
i tried to debug the issue reported by Siegmar at
http://www.open-mpi.org/community/lists/users/2014/12/26052.php
i have not been able to try this on an heterogeneous cluster yet, but i
could
reproduce a hang with 2 nodes and 3 tasks :
mpirun -host node0,node1 -np 3 --mca btl tcp,self --mca coll ^ml
./helloworld
git bisect pointed to commit
https://github.com/hppritcha/ompi/commit/bffb2b7a4bb49c9188d942201b8a8f04872ff63c,
and reverting a subpart of this commit "fixes" the hang
(see attached patch)
your change correctly prevents the use of uninitialized data (worst case
scenario is a crash),
but has some undesired side effects that eventually causes a hang.
could you please have a look at it ?
Cheers,
Gilles
diff --git a/orte/orted/pmix/pmix_server.c b/orte/orted/pmix/pmix_server.c
index 4f0493c..0f4c816 100644
--- a/orte/orted/pmix/pmix_server.c
+++ b/orte/orted/pmix/pmix_server.c
@@ -1241,9 +1241,9 @@ static void pmix_server_dmdx_resp(int status, orte_process_name_t* sender,
/* pass across any returned blobs */
opal_dss.copy_payload(reply, buffer);
stored = true;
- OBJ_RETAIN(reply);
- PMIX_SERVER_QUEUE_SEND(req->peer, req->tag, reply);
}
+ OBJ_RETAIN(reply);
+ PMIX_SERVER_QUEUE_SEND(req->peer, req->tag, reply);
} else {
/* If peer has an access to shared memory dstore, check
* if we already stored data for the target process.