Ralph,

i tried to debug the issue reported by Siegmar at
http://www.open-mpi.org/community/lists/users/2014/12/26052.php

i have not been able to try this on an heterogeneous cluster yet, but i
could
reproduce a hang with 2 nodes and 3 tasks :

mpirun -host node0,node1 -np 3 --mca btl tcp,self --mca coll ^ml
./helloworld

git bisect pointed to commit
https://github.com/hppritcha/ompi/commit/bffb2b7a4bb49c9188d942201b8a8f04872ff63c,
and reverting a subpart of this commit "fixes" the hang
(see attached patch)

your change correctly prevents the use of uninitialized data (worst case
scenario is a crash),
but has some undesired side effects that eventually causes a hang.


could you please have a look at it ?

Cheers,

Gilles
diff --git a/orte/orted/pmix/pmix_server.c b/orte/orted/pmix/pmix_server.c
index 4f0493c..0f4c816 100644
--- a/orte/orted/pmix/pmix_server.c
+++ b/orte/orted/pmix/pmix_server.c
@@ -1241,9 +1241,9 @@ static void pmix_server_dmdx_resp(int status, orte_process_name_t* sender,
                     /* pass across any returned blobs */
                     opal_dss.copy_payload(reply, buffer);
                     stored = true;
-                    OBJ_RETAIN(reply);
-                    PMIX_SERVER_QUEUE_SEND(req->peer, req->tag, reply);
                 }
+                OBJ_RETAIN(reply);
+                PMIX_SERVER_QUEUE_SEND(req->peer, req->tag, reply);
             } else {
                 /* If peer has an access to shared memory dstore, check
                  * if we already stored data for the target process.

Reply via email to