Ralph, i tried to debug the issue reported by Siegmar at http://www.open-mpi.org/community/lists/users/2014/12/26052.php
i have not been able to try this on an heterogeneous cluster yet, but i could reproduce a hang with 2 nodes and 3 tasks : mpirun -host node0,node1 -np 3 --mca btl tcp,self --mca coll ^ml ./helloworld git bisect pointed to commit https://github.com/hppritcha/ompi/commit/bffb2b7a4bb49c9188d942201b8a8f04872ff63c, and reverting a subpart of this commit "fixes" the hang (see attached patch) your change correctly prevents the use of uninitialized data (worst case scenario is a crash), but has some undesired side effects that eventually causes a hang. could you please have a look at it ? Cheers, Gilles
diff --git a/orte/orted/pmix/pmix_server.c b/orte/orted/pmix/pmix_server.c index 4f0493c..0f4c816 100644 --- a/orte/orted/pmix/pmix_server.c +++ b/orte/orted/pmix/pmix_server.c @@ -1241,9 +1241,9 @@ static void pmix_server_dmdx_resp(int status, orte_process_name_t* sender, /* pass across any returned blobs */ opal_dss.copy_payload(reply, buffer); stored = true; - OBJ_RETAIN(reply); - PMIX_SERVER_QUEUE_SEND(req->peer, req->tag, reply); } + OBJ_RETAIN(reply); + PMIX_SERVER_QUEUE_SEND(req->peer, req->tag, reply); } else { /* If peer has an access to shared memory dstore, check * if we already stored data for the target process.