Hi Ralph, > On 15 Oct 2015, at 0:26 , Ralph Castain <r...@open-mpi.org> wrote: > Okay, so each orte-submit is reporting job has launched, which means the hang > is coming while waiting to hear the job completed. Are you sure that orte-dvm > believes the job has completed?
No, I'm not. > In other words, when you say that you observe the job as completing, are you > basing that on some output from orte-dvm, or because the procs have exited, > or...? ... because the tasks have created their output. > I can send you a patch tonight that would cause orte-dvm to emit a "job > completed" message when it determines each job has terminated - might help us > take the next step. Great. > I'm wondering if orte-dvm thinks the job is still running, and the race > condition is in that area (as opposed to being in orte-submit itself) Do some counts from the output of orte-dvm provide some hints? $ grep "Releasing job data.*INVALID" dvm_output.txt |wc -l 42 $ grep "ORTE_DAEMON_SPAWN_JOB_CMD" dvm_output.txt |wc -l 42 $ grep "ORTE_DAEMON_ADD_LOCAL_PROCS" dvm_output.txt |wc -l 42 $ grep "sess_dir_finalize" dvm_output.txt |wc -l 35 In other words, the "[netbook:XXXX] sess_dir_finalize: proc session dir does not exist" message doesn't show up for the hanging ones, which could support your question that the orte-dvm is at fault. Gr, Mark