Hi Ralph,

> On 15 Oct 2015, at 0:26 , Ralph Castain <r...@open-mpi.org> wrote:
> Okay, so each orte-submit is reporting job has launched, which means the hang 
> is coming while waiting to hear the job completed. Are you sure that orte-dvm 
> believes the job has completed?

No, I'm not.

> In other words, when you say that you observe the job as completing, are you 
> basing that on some output from orte-dvm, or because the procs have exited, 
> or...?

... because the tasks have created their output.

> I can send you a patch tonight that would cause orte-dvm to emit a "job 
> completed" message when it determines each job has terminated - might help us 
> take the next step.

Great.

> I'm wondering if orte-dvm thinks the job is still running, and the race 
> condition is in that area (as opposed to being in orte-submit itself)

Do some counts from the output of orte-dvm provide some hints?


$ grep "Releasing job data.*INVALID" dvm_output.txt |wc -l
      42

$ grep "ORTE_DAEMON_SPAWN_JOB_CMD" dvm_output.txt |wc -l
      42

$ grep "ORTE_DAEMON_ADD_LOCAL_PROCS" dvm_output.txt |wc -l
      42

$ grep "sess_dir_finalize" dvm_output.txt |wc -l
      35


In other words, the "[netbook:XXXX] sess_dir_finalize: proc session dir does 
not exist" message doesn't show up for the hanging ones, which could support 
your question that the orte-dvm is at fault.

Gr,

Mark

Reply via email to