Okay, please try the attached patch. It will cause two messages to be output for each job: one indicating the job has been marked terminated, and the other reporting that the completion message was sent to the requestor. Let's see what that tells us.
Thanks Ralph On Wed, Oct 14, 2015 at 3:44 PM, Mark Santcroos <mark.santcr...@rutgers.edu> wrote: > Hi Ralph, > > > On 15 Oct 2015, at 0:26 , Ralph Castain <r...@open-mpi.org> wrote: > > Okay, so each orte-submit is reporting job has launched, which means the > hang is coming while waiting to hear the job completed. Are you sure that > orte-dvm believes the job has completed? > > No, I'm not. > > > In other words, when you say that you observe the job as completing, are > you basing that on some output from orte-dvm, or because the procs have > exited, or...? > > ... because the tasks have created their output. > > > I can send you a patch tonight that would cause orte-dvm to emit a "job > completed" message when it determines each job has terminated - might help > us take the next step. > > Great. > > > I'm wondering if orte-dvm thinks the job is still running, and the race > condition is in that area (as opposed to being in orte-submit itself) > > Do some counts from the output of orte-dvm provide some hints? > > > $ grep "Releasing job data.*INVALID" dvm_output.txt |wc -l > 42 > > $ grep "ORTE_DAEMON_SPAWN_JOB_CMD" dvm_output.txt |wc -l > 42 > > $ grep "ORTE_DAEMON_ADD_LOCAL_PROCS" dvm_output.txt |wc -l > 42 > > $ grep "sess_dir_finalize" dvm_output.txt |wc -l > 35 > > > In other words, the "[netbook:XXXX] sess_dir_finalize: proc session dir > does not exist" message doesn't show up for the hanging ones, which could > support your question that the orte-dvm is at fault. > > Gr, > > Mark > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/10/18171.php >
diff --git a/orte/mca/state/dvm/state_dvm.c b/orte/mca/state/dvm/state_dvm.c index 0e7309c..5b1a841 100644 --- a/orte/mca/state/dvm/state_dvm.c +++ b/orte/mca/state/dvm/state_dvm.c @@ -267,6 +267,7 @@ void check_complete(int fd, short args, void *cbdata) if (jdata->state < ORTE_JOB_STATE_UNTERMINATED) { jdata->state = ORTE_JOB_STATE_TERMINATED; } + opal_output(0, "%s JOB %s HAS TERMINATED", ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), ORTE_JOBID_PRINT(jdata->jobid)); } /* tell the IOF that the job is complete */ diff --git a/orte/tools/orte-dvm/orte-dvm.c b/orte/tools/orte-dvm/orte-dvm.c index 3cdf585..003f93a 100644 --- a/orte/tools/orte-dvm/orte-dvm.c +++ b/orte/tools/orte-dvm/orte-dvm.c @@ -442,18 +442,6 @@ int main(int argc, char *argv[]) exit(orte_exit_status); } -static void send_callback(int status, orte_process_name_t *peer, - opal_buffer_t* buffer, orte_rml_tag_t tag, - void* cbdata) - -{ - orte_job_t *jdata = (orte_job_t*)cbdata; - - OBJ_RELEASE(buffer); - /* cleanup the job object */ - opal_pointer_array_set_item(orte_job_data, ORTE_LOCAL_JOBID(jdata->jobid), NULL); - OBJ_RELEASE(jdata); -} static void notify_requestor(int sd, short args, void *cbdata) { orte_state_caddy_t *caddy = (orte_state_caddy_t*)cbdata; @@ -462,6 +450,11 @@ static void notify_requestor(int sd, short args, void *cbdata) int ret; opal_buffer_t *reply; +opal_output(0, "%s NOTIFYING %s OF JOB %s COMPLETION", + ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), + ORTE_NAME_PRINT(&jdata->originator), + ORTE_JOBID_PRINT(jdata->jobid)); + /* notify the requestor */ reply = OBJ_NEW(opal_buffer_t); /* see if there was any problem */ @@ -471,11 +464,13 @@ static void notify_requestor(int sd, short args, void *cbdata) ret = 0; } opal_dss.pack(reply, &ret, 1, OPAL_INT); - orte_rml.send_buffer_nb(&jdata->originator, reply, ORTE_RML_TAG_TOOL, send_callback, jdata); + orte_rml.send_buffer_nb(&jdata->originator, reply, ORTE_RML_TAG_TOOL, orte_rml_send_callback, NULL); + + /* flag that we were notified */ + jdata->state = ORTE_JOB_STATE_NOTIFIED; + /* send us back thru job complete */ + ORTE_ACTIVATE_JOB_STATE(jdata, ORTE_JOB_STATE_TERMINATED); - /* we cannot cleanup the job object as we might - * hit an error during transmission, so clean it - * up in the send callback */ OBJ_RELEASE(caddy); }