Interesting - I see why. Please try this version.
Ralph
On Thu, Oct 15, 2015 at 4:05 AM, Mark Santcroos <[email protected]>
wrote:
>
> > On 15 Oct 2015, at 4:38 , Ralph Castain <[email protected]> wrote:
> > Okay, please try the attached patch.
>
> *scratch*
>
> Although I reported results with the patch earlier, I can't reproduce it
> anymore.
> Now orte-dvm shuts down after the first orte-submit completes with:
>
>
> [netbook:72038] [[9827,0],0] orted:comm:process_commands() Processing
> Command: ORTE_DAEMON_SPAWN_JOB_CMD
> [netbook:72038] [[9827,0],0] orted:comm:process_commands() Processing
> Command: ORTE_DAEMON_ADD_LOCAL_PROCS
> [netbook:72038] [[9827,0],0] Releasing job data for [INVALID]
> [netbook:72038] sess_dir_finalize: proc session dir does not exist
> [netbook:72038] [[9827,0],0] JOB [9827,1] HAS TERMINATED
> [netbook:72038] [[9827,0],0] NOTIFYING [[9826,0],0] OF JOB [9827,1]
> COMPLETION
> [netbook:72038] [[9827,0],0] JOB [9827,1] HAS TERMINATED
> [netbook:72038] [[9827,0],0] orted:comm:process_commands() Processing
> Command: ORTE_DAEMON_EXIT_CMD
> [netbook:72038] sess_dir_finalize: proc session dir does not exist
> [netbook:72038] sess_dir_cleanup: job session dir does not exist
> exiting with status 0
>
>
> (Earlier I maybe had an unpatched instance of orte-dvm still running and
> either the installation or some dynamic linking got messed up?!?!)
> _______________________________________________
> devel mailing list
> [email protected]
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/10/18178.php
>
diff --git a/orte/mca/state/dvm/state_dvm.c b/orte/mca/state/dvm/state_dvm.c
index 0e7309c..5b1a841 100644
--- a/orte/mca/state/dvm/state_dvm.c
+++ b/orte/mca/state/dvm/state_dvm.c
@@ -267,6 +267,7 @@ void check_complete(int fd, short args, void *cbdata)
if (jdata->state < ORTE_JOB_STATE_UNTERMINATED) {
jdata->state = ORTE_JOB_STATE_TERMINATED;
}
+ opal_output(0, "%s JOB %s HAS TERMINATED",
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), ORTE_JOBID_PRINT(jdata->jobid));
}
/* tell the IOF that the job is complete */
diff --git a/orte/tools/orte-dvm/orte-dvm.c b/orte/tools/orte-dvm/orte-dvm.c
index 3cdf585..f9a969a 100644
--- a/orte/tools/orte-dvm/orte-dvm.c
+++ b/orte/tools/orte-dvm/orte-dvm.c
@@ -462,6 +462,11 @@ static void notify_requestor(int sd, short args, void
*cbdata)
int ret;
opal_buffer_t *reply;
+opal_output(0, "%s NOTIFYING %s OF JOB %s COMPLETION",
+ ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
+ ORTE_NAME_PRINT(&jdata->originator),
+ ORTE_JOBID_PRINT(jdata->jobid));
+
/* notify the requestor */
reply = OBJ_NEW(opal_buffer_t);
/* see if there was any problem */
@@ -476,6 +481,7 @@ static void notify_requestor(int sd, short args, void
*cbdata)
/* we cannot cleanup the job object as we might
* hit an error during transmission, so clean it
* up in the send callback */
+
OBJ_RELEASE(caddy);
}