I've been working on  adding BLCR checkpointing for OpenMPI jobs on our
cluster.  Although the checkpoint and restart themselves seem to work  in
the process I encountered a few issues if I reschedule a multi-node job via
qmod -rq or qmod -rj.
1)I get errors in the messages file of nodes running slave tasks but not
the master.
12/07/2012 15:43:52|  main|node-f10|E|slave shepherd of job 1843574.1
exited with exit status = 11
12/07/2012 15:43:52|  main|node-f10|E|can't find directory
active_jobs/1843574.1/1.node-f10 for reaping job 1843574.1 task 1.node-f10
--
12/10/2012 13:27:41|  main|node-f10|E|slave shepherd of job 2061719.1
exited with exit status = 11
12/10/2012 13:27:41|  main|node-f10|E|can't find directory
active_jobs/2061719.1/1.node-f10 for reaping job 2061719.1 task 1.node-f10
--
12/10/2012 14:42:37|  main|node-f10|E|slave shepherd of job 2062825.1
exited with exit status = 11
12/10/2012 14:42:37|  main|node-f10|E|can't find directory
active_jobs/2062825.1/1.node-f10 for reaping job 2062825.1 task 1.node-f10
--
12/10/2012 14:57:57|  main|node-f10|E|slave shepherd of job 2062825.1
exited with exit status = 11
12/10/2012 14:57:57|  main|node-f10|E|can't find directory
active_jobs/2062825.1/1.node-f10 for reaping job 2062825.1 task 1.node-f10
--
12/11/2012 09:27:45|  main|node-f10|E|slave shepherd of job 2066267.1
exited with exit status = 11
12/11/2012 09:27:45|  main|node-f10|E|can't find directory
active_jobs/2066267.1/1.node-f10 for reaping job 2066267.1 task 1.node-f10
--
12/12/2012 09:38:02|  main|node-f10|E|slave shepherd of job 2067358.1
exited with exit status = 11
12/12/2012 09:38:02|  main|node-f10|E|can't find directory
active_jobs/2067358.1/1.node-f10 for reaping job 2067358.1 task 1.node-f10
12/12/2012 11:51:53|  main|node-f10|E|slave shepherd of job 2067359.1
exited with exit status = 11
12/12/2012 11:51:53|  main|node-f10|E|can't find directory
active_jobs/2067359.1/1.node-f10 for reaping job 2067359.1 task 1.node-f10

2)On the non-master nodes the actual job processes do not die but instead
get re-parented to init.
3)The job in question will not run on the non-master nodes of its previous
incarnation.  If it tries to start on them It gets stuck in an Rt state
until I restart (softstop then start) the sge_execd.

I can probably find a way to kill the rogue processes and kick the
sge_execd when the errors appear but I wonder if anyone has encountered
this before and has a way to prevent the issues in the first place


Thanks

William
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to