William Hay <[email protected]> writes: > I've been working on adding BLCR checkpointing for OpenMPI jobs on our > cluster.
Is that the one with infinipath? If so, where did you get the checkpointing support? > Although the checkpoint and restart themselves seem to work in > the process I encountered a few issues if I reschedule a multi-node job via > qmod -rq or qmod -rj. > 1)I get errors in the messages file of nodes running slave tasks but not > the master. > 12/07/2012 15:43:52| main|node-f10|E|slave shepherd of job 1843574.1 > exited with exit status = 11 Presumably the first thing to do is figure out is why the job went to 11. > 2)On the non-master nodes the actual job processes do not die but instead > get re-parented to init. > 3)The job in question will not run on the non-master nodes of its previous > incarnation. If it tries to start on them It gets stuck in an Rt state > until I restart (softstop then start) the sge_execd. No log messages, I assume. Do other jobs start? Is it a problem with the original spool directories not being deleted? > I can probably find a way to kill the rogue processes and kick the > sge_execd when the errors appear but I wonder if anyone has encountered > this before and has a way to prevent the issues in the first place The latest SGE should prevent that, but otherwise (if you can avoid loosely integrated jobs) proc_police should do the job <http://arc.liv.ac.uk/SGE/howto/remove_orphaned_processes.html>. -- Community Grid Engine: http://arc.liv.ac.uk/SGE/ _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
