Jake Carroll <jake.carr...@uq.edu.au> writes: > Hi. > > We've now shot the head node in the head (heh) and we're exploring killing > off/restarting each execd on the compute nodes. > > Do you recommend a kill -HUP on the process, or something more aggressive? > This will in theory "kill" currently executing jobs on each compute host, > we're assuming?
I can't remember what this refers to, but the init scripts for SGE 8 have a "restart" option which does softstop+start. > Also, we just caught another one in the act, on one of the nodes that just > threw the 137: > > [root@compute-0-6 ~]# tail -f > /opt/gridengine/default/spool/compute-0-6/messages > 01/17/2013 08:03:15| main|compute-0-6|W|reaping job "1371379" ptf > complains: Job does not exist > 01/17/2013 09:22:33| main|compute-0-6|W|reaping job "1371379" ptf > complains: Job does not exist > 01/17/2013 09:24:55| main|compute-0-6|W|reaping job "1371379" ptf > complains: Job does not exist > 01/17/2013 09:34:12| main|compute-0-6|W|reaping job "1371379" ptf > complains: Job does not exist > 01/17/2013 10:06:45| main|compute-0-6|E|removing unreferenced job > 1371379.7545 without job report from ptf > 01/17/2013 10:09:25| main|compute-0-6|W|reaping job "1371379" ptf > complains: Job does not exist > 01/18/2013 17:10:52| main|compute-0-6|W|can't register at qmaster > "cluster.local": abort qmaster registration due to communication errors > 01/18/2013 17:16:42| main|compute-0-6|W|gethostbyname(cluster.local) took > 20 seconds and returns TRY_AGAIN > > 01/18/2013 17:25:37| main|compute-0-6|E|commlib error: got select error > (No route to host) You'd better address the network errors before anything else. As in the tracker, I don't know what causes the PTF errors, though. > What's most unusual, about this, is that these time stamps don't match up > with the error 137 we just saw. Look in the messages files for what does. > This example job was running for two days or so, then just became unhappy > today, then threw the 137: > > Job 1307803 (b5_set11_9) Complete > User = someguy > Queue = medium.q@compute-0-6.local > Host = compute-0-6.local > Start Time = 01/14/2013 14:22:12 > End Time = 01/21/2013 12:23:02 That's nearly a week, not two days. -- Community Grid Engine: http://arc.liv.ac.uk/SGE/ _______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users