Jake Carroll <jake.carr...@uq.edu.au> writes:

> Hi.
>
> We've now shot the head node in the head (heh) and we're exploring killing
> off/restarting each execd on the compute nodes.
>
> Do you recommend a kill -HUP on the process, or something more aggressive?
> This will in theory "kill" currently executing jobs on each compute host,
> we're assuming?

I can't remember what this refers to, but the init scripts for SGE 8
have a "restart" option which does softstop+start.

> Also, we just caught another one in the act, on one of the nodes that just
> threw the 137:
>
> [root@compute-0-6 ~]# tail -f
> /opt/gridengine/default/spool/compute-0-6/messages
> 01/17/2013 08:03:15|  main|compute-0-6|W|reaping job "1371379" ptf
> complains: Job does not exist
> 01/17/2013 09:22:33|  main|compute-0-6|W|reaping job "1371379" ptf
> complains: Job does not exist
> 01/17/2013 09:24:55|  main|compute-0-6|W|reaping job "1371379" ptf
> complains: Job does not exist
> 01/17/2013 09:34:12|  main|compute-0-6|W|reaping job "1371379" ptf
> complains: Job does not exist
> 01/17/2013 10:06:45|  main|compute-0-6|E|removing unreferenced job
> 1371379.7545 without job report from ptf
> 01/17/2013 10:09:25|  main|compute-0-6|W|reaping job "1371379" ptf
> complains: Job does not exist
> 01/18/2013 17:10:52|  main|compute-0-6|W|can't register at qmaster
> "cluster.local": abort qmaster registration due to communication errors
> 01/18/2013 17:16:42|  main|compute-0-6|W|gethostbyname(cluster.local) took
> 20 seconds and returns TRY_AGAIN
>
> 01/18/2013 17:25:37|  main|compute-0-6|E|commlib error: got select error
> (No route to host)

You'd better address the network errors before anything else.  As in the
tracker, I don't know what causes the PTF errors, though.

> What's most unusual, about this, is that these time stamps don't match up
> with the error 137 we just saw.

Look in the messages files for what does.

> This example job was running for two days or so, then just became unhappy
> today, then threw the 137:
>
> Job 1307803 (b5_set11_9) Complete
> User             = someguy
> Queue            = medium.q@compute-0-6.local
> Host             = compute-0-6.local
> Start Time       = 01/14/2013 14:22:12
> End Time         = 01/21/2013 12:23:02

That's nearly a week, not two days.

-- 
Community Grid Engine:  http://arc.liv.ac.uk/SGE/
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to