I'm currently running Grid Engine 2011.11p1 on CentOS-6. I'm using
classic spooling to a local disk, local $SGE_ROOT (except for
$SGE_ROOT/$SGE_CELL/common), and local spooling directories on the nodes
(of which there are more than 600). I'm occasionally seeing *really* long
scheduling runs (the last two were 4005 and 4847 seconds). This leads to
extra fun like:
11/01/2013 08:35:39|event_|sortinghat|W|acknowledge timeout after 600 seconds for event
client (schedd:0) on host "$SGE_MASTER"
11/01/2013 08:35:39|event_|sortinghat|E|removing event client (schedd:0) on host
"$SGE_MASTER" after acknowledge timeout from event client list
I have "PROFILE=1" set, and of course most of the time is spent in "job
dispatching". But I'm really not sure how else to track down the cause of
this. Where should I be looking? Are there any other options I can set
to get more info?
Thanks.
--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users