I'm currently running Grid Engine 2011.11p1 on CentOS-6. I'm using classic spooling to a local disk, local $SGE_ROOT (except for $SGE_ROOT/$SGE_CELL/common), and local spooling directories on the nodes (of which there are more than 600). I'm occasionally seeing *really* long scheduling runs (the last two were 4005 and 4847 seconds). This leads to extra fun like:

11/01/2013 08:35:39|event_|sortinghat|W|acknowledge timeout after 600 seconds for event 
client (schedd:0) on host "$SGE_MASTER"
11/01/2013 08:35:39|event_|sortinghat|E|removing event client (schedd:0) on host 
"$SGE_MASTER" after acknowledge timeout from event client list

I have "PROFILE=1" set, and of course most of the time is spent in "job dispatching". But I'm really not sure how else to track down the cause of this. Where should I be looking? Are there any other options I can set to get more info?

Thanks.

--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to