[gridengine users] Debugging really long scheduling runs

Joshua Baker-LePain Fri, 01 Nov 2013 10:51:52 -0700

I'm currently running Grid Engine 2011.11p1 on CentOS-6. I'm usingclassic spooling to a local disk, local $SGE_ROOT (except for$SGE_ROOT/$SGE_CELL/common), and local spooling directories on the nodes(of which there are more than 600). I'm occasionally seeing *really* longscheduling runs (the last two were 4005 and 4847 seconds). This leads toextra fun like:


11/01/2013 08:35:39|event_|sortinghat|W|acknowledge timeout after 600 seconds for event 
client (schedd:0) on host "$SGE_MASTER"
11/01/2013 08:35:39|event_|sortinghat|E|removing event client (schedd:0) on host 
"$SGE_MASTER" after acknowledge timeout from event client list

I have "PROFILE=1" set, and of course most of the time is spent in "jobdispatching". But I'm really not sure how else to track down the cause ofthis. Where should I be looking? Are there any other options I can setto get more info?


Thanks.

--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

[gridengine users] Debugging *really* long scheduling runs

Reply via email to

[gridengine users] Debugging really long scheduling runs