Re: [gridengine users] Debugging really long scheduling runs

Joshua Baker-LePain Mon, 04 Nov 2013 12:16:44 -0800

On Fri, 1 Nov 2013 at 10:44am, Joshua Baker-LePain wrote

I'm currently running Grid Engine 2011.11p1 on CentOS-6. I'm using classicspooling to a local disk, local $SGE_ROOT (except for$SGE_ROOT/$SGE_CELL/common), and local spooling directories on the nodes (ofwhich there are more than 600). I'm occasionally seeing *really* longscheduling runs (the last two were 4005 and 4847 seconds). This leads toextra fun like:
11/01/2013 08:35:39|event_|sortinghat|W|acknowledge timeout after 600 secondsfor event client (schedd:0) on host "$SGE_MASTER"11/01/2013 08:35:39|event_|sortinghat|E|removing event client (schedd:0) onhost "$SGE_MASTER" after acknowledge timeout from event client list
I have "PROFILE=1" set, and of course most of the time is spent in "jobdispatching". But I'm really not sure how else to track down the cause ofthis. Where should I be looking? Are there any other options I can set toget more info?

Over the weekend this got extremely bad -- one scheduling run took 22319s.This morning I started suspending jobs to see if I could find any thatwere causing this. Lo and behold, one user has 39 jobs in the queue, eachof which is an array job with 100,000 tasks (our setting formax_aj_tasks). The resource requests for the jobs are pretty basic:


hard resource_list:         h_rt=600,mem_free=1G

We do have mem_free set as consumable. With these jobs on hold, thescheduler runs are taking a few seconds. If I take the hold off of evenone of these jobs, though, the scheduler goes crazy again (long runs,eating up memory).

In looking at the qacct data for these jobs, each task runs for just a fewseconds. I've already "encouraged" the user to reformulate the jobs sothat each task runs much longer, but should these jobs really confound thescheduler so? Is my max_aj_tasks setting too high?


--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Debugging *really* long scheduling runs

Reply via email to

Re: [gridengine users] Debugging really long scheduling runs