On Fri, 1 Nov 2013 at 10:44am, Joshua Baker-LePain wrote

I'm currently running Grid Engine 2011.11p1 on CentOS-6. I'm using classic spooling to a local disk, local $SGE_ROOT (except for $SGE_ROOT/$SGE_CELL/common), and local spooling directories on the nodes (of which there are more than 600). I'm occasionally seeing *really* long scheduling runs (the last two were 4005 and 4847 seconds). This leads to extra fun like:

11/01/2013 08:35:39|event_|sortinghat|W|acknowledge timeout after 600 seconds for event client (schedd:0) on host "$SGE_MASTER" 11/01/2013 08:35:39|event_|sortinghat|E|removing event client (schedd:0) on host "$SGE_MASTER" after acknowledge timeout from event client list

I have "PROFILE=1" set, and of course most of the time is spent in "job dispatching". But I'm really not sure how else to track down the cause of this. Where should I be looking? Are there any other options I can set to get more info?

Over the weekend this got extremely bad -- one scheduling run took 22319s. This morning I started suspending jobs to see if I could find any that were causing this. Lo and behold, one user has 39 jobs in the queue, each of which is an array job with 100,000 tasks (our setting for max_aj_tasks). The resource requests for the jobs are pretty basic:

hard resource_list:         h_rt=600,mem_free=1G

We do have mem_free set as consumable. With these jobs on hold, the scheduler runs are taking a few seconds. If I take the hold off of even one of these jobs, though, the scheduler goes crazy again (long runs, eating up memory).

In looking at the qacct data for these jobs, each task runs for just a few seconds. I've already "encouraged" the user to reformulate the jobs so that each task runs much longer, but should these jobs really confound the scheduler so? Is my max_aj_tasks setting too high?

--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to