-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 04/11/13 20:13, Joshua Baker-LePain wrote: > On Fri, 1 Nov 2013 at 10:44am, Joshua Baker-LePain wrote > >> I'm currently running Grid Engine 2011.11p1 on CentOS-6. I'm >> using classic spooling to a local disk, local $SGE_ROOT (except >> for $SGE_ROOT/$SGE_CELL/common), and local spooling directories >> on the nodes (of which there are more than 600). I'm >> occasionally seeing *really* long scheduling runs (the last two >> were 4005 and 4847 seconds). This leads to extra fun like: >> >> 11/01/2013 08:35:39|event_|sortinghat|W|acknowledge timeout after >> 600 seconds for event client (schedd:0) on host "$SGE_MASTER" >> 11/01/2013 08:35:39|event_|sortinghat|E|removing event client >> (schedd:0) on host "$SGE_MASTER" after acknowledge timeout from >> event client list >> >> I have "PROFILE=1" set, and of course most of the time is spent >> in "job dispatching". But I'm really not sure how else to track >> down the cause of this. Where should I be looking? Are there >> any other options I can set to get more info? > > Over the weekend this got extremely bad -- one scheduling run took > 22319s. This morning I started suspending jobs to see if I could > find any that were causing this. Lo and behold, one user has 39 > jobs in the queue, each of which is an array job with 100,000 tasks > (our setting for max_aj_tasks). The resource requests for the jobs > are pretty basic: > > hard resource_list: h_rt=600,mem_free=1G > > We do have mem_free set as consumable. With these jobs on hold, > the scheduler runs are taking a few seconds. If I take the hold > off of even one of these jobs, though, the scheduler goes crazy > again (long runs, eating up memory). > > In looking at the qacct data for these jobs, each task runs for > just a few seconds. I've already "encouraged" the user to > reformulate the jobs so that each task runs much longer, but should > these jobs really confound the scheduler so? Is my max_aj_tasks > setting too high? > I would say no. Both in theory: array jobs are an optimisation and shouldn't stress the scheduler and in practice: I've just submitted 39 100000 task array jobs to my cluster and the scheduler is taking about the same time as before (< 40 seconds).
What does slug the scheduler I've found is performing fancy string matches(duh). Since there don't appear to be any in the hard resource_list and presumably no soft requests has the user requested some bizarre queue full of alternatives? William -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQIcBAEBAgAGBQJSeLevAAoJEKCzH4joEjNWyasP/0Ohyl0T97fHyTsI+hIPvnuz XXrSz6tCK0i7RKF3X6kReH0plDIbPT0v6OFG/CmmNm4zA04hYBovjw3b3vvM6FBE PO5tMNXEXuidTLBYU/nUuu4GykgbndIOIRT96rNUhFjfXZu3P/n2MtRg3CbMUOOv t9OumZrcxLDjV+AhilWbzVSxkD+BQX3kZUma9ayvTryIdqQczIbNS7lWDThBCbLT 4hFr+shC4km26m+XwtBvLMgXv6dLZpzvyMfoz5nNnlfDunUIAOQQpG3Devb/qg7R x0FHVE4LYNiNbpzFLDEQY2v1EoYds89UMuMJbbg4w0/0AX0WeNbH61h8nLPFojFh cztloKswnAB1x4RHELn+Na2XbypyIqKA/9Mw7RGLXdAoPE3vnnWcBbRqwJ6k+eo1 IW3xSBAULsIOMFA91NtYuJ9cUvDrcUU4i/NhflgPj+7HYb4Zw8FugA6IcEw5xXEy 9Vgp4qwevvKNc+svgiVlTtwJdRf1PYJr3DpiV1ao4c31+SiGLXFrvcv35Ut5rZnr 1086gZzcOnGUkVer8ftKrO9UnKLEUXEgVSs7c5OwMNRBw6jBvviHBWnvLm1Mrqm9 GNfEmIFSn1gypv1VHvS2IJjJR9K7DWZgEsPAwKJNekbdOy9w0vO1m9mJ97u13gTH uVvK46iOPSjBGdhp/68G =BuqE -----END PGP SIGNATURE----- _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
