Re: [gridengine users] Verifying behavior of max_reservations

Brian Smith Wed, 29 Aug 2012 08:09:16 -0700

I have a mix of high-throughput and long wait jobs. We classify andprioritize jobs based on runtime. We use a jsv to set


devel  # job length < 1hr
short  # job length < 6hr
medium # job length < 2day
long   # job length < 1wk
xlong  # job length > 1wk (goes to ACLed queue)


Users have to specify runtime or they get the default of 1 hr.

The complexes are consumable and have various urgencies assigned. Thelimits for these complexes are set in the global host definition andthey are used to help manage SLAs for various job classes. It appearsto work pretty well in our environment, allowing us to stay in the85-95% utilization range, keeping wait times reasonable based on overallruntime, and turnover within 110-120% of runtime, worst case.

8.1.1 has been fantastic, overall, so this is a fairly minor issue;mostly me trying to eek out a couple more percentage points ofutilization :)


Thanks for the link to the design documents.  They'll be very helpful.

-Brian

Brian Smith
Sr. System Administrator
Research Computing, University of South Florida
4202 E. Fowler Ave. SVC4010
Office Phone: +1 813 974-1467
Organization URL: http://rc.usf.edu

On 08/28/2012 06:25 PM, Dave Love wrote:

Brian Smith <b...@mail.usf.edu> writes:

Hi, Dave,

I'm mostly trying to verify the behavior of max_reservations as the
clarity of the man page is a little lacking.


No great surprise...  I can try to clarify it for things I understand --
maybe after Reuti explains.  I don't know whether the design document
<http://arc.liv.ac.uk/repos/darcs/sge/doc/devel/rfe/resource_reservation.txt>
is any more use.  Anyway, experimentally the number of individual
resources reserved can be much bigger than max_reservations (which
counts jobs as far as I know without checking the code).

I'm on 8.1.1 SoG, and if
I set the number too low, I get starving, short >128 slot parallel
jobs. If I set it too high, everything seems to get stalled.


Do you have a high throughput, frequent scheduling, or lots of waiting
jobs?  Anything else that might be unusual?  We don't usually have more
than a few 10s of jobs waiting (although often there are very large
arrays), and I've not seen particular problems with qmaster stalling.
Is there anything useful in its messages file, especially after changing
the log level to "info"?

Note that DURATION_OFFSET might be relevant, but it doesn't sound so in
this case.

I've been looking at the schedule file (and now that you've pointed
out qsched, I'll probably be checking that out as well), and I'm just
having some trouble finding the best middle-ground for my environment.


I'm not sure what to suggest and hope someone else has advice.  By the
way, I doubt think there will be anything very different in this area
between that version and older ones.  I have seen problems with
reservations, but mainly with them apparently being lost and maybe
returning.

_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Verifying behavior of max_reservations

Reply via email to