I have a mix of high-throughput and long wait jobs. We classify and prioritize jobs based on runtime. We use a jsv to set

devel  # job length < 1hr
short  # job length < 6hr
medium # job length < 2day
long   # job length < 1wk
xlong  # job length > 1wk (goes to ACLed queue)

Users have to specify runtime or they get the default of 1 hr.

The complexes are consumable and have various urgencies assigned. The limits for these complexes are set in the global host definition and they are used to help manage SLAs for various job classes. It appears to work pretty well in our environment, allowing us to stay in the 85-95% utilization range, keeping wait times reasonable based on overall runtime, and turnover within 110-120% of runtime, worst case.

8.1.1 has been fantastic, overall, so this is a fairly minor issue; mostly me trying to eek out a couple more percentage points of utilization :)

Thanks for the link to the design documents.  They'll be very helpful.

-Brian

Brian Smith
Sr. System Administrator
Research Computing, University of South Florida
4202 E. Fowler Ave. SVC4010
Office Phone: +1 813 974-1467
Organization URL: http://rc.usf.edu

On 08/28/2012 06:25 PM, Dave Love wrote:
Brian Smith <b...@mail.usf.edu> writes:

Hi, Dave,

I'm mostly trying to verify the behavior of max_reservations as the
clarity of the man page is a little lacking.

No great surprise...  I can try to clarify it for things I understand --
maybe after Reuti explains.  I don't know whether the design document
<http://arc.liv.ac.uk/repos/darcs/sge/doc/devel/rfe/resource_reservation.txt>
is any more use.  Anyway, experimentally the number of individual
resources reserved can be much bigger than max_reservations (which
counts jobs as far as I know without checking the code).

I'm on 8.1.1 SoG, and if
I set the number too low, I get starving, short >128 slot parallel
jobs. If I set it too high, everything seems to get stalled.

Do you have a high throughput, frequent scheduling, or lots of waiting
jobs?  Anything else that might be unusual?  We don't usually have more
than a few 10s of jobs waiting (although often there are very large
arrays), and I've not seen particular problems with qmaster stalling.
Is there anything useful in its messages file, especially after changing
the log level to "info"?

Note that DURATION_OFFSET might be relevant, but it doesn't sound so in
this case.

I've been looking at the schedule file (and now that you've pointed
out qsched, I'll probably be checking that out as well), and I'm just
having some trouble finding the best middle-ground for my environment.

I'm not sure what to suggest and hope someone else has advice.  By the
way, I doubt think there will be anything very different in this area
between that version and older ones.  I have seen problems with
reservations, but mainly with them apparently being lost and maybe
returning.

_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to