Re: [gridengine users] question about managing queues

Carl G. Riches Fri, 31 Jul 2015 14:08:27 -0700

On Wed, 29 Jul 2015, Reuti wrote:

Am 28.07.2015 um 21:31 schrieb Carl G. Riches:

On Tue, 28 Jul 2015, Reuti wrote:

Hi,

Am 28.07.2015 um 20:03 schrieb Carl G. Riches:


We have a Rocks cluster (Rocks release 6.1) with the SGE roll (rocks-sge
6.1.2 [GE2011]).  Usage levels have grown faster than the cluster's capacity.  
We have a single consumable resource (number of CPUs) that we are trying to 
manage in a way that is acceptable to our users.  Before diving in on a 
solution, I would like to find out if others have dealt with our particular 
problem.

Here is a statement of the problem:
- There is a fixed amount of a resource called "number of CPUs" available.
- There are many possible users of the resource "number of CPUs".
- There is a variable number of the resource in use at any given time.
- When the resource is exhausted, requests to use the resource queue up
until some amount of the resource becomes available again.
- In the event that resource use requests have queued up, we must manage
the resource in some way.

The way we would like to manage the resource is this:
1. In the event that no requests for the resource are queued up, do
 nothing.
2. In the event that a single user is consuming all of the resource and
 all queued requests for the resource belong to the same user that is
 using all of the resource, do nothing.
3. In the event that a single user is consuming all of the resource and
 not all queued requests for the resource belong to the same user that
 is using all of the resource, "manage the resource".
4. In the event that there are queued requests for the resource and the
 resource is completely used by more than one user, "manage the
 resource".

By "manage the resource" we mean:
a. If a user is consuming more than some arbitrary limit of the resource
 (call it L), suspend one of that user's jobs.
b. Determine how much of the resource (CPUs) are made available by the
 prior step.


None. Suspending a job in SGE still consumes the granted resources. Only option 
would be to reschedule the job to put it in to pending state again (and maybe 
put it on hold before, to avoid that it gets scheduled instantly again).


I didn't know that, thanks!

c. Find a job in the list of queued requests that uses less than or equal
 to the resources made available in the last step _and_ does not belong
 to a user currently using some arbitrary limit L (or more) of the
 resource, then dispatch the job.
d. Repeat the prior step until the available resource is less than the
 resource required by jobs in the list of queued requests.

Steps 1-4 above would be repeated at regular intervals to ensure that the
resource is shared.


e. unsuspend the prior suspended job in case of...?


Well, that step wasn't explicitly asked for....but was probably assumed by the 
users that a suspended (or held) job would eventually be restarted.

Has anyone on the list tried to do this sort of queue management?  If so,
how did you go about the task?  Is this possible with Grid Engine?


All this management must be done by a co-schulder which you have to program to 
fulfill the above requirements, i.e. putting jobs on hold, reschedule them, 
remove the hold of other jobs... It's not built into SGE.


That's what I was afraid of.


Would your users be happy to see that they got the promised computing time over 
a timeframe, so that a share-tree policy could be used? Of course, no user will 
see the instant effect that his just submitted job starts immediately, but over 
time they can check that the granted computing time was according to the 
promised one. SGE is also able to put a penalty on a user, in case he used to 
much computing time in the past. Then his jobs will get a lower priority, in 
case other users jobs are pending.


I'm not familiar with share tree policies.  What are they and how are they used?


Some links about the share-tree policy:

http://www.informit.com/articles/article.aspx?p=101162
http://www.bioteam.net/wp-content/uploads/2009/09/06-SGE-6-Admin-Policies.pdf
http://gridscheduler.sourceforge.net/howto/geee.html
https://blogs.oracle.com/sgrell/entry/n1ge_6_scheduler_hacks_the

There is also a short man page in SGE `man share_tree`.

Thanks for these pointers. After going through these and others, I thinkI have a basic understanding of share-tree and functional share policies.I'm not clear on whether or not these can be combined, but I think so.

The goal is to limit user access to CPU when there is contention to aqueue. The limit would apply to all users and the limit would be 12% ofall CPUs in a queue. There are no "projects" or "departments" at thistime. Would these parameter settings achieve that goal?



algorithm                         default
schedule_interval                 0:0:15
maxujobs                          0
queue_sort_method                 load
job_load_adjustments              np_load_avg=0.50
load_adjustment_decay_time        0:7:30
load_formula                      np_load_avg
schedd_job_info                   true
flush_submit_sec                  0
flush_finish_sec                  0
params                            none
reprioritize_interval             0:0:0
halftime                          168
usage_weight_list                 cpu=1.000000,mem=0.000000,io=0.000000
compensation_factor               5.000000
weight_user                       0.640000
weight_project                    0.120000
weight_department                 0.120000
weight_job                        0.120000
weight_tickets_functional         1000000
weight_tickets_share              1000
share_override_tickets            TRUE
share_functional_shares           TRUE
max_functional_jobs_to_schedule   2000
report_pjob_tickets               TRUE
max_pending_tasks_per_job         50
halflife_decay_list               none
policy_hierarchy                  OFS
weight_ticket                     1000.000000
weight_waiting_time               0.000000
weight_deadline                   3600000.000000
weight_urgency                    0.100000
weight_priority                   0.100000
max_reservation                   0
default_duration                  INFINITY

Must we also define a "default" user in some manner such that this policyis applied? If so, how do we do that?



Thanks,
Carl
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] question about managing queues

Reply via email to