Am 28.07.2015 um 21:31 schrieb Carl G. Riches: > On Tue, 28 Jul 2015, Reuti wrote: > >> Hi, >> >> Am 28.07.2015 um 20:03 schrieb Carl G. Riches: >> >>> >>> We have a Rocks cluster (Rocks release 6.1) with the SGE roll (rocks-sge >>> 6.1.2 [GE2011]). Usage levels have grown faster than the cluster's >>> capacity. We have a single consumable resource (number of CPUs) that we >>> are trying to manage in a way that is acceptable to our users. Before >>> diving in on a solution, I would like to find out if others have dealt with >>> our particular problem. >>> >>> Here is a statement of the problem: >>> - There is a fixed amount of a resource called "number of CPUs" available. >>> - There are many possible users of the resource "number of CPUs". >>> - There is a variable number of the resource in use at any given time. >>> - When the resource is exhausted, requests to use the resource queue up >>> until some amount of the resource becomes available again. >>> - In the event that resource use requests have queued up, we must manage >>> the resource in some way. >>> >>> The way we would like to manage the resource is this: >>> 1. In the event that no requests for the resource are queued up, do >>> nothing. >>> 2. In the event that a single user is consuming all of the resource and >>> all queued requests for the resource belong to the same user that is >>> using all of the resource, do nothing. >>> 3. In the event that a single user is consuming all of the resource and >>> not all queued requests for the resource belong to the same user that >>> is using all of the resource, "manage the resource". >>> 4. In the event that there are queued requests for the resource and the >>> resource is completely used by more than one user, "manage the >>> resource". >>> >>> By "manage the resource" we mean: >>> a. If a user is consuming more than some arbitrary limit of the resource >>> (call it L), suspend one of that user's jobs. >>> b. Determine how much of the resource (CPUs) are made available by the >>> prior step. >> >> None. Suspending a job in SGE still consumes the granted resources. Only >> option would be to reschedule the job to put it in to pending state again >> (and maybe put it on hold before, to avoid that it gets scheduled instantly >> again). > > I didn't know that, thanks! > >> >> >>> c. Find a job in the list of queued requests that uses less than or equal >>> to the resources made available in the last step _and_ does not belong >>> to a user currently using some arbitrary limit L (or more) of the >>> resource, then dispatch the job. >>> d. Repeat the prior step until the available resource is less than the >>> resource required by jobs in the list of queued requests. >>> >>> Steps 1-4 above would be repeated at regular intervals to ensure that the >>> resource is shared. >> >> e. unsuspend the prior suspended job in case of...? > > Well, that step wasn't explicitly asked for....but was probably assumed by > the users that a suspended (or held) job would eventually be restarted. > >> >> >>> Has anyone on the list tried to do this sort of queue management? If so, >>> how did you go about the task? Is this possible with Grid Engine? >> >> All this management must be done by a co-schulder which you have to program >> to fulfill the above requirements, i.e. putting jobs on hold, reschedule >> them, remove the hold of other jobs... It's not built into SGE. > > That's what I was afraid of. > >> >> Would your users be happy to see that they got the promised computing time >> over a timeframe, so that a share-tree policy could be used? Of course, no >> user will see the instant effect that his just submitted job starts >> immediately, but over time they can check that the granted computing time >> was according to the promised one. SGE is also able to put a penalty on a >> user, in case he used to much computing time in the past. Then his jobs will >> get a lower priority, in case other users jobs are pending. > > I'm not familiar with share tree policies. What are they and how are they > used?
Some links about the share-tree policy: http://www.informit.com/articles/article.aspx?p=101162 http://www.bioteam.net/wp-content/uploads/2009/09/06-SGE-6-Admin-Policies.pdf http://gridscheduler.sourceforge.net/howto/geee.html https://blogs.oracle.com/sgrell/entry/n1ge_6_scheduler_hacks_the There is also a short man page in SGE `man share_tree`. -- Reuti > Thanks, > Carl _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
