Some examples are here:
https://wiki.fysik.dtu.dk/niflheim/Slurm_accounting#quality-of-service-qos

/Ole

On 19-12-2019 19:30, Prentice Bisbal wrote:

On 12/19/19 10:44 AM, Ransom, Geoffrey M. wrote:

The simplest is probably to just have a separate partition that will only allow job times of 1 hour or less.

This is how our Univa queues used to work, by overlapping the same hardware. Univa shows available “slots” to the users and we had a lot of confused users complaining about all those free slots (busy slots in the other queue) while their jobs sat on the queue and new users confused as to why their jobs were being killed after 4 hours. I was able to move the short/long behavior to job classes and use RQSes and have one queue.

While slurm isn’t showing users unused resources I am concerned that going back to two queues (partitions) will cause user interaction and adoption problems.

         It all depends on what best suits the specific needs.

Is there a way to have one partition that holds aside a small percentage of resources for jobs with a runtime under 4 hours, i.e. jobs with long runtimes cannot tie up 100% of the resources at one time? Some kind of virtual partition that feeds into two other partitions based on runtime would also work. The goal is that users can continue to post jobs to one partition but the scheduler won’t let 100% of the compute resources get tied up with mutli-week long jobs.

The way to do this is with Quality of Service (QOS) in Slurm. When creating a QOS, you can specify the max. number of tasks a QOS can use. Create a QOS for the longer running jobs and set the MaxGrpTRES so that the number of CPUs is less that 100% of your cluster. Create a QOS for the shorter jobs with a shorter time limit (MaxWall).

Once the QOSes are setup, you can instruct your users to specify the proper QOS when submitting a job, or edit the job_submit.lua script to look at the time limit specified, and assign/override the QOS based on that.

Reply via email to