Hello, This is hopefully a very simple set of questions for someone. I'm evaluating slurm with a view to replacing our existing torque/moab system, and I've been reading about defining partitions and QoSs. I like the idea of being able to use a QoS to throttle user activity -- for example to set maxcpus/user, maxjobs/user and maxnodes/user, etc, etc. Also I'm going to define a very simple set of partitions to reflect the different types of nodes in the cluster. For example
Batch - normal compute nodes Highmem - high memory nodes Gpu - gpu nodes So presumably it makes sense to associate the "normal" QOS with the batch queue and define throttling limits as needs. Then define corresponding QoSs for the highmem and gpu partitions. In this respect do the QOS definitions override any definitions on the PartitionName line? For example does QOS Maxwall override MaxTime? Also I suspect I'll need to define a test queue with a high level of throttling to enable users to get a limited number of small test jobs through the system quickly. In this respect does it make sense for my batch and test partitions to overlap either partially or completely? At any one time the test partition will only take a few resources out of the pool of normal compute nodes? Another issue is that we do have a large mix of small and large jobs. In our torque/moab cluster we make use of the XFACTOR component to make sure that small jobs don't get starved out of the system. I don't think there is an analog of this parameter in slurm, and so I need to understand how to enable smaller jobs to compete with the larger jobs and not get starved out. Using slurm I understand that the backfill mechanism and priority flags like PriorityFavorSmall=NO and SMALL_RELATIVE_TO_TIME can help the situation. What are your thoughts? Your advice on the above points would be appreciated, please. Best regards, David