Hello,

This is hopefully a very simple set of questions for someone. I'm evaluating 
slurm with a view to replacing our existing torque/moab system, and I've been 
reading about defining partitions and QoSs. I like the idea of being able to 
use a QoS to throttle user activity -- for example to set maxcpus/user, 
maxjobs/user and maxnodes/user, etc, etc. Also I'm going to define a very 
simple set of partitions to reflect the different types of nodes in the 
cluster. For example

Batch - normal compute nodes
Highmem - high memory nodes
Gpu - gpu nodes

So presumably it makes sense to associate the "normal" QOS with the batch queue 
and define throttling limits as needs. Then define corresponding QoSs for the 
highmem and gpu partitions. In this respect do the QOS definitions override any 
definitions on the PartitionName line? For example does QOS Maxwall override 
MaxTime?

Also I suspect I'll need to define a test queue with a high level of throttling 
to enable users to get a limited number of small test jobs through the system 
quickly. In this respect does it make sense for my batch and test partitions to 
overlap either partially or completely? At any one time the test partition will 
only take a few resources out of the pool of normal compute nodes?

Another issue is that we do have a large mix of small and large jobs. In our 
torque/moab cluster we make use of the XFACTOR component to make sure that 
small jobs don't get starved out of the system. I don't think there is an 
analog of this parameter in slurm, and so I need to understand how to enable 
smaller jobs to compete  with the larger jobs and not get starved out. Using 
slurm I understand that the backfill mechanism and priority flags like 
PriorityFavorSmall=NO and SMALL_RELATIVE_TO_TIME can help the situation. What 
are your thoughts?

Your advice on the above points would be appreciated, please.

Best regards,
David

Reply via email to