On 19/06/2019 22.30, Fulcomer, Samuel wrote: > > (...and yes, the name is inspired by a certain OEM's software licensing > schemes...) > > At Brown we run a ~400 node cluster containing nodes of multiple > architectures (Sandy/Ivy, Haswell/Broadwell, and Sky/Cascade) purchased > in some cases by University funds and in others by investigator funding > (~50:50). They all appear in the default SLURM partition. We have 3 > classes of SLURM users: > > 1. Exploratory - no-charge access to up to 16 cores > 2. Priority - $750/quarter for access to up to 192 cores (and with a > GrpTRESRunMins=cpu limit). Each user has their own QoS > 3. Condo - an investigator group who paid for nodes added to the > cluster. The group has its own QoS and SLURM Account. The QoS allows > use of the number of cores purchased and has a much higher priority > than the QoS' of the "priority" users. > > The first problem with this scheme is that condo users who have > purchased the older hardware now have access to the newest without > penalty. In addition, we're encountering resistance to the idea of > turning off their hardware and terminating their condos (despite MOUs > stating a 5yr life). The pushback is the stated belief that the hardware > should run until it dies. > > What I propose is a new TRES called a Processor Performance Unit (PPU) > that would be specified on the Node line in slurm.conf, and used such > that GrpTRES=ppu=N was calculated as the number of allocated cores > multiplied by their associated PPU numbers. > > We could then assign a base PPU to the oldest hardware, say, "1" for > Sandy/Ivy and increase for later architectures based on performance > improvement. We'd set the condo QoS to GrpTRES=ppu=N*X+M*Y,..., where N > is the number of cores of the oldest architecture multiplied by the > configured PPU/core, X, and repeat for any newer nodes/cores the > investigator has purchased since. > > The result is that the investigator group gets to run on an > approximation of the performance that they've purchased, rather on the > raw purchased core count. > > Thoughts? > >
What we do is that our nodes are grouped into separate partitions based on the CPU model. E.g. the partition "batch-skl" is where our Skylake (6148) nodes are. The we have a job_submit.lua script which sends jobs without an explicit partition spec to all batch-xxx partitions (checking constraints etc. along the way). Then for each partition we set TRESBillingWeights= to "normalize" the fairshare consumption based on the geometric mean of a set of hopefully not too unrepresentative single-node benchmarks [1]. We also set a memory billing weight, and have MAX_TRES among our PriorityFlags, approximating dominant resource fairness (DRF) [2] [1] https://github.com/AaltoScienceIT/docker-fgci-benchmark [2] https://people.eecs.berkeley.edu/~alig/papers/drf.pdf -- Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist Aalto University School of Science, PHYS & NBE +358503841576 || janne.blomqv...@aalto.fi