I appreciate the write ups, thanks. Anyone using GrpCPURunMins (or any of the similar ones really) have any unanticipated negatives to report?
-- ____ || \UTGERS, |---------------------------*O*--------------------------- ||_// the State | Ryan Novosielski - novos...@rutgers.edu<mailto:novos...@rutgers.edu> || \ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \ of NJ | Office of Advanced Research Computing - MSB C630, Newark `' On Jul 24, 2017, at 19:10, Ryan Cox <ryan_...@byu.edu<mailto:ryan_...@byu.edu>> wrote: Corey, We almost exclusively use GrpCPURunMins as well as 3 or 7 day walltime limits depending on the partition. For my (somewhat rambling) thoughts on the matter, see https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Ftech.ryancox.net%2F2014%2F04%2Fscheduler-limit-remaining-cputime-per.html&data=02%7C01%7Cnovosirj%40rutgers.edu%7Cb305ada498b44a840e2f08d4d2e9200f%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C636365346110593269&sdata=Bi9p5aQQzVG2BEuxNLOeEu3XndlC8%2B1G5%2FCjyhx7GHU%3D&reserved=0. It generally works pretty well. We also have https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmarylou.byu.edu%2Fsimulation%2Fgrpcpurunmins.php&data=02%7C01%7Cnovosirj%40rutgers.edu%7Cb305ada498b44a840e2f08d4d2e9200f%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C636365346110593269&sdata=QuDrKMoI22KNqjW4LnOuZWBUhA0TvWmuoMlyYweICBU%3D&reserved=0 to simulate various settings, though it needs some improvement such as a realistic maximum. sshare -l (TRESRunMins) should have the live stats you're looking for. Ryan On 07/24/2017 02:39 PM, Corey Keasling wrote: Hi Slurm-Dev, I'm currently designing and testing what will ultimately be a small Slurm cluster of about 60 heterogeneous nodes (five different generations of hardware). Our user-base is also diverse, with need for fast turnover of small, sequential jobs and for long-duration parallel codes (e.g., 16 cores for several months). In the past we limited users by how many cores they could allocate at any one time. This has the drawback that no distinction is made between, say, 128 cores for 2 hours and 128 cores for 2 months. We want users to be able to run on a large portion of the cluster when it is available while ensuring that they cannot take advantage of an idle period to start jobs which will monopolize it for weeks. Limiting by GrpCPURunMins seems like a good answer. I think of it as allocating computational area (i.e., cores*minutes) and not just width (cores). I'd love to know if anyone has any experience or thoughts on imposing limits in this way. Also, is anyone aware of a simple way to calculate remaining "area"? I can use squeue or sacct to ultimately derive how much of a limit is in use by looking at remaining wall-time and core count, but if there's something built in - or pre-existing - it would be nice to know. It's worth noting that the cluster is divided into several partitions with most nodes existing in several. This is partially political (to give groups increased priority on nodes they helped pay for) and partially practical (to ensure users explicitly requesting slow nodes instead of just dumping them on ancient Opterons). Also, each user gets their own Account, so the QoS Grp limits apply to each human separately. Accounts would also have absolute core limits. Thank you for your thoughts! Corey -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University