[slurm-dev] Re: Thoughts on GrpCPURunMins as primary constraint?
Corey, We almost exclusively use GrpCPURunMins as well as 3 or 7 day walltime limits depending on the partition. For my (somewhat rambling) thoughts on the matter, see http://tech.ryancox.net/2014/04/scheduler-limit-remaining-cputime-per.html. It generally works pretty well. We also have https://marylou.byu.edu/simulation/grpcpurunmins.php to simulate various settings, though it needs some improvement such as a realistic maximum. sshare -l (TRESRunMins) should have the live stats you're looking for. Ryan On 07/24/2017 02:39 PM, Corey Keasling wrote: Hi Slurm-Dev, I'm currently designing and testing what will ultimately be a small Slurm cluster of about 60 heterogeneous nodes (five different generations of hardware). Our user-base is also diverse, with need for fast turnover of small, sequential jobs and for long-duration parallel codes (e.g., 16 cores for several months). In the past we limited users by how many cores they could allocate at any one time. This has the drawback that no distinction is made between, say, 128 cores for 2 hours and 128 cores for 2 months. We want users to be able to run on a large portion of the cluster when it is available while ensuring that they cannot take advantage of an idle period to start jobs which will monopolize it for weeks. Limiting by GrpCPURunMins seems like a good answer. I think of it as allocating computational area (i.e., cores*minutes) and not just width (cores). I'd love to know if anyone has any experience or thoughts on imposing limits in this way. Also, is anyone aware of a simple way to calculate remaining "area"? I can use squeue or sacct to ultimately derive how much of a limit is in use by looking at remaining wall-time and core count, but if there's something built in - or pre-existing - it would be nice to know. It's worth noting that the cluster is divided into several partitions with most nodes existing in several. This is partially political (to give groups increased priority on nodes they helped pay for) and partially practical (to ensure users explicitly requesting slow nodes instead of just dumping them on ancient Opterons). Also, each user gets their own Account, so the QoS Grp limits apply to each human separately. Accounts would also have absolute core limits. Thank you for your thoughts! Corey -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University
[slurm-dev] Re: Job Submit Lua Plugin
Nathan and Darby, For you and anyone else using Lua, see https://bugs.schedmd.com/show_bug.cgi?id=3815 with regards to --mem vs --mem-per-cpu starting in 17.02. Ryan On 06/27/2017 02:30 PM, Nathan Vance wrote: Re: [slurm-dev] Re: Job Submit Lua Plugin Darby, The "job_submit.lua: initialized" line in slurm.conf was indeed the issue. When compiling slurm I only got the "yes lua" line without the flags, but that seems to be just a difference in OS's. Now that I have debugging feedback I should be good to go! Thanks, Nathan On 27 June 2017 at 16:13, Vicker, Darby (JSC-EG311) <darby.vicke...@nasa.gov <mailto:darby.vicke...@nasa.gov>> wrote: We recently started using a lua job submit plugin as well. You have to have the lua-devel package installed when you compile slurm. It looks like you do (but we use RHEL the package name is lua-devel) but confirm that you see something like these in config.log: configure:24784: result: yes lua pkg_cv_lua_LIBS='-llua -lm -ldl ' lua_CFLAGS=' -DLUA_COMPAT_ALL' lua_LIBS='-llua -lm -ldl ' Do you have this in your slurm.conf? JobSubmitPlugins=lua I'm guessing not given you don't see anything in the logs. Before I got all the errors worked out, I would see errors like this in slurmctld_log: error: Couldn't find the specified plugin name for job_submit/lua looking at all files error: cannot find job_submit plugin for job_submit/lua error: cannot create job_submit context for job_submit/lua failed to initialize job_submit plugin After getting everything working, you should see this: job_submit.lua: initialized As well as any other slurm.log_info messages you put in your lua script. *From: *Nathan Vance <naterva...@gmail.com <mailto:naterva...@gmail.com>> *Reply-To: *slurm-dev <slurm-dev@schedmd.com <mailto:slurm-dev@schedmd.com>> *Date: *Tuesday, June 27, 2017 at 12:15 PM *To: *slurm-dev <slurm-dev@schedmd.com <mailto:slurm-dev@schedmd.com>> *Subject: *[slurm-dev] Job Submit Lua Plugin Hello all! I've been working on getting off the ground with Lua plugins. The goal is to implement Torque's routing queues for SLURM, but so far I have been unable to get SLURM to even call my plugin. What I have tried: 1) Copied contrib/lua/job_submit.lua to /etc/slurm/ (the same directory as slurm.conf) 2) Restarted slurmctld and verified that no functionality was broken 3) Added slurm.log_info("I got here") to several points in the script. After restarting slurmctld and submitting a job, grep "I got here" -R /var/log found no results. 4) In case there was a problem with the log file, I added os.execute("touch /home/myUser/slurm_job_submitted") to the top of the slurm_job_submit method. Restarting slurmctld and submitting a job still produced no evidence that my plugin was called. 5) In case there were permission issues, I made job_submit.lua executable. Nothing. Even grep "job_submit" -R /var/log (in case there was an error calling the script) comes up dry. Relevant information: OS: Ubuntu 16.04 Lua: lua5.2 and liblua5.2-dev (I can use Lua interactively) SLURM version: 17.02.5, compiled from source (after installing Lua) using ./configure --prefix=/usr --sysconfdir=/etc/slurm Any guidance to get me up and running would be greatly appreciated! Thanks, Nathan -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University
[slurm-dev] Re: Slurm & CGROUP
m: Wensheng Deng <w...@nyu.edu <mailto:w...@nyu.edu> <mailto:w...@nyu.edu <mailto:w...@nyu.edu>>> > Sent: 15 March 2017 10:28 > To: slurm-dev > Subject: [ext] [slurm-dev] Re: Slurm & CGROUP > > It should be (sorry): > we 'cp'ed a 5GB file from scratch to node local disk > > > On Wed, Mar 15, 2017 at 11:26 AM, Wensheng Deng <w...@nyu.edu <mailto:w...@nyu.edu> > <mailto:w...@nyu.edu <mailto:w...@nyu.edu>><mailto:w...@nyu.edu <mailto:w...@nyu.edu> > <mailto:w...@nyu.edu <mailto:w...@nyu.edu>>>> wrote: > Hello experts: > > We turn on TaskPlugin=task/cgroup. In one Slurm job, we 'cp'ed a > 5GB job from scratch to node local disk, declared 5 GB memory > for the job, and saw error message as below although the file > was copied okay: > > slurmstepd: error: Exceeded job memory limit at some point. > > srun: error: [nodenameXXX]: task 0: Out Of Memory > > srun: Terminating job step 41.0 > > slurmstepd: error: Exceeded job memory limit at some point. > > > From the cgroup document >https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt > <https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt> > Features: > - accounting anonymous pages, file caches, swap caches usage and > limiting them. > > It seems that cgroup charges memory "RSS + file caches" to user > process like 'cp', in our case, charged to user's jobs. swap is > off in this case. The file cache can be small or very big, and > it should not be charged to users' batch jobs in my opinion. > How do other sites circumvent this issue? The Slurm version is > 16.05.4. > > Thank you and Best Regards. > > > > Could you set AllowedRamSpace/AllowedSwapSpace in /etc/slurm/cgroup.conf to some big number? That way the job memory limit will be the cgroup soft limit, and the cgroup hard limit which is when the kernel will OOM kill the job would be "job_memory_limit * AllowedRamSpace" that is, some large value? -- Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist Aalto University School of Science, PHYS & NBE +358503841576 <tel:%2B358503841576> || janne.blomqv...@aalto.fi <mailto:janne.blomqv...@aalto.fi> -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University
[slurm-dev] Re: Stopping compute usage on login nodes
If you're interested in the programmatic method I mentioned to increase limits for file transfers, https://github.com/BYUHPC/uft/tree/master/cputime_controls might be worth looking at. It works well for us, though a user will occasionally start using a new file transfer program that you might want to centrally install and whitelist. We used to use LVS for load balancing and it worked pretty well. We finally scrapped it in favor of DNS round robin since it gets expensive to have a load balancer that's capable of moving that much bandwidth. We have a script that can drop some of the login nodes from the DNS round robin based on CPU and memory usage (with sanity checks to not drop all of them at the same time, of course :) ). There may be a better way of doing this but it has worked so far. Ryan On 02/09/2017 11:15 AM, Nicholas McCollum wrote: While this isn't a SLURM issue, it's something we all face. Due to my system being primarily students, it's something I face a lot. I second the use of ulimits, although this can kill off long running file transfers. What you can do to help out users is set a low soft limit and a somewhat larger hard limit. Encourage users that want to do a file transfer to increase their limit (they wont be able to go over the hard limit). A method that I am testing to employ is having each login node as a KVM virtual machine, and then limiting the amount of CPU that the virtual machine can use. Each login-VM will be identical minus the MAC and the IP address, then using IP tables on the VM-host to push the connections out to the VM that responds first. The idea is that a loaded down VM would have a delay in responding and provide a user with a login node that doesn't have any users on it. I'm sure someone has already blazed this trail before, but this is how I am going about it. -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University
[slurm-dev] Re: Stopping compute usage on login nodes
d no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, Wilford, Nottingham, NG11 7EP -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University
[slurm-dev] Re: how to monitor CPU/RAM usage on each node of a slurm job? python API?
I should probably add some example output: Someone we need to talk to: Node | Memory (GB) | CPUs Hostname AllocMaxCur Alloc Used Eff% m8-10-519.5 0 0 1 0.00 0 *m8-10-219.52.32.2 1 0.9999 m8-10-319.5 0 0 1 0.00 0 m8-10-419.5 0 0 1 0.00 0 * denotes the node where the batch script executes (node 0) CPU usage is cumulative since the start of the job Much better: Node | Memory (GB) | CPUs Hostname AllocMaxCur Alloc Used Eff% m9-48-2 112.0 21.1 19.3 16 15.9799 m9-48-398.0 18.5 16.8 14 13.9899 m9-16-3 112.0 20.9 19.2 16 15.9799 m9-44-1 112.0 21.0 19.2 16 15.9799 m9-43-3 119.0 22.3 20.4 17 16.9799 m9-44-2 112.0 21.2 19.3 16 15.9899 m9-14-4 112.0 21.0 19.2 16 15.9799 m9-46-4 119.0 22.5 20.5 17 16.9799 *m9-10-291.0 32.0 15.8 13 12.8198 m9-43-1 119.0 22.3 20.4 17 16.9799 m9-16-1 126.0 23.9 21.6 18 17.9799 m9-47-4 119.0 22.4 20.5 17 16.9799 m9-43-4 119.0 22.4 20.5 17 16.9799 m9-48-184.0 15.7 14.4 12 11.9899 m9-42-4 119.0 22.2 20.3 17 16.9799 m9-43-2 119.0 22.2 20.4 17 16.9799 * denotes the node where the batch script executes (node 0) CPU usage is cumulative since the start of the job Ryan On 09/19/2016 11:13 AM, Ryan Cox wrote: We use this script that we cobbled together: https://github.com/BYUHPC/slurm-random/blob/master/rjobstat. It assumes that you're using cgroups. It uses ssh to connect to each node so it's not very scalable but it works well enough for us. Ryan On 09/18/2016 06:42 PM, Igor Yakushin wrote: how to monitor CPU/RAM usage on each node of a slurm job? python API? Hi All, I'd like to be able to see for a given jobid how much resources are used by a job on each node it is running on at this moment. Is there a way to do it? So far it looks like I have to script it: get the list of the involved nodes using, for example, squeue or qstat, ssh to each node and find all the user processes (not 100% guaranteed that they would be from the job I am interested in: is there a way to find UNIX pids corresponding to Slurm jobid?). Another question: is there python API to slurm? I found pyslurm but so far it would not build with my version of Slurm. Thank you, Igor -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University
[slurm-dev] Re: how to monitor CPU/RAM usage on each node of a slurm job? python API?
We use this script that we cobbled together: https://github.com/BYUHPC/slurm-random/blob/master/rjobstat. It assumes that you're using cgroups. It uses ssh to connect to each node so it's not very scalable but it works well enough for us. Ryan On 09/18/2016 06:42 PM, Igor Yakushin wrote: how to monitor CPU/RAM usage on each node of a slurm job? python API? Hi All, I'd like to be able to see for a given jobid how much resources are used by a job on each node it is running on at this moment. Is there a way to do it? So far it looks like I have to script it: get the list of the involved nodes using, for example, squeue or qstat, ssh to each node and find all the user processes (not 100% guaranteed that they would be from the job I am interested in: is there a way to find UNIX pids corresponding to Slurm jobid?). Another question: is there python API to slurm? I found pyslurm but so far it would not build with my version of Slurm. Thank you, Igor
[slurm-dev] Re: scontrol update not allowing jobs
The --reservation is for sbatch, salloc, et al. It tells it that the job should run in the specified reservation. On 04/15/2016 11:37 AM, Glen MacLachlan wrote: Re: [slurm-dev] Re: scontrol update not allowing jobs Thanks for your feedbacl. Taking nodes out of maintenance still leaves them in the reserved state "resv" but still unable to run jobs even though I believe I've given the correct exception as shown in the original post. @Ryan: Yeah, I did specify the reservation, Reservation=root_13. The -- before reservation is syntactically incorrect too. In fact, if you don't specify which reservation is getting updated the scontrol command won't work. Best, Glen == Glen MacLachlan, PhD /HPC Specialist //for Physical Sciences & / /Professorial Lecturer, Data Sciences / Office of Technology Services The George Washington University 725 21st Street Washington, DC 20052 Suite 211, Corcoran Hall == On Fri, Apr 15, 2016 at 1:07 PM, Ryan Cox <ryan_...@byu.edu <mailto:ryan_...@byu.edu>> wrote: Did you try this: --reservation=root_13 On 04/15/2016 08:10 AM, Glen MacLachlan wrote: Dear all, Wrapping up a maintenance period and I want to run some test jobs before I release the reservation and allow regular user jobs to start running. I've modified the reservation to allow jobs from my account: $ scontrol show res ReservationName=root_13 StartTime=2016-04-12T09:00:00 EndTime=2016-04-15T20:00:00 Duration=3-11:00:00 Nodes=ALL NodeCnt=220 CoreCnt=3328 Features=(null) PartitionName=(null) Flags=MAINT,SPEC_NODES TRES=cpu=3328 Users=bindatype Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a but when I try to allocate a set of nodes I keep seeing the following: $ salloc -p defq -t 10 salloc: Required node not available (down, drained or reserved) salloc: Pending job allocation 1692921 salloc: job 1692921 queued and waiting for resources Note that all the nodes are currently in the maint state. Am I missing something here or is this a problem with scontrol update? -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University
[slurm-dev] Re: scontrol update not allowing jobs
Did you try this: --reservation=root_13 On 04/15/2016 08:10 AM, Glen MacLachlan wrote: scontrol update not allowing jobs Dear all, Wrapping up a maintenance period and I want to run some test jobs before I release the reservation and allow regular user jobs to start running. I've modified the reservation to allow jobs from my account: $ scontrol show res ReservationName=root_13 StartTime=2016-04-12T09:00:00 EndTime=2016-04-15T20:00:00 Duration=3-11:00:00 Nodes=ALL NodeCnt=220 CoreCnt=3328 Features=(null) PartitionName=(null) Flags=MAINT,SPEC_NODES TRES=cpu=3328 Users=bindatype Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a but when I try to allocate a set of nodes I keep seeing the following: $ salloc -p defq -t 10 salloc: Required node not available (down, drained or reserved) salloc: Pending job allocation 1692921 salloc: job 1692921 queued and waiting for resources Note that all the nodes are currently in the maint state. Am I missing something here or is this a problem with scontrol update?
[slurm-dev] Re: AssocGrp*Limits being considered for scheduling
Coincidentally, I asked about that yesterday in a bug report: http://bugs.schedmd.com/show_bug.cgi?id=2465. The short answer is to use SchedulerParameters=assoc_limit_continue that was introduced in 15.08.8. It only works if the Reason for the job is something like Assoc*Limit. Ryan On 02/23/2016 10:58 AM, Lucas Gabriel Vuotto wrote: Hello, we want to know if there is a "built-in" solution for the situation we have: We have an special account A in sacctmgr which gives some users more cpu minutes to use monthly. Also, we use the multifactor priority plugin to decide which jobs start first. Right now, there are some jobs from account A that can't start because the extra resources were consumed, so until march, 1st they won't start. Still, there are other jobs enqueued that have less priority than the ones from account A, so they're not starting because the scheduler still consider the jobs from account A to be able to schedule, assigning them a StartTime from today. Basically, what we want to know is if there is some option/plugin to either: 1. delay the StartTime from jobs that can't start because of AssocGrp*Limits 2. turn priority to 0 for that jobs until the next month 3. any other idea which can have the desire effect (run jobs that can actually run this month this month) Ideally, we want to know if there is some solution from slurm itself and not running cron jobs every 10 minutes to do option 1 manually, which is the only idea we have right now (better ideas are welcome, though). Cheers & thanks! -- lv.
[slurm-dev] Re: distribution for array jobs
g to get more than one job to run on a node? Thanks in advance, Brian Andrus -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University
[slurm-dev] Re: Slurmd restart without loosing jobs?
That particular problem is now fixed: http://bugs.schedmd.com/show_bug.cgi?id=587 Ryan On 10/13/2015 03:26 AM, Bjørn-Helge Mevik wrote: Restarting the slurmd daemons and/or the slurmctld daemon should in general not kill jobs. But if you change things in slurm.conf such that the format of the slurm state files changes, then restarting slurmctld might result in all jobs being killed. We did this once a couple of years ago when we activated checkpointing. When slurmcltd started, the checkpointing plugin expected some extra data in the job states, which obviously wasn't there, and slurmctld decided the data was invalid and killed all jobs. (I don't know if this is still a problem.) -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University
[slurm-dev] Re: Batch job submission failed: Invalid account or account/partition combination specified
We have seen similar issues on 14.11.8 but haven't bothered to diagnose or report it. I think I've seen it twice so far out of dozens of new users. Ryan On 09/07/2015 09:16 AM, Loris Bennett wrote: Hi, This problem occurs with 14.11.8. A user I set up today got the following error when submitting a job: Batch job submission failed: Invalid account or account/partition combination specified Using sacctmgr show user withassoc I can't see any difference between the user with the problem and another user associated with the same account who can submit. In the slurmcltd log I have [2015-09-07T17:02:00.790] _job_create: invalid account or partition for user 123456, account '(null)', and partition 'main' [2015-09-07T17:02:00.790] _slurm_rpc_submit_batch_job: Invalid account or account/partition combination specified Access to the partition 'main' is allowed to all. Restarting slurmctld fixed the problem. Is this a known issue? Cheers, Loris
[slurm-dev] Re: Changing /dev file permissions for particular user
Be sure to test it first before trying anything else: https://stackoverflow.com/questions/18661976/reading-dev-cpu-msr-from-userspace-operation-not-permitted. We ran into this issue once when we had a trusted person and we couldn't easily grant him access to the MSRs. We couldn't find a good solution. You could add the caps to a copy of the rdmsr binary and make that file only usable by your trusted user... Assuming you have an old enough kernel, I would just add the user to the group that MSR files are owned by (and change settings so the relevant /dev files are owned by a different group than root). Ryan On 06/24/2015 03:08 PM, Marcin Stolarek wrote: Changing /dev file permissions for particular user Hey! I've got one user I trust and know that he isn't going to do anything malicious, he needs a direct acces to file in dev (/dev/cpu/*/msr in particular). Have anybody checked how to do such a thing in slurm? We are thinking abuot doing it in prologue and changing back in epilogue, checking if the node is exclusive for user X. Do you know if the file permissions can be changed in users namespace or how to achieve this using slurm on Linux? cheers, marcin
[slurm-dev] Re: concurrent job limit
Job arrays can kind of be used for that: From http://slurm.schedmd.com/job_array.html: A maximum number of simultaneously running tasks from the job array may be specified using a % separator. For example --array=0-15%4 will limit the number of simultaneously running tasks from this job array to 4. Ryan On 06/11/2015 08:12 AM, Martin, Eric wrote: Is there a way for users to self limit the number of jobs that they concurrently run? Eric Martin Center for Genome Sciences Systems Biology Washington University School of Medicine Forest Park Avenue St. Louis, MO 63108 The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail. -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University
[slurm-dev] Re: FAIR_TREE in SLURM 14.11
inf 0 secant 100.003315 00.00 0.00 inf 0 physics parent0.0044470.396478 0.396478 0 hepx 7000.23201944470.396478 0.396478 0.585199 0 hepx test-hepx 10.01282144470.396478 1.00 0.226415 0.012821 0 stat parent0.00 00.00 0.00 0 carroll 100.003315 00.00 0.00 inf 0 = Trey Dockendorf Systems Analyst I Texas AM University Academy for Advanced Telecommunications and Learning Technologies Phone: (979)458-2396 Email: treyd...@tamu.edu mailto:treyd...@tamu.edu Jabber: treyd...@tamu.edu mailto:treyd...@tamu.edu On Thu, Jun 4, 2015 at 11:51 AM, Ryan Cox ryan_...@byu.edu mailto:ryan_...@byu.edu wrote: Trey, In http://slurm.schedmd.com/fair_tree.html#fairshare, take a look at the definition for S. Basically, the normalized shares only matters between sibling associations and will equal 1.0 when summed. If an association has no siblings, the value is 1.0. If each of the four siblings in an account has the same Raw Shares (as defined in sacctmgr) value, the normalized shares value for each is 0.25. The reason why is because the Level Fairshare calculations are only done within in account, comparing siblings to each other. Note that Norm Usage is still presented in sshare but not used in the calculations. The sshare manpage has a section about the Fair Tree modifications to existing columns: http://slurm.schedmd.com/sshare.html#SECTION_FAIR_TREE%20MODIFICATIONS Ryan On 06/03/2015 02:47 PM, Trey Dockendorf wrote: My site is currently on 14.03.10 and we are evaluating and testing 14.11.7 as well as moving from PriorityFlags=DEPTH_OBLIVIOUS,SMALL_RELATIVE_TO_TIME to using PriorityFlags=FAIR_TREE,SMALL_RELATIVE_TO_TIME. Our account hierarchy is very deep and is intended to represent the org structure of departments and research organizations that are using our cluster [1]. We were able to make the normalized share ratio match up so all non-stakeholders were equal (0.000323) and all stakeholders had the correct ratio based on their contributions to the cluster. The Shares value assigned represents CPUs funded. All the CPUs no longer belonging to stakeholders were given to the mgmt group so that the Shares given to the top level (tamu) had a meaningful value when divided up amongst all the accounts. While testing FAIR_TREE I noticed the normalized shares were drastically different [2]. In particular the current stakeholders (idhcm and hepx) both ended up with 1.0. I'm guessing this is due to having no sibling accounts. The docs for FAIR_TREE only describe the formula used to calculate the Level FairShare. Does the method for calculating normalized shares change for FAIR_TREE? Is the hierarchy we are using not a good fit for FAIR_TREE? The description and benefits of FAIR_TREE appeal to our use case, so modifying our hierarchy is within the realm of things I'm willing to change. Any advice on migrating into FAIR_TREE is more than welcome. Right now I've been running sleep jobs under different UIDs to simulate usage to try and work out how we may need to adjust things for a migration to FAIR_TREE. I used the attached spreadsheet to work out the share values we are using with 14.03.10. Thanks, - Trey [1]: Account User Raw Shares Norm Shares Raw Usage Effectv Usage FairShare -- -- --- --- - -- root 1.00 114089982 1.00 0.870551 root root 10.000323 0 0.00 1.00 grid10.0003233688 0.32 0.986174 cms 100.0002693688 0.27 0.986155 suragrid 1 0.27 0 0.00 1.00 tamu 30960.999354 114086294 0.68 0.870477 agriculture 20 0.0066712697 0.24 0.999507 aglife10.003336 2697 0.12 0.999507 genomics 1 0.003336 0 0.00 1.00 engineering 10 0.003336 0 0.00 1.00 pete10.003336 0 0.00 1.00 general 100.003336
[slurm-dev] Re: FAIR_TREE in SLURM 14.11
]: Account User Raw Shares Norm Shares Raw Usage Effectv Usage FairShare -- -- --- --- - -- root0.00 53229 1.00 rootroot 10.000323 0 0.00 1.00 grid 10.000323 0 0.00 cms 100.909091 0 0.00 suragrid 10.090909 0 0.00 tamu 30960.999354 53229 1.00 agriculture 200.006676 0 0.00 aglife 10.50 0 0.00 genomics 10.50 0 0.00 engineering 100.003338 0 0.00 pete 11.00 0 0.00 general 100.0033386326 0.118860 geo 100.003338 0 0.00 atmo 11.00 0 0.00 liberalarts1280.042724 13122 0.246522 idhmc 11.00 13122 1.00 mgmt20580.686916 20984 0.394237 science7600.253672 12795 0.240382 acad 100.013158 0 0.00 chem 100.013158 0 0.00 iamcs 100.013158 0 0.00 math-dept 200.026316 0 0.00 math 100.50 0 0.00 secant 100.50 0 0.00 physics 7000.921053 12795 1.00 hepx 11.00 12795 1.00 stat 100.013158 0 0.00 carroll 11.00 0 0.00 = Trey Dockendorf Systems Analyst I Texas AM University Academy for Advanced Telecommunications and Learning Technologies Phone: (979)458-2396 Email: treyd...@tamu.edu mailto:treyd...@tamu.edu Jabber: treyd...@tamu.edu mailto:treyd...@tamu.edu -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University
[slurm-dev] Re: GPU node allocation policy
You can do something like this: JobSubmitPlugins=all_partitions,lua. Have a special empty partition, as you suggest. Use the submit plugin to detect if the empty partition is in there. If it is in the job's list of partitions, you know that the user didn't specify a particular partition. If it is not in the list, you know that the user requested a particular partition (or set of partitions). You can then do all sorts of fun logic. Does all the GPU code in question need only one CPU core? Some of our users have code that can use multiple CPUs and multiple GPUs simultaneously (LAMMPS? NAMD? I'd have to check...). It might be limiting to restrict users to a certain amount of cores. If you're scheduling memory, it's also important to make sure that there is some memory available for the GPU jobs. What we do is uses QOSs to control access to our GPU partition with AllowQos. We use a job submit plugin to place jobs with the appropriate GRES into the gpu QOS, which is allowed into that partition. We also allow jobs in a preemptable QOS into the partition, with the gpu QOS able to preempt jobs in the preemptible QOS. We could also do a shorter walltime QOS or something with a lower priority but haven't done so; GPU jobs could get on there quickly even if all-CPU jobs are on there. They could also have the job submit plugin add the gpu partition into their list of partitions if the job meets certain criteria even if not requesting GPUs (short walltime or something else). Just some thoughts. Ryan On 04/07/2015 07:47 AM, Aaron Knister wrote: Ah, I was wondering about that. You could try this: Rename standard partition to cpu1 Create a partition called standard with no nodes Use the lua submit plugin to rewrite the partition list from standard to cpu1,cpufromgpunode I *think* that will work. I'm not sure about the empty partition piece and whether that will deny your submission before the submit filter kicks in but my gut says no. Sent from my iPhone On Apr 7, 2015, at 9:18 AM, Schmidtmann, Carl carl.schmidtm...@rochester.edu wrote: That only works if ALL the nodes have GPUs. We have 200+ nodes and 30 of them have GPUs. So we have to create three partitions - standard, gpu and cpufromgpunode. People in the standard partition can’t use the cpus on the gpu nodes. People that submit to the cpufromgpunode can’t use the cpus in the standard partition. We would like to see a way to specify MaxCPUsPerJobOnThisNode so the standard partition can use 24 cores on nodes without a GPU and less on nodes with a GPU. Or a way to specify ReserveCPUForGPU on the node or some such thing. I assume this is difficult because people have asked for it but it hasn’t been implemented. Carl Carl Schmidtmann Center for Integrated Research Computing University of Rochester On Apr 7, 2015, at 4:51 AM, Aaron Knister aaron.knis...@gmail.com wrote: Would MaxCPUsPerNode set at the partition level help? Here's the snippet from the man page: MaxCPUsPerNode Maximum number of CPUs on any node available to all jobs from this partition. This can be especially useful to schedule GPUs. For example a node can be associated with two Slurm partitions (e.g. cpu and gpu) and the partition/queue cpu could be limited to only a subset of the node's CPUs, insuring that one or more CPUs would be available to jobs in the gpu partition/queue. Sent from my iPhone On Apr 6, 2015, at 11:25 PM, Novosielski, Ryan novos...@ca.rutgers.edu wrote: I am imagine part of the reason is to keep people from running CPU jobs that would take more than 20 cores on the GPU machine as others do not have GPU's. I'd be interested in knowing strategies here too. *Note: UMDNJ is now Rutgers-Biomedical and Health Sciences* || \\UTGERS |-*O*- ||_// Biomedical | Ryan Novosielski - Senior Technologist || \\ and Health | novos...@rutgers.edu- 973/972.0922 (2x0922) || \\ Sciences | OIRT/High Perf Res Comp - MSB C630, Newark `' On Apr 6, 2015, at 20:17, Ryan Cox ryan_...@byu.edu wrote: Chris, Just have GPU users request the numbers of CPU cores that they need and don't lie to Slurm about the number of cores. If a GPU user needs 4 cores and 4 GPUs, have them request that. That leaves 20 cores for others to use. Ryan On 04/06/2015 03:43 PM, Christopher B Coffey wrote: Hello, I’m curious how you handle the allocation of GPU’s and cores on GPU systems in your cluster. My new GPU system is 24 core, with 2 Tesla K80’s (4 gpus total). We allocate cores/mem by: SelectType=select/cons_res SelectTypeParameters=CR_Core_Memory What I’m thinking of doing is lying to Slurm about the true cores, and specifying CPUs=20, along with Gres=gpu:tesla:4. Is this a reasonable solution in order to ensure there is a core reserved for each gpu in the system? My thought is to allocate the 20 cores on the system to non-GPU type work instead of leaving them idle. Thanks
[slurm-dev] Re: GPU node allocation policy
Chris, Just have GPU users request the numbers of CPU cores that they need and don't lie to Slurm about the number of cores. If a GPU user needs 4 cores and 4 GPUs, have them request that. That leaves 20 cores for others to use. Ryan On 04/06/2015 03:43 PM, Christopher B Coffey wrote: Hello, I’m curious how you handle the allocation of GPU’s and cores on GPU systems in your cluster. My new GPU system is 24 core, with 2 Tesla K80’s (4 gpus total). We allocate cores/mem by: SelectType=select/cons_res SelectTypeParameters=CR_Core_Memory What I’m thinking of doing is lying to Slurm about the true cores, and specifying CPUs=20, along with Gres=gpu:tesla:4. Is this a reasonable solution in order to ensure there is a core reserved for each gpu in the system? My thought is to allocate the 20 cores on the system to non-GPU type work instead of leaving them idle. Thanks! Chris
[slurm-dev] RE: fairshare allocations
On 01/21/2015 09:23 AM, Bill Wichser wrote: A user underneath gets the expected 0.009091 normalized shares since there are a lot of fairshare=1 users there. The user3 gets basically 25x this value as the fairshare for user3=25 Yet the normalized shares is actually MORE than the normalized shares for the account as a whole. What should I make of this? This is actually by design in Fair Tree and is different from other algorithms. The manpage for sshare covers this under FAIR_TREE MODIFICATIONS.The manpage states that Norm Shares is The shares assigned to the user or account normalized to the total number of assigned shares within the level. Basically, the Norm Shares is the association's raw shares value divided by the sum of it and its sibling associations' assigned raw shares values. For example, if an account has 10 users, each having 1 assigned raw share, the Norm Shares value will be .1 for each of those users under Fair Tree. Fair Tree only uses Norm Shares and Effective Usage (the other sshare field that's modified) when comparing sibling associations. Our Slurm UG presentation slides also mention this on pages 35 and 76 (http://slurm.schedmd.com/SUG14/fair_tree.pdf). Ryan
[slurm-dev] Re: [ sshare ] RAW Usage
to this RAW usage. Roshan *From:*Ryan Cox ryan_...@byu.edu *Sent:* 25 November 2014 17:43 *To:* slurm-dev *Subject:* [slurm-dev] Re: [ sshare ] RAW Usage Raw usage is a long double and the time added by jobs can be off by a few seconds. You can take a look at _apply_new_usage() in src/plugins/priority/multifactor/priority_multifactor.c to see exactly what happens. Ryan On 11/25/2014 10:34 AM, Roshan Mathew wrote: Hello SLURM users, http://slurm.schedmd.com/sshare.html *Raw Usage* The number of cpu-seconds of all the jobs that charged the account by the user. This number will decay over time when PriorityDecayHalfLife is defined. I am getting different /RAW Usage/ values for the same job every time it is executed. The Job am using is a CPU stress test for 1 minute. It would be very useful to understand the formula for how this RAW Usage is calculated when we are using the plugin PriorityType=priority/multifactor. Snip of my slurm.conf file:- # Activate the Multi-factor Job Priority Plugin with decay PriorityType=priority/multifactor # apply no decay PriorityDecayHalfLife=0 PriorityCalcPeriod=1 PriorityUsageResetPeriod=MONTHLY # The larger the job, the greater its job size priority. PriorityFavorSmall=NO # The job's age factor reaches 1.0 after waiting in the # queue for 2 weeks. PriorityMaxAge=14-0 # This next group determines the weighting of each of the # components of the Multi-factor Job Priority Plugin. # The default value for each of the following is 1. PriorityWeightAge=0 PriorityWeightFairshare=100 PriorityWeightJobSize=0 PriorityWeightPartition=0 PriorityWeightQOS=0 # don't use the qos factor Thanks! Image removed by sender. Image removed by sender. -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University
[slurm-dev] Re: [ sshare ] RAW Usage
* # Activate the Multi-factor Job Priority Plugin with decay PriorityType=priority/multifactor # apply no decay PriorityDecayHalfLife=0 PriorityCalcPeriod=1 # reset usage after 1 month PriorityUsageResetPeriod=MONTHLY # The larger the job, the greater its job size priority. PriorityFavorSmall=NO # The job's age factor reaches 1.0 after waiting in the # queue for 2 weeks. PriorityMaxAge=14-0 # This next group determines the weighting of each of the # components of the Multi-factor Job Priority Plugin. # The default value for each of the following is 1. PriorityWeightAge=0 PriorityWeightFairshare=100 PriorityWeightJobSize=0 PriorityWeightPartition=0 PriorityWeightQOS=0 # don't use the qos factor *Questions* 1. Given that I have set the PriorityDecayHalfLife=0, i.e no decay applied at any stage, shouldnt both the jobs have the same RAW Usage reported by SSHARE? 2. Also shouldnt CPUTimeRAW in sacct be same as RAW Usage in sshare? From: Skouson, Gary B gary.skou...@pnnl.gov Sent: 25 November 2014 21:09 To: slurm-dev Subject: [slurm-dev] Re: [ sshare ] RAW Usage I believe that the info share data is kept by slurmctld in memory. As far as I could tell from the code, it should be checkpointing the info to the assoc_usage file wherever slurm is saving state information. I couldn’t find any docs on that, you’d have to check the code for more information. However, if you just want to see what was used, you can get the raw usage using sacct. For example, for a given job, you can do something like: sacct -X -a -j 1182128 --format Jobid,jobname,partition,account,alloccpus,state,exitcode,cputimeraw - Gary Skouson From: Roshan Mathew [mailto:r.t.mat...@bath.ac.uk] Sent: Tuesday, November 25, 2014 9:51 AM To: slurm-dev Subject: [slurm-dev] Re: [ sshare ] RAW Usage Thanks Ryan, Is this value stored anywhere in the SLURM accounting DB? I could not find any value for the JOB that corresponds to this RAW usage. Roshan From: Ryan Cox ryan_...@byu.edu Sent: 25 November 2014 17:43 To: slurm-dev Subject: [slurm-dev] Re: [ sshare ] RAW Usage Raw usage is a long double and the time added by jobs can be off by a few seconds. You can take a look at _apply_new_usage() in src/plugins/priority/multifactor/priority_multifactor.c to see exactly what happens. Ryan On 11/25/2014 10:34 AM, Roshan Mathew wrote: Hello SLURM users, http://slurm.schedmd.com/sshare.html Raw Usage The number of cpu-seconds of all the jobs that charged the account by the user. This number will decay over time when PriorityDecayHalfLife is defined. I am getting different RAW Usage values for the same job every time it is executed. The Job am using is a CPU stress test for 1 minute. It would be very useful to understand the formula for how this RAW Usage is calculated when we are using the plugin PriorityType=priority/multifactor. Snip of my slurm.conf file:- # Activate the Multi-factor Job Priority Plugin with decay PriorityType=priority/multifactor # apply no decay PriorityDecayHalfLife=0 PriorityCalcPeriod=1 PriorityUsageResetPeriod=MONTHLY # The larger the job, the greater its job size priority. PriorityFavorSmall=NO # The job's age factor reaches 1.0 after waiting in the # queue for 2 weeks. PriorityMaxAge=14-0 # This next group determines the weighting of each of the # components of the Multi-factor Job Priority Plugin. # The default value for each of the following is 1. PriorityWeightAge=0 PriorityWeightFairshare=100 PriorityWeightJobSize=0 PriorityWeightPartition=0 PriorityWeightQOS=0 # don't use the qos factor Thanks! image001.jpg image001.jpg -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University
[slurm-dev] Re: [ sshare ] RAW Usage
Raw usage is a long double and the time added by jobs can be off by a few seconds. You can take a look at _apply_new_usage() in src/plugins/priority/multifactor/priority_multifactor.c to see exactly what happens. Ryan On 11/25/2014 10:34 AM, Roshan Mathew wrote: Hello SLURM users, http://slurm.schedmd.com/sshare.html *Raw Usage* The number of cpu-seconds of all the jobs that charged the account by the user. This number will decay over time when PriorityDecayHalfLife is defined. I am getting different /RAW Usage/ values for the same job every time it is executed. The Job am using is a CPU stress test for 1 minute. It would be very useful to understand the formula for how this RAW Usage is calculated when we are using the plugin PriorityType=priority/multifactor. Snip of my slurm.conf file:- # Activate the Multi-factor Job Priority Plugin with decay PriorityType=priority/multifactor # apply no decay PriorityDecayHalfLife=0 PriorityCalcPeriod=1 PriorityUsageResetPeriod=MONTHLY # The larger the job, the greater its job size priority. PriorityFavorSmall=NO # The job's age factor reaches 1.0 after waiting in the # queue for 2 weeks. PriorityMaxAge=14-0 # This next group determines the weighting of each of the # components of the Multi-factor Job Priority Plugin. # The default value for each of the following is 1. PriorityWeightAge=0 PriorityWeightFairshare=100 PriorityWeightJobSize=0 PriorityWeightPartition=0 PriorityWeightQOS=0 # don't use the qos factor Thanks!
[slurm-dev] Re: How many accounts can SLURM support?
Dave, I have done testing on 5-6 year old hardware with 100,000 users randomly distributed in 10,000 accounts with semi-random depths with most being between 1-4 levels from root but some much deeper than that, plus 100,000 jobs pending. slurmctld startup time was really long but, after getting started, fairshare and decay iterations in all fairshare algorithms took 50-150 milliseconds depending on how you measure it. Those calculations run no more frequently than once per minute and can be configured to run less frequently. You shouldn't have any problems. Ryan On 11/18/2014 12:30 PM, David Lipowitz wrote: How many accounts can SLURM support? Does anyone have a sense of how far SLURM scales regarding accounts and sub-accounts? In our batch environment, all jobs need to run under the same service account for a number of reasons (which I won't go into here). Since our scheduler knows which end user is actually submitting the job, we'd like to handle prioritization by creating sub-accounts for each user under each of the leaf accounts depicted below: root | +- query || |+- type_a || |+- type_b || |+- type_c || |+- type_d | +- process So I'd have five accounts, one for each type of query and another for the process account: query_type_a_dlipowitz query_type_b_dlipowitz query_type_c_dlipowitz query_type_d_dlipowitz process_dlipowitz And each other user would have five analogous accounts. Given that we have 600 users, can SLURM handle 3000 sub-accounts like this? If we doubled in size, could SLURM handle 6000? Thanks for any insight you might be able to offer. Cheers, Dave
[slurm-dev] Re: Non static partition definition
George, Wouldn't a QOS with GrpNodes=10 accomplish that? Ryan On 10/30/2014 11:47 AM, Brown George Andrew wrote: Hi, I would like to have a partition of N nodes without statically defining which nodes should belong to a partition and I'm trying to work out the best way to achieve this. Currently I have partitions which span across all the nodes in my cluster with differing settings, but I would like some of these to only occupy a subset of the cluster. I could say define partition A which can use all nodes but partition B may only access nodes 01-10. But I would like avoid partition B being reduced in size in the event of maintenance or hardware failure. I'm thinking the way to do this would be via a plugin. I would keep all partitions spanning all nodes in the cluster but upon submission check how many nodes are in use on the requested partition. If there were say already 10 nodes in use in partition B the job should be queued. However things then get a bit more complex as to when slurm should de-queue and then run the job. Is there a native method to do this in slurm? Essentially I would like something like the MaxNodes option that exists for partitions today but have it limit the total number of nodes used by jobs submitted to that partition rather than just a limit per job. Many thanks, George
[slurm-dev] Re: Understanding Fairshare and effect on background/backfill type partitions
Trey, I'm not sure why your jobs aren't starting. Someone else will have to answer that question. You can model an organizational hierarchy a lot better in 14.11 due to changes in Fairshare=parent for accounts. If you only want fairshare to matter at the research group and user levels but want to maintain an account structure that reflects your organization, set everything above the research group to be Fairshare=parent. It makes it so that those accounts disappear for fairshare calculation purposes (but not limits, accounting, etc). As for fairshare, precision loss can be a real issue and I'm guessing that you're affected. I won't rehash our Slurm UG presentation here, but we spent some time discussing precision loss issues. What normalized shares values do you see? Try plugging that into 2^(-EffectvUsage / SharesNorm) to see how small the number is. That number then has to be multiplied by PriorityWeightFairshare, which I see you sized properly. I would suggest looking at the Fair Tree fairshare algorithm once 14.11 is released. In case you want more information: http://slurm.schedmd.com/SUG14/fair_tree.pdf and https://fsl.byu.edu/documentation/slurm/fair_tree.php. The slides in the first link also discuss Fairshare=parent in slides 82-91. Ryan Disclaimer: I have some personal interest in both of the suggestions since we developed them. On 10/24/2014 10:49 AM, Trey Dockendorf wrote: Understanding Fairshare and effect on background/backfill type partitions In our setup we use a background partition that can be preempted but has access to the entire cluster. The idea is that when stakeholder partitions are not fully utilized, users can be opportunistic in making use of the cluster when the system is not 100% utilized. Recently I submitted a batch of jobs , ~60, to our background partition. All nodes were idle but half my jobs ended up pending with reason of Priority. I checked sshare and my FairShare value was at 0.00. Would my Fairshare dropping to 0 cause my jobs to be queued when resources were IDLE and no other jobs were queued in that partition besides my own? I'm also wondering what method is used to come up with sane Fairshare values. We have a (likely unnecessarily) complex account structure in slurmdbd that mimics the organizational structure of the departments / colleges / research groups using the cluster. Be interested how other groups have configured fairshare and the multifactor priority. For completeness, here are relevant config items I'm working with: AccountingStorageEnforce=limits,qos PreemptMode=SUSPEND,GANG PreemptType=preempt/partition_prio PriorityCalcPeriod=5 PriorityDecayHalfLife=7-0 PriorityFavorSmall=YES PriorityFlags=SMALL_RELATIVE_TO_TIME PriorityMaxAge=7-0 PriorityType=priority/multifactor PriorityUsageResetPeriod=NONE PriorityWeightAge=2000# 20% PriorityWeightFairshare=4000 # 40% PriorityWeightJobSize=3000# 30% PriorityWeightPartition=0 # 0% PriorityWeightQOS=1000# 10% SchedulerParameters=assume_swap # An option for in-house patch SchedulerTimeSlice=30 SchedulerType=sched/backfill SelectType=select/cons_res SelectTypeParameters=CR_CPU_Memory,CR_CORE_DEFAULT_DIST_BLOCK Example of a stakeholder partition and background: PartitionName=hepx Nodes=c0[101-116,120-132,227,416,530-532,933-936] Priority=100 AllowQOS=hepx MaxNodes=1 MaxTime=120:00:00 State=UP PartitionName=background Priority=10 AllowQOS=background MaxNodes=1 MaxTime=96:00:00 State=UP Thanks, - Trey = Trey Dockendorf Systems Analyst I Texas AM University Academy for Advanced Telecommunications and Learning Technologies Phone: (979)458-2396 Email: treyd...@tamu.edu mailto:treyd...@tamu.edu Jabber: treyd...@tamu.edu mailto:treyd...@tamu.edu
[slurm-dev] RE: EXTERNAL: Re: question on multifactor priority plugin - fairshare basics
Ed, Your math looks correct. In 14.11 you can achieve what you want by setting Fairshare=parent on your dev account with sacctmgr. Fairshare=parent on accounts (only defined on users prior to 14.11) makes it so that accounts effectively disappear for fairshare calculations but still exist for limits and organizational purposes. Children are effectively reparented to their account's parent (root in your case) for fairshare. Ryan On 10/14/2014 08:06 PM, Blosch, Edwin L wrote: Thanks for the reply Ryan, Yes, I’m using the basic fairshare. I am trying to use fairshare across a flat listing of users only, with a placeholder parent account called ‘dev’, but for now, it has no siblings. All users are under ‘dev’. I think the way it is calculated, in my configuration, the largest fairshare I will ever see is 0.5. F = 2**(-Ue/S), where in my case S = 1000 / 16000 (1000 per user, 16 users (who each have 1000)) and I have Ue =S for a user who never submit a job yetbecause Ue = 0 (Uactual) + (1.0 – 0.0)*1000/16000 (1.0 is parent usage, which is always 1.0 in my case because dev is the only parent account for any user) I was expecting/hoping/wishing the values would be between 0.0 and 1.0, but I can work with 0.5 as the max value. It just means that I need to double the PriorityWeightFairshare factor in order to achieve the intended relative weighting between Fairshare, QOS, Partitions, JobSize, Age. Ed *From:*Ryan Cox [mailto:ryan_...@byu.edu] *Sent:* Tuesday, October 14, 2014 6:00 PM *To:* slurm-dev *Subject:* EXTERNAL: [slurm-dev] Re: question on multifactor priority plugin - fairshare basics I assume you are using the default fairshare algorithm since you didn't specify otherwise. F=2**(-U/S) where U is Effectv Usage (often displayed in documentation as UE) and S is Norm Shares. See http://slurm.schedmd.com/priority_multifactor.html under the heading The SLURM Fair-Share Formula. Basically, Effectv Usage needs to be less than Norm Shares for Fairshare to be greater than 0.5. Ryan On 10/14/2014 04:27 PM, Blosch, Edwin L wrote: I must be misunderstanding a basic concept here. What conditions would have to exist to cause a Fairshare value greater than 0.5? [bloscel@maruhpc5 ~]$ sshare -a Account User Raw Shares Norm Shares Raw Usage Effectv Usage FairShare -- -- --- --- - -- root 1.0011376527 1.00 0.50 root root 00.00 0 0.00 0.00 cfd 11.0011376527 1.00 0.50 cfd bendeee 10000.076923 0 0.076923 0.50 cfd bloscel 10000.076923 712296 0.134718 0.297027 more users under same group -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University
[slurm-dev] Re: question on multifactor priority plugin - fairshare basics
I assume you are using the default fairshare algorithm since you didn't specify otherwise. F=2**(-U/S) where U is Effectv Usage (often displayed in documentation as UE) and S is Norm Shares. See http://slurm.schedmd.com/priority_multifactor.html under the heading The SLURM Fair-Share Formula. Basically, Effectv Usage needs to be less than Norm Shares for Fairshare to be greater than 0.5. Ryan On 10/14/2014 04:27 PM, Blosch, Edwin L wrote: I must be misunderstanding a basic concept here. What conditions would have to exist to cause a Fairshare value greater than 0.5? [bloscel@maruhpc5 ~]$ sshare -a Account User Raw Shares Norm Shares Raw Usage Effectv Usage FairShare -- -- --- --- - -- root 1.0011376527 1.00 0.50 root root 0 0.00 0 0.00 0.00 cfd 1 1.0011376527 1.00 0.50 cfd bendeee 1000 0.076923 0 0.076923 0.50 cfd bloscel 1000 0.076923 712296 0.134718 0.297027 more users under same group
[slurm-dev] Re: Authentication and invoking slurm commands from web app
from your system. Idiria Sociedad Limitada reserves the right to take legal action against any persons unlawfully gaining access to the content of any external message it has emitted. For additional information, please visit our website http://www.idiria.com -- Morris Moe Jette CTO, SchedMD LLC -- * José Román Bilbao Castro* Ingeniero Consultor +34 901009188 _jrbc...@idiria.com mailto:jrbc...@idiria.com __http://www.idiria.com http://www.idiria.com/_ _http:// http://%20%20/www.idiria.com/ http://www.idiria.com/_ -- Idiria Sociedad Limitada - Aviso legal Este mensaje, su contenido y cualquier fichero transmitido con él está dirigido únicamente a su destinatario y es confidencial. Por ello, se informa a quien lo reciba por error ó tenga conocimiento del mismo sin ser su destinatario, que la información contenida en él es reservada y su uso no autorizado, por lo que en tal caso le rogamos nos lo comunique por la misma vía o por teléfono (+ 34 690207492), así como que se abstenga de reproducir el mensaje mediante cualquier medio o remitirlo o entregarlo a otra persona, procediendo a su borrado de manera inmediata. Idiria Sociedad Limitada se reserva las acciones legales que le correspondan contra todo tercero que acceda de forma ilegítima al contenido de cualquier mensaje externo procedente del mismo. Para información y consultas visite nuestra web http://www.idiria.com http://www.idiria.com/ Idiria Sociedad Limitada - Disclaimer This message, its content and any file attached thereto is for the intended recipient only and is confidential. If you have received this e-mail in error or had access to it, you should note that the information in it is private and any use thereof is unauthorised. In such an event please notify us by e-mail or by telephone (+ 34 690207492). Any reproduction of this e-mail by whatsoever means and any transmission or dissemination thereof to other persons is prohibited. It should be deleted immediately from your system. Idiria Sociedad Limitada reserves the right to take legal action against any persons unlawfully gaining access to the content of any external message it has emitted. For additional information, please visit our website http://www.idiria.com http://www.idiria.com/ -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University
[slurm-dev] Re: Submitting to multiple partitions with job_submit plugin (Was: Implementing fair-share policy using BLCR)
On 09/23/2014 11:27 AM, Trey Dockendorf wrote: Has anyone used the Lua job_submit plugin and also allows multiple partitions? I'm not even user what the partition value would be in the Lua code when a job is submitted with --partition=general,background, for example. We do. We use the all_partitions plugin and our own Lua plugin for job submission. In the Lua script, we remove partitions from the array that they shouldn't have access to for whatever reason. Reasons include: the job didn't request enough memory to need a bigmem node, the job didn't request a GPU and this is a GPU partition, etc. The partition string has commas so you can explode() it into an array. Ryan
[slurm-dev] Re: Dynamic partitions on Linux cluster
I would also recommend QOS if you absolutely can't use fairshare. Set up a QOS per institute with a GrpNodes limit that is the correct ratio and only allow institute members to their QOS (make it their default too). Alternatively you can also do one account per institute and set GrpNodes there, though that is less flexible than a QOS. Ryan On 08/14/2014 07:48 AM, Paul Edmon wrote: We have a bit of a similar situation here. A possible solution that may work for you is QoS. The QoS's behave like a synthetic partition. That way you can have a single partition but multiple QoS's which can flex around down nodes. From the experimentation I have done with them this may be a good solution for you. -Paul Edmon- On 08/14/2014 09:25 AM, Uwe Sauter wrote: I would totally agree with you but university administration has to justify the part of the first institute (because it was paid with federal money) while the other institute paid for themselves and can do with their part what they want. This is the reason for the current unflexible mapping between partition and nodes. To get away from that for better availability I'm looking for a way to have a dynamic mapping that just enforces the ratio between the institutes while flexibly allocate the nodes from the whole pool. I know its a waste of resources but I am bound to this decision... Regards, Uwe Am 14.08.2014 um 14:59 schrieb Bill Barth: Yes, yes it does. I don't mean to be harsh, but doing it their way is a potentially huge waste of resources. Why not get each institute to agree to share the whole machine in proportion to what they paid? Each institute gets an allocation of time (through accounting) and a fairshare fraction in the ratio of their contribution, but is allowed to use the whole machine. If both institutes have periods of down time, then the machine will be less likely to sit idle and more work will get done. I'll get off my soapbox now. Best, Bill. -- Bill Barth, Ph.D., Director, HPC bba...@tacc.utexas.edu| Phone: (512) 232-7069 Office: ROC 1.435 | Fax: (512) 475-9445 On 8/14/14, 7:48 AM, Uwe Sauter uwe.sauter...@gmail.com wrote: Hi Bill, if I understand the concept of fairshare correctly, this could result in a situation where one institute uses all resources. Because of this fairshare is out of the question as I have to enforce the ratio between the institutes - I cannot allow usage that would result in one institute using more than what they paied for. If an institute doesn't use the resources they have to run idle (or power down). You could compare my situation with running two clusters that use the same base infrastructure. What I want to do is enable users of both institutes to use both clusters - but for each point in time use a maximum of nodes that belong to their cluster. Regards, Uwe Am 14.08.2014 um 14:34 schrieb Bill Barth: Why not make one partition and use fairshare to balance the usage over time? That way both institutes can run large jobs that span the whole machine when others are not using it. Bill. -- Bill Barth, Ph.D., Director, HPC bba...@tacc.utexas.edu| Phone: (512) 232-7069 Office: ROC 1.435 | Fax: (512) 475-9445 On 8/14/14, 4:11 AM, Uwe Sauter uwe.sauter...@gmail.com wrote: Hi all, I got a question about a configuration detail: dynamic partitions Situation: I operate a Linux cluster of currently 54 nodes for a cooperation of two different institutes at the university. To reflect the ratio of cash those institutes invested I configured SLURM with two partition, one for each institute. Those partitions have assigned different numbers of nodes in a hard way, e.g. PartitionName=InstA Nodes=n[01-20] PartitionName=InstB Nodes=n[21-54] To improve availability in case nodes break (and perhaps save some power) I'd like to configure SLURM in a way that jobs can be assigned nodes from the whole pool, respecting the number of nodes each institute bought. Research so far: There is an option for partition configuration called MaxNodes but the man pages state that this restricts the maximum number of nodes PER JOB. It probably is possible to get something similar working using limit enforcment through accounting, but I haven't understood that part of SLURM 100% yet. BlueGene systems seem to have the ability for something alike but then this is for IBM systems only. Question: Is it possible to configure SLURM so that both partitions could utilize all nodes but respect a maximum number of nodes that may be used the same time? Something like: PartitionName=InstA Nodes=n[01-54] MaxPartNodes=20 PartitionName=InstB Nodes=n[01-54] MaxPartNodes=34 So is there a way to achieve this using the confg file? Do I have to use accounting to enfoce the limits? Or is there another way that I don't see? Best regards, Uwe Sauter -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young
[slurm-dev] RE: fairshare - memory resource allocation
Janne, I appreciate the feedback. I agree that it makes the most sense to specify rates like DRF most of the time. However, there are some use cases that I'm aware of and others that are probably out there that would make a DRF imitation difficult or less desirable if it's the only option. We happen to have one partition that has mixed memory amounts per node, 32 GB and 64 GB. Besides the memory differences (long story), the nodes are homogeneous and each have 16 cores. I'm not sure I would like the DRF approach for this particular scenario. In this case we would like to set the charge rate to be .5/GB, or 1 core == 2 GB RAM. If someone needs 64 GB per node, they are contending for a more limited resource and we would be happy to double the charge rate for the 64 GB nodes. If they need all 64 GB, they would end up being charged for 32 CPU/processor equivalents instead of 16. With DRF that wouldn't be possible if I understand correctly. One other feature that could be interesting is to have a baseline standard for a CPU charge on a per-partition basis. Let's say that you have three partitions: old_hardware, new_hardware, and super_cooled_overclocked_awesomeness. You could set the per CPU charges to be 0.8, 1.0, and 20.0. That would reflect that a cpu-hour on one partition doesn't result in the same amount of computation as in another partition. You could accomplish the same thing automatically by using a QOS (and maybe some other parameter I'm not aware of) and maybe a job submit plugin but this would make it easier. I don't know that we would do this in our setup but it would be possible. It would be possible to add a config parameter that is something like Mem=DRF that would auto-configure it to match. The one question I have about that approach is what to do about partitions with non-homogeneous nodes. Does it make sense to sum the total cores and memory, etc or should it default to a charge rate that is the min() of the node configurations? Of course, partitions with mixed node types could be difficult to support no matter what method is used for picking charge rates. So yes, having a DRF-like auto-configuration could be nice and we might even use it for most of our partitions. I don't think I'll attempt it for the initial implementation but we'll see. Thanks, Ryan On 07/30/2014 03:31 PM, Blomqvist Janne wrote: Hi, if I understand it correctly, this is actually very close to Dominant Resource Fairness (DRF) which I mentioned previously, with the difference that in DRF the charge rates are determined automatically from the available resources (in a partition) rather than being specified explicitly by the administrator. So for an example, say we have a partition with 100 cores and 400 GB memory. Now for a job requesting (10CPU's, 20 GB) the domination calculation proceeds as follows: 1) Calculate the domination vector by dividing each element in the request vector (here, CPU MEM) with the available resources. That is (10/100, 20/400) = (0.1, 0.05). 2) The MAX element in the domination vector is chosen (it dominates the others, hence the name of the algorithm) as the one to use in fairshare calculations, accounting etc. In this case, the CPU element (0.1). Now for another job request, (1CPU, 20 GB) the domination vector is (0.01, 0.05) and the MAX element is the memory element (0.05), so in this case the memory part of the request dominates. In your patch you have used cpu-sec equivalents rather than dominant share secs, but that's just a difference of a scaling factor. From a backwards compatibility and user education point of view cpu-sec equivalents seem like a better choice to me, actually. So while you patch is more flexible than DRF in that it allows arbitrary charge rates to be specified, I'm not sure it makes sense to specify rates different from the DRF ones? Or if one does specify different rates, it might end up breaking some of the fairness properties that are described in the DRF paper and opens up the algorithm for gaming? -- Janne Blomqvist From: Ryan Cox [ryan_...@byu.edu] Sent: Tuesday, July 29, 2014 18:47 To: slurm-dev Subject: [slurm-dev] RE: fairshare - memory resource allocation I'm interested in hearing opinions on this, if any. Basically, I think there is an easy solution to the problem of a user using few CPUs but a lot of memory and that not being reflected well in the CPU-centric usage stats. Below is my proposal. There are likely some other good approaches out there too (Don and Janne presented some) so feel free to tell me that you don't like this idea :) Short version I propose that the Raw Usage be modified to *optionally* be (CPU equivalents * time) instead of just (CPUs * time). The CPU equivalent would be a MAX() of CPUs, memory, nodes, GPUs, energy over that time period, or whatever multiplied by a corresponding charge rate
[slurm-dev] RE: fairshare - memory resource allocation
Thanks. I can certainly call it that. My understanding is that this would be a slightly different implementation from Moab/Maui, but I don't know those as well so I could be wrong. Either way, the concept is similar enough that a more recognizable term might be good. Does anyone else have thoughts on this? I called it CPU equivalents because the calculation in the code is currently (total_cpus * time) so I stuck with CPUs. Slurm seems to use lots of terms somewhat interchangeably so I couldn't really decide. I don't really have an opinion on the name so I'll just accept what others decide. Ryan On 07/31/2014 02:28 AM, Bjørn-Helge Mevik wrote: Just a short note about terminology. I believe processor equivalents (PE) is a much used term for this. It is at least what Maui and Moab uses, if I recall correctly. The resource*time would then be PE seconds (or hours, or whatever).
[slurm-dev] RE: fairshare - memory resource allocation
. The patch currently implements charging for CPUs, memory (GB), and nodes. Note: I saw a similar idea in a bug report from the University of Chicago: http://bugs.schedmd.com/show_bug.cgi?id=858. Ryan On 07/25/2014 10:31 AM, Ryan Cox wrote: Bill and Don, We have wondered about this ourselves. I just came up with this idea and haven't thought it through completely, but option two seems like the easiest. For example, you could modify lines like https://github.com/SchedMD/slurm/blob/8a1e1384bacf690aed4c1f384da77a0cd978a63f/src/plugins/priority/multifactor/priority_multifactor.c#L952 to have a MAX() of a few different types. I seem to recall seeing this on the list or in a bug report somewhere already, but you could have different charge rates for memory or GPUs compared to a CPU, maybe on a per partition basis. You could give each of them a charge rate like: PartitionName=p1 ChargePerCPU=1.0 ChargePerGB=0.5 ChargePerGPU=2.0 .. So the line I referenced would be something like the following (except using real code and real struct members, etc): real_decay = run_decay * MAX(CPUs*ChargePerCPU, TotalJobMemory*ChargePerGB, GPUs*ChargePerGPU); In this case, each CPU is 1.0 but each GB of RAM is 0.5. Assuming no GPUs used, if the user requests 1 CPU and 2 GB of RAM the resulting usage is 1.0. But if they use 4 GB of RAM and 1 CPU, it is 2.0 just like they had been using 2 CPUs. Essentially you define every 2 GB of RAM to be equal to 1 CPU, so raw_usage could be redefined to deal with cpu equivalents. It might be harder to explain to users but I don't think it would be too bad. Ryan On 07/25/2014 10:05 AM, Lipari, Don wrote: Bill, As I understand the dilemma you presented, you want to maximize the utilization of node resources when running with Slurm configured for SelectType=select/cons_res. To do this, you would like to nudge users into requesting only the amount of memory they will need for their jobs. The nudge would be in the form of decreased fair-share priority for users' jobs that request only one core but lots of memory. I don't know of a way for Slurm to do this as it exists. I can only offer alternatives that have their pros and cons. One alternative would be to add memory usage support to the multifactor priority plugin. This would be a substantial undertaking as it touches code not just in multifactor/priority_multifactor.c but also in structures that are defined in common/assoc_mgr.h as well as sshare itself. A second less invasive option would be to redefine the multifactor/priority_multifactor.c's raw_usage to make it a configurable blend of cpu and memory usage. These changes could be more localized to the multifactor/priority_multifactor.c module. However you would have a harder time justifying a user's sshare report because the usage numbers would no longer track jobs' historical cpu usage. You response to a user who asked you to justify their sshare usage report would be, trust me, it's right. A third alternative (as I'm sure you know) is to give up on perfectly packed nodes and make every 4G of memory requested cost 1 cpu of allocation. Perhaps there are other options, but those are the ones that immediately come to mind. Don Lipari -Original Message- From: Bill Wichser [mailto:b...@princeton.edu] Sent: Friday, July 25, 2014 6:14 AM To: slurm-dev Subject: [slurm-dev] fairshare - memory resource allocation I'd like to revisit this... After struggling with memory allocations in some flavor of PBS for over 20 years, it was certainly a wonderful thing to have cgroup support right out of the box with Slurm. No longer do we have a shared node's jobs eating all the memory and killing everything running there. But we have found that there is a cost to this and that is a failure to adequately feed back this information to the fairshare mechanism. In looking at running jobs over the past 4 months, we found a spot where we could reduce the DefMemPerCPU allocation in slurm.conf to a value about 1G less than the actual G/core available. This meant that we had to notify the users close to this max value so that they could adjust their scripts. We also notified users that if this value was too high that they'd do best to reduce that limit to exactly what they require. This has proven much less successful. So our default is 3G/core with an actual node having 4G/core available. This allows some bigger memory jobs and some smaller memory jobs to make use of the node as there are available cores but not enough memory for the default case. Now that is good. It allows higher utilization of nodes, all the while protecting the memory of each other's processes. But the problem of fairshare comes about pretty quickly when there are jobs requiring say half the node's memory. This is mostly serial jobs requesting a single core. So this leaves about 11 cores with only about 2G/core left. Worse, when it comes
[slurm-dev] Re: fairshare
Bill, I may be wrong (corrections welcomed), but I'm pretty sure you'll have to use a database query. My understanding is that the decayed usage is stored as a single usage_raw value per association (https://github.com/SchedMD/slurm/blob/f8025c1484838ecbe3e690fa565452d990123361/src/plugins/priority/multifactor/priority_multifactor.c#L1119). There is no history of any kind. You would have to do a fairly complex query to get an accurate representation or write some code to recreate the way Slurm does it. If you look at _apply_decay() and _apply_new_usage() in src/plugins/priority/multifactor/priority_multifactor.c, you can see all that happens. Basically, once per decay thread iteration each association's usage_raw and the job's cputime for that time period is calculated and decayed accordingly. This can happen many, many times over the length of a job. If a job terminates before reaching its timelimit, the remaining allocated cputime is immediately added all at the same time (https://github.com/SchedMD/slurm/blob/f8025c1484838ecbe3e690fa565452d990123361/src/plugins/priority/multifactor/priority_multifactor.c#L1036). Those are some of the issues that you may run into while creating a database tool for this. I could be mistaken on some of the details but that is my understanding of the code (we looked recently for an unrelated reason). Ryan On 07/14/2014 02:15 PM, Bill Wichser wrote: Is there any way to get a better view of fairshare than the sshare command? Under PBS, there was the diagnose -f command which showed the breakdown per set time period which calculated this value. What was nice about this was I could point a group to this command, or cut and paste, showing that you have been using 20% over the last 30 days even though you haven't run anything in the last three days. It's a much more difficult problem when asked now. I have no tool which shows the value, and decay, over the time. So I'm wondering if anyone has a method to demonstrate that, yes, this fairshare value is correct and here is why. Or do I just need to figure out a database query to cull this information? Thanks, Bill -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University
[slurm-dev] Re: installing slurm on CentOS 5.10
Steve, Our script generator was rewritten recently and released on Github: https://github.com/BYUHPC/BYUJobScriptGenerator. You might want to try that out and tailor it for your needs, though we have no problem with people linking to our site directly if you don't want to host your own version. Ryan On 06/24/2014 08:38 AM, Love, Steve W. wrote: Hello, I’m trying to build a version of SLURM on a VM for the purpose of testing. The VM is running CentOS 5.10 and has 4 processors. Our HPC users will be faced with the task of changing their submission scripts from a cluster running SGE to one where they’ll be using SLURM. I’d like to use the installation of SLURM in order that our users can test simple scripts with; _https://marylou.byu.edu/documentation/slurm/script-generator_ I’ve been following some notes from the Clustervision user portal which suggests performing the following; use yum to install numactl libraries build hwloc which I did with; ./configure --prefix=/usr/local/hwloc/1.8.1 build munge which I did with; ./configure --prefix=/usr --sysconfdir=/etc --localstatedir=/var make make install build slurm which I did with; ./configure --prefix=/usr/local/slurm/14.03.3-2/ --enable-multiple-slurmd --with-hwloc=/usr/local/hwloc/1.8.1/ --enable-pam When I try to start a slurm daemon it complains about not having any configuration files ... which I can never find? I’ve went with; ./configure --prefix=/usr/local/slurm/14.03.3-2/ --enable-multiple-slurmd --with-hwloc=/usr/local/hwloc/1.8.1/ --enable-pam --sysconfdir=/usr/local/slurm/14.03.3-2/ But that too failed to produce any config files. Any ideas as to what I’m doing wrong here? Thanks, Steve Love. British Geological Survey Edinburgh _ _ This message (and any attachments) is for the recipient only. NERC is subject to the Freedom of Information Act 2000 and the contents of this email and any reply you make may be disclosed by NERC unless it is exempt from release under the Act. Any material supplied to NERC may be stored in an electronic records management system. -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University
[slurm-dev] LEVEL_BASED prioritization method
Levi Morrison and I have developed a new Slurm prioritization method that we call LEVEL_BASED. It prioritizes users such that users in an under-served account will always have a higher fair share factor than users in an over-served account. It works very well for us, though I understand that many sites have different needs. If you're interested, check out the documentation at https://fsl.byu.edu/documentation/slurm/level_based.php or try it out at https://github.com/BYUHPC/slurm in the level_based branch. If you want some of our problems with existing algorithms (as they apply to our use case), see http://tech.ryancox.net/2014/06/problems-with-slurm-prioritization.html. -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University
[slurm-dev] Fairshare=parent on an account: What should it do?
We're trying to figure out what the intended behavior of Fairshare=parent is when set on an account (http://bugs.schedmd.com/show_bug.cgi?id=864). We know what the actual behavior is but we're wondering if anyone actually likes the current behavior. There could be some use case out there that we don't know about. For example, you can end up with a scenario like the following: acctProf /|\ / | \ acctTA(parent) uD(5)uE(5) / | \ /|\ uA(5) uB(5) uC(5) The number in parenthesis is Fairshare according to sacctmgr. We incorrectly thought that Fairshare=parent would essentially collapse the tree so that uA - uE would all be on the same level. Thus, all five users would each get 5 / 25 shares. What actually happens is you get the following shares at the user level: shares (uA) = 5 / 15 = .333 shares (uB) = 5 / 15 = .333 shares (uC) = 5 / 15 = .333 shares (uD) = 5 / 10 = .5 shares (uE) = 5 / 10 = .5 That's pretty far off from each other, but not as far as it would be if one account had two users and the other had forty. Assuming this demonstration value of 5 shares, that would be: user_in_small_account = 5 / (2*5) = .5 user_in_large_account = 5 / (40*5) = .025 Is that actually useful to someone? We want to use subaccounts below a faculty account to hold, for example, a grad student or postdoc who teaches a class. It would be nice for the grad student to have administrative control over the subaccount since he actually knows the students but not have it affect priority calculations. Ryan -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University
[slurm-dev] Re: How to spread jobs among nodes?
Rather than maximize fragmentation, you probably want to do it on a per-job basis. If you want one core per node: sbatch: -N $numnodes -n $numnodes. Anything else would require the -m flag. I haven't played with it recently but I think you would want -m cyclic. Ryan On 05/08/2014 11:49 AM, Atom Powers wrote: How to spread jobs among nodes? It appears that my Slurm cluster is scheduling jobs to load up nodes as much as possible before putting jobs on other nodes. I understand the reasons for doing this, however I foresee my users wanting to spread jobs out among as many nodes as possible for various reasons, some of which are even valid. How would I configure the scheduler to distribute jobs in something like a round-robin fashion to many nodes instead of loading jobs onto just a few nodes? I currently have: 'SchedulerType' = 'sched/builtin', 'SelectTypeParameters' = 'CR_Core_Memory', 'SelectType'= 'select/cons_res', -- Perfection is just a word I use occasionally with mustard. --Atom Powers-- -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University
[slurm-dev] Re: Need Help Understanding Cgroup Swapiness
Note that the output of your job was printed successfully, then slurmstepd output occurred. At job/step exit time, the Slurm code simply reads the the memory.failcnt and memory.memsw.failcnt files in the relevant cgroup (explanation: https://www.kernel.org/doc/Documentation/cgroups/memory.txt). Your job's cgroup has memory.failcnt 0, meaning some of the job was swapped out but not killed. The output is different for memory.memsw.failcnt 0 because that means that a process was killed. Ryan On 04/21/2014 01:48 PM, Guglielmi Matteo wrote: Installed memory per node: RAM 32 GB SWAP 10 GB slurm.conf ProctrackType=proctrack/cgroup TaskPlugin=task/cgroup SelectTypeParameters=CR_Core_Memory NodeName=... RealMemory=29000 ### cgroup.conf AllowedRAMSpace=100 AllowedSwapSpace=30.0 ConstrainRAMSpace=YES ConstrainSwapSpace=YES MaxRAMPercent=100 MaxSwapPercent=100 MinRAMSpace=30 This program just eats up the requested amount of memory: ### memoryHog.c #include stdio.h #include stdlib.h #include string.h #include unistd.h #define PAGE_SZ (112) int main(int argc, char **argv) { int i; int gb = atoi( (argv[1]) ); // memory to consume in GB for (i = 0; i ((unsigned long)gb30)/PAGE_SZ ; ++i) { void *m = malloc(PAGE_SZ); if (!m) break; memset(m, 0, 1); } printf(allocated %lu MB\n, ((unsigned long)i*PAGE_SZ)20); sleep(10); return 0; } ### TESTING ### $ salloc --mem-per-cpu=9000 salloc: Granted job allocation 1503 $ srun memoryHog.x 8 allocated 8192 MB $ srun memoryHog.x 9 allocated 9050 MB slurmstepd: Exceeded step memory limit at some point. Step may have been partially swapped out to disk. ### LOGS: /var/log/slurm/slurmd.log ### [2014-04-15T18:58:26.212] [1503.0] task/cgroup: /slurm/uid_500/job_1503: alloc=9000MB mem.limit=9000MB memsw.limit=11700MB [2014-04-15T18:58:26.212] [1503.0] task/cgroup: /slurm/uid_500/job_1503/step_0: alloc=9000MB mem.limit=9000MB memsw.limit=11700MB [2014-04-15T18:58:39.961] [1503.0] done with job .. .. [2014-04-15T18:58:45.916] [1503.1] task/cgroup: /slurm/uid_500/job_1503: alloc=9000MB mem.limit=9000MB memsw.limit=11700MB [2014-04-15T18:58:45.916] [1503.1] task/cgroup: /slurm/uid_500/job_1503/step_1: alloc=9000MB mem.limit=9000MB memsw.limit=11700MB [2014-04-15T18:59:01.087] [1503.1] Exceeded step memory limit at some point. Step may have been partially swapped out to disk. [2014-04-15T18:59:01.120] [1503.1] done with job Since slurm sets memsw.limit=11700MB I was expecting the cgroup feature to start swapping out the exceeding 50 MB or so... they would actually fit in the swap area and the job should not be killed... What am I missing here? Should the code itself be aware of the given mem.limit=9000MB? Thanks for any explanation. MG -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University
[slurm-dev] Re: SLRUM as a load balancer for interactive use
This isn't exactly what you're looking for but I'll chime in anyway with how we do things. We decided to buy a few slightly beefier interactive nodes and set up cgroups, /tmp and /dev/shm namespaces (/tmp and /dev/shm are per-user), cputime limits, /tmp quotas, etc to sanely oversubscribe resources. This ended up being cheaper than other options and it has worked really well. We currently use LVS to load balance between interactive nodes but may switch to something else at some point. We allow users to edit files, compile code, transfer files around, etc. and also test their code for a little while. Anything beyond that requires submitting a job. We limit users to 1/4 of the RAM on the node and only 60 CPU-minutes per process via ulimit. The cpu cgroup (not cpuset) is used to set a soft limit of just a few cores for each user but allows them to burst to 100% of the cores when there is no contention on the node. It would take at least four users all using the maximum amount of RAM plus some extra use before the node crashes. The memory per node ratio and other settings could easily be changed if necessary. In practice, these settings have made it so that no user has crashed an interactive node since everything was deployed. Obviously I didn't answer the original question about SLURM but this is an alternative approach that has worked well for us. If you're interested in the code we used to set everything up, it is available at: https://github.com/BYUHPC/uft Ryan On 03/24/2014 01:32 PM, Olli-Pekka Lehto wrote: I can foresee the screen issue as well. One could fairly simply add a check when the user logs in to see if the the user has a node assigned to them already and force the session to use that node. It could perhaps even prompt if they want to do access this session or get a new one. The immediate issue we encountered when testing with screen, however, is the fact that when you detach and exit the interactive session SLURM faithfully cleans all the processes. In most cases this would be preferred but in this case I want the screen session (and the associated interactive job) to persist. Any ideas how to do this? In our case there is no real time limit on the current interactive use nodes so setting runtime as unlimited is probably the way to go, at least initially. Of course one needs to have a sufficiently large oversubscription factor of slots in this case. Olli-Pekka On Mar 24, 2014, at 3:48 PM, Schmidtmann, Carl carl.schmidtm...@rochester.edu wrote: We considered this option as well but the problem we saw with it is what happens when a user tries to use screen? Many of our users login, start screen, do some work and then disconnect. Whenever they reconnect they can pick up from where they left off. If you are allocated to a compute node based on loads, you likely won't be on the same node where your last session was. This is inconvenient for the users but then also leaves screen sessions open, at least until the time limit expires, on compute nodes. The other issue is the time limit. Do you make it 1 hour, 4 hours, 8 hours? How long does a user get to be logged in? If the time limit expires, what happens to the open editor session? Can this be recovered on a different compute node? We are still looking for a good way to balance users on login nodes. Right now we are working on a method of redirecting ssh logins based on user IDs which feels extremely hacky as well. Carl -- Carl Schmidtmann Center for Integrated Research Computing University of Rochester On Mar 24, 2014, at 5:44 AM, Olli-Pekka Lehto wrote: Dear devs, We are testing a concept where we are dynamically allocating a portion of our compute nodes with oversubscribed interactive nodes for low-intensity use. To make the use as simple as possible, we are testing redirecting user login sessions directly to these nodes via SLURM. Basically the shell initialization on the actual login node contains a SLURM srun command to spawn an interactive session and the user gets transparently dropped into a shell session on a compute node. This would offer more flexibility than physically setting up a set of login nodes. Furthermore, SLURM should be able make better decisions on where to assign each incoming session based on resource usage than a more naive round-robin load balancer. This way also all interactive use can be tracked with SLURM's accounting. Based on simple initial testing this seems to work but it's still a bit hacky. My question is has anyone been doing similar things and what are your experiences? Are there some caveats that we should be aware of? Best regards, Olli-Pekka -- Olli-Pekka Lehto Development Manager Computing Platforms CSC - IT Center for Science Ltd. E-Mail: olli-pekka.le...@csc.fi Tel: +358 50 381 8604 skype: oplehto // twitter: ople -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University http
[slurm-dev] Re: Job being canceled due to time limits
-t and --time are synonymous. You're using both Ryan On 09/05/2013 12:38 PM, Matthew Russell wrote: Job being canceled due to time limits Hi, I can't figure out why my job is being canceled due to time limites. My queue has an infinite time limit, and my batch file requests several hours, yet the job is always canceled within a few minutes. gm1@dena:GEM-MACH_1.5.1_dev$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST defq up infinite 4 idle dena[1-4] *headnode up infinite 1 idle dena* matt up infinite 2 idle dena[1-2] My batch file: #!/home/gm1/ECssm/multi/bin/s.sge_dummy_shell #SBATCH -D /home/gm1 #SBATCH --export=NONE #SBATCH -o /home/gm1/listings/dena/gm338_21388_M.21737.out.o #SBATCH -e /home/gm1/listings/dena/gm338_21388_M.21737.out.e #SBATCH -J gm338_21388_M.30296 *#SBATCH --time=38380* #SBATCH --partition=headnode #SBATCH #SBATCH -c 1 #SBATCH -t 4 #SBATCH # The error log: gm1@dena:GEM-MACH_1.5.1_dev$ cat ~/listings/dena/gm338_21388_M.21737.out.e slurmd[dena]: *** JOB 1683 CANCELLED AT 2013-09-05T14:24:27 DUE TO TIME LIMIT *** Is there somewhere else where a time limit can be imposed? The time limit is being imposed about 5 minutes into the job. Thanks -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University
[slurm-dev] Re: job steps not properly identified for jobs using step_batch cgroups
Moe, In what way is it experimental? Is it possibly unstable or just not feature-complete? We're writing a script to independently gather statistics for our own database and would like to use the cpuacct cgroup, thus the interest in the jobacct_gather/cgroup plugin. Ryan On 08/09/2013 10:07 AM, Moe Jette wrote: I misspoke. The JobAcctGatherType=jobacct_gather/cgroup plugin is experimental and not ready for use. Your configuration should work. Quoting Moe Jette je...@schedmd.com: Your explanation seems likely. You probably want to change your configuration to: JobAcctGatherType=jobacct_gather/cgroup Quoting Andy Wettstein wettst...@uchicago.edu: I understand this problem more fully now. Certains jobs that our users run fork processes in a way that the parent PID gets set to 1. The _get_offspring_data function in jobacct_gather/linux ignores these when adding up memory usage. It seems like if proctrack/cgroup is enabled, the jobacct_gather/linux plugin should rely on the cgroup.procs file to identify the pids instead of trying to figure things out based on parent PID. Is something like that reasonable? Andy On Tue, Jul 30, 2013 at 10:59:56AM -0700, Andy Wettstein wrote: Hi, I have the following set: ProctrackType = proctrack/cgroup TaskPlugin = task/cgroup JobAcctGatherType = jobacct_gather/linux This is on slurm 2.5.7. When I use sstat on all running jobs, there are a large number of jobs that say they have no steps running (for example: sstat: error: couldn't get steps for job 4783548). This seems to be the case for all steps that use the step_batch cgroup. If the step gets created in something like step_0, everything seems to be reported ok. In both instances, the PIDs are actually listed in the right cgroup.procs file. I noticed this because there were several jobs that should have been killed due to memory limits, but were not. The jobacct_gather plugin doesn't know about the processes in the step_batch cgroup so it doesn't count the memory usage. Andy -- andy wettstein hpc system administrator research computing center university of chicago 773.702.1104 -- andy wettstein hpc system administrator research computing center university of chicago 773.702.1104 -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University
[slurm-dev] Re: cgroups usage
We made the mistake of setting TaskAffinity=yes, though I'm not sure why we did that. There seems to be a bug where the first node has cgroup/cpuset and task affinity set correctly, but subsequent nodes set task affinity for *all* tasks to be CPU 0. We hadn't gotten around to reporting it yet but it's worth checking out. Ryan On 08/05/2013 05:52 PM, Kevin Abbey wrote: Hi , I started using cgroups for control memory usage last week. One user reported his application takes 4 times longer to complete. I read elsewhere that cgroup mem. control can reduce performance. Is this amount realistic? Is there a more efficient method to control memory usage on nodes which are shared? Thank you for any advice, Kevin -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University
[slurm-dev] Re: Job submit plugin to improve backfill
An alternative that we do is choose very low defaults for people: PartitionName=Default DefaultTime=30:00 #plus other options DefMemPerCPU=512 The disadvantage to this approach is that it doesn't give an obvious error message at submit time. However, it's not hard to figure out what happened when they hit the time limit or the error output says they went over their memory limit. Ryan On 06/28/2013 08:29 AM, Daniel M. Weeks wrote: At CCNI, we use backfill scheduling on all our systems. However, we have found that users typically do not specify a time limit for their job so the scheduler assumes the maximum from QoS/user limits/partition limits/etc. This really hurts backfilling since the scheduler remains ignorant of short jobs. Attached is a small patch I wrote containing a job submit plugin and a new error message. The plugin rejects a job submission when it is missing a time limit and will provide the user with a clear and distinct error. I've just re-tested and the patch applies and builds cleanly on the slurm-2.5, slurm-2.6, and master branches. Please let me know if you find this useful, run across problems, or have suggestions/improvements. Thanks. -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University
[slurm-dev] Re: Job Groups
Paul, We were discussing this yesterday due to a user not limiting the amount of jobs hammering our storage. A QOS with a GrpJobs limit sounds like the best approach for both us and you. Ryan On 06/19/2013 09:36 AM, Paul Edmon wrote: I have a group here that wants to submit a ton of jobs to the queue, but want to restrict how many they have running at any given time so that they don't torch their fileserver. They were using bgmod -L in LSF to do this, but they were wondering if there was a similar way in SLURM to do so. I know you can do this via the accounting interface but it would be good if I didn't have to apply it as a blanket to all their jobs and if they could manage it themselves. If nothing exists in SLURM to do this that's fine. One can always engineer around it. I figured I would ping the dev list first before putting a nail in it. From my look at the documentation I don't see anyway to do this other than what I stated above. -Paul Edmon- -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University
[slurm-dev] Re: Job Groups
Not that I'm aware of. I don't know of a way to give users control over a QOS like you can do with account coordinators for accounts. Ryan On 06/19/2013 10:55 AM, Paul Edmon wrote: Thanks for the input. Can GrpJobs be modified from the user side? -Paul Edmon- On 06/19/2013 12:15 PM, Ryan Cox wrote: Paul, We were discussing this yesterday due to a user not limiting the amount of jobs hammering our storage. A QOS with a GrpJobs limit sounds like the best approach for both us and you. Ryan On 06/19/2013 09:36 AM, Paul Edmon wrote: I have a group here that wants to submit a ton of jobs to the queue, but want to restrict how many they have running at any given time so that they don't torch their fileserver. They were using bgmod -L in LSF to do this, but they were wondering if there was a similar way in SLURM to do so. I know you can do this via the accounting interface but it would be good if I didn't have to apply it as a blanket to all their jobs and if they could manage it themselves. If nothing exists in SLURM to do this that's fine. One can always engineer around it. I figured I would ping the dev list first before putting a nail in it. From my look at the documentation I don't see anyway to do this other than what I stated above. -Paul Edmon- -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University
[slurm-dev] Re: untracked processes
This may not be exactly what you're looking for but it could be a start. We're looking at adding modifying ssh_config and sshd_config to propagate SLURM_JOB_ID for jobs that use ssh to spawn processes (credit to our sysadmin Lloyd Brown for that one). Then we will use something like a script in /etc/profile.d to add the process to the correct cgroup if it's launched via ssh and has $SLURM_JOB_ID set. We're not using cgroups yet (still have some CentOS 5) so I don't have exact implementation details at this point. Then the cgroups should work for resource control and, I assume, accounting if using the correct plugin. This may not catch 100% of everything, but we would probably have something look for all user processes that are not part of a cgroup and add them to the user cgroup. I don't think accounting could work in that case, but that would help catch and control rogue processes that aren't accounted for under SLURM. Epilog or a cron could clean up all of a user's processes after they don't have jobs on the node anymore. I don't know if SLURM has something like Torque's tm_adopt, but that could work in lieu of cgroups for accounting if you don't happen to use cgroups. tm_adopt allowed you to add a random process to be accounted for under Torque, even if it wasn't launched under Torque. We used to have a wrapper script for ssh that did just that when we used Torque and Moab. Ryan P.S. We've only been using SLURM for a few weeks so you might want to double-check the accuracy and viability of my statements :) On 02/21/2013 12:57 PM, Moe Jette wrote: Slurm only tracks the processes that it's daemons launch (most MPI implementations can launch their tasks using slurm). Anything launched outside of Slurm can be killed as part of a job prolog, but accounting and job step management are outside of Slurm's control. Quoting Michael Colonno mcolo...@stanford.edu: SLURM gurus ~ I'm trying to configure a commercial MPI code to run through SLURM. I can launch this code through either srun or sbatch without any issues (the good) but the processes manage to run completely disconnected from SLURM's notice (the bad). i.e. the job is running just fine but SLURM thinks it's completed and hence does not report anything running. I'm guessing this is due to the fact that this tool runs a pre-processing-type executable and then launches sub-processes to solve (MPI on a local system) without connecting the process IDs(?) In any event, I'm guessing I'm not the first person to run into this. Is there a recommended solution to configure SLURM to track codes like this? Thanks, ~Mike C. -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University