Re: [slurm-users] How to increase the cpu and gpu usage in our Slurm cluster?

2020-05-08 Thread feng lu
Hi All Our Slurm CPU and GPU cluster have 22 and 150 machines and 1400 core ( open hyper-threading), 900 GPU card。 Now the cpu occupancy is up to 90%, but the usage is about 25% percent. How to crease the cpu usage in our cluster? I can think following methods 1. Virtualize our cpu core using

Re: [slurm-users] scontrol show assoc_mgr showing more resources in use than squeue

2020-05-08 Thread Ole Holm Nielsen
Hi Michael, You can inquire the database for a job summary of a particular user and time period using the slurmacct command: https://github.com/OleHolmNielsen/Slurm_tools/tree/master/slurmacct You can also call "sacct --user=USER" directly like in slurmacct: # Request job data export

Re: [slurm-users] [External] Defining a default --nodes=1

2020-05-08 Thread Fulcomer, Samuel
"-N 1" restricts a job to a single node. We've continued to have issues with this. Historically we've had a single partition with multiple generations of nodes segregated for multinode scheduling via topology.conf. "Use -N 1" (unless you really know what you're doing) only goes so far. There are

Re: [slurm-users] scontrol show assoc_mgr showing more resources in use than squeue

2020-05-08 Thread Ole Holm Nielsen
Hi Michael, Yes, my Slurm tools use and trust the output of Slurm commands such as sacct, and any discrepancy would have to come from the Slurm database. Which version of Slurm are you running on the database server and the node where you run sacct? Did you add up the GrpTRESRunMins values

Re: [slurm-users] scontrol show assoc_mgr showing more resources in use than squeue

2020-05-08 Thread Ole Holm Nielsen
Hi Michael, Maybe you will find a couple of my Slurm tools useful for displaying data from the Slurm database in a more user-friendly format: showjob: Show status of Slurm job(s). Both queue information and accounting information is printed. showuserlimits: Print Slurm resource user limits

Re: [slurm-users] scontrol show assoc_mgr showing more resources in use than squeue

2020-05-08 Thread Renfro, Michael
Working on something like that now. From an SQL export, I see 16 jobs from my user that have a state of 7. Both states 3 and 7 show up as COMPLETED in sacct, and may also have some duplicate job entries found via sacct --duplicates. > On May 8, 2020, at 11:34 AM, Ole Holm Nielsen > wrote: >

Re: [slurm-users] [External] Defining a default --nodes=1

2020-05-08 Thread Michael Robbert
Manuel, You may want to instruct your users to use ‘-c’ or ‘—cpus-per-task’ to define the number of cpus that they need. Please correct me if I’m wrong, but I believe that will restrict the jobs to a singe node whereas ‘-n’ or ‘—ntasks’ is really for multi process jobs which can be spread

Re: [slurm-users] scontrol show assoc_mgr showing more resources in use than squeue

2020-05-08 Thread Renfro, Michael
Slurm 19.05.3 (packaged by Bright). For the three running jobs, the total GrpTRESRunMins requested is 564480 CPU-minutes as shown by 'showjob', and their remaining usage that the limit would check against is less than that. My download of your scripts dated to August 21, 2019, and I've just now

Re: [slurm-users] scontrol show assoc_mgr showing more resources in use than squeue

2020-05-08 Thread Renfro, Michael
Thanks, Ole. Your showuserlimits script is actually where I got started today, and where I found the sacct command I sent earlier. Your script gives the same output for that user: the only line that's not a "Limit = None" is for the user's GrpTRESRunMins value, which is at "Limit = 144,

[slurm-users] scontrol show assoc_mgr showing more resources in use than squeue

2020-05-08 Thread Renfro, Michael
Hey, folks. I've had a 1000 CPU-day (144 CPU-minutes) GrpTRESMins limit applied to each user for years. It generally works as intended, but I have one user I've noticed whose usage is highly inflated from reality, causing the GrpTRESMins limit to be enforced much earlier than necessary:

Re: [slurm-users] Defining a default --nodes=1

2020-05-08 Thread Renfro, Michael
There are MinNodes and MaxNodes settings that can be defined for each partition listed in slurm.conf [1]. Set both to 1 and you should end up with the non-MPI partitions you want. [1] https://slurm.schedmd.com/slurm.conf.html From: slurm-users on behalf of

[slurm-users] Defining a default --nodes=1

2020-05-08 Thread Holtgrewe, Manuel
Dear all, we're running a cluster where the large majority of jobs will use multi-threading and no message passing. Sometimes CPU>1 jobs are scheduled to run on more than one node (which would be fine for MPI jobs of course...) Is it possible to automatically set "--nodes=1" for all jobs

[slurm-users] How to display the effective QOS of a job?

2020-05-08 Thread Holtgrewe, Manuel
Hi, is it possible to display the effective QOS of a job? I need to investigate some unexpected behaviour of the scheduler (at least unexpected to me at the moment). I want to limit maximum number of CPUs per user in each partition. It is my understanding from the documentation that partition

Re: [slurm-users] [EXT] Re: Limit the number of GPUS per user per partition

2020-05-08 Thread Killian Murphy
Hi Thomas. The output you provided from sacctmgr doesn't look quite right to me. There is a field count mismatch between the header line and the rows, and I'm not seeing some fields that I would expect to see, particularly MaxTRESPU (MaxTRESPerUser) - I don't think this is a Slurm version