Hi All
Our Slurm CPU and GPU cluster have 22 and 150 machines and 1400 core ( open
hyper-threading), 900 GPU card。 Now the cpu occupancy is up to 90%, but the
usage is about 25% percent. How to crease the cpu usage in our cluster? I
can think following methods
1. Virtualize our cpu core using
Hi Michael,
You can inquire the database for a job summary of a particular user and
time period using the slurmacct command:
https://github.com/OleHolmNielsen/Slurm_tools/tree/master/slurmacct
You can also call "sacct --user=USER" directly like in slurmacct:
# Request job data
export
"-N 1" restricts a job to a single node.
We've continued to have issues with this. Historically we've had a single
partition with multiple generations of nodes segregated for
multinode scheduling via topology.conf. "Use -N 1" (unless you really know
what you're doing) only goes so far.
There are
Hi Michael,
Yes, my Slurm tools use and trust the output of Slurm commands such as
sacct, and any discrepancy would have to come from the Slurm database.
Which version of Slurm are you running on the database server and the
node where you run sacct?
Did you add up the GrpTRESRunMins values
Hi Michael,
Maybe you will find a couple of my Slurm tools useful for displaying
data from the Slurm database in a more user-friendly format:
showjob: Show status of Slurm job(s). Both queue information and
accounting information is printed.
showuserlimits: Print Slurm resource user limits
Working on something like that now. From an SQL export, I see 16 jobs from my
user that have a state of 7. Both states 3 and 7 show up as COMPLETED in sacct,
and may also have some duplicate job entries found via sacct --duplicates.
> On May 8, 2020, at 11:34 AM, Ole Holm Nielsen
> wrote:
>
Manuel,
You may want to instruct your users to use ‘-c’ or ‘—cpus-per-task’ to define
the number of cpus that they need. Please correct me if I’m wrong, but I
believe that will restrict the jobs to a singe node whereas ‘-n’ or ‘—ntasks’
is really for multi process jobs which can be spread
Slurm 19.05.3 (packaged by Bright). For the three running jobs, the total
GrpTRESRunMins requested is 564480 CPU-minutes as shown by 'showjob', and their
remaining usage that the limit would check against is less than that.
My download of your scripts dated to August 21, 2019, and I've just now
Thanks, Ole. Your showuserlimits script is actually where I got started today,
and where I found the sacct command I sent earlier.
Your script gives the same output for that user: the only line that's not a
"Limit = None" is for the user's GrpTRESRunMins value, which is at "Limit =
144,
Hey, folks. I've had a 1000 CPU-day (144 CPU-minutes) GrpTRESMins limit
applied to each user for years. It generally works as intended, but I have one
user I've noticed whose usage is highly inflated from reality, causing the
GrpTRESMins limit to be enforced much earlier than necessary:
There are MinNodes and MaxNodes settings that can be defined for each partition
listed in slurm.conf [1]. Set both to 1 and you should end up with the non-MPI
partitions you want.
[1] https://slurm.schedmd.com/slurm.conf.html
From: slurm-users on behalf of
Dear all,
we're running a cluster where the large majority of jobs will use
multi-threading and no message passing. Sometimes CPU>1 jobs are scheduled to
run on more than one node (which would be fine for MPI jobs of course...)
Is it possible to automatically set "--nodes=1" for all jobs
Hi,
is it possible to display the effective QOS of a job?
I need to investigate some unexpected behaviour of the scheduler (at least
unexpected to me at the moment). I want to limit maximum number of CPUs per
user in each partition. It is my understanding from the documentation that
partition
Hi Thomas.
The output you provided from sacctmgr doesn't look quite right to me. There
is a field count mismatch between the header line and the rows, and I'm not
seeing some fields that I would expect to see, particularly MaxTRESPU
(MaxTRESPerUser) - I don't think this is a Slurm version
14 matches
Mail list logo