Hi Edward,
Besides my Slurm Wiki page https://wiki.fysik.dtu.dk/niflheim/SLURM, I
have written a number of tools which we use for monitoring our cluster,
see https://github.com/OleHolmNielsen/Slurm_tools. I recommend in
particular these tools:
* pestat Prints a Slurm cluster nodes status with 1 line per node and
job info.
* showuserjobs Print the current node status and batch jobs status
broken down into userids.
Use the option "-p <partition>" to display partition data.
I recommend also this nice tool for displaying partition statistics:
* spart A user-oriented partition info command for slurm.
https://github.com/mercanca/spart
/Ole
On 7/8/19 9:33 PM, Edward Ned Harvey (slurm) wrote:
I am an experienced sysadmin, new to being a slurm admin, and I'm
encountering some difficulty:
If you have a simple question such as "how many cpu's are currently
being used in the foobar partition," or "give me an overview of the
waiting jobs and what are the reasons they're waiting" I don't have any
good easy ways yet to answer these questions. I can get the total number
of cpu's in a partition via "scontrol show partition foobar" and I can
get how many cpus are being used on a particular node via "scontrol show
node somenode" and I can get a (not easily parsable) list of nodes
within a partition via "sinfo". So all the information is available, but
very difficult to access because it would require some very nontrivial
parsing.
I see projects like this: https://github.com/fasrc/slurm_showq and
https://github.com/fasrc/scalc which seem to be created exactly for this
purpose. They're trying to make information in slurm more easily accessible.
So, is there a better way to manage a slurm cluster, are there better
tools, or better ways to use them? Any other suggestions for me from
experienced slurm admins? Like, a cheatsheet of common commands or
scripts like slurm_showq and scalc? Or is this just the normal state of
the world?