Hi Edward,

Besides my Slurm Wiki page https://wiki.fysik.dtu.dk/niflheim/SLURM, I have written a number of tools which we use for monitoring our cluster, see https://github.com/OleHolmNielsen/Slurm_tools. I recommend in particular these tools:

* pestat Prints a Slurm cluster nodes status with 1 line per node and job info.

* showuserjobs Print the current node status and batch jobs status broken down into userids.

Use the option "-p <partition>" to display partition data.

I recommend also this nice tool for displaying partition statistics:

* spart A user-oriented partition info command for slurm. https://github.com/mercanca/spart

/Ole

On 7/8/19 9:33 PM, Edward Ned Harvey (slurm) wrote:
I am an experienced sysadmin, new to being a slurm admin, and I'm encountering some difficulty:

If you have a simple question such as "how many cpu's are currently being used in the foobar partition," or "give me an overview of the waiting jobs and what are the reasons they're waiting" I don't have any good easy ways yet to answer these questions. I can get the total number of cpu's in a partition via "scontrol show partition foobar" and I can get how many cpus are being used on a particular node via "scontrol show node somenode" and I can get a (not easily parsable) list of nodes within a partition via "sinfo". So all the information is available, but very difficult to access because it would require some very nontrivial parsing.

I see projects like this: https://github.com/fasrc/slurm_showq and https://github.com/fasrc/scalc which seem to be created exactly for this purpose. They're trying to make information in slurm more easily accessible.

So, is there a better way to manage a slurm cluster, are there better tools, or better ways to use them? Any other suggestions for me from experienced slurm admins? Like, a cheatsheet of common commands or scripts like slurm_showq and scalc? Or is this just the normal state of the world?

Reply via email to