Hi Loris,
Thanks so much for your relevant comments!
On 07/21/2017 12:00 PM, Loris Bennett wrote:
Hi Ole,
Ole Holm Nielsen <ole.h.niel...@fysik.dtu.dk> writes:
As a small contribution to the Slurm community, I've moved my collection of
Slurm tools to GitHub at https://github.com/OleHolmNielsen/Slurm_tools. These
are tools which I feel makes the daily cluster monitoring and management a
little easier.
The following Slurm tools are available:
* pestat Prints a Slurm cluster nodes status with 1 line per node and job info.
* slurmreportmonth Generate monthly accounting statistics from Slurm using the
sreport command.
* showuserjobs Print the current node status and batch jobs status broken down
into userids.
* slurmibtopology Infiniband topology tool for Slurm.
* Slurm triggers scripts.
* Scripts for managing nodes.
* Scripts for managing jobs.
The tools "pestat" and "slurmibtopology" have previously been announced to this
list, but future updates will be on GitHub only.
I would also like to mention our Slurm deployment HowTo guide at
https://wiki.fysik.dtu.dk/niflheim/SLURM
/Ole
Thanks for sharing your tools. Here are some brief comments
- psjob/psnode
- The USERLIST variable makes the commands a bit brittle, since ps
will fail if you pass an unknown username.
Good point!
- showuserjobs
- Doesn't handle usernames longer than 8-chars (we have longer names)
Good point!
- The grouping doesn't seem quite correct. As shown in the example
below, not all the users of the group appear under the group total
for the appropriate group:
I tried to make the "sort" command do the final sorting, but I couldn't
make it to the GROUP_TOTAL first. Maybe I have to move the sorting into
the awk code...
Username Jobs CPUs Jobs CPUs Group Further info
======== ==== ===== ==== ===== ========
=============================
GRAND_TOTAL 168 1089 55 451 ALL running+idle=1540 CPUs 29
users
GROUP_TOTAL 56 349 10 119 group01 running+idle=468 CPUs 8
users
user01 27 324 4 52 group02 One, User
GROUP_TOTAL 27 324 4 52 group02 running+idle=376 CPUs 1
users
user02 29 174 1 6 group01 Two, User
GROUP_TOTAL 5 148 18 208 group03 running+idle=356 CPUs 4
users
user03 3 120 16 176 group03 Three, User
user04 11 96 3 48 group01 Four, User
...
In general, maybe it would good to have a common config file, where things such as
paths to binaries, USERLIST and username lengths are defined.
Yes, but what's the best way for this? I'd like to scripts to be
self-contained so people can pick what they need without doing
additional setups for users and sysadmins.
/Ole