I should probably add some example output:

Someone we need to talk to:
      Node   |     Memory (GB)     |         CPUs
    Hostname   Alloc    Max    Cur   Alloc   Used  Eff%
     m8-10-5    19.5      0      0       1   0.00     0
    *m8-10-2    19.5    2.3    2.2       1   0.99    99
     m8-10-3    19.5      0      0       1   0.00     0
     m8-10-4    19.5      0      0       1   0.00     0

* denotes the node where the batch script executes (node 0)
CPU usage is cumulative since the start of the job


Much better:
      Node   |     Memory (GB)     |         CPUs
    Hostname   Alloc    Max    Cur   Alloc   Used  Eff%
     m9-48-2   112.0   21.1   19.3      16  15.97    99
     m9-48-3    98.0   18.5   16.8      14  13.98    99
     m9-16-3   112.0   20.9   19.2      16  15.97    99
     m9-44-1   112.0   21.0   19.2      16  15.97    99
     m9-43-3   119.0   22.3   20.4      17  16.97    99
     m9-44-2   112.0   21.2   19.3      16  15.98    99
     m9-14-4   112.0   21.0   19.2      16  15.97    99
     m9-46-4   119.0   22.5   20.5      17  16.97    99
    *m9-10-2    91.0   32.0   15.8      13  12.81    98
     m9-43-1   119.0   22.3   20.4      17  16.97    99
     m9-16-1   126.0   23.9   21.6      18  17.97    99
     m9-47-4   119.0   22.4   20.5      17  16.97    99
     m9-43-4   119.0   22.4   20.5      17  16.97    99
     m9-48-1    84.0   15.7   14.4      12  11.98    99
     m9-42-4   119.0   22.2   20.3      17  16.97    99
     m9-43-2   119.0   22.2   20.4      17  16.97    99

* denotes the node where the batch script executes (node 0)
CPU usage is cumulative since the start of the job

Ryan

On 09/19/2016 11:13 AM, Ryan Cox wrote:
We use this script that we cobbled together: https://github.com/BYUHPC/slurm-random/blob/master/rjobstat. It assumes that you're using cgroups. It uses ssh to connect to each node so it's not very scalable but it works well enough for us.

Ryan

On 09/18/2016 06:42 PM, Igor Yakushin wrote:
how to monitor CPU/RAM usage on each node of a slurm job? python API?
Hi All,

I'd like to be able to see for a given jobid how much resources are used by a job on each node it is running on at this moment. Is there a way to do it?

So far it looks like I have to script it: get the list of the involved nodes using, for example, squeue or qstat, ssh to each node and find all the user processes (not 100% guaranteed that they would be from the job I am interested in: is there a way to find UNIX pids corresponding to Slurm jobid?).

Another question: is there python API to slurm? I found pyslurm but so far it would not build with my version of Slurm.

Thank you,
Igor



--
Ryan Cox
Operations Director
Fulton Supercomputing Lab
Brigham Young University

Reply via email to