[slurm-dev] Re: how to monitor CPU/RAM usage on each node of a slurm job? python API?

Rémi Palancher Mon, 19 Sep 2016 10:47:23 -0700


Hi Carlos,


Le 19/09/2016 à 18:08, Carlos Fenoy a écrit :

Hi All,

I'm working on a plugin that stores performance information of every
task of every job in influxdb. This can be visualized easily with
Grafana and provides information of cpu used and memory used as well as
read and writes from filesystems. This plugin is using the profile
capability of slurm, and it's been working fine in our cluster for
almost a year.

The code has been tested in 15.8.04 and I'm working on testing it with
the latest stable version and make some small adjustments so this plugin
can be integrated with the standard slurm distribution.

Here you have the code
https://github.com/cfenoy/influxdb-slurm-monitoring

A presentation about this plugin will take place in next week's SLUG.

FWIW, we're doing basically the same thing here at EDF. I developed 2collectd plugins for slurm[1]:


https://github.com/collectd/collectd/pull/1198

It is based on collectd because we use it for other types of metrics onthe nodes as well.

The slurmd plugin gather processes statistics (memory and CPU usage forthe moment) and collectd send them to influxdb. It discovers the jobsprocesses based on the slurm cgroup hierarchy, avoiding slurmctld polling.

Then, because we were facing very specific issues with grafana, I alsoended up developing a small webapp that send requests to influxdb anddraw a diagram in real-time (10s sampling):


https://github.com/edf-hpc/jobmetrics

There will be some stuff about this in the EDF site report during nextSLUG as well.


Best,
Rémi

[slurm-dev] Re: how to monitor CPU/RAM usage on each node of a slurm job? python API?

Reply via email to