Hi Carlos,

Le 19/09/2016 à 18:08, Carlos Fenoy a écrit :
Hi All,

I'm working on a plugin that stores performance information of every
task of every job in influxdb. This can be visualized easily with
Grafana and provides information of cpu used and memory used as well as
read and writes from filesystems. This plugin is using the profile
capability of slurm, and it's been working fine in our cluster for
almost a year.

The code has been tested in 15.8.04 and I'm working on testing it with
the latest stable version and make some small adjustments so this plugin
can be integrated with the standard slurm distribution.

Here you have the code
https://github.com/cfenoy/influxdb-slurm-monitoring

A presentation about this plugin will take place in next week's SLUG.

FWIW, we're doing basically the same thing here at EDF. I developed 2 collectd plugins for slurm[1]:

https://github.com/collectd/collectd/pull/1198

It is based on collectd because we use it for other types of metrics on the nodes as well.

The slurmd plugin gather processes statistics (memory and CPU usage for the moment) and collectd send them to influxdb. It discovers the jobs processes based on the slurm cgroup hierarchy, avoiding slurmctld polling.

Then, because we were facing very specific issues with grafana, I also ended up developing a small webapp that send requests to influxdb and draw a diagram in real-time (10s sampling):

https://github.com/edf-hpc/jobmetrics

There will be some stuff about this in the EDF site report during next SLUG as well.

Best,
Rémi

Reply via email to