Hi Carlos,
Le 19/09/2016 à 18:08, Carlos Fenoy a écrit :
Hi All,
I'm working on a plugin that stores performance information of every
task of every job in influxdb. This can be visualized easily with
Grafana and provides information of cpu used and memory used as well as
read and writes from filesystems. This plugin is using the profile
capability of slurm, and it's been working fine in our cluster for
almost a year.
The code has been tested in 15.8.04 and I'm working on testing it with
the latest stable version and make some small adjustments so this plugin
can be integrated with the standard slurm distribution.
Here you have the code
https://github.com/cfenoy/influxdb-slurm-monitoring
A presentation about this plugin will take place in next week's SLUG.
FWIW, we're doing basically the same thing here at EDF. I developed 2
collectd plugins for slurm[1]:
https://github.com/collectd/collectd/pull/1198
It is based on collectd because we use it for other types of metrics on
the nodes as well.
The slurmd plugin gather processes statistics (memory and CPU usage for
the moment) and collectd send them to influxdb. It discovers the jobs
processes based on the slurm cgroup hierarchy, avoiding slurmctld polling.
Then, because we were facing very specific issues with grafana, I also
ended up developing a small webapp that send requests to influxdb and
draw a diagram in real-time (10s sampling):
https://github.com/edf-hpc/jobmetrics
There will be some stuff about this in the EDF site report during next
SLUG as well.
Best,
Rémi