I attempted to make something somewhat similar to this with the intent of
finding users who submit highly inefficiant jobs. I wanted it to be
real-time, no extra plugins and work for users to see how their jobs
scaled. It uses only sacct and sstat. For anyone with sudo access, it
will show
Hi Carlos,
Le 19/09/2016 à 18:08, Carlos Fenoy a écrit :
Hi All,
I'm working on a plugin that stores performance information of every
task of every job in influxdb. This can be visualized easily with
Grafana and provides information of cpu used and memory used as well as
read and writes from f
The advantage of this plugin over the HDF5 one is that you get the results in
almost realtime. The plugin has a small buffer to prevent spamming too much the
influxdb server.
Sent from my iPhone
> On 19 Sep 2016, at 18:57, Igor Yakushin wrote:
>
> Hi Carlos,
> Can one get results in real tim
I should probably add some example output:
Someone we need to talk to:
Node | Memory (GB) | CPUs
Hostname AllocMaxCur Alloc Used Eff%
m8-10-519.5 0 0 1 0.00 0
*m8-10-219.52.32.2 1 0.9999
m8-
We use this script that we cobbled together:
https://github.com/BYUHPC/slurm-random/blob/master/rjobstat. It assumes
that you're using cgroups. It uses ssh to connect to each node so it's
not very scalable but it works well enough for us.
Ryan
On 09/18/2016 06:42 PM, Igor Yakushin wrote:
ho
Hi Carlos,
Can one get results in real time as the job is running or only once it is
finished?
Thank you,
Igor
On Mon, Sep 19, 2016 at 11:07 AM, Carlos Fenoy wrote:
> Hi All,
>
> I'm working on a plugin that stores performance information of every task
> of every job in influxdb. This can be vi
Hi All,
I'm working on a plugin that stores performance information of every task
of every job in influxdb. This can be visualized easily with Grafana and
provides information of cpu used and memory used as well as read and writes
from filesystems. This plugin is using the profile capability of sl
On 9/19/16, 11:33 AM, "Igor Yakushin" wrote:
>
> Hi Giovanni,
> We have just upgraded to 16.05.4.
> When I try building pyslurm, it says that version 2.6 of Slurm is required.
This thread should move to the PySlurm google group instead:
https://groups.google.com/forum/#!forum/pyslurm. You can
Hi Giovani,
We have just upgraded to 16.05.4.
When I try building pyslurm, it says that version 2.6 of Slurm is required.
Thank you,
Igor
On Mon, Sep 19, 2016 at 8:09 AM, Torres, Giovanni
wrote:
> On 9/18/16, 8:41 PM, "Igor Yakushin" wrote:
> >
> > Hi All,
> >
> >
> > I'd like to be able to se
>>
Reply-To: slurm-dev mailto:slurm-dev@schedmd.com>>
Date: Sunday, September 18, 2016 at 9:57 PM
To: slurm-dev mailto:slurm-dev@schedmd.com>>
Subject: [slurm-dev] Re: how to monitor CPU/RAM usage on each node of a slurm
job? python API?
Hi Peter,
Ganglia plugin would be interesti
On 9/18/16, 8:41 PM, "Igor Yakushin" wrote:
>
> Hi All,
>
>
> I'd like to be able to see for a given jobid how much resources are used by a
> job on each node it is running on at this moment. Is there a way to do it?
>
> So far it looks like I have to script it: get the list of the involved
Sorry, didn't notice you wanted real time profiling.
As mentioned, you can use sstat, or if you wish to probe each node for
the local resources used, you could:
xdsh $(squeue -h -j $SLURM_JOBID) "ps -q $(scontrol listpids
$SLURM_JOBID | awk 'NR>1{print $1}' ORS=',')"
On 09/19/2016 08:28
how to monitor CPU/RAM usage on each node of a slurm job? python API?
You should use HDF5
1http://slurm.schedmd.com/hdf5_profile_user_guide.html
On 09/19/2016 03:41 AM, Igor Yakushin wrote:
Hi All,
I'd like to be able to see for a given jobid how much resources are
used b
Hi Peter,
Ganglia plugin would be interesting. How do ganglia clients on different
nodes communicate? Typically they do not talk to each other but only to the
central node. However, to decide that they are part of the same job, they
somehow need to talk to each other?
Thank you,
Igor
On Sun, Sep
Hi Lachlan,
At first glance sstat does not give the kind of information I want. It does
not break it into separate cores and only gives some aggregated statistics
instead of the current one. I looked at all binaries in slurm bin directory
and it does not look like any of them do exactly what I need
Gah, yes. sstat, not sinfo.
--
The most dangerous phrase in the language is, "We've always done it this
way."
- Grace Hopper
On 19 September 2016 at 13:00, Peter A Ruprecht wrote:
> Igor,
>
> Would sstat give you what you need? (http://slurm.schedmd.com/sstat.html)
> It doesn't update in
Igor,
Would sstat give you what you need? (http://slurm.schedmd.com/sstat.html) It
doesn't update instantaneously but at least a few times a minute.
If you want to get fancy, I believe that xdmod can integrate with TACC-stats to
provide graphs about what is happening inside a job but I'm not
Also, if you have slurm installed on a deb based distro, you can try this
https://github.com/edf-hpc/slurm-web
I tried to get it running on RPM (Centos) but it is too tightly coupled to
deb for my ability to port it.
cheers
L.
--
The most dangerous phrase in the language is, "We've always d
I think you need a couple of things going on:
1. you have to have some sort of accounting organised and set up
2. your sbatch scripts need to use: srun not just
3. sinfo should then work on the job number.
When I asked, that was the response iirc.
cheers
L.
--
The most dangerous phrase i
19 matches
Mail list logo