On 9/18/16, 8:41 PM, "Igor Yakushin" <igor.2...@gmail.com> wrote:
> 
> Hi All,
> 
> 
> I'd like to be able to see for a given jobid how much resources are used by a 
> job on each node it is running on at this moment. Is there a way to do it? 
> 
> So far it looks like I have to script it: get the list of the involved nodes 
> using, for example, squeue or qstat, ssh to each node and find all the user 
> processes (not 
> 100% guaranteed that they would be from the job I am interested
> in: is there a way to find UNIX pids corresponding to Slurm jobid?).
>
You can do `scontrol listpids` on a node.  It will return a mapping of PIDs to 
JobIDs.  But from a script, you would have to fork a subshell to execute 
scontrol and then you would have to parse the output. 

If you are using the cgroup task plugin, a better way would be to parse the 
output of the cgroup hierarchy (/cgroup or /sys/fs/cgroup, depending on your 
OS) on each compute node.  There is a Python API to libcgroup 
(https://git.fedorahosted.org/git/python-libcgroup.git) but I don’t think it is 
complete and I’m not sure of its status (whether it is maintained or not).  If 
you are doing this from Python, however, I find it easier and faster to just 
glob the cgroup hierarchy and read cgroup.procs and memory.stat under the slurm 
tasks.  You still need to get the CPU state for each process or thread under a 
given job in order to get the “cpu load” for that job.

My take on this was to write a small daemon that runs on each node.  It gathers 
metrics for all running slurm processes on a node and aggregates them by job.  
The daemon then sends the info periodically (every 30 seconds) to a Redis 
database in JSON format.  From there, I can write command utilities or web 
tools that query Redis instead of slurmctld.  This makes for a stateless 
monitoring environment.  Given that Redis runs in-memory, if Redis goes down, 
all metrics are lost.  However, as long as the daemon is running on each 
compute node, Redis will be fully repopulated in 30 seconds.

I have some code that does all this already, but I don’t think it is ready for 
mass consumption.  I could put it on GitHub if anyone is interested.

>
> Another question: is there python API to slurm? I found pyslurm but so far it 
> would not build with my version of Slurm.

What version of Slurm are you running?  If you are having problems building 
PySlurm, feel free to post questions here: 
https://groups.google.com/forum/#!forum/pyslurm

We’d be happy to help you get PySlurm going.

Best,
Giovanni



Reply via email to