[slurm-dev] Re: job steps not properly identified for jobs using step_batch cgroups

Danny Auble Mon, 12 Aug 2013 15:42:01 -0700

Experimental meaning it doesn't work as correctly as the linux plugindoes. I know when we last worked on it the cgroup plugin did not domemory accounting correctly. I also know there is quite a bit offunctionality missing as well, (profiling and such). Basically it ishalf baked at this point.

I don't know when/if the plugin will get out of experimental state, butI would definitely stay away from it in production for the bad memoryaccounting alone.


Danny

On 08/12/13 15:04, Ryan Cox wrote:

Moe,
In what way is it experimental? Is it possibly unstable or just notfeature-complete?
We're writing a script to independently gather statistics for our owndatabase and would like to use the cpuacct cgroup, thus the interestin the jobacct_gather/cgroup plugin.
Ryan

On 08/09/2013 10:07 AM, Moe Jette wrote:
I misspoke. The JobAcctGatherType=jobacct_gather/cgroup plugin isexperimental and not ready for use. Your configuration should work.
Quoting Moe Jette <je...@schedmd.com>:
Your explanation seems likely. You probably want to change yourconfiguration to:
JobAcctGatherType=jobacct_gather/cgroup

Quoting Andy Wettstein <wettst...@uchicago.edu>:
I understand this problem more fully now.
Certains jobs that our users run fork processes in a way that theparent
PID gets set to 1. The _get_offspring_data function in
jobacct_gather/linux ignores these when adding up memory usage.

It seems like if proctrack/cgroup is enabled, the jobacct_gather/linux
plugin should rely on the cgroup.procs file to identify the pidsinstead
of trying to figure things out based on parent PID. Is something like
that reasonable?

Andy

On Tue, Jul 30, 2013 at 10:59:56AM -0700, Andy Wettstein wrote:
Hi,

I have the following set:

ProctrackType           = proctrack/cgroup
TaskPlugin              = task/cgroup
JobAcctGatherType       = jobacct_gather/linux

This is on slurm 2.5.7.
When I use sstat on all running jobs, there are a large number ofjobsthat say they have no steps running (for example: sstat: error:couldn't
get steps for job 4783548).
This seems to be the case for all steps that use the step_batchcgroup.If the step gets created in something like step_0, everythingseems tobe reported ok. In both instances, the PIDs are actually listed inthe
right cgroup.procs file.

I noticed this because there were several jobs that should have been
killed due to memory limits, but were not. The jobacct_gather plugin
doesn't know about the processes in the step_batch cgroup so itdoesn't
count the memory usage.


Andy




--
andy wettstein
hpc system administrator
research computing center
university of chicago
773.702.1104
--
andy wettstein
hpc system administrator
research computing center
university of chicago
773.702.1104

[slurm-dev] Re: job steps not properly identified for jobs using step_batch cgroups

Reply via email to