Ok. When you say it should work do you mean there is a bug in slurm that is causing this problem?
I can send a fairly trivial example that can bypass any memory limits if you need it. On Fri, Aug 09, 2013 at 09:07:53AM -0700, Moe Jette wrote: > > I misspoke. The JobAcctGatherType=jobacct_gather/cgroup plugin is > experimental and not ready for use. Your configuration should work. > > Quoting Moe Jette <je...@schedmd.com>: > > >Your explanation seems likely. You probably want to change your > >configuration to: > >JobAcctGatherType=jobacct_gather/cgroup > > > >Quoting Andy Wettstein <wettst...@uchicago.edu>: > > > >> > >>I understand this problem more fully now. > >> > >>Certains jobs that our users run fork processes in a way that the parent > >>PID gets set to 1. The _get_offspring_data function in > >>jobacct_gather/linux ignores these when adding up memory usage. > >> > >>It seems like if proctrack/cgroup is enabled, the jobacct_gather/linux > >>plugin should rely on the cgroup.procs file to identify the pids instead > >>of trying to figure things out based on parent PID. Is something like > >>that reasonable? > >> > >>Andy > >> > >>On Tue, Jul 30, 2013 at 10:59:56AM -0700, Andy Wettstein wrote: > >>> > >>>Hi, > >>> > >>>I have the following set: > >>> > >>>ProctrackType = proctrack/cgroup > >>>TaskPlugin = task/cgroup > >>>JobAcctGatherType = jobacct_gather/linux > >>> > >>>This is on slurm 2.5.7. > >>> > >>>When I use sstat on all running jobs, there are a large number of jobs > >>>that say they have no steps running (for example: sstat: error: couldn't > >>>get steps for job 4783548). > >>> > >>>This seems to be the case for all steps that use the step_batch cgroup. > >>>If the step gets created in something like step_0, everything seems to > >>>be reported ok. In both instances, the PIDs are actually listed in the > >>>right cgroup.procs file. > >>> > >>>I noticed this because there were several jobs that should have been > >>>killed due to memory limits, but were not. The jobacct_gather plugin > >>>doesn't know about the processes in the step_batch cgroup so it doesn't > >>>count the memory usage. > >>> > >>> > >>>Andy > >>> > >>> > >>> > >>> > >>>-- > >>>andy wettstein > >>>hpc system administrator > >>>research computing center > >>>university of chicago > >>>773.702.1104 > >> > >>-- > >>andy wettstein > >>hpc system administrator > >>research computing center > >>university of chicago > >>773.702.1104 > >> > > > > > > > > -- andy wettstein hpc system administrator research computing center university of chicago 773.702.1104