I've been trying to discover why accounting doesn't properly reflect memory usage when using jobacct_gather/cgroup. I've tracked it down to what looks like one definite bug, and some conflicting behavior with task/cgroup. This problem is observed when using sbatch without using srun, but could be wider than that, I'm not sure yet.
First, the bug... Without using srun, the step number is set to SLURM_BATCH_SCRIPT (-2, or 4294967294 when unsigned). plugins/jobacct_gather/cgroup/jobacct_gather_cgroup_cpuacct.c contains a special case for this where it sets the cgroup step path to "step_batch". This special case is missing from jobacct_gather_cgroup_memory.c, and as a result the step path ends up as "step_4294967294". I believe this is a bug, as that directory does not exist. Fixing that to mirror what cpuacct does gets us a little further, but now comes the conflicting behavior. The following discussion is in regard to the 'memory' cgroup subsystem... With the paths fixed, the jobacct_gather/cgroup plugin writes slurmstepd's PID to step_batch/task_0/cgroup.procs. Almost immediately thereafter, the task/cgroup plugin then writes slurmstepd's PID to step_batch/cgroup.procs, thus removing the only PID from task_0 causing task_0 itself to be removed by the cgroup release agent. With task_0 now gone, the periodic calls to the jobacct_gather/cgroup plugin fail to collect memory data- "unable to open '/cgroup/memory/slurm/uid_7260/job_1079/step_batch/task_0/memory.stat' for reading : No such file or directory" There appears to be a race between task/cgroup and jobacct_gather/cgroup - if I introduce enough delays and jobacct_gather/cgroup runs last and the pid stays in task_0, everything seems to work properly. If task/cgroup runs last and task_0 gets removed, the accounting info is lost. Before I waste any more time trying to debug this, can someone please tell me what the desired operation should be? It seems to me that the memory should be associated with task_0, and not step_batch, but I'm not sure. All testing here was done with 14.11.9. Thanks, Kevin -- Kevin Hildebrand University of Maryland, College Park
