[slurm-dev] RE: Feedback on integration tests systemd/slurm and questions

Martin Perry Fri, 29 Aug 2014 11:01:49 -0700

Rémi,

Thanks for investigating this. It looks like some work will be required to 
fully integrate Slurm cgroups with systemd. The different mount point shouldn't 
be a problem since that is configurable with CgroupMountPoint. The main issue 
seems to be with cleanup. The release_agent cleanup mechanism has always been a 
bit clunky. I think the long-term plan has been to do the cleanup inside Slurm 
instead. As you note, this has already been partially implemented for some of 
the subsystems. If the cgroup unified hierarchy will provide a better way of 
doing cleanup then maybe we should use that. Not sure when any of the previous 
Slurm cgroup developers will have time to work on this though...

Martin Perry
Bull

-----Original Message-----
From: Rémi Palancher [mailto:[email protected]] 
Sent: Thursday, August 28, 2014 9:18 AM
To: slurm-dev
Subject: [slurm-dev] Feedback on integration tests systemd/slurm and questions

Hi developers,

You should already know that systemd[1] is the fast growing init alternative 
that will be the new default on all major GNU/linux distributions including 
RHEL7, Centos, Fedora, Debian, Ubuntu and so on.
Among other things, systemd has notably the particularity to put all processes 
into cgroups. This also includes all system services and therefore slurm 
daemons. Since slurmd is also able to manage cgroups, we (with workmates at 
EDF) were curious to know how systemd and slurm could work together.

My testing environment is:

- Debian Wheezy 7.6
- Linux kernel 3.2.60
- systemd 204
- slurm 14.11.0-0pre3

systemd
=======

Here are short explanations of how systemd works (at least AFAIU!).

At boot time, systemd mounts the following cgroups FS:

   tmpfs on /sys/fs/cgroup type tmpfs (rw,nosuid,nodev,noexec,mode=755)
   cgroup on /sys/fs/cgroup/systemd type cgroup
(rw,nosuid,nodev,noexec,relatime,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd)
   cgroup on /sys/fs/cgroup/cpuset type cgroup
(rw,nosuid,nodev,noexec,relatime,cpuset)
   cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup
(rw,nosuid,nodev,noexec,relatime,cpuacct,cpu)
   cgroup on /sys/fs/cgroup/memory type cgroup
(rw,nosuid,nodev,noexec,relatime,memory)
   cgroup on /sys/fs/cgroup/devices type cgroup
(rw,nosuid,nodev,noexec,relatime,devices)
   cgroup on /sys/fs/cgroup/freezer type cgroup
(rw,nosuid,nodev,noexec,relatime,freezer)
   cgroup on /sys/fs/cgroup/net_cls type cgroup
(rw,nosuid,nodev,noexec,relatime,net_cls)
   cgroup on /sys/fs/cgroup/blkio type cgroup
(rw,nosuid,nodev,noexec,relatime,blkio)
   cgroup on /sys/fs/cgroup/perf_event type cgroup
(rw,nosuid,nodev,noexec,relatime,perf_event)

Inside a tmpfs, it mounts a first cgroup filesystem named 'systemd':

- without any resource controller associated to it
- notify_on_release is set to 1
- the release agent is /lib/systemd/systemd-cgroups-agent

Then, systemd looks for all available resource controller on the running kernel 
and mount one filesystem for each of these (except with cpu and cpuacct which 
are mounted together). In all these controller cgroup fs, there is none 
release_agent.

Systemd actually manages all processes running on the system (user sessions, 
kernel threads, services and forks) in dedicated cgroups inside the 'systemd' 
hierarchy. Then eventually, if limits are configured in so-called unit files, 
it also creates cgroups in the appropriate controller filesystems. For example, 
if you set a memory usage limit to slurmd service, additionally to the 
/system/slurmd.service cgroup in systemd fs, it will also create a 
/system/slurmd.service cgroup in memory fs with appropriate memory limits.

By default for services, it simply creates a cgroup in cpu,cpuacct controllers. 
For example with slurmd:

  # cat /proc/`pgrep slurmd`/cgroup
  9:perf_event:/
  8:blkio:/
  7:net_cls:/
  6:freezer:/
  5:devices:/
  4:memory:/
  3:cpuacct,cpu:/system/slurmd.service
  2:cpuset:/
  1:name=systemd:/system/slurmd.service

When all processes of a cgroup end, systemd is notified with the execution of 
the release agent /lib/systemd/systemd-cgroups-agent in the 'systemd' fs. This 
program basically sends a DBUS notification the systemd core daemon with the 
path of the empty cgroup in parameter. When the core daemon receives this DBUS 
notification, it looks over its internal data structures for all associated 
cgroups in all controller fs and delete all of these. This is how all cgroup 
controllers fs are kept clean when they become empty.

slurm
=====

Well, then the question is: How slurmd and its cgroup plugins could work on top 
of that?

First, here is an excerpt of cgroups.txt in Linux doc:

  "If an active hierarchy with exactly the same set of subsystems already
   exists, it will be reused for the new mount. If no existing hierarchy
   matches, and any of the requested subsystems are in use in an existing
   hierarchy, the mount will fail with -EBUSY. Otherwise, a new hierarchy
   is activated, associated with the requested subsystems."

Therefore, if we configure slurmd to mount by itself (on other
mountpoints) already existing cpuset, memory and freezer controller fs and set 
its own release_agent for emptiness notification, it works. The cleanup at the 
end of jobs is correctly done by Slurm release agent and systemd does not 
complain.

Here is the corresponding cgroup.conf:

   CgroupMountpoint=/cgroup
   CgroupAutomount=yes
   CgroupReleaseAgentDir="/etc/slurm-llnl/cgroup"

(in slurm.conf, I enable plugins proctrack/cgroup and task/cgroup but I avoided 
jobacct_gather/none since it's still flagged as "experimental"
in the doc.)

Then:

   # mkdir /cgroup
   # mount -t tmpfs tmpfs /cgroup
   # slurmd

My only source of sadness with this solution is the number of mounts:

   tmpfs on /sys/fs/cgroup type tmpfs (rw,nosuid,nodev,noexec,mode=755)
   cgroup on /sys/fs/cgroup/systemd type cgroup
(rw,nosuid,nodev,noexec,relatime,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd)
   cgroup on /sys/fs/cgroup/cpuset type cgroup
(rw,nosuid,nodev,noexec,relatime,cpuset,release_agent=/etc/slurm-llnl/cgroup/release_cpuset)
   cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup
(rw,nosuid,nodev,noexec,relatime,cpuacct,cpu)
   cgroup on /sys/fs/cgroup/memory type cgroup
(rw,nosuid,nodev,noexec,relatime,memory,release_agent=/etc/slurm-llnl/cgroup/release_memory)
   cgroup on /sys/fs/cgroup/devices type cgroup
(rw,nosuid,nodev,noexec,relatime,devices)
   cgroup on /sys/fs/cgroup/freezer type cgroup
(rw,nosuid,nodev,noexec,relatime,freezer,release_agent=/etc/slurm-llnl/cgroup/release_freezer)
   cgroup on /sys/fs/cgroup/net_cls type cgroup
(rw,nosuid,nodev,noexec,relatime,net_cls)
   cgroup on /sys/fs/cgroup/blkio type cgroup
(rw,nosuid,nodev,noexec,relatime,blkio)
   cgroup on /sys/fs/cgroup/perf_event type cgroup
(rw,nosuid,nodev,noexec,relatime,perf_event)
   tmpfs on /cgroup type tmpfs (rw,relatime)
   cgroup on /cgroup/freezer type cgroup
(rw,nosuid,nodev,noexec,relatime,freezer,release_agent=/etc/slurm-llnl/cgroup/release_freezer)
   cgroup on /cgroup/cpuset type cgroup
(rw,nosuid,nodev,noexec,relatime,cpuset,release_agent=/etc/slurm-llnl/cgroup/release_cpuset)
   cgroup on /cgroup/memory type cgroup
(rw,nosuid,nodev,noexec,relatime,memory,release_agent=/etc/slurm-llnl/cgroup/release_memory)

Therefore, I tried to make slurm use the controller filesystems already mounted 
by systemd. In cgroup.conf, it looks like this:

   CgroupMountpoint=/sys/fs/cgroup
   CgroupAutomount=no
   #CgroupReleaseAgentDir="/etc/slurm-llnl/cgroup"

With this configuration, slurmd does not set the release_agent in the root 
directory of the controller filesystem (and commenting out the 
CgroupReleaseAgentDir parameter does not change anything):

   # for controller in perf_event net_cls freezer devices memory cpu,cpuacct 
cpuset systemd;
     do echo "${controller}: $(cat
/sys/fs/cgroup/${controller}/release_agent)"; done
   perf_event:
   net_cls:
   freezer:
   devices:
   memory:
   cpu,cpuacct:
   cpuset:
   systemd: /lib/systemd/systemd-cgroups-agent

When a job is launched, the cgroups are properly created by slurmd:

   # find /sys/fs/cgroup/ -path '*slurm*tasks'
   /sys/fs/cgroup/freezer/slurm/uid_1000/job_27/step_batch/tasks
   /sys/fs/cgroup/freezer/slurm/uid_1000/job_27/tasks
   /sys/fs/cgroup/freezer/slurm/uid_1000/tasks
   /sys/fs/cgroup/freezer/slurm/tasks
   /sys/fs/cgroup/memory/slurm/uid_1000/job_27/step_batch/tasks
   /sys/fs/cgroup/memory/slurm/uid_1000/job_27/tasks
   /sys/fs/cgroup/memory/slurm/uid_1000/tasks
   /sys/fs/cgroup/memory/slurm/tasks
   /sys/fs/cgroup/cpu,cpuacct/system/slurmd.service/tasks
   /sys/fs/cgroup/cpuset/slurm/uid_1000/job_27/step_batch/tasks
   /sys/fs/cgroup/cpuset/slurm/uid_1000/job_27/tasks
   /sys/fs/cgroup/cpuset/slurm/uid_1000/tasks
   /sys/fs/cgroup/cpuset/slurm/tasks
   /sys/fs/cgroup/systemd/system/slurmd.service/tasks

(my job 27 has one batch step with `sleep 600`)

   # cat /proc/`pgrep sleep`/cgroup
   9:perf_event:/
   8:blkio:/
   7:net_cls:/
   6:freezer:/slurm/uid_1000/job_27/step_batch
   5:devices:/
   4:memory:/slurm/uid_1000/job_27/step_batch
   3:cpuacct,cpu:/system/slurmd.service
   2:cpuset:/slurm/uid_1000/job_27/step_batch
   1:name=systemd:/system/slurmd.service

But when I cancel the job some garbage are let in the cgroup
filesystems:

  # find /sys/fs/cgroup/ -path '*slurm*tasks'
  /sys/fs/cgroup/freezer/slurm/uid_1000/job_27/tasks
  /sys/fs/cgroup/freezer/slurm/uid_1000/tasks
  /sys/fs/cgroup/freezer/slurm/tasks
  /sys/fs/cgroup/memory/slurm/tasks
  /sys/fs/cgroup/cpu,cpuacct/system/slurmd.service/tasks
  /sys/fs/cgroup/cpuset/slurm/uid_1000/job_27/step_batch/tasks
  /sys/fs/cgroup/cpuset/slurm/uid_1000/job_27/tasks
  /sys/fs/cgroup/cpuset/slurm/uid_1000/tasks
  /sys/fs/cgroup/cpuset/slurm/tasks
  /sys/fs/cgroup/systemd/system/slurmd.service/tasks

The systemd release agent has not been called by linux since the cgroups were 
not present in 'systemd' fs. Slurm release script was not called neither since 
it was not set as release_agent in controllers filesystems.

But strangely, the memory controller has been totally cleaned and the 
step_batch in the freezer controller has vanished. Actually, I figured out that 
there is some cleanup logic which explains that result in:

- _slurm_cgroup_destroy() called by fini() in
   src/plugins/proctrack/cgroup/proctrack_cgroup.c
- task_cgroup_memory_fini() in
   src/plugins/task/cgroup/task_cgroup_memory.c

But there is none in task_cgroup_cpuset_fini() in 
src/plugins/task/cgroup/task_cgroup_cpuset.c.

So finally, here come my questions:

- Is the cleanup logic in the plugins supposed to work and just an
   unfinished work for all controllers? And the release_agent script is
   simply a workaround?
- Is slurmd supposed to rely only on the release_agent for the cleanup?
   And therefore all the cleanup logic triggered by fini() in plugins is
   irrelevant?
- Or a mix of these that I just don't understand?

I would be glad to have your insightful lights on this matter :) I would also 
appreciate to get feedback from other people who have done other tests with 
slurm and systemd!

The funny thing about all of this is that it will become totally irrelevant 
with the upcoming releases of the linux kernel (3.16+) and the ongoing effort 
on the cgroup unified hierarchy[3][4]! So if modifications should be done in 
Slurm on cgroup management, it would be wise to take this into account.

[1] http://www.freedesktop.org/wiki/Software/systemd/
[2] https://www.kernel.org/doc/Documentation/cgroups/cgroups.txt
[3] http://lwn.net/Articles/601840/
[4]
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/cgroups/unified-hierarchy.txt

Thank you for having taken the time to read this!

Regards,

--
Rï¿½mi Palancher
<[email protected]>

[slurm-dev] RE: Feedback on integration tests systemd/slurm and questions

Reply via email to