Rémi, Thanks for investigating this. It looks like some work will be required to fully integrate Slurm cgroups with systemd. The different mount point shouldn't be a problem since that is configurable with CgroupMountPoint. The main issue seems to be with cleanup. The release_agent cleanup mechanism has always been a bit clunky. I think the long-term plan has been to do the cleanup inside Slurm instead. As you note, this has already been partially implemented for some of the subsystems. If the cgroup unified hierarchy will provide a better way of doing cleanup then maybe we should use that. Not sure when any of the previous Slurm cgroup developers will have time to work on this though...
Martin Perry Bull -----Original Message----- From: Rémi Palancher [mailto:[email protected]] Sent: Thursday, August 28, 2014 9:18 AM To: slurm-dev Subject: [slurm-dev] Feedback on integration tests systemd/slurm and questions Hi developers, You should already know that systemd[1] is the fast growing init alternative that will be the new default on all major GNU/linux distributions including RHEL7, Centos, Fedora, Debian, Ubuntu and so on. Among other things, systemd has notably the particularity to put all processes into cgroups. This also includes all system services and therefore slurm daemons. Since slurmd is also able to manage cgroups, we (with workmates at EDF) were curious to know how systemd and slurm could work together. My testing environment is: - Debian Wheezy 7.6 - Linux kernel 3.2.60 - systemd 204 - slurm 14.11.0-0pre3 systemd ======= Here are short explanations of how systemd works (at least AFAIU!). At boot time, systemd mounts the following cgroups FS: tmpfs on /sys/fs/cgroup type tmpfs (rw,nosuid,nodev,noexec,mode=755) cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd) cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset) cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpuacct,cpu) cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory) cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices) cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer) cgroup on /sys/fs/cgroup/net_cls type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls) cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio) cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event) Inside a tmpfs, it mounts a first cgroup filesystem named 'systemd': - without any resource controller associated to it - notify_on_release is set to 1 - the release agent is /lib/systemd/systemd-cgroups-agent Then, systemd looks for all available resource controller on the running kernel and mount one filesystem for each of these (except with cpu and cpuacct which are mounted together). In all these controller cgroup fs, there is none release_agent. Systemd actually manages all processes running on the system (user sessions, kernel threads, services and forks) in dedicated cgroups inside the 'systemd' hierarchy. Then eventually, if limits are configured in so-called unit files, it also creates cgroups in the appropriate controller filesystems. For example, if you set a memory usage limit to slurmd service, additionally to the /system/slurmd.service cgroup in systemd fs, it will also create a /system/slurmd.service cgroup in memory fs with appropriate memory limits. By default for services, it simply creates a cgroup in cpu,cpuacct controllers. For example with slurmd: # cat /proc/`pgrep slurmd`/cgroup 9:perf_event:/ 8:blkio:/ 7:net_cls:/ 6:freezer:/ 5:devices:/ 4:memory:/ 3:cpuacct,cpu:/system/slurmd.service 2:cpuset:/ 1:name=systemd:/system/slurmd.service When all processes of a cgroup end, systemd is notified with the execution of the release agent /lib/systemd/systemd-cgroups-agent in the 'systemd' fs. This program basically sends a DBUS notification the systemd core daemon with the path of the empty cgroup in parameter. When the core daemon receives this DBUS notification, it looks over its internal data structures for all associated cgroups in all controller fs and delete all of these. This is how all cgroup controllers fs are kept clean when they become empty. slurm ===== Well, then the question is: How slurmd and its cgroup plugins could work on top of that? First, here is an excerpt of cgroups.txt in Linux doc: "If an active hierarchy with exactly the same set of subsystems already exists, it will be reused for the new mount. If no existing hierarchy matches, and any of the requested subsystems are in use in an existing hierarchy, the mount will fail with -EBUSY. Otherwise, a new hierarchy is activated, associated with the requested subsystems." Therefore, if we configure slurmd to mount by itself (on other mountpoints) already existing cpuset, memory and freezer controller fs and set its own release_agent for emptiness notification, it works. The cleanup at the end of jobs is correctly done by Slurm release agent and systemd does not complain. Here is the corresponding cgroup.conf: CgroupMountpoint=/cgroup CgroupAutomount=yes CgroupReleaseAgentDir="/etc/slurm-llnl/cgroup" (in slurm.conf, I enable plugins proctrack/cgroup and task/cgroup but I avoided jobacct_gather/none since it's still flagged as "experimental" in the doc.) Then: # mkdir /cgroup # mount -t tmpfs tmpfs /cgroup # slurmd My only source of sadness with this solution is the number of mounts: tmpfs on /sys/fs/cgroup type tmpfs (rw,nosuid,nodev,noexec,mode=755) cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd) cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset,release_agent=/etc/slurm-llnl/cgroup/release_cpuset) cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpuacct,cpu) cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory,release_agent=/etc/slurm-llnl/cgroup/release_memory) cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices) cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer,release_agent=/etc/slurm-llnl/cgroup/release_freezer) cgroup on /sys/fs/cgroup/net_cls type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls) cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio) cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event) tmpfs on /cgroup type tmpfs (rw,relatime) cgroup on /cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer,release_agent=/etc/slurm-llnl/cgroup/release_freezer) cgroup on /cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset,release_agent=/etc/slurm-llnl/cgroup/release_cpuset) cgroup on /cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory,release_agent=/etc/slurm-llnl/cgroup/release_memory) Therefore, I tried to make slurm use the controller filesystems already mounted by systemd. In cgroup.conf, it looks like this: CgroupMountpoint=/sys/fs/cgroup CgroupAutomount=no #CgroupReleaseAgentDir="/etc/slurm-llnl/cgroup" With this configuration, slurmd does not set the release_agent in the root directory of the controller filesystem (and commenting out the CgroupReleaseAgentDir parameter does not change anything): # for controller in perf_event net_cls freezer devices memory cpu,cpuacct cpuset systemd; do echo "${controller}: $(cat /sys/fs/cgroup/${controller}/release_agent)"; done perf_event: net_cls: freezer: devices: memory: cpu,cpuacct: cpuset: systemd: /lib/systemd/systemd-cgroups-agent When a job is launched, the cgroups are properly created by slurmd: # find /sys/fs/cgroup/ -path '*slurm*tasks' /sys/fs/cgroup/freezer/slurm/uid_1000/job_27/step_batch/tasks /sys/fs/cgroup/freezer/slurm/uid_1000/job_27/tasks /sys/fs/cgroup/freezer/slurm/uid_1000/tasks /sys/fs/cgroup/freezer/slurm/tasks /sys/fs/cgroup/memory/slurm/uid_1000/job_27/step_batch/tasks /sys/fs/cgroup/memory/slurm/uid_1000/job_27/tasks /sys/fs/cgroup/memory/slurm/uid_1000/tasks /sys/fs/cgroup/memory/slurm/tasks /sys/fs/cgroup/cpu,cpuacct/system/slurmd.service/tasks /sys/fs/cgroup/cpuset/slurm/uid_1000/job_27/step_batch/tasks /sys/fs/cgroup/cpuset/slurm/uid_1000/job_27/tasks /sys/fs/cgroup/cpuset/slurm/uid_1000/tasks /sys/fs/cgroup/cpuset/slurm/tasks /sys/fs/cgroup/systemd/system/slurmd.service/tasks (my job 27 has one batch step with `sleep 600`) # cat /proc/`pgrep sleep`/cgroup 9:perf_event:/ 8:blkio:/ 7:net_cls:/ 6:freezer:/slurm/uid_1000/job_27/step_batch 5:devices:/ 4:memory:/slurm/uid_1000/job_27/step_batch 3:cpuacct,cpu:/system/slurmd.service 2:cpuset:/slurm/uid_1000/job_27/step_batch 1:name=systemd:/system/slurmd.service But when I cancel the job some garbage are let in the cgroup filesystems: # find /sys/fs/cgroup/ -path '*slurm*tasks' /sys/fs/cgroup/freezer/slurm/uid_1000/job_27/tasks /sys/fs/cgroup/freezer/slurm/uid_1000/tasks /sys/fs/cgroup/freezer/slurm/tasks /sys/fs/cgroup/memory/slurm/tasks /sys/fs/cgroup/cpu,cpuacct/system/slurmd.service/tasks /sys/fs/cgroup/cpuset/slurm/uid_1000/job_27/step_batch/tasks /sys/fs/cgroup/cpuset/slurm/uid_1000/job_27/tasks /sys/fs/cgroup/cpuset/slurm/uid_1000/tasks /sys/fs/cgroup/cpuset/slurm/tasks /sys/fs/cgroup/systemd/system/slurmd.service/tasks The systemd release agent has not been called by linux since the cgroups were not present in 'systemd' fs. Slurm release script was not called neither since it was not set as release_agent in controllers filesystems. But strangely, the memory controller has been totally cleaned and the step_batch in the freezer controller has vanished. Actually, I figured out that there is some cleanup logic which explains that result in: - _slurm_cgroup_destroy() called by fini() in src/plugins/proctrack/cgroup/proctrack_cgroup.c - task_cgroup_memory_fini() in src/plugins/task/cgroup/task_cgroup_memory.c But there is none in task_cgroup_cpuset_fini() in src/plugins/task/cgroup/task_cgroup_cpuset.c. So finally, here come my questions: - Is the cleanup logic in the plugins supposed to work and just an unfinished work for all controllers? And the release_agent script is simply a workaround? - Is slurmd supposed to rely only on the release_agent for the cleanup? And therefore all the cleanup logic triggered by fini() in plugins is irrelevant? - Or a mix of these that I just don't understand? I would be glad to have your insightful lights on this matter :) I would also appreciate to get feedback from other people who have done other tests with slurm and systemd! The funny thing about all of this is that it will become totally irrelevant with the upcoming releases of the linux kernel (3.16+) and the ongoing effort on the cgroup unified hierarchy[3][4]! So if modifications should be done in Slurm on cgroup management, it would be wise to take this into account. [1] http://www.freedesktop.org/wiki/Software/systemd/ [2] https://www.kernel.org/doc/Documentation/cgroups/cgroups.txt [3] http://lwn.net/Articles/601840/ [4] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/cgroups/unified-hierarchy.txt Thank you for having taken the time to read this! Regards, -- R�mi Palancher <[email protected]>
