[systemd-devel] [PATCHSET RE-RESEND] update unified hierarchy support
(sorry, of course forgot to attach the patches) (bounced for not being subscribed, resending...) Hello, Unified hierarchy is available on the 4.5 kernel but there have been several updates. 1. The __DEVEL__sane_behavior flag is gone. Unified hierarchy is now available as "cgroup2" filesystem type with its own super magic number. 2. "cgroup.populated" file is replaced with "populated" field of "cgroup.events" file. 3. A zombie task remains associated with the cgroup it was associated with at the time of death instead of being moved immediately to root. This means that pid to unit lookup may return a slice if the session or service unit the pid belonged to is already gone. Three patches are attached addressing each of the above. Thanks! -- tejun >From 278a39f0a8fa34cd899c6a08e76626c987a4713e Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Fri, 25 Mar 2016 11:38:50 -0400 Subject: [PATCH 1/3] core: update unified hierarchy support Unified hierarchy is official as of Linux v4.5 and now available through a new filesystem type, cgroup2, with its own super magic. Update mount logic accordingly. Signed-off-by: Tejun Heo --- src/basic/cgroup-util.c | 2 +- src/basic/missing.h | 4 src/core/mount-setup.c | 2 +- 3 files changed, 6 insertions(+), 2 deletions(-) diff --git a/src/basic/cgroup-util.c b/src/basic/cgroup-util.c index 56c1fca..5124b5b 100644 --- a/src/basic/cgroup-util.c +++ b/src/basic/cgroup-util.c @@ -2129,7 +2129,7 @@ int cg_unified(void) { if (statfs("/sys/fs/cgroup/", &fs) < 0) return -errno; -if (F_TYPE_EQUAL(fs.f_type, CGROUP_SUPER_MAGIC)) +if (F_TYPE_EQUAL(fs.f_type, CGROUP2_SUPER_MAGIC)) unified_cache = true; else if (F_TYPE_EQUAL(fs.f_type, TMPFS_MAGIC)) unified_cache = false; diff --git a/src/basic/missing.h b/src/basic/missing.h index 034e334..66cd592 100644 --- a/src/basic/missing.h +++ b/src/basic/missing.h @@ -437,6 +437,10 @@ struct btrfs_ioctl_quota_ctl_args { #define CGROUP_SUPER_MAGIC 0x27e0eb #endif +#ifndef CGROUP2_SUPER_MAGIC +#define CGROUP2_SUPER_MAGIC 0x63677270 +#endif + #ifndef TMPFS_MAGIC #define TMPFS_MAGIC 0x01021994 #endif diff --git a/src/core/mount-setup.c b/src/core/mount-setup.c index de1a361..32fe51c 100644 --- a/src/core/mount-setup.c +++ b/src/core/mount-setup.c @@ -94,7 +94,7 @@ static const MountPoint mount_table[] = { #endif { "tmpfs", "/run", "tmpfs", "mode=755",MS_NOSUID|MS_NODEV|MS_STRICTATIME, NULL, MNT_FATAL|MNT_IN_CONTAINER }, -{ "cgroup", "/sys/fs/cgroup","cgroup", "__DEVEL__sane_behavior", MS_NOSUID|MS_NOEXEC|MS_NODEV, +{ "cgroup", "/sys/fs/cgroup","cgroup2",NULL, MS_NOSUID|MS_NOEXEC|MS_NODEV, cg_is_unified_wanted, MNT_FATAL|MNT_IN_CONTAINER }, { "tmpfs", "/sys/fs/cgroup","tmpfs", "mode=755", MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_STRICTATIME, cg_is_legacy_wanted, MNT_FATAL|MNT_IN_CONTAINER }, -- 2.5.5 >From 0fed0c3cdebe72557db528572ed2c531e32e7d5a Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Fri, 25 Mar 2016 11:38:50 -0400 Subject: [PATCH 2/3] core: update populated event handling in unified hierarchy Earlier during the development of unified hierarchy, the populated event was reported through by the dedicated "cgroup.populated" file; however, the interface was updated so that it's reported through the "populated" field of "cgroup.events" file. Update populated event handling logic accordingly. Signed-off-by: Tejun Heo --- src/basic/cgroup-util.c| 45 - src/basic/cgroup-util.h| 2 ++ src/core/cgroup.c | 6 +++--- src/nspawn/nspawn-cgroup.c | 3 +-- 4 files changed, 42 insertions(+), 14 deletions(-) diff --git a/src/basic/cgroup-util.c b/src/basic/cgroup-util.c index 5124b5b..5043180 100644 --- a/src/basic/cgroup-util.c +++ b/src/basic/cgroup-util.c @@ -101,6 +101,39 @@ int cg_read_pid(FILE *f, pid_t *_pid) { return 1; } +int cg_read_event(const char *controller, const char *path, const char *event, + char **val) +{ +_cleanup_free_ char *events = NULL, *content = NULL; +char *p, *line; +int r; + +r = cg_get_path(controller, path, "cgroup.events", &events); +if (r < 0) +return r; + +r = read_full_file(events, &content, NULL); +if (r < 0) +return r; + +p = content; +while ((line = strsep(&p, "\n"))) { +
[systemd-devel] [PATCHSET RESEND] update unified hierarchy support
(bounced for not being subscribed, resending...) Hello, Unified hierarchy is available on the 4.5 kernel but there have been several updates. 1. The __DEVEL__sane_behavior flag is gone. Unified hierarchy is now available as "cgroup2" filesystem type with its own super magic number. 2. "cgroup.populated" file is replaced with "populated" field of "cgroup.events" file. 3. A zombie task remains associated with the cgroup it was associated with at the time of death instead of being moved immediately to root. This means that pid to unit lookup may return a slice if the session or service unit the pid belonged to is already gone. Three patches are attached addressing each of the above. Thanks! -- tejun ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] [PATCH 1/4] cgroups: support for MemoryAndSwapLimit= setting
(cc'ing Johannes and quoting the whole body for context) Hey, guys. On Thu, Oct 10, 2013 at 10:28:16AM -0400, Tejun Heo wrote: > Hello, > > On Thu, Oct 10, 2013 at 04:03:20PM +0200, Lennart Poettering wrote: > > For example MemorySoftLimit is something we supported previously, but > > which I recently removed because Tejun Heo (the kernel cgroup > > maintainer, added to CC) suggested that the attribute wouldn't continue > > to exist on the kernel side or at least not in this form. > > The problem with the current softlimit is that we currently aren't > sure what it means. Its semantics is defined only by its > implementation details with all its quirks and different parties > interpret and use it differently. memcg people are trying to clear > that up so I think it'd be worthwhile to wait to see what happens > there. > > > Tejun, Mika sent patches to wrap memory.memsw.limit_in_bytes, > > memory.kmem.limit_in_bytes, memory.soft_limit_in_bytes, > > memory.kmem.tcp.limit_in_bytes in high-level systemd attributes. Could > > you comment on the future of these attributes in the kernel? Should we > > expose them in systemd? > > > > At the systemd hack fest in New Orleans we already discussed > > memory.soft_limit_in_bytes and memory.memsw.limit_in_bytes and you > > suggested not to expose them. What about the other two? > > Except for soft_limit_in_bytes, at least the meanings of the knobs are > well-defined and stable, so I think it should be at least safe to > expose those. > > > (I have the suspicion though that if we want to expose something we > > probably want to expose a single knob that puts a limit on all kinds of > > memory, regardless of "RAM", "swap", "kernel" or "tcp"...) > > Yeah, the different knobs grew organically to cover more stuff which > wasn't covered before, so, yeah, when viewed together, they don't > really make a cohesive sense. Another problem is that, enabling kmem > knobs would involve noticeable amount of extra overhead. kmem also > has restrictions on when it can be enabled - it can't be enabled on a > populated cgroup. > > Maybe an approach which makes sense is where one sets the amount of > memory which can be used and toggle which types of memory should be > included in the accounting. Setting kmem limit equal to that of > limit_in_bytes makes limit_in_bytes applied to both kernel and user > memories. I'll ask memcg people and find out how viable such approach > is. I talked with Johannes about the knobs and think something like the following could be useful. * A swap knob, which, when set, configures memsw.limit_in_bytes to memory.limit_in_bytes + the set value. * A switch to enable kmem. When enabled, kmem.limit_in_bytes tracks memory.limit_in_bytes. ie. kmem is accounted and both kernel and user memory live under the same memory limit. * A kmem knob which can be optionally configured to a lower value than memory.limit_in_bytes. This is useful for overcommit scenarios as explained in Documentation/cgroups/memory.txt::2.7.3. * tcp knobs are currently completely separate from other memory limits. This should probably be included in memory.limit_in_bytes. I think it probably is a better idea to hold off on this one. * What softlimit means is still very unclear. We might end up with explicit guarantee knob and keep softlimit as it is, whatever it currently means. Caveats * This setup doesn't allow setting (memory + swap) limit without setting memory limit. * The overcommit scenario described in memory.txt::2.7.3 is somewhat bogus because not all userland memory is reclaimable and not all kernel memory is unreclaimable. Oh well... Thanks. -- tejun ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] [PATCH 1/4] cgroups: support for MemoryAndSwapLimit= setting
Hello, On Thu, Oct 10, 2013 at 04:03:20PM +0200, Lennart Poettering wrote: > For example MemorySoftLimit is something we supported previously, but > which I recently removed because Tejun Heo (the kernel cgroup > maintainer, added to CC) suggested that the attribute wouldn't continue > to exist on the kernel side or at least not in this form. The problem with the current softlimit is that we currently aren't sure what it means. Its semantics is defined only by its implementation details with all its quirks and different parties interpret and use it differently. memcg people are trying to clear that up so I think it'd be worthwhile to wait to see what happens there. > Tejun, Mika sent patches to wrap memory.memsw.limit_in_bytes, > memory.kmem.limit_in_bytes, memory.soft_limit_in_bytes, > memory.kmem.tcp.limit_in_bytes in high-level systemd attributes. Could > you comment on the future of these attributes in the kernel? Should we > expose them in systemd? > > At the systemd hack fest in New Orleans we already discussed > memory.soft_limit_in_bytes and memory.memsw.limit_in_bytes and you > suggested not to expose them. What about the other two? Except for soft_limit_in_bytes, at least the meanings of the knobs are well-defined and stable, so I think it should be at least safe to expose those. > (I have the suspicion though that if we want to expose something we > probably want to expose a single knob that puts a limit on all kinds of > memory, regardless of "RAM", "swap", "kernel" or "tcp"...) Yeah, the different knobs grew organically to cover more stuff which wasn't covered before, so, yeah, when viewed together, they don't really make a cohesive sense. Another problem is that, enabling kmem knobs would involve noticeable amount of extra overhead. kmem also has restrictions on when it can be enabled - it can't be enabled on a populated cgroup. Maybe an approach which makes sense is where one sets the amount of memory which can be used and toggle which types of memory should be included in the accounting. Setting kmem limit equal to that of limit_in_bytes makes limit_in_bytes applied to both kernel and user memories. I'll ask memcg people and find out how viable such approach is. Thanks! -- tejun ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] [HEADSUP] cgroup changes
Hello, On Mon, Jun 24, 2013 at 4:38 PM, Andy Lutomirski wrote: > Now I'm confused. I thought that support for multiple hierarchies was > going away. Is it here to stay after all? It is going to be deprecated but also stay around for quite a while. That said, I didn' t mean to use multiple hierarchies. I was saying that if you build a sub-hierarchy in the unified hierarchy, you're likely to get away with it in most cases. Thanks. -- tejun ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] [HEADSUP] cgroup changes
Hello, Andy. On Mon, Jun 24, 2013 at 04:27:17PM -0700, Andy Lutomirski wrote: > I guess what I'm trying to say here is that many systems will rather > fundamentally use systemd. Admins of those systems should still have > access to a reasonably large subset of cgroup functionality. If the > single-hierarchy model is going to prevent going around systemd and if > systemd isn't going to expose all of the useful cgroup functionality, > then perhaps there should be a way to separate systemd's hierarchy > from the cgroup hierarchy. I don't think systemd will prevent you from buildling your own hierarchy on the side. It sure won't be properly supported and things might break in corener cases / over time but if you're willing to take such risks anyway... In the long term tho, what should happen probably is examining use cases like yours and then incorporating sensible mechanisms to support that into the base system infrastructure. It might not be completely identical but I'm sure over time we'll be able to find what are the fundamental pieces and proper abstractions. Right now, we're exposing way too much without even clearly understanding what are being enabled. It is unsustainable. Thanks. -- tejun ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] [HEADSUP] cgroup changes
Hello, On Mon, Jun 24, 2013 at 04:01:07PM -0700, Andy Lutomirski wrote: > So what is cgroup for? That is, what's the goal for what the new API > should be able to do? It is a for controlling and distributing resources. That part doesn't change. It's just not built to be used directly by individual applications. It's an admin tool just like sysctl - be that admin be a human or userland base system. There's a huge chasm between something which can be generally used by normal applications and something which is restricted to admins and base systems in terms of interface generality and stability, security, how the abstractions fit together with the existing APIs and so on. cgroup firmly belongs to the former. It still serves the same purpose but isn't, in a way, developed enough to be used directly by individual applications and I'm not even sure we want or need to develop it to such a level. Thanks. -- tejun ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] [HEADSUP] cgroup changes
Hello, On Mon, Jun 24, 2013 at 12:24:38PM -0700, Andy Lutomirski wrote: > Because more things are becoming per cpu without the option of moving > of per-cpu things on behalf of one cpu to another cpu. RCU is a nice > exception. Hmm... but in most cases it's per-cpu on the same cpu that initiated the task. If a given CPU is just crunching numbers and IRQ affinity is properly configured, the CPU shouldn't be bothered too much by per-cpu work items. If there are, please let us know. We can hunt them down. > The functionality I care about is that a program can reliably and > hierarchically subdivide system resources -- think rlimits but > actually useful. I, and probably many other things, want this > functionality. Yes, the current cgroup interface is awful, but it > gets one thing right: it's a hierarchy. And the hierarchy support was completely broken for many resource controllers up until only several releases ago. > I would argue that designing a kernel interface that requires exactly > one userspace component to manage it and ties that one userspace > component to something that can't easily be deployed everywhere (the > init system) is as big a cheat as the old approach of sneaking bad > APIs in through a filesystem was. In terms of API, it is firmly at the level of sysctl. That's it. While I agree that having a proper kernel API for hierarchical resource management could be nice. That currently is out of scope. We're already knee-deep in shit with the limited capabilities we're trying to implement. Also, I really don't think cgroup is the right interface for such thing even if we get to that. It should be part of the usual process/thread model, not this completely separate thing on the side. > IOW, please, when designing this, please specify an API that programs > are permitted to use, and let that API be reviewed. cgroup is not that API and it's never gonna be in all likelihood. As for systemd vs. non-systemd compatibility, I'm afraid I don't have a good answer. This is still all in a pretty earlly phase and the proper abstractions and APIs are being figured out. Hopefully, we'll converge on a mostly compatible high-level abstraction which can be presented regardless of the actual base system implementation. Thanks. -- tejun ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] [HEADSUP] cgroup changes
Hello, Andy. On Mon, Jun 24, 2013 at 11:49:05AM -0700, Andy Lutomirski wrote: > > I have an idea where it should be headed in the long term but am not > > sure about short-term solution. Given that the only sort wide-spread > > use case is virt kthreads, maybe it just needs to be special cased for > > now. Not sure. > > I'll be okay (I think) if I can reliably set affinities of these > threads. I'm currently doing it with cgroups. > > That being said, I don't like the direction that kernel thread magic > affinity is going. It may be great for cache performance and reducing > random bounding, but I have a scheduling-jitter-sensitive workload and > I don't care about overall system throughput. I need the kernel to > stay the f!&k off my important cpus, and arranging for this to happen > is becoming increasingly complicated. Why is it becoming increasingly complicated? The biggest change probably was the shared workqueue pool implementation but that was years ago and workqueue has grown pool attributes recently adding more properly designed flexibility and, for example, adding default affinity for !per-cpu workqueues should be pretty easy now. But anyways, if it's an issue, it should be examined and properly solved rather than hacking up hacky solution with cgroup. > cgroups are most certainly something that a binary can be aware of. > It's not like a sysctl knob at all -- it's per process. I have lots No, it definitely is not. Sure it is more granular than sysctl but that's it. It exposes control knobs which are directly tied into kernel implementation details. It is not a properly designed programming API by any stretch of imagination. It is an extreme failure on the kernel side that that part hasn't been made crystal clear from the beginning. I don't know how intentional it was but the whole thing is completely botched. cgroup *never* was held to the standard necessary for any widely available API and many of the controls it exposes are exactly at the level of sysctls. As the interface was filesystem, it could evade scrutiny and with the hierarchical organization also gave the impression that it's something which can be used directly by individual applications. It found a loophole in the way we implement and police kernel APIs and then exploited it like there's no tomorrow. We are firmly bound to maintain what already has been exposed from the kernel side and I'm not gonna break any of them but the free-for-all cgroup is broken and deprecated. It's gonna wither and fade away and any attempt to reverse that will be met with extreme prejudice. > of binaries that have worked quite well for a couple years that move > themselves into different cgroups. I have no problem with a unified > hierarchy, but I need control of my little piece of the hierarchy. > > I don't care if the interface to do so changes, but the basic > functionality is important. Whether you care or not is completely irrelevant. Individual binaries widely incorporating cgroup details automatically binds the kernel. It becomes excruciatingly painful to back out after certain point. I don't think we're there yet given the overall immaturity and brokeness of cgroups and it's imperative that we back the hell out as fast as possible before this insanity spreads any wider. Thanks. -- tejun ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] [HEADSUP] cgroup changes
Hello, On Mon, Jun 24, 2013 at 03:27:15PM +0200, Lennart Poettering wrote: > On Sat, 22.06.13 15:19, Andy Lutomirski (l...@amacapital.net) wrote: > > > 1. I put all the entire world into a separate, highly constrained > > cgroup. My real-time code runs outside that cgroup. This seems to > > exactly what slices are for, but I need kernel threads to go in to > > the constrained cgroup. Will systemd support this? > > I am not sure whether the ability to move kernel threads into cgroups > will stay around at all, from the kernel side. Tejun, can you comment on this? Any kernel threads with PF_NO_SETAFFINITY set already can't be removed from the root cgroup. In general, I don't think moving kernel threads into !root cgroups is a good idea. They're in most cases shared resources and userland doesn't really have much idea what they're actually doing, which is the fundmental issue. Which kthreads are running on the kernel side and what they're doing is strict implementation detail from the kernel side. There's no effort from kernel side in keeping them stable and userland is likely to get things completely wrong - e.g. many kernel threads named after workqueues in any recent kernels don't actually do anything until the system is under heavy memory pressure. Userland can't tell and has no control over what's being executed where at all and that's the way it should be. That said, there are cases where certain async executions are concretely bound to userland processes - say, (planned) aio updates, virt drivers and so on. Right now, virt implements something pretty hacky but I think they'll have to be tied closer to the usual process mechanism - ie. they should be saying that these kthreads are serving this process and should be treated as such in terms of resource control rather than the current "move this kthread to this set of cgroups, don't ask why" thing. Another not-well-thought-out aspect of the current cgroup. :( I have an idea where it should be headed in the long term but am not sure about short-term solution. Given that the only sort wide-spread use case is virt kthreads, maybe it just needs to be special cased for now. Not sure. > > 2. I manage services and tasks outside systemd (for one thing, I > > currently use Ubuntu, but even if I were on Fedora, I have a bunch > > of fine-grained things that figure out how they're supposed to > > allocate resources, and porting them to systemd just to keep working > > in the new world order would be a PITA [1]). > > > > (cgroups have the odd feature that they are per-task, not per thread > > group, and the systemd proposal seems likely to break anything that > > actually wants task granularity. I may actually want to use this, > > even though it's a bit evil -- my real-time thread groups have > > non-real-time threads.) > > Here too, Tejun is pretty keen on removing the ability of splitting up > threads into cgroups from the kernel, and will only allow this > per-process. Tejun, please comment! Yes, again, the biggest issue is how much of low-level cgroup details become known to individual programs. Splitting threads into different cgroup would in most cases mean that the binary itself would become aware of cgroup and it's akin to burying sysctl knob tunings into individual binaries. cgroup is not an interface for each individual program to fiddle with. If certain thread-granular control is absolutely necessary and justifiable, it's something to be added to the existing thread API, not something to be bolted on using cgroups. So, I'm quite strongly against allowing allowing splitting threads of the same process into different cgroups. Thanks. -- tejun ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] [HEADSUP] cgroup changes
Hello, On Mon, Jun 24, 2013 at 02:39:53PM +0100, Daniel P. Berrange wrote: > On Mon, Jun 24, 2013 at 03:27:15PM +0200, Lennart Poettering wrote: > > On Sat, 22.06.13 15:19, Andy Lutomirski (l...@amacapital.net) wrote: > > > > > 1. I put all the entire world into a separate, highly constrained > > > cgroup. My real-time code runs outside that cgroup. This seems to > > > exactly what slices are for, but I need kernel threads to go in to > > > the constrained cgroup. Will systemd support this? > > > > I am not sure whether the ability to move kernel threads into cgroups > > will stay around at all, from the kernel side. Tejun, can you comment > > on this? > > KVM uses the vhost_net device for accelerating guest network I/O > paths. This device creates a new kernel thread on each open(), > and that kernel thread is attached to the cgroup associated > with the process that open()d the device. > > If systemd allows for a process to be moved between cgroups, then > it must also be capable of moving any associated kernel threads to > the new cgroup at the same time. This co-placement of vhost-net > threads with the KVM process, is very critical for I/O performance > of KVM networking. Yeah, the way virt drivers use cgroups right now is pretty hacky. I was thinking about adding per-process workqueue which follows the cgroup association of the process after the unified hierarchy and then convert virt to use that. At any rate, those kthreads can be moved via cgroup.procs, so unified hierarchy wouldn't break it from kernel side. Not sure how the interface would look from systemd side tho. Thanks. -- tejun ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel