Re: cgroup: status-quo and userland efforts
On Wed, 4 Mar 2015, Luke Kenneth Casson Leighton wrote: and why he concludes that having a single hierarchy for all resource types. correcting to add "is not always a good idea" i think having a single hierarchy is fine *if* and only if it is possible to overlay something similar to SE/Linux policy files - enforced by the kernel *not* by userspace (sorry serge!) - such that through those policy files any type of hierarchy be it single or multi layer, recursive or in fact absolutely anything, may be emulated and properly enforced. The fundamental problem is that sometimes you have types of controls that are orthoginal to each other, and you either manage the two types of things in separate hierarchies, or you end up with one hierarchy that is a permutation of all the combinations of what would have been separate hierarchies. David Lang -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cgroup: status-quo and userland efforts
On Wed, Mar 4, 2015 at 5:08 AM, David Lang wrote: > On Tue, 3 Mar 2015, Luke Leighton wrote: >> whilst the majority of people view management to be "hierarchical" >> (so there is a top dog or God process and everything trickles down >> from that), this is viewed as such an anathema in the security >> industry that someone came up with a formal specification for the >> real-world way in which permissions are managed, sorry i should have said "managed in the security esp. defense industry" >> and it's called the FLASK model. > > > On this topic it's also worth reading Neil Brown's series of articles on > this over at http://lwn.net/Articles/604609/ oo good background, thank you david. happily reading now :) > and why he concludes that having a single hierarchy for all resource types. i think having a single hierarchy is fine *if* and only if it is possible to overlay something similar to SE/Linux policy files - enforced by the kernel *not* by userspace (sorry serge!) - such that through those policy files any type of hierarchy be it single or multi layer, recursive or in fact absolutely anything, may be emulated and properly enforced. l. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cgroup: status-quo and userland efforts
On Tue, 3 Mar 2015, Luke Leighton wrote: I wrote about that many times, but here are two of the problems. * There's no way to designate a cgroup to a resource, because cgroup is only defined by the combination of who's looking at it for which controller. That's how you end up with tagging the same resource multiple times for different controllers and even then it's broken as when you move resources from one cgroup to another, you can't tell what to do with other tags. While allowing obscene level of flexibility, multiple hierarchies destroy a very fundamental concept that it *should* provide - that of a resource container. It can't because a "cgroup" is undefined under multiple hierarchies. ok, there is an alternative to hierarchies, which has precedent (and, importantly, a set of userspace management tools as well as existing code in the linux kernel), and it's the FLASK model which you know as SE/Linux. whilst the majority of people view management to be "hierarchical" (so there is a top dog or God process and everything trickles down from that), this is viewed as such an anathema in the security industry that someone came up with a formal specification for the real-world way in which permissions are managed, and it's called the FLASK model. On this topic it's also worth reading Neil Brown's series of articles on this over at http://lwn.net/Articles/604609/ and why he concludes that having a single hierarchy for all resource types. David Lang -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Workman-devel] cgroup: status-quo and userland efforts
Serge Hallyn writes: > > Quoting Daniel P. Berrange (berrange@...): > > Are you also planning to actually write a new cgroup parent manager > > daemon too ? Currently my plan for libvirt is to just talk directly > > I'm toying with the idea, yes. (Right now my toy runs in either native > mode, using cgroupfs, or child mode, talking to a parent manager) I'd > love if someone else does it, but it needs to be done. > > As I've said elsewhere in the thread, I see 2 problems to be addressed: > > 1. The ability to nest the cgroup manager daemons, so that a daemon > running in a container can talk to a daemon running on the host. This > is the problem my current toy is aiming to address. But the API it > exports is just a thin layer over cgroupfs. cool! that's funny, that sounds exactly like what i asked if you could provide, and it turns out that you already did :) so, in theoorryy. you could have this: * run the service on top of /dev/cgroups, republishing [a subset?] as /run/cgroups and some other parts as /run/cgroups2 * have PID1, instead of going directly to /dev/cgroups, to go to /run/cgroups *instead*. * have lxc, instead of going directly to /dev/cgroups, to go to /run/cgroups2 *instead*. the problem: as lennart mentions, PID1s such as systemd may be expecting to manage the setup of cgroups - entirely - for security or other initialisation reasons - *before* even the service that you've created, serge, is allowed to run. and *that's* why i suggested the idea of following what SE/Linux has done, which is to have policy files that compile down to a set of permissions that the (various) managers can and cannot do. bits of cgroup that they are and are not permitted to manage. flat at the kernel implementation level; hierarchical (or other) at the "compile-the-policy-file" level. l. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cgroup: status-quo and userland efforts
Tejun Heo writes: > > Hello, Serge. > > On Thu, Jun 27, 2013 at 08:22:06AM -0500, Serge Hallyn wrote: > > At some point (probably soon) we might want to talk about a standard API > > for these things. However I think it will have to come in the form of > > a standard library, which knows to either send requests over dbus to > > systemd, or over /dev/cgroup sock to the manager. > > Yeah, eventually, I think we'll have a standardized way to configure > resource distribution in the system. Maybe we'll agree on a > standardized dbus protocol or there will be library, I don't know; > however, whatever form it may be in, it abstraction level should be > way higher than that of direct cgroupfs access. It's way too low > level and very easy to end up in a complete nonsense configuration. just because it sounds easy to end up in a complete nonsense configuration does not mean that the entire API should be abandoned. instead, it sounds to me like there should be explicit policies (taking a leaf out of SE/Linux's book) on what is and is not permitted. i think you'll find that that is much more acceptable [to have explicit policy files which define what can and can't be done]. it then becomes possible to define "sensible and sane" default policies for the average situation, whilst also allowing for more complex cases to be created by those people who really really know what they're doing. the "ridiculous counterexample" to what you are suggesting is that just because "rm -fr /*" does such a lot of damage, rm should have its "-r" option removed. perhaps a better example would involve rsync, which even as far back as 1999 had already run out of lowercase _and_ uppercase letters to use as options... but i can't think of one because rsync is awesome :) l. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cgroup: status-quo and userland efforts
Serge Hallyn writes: > > Quoting Tim Hockin (thockin@...): > > > FWIW, the code is too embarassing yet to see daylight, but I'm playing > > > with a very lowlevel cgroup manager which supports nesting itself. > > > Access in this POC is low-level ("set freezer.state to THAWED for cgroup > > > /c1/c2", "Create /c3"), but the key feature is that it can run in two > > > modes - native mode in which it uses cgroupfs, and child mode where it > > > talks to a parent manager to make the changes. > > > > In this world, are users able to read cgroup files, or do they have to > > go through a central agent, too? > > The agent won't itself do anything to stop access through cgroupfs, but > the idea would be that cgroupfs would only be mounted in the agent's > mntns. My hope would be that the libcgroup commands (like cgexec, > cgcreate, etc) would know to talk to the agent when possible, and users > would use those. serge, i realise this is a year on, so you probably have something at least working by now... but i have a possibly crazy idea.. would it be possible or convenient for the agent that you are writing to emulate - in userspace - the *exact* same interface as /dev/cgroups, providing a controlled hierarchy yet presenting itself to other processes in such a way that its hierarchical management would be completely transparent to anything that used it? including of course a new instance of the agent itself, in a recursive fashion :) the important question on top of this would be: is there anything that needs to be atomic which emulation of the /dev/cgroups kernel API in userspace could not handle? l. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cgroup: status-quo and userland efforts
Tejun Heo writes: > > Hello, Tim. > > On Fri, Jun 28, 2013 at 11:44:23AM -0700, Tim Hockin wrote: > The goal is to reach sane and widely useable / useful state with > minimum amount of complexity. Maintaining backward compatibility for > some period - likely quite a few years - while still allowing future > development is a pretty important consideration. Another factor is > that the general situation has been more or less atrocious and cgroup > as a whole has been failing in the very basic places, which also > reinforces the drive for simplicity. was it einstein who said that something should be made as simple as it needs to be... but no simpler? > That said, I stil don't know very well the scope and severity of the > problems you guys might face from the loss of multiple orthogonal > hierarchies. i think he made it very clear that it would be utterly catastrophic, with the cost being millions of dollars or more. the thing is, if you compare a "normal" company or individual user(s) needs, the numbers of such users may be large but each one has only one or a few machines. but in this case, it's just "one person" (tim) saying "i represent hundreds of thousands of machines, here, being adversely affected by these discussions". so he feels that you *should* be lending far more weight to what he's saying *but*... see below... > So, can you please explain the issues that you've experienced and are > foreseeing in detail with their contexts? ie. if you have certain > requirement, please give at least brief explanation on where such > requirement is coming from and how important the requirement is. well... that's the problem, tejun: he's not permitted to. he's under NDA. thus the "weighting" gets multiplied by... a number significantly less than 1e-5... oops :) l. l. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cgroup: status-quo and userland efforts
Tejun Heo writes: > > Hello, Tim. > > On Wed, Jun 26, 2013 at 08:42:21PM -0700, Tim Hockin wrote: > > OK, then what I don't know is what is the new interface? A new cgroupfs? > > It's gonna be a new mount option for cgroupfs. > > > DTF and CPU and cpuset all have "default" groups for some tasks (and > > not others) in our world today. DTF actually has default, prio, and > > "normal". I was simplifying before. I really wish it were as simple > > as you think it is. But if it were, do you think I'd still be > > arguing? > > How am I supposed to know when you don't communicate it but just wave > your hands saying it's all very complicated? i'd say that tejun's got you there, tim. how is anyone supposed to understand or help you to support what your team is doing if the entire work - no matter how good it is - is kept secret and proprietary? we *know* that secret and proprietary is risky, so why is the company that you work for indulging itself in such dangerous practices, especially when there appears to be so much at risk here if the only mindshare for the work you're doing exists solely and exclusively in some "secret lair"? my suggestion to you would be to urgently, *urgently* get the *entire* set of tools and documentation surrounding what is clearly mission critical infrastructure released *immediately* as a software libre project. and the second suggestion would - if they are amenable - to hire tejun and any of his associates - to come over for as long as possible and necessary to review what you've been doing, on site, giving them carte blanche (or even a remit) to update and refine the online documentation. without that happening - without there being publicly-available documentation - i really don't see how you can be expected to ask tejun to understand the complexity of what the team needs, when the majority of what you want - and need! - to say you *can't*... because you're under some bloody stupid NDA! that's... insane! you *need mindshare*: that means releasing the tools and documentation as a software libre project so that, if nothing else, there's other people whom the company you work for can poach when they get proficient at working with it :) l. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cgroup: status-quo and userland efforts
Tejun Heo writes: > I don't really understand your example anyway because you can classify > by DTF / non-DTF first and then just propagate cpuset settings along. > You won't lose anything that way, right? without spoiling the fun by reading ahead, based on the extreme complexity of what tim's team have spent probably man-decades possibly even getting on for a man-century getting right, i'm guessing two things: (a) that he will have said "we lose everything we worked to achieve over the past few years" and (b) "what we have now, whilst extremely complex, works really really well: why would we even remotely contemplate changing / losing it / replacing it with something that, from our deep level of expertise which we seem unable to get across to you quite how complex it is, we *know* will simply not possibly be adequate". tim: the only thing i can suggest here which may help is that you discuss seriously amongst the team as to whether to fork the functionality present in the linux kernel re hierarchical cgroups, and to maintain it indefinitely. > I wrote about that many times, but here are two of the problems. > > * There's no way to designate a cgroup to a resource, because cgroup > is only defined by the combination of who's looking at it for which > controller. That's how you end up with tagging the same resource > multiple times for different controllers and even then it's broken > as when you move resources from one cgroup to another, you can't > tell what to do with other tags. > > While allowing obscene level of flexibility, multiple hierarchies > destroy a very fundamental concept that it *should* provide - that > of a resource container. It can't because a "cgroup" is undefined > under multiple hierarchies. ok, there is an alternative to hierarchies, which has precedent (and, importantly, a set of userspace management tools as well as existing code in the linux kernel), and it's the FLASK model which you know as SE/Linux. whilst the majority of people view management to be "hierarchical" (so there is a top dog or God process and everything trickles down from that), this is viewed as such an anathema in the security industry that someone came up with a formal specification for the real-world way in which permissions are managed, and it's called the FLASK model. basically you have a security policy which may, in its extreme limits, either contain absolutely all and any permissions (in the case of SE/Linux that's quite literally every single system call), or it may contain absolutely none. *but* - and this is the key bit: when a process exec's a new one, there is *no correlation* between the amount of permissions that the new child process has and its parent. in other words, the security policy *may* say that a parent may exec a process which has *more* permissions (or even an entirely different set) than the parent. in other words there *is* no hierarchy. it's all "flat", with inter-relationships. now, the way in which the security policy is expressed is in an m4 macro language that may contain wildcards and includes and macros and functions and so on, meaning that its expression can be kept really quite simple if properly managed (and the SE/Linux team do an extraordinarily good job of doing exactly that). basically the reason why i mention this, tejun, is because it has distinct advantages. intuitively i am guessing that the reason why you are freaking out about hierarchies is because it is effectively potentially infinite depth. the reason why i mention SE/Linux is because it is effectively completely flat, and the responsibility for creating hierarchies (or not) is down to the userspace tools that compile the m4 macros into the binary files that the kernel reads and acts upon. so i think you'll find that if you investigate this approach and copy it, you should be able to keep the inherent simplicity of a "unified" underlying approach, but not have tim's team freaking out because they would be able to create policy files based on a hierarchical arrangement. it would also mean that policies could be written that ensure lxc doesn't need to get rewritten; PID1 could be allocated specific permissions that it can manage, and so on. does that make any sense? l. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Workman-devel] cgroup: status-quo and userland efforts
On Mon 15-07-13 14:49:40, Vivek Goyal wrote: > On Sun, Jun 30, 2013 at 08:38:38PM +0200, Michal Hocko wrote: > > On Fri 28-06-13 14:01:55, Vivek Goyal wrote: > > > On Fri, Jun 28, 2013 at 05:05:13PM +0200, Michal Hocko wrote: > > [...] > > > > OK, so libcgroup's rules daemon will still work and place my tasks in > > > > appropriate cgroups? > > > > > > Do you use that daemon in practice? > > > > I am not but my users do. And that is why I care. > > Michael, > > would you have more details of how those users are exactly using > rules engine daemon. The most common usage is uid and exec names. > To me rulesengined processed 3 kinds of rules. > > - uid based > - gid based > - exec file path based > > uid/gid based rule exection can be taken care by pam_cgroup module too. > So I think one should not need cgrulesengined for that. I am not familiar with pam_cgroup much but it is a part of libcgroup package, right? > I am curious what kind of exec rules are useful. Any placement of > services one can do using systemd. So only executables we are left > to manage are which are not services. Yes, those are usually backup processes which should not disrupt the regular server workload. uid ones are used to keep a leash on local users of the machine but i do not have many details as I usually do not have access to those machines. All I see are complains when something explodes ;) -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Workman-devel] cgroup: status-quo and userland efforts
On Sun, Jun 30, 2013 at 08:38:38PM +0200, Michal Hocko wrote: > On Fri 28-06-13 14:01:55, Vivek Goyal wrote: > > On Fri, Jun 28, 2013 at 05:05:13PM +0200, Michal Hocko wrote: > [...] > > > OK, so libcgroup's rules daemon will still work and place my tasks in > > > appropriate cgroups? > > > > Do you use that daemon in practice? > > I am not but my users do. And that is why I care. Michael, would you have more details of how those users are exactly using rules engine daemon. To me rulesengined processed 3 kinds of rules. - uid based - gid based - exec file path based uid/gid based rule exection can be taken care by pam_cgroup module too. So I think one should not need cgrulesengined for that. I am curious what kind of exec rules are useful. Any placement of services one can do using systemd. So only executables we are left to manage are which are not services. In practice is it very useful for an admin to say if "firefox" is launched by a user then it should run in xyz cgroup. And if user cares about firefox running in a sub cgroup, then it can always use cgexec to do that. Thanks Vivek -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cgroup: status-quo and userland efforts
On Wed, 3 Jul 2013, Kay Sievers wrote: > >> > But that's not my point. It seems pretty easy to make this cgroup > >> > management (in "native mode") a library that can have either a thin > >> > veneer of a main() function, while also being usable by systemd. The > >> > point is to solve all of the problems ONCE. I'm trying to make the > >> > case that systemd itself should be focusing on features and policies > >> > and awesome APIs. > >> > >> You know, getting this all right isn't easy. If you want to do things > >> properly, then you need to propagate attribute changes between the units > >> you > >> manage. You also need something like a scheduler, since a number of > >> controllers can only be configured under certain external conditions (for > >> example: the blkio or devices controller use major/minor parameters for > >> configuring per-device limits. Since major/minor assignments are pretty > >> much > >> unpredictable these days -- and users probably want to configure things > >> with > >> friendly and stable /dev/disk/by-id/* symlinks anyway -- this requires us > >> to > >> wait for devices to show up before we can configure the parameters.) Soo... > >> you need a graph of units, where you can propagate things, and schedule > >> things > >> based on some execution/event queue. And the propagation and scheduling are > >> closely intermingled. > > > > you are confusing policy and mechanisms. > > > > The access to cgroupfs is mechanism. > > > > The propagation of changes, the scheduling of cgroupfs access and > > the correlation to external conditions are policy. > > > > What Tim is asking for is to have a common interface, i.e. a library > > which implements the low level access to the cgroupfs mechanism > > without imposing systemd defined policies to it (It might implement a > > set of common useful policies, but that's a different discussion). > > > > That's definitely not an unreasonable request, because he wants to > > implement his own set of policies which are not necessarily the same > > as those which are implemented by systemd. > > > > You are simply ignoring the fact, that Linux is used in other ways > > than those which you are focussed on. That's true for Google's way to > > manage its gazillion machines and that's equally true for the other > > end of the spectrum which is deep embedded or any other specialized > > use case. Just face it: running Linux on your laptop and on some RHT > > lab machines is covering about 1% of the use cases. > > > > Nevertheless you repeatedly claim, that systemd is the only way to > > deal with system startup and system management, is covering _ALL_ use > > cases and the interfaces you expose are sufficient. > > > > Did you ever work on specialized embedded or big data use cases? I > > really doubt that, but I might be wrong as usual. > > > > So I invite you to prove that you can beat an existing setup for an > > automotive use case with your magic systemd foo. I refund you fully, > > if you can beat the mark of a functional system less than 800ms after > > reset release on a 200MHz ARM machine. Functional is defined by the > > use case requirements and means: > > > > - Basic cgroups management working > > - GUI up and running > > - Main communication interface (CAN bus) up and running > > > > The rest of the system is starting up after that including a more > > complex cgroup management. > > > > According to your claim that systemd is covering everything and some > > more, this should take you a few hours. I grant you a full week to > > work on that. > > > > The use case Tim is talking about is different, but has similar > > constraints which are completely driven by his particular use case > > scenario. I'm sure, that Tim can persuade his management to setup a > > similar contest to prove your expertise on the other extreme of the > > Linux world. > > > > Before answering please think about the relevance of your statements > > "getting this all right isn't easy", "something like a scheduler", > > "users probably want ..." and "stable /dev/disk/by-id/* symlinks" in > > those contexts. > > I don't think anybody needs your money. > > But it's sure an improvement over last time when you wanted to use a > "Kantholz" to make your statement. Now how about the policy vs. mechanisms part of Thomas' e-mail? -- Jiri Kosina SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cgroup: status-quo and userland efforts
On Wed, 2013-07-03 at 01:57 +0200, Thomas Gleixner wrote: > Lennart, > > On Sun, 30 Jun 2013, Lennart Poettering wrote: > > On 29.06.2013 05:05, Tim Hockin wrote: > > > But that's not my point. It seems pretty easy to make this cgroup > > > management (in "native mode") a library that can have either a thin > > > veneer of a main() function, while also being usable by systemd. The > > > point is to solve all of the problems ONCE. I'm trying to make the > > > case that systemd itself should be focusing on features and policies > > > and awesome APIs. > > > > You know, getting this all right isn't easy. If you want to do things > > properly, then you need to propagate attribute changes between the units you > > manage. You also need something like a scheduler, since a number of > > controllers can only be configured under certain external conditions (for > > example: the blkio or devices controller use major/minor parameters for > > configuring per-device limits. Since major/minor assignments are pretty much > > unpredictable these days -- and users probably want to configure things with > > friendly and stable /dev/disk/by-id/* symlinks anyway -- this requires us to > > wait for devices to show up before we can configure the parameters.) Soo... > > you need a graph of units, where you can propagate things, and schedule > > things > > based on some execution/event queue. And the propagation and scheduling are > > closely intermingled. > > you are confusing policy and mechanisms. > > The access to cgroupfs is mechanism. > > The propagation of changes, the scheduling of cgroupfs access and > the correlation to external conditions are policy. > > What Tim is asking for is to have a common interface, i.e. a library > which implements the low level access to the cgroupfs mechanism > without imposing systemd defined policies to it (It might implement a > set of common useful policies, but that's a different discussion). > > That's definitely not an unreasonable request, because he wants to > implement his own set of policies which are not necessarily the same > as those which are implemented by systemd. Could I just add a me too to this from Parallels. We need the ability to impose our own container policy on the kernel mechanisms. Perhaps I should step back a bit and say first of all that we all use the word "container" a lot, but if you analyse what we mean, you'll find that a Google container is different from a Parallels/OpenVZ container which is different from an LXC container and so on. How we all build our containers is a policy we impose on the various cgroup and namespace mechanisms within the kernel. We've spent a lot of discussion time over the years making sure that the kernel mechanisms support all of our different use cases, so I really don't want to see that change in the name of simplifying the API. I also don't think any quest for the one true container will be successful for the simple reason that containers are best when tuned for the job they're doing. For instance at Parallels we do IaaS containers. That means we can take a container, boot up any old Linux OS inside it and give you root on it in exactly the same way as you could for a virtual machine. Google does something more like application containers for job control and some network companies do pure namespace containers without any cgroup controllers at all. There's no one container description that would fit all use cases. So where we are is that the current APIs may be messy, but they support all use cases and all container structure policies. If anyone, systemd included, wants to do a new API, it must support all use cases as well. Ideally, it should be agreed to and in the kernel as well rather than having some userspace filter. James -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cgroup: status-quo and userland efforts
On Wed, 3 Jul 2013, Kay Sievers wrote: > On Wed, Jul 3, 2013 at 1:57 AM, Thomas Gleixner wrote: > > Before answering please think about the relevance of your statements > > "getting this all right isn't easy", "something like a scheduler", > > "users probably want ..." and "stable /dev/disk/by-id/* symlinks" in > > those contexts. > > I don't think anybody needs your money. Thanks for your well thought out technical argument. > But it's sure an improvement over last time when you wanted to use a > "Kantholz" to make your statement. Using an out of context snippet from a private conversation at the bar to answer a technical argument is definitely proving your point. Thanks, tglx -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cgroup: status-quo and userland efforts
On Wed, Jul 03, 2013 at 02:44:31AM +0200, Kay Sievers wrote: > I don't think anybody needs your money. > > But it's sure an improvement over last time when you wanted to use a > "Kantholz" to make your statement. Kantholz, frozen sharks, whatever helps get the real point across. Hint: this is not at all about the money. -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cgroup: status-quo and userland efforts
On Wed, Jul 3, 2013 at 1:57 AM, Thomas Gleixner wrote: > On Sun, 30 Jun 2013, Lennart Poettering wrote: >> On 29.06.2013 05:05, Tim Hockin wrote: >> > But that's not my point. It seems pretty easy to make this cgroup >> > management (in "native mode") a library that can have either a thin >> > veneer of a main() function, while also being usable by systemd. The >> > point is to solve all of the problems ONCE. I'm trying to make the >> > case that systemd itself should be focusing on features and policies >> > and awesome APIs. >> >> You know, getting this all right isn't easy. If you want to do things >> properly, then you need to propagate attribute changes between the units you >> manage. You also need something like a scheduler, since a number of >> controllers can only be configured under certain external conditions (for >> example: the blkio or devices controller use major/minor parameters for >> configuring per-device limits. Since major/minor assignments are pretty much >> unpredictable these days -- and users probably want to configure things with >> friendly and stable /dev/disk/by-id/* symlinks anyway -- this requires us to >> wait for devices to show up before we can configure the parameters.) Soo... >> you need a graph of units, where you can propagate things, and schedule >> things >> based on some execution/event queue. And the propagation and scheduling are >> closely intermingled. > > you are confusing policy and mechanisms. > > The access to cgroupfs is mechanism. > > The propagation of changes, the scheduling of cgroupfs access and > the correlation to external conditions are policy. > > What Tim is asking for is to have a common interface, i.e. a library > which implements the low level access to the cgroupfs mechanism > without imposing systemd defined policies to it (It might implement a > set of common useful policies, but that's a different discussion). > > That's definitely not an unreasonable request, because he wants to > implement his own set of policies which are not necessarily the same > as those which are implemented by systemd. > > You are simply ignoring the fact, that Linux is used in other ways > than those which you are focussed on. That's true for Google's way to > manage its gazillion machines and that's equally true for the other > end of the spectrum which is deep embedded or any other specialized > use case. Just face it: running Linux on your laptop and on some RHT > lab machines is covering about 1% of the use cases. > > Nevertheless you repeatedly claim, that systemd is the only way to > deal with system startup and system management, is covering _ALL_ use > cases and the interfaces you expose are sufficient. > > Did you ever work on specialized embedded or big data use cases? I > really doubt that, but I might be wrong as usual. > > So I invite you to prove that you can beat an existing setup for an > automotive use case with your magic systemd foo. I refund you fully, > if you can beat the mark of a functional system less than 800ms after > reset release on a 200MHz ARM machine. Functional is defined by the > use case requirements and means: > > - Basic cgroups management working > - GUI up and running > - Main communication interface (CAN bus) up and running > > The rest of the system is starting up after that including a more > complex cgroup management. > > According to your claim that systemd is covering everything and some > more, this should take you a few hours. I grant you a full week to > work on that. > > The use case Tim is talking about is different, but has similar > constraints which are completely driven by his particular use case > scenario. I'm sure, that Tim can persuade his management to setup a > similar contest to prove your expertise on the other extreme of the > Linux world. > > Before answering please think about the relevance of your statements > "getting this all right isn't easy", "something like a scheduler", > "users probably want ..." and "stable /dev/disk/by-id/* symlinks" in > those contexts. I don't think anybody needs your money. But it's sure an improvement over last time when you wanted to use a "Kantholz" to make your statement. Thanks, Kay -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cgroup: status-quo and userland efforts
Lennart, On Sun, 30 Jun 2013, Lennart Poettering wrote: > On 29.06.2013 05:05, Tim Hockin wrote: > > But that's not my point. It seems pretty easy to make this cgroup > > management (in "native mode") a library that can have either a thin > > veneer of a main() function, while also being usable by systemd. The > > point is to solve all of the problems ONCE. I'm trying to make the > > case that systemd itself should be focusing on features and policies > > and awesome APIs. > > You know, getting this all right isn't easy. If you want to do things > properly, then you need to propagate attribute changes between the units you > manage. You also need something like a scheduler, since a number of > controllers can only be configured under certain external conditions (for > example: the blkio or devices controller use major/minor parameters for > configuring per-device limits. Since major/minor assignments are pretty much > unpredictable these days -- and users probably want to configure things with > friendly and stable /dev/disk/by-id/* symlinks anyway -- this requires us to > wait for devices to show up before we can configure the parameters.) Soo... > you need a graph of units, where you can propagate things, and schedule things > based on some execution/event queue. And the propagation and scheduling are > closely intermingled. you are confusing policy and mechanisms. The access to cgroupfs is mechanism. The propagation of changes, the scheduling of cgroupfs access and the correlation to external conditions are policy. What Tim is asking for is to have a common interface, i.e. a library which implements the low level access to the cgroupfs mechanism without imposing systemd defined policies to it (It might implement a set of common useful policies, but that's a different discussion). That's definitely not an unreasonable request, because he wants to implement his own set of policies which are not necessarily the same as those which are implemented by systemd. You are simply ignoring the fact, that Linux is used in other ways than those which you are focussed on. That's true for Google's way to manage its gazillion machines and that's equally true for the other end of the spectrum which is deep embedded or any other specialized use case. Just face it: running Linux on your laptop and on some RHT lab machines is covering about 1% of the use cases. Nevertheless you repeatedly claim, that systemd is the only way to deal with system startup and system management, is covering _ALL_ use cases and the interfaces you expose are sufficient. Did you ever work on specialized embedded or big data use cases? I really doubt that, but I might be wrong as usual. So I invite you to prove that you can beat an existing setup for an automotive use case with your magic systemd foo. I refund you fully, if you can beat the mark of a functional system less than 800ms after reset release on a 200MHz ARM machine. Functional is defined by the use case requirements and means: - Basic cgroups management working - GUI up and running - Main communication interface (CAN bus) up and running The rest of the system is starting up after that including a more complex cgroup management. According to your claim that systemd is covering everything and some more, this should take you a few hours. I grant you a full week to work on that. The use case Tim is talking about is different, but has similar constraints which are completely driven by his particular use case scenario. I'm sure, that Tim can persuade his management to setup a similar contest to prove your expertise on the other extreme of the Linux world. Before answering please think about the relevance of your statements "getting this all right isn't easy", "something like a scheduler", "users probably want ..." and "stable /dev/disk/by-id/* symlinks" in those contexts. Thanks, tglx -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cgroup: status-quo and userland efforts
On Sun, Jun 30, 2013 at 12:39 PM, Lennart Poettering wrote: > Heya, > > > On 29.06.2013 05:05, Tim Hockin wrote: >> >> Come on, now, Lennart. You put a lot of words in my mouth. > > >>> I for sure am not going to make the PID 1 a client of another daemon. >>> That's >>> just wrong. If you have a daemon that is both conceptually the manager of >>> another service and the client of that other service, then that's bad >>> design >>> and you will easily run into deadlocks and such. Just think about it: if >>> you >>> have some external daemon for managing cgroups, and you need cgroups for >>> running external daemons, how are you going to start the external daemon >>> for >>> managing cgroups? Sure, you can hack around this, make that daemon >>> special, >>> and magic, and stuff -- or you can just not do such nonsense. There's no >>> reason to repeat the fuckup that cgroup became in kernelspace a second >>> time, >>> but this time in userspace, with multiple manager daemons all with >>> different >>> and slightly incompatible definitions what a unit to manage actualy is... >> >> >> I forgot about the tautology of systemd. systemd is monolithic. > > > systemd is certainly not monolithic for almost any definition of that term. > I am not sure where you are taking that from, and I am not sure I want to > discuss on that level. This just sounds like FUD you picked up somewhere and > are repeating carelessly... It does a number of sort-of-related things. Maybe it does them better by doing them together. I can't say, really. We don't use it at work, and I am on Ubuntu elsewhere, for now. >> But that's not my point. It seems pretty easy to make this cgroup >> management (in "native mode") a library that can have either a thin >> veneer of a main() function, while also being usable by systemd. The >> point is to solve all of the problems ONCE. I'm trying to make the >> case that systemd itself should be focusing on features and policies >> and awesome APIs. > > You know, getting this all right isn't easy. If you want to do things > properly, then you need to propagate attribute changes between the units you > manage. You also need something like a scheduler, since a number of > controllers can only be configured under certain external conditions (for > example: the blkio or devices controller use major/minor parameters for > configuring per-device limits. Since major/minor assignments are pretty much > unpredictable these days -- and users probably want to configure things with > friendly and stable /dev/disk/by-id/* symlinks anyway -- this requires us to > wait for devices to show up before we can configure the parameters.) Soo... > you need a graph of units, where you can propagate things, and schedule > things based on some execution/event queue. And the propagation and > scheduling are closely intermingled. I'm really just talking about the most basic low-level substrate of writing to cgroupfs. Again, we don't use udev (yet?) so we don't have these problems. It seems to me that it's possible to formulate a bottom layer that is usable by both systemd and non-systemd systems. But, you know, maybe I am wrong and our internal universe is so much simpler (and behind the times) than the rest of the world that layering can work for us and not you. > Now, that's pretty much exactly what systemd actually *is*. It implements a > graph of units with a scheduler. And if you rip that part out of systemd to > make this an "easy cgroup management library", then you simply turn what > systemd is into a library without leaving anything. Which is just bogus. > > So no, if you say "seems pretty easy to make this cgroup management a > library" then well, I have to disagree with you. > > >>> We want to run fewer, simpler things on our systems, we want to reuse as >> >> >> Fewer and simpler are not compatible, unless you are losing >> functionality. Systemd is fewer, but NOT simpler. > > > Oh, certainly it is. If we'd split up the cgroup fs access into separate > daemon of some kind, then we'd need some kind of IPC for that, and so you > have more daemons and you have some complex IPC between the processes. So > yeah, the systemd approach is certainly both simpler and uses fewer daemons > then your hypothetical one. Well, it SOUNDS like Serge is trying to develop this to demonstrate that a standalone daemon works. That's what I am keen to help with (or else we have to invent ourselves). I am not really afraid of IPC or of "more daemons". I much prefer simple agents doing one thing and interacting with each other in simple ways. But that's me. >>> much of the code as we can. You don't achieve that by running yet another >>> daemon that does worse what systemd can anyway do simpler, easier and >>> better. >> >> >> Considering this is all hypothetical, I find this to be a funny >> debate. My hypothetical idea is better than your hypothetical idea. > > > Well, systemd is pretty real, and the code to do the unified cg
Re: cgroup: status-quo and userland efforts
Heya, On 29.06.2013 05:05, Tim Hockin wrote: Come on, now, Lennart. You put a lot of words in my mouth. I for sure am not going to make the PID 1 a client of another daemon. That's just wrong. If you have a daemon that is both conceptually the manager of another service and the client of that other service, then that's bad design and you will easily run into deadlocks and such. Just think about it: if you have some external daemon for managing cgroups, and you need cgroups for running external daemons, how are you going to start the external daemon for managing cgroups? Sure, you can hack around this, make that daemon special, and magic, and stuff -- or you can just not do such nonsense. There's no reason to repeat the fuckup that cgroup became in kernelspace a second time, but this time in userspace, with multiple manager daemons all with different and slightly incompatible definitions what a unit to manage actualy is... I forgot about the tautology of systemd. systemd is monolithic. systemd is certainly not monolithic for almost any definition of that term. I am not sure where you are taking that from, and I am not sure I want to discuss on that level. This just sounds like FUD you picked up somewhere and are repeating carelessly... But that's not my point. It seems pretty easy to make this cgroup management (in "native mode") a library that can have either a thin veneer of a main() function, while also being usable by systemd. The point is to solve all of the problems ONCE. I'm trying to make the case that systemd itself should be focusing on features and policies and awesome APIs. You know, getting this all right isn't easy. If you want to do things properly, then you need to propagate attribute changes between the units you manage. You also need something like a scheduler, since a number of controllers can only be configured under certain external conditions (for example: the blkio or devices controller use major/minor parameters for configuring per-device limits. Since major/minor assignments are pretty much unpredictable these days -- and users probably want to configure things with friendly and stable /dev/disk/by-id/* symlinks anyway -- this requires us to wait for devices to show up before we can configure the parameters.) Soo... you need a graph of units, where you can propagate things, and schedule things based on some execution/event queue. And the propagation and scheduling are closely intermingled. Now, that's pretty much exactly what systemd actually *is*. It implements a graph of units with a scheduler. And if you rip that part out of systemd to make this an "easy cgroup management library", then you simply turn what systemd is into a library without leaving anything. Which is just bogus. So no, if you say "seems pretty easy to make this cgroup management a library" then well, I have to disagree with you. We want to run fewer, simpler things on our systems, we want to reuse as Fewer and simpler are not compatible, unless you are losing functionality. Systemd is fewer, but NOT simpler. Oh, certainly it is. If we'd split up the cgroup fs access into separate daemon of some kind, then we'd need some kind of IPC for that, and so you have more daemons and you have some complex IPC between the processes. So yeah, the systemd approach is certainly both simpler and uses fewer daemons then your hypothetical one. much of the code as we can. You don't achieve that by running yet another daemon that does worse what systemd can anyway do simpler, easier and better. Considering this is all hypothetical, I find this to be a funny debate. My hypothetical idea is better than your hypothetical idea. Well, systemd is pretty real, and the code to do the unified cgroup management within systemd is pretty complete. systemd is certainly not hypothetical. The least you could grant us is to have a look at the final APIs we will have to offer before you already imply that systemd cannot be a valid implementation of any API people could ever agree on. Whoah, don't get defensive. I said nothing of the sort. The fact of the matter is that we do not run systemd, at least in part because of the monolithic nature. That's unlikely to change in this timescale. Oh, my. I am not sure what makes you think it is monolithic. What I said was that it would be a shame if we had to invent our own low-level cgroup daemon just because the "upstream" daemons was too tightly coupled with systemd. I have no interest to reimplement systemd as a library, just to make you happy... I am quite happy with what we already have This is supposed to be collaborative, not combative. It certainly sounds *very* differently in what you are writing. Lennart -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read t
Re: [Workman-devel] cgroup: status-quo and userland efforts
On Fri 28-06-13 14:01:55, Vivek Goyal wrote: > On Fri, Jun 28, 2013 at 05:05:13PM +0200, Michal Hocko wrote: [...] > > OK, so libcgroup's rules daemon will still work and place my tasks in > > appropriate cgroups? > > Do you use that daemon in practice? I am not but my users do. And that is why I care. > For user session logins, I think systemd has plans to put user > sessions in a cgroup (kind of making pam_cgroup redundant). > > Other functionality rulesengined was providing moving tasks automatically > in a cgroup based on executable name. I think that was racy and not > many people had liked it. It doesn't make sense for short lived processes, all right, but it can be useful for those that live for a long time. > IIUC, systemd can't disable access to cgroupfs from other utilities. The previous messages read otherwise. And that is why this rised the red flag at many fronts. > So most likely rulesengined should contine to work. But having both > systemd and libcgroup might not make much sense though. > > Thanks > Vivek -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cgroup: status-quo and userland efforts
Hello, Tim. On Fri, Jun 28, 2013 at 11:44:23AM -0700, Tim Hockin wrote: > I totally understand where you're coming from - trying to get back to > a stable feature set. But it sucks to be on the losing end of that Oh, it has been sucking and will continue to suck like hell for me too for the foreseeable future. Trust me, this side ain't any greener. > battle - you're cutting things that REALLY matter to us, and without a > really viable alternative. So we'll keep fighting. Yeah, that's understandable. More on this later. > Splitting threads is sort of important for some cgroups, like CPU. I > wonder if pjt is paying attention to this thread. Paul? > I think this is wrong. Take the opportunity to define the RIGHT > interface that you WANT - a container. Implement it in terms of > cgroups (and maybe other stuff!). Make that API so compelling that > people want to use it, and your war of attrition on direct cgroup > madness will be won, but with net progress rather than regress. The goal is to reach sane and widely useable / useful state with minimum amount of complexity. Maintaining backward compatibility for some period - likely quite a few years - while still allowing future development is a pretty important consideration. Another factor is that the general situation has been more or less atrocious and cgroup as a whole has been failing in the very basic places, which also reinforces the drive for simplicity. I probably am forgetting some, but anyways, from my POV, there are fairly strong by-default factors which push for simplicity even if that means some loss of functionalities as long as those aren't something catastrophic. I've been going over the decisions past few days and unified hierarchy still seems the best, or rather, most acceptable solution. That said, I stil don't know very well the scope and severity of the problems you guys might face from the loss of multiple orthogonal hierarchies. The cpuset one wasn't very convincing especially given that most of expressibility problems can be mitigated if you presume the central managing facility which can adapt the configurations as the workload changes. Dynamic execution of configuration of course is the job of cgroup proper but larger cadence changes doesn't have to be statically encoded in the hierarchy itself and as I wrote before some just can't be whether multiple hierarchy or not. While the bar to overcome is pretty high, I do want to learn about the problems you guys are foreseeing, so that I can at least evaulate the graveness properly and hopefully compromises which can mitigate the most sore ones can be made wherever necessary. So, can you please explain the issues that you've experienced and are foreseeing in detail with their contexts? ie. if you have certain requirement, please give at least brief explanation on where such requirement is coming from and how important the requirement is. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cgroup: status-quo and userland efforts
Come on, now, Lennart. You put a lot of words in my mouth. On Fri, Jun 28, 2013 at 6:48 PM, Lennart Poettering wrote: > On 28.06.2013 20:53, Tim Hockin wrote: > >> a single-agent, we should make a kick-ass implementation that is >> flexible and scalable, and full-featured enough to not require >> divergence at the lowest layer of the stack. Then build systemd on >> top of that. Let systemd offer more features and policies and >> "semantic" APIs. > > > Well, what if systemd is already kick-ass? I mean, if you have a problem > with systemd, then that's your own problem, but I really don't think why I > should bother? I didn't say it wasn't. I said that we can build a common substrate that systemd can build on *and* non-systemd systems can use *and* Google can participate in. > I for sure am not going to make the PID 1 a client of another daemon. That's > just wrong. If you have a daemon that is both conceptually the manager of > another service and the client of that other service, then that's bad design > and you will easily run into deadlocks and such. Just think about it: if you > have some external daemon for managing cgroups, and you need cgroups for > running external daemons, how are you going to start the external daemon for > managing cgroups? Sure, you can hack around this, make that daemon special, > and magic, and stuff -- or you can just not do such nonsense. There's no > reason to repeat the fuckup that cgroup became in kernelspace a second time, > but this time in userspace, with multiple manager daemons all with different > and slightly incompatible definitions what a unit to manage actualy is... I forgot about the tautology of systemd. systemd is monolithic. Therefore it can not have any external dependencies. Therefore it must absorb anything it depends on. Therefore systemd continues to grow in size and scope. Up next: systemd manages your X sessions! But that's not my point. It seems pretty easy to make this cgroup management (in "native mode") a library that can have either a thin veneer of a main() function, while also being usable by systemd. The point is to solve all of the problems ONCE. I'm trying to make the case that systemd itself should be focusing on features and policies and awesome APIs. > We want to run fewer, simpler things on our systems, we want to reuse as Fewer and simpler are not compatible, unless you are losing functionality. Systemd is fewer, but NOT simpler. > much of the code as we can. You don't achieve that by running yet another > daemon that does worse what systemd can anyway do simpler, easier and > better. Considering this is all hypothetical, I find this to be a funny debate. My hypothetical idea is better than your hypothetical idea. > The least you could grant us is to have a look at the final APIs we will > have to offer before you already imply that systemd cannot be a valid > implementation of any API people could ever agree on. Whoah, don't get defensive. I said nothing of the sort. The fact of the matter is that we do not run systemd, at least in part because of the monolithic nature. That's unlikely to change in this timescale. What I said was that it would be a shame if we had to invent our own low-level cgroup daemon just because the "upstream" daemons was too tightly coupled with systemd. I think we have a lot of experience to offer to this project, and a vested interest in seeing it done well. But if it is purely targetting systemd, we have little incentive to devote resources to it. Please note that I am strictly talking about the lowest layer of the API. Just the thing that guards cgroupfs against mere mortals. The higher layers - where abstractions exist, that are actually USEFUL to end users - are not really in scope right now. We already have our own higher level APIs. This is supposed to be collaborative, not combative. Tim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cgroup: status-quo and userland efforts
On 28.06.2013 20:53, Tim Hockin wrote: a single-agent, we should make a kick-ass implementation that is flexible and scalable, and full-featured enough to not require divergence at the lowest layer of the stack. Then build systemd on top of that. Let systemd offer more features and policies and "semantic" APIs. Well, what if systemd is already kick-ass? I mean, if you have a problem with systemd, then that's your own problem, but I really don't think why I should bother? I for sure am not going to make the PID 1 a client of another daemon. That's just wrong. If you have a daemon that is both conceptually the manager of another service and the client of that other service, then that's bad design and you will easily run into deadlocks and such. Just think about it: if you have some external daemon for managing cgroups, and you need cgroups for running external daemons, how are you going to start the external daemon for managing cgroups? Sure, you can hack around this, make that daemon special, and magic, and stuff -- or you can just not do such nonsense. There's no reason to repeat the fuckup that cgroup became in kernelspace a second time, but this time in userspace, with multiple manager daemons all with different and slightly incompatible definitions what a unit to manage actualy is... We want to run fewer, simpler things on our systems, we want to reuse as much of the code as we can. You don't achieve that by running yet another daemon that does worse what systemd can anyway do simpler, easier and better. The least you could grant us is to have a look at the final APIs we will have to offer before you already imply that systemd cannot be a valid implementation of any API people could ever agree on. Lennart -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Workman-devel] cgroup: status-quo and userland efforts
On Fri, Jun 28, 2013 at 05:40:53PM -0500, Serge Hallyn wrote: > > The kernel can exposed a knob that would allow systemd to lock that > > down > > Gah - why would you give him that idea? :) That's one of the ideas I had from the beginning. > But yes, I'd sort of assume that was coming, eventually. But I think we'll probably settle with a mechanism to find out whether someone else is touching the hierarchy, which will be generally useful for other consumers of cgroup too. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Workman-devel] cgroup: status-quo and userland efforts
Quoting Daniel P. Berrange (berra...@redhat.com): > On Fri, Jun 28, 2013 at 02:01:55PM -0400, Vivek Goyal wrote: > > On Fri, Jun 28, 2013 at 05:05:13PM +0200, Michal Hocko wrote: > > > On Thu 27-06-13 22:01:38, Tejun Heo wrote: > > > > Hello, Mike. > > > > > > > > On Fri, Jun 28, 2013 at 06:49:10AM +0200, Mike Galbraith wrote: > > > > > I always thought that was a very cool feature, mkdir+echo, poof done. > > > > > Now maybe that interface is suboptimal for serious usage, but it makes > > > > > the things usable via dirt simple scripts, very flexible, nice. > > > > > > > > Oh, that in itself is not bad. I mean, if you're root, it's pretty > > > > easy to play with and that part is fine. But combined with the > > > > hierarchical nature of cgroup and file permissions, it encourages > > > > people to "deligate" subdirectories to less previledged domains, > > > > > > OK, this really depends on what you expose to non-root users. I have > > > seen use cases where admin prepares top-level which is root-only but > > > it allows creating sub-groups which are under _full_ control of the > > > subdomain. This worked nicely for memcg for example because hard limit, > > > oom handling and other knobs are hierarchical so the subdomain cannot > > > overwrite what admin has said. > > > > > > > which > > > > in turn leads to normal binaries to manipulate them directly, which is > > > > where the horror begins. We end up exposing control knobs which are > > > > tightly coupled to kernel implementation details right into lay > > > > binaries and scripts directly used by end users. > > > > > > > > I think this is the first time this happened, which is probably why > > > > nobody really noticed the mess earlier. > > > > > > > > Anyways, if you're root, you can keep doing whatever you want. > > > > > > OK, so libcgroup's rules daemon will still work and place my tasks in > > > appropriate cgroups? > > > > Do you use that daemon in practice? For user session logins, I think > > systemd has plans to put user sessions in a cgroup (kind of making > > pam_cgroup redundant). > > > > Other functionality rulesengined was providing moving tasks automatically > > in a cgroup based on executable name. I think that was racy and not > > many people had liked it. > > Regardless of the changes being proposed, IMHO, the cgrulesd should > never be used. It is just outright dangerous for a daemon to be > arbitrarily re-arranging what cgroups a process is placed in without > the applications being aware of it. It can only be safely used in a > scenario where cgroups are exclusively used by the administrator, > and never used by applications for their own needs. Even then it's not safe, since if the program quickly forks or clones a few times, you can end up with some of the tasks being reclassified and some not. > > IIUC, systemd can't disable access to cgroupfs from other utilities. > > The kernel can exposed a knob that would allow systemd to lock that > down Gah - why would you give him that idea? :) But yes, I'd sort of assume that was coming, eventually. -serge -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Workman-devel] cgroup: status-quo and userland efforts
On Fri, Jun 28, 2013 at 02:01:55PM -0400, Vivek Goyal wrote: > On Fri, Jun 28, 2013 at 05:05:13PM +0200, Michal Hocko wrote: > > On Thu 27-06-13 22:01:38, Tejun Heo wrote: > > > Hello, Mike. > > > > > > On Fri, Jun 28, 2013 at 06:49:10AM +0200, Mike Galbraith wrote: > > > > I always thought that was a very cool feature, mkdir+echo, poof done. > > > > Now maybe that interface is suboptimal for serious usage, but it makes > > > > the things usable via dirt simple scripts, very flexible, nice. > > > > > > Oh, that in itself is not bad. I mean, if you're root, it's pretty > > > easy to play with and that part is fine. But combined with the > > > hierarchical nature of cgroup and file permissions, it encourages > > > people to "deligate" subdirectories to less previledged domains, > > > > OK, this really depends on what you expose to non-root users. I have > > seen use cases where admin prepares top-level which is root-only but > > it allows creating sub-groups which are under _full_ control of the > > subdomain. This worked nicely for memcg for example because hard limit, > > oom handling and other knobs are hierarchical so the subdomain cannot > > overwrite what admin has said. > > > > > which > > > in turn leads to normal binaries to manipulate them directly, which is > > > where the horror begins. We end up exposing control knobs which are > > > tightly coupled to kernel implementation details right into lay > > > binaries and scripts directly used by end users. > > > > > > I think this is the first time this happened, which is probably why > > > nobody really noticed the mess earlier. > > > > > > Anyways, if you're root, you can keep doing whatever you want. > > > > OK, so libcgroup's rules daemon will still work and place my tasks in > > appropriate cgroups? > > Do you use that daemon in practice? For user session logins, I think > systemd has plans to put user sessions in a cgroup (kind of making > pam_cgroup redundant). > > Other functionality rulesengined was providing moving tasks automatically > in a cgroup based on executable name. I think that was racy and not > many people had liked it. Regardless of the changes being proposed, IMHO, the cgrulesd should never be used. It is just outright dangerous for a daemon to be arbitrarily re-arranging what cgroups a process is placed in without the applications being aware of it. It can only be safely used in a scenario where cgroups are exclusively used by the administrator, and never used by applications for their own needs. > IIUC, systemd can't disable access to cgroupfs from other utilities. The kernel can exposed a knob that would allow systemd to lock that down > So most likely rulesengined should contine to work. But having both > systemd and libcgroup might not make much sense though. Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cgroup: status-quo and userland efforts
Quoting Andy Lutomirski (l...@amacapital.net): > On 06/27/2013 11:01 AM, Tejun Heo wrote: > > AFAICS, having a userland agent which has overall knowledge of the > > hierarchy and enforcesf structure and limiations is a requirement to > > make cgroup generally useable and useful. For systemd based systems, > > systemd serving that role isn't too crazy. It's sure gonna have > > teeting issues at the beginning but it has all the necessary > > information to manage workloads on the system. > > > > A valid issue is interoperability between systemd and non-systemd > > systems. I don't have an immediately good answer for that. I wrote > > in another reply but making cgroup generally available is a pretty new > > effort and we're still in the process of figuring out what the right > > constructs and abstractions are. Hopefully, we'll be able to reach a > > common set of abstractions to base things on top in itme. > > > > The systemd stuff will break my code, too (although the single hierarchy > by itself won't, I think). I think that the kernel should make whatever > simple changes are needed so that systemd can function without using > cgroups at all. That way users of a different cgroup scheme can turn > off systemd's. > > Here was my proposal, which hasn't gotten a clear reply: > > http://article.gmane.org/gmane.comp.sysutils.systemd.devel/11424 Neat. I like that proposal. > I've already sent a patch to make /proc//task//children > available regardless of configuration. -serge -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cgroup: status-quo and userland efforts
On 06/27/2013 11:01 AM, Tejun Heo wrote: > AFAICS, having a userland agent which has overall knowledge of the > hierarchy and enforcesf structure and limiations is a requirement to > make cgroup generally useable and useful. For systemd based systems, > systemd serving that role isn't too crazy. It's sure gonna have > teeting issues at the beginning but it has all the necessary > information to manage workloads on the system. > > A valid issue is interoperability between systemd and non-systemd > systems. I don't have an immediately good answer for that. I wrote > in another reply but making cgroup generally available is a pretty new > effort and we're still in the process of figuring out what the right > constructs and abstractions are. Hopefully, we'll be able to reach a > common set of abstractions to base things on top in itme. > The systemd stuff will break my code, too (although the single hierarchy by itself won't, I think). I think that the kernel should make whatever simple changes are needed so that systemd can function without using cgroups at all. That way users of a different cgroup scheme can turn off systemd's. Here was my proposal, which hasn't gotten a clear reply: http://article.gmane.org/gmane.comp.sysutils.systemd.devel/11424 I've already sent a patch to make /proc//task//children available regardless of configuration. --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Workman-devel] cgroup: status-quo and userland efforts
On Fri, Jun 28, 2013 at 8:53 AM, Serge Hallyn wrote: > Quoting Daniel P. Berrange (berra...@redhat.com): >> Are you also planning to actually write a new cgroup parent manager >> daemon too ? Currently my plan for libvirt is to just talk directly > > I'm toying with the idea, yes. (Right now my toy runs in either native > mode, using cgroupfs, or child mode, talking to a parent manager) I'd > love if someone else does it, but it needs to be done. > > As I've said elsewhere in the thread, I see 2 problems to be addressed: > > 1. The ability to nest the cgroup manager daemons, so that a daemon > running in a container can talk to a daemon running on the host. This > is the problem my current toy is aiming to address. But the API it > exports is just a thin layer over cgroupfs. > > 2. Abstract away the kernel/cgroupfs details so that userspace can > explain its cgroup needs generically. This is IIUC what systemd is > addressing with slices and scopes. > > (2) is where I'd really like to have a well thought out, community > designed API that everyone can agree on, and it might be worth getting > together (with Tejun) at plumbers or something to lay something out. We're also working on (2) (well, we HAVE it, but we're dis-integrating it so we can hopefully publish more widely). But our (2) depends on direct cgroupfs access. If that is to change, we need a really robust (1). It's OK (desireable, in fact) that (1) be a very thin layer of abstraction. > In the end, something like libvirt or lxc should not need to care > what is running underneat it. It should be able to make its requests > the same way regardless of whether it running in fedora or ubuntu, > and whether it is running on the host or in a tightly bound container. > That's my goal anyway :) > >> to systemd's new DBus APIs for all management of cgroups, and then >> fall back to writing to cgroupfs directly for cases where systemd >> is not around. Having a library to abstract these two possible >> alternatives isn't all that compelling unless we think there will >> be multiple cgroups manager daemons. I've been somewhat assuming that >> even Ubuntu will eventually see the benefits & switch to systemd, > > So far I've seen no indication of that :) > > If the systemd code to manage slices could be made separately > compileable as a standalone library or daemon, then I'd advocate > using that. But I don't see a lot of incentive for systemd to do > that, so I'd feel like a heel even asking. I want to say "let the best API win", but I know that systemd is a giant katamari ball, and it's absorbing subsystems so it may win by default. That isn't going to stop us from trying to do what we do, and share that with the world. >> then the issue of multiple manager daemons wouldn't really exist. > > True. But I'm running under the assumption that Ubuntu will stick with > upstart, and therefore yes I'll need a separate (perhaps pair of) > management daemons. > > Even if we were to switch to systemd, I'd like the API for userspace > programs to configure and use cgroups to be as generic as possible, > so that anyone who wanted to write their own daemon could do so. > > -serge -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cgroup: status-quo and userland efforts
On Fri, Jun 28, 2013 at 8:05 AM, Michal Hocko wrote: > On Thu 27-06-13 22:01:38, Tejun Heo wrote: >> Oh, that in itself is not bad. I mean, if you're root, it's pretty >> easy to play with and that part is fine. But combined with the >> hierarchical nature of cgroup and file permissions, it encourages >> people to "deligate" subdirectories to less previledged domains, > > OK, this really depends on what you expose to non-root users. I have > seen use cases where admin prepares top-level which is root-only but > it allows creating sub-groups which are under _full_ control of the > subdomain. This worked nicely for memcg for example because hard limit, > oom handling and other knobs are hierarchical so the subdomain cannot > overwrite what admin has said. bingo > And the systemd, with its history of eating projects and not caring much > about their previous users who are not willing to jump in to the systemd > car, doesn't sound like a good place where to place the new interface to > me. +1 If systemd is the only upstream implementation of this single-agent idea, we will have to invent our own, and continue to diverge rather than converge. I think that, if we are going to pursue this model of a single-agent, we should make a kick-ass implementation that is flexible and scalable, and full-featured enough to not require divergence at the lowest layer of the stack. Then build systemd on top of that. Let systemd offer more features and policies and "semantic" APIs. We will build our own semantic APIs that are, necessarily, different from systemd. But we can all use the same low-level mechanism. Tim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cgroup: status-quo and userland efforts
On Thu, Jun 27, 2013 at 2:04 PM, Tejun Heo wrote: > Hello, > > On Thu, Jun 27, 2013 at 01:46:18PM -0700, Tim Hockin wrote: >> So what you're saying is that you don't care that this new thing is >> less capable than the old thing, despite it having real impact. > > Sort of. I'm saying, at least up until now, moving away from > orthogonal hierarchy support seems to be the right trade-off. It all > depends on how you measure how much things are simplified and how > heavy the "real impacts" are. It's not like these things can be > determined white and black. Given the current situation, I think it's > the right call. I totally understand where you're coming from - trying to get back to a stable feature set. But it sucks to be on the losing end of that battle - you're cutting things that REALLY matter to us, and without a really viable alternative. So we'll keep fighting. >> If controller C is enabled at level X but disabled at level X/Y, does >> that mean that X/Y uses the limits set in X? How about X/Y/Z? > > Y and Y/Z wouldn't make any difference. Tasks belonging to them would > behave as if they belong to X as far as C is concerened. OK, that *sounds* sane. It doesn't solve all our problems, but it alleviates some of them. >> So take away some of the flexibility that has minimal impact and >> maximum return. Splitting threads across cgroups - we use it, but we >> could get off that. Force all-or-nothing joining of an aggregate > > Please do so. Splitting threads is sort of important for some cgroups, like CPU. I wonder if pjt is paying attention to this thread. >> construct (a container vs N cgroups). >> >> But perform surgery with a scalpel, not a hatchet. > > As anything else, it's drawing a line in a continuous spectrum of > grey. Right now, given that maintaining multiple orthogonal > hierarchies while introducing a proper concept of resource container > involves addition of completely new constructs and complexity, I don't > think that's a good option. If there are problems which can't be > resolved / worked around in a reasonable manner, please bring them up > along with their contexts. Let's examine them and see whether there > are other ways to accomodate them. You're arguing that the abstraction you want is that of a "container" but that it's easier to remove options than to actually build a better API. I think this is wrong. Take the opportunity to define the RIGHT interface that you WANT - a container. Implement it in terms of cgroups (and maybe other stuff!). Make that API so compelling that people want to use it, and your war of attrition on direct cgroup madness will be won, but with net progress rather than regress. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cgroup: status-quo and userland efforts
Hello, Michal. On Fri, Jun 28, 2013 at 05:05:13PM +0200, Michal Hocko wrote: > OK, this really depends on what you expose to non-root users. I have > seen use cases where admin prepares top-level which is root-only but > it allows creating sub-groups which are under _full_ control of the > subdomain. This worked nicely for memcg for example because hard limit, > oom handling and other knobs are hierarchical so the subdomain cannot > overwrite what admin has said. Some knobs are safer than others and memcg probably has it easy as it doesn't implement proportional control. But, even then, there's a huge chasm between cgroup knobs and proper kernel API visible to normal programs. Just imagine exposing memcg features by extending rlimits. It'll take months if not a couple years ironing out the API details and going through review process, and rightfully so, these things, once published and made widely available, can't be taken back. Now compare that to how we decide what knobs to expose in cgroup. I mean, you even recently suggested flipping the default polarity of soft limit knob. cgroup's interface standard is very low. It's probably a notch higher than boot params but about at the same level as sysctl knobs. It isn't necessarily a bad thing as it allows us to rapidly explore various options and expose useable things in a very agile manner, but we should be very aware of how widely the interface is exposed; otherwise, we'd be exposing features and leaking kernel implementation details directly into userland programs without going through proper review process or buliding consensus, which, in the long term, is gonna be much worse than not having the feature exposed at all. "It works for special cases XXX and YYY" is a very poor and extremely short-sighted argument when the whole approach is breaching the very fundamentals of kernel API conventions. In addition, I really don't think cgroup is the right interface to directly expose to individual programs. As a management thing, it does make some sense but kernel API already has its, at times ancient but, generally working hierarchy and inheritance rules and conventions and primitive resource control contructs - nice, ionice, rlimits and so on. If exposing cgroup-level resource control directly to individual applications proves to be beneficial enough, what we should do is extending those things. The backend sure can be supported by cgroups but this mkdiring and echoing things with separate hierarchy from the usual process hierarchy isn't something which should be visible to individual applications. Currently, I'm not convinced that this is something which should be exposed to individual applications, but I sure can be wrong. But, right now, let's first get the existing part settled. We can worry about the rest later. Also, in light of the rather sneaky subversion happened with cgroup filesystem interface, I wonder whether we need to add some sort of generic warning mechanism which warns when permissions of pseudo file systems like cgroupfs are delegated to lesser security domains. In itself, it could be harmless but it can serves as a useful beacon. Not sure to what extent or how tho. > OK, so libcgroup's rules daemon will still work and place my tasks in > appropriate cgroups? You have two competing managers of the same hierarchy. There are ways to make them not interfere with each other too much but ultimately it's gonna be something clunky. That said, libcgroup itself is pretty clunky, so maybe you'll be okay with it. I don't know. > This is not quite in par with "libcgroup is dead and others have to > migrate to systemd as well" statements from the link posted earlier. > I really do not think that _any_ central agent will understand my > requirements and needs so I need a way to talk to cgroupfs somehow - I > have used libcgroups so far but touching cgroupfs is quite convinient > as well. As a developer who knows what's going on, I don't think it'd be too difficult to meddle with things manually with or without the central manager. It'll complain that someone else is meddling with the cgroup hierarchy and some functionalities might not work as expected, but I don't think it'll lock you out. At the same time, while us, the developers, having the level of latitude required to do our work is necessary, that shouldn't be the overruling focal point of the design of the whole system. It's something to be used and supporting the actual use cases should be the priority. I'm not saying developer convenience is not important but that it's not the only thing which matters. The way I see it, cgroup has basically been a playground for devs going wild without too much, if any, thought on how it'll actually be useable and useful to wider audience, so let's please adjust our priorities a bit. And, no, I don't believe that the use cases are so wildly different that we can't have a capable enough central manager. That's usually a symptom of n
Re: [Workman-devel] cgroup: status-quo and userland efforts
On Fri, Jun 28, 2013 at 05:05:13PM +0200, Michal Hocko wrote: > On Thu 27-06-13 22:01:38, Tejun Heo wrote: > > Hello, Mike. > > > > On Fri, Jun 28, 2013 at 06:49:10AM +0200, Mike Galbraith wrote: > > > I always thought that was a very cool feature, mkdir+echo, poof done. > > > Now maybe that interface is suboptimal for serious usage, but it makes > > > the things usable via dirt simple scripts, very flexible, nice. > > > > Oh, that in itself is not bad. I mean, if you're root, it's pretty > > easy to play with and that part is fine. But combined with the > > hierarchical nature of cgroup and file permissions, it encourages > > people to "deligate" subdirectories to less previledged domains, > > OK, this really depends on what you expose to non-root users. I have > seen use cases where admin prepares top-level which is root-only but > it allows creating sub-groups which are under _full_ control of the > subdomain. This worked nicely for memcg for example because hard limit, > oom handling and other knobs are hierarchical so the subdomain cannot > overwrite what admin has said. > > > which > > in turn leads to normal binaries to manipulate them directly, which is > > where the horror begins. We end up exposing control knobs which are > > tightly coupled to kernel implementation details right into lay > > binaries and scripts directly used by end users. > > > > I think this is the first time this happened, which is probably why > > nobody really noticed the mess earlier. > > > > Anyways, if you're root, you can keep doing whatever you want. > > OK, so libcgroup's rules daemon will still work and place my tasks in > appropriate cgroups? Do you use that daemon in practice? For user session logins, I think systemd has plans to put user sessions in a cgroup (kind of making pam_cgroup redundant). Other functionality rulesengined was providing moving tasks automatically in a cgroup based on executable name. I think that was racy and not many people had liked it. IIUC, systemd can't disable access to cgroupfs from other utilities. So most likely rulesengined should contine to work. But having both systemd and libcgroup might not make much sense though. Thanks Vivek -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Workman-devel] cgroup: status-quo and userland efforts
Quoting Daniel P. Berrange (berra...@redhat.com): > On Thu, Jun 27, 2013 at 08:22:06AM -0500, Serge Hallyn wrote: > > FWIW, the code is too embarassing yet to see daylight, but I'm playing > > with a very lowlevel cgroup manager which supports nesting itself. > > Access in this POC is low-level ("set freezer.state to THAWED for cgroup > > /c1/c2", "Create /c3"), but the key feature is that it can run in two > > modes - native mode in which it uses cgroupfs, and child mode where it > > talks to a parent manager to make the changes. > > > > So then the idea would be that userspace (like libvirt and lxc) would > > talk over /dev/cgroup to its manager. Userspace inside a container > > (which can't actually mount cgroups itself) would talk to its own > > manager which is talking over a passed-in socket to the host manager, > > which in turn runs natively (uses cgroupfs, and nests "create /c1" under > > the requestor's cgroup). > > > > At some point (probably soon) we might want to talk about a standard API > > for these things. However I think it will have to come in the form of > > a standard library, which knows to either send requests over dbus to > > systemd, or over /dev/cgroup sock to the manager. > > Are you also planning to actually write a new cgroup parent manager > daemon too ? Currently my plan for libvirt is to just talk directly I'm toying with the idea, yes. (Right now my toy runs in either native mode, using cgroupfs, or child mode, talking to a parent manager) I'd love if someone else does it, but it needs to be done. As I've said elsewhere in the thread, I see 2 problems to be addressed: 1. The ability to nest the cgroup manager daemons, so that a daemon running in a container can talk to a daemon running on the host. This is the problem my current toy is aiming to address. But the API it exports is just a thin layer over cgroupfs. 2. Abstract away the kernel/cgroupfs details so that userspace can explain its cgroup needs generically. This is IIUC what systemd is addressing with slices and scopes. (2) is where I'd really like to have a well thought out, community designed API that everyone can agree on, and it might be worth getting together (with Tejun) at plumbers or something to lay something out. In the end, something like libvirt or lxc should not need to care what is running underneat it. It should be able to make its requests the same way regardless of whether it running in fedora or ubuntu, and whether it is running on the host or in a tightly bound container. That's my goal anyway :) > to systemd's new DBus APIs for all management of cgroups, and then > fall back to writing to cgroupfs directly for cases where systemd > is not around. Having a library to abstract these two possible > alternatives isn't all that compelling unless we think there will > be multiple cgroups manager daemons. I've been somewhat assuming that > even Ubuntu will eventually see the benefits & switch to systemd, So far I've seen no indication of that :) If the systemd code to manage slices could be made separately compileable as a standalone library or daemon, then I'd advocate using that. But I don't see a lot of incentive for systemd to do that, so I'd feel like a heel even asking. > then the issue of multiple manager daemons wouldn't really exist. True. But I'm running under the assumption that Ubuntu will stick with upstart, and therefore yes I'll need a separate (perhaps pair of) management daemons. Even if we were to switch to systemd, I'd like the API for userspace programs to configure and use cgroups to be as generic as possible, so that anyone who wanted to write their own daemon could do so. -serge -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cgroup: status-quo and userland efforts
On Thu 27-06-13 22:01:38, Tejun Heo wrote: > Hello, Mike. > > On Fri, Jun 28, 2013 at 06:49:10AM +0200, Mike Galbraith wrote: > > I always thought that was a very cool feature, mkdir+echo, poof done. > > Now maybe that interface is suboptimal for serious usage, but it makes > > the things usable via dirt simple scripts, very flexible, nice. > > Oh, that in itself is not bad. I mean, if you're root, it's pretty > easy to play with and that part is fine. But combined with the > hierarchical nature of cgroup and file permissions, it encourages > people to "deligate" subdirectories to less previledged domains, OK, this really depends on what you expose to non-root users. I have seen use cases where admin prepares top-level which is root-only but it allows creating sub-groups which are under _full_ control of the subdomain. This worked nicely for memcg for example because hard limit, oom handling and other knobs are hierarchical so the subdomain cannot overwrite what admin has said. > which > in turn leads to normal binaries to manipulate them directly, which is > where the horror begins. We end up exposing control knobs which are > tightly coupled to kernel implementation details right into lay > binaries and scripts directly used by end users. > > I think this is the first time this happened, which is probably why > nobody really noticed the mess earlier. > > Anyways, if you're root, you can keep doing whatever you want. OK, so libcgroup's rules daemon will still work and place my tasks in appropriate cgroups? This is not quite in par with "libcgroup is dead and others have to migrate to systemd as well" statements from the link posted earlier. I really do not think that _any_ central agent will understand my requirements and needs so I need a way to talk to cgroupfs somehow - I have used libcgroups so far but touching cgroupfs is quite convinient as well. And the systemd, with its history of eating projects and not caring much about their previous users who are not willing to jump in to the systemd car, doesn't sound like a good place where to place the new interface to me. [...] -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Workman-devel] cgroup: status-quo and userland efforts
On Thu, Jun 27, 2013 at 08:22:06AM -0500, Serge Hallyn wrote: > FWIW, the code is too embarassing yet to see daylight, but I'm playing > with a very lowlevel cgroup manager which supports nesting itself. > Access in this POC is low-level ("set freezer.state to THAWED for cgroup > /c1/c2", "Create /c3"), but the key feature is that it can run in two > modes - native mode in which it uses cgroupfs, and child mode where it > talks to a parent manager to make the changes. > > So then the idea would be that userspace (like libvirt and lxc) would > talk over /dev/cgroup to its manager. Userspace inside a container > (which can't actually mount cgroups itself) would talk to its own > manager which is talking over a passed-in socket to the host manager, > which in turn runs natively (uses cgroupfs, and nests "create /c1" under > the requestor's cgroup). > > At some point (probably soon) we might want to talk about a standard API > for these things. However I think it will have to come in the form of > a standard library, which knows to either send requests over dbus to > systemd, or over /dev/cgroup sock to the manager. Are you also planning to actually write a new cgroup parent manager daemon too ? Currently my plan for libvirt is to just talk directly to systemd's new DBus APIs for all management of cgroups, and then fall back to writing to cgroupfs directly for cases where systemd is not around. Having a library to abstract these two possible alternatives isn't all that compelling unless we think there will be multiple cgroups manager daemons. I've been somewhat assuming that even Ubuntu will eventually see the benefits & switch to systemd, then the issue of multiple manager daemons wouldn't really exist. Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cgroup: status-quo and userland efforts
On Thu, 2013-06-27 at 22:01 -0700, Tejun Heo wrote: > Anyways, if you're root, you can keep doing whatever you want. You > could be stepping on the centralized agent's toes a bit and vice-versa Keep on truckn' sounds good, that vice-versa toe stomping not so good, but yeah, until systemd or ilk grows the ability to shut me down, I shouldn't feel any burning need to introduce it to my machete. > but I don't think that's gonna be disastrous. What I'm trying to > stamp out is direct usages from !root domains and !system-management > binaries / scripts. They absolutely have to go. There's no question > about it and I'll take totalitarian userland agent anyday over the > current mess. I get some of the why.. and yeah, it's the dirt simple usage that I care about most, not the big hairy problem cases you're trying to address. > Eventually, I think we'll be able to reach an equilibrium where most > things are reasonable and we'll be exploring the acceptable limits of > flexibility again, but right now, please bear with the brutality. > We're way over the line and I can't see a way back which isn't gonna > sting a bit. I'm and will keep trying to make it as painless as > possible. Keep on driving, and thanks for listening. Aaao ;-) -Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cgroup: status-quo and userland efforts
Hello, Mike. On Fri, Jun 28, 2013 at 06:49:10AM +0200, Mike Galbraith wrote: > I always thought that was a very cool feature, mkdir+echo, poof done. > Now maybe that interface is suboptimal for serious usage, but it makes > the things usable via dirt simple scripts, very flexible, nice. Oh, that in itself is not bad. I mean, if you're root, it's pretty easy to play with and that part is fine. But combined with the hierarchical nature of cgroup and file permissions, it encourages people to "deligate" subdirectories to less previledged domains, which in turn leads to normal binaries to manipulate them directly, which is where the horror begins. We end up exposing control knobs which are tightly coupled to kernel implementation details right into lay binaries and scripts directly used by end users. I think this is the first time this happened, which is probably why nobody really noticed the mess earlier. Anyways, if you're root, you can keep doing whatever you want. You could be stepping on the centralized agent's toes a bit and vice-versa but I don't think that's gonna be disastrous. What I'm trying to stamp out is direct usages from !root domains and !system-management binaries / scripts. They absolutely have to go. There's no question about it and I'll take totalitarian userland agent anyday over the current mess. Eventually, I think we'll be able to reach an equilibrium where most things are reasonable and we'll be exploring the acceptable limits of flexibility again, but right now, please bear with the brutality. We're way over the line and I can't see a way back which isn't gonna sting a bit. I'm and will keep trying to make it as painless as possible. Thanks! -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cgroup: status-quo and userland efforts
On Thu, 2013-06-27 at 21:09 -0700, Tejun Heo wrote: > No, it's completely messed up. We're now starting to see users trying > to embed low level cgroup details into their binaries and cgroup is > exposing sysctl-level konbs which are directly tied to internal > implementation of core subsystems. cgroup successfully bypassed the > usual kernel API policing with the help of hierarchical filesystem > interface which allows delegation on the surface. We completely > fucked up. This is a full scale disaster unrolling. I always thought that was a very cool feature, mkdir+echo, poof done. Now maybe that interface is suboptimal for serious usage, but it makes the things usable via dirt simple scripts, very flexible, nice. But whatever, not my call, you know your business better than I. If mandatory agent happens, fine, but imho that will be sad day. -Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cgroup: status-quo and userland efforts
Hello, Mike. On Fri, Jun 28, 2013 at 05:46:38AM +0200, Mike Galbraith wrote: > Sure, because in private property and I mandatory agent, I see "gimme > yer wallet bitch", an incredibly arrogant and brutal mugging. That's > not the way it's meant, I know that, but that's how it comes across. > You asked, so you get the straight up answer. I don't know. It reads more like tungue-in-cheek thing to me rather than being actually arrogant, and some part of the brutality is necessary at this point. > Offering to manage cgroups is one thing, very generous, forcefully > placing itself between user and kernel quite another. Perhaps I > misread, but my interpretation was that the intent is to make systemd a > mandatory agent, even saw reference to it taking up residence in the > kernel tree (that bit made me chuckle, pull request would have to be > very cleverly worded methinks). I'm sure it will be quite capable, its > authors are. However, when I want to talk to my kernel, I expect to be > able to tell anyone else using the phone to hang up.. now. I don't know how to respond to this. It feels more emotional than technical. > It's useful now, usable to the point that enterprise users exist who > have integrated cgroups into their business model. But then you know > that. Sure, there are problems, things could and no doubt will get a > lot better. No, it's completely messed up. We're now starting to see users trying to embed low level cgroup details into their binaries and cgroup is exposing sysctl-level konbs which are directly tied to internal implementation of core subsystems. cgroup successfully bypassed the usual kernel API policing with the help of hierarchical filesystem interface which allows delegation on the surface. We completely fucked up. This is a full scale disaster unrolling. > However, wrt userspace agent, no agent is going to be the right answer > for all, so that agent needs to have a step aside button so another > agent can be tasked with the managerial duties, whether that be little > ole /me or Aunt Tilly piddling with this and that because we damn well > feel like it, or BigFoot company X going massively wild and crazy doing > their business thing. *ANY* agent is better than now. We need to back the hell out of direct usages as soon as possible. cgroup is leaking kernel implementation details into individual binaries. The current situation is dangerous and putting an agent inbetween is a good way of gradually backing out of it. > No, it's not at all crazy, _offering_ the user a managerial service is > great, generous, way to go guys, pass out the white hats. Use force, > and those pretty white hats turn black as night, hero to villain. No, it's completely crazy. Full psycho crazy. You just don't realize it yet. > systemd and no systemd is also a valid issue. I'm sure it'll all get > worked out, but that link, and others like it make me see bright red. That red is nothing compared to the kernel implementation detail leak going on right now. The alarm for that has been blinking psychedelically for some time now. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cgroup: status-quo and userland efforts
On Thu, 2013-06-27 at 11:01 -0700, Tejun Heo wrote: > Hello, Mike. > > On Thu, Jun 27, 2013 at 07:45:07AM +0200, Mike Galbraith wrote: > > I can understand some alarm. When I saw the below I started frothing at > > the face and howling at the moon, and I don't even use the things much. > > Can I ask why? The reasons are not apparent to me. Sure, because in private property and I mandatory agent, I see "gimme yer wallet bitch", an incredibly arrogant and brutal mugging. That's not the way it's meant, I know that, but that's how it comes across. You asked, so you get the straight up answer. Offering to manage cgroups is one thing, very generous, forcefully placing itself between user and kernel quite another. Perhaps I misread, but my interpretation was that the intent is to make systemd a mandatory agent, even saw reference to it taking up residence in the kernel tree (that bit made me chuckle, pull request would have to be very cleverly worded methinks). I'm sure it will be quite capable, its authors are. However, when I want to talk to my kernel, I expect to be able to tell anyone else using the phone to hang up.. now. > > http://lists.freedesktop.org/archives/systemd-devel/2013-June/011521.html > > > > Hierarchy layout aside, that "private property" bit says that the folks > > who currently own and use the cgroups interface will lose direct access > > to it. I can imagine folks who have become dependent upon an on the fly > > management agents of their own design becoming a tad alarmed. > > They're gonna be able to do what they've been doing for the > foreseeable future if they choose not to use systemd's unified Those are the comforting words I wanted to hear, that the user chooses, that the user will not find that this that or any other userspace agent gains the right to insert itself between user and kernel. > AFAICS, having a userland agent which has overall knowledge of the > hierarchy and enforcesf structure and limiations is a requirement to > make cgroup generally useable and useful. It's useful now, usable to the point that enterprise users exist who have integrated cgroups into their business model. But then you know that. Sure, there are problems, things could and no doubt will get a lot better. However, wrt userspace agent, no agent is going to be the right answer for all, so that agent needs to have a step aside button so another agent can be tasked with the managerial duties, whether that be little ole /me or Aunt Tilly piddling with this and that because we damn well feel like it, or BigFoot company X going massively wild and crazy doing their business thing. > For systemd based systems, > systemd serving that role isn't too crazy. It's sure gonna have > teeting issues at the beginning but it has all the necessary > information to manage workloads on the system. No, it's not at all crazy, _offering_ the user a managerial service is great, generous, way to go guys, pass out the white hats. Use force, and those pretty white hats turn black as night, hero to villain. > A valid issue is interoperability between systemd and non-systemd > systems. I don't have an immediately good answer for that. I wrote > in another reply but making cgroup generally available is a pretty new > effort and we're still in the process of figuring out what the right > constructs and abstractions are. Hopefully, we'll be able to reach a > common set of abstractions to base things on top in itme. systemd and no systemd is also a valid issue. I'm sure it'll all get worked out, but that link, and others like it make me see bright red. -Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cgroup: status-quo and userland efforts
Hello, On Thu, Jun 27, 2013 at 01:46:18PM -0700, Tim Hockin wrote: > So what you're saying is that you don't care that this new thing is > less capable than the old thing, despite it having real impact. Sort of. I'm saying, at least up until now, moving away from orthogonal hierarchy support seems to be the right trade-off. It all depends on how you measure how much things are simplified and how heavy the "real impacts" are. It's not like these things can be determined white and black. Given the current situation, I think it's the right call. > If controller C is enabled at level X but disabled at level X/Y, does > that mean that X/Y uses the limits set in X? How about X/Y/Z? Y and Y/Z wouldn't make any difference. Tasks belonging to them would behave as if they belong to X as far as C is concerened. > So take away some of the flexibility that has minimal impact and > maximum return. Splitting threads across cgroups - we use it, but we > could get off that. Force all-or-nothing joining of an aggregate Please do so. > construct (a container vs N cgroups). > > But perform surgery with a scalpel, not a hatchet. As anything else, it's drawing a line in a continuous spectrum of grey. Right now, given that maintaining multiple orthogonal hierarchies while introducing a proper concept of resource container involves addition of completely new constructs and complexity, I don't think that's a good option. If there are problems which can't be resolved / worked around in a reasonable manner, please bring them up along with their contexts. Let's examine them and see whether there are other ways to accomodate them. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cgroup: status-quo and userland efforts
On Thu, Jun 27, 2013 at 11:14 AM, Serge Hallyn wrote: > Quoting Tejun Heo (t...@kernel.org): >> Hello, Serge. >> >> On Thu, Jun 27, 2013 at 08:22:06AM -0500, Serge Hallyn wrote: >> > At some point (probably soon) we might want to talk about a standard API >> > for these things. However I think it will have to come in the form of >> > a standard library, which knows to either send requests over dbus to >> > systemd, or over /dev/cgroup sock to the manager. >> >> Yeah, eventually, I think we'll have a standardized way to configure >> resource distribution in the system. Maybe we'll agree on a >> standardized dbus protocol or there will be library, I don't know; >> however, whatever form it may be in, it abstraction level should be >> way higher than that of direct cgroupfs access. It's way too low >> level and very easy to end up in a complete nonsense configuration. >> >> e.g. enabling "cpu" on a cgroup whlie leaving other cgroups alone >> wouldn't enable fair scheduling on that cgroup but drastically reduce >> the amount of cpu share it gets as it now gets treated as single >> entity competing with all tasks at the parent level. > > Right. I *think* this can be offered as a daemon which sits as the > sole consumer of my agent's API, and offers a higher level "do what I > want" API. But designing that API is going to be interesting. This is something we have, partially, and are working to be able to open-source. We have a LOT of experience feeding into the semantics that actually make users happy. Today it leverages split-hierarchies, but that is not required in the generic case (only if you want to offer the semantics we do). It explicitly delegates some aspects of sub-cgroup control to users, but that could go away if your lowest-level agency can handle it. > I should find a good, up-to-date summary of the current behaviors of > each controller so I can talk more intelligently about it. (I'll > start by looking at the kernel Documentation/cgroups, but don't > feel too confident that they'll be uptodate :) > >> At the moment, I'm not sure what the eventual abstraction would look >> like. systemd is extending its basic constructs by adding slices and >> scopes and it does make sense to integrate the general organization of >> the system (services, user sessions, VMs and so on) with resource >> management. Given some time, I'm hoping we'll be able to come up with >> and agree on some common constructs so that each workload can indicate >> its resource requirements in a unified way. >> >> That said, I really think we should experiment for a while before >> trying to settle down on things. We've now just started exploring how >> system-wide resource managment can be made widely available to systems >> without requiring extremely specialized hand-crafted configurations >> and I'm pretty sure we're getting and gonna get quite a few details >> wrong, so I don't think it'd be a good idea to try to agree on things >> right now. As far as such integration goes, I think it's time to play >> with things and observe the results. > > Right, I'm not attached to my toy implementation at all - except for > the ability, in some fashion, to have nested agents which don't have > cgroupfs access but talk to another agent to get the job done. > > -serge -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cgroup: status-quo and userland efforts
On Thu, Jun 27, 2013 at 10:38 AM, Tejun Heo wrote: > Hello, Tim. > > On Wed, Jun 26, 2013 at 08:42:21PM -0700, Tim Hockin wrote: >> OK, then what I don't know is what is the new interface? A new cgroupfs? > > It's gonna be a new mount option for cgroupfs. > >> DTF and CPU and cpuset all have "default" groups for some tasks (and >> not others) in our world today. DTF actually has default, prio, and >> "normal". I was simplifying before. I really wish it were as simple >> as you think it is. But if it were, do you think I'd still be >> arguing? > > How am I supposed to know when you don't communicate it but just wave > your hands saying it's all very complicated? The cpuset / blkcg > example is pretty bad because you can enforce any cpuset rules at the > leaves. Modifying hundreds of cgroups is really painful, and yes, we do it often enough to be able to see it. >> This really doesn't scale when I have thousands of jobs running. >> Being able to disable at some levels on some controllers probably >> helps some, but I can't say for sure without knowing the new interface > > How does the number of jobs affect it? Does each job create a new > cgroup? Well, in your model it does... >> We tried it in unified hierarchy. We had our Top People on the >> problem. The best we could get was bad enough that we embarked on a >> LITERAL 2 year transition to make it better. > > What didn't work? What part was so bad? I find it pretty difficult > to believe that multiple orthogonal hierarchies is the only possible > solution, so please elaborate the issues that you guys have > experienced. I'm looping in more Google people. > The hierarchy is for organization and enforcement of dynamic > hierarchical resource distribution and that's it. If its expressive > power is lacking, take compromise or tune the configuration according > to the workloads. The latter is necessary in workloads which have > clear distinction of foreground and background anyway - anything which > interacts with human beings including androids. So what you're saying is that you don't care that this new thing is less capable than the old thing, despite it having real impact. >> In other words, define a container as a set of cgroups, one under each >> each active controller type. A TID enters the container atomically, >> joining all of the cgroups or none of the cgroups. >> >> container C1 = { /cgroup/cpu/foo, /cgroup/memory/bar, >> /cgroup/io/default/foo/bar, /cgroup/cpuset/ >> >> This is an abstraction that we maintain in userspace (more or less) >> and we do actually have headaches from split hierarchies here >> (handling partial failures, non-atomic joins, etc) > > That'd separate out task organization from controllre config > hierarchies. Kay had a similar idea some time ago. I think it makes > things even more complex than it is right now. I'll continue on this > below. > >> I'm still a bit fuzzy - is all of this written somewhere? > > If you dig through cgroup ML, most are there. There'll be > "cgroup.controllers" file with which you can enable / disable > controllers. Enabling a controller in a cgroup implies that the > controller is enabled in all ancestors. Implies or requires? Cause or predicate? If controller C is enabled at level X but disabled at level X/Y, does that mean that X/Y uses the limits set in X? How about X/Y/Z? This will get rid of the bulk of the cpuset scaling problem, but not all of it. I think we still have the same problems with cpu as we do with io. Perhaps that should have been the example. >> It sounds like you're missing a layer of abstraction. Why not add the >> abstraction you want to expose on top of powerful primitives, instead >> of dumbing down the primitives? > > It sure would be possible build more and try to address the issues > we're seeing now; however, after looking at cgroups for some time now, > the underlying theme is failure to take reasonable trade-offs and > going for maximum flexibility in making each choice - the choice of > interface, multiple hierarchies, no restriction on hierarchical > behavior, splitting threads of the same process into separate cgroups, > semi-encouraging delegation through file permission without actually > pondering the consequences and so on. And each choice probably made > sense trying to serve each immediate requirement at the time but added > up it's a giant pile of mess which developed without direction. I am very sympathetic to this problem. You could have just described some of our internal problems too. The difference is that we are trying to make changes that provide more structure and boundaries in ways that retain the fundamental power, without tossing out the baby with the bathwater. > So, at this point, I'm very skeptical about adding more flexibility. > Once the basics are settled, we sure can look into the missing pieces > but I don't think that's what we should be doing right now. Another > thing is that the unified hiera
Re: cgroup: status-quo and userland efforts
Hello, On Thu, Jun 27, 2013 at 11:51 AM, Serge Hallyn wrote: >> I think it probably would be better to allow organization and RO > > What do you mean by "organization"? Creating cgroups and moving tasks > between them, without setting other cgroup values? Yeap, I also think that's how user sessions are gonna be handled. We're gonna have limited amount of delegation for organization and read accesses. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cgroup: status-quo and userland efforts
Quoting Tejun Heo (t...@kernel.org): > Hello, Serge. > > On Thu, Jun 27, 2013 at 01:14:57PM -0500, Serge Hallyn wrote: > > I should find a good, up-to-date summary of the current behaviors of > > each controller so I can talk more intelligently about it. (I'll > > start by looking at the kernel Documentation/cgroups, but don't > > feel too confident that they'll be uptodate :) > > Heh, it's hopelessly outdated. Sorry about that. I'll get around to > updating it eventually. Right now everything is in flux. > > > Right, I'm not attached to my toy implementation at all - except for > > the ability, in some fashion, to have nested agents which don't have > > cgroupfs access but talk to another agent to get the job done. > > I think it probably would be better to allow organization and RO What do you mean by "organization"? Creating cgroups and moving tasks between them, without setting other cgroup values? > access to knobs and stat files inside containers, for lower overhead, > if nothing else, and have comm channel for operations which need > supervision at a wider level. > > Thanks. > > -- > tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cgroup: status-quo and userland efforts
Hello, Serge. On Thu, Jun 27, 2013 at 01:14:57PM -0500, Serge Hallyn wrote: > I should find a good, up-to-date summary of the current behaviors of > each controller so I can talk more intelligently about it. (I'll > start by looking at the kernel Documentation/cgroups, but don't > feel too confident that they'll be uptodate :) Heh, it's hopelessly outdated. Sorry about that. I'll get around to updating it eventually. Right now everything is in flux. > Right, I'm not attached to my toy implementation at all - except for > the ability, in some fashion, to have nested agents which don't have > cgroupfs access but talk to another agent to get the job done. I think it probably would be better to allow organization and RO access to knobs and stat files inside containers, for lower overhead, if nothing else, and have comm channel for operations which need supervision at a wider level. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cgroup: status-quo and userland efforts
Quoting Tejun Heo (t...@kernel.org): > Hello, Serge. > > On Thu, Jun 27, 2013 at 08:22:06AM -0500, Serge Hallyn wrote: > > At some point (probably soon) we might want to talk about a standard API > > for these things. However I think it will have to come in the form of > > a standard library, which knows to either send requests over dbus to > > systemd, or over /dev/cgroup sock to the manager. > > Yeah, eventually, I think we'll have a standardized way to configure > resource distribution in the system. Maybe we'll agree on a > standardized dbus protocol or there will be library, I don't know; > however, whatever form it may be in, it abstraction level should be > way higher than that of direct cgroupfs access. It's way too low > level and very easy to end up in a complete nonsense configuration. > > e.g. enabling "cpu" on a cgroup whlie leaving other cgroups alone > wouldn't enable fair scheduling on that cgroup but drastically reduce > the amount of cpu share it gets as it now gets treated as single > entity competing with all tasks at the parent level. Right. I *think* this can be offered as a daemon which sits as the sole consumer of my agent's API, and offers a higher level "do what I want" API. But designing that API is going to be interesting. I should find a good, up-to-date summary of the current behaviors of each controller so I can talk more intelligently about it. (I'll start by looking at the kernel Documentation/cgroups, but don't feel too confident that they'll be uptodate :) > At the moment, I'm not sure what the eventual abstraction would look > like. systemd is extending its basic constructs by adding slices and > scopes and it does make sense to integrate the general organization of > the system (services, user sessions, VMs and so on) with resource > management. Given some time, I'm hoping we'll be able to come up with > and agree on some common constructs so that each workload can indicate > its resource requirements in a unified way. > > That said, I really think we should experiment for a while before > trying to settle down on things. We've now just started exploring how > system-wide resource managment can be made widely available to systems > without requiring extremely specialized hand-crafted configurations > and I'm pretty sure we're getting and gonna get quite a few details > wrong, so I don't think it'd be a good idea to try to agree on things > right now. As far as such integration goes, I think it's time to play > with things and observe the results. Right, I'm not attached to my toy implementation at all - except for the ability, in some fashion, to have nested agents which don't have cgroupfs access but talk to another agent to get the job done. -serge -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cgroup: status-quo and userland efforts
Hello, Mike. On Thu, Jun 27, 2013 at 07:45:07AM +0200, Mike Galbraith wrote: > I can understand some alarm. When I saw the below I started frothing at > the face and howling at the moon, and I don't even use the things much. Can I ask why? The reasons are not apparent to me. > http://lists.freedesktop.org/archives/systemd-devel/2013-June/011521.html > > Hierarchy layout aside, that "private property" bit says that the folks > who currently own and use the cgroups interface will lose direct access > to it. I can imagine folks who have become dependent upon an on the fly > management agents of their own design becoming a tad alarmed. They're gonna be able to do what they've been doing for the foreseeable future if they choose not to use systemd's unified resource management. That said, what we have today is pretty lousy and a lot of hierarchical stuff were completely broken until some releases ago and things *must* have been broken on the userland side too. It could have worked for their specific setup but I strongly doubt there are anything generic working well out in the wild. cgroup hasn't been capable of supporting something like that. AFAICS, having a userland agent which has overall knowledge of the hierarchy and enforcesf structure and limiations is a requirement to make cgroup generally useable and useful. For systemd based systems, systemd serving that role isn't too crazy. It's sure gonna have teeting issues at the beginning but it has all the necessary information to manage workloads on the system. A valid issue is interoperability between systemd and non-systemd systems. I don't have an immediately good answer for that. I wrote in another reply but making cgroup generally available is a pretty new effort and we're still in the process of figuring out what the right constructs and abstractions are. Hopefully, we'll be able to reach a common set of abstractions to base things on top in itme. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cgroup: status-quo and userland efforts
Hello, Serge. On Thu, Jun 27, 2013 at 08:22:06AM -0500, Serge Hallyn wrote: > At some point (probably soon) we might want to talk about a standard API > for these things. However I think it will have to come in the form of > a standard library, which knows to either send requests over dbus to > systemd, or over /dev/cgroup sock to the manager. Yeah, eventually, I think we'll have a standardized way to configure resource distribution in the system. Maybe we'll agree on a standardized dbus protocol or there will be library, I don't know; however, whatever form it may be in, it abstraction level should be way higher than that of direct cgroupfs access. It's way too low level and very easy to end up in a complete nonsense configuration. e.g. enabling "cpu" on a cgroup whlie leaving other cgroups alone wouldn't enable fair scheduling on that cgroup but drastically reduce the amount of cpu share it gets as it now gets treated as single entity competing with all tasks at the parent level. At the moment, I'm not sure what the eventual abstraction would look like. systemd is extending its basic constructs by adding slices and scopes and it does make sense to integrate the general organization of the system (services, user sessions, VMs and so on) with resource management. Given some time, I'm hoping we'll be able to come up with and agree on some common constructs so that each workload can indicate its resource requirements in a unified way. That said, I really think we should experiment for a while before trying to settle down on things. We've now just started exploring how system-wide resource managment can be made widely available to systems without requiring extremely specialized hand-crafted configurations and I'm pretty sure we're getting and gonna get quite a few details wrong, so I don't think it'd be a good idea to try to agree on things right now. As far as such integration goes, I think it's time to play with things and observe the results. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cgroup: status-quo and userland efforts
Hello, Tim. On Wed, Jun 26, 2013 at 08:42:21PM -0700, Tim Hockin wrote: > OK, then what I don't know is what is the new interface? A new cgroupfs? It's gonna be a new mount option for cgroupfs. > DTF and CPU and cpuset all have "default" groups for some tasks (and > not others) in our world today. DTF actually has default, prio, and > "normal". I was simplifying before. I really wish it were as simple > as you think it is. But if it were, do you think I'd still be > arguing? How am I supposed to know when you don't communicate it but just wave your hands saying it's all very complicated? The cpuset / blkcg example is pretty bad because you can enforce any cpuset rules at the leaves. > This really doesn't scale when I have thousands of jobs running. > Being able to disable at some levels on some controllers probably > helps some, but I can't say for sure without knowing the new interface How does the number of jobs affect it? Does each job create a new cgroup? > We tried it in unified hierarchy. We had our Top People on the > problem. The best we could get was bad enough that we embarked on a > LITERAL 2 year transition to make it better. What didn't work? What part was so bad? I find it pretty difficult to believe that multiple orthogonal hierarchies is the only possible solution, so please elaborate the issues that you guys have experienced. The hierarchy is for organization and enforcement of dynamic hierarchical resource distribution and that's it. If its expressive power is lacking, take compromise or tune the configuration according to the workloads. The latter is necessary in workloads which have clear distinction of foreground and background anyway - anything which interacts with human beings including androids. > In other words, define a container as a set of cgroups, one under each > each active controller type. A TID enters the container atomically, > joining all of the cgroups or none of the cgroups. > > container C1 = { /cgroup/cpu/foo, /cgroup/memory/bar, > /cgroup/io/default/foo/bar, /cgroup/cpuset/ > > This is an abstraction that we maintain in userspace (more or less) > and we do actually have headaches from split hierarchies here > (handling partial failures, non-atomic joins, etc) That'd separate out task organization from controllre config hierarchies. Kay had a similar idea some time ago. I think it makes things even more complex than it is right now. I'll continue on this below. > I'm still a bit fuzzy - is all of this written somewhere? If you dig through cgroup ML, most are there. There'll be "cgroup.controllers" file with which you can enable / disable controllers. Enabling a controller in a cgroup implies that the controller is enabled in all ancestors. > It sounds like you're missing a layer of abstraction. Why not add the > abstraction you want to expose on top of powerful primitives, instead > of dumbing down the primitives? It sure would be possible build more and try to address the issues we're seeing now; however, after looking at cgroups for some time now, the underlying theme is failure to take reasonable trade-offs and going for maximum flexibility in making each choice - the choice of interface, multiple hierarchies, no restriction on hierarchical behavior, splitting threads of the same process into separate cgroups, semi-encouraging delegation through file permission without actually pondering the consequences and so on. And each choice probably made sense trying to serve each immediate requirement at the time but added up it's a giant pile of mess which developed without direction. So, at this point, I'm very skeptical about adding more flexibility. Once the basics are settled, we sure can look into the missing pieces but I don't think that's what we should be doing right now. Another thing is that the unified hierarchy can be implemented by using most of the constructs cgroup core already has in more controller way. Given that we're gonna have to maintain both interfaces for quite some time, the deviation should be kept as minimal as possible. > But it seems vastly better to define a next-gen API that retains the > important flexibility but adds structure where it was lacking > previously. I suppose that's where we disagree. I think a lot of cgroup's problems stem from too much flexibility. The problem with such level of flexibility is that, in addition to breaking fundamental constructs and adding significantly to maintenance overhead, it blocks reasonable trade-offs to be made at the right places, in turn requiring more "flexibility" to address the introduced deficiencies. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cgroup: status-quo and userland efforts
Quoting Tim Hockin (thoc...@hockin.org): > On Thu, Jun 27, 2013 at 6:22 AM, Serge Hallyn wrote: > > Quoting Mike Galbraith (bitbuc...@online.de): > >> On Wed, 2013-06-26 at 14:20 -0700, Tejun Heo wrote: > >> > Hello, Tim. > >> > > >> > On Mon, Jun 24, 2013 at 09:07:47PM -0700, Tim Hockin wrote: > >> > > I really want to understand why this is SO IMPORTANT that you have to > >> > > break userspace compatibility? I mean, isn't Linux supposed to be the > >> > > OS with the stable kernel interface? I've seen Linus rant time and > >> > > time again about this - why is it OK now? > >> > > >> > What the hell are you talking about? Nobody is breaking userland > >> > interface. A new version of interface is being phased in and the old > >> > one will stay there for the foreseeable future. It will be phased out > >> > eventually but that's gonna take a long time and it will have to be > >> > something hardly noticeable. Of course new features will only be > >> > available with the new interface and there will be efforts to nudge > >> > people away from the old one but the existing interface will keep > >> > working it does. > >> > >> I can understand some alarm. When I saw the below I started frothing at > >> the face and howling at the moon, and I don't even use the things much. > >> > >> http://lists.freedesktop.org/archives/systemd-devel/2013-June/011521.html > >> > >> Hierarchy layout aside, that "private property" bit says that the folks > >> who currently own and use the cgroups interface will lose direct access > >> to it. I can imagine folks who have become dependent upon an on the fly > >> management agents of their own design becoming a tad alarmed. > > > > FWIW, the code is too embarassing yet to see daylight, but I'm playing > > with a very lowlevel cgroup manager which supports nesting itself. > > Access in this POC is low-level ("set freezer.state to THAWED for cgroup > > /c1/c2", "Create /c3"), but the key feature is that it can run in two > > modes - native mode in which it uses cgroupfs, and child mode where it > > talks to a parent manager to make the changes. > > In this world, are users able to read cgroup files, or do they have to > go through a central agent, too? The agent won't itself do anything to stop access through cgroupfs, but the idea would be that cgroupfs would only be mounted in the agent's mntns. My hope would be that the libcgroup commands (like cgexec, cgcreate, etc) would know to talk to the agent when possible, and users would use those. > > So then the idea would be that userspace (like libvirt and lxc) would > > talk over /dev/cgroup to its manager. Userspace inside a container > > (which can't actually mount cgroups itself) would talk to its own > > manager which is talking over a passed-in socket to the host manager, > > which in turn runs natively (uses cgroupfs, and nests "create /c1" under > > the requestor's cgroup). > > How do you handle updates of this agent? Suppose I have hundreds of > running containers, and I want to release a new version of the cgroupd > ? This may change (which is part of what I want to investigate with some POC), but right now I'm building any controller-aware smarts into it. I think that's what you're asking about? The agent doesn't do "slices" etc. This may turn out to be insufficient, we'll see. So the only state which the agent stores is a list of cgroup mounts (if in native mode) or an open socket to the parent (if in child mode), and a list of connected children sockets. HUPping the agent will cause it to reload the cgroupfs mounts (in case you've mounted a new controller, living in "the old world" :). If you just kill it and start a new one, it shouldn't matter. > (note: inquiries about the implementation do not denote acceptance of > the model :) To put it another way, the problem I'm solving (for now) is not the "I want a daemon to ensure that requested guarantees are correctly implemented." In that sense I'm maintaining the status quo, i.e. the admin needs to architect the layout correctly. The problem I'm solving is really that I want containers to be able to handle cgroups even if they can't mount cgroupfs, and I want all userspace to be able to behave the same whether they are in a container or not. This isn't meant as a poke in the eye of anyone who wants to address the other problem. If it turns out that we (meaning "the community of cgroup users") really want such an agent, then we can add that. I'm not convinced. What would probably be a better design, then, would be that the agent I'm working on can plug into a resource allocation agent. Or, I suppose, the other way around. > > At some point (probably soon) we might want to talk about a standard API > > for these things. However I think it will have to come in the form of > > a standard library, which knows to either send requests over dbus to > > systemd, or over /dev/cgroup sock to the manager. > > > > -serge -- To unsubscribe from this list:
Re: cgroup: status-quo and userland efforts
On Thu, Jun 27, 2013 at 6:22 AM, Serge Hallyn wrote: > Quoting Mike Galbraith (bitbuc...@online.de): >> On Wed, 2013-06-26 at 14:20 -0700, Tejun Heo wrote: >> > Hello, Tim. >> > >> > On Mon, Jun 24, 2013 at 09:07:47PM -0700, Tim Hockin wrote: >> > > I really want to understand why this is SO IMPORTANT that you have to >> > > break userspace compatibility? I mean, isn't Linux supposed to be the >> > > OS with the stable kernel interface? I've seen Linus rant time and >> > > time again about this - why is it OK now? >> > >> > What the hell are you talking about? Nobody is breaking userland >> > interface. A new version of interface is being phased in and the old >> > one will stay there for the foreseeable future. It will be phased out >> > eventually but that's gonna take a long time and it will have to be >> > something hardly noticeable. Of course new features will only be >> > available with the new interface and there will be efforts to nudge >> > people away from the old one but the existing interface will keep >> > working it does. >> >> I can understand some alarm. When I saw the below I started frothing at >> the face and howling at the moon, and I don't even use the things much. >> >> http://lists.freedesktop.org/archives/systemd-devel/2013-June/011521.html >> >> Hierarchy layout aside, that "private property" bit says that the folks >> who currently own and use the cgroups interface will lose direct access >> to it. I can imagine folks who have become dependent upon an on the fly >> management agents of their own design becoming a tad alarmed. > > FWIW, the code is too embarassing yet to see daylight, but I'm playing > with a very lowlevel cgroup manager which supports nesting itself. > Access in this POC is low-level ("set freezer.state to THAWED for cgroup > /c1/c2", "Create /c3"), but the key feature is that it can run in two > modes - native mode in which it uses cgroupfs, and child mode where it > talks to a parent manager to make the changes. In this world, are users able to read cgroup files, or do they have to go through a central agent, too? > So then the idea would be that userspace (like libvirt and lxc) would > talk over /dev/cgroup to its manager. Userspace inside a container > (which can't actually mount cgroups itself) would talk to its own > manager which is talking over a passed-in socket to the host manager, > which in turn runs natively (uses cgroupfs, and nests "create /c1" under > the requestor's cgroup). How do you handle updates of this agent? Suppose I have hundreds of running containers, and I want to release a new version of the cgroupd ? (note: inquiries about the implementation do not denote acceptance of the model :) > At some point (probably soon) we might want to talk about a standard API > for these things. However I think it will have to come in the form of > a standard library, which knows to either send requests over dbus to > systemd, or over /dev/cgroup sock to the manager. > > -serge -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cgroup: status-quo and userland efforts
Quoting Mike Galbraith (bitbuc...@online.de): > On Wed, 2013-06-26 at 14:20 -0700, Tejun Heo wrote: > > Hello, Tim. > > > > On Mon, Jun 24, 2013 at 09:07:47PM -0700, Tim Hockin wrote: > > > I really want to understand why this is SO IMPORTANT that you have to > > > break userspace compatibility? I mean, isn't Linux supposed to be the > > > OS with the stable kernel interface? I've seen Linus rant time and > > > time again about this - why is it OK now? > > > > What the hell are you talking about? Nobody is breaking userland > > interface. A new version of interface is being phased in and the old > > one will stay there for the foreseeable future. It will be phased out > > eventually but that's gonna take a long time and it will have to be > > something hardly noticeable. Of course new features will only be > > available with the new interface and there will be efforts to nudge > > people away from the old one but the existing interface will keep > > working it does. > > I can understand some alarm. When I saw the below I started frothing at > the face and howling at the moon, and I don't even use the things much. > > http://lists.freedesktop.org/archives/systemd-devel/2013-June/011521.html > > Hierarchy layout aside, that "private property" bit says that the folks > who currently own and use the cgroups interface will lose direct access > to it. I can imagine folks who have become dependent upon an on the fly > management agents of their own design becoming a tad alarmed. FWIW, the code is too embarassing yet to see daylight, but I'm playing with a very lowlevel cgroup manager which supports nesting itself. Access in this POC is low-level ("set freezer.state to THAWED for cgroup /c1/c2", "Create /c3"), but the key feature is that it can run in two modes - native mode in which it uses cgroupfs, and child mode where it talks to a parent manager to make the changes. So then the idea would be that userspace (like libvirt and lxc) would talk over /dev/cgroup to its manager. Userspace inside a container (which can't actually mount cgroups itself) would talk to its own manager which is talking over a passed-in socket to the host manager, which in turn runs natively (uses cgroupfs, and nests "create /c1" under the requestor's cgroup). At some point (probably soon) we might want to talk about a standard API for these things. However I think it will have to come in the form of a standard library, which knows to either send requests over dbus to systemd, or over /dev/cgroup sock to the manager. -serge -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cgroup: status-quo and userland efforts
On Wed, 2013-06-26 at 14:20 -0700, Tejun Heo wrote: > Hello, Tim. > > On Mon, Jun 24, 2013 at 09:07:47PM -0700, Tim Hockin wrote: > > I really want to understand why this is SO IMPORTANT that you have to > > break userspace compatibility? I mean, isn't Linux supposed to be the > > OS with the stable kernel interface? I've seen Linus rant time and > > time again about this - why is it OK now? > > What the hell are you talking about? Nobody is breaking userland > interface. A new version of interface is being phased in and the old > one will stay there for the foreseeable future. It will be phased out > eventually but that's gonna take a long time and it will have to be > something hardly noticeable. Of course new features will only be > available with the new interface and there will be efforts to nudge > people away from the old one but the existing interface will keep > working it does. I can understand some alarm. When I saw the below I started frothing at the face and howling at the moon, and I don't even use the things much. http://lists.freedesktop.org/archives/systemd-devel/2013-June/011521.html Hierarchy layout aside, that "private property" bit says that the folks who currently own and use the cgroups interface will lose direct access to it. I can imagine folks who have become dependent upon an on the fly management agents of their own design becoming a tad alarmed. -Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cgroup: status-quo and userland efforts
On Wed, Jun 26, 2013 at 6:04 PM, Tejun Heo wrote: > Hello, > > On Wed, Jun 26, 2013 at 05:06:02PM -0700, Tim Hockin wrote: >> The first assertion, as I understood, was that (eventually) cgroupfs >> will not allow split hierarchies - that unified hierarchy would be the >> only mode. Is that not the case? > > No, unified hierarchy would be an optional thing for quite a while. > >> The second assertion, as I understood, was that (eventually) cgroupfs >> would not support granting access to some cgroup control files to >> users (through chown/chmod). Is that not the case? > > Again, it'll be an opt-in thing. The hierarchy controller would be > able to notice that and issue warnings if it wants to. > >> Hmm, so what exactly is changing then? If, as you say here, the >> existing interfaces will keep working - what is changing? > > New interface is being added and new features will be added only for > the new interface. The old one will eventually be deprecated and > removed, but that *years* away. OK, then what I don't know is what is the new interface? A new cgroupfs? >> As I said, it's controlled delegated access. And we have some patches >> that we carry to prevent some of these DoS situations. > > I don't know. You can probably hack around some of the most serious > problems but the whole thing isn't built for proper delgation and > that's not the direction the upstream kernel is headed at the moment. > >> I actually can not speak to the details of the default IO problem, as >> it happened before I really got involved. But just think through it. >> If one half of the split has 5 processes running and the other half >> has 200, the processes in the 200 set each get FAR less spindle time >> than those in the 5 set. That is NOT the semantic we need. We're >> trying to offer ~equal access for users of the non-DTF class of jobs. >> >> This is not the tail doing the wagging. This is your assertion that >> something should work, when it just doesn't. We have two, totally >> orthogonal classes of applications on two totally disjoint sets of >> resources. Conjoining them is the wrong answer. > > As I've said multiple times, there sure are things that you cannot > achieve without orthogonal multiple hierarchies, but given the options > we have at hands, compromising inside a unified hierarchy seems like > the best trade-off. Please take a step back from the immediate detail > and think of the general hierarchical organization of workloads. If > DTF / non-DTF is a fundamental part of your workload classfication, > that should go above. DTF and CPU and cpuset all have "default" groups for some tasks (and not others) in our world today. DTF actually has default, prio, and "normal". I was simplifying before. I really wish it were as simple as you think it is. But if it were, do you think I'd still be arguing? > I don't really understand your example anyway because you can classify > by DTF / non-DTF first and then just propagate cpuset settings along. > You won't lose anything that way, right? This really doesn't scale when I have thousands of jobs running. Being able to disable at some levels on some controllers probably helps some, but I can't say for sure without knowing the new interface > Again, in general, you might not be able to achieve *exactly* what > you've been doing, but, an acceptable compromise should be possible > and not doing so leads to complete mess. We tried it in unified hierarchy. We had our Top People on the problem. The best we could get was bad enough that we embarked on a LITERAL 2 year transition to make it better. >> > But I don't follow the conclusion here. For short term workaround, >> > sure, but having that dictate the whole architecture decision seems >> > completely backwards to me. >> >> My point is that the orthogonality of resources is intrinsic. Letting >> "it's hard to make it work" dictate the architecture is what's >> backwards. > > No, it's not "it's hard to make it work". It's more "it's > fundamentally broken". You can't identify a resource to be belonging > to a cgroup independent of who's looking at the resource. What if you could ensure that for a given TID (or PID if required) in dir X of controller C, all of the other TIDs in that cgroup were in the same group, but maybe not the same sub-path, under every controller? This gives you what it sounds like you wanted elsewhere - a container abstraction. In other words, define a container as a set of cgroups, one under each each active controller type. A TID enters the container atomically, joining all of the cgroups or none of the cgroups. container C1 = { /cgroup/cpu/foo, /cgroup/memory/bar, /cgroup/io/default/foo/bar, /cgroup/cpuset/ This is an abstraction that we maintain in userspace (more or less) and we do actually have headaches from split hierarchies here (handling partial failures, non-atomic joins, etc) >> I'm not sure what "differing level of granularities" means? But that >
Re: cgroup: status-quo and userland efforts
Hello, On Wed, Jun 26, 2013 at 05:06:02PM -0700, Tim Hockin wrote: > The first assertion, as I understood, was that (eventually) cgroupfs > will not allow split hierarchies - that unified hierarchy would be the > only mode. Is that not the case? No, unified hierarchy would be an optional thing for quite a while. > The second assertion, as I understood, was that (eventually) cgroupfs > would not support granting access to some cgroup control files to > users (through chown/chmod). Is that not the case? Again, it'll be an opt-in thing. The hierarchy controller would be able to notice that and issue warnings if it wants to. > Hmm, so what exactly is changing then? If, as you say here, the > existing interfaces will keep working - what is changing? New interface is being added and new features will be added only for the new interface. The old one will eventually be deprecated and removed, but that *years* away. > As I said, it's controlled delegated access. And we have some patches > that we carry to prevent some of these DoS situations. I don't know. You can probably hack around some of the most serious problems but the whole thing isn't built for proper delgation and that's not the direction the upstream kernel is headed at the moment. > I actually can not speak to the details of the default IO problem, as > it happened before I really got involved. But just think through it. > If one half of the split has 5 processes running and the other half > has 200, the processes in the 200 set each get FAR less spindle time > than those in the 5 set. That is NOT the semantic we need. We're > trying to offer ~equal access for users of the non-DTF class of jobs. > > This is not the tail doing the wagging. This is your assertion that > something should work, when it just doesn't. We have two, totally > orthogonal classes of applications on two totally disjoint sets of > resources. Conjoining them is the wrong answer. As I've said multiple times, there sure are things that you cannot achieve without orthogonal multiple hierarchies, but given the options we have at hands, compromising inside a unified hierarchy seems like the best trade-off. Please take a step back from the immediate detail and think of the general hierarchical organization of workloads. If DTF / non-DTF is a fundamental part of your workload classfication, that should go above. I don't really understand your example anyway because you can classify by DTF / non-DTF first and then just propagate cpuset settings along. You won't lose anything that way, right? Again, in general, you might not be able to achieve *exactly* what you've been doing, but, an acceptable compromise should be possible and not doing so leads to complete mess. > > But I don't follow the conclusion here. For short term workaround, > > sure, but having that dictate the whole architecture decision seems > > completely backwards to me. > > My point is that the orthogonality of resources is intrinsic. Letting > "it's hard to make it work" dictate the architecture is what's > backwards. No, it's not "it's hard to make it work". It's more "it's fundamentally broken". You can't identify a resource to be belonging to a cgroup independent of who's looking at the resource. > I'm not sure what "differing level of granularities" means? But that It means that you'll be able to ignore subtrees depending on controllers. > aside, who have you spoken to here? On our internal discussions I > have not heard a SINGLE member of our prod-kernel team nor our cluster > management team who think this is a good idea. Not one. Some of memcg and blkcg people in infra kernel team. > I still don't really get what the hellish mess is, and why it can't be > solved another way. Your statement of "unified hierarchy isn't gonna > break them" is patently false, though. If we did this it would a) > cause a large amount of work to happen and b) cause a major regression > for our users. No, what I meant was that unified hierarchy won't break the multiple hierarchy support immediately. > I'm trying to understand your root problem so that I can try to find > other solutions. "Just do what I say" is not a great way to defend > your position in the face of evidence to the contrary. I'm presenting > you real life cases of situations that simply do not work, neither > philosophically nor in practice, and you continue to assert that it's > fine. It's not fine. I wrote about that many times, but here are two of the problems. * There's no way to designate a cgroup to a resource, because cgroup is only defined by the combination of who's looking at it for which controller. That's how you end up with tagging the same resource multiple times for different controllers and even then it's broken as when you move resources from one cgroup to another, you can't tell what to do with other tags. While allowing obscene level of flexibility, multiple hierarchies destroy a very fu
Re: cgroup: status-quo and userland efforts
On Wed, 26 Jun 2013, Tim Hockin wrote: On Wed, Jun 26, 2013 at 2:20 PM, Tejun Heo wrote: Hello, Tim. On Mon, Jun 24, 2013 at 09:07:47PM -0700, Tim Hockin wrote: I really want to understand why this is SO IMPORTANT that you have to break userspace compatibility? I mean, isn't Linux supposed to be the OS with the stable kernel interface? I've seen Linus rant time and time again about this - why is it OK now? What the hell are you talking about? Nobody is breaking userland interface. A new version of interface is being phased in and the old The first assertion, as I understood, was that (eventually) cgroupfs will not allow split hierarchies - that unified hierarchy would be the only mode. Is that not the case? The second assertion, as I understood, was that (eventually) cgroupfs would not support granting access to some cgroup control files to users (through chown/chmod). Is that not the case? As a bystander, what I understand to be happening is: 1. the Kernel developers are saying that multiple hierarchies is causing lots of problems, and so they are starting the migration to a unified hierarchy. In the near term this will be optional, at a later (unspecified) point, it will no longer be optional. It is recognized that this is an API break, but the problem is bad enough (too much undefined behavior) that it looks like they are going to do this anyway. 2. indpendantly from this, the systemd people have declared that systemd is going to take control of this unified hierarchy and all applications had better use DBUS calls to systemd to make any cgroup changes or else. (i.e. systemd may break whatever you are doing) I don't think the kernel developers are talking about changing ways to control cgroups, just eliminating having multiple hierarchies. Now, I could be completely misunderstanding this (and I expect to hear about it if I am :-) David Lang -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cgroup: status-quo and userland efforts
On Wed, Jun 26, 2013 at 2:20 PM, Tejun Heo wrote: > Hello, Tim. > > On Mon, Jun 24, 2013 at 09:07:47PM -0700, Tim Hockin wrote: >> I really want to understand why this is SO IMPORTANT that you have to >> break userspace compatibility? I mean, isn't Linux supposed to be the >> OS with the stable kernel interface? I've seen Linus rant time and >> time again about this - why is it OK now? > > What the hell are you talking about? Nobody is breaking userland > interface. A new version of interface is being phased in and the old The first assertion, as I understood, was that (eventually) cgroupfs will not allow split hierarchies - that unified hierarchy would be the only mode. Is that not the case? The second assertion, as I understood, was that (eventually) cgroupfs would not support granting access to some cgroup control files to users (through chown/chmod). Is that not the case? > one will stay there for the foreseeable future. It will be phased out > eventually but that's gonna take a long time and it will have to be > something hardly noticeable. Of course new features will only be > available with the new interface and there will be efforts to nudge > people away from the old one but the existing interface will keep > working it does. Hmm, so what exactly is changing then? If, as you say here, the existing interfaces will keep working - what is changing? >> Examples? we obviously don't grant full access, but our kernel gang >> and security gang seem to trust the bits we're enabling well enough... > > Then the security gang doesn't have any clue what's going on, or at > least operating on very different assumptions (ie. the workloads are > trusted by default). You can OOM the whole kernel by creating many > cgroups, completely mess up controllers by creating deep hierarchies, > affect your siblings by adjusting your weight and so on. It's really > easy to DoS the whole system if you have write access to a cgroup > directory. As I said, it's controlled delegated access. And we have some patches that we carry to prevent some of these DoS situations. >> The non-DTF jobs have a combined share that is small but non-trivial. >> If we cut that share in half, giving one slice to prod and one slice >> to batch, we get bad sharing under contention. We tried this. We > > Why is that tho? It *should* work fine and I can't think of a reason > why that would behave particularly badly off the top of my head. > Maybe I forgot too much of the iosched modification used in google. > Anyways, if there's a problem, that should be fixable, right? And > controller-specific issues like that should really dictate the > architectural design too much. I actually can not speak to the details of the default IO problem, as it happened before I really got involved. But just think through it. If one half of the split has 5 processes running and the other half has 200, the processes in the 200 set each get FAR less spindle time than those in the 5 set. That is NOT the semantic we need. We're trying to offer ~equal access for users of the non-DTF class of jobs. This is not the tail doing the wagging. This is your assertion that something should work, when it just doesn't. We have two, totally orthogonal classes of applications on two totally disjoint sets of resources. Conjoining them is the wrong answer. >> could add control loops in userspace code which try to balance the >> shares in proportion to the load. We did that with CPU, and it's sort > > Yeah, that is horrible. Yeah, I would love to explain some of the really nasty things we have done and are moving away from. I am not sure I am allowed to, though :) >> of horrible. We're moving AWAY from all this craziness in favor of >> well-defined hierarchical behaviors. > > But I don't follow the conclusion here. For short term workaround, > sure, but having that dictate the whole architecture decision seems > completely backwards to me. My point is that the orthogonality of resources is intrinsic. Letting "it's hard to make it work" dictate the architecture is what's backwards. >> It's a bit naive to think that this is some absolute truth, don't you >> think? It just isn't so. You should know better than most what >> craziness our users do, and what (legit) rationales they can produce. >> I have $large_number of machines running $huge_number of jobs from >> thousands of developers running for years upon years backing up my >> worldview. > > If so, you aren't communicating it very well. I've talked with quite > a few people about multiple orthogonal hierarchies including people > inside google. Sure, some are using it as it is there but I couldn't > find strong enough rationale to continue that way given the amount of > crazys it implies / encourages. On the other hand, most people agreed > that having a unified hierarchy with differing level of granularities > would serve their cases well enough while not being crazy. I'm not sure what "diff
Re: cgroup: status-quo and userland efforts
Hello, Tim. On Mon, Jun 24, 2013 at 09:07:47PM -0700, Tim Hockin wrote: > I really want to understand why this is SO IMPORTANT that you have to > break userspace compatibility? I mean, isn't Linux supposed to be the > OS with the stable kernel interface? I've seen Linus rant time and > time again about this - why is it OK now? What the hell are you talking about? Nobody is breaking userland interface. A new version of interface is being phased in and the old one will stay there for the foreseeable future. It will be phased out eventually but that's gonna take a long time and it will have to be something hardly noticeable. Of course new features will only be available with the new interface and there will be efforts to nudge people away from the old one but the existing interface will keep working it does. > Examples? we obviously don't grant full access, but our kernel gang > and security gang seem to trust the bits we're enabling well enough... Then the security gang doesn't have any clue what's going on, or at least operating on very different assumptions (ie. the workloads are trusted by default). You can OOM the whole kernel by creating many cgroups, completely mess up controllers by creating deep hierarchies, affect your siblings by adjusting your weight and so on. It's really easy to DoS the whole system if you have write access to a cgroup directory. > The non-DTF jobs have a combined share that is small but non-trivial. > If we cut that share in half, giving one slice to prod and one slice > to batch, we get bad sharing under contention. We tried this. We Why is that tho? It *should* work fine and I can't think of a reason why that would behave particularly badly off the top of my head. Maybe I forgot too much of the iosched modification used in google. Anyways, if there's a problem, that should be fixable, right? And controller-specific issues like that should really dictate the architectural design too much. > could add control loops in userspace code which try to balance the > shares in proportion to the load. We did that with CPU, and it's sort Yeah, that is horrible. > of horrible. We're moving AWAY from all this craziness in favor of > well-defined hierarchical behaviors. But I don't follow the conclusion here. For short term workaround, sure, but having that dictate the whole architecture decision seems completely backwards to me. > It's a bit naive to think that this is some absolute truth, don't you > think? It just isn't so. You should know better than most what > craziness our users do, and what (legit) rationales they can produce. > I have $large_number of machines running $huge_number of jobs from > thousands of developers running for years upon years backing up my > worldview. If so, you aren't communicating it very well. I've talked with quite a few people about multiple orthogonal hierarchies including people inside google. Sure, some are using it as it is there but I couldn't find strong enough rationale to continue that way given the amount of crazys it implies / encourages. On the other hand, most people agreed that having a unified hierarchy with differing level of granularities would serve their cases well enough while not being crazy. Really, I have $huge_number of machines configured certain way isn't much of an argument when unified hierarchy isn't gonna break them and many people involved in cgroup both on kernel and userland sides share the view that the whole thing is a hellish mess which can only be used by crafting very specialized configurations for each setup. > I'm not sure I really grok that statement. I'm OK with defining new That's about google's blkcg modifications to support blkcg on writeback IOs. It works but can't be upstreamed as it requires tagging each page both with memcg and blkcg tags. > rules that bring some order to the chaos. Give us new rules to live > by. All-or-nothing would be fine. What if mounting cgroupfs gives me > N sub-dirs, one for each compiled-in controller? You could make THAT > the mount option - you can have either a unified hierarchy of all > controllers or fully disjoint hierarchies. Or some other rule. Now I'm lost what you're talking about. But the summary is, in the future, use a single unified hierarchy with differing granularities. It's still being worked on, so, for now, try not to depend on creating completely orthogonal hierarchies for different controllers. > The time frame you talk about IS reason for panic. If I know that What time frame are you referring to? > you're going to completely screw me in a a year and a half, I have to How the hell am I gonna screw you in a year and half? What are you talking about? Where is this coming from? > start moving NOW to find new ways to hack around the mess you're > making, make my userspace mesh with it, test those things with > critical customers, find a way to deploy it safely to a bajillion > machines, handle inevitable rollba
Re: cgroup: status-quo and userland efforts
On Mon, Jun 24, 2013 at 5:01 PM, Tejun Heo wrote: > Hello, Tim. > > On Sat, Jun 22, 2013 at 04:13:41PM -0700, Tim Hockin wrote: >> I'm very sorry I let this fall off my plate. I was pointed at a >> systemd-devel message indicating that this is done. Is it so? It > > It's progressing pretty fast. > >> seems so completely ass-backwards to me. Below is one of our use-cases >> that I just don't see how we can reproduce in a single-heierarchy. > > Configurations which depend on orthogonal multiple hierarchies of > course won't be replicated under unified hierarchy. It's unfortunate > but those just have to go. More on this later. I really want to understand why this is SO IMPORTANT that you have to break userspace compatibility? I mean, isn't Linux supposed to be the OS with the stable kernel interface? I've seen Linus rant time and time again about this - why is it OK now? >> We're also long into the model that users can control their own >> sub-cgroups (moderated by permissions decided by admin SW up front). > > If you're in control of the base system, nothing prevents you from > doing so. It's utterly broken security and policy-enforcement point > of view but if you can trust each software running on your system to > do the right thing, it's gonna be fine. Examples? we obviously don't grant full access, but our kernel gang and security gang seem to trust the bits we're enabling well enough... >> This gives us 4 combinations: >> 1) { production, DTF } >> 2) { production, non-DTF } >> 3) { batch, DTF } >> 4) { batch non-DTF } >> >> Of these, (3) is sort of nonsense, but the others are actually used >> and needed. This is only >> possible because of split hierarchies. In fact, we undertook a very painful >> process to move from a unified cgroup hierarchy to split hierarchies in large >> part _because of_ these examples. > > You can create three sibling cgroups and configure cpuset and blkio > accordingly. For cpuset, the setup wouldn't make any different. For > blkio, the two non-DTFs would now belong to different cgroups and > compete with each other as two groups, which won't matter at all as > non-DTFs are given what's left over after serving DTFs anyway, IIRC. The non-DTF jobs have a combined share that is small but non-trivial. If we cut that share in half, giving one slice to prod and one slice to batch, we get bad sharing under contention. We tried this. We could add control loops in userspace code which try to balance the shares in proportion to the load. We did that with CPU, and it's sort of horrible. We're moving AWAY from all this craziness in favor of well-defined hierarchical behaviors. >> Making cgroups composable allows us to build a higher level abstraction that >> is very powerful and flexible. Moving back to unified hierarchies goes >> against everything that we're doing here, and will cause us REAL pain. > > Categorizing processes into hierarchical groups of tasks is a > fundamental idea and a fundamental idea is something to base things on > top of as it's something people can agree upon relatively easily and > establish a structure by. I'd go as far as saying that it's the > failure on the part of workload design if they in general can't be > categorized hierarchically. It's a bit naive to think that this is some absolute truth, don't you think? It just isn't so. You should know better than most what craziness our users do, and what (legit) rationales they can produce. I have $large_number of machines running $huge_number of jobs from thousands of developers running for years upon years backing up my worldview. > Even at the practical level, the orthogonal hierarchy encouraged, at > the very least, the blkcg writeback support which can't be upstreamed > in any reasonable manner because it is impossible to say that a > resource can't be said to belong to a cgroup irrespective of who's > looking at it. I'm not sure I really grok that statement. I'm OK with defining new rules that bring some order to the chaos. Give us new rules to live by. All-or-nothing would be fine. What if mounting cgroupfs gives me N sub-dirs, one for each compiled-in controller? You could make THAT the mount option - you can have either a unified hierarchy of all controllers or fully disjoint hierarchies. Or some other rule. > It's something fundamentally broken and I have very difficult time > believing google's workload is so different that it can't be > categorized in a single hierarchy for the purpose of resource > distribution. I'm sure there are cases where some compromises are > necessary but the laternative is much worse here. As I wrote multiple > times now, multiple orthogonal hierarchy support is gonna be around > for some time, so I don't think there's any rason for panic; that > said, please at least plan to move on. The time frame you talk about IS reason for panic. If I know that you're going to completely screw me in a a year and a half, I have to start
Re: cgroup: status-quo and userland efforts
Hello, Tim. On Sat, Jun 22, 2013 at 04:13:41PM -0700, Tim Hockin wrote: > I'm very sorry I let this fall off my plate. I was pointed at a > systemd-devel message indicating that this is done. Is it so? It It's progressing pretty fast. > seems so completely ass-backwards to me. Below is one of our use-cases > that I just don't see how we can reproduce in a single-heierarchy. Configurations which depend on orthogonal multiple hierarchies of course won't be replicated under unified hierarchy. It's unfortunate but those just have to go. More on this later. > We're also long into the model that users can control their own > sub-cgroups (moderated by permissions decided by admin SW up front). If you're in control of the base system, nothing prevents you from doing so. It's utterly broken security and policy-enforcement point of view but if you can trust each software running on your system to do the right thing, it's gonna be fine. > This gives us 4 combinations: > 1) { production, DTF } > 2) { production, non-DTF } > 3) { batch, DTF } > 4) { batch non-DTF } > > Of these, (3) is sort of nonsense, but the others are actually used > and needed. This is only > possible because of split hierarchies. In fact, we undertook a very painful > process to move from a unified cgroup hierarchy to split hierarchies in large > part _because of_ these examples. You can create three sibling cgroups and configure cpuset and blkio accordingly. For cpuset, the setup wouldn't make any different. For blkio, the two non-DTFs would now belong to different cgroups and compete with each other as two groups, which won't matter at all as non-DTFs are given what's left over after serving DTFs anyway, IIRC. > Making cgroups composable allows us to build a higher level abstraction that > is very powerful and flexible. Moving back to unified hierarchies goes > against everything that we're doing here, and will cause us REAL pain. Categorizing processes into hierarchical groups of tasks is a fundamental idea and a fundamental idea is something to base things on top of as it's something people can agree upon relatively easily and establish a structure by. I'd go as far as saying that it's the failure on the part of workload design if they in general can't be categorized hierarchically. Even at the practical level, the orthogonal hierarchy encouraged, at the very least, the blkcg writeback support which can't be upstreamed in any reasonable manner because it is impossible to say that a resource can't be said to belong to a cgroup irrespective of who's looking at it. It's something fundamentally broken and I have very difficult time believing google's workload is so different that it can't be categorized in a single hierarchy for the purpose of resource distribution. I'm sure there are cases where some compromises are necessary but the laternative is much worse here. As I wrote multiple times now, multiple orthogonal hierarchy support is gonna be around for some time, so I don't think there's any rason for panic; that said, please at least plan to move on. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cgroup: status-quo and userland efforts
I'm very sorry I let this fall off my plate. I was pointed at a systemd-devel message indicating that this is done. Is it so? It seems so completely ass-backwards to me. Below is one of our use-cases that I just don't see how we can reproduce in a single-heierarchy. We're also long into the model that users can control their own sub-cgroups (moderated by permissions decided by admin SW up front). We have classes of jobs which can run together on shared machines. This is VERY important to us, and is a key part of how we run things. Over the years we have evolved from very little isolation to fairly strong isolation, and cgroups are a large part of that. We have experienced and adapted to a number of problems around isolation over time. I won't go into the history of all of these, because it's not so relevant, but here is how we set things up today. >From a CPU perspective, we have two classes of jobs: production and batch. Production jobs can (but don't always) ask for exclusive cores, which ensures that no batch work runs on those CPUs. We manage this with the cpuset cgroup. Batch jobs are relegated to the set of CPUs that are "left-over" after exclusivity rules are applied. This is implemented with a shared subdirectory of the cpuset cgroup called "batch". Production jobs get their own subdirectories under cpuset. >From an IO perspective we also have two classes of jobs: normal and DTF-approved. Normal jobs do not get strong isolation for IO, whereas DTF-enabled jobs do. The vast majority of jobs are NOT DTF-enabled, and they share a nominal amount of IO bandwidth. This is implemented with a shared subdirectory of the io cgroup called "default". Jobs that are DTF-enabled get their own subdirectories under IO. This gives us 4 combinations: 1) { production, DTF } 2) { production, non-DTF } 3) { batch, DTF } 4) { batch non-DTF } Of these, (3) is sort of nonsense, but the others are actually used and needed. This is only possible because of split hierarchies. In fact, we undertook a very painful process to move from a unified cgroup hierarchy to split hierarchies in large part _because of_ these examples. And for more fun, I am simplifying this all. Batch jobs are actually bound to NUMA-node specific cpuset cgroups when possible. And we have a similar concept for the cpu cgroup as for cpuset. And we have a third tier of IO jobs. We don't do all of this for fun - it is in direct response to REAL problems we have experienced. Making cgroups composable allows us to build a higher level abstraction that is very powerful and flexible. Moving back to unified hierarchies goes against everything that we're doing here, and will cause us REAL pain. On Mon, Apr 22, 2013 at 3:33 PM, Tim Hockin wrote: > On Mon, Apr 22, 2013 at 11:41 PM, Tejun Heo wrote: >> Hello, Tim. >> >> On Mon, Apr 22, 2013 at 11:26:48PM +0200, Tim Hockin wrote: >>> We absolutely depend on the ability to split cgroup hierarchies. It >>> pretty much saved our fleet from imploding, in a way that a unified >>> hierarchy just could not do. A mandated unified hierarchy is madness. >>> Please step away from the ledge. >> >> You need to be a lot more specific about why unified hierarchy can't >> be implemented. The last time I asked around blk/memcg people in >> google, while they said that they'll need different levels of >> granularities for different controllers, google's use of cgroup >> doesn't require multiple orthogonal classifications of the same group >> of tasks. > > I'll pull some concrete examples together. I don't have them on hand, > and I am out of country this week. I have looped in the gang at work > (though some are here with me). > >> Also, cgroup isn't dropping multiple hierarchy support over-night. >> What has been working till now will continue to work for very long >> time. If there is no fundamental conflict with the future changes, >> there should be enough time to migrate gradually as desired. >> >>> More, going towards a unified hierarchy really limits what we can >>> delegate, and that is the word of the day. We've got a central >>> authority agent running which manages cgroups, and we want out of this >>> business. At least, we want to be able to grant users a set of >>> constraints, and then let them run wild within those constraints. >>> Forcing all such work to go through a daemon has proven to be very >>> problematic, and it has been great now that users can have DIY >>> sub-cgroups. >> >> Sorry, but that doesn't work properly now. It gives you the illusion >> of proper delegation but it's inherently dangerous. If that sort of >> illusion has been / is good enough for your setup, fine. Delegate at >> your own risks, but cgroup in itself doesn't support delegation to >> lesser security domains and it won't in the foreseeable future. > > We've had great success letting users create sub-cgroups in a few > specific controller types (cpu, cpuacct, memory). This is, of course, > w
Re: cgroup: status-quo and userland efforts
On Mon, Apr 22, 2013 at 11:41 PM, Tejun Heo wrote: > Hello, Tim. > > On Mon, Apr 22, 2013 at 11:26:48PM +0200, Tim Hockin wrote: >> We absolutely depend on the ability to split cgroup hierarchies. It >> pretty much saved our fleet from imploding, in a way that a unified >> hierarchy just could not do. A mandated unified hierarchy is madness. >> Please step away from the ledge. > > You need to be a lot more specific about why unified hierarchy can't > be implemented. The last time I asked around blk/memcg people in > google, while they said that they'll need different levels of > granularities for different controllers, google's use of cgroup > doesn't require multiple orthogonal classifications of the same group > of tasks. I'll pull some concrete examples together. I don't have them on hand, and I am out of country this week. I have looped in the gang at work (though some are here with me). > Also, cgroup isn't dropping multiple hierarchy support over-night. > What has been working till now will continue to work for very long > time. If there is no fundamental conflict with the future changes, > there should be enough time to migrate gradually as desired. > >> More, going towards a unified hierarchy really limits what we can >> delegate, and that is the word of the day. We've got a central >> authority agent running which manages cgroups, and we want out of this >> business. At least, we want to be able to grant users a set of >> constraints, and then let them run wild within those constraints. >> Forcing all such work to go through a daemon has proven to be very >> problematic, and it has been great now that users can have DIY >> sub-cgroups. > > Sorry, but that doesn't work properly now. It gives you the illusion > of proper delegation but it's inherently dangerous. If that sort of > illusion has been / is good enough for your setup, fine. Delegate at > your own risks, but cgroup in itself doesn't support delegation to > lesser security domains and it won't in the foreseeable future. We've had great success letting users create sub-cgroups in a few specific controller types (cpu, cpuacct, memory). This is, of course, with some restrictions. We do not just give them blanket access to all knobs. We don't need ALL cgroups, just the important ones. For a simple example, letting users create sub-groups in freezer or job (we have a job group that we've been carrying) lets them launch sub-tasks and manage them in a very clean way. We've been doing a LOT of development internally to make user-defined sub-memcgs work in our cluster scheduling system, and it's made some of our biggest, more insane users very happy. And for some cgroups, like cpuset, hierarchy just doesn't really make sense to me. I just don't care if that never works, though I have no problem with others wanting it. :) Aside: if the last CPU in your cpuset goes offline, you should go into a state akin to freezer. Running on any other CPU is an overt violation of policy that the user, or worse - the admin, set up. Just my 2cents. >> Strong disagreement, here. We use split hierarchies to great effect. >> Containment should be composable. If your users or abstractions can't >> handle it, please feel free to co-mount the universe, but please >> PLEASE don't force us to. >> >> I'm happy to talk more about what we do and why. > > Please do so. Why do you need multiple orthogonal hierarchies? Look for this in the next few days/weeks. From our point of view, cgroups are the ideal match for how we want to manage things (no surprise, really, since Mr. Menage worked on both). Tim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cgroup: status-quo and userland efforts
Hello, Tim. On Mon, Apr 22, 2013 at 11:26:48PM +0200, Tim Hockin wrote: > We absolutely depend on the ability to split cgroup hierarchies. It > pretty much saved our fleet from imploding, in a way that a unified > hierarchy just could not do. A mandated unified hierarchy is madness. > Please step away from the ledge. You need to be a lot more specific about why unified hierarchy can't be implemented. The last time I asked around blk/memcg people in google, while they said that they'll need different levels of granularities for different controllers, google's use of cgroup doesn't require multiple orthogonal classifications of the same group of tasks. Also, cgroup isn't dropping multiple hierarchy support over-night. What has been working till now will continue to work for very long time. If there is no fundamental conflict with the future changes, there should be enough time to migrate gradually as desired. > More, going towards a unified hierarchy really limits what we can > delegate, and that is the word of the day. We've got a central > authority agent running which manages cgroups, and we want out of this > business. At least, we want to be able to grant users a set of > constraints, and then let them run wild within those constraints. > Forcing all such work to go through a daemon has proven to be very > problematic, and it has been great now that users can have DIY > sub-cgroups. Sorry, but that doesn't work properly now. It gives you the illusion of proper delegation but it's inherently dangerous. If that sort of illusion has been / is good enough for your setup, fine. Delegate at your own risks, but cgroup in itself doesn't support delegation to lesser security domains and it won't in the foreseeable future. > Strong disagreement, here. We use split hierarchies to great effect. > Containment should be composable. If your users or abstractions can't > handle it, please feel free to co-mount the universe, but please > PLEASE don't force us to. > > I'm happy to talk more about what we do and why. Please do so. Why do you need multiple orthogonal hierarchies? Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cgroup: status-quo and userland efforts
Hi Tejun, This email worries me. A lot. It sounds very much like retrograde motion from our (Google's) point of view. We absolutely depend on the ability to split cgroup hierarchies. It pretty much saved our fleet from imploding, in a way that a unified hierarchy just could not do. A mandated unified hierarchy is madness. Please step away from the ledge. More, going towards a unified hierarchy really limits what we can delegate, and that is the word of the day. We've got a central authority agent running which manages cgroups, and we want out of this business. At least, we want to be able to grant users a set of constraints, and then let them run wild within those constraints. Forcing all such work to go through a daemon has proven to be very problematic, and it has been great now that users can have DIY sub-cgroups. berra...@redhat.com said, downthread: > We ultimately do need the ability to delegate hierarchy creation to > unprivileged users / programs, in order to allow containerized OS to > have the ability to use cgroups. Requiring any applications inside a > container to talk to a cgroups "authority" existing on the host OS is > not a satisfactory architecture. We need to allow for a container to > be self-contained in its usage of cgroups. This! A thousand times, this! > At the same time, we don't need/want to give them unrestricted ability > to create arbitarily complex hiearchies - we need some limits on it > to avoid them exposing pathelogically bad kernel behaviour. > > This could be as simple as saying that each cgroup controller directory > has a tunable "cgroups.max_children" and/or "cgroups.max_depth" which > allow limits to be placed when delegating administration of part of a >cgroups tree to an unprivileged user. We've been bitten by this, and more limitations would be great. We've got some less-than-perfect patches that impose limits for us now. > I've no disagreement that we need a unified hiearchy. The workman > app explicitly does /not/ expose the concept of differing hiearchies > per controller. Likewise libvirt will not allow the user to configure > non-unified hiearchies. Strong disagreement, here. We use split hierarchies to great effect. Containment should be composable. If your users or abstractions can't handle it, please feel free to co-mount the universe, but please PLEASE don't force us to. I'm happy to talk more about what we do and why. Tim On Sat, Apr 6, 2013 at 3:21 AM, Tejun Heo wrote: > Hello, guys. > > Status-quo > == > > It's been about a year since I wrote up a summary on cgroup status quo > and future plans. We're not there yet but much closer than we were > before. At least the locking and object life-time management aren't > crazy anymore and most controllers now support proper hierarchy > although not all of them agree on how to treat inheritance. > > IIRC, the yet-to-be-converted ones are blk-throttle and perf. cpu > needs to be updated so that it at least supports a similar mechanism > as cfq-iosched for configuring ratio between tasks on an internal > cgroup and its children. Also, we really should update how cpuset > handles a cgroup becoming empty (no cpus or memory node left due to > hot-unplug). It currently transfers all its tasks to the nearest > ancestor with executing resources, which is an irreversible process > which would affect all other co-mounted controllers. We probably want > it to just take on the masks of the ancestor until its own executing > resources become online again, and the new behavior should be gated > behind a switch (Li, can you please look into this?). > > While we have still ways to go, I feel relatively confident saying > that we aren't too far out now, well, except for the writeback mess > that still needs to be tackled. Anyways, once the remaining bits are > settled, we can proceed to implement the unified hierarchy mode I've > been talking about forever. I can't think of any fundamental > roadblocks at the moment but who knows? The devil usually is in the > details. Let's hope it goes okay. > > So, while we aren't moving as fast as we wish we were, the kernel side > of things are falling into places. At least, that's how I see it. > From now on, I think how to make it actually useable to userland > deserves a bit more focus, and by "useable to userland", I don't mean > some group hacking up an elaborate, manual configuration which is > tailored to the point of being eccentric to suit the needs of the said > group. There's nothing wrong with that and they can continue to do > so, but it just isn't generically useable or useful. It should be > possible to generically and automatically split resources among, say, > several servers and a couple users sharing a system without resorting > to indecipherable ad-hoc shell script running off rc.local. > > > Userland efforts > > > There are currently a few userland efforts trying to make interfacing > with cgroup le
Re: cgroup: status-quo and userland efforts
On 2013/4/17 1:10, Tejun Heo wrote: > Hello, Li. > > On Tue, Apr 16, 2013 at 07:17:17PM +0800, Li Zefan wrote: > ... >>> hot-unplug). It currently transfers all its tasks to the nearest >>> ancestor with executing resources, which is an irreversible process >>> which would affect all other co-mounted controllers. We probably want >>> it to just take on the masks of the ancestor until its own executing >>> resources become online again, and the new behavior should be gated >>> behind a switch (Li, can you please look into this?). >>> >> >> Sure, I'll be working on sane hierarchy behavior for cpuset. > > Great, it'd be great if you can share how it's gonna be done once the > basic design gets settled before full implementation. > The basic idea is, when a cpuset becomes empty due to hotplug, we don't move the tasks in it, but instead we update tasks' cpumask/nodemask using the nearest non-empty acestor cpuset's cpus_allowed and mems_allowed. - then it's allowed to move those tasks from the empty cpuset to another cpuset - when this acestor cpuset's cpumask/nodemask is changed (either by writing cpuset.cpus/mems or hotplug), not only the tasks in it but also tasks in the empty cpuset will be updated. - it's allowed to move a task to an empty cpuset, and the task's cpumask/nodemask will be updated according to the nearst non-empty acestor, no matter if this empty cpuset is exclusive or not. - if a previously offlined cpu becomes online again, the emtpy cpuset won't get this cpu resource automatically, which is the current behavior. How does this sound? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cgroup: status-quo and userland efforts
Hello, Li. On Tue, Apr 16, 2013 at 07:17:17PM +0800, Li Zefan wrote: ... > > hot-unplug). It currently transfers all its tasks to the nearest > > ancestor with executing resources, which is an irreversible process > > which would affect all other co-mounted controllers. We probably want > > it to just take on the masks of the ancestor until its own executing > > resources become online again, and the new behavior should be gated > > behind a switch (Li, can you please look into this?). > > > > Sure, I'll be working on sane hierarchy behavior for cpuset. Great, it'd be great if you can share how it's gonna be done once the basic design gets settled before full implementation. Thanks a lot! -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cgroup: status-quo and userland efforts
On 2013/4/6 9:21, Tejun Heo wrote: > Hello, guys. > > Status-quo > == > > It's been about a year since I wrote up a summary on cgroup status quo > and future plans. We're not there yet but much closer than we were > before. At least the locking and object life-time management aren't > crazy anymore and most controllers now support proper hierarchy > although not all of them agree on how to treat inheritance. > > IIRC, the yet-to-be-converted ones are blk-throttle and perf. cpu > needs to be updated so that it at least supports a similar mechanism > as cfq-iosched for configuring ratio between tasks on an internal > cgroup and its children. Also, we really should update how cpuset > handles a cgroup becoming empty (no cpus or memory node left due to > hot-unplug). It currently transfers all its tasks to the nearest > ancestor with executing resources, which is an irreversible process > which would affect all other co-mounted controllers. We probably want > it to just take on the masks of the ancestor until its own executing > resources become online again, and the new behavior should be gated > behind a switch (Li, can you please look into this?). > Sure, I'll be working on sane hierarchy behavior for cpuset. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cgroup: status-quo and userland efforts
Hey, Serge. On Tue, Apr 09, 2013 at 04:04:22PM -0500, Serge Hallyn wrote: > So for instance if there is a dbus call saying "please create cgroup > /x with (some constraints) and put $$ into it", "something" in the > container can convert that into "please create cgroup /lxc/c1/x > and put (host_uid($$)) into it" and pass that to the host's (or > parent container's) "something". Yeap, definitely. It shouldn't be difficult to make it transparent to individual consumers. It would actually be far easier to achieve that with userland agent which knows what's going on in the middle. > So perhaps it is best if the container monitor, living in the parent > namespaces, opens a socket '@cgroup_monitor' in the container > namespace (through setns), listens for container-userpsace requests > there, and passes them on to the host's monitor (which hopefully > also listens on '@cgroup_monitor', @ being '\0'). Note that my > mentino of converting pids requires a new kernel feature which we > don't currently have (but have wanted for a long time). Yeah, details may change but in principle something like that. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cgroup: status-quo and userland efforts
Quoting Tejun Heo (t...@kernel.org): > A bit of addition. > > On Tue, Apr 09, 2013 at 12:38:51PM -0700, Tejun Heo wrote: > > > We need to make the distribute approach work in order to support > > > containers, which requiring them to have a back-channel open to > > > the host userspace. If we can do that, then we've solved the problem > > Why is back-channel such a bad thing? Even fully virtualized > environments do special things to communicate with the host (the whole > stack of virt drivers). It is sub-optimal and pointless to make > everything completely transparent. There's nothing wrong with the > basesystem knowing that they're inside a container or a virtualized > environment, so I don't understand why a back-channel is such a big > problem. Agreed, that's fine so long as it will be a consistent interface. Ideally, we could do it in a way that the container monitor can transparently proxy between userspace inside the container and the library on the host - so that userspace can 'use cgroups' the same way no matter where it is. So for instance if there is a dbus call saying "please create cgroup /x with (some constraints) and put $$ into it", "something" in the container can convert that into "please create cgroup /lxc/c1/x and put (host_uid($$)) into it" and pass that to the host's (or parent container's) "something". So perhaps it is best if the container monitor, living in the parent namespaces, opens a socket '@cgroup_monitor' in the container namespace (through setns), listens for container-userpsace requests there, and passes them on to the host's monitor (which hopefully also listens on '@cgroup_monitor', @ being '\0'). Note that my mentino of converting pids requires a new kernel feature which we don't currently have (but have wanted for a long time). -serge -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cgroup: status-quo and userland efforts
A bit of addition. On Tue, Apr 09, 2013 at 12:38:51PM -0700, Tejun Heo wrote: > > We need to make the distribute approach work in order to support > > containers, which requiring them to have a back-channel open to > > the host userspace. If we can do that, then we've solved the problem Why is back-channel such a bad thing? Even fully virtualized environments do special things to communicate with the host (the whole stack of virt drivers). It is sub-optimal and pointless to make everything completely transparent. There's nothing wrong with the basesystem knowing that they're inside a container or a virtualized environment, so I don't understand why a back-channel is such a big problem. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cgroup: status-quo and userland efforts
Hello, Daniel. On Tue, Apr 09, 2013 at 10:50:25AM +0100, Daniel P. Berrange wrote: > The PaxControlGroups document is the key piece to making distributed > management work. This document does need updating, since some of what > it describes doesn't really work, but its goal is sound IMHO. I think we should add a comment to the doc saying "this is how to keep things from falling apart completely but in no way is a long term solution." > The Workman library is presuming that apps will follow the PaxControlGroups > guidelines for use of cgroups, and from there aims to provide system > administrators with a "single world view" and tools to then configure > this. It does not, however, attempt to force itself underneath the > apps like systemd / libvirt, since there is no need todo that. It > just aggregates information from system/libvirt/etc so that admin has > the complete picture of what the cgroups are being used for. I suppose that can be useful for now but pretty strongly disagree it would be acceptable as long term solution. > I don't see that creating a "single authority" magically solves any > of the problems you describe. For example, such an authority can't > know whether it should delete a cgroup just because an application > exits. It is quite possible an application would want the cgroup to > continue to exist, so that it is still there when it restarts. Sure, then make it request the persistency explicitly. The debate is not whether trusting each individual player can show similar result. Sure, that's in the realm of possibility. If you push it as far as "everyone" should and would behave properly even on edge cases, I would have to add "theoretical" there tho. The debate is which is the better way to achieve the desired goals and up until now I don't see any pros for the distributed approach other than "this is what we've been doing till now". > Ultimately it is the end admin or top level management tool that has > the whole picture. The Workman library / cli is aiming to provide > admins / apps with the complete picture of everything that is using > resources on the system, so they can adjust policies dynamically. Again, I don't know. It can be useful for now I suppose. I just can't see it being the long term solution. > You seem to be implying that 'distributed == anything goes', which is > certainly not what I consider to be the case. Indeed the main point > of having the PaxControlGroups guidelines is explicitly because we do > *not* want an "anything goes" approach. Yeah, by asking cooperations from individual players without any way to monitor or police them. > We ultimately do need the ability to delegate hierarchy creation to > unprivileged users / programs, in order to allow containerized OS to > have the ability to use cgroups. Requiring any applications inside a > container to talk to a cgroups "authority" existing on the host OS is > not a satisfactory architecture. We need to allow for a container to > be self-contained in its usage of cgroups. I'm not sure about this one. Yeah, we might need delegation there at least for now. That said, it's not gonna be completely consistent. Root cgroup is special for several controllers and we even have controllers which propagate config changes down the hierarchy. It just isn't built for proper delegation. > I don't think that requiring a single userspace authority is > satisfactory. We need to be able to delegate this to containers, > without them needing to talk to some authority back in the > host OS, so that they remain 100% isolated from processes in > the host OS. It's unlikely to work that well. I think a good mental image to have for cgroup is that of sysctl rather than a generic file system. You can't go delegate sysctl control knobs to containers or !root users. You need an extra layer of control to do that. It's true that such policing could happen in the kernel, but something in the kernel being exposed to untrusted entities has a lot of implications as the kernel now becomes heavily involved in *policy* decisions as to what can be allowed and what can't be and the kernel has a lot less latitude in making those decisions compared to userland base system. There are also security implications. memcg control knobs directly regulate the operation of memory reclaim and writeback. I wouldn't be surprised if there are pretty easy ways to make them go bonkers while staying inside the limits from the parent. Again, think of sysctl. You don't wanna hand these out to untrusted entities. > We need to make the distribute approach work in order to support > containers, which requiring them to have a back-channel open to > the host userspace. If we can do that, then we've solved the problem > of delegated to unprivileged users in non-container environments too. > IMHO with a sufficiently specified PaxControlGroups the distributed > approach is just fine. If applications are badly behaved and don't > follow the rules,
Re: cgroup: status-quo and userland efforts
Hello, On Tue, Apr 09, 2013 at 01:32:01AM +0200, Lennart Poettering wrote: > The other big thing we want from the systemd side is saner > notifications when cgroups run empty. i.e. currently we don't get > these at all in containers (since the agent can be only installed > once, for the host). And the way we get this is awful, via > kernel-spawned processes. I am looking for a way how I can establish > a watch on a certain subtree (not just one directory) and get simple > notifications in a race-free whenever a cgroup runs empty. Oh yeah, it's horrifying. There was something going on a while ago but I couldn't get hold of Eric Paris. We probably should resurrect that patch. As for delegating to namespaces, I'm not exactly sure what to do. At least for now, it could be an acceptable trade-off to delegate the subdirectory with some limits on the number of cgroups / depth of hierarchy / whatever. That said, I'm not really fond of the idea. It isn't likely to work seamlessly. The root cgroup is special anyway and I don't really like the idea of putting NS related stuff directly into cgroupfs. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cgroup: status-quo and userland efforts
On Fri, Apr 05, 2013 at 06:21:59PM -0700, Tejun Heo wrote: > Userland efforts > > > There are currently a few userland efforts trying to make interfacing > with cgroup less painful. > > * libcg: Make cgroup interface accessible from programming languages > with support for configuration persistency, which also brings its > own config files to remember what to do on the next boot. Sans the > persistence part, it just seems to directly translate the filesystem > interface to function interface. > > http://libcg.sourceforge.net/ > > * Workman: It's a rather young project but as its name (workload > management) implies, its aims are higher level than that of libcg. > It aims to provide high-level resource allocation and management and > introduces new concepts like resource partitions to represent its > view of resource hierarchy. Like libcg, this one is implemented as > a library but provides bindings for more languages. > > https://gitorious.org/workman/pages/Home > > * Pax Controla Groupiana: A document on how not to step on other's > toes while using cgroup. It's not a software project but tries to > define precautions that a software or user can take to avoid > breaking or confusing other users of the cgroup filesystem. > > http://www.freedesktop.org/wiki/Software/systemd/PaxControlGroups > > All try to play nice with other possible users of the cgroup > filesystem - be it libvirt cgroup, applications doing their own cgroup > tricks, or hand-crafted custom scripts. While the approach is > understandable given that those usages already exist, I don't think > it's a workable solution in the long term. There are several reasons > for that. Actually libcg doesn't really try to play nice with anything - being just a direct representation of the cgroups filesystem, it allows for absolutely anything to be done with no regard for best practice or co-operation. The PaxControlGroups document is the key piece to making distributed management work. This document does need updating, since some of what it describes doesn't really work, but its goal is sound IMHO. The Workman library is presuming that apps will follow the PaxControlGroups guidelines for use of cgroups, and from there aims to provide system administrators with a "single world view" and tools to then configure this. It does not, however, attempt to force itself underneath the apps like systemd / libvirt, since there is no need todo that. It just aggregates information from system/libvirt/etc so that admin has the complete picture of what the cgroups are being used for. > * The configurations aren't independent. e.g. for weight-based > controllers, your weight is only meaningful in relation to other > weights at that level. Distributing configuration to whatever > entities which may write to cgroupfs simply cannot work. It's > fundamentally flawed. I agree that whatever is setting weight values needs to be aware of what other weight values are set at the same point in the hiearchy. This doesn't imply we have to have a single authority setting these values though, just that anything that wants to set them, needs to be aware of the bigger picture. > * It's fragile like hell. There's no accountability. Nobody really > knows what's going on. Is this subdirectory still there due to a > bug in this program, or something or someone else created it and > crashed / forgot to remove it, or what? Oh, the cgroup I wanted to > create already exists. Maybe the previous instance created it and > then crashed or maybe some other program just happened to choose the > same name. Who owns config knobs in that directory? This way lies > madness. I understand why the Pax doc exists but I'm not sure its > long-term effect would be positive - best practices which ultimately > lead to utter confusion and fragility. I don't see that creating a "single authority" magically solves any of the problems you describe. For example, such an authority can't know whether it should delete a cgroup just because an application exits. It is quite possible an application would want the cgroup to continue to exist, so that it is still there when it restarts. > * In many cases, resource distribution is system-wide policy decisions > and determining what to do often requires system-wide knowledge. > You can't provision memory limits without knowing what's available > in the system and what else is going on in the system, and you want > to be able to adjust them as situation and configuration changes. > Without anybody having full picture of how resources are > provisioned, how would any of that be possible? Ultimately it is the end admin or top level management tool that has the whole picture. The Workman library / cli is aiming to provide admins / apps with the complete picture of everything that is using resources on the system, so they can adjust policies dynamically. > I thin
Re: cgroup: status-quo and userland efforts
On 04/09/2013 03:32 AM, Lennart Poettering wrote: > The other big thing we want from the systemd side is saner notifications > when cgroups run empty. i.e. currently we don't get these at all in > containers (since the agent can be only installed once, for the host). > And the way we get this is awful, via kernel-spawned processes. I am > looking for a way how I can establish a watch on a certain subtree (not > just one directory) and get simple notifications in a race-free whenever > a cgroup runs empty. > Well, as I am trying to port our tools for Upstream Linux (aka cgroups), I also got a pet peeve on this one as well. The notification system is global and done at the root level. IOW, notify_on_release is local, but release_agent is global. We use our management tool to enter containers and call something like init 0, that will shut the container down. But if the admin does it itself, the cgroup directory will stay there. We would like them to automatically disappear. Maybe that is not something that needs to be done in the kernel. If systemd had some very easy and well documented way for a 3rd party software to register a notification to be called upon a certain cgroup release (if it exists already, sorry Lennart, but I haven't found anything in the likes. Just enlighten me) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cgroup: status-quo and userland efforts
Heya, On 08.04.2013 15:46, Glauber Costa wrote: On 04/06/2013 05:21 AM, Tejun Heo wrote: Hello, guys. Hello Tejun, how are you? Status-quo == tl;did read; This is mostly sensible. There is still one problem that we hadn't yet had the bandwidth to tackle that should be added to your official TODO list. The cpu cgroup needs a real-time timeslice to accept real time tasks. It defaults to 0, meaning that a newly created cpu cgroup cannot accept tasks (rt tasks) without the user having to manually configure it. As far as I know, this problem hasn't yet been fixed. The fix of course, is as trivial as setting a new value instead of 0 as a default. The complication lies in determining which value should that be. There are many things that we should ask from a controller to implement in order to be able to handle fully joint hierarchies. One of them, IMHO, is that if you drop a task into a newly created cgroup it should run without the user having to do anything for it. The other big thing we want from the systemd side is saner notifications when cgroups run empty. i.e. currently we don't get these at all in containers (since the agent can be only installed once, for the host). And the way we get this is awful, via kernel-spawned processes. I am looking for a way how I can establish a watch on a certain subtree (not just one directory) and get simple notifications in a race-free whenever a cgroup runs empty. Lennart -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Workman-devel] cgroup: status-quo and userland efforts
On Mon, Apr 08, 2013 at 03:46:31PM -0400, Vivek Goyal wrote: > It would be good to think more about it. How a user can ensure minimum > resources to a partition/service. Because in that case at every level > somebody needs to keep track how much of resources have been committed > as minimum requirements and more consumers can't be allowed at same level. > (This sounds like cpu RT time division among various cgroups). Yes, please take a step back from what we have right now because it isn't very good. It's a general policy decision / enforcement problem and even the policies may change dynamically. Having a central authority doesn't automatically solve any of that and it'd be most likely as limited as existing solutions at the beginning but it allows for future improvements unlike scattering the solution all over the place which just digs the hole deeper. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Workman-devel] cgroup: status-quo and userland efforts
On Mon, Apr 08, 2013 at 12:20:24PM -0700, Tejun Heo wrote: [..] > > For example, one might want to say that maximum IO bandwidth for > > virtual machine virt1 on disk sda should be 10MB/s. Now libvirt > > should be able to save it in virtual machine specific configuration > > easily and whenever virtual machine is started, create a children > > cgroup, set the limits as specified. > > Yes, sure, libvirt can *request* whatever it seems appropriate to the > central authority, which will decide whether it'll be able to honor > the request and grant it if possible and allowed by policies in > effect. 10MB/s is an absolute limit. So I guess there is nothing to be requested from an central authority here in terms of resources. Even in the case of IO weight or cpu shares, there is nothing to be asked from central authority. Well, there is. Creation of new croups changes effective %share of peer groups. More below. Where it makes sense though is if one says give a particular service 25% cpu. Then suddenly all the peer and parent entities become important. IIUC, initial draft of workman does not address this issue. It would be good to think more about it. How a user can ensure minimum resources to a partition/service. Because in that case at every level somebody needs to keep track how much of resources have been committed as minimum requirements and more consumers can't be allowed at same level. (This sounds like cpu RT time division among various cgroups). Thanks Vivek -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Workman-devel] cgroup: status-quo and userland efforts
Hey, On Mon, Apr 08, 2013 at 03:11:05PM -0400, Vivek Goyal wrote: > > What if the program crashes? > > I am not sure about this. I guess when applications comes back after crash, > it can go through all the children cgroups and reclaim empty cgroups. Fragile, right? What are you arguing here? > > Wouldn't it make more sense to just have > > a central arbitrator that everyone talks to? > > May be. Just that in the past folks have not liked the idea of talking > to central authority to figure out resource group of an object they are > managing. What we've been doing seems tragically broken to me, so I'm not sure "people didn't use to do it that way" is a good point. > > What's the benefit of > > distributing the responsiblities here? It's not like we can put them > > in different security domains. > > To me it makes sense in a way, as these resources associated with the > service is just one another property and there does not seem to be > anything special about this property that it should be managed using > a single centralized authority. > > For example, one might want to say that maximum IO bandwidth for > virtual machine virt1 on disk sda should be 10MB/s. Now libvirt > should be able to save it in virtual machine specific configuration > easily and whenever virtual machine is started, create a children > cgroup, set the limits as specified. Yes, sure, libvirt can *request* whatever it seems appropriate to the central authority, which will decide whether it'll be able to honor the request and grant it if possible and allowed by policies in effect. > That would make sense. systemd had this conflict with cgconfig > too. Problem is that systemd starts first and sets up everything. Now > if there is a service which sets up cgroups, after systemd startup, > it is already late. Come on, that's not a difficult or fundamental problem. Whatever the central authority may be, systemd can use it to setup the initial hierarchy or set up bare-bone hierarchy in compatible manner. This isn't that different from udev. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Workman-devel] cgroup: status-quo and userland efforts
On Mon, Apr 08, 2013 at 11:16:07AM -0700, Tejun Heo wrote: > Hey, Vivek. > > On Mon, Apr 08, 2013 at 01:59:26PM -0400, Vivek Goyal wrote: > > But using the library admin application should be able to query the > > full "paritition" hierarchy and their weigths and calculate % system > > resources. I think one problem there is cpu controller where % resoruce > > of a cgroup depends on tasks entities which are peer to group. But that's > > a kernel issue and not user space thing. > > Yeah, we're gonna have to implement a different operation mode. > > > So I am not sure what are potential problems with proposed model of > > configuration in workman. All the consumer managers still follow what > > libarary has told them to do. > > Sure, if we assume everyone follows the rules and behaves nicely. > It's more about the general approach. Allowing / encouraging sharing > or distributing control of cgroup hierarchy without forcing structure > and rigid control over it is likely to lead to confusion and > fragility. > > > > or maybe some other program just happened to choose the > > > same name. > > > > Two programs ideally would have their own sub hiearchy. And if not one > > of the programs should get the conflict when trying to create cgroup and > > should back-off or fail or give warning... > > And who's responsible for deleting it? I think "consumer" manager should delete its own cgroup directories when associated consumer[s] stop running. And partitions created by workman will just remain there until and unless user wanted to delete these explicitly. > What if the program crashes? I am not sure about this. I guess when applications comes back after crash, it can go through all the children cgroups and reclaim empty cgroups. > > > > Who owns config knobs in that directory? > > > > IIUC, workman was looking at two types of cgroups. Once called > > "partitions" which will be created by library at startup time and > > library manages the configuration (something like cgconfig.conf). > > > > And individual managers create their own children groups for various > > services under that partition and control the config knobs for those > > services. > > > > user-defined-partition > > /|\ > >virt1 virt2 virt3 > > > > So user should be able to define a partition and control the configuration > > using workman lib. And if multiple virtual machines are being run in > > the partition, then they create their own cgroups and libvirt controls > > the properties of virt1, virt2, virt3 cgroups. I thought that was the > > the understanding when we dicussed ownership of config knobs las time. > > But things might have changed since last time. Workman folks should > > be able to shed light on this. > > I just read the introduction doc and haven't delved into the API or > code so I could be off but why should there be multiple managers? > What's the benefit of that? A centralized authority does not know about all the managed objects. Only respective manager knows about what objects it is managing and what are the controllable attributes of that object. systemd is managing services and libvirt is managing virtual machines, containers etc. Some people view associated resource group as just one additional attribute of the managed service. These managers already maintain multiple attributes of a service and can store one additional attribute easily. > Wouldn't it make more sense to just have > a central arbitrator that everyone talks to? May be. Just that in the past folks have not liked the idea of talking to central authority to figure out resource group of an object they are managing. > What's the benefit of > distributing the responsiblities here? It's not like we can put them > in different security domains. To me it makes sense in a way, as these resources associated with the service is just one another property and there does not seem to be anything special about this property that it should be managed using a single centralized authority. For example, one might want to say that maximum IO bandwidth for virtual machine virt1 on disk sda should be 10MB/s. Now libvirt should be able to save it in virtual machine specific configuration easily and whenever virtual machine is started, create a children cgroup, set the limits as specified. If a central authority keeps track of all this, I am not sure how would it look like and might get messy. [..] > > > I think the only logical thing to do is creating a centralized > > > userland authority which takes full ownership of the cgroup filesystem > > > interface, gives it a sane structure, > > > > Right now systemd seems to be giving initial structure. I guess we will > > require some changes where systemd itself runs in a cgroup and that > > allows one to create peer groups. Something like. > > > > root > >
Re: [Workman-devel] cgroup: status-quo and userland efforts
On Mon, Apr 08, 2013 at 11:16:07AM -0700, Tejun Heo wrote: > > Given the fact that library has view of full system resoruces (both > > persistent view and active view), shouldn't we just be able to extend > > the API to meet additional configuration or resource needs. > > Maybe, I don't know. It just looks like a weird approach to me. > Wouldn't it make more sense to implement it as a dbus service that > everyone talks to? That's how our base system is structured these > days. Why should this be any different? To expand a bit, the base system being composed that way makes a lot of sense. It becomes clear who's responsible for what and there's a reliable way to recover when things go awry on the clients' sides. Also, it pretty much *forces* you to design an interface which fits the problem domain properly rather than exposing all the control knobs there are without thinking how they'd be actually useful. The language binding issue is much easier too - it's already solved. It seems like the only logical thing to do, well, at least to me. Am I missing something? Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cgroup: status-quo and userland efforts
Hey, Glauber. On Mon, Apr 08, 2013 at 05:46:09PM +0400, Glauber Costa wrote: > On 04/06/2013 05:21 AM, Tejun Heo wrote: > > Hello, guys. > > Hello Tejun, how are you? I'm doing okay. :) > > Status-quo > > == > > > tl;did read; > > This is mostly sensible. There is still one problem that we hadn't yet > had the bandwidth to tackle that should be added to your official TODO list. > > The cpu cgroup needs a real-time timeslice to accept real time tasks. It > defaults to 0, meaning that a newly created cpu cgroup cannot accept > tasks (rt tasks) without the user having to manually configure it. > As far as I know, this problem hasn't yet been fixed. > > The fix of course, is as trivial as setting a new value instead of 0 as > a default. The complication lies in determining which value should that be. > > There are many things that we should ask from a controller to implement > in order to be able to handle fully joint hierarchies. One of them, > IMHO, is that if you drop a task into a newly created cgroup it should > run without the user having to do anything for it. Yeap, definitely. cpuset has similar problems (Li, help us!). For the controllers which are showing behaviors which don't allow sharing a single hierarchy, I think the solution is to implement an alternate behavior which can be flipped on mount time and force the switch flipped when mounting unified hierarchy, so that we don't disturb the existing users while pushing for more consistent behavior. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Workman-devel] cgroup: status-quo and userland efforts
Hey, Vivek. On Mon, Apr 08, 2013 at 01:59:26PM -0400, Vivek Goyal wrote: > But using the library admin application should be able to query the > full "paritition" hierarchy and their weigths and calculate % system > resources. I think one problem there is cpu controller where % resoruce > of a cgroup depends on tasks entities which are peer to group. But that's > a kernel issue and not user space thing. Yeah, we're gonna have to implement a different operation mode. > So I am not sure what are potential problems with proposed model of > configuration in workman. All the consumer managers still follow what > libarary has told them to do. Sure, if we assume everyone follows the rules and behaves nicely. It's more about the general approach. Allowing / encouraging sharing or distributing control of cgroup hierarchy without forcing structure and rigid control over it is likely to lead to confusion and fragility. > > or maybe some other program just happened to choose the > > same name. > > Two programs ideally would have their own sub hiearchy. And if not one > of the programs should get the conflict when trying to create cgroup and > should back-off or fail or give warning... And who's responsible for deleting it? What if the program crashes? > > Who owns config knobs in that directory? > > IIUC, workman was looking at two types of cgroups. Once called > "partitions" which will be created by library at startup time and > library manages the configuration (something like cgconfig.conf). > > And individual managers create their own children groups for various > services under that partition and control the config knobs for those > services. > > user-defined-partition >/|\ >virt1 virt2 virt3 > > So user should be able to define a partition and control the configuration > using workman lib. And if multiple virtual machines are being run in > the partition, then they create their own cgroups and libvirt controls > the properties of virt1, virt2, virt3 cgroups. I thought that was the > the understanding when we dicussed ownership of config knobs las time. > But things might have changed since last time. Workman folks should > be able to shed light on this. I just read the introduction doc and haven't delved into the API or code so I could be off but why should there be multiple managers? What's the benefit of that? Wouldn't it make more sense to just have a central arbitrator that everyone talks to? What's the benefit of distributing the responsiblities here? It's not like we can put them in different security domains. > > * In many cases, resource distribution is system-wide policy decisions > > and determining what to do often requires system-wide knowledge. > > You can't provision memory limits without knowing what's available > > in the system and what else is going on in the system, and you want > > to be able to adjust them as situation and configuration changes. > > Without anybody having full picture of how resources are > > provisioned, how would any of that be possible? > > I thought workman library will provide interfaces so that one can query > and be able to construct the full system view. > > Their doc says. > > GList *workmanager_partition_get_children(WorkmanPartition *partition, > GError **error); > > So I am assuming this can be used to construct the full partition > hierarchy and associated resource allocation. Sure, maybe it can be used as a building block. > [..] > > I think the only logical thing to do is creating a centralized > > userland authority which takes full ownership of the cgroup filesystem > > interface, gives it a sane structure, > > Right now systemd seems to be giving initial structure. I guess we will > require some changes where systemd itself runs in a cgroup and that > allows one to create peer groups. Something like. > > root > / \ > systemd other-groups No, we need a single structured hierarchy which everyone uses *including* systemd. > > represents available resources > > in a sane form, and makes policy decisions based on configuration and > > requests. > > Given the fact that library has view of full system resoruces (both > persistent view and active view), shouldn't we just be able to extend > the API to meet additional configuration or resource needs. Maybe, I don't know. It just looks like a weird approach to me. Wouldn't it make more sense to implement it as a dbus service that everyone talks to? That's how our base system is structured these days. Why should this be any different? Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.or
Re: [Workman-devel] cgroup: status-quo and userland efforts
On Mon, Apr 08, 2013 at 05:46:09PM +0400, Glauber Costa wrote: [..] > The cpu cgroup needs a real-time timeslice to accept real time tasks. It > defaults to 0, meaning that a newly created cpu cgroup cannot accept > tasks (rt tasks) without the user having to manually configure it. > As far as I know, this problem hasn't yet been fixed. Yes, systemd folks wanted this to be fixed so that out of the box they could put individual user session in a cgroup and still expect that any RT applications of user are not broken. Thanks Vivek -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Workman-devel] cgroup: status-quo and userland efforts
On Fri, Apr 05, 2013 at 06:21:59PM -0700, Tejun Heo wrote: [..] > Userland efforts > > > There are currently a few userland efforts trying to make interfacing > with cgroup less painful. > > * libcg: Make cgroup interface accessible from programming languages > with support for configuration persistency, which also brings its > own config files to remember what to do on the next boot. Sans the > persistence part, it just seems to directly translate the filesystem > interface to function interface. > > http://libcg.sourceforge.net/ > > * Workman: It's a rather young project but as its name (workload > management) implies, its aims are higher level than that of libcg. > It aims to provide high-level resource allocation and management and > introduces new concepts like resource partitions to represent its > view of resource hierarchy. Like libcg, this one is implemented as > a library but provides bindings for more languages. > > https://gitorious.org/workman/pages/Home > > * Pax Controla Groupiana: A document on how not to step on other's > toes while using cgroup. It's not a software project but tries to > define precautions that a software or user can take to avoid > breaking or confusing other users of the cgroup filesystem. > > http://www.freedesktop.org/wiki/Software/systemd/PaxControlGroups > > All try to play nice with other possible users of the cgroup > filesystem - be it libvirt cgroup, applications doing their own cgroup > tricks, or hand-crafted custom scripts. While the approach is > understandable given that those usages already exist, I don't think > it's a workable solution in the long term. There are several reasons > for that. > > * The configurations aren't independent. e.g. for weight-based > controllers, your weight is only meaningful in relation to other > weights at that level. Distributing configuration to whatever > entities which may write to cgroupfs simply cannot work. It's > fundamentally flawed. Hi Tejun, I thought in workman, "partition" configuration was still centralized while individual "consumer" configuration was with consumer manger (systemd, libvirt, .. etc). IOW, library can tell consumer manger to which partition to associate consumer with at startup time. (consumer manager can assume their own defaults if nothing has been told). Agreed, that weight is meaningful only if one as full hierarchy view and then one should be able to calculate effective % share of resoures of a group. But using the library admin application should be able to query the full "paritition" hierarchy and their weigths and calculate % system resources. I think one problem there is cpu controller where % resoruce of a cgroup depends on tasks entities which are peer to group. But that's a kernel issue and not user space thing. So I am not sure what are potential problems with proposed model of configuration in workman. All the consumer managers still follow what libarary has told them to do. > > * It's fragile like hell. There's no accountability. Nobody really > knows what's going on. Is this subdirectory still there due to a > bug in this program, or something or someone else created it and > crashed / forgot to remove it, or what? I thought any directory under a consumer manger is managed by that manager and nobody is supposed to dynamically create resource partition/cgroup there. So that takes away a bit of confusion. > Oh, the cgroup I wanted to > create already exists. Maybe the previous instance created it and > then crashed This should be the case as long as we stick to the notion of a manger managing its own sub-hierarchy. > or maybe some other program just happened to choose the > same name. Two programs ideally would have their own sub hiearchy. And if not one of the programs should get the conflict when trying to create cgroup and should back-off or fail or give warning... > Who owns config knobs in that directory? IIUC, workman was looking at two types of cgroups. Once called "partitions" which will be created by library at startup time and library manages the configuration (something like cgconfig.conf). And individual managers create their own children groups for various services under that partition and control the config knobs for those services. user-defined-partition /|\ virt1 virt2 virt3 So user should be able to define a partition and control the configuration using workman lib. And if multiple virtual machines are being run in the partition, then they create their own cgroups and libvirt controls the properties of virt1, virt2, virt3 cgroups. I thought that was the the understanding when we dicussed ownership of config knobs las time. But things might have changed since last time. Workman folks should be able to shed light on this. > This way lies > madness. I understand why
Re: cgroup: status-quo and userland efforts
On 04/06/2013 05:21 AM, Tejun Heo wrote: > Hello, guys. > Hello Tejun, how are you? > Status-quo > == > tl;did read; This is mostly sensible. There is still one problem that we hadn't yet had the bandwidth to tackle that should be added to your official TODO list. The cpu cgroup needs a real-time timeslice to accept real time tasks. It defaults to 0, meaning that a newly created cpu cgroup cannot accept tasks (rt tasks) without the user having to manually configure it. As far as I know, this problem hasn't yet been fixed. The fix of course, is as trivial as setting a new value instead of 0 as a default. The complication lies in determining which value should that be. There are many things that we should ask from a controller to implement in order to be able to handle fully joint hierarchies. One of them, IMHO, is that if you drop a task into a newly created cgroup it should run without the user having to do anything for it. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
cgroup: status-quo and userland efforts
Hello, guys. Status-quo == It's been about a year since I wrote up a summary on cgroup status quo and future plans. We're not there yet but much closer than we were before. At least the locking and object life-time management aren't crazy anymore and most controllers now support proper hierarchy although not all of them agree on how to treat inheritance. IIRC, the yet-to-be-converted ones are blk-throttle and perf. cpu needs to be updated so that it at least supports a similar mechanism as cfq-iosched for configuring ratio between tasks on an internal cgroup and its children. Also, we really should update how cpuset handles a cgroup becoming empty (no cpus or memory node left due to hot-unplug). It currently transfers all its tasks to the nearest ancestor with executing resources, which is an irreversible process which would affect all other co-mounted controllers. We probably want it to just take on the masks of the ancestor until its own executing resources become online again, and the new behavior should be gated behind a switch (Li, can you please look into this?). While we have still ways to go, I feel relatively confident saying that we aren't too far out now, well, except for the writeback mess that still needs to be tackled. Anyways, once the remaining bits are settled, we can proceed to implement the unified hierarchy mode I've been talking about forever. I can't think of any fundamental roadblocks at the moment but who knows? The devil usually is in the details. Let's hope it goes okay. So, while we aren't moving as fast as we wish we were, the kernel side of things are falling into places. At least, that's how I see it. >From now on, I think how to make it actually useable to userland deserves a bit more focus, and by "useable to userland", I don't mean some group hacking up an elaborate, manual configuration which is tailored to the point of being eccentric to suit the needs of the said group. There's nothing wrong with that and they can continue to do so, but it just isn't generically useable or useful. It should be possible to generically and automatically split resources among, say, several servers and a couple users sharing a system without resorting to indecipherable ad-hoc shell script running off rc.local. Userland efforts There are currently a few userland efforts trying to make interfacing with cgroup less painful. * libcg: Make cgroup interface accessible from programming languages with support for configuration persistency, which also brings its own config files to remember what to do on the next boot. Sans the persistence part, it just seems to directly translate the filesystem interface to function interface. http://libcg.sourceforge.net/ * Workman: It's a rather young project but as its name (workload management) implies, its aims are higher level than that of libcg. It aims to provide high-level resource allocation and management and introduces new concepts like resource partitions to represent its view of resource hierarchy. Like libcg, this one is implemented as a library but provides bindings for more languages. https://gitorious.org/workman/pages/Home * Pax Controla Groupiana: A document on how not to step on other's toes while using cgroup. It's not a software project but tries to define precautions that a software or user can take to avoid breaking or confusing other users of the cgroup filesystem. http://www.freedesktop.org/wiki/Software/systemd/PaxControlGroups All try to play nice with other possible users of the cgroup filesystem - be it libvirt cgroup, applications doing their own cgroup tricks, or hand-crafted custom scripts. While the approach is understandable given that those usages already exist, I don't think it's a workable solution in the long term. There are several reasons for that. * The configurations aren't independent. e.g. for weight-based controllers, your weight is only meaningful in relation to other weights at that level. Distributing configuration to whatever entities which may write to cgroupfs simply cannot work. It's fundamentally flawed. * It's fragile like hell. There's no accountability. Nobody really knows what's going on. Is this subdirectory still there due to a bug in this program, or something or someone else created it and crashed / forgot to remove it, or what? Oh, the cgroup I wanted to create already exists. Maybe the previous instance created it and then crashed or maybe some other program just happened to choose the same name. Who owns config knobs in that directory? This way lies madness. I understand why the Pax doc exists but I'm not sure its long-term effect would be positive - best practices which ultimately lead to utter confusion and fragility. * In many cases, resource distribution is system-wide policy decisions and determining what to do often requires system-wide knowledge. You can't prov