Re: [Documentation] State of CPU controller in cgroup v2
On Tue, Oct 04, 2016 at 10:47:17AM -0400, Tejun Heo wrote: > > cgroup-v2, by placing the system style controllers first and foremost, > > completely renders that scenario impossible. Note also that any proposed > > rgroup would not work for this, since that, per design, is a subtree, > > and therefore not disjoint. > > If a use case absolutely requires disjoint resource hierarchies, the > only solution is to keep using multiple v1 hierarchies, which > necessarily excludes the possibility of doing anyting across different > resource types. > > > So my objection to the whole cgroup-v2 model and implementation stems > > from the fact that it purports to be a 'better' and 'improved' system, > > while in actuality it neuters and destroys a lot of useful usecases. > > > > It completely disregards all task-controllers and labels their use-cases > > as irrelevant. > > Your objection then doesn't have much to do with the specifics of the > cgroup v2 model or implementation. It is too, I've stated multiple times that the no internal tasks thing is bad and that the root exception is an inexcusable wart that makes the whole thing internally inconsistent. But talking to you guys is pointless. You'll just keep moving air until the other party tires and gives up. My NAK on v2 stands. > It's an objection against > establishing common resource domains as that excludes building > orthogonal multiple hierarchies. That, necessarily, can only be > achieved by having multiple hierarchies for different resource types > and thus giving up the benefits of common resource domains. Yes, v2 not allowing that rules it out as a valid model. > Assuming that, I don't think your position is against cgroup v2 but > more toward keeping v1 around. We're talking about two quite > different mutually exclusive classes of use cases. You need unified > for one and disjoint for the other. v1 is gonna be there and can > easily be used alongside v2 for different controller types, which > would in most cases be cpu and cpuset. > > I can't see a reason why this would need to block properly supporting > containerization use cases. I don't block that use-case, I block cgroup-v2, its shit. The fact is, the naming "v2" suggests its a replacement and will deprecate "v1". Also the implementation is mutually exclusive with v1, you have to pick one and the other becomes inaccessible. You cannot even pick another one inside a container, breaking the container invariant.
Re: [Documentation] State of CPU controller in cgroup v2
Hello, Peter. On Tue, Sep 06, 2016 at 12:29:50PM +0200, Peter Zijlstra wrote: > The fundamental problem is that we have 2 different types of > controllers, on the one hand these controllers above, that work on tasks > and form groups of them and build up from that. Lets call them > task-controllers. > > On the other hand we have controllers like memcg which take the 'system' > as a whole and shrink it down into smaller bits. Lets call these > system-controllers. > > They are fundamentally at odds with capabilities, simply because of the > granularity they can work on. As pointed out multiple times, the picture is not that simple. For example, eventually, we want to be able to account for cpu cycles spent during memory reclaim or processing IOs (e.g. encryption), which can only be tied to the resource domain, not a specific task. There surely are things that can only be done by task-level controllers, but there are two different aspects here. One is the actual capabilities (e.g. hierarchical proportional cpu cycle distribution) and the other is how such capabilities are exposed. I'll continue below. > Merging the two into a common hierarchy is a useful concept for > containerization, no argument on that, esp. when also coupled with > namespaces and the like. Great, we now agree that comprehensive system resource control is useful. > However, where I object _most_ strongly is having this one use dominate > and destroy the capabilities (which are in use) of the task-controllers. The objection isn't necessarily just about loss of capabilities but also about not being able to do them in the same way as v1. The reason I proposed rgroup instead of scoped task-granularity is because I think that a properly insulated programmable interface which is inline with other widely used APIs is a better solution in the long run. If we go cgroupfs route for thread granularity, we pretty much lose the possibility, or at least make it very difficult, to make hierarchical resource control widely available to individual applications. How important such use cases are is debatable. I don't find it too difficult to imagine scenarios where individual applications like apache or torrent clients make use of it. Probably more importantly, rgroup, or something like it, gives an application an officially supported way to build and expose their resource hierarchies, which can then be used by both the application itself and outside to monitor and manipulate resource distribution. The decision between cgroupfs thread granularity and something like rgroup isn't an obvious one. Choosing the former is the path of lower resistance but it is so at the cost of certain long-term benefits. > > It could be made to work without races, though, with minimal (or even > > no) ABI change. The managed program could grab an fd pointing to its > > cgroup. Then it would use openat, etc for all operations. As long as > > 'mv /cgroup/a/b /cgroup/c/" didn't cause that fd to stop working, > > we're fine. > > I've mentioned openat() and related APIs several times, but so far never > got good reasons why that wouldn't work. Hopefully, this part was addressed in my reply to Andy. > cgroup-v2, by placing the system style controllers first and foremost, > completely renders that scenario impossible. Note also that any proposed > rgroup would not work for this, since that, per design, is a subtree, > and therefore not disjoint. If a use case absolutely requires disjoint resource hierarchies, the only solution is to keep using multiple v1 hierarchies, which necessarily excludes the possibility of doing anyting across different resource types. > So my objection to the whole cgroup-v2 model and implementation stems > from the fact that it purports to be a 'better' and 'improved' system, > while in actuality it neuters and destroys a lot of useful usecases. > > It completely disregards all task-controllers and labels their use-cases > as irrelevant. Your objection then doesn't have much to do with the specifics of the cgroup v2 model or implementation. It's an objection against establishing common resource domains as that excludes building orthogonal multiple hierarchies. That, necessarily, can only be achieved by having multiple hierarchies for different resource types and thus giving up the benefits of common resource domains. Assuming that, I don't think your position is against cgroup v2 but more toward keeping v1 around. We're talking about two quite different mutually exclusive classes of use cases. You need unified for one and disjoint for the other. v1 is gonna be there and can easily be used alongside v2 for different controller types, which would in most cases be cpu and cpuset. I can't see a reason why this would need to block properly supporting containerization use cases. Thanks. -- tejun
Re: [Documentation] State of CPU controller in cgroup v2
On Fri, 2016-09-30 at 11:06 +0200, Tejun Heo wrote: > Hello, Mike. > > On Sat, Sep 10, 2016 at 12:08:57PM +0200, Mike Galbraith wrote: > > On Fri, 2016-09-09 at 18:57 -0400, Tejun Heo wrote: > > > > > As for your example, who performs the cgroup setup and configuration, > > > > > the application itself or an external entity? If an external entity, > > > > > how does it know which thread is what? > > > > > > > > In my case, it would be a little script that reads a config file that > > > > knows all kinds of internal information about the application and its > > > > threads. > > > > > > I see. One-of-a-kind custom setup. This is a completely valid usage; > > > however, please also recognize that it's an extremely specific one > > > which is niche by definition. > > > > This is the same pigeon hole you placed Google into. So Google, my > > (also decidedly non-petite) users, and now Andy are all sharing the one > > of a kind extremely specific niche.. it's becoming a tad crowded. > > I wasn't trying to say that these use cases are small in numbers when > added up, but that they're all isolated in their own small silos. These use cases exist, and are perfectly valid use cases. That is sum and total of what is relevant. > Facebook has a lot of these usages too but they're almost all mutually > exculsive. Making workloads share machines or even adding resource > conrol for base system operations afterwards is extremely difficult. The cases I have in mind are not difficult to deal with, as you don't have to worry about collisions. > There are cases these adhoc approaches make sense but insisting that > this is all there is to resource control is short-sighted. 1. I never insisted any such thing. 2. Please stop pigeon-holing. The usage cases in question are no more ad hoc than any other usage, they are all "for this", none are globally applicable. What they are is power users utilizing the intimate knowledge that is both required and in the possession of power users who are in fact using controllers precisely as said controllers were designed to be used. No, these usages do not belong in an "adhoc" (aka disposable refuse) pi geon-hole. I choose to ignore the one you stuffed me into. -Mike
Re: [Documentation] State of CPU controller in cgroup v2
Hello, Mike. On Sat, Sep 10, 2016 at 12:08:57PM +0200, Mike Galbraith wrote: > On Fri, 2016-09-09 at 18:57 -0400, Tejun Heo wrote: > > > > As for your example, who performs the cgroup setup and configuration, > > > > the application itself or an external entity? If an external entity, > > > > how does it know which thread is what? > > > > > > In my case, it would be a little script that reads a config file that > > > knows all kinds of internal information about the application and its > > > threads. > > > > I see. One-of-a-kind custom setup. This is a completely valid usage; > > however, please also recognize that it's an extremely specific one > > which is niche by definition. > > This is the same pigeon hole you placed Google into. So Google, my > (also decidedly non-petite) users, and now Andy are all sharing the one > of a kind extremely specific niche.. it's becoming a tad crowded. I wasn't trying to say that these use cases are small in numbers when added up, but that they're all isolated in their own small silos. Facebook has a lot of these usages too but they're almost all mutually exculsive. Making workloads share machines or even adding resource conrol for base system operations afterwards is extremely difficult. There are cases these adhoc approaches make sense but insisting that this is all there is to resource control is short-sighted. Thanks. -- tejun
Re: [Documentation] State of CPU controller in cgroup v2
Hello, On Thu, Sep 15, 2016 at 01:08:07PM -0700, Andy Lutomirski wrote: > With regard to no-internal-tasks, I see (at least) three options: > > 1. Keep the cgroup2 status quo. Lots of distros and such are likely > to have their cgroup management fail if run in a container. I really, I don't know where you're getting this. No-internal-tasks rule has *NOTHING* to do with how or how not cgroup v1 hierarchies can be used inside a namespace. I suppose this is coming from the same misunderstanding that Austin has. Please see my reply there for more details. > really dislike this option. Up until this point, you haven't supplied any valid technical reasons for your objection. Repeating "really" doesn't add to the discussion at all. If you're indicating that you don't like it on an aeshtetic ground, please just say so. > 2. Enforce no-internal-tasks for the root cgroup. Un-cgroupable > thinks will still get accounted to the root cgroup even if subtree > control is on, but no tasks can be in the root cgroup if the root > cgroup has subtree control on. (If some controllers removed the > no-internal-tasks restriction, this would apply to the root as well.) > I think this may annoy certain users. If so, and if those users are > doing something valid, then I think that either those users should be > strongly encouraged or even forced to changed so namespacing works for > them or that we should do (3) instead. Theoretically, we can do that but what are the upsides and are they enough to justify the added inconveniences? Up until now, the only argument you provided is that people may do certain things in system-root which might not work in namespace-root but that isn't a critical problem. No real functionalities are lost by implementing the same behaviors both inside and outside namespaces. > 3. Remove the no-internal-tasks restriction entirely. I can see this > resulting in a lot of configuration awkwardness, but I think it will > *work*, especially since all of the controllers already need to do > something vaguely intelligent when subtree control is on in the root > and there are tasks in the root. The reasons for no-internal-tasks restriction have been explained multiple times in the documentations and throughout this thread, and we also discussed how and why system-root is special and allowing system-root's special treatment doesn't break things. > What I'm trying to say is that I think that option (1) is sufficiently > bad that cgroup2 should do (2) or (3) instead. If option (2) is > preferred and if it would break userspace, then I think we can work > around it by entirely deprecating cgroup2, renaming it to cgroup3, and > doing option (2) there. You've given reasons you don't like options > (2) and (3). I mostly agree with those reasons, but I don't think > they're strong enough to overcome the problems with (1). And you keep suggesting very drastic measures for an issue which isn't critical without providing any substantial technical reasons why such drastic measures would be necessary. This part of discussion started with your misunderstanding of the implications of the system-root being special, and the only reason you presented in the previous message is still a, different, misunderstanding. The only thing which isn't changing here is your opinions on how it should be. It is a baffling situation because your opinions don't seem to be affected at all by the validity of reasons for thinking so. > BTW, Mike keeps mentioning exclusive cgroups as problematic with the > no-internal-tasks constraints. Do exclusive cgroups still exist in > cgroup2? Could we perhaps just remove that capability entirely? I've > never understood what problem exlusive cpusets and such solve that > can't be more comprehensibly solved by just assigning the cpusets the > normal inclusive way. This was explained before during the discussion. Maybe it wasn't clear enough. The knob is a config protector which protects oneself from changing its configs. It doesn't really belong in the kernel. My guess is that it was added because delegation model wasn't properly established and people tried to delegate resource control knobs along with the cgroups and then wanted to prevent those knobs from changed in certain ways. > >> What kind of migration do you mean? Having fds follow rename(2) around is > >> the normal vfs behavior, so I don't really know what you mean. > > > > Process or task migration by writing pid to cgroup.procs or tasks > > file. cgroup never supported directory / cgroup level migrations. > > Ugh. Perhaps cgroup2 should start supporting this. I think that > making rename(2) work is simpler than adding a whole new API for > rgroups, and I think it could solve a lot of the same problems that > rgroups are trying to solve. We haven't needed that yet and supporting rename(2) doesn't necessarily make the API safe in terms of migration atomicity. Also, as pointed out in my previous reply (and
Re: [Documentation] State of CPU controller in cgroup v2
Hello, Austin. On Mon, Sep 12, 2016 at 11:20:03AM -0400, Austin S. Hemmelgarn wrote: > > If you confine it to the cpu controller, ignore anonymous > > consumptions, the rather ugly mapping between nice and weight values > > and the fact that nobody could come up with a practical usefulness for > > such setup, yes. My point was never that the cpu controller can't do > > it but that we should find a better way of coordinating it with other > > controllers and exposing it to individual applications. > > So, having a container where not everything in the container is split > further into subgroups is not a practically useful situation? Because > that's exactly what both systemd and every other cgroup management tool > expects to have work as things stand right now. The root cgroup within a Not true. $ cat /proc/1/cgroup 11:hugetlb:/ 10:pids:/init.scope 9:blkio:/ 8:cpuset:/ 7:memory:/ 6:freezer:/ 5:perf_event:/ 4:net_cls,net_prio:/ 3:cpu,cpuacct:/ 2:devices:/init.scope 1:name=systemd:/init.scope $ systemctl --version systemd 229 +PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD +IDN > cgroup namespace has to function exactly like the system-root, otherwise > nothing can depend on the special cases for the system root, because they > might get run in a cgroup namespace and such assumptions will be invalid. systemd already behaves exactly the same whether it's inside a namespace or not. > This in turn means that no current distro can run unmodified in a cgroup > namespace under a v2 hierarchy, which is a Very Bad Thing. cgroup v1 hierarchies can be mounted the same inside a namespace whether the system itself is on cgroup v1 or v2. Obviously, a given controller can only be attached to one hierarchy, so a controller can't be used at the same time on both v1 and v2 hierarchies; however, that is true with different v1 hierarchies too, and, given that delegations doesn't work properly on v1, shouldn't be that much of an issue. I'm not just claiming it. systemd-nspawn can already be on either v1 or v2 hierarchies regardless of what the outer systemd uses. Out of the claims that you made, the only one which holds up is that an existing software can't make use of cgroup v2 without modifications, which is true but at the same time doesn't mean much of anything. Thanks. -- tejun
Re: [Documentation] State of CPU controller in cgroup v2
On Fri, Sep 16, 2016 at 11:19:38AM -0700, Andy Lutomirski wrote: > On Fri, Sep 16, 2016 at 9:50 AM, Peter Zijlstra wrote: > > {1,2} {3,4} {5} seem exclusive, did I miss something? (other than that 5 > > cpu parts are 'rare'). > > There's no overlap, so they're logically exclusive, but it avoids > needing the "cpu_exclusive" parameter. I'd need to double check, but I don't think you _need_ that. That's more for enforcing nobody else steals your CPUs and 'accidentally' creates overlaps. But if you configure it right, non-overlap should be enough. That is, generate_sched_domains() only uses cpusets_overlap() which is cpumask_intersects(). Then again, it is almost 4am, so who knows. > > So there's a problem with sticking kernel threads (and esp. kthreadd) > > into !root groups. For example if you place it in a cpuset that doesn't > > have all cpus, then binding your shiny new kthread to a cpu will fail. > > > > You can fix that of course, and we used to do exactly that, but we kept > > running into 'fun' cases like that. > > Blech. But may this *should* have that effect. I'm sick of random > kernel crap being scheduled on my RT CPUs and on the CPUs that I > intend to be kept forcibly idle. Hehe, so ideally those threads don't do anything unless the tasks running on those CPUs explicitly ask for it. If you find any of the CPU-bound kernel tasks do work that is unrelated to the tasks running on that CPU, we should certainly look into it. Personally I'm not much bothered by idle threads sitting about.
Re: [Documentation] State of CPU controller in cgroup v2
On Fri, Sep 16, 2016 at 9:50 AM, Peter Zijlstra wrote: > On Fri, Sep 16, 2016 at 09:29:06AM -0700, Andy Lutomirski wrote: > >> > SCHED_DEADLINE, its a 'Global'-EDF like scheduler that doesn't support >> > CPU affinities (because that doesn't make sense). The only way to >> > restrict it is to partition. >> > >> > 'Global' because you can partition it. If you reduce your system to >> > single CPU partitions you'll reduce to P-EDF. >> > >> > (The same is true of SCHED_FIFO, that's a 'Global'-FIFO on the same >> > partition scheme, it however does support sched_affinity, but using it >> > gives 'interesting' schedulability results -- call it a historic >> > accident). >> >> Hmm, I didn't realize that the deadline scheduler was global. But >> ISTM requiring the use of "exclusive" to get this working is >> unfortunate. What if a user wants two separate partitions, one using >> CPUs 1 and 2 and the other using CPUs 3 and 4 (with 5 reserved for >> non-RT stuff)? > > {1,2} {3,4} {5} seem exclusive, did I miss something? (other than that 5 > cpu parts are 'rare'). There's no overlap, so they're logically exclusive, but it avoids needing the "cpu_exclusive" parameter. It always seemed confusing to me that a setting on a child cgroup would strictly remove a resource from the parent. (To be clear: I don't have any particularly strong objection to cpu_exclusive. It just always seemed like a bit of a hack that mostly duplicated what you could get by just setting the cpusets appropriately throughout the hierarchy.) >> > Note that related, but differently, we have the isolcpus boot parameter >> > which creates single CPU partitions for all listed CPUs and gives the >> > rest to the root cpuset. Ideally we'd kill this option given its a boot >> > time setting (for something which is trivially to do at runtime). >> > >> > But this cannot be done, because that would mean we'd have to start with >> > a !0 cpuset layout: >> > >> > '/' >> > load_balance=0 >> > / \ >> > 'system''isolated' >> > cpus=~isolcpus cpus=isolcpus >> > load_balance=0 >> > >> > And start with _everything_ in the /system group (inclding default IRQ >> > affinities). >> > >> > Of course, that will break everything cgroup :-( >> > >> >> I would actually *much* prefer this over the status quo. I'm tired of >> my crappy, partially-working script that sits there and creates >> exactly this configuration (minus the isolcpus part because I actually >> want migration to work) on boot. (Actually, it could have two >> automatic cgroups: /kernel and /init -- init and UMH would go in init >> and kernel threads and such would go in /kernel. Userspace would be >> able to request that a different cgroup be used for newly-created >> kernel threads.) > > So there's a problem with sticking kernel threads (and esp. kthreadd) > into !root groups. For example if you place it in a cpuset that doesn't > have all cpus, then binding your shiny new kthread to a cpu will fail. > > You can fix that of course, and we used to do exactly that, but we kept > running into 'fun' cases like that. Blech. But may this *should* have that effect. I'm sick of random kernel crap being scheduled on my RT CPUs and on the CPUs that I intend to be kept forcibly idle. > > The unbound workqueue stuff is totally arbitrary borkage though, that > can be made to work just fine, TJ didn't like it for some reason which I > really cannot remember. > > Also, UMH? User mode helper. Fortunately most users are gone now, but it still exists.
Re: [Documentation] State of CPU controller in cgroup v2
On Fri, Sep 16, 2016 at 09:29:06AM -0700, Andy Lutomirski wrote: > > SCHED_DEADLINE, its a 'Global'-EDF like scheduler that doesn't support > > CPU affinities (because that doesn't make sense). The only way to > > restrict it is to partition. > > > > 'Global' because you can partition it. If you reduce your system to > > single CPU partitions you'll reduce to P-EDF. > > > > (The same is true of SCHED_FIFO, that's a 'Global'-FIFO on the same > > partition scheme, it however does support sched_affinity, but using it > > gives 'interesting' schedulability results -- call it a historic > > accident). > > Hmm, I didn't realize that the deadline scheduler was global. But > ISTM requiring the use of "exclusive" to get this working is > unfortunate. What if a user wants two separate partitions, one using > CPUs 1 and 2 and the other using CPUs 3 and 4 (with 5 reserved for > non-RT stuff)? {1,2} {3,4} {5} seem exclusive, did I miss something? (other than that 5 cpu parts are 'rare'). > Shouldn't we be able to have a cgroup for each of the > DL partitions and do something to tell the deadline scheduler "here is > your domain"? Somewhat confused, by doing the non-overlapping domains, you do exactly that no? You end up with 2 (or more) independent deadline schedulers, but if you're not running deadline tasks (like in the /system partition) you don't care its there. > > Note that related, but differently, we have the isolcpus boot parameter > > which creates single CPU partitions for all listed CPUs and gives the > > rest to the root cpuset. Ideally we'd kill this option given its a boot > > time setting (for something which is trivially to do at runtime). > > > > But this cannot be done, because that would mean we'd have to start with > > a !0 cpuset layout: > > > > '/' > > load_balance=0 > > / \ > > 'system''isolated' > > cpus=~isolcpus cpus=isolcpus > > load_balance=0 > > > > And start with _everything_ in the /system group (inclding default IRQ > > affinities). > > > > Of course, that will break everything cgroup :-( > > > > I would actually *much* prefer this over the status quo. I'm tired of > my crappy, partially-working script that sits there and creates > exactly this configuration (minus the isolcpus part because I actually > want migration to work) on boot. (Actually, it could have two > automatic cgroups: /kernel and /init -- init and UMH would go in init > and kernel threads and such would go in /kernel. Userspace would be > able to request that a different cgroup be used for newly-created > kernel threads.) So there's a problem with sticking kernel threads (and esp. kthreadd) into !root groups. For example if you place it in a cpuset that doesn't have all cpus, then binding your shiny new kthread to a cpu will fail. You can fix that of course, and we used to do exactly that, but we kept running into 'fun' cases like that. The unbound workqueue stuff is totally arbitrary borkage though, that can be made to work just fine, TJ didn't like it for some reason which I really cannot remember. Also, UMH? > Heck, even systemd would probably prefer this. Then it could cleanly > expose a "slice" or whatever it's called for random kernel shit and at > least you could configure it meaningfully. No clue about systemd, I'm still on systems without that virus.
Re: [Documentation] State of CPU controller in cgroup v2
On Fri, Sep 16, 2016 at 9:19 AM, Peter Zijlstra wrote: > On Fri, Sep 16, 2016 at 08:12:58AM -0700, Andy Lutomirski wrote: >> On Sep 16, 2016 12:51 AM, "Peter Zijlstra" wrote: >> > >> > On Thu, Sep 15, 2016 at 01:08:07PM -0700, Andy Lutomirski wrote: >> > > BTW, Mike keeps mentioning exclusive cgroups as problematic with the >> > > no-internal-tasks constraints. Do exclusive cgroups still exist in >> > > cgroup2? Could we perhaps just remove that capability entirely? I've >> > > never understood what problem exlusive cpusets and such solve that >> > > can't be more comprehensibly solved by just assigning the cpusets the >> > > normal inclusive way. >> > >> > Without exclusive sets we cannot split the sched_domain structure. >> > Which leads to not being able to actually partition things. That would >> > break DL for one. >> >> Can you sketch out a toy example? > > [ Also see Documentation/cgroup-v1/cpusets.txt section 1.7 ] > > > mkdir /cpuset > > mount -t cgroup -o cpuset none /cpuset > > mkdir /cpuset/A > mkdir /cpuset/B > > cat /sys/devices/system/node/node0/cpulist > /cpuset/A/cpuset.cpus > echo 0 > /cpuset/A/cpuset.mems > > cat /sys/devices/system/node/node1/cpulist > /cpuset/B/cpuset.cpus > echo 1 > /cpuset/B/cpuset.mems > > # move all movable tasks into A > cat /cpuset/tasks | while read task; do echo $task > /cpuset/A/tasks ; done > > # kill machine wide load-balancing > echo 0 > /cpuset/cpuset.sched_load_balance > > # now place 'special' tasks in B > > > This partitions the scheduler into two, one for each node. > > Hereafter no task will be moved from one node to another. The > load-balancer is split in two, one balances in A one balances in B > nothing crosses. (It is important that A.cpus and B.cpus do not > intersect.) > > Ideally no task would remain in the root group, back in the day we could > actually do this (with exception of the cpu bound kernel threads), but > this has significantly regressed :-( > (still hate the workqueue affinity interface) I wonder if we could address this by creating (automatically at boot or when the cpuset controller is enabled or whatever) a /cpuset/random_kernel_shit cgroup and have all of the unmoveable tasks land there? > > As is, tasks that are left in the root group get balanced within > whatever domain they ended up in. > >> And what's DL? > > SCHED_DEADLINE, its a 'Global'-EDF like scheduler that doesn't support > CPU affinities (because that doesn't make sense). The only way to > restrict it is to partition. > > 'Global' because you can partition it. If you reduce your system to > single CPU partitions you'll reduce to P-EDF. > > (The same is true of SCHED_FIFO, that's a 'Global'-FIFO on the same > partition scheme, it however does support sched_affinity, but using it > gives 'interesting' schedulability results -- call it a historic > accident). Hmm, I didn't realize that the deadline scheduler was global. But ISTM requiring the use of "exclusive" to get this working is unfortunate. What if a user wants two separate partitions, one using CPUs 1 and 2 and the other using CPUs 3 and 4 (with 5 reserved for non-RT stuff)? Shouldn't we be able to have a cgroup for each of the DL partitions and do something to tell the deadline scheduler "here is your domain"? > > > Note that related, but differently, we have the isolcpus boot parameter > which creates single CPU partitions for all listed CPUs and gives the > rest to the root cpuset. Ideally we'd kill this option given its a boot > time setting (for something which is trivially to do at runtime). > > But this cannot be done, because that would mean we'd have to start with > a !0 cpuset layout: > > '/' > load_balance=0 > / \ > 'system''isolated' > cpus=~isolcpus cpus=isolcpus > load_balance=0 > > And start with _everything_ in the /system group (inclding default IRQ > affinities). > > Of course, that will break everything cgroup :-( > I would actually *much* prefer this over the status quo. I'm tired of my crappy, partially-working script that sits there and creates exactly this configuration (minus the isolcpus part because I actually want migration to work) on boot. (Actually, it could have two automatic cgroups: /kernel and /init -- init and UMH would go in init and kernel threads and such would go in /kernel. Userspace would be able to request that a different cgroup be used for newly-created kernel threads.) Heck, even systemd would probably prefer this. Then it could cleanly expose a "slice" or whatever it's called for random kernel shit and at least you could configure it meaningfully.
Re: [Documentation] State of CPU controller in cgroup v2
On Fri, Sep 16, 2016 at 08:12:58AM -0700, Andy Lutomirski wrote: > On Sep 16, 2016 12:51 AM, "Peter Zijlstra" wrote: > > > > On Thu, Sep 15, 2016 at 01:08:07PM -0700, Andy Lutomirski wrote: > > > BTW, Mike keeps mentioning exclusive cgroups as problematic with the > > > no-internal-tasks constraints. Do exclusive cgroups still exist in > > > cgroup2? Could we perhaps just remove that capability entirely? I've > > > never understood what problem exlusive cpusets and such solve that > > > can't be more comprehensibly solved by just assigning the cpusets the > > > normal inclusive way. > > > > Without exclusive sets we cannot split the sched_domain structure. > > Which leads to not being able to actually partition things. That would > > break DL for one. > > Can you sketch out a toy example? [ Also see Documentation/cgroup-v1/cpusets.txt section 1.7 ] mkdir /cpuset mount -t cgroup -o cpuset none /cpuset mkdir /cpuset/A mkdir /cpuset/B cat /sys/devices/system/node/node0/cpulist > /cpuset/A/cpuset.cpus echo 0 > /cpuset/A/cpuset.mems cat /sys/devices/system/node/node1/cpulist > /cpuset/B/cpuset.cpus echo 1 > /cpuset/B/cpuset.mems # move all movable tasks into A cat /cpuset/tasks | while read task; do echo $task > /cpuset/A/tasks ; done # kill machine wide load-balancing echo 0 > /cpuset/cpuset.sched_load_balance # now place 'special' tasks in B This partitions the scheduler into two, one for each node. Hereafter no task will be moved from one node to another. The load-balancer is split in two, one balances in A one balances in B nothing crosses. (It is important that A.cpus and B.cpus do not intersect.) Ideally no task would remain in the root group, back in the day we could actually do this (with exception of the cpu bound kernel threads), but this has significantly regressed :-( (still hate the workqueue affinity interface) As is, tasks that are left in the root group get balanced within whatever domain they ended up in. > And what's DL? SCHED_DEADLINE, its a 'Global'-EDF like scheduler that doesn't support CPU affinities (because that doesn't make sense). The only way to restrict it is to partition. 'Global' because you can partition it. If you reduce your system to single CPU partitions you'll reduce to P-EDF. (The same is true of SCHED_FIFO, that's a 'Global'-FIFO on the same partition scheme, it however does support sched_affinity, but using it gives 'interesting' schedulability results -- call it a historic accident). Note that related, but differently, we have the isolcpus boot parameter which creates single CPU partitions for all listed CPUs and gives the rest to the root cpuset. Ideally we'd kill this option given its a boot time setting (for something which is trivially to do at runtime). But this cannot be done, because that would mean we'd have to start with a !0 cpuset layout: '/' load_balance=0 / \ 'system''isolated' cpus=~isolcpus cpus=isolcpus load_balance=0 And start with _everything_ in the /system group (inclding default IRQ affinities). Of course, that will break everything cgroup :-(
Re: [Documentation] State of CPU controller in cgroup v2
On Sep 16, 2016 12:51 AM, "Peter Zijlstra" wrote: > > On Thu, Sep 15, 2016 at 01:08:07PM -0700, Andy Lutomirski wrote: > > BTW, Mike keeps mentioning exclusive cgroups as problematic with the > > no-internal-tasks constraints. Do exclusive cgroups still exist in > > cgroup2? Could we perhaps just remove that capability entirely? I've > > never understood what problem exlusive cpusets and such solve that > > can't be more comprehensibly solved by just assigning the cpusets the > > normal inclusive way. > > Without exclusive sets we cannot split the sched_domain structure. > Which leads to not being able to actually partition things. That would > break DL for one. Can you sketch out a toy example? And what's DL?
Re: [Documentation] State of CPU controller in cgroup v2
On Thu, Sep 15, 2016 at 01:08:07PM -0700, Andy Lutomirski wrote: > BTW, Mike keeps mentioning exclusive cgroups as problematic with the > no-internal-tasks constraints. Do exclusive cgroups still exist in > cgroup2? Could we perhaps just remove that capability entirely? I've > never understood what problem exlusive cpusets and such solve that > can't be more comprehensibly solved by just assigning the cpusets the > normal inclusive way. Without exclusive sets we cannot split the sched_domain structure. Which leads to not being able to actually partition things. That would break DL for one.
Re: [Documentation] State of CPU controller in cgroup v2
On Wed, Sep 14, 2016 at 1:00 PM, Tejun Heo wrote: > Hello, > With regard to no-internal-tasks, I see (at least) three options: 1. Keep the cgroup2 status quo. Lots of distros and such are likely to have their cgroup management fail if run in a container. I really, really dislike this option. 2. Enforce no-internal-tasks for the root cgroup. Un-cgroupable thinks will still get accounted to the root cgroup even if subtree control is on, but no tasks can be in the root cgroup if the root cgroup has subtree control on. (If some controllers removed the no-internal-tasks restriction, this would apply to the root as well.) I think this may annoy certain users. If so, and if those users are doing something valid, then I think that either those users should be strongly encouraged or even forced to changed so namespacing works for them or that we should do (3) instead. 3. Remove the no-internal-tasks restriction entirely. I can see this resulting in a lot of configuration awkwardness, but I think it will *work*, especially since all of the controllers already need to do something vaguely intelligent when subtree control is on in the root and there are tasks in the root. What I'm trying to say is that I think that option (1) is sufficiently bad that cgroup2 should do (2) or (3) instead. If option (2) is preferred and if it would break userspace, then I think we can work around it by entirely deprecating cgroup2, renaming it to cgroup3, and doing option (2) there. You've given reasons you don't like options (2) and (3). I mostly agree with those reasons, but I don't think they're strong enough to overcome the problems with (1). BTW, Mike keeps mentioning exclusive cgroups as problematic with the no-internal-tasks constraints. Do exclusive cgroups still exist in cgroup2? Could we perhaps just remove that capability entirely? I've never understood what problem exlusive cpusets and such solve that can't be more comprehensibly solved by just assigning the cpusets the normal inclusive way. >> > After a migration, the cgroup and its interface knobs are a different >> > directory and files. Semantically, during migration, we aren't moving >> > the directory or files and it'd be bizarre to overlay the semantics >> > you're describing on top of the existing cgroupfs. We will have to >> > break away from the very basic vfs rules such as a fd, once opened, >> > always corresponding to the same file. >> >> What kind of migration do you mean? Having fds follow rename(2) around is >> the normal vfs behavior, so I don't really know what you mean. > > Process or task migration by writing pid to cgroup.procs or tasks > file. cgroup never supported directory / cgroup level migrations. > Ugh. Perhaps cgroup2 should start supporting this. I think that making rename(2) work is simpler than adding a whole new API for rgroups, and I think it could solve a lot of the same problems that rgroups are trying to solve. --Andy
Re: [Documentation] State of CPU controller in cgroup v2
Hello, On Mon, Sep 12, 2016 at 10:39:04AM -0700, Andy Lutomirski wrote: > > > Your idea of "trivially" doesn't match mine. You gave a use case in > > > > I suppose I wasn't clear enough. It is trivial in the sense that if > > the userland implements something which works for namespace-root, it > > would work the same in system-root without further modifications. > > So I guess userspace can trivially get it right and can just as trivially > get it wrong. I wasn't trying to play a word game. What I was trying to say is that a configuration which works for namespace-roots works for the system-root too, in terms of cgroup hierarchy, without any modifications. > > Great, now we agree that what's currently implemented is valid. I > > think you're still failing to recognize the inherent specialness of > > the system-root and how much unnecessary pain the removal of the > > exemption would cause at virtually no practical gain. I won't repeat > > the same backing points here. > > I'm starting to think that you could extend the exemption with considerably > less difficulty. Can you please elaborate? It feels like you're repeating the same opinions without really describing them in detail or backing them up in the last couple replies. Having differing opinions is fine but to actually hash them out, the opinions and their rationles need to be laid out in detail. > > There isn't much which is getting in the way of doing that. Again, > > something which follows no-internal-task rule would behave the same no > > matter where it is. The system-root is different in that it is exempt > > from the rule and thus is more flexible but that difference is serving > > the purpose of handling the inherent specialness of the system-root. > > From *userspace's* POV, I still don't think there's any specialness except > from an accounting POV. After all, userspace has no control over the > special stuff anyway. And accounting doesn't matter: a namespace could > just see zeros in any special root accounting slots. The disagreement here isn't really consequential. The only reason this part became imporant is because you felt that something must be broken, which you now don't think is the case. I agree that there can be other ways to handle this but what's your proposal here? And how would that be practically and substantically better than what is implemented now? > > You've been pushing for enforcing the restriction on the system-root > > too and now are jumping to the opposite end. It's really frustrating > > that this is such a whack-a-mole game where you throw ideas without > > really thinking through them and only concede the bare minimum when > > all other logical avenues are closed off. Here, again, you seem to be > > stating a strong opinion when you haven't fully thought about it or > > tried to understand the reasons behind it. > > I think you should make it work the same way in namespace roots as it does > in the system root. I acknowledge that there are pros and cons of each. I > think the current middle ground is worse than either of the consistent > options. Again, the only thing you're doing is restating the same opinion. I understand that you have an impression that this can be done better but how exactly? > > But, whatever, let's go there: Given the arguments that I laid out for > > the no-internal-tasks rule, how does the problem seem fixable through > > relaxing the constraint? > > By deciding that, despite the arguments you laid out, it's still worth > relaxing the constraint. Or by deciding to add the constraint to the root. You're not really saying anything of substance in the above paragraph. > > > Isn't this the same thing? IIUC the constraint in question is that, > > > if a non-root cgroup has subtree control on, then it can't have > > > processes in it. This is the no-internal-tasks constraint, right? > > > > Yes, that is what no-internal-tasks rule is but I don't understand how > > that is the same thing as process granularity. Am I completely > > misunderstanding what you are trying to say here? > > Yes. I'm saying that no-internal-tasks could be relaxed per controller. I was asking whether you were wondering whether no-internal-tasks rule and process-granularity are the same thing. And, if that's not the case, what the previous sentence meant. I can't make out what you're responding to. > > If you confine it to the cpu controller, ignore anonymous > > consumptions, the rather ugly mapping between nice and weight values > > and the fact that nobody could come up with a practical usefulness for > > such setup, yes. My point was never that the cpu controller can't do > > it but that we should find a better way of coordinating it with other > > controllers and exposing it to individual applications. > > I'm not sure what the nice-vs-weight thing has to do with internal > processes, but all of this is a question for Peter. That part is from cgroup cpu controller weig
Re: [Documentation] State of CPU controller in cgroup v2
On 2016-09-09 18:57, Tejun Heo wrote: Hello, again. On Mon, Sep 05, 2016 at 10:37:55AM -0700, Andy Lutomirski wrote: * It doesn't bring any practical benefits in terms of capability. Userland can trivially handle the system-root and namespace-roots in a symmetrical manner. Your idea of "trivially" doesn't match mine. You gave a use case in I suppose I wasn't clear enough. It is trivial in the sense that if the userland implements something which works for namespace-root, it would work the same in system-root without further modifications. which userspace might take advantage of root being special. If I was emphasizing the cases where userspace would have to deal with the inherent differences, and, when they don't, they can behave exactly the same way. userspace does that, then that userspace cannot be run in a container. This could be a problem for real users. Sure, "don't do that" is a *valid* answer, but it's not a very helpful answer. Great, now we agree that what's currently implemented is valid. I think you're still failing to recognize the inherent specialness of the system-root and how much unnecessary pain the removal of the exemption would cause at virtually no practical gain. I won't repeat the same backing points here. * It's an unncessary inconvenience, especially for cases where the cgroup agent isn't in control of boot, for partial usage cases, or just for playing with it. You say that I'm ignoring the same use case for namespace-scope but namespace-roots don't have the same hybrid function for partial and uncontrolled systems, so it's not clear why there even NEEDS to be strict symmetry. I think their functions are much closer than you think they are. I want a whole Linux distro to be able to run in a container. This means that useful things people do in a distro or initramfs or whatever should just work if containerized. There isn't much which is getting in the way of doing that. Again, something which follows no-internal-task rule would behave the same no matter where it is. The system-root is different in that it is exempt from the rule and thus is more flexible but that difference is serving the purpose of handling the inherent specialness of the system-root. AFAICS, it is the solution which causes the least amount of contortion and unnecessary inconvenience to userland. It's easy and understandable to get hangups on asymmetries or exemptions like this, but they also often are acceptable trade-offs. It's really frustrating to see you first getting hung up on "this must be wrong" and even after explanations repeating the same thing just in different ways. If there is something fundamentally wrong with it, sure, let's fix it, but what's actually broken? I'm not saying it's fundamentally wrong. I'm saying it's a design You were. that has a big wart, and that wart is unfortunate, and after thinking a bit, I'm starting to agree with PeterZ that this is problematic. It also seems fixable: the constraint could be relaxed. You've been pushing for enforcing the restriction on the system-root too and now are jumping to the opposite end. It's really frustrating that this is such a whack-a-mole game where you throw ideas without really thinking through them and only concede the bare minimum when all other logical avenues are closed off. Here, again, you seem to be stating a strong opinion when you haven't fully thought about it or tried to understand the reasons behind it. But, whatever, let's go there: Given the arguments that I laid out for the no-internal-tasks rule, how does the problem seem fixable through relaxing the constraint? Also, here's an idea to maybe make PeterZ happier: relax the restriction a bit per-controller. Currently (except for /), if you have subtree control enabled you can't have any processes in the cgroup. Could you change this so it only applies to certain controllers? If the cpu controller is entirely happy to have processes and cgroups as siblings, then maybe a cgroup with only cpu subtree control enabled could allow processes to exist. The document lists several reasons for not doing this and also that there is no known real world use case for such configuration. So, up until this point, we were talking about no-internal-tasks constraint. Isn't this the same thing? IIUC the constraint in question is that, if a non-root cgroup has subtree control on, then it can't have processes in it. This is the no-internal-tasks constraint, right? Yes, that is what no-internal-tasks rule is but I don't understand how that is the same thing as process granularity. Am I completely misunderstanding what you are trying to say here? And I still think that, at least for cpu, nothing at all goes wrong if you allow processes to exist in cgroups that have cpu set in subtree-control. If you confine it to the cpu controller, ignore anonymous consumptions, the rather ugly mapping between nice and weight values and the f
Re: [Documentation] State of CPU controller in cgroup v2
On Fri, 2016-09-09 at 18:57 -0400, Tejun Heo wrote: > > > As for your example, who performs the cgroup setup and configuration, > > > the application itself or an external entity? If an external entity, > > > how does it know which thread is what? > > > > In my case, it would be a little script that reads a config file that > > knows all kinds of internal information about the application and its > > threads. > > I see. One-of-a-kind custom setup. This is a completely valid usage; > however, please also recognize that it's an extremely specific one > which is niche by definition. This is the same pigeon hole you placed Google into. So Google, my (also decidedly non-petite) users, and now Andy are all sharing the one of a kind extremely specific niche.. it's becoming a tad crowded. -Mike
Re: [Documentation] State of CPU controller in cgroup v2
On Fri, 2016-09-09 at 18:57 -0400, Tejun Heo wrote: > But, whatever, let's go there: Given the arguments that I laid out for > the no-internal-tasks rule, how does the problem seem fixable through > relaxing the constraint? Well, for one thing, cpusets would cease to leak CPUs. With the no -internal-tasks constraint, no task can acquire affinity of exclusive set A if set B is an exclusive subset thereof, as there is one and only one spot where the affinity of set A exists: in the forbidden set A. Relaxing no-internal-tasks would fix that, but without also relaxing the process-only rule, cpusets would remain useless for the purpose for which it was created. After all, it doesn't do much good to use the one and only dynamic partitioning tool to partition a box if you cannot subsequently place your tasks/threads properly therein. > What people do now with cgroup inside an application is extremely > limited. Because there is no proper support for it, each use case has > to craft up a dedicated custom setup which is all but guaranteed to be > incompatible with what someone else would come up for another > application. Everybody is in "this is mine, I control the entire > system" mindset, which is fine for those specific setups but > deterimental to making it widely available and useful. IMO, the problem with that making it available to the huddled masses bit is that it is a completely unrealistic fantasy. Can hordes of programs really autonomously carve up a single set of resources? I do not believe they can. The system agent cannot autonomously do so either. Intimate knowledge of local requirements is not optional, it is a prerequisite to sound decision making. You have to have a well defined need before it makes any sense to turn these things on, they are not free, and impact is global. -Mike
Re: [Documentation] State of CPU controller in cgroup v2
Hello, again. On Mon, Sep 05, 2016 at 10:37:55AM -0700, Andy Lutomirski wrote: > > * It doesn't bring any practical benefits in terms of capability. > > Userland can trivially handle the system-root and namespace-roots in > > a symmetrical manner. > > Your idea of "trivially" doesn't match mine. You gave a use case in I suppose I wasn't clear enough. It is trivial in the sense that if the userland implements something which works for namespace-root, it would work the same in system-root without further modifications. > which userspace might take advantage of root being special. If I was emphasizing the cases where userspace would have to deal with the inherent differences, and, when they don't, they can behave exactly the same way. > userspace does that, then that userspace cannot be run in a container. > This could be a problem for real users. Sure, "don't do that" is a > *valid* answer, but it's not a very helpful answer. Great, now we agree that what's currently implemented is valid. I think you're still failing to recognize the inherent specialness of the system-root and how much unnecessary pain the removal of the exemption would cause at virtually no practical gain. I won't repeat the same backing points here. > > * It's an unncessary inconvenience, especially for cases where the > > cgroup agent isn't in control of boot, for partial usage cases, or > > just for playing with it. > > > > You say that I'm ignoring the same use case for namespace-scope but > > namespace-roots don't have the same hybrid function for partial and > > uncontrolled systems, so it's not clear why there even NEEDS to be > > strict symmetry. > > I think their functions are much closer than you think they are. I > want a whole Linux distro to be able to run in a container. This > means that useful things people do in a distro or initramfs or > whatever should just work if containerized. There isn't much which is getting in the way of doing that. Again, something which follows no-internal-task rule would behave the same no matter where it is. The system-root is different in that it is exempt from the rule and thus is more flexible but that difference is serving the purpose of handling the inherent specialness of the system-root. AFAICS, it is the solution which causes the least amount of contortion and unnecessary inconvenience to userland. > > It's easy and understandable to get hangups on asymmetries or > > exemptions like this, but they also often are acceptable trade-offs. > > It's really frustrating to see you first getting hung up on "this must > > be wrong" and even after explanations repeating the same thing just in > > different ways. > > > > If there is something fundamentally wrong with it, sure, let's fix it, > > but what's actually broken? > > I'm not saying it's fundamentally wrong. I'm saying it's a design You were. > that has a big wart, and that wart is unfortunate, and after thinking > a bit, I'm starting to agree with PeterZ that this is problematic. It > also seems fixable: the constraint could be relaxed. You've been pushing for enforcing the restriction on the system-root too and now are jumping to the opposite end. It's really frustrating that this is such a whack-a-mole game where you throw ideas without really thinking through them and only concede the bare minimum when all other logical avenues are closed off. Here, again, you seem to be stating a strong opinion when you haven't fully thought about it or tried to understand the reasons behind it. But, whatever, let's go there: Given the arguments that I laid out for the no-internal-tasks rule, how does the problem seem fixable through relaxing the constraint? > >> >> Also, here's an idea to maybe make PeterZ happier: relax the > >> >> restriction a bit per-controller. Currently (except for /), if you > >> >> have subtree control enabled you can't have any processes in the > >> >> cgroup. Could you change this so it only applies to certain > >> >> controllers? If the cpu controller is entirely happy to have > >> >> processes and cgroups as siblings, then maybe a cgroup with only cpu > >> >> subtree control enabled could allow processes to exist. > >> > > >> > The document lists several reasons for not doing this and also that > >> > there is no known real world use case for such configuration. > > > > So, up until this point, we were talking about no-internal-tasks > > constraint. > > Isn't this the same thing? IIUC the constraint in question is that, > if a non-root cgroup has subtree control on, then it can't have > processes in it. This is the no-internal-tasks constraint, right? Yes, that is what no-internal-tasks rule is but I don't understand how that is the same thing as process granularity. Am I completely misunderstanding what you are trying to say here? > And I still think that, at least for cpu, nothing at all goes wrong if > you allow processes to exist in cgroups that have cpu set in > subtree-c
Re: [Documentation] State of CPU controller in cgroup v2
On Mon, Sep 05, 2016 at 10:37:55AM -0700, Andy Lutomirski wrote: > And I still think that, at least for cpu, nothing at all goes wrong if > you allow processes to exist in cgroups that have cpu set in > subtree-control. cpu, cpuset, perf, cpuacct (although we all agree that really should be part of cpu), pid, and possibly freezer (but I think we all agree freezer is 'broken'). That's roughly half the controllers out there. They all work on tasks, and should therefore have no problems what so ever to allow the full hierarchy without silly exceptions and constraints. The fundamental problem is that we have 2 different types of controllers, on the one hand these controllers above, that work on tasks and form groups of them and build up from that. Lets call them task-controllers. On the other hand we have controllers like memcg which take the 'system' as a whole and shrink it down into smaller bits. Lets call these system-controllers. They are fundamentally at odds with capabilities, simply because of the granularity they can work on. Merging the two into a common hierarchy is a useful concept for containerization, no argument on that, esp. when also coupled with namespaces and the like. However, where I object _most_ strongly is having this one use dominate and destroy the capabilities (which are in use) of the task-controllers. > > I do. It's a horrible userland API to expose to individual > > applications if the organization that a given application expects can > > be disturbed by system operations. Imagine how this would be > > documented - "if this operation races with system operation, it may > > return -ENOENT. Repeating the path lookup might make the operation > > succeed again." > > It could be made to work without races, though, with minimal (or even > no) ABI change. The managed program could grab an fd pointing to its > cgroup. Then it would use openat, etc for all operations. As long as > 'mv /cgroup/a/b /cgroup/c/" didn't cause that fd to stop working, > we're fine. I've mentioned openat() and related APIs several times, but so far never got good reasons why that wouldn't work. Also note that in order to partition the cpus with cpusets, you're required to generate a disjoint hierarchy (that is, one where the (common) parent is 'disabled' and the children have no overlap). This is rather fundamental to partitioning, that by its very nature requires separation. The result is that if you want to place your RT threads (consider an application that consists of RT and !RT parts) in a different partition there is no common parent you can place the process in. cgroup-v2, by placing the system style controllers first and foremost, completely renders that scenario impossible. Note also that any proposed rgroup would not work for this, since that, per design, is a subtree, and therefore not disjoint. So my objection to the whole cgroup-v2 model and implementation stems from the fact that it purports to be a 'better' and 'improved' system, while in actuality it neuters and destroys a lot of useful usecases. It completely disregards all task-controllers and labels their use-cases as irrelevant.
Re: [Documentation] State of CPU controller in cgroup v2
On Sat, Sep 3, 2016 at 3:05 PM, Tejun Heo wrote: > Hello, Andy. > > On Wed, Aug 31, 2016 at 02:46:20PM -0700, Andy Lutomirski wrote: >> > Consider a use case where the user isn't interested in fully >> > accounting and dividing up system resources but wants to just cap >> > resource usage from a subset of workloads. There is no reason to >> > require such usages to fully contain all processes in non-root >> > cgroups. Furthermore, it's not trivial to migrate all processes out >> > of root to a sub-cgroup unless the agent is in full control of boot >> > process. >> >> Then please also consider exactly the same use case while running in a >> container. >> >> I'm a bit frustrated that you're saying that my example failure modes >> consist of shooting oneself in the foot and then you go on to come up >> with your own examples that have precisely the same problem. > > You have a point, which is > > The system-root and namespace-roots are not symmetric. > > and that's a valid concern. Here's why the system-root is special. > [...] > > Now, due to the various issues with direct competition between > processes and cgroups, cgroup v2 disallows resource control across > them (the no-internal-tasks restriction); however, cgroup v2 currently > doesn't apply the restriction to the system-root. Here are the > reasons. > > * It doesn't bring any practical benefits in terms of implementation. > As noted above, all controllers already have to allow uncontained > consumptions in the system-root and that's the only attribute > required for the exemption. > > * It doesn't bring any practical benefits in terms of capability. > Userland can trivially handle the system-root and namespace-roots in > a symmetrical manner. Your idea of "trivially" doesn't match mine. You gave a use case in which userspace might take advantage of root being special. If userspace does that, then that userspace cannot be run in a container. This could be a problem for real users. Sure, "don't do that" is a *valid* answer, but it's not a very helpful answer. > > * It's an unncessary inconvenience, especially for cases where the > cgroup agent isn't in control of boot, for partial usage cases, or > just for playing with it. > > You say that I'm ignoring the same use case for namespace-scope but > namespace-roots don't have the same hybrid function for partial and > uncontrolled systems, so it's not clear why there even NEEDS to be > strict symmetry. I think their functions are much closer than you think they are. I want a whole Linux distro to be able to run in a container. This means that useful things people do in a distro or initramfs or whatever should just work if containerized. > > It's easy and understandable to get hangups on asymmetries or > exemptions like this, but they also often are acceptable trade-offs. > It's really frustrating to see you first getting hung up on "this must > be wrong" and even after explanations repeating the same thing just in > different ways. > > If there is something fundamentally wrong with it, sure, let's fix it, > but what's actually broken? I'm not saying it's fundamentally wrong. I'm saying it's a design that has a big wart, and that wart is unfortunate, and after thinking a bit, I'm starting to agree with PeterZ that this is problematic. It also seems fixable: the constraint could be relaxed. >> >> Also, here's an idea to maybe make PeterZ happier: relax the >> >> restriction a bit per-controller. Currently (except for /), if you >> >> have subtree control enabled you can't have any processes in the >> >> cgroup. Could you change this so it only applies to certain >> >> controllers? If the cpu controller is entirely happy to have >> >> processes and cgroups as siblings, then maybe a cgroup with only cpu >> >> subtree control enabled could allow processes to exist. >> > >> > The document lists several reasons for not doing this and also that >> > there is no known real world use case for such configuration. > > So, up until this point, we were talking about no-internal-tasks > constraint. Isn't this the same thing? IIUC the constraint in question is that, if a non-root cgroup has subtree control on, then it can't have processes in it. This is the no-internal-tasks constraint, right? And I still think that, at least for cpu, nothing at all goes wrong if you allow processes to exist in cgroups that have cpu set in subtree-control. - begin talking about process granularity - > >> My company's production workload would map quite nicely to this >> relaxed model. I have quite a few processes each with several >> threads. Some of those threads get some CPUs, some get other CPUs, >> and they vary in what shares of what CPUs they get. To be clear, >> there is not a hierarchy of resource usage that's compatible with the >> process hierarchy. Multiple processes have threads that should be >> grouped in a different place in the hierarchy than other threads. >> Concre
Re: [Documentation] State of CPU controller in cgroup v2
Hello, Andy. On Wed, Aug 31, 2016 at 02:46:20PM -0700, Andy Lutomirski wrote: > > Consider a use case where the user isn't interested in fully > > accounting and dividing up system resources but wants to just cap > > resource usage from a subset of workloads. There is no reason to > > require such usages to fully contain all processes in non-root > > cgroups. Furthermore, it's not trivial to migrate all processes out > > of root to a sub-cgroup unless the agent is in full control of boot > > process. > > Then please also consider exactly the same use case while running in a > container. > > I'm a bit frustrated that you're saying that my example failure modes > consist of shooting oneself in the foot and then you go on to come up > with your own examples that have precisely the same problem. You have a point, which is The system-root and namespace-roots are not symmetric. and that's a valid concern. Here's why the system-root is special. * A system has entities and resource consumptions which can only be attributed to the "system". The system-root is the natural place to put them. The system-root has stuff no other cgroups, not even namespace-roots, have. It's a unique situation. * The need to bypass most cgroup related overhead when not in use. The system-root is there whether cgroup is actally in use or not and thus can not impose noticeable overhead. It has to make sense for both resource-controlled systems as well as ones that aren't. Again, no other group has these requirements. Note that this means that all controllers should be able to and already allow uncontained consumptions in the system-root. I'll come back to this later. Now, due to the various issues with direct competition between processes and cgroups, cgroup v2 disallows resource control across them (the no-internal-tasks restriction); however, cgroup v2 currently doesn't apply the restriction to the system-root. Here are the reasons. * It doesn't bring any practical benefits in terms of implementation. As noted above, all controllers already have to allow uncontained consumptions in the system-root and that's the only attribute required for the exemption. * It doesn't bring any practical benefits in terms of capability. Userland can trivially handle the system-root and namespace-roots in a symmetrical manner. * It's an unncessary inconvenience, especially for cases where the cgroup agent isn't in control of boot, for partial usage cases, or just for playing with it. You say that I'm ignoring the same use case for namespace-scope but namespace-roots don't have the same hybrid function for partial and uncontrolled systems, so it's not clear why there even NEEDS to be strict symmetry. On this subject, your only actual point is that there is an asymmetry and that's bothersome. I've been trying to explain why the special case doesn't actually get in the way in terms of implementation or capability and is actually beneficial. Instead of engaging in the actual discussion, you're constantly coming up with different ways of saying "it's not symmetric". The system-root and namespace-roots aren't equivalent. There are a lot of parallels between system-root and namescope-root but they aren't the same thing (e.g. bootstrapping a namespace is a less complicated and more malleable process). The system-root is not even a fully qualified node of the resource graph. It's easy and understandable to get hangups on asymmetries or exemptions like this, but they also often are acceptable trade-offs. It's really frustrating to see you first getting hung up on "this must be wrong" and even after explanations repeating the same thing just in different ways. If there is something fundamentally wrong with it, sure, let's fix it, but what's actually broken? > > I have, multiple times. Can you please read 2-1-2 of the document in > > the original post and take the discussion from there? > > I've read it multiple times, and I don't see any explanation that's > consistent with the fact that you are exempting the root cgroup from > this constraint. If the constraint were really critical to everything > working, then I would expect the root cgroup to have exactly the same > problem. This makes me think that either something nasty is being > fudged for the root cgroup or that the constraint isn't actually so > important after all. The only thing on point I can find is: > > > Root cgroup is exempt from this constraint, which is in line with > > how root cgroup is handled in general - it's excluded from cgroup > > resource accounting and control. > > and that's not very helpful. My apologies. I somehow thought that was part of the documentation. Will update it later, but here's an excerpt from my earlier response. Having a special case doesn't necessarily get in the way of benefiting from a set of general rules. The root cgroup is inherently special as it has to be the catch-all scope for en
Re: [Documentation] State of CPU controller in cgroup v2
On Wed, Aug 31, 2016 at 2:07 PM, Tejun Heo wrote: > Hello, > > On Wed, Aug 31, 2016 at 12:11:58PM -0700, Andy Lutomirski wrote: >> > You can say that allowing the possibility of deviation isn't a good >> > design choice but it is a design choice with other implications - on >> > how we deal with configurations without cgroup at all, transitioning >> > from v1, bootstrapping a system and avoiding surprising >> > userland-visible behaviors (e.g. like creating magic preset cgroups >> > and silently migrating process there on certain events). >> >> Are there existing userspace programs that use cgroup2 and enable >> subtree control on / when there are processes in /? If the answer is >> no, then I think you should change cgroup2 to just disallow it. If >> the answer is yes, then I think there's a problem and maybe you should >> consider a breaking change. Given that cgroup2 hasn't really launched >> on a large scale, it seems worthwhile to get it right. > > Adding the restriction isn't difficult from implementation point of > view and for a system agent which control the boot process > implementing that wouldn't be difficult either but I can't see what > the actual benefits of the extra restriction would be and there are > tangible downsides to doing so. > > Consider a use case where the user isn't interested in fully > accounting and dividing up system resources but wants to just cap > resource usage from a subset of workloads. There is no reason to > require such usages to fully contain all processes in non-root > cgroups. Furthermore, it's not trivial to migrate all processes out > of root to a sub-cgroup unless the agent is in full control of boot > process. Then please also consider exactly the same use case while running in a container. I'm a bit frustrated that you're saying that my example failure modes consist of shooting oneself in the foot and then you go on to come up with your own examples that have precisely the same problem. > >> I don't understand what you're talking about wrt silently migrating >> processes. Are you thinking about usermodehelper? If so, maybe it >> really does make sense to allow (or require?) the cgroup manager to >> specify which cgroup these processes end up in. > > That was from one of the ideas that I was considering way back where > enabling resource control in an intermediate node automatically moves > internal processes to a preset cgroup whether visible or hidden, which > would be another way of addressing the problem. > > None of these affects what cgroup v2 can do at all and the only thing > the userland is asked to do under the current scheme is "if you wanna > keep the whole system divided up and use the same mode of operations > across system-scope and namespace-scope move out of root while setting > yourself up, which also happens to be what you have to do inside > namespaces anyway." > >> But, given that all the controllers need to support the current magic >> root exception (for genuinely unaccountable things if nothing else), >> can you explain what would actually go wrong if you just removed the >> restriction entirely? > > I have, multiple times. Can you please read 2-1-2 of the document in > the original post and take the discussion from there? I've read it multiple times, and I don't see any explanation that's consistent with the fact that you are exempting the root cgroup from this constraint. If the constraint were really critical to everything working, then I would expect the root cgroup to have exactly the same problem. This makes me think that either something nasty is being fudged for the root cgroup or that the constraint isn't actually so important after all. The only thing on point I can find is: > Root cgroup is exempt from this constraint, which is in line with > how root cgroup is handled in general - it's excluded from cgroup > resource accounting and control. and that's not very helpful. > >> Also, here's an idea to maybe make PeterZ happier: relax the >> restriction a bit per-controller. Currently (except for /), if you >> have subtree control enabled you can't have any processes in the >> cgroup. Could you change this so it only applies to certain >> controllers? If the cpu controller is entirely happy to have >> processes and cgroups as siblings, then maybe a cgroup with only cpu >> subtree control enabled could allow processes to exist. > > The document lists several reasons for not doing this and also that > there is no known real world use case for such configuration. My company's production workload would map quite nicely to this relaxed model. I have quite a few processes each with several threads. Some of those threads get some CPUs, some get other CPUs, and they vary in what shares of what CPUs they get. To be clear, there is not a hierarchy of resource usage that's compatible with the process hierarchy. Multiple processes have threads that should be grouped in a different place in the hierarchy than othe
Re: [Documentation] State of CPU controller in cgroup v2
Hello, On Wed, Aug 31, 2016 at 12:11:58PM -0700, Andy Lutomirski wrote: > > You can say that allowing the possibility of deviation isn't a good > > design choice but it is a design choice with other implications - on > > how we deal with configurations without cgroup at all, transitioning > > from v1, bootstrapping a system and avoiding surprising > > userland-visible behaviors (e.g. like creating magic preset cgroups > > and silently migrating process there on certain events). > > Are there existing userspace programs that use cgroup2 and enable > subtree control on / when there are processes in /? If the answer is > no, then I think you should change cgroup2 to just disallow it. If > the answer is yes, then I think there's a problem and maybe you should > consider a breaking change. Given that cgroup2 hasn't really launched > on a large scale, it seems worthwhile to get it right. Adding the restriction isn't difficult from implementation point of view and for a system agent which control the boot process implementing that wouldn't be difficult either but I can't see what the actual benefits of the extra restriction would be and there are tangible downsides to doing so. Consider a use case where the user isn't interested in fully accounting and dividing up system resources but wants to just cap resource usage from a subset of workloads. There is no reason to require such usages to fully contain all processes in non-root cgroups. Furthermore, it's not trivial to migrate all processes out of root to a sub-cgroup unless the agent is in full control of boot process. At least up until this point in discussion, I can't see actual benefits of adding this restriction and the only reason for pushing it seems the initial misunderstanding and purism. > I don't understand what you're talking about wrt silently migrating > processes. Are you thinking about usermodehelper? If so, maybe it > really does make sense to allow (or require?) the cgroup manager to > specify which cgroup these processes end up in. That was from one of the ideas that I was considering way back where enabling resource control in an intermediate node automatically moves internal processes to a preset cgroup whether visible or hidden, which would be another way of addressing the problem. None of these affects what cgroup v2 can do at all and the only thing the userland is asked to do under the current scheme is "if you wanna keep the whole system divided up and use the same mode of operations across system-scope and namespace-scope move out of root while setting yourself up, which also happens to be what you have to do inside namespaces anyway." > But, given that all the controllers need to support the current magic > root exception (for genuinely unaccountable things if nothing else), > can you explain what would actually go wrong if you just removed the > restriction entirely? I have, multiple times. Can you please read 2-1-2 of the document in the original post and take the discussion from there? > Also, here's an idea to maybe make PeterZ happier: relax the > restriction a bit per-controller. Currently (except for /), if you > have subtree control enabled you can't have any processes in the > cgroup. Could you change this so it only applies to certain > controllers? If the cpu controller is entirely happy to have > processes and cgroups as siblings, then maybe a cgroup with only cpu > subtree control enabled could allow processes to exist. The document lists several reasons for not doing this and also that there is no known real world use case for such configuration. Please also note that the behavior that you're describing is actually what rgroup implements. It makes a lot more sense there because threads and groups share the same configuration mechanism and it only has to worry about competition among threads (anonymous consumption is out of scope for rgroup). > >> It *also* won't work (I think) if subtree control is enabled on the > >> root, but I don't think this is a problem in practice because subtree > >> control won't be enabled on the namespace root by a sensible cgroup > >> manager. > > > > Exactly the same thing. You can shoot yourself in the foot but it's > > easy not to. > > Somewhat off-topic: this appears to be either a bug or a misfeature: > > bash-4.3# mkdir foo > bash-4.3# ls foo > cgroup.controllers cgroup.events cgroup.procs cgroup.subtree_control > bash-4.3# mkdir foo/io.max <-- IMO this shouldn't have worked > bash-4.3# echo +io >cgroup.subtree_control > [ 40.470712] cgroup: cgroup_addrm_files: failed to add max, err=-17 > > Shouldn't cgroups with names that potentially conflict with > kernel-provided dentries be disallowed? Yeap, the name collisions suck. I thought about disallowing all sub-cgroups which starts with "KNOWN_SUBSYS." but that has a non-trivial chance of breaking users which were happy before when a new controller gets added. But, yeah, we at least should disallow the
Re: [Documentation] State of CPU controller in cgroup v2
I'm replying separately to keep the two issues in separate emails. On Mon, Aug 29, 2016 at 3:20 PM, Tejun Heo wrote: > Hello, Andy. > > Sorry about the delay. Was kinda overwhelmed with other things. > > On Sat, Aug 20, 2016 at 11:45:55AM -0700, Andy Lutomirski wrote: >> > This becomes clear whenever an entity is allocating memory on behalf >> > of someone else - get_user_pages(), khugepaged, swapoff and so on (and >> > likely userfaultfd too). When a task is trying to add a page to a >> > VMA, the task might not have any relationship with the VMA other than >> > that it's operating on it for someone else. The page has to be >> > charged to whoever is responsible for the VMA and the only ownership >> > which can be established is the containing mm_struct. >> >> This surprises me a bit. If I do access_process_vm(), then I would >> have expected the charge to go the caller, not the mm being accessed. > > It does and should go the target mm. Who faults in a page shouldn't > be the final determinant in the ownership; otherwise, we end up in > situations where the ownership changes due to, for example, > fluctuations in page fault pattern. It doesn't make semantical sense > either. If a kthread is doing PIO for a process, why would it get > charged for the memory it's faulting in? OK, that makes sense. Although, given that cgroup1 allows tasks in the same processes to be split up, how does this work in cgroup1? Do you just pick the mm associated with the thread group leader? If so, why can't cgroup2 do the same thing? But even this is at best a vague approximation. If you have MAP_SHARED mappings (libc.so, for example), then the cgroup you charge it to is more or less arbitrary. > >> What happens if a program calls read(2), though? A page may be >> inserted into page cache on behalf of an address_space without any >> particular mm being involved. There will usually be a calling task, >> though. > > Most faults are synchronous and the faulting thread is a member of the > mm to be charged, so this usually isn't an issue. I don't think there > are places where we populate an address_space without knowing who it > is for (as opposed / in addition to who the operator is). True, but there's no *mm* involved in any fundamental sense. You can look at the task and find the task's mm (or actually the task's thread group leader, since cgroup2 doesn't literally map mms to cgroups), but that seems to me to be a pretty poor reason to argue that tasks should have to be kept together. > >> But this is all very memcg-specific. What about other cgroups? I/O >> is per-task, right? Scheduling is definitely per-task. > > They aren't separate. Think about IOs to write out page cache, CPU > cycles spent reclaiming memory or encrypting writeback IOs. It's fine > to get more granular with specific resources but the semantics gets > messy for cross-resource accounting and control without proper > scoping. Page cache doesn't belong to a a specific mm. Memory reclaim only has an mm associated if the memory being reclaimed belongs cleanly to an mm. Encrypting writeback (I assume you mean the cpu usage) is just like page cache writeback IO -- there's no specific mm involved in general. > >> > Consider the scenario where you have somebody faulting on behalf of a >> > foreign VMA, but the thread who created and is actively using that VMA >> > is in a different cgroup than the process leader. Who are we going to >> > charge? All possible answers seem erratic. >> >> Indeed, and this problem is probably not solvable in practice unless >> you charge all involved cgroups. But the caller's *mm* is entirely >> irrelevant here, so I don't see how this implies that cgroups need to >> keep tasks in the same process together. The relevant entities are >> the calling *task* and the target mm, and you're going to be >> hard-pressed to ensure that they belong to the same cgroup, so I think >> you need to be able handle weird cases in which there isn't an >> obviously correct cgroup to charge. > > It is an erratic case which is caused by userland interface allowing > non-sensical configuration. We can accept it as a necessary trade-off > given big enough benefits or unavoidable constraints but it isn't > something to do willy-nilly. > >> > For system-level and process-level operations to not step on each >> > other's toes, they need to agree on the granularity boundary - >> > system-level should be able to treat an application hierarchy as a >> > single unit. A possible solution is allowing rgroup hirearchies to >> > span across process boundaries and implementing cgroup migration >> > operations which treat such hierarchies as a single unit. I'm not yet >> > sure whether the boundary should be at program groups or rgroups. >> >> I think that, if the system cgroup manager is moving processes around >> after starting them and execing the final binary, there will be races >> and confusion, and no about of granularity fidd
Re: [Documentation] State of CPU controller in cgroup v2
On Wed, Aug 31, 2016 at 10:32 AM, Tejun Heo wrote: > Hello, Andy. > > >> >> I really, really think that cgroup v2 should supply the same >> >> *interface* inside and outside of a non-root namespace. If this is >> > >> > It *does*. That's what I tried to explain, that it's exactly >> > isomorhpic once you discount the system-wide consumptions. >> >> I don't think I agree. >> >> Suppose I wrote an init program or a cgroup manager. I can expect >> that init program to be started in the root cgroup. The program can >> be lazy and write +io to /cgroup/cgroup.subtree_control and then >> create some new cgroup /cgroup/a and it will work (I just tried it). >> >> Now I run that program in a namespace. It will not work because it'll >> get -EBUSY when it tries to write to cgroup.subtree_control. (I just >> tried this, too, only using cd instead of a namespace.) So it's *not* >> isomorphic. > > Yeah, it is possible to shoot yourself in the foot but both > system-scope and namespace-scope can implement the exactly same > behavior - move yourself out of root before enabling resource controls > and get the same expected outcome, which BTW is how systemd behaves > already. > > You can say that allowing the possibility of deviation isn't a good > design choice but it is a design choice with other implications - on > how we deal with configurations without cgroup at all, transitioning > from v1, bootstrapping a system and avoiding surprising > userland-visible behaviors (e.g. like creating magic preset cgroups > and silently migrating process there on certain events). Are there existing userspace programs that use cgroup2 and enable subtree control on / when there are processes in /? If the answer is no, then I think you should change cgroup2 to just disallow it. If the answer is yes, then I think there's a problem and maybe you should consider a breaking change. Given that cgroup2 hasn't really launched on a large scale, it seems worthwhile to get it right. I don't understand what you're talking about wrt silently migrating processes. Are you thinking about usermodehelper? If so, maybe it really does make sense to allow (or require?) the cgroup manager to specify which cgroup these processes end up in. But, given that all the controllers need to support the current magic root exception (for genuinely unaccountable things if nothing else), can you explain what would actually go wrong if you just removed the restriction entirely? Also, here's an idea to maybe make PeterZ happier: relax the restriction a bit per-controller. Currently (except for /), if you have subtree control enabled you can't have any processes in the cgroup. Could you change this so it only applies to certain controllers? If the cpu controller is entirely happy to have processes and cgroups as siblings, then maybe a cgroup with only cpu subtree control enabled could allow processes to exist. > >> It *also* won't work (I think) if subtree control is enabled on the >> root, but I don't think this is a problem in practice because subtree >> control won't be enabled on the namespace root by a sensible cgroup >> manager. > > Exactly the same thing. You can shoot yourself in the foot but it's > easy not to. > Somewhat off-topic: this appears to be either a bug or a misfeature: bash-4.3# mkdir foo bash-4.3# ls foo cgroup.controllers cgroup.events cgroup.procs cgroup.subtree_control bash-4.3# mkdir foo/io.max <-- IMO this shouldn't have worked bash-4.3# echo +io >cgroup.subtree_control [ 40.470712] cgroup: cgroup_addrm_files: failed to add max, err=-17 Shouldn't cgroups with names that potentially conflict with kernel-provided dentries be disallowed? --Andy
Re: [Documentation] State of CPU controller in cgroup v2
Hello, Andy. On Tue, Aug 30, 2016 at 08:42:20PM -0700, Andy Lutomirski wrote: > On Mon, Aug 29, 2016 at 3:20 PM, Tejun Heo wrote: > >> This seems to explain why the controllers need to be able to handle > >> things being charged to the root cgroup (or to an unidentifiable > >> cgroup, anyway). That isn't quite the same thing as allowing, from an > >> ABI point of view, the root cgroup to contain processes and cgroups > >> but not allowing other cgroups to do the same thing. Consider: > > > > The points are 1. we need the root to be a special container anyway > > But you don't need to let userspace see that. I'm not saying that what cgroup v2 implements is the only solution. There of course can be other approaches which don't expose this particular detail to userland. I was highlighting that there is an underlying condition to be dealt with and that what cgroup v2 implements is one working solution for it. It's fine to have, say, aesthetical disgreements on the specifics of the chosen approach, and, while a bit late, we can still talk about pros and cons of different possible approaches and make improvements where it makes sense. However, this isn't in any way a make-it-or-break-it issue as you implied before. > >> I really, really think that cgroup v2 should supply the same > >> *interface* inside and outside of a non-root namespace. If this is > > > > It *does*. That's what I tried to explain, that it's exactly > > isomorhpic once you discount the system-wide consumptions. > > I don't think I agree. > > Suppose I wrote an init program or a cgroup manager. I can expect > that init program to be started in the root cgroup. The program can > be lazy and write +io to /cgroup/cgroup.subtree_control and then > create some new cgroup /cgroup/a and it will work (I just tried it). > > Now I run that program in a namespace. It will not work because it'll > get -EBUSY when it tries to write to cgroup.subtree_control. (I just > tried this, too, only using cd instead of a namespace.) So it's *not* > isomorphic. Yeah, it is possible to shoot yourself in the foot but both system-scope and namespace-scope can implement the exactly same behavior - move yourself out of root before enabling resource controls and get the same expected outcome, which BTW is how systemd behaves already. You can say that allowing the possibility of deviation isn't a good design choice but it is a design choice with other implications - on how we deal with configurations without cgroup at all, transitioning from v1, bootstrapping a system and avoiding surprising userland-visible behaviors (e.g. like creating magic preset cgroups and silently migrating process there on certain events). > It *also* won't work (I think) if subtree control is enabled on the > root, but I don't think this is a problem in practice because subtree > control won't be enabled on the namespace root by a sensible cgroup > manager. Exactly the same thing. You can shoot yourself in the foot but it's easy not to. Thanks. -- tejun
Re: [Documentation] State of CPU controller in cgroup v2
On Mon, Aug 29, 2016 at 3:20 PM, Tejun Heo wrote: >> > These base-system operations are special regardless of cgroup and we >> > already have sometimes crude ways to affect their behaviors where >> > necessary through sysctl knobs, priorities on specific kernel threads >> > and so on. cgroup doesn't change the situation all that much. What >> > gets left in the root cgroup usually are the base-system operations >> > which are outside the scope of cgroup resource control in the first >> > place and cgroup resource graph can treat the root as an opaque anchor >> > point. >> >> This seems to explain why the controllers need to be able to handle >> things being charged to the root cgroup (or to an unidentifiable >> cgroup, anyway). That isn't quite the same thing as allowing, from an >> ABI point of view, the root cgroup to contain processes and cgroups >> but not allowing other cgroups to do the same thing. Consider: > > The points are 1. we need the root to be a special container anyway But you don't need to let userspace see that. > 2. allowing it to be special and contain system-wide consumptions > doesn't make the resource graph inconsistent once all non-system-wide > consumptions are put in non-root cgroups, and 3. this is the most > natural way to handle the situation both from implementation and > interface standpoints as it makes non-cgroup configuration a natural > degenerate case of cgroup configuration. > >> suppose that systemd (or some competing cgroup manager) is designed to >> run in the root cgroup namespace. It presumably expects *itself* to >> be in the root cgroup. Now try to run it using cgroups v2 in a >> non-root namespace. I don't see how it can possibly work if it the >> hierarchy constraints don't permit it to create sub-cgroups while it's >> still in the root. In fact, this seems impossible to fix even with >> user code changes. The manager would need to simultaneously create a >> new child cgroup to contain itself and assign itself to that child >> cgroup, because the intermediate state is illegal. > > Please re-read the constraint. It doesn't prevent any organizational > operations before resource control is enabled. > >> I really, really think that cgroup v2 should supply the same >> *interface* inside and outside of a non-root namespace. If this is > > It *does*. That's what I tried to explain, that it's exactly > isomorhpic once you discount the system-wide consumptions. > I don't think I agree. Suppose I wrote an init program or a cgroup manager. I can expect that init program to be started in the root cgroup. The program can be lazy and write +io to /cgroup/cgroup.subtree_control and then create some new cgroup /cgroup/a and it will work (I just tried it). Now I run that program in a namespace. It will not work because it'll get -EBUSY when it tries to write to cgroup.subtree_control. (I just tried this, too, only using cd instead of a namespace.) So it's *not* isomorphic. It *also* won't work (I think) if subtree control is enabled on the root, but I don't think this is a problem in practice because subtree control won't be enabled on the namespace root by a sensible cgroup manager. --Andy
Re: [Documentation] State of CPU controller in cgroup v2
Hello, James. On Sat, Aug 20, 2016 at 10:34:14PM -0700, James Bottomley wrote: > I can see that process based is conceptually easier in v2 because you > begin with a process tree, but it would really be a pity to lose the > thread based controls we have now and permanently lose the ability to > create more as we find uses for them. I can't really see how improving > "common resource domain" is a good tradeoff for this. Thread based control for namespace is not a different problem from thread based control for individual applications, right? And the problems with using cgroupfs directly for in-process control still applies the same whether it's system-wide or inside a namespace. One argument could be that inside a namespace, as the cgroupfs is already scoped, cgroup path headaches are less of an issue, which is true; however, that isn't applicable to applications which aren't scoped in thier own namespaces and we can't scope every binary on the system. More importnatly, a given application can't rely on being scoped in a certain way. You can craft a custom config for a specific setup but that's a horrible way to solve the problem of in-application hierarchical resource distribution, and that's what rgroup was all about. Thanks. -- tejun
Re: [Documentation] State of CPU controller in cgroup v2
Hello, Andy. Sorry about the delay. Was kinda overwhelmed with other things. On Sat, Aug 20, 2016 at 11:45:55AM -0700, Andy Lutomirski wrote: > > This becomes clear whenever an entity is allocating memory on behalf > > of someone else - get_user_pages(), khugepaged, swapoff and so on (and > > likely userfaultfd too). When a task is trying to add a page to a > > VMA, the task might not have any relationship with the VMA other than > > that it's operating on it for someone else. The page has to be > > charged to whoever is responsible for the VMA and the only ownership > > which can be established is the containing mm_struct. > > This surprises me a bit. If I do access_process_vm(), then I would > have expected the charge to go the caller, not the mm being accessed. It does and should go the target mm. Who faults in a page shouldn't be the final determinant in the ownership; otherwise, we end up in situations where the ownership changes due to, for example, fluctuations in page fault pattern. It doesn't make semantical sense either. If a kthread is doing PIO for a process, why would it get charged for the memory it's faulting in? > What happens if a program calls read(2), though? A page may be > inserted into page cache on behalf of an address_space without any > particular mm being involved. There will usually be a calling task, > though. Most faults are synchronous and the faulting thread is a member of the mm to be charged, so this usually isn't an issue. I don't think there are places where we populate an address_space without knowing who it is for (as opposed / in addition to who the operator is). > But this is all very memcg-specific. What about other cgroups? I/O > is per-task, right? Scheduling is definitely per-task. They aren't separate. Think about IOs to write out page cache, CPU cycles spent reclaiming memory or encrypting writeback IOs. It's fine to get more granular with specific resources but the semantics gets messy for cross-resource accounting and control without proper scoping. > > Consider the scenario where you have somebody faulting on behalf of a > > foreign VMA, but the thread who created and is actively using that VMA > > is in a different cgroup than the process leader. Who are we going to > > charge? All possible answers seem erratic. > > Indeed, and this problem is probably not solvable in practice unless > you charge all involved cgroups. But the caller's *mm* is entirely > irrelevant here, so I don't see how this implies that cgroups need to > keep tasks in the same process together. The relevant entities are > the calling *task* and the target mm, and you're going to be > hard-pressed to ensure that they belong to the same cgroup, so I think > you need to be able handle weird cases in which there isn't an > obviously correct cgroup to charge. It is an erratic case which is caused by userland interface allowing non-sensical configuration. We can accept it as a necessary trade-off given big enough benefits or unavoidable constraints but it isn't something to do willy-nilly. > > For system-level and process-level operations to not step on each > > other's toes, they need to agree on the granularity boundary - > > system-level should be able to treat an application hierarchy as a > > single unit. A possible solution is allowing rgroup hirearchies to > > span across process boundaries and implementing cgroup migration > > operations which treat such hierarchies as a single unit. I'm not yet > > sure whether the boundary should be at program groups or rgroups. > > I think that, if the system cgroup manager is moving processes around > after starting them and execing the final binary, there will be races > and confusion, and no about of granularity fiddling will fix that. I don't see how that statement is true. For example, if you confine the hierarhcy to in-process, there is proper isolation and whether system agent migrates the process or not doesn't make any difference to the internal hierarchy. > I know nothing about rgroups. Are they upstream? It was linked from the original message. [7] http://lkml.kernel.org/r/20160105154503.gc5...@mtj.duckdns.org [RFD] cgroup: thread granularity support for cpu controller Tejun Heo [8] http://lkml.kernel.org/r/1457710888-31182-1-git-send-email...@kernel.org [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP Tejun Heo [9] http://lkml.kernel.org/r/20160311160522.ga24...@htj.duckdns.org Example program for PRIO_RGRP Tejun Heo > > These base-system operations are special regardless of cgroup and we > > already have sometimes crude ways to affect their behaviors where > > necessary through sysctl knobs, priorities on specific kernel threads > > and so on. cgroup doesn't change the situation all that much. What > > gets left in the root cgroup usually are the base-system operations > > which are outside the scope of cgroup resource co
Re: [Documentation] State of CPU controller in cgroup v2
On Sat, 2016-08-20 at 11:56 -0400, Tejun Heo wrote: > > > there are other reasons to enforce process granularity. One > > > important one is isolating system-level management operations from > > > in-process application operations. The cgroup interface, being a > > > virtual filesystem, is very unfit for multiple independent > > > operations taking place at the same time as most operations have to > > > be multi-step and there is no way to synchronize multiple accessors. > > > See also [5] Documentation/cgroup-v2.txt, "R-2. Thread Granularity" > > > > I don't buy this argument at all. System-level code is likely to > > assign single process *trees*, which are a different beast entirely. > > I.e. you fork, move the child into a cgroup, and that child and its > > children stay in that cgroup. I don't see how the thread/process > > distinction matters. > > Good point on the multi-process issue, this is something which nagged > me a bit while working on rgroup, although I have to point out that > the issue here is one of not going far enough rather than the approach > being wrong. There are limitations to scoping it to individual > processes but that doesn't negate the underlying problem or the > usefulness of in-process control. > > For system-level and process-level operations to not step on each > other's toes, they need to agree on the granularity boundary - > system-level should be able to treat an application hierarchy as a > single unit. A possible solution is allowing rgroup hirearchies to > span across process boundaries and implementing cgroup migration > operations which treat such hierarchies as a single unit. I'm not yet > sure whether the boundary should be at program groups or rgroups. Why is it not viable to predicate contentious lowest common denominator restrictions upon the set of enabled controllers? If only thread granularity controllers are enabled, from that point onward, v2 restrictions cease to make any sense, thus could be lifted, leaving nobody cast adrift in a leaky v1 lifeboat when v2 sets sail. Or? -Mike
Re: [Documentation] State of CPU controller in cgroup v2
On Wed, 2016-08-17 at 13:18 -0700, Andy Lutomirski wrote: > On Aug 5, 2016 7:07 PM, "Tejun Heo" wrote: [...] > > 2. Disagreements and Arguments > > > > There have been several lengthy discussion threads [3][4] on LKML > > around the structural constraints of cgroup v2. The two that > > affect the CPU controller are process granularity and no internal > > process constraint. Both arise primarily from the need for common > > resource domain definition across different resources. > > > > The common resource domain is a powerful concept in cgroup v2 that > > allows controllers to make basic assumptions about the structural > > organization of processes and controllers inside the cgroup > > hierarchy, and thus solve problems spanning multiple types of > > resources. The prime example for this is page cache writeback: > > dirty page cache is regulated through throttling buffered writers > > based on memory availability, and initiating batched write outs to > > the disk based on IO capacity. Tracking and controlling writeback > > inside a cgroup thus requires the direct cooperation of the memory > > and the IO controller. > > > > This easily extends to other areas, such as CPU cycles consumed > > while performing memory reclaim or IO encryption. > > > > > > 2-1. Contentious Restrictions > > > > For controllers of different resources to work together, they must > > agree on a common organization. This uniform model across > > controllers imposes two contentious restrictions on the CPU > > controller: process granularity and the no-internal-process > > constraint. > > > > > > 2-1-1. Process Granularity > > > > For memory, because an address space is shared between all > > threads > > of a process, the terminal consumer is a process, not a thread. > > Separating the threads of a single process into different memory > > control domains doesn't make semantical sense. cgroup v2 ensures > > that all controller can agree on the same organization by > > requiring > > that threads of the same process belong to the same cgroup. > > I haven't followed all of the history here, but it seems to me that > this argument is less accurate than it appears. Linux, for better or > for worse, has somewhat orthogonal concepts of thread groups > (processes), mms, and file tables. An mm has VMAs in it, and VMAs > can reference things (files, etc) that hold resources. (Two mms can > share resources by mapping the same thing or using fork().) File > tables hold files, and files can use resources. Both of these are, > at best, moderately good approximations of what actually holds > resources. Meanwhile, threads (tasks) do syscalls, take page faults, > *allocate* resources, etc. > > So I think it's not really true to say that the "terminal consumer" > of anything is a process, not a thread. > > While it's certainly easier to think about assigning processes to > cgroups, and I certainly agree that, in the common case, it's the > right thing to do, I don't see why requiring it is a good idea. Can > we turn this around: what actually goes wrong if cgroup v2 were to > allow assigning individual threads if a user specifically requests > it? A similar point from a different consumer: from the unprivileged containers point of view, I'm interested in a thread based interface as well. The principle utility of unprivileged containers is to allow applications that wish to to use container properties (effectively to become self-containerising). Some that use the producer/consumer model do use process pools (apache springs to mind instantly) but some use thread pools. It is useful to the latter to preserve the concept of a thread as being the entity inhabiting the cgroup (but only where the granularity of the cgroup permits threads to participate) so we can easily modify them to be self containerising without forcing them to switch back from a thread pool model to a process pool model. I can see that process based is conceptually easier in v2 because you begin with a process tree, but it would really be a pity to lose the thread based controls we have now and permanently lose the ability to create more as we find uses for them. I can't really see how improving "common resource domain" is a good tradeoff for this. James
Re: [Documentation] State of CPU controller in cgroup v2
On Sat, Aug 20, 2016 at 8:56 AM, Tejun Heo wrote: > Hello, Andy. > > On Wed, Aug 17, 2016 at 01:18:24PM -0700, Andy Lutomirski wrote: >> > 2-1-1. Process Granularity >> > >> > For memory, because an address space is shared between all threads >> > of a process, the terminal consumer is a process, not a thread. >> > Separating the threads of a single process into different memory >> > control domains doesn't make semantical sense. cgroup v2 ensures >> > that all controller can agree on the same organization by requiring >> > that threads of the same process belong to the same cgroup. >> >> I haven't followed all of the history here, but it seems to me that >> this argument is less accurate than it appears. Linux, for better or >> for worse, has somewhat orthogonal concepts of thread groups >> (processes), mms, and file tables. An mm has VMAs in it, and VMAs can >> reference things (files, etc) that hold resources. (Two mms can share >> resources by mapping the same thing or using fork().) File tables >> hold files, and files can use resources. Both of these are, at best, >> moderately good approximations of what actually holds resources. >> Meanwhile, threads (tasks) do syscalls, take page faults, *allocate* >> resources, etc. >> >> So I think it's not really true to say that the "terminal consumer" of >> anything is a process, not a thread. > > The terminal consumer is actually the mm context. A task may be the > allocating entity but not always for itself. > > This becomes clear whenever an entity is allocating memory on behalf > of someone else - get_user_pages(), khugepaged, swapoff and so on (and > likely userfaultfd too). When a task is trying to add a page to a > VMA, the task might not have any relationship with the VMA other than > that it's operating on it for someone else. The page has to be > charged to whoever is responsible for the VMA and the only ownership > which can be established is the containing mm_struct. This surprises me a bit. If I do access_process_vm(), then I would have expected the charge to go the caller, not the mm being accessed. What happens if a program calls read(2), though? A page may be inserted into page cache on behalf of an address_space without any particular mm being involved. There will usually be a calling task, though. But this is all very memcg-specific. What about other cgroups? I/O is per-task, right? Scheduling is definitely per-task. > > While a mm_struct technically may not map to a process, it is a very > close approxmiation which is hardly ever broken in practice. > >> While it's certainly easier to think about assigning processes to >> cgroups, and I certainly agree that, in the common case, it's the >> right thing to do, I don't see why requiring it is a good idea. Can >> we turn this around: what actually goes wrong if cgroup v2 were to >> allow assigning individual threads if a user specifically requests it? > > Consider the scenario where you have somebody faulting on behalf of a > foreign VMA, but the thread who created and is actively using that VMA > is in a different cgroup than the process leader. Who are we going to > charge? All possible answers seem erratic. > Indeed, and this problem is probably not solvable in practice unless you charge all involved cgroups. But the caller's *mm* is entirely irrelevant here, so I don't see how this implies that cgroups need to keep tasks in the same process together. The relevant entities are the calling *task* and the target mm, and you're going to be hard-pressed to ensure that they belong to the same cgroup, so I think you need to be able handle weird cases in which there isn't an obviously correct cgroup to charge. >> > there are other reasons to enforce process granularity. One >> > important one is isolating system-level management operations from >> > in-process application operations. The cgroup interface, being a >> > virtual filesystem, is very unfit for multiple independent >> > operations taking place at the same time as most operations have to >> > be multi-step and there is no way to synchronize multiple accessors. >> > See also [5] Documentation/cgroup-v2.txt, "R-2. Thread Granularity" >> >> I don't buy this argument at all. System-level code is likely to >> assign single process *trees*, which are a different beast entirely. >> I.e. you fork, move the child into a cgroup, and that child and its >> children stay in that cgroup. I don't see how the thread/process >> distinction matters. > > Good point on the multi-process issue, this is something which nagged > me a bit while working on rgroup, although I have to point out that > the issue here is one of not going far enough rather than the approach > being wrong. There are limitations to scoping it to individual > processes but that doesn't negate the underlying problem or the > usefulness of in-process control. > > For system-level and process-level operations to not step on each >
Re: [Documentation] State of CPU controller in cgroup v2
Hello, Andy. On Wed, Aug 17, 2016 at 01:18:24PM -0700, Andy Lutomirski wrote: > > 2-1-1. Process Granularity > > > > For memory, because an address space is shared between all threads > > of a process, the terminal consumer is a process, not a thread. > > Separating the threads of a single process into different memory > > control domains doesn't make semantical sense. cgroup v2 ensures > > that all controller can agree on the same organization by requiring > > that threads of the same process belong to the same cgroup. > > I haven't followed all of the history here, but it seems to me that > this argument is less accurate than it appears. Linux, for better or > for worse, has somewhat orthogonal concepts of thread groups > (processes), mms, and file tables. An mm has VMAs in it, and VMAs can > reference things (files, etc) that hold resources. (Two mms can share > resources by mapping the same thing or using fork().) File tables > hold files, and files can use resources. Both of these are, at best, > moderately good approximations of what actually holds resources. > Meanwhile, threads (tasks) do syscalls, take page faults, *allocate* > resources, etc. > > So I think it's not really true to say that the "terminal consumer" of > anything is a process, not a thread. The terminal consumer is actually the mm context. A task may be the allocating entity but not always for itself. This becomes clear whenever an entity is allocating memory on behalf of someone else - get_user_pages(), khugepaged, swapoff and so on (and likely userfaultfd too). When a task is trying to add a page to a VMA, the task might not have any relationship with the VMA other than that it's operating on it for someone else. The page has to be charged to whoever is responsible for the VMA and the only ownership which can be established is the containing mm_struct. While a mm_struct technically may not map to a process, it is a very close approxmiation which is hardly ever broken in practice. > While it's certainly easier to think about assigning processes to > cgroups, and I certainly agree that, in the common case, it's the > right thing to do, I don't see why requiring it is a good idea. Can > we turn this around: what actually goes wrong if cgroup v2 were to > allow assigning individual threads if a user specifically requests it? Consider the scenario where you have somebody faulting on behalf of a foreign VMA, but the thread who created and is actively using that VMA is in a different cgroup than the process leader. Who are we going to charge? All possible answers seem erratic. Please note that I agree that thread granularity can be useful for some resources; however, my points are 1. it should be scoped so that the resource distribution tree as a whole can be shared across different resources, and, 2. cgroup filesystem interface isn't a good interface for the purpose. I'll continue the second point below. > > there are other reasons to enforce process granularity. One > > important one is isolating system-level management operations from > > in-process application operations. The cgroup interface, being a > > virtual filesystem, is very unfit for multiple independent > > operations taking place at the same time as most operations have to > > be multi-step and there is no way to synchronize multiple accessors. > > See also [5] Documentation/cgroup-v2.txt, "R-2. Thread Granularity" > > I don't buy this argument at all. System-level code is likely to > assign single process *trees*, which are a different beast entirely. > I.e. you fork, move the child into a cgroup, and that child and its > children stay in that cgroup. I don't see how the thread/process > distinction matters. Good point on the multi-process issue, this is something which nagged me a bit while working on rgroup, although I have to point out that the issue here is one of not going far enough rather than the approach being wrong. There are limitations to scoping it to individual processes but that doesn't negate the underlying problem or the usefulness of in-process control. For system-level and process-level operations to not step on each other's toes, they need to agree on the granularity boundary - system-level should be able to treat an application hierarchy as a single unit. A possible solution is allowing rgroup hirearchies to span across process boundaries and implementing cgroup migration operations which treat such hierarchies as a single unit. I'm not yet sure whether the boundary should be at program groups or rgroups. > On the contrary: with cgroup namespaces, one could easily create a > cgroup namespace, shove a process in it, and let that process delegate > its threads to child cgroups however it likes. (Well, children of the > namespace root.) cgroup namespace solves just one piece of the whole problem and not in a very robust way. It's okay for containers but not so for individual applications.
Re: [Documentation] State of CPU controller in cgroup v2
On Aug 5, 2016 7:07 PM, "Tejun Heo" wrote: > > Hello, > > There have been several discussions around CPU controller support. > Unfortunately, no consensus was reached and cgroup v2 is sorely > lacking CPU controller support. This document includes summary of the > situation and arguments along with an interim solution for parties who > want to use the out-of-tree patches for CPU controller cgroup v2 > support. I'll post the two patches as replies for reference. > > Thanks. > > > CPU Controller on Control Group v2 > > August, 2016Tejun Heo > > > While most controllers have support for cgroup v2 now, the CPU > controller support is not upstream yet due to objections from the > scheduler maintainers on the basic designs of cgroup v2. This > document explains the current situation as well as an interim > solution, and details the disagreements and arguments. The latest > version of this document can be found at the following URL. > > > https://git.kernel.org/cgit/linux/kernel/git/tj/cgroup.git/tree/Documentation/cgroup-v2-cpu.txt?h=cgroup-v2-cpu > > > CONTENTS > > 1. Current Situation and Interim Solution > 2. Disagreements and Arguments > 2-1. Contentious Restrictions > 2-1-1. Process Granularity > 2-1-2. No Internal Process Constraint > 2-2. Impact on CPU Controller > 2-2-1. Impact of Process Granularity > 2-2-2. Impact of No Internal Process Constraint > 2-3. Arguments for cgroup v2 > 3. Way Forward > 4. References > > > 1. Current Situation and Interim Solution > > All objections from the scheduler maintainers apply to cgroup v2 core > design, and there are no known objections to the specifics of the CPU > controller cgroup v2 interface. The only blocked part is changes to > expose the CPU controller interface on cgroup v2, which comprises the > following two patches: > > [1] sched: Misc preps for cgroup unified hierarchy interface > [2] sched: Implement interface for cgroup unified hierarchy > > The necessary changes are superficial and implement the interface > files on cgroup v2. The combined diffstat is as follows. > > kernel/sched/core.c| 149 > +++-- > kernel/sched/cpuacct.c | 57 -- > kernel/sched/cpuacct.h |5 + > 3 files changed, 189 insertions(+), 22 deletions(-) > > The patches are easy to apply and forward-port. The following git > branch will always carry the two patches on top of the latest release > of the upstream kernel. > > git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git/cgroup-v2-cpu > > There also are versioned branches going back to v4.4. > > > git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git/cgroup-v2-cpu-$KERNEL_VER > > While it's difficult to tell whether the CPU controller support will > be merged, there are crucial resource control features in cgroup v2 > that are only possible due to the design choices that are being > objected to, and every effort will be made to ease enabling the CPU > controller cgroup v2 support out-of-tree for parties which choose to. > > > 2. Disagreements and Arguments > > There have been several lengthy discussion threads [3][4] on LKML > around the structural constraints of cgroup v2. The two that affect > the CPU controller are process granularity and no internal process > constraint. Both arise primarily from the need for common resource > domain definition across different resources. > > The common resource domain is a powerful concept in cgroup v2 that > allows controllers to make basic assumptions about the structural > organization of processes and controllers inside the cgroup hierarchy, > and thus solve problems spanning multiple types of resources. The > prime example for this is page cache writeback: dirty page cache is > regulated through throttling buffered writers based on memory > availability, and initiating batched write outs to the disk based on > IO capacity. Tracking and controlling writeback inside a cgroup thus > requires the direct cooperation of the memory and the IO controller. > > This easily extends to other areas, such as CPU cycles consumed while > performing memory reclaim or IO encryption. > > > 2-1. Contentious Restrictions > > For controllers of different resources to work together, they must > agree on a common organization. This uniform model across controllers > imposes two contentious restrictions on the CPU controller: process > granularity and the no-internal-process constraint. > > > 2-1-1. Process Granularity > > For memory, because an address space is shared between all threads > of a process, the terminal consumer is a process, not a thread. > Separating the threads of a single process into different memory > control domains doesn't make semantical sense. cgroup v2 ensures > that all controller can agree on the same organization by requiring > that threads of the same process belong to the same cgroup. I haven't followed all of the history here, but
Re: [Documentation] State of CPU controller in cgroup v2
On Tue, 2016-08-16 at 12:30 -0400, Johannes Weiner wrote: > On Tue, Aug 16, 2016 at 04:07:38PM +0200, Peter Zijlstra wrote: > > Also, the argument there seems unfair at best, you don't need cpu-v2 for > > buffered write control, you only need memcg and block co-mounted. > > Yes, memcg and block agreeing is enough for that case. But I mentioned > a whole bunch of these examples, to make the broader case for a common > controller model. The core issue I have with that model is that it defines context=mm, and declares context=task to be invalid, while in reality, both views are perfectly valid, useful, and in use. That redefinition of context is demonstrably harmful when applied to scheduler related controllers, rendering a substantial portion of to be managed objects completely unmanageable. You (collectively) know that full well. AFAIKT, there is only one viable option, and that is to continue to allow both. Whether you like the duality or not (who would), it's deeply embedded in what's under the controllers, and won't go away. I'll now go try a little harder while you ponder (or pop) this thought bubble, see if I can set a new personal best at the art of ignoring. (CC did not help btw, your bad if you don't like bubble content) -Mike
Re: [Documentation] State of CPU controller in cgroup v2
Hello, Peter. On Tue, Aug 16, 2016 at 04:07:38PM +0200, Peter Zijlstra wrote: > On Wed, Aug 10, 2016 at 06:09:44PM -0400, Johannes Weiner wrote: > > > [ That, and a disturbing number of emotional outbursts against > > systemd, which has nothing to do with any of this. ] > > Oh, so I'm entirely dreaming this then: > > https://github.com/systemd/systemd/pull/3905 > > Completely unrelated. We use centos in the fleet and are trying to control resources in base system which of course requires writeback control and thus cgroup v2. I'm working to solve the use cases people are facing and systemd is a piece of the puzzle. There is no big conspiracy. As Johannes and Chris already pointed out, systemd is a user of cgroup v2, a pretty important one at this point. While I of course care about it having a proper support for cgroup v2, systemd is just picking up the changes in cgroup v2. cgroup v2 design wouldn't be different without systemd. We'll just have something else playing its role in resource management. > Also, the argument there seems unfair at best, you don't need cpu-v2 for > buffered write control, you only need memcg and block co-mounted. ( Everything I'm gonna write below has already been extensively documented in the posted documentation. I'm gonna repeat the points for completeness but if we're gonna start an actually technical discussion, let's please start from the documentation instead of jumping off of an one liner and trying to rebuild the entire argument each time. I'm not sure what you exactly meant by the above sentence and assuming that you're saying that there are no new capabilities gained by cpu controller being on the v2 hierarchy and thus the cpu controller doesn't need to be on cgroup v2? If I'm mistaken, please let me know. ) Just co-mounting isn't enough as it still leaves the problems with anonymous consumption, different handling of threads belonging to different cgroups, and whether it's acceptable to always require blkio to use memory controller. cgroup v2 is what we got after working through all these issues. While it is true that cpu controller doesn't need to be on cgroup v2 for writeback control to work, it misses the point about the larger design issues identified during writeback control work, which can be easily applied to the cpu controller - e.g. accounting cpu cycles spent for packet reception, memory reclaim, IO encryption and so on. In addition, it is an unnecessary inconvenience for users who want writeback control to require the complication of mixed v1 and v2 hierarchies when their requirements can be easily served by v2, especially considering that the only blocked part is trivial changes to expose cpu controller interface on v2 and that enabling it on v2 doesn't preclude it from being used on a v1 hierarchy if necessary. Thanks. -- tejun
Re: [Documentation] State of CPU controller in cgroup v2
On Tue, Aug 16, 2016 at 04:07:38PM +0200, Peter Zijlstra wrote: > On Wed, Aug 10, 2016 at 06:09:44PM -0400, Johannes Weiner wrote: > > > [ That, and a disturbing number of emotional outbursts against > > systemd, which has nothing to do with any of this. ] > > Oh, so I'm entirely dreaming this then: > > https://github.com/systemd/systemd/pull/3905 > > Completely unrelated. Yes and no. We certainly do use systemd (kind of hard not to at this point if you're using any major distribution), and we do feed back the changes we make to it upstream. But this is updating systemd to work with the resource control design choices we made in the kernel, not the other way round. As I wrote to Mike before, we have been running into these resource control issues way before systemd, when we used a combination of libcgroup and custom hacks to coordinate the jobs on the system. The cgroup2 design choices fell out of experiences with those setups. Neither the problem statement nor the proposed solutions depend on systemd, which is why I had hoped we could focus these cgroup2 debates around the broader resource control issues we are trying to address, rather than get hung up on one contentious user of the interface. > Also, the argument there seems unfair at best, you don't need cpu-v2 for > buffered write control, you only need memcg and block co-mounted. Yes, memcg and block agreeing is enough for that case. But I mentioned a whole bunch of these examples, to make the broader case for a common controller model.
Re: [Documentation] State of CPU controller in cgroup v2
On 08/16/2016 10:07 AM, Peter Zijlstra wrote: On Wed, Aug 10, 2016 at 06:09:44PM -0400, Johannes Weiner wrote: [ That, and a disturbing number of emotional outbursts against systemd, which has nothing to do with any of this. ] Oh, so I'm entirely dreaming this then: https://github.com/systemd/systemd/pull/3905 Completely unrelated. Also, the argument there seems unfair at best, you don't need cpu-v2 for buffered write control, you only need memcg and block co-mounted. This isn't systemd dictating cgroups2 or systemd trying to get rid of v1. But systemd is a common user of cgroups, and we do use it here in production. We're just sending patches upstream for the tools we're using. It's better than keeping them private, or reinventing a completely different tool that does almost the same thing. -chris
Re: [Documentation] State of CPU controller in cgroup v2
On Wed, Aug 10, 2016 at 06:09:44PM -0400, Johannes Weiner wrote: > [ That, and a disturbing number of emotional outbursts against > systemd, which has nothing to do with any of this. ] Oh, so I'm entirely dreaming this then: https://github.com/systemd/systemd/pull/3905 Completely unrelated. Also, the argument there seems unfair at best, you don't need cpu-v2 for buffered write control, you only need memcg and block co-mounted.
Re: [Documentation] State of CPU controller in cgroup v2
On Fri, 2016-08-12 at 18:17 -0400, Johannes Weiner wrote: > > > This argument that cgroup2 is not backward compatible is laughable. > > > > Fine, you're entitled to your sense of humor. I have one to, I find it > > laughable that threaded applications can only sit there like a lump of > > mud simply because they share more than applications written as a > > gaggle of tasks. "Threads are like.. so yesterday, the future belongs > > to the process" tickles my funny-bone. Whatever, to each his own. > > Who are you quoting here? This is such a grotesque misrepresentation > of what we have been saying and implementing, it's not even funny. Agreed, it's not funny to me either. Excluding threaded applications from doing.. anything.. implies to me that either someone thinks same do not need resource management facilities due to some magical property of threading itself, or someone doesn't realize that an application thread is a task, ie one and the same things which can be doing one and the same job. No matter how I turn it, what I see is nonsense. > https://yourlogicalfallacyis.com/black-or-white > https://yourlogicalfallacyis.com/strawman > https://yourlogicalfallacyis.com/appeal-to-emotion Nope, plain ole sarcasm, an expression of shock and awe. > It's great that cgroup1 works for some of your customers, and they are > free to keep using it. If no third party can flush my customers investment down the toilet, I can cease to care. Please don't CC me in future, you're unlikely to convince me that v2 is remotely sane, nor do you need to. Lucky you. -Mike
Re: [Documentation] State of CPU controller in cgroup v2
On Thu, Aug 11, 2016 at 08:25:06AM +0200, Mike Galbraith wrote: > On Wed, 2016-08-10 at 18:09 -0400, Johannes Weiner wrote: > > The complete lack of cohesiveness between v1 controllers prevents us > > from implementing even the most fundamental resource control that > > cloud fleets like Google's and Facebook's are facing, such as > > controlling buffered IO; attributing CPU cycles spent receiving > > packets, reclaiming memory in kswapd, encrypting the disk; attributing > > swap IO etc. That's why cgroup2 runs a tighter ship when it comes to > > the controllers: to make something much bigger work. > > Where is the gun wielding thug forcing people to place tasks where v2 > now explicitly forbids them? The problems with supporting this are well-documented. Please see R-2 in Documentation/cgroup-v2.txt. > > Agreeing on something - in this case a common controller model - is > > necessarily going to take away some flexibility from how you approach > > a problem. What matters is whether the problem can still be solved. > > What annoys me about this more than the seemingly gratuitous breakage > is that the decision is passed to third parties who have nothing to > lose, and have done quite a bit of breaking lately. Mike, there is no connection between what you are quoting and what you are replying to here. We cannot have a technical discussion when you enter it with your mind fully made up, repeat the same inflammatory talking points over and over - some of them trivially false, some a gross misrepresentation of what we have been trying to do - and are completely unwilling to even entertain the idea that there might be problems outside of the one-controller-scope you are looking at. But to address your point: there is no 'breakage' here. Or in your words: there is no gun wielding thug forcing people to upgrade to v2. If v1 does everything your specific setup needs, nobody forces you to upgrade. We are fairly confident that the majority of users *will* upgrade, simply because v2 solves so many basic resource control problems that v1 is inherently incapable of solving. There is a positive incentive, but we are trying not to create negative ones. And even if you run a systemd distribution, and systemd switches to v2, it's trivially easy to pry the CPU controller from its hands and maintain your setup exactly as-is using the current CPU controller. This is really not a technical argument. > > This argument that cgroup2 is not backward compatible is laughable. > > Fine, you're entitled to your sense of humor. I have one to, I find it > laughable that threaded applications can only sit there like a lump of > mud simply because they share more than applications written as a > gaggle of tasks. "Threads are like.. so yesterday, the future belongs > to the process" tickles my funny-bone. Whatever, to each his own. Who are you quoting here? This is such a grotesque misrepresentation of what we have been saying and implementing, it's not even funny. In reality, the rgroup extension for setpriority() was directly based on your and PeterZ's feedback regarding thread control. Except that, unlike cgroup1's approach to threads, which might work in some setups but suffers immensely from the global nature of the vfs interface once you have to cooperate with other applications and system management*, rgroup was proposed as a much more generic and robust interface to do hierarchical resource control from inside the application. * This doesn't have to be systemd, btw. We have used cgroups to isolate system services, maintenance jobs, cron jobs etc. from our applications way before systemd, and it's been a pita to coordinate the system managing applications and the applications managing its workers using the same globally scoped vfs interface. > > > I mentioned a real world case of a thread pool servicing customer > > > accounts by doing something quite sane: hop into an account (cgroup), > > > do work therein, send bean count off to the $$ department, wash, rinse > > > repeat. That's real world users making real world cash registers go ka > > > -ching so real world people can pay their real world bills. > > > > Sure, but you're implying that this is the only way to run this real > > world cash register. > > I implied no such thing. Of course it can be done differently, all > they have to do is rip out these archaic thread thingies. > > Apologies for dripping sarcasm all over your monitor, but this annoys > me far more that it should any casual user of cgroups. Perhaps I > shouldn't care about the users (suse customers) who will step in this > eventually, but I do. https://yourlogicalfallacyis.com/black-or-white https://yourlogicalfallacyis.com/strawman https://yourlogicalfallacyis.com/appeal-to-emotion Can you please try to stay objective? > > > As with the thread pool, process granularity makes it impossible for > > > any threaded application affinity to be managed via cpusets, such as > > > say stuf
Re: [Documentation] State of CPU controller in cgroup v2
On Wed, 2016-08-10 at 18:09 -0400, Johannes Weiner wrote: > The complete lack of cohesiveness between v1 controllers prevents us > from implementing even the most fundamental resource control that > cloud fleets like Google's and Facebook's are facing, such as > controlling buffered IO; attributing CPU cycles spent receiving > packets, reclaiming memory in kswapd, encrypting the disk; attributing > swap IO etc. That's why cgroup2 runs a tighter ship when it comes to > the controllers: to make something much bigger work. Where is the gun wielding thug forcing people to place tasks where v2 now explicitly forbids them? > Agreeing on something - in this case a common controller model - is > necessarily going to take away some flexibility from how you approach > a problem. What matters is whether the problem can still be solved. What annoys me about this more than the seemingly gratuitous breakage is that the decision is passed to third parties who have nothing to lose, and have done quite a bit of breaking lately. > This argument that cgroup2 is not backward compatible is laughable. Fine, you're entitled to your sense of humor. I have one to, I find it laughable that threaded applications can only sit there like a lump of mud simply because they share more than applications written as a gaggle of tasks. "Threads are like.. so yesterday, the future belongs to the process" tickles my funny-bone. Whatever, to each his own. ... > Lastly, again - and this was the whole point of this document - the > changes in cgroup2 are not gratuitous. They are driven by fundamental > resource control problems faced by more comprehensive applications of > cgroup. On the other hand, the opposition here mainly seems to be the > inconvenience of switching some specialized setups from a v1-oriented > way of solving a problem to a v2-oriented way. > > [ That, and a disturbing number of emotional outbursts against > systemd, which has nothing to do with any of this. ] > > It's a really myopic line of argument. And I think the myopia is on the other side of my monitor, whatever. > That being said, let's go through your points: > > > Priority and affinity are not process wide attributes, never have > > been, but you're insisting that so they must become for the sake of > > progress. > > Not really. > > It's just questionable whether the cgroup interface is the best way to > manipulate these attributes, or whether existing interfaces like > setpriority() and sched_setaffinity() should be extended to manipulate > groups, like the rgroup proposal does. The problems of using the > cgroup interface for this are extensively documented, including in the > email you were replying to. > > > I mentioned a real world case of a thread pool servicing customer > > accounts by doing something quite sane: hop into an account (cgroup), > > do work therein, send bean count off to the $$ department, wash, rinse > > repeat. That's real world users making real world cash registers go ka > > -ching so real world people can pay their real world bills. > > Sure, but you're implying that this is the only way to run this real > world cash register. I implied no such thing. Of course it can be done differently, all they have to do is rip out these archaic thread thingies. Apologies for dripping sarcasm all over your monitor, but this annoys me far more that it should any casual user of cgroups. Perhaps I shouldn't care about the users (suse customers) who will step in this eventually, but I do. > I'm not going down the rabbit hole again of arguing against an > incomplete case description. Scale matters. Number of workers > matter. Amount of work each thread does matters to evaluate > transaction overhead. Task migration is an expensive operation etc. > > > I also mentioned breakage to cpusets: given exclusive set A and > > exclusive subset B therein, there is one and only one spot where > > affinity A exists... at the to be forbidden junction of A and B. > > Again, a means to an end rather than a goal I don't believe I described a means to an end, I believe I described affinity bits going missing. > - and a particularly > suspicious one at that: why would a cgroup need to tell its *siblings* > which cpus/nodes in cannot use? In the hierarchical model, it's > clearly the task of the ancestor to allocate the resources downward. > > More details would be needed to properly discuss what we are trying to > accomplish here. > > > As with the thread pool, process granularity makes it impossible for > > any threaded application affinity to be managed via cpusets, such as > > say stuffing realtime critical threads into a shielded cpuset, mundane > > threads into another. There are any number of affinity usages that > > will break. > > Ditto. It's not obvious why this needs to be the cgroup interface and > couldn't instead be solved with extending sched_setaffinity() - again > weighing that against the power of the common controller mo
Re: [Documentation] State of CPU controller in cgroup v2
On Sat, Aug 06, 2016 at 11:04:51AM +0200, Mike Galbraith wrote: > On Fri, 2016-08-05 at 13:07 -0400, Tejun Heo wrote: > > It is true that the trees are semantically different from each other > > and the symmetric handling of tasks and cgroups is aesthetically > > pleasing. However, it isn't clear what the practical usefulness of > > a layout with direct competition between tasks and cgroups would be, > > considering that number and behavior of tasks are controlled by each > > application, and cgroups primarily deal with system level resource > > distribution; changes in the number of active threads would directly > > impact resource distribution. Real world use cases of such layouts > > could not be established during the discussions. > > You apparently intend to ignore any real world usages that don't work > with these new constraints. He didn't ignore these use cases. He offered alternatives like rgroup to allow manipulating threads from within the application, only in a way that does not interfere with cgroup2's common controller model. The complete lack of cohesiveness between v1 controllers prevents us from implementing even the most fundamental resource control that cloud fleets like Google's and Facebook's are facing, such as controlling buffered IO; attributing CPU cycles spent receiving packets, reclaiming memory in kswapd, encrypting the disk; attributing swap IO etc. That's why cgroup2 runs a tighter ship when it comes to the controllers: to make something much bigger work. Agreeing on something - in this case a common controller model - is necessarily going to take away some flexibility from how you approach a problem. What matters is whether the problem can still be solved. This argument that cgroup2 is not backward compatible is laughable. Of course it's going to be different, otherwise we wouldn't have had to version it. The question is not whether the exact same configurations and existing application design can be used in v1 and v2 - that's a strange onus to put on a versioned interface. The question is whether you can translate a solution from v1 to v2. Yeah, it might be a hassle depending on how specialized your setup is, but that's why we keep v1 around until the last user dies and allow you to freely mix and match v1 and v2 controllers within a single system to ease the transition. But this distinction between approach and application design, and the application's actual purpose is crucial. Every time this discussion came up, somebody says 'moving worker threads between different resource domains'. That's not a goal, though, that's a very specific means to an end, with no explanation of why it has to be done that way. When comparing the cgroup v1 and v2 interface, we should be discussing goals, not 'this is my favorite way to do it'. If you have an actual real-world goal that can be accomplished in v1 but not in v2 + rgroup, then that's what we should be talking about. Lastly, again - and this was the whole point of this document - the changes in cgroup2 are not gratuitous. They are driven by fundamental resource control problems faced by more comprehensive applications of cgroup. On the other hand, the opposition here mainly seems to be the inconvenience of switching some specialized setups from a v1-oriented way of solving a problem to a v2-oriented way. [ That, and a disturbing number of emotional outbursts against systemd, which has nothing to do with any of this. ] It's a really myopic line of argument. That being said, let's go through your points: > Priority and affinity are not process wide attributes, never have > been, but you're insisting that so they must become for the sake of > progress. Not really. It's just questionable whether the cgroup interface is the best way to manipulate these attributes, or whether existing interfaces like setpriority() and sched_setaffinity() should be extended to manipulate groups, like the rgroup proposal does. The problems of using the cgroup interface for this are extensively documented, including in the email you were replying to. > I mentioned a real world case of a thread pool servicing customer > accounts by doing something quite sane: hop into an account (cgroup), > do work therein, send bean count off to the $$ department, wash, rinse > repeat. That's real world users making real world cash registers go ka > -ching so real world people can pay their real world bills. Sure, but you're implying that this is the only way to run this real world cash register. I think it's entirely justified to re-evaluate this, given the myriad of much more fundamental problems that cgroup2 is solving by building on a common controller model. I'm not going down the rabbit hole again of arguing against an incomplete case description. Scale matters. Number of workers matter. Amount of work each thread does matters to evaluate transaction overhead. Task migration is an expensive operation etc. > I also mentioned breaka
Re: [Documentation] State of CPU controller in cgroup v2
On Fri, 2016-08-05 at 13:07 -0400, Tejun Heo wrote: > 2-2. Impact on CPU Controller > > As indicated earlier, the CPU controller's resource distribution graph > is the simplest. Every schedulable resource consumption can be > attributed to a specific task. In addition, for weight based control, > the per-task priority set through setpriority(2) can be translated to > and from a per-cgroup weight. As such, the CPU controller can treat a > task and a cgroup symmetrically, allowing support for any tree layout > of cgroups and tasks. Both process granularity and the no internal > process constraint restrict how the CPU controller can be used. Not only the cpu controller, but also cpuacct and cpuset. > 2-2-1. Impact of Process Granularity > > Process granularity prevents tasks belonging to the same process to > be assigned to different cgroups. It was pointed out [6] that this > excludes the valid use case of hierarchical CPU distribution within > processes. Does that not obsolete the rather useful/common concept "thread pool"? > 2-2-2. Impact of No Internal Process Constraint > > The no internal process constraint disallows tasks from competing > directly against cgroups. Here is an excerpt from Peter Zijlstra > pointing out the issue [10] - R, L and A are cgroups; t1, t2, t3 and > t4 are tasks: > > > R > / | \ >t1 t2 A >/ \ > t3 t4 > > > Is fundamentally different from: > > >R > / \ >L A > / \ / \ > t1 t2 t3 t4 > > > Because if in the first hierarchy you add a task (t5) to R, all of > its A will run at 1/4th of total bandwidth where before it had > 1/3rd, whereas with the second example, if you add our t5 to L, A > doesn't get any less bandwidth. > > > It is true that the trees are semantically different from each other > and the symmetric handling of tasks and cgroups is aesthetically > pleasing. However, it isn't clear what the practical usefulness of > a layout with direct competition between tasks and cgroups would be, > considering that number and behavior of tasks are controlled by each > application, and cgroups primarily deal with system level resource > distribution; changes in the number of active threads would directly > impact resource distribution. Real world use cases of such layouts > could not be established during the discussions. You apparently intend to ignore any real world usages that don't work with these new constraints. Priority and affinity are not process wide attributes, never have been, but you're insisting that so they must become for the sake of progress. I mentioned a real world case of a thread pool servicing customer accounts by doing something quite sane: hop into an account (cgroup), do work therein, send bean count off to the $$ department, wash, rinse repeat. That's real world users making real world cash registers go ka -ching so real world people can pay their real world bills. I also mentioned breakage to cpusets: given exclusive set A and exclusive subset B therein, there is one and only one spot where affinity A exists... at the to be forbidden junction of A and B. As with the thread pool, process granularity makes it impossible for any threaded application affinity to be managed via cpusets, such as say stuffing realtime critical threads into a shielded cpuset, mundane threads into another. There are any number of affinity usages that will break. Try as I may, I can't see anything progressive about enforcing process granularity of per thread attributes. I do see regression potential for users of these controllers, and no viable means to even report them as being such. It will likely be systemd flipping the V2 on switch, not the kernel, not the user. Regression reports would thus presumably be deflected to... those who want this. Sweet. -Mike