Re: [systemd-devel] [HEADSUP] cgroup changes
On Tue, 25.06.13 08:31, Brian Bockelman (bbock...@cse.unl.edu) wrote: On Jun 25, 2013, at 4:56 AM, Lennart Poettering lenn...@poettering.net wrote: On Tue, 25.06.13 02:21, Brian Bockelman (bbock...@cse.unl.edu) wrote: A few questions came to mind which may provide interesting input to your design process: 1) I use cgroups heavily for resource accounting. Do you envision me querying via dbus for each accounting attribute? Or do you envision me querying for the cgroup name, then accessing the controller statistics directly? Good question. Tejun wants systemd to cover that too. I am not entirely sure. I don't like the extra roundtrip for measuring the accounting bits. But maybe we can add a library that avoids the roundtrip, and simply provides you with high-level accounting values for cgroups. That way, for *changing* things you'd need to go via the bus, for *reading* things we'd give you a library that goes directly to the cgroupfs and avoids the roundtrip. I like this idea. Hopefully single-writer, multiple-reader is more sustainable path forward. What about the notification APIs? We currently use the memory.oom_control to get a notification when a job hits limits (this allows us to know the job died due to memory issues, as the user code itself typically just SIGSEGV's). Is subscribing to notifications considered reading or writing in this case? That sounds like another case for the library, i.e. is more considered reading. That said I think the current notification infrastructure of cgroup attributes is really really awful, so I am not to keen to support that right-away. Lennart -- Lennart Poettering - Red Hat, Inc. ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] [HEADSUP] cgroup changes
On Jun 25, 2013, at 4:56 AM, Lennart Poettering lenn...@poettering.net wrote: On Tue, 25.06.13 02:21, Brian Bockelman (bbock...@cse.unl.edu) wrote: A few questions came to mind which may provide interesting input to your design process: 1) I use cgroups heavily for resource accounting. Do you envision me querying via dbus for each accounting attribute? Or do you envision me querying for the cgroup name, then accessing the controller statistics directly? Good question. Tejun wants systemd to cover that too. I am not entirely sure. I don't like the extra roundtrip for measuring the accounting bits. But maybe we can add a library that avoids the roundtrip, and simply provides you with high-level accounting values for cgroups. That way, for *changing* things you'd need to go via the bus, for *reading* things we'd give you a library that goes directly to the cgroupfs and avoids the roundtrip. I like this idea. Hopefully single-writer, multiple-reader is more sustainable path forward. What about the notification APIs? We currently use the memory.oom_control to get a notification when a job hits limits (this allows us to know the job died due to memory issues, as the user code itself typically just SIGSEGV's). Is subscribing to notifications considered reading or writing in this case? 2) I currently fork and setup the resource environment (namespaces, environment, working directory, etc). Can an appropriately privileged process create a sub-slice, place itself in it, and then drop privs / exec? We'll probably have a way how you can take an existing set of processes and turn them dynamically into a new unit in systemd. These units would be mostly like service units, except that systemd wouldn't start the processes, but they would be foreign created. We are not sure about the name for this yet (i.e. whether to cover it under the .service suffix, but we'll probably call it Scopes instead, with the suffix .scope). The scope units could then be manipulated at runtime for (cgroup based) resource management the way normal services are too. So basically, a service unit could be assigned to a slice unit, and could then create scope units which detach subprocesses from the original service unit, and get their own cgroup in the same slice or any other. This sounds manageable. 5) Will I be able to delegate management of a subslice to a non-privileged user? Unlikely, at least for the beginning. (Very) long-term, this is attractive for us. We prefer the batch system to run as unprivileged when possible (and to sacrifice the minimal amount of functionality to do so!). I'm excited to see new ideas (again, having system tools be aware of the batch system activity is intriguing [2]), but am a bit worried about losing functionality and the cost of porting things to the new era! There's certainly going to be some lost flexibility. But of course we'll try to cover all interesting usecases. I'll try to lurk and provide guidance about how us nutty batch system folks may try to use it. [2] Hopefully something that works better than ps xawf -eo pid,user,cgroup,args which currently segfaults for me :( Hmm, could you file a bug, please? Couldn't figure out a patch -- too little time. However, I at least tracked down the offending code. Bug report is here: https://bugzilla.redhat.com/show_bug.cgi?id=977854 Thanks, Brian smime.p7s Description: S/MIME cryptographic signature ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] [HEADSUP] cgroup changes
On Mon, 24.06.13 17:09, Andy Lutomirski (l...@amacapital.net) wrote: On Mon, Jun 24, 2013 at 4:57 PM, Lennart Poettering lenn...@poettering.net wrote: On Mon, 24.06.13 16:01, Andy Lutomirski (l...@amacapital.net) wrote: AFAICT the main reason that systemd uses cgroup is to efficiently track which service various processes came from and to send signals, and it seems like that use case could be handled without cgroups at all by creative use of subreapers and a syscall to broadcast a signal to everything that has a given subreaper as an ancestor. In that case, systemd could be asked to stay away from cgroups even in the single-hierarchy case. systemd uses cgroups to manage services. Managing services means many things. Among them: keeping track of processes, listing processes of a service, killing processes of a service, doing per-service logging (which means reliably, immediately, and race-freely tracing back messages to the service which logged them), about 55 other things, and also resource management. I don't see how I can do anything of this without something like cgroups, i.e. hierarchial, resource management involved systemd which allows me to securely put labels on processes. Boneheaded straw-man proposal: two new syscalls and a few spare processes. int sys_task_reaper(int tid): Returns the reaper for the task tid (which is 1 if there's no subreaper). (This could just as easily be a file in /proc.) int sys_killall_under_subreaper(int subreaper, int sig): Broadcasts sig to all tasks under subreaper (excluding subreaper). Guarantees that, even if those tasks are forking, they all get the signal. Then, when starting a service, systemd forks, sets the child to be a subreaper, then forks that child again to exec the service. Does this do everything that's needed? No. It doesn't do anything that's needed. How do I list all PIDs in a service with this? How do I determine the service of a PID? How do i do resource manage with this? sys_task_reaper is trivial to implement (that functionality is already there in the reparenting code), and sys_killall_under_subreaper is probably not so bad. This has one main downside I can think of: it wastes a decent number of processes (one subreaper per service). Yeah, also the downside that it doesn't do what we need. Lennart -- Lennart Poettering - Red Hat, Inc. ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] [HEADSUP] cgroup changes
On Tue, 25.06.13 02:21, Brian Bockelman (bbock...@cse.unl.edu) wrote: A few questions came to mind which may provide interesting input to your design process: 1) I use cgroups heavily for resource accounting. Do you envision me querying via dbus for each accounting attribute? Or do you envision me querying for the cgroup name, then accessing the controller statistics directly? Good question. Tejun wants systemd to cover that too. I am not entirely sure. I don't like the extra roundtrip for measuring the accounting bits. But maybe we can add a library that avoids the roundtrip, and simply provides you with high-level accounting values for cgroups. That way, for *changing* things you'd need to go via the bus, for *reading* things we'd give you a library that goes directly to the cgroupfs and avoids the roundtrip. 2) I currently fork and setup the resource environment (namespaces, environment, working directory, etc). Can an appropriately privileged process create a sub-slice, place itself in it, and then drop privs / exec? We'll probably have a way how you can take an existing set of processes and turn them dynamically into a new unit in systemd. These units would be mostly like service units, except that systemd wouldn't start the processes, but they would be foreign created. We are not sure about the name for this yet (i.e. whether to cover it under the .service suffix, but we'll probably call it Scopes instead, with the suffix .scope). The scope units could then be manipulated at runtime for (cgroup based) resource management the way normal services are too. So basically, a service unit could be assigned to a slice unit, and could then create scope units which detach subprocesses from the original service unit, and get their own cgroup in the same slice or any other. 3) More generally, will I be able to interact with slices directly, or will I need to create throw-away units and launch them via systemd (versus a normal fork/exec)? Basically, with this scope concept in place, you'd create a throw-away scope. In fact, scope units can only be created as throw-away units. - The latter causes quite a bit of anxiety for me - we currently support many POSIX platforms plus Windows (hey - at least we dropped HPUX) and I'd like to avoid a completely independent code path for spawning jobs on Linux. 4) Will many short-lived jobs cause any heartache? Would anything untoward happen to my system if I spawned / destroyed jobs (and corresponding units or slices) at, say, 1Hz? Well, the idea is that these scopes are very lightweight. And we need to make them scale (but I don't see why they shouldn't). 5) Will I be able to delegate management of a subslice to a non-privileged user? Unlikely, at least for the beginning. I'm excited to see new ideas (again, having system tools be aware of the batch system activity is intriguing [2]), but am a bit worried about losing functionality and the cost of porting things to the new era! There's certainly going to be some lost flexibility. But of course we'll try to cover all interesting usecases. [2] Hopefully something that works better than ps xawf -eo pid,user,cgroup,args which currently segfaults for me :( Hmm, could you file a bug, please? Lennart -- Lennart Poettering - Red Hat, Inc. ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] [HEADSUP] cgroup changes
On Jun 25, 2013 2:43 AM, Lennart Poettering lenn...@poettering.net wrote: On Mon, 24.06.13 17:09, Andy Lutomirski (l...@amacapital.net) wrote: On Mon, Jun 24, 2013 at 4:57 PM, Lennart Poettering lenn...@poettering.net wrote: On Mon, 24.06.13 16:01, Andy Lutomirski (l...@amacapital.net) wrote: AFAICT the main reason that systemd uses cgroup is to efficiently track which service various processes came from and to send signals, and it seems like that use case could be handled without cgroups at all by creative use of subreapers and a syscall to broadcast a signal to everything that has a given subreaper as an ancestor. In that case, systemd could be asked to stay away from cgroups even in the single-hierarchy case. systemd uses cgroups to manage services. Managing services means many things. Among them: keeping track of processes, listing processes of a service, killing processes of a service, doing per-service logging (which means reliably, immediately, and race-freely tracing back messages to the service which logged them), about 55 other things, and also resource management. I don't see how I can do anything of this without something like cgroups, i.e. hierarchial, resource management involved systemd which allows me to securely put labels on processes. Boneheaded straw-man proposal: two new syscalls and a few spare processes. int sys_task_reaper(int tid): Returns the reaper for the task tid (which is 1 if there's no subreaper). (This could just as easily be a file in /proc.) int sys_killall_under_subreaper(int subreaper, int sig): Broadcasts sig to all tasks under subreaper (excluding subreaper). Guarantees that, even if those tasks are forking, they all get the signal. Then, when starting a service, systemd forks, sets the child to be a subreaper, then forks that child again to exec the service. Does this do everything that's needed? No. It doesn't do anything that's needed. How do I list all PIDs in a service with this? Walk /proc/subreaper/children recursively. A kernel patch to make that field show up unconditionally instead of hiding under EXPERT would help. How do I determine the service of a PID? Call sys_task_reaper, then look up what service that subreaper comes from. How do i do resource manage with this? With cgroups, unless the admin has configured systemd not to use cgroups, in which case you don't. (The whole point would be to keep DefaultControllers= without using the one and only cgroup hierarchy.) --Andy sys_task_reaper is trivial to implement (that functionality is already there in the reparenting code), and sys_killall_under_subreaper is probably not so bad. This has one main downside I can think of: it wastes a decent number of processes (one subreaper per service). Yeah, also the downside that it doesn't do what we need. Lennart -- Lennart Poettering - Red Hat, Inc. ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] [HEADSUP] cgroup changes
On Sat, 22.06.13 15:19, Andy Lutomirski (l...@amacapital.net) wrote: 1. I put all the entire world into a separate, highly constrained cgroup. My real-time code runs outside that cgroup. This seems to exactly what slices are for, but I need kernel threads to go in to the constrained cgroup. Will systemd support this? I am not sure whether the ability to move kernel threads into cgroups will stay around at all, from the kernel side. Tejun, can you comment on this? 2. I manage services and tasks outside systemd (for one thing, I currently use Ubuntu, but even if I were on Fedora, I have a bunch of fine-grained things that figure out how they're supposed to allocate resources, and porting them to systemd just to keep working in the new world order would be a PITA [1]). (cgroups have the odd feature that they are per-task, not per thread group, and the systemd proposal seems likely to break anything that actually wants task granularity. I may actually want to use this, even though it's a bit evil -- my real-time thread groups have non-real-time threads.) Here too, Tejun is pretty keen on removing the ability of splitting up threads into cgroups from the kernel, and will only allow this per-process. Tejun, please comment! I think that what I want are something like sub-unit cgroups -- I want to be able to ask systemd to further subdivide the group for my unit, login session, or whatever. Would this be reasonable? (Another way of thinking of this is that a unit would have a whole cgroup hierarchy instead of just one cgroup.) The idea is not even to allow this. Basically, if you want to partitions your daemon into different cgroups you need to do that through systemd's abstractions: slices and services. To make this more palatable we'll introduce throw-away units though, so that you can dynamically run something as a workload and don't need to be concerned about naming this, or cleaning it up. I think that the single-hierarchy model will require that I subdivide my user session so that the default sub-unit cgroup is constrained similarly to the default slice. I'll lose functionality, but I don't think this is a showstopper. A different approach would be to allow units to (with systemd's cooperation) escape into their own, dynamically created unit. This seems kind of awful. This is basically what I meant with throw-away units. 3. My code runs unprivileged, but it still wants to configure itself. If needed, I can write a little privileged daemon to handle the systemd calls. So, at least in the beginning I am pretty sure that manipulating the resource parameters we'll restrict to root-only, since this is much more security than one might assume and we simply don't oversee this all. I think I can get away without anything fancy if a unit (login session?) grant the right to manipulate sub-unit cgroups to a non-root user. As mentioned, this will not be possible. 4. As mentioned, I'm on Ubuntu some of the time. I'd like to keep the same code working on systemd and non-systemd systems. How hard would it be to run systemd as just a cgroup controller? That is, have systemd create its slices, run exactly one unit that represents the whole system, and let other things use the cgroup API. I have no idea, I don't develop Ubuntu. They will have to come up with some cgroup maintenance daemon of their own. As I know them they'll either do a port of the systemd counter part (but that's going to be tough!), or they'll stick something half-baked into Upstart... Sorry, if this all sounds a bit disappointing. But yeah, this all is not a trivial change... Lennart -- Lennart Poettering - Red Hat, Inc. ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] [HEADSUP] cgroup changes
On Fri, 21.06.13 14:47, Kok, Auke-jan H (auke-jan.h@intel.com) wrote: Do you suggest these manipulations should be implemented without high level systemd API's and the controller just manipulates the cgroups directly? All changes to cgroup attributes must go through systemd. If the WM wants to freeze or adjust OOM he needs to issue systemd bus calls for that. The run-away stuff I can't follow? the kernel will distribute CPU evenly among running apps if all want it, so not seeing why there's more monitoring needed. The thermal stuff is probably best done in-kernel i guess... Too dangerous/subject-to-latency for userspace, no? Only userspace can distinguish between e.g. a foreground and background application (WM) and decide that CPU consumption of certain apps in the background is excessive, and throttle it down further, which is somewhat similar to using freezer to just SIGSTOP them entirely basically. Yes, userspace can do that via systemd, there will be high-level operations on the bus for this. For example: SetCPUShares() to alter the cpu.shares value, and so on. This method call will do much more though thatn just write this value. One of the complexities of the cgroup stuff here is that adding a unit to a controller like cpu means you have to do the same for all its immediately siblings (i.e. other units in the same slice) plus all its parent slices (and recurisvely their siblings). Why? Because otherwise you might end up granting the service that is in the cpu controller the same amount of CPU *in total* as the other services in the same slice get *individually* for each process. And that would be grossly unfair... Thermal throttling from userspace allows you to distinguish between never make my SETI turn the fan on and throttle the entire system when I reach high fan speeds. You can't do that in the kernel. [1] Arguably this could be done in-task and not by an external controller, but you're still trusting the task to do the right thing, which may not be something you want to do. So, if userspace needs to communicate something to kernel space about what kind of cooling stragegy it would prefer and that is done via cgroups, them I am sure we can add similar high-level per-unit control calls to systemd, too. Lennart -- Lennart Poettering - Red Hat, Inc. ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] [HEADSUP] cgroup changes
On Mon, Jun 24, 2013 at 03:27:15PM +0200, Lennart Poettering wrote: On Sat, 22.06.13 15:19, Andy Lutomirski (l...@amacapital.net) wrote: 1. I put all the entire world into a separate, highly constrained cgroup. My real-time code runs outside that cgroup. This seems to exactly what slices are for, but I need kernel threads to go in to the constrained cgroup. Will systemd support this? I am not sure whether the ability to move kernel threads into cgroups will stay around at all, from the kernel side. Tejun, can you comment on this? KVM uses the vhost_net device for accelerating guest network I/O paths. This device creates a new kernel thread on each open(), and that kernel thread is attached to the cgroup associated with the process that open()d the device. If systemd allows for a process to be moved between cgroups, then it must also be capable of moving any associated kernel threads to the new cgroup at the same time. This co-placement of vhost-net threads with the KVM process, is very critical for I/O performance of KVM networking. Regards, Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] [HEADSUP] cgroup changes
On 06/21/2013 10:36 AM, Lennart Poettering wrote: 2) This hierarchy becomes private property of systemd. systemd will set it up. Systemd will maintain it. Systemd will rearrange it. Other software that wants to make use of cgroups can do so only through systemd's APIs. This single-writer logic is absolutely necessary, since interdependencies between the various controllers, the various attributes, the various cgroups are non-obvious and we simply cannot allow that cgroup users alter the tree independently of each other forever. Due to all this: The Pax Cgroup document is a thing of the past, it is dead. If you are using non-trivial cgroup setups with systemd right now, then things will change for you. We will provide you with similar functionality as before, but things will be different and less low-level. As long as you only used the high-level options such as CPUShares, MemoryLimit and so on you should be on the safe side. Hmm. This may be tricky for my use case. Here are a few issues. For all I know, they may already be supported (or planned), but I don't want to get caught. 1. I put all the entire world into a separate, highly constrained cgroup. My real-time code runs outside that cgroup. This seems to exactly what slices are for, but I need kernel threads to go in to the constrained cgroup. Will systemd support this? 2. I manage services and tasks outside systemd (for one thing, I currently use Ubuntu, but even if I were on Fedora, I have a bunch of fine-grained things that figure out how they're supposed to allocate resources, and porting them to systemd just to keep working in the new world order would be a PITA [1]). (cgroups have the odd feature that they are per-task, not per thread group, and the systemd proposal seems likely to break anything that actually wants task granularity. I may actually want to use this, even though it's a bit evil -- my real-time thread groups have non-real-time threads.) I think that what I want are something like sub-unit cgroups -- I want to be able to ask systemd to further subdivide the group for my unit, login session, or whatever. Would this be reasonable? (Another way of thinking of this is that a unit would have a whole cgroup hierarchy instead of just one cgroup.) I think that the single-hierarchy model will require that I subdivide my user session so that the default sub-unit cgroup is constrained similarly to the default slice. I'll lose functionality, but I don't think this is a showstopper. A different approach would be to allow units to (with systemd's cooperation) escape into their own, dynamically created unit. This seems kind of awful. 3. My code runs unprivileged, but it still wants to configure itself. If needed, I can write a little privileged daemon to handle the systemd calls. I think I can get away without anything fancy if a unit (login session?) grant the right to manipulate sub-unit cgroups to a non-root user. 4. As mentioned, I'm on Ubuntu some of the time. I'd like to keep the same code working on systemd and non-systemd systems. How hard would it be to run systemd as just a cgroup controller? That is, have systemd create its slices, run exactly one unit that represents the whole system, and let other things use the cgroup API. [1] Some day, I might convert my code to use a session systemd instance. I'm not holding my breath, but it could be nice. --Andy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] [HEADSUP] cgroup changes
On Mon, Jun 24, 2013 at 6:27 AM, Lennart Poettering lenn...@poettering.net wrote: On Sat, 22.06.13 15:19, Andy Lutomirski (l...@amacapital.net) wrote: 2. I manage services and tasks outside systemd (for one thing, I currently use Ubuntu, but even if I were on Fedora, I have a bunch of fine-grained things that figure out how they're supposed to allocate resources, and porting them to systemd just to keep working in the new world order would be a PITA [1]). [...] I think that what I want are something like sub-unit cgroups -- I want to be able to ask systemd to further subdivide the group for my unit, login session, or whatever. Would this be reasonable? (Another way of thinking of this is that a unit would have a whole cgroup hierarchy instead of just one cgroup.) The idea is not even to allow this. Basically, if you want to partitions your daemon into different cgroups you need to do that through systemd's abstractions: slices and services. To make this more palatable we'll introduce throw-away units though, so that you can dynamically run something as a workload and don't need to be concerned about naming this, or cleaning it up. Hmm. My particular software can maybe live with this with unpleasant modifications, but this will break anything that, say, accepts a connection from a client, forks into a (possibly new) cgroup based on the identity of that client, and then does something. How can this support containers or the use of cgroups in a non-systemwide systemd instance? Containers may no longer be allowed to escape from the cgroup they start in, but there should (IMO) still be a way for things to subdivide their cgroup-controlled resources. If I want to have a hierarchy more than two levels deep, I suspect I'm SOL under this model. If I'm understanding correctly, there will be slices, then units, and that's it. 4. As mentioned, I'm on Ubuntu some of the time. I'd like to keep the same code working on systemd and non-systemd systems. How hard would it be to run systemd as just a cgroup controller? That is, have systemd create its slices, run exactly one unit that represents the whole system, and let other things use the cgroup API. I have no idea, I don't develop Ubuntu. They will have to come up with some cgroup maintenance daemon of their own. As I know them they'll either do a port of the systemd counter part (but that's going to be tough!), or they'll stick something half-baked into Upstart... Sorry, if this all sounds a bit disappointing. But yeah, this all is not a trivial change... I'm worried that the impedance mismatch between systemd and any other possible API is going to be enormous. On systemd, I'll have to: - Create a throwaway unit - Figure out how to wire up stdout and stderr correctly (I use them for communication between processes) - Translate the current directory, the environment, etc. into systemd configuration - Translate my desired resource controls into systemd's let's pretend that there aren't really cgroups underlying it configuration - Start the throwaway unit - Figure out how to get notified when it finishes Without systemd, I'll have to: - fork() - Ask whatever is managing cgroups to switch me to a different cgroup - exec() This is going to suck, I think. --Andy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] [HEADSUP] cgroup changes
On Fri, Jun 21, 2013 at 10:36 AM, Lennart Poettering lenn...@poettering.net wrote: As long as you only used the high-level options such as CPUShares, MemoryLimit and so on you should be on the safe side. This is already representative of how we're doing thing in large-scale production and how we recommend other users use cgroups on systemd-based distributions. So, +1. -- David Strauss | da...@davidstrauss.net | +1 512 577 5827 [mobile] ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] [HEADSUP] cgroup changes
Hello, On Mon, Jun 24, 2013 at 02:39:53PM +0100, Daniel P. Berrange wrote: On Mon, Jun 24, 2013 at 03:27:15PM +0200, Lennart Poettering wrote: On Sat, 22.06.13 15:19, Andy Lutomirski (l...@amacapital.net) wrote: 1. I put all the entire world into a separate, highly constrained cgroup. My real-time code runs outside that cgroup. This seems to exactly what slices are for, but I need kernel threads to go in to the constrained cgroup. Will systemd support this? I am not sure whether the ability to move kernel threads into cgroups will stay around at all, from the kernel side. Tejun, can you comment on this? KVM uses the vhost_net device for accelerating guest network I/O paths. This device creates a new kernel thread on each open(), and that kernel thread is attached to the cgroup associated with the process that open()d the device. If systemd allows for a process to be moved between cgroups, then it must also be capable of moving any associated kernel threads to the new cgroup at the same time. This co-placement of vhost-net threads with the KVM process, is very critical for I/O performance of KVM networking. Yeah, the way virt drivers use cgroups right now is pretty hacky. I was thinking about adding per-process workqueue which follows the cgroup association of the process after the unified hierarchy and then convert virt to use that. At any rate, those kthreads can be moved via cgroup.procs, so unified hierarchy wouldn't break it from kernel side. Not sure how the interface would look from systemd side tho. Thanks. -- tejun ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] [HEADSUP] cgroup changes
Hello, On Mon, Jun 24, 2013 at 03:27:15PM +0200, Lennart Poettering wrote: On Sat, 22.06.13 15:19, Andy Lutomirski (l...@amacapital.net) wrote: 1. I put all the entire world into a separate, highly constrained cgroup. My real-time code runs outside that cgroup. This seems to exactly what slices are for, but I need kernel threads to go in to the constrained cgroup. Will systemd support this? I am not sure whether the ability to move kernel threads into cgroups will stay around at all, from the kernel side. Tejun, can you comment on this? Any kernel threads with PF_NO_SETAFFINITY set already can't be removed from the root cgroup. In general, I don't think moving kernel threads into !root cgroups is a good idea. They're in most cases shared resources and userland doesn't really have much idea what they're actually doing, which is the fundmental issue. Which kthreads are running on the kernel side and what they're doing is strict implementation detail from the kernel side. There's no effort from kernel side in keeping them stable and userland is likely to get things completely wrong - e.g. many kernel threads named after workqueues in any recent kernels don't actually do anything until the system is under heavy memory pressure. Userland can't tell and has no control over what's being executed where at all and that's the way it should be. That said, there are cases where certain async executions are concretely bound to userland processes - say, (planned) aio updates, virt drivers and so on. Right now, virt implements something pretty hacky but I think they'll have to be tied closer to the usual process mechanism - ie. they should be saying that these kthreads are serving this process and should be treated as such in terms of resource control rather than the current move this kthread to this set of cgroups, don't ask why thing. Another not-well-thought-out aspect of the current cgroup. :( I have an idea where it should be headed in the long term but am not sure about short-term solution. Given that the only sort wide-spread use case is virt kthreads, maybe it just needs to be special cased for now. Not sure. 2. I manage services and tasks outside systemd (for one thing, I currently use Ubuntu, but even if I were on Fedora, I have a bunch of fine-grained things that figure out how they're supposed to allocate resources, and porting them to systemd just to keep working in the new world order would be a PITA [1]). (cgroups have the odd feature that they are per-task, not per thread group, and the systemd proposal seems likely to break anything that actually wants task granularity. I may actually want to use this, even though it's a bit evil -- my real-time thread groups have non-real-time threads.) Here too, Tejun is pretty keen on removing the ability of splitting up threads into cgroups from the kernel, and will only allow this per-process. Tejun, please comment! Yes, again, the biggest issue is how much of low-level cgroup details become known to individual programs. Splitting threads into different cgroup would in most cases mean that the binary itself would become aware of cgroup and it's akin to burying sysctl knob tunings into individual binaries. cgroup is not an interface for each individual program to fiddle with. If certain thread-granular control is absolutely necessary and justifiable, it's something to be added to the existing thread API, not something to be bolted on using cgroups. So, I'm quite strongly against allowing allowing splitting threads of the same process into different cgroups. Thanks. -- tejun ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] [HEADSUP] cgroup changes
Hello, Andy. On Mon, Jun 24, 2013 at 11:49:05AM -0700, Andy Lutomirski wrote: I have an idea where it should be headed in the long term but am not sure about short-term solution. Given that the only sort wide-spread use case is virt kthreads, maybe it just needs to be special cased for now. Not sure. I'll be okay (I think) if I can reliably set affinities of these threads. I'm currently doing it with cgroups. That being said, I don't like the direction that kernel thread magic affinity is going. It may be great for cache performance and reducing random bounding, but I have a scheduling-jitter-sensitive workload and I don't care about overall system throughput. I need the kernel to stay the f!k off my important cpus, and arranging for this to happen is becoming increasingly complicated. Why is it becoming increasingly complicated? The biggest change probably was the shared workqueue pool implementation but that was years ago and workqueue has grown pool attributes recently adding more properly designed flexibility and, for example, adding default affinity for !per-cpu workqueues should be pretty easy now. But anyways, if it's an issue, it should be examined and properly solved rather than hacking up hacky solution with cgroup. cgroups are most certainly something that a binary can be aware of. It's not like a sysctl knob at all -- it's per process. I have lots No, it definitely is not. Sure it is more granular than sysctl but that's it. It exposes control knobs which are directly tied into kernel implementation details. It is not a properly designed programming API by any stretch of imagination. It is an extreme failure on the kernel side that that part hasn't been made crystal clear from the beginning. I don't know how intentional it was but the whole thing is completely botched. cgroup *never* was held to the standard necessary for any widely available API and many of the controls it exposes are exactly at the level of sysctls. As the interface was filesystem, it could evade scrutiny and with the hierarchical organization also gave the impression that it's something which can be used directly by individual applications. It found a loophole in the way we implement and police kernel APIs and then exploited it like there's no tomorrow. We are firmly bound to maintain what already has been exposed from the kernel side and I'm not gonna break any of them but the free-for-all cgroup is broken and deprecated. It's gonna wither and fade away and any attempt to reverse that will be met with extreme prejudice. of binaries that have worked quite well for a couple years that move themselves into different cgroups. I have no problem with a unified hierarchy, but I need control of my little piece of the hierarchy. I don't care if the interface to do so changes, but the basic functionality is important. Whether you care or not is completely irrelevant. Individual binaries widely incorporating cgroup details automatically binds the kernel. It becomes excruciatingly painful to back out after certain point. I don't think we're there yet given the overall immaturity and brokeness of cgroups and it's imperative that we back the hell out as fast as possible before this insanity spreads any wider. Thanks. -- tejun ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] [HEADSUP] cgroup changes
On Mon, Jun 24, 2013 at 12:10 PM, Tejun Heo t...@kernel.org wrote: Hello, Andy. On Mon, Jun 24, 2013 at 11:49:05AM -0700, Andy Lutomirski wrote: I have an idea where it should be headed in the long term but am not sure about short-term solution. Given that the only sort wide-spread use case is virt kthreads, maybe it just needs to be special cased for now. Not sure. I'll be okay (I think) if I can reliably set affinities of these threads. I'm currently doing it with cgroups. That being said, I don't like the direction that kernel thread magic affinity is going. It may be great for cache performance and reducing random bounding, but I have a scheduling-jitter-sensitive workload and I don't care about overall system throughput. I need the kernel to stay the f!k off my important cpus, and arranging for this to happen is becoming increasingly complicated. Why is it becoming increasingly complicated? The biggest change probably was the shared workqueue pool implementation but that was years ago and workqueue has grown pool attributes recently adding more properly designed flexibility and, for example, adding default affinity for !per-cpu workqueues should be pretty easy now. But anyways, if it's an issue, it should be examined and properly solved rather than hacking up hacky solution with cgroup. Because more things are becoming per cpu without the option of moving of per-cpu things on behalf of one cpu to another cpu. RCU is a nice exception. cgroups are most certainly something that a binary can be aware of. It's not like a sysctl knob at all -- it's per process. I have lots No, it definitely is not. Sure it is more granular than sysctl but that's it. It exposes control knobs which are directly tied into kernel implementation details. It is not a properly designed programming API by any stretch of imagination. It is an extreme failure on the kernel side that that part hasn't been made crystal clear from the beginning. I don't know how intentional it was but the whole thing is completely botched. cgroup *never* was held to the standard necessary for any widely available API and many of the controls it exposes are exactly at the level of sysctls. As the interface was filesystem, it could evade scrutiny and with the hierarchical organization also gave the impression that it's something which can be used directly by individual applications. It found a loophole in the way we implement and police kernel APIs and then exploited it like there's no tomorrow. We are firmly bound to maintain what already has been exposed from the kernel side and I'm not gonna break any of them but the free-for-all cgroup is broken and deprecated. It's gonna wither and fade away and any attempt to reverse that will be met with extreme prejudice. The functionality I care about is that a program can reliably and hierarchically subdivide system resources -- think rlimits but actually useful. I, and probably many other things, want this functionality. Yes, the current cgroup interface is awful, but it gets one thing right: it's a hierarchy. Back when my software ran on Windows, I used the awful job interface to allocate resources among different parts of my software. When I switched to Linux, I lost some of that functionality and replaced other bits with cgroups. It's hackish, but it works. Now we're apparently moving toward having a unified hierarchy (great!), a more sane API (great!), and a nasty userspace situation where systemd-using systems control the hierarchy through a highly limiting systemd-specific interface and non-systemd systems do something else which will presumably look nothing like what systemd does. I would argue that designing a kernel interface that requires exactly one userspace component to manage it and ties that one userspace component to something that can't easily be deployed everywhere (the init system) is as big a cheat as the old approach of sneaking bad APIs in through a filesystem was. IOW, please, when designing this, please specify an API that programs are permitted to use, and let that API be reviewed. --Andy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] [HEADSUP] cgroup changes
Hello, On Mon, Jun 24, 2013 at 12:24:38PM -0700, Andy Lutomirski wrote: Because more things are becoming per cpu without the option of moving of per-cpu things on behalf of one cpu to another cpu. RCU is a nice exception. Hmm... but in most cases it's per-cpu on the same cpu that initiated the task. If a given CPU is just crunching numbers and IRQ affinity is properly configured, the CPU shouldn't be bothered too much by per-cpu work items. If there are, please let us know. We can hunt them down. The functionality I care about is that a program can reliably and hierarchically subdivide system resources -- think rlimits but actually useful. I, and probably many other things, want this functionality. Yes, the current cgroup interface is awful, but it gets one thing right: it's a hierarchy. And the hierarchy support was completely broken for many resource controllers up until only several releases ago. I would argue that designing a kernel interface that requires exactly one userspace component to manage it and ties that one userspace component to something that can't easily be deployed everywhere (the init system) is as big a cheat as the old approach of sneaking bad APIs in through a filesystem was. In terms of API, it is firmly at the level of sysctl. That's it. While I agree that having a proper kernel API for hierarchical resource management could be nice. That currently is out of scope. We're already knee-deep in shit with the limited capabilities we're trying to implement. Also, I really don't think cgroup is the right interface for such thing even if we get to that. It should be part of the usual process/thread model, not this completely separate thing on the side. IOW, please, when designing this, please specify an API that programs are permitted to use, and let that API be reviewed. cgroup is not that API and it's never gonna be in all likelihood. As for systemd vs. non-systemd compatibility, I'm afraid I don't have a good answer. This is still all in a pretty earlly phase and the proper abstractions and APIs are being figured out. Hopefully, we'll converge on a mostly compatible high-level abstraction which can be presented regardless of the actual base system implementation. Thanks. -- tejun ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] [HEADSUP] cgroup changes
On Mon, Jun 24, 2013 at 12:37 PM, Tejun Heo t...@kernel.org wrote: Hello, On Mon, Jun 24, 2013 at 12:24:38PM -0700, Andy Lutomirski wrote: Because more things are becoming per cpu without the option of moving of per-cpu things on behalf of one cpu to another cpu. RCU is a nice exception. Hmm... but in most cases it's per-cpu on the same cpu that initiated the task. If a given CPU is just crunching numbers and IRQ affinity is properly configured, the CPU shouldn't be bothered too much by per-cpu work items. If there are, please let us know. We can hunt them down. I'm not just crunching numbers -- I do (nonblocking) I/O as well. The functionality I care about is that a program can reliably and hierarchically subdivide system resources -- think rlimits but actually useful. I, and probably many other things, want this functionality. Yes, the current cgroup interface is awful, but it gets one thing right: it's a hierarchy. And the hierarchy support was completely broken for many resource controllers up until only several releases ago. I would argue that designing a kernel interface that requires exactly one userspace component to manage it and ties that one userspace component to something that can't easily be deployed everywhere (the init system) is as big a cheat as the old approach of sneaking bad APIs in through a filesystem was. In terms of API, it is firmly at the level of sysctl. That's it. While I agree that having a proper kernel API for hierarchical resource management could be nice. That currently is out of scope. We're already knee-deep in shit with the limited capabilities we're trying to implement. Also, I really don't think cgroup is the right interface for such thing even if we get to that. It should be part of the usual process/thread model, not this completely separate thing on the side. IOW, please, when designing this, please specify an API that programs are permitted to use, and let that API be reviewed. cgroup is not that API and it's never gonna be in all likelihood. As for systemd vs. non-systemd compatibility, I'm afraid I don't have a good answer. This is still all in a pretty earlly phase and the proper abstractions and APIs are being figured out. Hopefully, we'll converge on a mostly compatible high-level abstraction which can be presented regardless of the actual base system implementation. So what is cgroup for? That is, what's the goal for what the new API should be able to do? AFAICT the main reason that systemd uses cgroup is to efficiently track which service various processes came from and to send signals, and it seems like that use case could be handled without cgroups at all by creative use of subreapers and a syscall to broadcast a signal to everything that has a given subreaper as an ancestor. In that case, systemd could be asked to stay away from cgroups even in the single-hierarchy case. --Andy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] [HEADSUP] cgroup changes
Hello, On Mon, Jun 24, 2013 at 04:01:07PM -0700, Andy Lutomirski wrote: So what is cgroup for? That is, what's the goal for what the new API should be able to do? It is a for controlling and distributing resources. That part doesn't change. It's just not built to be used directly by individual applications. It's an admin tool just like sysctl - be that admin be a human or userland base system. There's a huge chasm between something which can be generally used by normal applications and something which is restricted to admins and base systems in terms of interface generality and stability, security, how the abstractions fit together with the existing APIs and so on. cgroup firmly belongs to the former. It still serves the same purpose but isn't, in a way, developed enough to be used directly by individual applications and I'm not even sure we want or need to develop it to such a level. Thanks. -- tejun ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] [HEADSUP] cgroup changes
On Mon, Jun 24, 2013 at 4:19 PM, Tejun Heo t...@kernel.org wrote: Hello, On Mon, Jun 24, 2013 at 04:01:07PM -0700, Andy Lutomirski wrote: So what is cgroup for? That is, what's the goal for what the new API should be able to do? It is a for controlling and distributing resources. That part doesn't change. It's just not built to be used directly by individual applications. It's an admin tool just like sysctl - be that admin be a human or userland base system. There's a huge chasm between something which can be generally used by normal applications and something which is restricted to admins and base systems in terms of interface generality and stability, security, how the abstractions fit together with the existing APIs and so on. cgroup firmly belongs to the former. It still serves the same purpose but isn't, in a way, developed enough to be used directly by individual applications and I'm not even sure we want or need to develop it to such a level. My application is running on a single-purpose system I administer. I guess what I'm trying to say here is that many systems will rather fundamentally use systemd. Admins of those systems should still have access to a reasonably large subset of cgroup functionality. If the single-hierarchy model is going to prevent going around systemd and if systemd isn't going to expose all of the useful cgroup functionality, then perhaps there should be a way to separate systemd's hierarchy from the cgroup hierarchy. Looking at http://0pointer.de/blog/projects/cgroups-vs-cgroups.html, it looks like systemd doesn't actually need the cgroup resource control functionality. Maybe there's a way to disentangle this stuff. The /proc/pid/children feature that CRIU added seems like a decent start. --Andy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] [HEADSUP] cgroup changes
On Mon, Jun 24, 2013 at 4:37 PM, Tejun Heo t...@kernel.org wrote: Hello, Andy. On Mon, Jun 24, 2013 at 04:27:17PM -0700, Andy Lutomirski wrote: I guess what I'm trying to say here is that many systems will rather fundamentally use systemd. Admins of those systems should still have access to a reasonably large subset of cgroup functionality. If the single-hierarchy model is going to prevent going around systemd and if systemd isn't going to expose all of the useful cgroup functionality, then perhaps there should be a way to separate systemd's hierarchy from the cgroup hierarchy. I don't think systemd will prevent you from buildling your own hierarchy on the side. It sure won't be properly supported and things might break in corener cases / over time but if you're willing to take such risks anyway... In the long term tho, what should happen probably is examining use cases like yours and then incorporating sensible mechanisms to support that into the base system infrastructure. It might not be completely identical but I'm sure over time we'll be able to find what are the fundamental pieces and proper abstractions. Right now, we're exposing way too much without even clearly understanding what are being enabled. It is unsustainable. Now I'm confused. I thought that support for multiple hierarchies was going away. Is it here to stay after all? --Andy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] [HEADSUP] cgroup changes
Hello, On Mon, Jun 24, 2013 at 4:38 PM, Andy Lutomirski l...@amacapital.net wrote: Now I'm confused. I thought that support for multiple hierarchies was going away. Is it here to stay after all? It is going to be deprecated but also stay around for quite a while. That said, I didn' t mean to use multiple hierarchies. I was saying that if you build a sub-hierarchy in the unified hierarchy, you're likely to get away with it in most cases. Thanks. -- tejun ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] [HEADSUP] cgroup changes
On Mon, Jun 24, 2013 at 4:40 PM, Tejun Heo t...@kernel.org wrote: Hello, On Mon, Jun 24, 2013 at 4:38 PM, Andy Lutomirski l...@amacapital.net wrote: Now I'm confused. I thought that support for multiple hierarchies was going away. Is it here to stay after all? It is going to be deprecated but also stay around for quite a while. That said, I didn' t mean to use multiple hierarchies. I was saying that if you build a sub-hierarchy in the unified hierarchy, you're likely to get away with it in most cases. Isn't that exactly what I was originally asking for? Quoting from earlier in the thread: On Mon, Jun 24, 2013 at 6:27 AM, Lennart Poettering lenn...@poettering.net wrote: On Sat, 22.06.13 15:19, Andy Lutomirski (l...@amacapital.net) wrote: 2. I manage services and tasks outside systemd (for one thing, I currently use Ubuntu, but even if I were on Fedora, I have a bunch of fine-grained things that figure out how they're supposed to allocate resources, and porting them to systemd just to keep working in the new world order would be a PITA [1]). [...] I think that what I want are something like sub-unit cgroups -- I want to be able to ask systemd to further subdivide the group for my unit, login session, or whatever. Would this be reasonable? (Another way of thinking of this is that a unit would have a whole cgroup hierarchy instead of just one cgroup.) The idea is not even to allow this. Basically, if you want to partitions your daemon into different cgroups you need to do that through systemd's abstractions: slices and services. To make this more palatable we'll introduce throw-away units though, so that you can dynamically run something as a workload and don't need to be concerned about naming this, or cleaning it up. If I can subdivide my service in the hierarchy, then I'm happy. If this gets lost *and* systemd insists on controlling the one and only cgroup hierarchy, then I think I have serious problems with the new regime. --Andy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] [HEADSUP] cgroup changes
On Mon, 24.06.13 16:01, Andy Lutomirski (l...@amacapital.net) wrote: AFAICT the main reason that systemd uses cgroup is to efficiently track which service various processes came from and to send signals, and it seems like that use case could be handled without cgroups at all by creative use of subreapers and a syscall to broadcast a signal to everything that has a given subreaper as an ancestor. In that case, systemd could be asked to stay away from cgroups even in the single-hierarchy case. systemd uses cgroups to manage services. Managing services means many things. Among them: keeping track of processes, listing processes of a service, killing processes of a service, doing per-service logging (which means reliably, immediately, and race-freely tracing back messages to the service which logged them), about 55 other things, and also resource management. I don't see how I can do anything of this without something like cgroups, i.e. hierarchial, resource management involved systemd which allows me to securely put labels on processes. Lennart -- Lennart Poettering - Red Hat, Inc. ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] [HEADSUP] cgroup changes
On Mon, Jun 24, 2013 at 4:57 PM, Lennart Poettering lenn...@poettering.net wrote: On Mon, 24.06.13 16:01, Andy Lutomirski (l...@amacapital.net) wrote: AFAICT the main reason that systemd uses cgroup is to efficiently track which service various processes came from and to send signals, and it seems like that use case could be handled without cgroups at all by creative use of subreapers and a syscall to broadcast a signal to everything that has a given subreaper as an ancestor. In that case, systemd could be asked to stay away from cgroups even in the single-hierarchy case. systemd uses cgroups to manage services. Managing services means many things. Among them: keeping track of processes, listing processes of a service, killing processes of a service, doing per-service logging (which means reliably, immediately, and race-freely tracing back messages to the service which logged them), about 55 other things, and also resource management. I don't see how I can do anything of this without something like cgroups, i.e. hierarchial, resource management involved systemd which allows me to securely put labels on processes. Boneheaded straw-man proposal: two new syscalls and a few spare processes. int sys_task_reaper(int tid): Returns the reaper for the task tid (which is 1 if there's no subreaper). (This could just as easily be a file in /proc.) int sys_killall_under_subreaper(int subreaper, int sig): Broadcasts sig to all tasks under subreaper (excluding subreaper). Guarantees that, even if those tasks are forking, they all get the signal. Then, when starting a service, systemd forks, sets the child to be a subreaper, then forks that child again to exec the service. Does this do everything that's needed? sys_task_reaper is trivial to implement (that functionality is already there in the reparenting code), and sys_killall_under_subreaper is probably not so bad. This has one main downside I can think of: it wastes a decent number of processes (one subreaper per service). --Andy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] [HEADSUP] cgroup changes
Lennart Poettering lennart at poettering.net writes: 2) This hierarchy becomes private property of systemd. systemd will set it up. Systemd will maintain it. Systemd will rearrange it. Other software that wants to make use of cgroups can do so only through systemd's APIs. This single-writer logic is absolutely necessary, since interdependencies between the various controllers, the various attributes, the various cgroups are non-obvious and we simply cannot allow that cgroup users alter the tree independently of each other forever. Due to all this: The Pax Cgroup document is a thing of the past, it is dead. Hi [1], I currently contribute cgroup support to a batch system (http://research.cs.wisc.edu/htcondor/) and am trying to figure out how this will affect me. Right now, I take the resources provided by the cgroup setup by the sysadmin and sub-divide them amongst the running jobs. Cgroups are used for resource management, resource accounting, and job management (using the freezer controller to deliver signals to all processes at once). Jobs last between seconds to hours; it is acceptable for a setup time of, say, several hundred milliseconds - as long as we can easily create and destroy many jobs. A few questions came to mind which may provide interesting input to your design process: 1) I use cgroups heavily for resource accounting. Do you envision me querying via dbus for each accounting attribute? Or do you envision me querying for the cgroup name, then accessing the controller statistics directly? 2) I currently fork and setup the resource environment (namespaces, environment, working directory, etc). Can an appropriately privileged process create a sub-slice, place itself in it, and then drop privs / exec? 3) More generally, will I be able to interact with slices directly, or will I need to create throw-away units and launch them via systemd (versus a normal fork/exec)? - The latter causes quite a bit of anxiety for me - we currently support many POSIX platforms plus Windows (hey - at least we dropped HPUX) and I'd like to avoid a completely independent code path for spawning jobs on Linux. 4) Will many short-lived jobs cause any heartache? Would anything untoward happen to my system if I spawned / destroyed jobs (and corresponding units or slices) at, say, 1Hz? 5) Will I be able to delegate management of a subslice to a non-privileged user? I'm excited to see new ideas (again, having system tools be aware of the batch system activity is intriguing [2]), but am a bit worried about losing functionality and the cost of porting things to the new era! Thanks! Brian [1] apologies if the reply comes through mangled; posting through the gmane web interface. [2] Hopefully something that works better than ps xawf -eo pid,user,cgroup,args which currently segfaults for me :( ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
[systemd-devel] [HEADSUP] cgroup changes
Heya, On monday I posted this mail: http://lists.freedesktop.org/archives/systemd-devel/2013-June/011388.html Here's an update and a bit on the bigger picture: Half of what I mentioned there is now in place. There's now a new slice unit type in place in git, and everything is hooked up to it. logind will now also keep track of running containers/VMs. The various container/VM managers have to register with logind now. This serves the purpose of better integration of containers/VMs everywhere (so that ps can show for each process where it belongs to). However, the main reason for this is that this is eventually going to be the only way how containers/VMs can get a cgroup of their own. So, in that context, a bit of the bigger picture: It took us a while to realize the full extent how awfully unusable cgroups currently are. The attributes have way more interdependencies than people might think and it is trivial to create non-sensical configurations... Of course, understanding how awful the status quo is a good first step. But we really needed to figure out what we can do about this to clean this up in the long run, and how we can get to something useful quickly. So, after much discussion between Tejun (the kernel cgroup maintainer) and various other folks here's the new scheme that we want to go for: 1) In the long run there's only going to be a single kernel cgroup hierarchy, the per-controller hierarchies will go away. The single hierarchy will allow controllers to be individually enabled for each cgroup. The net effect is that the hierarchies the controllers see are not orthogonal anymore, they are always subtrees of the full single hierarchy. 2) This hierarchy becomes private property of systemd. systemd will set it up. Systemd will maintain it. Systemd will rearrange it. Other software that wants to make use of cgroups can do so only through systemd's APIs. This single-writer logic is absolutely necessary, since interdependencies between the various controllers, the various attributes, the various cgroups are non-obvious and we simply cannot allow that cgroup users alter the tree independently of each other forever. Due to all this: The Pax Cgroup document is a thing of the past, it is dead. 3) systemd will hide the fact that cgroups are internally used almost entirely. In fact, we will take away the unit configuration options ControlGroup=, ControlGroupModify=, ControlGroupPersistent=, ControlGroupAttribute= in their entirety. The high-level options CPUShares=, MemoryLimit=, .. and so on will continue to exist and we'll add additional ones like them. The system.conf setting DefaultControllers=cpu will go away too. Basically, you'll get more high-level settings, but all the low level bits will go away without replacement. We will take away the ability for the admin to set arbitrary low-level attributes, to arrange things in completely arbitrary cgroup trees or to enable arbitrary controllers for a service. 4) systemd git introduced a new unit type called slice (see above). This is for partitioning up resources of the system into slices. Slices are hierarchial, and other units (such as services, but also containers/VMs and logged in users) can then be assigned to these slices. Slices internally map to cgroups, but they are a very high-level construct. Slices will expose the same CPUShares=, MemoryLimit= properties as the other units do. This means resource management will become a first-class, built-in functionality of systemd. You can create slices for your customers, and in them subslices for their departments, and then run services, users, vms in them. In the long run these will by dynamically moveable even (while they are running), but that'll take more kernel work. By default there will three slices: system.slice (where all system services are located by default), user.slice (where all logged in users are located by default), machine.slice (where all running VMs/containers are located by default). However, the admin will have full freedom to create arbitary slices and then move the other units into them. 5) systemd's logind daemon already kept track of logged in users/sessions. It is now extended to also keep track of virtual machines/containers. In fact, this is how libvirt/nspawn and friends will now get their own cgroups. They register as a machine, which means passing a bit of meta info to systemd, and getting a cgroup assigned in response. This registration ensures that ps and friends can show to which VM/container a process belongs, but easily allows other tools to query container/VM info too, so that we'll be able to provide an integration level of containers/VMs like solaris zones can do it in the long run. So, this all together sounds like an awful lot of change. #1 and #2 are long term changes. However #3, #4, #5 are something we can do now and should do now, as prepartion for the single-writer, unified cgroup tree. We really, really shouldn't ship the cgroup mess for longer,
Re: [systemd-devel] [HEADSUP] cgroup changes
On Fri, Jun 21, 2013 at 10:36 AM, Lennart Poettering lenn...@poettering.net wrote: Heya, On monday I posted this mail: http://lists.freedesktop.org/archives/systemd-devel/2013-June/011388.html Here's an update and a bit on the bigger picture: Thanks for doing this - I am really looking forward to seeing this all take shape, and I hope to be able to leverage this in the future :^) All the points below are great, and problems that I've encountered in the past have all hinted towards this being the right way forward. #2 below has my interest - when you have some ideas about how the API will look I'd like to review it and match against our use cases... Auke ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] [HEADSUP] cgroup changes
On Fri, 21.06.13 12:59, Kok, Auke-jan H (auke-jan.h@intel.com) wrote: http://lists.freedesktop.org/archives/systemd-devel/2013-June/011388.html Here's an update and a bit on the bigger picture: Thanks for doing this - I am really looking forward to seeing this all take shape, and I hope to be able to leverage this in the future :^) All the points below are great, and problems that I've encountered in the past have all hinted towards this being the right way forward. #2 below has my interest - when you have some ideas about how the API will look I'd like to review it and match against our use cases... Point #2 is precisely about not having APIs for this... ;-) So, in the future, when you have some service, and that service wants to alter some cgroup resource limits for itself (let's say: set its own cpu shares value to 1500), this is what should happen: the service should use a call like sd_pid_get_unit() to get its own unit name, and then use dbus to invoke SetCPUShares(1500) for that service. systemd will then do the rest. (*) Lennart (*) to make this even simpler we have been thinking of defining a new virtual bus object path /org/freedesktop/systemd1/self/ or so which will always points to the callers own unit. This would be similar to /proc/self/ which also points to its own PID dir for each process... With that in place you could then set any resource setting you want with a single bus method call. -- Lennart Poettering - Red Hat, Inc. ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] [HEADSUP] cgroup changes
On Fri, Jun 21, 2013 at 1:10 PM, Lennart Poettering lenn...@poettering.net wrote: On Fri, 21.06.13 12:59, Kok, Auke-jan H (auke-jan.h@intel.com) wrote: http://lists.freedesktop.org/archives/systemd-devel/2013-June/011388.html Here's an update and a bit on the bigger picture: Thanks for doing this - I am really looking forward to seeing this all take shape, and I hope to be able to leverage this in the future :^) All the points below are great, and problems that I've encountered in the past have all hinted towards this being the right way forward. #2 below has my interest - when you have some ideas about how the API will look I'd like to review it and match against our use cases... Point #2 is precisely about not having APIs for this... ;-) So, in the future, when you have some service, and that service wants to alter some cgroup resource limits for itself (let's say: set its own cpu shares value to 1500), this is what should happen: the service should use a call like sd_pid_get_unit() to get its own unit name, and then use dbus to invoke SetCPUShares(1500) for that service. systemd will then do the rest. (*) Lennart (*) to make this even simpler we have been thinking of defining a new virtual bus object path /org/freedesktop/systemd1/self/ or so which will always points to the callers own unit. This would be similar to /proc/self/ which also points to its own PID dir for each process... With that in place you could then set any resource setting you want with a single bus method call. This is fine for applications that manage themselves, but I'm seeing more interest in use cases where we want external influence on cgroup hierarchies, for instance: - foreground/background priorities - a window manager marks background applications and puts them in the freezer, changes oom_score_adj so that old apps can get automatically cleaned up in case memory availability is low. - detecting runaway apps and taking cpu slices away from them. - thermally constraining classes of applications Those would be tasks that an external process would do by manipulating properties of cgroups, not something each task would do on it's own. Do you suggest these manipulations should be implemented without high level systemd API's and the controller just manipulates the cgroups directly? Auke ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] [HEADSUP] cgroup changes
On Fri, 21.06.13 14:10, Kok, Auke-jan H (auke-jan.h@intel.com) wrote: So, in the future, when you have some service, and that service wants to alter some cgroup resource limits for itself (let's say: set its own cpu shares value to 1500), this is what should happen: the service should use a call like sd_pid_get_unit() to get its own unit name, and then use dbus to invoke SetCPUShares(1500) for that service. systemd will then do the rest. (*) Lennart (*) to make this even simpler we have been thinking of defining a new virtual bus object path /org/freedesktop/systemd1/self/ or so which will always points to the callers own unit. This would be similar to /proc/self/ which also points to its own PID dir for each process... With that in place you could then set any resource setting you want with a single bus method call. This is fine for applications that manage themselves, but I'm seeing more interest in use cases where we want external influence on cgroup hierarchies, for instance: - foreground/background priorities - a window manager marks background applications and puts them in the freezer, changes oom_score_adj so that old apps can get automatically cleaned up in case memory availability is low. - detecting runaway apps and taking cpu slices away from them. - thermally constraining classes of applications Those would be tasks that an external process would do by manipulating properties of cgroups, not something each task would do on it's own. Do you suggest these manipulations should be implemented without high level systemd API's and the controller just manipulates the cgroups directly? All changes to cgroup attributes must go through systemd. If the WM wants to freeze or adjust OOM he needs to issue systemd bus calls for that. The run-away stuff I can't follow? the kernel will distribute CPU evenly among running apps if all want it, so not seeing why there's more monitoring needed. The thermal stuff is probably best done in-kernel i guess... Too dangerous/subject-to-latency for userspace, no? Lennart -- Lennart Poettering - Red Hat, Inc. ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] [HEADSUP] cgroup changes
On Fri, Jun 21, 2013 at 2:17 PM, Lennart Poettering lenn...@poettering.net wrote: On Fri, 21.06.13 14:10, Kok, Auke-jan H (auke-jan.h@intel.com) wrote: So, in the future, when you have some service, and that service wants to alter some cgroup resource limits for itself (let's say: set its own cpu shares value to 1500), this is what should happen: the service should use a call like sd_pid_get_unit() to get its own unit name, and then use dbus to invoke SetCPUShares(1500) for that service. systemd will then do the rest. (*) Lennart (*) to make this even simpler we have been thinking of defining a new virtual bus object path /org/freedesktop/systemd1/self/ or so which will always points to the callers own unit. This would be similar to /proc/self/ which also points to its own PID dir for each process... With that in place you could then set any resource setting you want with a single bus method call. This is fine for applications that manage themselves, but I'm seeing more interest in use cases where we want external influence on cgroup hierarchies, for instance: - foreground/background priorities - a window manager marks background applications and puts them in the freezer, changes oom_score_adj so that old apps can get automatically cleaned up in case memory availability is low. - detecting runaway apps and taking cpu slices away from them. - thermally constraining classes of applications Those would be tasks that an external process would do by manipulating properties of cgroups, not something each task would do on it's own. Do you suggest these manipulations should be implemented without high level systemd API's and the controller just manipulates the cgroups directly? All changes to cgroup attributes must go through systemd. If the WM wants to freeze or adjust OOM he needs to issue systemd bus calls for that. The run-away stuff I can't follow? the kernel will distribute CPU evenly among running apps if all want it, so not seeing why there's more monitoring needed. The thermal stuff is probably best done in-kernel i guess... Too dangerous/subject-to-latency for userspace, no? Only userspace can distinguish between e.g. a foreground and background application (WM) and decide that CPU consumption of certain apps in the background is excessive, and throttle it down further, which is somewhat similar to using freezer to just SIGSTOP them entirely basically. Thermal throttling from userspace allows you to distinguish between never make my SETI turn the fan on and throttle the entire system when I reach high fan speeds. You can't do that in the kernel. [1] Arguably this could be done in-task and not by an external controller, but you're still trusting the task to do the right thing, which may not be something you want to do. Auke [1] Note that the new Intel P-state driver by Dirk Brandewie changes how things work with nice(). The old behaviour was abused by folks running bitcoin miners at nice values which caused ondemand to do something irrational: nice-only tasks would keep the CPU in lowest frequencies, which is terrible from a power perspective - now every daemon running at nice value takes much longer to complete its task, burning more power then when it had raced-to-idle. ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] [HEADSUP] cgroup changes
On Fri, Jun 21, 2013 at 11:47 PM, Kok, Auke-jan H auke-jan.h@intel.com wrote: Only userspace can distinguish between e.g. a foreground and background application (WM) and decide that CPU consumption of certain apps in the background is excessive, and throttle it down further, This would probably be some bus call to the systemd --user instance managing the services in the session, if that's what you mean? Kay ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] [HEADSUP] cgroup changes
On Fri, Jun 21, 2013 at 3:07 PM, Kay Sievers k...@vrfy.org wrote: On Fri, Jun 21, 2013 at 11:47 PM, Kok, Auke-jan H auke-jan.h@intel.com wrote: Only userspace can distinguish between e.g. a foreground and background application (WM) and decide that CPU consumption of certain apps in the background is excessive, and throttle it down further, This would probably be some bus call to the systemd --user instance managing the services in the session, if that's what you mean? for instance, yes. Auke ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel