Re: [systemd-devel] [HEADSUP] cgroup changes

2013-07-16 Thread Lennart Poettering
On Tue, 25.06.13 08:31, Brian Bockelman (bbock...@cse.unl.edu) wrote:

 
 On Jun 25, 2013, at 4:56 AM, Lennart Poettering lenn...@poettering.net 
 wrote:
 
  On Tue, 25.06.13 02:21, Brian Bockelman (bbock...@cse.unl.edu) wrote:
  
  A few questions came to mind which may provide interesting input 
  to your design process:
  1) I use cgroups heavily for resource accounting.  Do you envision 
   me querying via dbus for each accounting attribute?  Or do you 
   envision me querying for the cgroup name, then accessing the 
  controller statistics directly?
  
  Good question. Tejun wants systemd to cover that too. I am not entirely
  sure. I don't like the extra roundtrip for measuring the accounting
  bits. But maybe we can add a library that avoids the roundtrip, and
  simply provides you with high-level accounting values for cgroups. That
  way, for *changing* things you'd need to go via the bus, for *reading*
  things we'd give you a library that goes directly to the cgroupfs and
  avoids the roundtrip.
 
 I like this idea.  Hopefully single-writer, multiple-reader is more 
 sustainable path forward.
 
 What about the notification APIs?  We currently use the
 memory.oom_control to get a notification when a job hits limits (this
 allows us to know the job died due to memory issues, as the user code
 itself typically just SIGSEGV's).  Is subscribing to notifications
 considered reading or writing in this case?

That sounds like another case for the library, i.e. is more considered
reading. That said I think the current notification infrastructure of
cgroup attributes is really really awful, so I am not to keen to support
that right-away.

Lennart

-- 
Lennart Poettering - Red Hat, Inc.
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [HEADSUP] cgroup changes

2013-06-29 Thread Brian Bockelman

On Jun 25, 2013, at 4:56 AM, Lennart Poettering lenn...@poettering.net wrote:

 On Tue, 25.06.13 02:21, Brian Bockelman (bbock...@cse.unl.edu) wrote:
 
 A few questions came to mind which may provide interesting input 
 to your design process:
 1) I use cgroups heavily for resource accounting.  Do you envision 
  me querying via dbus for each accounting attribute?  Or do you 
  envision me querying for the cgroup name, then accessing the 
 controller statistics directly?
 
 Good question. Tejun wants systemd to cover that too. I am not entirely
 sure. I don't like the extra roundtrip for measuring the accounting
 bits. But maybe we can add a library that avoids the roundtrip, and
 simply provides you with high-level accounting values for cgroups. That
 way, for *changing* things you'd need to go via the bus, for *reading*
 things we'd give you a library that goes directly to the cgroupfs and
 avoids the roundtrip.

I like this idea.  Hopefully single-writer, multiple-reader is more sustainable 
path forward.

What about the notification APIs?  We currently use the memory.oom_control to 
get a notification when a job hits limits (this allows us to know the job died 
due to memory issues, as the user code itself typically just SIGSEGV's).  Is 
subscribing to notifications considered reading or writing in this case?

 
 2) I currently fork and setup the resource environment (namespaces, 
  environment, working directory, etc).  Can an appropriately privileged 
  process create a sub-slice, place itself in it, and then drop privs 
 / exec?
 
 We'll probably have a way how you can take an existing set of processes
 and turn them dynamically into a new unit in systemd. These units would
 be mostly like service units, except that systemd wouldn't start the
 processes, but they would be foreign created. We are not sure about
 the name for this yet (i.e. whether to cover it under the .service
 suffix, but we'll probably call it Scopes instead, with the suffix
 .scope).
 
 The scope units could then be manipulated at runtime for (cgroup based)
 resource management the way normal services are too.
 
 So basically, a service unit could be assigned to a slice unit, and
 could then create scope units which detach subprocesses from the
 original service unit, and get their own cgroup in the same slice or any
 other.
 

This sounds manageable.

 
 5) Will I be able to delegate management of a subslice to a
 non-privileged user?
 
 Unlikely, at least for the beginning. 
 

(Very) long-term, this is attractive for us.  We prefer the batch system to run 
as unprivileged when possible (and to sacrifice the minimal amount of 
functionality to do so!).

 I'm excited to see new ideas (again, having system tools be aware of 
 the batch system activity is intriguing [2]), but am a bit worried about
 losing functionality and the cost of porting things to the new era!
 
 There's certainly going to be some lost flexibility. But of course we'll
 try to cover all interesting usecases.

I'll try to lurk and provide guidance about how us nutty batch system folks may 
try to use it.

 
 [2] Hopefully something that works better than 
 ps xawf -eo pid,user,cgroup,args which currently segfaults for me :(
 
 Hmm, could you file a bug, please?
 

Couldn't figure out a patch -- too little time.  However, I at least tracked 
down the offending code.  Bug report is here:

https://bugzilla.redhat.com/show_bug.cgi?id=977854

Thanks,

Brian



smime.p7s
Description: S/MIME cryptographic signature
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [HEADSUP] cgroup changes

2013-06-25 Thread Lennart Poettering
On Mon, 24.06.13 17:09, Andy Lutomirski (l...@amacapital.net) wrote:

 
 On Mon, Jun 24, 2013 at 4:57 PM, Lennart Poettering
 lenn...@poettering.net wrote:
  On Mon, 24.06.13 16:01, Andy Lutomirski (l...@amacapital.net) wrote:
 
  AFAICT the main reason that systemd uses cgroup is to efficiently
  track which service various processes came from and to send signals,
  and it seems like that use case could be handled without cgroups at
  all by creative use of subreapers and a syscall to broadcast a signal
  to everything that has a given subreaper as an ancestor.  In that
  case, systemd could be asked to stay away from cgroups even in the
  single-hierarchy case.
 
  systemd uses cgroups to manage services. Managing services means many
  things. Among them: keeping track of processes, listing processes of a
  service, killing processes of a service, doing per-service logging
  (which means reliably, immediately, and race-freely tracing back
  messages to the service which logged them), about 55 other things, and
  also resource management.
 
  I don't see how I can do anything of this without something like
  cgroups, i.e. hierarchial, resource management involved systemd which
  allows me to securely put labels on processes.
 
 Boneheaded straw-man proposal: two new syscalls and a few spare processes.
 
 int sys_task_reaper(int tid): Returns the reaper for the task tid
 (which is 1 if there's no subreaper).  (This could just as easily be a
 file in /proc.)
 
 int sys_killall_under_subreaper(int subreaper, int sig): Broadcasts
 sig to all tasks under subreaper (excluding subreaper).  Guarantees
 that, even if those tasks are forking, they all get the signal.
 
 Then, when starting a service, systemd forks, sets the child to be a
 subreaper, then forks that child again to exec the service.
 
 Does this do everything that's needed?  

No. It doesn't do anything that's needed. How do I list all PIDs in a
service with this? How do I determine the service of a PID? How do i do
resource manage with this?

 sys_task_reaper is trivial to
 implement (that functionality is already there in the reparenting
 code), and sys_killall_under_subreaper is probably not so bad.
 
 This has one main downside I can think of: it wastes a decent number
 of processes (one subreaper per service).

Yeah, also the downside that it doesn't do what we need. 

Lennart

-- 
Lennart Poettering - Red Hat, Inc.
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [HEADSUP] cgroup changes

2013-06-25 Thread Lennart Poettering
On Tue, 25.06.13 02:21, Brian Bockelman (bbock...@cse.unl.edu) wrote:

 A few questions came to mind which may provide interesting input 
 to your design process:
 1) I use cgroups heavily for resource accounting.  Do you envision 
   me querying via dbus for each accounting attribute?  Or do you 
   envision me querying for the cgroup name, then accessing the 
 controller statistics directly?

Good question. Tejun wants systemd to cover that too. I am not entirely
sure. I don't like the extra roundtrip for measuring the accounting
bits. But maybe we can add a library that avoids the roundtrip, and
simply provides you with high-level accounting values for cgroups. That
way, for *changing* things you'd need to go via the bus, for *reading*
things we'd give you a library that goes directly to the cgroupfs and
avoids the roundtrip.

 2) I currently fork and setup the resource environment (namespaces, 
   environment, working directory, etc).  Can an appropriately privileged 
   process create a sub-slice, place itself in it, and then drop privs 
 / exec?

We'll probably have a way how you can take an existing set of processes
and turn them dynamically into a new unit in systemd. These units would
be mostly like service units, except that systemd wouldn't start the
processes, but they would be foreign created. We are not sure about
the name for this yet (i.e. whether to cover it under the .service
suffix, but we'll probably call it Scopes instead, with the suffix
.scope).

The scope units could then be manipulated at runtime for (cgroup based)
resource management the way normal services are too.

So basically, a service unit could be assigned to a slice unit, and
could then create scope units which detach subprocesses from the
original service unit, and get their own cgroup in the same slice or any
other.



 3) More generally, will I be able to interact with slices directly, or 
   will I need to create throw-away units and launch them via systemd 
   (versus a normal fork/exec)?

Basically, with this scope concept in place, you'd create a throw-away
scope. In fact, scope units can only be created as throw-away units.

 - The latter causes quite a bit of anxiety for me - we currently 
   support many POSIX platforms plus Windows (hey - at least 
   we dropped HPUX) and I'd like to avoid a completely independent 
   code path for spawning jobs on Linux.
 4) Will many short-lived jobs cause any heartache?  Would anything 
   untoward happen to my system if I spawned / destroyed jobs (and 
   corresponding units or slices) at, say, 1Hz?

Well, the idea is that these scopes are very lightweight. And we need
to make them scale (but I don't see why they shouldn't).

 5) Will I be able to delegate management of a subslice to a
 non-privileged user?

Unlikely, at least for the beginning. 

 I'm excited to see new ideas (again, having system tools be aware of 
 the batch system activity is intriguing [2]), but am a bit worried about
 losing functionality and the cost of porting things to the new era!

There's certainly going to be some lost flexibility. But of course we'll
try to cover all interesting usecases.

 [2] Hopefully something that works better than 
  ps xawf -eo pid,user,cgroup,args which currently segfaults for me :(

Hmm, could you file a bug, please?

Lennart

-- 
Lennart Poettering - Red Hat, Inc.
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [HEADSUP] cgroup changes

2013-06-25 Thread Andy Lutomirski
On Jun 25, 2013 2:43 AM, Lennart Poettering lenn...@poettering.net
wrote:

 On Mon, 24.06.13 17:09, Andy Lutomirski (l...@amacapital.net) wrote:

 
  On Mon, Jun 24, 2013 at 4:57 PM, Lennart Poettering
  lenn...@poettering.net wrote:
   On Mon, 24.06.13 16:01, Andy Lutomirski (l...@amacapital.net) wrote:
  
   AFAICT the main reason that systemd uses cgroup is to efficiently
   track which service various processes came from and to send signals,
   and it seems like that use case could be handled without cgroups at
   all by creative use of subreapers and a syscall to broadcast a signal
   to everything that has a given subreaper as an ancestor.  In that
   case, systemd could be asked to stay away from cgroups even in the
   single-hierarchy case.
  
   systemd uses cgroups to manage services. Managing services means many
   things. Among them: keeping track of processes, listing processes of a
   service, killing processes of a service, doing per-service logging
   (which means reliably, immediately, and race-freely tracing back
   messages to the service which logged them), about 55 other things, and
   also resource management.
  
   I don't see how I can do anything of this without something like
   cgroups, i.e. hierarchial, resource management involved systemd which
   allows me to securely put labels on processes.
 
  Boneheaded straw-man proposal: two new syscalls and a few spare
processes.
 
  int sys_task_reaper(int tid): Returns the reaper for the task tid
  (which is 1 if there's no subreaper).  (This could just as easily be a
  file in /proc.)
 
  int sys_killall_under_subreaper(int subreaper, int sig): Broadcasts
  sig to all tasks under subreaper (excluding subreaper).  Guarantees
  that, even if those tasks are forking, they all get the signal.
 
  Then, when starting a service, systemd forks, sets the child to be a
  subreaper, then forks that child again to exec the service.
 
  Does this do everything that's needed?

 No. It doesn't do anything that's needed. How do I list all PIDs in a
 service with this?

Walk /proc/subreaper/children recursively.  A kernel patch to make that
field show up unconditionally instead of hiding under EXPERT would help.

 How do I determine the service of a PID?

Call sys_task_reaper, then look up what service that subreaper comes from.

 How do i do
 resource manage with this?

With cgroups, unless the admin has configured systemd not to use cgroups,
in which case you don't.  (The whole point would be to keep
DefaultControllers= without using the one and only cgroup hierarchy.)

--Andy


  sys_task_reaper is trivial to
  implement (that functionality is already there in the reparenting
  code), and sys_killall_under_subreaper is probably not so bad.
 
  This has one main downside I can think of: it wastes a decent number
  of processes (one subreaper per service).

 Yeah, also the downside that it doesn't do what we need.

 Lennart

 --
 Lennart Poettering - Red Hat, Inc.
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [HEADSUP] cgroup changes

2013-06-24 Thread Lennart Poettering
On Sat, 22.06.13 15:19, Andy Lutomirski (l...@amacapital.net) wrote:

 1. I put all the entire world into a separate, highly constrained
 cgroup.  My real-time code runs outside that cgroup.  This seems to
 exactly what slices are for, but I need kernel threads to go in to
 the constrained cgroup.  Will systemd support this?

I am not sure whether the ability to move kernel threads into cgroups
will stay around at all, from the kernel side. Tejun, can you comment on this?
 
 2. I manage services and tasks outside systemd (for one thing, I
 currently use Ubuntu, but even if I were on Fedora, I have a bunch
 of fine-grained things that figure out how they're supposed to
 allocate resources, and porting them to systemd just to keep working
 in the new world order would be a PITA [1]).
 
 (cgroups have the odd feature that they are per-task, not per thread
 group, and the systemd proposal seems likely to break anything that
 actually wants task granularity.  I may actually want to use this,
 even though it's a bit evil -- my real-time thread groups have
 non-real-time threads.)

Here too, Tejun is pretty keen on removing the ability of splitting up
threads into cgroups from the kernel, and will only allow this
per-process. Tejun, please comment!

 I think that what I want are something like sub-unit cgroups -- I
 want to be able to ask systemd to further subdivide the group for my
 unit, login session, or whatever.  Would this be reasonable?
 (Another way of thinking of this is that a unit would have a whole
 cgroup hierarchy instead of just one cgroup.)

The idea is not even to allow this. Basically, if you want to partitions
your daemon into different cgroups you need to do that through systemd's
abstractions: slices and services. To make this more palatable we'll
introduce throw-away units though, so that you can dynamically run
something as a workload and don't need to be concerned about naming
this, or cleaning it up.

 I think that the single-hierarchy model will require that I
 subdivide my user session so that the default sub-unit cgroup is
 constrained similarly to the default slice.  I'll lose
 functionality, but I don't think this is a showstopper.
 
 A different approach would be to allow units to (with systemd's
 cooperation) escape into their own, dynamically created unit.  This
 seems kind of awful.

This is basically what I meant with throw-away units. 

 3. My code runs unprivileged, but it still wants to configure
 itself. If needed, I can write a little privileged daemon to handle
 the systemd calls.

So, at least in the beginning I am pretty sure that manipulating the
resource parameters we'll restrict to root-only, since this is much more
security than one might assume and we simply don't oversee this all.

 I think I can get away without anything fancy if a unit (login
 session?) grant the right to manipulate sub-unit cgroups to a
 non-root user.

As mentioned, this will not be possible.

 4. As mentioned, I'm on Ubuntu some of the time.  I'd like to keep
 the same code working on systemd and non-systemd systems.
 
 How hard would it be to run systemd as just a cgroup controller?
 That is, have systemd create its slices, run exactly one unit that
 represents the whole system, and let other things use the cgroup
 API.

I have no idea, I don't develop Ubuntu. They will have to come up with
some cgroup maintenance daemon of their own. As I know them they'll
either do a port of the systemd counter part (but that's going to be
tough!), or they'll stick something half-baked into Upstart...

Sorry, if this all sounds a bit disappointing. But yeah, this all is not
a trivial change...

Lennart

-- 
Lennart Poettering - Red Hat, Inc.
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [HEADSUP] cgroup changes

2013-06-24 Thread Lennart Poettering
On Fri, 21.06.13 14:47, Kok, Auke-jan H (auke-jan.h@intel.com) wrote:

  Do you suggest these manipulations should be implemented without high
  level systemd API's and the controller just manipulates the cgroups
  directly?
 
  All changes to cgroup attributes must go through systemd. If the WM
  wants to freeze or adjust OOM he needs to issue systemd bus calls for
  that.
 
  The run-away stuff I can't follow? the kernel will distribute CPU
  evenly among running apps if all want it, so not seeing why there's more
  monitoring needed.
 
  The thermal stuff is probably best done in-kernel i guess... Too
  dangerous/subject-to-latency for userspace, no?
 
 Only userspace can distinguish between e.g. a foreground and
 background application (WM) and decide that CPU consumption of certain
 apps in the background is excessive, and throttle it down further,
 which is somewhat similar to using freezer to just SIGSTOP them
 entirely basically.

Yes, userspace can do that via systemd, there will be high-level
operations on the bus for this. For example: SetCPUShares() to alter the
cpu.shares value, and so on. This method call will do much more though
thatn just write this value. One of the complexities of the cgroup stuff
here is that adding a unit to a controller like cpu means you have to
do the same for all its immediately siblings (i.e. other units in the
same slice) plus all its parent slices (and recurisvely their
siblings). Why? Because otherwise you might end up granting the service
that is in the cpu controller the same amount of CPU *in total* as the
other services in the same slice get *individually* for each
process. And that would be grossly unfair...

 Thermal throttling from userspace allows you to distinguish between
 never make my SETI turn the fan on and throttle the entire system
 when I reach high fan speeds. You can't do that in the kernel. [1]
 Arguably this could be done in-task and not by an external controller,
 but you're still trusting the task to do the right thing, which may
 not be something you want to do.

So, if userspace needs to communicate something to kernel space about
what kind of cooling stragegy it would prefer and that is done via
cgroups, them I am sure we can add similar high-level per-unit control
calls to systemd, too.

Lennart

-- 
Lennart Poettering - Red Hat, Inc.
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [HEADSUP] cgroup changes

2013-06-24 Thread Daniel P. Berrange
On Mon, Jun 24, 2013 at 03:27:15PM +0200, Lennart Poettering wrote:
 On Sat, 22.06.13 15:19, Andy Lutomirski (l...@amacapital.net) wrote:
 
  1. I put all the entire world into a separate, highly constrained
  cgroup.  My real-time code runs outside that cgroup.  This seems to
  exactly what slices are for, but I need kernel threads to go in to
  the constrained cgroup.  Will systemd support this?
 
 I am not sure whether the ability to move kernel threads into cgroups
 will stay around at all, from the kernel side. Tejun, can you comment
 on this?

KVM uses the vhost_net device for accelerating guest network I/O
paths. This device creates a new kernel thread on each open(),
and that kernel thread is attached to the cgroup associated
with the process that open()d the device.

If systemd allows for a process to be moved between cgroups, then
it must also be capable of moving any associated kernel threads to
the new cgroup at the same time. This co-placement of vhost-net
threads with the KVM process, is very critical for I/O performance
of KVM networking.

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [HEADSUP] cgroup changes

2013-06-24 Thread Andy Lutomirski

On 06/21/2013 10:36 AM, Lennart Poettering wrote:


2) This hierarchy becomes private property of systemd. systemd will set
it up. Systemd will maintain it. Systemd will rearrange it. Other
software that wants to make use of cgroups can do so only through
systemd's APIs. This single-writer logic is absolutely necessary, since
interdependencies between the various controllers, the various
attributes, the various cgroups are non-obvious and we simply cannot
allow that cgroup users alter the tree independently of each other
forever. Due to all this: The Pax Cgroup document is a thing of the
past, it is dead.




If you are using non-trivial cgroup setups with systemd right now, then
things will change for you. We will provide you with similar
functionality as before, but things will be different and less
low-level. As long as you only used the high-level options such as
CPUShares, MemoryLimit and so on you should be on the safe side.



Hmm.  This may be tricky for my use case.  Here are a few issues.  For 
all I know, they may already be supported (or planned), but I don't want 
to get caught.


1. I put all the entire world into a separate, highly constrained 
cgroup.  My real-time code runs outside that cgroup.  This seems to 
exactly what slices are for, but I need kernel threads to go in to the 
constrained cgroup.  Will systemd support this?


2. I manage services and tasks outside systemd (for one thing, I 
currently use Ubuntu, but even if I were on Fedora, I have a bunch of 
fine-grained things that figure out how they're supposed to allocate 
resources, and porting them to systemd just to keep working in the new 
world order would be a PITA [1]).


(cgroups have the odd feature that they are per-task, not per thread 
group, and the systemd proposal seems likely to break anything that 
actually wants task granularity.  I may actually want to use this, even 
though it's a bit evil -- my real-time thread groups have non-real-time 
threads.)


I think that what I want are something like sub-unit cgroups -- I want 
to be able to ask systemd to further subdivide the group for my unit, 
login session, or whatever.  Would this be reasonable?  (Another way of 
thinking of this is that a unit would have a whole cgroup hierarchy 
instead of just one cgroup.)


I think that the single-hierarchy model will require that I subdivide my 
user session so that the default sub-unit cgroup is constrained 
similarly to the default slice.  I'll lose functionality, but I don't 
think this is a showstopper.


A different approach would be to allow units to (with systemd's 
cooperation) escape into their own, dynamically created unit.  This 
seems kind of awful.


3. My code runs unprivileged, but it still wants to configure itself. 
If needed, I can write a little privileged daemon to handle the systemd 
calls.


I think I can get away without anything fancy if a unit (login session?) 
grant the right to manipulate sub-unit cgroups to a non-root user.


4. As mentioned, I'm on Ubuntu some of the time.  I'd like to keep the 
same code working on systemd and non-systemd systems.


How hard would it be to run systemd as just a cgroup controller?  That 
is, have systemd create its slices, run exactly one unit that represents 
the whole system, and let other things use the cgroup API.



[1] Some day, I might convert my code to use a session systemd instance. 
 I'm not holding my breath, but it could be nice.


--Andy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [HEADSUP] cgroup changes

2013-06-24 Thread Andy Lutomirski
On Mon, Jun 24, 2013 at 6:27 AM, Lennart Poettering
lenn...@poettering.net wrote:
 On Sat, 22.06.13 15:19, Andy Lutomirski (l...@amacapital.net) wrote:


 2. I manage services and tasks outside systemd (for one thing, I
 currently use Ubuntu, but even if I were on Fedora, I have a bunch
 of fine-grained things that figure out how they're supposed to
 allocate resources, and porting them to systemd just to keep working
 in the new world order would be a PITA [1]).


[...]


 I think that what I want are something like sub-unit cgroups -- I
 want to be able to ask systemd to further subdivide the group for my
 unit, login session, or whatever.  Would this be reasonable?
 (Another way of thinking of this is that a unit would have a whole
 cgroup hierarchy instead of just one cgroup.)

 The idea is not even to allow this. Basically, if you want to partitions
 your daemon into different cgroups you need to do that through systemd's
 abstractions: slices and services. To make this more palatable we'll
 introduce throw-away units though, so that you can dynamically run
 something as a workload and don't need to be concerned about naming
 this, or cleaning it up.


Hmm.  My particular software can maybe live with this with unpleasant
modifications, but this will break anything that, say, accepts a
connection from a client, forks into a (possibly new) cgroup based on
the identity of that client, and then does something.

How can this support containers or the use of cgroups in a
non-systemwide systemd instance?  Containers may no longer be allowed
to escape from the cgroup they start in, but there should (IMO) still
be a way for things to subdivide their cgroup-controlled resources.

If I want to have a hierarchy more than two levels deep, I suspect I'm
SOL under this model.  If I'm understanding correctly, there will be
slices, then units, and that's it.


 4. As mentioned, I'm on Ubuntu some of the time.  I'd like to keep
 the same code working on systemd and non-systemd systems.

 How hard would it be to run systemd as just a cgroup controller?
 That is, have systemd create its slices, run exactly one unit that
 represents the whole system, and let other things use the cgroup
 API.

 I have no idea, I don't develop Ubuntu. They will have to come up with
 some cgroup maintenance daemon of their own. As I know them they'll
 either do a port of the systemd counter part (but that's going to be
 tough!), or they'll stick something half-baked into Upstart...

 Sorry, if this all sounds a bit disappointing. But yeah, this all is not
 a trivial change...


I'm worried that the impedance mismatch between systemd and any other
possible API is going to be enormous.  On systemd, I'll have to:

 - Create a throwaway unit
 - Figure out how to wire up stdout and stderr correctly (I use them
for communication between processes)
 - Translate the current directory, the environment, etc. into systemd
configuration
 - Translate my desired resource controls into systemd's let's
pretend that there aren't really cgroups underlying it configuration
 - Start the throwaway unit
 - Figure out how to get notified when it finishes

Without systemd, I'll have to:

 - fork()
 - Ask whatever is managing cgroups to switch me to a different cgroup
 - exec()

This is going to suck, I think.

--Andy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [HEADSUP] cgroup changes

2013-06-24 Thread David Strauss
On Fri, Jun 21, 2013 at 10:36 AM, Lennart Poettering
lenn...@poettering.net wrote:
 As long as you only used the high-level options such as
 CPUShares, MemoryLimit and so on you should be on the safe side.

This is already representative of how we're doing thing in large-scale
production and how we recommend other users use cgroups on
systemd-based distributions.

So, +1.

--
David Strauss
   | da...@davidstrauss.net
   | +1 512 577 5827 [mobile]
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [HEADSUP] cgroup changes

2013-06-24 Thread Tejun Heo
Hello,

On Mon, Jun 24, 2013 at 02:39:53PM +0100, Daniel P. Berrange wrote:
 On Mon, Jun 24, 2013 at 03:27:15PM +0200, Lennart Poettering wrote:
  On Sat, 22.06.13 15:19, Andy Lutomirski (l...@amacapital.net) wrote:
  
   1. I put all the entire world into a separate, highly constrained
   cgroup.  My real-time code runs outside that cgroup.  This seems to
   exactly what slices are for, but I need kernel threads to go in to
   the constrained cgroup.  Will systemd support this?
  
  I am not sure whether the ability to move kernel threads into cgroups
  will stay around at all, from the kernel side. Tejun, can you comment
  on this?
 
 KVM uses the vhost_net device for accelerating guest network I/O
 paths. This device creates a new kernel thread on each open(),
 and that kernel thread is attached to the cgroup associated
 with the process that open()d the device.
 
 If systemd allows for a process to be moved between cgroups, then
 it must also be capable of moving any associated kernel threads to
 the new cgroup at the same time. This co-placement of vhost-net
 threads with the KVM process, is very critical for I/O performance
 of KVM networking.

Yeah, the way virt drivers use cgroups right now is pretty hacky.  I
was thinking about adding per-process workqueue which follows the
cgroup association of the process after the unified hierarchy and then
convert virt to use that.

At any rate, those kthreads can be moved via cgroup.procs, so unified
hierarchy wouldn't break it from kernel side.  Not sure how the
interface would look from systemd side tho.

Thanks.

-- 
tejun
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [HEADSUP] cgroup changes

2013-06-24 Thread Tejun Heo
Hello,

On Mon, Jun 24, 2013 at 03:27:15PM +0200, Lennart Poettering wrote:
 On Sat, 22.06.13 15:19, Andy Lutomirski (l...@amacapital.net) wrote:
 
  1. I put all the entire world into a separate, highly constrained
  cgroup.  My real-time code runs outside that cgroup.  This seems to
  exactly what slices are for, but I need kernel threads to go in to
  the constrained cgroup.  Will systemd support this?
 
 I am not sure whether the ability to move kernel threads into cgroups
 will stay around at all, from the kernel side. Tejun, can you comment on this?

Any kernel threads with PF_NO_SETAFFINITY set already can't be removed
from the root cgroup.  In general, I don't think moving kernel threads
into !root cgroups is a good idea.  They're in most cases shared
resources and userland doesn't really have much idea what they're
actually doing, which is the fundmental issue.

Which kthreads are running on the kernel side and what they're doing
is strict implementation detail from the kernel side.  There's no
effort from kernel side in keeping them stable and userland is likely
to get things completely wrong - e.g. many kernel threads named after
workqueues in any recent kernels don't actually do anything until the
system is under heavy memory pressure.  Userland can't tell and has no
control over what's being executed where at all and that's the way it
should be.

That said, there are cases where certain async executions are
concretely bound to userland processes - say, (planned) aio updates,
virt drivers and so on.  Right now, virt implements something pretty
hacky but I think they'll have to be tied closer to the usual process
mechanism - ie. they should be saying that these kthreads are serving
this process and should be treated as such in terms of resource
control rather than the current move this kthread to this set of
cgroups, don't ask why thing.  Another not-well-thought-out aspect of
the current cgroup.  :(

I have an idea where it should be headed in the long term but am not
sure about short-term solution.  Given that the only sort wide-spread
use case is virt kthreads, maybe it just needs to be special cased for
now.  Not sure.

  2. I manage services and tasks outside systemd (for one thing, I
  currently use Ubuntu, but even if I were on Fedora, I have a bunch
  of fine-grained things that figure out how they're supposed to
  allocate resources, and porting them to systemd just to keep working
  in the new world order would be a PITA [1]).
  
  (cgroups have the odd feature that they are per-task, not per thread
  group, and the systemd proposal seems likely to break anything that
  actually wants task granularity.  I may actually want to use this,
  even though it's a bit evil -- my real-time thread groups have
  non-real-time threads.)
 
 Here too, Tejun is pretty keen on removing the ability of splitting up
 threads into cgroups from the kernel, and will only allow this
 per-process. Tejun, please comment!

Yes, again, the biggest issue is how much of low-level cgroup details
become known to individual programs.  Splitting threads into different
cgroup would in most cases mean that the binary itself would become
aware of cgroup and it's akin to burying sysctl knob tunings into
individual binaries.  cgroup is not an interface for each individual
program to fiddle with.  If certain thread-granular control is
absolutely necessary and justifiable, it's something to be added to
the existing thread API, not something to be bolted on using cgroups.

So, I'm quite strongly against allowing allowing splitting threads of
the same process into different cgroups.

Thanks.

-- 
tejun
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [HEADSUP] cgroup changes

2013-06-24 Thread Tejun Heo
Hello, Andy.

On Mon, Jun 24, 2013 at 11:49:05AM -0700, Andy Lutomirski wrote:
  I have an idea where it should be headed in the long term but am not
  sure about short-term solution.  Given that the only sort wide-spread
  use case is virt kthreads, maybe it just needs to be special cased for
  now.  Not sure.
 
 I'll be okay (I think) if I can reliably set affinities of these
 threads.  I'm currently doing it with cgroups.
 
 That being said, I don't like the direction that kernel thread magic
 affinity is going.  It may be great for cache performance and reducing
 random bounding, but I have a scheduling-jitter-sensitive workload and
 I don't care about overall system throughput.  I need the kernel to
 stay the f!k off my important cpus, and arranging for this to happen
 is becoming increasingly complicated.

Why is it becoming increasingly complicated?  The biggest change
probably was the shared workqueue pool implementation but that was
years ago and workqueue has grown pool attributes recently adding more
properly designed flexibility and, for example, adding default
affinity for !per-cpu workqueues should be pretty easy now.  But
anyways, if it's an issue, it should be examined and properly solved
rather than hacking up hacky solution with cgroup.

 cgroups are most certainly something that a binary can be aware of.
 It's not like a sysctl knob at all -- it's per process.  I have lots

No, it definitely is not.  Sure it is more granular than sysctl but
that's it.  It exposes control knobs which are directly tied into
kernel implementation details.  It is not a properly designed
programming API by any stretch of imagination.  It is an extreme
failure on the kernel side that that part hasn't been made crystal
clear from the beginning.  I don't know how intentional it was but the
whole thing is completely botched.

cgroup *never* was held to the standard necessary for any widely
available API and many of the controls it exposes are exactly at the
level of sysctls.  As the interface was filesystem, it could evade
scrutiny and with the hierarchical organization also gave the
impression that it's something which can be used directly by
individual applications.  It found a loophole in the way we implement
and police kernel APIs and then exploited it like there's no tomorrow.

We are firmly bound to maintain what already has been exposed from the
kernel side and I'm not gonna break any of them but the free-for-all
cgroup is broken and deprecated.  It's gonna wither and fade away and
any attempt to reverse that will be met with extreme prejudice.

 of binaries that have worked quite well for a couple years that move
 themselves into different cgroups.  I have no problem with a unified
 hierarchy, but I need control of my little piece of the hierarchy.
 
 I don't care if the interface to do so changes, but the basic
 functionality is important.

Whether you care or not is completely irrelevant.  Individual binaries
widely incorporating cgroup details automatically binds the kernel.
It becomes excruciatingly painful to back out after certain point.  I
don't think we're there yet given the overall immaturity and brokeness
of cgroups and it's imperative that we back the hell out as fast as
possible before this insanity spreads any wider.

Thanks.

-- 
tejun
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [HEADSUP] cgroup changes

2013-06-24 Thread Andy Lutomirski
On Mon, Jun 24, 2013 at 12:10 PM, Tejun Heo t...@kernel.org wrote:
 Hello, Andy.

 On Mon, Jun 24, 2013 at 11:49:05AM -0700, Andy Lutomirski wrote:
  I have an idea where it should be headed in the long term but am not
  sure about short-term solution.  Given that the only sort wide-spread
  use case is virt kthreads, maybe it just needs to be special cased for
  now.  Not sure.

 I'll be okay (I think) if I can reliably set affinities of these
 threads.  I'm currently doing it with cgroups.

 That being said, I don't like the direction that kernel thread magic
 affinity is going.  It may be great for cache performance and reducing
 random bounding, but I have a scheduling-jitter-sensitive workload and
 I don't care about overall system throughput.  I need the kernel to
 stay the f!k off my important cpus, and arranging for this to happen
 is becoming increasingly complicated.

 Why is it becoming increasingly complicated?  The biggest change
 probably was the shared workqueue pool implementation but that was
 years ago and workqueue has grown pool attributes recently adding more
 properly designed flexibility and, for example, adding default
 affinity for !per-cpu workqueues should be pretty easy now.  But
 anyways, if it's an issue, it should be examined and properly solved
 rather than hacking up hacky solution with cgroup.

Because more things are becoming per cpu without the option of moving
of per-cpu things on behalf of one cpu to another cpu.  RCU is a nice
exception.


 cgroups are most certainly something that a binary can be aware of.
 It's not like a sysctl knob at all -- it's per process.  I have lots

 No, it definitely is not.  Sure it is more granular than sysctl but
 that's it.  It exposes control knobs which are directly tied into
 kernel implementation details.  It is not a properly designed
 programming API by any stretch of imagination.  It is an extreme
 failure on the kernel side that that part hasn't been made crystal
 clear from the beginning.  I don't know how intentional it was but the
 whole thing is completely botched.

 cgroup *never* was held to the standard necessary for any widely
 available API and many of the controls it exposes are exactly at the
 level of sysctls.  As the interface was filesystem, it could evade
 scrutiny and with the hierarchical organization also gave the
 impression that it's something which can be used directly by
 individual applications.  It found a loophole in the way we implement
 and police kernel APIs and then exploited it like there's no tomorrow.

 We are firmly bound to maintain what already has been exposed from the
 kernel side and I'm not gonna break any of them but the free-for-all
 cgroup is broken and deprecated.  It's gonna wither and fade away and
 any attempt to reverse that will be met with extreme prejudice.

The functionality I care about is that a program can reliably and
hierarchically subdivide system resources -- think rlimits but
actually useful.  I, and probably many other things, want this
functionality.  Yes, the current cgroup interface is awful, but it
gets one thing right: it's a hierarchy.

Back when my software ran on Windows, I used the awful job interface
to allocate resources among different parts of my software.  When I
switched to Linux, I lost some of that functionality and replaced
other bits with cgroups.  It's hackish, but it works.

Now we're apparently moving toward having a unified hierarchy
(great!), a more sane API (great!), and a nasty userspace situation
where systemd-using systems control the hierarchy through a highly
limiting systemd-specific interface and non-systemd systems do
something else which will presumably look nothing like what systemd
does.

I would argue that designing a kernel interface that requires exactly
one userspace component to manage it and ties that one userspace
component to something that can't easily be deployed everywhere (the
init system) is as big a cheat as the old approach of sneaking bad
APIs in through a filesystem was.

IOW, please, when designing this, please specify an API that programs
are permitted to use, and let that API be reviewed.

--Andy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [HEADSUP] cgroup changes

2013-06-24 Thread Tejun Heo
Hello,

On Mon, Jun 24, 2013 at 12:24:38PM -0700, Andy Lutomirski wrote:
 Because more things are becoming per cpu without the option of moving
 of per-cpu things on behalf of one cpu to another cpu.  RCU is a nice
 exception.

Hmm... but in most cases it's per-cpu on the same cpu that initiated
the task.  If a given CPU is just crunching numbers and IRQ affinity
is properly configured, the CPU shouldn't be bothered too much by
per-cpu work items.  If there are, please let us know.  We can hunt
them down.

 The functionality I care about is that a program can reliably and
 hierarchically subdivide system resources -- think rlimits but
 actually useful.  I, and probably many other things, want this
 functionality.  Yes, the current cgroup interface is awful, but it
 gets one thing right: it's a hierarchy.

And the hierarchy support was completely broken for many resource
controllers up until only several releases ago.

 I would argue that designing a kernel interface that requires exactly
 one userspace component to manage it and ties that one userspace
 component to something that can't easily be deployed everywhere (the
 init system) is as big a cheat as the old approach of sneaking bad
 APIs in through a filesystem was.

In terms of API, it is firmly at the level of sysctl.  That's it.

While I agree that having a proper kernel API for hierarchical
resource management could be nice.  That currently is out of scope.
We're already knee-deep in shit with the limited capabilities we're
trying to implement.  Also, I really don't think cgroup is the right
interface for such thing even if we get to that.  It should be part of
the usual process/thread model, not this completely separate thing on
the side.

 IOW, please, when designing this, please specify an API that programs
 are permitted to use, and let that API be reviewed.

cgroup is not that API and it's never gonna be in all likelihood.  As
for systemd vs. non-systemd compatibility, I'm afraid I don't have a
good answer.  This is still all in a pretty earlly phase and the
proper abstractions and APIs are being figured out.  Hopefully, we'll
converge on a mostly compatible high-level abstraction which can be
presented regardless of the actual base system implementation.

Thanks.

-- 
tejun
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [HEADSUP] cgroup changes

2013-06-24 Thread Andy Lutomirski
On Mon, Jun 24, 2013 at 12:37 PM, Tejun Heo t...@kernel.org wrote:
 Hello,

 On Mon, Jun 24, 2013 at 12:24:38PM -0700, Andy Lutomirski wrote:
 Because more things are becoming per cpu without the option of moving
 of per-cpu things on behalf of one cpu to another cpu.  RCU is a nice
 exception.

 Hmm... but in most cases it's per-cpu on the same cpu that initiated
 the task.  If a given CPU is just crunching numbers and IRQ affinity
 is properly configured, the CPU shouldn't be bothered too much by
 per-cpu work items.  If there are, please let us know.  We can hunt
 them down.

I'm not just crunching numbers -- I do (nonblocking) I/O as well.


 The functionality I care about is that a program can reliably and
 hierarchically subdivide system resources -- think rlimits but
 actually useful.  I, and probably many other things, want this
 functionality.  Yes, the current cgroup interface is awful, but it
 gets one thing right: it's a hierarchy.

 And the hierarchy support was completely broken for many resource
 controllers up until only several releases ago.

 I would argue that designing a kernel interface that requires exactly
 one userspace component to manage it and ties that one userspace
 component to something that can't easily be deployed everywhere (the
 init system) is as big a cheat as the old approach of sneaking bad
 APIs in through a filesystem was.

 In terms of API, it is firmly at the level of sysctl.  That's it.

 While I agree that having a proper kernel API for hierarchical
 resource management could be nice.  That currently is out of scope.
 We're already knee-deep in shit with the limited capabilities we're
 trying to implement.  Also, I really don't think cgroup is the right
 interface for such thing even if we get to that.  It should be part of
 the usual process/thread model, not this completely separate thing on
 the side.

 IOW, please, when designing this, please specify an API that programs
 are permitted to use, and let that API be reviewed.

 cgroup is not that API and it's never gonna be in all likelihood.  As
 for systemd vs. non-systemd compatibility, I'm afraid I don't have a
 good answer.  This is still all in a pretty earlly phase and the
 proper abstractions and APIs are being figured out.  Hopefully, we'll
 converge on a mostly compatible high-level abstraction which can be
 presented regardless of the actual base system implementation.


So what is cgroup for?  That is, what's the goal for what the new API
should be able to do?

AFAICT the main reason that systemd uses cgroup is to efficiently
track which service various processes came from and to send signals,
and it seems like that use case could be handled without cgroups at
all by creative use of subreapers and a syscall to broadcast a signal
to everything that has a given subreaper as an ancestor.  In that
case, systemd could be asked to stay away from cgroups even in the
single-hierarchy case.

--Andy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [HEADSUP] cgroup changes

2013-06-24 Thread Tejun Heo
Hello,

On Mon, Jun 24, 2013 at 04:01:07PM -0700, Andy Lutomirski wrote:
 So what is cgroup for?  That is, what's the goal for what the new API
 should be able to do?

It is a for controlling and distributing resources.  That part doesn't
change.  It's just not built to be used directly by individual
applications.  It's an admin tool just like sysctl - be that admin be
a human or userland base system.

There's a huge chasm between something which can be generally used by
normal applications and something which is restricted to admins and
base systems in terms of interface generality and stability, security,
how the abstractions fit together with the existing APIs and so on.
cgroup firmly belongs to the former.  It still serves the same purpose
but isn't, in a way, developed enough to be used directly by
individual applications and I'm not even sure we want or need to
develop it to such a level.

Thanks.

-- 
tejun
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [HEADSUP] cgroup changes

2013-06-24 Thread Andy Lutomirski
On Mon, Jun 24, 2013 at 4:19 PM, Tejun Heo t...@kernel.org wrote:
 Hello,

 On Mon, Jun 24, 2013 at 04:01:07PM -0700, Andy Lutomirski wrote:
 So what is cgroup for?  That is, what's the goal for what the new API
 should be able to do?

 It is a for controlling and distributing resources.  That part doesn't
 change.  It's just not built to be used directly by individual
 applications.  It's an admin tool just like sysctl - be that admin be
 a human or userland base system.

 There's a huge chasm between something which can be generally used by
 normal applications and something which is restricted to admins and
 base systems in terms of interface generality and stability, security,
 how the abstractions fit together with the existing APIs and so on.
 cgroup firmly belongs to the former.  It still serves the same purpose
 but isn't, in a way, developed enough to be used directly by
 individual applications and I'm not even sure we want or need to
 develop it to such a level.

My application is running on a single-purpose system I administer.

I guess what I'm trying to say here is that many systems will rather
fundamentally use systemd.  Admins of those systems should still have
access to a reasonably large subset of cgroup functionality.  If the
single-hierarchy model is going to prevent going around systemd and if
systemd isn't going to expose all of the useful cgroup functionality,
then perhaps there should be a way to separate systemd's hierarchy
from the cgroup hierarchy.

Looking at http://0pointer.de/blog/projects/cgroups-vs-cgroups.html,
it looks like systemd doesn't actually need the cgroup resource
control functionality.  Maybe there's a way to disentangle this stuff.
 The /proc/pid/children feature that CRIU added seems like a decent
start.

--Andy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [HEADSUP] cgroup changes

2013-06-24 Thread Andy Lutomirski
On Mon, Jun 24, 2013 at 4:37 PM, Tejun Heo t...@kernel.org wrote:
 Hello, Andy.

 On Mon, Jun 24, 2013 at 04:27:17PM -0700, Andy Lutomirski wrote:
 I guess what I'm trying to say here is that many systems will rather
 fundamentally use systemd.  Admins of those systems should still have
 access to a reasonably large subset of cgroup functionality.  If the
 single-hierarchy model is going to prevent going around systemd and if
 systemd isn't going to expose all of the useful cgroup functionality,
 then perhaps there should be a way to separate systemd's hierarchy
 from the cgroup hierarchy.

 I don't think systemd will prevent you from buildling your own
 hierarchy on the side.  It sure won't be properly supported and things
 might break in corener cases / over time but if you're willing to take
 such risks anyway...  In the long term tho, what should happen
 probably is examining use cases like yours and then incorporating
 sensible mechanisms to support that into the base system
 infrastructure.  It might not be completely identical but I'm sure
 over time we'll be able to find what are the fundamental pieces and
 proper abstractions.  Right now, we're exposing way too much without
 even clearly understanding what are being enabled.  It is
 unsustainable.

Now I'm confused.  I thought that support for multiple hierarchies was
going away.  Is it here to stay after all?

--Andy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [HEADSUP] cgroup changes

2013-06-24 Thread Tejun Heo
Hello,

On Mon, Jun 24, 2013 at 4:38 PM, Andy Lutomirski l...@amacapital.net wrote:
 Now I'm confused.  I thought that support for multiple hierarchies was
 going away.  Is it here to stay after all?

It is going to be deprecated but also stay around for quite a while.
That said, I didn' t mean to use multiple hierarchies. I was saying
that if you build a sub-hierarchy in the unified hierarchy, you're
likely to get away with it in most cases.

Thanks.

--
tejun
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [HEADSUP] cgroup changes

2013-06-24 Thread Andy Lutomirski
On Mon, Jun 24, 2013 at 4:40 PM, Tejun Heo t...@kernel.org wrote:
 Hello,

 On Mon, Jun 24, 2013 at 4:38 PM, Andy Lutomirski l...@amacapital.net wrote:
 Now I'm confused.  I thought that support for multiple hierarchies was
 going away.  Is it here to stay after all?

 It is going to be deprecated but also stay around for quite a while.
 That said, I didn' t mean to use multiple hierarchies. I was saying
 that if you build a sub-hierarchy in the unified hierarchy, you're
 likely to get away with it in most cases.

Isn't that exactly what I was originally asking for?  Quoting from
earlier in the thread:

On Mon, Jun 24, 2013 at 6:27 AM, Lennart Poettering
lenn...@poettering.net wrote:
 On Sat, 22.06.13 15:19, Andy Lutomirski (l...@amacapital.net) wrote:

 2. I manage services and tasks outside systemd (for one thing, I
 currently use Ubuntu, but even if I were on Fedora, I have a bunch
 of fine-grained things that figure out how they're supposed to
 allocate resources, and porting them to systemd just to keep working
 in the new world order would be a PITA [1]).


[...]


 I think that what I want are something like sub-unit cgroups -- I
 want to be able to ask systemd to further subdivide the group for my
 unit, login session, or whatever.  Would this be reasonable?
 (Another way of thinking of this is that a unit would have a whole
 cgroup hierarchy instead of just one cgroup.)

 The idea is not even to allow this. Basically, if you want to partitions
 your daemon into different cgroups you need to do that through systemd's
 abstractions: slices and services. To make this more palatable we'll
 introduce throw-away units though, so that you can dynamically run
 something as a workload and don't need to be concerned about naming
 this, or cleaning it up.


If I can subdivide my service in the hierarchy, then I'm happy.  If
this gets lost *and* systemd insists on controlling the one and only
cgroup hierarchy, then I think I have serious problems with the new
regime.

--Andy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [HEADSUP] cgroup changes

2013-06-24 Thread Lennart Poettering
On Mon, 24.06.13 16:01, Andy Lutomirski (l...@amacapital.net) wrote:

 AFAICT the main reason that systemd uses cgroup is to efficiently
 track which service various processes came from and to send signals,
 and it seems like that use case could be handled without cgroups at
 all by creative use of subreapers and a syscall to broadcast a signal
 to everything that has a given subreaper as an ancestor.  In that
 case, systemd could be asked to stay away from cgroups even in the
 single-hierarchy case.

systemd uses cgroups to manage services. Managing services means many
things. Among them: keeping track of processes, listing processes of a
service, killing processes of a service, doing per-service logging
(which means reliably, immediately, and race-freely tracing back
messages to the service which logged them), about 55 other things, and
also resource management.

I don't see how I can do anything of this without something like
cgroups, i.e. hierarchial, resource management involved systemd which
allows me to securely put labels on processes.

Lennart

-- 
Lennart Poettering - Red Hat, Inc.
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [HEADSUP] cgroup changes

2013-06-24 Thread Andy Lutomirski
On Mon, Jun 24, 2013 at 4:57 PM, Lennart Poettering
lenn...@poettering.net wrote:
 On Mon, 24.06.13 16:01, Andy Lutomirski (l...@amacapital.net) wrote:

 AFAICT the main reason that systemd uses cgroup is to efficiently
 track which service various processes came from and to send signals,
 and it seems like that use case could be handled without cgroups at
 all by creative use of subreapers and a syscall to broadcast a signal
 to everything that has a given subreaper as an ancestor.  In that
 case, systemd could be asked to stay away from cgroups even in the
 single-hierarchy case.

 systemd uses cgroups to manage services. Managing services means many
 things. Among them: keeping track of processes, listing processes of a
 service, killing processes of a service, doing per-service logging
 (which means reliably, immediately, and race-freely tracing back
 messages to the service which logged them), about 55 other things, and
 also resource management.

 I don't see how I can do anything of this without something like
 cgroups, i.e. hierarchial, resource management involved systemd which
 allows me to securely put labels on processes.

Boneheaded straw-man proposal: two new syscalls and a few spare processes.

int sys_task_reaper(int tid): Returns the reaper for the task tid
(which is 1 if there's no subreaper).  (This could just as easily be a
file in /proc.)

int sys_killall_under_subreaper(int subreaper, int sig): Broadcasts
sig to all tasks under subreaper (excluding subreaper).  Guarantees
that, even if those tasks are forking, they all get the signal.

Then, when starting a service, systemd forks, sets the child to be a
subreaper, then forks that child again to exec the service.

Does this do everything that's needed?  sys_task_reaper is trivial to
implement (that functionality is already there in the reparenting
code), and sys_killall_under_subreaper is probably not so bad.


This has one main downside I can think of: it wastes a decent number
of processes (one subreaper per service).

--Andy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [HEADSUP] cgroup changes

2013-06-24 Thread Brian Bockelman
Lennart Poettering lennart at poettering.net writes:

 
 2) This hierarchy becomes private property of systemd. systemd will set
 it up. Systemd will maintain it. Systemd will rearrange it. Other
 software that wants to make use of cgroups can do so only through
 systemd's APIs. This single-writer logic is absolutely necessary, since
 interdependencies between the various controllers, the various
 attributes, the various cgroups are non-obvious and we simply cannot
 allow that cgroup users alter the tree independently of each other
 forever. Due to all this: The Pax Cgroup document is a thing of the
 past, it is dead.
 

Hi [1],

I currently contribute cgroup support to a batch system 
(http://research.cs.wisc.edu/htcondor/) and 
am trying to figure out how this will affect me.

Right now, I take the resources provided by the cgroup setup by the 
sysadmin and sub-divide them  amongst the running jobs.  
Cgroups are used for resource management, resource accounting, and job 
management (using the freezer controller to deliver signals to all 
processes at once).  Jobs last  between seconds to hours; it is 
acceptable for a setup time of, say, several hundred milliseconds - as 
long as we can easily create and destroy many jobs.

A few questions came to mind which may provide interesting input 
to your design process:
1) I use cgroups heavily for resource accounting.  Do you envision 
  me querying via dbus for each accounting attribute?  Or do you 
  envision me querying for the cgroup name, then accessing the 
controller statistics directly?
2) I currently fork and setup the resource environment (namespaces, 
  environment, working directory, etc).  Can an appropriately privileged 
  process create a sub-slice, place itself in it, and then drop privs 
/ exec?
3) More generally, will I be able to interact with slices directly, or 
  will I need to create throw-away units and launch them via systemd 
  (versus a normal fork/exec)?
- The latter causes quite a bit of anxiety for me - we currently 
  support many POSIX platforms plus Windows (hey - at least 
  we dropped HPUX) and I'd like to avoid a completely independent 
  code path for spawning jobs on Linux.
4) Will many short-lived jobs cause any heartache?  Would anything 
  untoward happen to my system if I spawned / destroyed jobs (and 
  corresponding units or slices) at, say, 1Hz?
5) Will I be able to delegate management of a subslice to a non-privileged user?

I'm excited to see new ideas (again, having system tools be aware of 
the batch system activity is intriguing [2]), but am a bit worried about
losing functionality and the cost of porting things to the new era!

Thanks!

Brian

[1] apologies if the reply comes through mangled; posting through
  the gmane web interface.
[2] Hopefully something that works better than 
 ps xawf -eo pid,user,cgroup,args which currently segfaults for me :(


___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


[systemd-devel] [HEADSUP] cgroup changes

2013-06-21 Thread Lennart Poettering
Heya,

On monday I posted this mail:

http://lists.freedesktop.org/archives/systemd-devel/2013-June/011388.html

Here's an update and a bit on the bigger picture:

Half of what I mentioned there is now in place. There's now a new
slice unit type in place in git, and everything is hooked up to
it. logind will now also keep track of running containers/VMs. The
various container/VM managers have to register with logind now. This
serves the purpose of better integration of containers/VMs everywhere
(so that ps can show for each process where it belongs to). However,
the main reason for this is that this is eventually going to be the only
way how containers/VMs can get a cgroup of their own.

So, in that context, a bit of the bigger picture:

It took us a while to realize the full extent how awfully unusable
cgroups currently are. The attributes have way more interdependencies
than people might think and it is trivial to create non-sensical
configurations...

Of course, understanding how awful the status quo is a good first
step. But we really needed to figure out what we can do about this to
clean this up in the long run, and how we can get to something useful
quickly. So, after much discussion between Tejun (the kernel cgroup
maintainer) and various other folks here's the new scheme that we want
to go for:

1) In the long run there's only going to be a single kernel cgroup
hierarchy, the per-controller hierarchies will go away. The single
hierarchy will allow controllers to be individually enabled for each
cgroup. The net effect is that the hierarchies the controllers see are
not orthogonal anymore, they are always subtrees of the full single
hierarchy.

2) This hierarchy becomes private property of systemd. systemd will set
it up. Systemd will maintain it. Systemd will rearrange it. Other
software that wants to make use of cgroups can do so only through
systemd's APIs. This single-writer logic is absolutely necessary, since
interdependencies between the various controllers, the various
attributes, the various cgroups are non-obvious and we simply cannot
allow that cgroup users alter the tree independently of each other
forever. Due to all this: The Pax Cgroup document is a thing of the
past, it is dead.

3) systemd will hide the fact that cgroups are internally used almost
entirely. In fact, we will take away the unit configuration options
ControlGroup=, ControlGroupModify=, ControlGroupPersistent=,
ControlGroupAttribute= in their entirety. The high-level options
CPUShares=, MemoryLimit=, .. and so on will continue to exist and we'll
add additional ones like them. The system.conf setting
DefaultControllers=cpu will go away too. Basically, you'll get more
high-level settings, but all the low level bits will go away without
replacement. We will take away the ability for the admin to set
arbitrary low-level attributes, to arrange things in completely
arbitrary cgroup trees or to enable arbitrary controllers for a service.

4) systemd git introduced a new unit type called slice (see
above). This is for partitioning up resources of the system into
slices. Slices are hierarchial, and other units (such as services, but
also containers/VMs and logged in users) can then be assigned to these
slices. Slices internally map to cgroups, but they are a very high-level
construct. Slices will expose the same CPUShares=, MemoryLimit=
properties as the other units do. This means resource management will
become a first-class, built-in functionality of systemd. You can create
slices for your customers, and in them subslices for their departments,
and then run services, users, vms in them. In the long run these will by
dynamically moveable even (while they are running), but that'll take
more kernel work. By default there will three slices: system.slice
(where all system services are located by default), user.slice (where
all logged in users are located by default), machine.slice (where all
running VMs/containers are located by default). However, the admin will
have full freedom to create arbitary slices and then move the other
units into them.

5) systemd's logind daemon already kept track of logged in
users/sessions. It is now extended to also keep track of virtual
machines/containers. In fact, this is how libvirt/nspawn and friends
will now get their own cgroups. They register as a machine, which means
passing a bit of meta info to systemd, and getting a cgroup assigned in
response. This registration ensures that ps and friends can show to
which VM/container a process belongs, but easily allows other tools to
query container/VM info too, so that we'll be able to provide an
integration level of containers/VMs like solaris zones can do it in the
long run.

So, this all together sounds like an awful lot of change. #1 and #2 are
long term changes. However #3, #4, #5 are something we can do now and
should do now, as prepartion for the single-writer, unified cgroup
tree. We really, really shouldn't ship the cgroup mess for longer, 

Re: [systemd-devel] [HEADSUP] cgroup changes

2013-06-21 Thread Kok, Auke-jan H
On Fri, Jun 21, 2013 at 10:36 AM, Lennart Poettering
lenn...@poettering.net wrote:
 Heya,

 On monday I posted this mail:

 http://lists.freedesktop.org/archives/systemd-devel/2013-June/011388.html

 Here's an update and a bit on the bigger picture:

Thanks for doing this - I am really looking forward to seeing this all
take shape, and I hope to be able to leverage this in the future :^)

All the points below are great, and problems that I've encountered in
the past have all hinted towards this being the right way forward.

#2 below has my interest - when you have some ideas about how the API
will look I'd like to review it and match against our use cases...

Auke
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [HEADSUP] cgroup changes

2013-06-21 Thread Lennart Poettering
On Fri, 21.06.13 12:59, Kok, Auke-jan H (auke-jan.h@intel.com) wrote:

  http://lists.freedesktop.org/archives/systemd-devel/2013-June/011388.html
 
  Here's an update and a bit on the bigger picture:
 
 Thanks for doing this - I am really looking forward to seeing this all
 take shape, and I hope to be able to leverage this in the future :^)
 
 All the points below are great, and problems that I've encountered in
 the past have all hinted towards this being the right way forward.
 
 #2 below has my interest - when you have some ideas about how the API
 will look I'd like to review it and match against our use cases...

Point #2 is precisely about not having APIs for this... ;-)

So, in the future, when you have some service, and that service wants to
alter some cgroup resource limits for itself (let's say: set its own cpu
shares value to 1500), this is what should happen: the service should
use a call like sd_pid_get_unit() to get its own unit name, and then use
dbus to invoke SetCPUShares(1500) for that service. systemd will then do
the rest. (*)

Lennart

(*) to make this even simpler we have been thinking of defining a new
virtual bus object path /org/freedesktop/systemd1/self/ or so which
will always points to the callers own unit. This would be similar to
/proc/self/ which also points to its own PID dir for each
process... With that in place you could then set any resource setting
you want with a single bus method call.

-- 
Lennart Poettering - Red Hat, Inc.
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [HEADSUP] cgroup changes

2013-06-21 Thread Kok, Auke-jan H
On Fri, Jun 21, 2013 at 1:10 PM, Lennart Poettering
lenn...@poettering.net wrote:
 On Fri, 21.06.13 12:59, Kok, Auke-jan H (auke-jan.h@intel.com) wrote:

  http://lists.freedesktop.org/archives/systemd-devel/2013-June/011388.html
 
  Here's an update and a bit on the bigger picture:

 Thanks for doing this - I am really looking forward to seeing this all
 take shape, and I hope to be able to leverage this in the future :^)

 All the points below are great, and problems that I've encountered in
 the past have all hinted towards this being the right way forward.

 #2 below has my interest - when you have some ideas about how the API
 will look I'd like to review it and match against our use cases...

 Point #2 is precisely about not having APIs for this... ;-)

 So, in the future, when you have some service, and that service wants to
 alter some cgroup resource limits for itself (let's say: set its own cpu
 shares value to 1500), this is what should happen: the service should
 use a call like sd_pid_get_unit() to get its own unit name, and then use
 dbus to invoke SetCPUShares(1500) for that service. systemd will then do
 the rest. (*)

 Lennart

 (*) to make this even simpler we have been thinking of defining a new
 virtual bus object path /org/freedesktop/systemd1/self/ or so which
 will always points to the callers own unit. This would be similar to
 /proc/self/ which also points to its own PID dir for each
 process... With that in place you could then set any resource setting
 you want with a single bus method call.

This is fine for applications that manage themselves, but I'm seeing
more interest in use cases where we want external influence on cgroup
hierarchies, for instance:

- foreground/background priorities - a window manager marks background
applications and puts them in the freezer, changes oom_score_adj so
that old apps can get automatically cleaned up in case memory
availability is low.
- detecting runaway apps and taking cpu slices away from them.
- thermally constraining classes of applications

Those would be tasks that an external process would do by manipulating
properties of cgroups, not something each task would do on it's own.

Do you suggest these manipulations should be implemented without high
level systemd API's and the controller just manipulates the cgroups
directly?

Auke
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [HEADSUP] cgroup changes

2013-06-21 Thread Lennart Poettering
On Fri, 21.06.13 14:10, Kok, Auke-jan H (auke-jan.h@intel.com) wrote:

  So, in the future, when you have some service, and that service wants to
  alter some cgroup resource limits for itself (let's say: set its own cpu
  shares value to 1500), this is what should happen: the service should
  use a call like sd_pid_get_unit() to get its own unit name, and then use
  dbus to invoke SetCPUShares(1500) for that service. systemd will then do
  the rest. (*)
 
  Lennart
 
  (*) to make this even simpler we have been thinking of defining a new
  virtual bus object path /org/freedesktop/systemd1/self/ or so which
  will always points to the callers own unit. This would be similar to
  /proc/self/ which also points to its own PID dir for each
  process... With that in place you could then set any resource setting
  you want with a single bus method call.
 
 This is fine for applications that manage themselves, but I'm seeing
 more interest in use cases where we want external influence on cgroup
 hierarchies, for instance:
 
 - foreground/background priorities - a window manager marks background
 applications and puts them in the freezer, changes oom_score_adj so
 that old apps can get automatically cleaned up in case memory
 availability is low.
 - detecting runaway apps and taking cpu slices away from them.
 - thermally constraining classes of applications
 
 Those would be tasks that an external process would do by manipulating
 properties of cgroups, not something each task would do on it's own.
 
 Do you suggest these manipulations should be implemented without high
 level systemd API's and the controller just manipulates the cgroups
 directly?

All changes to cgroup attributes must go through systemd. If the WM
wants to freeze or adjust OOM he needs to issue systemd bus calls for
that. 

The run-away stuff I can't follow? the kernel will distribute CPU 
evenly among running apps if all want it, so not seeing why there's more
monitoring needed.

The thermal stuff is probably best done in-kernel i guess... Too
dangerous/subject-to-latency for userspace, no?

Lennart

-- 
Lennart Poettering - Red Hat, Inc.
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [HEADSUP] cgroup changes

2013-06-21 Thread Kok, Auke-jan H
On Fri, Jun 21, 2013 at 2:17 PM, Lennart Poettering
lenn...@poettering.net wrote:
 On Fri, 21.06.13 14:10, Kok, Auke-jan H (auke-jan.h@intel.com) wrote:

  So, in the future, when you have some service, and that service wants to
  alter some cgroup resource limits for itself (let's say: set its own cpu
  shares value to 1500), this is what should happen: the service should
  use a call like sd_pid_get_unit() to get its own unit name, and then use
  dbus to invoke SetCPUShares(1500) for that service. systemd will then do
  the rest. (*)
 
  Lennart
 
  (*) to make this even simpler we have been thinking of defining a new
  virtual bus object path /org/freedesktop/systemd1/self/ or so which
  will always points to the callers own unit. This would be similar to
  /proc/self/ which also points to its own PID dir for each
  process... With that in place you could then set any resource setting
  you want with a single bus method call.

 This is fine for applications that manage themselves, but I'm seeing
 more interest in use cases where we want external influence on cgroup
 hierarchies, for instance:

 - foreground/background priorities - a window manager marks background
 applications and puts them in the freezer, changes oom_score_adj so
 that old apps can get automatically cleaned up in case memory
 availability is low.
 - detecting runaway apps and taking cpu slices away from them.
 - thermally constraining classes of applications

 Those would be tasks that an external process would do by manipulating
 properties of cgroups, not something each task would do on it's own.

 Do you suggest these manipulations should be implemented without high
 level systemd API's and the controller just manipulates the cgroups
 directly?

 All changes to cgroup attributes must go through systemd. If the WM
 wants to freeze or adjust OOM he needs to issue systemd bus calls for
 that.

 The run-away stuff I can't follow? the kernel will distribute CPU
 evenly among running apps if all want it, so not seeing why there's more
 monitoring needed.

 The thermal stuff is probably best done in-kernel i guess... Too
 dangerous/subject-to-latency for userspace, no?

Only userspace can distinguish between e.g. a foreground and
background application (WM) and decide that CPU consumption of certain
apps in the background is excessive, and throttle it down further,
which is somewhat similar to using freezer to just SIGSTOP them
entirely basically.

Thermal throttling from userspace allows you to distinguish between
never make my SETI turn the fan on and throttle the entire system
when I reach high fan speeds. You can't do that in the kernel. [1]
Arguably this could be done in-task and not by an external controller,
but you're still trusting the task to do the right thing, which may
not be something you want to do.


Auke


[1] Note that the new Intel P-state driver by Dirk Brandewie changes
how things work with nice(). The old behaviour was abused by folks
running bitcoin miners at nice values which caused ondemand to do
something irrational: nice-only tasks would keep the CPU in lowest
frequencies, which is terrible from a power perspective - now every
daemon running at nice value takes much longer to complete its task,
burning more power then when it had raced-to-idle.
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [HEADSUP] cgroup changes

2013-06-21 Thread Kay Sievers
On Fri, Jun 21, 2013 at 11:47 PM, Kok, Auke-jan H
auke-jan.h@intel.com wrote:
 Only userspace can distinguish between e.g. a foreground and
 background application (WM) and decide that CPU consumption of certain
 apps in the background is excessive, and throttle it down further,

This would probably be some bus call to the systemd --user instance
managing the services in the session, if that's what you mean?

Kay
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [HEADSUP] cgroup changes

2013-06-21 Thread Kok, Auke-jan H
On Fri, Jun 21, 2013 at 3:07 PM, Kay Sievers k...@vrfy.org wrote:
 On Fri, Jun 21, 2013 at 11:47 PM, Kok, Auke-jan H
 auke-jan.h@intel.com wrote:
 Only userspace can distinguish between e.g. a foreground and
 background application (WM) and decide that CPU consumption of certain
 apps in the background is excessive, and throttle it down further,

 This would probably be some bus call to the systemd --user instance
 managing the services in the session, if that's what you mean?

for instance, yes.

Auke
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel