subject:"Re\: \[Documentation\] State of CPU controller in cgroup v2"

Re: [Documentation] State of CPU controller in cgroup v2

2016-10-05 Thread Peter Zijlstra

On Tue, Oct 04, 2016 at 10:47:17AM -0400, Tejun Heo wrote:
> > cgroup-v2, by placing the system style controllers first and foremost,
> > completely renders that scenario impossible. Note also that any proposed
> > rgroup would not work for this, since that, per design, is a subtree,
> > and therefore not disjoint.
> 
> If a use case absolutely requires disjoint resource hierarchies, the
> only solution is to keep using multiple v1 hierarchies, which
> necessarily excludes the possibility of doing anyting across different
> resource types.
> 
> > So my objection to the whole cgroup-v2 model and implementation stems
> > from the fact that it purports to be a 'better' and 'improved' system,
> > while in actuality it neuters and destroys a lot of useful usecases.
> > 
> > It completely disregards all task-controllers and labels their use-cases
> > as irrelevant.
> 
> Your objection then doesn't have much to do with the specifics of the
> cgroup v2 model or implementation.

It is too, I've stated multiple times that the no internal tasks thing
is bad and that the root exception is an inexcusable wart that makes the
whole thing internally inconsistent.

But talking to you guys is pointless. You'll just keep moving air until
the other party tires and gives up.

My NAK on v2 stands.

> It's an objection against
> establishing common resource domains as that excludes building
> orthogonal multiple hierarchies.  That, necessarily, can only be
> achieved by having multiple hierarchies for different resource types
> and thus giving up the benefits of common resource domains.

Yes, v2 not allowing that rules it out as a valid model.

> Assuming that, I don't think your position is against cgroup v2 but
> more toward keeping v1 around.  We're talking about two quite
> different mutually exclusive classes of use cases.  You need unified
> for one and disjoint for the other.  v1 is gonna be there and can
> easily be used alongside v2 for different controller types, which
> would in most cases be cpu and cpuset.
> 
> I can't see a reason why this would need to block properly supporting
> containerization use cases.

I don't block that use-case, I block cgroup-v2, its shit.

The fact is, the naming "v2" suggests its a replacement and will
deprecate "v1". Also the implementation is mutually exclusive with v1,
you have to pick one and the other becomes inaccessible.

You cannot even pick another one inside a container, breaking the
container invariant.

Re: [Documentation] State of CPU controller in cgroup v2

2016-10-04 Thread Tejun Heo

Hello, Peter.

On Tue, Sep 06, 2016 at 12:29:50PM +0200, Peter Zijlstra wrote:
> The fundamental problem is that we have 2 different types of
> controllers, on the one hand these controllers above, that work on tasks
> and form groups of them and build up from that. Lets call them
> task-controllers.
> 
> On the other hand we have controllers like memcg which take the 'system'
> as a whole and shrink it down into smaller bits. Lets call these
> system-controllers.
>
> They are fundamentally at odds with capabilities, simply because of the
> granularity they can work on.

As pointed out multiple times, the picture is not that simple.  For
example, eventually, we want to be able to account for cpu cycles
spent during memory reclaim or processing IOs (e.g. encryption), which
can only be tied to the resource domain, not a specific task.

There surely are things that can only be done by task-level
controllers, but there are two different aspects here.  One is the
actual capabilities (e.g. hierarchical proportional cpu cycle
distribution) and the other is how such capabilities are exposed.
I'll continue below.

> Merging the two into a common hierarchy is a useful concept for
> containerization, no argument on that, esp. when also coupled with
> namespaces and the like.

Great, we now agree that comprehensive system resource control is
useful.

> However, where I object _most_ strongly is having this one use dominate
> and destroy the capabilities (which are in use) of the task-controllers.

The objection isn't necessarily just about loss of capabilities but
also about not being able to do them in the same way as v1.  The
reason I proposed rgroup instead of scoped task-granularity is because
I think that a properly insulated programmable interface which is
inline with other widely used APIs is a better solution in the long
run.

If we go cgroupfs route for thread granularity, we pretty much lose
the possibility, or at least make it very difficult, to make
hierarchical resource control widely available to individual
applications.

How important such use cases are is debatable.  I don't find it too
difficult to imagine scenarios where individual applications like
apache or torrent clients make use of it.  Probably more importantly,
rgroup, or something like it, gives an application an officially
supported way to build and expose their resource hierarchies, which
can then be used by both the application itself and outside to monitor
and manipulate resource distribution.

The decision between cgroupfs thread granularity and something like
rgroup isn't an obvious one.  Choosing the former is the path of lower
resistance but it is so at the cost of certain long-term benefits.

> > It could be made to work without races, though, with minimal (or even
> > no) ABI change.  The managed program could grab an fd pointing to its
> > cgroup.  Then it would use openat, etc for all operations.  As long as
> > 'mv /cgroup/a/b /cgroup/c/" didn't cause that fd to stop working,
> > we're fine.
> 
> I've mentioned openat() and related APIs several times, but so far never
> got good reasons why that wouldn't work.

Hopefully, this part was addressed in my reply to Andy.

> cgroup-v2, by placing the system style controllers first and foremost,
> completely renders that scenario impossible. Note also that any proposed
> rgroup would not work for this, since that, per design, is a subtree,
> and therefore not disjoint.

If a use case absolutely requires disjoint resource hierarchies, the
only solution is to keep using multiple v1 hierarchies, which
necessarily excludes the possibility of doing anyting across different
resource types.

> So my objection to the whole cgroup-v2 model and implementation stems
> from the fact that it purports to be a 'better' and 'improved' system,
> while in actuality it neuters and destroys a lot of useful usecases.
> 
> It completely disregards all task-controllers and labels their use-cases
> as irrelevant.

Your objection then doesn't have much to do with the specifics of the
cgroup v2 model or implementation.  It's an objection against
establishing common resource domains as that excludes building
orthogonal multiple hierarchies.  That, necessarily, can only be
achieved by having multiple hierarchies for different resource types
and thus giving up the benefits of common resource domains.

Assuming that, I don't think your position is against cgroup v2 but
more toward keeping v1 around.  We're talking about two quite
different mutually exclusive classes of use cases.  You need unified
for one and disjoint for the other.  v1 is gonna be there and can
easily be used alongside v2 for different controller types, which
would in most cases be cpu and cpuset.

I can't see a reason why this would need to block properly supporting
containerization use cases.

Thanks.

-- 
tejun

Re: [Documentation] State of CPU controller in cgroup v2

2016-09-30 Thread Mike Galbraith

On Fri, 2016-09-30 at 11:06 +0200, Tejun Heo wrote:
> Hello, Mike.
> 
> On Sat, Sep 10, 2016 at 12:08:57PM +0200, Mike Galbraith wrote:
> > On Fri, 2016-09-09 at 18:57 -0400, Tejun Heo wrote:
> > > > > As for your example, who performs the cgroup setup and configuration,
> > > > > the application itself or an external entity?  If an external entity,
> > > > > how does it know which thread is what?
> > > > 
> > > > In my case, it would be a little script that reads a config file that
> > > > knows all kinds of internal information about the application and its
> > > > threads.
> > > 
> > > I see.  One-of-a-kind custom setup.  This is a completely valid usage;
> > > however, please also recognize that it's an extremely specific one
> > > which is niche by definition.
> > 
> > This is the same pigeon hole you placed Google into.  So Google, my
> > (also decidedly non-petite) users, and now Andy are all sharing the one
> > of a kind extremely specific niche.. it's becoming a tad crowded.
> 
> I wasn't trying to say that these use cases are small in numbers when
> added up, but that they're all isolated in their own small silos.

These use cases exist, and are perfectly valid use cases.  That is sum
and total of what is relevant.

> Facebook has a lot of these usages too but they're almost all mutually
> exculsive.  Making workloads share machines or even adding resource
> conrol for base system operations afterwards is extremely difficult.

The cases I have in mind are not difficult to deal with, as you don't
have to worry about collisions.

> There are cases these adhoc approaches make sense but insisting that
> this is all there is to resource control is short-sighted.

1. I never insisted any such thing.
2. Please stop pigeon-holing.

The usage cases in question are no more ad hoc than any other usage,
they are all "for this", none are globally applicable.  What they are
is power users utilizing the intimate knowledge that is both required
and in the possession of power users who are in fact using controllers
precisely as said controllers were designed to be used.

No, these usages do not belong in an "adhoc" (aka disposable refuse) pi
geon-hole.  I choose to ignore the one you stuffed me into.
-Mike

Re: [Documentation] State of CPU controller in cgroup v2

2016-09-30 Thread Tejun Heo

Hello, Mike.

On Sat, Sep 10, 2016 at 12:08:57PM +0200, Mike Galbraith wrote:
> On Fri, 2016-09-09 at 18:57 -0400, Tejun Heo wrote:
> > > > As for your example, who performs the cgroup setup and configuration,
> > > > the application itself or an external entity?  If an external entity,
> > > > how does it know which thread is what?
> > > 
> > > In my case, it would be a little script that reads a config file that
> > > knows all kinds of internal information about the application and its
> > > threads.
> > 
> > I see.  One-of-a-kind custom setup.  This is a completely valid usage;
> > however, please also recognize that it's an extremely specific one
> > which is niche by definition.
> 
> This is the same pigeon hole you placed Google into.  So Google, my
> (also decidedly non-petite) users, and now Andy are all sharing the one
> of a kind extremely specific niche.. it's becoming a tad crowded.

I wasn't trying to say that these use cases are small in numbers when
added up, but that they're all isolated in their own small silos.
Facebook has a lot of these usages too but they're almost all mutually
exculsive.  Making workloads share machines or even adding resource
conrol for base system operations afterwards is extremely difficult.
There are cases these adhoc approaches make sense but insisting that
this is all there is to resource control is short-sighted.

Thanks.

-- 
tejun

Re: [Documentation] State of CPU controller in cgroup v2

2016-09-19 Thread Tejun Heo

Hello,

On Thu, Sep 15, 2016 at 01:08:07PM -0700, Andy Lutomirski wrote:
> With regard to no-internal-tasks, I see (at least) three options:
> 
> 1. Keep the cgroup2 status quo.  Lots of distros and such are likely
> to have their cgroup management fail if run in a container.  I really,

I don't know where you're getting this.  No-internal-tasks rule has
*NOTHING* to do with how or how not cgroup v1 hierarchies can be used
inside a namespace.  I suppose this is coming from the same
misunderstanding that Austin has.  Please see my reply there for more
details.

> really dislike this option.

Up until this point, you haven't supplied any valid technical reasons
for your objection.  Repeating "really" doesn't add to the discussion
at all.  If you're indicating that you don't like it on an aeshtetic
ground, please just say so.

> 2. Enforce no-internal-tasks for the root cgroup.  Un-cgroupable
> thinks will still get accounted to the root cgroup even if subtree
> control is on, but no tasks can be in the root cgroup if the root
> cgroup has subtree control on.  (If some controllers removed the
> no-internal-tasks restriction, this would apply to the root as well.)
> I think this may annoy certain users.  If so, and if those users are
> doing something valid, then I think that either those users should be
> strongly encouraged or even forced to changed so namespacing works for
> them or that we should do (3) instead.

Theoretically, we can do that but what are the upsides and are they
enough to justify the added inconveniences?  Up until now, the only
argument you provided is that people may do certain things in
system-root which might not work in namespace-root but that isn't a
critical problem.  No real functionalities are lost by implementing
the same behaviors both inside and outside namespaces.

> 3. Remove the no-internal-tasks restriction entirely.  I can see this
> resulting in a lot of configuration awkwardness, but I think it will
> *work*, especially since all of the controllers already need to do
> something vaguely intelligent when subtree control is on in the root
> and there are tasks in the root.

The reasons for no-internal-tasks restriction have been explained
multiple times in the documentations and throughout this thread, and
we also discussed how and why system-root is special and allowing
system-root's special treatment doesn't break things.

> What I'm trying to say is that I think that option (1) is sufficiently
> bad that cgroup2 should do (2) or (3) instead.  If option (2) is
> preferred and if it would break userspace, then I think we can work
> around it by entirely deprecating cgroup2, renaming it to cgroup3, and
> doing option (2) there.  You've given reasons you don't like options
> (2) and (3).  I mostly agree with those reasons, but I don't think
> they're strong enough to overcome the problems with (1).

And you keep suggesting very drastic measures for an issue which isn't
critical without providing any substantial technical reasons why such
drastic measures would be necessary.  This part of discussion started
with your misunderstanding of the implications of the system-root
being special, and the only reason you presented in the previous
message is still a, different, misunderstanding.

The only thing which isn't changing here is your opinions on how it
should be.  It is a baffling situation because your opinions don't
seem to be affected at all by the validity of reasons for thinking so.

> BTW, Mike keeps mentioning exclusive cgroups as problematic with the
> no-internal-tasks constraints.  Do exclusive cgroups still exist in
> cgroup2?  Could we perhaps just remove that capability entirely?  I've
> never understood what problem exlusive cpusets and such solve that
> can't be more comprehensibly solved by just assigning the cpusets the
> normal inclusive way.

This was explained before during the discussion.  Maybe it wasn't
clear enough.  The knob is a config protector which protects oneself
from changing its configs.  It doesn't really belong in the kernel.
My guess is that it was added because delegation model wasn't properly
established and people tried to delegate resource control knobs along
with the cgroups and then wanted to prevent those knobs from changed
in certain ways.

> >> What kind of migration do you mean?  Having fds follow rename(2) around is
> >> the normal vfs behavior, so I don't really know what you mean.
> >
> > Process or task migration by writing pid to cgroup.procs or tasks
> > file.  cgroup never supported directory / cgroup level migrations.
> 
> Ugh.  Perhaps cgroup2 should start supporting this.  I think that
> making rename(2) work is simpler than adding a whole new API for
> rgroups, and I think it could solve a lot of the same problems that
> rgroups are trying to solve.

We haven't needed that yet and supporting rename(2) doesn't
necessarily make the API safe in terms of migration atomicity.  Also,
as pointed out in my previous reply (and

Re: [Documentation] State of CPU controller in cgroup v2

2016-09-19 Thread Tejun Heo

Hello, Austin.

On Mon, Sep 12, 2016 at 11:20:03AM -0400, Austin S. Hemmelgarn wrote:
> > If you confine it to the cpu controller, ignore anonymous
> > consumptions, the rather ugly mapping between nice and weight values
> > and the fact that nobody could come up with a practical usefulness for
> > such setup, yes.  My point was never that the cpu controller can't do
> > it but that we should find a better way of coordinating it with other
> > controllers and exposing it to individual applications.
>
> So, having a container where not everything in the container is split
> further into subgroups is not a practically useful situation?  Because
> that's exactly what both systemd and every other cgroup management tool
> expects to have work as things stand right now.  The root cgroup within a

Not true.

 $ cat /proc/1/cgroup
 11:hugetlb:/
 10:pids:/init.scope
 9:blkio:/
 8:cpuset:/
 7:memory:/
 6:freezer:/
 5:perf_event:/
 4:net_cls,net_prio:/
 3:cpu,cpuacct:/
 2:devices:/init.scope
 1:name=systemd:/init.scope
 $ systemctl --version
 systemd 229
 +PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP 
+GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD +IDN

> cgroup namespace has to function exactly like the system-root, otherwise
> nothing can depend on the special cases for the system root, because they
> might get run in a cgroup namespace and such assumptions will be invalid.

systemd already behaves exactly the same whether it's inside a
namespace or not.

> This in turn means that no current distro can run unmodified in a cgroup
> namespace under a v2 hierarchy, which is a Very Bad Thing.

cgroup v1 hierarchies can be mounted the same inside a namespace
whether the system itself is on cgroup v1 or v2.  Obviously, a given
controller can only be attached to one hierarchy, so a controller
can't be used at the same time on both v1 and v2 hierarchies; however,
that is true with different v1 hierarchies too, and, given that
delegations doesn't work properly on v1, shouldn't be that much of an
issue.

I'm not just claiming it.  systemd-nspawn can already be on either v1
or v2 hierarchies regardless of what the outer systemd uses.

Out of the claims that you made, the only one which holds up is that
an existing software can't make use of cgroup v2 without
modifications, which is true but at the same time doesn't mean much of
anything.

Thanks.

-- 
tejun

Re: [Documentation] State of CPU controller in cgroup v2

2016-09-16 Thread Peter Zijlstra

On Fri, Sep 16, 2016 at 11:19:38AM -0700, Andy Lutomirski wrote:
> On Fri, Sep 16, 2016 at 9:50 AM, Peter Zijlstra  wrote:

> > {1,2} {3,4} {5} seem exclusive, did I miss something? (other than that 5
> > cpu parts are 'rare').
> 
> There's no overlap, so they're logically exclusive, but it avoids
> needing the "cpu_exclusive" parameter. 

I'd need to double check, but I don't think you _need_ that. That's more
for enforcing nobody else steals your CPUs and 'accidentally' creates
overlaps. But if you configure it right, non-overlap should be enough.

That is, generate_sched_domains() only uses cpusets_overlap() which is
cpumask_intersects(). Then again, it is almost 4am, so who knows.

> > So there's a problem with sticking kernel threads (and esp. kthreadd)
> > into !root groups. For example if you place it in a cpuset that doesn't
> > have all cpus, then binding your shiny new kthread to a cpu will fail.
> >
> > You can fix that of course, and we used to do exactly that, but we kept
> > running into 'fun' cases like that.
> 
> Blech.  But may this *should* have that effect.  I'm sick of random
> kernel crap being scheduled on my RT CPUs and on the CPUs that I
> intend to be kept forcibly idle.

Hehe, so ideally those threads don't do anything unless the tasks
running on those CPUs explicitly ask for it. If you find any of the
CPU-bound kernel tasks do work that is unrelated to the tasks running on
that CPU, we should certainly look into it.

Personally I'm not much bothered by idle threads sitting about.

Re: [Documentation] State of CPU controller in cgroup v2

2016-09-16 Thread Andy Lutomirski

On Fri, Sep 16, 2016 at 9:50 AM, Peter Zijlstra  wrote:
> On Fri, Sep 16, 2016 at 09:29:06AM -0700, Andy Lutomirski wrote:
>
>> > SCHED_DEADLINE, its a 'Global'-EDF like scheduler that doesn't support
>> > CPU affinities (because that doesn't make sense). The only way to
>> > restrict it is to partition.
>> >
>> > 'Global' because you can partition it. If you reduce your system to
>> > single CPU partitions you'll reduce to P-EDF.
>> >
>> > (The same is true of SCHED_FIFO, that's a 'Global'-FIFO on the same
>> > partition scheme, it however does support sched_affinity, but using it
>> > gives 'interesting' schedulability results -- call it a historic
>> > accident).
>>
>> Hmm, I didn't realize that the deadline scheduler was global.  But
>> ISTM requiring the use of "exclusive" to get this working is
>> unfortunate.  What if a user wants two separate partitions, one using
>> CPUs 1 and 2 and the other using CPUs 3 and 4 (with 5 reserved for
>> non-RT stuff)?
>
> {1,2} {3,4} {5} seem exclusive, did I miss something? (other than that 5
> cpu parts are 'rare').

There's no overlap, so they're logically exclusive, but it avoids
needing the "cpu_exclusive" parameter.  It always seemed confusing to
me that a setting on a child cgroup would strictly remove a resource
from the parent.  (To be clear: I don't have any particularly strong
objection to cpu_exclusive.  It just always seemed like a bit of a
hack that mostly duplicated what you could get by just setting the
cpusets appropriately throughout the hierarchy.)

>> > Note that related, but differently, we have the isolcpus boot parameter
>> > which creates single CPU partitions for all listed CPUs and gives the
>> > rest to the root cpuset. Ideally we'd kill this option given its a boot
>> > time setting (for something which is trivially to do at runtime).
>> >
>> > But this cannot be done, because that would mean we'd have to start with
>> > a !0 cpuset layout:
>> >
>> > '/'
>> > load_balance=0
>> > /  \
>> > 'system''isolated'
>> > cpus=~isolcpus  cpus=isolcpus
>> > load_balance=0
>> >
>> > And start with _everything_ in the /system group (inclding default IRQ
>> > affinities).
>> >
>> > Of course, that will break everything cgroup :-(
>> >
>>
>> I would actually *much* prefer this over the status quo.  I'm tired of
>> my crappy, partially-working script that sits there and creates
>> exactly this configuration (minus the isolcpus part because I actually
>> want migration to work) on boot.  (Actually, it could have two
>> automatic cgroups: /kernel and /init -- init and UMH would go in init
>> and kernel threads and such would go in /kernel.  Userspace would be
>> able to request that a different cgroup be used for newly-created
>> kernel threads.)
>
> So there's a problem with sticking kernel threads (and esp. kthreadd)
> into !root groups. For example if you place it in a cpuset that doesn't
> have all cpus, then binding your shiny new kthread to a cpu will fail.
>
> You can fix that of course, and we used to do exactly that, but we kept
> running into 'fun' cases like that.

Blech.  But may this *should* have that effect.  I'm sick of random
kernel crap being scheduled on my RT CPUs and on the CPUs that I
intend to be kept forcibly idle.

>
> The unbound workqueue stuff is totally arbitrary borkage though, that
> can be made to work just fine, TJ didn't like it for some reason which I
> really cannot remember.
>
> Also, UMH?

User mode helper.  Fortunately most users are gone now, but it still exists.

Re: [Documentation] State of CPU controller in cgroup v2

2016-09-16 Thread Peter Zijlstra

On Fri, Sep 16, 2016 at 09:29:06AM -0700, Andy Lutomirski wrote:

> > SCHED_DEADLINE, its a 'Global'-EDF like scheduler that doesn't support
> > CPU affinities (because that doesn't make sense). The only way to
> > restrict it is to partition.
> >
> > 'Global' because you can partition it. If you reduce your system to
> > single CPU partitions you'll reduce to P-EDF.
> >
> > (The same is true of SCHED_FIFO, that's a 'Global'-FIFO on the same
> > partition scheme, it however does support sched_affinity, but using it
> > gives 'interesting' schedulability results -- call it a historic
> > accident).
> 
> Hmm, I didn't realize that the deadline scheduler was global.  But
> ISTM requiring the use of "exclusive" to get this working is
> unfortunate.  What if a user wants two separate partitions, one using
> CPUs 1 and 2 and the other using CPUs 3 and 4 (with 5 reserved for
> non-RT stuff)? 

{1,2} {3,4} {5} seem exclusive, did I miss something? (other than that 5
cpu parts are 'rare').

> Shouldn't we be able to have a cgroup for each of the
> DL partitions and do something to tell the deadline scheduler "here is
> your domain"?

Somewhat confused, by doing the non-overlapping domains, you do exactly
that no?

You end up with 2 (or more) independent deadline schedulers, but if
you're not running deadline tasks (like in the /system partition) you
don't care its there.

> > Note that related, but differently, we have the isolcpus boot parameter
> > which creates single CPU partitions for all listed CPUs and gives the
> > rest to the root cpuset. Ideally we'd kill this option given its a boot
> > time setting (for something which is trivially to do at runtime).
> >
> > But this cannot be done, because that would mean we'd have to start with
> > a !0 cpuset layout:
> >
> > '/'
> > load_balance=0
> > /  \
> > 'system''isolated'
> > cpus=~isolcpus  cpus=isolcpus
> > load_balance=0
> >
> > And start with _everything_ in the /system group (inclding default IRQ
> > affinities).
> >
> > Of course, that will break everything cgroup :-(
> >
> 
> I would actually *much* prefer this over the status quo.  I'm tired of
> my crappy, partially-working script that sits there and creates
> exactly this configuration (minus the isolcpus part because I actually
> want migration to work) on boot.  (Actually, it could have two
> automatic cgroups: /kernel and /init -- init and UMH would go in init
> and kernel threads and such would go in /kernel.  Userspace would be
> able to request that a different cgroup be used for newly-created
> kernel threads.)

So there's a problem with sticking kernel threads (and esp. kthreadd)
into !root groups. For example if you place it in a cpuset that doesn't
have all cpus, then binding your shiny new kthread to a cpu will fail.

You can fix that of course, and we used to do exactly that, but we kept
running into 'fun' cases like that.

The unbound workqueue stuff is totally arbitrary borkage though, that
can be made to work just fine, TJ didn't like it for some reason which I
really cannot remember.

Also, UMH?

> Heck, even systemd would probably prefer this.  Then it could cleanly
> expose a "slice" or whatever it's called for random kernel shit and at
> least you could configure it meaningfully.

No clue about systemd, I'm still on systems without that virus.

Re: [Documentation] State of CPU controller in cgroup v2

2016-09-16 Thread Andy Lutomirski

On Fri, Sep 16, 2016 at 9:19 AM, Peter Zijlstra  wrote:
> On Fri, Sep 16, 2016 at 08:12:58AM -0700, Andy Lutomirski wrote:
>> On Sep 16, 2016 12:51 AM, "Peter Zijlstra"  wrote:
>> >
>> > On Thu, Sep 15, 2016 at 01:08:07PM -0700, Andy Lutomirski wrote:
>> > > BTW, Mike keeps mentioning exclusive cgroups as problematic with the
>> > > no-internal-tasks constraints.  Do exclusive cgroups still exist in
>> > > cgroup2?  Could we perhaps just remove that capability entirely?  I've
>> > > never understood what problem exlusive cpusets and such solve that
>> > > can't be more comprehensibly solved by just assigning the cpusets the
>> > > normal inclusive way.
>> >
>> > Without exclusive sets we cannot split the sched_domain structure.
>> > Which leads to not being able to actually partition things. That would
>> > break DL for one.
>>
>> Can you sketch out a toy example?
>
> [ Also see Documentation/cgroup-v1/cpusets.txt section 1.7 ]
>
>
>   mkdir /cpuset
>
>   mount -t cgroup -o cpuset none /cpuset
>
>   mkdir /cpuset/A
>   mkdir /cpuset/B
>
>   cat /sys/devices/system/node/node0/cpulist > /cpuset/A/cpuset.cpus
>   echo 0 > /cpuset/A/cpuset.mems
>
>   cat /sys/devices/system/node/node1/cpulist > /cpuset/B/cpuset.cpus
>   echo 1 > /cpuset/B/cpuset.mems
>
>   # move all movable tasks into A
>   cat /cpuset/tasks | while read task; do echo $task > /cpuset/A/tasks ; done
>
>   # kill machine wide load-balancing
>   echo 0 > /cpuset/cpuset.sched_load_balance
>
>   # now place 'special' tasks in B
>
>
> This partitions the scheduler into two, one for each node.
>
> Hereafter no task will be moved from one node to another. The
> load-balancer is split in two, one balances in A one balances in B
> nothing crosses. (It is important that A.cpus and B.cpus do not
> intersect.)
>
> Ideally no task would remain in the root group, back in the day we could
> actually do this (with exception of the cpu bound kernel threads), but
> this has significantly regressed :-(
> (still hate the workqueue affinity interface)

I wonder if we could address this by creating (automatically at boot
or when the cpuset controller is enabled or whatever) a
/cpuset/random_kernel_shit cgroup and have all of the unmoveable tasks
land there?

>
> As is, tasks that are left in the root group get balanced within
> whatever domain they ended up in.
>
>> And what's DL?
>
> SCHED_DEADLINE, its a 'Global'-EDF like scheduler that doesn't support
> CPU affinities (because that doesn't make sense). The only way to
> restrict it is to partition.
>
> 'Global' because you can partition it. If you reduce your system to
> single CPU partitions you'll reduce to P-EDF.
>
> (The same is true of SCHED_FIFO, that's a 'Global'-FIFO on the same
> partition scheme, it however does support sched_affinity, but using it
> gives 'interesting' schedulability results -- call it a historic
> accident).

Hmm, I didn't realize that the deadline scheduler was global.  But
ISTM requiring the use of "exclusive" to get this working is
unfortunate.  What if a user wants two separate partitions, one using
CPUs 1 and 2 and the other using CPUs 3 and 4 (with 5 reserved for
non-RT stuff)?  Shouldn't we be able to have a cgroup for each of the
DL partitions and do something to tell the deadline scheduler "here is
your domain"?

>
>
> Note that related, but differently, we have the isolcpus boot parameter
> which creates single CPU partitions for all listed CPUs and gives the
> rest to the root cpuset. Ideally we'd kill this option given its a boot
> time setting (for something which is trivially to do at runtime).
>
> But this cannot be done, because that would mean we'd have to start with
> a !0 cpuset layout:
>
> '/'
> load_balance=0
> /  \
> 'system''isolated'
> cpus=~isolcpus  cpus=isolcpus
> load_balance=0
>
> And start with _everything_ in the /system group (inclding default IRQ
> affinities).
>
> Of course, that will break everything cgroup :-(
>

I would actually *much* prefer this over the status quo.  I'm tired of
my crappy, partially-working script that sits there and creates
exactly this configuration (minus the isolcpus part because I actually
want migration to work) on boot.  (Actually, it could have two
automatic cgroups: /kernel and /init -- init and UMH would go in init
and kernel threads and such would go in /kernel.  Userspace would be
able to request that a different cgroup be used for newly-created
kernel threads.)

Heck, even systemd would probably prefer this.  Then it could cleanly
expose a "slice" or whatever it's called for random kernel shit and at
least you could configure it meaningfully.

Re: [Documentation] State of CPU controller in cgroup v2

2016-09-16 Thread Peter Zijlstra

On Fri, Sep 16, 2016 at 08:12:58AM -0700, Andy Lutomirski wrote:
> On Sep 16, 2016 12:51 AM, "Peter Zijlstra"  wrote:
> >
> > On Thu, Sep 15, 2016 at 01:08:07PM -0700, Andy Lutomirski wrote:
> > > BTW, Mike keeps mentioning exclusive cgroups as problematic with the
> > > no-internal-tasks constraints.  Do exclusive cgroups still exist in
> > > cgroup2?  Could we perhaps just remove that capability entirely?  I've
> > > never understood what problem exlusive cpusets and such solve that
> > > can't be more comprehensibly solved by just assigning the cpusets the
> > > normal inclusive way.
> >
> > Without exclusive sets we cannot split the sched_domain structure.
> > Which leads to not being able to actually partition things. That would
> > break DL for one.
> 
> Can you sketch out a toy example? 

[ Also see Documentation/cgroup-v1/cpusets.txt section 1.7 ]

  mkdir /cpuset

  mount -t cgroup -o cpuset none /cpuset

  mkdir /cpuset/A
  mkdir /cpuset/B

  cat /sys/devices/system/node/node0/cpulist > /cpuset/A/cpuset.cpus
  echo 0 > /cpuset/A/cpuset.mems

  cat /sys/devices/system/node/node1/cpulist > /cpuset/B/cpuset.cpus
  echo 1 > /cpuset/B/cpuset.mems

  # move all movable tasks into A
  cat /cpuset/tasks | while read task; do echo $task > /cpuset/A/tasks ; done

  # kill machine wide load-balancing
  echo 0 > /cpuset/cpuset.sched_load_balance

  # now place 'special' tasks in B

This partitions the scheduler into two, one for each node.

Hereafter no task will be moved from one node to another. The
load-balancer is split in two, one balances in A one balances in B
nothing crosses. (It is important that A.cpus and B.cpus do not
intersect.)

Ideally no task would remain in the root group, back in the day we could
actually do this (with exception of the cpu bound kernel threads), but
this has significantly regressed :-(
(still hate the workqueue affinity interface)

As is, tasks that are left in the root group get balanced within
whatever domain they ended up in.

> And what's DL?

SCHED_DEADLINE, its a 'Global'-EDF like scheduler that doesn't support
CPU affinities (because that doesn't make sense). The only way to
restrict it is to partition.

'Global' because you can partition it. If you reduce your system to
single CPU partitions you'll reduce to P-EDF.

(The same is true of SCHED_FIFO, that's a 'Global'-FIFO on the same
partition scheme, it however does support sched_affinity, but using it
gives 'interesting' schedulability results -- call it a historic
accident).

Note that related, but differently, we have the isolcpus boot parameter
which creates single CPU partitions for all listed CPUs and gives the
rest to the root cpuset. Ideally we'd kill this option given its a boot
time setting (for something which is trivially to do at runtime).

But this cannot be done, because that would mean we'd have to start with
a !0 cpuset layout:

'/'
load_balance=0
/  \
'system''isolated'
cpus=~isolcpus  cpus=isolcpus
load_balance=0

And start with _everything_ in the /system group (inclding default IRQ
affinities).

Of course, that will break everything cgroup :-(

Re: [Documentation] State of CPU controller in cgroup v2

2016-09-16 Thread Andy Lutomirski

On Sep 16, 2016 12:51 AM, "Peter Zijlstra"  wrote:
>
> On Thu, Sep 15, 2016 at 01:08:07PM -0700, Andy Lutomirski wrote:
> > BTW, Mike keeps mentioning exclusive cgroups as problematic with the
> > no-internal-tasks constraints.  Do exclusive cgroups still exist in
> > cgroup2?  Could we perhaps just remove that capability entirely?  I've
> > never understood what problem exlusive cpusets and such solve that
> > can't be more comprehensibly solved by just assigning the cpusets the
> > normal inclusive way.
>
> Without exclusive sets we cannot split the sched_domain structure.
> Which leads to not being able to actually partition things. That would
> break DL for one.

Can you sketch out a toy example?  And what's DL?

Re: [Documentation] State of CPU controller in cgroup v2

2016-09-16 Thread Peter Zijlstra

On Thu, Sep 15, 2016 at 01:08:07PM -0700, Andy Lutomirski wrote:
> BTW, Mike keeps mentioning exclusive cgroups as problematic with the
> no-internal-tasks constraints.  Do exclusive cgroups still exist in
> cgroup2?  Could we perhaps just remove that capability entirely?  I've
> never understood what problem exlusive cpusets and such solve that
> can't be more comprehensibly solved by just assigning the cpusets the
> normal inclusive way.

Without exclusive sets we cannot split the sched_domain structure.
Which leads to not being able to actually partition things. That would
break DL for one.

Re: [Documentation] State of CPU controller in cgroup v2

2016-09-15 Thread Andy Lutomirski

On Wed, Sep 14, 2016 at 1:00 PM, Tejun Heo  wrote:
> Hello,
>

With regard to no-internal-tasks, I see (at least) three options:

1. Keep the cgroup2 status quo.  Lots of distros and such are likely
to have their cgroup management fail if run in a container.  I really,
really dislike this option.

2. Enforce no-internal-tasks for the root cgroup.  Un-cgroupable
thinks will still get accounted to the root cgroup even if subtree
control is on, but no tasks can be in the root cgroup if the root
cgroup has subtree control on.  (If some controllers removed the
no-internal-tasks restriction, this would apply to the root as well.)
I think this may annoy certain users.  If so, and if those users are
doing something valid, then I think that either those users should be
strongly encouraged or even forced to changed so namespacing works for
them or that we should do (3) instead.

3. Remove the no-internal-tasks restriction entirely.  I can see this
resulting in a lot of configuration awkwardness, but I think it will
*work*, especially since all of the controllers already need to do
something vaguely intelligent when subtree control is on in the root
and there are tasks in the root.

What I'm trying to say is that I think that option (1) is sufficiently
bad that cgroup2 should do (2) or (3) instead.  If option (2) is
preferred and if it would break userspace, then I think we can work
around it by entirely deprecating cgroup2, renaming it to cgroup3, and
doing option (2) there.  You've given reasons you don't like options
(2) and (3).  I mostly agree with those reasons, but I don't think
they're strong enough to overcome the problems with (1).

BTW, Mike keeps mentioning exclusive cgroups as problematic with the
no-internal-tasks constraints.  Do exclusive cgroups still exist in
cgroup2?  Could we perhaps just remove that capability entirely?  I've
never understood what problem exlusive cpusets and such solve that
can't be more comprehensibly solved by just assigning the cpusets the
normal inclusive way.

>> > After a migration, the cgroup and its interface knobs are a different
>> > directory and files.  Semantically, during migration, we aren't moving
>> > the directory or files and it'd be bizarre to overlay the semantics
>> > you're describing on top of the existing cgroupfs.  We will have to
>> > break away from the very basic vfs rules such as a fd, once opened,
>> > always corresponding to the same file.
>>
>> What kind of migration do you mean?  Having fds follow rename(2) around is
>> the normal vfs behavior, so I don't really know what you mean.
>
> Process or task migration by writing pid to cgroup.procs or tasks
> file.  cgroup never supported directory / cgroup level migrations.
>

Ugh.  Perhaps cgroup2 should start supporting this.  I think that
making rename(2) work is simpler than adding a whole new API for
rgroups, and I think it could solve a lot of the same problems that
rgroups are trying to solve.

--Andy

Re: [Documentation] State of CPU controller in cgroup v2

2016-09-14 Thread Tejun Heo

Hello,

On Mon, Sep 12, 2016 at 10:39:04AM -0700, Andy Lutomirski wrote:
> > > Your idea of "trivially" doesn't match mine.  You gave a use case in
> >
> > I suppose I wasn't clear enough.  It is trivial in the sense that if
> > the userland implements something which works for namespace-root, it
> > would work the same in system-root without further modifications.
> 
> So I guess userspace can trivially get it right and can just as trivially
> get it wrong.

I wasn't trying to play a word game.  What I was trying to say is that
a configuration which works for namespace-roots works for the
system-root too, in terms of cgroup hierarchy, without any
modifications.

> > Great, now we agree that what's currently implemented is valid.  I
> > think you're still failing to recognize the inherent specialness of
> > the system-root and how much unnecessary pain the removal of the
> > exemption would cause at virtually no practical gain.  I won't repeat
> > the same backing points here.
> 
> I'm starting to think that you could extend the exemption with considerably
> less difficulty.

Can you please elaborate?  It feels like you're repeating the same
opinions without really describing them in detail or backing them up
in the last couple replies.  Having differing opinions is fine but to
actually hash them out, the opinions and their rationles need to be
laid out in detail.

> > There isn't much which is getting in the way of doing that.  Again,
> > something which follows no-internal-task rule would behave the same no
> > matter where it is.  The system-root is different in that it is exempt
> > from the rule and thus is more flexible but that difference is serving
> > the purpose of handling the inherent specialness of the system-root.
> 
> From *userspace's* POV, I still don't think there's any specialness except
> from an accounting POV.  After all, userspace has no control over the
> special stuff anyway.  And accounting doesn't matter: a namespace could
> just see zeros in any special root accounting slots.

The disagreement here isn't really consequential.  The only reason
this part became imporant is because you felt that something must be
broken, which you now don't think is the case.

I agree that there can be other ways to handle this but what's your
proposal here?  And how would that be practically and substantically
better than what is implemented now?

> > You've been pushing for enforcing the restriction on the system-root
> > too and now are jumping to the opposite end.  It's really frustrating
> > that this is such a whack-a-mole game where you throw ideas without
> > really thinking through them and only concede the bare minimum when
> > all other logical avenues are closed off.  Here, again, you seem to be
> > stating a strong opinion when you haven't fully thought about it or
> > tried to understand the reasons behind it.
> 
> I think you should make it work the same way in namespace roots as it does
> in the system root.  I acknowledge that there are pros and cons of each.  I
> think the current middle ground is worse than either of the consistent
> options.

Again, the only thing you're doing is restating the same opinion.  I
understand that you have an impression that this can be done better
but how exactly?

> > But, whatever, let's go there: Given the arguments that I laid out for
> > the no-internal-tasks rule, how does the problem seem fixable through
> > relaxing the constraint?
> 
> By deciding that, despite the arguments you laid out, it's still worth
> relaxing the constraint.  Or by deciding to add the constraint to the root.

You're not really saying anything of substance in the above paragraph.

> > > Isn't this the same thing?  IIUC the constraint in question is that,
> > > if a non-root cgroup has subtree control on, then it can't have
> > > processes in it.  This is the no-internal-tasks constraint, right?
> >
> > Yes, that is what no-internal-tasks rule is but I don't understand how
> > that is the same thing as process granularity.  Am I completely
> > misunderstanding what you are trying to say here?
> 
> Yes.  I'm saying that no-internal-tasks could be relaxed per controller.

I was asking whether you were wondering whether no-internal-tasks rule
and process-granularity are the same thing.  And, if that's not the
case, what the previous sentence meant.  I can't make out what you're
responding to.

> > If you confine it to the cpu controller, ignore anonymous
> > consumptions, the rather ugly mapping between nice and weight values
> > and the fact that nobody could come up with a practical usefulness for
> > such setup, yes.  My point was never that the cpu controller can't do
> > it but that we should find a better way of coordinating it with other
> > controllers and exposing it to individual applications.
> 
> I'm not sure what the nice-vs-weight thing has to do with internal
> processes, but all of this is a question for Peter.

That part is from cgroup cpu controller weig

Re: [Documentation] State of CPU controller in cgroup v2

2016-09-12 Thread Austin S. Hemmelgarn


On 2016-09-09 18:57, Tejun Heo wrote:

Hello, again.

On Mon, Sep 05, 2016 at 10:37:55AM -0700, Andy Lutomirski wrote:

* It doesn't bring any practical benefits in terms of capability.
  Userland can trivially handle the system-root and namespace-roots in
  a symmetrical manner.


Your idea of "trivially" doesn't match mine.  You gave a use case in


I suppose I wasn't clear enough.  It is trivial in the sense that if
the userland implements something which works for namespace-root, it
would work the same in system-root without further modifications.


which userspace might take advantage of root being special.  If


I was emphasizing the cases where userspace would have to deal with
the inherent differences, and, when they don't, they can behave
exactly the same way.


userspace does that, then that userspace cannot be run in a container.
This could be a problem for real users.  Sure, "don't do that" is a
*valid* answer, but it's not a very helpful answer.


Great, now we agree that what's currently implemented is valid.  I
think you're still failing to recognize the inherent specialness of
the system-root and how much unnecessary pain the removal of the
exemption would cause at virtually no practical gain.  I won't repeat
the same backing points here.


* It's an unncessary inconvenience, especially for cases where the
  cgroup agent isn't in control of boot, for partial usage cases, or
  just for playing with it.

You say that I'm ignoring the same use case for namespace-scope but
namespace-roots don't have the same hybrid function for partial and
uncontrolled systems, so it's not clear why there even NEEDS to be
strict symmetry.


I think their functions are much closer than you think they are.  I
want a whole Linux distro to be able to run in a container.  This
means that useful things people do in a distro or initramfs or
whatever should just work if containerized.


There isn't much which is getting in the way of doing that.  Again,
something which follows no-internal-task rule would behave the same no
matter where it is.  The system-root is different in that it is exempt
from the rule and thus is more flexible but that difference is serving
the purpose of handling the inherent specialness of the system-root.
AFAICS, it is the solution which causes the least amount of contortion
and unnecessary inconvenience to userland.


It's easy and understandable to get hangups on asymmetries or
exemptions like this, but they also often are acceptable trade-offs.
It's really frustrating to see you first getting hung up on "this must
be wrong" and even after explanations repeating the same thing just in
different ways.

If there is something fundamentally wrong with it, sure, let's fix it,
but what's actually broken?


I'm not saying it's fundamentally wrong.  I'm saying it's a design


You were.


that has a big wart, and that wart is unfortunate, and after thinking
a bit, I'm starting to agree with PeterZ that this is problematic.  It
also seems fixable: the constraint could be relaxed.


You've been pushing for enforcing the restriction on the system-root
too and now are jumping to the opposite end.  It's really frustrating
that this is such a whack-a-mole game where you throw ideas without
really thinking through them and only concede the bare minimum when
all other logical avenues are closed off.  Here, again, you seem to be
stating a strong opinion when you haven't fully thought about it or
tried to understand the reasons behind it.

But, whatever, let's go there: Given the arguments that I laid out for
the no-internal-tasks rule, how does the problem seem fixable through
relaxing the constraint?


Also, here's an idea to maybe make PeterZ happier: relax the
restriction a bit per-controller.  Currently (except for /), if you
have subtree control enabled you can't have any processes in the
cgroup.  Could you change this so it only applies to certain
controllers?  If the cpu controller is entirely happy to have
processes and cgroups as siblings, then maybe a cgroup with only cpu
subtree control enabled could allow processes to exist.


The document lists several reasons for not doing this and also that
there is no known real world use case for such configuration.


So, up until this point, we were talking about no-internal-tasks
constraint.


Isn't this the same thing?  IIUC the constraint in question is that,
if a non-root cgroup has subtree control on, then it can't have
processes in it.  This is the no-internal-tasks constraint, right?


Yes, that is what no-internal-tasks rule is but I don't understand how
that is the same thing as process granularity.  Am I completely
misunderstanding what you are trying to say here?


And I still think that, at least for cpu, nothing at all goes wrong if
you allow processes to exist in cgroups that have cpu set in
subtree-control.


If you confine it to the cpu controller, ignore anonymous
consumptions, the rather ugly mapping between nice and weight values
and the f

Re: [Documentation] State of CPU controller in cgroup v2

2016-09-10 Thread Mike Galbraith

On Fri, 2016-09-09 at 18:57 -0400, Tejun Heo wrote:

> > > As for your example, who performs the cgroup setup and configuration,
> > > the application itself or an external entity?  If an external entity,
> > > how does it know which thread is what?
> > 
> > In my case, it would be a little script that reads a config file that
> > knows all kinds of internal information about the application and its
> > threads.
> 
> I see.  One-of-a-kind custom setup.  This is a completely valid usage;
> however, please also recognize that it's an extremely specific one
> which is niche by definition.

This is the same pigeon hole you placed Google into.  So Google, my
(also decidedly non-petite) users, and now Andy are all sharing the one
of a kind extremely specific niche.. it's becoming a tad crowded.

-Mike

Re: [Documentation] State of CPU controller in cgroup v2

2016-09-10 Thread Mike Galbraith

On Fri, 2016-09-09 at 18:57 -0400, Tejun Heo wrote:

> But, whatever, let's go there: Given the arguments that I laid out for
> the no-internal-tasks rule, how does the problem seem fixable through
> relaxing the constraint?

Well, for one thing, cpusets would cease to leak CPUs.  With the no
-internal-tasks constraint, no task can acquire affinity of exclusive
set A if set B is an exclusive subset thereof, as there is one and only
one spot where the affinity of set A exists: in the forbidden set A.

Relaxing no-internal-tasks would fix that, but without also relaxing
the process-only rule, cpusets would remain useless for the purpose for
which it was created.  After all, it doesn't do much good to use the
one and only dynamic partitioning tool to partition a box if you cannot
subsequently place your tasks/threads properly therein.

> What people do now with cgroup inside an application is extremely
> limited.  Because there is no proper support for it, each use case has
> to craft up a dedicated custom setup which is all but guaranteed to be
> incompatible with what someone else would come up for another
> application.  Everybody is in "this is mine, I control the entire
> system" mindset, which is fine for those specific setups but
> deterimental to making it widely available and useful.

IMO, the problem with that making it available to the huddled masses
bit is that it is a completely unrealistic fantasy.  Can hordes of
programs really autonomously carve up a single set of resources?  I do
not believe they can.  The system agent cannot autonomously do so
either.  Intimate knowledge of local requirements is not optional, it
is a prerequisite to sound decision making.  You have to have a well
defined need before it makes any sense to turn these things on, they
are not free, and impact is global.

-Mike

Re: [Documentation] State of CPU controller in cgroup v2

2016-09-09 Thread Tejun Heo

Hello, again.

On Mon, Sep 05, 2016 at 10:37:55AM -0700, Andy Lutomirski wrote:
> > * It doesn't bring any practical benefits in terms of capability.
> >   Userland can trivially handle the system-root and namespace-roots in
> >   a symmetrical manner.
> 
> Your idea of "trivially" doesn't match mine.  You gave a use case in

I suppose I wasn't clear enough.  It is trivial in the sense that if
the userland implements something which works for namespace-root, it
would work the same in system-root without further modifications.

> which userspace might take advantage of root being special.  If

I was emphasizing the cases where userspace would have to deal with
the inherent differences, and, when they don't, they can behave
exactly the same way.

> userspace does that, then that userspace cannot be run in a container.
> This could be a problem for real users.  Sure, "don't do that" is a
> *valid* answer, but it's not a very helpful answer.

Great, now we agree that what's currently implemented is valid.  I
think you're still failing to recognize the inherent specialness of
the system-root and how much unnecessary pain the removal of the
exemption would cause at virtually no practical gain.  I won't repeat
the same backing points here.

> > * It's an unncessary inconvenience, especially for cases where the
> >   cgroup agent isn't in control of boot, for partial usage cases, or
> >   just for playing with it.
> >
> > You say that I'm ignoring the same use case for namespace-scope but
> > namespace-roots don't have the same hybrid function for partial and
> > uncontrolled systems, so it's not clear why there even NEEDS to be
> > strict symmetry.
> 
> I think their functions are much closer than you think they are.  I
> want a whole Linux distro to be able to run in a container.  This
> means that useful things people do in a distro or initramfs or
> whatever should just work if containerized.

There isn't much which is getting in the way of doing that.  Again,
something which follows no-internal-task rule would behave the same no
matter where it is.  The system-root is different in that it is exempt
from the rule and thus is more flexible but that difference is serving
the purpose of handling the inherent specialness of the system-root.
AFAICS, it is the solution which causes the least amount of contortion
and unnecessary inconvenience to userland.

> > It's easy and understandable to get hangups on asymmetries or
> > exemptions like this, but they also often are acceptable trade-offs.
> > It's really frustrating to see you first getting hung up on "this must
> > be wrong" and even after explanations repeating the same thing just in
> > different ways.
> >
> > If there is something fundamentally wrong with it, sure, let's fix it,
> > but what's actually broken?
> 
> I'm not saying it's fundamentally wrong.  I'm saying it's a design

You were.

> that has a big wart, and that wart is unfortunate, and after thinking
> a bit, I'm starting to agree with PeterZ that this is problematic.  It
> also seems fixable: the constraint could be relaxed.

You've been pushing for enforcing the restriction on the system-root
too and now are jumping to the opposite end.  It's really frustrating
that this is such a whack-a-mole game where you throw ideas without
really thinking through them and only concede the bare minimum when
all other logical avenues are closed off.  Here, again, you seem to be
stating a strong opinion when you haven't fully thought about it or
tried to understand the reasons behind it.

But, whatever, let's go there: Given the arguments that I laid out for
the no-internal-tasks rule, how does the problem seem fixable through
relaxing the constraint?

> >> >> Also, here's an idea to maybe make PeterZ happier: relax the
> >> >> restriction a bit per-controller.  Currently (except for /), if you
> >> >> have subtree control enabled you can't have any processes in the
> >> >> cgroup.  Could you change this so it only applies to certain
> >> >> controllers?  If the cpu controller is entirely happy to have
> >> >> processes and cgroups as siblings, then maybe a cgroup with only cpu
> >> >> subtree control enabled could allow processes to exist.
> >> >
> >> > The document lists several reasons for not doing this and also that
> >> > there is no known real world use case for such configuration.
> >
> > So, up until this point, we were talking about no-internal-tasks
> > constraint.
> 
> Isn't this the same thing?  IIUC the constraint in question is that,
> if a non-root cgroup has subtree control on, then it can't have
> processes in it.  This is the no-internal-tasks constraint, right?

Yes, that is what no-internal-tasks rule is but I don't understand how
that is the same thing as process granularity.  Am I completely
misunderstanding what you are trying to say here?

> And I still think that, at least for cpu, nothing at all goes wrong if
> you allow processes to exist in cgroups that have cpu set in
> subtree-c

Re: [Documentation] State of CPU controller in cgroup v2

2016-09-06 Thread Peter Zijlstra

On Mon, Sep 05, 2016 at 10:37:55AM -0700, Andy Lutomirski wrote:
> And I still think that, at least for cpu, nothing at all goes wrong if
> you allow processes to exist in cgroups that have cpu set in
> subtree-control.

cpu, cpuset, perf, cpuacct (although we all agree that really should be
part of cpu), pid, and possibly freezer (but I think we all agree
freezer is 'broken').

That's roughly half the controllers out there.

They all work on tasks, and should therefore have no problems what so
ever to allow the full hierarchy without silly exceptions and
constraints.

The fundamental problem is that we have 2 different types of
controllers, on the one hand these controllers above, that work on tasks
and form groups of them and build up from that. Lets call them
task-controllers.

On the other hand we have controllers like memcg which take the 'system'
as a whole and shrink it down into smaller bits. Lets call these
system-controllers.

They are fundamentally at odds with capabilities, simply because of the
granularity they can work on.

Merging the two into a common hierarchy is a useful concept for
containerization, no argument on that, esp. when also coupled with
namespaces and the like.

However, where I object _most_ strongly is having this one use dominate
and destroy the capabilities (which are in use) of the task-controllers.

> > I do.  It's a horrible userland API to expose to individual
> > applications if the organization that a given application expects can
> > be disturbed by system operations.  Imagine how this would be
> > documented - "if this operation races with system operation, it may
> > return -ENOENT.  Repeating the path lookup might make the operation
> > succeed again."
> 
> It could be made to work without races, though, with minimal (or even
> no) ABI change.  The managed program could grab an fd pointing to its
> cgroup.  Then it would use openat, etc for all operations.  As long as
> 'mv /cgroup/a/b /cgroup/c/" didn't cause that fd to stop working,
> we're fine.

I've mentioned openat() and related APIs several times, but so far never
got good reasons why that wouldn't work.

Also note that in order to partition the cpus with cpusets, you're
required to generate a disjoint hierarchy (that is, one where the
(common) parent is 'disabled' and the children have no overlap).

This is rather fundamental to partitioning, that by its very nature
requires separation.

The result is that if you want to place your RT threads (consider an
application that consists of RT and !RT parts) in a different partition
there is no common parent you can place the process in.

cgroup-v2, by placing the system style controllers first and foremost,
completely renders that scenario impossible. Note also that any proposed
rgroup would not work for this, since that, per design, is a subtree,
and therefore not disjoint.

So my objection to the whole cgroup-v2 model and implementation stems
from the fact that it purports to be a 'better' and 'improved' system,
while in actuality it neuters and destroys a lot of useful usecases.

It completely disregards all task-controllers and labels their use-cases
as irrelevant.

Re: [Documentation] State of CPU controller in cgroup v2

2016-09-05 Thread Andy Lutomirski

On Sat, Sep 3, 2016 at 3:05 PM, Tejun Heo  wrote:
> Hello, Andy.
>
> On Wed, Aug 31, 2016 at 02:46:20PM -0700, Andy Lutomirski wrote:
>> > Consider a use case where the user isn't interested in fully
>> > accounting and dividing up system resources but wants to just cap
>> > resource usage from a subset of workloads.  There is no reason to
>> > require such usages to fully contain all processes in non-root
>> > cgroups.  Furthermore, it's not trivial to migrate all processes out
>> > of root to a sub-cgroup unless the agent is in full control of boot
>> > process.
>>
>> Then please also consider exactly the same use case while running in a
>> container.
>>
>> I'm a bit frustrated that you're saying that my example failure modes
>> consist of shooting oneself in the foot and then you go on to come up
>> with your own examples that have precisely the same problem.
>
> You have a point, which is
>
>   The system-root and namespace-roots are not symmetric.
>
> and that's a valid concern.  Here's why the system-root is special.
>

[...]

>
> Now, due to the various issues with direct competition between
> processes and cgroups, cgroup v2 disallows resource control across
> them (the no-internal-tasks restriction); however, cgroup v2 currently
> doesn't apply the restriction to the system-root.  Here are the
> reasons.
>
> * It doesn't bring any practical benefits in terms of implementation.
>   As noted above, all controllers already have to allow uncontained
>   consumptions in the system-root and that's the only attribute
>   required for the exemption.
>
> * It doesn't bring any practical benefits in terms of capability.
>   Userland can trivially handle the system-root and namespace-roots in
>   a symmetrical manner.

Your idea of "trivially" doesn't match mine.  You gave a use case in
which userspace might take advantage of root being special.  If
userspace does that, then that userspace cannot be run in a container.
This could be a problem for real users.  Sure, "don't do that" is a
*valid* answer, but it's not a very helpful answer.

>
> * It's an unncessary inconvenience, especially for cases where the
>   cgroup agent isn't in control of boot, for partial usage cases, or
>   just for playing with it.
>
> You say that I'm ignoring the same use case for namespace-scope but
> namespace-roots don't have the same hybrid function for partial and
> uncontrolled systems, so it's not clear why there even NEEDS to be
> strict symmetry.

I think their functions are much closer than you think they are.  I
want a whole Linux distro to be able to run in a container.  This
means that useful things people do in a distro or initramfs or
whatever should just work if containerized.

>
> It's easy and understandable to get hangups on asymmetries or
> exemptions like this, but they also often are acceptable trade-offs.
> It's really frustrating to see you first getting hung up on "this must
> be wrong" and even after explanations repeating the same thing just in
> different ways.
>
> If there is something fundamentally wrong with it, sure, let's fix it,
> but what's actually broken?

I'm not saying it's fundamentally wrong.  I'm saying it's a design
that has a big wart, and that wart is unfortunate, and after thinking
a bit, I'm starting to agree with PeterZ that this is problematic.  It
also seems fixable: the constraint could be relaxed.

>> >> Also, here's an idea to maybe make PeterZ happier: relax the
>> >> restriction a bit per-controller.  Currently (except for /), if you
>> >> have subtree control enabled you can't have any processes in the
>> >> cgroup.  Could you change this so it only applies to certain
>> >> controllers?  If the cpu controller is entirely happy to have
>> >> processes and cgroups as siblings, then maybe a cgroup with only cpu
>> >> subtree control enabled could allow processes to exist.
>> >
>> > The document lists several reasons for not doing this and also that
>> > there is no known real world use case for such configuration.
>
> So, up until this point, we were talking about no-internal-tasks
> constraint.

Isn't this the same thing?  IIUC the constraint in question is that,
if a non-root cgroup has subtree control on, then it can't have
processes in it.  This is the no-internal-tasks constraint, right?

And I still think that, at least for cpu, nothing at all goes wrong if
you allow processes to exist in cgroups that have cpu set in
subtree-control.

- begin talking about process granularity -

>
>> My company's production workload would map quite nicely to this
>> relaxed model.  I have quite a few processes each with several
>> threads.  Some of those threads get some CPUs, some get other CPUs,
>> and they vary in what shares of what CPUs they get.  To be clear,
>> there is not a hierarchy of resource usage that's compatible with the
>> process hierarchy.  Multiple processes have threads that should be
>> grouped in a different place in the hierarchy than other threads.
>> Concre

Re: [Documentation] State of CPU controller in cgroup v2

2016-09-03 Thread Tejun Heo

Hello, Andy.

On Wed, Aug 31, 2016 at 02:46:20PM -0700, Andy Lutomirski wrote:
> > Consider a use case where the user isn't interested in fully
> > accounting and dividing up system resources but wants to just cap
> > resource usage from a subset of workloads.  There is no reason to
> > require such usages to fully contain all processes in non-root
> > cgroups.  Furthermore, it's not trivial to migrate all processes out
> > of root to a sub-cgroup unless the agent is in full control of boot
> > process.
> 
> Then please also consider exactly the same use case while running in a
> container.
> 
> I'm a bit frustrated that you're saying that my example failure modes
> consist of shooting oneself in the foot and then you go on to come up
> with your own examples that have precisely the same problem.

You have a point, which is

  The system-root and namespace-roots are not symmetric.

and that's a valid concern.  Here's why the system-root is special.

* A system has entities and resource consumptions which can only be
  attributed to the "system".  The system-root is the natural place to
  put them.  The system-root has stuff no other cgroups, not even
  namespace-roots, have.  It's a unique situation.

* The need to bypass most cgroup related overhead when not in use.
  The system-root is there whether cgroup is actally in use or not and
  thus can not impose noticeable overhead.  It has to make sense for
  both resource-controlled systems as well as ones that aren't.
  Again, no other group has these requirements.

  Note that this means that all controllers should be able to and
  already allow uncontained consumptions in the system-root.  I'll
  come back to this later.

Now, due to the various issues with direct competition between
processes and cgroups, cgroup v2 disallows resource control across
them (the no-internal-tasks restriction); however, cgroup v2 currently
doesn't apply the restriction to the system-root.  Here are the
reasons.

* It doesn't bring any practical benefits in terms of implementation.
  As noted above, all controllers already have to allow uncontained
  consumptions in the system-root and that's the only attribute
  required for the exemption.

* It doesn't bring any practical benefits in terms of capability.
  Userland can trivially handle the system-root and namespace-roots in
  a symmetrical manner.

* It's an unncessary inconvenience, especially for cases where the
  cgroup agent isn't in control of boot, for partial usage cases, or
  just for playing with it.

You say that I'm ignoring the same use case for namespace-scope but
namespace-roots don't have the same hybrid function for partial and
uncontrolled systems, so it's not clear why there even NEEDS to be
strict symmetry.

On this subject, your only actual point is that there is an asymmetry
and that's bothersome.  I've been trying to explain why the special
case doesn't actually get in the way in terms of implementation or
capability and is actually beneficial.  Instead of engaging in the
actual discussion, you're constantly coming up with different ways of
saying "it's not symmetric".

The system-root and namespace-roots aren't equivalent.  There are a
lot of parallels between system-root and namescope-root but they
aren't the same thing (e.g. bootstrapping a namespace is a less
complicated and more malleable process).  The system-root is not even
a fully qualified node of the resource graph.

It's easy and understandable to get hangups on asymmetries or
exemptions like this, but they also often are acceptable trade-offs.
It's really frustrating to see you first getting hung up on "this must
be wrong" and even after explanations repeating the same thing just in
different ways.

If there is something fundamentally wrong with it, sure, let's fix it,
but what's actually broken?

> > I have, multiple times.  Can you please read 2-1-2 of the document in
> > the original post and take the discussion from there?
> 
> I've read it multiple times, and I don't see any explanation that's
> consistent with the fact that you are exempting the root cgroup from
> this constraint.  If the constraint were really critical to everything
> working, then I would expect the root cgroup to have exactly the same
> problem.  This makes me think that either something nasty is being
> fudged for the root cgroup or that the constraint isn't actually so
> important after all.  The only thing on point I can find is:
> 
> > Root cgroup is exempt from this constraint, which is in line with
> > how root cgroup is handled in general - it's excluded from cgroup
> > resource accounting and control.
> 
> and that's not very helpful.

My apologies.  I somehow thought that was part of the documentation.
Will update it later, but here's an excerpt from my earlier response.

  Having a special case doesn't necessarily get in the way of
  benefiting from a set of general rules.  The root cgroup is
  inherently special as it has to be the catch-all scope for en

Re: [Documentation] State of CPU controller in cgroup v2

2016-08-31 Thread Andy Lutomirski

On Wed, Aug 31, 2016 at 2:07 PM, Tejun Heo  wrote:
> Hello,
>
> On Wed, Aug 31, 2016 at 12:11:58PM -0700, Andy Lutomirski wrote:
>> > You can say that allowing the possibility of deviation isn't a good
>> > design choice but it is a design choice with other implications - on
>> > how we deal with configurations without cgroup at all, transitioning
>> > from v1, bootstrapping a system and avoiding surprising
>> > userland-visible behaviors (e.g. like creating magic preset cgroups
>> > and silently migrating process there on certain events).
>>
>> Are there existing userspace programs that use cgroup2 and enable
>> subtree control on / when there are processes in /?  If the answer is
>> no, then I think you should change cgroup2 to just disallow it.  If
>> the answer is yes, then I think there's a problem and maybe you should
>> consider a breaking change.  Given that cgroup2 hasn't really launched
>> on a large scale, it seems worthwhile to get it right.
>
> Adding the restriction isn't difficult from implementation point of
> view and for a system agent which control the boot process
> implementing that wouldn't be difficult either but I can't see what
> the actual benefits of the extra restriction would be and there are
> tangible downsides to doing so.
>
> Consider a use case where the user isn't interested in fully
> accounting and dividing up system resources but wants to just cap
> resource usage from a subset of workloads.  There is no reason to
> require such usages to fully contain all processes in non-root
> cgroups.  Furthermore, it's not trivial to migrate all processes out
> of root to a sub-cgroup unless the agent is in full control of boot
> process.

Then please also consider exactly the same use case while running in a
container.

I'm a bit frustrated that you're saying that my example failure modes
consist of shooting oneself in the foot and then you go on to come up
with your own examples that have precisely the same problem.

>
>> I don't understand what you're talking about wrt silently migrating
>> processes.  Are you thinking about usermodehelper?  If so, maybe it
>> really does make sense to allow (or require?) the cgroup manager to
>> specify which cgroup these processes end up in.
>
> That was from one of the ideas that I was considering way back where
> enabling resource control in an intermediate node automatically moves
> internal processes to a preset cgroup whether visible or hidden, which
> would be another way of addressing the problem.
>
> None of these affects what cgroup v2 can do at all and the only thing
> the userland is asked to do under the current scheme is "if you wanna
> keep the whole system divided up and use the same mode of operations
> across system-scope and namespace-scope move out of root while setting
> yourself up, which also happens to be what you have to do inside
> namespaces anyway."
>
>> But, given that all the controllers need to support the current magic
>> root exception (for genuinely unaccountable things if nothing else),
>> can you explain what would actually go wrong if you just removed the
>> restriction entirely?
>
> I have, multiple times.  Can you please read 2-1-2 of the document in
> the original post and take the discussion from there?

I've read it multiple times, and I don't see any explanation that's
consistent with the fact that you are exempting the root cgroup from
this constraint.  If the constraint were really critical to everything
working, then I would expect the root cgroup to have exactly the same
problem.  This makes me think that either something nasty is being
fudged for the root cgroup or that the constraint isn't actually so
important after all.  The only thing on point I can find is:

> Root cgroup is exempt from this constraint, which is in line with
> how root cgroup is handled in general - it's excluded from cgroup
> resource accounting and control.

and that's not very helpful.

>
>> Also, here's an idea to maybe make PeterZ happier: relax the
>> restriction a bit per-controller.  Currently (except for /), if you
>> have subtree control enabled you can't have any processes in the
>> cgroup.  Could you change this so it only applies to certain
>> controllers?  If the cpu controller is entirely happy to have
>> processes and cgroups as siblings, then maybe a cgroup with only cpu
>> subtree control enabled could allow processes to exist.
>
> The document lists several reasons for not doing this and also that
> there is no known real world use case for such configuration.

My company's production workload would map quite nicely to this
relaxed model.  I have quite a few processes each with several
threads.  Some of those threads get some CPUs, some get other CPUs,
and they vary in what shares of what CPUs they get.  To be clear,
there is not a hierarchy of resource usage that's compatible with the
process hierarchy.  Multiple processes have threads that should be
grouped in a different place in the hierarchy than othe

Re: [Documentation] State of CPU controller in cgroup v2

2016-08-31 Thread Tejun Heo

Hello,

On Wed, Aug 31, 2016 at 12:11:58PM -0700, Andy Lutomirski wrote:
> > You can say that allowing the possibility of deviation isn't a good
> > design choice but it is a design choice with other implications - on
> > how we deal with configurations without cgroup at all, transitioning
> > from v1, bootstrapping a system and avoiding surprising
> > userland-visible behaviors (e.g. like creating magic preset cgroups
> > and silently migrating process there on certain events).
> 
> Are there existing userspace programs that use cgroup2 and enable
> subtree control on / when there are processes in /?  If the answer is
> no, then I think you should change cgroup2 to just disallow it.  If
> the answer is yes, then I think there's a problem and maybe you should
> consider a breaking change.  Given that cgroup2 hasn't really launched
> on a large scale, it seems worthwhile to get it right.

Adding the restriction isn't difficult from implementation point of
view and for a system agent which control the boot process
implementing that wouldn't be difficult either but I can't see what
the actual benefits of the extra restriction would be and there are
tangible downsides to doing so.

Consider a use case where the user isn't interested in fully
accounting and dividing up system resources but wants to just cap
resource usage from a subset of workloads.  There is no reason to
require such usages to fully contain all processes in non-root
cgroups.  Furthermore, it's not trivial to migrate all processes out
of root to a sub-cgroup unless the agent is in full control of boot
process.

At least up until this point in discussion, I can't see actual
benefits of adding this restriction and the only reason for pushing it
seems the initial misunderstanding and purism.

> I don't understand what you're talking about wrt silently migrating
> processes.  Are you thinking about usermodehelper?  If so, maybe it
> really does make sense to allow (or require?) the cgroup manager to
> specify which cgroup these processes end up in.

That was from one of the ideas that I was considering way back where
enabling resource control in an intermediate node automatically moves
internal processes to a preset cgroup whether visible or hidden, which
would be another way of addressing the problem.

None of these affects what cgroup v2 can do at all and the only thing
the userland is asked to do under the current scheme is "if you wanna
keep the whole system divided up and use the same mode of operations
across system-scope and namespace-scope move out of root while setting
yourself up, which also happens to be what you have to do inside
namespaces anyway."

> But, given that all the controllers need to support the current magic
> root exception (for genuinely unaccountable things if nothing else),
> can you explain what would actually go wrong if you just removed the
> restriction entirely?

I have, multiple times.  Can you please read 2-1-2 of the document in
the original post and take the discussion from there?

> Also, here's an idea to maybe make PeterZ happier: relax the
> restriction a bit per-controller.  Currently (except for /), if you
> have subtree control enabled you can't have any processes in the
> cgroup.  Could you change this so it only applies to certain
> controllers?  If the cpu controller is entirely happy to have
> processes and cgroups as siblings, then maybe a cgroup with only cpu
> subtree control enabled could allow processes to exist.

The document lists several reasons for not doing this and also that
there is no known real world use case for such configuration.

Please also note that the behavior that you're describing is actually
what rgroup implements.  It makes a lot more sense there because
threads and groups share the same configuration mechanism and it only
has to worry about competition among threads (anonymous consumption is
out of scope for rgroup).

> >> It *also* won't work (I think) if subtree control is enabled on the
> >> root, but I don't think this is a problem in practice because subtree
> >> control won't be enabled on the namespace root by a sensible cgroup
> >> manager.
> >
> > Exactly the same thing.  You can shoot yourself in the foot but it's
> > easy not to.
> 
> Somewhat off-topic: this appears to be either a bug or a misfeature:
> 
> bash-4.3# mkdir foo
> bash-4.3# ls foo
> cgroup.controllers  cgroup.events  cgroup.procs  cgroup.subtree_control
> bash-4.3# mkdir foo/io.max  <-- IMO this shouldn't have worked
> bash-4.3# echo +io >cgroup.subtree_control
> [   40.470712] cgroup: cgroup_addrm_files: failed to add max, err=-17
> 
> Shouldn't cgroups with names that potentially conflict with
> kernel-provided dentries be disallowed?

Yeap, the name collisions suck.  I thought about disallowing all
sub-cgroups which starts with "KNOWN_SUBSYS." but that has a
non-trivial chance of breaking users which were happy before when a
new controller gets added.  But, yeah, we at least should disallow the

Re: [Documentation] State of CPU controller in cgroup v2

2016-08-31 Thread Andy Lutomirski

I'm replying separately to keep the two issues in separate emails.

On Mon, Aug 29, 2016 at 3:20 PM, Tejun Heo  wrote:
> Hello, Andy.
>
> Sorry about the delay.  Was kinda overwhelmed with other things.
>
> On Sat, Aug 20, 2016 at 11:45:55AM -0700, Andy Lutomirski wrote:
>> > This becomes clear whenever an entity is allocating memory on behalf
>> > of someone else - get_user_pages(), khugepaged, swapoff and so on (and
>> > likely userfaultfd too).  When a task is trying to add a page to a
>> > VMA, the task might not have any relationship with the VMA other than
>> > that it's operating on it for someone else.  The page has to be
>> > charged to whoever is responsible for the VMA and the only ownership
>> > which can be established is the containing mm_struct.
>>
>> This surprises me a bit.  If I do access_process_vm(), then I would
>> have expected the charge to go the caller, not the mm being accessed.
>
> It does and should go the target mm.  Who faults in a page shouldn't
> be the final determinant in the ownership; otherwise, we end up in
> situations where the ownership changes due to, for example,
> fluctuations in page fault pattern.  It doesn't make semantical sense
> either.  If a kthread is doing PIO for a process, why would it get
> charged for the memory it's faulting in?

OK, that makes sense.  Although, given that cgroup1 allows tasks in
the same processes to be split up, how does this work in cgroup1?  Do
you just pick the mm associated with the thread group leader?  If so,
why can't cgroup2 do the same thing?

But even this is at best a vague approximation.  If you have
MAP_SHARED mappings (libc.so, for example), then the cgroup you charge
it to is more or less arbitrary.

>
>> What happens if a program calls read(2), though?  A page may be
>> inserted into page cache on behalf of an address_space without any
>> particular mm being involved.  There will usually be a calling task,
>> though.
>
> Most faults are synchronous and the faulting thread is a member of the
> mm to be charged, so this usually isn't an issue.  I don't think there
> are places where we populate an address_space without knowing who it
> is for (as opposed / in addition to who the operator is).

True, but there's no *mm* involved in any fundamental sense.  You can
look at the task and find the task's mm (or actually the task's thread
group leader, since cgroup2 doesn't literally map mms to cgroups), but
that seems to me to be a pretty poor reason to argue that tasks should
have to be kept together.

>
>> But this is all very memcg-specific.  What about other cgroups?  I/O
>> is per-task, right?  Scheduling is definitely per-task.
>
> They aren't separate.  Think about IOs to write out page cache, CPU
> cycles spent reclaiming memory or encrypting writeback IOs.  It's fine
> to get more granular with specific resources but the semantics gets
> messy for cross-resource accounting and control without proper
> scoping.

Page cache doesn't belong to a a specific mm.  Memory reclaim only has
an mm associated if the memory being reclaimed belongs cleanly to an
mm.  Encrypting writeback (I assume you mean the cpu usage) is just
like page cache writeback IO -- there's no specific mm involved in
general.

>
>> > Consider the scenario where you have somebody faulting on behalf of a
>> > foreign VMA, but the thread who created and is actively using that VMA
>> > is in a different cgroup than the process leader.  Who are we going to
>> > charge?  All possible answers seem erratic.
>>
>> Indeed, and this problem is probably not solvable in practice unless
>> you charge all involved cgroups.  But the caller's *mm* is entirely
>> irrelevant here, so I don't see how this implies that cgroups need to
>> keep tasks in the same process together.  The relevant entities are
>> the calling *task* and the target mm, and you're going to be
>> hard-pressed to ensure that they belong to the same cgroup, so I think
>> you need to be able handle weird cases in which there isn't an
>> obviously correct cgroup to charge.
>
> It is an erratic case which is caused by userland interface allowing
> non-sensical configuration.  We can accept it as a necessary trade-off
> given big enough benefits or unavoidable constraints but it isn't
> something to do willy-nilly.
>
>> > For system-level and process-level operations to not step on each
>> > other's toes, they need to agree on the granularity boundary -
>> > system-level should be able to treat an application hierarchy as a
>> > single unit.  A possible solution is allowing rgroup hirearchies to
>> > span across process boundaries and implementing cgroup migration
>> > operations which treat such hierarchies as a single unit.  I'm not yet
>> > sure whether the boundary should be at program groups or rgroups.
>>
>> I think that, if the system cgroup manager is moving processes around
>> after starting them and execing the final binary, there will be races
>> and confusion, and no about of granularity fidd

Re: [Documentation] State of CPU controller in cgroup v2

2016-08-31 Thread Andy Lutomirski

On Wed, Aug 31, 2016 at 10:32 AM, Tejun Heo  wrote:
> Hello, Andy.
>

>
>> >> I really, really think that cgroup v2 should supply the same
>> >> *interface* inside and outside of a non-root namespace.  If this is
>> >
>> > It *does*.  That's what I tried to explain, that it's exactly
>> > isomorhpic once you discount the system-wide consumptions.
>>
>> I don't think I agree.
>>
>> Suppose I wrote an init program or a cgroup manager.  I can expect
>> that init program to be started in the root cgroup.  The program can
>> be lazy and write +io to /cgroup/cgroup.subtree_control and then
>> create some new cgroup /cgroup/a and it will work (I just tried it).
>>
>> Now I run that program in a namespace.  It will not work because it'll
>> get -EBUSY when it tries to write to cgroup.subtree_control.  (I just
>> tried this, too, only using cd instead of a namespace.)  So it's *not*
>> isomorphic.
>
> Yeah, it is possible to shoot yourself in the foot but both
> system-scope and namespace-scope can implement the exactly same
> behavior - move yourself out of root before enabling resource controls
> and get the same expected outcome, which BTW is how systemd behaves
> already.
>
> You can say that allowing the possibility of deviation isn't a good
> design choice but it is a design choice with other implications - on
> how we deal with configurations without cgroup at all, transitioning
> from v1, bootstrapping a system and avoiding surprising
> userland-visible behaviors (e.g. like creating magic preset cgroups
> and silently migrating process there on certain events).

Are there existing userspace programs that use cgroup2 and enable
subtree control on / when there are processes in /?  If the answer is
no, then I think you should change cgroup2 to just disallow it.  If
the answer is yes, then I think there's a problem and maybe you should
consider a breaking change.  Given that cgroup2 hasn't really launched
on a large scale, it seems worthwhile to get it right.

I don't understand what you're talking about wrt silently migrating
processes.  Are you thinking about usermodehelper?  If so, maybe it
really does make sense to allow (or require?) the cgroup manager to
specify which cgroup these processes end up in.

But, given that all the controllers need to support the current magic
root exception (for genuinely unaccountable things if nothing else),
can you explain what would actually go wrong if you just removed the
restriction entirely?

Also, here's an idea to maybe make PeterZ happier: relax the
restriction a bit per-controller.  Currently (except for /), if you
have subtree control enabled you can't have any processes in the
cgroup.  Could you change this so it only applies to certain
controllers?  If the cpu controller is entirely happy to have
processes and cgroups as siblings, then maybe a cgroup with only cpu
subtree control enabled could allow processes to exist.

>
>> It *also* won't work (I think) if subtree control is enabled on the
>> root, but I don't think this is a problem in practice because subtree
>> control won't be enabled on the namespace root by a sensible cgroup
>> manager.
>
> Exactly the same thing.  You can shoot yourself in the foot but it's
> easy not to.
>

Somewhat off-topic: this appears to be either a bug or a misfeature:

bash-4.3# mkdir foo
bash-4.3# ls foo
cgroup.controllers  cgroup.events  cgroup.procs  cgroup.subtree_control
bash-4.3# mkdir foo/io.max  <-- IMO this shouldn't have worked
bash-4.3# echo +io >cgroup.subtree_control
[   40.470712] cgroup: cgroup_addrm_files: failed to add max, err=-17

Shouldn't cgroups with names that potentially conflict with
kernel-provided dentries be disallowed?

--Andy

Re: [Documentation] State of CPU controller in cgroup v2

2016-08-31 Thread Tejun Heo

Hello, Andy.

On Tue, Aug 30, 2016 at 08:42:20PM -0700, Andy Lutomirski wrote:
> On Mon, Aug 29, 2016 at 3:20 PM, Tejun Heo  wrote:
> >> This seems to explain why the controllers need to be able to handle
> >> things being charged to the root cgroup (or to an unidentifiable
> >> cgroup, anyway).  That isn't quite the same thing as allowing, from an
> >> ABI point of view, the root cgroup to contain processes and cgroups
> >> but not allowing other cgroups to do the same thing.  Consider:
> >
> > The points are 1. we need the root to be a special container anyway
> 
> But you don't need to let userspace see that.

I'm not saying that what cgroup v2 implements is the only solution.
There of course can be other approaches which don't expose this
particular detail to userland.  I was highlighting that there is an
underlying condition to be dealt with and that what cgroup v2
implements is one working solution for it.

It's fine to have, say, aesthetical disgreements on the specifics of
the chosen approach, and, while a bit late, we can still talk about
pros and cons of different possible approaches and make improvements
where it makes sense.  However, this isn't in any way a
make-it-or-break-it issue as you implied before.

> >> I really, really think that cgroup v2 should supply the same
> >> *interface* inside and outside of a non-root namespace.  If this is
> >
> > It *does*.  That's what I tried to explain, that it's exactly
> > isomorhpic once you discount the system-wide consumptions.
> 
> I don't think I agree.
> 
> Suppose I wrote an init program or a cgroup manager.  I can expect
> that init program to be started in the root cgroup.  The program can
> be lazy and write +io to /cgroup/cgroup.subtree_control and then
> create some new cgroup /cgroup/a and it will work (I just tried it).
> 
> Now I run that program in a namespace.  It will not work because it'll
> get -EBUSY when it tries to write to cgroup.subtree_control.  (I just
> tried this, too, only using cd instead of a namespace.)  So it's *not*
> isomorphic.

Yeah, it is possible to shoot yourself in the foot but both
system-scope and namespace-scope can implement the exactly same
behavior - move yourself out of root before enabling resource controls
and get the same expected outcome, which BTW is how systemd behaves
already.

You can say that allowing the possibility of deviation isn't a good
design choice but it is a design choice with other implications - on
how we deal with configurations without cgroup at all, transitioning
from v1, bootstrapping a system and avoiding surprising
userland-visible behaviors (e.g. like creating magic preset cgroups
and silently migrating process there on certain events).

> It *also* won't work (I think) if subtree control is enabled on the
> root, but I don't think this is a problem in practice because subtree
> control won't be enabled on the namespace root by a sensible cgroup
> manager.

Exactly the same thing.  You can shoot yourself in the foot but it's
easy not to.

Thanks.

-- 
tejun

Re: [Documentation] State of CPU controller in cgroup v2

2016-08-30 Thread Andy Lutomirski

On Mon, Aug 29, 2016 at 3:20 PM, Tejun Heo  wrote:
>> > These base-system operations are special regardless of cgroup and we
>> > already have sometimes crude ways to affect their behaviors where
>> > necessary through sysctl knobs, priorities on specific kernel threads
>> > and so on.  cgroup doesn't change the situation all that much.  What
>> > gets left in the root cgroup usually are the base-system operations
>> > which are outside the scope of cgroup resource control in the first
>> > place and cgroup resource graph can treat the root as an opaque anchor
>> > point.
>>
>> This seems to explain why the controllers need to be able to handle
>> things being charged to the root cgroup (or to an unidentifiable
>> cgroup, anyway).  That isn't quite the same thing as allowing, from an
>> ABI point of view, the root cgroup to contain processes and cgroups
>> but not allowing other cgroups to do the same thing.  Consider:
>
> The points are 1. we need the root to be a special container anyway

But you don't need to let userspace see that.

> 2. allowing it to be special and contain system-wide consumptions
> doesn't make the resource graph inconsistent once all non-system-wide
> consumptions are put in non-root cgroups, and 3. this is the most
> natural way to handle the situation both from implementation and
> interface standpoints as it makes non-cgroup configuration a natural
> degenerate case of cgroup configuration.
>
>> suppose that systemd (or some competing cgroup manager) is designed to
>> run in the root cgroup namespace.  It presumably expects *itself* to
>> be in the root cgroup.  Now try to run it using cgroups v2 in a
>> non-root namespace.  I don't see how it can possibly work if it the
>> hierarchy constraints don't permit it to create sub-cgroups while it's
>> still in the root.  In fact, this seems impossible to fix even with
>> user code changes.  The manager would need to simultaneously create a
>> new child cgroup to contain itself and assign itself to that child
>> cgroup, because the intermediate state is illegal.
>
> Please re-read the constraint.  It doesn't prevent any organizational
> operations before resource control is enabled.
>
>> I really, really think that cgroup v2 should supply the same
>> *interface* inside and outside of a non-root namespace.  If this is
>
> It *does*.  That's what I tried to explain, that it's exactly
> isomorhpic once you discount the system-wide consumptions.
>

I don't think I agree.

Suppose I wrote an init program or a cgroup manager.  I can expect
that init program to be started in the root cgroup.  The program can
be lazy and write +io to /cgroup/cgroup.subtree_control and then
create some new cgroup /cgroup/a and it will work (I just tried it).

Now I run that program in a namespace.  It will not work because it'll
get -EBUSY when it tries to write to cgroup.subtree_control.  (I just
tried this, too, only using cd instead of a namespace.)  So it's *not*
isomorphic.

It *also* won't work (I think) if subtree control is enabled on the
root, but I don't think this is a problem in practice because subtree
control won't be enabled on the namespace root by a sensible cgroup
manager.

--Andy

Re: [Documentation] State of CPU controller in cgroup v2

2016-08-29 Thread Tejun Heo

Hello, James.

On Sat, Aug 20, 2016 at 10:34:14PM -0700, James Bottomley wrote:
> I can see that process based is conceptually easier in v2 because you
> begin with a process tree, but it would really be a pity to lose the
> thread based controls we have now and permanently lose the ability to
> create more as we find uses for them.  I can't really see how improving
> "common resource domain" is a good tradeoff for this.

Thread based control for namespace is not a different problem from
thread based control for individual applications, right?  And the
problems with using cgroupfs directly for in-process control still
applies the same whether it's system-wide or inside a namespace.

One argument could be that inside a namespace, as the cgroupfs is
already scoped, cgroup path headaches are less of an issue, which is
true; however, that isn't applicable to applications which aren't
scoped in thier own namespaces and we can't scope every binary on the
system.  More importnatly, a given application can't rely on being
scoped in a certain way.  You can craft a custom config for a specific
setup but that's a horrible way to solve the problem of in-application
hierarchical resource distribution, and that's what rgroup was all
about.

Thanks.

-- 
tejun

Re: [Documentation] State of CPU controller in cgroup v2

2016-08-29 Thread Tejun Heo

Hello, Andy.

Sorry about the delay.  Was kinda overwhelmed with other things.

On Sat, Aug 20, 2016 at 11:45:55AM -0700, Andy Lutomirski wrote:
> > This becomes clear whenever an entity is allocating memory on behalf
> > of someone else - get_user_pages(), khugepaged, swapoff and so on (and
> > likely userfaultfd too).  When a task is trying to add a page to a
> > VMA, the task might not have any relationship with the VMA other than
> > that it's operating on it for someone else.  The page has to be
> > charged to whoever is responsible for the VMA and the only ownership
> > which can be established is the containing mm_struct.
> 
> This surprises me a bit.  If I do access_process_vm(), then I would
> have expected the charge to go the caller, not the mm being accessed.

It does and should go the target mm.  Who faults in a page shouldn't
be the final determinant in the ownership; otherwise, we end up in
situations where the ownership changes due to, for example,
fluctuations in page fault pattern.  It doesn't make semantical sense
either.  If a kthread is doing PIO for a process, why would it get
charged for the memory it's faulting in?

> What happens if a program calls read(2), though?  A page may be
> inserted into page cache on behalf of an address_space without any
> particular mm being involved.  There will usually be a calling task,
> though.

Most faults are synchronous and the faulting thread is a member of the
mm to be charged, so this usually isn't an issue.  I don't think there
are places where we populate an address_space without knowing who it
is for (as opposed / in addition to who the operator is).

> But this is all very memcg-specific.  What about other cgroups?  I/O
> is per-task, right?  Scheduling is definitely per-task.

They aren't separate.  Think about IOs to write out page cache, CPU
cycles spent reclaiming memory or encrypting writeback IOs.  It's fine
to get more granular with specific resources but the semantics gets
messy for cross-resource accounting and control without proper
scoping.

> > Consider the scenario where you have somebody faulting on behalf of a
> > foreign VMA, but the thread who created and is actively using that VMA
> > is in a different cgroup than the process leader.  Who are we going to
> > charge?  All possible answers seem erratic.
> 
> Indeed, and this problem is probably not solvable in practice unless
> you charge all involved cgroups.  But the caller's *mm* is entirely
> irrelevant here, so I don't see how this implies that cgroups need to
> keep tasks in the same process together.  The relevant entities are
> the calling *task* and the target mm, and you're going to be
> hard-pressed to ensure that they belong to the same cgroup, so I think
> you need to be able handle weird cases in which there isn't an
> obviously correct cgroup to charge.

It is an erratic case which is caused by userland interface allowing
non-sensical configuration.  We can accept it as a necessary trade-off
given big enough benefits or unavoidable constraints but it isn't
something to do willy-nilly.

> > For system-level and process-level operations to not step on each
> > other's toes, they need to agree on the granularity boundary -
> > system-level should be able to treat an application hierarchy as a
> > single unit.  A possible solution is allowing rgroup hirearchies to
> > span across process boundaries and implementing cgroup migration
> > operations which treat such hierarchies as a single unit.  I'm not yet
> > sure whether the boundary should be at program groups or rgroups.
> 
> I think that, if the system cgroup manager is moving processes around
> after starting them and execing the final binary, there will be races
> and confusion, and no about of granularity fiddling will fix that.

I don't see how that statement is true.  For example, if you confine
the hierarhcy to in-process, there is proper isolation and whether
system agent migrates the process or not doesn't make any difference
to the internal hierarchy.

> I know nothing about rgroups.  Are they upstream?

It was linked from the original message.

[7]  http://lkml.kernel.org/r/20160105154503.gc5...@mtj.duckdns.org
 [RFD] cgroup: thread granularity support for cpu controller
 Tejun Heo 

[8]  http://lkml.kernel.org/r/1457710888-31182-1-git-send-email...@kernel.org
 [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and 
PRIO_RGRP
 Tejun Heo 

[9]  http://lkml.kernel.org/r/20160311160522.ga24...@htj.duckdns.org
 Example program for PRIO_RGRP
 Tejun Heo 

> > These base-system operations are special regardless of cgroup and we
> > already have sometimes crude ways to affect their behaviors where
> > necessary through sysctl knobs, priorities on specific kernel threads
> > and so on.  cgroup doesn't change the situation all that much.  What
> > gets left in the root cgroup usually are the base-system operations
> > which are outside the scope of cgroup resource co

Re: [Documentation] State of CPU controller in cgroup v2

2016-08-22 Thread Mike Galbraith

On Sat, 2016-08-20 at 11:56 -0400, Tejun Heo wrote:

> > >   there are other reasons to enforce process granularity.  One
> > >   important one is isolating system-level management operations from
> > >   in-process application operations.  The cgroup interface, being a
> > >   virtual filesystem, is very unfit for multiple independent
> > >   operations taking place at the same time as most operations have to
> > >   be multi-step and there is no way to synchronize multiple accessors.
> > >   See also [5] Documentation/cgroup-v2.txt, "R-2. Thread Granularity"
> > 
> > I don't buy this argument at all.  System-level code is likely to
> > assign single process *trees*, which are a different beast entirely.
> > I.e. you fork, move the child into a cgroup, and that child and its
> > children stay in that cgroup.  I don't see how the thread/process
> > distinction matters.
> 
> Good point on the multi-process issue, this is something which nagged
> me a bit while working on rgroup, although I have to point out that
> the issue here is one of not going far enough rather than the approach
> being wrong.  There are limitations to scoping it to individual
> processes but that doesn't negate the underlying problem or the
> usefulness of in-process control.
> 
> For system-level and process-level operations to not step on each
> other's toes, they need to agree on the granularity boundary -
> system-level should be able to treat an application hierarchy as a
> single unit.  A possible solution is allowing rgroup hirearchies to
> span across process boundaries and implementing cgroup migration
> operations which treat such hierarchies as a single unit.  I'm not yet
> sure whether the boundary should be at program groups or rgroups.

Why is it not viable to predicate contentious lowest common denominator
restrictions upon the set of enabled controllers?  If only thread
granularity controllers are enabled, from that point onward, v2
restrictions cease to make any sense, thus could be lifted, leaving
nobody cast adrift in a leaky v1 lifeboat when v2 sets sail.  Or?

-Mike

Re: [Documentation] State of CPU controller in cgroup v2

2016-08-20 Thread James Bottomley

On Wed, 2016-08-17 at 13:18 -0700, Andy Lutomirski wrote:
> On Aug 5, 2016 7:07 PM, "Tejun Heo"  wrote:
[...]
> > 2. Disagreements and Arguments
> > 
> > There have been several lengthy discussion threads [3][4] on LKML
> > around the structural constraints of cgroup v2.  The two that 
> > affect the CPU controller are process granularity and no internal 
> > process constraint.  Both arise primarily from the need for common 
> > resource domain definition across different resources.
> > 
> > The common resource domain is a powerful concept in cgroup v2 that
> > allows controllers to make basic assumptions about the structural
> > organization of processes and controllers inside the cgroup 
> > hierarchy, and thus solve problems spanning multiple types of 
> > resources.  The prime example for this is page cache writeback: 
> > dirty page cache is regulated through throttling buffered writers 
> > based on memory availability, and initiating batched write outs to 
> > the disk based on IO capacity.  Tracking and controlling writeback 
> > inside a cgroup thus requires the direct cooperation of the memory 
> > and the IO controller.
> > 
> > This easily extends to other areas, such as CPU cycles consumed 
> > while performing memory reclaim or IO encryption.
> > 
> > 
> > 2-1. Contentious Restrictions
> > 
> > For controllers of different resources to work together, they must
> > agree on a common organization.  This uniform model across 
> > controllers imposes two contentious restrictions on the CPU 
> > controller: process granularity and the no-internal-process
> > constraint.
> > 
> > 
> >   2-1-1. Process Granularity
> > 
> >   For memory, because an address space is shared between all
> > threads
> >   of a process, the terminal consumer is a process, not a thread.
> >   Separating the threads of a single process into different memory
> >   control domains doesn't make semantical sense.  cgroup v2 ensures
> >   that all controller can agree on the same organization by
> > requiring
> >   that threads of the same process belong to the same cgroup.
> 
> I haven't followed all of the history here, but it seems to me that
> this argument is less accurate than it appears.  Linux, for better or
> for worse, has somewhat orthogonal concepts of thread groups
> (processes), mms, and file tables.  An mm has VMAs in it, and VMAs 
> can reference things (files, etc) that hold resources.  (Two mms can
> share resources by mapping the same thing or using fork().)  File 
> tables hold files, and files can use resources.  Both of these are, 
> at best, moderately good approximations of what actually holds 
> resources. Meanwhile, threads (tasks) do syscalls, take page faults, 
> *allocate* resources, etc.
> 
> So I think it's not really true to say that the "terminal consumer" 
> of anything is a process, not a thread.
> 
> While it's certainly easier to think about assigning processes to
> cgroups, and I certainly agree that, in the common case, it's the
> right thing to do, I don't see why requiring it is a good idea.  Can
> we turn this around: what actually goes wrong if cgroup v2 were to
> allow assigning individual threads if a user specifically requests
> it?

A similar point from a different consumer: from the unprivileged
containers point of view, I'm interested in a thread based interface as
well.  The principle utility of unprivileged containers is to allow
applications that wish to to use container properties (effectively to
become self-containerising).  Some that use the producer/consumer model
do use process pools (apache springs to mind instantly) but some use
thread pools.  It is useful to the latter to preserve the concept of a
thread as being the entity inhabiting the cgroup (but only where the
granularity of the cgroup permits threads to participate) so we can
easily modify them to be self containerising without forcing them to
switch back from a thread pool model to a process pool model.

I can see that process based is conceptually easier in v2 because you
begin with a process tree, but it would really be a pity to lose the
thread based controls we have now and permanently lose the ability to
create more as we find uses for them.  I can't really see how improving
"common resource domain" is a good tradeoff for this.

James

Re: [Documentation] State of CPU controller in cgroup v2

2016-08-20 Thread Andy Lutomirski

On Sat, Aug 20, 2016 at 8:56 AM, Tejun Heo  wrote:
> Hello, Andy.
>
> On Wed, Aug 17, 2016 at 01:18:24PM -0700, Andy Lutomirski wrote:
>> >   2-1-1. Process Granularity
>> >
>> >   For memory, because an address space is shared between all threads
>> >   of a process, the terminal consumer is a process, not a thread.
>> >   Separating the threads of a single process into different memory
>> >   control domains doesn't make semantical sense.  cgroup v2 ensures
>> >   that all controller can agree on the same organization by requiring
>> >   that threads of the same process belong to the same cgroup.
>>
>> I haven't followed all of the history here, but it seems to me that
>> this argument is less accurate than it appears.  Linux, for better or
>> for worse, has somewhat orthogonal concepts of thread groups
>> (processes), mms, and file tables.  An mm has VMAs in it, and VMAs can
>> reference things (files, etc) that hold resources.  (Two mms can share
>> resources by mapping the same thing or using fork().)  File tables
>> hold files, and files can use resources.  Both of these are, at best,
>> moderately good approximations of what actually holds resources.
>> Meanwhile, threads (tasks) do syscalls, take page faults, *allocate*
>> resources, etc.
>>
>> So I think it's not really true to say that the "terminal consumer" of
>> anything is a process, not a thread.
>
> The terminal consumer is actually the mm context.  A task may be the
> allocating entity but not always for itself.
>
> This becomes clear whenever an entity is allocating memory on behalf
> of someone else - get_user_pages(), khugepaged, swapoff and so on (and
> likely userfaultfd too).  When a task is trying to add a page to a
> VMA, the task might not have any relationship with the VMA other than
> that it's operating on it for someone else.  The page has to be
> charged to whoever is responsible for the VMA and the only ownership
> which can be established is the containing mm_struct.

This surprises me a bit.  If I do access_process_vm(), then I would
have expected the charge to go the caller, not the mm being accessed.

What happens if a program calls read(2), though?  A page may be
inserted into page cache on behalf of an address_space without any
particular mm being involved.  There will usually be a calling task,
though.

But this is all very memcg-specific.  What about other cgroups?  I/O
is per-task, right?  Scheduling is definitely per-task.

>
> While a mm_struct technically may not map to a process, it is a very
> close approxmiation which is hardly ever broken in practice.
>
>> While it's certainly easier to think about assigning processes to
>> cgroups, and I certainly agree that, in the common case, it's the
>> right thing to do, I don't see why requiring it is a good idea.  Can
>> we turn this around: what actually goes wrong if cgroup v2 were to
>> allow assigning individual threads if a user specifically requests it?
>
> Consider the scenario where you have somebody faulting on behalf of a
> foreign VMA, but the thread who created and is actively using that VMA
> is in a different cgroup than the process leader.  Who are we going to
> charge?  All possible answers seem erratic.
>

Indeed, and this problem is probably not solvable in practice unless
you charge all involved cgroups.  But the caller's *mm* is entirely
irrelevant here, so I don't see how this implies that cgroups need to
keep tasks in the same process together.  The relevant entities are
the calling *task* and the target mm, and you're going to be
hard-pressed to ensure that they belong to the same cgroup, so I think
you need to be able handle weird cases in which there isn't an
obviously correct cgroup to charge.

>> >   there are other reasons to enforce process granularity.  One
>> >   important one is isolating system-level management operations from
>> >   in-process application operations.  The cgroup interface, being a
>> >   virtual filesystem, is very unfit for multiple independent
>> >   operations taking place at the same time as most operations have to
>> >   be multi-step and there is no way to synchronize multiple accessors.
>> >   See also [5] Documentation/cgroup-v2.txt, "R-2. Thread Granularity"
>>
>> I don't buy this argument at all.  System-level code is likely to
>> assign single process *trees*, which are a different beast entirely.
>> I.e. you fork, move the child into a cgroup, and that child and its
>> children stay in that cgroup.  I don't see how the thread/process
>> distinction matters.
>
> Good point on the multi-process issue, this is something which nagged
> me a bit while working on rgroup, although I have to point out that
> the issue here is one of not going far enough rather than the approach
> being wrong.  There are limitations to scoping it to individual
> processes but that doesn't negate the underlying problem or the
> usefulness of in-process control.
>
> For system-level and process-level operations to not step on each
>

Re: [Documentation] State of CPU controller in cgroup v2

2016-08-20 Thread Tejun Heo

Hello, Andy.

On Wed, Aug 17, 2016 at 01:18:24PM -0700, Andy Lutomirski wrote:
> >   2-1-1. Process Granularity
> >
> >   For memory, because an address space is shared between all threads
> >   of a process, the terminal consumer is a process, not a thread.
> >   Separating the threads of a single process into different memory
> >   control domains doesn't make semantical sense.  cgroup v2 ensures
> >   that all controller can agree on the same organization by requiring
> >   that threads of the same process belong to the same cgroup.
> 
> I haven't followed all of the history here, but it seems to me that
> this argument is less accurate than it appears.  Linux, for better or
> for worse, has somewhat orthogonal concepts of thread groups
> (processes), mms, and file tables.  An mm has VMAs in it, and VMAs can
> reference things (files, etc) that hold resources.  (Two mms can share
> resources by mapping the same thing or using fork().)  File tables
> hold files, and files can use resources.  Both of these are, at best,
> moderately good approximations of what actually holds resources.
> Meanwhile, threads (tasks) do syscalls, take page faults, *allocate*
> resources, etc.
> 
> So I think it's not really true to say that the "terminal consumer" of
> anything is a process, not a thread.

The terminal consumer is actually the mm context.  A task may be the
allocating entity but not always for itself.

This becomes clear whenever an entity is allocating memory on behalf
of someone else - get_user_pages(), khugepaged, swapoff and so on (and
likely userfaultfd too).  When a task is trying to add a page to a
VMA, the task might not have any relationship with the VMA other than
that it's operating on it for someone else.  The page has to be
charged to whoever is responsible for the VMA and the only ownership
which can be established is the containing mm_struct.

While a mm_struct technically may not map to a process, it is a very
close approxmiation which is hardly ever broken in practice.

> While it's certainly easier to think about assigning processes to
> cgroups, and I certainly agree that, in the common case, it's the
> right thing to do, I don't see why requiring it is a good idea.  Can
> we turn this around: what actually goes wrong if cgroup v2 were to
> allow assigning individual threads if a user specifically requests it?

Consider the scenario where you have somebody faulting on behalf of a
foreign VMA, but the thread who created and is actively using that VMA
is in a different cgroup than the process leader.  Who are we going to
charge?  All possible answers seem erratic.

Please note that I agree that thread granularity can be useful for
some resources; however, my points are 1. it should be scoped so that
the resource distribution tree as a whole can be shared across
different resources, and, 2. cgroup filesystem interface isn't a good
interface for the purpose.  I'll continue the second point below.

> >   there are other reasons to enforce process granularity.  One
> >   important one is isolating system-level management operations from
> >   in-process application operations.  The cgroup interface, being a
> >   virtual filesystem, is very unfit for multiple independent
> >   operations taking place at the same time as most operations have to
> >   be multi-step and there is no way to synchronize multiple accessors.
> >   See also [5] Documentation/cgroup-v2.txt, "R-2. Thread Granularity"
> 
> I don't buy this argument at all.  System-level code is likely to
> assign single process *trees*, which are a different beast entirely.
> I.e. you fork, move the child into a cgroup, and that child and its
> children stay in that cgroup.  I don't see how the thread/process
> distinction matters.

Good point on the multi-process issue, this is something which nagged
me a bit while working on rgroup, although I have to point out that
the issue here is one of not going far enough rather than the approach
being wrong.  There are limitations to scoping it to individual
processes but that doesn't negate the underlying problem or the
usefulness of in-process control.

For system-level and process-level operations to not step on each
other's toes, they need to agree on the granularity boundary -
system-level should be able to treat an application hierarchy as a
single unit.  A possible solution is allowing rgroup hirearchies to
span across process boundaries and implementing cgroup migration
operations which treat such hierarchies as a single unit.  I'm not yet
sure whether the boundary should be at program groups or rgroups.

> On the contrary: with cgroup namespaces, one could easily create a
> cgroup namespace, shove a process in it, and let that process delegate
> its threads to child cgroups however it likes.  (Well, children of the
> namespace root.)

cgroup namespace solves just one piece of the whole problem and not in
a very robust way.  It's okay for containers but not so for individual
applications.

Re: [Documentation] State of CPU controller in cgroup v2

2016-08-17 Thread Andy Lutomirski

On Aug 5, 2016 7:07 PM, "Tejun Heo"  wrote:
>
> Hello,
>
> There have been several discussions around CPU controller support.
> Unfortunately, no consensus was reached and cgroup v2 is sorely
> lacking CPU controller support.  This document includes summary of the
> situation and arguments along with an interim solution for parties who
> want to use the out-of-tree patches for CPU controller cgroup v2
> support.  I'll post the two patches as replies for reference.
>
> Thanks.
>
>
> CPU Controller on Control Group v2
>
> August, 2016Tejun Heo 
>
>
> While most controllers have support for cgroup v2 now, the CPU
> controller support is not upstream yet due to objections from the
> scheduler maintainers on the basic designs of cgroup v2.  This
> document explains the current situation as well as an interim
> solution, and details the disagreements and arguments.  The latest
> version of this document can be found at the following URL.
>
>  
> https://git.kernel.org/cgit/linux/kernel/git/tj/cgroup.git/tree/Documentation/cgroup-v2-cpu.txt?h=cgroup-v2-cpu
>
>
> CONTENTS
>
> 1. Current Situation and Interim Solution
> 2. Disagreements and Arguments
>   2-1. Contentious Restrictions
> 2-1-1. Process Granularity
> 2-1-2. No Internal Process Constraint
>   2-2. Impact on CPU Controller
> 2-2-1. Impact of Process Granularity
> 2-2-2. Impact of No Internal Process Constraint
>   2-3. Arguments for cgroup v2
> 3. Way Forward
> 4. References
>
>
> 1. Current Situation and Interim Solution
>
> All objections from the scheduler maintainers apply to cgroup v2 core
> design, and there are no known objections to the specifics of the CPU
> controller cgroup v2 interface.  The only blocked part is changes to
> expose the CPU controller interface on cgroup v2, which comprises the
> following two patches:
>
>  [1] sched: Misc preps for cgroup unified hierarchy interface
>  [2] sched: Implement interface for cgroup unified hierarchy
>
> The necessary changes are superficial and implement the interface
> files on cgroup v2.  The combined diffstat is as follows.
>
>  kernel/sched/core.c|  149 
> +++--
>  kernel/sched/cpuacct.c |   57 --
>  kernel/sched/cpuacct.h |5 +
>  3 files changed, 189 insertions(+), 22 deletions(-)
>
> The patches are easy to apply and forward-port.  The following git
> branch will always carry the two patches on top of the latest release
> of the upstream kernel.
>
>  git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git/cgroup-v2-cpu
>
> There also are versioned branches going back to v4.4.
>
>  
> git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git/cgroup-v2-cpu-$KERNEL_VER
>
> While it's difficult to tell whether the CPU controller support will
> be merged, there are crucial resource control features in cgroup v2
> that are only possible due to the design choices that are being
> objected to, and every effort will be made to ease enabling the CPU
> controller cgroup v2 support out-of-tree for parties which choose to.
>
>
> 2. Disagreements and Arguments
>
> There have been several lengthy discussion threads [3][4] on LKML
> around the structural constraints of cgroup v2.  The two that affect
> the CPU controller are process granularity and no internal process
> constraint.  Both arise primarily from the need for common resource
> domain definition across different resources.
>
> The common resource domain is a powerful concept in cgroup v2 that
> allows controllers to make basic assumptions about the structural
> organization of processes and controllers inside the cgroup hierarchy,
> and thus solve problems spanning multiple types of resources.  The
> prime example for this is page cache writeback: dirty page cache is
> regulated through throttling buffered writers based on memory
> availability, and initiating batched write outs to the disk based on
> IO capacity.  Tracking and controlling writeback inside a cgroup thus
> requires the direct cooperation of the memory and the IO controller.
>
> This easily extends to other areas, such as CPU cycles consumed while
> performing memory reclaim or IO encryption.
>
>
> 2-1. Contentious Restrictions
>
> For controllers of different resources to work together, they must
> agree on a common organization.  This uniform model across controllers
> imposes two contentious restrictions on the CPU controller: process
> granularity and the no-internal-process constraint.
>
>
>   2-1-1. Process Granularity
>
>   For memory, because an address space is shared between all threads
>   of a process, the terminal consumer is a process, not a thread.
>   Separating the threads of a single process into different memory
>   control domains doesn't make semantical sense.  cgroup v2 ensures
>   that all controller can agree on the same organization by requiring
>   that threads of the same process belong to the same cgroup.

I haven't followed all of the history here, but

Re: [Documentation] State of CPU controller in cgroup v2

2016-08-17 Thread Mike Galbraith

On Tue, 2016-08-16 at 12:30 -0400, Johannes Weiner wrote:
> On Tue, Aug 16, 2016 at 04:07:38PM +0200, Peter Zijlstra wrote:

> > Also, the argument there seems unfair at best, you don't need cpu-v2 for
> > buffered write control, you only need memcg and block co-mounted.
> 
> Yes, memcg and block agreeing is enough for that case. But I mentioned
> a whole bunch of these examples, to make the broader case for a common
> controller model.

The core issue I have with that model is that it defines context=mm,
and declares context=task to be invalid, while in reality, both views
are perfectly valid, useful, and in use.  That redefinition of context
is demonstrably harmful when applied to scheduler related controllers,
rendering a substantial portion of to be managed objects completely
unmanageable.  You (collectively) know that full well.

AFAIKT, there is only one viable option, and that is to continue to
allow both.  Whether you like the duality or not (who would), it's
deeply embedded in what's under the controllers, and won't go away.

I'll now go try a little harder while you ponder (or pop) this thought
bubble, see if I can set a new personal best at the art of ignoring. 

(CC did not help btw, your bad if you don't like bubble content)

-Mike

Re: [Documentation] State of CPU controller in cgroup v2

2016-08-16 Thread Tejun Heo

Hello, Peter.

On Tue, Aug 16, 2016 at 04:07:38PM +0200, Peter Zijlstra wrote:
> On Wed, Aug 10, 2016 at 06:09:44PM -0400, Johannes Weiner wrote:
> 
> > [ That, and a disturbing number of emotional outbursts against
> >   systemd, which has nothing to do with any of this. ]
> 
> Oh, so I'm entirely dreaming this then:
> 
>   https://github.com/systemd/systemd/pull/3905
> 
> Completely unrelated.

We use centos in the fleet and are trying to control resources in base
system which of course requires writeback control and thus cgroup v2.
I'm working to solve the use cases people are facing and systemd is a
piece of the puzzle.  There is no big conspiracy.

As Johannes and Chris already pointed out, systemd is a user of cgroup
v2, a pretty important one at this point.  While I of course care
about it having a proper support for cgroup v2, systemd is just
picking up the changes in cgroup v2.  cgroup v2 design wouldn't be
different without systemd.  We'll just have something else playing its
role in resource management.

> Also, the argument there seems unfair at best, you don't need cpu-v2 for
> buffered write control, you only need memcg and block co-mounted.

( Everything I'm gonna write below has already been extensively
  documented in the posted documentation.  I'm gonna repeat the points
  for completeness but if we're gonna start an actually technical
  discussion, let's please start from the documentation instead of
  jumping off of an one liner and trying to rebuild the entire
  argument each time.

  I'm not sure what you exactly meant by the above sentence and
  assuming that you're saying that there are no new capabilities
  gained by cpu controller being on the v2 hierarchy and thus the cpu
  controller doesn't need to be on cgroup v2?  If I'm mistaken, please
  let me know. )

Just co-mounting isn't enough as it still leaves the problems with
anonymous consumption, different handling of threads belonging to
different cgroups, and whether it's acceptable to always require blkio
to use memory controller.  cgroup v2 is what we got after working
through all these issues.

While it is true that cpu controller doesn't need to be on cgroup v2
for writeback control to work, it misses the point about the larger
design issues identified during writeback control work, which can be
easily applied to the cpu controller - e.g. accounting cpu cycles
spent for packet reception, memory reclaim, IO encryption and so on.

In addition, it is an unnecessary inconvenience for users who want
writeback control to require the complication of mixed v1 and v2
hierarchies when their requirements can be easily served by v2,
especially considering that the only blocked part is trivial changes
to expose cpu controller interface on v2 and that enabling it on v2
doesn't preclude it from being used on a v1 hierarchy if necessary.

Thanks.

-- 
tejun

Re: [Documentation] State of CPU controller in cgroup v2

2016-08-16 Thread Johannes Weiner

On Tue, Aug 16, 2016 at 04:07:38PM +0200, Peter Zijlstra wrote:
> On Wed, Aug 10, 2016 at 06:09:44PM -0400, Johannes Weiner wrote:
> 
> > [ That, and a disturbing number of emotional outbursts against
> >   systemd, which has nothing to do with any of this. ]
> 
> Oh, so I'm entirely dreaming this then:
> 
>   https://github.com/systemd/systemd/pull/3905
> 
> Completely unrelated.

Yes and no. We certainly do use systemd (kind of hard not to at this
point if you're using any major distribution), and we do feed back the
changes we make to it upstream. But this is updating systemd to work
with the resource control design choices we made in the kernel, not
the other way round.

As I wrote to Mike before, we have been running into these resource
control issues way before systemd, when we used a combination of
libcgroup and custom hacks to coordinate the jobs on the system. The
cgroup2 design choices fell out of experiences with those setups.

Neither the problem statement nor the proposed solutions depend on
systemd, which is why I had hoped we could focus these cgroup2 debates
around the broader resource control issues we are trying to address,
rather than get hung up on one contentious user of the interface.

> Also, the argument there seems unfair at best, you don't need cpu-v2 for
> buffered write control, you only need memcg and block co-mounted.

Yes, memcg and block agreeing is enough for that case. But I mentioned
a whole bunch of these examples, to make the broader case for a common
controller model.

Re: [Documentation] State of CPU controller in cgroup v2

2016-08-16 Thread Chris Mason




On 08/16/2016 10:07 AM, Peter Zijlstra wrote:

On Wed, Aug 10, 2016 at 06:09:44PM -0400, Johannes Weiner wrote:


[ That, and a disturbing number of emotional outbursts against
  systemd, which has nothing to do with any of this. ]


Oh, so I'm entirely dreaming this then:

  https://github.com/systemd/systemd/pull/3905

Completely unrelated.

Also, the argument there seems unfair at best, you don't need cpu-v2 for
buffered write control, you only need memcg and block co-mounted.



This isn't systemd dictating cgroups2 or systemd trying to get rid of 
v1.  But systemd is a common user of cgroups, and we do use it here in 
production.


We're just sending patches upstream for the tools we're using.  It's 
better than keeping them private, or reinventing a completely different 
tool that does almost the same thing.


-chris

Re: [Documentation] State of CPU controller in cgroup v2

2016-08-16 Thread Peter Zijlstra

On Wed, Aug 10, 2016 at 06:09:44PM -0400, Johannes Weiner wrote:

> [ That, and a disturbing number of emotional outbursts against
>   systemd, which has nothing to do with any of this. ]

Oh, so I'm entirely dreaming this then:

  https://github.com/systemd/systemd/pull/3905

Completely unrelated.

Also, the argument there seems unfair at best, you don't need cpu-v2 for
buffered write control, you only need memcg and block co-mounted.

Re: [Documentation] State of CPU controller in cgroup v2

2016-08-12 Thread Mike Galbraith

On Fri, 2016-08-12 at 18:17 -0400, Johannes Weiner wrote:

> > > This argument that cgroup2 is not backward compatible is laughable.
> > 
> > Fine, you're entitled to your sense of humor.  I have one to, I find it
> > laughable that threaded applications can only sit there like a lump of
> > mud simply because they share more than applications written as a
> > gaggle of tasks.  "Threads are like.. so yesterday, the future belongs
> > to the process" tickles my funny-bone.  Whatever, to each his own.
> 
> Who are you quoting here? This is such a grotesque misrepresentation
> of what we have been saying and implementing, it's not even funny.

Agreed, it's not funny to me either.  Excluding threaded applications
from doing.. anything.. implies to me that either someone thinks same
do not need resource management facilities due to some magical property
of threading itself, or someone doesn't realize that an application
thread is a task, ie one and the same things which can be doing one and
the same job.  No matter how I turn it, what I see is nonsense.

> https://yourlogicalfallacyis.com/black-or-white
> https://yourlogicalfallacyis.com/strawman
> https://yourlogicalfallacyis.com/appeal-to-emotion

Nope, plain ole sarcasm, an expression of shock and awe.

> It's great that cgroup1 works for some of your customers, and they are
> free to keep using it.

If no third party can flush my customers investment down the toilet, I
can cease to care.  Please don't CC me in future, you're unlikely to
convince me that v2 is remotely sane, nor do you need to.  Lucky you. 

-Mike

Re: [Documentation] State of CPU controller in cgroup v2

2016-08-12 Thread Johannes Weiner

On Thu, Aug 11, 2016 at 08:25:06AM +0200, Mike Galbraith wrote:
> On Wed, 2016-08-10 at 18:09 -0400, Johannes Weiner wrote:
> > The complete lack of cohesiveness between v1 controllers prevents us
> > from implementing even the most fundamental resource control that
> > cloud fleets like Google's and Facebook's are facing, such as
> > controlling buffered IO; attributing CPU cycles spent receiving
> > packets, reclaiming memory in kswapd, encrypting the disk; attributing
> > swap IO etc. That's why cgroup2 runs a tighter ship when it comes to
> > the controllers: to make something much bigger work.
> 
> Where is the gun wielding thug forcing people to place tasks where v2
> now explicitly forbids them?

The problems with supporting this are well-documented. Please see R-2
in Documentation/cgroup-v2.txt.

> > Agreeing on something - in this case a common controller model - is
> > necessarily going to take away some flexibility from how you approach
> > a problem. What matters is whether the problem can still be solved.
> 
> What annoys me about this more than the seemingly gratuitous breakage
> is that the decision is passed to third parties who have nothing to
> lose, and have done quite a bit of breaking lately.

Mike, there is no connection between what you are quoting and what you
are replying to here. We cannot have a technical discussion when you
enter it with your mind fully made up, repeat the same inflammatory
talking points over and over - some of them trivially false, some a
gross misrepresentation of what we have been trying to do - and are
completely unwilling to even entertain the idea that there might be
problems outside of the one-controller-scope you are looking at.

But to address your point: there is no 'breakage' here. Or in your
words: there is no gun wielding thug forcing people to upgrade to
v2. If v1 does everything your specific setup needs, nobody forces you
to upgrade. We are fairly confident that the majority of users *will*
upgrade, simply because v2 solves so many basic resource control
problems that v1 is inherently incapable of solving. There is a
positive incentive, but we are trying not to create negative ones.

And even if you run a systemd distribution, and systemd switches to
v2, it's trivially easy to pry the CPU controller from its hands and
maintain your setup exactly as-is using the current CPU controller.

This is really not a technical argument.

> > This argument that cgroup2 is not backward compatible is laughable.
> 
> Fine, you're entitled to your sense of humor.  I have one to, I find it
> laughable that threaded applications can only sit there like a lump of
> mud simply because they share more than applications written as a
> gaggle of tasks.  "Threads are like.. so yesterday, the future belongs
> to the process" tickles my funny-bone.  Whatever, to each his own.

Who are you quoting here? This is such a grotesque misrepresentation
of what we have been saying and implementing, it's not even funny.

In reality, the rgroup extension for setpriority() was directly based
on your and PeterZ's feedback regarding thread control. Except that,
unlike cgroup1's approach to threads, which might work in some setups
but suffers immensely from the global nature of the vfs interface once
you have to cooperate with other applications and system management*,
rgroup was proposed as a much more generic and robust interface to do
hierarchical resource control from inside the application.

* This doesn't have to be systemd, btw. We have used cgroups to
  isolate system services, maintenance jobs, cron jobs etc. from our
  applications way before systemd, and it's been a pita to coordinate
  the system managing applications and the applications managing its
  workers using the same globally scoped vfs interface.

> > > I mentioned a real world case of a thread pool servicing customer
> > > accounts by doing something quite sane: hop into an account (cgroup),
> > > do work therein, send bean count off to the $$ department, wash, rinse
> > > repeat.  That's real world users making real world cash registers go ka
> > > -ching so real world people can pay their real world bills.
> > 
> > Sure, but you're implying that this is the only way to run this real
> > world cash register.
> 
> I implied no such thing.  Of course it can be done differently, all
> they have to do is rip out these archaic thread thingies.
>
> Apologies for dripping sarcasm all over your monitor, but this annoys
> me far more that it should any casual user of cgroups.  Perhaps I
> shouldn't care about the users (suse customers) who will step in this
> eventually, but I do.

https://yourlogicalfallacyis.com/black-or-white
https://yourlogicalfallacyis.com/strawman
https://yourlogicalfallacyis.com/appeal-to-emotion

Can you please try to stay objective?

> > > As with the thread pool, process granularity makes it impossible for
> > > any threaded application affinity to be managed via cpusets, such as
> > > say stuf

Re: [Documentation] State of CPU controller in cgroup v2

2016-08-10 Thread Mike Galbraith

On Wed, 2016-08-10 at 18:09 -0400, Johannes Weiner wrote:
> The complete lack of cohesiveness between v1 controllers prevents us
> from implementing even the most fundamental resource control that
> cloud fleets like Google's and Facebook's are facing, such as
> controlling buffered IO; attributing CPU cycles spent receiving
> packets, reclaiming memory in kswapd, encrypting the disk; attributing
> swap IO etc. That's why cgroup2 runs a tighter ship when it comes to
> the controllers: to make something much bigger work.

Where is the gun wielding thug forcing people to place tasks where v2
now explicitly forbids them?

> Agreeing on something - in this case a common controller model - is
> necessarily going to take away some flexibility from how you approach
> a problem. What matters is whether the problem can still be solved.

What annoys me about this more than the seemingly gratuitous breakage
is that the decision is passed to third parties who have nothing to
lose, and have done quite a bit of breaking lately.

> This argument that cgroup2 is not backward compatible is laughable.

Fine, you're entitled to your sense of humor.  I have one to, I find it
laughable that threaded applications can only sit there like a lump of
mud simply because they share more than applications written as a
gaggle of tasks.  "Threads are like.. so yesterday, the future belongs
to the process" tickles my funny-bone.  Whatever, to each his own.

...

> Lastly, again - and this was the whole point of this document - the
> changes in cgroup2 are not gratuitous. They are driven by fundamental
> resource control problems faced by more comprehensive applications of
> cgroup. On the other hand, the opposition here mainly seems to be the
> inconvenience of switching some specialized setups from a v1-oriented
> way of solving a problem to a v2-oriented way.
> 
> [ That, and a disturbing number of emotional outbursts against
>   systemd, which has nothing to do with any of this. ]
> 
> It's a really myopic line of argument.

And I think the myopia is on the other side of my monitor, whatever. 
 
> That being said, let's go through your points:
> 
> > Priority and affinity are not process wide attributes, never have
> > been, but you're insisting that so they must become for the sake of
> > progress.
> 
> Not really.
> 
> It's just questionable whether the cgroup interface is the best way to
> manipulate these attributes, or whether existing interfaces like
> setpriority() and sched_setaffinity() should be extended to manipulate
> groups, like the rgroup proposal does. The problems of using the
> cgroup interface for this are extensively documented, including in the
> email you were replying to.
> 
> > I mentioned a real world case of a thread pool servicing customer
> > accounts by doing something quite sane: hop into an account (cgroup),
> > do work therein, send bean count off to the $$ department, wash, rinse
> > repeat.  That's real world users making real world cash registers go ka
> > -ching so real world people can pay their real world bills.
> 
> Sure, but you're implying that this is the only way to run this real
> world cash register.

I implied no such thing.  Of course it can be done differently, all
they have to do is rip out these archaic thread thingies.

Apologies for dripping sarcasm all over your monitor, but this annoys
me far more that it should any casual user of cgroups.  Perhaps I
shouldn't care about the users (suse customers) who will step in this
eventually, but I do.

> I'm not going down the rabbit hole again of arguing against an
> incomplete case description. Scale matters. Number of workers
> matter. Amount of work each thread does matters to evaluate
> transaction overhead. Task migration is an expensive operation etc.
> 
> > I also mentioned breakage to cpusets: given exclusive set A and
> > exclusive subset B therein, there is one and only one spot where
> > affinity A exists... at the to be forbidden junction of A and B.
> 
> Again, a means to an end rather than a goal

I don't believe I described a means to an end, I believe I described
affinity bits going missing.

>  - and a particularly
> suspicious one at that: why would a cgroup need to tell its *siblings*
> which cpus/nodes in cannot use? In the hierarchical model, it's
> clearly the task of the ancestor to allocate the resources downward.
> 
> More details would be needed to properly discuss what we are trying to
> accomplish here.
> 
> > As with the thread pool, process granularity makes it impossible for
> > any threaded application affinity to be managed via cpusets, such as
> > say stuffing realtime critical threads into a shielded cpuset, mundane
> > threads into another.  There are any number of affinity usages that
> > will break.
> 
> Ditto. It's not obvious why this needs to be the cgroup interface and
> couldn't instead be solved with extending sched_setaffinity() - again
> weighing that against the power of the common controller mo

Re: [Documentation] State of CPU controller in cgroup v2

2016-08-10 Thread Johannes Weiner

On Sat, Aug 06, 2016 at 11:04:51AM +0200, Mike Galbraith wrote:
> On Fri, 2016-08-05 at 13:07 -0400, Tejun Heo wrote:
> >   It is true that the trees are semantically different from each other
> >   and the symmetric handling of tasks and cgroups is aesthetically
> >   pleasing.  However, it isn't clear what the practical usefulness of
> >   a layout with direct competition between tasks and cgroups would be,
> >   considering that number and behavior of tasks are controlled by each
> >   application, and cgroups primarily deal with system level resource
> >   distribution; changes in the number of active threads would directly
> >   impact resource distribution.  Real world use cases of such layouts
> >   could not be established during the discussions.
> 
> You apparently intend to ignore any real world usages that don't work
> with these new constraints.

He didn't ignore these use cases. He offered alternatives like rgroup
to allow manipulating threads from within the application, only in a
way that does not interfere with cgroup2's common controller model.

The complete lack of cohesiveness between v1 controllers prevents us
from implementing even the most fundamental resource control that
cloud fleets like Google's and Facebook's are facing, such as
controlling buffered IO; attributing CPU cycles spent receiving
packets, reclaiming memory in kswapd, encrypting the disk; attributing
swap IO etc. That's why cgroup2 runs a tighter ship when it comes to
the controllers: to make something much bigger work.

Agreeing on something - in this case a common controller model - is
necessarily going to take away some flexibility from how you approach
a problem. What matters is whether the problem can still be solved.

This argument that cgroup2 is not backward compatible is laughable. Of
course it's going to be different, otherwise we wouldn't have had to
version it. The question is not whether the exact same configurations
and existing application design can be used in v1 and v2 - that's a
strange onus to put on a versioned interface. The question is whether
you can translate a solution from v1 to v2. Yeah, it might be a hassle
depending on how specialized your setup is, but that's why we keep v1
around until the last user dies and allow you to freely mix and match
v1 and v2 controllers within a single system to ease the transition.

But this distinction between approach and application design, and the
application's actual purpose is crucial. Every time this discussion
came up, somebody says 'moving worker threads between different
resource domains'. That's not a goal, though, that's a very specific
means to an end, with no explanation of why it has to be done that
way. When comparing the cgroup v1 and v2 interface, we should be
discussing goals, not 'this is my favorite way to do it'. If you have
an actual real-world goal that can be accomplished in v1 but not in v2
+ rgroup, then that's what we should be talking about.

Lastly, again - and this was the whole point of this document - the
changes in cgroup2 are not gratuitous. They are driven by fundamental
resource control problems faced by more comprehensive applications of
cgroup. On the other hand, the opposition here mainly seems to be the
inconvenience of switching some specialized setups from a v1-oriented
way of solving a problem to a v2-oriented way.

[ That, and a disturbing number of emotional outbursts against
  systemd, which has nothing to do with any of this. ]

It's a really myopic line of argument.

That being said, let's go through your points:

> Priority and affinity are not process wide attributes, never have
> been, but you're insisting that so they must become for the sake of
> progress.

Not really.

It's just questionable whether the cgroup interface is the best way to
manipulate these attributes, or whether existing interfaces like
setpriority() and sched_setaffinity() should be extended to manipulate
groups, like the rgroup proposal does. The problems of using the
cgroup interface for this are extensively documented, including in the
email you were replying to.

> I mentioned a real world case of a thread pool servicing customer
> accounts by doing something quite sane: hop into an account (cgroup),
> do work therein, send bean count off to the $$ department, wash, rinse
> repeat.  That's real world users making real world cash registers go ka
> -ching so real world people can pay their real world bills.

Sure, but you're implying that this is the only way to run this real
world cash register. I think it's entirely justified to re-evaluate
this, given the myriad of much more fundamental problems that cgroup2
is solving by building on a common controller model.

I'm not going down the rabbit hole again of arguing against an
incomplete case description. Scale matters. Number of workers
matter. Amount of work each thread does matters to evaluate
transaction overhead. Task migration is an expensive operation etc.

> I also mentioned breaka

Re: [Documentation] State of CPU controller in cgroup v2

2016-08-06 Thread Mike Galbraith

On Fri, 2016-08-05 at 13:07 -0400, Tejun Heo wrote:

> 2-2. Impact on CPU Controller
> 
> As indicated earlier, the CPU controller's resource distribution graph
> is the simplest.  Every schedulable resource consumption can be
> attributed to a specific task.  In addition, for weight based control,
> the per-task priority set through setpriority(2) can be translated to
> and from a per-cgroup weight.  As such, the CPU controller can treat a
> task and a cgroup symmetrically, allowing support for any tree layout
> of cgroups and tasks.  Both process granularity and the no internal
> process constraint restrict how the CPU controller can be used.

Not only the cpu controller, but also cpuacct and cpuset.

>   2-2-1. Impact of Process Granularity
> 
>   Process granularity prevents tasks belonging to the same process to
>   be assigned to different cgroups.  It was pointed out [6] that this
>   excludes the valid use case of hierarchical CPU distribution within
>   processes.

Does that not obsolete the rather useful/common concept "thread pool"?

>   2-2-2. Impact of No Internal Process Constraint
> 
>   The no internal process constraint disallows tasks from competing
>   directly against cgroups.  Here is an excerpt from Peter Zijlstra
>   pointing out the issue [10] - R, L and A are cgroups; t1, t2, t3 and
>   t4 are tasks:
> 
> 
>   R
> / | \
>t1 t2 A
>/   \
>   t3   t4
> 
> 
> Is fundamentally different from:
> 
> 
>R
>  /   \
>L   A
>  /   \   /   \
> t1  t2  t3   t4
> 
> 
> Because if in the first hierarchy you add a task (t5) to R, all of
> its A will run at 1/4th of total bandwidth where before it had
> 1/3rd, whereas with the second example, if you add our t5 to L, A
> doesn't get any less bandwidth.
> 
> 
>   It is true that the trees are semantically different from each other
>   and the symmetric handling of tasks and cgroups is aesthetically
>   pleasing.  However, it isn't clear what the practical usefulness of
>   a layout with direct competition between tasks and cgroups would be,
>   considering that number and behavior of tasks are controlled by each
>   application, and cgroups primarily deal with system level resource
>   distribution; changes in the number of active threads would directly
>   impact resource distribution.  Real world use cases of such layouts
>   could not be established during the discussions.

You apparently intend to ignore any real world usages that don't work
with these new constraints.  Priority and affinity are not process wide
attributes, never have been, but you're insisting that so they must
become for the sake of progress.

I mentioned a real world case of a thread pool servicing customer
accounts by doing something quite sane: hop into an account (cgroup),
do work therein, send bean count off to the $$ department, wash, rinse
repeat.  That's real world users making real world cash registers go ka
-ching so real world people can pay their real world bills.

I also mentioned breakage to cpusets: given exclusive set A and
exclusive subset B therein, there is one and only one spot where
affinity A exists... at the to be forbidden junction of A and B.

As with the thread pool, process granularity makes it impossible for
any threaded application affinity to be managed via cpusets, such as
say stuffing realtime critical threads into a shielded cpuset, mundane
threads into another.  There are any number of affinity usages that
will break.

Try as I may, I can't see anything progressive about enforcing process
granularity of per thread attributes.  I do see regression potential
for users of these controllers, and no viable means to even report them
as being such.  It will likely be systemd flipping the V2 on switch,
not the kernel, not the user.  Regression reports would thus presumably
be deflected to... those who want this.  Sweet.

-Mike

45 matches

Mail list logo