subject:"Re\: \[PATCH 3\/3\] sched\: Implement interface for cgroup unified hierarchy"

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-10-27 Thread Mike Galbraith

On Tue, 2015-10-27 at 15:00 +0900, Tejun Heo wrote:
> On Tue, Oct 27, 2015 at 06:56:42AM +0100, Mike Galbraith wrote:
> > > Well, if you think certain things are being missed, please speak up.
> > > Not in some media campaign way but with technical reasoning and
> > > justifications.
> > 
> > Inserting a middle-man is extremely unlikely to improve performance.
> 
> I'm not following you at all.  Technical reasoning and justifications
> is a middle-man?

No, user <-> systemd or whatever <-> kernel
 ^^^

> I don't think anything productive is likely to come out of this
> conversation.  Let's just end this sub-thread.

Agreed.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-10-27 Thread Tejun Heo

On Tue, Oct 27, 2015 at 06:56:42AM +0100, Mike Galbraith wrote:
> > Well, if you think certain things are being missed, please speak up.
> > Not in some media campaign way but with technical reasoning and
> > justifications.
> 
> Inserting a middle-man is extremely unlikely to improve performance.

I'm not following you at all.  Technical reasoning and justifications
is a middle-man?

I don't think anything productive is likely to come out of this
conversation.  Let's just end this sub-thread.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-10-27 Thread Tejun Heo

On Tue, Oct 27, 2015 at 06:56:42AM +0100, Mike Galbraith wrote:
> > Well, if you think certain things are being missed, please speak up.
> > Not in some media campaign way but with technical reasoning and
> > justifications.
> 
> Inserting a middle-man is extremely unlikely to improve performance.

I'm not following you at all.  Technical reasoning and justifications
is a middle-man?

I don't think anything productive is likely to come out of this
conversation.  Let's just end this sub-thread.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-10-27 Thread Mike Galbraith

On Tue, 2015-10-27 at 15:00 +0900, Tejun Heo wrote:
> On Tue, Oct 27, 2015 at 06:56:42AM +0100, Mike Galbraith wrote:
> > > Well, if you think certain things are being missed, please speak up.
> > > Not in some media campaign way but with technical reasoning and
> > > justifications.
> > 
> > Inserting a middle-man is extremely unlikely to improve performance.
> 
> I'm not following you at all.  Technical reasoning and justifications
> is a middle-man?

No, user <-> systemd or whatever <-> kernel
 ^^^

> I don't think anything productive is likely to come out of this
> conversation.  Let's just end this sub-thread.

Agreed.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-10-26 Thread Mike Galbraith

On Tue, 2015-10-27 at 14:46 +0900, Tejun Heo wrote:
> Hello,
> 
> On Tue, Oct 27, 2015 at 06:42:11AM +0100, Mike Galbraith wrote:
> > Sure, sounds fine, I just fervently hope that the below is foul swamp
> > gas having nothing what so ever to do with your definition of "saner".
> 
> lol, idk, you keep taking things in weird directions.  Let's just stay
> technical, okay?
> 
> > http://www.linuxfoundation.org/news-media/blogs/browse/2013/08/all-about-linux-kernel-cgroup%E2%80%99s-redesign
> > 
> > http://www.freedesktop.org/wiki/Software/systemd/ControlGroupInterface/
> > 
> > I'm not into begging.  I really don't want to have to ask anyone to
> > pretty please do for me what I can currently do all by my little self
> > without having to give a rats ass less whether what I want to do fits in
> > the world view of this or that obnoxious little control freak.
> 
> Well, if you think certain things are being missed, please speak up.
> Not in some media campaign way but with technical reasoning and
> justifications.

Inserting a middle-man is extremely unlikely to improve performance.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-10-26 Thread Tejun Heo

Hello,

On Tue, Oct 27, 2015 at 06:42:11AM +0100, Mike Galbraith wrote:
> Sure, sounds fine, I just fervently hope that the below is foul swamp
> gas having nothing what so ever to do with your definition of "saner".

lol, idk, you keep taking things in weird directions.  Let's just stay
technical, okay?

> http://www.linuxfoundation.org/news-media/blogs/browse/2013/08/all-about-linux-kernel-cgroup%E2%80%99s-redesign
> 
> http://www.freedesktop.org/wiki/Software/systemd/ControlGroupInterface/
> 
> I'm not into begging.  I really don't want to have to ask anyone to
> pretty please do for me what I can currently do all by my little self
> without having to give a rats ass less whether what I want to do fits in
> the world view of this or that obnoxious little control freak.

Well, if you think certain things are being missed, please speak up.
Not in some media campaign way but with technical reasoning and
justifications.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-10-26 Thread Mike Galbraith

On Tue, 2015-10-27 at 12:16 +0900, Tejun Heo wrote:
> Hello, Mike.
> 
> On Sun, Oct 25, 2015 at 04:43:33AM +0100, Mike Galbraith wrote:
> > I don't think it's weird, it's just a thought wrt where pigeon holing
> > could lead:  If you filter out current users who do so in a manner you
> > consider to be in some way odd, when all the filtering is done, you may
> > find that you've filtered out the vast majority of current deployment.
> 
> I think you misunderstood what I wrote.  It's not about excluding
> existing odd use cases.  It's about examining the usages and
> extracting the required capabilities and building an interface which
> is well defined and blends well with the rest of programming interface
> provided by the kernel so that those can be achieved in a saner way.

Sure, sounds fine, I just fervently hope that the below is foul swamp
gas having nothing what so ever to do with your definition of "saner".

http://www.linuxfoundation.org/news-media/blogs/browse/2013/08/all-about-linux-kernel-cgroup%E2%80%99s-redesign

http://www.freedesktop.org/wiki/Software/systemd/ControlGroupInterface/

I'm not into begging.  I really don't want to have to ask anyone to
pretty please do for me what I can currently do all by my little self
without having to give a rats ass less whether what I want to do fits in
the world view of this or that obnoxious little control freak.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-10-26 Thread Tejun Heo

Hello, Mike.

On Sun, Oct 25, 2015 at 04:43:33AM +0100, Mike Galbraith wrote:
> I don't think it's weird, it's just a thought wrt where pigeon holing
> could lead:  If you filter out current users who do so in a manner you
> consider to be in some way odd, when all the filtering is done, you may
> find that you've filtered out the vast majority of current deployment.

I think you misunderstood what I wrote.  It's not about excluding
existing odd use cases.  It's about examining the usages and
extracting the required capabilities and building an interface which
is well defined and blends well with the rest of programming interface
provided by the kernel so that those can be achieved in a saner way.

If doing acrobatics with the current interface is necessary to acheive
certain capabilities, we need to come up with a better interface for
those.  If fringe usages can be satisfied using better constructs, we
should implement that and encourage transition to a better mechanism.

> I'm not at all sure of this, but I suspect that SUSE's gigabuck size
> cgroup power user will land in the same "fringe" pigeon hole.  If so,
> that would be another sizeable dent in volume.
> 
> My point is that these power users likely _are_ your general audience.

Sure, that doesn't mean we shouldn't scrutinize the interface we
implement to support those users.  Also, cgroup also definitely had
some negative spiral effect where eccentric mechanisms and interfaces
discouraged general wider usages fortifying the argument that "we're
the main users" which in turn fed back to even weirder things being
added.  Everybody including the "heavy" users suffers from such
failures in the long term.

We sure want to support all the valid use cases from heavy users in a
reasonable way but that doesn't mean we say yes to everything.

> Sure, it was just a thought wrt "actively filter those out" and who all
> "those" may end up being.

I hope what I meant is clearer now.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-10-26 Thread Peter Zijlstra

On Sun, Oct 25, 2015 at 02:17:23PM +0100, Florian Weimer wrote:
> On 10/25/2015 12:58 PM, Theodore Ts'o wrote:
> 
> > Well, I was thinking we could just teach them to use
> > "syscall(SYS_gettid)".
> 
> Right, and that's easier if TIDs are officially part of the GNU API.
> 
> I think the worry is that some future system might have TIDs which do
> not share the PID space, or are real descriptors (that they need
> explicit open and close operations).

For the scheduler the sharing of pid/tid space is not an issue.

Semantically all [1] scheduler syscalls take a tid. There isn't a single
syscall that iterates the thread group.

Even sys_setpriority() interprets its @who argument as a tid when
@which == PRIO_PROCESS (PRIO_PGRP looks to be the actual process).

[1] as seen from: git grep SYSCALL kernel/sched/


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-10-26 Thread Tejun Heo

Hello, Mike.

On Sun, Oct 25, 2015 at 04:43:33AM +0100, Mike Galbraith wrote:
> I don't think it's weird, it's just a thought wrt where pigeon holing
> could lead:  If you filter out current users who do so in a manner you
> consider to be in some way odd, when all the filtering is done, you may
> find that you've filtered out the vast majority of current deployment.

I think you misunderstood what I wrote.  It's not about excluding
existing odd use cases.  It's about examining the usages and
extracting the required capabilities and building an interface which
is well defined and blends well with the rest of programming interface
provided by the kernel so that those can be achieved in a saner way.

If doing acrobatics with the current interface is necessary to acheive
certain capabilities, we need to come up with a better interface for
those.  If fringe usages can be satisfied using better constructs, we
should implement that and encourage transition to a better mechanism.

> I'm not at all sure of this, but I suspect that SUSE's gigabuck size
> cgroup power user will land in the same "fringe" pigeon hole.  If so,
> that would be another sizeable dent in volume.
> 
> My point is that these power users likely _are_ your general audience.

Sure, that doesn't mean we shouldn't scrutinize the interface we
implement to support those users.  Also, cgroup also definitely had
some negative spiral effect where eccentric mechanisms and interfaces
discouraged general wider usages fortifying the argument that "we're
the main users" which in turn fed back to even weirder things being
added.  Everybody including the "heavy" users suffers from such
failures in the long term.

We sure want to support all the valid use cases from heavy users in a
reasonable way but that doesn't mean we say yes to everything.

> Sure, it was just a thought wrt "actively filter those out" and who all
> "those" may end up being.

I hope what I meant is clearer now.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-10-26 Thread Mike Galbraith

On Tue, 2015-10-27 at 12:16 +0900, Tejun Heo wrote:
> Hello, Mike.
> 
> On Sun, Oct 25, 2015 at 04:43:33AM +0100, Mike Galbraith wrote:
> > I don't think it's weird, it's just a thought wrt where pigeon holing
> > could lead:  If you filter out current users who do so in a manner you
> > consider to be in some way odd, when all the filtering is done, you may
> > find that you've filtered out the vast majority of current deployment.
> 
> I think you misunderstood what I wrote.  It's not about excluding
> existing odd use cases.  It's about examining the usages and
> extracting the required capabilities and building an interface which
> is well defined and blends well with the rest of programming interface
> provided by the kernel so that those can be achieved in a saner way.

Sure, sounds fine, I just fervently hope that the below is foul swamp
gas having nothing what so ever to do with your definition of "saner".

http://www.linuxfoundation.org/news-media/blogs/browse/2013/08/all-about-linux-kernel-cgroup%E2%80%99s-redesign

http://www.freedesktop.org/wiki/Software/systemd/ControlGroupInterface/

I'm not into begging.  I really don't want to have to ask anyone to
pretty please do for me what I can currently do all by my little self
without having to give a rats ass less whether what I want to do fits in
the world view of this or that obnoxious little control freak.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-10-26 Thread Tejun Heo

Hello,

On Tue, Oct 27, 2015 at 06:42:11AM +0100, Mike Galbraith wrote:
> Sure, sounds fine, I just fervently hope that the below is foul swamp
> gas having nothing what so ever to do with your definition of "saner".

lol, idk, you keep taking things in weird directions.  Let's just stay
technical, okay?

> http://www.linuxfoundation.org/news-media/blogs/browse/2013/08/all-about-linux-kernel-cgroup%E2%80%99s-redesign
> 
> http://www.freedesktop.org/wiki/Software/systemd/ControlGroupInterface/
> 
> I'm not into begging.  I really don't want to have to ask anyone to
> pretty please do for me what I can currently do all by my little self
> without having to give a rats ass less whether what I want to do fits in
> the world view of this or that obnoxious little control freak.

Well, if you think certain things are being missed, please speak up.
Not in some media campaign way but with technical reasoning and
justifications.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-10-26 Thread Mike Galbraith

On Tue, 2015-10-27 at 14:46 +0900, Tejun Heo wrote:
> Hello,
> 
> On Tue, Oct 27, 2015 at 06:42:11AM +0100, Mike Galbraith wrote:
> > Sure, sounds fine, I just fervently hope that the below is foul swamp
> > gas having nothing what so ever to do with your definition of "saner".
> 
> lol, idk, you keep taking things in weird directions.  Let's just stay
> technical, okay?
> 
> > http://www.linuxfoundation.org/news-media/blogs/browse/2013/08/all-about-linux-kernel-cgroup%E2%80%99s-redesign
> > 
> > http://www.freedesktop.org/wiki/Software/systemd/ControlGroupInterface/
> > 
> > I'm not into begging.  I really don't want to have to ask anyone to
> > pretty please do for me what I can currently do all by my little self
> > without having to give a rats ass less whether what I want to do fits in
> > the world view of this or that obnoxious little control freak.
> 
> Well, if you think certain things are being missed, please speak up.
> Not in some media campaign way but with technical reasoning and
> justifications.

Inserting a middle-man is extremely unlikely to improve performance.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-10-26 Thread Peter Zijlstra

On Sun, Oct 25, 2015 at 02:17:23PM +0100, Florian Weimer wrote:
> On 10/25/2015 12:58 PM, Theodore Ts'o wrote:
> 
> > Well, I was thinking we could just teach them to use
> > "syscall(SYS_gettid)".
> 
> Right, and that's easier if TIDs are officially part of the GNU API.
> 
> I think the worry is that some future system might have TIDs which do
> not share the PID space, or are real descriptors (that they need
> explicit open and close operations).

For the scheduler the sharing of pid/tid space is not an issue.

Semantically all [1] scheduler syscalls take a tid. There isn't a single
syscall that iterates the thread group.

Even sys_setpriority() interprets its @who argument as a tid when
@which == PRIO_PROCESS (PRIO_PGRP looks to be the actual process).

[1] as seen from: git grep SYSCALL kernel/sched/


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-10-25 Thread Florian Weimer

On 10/25/2015 12:58 PM, Theodore Ts'o wrote:

> Well, I was thinking we could just teach them to use
> "syscall(SYS_gettid)".

Right, and that's easier if TIDs are officially part of the GNU API.

I think the worry is that some future system might have TIDs which do
not share the PID space, or are real descriptors (that they need
explicit open and close operations).

> On a different subject, I'm going to start telling people to use
> "syscall(SYS_getrandom)", since I think that's going to be easier than
> having asking people to change their Makefiles to link against some
> Linux-specific library, but that's a different debate, and I recognize
> the glibc folks aren't willing to bend on that one.

I think we can reach consensus for an implementation which makes this code

  unsigned char session_key[32];
  getrandom (session_key, sizeof (session_key), 0);
  install_session_key (session_key);

correct.  That is, no error handling code for ENOMEM, ENOSYS, EINTR,
ENOMEM or short reads is necessary.  It seems that several getrandom
wrappers currently built into applications do not get this completely right.

Florian

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-10-25 Thread Theodore Ts'o

On Sun, Oct 25, 2015 at 11:47:04AM +0100, Florian Weimer wrote:
> On 10/25/2015 11:41 AM, Theodore Ts'o wrote:
> > On Sun, Oct 25, 2015 at 10:33:32AM +0100, Ingo Molnar wrote:
> >>
> >> Hm, that's weird - all our sched_*() system call APIs that set task 
> >> scheduling 
> >> priorities are fundamentally per thread, not per process. Same goes for 
> >> the old 
> >> sys_nice() interface. The scheduler has no real notion of 'process', and 
> >> certainly 
> >> not at the system call level.
> >>
> > 
> > I suspect the main issue is that the games programmers were trying to
> > access it via libc / pthreads, which hides a lot of the power
> > available at the raw syscall level.  This is probably more of a
> > "tutorial needed for userspace programmers" issue, at a guess.
> 
> If this refers to the lack of exposure of thread IDs in glibc, we are
> willing to change that on glibc side.  The discussion has progressed to
> the point where it is now about the question whether it should be part
> of the GNU API (like sched_setaffinity), or live in glibc as a
> Linux-specific extension (like sched_getcpu).  More input is certainly
> welcome.

Well, I was thinking we could just teach them to use
"syscall(SYS_gettid)".

On a different subject, I'm going to start telling people to use
"syscall(SYS_getrandom)", since I think that's going to be easier than
having asking people to change their Makefiles to link against some
Linux-specific library, but that's a different debate, and I recognize
the glibc folks aren't willing to bend on that one.

Cheers,

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-10-25 Thread Florian Weimer

On 10/25/2015 11:41 AM, Theodore Ts'o wrote:
> On Sun, Oct 25, 2015 at 10:33:32AM +0100, Ingo Molnar wrote:
>>
>> Hm, that's weird - all our sched_*() system call APIs that set task 
>> scheduling 
>> priorities are fundamentally per thread, not per process. Same goes for the 
>> old 
>> sys_nice() interface. The scheduler has no real notion of 'process', and 
>> certainly 
>> not at the system call level.
>>
> 
> I suspect the main issue is that the games programmers were trying to
> access it via libc / pthreads, which hides a lot of the power
> available at the raw syscall level.  This is probably more of a
> "tutorial needed for userspace programmers" issue, at a guess.

If this refers to the lack of exposure of thread IDs in glibc, we are
willing to change that on glibc side.  The discussion has progressed to
the point where it is now about the question whether it should be part
of the GNU API (like sched_setaffinity), or live in glibc as a
Linux-specific extension (like sched_getcpu).  More input is certainly
welcome.

Old concerns about support for n:m threading implementations in glibc
are no longer relevant because too much code using well-documented
interfaces would break.

Florian

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-10-25 Thread Theodore Ts'o

On Sun, Oct 25, 2015 at 10:33:32AM +0100, Ingo Molnar wrote:
> 
> Hm, that's weird - all our sched_*() system call APIs that set task 
> scheduling 
> priorities are fundamentally per thread, not per process. Same goes for the 
> old 
> sys_nice() interface. The scheduler has no real notion of 'process', and 
> certainly 
> not at the system call level.
>

I suspect the main issue is that the games programmers were trying to
access it via libc / pthreads, which hides a lot of the power
available at the raw syscall level.  This is probably more of a
"tutorial needed for userspace programmers" issue, at a guess.

   - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-10-25 Thread Ingo Molnar


* Linus Torvalds  wrote:

> On Sun, Oct 25, 2015 at 11:18 AM, Tejun Heo  wrote:
> >
> > We definitely need to weigh the inputs from heavy users but also need to 
> > discern the actual problems which need to be solved from the specific 
> > mechanisms chosen to solve them.  Let's please keep the discussions 
> > technical.  
> > That's the best way to reach a viable long-term solution which can benefit 
> > a 
> > lot wider audience in the long term.  Even though that might not be the 
> > path 
> > of least immediate resistance, I believe that google will be an eventual 
> > beneficiary too.
> 
> So here's a somewhat odd request I got to hear very recently (at LinuxCon EU 
> in 
> Ireland)..
> 
> A least some game engine writers apparently would like to be able to set 
> scheduling priorities for threads within a single process, because they may 
> want 
> te game as a whole to have a certain priority, but then some of the threads 
> are 
> critical for latency and may want certain guaranteed resources (eg audio or 
> actual gameplay) while others are very much background things (garbage 
> collection etc).
> 
> I suspect that's a very non-google use. We apparently don't really support 
> that 
> kind of per-thread model right now at all.

Hm, that's weird - all our sched_*() system call APIs that set task scheduling 
priorities are fundamentally per thread, not per process. Same goes for the old 
sys_nice() interface. The scheduler has no real notion of 'process', and 
certainly 
not at the system call level.

This was always so and is expected to remain so in the future as well - and 
this 
is unrelated to cgroups.

> Do they want cgroups? Maybe not. You can apparently do something like this 
> under 
> Windows and OS X, but not under Linux (and I'm reporting second-hand here, I 
> don't know the exact details). I'm just bringing it up as a somewhat unusual 
> non-server thing that is certainly very relevant despite being different.

So I'd realy like to hear about specifics, and they might be banging on open 
doors!

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-10-25 Thread Florian Weimer

On 10/25/2015 12:58 PM, Theodore Ts'o wrote:

> Well, I was thinking we could just teach them to use
> "syscall(SYS_gettid)".

Right, and that's easier if TIDs are officially part of the GNU API.

I think the worry is that some future system might have TIDs which do
not share the PID space, or are real descriptors (that they need
explicit open and close operations).

> On a different subject, I'm going to start telling people to use
> "syscall(SYS_getrandom)", since I think that's going to be easier than
> having asking people to change their Makefiles to link against some
> Linux-specific library, but that's a different debate, and I recognize
> the glibc folks aren't willing to bend on that one.

I think we can reach consensus for an implementation which makes this code

  unsigned char session_key[32];
  getrandom (session_key, sizeof (session_key), 0);
  install_session_key (session_key);

correct.  That is, no error handling code for ENOMEM, ENOSYS, EINTR,
ENOMEM or short reads is necessary.  It seems that several getrandom
wrappers currently built into applications do not get this completely right.

Florian

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-10-25 Thread Theodore Ts'o

On Sun, Oct 25, 2015 at 10:33:32AM +0100, Ingo Molnar wrote:
> 
> Hm, that's weird - all our sched_*() system call APIs that set task 
> scheduling 
> priorities are fundamentally per thread, not per process. Same goes for the 
> old 
> sys_nice() interface. The scheduler has no real notion of 'process', and 
> certainly 
> not at the system call level.
>

I suspect the main issue is that the games programmers were trying to
access it via libc / pthreads, which hides a lot of the power
available at the raw syscall level.  This is probably more of a
"tutorial needed for userspace programmers" issue, at a guess.

   - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-10-25 Thread Ingo Molnar


* Linus Torvalds  wrote:

> On Sun, Oct 25, 2015 at 11:18 AM, Tejun Heo  wrote:
> >
> > We definitely need to weigh the inputs from heavy users but also need to 
> > discern the actual problems which need to be solved from the specific 
> > mechanisms chosen to solve them.  Let's please keep the discussions 
> > technical.  
> > That's the best way to reach a viable long-term solution which can benefit 
> > a 
> > lot wider audience in the long term.  Even though that might not be the 
> > path 
> > of least immediate resistance, I believe that google will be an eventual 
> > beneficiary too.
> 
> So here's a somewhat odd request I got to hear very recently (at LinuxCon EU 
> in 
> Ireland)..
> 
> A least some game engine writers apparently would like to be able to set 
> scheduling priorities for threads within a single process, because they may 
> want 
> te game as a whole to have a certain priority, but then some of the threads 
> are 
> critical for latency and may want certain guaranteed resources (eg audio or 
> actual gameplay) while others are very much background things (garbage 
> collection etc).
> 
> I suspect that's a very non-google use. We apparently don't really support 
> that 
> kind of per-thread model right now at all.

Hm, that's weird - all our sched_*() system call APIs that set task scheduling 
priorities are fundamentally per thread, not per process. Same goes for the old 
sys_nice() interface. The scheduler has no real notion of 'process', and 
certainly 
not at the system call level.

This was always so and is expected to remain so in the future as well - and 
this 
is unrelated to cgroups.

> Do they want cgroups? Maybe not. You can apparently do something like this 
> under 
> Windows and OS X, but not under Linux (and I'm reporting second-hand here, I 
> don't know the exact details). I'm just bringing it up as a somewhat unusual 
> non-server thing that is certainly very relevant despite being different.

So I'd realy like to hear about specifics, and they might be banging on open 
doors!

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-10-25 Thread Florian Weimer

On 10/25/2015 11:41 AM, Theodore Ts'o wrote:
> On Sun, Oct 25, 2015 at 10:33:32AM +0100, Ingo Molnar wrote:
>>
>> Hm, that's weird - all our sched_*() system call APIs that set task 
>> scheduling 
>> priorities are fundamentally per thread, not per process. Same goes for the 
>> old 
>> sys_nice() interface. The scheduler has no real notion of 'process', and 
>> certainly 
>> not at the system call level.
>>
> 
> I suspect the main issue is that the games programmers were trying to
> access it via libc / pthreads, which hides a lot of the power
> available at the raw syscall level.  This is probably more of a
> "tutorial needed for userspace programmers" issue, at a guess.

If this refers to the lack of exposure of thread IDs in glibc, we are
willing to change that on glibc side.  The discussion has progressed to
the point where it is now about the question whether it should be part
of the GNU API (like sched_setaffinity), or live in glibc as a
Linux-specific extension (like sched_getcpu).  More input is certainly
welcome.

Old concerns about support for n:m threading implementations in glibc
are no longer relevant because too much code using well-documented
interfaces would break.

Florian

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-10-25 Thread Theodore Ts'o

On Sun, Oct 25, 2015 at 11:47:04AM +0100, Florian Weimer wrote:
> On 10/25/2015 11:41 AM, Theodore Ts'o wrote:
> > On Sun, Oct 25, 2015 at 10:33:32AM +0100, Ingo Molnar wrote:
> >>
> >> Hm, that's weird - all our sched_*() system call APIs that set task 
> >> scheduling 
> >> priorities are fundamentally per thread, not per process. Same goes for 
> >> the old 
> >> sys_nice() interface. The scheduler has no real notion of 'process', and 
> >> certainly 
> >> not at the system call level.
> >>
> > 
> > I suspect the main issue is that the games programmers were trying to
> > access it via libc / pthreads, which hides a lot of the power
> > available at the raw syscall level.  This is probably more of a
> > "tutorial needed for userspace programmers" issue, at a guess.
> 
> If this refers to the lack of exposure of thread IDs in glibc, we are
> willing to change that on glibc side.  The discussion has progressed to
> the point where it is now about the question whether it should be part
> of the GNU API (like sched_setaffinity), or live in glibc as a
> Linux-specific extension (like sched_getcpu).  More input is certainly
> welcome.

Well, I was thinking we could just teach them to use
"syscall(SYS_gettid)".

On a different subject, I'm going to start telling people to use
"syscall(SYS_getrandom)", since I think that's going to be easier than
having asking people to change their Makefiles to link against some
Linux-specific library, but that's a different debate, and I recognize
the glibc folks aren't willing to bend on that one.

Cheers,

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-10-24 Thread Linus Torvalds

On Sun, Oct 25, 2015 at 11:18 AM, Tejun Heo  wrote:
>
> We definitely need to weigh the inputs from heavy users but also need
> to discern the actual problems which need to be solved from the
> specific mechanisms chosen to solve them.  Let's please keep the
> discussions technical.  That's the best way to reach a viable
> long-term solution which can benefit a lot wider audience in the long
> term.  Even though that might not be the path of least immediate
> resistance, I believe that google will be an eventual beneficiary too.

So here's a somewhat odd request I got to hear very recently (at
LinuxCon EU in Ireland)..

A least some game engine writers apparently would like to be able to
set scheduling priorities for threads within a single process, because
they may want te game as a whole to have a certain priority, but then
some of the threads are critical for latency and may want certain
guaranteed resources (eg audio or actual gameplay) while others are
very much background things (garbage collection etc).

I suspect that's a very non-google use. We apparently don't really
support that kind of per-thread model right now at all.

Do they want cgroups? Maybe not. You can apparently do something like
this under Windows and OS X, but not under Linux (and I'm reporting
second-hand here, I don't know the exact details). I'm just bringing
it up as a somewhat unusual non-server thing that is certainly very
relevant despite being different.

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-10-24 Thread Mike Galbraith

On Sun, 2015-10-25 at 11:18 +0900, Tejun Heo wrote:
> Hello, Mike.
> 
> On Sat, Oct 24, 2015 at 06:36:07AM +0200, Mike Galbraith wrote:
> > On Sat, 2015-10-24 at 07:21 +0900, Tejun Heo wrote:
> > 
> > > It'd be a step back in usability only for users who have been using
> > > cgroups in fringing ways which can't be justified for ratification and
> > > we do want to actively filter those out.
> > 
> > Of all the cgroup signal currently in existence, seems the Google signal
> > has to have the most volume under the curve by a mile.  If you were to
> > filter that signal out, what remained would be a flat line of noise.
> 
> This is a weird direction to take the discussion, but let me provide a
> counter argument.

I don't think it's weird, it's just a thought wrt where pigeon holing
could lead:  If you filter out current users who do so in a manner you
consider to be in some way odd, when all the filtering is done, you may
find that you've filtered out the vast majority of current deployment.

> Google sure is an important user of the kernel and likely the most
> extensive user of cgroup.  At the same time, its kernel efforts are
> largely in service of a few very big internal customers which are in
> control of large part of the entire software stack.  The things that
> are important for general audience of the kernel in the long term
> don't necessarily coincide with what such efforts need or want.

I'm not at all sure of this, but I suspect that SUSE's gigabuck size
cgroup power user will land in the same "fringe" pigeon hole.  If so,
that would be another sizeable dent in volume.

My point is that these power users likely _are_ your general audience.
 
> I'd even venture to say as much as the inputs coming out of google are
> interesting and important, they're also a lot more prone to lock-in
> effects to short term solutions and status-quo given their priorities.
> This is not to denigrate google's kernel efforts but just to
> counter-balance "it's google" as a shortcut for proper technical
> discussions.
> 
> There are good reasons why cgroup is the design disaster as it is now
> and chasing each usage scenario and hack which provides the least
> immediate resistance without paying the effort to extract the actual
> requirements and common solutions is an important one.  It is critical
> to provide back-pressure for long-term thinking and solutions;
> otherwise, we're bound to repeat the errors and end up with something
> which everyone loves to hate.
> 
> We definitely need to weigh the inputs from heavy users but also need
> to discern the actual problems which need to be solved from the
> specific mechanisms chosen to solve them.  Let's please keep the
> discussions technical.

Sure, it was just a thought wrt "actively filter those out" and who all
"those" may end up being.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-10-24 Thread Tejun Heo

Hello, Mike.

On Sat, Oct 24, 2015 at 06:36:07AM +0200, Mike Galbraith wrote:
> On Sat, 2015-10-24 at 07:21 +0900, Tejun Heo wrote:
> 
> > It'd be a step back in usability only for users who have been using
> > cgroups in fringing ways which can't be justified for ratification and
> > we do want to actively filter those out.
> 
> Of all the cgroup signal currently in existence, seems the Google signal
> has to have the most volume under the curve by a mile.  If you were to
> filter that signal out, what remained would be a flat line of noise.

This is a weird direction to take the discussion, but let me provide a
counter argument.

Google sure is an important user of the kernel and likely the most
extensive user of cgroup.  At the same time, its kernel efforts are
largely in service of a few very big internal customers which are in
control of large part of the entire software stack.  The things that
are important for general audience of the kernel in the long term
don't necessarily coincide with what such efforts need or want.

I'd even venture to say as much as the inputs coming out of google are
interesting and important, they're also a lot more prone to lock-in
effects to short term solutions and status-quo given their priorities.
This is not to denigrate google's kernel efforts but just to
counter-balance "it's google" as a shortcut for proper technical
discussions.

There are good reasons why cgroup is the design disaster as it is now
and chasing each usage scenario and hack which provides the least
immediate resistance without paying the effort to extract the actual
requirements and common solutions is an important one.  It is critical
to provide back-pressure for long-term thinking and solutions;
otherwise, we're bound to repeat the errors and end up with something
which everyone loves to hate.

We definitely need to weigh the inputs from heavy users but also need
to discern the actual problems which need to be solved from the
specific mechanisms chosen to solve them.  Let's please keep the
discussions technical.  That's the best way to reach a viable
long-term solution which can benefit a lot wider audience in the long
term.  Even though that might not be the path of least immediate
resistance, I believe that google will be an eventual beneficiary too.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-10-24 Thread Tejun Heo

Hello, Mike.

On Sat, Oct 24, 2015 at 06:36:07AM +0200, Mike Galbraith wrote:
> On Sat, 2015-10-24 at 07:21 +0900, Tejun Heo wrote:
> 
> > It'd be a step back in usability only for users who have been using
> > cgroups in fringing ways which can't be justified for ratification and
> > we do want to actively filter those out.
> 
> Of all the cgroup signal currently in existence, seems the Google signal
> has to have the most volume under the curve by a mile.  If you were to
> filter that signal out, what remained would be a flat line of noise.

This is a weird direction to take the discussion, but let me provide a
counter argument.

Google sure is an important user of the kernel and likely the most
extensive user of cgroup.  At the same time, its kernel efforts are
largely in service of a few very big internal customers which are in
control of large part of the entire software stack.  The things that
are important for general audience of the kernel in the long term
don't necessarily coincide with what such efforts need or want.

I'd even venture to say as much as the inputs coming out of google are
interesting and important, they're also a lot more prone to lock-in
effects to short term solutions and status-quo given their priorities.
This is not to denigrate google's kernel efforts but just to
counter-balance "it's google" as a shortcut for proper technical
discussions.

There are good reasons why cgroup is the design disaster as it is now
and chasing each usage scenario and hack which provides the least
immediate resistance without paying the effort to extract the actual
requirements and common solutions is an important one.  It is critical
to provide back-pressure for long-term thinking and solutions;
otherwise, we're bound to repeat the errors and end up with something
which everyone loves to hate.

We definitely need to weigh the inputs from heavy users but also need
to discern the actual problems which need to be solved from the
specific mechanisms chosen to solve them.  Let's please keep the
discussions technical.  That's the best way to reach a viable
long-term solution which can benefit a lot wider audience in the long
term.  Even though that might not be the path of least immediate
resistance, I believe that google will be an eventual beneficiary too.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-10-24 Thread Mike Galbraith

On Sun, 2015-10-25 at 11:18 +0900, Tejun Heo wrote:
> Hello, Mike.
> 
> On Sat, Oct 24, 2015 at 06:36:07AM +0200, Mike Galbraith wrote:
> > On Sat, 2015-10-24 at 07:21 +0900, Tejun Heo wrote:
> > 
> > > It'd be a step back in usability only for users who have been using
> > > cgroups in fringing ways which can't be justified for ratification and
> > > we do want to actively filter those out.
> > 
> > Of all the cgroup signal currently in existence, seems the Google signal
> > has to have the most volume under the curve by a mile.  If you were to
> > filter that signal out, what remained would be a flat line of noise.
> 
> This is a weird direction to take the discussion, but let me provide a
> counter argument.

I don't think it's weird, it's just a thought wrt where pigeon holing
could lead:  If you filter out current users who do so in a manner you
consider to be in some way odd, when all the filtering is done, you may
find that you've filtered out the vast majority of current deployment.

> Google sure is an important user of the kernel and likely the most
> extensive user of cgroup.  At the same time, its kernel efforts are
> largely in service of a few very big internal customers which are in
> control of large part of the entire software stack.  The things that
> are important for general audience of the kernel in the long term
> don't necessarily coincide with what such efforts need or want.

I'm not at all sure of this, but I suspect that SUSE's gigabuck size
cgroup power user will land in the same "fringe" pigeon hole.  If so,
that would be another sizeable dent in volume.

My point is that these power users likely _are_ your general audience.
 
> I'd even venture to say as much as the inputs coming out of google are
> interesting and important, they're also a lot more prone to lock-in
> effects to short term solutions and status-quo given their priorities.
> This is not to denigrate google's kernel efforts but just to
> counter-balance "it's google" as a shortcut for proper technical
> discussions.
> 
> There are good reasons why cgroup is the design disaster as it is now
> and chasing each usage scenario and hack which provides the least
> immediate resistance without paying the effort to extract the actual
> requirements and common solutions is an important one.  It is critical
> to provide back-pressure for long-term thinking and solutions;
> otherwise, we're bound to repeat the errors and end up with something
> which everyone loves to hate.
> 
> We definitely need to weigh the inputs from heavy users but also need
> to discern the actual problems which need to be solved from the
> specific mechanisms chosen to solve them.  Let's please keep the
> discussions technical.

Sure, it was just a thought wrt "actively filter those out" and who all
"those" may end up being.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-10-24 Thread Linus Torvalds

On Sun, Oct 25, 2015 at 11:18 AM, Tejun Heo  wrote:
>
> We definitely need to weigh the inputs from heavy users but also need
> to discern the actual problems which need to be solved from the
> specific mechanisms chosen to solve them.  Let's please keep the
> discussions technical.  That's the best way to reach a viable
> long-term solution which can benefit a lot wider audience in the long
> term.  Even though that might not be the path of least immediate
> resistance, I believe that google will be an eventual beneficiary too.

So here's a somewhat odd request I got to hear very recently (at
LinuxCon EU in Ireland)..

A least some game engine writers apparently would like to be able to
set scheduling priorities for threads within a single process, because
they may want te game as a whole to have a certain priority, but then
some of the threads are critical for latency and may want certain
guaranteed resources (eg audio or actual gameplay) while others are
very much background things (garbage collection etc).

I suspect that's a very non-google use. We apparently don't really
support that kind of per-thread model right now at all.

Do they want cgroups? Maybe not. You can apparently do something like
this under Windows and OS X, but not under Linux (and I'm reporting
second-hand here, I don't know the exact details). I'm just bringing
it up as a somewhat unusual non-server thing that is certainly very
relevant despite being different.

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-10-23 Thread Mike Galbraith

On Sat, 2015-10-24 at 07:21 +0900, Tejun Heo wrote:

> It'd be a step back in usability only for users who have been using
> cgroups in fringing ways which can't be justified for ratification and
> we do want to actively filter those out.

Of all the cgroup signal currently in existence, seems the Google signal
has to have the most volume under the curve by a mile.  If you were to
filter that signal out, what remained would be a flat line of noise.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-10-23 Thread Tejun Heo

Hello, Paul.

On Thu, Oct 15, 2015 at 04:42:37AM -0700, Paul Turner wrote:
> > The thing which bothers me the most is that cpuset behavior is
> > different from global case for no good reason.
> 
> I've tried to explain above that I believe there are reasonable
> reasons for it working the way it does from an interface perspective.
> I do not think they can be so quickly discarded out of hand.  However,
> I think we should continue winnowing focus and first resolve the model
> of interaction for sub-process hierarchies,

One way or the other, I think the kernel needs to sort out how task
affinity masks are handled when the available CPUs change, be that
from CPU hotplug or cpuset config changes.

On forcing all affinity masks to the set of available CPUs, I'm still
not convinced that it's a useful extra behavior to implement for
cpuset especially given that the same can be achieved from userland
without too much difficulty.  This goes back to the argument for
implmenting the minimal set of functionality which can be used as
building blocks.  Updating all task affinty masks is an irreversible
destructive operation.  It doesn't enable anything which can't be
otherwise but does end up restricting how the feature can be used.

But yeah, let's shelve this subject for now.

> > Now, if you make the in-process grouping dynamic and accessible to
> > external entities (and if we aren't gonna do that, why even bother?),
> > this breaks down and we have some of the same problems we have with
> > allowing applications to directly manipulate cgroup sub-directories.
> > This is a fundamental problem.  Setting attributes can be shared but
> > organization is an exclusive process.  You can't share that without
> > close coordination.
> 
> Your concern here is centered on permissions, not the interface.
> 
> This seems directly remedied by exactly:
>   Any sub-process hierarchy we exposed would be locked down in terms
> of write access.  These would not be generally writable.  You're
> absolutely correct that you can't share without close coordination,
> and granting the appropriate permissions is part of that.

It is not about permissions.  It is about designing an interface which
guarantees certain set of invariants regardless of priviledges - even
root can't violate such invariants short of injecting code into and
modifying the behavior of the target process.  This isn't anything
unusual.  In fact, permission based access control is something which
is added if and only if allowing and controlling accesses from
multiple parties is necessary and needs to be explicitly justified.

If coordination in terms of thread hierarchy organization from the
target process is needed for allowing external entities to twiddle
with resource distribution, no capability is lost by making the
organization solely the responsibility of the target process while
gaining a lot stronger set of behavioral invariants.  I can't see
strong enough justifications for allowing external entities to
manipulate in-process thread organization.

> > assigning the full responsiblity of in-process organization to the
> > application itself and tying it to static parental relationship allows
> > for solid common grounds where these resource operations can be
> > performed by different entities without causing structural issues just
> > like other similar operations.
> 
> But cases have already been presented above where the full
> responsibility cannot be delegated to the application.  Because we
> explicitly depend on constraints being provided by the external
> environment.

I don't think such cases have been presented.  The only thing
necessary is the target processes organizing threads in a way which
allows external agents to apply external constraints.

> > It's not that but more about what the file-system interface implies.
> > It's not just different.  It breaks a lot of expectations a lot of
> > application visible kernel interface provides as explained above.
> > There are reasons why we usually don't do things this way.
> 
> The arguments you've made above are largely centered on permissions
> and the right to make modifications.  I don't see what other
> expectations you believe are being broken here.  This still feels like
> an aesthetic objection.

I hope my points are clear by now.

> > It does require the applications to follow certain protocols to
> > organize itself but this is a pretty trivial thing to do and comes
> > with the benefit that we don't need to introduce a completely new
> > grouping concept to applications.
> 
> I strongly disagree here:  Applications today do _not_ use sub-process
> clone hierarchies today.  As a result, this _is_ introducing a
> completely new grouping concept because it's one applications have
> never cared about outside of a shell implementation.

It is a logical extension of how the kernel organizes processes in the
system.  It's a lot more native to how programs usually interact with
the system than

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-10-23 Thread Mike Galbraith

On Sat, 2015-10-24 at 07:21 +0900, Tejun Heo wrote:

> It'd be a step back in usability only for users who have been using
> cgroups in fringing ways which can't be justified for ratification and
> we do want to actively filter those out.

Of all the cgroup signal currently in existence, seems the Google signal
has to have the most volume under the curve by a mile.  If you were to
filter that signal out, what remained would be a flat line of noise.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-10-23 Thread Tejun Heo

Hello, Paul.

On Thu, Oct 15, 2015 at 04:42:37AM -0700, Paul Turner wrote:
> > The thing which bothers me the most is that cpuset behavior is
> > different from global case for no good reason.
> 
> I've tried to explain above that I believe there are reasonable
> reasons for it working the way it does from an interface perspective.
> I do not think they can be so quickly discarded out of hand.  However,
> I think we should continue winnowing focus and first resolve the model
> of interaction for sub-process hierarchies,

One way or the other, I think the kernel needs to sort out how task
affinity masks are handled when the available CPUs change, be that
from CPU hotplug or cpuset config changes.

On forcing all affinity masks to the set of available CPUs, I'm still
not convinced that it's a useful extra behavior to implement for
cpuset especially given that the same can be achieved from userland
without too much difficulty.  This goes back to the argument for
implmenting the minimal set of functionality which can be used as
building blocks.  Updating all task affinty masks is an irreversible
destructive operation.  It doesn't enable anything which can't be
otherwise but does end up restricting how the feature can be used.

But yeah, let's shelve this subject for now.

> > Now, if you make the in-process grouping dynamic and accessible to
> > external entities (and if we aren't gonna do that, why even bother?),
> > this breaks down and we have some of the same problems we have with
> > allowing applications to directly manipulate cgroup sub-directories.
> > This is a fundamental problem.  Setting attributes can be shared but
> > organization is an exclusive process.  You can't share that without
> > close coordination.
> 
> Your concern here is centered on permissions, not the interface.
> 
> This seems directly remedied by exactly:
>   Any sub-process hierarchy we exposed would be locked down in terms
> of write access.  These would not be generally writable.  You're
> absolutely correct that you can't share without close coordination,
> and granting the appropriate permissions is part of that.

It is not about permissions.  It is about designing an interface which
guarantees certain set of invariants regardless of priviledges - even
root can't violate such invariants short of injecting code into and
modifying the behavior of the target process.  This isn't anything
unusual.  In fact, permission based access control is something which
is added if and only if allowing and controlling accesses from
multiple parties is necessary and needs to be explicitly justified.

If coordination in terms of thread hierarchy organization from the
target process is needed for allowing external entities to twiddle
with resource distribution, no capability is lost by making the
organization solely the responsibility of the target process while
gaining a lot stronger set of behavioral invariants.  I can't see
strong enough justifications for allowing external entities to
manipulate in-process thread organization.

> > assigning the full responsiblity of in-process organization to the
> > application itself and tying it to static parental relationship allows
> > for solid common grounds where these resource operations can be
> > performed by different entities without causing structural issues just
> > like other similar operations.
> 
> But cases have already been presented above where the full
> responsibility cannot be delegated to the application.  Because we
> explicitly depend on constraints being provided by the external
> environment.

I don't think such cases have been presented.  The only thing
necessary is the target processes organizing threads in a way which
allows external agents to apply external constraints.

> > It's not that but more about what the file-system interface implies.
> > It's not just different.  It breaks a lot of expectations a lot of
> > application visible kernel interface provides as explained above.
> > There are reasons why we usually don't do things this way.
> 
> The arguments you've made above are largely centered on permissions
> and the right to make modifications.  I don't see what other
> expectations you believe are being broken here.  This still feels like
> an aesthetic objection.

I hope my points are clear by now.

> > It does require the applications to follow certain protocols to
> > organize itself but this is a pretty trivial thing to do and comes
> > with the benefit that we don't need to introduce a completely new
> > grouping concept to applications.
> 
> I strongly disagree here:  Applications today do _not_ use sub-process
> clone hierarchies today.  As a result, this _is_ introducing a
> completely new grouping concept because it's one applications have
> never cared about outside of a shell implementation.

It is a logical extension of how the kernel organizes processes in the
system.  It's a lot more native to how programs usually interact with
the system than

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-10-15 Thread Paul Turner

On Thu, Oct 1, 2015 at 11:46 AM, Tejun Heo  wrote:
> Hello, Paul.
>
> Sorry about the delay.  Things were kinda hectic in the past couple
> weeks.

Likewise :-(

>
> On Fri, Sep 18, 2015 at 04:27:07AM -0700, Paul Turner wrote:
>> On Sat, Sep 12, 2015 at 7:40 AM, Tejun Heo  wrote:
>> > On Wed, Sep 09, 2015 at 05:49:31AM -0700, Paul Turner wrote:
>> >> I do not think this is a layering problem.  This is more like C++:
>> >> there is no sane way to concurrently use all the features available,
>> >> however, reasonably self-consistent subsets may be chosen.
>> >
>> > That's just admitting failure.
>> >
>>
>> Alternatively: accepting there are varied use-cases to
>> support.
>
> Analogies like this can go awry but as we're in it anyway, let's push
> it a bit further.  One of the reasons why C++ isn't lauded as an
> example of great engineering is while it does support a vast number of
> use-cases or rather usage-scenarios (it's not necessarily focused on
> utility but just how things are done) it fails to distill the essence
> of the actual utility out of them and condense it.  It's not just an
> aesthetic argument.  That failure exacts heavy costs on its users and
> is one of the reasons why C++ projects are more prone to horrible
> disarrays unless specific precautions are taken.
>
> I'm not against supporting valid and useful use-cases but not all
> usage-scenarios are equal.  If we can achieve the same eventual goals
> with reasonable trade-offs in a simpler and more straight-forward way,
> that's what we should do even though that'd require some modifications
> to specific usage-scenarios.  ie. the usage-scenarios need to
> scrutinized so that the core of the utility can be extracted and
> abstracted in the, hopefully, minimal way.
>
> This is what worries me when you liken the situation to C++.  You
> probably were trying to make a different point but I'm not sure we're
> on the same page and I think we need to agree at least on this in
> principle; otherwise, we'll just keep talking past each other.


I agree with trying to reach a minimal core functionality that
satisfies all use-cases.  I am only saying however, that I think that
I do not think we can reduce to an api so minimal that all users will
use all parts of it.  We have to fit more than one usage model in.

>
>> > The kernel does not update all CPU affinity masks when a CPU goes down
>> > or comes up.  It just enforces the intersection and when the
>> > intersection becomes empty, ignores it.  cgroup-scoped behaviors
>> > should reflect what the system does in the global case in general, and
>> > the global behavior here, although missing some bits, is a lot saner
>> > than what cpuset is currently doing.
>>
>> You are conflating two things here:
>> 1) How we maintain these masks
>> 2) The interactions on updates
>>
>> I absolutely agree with you that we want to maintain (1) in a
>> non-pointwise format.  I've already talked about that in other replies
>> on this thread.
>>
>> However for (2) I feel you are:
>>  i) Underestimating the complexity of synchronizing updates with user-space.
>>  ii) Introducing more non-desirable behaviors [partial overwrite] than
>> those you object to [total overwrite].
>
> The thing which bothers me the most is that cpuset behavior is
> different from global case for no good reason.

I've tried to explain above that I believe there are reasonable
reasons for it working the way it does from an interface perspective.
I do not think they can be so quickly discarded out of hand.  However,
I think we should continue winnowing focus and first resolve the model
of interaction for sub-process hierarchies,

> We don't have a model
> right now.  It's schizophrenic.  And what I was trying to say was that
> maybe this is because we never had a working model in the global case
> either but if that's the case we need to solve the global case too or
> at least figure out where we wanna be in the long term.
>
>> It's the most consistent choice; you've not given any reasons above
>> why a solution with only partial consistency is any better.
>>
>> Any choice here is difficult to coordinate, that two APIs allow
>> manipulation of the same property means that we must always
>> choose some compromise here.  I prefer the one with the least
>> surprises.
>
> I don't think the current situation around affinity mask handling can
> be considered consistent and cpuset is pouring more inconsistencies
> into it.  We need to figure it out one way or the other.
>
> ...
>> I do not yet see a good reason why the threads arbitrarily not sharing an
>> address space necessitates the use of an entirely different API.  The
>> only problems stated so far in this discussion have been:
>>   1) Actual issues involving relative paths, which are potentially solvable.
>
> Also the ownership of organization.  If the use-cases can be
> reasonably served with static grouping, I think it'd definitely be a
> worthwhile trade-off to make.  It's different from

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-10-15 Thread Paul Turner

On Thu, Oct 1, 2015 at 11:46 AM, Tejun Heo  wrote:
> Hello, Paul.
>
> Sorry about the delay.  Things were kinda hectic in the past couple
> weeks.

Likewise :-(

>
> On Fri, Sep 18, 2015 at 04:27:07AM -0700, Paul Turner wrote:
>> On Sat, Sep 12, 2015 at 7:40 AM, Tejun Heo  wrote:
>> > On Wed, Sep 09, 2015 at 05:49:31AM -0700, Paul Turner wrote:
>> >> I do not think this is a layering problem.  This is more like C++:
>> >> there is no sane way to concurrently use all the features available,
>> >> however, reasonably self-consistent subsets may be chosen.
>> >
>> > That's just admitting failure.
>> >
>>
>> Alternatively: accepting there are varied use-cases to
>> support.
>
> Analogies like this can go awry but as we're in it anyway, let's push
> it a bit further.  One of the reasons why C++ isn't lauded as an
> example of great engineering is while it does support a vast number of
> use-cases or rather usage-scenarios (it's not necessarily focused on
> utility but just how things are done) it fails to distill the essence
> of the actual utility out of them and condense it.  It's not just an
> aesthetic argument.  That failure exacts heavy costs on its users and
> is one of the reasons why C++ projects are more prone to horrible
> disarrays unless specific precautions are taken.
>
> I'm not against supporting valid and useful use-cases but not all
> usage-scenarios are equal.  If we can achieve the same eventual goals
> with reasonable trade-offs in a simpler and more straight-forward way,
> that's what we should do even though that'd require some modifications
> to specific usage-scenarios.  ie. the usage-scenarios need to
> scrutinized so that the core of the utility can be extracted and
> abstracted in the, hopefully, minimal way.
>
> This is what worries me when you liken the situation to C++.  You
> probably were trying to make a different point but I'm not sure we're
> on the same page and I think we need to agree at least on this in
> principle; otherwise, we'll just keep talking past each other.


I agree with trying to reach a minimal core functionality that
satisfies all use-cases.  I am only saying however, that I think that
I do not think we can reduce to an api so minimal that all users will
use all parts of it.  We have to fit more than one usage model in.

>
>> > The kernel does not update all CPU affinity masks when a CPU goes down
>> > or comes up.  It just enforces the intersection and when the
>> > intersection becomes empty, ignores it.  cgroup-scoped behaviors
>> > should reflect what the system does in the global case in general, and
>> > the global behavior here, although missing some bits, is a lot saner
>> > than what cpuset is currently doing.
>>
>> You are conflating two things here:
>> 1) How we maintain these masks
>> 2) The interactions on updates
>>
>> I absolutely agree with you that we want to maintain (1) in a
>> non-pointwise format.  I've already talked about that in other replies
>> on this thread.
>>
>> However for (2) I feel you are:
>>  i) Underestimating the complexity of synchronizing updates with user-space.
>>  ii) Introducing more non-desirable behaviors [partial overwrite] than
>> those you object to [total overwrite].
>
> The thing which bothers me the most is that cpuset behavior is
> different from global case for no good reason.

I've tried to explain above that I believe there are reasonable
reasons for it working the way it does from an interface perspective.
I do not think they can be so quickly discarded out of hand.  However,
I think we should continue winnowing focus and first resolve the model
of interaction for sub-process hierarchies,

> We don't have a model
> right now.  It's schizophrenic.  And what I was trying to say was that
> maybe this is because we never had a working model in the global case
> either but if that's the case we need to solve the global case too or
> at least figure out where we wanna be in the long term.
>
>> It's the most consistent choice; you've not given any reasons above
>> why a solution with only partial consistency is any better.
>>
>> Any choice here is difficult to coordinate, that two APIs allow
>> manipulation of the same property means that we must always
>> choose some compromise here.  I prefer the one with the least
>> surprises.
>
> I don't think the current situation around affinity mask handling can
> be considered consistent and cpuset is pouring more inconsistencies
> into it.  We need to figure it out one way or the other.
>
> ...
>> I do not yet see a good reason why the threads arbitrarily not sharing an
>> address space necessitates the use of an entirely different API.  The
>> only problems stated so far in this discussion have been:
>>   1) Actual issues involving relative paths, which are potentially solvable.
>
> Also the ownership of organization.  If the use-cases can be
> reasonably served with static grouping, I think it'd definitely be a
> worthwhile

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-10-01 Thread Tejun Heo

Hello, Paul.

Sorry about the delay.  Things were kinda hectic in the past couple
weeks.

On Fri, Sep 18, 2015 at 04:27:07AM -0700, Paul Turner wrote:
> On Sat, Sep 12, 2015 at 7:40 AM, Tejun Heo  wrote:
> > On Wed, Sep 09, 2015 at 05:49:31AM -0700, Paul Turner wrote:
> >> I do not think this is a layering problem.  This is more like C++:
> >> there is no sane way to concurrently use all the features available,
> >> however, reasonably self-consistent subsets may be chosen.
> >
> > That's just admitting failure.
> >
> 
> Alternatively: accepting there are varied use-cases to
> support.

Analogies like this can go awry but as we're in it anyway, let's push
it a bit further.  One of the reasons why C++ isn't lauded as an
example of great engineering is while it does support a vast number of
use-cases or rather usage-scenarios (it's not necessarily focused on
utility but just how things are done) it fails to distill the essence
of the actual utility out of them and condense it.  It's not just an
aesthetic argument.  That failure exacts heavy costs on its users and
is one of the reasons why C++ projects are more prone to horrible
disarrays unless specific precautions are taken.

I'm not against supporting valid and useful use-cases but not all
usage-scenarios are equal.  If we can achieve the same eventual goals
with reasonable trade-offs in a simpler and more straight-forward way,
that's what we should do even though that'd require some modifications
to specific usage-scenarios.  ie. the usage-scenarios need to
scrutinized so that the core of the utility can be extracted and
abstracted in the, hopefully, minimal way.

This is what worries me when you liken the situation to C++.  You
probably were trying to make a different point but I'm not sure we're
on the same page and I think we need to agree at least on this in
principle; otherwise, we'll just keep talking past each other.

> > The kernel does not update all CPU affinity masks when a CPU goes down
> > or comes up.  It just enforces the intersection and when the
> > intersection becomes empty, ignores it.  cgroup-scoped behaviors
> > should reflect what the system does in the global case in general, and
> > the global behavior here, although missing some bits, is a lot saner
> > than what cpuset is currently doing.
> 
> You are conflating two things here:
> 1) How we maintain these masks
> 2) The interactions on updates
> 
> I absolutely agree with you that we want to maintain (1) in a
> non-pointwise format.  I've already talked about that in other replies
> on this thread.
> 
> However for (2) I feel you are:
>  i) Underestimating the complexity of synchronizing updates with user-space.
>  ii) Introducing more non-desirable behaviors [partial overwrite] than
> those you object to [total overwrite].

The thing which bothers me the most is that cpuset behavior is
different from global case for no good reason.  We don't have a model
right now.  It's schizophrenic.  And what I was trying to say was that
maybe this is because we never had a working model in the global case
either but if that's the case we need to solve the global case too or
at least figure out where we wanna be in the long term.

> It's the most consistent choice; you've not given any reasons above
> why a solution with only partial consistency is any better.
>
> Any choice here is difficult to coordinate, that two APIs allow
> manipulation of the same property means that we must always
> choose some compromise here.  I prefer the one with the least
> surprises.

I don't think the current situation around affinity mask handling can
be considered consistent and cpuset is pouring more inconsistencies
into it.  We need to figure it out one way or the other.

...
> I do not yet see a good reason why the threads arbitrarily not sharing an
> address space necessitates the use of an entirely different API.  The
> only problems stated so far in this discussion have been:
>   1) Actual issues involving relative paths, which are potentially solvable.

Also the ownership of organization.  If the use-cases can be
reasonably served with static grouping, I think it'd definitely be a
worthwhile trade-off to make.  It's different from process level
grouping.  There, we can simply state that this is to be arbitrated in
the userland and that arbitration isn't that difficult as it's among
administration stack of userspace.

In-process attributes are different.  The process itself can
manipulate its own attributes but it's also common for external tools
to peek into processes and set certain attributes.  Even when the two
parties aren't coordinated, this is usually fine because there's no
reason for applications to depend on what those attribute are set to
and even when the different entities do different things, the
combination is still something coherent.

Now, if you make the in-process grouping dynamic and accessible to
external entities (and if we aren't gonna do that, why even bother?),
this breaks down

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-10-01 Thread Tejun Heo

Hello, Paul.

Sorry about the delay.  Things were kinda hectic in the past couple
weeks.

On Fri, Sep 18, 2015 at 04:27:07AM -0700, Paul Turner wrote:
> On Sat, Sep 12, 2015 at 7:40 AM, Tejun Heo  wrote:
> > On Wed, Sep 09, 2015 at 05:49:31AM -0700, Paul Turner wrote:
> >> I do not think this is a layering problem.  This is more like C++:
> >> there is no sane way to concurrently use all the features available,
> >> however, reasonably self-consistent subsets may be chosen.
> >
> > That's just admitting failure.
> >
> 
> Alternatively: accepting there are varied use-cases to
> support.

Analogies like this can go awry but as we're in it anyway, let's push
it a bit further.  One of the reasons why C++ isn't lauded as an
example of great engineering is while it does support a vast number of
use-cases or rather usage-scenarios (it's not necessarily focused on
utility but just how things are done) it fails to distill the essence
of the actual utility out of them and condense it.  It's not just an
aesthetic argument.  That failure exacts heavy costs on its users and
is one of the reasons why C++ projects are more prone to horrible
disarrays unless specific precautions are taken.

I'm not against supporting valid and useful use-cases but not all
usage-scenarios are equal.  If we can achieve the same eventual goals
with reasonable trade-offs in a simpler and more straight-forward way,
that's what we should do even though that'd require some modifications
to specific usage-scenarios.  ie. the usage-scenarios need to
scrutinized so that the core of the utility can be extracted and
abstracted in the, hopefully, minimal way.

This is what worries me when you liken the situation to C++.  You
probably were trying to make a different point but I'm not sure we're
on the same page and I think we need to agree at least on this in
principle; otherwise, we'll just keep talking past each other.

> > The kernel does not update all CPU affinity masks when a CPU goes down
> > or comes up.  It just enforces the intersection and when the
> > intersection becomes empty, ignores it.  cgroup-scoped behaviors
> > should reflect what the system does in the global case in general, and
> > the global behavior here, although missing some bits, is a lot saner
> > than what cpuset is currently doing.
> 
> You are conflating two things here:
> 1) How we maintain these masks
> 2) The interactions on updates
> 
> I absolutely agree with you that we want to maintain (1) in a
> non-pointwise format.  I've already talked about that in other replies
> on this thread.
> 
> However for (2) I feel you are:
>  i) Underestimating the complexity of synchronizing updates with user-space.
>  ii) Introducing more non-desirable behaviors [partial overwrite] than
> those you object to [total overwrite].

The thing which bothers me the most is that cpuset behavior is
different from global case for no good reason.  We don't have a model
right now.  It's schizophrenic.  And what I was trying to say was that
maybe this is because we never had a working model in the global case
either but if that's the case we need to solve the global case too or
at least figure out where we wanna be in the long term.

> It's the most consistent choice; you've not given any reasons above
> why a solution with only partial consistency is any better.
>
> Any choice here is difficult to coordinate, that two APIs allow
> manipulation of the same property means that we must always
> choose some compromise here.  I prefer the one with the least
> surprises.

I don't think the current situation around affinity mask handling can
be considered consistent and cpuset is pouring more inconsistencies
into it.  We need to figure it out one way or the other.

...
> I do not yet see a good reason why the threads arbitrarily not sharing an
> address space necessitates the use of an entirely different API.  The
> only problems stated so far in this discussion have been:
>   1) Actual issues involving relative paths, which are potentially solvable.

Also the ownership of organization.  If the use-cases can be
reasonably served with static grouping, I think it'd definitely be a
worthwhile trade-off to make.  It's different from process level
grouping.  There, we can simply state that this is to be arbitrated in
the userland and that arbitration isn't that difficult as it's among
administration stack of userspace.

In-process attributes are different.  The process itself can
manipulate its own attributes but it's also common for external tools
to peek into processes and set certain attributes.  Even when the two
parties aren't coordinated, this is usually fine because there's no
reason for applications to depend on what those attribute are set to
and even when the different entities do different things, the
combination is still something coherent.

Now, if you make the in-process grouping dynamic and accessible to
external entities (and if we aren't gonna do that, why even bother?),

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-09-18 Thread Paul Turner

On Sat, Sep 12, 2015 at 7:40 AM, Tejun Heo  wrote:
> Hello,
>
> On Wed, Sep 09, 2015 at 05:49:31AM -0700, Paul Turner wrote:
>> I do not think this is a layering problem.  This is more like C++:
>> there is no sane way to concurrently use all the features available,
>> however, reasonably self-consistent subsets may be chosen.
>
> That's just admitting failure.
>

Alternatively: accepting there are varied use-cases to
support.

>> > I don't get it.  How is that the only consistent way?  Why is making
>> > irreversible changes even a good way?  Just layer the masks and
>> > trigger notification on changes.
>>
>> I'm not sure if you're agreeing or disagreeing here.  Are you saying
>> there's another consistent way from "clobbering then triggering
>> notification changes" since it seems like that's what is rejected and
>> then described.  This certainly does not include any provisions for
>> reversibility (which I think is a non-starter).
>>
>> With respect to layering:  Are you proposing we maintain a separate
>> mask for sched_setaffinity and cpusets, then do different things
>> depending on their intersection, or lack thereof?  I feel this would
>> introduce more consistencies than it would solve as these masks would
>> not be separately inspectable from user-space, leading to confusing
>> interactions as they are changed.
>
> So, one of the problems is that the kernel can't have tasks w/o
> runnable CPUs, so we have to some workaround when, for whatever
> reason, a task ends up with no CPU that it can run on.  The other is
> that we never established a consistent way to deal with it in global
> case either.
>
> You say cpuset isn't a layering thing but that simply isn't true.
> It's a cgroup-scope CPU mask.  It layers atop task affinities
> restricting what they can be configured to, limiting the effective
> cpumask to the intersection of actually existing CPUs and overriding
> individual affinity setting when the intersection doesn't exist.
>
> The kernel does not update all CPU affinity masks when a CPU goes down
> or comes up.  It just enforces the intersection and when the
> intersection becomes empty, ignores it.  cgroup-scoped behaviors
> should reflect what the system does in the global case in general, and
> the global behavior here, although missing some bits, is a lot saner
> than what cpuset is currently doing.

You are conflating two things here:
1) How we maintain these masks
2) The interactions on updates

I absolutely agree with you that we want to maintain (1) in a
non-pointwise format.  I've already talked about that in other replies
on this thread.

However for (2) I feel you are:
 i) Underestimating the complexity of synchronizing updates with user-space.
 ii) Introducing more non-desirable behaviors [partial overwrite] than
those you object to [total overwrite].

>
>> There are also very real challenges with how any notification is
>> implemented, independent of delivery:
>> The 'main' of an application often does not have good control or even
>> understanding over its own threads since many may be library managed.
>> Designation of responsibility for updating these masks is difficult.
>> That said, I think a notification would still be a useful improvement
>> here and that some applications would benefit.
>
> And this is the missing piece in the global case too.  We've just
> never solved this problem properly but that does not mean we should go
> off and do something completely different for cgroup case.  Clobbering
> is fine if there's a single entity controlling everything but at that
> level it's nothing more than a shorthand for running taskset on member
> tasks.
>

>From user-space's perspective it always involves some out-of-band
clobber since what's specified by cpusets takes precedence.

However the result of overlaying the masks is that different update
combinations will have very different effects, varying from greatly
expanding parallelism to greatly restricting it.  Further, these
effects are hard to predict since anything returned by getaffinity is
obscured by whatever the instantaneous cpuset-level masks happen to be.

>> At the very least, I do not think that cpuset's behavior here can be
>> dismissed as unreasonable.
>
> It sure is very misguided.
>

It's the most consistent choice; you've not given any reasons above
why a solution with only partial consistency is any better.

Any choice here is difficult to coordinate, that two APIs allow
manipulation of the same property means that we must always
choose some compromise here.  I prefer the one with the least
surprises.

>> > What if optimizing cache allocation across competing threads of a
>> > process can yield, say, 3% gain across large compute farm?  Is that
>> > fringe?
>>
>> Frankly, yes.  If you have a compute farm sufficiently dedicated to a
>> single application I'd say that's a fairly specialized use.  I see no
>> reason why a more 'technical' API should be a challenge for such a
>> user.  The fundamental

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-09-18 Thread Paul Turner

On Sat, Sep 12, 2015 at 7:40 AM, Tejun Heo  wrote:
> Hello,
>
> On Wed, Sep 09, 2015 at 05:49:31AM -0700, Paul Turner wrote:
>> I do not think this is a layering problem.  This is more like C++:
>> there is no sane way to concurrently use all the features available,
>> however, reasonably self-consistent subsets may be chosen.
>
> That's just admitting failure.
>

Alternatively: accepting there are varied use-cases to
support.

>> > I don't get it.  How is that the only consistent way?  Why is making
>> > irreversible changes even a good way?  Just layer the masks and
>> > trigger notification on changes.
>>
>> I'm not sure if you're agreeing or disagreeing here.  Are you saying
>> there's another consistent way from "clobbering then triggering
>> notification changes" since it seems like that's what is rejected and
>> then described.  This certainly does not include any provisions for
>> reversibility (which I think is a non-starter).
>>
>> With respect to layering:  Are you proposing we maintain a separate
>> mask for sched_setaffinity and cpusets, then do different things
>> depending on their intersection, or lack thereof?  I feel this would
>> introduce more consistencies than it would solve as these masks would
>> not be separately inspectable from user-space, leading to confusing
>> interactions as they are changed.
>
> So, one of the problems is that the kernel can't have tasks w/o
> runnable CPUs, so we have to some workaround when, for whatever
> reason, a task ends up with no CPU that it can run on.  The other is
> that we never established a consistent way to deal with it in global
> case either.
>
> You say cpuset isn't a layering thing but that simply isn't true.
> It's a cgroup-scope CPU mask.  It layers atop task affinities
> restricting what they can be configured to, limiting the effective
> cpumask to the intersection of actually existing CPUs and overriding
> individual affinity setting when the intersection doesn't exist.
>
> The kernel does not update all CPU affinity masks when a CPU goes down
> or comes up.  It just enforces the intersection and when the
> intersection becomes empty, ignores it.  cgroup-scoped behaviors
> should reflect what the system does in the global case in general, and
> the global behavior here, although missing some bits, is a lot saner
> than what cpuset is currently doing.

You are conflating two things here:
1) How we maintain these masks
2) The interactions on updates

I absolutely agree with you that we want to maintain (1) in a
non-pointwise format.  I've already talked about that in other replies
on this thread.

However for (2) I feel you are:
 i) Underestimating the complexity of synchronizing updates with user-space.
 ii) Introducing more non-desirable behaviors [partial overwrite] than
those you object to [total overwrite].

>
>> There are also very real challenges with how any notification is
>> implemented, independent of delivery:
>> The 'main' of an application often does not have good control or even
>> understanding over its own threads since many may be library managed.
>> Designation of responsibility for updating these masks is difficult.
>> That said, I think a notification would still be a useful improvement
>> here and that some applications would benefit.
>
> And this is the missing piece in the global case too.  We've just
> never solved this problem properly but that does not mean we should go
> off and do something completely different for cgroup case.  Clobbering
> is fine if there's a single entity controlling everything but at that
> level it's nothing more than a shorthand for running taskset on member
> tasks.
>

>From user-space's perspective it always involves some out-of-band
clobber since what's specified by cpusets takes precedence.

However the result of overlaying the masks is that different update
combinations will have very different effects, varying from greatly
expanding parallelism to greatly restricting it.  Further, these
effects are hard to predict since anything returned by getaffinity is
obscured by whatever the instantaneous cpuset-level masks happen to be.

>> At the very least, I do not think that cpuset's behavior here can be
>> dismissed as unreasonable.
>
> It sure is very misguided.
>

It's the most consistent choice; you've not given any reasons above
why a solution with only partial consistency is any better.

Any choice here is difficult to coordinate, that two APIs allow
manipulation of the same property means that we must always
choose some compromise here.  I prefer the one with the least
surprises.

>> > What if optimizing cache allocation across competing threads of a
>> > process can yield, say, 3% gain across large compute farm?  Is that
>> > fringe?
>>
>> Frankly, yes.  If you have a compute farm sufficiently dedicated to a
>> single application I'd say that's a fairly specialized use.  I see no
>> reason why a more 'technical' API should be a challenge for such a
>> user.

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-09-17 Thread Tejun Heo

Paul?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-09-17 Thread Peter Zijlstra

On Thu, Sep 17, 2015 at 11:52:45AM -0400, Tejun Heo wrote:
> Hello,
> 
> On Thu, Sep 17, 2015 at 05:10:49PM +0200, Peter Zijlstra wrote:
> > Subject: sched: Refuse to unplug a CPU if this will violate user task 
> > affinity
> > 
> > Its bad policy to allow unplugging a CPU for which a user set explicit
> > affinity, either strictly on this CPU or in case this was the last
> > online CPU in its mask.
> > 
> > Either would end up forcing the thread on a random other CPU, violating
> > the sys_sched_setaffinity() constraint.
> 
> Shouldn't this at least handle suspend differently?  Otherwise any
> userland task would be able to block suspend.

It does, it will allow suspend.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-09-17 Thread Tejun Heo

Hello,

On Thu, Sep 17, 2015 at 05:10:49PM +0200, Peter Zijlstra wrote:
> Subject: sched: Refuse to unplug a CPU if this will violate user task affinity
> 
> Its bad policy to allow unplugging a CPU for which a user set explicit
> affinity, either strictly on this CPU or in case this was the last
> online CPU in its mask.
> 
> Either would end up forcing the thread on a random other CPU, violating
> the sys_sched_setaffinity() constraint.

Shouldn't this at least handle suspend differently?  Otherwise any
userland task would be able to block suspend.

> Disallow this by default; root might not be aware of all user
> affinities, but can negotiate and change affinities for all tasks.
>
> Provide a sysctl to go back to the old behaviour.

I don't think a sysctl is a good way to control this as that breaks
the invariant - all tasks always have some cpus online in its affinity
mask - which otherwise can be guaranteed.

If we wanna go this way, let's plesae start the discussion in a
separate thread with detailed explanation on implications of the
change.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-09-17 Thread Peter Zijlstra

On Thu, Sep 17, 2015 at 10:53:09AM -0400, Tejun Heo wrote:
> > I'd be happy to fail a CPU down for user tasks where this is the last
> > runnable CPU of.
> 
> So, yeah, we need to keep these things consistent across global and
> cgroup cases.
> 

Ok, I'll go extend the sysctl_sched_strict_affinity to the cpuset bits
too then :-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-09-17 Thread Peter Zijlstra

On Thu, Sep 17, 2015 at 04:35:27PM +0200, Peter Zijlstra wrote:
> I'd be happy to fail a CPU down for user tasks where this is the last
> runnable CPU of.

A little like so. Completely untested.

---
Subject: sched: Refuse to unplug a CPU if this will violate user task affinity

Its bad policy to allow unplugging a CPU for which a user set explicit
affinity, either strictly on this CPU or in case this was the last
online CPU in its mask.

Either would end up forcing the thread on a random other CPU, violating
the sys_sched_setaffinity() constraint.

Disallow this by default; root might not be aware of all user
affinities, but can negotiate and change affinities for all tasks.

Provide a sysctl to go back to the old behaviour.

Signed-off-by: Peter Zijlstra (Intel) 
---
 include/linux/sched/sysctl.h |  1 +
 kernel/sched/core.c  | 46 
 kernel/sysctl.c  |  9 +
 3 files changed, 56 insertions(+)

diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index c9e4731cf10b..9444b549914b 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -39,6 +39,7 @@ extern unsigned int sysctl_sched_latency;
 extern unsigned int sysctl_sched_min_granularity;
 extern unsigned int sysctl_sched_wakeup_granularity;
 extern unsigned int sysctl_sched_child_runs_first;
+extern unsigned int sysctl_sched_strict_affinity;
 
 enum sched_tunable_scaling {
SCHED_TUNABLESCALING_NONE,
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6ab415aa15c4..457c8b912fc6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -284,6 +284,11 @@ __read_mostly int scheduler_running;
  */
 int sysctl_sched_rt_runtime = 95;
 
+/*
+ * Disallows cpu unplug if that would result in a task without runnable CPUs.
+ */
+unsigned int sysctl_sched_strict_affinity = 1;
+
 /* cpus with isolated domains */
 cpumask_var_t cpu_isolated_map;
 
@@ -5430,6 +5435,42 @@ static void set_rq_offline(struct rq *rq)
 }
 
 /*
+ * Test if there's a user task for which @cpu is the last runnable CPU
+ */
+static bool migration_possible(int cpu)
+{
+   struct task_struct *g, *p;
+   bool ret = true;
+   int next;
+
+   read_lock(_lock);
+   for_each_process_thread(g, p) {
+   /* if its running elsewhere, this cannot be its last cpu */
+   if (task_cpu(p) != cpu)
+   continue;
+
+   /* we only care about user state */
+   if (p->flags & PF_KTHREAD)
+   continue;
+
+   next = -1;
+again:
+   next = cpumask_next_and(next, tsk_cpus_allowed(p), 
cpu_active_mask);
+   if (next >= nr_cpu_ids) {
+   printk(KERN_WARNING "task %s-%d refused unplug of CPU 
%d\n",
+   p->comm, p->pid, cpu);
+   ret = false;
+   break;
+   }
+   if (next == cpu)
+   goto again;
+   }
+   read_unlock(_lock);
+
+   return ret;
+}
+
+/*
  * migration_call - callback that gets triggered when a CPU is added.
  * Here we can start up the necessary migration thread for the new CPU.
  */
@@ -5440,6 +5481,11 @@ migration_call(struct notifier_block *nfb, unsigned long 
action, void *hcpu)
unsigned long flags;
struct rq *rq = cpu_rq(cpu);
 
+   if (action == CPU_DOWN_PREPARE && sysctl_sched_strict_affinity) {
+   if (!migration_possible(cpu))
+   return notifier_from_errno(-EBUSY);
+   }
+
switch (action & ~CPU_TASKS_FROZEN) {
 
case CPU_UP_PREPARE:
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index e69201d8094e..9d0edcc73cc3 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -283,6 +283,15 @@ static struct ctl_table kern_table[] = {
.mode   = 0644,
.proc_handler   = proc_dointvec,
},
+#ifdef CONFIG_SMP
+   {
+   .procname   = "sched_strict_affinity",
+   .data   = _sched_strict_affinity,
+   .maxlen = sizeof(unsigned int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec,
+   },
+#endif /* CONFIG_SMP */
 #ifdef CONFIG_SCHED_DEBUG
{
.procname   = "sched_min_granularity_ns",
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-09-17 Thread Tejun Heo

Hello,

On Thu, Sep 17, 2015 at 04:35:27PM +0200, Peter Zijlstra wrote:
> On Sat, Sep 12, 2015 at 10:40:07AM -0400, Tejun Heo wrote:
> > So, one of the problems is that the kernel can't have tasks w/o
> > runnable CPUs, so we have to some workaround when, for whatever
> > reason, a task ends up with no CPU that it can run on.
> 
> No, just refuse that configuration.

Well, you've been saying that but that's not our behavior on cpu
hotunplug either and it applies the same.  If we reject cpu hotunplugs
while tasks are affined to it, we can do the same in cpuset too.

> > The kernel does not update all CPU affinity masks when a CPU goes down
> > or comes up.
> 
> I'd be happy to fail a CPU down for user tasks where this is the last
> runnable CPU of.

So, yeah, we need to keep these things consistent across global and
cgroup cases.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-09-17 Thread Peter Zijlstra

On Sat, Sep 12, 2015 at 10:40:07AM -0400, Tejun Heo wrote:
> So, one of the problems is that the kernel can't have tasks w/o
> runnable CPUs, so we have to some workaround when, for whatever
> reason, a task ends up with no CPU that it can run on.

No, just refuse that configuration.

> You say cpuset isn't a layering thing but that simply isn't true.
> It's a cgroup-scope CPU mask.  It layers atop task affinities
> restricting what they can be configured to, limiting the effective
> cpumask to the intersection of actually existing CPUs and overriding
> individual affinity setting when the intersection doesn't exist.

No, just fail.

> The kernel does not update all CPU affinity masks when a CPU goes down
> or comes up.

I'd be happy to fail a CPU down for user tasks where this is the last
runnable CPU of.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-09-17 Thread Peter Zijlstra

On Thu, Sep 17, 2015 at 10:53:09AM -0400, Tejun Heo wrote:
> > I'd be happy to fail a CPU down for user tasks where this is the last
> > runnable CPU of.
> 
> So, yeah, we need to keep these things consistent across global and
> cgroup cases.
> 

Ok, I'll go extend the sysctl_sched_strict_affinity to the cpuset bits
too then :-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-09-17 Thread Peter Zijlstra

On Thu, Sep 17, 2015 at 11:52:45AM -0400, Tejun Heo wrote:
> Hello,
> 
> On Thu, Sep 17, 2015 at 05:10:49PM +0200, Peter Zijlstra wrote:
> > Subject: sched: Refuse to unplug a CPU if this will violate user task 
> > affinity
> > 
> > Its bad policy to allow unplugging a CPU for which a user set explicit
> > affinity, either strictly on this CPU or in case this was the last
> > online CPU in its mask.
> > 
> > Either would end up forcing the thread on a random other CPU, violating
> > the sys_sched_setaffinity() constraint.
> 
> Shouldn't this at least handle suspend differently?  Otherwise any
> userland task would be able to block suspend.

It does, it will allow suspend.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-09-17 Thread Tejun Heo

Hello,

On Thu, Sep 17, 2015 at 05:10:49PM +0200, Peter Zijlstra wrote:
> Subject: sched: Refuse to unplug a CPU if this will violate user task affinity
> 
> Its bad policy to allow unplugging a CPU for which a user set explicit
> affinity, either strictly on this CPU or in case this was the last
> online CPU in its mask.
> 
> Either would end up forcing the thread on a random other CPU, violating
> the sys_sched_setaffinity() constraint.

Shouldn't this at least handle suspend differently?  Otherwise any
userland task would be able to block suspend.

> Disallow this by default; root might not be aware of all user
> affinities, but can negotiate and change affinities for all tasks.
>
> Provide a sysctl to go back to the old behaviour.

I don't think a sysctl is a good way to control this as that breaks
the invariant - all tasks always have some cpus online in its affinity
mask - which otherwise can be guaranteed.

If we wanna go this way, let's plesae start the discussion in a
separate thread with detailed explanation on implications of the
change.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-09-17 Thread Tejun Heo

Hello,

On Thu, Sep 17, 2015 at 04:35:27PM +0200, Peter Zijlstra wrote:
> On Sat, Sep 12, 2015 at 10:40:07AM -0400, Tejun Heo wrote:
> > So, one of the problems is that the kernel can't have tasks w/o
> > runnable CPUs, so we have to some workaround when, for whatever
> > reason, a task ends up with no CPU that it can run on.
> 
> No, just refuse that configuration.

Well, you've been saying that but that's not our behavior on cpu
hotunplug either and it applies the same.  If we reject cpu hotunplugs
while tasks are affined to it, we can do the same in cpuset too.

> > The kernel does not update all CPU affinity masks when a CPU goes down
> > or comes up.
> 
> I'd be happy to fail a CPU down for user tasks where this is the last
> runnable CPU of.

So, yeah, we need to keep these things consistent across global and
cgroup cases.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-09-17 Thread Peter Zijlstra

On Thu, Sep 17, 2015 at 04:35:27PM +0200, Peter Zijlstra wrote:
> I'd be happy to fail a CPU down for user tasks where this is the last
> runnable CPU of.

A little like so. Completely untested.

---
Subject: sched: Refuse to unplug a CPU if this will violate user task affinity

Its bad policy to allow unplugging a CPU for which a user set explicit
affinity, either strictly on this CPU or in case this was the last
online CPU in its mask.

Either would end up forcing the thread on a random other CPU, violating
the sys_sched_setaffinity() constraint.

Disallow this by default; root might not be aware of all user
affinities, but can negotiate and change affinities for all tasks.

Provide a sysctl to go back to the old behaviour.

Signed-off-by: Peter Zijlstra (Intel) 
---
 include/linux/sched/sysctl.h |  1 +
 kernel/sched/core.c  | 46 
 kernel/sysctl.c  |  9 +
 3 files changed, 56 insertions(+)

diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index c9e4731cf10b..9444b549914b 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -39,6 +39,7 @@ extern unsigned int sysctl_sched_latency;
 extern unsigned int sysctl_sched_min_granularity;
 extern unsigned int sysctl_sched_wakeup_granularity;
 extern unsigned int sysctl_sched_child_runs_first;
+extern unsigned int sysctl_sched_strict_affinity;
 
 enum sched_tunable_scaling {
SCHED_TUNABLESCALING_NONE,
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6ab415aa15c4..457c8b912fc6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -284,6 +284,11 @@ __read_mostly int scheduler_running;
  */
 int sysctl_sched_rt_runtime = 95;
 
+/*
+ * Disallows cpu unplug if that would result in a task without runnable CPUs.
+ */
+unsigned int sysctl_sched_strict_affinity = 1;
+
 /* cpus with isolated domains */
 cpumask_var_t cpu_isolated_map;
 
@@ -5430,6 +5435,42 @@ static void set_rq_offline(struct rq *rq)
 }
 
 /*
+ * Test if there's a user task for which @cpu is the last runnable CPU
+ */
+static bool migration_possible(int cpu)
+{
+   struct task_struct *g, *p;
+   bool ret = true;
+   int next;
+
+   read_lock(_lock);
+   for_each_process_thread(g, p) {
+   /* if its running elsewhere, this cannot be its last cpu */
+   if (task_cpu(p) != cpu)
+   continue;
+
+   /* we only care about user state */
+   if (p->flags & PF_KTHREAD)
+   continue;
+
+   next = -1;
+again:
+   next = cpumask_next_and(next, tsk_cpus_allowed(p), 
cpu_active_mask);
+   if (next >= nr_cpu_ids) {
+   printk(KERN_WARNING "task %s-%d refused unplug of CPU 
%d\n",
+   p->comm, p->pid, cpu);
+   ret = false;
+   break;
+   }
+   if (next == cpu)
+   goto again;
+   }
+   read_unlock(_lock);
+
+   return ret;
+}
+
+/*
  * migration_call - callback that gets triggered when a CPU is added.
  * Here we can start up the necessary migration thread for the new CPU.
  */
@@ -5440,6 +5481,11 @@ migration_call(struct notifier_block *nfb, unsigned long 
action, void *hcpu)
unsigned long flags;
struct rq *rq = cpu_rq(cpu);
 
+   if (action == CPU_DOWN_PREPARE && sysctl_sched_strict_affinity) {
+   if (!migration_possible(cpu))
+   return notifier_from_errno(-EBUSY);
+   }
+
switch (action & ~CPU_TASKS_FROZEN) {
 
case CPU_UP_PREPARE:
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index e69201d8094e..9d0edcc73cc3 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -283,6 +283,15 @@ static struct ctl_table kern_table[] = {
.mode   = 0644,
.proc_handler   = proc_dointvec,
},
+#ifdef CONFIG_SMP
+   {
+   .procname   = "sched_strict_affinity",
+   .data   = _sched_strict_affinity,
+   .maxlen = sizeof(unsigned int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec,
+   },
+#endif /* CONFIG_SMP */
 #ifdef CONFIG_SCHED_DEBUG
{
.procname   = "sched_min_granularity_ns",
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-09-17 Thread Peter Zijlstra

On Sat, Sep 12, 2015 at 10:40:07AM -0400, Tejun Heo wrote:
> So, one of the problems is that the kernel can't have tasks w/o
> runnable CPUs, so we have to some workaround when, for whatever
> reason, a task ends up with no CPU that it can run on.

No, just refuse that configuration.

> You say cpuset isn't a layering thing but that simply isn't true.
> It's a cgroup-scope CPU mask.  It layers atop task affinities
> restricting what they can be configured to, limiting the effective
> cpumask to the intersection of actually existing CPUs and overriding
> individual affinity setting when the intersection doesn't exist.

No, just fail.

> The kernel does not update all CPU affinity masks when a CPU goes down
> or comes up.

I'd be happy to fail a CPU down for user tasks where this is the last
runnable CPU of.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-09-17 Thread Tejun Heo

Paul?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-09-12 Thread Tejun Heo

Hello,

On Wed, Sep 09, 2015 at 05:49:31AM -0700, Paul Turner wrote:
> I do not think this is a layering problem.  This is more like C++:
> there is no sane way to concurrently use all the features available,
> however, reasonably self-consistent subsets may be chosen.

That's just admitting failure.

> > I don't get it.  How is that the only consistent way?  Why is making
> > irreversible changes even a good way?  Just layer the masks and
> > trigger notification on changes.
> 
> I'm not sure if you're agreeing or disagreeing here.  Are you saying
> there's another consistent way from "clobbering then triggering
> notification changes" since it seems like that's what is rejected and
> then described.  This certainly does not include any provisions for
> reversibility (which I think is a non-starter).
> 
> With respect to layering:  Are you proposing we maintain a separate
> mask for sched_setaffinity and cpusets, then do different things
> depending on their intersection, or lack thereof?  I feel this would
> introduce more consistencies than it would solve as these masks would
> not be separately inspectable from user-space, leading to confusing
> interactions as they are changed.

So, one of the problems is that the kernel can't have tasks w/o
runnable CPUs, so we have to some workaround when, for whatever
reason, a task ends up with no CPU that it can run on.  The other is
that we never established a consistent way to deal with it in global
case either.

You say cpuset isn't a layering thing but that simply isn't true.
It's a cgroup-scope CPU mask.  It layers atop task affinities
restricting what they can be configured to, limiting the effective
cpumask to the intersection of actually existing CPUs and overriding
individual affinity setting when the intersection doesn't exist.

The kernel does not update all CPU affinity masks when a CPU goes down
or comes up.  It just enforces the intersection and when the
intersection becomes empty, ignores it.  cgroup-scoped behaviors
should reflect what the system does in the global case in general, and
the global behavior here, although missing some bits, is a lot saner
than what cpuset is currently doing.

> There are also very real challenges with how any notification is
> implemented, independent of delivery:
> The 'main' of an application often does not have good control or even
> understanding over its own threads since many may be library managed.
> Designation of responsibility for updating these masks is difficult.
> That said, I think a notification would still be a useful improvement
> here and that some applications would benefit.

And this is the missing piece in the global case too.  We've just
never solved this problem properly but that does not mean we should go
off and do something completely different for cgroup case.  Clobbering
is fine if there's a single entity controlling everything but at that
level it's nothing more than a shorthand for running taskset on member
tasks.

> At the very least, I do not think that cpuset's behavior here can be
> dismissed as unreasonable.

It sure is very misguided.

> > What if optimizing cache allocation across competing threads of a
> > process can yield, say, 3% gain across large compute farm?  Is that
> > fringe?
> 
> Frankly, yes.  If you have a compute farm sufficiently dedicated to a
> single application I'd say that's a fairly specialized use.  I see no
> reason why a more 'technical' API should be a challenge for such a
> user.  The fundamental point here was that it's ok for the API of some
> controllers to be targeted at system rather than user control in terms
> of interface.  This does not restrict their use by users where
> appropriate.

We can go back and forth forever on this but I'm fairly sure
everything CAT related is niche at this point.

> So there is definitely a proliferation of discussion regarding
> applying cgroups to other problems which I agree we need to take a
> step back and re-examine.
> 
> However, here we're fundamentally talking about APIs designed to
> partition resources which is the problem that cgroups was introduced
> to address.  If we want to introduce another API to do that below the
> process level we need to talk about why it's fundamentally different
> for processes versus threads, and whether we want two APIs that
> interface with the same underlying kernel mechanics.

More on this below but going full-on cgroup controller w/ thread-level
interface is akin to introducing syscalls for this.  That really is
what it is.

> > For the cache allocation thing, I'd strongly suggest something way
> > simpler and non-commmittal - e.g. create a char device node with
> > simple configuration and basic access control.  If this *really* turns
> > out to be useful and its configuration complex enough to warrant
> > cgroup integration, let's do it then, and if we actually end up there,
> > I bet the interface that we'd come up with at that point would be
> > different from what

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-09-12 Thread Tejun Heo

Hello,

On Wed, Sep 09, 2015 at 05:49:31AM -0700, Paul Turner wrote:
> I do not think this is a layering problem.  This is more like C++:
> there is no sane way to concurrently use all the features available,
> however, reasonably self-consistent subsets may be chosen.

That's just admitting failure.

> > I don't get it.  How is that the only consistent way?  Why is making
> > irreversible changes even a good way?  Just layer the masks and
> > trigger notification on changes.
> 
> I'm not sure if you're agreeing or disagreeing here.  Are you saying
> there's another consistent way from "clobbering then triggering
> notification changes" since it seems like that's what is rejected and
> then described.  This certainly does not include any provisions for
> reversibility (which I think is a non-starter).
> 
> With respect to layering:  Are you proposing we maintain a separate
> mask for sched_setaffinity and cpusets, then do different things
> depending on their intersection, or lack thereof?  I feel this would
> introduce more consistencies than it would solve as these masks would
> not be separately inspectable from user-space, leading to confusing
> interactions as they are changed.

So, one of the problems is that the kernel can't have tasks w/o
runnable CPUs, so we have to some workaround when, for whatever
reason, a task ends up with no CPU that it can run on.  The other is
that we never established a consistent way to deal with it in global
case either.

You say cpuset isn't a layering thing but that simply isn't true.
It's a cgroup-scope CPU mask.  It layers atop task affinities
restricting what they can be configured to, limiting the effective
cpumask to the intersection of actually existing CPUs and overriding
individual affinity setting when the intersection doesn't exist.

The kernel does not update all CPU affinity masks when a CPU goes down
or comes up.  It just enforces the intersection and when the
intersection becomes empty, ignores it.  cgroup-scoped behaviors
should reflect what the system does in the global case in general, and
the global behavior here, although missing some bits, is a lot saner
than what cpuset is currently doing.

> There are also very real challenges with how any notification is
> implemented, independent of delivery:
> The 'main' of an application often does not have good control or even
> understanding over its own threads since many may be library managed.
> Designation of responsibility for updating these masks is difficult.
> That said, I think a notification would still be a useful improvement
> here and that some applications would benefit.

And this is the missing piece in the global case too.  We've just
never solved this problem properly but that does not mean we should go
off and do something completely different for cgroup case.  Clobbering
is fine if there's a single entity controlling everything but at that
level it's nothing more than a shorthand for running taskset on member
tasks.

> At the very least, I do not think that cpuset's behavior here can be
> dismissed as unreasonable.

It sure is very misguided.

> > What if optimizing cache allocation across competing threads of a
> > process can yield, say, 3% gain across large compute farm?  Is that
> > fringe?
> 
> Frankly, yes.  If you have a compute farm sufficiently dedicated to a
> single application I'd say that's a fairly specialized use.  I see no
> reason why a more 'technical' API should be a challenge for such a
> user.  The fundamental point here was that it's ok for the API of some
> controllers to be targeted at system rather than user control in terms
> of interface.  This does not restrict their use by users where
> appropriate.

We can go back and forth forever on this but I'm fairly sure
everything CAT related is niche at this point.

> So there is definitely a proliferation of discussion regarding
> applying cgroups to other problems which I agree we need to take a
> step back and re-examine.
> 
> However, here we're fundamentally talking about APIs designed to
> partition resources which is the problem that cgroups was introduced
> to address.  If we want to introduce another API to do that below the
> process level we need to talk about why it's fundamentally different
> for processes versus threads, and whether we want two APIs that
> interface with the same underlying kernel mechanics.

More on this below but going full-on cgroup controller w/ thread-level
interface is akin to introducing syscalls for this.  That really is
what it is.

> > For the cache allocation thing, I'd strongly suggest something way
> > simpler and non-commmittal - e.g. create a char device node with
> > simple configuration and basic access control.  If this *really* turns
> > out to be useful and its configuration complex enough to warrant
> > cgroup integration, let's do it then, and if we actually end up there,
> > I bet the interface that we'd come up with at that point would be
> > different from what

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-09-09 Thread Paul Turner

[ Picking this back up, I was out of the country last week.  Note that
we are also wrestling with some DMARC issues as it was just activated
for Google.com so apologies if people do not receive this directly. ]

On Tue, Aug 25, 2015 at 2:02 PM, Tejun Heo  wrote:
> Hello,
>
> On Mon, Aug 24, 2015 at 04:06:39PM -0700, Paul Turner wrote:
>> > This is an erratic behavior on cpuset's part tho.  Nothing else
>> > behaves this way and it's borderline buggy.
>>
>> It's actually the only sane possible interaction here.
>>
>> If you don't overwrite the masks you can no longer manage cpusets with
>> a multi-threaded application.
>> If you partially overwrite the masks you can create a host of
>> inconsistent behaviors where an application suddenly loses
>> parallelism.
>
> It's a layering problem.  It'd be fine if cpuset either did "layer
> per-thread affinities below w/ config change notification" or "ignore
> and/or reject per-thread affinities".  What we have now is two layers
> manipulating the same field without any mechanism for coordination.
>

I think this is a mischaracterization.  With respect to the two
proposed solutions:

a) Notifications do not solve this problem.
b) Rejecting per-thread affinities is a non-starter.  It's absolutely
needed.  (Aside:  This would also wholly break the existing
sched_setaffinity/getaffinity syscalls.)

I do not think this is a layering problem.  This is more like C++:
there is no sane way to concurrently use all the features available,
however, reasonably self-consistent subsets may be chosen.

>> The *only* consistent way is to clobber all masks uniformly.  Then
>> either arrange for some notification to the application to re-sync, or
>> use sub-sub-containers within the cpuset hierarchy to advertise
>> finer-partitions.
>
> I don't get it.  How is that the only consistent way?  Why is making
> irreversible changes even a good way?  Just layer the masks and
> trigger notification on changes.

I'm not sure if you're agreeing or disagreeing here.  Are you saying
there's another consistent way from "clobbering then triggering
notification changes" since it seems like that's what is rejected and
then described.  This certainly does not include any provisions for
reversibility (which I think is a non-starter).

With respect to layering:  Are you proposing we maintain a separate
mask for sched_setaffinity and cpusets, then do different things
depending on their intersection, or lack thereof?  I feel this would
introduce more consistencies than it would solve as these masks would
not be separately inspectable from user-space, leading to confusing
interactions as they are changed.

There are also very real challenges with how any notification is
implemented, independent of delivery:
The 'main' of an application often does not have good control or even
understanding over its own threads since many may be library managed.
Designation of responsibility for updating these masks is difficult.
That said, I think a notification would still be a useful improvement
here and that some applications would benefit.

At the very least, I do not think that cpuset's behavior here can be
dismissed as unreasonable.

>
>> I don't think the case of having a large compute farm with
>> "unimportant" and "important" work is particularly fringe.  Reducing
>> the impact on the "important" work so that we can scavenge more cycles
>> for the latency insensitive "unimportant" is very real.
>
> What if optimizing cache allocation across competing threads of a
> process can yield, say, 3% gain across large compute farm?  Is that
> fringe?

Frankly, yes.  If you have a compute farm sufficiently dedicated to a
single application I'd say that's a fairly specialized use.  I see no
reason why a more 'technical' API should be a challenge for such a
user.  The fundamental point here was that it's ok for the API of some
controllers to be targeted at system rather than user control in terms
of interface.  This does not restrict their use by users where
appropriate.

>
>> Right, but it's exactly because of _how bad_ those other mechanisms
>> _are_ that cgroups was originally created.Its growth was not
>> managed well from there, but let's not step away from the fact that
>> this interface was created to solve this problem.
>
> Sure, at the same time, please don't forget that there are ample
> reasons we can't replace more basic mechanisms with cgroups.  I'm not
> saying this can't be part of cgroup but rather that it's misguided to
> do plunge into cgroups as the first and only step.
>

So there is definitely a proliferation of discussion regarding
applying cgroups to other problems which I agree we need to take a
step back and re-examine.

However, here we're fundamentally talking about APIs designed to
partition resources which is the problem that cgroups was introduced
to address.  If we want to introduce another API to do that below the
process level we need to talk about why it's fundamentally different
for

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-09-09 Thread Paul Turner

[ Picking this back up, I was out of the country last week.  Note that
we are also wrestling with some DMARC issues as it was just activated
for Google.com so apologies if people do not receive this directly. ]

On Tue, Aug 25, 2015 at 2:02 PM, Tejun Heo  wrote:
> Hello,
>
> On Mon, Aug 24, 2015 at 04:06:39PM -0700, Paul Turner wrote:
>> > This is an erratic behavior on cpuset's part tho.  Nothing else
>> > behaves this way and it's borderline buggy.
>>
>> It's actually the only sane possible interaction here.
>>
>> If you don't overwrite the masks you can no longer manage cpusets with
>> a multi-threaded application.
>> If you partially overwrite the masks you can create a host of
>> inconsistent behaviors where an application suddenly loses
>> parallelism.
>
> It's a layering problem.  It'd be fine if cpuset either did "layer
> per-thread affinities below w/ config change notification" or "ignore
> and/or reject per-thread affinities".  What we have now is two layers
> manipulating the same field without any mechanism for coordination.
>

I think this is a mischaracterization.  With respect to the two
proposed solutions:

a) Notifications do not solve this problem.
b) Rejecting per-thread affinities is a non-starter.  It's absolutely
needed.  (Aside:  This would also wholly break the existing
sched_setaffinity/getaffinity syscalls.)

I do not think this is a layering problem.  This is more like C++:
there is no sane way to concurrently use all the features available,
however, reasonably self-consistent subsets may be chosen.

>> The *only* consistent way is to clobber all masks uniformly.  Then
>> either arrange for some notification to the application to re-sync, or
>> use sub-sub-containers within the cpuset hierarchy to advertise
>> finer-partitions.
>
> I don't get it.  How is that the only consistent way?  Why is making
> irreversible changes even a good way?  Just layer the masks and
> trigger notification on changes.

I'm not sure if you're agreeing or disagreeing here.  Are you saying
there's another consistent way from "clobbering then triggering
notification changes" since it seems like that's what is rejected and
then described.  This certainly does not include any provisions for
reversibility (which I think is a non-starter).

With respect to layering:  Are you proposing we maintain a separate
mask for sched_setaffinity and cpusets, then do different things
depending on their intersection, or lack thereof?  I feel this would
introduce more consistencies than it would solve as these masks would
not be separately inspectable from user-space, leading to confusing
interactions as they are changed.

There are also very real challenges with how any notification is
implemented, independent of delivery:
The 'main' of an application often does not have good control or even
understanding over its own threads since many may be library managed.
Designation of responsibility for updating these masks is difficult.
That said, I think a notification would still be a useful improvement
here and that some applications would benefit.

At the very least, I do not think that cpuset's behavior here can be
dismissed as unreasonable.

>
>> I don't think the case of having a large compute farm with
>> "unimportant" and "important" work is particularly fringe.  Reducing
>> the impact on the "important" work so that we can scavenge more cycles
>> for the latency insensitive "unimportant" is very real.
>
> What if optimizing cache allocation across competing threads of a
> process can yield, say, 3% gain across large compute farm?  Is that
> fringe?

Frankly, yes.  If you have a compute farm sufficiently dedicated to a
single application I'd say that's a fairly specialized use.  I see no
reason why a more 'technical' API should be a challenge for such a
user.  The fundamental point here was that it's ok for the API of some
controllers to be targeted at system rather than user control in terms
of interface.  This does not restrict their use by users where
appropriate.

>
>> Right, but it's exactly because of _how bad_ those other mechanisms
>> _are_ that cgroups was originally created.Its growth was not
>> managed well from there, but let's not step away from the fact that
>> this interface was created to solve this problem.
>
> Sure, at the same time, please don't forget that there are ample
> reasons we can't replace more basic mechanisms with cgroups.  I'm not
> saying this can't be part of cgroup but rather that it's misguided to
> do plunge into cgroups as the first and only step.
>

So there is definitely a proliferation of discussion regarding
applying cgroups to other problems which I agree we need to take a
step back and re-examine.

However, here we're fundamentally talking about APIs designed to
partition resources which is the problem that cgroups was introduced
to address.  If we want to introduce another API to do that below the
process level we need to talk about why it's

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-09-02 Thread Tejun Heo

Paul?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-09-02 Thread Tejun Heo

Paul?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-08-25 Thread Tejun Heo

Hello, Kame.

On Tue, Aug 25, 2015 at 11:36:25AM +0900, Kamezawa Hiroyuki wrote:
> I think I should explain my customer's use case of qemu + cpuset/cpu (via 
> libvirt)
> 
> (1) Isolating hypervisor thread.
>As already discussed, hypervisor threads are isolated by cpuset. But their 
> purpose
>is to avoid _latency_ spike caused by hypervisor behavior. So, "nice" 
> cannot be solution
>as already discussed.
> 
> (2) Fixed rate vcpu service.
>With using cpu controller's quota/period feature, my customer creates  
> vcpu models like
>Low(1GHz), Mid(2GHz), High(3GHz) for IaaS system.
> 
>To do this, each vcpus should be quota-limited independently, with 
> per-thread cpu control.
> 
> Especially, the method (1) is used in several enterprise customers for 
> stabilizing their system.
> 
> Sub-process control should be provided by some way.

Can you please take a look at the proposal on my reply to Paul's
email?  AFAICS, both of above cases should be fine with that.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-08-25 Thread Tejun Heo

Hello,

On Mon, Aug 24, 2015 at 04:06:39PM -0700, Paul Turner wrote:
> > This is an erratic behavior on cpuset's part tho.  Nothing else
> > behaves this way and it's borderline buggy.
> 
> It's actually the only sane possible interaction here.
> 
> If you don't overwrite the masks you can no longer manage cpusets with
> a multi-threaded application.
> If you partially overwrite the masks you can create a host of
> inconsistent behaviors where an application suddenly loses
> parallelism.

It's a layering problem.  It'd be fine if cpuset either did "layer
per-thread affinities below w/ config change notification" or "ignore
and/or reject per-thread affinities".  What we have now is two layers
manipulating the same field without any mechanism for coordination.

> The *only* consistent way is to clobber all masks uniformly.  Then
> either arrange for some notification to the application to re-sync, or
> use sub-sub-containers within the cpuset hierarchy to advertise
> finer-partitions.

I don't get it.  How is that the only consistent way?  Why is making
irreversible changes even a good way?  Just layer the masks and
trigger notification on changes.

> I don't think the case of having a large compute farm with
> "unimportant" and "important" work is particularly fringe.  Reducing
> the impact on the "important" work so that we can scavenge more cycles
> for the latency insensitive "unimportant" is very real.

What if optimizing cache allocation across competing threads of a
process can yield, say, 3% gain across large compute farm?  Is that
fringe?

> Right, but it's exactly because of _how bad_ those other mechanisms
> _are_ that cgroups was originally created.Its growth was not
> managed well from there, but let's not step away from the fact that
> this interface was created to solve this problem.

Sure, at the same time, please don't forget that there are ample
reasons we can't replace more basic mechanisms with cgroups.  I'm not
saying this can't be part of cgroup but rather that it's misguided to
do plunge into cgroups as the first and only step.

More importantly, I am extremely doubtful that we understand the usage
scenarios and their benefits very well at this point and want to avoid
over-committing to something we'll look back and regret.  As it
currently stands, this has a high likelihood of becoming a mismanaged
growth.

For the cache allocation thing, I'd strongly suggest something way
simpler and non-commmittal - e.g. create a char device node with
simple configuration and basic access control.  If this *really* turns
out to be useful and its configuration complex enough to warrant
cgroup integration, let's do it then, and if we actually end up there,
I bet the interface that we'd come up with at that point would be
different from what people are proposing now.

> > Yeah, I understand the similarity part but don't buy that the benefit
> > there is big enough to introduce a kernel API which is expected to be
> > used by individual programs which is radically different from how
> > processes / threads are organized and applications interact with the
> > kernel.
> 
> Sorry, I don't quite follow, in what way is it radically different?
> What is magically different about a process versus a thread in this
> sub-division?

I meant that cgroupfs as opposed to most other programming interfaces
that we publish to applications.  We already have process / thread
hierarchy which is created through forking/cloning and conventions
built around them for interaction.  No sane application programming
interface requires individual applications to open a file somewhere,
echo some values to it and use directory operations to manage its
organization.  Will get back to this later.

> > All controllers only get what their ancestors can hand down to them.
> > That's basic hierarchical behavior.
> 
> And many users want non work-conserving systems in which we can add
> and remove idle resources.  This means that how much bandwidth an
> ancestor has is not fixed in stone.

I'm having a hard time following you on this part of the discussion.
Can you give me an example?

> > If that's the case and we fail miserably at creating a reasonable
> > programming interface for that, we can always revive thread
> > granularity.  This is mostly a policy decision after all.
> 
> These interfaces should be presented side-by-side.  This is not a
> reasonable patch-later part of the interface as we depend on it today.

Revival of thread affinity is trivial and will stay that way for a
long time and the transition is already gradual, so it'll be a lost
opportunity but there is quite a bit of maneuvering room.  Anyways, on
with the sub-process interface.

Skipping description of the problems with the current setup here as
I've repated it a couple times in this thread already.

On the other sub-thread, I said that process/thread tree and cgroup
association are inherently tied.  This is because a new child task is
always born into the

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-08-25 Thread Tejun Heo

Hello, Paul.

On Mon, Aug 24, 2015 at 04:15:59PM -0700, Paul Turner wrote:
> > Hmmm... if that's the case, would limiting iops on those IO devices
> > (or classes of them) work?  qemu already implements IO limit mechanism
> > after all.
> 
> No.
> 
> 1) They should proceed at the maximum rate that they can that's still
> within their provisioning budget.

Ooh, right.

> 2) The cost/IO is both inconsistent and changes over time.  Attempting
> to micro-optimize every backend for this is infeasible, this is
> exactly the type of problem that the scheduler can usefully help
> arbitrate.
> 3) Even pretending (2) is fixable, dynamically dividing these
> right-to-work tokens between different I/O device backends is
> extremely complex.
>
> > Anyways, a point here is that threads of the same process competing
> > isn't a new problem.  There are many ways to make those threads play
> > nice as the application itself often has to be involved anyway,
> > especially for something like qemu which is heavily involved in
> > provisioning resources.
> 
> It's certainly not a new problem, but it's a real one, and it's
> _hard_.  You're proposing removing the best known solution.

Well, I'm trying to figure out whether we actually need it and
implement something sane if so.  We actually can't do hierarchical
resource distribution with existing mechanisms, so if that is
something which is beneficial enough, let's go ahead and figure it
out.

> > cgroups can be a nice brute-force add-on which lets sysadmins do wild
> > things but it's inherently hacky and incomplete for coordinating
> > threads.  For example, what is it gonna do if qemu cloned vcpus and IO
> > helpers dynamically off of the same parent thread?
> 
> We're talking about sub-process usage here.  This is the application
> coordinating itself, NOT the sysadmin.  Processes are becoming larger
> and larger, we need many of the same controls within them that we have
> between them.
>
> >  It requires
> > application's cooperation anyway but at the same time is painful to
> > actually interact from those applications.
> 
> As discussed elsewhere on thread this is really not a problem if you
> define consistent rules with respect to which parts are managed by
> who.  The argument of potential interference is no different to
> messing with an application's on-disk configuration behind its back.
> Alternate strawmen which greatly improve this from where we are today
> have also been proposed.

Let's continue in the other sub-thread but it's not just system
management and applications not stepping on each other's toes although
even just that is extremely painful with the current interface.
cgroup membership is inherently tied to process tree no matter who's
managing it which requires coordination from the application side for
sub-process management and at that point it's really matter of putting
one and one together.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-08-25 Thread Peter Zijlstra

On Tue, Aug 25, 2015 at 11:24:42AM +0200, Ingo Molnar wrote:
> 
> * Paul Turner  wrote:
> 
> > > Anyways, a point here is that threads of the same process competing
> > > isn't a new problem.  There are many ways to make those threads play
> > > nice as the application itself often has to be involved anyway,
> > > especially for something like qemu which is heavily involved in
> > > provisioning resources.
> > 
> > It's certainly not a new problem, but it's a real one, and it's
> > _hard_.  You're proposing removing the best known solution.
> 
> Also, just to make sure this is resolved properly, I'm NAK-ing the current 
> scheduler bits in this series:
> 
>   NAKed-by: Ingo Molnar 
> 
> until all of pjt's API design concerns are resolved. This is conceptual, it 
> is not 
> a 'we can fix it later' detail.
> 
> Tejun, please keep me Cc:-ed to future versions of this series so that I can 
> lift 
> the NAK if things get resolved.

You can add:

NAKed-by: Peter Zijlstra 

to that.

There have been at least 3 different groups of people:

 - Mike, representing Suse customers
 - Kamezawa-san, representing Fujitsu customers
 - Paul, representing Google

that claim per-thread control groups are in use and important.

Any replacement _must_ provide for this use case up front; its not
something that can be cobbled on later.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-08-25 Thread Ingo Molnar

* Paul Turner  wrote:

> > Anyways, a point here is that threads of the same process competing
> > isn't a new problem.  There are many ways to make those threads play
> > nice as the application itself often has to be involved anyway,
> > especially for something like qemu which is heavily involved in
> > provisioning resources.
> 
> It's certainly not a new problem, but it's a real one, and it's
> _hard_.  You're proposing removing the best known solution.

Also, just to make sure this is resolved properly, I'm NAK-ing the current 
scheduler bits in this series:

  NAKed-by: Ingo Molnar 

until all of pjt's API design concerns are resolved. This is conceptual, it is 
not 
a 'we can fix it later' detail.

Tejun, please keep me Cc:-ed to future versions of this series so that I can 
lift 
the NAK if things get resolved.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-08-25 Thread Tejun Heo

Hello, Paul.

On Mon, Aug 24, 2015 at 04:15:59PM -0700, Paul Turner wrote:
  Hmmm... if that's the case, would limiting iops on those IO devices
  (or classes of them) work?  qemu already implements IO limit mechanism
  after all.
 
 No.
 
 1) They should proceed at the maximum rate that they can that's still
 within their provisioning budget.

Ooh, right.

 2) The cost/IO is both inconsistent and changes over time.  Attempting
 to micro-optimize every backend for this is infeasible, this is
 exactly the type of problem that the scheduler can usefully help
 arbitrate.
 3) Even pretending (2) is fixable, dynamically dividing these
 right-to-work tokens between different I/O device backends is
 extremely complex.

  Anyways, a point here is that threads of the same process competing
  isn't a new problem.  There are many ways to make those threads play
  nice as the application itself often has to be involved anyway,
  especially for something like qemu which is heavily involved in
  provisioning resources.
 
 It's certainly not a new problem, but it's a real one, and it's
 _hard_.  You're proposing removing the best known solution.

Well, I'm trying to figure out whether we actually need it and
implement something sane if so.  We actually can't do hierarchical
resource distribution with existing mechanisms, so if that is
something which is beneficial enough, let's go ahead and figure it
out.

  cgroups can be a nice brute-force add-on which lets sysadmins do wild
  things but it's inherently hacky and incomplete for coordinating
  threads.  For example, what is it gonna do if qemu cloned vcpus and IO
  helpers dynamically off of the same parent thread?
 
 We're talking about sub-process usage here.  This is the application
 coordinating itself, NOT the sysadmin.  Processes are becoming larger
 and larger, we need many of the same controls within them that we have
 between them.

   It requires
  application's cooperation anyway but at the same time is painful to
  actually interact from those applications.
 
 As discussed elsewhere on thread this is really not a problem if you
 define consistent rules with respect to which parts are managed by
 who.  The argument of potential interference is no different to
 messing with an application's on-disk configuration behind its back.
 Alternate strawmen which greatly improve this from where we are today
 have also been proposed.

Let's continue in the other sub-thread but it's not just system
management and applications not stepping on each other's toes although
even just that is extremely painful with the current interface.
cgroup membership is inherently tied to process tree no matter who's
managing it which requires coordination from the application side for
sub-process management and at that point it's really matter of putting
one and one together.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-08-25 Thread Peter Zijlstra

On Tue, Aug 25, 2015 at 11:24:42AM +0200, Ingo Molnar wrote:
 
 * Paul Turner p...@google.com wrote:
 
   Anyways, a point here is that threads of the same process competing
   isn't a new problem.  There are many ways to make those threads play
   nice as the application itself often has to be involved anyway,
   especially for something like qemu which is heavily involved in
   provisioning resources.
  
  It's certainly not a new problem, but it's a real one, and it's
  _hard_.  You're proposing removing the best known solution.
 
 Also, just to make sure this is resolved properly, I'm NAK-ing the current 
 scheduler bits in this series:
 
   NAKed-by: Ingo Molnar mi...@kernel.org
 
 until all of pjt's API design concerns are resolved. This is conceptual, it 
 is not 
 a 'we can fix it later' detail.
 
 Tejun, please keep me Cc:-ed to future versions of this series so that I can 
 lift 
 the NAK if things get resolved.

You can add:

NAKed-by: Peter Zijlstra pet...@infradead.org

to that.

There have been at least 3 different groups of people:

 - Mike, representing Suse customers
 - Kamezawa-san, representing Fujitsu customers
 - Paul, representing Google

that claim per-thread control groups are in use and important.

Any replacement _must_ provide for this use case up front; its not
something that can be cobbled on later.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-08-25 Thread Ingo Molnar


* Paul Turner p...@google.com wrote:

  Anyways, a point here is that threads of the same process competing
  isn't a new problem.  There are many ways to make those threads play
  nice as the application itself often has to be involved anyway,
  especially for something like qemu which is heavily involved in
  provisioning resources.
 
 It's certainly not a new problem, but it's a real one, and it's
 _hard_.  You're proposing removing the best known solution.

Also, just to make sure this is resolved properly, I'm NAK-ing the current 
scheduler bits in this series:

  NAKed-by: Ingo Molnar mi...@kernel.org

until all of pjt's API design concerns are resolved. This is conceptual, it is 
not 
a 'we can fix it later' detail.

Tejun, please keep me Cc:-ed to future versions of this series so that I can 
lift 
the NAK if things get resolved.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-08-25 Thread Tejun Heo

Hello,

On Mon, Aug 24, 2015 at 04:06:39PM -0700, Paul Turner wrote:
  This is an erratic behavior on cpuset's part tho.  Nothing else
  behaves this way and it's borderline buggy.
 
 It's actually the only sane possible interaction here.
 
 If you don't overwrite the masks you can no longer manage cpusets with
 a multi-threaded application.
 If you partially overwrite the masks you can create a host of
 inconsistent behaviors where an application suddenly loses
 parallelism.

It's a layering problem.  It'd be fine if cpuset either did layer
per-thread affinities below w/ config change notification or ignore
and/or reject per-thread affinities.  What we have now is two layers
manipulating the same field without any mechanism for coordination.

 The *only* consistent way is to clobber all masks uniformly.  Then
 either arrange for some notification to the application to re-sync, or
 use sub-sub-containers within the cpuset hierarchy to advertise
 finer-partitions.

I don't get it.  How is that the only consistent way?  Why is making
irreversible changes even a good way?  Just layer the masks and
trigger notification on changes.

 I don't think the case of having a large compute farm with
 unimportant and important work is particularly fringe.  Reducing
 the impact on the important work so that we can scavenge more cycles
 for the latency insensitive unimportant is very real.

What if optimizing cache allocation across competing threads of a
process can yield, say, 3% gain across large compute farm?  Is that
fringe?

 Right, but it's exactly because of _how bad_ those other mechanisms
 _are_ that cgroups was originally created.Its growth was not
 managed well from there, but let's not step away from the fact that
 this interface was created to solve this problem.

Sure, at the same time, please don't forget that there are ample
reasons we can't replace more basic mechanisms with cgroups.  I'm not
saying this can't be part of cgroup but rather that it's misguided to
do plunge into cgroups as the first and only step.

More importantly, I am extremely doubtful that we understand the usage
scenarios and their benefits very well at this point and want to avoid
over-committing to something we'll look back and regret.  As it
currently stands, this has a high likelihood of becoming a mismanaged
growth.

For the cache allocation thing, I'd strongly suggest something way
simpler and non-commmittal - e.g. create a char device node with
simple configuration and basic access control.  If this *really* turns
out to be useful and its configuration complex enough to warrant
cgroup integration, let's do it then, and if we actually end up there,
I bet the interface that we'd come up with at that point would be
different from what people are proposing now.

  Yeah, I understand the similarity part but don't buy that the benefit
  there is big enough to introduce a kernel API which is expected to be
  used by individual programs which is radically different from how
  processes / threads are organized and applications interact with the
  kernel.
 
 Sorry, I don't quite follow, in what way is it radically different?
 What is magically different about a process versus a thread in this
 sub-division?

I meant that cgroupfs as opposed to most other programming interfaces
that we publish to applications.  We already have process / thread
hierarchy which is created through forking/cloning and conventions
built around them for interaction.  No sane application programming
interface requires individual applications to open a file somewhere,
echo some values to it and use directory operations to manage its
organization.  Will get back to this later.

  All controllers only get what their ancestors can hand down to them.
  That's basic hierarchical behavior.
 
 And many users want non work-conserving systems in which we can add
 and remove idle resources.  This means that how much bandwidth an
 ancestor has is not fixed in stone.

I'm having a hard time following you on this part of the discussion.
Can you give me an example?

  If that's the case and we fail miserably at creating a reasonable
  programming interface for that, we can always revive thread
  granularity.  This is mostly a policy decision after all.
 
 These interfaces should be presented side-by-side.  This is not a
 reasonable patch-later part of the interface as we depend on it today.

Revival of thread affinity is trivial and will stay that way for a
long time and the transition is already gradual, so it'll be a lost
opportunity but there is quite a bit of maneuvering room.  Anyways, on
with the sub-process interface.

Skipping description of the problems with the current setup here as
I've repated it a couple times in this thread already.

On the other sub-thread, I said that process/thread tree and cgroup
association are inherently tied.  This is because a new child task is
always born into the parent's cgroup and the only reason cgroup works
on system management

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-08-25 Thread Tejun Heo

Hello, Kame.

On Tue, Aug 25, 2015 at 11:36:25AM +0900, Kamezawa Hiroyuki wrote:
 I think I should explain my customer's use case of qemu + cpuset/cpu (via 
 libvirt)
 
 (1) Isolating hypervisor thread.
As already discussed, hypervisor threads are isolated by cpuset. But their 
 purpose
is to avoid _latency_ spike caused by hypervisor behavior. So, nice 
 cannot be solution
as already discussed.
 
 (2) Fixed rate vcpu service.
With using cpu controller's quota/period feature, my customer creates  
 vcpu models like
Low(1GHz), Mid(2GHz), High(3GHz) for IaaS system.
 
To do this, each vcpus should be quota-limited independently, with 
 per-thread cpu control.
 
 Especially, the method (1) is used in several enterprise customers for 
 stabilizing their system.
 
 Sub-process control should be provided by some way.

Can you please take a look at the proposal on my reply to Paul's
email?  AFAICS, both of above cases should be fine with that.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-08-24 Thread Kamezawa Hiroyuki


On 2015/08/25 8:15, Paul Turner wrote:

On Mon, Aug 24, 2015 at 3:49 PM, Tejun Heo  wrote:

Hello,

On Mon, Aug 24, 2015 at 03:03:05PM -0700, Paul Turner wrote:

Hmm... I was hoping for an actual configurations and usage scenarios.
Preferably something people can set up and play with.


This is much easier to set up and play with synthetically.  Just
create the 10 threads and 100 threads above then experiment with
configurations designed at guaranteeing the set of 100 threads
relatively uniform throughput regardless of how many are active.   I
don't think trying to run a VM stack adds anything except complexity
of reproduction here.


Well, but that loses most of details and why such use cases matter to
begin with.  We can imagine up stuff to induce arbitrary set of
requirements.


All that's being proved or disproved here is that it's difficult to
coordinate the consumption of asymmetric thread pools using nice.  The
constraints here were drawn from a real-world example.




I take that the
CPU intensive helper threads are usually IO workers?  Is the scenario
where the VM is set up with a lot of IO devices and different ones may
consume large amount of CPU cycles at any given point?


Yes, generally speaking there are a few major classes of IO (flash,
disk, network) that a guest may invoke.  Each of these backends is
separate and chooses its own threading.


Hmmm... if that's the case, would limiting iops on those IO devices
(or classes of them) work?  qemu already implements IO limit mechanism
after all.


No.

1) They should proceed at the maximum rate that they can that's still
within their provisioning budget.
2) The cost/IO is both inconsistent and changes over time.  Attempting
to micro-optimize every backend for this is infeasible, this is
exactly the type of problem that the scheduler can usefully help
arbitrate.
3) Even pretending (2) is fixable, dynamically dividing these
right-to-work tokens between different I/O device backends is
extremely complex.



I think I should explain my customer's use case of qemu + cpuset/cpu (via 
libvirt)

(1) Isolating hypervisor thread.
   As already discussed, hypervisor threads are isolated by cpuset. But their 
purpose
   is to avoid _latency_ spike caused by hypervisor behavior. So, "nice" cannot 
be solution
   as already discussed.

(2) Fixed rate vcpu service.
   With using cpu controller's quota/period feature, my customer creates  vcpu 
models like
   Low(1GHz), Mid(2GHz), High(3GHz) for IaaS system.

   To do this, each vcpus should be quota-limited independently, with 
per-thread cpu control.

Especially, the method (1) is used in several enterprise customers for 
stabilizing their system.

Sub-process control should be provided by some way.

Thanks,
-Kame





Anyways, a point here is that threads of the same process competing
isn't a new problem.  There are many ways to make those threads play
nice as the application itself often has to be involved anyway,
especially for something like qemu which is heavily involved in
provisioning resources.


It's certainly not a new problem, but it's a real one, and it's
_hard_.  You're proposing removing the best known solution.



cgroups can be a nice brute-force add-on which lets sysadmins do wild
things but it's inherently hacky and incomplete for coordinating
threads.  For example, what is it gonna do if qemu cloned vcpus and IO
helpers dynamically off of the same parent thread?


We're talking about sub-process usage here.  This is the application
coordinating itself, NOT the sysadmin.  Processes are becoming larger
and larger, we need many of the same controls within them that we have
between them.


  It requires
application's cooperation anyway but at the same time is painful to
actually interact from those applications.


As discussed elsewhere on thread this is really not a problem if you
define consistent rules with respect to which parts are managed by
who.  The argument of potential interference is no different to
messing with an application's on-disk configuration behind its back.
Alternate strawmen which greatly improve this from where we are today
have also been proposed.



Thanks.

--
tejun

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-08-24 Thread Paul Turner

On Mon, Aug 24, 2015 at 3:49 PM, Tejun Heo  wrote:
> Hello,
>
> On Mon, Aug 24, 2015 at 03:03:05PM -0700, Paul Turner wrote:
>> > Hmm... I was hoping for an actual configurations and usage scenarios.
>> > Preferably something people can set up and play with.
>>
>> This is much easier to set up and play with synthetically.  Just
>> create the 10 threads and 100 threads above then experiment with
>> configurations designed at guaranteeing the set of 100 threads
>> relatively uniform throughput regardless of how many are active.   I
>> don't think trying to run a VM stack adds anything except complexity
>> of reproduction here.
>
> Well, but that loses most of details and why such use cases matter to
> begin with.  We can imagine up stuff to induce arbitrary set of
> requirements.

All that's being proved or disproved here is that it's difficult to
coordinate the consumption of asymmetric thread pools using nice.  The
constraints here were drawn from a real-world example.

>
>> > I take that the
>> > CPU intensive helper threads are usually IO workers?  Is the scenario
>> > where the VM is set up with a lot of IO devices and different ones may
>> > consume large amount of CPU cycles at any given point?
>>
>> Yes, generally speaking there are a few major classes of IO (flash,
>> disk, network) that a guest may invoke.  Each of these backends is
>> separate and chooses its own threading.
>
> Hmmm... if that's the case, would limiting iops on those IO devices
> (or classes of them) work?  qemu already implements IO limit mechanism
> after all.

No.

1) They should proceed at the maximum rate that they can that's still
within their provisioning budget.
2) The cost/IO is both inconsistent and changes over time.  Attempting
to micro-optimize every backend for this is infeasible, this is
exactly the type of problem that the scheduler can usefully help
arbitrate.
3) Even pretending (2) is fixable, dynamically dividing these
right-to-work tokens between different I/O device backends is
extremely complex.

>
> Anyways, a point here is that threads of the same process competing
> isn't a new problem.  There are many ways to make those threads play
> nice as the application itself often has to be involved anyway,
> especially for something like qemu which is heavily involved in
> provisioning resources.

It's certainly not a new problem, but it's a real one, and it's
_hard_.  You're proposing removing the best known solution.

>
> cgroups can be a nice brute-force add-on which lets sysadmins do wild
> things but it's inherently hacky and incomplete for coordinating
> threads.  For example, what is it gonna do if qemu cloned vcpus and IO
> helpers dynamically off of the same parent thread?

We're talking about sub-process usage here.  This is the application
coordinating itself, NOT the sysadmin.  Processes are becoming larger
and larger, we need many of the same controls within them that we have
between them.

>  It requires
> application's cooperation anyway but at the same time is painful to
> actually interact from those applications.

As discussed elsewhere on thread this is really not a problem if you
define consistent rules with respect to which parts are managed by
who.  The argument of potential interference is no different to
messing with an application's on-disk configuration behind its back.
Alternate strawmen which greatly improve this from where we are today
have also been proposed.

>
> Thanks.
>
> --
> tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-08-24 Thread Paul Turner

On Mon, Aug 24, 2015 at 3:19 PM, Tejun Heo  wrote:
> Hey,
>
> On Mon, Aug 24, 2015 at 02:58:23PM -0700, Paul Turner wrote:
>> > Why isn't it?  Because the programs themselves might try to override
>> > it?
>>
>> The major reasons are:
>>
>> 1) Isolation.  Doing everything with sched_setaffinity means that
>> programs can use arbitrary resources if they desire.
>>   1a) These restrictions need to also apply to threads created by
>> library code.  Which may be 3rd party.
>> 2) Interaction between cpusets and sched_setaffinity.  For necessary
>> reasons, a cpuset update always overwrites all extant
>> sched_setaffinity values. ...And we need some cpusets for (1)And
>> we need periodic updates for access to shared cores.
>
> This is an erratic behavior on cpuset's part tho.  Nothing else
> behaves this way and it's borderline buggy.
>

It's actually the only sane possible interaction here.

If you don't overwrite the masks you can no longer manage cpusets with
a multi-threaded application.
If you partially overwrite the masks you can create a host of
inconsistent behaviors where an application suddenly loses
parallelism.

The *only* consistent way is to clobber all masks uniformly.  Then
either arrange for some notification to the application to re-sync, or
use sub-sub-containers within the cpuset hierarchy to advertise
finer-partitions.

(Generally speaking, there is no real way to mate these APIs and part
of the reason we use sub-containers here.  What's being proposed will
make this worse rather than better.)

>> 3) Virtualization of CPU ids.  (Multiple applications all binding to
>> core 1 is a bad thing.)
>
> This is about who's setting the affinity, right?  As long as an agent
> which knows system details sets it, which mechanism doesn't really
> matter.

Yes, there are other ways to implement this.

>
>> >> Let me try to restate:
>> >>   I think that we can specify the usage is specifically niche that it
>> >> will *typically* be used by higher level management daemons which
>> >
>> > I really don't think that's the case.
>> >
>>
>> Can you provide examples of non-exceptional usage in this fashion?
>
> I heard of two use cases.  One is sytem-partitioning that you're
> talking about and the other is preventing threads of the same process
> from stepping on each other's toes.  There was a fancy word for the
> cacheline cannibalizing behavior which shows up in those scenarios.

So this is a single example right, since the system partitioning case
is the one in which it's exclusively used by a higher level management
daemon.

The case of an process with specifically identified threads in
conflict certainly seems exceptional in the level of optimization both
in implementation and analysis present.  I would expect in this case
that either they are comfortable with the more technical API, or they
can coordinate with an external controller.  Which is much less
overloaded both by number of callers and by number of interfaces than
it is in the cpuset case.

>
>> > It's more like there are two niche sets of use cases.  If a
>> > programmable interface or cgroups has to be picked as an exclusive
>> > alternative, it's pretty clear that programmable interface is the way
>> > to go.
>>
>> I strongly disagree here:
>>   The *major obvious use* is partitioning of a system, which must act
>
> I don't know.  Why is that more major obvious use?  This is super
> duper fringe to begin with.  It's like tallying up beans.  Sure, some
> may be taller than others but they're all still beans and I'm not even
> sure there's a big difference between the two use cases here.

I don't think the case of having a large compute farm with
"unimportant" and "important" work is particularly fringe.  Reducing
the impact on the "important" work so that we can scavenge more cycles
for the latency insensitive "unimportant" is very real.

>
>> on groups of processes.  Cgroups is the only interface we have which
>> satisfies this today.
>
> Well, not really.  cgroups is more convenient / better at these things
> but not the only way to do it.  People have been doing isolation to
> varying degrees with other mechanisms for ages.
>

Right, but it's exactly because of _how bad_ those other mechanisms
_are_ that cgroups was originally created.Its growth was not
managed well from there, but let's not step away from the fact that
this interface was created to solve this problem.

>> > Ditto.  Wouldn't it be better to implement something which resemables
>> > conventional programming interface rather than contorting the
>> > filesystem semantics?
>>
>> Maybe?  This is a trade-off, some of which is built on the assumptions
>> we're now debating.
>>
>> There is also value, cost-wise, in iterative improvement of what we
>> have today rather than trying to nuke it from orbit.  I do not know
>> which of these is the right choice, it likely depends strongly on
>> where we end up for sub-process interfaces.  If we do support those
>> I'm not sure it

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-08-24 Thread Tejun Heo

Hello,

On Mon, Aug 24, 2015 at 03:03:05PM -0700, Paul Turner wrote:
> > Hmm... I was hoping for an actual configurations and usage scenarios.
> > Preferably something people can set up and play with.
> 
> This is much easier to set up and play with synthetically.  Just
> create the 10 threads and 100 threads above then experiment with
> configurations designed at guaranteeing the set of 100 threads
> relatively uniform throughput regardless of how many are active.   I
> don't think trying to run a VM stack adds anything except complexity
> of reproduction here.

Well, but that loses most of details and why such use cases matter to
begin with.  We can imagine up stuff to induce arbitrary set of
requirements.

> > I take that the
> > CPU intensive helper threads are usually IO workers?  Is the scenario
> > where the VM is set up with a lot of IO devices and different ones may
> > consume large amount of CPU cycles at any given point?
> 
> Yes, generally speaking there are a few major classes of IO (flash,
> disk, network) that a guest may invoke.  Each of these backends is
> separate and chooses its own threading.

Hmmm... if that's the case, would limiting iops on those IO devices
(or classes of them) work?  qemu already implements IO limit mechanism
after all.

Anyways, a point here is that threads of the same process competing
isn't a new problem.  There are many ways to make those threads play
nice as the application itself often has to be involved anyway,
especially for something like qemu which is heavily involved in
provisioning resources.

cgroups can be a nice brute-force add-on which lets sysadmins do wild
things but it's inherently hacky and incomplete for coordinating
threads.  For example, what is it gonna do if qemu cloned vcpus and IO
helpers dynamically off of the same parent thread?  It requires
application's cooperation anyway but at the same time is painful to
actually interact from those applications.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-08-24 Thread Tejun Heo

Hey,

On Mon, Aug 24, 2015 at 02:58:23PM -0700, Paul Turner wrote:
> > Why isn't it?  Because the programs themselves might try to override
> > it?
> 
> The major reasons are:
> 
> 1) Isolation.  Doing everything with sched_setaffinity means that
> programs can use arbitrary resources if they desire.
>   1a) These restrictions need to also apply to threads created by
> library code.  Which may be 3rd party.
> 2) Interaction between cpusets and sched_setaffinity.  For necessary
> reasons, a cpuset update always overwrites all extant
> sched_setaffinity values. ...And we need some cpusets for (1)And
> we need periodic updates for access to shared cores.

This is an erratic behavior on cpuset's part tho.  Nothing else
behaves this way and it's borderline buggy.

> 3) Virtualization of CPU ids.  (Multiple applications all binding to
> core 1 is a bad thing.)

This is about who's setting the affinity, right?  As long as an agent
which knows system details sets it, which mechanism doesn't really
matter.

> >> Let me try to restate:
> >>   I think that we can specify the usage is specifically niche that it
> >> will *typically* be used by higher level management daemons which
> >
> > I really don't think that's the case.
> >
> 
> Can you provide examples of non-exceptional usage in this fashion?

I heard of two use cases.  One is sytem-partitioning that you're
talking about and the other is preventing threads of the same process
from stepping on each other's toes.  There was a fancy word for the
cacheline cannibalizing behavior which shows up in those scenarios.

> > It's more like there are two niche sets of use cases.  If a
> > programmable interface or cgroups has to be picked as an exclusive
> > alternative, it's pretty clear that programmable interface is the way
> > to go.
> 
> I strongly disagree here:
>   The *major obvious use* is partitioning of a system, which must act

I don't know.  Why is that more major obvious use?  This is super
duper fringe to begin with.  It's like tallying up beans.  Sure, some
may be taller than others but they're all still beans and I'm not even
sure there's a big difference between the two use cases here.

> on groups of processes.  Cgroups is the only interface we have which
> satisfies this today.

Well, not really.  cgroups is more convenient / better at these things
but not the only way to do it.  People have been doing isolation to
varying degrees with other mechanisms for ages.

> > Ditto.  Wouldn't it be better to implement something which resemables
> > conventional programming interface rather than contorting the
> > filesystem semantics?
> 
> Maybe?  This is a trade-off, some of which is built on the assumptions
> we're now debating.
> 
> There is also value, cost-wise, in iterative improvement of what we
> have today rather than trying to nuke it from orbit.  I do not know
> which of these is the right choice, it likely depends strongly on
> where we end up for sub-process interfaces.  If we do support those
> I'm not sure it makes sense for them to have an entirely different API
> from process-level coordination, at which point the file-system
> overload is a trade-off rather than a cost.

Yeah, I understand the similarity part but don't buy that the benefit
there is big enough to introduce a kernel API which is expected to be
used by individual programs which is radically different from how
processes / threads are organized and applications interact with the
kernel.  These are a lot more grave issues and if we end up paying
some complexity from kernel side internally, so be it.

> > So, except for cpuset, this doesn't matter for controllers.  All
> > limits are hierarchical and that's it.
> 
> Well no, it still matters because I might want to lower the limit
> below what children have set.

All controllers only get what their ancestors can hand down to them.
That's basic hierarchical behavior.

> > For cpuset, it's tricky
> > because a nested cgroup might end up with no intersecting execution
> > resource.  The kernel can't have threads which don't have any
> > execution resources and the solution has been assuming the resources
> > from higher-ups till there's some.  Application control has always
> > behaved the same way.  If the configured affinity becomes empty, the
> > scheduler ignored it.
> 
> Actually no, any configuration change that would result in this state
> is rejected.
>
> It's not possible to configure an empty cpuset once tasks are in it,
> or attach tasks to an empty set.
> It's also not possible to create this state using setaffinity, these
> restrictions are always over-ridden by updates, even if they do not
> need to be.

So, even in traditional hierarchies, this isn't true.  You can get to
no-resource config through cpu hot-unplug and cpuset currently ejects
tasks to the closest ancestor with execution resources.

> > Because the details on this particular issue can be hashed out in the
> > future?  There's nothing permanently

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-08-24 Thread Paul Turner

On Mon, Aug 24, 2015 at 2:40 PM, Tejun Heo  wrote:
> On Mon, Aug 24, 2015 at 02:19:29PM -0700, Paul Turner wrote:
>> > Would it be possible for you to give realistic and concrete examples?
>> > I'm not trying to play down the use cases but concrete examples are
>> > usually helpful at putting things in perspective.
>>
>> I don't think there's anything that's not realistic or concrete about
>> the example above.  The "suppose" parts were only for qualifying the
>> pool sizes for vcpu and non-vcpu threads above since discussion of
>> implementation using nice is dependent on knowing these counts.
>
> Hmm... I was hoping for an actual configurations and usage scenarios.
> Preferably something people can set up and play with.

This is much easier to set up and play with synthetically.  Just
create the 10 threads and 100 threads above then experiment with
configurations designed at guaranteeing the set of 100 threads
relatively uniform throughput regardless of how many are active.   I
don't think trying to run a VM stack adds anything except complexity
of reproduction here.

> I take that the
> CPU intensive helper threads are usually IO workers?  Is the scenario
> where the VM is set up with a lot of IO devices and different ones may
> consume large amount of CPU cycles at any given point?

Yes, generally speaking there are a few major classes of IO (flash,
disk, network) that a guest may invoke.  Each of these backends is
separate and chooses its own threading.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-08-24 Thread Paul Turner

On Mon, Aug 24, 2015 at 2:36 PM, Tejun Heo  wrote:
> Hello, Paul.
>
> On Mon, Aug 24, 2015 at 01:52:01PM -0700, Paul Turner wrote:
>> We typically share our machines between many jobs, these jobs can have
>> cores that are "private" (and not shared with other jobs) and cores
>> that are "shared" (general purpose cores accessible to all jobs on the
>> same machine).
>>
>> The pool of cpus in the "shared" pool is dynamic as jobs entering and
>> leaving the machine take or release their associated "private" cores.
>>
>> By creating the appropriate sub-containers within the cpuset group we
>> allow jobs to pin specific threads to run on their (typically) private
>> cores.  This also allows the management daemons additional flexibility
>> as it's possible to update which cores we place as private, without
>> synchronization with the application.  Note that sched_setaffinity()
>> is a non-starter here.
>
> Why isn't it?  Because the programs themselves might try to override
> it?

The major reasons are:

1) Isolation.  Doing everything with sched_setaffinity means that
programs can use arbitrary resources if they desire.
  1a) These restrictions need to also apply to threads created by
library code.  Which may be 3rd party.
2) Interaction between cpusets and sched_setaffinity.  For necessary
reasons, a cpuset update always overwrites all extant
sched_setaffinity values. ...And we need some cpusets for (1)And
we need periodic updates for access to shared cores.
3) Virtualization of CPU ids.  (Multiple applications all binding to
core 1 is a bad thing.)

>
>> Let me try to restate:
>>   I think that we can specify the usage is specifically niche that it
>> will *typically* be used by higher level management daemons which
>
> I really don't think that's the case.
>

Can you provide examples of non-exceptional usage in this fashion?

>> prefer a more technical and specific interface.  This does not
>> preclude use by threads, it just makes it less convenient; I think
>> that we should be optimizing for flexibility over ease-of-use for a
>> very small number of cases here.
>
> It's more like there are two niche sets of use cases.  If a
> programmable interface or cgroups has to be picked as an exclusive
> alternative, it's pretty clear that programmable interface is the way
> to go.

I strongly disagree here:
  The *major obvious use* is partitioning of a system, which must act
on groups of processes.  Cgroups is the only interface we have which
satisfies this today.

>
>> > It's not contained in the process at all.  What if an external entity
>> > decides to migrate the process into another cgroup inbetween?
>> >
>>
>> If we have 'atomic' moves and a way to access our sub-containers from
>> the process in a consistent fashion (e.g. relative paths) then this is
>> not an issue.
>
> But it gets so twisted.  Relative paths aren't enough.  It actually
> has to proxy accesses to already open files.  At that point, why would
> we even keep it as a file-system based interface?

Well no, this can just be reversed and we can have the relative paths
be the actual files which the hierarchy points back at.

Ultimately, they could potentially not even be exposed in the regular
hierarchy.  At this point we could not expose anything that does not
support sub-process splits within processes' hierarchy and we're at a
more reasonable state of affairs.

There is real value in being able to duplicate interface between
process and sub-process level control.

>
>> I am not endorsing the world we are in today, only describing how it
>> can be somewhat sanely managed.  Some of these lessons could be
>> formalized in imagining the world of tomorrow.  E.g. the sub-process
>> mounts could appear within some (non-movable) alternate file-system
>> path.
>
> Ditto.  Wouldn't it be better to implement something which resemables
> conventional programming interface rather than contorting the
> filesystem semantics?
>

Maybe?  This is a trade-off, some of which is built on the assumptions
we're now debating.

There is also value, cost-wise, in iterative improvement of what we
have today rather than trying to nuke it from orbit.  I do not know
which of these is the right choice, it likely depends strongly on
where we end up for sub-process interfaces.  If we do support those
I'm not sure it makes sense for them to have an entirely different API
from process-level coordination, at which point the file-system
overload is a trade-off rather than a cost.

>> >> The harder answer is:  How do we handle non-fungible resources such as
>> >> CPU assignments within a hierarchy?  This is a big part of why I make
>> >> arguments for certain partitions being management-software only above.
>> >> This is imperfect, but better then where we stand today.
>> >
>> > I'm not following.  Why is that different?
>>
>> This is generally any time a change in the external-to-application's
>> cgroup-parent requires changes in the sub-hierarchy.  This is most
>> visible

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-08-24 Thread Tejun Heo

On Mon, Aug 24, 2015 at 02:19:29PM -0700, Paul Turner wrote:
> > Would it be possible for you to give realistic and concrete examples?
> > I'm not trying to play down the use cases but concrete examples are
> > usually helpful at putting things in perspective.
> 
> I don't think there's anything that's not realistic or concrete about
> the example above.  The "suppose" parts were only for qualifying the
> pool sizes for vcpu and non-vcpu threads above since discussion of
> implementation using nice is dependent on knowing these counts.

Hmm... I was hoping for an actual configurations and usage scenarios.
Preferably something people can set up and play with.  I take that the
CPU intensive helper threads are usually IO workers?  Is the scenario
where the VM is set up with a lot of IO devices and different ones may
consume large amount of CPU cycles at any given point?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-08-24 Thread Tejun Heo

Hello, Paul.

On Mon, Aug 24, 2015 at 01:52:01PM -0700, Paul Turner wrote:
> We typically share our machines between many jobs, these jobs can have
> cores that are "private" (and not shared with other jobs) and cores
> that are "shared" (general purpose cores accessible to all jobs on the
> same machine).
> 
> The pool of cpus in the "shared" pool is dynamic as jobs entering and
> leaving the machine take or release their associated "private" cores.
> 
> By creating the appropriate sub-containers within the cpuset group we
> allow jobs to pin specific threads to run on their (typically) private
> cores.  This also allows the management daemons additional flexibility
> as it's possible to update which cores we place as private, without
> synchronization with the application.  Note that sched_setaffinity()
> is a non-starter here.

Why isn't it?  Because the programs themselves might try to override
it?

> Let me try to restate:
>   I think that we can specify the usage is specifically niche that it
> will *typically* be used by higher level management daemons which

I really don't think that's the case.

> prefer a more technical and specific interface.  This does not
> preclude use by threads, it just makes it less convenient; I think
> that we should be optimizing for flexibility over ease-of-use for a
> very small number of cases here.

It's more like there are two niche sets of use cases.  If a
programmable interface or cgroups has to be picked as an exclusive
alternative, it's pretty clear that programmable interface is the way
to go.

> > It's not contained in the process at all.  What if an external entity
> > decides to migrate the process into another cgroup inbetween?
> >
> 
> If we have 'atomic' moves and a way to access our sub-containers from
> the process in a consistent fashion (e.g. relative paths) then this is
> not an issue.

But it gets so twisted.  Relative paths aren't enough.  It actually
has to proxy accesses to already open files.  At that point, why would
we even keep it as a file-system based interface?

> I am not endorsing the world we are in today, only describing how it
> can be somewhat sanely managed.  Some of these lessons could be
> formalized in imagining the world of tomorrow.  E.g. the sub-process
> mounts could appear within some (non-movable) alternate file-system
> path.

Ditto.  Wouldn't it be better to implement something which resemables
conventional programming interface rather than contorting the
filesystem semantics?

> >> The harder answer is:  How do we handle non-fungible resources such as
> >> CPU assignments within a hierarchy?  This is a big part of why I make
> >> arguments for certain partitions being management-software only above.
> >> This is imperfect, but better then where we stand today.
> >
> > I'm not following.  Why is that different?
> 
> This is generally any time a change in the external-to-application's
> cgroup-parent requires changes in the sub-hierarchy.  This is most
> visible with a resource such as a cpu which is uniquely identified,
> but similarly applies to any limits.

So, except for cpuset, this doesn't matter for controllers.  All
limits are hierarchical and that's it.  For cpuset, it's tricky
because a nested cgroup might end up with no intersecting execution
resource.  The kernel can't have threads which don't have any
execution resources and the solution has been assuming the resources
from higher-ups till there's some.  Application control has always
behaved the same way.  If the configured affinity becomes empty, the
scheduler ignored it.

> > The transition can already be gradual.  Why would you add yet another
> > transition step?
> 
> Because what's being proposed today does not offer any replacement for
> the sub-process control that we depend on today?  Why would we embark
> on merging the new interface before these details are sufficiently
> resolved?

Because the details on this particular issue can be hashed out in the
future?  There's nothing permanently blocking any direction that we
might choose in the future and what's working today will keep working.
Why block the whole thing which can be useful for the majority of use
cases for this particular corner case?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-08-24 Thread Paul Turner

On Mon, Aug 24, 2015 at 2:17 PM, Tejun Heo  wrote:
> Hello,
>
> On Mon, Aug 24, 2015 at 02:10:17PM -0700, Paul Turner wrote:
>> Suppose that we have 10 vcpu threads and 100 support threads.
>> Suppose that we want the support threads to receive up to 10% of the
>> time available to the VM as a whole on that machine.
>>
>> If I have one particular support thread that is busy, I want it to
>> receive that entire 10% (maybe a guest is pounding on scsi for
>> example, or in the thread-pool case, I've passed a single expensive
>> computation).  Conversely, suppose the guest is doing lots of
>> different things and several support threads are active, I want the
>> time to be shared between them.
>>
>> There is no way to implement this with nice.  Either a single thread
>> can consume 10%, and the group can dominate, or the group cannot
>> dominate and the single thread can be starved.
>
> Would it be possible for you to give realistic and concrete examples?
> I'm not trying to play down the use cases but concrete examples are
> usually helpful at putting things in perspective.

I don't think there's anything that's not realistic or concrete about
the example above.  The "suppose" parts were only for qualifying the
pool sizes for vcpu and non-vcpu threads above since discussion of
implementation using nice is dependent on knowing these counts.


>
> Thanks.
>
> --
> tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-08-24 Thread Tejun Heo

Hello,

On Mon, Aug 24, 2015 at 02:10:17PM -0700, Paul Turner wrote:
> Suppose that we have 10 vcpu threads and 100 support threads.
> Suppose that we want the support threads to receive up to 10% of the
> time available to the VM as a whole on that machine.
> 
> If I have one particular support thread that is busy, I want it to
> receive that entire 10% (maybe a guest is pounding on scsi for
> example, or in the thread-pool case, I've passed a single expensive
> computation).  Conversely, suppose the guest is doing lots of
> different things and several support threads are active, I want the
> time to be shared between them.
> 
> There is no way to implement this with nice.  Either a single thread
> can consume 10%, and the group can dominate, or the group cannot
> dominate and the single thread can be starved.

Would it be possible for you to give realistic and concrete examples?
I'm not trying to play down the use cases but concrete examples are
usually helpful at putting things in perspective.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-08-24 Thread Paul Turner

On Mon, Aug 24, 2015 at 2:12 PM, Tejun Heo  wrote:
> Hello, Paul.
>
> On Mon, Aug 24, 2015 at 02:00:54PM -0700, Paul Turner wrote:
>> > Hmmm... I'm trying to understand the usecases where having hierarchy
>> > inside a process are actually required so that we don't end up doing
>> > something complex unnecessarily.  So far, it looks like an easy
>> > alternative for qemu would be teaching it to manage priorities of its
>> > threads given that the threads are mostly static - vcpus going up and
>> > down are explicit operations which can trigger priority adjustments if
>> > necessary, which is unlikely to begin with.
>>
>> What you're proposing is both unnecessarily complex and imprecise.
>> Arbitrating competition between groups of threads is exactly why we
>> support sub-hierarchies within cpu.
>
> Sure, and to make that behave half-way acceptable, we'll have to take
> on significant amount of effort and likely complexity and I'm trying
> to see whether the usecases are actually justifiable.  I get that
> priority based solution will be less precise and more complex on the
> application side but by how much and does the added precision enough
> to justify the extra facilities to support that?  If it is, sure,
> let's get to it but it'd be great if the concrete prolem cases are
> properly identified and understood.  I'll continue on the other reply.
>

No problem, I think the conversation is absolutely
constructive/important to have and am happy to help drill down.

> Thanks.
>
> --
> tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-08-24 Thread Tejun Heo

Hello, Paul.

On Mon, Aug 24, 2015 at 02:00:54PM -0700, Paul Turner wrote:
> > Hmmm... I'm trying to understand the usecases where having hierarchy
> > inside a process are actually required so that we don't end up doing
> > something complex unnecessarily.  So far, it looks like an easy
> > alternative for qemu would be teaching it to manage priorities of its
> > threads given that the threads are mostly static - vcpus going up and
> > down are explicit operations which can trigger priority adjustments if
> > necessary, which is unlikely to begin with.
> 
> What you're proposing is both unnecessarily complex and imprecise.
> Arbitrating competition between groups of threads is exactly why we
> support sub-hierarchies within cpu.

Sure, and to make that behave half-way acceptable, we'll have to take
on significant amount of effort and likely complexity and I'm trying
to see whether the usecases are actually justifiable.  I get that
priority based solution will be less precise and more complex on the
application side but by how much and does the added precision enough
to justify the extra facilities to support that?  If it is, sure,
let's get to it but it'd be great if the concrete prolem cases are
properly identified and understood.  I'll continue on the other reply.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-08-24 Thread Paul Turner

On Mon, Aug 24, 2015 at 2:02 PM, Tejun Heo  wrote:
> Hello,
>
> On Mon, Aug 24, 2015 at 01:54:08PM -0700, Paul Turner wrote:
>> > That alone doesn't require hierarchical resource distribution tho.
>> > Setting nice levels reasonably is likely to alleviate most of the
>> > problem.
>>
>> Nice is not sufficient here.  There could be arbitrarily many threads
>> within the hypervisor that are not actually hosting guest CPU threads.
>> The only way to have this competition occur at a reasonably fixed
>> ratio is a sub-hierarchy.
>
> I get that having hierarchy of threads would be nicer but am having a
> bit of difficulty seeing why adjusting priorities of threads wouldn't
> be sufficient.  It's not like threads of the same process competing
> with each other is a new problem.  People have been dealing with it
> for ages.  Hierarchical management can be a nice plus but we want the
> problem and proposed solution to be justifiable.

Consider what happens with load asymmetry:

Suppose that we have 10 vcpu threads and 100 support threads.
Suppose that we want the support threads to receive up to 10% of the
time available to the VM as a whole on that machine.

If I have one particular support thread that is busy, I want it to
receive that entire 10% (maybe a guest is pounding on scsi for
example, or in the thread-pool case, I've passed a single expensive
computation).  Conversely, suppose the guest is doing lots of
different things and several support threads are active, I want the
time to be shared between them.

There is no way to implement this with nice.  Either a single thread
can consume 10%, and the group can dominate, or the group cannot
dominate and the single thread can be starved.

>
> Thanks.
>
> --
> tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-08-24 Thread Tejun Heo

Hello,

On Mon, Aug 24, 2015 at 01:54:08PM -0700, Paul Turner wrote:
> > That alone doesn't require hierarchical resource distribution tho.
> > Setting nice levels reasonably is likely to alleviate most of the
> > problem.
> 
> Nice is not sufficient here.  There could be arbitrarily many threads
> within the hypervisor that are not actually hosting guest CPU threads.
> The only way to have this competition occur at a reasonably fixed
> ratio is a sub-hierarchy.

I get that having hierarchy of threads would be nicer but am having a
bit of difficulty seeing why adjusting priorities of threads wouldn't
be sufficient.  It's not like threads of the same process competing
with each other is a new problem.  People have been dealing with it
for ages.  Hierarchical management can be a nice plus but we want the
problem and proposed solution to be justifiable.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-08-24 Thread Paul Turner

On Mon, Aug 24, 2015 at 1:25 PM, Tejun Heo  wrote:
> Hello, Austin.
>
> On Mon, Aug 24, 2015 at 04:00:49PM -0400, Austin S Hemmelgarn wrote:
>> >That alone doesn't require hierarchical resource distribution tho.
>> >Setting nice levels reasonably is likely to alleviate most of the
>> >problem.
>>
>> In the cases I've dealt with this myself, nice levels didn't cut it, and I
>> had to resort to SCHED_RR with particular care to avoid priority inversions.
>
> I wonder why.  The difference between -20 and 20 is around 2500x in
> terms of weight.  That should have been enough for expressing whatever
> precedence the vcpus should have over other threads.

This strongly perturbs the load-balancer which performs busiest cpu
selection by weight.

Note that also we do not necessarily want total dominance by vCPU
threads, the hypervisor threads are almost always doing work on their
behalf and we want to provision them with _some_ time.  A
sub-hierarchy allows this to be performed in a way that is independent
of how many vCPUs or support threads that are present.

>
>> >I don't know.  "Someone running one or two VM's on a laptop under
>> >QEMU" doesn't really sound like the use case which absolutely requires
>> >hierarchical cpu cycle distribution.
>>
>> It depends on the use case.  I never have more than 2 VM's running on my
>> laptop (always under QEMU, setting up Xen is kind of pointless ona quad core
>> system with only 8G of RAM), and I take extensive advantage of the cpu
>> cgroup to partition resources among various services on the host.
>
> Hmmm... I'm trying to understand the usecases where having hierarchy
> inside a process are actually required so that we don't end up doing
> something complex unnecessarily.  So far, it looks like an easy
> alternative for qemu would be teaching it to manage priorities of its
> threads given that the threads are mostly static - vcpus going up and
> down are explicit operations which can trigger priority adjustments if
> necessary, which is unlikely to begin with.

What you're proposing is both unnecessarily complex and imprecise.
Arbitrating competition between groups of threads is exactly why we
support sub-hierarchies within cpu.

>
> Thanks.
>
> --
> tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-08-24 Thread Paul Turner

On Mon, Aug 24, 2015 at 10:04 AM, Tejun Heo  wrote:
> Hello, Austin.
>
> On Mon, Aug 24, 2015 at 11:47:02AM -0400, Austin S Hemmelgarn wrote:
>> >Just to learn more, what sort of hypervisor support threads are we
>> >talking about?  They would have to consume considerable amount of cpu
>> >cycles for problems like this to be relevant and be dynamic in numbers
>> >in a way which letting them competing against vcpus makes sense.  Do
>> >IO helpers meet these criteria?
>> >
>> Depending on the configuration, yes they can.  VirtualBox has some rather
>> CPU intensive threads that aren't vCPU threads (their emulated APIC thread
>> immediately comes to mind), and so does QEMU depending on the emulated
>
> And the number of those threads fluctuate widely and dynamically?
>
>> hardware configuration (it gets more noticeable when the disk images are
>> stored on a SAN and served through iSCSI, NBD, FCoE, or ATAoE, which is
>> pretty typical usage for large virtualization deployments).  I've seen cases
>> first hand where the vCPU's can make no reasonable progress because they are
>> constantly getting crowded out by other threads.
>
> That alone doesn't require hierarchical resource distribution tho.
> Setting nice levels reasonably is likely to alleviate most of the
> problem.

Nice is not sufficient here.  There could be arbitrarily many threads
within the hypervisor that are not actually hosting guest CPU threads.
The only way to have this competition occur at a reasonably fixed
ratio is a sub-hierarchy.

>
>> The use of the term 'hypervisor support threads' for this is probably not
>> the best way of describing the contention, as it's almost always a full
>> system virtualization issue, and the contending threads are usually storage
>> back-end access threads.
>>
>> I would argue that there are better ways to deal properly with this (Isolate
>> the non vCPU threads on separate physical CPU's from the hardware emulation
>> threads), but such methods require large systems to be practical at any
>> scale, and many people don't have the budget for such large systems, and
>> this way of doing things is much more flexible for small scale use cases
>> (for example, someone running one or two VM's on a laptop under QEMU or
>> VirtualBox).
>
> I don't know.  "Someone running one or two VM's on a laptop under
> QEMU" doesn't really sound like the use case which absolutely requires
> hierarchical cpu cycle distribution.
>

We run more than 'one or two' VMs using this configuration.  :)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-08-24 Thread Paul Turner

On Sat, Aug 22, 2015 at 11:29 AM, Tejun Heo  wrote:
> Hello, Paul.
>
> On Fri, Aug 21, 2015 at 12:26:30PM -0700, Paul Turner wrote:
> ...
>> A very concrete example of the above is a virtual machine in which you
>> want to guarantee scheduling for the vCPU threads which must schedule
>> beside many hypervisor support threads.   A hierarchy is the only way
>> to fix the ratio at which these compete.
>
> Just to learn more, what sort of hypervisor support threads are we
> talking about?  They would have to consume considerable amount of cpu
> cycles for problems like this to be relevant and be dynamic in numbers
> in a way which letting them competing against vcpus makes sense.  Do
> IO helpers meet these criteria?

I'm not sure what you mean by an IO helper.  By support threads I mean
any threads that are used in the hypervisor implementation that are
not hosting a vCPU.

>
>> An example that's not the cpu controller is that we use cpusets to
>> expose to applications their "shared" and "private" cores.  (These
>> sets are dynamic based on what is coscheduled on a given machine.)
>
> Can you please go into more details with these?

We typically share our machines between many jobs, these jobs can have
cores that are "private" (and not shared with other jobs) and cores
that are "shared" (general purpose cores accessible to all jobs on the
same machine).

The pool of cpus in the "shared" pool is dynamic as jobs entering and
leaving the machine take or release their associated "private" cores.

By creating the appropriate sub-containers within the cpuset group we
allow jobs to pin specific threads to run on their (typically) private
cores.  This also allows the management daemons additional flexibility
as it's possible to update which cores we place as private, without
synchronization with the application.  Note that sched_setaffinity()
is a non-starter here.

>
>> > Why would you assume that threads of a process wouldn't want to
>> > configure it ever?  How is this different from CPU affinity?
>>
>> In general cache and CPU behave differently.  Generally for it to make
>> sense between threads in a process they would have to have wholly
>> disjoint memory, at which point the only sane long-term implementation
>> is separate processes and the management moves up a level anyway.
>>
>> That said, there are surely cases in which it might be convenient to
>> use at a per-thread level to correct a specific performance anomaly.
>> But at that point, you have certainly reached the level of hammer that
>> you can coordinate with an external daemon if necessary.
>
> So, I'm not super familiar with all the use cases but the whole cache
> allocation thing is almost by nature a specific niche thing and I feel
> pretty reluctant to blow off per-thread usages as too niche to worry
> about.

Let me try to restate:
  I think that we can specify the usage is specifically niche that it
will *typically* be used by higher level management daemons which
prefer a more technical and specific interface.  This does not
preclude use by threads, it just makes it less convenient; I think
that we should be optimizing for flexibility over ease-of-use for a
very small number of cases here.

>
>> > I don't follow what you're trying to way with the above paragraph.
>> > Are you still talking about CAT?  If so, that use case isn't the only
>> > one.  I'm pretty sure there are people who would want to configure
>> > cache allocation at thread level.
>>
>> I'm not agreeing with you that "in cgroups" means "must be usable by
>> applications within that hierarchy".  A cgroup subsystem used as a
>> partitioning API only by system management daemons is entirely
>> reasonable.  CAT is a reasonable example of this.
>
> I see.  The same argument.  I don't think CAT just being system
> management thing makes sense.
>
>> > So, this is a trade-off we're consciously making.  If there are
>> > common-enough use cases which require jumping across different cgroup
>> > domains, we'll try to figure out a way to accomodate those but by
>> > default migration is a very cold and expensive path.
>>
>> The core here was the need for allowing sub-process migration.  I'm
>> not sure I follow the performance trade-off argument; haven't we
>> historically seen the opposite?  That migration has been a slow-path
>> without optimizations and people pushing to make it faster?  This
>> seems a hard generalization to make for something that's inherently
>> tied to a particular controller.
>
> It isn't something tied to a particular controller.  Some controllers
> may get impacted less by than others but there's an inherent
> connection between how dynamic an association is and how expensive the
> locking around it needs to be and we need to set up basic behavior and
> usage conventions so that different controllers are designed and
> implemented assuming similar usage patterns; otherwise, we end up with
> the chaotic shit show that we have had where everything behaves
>

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-08-24 Thread Tejun Heo

Hello, Austin.

On Mon, Aug 24, 2015 at 04:00:49PM -0400, Austin S Hemmelgarn wrote:
> >That alone doesn't require hierarchical resource distribution tho.
> >Setting nice levels reasonably is likely to alleviate most of the
> >problem.
>
> In the cases I've dealt with this myself, nice levels didn't cut it, and I
> had to resort to SCHED_RR with particular care to avoid priority inversions.

I wonder why.  The difference between -20 and 20 is around 2500x in
terms of weight.  That should have been enough for expressing whatever
precedence the vcpus should have over other threads.

> >I don't know.  "Someone running one or two VM's on a laptop under
> >QEMU" doesn't really sound like the use case which absolutely requires
> >hierarchical cpu cycle distribution.
>
> It depends on the use case.  I never have more than 2 VM's running on my
> laptop (always under QEMU, setting up Xen is kind of pointless ona quad core
> system with only 8G of RAM), and I take extensive advantage of the cpu
> cgroup to partition resources among various services on the host.

Hmmm... I'm trying to understand the usecases where having hierarchy
inside a process are actually required so that we don't end up doing
something complex unnecessarily.  So far, it looks like an easy
alternative for qemu would be teaching it to manage priorities of its
threads given that the threads are mostly static - vcpus going up and
down are explicit operations which can trigger priority adjustments if
necessary, which is unlikely to begin with.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-08-24 Thread Austin S Hemmelgarn


On 2015-08-24 13:04, Tejun Heo wrote:

Hello, Austin.

On Mon, Aug 24, 2015 at 11:47:02AM -0400, Austin S Hemmelgarn wrote:

Just to learn more, what sort of hypervisor support threads are we
talking about?  They would have to consume considerable amount of cpu
cycles for problems like this to be relevant and be dynamic in numbers
in a way which letting them competing against vcpus makes sense.  Do
IO helpers meet these criteria?


Depending on the configuration, yes they can.  VirtualBox has some rather
CPU intensive threads that aren't vCPU threads (their emulated APIC thread
immediately comes to mind), and so does QEMU depending on the emulated


And the number of those threads fluctuate widely and dynamically?
It depends, usually there isn't dynamic fluctuation unless there is a 
lot of hot[un]plugging of virtual devices going on (which can be the 
case for situations with tight host/guest integration), but the number 
of threads can vary widely between configurations (most of the VM's I 
run under QEMU have about 16 threads on average, but I've seen instances 
with more than 100 threads).  The most likely case to cause wide and 
dynamic fluctuations of threads would be systems set up to dynamically 
hot[un]plug vCPU's based on system load (such systems have other issues 
to contend with also, but they do exist).

hardware configuration (it gets more noticeable when the disk images are
stored on a SAN and served through iSCSI, NBD, FCoE, or ATAoE, which is
pretty typical usage for large virtualization deployments).  I've seen cases
first hand where the vCPU's can make no reasonable progress because they are
constantly getting crowded out by other threads.


That alone doesn't require hierarchical resource distribution tho.
Setting nice levels reasonably is likely to alleviate most of the
problem.
In the cases I've dealt with this myself, nice levels didn't cut it, and 
I had to resort to SCHED_RR with particular care to avoid priority 
inversions.

The use of the term 'hypervisor support threads' for this is probably not
the best way of describing the contention, as it's almost always a full
system virtualization issue, and the contending threads are usually storage
back-end access threads.

I would argue that there are better ways to deal properly with this (Isolate
the non vCPU threads on separate physical CPU's from the hardware emulation
threads), but such methods require large systems to be practical at any
scale, and many people don't have the budget for such large systems, and
this way of doing things is much more flexible for small scale use cases
(for example, someone running one or two VM's on a laptop under QEMU or
VirtualBox).


I don't know.  "Someone running one or two VM's on a laptop under
QEMU" doesn't really sound like the use case which absolutely requires
hierarchical cpu cycle distribution.
It depends on the use case.  I never have more than 2 VM's running on my 
laptop (always under QEMU, setting up Xen is kind of pointless ona quad 
core system with only 8G of RAM), and I take extensive advantage of the 
cpu cgroup to partition resources among various services on the host.





smime.p7s
Description: S/MIME Cryptographic Signature

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-08-24 Thread Mike Galbraith

On Mon, 2015-08-24 at 13:04 -0400, Tejun Heo wrote:
> Hello, Austin.
> 
> On Mon, Aug 24, 2015 at 11:47:02AM -0400, Austin S Hemmelgarn wrote:
> > >Just to learn more, what sort of hypervisor support threads are we
> > >talking about?  They would have to consume considerable amount of cpu
> > >cycles for problems like this to be relevant and be dynamic in numbers
> > >in a way which letting them competing against vcpus makes sense.  Do
> > >IO helpers meet these criteria?
> > >
> > Depending on the configuration, yes they can.  VirtualBox has some rather
> > CPU intensive threads that aren't vCPU threads (their emulated APIC thread
> > immediately comes to mind), and so does QEMU depending on the emulated
> 
> And the number of those threads fluctuate widely and dynamically?
> 
> > hardware configuration (it gets more noticeable when the disk images are
> > stored on a SAN and served through iSCSI, NBD, FCoE, or ATAoE, which is
> > pretty typical usage for large virtualization deployments).  I've seen cases
> > first hand where the vCPU's can make no reasonable progress because they are
> > constantly getting crowded out by other threads.

Hm. Serious CPU starvation would seem to require quite a few hungry
threads, but even a few IO threads with kick butt hardware under them
could easily tilt fairness heavily in favor of VPUs generating IO.

> That alone doesn't require hierarchical resource distribution tho.
> Setting nice levels reasonably is likely to alleviate most of the
> problem.

Unless the CPU controller is in use.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-08-24 Thread Tejun Heo

Hello, Austin.

On Mon, Aug 24, 2015 at 11:47:02AM -0400, Austin S Hemmelgarn wrote:
> >Just to learn more, what sort of hypervisor support threads are we
> >talking about?  They would have to consume considerable amount of cpu
> >cycles for problems like this to be relevant and be dynamic in numbers
> >in a way which letting them competing against vcpus makes sense.  Do
> >IO helpers meet these criteria?
> >
> Depending on the configuration, yes they can.  VirtualBox has some rather
> CPU intensive threads that aren't vCPU threads (their emulated APIC thread
> immediately comes to mind), and so does QEMU depending on the emulated

And the number of those threads fluctuate widely and dynamically?

> hardware configuration (it gets more noticeable when the disk images are
> stored on a SAN and served through iSCSI, NBD, FCoE, or ATAoE, which is
> pretty typical usage for large virtualization deployments).  I've seen cases
> first hand where the vCPU's can make no reasonable progress because they are
> constantly getting crowded out by other threads.

That alone doesn't require hierarchical resource distribution tho.
Setting nice levels reasonably is likely to alleviate most of the
problem.

> The use of the term 'hypervisor support threads' for this is probably not
> the best way of describing the contention, as it's almost always a full
> system virtualization issue, and the contending threads are usually storage
> back-end access threads.
> 
> I would argue that there are better ways to deal properly with this (Isolate
> the non vCPU threads on separate physical CPU's from the hardware emulation
> threads), but such methods require large systems to be practical at any
> scale, and many people don't have the budget for such large systems, and
> this way of doing things is much more flexible for small scale use cases
> (for example, someone running one or two VM's on a laptop under QEMU or
> VirtualBox).

I don't know.  "Someone running one or two VM's on a laptop under
QEMU" doesn't really sound like the use case which absolutely requires
hierarchical cpu cycle distribution.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-08-24 Thread Austin S Hemmelgarn


On 2015-08-22 14:29, Tejun Heo wrote:

Hello, Paul.

On Fri, Aug 21, 2015 at 12:26:30PM -0700, Paul Turner wrote:
...

A very concrete example of the above is a virtual machine in which you
want to guarantee scheduling for the vCPU threads which must schedule
beside many hypervisor support threads.   A hierarchy is the only way
to fix the ratio at which these compete.


Just to learn more, what sort of hypervisor support threads are we
talking about?  They would have to consume considerable amount of cpu
cycles for problems like this to be relevant and be dynamic in numbers
in a way which letting them competing against vcpus makes sense.  Do
IO helpers meet these criteria?

Depending on the configuration, yes they can.  VirtualBox has some 
rather CPU intensive threads that aren't vCPU threads (their emulated 
APIC thread immediately comes to mind), and so does QEMU depending on 
the emulated hardware configuration (it gets more noticeable when the 
disk images are stored on a SAN and served through iSCSI, NBD, FCoE, or 
ATAoE, which is pretty typical usage for large virtualization 
deployments).  I've seen cases first hand where the vCPU's can make no 
reasonable progress because they are constantly getting crowded out by 
other threads.


The use of the term 'hypervisor support threads' for this is probably 
not the best way of describing the contention, as it's almost always a 
full system virtualization issue, and the contending threads are usually 
storage back-end access threads.


I would argue that there are better ways to deal properly with this 
(Isolate the non vCPU threads on separate physical CPU's from the 
hardware emulation threads), but such methods require large systems to 
be practical at any scale, and many people don't have the budget for 
such large systems, and this way of doing things is much more flexible 
for small scale use cases (for example, someone running one or two VM's 
on a laptop under QEMU or VirtualBox).




smime.p7s
Description: S/MIME Cryptographic Signature

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-08-24 Thread Austin S Hemmelgarn


On 2015-08-22 14:29, Tejun Heo wrote:

Hello, Paul.

On Fri, Aug 21, 2015 at 12:26:30PM -0700, Paul Turner wrote:
...

A very concrete example of the above is a virtual machine in which you
want to guarantee scheduling for the vCPU threads which must schedule
beside many hypervisor support threads.   A hierarchy is the only way
to fix the ratio at which these compete.


Just to learn more, what sort of hypervisor support threads are we
talking about?  They would have to consume considerable amount of cpu
cycles for problems like this to be relevant and be dynamic in numbers
in a way which letting them competing against vcpus makes sense.  Do
IO helpers meet these criteria?

Depending on the configuration, yes they can.  VirtualBox has some 
rather CPU intensive threads that aren't vCPU threads (their emulated 
APIC thread immediately comes to mind), and so does QEMU depending on 
the emulated hardware configuration (it gets more noticeable when the 
disk images are stored on a SAN and served through iSCSI, NBD, FCoE, or 
ATAoE, which is pretty typical usage for large virtualization 
deployments).  I've seen cases first hand where the vCPU's can make no 
reasonable progress because they are constantly getting crowded out by 
other threads.


The use of the term 'hypervisor support threads' for this is probably 
not the best way of describing the contention, as it's almost always a 
full system virtualization issue, and the contending threads are usually 
storage back-end access threads.


I would argue that there are better ways to deal properly with this 
(Isolate the non vCPU threads on separate physical CPU's from the 
hardware emulation threads), but such methods require large systems to 
be practical at any scale, and many people don't have the budget for 
such large systems, and this way of doing things is much more flexible 
for small scale use cases (for example, someone running one or two VM's 
on a laptop under QEMU or VirtualBox).




smime.p7s
Description: S/MIME Cryptographic Signature

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-08-24 Thread Tejun Heo

Hello, Austin.

On Mon, Aug 24, 2015 at 11:47:02AM -0400, Austin S Hemmelgarn wrote:
 Just to learn more, what sort of hypervisor support threads are we
 talking about?  They would have to consume considerable amount of cpu
 cycles for problems like this to be relevant and be dynamic in numbers
 in a way which letting them competing against vcpus makes sense.  Do
 IO helpers meet these criteria?
 
 Depending on the configuration, yes they can.  VirtualBox has some rather
 CPU intensive threads that aren't vCPU threads (their emulated APIC thread
 immediately comes to mind), and so does QEMU depending on the emulated

And the number of those threads fluctuate widely and dynamically?

 hardware configuration (it gets more noticeable when the disk images are
 stored on a SAN and served through iSCSI, NBD, FCoE, or ATAoE, which is
 pretty typical usage for large virtualization deployments).  I've seen cases
 first hand where the vCPU's can make no reasonable progress because they are
 constantly getting crowded out by other threads.

That alone doesn't require hierarchical resource distribution tho.
Setting nice levels reasonably is likely to alleviate most of the
problem.

 The use of the term 'hypervisor support threads' for this is probably not
 the best way of describing the contention, as it's almost always a full
 system virtualization issue, and the contending threads are usually storage
 back-end access threads.
 
 I would argue that there are better ways to deal properly with this (Isolate
 the non vCPU threads on separate physical CPU's from the hardware emulation
 threads), but such methods require large systems to be practical at any
 scale, and many people don't have the budget for such large systems, and
 this way of doing things is much more flexible for small scale use cases
 (for example, someone running one or two VM's on a laptop under QEMU or
 VirtualBox).

I don't know.  Someone running one or two VM's on a laptop under
QEMU doesn't really sound like the use case which absolutely requires
hierarchical cpu cycle distribution.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-08-24 Thread Mike Galbraith

On Mon, 2015-08-24 at 13:04 -0400, Tejun Heo wrote:
 Hello, Austin.
 
 On Mon, Aug 24, 2015 at 11:47:02AM -0400, Austin S Hemmelgarn wrote:
  Just to learn more, what sort of hypervisor support threads are we
  talking about?  They would have to consume considerable amount of cpu
  cycles for problems like this to be relevant and be dynamic in numbers
  in a way which letting them competing against vcpus makes sense.  Do
  IO helpers meet these criteria?
  
  Depending on the configuration, yes they can.  VirtualBox has some rather
  CPU intensive threads that aren't vCPU threads (their emulated APIC thread
  immediately comes to mind), and so does QEMU depending on the emulated
 
 And the number of those threads fluctuate widely and dynamically?
 
  hardware configuration (it gets more noticeable when the disk images are
  stored on a SAN and served through iSCSI, NBD, FCoE, or ATAoE, which is
  pretty typical usage for large virtualization deployments).  I've seen cases
  first hand where the vCPU's can make no reasonable progress because they are
  constantly getting crowded out by other threads.

Hm. Serious CPU starvation would seem to require quite a few hungry
threads, but even a few IO threads with kick butt hardware under them
could easily tilt fairness heavily in favor of VPUs generating IO.

 That alone doesn't require hierarchical resource distribution tho.
 Setting nice levels reasonably is likely to alleviate most of the
 problem.

Unless the CPU controller is in use.

-Mike

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-08-24 Thread Tejun Heo

Hello,

On Mon, Aug 24, 2015 at 01:54:08PM -0700, Paul Turner wrote:
  That alone doesn't require hierarchical resource distribution tho.
  Setting nice levels reasonably is likely to alleviate most of the
  problem.
 
 Nice is not sufficient here.  There could be arbitrarily many threads
 within the hypervisor that are not actually hosting guest CPU threads.
 The only way to have this competition occur at a reasonably fixed
 ratio is a sub-hierarchy.

I get that having hierarchy of threads would be nicer but am having a
bit of difficulty seeing why adjusting priorities of threads wouldn't
be sufficient.  It's not like threads of the same process competing
with each other is a new problem.  People have been dealing with it
for ages.  Hierarchical management can be a nice plus but we want the
problem and proposed solution to be justifiable.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-08-24 Thread Tejun Heo

Hello, Austin.

On Mon, Aug 24, 2015 at 04:00:49PM -0400, Austin S Hemmelgarn wrote:
 That alone doesn't require hierarchical resource distribution tho.
 Setting nice levels reasonably is likely to alleviate most of the
 problem.

 In the cases I've dealt with this myself, nice levels didn't cut it, and I
 had to resort to SCHED_RR with particular care to avoid priority inversions.

I wonder why.  The difference between -20 and 20 is around 2500x in
terms of weight.  That should have been enough for expressing whatever
precedence the vcpus should have over other threads.

 I don't know.  Someone running one or two VM's on a laptop under
 QEMU doesn't really sound like the use case which absolutely requires
 hierarchical cpu cycle distribution.

 It depends on the use case.  I never have more than 2 VM's running on my
 laptop (always under QEMU, setting up Xen is kind of pointless ona quad core
 system with only 8G of RAM), and I take extensive advantage of the cpu
 cgroup to partition resources among various services on the host.

Hmmm... I'm trying to understand the usecases where having hierarchy
inside a process are actually required so that we don't end up doing
something complex unnecessarily.  So far, it looks like an easy
alternative for qemu would be teaching it to manage priorities of its
threads given that the threads are mostly static - vcpus going up and
down are explicit operations which can trigger priority adjustments if
necessary, which is unlikely to begin with.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-08-24 Thread Paul Turner

On Mon, Aug 24, 2015 at 10:04 AM, Tejun Heo t...@kernel.org wrote:
 Hello, Austin.

 On Mon, Aug 24, 2015 at 11:47:02AM -0400, Austin S Hemmelgarn wrote:
 Just to learn more, what sort of hypervisor support threads are we
 talking about?  They would have to consume considerable amount of cpu
 cycles for problems like this to be relevant and be dynamic in numbers
 in a way which letting them competing against vcpus makes sense.  Do
 IO helpers meet these criteria?
 
 Depending on the configuration, yes they can.  VirtualBox has some rather
 CPU intensive threads that aren't vCPU threads (their emulated APIC thread
 immediately comes to mind), and so does QEMU depending on the emulated

 And the number of those threads fluctuate widely and dynamically?

 hardware configuration (it gets more noticeable when the disk images are
 stored on a SAN and served through iSCSI, NBD, FCoE, or ATAoE, which is
 pretty typical usage for large virtualization deployments).  I've seen cases
 first hand where the vCPU's can make no reasonable progress because they are
 constantly getting crowded out by other threads.

 That alone doesn't require hierarchical resource distribution tho.
 Setting nice levels reasonably is likely to alleviate most of the
 problem.

Nice is not sufficient here.  There could be arbitrarily many threads
within the hypervisor that are not actually hosting guest CPU threads.
The only way to have this competition occur at a reasonably fixed
ratio is a sub-hierarchy.


 The use of the term 'hypervisor support threads' for this is probably not
 the best way of describing the contention, as it's almost always a full
 system virtualization issue, and the contending threads are usually storage
 back-end access threads.

 I would argue that there are better ways to deal properly with this (Isolate
 the non vCPU threads on separate physical CPU's from the hardware emulation
 threads), but such methods require large systems to be practical at any
 scale, and many people don't have the budget for such large systems, and
 this way of doing things is much more flexible for small scale use cases
 (for example, someone running one or two VM's on a laptop under QEMU or
 VirtualBox).

 I don't know.  Someone running one or two VM's on a laptop under
 QEMU doesn't really sound like the use case which absolutely requires
 hierarchical cpu cycle distribution.


We run more than 'one or two' VMs using this configuration.  :)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-08-24 Thread Austin S Hemmelgarn


On 2015-08-24 13:04, Tejun Heo wrote:

Hello, Austin.

On Mon, Aug 24, 2015 at 11:47:02AM -0400, Austin S Hemmelgarn wrote:

Just to learn more, what sort of hypervisor support threads are we
talking about?  They would have to consume considerable amount of cpu
cycles for problems like this to be relevant and be dynamic in numbers
in a way which letting them competing against vcpus makes sense.  Do
IO helpers meet these criteria?


Depending on the configuration, yes they can.  VirtualBox has some rather
CPU intensive threads that aren't vCPU threads (their emulated APIC thread
immediately comes to mind), and so does QEMU depending on the emulated


And the number of those threads fluctuate widely and dynamically?
It depends, usually there isn't dynamic fluctuation unless there is a 
lot of hot[un]plugging of virtual devices going on (which can be the 
case for situations with tight host/guest integration), but the number 
of threads can vary widely between configurations (most of the VM's I 
run under QEMU have about 16 threads on average, but I've seen instances 
with more than 100 threads).  The most likely case to cause wide and 
dynamic fluctuations of threads would be systems set up to dynamically 
hot[un]plug vCPU's based on system load (such systems have other issues 
to contend with also, but they do exist).

hardware configuration (it gets more noticeable when the disk images are
stored on a SAN and served through iSCSI, NBD, FCoE, or ATAoE, which is
pretty typical usage for large virtualization deployments).  I've seen cases
first hand where the vCPU's can make no reasonable progress because they are
constantly getting crowded out by other threads.


That alone doesn't require hierarchical resource distribution tho.
Setting nice levels reasonably is likely to alleviate most of the
problem.
In the cases I've dealt with this myself, nice levels didn't cut it, and 
I had to resort to SCHED_RR with particular care to avoid priority 
inversions.

The use of the term 'hypervisor support threads' for this is probably not
the best way of describing the contention, as it's almost always a full
system virtualization issue, and the contending threads are usually storage
back-end access threads.

I would argue that there are better ways to deal properly with this (Isolate
the non vCPU threads on separate physical CPU's from the hardware emulation
threads), but such methods require large systems to be practical at any
scale, and many people don't have the budget for such large systems, and
this way of doing things is much more flexible for small scale use cases
(for example, someone running one or two VM's on a laptop under QEMU or
VirtualBox).


I don't know.  Someone running one or two VM's on a laptop under
QEMU doesn't really sound like the use case which absolutely requires
hierarchical cpu cycle distribution.
It depends on the use case.  I never have more than 2 VM's running on my 
laptop (always under QEMU, setting up Xen is kind of pointless ona quad 
core system with only 8G of RAM), and I take extensive advantage of the 
cpu cgroup to partition resources among various services on the host.





smime.p7s
Description: S/MIME Cryptographic Signature

1 2 >

1 - 100 of 148 matches

Mail list logo