Re: [PATCH 2/2] sched: cpufreq: use rt_avg as estimate of required RT CPU capacity

2016-09-02 Thread Thomas Gleixner
On Fri, 2 Sep 2016, Juri Lelli wrote:
> On 01/09/16 14:48, Steve Muckle wrote:
> > On Wed, Aug 31, 2016 at 06:00:02PM +0100, Juri Lelli wrote:
> > > > Another problem is that we have many semi related knobs; we have the
> > > > global RT runtime limit knob, but that doesn't affect cpufreq (maybe it
> > > > should)
> > > 
> > > Maybe we could create this sort of link when using the cgroup RT
> > > throttling interface as well? It should still then fit well once we
> > > replace the underlying mechanism with DL reservations. And, AFAIK, the
> > > interface is used by Android folks already.
> > 
> > I'm not sure how the upper bounds can be used to infer CPU frequency...
> > On my Nexus 6p (an Android device), the global RT runtime limit
> > seems to be set at 950ms/1sec, the root cgroup is set to 800ms/1sec, and
> > bg_non_interactive is set at 700ms/1sec.
> > 
> 
> Right, unfortunately. Still too coarse grained (as Thomas is also saying
> in his last reply, if I read it correctly).

Yes, you do. It's a big hammer and really unsuitable for this kind of
mechanism.

> Doesn't pay off the added complexity I'm afraid.

Certainly not. And the only choice we have is heuristics. Heuristic is an
euphemism for saying that it cannot work.

Thanks,

tglx


Re: [PATCH 2/2] sched: cpufreq: use rt_avg as estimate of required RT CPU capacity

2016-09-02 Thread Juri Lelli
On 01/09/16 14:48, Steve Muckle wrote:
> On Wed, Aug 31, 2016 at 06:00:02PM +0100, Juri Lelli wrote:
> > > Another problem is that we have many semi related knobs; we have the
> > > global RT runtime limit knob, but that doesn't affect cpufreq (maybe it
> > > should)
> > 
> > Maybe we could create this sort of link when using the cgroup RT
> > throttling interface as well? It should still then fit well once we
> > replace the underlying mechanism with DL reservations. And, AFAIK, the
> > interface is used by Android folks already.
> 
> I'm not sure how the upper bounds can be used to infer CPU frequency...
> On my Nexus 6p (an Android device), the global RT runtime limit
> seems to be set at 950ms/1sec, the root cgroup is set to 800ms/1sec, and
> bg_non_interactive is set at 700ms/1sec.
> 

Right, unfortunately. Still too coarse grained (as Thomas is also saying
in his last reply, if I read it correctly). Doesn't pay off the added
complexity I'm afraid.


Re: [PATCH 2/2] sched: cpufreq: use rt_avg as estimate of required RT CPU capacity

2016-09-02 Thread Thomas Gleixner
On Wed, 31 Aug 2016, Peter Zijlstra wrote:
> On Wed, Aug 31, 2016 at 06:28:10PM +0200, Thomas Gleixner wrote:
> > > That is the way it's been with cpufreq and many systems (including all
> > > mobile devices) rely on that to not destroy power. RT + variable cpufreq
> > > is not deterministic.
> > > 
> > > Given we don't have good constraints on RT tasks I don't think we should
> > > try to strengthen the semantics there. Folks should either move to DL if
> > > they want determinism *and* not-sucky power, or continue disabling
> > > cpufreq if they are able to do so.
> > 
> > RT deterministic behaviour is all about meeting the deadlines. If your
> > deadline is relaxed enough that you can meet it even with the lowest cpu
> > frequency then it's perfectly fine to enable cpufreq. The same logic applies
> > to C-States.
> > 
> > There are a lot of RT systems out there which enable both. If cpufreq or
> > c-states cause a deadline violation because the constraints of the system 
> > are
> > tight, then people will disable it and we need a knob for both.
> > 
> > Realtime is not as fast as possible. It's as fast as specified.
> 
> Sure, problem is of course that RR/FIFO doesn't specify anything so the
> users are left to prod knobs.

I know :(
 
> Another problem is that we have many semi related knobs; we have the
> global RT runtime limit knob, but that doesn't affect cpufreq (maybe it
> should) and cpufreq has knobs to set f_min and f_max, which again are
> unaware of RT anything.
> 
> So before we go do anything, I'd like input on what is needed and how
> things should tie together to make most sense.

RT systems and especially RR/FIFO driven ones need a lot of specific tuning
and configuration. I doubt that we can do anything except lousy heuristics
which will end up being wrong for most use cases.

In the DL case we certainly can do informed decisions, but for the RR/FIFO
case the global RT runtime limit is just a too big hammer which shouldn't be
abused for calculating cpufreq limits.

I think that we should concentrate on DL and make it work very well and just
leave the rest of the RT folks with rather simplistic knobs (i.e. on/off/hard
limits). That will force people who have RT _and_ power constraints to think
harder about their system design and eventually make them move over to DL.

Thanks,

tglx


Re: [PATCH 2/2] sched: cpufreq: use rt_avg as estimate of required RT CPU capacity

2016-09-01 Thread Steve Muckle
On Wed, Aug 31, 2016 at 06:00:02PM +0100, Juri Lelli wrote:
> > Another problem is that we have many semi related knobs; we have the
> > global RT runtime limit knob, but that doesn't affect cpufreq (maybe it
> > should)
> 
> Maybe we could create this sort of link when using the cgroup RT
> throttling interface as well? It should still then fit well once we
> replace the underlying mechanism with DL reservations. And, AFAIK, the
> interface is used by Android folks already.

I'm not sure how the upper bounds can be used to infer CPU frequency...
On my Nexus 6p (an Android device), the global RT runtime limit
seems to be set at 950ms/1sec, the root cgroup is set to 800ms/1sec, and
bg_non_interactive is set at 700ms/1sec.



Re: [PATCH 2/2] sched: cpufreq: use rt_avg as estimate of required RT CPU capacity

2016-09-01 Thread Peter Zijlstra
On Wed, Aug 31, 2016 at 06:00:02PM +0100, Juri Lelli wrote:
> On 31/08/16 18:40, Peter Zijlstra wrote:
> > Another problem is that we have many semi related knobs; we have the
> > global RT runtime limit knob, but that doesn't affect cpufreq (maybe it
> > should)
> 
> Maybe we could create this sort of link when using the cgroup RT
> throttling interface as well? It should still then fit well once we
> replace the underlying mechanism with DL reservations. And, AFAIK, the
> interface is used by Android folks already.

Tricky, but possible I suppose.

Since minimal cpufreq is 'global', the cgroup reservation only matters
if there are no tasks in any of its parent groups. Computing the
effective rt min then again becomes somewhat tricky, since we'd have to
iterate the cgroup tree.


Re: [PATCH 2/2] sched: cpufreq: use rt_avg as estimate of required RT CPU capacity

2016-08-31 Thread Rafael J. Wysocki
On Wednesday, August 31, 2016 06:40:09 PM Peter Zijlstra wrote:
> On Wed, Aug 31, 2016 at 06:28:10PM +0200, Thomas Gleixner wrote:
> > > That is the way it's been with cpufreq and many systems (including all
> > > mobile devices) rely on that to not destroy power. RT + variable cpufreq
> > > is not deterministic.
> > > 
> > > Given we don't have good constraints on RT tasks I don't think we should
> > > try to strengthen the semantics there. Folks should either move to DL if
> > > they want determinism *and* not-sucky power, or continue disabling
> > > cpufreq if they are able to do so.
> > 
> > RT deterministic behaviour is all about meeting the deadlines. If your
> > deadline is relaxed enough that you can meet it even with the lowest cpu
> > frequency then it's perfectly fine to enable cpufreq. The same logic applies
> > to C-States.
> > 
> > There are a lot of RT systems out there which enable both. If cpufreq or
> > c-states cause a deadline violation because the constraints of the system 
> > are
> > tight, then people will disable it and we need a knob for both.
> > 
> > Realtime is not as fast as possible. It's as fast as specified.
> 
> Sure, problem is of course that RR/FIFO doesn't specify anything so the
> users are left to prod knobs.
> 
> Another problem is that we have many semi related knobs; we have the
> global RT runtime limit knob, but that doesn't affect cpufreq (maybe it
> should) and cpufreq has knobs to set f_min and f_max, which again are
> unaware of RT anything.
> 
> So before we go do anything, I'd like input on what is needed and how
> things should tie together to make most sense.

I totally agree.

We need to know where we want to get to before deciding on which way to go.

Thanks,
Rafael



Re: [PATCH 2/2] sched: cpufreq: use rt_avg as estimate of required RT CPU capacity

2016-08-31 Thread Thomas Gleixner
On Wed, 31 Aug 2016, Steve Muckle wrote:
> On Wed, Aug 31, 2016 at 04:39:07PM +0200, Peter Zijlstra wrote:
> > On Fri, Aug 26, 2016 at 11:40:48AM -0700, Steve Muckle wrote:
> > > A policy of going to fmax on any RT activity will be detrimental
> > > for power on many platforms. Often RT accounts for only a small amount
> > > of CPU activity so sending the CPU frequency to fmax is overkill. Worse
> > > still, some platforms may not be able to even complete the CPU frequency
> > > change before the RT activity has already completed.
> > > 
> > > Cpufreq governors have not treated RT activity this way in the past so
> > > it is not part of the expected semantics of the RT scheduling class. The
> > > DL class offers guarantees about task completion and could be used for
> > > this purpose.
> > 
> > Not entirely true. People have simply disabled cpufreq because of this.
> >
> > Yes, RR/FIFO are a pain, but they should still be deterministic, and
> > variable cpufreq destroys that.
> 
> That is the way it's been with cpufreq and many systems (including all
> mobile devices) rely on that to not destroy power. RT + variable cpufreq
> is not deterministic.
> 
> Given we don't have good constraints on RT tasks I don't think we should
> try to strengthen the semantics there. Folks should either move to DL if
> they want determinism *and* not-sucky power, or continue disabling
> cpufreq if they are able to do so.

RT deterministic behaviour is all about meeting the deadlines. If your
deadline is relaxed enough that you can meet it even with the lowest cpu
frequency then it's perfectly fine to enable cpufreq. The same logic applies
to C-States.

There are a lot of RT systems out there which enable both. If cpufreq or
c-states cause a deadline violation because the constraints of the system are
tight, then people will disable it and we need a knob for both.

Realtime is not as fast as possible. It's as fast as specified.
 
Thanks,

tglx


Re: [PATCH 2/2] sched: cpufreq: use rt_avg as estimate of required RT CPU capacity

2016-08-31 Thread Juri Lelli
On 31/08/16 18:40, Peter Zijlstra wrote:
> On Wed, Aug 31, 2016 at 06:28:10PM +0200, Thomas Gleixner wrote:
> > > That is the way it's been with cpufreq and many systems (including all
> > > mobile devices) rely on that to not destroy power. RT + variable cpufreq
> > > is not deterministic.
> > > 
> > > Given we don't have good constraints on RT tasks I don't think we should
> > > try to strengthen the semantics there. Folks should either move to DL if
> > > they want determinism *and* not-sucky power, or continue disabling
> > > cpufreq if they are able to do so.
> > 
> > RT deterministic behaviour is all about meeting the deadlines. If your
> > deadline is relaxed enough that you can meet it even with the lowest cpu
> > frequency then it's perfectly fine to enable cpufreq. The same logic applies
> > to C-States.
> > 
> > There are a lot of RT systems out there which enable both. If cpufreq or
> > c-states cause a deadline violation because the constraints of the system 
> > are
> > tight, then people will disable it and we need a knob for both.
> > 
> > Realtime is not as fast as possible. It's as fast as specified.
> 
> Sure, problem is of course that RR/FIFO doesn't specify anything so the
> users are left to prod knobs.
> 
> Another problem is that we have many semi related knobs; we have the
> global RT runtime limit knob, but that doesn't affect cpufreq (maybe it
> should)

Maybe we could create this sort of link when using the cgroup RT
throttling interface as well? It should still then fit well once we
replace the underlying mechanism with DL reservations. And, AFAIK, the
interface is used by Android folks already.

> and cpufreq has knobs to set f_min and f_max, which again are
> unaware of RT anything.
> 
> So before we go do anything, I'd like input on what is needed and how
> things should tie together to make most sense.
> 


Re: [PATCH 2/2] sched: cpufreq: use rt_avg as estimate of required RT CPU capacity

2016-08-31 Thread Peter Zijlstra
On Wed, Aug 31, 2016 at 06:28:10PM +0200, Thomas Gleixner wrote:
> > That is the way it's been with cpufreq and many systems (including all
> > mobile devices) rely on that to not destroy power. RT + variable cpufreq
> > is not deterministic.
> > 
> > Given we don't have good constraints on RT tasks I don't think we should
> > try to strengthen the semantics there. Folks should either move to DL if
> > they want determinism *and* not-sucky power, or continue disabling
> > cpufreq if they are able to do so.
> 
> RT deterministic behaviour is all about meeting the deadlines. If your
> deadline is relaxed enough that you can meet it even with the lowest cpu
> frequency then it's perfectly fine to enable cpufreq. The same logic applies
> to C-States.
> 
> There are a lot of RT systems out there which enable both. If cpufreq or
> c-states cause a deadline violation because the constraints of the system are
> tight, then people will disable it and we need a knob for both.
> 
> Realtime is not as fast as possible. It's as fast as specified.

Sure, problem is of course that RR/FIFO doesn't specify anything so the
users are left to prod knobs.

Another problem is that we have many semi related knobs; we have the
global RT runtime limit knob, but that doesn't affect cpufreq (maybe it
should) and cpufreq has knobs to set f_min and f_max, which again are
unaware of RT anything.

So before we go do anything, I'd like input on what is needed and how
things should tie together to make most sense.


Re: [PATCH 2/2] sched: cpufreq: use rt_avg as estimate of required RT CPU capacity

2016-08-31 Thread Steve Muckle
On Wed, Aug 31, 2016 at 04:39:07PM +0200, Peter Zijlstra wrote:
> On Fri, Aug 26, 2016 at 11:40:48AM -0700, Steve Muckle wrote:
> > A policy of going to fmax on any RT activity will be detrimental
> > for power on many platforms. Often RT accounts for only a small amount
> > of CPU activity so sending the CPU frequency to fmax is overkill. Worse
> > still, some platforms may not be able to even complete the CPU frequency
> > change before the RT activity has already completed.
> > 
> > Cpufreq governors have not treated RT activity this way in the past so
> > it is not part of the expected semantics of the RT scheduling class. The
> > DL class offers guarantees about task completion and could be used for
> > this purpose.
> 
> Not entirely true. People have simply disabled cpufreq because of this.
>
> Yes, RR/FIFO are a pain, but they should still be deterministic, and
> variable cpufreq destroys that.

That is the way it's been with cpufreq and many systems (including all
mobile devices) rely on that to not destroy power. RT + variable cpufreq
is not deterministic.

Given we don't have good constraints on RT tasks I don't think we should
try to strengthen the semantics there. Folks should either move to DL if
they want determinism *and* not-sucky power, or continue disabling
cpufreq if they are able to do so.

> I realize that the fmax thing is annoying, but I'm not seeing how rt_avg
> is much better.

Rt_avg is much closer to the current behavior offered by the most
commonly used cpufreq governors since it tracks actual CPU utilization.
Power is not impacted by minimal RT activity and the frequency is raised
if RT activity is high.



Re: [PATCH 2/2] sched: cpufreq: use rt_avg as estimate of required RT CPU capacity

2016-08-31 Thread Steve Muckle
On Wed, Aug 31, 2016 at 03:31:07AM +0200, Rafael J. Wysocki wrote:
> On Friday, August 26, 2016 11:40:48 AM Steve Muckle wrote:
> > A policy of going to fmax on any RT activity will be detrimental
> > for power on many platforms. Often RT accounts for only a small amount
> > of CPU activity so sending the CPU frequency to fmax is overkill. Worse
> > still, some platforms may not be able to even complete the CPU frequency
> > change before the RT activity has already completed.
> > 
> > Cpufreq governors have not treated RT activity this way in the past so
> > it is not part of the expected semantics of the RT scheduling class. The
> > DL class offers guarantees about task completion and could be used for
> > this purpose.
> > 
> > Modify the schedutil algorithm to instead use rt_avg as an estimate of
> > RT utilization of the CPU.
> > 
> > Based on previous work by Vincent Guittot .
> 
> If we do it for RT, why not to do a similar thing for DL?  As in the
> original patch from Peter, for example?

Agreed DL should have a similar change. I think that could be done in a
separate patch. I also would need to discuss it with the deadline sched
devs to fully understand the metric used there.

> 
> > Signed-off-by: Steve Muckle 
> > ---
> >  kernel/sched/cpufreq_schedutil.c | 26 +-
> >  1 file changed, 17 insertions(+), 9 deletions(-)
> > 
> > diff --git a/kernel/sched/cpufreq_schedutil.c 
> > b/kernel/sched/cpufreq_schedutil.c
> > index cb8a77b1ef1b..89094a466250 100644
> > --- a/kernel/sched/cpufreq_schedutil.c
> > +++ b/kernel/sched/cpufreq_schedutil.c
> > @@ -146,13 +146,21 @@ static unsigned int get_next_freq(struct sugov_cpu 
> > *sg_cpu, unsigned long util,
> >  
> >  static void sugov_get_util(unsigned long *util, unsigned long *max)
> >  {
> > -   struct rq *rq = this_rq();
> > -   unsigned long cfs_max;
> > +   int cpu = smp_processor_id();
> > +   struct rq *rq = cpu_rq(cpu);
> > +   unsigned long max_cap, rt;
> > +   s64 delta;
> >  
> > -   cfs_max = arch_scale_cpu_capacity(NULL, smp_processor_id());
> > +   max_cap = arch_scale_cpu_capacity(NULL, cpu);
> >  
> > -   *util = min(rq->cfs.avg.util_avg, cfs_max);
> > -   *max = cfs_max;
> > +   delta = rq_clock(rq) - rq->age_stamp;
> > +   if (unlikely(delta < 0))
> > +   delta = 0;
> > +   rt = div64_u64(rq->rt_avg, sched_avg_period() + delta);
> > +   rt = (rt * max_cap) >> SCHED_CAPACITY_SHIFT;
> 
> These computations are rather heavy, so I wonder if they are avoidable based
> on the flags, for example?

Yeah the div is bad. I don't know that we can avoid it based on the
flags because rt_avg will decay during CFS activity and you'd want to
take note of that.

One way to make this a little better is to ssume that the divisor,
sched_avg_period() + delta, fits into 32 bits so that div_u64 can be
used, which I believe is less bad. Doing that means placing a
restriction on how large sysctl_sched_time_avg (which determines
sched_avg_period()) can be, a max of 4.2 seconds I think. I don't know
that anyone uses a value that large anyway but there's currently no
limit on it.

Another option would be just adding another separate metric to track rt
activity that is more mathematically favorable to deal with.

Both these seemed potentially heavy handed so I figured I'd just start
with the obvious, if suboptimal, solution...

> Plus is SCHED_CAPACITY_SHIFT actually defined for all architectures?

Yes.

> One more ugly thing is about using rq_clock(rq) directly from here whereas we
> pass it around as the 'time' argument elsewhere.

Sure I'll clean this up.

> 
> > +
> > +   *util = min(rq->cfs.avg.util_avg + rt, max_cap);
> > +   *max = max_cap;
> >  }


Re: [PATCH 2/2] sched: cpufreq: use rt_avg as estimate of required RT CPU capacity

2016-08-31 Thread Peter Zijlstra
On Fri, Aug 26, 2016 at 11:40:48AM -0700, Steve Muckle wrote:
> A policy of going to fmax on any RT activity will be detrimental
> for power on many platforms. Often RT accounts for only a small amount
> of CPU activity so sending the CPU frequency to fmax is overkill. Worse
> still, some platforms may not be able to even complete the CPU frequency
> change before the RT activity has already completed.
> 
> Cpufreq governors have not treated RT activity this way in the past so
> it is not part of the expected semantics of the RT scheduling class. The
> DL class offers guarantees about task completion and could be used for
> this purpose.

Not entirely true. People have simply disabled cpufreq because of this.

Yes, RR/FIFO are a pain, but they should still be deterministic, and
variable cpufreq destroys that.

I realize that the fmax thing is annoying, but I'm not seeing how rt_avg
is much better.


Re: [PATCH 2/2] sched: cpufreq: use rt_avg as estimate of required RT CPU capacity

2016-08-30 Thread Rafael J. Wysocki
On Friday, August 26, 2016 11:40:48 AM Steve Muckle wrote:
> A policy of going to fmax on any RT activity will be detrimental
> for power on many platforms. Often RT accounts for only a small amount
> of CPU activity so sending the CPU frequency to fmax is overkill. Worse
> still, some platforms may not be able to even complete the CPU frequency
> change before the RT activity has already completed.
> 
> Cpufreq governors have not treated RT activity this way in the past so
> it is not part of the expected semantics of the RT scheduling class. The
> DL class offers guarantees about task completion and could be used for
> this purpose.
> 
> Modify the schedutil algorithm to instead use rt_avg as an estimate of
> RT utilization of the CPU.
> 
> Based on previous work by Vincent Guittot .

If we do it for RT, why not to do a similar thing for DL?  As in the
original patch from Peter, for example?

> Signed-off-by: Steve Muckle 
> ---
>  kernel/sched/cpufreq_schedutil.c | 26 +-
>  1 file changed, 17 insertions(+), 9 deletions(-)
> 
> diff --git a/kernel/sched/cpufreq_schedutil.c 
> b/kernel/sched/cpufreq_schedutil.c
> index cb8a77b1ef1b..89094a466250 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -146,13 +146,21 @@ static unsigned int get_next_freq(struct sugov_cpu 
> *sg_cpu, unsigned long util,
>  
>  static void sugov_get_util(unsigned long *util, unsigned long *max)
>  {
> - struct rq *rq = this_rq();
> - unsigned long cfs_max;
> + int cpu = smp_processor_id();
> + struct rq *rq = cpu_rq(cpu);
> + unsigned long max_cap, rt;
> + s64 delta;
>  
> - cfs_max = arch_scale_cpu_capacity(NULL, smp_processor_id());
> + max_cap = arch_scale_cpu_capacity(NULL, cpu);
>  
> - *util = min(rq->cfs.avg.util_avg, cfs_max);
> - *max = cfs_max;
> + delta = rq_clock(rq) - rq->age_stamp;
> + if (unlikely(delta < 0))
> + delta = 0;
> + rt = div64_u64(rq->rt_avg, sched_avg_period() + delta);
> + rt = (rt * max_cap) >> SCHED_CAPACITY_SHIFT;

These computations are rather heavy, so I wonder if they are avoidable based
on the flags, for example?

Plus is SCHED_CAPACITY_SHIFT actually defined for all architectures?

One more ugly thing is about using rq_clock(rq) directly from here whereas we
pass it around as the 'time' argument elsewhere.

> +
> + *util = min(rq->cfs.avg.util_avg + rt, max_cap);
> + *max = max_cap;
>  }

Thanks,
Rafael