subject:"\[PATCH\] sched\/fair\: schedutil\: update only with all info available"

Re: [PATCH] sched/fair: schedutil: update only with all info available

2018-04-26 Thread Peter Zijlstra

On Thu, Apr 26, 2018 at 12:15:33PM +0100, Patrick Bellasi wrote:
> > Yes, these patches predate those, but indeed, now that we age the
> > blocked load consistently it should no longer be required.
> 
> After this discussion, I think there is a general consensus about
> always add sg_cpu->util_cfs in cpufreq_schedutil.c::sugov_aggregate_util.
> 
> Is that right?

Yes I think so. I've been waiting to see the problem with the blocked
load aging patches sorted though. Because if we'd have to revert those
we'd be back to needing the current stuff again.

Luckily it appears Vincent found something there, so fingers crossed.

> For the rest, what this patch proposes is a code reorganization which
> is not required anymore to fix this specific issue but, it's still
> required to fix the other issue reported by Vincent: i.e. util_est is
> not updated before schedutil.
> 
> Thus, I would propose to still keep this refactoring but in the
> context of a different patch to specifically fixes the util_est case.
> 
> If there are not major complains, I'll post a new series in the next
> few days.

Fair enough..

Re: [PATCH] sched/fair: schedutil: update only with all info available

2018-04-26 Thread Patrick Bellasi

On 11-Apr 17:37, Peter Zijlstra wrote:
> On Wed, Apr 11, 2018 at 05:29:01PM +0200, Vincent Guittot wrote:
> > On 11 April 2018 at 17:14, Peter Zijlstra  wrote:
> > > On Tue, Apr 10, 2018 at 12:04:12PM +0100, Patrick Bellasi wrote:
> > >> On 09-Apr 10:51, Vincent Guittot wrote:
> > >
> > >> > Peter,
> > >> > what was your goal with adding the condition "if
> > >> > (rq->cfs.h_nr_running)" for the aggragation of CFS utilization
> > >>
> > >> The original intent was to get rid of sched class flags, used to track
> > >> which class has tasks runnable from within schedutil. The reason was
> > >> to solve some misalignment between scheduler class status and
> > >> schedutil status.
> > >>
> > >> The solution, initially suggested by Viresh, and finally proposed by
> > >> Peter was to exploit RQ knowledges directly from within schedutil.
> > >>
> > >> The problem is that now schedutil updated depends on two information:
> > >> utilization changes and number of RT and CFS runnable tasks.
> > >>
> > >> Thus, using cfs_rq::h_nr_running is not the problem... it's actually
> > >> part of a much more clean solution of the code we used to have.
> > >>
> > >> The problem, IMO is that we now depend on other information which
> > >> needs to be in sync before calling schedutil... and the patch I
> > >> proposed is meant to make it less likely that all the information
> > >> required are not aligned (also in the future).
> > >
> > > Specifically, the h_nr_running test was get rid of
> > >
> > > if (delta_ns > TICK_NSEC) {
> > > j_sg_cpu->iowait_boost = 0;
> > > j_sg_cpu->iowait_boost_pending = false;
> > > -   j_sg_cpu->util_cfs = 0;
> > >
> > > ^^^ that..
> > >
> > > -   if (j_sg_cpu->util_dl == 0)
> > > -   continue;
> > > }
> > >
> > >
> > > because that felt rather arbitrary.
> > 
> > yes I agree.
> > 
> > With the patch that updates blocked idle load, we should not have the
> > problem of blocked utilization anymore and get rid of the code above
> > and h_nr_running test
> 
> Yes, these patches predate those, but indeed, now that we age the
> blocked load consistently it should no longer be required.

After this discussion, I think there is a general consensus about
always add sg_cpu->util_cfs in cpufreq_schedutil.c::sugov_aggregate_util.

Is that right?

For the rest, what this patch proposes is a code reorganization which
is not required anymore to fix this specific issue but, it's still
required to fix the other issue reported by Vincent: i.e. util_est is
not updated before schedutil.

Thus, I would propose to still keep this refactoring but in the
context of a different patch to specifically fixes the util_est case.

If there are not major complains, I'll post a new series in the next
few days.

-- 
#include 

Patrick Bellasi

Re: [PATCH] sched/fair: schedutil: update only with all info available

2018-04-12 Thread Joel Fernandes

On Thu, Apr 12, 2018 at 12:01 AM, Vincent Guittot
 wrote:

>>
>> Also that aside, the "running util" is what was used to drive the CFS
>> util before Peter's patch (8f111bc357a). That was accounting the
>> blocked and decaying utilization but that patch changed the behavior.
>> It seems logical we should just use that not check for h_nr_running
>> for CFS so we don't miss on the decayed  utilization. What is the use
>> of checking h_nr_running or state of runqueue for CFS? I am sure to be
>> missing something here. :-(
>
> As Peter mentioned, the change in commit (8f111bc357a) was to remove
> the test that was arbitrary removing the util_avg of a cpu that has
> not been updated since a tick
>
> But with the update of blocked idle load, we don't need to handle the
> case of stalled load/utilization

Thanks a lot for the clarification. It makes sense now.

- Joel

Re: [PATCH] sched/fair: schedutil: update only with all info available

2018-04-12 Thread Vincent Guittot

Hi Joel,

On 11 April 2018 at 23:34, Joel Fernandes  wrote:
> Hi Vincent,
>
> On Wed, Apr 11, 2018 at 4:56 AM, Vincent Guittot
>  wrote:
>> On 11 April 2018 at 12:15, Patrick Bellasi  wrote:
>>> On 11-Apr 08:57, Vincent Guittot wrote:
 On 10 April 2018 at 13:04, Patrick Bellasi  wrote:
 > On 09-Apr 10:51, Vincent Guittot wrote:
 >> On 6 April 2018 at 19:28, Patrick Bellasi  
 >> wrote:
 >> Peter,
 >> what was your goal with adding the condition "if
 >> (rq->cfs.h_nr_running)" for the aggragation of CFS utilization
 >
 > The original intent was to get rid of sched class flags, used to track
 > which class has tasks runnable from within schedutil. The reason was
 > to solve some misalignment between scheduler class status and
 > schedutil status.

 This was mainly for RT tasks but it was not the case for cfs task
 before commit 8f111bc357aa
>>>
>>> True, but with his solution Peter has actually come up with a unified
>>> interface which is now (and can be IMO) based just on RUNNABLE
>>> counters for each class.
>>
>> But do we really want to only take care of runnable counter for all class ?
>>
>>>
 > The solution, initially suggested by Viresh, and finally proposed by
 > Peter was to exploit RQ knowledges directly from within schedutil.
 >
 > The problem is that now schedutil updated depends on two information:
 > utilization changes and number of RT and CFS runnable tasks.
 >
 > Thus, using cfs_rq::h_nr_running is not the problem... it's actually
 > part of a much more clean solution of the code we used to have.

 So there are 2 problems there:
 - using cfs_rq::h_nr_running when aggregating cfs utilization which
 generates a lot of frequency drop
>>>
>>> You mean because we now completely disregard the blocked utilization
>>> where a CPU is idle, right?
>>
>> yes
>>
>>>
>>> Given how PELT works and the recent support for IDLE CPUs updated, we
>>> should probably always add contributions for the CFS class.
>>>
 - making sure that the nr-running are up-to-date when used in sched_util
>>>
>>> Right... but, if we always add the cfs_rq (to always account for
>>> blocked utilization), we don't have anymore this last dependency,
>>> isn't it?
>>
>> yes
>>
>>>
>>> We still have to account for the util_est dependency.
>>>
>>> Should I add a patch to this series to disregard cfs_rq::h_nr_running
>>> from schedutil as you suggested?
>>
>> It's probably better to have a separate patch as these are 2 different topics
>> - when updating cfs_rq::h_nr_running and when calling cpufreq_update_util
>> - should we use runnable or running utilization for CFS
>
> By runnable you don't mean sched_avg::load_avg right? I got a bit
> confused, since runnable means load_avg and running means util_avg.

Sorry for the confusion. By runnable utilization, I meant taking into
account the number of running task (cfs_rq::h_nr_running) like what is
done by commit (8f111bc357a)


> But I didn't follow here why we are talking about load_avg for
> schedutil at all. I am guessing by "runnable" you mean h_nr_running !=
> 0.

yes

>
> Also that aside, the "running util" is what was used to drive the CFS
> util before Peter's patch (8f111bc357a). That was accounting the
> blocked and decaying utilization but that patch changed the behavior.
> It seems logical we should just use that not check for h_nr_running
> for CFS so we don't miss on the decayed  utilization. What is the use
> of checking h_nr_running or state of runqueue for CFS? I am sure to be
> missing something here. :-(

As Peter mentioned, the change in commit (8f111bc357a) was to remove
the test that was arbitrary removing the util_avg of a cpu that has
not been updated since a tick

But with the update of blocked idle load, we don't need to handle the
case of stalled load/utilization

>
> thanks!
>
> - Joel

Re: [PATCH] sched/fair: schedutil: update only with all info available

2018-04-11 Thread Joel Fernandes

Hi Vincent,

On Wed, Apr 11, 2018 at 4:56 AM, Vincent Guittot
 wrote:
> On 11 April 2018 at 12:15, Patrick Bellasi  wrote:
>> On 11-Apr 08:57, Vincent Guittot wrote:
>>> On 10 April 2018 at 13:04, Patrick Bellasi  wrote:
>>> > On 09-Apr 10:51, Vincent Guittot wrote:
>>> >> On 6 April 2018 at 19:28, Patrick Bellasi  
>>> >> wrote:
>>> >> Peter,
>>> >> what was your goal with adding the condition "if
>>> >> (rq->cfs.h_nr_running)" for the aggragation of CFS utilization
>>> >
>>> > The original intent was to get rid of sched class flags, used to track
>>> > which class has tasks runnable from within schedutil. The reason was
>>> > to solve some misalignment between scheduler class status and
>>> > schedutil status.
>>>
>>> This was mainly for RT tasks but it was not the case for cfs task
>>> before commit 8f111bc357aa
>>
>> True, but with his solution Peter has actually come up with a unified
>> interface which is now (and can be IMO) based just on RUNNABLE
>> counters for each class.
>
> But do we really want to only take care of runnable counter for all class ?
>
>>
>>> > The solution, initially suggested by Viresh, and finally proposed by
>>> > Peter was to exploit RQ knowledges directly from within schedutil.
>>> >
>>> > The problem is that now schedutil updated depends on two information:
>>> > utilization changes and number of RT and CFS runnable tasks.
>>> >
>>> > Thus, using cfs_rq::h_nr_running is not the problem... it's actually
>>> > part of a much more clean solution of the code we used to have.
>>>
>>> So there are 2 problems there:
>>> - using cfs_rq::h_nr_running when aggregating cfs utilization which
>>> generates a lot of frequency drop
>>
>> You mean because we now completely disregard the blocked utilization
>> where a CPU is idle, right?
>
> yes
>
>>
>> Given how PELT works and the recent support for IDLE CPUs updated, we
>> should probably always add contributions for the CFS class.
>>
>>> - making sure that the nr-running are up-to-date when used in sched_util
>>
>> Right... but, if we always add the cfs_rq (to always account for
>> blocked utilization), we don't have anymore this last dependency,
>> isn't it?
>
> yes
>
>>
>> We still have to account for the util_est dependency.
>>
>> Should I add a patch to this series to disregard cfs_rq::h_nr_running
>> from schedutil as you suggested?
>
> It's probably better to have a separate patch as these are 2 different topics
> - when updating cfs_rq::h_nr_running and when calling cpufreq_update_util
> - should we use runnable or running utilization for CFS

By runnable you don't mean sched_avg::load_avg right? I got a bit
confused, since runnable means load_avg and running means util_avg.
But I didn't follow here why we are talking about load_avg for
schedutil at all. I am guessing by "runnable" you mean h_nr_running !=
0.

Also that aside, the "running util" is what was used to drive the CFS
util before Peter's patch (8f111bc357a). That was accounting the
blocked and decaying utilization but that patch changed the behavior.
It seems logical we should just use that not check for h_nr_running
for CFS so we don't miss on the decayed  utilization. What is the use
of checking h_nr_running or state of runqueue for CFS? I am sure to be
missing something here. :-(

thanks!

- Joel

Re: [PATCH] sched/fair: schedutil: update only with all info available

2018-04-11 Thread Vincent Guittot

On 11 April 2018 at 18:15, Peter Zijlstra  wrote:
> On Wed, Apr 11, 2018 at 06:10:47PM +0200, Vincent Guittot wrote:
>> > Could is be that for some reason the nohz balancer now takes a very long
>> > time to run?
>>
>> Heiner mentions that is was a relatively slow celeron and he uses
>> ondemand governor. So I was about to ask him to use performance
>> governor to see if it can be because cpu runs slow and takes too muche
>> time to enter idle
>
> Maybe also hand him a patch with some trace_printk()s in and ask him to
> send /debug/tracing/trace after it happens or somesuch. Maybe we can
> find a clue that way.

yes

Re: [PATCH] sched/fair: schedutil: update only with all info available

2018-04-11 Thread Peter Zijlstra

On Wed, Apr 11, 2018 at 06:10:47PM +0200, Vincent Guittot wrote:
> > Could is be that for some reason the nohz balancer now takes a very long
> > time to run?
> 
> Heiner mentions that is was a relatively slow celeron and he uses
> ondemand governor. So I was about to ask him to use performance
> governor to see if it can be because cpu runs slow and takes too muche
> time to enter idle

Maybe also hand him a patch with some trace_printk()s in and ask him to
send /debug/tracing/trace after it happens or somesuch. Maybe we can
find a clue that way.

Re: [PATCH] sched/fair: schedutil: update only with all info available

2018-04-11 Thread Vincent Guittot

On 11 April 2018 at 18:00, Peter Zijlstra  wrote:
> On Wed, Apr 11, 2018 at 05:41:24PM +0200, Vincent Guittot wrote:
>> Yes. and to be honest I don't have any clues of the root cause :-(
>> Heiner mentioned that it's much better in latest linux-next but I
>> haven't seen any changes related to the code of those patches
>
> Yeah, it's a bit of a puzzle. Now you touch nohz, and the patches in
> next that are most likely to have affected this are rjw's
> cpuidle-vs-nohz patches. The common demoninator being nohz.
>
> Now I think rjw's patches will ensure we enter nohz _less_, they avoid
> stopping the tick when we expect to go idle for a short period only.
>
> So if your patch makes nohz go wobbly, going nohz less will make that
> better.
>
> Of course, I've no actual clue as to what that patch (it's the last one
> in the series, right?:
>
>   31e77c93e432 ("sched/fair: Update blocked load when newly idle")
>
> ) does that is so offensive to that one machine. You never did manage to
> reproduce, right?

yes

>
> Could is be that for some reason the nohz balancer now takes a very long
> time to run?

Heiner mentions that is was a relatively slow celeron and he uses
ondemand governor. So I was about to ask him to use performance
governor to see if it can be because cpu runs slow and takes too muche
time to enter idle

>
> Could something like the following happen (and this is really flaky
> thinking here):
>
> last CPU goes idle, we enter idle_balance(), that kicks ilb, ilb runs,
> which somehow again triggers idle_balance and around we go?
>
> I'm not immediately seeing how that could happen, but if we do something
> daft like that we can tie up the CPU for a while, mostly with IRQs
> disabled, and that would be visible as that latency he sees.
>
>

Re: [PATCH] sched/fair: schedutil: update only with all info available

2018-04-11 Thread Peter Zijlstra

On Wed, Apr 11, 2018 at 05:41:24PM +0200, Vincent Guittot wrote:
> Yes. and to be honest I don't have any clues of the root cause :-(
> Heiner mentioned that it's much better in latest linux-next but I
> haven't seen any changes related to the code of those patches

Yeah, it's a bit of a puzzle. Now you touch nohz, and the patches in
next that are most likely to have affected this are rjw's
cpuidle-vs-nohz patches. The common demoninator being nohz.

Now I think rjw's patches will ensure we enter nohz _less_, they avoid
stopping the tick when we expect to go idle for a short period only.

So if your patch makes nohz go wobbly, going nohz less will make that
better.

Of course, I've no actual clue as to what that patch (it's the last one
in the series, right?:

  31e77c93e432 ("sched/fair: Update blocked load when newly idle")

) does that is so offensive to that one machine. You never did manage to
reproduce, right?

Could is be that for some reason the nohz balancer now takes a very long
time to run?

Could something like the following happen (and this is really flaky
thinking here):

last CPU goes idle, we enter idle_balance(), that kicks ilb, ilb runs,
which somehow again triggers idle_balance and around we go?

I'm not immediately seeing how that could happen, but if we do something
daft like that we can tie up the CPU for a while, mostly with IRQs
disabled, and that would be visible as that latency he sees.

Re: [PATCH] sched/fair: schedutil: update only with all info available

2018-04-11 Thread Vincent Guittot

On 11 April 2018 at 17:37, Peter Zijlstra  wrote:
> On Wed, Apr 11, 2018 at 05:29:01PM +0200, Vincent Guittot wrote:
>> On 11 April 2018 at 17:14, Peter Zijlstra  wrote:
>> > On Tue, Apr 10, 2018 at 12:04:12PM +0100, Patrick Bellasi wrote:
>> >> On 09-Apr 10:51, Vincent Guittot wrote:
>> >
>> >> > Peter,
>> >> > what was your goal with adding the condition "if
>> >> > (rq->cfs.h_nr_running)" for the aggragation of CFS utilization
>> >>
>> >> The original intent was to get rid of sched class flags, used to track
>> >> which class has tasks runnable from within schedutil. The reason was
>> >> to solve some misalignment between scheduler class status and
>> >> schedutil status.
>> >>
>> >> The solution, initially suggested by Viresh, and finally proposed by
>> >> Peter was to exploit RQ knowledges directly from within schedutil.
>> >>
>> >> The problem is that now schedutil updated depends on two information:
>> >> utilization changes and number of RT and CFS runnable tasks.
>> >>
>> >> Thus, using cfs_rq::h_nr_running is not the problem... it's actually
>> >> part of a much more clean solution of the code we used to have.
>> >>
>> >> The problem, IMO is that we now depend on other information which
>> >> needs to be in sync before calling schedutil... and the patch I
>> >> proposed is meant to make it less likely that all the information
>> >> required are not aligned (also in the future).
>> >
>> > Specifically, the h_nr_running test was get rid of
>> >
>> > if (delta_ns > TICK_NSEC) {
>> > j_sg_cpu->iowait_boost = 0;
>> > j_sg_cpu->iowait_boost_pending = false;
>> > -   j_sg_cpu->util_cfs = 0;
>> >
>> > ^^^ that..
>> >
>> > -   if (j_sg_cpu->util_dl == 0)
>> > -   continue;
>> > }
>> >
>> >
>> > because that felt rather arbitrary.
>>
>> yes I agree.
>>
>> With the patch that updates blocked idle load, we should not have the
>> problem of blocked utilization anymore and get rid of the code above
>> and h_nr_running test
>
> Yes, these patches predate those, but indeed, now that we age the
> blocked load consistently it should no longer be required.
>
> Of course, you still have that weird regression report against those
> patches... :-)

Yes. and to be honest I don't have any clues of the root cause :-(
Heiner mentioned that it's much better in latest linux-next but I
haven't seen any changes related to the code of those patches

Re: [PATCH] sched/fair: schedutil: update only with all info available

2018-04-11 Thread Peter Zijlstra

On Wed, Apr 11, 2018 at 05:29:01PM +0200, Vincent Guittot wrote:
> On 11 April 2018 at 17:14, Peter Zijlstra  wrote:
> > On Tue, Apr 10, 2018 at 12:04:12PM +0100, Patrick Bellasi wrote:
> >> On 09-Apr 10:51, Vincent Guittot wrote:
> >
> >> > Peter,
> >> > what was your goal with adding the condition "if
> >> > (rq->cfs.h_nr_running)" for the aggragation of CFS utilization
> >>
> >> The original intent was to get rid of sched class flags, used to track
> >> which class has tasks runnable from within schedutil. The reason was
> >> to solve some misalignment between scheduler class status and
> >> schedutil status.
> >>
> >> The solution, initially suggested by Viresh, and finally proposed by
> >> Peter was to exploit RQ knowledges directly from within schedutil.
> >>
> >> The problem is that now schedutil updated depends on two information:
> >> utilization changes and number of RT and CFS runnable tasks.
> >>
> >> Thus, using cfs_rq::h_nr_running is not the problem... it's actually
> >> part of a much more clean solution of the code we used to have.
> >>
> >> The problem, IMO is that we now depend on other information which
> >> needs to be in sync before calling schedutil... and the patch I
> >> proposed is meant to make it less likely that all the information
> >> required are not aligned (also in the future).
> >
> > Specifically, the h_nr_running test was get rid of
> >
> > if (delta_ns > TICK_NSEC) {
> > j_sg_cpu->iowait_boost = 0;
> > j_sg_cpu->iowait_boost_pending = false;
> > -   j_sg_cpu->util_cfs = 0;
> >
> > ^^^ that..
> >
> > -   if (j_sg_cpu->util_dl == 0)
> > -   continue;
> > }
> >
> >
> > because that felt rather arbitrary.
> 
> yes I agree.
> 
> With the patch that updates blocked idle load, we should not have the
> problem of blocked utilization anymore and get rid of the code above
> and h_nr_running test

Yes, these patches predate those, but indeed, now that we age the
blocked load consistently it should no longer be required.

Of course, you still have that weird regression report against those
patches... :-)

Re: [PATCH] sched/fair: schedutil: update only with all info available

2018-04-11 Thread Vincent Guittot

On 11 April 2018 at 17:14, Peter Zijlstra  wrote:
> On Tue, Apr 10, 2018 at 12:04:12PM +0100, Patrick Bellasi wrote:
>> On 09-Apr 10:51, Vincent Guittot wrote:
>
>> > Peter,
>> > what was your goal with adding the condition "if
>> > (rq->cfs.h_nr_running)" for the aggragation of CFS utilization
>>
>> The original intent was to get rid of sched class flags, used to track
>> which class has tasks runnable from within schedutil. The reason was
>> to solve some misalignment between scheduler class status and
>> schedutil status.
>>
>> The solution, initially suggested by Viresh, and finally proposed by
>> Peter was to exploit RQ knowledges directly from within schedutil.
>>
>> The problem is that now schedutil updated depends on two information:
>> utilization changes and number of RT and CFS runnable tasks.
>>
>> Thus, using cfs_rq::h_nr_running is not the problem... it's actually
>> part of a much more clean solution of the code we used to have.
>>
>> The problem, IMO is that we now depend on other information which
>> needs to be in sync before calling schedutil... and the patch I
>> proposed is meant to make it less likely that all the information
>> required are not aligned (also in the future).
>
> Specifically, the h_nr_running test was get rid of
>
> if (delta_ns > TICK_NSEC) {
> j_sg_cpu->iowait_boost = 0;
> j_sg_cpu->iowait_boost_pending = false;
> -   j_sg_cpu->util_cfs = 0;
>
> ^^^ that..
>
> -   if (j_sg_cpu->util_dl == 0)
> -   continue;
> }
>
>
> because that felt rather arbitrary.

yes I agree.

With the patch that updates blocked idle load, we should not have the
problem of blocked utilization anymore and get rid of the code above
and h_nr_running test

Re: [PATCH] sched/fair: schedutil: update only with all info available

2018-04-11 Thread Peter Zijlstra

On Tue, Apr 10, 2018 at 12:04:12PM +0100, Patrick Bellasi wrote:
> On 09-Apr 10:51, Vincent Guittot wrote:

> > Peter,
> > what was your goal with adding the condition "if
> > (rq->cfs.h_nr_running)" for the aggragation of CFS utilization
> 
> The original intent was to get rid of sched class flags, used to track
> which class has tasks runnable from within schedutil. The reason was
> to solve some misalignment between scheduler class status and
> schedutil status.
> 
> The solution, initially suggested by Viresh, and finally proposed by
> Peter was to exploit RQ knowledges directly from within schedutil.
> 
> The problem is that now schedutil updated depends on two information:
> utilization changes and number of RT and CFS runnable tasks.
> 
> Thus, using cfs_rq::h_nr_running is not the problem... it's actually
> part of a much more clean solution of the code we used to have.
> 
> The problem, IMO is that we now depend on other information which
> needs to be in sync before calling schedutil... and the patch I
> proposed is meant to make it less likely that all the information
> required are not aligned (also in the future).

Specifically, the h_nr_running test was get rid of

if (delta_ns > TICK_NSEC) {
j_sg_cpu->iowait_boost = 0;
j_sg_cpu->iowait_boost_pending = false;
-   j_sg_cpu->util_cfs = 0;

^^^ that..

-   if (j_sg_cpu->util_dl == 0)
-   continue;
}


because that felt rather arbitrary.

Re: [PATCH] sched/fair: schedutil: update only with all info available

2018-04-11 Thread Peter Zijlstra

On Fri, Apr 06, 2018 at 06:28:35PM +0100, Patrick Bellasi wrote:
> is maintained although there are not actual usages so far in mainline
> for this hint... do we really need it?

There were intel_pstate patches that I expected to have shown up
already, and I meant to have a look at sugov, except I got side-tracked
again :/

Re: [PATCH] sched/fair: schedutil: update only with all info available

2018-04-11 Thread Patrick Bellasi

On 11-Apr 13:56, Vincent Guittot wrote:
> On 11 April 2018 at 12:15, Patrick Bellasi  wrote:
> > On 11-Apr 08:57, Vincent Guittot wrote:
> >> On 10 April 2018 at 13:04, Patrick Bellasi  wrote:
> >> > On 09-Apr 10:51, Vincent Guittot wrote:
> >> >> On 6 April 2018 at 19:28, Patrick Bellasi  
> >> >> wrote:
> >> >> Peter,
> >> >> what was your goal with adding the condition "if
> >> >> (rq->cfs.h_nr_running)" for the aggragation of CFS utilization
> >> >
> >> > The original intent was to get rid of sched class flags, used to track
> >> > which class has tasks runnable from within schedutil. The reason was
> >> > to solve some misalignment between scheduler class status and
> >> > schedutil status.
> >>
> >> This was mainly for RT tasks but it was not the case for cfs task
> >> before commit 8f111bc357aa
> >
> > True, but with his solution Peter has actually come up with a unified
> > interface which is now (and can be IMO) based just on RUNNABLE
> > counters for each class.
> 
> But do we really want to only take care of runnable counter for all class ?

Perhaps, once we have PELT RT support with your patches we can
consider blocked utilization also for those tasks...

However, we can also argue that a policy where we trigger updates
based on RUNNABLE counters and then it's up to the schedutil policy to
decide for how long to ignore a frequency drop, using a step down
holding timer similar to what we already have, can also be a possible
solution.

I also kind-of see a possible interesting per-task tuning of such a
policy. Meaning that, for example, for certain tasks we wanna use a
longer throttling down scale time which can be instead shorter if only
"background" tasks are currently active.

> >> > The solution, initially suggested by Viresh, and finally proposed by
> >> > Peter was to exploit RQ knowledges directly from within schedutil.
> >> >
> >> > The problem is that now schedutil updated depends on two information:
> >> > utilization changes and number of RT and CFS runnable tasks.
> >> >
> >> > Thus, using cfs_rq::h_nr_running is not the problem... it's actually
> >> > part of a much more clean solution of the code we used to have.
> >>
> >> So there are 2 problems there:
> >> - using cfs_rq::h_nr_running when aggregating cfs utilization which
> >> generates a lot of frequency drop
> >
> > You mean because we now completely disregard the blocked utilization
> > where a CPU is idle, right?
> 
> yes
> 
> >
> > Given how PELT works and the recent support for IDLE CPUs updated, we
> > should probably always add contributions for the CFS class.
> >
> >> - making sure that the nr-running are up-to-date when used in sched_util
> >
> > Right... but, if we always add the cfs_rq (to always account for
> > blocked utilization), we don't have anymore this last dependency,
> > isn't it?
> 
> yes
> 
> >
> > We still have to account for the util_est dependency.
> >
> > Should I add a patch to this series to disregard cfs_rq::h_nr_running
> > from schedutil as you suggested?
> 
> It's probably better to have a separate patch as these are 2 different topics
> - when updating cfs_rq::h_nr_running and when calling cpufreq_update_util
> - should we use runnable or running utilization for CFS

Yes, well... since OSPM is just next week, we can also have a better
discussion there and decide by then.

What is true so far is that using RUNNABLE is a change with respect to
the previous behaviors which unfortunately went unnoticed so far.

-- 
#include 

Patrick Bellasi

Re: [PATCH] sched/fair: schedutil: update only with all info available

2018-04-11 Thread Vincent Guittot

On 11 April 2018 at 12:15, Patrick Bellasi  wrote:
> On 11-Apr 08:57, Vincent Guittot wrote:
>> On 10 April 2018 at 13:04, Patrick Bellasi  wrote:
>> > On 09-Apr 10:51, Vincent Guittot wrote:
>> >> On 6 April 2018 at 19:28, Patrick Bellasi  wrote:
>> >> Peter,
>> >> what was your goal with adding the condition "if
>> >> (rq->cfs.h_nr_running)" for the aggragation of CFS utilization
>> >
>> > The original intent was to get rid of sched class flags, used to track
>> > which class has tasks runnable from within schedutil. The reason was
>> > to solve some misalignment between scheduler class status and
>> > schedutil status.
>>
>> This was mainly for RT tasks but it was not the case for cfs task
>> before commit 8f111bc357aa
>
> True, but with his solution Peter has actually come up with a unified
> interface which is now (and can be IMO) based just on RUNNABLE
> counters for each class.

But do we really want to only take care of runnable counter for all class ?

>
>> > The solution, initially suggested by Viresh, and finally proposed by
>> > Peter was to exploit RQ knowledges directly from within schedutil.
>> >
>> > The problem is that now schedutil updated depends on two information:
>> > utilization changes and number of RT and CFS runnable tasks.
>> >
>> > Thus, using cfs_rq::h_nr_running is not the problem... it's actually
>> > part of a much more clean solution of the code we used to have.
>>
>> So there are 2 problems there:
>> - using cfs_rq::h_nr_running when aggregating cfs utilization which
>> generates a lot of frequency drop
>
> You mean because we now completely disregard the blocked utilization
> where a CPU is idle, right?

yes

>
> Given how PELT works and the recent support for IDLE CPUs updated, we
> should probably always add contributions for the CFS class.
>
>> - making sure that the nr-running are up-to-date when used in sched_util
>
> Right... but, if we always add the cfs_rq (to always account for
> blocked utilization), we don't have anymore this last dependency,
> isn't it?

yes

>
> We still have to account for the util_est dependency.
>
> Should I add a patch to this series to disregard cfs_rq::h_nr_running
> from schedutil as you suggested?

It's probably better to have a separate patch as these are 2 different topics
- when updating cfs_rq::h_nr_running and when calling cpufreq_update_util
- should we use runnable or running utilization for CFS

Vincent,

>
>> > The problem, IMO is that we now depend on other information which
>> > needs to be in sync before calling schedutil... and the patch I
>> > proposed is meant to make it less likely that all the information
>> > required are not aligned (also in the future).
>> >
>> > --
>> > #include 
>> >
>> > Patrick Bellasi
>
> --
> #include 
>
> Patrick Bellasi

Re: [PATCH] sched/fair: schedutil: update only with all info available

2018-04-11 Thread Patrick Bellasi

On 11-Apr 08:57, Vincent Guittot wrote:
> On 10 April 2018 at 13:04, Patrick Bellasi  wrote:
> > On 09-Apr 10:51, Vincent Guittot wrote:
> >> On 6 April 2018 at 19:28, Patrick Bellasi  wrote:
> >> Peter,
> >> what was your goal with adding the condition "if
> >> (rq->cfs.h_nr_running)" for the aggragation of CFS utilization
> >
> > The original intent was to get rid of sched class flags, used to track
> > which class has tasks runnable from within schedutil. The reason was
> > to solve some misalignment between scheduler class status and
> > schedutil status.
> 
> This was mainly for RT tasks but it was not the case for cfs task
> before commit 8f111bc357aa

True, but with his solution Peter has actually come up with a unified
interface which is now (and can be IMO) based just on RUNNABLE
counters for each class.

> > The solution, initially suggested by Viresh, and finally proposed by
> > Peter was to exploit RQ knowledges directly from within schedutil.
> >
> > The problem is that now schedutil updated depends on two information:
> > utilization changes and number of RT and CFS runnable tasks.
> >
> > Thus, using cfs_rq::h_nr_running is not the problem... it's actually
> > part of a much more clean solution of the code we used to have.
> 
> So there are 2 problems there:
> - using cfs_rq::h_nr_running when aggregating cfs utilization which
> generates a lot of frequency drop

You mean because we now completely disregard the blocked utilization
where a CPU is idle, right?

Given how PELT works and the recent support for IDLE CPUs updated, we
should probably always add contributions for the CFS class.

> - making sure that the nr-running are up-to-date when used in sched_util

Right... but, if we always add the cfs_rq (to always account for
blocked utilization), we don't have anymore this last dependency,
isn't it?

We still have to account for the util_est dependency.

Should I add a patch to this series to disregard cfs_rq::h_nr_running
from schedutil as you suggested?

> > The problem, IMO is that we now depend on other information which
> > needs to be in sync before calling schedutil... and the patch I
> > proposed is meant to make it less likely that all the information
> > required are not aligned (also in the future).
> >
> > --
> > #include 
> >
> > Patrick Bellasi

-- 
#include 

Patrick Bellasi

Re: [PATCH] sched/fair: schedutil: update only with all info available

2018-04-11 Thread Patrick Bellasi

On 11-Apr 09:57, Vincent Guittot wrote:
> On 6 April 2018 at 19:28, Patrick Bellasi  wrote:
> 
> >  }
> > @@ -5454,8 +5441,11 @@ static void dequeue_task_fair(struct rq *rq, struct 
> > task_struct *p, int flags)
> > update_cfs_group(se);
> > }
> >
> > -   if (!se)
> > +   /* The task is no more visible from the root cfs_rq */
> > +   if (!se) {
> > sub_nr_running(rq, 1);
> > +   cpufreq_update_util(rq, 0);
> 
> call to cpufreq_update_util() should be done after util_est_dequeue()

Yeah... good point, looks like I should have notice it :)

That's another compelling example why updating schedutil as a side
effect of update_load_avg does not allow to easily track when it's the
best time to trigger an update.

UtilEst is now a factor which could impact on OPP selection for CFS
tasks, and thus we should update at the really end of the function.

> > +   }
> >
> > util_est_dequeue(&rq->cfs, p, task_sleep);
> > hrtick_update(rq);

-- 
#include 

Patrick Bellasi

Re: [PATCH] sched/fair: schedutil: update only with all info available

2018-04-11 Thread Vincent Guittot

On 6 April 2018 at 19:28, Patrick Bellasi  wrote:

>  }
> @@ -5454,8 +5441,11 @@ static void dequeue_task_fair(struct rq *rq, struct 
> task_struct *p, int flags)
> update_cfs_group(se);
> }
>
> -   if (!se)
> +   /* The task is no more visible from the root cfs_rq */
> +   if (!se) {
> sub_nr_running(rq, 1);
> +   cpufreq_update_util(rq, 0);

call to cpufreq_update_util() should be done after util_est_dequeue()

> +   }
>
> util_est_dequeue(&rq->cfs, p, task_sleep);
> hrtick_update(rq);

Re: [PATCH] sched/fair: schedutil: update only with all info available

2018-04-10 Thread Vincent Guittot

On 10 April 2018 at 13:04, Patrick Bellasi  wrote:
> Hi Vincent,
>
> On 09-Apr 10:51, Vincent Guittot wrote:
>> Hi Patrick
>>
>> On 6 April 2018 at 19:28, Patrick Bellasi  wrote:
>> > Schedutil is not properly updated when the first FAIR task wakes up on a
>> > CPU and when a RQ is (un)throttled. This is mainly due to the current
>> > integration strategy, which relies on updates being triggered implicitly
>> > each time a cfs_rq's utilization is updated.
>> >
>> > Those updates are currently provided (mainly) via
>> >cfs_rq_util_change()
>> > which is used in:
>> >  - update_cfs_rq_load_avg()
>> >when the utilization of a cfs_rq is updated
>> >  - {attach,detach}_entity_load_avg()
>> > This is done based on the idea that "we should callback schedutil
>> > frequently enough" to properly update the CPU frequency at every
>> > utilization change.
>> >
>> > Since this recent schedutil update:
>> >
>> >   commit 8f111bc357aa ("cpufreq/schedutil: Rewrite CPUFREQ_RT support")
>> >
>> > we use additional RQ information to properly account for FAIR tasks
>> > utilization. Specifically, cfs_rq::h_nr_running has to be non-zero
>> > in sugov_aggregate_util() to sum up the cfs_rq's utilization.
>>
>> Isn't the use of cfs_rq::h_nr_running,  the root cause of the problem ?
>
> Not really...
>
>> I can now see a lot a frequency changes on my hikey with this new
>> condition in sugov_aggregate_util().
>> With a rt-app UC that creates a periodic cfs task, I have a lot of
>> frequency changes instead of staying at the same frequency
>
> I don't remember a similar behavior... but I'll check better.

I have discovered this behavior quite recently while preparing OSPM

>
>> Peter,
>> what was your goal with adding the condition "if
>> (rq->cfs.h_nr_running)" for the aggragation of CFS utilization
>
> The original intent was to get rid of sched class flags, used to track
> which class has tasks runnable from within schedutil. The reason was
> to solve some misalignment between scheduler class status and
> schedutil status.

This was mainly for RT tasks but it was not the case for cfs task
before commit 8f111bc357aa

>
> The solution, initially suggested by Viresh, and finally proposed by
> Peter was to exploit RQ knowledges directly from within schedutil.
>
> The problem is that now schedutil updated depends on two information:
> utilization changes and number of RT and CFS runnable tasks.
>
> Thus, using cfs_rq::h_nr_running is not the problem... it's actually
> part of a much more clean solution of the code we used to have.

So there are 2 problems there:
- using cfs_rq::h_nr_running when aggregating cfs utilization which
generates a lot of frequency drop
- making sure that the nr-running are up-to-date when used in sched_util

>
> The problem, IMO is that we now depend on other information which
> needs to be in sync before calling schedutil... and the patch I
> proposed is meant to make it less likely that all the information
> required are not aligned (also in the future).
>
> --
> #include 
>
> Patrick Bellasi

Re: [PATCH] sched/fair: schedutil: update only with all info available

2018-04-10 Thread Patrick Bellasi

Hi Joel,

On 06-Apr 16:48, Joel Fernandes wrote:
> On Fri, Apr 6, 2018 at 10:28 AM, Patrick Bellasi
>  wrote:
> > Schedutil is not properly updated when the first FAIR task wakes up on a
> > CPU and when a RQ is (un)throttled. This is mainly due to the current
> > integration strategy, which relies on updates being triggered implicitly
> > each time a cfs_rq's utilization is updated.
> 
> To me that's not implicit, but is explicitly done when the util is
> updated. It seems that's the logical place..

I'm not arguing the place is not logical, it makes perfect sense.
IMO it just makes more difficult to track dependencies like the one we
have now: when the root cfs_rq utilization is updated the h_nr_running
is not always aligned.

Moreover, when task groups are in use, we do many times a call to a
wrapper function which just bails out, since we are updating a non
root cfs_rq. Other reasons have also been detailed in the changelog.

I've notice that we already have pretty well defined call sites
in fair.c where we update the core's nr_running counter. These are
also the exact and only places where the root cfs_rq utilization
change, apart from the tick.

What I'm proposing here is meant not only to fix the current issue,
but also at possible find a code organization which makes less likely
to miss dependencies in possible future code updates too.
To me it looks more clean and, still have to look better at the code,
but I think this can be a first step to possibly factor all schedutil
updates (for FAIR and RT) into just core.c

[...]

> > This means that we are likely to see a zero cfs_rq utilization when we
> > enqueue a task on an empty CPU, or a non-zero cfs_rq utilization when
> > instead, for example, we are throttling all the FAIR tasks of a CPU.
> >
> > While the second issue is less important, since we are less likely to
> 
> Actually I wanted behavior like the second issue, because that means
> frequency will not be dropped if CPU is about to idle which is similar
> to a patch I sent long time ago (skip freq update if CPU about to
> idle).

I think that's a slightly different case since here a cfs_rq is
intentionally throttled thus we want to stop the tasks and potentially
to drop the frequency.

> > reduce frequency when CPU utilization decreases, the first issue can
> > instead impact performance. Indeed, we potentially introduce a not desired
> 
> This issue would be fixed by just fixing the h_nr_running bug right?

Sure, something like this:

---8<---
index 0951d1c58d2f..e9a31258d345 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5368,6 +5368,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, 
int flags)
if (se->on_rq)
break;
cfs_rq = cfs_rq_of(se);
+   cfs_rq->h_nr_running++;
enqueue_entity(cfs_rq, se, flags);

/*
@@ -5377,8 +5378,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, 
int flags)
 * post the final h_nr_running increment below.
 */
if (cfs_rq_throttled(cfs_rq))
+   cfs_rq->h_nr_running--;
break;
-   cfs_rq->h_nr_running++;
+   }

flags = ENQUEUE_WAKEUP;
}
---8<---

can probably easily fix one of the problems.

But I did not checked what does it imply to increment h_nr_running
before calling enqueue_entity() and cfs_rq_throttled().

Still we miss the (IMO) opportunity to make a more suitable single and
explicit function call only when needed. Which is also just few code
lines below in the same function as proposed by this patch.

> > latency between a task enqueue on a CPU and its frequency increase.
> >
> > Another possible unwanted side effect is the iowait boosting of a CPU
> > when we enqueue a task into a throttled cfs_rq.
> 
> Probably very very rare.

Still possible by code... and when you notice it you cannot think
about a non wanted behavior.

> > Moreover, the current schedutil integration has these other downsides:
> >
> >  - schedutil updates are triggered by RQ's load updates, which makes
> >sense in general but it does not allow to know exactly which other RQ
> >related information has been updated (e.g. h_nr_running).
> 
> It seems broken that all information that schedutil needs isn't
> updated _before_ the cpufreq update request, so that should be fixed
> instead?

That's what I'm trying to do here but, instead of doing before some
operations I'm proposing to postpone what can be done after.
Schedutil updates can be done right before returning to the core
scheduler, at which time we should be pretty sure all CFS related
info have been properly updated.

[...]

> > All the above considered, let's try to make schedutil updates more
> > explicit in fair.c by:
> >
> >  - removing the cfs_rq_util_change() wrapper function to use the
> >cpufreq_update_util() public API only when root cfs_rq

Re: [PATCH] sched/fair: schedutil: update only with all info available

2018-04-10 Thread Patrick Bellasi

Hi Vincent,

On 09-Apr 10:51, Vincent Guittot wrote:
> Hi Patrick
> 
> On 6 April 2018 at 19:28, Patrick Bellasi  wrote:
> > Schedutil is not properly updated when the first FAIR task wakes up on a
> > CPU and when a RQ is (un)throttled. This is mainly due to the current
> > integration strategy, which relies on updates being triggered implicitly
> > each time a cfs_rq's utilization is updated.
> >
> > Those updates are currently provided (mainly) via
> >cfs_rq_util_change()
> > which is used in:
> >  - update_cfs_rq_load_avg()
> >when the utilization of a cfs_rq is updated
> >  - {attach,detach}_entity_load_avg()
> > This is done based on the idea that "we should callback schedutil
> > frequently enough" to properly update the CPU frequency at every
> > utilization change.
> >
> > Since this recent schedutil update:
> >
> >   commit 8f111bc357aa ("cpufreq/schedutil: Rewrite CPUFREQ_RT support")
> >
> > we use additional RQ information to properly account for FAIR tasks
> > utilization. Specifically, cfs_rq::h_nr_running has to be non-zero
> > in sugov_aggregate_util() to sum up the cfs_rq's utilization.
> 
> Isn't the use of cfs_rq::h_nr_running,  the root cause of the problem ?

Not really...

> I can now see a lot a frequency changes on my hikey with this new
> condition in sugov_aggregate_util().
> With a rt-app UC that creates a periodic cfs task, I have a lot of
> frequency changes instead of staying at the same frequency

I don't remember a similar behavior... but I'll check better.

> Peter,
> what was your goal with adding the condition "if
> (rq->cfs.h_nr_running)" for the aggragation of CFS utilization

The original intent was to get rid of sched class flags, used to track
which class has tasks runnable from within schedutil. The reason was
to solve some misalignment between scheduler class status and
schedutil status.

The solution, initially suggested by Viresh, and finally proposed by
Peter was to exploit RQ knowledges directly from within schedutil.

The problem is that now schedutil updated depends on two information:
utilization changes and number of RT and CFS runnable tasks.

Thus, using cfs_rq::h_nr_running is not the problem... it's actually
part of a much more clean solution of the code we used to have.

The problem, IMO is that we now depend on other information which
needs to be in sync before calling schedutil... and the patch I
proposed is meant to make it less likely that all the information
required are not aligned (also in the future).

-- 
#include 

Patrick Bellasi

Re: [PATCH] sched/fair: schedutil: update only with all info available

2018-04-09 Thread Vincent Guittot

Hi Patrick

On 6 April 2018 at 19:28, Patrick Bellasi  wrote:
> Schedutil is not properly updated when the first FAIR task wakes up on a
> CPU and when a RQ is (un)throttled. This is mainly due to the current
> integration strategy, which relies on updates being triggered implicitly
> each time a cfs_rq's utilization is updated.
>
> Those updates are currently provided (mainly) via
>cfs_rq_util_change()
> which is used in:
>  - update_cfs_rq_load_avg()
>when the utilization of a cfs_rq is updated
>  - {attach,detach}_entity_load_avg()
> This is done based on the idea that "we should callback schedutil
> frequently enough" to properly update the CPU frequency at every
> utilization change.
>
> Since this recent schedutil update:
>
>   commit 8f111bc357aa ("cpufreq/schedutil: Rewrite CPUFREQ_RT support")
>
> we use additional RQ information to properly account for FAIR tasks
> utilization. Specifically, cfs_rq::h_nr_running has to be non-zero
> in sugov_aggregate_util() to sum up the cfs_rq's utilization.

Isn't the use of cfs_rq::h_nr_running,  the root cause of the problem ?
I can now see a lot a frequency changes on my hikey with this new
condition in sugov_aggregate_util().
With a rt-app UC that creates a periodic cfs task, I have a lot of
frequency changes instead of staying at the same frequency

Peter,
what was your goal with adding the condition "if
(rq->cfs.h_nr_running)" for the aggragation of CFS utilization

Thanks
Vincent

>
> However, cfs_rq::h_nr_running is usually updated as:
>
> enqueue_entity()
>...
>update_load_avg()
>   ...
>   cfs_rq_util_change ==> trigger schedutil update
> ...
> cfs_rq->h_nr_running += number_of_tasks
>
> both in enqueue_task_fair() as well as in unthrottle_cfs_rq().
> A similar pattern is used also in dequeue_task_fair() and
> throttle_cfs_rq() to remove tasks.
>
> This means that we are likely to see a zero cfs_rq utilization when we
> enqueue a task on an empty CPU, or a non-zero cfs_rq utilization when
> instead, for example, we are throttling all the FAIR tasks of a CPU.
>
> While the second issue is less important, since we are less likely to
> reduce frequency when CPU utilization decreases, the first issue can
> instead impact performance. Indeed, we potentially introduce a not desired
> latency between a task enqueue on a CPU and its frequency increase.
>
> Another possible unwanted side effect is the iowait boosting of a CPU
> when we enqueue a task into a throttled cfs_rq.
>
> Moreover, the current schedutil integration has these other downsides:
>
>  - schedutil updates are triggered by RQ's load updates, which makes
>sense in general but it does not allow to know exactly which other RQ
>related information has been updated (e.g. h_nr_running).
>
>  - increasing the chances to update schedutil does not always correspond
>to provide the most accurate information for a proper frequency
>selection, thus we can skip some updates.
>
>  - we don't know exactly at which point a schedutil update is triggered,
>and thus potentially a frequency change started, and that's because
>the update is a side effect of cfs_rq_util_changed instead of an
>explicit call from the most suitable call path.
>
>  - cfs_rq_util_change() is mainly a wrapper function for an already
>existing "public API", cpufreq_update_util(), to ensure we actually
>update schedutil only when we are updating a root RQ. Thus, especially
>when task groups are in use, most of the calls to this wrapper
>function are really not required.
>
>  - the usage of a wrapper function is not completely consistent across
>fair.c, since we still need sometimes additional explicit calls to
>cpufreq_update_util(), for example to support the IOWAIT boot flag in
>the wakeup path
>
>  - it makes it hard to integrate new features since it could require to
>change other function prototypes just to pass in an additional flag,
>as it happened for example here:
>
>   commit ea14b57e8a18 ("sched/cpufreq: Provide migration hint")
>
> All the above considered, let's try to make schedutil updates more
> explicit in fair.c by:
>
>  - removing the cfs_rq_util_change() wrapper function to use the
>cpufreq_update_util() public API only when root cfs_rq is updated
>
>  - remove indirect and side-effect (sometimes not required) schedutil
>updates when the cfs_rq utilization is updated
>
>  - call cpufreq_update_util() explicitly in the few call sites where it
>really makes sense and all the required information has been updated
>
> By doing so, this patch mainly removes code and adds explicit calls to
> schedutil only when we:
>  - {enqueue,dequeue}_task_fair() a task to/from the root cfs_rq
>  - (un)throttle_cfs_rq() a set of tasks up to the root cfs_rq
>  - task_tick_fair() to update the utilization of the root cfs_rq
>
> All the other code paths, currently _indirectly_ covered by a call to
> update_load_

Re: [PATCH] sched/fair: schedutil: update only with all info available

2018-04-06 Thread Joel Fernandes

On Fri, Apr 6, 2018 at 10:28 AM, Patrick Bellasi
 wrote:
> Schedutil is not properly updated when the first FAIR task wakes up on a
> CPU and when a RQ is (un)throttled. This is mainly due to the current
> integration strategy, which relies on updates being triggered implicitly
> each time a cfs_rq's utilization is updated.

To me that's not implicit, but is explicitly done when the util is
updated. It seems that's the logical place..

>
> Those updates are currently provided (mainly) via
>cfs_rq_util_change()
> which is used in:
>  - update_cfs_rq_load_avg()
>when the utilization of a cfs_rq is updated
>  - {attach,detach}_entity_load_avg()
> This is done based on the idea that "we should callback schedutil
> frequently enough" to properly update the CPU frequency at every

This also I didn't get, its not called "frequently enough" but at the
right time (when the util is updated).

> utilization change.
>
> Since this recent schedutil update:
>
>   commit 8f111bc357aa ("cpufreq/schedutil: Rewrite CPUFREQ_RT support")
>
> we use additional RQ information to properly account for FAIR tasks
> utilization. Specifically, cfs_rq::h_nr_running has to be non-zero
> in sugov_aggregate_util() to sum up the cfs_rq's utilization.
>
> However, cfs_rq::h_nr_running is usually updated as:
>
> enqueue_entity()
>...
>update_load_avg()
>   ...
>   cfs_rq_util_change ==> trigger schedutil update
> ...
> cfs_rq->h_nr_running += number_of_tasks
>
> both in enqueue_task_fair() as well as in unthrottle_cfs_rq().
> A similar pattern is used also in dequeue_task_fair() and
> throttle_cfs_rq() to remove tasks.

Maybe this should be fixed then? It seems broken to begin with...

>
> This means that we are likely to see a zero cfs_rq utilization when we
> enqueue a task on an empty CPU, or a non-zero cfs_rq utilization when
> instead, for example, we are throttling all the FAIR tasks of a CPU.
>
> While the second issue is less important, since we are less likely to

Actually I wanted behavior like the second issue, because that means
frequency will not be dropped if CPU is about to idle which is similar
to a patch I sent long time ago (skip freq update if CPU about to
idle).

> reduce frequency when CPU utilization decreases, the first issue can
> instead impact performance. Indeed, we potentially introduce a not desired

This issue would be fixed by just fixing the h_nr_running bug right?

> latency between a task enqueue on a CPU and its frequency increase.
>
> Another possible unwanted side effect is the iowait boosting of a CPU
> when we enqueue a task into a throttled cfs_rq.

Probably very very rare.

> Moreover, the current schedutil integration has these other downsides:
>
>  - schedutil updates are triggered by RQ's load updates, which makes
>sense in general but it does not allow to know exactly which other RQ
>related information has been updated (e.g. h_nr_running).

It seems broken that all information that schedutil needs isn't
updated _before_ the cpufreq update request, so that should be fixed
instead?

>
>  - increasing the chances to update schedutil does not always correspond
>to provide the most accurate information for a proper frequency
>selection, thus we can skip some updates.

Again IMO its just updated at the right time already, not just
frequently enough...

>
>  - we don't know exactly at which point a schedutil update is triggered,
>and thus potentially a frequency change started, and that's because
>the update is a side effect of cfs_rq_util_changed instead of an
>explicit call from the most suitable call path.
>
>  - cfs_rq_util_change() is mainly a wrapper function for an already
>existing "public API", cpufreq_update_util(), to ensure we actually
>update schedutil only when we are updating a root RQ. Thus, especially
>when task groups are in use, most of the calls to this wrapper
>function are really not required.
>
>  - the usage of a wrapper function is not completely consistent across
>fair.c, since we still need sometimes additional explicit calls to
>cpufreq_update_util(), for example to support the IOWAIT boot flag in
>the wakeup path
>
>  - it makes it hard to integrate new features since it could require to
>change other function prototypes just to pass in an additional flag,
>as it happened for example here:
>
>   commit ea14b57e8a18 ("sched/cpufreq: Provide migration hint")
>
> All the above considered, let's try to make schedutil updates more
> explicit in fair.c by:
>
>  - removing the cfs_rq_util_change() wrapper function to use the
>cpufreq_update_util() public API only when root cfs_rq is updated
>
>  - remove indirect and side-effect (sometimes not required) schedutil
>updates when the cfs_rq utilization is updated
>
>  - call cpufreq_update_util() explicitly in the few call sites where it
>really makes sense and all the required information has been upda

[PATCH] sched/fair: schedutil: update only with all info available

2018-04-06 Thread Patrick Bellasi

Schedutil is not properly updated when the first FAIR task wakes up on a
CPU and when a RQ is (un)throttled. This is mainly due to the current
integration strategy, which relies on updates being triggered implicitly
each time a cfs_rq's utilization is updated.

Those updates are currently provided (mainly) via
   cfs_rq_util_change()
which is used in:
 - update_cfs_rq_load_avg()
   when the utilization of a cfs_rq is updated
 - {attach,detach}_entity_load_avg()
This is done based on the idea that "we should callback schedutil
frequently enough" to properly update the CPU frequency at every
utilization change.

Since this recent schedutil update:

  commit 8f111bc357aa ("cpufreq/schedutil: Rewrite CPUFREQ_RT support")

we use additional RQ information to properly account for FAIR tasks
utilization. Specifically, cfs_rq::h_nr_running has to be non-zero
in sugov_aggregate_util() to sum up the cfs_rq's utilization.

However, cfs_rq::h_nr_running is usually updated as:

enqueue_entity()
   ...
   update_load_avg()
  ...
  cfs_rq_util_change ==> trigger schedutil update
...
cfs_rq->h_nr_running += number_of_tasks

both in enqueue_task_fair() as well as in unthrottle_cfs_rq().
A similar pattern is used also in dequeue_task_fair() and
throttle_cfs_rq() to remove tasks.

This means that we are likely to see a zero cfs_rq utilization when we
enqueue a task on an empty CPU, or a non-zero cfs_rq utilization when
instead, for example, we are throttling all the FAIR tasks of a CPU.

While the second issue is less important, since we are less likely to
reduce frequency when CPU utilization decreases, the first issue can
instead impact performance. Indeed, we potentially introduce a not desired
latency between a task enqueue on a CPU and its frequency increase.

Another possible unwanted side effect is the iowait boosting of a CPU
when we enqueue a task into a throttled cfs_rq.

Moreover, the current schedutil integration has these other downsides:

 - schedutil updates are triggered by RQ's load updates, which makes
   sense in general but it does not allow to know exactly which other RQ
   related information has been updated (e.g. h_nr_running).

 - increasing the chances to update schedutil does not always correspond
   to provide the most accurate information for a proper frequency
   selection, thus we can skip some updates.

 - we don't know exactly at which point a schedutil update is triggered,
   and thus potentially a frequency change started, and that's because
   the update is a side effect of cfs_rq_util_changed instead of an
   explicit call from the most suitable call path.

 - cfs_rq_util_change() is mainly a wrapper function for an already
   existing "public API", cpufreq_update_util(), to ensure we actually
   update schedutil only when we are updating a root RQ. Thus, especially
   when task groups are in use, most of the calls to this wrapper
   function are really not required.

 - the usage of a wrapper function is not completely consistent across
   fair.c, since we still need sometimes additional explicit calls to
   cpufreq_update_util(), for example to support the IOWAIT boot flag in
   the wakeup path

 - it makes it hard to integrate new features since it could require to
   change other function prototypes just to pass in an additional flag,
   as it happened for example here:

  commit ea14b57e8a18 ("sched/cpufreq: Provide migration hint")

All the above considered, let's try to make schedutil updates more
explicit in fair.c by:

 - removing the cfs_rq_util_change() wrapper function to use the
   cpufreq_update_util() public API only when root cfs_rq is updated

 - remove indirect and side-effect (sometimes not required) schedutil
   updates when the cfs_rq utilization is updated

 - call cpufreq_update_util() explicitly in the few call sites where it
   really makes sense and all the required information has been updated

By doing so, this patch mainly removes code and adds explicit calls to
schedutil only when we:
 - {enqueue,dequeue}_task_fair() a task to/from the root cfs_rq
 - (un)throttle_cfs_rq() a set of tasks up to the root cfs_rq
 - task_tick_fair() to update the utilization of the root cfs_rq

All the other code paths, currently _indirectly_ covered by a call to
update_load_avg(), are also covered by the above three calls.
Some already imply enqueue/dequeue calls:
 - switch_{to,from}_fair()
 - sched_move_task()
or are followed by enqueue/dequeue calls:
 - cpu_cgroup_fork() and
   post_init_entity_util_avg():
 are used at wakeup_new_task() time and thus already followed by an
 enqueue_task_fair()
 - migrate_task_rq_fair():
 updates the removed utilization but not the actual cfs_rq
 utilization, which is updated by a following sched event

This new proposal allows also to better aggregate schedutil related
flags, which are required only at enqueue_task_fair() time.
Indeed, IOWAIT and MIGRATION flags are now requested only when

Re: [PATCH] sched/fair: schedutil: update only with all info available

Re: [PATCH] sched/fair: schedutil: update only with all info available

Re: [PATCH] sched/fair: schedutil: update only with all info available

Re: [PATCH] sched/fair: schedutil: update only with all info available

Re: [PATCH] sched/fair: schedutil: update only with all info available

Re: [PATCH] sched/fair: schedutil: update only with all info available

Re: [PATCH] sched/fair: schedutil: update only with all info available

Re: [PATCH] sched/fair: schedutil: update only with all info available

Re: [PATCH] sched/fair: schedutil: update only with all info available

Re: [PATCH] sched/fair: schedutil: update only with all info available

Re: [PATCH] sched/fair: schedutil: update only with all info available

Re: [PATCH] sched/fair: schedutil: update only with all info available

Re: [PATCH] sched/fair: schedutil: update only with all info available

Re: [PATCH] sched/fair: schedutil: update only with all info available

Re: [PATCH] sched/fair: schedutil: update only with all info available

Re: [PATCH] sched/fair: schedutil: update only with all info available

Re: [PATCH] sched/fair: schedutil: update only with all info available

Re: [PATCH] sched/fair: schedutil: update only with all info available

Re: [PATCH] sched/fair: schedutil: update only with all info available

Re: [PATCH] sched/fair: schedutil: update only with all info available

Re: [PATCH] sched/fair: schedutil: update only with all info available

Re: [PATCH] sched/fair: schedutil: update only with all info available

Re: [PATCH] sched/fair: schedutil: update only with all info available

Re: [PATCH] sched/fair: schedutil: update only with all info available

[PATCH] sched/fair: schedutil: update only with all info available

25 matches

Site Navigation

Mail list logo

Footer information