Re: [PATCH] sched/fair: schedutil: update only with all info available
On Thu, Apr 26, 2018 at 12:15:33PM +0100, Patrick Bellasi wrote: > > Yes, these patches predate those, but indeed, now that we age the > > blocked load consistently it should no longer be required. > > After this discussion, I think there is a general consensus about > always add sg_cpu->util_cfs in cpufreq_schedutil.c::sugov_aggregate_util. > > Is that right? Yes I think so. I've been waiting to see the problem with the blocked load aging patches sorted though. Because if we'd have to revert those we'd be back to needing the current stuff again. Luckily it appears Vincent found something there, so fingers crossed. > For the rest, what this patch proposes is a code reorganization which > is not required anymore to fix this specific issue but, it's still > required to fix the other issue reported by Vincent: i.e. util_est is > not updated before schedutil. > > Thus, I would propose to still keep this refactoring but in the > context of a different patch to specifically fixes the util_est case. > > If there are not major complains, I'll post a new series in the next > few days. Fair enough..
Re: [PATCH] sched/fair: schedutil: update only with all info available
On 11-Apr 17:37, Peter Zijlstra wrote: > On Wed, Apr 11, 2018 at 05:29:01PM +0200, Vincent Guittot wrote: > > On 11 April 2018 at 17:14, Peter Zijlstra wrote: > > > On Tue, Apr 10, 2018 at 12:04:12PM +0100, Patrick Bellasi wrote: > > >> On 09-Apr 10:51, Vincent Guittot wrote: > > > > > >> > Peter, > > >> > what was your goal with adding the condition "if > > >> > (rq->cfs.h_nr_running)" for the aggragation of CFS utilization > > >> > > >> The original intent was to get rid of sched class flags, used to track > > >> which class has tasks runnable from within schedutil. The reason was > > >> to solve some misalignment between scheduler class status and > > >> schedutil status. > > >> > > >> The solution, initially suggested by Viresh, and finally proposed by > > >> Peter was to exploit RQ knowledges directly from within schedutil. > > >> > > >> The problem is that now schedutil updated depends on two information: > > >> utilization changes and number of RT and CFS runnable tasks. > > >> > > >> Thus, using cfs_rq::h_nr_running is not the problem... it's actually > > >> part of a much more clean solution of the code we used to have. > > >> > > >> The problem, IMO is that we now depend on other information which > > >> needs to be in sync before calling schedutil... and the patch I > > >> proposed is meant to make it less likely that all the information > > >> required are not aligned (also in the future). > > > > > > Specifically, the h_nr_running test was get rid of > > > > > > if (delta_ns > TICK_NSEC) { > > > j_sg_cpu->iowait_boost = 0; > > > j_sg_cpu->iowait_boost_pending = false; > > > - j_sg_cpu->util_cfs = 0; > > > > > > ^^^ that.. > > > > > > - if (j_sg_cpu->util_dl == 0) > > > - continue; > > > } > > > > > > > > > because that felt rather arbitrary. > > > > yes I agree. > > > > With the patch that updates blocked idle load, we should not have the > > problem of blocked utilization anymore and get rid of the code above > > and h_nr_running test > > Yes, these patches predate those, but indeed, now that we age the > blocked load consistently it should no longer be required. After this discussion, I think there is a general consensus about always add sg_cpu->util_cfs in cpufreq_schedutil.c::sugov_aggregate_util. Is that right? For the rest, what this patch proposes is a code reorganization which is not required anymore to fix this specific issue but, it's still required to fix the other issue reported by Vincent: i.e. util_est is not updated before schedutil. Thus, I would propose to still keep this refactoring but in the context of a different patch to specifically fixes the util_est case. If there are not major complains, I'll post a new series in the next few days. -- #include Patrick Bellasi
Re: [PATCH] sched/fair: schedutil: update only with all info available
On Thu, Apr 12, 2018 at 12:01 AM, Vincent Guittot wrote: >> >> Also that aside, the "running util" is what was used to drive the CFS >> util before Peter's patch (8f111bc357a). That was accounting the >> blocked and decaying utilization but that patch changed the behavior. >> It seems logical we should just use that not check for h_nr_running >> for CFS so we don't miss on the decayed utilization. What is the use >> of checking h_nr_running or state of runqueue for CFS? I am sure to be >> missing something here. :-( > > As Peter mentioned, the change in commit (8f111bc357a) was to remove > the test that was arbitrary removing the util_avg of a cpu that has > not been updated since a tick > > But with the update of blocked idle load, we don't need to handle the > case of stalled load/utilization Thanks a lot for the clarification. It makes sense now. - Joel
Re: [PATCH] sched/fair: schedutil: update only with all info available
Hi Joel, On 11 April 2018 at 23:34, Joel Fernandes wrote: > Hi Vincent, > > On Wed, Apr 11, 2018 at 4:56 AM, Vincent Guittot > wrote: >> On 11 April 2018 at 12:15, Patrick Bellasi wrote: >>> On 11-Apr 08:57, Vincent Guittot wrote: On 10 April 2018 at 13:04, Patrick Bellasi wrote: > On 09-Apr 10:51, Vincent Guittot wrote: >> On 6 April 2018 at 19:28, Patrick Bellasi >> wrote: >> Peter, >> what was your goal with adding the condition "if >> (rq->cfs.h_nr_running)" for the aggragation of CFS utilization > > The original intent was to get rid of sched class flags, used to track > which class has tasks runnable from within schedutil. The reason was > to solve some misalignment between scheduler class status and > schedutil status. This was mainly for RT tasks but it was not the case for cfs task before commit 8f111bc357aa >>> >>> True, but with his solution Peter has actually come up with a unified >>> interface which is now (and can be IMO) based just on RUNNABLE >>> counters for each class. >> >> But do we really want to only take care of runnable counter for all class ? >> >>> > The solution, initially suggested by Viresh, and finally proposed by > Peter was to exploit RQ knowledges directly from within schedutil. > > The problem is that now schedutil updated depends on two information: > utilization changes and number of RT and CFS runnable tasks. > > Thus, using cfs_rq::h_nr_running is not the problem... it's actually > part of a much more clean solution of the code we used to have. So there are 2 problems there: - using cfs_rq::h_nr_running when aggregating cfs utilization which generates a lot of frequency drop >>> >>> You mean because we now completely disregard the blocked utilization >>> where a CPU is idle, right? >> >> yes >> >>> >>> Given how PELT works and the recent support for IDLE CPUs updated, we >>> should probably always add contributions for the CFS class. >>> - making sure that the nr-running are up-to-date when used in sched_util >>> >>> Right... but, if we always add the cfs_rq (to always account for >>> blocked utilization), we don't have anymore this last dependency, >>> isn't it? >> >> yes >> >>> >>> We still have to account for the util_est dependency. >>> >>> Should I add a patch to this series to disregard cfs_rq::h_nr_running >>> from schedutil as you suggested? >> >> It's probably better to have a separate patch as these are 2 different topics >> - when updating cfs_rq::h_nr_running and when calling cpufreq_update_util >> - should we use runnable or running utilization for CFS > > By runnable you don't mean sched_avg::load_avg right? I got a bit > confused, since runnable means load_avg and running means util_avg. Sorry for the confusion. By runnable utilization, I meant taking into account the number of running task (cfs_rq::h_nr_running) like what is done by commit (8f111bc357a) > But I didn't follow here why we are talking about load_avg for > schedutil at all. I am guessing by "runnable" you mean h_nr_running != > 0. yes > > Also that aside, the "running util" is what was used to drive the CFS > util before Peter's patch (8f111bc357a). That was accounting the > blocked and decaying utilization but that patch changed the behavior. > It seems logical we should just use that not check for h_nr_running > for CFS so we don't miss on the decayed utilization. What is the use > of checking h_nr_running or state of runqueue for CFS? I am sure to be > missing something here. :-( As Peter mentioned, the change in commit (8f111bc357a) was to remove the test that was arbitrary removing the util_avg of a cpu that has not been updated since a tick But with the update of blocked idle load, we don't need to handle the case of stalled load/utilization > > thanks! > > - Joel
Re: [PATCH] sched/fair: schedutil: update only with all info available
Hi Vincent, On Wed, Apr 11, 2018 at 4:56 AM, Vincent Guittot wrote: > On 11 April 2018 at 12:15, Patrick Bellasi wrote: >> On 11-Apr 08:57, Vincent Guittot wrote: >>> On 10 April 2018 at 13:04, Patrick Bellasi wrote: >>> > On 09-Apr 10:51, Vincent Guittot wrote: >>> >> On 6 April 2018 at 19:28, Patrick Bellasi >>> >> wrote: >>> >> Peter, >>> >> what was your goal with adding the condition "if >>> >> (rq->cfs.h_nr_running)" for the aggragation of CFS utilization >>> > >>> > The original intent was to get rid of sched class flags, used to track >>> > which class has tasks runnable from within schedutil. The reason was >>> > to solve some misalignment between scheduler class status and >>> > schedutil status. >>> >>> This was mainly for RT tasks but it was not the case for cfs task >>> before commit 8f111bc357aa >> >> True, but with his solution Peter has actually come up with a unified >> interface which is now (and can be IMO) based just on RUNNABLE >> counters for each class. > > But do we really want to only take care of runnable counter for all class ? > >> >>> > The solution, initially suggested by Viresh, and finally proposed by >>> > Peter was to exploit RQ knowledges directly from within schedutil. >>> > >>> > The problem is that now schedutil updated depends on two information: >>> > utilization changes and number of RT and CFS runnable tasks. >>> > >>> > Thus, using cfs_rq::h_nr_running is not the problem... it's actually >>> > part of a much more clean solution of the code we used to have. >>> >>> So there are 2 problems there: >>> - using cfs_rq::h_nr_running when aggregating cfs utilization which >>> generates a lot of frequency drop >> >> You mean because we now completely disregard the blocked utilization >> where a CPU is idle, right? > > yes > >> >> Given how PELT works and the recent support for IDLE CPUs updated, we >> should probably always add contributions for the CFS class. >> >>> - making sure that the nr-running are up-to-date when used in sched_util >> >> Right... but, if we always add the cfs_rq (to always account for >> blocked utilization), we don't have anymore this last dependency, >> isn't it? > > yes > >> >> We still have to account for the util_est dependency. >> >> Should I add a patch to this series to disregard cfs_rq::h_nr_running >> from schedutil as you suggested? > > It's probably better to have a separate patch as these are 2 different topics > - when updating cfs_rq::h_nr_running and when calling cpufreq_update_util > - should we use runnable or running utilization for CFS By runnable you don't mean sched_avg::load_avg right? I got a bit confused, since runnable means load_avg and running means util_avg. But I didn't follow here why we are talking about load_avg for schedutil at all. I am guessing by "runnable" you mean h_nr_running != 0. Also that aside, the "running util" is what was used to drive the CFS util before Peter's patch (8f111bc357a). That was accounting the blocked and decaying utilization but that patch changed the behavior. It seems logical we should just use that not check for h_nr_running for CFS so we don't miss on the decayed utilization. What is the use of checking h_nr_running or state of runqueue for CFS? I am sure to be missing something here. :-( thanks! - Joel
Re: [PATCH] sched/fair: schedutil: update only with all info available
On 11 April 2018 at 18:15, Peter Zijlstra wrote: > On Wed, Apr 11, 2018 at 06:10:47PM +0200, Vincent Guittot wrote: >> > Could is be that for some reason the nohz balancer now takes a very long >> > time to run? >> >> Heiner mentions that is was a relatively slow celeron and he uses >> ondemand governor. So I was about to ask him to use performance >> governor to see if it can be because cpu runs slow and takes too muche >> time to enter idle > > Maybe also hand him a patch with some trace_printk()s in and ask him to > send /debug/tracing/trace after it happens or somesuch. Maybe we can > find a clue that way. yes
Re: [PATCH] sched/fair: schedutil: update only with all info available
On Wed, Apr 11, 2018 at 06:10:47PM +0200, Vincent Guittot wrote: > > Could is be that for some reason the nohz balancer now takes a very long > > time to run? > > Heiner mentions that is was a relatively slow celeron and he uses > ondemand governor. So I was about to ask him to use performance > governor to see if it can be because cpu runs slow and takes too muche > time to enter idle Maybe also hand him a patch with some trace_printk()s in and ask him to send /debug/tracing/trace after it happens or somesuch. Maybe we can find a clue that way.
Re: [PATCH] sched/fair: schedutil: update only with all info available
On 11 April 2018 at 18:00, Peter Zijlstra wrote: > On Wed, Apr 11, 2018 at 05:41:24PM +0200, Vincent Guittot wrote: >> Yes. and to be honest I don't have any clues of the root cause :-( >> Heiner mentioned that it's much better in latest linux-next but I >> haven't seen any changes related to the code of those patches > > Yeah, it's a bit of a puzzle. Now you touch nohz, and the patches in > next that are most likely to have affected this are rjw's > cpuidle-vs-nohz patches. The common demoninator being nohz. > > Now I think rjw's patches will ensure we enter nohz _less_, they avoid > stopping the tick when we expect to go idle for a short period only. > > So if your patch makes nohz go wobbly, going nohz less will make that > better. > > Of course, I've no actual clue as to what that patch (it's the last one > in the series, right?: > > 31e77c93e432 ("sched/fair: Update blocked load when newly idle") > > ) does that is so offensive to that one machine. You never did manage to > reproduce, right? yes > > Could is be that for some reason the nohz balancer now takes a very long > time to run? Heiner mentions that is was a relatively slow celeron and he uses ondemand governor. So I was about to ask him to use performance governor to see if it can be because cpu runs slow and takes too muche time to enter idle > > Could something like the following happen (and this is really flaky > thinking here): > > last CPU goes idle, we enter idle_balance(), that kicks ilb, ilb runs, > which somehow again triggers idle_balance and around we go? > > I'm not immediately seeing how that could happen, but if we do something > daft like that we can tie up the CPU for a while, mostly with IRQs > disabled, and that would be visible as that latency he sees. > >
Re: [PATCH] sched/fair: schedutil: update only with all info available
On Wed, Apr 11, 2018 at 05:41:24PM +0200, Vincent Guittot wrote: > Yes. and to be honest I don't have any clues of the root cause :-( > Heiner mentioned that it's much better in latest linux-next but I > haven't seen any changes related to the code of those patches Yeah, it's a bit of a puzzle. Now you touch nohz, and the patches in next that are most likely to have affected this are rjw's cpuidle-vs-nohz patches. The common demoninator being nohz. Now I think rjw's patches will ensure we enter nohz _less_, they avoid stopping the tick when we expect to go idle for a short period only. So if your patch makes nohz go wobbly, going nohz less will make that better. Of course, I've no actual clue as to what that patch (it's the last one in the series, right?: 31e77c93e432 ("sched/fair: Update blocked load when newly idle") ) does that is so offensive to that one machine. You never did manage to reproduce, right? Could is be that for some reason the nohz balancer now takes a very long time to run? Could something like the following happen (and this is really flaky thinking here): last CPU goes idle, we enter idle_balance(), that kicks ilb, ilb runs, which somehow again triggers idle_balance and around we go? I'm not immediately seeing how that could happen, but if we do something daft like that we can tie up the CPU for a while, mostly with IRQs disabled, and that would be visible as that latency he sees.
Re: [PATCH] sched/fair: schedutil: update only with all info available
On 11 April 2018 at 17:37, Peter Zijlstra wrote: > On Wed, Apr 11, 2018 at 05:29:01PM +0200, Vincent Guittot wrote: >> On 11 April 2018 at 17:14, Peter Zijlstra wrote: >> > On Tue, Apr 10, 2018 at 12:04:12PM +0100, Patrick Bellasi wrote: >> >> On 09-Apr 10:51, Vincent Guittot wrote: >> > >> >> > Peter, >> >> > what was your goal with adding the condition "if >> >> > (rq->cfs.h_nr_running)" for the aggragation of CFS utilization >> >> >> >> The original intent was to get rid of sched class flags, used to track >> >> which class has tasks runnable from within schedutil. The reason was >> >> to solve some misalignment between scheduler class status and >> >> schedutil status. >> >> >> >> The solution, initially suggested by Viresh, and finally proposed by >> >> Peter was to exploit RQ knowledges directly from within schedutil. >> >> >> >> The problem is that now schedutil updated depends on two information: >> >> utilization changes and number of RT and CFS runnable tasks. >> >> >> >> Thus, using cfs_rq::h_nr_running is not the problem... it's actually >> >> part of a much more clean solution of the code we used to have. >> >> >> >> The problem, IMO is that we now depend on other information which >> >> needs to be in sync before calling schedutil... and the patch I >> >> proposed is meant to make it less likely that all the information >> >> required are not aligned (also in the future). >> > >> > Specifically, the h_nr_running test was get rid of >> > >> > if (delta_ns > TICK_NSEC) { >> > j_sg_cpu->iowait_boost = 0; >> > j_sg_cpu->iowait_boost_pending = false; >> > - j_sg_cpu->util_cfs = 0; >> > >> > ^^^ that.. >> > >> > - if (j_sg_cpu->util_dl == 0) >> > - continue; >> > } >> > >> > >> > because that felt rather arbitrary. >> >> yes I agree. >> >> With the patch that updates blocked idle load, we should not have the >> problem of blocked utilization anymore and get rid of the code above >> and h_nr_running test > > Yes, these patches predate those, but indeed, now that we age the > blocked load consistently it should no longer be required. > > Of course, you still have that weird regression report against those > patches... :-) Yes. and to be honest I don't have any clues of the root cause :-( Heiner mentioned that it's much better in latest linux-next but I haven't seen any changes related to the code of those patches
Re: [PATCH] sched/fair: schedutil: update only with all info available
On Wed, Apr 11, 2018 at 05:29:01PM +0200, Vincent Guittot wrote: > On 11 April 2018 at 17:14, Peter Zijlstra wrote: > > On Tue, Apr 10, 2018 at 12:04:12PM +0100, Patrick Bellasi wrote: > >> On 09-Apr 10:51, Vincent Guittot wrote: > > > >> > Peter, > >> > what was your goal with adding the condition "if > >> > (rq->cfs.h_nr_running)" for the aggragation of CFS utilization > >> > >> The original intent was to get rid of sched class flags, used to track > >> which class has tasks runnable from within schedutil. The reason was > >> to solve some misalignment between scheduler class status and > >> schedutil status. > >> > >> The solution, initially suggested by Viresh, and finally proposed by > >> Peter was to exploit RQ knowledges directly from within schedutil. > >> > >> The problem is that now schedutil updated depends on two information: > >> utilization changes and number of RT and CFS runnable tasks. > >> > >> Thus, using cfs_rq::h_nr_running is not the problem... it's actually > >> part of a much more clean solution of the code we used to have. > >> > >> The problem, IMO is that we now depend on other information which > >> needs to be in sync before calling schedutil... and the patch I > >> proposed is meant to make it less likely that all the information > >> required are not aligned (also in the future). > > > > Specifically, the h_nr_running test was get rid of > > > > if (delta_ns > TICK_NSEC) { > > j_sg_cpu->iowait_boost = 0; > > j_sg_cpu->iowait_boost_pending = false; > > - j_sg_cpu->util_cfs = 0; > > > > ^^^ that.. > > > > - if (j_sg_cpu->util_dl == 0) > > - continue; > > } > > > > > > because that felt rather arbitrary. > > yes I agree. > > With the patch that updates blocked idle load, we should not have the > problem of blocked utilization anymore and get rid of the code above > and h_nr_running test Yes, these patches predate those, but indeed, now that we age the blocked load consistently it should no longer be required. Of course, you still have that weird regression report against those patches... :-)
Re: [PATCH] sched/fair: schedutil: update only with all info available
On 11 April 2018 at 17:14, Peter Zijlstra wrote: > On Tue, Apr 10, 2018 at 12:04:12PM +0100, Patrick Bellasi wrote: >> On 09-Apr 10:51, Vincent Guittot wrote: > >> > Peter, >> > what was your goal with adding the condition "if >> > (rq->cfs.h_nr_running)" for the aggragation of CFS utilization >> >> The original intent was to get rid of sched class flags, used to track >> which class has tasks runnable from within schedutil. The reason was >> to solve some misalignment between scheduler class status and >> schedutil status. >> >> The solution, initially suggested by Viresh, and finally proposed by >> Peter was to exploit RQ knowledges directly from within schedutil. >> >> The problem is that now schedutil updated depends on two information: >> utilization changes and number of RT and CFS runnable tasks. >> >> Thus, using cfs_rq::h_nr_running is not the problem... it's actually >> part of a much more clean solution of the code we used to have. >> >> The problem, IMO is that we now depend on other information which >> needs to be in sync before calling schedutil... and the patch I >> proposed is meant to make it less likely that all the information >> required are not aligned (also in the future). > > Specifically, the h_nr_running test was get rid of > > if (delta_ns > TICK_NSEC) { > j_sg_cpu->iowait_boost = 0; > j_sg_cpu->iowait_boost_pending = false; > - j_sg_cpu->util_cfs = 0; > > ^^^ that.. > > - if (j_sg_cpu->util_dl == 0) > - continue; > } > > > because that felt rather arbitrary. yes I agree. With the patch that updates blocked idle load, we should not have the problem of blocked utilization anymore and get rid of the code above and h_nr_running test
Re: [PATCH] sched/fair: schedutil: update only with all info available
On Tue, Apr 10, 2018 at 12:04:12PM +0100, Patrick Bellasi wrote: > On 09-Apr 10:51, Vincent Guittot wrote: > > Peter, > > what was your goal with adding the condition "if > > (rq->cfs.h_nr_running)" for the aggragation of CFS utilization > > The original intent was to get rid of sched class flags, used to track > which class has tasks runnable from within schedutil. The reason was > to solve some misalignment between scheduler class status and > schedutil status. > > The solution, initially suggested by Viresh, and finally proposed by > Peter was to exploit RQ knowledges directly from within schedutil. > > The problem is that now schedutil updated depends on two information: > utilization changes and number of RT and CFS runnable tasks. > > Thus, using cfs_rq::h_nr_running is not the problem... it's actually > part of a much more clean solution of the code we used to have. > > The problem, IMO is that we now depend on other information which > needs to be in sync before calling schedutil... and the patch I > proposed is meant to make it less likely that all the information > required are not aligned (also in the future). Specifically, the h_nr_running test was get rid of if (delta_ns > TICK_NSEC) { j_sg_cpu->iowait_boost = 0; j_sg_cpu->iowait_boost_pending = false; - j_sg_cpu->util_cfs = 0; ^^^ that.. - if (j_sg_cpu->util_dl == 0) - continue; } because that felt rather arbitrary.
Re: [PATCH] sched/fair: schedutil: update only with all info available
On Fri, Apr 06, 2018 at 06:28:35PM +0100, Patrick Bellasi wrote: > is maintained although there are not actual usages so far in mainline > for this hint... do we really need it? There were intel_pstate patches that I expected to have shown up already, and I meant to have a look at sugov, except I got side-tracked again :/
Re: [PATCH] sched/fair: schedutil: update only with all info available
On 11-Apr 13:56, Vincent Guittot wrote: > On 11 April 2018 at 12:15, Patrick Bellasi wrote: > > On 11-Apr 08:57, Vincent Guittot wrote: > >> On 10 April 2018 at 13:04, Patrick Bellasi wrote: > >> > On 09-Apr 10:51, Vincent Guittot wrote: > >> >> On 6 April 2018 at 19:28, Patrick Bellasi > >> >> wrote: > >> >> Peter, > >> >> what was your goal with adding the condition "if > >> >> (rq->cfs.h_nr_running)" for the aggragation of CFS utilization > >> > > >> > The original intent was to get rid of sched class flags, used to track > >> > which class has tasks runnable from within schedutil. The reason was > >> > to solve some misalignment between scheduler class status and > >> > schedutil status. > >> > >> This was mainly for RT tasks but it was not the case for cfs task > >> before commit 8f111bc357aa > > > > True, but with his solution Peter has actually come up with a unified > > interface which is now (and can be IMO) based just on RUNNABLE > > counters for each class. > > But do we really want to only take care of runnable counter for all class ? Perhaps, once we have PELT RT support with your patches we can consider blocked utilization also for those tasks... However, we can also argue that a policy where we trigger updates based on RUNNABLE counters and then it's up to the schedutil policy to decide for how long to ignore a frequency drop, using a step down holding timer similar to what we already have, can also be a possible solution. I also kind-of see a possible interesting per-task tuning of such a policy. Meaning that, for example, for certain tasks we wanna use a longer throttling down scale time which can be instead shorter if only "background" tasks are currently active. > >> > The solution, initially suggested by Viresh, and finally proposed by > >> > Peter was to exploit RQ knowledges directly from within schedutil. > >> > > >> > The problem is that now schedutil updated depends on two information: > >> > utilization changes and number of RT and CFS runnable tasks. > >> > > >> > Thus, using cfs_rq::h_nr_running is not the problem... it's actually > >> > part of a much more clean solution of the code we used to have. > >> > >> So there are 2 problems there: > >> - using cfs_rq::h_nr_running when aggregating cfs utilization which > >> generates a lot of frequency drop > > > > You mean because we now completely disregard the blocked utilization > > where a CPU is idle, right? > > yes > > > > > Given how PELT works and the recent support for IDLE CPUs updated, we > > should probably always add contributions for the CFS class. > > > >> - making sure that the nr-running are up-to-date when used in sched_util > > > > Right... but, if we always add the cfs_rq (to always account for > > blocked utilization), we don't have anymore this last dependency, > > isn't it? > > yes > > > > > We still have to account for the util_est dependency. > > > > Should I add a patch to this series to disregard cfs_rq::h_nr_running > > from schedutil as you suggested? > > It's probably better to have a separate patch as these are 2 different topics > - when updating cfs_rq::h_nr_running and when calling cpufreq_update_util > - should we use runnable or running utilization for CFS Yes, well... since OSPM is just next week, we can also have a better discussion there and decide by then. What is true so far is that using RUNNABLE is a change with respect to the previous behaviors which unfortunately went unnoticed so far. -- #include Patrick Bellasi
Re: [PATCH] sched/fair: schedutil: update only with all info available
On 11 April 2018 at 12:15, Patrick Bellasi wrote: > On 11-Apr 08:57, Vincent Guittot wrote: >> On 10 April 2018 at 13:04, Patrick Bellasi wrote: >> > On 09-Apr 10:51, Vincent Guittot wrote: >> >> On 6 April 2018 at 19:28, Patrick Bellasi wrote: >> >> Peter, >> >> what was your goal with adding the condition "if >> >> (rq->cfs.h_nr_running)" for the aggragation of CFS utilization >> > >> > The original intent was to get rid of sched class flags, used to track >> > which class has tasks runnable from within schedutil. The reason was >> > to solve some misalignment between scheduler class status and >> > schedutil status. >> >> This was mainly for RT tasks but it was not the case for cfs task >> before commit 8f111bc357aa > > True, but with his solution Peter has actually come up with a unified > interface which is now (and can be IMO) based just on RUNNABLE > counters for each class. But do we really want to only take care of runnable counter for all class ? > >> > The solution, initially suggested by Viresh, and finally proposed by >> > Peter was to exploit RQ knowledges directly from within schedutil. >> > >> > The problem is that now schedutil updated depends on two information: >> > utilization changes and number of RT and CFS runnable tasks. >> > >> > Thus, using cfs_rq::h_nr_running is not the problem... it's actually >> > part of a much more clean solution of the code we used to have. >> >> So there are 2 problems there: >> - using cfs_rq::h_nr_running when aggregating cfs utilization which >> generates a lot of frequency drop > > You mean because we now completely disregard the blocked utilization > where a CPU is idle, right? yes > > Given how PELT works and the recent support for IDLE CPUs updated, we > should probably always add contributions for the CFS class. > >> - making sure that the nr-running are up-to-date when used in sched_util > > Right... but, if we always add the cfs_rq (to always account for > blocked utilization), we don't have anymore this last dependency, > isn't it? yes > > We still have to account for the util_est dependency. > > Should I add a patch to this series to disregard cfs_rq::h_nr_running > from schedutil as you suggested? It's probably better to have a separate patch as these are 2 different topics - when updating cfs_rq::h_nr_running and when calling cpufreq_update_util - should we use runnable or running utilization for CFS Vincent, > >> > The problem, IMO is that we now depend on other information which >> > needs to be in sync before calling schedutil... and the patch I >> > proposed is meant to make it less likely that all the information >> > required are not aligned (also in the future). >> > >> > -- >> > #include >> > >> > Patrick Bellasi > > -- > #include > > Patrick Bellasi
Re: [PATCH] sched/fair: schedutil: update only with all info available
On 11-Apr 08:57, Vincent Guittot wrote: > On 10 April 2018 at 13:04, Patrick Bellasi wrote: > > On 09-Apr 10:51, Vincent Guittot wrote: > >> On 6 April 2018 at 19:28, Patrick Bellasi wrote: > >> Peter, > >> what was your goal with adding the condition "if > >> (rq->cfs.h_nr_running)" for the aggragation of CFS utilization > > > > The original intent was to get rid of sched class flags, used to track > > which class has tasks runnable from within schedutil. The reason was > > to solve some misalignment between scheduler class status and > > schedutil status. > > This was mainly for RT tasks but it was not the case for cfs task > before commit 8f111bc357aa True, but with his solution Peter has actually come up with a unified interface which is now (and can be IMO) based just on RUNNABLE counters for each class. > > The solution, initially suggested by Viresh, and finally proposed by > > Peter was to exploit RQ knowledges directly from within schedutil. > > > > The problem is that now schedutil updated depends on two information: > > utilization changes and number of RT and CFS runnable tasks. > > > > Thus, using cfs_rq::h_nr_running is not the problem... it's actually > > part of a much more clean solution of the code we used to have. > > So there are 2 problems there: > - using cfs_rq::h_nr_running when aggregating cfs utilization which > generates a lot of frequency drop You mean because we now completely disregard the blocked utilization where a CPU is idle, right? Given how PELT works and the recent support for IDLE CPUs updated, we should probably always add contributions for the CFS class. > - making sure that the nr-running are up-to-date when used in sched_util Right... but, if we always add the cfs_rq (to always account for blocked utilization), we don't have anymore this last dependency, isn't it? We still have to account for the util_est dependency. Should I add a patch to this series to disregard cfs_rq::h_nr_running from schedutil as you suggested? > > The problem, IMO is that we now depend on other information which > > needs to be in sync before calling schedutil... and the patch I > > proposed is meant to make it less likely that all the information > > required are not aligned (also in the future). > > > > -- > > #include > > > > Patrick Bellasi -- #include Patrick Bellasi
Re: [PATCH] sched/fair: schedutil: update only with all info available
On 11-Apr 09:57, Vincent Guittot wrote: > On 6 April 2018 at 19:28, Patrick Bellasi wrote: > > > } > > @@ -5454,8 +5441,11 @@ static void dequeue_task_fair(struct rq *rq, struct > > task_struct *p, int flags) > > update_cfs_group(se); > > } > > > > - if (!se) > > + /* The task is no more visible from the root cfs_rq */ > > + if (!se) { > > sub_nr_running(rq, 1); > > + cpufreq_update_util(rq, 0); > > call to cpufreq_update_util() should be done after util_est_dequeue() Yeah... good point, looks like I should have notice it :) That's another compelling example why updating schedutil as a side effect of update_load_avg does not allow to easily track when it's the best time to trigger an update. UtilEst is now a factor which could impact on OPP selection for CFS tasks, and thus we should update at the really end of the function. > > + } > > > > util_est_dequeue(&rq->cfs, p, task_sleep); > > hrtick_update(rq); -- #include Patrick Bellasi
Re: [PATCH] sched/fair: schedutil: update only with all info available
On 6 April 2018 at 19:28, Patrick Bellasi wrote: > } > @@ -5454,8 +5441,11 @@ static void dequeue_task_fair(struct rq *rq, struct > task_struct *p, int flags) > update_cfs_group(se); > } > > - if (!se) > + /* The task is no more visible from the root cfs_rq */ > + if (!se) { > sub_nr_running(rq, 1); > + cpufreq_update_util(rq, 0); call to cpufreq_update_util() should be done after util_est_dequeue() > + } > > util_est_dequeue(&rq->cfs, p, task_sleep); > hrtick_update(rq);
Re: [PATCH] sched/fair: schedutil: update only with all info available
On 10 April 2018 at 13:04, Patrick Bellasi wrote: > Hi Vincent, > > On 09-Apr 10:51, Vincent Guittot wrote: >> Hi Patrick >> >> On 6 April 2018 at 19:28, Patrick Bellasi wrote: >> > Schedutil is not properly updated when the first FAIR task wakes up on a >> > CPU and when a RQ is (un)throttled. This is mainly due to the current >> > integration strategy, which relies on updates being triggered implicitly >> > each time a cfs_rq's utilization is updated. >> > >> > Those updates are currently provided (mainly) via >> >cfs_rq_util_change() >> > which is used in: >> > - update_cfs_rq_load_avg() >> >when the utilization of a cfs_rq is updated >> > - {attach,detach}_entity_load_avg() >> > This is done based on the idea that "we should callback schedutil >> > frequently enough" to properly update the CPU frequency at every >> > utilization change. >> > >> > Since this recent schedutil update: >> > >> > commit 8f111bc357aa ("cpufreq/schedutil: Rewrite CPUFREQ_RT support") >> > >> > we use additional RQ information to properly account for FAIR tasks >> > utilization. Specifically, cfs_rq::h_nr_running has to be non-zero >> > in sugov_aggregate_util() to sum up the cfs_rq's utilization. >> >> Isn't the use of cfs_rq::h_nr_running, the root cause of the problem ? > > Not really... > >> I can now see a lot a frequency changes on my hikey with this new >> condition in sugov_aggregate_util(). >> With a rt-app UC that creates a periodic cfs task, I have a lot of >> frequency changes instead of staying at the same frequency > > I don't remember a similar behavior... but I'll check better. I have discovered this behavior quite recently while preparing OSPM > >> Peter, >> what was your goal with adding the condition "if >> (rq->cfs.h_nr_running)" for the aggragation of CFS utilization > > The original intent was to get rid of sched class flags, used to track > which class has tasks runnable from within schedutil. The reason was > to solve some misalignment between scheduler class status and > schedutil status. This was mainly for RT tasks but it was not the case for cfs task before commit 8f111bc357aa > > The solution, initially suggested by Viresh, and finally proposed by > Peter was to exploit RQ knowledges directly from within schedutil. > > The problem is that now schedutil updated depends on two information: > utilization changes and number of RT and CFS runnable tasks. > > Thus, using cfs_rq::h_nr_running is not the problem... it's actually > part of a much more clean solution of the code we used to have. So there are 2 problems there: - using cfs_rq::h_nr_running when aggregating cfs utilization which generates a lot of frequency drop - making sure that the nr-running are up-to-date when used in sched_util > > The problem, IMO is that we now depend on other information which > needs to be in sync before calling schedutil... and the patch I > proposed is meant to make it less likely that all the information > required are not aligned (also in the future). > > -- > #include > > Patrick Bellasi
Re: [PATCH] sched/fair: schedutil: update only with all info available
Hi Joel, On 06-Apr 16:48, Joel Fernandes wrote: > On Fri, Apr 6, 2018 at 10:28 AM, Patrick Bellasi > wrote: > > Schedutil is not properly updated when the first FAIR task wakes up on a > > CPU and when a RQ is (un)throttled. This is mainly due to the current > > integration strategy, which relies on updates being triggered implicitly > > each time a cfs_rq's utilization is updated. > > To me that's not implicit, but is explicitly done when the util is > updated. It seems that's the logical place.. I'm not arguing the place is not logical, it makes perfect sense. IMO it just makes more difficult to track dependencies like the one we have now: when the root cfs_rq utilization is updated the h_nr_running is not always aligned. Moreover, when task groups are in use, we do many times a call to a wrapper function which just bails out, since we are updating a non root cfs_rq. Other reasons have also been detailed in the changelog. I've notice that we already have pretty well defined call sites in fair.c where we update the core's nr_running counter. These are also the exact and only places where the root cfs_rq utilization change, apart from the tick. What I'm proposing here is meant not only to fix the current issue, but also at possible find a code organization which makes less likely to miss dependencies in possible future code updates too. To me it looks more clean and, still have to look better at the code, but I think this can be a first step to possibly factor all schedutil updates (for FAIR and RT) into just core.c [...] > > This means that we are likely to see a zero cfs_rq utilization when we > > enqueue a task on an empty CPU, or a non-zero cfs_rq utilization when > > instead, for example, we are throttling all the FAIR tasks of a CPU. > > > > While the second issue is less important, since we are less likely to > > Actually I wanted behavior like the second issue, because that means > frequency will not be dropped if CPU is about to idle which is similar > to a patch I sent long time ago (skip freq update if CPU about to > idle). I think that's a slightly different case since here a cfs_rq is intentionally throttled thus we want to stop the tasks and potentially to drop the frequency. > > reduce frequency when CPU utilization decreases, the first issue can > > instead impact performance. Indeed, we potentially introduce a not desired > > This issue would be fixed by just fixing the h_nr_running bug right? Sure, something like this: ---8<--- index 0951d1c58d2f..e9a31258d345 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5368,6 +5368,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) if (se->on_rq) break; cfs_rq = cfs_rq_of(se); + cfs_rq->h_nr_running++; enqueue_entity(cfs_rq, se, flags); /* @@ -5377,8 +5378,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) * post the final h_nr_running increment below. */ if (cfs_rq_throttled(cfs_rq)) + cfs_rq->h_nr_running--; break; - cfs_rq->h_nr_running++; + } flags = ENQUEUE_WAKEUP; } ---8<--- can probably easily fix one of the problems. But I did not checked what does it imply to increment h_nr_running before calling enqueue_entity() and cfs_rq_throttled(). Still we miss the (IMO) opportunity to make a more suitable single and explicit function call only when needed. Which is also just few code lines below in the same function as proposed by this patch. > > latency between a task enqueue on a CPU and its frequency increase. > > > > Another possible unwanted side effect is the iowait boosting of a CPU > > when we enqueue a task into a throttled cfs_rq. > > Probably very very rare. Still possible by code... and when you notice it you cannot think about a non wanted behavior. > > Moreover, the current schedutil integration has these other downsides: > > > > - schedutil updates are triggered by RQ's load updates, which makes > >sense in general but it does not allow to know exactly which other RQ > >related information has been updated (e.g. h_nr_running). > > It seems broken that all information that schedutil needs isn't > updated _before_ the cpufreq update request, so that should be fixed > instead? That's what I'm trying to do here but, instead of doing before some operations I'm proposing to postpone what can be done after. Schedutil updates can be done right before returning to the core scheduler, at which time we should be pretty sure all CFS related info have been properly updated. [...] > > All the above considered, let's try to make schedutil updates more > > explicit in fair.c by: > > > > - removing the cfs_rq_util_change() wrapper function to use the > >cpufreq_update_util() public API only when root cfs_rq
Re: [PATCH] sched/fair: schedutil: update only with all info available
Hi Vincent, On 09-Apr 10:51, Vincent Guittot wrote: > Hi Patrick > > On 6 April 2018 at 19:28, Patrick Bellasi wrote: > > Schedutil is not properly updated when the first FAIR task wakes up on a > > CPU and when a RQ is (un)throttled. This is mainly due to the current > > integration strategy, which relies on updates being triggered implicitly > > each time a cfs_rq's utilization is updated. > > > > Those updates are currently provided (mainly) via > >cfs_rq_util_change() > > which is used in: > > - update_cfs_rq_load_avg() > >when the utilization of a cfs_rq is updated > > - {attach,detach}_entity_load_avg() > > This is done based on the idea that "we should callback schedutil > > frequently enough" to properly update the CPU frequency at every > > utilization change. > > > > Since this recent schedutil update: > > > > commit 8f111bc357aa ("cpufreq/schedutil: Rewrite CPUFREQ_RT support") > > > > we use additional RQ information to properly account for FAIR tasks > > utilization. Specifically, cfs_rq::h_nr_running has to be non-zero > > in sugov_aggregate_util() to sum up the cfs_rq's utilization. > > Isn't the use of cfs_rq::h_nr_running, the root cause of the problem ? Not really... > I can now see a lot a frequency changes on my hikey with this new > condition in sugov_aggregate_util(). > With a rt-app UC that creates a periodic cfs task, I have a lot of > frequency changes instead of staying at the same frequency I don't remember a similar behavior... but I'll check better. > Peter, > what was your goal with adding the condition "if > (rq->cfs.h_nr_running)" for the aggragation of CFS utilization The original intent was to get rid of sched class flags, used to track which class has tasks runnable from within schedutil. The reason was to solve some misalignment between scheduler class status and schedutil status. The solution, initially suggested by Viresh, and finally proposed by Peter was to exploit RQ knowledges directly from within schedutil. The problem is that now schedutil updated depends on two information: utilization changes and number of RT and CFS runnable tasks. Thus, using cfs_rq::h_nr_running is not the problem... it's actually part of a much more clean solution of the code we used to have. The problem, IMO is that we now depend on other information which needs to be in sync before calling schedutil... and the patch I proposed is meant to make it less likely that all the information required are not aligned (also in the future). -- #include Patrick Bellasi
Re: [PATCH] sched/fair: schedutil: update only with all info available
Hi Patrick On 6 April 2018 at 19:28, Patrick Bellasi wrote: > Schedutil is not properly updated when the first FAIR task wakes up on a > CPU and when a RQ is (un)throttled. This is mainly due to the current > integration strategy, which relies on updates being triggered implicitly > each time a cfs_rq's utilization is updated. > > Those updates are currently provided (mainly) via >cfs_rq_util_change() > which is used in: > - update_cfs_rq_load_avg() >when the utilization of a cfs_rq is updated > - {attach,detach}_entity_load_avg() > This is done based on the idea that "we should callback schedutil > frequently enough" to properly update the CPU frequency at every > utilization change. > > Since this recent schedutil update: > > commit 8f111bc357aa ("cpufreq/schedutil: Rewrite CPUFREQ_RT support") > > we use additional RQ information to properly account for FAIR tasks > utilization. Specifically, cfs_rq::h_nr_running has to be non-zero > in sugov_aggregate_util() to sum up the cfs_rq's utilization. Isn't the use of cfs_rq::h_nr_running, the root cause of the problem ? I can now see a lot a frequency changes on my hikey with this new condition in sugov_aggregate_util(). With a rt-app UC that creates a periodic cfs task, I have a lot of frequency changes instead of staying at the same frequency Peter, what was your goal with adding the condition "if (rq->cfs.h_nr_running)" for the aggragation of CFS utilization Thanks Vincent > > However, cfs_rq::h_nr_running is usually updated as: > > enqueue_entity() >... >update_load_avg() > ... > cfs_rq_util_change ==> trigger schedutil update > ... > cfs_rq->h_nr_running += number_of_tasks > > both in enqueue_task_fair() as well as in unthrottle_cfs_rq(). > A similar pattern is used also in dequeue_task_fair() and > throttle_cfs_rq() to remove tasks. > > This means that we are likely to see a zero cfs_rq utilization when we > enqueue a task on an empty CPU, or a non-zero cfs_rq utilization when > instead, for example, we are throttling all the FAIR tasks of a CPU. > > While the second issue is less important, since we are less likely to > reduce frequency when CPU utilization decreases, the first issue can > instead impact performance. Indeed, we potentially introduce a not desired > latency between a task enqueue on a CPU and its frequency increase. > > Another possible unwanted side effect is the iowait boosting of a CPU > when we enqueue a task into a throttled cfs_rq. > > Moreover, the current schedutil integration has these other downsides: > > - schedutil updates are triggered by RQ's load updates, which makes >sense in general but it does not allow to know exactly which other RQ >related information has been updated (e.g. h_nr_running). > > - increasing the chances to update schedutil does not always correspond >to provide the most accurate information for a proper frequency >selection, thus we can skip some updates. > > - we don't know exactly at which point a schedutil update is triggered, >and thus potentially a frequency change started, and that's because >the update is a side effect of cfs_rq_util_changed instead of an >explicit call from the most suitable call path. > > - cfs_rq_util_change() is mainly a wrapper function for an already >existing "public API", cpufreq_update_util(), to ensure we actually >update schedutil only when we are updating a root RQ. Thus, especially >when task groups are in use, most of the calls to this wrapper >function are really not required. > > - the usage of a wrapper function is not completely consistent across >fair.c, since we still need sometimes additional explicit calls to >cpufreq_update_util(), for example to support the IOWAIT boot flag in >the wakeup path > > - it makes it hard to integrate new features since it could require to >change other function prototypes just to pass in an additional flag, >as it happened for example here: > > commit ea14b57e8a18 ("sched/cpufreq: Provide migration hint") > > All the above considered, let's try to make schedutil updates more > explicit in fair.c by: > > - removing the cfs_rq_util_change() wrapper function to use the >cpufreq_update_util() public API only when root cfs_rq is updated > > - remove indirect and side-effect (sometimes not required) schedutil >updates when the cfs_rq utilization is updated > > - call cpufreq_update_util() explicitly in the few call sites where it >really makes sense and all the required information has been updated > > By doing so, this patch mainly removes code and adds explicit calls to > schedutil only when we: > - {enqueue,dequeue}_task_fair() a task to/from the root cfs_rq > - (un)throttle_cfs_rq() a set of tasks up to the root cfs_rq > - task_tick_fair() to update the utilization of the root cfs_rq > > All the other code paths, currently _indirectly_ covered by a call to > update_load_
Re: [PATCH] sched/fair: schedutil: update only with all info available
On Fri, Apr 6, 2018 at 10:28 AM, Patrick Bellasi wrote: > Schedutil is not properly updated when the first FAIR task wakes up on a > CPU and when a RQ is (un)throttled. This is mainly due to the current > integration strategy, which relies on updates being triggered implicitly > each time a cfs_rq's utilization is updated. To me that's not implicit, but is explicitly done when the util is updated. It seems that's the logical place.. > > Those updates are currently provided (mainly) via >cfs_rq_util_change() > which is used in: > - update_cfs_rq_load_avg() >when the utilization of a cfs_rq is updated > - {attach,detach}_entity_load_avg() > This is done based on the idea that "we should callback schedutil > frequently enough" to properly update the CPU frequency at every This also I didn't get, its not called "frequently enough" but at the right time (when the util is updated). > utilization change. > > Since this recent schedutil update: > > commit 8f111bc357aa ("cpufreq/schedutil: Rewrite CPUFREQ_RT support") > > we use additional RQ information to properly account for FAIR tasks > utilization. Specifically, cfs_rq::h_nr_running has to be non-zero > in sugov_aggregate_util() to sum up the cfs_rq's utilization. > > However, cfs_rq::h_nr_running is usually updated as: > > enqueue_entity() >... >update_load_avg() > ... > cfs_rq_util_change ==> trigger schedutil update > ... > cfs_rq->h_nr_running += number_of_tasks > > both in enqueue_task_fair() as well as in unthrottle_cfs_rq(). > A similar pattern is used also in dequeue_task_fair() and > throttle_cfs_rq() to remove tasks. Maybe this should be fixed then? It seems broken to begin with... > > This means that we are likely to see a zero cfs_rq utilization when we > enqueue a task on an empty CPU, or a non-zero cfs_rq utilization when > instead, for example, we are throttling all the FAIR tasks of a CPU. > > While the second issue is less important, since we are less likely to Actually I wanted behavior like the second issue, because that means frequency will not be dropped if CPU is about to idle which is similar to a patch I sent long time ago (skip freq update if CPU about to idle). > reduce frequency when CPU utilization decreases, the first issue can > instead impact performance. Indeed, we potentially introduce a not desired This issue would be fixed by just fixing the h_nr_running bug right? > latency between a task enqueue on a CPU and its frequency increase. > > Another possible unwanted side effect is the iowait boosting of a CPU > when we enqueue a task into a throttled cfs_rq. Probably very very rare. > Moreover, the current schedutil integration has these other downsides: > > - schedutil updates are triggered by RQ's load updates, which makes >sense in general but it does not allow to know exactly which other RQ >related information has been updated (e.g. h_nr_running). It seems broken that all information that schedutil needs isn't updated _before_ the cpufreq update request, so that should be fixed instead? > > - increasing the chances to update schedutil does not always correspond >to provide the most accurate information for a proper frequency >selection, thus we can skip some updates. Again IMO its just updated at the right time already, not just frequently enough... > > - we don't know exactly at which point a schedutil update is triggered, >and thus potentially a frequency change started, and that's because >the update is a side effect of cfs_rq_util_changed instead of an >explicit call from the most suitable call path. > > - cfs_rq_util_change() is mainly a wrapper function for an already >existing "public API", cpufreq_update_util(), to ensure we actually >update schedutil only when we are updating a root RQ. Thus, especially >when task groups are in use, most of the calls to this wrapper >function are really not required. > > - the usage of a wrapper function is not completely consistent across >fair.c, since we still need sometimes additional explicit calls to >cpufreq_update_util(), for example to support the IOWAIT boot flag in >the wakeup path > > - it makes it hard to integrate new features since it could require to >change other function prototypes just to pass in an additional flag, >as it happened for example here: > > commit ea14b57e8a18 ("sched/cpufreq: Provide migration hint") > > All the above considered, let's try to make schedutil updates more > explicit in fair.c by: > > - removing the cfs_rq_util_change() wrapper function to use the >cpufreq_update_util() public API only when root cfs_rq is updated > > - remove indirect and side-effect (sometimes not required) schedutil >updates when the cfs_rq utilization is updated > > - call cpufreq_update_util() explicitly in the few call sites where it >really makes sense and all the required information has been upda
[PATCH] sched/fair: schedutil: update only with all info available
Schedutil is not properly updated when the first FAIR task wakes up on a CPU and when a RQ is (un)throttled. This is mainly due to the current integration strategy, which relies on updates being triggered implicitly each time a cfs_rq's utilization is updated. Those updates are currently provided (mainly) via cfs_rq_util_change() which is used in: - update_cfs_rq_load_avg() when the utilization of a cfs_rq is updated - {attach,detach}_entity_load_avg() This is done based on the idea that "we should callback schedutil frequently enough" to properly update the CPU frequency at every utilization change. Since this recent schedutil update: commit 8f111bc357aa ("cpufreq/schedutil: Rewrite CPUFREQ_RT support") we use additional RQ information to properly account for FAIR tasks utilization. Specifically, cfs_rq::h_nr_running has to be non-zero in sugov_aggregate_util() to sum up the cfs_rq's utilization. However, cfs_rq::h_nr_running is usually updated as: enqueue_entity() ... update_load_avg() ... cfs_rq_util_change ==> trigger schedutil update ... cfs_rq->h_nr_running += number_of_tasks both in enqueue_task_fair() as well as in unthrottle_cfs_rq(). A similar pattern is used also in dequeue_task_fair() and throttle_cfs_rq() to remove tasks. This means that we are likely to see a zero cfs_rq utilization when we enqueue a task on an empty CPU, or a non-zero cfs_rq utilization when instead, for example, we are throttling all the FAIR tasks of a CPU. While the second issue is less important, since we are less likely to reduce frequency when CPU utilization decreases, the first issue can instead impact performance. Indeed, we potentially introduce a not desired latency between a task enqueue on a CPU and its frequency increase. Another possible unwanted side effect is the iowait boosting of a CPU when we enqueue a task into a throttled cfs_rq. Moreover, the current schedutil integration has these other downsides: - schedutil updates are triggered by RQ's load updates, which makes sense in general but it does not allow to know exactly which other RQ related information has been updated (e.g. h_nr_running). - increasing the chances to update schedutil does not always correspond to provide the most accurate information for a proper frequency selection, thus we can skip some updates. - we don't know exactly at which point a schedutil update is triggered, and thus potentially a frequency change started, and that's because the update is a side effect of cfs_rq_util_changed instead of an explicit call from the most suitable call path. - cfs_rq_util_change() is mainly a wrapper function for an already existing "public API", cpufreq_update_util(), to ensure we actually update schedutil only when we are updating a root RQ. Thus, especially when task groups are in use, most of the calls to this wrapper function are really not required. - the usage of a wrapper function is not completely consistent across fair.c, since we still need sometimes additional explicit calls to cpufreq_update_util(), for example to support the IOWAIT boot flag in the wakeup path - it makes it hard to integrate new features since it could require to change other function prototypes just to pass in an additional flag, as it happened for example here: commit ea14b57e8a18 ("sched/cpufreq: Provide migration hint") All the above considered, let's try to make schedutil updates more explicit in fair.c by: - removing the cfs_rq_util_change() wrapper function to use the cpufreq_update_util() public API only when root cfs_rq is updated - remove indirect and side-effect (sometimes not required) schedutil updates when the cfs_rq utilization is updated - call cpufreq_update_util() explicitly in the few call sites where it really makes sense and all the required information has been updated By doing so, this patch mainly removes code and adds explicit calls to schedutil only when we: - {enqueue,dequeue}_task_fair() a task to/from the root cfs_rq - (un)throttle_cfs_rq() a set of tasks up to the root cfs_rq - task_tick_fair() to update the utilization of the root cfs_rq All the other code paths, currently _indirectly_ covered by a call to update_load_avg(), are also covered by the above three calls. Some already imply enqueue/dequeue calls: - switch_{to,from}_fair() - sched_move_task() or are followed by enqueue/dequeue calls: - cpu_cgroup_fork() and post_init_entity_util_avg(): are used at wakeup_new_task() time and thus already followed by an enqueue_task_fair() - migrate_task_rq_fair(): updates the removed utilization but not the actual cfs_rq utilization, which is updated by a following sched event This new proposal allows also to better aggregate schedutil related flags, which are required only at enqueue_task_fair() time. Indeed, IOWAIT and MIGRATION flags are now requested only when