On Tue, Jul 4, 2017 at 10:34 AM, Patrick Bellasi <patrick.bell...@arm.com> wrote: > Currently, sg_cpu's flags are set to the value defined by the last call of > the cpufreq_update_util()/cpufreq_update_this_cpu(); for RT/DL classes > this corresponds to the SCHED_CPUFREQ_{RT/DL} flags always being set. > > When multiple CPU shares the same frequency domain it might happen that a > CPU which executed a RT task, right before entering IDLE, has one of the > SCHED_CPUFREQ_RT_DL flags set, permanently, until it exits IDLE. > > Although such an idle CPU is _going to be_ ignored by the > sugov_next_freq_shared(): > 1. this kind of "useless RT requests" are ignored only if more then > TICK_NSEC have elapsed since the last update > 2. we can still potentially trigger an already too late switch to > MAX, which starts also a new throttling interval > 3. the internal state machine is not consistent with what the > scheduler knows, i.e. the CPU is now actually idle > > Thus, in sugov_next_freq_shared(), where utilisation and flags are > aggregated across all the CPUs of a frequency domain, it can turn out > that all the CPUs of that domain can run unnecessary at the maximum OPP > until another event happens in the idle CPU, which eventually clear the > SCHED_CPUFREQ_{RT/DL} flag, or the IDLE CPUs gets ignored after > TICK_NSEC since the CPU entering IDLE. > > Such a behaviour can harm the energy efficiency of systems where RT > workloads are not so frequent and other CPUs in the same frequency > domain are running small utilisation workloads, which is a quite common > scenario in mobile embedded systems. > > This patch proposes a solution which is aligned with the current principle > to update the flags each time a scheduling event happens. The scheduling > of the idle_task on a CPU is considered one of such meaningful events. > That's why when the idle_task is selected for execution we poke the > schedutil policy to reset the flags for that CPU. > > No frequency transitions are activated at that point, which is fair in > case the RT workload should come back in the future. However, this still > allows other CPUs in the same frequency domain to scale down the > frequency in case that should be possible. > > Signed-off-by: Patrick Bellasi <patrick.bell...@arm.com> > Cc: Ingo Molnar <mi...@redhat.com> > Cc: Peter Zijlstra <pet...@infradead.org> > Cc: Rafael J. Wysocki <rafael.j.wyso...@intel.com> > Cc: Viresh Kumar <viresh.ku...@linaro.org> > Cc: linux-kernel@vger.kernel.org > Cc: linux...@vger.kernel.org > > --- > Changes from v1: > - added "unlikely()" around the statement (SteveR) > --- > include/linux/sched/cpufreq.h | 1 + > kernel/sched/cpufreq_schedutil.c | 7 +++++++ > kernel/sched/idle_task.c | 4 ++++ > 3 files changed, 12 insertions(+) > > diff --git a/include/linux/sched/cpufreq.h b/include/linux/sched/cpufreq.h > index d2be2cc..36ac8d2 100644 > --- a/include/linux/sched/cpufreq.h > +++ b/include/linux/sched/cpufreq.h > @@ -10,6 +10,7 @@ > #define SCHED_CPUFREQ_RT (1U << 0) > #define SCHED_CPUFREQ_DL (1U << 1) > #define SCHED_CPUFREQ_IOWAIT (1U << 2) > +#define SCHED_CPUFREQ_IDLE (1U << 3) > > #define SCHED_CPUFREQ_RT_DL (SCHED_CPUFREQ_RT | SCHED_CPUFREQ_DL) > > diff --git a/kernel/sched/cpufreq_schedutil.c > b/kernel/sched/cpufreq_schedutil.c > index eaba6d6..004ae18 100644 > --- a/kernel/sched/cpufreq_schedutil.c > +++ b/kernel/sched/cpufreq_schedutil.c > @@ -304,6 +304,12 @@ static void sugov_update_shared(struct update_util_data > *hook, u64 time, > > sg_cpu->util = util; > sg_cpu->max = max; > + > + /* CPU is entering IDLE, reset flags without triggering an update */ > + if (unlikely(flags & SCHED_CPUFREQ_IDLE)) { > + sg_cpu->flags = 0; > + goto done; > + }
Instead of defining a new flag for idle, wouldn't another way be to just clear the flag from the RT scheduling class with an extra call to cpufreq_update_util with flags = 0 during dequeue_rt_entity? That seems to me to be also the right place to clear the flag since the flag is set in the corresponding class to begin with. thanks, -Joel