Re: [RFC PATCH v2 3/6] sched: Add over-utilization/tipping point indicator

2018-04-20 Thread Juri Lelli
On 20/04/18 09:31, Quentin Perret wrote:
> On Friday 20 Apr 2018 at 01:14:35 (-0700), Joel Fernandes wrote:
> > On Fri, Apr 20, 2018 at 1:13 AM, Joel Fernandes  wrote:
> > > On Wed, Apr 18, 2018 at 4:17 AM, Quentin Perret  
> > > wrote:
> > >> On Friday 13 Apr 2018 at 16:56:39 (-0700), Joel Fernandes wrote:

[...]

> > >>>
> > >>> I'm wondering if it makes sense for considering scenarios whether
> > >>> other classes cause CPUs in the domain to go above the tipping point.
> > >>> Then in that case also, it makes sense to not to do EAS in that domain
> > >>> because of the overutilization.
> > >>>
> > >>> I guess task_fits using cpu_util which is PELT only at the moment...
> > >>> so may require some other method like aggregation of CFS PELT, with
> > >>> RT-PELT and DL running bw or something.
> > >>>
> > >>
> > >> So at the moment in cpu_overutilized() we comapre cpu_util() to
> > >> capacity_of() which should include RT and IRQ pressure IIRC. But
> > >> you're right, we might be able to do more here... Perhaps we
> > >> could also use cpu_util_dl() which is available in sched.h now ?
> > >
> > > Yes, should be Ok, and then when RT utilization stuff is available,
> > > then that can be included in the equation as well (probably for now
> > > you could use rt_avg).
> > >
> > > Another crazy idea is to check the contribution of higher classes in
> > > one-shot with (capacity_orig_of - capacity_of) although I think that
> > > method would be less instantaneous/accurate.
> > 
> > Just to add to the last point, the capacity_of also factors in the IRQ
> > contribution if I remember correctly, which is probably a good thing?
> > 
> 
> I think so too yes. But actually, since we compare cpu_util() to
> capacity_of() in cpu_overutilized(), the current implementation should
> already be fairly similar to the "capacity_orig_of - capacity_of"
> implementation you're suggesting I guess.
> And I agree that when Vincent's RT PELT patches get merged we should
> probably use that :-)

Mind that rt_avg contains DL contribution as well ATM

https://elixir.bootlin.com/linux/v4.17-rc1/source/kernel/sched/deadline.c#L1182

So you shouldn't add newer DL utilization signal to it.

OTOH, using RT PELT (once in) is of course to be preferred.

Best,

- Juri


Re: [RFC PATCH v2 3/6] sched: Add over-utilization/tipping point indicator

2018-04-20 Thread Juri Lelli
On 20/04/18 09:31, Quentin Perret wrote:
> On Friday 20 Apr 2018 at 01:14:35 (-0700), Joel Fernandes wrote:
> > On Fri, Apr 20, 2018 at 1:13 AM, Joel Fernandes  wrote:
> > > On Wed, Apr 18, 2018 at 4:17 AM, Quentin Perret  
> > > wrote:
> > >> On Friday 13 Apr 2018 at 16:56:39 (-0700), Joel Fernandes wrote:

[...]

> > >>>
> > >>> I'm wondering if it makes sense for considering scenarios whether
> > >>> other classes cause CPUs in the domain to go above the tipping point.
> > >>> Then in that case also, it makes sense to not to do EAS in that domain
> > >>> because of the overutilization.
> > >>>
> > >>> I guess task_fits using cpu_util which is PELT only at the moment...
> > >>> so may require some other method like aggregation of CFS PELT, with
> > >>> RT-PELT and DL running bw or something.
> > >>>
> > >>
> > >> So at the moment in cpu_overutilized() we comapre cpu_util() to
> > >> capacity_of() which should include RT and IRQ pressure IIRC. But
> > >> you're right, we might be able to do more here... Perhaps we
> > >> could also use cpu_util_dl() which is available in sched.h now ?
> > >
> > > Yes, should be Ok, and then when RT utilization stuff is available,
> > > then that can be included in the equation as well (probably for now
> > > you could use rt_avg).
> > >
> > > Another crazy idea is to check the contribution of higher classes in
> > > one-shot with (capacity_orig_of - capacity_of) although I think that
> > > method would be less instantaneous/accurate.
> > 
> > Just to add to the last point, the capacity_of also factors in the IRQ
> > contribution if I remember correctly, which is probably a good thing?
> > 
> 
> I think so too yes. But actually, since we compare cpu_util() to
> capacity_of() in cpu_overutilized(), the current implementation should
> already be fairly similar to the "capacity_orig_of - capacity_of"
> implementation you're suggesting I guess.
> And I agree that when Vincent's RT PELT patches get merged we should
> probably use that :-)

Mind that rt_avg contains DL contribution as well ATM

https://elixir.bootlin.com/linux/v4.17-rc1/source/kernel/sched/deadline.c#L1182

So you shouldn't add newer DL utilization signal to it.

OTOH, using RT PELT (once in) is of course to be preferred.

Best,

- Juri


Re: [RFC PATCH v2 3/6] sched: Add over-utilization/tipping point indicator

2018-04-20 Thread Quentin Perret
On Friday 20 Apr 2018 at 01:14:35 (-0700), Joel Fernandes wrote:
> On Fri, Apr 20, 2018 at 1:13 AM, Joel Fernandes  wrote:
> > On Wed, Apr 18, 2018 at 4:17 AM, Quentin Perret  
> > wrote:
> >> On Friday 13 Apr 2018 at 16:56:39 (-0700), Joel Fernandes wrote:
> >>> Hi,
> >>>
> >>> On Fri, Apr 6, 2018 at 8:36 AM, Dietmar Eggemann
> >>>  wrote:
> >>> > From: Thara Gopinath 
> >>> >
> >>> > Energy-aware scheduling should only operate when the system is not
> >>> > overutilized. There must be cpu time available to place tasks based on
> >>> > utilization in an energy-aware fashion, i.e. to pack tasks on
> >>> > energy-efficient cpus without harming the overall throughput.
> >>> >
> >>> > In case the system operates above this tipping point the tasks have to
> >>> > be placed based on task and cpu load in the classical way of spreading
> >>> > tasks across as many cpus as possible.
> >>> >
> >>> > The point in which a system switches from being not overutilized to
> >>> > being overutilized is called the tipping point.
> >>> >
> >>> > Such a tipping point indicator on a sched domain as the system
> >>> > boundary is introduced here. As soon as one cpu of a sched domain is
> >>> > overutilized the whole sched domain is declared overutilized as well.
> >>> > A cpu becomes overutilized when its utilization is higher that 80%
> >>> > (capacity_margin) of its capacity.
> >>> >
> >>> > The implementation takes advantage of the shared sched domain which is
> >>> > shared across all per-cpu views of a sched domain level. The new
> >>> > overutilized flag is placed in this shared sched domain.
> >>> >
> >>> > Load balancing is skipped in case the energy model is present and the
> >>> > sched domain is not overutilized because under this condition the
> >>> > predominantly load-per-capacity driven load-balancer should not
> >>> > interfere with the energy-aware wakeup placement based on utilization.
> >>> >
> >>> > In case the total utilization of a sched domain is greater than the
> >>> > total sched domain capacity the overutilized flag is set at the parent
> >>> > sched domain level to let other sched groups help getting rid of the
> >>> > overutilization of cpus.
> >>> >
> >>> > Signed-off-by: Thara Gopinath 
> >>> > Signed-off-by: Dietmar Eggemann 
> >>> > ---
> >>> >  include/linux/sched/topology.h |  1 +
> >>> >  kernel/sched/fair.c| 62 
> >>> > --
> >>> >  kernel/sched/sched.h   |  1 +
> >>> >  kernel/sched/topology.c| 12 +++-
> >>> >  4 files changed, 65 insertions(+), 11 deletions(-)
> >>> >
> >>> > diff --git a/include/linux/sched/topology.h 
> >>> > b/include/linux/sched/topology.h
> >>> > index 26347741ba50..dd001c232646 100644
> >>> > --- a/include/linux/sched/topology.h
> >>> > +++ b/include/linux/sched/topology.h
> >>> > @@ -72,6 +72,7 @@ struct sched_domain_shared {
> >>> > atomic_tref;
> >>> > atomic_tnr_busy_cpus;
> >>> > int has_idle_cores;
> >>> > +   int overutilized;
> >>> >  };
> >>> >
> >>> >  struct sched_domain {
> >>> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >>> > index 0a76ad2ef022..6960e5ef3c14 100644
> >>> > --- a/kernel/sched/fair.c
> >>> > +++ b/kernel/sched/fair.c
> >>> > @@ -5345,6 +5345,28 @@ static inline void hrtick_update(struct rq *rq)
> >>> >  }
> >>> >  #endif
> >>> >
> >>> > +#ifdef CONFIG_SMP
> >>> > +static inline int cpu_overutilized(int cpu);
> >>> > +
> >>> > +static inline int sd_overutilized(struct sched_domain *sd)
> >>> > +{
> >>> > +   return READ_ONCE(sd->shared->overutilized);
> >>> > +}
> >>> > +
> >>> > +static inline void update_overutilized_status(struct rq *rq)
> >>> > +{
> >>> > +   struct sched_domain *sd;
> >>> > +
> >>> > +   rcu_read_lock();
> >>> > +   sd = rcu_dereference(rq->sd);
> >>> > +   if (sd && !sd_overutilized(sd) && cpu_overutilized(rq->cpu))
> >>> > +   WRITE_ONCE(sd->shared->overutilized, 1);
> >>> > +   rcu_read_unlock();
> >>> > +}
> >>> > +#else
> >>> > +static inline void update_overutilized_status(struct rq *rq) {}
> >>> > +#endif /* CONFIG_SMP */
> >>> > +
> >>> >  /*
> >>> >   * The enqueue_task method is called before nr_running is
> >>> >   * increased. Here we update the fair scheduling stats and
> >>> > @@ -5394,8 +5416,10 @@ enqueue_task_fair(struct rq *rq, struct 
> >>> > task_struct *p, int flags)
> >>> > update_cfs_group(se);
> >>> > }
> >>> >
> >>> > -   if (!se)
> >>> > +   if (!se) {
> >>> > add_nr_running(rq, 1);
> >>> > +   update_overutilized_status(rq);
> >>> > +   }
> >>>
> >>> I'm wondering if it makes sense for considering scenarios whether
> >>> other classes cause CPUs in the domain to go above the tipping point.
> >>> 

Re: [RFC PATCH v2 3/6] sched: Add over-utilization/tipping point indicator

2018-04-20 Thread Quentin Perret
On Friday 20 Apr 2018 at 01:14:35 (-0700), Joel Fernandes wrote:
> On Fri, Apr 20, 2018 at 1:13 AM, Joel Fernandes  wrote:
> > On Wed, Apr 18, 2018 at 4:17 AM, Quentin Perret  
> > wrote:
> >> On Friday 13 Apr 2018 at 16:56:39 (-0700), Joel Fernandes wrote:
> >>> Hi,
> >>>
> >>> On Fri, Apr 6, 2018 at 8:36 AM, Dietmar Eggemann
> >>>  wrote:
> >>> > From: Thara Gopinath 
> >>> >
> >>> > Energy-aware scheduling should only operate when the system is not
> >>> > overutilized. There must be cpu time available to place tasks based on
> >>> > utilization in an energy-aware fashion, i.e. to pack tasks on
> >>> > energy-efficient cpus without harming the overall throughput.
> >>> >
> >>> > In case the system operates above this tipping point the tasks have to
> >>> > be placed based on task and cpu load in the classical way of spreading
> >>> > tasks across as many cpus as possible.
> >>> >
> >>> > The point in which a system switches from being not overutilized to
> >>> > being overutilized is called the tipping point.
> >>> >
> >>> > Such a tipping point indicator on a sched domain as the system
> >>> > boundary is introduced here. As soon as one cpu of a sched domain is
> >>> > overutilized the whole sched domain is declared overutilized as well.
> >>> > A cpu becomes overutilized when its utilization is higher that 80%
> >>> > (capacity_margin) of its capacity.
> >>> >
> >>> > The implementation takes advantage of the shared sched domain which is
> >>> > shared across all per-cpu views of a sched domain level. The new
> >>> > overutilized flag is placed in this shared sched domain.
> >>> >
> >>> > Load balancing is skipped in case the energy model is present and the
> >>> > sched domain is not overutilized because under this condition the
> >>> > predominantly load-per-capacity driven load-balancer should not
> >>> > interfere with the energy-aware wakeup placement based on utilization.
> >>> >
> >>> > In case the total utilization of a sched domain is greater than the
> >>> > total sched domain capacity the overutilized flag is set at the parent
> >>> > sched domain level to let other sched groups help getting rid of the
> >>> > overutilization of cpus.
> >>> >
> >>> > Signed-off-by: Thara Gopinath 
> >>> > Signed-off-by: Dietmar Eggemann 
> >>> > ---
> >>> >  include/linux/sched/topology.h |  1 +
> >>> >  kernel/sched/fair.c| 62 
> >>> > --
> >>> >  kernel/sched/sched.h   |  1 +
> >>> >  kernel/sched/topology.c| 12 +++-
> >>> >  4 files changed, 65 insertions(+), 11 deletions(-)
> >>> >
> >>> > diff --git a/include/linux/sched/topology.h 
> >>> > b/include/linux/sched/topology.h
> >>> > index 26347741ba50..dd001c232646 100644
> >>> > --- a/include/linux/sched/topology.h
> >>> > +++ b/include/linux/sched/topology.h
> >>> > @@ -72,6 +72,7 @@ struct sched_domain_shared {
> >>> > atomic_tref;
> >>> > atomic_tnr_busy_cpus;
> >>> > int has_idle_cores;
> >>> > +   int overutilized;
> >>> >  };
> >>> >
> >>> >  struct sched_domain {
> >>> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >>> > index 0a76ad2ef022..6960e5ef3c14 100644
> >>> > --- a/kernel/sched/fair.c
> >>> > +++ b/kernel/sched/fair.c
> >>> > @@ -5345,6 +5345,28 @@ static inline void hrtick_update(struct rq *rq)
> >>> >  }
> >>> >  #endif
> >>> >
> >>> > +#ifdef CONFIG_SMP
> >>> > +static inline int cpu_overutilized(int cpu);
> >>> > +
> >>> > +static inline int sd_overutilized(struct sched_domain *sd)
> >>> > +{
> >>> > +   return READ_ONCE(sd->shared->overutilized);
> >>> > +}
> >>> > +
> >>> > +static inline void update_overutilized_status(struct rq *rq)
> >>> > +{
> >>> > +   struct sched_domain *sd;
> >>> > +
> >>> > +   rcu_read_lock();
> >>> > +   sd = rcu_dereference(rq->sd);
> >>> > +   if (sd && !sd_overutilized(sd) && cpu_overutilized(rq->cpu))
> >>> > +   WRITE_ONCE(sd->shared->overutilized, 1);
> >>> > +   rcu_read_unlock();
> >>> > +}
> >>> > +#else
> >>> > +static inline void update_overutilized_status(struct rq *rq) {}
> >>> > +#endif /* CONFIG_SMP */
> >>> > +
> >>> >  /*
> >>> >   * The enqueue_task method is called before nr_running is
> >>> >   * increased. Here we update the fair scheduling stats and
> >>> > @@ -5394,8 +5416,10 @@ enqueue_task_fair(struct rq *rq, struct 
> >>> > task_struct *p, int flags)
> >>> > update_cfs_group(se);
> >>> > }
> >>> >
> >>> > -   if (!se)
> >>> > +   if (!se) {
> >>> > add_nr_running(rq, 1);
> >>> > +   update_overutilized_status(rq);
> >>> > +   }
> >>>
> >>> I'm wondering if it makes sense for considering scenarios whether
> >>> other classes cause CPUs in the domain to go above the tipping point.
> >>> Then in that case also, it makes sense to not to do EAS in that domain
> >>> because of the overutilization.
> >>>
> >>> I guess task_fits using 

Re: [RFC PATCH v2 3/6] sched: Add over-utilization/tipping point indicator

2018-04-20 Thread Joel Fernandes
On Fri, Apr 20, 2018 at 1:13 AM, Joel Fernandes  wrote:
> On Wed, Apr 18, 2018 at 4:17 AM, Quentin Perret  
> wrote:
>> On Friday 13 Apr 2018 at 16:56:39 (-0700), Joel Fernandes wrote:
>>> Hi,
>>>
>>> On Fri, Apr 6, 2018 at 8:36 AM, Dietmar Eggemann
>>>  wrote:
>>> > From: Thara Gopinath 
>>> >
>>> > Energy-aware scheduling should only operate when the system is not
>>> > overutilized. There must be cpu time available to place tasks based on
>>> > utilization in an energy-aware fashion, i.e. to pack tasks on
>>> > energy-efficient cpus without harming the overall throughput.
>>> >
>>> > In case the system operates above this tipping point the tasks have to
>>> > be placed based on task and cpu load in the classical way of spreading
>>> > tasks across as many cpus as possible.
>>> >
>>> > The point in which a system switches from being not overutilized to
>>> > being overutilized is called the tipping point.
>>> >
>>> > Such a tipping point indicator on a sched domain as the system
>>> > boundary is introduced here. As soon as one cpu of a sched domain is
>>> > overutilized the whole sched domain is declared overutilized as well.
>>> > A cpu becomes overutilized when its utilization is higher that 80%
>>> > (capacity_margin) of its capacity.
>>> >
>>> > The implementation takes advantage of the shared sched domain which is
>>> > shared across all per-cpu views of a sched domain level. The new
>>> > overutilized flag is placed in this shared sched domain.
>>> >
>>> > Load balancing is skipped in case the energy model is present and the
>>> > sched domain is not overutilized because under this condition the
>>> > predominantly load-per-capacity driven load-balancer should not
>>> > interfere with the energy-aware wakeup placement based on utilization.
>>> >
>>> > In case the total utilization of a sched domain is greater than the
>>> > total sched domain capacity the overutilized flag is set at the parent
>>> > sched domain level to let other sched groups help getting rid of the
>>> > overutilization of cpus.
>>> >
>>> > Signed-off-by: Thara Gopinath 
>>> > Signed-off-by: Dietmar Eggemann 
>>> > ---
>>> >  include/linux/sched/topology.h |  1 +
>>> >  kernel/sched/fair.c| 62 
>>> > --
>>> >  kernel/sched/sched.h   |  1 +
>>> >  kernel/sched/topology.c| 12 +++-
>>> >  4 files changed, 65 insertions(+), 11 deletions(-)
>>> >
>>> > diff --git a/include/linux/sched/topology.h 
>>> > b/include/linux/sched/topology.h
>>> > index 26347741ba50..dd001c232646 100644
>>> > --- a/include/linux/sched/topology.h
>>> > +++ b/include/linux/sched/topology.h
>>> > @@ -72,6 +72,7 @@ struct sched_domain_shared {
>>> > atomic_tref;
>>> > atomic_tnr_busy_cpus;
>>> > int has_idle_cores;
>>> > +   int overutilized;
>>> >  };
>>> >
>>> >  struct sched_domain {
>>> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> > index 0a76ad2ef022..6960e5ef3c14 100644
>>> > --- a/kernel/sched/fair.c
>>> > +++ b/kernel/sched/fair.c
>>> > @@ -5345,6 +5345,28 @@ static inline void hrtick_update(struct rq *rq)
>>> >  }
>>> >  #endif
>>> >
>>> > +#ifdef CONFIG_SMP
>>> > +static inline int cpu_overutilized(int cpu);
>>> > +
>>> > +static inline int sd_overutilized(struct sched_domain *sd)
>>> > +{
>>> > +   return READ_ONCE(sd->shared->overutilized);
>>> > +}
>>> > +
>>> > +static inline void update_overutilized_status(struct rq *rq)
>>> > +{
>>> > +   struct sched_domain *sd;
>>> > +
>>> > +   rcu_read_lock();
>>> > +   sd = rcu_dereference(rq->sd);
>>> > +   if (sd && !sd_overutilized(sd) && cpu_overutilized(rq->cpu))
>>> > +   WRITE_ONCE(sd->shared->overutilized, 1);
>>> > +   rcu_read_unlock();
>>> > +}
>>> > +#else
>>> > +static inline void update_overutilized_status(struct rq *rq) {}
>>> > +#endif /* CONFIG_SMP */
>>> > +
>>> >  /*
>>> >   * The enqueue_task method is called before nr_running is
>>> >   * increased. Here we update the fair scheduling stats and
>>> > @@ -5394,8 +5416,10 @@ enqueue_task_fair(struct rq *rq, struct 
>>> > task_struct *p, int flags)
>>> > update_cfs_group(se);
>>> > }
>>> >
>>> > -   if (!se)
>>> > +   if (!se) {
>>> > add_nr_running(rq, 1);
>>> > +   update_overutilized_status(rq);
>>> > +   }
>>>
>>> I'm wondering if it makes sense for considering scenarios whether
>>> other classes cause CPUs in the domain to go above the tipping point.
>>> Then in that case also, it makes sense to not to do EAS in that domain
>>> because of the overutilization.
>>>
>>> I guess task_fits using cpu_util which is PELT only at the moment...
>>> so may require some other method like aggregation of CFS PELT, with
>>> RT-PELT and DL running bw or 

Re: [RFC PATCH v2 3/6] sched: Add over-utilization/tipping point indicator

2018-04-20 Thread Joel Fernandes
On Fri, Apr 20, 2018 at 1:13 AM, Joel Fernandes  wrote:
> On Wed, Apr 18, 2018 at 4:17 AM, Quentin Perret  
> wrote:
>> On Friday 13 Apr 2018 at 16:56:39 (-0700), Joel Fernandes wrote:
>>> Hi,
>>>
>>> On Fri, Apr 6, 2018 at 8:36 AM, Dietmar Eggemann
>>>  wrote:
>>> > From: Thara Gopinath 
>>> >
>>> > Energy-aware scheduling should only operate when the system is not
>>> > overutilized. There must be cpu time available to place tasks based on
>>> > utilization in an energy-aware fashion, i.e. to pack tasks on
>>> > energy-efficient cpus without harming the overall throughput.
>>> >
>>> > In case the system operates above this tipping point the tasks have to
>>> > be placed based on task and cpu load in the classical way of spreading
>>> > tasks across as many cpus as possible.
>>> >
>>> > The point in which a system switches from being not overutilized to
>>> > being overutilized is called the tipping point.
>>> >
>>> > Such a tipping point indicator on a sched domain as the system
>>> > boundary is introduced here. As soon as one cpu of a sched domain is
>>> > overutilized the whole sched domain is declared overutilized as well.
>>> > A cpu becomes overutilized when its utilization is higher that 80%
>>> > (capacity_margin) of its capacity.
>>> >
>>> > The implementation takes advantage of the shared sched domain which is
>>> > shared across all per-cpu views of a sched domain level. The new
>>> > overutilized flag is placed in this shared sched domain.
>>> >
>>> > Load balancing is skipped in case the energy model is present and the
>>> > sched domain is not overutilized because under this condition the
>>> > predominantly load-per-capacity driven load-balancer should not
>>> > interfere with the energy-aware wakeup placement based on utilization.
>>> >
>>> > In case the total utilization of a sched domain is greater than the
>>> > total sched domain capacity the overutilized flag is set at the parent
>>> > sched domain level to let other sched groups help getting rid of the
>>> > overutilization of cpus.
>>> >
>>> > Signed-off-by: Thara Gopinath 
>>> > Signed-off-by: Dietmar Eggemann 
>>> > ---
>>> >  include/linux/sched/topology.h |  1 +
>>> >  kernel/sched/fair.c| 62 
>>> > --
>>> >  kernel/sched/sched.h   |  1 +
>>> >  kernel/sched/topology.c| 12 +++-
>>> >  4 files changed, 65 insertions(+), 11 deletions(-)
>>> >
>>> > diff --git a/include/linux/sched/topology.h 
>>> > b/include/linux/sched/topology.h
>>> > index 26347741ba50..dd001c232646 100644
>>> > --- a/include/linux/sched/topology.h
>>> > +++ b/include/linux/sched/topology.h
>>> > @@ -72,6 +72,7 @@ struct sched_domain_shared {
>>> > atomic_tref;
>>> > atomic_tnr_busy_cpus;
>>> > int has_idle_cores;
>>> > +   int overutilized;
>>> >  };
>>> >
>>> >  struct sched_domain {
>>> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> > index 0a76ad2ef022..6960e5ef3c14 100644
>>> > --- a/kernel/sched/fair.c
>>> > +++ b/kernel/sched/fair.c
>>> > @@ -5345,6 +5345,28 @@ static inline void hrtick_update(struct rq *rq)
>>> >  }
>>> >  #endif
>>> >
>>> > +#ifdef CONFIG_SMP
>>> > +static inline int cpu_overutilized(int cpu);
>>> > +
>>> > +static inline int sd_overutilized(struct sched_domain *sd)
>>> > +{
>>> > +   return READ_ONCE(sd->shared->overutilized);
>>> > +}
>>> > +
>>> > +static inline void update_overutilized_status(struct rq *rq)
>>> > +{
>>> > +   struct sched_domain *sd;
>>> > +
>>> > +   rcu_read_lock();
>>> > +   sd = rcu_dereference(rq->sd);
>>> > +   if (sd && !sd_overutilized(sd) && cpu_overutilized(rq->cpu))
>>> > +   WRITE_ONCE(sd->shared->overutilized, 1);
>>> > +   rcu_read_unlock();
>>> > +}
>>> > +#else
>>> > +static inline void update_overutilized_status(struct rq *rq) {}
>>> > +#endif /* CONFIG_SMP */
>>> > +
>>> >  /*
>>> >   * The enqueue_task method is called before nr_running is
>>> >   * increased. Here we update the fair scheduling stats and
>>> > @@ -5394,8 +5416,10 @@ enqueue_task_fair(struct rq *rq, struct 
>>> > task_struct *p, int flags)
>>> > update_cfs_group(se);
>>> > }
>>> >
>>> > -   if (!se)
>>> > +   if (!se) {
>>> > add_nr_running(rq, 1);
>>> > +   update_overutilized_status(rq);
>>> > +   }
>>>
>>> I'm wondering if it makes sense for considering scenarios whether
>>> other classes cause CPUs in the domain to go above the tipping point.
>>> Then in that case also, it makes sense to not to do EAS in that domain
>>> because of the overutilization.
>>>
>>> I guess task_fits using cpu_util which is PELT only at the moment...
>>> so may require some other method like aggregation of CFS PELT, with
>>> RT-PELT and DL running bw or something.
>>>
>>
>> So at the moment in cpu_overutilized() we comapre cpu_util() to
>> capacity_of() which should include RT and IRQ pressure IIRC. 

Re: [RFC PATCH v2 3/6] sched: Add over-utilization/tipping point indicator

2018-04-20 Thread Joel Fernandes
On Wed, Apr 18, 2018 at 4:17 AM, Quentin Perret  wrote:
> On Friday 13 Apr 2018 at 16:56:39 (-0700), Joel Fernandes wrote:
>> Hi,
>>
>> On Fri, Apr 6, 2018 at 8:36 AM, Dietmar Eggemann
>>  wrote:
>> > From: Thara Gopinath 
>> >
>> > Energy-aware scheduling should only operate when the system is not
>> > overutilized. There must be cpu time available to place tasks based on
>> > utilization in an energy-aware fashion, i.e. to pack tasks on
>> > energy-efficient cpus without harming the overall throughput.
>> >
>> > In case the system operates above this tipping point the tasks have to
>> > be placed based on task and cpu load in the classical way of spreading
>> > tasks across as many cpus as possible.
>> >
>> > The point in which a system switches from being not overutilized to
>> > being overutilized is called the tipping point.
>> >
>> > Such a tipping point indicator on a sched domain as the system
>> > boundary is introduced here. As soon as one cpu of a sched domain is
>> > overutilized the whole sched domain is declared overutilized as well.
>> > A cpu becomes overutilized when its utilization is higher that 80%
>> > (capacity_margin) of its capacity.
>> >
>> > The implementation takes advantage of the shared sched domain which is
>> > shared across all per-cpu views of a sched domain level. The new
>> > overutilized flag is placed in this shared sched domain.
>> >
>> > Load balancing is skipped in case the energy model is present and the
>> > sched domain is not overutilized because under this condition the
>> > predominantly load-per-capacity driven load-balancer should not
>> > interfere with the energy-aware wakeup placement based on utilization.
>> >
>> > In case the total utilization of a sched domain is greater than the
>> > total sched domain capacity the overutilized flag is set at the parent
>> > sched domain level to let other sched groups help getting rid of the
>> > overutilization of cpus.
>> >
>> > Signed-off-by: Thara Gopinath 
>> > Signed-off-by: Dietmar Eggemann 
>> > ---
>> >  include/linux/sched/topology.h |  1 +
>> >  kernel/sched/fair.c| 62 
>> > --
>> >  kernel/sched/sched.h   |  1 +
>> >  kernel/sched/topology.c| 12 +++-
>> >  4 files changed, 65 insertions(+), 11 deletions(-)
>> >
>> > diff --git a/include/linux/sched/topology.h 
>> > b/include/linux/sched/topology.h
>> > index 26347741ba50..dd001c232646 100644
>> > --- a/include/linux/sched/topology.h
>> > +++ b/include/linux/sched/topology.h
>> > @@ -72,6 +72,7 @@ struct sched_domain_shared {
>> > atomic_tref;
>> > atomic_tnr_busy_cpus;
>> > int has_idle_cores;
>> > +   int overutilized;
>> >  };
>> >
>> >  struct sched_domain {
>> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> > index 0a76ad2ef022..6960e5ef3c14 100644
>> > --- a/kernel/sched/fair.c
>> > +++ b/kernel/sched/fair.c
>> > @@ -5345,6 +5345,28 @@ static inline void hrtick_update(struct rq *rq)
>> >  }
>> >  #endif
>> >
>> > +#ifdef CONFIG_SMP
>> > +static inline int cpu_overutilized(int cpu);
>> > +
>> > +static inline int sd_overutilized(struct sched_domain *sd)
>> > +{
>> > +   return READ_ONCE(sd->shared->overutilized);
>> > +}
>> > +
>> > +static inline void update_overutilized_status(struct rq *rq)
>> > +{
>> > +   struct sched_domain *sd;
>> > +
>> > +   rcu_read_lock();
>> > +   sd = rcu_dereference(rq->sd);
>> > +   if (sd && !sd_overutilized(sd) && cpu_overutilized(rq->cpu))
>> > +   WRITE_ONCE(sd->shared->overutilized, 1);
>> > +   rcu_read_unlock();
>> > +}
>> > +#else
>> > +static inline void update_overutilized_status(struct rq *rq) {}
>> > +#endif /* CONFIG_SMP */
>> > +
>> >  /*
>> >   * The enqueue_task method is called before nr_running is
>> >   * increased. Here we update the fair scheduling stats and
>> > @@ -5394,8 +5416,10 @@ enqueue_task_fair(struct rq *rq, struct task_struct 
>> > *p, int flags)
>> > update_cfs_group(se);
>> > }
>> >
>> > -   if (!se)
>> > +   if (!se) {
>> > add_nr_running(rq, 1);
>> > +   update_overutilized_status(rq);
>> > +   }
>>
>> I'm wondering if it makes sense for considering scenarios whether
>> other classes cause CPUs in the domain to go above the tipping point.
>> Then in that case also, it makes sense to not to do EAS in that domain
>> because of the overutilization.
>>
>> I guess task_fits using cpu_util which is PELT only at the moment...
>> so may require some other method like aggregation of CFS PELT, with
>> RT-PELT and DL running bw or something.
>>
>
> So at the moment in cpu_overutilized() we comapre cpu_util() to
> capacity_of() which should include RT and IRQ pressure IIRC. But
> you're right, we might be able to do more 

Re: [RFC PATCH v2 3/6] sched: Add over-utilization/tipping point indicator

2018-04-20 Thread Joel Fernandes
On Wed, Apr 18, 2018 at 4:17 AM, Quentin Perret  wrote:
> On Friday 13 Apr 2018 at 16:56:39 (-0700), Joel Fernandes wrote:
>> Hi,
>>
>> On Fri, Apr 6, 2018 at 8:36 AM, Dietmar Eggemann
>>  wrote:
>> > From: Thara Gopinath 
>> >
>> > Energy-aware scheduling should only operate when the system is not
>> > overutilized. There must be cpu time available to place tasks based on
>> > utilization in an energy-aware fashion, i.e. to pack tasks on
>> > energy-efficient cpus without harming the overall throughput.
>> >
>> > In case the system operates above this tipping point the tasks have to
>> > be placed based on task and cpu load in the classical way of spreading
>> > tasks across as many cpus as possible.
>> >
>> > The point in which a system switches from being not overutilized to
>> > being overutilized is called the tipping point.
>> >
>> > Such a tipping point indicator on a sched domain as the system
>> > boundary is introduced here. As soon as one cpu of a sched domain is
>> > overutilized the whole sched domain is declared overutilized as well.
>> > A cpu becomes overutilized when its utilization is higher that 80%
>> > (capacity_margin) of its capacity.
>> >
>> > The implementation takes advantage of the shared sched domain which is
>> > shared across all per-cpu views of a sched domain level. The new
>> > overutilized flag is placed in this shared sched domain.
>> >
>> > Load balancing is skipped in case the energy model is present and the
>> > sched domain is not overutilized because under this condition the
>> > predominantly load-per-capacity driven load-balancer should not
>> > interfere with the energy-aware wakeup placement based on utilization.
>> >
>> > In case the total utilization of a sched domain is greater than the
>> > total sched domain capacity the overutilized flag is set at the parent
>> > sched domain level to let other sched groups help getting rid of the
>> > overutilization of cpus.
>> >
>> > Signed-off-by: Thara Gopinath 
>> > Signed-off-by: Dietmar Eggemann 
>> > ---
>> >  include/linux/sched/topology.h |  1 +
>> >  kernel/sched/fair.c| 62 
>> > --
>> >  kernel/sched/sched.h   |  1 +
>> >  kernel/sched/topology.c| 12 +++-
>> >  4 files changed, 65 insertions(+), 11 deletions(-)
>> >
>> > diff --git a/include/linux/sched/topology.h 
>> > b/include/linux/sched/topology.h
>> > index 26347741ba50..dd001c232646 100644
>> > --- a/include/linux/sched/topology.h
>> > +++ b/include/linux/sched/topology.h
>> > @@ -72,6 +72,7 @@ struct sched_domain_shared {
>> > atomic_tref;
>> > atomic_tnr_busy_cpus;
>> > int has_idle_cores;
>> > +   int overutilized;
>> >  };
>> >
>> >  struct sched_domain {
>> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> > index 0a76ad2ef022..6960e5ef3c14 100644
>> > --- a/kernel/sched/fair.c
>> > +++ b/kernel/sched/fair.c
>> > @@ -5345,6 +5345,28 @@ static inline void hrtick_update(struct rq *rq)
>> >  }
>> >  #endif
>> >
>> > +#ifdef CONFIG_SMP
>> > +static inline int cpu_overutilized(int cpu);
>> > +
>> > +static inline int sd_overutilized(struct sched_domain *sd)
>> > +{
>> > +   return READ_ONCE(sd->shared->overutilized);
>> > +}
>> > +
>> > +static inline void update_overutilized_status(struct rq *rq)
>> > +{
>> > +   struct sched_domain *sd;
>> > +
>> > +   rcu_read_lock();
>> > +   sd = rcu_dereference(rq->sd);
>> > +   if (sd && !sd_overutilized(sd) && cpu_overutilized(rq->cpu))
>> > +   WRITE_ONCE(sd->shared->overutilized, 1);
>> > +   rcu_read_unlock();
>> > +}
>> > +#else
>> > +static inline void update_overutilized_status(struct rq *rq) {}
>> > +#endif /* CONFIG_SMP */
>> > +
>> >  /*
>> >   * The enqueue_task method is called before nr_running is
>> >   * increased. Here we update the fair scheduling stats and
>> > @@ -5394,8 +5416,10 @@ enqueue_task_fair(struct rq *rq, struct task_struct 
>> > *p, int flags)
>> > update_cfs_group(se);
>> > }
>> >
>> > -   if (!se)
>> > +   if (!se) {
>> > add_nr_running(rq, 1);
>> > +   update_overutilized_status(rq);
>> > +   }
>>
>> I'm wondering if it makes sense for considering scenarios whether
>> other classes cause CPUs in the domain to go above the tipping point.
>> Then in that case also, it makes sense to not to do EAS in that domain
>> because of the overutilization.
>>
>> I guess task_fits using cpu_util which is PELT only at the moment...
>> so may require some other method like aggregation of CFS PELT, with
>> RT-PELT and DL running bw or something.
>>
>
> So at the moment in cpu_overutilized() we comapre cpu_util() to
> capacity_of() which should include RT and IRQ pressure IIRC. But
> you're right, we might be able to do more here... Perhaps we
> could also use cpu_util_dl() which is available in sched.h now ?

Yes, should be Ok, and then when RT 

Re: [RFC PATCH v2 3/6] sched: Add over-utilization/tipping point indicator

2018-04-18 Thread Quentin Perret
On Friday 13 Apr 2018 at 16:56:39 (-0700), Joel Fernandes wrote:
> Hi,
> 
> On Fri, Apr 6, 2018 at 8:36 AM, Dietmar Eggemann
>  wrote:
> > From: Thara Gopinath 
> >
> > Energy-aware scheduling should only operate when the system is not
> > overutilized. There must be cpu time available to place tasks based on
> > utilization in an energy-aware fashion, i.e. to pack tasks on
> > energy-efficient cpus without harming the overall throughput.
> >
> > In case the system operates above this tipping point the tasks have to
> > be placed based on task and cpu load in the classical way of spreading
> > tasks across as many cpus as possible.
> >
> > The point in which a system switches from being not overutilized to
> > being overutilized is called the tipping point.
> >
> > Such a tipping point indicator on a sched domain as the system
> > boundary is introduced here. As soon as one cpu of a sched domain is
> > overutilized the whole sched domain is declared overutilized as well.
> > A cpu becomes overutilized when its utilization is higher that 80%
> > (capacity_margin) of its capacity.
> >
> > The implementation takes advantage of the shared sched domain which is
> > shared across all per-cpu views of a sched domain level. The new
> > overutilized flag is placed in this shared sched domain.
> >
> > Load balancing is skipped in case the energy model is present and the
> > sched domain is not overutilized because under this condition the
> > predominantly load-per-capacity driven load-balancer should not
> > interfere with the energy-aware wakeup placement based on utilization.
> >
> > In case the total utilization of a sched domain is greater than the
> > total sched domain capacity the overutilized flag is set at the parent
> > sched domain level to let other sched groups help getting rid of the
> > overutilization of cpus.
> >
> > Signed-off-by: Thara Gopinath 
> > Signed-off-by: Dietmar Eggemann 
> > ---
> >  include/linux/sched/topology.h |  1 +
> >  kernel/sched/fair.c| 62 
> > --
> >  kernel/sched/sched.h   |  1 +
> >  kernel/sched/topology.c| 12 +++-
> >  4 files changed, 65 insertions(+), 11 deletions(-)
> >
> > diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> > index 26347741ba50..dd001c232646 100644
> > --- a/include/linux/sched/topology.h
> > +++ b/include/linux/sched/topology.h
> > @@ -72,6 +72,7 @@ struct sched_domain_shared {
> > atomic_tref;
> > atomic_tnr_busy_cpus;
> > int has_idle_cores;
> > +   int overutilized;
> >  };
> >
> >  struct sched_domain {
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 0a76ad2ef022..6960e5ef3c14 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -5345,6 +5345,28 @@ static inline void hrtick_update(struct rq *rq)
> >  }
> >  #endif
> >
> > +#ifdef CONFIG_SMP
> > +static inline int cpu_overutilized(int cpu);
> > +
> > +static inline int sd_overutilized(struct sched_domain *sd)
> > +{
> > +   return READ_ONCE(sd->shared->overutilized);
> > +}
> > +
> > +static inline void update_overutilized_status(struct rq *rq)
> > +{
> > +   struct sched_domain *sd;
> > +
> > +   rcu_read_lock();
> > +   sd = rcu_dereference(rq->sd);
> > +   if (sd && !sd_overutilized(sd) && cpu_overutilized(rq->cpu))
> > +   WRITE_ONCE(sd->shared->overutilized, 1);
> > +   rcu_read_unlock();
> > +}
> > +#else
> > +static inline void update_overutilized_status(struct rq *rq) {}
> > +#endif /* CONFIG_SMP */
> > +
> >  /*
> >   * The enqueue_task method is called before nr_running is
> >   * increased. Here we update the fair scheduling stats and
> > @@ -5394,8 +5416,10 @@ enqueue_task_fair(struct rq *rq, struct task_struct 
> > *p, int flags)
> > update_cfs_group(se);
> > }
> >
> > -   if (!se)
> > +   if (!se) {
> > add_nr_running(rq, 1);
> > +   update_overutilized_status(rq);
> > +   }
> 
> I'm wondering if it makes sense for considering scenarios whether
> other classes cause CPUs in the domain to go above the tipping point.
> Then in that case also, it makes sense to not to do EAS in that domain
> because of the overutilization.
> 
> I guess task_fits using cpu_util which is PELT only at the moment...
> so may require some other method like aggregation of CFS PELT, with
> RT-PELT and DL running bw or something.
>

So at the moment in cpu_overutilized() we comapre cpu_util() to
capacity_of() which should include RT and IRQ pressure IIRC. But
you're right, we might be able to do more here... Perhaps we
could also use cpu_util_dl() which is available in sched.h now ?


Re: [RFC PATCH v2 3/6] sched: Add over-utilization/tipping point indicator

2018-04-18 Thread Quentin Perret
On Friday 13 Apr 2018 at 16:56:39 (-0700), Joel Fernandes wrote:
> Hi,
> 
> On Fri, Apr 6, 2018 at 8:36 AM, Dietmar Eggemann
>  wrote:
> > From: Thara Gopinath 
> >
> > Energy-aware scheduling should only operate when the system is not
> > overutilized. There must be cpu time available to place tasks based on
> > utilization in an energy-aware fashion, i.e. to pack tasks on
> > energy-efficient cpus without harming the overall throughput.
> >
> > In case the system operates above this tipping point the tasks have to
> > be placed based on task and cpu load in the classical way of spreading
> > tasks across as many cpus as possible.
> >
> > The point in which a system switches from being not overutilized to
> > being overutilized is called the tipping point.
> >
> > Such a tipping point indicator on a sched domain as the system
> > boundary is introduced here. As soon as one cpu of a sched domain is
> > overutilized the whole sched domain is declared overutilized as well.
> > A cpu becomes overutilized when its utilization is higher that 80%
> > (capacity_margin) of its capacity.
> >
> > The implementation takes advantage of the shared sched domain which is
> > shared across all per-cpu views of a sched domain level. The new
> > overutilized flag is placed in this shared sched domain.
> >
> > Load balancing is skipped in case the energy model is present and the
> > sched domain is not overutilized because under this condition the
> > predominantly load-per-capacity driven load-balancer should not
> > interfere with the energy-aware wakeup placement based on utilization.
> >
> > In case the total utilization of a sched domain is greater than the
> > total sched domain capacity the overutilized flag is set at the parent
> > sched domain level to let other sched groups help getting rid of the
> > overutilization of cpus.
> >
> > Signed-off-by: Thara Gopinath 
> > Signed-off-by: Dietmar Eggemann 
> > ---
> >  include/linux/sched/topology.h |  1 +
> >  kernel/sched/fair.c| 62 
> > --
> >  kernel/sched/sched.h   |  1 +
> >  kernel/sched/topology.c| 12 +++-
> >  4 files changed, 65 insertions(+), 11 deletions(-)
> >
> > diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> > index 26347741ba50..dd001c232646 100644
> > --- a/include/linux/sched/topology.h
> > +++ b/include/linux/sched/topology.h
> > @@ -72,6 +72,7 @@ struct sched_domain_shared {
> > atomic_tref;
> > atomic_tnr_busy_cpus;
> > int has_idle_cores;
> > +   int overutilized;
> >  };
> >
> >  struct sched_domain {
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 0a76ad2ef022..6960e5ef3c14 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -5345,6 +5345,28 @@ static inline void hrtick_update(struct rq *rq)
> >  }
> >  #endif
> >
> > +#ifdef CONFIG_SMP
> > +static inline int cpu_overutilized(int cpu);
> > +
> > +static inline int sd_overutilized(struct sched_domain *sd)
> > +{
> > +   return READ_ONCE(sd->shared->overutilized);
> > +}
> > +
> > +static inline void update_overutilized_status(struct rq *rq)
> > +{
> > +   struct sched_domain *sd;
> > +
> > +   rcu_read_lock();
> > +   sd = rcu_dereference(rq->sd);
> > +   if (sd && !sd_overutilized(sd) && cpu_overutilized(rq->cpu))
> > +   WRITE_ONCE(sd->shared->overutilized, 1);
> > +   rcu_read_unlock();
> > +}
> > +#else
> > +static inline void update_overutilized_status(struct rq *rq) {}
> > +#endif /* CONFIG_SMP */
> > +
> >  /*
> >   * The enqueue_task method is called before nr_running is
> >   * increased. Here we update the fair scheduling stats and
> > @@ -5394,8 +5416,10 @@ enqueue_task_fair(struct rq *rq, struct task_struct 
> > *p, int flags)
> > update_cfs_group(se);
> > }
> >
> > -   if (!se)
> > +   if (!se) {
> > add_nr_running(rq, 1);
> > +   update_overutilized_status(rq);
> > +   }
> 
> I'm wondering if it makes sense for considering scenarios whether
> other classes cause CPUs in the domain to go above the tipping point.
> Then in that case also, it makes sense to not to do EAS in that domain
> because of the overutilization.
> 
> I guess task_fits using cpu_util which is PELT only at the moment...
> so may require some other method like aggregation of CFS PELT, with
> RT-PELT and DL running bw or something.
>

So at the moment in cpu_overutilized() we comapre cpu_util() to
capacity_of() which should include RT and IRQ pressure IIRC. But
you're right, we might be able to do more here... Perhaps we
could also use cpu_util_dl() which is available in sched.h now ?


Re: [RFC PATCH v2 3/6] sched: Add over-utilization/tipping point indicator

2018-04-17 Thread Leo Yan
On Tue, Apr 17, 2018 at 07:39:21PM +0200, Dietmar Eggemann wrote:
> On 04/17/2018 04:25 PM, Leo Yan wrote:
> 
> >>@@ -5394,8 +5416,10 @@ enqueue_task_fair(struct rq *rq, struct task_struct 
> >>*p, int flags)
> >>update_cfs_group(se);
> >>}
> >>-   if (!se)
> >>+   if (!se) {
> >>add_nr_running(rq, 1);
> >>+   update_overutilized_status(rq);
> >>+   }
> >
> >Maybe this isn't a good question, why only update overutilized flag
> >for enqueue flow but not for dequeue flow?
> 
> [...]
> 
> >>@@ -9955,6 +10009,8 @@ static void task_tick_fair(struct rq *rq, struct 
> >>task_struct *curr, int queued)
> >>if (static_branch_unlikely(_numa_balancing))
> >>task_tick_numa(rq, curr);
> >>+
> >>+   update_overutilized_status(rq);
> >
> >Can sched tick clear overutilized flag if under tipping point?
> 
> No, only the load balancer for this particular sched domain level is able to
> clear the flag. We want to use the existing iteration over all cpus of the
> sched domain span to reset the flag.

Yes and sorry introduce noise.  The overutilized flag is shared with
all CPUs in the sched domain, so one CPU can set overutilized flag
for itself, but the CPU cannot clear the flag due this flag might be
set by other CPUs in the same sched domain.

Thanks,
Leo Yan


Re: [RFC PATCH v2 3/6] sched: Add over-utilization/tipping point indicator

2018-04-17 Thread Leo Yan
On Tue, Apr 17, 2018 at 07:39:21PM +0200, Dietmar Eggemann wrote:
> On 04/17/2018 04:25 PM, Leo Yan wrote:
> 
> >>@@ -5394,8 +5416,10 @@ enqueue_task_fair(struct rq *rq, struct task_struct 
> >>*p, int flags)
> >>update_cfs_group(se);
> >>}
> >>-   if (!se)
> >>+   if (!se) {
> >>add_nr_running(rq, 1);
> >>+   update_overutilized_status(rq);
> >>+   }
> >
> >Maybe this isn't a good question, why only update overutilized flag
> >for enqueue flow but not for dequeue flow?
> 
> [...]
> 
> >>@@ -9955,6 +10009,8 @@ static void task_tick_fair(struct rq *rq, struct 
> >>task_struct *curr, int queued)
> >>if (static_branch_unlikely(_numa_balancing))
> >>task_tick_numa(rq, curr);
> >>+
> >>+   update_overutilized_status(rq);
> >
> >Can sched tick clear overutilized flag if under tipping point?
> 
> No, only the load balancer for this particular sched domain level is able to
> clear the flag. We want to use the existing iteration over all cpus of the
> sched domain span to reset the flag.

Yes and sorry introduce noise.  The overutilized flag is shared with
all CPUs in the sched domain, so one CPU can set overutilized flag
for itself, but the CPU cannot clear the flag due this flag might be
set by other CPUs in the same sched domain.

Thanks,
Leo Yan


Re: [RFC PATCH v2 3/6] sched: Add over-utilization/tipping point indicator

2018-04-17 Thread Dietmar Eggemann

On 04/17/2018 04:25 PM, Leo Yan wrote:


@@ -5394,8 +5416,10 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, 
int flags)
update_cfs_group(se);
}
  
-	if (!se)

+   if (!se) {
add_nr_running(rq, 1);
+   update_overutilized_status(rq);
+   }


Maybe this isn't a good question, why only update overutilized flag
for enqueue flow but not for dequeue flow?


[...]


@@ -9955,6 +10009,8 @@ static void task_tick_fair(struct rq *rq, struct 
task_struct *curr, int queued)
  
  	if (static_branch_unlikely(_numa_balancing))

task_tick_numa(rq, curr);
+
+   update_overutilized_status(rq);


Can sched tick clear overutilized flag if under tipping point?


No, only the load balancer for this particular sched domain level is 
able to clear the flag. We want to use the existing iteration over all 
cpus of the sched domain span to reset the flag.


Re: [RFC PATCH v2 3/6] sched: Add over-utilization/tipping point indicator

2018-04-17 Thread Dietmar Eggemann

On 04/17/2018 04:25 PM, Leo Yan wrote:


@@ -5394,8 +5416,10 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, 
int flags)
update_cfs_group(se);
}
  
-	if (!se)

+   if (!se) {
add_nr_running(rq, 1);
+   update_overutilized_status(rq);
+   }


Maybe this isn't a good question, why only update overutilized flag
for enqueue flow but not for dequeue flow?


[...]


@@ -9955,6 +10009,8 @@ static void task_tick_fair(struct rq *rq, struct 
task_struct *curr, int queued)
  
  	if (static_branch_unlikely(_numa_balancing))

task_tick_numa(rq, curr);
+
+   update_overutilized_status(rq);


Can sched tick clear overutilized flag if under tipping point?


No, only the load balancer for this particular sched domain level is 
able to clear the flag. We want to use the existing iteration over all 
cpus of the sched domain span to reset the flag.


Re: [RFC PATCH v2 3/6] sched: Add over-utilization/tipping point indicator

2018-04-17 Thread Leo Yan
On Fri, Apr 06, 2018 at 04:36:04PM +0100, Dietmar Eggemann wrote:
> From: Thara Gopinath 
> 
> Energy-aware scheduling should only operate when the system is not
> overutilized. There must be cpu time available to place tasks based on
> utilization in an energy-aware fashion, i.e. to pack tasks on
> energy-efficient cpus without harming the overall throughput.
> 
> In case the system operates above this tipping point the tasks have to
> be placed based on task and cpu load in the classical way of spreading
> tasks across as many cpus as possible.
> 
> The point in which a system switches from being not overutilized to
> being overutilized is called the tipping point.
> 
> Such a tipping point indicator on a sched domain as the system
> boundary is introduced here. As soon as one cpu of a sched domain is
> overutilized the whole sched domain is declared overutilized as well.
> A cpu becomes overutilized when its utilization is higher that 80%
> (capacity_margin) of its capacity.
> 
> The implementation takes advantage of the shared sched domain which is
> shared across all per-cpu views of a sched domain level. The new
> overutilized flag is placed in this shared sched domain.
> 
> Load balancing is skipped in case the energy model is present and the
> sched domain is not overutilized because under this condition the
> predominantly load-per-capacity driven load-balancer should not
> interfere with the energy-aware wakeup placement based on utilization.
> 
> In case the total utilization of a sched domain is greater than the
> total sched domain capacity the overutilized flag is set at the parent
> sched domain level to let other sched groups help getting rid of the
> overutilization of cpus.
> 
> Signed-off-by: Thara Gopinath 
> Signed-off-by: Dietmar Eggemann 
> ---
>  include/linux/sched/topology.h |  1 +
>  kernel/sched/fair.c| 62 
> --
>  kernel/sched/sched.h   |  1 +
>  kernel/sched/topology.c| 12 +++-
>  4 files changed, 65 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 26347741ba50..dd001c232646 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -72,6 +72,7 @@ struct sched_domain_shared {
>   atomic_tref;
>   atomic_tnr_busy_cpus;
>   int has_idle_cores;
> + int overutilized;
>  };
>  
>  struct sched_domain {
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 0a76ad2ef022..6960e5ef3c14 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5345,6 +5345,28 @@ static inline void hrtick_update(struct rq *rq)
>  }
>  #endif
>  
> +#ifdef CONFIG_SMP
> +static inline int cpu_overutilized(int cpu);
> +
> +static inline int sd_overutilized(struct sched_domain *sd)
> +{
> + return READ_ONCE(sd->shared->overutilized);
> +}
> +
> +static inline void update_overutilized_status(struct rq *rq)
> +{
> + struct sched_domain *sd;
> +
> + rcu_read_lock();
> + sd = rcu_dereference(rq->sd);
> + if (sd && !sd_overutilized(sd) && cpu_overutilized(rq->cpu))
> + WRITE_ONCE(sd->shared->overutilized, 1);
> + rcu_read_unlock();
> +}
> +#else
> +static inline void update_overutilized_status(struct rq *rq) {}
> +#endif /* CONFIG_SMP */
> +
>  /*
>   * The enqueue_task method is called before nr_running is
>   * increased. Here we update the fair scheduling stats and
> @@ -5394,8 +5416,10 @@ enqueue_task_fair(struct rq *rq, struct task_struct 
> *p, int flags)
>   update_cfs_group(se);
>   }
>  
> - if (!se)
> + if (!se) {
>   add_nr_running(rq, 1);
> + update_overutilized_status(rq);
> + }

Maybe this isn't a good question, why only update overutilized flag
for enqueue flow but not for dequeue flow?

>   util_est_enqueue(>cfs, p);
>   hrtick_update(rq);
> @@ -6579,6 +6603,11 @@ static inline int util_fits_capacity(unsigned long 
> util, unsigned long capacity)
>   return capacity * 1024 > util * capacity_margin;
>  }
>  
> +static inline int cpu_overutilized(int cpu)
> +{
> + return !util_fits_capacity(cpu_util(cpu), capacity_of(cpu));
> +}
> +
>  /*
>   * Disable WAKE_AFFINE in the case where task @p doesn't fit in the
>   * capacity of either the waking CPU @cpu or the previous CPU @prev_cpu.
> @@ -7817,6 +7846,7 @@ struct sd_lb_stats {
>   unsigned long total_running;
>   unsigned long total_load;   /* Total load of all groups in sd */
>   unsigned long total_capacity;   /* Total capacity of all groups in sd */
> + unsigned long total_util;   /* Total util of all groups in sd */
>   unsigned long avg_load; /* Average load across all groups in sd */
>  
>   struct sg_lb_stats busiest_stat;/* Statistics of the busiest group */
> @@ -7837,6 +7867,7 

Re: [RFC PATCH v2 3/6] sched: Add over-utilization/tipping point indicator

2018-04-17 Thread Leo Yan
On Fri, Apr 06, 2018 at 04:36:04PM +0100, Dietmar Eggemann wrote:
> From: Thara Gopinath 
> 
> Energy-aware scheduling should only operate when the system is not
> overutilized. There must be cpu time available to place tasks based on
> utilization in an energy-aware fashion, i.e. to pack tasks on
> energy-efficient cpus without harming the overall throughput.
> 
> In case the system operates above this tipping point the tasks have to
> be placed based on task and cpu load in the classical way of spreading
> tasks across as many cpus as possible.
> 
> The point in which a system switches from being not overutilized to
> being overutilized is called the tipping point.
> 
> Such a tipping point indicator on a sched domain as the system
> boundary is introduced here. As soon as one cpu of a sched domain is
> overutilized the whole sched domain is declared overutilized as well.
> A cpu becomes overutilized when its utilization is higher that 80%
> (capacity_margin) of its capacity.
> 
> The implementation takes advantage of the shared sched domain which is
> shared across all per-cpu views of a sched domain level. The new
> overutilized flag is placed in this shared sched domain.
> 
> Load balancing is skipped in case the energy model is present and the
> sched domain is not overutilized because under this condition the
> predominantly load-per-capacity driven load-balancer should not
> interfere with the energy-aware wakeup placement based on utilization.
> 
> In case the total utilization of a sched domain is greater than the
> total sched domain capacity the overutilized flag is set at the parent
> sched domain level to let other sched groups help getting rid of the
> overutilization of cpus.
> 
> Signed-off-by: Thara Gopinath 
> Signed-off-by: Dietmar Eggemann 
> ---
>  include/linux/sched/topology.h |  1 +
>  kernel/sched/fair.c| 62 
> --
>  kernel/sched/sched.h   |  1 +
>  kernel/sched/topology.c| 12 +++-
>  4 files changed, 65 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 26347741ba50..dd001c232646 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -72,6 +72,7 @@ struct sched_domain_shared {
>   atomic_tref;
>   atomic_tnr_busy_cpus;
>   int has_idle_cores;
> + int overutilized;
>  };
>  
>  struct sched_domain {
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 0a76ad2ef022..6960e5ef3c14 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5345,6 +5345,28 @@ static inline void hrtick_update(struct rq *rq)
>  }
>  #endif
>  
> +#ifdef CONFIG_SMP
> +static inline int cpu_overutilized(int cpu);
> +
> +static inline int sd_overutilized(struct sched_domain *sd)
> +{
> + return READ_ONCE(sd->shared->overutilized);
> +}
> +
> +static inline void update_overutilized_status(struct rq *rq)
> +{
> + struct sched_domain *sd;
> +
> + rcu_read_lock();
> + sd = rcu_dereference(rq->sd);
> + if (sd && !sd_overutilized(sd) && cpu_overutilized(rq->cpu))
> + WRITE_ONCE(sd->shared->overutilized, 1);
> + rcu_read_unlock();
> +}
> +#else
> +static inline void update_overutilized_status(struct rq *rq) {}
> +#endif /* CONFIG_SMP */
> +
>  /*
>   * The enqueue_task method is called before nr_running is
>   * increased. Here we update the fair scheduling stats and
> @@ -5394,8 +5416,10 @@ enqueue_task_fair(struct rq *rq, struct task_struct 
> *p, int flags)
>   update_cfs_group(se);
>   }
>  
> - if (!se)
> + if (!se) {
>   add_nr_running(rq, 1);
> + update_overutilized_status(rq);
> + }

Maybe this isn't a good question, why only update overutilized flag
for enqueue flow but not for dequeue flow?

>   util_est_enqueue(>cfs, p);
>   hrtick_update(rq);
> @@ -6579,6 +6603,11 @@ static inline int util_fits_capacity(unsigned long 
> util, unsigned long capacity)
>   return capacity * 1024 > util * capacity_margin;
>  }
>  
> +static inline int cpu_overutilized(int cpu)
> +{
> + return !util_fits_capacity(cpu_util(cpu), capacity_of(cpu));
> +}
> +
>  /*
>   * Disable WAKE_AFFINE in the case where task @p doesn't fit in the
>   * capacity of either the waking CPU @cpu or the previous CPU @prev_cpu.
> @@ -7817,6 +7846,7 @@ struct sd_lb_stats {
>   unsigned long total_running;
>   unsigned long total_load;   /* Total load of all groups in sd */
>   unsigned long total_capacity;   /* Total capacity of all groups in sd */
> + unsigned long total_util;   /* Total util of all groups in sd */
>   unsigned long avg_load; /* Average load across all groups in sd */
>  
>   struct sg_lb_stats busiest_stat;/* Statistics of the busiest group */
> @@ -7837,6 +7867,7 @@ static inline void init_sd_lb_stats(struct sd_lb_stats 
> *sds)
>  

Re: [RFC PATCH v2 3/6] sched: Add over-utilization/tipping point indicator

2018-04-13 Thread Joel Fernandes
Hi,

On Fri, Apr 6, 2018 at 8:36 AM, Dietmar Eggemann
 wrote:
> From: Thara Gopinath 
>
> Energy-aware scheduling should only operate when the system is not
> overutilized. There must be cpu time available to place tasks based on
> utilization in an energy-aware fashion, i.e. to pack tasks on
> energy-efficient cpus without harming the overall throughput.
>
> In case the system operates above this tipping point the tasks have to
> be placed based on task and cpu load in the classical way of spreading
> tasks across as many cpus as possible.
>
> The point in which a system switches from being not overutilized to
> being overutilized is called the tipping point.
>
> Such a tipping point indicator on a sched domain as the system
> boundary is introduced here. As soon as one cpu of a sched domain is
> overutilized the whole sched domain is declared overutilized as well.
> A cpu becomes overutilized when its utilization is higher that 80%
> (capacity_margin) of its capacity.
>
> The implementation takes advantage of the shared sched domain which is
> shared across all per-cpu views of a sched domain level. The new
> overutilized flag is placed in this shared sched domain.
>
> Load balancing is skipped in case the energy model is present and the
> sched domain is not overutilized because under this condition the
> predominantly load-per-capacity driven load-balancer should not
> interfere with the energy-aware wakeup placement based on utilization.
>
> In case the total utilization of a sched domain is greater than the
> total sched domain capacity the overutilized flag is set at the parent
> sched domain level to let other sched groups help getting rid of the
> overutilization of cpus.
>
> Signed-off-by: Thara Gopinath 
> Signed-off-by: Dietmar Eggemann 
> ---
>  include/linux/sched/topology.h |  1 +
>  kernel/sched/fair.c| 62 
> --
>  kernel/sched/sched.h   |  1 +
>  kernel/sched/topology.c| 12 +++-
>  4 files changed, 65 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 26347741ba50..dd001c232646 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -72,6 +72,7 @@ struct sched_domain_shared {
> atomic_tref;
> atomic_tnr_busy_cpus;
> int has_idle_cores;
> +   int overutilized;
>  };
>
>  struct sched_domain {
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 0a76ad2ef022..6960e5ef3c14 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5345,6 +5345,28 @@ static inline void hrtick_update(struct rq *rq)
>  }
>  #endif
>
> +#ifdef CONFIG_SMP
> +static inline int cpu_overutilized(int cpu);
> +
> +static inline int sd_overutilized(struct sched_domain *sd)
> +{
> +   return READ_ONCE(sd->shared->overutilized);
> +}
> +
> +static inline void update_overutilized_status(struct rq *rq)
> +{
> +   struct sched_domain *sd;
> +
> +   rcu_read_lock();
> +   sd = rcu_dereference(rq->sd);
> +   if (sd && !sd_overutilized(sd) && cpu_overutilized(rq->cpu))
> +   WRITE_ONCE(sd->shared->overutilized, 1);
> +   rcu_read_unlock();
> +}
> +#else
> +static inline void update_overutilized_status(struct rq *rq) {}
> +#endif /* CONFIG_SMP */
> +
>  /*
>   * The enqueue_task method is called before nr_running is
>   * increased. Here we update the fair scheduling stats and
> @@ -5394,8 +5416,10 @@ enqueue_task_fair(struct rq *rq, struct task_struct 
> *p, int flags)
> update_cfs_group(se);
> }
>
> -   if (!se)
> +   if (!se) {
> add_nr_running(rq, 1);
> +   update_overutilized_status(rq);
> +   }

I'm wondering if it makes sense for considering scenarios whether
other classes cause CPUs in the domain to go above the tipping point.
Then in that case also, it makes sense to not to do EAS in that domain
because of the overutilization.

I guess task_fits using cpu_util which is PELT only at the moment...
so may require some other method like aggregation of CFS PELT, with
RT-PELT and DL running bw or something.

thanks,

- Joel


Re: [RFC PATCH v2 3/6] sched: Add over-utilization/tipping point indicator

2018-04-13 Thread Joel Fernandes
Hi,

On Fri, Apr 6, 2018 at 8:36 AM, Dietmar Eggemann
 wrote:
> From: Thara Gopinath 
>
> Energy-aware scheduling should only operate when the system is not
> overutilized. There must be cpu time available to place tasks based on
> utilization in an energy-aware fashion, i.e. to pack tasks on
> energy-efficient cpus without harming the overall throughput.
>
> In case the system operates above this tipping point the tasks have to
> be placed based on task and cpu load in the classical way of spreading
> tasks across as many cpus as possible.
>
> The point in which a system switches from being not overutilized to
> being overutilized is called the tipping point.
>
> Such a tipping point indicator on a sched domain as the system
> boundary is introduced here. As soon as one cpu of a sched domain is
> overutilized the whole sched domain is declared overutilized as well.
> A cpu becomes overutilized when its utilization is higher that 80%
> (capacity_margin) of its capacity.
>
> The implementation takes advantage of the shared sched domain which is
> shared across all per-cpu views of a sched domain level. The new
> overutilized flag is placed in this shared sched domain.
>
> Load balancing is skipped in case the energy model is present and the
> sched domain is not overutilized because under this condition the
> predominantly load-per-capacity driven load-balancer should not
> interfere with the energy-aware wakeup placement based on utilization.
>
> In case the total utilization of a sched domain is greater than the
> total sched domain capacity the overutilized flag is set at the parent
> sched domain level to let other sched groups help getting rid of the
> overutilization of cpus.
>
> Signed-off-by: Thara Gopinath 
> Signed-off-by: Dietmar Eggemann 
> ---
>  include/linux/sched/topology.h |  1 +
>  kernel/sched/fair.c| 62 
> --
>  kernel/sched/sched.h   |  1 +
>  kernel/sched/topology.c| 12 +++-
>  4 files changed, 65 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 26347741ba50..dd001c232646 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -72,6 +72,7 @@ struct sched_domain_shared {
> atomic_tref;
> atomic_tnr_busy_cpus;
> int has_idle_cores;
> +   int overutilized;
>  };
>
>  struct sched_domain {
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 0a76ad2ef022..6960e5ef3c14 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5345,6 +5345,28 @@ static inline void hrtick_update(struct rq *rq)
>  }
>  #endif
>
> +#ifdef CONFIG_SMP
> +static inline int cpu_overutilized(int cpu);
> +
> +static inline int sd_overutilized(struct sched_domain *sd)
> +{
> +   return READ_ONCE(sd->shared->overutilized);
> +}
> +
> +static inline void update_overutilized_status(struct rq *rq)
> +{
> +   struct sched_domain *sd;
> +
> +   rcu_read_lock();
> +   sd = rcu_dereference(rq->sd);
> +   if (sd && !sd_overutilized(sd) && cpu_overutilized(rq->cpu))
> +   WRITE_ONCE(sd->shared->overutilized, 1);
> +   rcu_read_unlock();
> +}
> +#else
> +static inline void update_overutilized_status(struct rq *rq) {}
> +#endif /* CONFIG_SMP */
> +
>  /*
>   * The enqueue_task method is called before nr_running is
>   * increased. Here we update the fair scheduling stats and
> @@ -5394,8 +5416,10 @@ enqueue_task_fair(struct rq *rq, struct task_struct 
> *p, int flags)
> update_cfs_group(se);
> }
>
> -   if (!se)
> +   if (!se) {
> add_nr_running(rq, 1);
> +   update_overutilized_status(rq);
> +   }

I'm wondering if it makes sense for considering scenarios whether
other classes cause CPUs in the domain to go above the tipping point.
Then in that case also, it makes sense to not to do EAS in that domain
because of the overutilization.

I guess task_fits using cpu_util which is PELT only at the moment...
so may require some other method like aggregation of CFS PELT, with
RT-PELT and DL running bw or something.

thanks,

- Joel


[RFC PATCH v2 3/6] sched: Add over-utilization/tipping point indicator

2018-04-06 Thread Dietmar Eggemann
From: Thara Gopinath 

Energy-aware scheduling should only operate when the system is not
overutilized. There must be cpu time available to place tasks based on
utilization in an energy-aware fashion, i.e. to pack tasks on
energy-efficient cpus without harming the overall throughput.

In case the system operates above this tipping point the tasks have to
be placed based on task and cpu load in the classical way of spreading
tasks across as many cpus as possible.

The point in which a system switches from being not overutilized to
being overutilized is called the tipping point.

Such a tipping point indicator on a sched domain as the system
boundary is introduced here. As soon as one cpu of a sched domain is
overutilized the whole sched domain is declared overutilized as well.
A cpu becomes overutilized when its utilization is higher that 80%
(capacity_margin) of its capacity.

The implementation takes advantage of the shared sched domain which is
shared across all per-cpu views of a sched domain level. The new
overutilized flag is placed in this shared sched domain.

Load balancing is skipped in case the energy model is present and the
sched domain is not overutilized because under this condition the
predominantly load-per-capacity driven load-balancer should not
interfere with the energy-aware wakeup placement based on utilization.

In case the total utilization of a sched domain is greater than the
total sched domain capacity the overutilized flag is set at the parent
sched domain level to let other sched groups help getting rid of the
overutilization of cpus.

Signed-off-by: Thara Gopinath 
Signed-off-by: Dietmar Eggemann 
---
 include/linux/sched/topology.h |  1 +
 kernel/sched/fair.c| 62 --
 kernel/sched/sched.h   |  1 +
 kernel/sched/topology.c| 12 +++-
 4 files changed, 65 insertions(+), 11 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 26347741ba50..dd001c232646 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -72,6 +72,7 @@ struct sched_domain_shared {
atomic_tref;
atomic_tnr_busy_cpus;
int has_idle_cores;
+   int overutilized;
 };
 
 struct sched_domain {
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0a76ad2ef022..6960e5ef3c14 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5345,6 +5345,28 @@ static inline void hrtick_update(struct rq *rq)
 }
 #endif
 
+#ifdef CONFIG_SMP
+static inline int cpu_overutilized(int cpu);
+
+static inline int sd_overutilized(struct sched_domain *sd)
+{
+   return READ_ONCE(sd->shared->overutilized);
+}
+
+static inline void update_overutilized_status(struct rq *rq)
+{
+   struct sched_domain *sd;
+
+   rcu_read_lock();
+   sd = rcu_dereference(rq->sd);
+   if (sd && !sd_overutilized(sd) && cpu_overutilized(rq->cpu))
+   WRITE_ONCE(sd->shared->overutilized, 1);
+   rcu_read_unlock();
+}
+#else
+static inline void update_overutilized_status(struct rq *rq) {}
+#endif /* CONFIG_SMP */
+
 /*
  * The enqueue_task method is called before nr_running is
  * increased. Here we update the fair scheduling stats and
@@ -5394,8 +5416,10 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, 
int flags)
update_cfs_group(se);
}
 
-   if (!se)
+   if (!se) {
add_nr_running(rq, 1);
+   update_overutilized_status(rq);
+   }
 
util_est_enqueue(>cfs, p);
hrtick_update(rq);
@@ -6579,6 +6603,11 @@ static inline int util_fits_capacity(unsigned long util, 
unsigned long capacity)
return capacity * 1024 > util * capacity_margin;
 }
 
+static inline int cpu_overutilized(int cpu)
+{
+   return !util_fits_capacity(cpu_util(cpu), capacity_of(cpu));
+}
+
 /*
  * Disable WAKE_AFFINE in the case where task @p doesn't fit in the
  * capacity of either the waking CPU @cpu or the previous CPU @prev_cpu.
@@ -7817,6 +7846,7 @@ struct sd_lb_stats {
unsigned long total_running;
unsigned long total_load;   /* Total load of all groups in sd */
unsigned long total_capacity;   /* Total capacity of all groups in sd */
+   unsigned long total_util;   /* Total util of all groups in sd */
unsigned long avg_load; /* Average load across all groups in sd */
 
struct sg_lb_stats busiest_stat;/* Statistics of the busiest group */
@@ -7837,6 +7867,7 @@ static inline void init_sd_lb_stats(struct sd_lb_stats 
*sds)
.total_running = 0UL,
.total_load = 0UL,
.total_capacity = 0UL,
+   .total_util = 0UL,
.busiest_stat = {
.avg_load = 0UL,
.sum_nr_running = 0,
@@ -8133,11 +8164,12 @@ static bool 

[RFC PATCH v2 3/6] sched: Add over-utilization/tipping point indicator

2018-04-06 Thread Dietmar Eggemann
From: Thara Gopinath 

Energy-aware scheduling should only operate when the system is not
overutilized. There must be cpu time available to place tasks based on
utilization in an energy-aware fashion, i.e. to pack tasks on
energy-efficient cpus without harming the overall throughput.

In case the system operates above this tipping point the tasks have to
be placed based on task and cpu load in the classical way of spreading
tasks across as many cpus as possible.

The point in which a system switches from being not overutilized to
being overutilized is called the tipping point.

Such a tipping point indicator on a sched domain as the system
boundary is introduced here. As soon as one cpu of a sched domain is
overutilized the whole sched domain is declared overutilized as well.
A cpu becomes overutilized when its utilization is higher that 80%
(capacity_margin) of its capacity.

The implementation takes advantage of the shared sched domain which is
shared across all per-cpu views of a sched domain level. The new
overutilized flag is placed in this shared sched domain.

Load balancing is skipped in case the energy model is present and the
sched domain is not overutilized because under this condition the
predominantly load-per-capacity driven load-balancer should not
interfere with the energy-aware wakeup placement based on utilization.

In case the total utilization of a sched domain is greater than the
total sched domain capacity the overutilized flag is set at the parent
sched domain level to let other sched groups help getting rid of the
overutilization of cpus.

Signed-off-by: Thara Gopinath 
Signed-off-by: Dietmar Eggemann 
---
 include/linux/sched/topology.h |  1 +
 kernel/sched/fair.c| 62 --
 kernel/sched/sched.h   |  1 +
 kernel/sched/topology.c| 12 +++-
 4 files changed, 65 insertions(+), 11 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 26347741ba50..dd001c232646 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -72,6 +72,7 @@ struct sched_domain_shared {
atomic_tref;
atomic_tnr_busy_cpus;
int has_idle_cores;
+   int overutilized;
 };
 
 struct sched_domain {
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0a76ad2ef022..6960e5ef3c14 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5345,6 +5345,28 @@ static inline void hrtick_update(struct rq *rq)
 }
 #endif
 
+#ifdef CONFIG_SMP
+static inline int cpu_overutilized(int cpu);
+
+static inline int sd_overutilized(struct sched_domain *sd)
+{
+   return READ_ONCE(sd->shared->overutilized);
+}
+
+static inline void update_overutilized_status(struct rq *rq)
+{
+   struct sched_domain *sd;
+
+   rcu_read_lock();
+   sd = rcu_dereference(rq->sd);
+   if (sd && !sd_overutilized(sd) && cpu_overutilized(rq->cpu))
+   WRITE_ONCE(sd->shared->overutilized, 1);
+   rcu_read_unlock();
+}
+#else
+static inline void update_overutilized_status(struct rq *rq) {}
+#endif /* CONFIG_SMP */
+
 /*
  * The enqueue_task method is called before nr_running is
  * increased. Here we update the fair scheduling stats and
@@ -5394,8 +5416,10 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, 
int flags)
update_cfs_group(se);
}
 
-   if (!se)
+   if (!se) {
add_nr_running(rq, 1);
+   update_overutilized_status(rq);
+   }
 
util_est_enqueue(>cfs, p);
hrtick_update(rq);
@@ -6579,6 +6603,11 @@ static inline int util_fits_capacity(unsigned long util, 
unsigned long capacity)
return capacity * 1024 > util * capacity_margin;
 }
 
+static inline int cpu_overutilized(int cpu)
+{
+   return !util_fits_capacity(cpu_util(cpu), capacity_of(cpu));
+}
+
 /*
  * Disable WAKE_AFFINE in the case where task @p doesn't fit in the
  * capacity of either the waking CPU @cpu or the previous CPU @prev_cpu.
@@ -7817,6 +7846,7 @@ struct sd_lb_stats {
unsigned long total_running;
unsigned long total_load;   /* Total load of all groups in sd */
unsigned long total_capacity;   /* Total capacity of all groups in sd */
+   unsigned long total_util;   /* Total util of all groups in sd */
unsigned long avg_load; /* Average load across all groups in sd */
 
struct sg_lb_stats busiest_stat;/* Statistics of the busiest group */
@@ -7837,6 +7867,7 @@ static inline void init_sd_lb_stats(struct sd_lb_stats 
*sds)
.total_running = 0UL,
.total_load = 0UL,
.total_capacity = 0UL,
+   .total_util = 0UL,
.busiest_stat = {
.avg_load = 0UL,
.sum_nr_running = 0,
@@ -8133,11 +8164,12 @@ static bool update_nohz_stats(struct rq *rq, bool force)
  * @local_group: Does group contain