On Sat, May 12, 2018 at 11:04:43PM -0700, Joel Fernandes wrote:
> On Thu, May 10, 2018 at 04:05:53PM +0100, Patrick Bellasi wrote:
> > Schedutil updates for FAIR tasks are triggered implicitly each time a
> > cfs_rq's utilization is updated via cfs_rq_util_change(), currently
> > called by update_cfs_rq_load_avg(), when the utilization of a cfs_rq has
> > changed, and {attach,detach}_entity_load_avg().
> > 
> > This design is based on the idea that "we should callback schedutil
> > frequently enough" to properly update the CPU frequency at every
> > utilization change. However, such an integration strategy has also
> > some downsides:
> 
> Hi Patrick,
> 
> I agree making the call explicit would make schedutil integration easier so
> that's really awesome. However I also fear that if some path in the fair
> class in the future changes the utilization but forgets to update schedutil
> explicitly (because they forgot to call the explicit public API) then the
> schedutil update wouldn't go through. In this case the previous design of
> doing the schedutil update in the wrapper kind of was a nice to have
> 
> Just thinking out loud but is there a way you could make the implicit call
> anyway incase the explicit call wasn't requested for some reason? That's
> probably hard to do correctly though..
> 
> Some more comments below:
> 
> > 
> >  - schedutil updates are triggered by RQ's load updates, which makes
> >    sense in general but it does not allow to know exactly which other RQ
> >    related information have been updated.
> >    Recently, for example, we had issues due to schedutil dependencies on
> >    cfs_rq->h_nr_running and estimated utilization updates.
> > 
> >  - cfs_rq_util_change() is mainly a wrapper function for an already
> >    existing "public API", cpufreq_update_util(), which is required
> >    just to ensure we actually update schedutil only when we are updating
> >    a root cfs_rq.
> >    Thus, especially when task groups are in use, most of the calls to
> >    this wrapper function are not required.
> > 
> >  - the usage of a wrapper function is not completely consistent across
> >    fair.c, since we could still need additional explicit calls to
> >    cpufreq_update_util().
> >    For example this already happens to report the IOWAIT boot flag in
> >    the wakeup path.
> > 
> >  - it makes it hard to integrate new features since it could require to
> >    change other function prototypes just to pass in an additional flag,
> >    as it happened for example in commit:
> > 
> >       ea14b57e8a18 ("sched/cpufreq: Provide migration hint")
> > 
> > All the above considered, let's make schedutil updates more explicit in
> > fair.c by removing the cfs_rq_util_change() wrapper function in favour
> > of the existing cpufreq_update_util() public API.
> > This can be done by calling cpufreq_update_util() explicitly in the few
> > call sites where it really makes sense and when all the (potentially)
> > required cfs_rq's information have been updated.
> > 
> > This patch mainly removes code and adds explicit schedutil updates
> > only when we:
> >  - {enqueue,dequeue}_task_fair() a task to/from the root cfs_rq
> >  - (un)throttle_cfs_rq() a set of tasks up to the root cfs_rq
> >  - task_tick_fair() to update the utilization of the root cfs_rq
> > 
> > All the other code paths, currently _indirectly_ covered by a call to
> > update_load_avg(), are still covered. Indeed, some paths already imply
> > enqueue/dequeue calls:
> >  - switch_{to,from}_fair()
> >  - sched_move_task()
> > while others are followed by enqueue/dequeue calls:
> >  - cpu_cgroup_fork() and
> >    post_init_entity_util_avg():
> >      are used at wakeup_new_task() time and thus already followed by an
> >      enqueue_task_fair()
> >  - migrate_task_rq_fair():
> >      updates the removed utilization but not the actual cfs_rq
> >      utilization, which is updated by a following sched event
> > 
> > This new proposal allows also to better aggregate schedutil related
> > flags, which are required only at enqueue_task_fair() time.
> > IOWAIT and MIGRATION flags are now requested only when a task is
> > actually visible at the root cfs_rq level.
> > 
> > Signed-off-by: Patrick Bellasi <patrick.bell...@arm.com>
> > Cc: Ingo Molnar <mi...@redhat.com>
> > Cc: Peter Zijlstra <pet...@infradead.org>
> > Cc: Rafael J. Wysocki <rafael.j.wyso...@intel.com>
> > Cc: Viresh Kumar <viresh.ku...@linaro.org>
> > Cc: Joel Fernandes <joe...@google.com>
> > Cc: Juri Lelli <juri.le...@redhat.com>
> > Cc: linux-kernel@vger.kernel.org
> > Cc: linux...@vger.kernel.org
> > 
> > ---
> > 
> > NOTE: this patch changes the behavior of the IOWAIT flag: in case of a
> > task waking up on a throttled RQ we do not assert the flag to schedutil
> > anymore. However, this seems to make sense since the task will not be
> > running anyway.
> > ---
> >  kernel/sched/fair.c | 81 
> > ++++++++++++++++++++++++-----------------------------
> >  1 file changed, 36 insertions(+), 45 deletions(-)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 01dfc47541e6..87f092151a6e 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -772,7 +772,7 @@ void post_init_entity_util_avg(struct sched_entity *se)
> >                      * For !fair tasks do:
> >                      *
> >                     update_cfs_rq_load_avg(now, cfs_rq);
> > -                   attach_entity_load_avg(cfs_rq, se, 0);
> > +                   attach_entity_load_avg(cfs_rq, se);
> >                     switched_from_fair(rq, p);
> >                      *
> >                      * such that the next switched_to_fair() has the
> > @@ -3009,29 +3009,6 @@ static inline void update_cfs_group(struct 
> > sched_entity *se)
> >  }
> >  #endif /* CONFIG_FAIR_GROUP_SCHED */
> >  
> > -static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq, int flags)
> > -{
> > -   struct rq *rq = rq_of(cfs_rq);
> > -
> > -   if (&rq->cfs == cfs_rq || (flags & SCHED_CPUFREQ_MIGRATION)) {
> > -           /*
> > -            * There are a few boundary cases this might miss but it should
> > -            * get called often enough that that should (hopefully) not be
> > -            * a real problem.
> > -            *
> > -            * It will not get called when we go idle, because the idle
> > -            * thread is a different class (!fair), nor will the utilization
> > -            * number include things like RT tasks.
> > -            *
> > -            * As is, the util number is not freq-invariant (we'd have to
> > -            * implement arch_scale_freq_capacity() for that).
> > -            *
> > -            * See cpu_util().
> > -            */
> > -           cpufreq_update_util(rq, flags);
> > -   }
> > -}
> > -
> >  #ifdef CONFIG_SMP
> >  /*
> >   * Approximate:
> > @@ -3712,9 +3689,6 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
> >     cfs_rq->load_last_update_time_copy = sa->last_update_time;
> >  #endif
> >  
> > -   if (decayed)
> > -           cfs_rq_util_change(cfs_rq, 0);
> > -
> >     return decayed;
> >  }
> >  
> > @@ -3726,7 +3700,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
> >   * Must call update_cfs_rq_load_avg() before this, since we rely on
> >   * cfs_rq->avg.last_update_time being current.
> >   */
> > -static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct 
> > sched_entity *se, int flags)
> > +static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct 
> > sched_entity *se)
> >  {
> >     u32 divider = LOAD_AVG_MAX - 1024 + cfs_rq->avg.period_contrib;
> >  
> > @@ -3762,7 +3736,6 @@ static void attach_entity_load_avg(struct cfs_rq 
> > *cfs_rq, struct sched_entity *s
> >  
> >     add_tg_cfs_propagate(cfs_rq, se->avg.load_sum);
> >  
> > -   cfs_rq_util_change(cfs_rq, flags);
> >  }
> >  
> >  /**
> > @@ -3781,7 +3754,6 @@ static void detach_entity_load_avg(struct cfs_rq 
> > *cfs_rq, struct sched_entity *s
> >  
> >     add_tg_cfs_propagate(cfs_rq, -se->avg.load_sum);
> >  
> > -   cfs_rq_util_change(cfs_rq, 0);
> >  }
> >  
> >  /*
> > @@ -3818,7 +3790,7 @@ static inline void update_load_avg(struct cfs_rq 
> > *cfs_rq, struct sched_entity *s
> >              *
> >              * IOW we're enqueueing a task on a new CPU.
> >              */
> > -           attach_entity_load_avg(cfs_rq, se, SCHED_CPUFREQ_MIGRATION);
> > +           attach_entity_load_avg(cfs_rq, se);
> >             update_tg_load_avg(cfs_rq, 0);
> >  
> >     } else if (decayed && (flags & UPDATE_TG))
> > @@ -4028,13 +4000,12 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq 
> > *cfs_rq)
> >  
> >  static inline void update_load_avg(struct cfs_rq *cfs_rq, struct 
> > sched_entity *se, int not_used1)
> >  {
> > -   cfs_rq_util_change(cfs_rq, 0);
> 
> How about kill that extra line by doing:
> 
> static inline void update_load_avg(struct cfs_rq *cfs_rq,
>                                  struct sched_entity *se, int not_used1) {}
> 
> >  }
> >  
> >  static inline void remove_entity_load_avg(struct sched_entity *se) {}
> >  
> >  static inline void
> > -attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int 
> > flags) {}
> > +attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
> >  static inline void
> >  detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
> >  
> > @@ -4762,8 +4733,11 @@ static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
> >                     dequeue = 0;
> >     }
> >  
> > -   if (!se)
> > +   /* The tasks are no more visible from the root cfs_rq */
> > +   if (!se) {
> >             sub_nr_running(rq, task_delta);
> > +           cpufreq_update_util(rq, 0);
> > +   }
> >  
> >     cfs_rq->throttled = 1;
> >     cfs_rq->throttled_clock = rq_clock(rq);
> > @@ -4825,8 +4799,11 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
> >                     break;
> >     }
> >  
> > -   if (!se)
> > +   /* The tasks are now visible from the root cfs_rq */
> > +   if (!se) {
> >             add_nr_running(rq, task_delta);
> > +           cpufreq_update_util(rq, 0);
> > +   }
> >  
> >     /* Determine whether we need to wake up potentially idle CPU: */
> >     if (rq->curr == rq->idle && rq->cfs.nr_running)
> > @@ -5359,14 +5336,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct 
> > *p, int flags)
> >     /* Estimated utilization must be updated before schedutil */
> >     util_est_enqueue(&rq->cfs, p);
> >  
> >  -  /*
> > -    * If in_iowait is set, the code below may not trigger any cpufreq
> > -    * utilization updates, so do it here explicitly with the IOWAIT flag
> > -    * passed.
> > -    */
> > -   if (p->in_iowait)
> > -           cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT);
> > -
> >     for_each_sched_entity(se) {
> >             if (se->on_rq)
> >                     break;
> > @@ -5397,9 +5366,27 @@ enqueue_task_fair(struct rq *rq, struct task_struct 
> > *p, int flags)
> >             update_cfs_group(se);
> >     }
> >  
> > -   if (!se)
> > +   /* The task is visible from the root cfs_rq */
> > +   if (!se) {
> > +           unsigned int flags = 0;
> > +
> >             add_nr_running(rq, 1);
> >  
> > +           if (p->in_iowait)
> > +                   flags |= SCHED_CPUFREQ_IOWAIT;
> > +
> > +           /*
> > +            * !last_update_time means we've passed through
> > +            * migrate_task_rq_fair() indicating we migrated.
> > +            *
> > +            * IOW we're enqueueing a task on a new CPU.
> > +            */
> > +           if (!p->se.avg.last_update_time)
> > +                   flags |= SCHED_CPUFREQ_MIGRATION;
> > +
> > +           cpufreq_update_util(rq, flags);
> > +   }
> > +
> >     hrtick_update(rq);
> >  }
> >  
> > @@ -5456,10 +5443,12 @@ static void dequeue_task_fair(struct rq *rq, struct 
> > task_struct *p, int flags)
> >             update_cfs_group(se);
> >     }
> >  
> > +   /* The task is no more visible from the root cfs_rq */
> >     if (!se)
> >             sub_nr_running(rq, 1);
> >  
> >     util_est_dequeue(&rq->cfs, p, task_sleep);
> > +   cpufreq_update_util(rq, 0);
> 
> One question about this change. In enqueue, throttle and unthrottle - you are
> conditionally calling cpufreq_update_util incase the task was
> visible/not-visible in the hierarchy.
> 
> But in dequeue you're unconditionally calling it. Seems a bit inconsistent.
> Is this because of util_est or something? Could you add a comment here
> explaining why this is so?

The big question I have is incase se != NULL, then its still visible at the
root RQ level. In that case should we still call the util_est_dequeue and the
cpufreq_update_util?

Sorry if I missed something obvious.

thanks!

- Joel

Reply via email to