On 12-May 23:25, Joel Fernandes wrote: > On Sat, May 12, 2018 at 11:04:43PM -0700, Joel Fernandes wrote: > > On Thu, May 10, 2018 at 04:05:53PM +0100, Patrick Bellasi wrote: > > > Schedutil updates for FAIR tasks are triggered implicitly each time a > > > cfs_rq's utilization is updated via cfs_rq_util_change(), currently > > > called by update_cfs_rq_load_avg(), when the utilization of a cfs_rq has > > > changed, and {attach,detach}_entity_load_avg(). > > > > > > This design is based on the idea that "we should callback schedutil > > > frequently enough" to properly update the CPU frequency at every > > > utilization change. However, such an integration strategy has also > > > some downsides: > > > > Hi Patrick,
Hi Joel, > > I agree making the call explicit would make schedutil integration easier so > > that's really awesome. However I also fear that if some path in the fair > > class in the future changes the utilization but forgets to update schedutil > > explicitly (because they forgot to call the explicit public API) then the > > schedutil update wouldn't go through. In this case the previous design of > > doing the schedutil update in the wrapper kind of was a nice to have I cannot see right now other possible future paths where we can actually change the utilization signal without considering that, eventually, we should call an existing API to update schedutil if it makes sense. What I can see more likely instead, also because it already happened a couple of time, is that because of code changes in fair.c we end up calling (implicitly) schedutil with a wrong utilization value. To note this kind of broken dependency it has already been more difficult than possibly noticing an update of the utilization without a corresponding explicit call of the public API. > > Just thinking out loud but is there a way you could make the implicit call > > anyway incase the explicit call wasn't requested for some reason? That's > > probably hard to do correctly though.. > > > > Some more comments below: > > [...] > > > > > > - it makes it hard to integrate new features since it could require to > > > change other function prototypes just to pass in an additional flag, > > > as it happened for example in commit: > > > > > > ea14b57e8a18 ("sched/cpufreq: Provide migration hint") IMHO, the point above is also a good example of how convoluted is to add support for one new simple feature because of the current implicit updates. [...] > > > @@ -4028,13 +4000,12 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq > > > *cfs_rq) > > > > > > static inline void update_load_avg(struct cfs_rq *cfs_rq, struct > > > sched_entity *se, int not_used1) > > > { > > > - cfs_rq_util_change(cfs_rq, 0); > > > > How about kill that extra line by doing: > > > > static inline void update_load_avg(struct cfs_rq *cfs_rq, > > struct sched_entity *se, int not_used1) {} > > > > > } Right, that could make sense, thanks! [...] > > > @@ -5397,9 +5366,27 @@ enqueue_task_fair(struct rq *rq, struct > > > task_struct *p, int flags) > > > update_cfs_group(se); > > > } > > > > > > - if (!se) > > > + /* The task is visible from the root cfs_rq */ > > > + if (!se) { > > > + unsigned int flags = 0; > > > + > > > add_nr_running(rq, 1); > > > > > > + if (p->in_iowait) > > > + flags |= SCHED_CPUFREQ_IOWAIT; > > > + > > > + /* > > > + * !last_update_time means we've passed through > > > + * migrate_task_rq_fair() indicating we migrated. > > > + * > > > + * IOW we're enqueueing a task on a new CPU. > > > + */ > > > + if (!p->se.avg.last_update_time) > > > + flags |= SCHED_CPUFREQ_MIGRATION; > > > + > > > + cpufreq_update_util(rq, flags); > > > + } > > > + > > > hrtick_update(rq); > > > } > > > > > > @@ -5456,10 +5443,12 @@ static void dequeue_task_fair(struct rq *rq, > > > struct task_struct *p, int flags) > > > update_cfs_group(se); > > > } > > > > > > + /* The task is no more visible from the root cfs_rq */ > > > if (!se) > > > sub_nr_running(rq, 1); > > > > > > util_est_dequeue(&rq->cfs, p, task_sleep); > > > + cpufreq_update_util(rq, 0); > > > > One question about this change. In enqueue, throttle and unthrottle - you > > are > > conditionally calling cpufreq_update_util incase the task was > > visible/not-visible in the hierarchy. > > > > But in dequeue you're unconditionally calling it. Seems a bit inconsistent. > > Is this because of util_est or something? Could you add a comment here > > explaining why this is so? > > The big question I have is incase se != NULL, then its still visible at the > root RQ level. My understanding it that you get !se at dequeue time when we are dequeuing a task from a throttled RQ. Isn't it? Thus, this means you are dequeuing a throttled task, I guess for example because of a migration. However, the point is that a task dequeue from a throttled RQ _is already_ not visible from the root RQ, because of the sub_nr_running() done by throttle_cfs_rq(). > In that case should we still call the util_est_dequeue and the > cpufreq_update_util? I had a better look at the different code paths and I've possibly come up with some interesting observations. Lemme try to resume theme here. First of all, we need to distinguish from estimated utilization updates and schedutil updates, since they respond to two very different goals. .:: Estimated utilization updates ================================= Goal: account for the amount of utilization we are going to expect on a CPU At {en,de}queue time, util_est_{en,de}queue() is always unconditionally called because it tracks the utilization which is estimated to be generated by all the RUNNABLE tasks. We do not care about throttled/un-throttled RQ here because the effect of throttling is already folded into the estimated utilization. For example, a 100% tasks which is placed into a 50% bandwidth limited TG will generate a 50% (estimated) utilization. Thus, when the task is enqueued we can account immediately for that utilization although the RQ can be currently throttled. .:: Schedutil updates ===================== Goal: select a better frequency, if and _when_ required At enqueue time, if the task is visible at the root RQ the it's expected to run within a scheduler latency period. Thus, it makes sense to call immediately schedutil to account for its estimated utilization to possibly increase the OPP. If instead the task is enqueued into a throttled RQ, then I'm skipping the update since the task will not run until the RQ is actually un-throttled. HOWEVER, I would say that in general we could skip this last optimization and always unconditionally update schedutil at enqueue time considering the fact that the effects of a throttled RQ are always reflected into the (estimated) utilization of a task. At dequeue time instead, since we certainly removed some estimated utilization, then I unconditionally updated schedutil. HOWEVER, I was not considering these two things: 1. for a task going to sleep, we still have its blocked utilization accounted in the cfs_rq utilization. 2. for a task being migrated, at dequeue time we still have not removed the task's utilization from the cfs_rq's utilization. This usually happens later, for example we can have: move_queued_task() dequeue_task() --> CFS task dequeued set_task_cpu() --> schedutil updated migrate_task_rq_fair() detach_entity_cfs_rq() detach_entity_load_avg() --> CFS util removal enqueue_task() Moreover, the "CFS util removal" actually affects the cfs_rq only if we hold the RQ lock, otherwise we know that it's just back annotated as "removed" utilization and the actual cfs_rq utilization is fixed up at the next chance we have the RQ lock. Thus, I would say that in both cases, at dequeue time it does not make sense to update schedutil since we always see the task's utilization in the cfs_rq and thus we will not reduce the frequency. NOTE, this is true independently from the refactoring I'm proposing. At dequeue time, although we call update_load_avg() on the root RQ, it does not make sense to update schedutil since we still see either the blocked utilization of a sleeping task or the not yet removed utilization of a migrating task. In both cases the risk is to ask for an higher OPP right when a CPU is going to be IDLE. Moreover, it seems that in general we prefer a "conservative" approach in frequency reduction. For example it could be harmful to trigger a frequency reduction when a task is migrating off a CPU, if right after another task should be instead migrated into the same CPU. .:: Conclusions =============== All that considered, I think I've convinced myself that we really need to notify schedutil only in these cases: 1. enqueue time because of the changes in estimated utilization and the possibility to just straight to a better OPP 2. task tick time because of the possible ramp-up of the utilization Another case is related to remote CPUs blocked utilization update, after the recent Vincent's patches. Currently indeed: update_blocked_averages() update_load_avg() --> update schedutil and thus, potentially we wake up an IDLE cluster just to reduce its OPP. If the cluster is in a deep idle state, I'm not entirely sure this is good from an energy saving standpoint. However, with the patch I'm proposing we are missing that support, meaning that an IDLE cluster will get its utilization decayed but we don't wake it up just to drop its frequency. Perhaps we should better pass in this information to schedutil via a flag (e.g. SCHED_FREQ_REMOTE_UPDATE) and implement there a policy to decide if and when it makes sense to drop the OPP. Or otherwise find a way for the special DL tasks to always run on the lower capacity_orig CPUs. > Sorry if I missed something obvious. Thanks for the question it has actually triggered a better analysis of what we have and what we need. Looking forward to some feedbacks about the above before posting a new version of this last patch. > thanks! > > - Joel -- #include <best/regards.h> Patrick Bellasi