On Sat, Apr 02, 2016 at 03:11:54PM +0800, Leo Yan wrote: > On Fri, Apr 01, 2016 at 03:28:49PM -0700, Steve Muckle wrote: > > I think I follow - Leo please correct me if I mangle your intentions. > > It's an issue that Morten and Dietmar had mentioned to me as well.
Yes. We have been working on this issue for a while without getting to a nice solution yet. > > Assume CONFIG_FAIR_GROUP_SCHED is enabled and a task is running in a > > task group other than the root. The task migrates from one CPU to > > another. The cfs_rq.avg structures on the src and dest CPUs > > corresponding to the group containing the task are updated immediately > > via remove_entity_load_avg()/update_cfs_rq_load_avg() and > > attach_entity_load_avg(). But the cfs_rq.avg corresponding to the root > > on the src and dest are not immediately updated. The root cfs_rq.avg > > must catch up over time with PELT. Yes. The problem is that it is only the cfs_rq.avg of the cfs_rq where the is enqueued/dequeued that gets immediately updated. If the cfs_rq is a group cfs_rq its group entity se.avg doesn't get updated immediately. It has to adapt over time at the pace defined by the geometric series. The impact of a task migration therefore doesn't trickle through to the root cfs_rq.avg. This behaviour is one of the fundamental changes Yuyang introduced with his rewrite of PELT. > > As to why we care, there's at least one issue which may or may not be > > Leo's - the root cfs_rq is the one whose avg structure we read from to > > determine the frequency with schedutil. If you have a cpu-bound task in > > a non-root cgroup which periodically migrates among CPUs on an otherwise > > idle system, I believe each time it migrates the frequency would drop to > > fmin and have to ramp up again with PELT. It makes any scheduling decision based on utilization difficult if fair group scheduling is used as cpu_util() doesn't give an up-to-date picture of any utilization caused by task in task groups. For the energy-aware scheduling patches and patches we have in the pipeline for improving capacity awareness in the scheduler we rely on cpu_util(). > Steve, thanks for explanation and totally agree. My initial purpose is > not from schedutil's perspective, but just did some basic analysis for > utilization. So my case is: fixed cpu to maximum frequency 1.2GHz, then > launch a periodic task (period: 100ms, duty cycle: 40%) and limit its > affinity to only two CPUs. So observed the same result like you said. > > After applied this patch, I can get much better result for the CPU's > utilization after task's migration. Please see the parsed result for > CPU's utilization: > http://people.linaro.org/~leo.yan/eas_profiling/cpu1_util_update_cfs_rq_avg.png > http://people.linaro.org/~leo.yan/eas_profiling/cpu2_util_update_cfs_rq_avg.png > > > Leo I noticed you did not modify detach_entity_load_average(). I think > > this would be needed to avoid the task's stats being double counted for > > a while after switched_from_fair() or task_move_group_fair(). I'm afraid that the solution to problem is more complicated than that :-( You are adding/removing a contribution from the root cfs_rq.avg which isn't part of the signal in the first place. The root cfs_rq.avg only contains the sum of the load/util of the sched_entities on the cfs_rq. If you remove the contribution of the tasks from there you may end up double-accounting for the task migration. Once due to you patch and then again slowly over time as the group sched_entity starts reflecting that the task has migrated. Furthermore, for group scheduling to make sense it has to be the task_h_load() you add/remove otherwise the group weighting is completely lost. Or am I completely misreading your patch? I don't think the slow response time for _load_ is necessarily a big problem. Otherwise we would have had people complaining already about group scheduling being broken. It is however a problem for all the initiatives that built on utilization. We have to either make the updates trickle through the group hierarchy for utilization, which is difficult with making a lot of changes to the current code structure, or introduce a new avg structure at root level which contains the sum of task utilization for all _tasks_ (not groups) in the group hierarchy and maintain it separately. None of those two are particularly pretty. Better suggestions?

