Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes

Vincent Guittot Fri, 14 Oct 2016 01:26:00 -0700

On 13 October 2016 at 23:34, Vincent Guittot <[email protected]> wrote:
> On 13 October 2016 at 20:49, Dietmar Eggemann <[email protected]> 
> wrote:
>> On 13/10/16 17:48, Vincent Guittot wrote:
>>> On 13 October 2016 at 17:52, Joseph Salisbury
>>> <[email protected]> wrote:
>>>> On 10/13/2016 06:58 AM, Vincent Guittot wrote:
>>>>> Hi,
>>>>>
>>>>> On 12 October 2016 at 18:21, Joseph Salisbury
>>>>> <[email protected]> wrote:
>>>>>> On 10/12/2016 08:20 AM, Vincent Guittot wrote:
>>>>>>> On 8 October 2016 at 13:49, Mike Galbraith <[email protected]> wrote:
>>>>>>>> On Sat, 2016-10-08 at 13:37 +0200, Vincent Guittot wrote:
>>>>>>>>> On 8 October 2016 at 10:39, Ingo Molnar <[email protected]> wrote:
>>>>>>>>>> * Peter Zijlstra <[email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> On Fri, Oct 07, 2016 at 03:38:23PM -0400, Joseph Salisbury wrote:
>>>>>>>>>>>> Hello Peter,
>>>>>>>>>>>>
>>>>>>>>>>>> A kernel bug report was opened against Ubuntu [0].  After a
>>>>>>>>>>>> kernel
>>>>>>>>>>>> bisect, it was found that reverting the following commit
>>>>>>>>>>>> resolved this bug:
>>>>>>>>>>>>
>>>>>>>>>>>> commit 3d30544f02120b884bba2a9466c87dba980e3be5
>>>>>>>>>>>> Author: Peter Zijlstra <[email protected]>
>>>>>>>>>>>> Date:   Tue Jun 21 14:27:50 2016 +0200
>>>>>>>>>>>>
>>>>>>>>>>>>     sched/fair: Apply more PELT fixes
>>>>>>>>> This patch only speeds up the update of task group load in order to
>>>>>>>>> reflect the new load balance but It should not change the final value
>>>>>>>>> and as a result the final behavior. I will try to reproduce it in my
>>>>>>>>> target later today
>>>>>>>> FWIW, I tried and failed w/wo autogroup on 4.8 and master.
>>>>>>> Me too
>>>>>>>
>>>>>>> Is it possible to get some dump of  /proc/sched_debug while the problem 
>>>>>>> occurs ?
>>>>>>>
>>>>>>> Vincent
>>>>>>>
>>>>>>>>         -Mike
>>>>>> The output from /proc/shed_debug can be seen here:
>>>>>> http://paste.ubuntu.com/23312351/
>>>>> I have looked at the dump and there is something very odd for
>>>>> system.slice task group where the display manager is running.
>>>>> system.slice->tg_load_avg is around 381697 but  tg_load_avg is
>>>>> normally equal to Sum of system.slice[cpu]->tg_load_avg_contrib
>>>>> whereas Sum of system.slice[cpu]->tg_load_avg_contrib = 1013 in our
>>>>> case. We can have some differences because the dump of
>>>>> /proc/shed_debug is not atomic and some changes can happen but nothing
>>>>> like this difference.
>>>>>
>>>>> The main effect of this quite high value is that the weight/prio of
>>>>> the sched_entity that represents system.slice in root cfs_rq is very
>>>>> low (lower than task with the smallest nice prio) so the system.slice
>>>>> task group will not get the CPU quite often compared to the user.slice
>>>>> task group: less than 1% for the system.slice where lightDM and xorg
>>>>> are running compared 99% for the user.slice where the stress tasks are
>>>>> running. This is confirmed by the se->avg.util_avg value of the task
>>>>> groups which reflect how much time each task group is effectively
>>>>> running on a CPU:
>>>>> system.slice[CPU3].se->avg.util_avg = 8 whereas
>>>>> user.slice[CPU3].se->avg.util_avg = 991
>>>>>
>>>>> This difference of weight/priority explains why the system becomes
>>>>> unresponsive. For now, I can't explain is why
>>>>> system.slice->tg_load_avg = 381697 whereas is should be around 1013
>>>>> and how the patch can generate this situation.
>>>>>
>>>>> Is it possible to have a dump of /proc/sched_debug before starting
>>>>> stress command ? to check if the problem is there from the beginning
>>>>> but not seen because not overloaded. Or if it the problem comes when
>>>>> user starts to load the system
>>>> Here is the dump before stress is started:
>>>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1627108/+attachment/4760437/+files/dump_nonbuggy
>>>
>>> This one is ok.
>>> The dump indicates Sched Debug Version: v0.11, 4.8.0-11-generic
>>> #12~lp1627108Commit3d30544Reverted
>>> so this is without the culprit commit
>>>
>>>>
>>>> Here it is after:
>>>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1627108/+attachment/4760436/+files/dump_buggy
>>>>
>>>
>>> This one has the exact same odds values for system.slice->tg_load_avg
>>> than the 1st dump that you sent yesterday
>>> The dump indicates Sched Debug Version: v0.11, 4.8.0-22-generic #24-Ubuntu
>>> So this dump has been done with a different kernel than for the dump above.
>>> As I can't find any stress task in the dump, i tend to believe that
>>> the dump has been done before starting the stress tasks and not after
>>> starting them. Can you confirm ?
>>>
>>> If i'm right, it mean that the problem was already there before
>>> starting stress tasks.
>>
>> Could it be a problem I'm also seeing on my ARM64 Juno (6 logical cpus) w/o 
>> systemd
>> and w/o autogroup (tip/sched/core 447976ef4fd0):
>>
>> When I create a tg_root/tg_x/tg_y_1 and a tg_root/tg_x/tg_y_2 group, the 
>> tg_x->load_avg
>> becomes > 6*1024 before any tasks ran in it.
>
> This is normal as se->avg.load_avg is initialized to
> scale_load_down(se->load.weight) and this se->avg.load_avg will be
> added to tg_x[cpu]->cfs_rq->avg.load_avg when attached to the cfs_rq
>
>>
>> tg_x   : 0xffff800975800d80
>> tg_y_1 : 0xffff800975800c00
>> tg_y_2 : 0xffff80097543d200
>>
>>  mkdir-2177 [002] 117.235241: bprint: sched_online_group: 
>> tg=0xffff800975800d80 tg->parent=0xffff000008fd0300
>>  mkdir-2177 [002] 117.235244: bprint: online_fair_sched_group: 
>> tg=0xffff800975800d80 cpu=0
>>  mkdir-2177 [002] 117.235247: bprint: online_fair_sched_group: 
>> tg=0xffff800975800d80 cpu=1
>>  mkdir-2177 [002] 117.235249: bprint: online_fair_sched_group: 
>> tg=0xffff800975800d80 cpu=2
>>  mkdir-2177 [002] 117.235251: bprint: online_fair_sched_group: 
>> tg=0xffff800975800d80 cpu=3
>>  mkdir-2177 [002] 117.235253: bprint: online_fair_sched_group: 
>> tg=0xffff800975800d80 cpu=4
>>  mkdir-2177 [002] 117.235255: bprint: online_fair_sched_group: 
>> tg=0xffff800975800d80 cpu=5
>>
>>  mkdir-2181 [002] 117.353155: bprint: sched_online_group: 
>> tg=0xffff800975800c00 tg->parent=0xffff800975800d80
>>  mkdir-2181 [002] 117.353158: bprint: online_fair_sched_group: 
>> tg=0xffff800975800c00 cpu=0
>>  mkdir-2181 [002] 117.353162: bprint: post_init_entity_util_avg: cpu=0 
>> tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=0 
>> cfs_rq->avg.load_avg=1024 cfs_rq->tg_load_avg_contrib=0 delta=1024
>>  mkdir-2181 [002] 117.353164: bprint: online_fair_sched_group: 
>> tg=0xffff800975800c00 cpu=1
>>  mkdir-2181 [002] 117.353167: bprint: post_init_entity_util_avg: cpu=1 
>> tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=1024 
>> cfs_rq->avg.load_avg=1024 cfs_rq->tg_load_avg_contrib=0 delta=1024
>>  mkdir-2181 [002] 117.353168: bprint: online_fair_sched_group: 
>> tg=0xffff800975800c00 cpu=2
>>  mkdir-2181 [002] 117.353171: bprint: post_init_entity_util_avg: cpu=2 
>> tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=2048 
>> cfs_rq->avg.load_avg=1024 cfs_rq->tg_load_avg_contrib=0 delta=1024
>>  mkdir-2181 [002] 117.353173: bprint: online_fair_sched_group: 
>> tg=0xffff800975800c00 cpu=3
>>  mkdir-2181 [002] 117.353175: bprint: post_init_entity_util_avg: cpu=3 
>> tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=3072 
>> cfs_rq->avg.load_avg=1024 cfs_rq->tg_load_avg_contrib=0 delta=1024
>>  mkdir-2181 [002] 117.353177: bprint: online_fair_sched_group: 
>> tg=0xffff800975800c00 cpu=4
>>  mkdir-2181 [002] 117.353179: bprint: post_init_entity_util_avg: cpu=4 
>> tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=4096 
>> cfs_rq->avg.load_avg=1024 cfs_rq->tg_load_avg_contrib=0 delta=1024
>>  mkdir-2181 [002] 117.353180: bprint: online_fair_sched_group: 
>> tg=0xffff800975800c00 cpu=5
>>  mkdir-2181 [002] 117.353183: bprint: post_init_entity_util_avg: cpu=5 
>> tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=5120 
>> cfs_rq->avg.load_avg=1024 cfs_rq->tg_load_avg_contrib=0 delta=1024
>>
>
> This looks coherent as tg_x->load_avg = 5120 == sum of
> cfs_rq[cpu]->load_avg and cfs_rq->tg_load_avg_contrib will be set to
> cfs_rq[cpu]->load_avg at the end of the function call
> In the buggy case, tg->load_avg = 381697 but sum of
> cfs_rq[cpu]->tg_load_avg_contrib = 1013 = cfs_rq[cpu]->avg.load_avg so
> there is no way to remove the 381697
>
>>  mkdir-2185 [001] 117.502980: bprint: sched_online_group: 
>> tg=0xffff80097543d200 tg->parent=0xffff800975800d80
>>  mkdir-2185 [001] 117.502982: bprint: online_fair_sched_group: 
>> tg=0xffff80097543d200 cpu=0
>>  mkdir-2185 [001] 117.502987: bprint: post_init_entity_util_avg: cpu=0 
>> tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=6144 
>> cfs_rq->avg.load_avg=1068 cfs_rq->tg_load_avg_contrib=1024 delta=44
>>  mkdir-2185 [001] 117.502988: bprint: online_fair_sched_group: 
>> tg=0xffff80097543d200 cpu=1
>>  mkdir-2185 [001] 117.502992: bprint: post_init_entity_util_avg: cpu=1 
>> tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=6188 
>> cfs_rq->avg.load_avg=1058 cfs_rq->tg_load_avg_contrib=1024 delta=34
>>  mkdir-2185 [001] 117.502993: bprint: online_fair_sched_group: 
>> tg=0xffff80097543d200 cpu=2
>>  mkdir-2185 [001] 117.502996: bprint: post_init_entity_util_avg: cpu=2 
>> tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=6222 
>> cfs_rq->avg.load_avg=1092 cfs_rq->tg_load_avg_contrib=1024 delta=68
>>  mkdir-2185 [001] 117.502998: bprint: online_fair_sched_group: 
>> tg=0xffff80097543d200 cpu=3
>>  mkdir-2185 [001] 117.503001: bprint: post_init_entity_util_avg: cpu=3 
>> tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=6290 
>> cfs_rq->avg.load_avg=1069 cfs_rq->tg_load_avg_contrib=1024 delta=45
>>  mkdir-2185 [001] 117.503002: bprint: online_fair_sched_group: 
>> tg=0xffff80097543d200 cpu=4
>>  mkdir-2185 [001] 117.503005: bprint: post_init_entity_util_avg: cpu=4 
>> tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=6335 
>> cfs_rq->avg.load_avg=1064 cfs_rq->tg_load_avg_contrib=1024 delta=40
>>  mkdir-2185 [001] 117.503006: bprint: online_fair_sched_group: 
>> tg=0xffff80097543d200 cpu=5
>>  mkdir-2185 [001] 117.503009: bprint: post_init_entity_util_avg: cpu=5 
>> tg=0xffff800975800d80 tg_css_id=2 cfs_rq->tg->load_avg=*6375* 
>> cfs_rq->avg.load_avg=1061 cfs_rq->tg_load_avg_contrib=1024 delta=37
>>
>> We add delta=1024 to tg_x->load_avg during tg_root/tg_x/tg_y_1 
>> initialization but only
>> delta=~40 for tg_root/tg_x/tg_y_2.
>
> IMO, this comes from the decay of tg_y_2->se[cpu]->avg.load_avg before
> being attached to tg_x
>
> But I don't think that it's the same issue as the tg_x->load_avg is
> still coherent with the sum of cfs_rq->avg.load_avg and the sume of
> cfs_rq->tg_load_avg_contrib
>
>>
>> system.slice->tg_load_avg = 381697 is still pretty much higher and AFAICS, 
>> it's an
>> i5-5300U CPU so only 4 logical cpus.
>>
>> How many system.slice/foo's actually exists in this system? I only see 
>> user.slice
>
> There is several dozen of childs in system.slice but all


but all child have null load_avg and load_avg_contrib

>
>> related cfs_rq[x]:/autogroups-xxx on my Ubuntu-16.04 Desktop system.
>>
>> [...]

Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes

Reply via email to