Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes

2016-10-20 Thread Morten Rasmussen
On Wed, Oct 19, 2016 at 07:41:36PM +0200, Vincent Guittot wrote:
> On 19 October 2016 at 15:30, Morten Rasmussen  
> wrote:
> > On Tue, Oct 18, 2016 at 01:56:51PM +0200, Vincent Guittot wrote:
> >> Le Tuesday 18 Oct 2016 à 12:34:12 (+0200), Peter Zijlstra a écrit :
> >> > On Tue, Oct 18, 2016 at 11:45:48AM +0200, Vincent Guittot wrote:
> >> > > On 18 October 2016 at 11:07, Peter Zijlstra  
> >> > > wrote:
> >> > > > So aside from funny BIOSes, this should also show up when creating
> >> > > > cgroups when you have offlined a few CPUs, which is far more common 
> >> > > > I'd
> >> > > > think.
> >> > >
> >> > > The problem is also that the load of the tg->se[cpu] that represents
> >> > > the tg->cfs_rq[cpu] is initialized to 1024 in:
> >> > > alloc_fair_sched_group
> >> > >  for_each_possible_cpu(i) {
> >> > >  init_entity_runnable_average(se);
> >> > > sa->load_avg = scale_load_down(se->load.weight);
> >> > >
> >> > > Initializing  sa->load_avg to 1024 for a newly created task makes
> >> > > sense as we don't know yet what will be its real load but i'm not sure
> >> > > that we have to do the same for se that represents a task group. This
> >> > > load should be initialized to 0 and it will increase when task will be
> >> > > moved/attached into task group
> >> >
> >> > Yes, I think that makes sense, not sure how horrible that is with the
> >>
> >> That should not be that bad because this initial value is only useful for
> >> the few dozens of ms that follow the creation of the task group
> >
> > IMHO, it doesn't make much sense to initialize empty containers, which
> > group sched_entities really are, to 1024. It is meant to represent what
> > is in it, and a creation it is empty, so in my opinion initializing it
> > to zero make sense.
> >
> >> > current state of things, but after your propagate patch, that
> >> > reinstates the interactivity hack that should work for sure.
> >
> > It actually works on mainline/tip as well.
> >
> > As I see it, the fundamental problem is keeping group entities up to
> > date. Because the load_weight and hence se->avg.load_avg each per-cpu
> > group sched_entity depends on the group cfs_rq->tg_load_avg_contrib for
> > all cpus (tg->load_avg), including those that might be empty and
> > therefore not enqueued, we must ensure that they are updated some other
> > way. Most naturally as part of update_blocked_averages().
> >
> > To guarantee that, it basically boils down to making sure:
> > Any cfs_rq with a non-zero tg_load_avg_contrib must be on the
> > leaf_cfs_rq_list.
> >
> > We can do that in different ways: 1) Add all cfs_rqs to the
> > leaf_cfs_rq_list at task group creation, or 2) initialize group
> > sched_entity contributions to zero and make sure that they are added to
> > leaf_cfs_rq_list as soon as a sched_entity (task or group) is enqueued
> > on it.
> >
> > Vincent patch below gives us the second option.
> >
> >>  kernel/sched/fair.c | 9 -
> >>  1 file changed, 8 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index 8b03fb5..89776ac 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -690,7 +690,14 @@ void init_entity_runnable_average(struct sched_entity 
> >> *se)
> >>* will definitely be update (after enqueue).
> >>*/
> >>   sa->period_contrib = 1023;
> >> - sa->load_avg = scale_load_down(se->load.weight);
> >> + /*
> >> +  * Tasks are intialized with full load to be seen as heavy task until
> >> +  * they get a chance to stabilize to their real load level.
> >> +  * group entity are intialized with null load to reflect the fact 
> >> that
> >> +  * nothing has been attached yet to the task group.
> >> +  */
> >> + if (entity_is_task(se))
> >> + sa->load_avg = scale_load_down(se->load.weight);
> >>   sa->load_sum = sa->load_avg * LOAD_AVG_MAX;
> >>   /*
> >>* At this point, util_avg won't be used in select_task_rq_fair 
> >> anyway
> >
> > I would suggest adding a comment somewhere stating that we need to keep
> > group cfs_rqs up to date:
> >
> > -
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index abb3763dff69..2b820d489be0 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6641,6 +6641,11 @@ static void update_blocked_averages(int cpu)
> > if (throttled_hierarchy(cfs_rq))
> > continue;
> >
> > +   /*
> > +* Note that _any_ leaf cfs_rq with a non-zero 
> > tg_load_avg_contrib
> > +* _must_ be on the leaf_cfs_rq_list to ensure that group 
> > shares
> > +* are updated correctly.
> > +*/
> 
> As discussed on IRC, the point is that even if the leaf cfs_rq is
> added to the leaf_cfs_rq_list, it doesn't ensure that it will be
> updated correctly for unplugged CPUs

Agreed. We have to ensure that tg_load_avg_contrib is

Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes

2016-10-19 Thread Vincent Guittot
On 19 October 2016 at 17:33, Dietmar Eggemann  wrote:
> On 19/10/16 12:25, Vincent Guittot wrote:
>> On 19 October 2016 at 11:46, Dietmar Eggemann  
>> wrote:
>>> On 18/10/16 12:56, Vincent Guittot wrote:
 Le Tuesday 18 Oct 2016 à 12:34:12 (+0200), Peter Zijlstra a écrit :
> On Tue, Oct 18, 2016 at 11:45:48AM +0200, Vincent Guittot wrote:
>> On 18 October 2016 at 11:07, Peter Zijlstra  wrote:
>
> [...]
>
>>> But this test only makes sure that we don't see any ghost contribution
>>> (from non-existing cpus) any more.
>>>
>>> We should study the tg->se[i]->avg.load_avg for the hierarchy of tg's
>>> (with the highest tg having a task enqueued) a little bit more, with and
>>> without your v5 'sched: reflect sched_entity move into task_group's load'.
>>
>> Can you elaborate ?
>
> I try :-)
>
> I thought I will see some different behaviour because of the fact that
> the tg se's are initialized differently [1024 versus 0].

This difference should be noticeable (if noticeable) only during few
hundreds of ms after the creation of the task group until the load_avg
has reached its real value.

>
> But I can't spot any difference. The test case is running a sysbench
> thread affine to cpu1 in tg_root/tg_1/tg_11/tg_111 on tip/sched/core on
> an ARM64 Juno (6 logical cpus).
> The moment the sysbench task is put into tg_111
> tg_111->se[1]->avg.load_avg gets updated to 0 any way because of the
> huge time difference between creating this tg and attaching a task to
> it. So the tg->se[2]->avg.load_avg signals for tg_111, tg_11 and tg_1
> look exactly the same w/o and w/ your patch.
>
> But your patch helps in this (very synthetic) test case as well. W/o
> your patch I see remaining tg->load_avg for tg_1 and tg_11 after the
> test case has finished because the tg's were exclusively used on cpu1.
>
> # cat /proc/sched_debug
>
>  cfs_rq[1]:/tg_1
>.tg_load_avg_contrib   : 0
>.tg_load_avg   : 5120 (5 (unused cpus) * 1024 * 1)
>  cfs_rq[1]:/tg_1/tg_11/tg_111
>.tg_load_avg_contrib   : 0
>.tg_load_avg   : 0
>  cfs_rq[1]:/tg_1/tg_11
>.tg_load_avg_contrib   : 0
>.tg_load_avg   : 5120
>
> With your patch applied all the .tg_load_avg are 0.


Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes

2016-10-19 Thread Vincent Guittot
On 19 October 2016 at 15:30, Morten Rasmussen  wrote:
> On Tue, Oct 18, 2016 at 01:56:51PM +0200, Vincent Guittot wrote:
>> Le Tuesday 18 Oct 2016 à 12:34:12 (+0200), Peter Zijlstra a écrit :
>> > On Tue, Oct 18, 2016 at 11:45:48AM +0200, Vincent Guittot wrote:
>> > > On 18 October 2016 at 11:07, Peter Zijlstra  wrote:
>> > > > So aside from funny BIOSes, this should also show up when creating
>> > > > cgroups when you have offlined a few CPUs, which is far more common I'd
>> > > > think.
>> > >
>> > > The problem is also that the load of the tg->se[cpu] that represents
>> > > the tg->cfs_rq[cpu] is initialized to 1024 in:
>> > > alloc_fair_sched_group
>> > >  for_each_possible_cpu(i) {
>> > >  init_entity_runnable_average(se);
>> > > sa->load_avg = scale_load_down(se->load.weight);
>> > >
>> > > Initializing  sa->load_avg to 1024 for a newly created task makes
>> > > sense as we don't know yet what will be its real load but i'm not sure
>> > > that we have to do the same for se that represents a task group. This
>> > > load should be initialized to 0 and it will increase when task will be
>> > > moved/attached into task group
>> >
>> > Yes, I think that makes sense, not sure how horrible that is with the
>>
>> That should not be that bad because this initial value is only useful for
>> the few dozens of ms that follow the creation of the task group
>
> IMHO, it doesn't make much sense to initialize empty containers, which
> group sched_entities really are, to 1024. It is meant to represent what
> is in it, and a creation it is empty, so in my opinion initializing it
> to zero make sense.
>
>> > current state of things, but after your propagate patch, that
>> > reinstates the interactivity hack that should work for sure.
>
> It actually works on mainline/tip as well.
>
> As I see it, the fundamental problem is keeping group entities up to
> date. Because the load_weight and hence se->avg.load_avg each per-cpu
> group sched_entity depends on the group cfs_rq->tg_load_avg_contrib for
> all cpus (tg->load_avg), including those that might be empty and
> therefore not enqueued, we must ensure that they are updated some other
> way. Most naturally as part of update_blocked_averages().
>
> To guarantee that, it basically boils down to making sure:
> Any cfs_rq with a non-zero tg_load_avg_contrib must be on the
> leaf_cfs_rq_list.
>
> We can do that in different ways: 1) Add all cfs_rqs to the
> leaf_cfs_rq_list at task group creation, or 2) initialize group
> sched_entity contributions to zero and make sure that they are added to
> leaf_cfs_rq_list as soon as a sched_entity (task or group) is enqueued
> on it.
>
> Vincent patch below gives us the second option.
>
>>  kernel/sched/fair.c | 9 -
>>  1 file changed, 8 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 8b03fb5..89776ac 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -690,7 +690,14 @@ void init_entity_runnable_average(struct sched_entity 
>> *se)
>>* will definitely be update (after enqueue).
>>*/
>>   sa->period_contrib = 1023;
>> - sa->load_avg = scale_load_down(se->load.weight);
>> + /*
>> +  * Tasks are intialized with full load to be seen as heavy task until
>> +  * they get a chance to stabilize to their real load level.
>> +  * group entity are intialized with null load to reflect the fact that
>> +  * nothing has been attached yet to the task group.
>> +  */
>> + if (entity_is_task(se))
>> + sa->load_avg = scale_load_down(se->load.weight);
>>   sa->load_sum = sa->load_avg * LOAD_AVG_MAX;
>>   /*
>>* At this point, util_avg won't be used in select_task_rq_fair anyway
>
> I would suggest adding a comment somewhere stating that we need to keep
> group cfs_rqs up to date:
>
> -
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index abb3763dff69..2b820d489be0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6641,6 +6641,11 @@ static void update_blocked_averages(int cpu)
> if (throttled_hierarchy(cfs_rq))
> continue;
>
> +   /*
> +* Note that _any_ leaf cfs_rq with a non-zero 
> tg_load_avg_contrib
> +* _must_ be on the leaf_cfs_rq_list to ensure that group 
> shares
> +* are updated correctly.
> +*/

As discussed on IRC, the point is that even if the leaf cfs_rq is
added to the leaf_cfs_rq_list, it doesn't ensure that it will be
updated correctly for unplugged CPUs

> if (update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq, 
> true))
> update_tg_load_avg(cfs_rq, 0);
> }
> -
>
> I did a couple of simple tests on tip/sched/core to test whether
> Vincent's fix works even without reflecting group load/util in the group
> hierarchy:
>
> Juno (2xA57+4xA53)
>
> tip:
>  

Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes

2016-10-19 Thread Joonwoo Park
On Wed, Oct 19, 2016 at 04:33:03PM +0100, Dietmar Eggemann wrote:
> On 19/10/16 12:25, Vincent Guittot wrote:
> > On 19 October 2016 at 11:46, Dietmar Eggemann  
> > wrote:
> >> On 18/10/16 12:56, Vincent Guittot wrote:
> >>> Le Tuesday 18 Oct 2016 à 12:34:12 (+0200), Peter Zijlstra a écrit :
>  On Tue, Oct 18, 2016 at 11:45:48AM +0200, Vincent Guittot wrote:
> > On 18 October 2016 at 11:07, Peter Zijlstra  
> > wrote:
> 
> [...]
> 
> >> But this test only makes sure that we don't see any ghost contribution
> >> (from non-existing cpus) any more.
> >>
> >> We should study the tg->se[i]->avg.load_avg for the hierarchy of tg's
> >> (with the highest tg having a task enqueued) a little bit more, with and
> >> without your v5 'sched: reflect sched_entity move into task_group's load'.
> > 
> > Can you elaborate ?
> 
> I try :-)
> 
> I thought I will see some different behaviour because of the fact that
> the tg se's are initialized differently [1024 versus 0].

This is the exact thing I was also worried about and that's the reason I
tried to fix this in a different way.
However I didn't find any behaviour difference once any task attached to
child cfs_rq which is the point we really care about.

I found this bug while making patch at https://lkml.org/lkml/2016/10/18/841
which will fail with wrong task_group load_avg.
I tested Vincent's patch and above together, confirmed it's still good.

Though I know Ingo already sent out pull request.  Anyway.

Tested-by: Joonwoo Park 

Thanks,
Joonwoo

> 
> But I can't spot any difference. The test case is running a sysbench
> thread affine to cpu1 in tg_root/tg_1/tg_11/tg_111 on tip/sched/core on
> an ARM64 Juno (6 logical cpus).
> The moment the sysbench task is put into tg_111
> tg_111->se[1]->avg.load_avg gets updated to 0 any way because of the
> huge time difference between creating this tg and attaching a task to
> it. So the tg->se[2]->avg.load_avg signals for tg_111, tg_11 and tg_1
> look exactly the same w/o and w/ your patch.
> 
> But your patch helps in this (very synthetic) test case as well. W/o
> your patch I see remaining tg->load_avg for tg_1 and tg_11 after the
> test case has finished because the tg's were exclusively used on cpu1.
> 
> # cat /proc/sched_debug
> 
>  cfs_rq[1]:/tg_1
>.tg_load_avg_contrib   : 0
>.tg_load_avg   : 5120 (5 (unused cpus) * 1024 * 1)
>  cfs_rq[1]:/tg_1/tg_11/tg_111
>.tg_load_avg_contrib   : 0
>.tg_load_avg   : 0
>  cfs_rq[1]:/tg_1/tg_11
>.tg_load_avg_contrib   : 0
>.tg_load_avg   : 5120
> 
> With your patch applied all the .tg_load_avg are 0.


Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes

2016-10-19 Thread Morten Rasmussen
On Tue, Oct 18, 2016 at 01:56:51PM +0200, Vincent Guittot wrote:
> Le Tuesday 18 Oct 2016 à 12:34:12 (+0200), Peter Zijlstra a écrit :
> > On Tue, Oct 18, 2016 at 11:45:48AM +0200, Vincent Guittot wrote:
> > > On 18 October 2016 at 11:07, Peter Zijlstra  wrote:
> > > > So aside from funny BIOSes, this should also show up when creating
> > > > cgroups when you have offlined a few CPUs, which is far more common I'd
> > > > think.
> > > 
> > > The problem is also that the load of the tg->se[cpu] that represents
> > > the tg->cfs_rq[cpu] is initialized to 1024 in:
> > > alloc_fair_sched_group
> > >  for_each_possible_cpu(i) {
> > >  init_entity_runnable_average(se);
> > > sa->load_avg = scale_load_down(se->load.weight);
> > > 
> > > Initializing  sa->load_avg to 1024 for a newly created task makes
> > > sense as we don't know yet what will be its real load but i'm not sure
> > > that we have to do the same for se that represents a task group. This
> > > load should be initialized to 0 and it will increase when task will be
> > > moved/attached into task group
> > 
> > Yes, I think that makes sense, not sure how horrible that is with the
> 
> That should not be that bad because this initial value is only useful for
> the few dozens of ms that follow the creation of the task group

IMHO, it doesn't make much sense to initialize empty containers, which
group sched_entities really are, to 1024. It is meant to represent what
is in it, and a creation it is empty, so in my opinion initializing it
to zero make sense.
 
> > current state of things, but after your propagate patch, that
> > reinstates the interactivity hack that should work for sure.

It actually works on mainline/tip as well.

As I see it, the fundamental problem is keeping group entities up to
date. Because the load_weight and hence se->avg.load_avg each per-cpu
group sched_entity depends on the group cfs_rq->tg_load_avg_contrib for
all cpus (tg->load_avg), including those that might be empty and
therefore not enqueued, we must ensure that they are updated some other
way. Most naturally as part of update_blocked_averages().

To guarantee that, it basically boils down to making sure:
Any cfs_rq with a non-zero tg_load_avg_contrib must be on the
leaf_cfs_rq_list.

We can do that in different ways: 1) Add all cfs_rqs to the
leaf_cfs_rq_list at task group creation, or 2) initialize group
sched_entity contributions to zero and make sure that they are added to
leaf_cfs_rq_list as soon as a sched_entity (task or group) is enqueued
on it.

Vincent patch below gives us the second option.

>  kernel/sched/fair.c | 9 -
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8b03fb5..89776ac 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -690,7 +690,14 @@ void init_entity_runnable_average(struct sched_entity 
> *se)
>* will definitely be update (after enqueue).
>*/
>   sa->period_contrib = 1023;
> - sa->load_avg = scale_load_down(se->load.weight);
> + /*
> +  * Tasks are intialized with full load to be seen as heavy task until
> +  * they get a chance to stabilize to their real load level.
> +  * group entity are intialized with null load to reflect the fact that
> +  * nothing has been attached yet to the task group.
> +  */
> + if (entity_is_task(se))
> + sa->load_avg = scale_load_down(se->load.weight);
>   sa->load_sum = sa->load_avg * LOAD_AVG_MAX;
>   /*
>* At this point, util_avg won't be used in select_task_rq_fair anyway

I would suggest adding a comment somewhere stating that we need to keep
group cfs_rqs up to date:

-
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index abb3763dff69..2b820d489be0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6641,6 +6641,11 @@ static void update_blocked_averages(int cpu)
if (throttled_hierarchy(cfs_rq))
continue;
 
+   /*
+* Note that _any_ leaf cfs_rq with a non-zero 
tg_load_avg_contrib
+* _must_ be on the leaf_cfs_rq_list to ensure that group shares
+* are updated correctly.
+*/
if (update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq, 
true))
update_tg_load_avg(cfs_rq, 0);
}
-

I did a couple of simple tests on tip/sched/core to test whether
Vincent's fix works even without reflecting group load/util in the group
hierarchy:

Juno (2xA57+4xA53)

tip:
grouped hog(1) alone: 2841
non-grouped hogs(6) alone: 40830
grouped hog(1): 218
non-grouped hogs(6): 40580

tip+vg:
grouped hog alone: 2849
non-grouped hogs(6) alone: 40831
grouped hog: 2363
non-grouped hogs: 38418

See script below for details, but we basically see that the grouped task
is not getting its 'fair' share 

Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes

2016-10-19 Thread Peter Zijlstra
On Tue, Oct 18, 2016 at 01:56:51PM +0200, Vincent Guittot wrote:

> ---
>  kernel/sched/fair.c | 9 -
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8b03fb5..89776ac 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -690,7 +690,14 @@ void init_entity_runnable_average(struct sched_entity 
> *se)
>* will definitely be update (after enqueue).
>*/
>   sa->period_contrib = 1023;
> - sa->load_avg = scale_load_down(se->load.weight);
> + /*
> +  * Tasks are intialized with full load to be seen as heavy task until
> +  * they get a chance to stabilize to their real load level.
> +  * group entity are intialized with null load to reflect the fact that
> +  * nothing has been attached yet to the task group.
> +  */
> + if (entity_is_task(se))
> + sa->load_avg = scale_load_down(se->load.weight);
>   sa->load_sum = sa->load_avg * LOAD_AVG_MAX;
>   /*
>* At this point, util_avg won't be used in select_task_rq_fair anyway
> 

Vince, could you post a proper version of this patch with changelogs and
tags so that we can get that merged into Linus' tree and stable for 4.8?



Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes

2016-10-19 Thread Dietmar Eggemann
On 18/10/16 12:56, Vincent Guittot wrote:
> Le Tuesday 18 Oct 2016 à 12:34:12 (+0200), Peter Zijlstra a écrit :
>> On Tue, Oct 18, 2016 at 11:45:48AM +0200, Vincent Guittot wrote:
>>> On 18 October 2016 at 11:07, Peter Zijlstra  wrote:

[...]

> 
> The patch below fixes the issue on my platform: 
> 
> Dietmar, Omer can you confirm that this fix the problem of your platform too ?

It fixes this broken BIOS issue on my T430 ( cpu_possible_mask >
cpu_online_mask). I ran the original test with the cpu hogs (stress -c
4). Launch time of applications becomes normal again.

Tested-by: Dietmar Eggemann 

But this test only makes sure that we don't see any ghost contribution
(from non-existing cpus) any more.

We should study the tg->se[i]->avg.load_avg for the hierarchy of tg's
(with the highest tg having a task enqueued) a little bit more, with and
without your v5 'sched: reflect sched_entity move into task_group's load'.

> ---
>  kernel/sched/fair.c | 9 -
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8b03fb5..89776ac 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -690,7 +690,14 @@ void init_entity_runnable_average(struct sched_entity 
> *se)
>* will definitely be update (after enqueue).
>*/
>   sa->period_contrib = 1023;
> - sa->load_avg = scale_load_down(se->load.weight);
> + /*
> +  * Tasks are intialized with full load to be seen as heavy task until
> +  * they get a chance to stabilize to their real load level.
> +  * group entity are intialized with null load to reflect the fact that
> +  * nothing has been attached yet to the task group.
> +  */
> + if (entity_is_task(se))
> + sa->load_avg = scale_load_down(se->load.weight);
>   sa->load_sum = sa->load_avg * LOAD_AVG_MAX;
>   /*
>* At this point, util_avg won't be used in select_task_rq_fair anyway


Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes

2016-10-19 Thread Vincent Guittot
On 19 October 2016 at 11:46, Dietmar Eggemann  wrote:
> On 18/10/16 12:56, Vincent Guittot wrote:
>> Le Tuesday 18 Oct 2016 à 12:34:12 (+0200), Peter Zijlstra a écrit :
>>> On Tue, Oct 18, 2016 at 11:45:48AM +0200, Vincent Guittot wrote:
 On 18 October 2016 at 11:07, Peter Zijlstra  wrote:
>
> [...]
>
>>
>> The patch below fixes the issue on my platform:
>>
>> Dietmar, Omer can you confirm that this fix the problem of your platform too 
>> ?
>
> It fixes this broken BIOS issue on my T430 ( cpu_possible_mask >
> cpu_online_mask). I ran the original test with the cpu hogs (stress -c
> 4). Launch time of applications becomes normal again.
>
> Tested-by: Dietmar Eggemann 

Thanks

>
> But this test only makes sure that we don't see any ghost contribution
> (from non-existing cpus) any more.
>
> We should study the tg->se[i]->avg.load_avg for the hierarchy of tg's
> (with the highest tg having a task enqueued) a little bit more, with and
> without your v5 'sched: reflect sched_entity move into task_group's load'.

Can you elaborate ?

Vincent

>
>> ---
>>  kernel/sched/fair.c | 9 -
>>  1 file changed, 8 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 8b03fb5..89776ac 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -690,7 +690,14 @@ void init_entity_runnable_average(struct sched_entity 
>> *se)
>>* will definitely be update (after enqueue).
>>*/
>>   sa->period_contrib = 1023;
>> - sa->load_avg = scale_load_down(se->load.weight);
>> + /*
>> +  * Tasks are intialized with full load to be seen as heavy task until
>> +  * they get a chance to stabilize to their real load level.
>> +  * group entity are intialized with null load to reflect the fact that
>> +  * nothing has been attached yet to the task group.
>> +  */
>> + if (entity_is_task(se))
>> + sa->load_avg = scale_load_down(se->load.weight);
>>   sa->load_sum = sa->load_avg * LOAD_AVG_MAX;
>>   /*
>>* At this point, util_avg won't be used in select_task_rq_fair anyway


Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes

2016-10-19 Thread Dietmar Eggemann
On 19/10/16 12:25, Vincent Guittot wrote:
> On 19 October 2016 at 11:46, Dietmar Eggemann  
> wrote:
>> On 18/10/16 12:56, Vincent Guittot wrote:
>>> Le Tuesday 18 Oct 2016 à 12:34:12 (+0200), Peter Zijlstra a écrit :
 On Tue, Oct 18, 2016 at 11:45:48AM +0200, Vincent Guittot wrote:
> On 18 October 2016 at 11:07, Peter Zijlstra  wrote:

[...]

>> But this test only makes sure that we don't see any ghost contribution
>> (from non-existing cpus) any more.
>>
>> We should study the tg->se[i]->avg.load_avg for the hierarchy of tg's
>> (with the highest tg having a task enqueued) a little bit more, with and
>> without your v5 'sched: reflect sched_entity move into task_group's load'.
> 
> Can you elaborate ?

I try :-)

I thought I will see some different behaviour because of the fact that
the tg se's are initialized differently [1024 versus 0].

But I can't spot any difference. The test case is running a sysbench
thread affine to cpu1 in tg_root/tg_1/tg_11/tg_111 on tip/sched/core on
an ARM64 Juno (6 logical cpus).
The moment the sysbench task is put into tg_111
tg_111->se[1]->avg.load_avg gets updated to 0 any way because of the
huge time difference between creating this tg and attaching a task to
it. So the tg->se[2]->avg.load_avg signals for tg_111, tg_11 and tg_1
look exactly the same w/o and w/ your patch.

But your patch helps in this (very synthetic) test case as well. W/o
your patch I see remaining tg->load_avg for tg_1 and tg_11 after the
test case has finished because the tg's were exclusively used on cpu1.

# cat /proc/sched_debug

 cfs_rq[1]:/tg_1
   .tg_load_avg_contrib   : 0
   .tg_load_avg   : 5120 (5 (unused cpus) * 1024 * 1)
 cfs_rq[1]:/tg_1/tg_11/tg_111
   .tg_load_avg_contrib   : 0
   .tg_load_avg   : 0
 cfs_rq[1]:/tg_1/tg_11
   .tg_load_avg_contrib   : 0
   .tg_load_avg   : 5120

With your patch applied all the .tg_load_avg are 0.


Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes

2016-10-19 Thread Vincent Guittot
On 19 October 2016 at 16:49, Joseph Salisbury
 wrote:
> On 10/18/2016 07:56 AM, Vincent Guittot wrote:
>> Le Tuesday 18 Oct 2016 à 12:34:12 (+0200), Peter Zijlstra a écrit :
>>> On Tue, Oct 18, 2016 at 11:45:48AM +0200, Vincent Guittot wrote:
 On 18 October 2016 at 11:07, Peter Zijlstra  wrote:
> So aside from funny BIOSes, this should also show up when creating
> cgroups when you have offlined a few CPUs, which is far more common I'd
> think.
 The problem is also that the load of the tg->se[cpu] that represents
 the tg->cfs_rq[cpu] is initialized to 1024 in:
 alloc_fair_sched_group
  for_each_possible_cpu(i) {
  init_entity_runnable_average(se);
 sa->load_avg = scale_load_down(se->load.weight);

 Initializing  sa->load_avg to 1024 for a newly created task makes
 sense as we don't know yet what will be its real load but i'm not sure
 that we have to do the same for se that represents a task group. This
 load should be initialized to 0 and it will increase when task will be
 moved/attached into task group
>>> Yes, I think that makes sense, not sure how horrible that is with the
>> That should not be that bad because this initial value is only useful for
>> the few dozens of ms that follow the creation of the task group
>>
>>> current state of things, but after your propagate patch, that
>>> reinstates the interactivity hack that should work for sure.
>> The patch below fixes the issue on my platform:
>>
>> Dietmar, Omer can you confirm that this fix the problem of your platform too 
>> ?
>>
>> ---
>>  kernel/sched/fair.c | 9 -
>>  1 file changed, 8 insertions(+), 1 deletion(-)Vinc
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 8b03fb5..89776ac 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -690,7 +690,14 @@ void init_entity_runnable_average(struct sched_entity 
>> *se)
>>* will definitely be update (after enqueue).
>>*/
>>   sa->period_contrib = 1023;
>> - sa->load_avg = scale_load_down(se->load.weight);
>> + /*
>> +  * Tasks are intialized with full load to be seen as heavy task until
>> +  * they get a chance to stabilize to their real load level.
>> +  * group entity are intialized with null load to reflect the fact that
>> +  * nothing has been attached yet to the task group.
>> +  */
>> + if (entity_is_task(se))
>> + sa->load_avg = scale_load_down(se->load.weight);
>>   sa->load_sum = sa->load_avg * LOAD_AVG_MAX;
>>   /*
>>* At this point, util_avg won't be used in select_task_rq_fair anyway
>>
>>
>>
>>
> Omer also reports that this patch fixes the bug for him as well.  Thanks
> for the great work, Vincent!

Thanks

>


Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes

2016-10-19 Thread Joseph Salisbury
On 10/18/2016 07:56 AM, Vincent Guittot wrote:
> Le Tuesday 18 Oct 2016 à 12:34:12 (+0200), Peter Zijlstra a écrit :
>> On Tue, Oct 18, 2016 at 11:45:48AM +0200, Vincent Guittot wrote:
>>> On 18 October 2016 at 11:07, Peter Zijlstra  wrote:
 So aside from funny BIOSes, this should also show up when creating
 cgroups when you have offlined a few CPUs, which is far more common I'd
 think.
>>> The problem is also that the load of the tg->se[cpu] that represents
>>> the tg->cfs_rq[cpu] is initialized to 1024 in:
>>> alloc_fair_sched_group
>>>  for_each_possible_cpu(i) {
>>>  init_entity_runnable_average(se);
>>> sa->load_avg = scale_load_down(se->load.weight);
>>>
>>> Initializing  sa->load_avg to 1024 for a newly created task makes
>>> sense as we don't know yet what will be its real load but i'm not sure
>>> that we have to do the same for se that represents a task group. This
>>> load should be initialized to 0 and it will increase when task will be
>>> moved/attached into task group
>> Yes, I think that makes sense, not sure how horrible that is with the
> That should not be that bad because this initial value is only useful for
> the few dozens of ms that follow the creation of the task group 
>
>> current state of things, but after your propagate patch, that
>> reinstates the interactivity hack that should work for sure.
> The patch below fixes the issue on my platform: 
>
> Dietmar, Omer can you confirm that this fix the problem of your platform too ?
>
> ---
>  kernel/sched/fair.c | 9 -
>  1 file changed, 8 insertions(+), 1 deletion(-)Vinc
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8b03fb5..89776ac 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -690,7 +690,14 @@ void init_entity_runnable_average(struct sched_entity 
> *se)
>* will definitely be update (after enqueue).
>*/
>   sa->period_contrib = 1023;
> - sa->load_avg = scale_load_down(se->load.weight);
> + /*
> +  * Tasks are intialized with full load to be seen as heavy task until
> +  * they get a chance to stabilize to their real load level.
> +  * group entity are intialized with null load to reflect the fact that
> +  * nothing has been attached yet to the task group.
> +  */
> + if (entity_is_task(se))
> + sa->load_avg = scale_load_down(se->load.weight);
>   sa->load_sum = sa->load_avg * LOAD_AVG_MAX;
>   /*
>* At this point, util_avg won't be used in select_task_rq_fair anyway
>
>
>
>
Omer also reports that this patch fixes the bug for him as well.  Thanks
for the great work, Vincent!



Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes

2016-10-19 Thread Vincent Guittot
On 19 October 2016 at 13:33, Peter Zijlstra  wrote:
> On Tue, Oct 18, 2016 at 01:56:51PM +0200, Vincent Guittot wrote:
>
>> ---
>>  kernel/sched/fair.c | 9 -
>>  1 file changed, 8 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 8b03fb5..89776ac 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -690,7 +690,14 @@ void init_entity_runnable_average(struct sched_entity 
>> *se)
>>* will definitely be update (after enqueue).
>>*/
>>   sa->period_contrib = 1023;
>> - sa->load_avg = scale_load_down(se->load.weight);
>> + /*
>> +  * Tasks are intialized with full load to be seen as heavy task until
>> +  * they get a chance to stabilize to their real load level.
>> +  * group entity are intialized with null load to reflect the fact that
>> +  * nothing has been attached yet to the task group.
>> +  */
>> + if (entity_is_task(se))
>> + sa->load_avg = scale_load_down(se->load.weight);
>>   sa->load_sum = sa->load_avg * LOAD_AVG_MAX;
>>   /*
>>* At this point, util_avg won't be used in select_task_rq_fair anyway
>>
>
> Vince, could you post a proper version of this patch with changelogs and
> tags so that we can get that merged into Linus' tree and stable for 4.8?

yes. i just have to finish the changelog and i sent it

>


Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes

2016-10-18 Thread Vincent Guittot
On 18 October 2016 at 23:58, Joonwoo Park  wrote:
>
>
> On 10/18/2016 04:56 AM, Vincent Guittot wrote:
>>
>> Le Tuesday 18 Oct 2016 à 12:34:12 (+0200), Peter Zijlstra a écrit :
>>>
>>> On Tue, Oct 18, 2016 at 11:45:48AM +0200, Vincent Guittot wrote:

 On 18 October 2016 at 11:07, Peter Zijlstra 
 wrote:
>
> So aside from funny BIOSes, this should also show up when creating
> cgroups when you have offlined a few CPUs, which is far more common I'd
> think.


 The problem is also that the load of the tg->se[cpu] that represents
 the tg->cfs_rq[cpu] is initialized to 1024 in:
 alloc_fair_sched_group
  for_each_possible_cpu(i) {
  init_entity_runnable_average(se);
 sa->load_avg = scale_load_down(se->load.weight);

 Initializing  sa->load_avg to 1024 for a newly created task makes
 sense as we don't know yet what will be its real load but i'm not sure
 that we have to do the same for se that represents a task group. This
 load should be initialized to 0 and it will increase when task will be
 moved/attached into task group
>>>
>>>
>>> Yes, I think that makes sense, not sure how horrible that is with the
>>
>>
>> That should not be that bad because this initial value is only useful for
>> the few dozens of ms that follow the creation of the task group
>>
>>>
>>> current state of things, but after your propagate patch, that
>>> reinstates the interactivity hack that should work for sure.
>>
>>
>> The patch below fixes the issue on my platform:
>>
>> Dietmar, Omer can you confirm that this fix the problem of your platform
>> too ?
>
>
> I just noticed this thread after posting
> https://lkml.org/lkml/2016/10/18/719...
> Noticed this bug while a ago and had the patch above at least a week but
> unfortunately didn't have time to post...
> I think Omer had same problem I was trying to fix and I believe patch I post
> should address it.
>
> Vincent, your version fixes my test case as well.

Thanks for testing.
Can i consider this as a Tested-by ?

> This is sched_stat from the same test case I had in my changelog.
> Note dd-2030 which is in root cgroup had same runtime as dd-2033 which is in
> child cgroup.
>
> dd (2030, #threads: 1)
> ---
> se.exec_start:275700.024137
> se.vruntime  : 10589.114654
> se.sum_exec_runtime  :  1576.837993
> se.nr_migrations :0
> nr_switches  :  159
> nr_voluntary_switches:0
> nr_involuntary_switches  :  159
> se.load.weight   :  1048576
> se.avg.load_sum  : 48840575
> se.avg.util_sum  : 19741820
> se.avg.load_avg  : 1022
> se.avg.util_avg  :  413
> se.avg.last_update_time  : 275700024137
> policy   :0
> prio :  120
> clock-delta  :   34
> dd (2033, #threads: 1)
> ---
> se.exec_start:275710.037178
> se.vruntime  :  2383.802868
> se.sum_exec_runtime  :  1576.547591
> se.nr_migrations :0
> nr_switches  :  162
> nr_voluntary_switches:0
> nr_involuntary_switches  :  162
> se.load.weight   :  1048576
> se.avg.load_sum  : 48316646
> se.avg.util_sum  : 21235249
> se.avg.load_avg  : 1011
> se.avg.util_avg  :  444
> se.avg.last_update_time  : 275710037178
> policy   :0
> prio :  120
> clock-delta  :   36
>
> Thanks,
> Joonwoo
>
>
>>
>> ---
>>  kernel/sched/fair.c | 9 -
>>  1 file changed, 8 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 8b03fb5..89776ac 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -690,7 +690,14 @@ void init_entity_runnable_

Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes

2016-10-18 Thread Joonwoo Park



On 10/18/2016 04:56 AM, Vincent Guittot wrote:

Le Tuesday 18 Oct 2016 à 12:34:12 (+0200), Peter Zijlstra a écrit :

On Tue, Oct 18, 2016 at 11:45:48AM +0200, Vincent Guittot wrote:

On 18 October 2016 at 11:07, Peter Zijlstra  wrote:

So aside from funny BIOSes, this should also show up when creating
cgroups when you have offlined a few CPUs, which is far more common I'd
think.


The problem is also that the load of the tg->se[cpu] that represents
the tg->cfs_rq[cpu] is initialized to 1024 in:
alloc_fair_sched_group
 for_each_possible_cpu(i) {
 init_entity_runnable_average(se);
sa->load_avg = scale_load_down(se->load.weight);

Initializing  sa->load_avg to 1024 for a newly created task makes
sense as we don't know yet what will be its real load but i'm not sure
that we have to do the same for se that represents a task group. This
load should be initialized to 0 and it will increase when task will be
moved/attached into task group


Yes, I think that makes sense, not sure how horrible that is with the


That should not be that bad because this initial value is only useful for
the few dozens of ms that follow the creation of the task group



current state of things, but after your propagate patch, that
reinstates the interactivity hack that should work for sure.


The patch below fixes the issue on my platform:

Dietmar, Omer can you confirm that this fix the problem of your platform too ?


I just noticed this thread after posting 
https://lkml.org/lkml/2016/10/18/719...
Noticed this bug while a ago and had the patch above at least a week but 
unfortunately didn't have time to post...
I think Omer had same problem I was trying to fix and I believe patch I 
post should address it.


Vincent, your version fixes my test case as well.
This is sched_stat from the same test case I had in my changelog.
Note dd-2030 which is in root cgroup had same runtime as dd-2033 which 
is in child cgroup.


dd (2030, #threads: 1)
---
se.exec_start:275700.024137
se.vruntime  : 10589.114654
se.sum_exec_runtime  :  1576.837993
se.nr_migrations :0
nr_switches  :  159
nr_voluntary_switches:0
nr_involuntary_switches  :  159
se.load.weight   :  1048576
se.avg.load_sum  : 48840575
se.avg.util_sum  : 19741820
se.avg.load_avg  : 1022
se.avg.util_avg  :  413
se.avg.last_update_time  : 275700024137
policy   :0
prio :  120
clock-delta  :   34
dd (2033, #threads: 1)
---
se.exec_start:275710.037178
se.vruntime  :  2383.802868
se.sum_exec_runtime  :  1576.547591
se.nr_migrations :0
nr_switches  :  162
nr_voluntary_switches:0
nr_involuntary_switches  :  162
se.load.weight   :  1048576
se.avg.load_sum  : 48316646
se.avg.util_sum  : 21235249
se.avg.load_avg  : 1011
se.avg.util_avg  :  444
se.avg.last_update_time  : 275710037178
policy   :0
prio :  120
clock-delta  :   36

Thanks,
Joonwoo



---
 kernel/sched/fair.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8b03fb5..89776ac 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -690,7 +690,14 @@ void init_entity_runnable_average(struct sched_entity *se)
 * will definitely be update (after enqueue).
 */
sa->period_contrib = 1023;
-   sa->load_avg = scale_load_down(se->load.weight);
+   /*
+* Tasks are intialized with full load to be seen as heavy task until
+* they get a chance to stabilize to their real load level.
+* group entity are intialized with null loa

Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes

2016-10-18 Thread Peter Zijlstra
On Tue, Oct 18, 2016 at 12:15:11PM +0100, Dietmar Eggemann wrote:
> On 18/10/16 10:07, Peter Zijlstra wrote:
> > On Mon, Oct 17, 2016 at 11:52:39PM +0100, Dietmar Eggemann wrote:

> > On IRC you mentioned that adding list_add_leaf_cfs_rq() to
> > online_fair_sched_group() cures this, this would actually match with
> > unregister_fair_sched_group() doing list_del_leaf_cfs_rq() and avoid
> > a few instructions on the enqueue path, so that's all good.
> 
> Yes, I was able to recreate a similar problem (not related to the cpu
> masks) on ARM64 (6 logical cpus). I created 100 2. level tg's but only
> put one task (no cpu affinity, so it could run on multiple cpus) in one
> of these tg's (mainly to see the related cfs_rq's in /proc/sched_debug).
> 
> I get a remaining .tg_load_avg : 49898 for cfs_rq[x]:/tg_1

Ah, and since all those CPUs are online, we decay all that load away. OK
makes sense now.

>  > I'm just not immediately seeing how that cures things. The only relevant
> > user of the leaf_cfs_rq list seems to be update_blocked_averages() which
> > is called from the balance code (idle_balance() and
> > rebalance_domains()). But neither should call that for offline (or
> > !present) CPUs.
> 
> Assuming this is load from the 99 2. level tg's which never had a task
> running, putting list_add_leaf_cfs_rq() into online_fair_sched_group()
> for all cpus makes sure that all the 'blocked load' get's decayed.
> 
> Doing what Vincent just suggested, not initializing tg se's w/ 1024 but
> w/ 0 instead prevents this from being necessary.

Indeed. I just worry about the cases where we do no propagate the load
up, eg. the stuff fixed by:

  1476695653-12309-5-git-send-email-vincent.guit...@linaro.org

If we hit an intermediary cgroup with 0 load, we might get some
interactivity issues.

But it could be I got lost again :-)


Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes

2016-10-18 Thread Vincent Guittot
Le Tuesday 18 Oct 2016 à 12:34:12 (+0200), Peter Zijlstra a écrit :
> On Tue, Oct 18, 2016 at 11:45:48AM +0200, Vincent Guittot wrote:
> > On 18 October 2016 at 11:07, Peter Zijlstra  wrote:
> > > So aside from funny BIOSes, this should also show up when creating
> > > cgroups when you have offlined a few CPUs, which is far more common I'd
> > > think.
> > 
> > The problem is also that the load of the tg->se[cpu] that represents
> > the tg->cfs_rq[cpu] is initialized to 1024 in:
> > alloc_fair_sched_group
> >  for_each_possible_cpu(i) {
> >  init_entity_runnable_average(se);
> > sa->load_avg = scale_load_down(se->load.weight);
> > 
> > Initializing  sa->load_avg to 1024 for a newly created task makes
> > sense as we don't know yet what will be its real load but i'm not sure
> > that we have to do the same for se that represents a task group. This
> > load should be initialized to 0 and it will increase when task will be
> > moved/attached into task group
> 
> Yes, I think that makes sense, not sure how horrible that is with the

That should not be that bad because this initial value is only useful for
the few dozens of ms that follow the creation of the task group 

>
> current state of things, but after your propagate patch, that
> reinstates the interactivity hack that should work for sure.

The patch below fixes the issue on my platform: 

Dietmar, Omer can you confirm that this fix the problem of your platform too ?

---
 kernel/sched/fair.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8b03fb5..89776ac 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -690,7 +690,14 @@ void init_entity_runnable_average(struct sched_entity *se)
 * will definitely be update (after enqueue).
 */
sa->period_contrib = 1023;
-   sa->load_avg = scale_load_down(se->load.weight);
+   /*
+* Tasks are intialized with full load to be seen as heavy task until
+* they get a chance to stabilize to their real load level.
+* group entity are intialized with null load to reflect the fact that
+* nothing has been attached yet to the task group.
+*/
+   if (entity_is_task(se))
+   sa->load_avg = scale_load_down(se->load.weight);
sa->load_sum = sa->load_avg * LOAD_AVG_MAX;
/*
 * At this point, util_avg won't be used in select_task_rq_fair anyway






Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes

2016-10-18 Thread Dietmar Eggemann
On 18/10/16 10:07, Peter Zijlstra wrote:
> On Mon, Oct 17, 2016 at 11:52:39PM +0100, Dietmar Eggemann wrote:

[...]

>> Using for_each_online_cpu(i) instead of for_each_possible_cpu(i) in
>> online_fair_sched_group() works on this machine, i.e. the .tg_load_avg
>> of system.slice tg is 0 after startup.
> 
> Right, so the reason for using present_mask is that it avoids having to
> deal with hotplug, also all the per-cpu memory is allocated and present
> for !online CPUs anyway, so might as well set it up properly anyway.
> 
> (You might want to start booting your laptop with "possible_cpus=4" to
> save some memory FWIW.)

The question for me is could this be the reason for the X1 Carbon
platform as well?

The initial pastebin from Joseph (http://paste.ubuntu.com/23312351)
showed .tg_load_avg : 381697 on a 4 logical cpu thing. With a couple of
more services than 80 this might be the problem.

> 
> But yes, we have a bug here too... /me ponders
> 
> So aside from funny BIOSes, this should also show up when creating
> cgroups when you have offlined a few CPUs, which is far more common I'd
> think.

Yes.

> On IRC you mentioned that adding list_add_leaf_cfs_rq() to
> online_fair_sched_group() cures this, this would actually match with
> unregister_fair_sched_group() doing list_del_leaf_cfs_rq() and avoid
> a few instructions on the enqueue path, so that's all good.

Yes, I was able to recreate a similar problem (not related to the cpu
masks) on ARM64 (6 logical cpus). I created 100 2. level tg's but only
put one task (no cpu affinity, so it could run on multiple cpus) in one
of these tg's (mainly to see the related cfs_rq's in /proc/sched_debug).

I get a remaining .tg_load_avg : 49898 for cfs_rq[x]:/tg_1

 > I'm just not immediately seeing how that cures things. The only relevant
> user of the leaf_cfs_rq list seems to be update_blocked_averages() which
> is called from the balance code (idle_balance() and
> rebalance_domains()). But neither should call that for offline (or
> !present) CPUs.

Assuming this is load from the 99 2. level tg's which never had a task
running, putting list_add_leaf_cfs_rq() into online_fair_sched_group()
for all cpus makes sure that all the 'blocked load' get's decayed.

Doing what Vincent just suggested, not initializing tg se's w/ 1024 but
w/ 0 instead prevents this from being necessary.

[...]


Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes

2016-10-18 Thread Peter Zijlstra
On Tue, Oct 18, 2016 at 11:45:48AM +0200, Vincent Guittot wrote:
> On 18 October 2016 at 11:07, Peter Zijlstra  wrote:
> > So aside from funny BIOSes, this should also show up when creating
> > cgroups when you have offlined a few CPUs, which is far more common I'd
> > think.
> 
> The problem is also that the load of the tg->se[cpu] that represents
> the tg->cfs_rq[cpu] is initialized to 1024 in:
> alloc_fair_sched_group
>  for_each_possible_cpu(i) {
>  init_entity_runnable_average(se);
> sa->load_avg = scale_load_down(se->load.weight);
> 
> Initializing  sa->load_avg to 1024 for a newly created task makes
> sense as we don't know yet what will be its real load but i'm not sure
> that we have to do the same for se that represents a task group. This
> load should be initialized to 0 and it will increase when task will be
> moved/attached into task group

Yes, I think that makes sense, not sure how horrible that is with the
current state of things, but after your propagate patch, that
reinstates the interactivity hack that should work for sure.


Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes

2016-10-18 Thread Vincent Guittot
On 18 October 2016 at 11:07, Peter Zijlstra  wrote:
> On Mon, Oct 17, 2016 at 11:52:39PM +0100, Dietmar Eggemann wrote:
>>
>> Something looks weird related to the use of for_each_possible_cpu(i) in
>> online_fair_sched_group() on my i5-3320M CPU (4 logical cpus).
>>
>> In case I print out cpu id and the cpu masks inside the 
>> for_each_possible_cpu(i)
>> I get:
>>
>> [ 5.462368]  cpu=0  cpu_possible_mask=0-7 cpu_online_mask=0-3 
>> cpu_present_mask=0-3 cpu_active_mask=0-3
>
> OK, you have a buggy BIOS :-) It enumerates too many CPU slots. There is
> no reason to have 4 empty CPU slots on a machine that cannot do physical
> hotplug.
>
> This also explains why it doesn't show on many machines, most machines
> will not have this and possible_mask == present_mask == online_mask for
> most all cases.
>
> x86 folk, can we detect the lack of physical hotplug capability and FW_WARN
> on this and lowering possible_mask to present_mask?
>
>> [ 5.462370]  cpu=1  cpu_possible_mask=0-7 cpu_online_mask=0-3 
>> cpu_present_mask=0-3 cpu_active_mask=0-3
>> [ 5.462370]  cpu=2  cpu_possible_mask=0-7 cpu_online_mask=0-3 
>> cpu_present_mask=0-3 cpu_active_mask=0-3
>> [ 5.462371]  cpu=3  cpu_possible_mask=0-7 cpu_online_mask=0-3 
>> cpu_present_mask=0-3 cpu_active_mask=0-3
>> [ 5.462372] *cpu=4* cpu_possible_mask=0-7 cpu_online_mask=0-3 
>> cpu_present_mask=0-3 cpu_active_mask=0-3
>> [ 5.462373] *cpu=5* cpu_possible_mask=0-7 cpu_online_mask=0-3 
>> cpu_present_mask=0-3 cpu_active_mask=0-3
>> [ 5.462374] *cpu=6* cpu_possible_mask=0-7 cpu_online_mask=0-3 
>> cpu_present_mask=0-3 cpu_active_mask=0-3
>> [ 5.462375] *cpu=7* cpu_possible_mask=0-7 cpu_online_mask=0-3 
>> cpu_present_mask=0-3 cpu_active_mask=0-3
>>
>> T430:/sys/fs/cgroup/cpu,cpuacct/system.slice# ls -l | grep '^d' | wc -l
>> 80
>>
>> /proc/sched_debug:
>>
>> cfs_rq[0]:/system.slice
>>   ...
>>   .tg_load_avg   : 323584
>>   ...
>>
>> 80 * 1024 * 4 (not existent cpu4-cpu7) = 327680 (with a little bit of decay,
>> this could be this extra load on the systen.slice tg)
>>
>> Using for_each_online_cpu(i) instead of for_each_possible_cpu(i) in
>> online_fair_sched_group() works on this machine, i.e. the .tg_load_avg
>> of system.slice tg is 0 after startup.
>
> Right, so the reason for using present_mask is that it avoids having to
> deal with hotplug, also all the per-cpu memory is allocated and present
> for !online CPUs anyway, so might as well set it up properly anyway.
>
> (You might want to start booting your laptop with "possible_cpus=4" to
> save some memory FWIW.)
>
> But yes, we have a bug here too... /me ponders
>
> So aside from funny BIOSes, this should also show up when creating
> cgroups when you have offlined a few CPUs, which is far more common I'd
> think.

The problem is also that the load of the tg->se[cpu] that represents
the tg->cfs_rq[cpu] is initialized to 1024 in:
alloc_fair_sched_group
 for_each_possible_cpu(i) {
 init_entity_runnable_average(se);
sa->load_avg = scale_load_down(se->load.weight);

Initializing  sa->load_avg to 1024 for a newly created task makes
sense as we don't know yet what will be its real load but i'm not sure
that we have to do the same for se that represents a task group. This
load should be initialized to 0 and it will increase when task will be
moved/attached into task group

>
> On IRC you mentioned that adding list_add_leaf_cfs_rq() to
> online_fair_sched_group() cures this, this would actually match with
> unregister_fair_sched_group() doing list_del_leaf_cfs_rq() and avoid
> a few instructions on the enqueue path, so that's all good.
>
> I'm just not immediately seeing how that cures things. The only relevant
> user of the leaf_cfs_rq list seems to be update_blocked_averages() which
> is called from the balance code (idle_balance() and
> rebalance_domains()). But neither should call that for offline (or
> !present) CPUs.
>
>
> Humm..


Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes

2016-10-18 Thread Peter Zijlstra
On Mon, Oct 17, 2016 at 11:52:39PM +0100, Dietmar Eggemann wrote:
> 
> Something looks weird related to the use of for_each_possible_cpu(i) in
> online_fair_sched_group() on my i5-3320M CPU (4 logical cpus).
> 
> In case I print out cpu id and the cpu masks inside the 
> for_each_possible_cpu(i)
> I get:
> 
> [ 5.462368]  cpu=0  cpu_possible_mask=0-7 cpu_online_mask=0-3 
> cpu_present_mask=0-3 cpu_active_mask=0-3

OK, you have a buggy BIOS :-) It enumerates too many CPU slots. There is
no reason to have 4 empty CPU slots on a machine that cannot do physical
hotplug.

This also explains why it doesn't show on many machines, most machines
will not have this and possible_mask == present_mask == online_mask for
most all cases.

x86 folk, can we detect the lack of physical hotplug capability and FW_WARN
on this and lowering possible_mask to present_mask?

> [ 5.462370]  cpu=1  cpu_possible_mask=0-7 cpu_online_mask=0-3 
> cpu_present_mask=0-3 cpu_active_mask=0-3
> [ 5.462370]  cpu=2  cpu_possible_mask=0-7 cpu_online_mask=0-3 
> cpu_present_mask=0-3 cpu_active_mask=0-3
> [ 5.462371]  cpu=3  cpu_possible_mask=0-7 cpu_online_mask=0-3 
> cpu_present_mask=0-3 cpu_active_mask=0-3
> [ 5.462372] *cpu=4* cpu_possible_mask=0-7 cpu_online_mask=0-3 
> cpu_present_mask=0-3 cpu_active_mask=0-3
> [ 5.462373] *cpu=5* cpu_possible_mask=0-7 cpu_online_mask=0-3 
> cpu_present_mask=0-3 cpu_active_mask=0-3
> [ 5.462374] *cpu=6* cpu_possible_mask=0-7 cpu_online_mask=0-3 
> cpu_present_mask=0-3 cpu_active_mask=0-3
> [ 5.462375] *cpu=7* cpu_possible_mask=0-7 cpu_online_mask=0-3 
> cpu_present_mask=0-3 cpu_active_mask=0-3
> 
> T430:/sys/fs/cgroup/cpu,cpuacct/system.slice# ls -l | grep '^d' | wc -l
> 80
> 
> /proc/sched_debug:
> 
> cfs_rq[0]:/system.slice
>   ...
>   .tg_load_avg   : 323584
>   ...
> 
> 80 * 1024 * 4 (not existent cpu4-cpu7) = 327680 (with a little bit of decay,
> this could be this extra load on the systen.slice tg)
> 
> Using for_each_online_cpu(i) instead of for_each_possible_cpu(i) in
> online_fair_sched_group() works on this machine, i.e. the .tg_load_avg
> of system.slice tg is 0 after startup.

Right, so the reason for using present_mask is that it avoids having to
deal with hotplug, also all the per-cpu memory is allocated and present
for !online CPUs anyway, so might as well set it up properly anyway.

(You might want to start booting your laptop with "possible_cpus=4" to
save some memory FWIW.)

But yes, we have a bug here too... /me ponders

So aside from funny BIOSes, this should also show up when creating
cgroups when you have offlined a few CPUs, which is far more common I'd
think.

On IRC you mentioned that adding list_add_leaf_cfs_rq() to
online_fair_sched_group() cures this, this would actually match with
unregister_fair_sched_group() doing list_del_leaf_cfs_rq() and avoid
a few instructions on the enqueue path, so that's all good.

I'm just not immediately seeing how that cures things. The only relevant
user of the leaf_cfs_rq list seems to be update_blocked_averages() which
is called from the balance code (idle_balance() and
rebalance_domains()). But neither should call that for offline (or
!present) CPUs.


Humm..


Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes

2016-10-18 Thread Vincent Guittot
On 18 October 2016 at 00:52, Dietmar Eggemann  wrote:
> On 10/17/2016 02:54 PM, Vincent Guittot wrote:
>> On 17 October 2016 at 15:19, Peter Zijlstra  wrote:
>>> On Mon, Oct 17, 2016 at 12:49:55PM +0100, Dietmar Eggemann wrote:
>
> [...]
>
 BTW, I guess we can reach .tg_load_avg up to ~30-40 on such a 
 system
 initially because systemd will create all ~100 services (and therefore the
 corresponding 2. level tg's) at once. In my previous example, there was 
 500ms
 between the creation of 2 tg's so there was a lot of decaying going on in 
 between.
>>>
>>> Cute... on current kernels that translates to simply removing the call
>>> to update_tg_load_avg(), lets see if we can figure out what goes
>>> sideways first though, because it _should_ decay back out. And if that
>>
>> yes, Reaching ~30-40 is not an issue in itself, the problem is
>> that load_avg has decayed but it has not been reflected in
>> tg->load_avg in the buggy case
>>
>>> can fail here, I'm not seeing why that wouldn't fail elsewhere either.
>>>
>>> I'll see if I can reproduce this with a script creating heaps of cgroups
>>> in a hurry, I have a total lack of system-disease on all my machines.
>

Hi Dietmar,

>
> Something looks weird related to the use of for_each_possible_cpu(i) in
> online_fair_sched_group() on my i5-3320M CPU (4 logical cpus).
>
> In case I print out cpu id and the cpu masks inside the 
> for_each_possible_cpu(i)
> I get:
>
> [ 5.462368]  cpu=0  cpu_possible_mask=0-7 cpu_online_mask=0-3 
> cpu_present_mask=0-3 cpu_active_mask=0-3
> [ 5.462370]  cpu=1  cpu_possible_mask=0-7 cpu_online_mask=0-3 
> cpu_present_mask=0-3 cpu_active_mask=0-3
> [ 5.462370]  cpu=2  cpu_possible_mask=0-7 cpu_online_mask=0-3 
> cpu_present_mask=0-3 cpu_active_mask=0-3
> [ 5.462371]  cpu=3  cpu_possible_mask=0-7 cpu_online_mask=0-3 
> cpu_present_mask=0-3 cpu_active_mask=0-3
> [ 5.462372] *cpu=4* cpu_possible_mask=0-7 cpu_online_mask=0-3 
> cpu_present_mask=0-3 cpu_active_mask=0-3
> [ 5.462373] *cpu=5* cpu_possible_mask=0-7 cpu_online_mask=0-3 
> cpu_present_mask=0-3 cpu_active_mask=0-3
> [ 5.462374] *cpu=6* cpu_possible_mask=0-7 cpu_online_mask=0-3 
> cpu_present_mask=0-3 cpu_active_mask=0-3
> [ 5.462375] *cpu=7* cpu_possible_mask=0-7 cpu_online_mask=0-3 
> cpu_present_mask=0-3 cpu_active_mask=0-3
>

Thanks to your description above, i have been able to reproduce the
issue on my ARM platform.
The key point is to have cpu_possible_mask different from
cpu_present_mask in order to reproduce the problem. When
cpu_present_mask equals cpu_possible_mask, i can't reproduce the
problem
I create a 1st level of task group tg-l1. Then each time, I create a
new task group in tg-l1, tg-l1.tg_load_avg will increase with 1024*
number of cpu that are possible but not present like you described
below

Thanks
Vincent

> T430:/sys/fs/cgroup/cpu,cpuacct/system.slice# ls -l | grep '^d' | wc -l
> 80
>
> /proc/sched_debug:
>
> cfs_rq[0]:/system.slice
>   ...
>   .tg_load_avg   : 323584
>   ...
>
> 80 * 1024 * 4 (not existent cpu4-cpu7) = 327680 (with a little bit of decay,
> this could be this extra load on the systen.slice tg)
>
> Using for_each_online_cpu(i) instead of for_each_possible_cpu(i) in
> online_fair_sched_group() works on this machine, i.e. the .tg_load_avg
> of system.slice tg is 0 after startup.


Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes

2016-10-17 Thread Dietmar Eggemann
On 10/17/2016 02:54 PM, Vincent Guittot wrote:
> On 17 October 2016 at 15:19, Peter Zijlstra  wrote:
>> On Mon, Oct 17, 2016 at 12:49:55PM +0100, Dietmar Eggemann wrote:

[...]

>>> BTW, I guess we can reach .tg_load_avg up to ~30-40 on such a system
>>> initially because systemd will create all ~100 services (and therefore the
>>> corresponding 2. level tg's) at once. In my previous example, there was 
>>> 500ms
>>> between the creation of 2 tg's so there was a lot of decaying going on in 
>>> between.
>>
>> Cute... on current kernels that translates to simply removing the call
>> to update_tg_load_avg(), lets see if we can figure out what goes
>> sideways first though, because it _should_ decay back out. And if that
> 
> yes, Reaching ~30-40 is not an issue in itself, the problem is
> that load_avg has decayed but it has not been reflected in
> tg->load_avg in the buggy case
> 
>> can fail here, I'm not seeing why that wouldn't fail elsewhere either.
>>
>> I'll see if I can reproduce this with a script creating heaps of cgroups
>> in a hurry, I have a total lack of system-disease on all my machines.


Something looks weird related to the use of for_each_possible_cpu(i) in
online_fair_sched_group() on my i5-3320M CPU (4 logical cpus).

In case I print out cpu id and the cpu masks inside the for_each_possible_cpu(i)
I get:

[ 5.462368]  cpu=0  cpu_possible_mask=0-7 cpu_online_mask=0-3 
cpu_present_mask=0-3 cpu_active_mask=0-3
[ 5.462370]  cpu=1  cpu_possible_mask=0-7 cpu_online_mask=0-3 
cpu_present_mask=0-3 cpu_active_mask=0-3
[ 5.462370]  cpu=2  cpu_possible_mask=0-7 cpu_online_mask=0-3 
cpu_present_mask=0-3 cpu_active_mask=0-3
[ 5.462371]  cpu=3  cpu_possible_mask=0-7 cpu_online_mask=0-3 
cpu_present_mask=0-3 cpu_active_mask=0-3
[ 5.462372] *cpu=4* cpu_possible_mask=0-7 cpu_online_mask=0-3 
cpu_present_mask=0-3 cpu_active_mask=0-3
[ 5.462373] *cpu=5* cpu_possible_mask=0-7 cpu_online_mask=0-3 
cpu_present_mask=0-3 cpu_active_mask=0-3
[ 5.462374] *cpu=6* cpu_possible_mask=0-7 cpu_online_mask=0-3 
cpu_present_mask=0-3 cpu_active_mask=0-3
[ 5.462375] *cpu=7* cpu_possible_mask=0-7 cpu_online_mask=0-3 
cpu_present_mask=0-3 cpu_active_mask=0-3

T430:/sys/fs/cgroup/cpu,cpuacct/system.slice# ls -l | grep '^d' | wc -l
80

/proc/sched_debug:

cfs_rq[0]:/system.slice
  ...
  .tg_load_avg   : 323584
  ...

80 * 1024 * 4 (not existent cpu4-cpu7) = 327680 (with a little bit of decay,
this could be this extra load on the systen.slice tg)

Using for_each_online_cpu(i) instead of for_each_possible_cpu(i) in
online_fair_sched_group() works on this machine, i.e. the .tg_load_avg
of system.slice tg is 0 after startup.


Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes

2016-10-17 Thread Vincent Guittot
On 17 October 2016 at 15:19, Peter Zijlstra  wrote:
> On Mon, Oct 17, 2016 at 12:49:55PM +0100, Dietmar Eggemann wrote:
>
>> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> > index 8b03fb5..8926685 100644
>> > --- a/kernel/sched/fair.c
>> > +++ b/kernel/sched/fair.c
>> > @@ -2902,7 +2902,8 @@ __update_load_avg(u64 now, int cpu, struct sched_avg 
>> > *sa,
>> >   */
>> >  static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force)
>> >  {
>> > -   long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
>> > +   unsigned long load_avg = READ_ONCE(cfs_rq->avg.load_avg);
>> > +   long delta = load_avg - cfs_rq->tg_load_avg_contrib;
>> >
>> > /*
>> >  * No need to update load_avg for root_task_group as it is not used.
>> > @@ -2912,7 +2913,7 @@ static inline void update_tg_load_avg(struct cfs_rq 
>> > *cfs_rq, int force)
>> >
>> > if (force || abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
>> > atomic_long_add(delta, &cfs_rq->tg->load_avg);
>> > -   cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
>> > +   cfs_rq->tg_load_avg_contrib = load_avg;
>> > }
>> >  }
>>
>> Tested it on an Ubuntu 16.10 Server (on top of the default 4.8.0-22-generic
>> kernel) on a Lenovo T430 and it didn't help.
>
> Right, I don't think that race exists, we update cfs_rq->avg.load_avg
> with rq->lock held and have it held here, so it cannot change under us.

yes I agree. It was just to be sure that an unexpected race condition
doesn't happen

>
> This might do with a few lockdep_assert_held() instances to clarify this
> though.
>
>> What seems to cure it is to get rid of this snippet (part of the commit
>> mentioned earlier in this thread: 3d30544f0212):
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 039de34f1521..16c692049fbf 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -726,7 +726,6 @@ void post_init_entity_util_avg(struct sched_entity *se)
>> struct sched_avg *sa = &se->avg;
>> long cap = (long)(SCHED_CAPACITY_SCALE - cfs_rq->avg.util_avg) / 2;
>> u64 now = cfs_rq_clock_task(cfs_rq);
>> -   int tg_update;
>>
>> if (cap > 0) {
>> if (cfs_rq->avg.util_avg != 0) {
>> @@ -759,10 +758,8 @@ void post_init_entity_util_avg(struct sched_entity *se)
>> }
>> }
>>
>> -   tg_update = update_cfs_rq_load_avg(now, cfs_rq, false);
>> +   update_cfs_rq_load_avg(now, cfs_rq, false);
>> attach_entity_load_avg(cfs_rq, se);
>> -   if (tg_update)
>> -   update_tg_load_avg(cfs_rq, false);
>>  }
>>
>>  #else /* !CONFIG_SMP */
>>
>> BTW, I guess we can reach .tg_load_avg up to ~30-40 on such a system
>> initially because systemd will create all ~100 services (and therefore the
>> corresponding 2. level tg's) at once. In my previous example, there was 500ms
>> between the creation of 2 tg's so there was a lot of decaying going on in 
>> between.
>
> Cute... on current kernels that translates to simply removing the call
> to update_tg_load_avg(), lets see if we can figure out what goes
> sideways first though, because it _should_ decay back out. And if that

yes, Reaching ~30-40 is not an issue in itself, the problem is
that load_avg has decayed but it has not been reflected in
tg->load_avg in the buggy case

> can fail here, I'm not seeing why that wouldn't fail elsewhere either.
>
> I'll see if I can reproduce this with a script creating heaps of cgroups
> in a hurry, I have a total lack of system-disease on all my machines.
>
>
> /me goes prod..


Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes

2016-10-17 Thread Peter Zijlstra
On Mon, Oct 17, 2016 at 12:49:55PM +0100, Dietmar Eggemann wrote:

> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 8b03fb5..8926685 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -2902,7 +2902,8 @@ __update_load_avg(u64 now, int cpu, struct sched_avg 
> > *sa,
> >   */
> >  static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force)
> >  {
> > -   long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
> > +   unsigned long load_avg = READ_ONCE(cfs_rq->avg.load_avg);
> > +   long delta = load_avg - cfs_rq->tg_load_avg_contrib;
> >  
> > /*
> >  * No need to update load_avg for root_task_group as it is not used.
> > @@ -2912,7 +2913,7 @@ static inline void update_tg_load_avg(struct cfs_rq 
> > *cfs_rq, int force)
> >  
> > if (force || abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
> > atomic_long_add(delta, &cfs_rq->tg->load_avg);
> > -   cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
> > +   cfs_rq->tg_load_avg_contrib = load_avg;
> > }
> >  }
> 
> Tested it on an Ubuntu 16.10 Server (on top of the default 4.8.0-22-generic
> kernel) on a Lenovo T430 and it didn't help.

Right, I don't think that race exists, we update cfs_rq->avg.load_avg
with rq->lock held and have it held here, so it cannot change under us.

This might do with a few lockdep_assert_held() instances to clarify this
though.

> What seems to cure it is to get rid of this snippet (part of the commit
> mentioned earlier in this thread: 3d30544f0212):
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 039de34f1521..16c692049fbf 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -726,7 +726,6 @@ void post_init_entity_util_avg(struct sched_entity *se)
> struct sched_avg *sa = &se->avg;
> long cap = (long)(SCHED_CAPACITY_SCALE - cfs_rq->avg.util_avg) / 2;
> u64 now = cfs_rq_clock_task(cfs_rq);
> -   int tg_update;
>  
> if (cap > 0) {
> if (cfs_rq->avg.util_avg != 0) {
> @@ -759,10 +758,8 @@ void post_init_entity_util_avg(struct sched_entity *se)
> }
> }
>  
> -   tg_update = update_cfs_rq_load_avg(now, cfs_rq, false);
> +   update_cfs_rq_load_avg(now, cfs_rq, false);
> attach_entity_load_avg(cfs_rq, se);
> -   if (tg_update)
> -   update_tg_load_avg(cfs_rq, false);
>  }
>  
>  #else /* !CONFIG_SMP */
> 
> BTW, I guess we can reach .tg_load_avg up to ~30-40 on such a system
> initially because systemd will create all ~100 services (and therefore the
> corresponding 2. level tg's) at once. In my previous example, there was 500ms
> between the creation of 2 tg's so there was a lot of decaying going on in 
> between.

Cute... on current kernels that translates to simply removing the call
to update_tg_load_avg(), lets see if we can figure out what goes
sideways first though, because it _should_ decay back out. And if that
can fail here, I'm not seeing why that wouldn't fail elsewhere either.

I'll see if I can reproduce this with a script creating heaps of cgroups
in a hurry, I have a total lack of system-disease on all my machines.


/me goes prod..


Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes

2016-10-17 Thread Dietmar Eggemann
Hi Vincent,

On 17/10/16 10:09, Vincent Guittot wrote:
> Le Friday 14 Oct 2016 à 12:04:02 (-0400), Joseph Salisbury a écrit :
>> On 10/14/2016 11:18 AM, Vincent Guittot wrote:
>>> Le Friday 14 Oct 2016 à 14:10:07 (+0100), Dietmar Eggemann a écrit :
 On 14/10/16 09:24, Vincent Guittot wrote:

[...]

> Could you try the patch below on top of the faulty kernel ?
> 
> ---
>  kernel/sched/fair.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8b03fb5..8926685 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2902,7 +2902,8 @@ __update_load_avg(u64 now, int cpu, struct sched_avg 
> *sa,
>   */
>  static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force)
>  {
> - long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
> + unsigned long load_avg = READ_ONCE(cfs_rq->avg.load_avg);
> + long delta = load_avg - cfs_rq->tg_load_avg_contrib;
>  
>   /*
>* No need to update load_avg for root_task_group as it is not used.
> @@ -2912,7 +2913,7 @@ static inline void update_tg_load_avg(struct cfs_rq 
> *cfs_rq, int force)
>  
>   if (force || abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
>   atomic_long_add(delta, &cfs_rq->tg->load_avg);
> - cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
> + cfs_rq->tg_load_avg_contrib = load_avg;
>   }
>  }

Tested it on an Ubuntu 16.10 Server (on top of the default 4.8.0-22-generic
kernel) on a Lenovo T430 and it didn't help.

What seems to cure it is to get rid of this snippet (part of the commit
mentioned earlier in this thread: 3d30544f0212):

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 039de34f1521..16c692049fbf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -726,7 +726,6 @@ void post_init_entity_util_avg(struct sched_entity *se)
struct sched_avg *sa = &se->avg;
long cap = (long)(SCHED_CAPACITY_SCALE - cfs_rq->avg.util_avg) / 2;
u64 now = cfs_rq_clock_task(cfs_rq);
-   int tg_update;
 
if (cap > 0) {
if (cfs_rq->avg.util_avg != 0) {
@@ -759,10 +758,8 @@ void post_init_entity_util_avg(struct sched_entity *se)
}
}
 
-   tg_update = update_cfs_rq_load_avg(now, cfs_rq, false);
+   update_cfs_rq_load_avg(now, cfs_rq, false);
attach_entity_load_avg(cfs_rq, se);
-   if (tg_update)
-   update_tg_load_avg(cfs_rq, false);
 }
 
 #else /* !CONFIG_SMP */

BTW, I guess we can reach .tg_load_avg up to ~30-40 on such a system
initially because systemd will create all ~100 services (and therefore the
corresponding 2. level tg's) at once. In my previous example, there was 500ms
between the creation of 2 tg's so there was a lot of decaying going on in 
between.























Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes

2016-10-17 Thread Vincent Guittot
Le Friday 14 Oct 2016 à 12:04:02 (-0400), Joseph Salisbury a écrit :
> On 10/14/2016 11:18 AM, Vincent Guittot wrote:
> > Le Friday 14 Oct 2016 à 14:10:07 (+0100), Dietmar Eggemann a écrit :
> >> On 14/10/16 09:24, Vincent Guittot wrote:
> >>> On 13 October 2016 at 23:34, Vincent Guittot  
> >>> wrote:
>  On 13 October 2016 at 20:49, Dietmar Eggemann  
>  wrote:
> > On 13/10/16 17:48, Vincent Guittot wrote:
> >> On 13 October 2016 at 17:52, Joseph Salisbury
> >>  wrote:
> >>> On 10/13/2016 06:58 AM, Vincent Guittot wrote:
>  Hi,
> 
>  On 12 October 2016 at 18:21, Joseph Salisbury
>   wrote:
> > On 10/12/2016 08:20 AM, Vincent Guittot wrote:
> >> On 8 October 2016 at 13:49, Mike Galbraith  wrote:
> >>> On Sat, 2016-10-08 at 13:37 +0200, Vincent Guittot wrote:
>  On 8 October 2016 at 10:39, Ingo Molnar  wrote:
> > * Peter Zijlstra  wrote:
> >
> >> On Fri, Oct 07, 2016 at 03:38:23PM -0400, Joseph Salisbury 
> >> wrote:
> >> [...]
> >>
> > When I create a tg_root/tg_x/tg_y_1 and a tg_root/tg_x/tg_y_2 group, 
> > the tg_x->load_avg
> > becomes > 6*1024 before any tasks ran in it.
>  This is normal as se->avg.load_avg is initialized to
>  scale_load_down(se->load.weight) and this se->avg.load_avg will be
>  added to tg_x[cpu]->cfs_rq->avg.load_avg when attached to the cfs_rq
> >> Yeah, you right, even when I've created 50 second level groups,
> >> tg_x->load_avg is ~6800.
> >>
> >> Could it have something to do with the fact that .se->load.weight = 2
> >> for all these task groups? on a 64bit system?
> > I don't think so, the problem really comes from tg->load_avg = 381697
> > but sum of cfs_rq[cpu]->tg_load_avg_contrib = 1013 which is << tg->load_avg
> > and cfs_rq[cpu]->tg_load_avg_contrib == cfs_rq[cpu]->avg.load_avg so we 
> > can't 
> > expect any negative delta to remove this large value
> >
> >> In case we call  __update_load_avg(..., se->on_rq *
> >> scale_load_down(se->load.weight), ...) we pass a weight argument of 0
> >> for these se's.
> >>
> >> Does not happen with:
> >>
> >> -   if (shares < MIN_SHARES)
> >> -   shares = MIN_SHARES;
> >> +   if (shares < scale_load(MIN_SHARES))
> >> +   shares = scale_load(MIN_SHARES);
> >>
> >> in  calc_cfs_shares().
> >>
> >> [...]
> >>
> >>
> Adding Omer to CC list, as he is able to reproduce this bug.

Could you try the patch below on top of the faulty kernel ?

---
 kernel/sched/fair.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8b03fb5..8926685 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2902,7 +2902,8 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
  */
 static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force)
 {
-   long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
+   unsigned long load_avg = READ_ONCE(cfs_rq->avg.load_avg);
+   long delta = load_avg - cfs_rq->tg_load_avg_contrib;
 
/*
 * No need to update load_avg for root_task_group as it is not used.
@@ -2912,7 +2913,7 @@ static inline void update_tg_load_avg(struct cfs_rq 
*cfs_rq, int force)
 
if (force || abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
atomic_long_add(delta, &cfs_rq->tg->load_avg);
-   cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
+   cfs_rq->tg_load_avg_contrib = load_avg;
}
 }
 
-- 
2.7.4


> 


Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes

2016-10-14 Thread Joseph Salisbury
On 10/14/2016 11:18 AM, Vincent Guittot wrote:
> Le Friday 14 Oct 2016 à 14:10:07 (+0100), Dietmar Eggemann a écrit :
>> On 14/10/16 09:24, Vincent Guittot wrote:
>>> On 13 October 2016 at 23:34, Vincent Guittot  
>>> wrote:
 On 13 October 2016 at 20:49, Dietmar Eggemann  
 wrote:
> On 13/10/16 17:48, Vincent Guittot wrote:
>> On 13 October 2016 at 17:52, Joseph Salisbury
>>  wrote:
>>> On 10/13/2016 06:58 AM, Vincent Guittot wrote:
 Hi,

 On 12 October 2016 at 18:21, Joseph Salisbury
  wrote:
> On 10/12/2016 08:20 AM, Vincent Guittot wrote:
>> On 8 October 2016 at 13:49, Mike Galbraith  wrote:
>>> On Sat, 2016-10-08 at 13:37 +0200, Vincent Guittot wrote:
 On 8 October 2016 at 10:39, Ingo Molnar  wrote:
> * Peter Zijlstra  wrote:
>
>> On Fri, Oct 07, 2016 at 03:38:23PM -0400, Joseph Salisbury wrote:
>> [...]
>>
> When I create a tg_root/tg_x/tg_y_1 and a tg_root/tg_x/tg_y_2 group, the 
> tg_x->load_avg
> becomes > 6*1024 before any tasks ran in it.
 This is normal as se->avg.load_avg is initialized to
 scale_load_down(se->load.weight) and this se->avg.load_avg will be
 added to tg_x[cpu]->cfs_rq->avg.load_avg when attached to the cfs_rq
>> Yeah, you right, even when I've created 50 second level groups,
>> tg_x->load_avg is ~6800.
>>
>> Could it have something to do with the fact that .se->load.weight = 2
>> for all these task groups? on a 64bit system?
> I don't think so, the problem really comes from tg->load_avg = 381697
> but sum of cfs_rq[cpu]->tg_load_avg_contrib = 1013 which is << tg->load_avg
> and cfs_rq[cpu]->tg_load_avg_contrib == cfs_rq[cpu]->avg.load_avg so we can't 
> expect any negative delta to remove this large value
>
>> In case we call  __update_load_avg(..., se->on_rq *
>> scale_load_down(se->load.weight), ...) we pass a weight argument of 0
>> for these se's.
>>
>> Does not happen with:
>>
>> -   if (shares < MIN_SHARES)
>> -   shares = MIN_SHARES;
>> +   if (shares < scale_load(MIN_SHARES))
>> +   shares = scale_load(MIN_SHARES);
>>
>> in  calc_cfs_shares().
>>
>> [...]
>>
>>
Adding Omer to CC list, as he is able to reproduce this bug.



Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes

2016-10-14 Thread Vincent Guittot
Le Friday 14 Oct 2016 à 14:10:07 (+0100), Dietmar Eggemann a écrit :
> On 14/10/16 09:24, Vincent Guittot wrote:
> > On 13 October 2016 at 23:34, Vincent Guittot  
> > wrote:
> >> On 13 October 2016 at 20:49, Dietmar Eggemann  
> >> wrote:
> >>> On 13/10/16 17:48, Vincent Guittot wrote:
>  On 13 October 2016 at 17:52, Joseph Salisbury
>   wrote:
> > On 10/13/2016 06:58 AM, Vincent Guittot wrote:
> >> Hi,
> >>
> >> On 12 October 2016 at 18:21, Joseph Salisbury
> >>  wrote:
> >>> On 10/12/2016 08:20 AM, Vincent Guittot wrote:
>  On 8 October 2016 at 13:49, Mike Galbraith  wrote:
> > On Sat, 2016-10-08 at 13:37 +0200, Vincent Guittot wrote:
> >> On 8 October 2016 at 10:39, Ingo Molnar  wrote:
> >>> * Peter Zijlstra  wrote:
> >>>
>  On Fri, Oct 07, 2016 at 03:38:23PM -0400, Joseph Salisbury wrote:
> 
> [...]
> 
> >>> When I create a tg_root/tg_x/tg_y_1 and a tg_root/tg_x/tg_y_2 group, the 
> >>> tg_x->load_avg
> >>> becomes > 6*1024 before any tasks ran in it.
> >>
> >> This is normal as se->avg.load_avg is initialized to
> >> scale_load_down(se->load.weight) and this se->avg.load_avg will be
> >> added to tg_x[cpu]->cfs_rq->avg.load_avg when attached to the cfs_rq
> 
> Yeah, you right, even when I've created 50 second level groups,
> tg_x->load_avg is ~6800.
> 
> Could it have something to do with the fact that .se->load.weight = 2
> for all these task groups? on a 64bit system?

I don't think so, the problem really comes from tg->load_avg = 381697
but sum of cfs_rq[cpu]->tg_load_avg_contrib = 1013 which is << tg->load_avg
and cfs_rq[cpu]->tg_load_avg_contrib == cfs_rq[cpu]->avg.load_avg so we can't 
expect any negative delta to remove this large value

> 
> In case we call  __update_load_avg(..., se->on_rq *
> scale_load_down(se->load.weight), ...) we pass a weight argument of 0
> for these se's.
> 
> Does not happen with:
> 
> -   if (shares < MIN_SHARES)
> -   shares = MIN_SHARES;
> +   if (shares < scale_load(MIN_SHARES))
> +   shares = scale_load(MIN_SHARES);
> 
> in  calc_cfs_shares().
> 
> [...]
> 
> 


Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes

2016-10-14 Thread Dietmar Eggemann
On 14/10/16 09:24, Vincent Guittot wrote:
> On 13 October 2016 at 23:34, Vincent Guittot  
> wrote:
>> On 13 October 2016 at 20:49, Dietmar Eggemann  
>> wrote:
>>> On 13/10/16 17:48, Vincent Guittot wrote:
 On 13 October 2016 at 17:52, Joseph Salisbury
  wrote:
> On 10/13/2016 06:58 AM, Vincent Guittot wrote:
>> Hi,
>>
>> On 12 October 2016 at 18:21, Joseph Salisbury
>>  wrote:
>>> On 10/12/2016 08:20 AM, Vincent Guittot wrote:
 On 8 October 2016 at 13:49, Mike Galbraith  wrote:
> On Sat, 2016-10-08 at 13:37 +0200, Vincent Guittot wrote:
>> On 8 October 2016 at 10:39, Ingo Molnar  wrote:
>>> * Peter Zijlstra  wrote:
>>>
 On Fri, Oct 07, 2016 at 03:38:23PM -0400, Joseph Salisbury wrote:

[...]

>>> When I create a tg_root/tg_x/tg_y_1 and a tg_root/tg_x/tg_y_2 group, the 
>>> tg_x->load_avg
>>> becomes > 6*1024 before any tasks ran in it.
>>
>> This is normal as se->avg.load_avg is initialized to
>> scale_load_down(se->load.weight) and this se->avg.load_avg will be
>> added to tg_x[cpu]->cfs_rq->avg.load_avg when attached to the cfs_rq

Yeah, you right, even when I've created 50 second level groups,
tg_x->load_avg is ~6800.

Could it have something to do with the fact that .se->load.weight = 2
for all these task groups? on a 64bit system?

In case we call  __update_load_avg(..., se->on_rq *
scale_load_down(se->load.weight), ...) we pass a weight argument of 0
for these se's.

Does not happen with:

-   if (shares < MIN_SHARES)
-   shares = MIN_SHARES;
+   if (shares < scale_load(MIN_SHARES))
+   shares = scale_load(MIN_SHARES);

in  calc_cfs_shares().

[...]




Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes

2016-10-14 Thread Vincent Guittot
On 13 October 2016 at 23:34, Vincent Guittot  wrote:
> On 13 October 2016 at 20:49, Dietmar Eggemann  
> wrote:
>> On 13/10/16 17:48, Vincent Guittot wrote:
>>> On 13 October 2016 at 17:52, Joseph Salisbury
>>>  wrote:
 On 10/13/2016 06:58 AM, Vincent Guittot wrote:
> Hi,
>
> On 12 October 2016 at 18:21, Joseph Salisbury
>  wrote:
>> On 10/12/2016 08:20 AM, Vincent Guittot wrote:
>>> On 8 October 2016 at 13:49, Mike Galbraith  wrote:
 On Sat, 2016-10-08 at 13:37 +0200, Vincent Guittot wrote:
> On 8 October 2016 at 10:39, Ingo Molnar  wrote:
>> * Peter Zijlstra  wrote:
>>
>>> On Fri, Oct 07, 2016 at 03:38:23PM -0400, Joseph Salisbury wrote:
 Hello Peter,

 A kernel bug report was opened against Ubuntu [0].  After a
 kernel
 bisect, it was found that reverting the following commit
 resolved this bug:

 commit 3d30544f02120b884bba2a9466c87dba980e3be5
 Author: Peter Zijlstra 
 Date:   Tue Jun 21 14:27:50 2016 +0200

 sched/fair: Apply more PELT fixes
> This patch only speeds up the update of task group load in order to
> reflect the new load balance but It should not change the final value
> and as a result the final behavior. I will try to reproduce it in my
> target later today
 FWIW, I tried and failed w/wo autogroup on 4.8 and master.
>>> Me too
>>>
>>> Is it possible to get some dump of  /proc/sched_debug while the problem 
>>> occurs ?
>>>
>>> Vincent
>>>
 -Mike
>> The output from /proc/shed_debug can be seen here:
>> http://paste.ubuntu.com/23312351/
> I have looked at the dump and there is something very odd for
> system.slice task group where the display manager is running.
> system.slice->tg_load_avg is around 381697 but  tg_load_avg is
> normally equal to Sum of system.slice[cpu]->tg_load_avg_contrib
> whereas Sum of system.slice[cpu]->tg_load_avg_contrib = 1013 in our
> case. We can have some differences because the dump of
> /proc/shed_debug is not atomic and some changes can happen but nothing
> like this difference.
>
> The main effect of this quite high value is that the weight/prio of
> the sched_entity that represents system.slice in root cfs_rq is very
> low (lower than task with the smallest nice prio) so the system.slice
> task group will not get the CPU quite often compared to the user.slice
> task group: less than 1% for the system.slice where lightDM and xorg
> are running compared 99% for the user.slice where the stress tasks are
> running. This is confirmed by the se->avg.util_avg value of the task
> groups which reflect how much time each task group is effectively
> running on a CPU:
> system.slice[CPU3].se->avg.util_avg = 8 whereas
> user.slice[CPU3].se->avg.util_avg = 991
>
> This difference of weight/priority explains why the system becomes
> unresponsive. For now, I can't explain is why
> system.slice->tg_load_avg = 381697 whereas is should be around 1013
> and how the patch can generate this situation.
>
> Is it possible to have a dump of /proc/sched_debug before starting
> stress command ? to check if the problem is there from the beginning
> but not seen because not overloaded. Or if it the problem comes when
> user starts to load the system
 Here is the dump before stress is started:
 https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1627108/+attachment/4760437/+files/dump_nonbuggy
>>>
>>> This one is ok.
>>> The dump indicates Sched Debug Version: v0.11, 4.8.0-11-generic
>>> #12~lp1627108Commit3d30544Reverted
>>> so this is without the culprit commit
>>>

 Here it is after:
 https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1627108/+attachment/4760436/+files/dump_buggy

>>>
>>> This one has the exact same odds values for system.slice->tg_load_avg
>>> than the 1st dump that you sent yesterday
>>> The dump indicates Sched Debug Version: v0.11, 4.8.0-22-generic #24-Ubuntu
>>> So this dump has been done with a different kernel than for the dump above.
>>> As I can't find any stress task in the dump, i tend to believe that
>>> the dump has been done before starting the stress tasks and not after
>>> starting them. Can you confirm ?
>>>
>>> If i'm right, it mean that the problem was already there before
>>> starting stress tasks.
>>
>> Could it be a problem I'm also seeing on my ARM64 Juno (6 logical cpus) w/o 
>> systemd
>> and w/o autogroup (tip/sched/core 447976ef4fd0):
>>
>> When I create a tg_root/tg_x/tg_y_1 and a tg_root/tg_x/tg_y_2 group, the 
>> tg_x->load_avg
>> becomes > 6*1024 before any tasks ran in it.
>
> This is normal as se->avg.load_avg is initialized to
> scale_load_down(se->load.weight) and this se->avg.load_avg

Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes

2016-10-13 Thread Vincent Guittot
On 13 October 2016 at 20:49, Dietmar Eggemann  wrote:
> On 13/10/16 17:48, Vincent Guittot wrote:
>> On 13 October 2016 at 17:52, Joseph Salisbury
>>  wrote:
>>> On 10/13/2016 06:58 AM, Vincent Guittot wrote:
 Hi,

 On 12 October 2016 at 18:21, Joseph Salisbury
  wrote:
> On 10/12/2016 08:20 AM, Vincent Guittot wrote:
>> On 8 October 2016 at 13:49, Mike Galbraith  wrote:
>>> On Sat, 2016-10-08 at 13:37 +0200, Vincent Guittot wrote:
 On 8 October 2016 at 10:39, Ingo Molnar  wrote:
> * Peter Zijlstra  wrote:
>
>> On Fri, Oct 07, 2016 at 03:38:23PM -0400, Joseph Salisbury wrote:
>>> Hello Peter,
>>>
>>> A kernel bug report was opened against Ubuntu [0].  After a
>>> kernel
>>> bisect, it was found that reverting the following commit
>>> resolved this bug:
>>>
>>> commit 3d30544f02120b884bba2a9466c87dba980e3be5
>>> Author: Peter Zijlstra 
>>> Date:   Tue Jun 21 14:27:50 2016 +0200
>>>
>>> sched/fair: Apply more PELT fixes
 This patch only speeds up the update of task group load in order to
 reflect the new load balance but It should not change the final value
 and as a result the final behavior. I will try to reproduce it in my
 target later today
>>> FWIW, I tried and failed w/wo autogroup on 4.8 and master.
>> Me too
>>
>> Is it possible to get some dump of  /proc/sched_debug while the problem 
>> occurs ?
>>
>> Vincent
>>
>>> -Mike
> The output from /proc/shed_debug can be seen here:
> http://paste.ubuntu.com/23312351/
 I have looked at the dump and there is something very odd for
 system.slice task group where the display manager is running.
 system.slice->tg_load_avg is around 381697 but  tg_load_avg is
 normally equal to Sum of system.slice[cpu]->tg_load_avg_contrib
 whereas Sum of system.slice[cpu]->tg_load_avg_contrib = 1013 in our
 case. We can have some differences because the dump of
 /proc/shed_debug is not atomic and some changes can happen but nothing
 like this difference.

 The main effect of this quite high value is that the weight/prio of
 the sched_entity that represents system.slice in root cfs_rq is very
 low (lower than task with the smallest nice prio) so the system.slice
 task group will not get the CPU quite often compared to the user.slice
 task group: less than 1% for the system.slice where lightDM and xorg
 are running compared 99% for the user.slice where the stress tasks are
 running. This is confirmed by the se->avg.util_avg value of the task
 groups which reflect how much time each task group is effectively
 running on a CPU:
 system.slice[CPU3].se->avg.util_avg = 8 whereas
 user.slice[CPU3].se->avg.util_avg = 991

 This difference of weight/priority explains why the system becomes
 unresponsive. For now, I can't explain is why
 system.slice->tg_load_avg = 381697 whereas is should be around 1013
 and how the patch can generate this situation.

 Is it possible to have a dump of /proc/sched_debug before starting
 stress command ? to check if the problem is there from the beginning
 but not seen because not overloaded. Or if it the problem comes when
 user starts to load the system
>>> Here is the dump before stress is started:
>>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1627108/+attachment/4760437/+files/dump_nonbuggy
>>
>> This one is ok.
>> The dump indicates Sched Debug Version: v0.11, 4.8.0-11-generic
>> #12~lp1627108Commit3d30544Reverted
>> so this is without the culprit commit
>>
>>>
>>> Here it is after:
>>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1627108/+attachment/4760436/+files/dump_buggy
>>>
>>
>> This one has the exact same odds values for system.slice->tg_load_avg
>> than the 1st dump that you sent yesterday
>> The dump indicates Sched Debug Version: v0.11, 4.8.0-22-generic #24-Ubuntu
>> So this dump has been done with a different kernel than for the dump above.
>> As I can't find any stress task in the dump, i tend to believe that
>> the dump has been done before starting the stress tasks and not after
>> starting them. Can you confirm ?
>>
>> If i'm right, it mean that the problem was already there before
>> starting stress tasks.
>
> Could it be a problem I'm also seeing on my ARM64 Juno (6 logical cpus) w/o 
> systemd
> and w/o autogroup (tip/sched/core 447976ef4fd0):
>
> When I create a tg_root/tg_x/tg_y_1 and a tg_root/tg_x/tg_y_2 group, the 
> tg_x->load_avg
> becomes > 6*1024 before any tasks ran in it.

This is normal as se->avg.load_avg is initialized to
scale_load_down(se->load.weight) and this se->avg.load_avg will be
added to tg_x[cpu]->cfs_rq->avg.load_avg when attached to the cfs_rq

>
> tg_x   : 0x800975800d80
> tg_y_1 : 0x800975800c00
> tg_y_2 : 0x80097543

Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes

2016-10-13 Thread Dietmar Eggemann
On 13/10/16 17:48, Vincent Guittot wrote:
> On 13 October 2016 at 17:52, Joseph Salisbury
>  wrote:
>> On 10/13/2016 06:58 AM, Vincent Guittot wrote:
>>> Hi,
>>>
>>> On 12 October 2016 at 18:21, Joseph Salisbury
>>>  wrote:
 On 10/12/2016 08:20 AM, Vincent Guittot wrote:
> On 8 October 2016 at 13:49, Mike Galbraith  wrote:
>> On Sat, 2016-10-08 at 13:37 +0200, Vincent Guittot wrote:
>>> On 8 October 2016 at 10:39, Ingo Molnar  wrote:
 * Peter Zijlstra  wrote:

> On Fri, Oct 07, 2016 at 03:38:23PM -0400, Joseph Salisbury wrote:
>> Hello Peter,
>>
>> A kernel bug report was opened against Ubuntu [0].  After a
>> kernel
>> bisect, it was found that reverting the following commit
>> resolved this bug:
>>
>> commit 3d30544f02120b884bba2a9466c87dba980e3be5
>> Author: Peter Zijlstra 
>> Date:   Tue Jun 21 14:27:50 2016 +0200
>>
>> sched/fair: Apply more PELT fixes
>>> This patch only speeds up the update of task group load in order to
>>> reflect the new load balance but It should not change the final value
>>> and as a result the final behavior. I will try to reproduce it in my
>>> target later today
>> FWIW, I tried and failed w/wo autogroup on 4.8 and master.
> Me too
>
> Is it possible to get some dump of  /proc/sched_debug while the problem 
> occurs ?
>
> Vincent
>
>> -Mike
 The output from /proc/shed_debug can be seen here:
 http://paste.ubuntu.com/23312351/
>>> I have looked at the dump and there is something very odd for
>>> system.slice task group where the display manager is running.
>>> system.slice->tg_load_avg is around 381697 but  tg_load_avg is
>>> normally equal to Sum of system.slice[cpu]->tg_load_avg_contrib
>>> whereas Sum of system.slice[cpu]->tg_load_avg_contrib = 1013 in our
>>> case. We can have some differences because the dump of
>>> /proc/shed_debug is not atomic and some changes can happen but nothing
>>> like this difference.
>>>
>>> The main effect of this quite high value is that the weight/prio of
>>> the sched_entity that represents system.slice in root cfs_rq is very
>>> low (lower than task with the smallest nice prio) so the system.slice
>>> task group will not get the CPU quite often compared to the user.slice
>>> task group: less than 1% for the system.slice where lightDM and xorg
>>> are running compared 99% for the user.slice where the stress tasks are
>>> running. This is confirmed by the se->avg.util_avg value of the task
>>> groups which reflect how much time each task group is effectively
>>> running on a CPU:
>>> system.slice[CPU3].se->avg.util_avg = 8 whereas
>>> user.slice[CPU3].se->avg.util_avg = 991
>>>
>>> This difference of weight/priority explains why the system becomes
>>> unresponsive. For now, I can't explain is why
>>> system.slice->tg_load_avg = 381697 whereas is should be around 1013
>>> and how the patch can generate this situation.
>>>
>>> Is it possible to have a dump of /proc/sched_debug before starting
>>> stress command ? to check if the problem is there from the beginning
>>> but not seen because not overloaded. Or if it the problem comes when
>>> user starts to load the system
>> Here is the dump before stress is started:
>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1627108/+attachment/4760437/+files/dump_nonbuggy
> 
> This one is ok.
> The dump indicates Sched Debug Version: v0.11, 4.8.0-11-generic
> #12~lp1627108Commit3d30544Reverted
> so this is without the culprit commit
> 
>>
>> Here it is after:
>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1627108/+attachment/4760436/+files/dump_buggy
>>
> 
> This one has the exact same odds values for system.slice->tg_load_avg
> than the 1st dump that you sent yesterday
> The dump indicates Sched Debug Version: v0.11, 4.8.0-22-generic #24-Ubuntu
> So this dump has been done with a different kernel than for the dump above.
> As I can't find any stress task in the dump, i tend to believe that
> the dump has been done before starting the stress tasks and not after
> starting them. Can you confirm ?
> 
> If i'm right, it mean that the problem was already there before
> starting stress tasks.

Could it be a problem I'm also seeing on my ARM64 Juno (6 logical cpus) w/o 
systemd
and w/o autogroup (tip/sched/core 447976ef4fd0):

When I create a tg_root/tg_x/tg_y_1 and a tg_root/tg_x/tg_y_2 group, the 
tg_x->load_avg
becomes > 6*1024 before any tasks ran in it.

tg_x   : 0x800975800d80
tg_y_1 : 0x800975800c00
tg_y_2 : 0x80097543d200 

 mkdir-2177 [002] 117.235241: bprint: sched_online_group: tg=0x800975800d80 
tg->parent=0x08fd0300
 mkdir-2177 [002] 117.235244: bprint: online_fair_sched_group: 
tg=0x800975800d80 cpu=0
 mkdir-2177 [002] 117.235247: bprint: online_fair_sched_group: 
tg=0x800975800d80 cpu=1
 mkdir-2177 [002] 117.235249: bprint: online_fair_sched_

Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes

2016-10-13 Thread Vincent Guittot
On 13 October 2016 at 17:52, Joseph Salisbury
 wrote:
> On 10/13/2016 06:58 AM, Vincent Guittot wrote:
>> Hi,
>>
>> On 12 October 2016 at 18:21, Joseph Salisbury
>>  wrote:
>>> On 10/12/2016 08:20 AM, Vincent Guittot wrote:
 On 8 October 2016 at 13:49, Mike Galbraith  wrote:
> On Sat, 2016-10-08 at 13:37 +0200, Vincent Guittot wrote:
>> On 8 October 2016 at 10:39, Ingo Molnar  wrote:
>>> * Peter Zijlstra  wrote:
>>>
 On Fri, Oct 07, 2016 at 03:38:23PM -0400, Joseph Salisbury wrote:
> Hello Peter,
>
> A kernel bug report was opened against Ubuntu [0].  After a
> kernel
> bisect, it was found that reverting the following commit
> resolved this bug:
>
> commit 3d30544f02120b884bba2a9466c87dba980e3be5
> Author: Peter Zijlstra 
> Date:   Tue Jun 21 14:27:50 2016 +0200
>
> sched/fair: Apply more PELT fixes
>> This patch only speeds up the update of task group load in order to
>> reflect the new load balance but It should not change the final value
>> and as a result the final behavior. I will try to reproduce it in my
>> target later today
> FWIW, I tried and failed w/wo autogroup on 4.8 and master.
 Me too

 Is it possible to get some dump of  /proc/sched_debug while the problem 
 occurs ?

 Vincent

> -Mike
>>> The output from /proc/shed_debug can be seen here:
>>> http://paste.ubuntu.com/23312351/
>> I have looked at the dump and there is something very odd for
>> system.slice task group where the display manager is running.
>> system.slice->tg_load_avg is around 381697 but  tg_load_avg is
>> normally equal to Sum of system.slice[cpu]->tg_load_avg_contrib
>> whereas Sum of system.slice[cpu]->tg_load_avg_contrib = 1013 in our
>> case. We can have some differences because the dump of
>> /proc/shed_debug is not atomic and some changes can happen but nothing
>> like this difference.
>>
>> The main effect of this quite high value is that the weight/prio of
>> the sched_entity that represents system.slice in root cfs_rq is very
>> low (lower than task with the smallest nice prio) so the system.slice
>> task group will not get the CPU quite often compared to the user.slice
>> task group: less than 1% for the system.slice where lightDM and xorg
>> are running compared 99% for the user.slice where the stress tasks are
>> running. This is confirmed by the se->avg.util_avg value of the task
>> groups which reflect how much time each task group is effectively
>> running on a CPU:
>> system.slice[CPU3].se->avg.util_avg = 8 whereas
>> user.slice[CPU3].se->avg.util_avg = 991
>>
>> This difference of weight/priority explains why the system becomes
>> unresponsive. For now, I can't explain is why
>> system.slice->tg_load_avg = 381697 whereas is should be around 1013
>> and how the patch can generate this situation.
>>
>> Is it possible to have a dump of /proc/sched_debug before starting
>> stress command ? to check if the problem is there from the beginning
>> but not seen because not overloaded. Or if it the problem comes when
>> user starts to load the system
> Here is the dump before stress is started:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1627108/+attachment/4760437/+files/dump_nonbuggy

This one is ok.
The dump indicates Sched Debug Version: v0.11, 4.8.0-11-generic
#12~lp1627108Commit3d30544Reverted
so this is without the culprit commit

>
> Here it is after:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1627108/+attachment/4760436/+files/dump_buggy
>

This one has the exact same odds values for system.slice->tg_load_avg
than the 1st dump that you sent yesterday
The dump indicates Sched Debug Version: v0.11, 4.8.0-22-generic #24-Ubuntu
So this dump has been done with a different kernel than for the dump above.
As I can't find any stress task in the dump, i tend to believe that
the dump has been done before starting the stress tasks and not after
starting them. Can you confirm ?

If i'm right, it mean that the problem was already there before
starting stress tasks.


>
>>
>> Thanks,
>>
>>> Ingo, the latest scheduler bits also still exhibit the bug:
>>> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git
>>>
>>>
>


Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes

2016-10-13 Thread Joseph Salisbury
On 10/13/2016 06:58 AM, Vincent Guittot wrote:
> Hi,
>
> On 12 October 2016 at 18:21, Joseph Salisbury
>  wrote:
>> On 10/12/2016 08:20 AM, Vincent Guittot wrote:
>>> On 8 October 2016 at 13:49, Mike Galbraith  wrote:
 On Sat, 2016-10-08 at 13:37 +0200, Vincent Guittot wrote:
> On 8 October 2016 at 10:39, Ingo Molnar  wrote:
>> * Peter Zijlstra  wrote:
>>
>>> On Fri, Oct 07, 2016 at 03:38:23PM -0400, Joseph Salisbury wrote:
 Hello Peter,

 A kernel bug report was opened against Ubuntu [0].  After a
 kernel
 bisect, it was found that reverting the following commit
 resolved this bug:

 commit 3d30544f02120b884bba2a9466c87dba980e3be5
 Author: Peter Zijlstra 
 Date:   Tue Jun 21 14:27:50 2016 +0200

 sched/fair: Apply more PELT fixes
> This patch only speeds up the update of task group load in order to
> reflect the new load balance but It should not change the final value
> and as a result the final behavior. I will try to reproduce it in my
> target later today
 FWIW, I tried and failed w/wo autogroup on 4.8 and master.
>>> Me too
>>>
>>> Is it possible to get some dump of  /proc/sched_debug while the problem 
>>> occurs ?
>>>
>>> Vincent
>>>
 -Mike
>> The output from /proc/shed_debug can be seen here:
>> http://paste.ubuntu.com/23312351/
> I have looked at the dump and there is something very odd for
> system.slice task group where the display manager is running.
> system.slice->tg_load_avg is around 381697 but  tg_load_avg is
> normally equal to Sum of system.slice[cpu]->tg_load_avg_contrib
> whereas Sum of system.slice[cpu]->tg_load_avg_contrib = 1013 in our
> case. We can have some differences because the dump of
> /proc/shed_debug is not atomic and some changes can happen but nothing
> like this difference.
>
> The main effect of this quite high value is that the weight/prio of
> the sched_entity that represents system.slice in root cfs_rq is very
> low (lower than task with the smallest nice prio) so the system.slice
> task group will not get the CPU quite often compared to the user.slice
> task group: less than 1% for the system.slice where lightDM and xorg
> are running compared 99% for the user.slice where the stress tasks are
> running. This is confirmed by the se->avg.util_avg value of the task
> groups which reflect how much time each task group is effectively
> running on a CPU:
> system.slice[CPU3].se->avg.util_avg = 8 whereas
> user.slice[CPU3].se->avg.util_avg = 991
>
> This difference of weight/priority explains why the system becomes
> unresponsive. For now, I can't explain is why
> system.slice->tg_load_avg = 381697 whereas is should be around 1013
> and how the patch can generate this situation.
>
> Is it possible to have a dump of /proc/sched_debug before starting
> stress command ? to check if the problem is there from the beginning
> but not seen because not overloaded. Or if it the problem comes when
> user starts to load the system
Here is the dump before stress is started:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1627108/+attachment/4760437/+files/dump_nonbuggy

Here it is after:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1627108/+attachment/4760436/+files/dump_buggy


>
> Thanks,
>
>> Ingo, the latest scheduler bits also still exhibit the bug:
>> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git
>>
>>



Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes

2016-10-13 Thread Vincent Guittot
Hi,

On 12 October 2016 at 18:21, Joseph Salisbury
 wrote:
> On 10/12/2016 08:20 AM, Vincent Guittot wrote:
>> On 8 October 2016 at 13:49, Mike Galbraith  wrote:
>>> On Sat, 2016-10-08 at 13:37 +0200, Vincent Guittot wrote:
 On 8 October 2016 at 10:39, Ingo Molnar  wrote:
> * Peter Zijlstra  wrote:
>
>> On Fri, Oct 07, 2016 at 03:38:23PM -0400, Joseph Salisbury wrote:
>>> Hello Peter,
>>>
>>> A kernel bug report was opened against Ubuntu [0].  After a
>>> kernel
>>> bisect, it was found that reverting the following commit
>>> resolved this bug:
>>>
>>> commit 3d30544f02120b884bba2a9466c87dba980e3be5
>>> Author: Peter Zijlstra 
>>> Date:   Tue Jun 21 14:27:50 2016 +0200
>>>
>>> sched/fair: Apply more PELT fixes
 This patch only speeds up the update of task group load in order to
 reflect the new load balance but It should not change the final value
 and as a result the final behavior. I will try to reproduce it in my
 target later today
>>> FWIW, I tried and failed w/wo autogroup on 4.8 and master.
>> Me too
>>
>> Is it possible to get some dump of  /proc/sched_debug while the problem 
>> occurs ?
>>
>> Vincent
>>
>>> -Mike
>
> The output from /proc/shed_debug can be seen here:
> http://paste.ubuntu.com/23312351/

I have looked at the dump and there is something very odd for
system.slice task group where the display manager is running.
system.slice->tg_load_avg is around 381697 but  tg_load_avg is
normally equal to Sum of system.slice[cpu]->tg_load_avg_contrib
whereas Sum of system.slice[cpu]->tg_load_avg_contrib = 1013 in our
case. We can have some differences because the dump of
/proc/shed_debug is not atomic and some changes can happen but nothing
like this difference.

The main effect of this quite high value is that the weight/prio of
the sched_entity that represents system.slice in root cfs_rq is very
low (lower than task with the smallest nice prio) so the system.slice
task group will not get the CPU quite often compared to the user.slice
task group: less than 1% for the system.slice where lightDM and xorg
are running compared 99% for the user.slice where the stress tasks are
running. This is confirmed by the se->avg.util_avg value of the task
groups which reflect how much time each task group is effectively
running on a CPU:
system.slice[CPU3].se->avg.util_avg = 8 whereas
user.slice[CPU3].se->avg.util_avg = 991

This difference of weight/priority explains why the system becomes
unresponsive. For now, I can't explain is why
system.slice->tg_load_avg = 381697 whereas is should be around 1013
and how the patch can generate this situation.

Is it possible to have a dump of /proc/sched_debug before starting
stress command ? to check if the problem is there from the beginning
but not seen because not overloaded. Or if it the problem comes when
user starts to load the system

Thanks,

>
> Ingo, the latest scheduler bits also still exhibit the bug:
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git
>
>


Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes

2016-10-12 Thread Joseph Salisbury
On 10/12/2016 08:20 AM, Vincent Guittot wrote:
> On 8 October 2016 at 13:49, Mike Galbraith  wrote:
>> On Sat, 2016-10-08 at 13:37 +0200, Vincent Guittot wrote:
>>> On 8 October 2016 at 10:39, Ingo Molnar  wrote:
 * Peter Zijlstra  wrote:

> On Fri, Oct 07, 2016 at 03:38:23PM -0400, Joseph Salisbury wrote:
>> Hello Peter,
>>
>> A kernel bug report was opened against Ubuntu [0].  After a
>> kernel
>> bisect, it was found that reverting the following commit
>> resolved this bug:
>>
>> commit 3d30544f02120b884bba2a9466c87dba980e3be5
>> Author: Peter Zijlstra 
>> Date:   Tue Jun 21 14:27:50 2016 +0200
>>
>> sched/fair: Apply more PELT fixes
>>> This patch only speeds up the update of task group load in order to
>>> reflect the new load balance but It should not change the final value
>>> and as a result the final behavior. I will try to reproduce it in my
>>> target later today
>> FWIW, I tried and failed w/wo autogroup on 4.8 and master.
> Me too
>
> Is it possible to get some dump of  /proc/sched_debug while the problem 
> occurs ?
>
> Vincent
>
>> -Mike

The output from /proc/shed_debug can be seen here:
http://paste.ubuntu.com/23312351/

Ingo, the latest scheduler bits also still exhibit the bug:
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git




Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes

2016-10-12 Thread Joseph Salisbury
On 10/12/2016 08:20 AM, Vincent Guittot wrote:
> On 8 October 2016 at 13:49, Mike Galbraith  wrote:
>> On Sat, 2016-10-08 at 13:37 +0200, Vincent Guittot wrote:
>>> On 8 October 2016 at 10:39, Ingo Molnar  wrote:
 * Peter Zijlstra  wrote:

> On Fri, Oct 07, 2016 at 03:38:23PM -0400, Joseph Salisbury wrote:
>> Hello Peter,
>>
>> A kernel bug report was opened against Ubuntu [0].  After a
>> kernel
>> bisect, it was found that reverting the following commit
>> resolved this bug:
>>
>> commit 3d30544f02120b884bba2a9466c87dba980e3be5
>> Author: Peter Zijlstra 
>> Date:   Tue Jun 21 14:27:50 2016 +0200
>>
>> sched/fair: Apply more PELT fixes
>>> This patch only speeds up the update of task group load in order to
>>> reflect the new load balance but It should not change the final value
>>> and as a result the final behavior. I will try to reproduce it in my
>>> target later today
>> FWIW, I tried and failed w/wo autogroup on 4.8 and master.
> Me too
>
> Is it possible to get some dump of  /proc/sched_debug while the problem 
> occurs ?
Yes, I requested that data from the bug reporter.  I also built a test
kernel and requested testing of the repo Ingo pointed me at.  We should
have an update shortly.

Joe


>
> Vincent
>
>> -Mike



Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes

2016-10-12 Thread Vincent Guittot
On 8 October 2016 at 13:49, Mike Galbraith  wrote:
> On Sat, 2016-10-08 at 13:37 +0200, Vincent Guittot wrote:
>> On 8 October 2016 at 10:39, Ingo Molnar  wrote:
>> >
>> > * Peter Zijlstra  wrote:
>> >
>> > > On Fri, Oct 07, 2016 at 03:38:23PM -0400, Joseph Salisbury wrote:
>> > > > Hello Peter,
>> > > >
>> > > > A kernel bug report was opened against Ubuntu [0].  After a
>> > > > kernel
>> > > > bisect, it was found that reverting the following commit
>> > > > resolved this bug:
>> > > >
>> > > > commit 3d30544f02120b884bba2a9466c87dba980e3be5
>> > > > Author: Peter Zijlstra 
>> > > > Date:   Tue Jun 21 14:27:50 2016 +0200
>> > > >
>> > > > sched/fair: Apply more PELT fixes
>>
>> This patch only speeds up the update of task group load in order to
>> reflect the new load balance but It should not change the final value
>> and as a result the final behavior. I will try to reproduce it in my
>> target later today
>
> FWIW, I tried and failed w/wo autogroup on 4.8 and master.

Me too

Is it possible to get some dump of  /proc/sched_debug while the problem occurs ?

Vincent

> -Mike


Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes

2016-10-08 Thread Mike Galbraith
On Sat, 2016-10-08 at 13:37 +0200, Vincent Guittot wrote:
> On 8 October 2016 at 10:39, Ingo Molnar  wrote:
> > 
> > * Peter Zijlstra  wrote:
> > 
> > > On Fri, Oct 07, 2016 at 03:38:23PM -0400, Joseph Salisbury wrote:
> > > > Hello Peter,
> > > > 
> > > > A kernel bug report was opened against Ubuntu [0].  After a
> > > > kernel
> > > > bisect, it was found that reverting the following commit
> > > > resolved this bug:
> > > > 
> > > > commit 3d30544f02120b884bba2a9466c87dba980e3be5
> > > > Author: Peter Zijlstra 
> > > > Date:   Tue Jun 21 14:27:50 2016 +0200
> > > > 
> > > > sched/fair: Apply more PELT fixes
> 
> This patch only speeds up the update of task group load in order to
> reflect the new load balance but It should not change the final value
> and as a result the final behavior. I will try to reproduce it in my
> target later today

FWIW, I tried and failed w/wo autogroup on 4.8 and master.

-Mike


Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes

2016-10-08 Thread Vincent Guittot
On 8 October 2016 at 10:39, Ingo Molnar  wrote:
>
> * Peter Zijlstra  wrote:
>
>> On Fri, Oct 07, 2016 at 03:38:23PM -0400, Joseph Salisbury wrote:
>> > Hello Peter,
>> >
>> > A kernel bug report was opened against Ubuntu [0].  After a kernel
>> > bisect, it was found that reverting the following commit resolved this bug:
>> >
>> > commit 3d30544f02120b884bba2a9466c87dba980e3be5
>> > Author: Peter Zijlstra 
>> > Date:   Tue Jun 21 14:27:50 2016 +0200
>> >
>> > sched/fair: Apply more PELT fixes

This patch only speeds up the update of task group load in order to
reflect the new load balance but It should not change the final value
and as a result the final behavior. I will try to reproduce it in my
target later today

>>
>> That commit doesn't revert cleanly, did you take out more?
>
> Note that it reverts cleanly from v4.8 - while it does to revert from current
> upstream that did more changes in that area.
>
> I suspect Josheph tested a v4.8-ish kernel.
>
>> I'll try and have a look asap, but I'm traveling next week so it might a
>> tad slower than normal.
>>
>> If you could provide a /proc/sched_debug dump while the thing is running
>> that'd might be useful.
>
> Also, running the latest scheduler bits would be useful:
>
>   git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git
>
> ... just on the off chance that another change in this area fixed the bug, 
> plus to
> make it easier to send test patches.
>
> Upstream merge commit af79ad2b1f33 should also be pretty safe to try - it's 
> v4.8
> with the v4.9 scheduler bits applied.
>
> Thanks,
>
> Ingo


Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes

2016-10-08 Thread Ingo Molnar

* Peter Zijlstra  wrote:

> On Fri, Oct 07, 2016 at 03:38:23PM -0400, Joseph Salisbury wrote:
> > Hello Peter,
> > 
> > A kernel bug report was opened against Ubuntu [0].  After a kernel
> > bisect, it was found that reverting the following commit resolved this bug:
> > 
> > commit 3d30544f02120b884bba2a9466c87dba980e3be5
> > Author: Peter Zijlstra 
> > Date:   Tue Jun 21 14:27:50 2016 +0200
> > 
> > sched/fair: Apply more PELT fixes
> 
> That commit doesn't revert cleanly, did you take out more?

Note that it reverts cleanly from v4.8 - while it does to revert from current 
upstream that did more changes in that area.

I suspect Josheph tested a v4.8-ish kernel.

> I'll try and have a look asap, but I'm traveling next week so it might a
> tad slower than normal.
> 
> If you could provide a /proc/sched_debug dump while the thing is running
> that'd might be useful.

Also, running the latest scheduler bits would be useful:

  git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git

... just on the off chance that another change in this area fixed the bug, plus 
to 
make it easier to send test patches.

Upstream merge commit af79ad2b1f33 should also be pretty safe to try - it's 
v4.8 
with the v4.9 scheduler bits applied.

Thanks,

Ingo


Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes

2016-10-08 Thread Peter Zijlstra
On Fri, Oct 07, 2016 at 03:38:23PM -0400, Joseph Salisbury wrote:
> Hello Peter,
> 
> A kernel bug report was opened against Ubuntu [0].  After a kernel
> bisect, it was found that reverting the following commit resolved this bug:
> 
> commit 3d30544f02120b884bba2a9466c87dba980e3be5
> Author: Peter Zijlstra 
> Date:   Tue Jun 21 14:27:50 2016 +0200
> 
> sched/fair: Apply more PELT fixes

That commit doesn't revert cleanly, did you take out more?

> The regression was introduced as of v4.8-rc1.  The bug can be reproduced
> on an X1 Carbon with the following:
> stress -c $your_total_cpu_cores
> 
> I was hoping to get your feedback, since you are the patch author.  Do
> you think gathering any additional data will help diagnose this issue,
> or would it be best to submit a revert request? 

I'll try and have a look asap, but I'm traveling next week so it might a
tad slower than normal.

If you could provide a /proc/sched_debug dump while the thing is running
that'd might be useful.

Thanks!


Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes

2016-10-07 Thread Linus Torvalds
On Fri, Oct 7, 2016 at 1:22 PM, Joseph Salisbury
 wrote:
>
> Yes, CONFIG_SCHED_AUTOGROUP is enabled in the Ubuntu kernel.  However,
> that config was also enable in the Ubuntu 4.4 kerrnels without seeing
> this issue.   I can try disabling the config in the 4.8 based kernel and
> see if that changes things.

No, that wouldn't make any sense. I just wanted to know that the
option was enabled, because that option really *should* help buffer
other processes from one session that is a CPU hog.

So something is seriously wrong in that situation if other things get
very choppy. Of course, the fact that it apparently happens on one
particular machine only means that it's hard to figure out what
triggers it. Maybe some unlucky combination of cpufreq and thermal
throttling by the hardware, coupled with the scheduler change.

Peter?

 Linus


Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes

2016-10-07 Thread Joseph Salisbury
On 10/07/2016 03:57 PM, Linus Torvalds wrote:
> On Fri, Oct 7, 2016 at 12:38 PM, Joseph Salisbury
>  wrote:
>> A kernel bug report was opened against Ubuntu [0].  After a kernel
>> bisect, it was found that reverting the following commit resolved this bug:
> Hmm. Interesting, and it sounds like we should revert that unless
> somebody figures out *why* following the rules wrt cfq updates causes
> problems. But I also wonder what the Ubuntu kernel config is. Does
> Ubuntu enable CONFIG_SCHED_AUTOGROUP=y, for example? Because
> regardless of any other scheduler issues, autogrouping *should* mean
> that when you run some CPU hogger in one session, that should still
> balance all CPU time with other sessions..
>
> I'm not seeing anything odd on my xps13, which should have a similar
> CPU to the X1 Carbon.
>
>Linus
Hi Linus,

Yes, CONFIG_SCHED_AUTOGROUP is enabled in the Ubuntu kernel.  However,
that config was also enable in the Ubuntu 4.4 kerrnels without seeing
this issue.   I can try disabling the config in the 4.8 based kernel and
see if that changes things.

Thanks,

Joe


Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes

2016-10-07 Thread Linus Torvalds
On Fri, Oct 7, 2016 at 12:38 PM, Joseph Salisbury
 wrote:
>
> A kernel bug report was opened against Ubuntu [0].  After a kernel
> bisect, it was found that reverting the following commit resolved this bug:

Hmm. Interesting, and it sounds like we should revert that unless
somebody figures out *why* following the rules wrt cfq updates causes
problems. But I also wonder what the Ubuntu kernel config is. Does
Ubuntu enable CONFIG_SCHED_AUTOGROUP=y, for example? Because
regardless of any other scheduler issues, autogrouping *should* mean
that when you run some CPU hogger in one session, that should still
balance all CPU time with other sessions..

I'm not seeing anything odd on my xps13, which should have a similar
CPU to the X1 Carbon.

   Linus