Hi,

On 26/09/18 09:13, Vincent Guittot wrote:
> On Tue, 25 Sep 2018 at 19:38, Valentin Schneider
> <valentin.schnei...@arm.com> wrote:
>>
>> When load_balance() fails to move some load because of task affinity,
>> we end up increasing sd->balance_interval to delay the next periodic
>> balance in the hopes that next time we look, that annoying pinned
>> task(s) will be gone.
> 
> It's not about hope but to minimize useless periodic load balance as
> there is no possibility to balance anything
> 
>>
>> However, idle_balance() pays no attention to sd->balance_interval, yet
>> it will still lead to an increase in balance_interval in case of
>> pinned tasks.
> 
> I would say that this makes sense as there is no way to pull the
> pinned task on this rq so running periodic load balance (busy or idle)
> is a waste of time  and resources
> 

Increasing the delay for the next periodic balance when there's pinned tasks
makes sense. The issue I have here is that the balance_interval can potentially
be doubled at every failed idle_balance() without waiting for a sufficient
amount of time.

When going through periodic balance, load_balance() is gated by:

    time_after_eq(jiffies, sd->last_balance + interval)

So if we previously ran load_balance() and failed to pull anything because 
of task affinity, we'll wait for at least the new interval before attempting
another load_balance().

On the other hand, idle_balance() will call load_balance() regardless of
balance_interval values. So if the task composition allows it, we can repeatedly
fail to load_balance() and thus double balance_interval at every idle_balance()
"tick". So you can go from 8ms to 512ms in a few successive idle_balance()s,
whereas periodic balance would have to wait for at least the sum of all
intermediate balance_intervals leading to 512 (8, 16, 32, etc).

>>
>> If we're going through several newidle balances (e.g. we have a
>> periodic task), this can lead to a huge increase of the
> 
> This only happen when overload is set which means that there are tasks
> on another CPU but this one can't pull one of them.
> 
> Can you give us details about the use case that you care about ?
> 

It's the same as I presented last week - devlib (some python target 
communication
library I use) has some phase where it spawns at lot of tasks at once to do
some setup (busybox, shutils, bash...). Some of those tasks are pinned to a
particular CPU, and that can lead to failed load_balance() - and to make things
worse, there's a lot of idle_balance() in there.

Eventually when I start running my actual workload a few ~100ms later, it's
impacted by that balance_interval increase.

Admittedly that's a specific use-case, but I don't think this quick increase
is something that was intended.

> Also, If the interval reaches a value that becomes a problem for you
> why don't you reduce the max_interval with sysfs ? This max interval
> can be reach in several other occasions
> 

True, although I saw that as an unwanted behaviour rather than a tuning problem.

>> balance_interval in a very small amount of time.
>>
>> To prevent that, don't increase the balance interval when going
>> through a newidle balance.
>>
>> Signed-off-by: Valentin Schneider <valentin.schnei...@arm.com>
>> ---
>>  kernel/sched/fair.c | 19 ++++++++++++++-----
>>  1 file changed, 14 insertions(+), 5 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 6bd142d..620910d 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -8782,13 +8782,22 @@ static int load_balance(int this_cpu, struct rq 
>> *this_rq,
>>         sd->nr_balance_failed = 0;
>>
>>  out_one_pinned:
>> +       ld_moved = 0;
>> +
>> +       /*
>> +        * idle_balance() disregards balance intervals, so we could 
>> repeatedly
>> +        * reach this code, which would lead to balance_interval 
>> skyrocketting
>> +        * in a short amount of time. Skip the balance_interval increase 
>> logic
>> +        * to avoid that.
>> +        */
>> +       if (env.idle == CPU_NEWLY_IDLE)
>> +               goto out;
>> +
>>         /* tune up the balancing interval */
>> -       if (((env.flags & LBF_ALL_PINNED) &&
>> -                       sd->balance_interval < MAX_PINNED_INTERVAL) ||
>> -                       (sd->balance_interval < sd->max_interval))
>> +       if ((env.flags & LBF_ALL_PINNED &&
>> +            sd->balance_interval < MAX_PINNED_INTERVAL) ||
>> +           (sd->balance_interval < sd->max_interval))
> 
> This looks like a code cleanup that is not related to the subject
> 

Absolutely, I figured I could get away with it but if that's preferred I can
split that more clearly

>>                 sd->balance_interval *= 2;
>> -
>> -       ld_moved = 0;
>>  out:
>>         return ld_moved;
>>  }
>> --
>> 2.7.4
>>

Reply via email to