Hi, On 26/09/18 09:13, Vincent Guittot wrote: > On Tue, 25 Sep 2018 at 19:38, Valentin Schneider > <valentin.schnei...@arm.com> wrote: >> >> When load_balance() fails to move some load because of task affinity, >> we end up increasing sd->balance_interval to delay the next periodic >> balance in the hopes that next time we look, that annoying pinned >> task(s) will be gone. > > It's not about hope but to minimize useless periodic load balance as > there is no possibility to balance anything > >> >> However, idle_balance() pays no attention to sd->balance_interval, yet >> it will still lead to an increase in balance_interval in case of >> pinned tasks. > > I would say that this makes sense as there is no way to pull the > pinned task on this rq so running periodic load balance (busy or idle) > is a waste of time and resources >
Increasing the delay for the next periodic balance when there's pinned tasks makes sense. The issue I have here is that the balance_interval can potentially be doubled at every failed idle_balance() without waiting for a sufficient amount of time. When going through periodic balance, load_balance() is gated by: time_after_eq(jiffies, sd->last_balance + interval) So if we previously ran load_balance() and failed to pull anything because of task affinity, we'll wait for at least the new interval before attempting another load_balance(). On the other hand, idle_balance() will call load_balance() regardless of balance_interval values. So if the task composition allows it, we can repeatedly fail to load_balance() and thus double balance_interval at every idle_balance() "tick". So you can go from 8ms to 512ms in a few successive idle_balance()s, whereas periodic balance would have to wait for at least the sum of all intermediate balance_intervals leading to 512 (8, 16, 32, etc). >> >> If we're going through several newidle balances (e.g. we have a >> periodic task), this can lead to a huge increase of the > > This only happen when overload is set which means that there are tasks > on another CPU but this one can't pull one of them. > > Can you give us details about the use case that you care about ? > It's the same as I presented last week - devlib (some python target communication library I use) has some phase where it spawns at lot of tasks at once to do some setup (busybox, shutils, bash...). Some of those tasks are pinned to a particular CPU, and that can lead to failed load_balance() - and to make things worse, there's a lot of idle_balance() in there. Eventually when I start running my actual workload a few ~100ms later, it's impacted by that balance_interval increase. Admittedly that's a specific use-case, but I don't think this quick increase is something that was intended. > Also, If the interval reaches a value that becomes a problem for you > why don't you reduce the max_interval with sysfs ? This max interval > can be reach in several other occasions > True, although I saw that as an unwanted behaviour rather than a tuning problem. >> balance_interval in a very small amount of time. >> >> To prevent that, don't increase the balance interval when going >> through a newidle balance. >> >> Signed-off-by: Valentin Schneider <valentin.schnei...@arm.com> >> --- >> kernel/sched/fair.c | 19 ++++++++++++++----- >> 1 file changed, 14 insertions(+), 5 deletions(-) >> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index 6bd142d..620910d 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -8782,13 +8782,22 @@ static int load_balance(int this_cpu, struct rq >> *this_rq, >> sd->nr_balance_failed = 0; >> >> out_one_pinned: >> + ld_moved = 0; >> + >> + /* >> + * idle_balance() disregards balance intervals, so we could >> repeatedly >> + * reach this code, which would lead to balance_interval >> skyrocketting >> + * in a short amount of time. Skip the balance_interval increase >> logic >> + * to avoid that. >> + */ >> + if (env.idle == CPU_NEWLY_IDLE) >> + goto out; >> + >> /* tune up the balancing interval */ >> - if (((env.flags & LBF_ALL_PINNED) && >> - sd->balance_interval < MAX_PINNED_INTERVAL) || >> - (sd->balance_interval < sd->max_interval)) >> + if ((env.flags & LBF_ALL_PINNED && >> + sd->balance_interval < MAX_PINNED_INTERVAL) || >> + (sd->balance_interval < sd->max_interval)) > > This looks like a code cleanup that is not related to the subject > Absolutely, I figured I could get away with it but if that's preferred I can split that more clearly >> sd->balance_interval *= 2; >> - >> - ld_moved = 0; >> out: >> return ld_moved; >> } >> -- >> 2.7.4 >>