A long-tail load balance cost is observed on the newly idle path, this is caused by a race window between the first nr_running check of the busiest runqueue and its nr_running recheck in detach_tasks.
Before the busiest runqueue is locked, the tasks on the busiest runqueue could be pulled by other CPUs and nr_running of the busiest runqueu becomes 1, this causes detach_tasks breaks with LBF_ALL_PINNED flag set, and triggers load_balance redo at the same sched_domain level. In order to find the new busiest sched_group and CPU, load balance will recompute and update the various load statistics, which eventually leads to the long-tail load balance cost. This patch introduces a variable(sched_nr_lb_redo) to limit load balance redo times, combined with sysctl_sched_nr_migrate, the max load balance cost is reduced from 100+ us to 70+ us, measured on a 4s x86 system with 192 logical CPUs. Cc: Andi Kleen <a...@linux.intel.com> Cc: Tim Chen <tim.c.c...@linux.intel.com> Cc: Srinivas Pandruvada <srinivas.pandruv...@linux.intel.com> Cc: Rafael J. Wysocki <rafael.j.wyso...@intel.com> Signed-off-by: Aubrey Li <aubrey...@linux.intel.com> --- kernel/sched/fair.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index ae7ceba..b59f371 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7407,6 +7407,8 @@ struct lb_env { unsigned int loop; unsigned int loop_break; unsigned int loop_max; + unsigned int redo_cnt; + unsigned int redo_max; enum fbq_type fbq_type; enum migration_type migration_type; @@ -9525,6 +9527,7 @@ static int should_we_balance(struct lb_env *env) return group_balance_cpu(sg) == env->dst_cpu; } +static const unsigned int sched_nr_lb_redo = 1; /* * Check this_cpu to ensure it is balanced within domain. Attempt to move * tasks if there is an imbalance. @@ -9547,6 +9550,7 @@ static int load_balance(int this_cpu, struct rq *this_rq, .dst_grpmask = sched_group_span(sd->groups), .idle = idle, .loop_break = sched_nr_migrate_break, + .redo_max = sched_nr_lb_redo, .cpus = cpus, .fbq_type = all, .tasks = LIST_HEAD_INIT(env.tasks), @@ -9682,7 +9686,8 @@ static int load_balance(int this_cpu, struct rq *this_rq, * destination group that is receiving any migrated * load. */ - if (!cpumask_subset(cpus, env.dst_grpmask)) { + if (!cpumask_subset(cpus, env.dst_grpmask) && + ++env.redo_cnt < env.redo_max) { env.loop = 0; env.loop_break = sched_nr_migrate_break; goto redo; -- 2.7.4