Hi, We're seeing a livelock where two CPUs both loop with interrupts disabled in pick_next_task_fair() / idle_balance() and continuously fetch all tasks from each other. This has been observed on a 3.18 kernel but the relevant code paths appear to be the same in current mainline.
The problem situation appears to be this: cpu0 cpu1 pick_next_task take rq0->lock pick_next_task_fair running=0 idle_balance() drop rq0->lock wakeup A,B on cpu0 pick_next_task take rq1->lock pick_next_task_fair running=0 idle_balance() drop r1->lock load_balance() busiest=rq0 take rq0->lock detach A,B drop rq0->lock take rq1->lock attach A,B pulled_task = 2 drop rq1->lock load_balance() busiest=rq1 take rq1->lock detach A,B drop rq1->lock take rq0->lock attach A,B pulled_task = 2 drop rq0->lock running=0() idle_balance() busiest=rq0, pull A,B, etc. running = 0 load_balance() busiest=rq1, pull A,B, etc And this goes on, with interrupts disabled on both CPUs, for at least a 100 ms (which is when a hardware watchdog hits). The conditions needed, apart from the right timing, are: - cgroups. One of the tasks, say A, needs to be in a CPU cgroup. When the problem occurs, A's ->se has zero load_avg_contrib and task_h_load(A) is zero. However, the se->parent->parent of A has a (relatively) high load_avg_contrib. cpu0's cfs_rq has therefore a relatively high runnable_load_avg. find_busiest_group() therefore detects imbalance, and detach_tasks() detaches all tasks. - PREEMPT=n. Otherwise, the code under #ifdef in detach_tasks() would ensure that we'd only ever pull a maximum of one task during idle balancing. - cpu0 and cpu1 are threads on the same core (cpus_share_cache() == true). otherwise, cpu1 will not be able to wakeup tasks on cpu0 while cpu0 has interrupts disabled (since an IPI would be required). Turning off the default TTWU_QUEUE feature would also provide the same effect. I see two simple ways to prevent the livelock. One is to just to remove the #ifdef: diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 3d57cc0..74c94dc 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5937,7 +5937,6 @@ static int detach_tasks(struct lb_env *env) detached++; env->imbalance -= load; -#ifdef CONFIG_PREEMPT /* * NEWIDLE balancing is a source of latency, so preemptible * kernels will stop after the first task is detached to minimize @@ -5945,7 +5944,6 @@ static int detach_tasks(struct lb_env *env) */ if (env->idle == CPU_NEWLY_IDLE) break; -#endif /* * We only want to steal up to the prescribed amount of Or this should work too: diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 3d57cc0..13358cf 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7487,6 +7487,8 @@ static int idle_balance(struct rq *this_rq) */ if (this_rq->cfs.h_nr_running && !pulled_task) pulled_task = 1; + else if (!this_rq->cfs.h_nr_running && pulled_task) + pulled_task = 0; out: /* Move the next balance forward */ But it seems that the real problem is that detach_tasks() can remove all tasks, when cgroups are involved as described above? Thanks. /Rabin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/