On Thu, Jul 02, 2015 at 09:05:39AM +0800, Yuyang Du wrote: > Hi Mike, > > On Thu, Jul 02, 2015 at 10:05:47AM +0200, Mike Galbraith wrote: > > On Thu, 2015-07-02 at 07:25 +0800, Yuyang Du wrote: > > > > > That being said, it is also obvious to prevent the livelock from > > > happening: > > > idle pulling until the source rq's nr_running is 1, becuase otherwise we > > > just avoid idleness by making another idleness. > > > > Yeah, but that's just the symptom, not the disease. Better for the idle > > balance symptom may actually be to only pull one when idle balancing. > > After all, the immediate goal is to find something better to do than > > idle, not to achieve continual perfect (is the enemy of good) balance. > > > Symptom? :) > > You mean "pull one and stop, can't be greedy"? Right, but still need to > assure you don't make another idle CPU (meaning until nr_running == 1), which > is the cure to disease. > > I am ok with at most "pull one", but probably we stick to the load_balance() > by pulling an fair amount, assuming load_balance() magically computes the > right imbalance, otherwise you may have to do multiple "pull one"s.
Talking about the disease and looking at the debug data that Rabin has provided I think the problem is due to the way blocked load is handled (or not handled) in calculate_imbalance(). We have three entities in the root cfs_rq on cpu1: 1. Task entity pid 7, load_avg_contrib = 5. 2. Task entity pid 30, load_avg_contrib = 10. 3. Group entity, load_avg_contrib = 118, but contains task entity pid 413 further down the hierarchy with task_h_load() = 0. The 118 comes from the blocked load contribution in the system.slice task group. calculate_imbalance() figures out the average loads are: cpu0: load/capacity = 0*1024/1024 = 0 cpu1: load/capacity = (5 + 10 + 118)*1024/1024 = 133 domain: load/capacity = (0 + 133)*1024/(2*1024) = 62 env->imbalance = 62 Rabin reported env->imbalance = 60 after pulling the rcu task with load_avg_contrib = 5. It doesn't match my numbers exactly, but it pretty close ;-) detach_tasks() will attempts to pull 62 based on tasks task_h_load() but the task_h_load() sum is only 5 + 10 + 0 and hence detach_tasks() will empty the src_rq. IOW, since task groups include blocked load in the load_avg_contrib (see __update_group_entity_contrib() and __update_cfs_rq_tg_load_contrib()) the imbalance includes blocked load and hence env->imbalance >= sum(task_h_load(p)) for all tasks p on the rq. Which leads to detach_tasks() emptying the rq completely in the reported scenario where blocked load > runnable load. Whether emptying the src_rq is the right thing to do depends on on your point of view. Does balanced load (runnable+blocked) take priority over keeping cpus busy or not? For idle_balance() it seems intuitively correct to not empty the rq and hence you could consider env->imbalance to be too big. I think we will see more of this kind of problems if we include weighted_cpuload() as well. Parts of the imbalance calculation code is quite old and could use some attention first. A short term fix could be what Yuyang propose, stop pulling tasks when there is only one left in detach_tasks(). It won't affect active load balance where we may want to migrate the last task as it active load balance doesn't use detach_tasks(). Morten -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/