Hi Mike, Thank you very much for such a clear and comprehensive explanation. So when I put together the problem and the proposed solution pieces in the current scheduler scalability,the following was what I found:
1. select_idle_sibling() is needed as an agent to correctly find the right cpu for wake up tasks to go to."Correctly" would be to find an idle cpu at the lowest cost possible. 2."Cost could be lowered" either by optimizing the order of searching for an idle cpu or restricting the search to a few cpus alone. 3. The former has the problem that it would not prevent bouncing tasks all over the domain sharing an L3 cache,which could potentially affect the fast moving tasks. 4. The latter has the problem that it is not aggressive enough in finding an idle cpu. This is some tangled problem,but I think the solution at best could be smoothed to a a flowchart. STEP1 STEP2 STEP3 _____________________ | | |See if the idle buddy|No _________________ Yes ________________ |is free at all sched |---->| Do we search the|----> |Optimized search| |domains | |sched domains | |________________| |_____________________| |for an idle cpu | | |Yes |_________________| \|/ \|/ |No: saturated Return target cpu Return \|/ system cpu buddy Return prev_cpu I dont know how well we can do STEP2. That seems to be the biggest challenge.STEP 1 is a reverted commit 37407ea7, "sched: Improve scalability via 'CPU buddies', which withstand random perturbations" for reasons which can hopefully be overcome by this flowchart. I have tried to tackle STEP3.STEP 3 will not prevent bouncing but a good STEP2 could tell us if it is worth the bounce. STEP3 Patch is given below: ***********************START PATCH************************************** sched:Reduce the overhead of select_idle_sibling From: Preeti U Murthy <pre...@linux.vnet.ibm.com> The traversal of the sched domains to find the idlest cpu is a costly operation currently in select_idle_sibling() due to large L3 caches. So the traversal better be worth the effort. In the current approach,in select_idle_sibling(),the sched domains are searched top to bottom starting at the last level cache domain,and the search is for the idlest *group*. It is less likely that you will find a completely idle group at a higher level sched domain compared to a lower level sched domain.This will make the first few iterations fruitless most of the times. And it is less likely to find an idle *group* compared to an idle *cpu*. This patch aims at finding an idle cpu.And since the iterations are from bottom to top,stopping at the last level cache domain,it tries to see if the task can be scheduled on a core which shares the l2 cache with the waking cpu/prev cpu first, before probing the cpus at the higher level domain,trying to make the cost of migration lower than the current approach. Also it tries to return an idle cpu as soon as it finds one,without querying the status of the other cpus in the sched group.This could potentially be much faster unless a worse case such as not finding an idle cpu at every domain occurs. --- kernel/sched/fair.c | 33 +++++++++++++++++++++------------ 1 file changed, 21 insertions(+), 12 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b29cdbf..6123bca 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3310,7 +3310,7 @@ static int select_idle_sibling(struct task_struct *p, int target) { int cpu = smp_processor_id(); int prev_cpu = task_cpu(p); - struct sched_domain *sd; + struct sched_domain *sd, *tmp; struct sched_group *sg; int i; @@ -3332,24 +3332,33 @@ static int select_idle_sibling(struct task_struct *p, int target) * Otherwise, iterate the domains and find an elegible idle cpu. */ sd = rcu_dereference(per_cpu(sd_llc, target)); - for_each_lower_domain(sd) { - sg = sd->groups; + /* + * Iterate the domains bottom up,to cash in on the shared lower level + * cache advantages. + * Since this iteration is costly,return any idle cpu and dont wait + * for a completely idle group. + */ + for_each_domain(target, tmp) { + sg = tmp->groups; do { if (!cpumask_intersects(sched_group_cpus(sg), tsk_cpus_allowed(p))) goto next; - for_each_cpu(i, sched_group_cpus(sg)) { - if (!idle_cpu(i)) - goto next; + if (idle_cpu(i)) { + target = i; + goto done; + } } +next: sg = sg->next; - target = cpumask_first_and(sched_group_cpus(sg), - tsk_cpus_allowed(p)); - goto done; -next: - sg = sg->next; - } while (sg != sd->groups); + } while (sg != tmp->groups); + + /* if you have covered the llc domain then break out,it is + * not worth migrating on any higher level domain + */ + if (tmp == sd) + break; } done: return target; Regards Preeti U Murthy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/