[RFC PATCH 01/13] sched:Prevent movement of short running tasks during load balancing
Prevent sched groups with low load as tracked by PJT's metrics from being candidates of the load balance routine.This metric is chosen to be 1024+15%*1024.But using PJT's metrics it has been observed that even when three 10% tasks are running,the load sometimes does not exceed this threshold.The call should be taken if the tasks can afford to be throttled. This is why an additional metric has been included,which can determine how long we can tolerate tasks not being moved even if the load is low. Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com --- kernel/sched/fair.c | 16 1 file changed, 16 insertions(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index dbddcf6..e02dad4 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4188,6 +4188,7 @@ struct sd_lb_stats { */ struct sg_lb_stats { unsigned long avg_load; /*Avg load across the CPUs of the group */ + u64 avg_cfs_runnable_load; /* Equivalent of avg_load but calculated using PJT's metric */ unsigned long group_load; /* Total load over the CPUs of the group */ unsigned long sum_nr_running; /* Nr tasks running in the group */ unsigned long sum_weighted_load; /* Weighted load of group's tasks */ @@ -4504,6 +4505,7 @@ static inline void update_sg_lb_stats(struct lb_env *env, unsigned long load, max_cpu_load, min_cpu_load; unsigned int balance_cpu = -1, first_idle_cpu = 0; unsigned long avg_load_per_task = 0; + u64 group_load = 0; /* computed using PJT's metric */ int i; if (local_group) @@ -4548,6 +4550,7 @@ static inline void update_sg_lb_stats(struct lb_env *env, if (idle_cpu(i)) sgs-idle_cpus++; + group_load += cpu_rq(i)-cfs.runnable_load_avg; update_sg_numa_stats(sgs, rq); } @@ -4572,6 +4575,19 @@ static inline void update_sg_lb_stats(struct lb_env *env, sgs-avg_load = (sgs-group_load*SCHED_POWER_SCALE) / group-sgp-power; /* +* Check if the sched group has not crossed the threshold. +* +* Also check if the sched_group although being within the threshold,is not +* queueing too many tasks.If yes to both,then make it an +* invalid candidate for load balancing +* +* The below condition is included as a tunable to meet performance and power needs +*/ + sgs-avg_cfs_runnable_load = (group_load * SCHED_POWER_SCALE) / group-sgp-power; + if (sgs-avg_cfs_runnable_load = 1178 sgs-sum_nr_running = 2) + sgs-avg_cfs_runnable_load = 0; + + /* * Consider the group unbalanced when the imbalance is larger * than the average weight of a task. * -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 00/13] sched: Integrating Per-entity-load-tracking with the core scheduler
Hi Peter, Thank you very much for your feedback. On 10/25/2012 09:26 PM, Peter Zijlstra wrote: OK, so I tried reading a few patches and I'm completely failing.. maybe its late and my brain stopped working, but it simply doesn't make any sense. Most changelogs and comments aren't really helping either. At best they mention what you're doing, not why and how. This means I get to basically duplicate your entire thought pattern and I might as well do the patches myself. I also don't see the 'big' picture of what you're doing, you start out by some weird avoid short running task movement.. why is that a good start? I would have expected a series like this to replace rq-cpu_load / rq-load with a saner metric and go from there.. instead it looks like its going about at things completely backwards. Fixing small details instead of the big things. Let me see if I get what you are saying here right.You want to replace for example cfs_rq-load.weight with a saner metric because it does not consider the run time of the sched entities queued on it,merely their priorities.If this is right,in this patchset I believe cfs_rq-runnable_load_avg would be that right metric because it considers the run time of the sched entities queued on it. So why didnt I replace? I added cfs_rq-runnable_load_avg as an additional metric instead of replacing the older metric.I let the old metric be as a dead metric and used the newer metric as an alternative.so if this alternate metric does not do us good we have the old metric to fall back on. At best they mention what you're doing, not why and how. So the above explains *what* I am doing. *How* am i doing it: Everywhere the scheduler needs to make a decision on: a.find_busiest_group/find_idlest_group/update_sg_lb_stats:use sum of cfs_rq-runnable_load_avg to decide this instead of sum of cfs_rq-load.weight. b.find_busiest_queue/find_idlest_queue: use cfs_rq-runnable_load_avg to decide this instead of cfs_rq-load.weight c.move_tasks: use se-avg.load_avg_contrib to decide the weight of of each task instead of se-load.weight as the former reflects the runtime of the sched entity and hence its actual load. This is what my patches3-13 do.Merely *Replace*. *Why am I doing it*: Feed the load balancer with a more realistic metric to do load balancing and observe the consequences. you start out by some weird avoid short running task movement. why is that a good start? The short running tasks are not really burdening you,they have very little run time,so why move them? Throughout the concept of load balancing the focus is on *balancing the *existing* load* between the sched groups.But not really evaluating the *absolute load* of any given sched group. Why is this *the start*? This is like a round of elimination before the actual interview round ;) Try to have only those sched groups as candidates for load balancing that are sufficiently loaded.[Patch1] This *sufficiently loaded* is decided by the new metric explained in the *How* above. I also don't see the 'big' picture of what you're doing Patch1: is explained above.*End result:Good candidates for lb.* Patch2: 10% 10% 10%100% -- -- SCHED_GP1 SCHED_GP2 Before Patch After Patch --- SCHED_GP1 load:3072SCHED_GP1:512 SCHED_GP2 load:1024SCHED_GP2:1024 SCHED_GP1 is busiest SCHED_GP2 is busiest: But Patch2 re-decides between GP1 and GP2 to check if the latency of tasks is getting affected although there is less load on GP1.If yes it is a better *busy * gp. *End Result: Better candidates for lb* Rest of the patches: now that we have our busy sched group,let us load balance with the aid of the new metric. *End Result: Hopefully a more sensible movement of loads* This is how I build the picture. Regards Preeti -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 00/13] sched: Integrating Per-entity-load-tracking with the core scheduler
Hi Peter, Thank you very much for your feedback.Please ignore the previous post.I am extremely sorry about the word wrap issues with it. On 10/25/2012 09:26 PM, Peter Zijlstra wrote: OK, so I tried reading a few patches and I'm completely failing.. maybe its late and my brain stopped working, but it simply doesn't make any sense. Most changelogs and comments aren't really helping either. At best they mention what you're doing, not why and how. This means I get to basically duplicate your entire thought pattern and I might as well do the patches myself. I also don't see the 'big' picture of what you're doing, you start out by some weird avoid short running task movement.. why is that a good start? I would have expected a series like this to replace rq-cpu_load / rq-load with a saner metric and go from there.. instead it looks like its going about at things completely backwards. Fixing small details instead of the big things. Do explain.. Let me see if I get what you are saying here right.You want to replace for example cfs_rq-load.weight with a saner metric because it does not consider the run time of the sched entities queued on it,merely their priorities.If this is right,in this patchset I believe cfs_rq-runnable_load_avg would be that right metric because it considers the run time of the sched entities queued on it. So why didnt I replace? I added cfs_rq-runnable_load_avg as an additional metric instead of replacing the older metric.I let the old metric be as a dead metric and used the newer metric as an alternative.so if this alternate metric does not do us good we have the old metric to fall back on. At best they mention what you're doing, not why and how. So the above explains *what* I am doing. *How* am i doing it: Everywhere the scheduler needs to make a decision on: a.find_busiest_group/find_idlest_group/update_sg_lb_stats:use sum of cfs_rq-runnable_load_avg to decide this instead of sum of cfs_rq-load.weight. b.find_busiest_queue/find_idlest_queue: use cfs_rq-runnable_load_avg to decide this instead of cfs_rq-load.weight c.move_tasks: use se-avg.load_avg_contrib to decide the weight of of each task instead of se-load.weight as the former reflects the runtime of the sched entity and hence its actual load. This is what my patches3-13 do.Merely *Replace*. *Why am I doing it*: Feed the load balancer with a more realistic metric to do load balancing and observe the consequences. you start out by some weird avoid short running task movement. why is that a good start? The short running tasks are not really burdening you,they have very little run time,so why move them? Throughout the concept of load balancing the focus is on *balancing the *existing* load* between the sched groups.But not really evaluating the *absolute load* of any given sched group. Why is this *the start*? This is like a round of elimination before the actual interview round Try to have only those sched groups as candidates for load balancing that are sufficiently loaded.[Patch1] This *sufficiently loaded* is decided by the new metric explained in the *How* above. I also don't see the 'big' picture of what you're doing Patch1: is explained above.*End result:Good candidates for lb.* Patch2: 10% 10% 10%100% -- -- SCHED_GP1 SCHED_GP2 Before Patch After Patch --- SCHED_GP1 load:3072SCHED_GP1:512 SCHED_GP2 load:1024SCHED_GP2:1024 SCHED_GP1 is busiest SCHED_GP2 is busiest: But Patch2 re-decides between GP1 and GP2 to check if the latency of tasks is getting affected although there is less load on GP1.If yes it is a better *busy * gp. *End Result: Better candidates for lb* Rest of the patches: now that we have our busy sched group,let us load balance with the aid of the new metric. *End Result: Hopefully a more sensible movement of loads* This is how I build the picture. Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 00/13] sched: Integrating Per-entity-load-tracking with the core scheduler
On 10/26/2012 05:59 PM, Peter Zijlstra wrote: On Thu, 2012-10-25 at 23:42 +0530, Preeti U Murthy wrote: firstly, cfs_rq is the wrong place for a per-cpu load measure, secondly why add another load field instead of fixing the one we have? Hmm..,rq-load.weight is the place. So why didnt I replace? I added cfs_rq-runnable_load_avg as an additional metric instead of replacing the older metric.I let the old metric be as a dead metric and used the newer metric as an alternative.so if this alternate metric does not do us good we have the old metric to fall back on. That's wrong.. either it works and we can apply the patches or it doesn't and we won't. What you did makes it very hard to see you preserve the current balancer -- which afaict you don't, you change the balancer with the very first patch. You are right on this Peter. Why not update weighted_cpuload(), update_idle_cpu_load() and update_cpu_load_active() to use another metric and go from there. If you do that the whole balancer will auto-magically use the new weight measure. Once you have that running, you can look at modifying it. Hmm...Correct. a.find_busiest_group/find_idlest_group/update_sg_lb_stats:use sum of cfs_rq-runnable_load_avg to decide this instead of sum of cfs_rq-load.weight. But the first patches are about adding new balancing conditions, that's a complete fail.. b.find_busiest_queue/find_idlest_queue: use cfs_rq-runnable_load_avg to decide this instead of cfs_rq-load.weight See, if you would have changed the input function (weighted_cpuload), you wouldn't have needed to touch any of this. I see your point. c.move_tasks: use se-avg.load_avg_contrib to decide the weight of of each task instead of se-load.weight as the former reflects the runtime of the sched entity and hence its actual load. The changelog in that patch (7) is completely devoid of any useful information. This is what my patches3-13 do.Merely *Replace*. *Why am I doing it*: Feed the load balancer with a more realistic metric to do load balancing and observe the consequences. I know why you're doing the entire thing, but you fail to motivate each individual change. A changelog should read something like: current code does blah, this has a problem when blah, we can improve this doing blah because blah. Ah! I get it. you start out by some weird avoid short running task movement. why is that a good start? The short running tasks are not really burdening you,they have very little run time,so why move them? The most important part is why this would be a good start to the series, it is not. The patch is also dubious at best; short running might be all you have, your definition of 'short' is also iffy. My definition of 'short' was bursty loads.What I observed from using the new metric to study the effective load calculation was,when there are around 2-3 such bursty loads the effective load stays below the threshold that i have stated,and I thought this would be a good enough excuse to let the loads stay on the cpu. Bursty being a load that sleeps for 9ms every 10ms for a duration of 10s.(a 10% load) in my experiments. Throughout the concept of load balancing the focus is on *balancing the *existing* load* between the sched groups.But not really evaluating the *absolute load* of any given sched group. Which is why you're going to change the metrics.. the balancer really only cares about making load equal, flipping the metric of the load doesn't change anything fundamental. Ok. Why is this *the start*? This is like a round of elimination before the actual interview round Try to have only those sched groups as candidates for load balancing that are sufficiently loaded.[Patch1] This *sufficiently loaded* is decided by the new metric explained in the *How* above. No, this is absolutely wrong. So a sane series would introduce maybe two functions: cpu_load() and task_load() and use those where we now use rq-load.weight and p-se.load.weight for load balancing purposes. Implement these functions using those two expression. So effectively this patch is a NOP. Secondly, switch these two functions over to the per-task based averages. Tada! all done. The load balancer will then try and equalize effective load instead of instant load. It will do the 3x10% vs 100% thing correctly with just those two patches. Simply because it will report a lower cpu-load for the 3x10% case than it will for the 100% case, no need to go fudge about in the load-balance internals. Once you've got this correctly done, you can go change balancing to better utilize the new metric, like use the effective load instead of nr_running against the capacity and things like that. But for every such change you want to be very careful and run all the benchmarks you can find -- in fact you want to do that after the 2nd patch too. I agree with this entire suggestion.Perhaps I was hesitant
Re: [RFC PATCH 00/13] sched: Integrating Per-entity-load-tracking with the core scheduler
On 10/26/2012 06:37 PM, Ingo Molnar wrote: * Peter Zijlstra a.p.zijls...@chello.nl wrote: [...] So a sane series would introduce maybe two functions: cpu_load() and task_load() and use those where we now use rq-load.weight and p-se.load.weight for load balancing purposes. Implement these functions using those two expression. So effectively this patch is a NOP. Secondly, switch these two functions over to the per-task based averages. Tada! all done. The load balancer will then try and equalize effective load instead of instant load. It will do the 3x10% vs 100% thing correctly with just those two patches. Simply because it will report a lower cpu-load for the 3x10% case than it will for the 100% case, no need to go fudge about in the load-balance internals. Once you've got this correctly done, you can go change balancing to better utilize the new metric, like use the effective load instead of nr_running against the capacity and things like that. But for every such change you want to be very careful and run all the benchmarks you can find -- in fact you want to do that after the 2nd patch too. If anyone posted that simple two-patch series that switches over to the new load metrics I'd be happy to test the performance of those. Having two parallel load metrics is really not something that we should tolerate for too long. Thanks, Ingo Right Ingo.I will incorporate this approach and post out very soon. Thank you Regards Preeti -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 5/6] sched: pack the idle load balance
Hi Vincent, On 04/05/2013 04:38 PM, Vincent Guittot wrote: Peter, After some toughts about your comments,I can update the buddy cpu during ILB or periofdic LB to a new idle core and extend the packing mechanism Does this additional mechanism sound better for you ? If the primary goal of this patchset is to pack small tasks in fewer power domains then why not see if the power aware scheduler patchset by Alex does the same for you? The reason being: a.The power aware scheduler also checks if a task is small enough to be packed on a cpu which has just enough capacity to take on that task(leader cpu). This cpu belongs to a scheduler group which is nearly full(group_leader),so we end up packing tasks. b.The overhead of assigning a buddy cpu gets eliminated because the best cpu for packing is decided during wake up. c.This is a scalable solution because if the leader cpu is busy,then any other idle cpu from that group_leader is chosen.Eventually you end up packing anyway. The reason that I am suggesting this is that we could unify the power awareness of the scheduler under one umbrella.And i believe that the current power aware scheduler patchset is flexible enough to do this and that we must cash in on it. Thanks Regards Preeti U Murthy Vincent On 26 March 2013 15:42, Peter Zijlstra pet...@infradead.org wrote: On Tue, 2013-03-26 at 15:03 +0100, Vincent Guittot wrote: But ha! here's your NO_HZ link.. but does the above DTRT and ensure that the ILB is a little core when possible? The loop looks for an idle CPU as close as possible to the buddy CPU and the buddy CPU is the 1st CPU has been chosen. So if your buddy is a little and there is an idle little, the ILB will be this idle little. Earlier you wrote: | Cluster 0 | Cluster 1 | | CPU0 | CPU1 | CPU2 | CPU3 | --- buddy | CPU0 | CPU0 | CPU0 | CPU2 | So extrapolating that to a 4+4 big-little you'd get something like: | little A9 || big A15 | | 0 | 1 | 2 | 3 || 4 | 5 | 6 | 7 | --+---+---+---+---++---+---+---+---+ buddy | 0 | 0 | 0 | 0 || 0 | 4 | 4 | 4 | Right? So supposing the current ILB is 6, we'll only check 4, not 0-3, even though there might be a perfectly idle cpu in there. Also, your scheme fails to pack when cpus 0,4 are filled, even when there's idle cores around. If we'd use the ILB as packing cpu, we would simply select a next pack target once the old one fills up. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/5] sched: don't consider upper se in sched_slice()
Hi Joonsoo, On 04/04/2013 06:12 AM, Joonsoo Kim wrote: Hello, Preeti. So, how about extending a sched_period with rq-nr_running, instead of cfs_rq-nr_running? It is my quick thought and I think that we can ensure to run atleast once in this extending sched_period. Yeah this seems to be correct.This would ensure sched_min_granularity also. So then in the scenarion where there are 2 tgs in a runqueue with 10 tasks each,when we calculate the sched_slice of any task,the __sched_period() would return 4*20 = 80ms. The sched_slice of each of the task would be 80/20 = 4ms. But what about the sched_slice of each task group? How would that be calculated then? Let us take the above example and walk through this problem.This would probably help us spot the issues involved with this. And, do we leave a problem if we cannot guaranteed atleast once property? This would depend on the results of the benchmarks with the changes.I am unable to comment on this off the top of my head. Thank you Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v7 20/21] sched: don't do power balance on share cpu power domain
Hi Alex, On 04/04/2013 07:31 AM, Alex Shi wrote: Packing tasks among such domain can't save power, just performance losing. So no power balance on them. As far as my understanding goes, powersave policy is the one that tries to pack tasks onto a SIBLING domain( domain where SD_SHARE_CPUPOWER is set).balance policy does not do that,meaning it does not pack on the domain that shares CPU power,but packs across all other domains.So the change you are making below results in nothing but the default behaviour of balance policy. Correct me if I am wrong but my point is,looks to me,that the powersave policy is introduced in this patchset,and with the below patch its characteristic behaviour of packing onto domains sharing cpu power is removed,thus making it default to balance policy.Now there are two policies which behave the same way:balance and powersave. Signed-off-by: Alex Shi alex@intel.com --- kernel/sched/fair.c | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 047a1b3..3a0284b 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3503,7 +3503,7 @@ static int get_cpu_for_power_policy(struct sched_domain *sd, int cpu, policy = get_sd_sched_balance_policy(sd, cpu, p, sds); if (policy != SCHED_POLICY_PERFORMANCE sds-group_leader) { - if (wakeup) + if (wakeup !(sd-flags SD_SHARE_CPUPOWER)) new_cpu = find_leader_cpu(sds-group_leader, p, cpu, policy); /* for fork balancing and a little busy task */ @@ -4410,8 +4410,9 @@ static unsigned long task_h_load(struct task_struct *p) static inline void init_sd_lb_power_stats(struct lb_env *env, struct sd_lb_stats *sds) { - if (sched_balance_policy == SCHED_POLICY_PERFORMANCE || - env-idle == CPU_NOT_IDLE) { + if (sched_balance_policy == SCHED_POLICY_PERFORMANCE + || env-sd-flags SD_SHARE_CPUPOWER + || env-idle == CPU_NOT_IDLE) { env-flags = ~LBF_POWER_BAL; env-flags |= LBF_PERF_BAL; return; Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v7 20/21] sched: don't do power balance on share cpu power domain
Hi Alex, I am sorry I overlooked the changes you have made to the power scheduling policies.Now you have just two : performance and powersave. Hence you can ignore my below comments.But if you use group-capacity instead of group-weight for threshold,like you did for balance policy in your version5 of this patchset, dont you think the below patch can be avoided? group-capacity being the threshold will automatically ensure that you dont pack onto domains that share cpu power. Regards Preeti U Murthy On 04/08/2013 08:47 AM, Preeti U Murthy wrote: Hi Alex, On 04/04/2013 07:31 AM, Alex Shi wrote: Packing tasks among such domain can't save power, just performance losing. So no power balance on them. As far as my understanding goes, powersave policy is the one that tries to pack tasks onto a SIBLING domain( domain where SD_SHARE_CPUPOWER is set).balance policy does not do that,meaning it does not pack on the domain that shares CPU power,but packs across all other domains.So the change you are making below results in nothing but the default behaviour of balance policy. Correct me if I am wrong but my point is,looks to me,that the powersave policy is introduced in this patchset,and with the below patch its characteristic behaviour of packing onto domains sharing cpu power is removed,thus making it default to balance policy.Now there are two policies which behave the same way:balance and powersave. Signed-off-by: Alex Shi alex@intel.com --- kernel/sched/fair.c | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 047a1b3..3a0284b 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3503,7 +3503,7 @@ static int get_cpu_for_power_policy(struct sched_domain *sd, int cpu, policy = get_sd_sched_balance_policy(sd, cpu, p, sds); if (policy != SCHED_POLICY_PERFORMANCE sds-group_leader) { -if (wakeup) +if (wakeup !(sd-flags SD_SHARE_CPUPOWER)) new_cpu = find_leader_cpu(sds-group_leader, p, cpu, policy); /* for fork balancing and a little busy task */ @@ -4410,8 +4410,9 @@ static unsigned long task_h_load(struct task_struct *p) static inline void init_sd_lb_power_stats(struct lb_env *env, struct sd_lb_stats *sds) { -if (sched_balance_policy == SCHED_POLICY_PERFORMANCE || -env-idle == CPU_NOT_IDLE) { +if (sched_balance_policy == SCHED_POLICY_PERFORMANCE +|| env-sd-flags SD_SHARE_CPUPOWER +|| env-idle == CPU_NOT_IDLE) { env-flags = ~LBF_POWER_BAL; env-flags |= LBF_PERF_BAL; return; Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/5] sched: don't consider upper se in sched_slice()
Hi Joonsoo, On 03/28/2013 01:28 PM, Joonsoo Kim wrote: Following-up upper se in sched_slice() should not be done, because sched_slice() is used for checking that resched is needed whithin *this* cfs_rq and there is one problem related to this in current implementation. The problem is that if we follow-up upper se in sched_slice(), it is possible that we get a ideal slice which is lower than sysctl_sched_min_granularity. For example, we assume that we have 4 tg which is attached to root tg with same share and each one have 20 runnable tasks on cpu0, respectivly. In this case, __sched_period() return sysctl_sched_min_granularity * 20 and then go into loop. At first iteration, we compute a portion of slice for this task on this cfs_rq, so get a slice, sysctl_sched_min_granularity. Afterward, we enter second iteration and get a slice which is a quarter of sysctl_sched_min_granularity, because there is 4 tgs with same share in that cfs_rq. Ensuring slice larger than min_granularity is important for performance and there is no lower bound about this, except timer tick, we should fix it not to consider upper se when calculating sched_slice. I am assuming the sysctl_sched_latency = 20ms and sysctl_sched_min_granularity = 4ms. In that case: With your patch, the sum of the sched_slice(runnable_task) on each_tg is 40ms = sched_min_granularity * 10, while the parent tg has a sched_slice of sysctl_sched_latency / 4 = (20 / 4) = 5ms. Ideally the children's cpu share must add upto the parent's share. Thank you Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 5/5] sched: limit sched_slice if it is more than sysctl_sched_latency
Hi Joonsoo On 03/28/2013 01:28 PM, Joonsoo Kim wrote: sched_slice() compute ideal runtime slice. If there are many tasks in cfs_rq, period for this cfs_rq is extended to guarantee that each task has time slice at least, sched_min_granularity. And then each task get a portion of this period for it. If there is a task which have much larger load weight than others, a portion of period can exceed far more than sysctl_sched_latency. Correct. But that does not matter, the length of the scheduling latency period is determined by the return value of ___sched_period(), not the value of sysctl_sched_latency. You would not extend the period,if you wanted all tasks to have a slice within the sysctl_sched_latency, right? So since the value of the length of the scheduling latency period, is dynamic depending on the number of the processes running, the sysctl_sched_latency which is the default latency period length is not mesed with, but is only used as a base to determine the actual scheduling period. For exampple, you can simply imagine that one task with nice -20 and 9 tasks with nice 0 on one cfs_rq. In this case, load weight sum for this cfs_rq is 88761 + 9 * 1024, 97977. So a portion of slice for the task with nice -20 is sysctl_sched_min_granularity * 10 * (88761 / 97977), that is, approximately, sysctl_sched_min_granularity * 9. This aspect can be much larger if there is more tasks with nice 0. Yeah so the __sched_period says that within 40ms, all tasks need to be scheduled ateast once, and the highest priority task gets nearly 36ms of it, while the rest is distributed among the others. So we should limit this possible weird situation. Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index e232421..6ceffbc 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -645,6 +645,9 @@ static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se) } slice = calc_delta_mine(slice, se-load.weight, load); + if (unlikely(slice sysctl_sched_latency)) + slice = sysctl_sched_latency; Then in this case the highest priority thread would get 20ms(sysctl_sched_latency), and the rest would get sysctl_sched_min_granularity * 10 * (1024/97977) which would be 0.4ms. Then all tasks would get scheduled ateast once within 20ms + (0.4*9) ms = 23.7ms, while your scheduling latency period was extended to 40ms,just so that each of these tasks don't have their sched_slices shrunk due to large number of tasks. + return slice; } Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v5 14/15] sched: power aware load balance
Hi Alex, On 03/25/2013 10:22 AM, Alex Shi wrote: On 03/22/2013 01:14 PM, Preeti U Murthy wrote: the value get from decay_load(): sa-runnable_avg_sum = decay_load(sa-runnable_avg_sum, in decay_load it is possible to be set zero. Yes you are right, it is possible to be set to 0, but after a very long time, to be more precise, nearly 2 seconds. If you look at decay_load(), if the period between last update and now has crossed (32*63),only then will the runnable_avg_sum become 0, else it will simply decay. This means that for nearly 2seconds,consolidation of loads may not be possible even after the runqueues have finished executing tasks running on them. Look into the decay_load(), since the LOAD_AVG_MAX is about 47742, so after 16 * 32ms, the maximum avg sum will be decay to zero. 2^16 = 65536 Yes, compare to accumulate time 345ms, the decay is not symmetry, and not precise, seems it has space to tune well. But it is acceptable now. The exact experiment that I performed was running ebizzy, with just two threads. My setup was 2 socket,2 cores each,4 threads each core. So a 16 logical cpu machine.When I begin running ebizzy with balance policy, the 2 threads of ebizzy are found one on each socket, while I would expect them to be on the same socket. All other cpus, except the ones running ebizzy threads are idle and not running anything on either socket. I am not running any other processes. did you try the simplest benchmark: while true; do :; done Yeah I tried out this while true; do :; done benchmark on a vm which ran on 2 socket, 2 cores each socket and 2 threads each core emulation. I ran two instances of this loop with balance policy on, and it was found that there was one instance running on each socket, rather than both instances getting consolidated on one socket. But when I apply the change where we do not consider rq-util if it has no nr_running on the rq,the two instances of the above benchmark get consolidated onto one socket. I am writing the v6 version which include rt_util etc. you may test on it after I send out. :) Sure will do so. Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v5 14/15] sched: power aware load balance
On 03/29/2013 07:09 PM, Alex Shi wrote: On 03/29/2013 08:42 PM, Preeti U Murthy wrote: did you try the simplest benchmark: while true; do :; done Yeah I tried out this while true; do :; done benchmark on a vm which ran Thanks a lot for trying! What's do you mean 'vm'? Virtual machine? Yes. on 2 socket, 2 cores each socket and 2 threads each core emulation. I ran two instances of this loop with balance policy on, and it was found that there was one instance running on each socket, rather than both instances getting consolidated on one socket. But when I apply the change where we do not consider rq-util if it has no nr_running on the rq,the two instances of the above benchmark get consolidated onto one socket. I don't know much of virtual machine, guess the unstable VCPU to CPU pin cause rq-util keep large? Did you try to pin VCPU to physical CPU? No I hadn't done any vcpu to cpu pinning but why did the situation drastically alter to consolidate the load when the rq-util for the runqueues with 0 tasks on them was not considered as part of sgs-utils? I still give the rq-util weight even the nr_running is 0, because some transitory tasks may actived on the cpu, but just missed on balancing point. I just wondering that forgetting rq-util when nr_running = 0 is the real root cause if your finding is just on VM and without fixed VCPU to CPU pin. I find the same situation on a physical machine too. On a 2 socket, 4 core machine as well. In fact, using trace_printks in the load balancing part, I could find that the reason that the load was not getting consolidated onto a socket was because the rq-util of a run-queue with no processes on it, had not decayed to 0, which is why it would consider the socket as overloaded and would rule out power aware balancing.All this was on a physical machine. Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v5 14/15] sched: power aware load balance
Hi, On 03/30/2013 07:34 PM, Alex Shi wrote: On 03/30/2013 07:25 PM, Preeti U Murthy wrote: I still give the rq-util weight even the nr_running is 0, because some transitory tasks may actived on the cpu, but just missed on balancing point. I just wondering that forgetting rq-util when nr_running = 0 is the real root cause if your finding is just on VM and without fixed VCPU to CPU pin. I find the same situation on a physical machine too. On a 2 socket, 4 core machine as well. In fact, using trace_printks in the load balancing part, I could find that the reason that the load was not getting consolidated onto a socket was because the rq-util of a run-queue with no processes on it, had not decayed to 0, which is why it would consider the socket as overloaded and would rule out power aware balancing.All this was on a physical machine. Consider of this situation, we may stop account the rq-util when nr_running is zero. Tasks will be a bit more compact. but anyway, that's powersaving policy. True, the tasks will be packed a bit more compactly, but we can expect the behaviour of your patchset *defaulting to performance policy when overloaded*, to come to the rescue of such a situation. Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 5/5] sched: limit sched_slice if it is more than sysctl_sched_latency
Hi Joonsoo, On 04/01/2013 10:39 AM, Joonsoo Kim wrote: Hello Preeti. So we should limit this possible weird situation. Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index e232421..6ceffbc 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -645,6 +645,9 @@ static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se) } slice = calc_delta_mine(slice, se-load.weight, load); + if (unlikely(slice sysctl_sched_latency)) + slice = sysctl_sched_latency; Then in this case the highest priority thread would get 20ms(sysctl_sched_latency), and the rest would get sysctl_sched_min_granularity * 10 * (1024/97977) which would be 0.4ms. Then all tasks would get scheduled ateast once within 20ms + (0.4*9) ms = 23.7ms, while your scheduling latency period was extended to 40ms,just so that each of these tasks don't have their sched_slices shrunk due to large number of tasks. I don't know I understand your question correctly. I will do my best to answer your comment. :) With this patch, I just limit maximum slice at one time. Scheduling is controlled through the vruntime. So, in this case, the task with nice -20 will be scheduled twice. 20 + (0.4 * 9) + 20 = 43.9 ms And after 43.9 ms, this process is repeated. So I can tell you that scheduling period is preserved as before. If we give a long period to a task at one go, it can cause a latency problem. So IMHO, limiting this is meaningful. Thank you very much for the explanation. Just one question. What is the reason behind you choosing sysctl_sched_latency as the upper bound here? Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/5] sched: don't consider upper se in sched_slice()
Hi Joonsoo, On 04/01/2013 09:38 AM, Joonsoo Kim wrote: Hello, Preeti. Ideally the children's cpu share must add upto the parent's share. I don't think so. We should schedule out the parent tg if 5ms is over. As we do so, we can fairly distribute time slice to every tg within short term. If we add the children's cpu share upto the parent's, the parent tg may have large time slice, so it cannot be preempted easily. There may be a latency problem if there are many tgs. In the case where the #running sched_nr_latency, the children's sched_slices add up to the parent's. A rq with two tgs,each with 3 tasks. Each of these tasks have a sched slice of [(sysctl_sched_latency / 3) / 2] as of the present implementation. The sum of the above sched_slice of all tasks of a tg will lead to the sched_slice of its parent: sysctl_sched_latency / 2 This breaks when the nr_running on each tg sched_nr_latency. However I don't know if this is a good thing or a bad thing. Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v6 12/21] sched: add power aware scheduling in fork/exec/wake
Hi Alex, The below patch has an issue in the select_task_rq_fair(), which I have mentioned in the fix for it below. On 03/30/2013 08:04 PM, Alex Shi wrote: This patch add power aware scheduling in fork/exec/wake. It try to select cpu from the busiest while still has utilization group. That's will save power since it leaves more groups idle in system. The trade off is adding a power aware statistics collection in group seeking. But since the collection just happened in power scheduling eligible condition, the worst case of hackbench testing just drops about 2% with powersaving policy. No clear change for performance policy. The main function in this patch is get_cpu_for_power_policy(), that will try to get a idlest cpu from the busiest while still has utilization group, if the system is using power aware policy and has such group. Signed-off-by: Alex Shi alex@intel.com --- kernel/sched/fair.c | 109 +--- 1 file changed, 103 insertions(+), 6 deletions(-) +/* + * select_task_rq_fair: balance the current task (running on cpu) in domains * that have the 'flag' flag set. In practice, this is SD_BALANCE_FORK and * SD_BALANCE_EXEC. * - * Balance, ie. select the least loaded group. - * * Returns the target CPU number, or the same CPU if no balancing is needed. * * preempt must be disabled. */ static int -select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags) +select_task_rq_fair(struct task_struct *p, int sd_flag, int flags) { struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL; int cpu = smp_processor_id(); int prev_cpu = task_cpu(p); int new_cpu = cpu; int want_affine = 0; - int sync = wake_flags WF_SYNC; + int sync = flags WF_SYNC; + struct sd_lb_stats sds; if (p-nr_cpus_allowed == 1) return prev_cpu; @@ -3412,11 +3500,20 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags) break; } - if (tmp-flags sd_flag) + if (tmp-flags sd_flag) { sd = tmp; + + new_cpu = get_cpu_for_power_policy(sd, cpu, p, sds); + if (new_cpu != -1) + goto unlock; + } } if (affine_sd) { + new_cpu = get_cpu_for_power_policy(affine_sd, cpu, p, sds); + if (new_cpu != -1) + goto unlock; + if (cpu != prev_cpu wake_affine(affine_sd, p, sync)) prev_cpu = cpu; sched: Fix power aware scheduling in fork/wake/exec From: Preeti U Murthy pre...@linux.vnet.ibm.com Problem: select_task_rq_fair() returns a target CPU/ waking CPU if no balancing is required. However with the current power aware scheduling in this path, an invalid CPU might be returned. If get_cpu_for_power_policy() fails to find a new_cpu for the forked task, then there is a possibility that the new_cpu could continue to be -1, till the end of the select_task_rq_fair() if the search for a new cpu ahead in this function also fails. Since this scenario is unexpected by the callers of select_task_rq_fair(),this needs to be fixed. Fix: Do not intermix the variables meant to reflect the target CPU of power save and performance policies. If the target CPU of powersave is successful in being found, return it. Else allow the performance policy to take a call on the target CPU. The above scenario was caught when a kernel crash resulted with a bad data access interrupt, during a kernbench run on a 2 socket,16 core machine,with each core having SMT-4 --- kernel/sched/fair.c | 13 - 1 file changed, 8 insertions(+), 5 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 56b96a7..54d4400 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3532,6 +3532,7 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int flags) int cpu = smp_processor_id(); int prev_cpu = task_cpu(p); int new_cpu = cpu; + int new_cpu_for_power = -1; int want_affine = 0; int sync = flags WF_SYNC; struct sd_lb_stats sds; @@ -3563,16 +3564,16 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int flags) if (tmp-flags sd_flag) { sd = tmp; - new_cpu = get_cpu_for_power_policy(sd, cpu, p, sds, + new_cpu_for_power = get_cpu_for_power_policy(sd, cpu, p, sds, flags SD_BALANCE_FORK); - if (new_cpu != -1) + if (new_cpu_for_power != -1) goto unlock; } } if (affine_sd) { - new_cpu = get_cpu_for_power_policy(affine_sd, cpu, p, sds, 0
Re: [PATCH 4/5] sched: don't consider upper se in sched_slice()
Hi Joonsoo, On 04/02/2013 07:55 AM, Joonsoo Kim wrote: Hello, Preeti. On Mon, Apr 01, 2013 at 12:36:52PM +0530, Preeti U Murthy wrote: Hi Joonsoo, On 04/01/2013 09:38 AM, Joonsoo Kim wrote: Hello, Preeti. Ideally the children's cpu share must add upto the parent's share. I don't think so. We should schedule out the parent tg if 5ms is over. As we do so, we can fairly distribute time slice to every tg within short term. If we add the children's cpu share upto the parent's, the parent tg may have large time slice, so it cannot be preempted easily. There may be a latency problem if there are many tgs. In the case where the #running sched_nr_latency, the children's sched_slices add up to the parent's. A rq with two tgs,each with 3 tasks. Each of these tasks have a sched slice of [(sysctl_sched_latency / 3) / 2] as of the present implementation. The sum of the above sched_slice of all tasks of a tg will lead to the sched_slice of its parent: sysctl_sched_latency / 2 This breaks when the nr_running on each tg sched_nr_latency. However I don't know if this is a good thing or a bad thing. Ah.. Now I get your point. Yes, you are right and it may be good thing. With that property, all tasks in the system can be scheduled at least once in sysctl_sched_latency. sysctl_sched_latency is system-wide configuration, so my patch may be wrong. With my patch, all tasks in the system cannot be scheduled at least once in sysctl_sched_latency. Instead, it schedule all tasks in cfs_rq at least once in sysctl_sched_latency if there is no other tgs. Exactly. You have got all the above points right. I think that it is real problem that sysctl_sched_min_granularity is not guaranteed for each task. Instead of this patch, how about considering low bound? if (slice sysctl_sched_min_granularity) slice = sysctl_sched_min_granularity; Consider the below scenario. A runqueue has two task groups,each with 10 tasks. With the current implementation,each of these tasks get a sched_slice of 2ms.Hence in a matter of (10*2) + (10*2) = 40 ms, all tasks( all tasks of both the task groups) will get the chance to run. But what is the scheduling period in this scenario? Is it 40ms (extended sysctl_sched_latency), which is the scheduling period for each of the runqueues with 10 tasks in it? Or is it 80ms which is the total of the scheduling periods of each of the run queues with 10 tasks.Either way all tasks seem to get scheduled atleast once within the scheduling period.So we appear to be safe. Although the sched_slice sched_min_granularity. With your above lower bound of sysctl_sched_min_granularity, each task of each tg gets 4ms as its sched_slice.So in a matter of (10*4) + (10*4) = 80ms,all tasks get to run. With the above question being put forth here as well, we don't appear to be safe if the scheduling_period is considered to be 40ms, otherwise it appears fine to me, because it ensures the sched_slice is atleast sched_min_granularity no matter what. Thank you Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/5] sched: don't consider upper se in sched_slice()
Hi Joonsoo, I think that it is real problem that sysctl_sched_min_granularity is not guaranteed for each task. Instead of this patch, how about considering low bound? if (slice sysctl_sched_min_granularity) slice = sysctl_sched_min_granularity; Consider the below scenario. A runqueue has two task groups,each with 10 tasks. With the current implementation,each of these tasks get a sched_slice of 2ms.Hence in a matter of (10*2) + (10*2) = 40 ms, all tasks( all tasks of both the task groups) will get the chance to run. But what is the scheduling period in this scenario? Is it 40ms (extended sysctl_sched_latency), which is the scheduling period for each of the runqueues with 10 tasks in it? Or is it 80ms which is the total of the scheduling periods of each of the run queues with 10 tasks.Either way all tasks seem to get scheduled atleast once within the scheduling period.So we appear to be safe. Although the sched_slice sched_min_granularity. With your above lower bound of sysctl_sched_min_granularity, each task of each tg gets 4ms as its sched_slice.So in a matter of (10*4) + (10*4) = 80ms,all tasks get to run. With the above question being put forth here as well, we don't appear to be safe if the scheduling_period is considered to be 40ms, otherwise it appears fine to me, because it ensures the sched_slice is atleast sched_min_granularity no matter what. So, you mean that we should guarantee to schedule each task atleast once in sysctl_sched_latency? I would rather say all tasks get to run atleast once in a sched_period, which could be just sysctl_sched_latency or more depending on the number of tasks in the current implementation. But this is not guaranteed in current code, this is why I made this patch. Please refer a patch description. Ok,take the example of a runqueue with 2 task groups,each with 10 tasks.Same as your previous example. Can you explain how your patch ensures that all 20 tasks get to run atleast once in a sched_period? Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v5 14/15] sched: power aware load balance
Hi Alex, Please note one point below. On 02/18/2013 10:37 AM, Alex Shi wrote: This patch enabled the power aware consideration in load balance. As mentioned in the power aware scheduler proposal, Power aware scheduling has 2 assumptions: 1, race to idle is helpful for power saving 2, less active sched_groups will reduce power consumption The first assumption make performance policy take over scheduling when any scheduler group is busy. The second assumption make power aware scheduling try to pack disperse tasks into fewer groups. The enabling logical summary here: 1, Collect power aware scheduler statistics during performance load balance statistics collection. 2, If the balance cpu is eligible for power load balance, just do it and forget performance load balance. If the domain is suitable for power balance, but the cpu is inappropriate(idle or full), stop both power/performance balance in this domain. If using performance balance or any group is busy, do performance balance. Above logical is mainly implemented in update_sd_lb_power_stats(). It decides if a domain is suitable for power aware scheduling. If so, it will fill the dst group and source group accordingly. This patch reuse some of Suresh's power saving load balance code. A test can show the effort on different policy: for ((i = 0; i I; i++)) ; do while true; do :; done done On my SNB laptop with 4core* HT: the data is Watts powersaving balance performance i = 2 40 54 54 i = 4 57 64* 68 i = 8 68 68 68 Note: When i = 4 with balance policy, the power may change in 57~68Watt, since the HT capacity and core capacity are both 1. on SNB EP machine with 2 sockets * 8 cores * HT: powersaving balance performance i = 4 190 201 238 i = 8 205 241 268 i = 16 271 348 376 If system has few continued tasks, use power policy can get the performance/power gain. Like sysbench fileio randrw test with 16 thread on the SNB EP box, Signed-off-by: Alex Shi alex@intel.com --- kernel/sched/fair.c | 129 ++-- 1 file changed, 126 insertions(+), 3 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index ffdf35d..3b1e9a6 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4650,6 +4753,12 @@ static inline void update_sg_lb_stats(struct lb_env *env, sgs-group_load += load; sgs-sum_nr_running += nr_running; sgs-sum_weighted_load += weighted_cpuload(i); + + /* accumulate the maximum potential util */ + if (!nr_running) + nr_running = 1; + sgs-group_utils += rq-util * nr_running; You may have observed the following,but I thought it would be best to bring this to notice. The above will lead to situations where sched groups never fill to their full capacity. This is explained with an example below: Say the topology is two cores with hyper threading of two logical threads each. If we choose powersaving policy,and run two workloads at full utilization, the load will get distributed one on each core, they will not get consolidated on a single core. The reason being,the condition: if (sgs-group_utils + FULL_UTIL threshold_util) in update_sd_lb_power_stats will fail. The situation goes thus: w1 w2 t1 t2 t3 t4 ------ core1 core2 Above t-thread(logical cpu) w-workload Neither core will be able to pull the task from the other to consolidate the load because the rq-util of t2 and t4, on which no process is running, continue to show some number even though they degrade with time and sgs-utils accounts for them. Therefore, for core1 and core2, the sgs-utils will be slightly above 100 and the above condition will fail, thus failing them as candidates for group_leader,since threshold_util will be 200. This phenomenon is seen for balance policy and wider topology as well. I think we would be better off without accounting the rq-utils of the cpus which do not have any processes running on them for sgs-utils. What do you think? Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v5 14/15] sched: power aware load balance
Hi Alex, On 03/21/2013 01:13 PM, Alex Shi wrote: On 03/20/2013 12:57 PM, Preeti U Murthy wrote: Neither core will be able to pull the task from the other to consolidate the load because the rq-util of t2 and t4, on which no process is running, continue to show some number even though they degrade with time and sgs-utils accounts for them. Therefore, for core1 and core2, the sgs-utils will be slightly above 100 and the above condition will fail, thus failing them as candidates for group_leader,since threshold_util will be 200. Thanks for note, Preeti! Did you find some real issue in some platform? In theory, a totally idle cpu has a zero rq-util at least after 3xxms, and in fact, I fi nd the code works fine on my machines. Yes, I did find this behaviour on a 2 socket, 8 core machine very consistently. rq-util cannot go to 0, after it has begun accumulating load right? Say a load was running on a runqueue which had its rq-util to be at 100%. After the load finishes, the runqueue goes idle. For every scheduler tick, its utilisation decays. But can never become 0. rq-util = rq-avg.runnable_avg_sum/rq-avg.runnable_avg_period This ratio will come close to 0, but will never become 0 once it has picked up a value.So if a sched_group consists of two run queues,one having utilisation 100, running 1 load and the other having utilisation .001,but running no load,then in update_sd_lb_power_stats(), the below condition sgs-group_utils + FULL_UTIL threshold_util will turn out to be (100.001 + 100 200) and hence the group will fail to act as the group leader,to take on more tasks onto itself. Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v5 14/15] sched: power aware load balance
On 03/21/2013 02:57 PM, Alex Shi wrote: On 03/21/2013 04:41 PM, Preeti U Murthy wrote: Yes, I did find this behaviour on a 2 socket, 8 core machine very consistently. rq-util cannot go to 0, after it has begun accumulating load right? Say a load was running on a runqueue which had its rq-util to be at 100%. After the load finishes, the runqueue goes idle. For every scheduler tick, its utilisation decays. But can never become 0. rq-util = rq-avg.runnable_avg_sum/rq-avg.runnable_avg_period did you close all of background system services? In theory the rq-avg.runnable_avg_sum should be zero if there is no task a bit long, otherwise there are some bugs in kernel. Could you explain why rq-avg.runnable_avg_sum should be zero? What if some kernel thread ran on this run queue and is now finished? Its utilisation would be say x.How would that ever drop to 0,even if nothing ran on it later? Regards Preeti U Murthy Could you check the value under /proc/sched_debug? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v5 14/15] sched: power aware load balance
Hi, On 03/22/2013 07:00 AM, Alex Shi wrote: On 03/21/2013 06:27 PM, Preeti U Murthy wrote: did you close all of background system services? In theory the rq-avg.runnable_avg_sum should be zero if there is no task a bit long, otherwise there are some bugs in kernel. Could you explain why rq-avg.runnable_avg_sum should be zero? What if some kernel thread ran on this run queue and is now finished? Its utilisation would be say x.How would that ever drop to 0,even if nothing ran on it later? the value get from decay_load(): sa-runnable_avg_sum = decay_load(sa-runnable_avg_sum, in decay_load it is possible to be set zero. Yes you are right, it is possible to be set to 0, but after a very long time, to be more precise, nearly 2 seconds. If you look at decay_load(), if the period between last update and now has crossed (32*63),only then will the runnable_avg_sum become 0, else it will simply decay. This means that for nearly 2seconds,consolidation of loads may not be possible even after the runqueues have finished executing tasks running on them. The exact experiment that I performed was running ebizzy, with just two threads. My setup was 2 socket,2 cores each,4 threads each core. So a 16 logical cpu machine.When I begin running ebizzy with balance policy, the 2 threads of ebizzy are found one on each socket, while I would expect them to be on the same socket. All other cpus, except the ones running ebizzy threads are idle and not running anything on either socket. I am not running any other processes. You could run a similar experiment and let me know if you see otherwise. I am at a loss to understand why else would such a spreading of load occur, if not for the rq-util not becoming 0 quickly,when it is not running anything. I have used trace_printks to keep track of runqueue util of those runqueues not running anything after maybe some initial load and it does not become 0 till the end of the run. Regards Preeti U Murthy and /proc/sched_debug also approve this: .tg_runnable_contrib : 0 .tg-runnable_avg : 50 .avg-runnable_avg_sum : 0 .avg-runnable_avg_period : 47507 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/6] sched: packing small tasks
Hi Vincent, The power aware scheduler patchset that has been released by Alex recently [https://lkml.org/lkml/2013/2/18/4], also has the idea of packing of small tasks.In that patchset, the waking of small tasks in done on a leader cpu, which of course varies dynamically. https://lkml.org/lkml/2013/2/18/6 I will refer henceforth to the above patchset as Patch A and your current patchset in this mail as Patch B. The difference in Patch A is that it is packing small tasks onto a leader cpu, i.e. a cpu which has just enough capacity to take on itself a light weight task. Essentially belonging to a group which has a favourable grp_power and the leader cpu having a suitable rq-util. In Patch B, you are choosing a buddy cpu, belonging to a group with the least cpu power, and while deciding to pack tasks onto the buddy cpu, this again boils down to, verifying if the grp_power is favourable and the buddy cpu has a suitable rq-util. In my opinion, this goes to say that both the patches are using similar metrics to decide if a cpu is capable of handling the light weight task. The difference lies in how far the search for the appropriate cpu can proceed. While Patch A continues to search at all levels of sched domains till the last to find the appropriate cpu, Patch B queries only the buddy CPU. The point I am trying to make is that, considering the above similarities and differences, is it possible for us to see if the ideas that both these patches bring in can be merged into one, since both of them are having the common goal of packing small tasks. Thanks Regards Preeti U Murthy On 03/22/2013 05:55 PM, Vincent Guittot wrote: Hi, This patchset takes advantage of the new per-task load tracking that is available in the kernel for packing the small tasks in as few as possible CPU/Cluster/Core. The main goal of packing small tasks is to reduce the power consumption in the low load use cases by minimizing the number of power domain that are enabled. The packing is done in 2 steps: The 1st step looks for the best place to pack tasks in a system according to its topology and it defines a pack buddy CPU for each CPU if there is one available. We define the best CPU during the build of the sched_domain instead of evaluating it at runtime because it can be difficult to define a stable buddy CPU in a low CPU load situation. The policy for defining a buddy CPU is that we pack at all levels inside a node where a group of CPU can be power gated independently from others. For describing this capability, a new flag has been introduced SD_SHARE_POWERDOMAIN that is used to indicate whether the groups of CPUs of a scheduling domain are sharing their power state. By default, this flag has been set in all sched_domain in order to keep unchanged the current behavior of the scheduler and only ARM platform clears the SD_SHARE_POWERDOMAIN flag for MC and CPU level. In a 2nd step, the scheduler checks the load average of a task which wakes up as well as the load average of the buddy CPU and it can decide to migrate the light tasks on a not busy buddy. This check is done during the wake up because small tasks tend to wake up between periodic load balance and asynchronously to each other which prevents the default mechanism to catch and migrate them efficiently. A light task is defined by a runnable_avg_sum that is less than 20% of the runnable_avg_period. In fact, the former condition encloses 2 ones: The average CPU load of the task must be less than 20% and the task must have been runnable less than 10ms when it woke up last time in order to be electable for the packing migration. So, a task than runs 1 ms each 5ms will be considered as a small task but a task that runs 50 ms with a period of 500ms, will not. Then, the business of the buddy CPU depends of the load average for the rq and the number of running tasks. A CPU with a load average greater than 50% will be considered as busy CPU whatever the number of running tasks is and this threshold will be reduced by the number of running tasks in order to not increase too much the wake up latency of a task. When the buddy CPU is busy, the scheduler falls back to default CFS policy. Change since V2: - Migrate only a task that wakes up - Change the light tasks threshold to 20% - Change the loaded CPU threshold to not pull tasks if the current number of running tasks is null but the load average is already greater than 50% - Fix the algorithm for selecting the buddy CPU. Change since V1: Patch 2/6 - Change the flag name which was not clear. The new name is SD_SHARE_POWERDOMAIN. - Create an architecture dependent function to tune the sched_domain flags Patch 3/6 - Fix issues in the algorithm that looks for the best buddy CPU - Use pr_debug instead of pr_info - Fix for uniprocessor Patch 4/6 - Remove the use of usage_avg_sum which has not been merged Patch 5/6 - Change the way the coherency
Re: [RFC PATCH v3 3/6] sched: pack small tasks
Hi Peter, On 03/26/2013 06:07 PM, Peter Zijlstra wrote: On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote: +static bool is_light_task(struct task_struct *p) +{ + /* A light task runs less than 20% in average */ + return ((p-se.avg.runnable_avg_sum * 5) + (p-se.avg.runnable_avg_period)); +} OK, so we have a 'problem' here, we initialize runnable_avg_* to 0, but we want to 'assume' a fresh task is fully 'loaded'. IIRC Alex ran into this as well. PJT, do you have any sane solution for this, I forgot what the result of the last discussion was -- was there any? The conclusion after last discussion between PJT and Alex was that the load contribution of a fresh task be set to full during __sched_fork(). task-se.avg.load_avg_contrib = task-se.load.weight during __sched_fork() is reflected in the latest power aware scheduler patchset by Alex. Thanks Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 3/6] sched: pack small tasks
Hi, On 03/26/2013 05:56 PM, Peter Zijlstra wrote: On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote: +static bool is_buddy_busy(int cpu) +{ + struct rq *rq = cpu_rq(cpu); + + /* +* A busy buddy is a CPU with a high load or a small load with a lot of +* running tasks. +*/ + return (rq-avg.runnable_avg_sum + (rq-avg.runnable_avg_period / (rq-nr_running + 2))); +} Why does the comment talk about load but we don't see it in the equation. Also, why does nr_running matter at all? I thought we'd simply bother with utilization, if fully utilized we're done etc.. Peter, lets say the run-queue has 50% utilization and is running 2 tasks. And we wish to find out if it is busy. We would compare this metric with the cpu power, which lets say is 100. rq-util * 100 cpu_of(rq)-power. In the above scenario would we declare the cpu _not_busy? Or would we do the following: (rq-util * 100) * #nr_running cpu_of(rq)-power and conclude that it is just enough _busy_ to not take on more processes? @Vincent: Yes the comment above needs to be fixed. A busy buddy is a CPU with *high rq utilization*, as far as the equation goes. Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v2 PATCH 2.1] sched: Use Per-Entity-Load-Tracking metric for load balancing
sched: Use Per-Entity-Load-Tracking metric for load balancing From: Preeti U Murthy pre...@linux.vnet.ibm.com Currently the load balancer weighs a task based upon its priority,and this weight consequently gets added up to the weight of the run queue that it is on.It is this weight of the runqueue that sums up to a sched group's load which is used to decide the busiest or the idlest group and the runqueue thereof. The Per-Entity-Load-Tracking metric however measures how long a task has been runnable over the duration of its lifetime.This gives us a hint of the amount of CPU time that the task can demand.This metric takes care of the task priority as well.Therefore apart from the priority of a task we also have an idea of the live behavior of the task.This seems to be a more realistic metric to use to compute task weight which adds upto the run queue weight and the weight of the sched group.Consequently they can be used for load balancing. The semantics of load balancing is left untouched.The two functions load_balance() and select_task_rq_fair() perform the task of load balancing.These two paths have been browsed through in this patch to make necessary changes. weighted_cpuload() and task_h_load() provide the run queue weight and the weight of the task respectively.They have been modified to provide the Per-Entity-Load-Tracking metric as relevant for each. The rest of the modifications had to be made to suit these two changes. Completely Fair Scheduler class is the only sched_class which contributes to the run queue load.Therefore the rq-load.weight==cfs_rq-load.weight when the cfs_rq is the root cfs_rq (rq-cfs) of the hierarchy.When replacing this with Per-Entity-Load-Tracking metric,cfs_rq-runnable_load_avg needs to be used as this is the right reflection of the run queue load when the cfs_rq is the root cfs_rq (rq-cfs) of the hierarchy.This metric reflects the percentage uptime of the tasks that are queued on it and hence that contribute to the load.Thus cfs_rq-runnable_load_avg replaces the metric earlier used in weighted_cpuload(). The task load is aptly captured by se.avg.load_avg_contrib which captures the runnable time vs the alive time of the task against its priority.This metric replaces the earlier metric used in task_h_load(). The consequent changes appear as data type changes for the helper variables; they abound in number.Because cfs_rq-runnable_load_avg needs to be big enough to capture the tasks' load often and accurately. The following patch does not consider CONFIG_FAIR_GROUP_SCHED AND CONFIG_SCHED_NUMA.This is done so as to evaluate this approach starting from the simplest scenario.Earlier discussions can be found in the link below. Link: https://lkml.org/lkml/2012/10/25/162 Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com --- I apologise about having overlooked this one change in the patchset.This needs to be applied on top of patch2 of this patchset.The experiment results that have been posted in reply to this thread are done after having applied this patch. kernel/sched/fair.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index f8f3a29..19094eb 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4362,7 +4362,7 @@ struct sd_lb_stats { * sg_lb_stats - stats of a sched_group required for load_balancing */ struct sg_lb_stats { - unsigned long avg_load; /*Avg load across the CPUs of the group */ + u64 avg_load; /*Avg load across the CPUs of the group */ u64 group_load; /* Total load over the CPUs of the group */ unsigned long sum_nr_running; /* Nr tasks running in the group */ u64 sum_weighted_load; /* Weighted load of group's tasks */ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v2 PATCH 0/2] sched: Integrating Per-entity-load-tracking with the core scheduler
* of time.This is done to create both long running and short running tasks * on the cpu. * * Multiple threads are created of each instance.The threads request for a * memory chunk,write into it and then free it.This is done throughout the * period of the run. * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public License as * published by the Free Software Foundation; version 2 of the License. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 * USA */ #include stdio.h #include unistd.h #include stdlib.h #include sys/mman.h #include pthread.h #include string.h #include time.h #include sys/time.h #include sys/resource.h #include malloc.h /* Variable entities */ static unsigned int seconds; static unsigned int threads; static unsigned int mem_chunk_size; static unsigned int sleep_at; static unsigned int sleep_interval; /* Fixed entities */ typedef size_t mem_slot_t;/* 8 bytes */ static unsigned int slot_size = sizeof(mem_slot_t); /* Other parameters */ static volatile int start; static time_t start_time; static unsigned int records_read; pthread_mutex_t records_count_lock = PTHREAD_MUTEX_INITIALIZER; static unsigned int write_to_mem(void) { int i, j; mem_slot_t *scratch_pad, *temp; mem_chunk_size = slot_size * 256; mem_slot_t *end; sleep_at = 0; /* sleep for every 2800 records */ sleep_interval = 9000; /* sleep for 9 ms */ for (i=0; start == 1; i++) { /* ask for a memory chunk */ scratch_pad = (mem_slot_t *)malloc(mem_chunk_size); if (scratch_pad == NULL) { fprintf(stderr,Could not allocate memory\n); exit(1); } end = scratch_pad + (mem_chunk_size / slot_size); /* write into this chunk */ for (temp = scratch_pad, j=0; temp end; temp++, j++) *temp = (mem_slot_t)j; /* Free this chunk */ free(scratch_pad); /* Decide the duty cycle;currently 10 ms */ if (sleep_at !(i % sleep_at)) usleep(sleep_interval); } return (i); } static void * thread_run(void *arg) { unsigned int records_local; /* Wait for the start signal */ while (start == 0); records_local = write_to_mem(); pthread_mutex_lock(records_count_lock); records_read += records_local; pthread_mutex_unlock(records_count_lock); return NULL; } static void start_threads() { double diff_time; unsigned int i; int err; threads = 8; seconds = 10; pthread_t thread_array[threads]; for (i = 0; i threads; i++) { err = pthread_create(thread_array[i], NULL, thread_run, NULL); if (err) { fprintf(stderr, Error creating thread %d\n, i); exit(1); } } start_time = time(NULL); start = 1; sleep(seconds); start = 0; diff_time = difftime(time(NULL), start_time); for (i = 0; i threads; i++) { err = pthread_join(thread_array[i], NULL); if (err) { fprintf(stderr, Error joining thread %d\n, i); exit(1); } } printf(%u records/s\n, (unsigned int) (((double) records_read)/diff_time)); } int main() { start_threads(); return 0; } END WORKLOAD Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v5 02/15] sched: set initial load avg of new forked task
Hi Alex, diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 81fa536..cae5134 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1509,6 +1509,10 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq, * We track migrations using entity decay_count = 0, on a wake-up * migration we use a negative decay count to track the remote decays * accumulated while sleeping. +* +* When enqueue a new forked task, the se-avg.decay_count == 0, so +* we bypass update_entity_load_avg(), use avg.load_avg_contrib initial +* value: se-load.weight. I disagree with the comment.update_entity_load_avg() gets called for all forked tasks. enqueue_task_fair-update_entity_load_avg() during the second iteration.But __update_entity_load_avg() in update_entity_load_avg() When goes 'enqueue_task_fair-update_entity_load_avg()' during the second iteration. the se is changed. That is different se. Correct Alex,sorry I overlooked this. Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: sched: Consequences of integrating the Per Entity Load Tracking Metric into the Load Balancer
Hi Alex, Thank you very much for running the below benchmark on blocked_load+runnable_load:) Just a few queries. How did you do the wake up balancing? Did you iterate over the L3 package looking for an idle cpu? Or did you just query the L2 package for an idle cpu? I think when you are using blocked_load+runnable_load it would be better if we just query the L2 package as Vincent had pointed out because the fundamental behind using blocked_load+runnable_load is to keep a steady state across cpus unless we could reap the advantage of moving the blocked load to a sibling core when it wakes up. And the drop of performance is relative to what? 1.Your v3 patchset with runnable_load_avg in weighted_cpu_load(). 2.Your v3 patchset with runnable_load_avg+blocked_load_avg in weighted_cpu_load(). Are the above two what you are comparing? And in the above two versions have you included your [PATCH] sched: use instant load weight in burst regular load balance? On 01/20/2013 09:22 PM, Alex Shi wrote: The blocked load of a cluster will be high if the blocked tasks have run recently. The contribution of a blocked task will be divided by 2 each 32ms, so it means that a high blocked load will be made of recent running tasks and the long sleeping tasks will not influence the load balancing. The load balance period is between 1 tick (10ms for idle load balance on ARM) and up to 256 ms (for busy load balance) so a high blocked load should imply some tasks that have run recently otherwise your blocked load will be small and will not have a large influence on your load balance Just tried using cfs's runnable_load_avg + blocked_load_avg in weighted_cpuload() with my v3 patchset, aim9 shared workfile testing show the performance dropped 70% more on the NHM EP machine. :( Ops, the performance is still worse than just count runnable_load_avg. But dropping is not so big, it dropped 30%, not 70%. Thank you Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] tty: Only wakeup the line discipline idle queue when queue is active
On 01/18/2013 09:15 PM, Oleg Nesterov wrote: On 01/17, Preeti U Murthy wrote: On 01/16/2013 05:32 PM, Ivo Sieben wrote: I don't have a problem that there is a context switch to the high priority process: it has a higher priority, so it probably is more important. My problem is that even when the waitqueue is empty, the high priority thread has a risk to block on the spinlock needlessly (causing context switches to low priority task and back to the high priority task) Fair enough Ivo.I think you should go ahead with merging the waitqueue_active() wake_up() logic into the wake_up() variants. This is not easy. We can't simply change wake_up*() helpers or modify __wake_up(). Hmm.I need to confess that I don't really know what goes into a change such as this.Since there are a lot of waitqueue_active()+wake_up() calls,I was wondering why at all have a separate logic as waitqueue_active(),if we could do what it does in wake_up*(). But you guys can decide this best. I can't understand why do you dislike Ivo's simple patch. There are a lot of if (waitqueue_active) wake_up examples. Even if we add the new helpers (personally I don't think this makes sense) , we can do this later. Why should we delay this fix? Personally i was concerned about how this could cause a scheduler overhead.There does not seem to be much of a problem here.Ivo's patch for adding a waitqueue_active() for his specific problem would also do well,unless there is a dire requirement for a clean up,which I am unable to evaluate. Oleg. Thank you Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] tty: Only wakeup the line discipline idle queue when queue is active
Hi Ivo, Can you explain how this problem could create a scheduler overhead? I am a little confused, because as far as i know,scheduler does not come in the picture of the wake up path right? select_task_rq() in try_to_wake_up() is where the scheduler comes in,and this is after the task wakes up. On 01/03/2013 03:19 PM, Ivo Sieben wrote: Oleg, Peter, Ingo, Andi Preeti, 2013/1/2 Jiri Slaby jsl...@suse.cz: On 01/02/2013 04:21 PM, Ivo Sieben wrote: I don't understand your responses: do you suggest to implement this if active behavior in: * A new wake_up function called wake_up_if_active() that is part of the waitqueue layer? Sounds good. -- js suse labs I want to ask you 'scheduler' people for your opinion: Maybe you remember my previous patch where I suggested an extra 'waitqueue empty' check before entering the critical section of the wakeup() function (If you do not remember see https://lkml.org/lkml/2012/10/25/159) Finally Oleg responded that a lot of callers do if (waitqueue_active(q)) wake_up(...); what made my patch pointless and adds a memory barrier. I then decided to also implement the 'waitqueue_active' approach for my problem. But now I get a review comment by Jiri that he would like to hide this 'if active behavior' in a wake_up_if_active() kind of function. I think he is right that implementing this check in the wakeup function would clean things up, right? I would like to have your opinion on the following two suggestions: - We still can do the original patch on the wake_up() that I suggested. I then can do an additional code cleanup patch that removes the double 'waitqueue_active' call (a quick grep found about 150 of these waitqueue active calls) on several places in the code. I think this is a good move. - Or - as an alternative - I could add extra _if_active() versions of all wake_up() functions, that implement this extra test. Why add 'extra' if_active versions? Why not optimize this within the existing wake_up() functions? Regards, Ivo Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] tty: Only wakeup the line discipline idle queue when queue is active
Hi Ivo, On 01/16/2013 02:46 PM, Ivo Sieben wrote: Hi Preeti, 2013/1/16 Preeti U Murthy pre...@linux.vnet.ibm.com: Hi Ivo, Can you explain how this problem could create a scheduler overhead? I am a little confused, because as far as i know,scheduler does not come in the picture of the wake up path right? select_task_rq() in try_to_wake_up() is where the scheduler comes in,and this is after the task wakes up. Everytime the line discipline is dereferenced, the wakeup function is called. The wakeup() function contains a critical section protected by spinlocks. On a PREEMPT_RT system, a normal spinlock behaves just like a mutex: scheduling is not disabled and it is still possible that a new process on a higher RT priority is scheduled in. When a new - higher priority - process is scheduled in just when the put_ldisc() is in the critical section of the wakeup function, the higher priority process (that uses the same TTY instance) will finally also dereference the line discipline and try to wakeup the same waitqueue. This causes the high priority process to block on the same spinlock. Priority inheritance will solve this blocked situation by a context switch to the lower priority process, run until that process leaves the critical section, and a context switch back to the higher priority process. This is unnecessary since the waitqueue was empty after all (during normal operation the waitqueue is empty most of the time). This unnecessary context switch from/to the high priority process is what a mean with scheduler overhead (maybe not a good name for it, sorry for the confusion). Does this makes sense to you? Yes.Thank you very much for the explanation :) But I dont see how the context switching goes away with your patch.With your patch, when the higher priority thread comes in when the lower priority thread is running in the critical section,it will see the wait queue empty and continue its execution without now wanting to enter the critical section.So this means it will preempt the lower priority thread because it is not waiting on a lock anyway.There is a context switch here right? I dont see any problem in scheduling due to this,but I do think your patch is essential. The entire logic of wakelist_active() wake_up() could be integrated into wake_up(). I dont understand why we need a separate function to check the emptiness of the wake list. But as Oleg pointed out we must identify the places for this optimization. Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: sched: Consequences of integrating the Per Entity Load Tracking Metric into the Load Balancer
Hi Alex, On 01/16/2013 07:38 PM, Alex Shi wrote: On 01/08/2013 04:41 PM, Preeti U Murthy wrote: Hi Mike, Thank you very much for such a clear and comprehensive explanation. So when I put together the problem and the proposed solution pieces in the current scheduler scalability,the following was what I found: 1. select_idle_sibling() is needed as an agent to correctly find the right cpu for wake up tasks to go to.Correctly would be to find an idle cpu at the lowest cost possible. 2.Cost could be lowered either by optimizing the order of searching for an idle cpu or restricting the search to a few cpus alone. 3. The former has the problem that it would not prevent bouncing tasks all over the domain sharing an L3 cache,which could potentially affect the fast moving tasks. 4. The latter has the problem that it is not aggressive enough in finding an idle cpu. This is some tangled problem,but I think the solution at best could be smoothed to a a flowchart. STEP1 STEP2STEP3 _ | | |See if the idle buddy|No_ Yes |is free at all sched || Do we search the| |Optimized search| |domains | |sched domains| || |_| |for an idle cpu | | |Yes |_|\|/ \|/|No: saturated Return target cpu Return \|/ system cpu buddyReturn prev_cpu I re-written the patch as following. hackbench/aim9 doest show clean performance change. Actually we can get some profit. it also will be very slight. :) BTW, it still need another patch before apply this. Just to show the logical. === From 145ff27744c8ac04eda056739fe5aa907a00877e Mon Sep 17 00:00:00 2001 From: Alex Shi alex@intel.com Date: Fri, 11 Jan 2013 16:49:03 +0800 Subject: [PATCH 3/7] sched: select_idle_sibling optimization Current logical in this function will insist to wake up the task in a totally idle group, otherwise it would rather back to previous cpu. As Namhyung pointed out this could be the waking cpu as well. The new logical will try to wake up the task on any idle cpu in the same cpu socket (in same sd_llc), while idle cpu in the smaller domain has higher priority. Here is where the problem of large sockets come in.select_idle_sibling() has its main weakness here.No doubt that this patch could improve performance over the current logic,but will still retain the major drawback of searching for an idle cpu in a large socket in the worst case. It should has some help on burst wake up benchmarks like aim7. Original-patch-by: Preeti U Murthy pre...@linux.vnet.ibm.com Signed-off-by: Alex Shi alex@intel.com --- kernel/sched/fair.c | 40 +++- 1 files changed, 19 insertions(+), 21 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index e116215..fa40e49 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3253,13 +3253,13 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu) /* * Try and locate an idle CPU in the sched_domain. */ -static int select_idle_sibling(struct task_struct *p) +static int select_idle_sibling(struct task_struct *p, + struct sched_domain *affine_sd, int sync) As Namhyung pointed out where is affine_sd being used? { int cpu = smp_processor_id(); int prev_cpu = task_cpu(p); struct sched_domain *sd; struct sched_group *sg; - int i; /* * If the task is going to be woken-up on this cpu and if it is @@ -3281,27 +3281,25 @@ static int select_idle_sibling(struct task_struct *p) /* * Otherwise, iterate the domains and find an elegible idle cpu. */ - sd = rcu_dereference(per_cpu(sd_llc, prev_cpu)); - for_each_lower_domain(sd) { + for_each_domain(prev_cpu, sd) { Why is prev_cpu being used? Ideally it should be the target cpu (waking/prev) depending on what wake_affine() decides. sg = sd-groups; do { - if (!cpumask_intersects(sched_group_cpus(sg), - tsk_cpus_allowed(p))) - goto next; - - for_each_cpu(i, sched_group_cpus(sg)) { - if (!idle_cpu(i)) - goto next; - } - - prev_cpu = cpumask_first_and(sched_group_cpus(sg), - tsk_cpus_allowed(p)); - goto done; -next: - sg = sg-next; - } while (sg != sd-groups); + int nr_busy
Re: sched: Consequences of integrating the Per Entity Load Tracking Metric into the Load Balancer
Hi Namhyung, I re-written the patch as following. hackbench/aim9 doest show clean performance change. Actually we can get some profit. it also will be very slight. :) BTW, it still need another patch before apply this. Just to show the logical. === From 145ff27744c8ac04eda056739fe5aa907a00877e Mon Sep 17 00:00:00 2001 From: Alex Shi alex@intel.com Date: Fri, 11 Jan 2013 16:49:03 +0800 Subject: [PATCH 3/7] sched: select_idle_sibling optimization Current logical in this function will insist to wake up the task in a totally idle group, otherwise it would rather back to previous cpu. Or current cpu depending on result of wake_affine(), right? The new logical will try to wake up the task on any idle cpu in the same cpu socket (in same sd_llc), while idle cpu in the smaller domain has higher priority. But what about SMT domain? The previous approach also descended till the SMT domain.Here we start from the SMT domain. You could check with /proc/schedstat as to which are the different domains the cpu is a part of and SMT domain happens to be domain0.As far as i know for_each_lower_domain will descend till domain0. I mean it seems that the code prefers running a task on a idle cpu which is a sibling thread in the same core rather than running it on an idle cpu in another idle core. I guess we didn't do that before. It should has some help on burst wake up benchmarks like aim7. Original-patch-by: Preeti U Murthy pre...@linux.vnet.ibm.com Signed-off-by: Alex Shi alex@intel.com --- kernel/sched/fair.c | 40 +++- 1 files changed, 19 insertions(+), 21 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index e116215..fa40e49 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3253,13 +3253,13 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu) /* * Try and locate an idle CPU in the sched_domain. */ -static int select_idle_sibling(struct task_struct *p) +static int select_idle_sibling(struct task_struct *p, +struct sched_domain *affine_sd, int sync) Where are these arguments used? { int cpu = smp_processor_id(); int prev_cpu = task_cpu(p); struct sched_domain *sd; struct sched_group *sg; -int i; /* * If the task is going to be woken-up on this cpu and if it is @@ -3281,27 +3281,25 @@ static int select_idle_sibling(struct task_struct *p) /* * Otherwise, iterate the domains and find an elegible idle cpu. */ -sd = rcu_dereference(per_cpu(sd_llc, prev_cpu)); -for_each_lower_domain(sd) { +for_each_domain(prev_cpu, sd) { Always start from the prev_cpu? sg = sd-groups; do { -if (!cpumask_intersects(sched_group_cpus(sg), -tsk_cpus_allowed(p))) -goto next; - -for_each_cpu(i, sched_group_cpus(sg)) { -if (!idle_cpu(i)) -goto next; -} - -prev_cpu = cpumask_first_and(sched_group_cpus(sg), -tsk_cpus_allowed(p)); -goto done; -next: -sg = sg-next; -} while (sg != sd-groups); +int nr_busy = atomic_read(sg-sgp-nr_busy_cpus); +int i; + +/* no idle cpu in the group */ +if (nr_busy == sg-group_weight) +continue; Maybe we can skip local group since it's a bottom-up search so we know there's no idle cpu in the lower domain from the prior iteration. We could have done this for the current logic because it checks for an *idle* group.The local group would definitely fail this test.But here we need to check the local group also because we are looking for an idle cpu. Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] tty: Only wakeup the line discipline idle queue when queue is active
On 01/16/2013 05:32 PM, Ivo Sieben wrote: 2013/1/16 Preeti U Murthy pre...@linux.vnet.ibm.com: Yes.Thank you very much for the explanation :) But I dont see how the context switching goes away with your patch.With your patch, when the higher priority thread comes in when the lower priority thread is running in the critical section,it will see the wait queue empty and continue its execution without now wanting to enter the critical section.So this means it will preempt the lower priority thread because it is not waiting on a lock anyway.There is a context switch here right? I dont see any problem in scheduling due to this,but I do think your patch is essential. I don't have a problem that there is a context switch to the high priority process: it has a higher priority, so it probably is more important. My problem is that even when the waitqueue is empty, the high priority thread has a risk to block on the spinlock needlessly (causing context switches to low priority task and back to the high priority task) Fair enough Ivo.I think you should go ahead with merging the waitqueue_active() wake_up() logic into the wake_up() variants. Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v4 08/18] Revert sched: Introduce temporary FAIR_GROUP_SCHED dependency for load-tracking
Hi everyone, On 02/13/2013 09:15 PM, Paul Turner wrote: On Wed, Feb 13, 2013 at 7:23 AM, Alex Shi alex@intel.com wrote: On 02/12/2013 06:27 PM, Peter Zijlstra wrote: On Thu, 2013-01-24 at 11:06 +0800, Alex Shi wrote: Remove CONFIG_FAIR_GROUP_SCHED that covers the runnable info, then we can use runnable load variables. It would be nice if we could quantify the performance hit of doing so. Haven't yet looked at later patches to see if we remove anything to off-set this. In our rough testing, no much clear performance changes. I'd personally like this to go with a series that actually does something with it. There's been a few proposals floating around on _how_ to do this; but the challenge is in getting it stable enough that all of the wake-up balancing does not totally perforate your stability gains into the noise. select_idle_sibling really is your nemesis here. It's a small enough patch that it can go at the head of any such series (and indeed; it was originally structured to make such a patch rather explicit.) -- Thanks Alex Paul,what exactly do you mean by select_idle_sibling() is our nemesis here? What we observed through our experiments was that: 1.With the per entity load tracking(runnable_load_avg) in load balancing,the load is distributed appropriately across the cpus. 2.However when a task sleeps and wakes up,select_idle_sibling() searches for the idlest group top to bottom.If a suitable candidate is not found,it wakes up the task on the prev_cpu/waker_cpu.This would increase the runqueue size and load of prev_cpu/waker_cpu respectively. 3.The load balancer would then come to the rescue and redistribute the load. As a consequence, *The primary observation was that there is no performance degradation with the integration of per entity load tracking into the load balancer but there was a good increase in the number of migrations*. This as I see it, is due to the point2 and point3 above.Is this what you call as the nemesis? OR select_idle_sibling() does a top to bottom search of the chosen domain for an idlest group and is very likely to spread the waking task to a far off group,in case of underutilized systems.This would prove costly for the software buddies in finding each other due to the time taken for the search and the possible spreading of the software buddy tasks.Is this what you call nemesis? Another approach to remove the above two nemesis,if they are so,would be to use blocked_load+runnable_load for balancing.But when waking up a task,use select_idle_sibling() only to search the L2 cache domains for an idlest group.If unsuccessful,return the prev_cpu which has already accounted for the task in the blocked_load,hence this move would not increase its load.Would you recommend going in this direction? Thank you Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v4 05/18] sched: quicker balancing on fork/exec/wake
; for_each_cpu(i, sched_group_cpus(group)) { - /* Bias balancing toward cpus of our domain */ - if (local_group) - load = source_load(i, load_idx); - else - load = target_load(i, load_idx); + load = weighted_cpuload(i); avg_load += load; } @@ -3227,20 +3218,12 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, avg_load = (avg_load * SCHED_POWER_SCALE); do_div(avg_load, group-sgp-power); - if (local_group) { - this_load = avg_load; - this_group = group; - } else if (avg_load min_load) { + if (avg_load min_load) { min_load = avg_load; idlest = group; } } while (group = group-next, group != sd-groups); - if (this_group idlest!= this_group) { - /* Bias towards our group again */ - if (!idlest || 100*this_load imbalance*min_load) - return this_group; - } return idlest; } @@ -3248,7 +3231,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, * find_idlest_cpu - find the idlest cpu among the cpus in group. */ static int -find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu) +find_idlest_cpu(struct sched_group *group, struct task_struct *p) { u64 load, min_load = ~0ULL; int idlest = -1; @@ -3258,7 +3241,7 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu) for_each_cpu_and(i, sched_group_cpus(group), tsk_cpus_allowed(p)) { load = weighted_cpuload(i); - if (load min_load || (load == min_load i == this_cpu)) { + if (load min_load) { min_load = load; idlest = i; } @@ -3325,6 +3308,7 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags) int cpu = smp_processor_id(); int prev_cpu = task_cpu(p); int new_cpu = cpu; + struct sched_group *group; int want_affine = 0; int sync = wake_flags WF_SYNC; @@ -3367,8 +3351,6 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags) } while (sd) { - int load_idx = sd-forkexec_idx; - struct sched_group *group; int weight; if (!(sd-flags sd_flag)) { @@ -3376,15 +3358,9 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags) continue; } - if (sd_flag SD_BALANCE_WAKE) - load_idx = sd-wake_idx; - - group = find_idlest_group(sd, p, cpu, load_idx); - if (!group) { - goto unlock; - } + group = find_idlest_group(sd, p, cpu); - new_cpu = find_idlest_cpu(group, p, cpu); + new_cpu = group_first_cpu(group); /* Now try balancing at a lower domain level of new_cpu */ cpu = new_cpu; @@ -3398,6 +3374,7 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags) } /* while loop will break here if sd == NULL */ } + new_cpu = find_idlest_cpu(group, p); unlock: rcu_read_unlock(); @@ -4280,15 +4257,15 @@ static inline void update_sd_lb_power_stats(struct lb_env *env, if (sgs-sum_nr_running + 1 sgs-group_capacity) return; if (sgs-group_util sds-leader_util || - sgs-group_util == sds-leader_util - group_first_cpu(group) group_first_cpu(sds-group_leader)) { + (sgs-group_util == sds-leader_util + group_first_cpu(group) group_first_cpu(sds-group_leader))) { sds-group_leader = group; sds-leader_util = sgs-group_util; } /* Calculate the group which is almost idle */ if (sgs-group_util sds-min_util || - sgs-group_util == sds-min_util - group_first_cpu(group) group_first_cpu(sds-group_leader)) { + (sgs-group_util == sds-min_util + group_first_cpu(group) group_first_cpu(sds-group_leader))) { sds-group_min = group; sds-min_util = sgs-group_util; sds-min_load_per_task = sgs-sum_weighted_load; Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v4 07/18] sched: set initial load avg of new forked task
Hi everyone, On 02/19/2013 05:04 PM, Paul Turner wrote: On Fri, Feb 15, 2013 at 2:07 AM, Alex Shi alex@intel.com wrote: diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 1dff78a..9d1c193 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1557,8 +1557,8 @@ static void __sched_fork(struct task_struct *p) * load-balance). */ #if defined(CONFIG_SMP) defined(CONFIG_FAIR_GROUP_SCHED) - p-se.avg.runnable_avg_period = 0; - p-se.avg.runnable_avg_sum = 0; + p-se.avg.runnable_avg_period = 1024; + p-se.avg.runnable_avg_sum = 1024; It can't work. avg.decay_count needs to be set to 0 before enqueue_entity_load_avg(), then update_entity_load_avg() can't be called, so, runnable_avg_period/sum are unusable. Well we _could_ also use a negative decay_count here and treat it like a migration; but the larger problem is the visibility of p-on_rq; which is gates whether we account the time as runnable and occurs after activate_task() so that's out. Even we has chance to call __update_entity_runnable_avg(), avg.last_runnable_update needs be set before that, usually, it needs to be set as 'now', that cause __update_entity_runnable_avg() function return 0, then update_entity_load_avg() still can not reach to __update_entity_load_avg_contrib(). If we embed a simple new task load initialization to many functions, that is too hard for future reader. This is my concern about making this a special case with the introduction ENQUEUE_NEWTASK flag; enqueue jumps through enough hoops as it is. I still don't see why we can't resolve this at init time in __sched_fork(); your patch above just moves an explicit initialization of load_avg_contrib into the enqueue path. Adding a call to __update_task_entity_contrib() to the previous alternate suggestion would similarly seem to resolve this? We could do this(Adding a call to __update_task_entity_contrib()),but the cfs_rq-runnable_load_avg gets updated only if the task is on the runqueue. But in the forked task's case the on_rq flag is not yet set.Something like the below: --- kernel/sched/fair.c | 18 +- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 8691b0d..841e156 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1451,14 +1451,20 @@ static inline void update_entity_load_avg(struct sched_entity *se, else now = cfs_rq_clock_task(group_cfs_rq(se)); - if (!__update_entity_runnable_avg(now, se-avg, se-on_rq)) - return; - + if (!__update_entity_runnable_avg(now, se-avg, se-on_rq)) { + if (!(flags ENQUEUE_NEWTASK)) + return; + } contrib_delta = __update_entity_load_avg_contrib(se); if (!update_cfs_rq) return; + /* But the cfs_rq-runnable_load_avg does not get updated in case of +* a forked task,because the se-on_rq = 0,although we update the +* task's load_avg_contrib above in +* __update_entity_laod_avg_contrib(). +*/ if (se-on_rq) cfs_rq-runnable_load_avg += contrib_delta; else @@ -1538,12 +1544,6 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq, subtract_blocked_load_contrib(cfs_rq, se-avg.load_avg_contrib); update_entity_load_avg(se, 0); } - /* -* set the initial load avg of new task same as its load -* in order to avoid brust fork make few cpu too heavier -*/ - if (flags ENQUEUE_NEWTASK) - se-avg.load_avg_contrib = se-load.weight; cfs_rq-runnable_load_avg += se-avg.load_avg_contrib; /* we force update consideration on load-balancer moves */ Thanks Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v5 06/15] sched: log the cpu utilization at rq
Hi, /* * This is the main, per-CPU runqueue data structure. * @@ -481,6 +484,7 @@ struct rq { #endif struct sched_avg avg; +unsigned int util; }; static inline int cpu_of(struct rq *rq) You don't actually compute the rq utilization, you only compute the utilization as per the fair class, so if there's significant RT activity it'll think the cpu is under-utilized, whihc I think will result in the wrong thing. Correct me if I am wrong,but isn't the current load balancer also disregarding the real time tasks to calculate the domain/group/cpu level load too? What I mean is,if the answer to the above question is yes,then can we safely assume that the furthur optimizations to the load balancer like the power aware scheduler and the usage of per entity load tracking can be done without considering the real time tasks? Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v5 06/15] sched: log the cpu utilization at rq
Hi everyone, On 02/18/2013 10:37 AM, Alex Shi wrote: The cpu's utilization is to measure how busy is the cpu. util = cpu_rq(cpu)-avg.runnable_avg_sum / cpu_rq(cpu)-avg.runnable_avg_period; Why not cfs_rq-runnable_load_avg? I am concerned with what is the right metric to use here. Refer to this discussion:https://lkml.org/lkml/2012/10/29/448 Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v5 09/15] sched: add power aware scheduling in fork/exec/wake
Hi Alex, On 02/24/2013 02:57 PM, Alex Shi wrote: On 02/22/2013 04:54 PM, Peter Zijlstra wrote: On Thu, 2013-02-21 at 22:40 +0800, Alex Shi wrote: The name is a secondary issue, first you need to explain why you think nr_running is a useful metric at all. You can have a high nr_running and a low utilization (a burst of wakeups, each waking a process that'll instantly go to sleep again), or low nr_running and high utilization (a single process cpu bound process). It is true in periodic balance. But in fork/exec/waking timing, the incoming processes usually need to do something before sleep again. You'd be surprised, there's a fair number of workloads that have negligible runtime on wakeup. will appreciate if you like introduce some workload. :) BTW, do you has some idea to handle them? Actually, if tasks is just like transitory, it is also hard to catch them in balance, like 'cyclitest -t 100' on my 4 LCPU laptop, vmstat just can catch 1 or 2 tasks very second. I use nr_running to measure how the group busy, due to 3 reasons: 1, the current performance policy doesn't use utilization too. We were planning to fix that now that its available. I had tried, but failed on aim9 benchmark. As a result I give up to use utilization in performance balance. Some trying and talking in the thread. https://lkml.org/lkml/2013/1/6/96 https://lkml.org/lkml/2013/1/22/662 2, the power policy don't care load weight. Then its broken, it should very much still care about weight. Here power policy just use nr_running as the criteria to check if it's eligible for power aware balance. when do balancing the load weight is still the key judgment. 3, I tested some benchmarks, kbuild/tbench/hackbench/aim7 etc, some benchmark results looks clear bad when use utilization. if my memory right, the hackbench/aim7 both looks bad. I had tried many ways to engage utilization into this balance, like use utilization only, or use utilization * nr_running etc. but still can not find a way to recover the lose. But with nr_running, the performance seems doesn't lose much with power policy. You're failing to explain why utilization performs bad and you don't explain why nr_running is better. That things work simply isn't good Um, let me try to explain again, The utilisation need much time to accumulate itself(345ms). Whenever with or without load weight, many bursting tasks just give a minimum weight to the carrier CPU at the first few ms. So, it is too easy to do a incorrect distribution here and need migration on later periodic balancing. I dont understand why forked tasks are taking time to accumulate the load.I understand this if it were to be a woken up task.The first time the forked task gets a chance to update the load itself,it needs to reflect full utilization.In __update_entity_runnable_avg both runnable_avg_period and runnable_avg_sum get equally incremented for a forked task since it is runnable.Hence where is the chance for the load to get incremented in steps? In sleeping tasks since runnable_avg_sum progresses much slower than runnable_avg_period,these tasks take much time to accumulate the load when they wake up.This makes sense of course.But how does this happen for forked tasks? Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v5 02/15] sched: set initial load avg of new forked task
Hi Alex, On 02/20/2013 11:50 AM, Alex Shi wrote: On 02/18/2013 01:07 PM, Alex Shi wrote: New task has no runnable sum at its first runnable time, so its runnable load is zero. That makes burst forking balancing just select few idle cpus to assign tasks if we engage runnable load in balancing. Set initial load avg of new forked task as its load weight to resolve this issue. patch answering PJT's update here. that merged the 1st and 2nd patches into one. other patches in serial don't need to change. = From 89b56f2e5a323a0cb91c98be15c94d34e8904098 Mon Sep 17 00:00:00 2001 From: Alex Shi alex@intel.com Date: Mon, 3 Dec 2012 17:30:39 +0800 Subject: [PATCH 01/14] sched: set initial value of runnable avg for new forked task We need initialize the se.avg.{decay_count, load_avg_contrib} for a new forked task. Otherwise random values of above variables cause mess when do new task enqueue: enqueue_task_fair enqueue_entity enqueue_entity_load_avg and make forking balancing imbalance since incorrect load_avg_contrib. set avg.decay_count = 0, and avg.load_avg_contrib = se-load.weight to resolve such issues. Signed-off-by: Alex Shi alex@intel.com --- kernel/sched/core.c | 3 +++ kernel/sched/fair.c | 4 2 files changed, 7 insertions(+) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 26058d0..1452e14 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1559,6 +1559,7 @@ static void __sched_fork(struct task_struct *p) #if defined(CONFIG_SMP) defined(CONFIG_FAIR_GROUP_SCHED) p-se.avg.runnable_avg_period = 0; p-se.avg.runnable_avg_sum = 0; + p-se.avg.decay_count = 0; #endif #ifdef CONFIG_SCHEDSTATS memset(p-se.statistics, 0, sizeof(p-se.statistics)); @@ -1646,6 +1647,8 @@ void sched_fork(struct task_struct *p) p-sched_reset_on_fork = 0; } I think the following comment will help here. /* All forked tasks are assumed to have full utilization to begin with */ + p-se.avg.load_avg_contrib = p-se.load.weight; + if (!rt_prio(p-prio)) p-sched_class = fair_sched_class; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 81fa536..cae5134 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1509,6 +1509,10 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq, * We track migrations using entity decay_count = 0, on a wake-up * migration we use a negative decay count to track the remote decays * accumulated while sleeping. + * + * When enqueue a new forked task, the se-avg.decay_count == 0, so + * we bypass update_entity_load_avg(), use avg.load_avg_contrib initial + * value: se-load.weight. I disagree with the comment.update_entity_load_avg() gets called for all forked tasks. enqueue_task_fair-update_entity_load_avg() during the second iteration.But __update_entity_load_avg() in update_entity_load_avg() ,where the actual load update happens does not get called.This is because as below,the last_update of the forked task is nearly equal to the clock task of the runqueue.Hence probably 1ms has not passed by for the load to get updated.Which is why the load of the task nor the load of the runqueue gets updated when the task forks. Also note that the reason we bypass update_entity_load_avg() below is not because our decay_count=0.Its because the forked tasks have nothing to update.Only woken up tasks and migrated wake ups have load updates to do.Forked tasks just got created,they have no load to update but only to create. This I feel is rightly done in sched_fork by this patch. So ideally I dont think we should have any comment here.It does not sound relevant. */ if (unlikely(se-avg.decay_count = 0)) { se-avg.last_runnable_update = rq_of(cfs_rq)-clock_task; Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v5 09/15] sched: add power aware scheduling in fork/exec/wake
Hi, On 02/24/2013 02:57 PM, Alex Shi wrote: On 02/22/2013 04:54 PM, Peter Zijlstra wrote: On Thu, 2013-02-21 at 22:40 +0800, Alex Shi wrote: The name is a secondary issue, first you need to explain why you think nr_running is a useful metric at all. You can have a high nr_running and a low utilization (a burst of wakeups, each waking a process that'll instantly go to sleep again), or low nr_running and high utilization (a single process cpu bound process). It is true in periodic balance. But in fork/exec/waking timing, the incoming processes usually need to do something before sleep again. You'd be surprised, there's a fair number of workloads that have negligible runtime on wakeup. will appreciate if you like introduce some workload. :) BTW, do you has some idea to handle them? Actually, if tasks is just like transitory, it is also hard to catch them in balance, like 'cyclitest -t 100' on my 4 LCPU laptop, vmstat just can catch 1 or 2 tasks very second. I use nr_running to measure how the group busy, due to 3 reasons: 1, the current performance policy doesn't use utilization too. We were planning to fix that now that its available. I had tried, but failed on aim9 benchmark. As a result I give up to use utilization in performance balance. Some trying and talking in the thread. https://lkml.org/lkml/2013/1/6/96 https://lkml.org/lkml/2013/1/22/662 2, the power policy don't care load weight. Then its broken, it should very much still care about weight. Here power policy just use nr_running as the criteria to check if it's eligible for power aware balance. when do balancing the load weight is still the key judgment. 3, I tested some benchmarks, kbuild/tbench/hackbench/aim7 etc, some benchmark results looks clear bad when use utilization. if my memory right, the hackbench/aim7 both looks bad. I had tried many ways to engage utilization into this balance, like use utilization only, or use utilization * nr_running etc. but still can not find a way to recover the lose. But with nr_running, the performance seems doesn't lose much with power policy. You're failing to explain why utilization performs bad and you don't explain why nr_running is better. That things work simply isn't good Um, let me try to explain again, The utilisation need much time to accumulate itself(345ms). Whenever with or without load weight, many bursting tasks just give a minimum weight to the carrier CPU at the first few ms. So, it is too easy to do a incorrect distribution here and need migration on later periodic balancing. Why can't this be attacked in *either* of the following ways: 1.Attack this problem at the source, by ensuring that the utilisation is accumulated faster by making the update window smaller. 2.Balance on nr-running only if you detect burst wakeups. Alex, you had released a patch earlier which could detect this right? Instead of balancing on nr_running all the time, why not balance on it only if burst wakeups are detected. By doing so you ensure that nr_running as a metric for load balancing is used when it is right to do so and the reason to use it also gets well documented. Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: weakness of runnable load tracking?
Hi Alex, Hi Paul Ingo: In a short word of this issue: burst forking/waking tasks have no time accumulate the load contribute, their runnable load are taken as zero. On performing certain experiments on the way PJT's metric calculates the load,I observed a few things.Based on these observations let me see if i can address the issue of why PJT's metric is calculating the load of bursty tasks as 0. When we speak about a burst waking task(I will not go into forking here),we should also speak about its duty cycle-it burst wakes for 1ms for a 10ms duty cycle or burst wakes 9s out of a 10s duty cycle-both being 10% tasks wrt their duty cycles.Lets see how load is calculated by PJT's metric in each of the above cases. -- | | | | __| | A B 1ms - 10ms Example 1 When the task wakes up at A,it is not yet runnable,and an update of the task load takes place.Its runtime so far is 0,and its existing time is 10ms.Hence the load is 0/10*1024.Since a scheduler tick happens at B( a scheduler tick happens for every 1ms,10ms or 4ms.Let us assume 1ms),an update of the load takes place.PJT's metric divides the time elapsed into 1ms windows.There is just 1ms window,and hence the runtime is 1ms and the load is 1ms/10ms*1024. *If the time elapsed between A and B were to be 1ms,then PJT's metric will not capture it*. And under these circumstances the load remains 0/10ms*1024=0.This is the situation you are pointing out.Let us assume that these cycle continues throughout the lifetime of the load,then the load remains at 0.The question is if such tasks which run for periods1ms is ok to be termed as 0 workloads.If it is fine,then what PJT's metric is doing is right.Maybe we should ignore such workloads because they hardly contribute to the load.Otherwise we will need to reduce the window of load update to 1ms to capture such loads. Just for some additional info so that we know what happens to different kinds of loads with PJT's metric,consider the below situation: -- | | | | | | A B 1s -- --- 10s Example 2 Here at A,the task wakes,just like in Example1 and the load is termed 0. In between A and B for every scheduler tick if we consider the load to get updated,then the load slowly increases from 0 to 1024 at B.It is 1024 here,although this is also a 10% task,whereas in Example1 the load is 102.4 - a 10% task.So what is fishy? In my opinion,PJT's metric gives the tasks some time to prove their activeness after they wake up.In Example2 the task has stayed awake too long-1s; irrespective of what % of the total run time it is.Therefore it calculates the load to be big enough to balance. In the example that you have quoted,the tasks may not have run long enough to consider them as candidates for load balance. So,essentially what PJT's metric is doing is characterising a task by the amount it has run so far. that make select_task_rq do a wrong decision on which group is idlest. There is still 3 kinds of solution is helpful for this issue. a, set a unzero minimum value for the long time sleeping task. but it seems unfair for other tasks these just sleep a short while. b, just use runnable load contrib in load balance. Still using nr_running to judge idlest group in select_task_rq_fair. but that may cause a bit more migrations in future load balance. c, consider both runnable load and nr_running in the group: like in the searching domain, the nr_running number increased a certain number, like double of the domain span, in a certain time. we will think it's a burst forking/waking happened, then just count the nr_running as the idlest group criteria. IMHO, I like the 3rd one a bit more. as to the certain time to judge if a burst happened, since we will calculate the runnable avg at very tick, so if increased nr_running is beyond sd-span_weight in 2 ticks, means burst happening. What's your opinion of this? Any comments are appreciated! So Pjt's metric rightly seems to be capturing the load of these bursty tasks but you are right in pointing out that when too many such loads queue up on the cpu,this metric will consider the load on the cpu as 0,which might not be such a good idea. It is true that we need to bring in nr_running somewhere.Let me now go through your suggestions on where to include nr_running and get back on this.I had planned on including nr_running while selecting the busy group in update_sd_lb_stats,but select_task_rq_fair is yet another place to do this, thats right.Good that this issue was brought up :) Regards! Alex Regards Preeti U Murthy -- To unsubscribe from
Re: [PATCH 01/18] sched: select_task_rq_fair clean up
Hi Alex, On 12/10/2012 01:52 PM, Alex Shi wrote: It is impossible to miss a task allowed cpu in a eligible group. The one thing I am concerned with here is if there is a possibility of the task changing its tsk_cpus_allowed() while this code is running. i.e find_idlest_group() finds an idle group,then the tsk_cpus_allowed() for the task changes,perhaps by the user himself,which might not include the cpus in the idle group.After this find_idlest_cpu() is called.I mean a race condition in short.Then we might not have an eligible cpu in that group right? And since find_idlest_group only return a different group which excludes old cpu, it's also imporissible to find a new cpu same as old cpu. This I agree with. Signed-off-by: Alex Shi alex@intel.com --- kernel/sched/fair.c |5 - 1 files changed, 0 insertions(+), 5 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 59e072b..df99456 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3150,11 +3150,6 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags) } new_cpu = find_idlest_cpu(group, p, cpu); - if (new_cpu == -1 || new_cpu == cpu) { - /* Now try balancing at a lower domain level of cpu */ - sd = sd-child; - continue; - } /* Now try balancing at a lower domain level of new_cpu */ cpu = new_cpu; Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 02/18] sched: fix find_idlest_group mess logical
Hi Alex, On 12/10/2012 01:52 PM, Alex Shi wrote: There is 4 situations in the function: 1, no task allowed group; so min_load = ULONG_MAX, this_load = 0, idlest = NULL 2, only local group task allowed; so min_load = ULONG_MAX, this_load assigned, idlest = NULL 3, only non-local task group allowed; so min_load assigned, this_load = 0, idlest != NULL 4, local group + another group are task allowed. so min_load assigned, this_load assigned, idlest != NULL Current logical will return NULL in first 3 kinds of scenarios. And still return NULL, if idlest group is heavier then the local group in the 4th situation. Actually, I thought groups in situation 2,3 are also eligible to host the task. And in 4th situation, agree to bias toward local group. So, has this patch. Signed-off-by: Alex Shi alex@intel.com --- kernel/sched/fair.c | 12 +--- 1 files changed, 9 insertions(+), 3 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index df99456..b40bc2b 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2953,6 +2953,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu, int load_idx) { struct sched_group *idlest = NULL, *group = sd-groups; + struct sched_group *this_group = NULL; unsigned long min_load = ULONG_MAX, this_load = 0; int imbalance = 100 + (sd-imbalance_pct-100)/2; @@ -2987,14 +2988,19 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, if (local_group) { this_load = avg_load; - } else if (avg_load min_load) { + this_group = group; + } + if (avg_load min_load) { min_load = avg_load; idlest = group; } } while (group = group-next, group != sd-groups); - if (!idlest || 100*this_load imbalance*min_load) - return NULL; + if (this_group idlest != this_group) + /* Bias toward our group again */ + if (100*this_load imbalance*min_load) + idlest = this_group; If the idlest group is heavier than this_group(or to put it better if the difference in the loads of the local group and idlest group is less than a threshold,it means there is no point moving the load from the local group) you return NULL,that immediately means this_group is chosen as the candidate group for the task to run,one does not have to explicitly return that. Let me explain: find_idlest_group()-if it returns NULL to mark your case4,it means there is no idler group than the group to which this_cpu belongs to, at that level of sched domain.Which is fair enough. So now the question is under such a circumstance which is the idlest group so far.It is the group containing this_cpu,i.e.this_group.After this sd-child is chosen which is nothing but this_group(sd hierarchy moves towards the cpu it belongs to). Again here the idlest group search begins. + return idlest; } Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 02/18] sched: fix find_idlest_group mess logical
Hi Alex, On 12/11/2012 10:59 AM, Alex Shi wrote: On 12/11/2012 01:08 PM, Preeti U Murthy wrote: Hi Alex, On 12/10/2012 01:52 PM, Alex Shi wrote: There is 4 situations in the function: 1, no task allowed group; so min_load = ULONG_MAX, this_load = 0, idlest = NULL 2, only local group task allowed; so min_load = ULONG_MAX, this_load assigned, idlest = NULL 3, only non-local task group allowed; so min_load assigned, this_load = 0, idlest != NULL 4, local group + another group are task allowed. so min_load assigned, this_load assigned, idlest != NULL Current logical will return NULL in first 3 kinds of scenarios. And still return NULL, if idlest group is heavier then the local group in the 4th situation. Actually, I thought groups in situation 2,3 are also eligible to host the task. And in 4th situation, agree to bias toward local group. So, has this patch. Signed-off-by: Alex Shi alex@intel.com --- kernel/sched/fair.c | 12 +--- 1 files changed, 9 insertions(+), 3 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index df99456..b40bc2b 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2953,6 +2953,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu, int load_idx) { struct sched_group *idlest = NULL, *group = sd-groups; + struct sched_group *this_group = NULL; unsigned long min_load = ULONG_MAX, this_load = 0; int imbalance = 100 + (sd-imbalance_pct-100)/2; @@ -2987,14 +2988,19 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, if (local_group) { this_load = avg_load; - } else if (avg_load min_load) { + this_group = group; + } + if (avg_load min_load) { min_load = avg_load; idlest = group; } } while (group = group-next, group != sd-groups); - if (!idlest || 100*this_load imbalance*min_load) - return NULL; + if (this_group idlest != this_group) + /* Bias toward our group again */ + if (100*this_load imbalance*min_load) + idlest = this_group; If the idlest group is heavier than this_group(or to put it better if the difference in the loads of the local group and idlest group is less than a threshold,it means there is no point moving the load from the local group) you return NULL,that immediately means this_group is chosen as the candidate group for the task to run,one does not have to explicitly return that. In situation 4, this_group is not NULL. True.The return value of find_idlest_group() indicates that there is no other idle group other than the local group(the group to which cpu belongs to). it does not indicate that there is no host group for the task.If this is the case,select_task_rq_fair() falls back to the group(sd-child) to which the cpu chosen in the previous iteration belongs to,This is nothing but this_group in the current iteration. Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 01/18] sched: select_task_rq_fair clean up
On 12/11/2012 10:58 AM, Alex Shi wrote: On 12/11/2012 12:23 PM, Preeti U Murthy wrote: Hi Alex, On 12/10/2012 01:52 PM, Alex Shi wrote: It is impossible to miss a task allowed cpu in a eligible group. The one thing I am concerned with here is if there is a possibility of the task changing its tsk_cpus_allowed() while this code is running. i.e find_idlest_group() finds an idle group,then the tsk_cpus_allowed() for the task changes,perhaps by the user himself,which might not include the cpus in the idle group.After this find_idlest_cpu() is called.I mean a race condition in short.Then we might not have an eligible cpu in that group right? your worry make sense, but the code handle the situation, in select_task_rq(), it will check the cpu allowed again. if the answer is no, it will fallback to old cpu. And since find_idlest_group only return a different group which excludes old cpu, it's also imporissible to find a new cpu same as old cpu. I doubt this will work correctly.Consider the following situation:sched domain begins with sd that encloses both socket1 and socket2 cpu0 cpu1 | cpu2 cpu3 ---|- socket1 | socket2 old cpu = cpu1 Iteration1: 1.find_idlest_group() returns socket2 to be idlest. 2.task changes tsk_allowed_cpus to 0,1 3.find_idlest_cpu() returns cpu2 * without your patch 1.the condition after find_idlest_cpu() returns -1,and sd-child is chosen which happens to be socket1 2.in the next iteration, find_idlest_group() and find_idlest_cpu() will probably choose cpu0 which happens to be idler than cpu1,which is in tsk_allowed_cpu. * with your patch 1.the condition after find_idlest_cpu() does not exist,therefore a sched domain to which cpu2 belongs to is chosen.this is socket2.(under the for_each_domain() loop). 2.in the next iteration, find_idlest_group() return NULL,because there is no cpu which intersects with tsk_allowed_cpus. 3.in select task rq,the fallback cpu is chosen even when an idle cpu existed. So my concern is though select_task_rq() checks the tsk_allowed_cpus(),you might end up choosing a different path of sched_domains compared to without this patch as shown above. In short without the if(new_cpu==-1) condition we might get misled doing unnecessary iterations over the wrong sched domains in select_task_rq_fair().(Think about situations when not all the cpus of socket2 are disallowed by the task,then there will more iterations in the wrong path of sched_domains before exit,compared to what is shown above.) Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] sched: Explicit division calls on 64-bit integers
Certain gcc tool chains convert the division on a 64-bit dividend into a __aeabi_uldivmod call which does unnecessary 64-bit by 64-bit divides although the divisor is 32-bit.This 64 by 64 bit division is not implemented in the kernel for reasons of efficiency,which results in undefined reference errors during link time.Hence perform the division on 64-bit dividends using do_div() function. The below use case is the integration of Per-entity-Load-Tracking metric with the load balancer,where cfs_rq-runnable_load_avg, a 64 bit unsigned integer is used to as the base metric for load balancing. Signed-off-by: Preeti U Murthypre...@linux.vnet.ibm.com --- kernel/sched/fair.c | 51 +++ 1 file changed, 31 insertions(+), 20 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index f8f3a29..7cd3096 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2982,9 +2982,13 @@ static u64 cpu_avg_load_per_task(int cpu) { struct rq *rq = cpu_rq(cpu); unsigned long nr_running = ACCESS_ONCE(rq-nr_running); + u64 cfs_avg_load_per_task; - if (nr_running) - return rq-cfs.runnable_load_avg / nr_running; + if (nr_running) { + cfs_avg_load_per_task = rq-cfs.runnable_load_avg; + do_div(cfs_avg_load_per_task, nr_running); + return cfs_avg_load_per_task; + } return 0; } @@ -3249,7 +3253,8 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, } /* Adjust by relative CPU power of the group */ - avg_load = (avg_load * SCHED_POWER_SCALE) / group-sgp-power; + avg_load = (avg_load * SCHED_POWER_SCALE); + do_div(avg_load, group-sgp-power); if (local_group) { this_load = avg_load; @@ -4756,7 +4761,8 @@ static inline void update_sg_lb_stats(struct lb_env *env, } /* Adjust by relative CPU power of the group */ - sgs-avg_load = (sgs-group_load*SCHED_POWER_SCALE) / group-sgp-power; + sgs-avg_load = (sgs-group_load*SCHED_POWER_SCALE); + do_div(sgs-avg_load, group-sgp-power); /* * Consider the group unbalanced when the imbalance is larger @@ -4767,8 +4773,10 @@ static inline void update_sg_lb_stats(struct lb_env *env, * normalized nr_running number somewhere that negates * the hierarchy? */ - if (sgs-sum_nr_running) - avg_load_per_task = sgs-sum_weighted_load / sgs-sum_nr_running; + if (sgs-sum_nr_running) { + avg_load_per_task = sgs-sum_weighted_load; + do_div(avg_load_per_task, sgs-sum_nr_running); + } if ((max_cpu_load - min_cpu_load) = avg_load_per_task (max_nr_running - min_nr_running) 1) @@ -4953,7 +4961,7 @@ void fix_small_imbalance(struct lb_env *env, struct sd_lb_stats *sds) u64 scaled_busy_load_per_task; if (sds-this_nr_running) { - sds-this_load_per_task /= sds-this_nr_running; + do_div(sds-this_load_per_task, sds-this_nr_running); if (sds-busiest_load_per_task sds-this_load_per_task) imbn = 1; @@ -4964,7 +4972,7 @@ void fix_small_imbalance(struct lb_env *env, struct sd_lb_stats *sds) scaled_busy_load_per_task = sds-busiest_load_per_task * SCHED_POWER_SCALE; - scaled_busy_load_per_task /= sds-busiest-sgp-power; + do_div(scaled_busy_load_per_task, sds-busiest-sgp-power); if (sds-max_load - sds-this_load + scaled_busy_load_per_task = (scaled_busy_load_per_task * imbn)) { @@ -4985,20 +4993,21 @@ void fix_small_imbalance(struct lb_env *env, struct sd_lb_stats *sds) pwr_now /= SCHED_POWER_SCALE; /* Amount of load we'd subtract */ - tmp = (sds-busiest_load_per_task * SCHED_POWER_SCALE) / - sds-busiest-sgp-power; + tmp = (sds-busiest_load_per_task * SCHED_POWER_SCALE); + do_div(tmp, sds-busiest-sgp-power); if (sds-max_load tmp) pwr_move += sds-busiest-sgp-power * min(sds-busiest_load_per_task, sds-max_load - tmp); /* Amount of load we'd add */ if (sds-max_load * sds-busiest-sgp-power - sds-busiest_load_per_task * SCHED_POWER_SCALE) - tmp = (sds-max_load * sds-busiest-sgp-power) / - sds-this-sgp-power; - else - tmp = (sds-busiest_load_per_task * SCHED_POWER_SCALE) / - sds-this-sgp-power; + sds-busiest_load_per_task * SCHED_POWER_SCALE) { + tmp = (sds-max_load * sds-busiest-sgp-power); + do_div(tmp, sds-this-sgp-power); + } else { + tmp = (sds-busiest_load_per_task
Re: [RFC PATCH 0/5] enable runnable load avg in load balance
Hi everyone, On 11/27/2012 12:33 AM, Benjamin Segall wrote: So, I've been trying out using the runnable averages for load balance in a few ways, but haven't actually gotten any improvement on the benchmarks I've run. I'll post my patches once I have the numbers down, but it's generally been about half a percent to 1% worse on the tests I've tried. The basic idea is to use (cfs_rq-runnable_load_avg + cfs_rq-blocked_load_avg) (which should be equivalent to doing load_avg_contrib on the rq) for cfs_rqs and possibly the rq, and p-se.load.weight * p-se.avg.runnable_avg_sum / period for tasks. Why should cfs_rq-blocked_load_avg be included to calculate the load on the rq? They do not contribute to the active load of the cpu right? When a task goes to sleep its load is removed from cfs_rq-load.weight as well in account_entity_dequeue(). Which means the load balancer considers a sleeping entity as *not* contributing to the active runqueue load.So shouldn't the new metric consider cfs_rq-runnable_load_avg alone? I have not yet tried including wake_affine, so this has just involved h_load (task_load_down and task_h_load), as that makes everything (besides wake_affine) be based on either the new averages or the rq-cpu_load averages. Yeah I have been trying to view the performance as well,but with cfs_rq-runnable_load_avg as the rq load contribution and the task load, same as mentioned above.I have not completed my experiments but I would expect some significant performance difference due to the below scenario: Task3(10% task) Task1(100% task) Task4(10% task) Task2(100% task) Task5(10% task) --- -- CPU1 CPU2 CPU3 When cpu3 triggers load balancing: CASE1: without PJT's metric the following loads will be perceived CPU1-2048 CPU2-3042 Therefore CPU2 might be relieved of one task to result in: Task1(100% task) Task4(10% task) Task2(100% task) Task5(10% task) Task3(10% task) --- -- CPU1 CPU2 CPU3 CASE2: with PJT's metric the following loads will be perceived CPU1-2048 CPU2-1022 Therefore CPU1 might be relieved of one task to result in: Task3(10% task) Task4(10% task) Task2(100% task) Task5(10% task) Task1(100% task) --- -- CPU1 CPU2 CPU3 The differences between the above two scenarios include: 1.Reduced latency for Task1 in CASE2,which is the right task to be moved in the above scenario. 2.Even though in the former case CPU2 is relieved of one task,its of no use if Task3 is going to sleep most of the time.This might result in more load balancing on behalf of cpu3. What do you guys think? Thank you Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 0/5] enable runnable load avg in load balance
Hi, On 11/27/2012 11:44 AM, Alex Shi wrote: On 11/27/2012 11:08 AM, Preeti U Murthy wrote: Hi everyone, On 11/27/2012 12:33 AM, Benjamin Segall wrote: So, I've been trying out using the runnable averages for load balance in a few ways, but haven't actually gotten any improvement on the benchmarks I've run. I'll post my patches once I have the numbers down, but it's generally been about half a percent to 1% worse on the tests I've tried. The basic idea is to use (cfs_rq-runnable_load_avg + cfs_rq-blocked_load_avg) (which should be equivalent to doing load_avg_contrib on the rq) for cfs_rqs and possibly the rq, and p-se.load.weight * p-se.avg.runnable_avg_sum / period for tasks. Why should cfs_rq-blocked_load_avg be included to calculate the load on the rq? They do not contribute to the active load of the cpu right? When a task goes to sleep its load is removed from cfs_rq-load.weight as well in account_entity_dequeue(). Which means the load balancer considers a sleeping entity as *not* contributing to the active runqueue load.So shouldn't the new metric consider cfs_rq-runnable_load_avg alone? I have not yet tried including wake_affine, so this has just involved h_load (task_load_down and task_h_load), as that makes everything (besides wake_affine) be based on either the new averages or the rq-cpu_load averages. Yeah I have been trying to view the performance as well,but with cfs_rq-runnable_load_avg as the rq load contribution and the task load, same as mentioned above.I have not completed my experiments but I would expect some significant performance difference due to the below scenario: Task3(10% task) Task1(100% task) Task4(10% task) Task2(100% task) Task5(10% task) --- -- CPU1 CPU2 CPU3 When cpu3 triggers load balancing: CASE1: without PJT's metric the following loads will be perceived CPU1-2048 CPU2-3042 Therefore CPU2 might be relieved of one task to result in: Task1(100% task) Task4(10% task) Task2(100% task) Task5(10% task) Task3(10% task) --- -- CPU1 CPU2 CPU3 CASE2: with PJT's metric the following loads will be perceived CPU1-2048 CPU2-1022 Therefore CPU1 might be relieved of one task to result in: Task3(10% task) Task4(10% task) Task2(100% task) Task5(10% task) Task1(100% task) --- -- CPU1 CPU2 CPU3 The differences between the above two scenarios include: 1.Reduced latency for Task1 in CASE2,which is the right task to be moved in the above scenario. 2.Even though in the former case CPU2 is relieved of one task,its of no use if Task3 is going to sleep most of the time.This might result in more load balancing on behalf of cpu3. What do you guys think? It looks fine. just a question of CASE 1. Usually the cpu2 with 3 10% load task will show nr_running == 0, at 70% time. So, how you make rq-nr_running = 3 always? Guess in most chance load balance with pull task1 or task2 to cpu2 or cpu3. not the result of CASE 1. Thats right Alex.Most of the time the nr_running on CPU2 will be shown to be 0 or perhaps 1/2.But whether you use PJT's metric or not,the load balancer in such circumstances will behave the same, as you have rightly pointed out: pull task1/2 to cpu2/3. But the issue usually arises when all three wake up at the same time on cpu2,portraying wrongly that the load is 3042, if PJT's metric is not used.This could lead to load balancing one of these short running tasks as shown by CASE1.This is the situation where in my opinion,PJT's metric could make a difference. Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
sched: Consequences of integrating the Per Entity Load Tracking Metric into the Load Balancer
a performance improvement. Please do let me know your suggestions.This will greatly help take the right steps here on, in achieving the correct integration. Thank you Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: sched: Consequences of integrating the Per Entity Load Tracking Metric into the Load Balancer
Hi Mike, Thank you very much for your feedback.Considering your suggestions,I have posted out a proposed solution to prevent select_idle_sibling() from becoming a disadvantage to normal load balancing,rather aiding it. **This patch is *without* the enablement of the per entity load tracking metric.** This is with an intention to correct the existing select_idle_sibling() mess before going ahead. ---BEGIN PATCH Subject: [PATCH] sched: Merge select_idle_sibling with the behaviour of SD_BALANCE_WAKE The function of select_idle_sibling() is to place the woken up task in the vicinity of the waking cpu or on the previous cpu depending on what wake_affine() says. This placement being only in an idle group.If an idle group is not found,the fallback cpu is either the waking cpu or the previous cpu accordingly. This results in the runqueue of the waking cpu or the previous cpu getting overloaded when the system is committed,which is a latency hit to these tasks. What is required is that the newly woken up tasks be placed close to the wake up cpu or the previous cpu,whichever is best, for reasons to avoid latency hit and cache coldness respectively.This is achieved with wake_affine() deciding which cache domain the task should be placed on. Once this is decided,instead of searching for a completely idle group,let us search for the idlest group.This will anyway return a completely idle group if it exists and its mechanism will fall back to what select_idle_sibling() was doing.But if this fails,find_idlest_group() continues the search for a relatively more idle group. The argument could be that,we wish to avoid migration of the newly woken up task to any other group unless it is completely idle.But in this case, to begin with we choose a sched domain,within which a migration could be less harmful.We enable the SD_BALANCE_WAKE flag on the SMT and MC domains to co-operate with the same. This patch is based on the tip tree without enabling the per entity load tracking.This is with an intention to clear up the select_idle_sibling() mess before introducing the metric. --- include/linux/topology.h |4 ++- kernel/sched/fair.c | 61 +- 2 files changed, 9 insertions(+), 56 deletions(-) diff --git a/include/linux/topology.h b/include/linux/topology.h index d3cf0d6..eeb309e 100644 --- a/include/linux/topology.h +++ b/include/linux/topology.h @@ -95,7 +95,7 @@ int arch_update_cpu_topology(void); | 1*SD_BALANCE_NEWIDLE \ | 1*SD_BALANCE_EXEC \ | 1*SD_BALANCE_FORK \ - | 0*SD_BALANCE_WAKE \ + | 1*SD_BALANCE_WAKE \ | 1*SD_WAKE_AFFINE \ | 1*SD_SHARE_CPUPOWER \ | 1*SD_SHARE_PKG_RESOURCES \ @@ -127,7 +127,7 @@ int arch_update_cpu_topology(void); | 1*SD_BALANCE_NEWIDLE \ | 1*SD_BALANCE_EXEC \ | 1*SD_BALANCE_FORK \ - | 0*SD_BALANCE_WAKE \ + | 1*SD_BALANCE_WAKE \ | 1*SD_WAKE_AFFINE \ | 0*SD_SHARE_CPUPOWER \ | 1*SD_SHARE_PKG_RESOURCES \ diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b29cdbf..c33eda7 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3303,58 +3303,6 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu) return idlest; } -/* - * Try and locate an idle CPU in the sched_domain. - */ -static int select_idle_sibling(struct task_struct *p, int target) -{ - int cpu = smp_processor_id(); - int prev_cpu = task_cpu(p); - struct sched_domain *sd; - struct sched_group *sg; - int i; - - /* -* If the task is going to be woken-up on this cpu and if it is -* already idle, then it is the right target. -*/ - if (target == cpu idle_cpu(cpu)) - return cpu; - - /* -* If the task is going to be woken-up on the cpu where it previously -* ran and if it is currently idle, then it the right target. -*/ - if (target == prev_cpu idle_cpu(prev_cpu)) - return prev_cpu; - - /* -* Otherwise, iterate the domains and find an elegible idle cpu. -*/ - sd = rcu_dereference(per_cpu(sd_llc, target)); -
Re: [PATCH 07/18] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task
Hi Alex, On 12/10/2012 01:52 PM, Alex Shi wrote: They are the base values in load balance, update them with rq runnable load average, then the load balance will consider runnable load avg naturally. Signed-off-by: Alex Shi alex@intel.com --- kernel/sched/core.c |4 ++-- kernel/sched/fair.c |4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 96fa5f1..0ecb907 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2487,7 +2487,7 @@ static void __update_cpu_load(struct rq *this_rq, unsigned long this_load, void update_idle_cpu_load(struct rq *this_rq) { unsigned long curr_jiffies = ACCESS_ONCE(jiffies); - unsigned long load = this_rq-load.weight; + unsigned long load = (unsigned long)this_rq-cfs.runnable_load_avg; unsigned long pending_updates; /* @@ -2537,7 +2537,7 @@ static void update_cpu_load_active(struct rq *this_rq) * See the mess around update_idle_cpu_load() / update_cpu_load_nohz(). */ this_rq-last_load_update_tick = jiffies; - __update_cpu_load(this_rq, this_rq-load.weight, 1); + __update_cpu_load(this_rq, this_rq-cfs.runnable_load_avg, 1); calc_load_account_active(this_rq); } diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 61c8d24..6d893a6 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2680,7 +2680,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) /* Used instead of source_load when we know the type == 0 */ static unsigned long weighted_cpuload(const int cpu) { - return cpu_rq(cpu)-load.weight; + return (unsigned long)cpu_rq(cpu)-cfs.runnable_load_avg; I was wondering why you have typecasted the cfs.runnable_load_avg to unsigned long.Have you looked into why it was declared as u64 in the first place? } /* @@ -2727,7 +2727,7 @@ static unsigned long cpu_avg_load_per_task(int cpu) unsigned long nr_running = ACCESS_ONCE(rq-nr_running); if (nr_running) - return rq-load.weight / nr_running; + return rq-cfs.runnable_load_avg / nr_running; rq-cfs.runnable_load_avg is u64 type.you will need to typecast it here also right? how does this division work? because the return type is unsigned long. return 0; } Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 08/18] sched: consider runnable load average in move_tasks
Hi Alex, On 12/10/2012 01:52 PM, Alex Shi wrote: Except using runnable load average in background, move_tasks is also the key functions in load balance. We need consider the runnable load average in it in order to the apple to apple load comparison. Signed-off-by: Alex Shi alex@intel.com --- kernel/sched/fair.c | 11 ++- 1 files changed, 10 insertions(+), 1 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6d893a6..bbb069c 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3741,6 +3741,15 @@ static unsigned long task_h_load(struct task_struct *p); static const unsigned int sched_nr_migrate_break = 32; +static unsigned long task_h_load_avg(struct task_struct *p) +{ + u32 period = p-se.avg.runnable_avg_period; + if (!period) + return 0; + + return task_h_load(p) * p-se.avg.runnable_avg_sum / period; This might result in an overflow,considering you are multiplying two 32 bit integers.Below is how this is handled in __update_task_entity_contrib in kernel/sched/fair.c u32 contrib; /* avoid overflowing a 32-bit type w/ SCHED_LOAD_SCALE */ contrib = se-avg.runnable_avg_sum * scale_load_down(se-load.weight); contrib /= (se-avg.runnable_avg_period + 1); se-avg.load_avg_contrib = scale_load(contrib); Also why can't p-se.load_avg_contrib be used directly? as a return value for task_h_load_avg? since this is already updated in update_task_entity_contrib and update_group_entity_contrib. +} + /* * move_tasks tries to move up to imbalance weighted load from busiest to * this_rq, as part of a balancing operation within domain sd. @@ -3776,7 +3785,7 @@ static int move_tasks(struct lb_env *env) if (throttled_lb_pair(task_group(p), env-src_cpu, env-dst_cpu)) goto next; - load = task_h_load(p); + load = task_h_load_avg(p); if (sched_feat(LB_MIN) load 16 !env-sd-nr_balance_failed) goto next; Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 01/18] sched: select_task_rq_fair clean up
On 12/11/2012 05:23 PM, Alex Shi wrote: On 12/11/2012 02:30 PM, Preeti U Murthy wrote: On 12/11/2012 10:58 AM, Alex Shi wrote: On 12/11/2012 12:23 PM, Preeti U Murthy wrote: Hi Alex, On 12/10/2012 01:52 PM, Alex Shi wrote: It is impossible to miss a task allowed cpu in a eligible group. The one thing I am concerned with here is if there is a possibility of the task changing its tsk_cpus_allowed() while this code is running. i.e find_idlest_group() finds an idle group,then the tsk_cpus_allowed() for the task changes,perhaps by the user himself,which might not include the cpus in the idle group.After this find_idlest_cpu() is called.I mean a race condition in short.Then we might not have an eligible cpu in that group right? your worry make sense, but the code handle the situation, in select_task_rq(), it will check the cpu allowed again. if the answer is no, it will fallback to old cpu. And since find_idlest_group only return a different group which excludes old cpu, it's also imporissible to find a new cpu same as old cpu. I doubt this will work correctly.Consider the following situation:sched domain begins with sd that encloses both socket1 and socket2 cpu0 cpu1 | cpu2 cpu3 ---|- socket1 | socket2 old cpu = cpu1 Iteration1: 1.find_idlest_group() returns socket2 to be idlest. 2.task changes tsk_allowed_cpus to 0,1 3.find_idlest_cpu() returns cpu2 * without your patch 1.the condition after find_idlest_cpu() returns -1,and sd-child is chosen which happens to be socket1 2.in the next iteration, find_idlest_group() and find_idlest_cpu() will probably choose cpu0 which happens to be idler than cpu1,which is in tsk_allowed_cpu. Thanks for question Preeti! :) Yes, with more iteration you has more possibility to get task allowed cpu in select_task_rq_fair. but how many opportunity the situation happened? how much gain you get here? With LCPU increasing many many iterations cause scalability issue. that is the simplified forking patchset for. and that why 10% performance gain on hackbench process/thread. and if you insist want not to miss your chance in strf(), the current iteration is still not enough. How you know the idlest cpu is still idlest after this function finished? how to ensure the allowed cpu won't be changed again? A quick snapshot is enough in balancing here. we still has periodic balacning. Hmm ok,let me look at this more closely. * with your patch 1.the condition after find_idlest_cpu() does not exist,therefore a sched domain to which cpu2 belongs to is chosen.this is socket2.(under the for_each_domain() loop). 2.in the next iteration, find_idlest_group() return NULL,because there is no cpu which intersects with tsk_allowed_cpus. 3.in select task rq,the fallback cpu is chosen even when an idle cpu existed. So my concern is though select_task_rq() checks the tsk_allowed_cpus(),you might end up choosing a different path of sched_domains compared to without this patch as shown above. In short without the if(new_cpu==-1) condition we might get misled doing unnecessary iterations over the wrong sched domains in select_task_rq_fair().(Think about situations when not all the cpus of socket2 are disallowed by the task,then there will more iterations in After read the first 4 patches, believe you will find the patchset is trying to reduce iterations, not increase them. Right,sorry about not noticing this. the wrong path of sched_domains before exit,compared to what is shown above.) Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC v2 PATCH 0/2] sched: Integrating Per-entity-load-tracking with the core scheduler
This approach aims to retain the current behavior of load balancer with the change being only in the metric consumed during load balancing, without unnecessary introduction of new variables.This RFC has been posted to evaluate its design;to see if this is the right way to go about introducing per-entity-load-tracking metric for the consumers of the same; in this specific case,the load balancer.Once the design has been approved off,I can go around to testing it. The patch has been based out of tip-master:HEAD at commit 8a1d31c703d Subject:Merge branch 'x86/urgent' Grateful to Peter Zijlstra and Ingo Molnar for their valuable feedback on v1 of the RFC which was the foundation for this version. PATCH[1/2] Aims at enabling usage of Per-Entity-Load-Tracking for load balacing PATCH[2/2] The crux of the patchset lies here. --- Preeti U Murthy (2): sched: Revert temporary FAIR_GROUP_SCHED dependency for load-tracking sched: Use Per-Entity-Load-Tracking metric for load balancing include/linux/sched.h |9 +- kernel/sched/core.c | 19 +--- kernel/sched/fair.c | 76 + kernel/sched/sched.h |9 +- 4 files changed, 43 insertions(+), 70 deletions(-) -- Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC v2 PATCH 1/2] sched: Revert temporary FAIR_GROUP_SCHED dependency for load-tracking
Now that we need the per-entity load tracking for load balancing, trivially revert the patch which introduced the FAIR_GROUP_SCHED dependence for load tracking. Signed-off-by: Preeti U Murthypre...@linux.vnet.ibm.com --- include/linux/sched.h |7 +-- kernel/sched/core.c |7 +-- kernel/sched/fair.c | 12 +--- kernel/sched/sched.h |7 --- 4 files changed, 3 insertions(+), 30 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 03be150..087dd20 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1169,12 +1169,7 @@ struct sched_entity { /* rq owned by this entity/group: */ struct cfs_rq *my_q; #endif -/* - * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be - * removed when useful for applications beyond shares distribution (e.g. - * load-balance). - */ -#if defined(CONFIG_SMP) defined(CONFIG_FAIR_GROUP_SCHED) +#if defined(CONFIG_SMP) /* Per-entity load-tracking */ struct sched_avgavg; #endif diff --git a/kernel/sched/core.c b/kernel/sched/core.c index c2e077c..24d8b9b 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1526,12 +1526,7 @@ static void __sched_fork(struct task_struct *p) p-se.vruntime = 0; INIT_LIST_HEAD(p-se.group_node); -/* - * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be - * removed when useful for applications beyond shares distribution (e.g. - * load-balance). - */ -#if defined(CONFIG_SMP) defined(CONFIG_FAIR_GROUP_SCHED) +#if defined(CONFIG_SMP) p-se.avg.runnable_avg_period = 0; p-se.avg.runnable_avg_sum = 0; #endif diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 2cebc81..a9cdc8f 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1149,8 +1149,7 @@ static inline void update_cfs_shares(struct cfs_rq *cfs_rq) } #endif /* CONFIG_FAIR_GROUP_SCHED */ -/* Only depends on SMP, FAIR_GROUP_SCHED may be removed when useful in lb */ -#if defined(CONFIG_SMP) defined(CONFIG_FAIR_GROUP_SCHED) +#if defined(CONFIG_SMP) /* * We choose a half-life close to 1 scheduling period. * Note: The tables below are dependent on this value. @@ -3503,12 +3502,6 @@ unlock: } /* - * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be - * removed when useful for applications beyond shares distribution (e.g. - * load-balance). - */ -#ifdef CONFIG_FAIR_GROUP_SCHED -/* * Called immediately before a task is migrated to a new cpu; task_cpu(p) and * cfs_rq_of(p) references at time of call are still valid and identify the * previous cpu. However, the caller only guarantees p-pi_lock is held; no @@ -3531,7 +3524,6 @@ migrate_task_rq_fair(struct task_struct *p, int next_cpu) atomic64_add(se-avg.load_avg_contrib, cfs_rq-removed_load); } } -#endif #endif /* CONFIG_SMP */ static unsigned long @@ -6416,9 +6408,7 @@ const struct sched_class fair_sched_class = { #ifdef CONFIG_SMP .select_task_rq = select_task_rq_fair, -#ifdef CONFIG_FAIR_GROUP_SCHED .migrate_task_rq= migrate_task_rq_fair, -#endif .rq_online = rq_online_fair, .rq_offline = rq_offline_fair, diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 508e77e..bfd004a 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -226,12 +226,6 @@ struct cfs_rq { #endif #ifdef CONFIG_SMP -/* - * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be - * removed when useful for applications beyond shares distribution (e.g. - * load-balance). - */ -#ifdef CONFIG_FAIR_GROUP_SCHED /* * CFS Load tracking * Under CFS, load is tracked on a per-entity basis and aggregated up. @@ -241,7 +235,6 @@ struct cfs_rq { u64 runnable_load_avg, blocked_load_avg; atomic64_t decay_counter, removed_load; u64 last_decay; -#endif /* CONFIG_FAIR_GROUP_SCHED */ /* These always depend on CONFIG_FAIR_GROUP_SCHED */ #ifdef CONFIG_FAIR_GROUP_SCHED u32 tg_runnable_contrib; -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC v2 PATCH 2/2] sched: Use Per-Entity-Load-Tracking metric for load balancing
Currently the load balancer weighs a task based upon its priority,and this weight consequently gets added up to the weight of the run queue that it is on.It is this weight of the runqueue that sums up to a sched group's load which is used to decide the busiest or the idlest group and the runqueue thereof. The Per-Entity-Load-Tracking metric however measures how long a task has been runnable over the duration of its lifetime.This gives us a hint of the amount of CPU time that the task can demand.This metric takes care of the task priority as well.Therefore apart from the priority of a task we also have an idea of the live behavior of the task.This seems to be a more realistic metric to use to compute task weight which adds upto the run queue weight and the weight of the sched group.Consequently they can be used for load balancing. The semantics of load balancing is left untouched.The two functions load_balance() and select_task_rq_fair() perform the task of load balancing.These two paths have been browsed through in this patch to make necessary changes. weighted_cpuload() and task_h_load() provide the run queue weight and the weight of the task respectively.They have been modified to provide the Per-Entity-Load-Tracking metric as relevant for each. The rest of the modifications had to be made to suit these two changes. Completely Fair Scheduler class is the only sched_class which contributes to the run queue load.Therefore the rq-load.weight==cfs_rq-load.weight when the cfs_rq is the root cfs_rq (rq-cfs) of the hierarchy.When replacing this with Per-Entity-Load-Tracking metric,cfs_rq-runnable_load_avg needs to be used as this is the right reflection of the run queue load when the cfs_rq is the root cfs_rq (rq-cfs) of the hierarchy.This metric reflects the percentage uptime of the tasks that are queued on it and hence that contribute to the load.Thus cfs_rq-runnable_load_avg replaces the metric earlier used in weighted_cpuload(). The task load is aptly captured by se.avg.load_avg_contrib which captures the runnable time vs the alive time of the task against its priority.This metric replaces the earlier metric used in task_h_load(). The consequent changes appear as data type changes for the helper variables; they abound in number.Because cfs_rq-runnable_load_avg needs to be big enough to capture the tasks' load often and accurately. The following patch does not consider CONFIG_FAIR_GROUP_SCHED AND CONFIG_SCHED_NUMA.This is done so as to evaluate this approach starting from the simplest scenario.Earlier discussions can be found in the link below. Link: https://lkml.org/lkml/2012/10/25/162 Signed-off-by: Preeti U Murthypre...@linux.vnet.ibm.com --- include/linux/sched.h |2 +- kernel/sched/core.c | 12 + kernel/sched/fair.c | 64 + kernel/sched/sched.h |2 +- 4 files changed, 40 insertions(+), 40 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 087dd20..302756e 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -924,7 +924,7 @@ struct sched_domain { unsigned int lb_count[CPU_MAX_IDLE_TYPES]; unsigned int lb_failed[CPU_MAX_IDLE_TYPES]; unsigned int lb_balanced[CPU_MAX_IDLE_TYPES]; - unsigned int lb_imbalance[CPU_MAX_IDLE_TYPES]; + u64 lb_imbalance[CPU_MAX_IDLE_TYPES]; unsigned int lb_gained[CPU_MAX_IDLE_TYPES]; unsigned int lb_hot_gained[CPU_MAX_IDLE_TYPES]; unsigned int lb_nobusyg[CPU_MAX_IDLE_TYPES]; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 24d8b9b..4dea057 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2415,8 +2415,8 @@ static const unsigned char * would be when CPU is idle and so we just decay the old load without * adding any new load. */ -static unsigned long -decay_load_missed(unsigned long load, unsigned long missed_updates, int idx) +static u64 +decay_load_missed(u64 load, unsigned long missed_updates, int idx) { int j = 0; @@ -2444,7 +2444,7 @@ decay_load_missed(unsigned long load, unsigned long missed_updates, int idx) * scheduler tick (TICK_NSEC). With tickless idle this will not be called * every tick. We fix it up based on jiffies. */ -static void __update_cpu_load(struct rq *this_rq, unsigned long this_load, +static void __update_cpu_load(struct rq *this_rq, u64 this_load, unsigned long pending_updates) { int i, scale; @@ -2454,7 +2454,7 @@ static void __update_cpu_load(struct rq *this_rq, unsigned long this_load, /* Update our load: */ this_rq-cpu_load[0] = this_load; /* Fasttrack for idx 0 */ for (i = 1, scale = 2; i CPU_LOAD_IDX_MAX; i++, scale += scale) { - unsigned long old_load, new_load; + u64 old_load, new_load; /* scale is effectively 1 i now, and i divides by scale */ @@ -2496,7 +2496,7 @@ static void __update_cpu_load(struct rq
Re: [RFC v2 PATCH 2/2] sched: Use Per-Entity-Load-Tracking metric for load balancing
Hi Vincent, Thank you for your review. On 11/15/2012 11:43 PM, Vincent Guittot wrote: Hi Preeti, On 15 November 2012 17:54, Preeti U Murthy pre...@linux.vnet.ibm.com wrote: Currently the load balancer weighs a task based upon its priority,and this weight consequently gets added up to the weight of the run queue that it is on.It is this weight of the runqueue that sums up to a sched group's load which is used to decide the busiest or the idlest group and the runqueue thereof. The Per-Entity-Load-Tracking metric however measures how long a task has been runnable over the duration of its lifetime.This gives us a hint of the amount of CPU time that the task can demand.This metric takes care of the task priority as well.Therefore apart from the priority of a task we also have an idea of the live behavior of the task.This seems to be a more realistic metric to use to compute task weight which adds upto the run queue weight and the weight of the sched group.Consequently they can be used for load balancing. The semantics of load balancing is left untouched.The two functions load_balance() and select_task_rq_fair() perform the task of load balancing.These two paths have been browsed through in this patch to make necessary changes. weighted_cpuload() and task_h_load() provide the run queue weight and the weight of the task respectively.They have been modified to provide the Per-Entity-Load-Tracking metric as relevant for each. The rest of the modifications had to be made to suit these two changes. Completely Fair Scheduler class is the only sched_class which contributes to the run queue load.Therefore the rq-load.weight==cfs_rq-load.weight when the cfs_rq is the root cfs_rq (rq-cfs) of the hierarchy.When replacing this with Per-Entity-Load-Tracking metric,cfs_rq-runnable_load_avg needs to be used as this is the right reflection of the run queue load when the cfs_rq is the root cfs_rq (rq-cfs) of the hierarchy.This metric reflects the percentage uptime of the tasks that are queued on it and hence that contribute to the load.Thus cfs_rq-runnable_load_avg replaces the metric earlier used in weighted_cpuload(). The task load is aptly captured by se.avg.load_avg_contrib which captures the runnable time vs the alive time of the task against its priority.This metric replaces the earlier metric used in task_h_load(). The consequent changes appear as data type changes for the helper variables; they abound in number.Because cfs_rq-runnable_load_avg needs to be big enough to capture the tasks' load often and accurately. You are now using cfs_rq-runnable_load_avg instead of cfs_rq-load.weight for calculation of cpu_load but cfs_rq-runnable_load_avg is smaller or equal to cfs_rq-load.weight value. This implies that the new value is smaller or equal to the old statistic so you should be able to keep the same variable width for the computation of cpu_load Right.But cfs_rq-runnable_load_avg is a 64 bit unsigned integer as per the Per-entity-load-tracking patchset.I could not figure out why this is the case although as you mention, its value will not exceed cfs_rq-load.weight.In order to retain the data type of cfs_rq-runnable_load_avg as it is,these changes had to be made to suit it.It would be good if someone would clarify why it is a 64 bit integer,will save a lot of trouble if we could consider this the same length as cfs_rq-load.weight.Ben,Paul? can you clarify this point? The following patch does not consider CONFIG_FAIR_GROUP_SCHED AND CONFIG_SCHED_NUMA.This is done so as to evaluate this approach starting from the simplest scenario.Earlier discussions can be found in the link below. Link: https://lkml.org/lkml/2012/10/25/162 Signed-off-by: Preeti U Murthypre...@linux.vnet.ibm.com --- include/linux/sched.h |2 +- kernel/sched/core.c | 12 + kernel/sched/fair.c | 64 + kernel/sched/sched.h |2 +- 4 files changed, 40 insertions(+), 40 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 087dd20..302756e 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -924,7 +924,7 @@ struct sched_domain { unsigned int lb_count[CPU_MAX_IDLE_TYPES]; unsigned int lb_failed[CPU_MAX_IDLE_TYPES]; unsigned int lb_balanced[CPU_MAX_IDLE_TYPES]; - unsigned int lb_imbalance[CPU_MAX_IDLE_TYPES]; + u64 lb_imbalance[CPU_MAX_IDLE_TYPES]; unsigned int lb_gained[CPU_MAX_IDLE_TYPES]; unsigned int lb_hot_gained[CPU_MAX_IDLE_TYPES]; unsigned int lb_nobusyg[CPU_MAX_IDLE_TYPES]; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 24d8b9b..4dea057 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2415,8 +2415,8 @@ static const unsigned char * would be when CPU is idle and so we just decay the old load without * adding any new load. */ -static unsigned long
Re: [RFC PATCH 4/5] sched: consider runnable load average in wake_affine and move_tasks
Hi Alex, On 11/17/2012 06:34 PM, Alex Shi wrote: Except using runnable load average in background, wake_affine and move_tasks is also the key functions in load balance. We need consider the runnable load average in them in order to the apple to apple load comparison in load balance. Signed-off-by: Alex Shi alex@intel.com --- kernel/sched/fair.c | 16 ++-- 1 files changed, 10 insertions(+), 6 deletions(-) @@ -4229,7 +4233,7 @@ static int move_tasks(struct lb_env *env) if (throttled_lb_pair(task_group(p), env-src_cpu, env-dst_cpu)) goto next; - load = task_h_load(p); + load = task_h_load(p) * p-se.avg.load_avg_contrib; Shouldn't the above be just load = p-se.avg.load_avg_contrib? This metric already has considered p-se.load.weight.task_h_load(p) returns the same. if (sched_feat(LB_MIN) load 16 !env-failed) goto next; Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 0/5] enable runnable load avg in load balance
Hi Alex, On 11/17/2012 06:34 PM, Alex Shi wrote: This patchset try to consider runnable load avg when do cpu load comparison in load balance. I had seen preeti's enabling before patch finished, but I still think considing runnable load avg on rq is may a more natrual way. BTW, I am thinking if 2 times decay for cpu_load is too complicate? one for runnable time, another for CPU_LOAD_IDX. I think I missed the decay reason for CPU_LOAD_IDX. Could anyone like do me favor to give some hints of this? The decay happening for CPU_LOAD_IDX is *more coarse grained* than the decay that __update_entity_runnable_avg() is performing.While __update_cpu_load() decays the rq-load.weight *for every jiffy*(~4ms) passed so far without update of the load, __update_entity_runnable_avg() decays the rq-load.weight *for every 1ms* when called from update_rq_runnable_avg(). Before the introduction of PJT's series,__update_cpu_load() seems to be the only place where decay of older rq load was happening(so as to give the older load less importance in its relevance),but with the introduction of PJT's series since the older rq load gets decayed in __update_entity_runnable_avg() in a more fine grained fashion,perhaps you are right,while the CPU_LOAD_IDX gets updated,we dont need to decay the load once again here. Best Regards! Alex [RFC PATCH 1/5] sched: get rq runnable load average for load balance [RFC PATCH 2/5] sched: update rq runnable load average in time [RFC PATCH 3/5] sched: using runnable load avg in cpu_load and [RFC PATCH 4/5] sched: consider runnable load average in wake_affine [RFC PATCH 5/5] sched: revert 'Introduce temporary FAIR_GROUP_SCHED Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [REPOST-v2] sched: Prevent wakeup to enter critical section needlessly
Hi Ivo, On 11/19/2012 01:00 PM, Ivo Sieben wrote: Check the waitqueue task list to be non empty before entering the critical section. This prevents locking the spin lock needlessly in case the queue was empty, and therefor also prevent scheduling overhead on a PREEMPT_RT system. Signed-off-by: Ivo Sieben meltedpiano...@gmail.com --- a second repost of this patch v2: Can anyone respond? Did I apply the memory barrier correct? v2: - We don't need the careful list empty, a normal list empty is sufficient: if you miss an update it was just as it happened a little later. - Because of memory ordering problems we can observe an unupdated list administration. This can cause an wait_event-like code to miss an event. Adding a memory barrier befor checking the list to be empty will guarantee we evaluate a 100% updated list adminsitration. kernel/sched/core.c | 19 --- 1 file changed, 16 insertions(+), 3 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 2d8927f..168a9b2 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3090,9 +3090,22 @@ void __wake_up(wait_queue_head_t *q, unsigned int mode, { unsigned long flags; - spin_lock_irqsave(q-lock, flags); - __wake_up_common(q, mode, nr_exclusive, 0, key); - spin_unlock_irqrestore(q-lock, flags); + /* + * We check for list emptiness outside the lock. This prevents the wake + * up to enter the critical section needlessly when the task list is + * empty. + * + * Placed a full memory barrier before checking list emptiness to make + * 100% sure this function sees an up-to-date list administration. + * Note that other code that manipulates the list uses a spin_lock and + * therefore doesn't need additional memory barriers. + */ + smp_mb(); + if (!list_empty(q-task_list)) { + spin_lock_irqsave(q-lock, flags); + __wake_up_common(q, mode, nr_exclusive, 0, key); + spin_unlock_irqrestore(q-lock, flags); + } } EXPORT_SYMBOL(__wake_up); Looks good to me. Reviewed-by: Preeti U Murthy pre...@linux.vnet.ibm.com Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 05/22] sched: remove domain iterations in fork/exec/wake
Hi Morten,Alex On 01/09/2013 11:51 PM, Morten Rasmussen wrote: On Sat, Jan 05, 2013 at 08:37:34AM +, Alex Shi wrote: Guess the search cpu from bottom to up in domain tree come from commit 3dbd5342074a1e sched: multilevel sbe sbf, the purpose is balancing over tasks on all level domains. This balancing cost much if there has many domain/groups in a large system. And force spreading task among different domains may cause performance issue due to bad locality. If we remove this code, we will get quick fork/exec/wake, plus better balancing among whole system, that also reduce migrations in future load balancing. This patch increases 10+% performance of hackbench on my 4 sockets NHM and SNB machines. Signed-off-by: Alex Shi alex@intel.com --- kernel/sched/fair.c | 20 +--- 1 file changed, 1 insertion(+), 19 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index ecfbf8e..895a3f4 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3364,15 +3364,9 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags) goto unlock; } -while (sd) { +if (sd) { int load_idx = sd-forkexec_idx; struct sched_group *group; -int weight; - -if (!(sd-flags sd_flag)) { -sd = sd-child; -continue; -} if (sd_flag SD_BALANCE_WAKE) load_idx = sd-wake_idx; @@ -3382,18 +3376,6 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags) goto unlock; new_cpu = find_idlest_cpu(group, p, cpu); - -/* Now try balancing at a lower domain level of new_cpu */ -cpu = new_cpu; -weight = sd-span_weight; -sd = NULL; -for_each_domain(cpu, tmp) { -if (weight = tmp-span_weight) -break; -if (tmp-flags sd_flag) -sd = tmp; -} -/* while loop will break here if sd == NULL */ I agree that this should be a major optimization. I just can't figure out why the existing recursive search for an idle cpu switches to the new cpu near the end and then starts a search for an idle cpu in the new cpu's domain. Is this to handle some exotic sched domain configurations? If so, they probably wouldn't work with your optimizations. Let me explain my understanding of why the recursive search is the way it is. _ sd0 | | | ___sd1__ ___sd2__| | || || | | | sgx| | sga | | | | sgy| | sgb | | | || || | |_| What the current recursive search is doing is (assuming we start with sd0-the top level sched domain whose flags are rightly set). we find that sd1 is the idlest group,and a cpux1 in sgx is the idlest cpu. We could have ideally stopped the search here.But the problem with this is that there is a possibility that sgx is more loaded than sgy; meaning the cpus in sgx are heavily imbalanced;say there are two cpus cpux1 and cpux2 in sgx,where cpux2 is heavily loaded and cpux1 has recently gotten idle and load balancing has not come to its rescue yet.According to the search above, cpux1 is idle,but is *not the right candidate for scheduling forked task,it is the right candidate for relieving the load from cpux2* due to cache locality etc. Therefore in the next recursive search we go one step inside sd1-the chosen idlest group candidate,which also happens to be the *next level sched domain for cpux1-the chosen idle cpu*. It then returns sgy as the idlest perhaps,if the situation happens to be better than what i have described for sgx and an appropriate cpu there is chosen. So in short a bird's eye view of a large sched domain to choose the cpu would be very short sighted,we could end up creating imbalances within lower level sched domains.To avoid this the recursive search plays safe and chooses the best idle group after viewing the large sched domain in detail. Therefore even i feel that this patch should be implemented after thorough tests. Morten Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 02/22] sched: select_task_rq_fair clean up
On 01/05/2013 02:07 PM, Alex Shi wrote: It is impossible to miss a task allowed cpu in a eligible group. And since find_idlest_group only return a different group which excludes old cpu, it's also imporissible to find a new cpu same as old cpu. Signed-off-by: Alex Shi alex@intel.com --- kernel/sched/fair.c | 5 - 1 file changed, 5 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 5eea870..6d3a95d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3378,11 +3378,6 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags) } new_cpu = find_idlest_cpu(group, p, cpu); - if (new_cpu == -1 || new_cpu == cpu) { - /* Now try balancing at a lower domain level of cpu */ - sd = sd-child; - continue; - } /* Now try balancing at a lower domain level of new_cpu */ cpu = new_cpu; Reviewed-by:Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 03/22] sched: fix find_idlest_group mess logical
On 01/05/2013 02:07 PM, Alex Shi wrote: There is 4 situations in the function: 1, no task allowed group; so min_load = ULONG_MAX, this_load = 0, idlest = NULL 2, only local group task allowed; so min_load = ULONG_MAX, this_load assigned, idlest = NULL 3, only non-local task group allowed; so min_load assigned, this_load = 0, idlest != NULL 4, local group + another group are task allowed. so min_load assigned, this_load assigned, idlest != NULL Current logical will return NULL in first 3 kinds of scenarios. And still return NULL, if idlest group is heavier then the local group in the 4th situation. Actually, I thought groups in situation 2,3 are also eligible to host the task. And in 4th situation, agree to bias toward local group. So, has this patch. Signed-off-by: Alex Shi alex@intel.com --- kernel/sched/fair.c | 12 +--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6d3a95d..3c7b09a 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3181,6 +3181,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu, int load_idx) { struct sched_group *idlest = NULL, *group = sd-groups; + struct sched_group *this_group = NULL; unsigned long min_load = ULONG_MAX, this_load = 0; int imbalance = 100 + (sd-imbalance_pct-100)/2; @@ -3215,14 +3216,19 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, if (local_group) { this_load = avg_load; - } else if (avg_load min_load) { + this_group = group; + } + if (avg_load min_load) { min_load = avg_load; idlest = group; } } while (group = group-next, group != sd-groups); - if (!idlest || 100*this_load imbalance*min_load) - return NULL; + if (this_group idlest != this_group) + /* Bias toward our group again */ + if (100*this_load imbalance*min_load) + idlest = this_group; + return idlest; } Reviewed-by:Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 04/22] sched: don't need go to smaller sched domain
On 01/05/2013 02:07 PM, Alex Shi wrote: If parent sched domain has no task allowed cpu find. neither find in it's child. So, go out to save useless checking. Signed-off-by: Alex Shi alex@intel.com --- kernel/sched/fair.c | 6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 3c7b09a..ecfbf8e 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3378,10 +3378,8 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags) load_idx = sd-wake_idx; group = find_idlest_group(sd, p, cpu, load_idx); - if (!group) { - sd = sd-child; - continue; - } + if (!group) + goto unlock; new_cpu = find_idlest_cpu(group, p, cpu); Reviewed-by:Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 07/22] sched: set initial load avg of new forked task
On 01/05/2013 02:07 PM, Alex Shi wrote: New task has no runnable sum at its first runnable time, that make burst forking just select few idle cpus to put tasks. Set initial load avg of new forked task as its load weight to resolve this issue. Signed-off-by: Alex Shi alex@intel.com --- include/linux/sched.h | 1 + kernel/sched/core.c | 2 +- kernel/sched/fair.c | 11 +-- 3 files changed, 11 insertions(+), 3 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 206bb08..fb7aab5 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1069,6 +1069,7 @@ struct sched_domain; #else #define ENQUEUE_WAKING 0 #endif +#define ENQUEUE_NEWTASK 8 #define DEQUEUE_SLEEP1 diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 66c1718..66ce1f1 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1705,7 +1705,7 @@ void wake_up_new_task(struct task_struct *p) #endif rq = __task_rq_lock(p); - activate_task(rq, p, 0); + activate_task(rq, p, ENQUEUE_NEWTASK); p-on_rq = 1; trace_sched_wakeup_new(p, true); check_preempt_curr(rq, p, WF_FORK); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 895a3f4..5c545e4 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1503,8 +1503,9 @@ static inline void update_rq_runnable_avg(struct rq *rq, int runnable) /* Add the load generated by se into cfs_rq's child load-average */ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, - int wakeup) + int flags) { + int wakeup = flags ENQUEUE_WAKEUP; /* * We track migrations using entity decay_count = 0, on a wake-up * migration we use a negative decay count to track the remote decays @@ -1538,6 +1539,12 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq, update_entity_load_avg(se, 0); } + /* + * set the initial load avg of new task same as its load + * in order to avoid brust fork make few cpu too heavier + */ + if (flags ENQUEUE_NEWTASK) + se-avg.load_avg_contrib = se-load.weight; cfs_rq-runnable_load_avg += se-avg.load_avg_contrib; /* we force update consideration on load-balancer moves */ update_cfs_rq_blocked_load(cfs_rq, !wakeup); @@ -1701,7 +1708,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) * Update run-time statistics of the 'current'. */ update_curr(cfs_rq); - enqueue_entity_load_avg(cfs_rq, se, flags ENQUEUE_WAKEUP); + enqueue_entity_load_avg(cfs_rq, se, flags); account_entity_enqueue(cfs_rq, se); update_cfs_shares(cfs_rq); I had seen in my experiments, that the forked tasks with initial load to be 0,would adversely affect the runqueue lengths.Since the load for these tasks to pick up takes some time,the cpus on which the forked tasks are scheduled, could be candidates for dst_cpu many times and the runqueue lengths increase considerably. This patch solves this issue by making the forked tasks contribute actively to the runqueue load. Reviewed-by:Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: sched: Consequences of integrating the Per Entity Load Tracking Metric into the Load Balancer
Hi Mike, Thank you very much for your inputs.Just a few thoughts so that we are clear with the problems so far in the scheduler scalability and in what direction we ought to move to correct them. 1. During fork or exec,the scheduler goes through find_idlest_group() and find_idlest_cpu() in select_task_rq_fair() by iterating through all domains.Why then was a similar approach not followed for wake up balancing? What was so different about wake ups (except that the woken up task had to remain close to the prev/waking cpu) that we had to introduce select_idle_sibling() in the first place? 2.To the best of my knowlege,the concept of buddy cpu was introduced in select_idle_sibling() so as to avoid the entire package traversal and restrict it to the buddy cpus alone.But even during fork or exec,we iterate through all the sched domains,like I have mentioned above.Why did not the buddy cpu solution come to the rescue here as well? 3.So the correct problem stands at avoid iterating through the entire package at the cost of less aggression in finding the idle cpu or iterate through the package with an intention of finding the idlest cpu.To the best of my understanding the former is your approach or commit 37407ea7,the latter is what I tried to do.But as you have rightly pointed out my approach will have scaling issues.In this light,how does your best_combined patch(below) look like? Do you introduce a cut off value on the loads to decide on which approach to take? Meanwhile I will also try to run tbench and a few other benchmarks to find out why the results are like below.Will update you very soon on this. Thank you Regards Preeti U Murthy On 01/06/2013 10:02 PM, Mike Galbraith wrote: On Sat, 2013-01-05 at 09:13 +0100, Mike Galbraith wrote: I still have a 2.6-rt problem I need to find time to squabble with, but maybe I'll soonish see if what you did plus what I did combined works out on that 4x10 core box where current is _so_ unbelievably horrible. Heck, it can't get any worse, and the restricted wake balance alone kinda sorta worked. Actually, I flunked copy/paste 101. Below (preeti) shows the real deal. tbench, 3 runs, 30 secs/run revert = 37407ea7 reverted clients 1 5 1020 40 80 3.6.0.virgin27.83 139.501488.76 4172.936983.71 8301.73 29.23 139.981500.22 4162.926907.16 8231.13 30.00 141.431500.09 3975.506847.24 7983.98 3.6.0+revert 281.081404.762802.44 5019.497080.97 8592.80 282.381375.702747.23 4823.957052.15 8508.45 270.691375.532736.29 5243.057058.75 8806.72 3.6.0+preeti26.43 126.621027.23 3350.067004.22 7561.83 26.67 128.66 922.57 3341.737045.05 7662.18 25.54 129.201015.02 3337.606591.32 7634.33 3.6.0+best_combined280.481382.072730.27 4786.206477.28 7980.07 276.881392.502708.23 4741.256590.99 7992.11 278.921368.552735.49 4614.996573.38 7921.75 3.0.51-0.7.9-default 286.441415.372794.41 5284.397282.57 13670.80 Something is either wrong with 3.6 itself, or the config I'm using, as max throughput is nowhere near where it should be (see default). On the bright side, integrating the two does show some promise. -Mike -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 09/22] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task
Hi everyone, On 01/07/2013 12:01 AM, Linus Torvalds wrote: On Sat, Jan 5, 2013 at 11:54 PM, Alex Shi alex@intel.com wrote: I just looked into the aim9 benchmark, in this case it forks 2000 tasks, after all tasks ready, aim9 give a signal than all tasks burst waking up and run until all finished. Since each of tasks are finished very quickly, a imbalanced empty cpu may goes to sleep till a regular balancing give it some new tasks. That causes the performance dropping. cause more idle entering. Sounds like for AIM (and possibly for other really bursty loads), we might want to do some load-balancing at wakeup time by *just* looking at the number of running tasks, rather than at the load average. Hmm? During wake ups,the load average is not even queried,is it? wake_affine() is called to see in the affinity of which cpu(prev/waking),the task should go.But after that select_idle_sibling() simply sees if there is an idle cpu to offload the task to. Looks like only in the periodic load balancing we can correct this scenario as of now, as pointed below. The load average is fundamentally always going to run behind a bit, and while you want to use it for long-term balancing, a short-term you might want to do just a if we have a huge amount of runnable processes, do a load balancing *now*. Where huge amount should probably be relative to the long-term load balancing (ie comparing the number of runnable processes on this CPU right *now* with the load average over the last second or so would show a clear spike, and a reason for quick action). Linus Earlier I had posted a patch,to address this. https://lkml.org/lkml/2012/10/25/156 update_sd_pick_busiest() checks whether a sched group has too many running tasks to be offloaded. --START_PATCH- The scenario which led to this patch is shown below: Consider Task1 and Task2 to be a long running task and Tasks 3,4,5,6 to be short running tasks Task3 Task4 Task1 Task5 Task2 Task6 -- -- SCHED_GRP1 SCHED_GRP2 Normal load calculator would qualify SCHED_GRP2 as the candidate for sd-busiest due to the following loads that it calculates. SCHED_GRP1:2048 SCHED_GRP2:4096 Load calculator would probably qualify SCHED_GRP1 as the candidate for sd-busiest due to the following loads that it calculates SCHED_GRP1:3200 SCHED_GRP2:1156 This patch aims to strike a balance between the loads of the group and the number of tasks running on the group to decide the busiest group in the sched_domain. This means we will need to use the PJT's metrics but with an additional constraint. Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com --- kernel/sched/fair.c | 25 ++--- 1 file changed, 22 insertions(+), 3 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index e02dad4..aafa3c1 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -165,7 +165,8 @@ void sched_init_granularity(void) #else # define WMULT_CONST (1UL 32) #endif - +#define NR_THRESHOLD 2 +#define LOAD_THRESHOLD 1 #define WMULT_SHIFT32 /* @@ -4169,6 +4170,7 @@ struct sd_lb_stats { /* Statistics of the busiest group */ unsigned int busiest_idle_cpus; unsigned long max_load; + u64 max_sg_load; /* Equivalent of max_load but calculated using pjt's metric*/ unsigned long busiest_load_per_task; unsigned long busiest_nr_running; unsigned long busiest_group_capacity; @@ -4628,8 +4630,24 @@ static bool update_sd_pick_busiest(struct lb_env *env, struct sched_group *sg, struct sg_lb_stats *sgs) { - if (sgs-avg_load = sds-max_load) - return false; + /* Use PJT's metrics to qualify a sched_group as busy +* +* But a low load sched group may be queueing up many tasks +* So before dismissing a sched group with lesser load,ensure +* that the number of processes on it is checked if it is +* not too less loaded than the max load so far +* +* But as of now as LOAD_THRESHOLD is 1,this check is a nop. +* But we could vary LOAD_THRESHOLD suitably to bring in this check +*/ + if (sgs-avg_cfs_runnable_load = sds-max_sg_load) { + if (sgs-avg_cfs_runnable_load LOAD_THRESHOLD * sds-max_sg_load) { + if (sgs-sum_nr_running = (NR_THRESHOLD + sds-busiest_nr_running)) + return false; + } else { + return false; + } + } if (sgs-sum_nr_running sgs-group_capacity) return true; @@ -4708,6 +4726,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, sds-this_idle_cpus
Re: sched: Consequences of integrating the Per Entity Load Tracking Metric into the Load Balancer
On 01/07/2013 09:18 PM, Vincent Guittot wrote: On 2 January 2013 05:22, Preeti U Murthy pre...@linux.vnet.ibm.com wrote: Hi everyone, I have been looking at how different workloads react when the per entity load tracking metric is integrated into the load balancer and what are the possible reasons for it. I had posted the integration patch earlier: https://lkml.org/lkml/2012/11/15/391 Essentially what I am doing is: 1.I have disabled CONFIG_FAIR_GROUP_SCHED to make the analysis simple 2.I have replaced cfs_rq-load.weight in weighted_cpuload() with cfs.runnable_load_avg,the active load tracking metric. 3.I have replaced se.load.weight in task_h_load() with se.load.avg.contrib,the per entity load tracking metric. 4.The load balancer will end up using these metrics. After conducting experiments on several workloads I found out that the performance of the workloads with the above integration would neither improve nor deteriorate.And this observation was consistent. Ideally the performance should have improved considering,that the metric does better tracking of load. Let me explain with a simple example as to why we should see a performance improvement ideally:Consider 2 80% tasks and 1 40% task. With integration: 40% 80%40% cpu1 cpu2 The above will be the scenario when the tasks fork initially.And this is a perfectly balanced system,hence no more load balancing.And proper distribution of loads on the cpu. Without integration --- 40% 40% 80%40% 80%40% cpu1 cpu2OR cpu1 cpu2 Because the view is that all the tasks as having the same load.The load balancer could ping pong tasks between these two situations. When I performed this experiment,I did not see an improvement in the performance though in the former case.On further observation I found that the following was actually happening. With integration Initially 40% task sleeps 40% task wakes up and select_idle_sibling() decides to wake it up on cpu1 40% - - 40% 80%40%80%40% 80% 40% cpu1 cpu2cpu1 cpu2 cpu1 cpu2 This makes load balance trigger movement of 40% from cpu1 back to cpu2.Hence the stability that the load balancer was trying to achieve is gone.Hence the culprit boils down to select_idle_sibling.How is it the culprit and how is it hindering performance of the workloads? *What is the way ahead with the per entity load tracking metric in the load balancer then?* In replies to a post by Paul in https://lkml.org/lkml/2012/12/6/105, he mentions the following: It is my intuition that the greatest carnage here is actually caused by wake-up load-balancing getting in the way of periodic in establishing a steady state. I suspect more mileage would result from reducing the interference wake-up load-balancing has with steady state. The whole point of using blocked load is so that you can converge on a steady state where you don't NEED to move tasks. What disrupts this is we naturally prefer idle cpus on wake-up balance to reduce wake-up latency. I think the better answer is making these two processes load balancing() and select_idle_sibling() more co-operative. I had not realised how this would happen until I saw it happening in the above experiment. Based on what Paul explained above let us use the runnable load + the blocked load for calculating the load on a cfs runqueue rather than just the runnable load(which is what i am doing now) and see its consequence. Initially: 40% task sleeps 40% 80%40% - 80% 40% cpu1 cpu2 cpu1 cpu2 So initially the load on cpu1 is say 80 and on cpu2 also it is 80.Balanced.Now when 40% task sleeps,the total load on cpu2=runnable load+blocked load.which is still 80. As a consequence,firstly,during periodic load balancing the load is not moved from cpu1 to cpu2 when the 40% task sleeps.(It sees the load on cpu2 as 80 and not as 40). Hence the above scenario remains the same.On wake up,what happens? Here comes the point of making both load balancing and wake up balance(select_idle_sibling) co operative. How about we always schedule the woken up task on the prev_cpu? This seems more sensible considering load balancing considers blocked load as being a part of the load of cpu2. Hi Preeti, I'm not sure that we want such steady state at cores level because we take advantage of migrating wake up tasks between cores that share their cache as Matthew demonstrated. But I agree that reaching such steady state at cluster and CPU level is interesting. IMHO, you're right that taking the blocked load into consideration should minimize tasks migration between cluster but it should no prevent fast task
Re: sched: Consequences of integrating the Per Entity Load Tracking Metric into the Load Balancer
Hi Mike, Thank you very much for such a clear and comprehensive explanation. So when I put together the problem and the proposed solution pieces in the current scheduler scalability,the following was what I found: 1. select_idle_sibling() is needed as an agent to correctly find the right cpu for wake up tasks to go to.Correctly would be to find an idle cpu at the lowest cost possible. 2.Cost could be lowered either by optimizing the order of searching for an idle cpu or restricting the search to a few cpus alone. 3. The former has the problem that it would not prevent bouncing tasks all over the domain sharing an L3 cache,which could potentially affect the fast moving tasks. 4. The latter has the problem that it is not aggressive enough in finding an idle cpu. This is some tangled problem,but I think the solution at best could be smoothed to a a flowchart. STEP1 STEP2STEP3 _ | | |See if the idle buddy|No_ Yes |is free at all sched || Do we search the| |Optimized search| |domains | |sched domains| || |_| |for an idle cpu | | |Yes |_|\|/ \|/|No: saturated Return target cpu Return \|/ system cpu buddyReturn prev_cpu I dont know how well we can do STEP2. That seems to be the biggest challenge.STEP 1 is a reverted commit 37407ea7, sched: Improve scalability via 'CPU buddies', which withstand random perturbations for reasons which can hopefully be overcome by this flowchart. I have tried to tackle STEP3.STEP 3 will not prevent bouncing but a good STEP2 could tell us if it is worth the bounce. STEP3 Patch is given below: ***START PATCH** sched:Reduce the overhead of select_idle_sibling From: Preeti U Murthy pre...@linux.vnet.ibm.com The traversal of the sched domains to find the idlest cpu is a costly operation currently in select_idle_sibling() due to large L3 caches. So the traversal better be worth the effort. In the current approach,in select_idle_sibling(),the sched domains are searched top to bottom starting at the last level cache domain,and the search is for the idlest *group*. It is less likely that you will find a completely idle group at a higher level sched domain compared to a lower level sched domain.This will make the first few iterations fruitless most of the times. And it is less likely to find an idle *group* compared to an idle *cpu*. This patch aims at finding an idle cpu.And since the iterations are from bottom to top,stopping at the last level cache domain,it tries to see if the task can be scheduled on a core which shares the l2 cache with the waking cpu/prev cpu first, before probing the cpus at the higher level domain,trying to make the cost of migration lower than the current approach. Also it tries to return an idle cpu as soon as it finds one,without querying the status of the other cpus in the sched group.This could potentially be much faster unless a worse case such as not finding an idle cpu at every domain occurs. --- kernel/sched/fair.c | 33 + 1 file changed, 21 insertions(+), 12 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b29cdbf..6123bca 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3310,7 +3310,7 @@ static int select_idle_sibling(struct task_struct *p, int target) { int cpu = smp_processor_id(); int prev_cpu = task_cpu(p); - struct sched_domain *sd; + struct sched_domain *sd, *tmp; struct sched_group *sg; int i; @@ -3332,24 +3332,33 @@ static int select_idle_sibling(struct task_struct *p, int target) * Otherwise, iterate the domains and find an elegible idle cpu. */ sd = rcu_dereference(per_cpu(sd_llc, target)); - for_each_lower_domain(sd) { - sg = sd-groups; + /* +* Iterate the domains bottom up,to cash in on the shared lower level +* cache advantages. +* Since this iteration is costly,return any idle cpu and dont wait +* for a completely idle group. +*/ + for_each_domain(target, tmp) { + sg = tmp-groups; do { if (!cpumask_intersects(sched_group_cpus(sg), tsk_cpus_allowed(p))) goto next; - for_each_cpu(i, sched_group_cpus(sg)) { - if (!idle_cpu(i)) - goto next; + if (idle_cpu(i)) { + target = i
Re: sched: Consequences of integrating the Per Entity Load Tracking Metric into the Load Balancer
Here comes the point of making both load balancing and wake up balance(select_idle_sibling) co operative. How about we always schedule the woken up task on the prev_cpu? This seems more sensible considering load balancing considers blocked load as being a part of the load of cpu2. Hi Preeti, I'm not sure that we want such steady state at cores level because we take advantage of migrating wake up tasks between cores that share their cache as Matthew demonstrated. But I agree that reaching such steady state at cluster and CPU level is interesting. IMHO, you're right that taking the blocked load into consideration should minimize tasks migration between cluster but it should no prevent fast task migration between cores that share their cache True Vincent.But I think the one disadvantage even at cpu or cluster level is that when we consider blocked load, we might prevent any more tasks from being scheduled on that cpu during periodic load balance if the blocked load is too much.This is very poor cpu utilization The blocked load of a cluster will be high if the blocked tasks have run recently. The contribution of a blocked task will be divided by 2 each 32ms, so it means that a high blocked load will be made of recent running tasks and the long sleeping tasks will not influence the load balancing. The load balance period is between 1 tick (10ms for idle load balance on ARM) and up to 256 ms (for busy load balance) so a high blocked load should imply some tasks that have run recently otherwise your blocked load will be small and will not have a large influence on your load balance Makes a lot of sense. Also we can consider steady states if the waking tasks have a specific waking pattern.I am not sure if we can risk hoping that the blocked task would wake up soon or would wake up at time 'x' and utilize that cpu. Ok, so you don't consider to use blocked load in load balancing any more ? Hmm..This has got me thinking.I thought to solve the existing select_idle_sibling() problem of bouncing tasks all over the l3 package and taking time to find an idle buddy could be solved in isolation with the PJT's metric.But that does not seem to be the case considering the suggestions by you and Mike. Currently there are so many approaches proposed to improve the scheduler that it is confusing as to how and which pieces fit well.Let me lay them down.Please do help me put them together. Jigsaw Piece1:Use Pjt's metric in load balancing and Blocked load+runnable load as part of cpu load while load balancing. Jigsaw Piece2: select_idle_sibling() choosing the cpu to wake up tasks on. Jigsaw Piece3: 'cpu buddy' concept to prevent bouncing of tasks. Considering both yours and Mike's suggestions,what do you guys think of the following puzzle and solution? *Puzzle*: Waking up tasks should not take too much time to find a cpu to run on and should not keep bouncing on too many cpus all over the package, and should try as much not to create too much of an imbalance in the load distribution of the cpus. *Solution:* Place Jigsaw Piece 1 first:Use Pjt's metric and blocked load + runnable load as part of cpu load while load balancing. (As time passes the blocked load becomes less significant on that cpu,hence load balancing will go on as usual). Place Jigsaw Piece 2 next: When tasks wake up,**use select_idle_sibling() to see only if you can migrate tasks between cores that share their cache**, IOW see if the cpu at the lowest level sched domain is idle.If it is, then schedule on it and migrate_task_rq_fair() will remove the load from the prev_cpu,if not idle,then return the prev_cpu() which had already considered the blocked load as part of its overall load.Hence very little imbalance will be created. *Possible End Picture* Waking up tasks will not take time to find a cpu since we are probing the cpus at only one sched domain level.The bouncing of tasks will be restricted at the core level.An imbalance will not be created as the blocked load is also considered while load balancing. *Concerns* 1.Is the wake up load balancing in this solution less aggressive so as to harm throughput significantly ? 2.Do we need Jigsaw Piece 3 at all? Please do let me know what you all think.Thank you very much for your suggestions. regards, Vincent Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sched: Get rid of unnecessary checks from select_idle_sibling
On 01/09/2013 12:20 PM, Namhyung Kim wrote: From: Namhyung Kim namhyung@lge.com AFAICS @target cpu of select_idle_sibling() is always either prev_cpu or this_cpu. So no need to check it again and the conditionals can be consolidated. Cc: Mike Galbraith efa...@gmx.de Cc: Preeti U Murthy pre...@linux.vnet.ibm.com Cc: Vincent Guittot vincent.guit...@linaro.org Cc: Alex Shi alex@intel.com Signed-off-by: Namhyung Kim namhy...@kernel.org --- kernel/sched/fair.c | 17 - 1 file changed, 4 insertions(+), 13 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 5eea8707234a..af665814c216 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3254,25 +3254,16 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu) */ static int select_idle_sibling(struct task_struct *p, int target) { - int cpu = smp_processor_id(); - int prev_cpu = task_cpu(p); struct sched_domain *sd; struct sched_group *sg; int i; /* - * If the task is going to be woken-up on this cpu and if it is - * already idle, then it is the right target. - */ - if (target == cpu idle_cpu(cpu)) - return cpu; - - /* - * If the task is going to be woken-up on the cpu where it previously - * ran and if it is currently idle, then it the right target. + * If the task is going to be woken-up on this cpu or the cpu where it + * previously ran and it is already idle, then it is the right target. */ - if (target == prev_cpu idle_cpu(prev_cpu)) - return prev_cpu; + if (idle_cpu(target)) + return target; /* * Otherwise, iterate the domains and find an elegible idle cpu. If NUMA_TTWU_BIAS or NUMA_TTWU_TO is true(it is false by default),cpu/prev_cpu can be changed to be a random node_cpu(the node that 'this_cpu' is on). In which case even if the node cpu is idle,it would not be a viable target,looks like.Maybe that is why select_idle_sibling() makes the check if the target is prev_cpu/this cpu. Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sched: Get rid of unnecessary checks from select_idle_sibling
On 01/10/2013 11:19 AM, Namhyung Kim wrote: Hi Preeti, On Wed, 09 Jan 2013 13:51:00 +0530, Preeti U. Murthy wrote: On 01/09/2013 12:20 PM, Namhyung Kim wrote: From: Namhyung Kim namhyung@lge.com AFAICS @target cpu of select_idle_sibling() is always either prev_cpu or this_cpu. So no need to check it again and the conditionals can be consolidated. [snip] If NUMA_TTWU_BIAS or NUMA_TTWU_TO is true(it is false by I can't find those bits in the code. I've checked v3.8-rc2, next-20130110, tip/master and tip/numa/core but there's nothing like above. Which tree are you saying? default),cpu/prev_cpu can be changed to be a random node_cpu(the node that 'this_cpu' is on). In which case even if the node cpu is idle,it would not be a viable target,looks like.Maybe that is why select_idle_sibling() makes the check if the target is prev_cpu/this cpu. Looking into tip/numa/core, I can see that there's a code added for CONFIG_NUMA_BALANCING. But still, it seems nothing changed on a path from select_task_rq_fair() to select_idle_sibling() - i.e. if the select_idle_sibling called, the target would be either prev_cpu or this cpu. Am I missing something? Hi Namhyung, Sorry about this.I did a git pull on the tip/master.You are right this piece of code is not there any more.So not an issue. You can go ahead and add my Reviewed-by.I don't find any potential problems now. For your information however, the code piece which was earlier there, is given below which aimed to pull the tasks towards the node of the waking cpu,if NUMA_TTWU_BIAS is enabled. if (sched_feat_numa(NUMA_TTWU_BIAS) node != -1) { /* * For fork,exec find the idlest cpu in the home-node. */ if (sd_flag (SD_BALANCE_FORK|SD_BALANCE_EXEC)) { int node_cpu = numa_select_node_cpu(p, node); if (node_cpu 0) goto find_sd; new_cpu = cpu = node_cpu; sd = per_cpu(sd_node, cpu); goto pick_idlest; } /* * For wake, pretend we were running in the home-node. */ if (cpu_to_node(prev_cpu) != node) { int node_cpu = numa_select_node_cpu(p, node); if (node_cpu 0) goto find_sd; if (sched_feat_numa(NUMA_TTWU_TO)) cpu = node_cpu; else prev_cpu = node_cpu; } } Thanks, Namhyung I apologise once again for the wrong feedback. Thank you Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 2/3] sched: factor out code to should_we_balance()
On 08/02/2013 04:02 PM, Peter Zijlstra wrote: On Fri, Aug 02, 2013 at 02:56:14PM +0530, Preeti U Murthy wrote: You need to iterate over all the groups of the sched domain env-sd and not just the first group of env-sd like you are doing above. This is to I don't think so. IIRC, env-sd-groups always means local group, so we don't need to find our group by iterating over all the groups. Take a look at update_sd_lb_stats(). That should clarify this. There is an exclusive local_group check there. sd-groups points to the first group in the list of groups under this sd. Take a look at: 88b8dac0a Ah ok! Thanks for this pointer. Apologies for having overlooked the fact that the sd-groups always points to the group to which the balance_cpu belongs. And subsequent dst_cpus for retries of load balancing also belong to the same group as the balance_cpu. This patch thus looks fine to me. Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 3/3] sched: clean-up struct sd_lb_stat
Hi Joonsoo, On 08/05/2013 01:02 PM, Joonsoo Kim wrote: + if (busiest-group_imb) { + busiest-sum_weighted_load = + min(busiest-sum_weighted_load, sds-sd_avg_load); Right here we get confused as to why the total load is being compared against load per task (although you are changing it to load per task above). Yes, you are right. I will add load_per_task to struct sg_lb_stats. @@ -4771,12 +4763,13 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s * Be careful of negative numbers as they'll appear as very large values * with unsigned longs. */ - max_pull = min(sds-max_load - sds-avg_load, load_above_capacity); + max_pull = min(busiest-avg_load - sds-sd_avg_load, + load_above_capacity); This is ok, but readability wise max_load is much better. max_load signifies the maximum load per task on a group in this sd, and avg_load signifies the total load per task across the sd. You are checking if there is imbalance in the total load in the sd and try to even it out across the sd. Here busiest does not convey this immediately. You may be confused. max_load doesn't means the maximum load per task. It means that the busiest group's load per cpu power. And here max doesn't mean maximum load in this sd, instead, load of busiest sg. IMO, busiest-avg_load convey proper meaning. Yes sorry about that. It is load per cpu power. My point was that max_load helped in readability. Look at the these two lines in your patch. - max_pull = min(sds-max_load - sds-avg_load, load_above_capacity); + max_pull = min(busiest-avg_load - sds-sd_avg_load, + load_above_capacity); The deleted line says that there is an imbalance in the load/cpu in some part of the sd(max_load), which is above the average load of the sd, which we are trying to even out. Something like the below diagram for the sd load. max_load _/\ avg_load. I believe the line you have added makes it hard to understand this. Anyway this is a minor issue, you can ignore it. Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH 5/5] cpuidle/ppc: Add longnap state to the idle states on powernv
This patch hooks into the existing broadcast framework with the support that this patchset introduces for ppc, and the cpuidle driver backend for powernv(posted out recently by Deepthi Dharwar) to add sleep state as one of the deep idle states, in which the decrementer is switched off. However in this patch, we only emulate sleep by going into a state which does a nap with the decrementer interrupts disabled, termed as longnap. This enables focus on the timer broadcast framework for ppc in this series of patches , which is required as a first step to enable sleep on ppc. Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com --- arch/powerpc/platforms/powernv/processor_idle.c | 48 +++ 1 file changed, 47 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/platforms/powernv/processor_idle.c b/arch/powerpc/platforms/powernv/processor_idle.c index f43ad91a..9aca502 100644 --- a/arch/powerpc/platforms/powernv/processor_idle.c +++ b/arch/powerpc/platforms/powernv/processor_idle.c @@ -9,16 +9,18 @@ #include linux/cpuidle.h #include linux/cpu.h #include linux/notifier.h +#include linux/clockchips.h #include asm/machdep.h #include asm/runlatch.h +#include asm/time.h struct cpuidle_driver powernv_idle_driver = { .name = powernv_idle, .owner =THIS_MODULE, }; -#define MAX_IDLE_STATE_COUNT 2 +#define MAX_IDLE_STATE_COUNT 3 static int max_idle_state = MAX_IDLE_STATE_COUNT - 1; static struct cpuidle_device __percpu *powernv_cpuidle_devices; @@ -54,6 +56,43 @@ static int nap_loop(struct cpuidle_device *dev, return index; } +/* Emulate sleep, with long nap. + * During sleep, the core does not receive decrementer interrupts. + * Emulate sleep using long nap with decrementers interrupts disabled. + * This is an initial prototype to test the timer offload framework for ppc. + * We will eventually introduce the sleep state once the timer offload framework + * for ppc is stable. + */ +static int longnap_loop(struct cpuidle_device *dev, + struct cpuidle_driver *drv, + int index) +{ + int cpu = dev-cpu; + + unsigned long lpcr = mfspr(SPRN_LPCR); + + lpcr = ~(LPCR_MER | LPCR_PECE); /* lpcr[mer] must be 0 */ + + /* exit powersave upon external interrupt, but not decrementer +* interrupt, Emulate sleep. +*/ + lpcr |= LPCR_PECE0; + + if (cpu != bc_cpu) { + mtspr(SPRN_LPCR, lpcr); + clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, cpu); + power7_nap(); + clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, cpu); + } else { + /* Wakeup on a decrementer interrupt, Do a nap */ + lpcr |= LPCR_PECE1; + mtspr(SPRN_LPCR, lpcr); + power7_nap(); + } + + return index; +} + /* * States for dedicated partition case. */ @@ -72,6 +111,13 @@ static struct cpuidle_state powernv_states[MAX_IDLE_STATE_COUNT] = { .exit_latency = 10, .target_residency = 100, .enter = nap_loop }, +{ /* LongNap */ + .name = LongNap, + .desc = LongNap, + .flags = CPUIDLE_FLAG_TIME_VALID, + .exit_latency = 10, + .target_residency = 100, + .enter = longnap_loop }, }; static int powernv_cpuidle_add_cpu_notifier(struct notifier_block *n, -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH 0/5] cpuidle/ppc: Timer offload framework to support deep idle states
On PowerPC, when CPUs enter deep idle states, their local timers are switched off. The responsibility of waking them up at their next timer event, needs to be handed over to an external device. On PowerPC, we do not have an external device equivalent to HPET, which is currently done on architectures like x86. Instead we assign the local timer of one of the CPUs to do this job. This patchset is an attempt to make use of the existing timer broadcast framework in the kernel to meet the above requirement, except that the tick broadcast device is the local timer of the boot CPU. This patch series is ported ontop of 3.11-rc1 + the cpuidle driver backend for powernv posted by Deepthi Dharwar recently. The current design and implementation supports the ONESHOT tick mode. It does not yet support the PERIODIC tick mode. This patch is tested with NOHZ_FULL off. Patch[1/5], Patch[2/5]: optimize the broadcast mechanism on ppc. Patch[3/5]: Introduces the core of the timer offload framework on powerpc. Patch[4/5]: The cpu doing the broadcast should not go into tickless idle. Patch[5/5]: Add a deep idle state to the cpuidle state table on powernv. Patch[5/5] is the patch that ultimately makes use of the timer offload framework that the patches Patch[1/5] to Patch[4/5] build. --- Preeti U Murthy (3): cpuidle/ppc: Add timer offload framework to support deep idle states cpuidle/ppc: CPU goes tickless if there are no arch-specific constraints cpuidle/ppc: Add longnap state to the idle states on powernv Srivatsa S. Bhat (2): powerpc: Free up the IPI message slot of ipi call function (PPC_MSG_CALL_FUNC) powerpc: Implement broadcast timer interrupt as an IPI message arch/powerpc/include/asm/smp.h |3 + arch/powerpc/include/asm/time.h |3 + arch/powerpc/kernel/smp.c | 23 -- arch/powerpc/kernel/time.c | 84 +++ arch/powerpc/platforms/cell/interrupt.c |2 - arch/powerpc/platforms/powernv/Kconfig |1 arch/powerpc/platforms/powernv/processor_idle.c | 48 + arch/powerpc/platforms/ps3/smp.c|2 - kernel/time/tick-sched.c|7 ++ 9 files changed, 161 insertions(+), 12 deletions(-) -- Signature -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH 1/5] powerpc: Free up the IPI message slot of ipi call function (PPC_MSG_CALL_FUNC)
From: Srivatsa S. Bhat srivatsa.b...@linux.vnet.ibm.com The IPI handlers for both PPC_MSG_CALL_FUNC and PPC_MSG_CALL_FUNC_SINGLE map to a common implementation - generic_smp_call_function_single_interrupt(). So, we can consolidate them and save one of the IPI message slots, (which are precious, since only 4 of those slots are available). So, implement the functionality of PPC_MSG_CALL_FUNC using PPC_MSG_CALL_FUNC_SINGLE itself and release its IPI message slot, so that it can be used for something else in the future, if desired. Signed-off-by: Srivatsa S. Bhat srivatsa.b...@linux.vnet.ibm.com Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com --- arch/powerpc/include/asm/smp.h |2 +- arch/powerpc/kernel/smp.c | 12 +--- arch/powerpc/platforms/cell/interrupt.c |2 +- arch/powerpc/platforms/ps3/smp.c|2 +- 4 files changed, 8 insertions(+), 10 deletions(-) diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h index ffbaabe..51bf017 100644 --- a/arch/powerpc/include/asm/smp.h +++ b/arch/powerpc/include/asm/smp.h @@ -117,7 +117,7 @@ extern int cpu_to_core_id(int cpu); * * Make sure this matches openpic_request_IPIs in open_pic.c, or what shows up * in /proc/interrupts will be wrong!!! --Troy */ -#define PPC_MSG_CALL_FUNCTION 0 +#define PPC_MSG_UNUSED 0 #define PPC_MSG_RESCHEDULE 1 #define PPC_MSG_CALL_FUNC_SINGLE 2 #define PPC_MSG_DEBUGGER_BREAK 3 diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index 38b0ba6..bc41e9f 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -111,9 +111,9 @@ int smp_generic_kick_cpu(int nr) } #endif /* CONFIG_PPC64 */ -static irqreturn_t call_function_action(int irq, void *data) +static irqreturn_t unused_action(int irq, void *data) { - generic_smp_call_function_interrupt(); + /* This slot is unused and hence available for use, if needed */ return IRQ_HANDLED; } @@ -144,14 +144,14 @@ static irqreturn_t debug_ipi_action(int irq, void *data) } static irq_handler_t smp_ipi_action[] = { - [PPC_MSG_CALL_FUNCTION] = call_function_action, + [PPC_MSG_UNUSED] = unused_action, /* Slot available for future use */ [PPC_MSG_RESCHEDULE] = reschedule_action, [PPC_MSG_CALL_FUNC_SINGLE] = call_function_single_action, [PPC_MSG_DEBUGGER_BREAK] = debug_ipi_action, }; const char *smp_ipi_name[] = { - [PPC_MSG_CALL_FUNCTION] = ipi call function, + [PPC_MSG_UNUSED] = ipi unused, [PPC_MSG_RESCHEDULE] = ipi reschedule, [PPC_MSG_CALL_FUNC_SINGLE] = ipi call function single, [PPC_MSG_DEBUGGER_BREAK] = ipi debugger, @@ -221,8 +221,6 @@ irqreturn_t smp_ipi_demux(void) all = xchg(info-messages, 0); #ifdef __BIG_ENDIAN - if (all (1 (24 - 8 * PPC_MSG_CALL_FUNCTION))) - generic_smp_call_function_interrupt(); if (all (1 (24 - 8 * PPC_MSG_RESCHEDULE))) scheduler_ipi(); if (all (1 (24 - 8 * PPC_MSG_CALL_FUNC_SINGLE))) @@ -265,7 +263,7 @@ void arch_send_call_function_ipi_mask(const struct cpumask *mask) unsigned int cpu; for_each_cpu(cpu, mask) - do_message_pass(cpu, PPC_MSG_CALL_FUNCTION); + do_message_pass(cpu, PPC_MSG_CALL_FUNC_SINGLE); } #if defined(CONFIG_DEBUGGER) || defined(CONFIG_KEXEC) diff --git a/arch/powerpc/platforms/cell/interrupt.c b/arch/powerpc/platforms/cell/interrupt.c index 2d42f3b..28166e4 100644 --- a/arch/powerpc/platforms/cell/interrupt.c +++ b/arch/powerpc/platforms/cell/interrupt.c @@ -213,7 +213,7 @@ static void iic_request_ipi(int msg) void iic_request_IPIs(void) { - iic_request_ipi(PPC_MSG_CALL_FUNCTION); + iic_request_ipi(PPC_MSG_UNUSED); iic_request_ipi(PPC_MSG_RESCHEDULE); iic_request_ipi(PPC_MSG_CALL_FUNC_SINGLE); iic_request_ipi(PPC_MSG_DEBUGGER_BREAK); diff --git a/arch/powerpc/platforms/ps3/smp.c b/arch/powerpc/platforms/ps3/smp.c index 4b35166..488f069 100644 --- a/arch/powerpc/platforms/ps3/smp.c +++ b/arch/powerpc/platforms/ps3/smp.c @@ -74,7 +74,7 @@ static int __init ps3_smp_probe(void) * to index needs to be setup. */ - BUILD_BUG_ON(PPC_MSG_CALL_FUNCTION!= 0); + BUILD_BUG_ON(PPC_MSG_UNUSED != 0); BUILD_BUG_ON(PPC_MSG_RESCHEDULE != 1); BUILD_BUG_ON(PPC_MSG_CALL_FUNC_SINGLE != 2); BUILD_BUG_ON(PPC_MSG_DEBUGGER_BREAK != 3); -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH 3/5] cpuidle/ppc: Add timer offload framework to support deep idle states
On ppc, in deep idle states, the lapic of the cpus gets switched off. Hence make use of the broadcast framework to wakeup cpus in sleep state, except that on ppc, we do not have an external device such as HPET, but we use the lapic of a cpu itself as the broadcast device. Instantiate two different clock event devices, one representing the lapic and another representing the broadcast device for each cpu. Such a cpu is forbidden to enter the deep idle state. The cpu which hosts the broadcast device will be referred to as the broadcast cpu in the changelogs of this patchset for convenience. For now, only the boot cpu's broadcast device gets registered as a clock event device along with the lapic. Hence this is the broadcast cpu. On the broadcast cpu, on each timer interrupt, apart from the regular lapic event handler the broadcast handler is also called. We avoid the overhead of programming the lapic for a broadcast event specifically. The reason is prevent multiple cpus from sending IPIs to program the lapic of the broadcast cpu for their next local event each time they go to deep idle state. Apart from this there is no change in the way broadcast is handled today. On a broadcast ipi the event handler for a timer interrupt is called on the cpu in deep idle state to handle the local events. The current design and implementation of the timer offload framework supports the ONESHOT tick mode but not the PERIODIC mode. Signed-off-by: Preeti U. Murthy pre...@linux.vnet.ibm.com --- arch/powerpc/include/asm/time.h|3 + arch/powerpc/kernel/smp.c |4 +- arch/powerpc/kernel/time.c | 79 arch/powerpc/platforms/powernv/Kconfig |1 4 files changed, 84 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/include/asm/time.h b/arch/powerpc/include/asm/time.h index c1f2676..936be0d 100644 --- a/arch/powerpc/include/asm/time.h +++ b/arch/powerpc/include/asm/time.h @@ -24,14 +24,17 @@ extern unsigned long tb_ticks_per_jiffy; extern unsigned long tb_ticks_per_usec; extern unsigned long tb_ticks_per_sec; extern struct clock_event_device decrementer_clockevent; +extern struct clock_event_device broadcast_clockevent; struct rtc_time; extern void to_tm(int tim, struct rtc_time * tm); extern void GregorianDay(struct rtc_time *tm); +extern void decrementer_timer_interrupt(void); extern void generic_calibrate_decr(void); extern void set_dec_cpu6(unsigned int val); +extern int bc_cpu; /* Some sane defaults: 125 MHz timebase, 1GHz processor */ extern unsigned long ppc_proc_freq; diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index 6a68ca4..d3b7014 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -114,7 +114,7 @@ int smp_generic_kick_cpu(int nr) static irqreturn_t timer_action(int irq, void *data) { - timer_interrupt(); + decrementer_timer_interrupt(); return IRQ_HANDLED; } @@ -223,7 +223,7 @@ irqreturn_t smp_ipi_demux(void) #ifdef __BIG_ENDIAN if (all (1 (24 - 8 * PPC_MSG_TIMER))) - timer_interrupt(); + decrementer_timer_interrupt(); if (all (1 (24 - 8 * PPC_MSG_RESCHEDULE))) scheduler_ipi(); if (all (1 (24 - 8 * PPC_MSG_CALL_FUNC_SINGLE))) diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c index 65ab9e9..8ed0fb3 100644 --- a/arch/powerpc/kernel/time.c +++ b/arch/powerpc/kernel/time.c @@ -42,6 +42,7 @@ #include linux/timex.h #include linux/kernel_stat.h #include linux/time.h +#include linux/timer.h #include linux/init.h #include linux/profile.h #include linux/cpu.h @@ -97,8 +98,11 @@ static struct clocksource clocksource_timebase = { static int decrementer_set_next_event(unsigned long evt, struct clock_event_device *dev); +static int broadcast_set_next_event(unsigned long evt, + struct clock_event_device *dev); static void decrementer_set_mode(enum clock_event_mode mode, struct clock_event_device *dev); +static void decrementer_timer_broadcast(const struct cpumask *mask); struct clock_event_device decrementer_clockevent = { .name = decrementer, @@ -106,13 +110,26 @@ struct clock_event_device decrementer_clockevent = { .irq= 0, .set_next_event = decrementer_set_next_event, .set_mode = decrementer_set_mode, - .features = CLOCK_EVT_FEAT_ONESHOT, + .broadcast = decrementer_timer_broadcast, + .features = CLOCK_EVT_FEAT_C3STOP | CLOCK_EVT_FEAT_ONESHOT, }; EXPORT_SYMBOL(decrementer_clockevent); +struct clock_event_device broadcast_clockevent = { + .name = broadcast, + .rating = 200, + .irq= 0, + .set_next_event = broadcast_set_next_event
[RFC PATCH 4/5] cpuidle/ppc: CPU goes tickless if there are no arch-specific constraints
In the current design of timer offload framework, the broadcast cpu should *not* go into tickless idle so as to avoid missed wakeups on CPUs in deep idle states. Since we prevent the CPUs entering deep idle states from programming the lapic of the broadcast cpu for their respective next local events for reasons mentioned in PATCH[3/5], the broadcast CPU checks if there are any CPUs to be woken up during each of its timer interrupt programmed to its local events. With tickless idle, the broadcast CPU might not get a timer interrupt till after many ticks which can result in missed wakeups on CPUs in deep idle states. By disabling tickless idle, worst case, the tick_sched hrtimer will trigger a timer interrupt every period to check for broadcast. However the current setup of tickless idle does not let us make the choice of tickless on individual cpus. NOHZ_MODE_INACTIVE which disables tickless idle, is a system wide setting. Hence resort to an arch specific call to check if a cpu can go into tickless idle. Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com --- arch/powerpc/kernel/time.c |5 + kernel/time/tick-sched.c |7 +++ 2 files changed, 12 insertions(+) diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c index 8ed0fb3..68a636f 100644 --- a/arch/powerpc/kernel/time.c +++ b/arch/powerpc/kernel/time.c @@ -862,6 +862,11 @@ static void decrementer_timer_broadcast(const struct cpumask *mask) arch_send_tick_broadcast(mask); } +int arch_can_stop_idle_tick(int cpu) +{ + return cpu != bc_cpu; +} + static void register_decrementer_clockevent(int cpu) { struct clock_event_device *dec = per_cpu(decrementers, cpu); diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 6960172..e9ffa84 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -700,8 +700,15 @@ static void tick_nohz_full_stop_tick(struct tick_sched *ts) #endif } +int __weak arch_can_stop_idle_tick(int cpu) +{ + return 1; +} + static bool can_stop_idle_tick(int cpu, struct tick_sched *ts) { + if (!arch_can_stop_idle_tick(cpu)) + return false; /* * If this cpu is offline and it is the one which updates * jiffies, then give up the assignment and let it be taken by -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH 2/5] powerpc: Implement broadcast timer interrupt as an IPI message
From: Srivatsa S. Bhat srivatsa.b...@linux.vnet.ibm.com For scalability and performance reasons, we want the broadcast timer interrupts to be handled as efficiently as possible. Fixed IPI messages are one of the most efficient mechanisms available - they are faster than the smp_call_function mechanism because the IPI handlers are fixed and hence they don't involve costly operations such as adding IPI handlers to the target CPU's function queue, acquiring locks for synchronization etc. Luckily we have an unused IPI message slot, so use that to implement broadcast timer interrupts efficiently. Signed-off-by: Srivatsa S. Bhat srivatsa.b...@linux.vnet.ibm.com Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com --- arch/powerpc/include/asm/smp.h |3 ++- arch/powerpc/kernel/smp.c | 19 +++ arch/powerpc/platforms/cell/interrupt.c |2 +- arch/powerpc/platforms/ps3/smp.c|2 +- 4 files changed, 19 insertions(+), 7 deletions(-) diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h index 51bf017..d877b69 100644 --- a/arch/powerpc/include/asm/smp.h +++ b/arch/powerpc/include/asm/smp.h @@ -117,7 +117,7 @@ extern int cpu_to_core_id(int cpu); * * Make sure this matches openpic_request_IPIs in open_pic.c, or what shows up * in /proc/interrupts will be wrong!!! --Troy */ -#define PPC_MSG_UNUSED 0 +#define PPC_MSG_TIMER 0 #define PPC_MSG_RESCHEDULE 1 #define PPC_MSG_CALL_FUNC_SINGLE 2 #define PPC_MSG_DEBUGGER_BREAK 3 @@ -190,6 +190,7 @@ extern struct smp_ops_t *smp_ops; extern void arch_send_call_function_single_ipi(int cpu); extern void arch_send_call_function_ipi_mask(const struct cpumask *mask); +extern void arch_send_tick_broadcast(const struct cpumask *mask); /* Definitions relative to the secondary CPU spin loop * and entry point. Not all of them exist on both 32 and diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index bc41e9f..6a68ca4 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -35,6 +35,7 @@ #include asm/ptrace.h #include linux/atomic.h #include asm/irq.h +#include asm/hw_irq.h #include asm/page.h #include asm/pgtable.h #include asm/prom.h @@ -111,9 +112,9 @@ int smp_generic_kick_cpu(int nr) } #endif /* CONFIG_PPC64 */ -static irqreturn_t unused_action(int irq, void *data) +static irqreturn_t timer_action(int irq, void *data) { - /* This slot is unused and hence available for use, if needed */ + timer_interrupt(); return IRQ_HANDLED; } @@ -144,14 +145,14 @@ static irqreturn_t debug_ipi_action(int irq, void *data) } static irq_handler_t smp_ipi_action[] = { - [PPC_MSG_UNUSED] = unused_action, /* Slot available for future use */ + [PPC_MSG_TIMER] = timer_action, [PPC_MSG_RESCHEDULE] = reschedule_action, [PPC_MSG_CALL_FUNC_SINGLE] = call_function_single_action, [PPC_MSG_DEBUGGER_BREAK] = debug_ipi_action, }; const char *smp_ipi_name[] = { - [PPC_MSG_UNUSED] = ipi unused, + [PPC_MSG_TIMER] = ipi timer, [PPC_MSG_RESCHEDULE] = ipi reschedule, [PPC_MSG_CALL_FUNC_SINGLE] = ipi call function single, [PPC_MSG_DEBUGGER_BREAK] = ipi debugger, @@ -221,6 +222,8 @@ irqreturn_t smp_ipi_demux(void) all = xchg(info-messages, 0); #ifdef __BIG_ENDIAN + if (all (1 (24 - 8 * PPC_MSG_TIMER))) + timer_interrupt(); if (all (1 (24 - 8 * PPC_MSG_RESCHEDULE))) scheduler_ipi(); if (all (1 (24 - 8 * PPC_MSG_CALL_FUNC_SINGLE))) @@ -266,6 +269,14 @@ void arch_send_call_function_ipi_mask(const struct cpumask *mask) do_message_pass(cpu, PPC_MSG_CALL_FUNC_SINGLE); } +void arch_send_tick_broadcast(const struct cpumask *mask) +{ + unsigned int cpu; + + for_each_cpu(cpu, mask) + do_message_pass(cpu, PPC_MSG_TIMER); +} + #if defined(CONFIG_DEBUGGER) || defined(CONFIG_KEXEC) void smp_send_debugger_break(void) { diff --git a/arch/powerpc/platforms/cell/interrupt.c b/arch/powerpc/platforms/cell/interrupt.c index 28166e4..1359113 100644 --- a/arch/powerpc/platforms/cell/interrupt.c +++ b/arch/powerpc/platforms/cell/interrupt.c @@ -213,7 +213,7 @@ static void iic_request_ipi(int msg) void iic_request_IPIs(void) { - iic_request_ipi(PPC_MSG_UNUSED); + iic_request_ipi(PPC_MSG_TIMER); iic_request_ipi(PPC_MSG_RESCHEDULE); iic_request_ipi(PPC_MSG_CALL_FUNC_SINGLE); iic_request_ipi(PPC_MSG_DEBUGGER_BREAK); diff --git a/arch/powerpc/platforms/ps3/smp.c b/arch/powerpc/platforms/ps3/smp.c index 488f069..5cb742a 100644 --- a/arch/powerpc/platforms/ps3/smp.c +++ b/arch/powerpc/platforms/ps3/smp.c @@ -74,7 +74,7 @@ static int __init ps3_smp_probe(void) * to index needs to be setup. */ - BUILD_BUG_ON(PPC_MSG_UNUSED
Re: [RFC PATCH 4/5] cpuidle/ppc: CPU goes tickless if there are no arch-specific constraints
Hi Frederic, On 07/25/2013 07:00 PM, Frederic Weisbecker wrote: On Thu, Jul 25, 2013 at 02:33:02PM +0530, Preeti U Murthy wrote: In the current design of timer offload framework, the broadcast cpu should *not* go into tickless idle so as to avoid missed wakeups on CPUs in deep idle states. Since we prevent the CPUs entering deep idle states from programming the lapic of the broadcast cpu for their respective next local events for reasons mentioned in PATCH[3/5], the broadcast CPU checks if there are any CPUs to be woken up during each of its timer interrupt programmed to its local events. With tickless idle, the broadcast CPU might not get a timer interrupt till after many ticks which can result in missed wakeups on CPUs in deep idle states. By disabling tickless idle, worst case, the tick_sched hrtimer will trigger a timer interrupt every period to check for broadcast. However the current setup of tickless idle does not let us make the choice of tickless on individual cpus. NOHZ_MODE_INACTIVE which disables tickless idle, is a system wide setting. Hence resort to an arch specific call to check if a cpu can go into tickless idle. Hi Preeti, I'm not exactly sure why you can't enter the broadcast CPU in dynticks idle mode. I read in the previous patch that's because in dynticks idle mode the broadcast CPU deactivates its lapic so it doesn't receive the IPI. But may be I misunderstood. Anyway that's not good for powersaving. Let me elaborate. The CPUs in deep idle states have their lapics deactivated. This means the next timer event which would typically have been taken care of by a lapic firing at the appropriate moment does not get taken care of in deep idle states, due to the lapic being switched off. Hence such CPUs offload their next timer event to the broadcast CPU, which should *not* enter deep idle states. The broadcast CPU has the responsibility of waking the CPUs in deep idle states. *The lapic of a broadcast CPU is active always*. Say CPUX, wants the broadcast CPU to wake it up at timeX. Since we cannot program the lapic of a remote CPU, CPUX will need to send an IPI to the broadcast CPU, asking it to program its lapic to fire at timeX so as to wake up CPUX. *With multiple CPUs the overhead of sending IPI, could result in performance bottlenecks and may not scale well.* Hence the workaround is that the broadcast CPU on each of its timer interrupt checks if any of the next timer event of a CPU in deep idle state has expired, which can very well be found from dev-next_event of that CPU. For example the timeX that has been mentioned above has expired. If so the broadcast handler is called to send an IPI to the idling CPU to wake it up. *If the broadcast CPU, is in tickless idle, its timer interrupt could be many ticks away. It could miss waking up a CPU in deep idle*, if its wakeup is much before this timer interrupt of the broadcast CPU. But without tickless idle, atleast at each period we are assured of a timer interrupt. At which time broadcast handling is done as stated in the previous paragraph and we will not miss wakeup of CPUs in deep idle states. Yeah it is true that not allowing the broadcast CPU to enter tickless idle is bad for power savings, but for the use case that we are aiming at in this patch series, the current approach seems to be the best, with minimal trade-offs in performance, power savings, scalability and no change in the broadcast framework that exists today in the kernel. Also when an arch wants to prevent a CPU from entering dynticks idle mode, it typically use arch_needs_cpu(). May be that could fit for you as well? Oh ok thanks :) I will look into this and get back on if we can use it. Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 4/5] cpuidle/ppc: CPU goes tickless if there are no arch-specific constraints
Hi Frederic, On 07/25/2013 07:00 PM, Frederic Weisbecker wrote: On Thu, Jul 25, 2013 at 02:33:02PM +0530, Preeti U Murthy wrote: In the current design of timer offload framework, the broadcast cpu should *not* go into tickless idle so as to avoid missed wakeups on CPUs in deep idle states. Since we prevent the CPUs entering deep idle states from programming the lapic of the broadcast cpu for their respective next local events for reasons mentioned in PATCH[3/5], the broadcast CPU checks if there are any CPUs to be woken up during each of its timer interrupt programmed to its local events. With tickless idle, the broadcast CPU might not get a timer interrupt till after many ticks which can result in missed wakeups on CPUs in deep idle states. By disabling tickless idle, worst case, the tick_sched hrtimer will trigger a timer interrupt every period to check for broadcast. However the current setup of tickless idle does not let us make the choice of tickless on individual cpus. NOHZ_MODE_INACTIVE which disables tickless idle, is a system wide setting. Hence resort to an arch specific call to check if a cpu can go into tickless idle. Hi Preeti, I'm not exactly sure why you can't enter the broadcast CPU in dynticks idle mode. I read in the previous patch that's because in dynticks idle mode the broadcast CPU deactivates its lapic so it doesn't receive the IPI. But may be I misunderstood. Anyway that's not good for powersaving. Also when an arch wants to prevent a CPU from entering dynticks idle mode, it typically use arch_needs_cpu(). May be that could fit for you as well? Yes this will suit our requirement perfectly. I will note down this change for the next version of this patchset. Thank you very much for pointing this out :) Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 4/5] cpuidle/ppc: CPU goes tickless if there are no arch-specific constraints
Hi Paul, On 07/26/2013 08:49 AM, Paul Mackerras wrote: On Fri, Jul 26, 2013 at 08:09:23AM +0530, Preeti U Murthy wrote: Hi Frederic, On 07/25/2013 07:00 PM, Frederic Weisbecker wrote: Hi Preeti, I'm not exactly sure why you can't enter the broadcast CPU in dynticks idle mode. I read in the previous patch that's because in dynticks idle mode the broadcast CPU deactivates its lapic so it doesn't receive the IPI. But may be I misunderstood. Anyway that's not good for powersaving. Let me elaborate. The CPUs in deep idle states have their lapics deactivated. This means the next timer event which would typically have been taken care of by a lapic firing at the appropriate moment does not get taken care of in deep idle states, due to the lapic being switched off. I really don't think it's helpful to use the term lapic in connection with Power systems. There is nothing that is called a lapic in a Power machine. The nearest equivalent of the LAPIC on x86 machines is the ICP, the interrupt-controller presentation element, of which there is one per CPU thread. However, I don't believe the ICP gets disabled in deep sleep modes. What does get disabled is the decrementer, which is a register that normally counts down (at 512MHz) and generates an exception when it is negative. The decrementer *is* part of the CPU core, unlike the ICP. That's why we can still get IPIs but not timer interrupts. Please reword your patch description to not use the term lapic, which is not defined in the Power context and is therefore just causing confusion. Noted. Thank you :) I will probably send out a fresh patchset with the appropriate changelog to avoid this confusion ? Paul. Regards Preeti U murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 4/5] cpuidle/ppc: CPU goes tickless if there are no arch-specific constraints
Hi Frederic, I apologise for the confusion. As Paul pointed out maybe the usage of the term lapic is causing a large amount of confusion. So please see the clarification below. Maybe it will help answer your question. On 07/26/2013 08:09 AM, Preeti U Murthy wrote: Hi Frederic, On 07/25/2013 07:00 PM, Frederic Weisbecker wrote: On Thu, Jul 25, 2013 at 02:33:02PM +0530, Preeti U Murthy wrote: In the current design of timer offload framework, the broadcast cpu should *not* go into tickless idle so as to avoid missed wakeups on CPUs in deep idle states. Since we prevent the CPUs entering deep idle states from programming the lapic of the broadcast cpu for their respective next local events for reasons mentioned in PATCH[3/5], the broadcast CPU checks if there are any CPUs to be woken up during each of its timer interrupt programmed to its local events. With tickless idle, the broadcast CPU might not get a timer interrupt till after many ticks which can result in missed wakeups on CPUs in deep idle states. By disabling tickless idle, worst case, the tick_sched hrtimer will trigger a timer interrupt every period to check for broadcast. However the current setup of tickless idle does not let us make the choice of tickless on individual cpus. NOHZ_MODE_INACTIVE which disables tickless idle, is a system wide setting. Hence resort to an arch specific call to check if a cpu can go into tickless idle. Hi Preeti, I'm not exactly sure why you can't enter the broadcast CPU in dynticks idle mode. I read in the previous patch that's because in dynticks idle mode the broadcast CPU deactivates its lapic so it doesn't receive the IPI. But may be I misunderstood. Anyway that's not good for powersaving. Firstly, when CPUs enter deep idle states, their local clock event devices get switched off. In the case of powerpc, local clock event device is the decrementer. Hence such CPUs *do not get timer interrupts* but are still *capable of taking IPIs.* So we need to ensure that some other CPU, in this case the broadcast CPU, makes note of when the timer interrupt of the CPU in such deep idle states is to trigger and at that moment issue an IPI to that CPU. *The broadcast CPU however should have its decrementer active always*, meaning it is disallowed from entering deep idle states, where the decrementer switches off, precisely because the other idling CPUs bank on it for the above mentioned reason. *The lapic of a broadcast CPU is active always*. Say CPUX, wants the broadcast CPU to wake it up at timeX. Since we cannot program the lapic of a remote CPU, CPUX will need to send an IPI to the broadcast CPU, asking it to program its lapic to fire at timeX so as to wake up CPUX. *With multiple CPUs the overhead of sending IPI, could result in performance bottlenecks and may not scale well.* Rewording the above. The decrementer of the broadcast CPU is active always. Since we cannot program the clock event device of a remote CPU, CPUX will need to send an IPI to the broadcast CPU, (which the broadcast CPU is very well capable of receiving), asking it to program its decrementer to fire at timeX so as to wake up CPUX *With multiple CPUs the overhead of sending IPI, could result in performance bottlenecks and may not scale well.* Hence the workaround is that the broadcast CPU on each of its timer interrupt checks if any of the next timer event of a CPU in deep idle state has expired, which can very well be found from dev-next_event of that CPU. For example the timeX that has been mentioned above has expired. If so the broadcast handler is called to send an IPI to the idling CPU to wake it up. *If the broadcast CPU, is in tickless idle, its timer interrupt could be many ticks away. It could miss waking up a CPU in deep idle*, if its wakeup is much before this timer interrupt of the broadcast CPU. But without tickless idle, atleast at each period we are assured of a timer interrupt. At which time broadcast handling is done as stated in the previous paragraph and we will not miss wakeup of CPUs in deep idle states. Yeah it is true that not allowing the broadcast CPU to enter tickless idle is bad for power savings, but for the use case that we are aiming at in this patch series, the current approach seems to be the best, with minimal trade-offs in performance, power savings, scalability and no change in the broadcast framework that exists today in the kernel. Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[Resend RFC PATCH 0/5] cpuidle/ppc: Timer offload framework to support deep idle states
On PowerPC, when CPUs enter deep idle states, their local timers are switched off. The responsibility of waking them up at their next timer event, needs to be handed over to an external device. On PowerPC, we do not have an external device equivalent to HPET, which is currently done on architectures like x86. Instead we assign the local timer of one of the CPUs to do this job. This patchset is an attempt to make use of the existing timer broadcast framework in the kernel to meet the above requirement, except that the tick broadcast device is the local timer of the boot CPU. This patch series is ported ontop of 3.11-rc1 + the cpuidle driver backend for powernv posted by Deepthi Dharwar recently. The current design and implementation supports the ONESHOT tick mode. It does not yet support the PERIODIC tick mode. This patch is tested with NOHZ_FULL off. Patch[1/5], Patch[2/5]: optimize the broadcast mechanism on ppc. Patch[3/5]: Introduces the core of the timer offload framework on powerpc. Patch[4/5]: The cpu doing the broadcast should not go into tickless idle. Patch[5/5]: Add a deep idle state to the cpuidle state table on powernv. Patch[5/5] is the patch that ultimately makes use of the timer offload framework that the patches Patch[1/5] to Patch[4/5] build. This patch series is being resent to clarify certain ambiguity in the patch descriptions from the previous post. Discussion around this: https://lkml.org/lkml/2013/7/25/754 --- Preeti U Murthy (3): cpuidle/ppc: Add timer offload framework to support deep idle states cpuidle/ppc: CPU goes tickless if there are no arch-specific constraints cpuidle/ppc: Add longnap state to the idle states on powernv Srivatsa S. Bhat (2): powerpc: Free up the IPI message slot of ipi call function (PPC_MSG_CALL_FUNC) powerpc: Implement broadcast timer interrupt as an IPI message arch/powerpc/include/asm/smp.h |3 + arch/powerpc/include/asm/time.h |3 + arch/powerpc/kernel/smp.c | 23 -- arch/powerpc/kernel/time.c | 86 +++ arch/powerpc/platforms/cell/interrupt.c |2 - arch/powerpc/platforms/powernv/Kconfig |1 arch/powerpc/platforms/powernv/processor_idle.c | 48 + arch/powerpc/platforms/ps3/smp.c|2 - kernel/time/tick-sched.c|7 ++ 9 files changed, 163 insertions(+), 12 deletions(-) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[Resend RFC PATCH 1/5] powerpc: Free up the IPI message slot of ipi call function (PPC_MSG_CALL_FUNC)
From: Srivatsa S. Bhat srivatsa.b...@linux.vnet.ibm.com The IPI handlers for both PPC_MSG_CALL_FUNC and PPC_MSG_CALL_FUNC_SINGLE map to a common implementation - generic_smp_call_function_single_interrupt(). So, we can consolidate them and save one of the IPI message slots, (which are precious, since only 4 of those slots are available). So, implement the functionality of PPC_MSG_CALL_FUNC using PPC_MSG_CALL_FUNC_SINGLE itself and release its IPI message slot, so that it can be used for something else in the future, if desired. Signed-off-by: Srivatsa S. Bhat srivatsa.b...@linux.vnet.ibm.com Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com --- arch/powerpc/include/asm/smp.h |2 +- arch/powerpc/kernel/smp.c | 12 +--- arch/powerpc/platforms/cell/interrupt.c |2 +- arch/powerpc/platforms/ps3/smp.c|2 +- 4 files changed, 8 insertions(+), 10 deletions(-) diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h index ffbaabe..51bf017 100644 --- a/arch/powerpc/include/asm/smp.h +++ b/arch/powerpc/include/asm/smp.h @@ -117,7 +117,7 @@ extern int cpu_to_core_id(int cpu); * * Make sure this matches openpic_request_IPIs in open_pic.c, or what shows up * in /proc/interrupts will be wrong!!! --Troy */ -#define PPC_MSG_CALL_FUNCTION 0 +#define PPC_MSG_UNUSED 0 #define PPC_MSG_RESCHEDULE 1 #define PPC_MSG_CALL_FUNC_SINGLE 2 #define PPC_MSG_DEBUGGER_BREAK 3 diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index 38b0ba6..bc41e9f 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -111,9 +111,9 @@ int smp_generic_kick_cpu(int nr) } #endif /* CONFIG_PPC64 */ -static irqreturn_t call_function_action(int irq, void *data) +static irqreturn_t unused_action(int irq, void *data) { - generic_smp_call_function_interrupt(); + /* This slot is unused and hence available for use, if needed */ return IRQ_HANDLED; } @@ -144,14 +144,14 @@ static irqreturn_t debug_ipi_action(int irq, void *data) } static irq_handler_t smp_ipi_action[] = { - [PPC_MSG_CALL_FUNCTION] = call_function_action, + [PPC_MSG_UNUSED] = unused_action, /* Slot available for future use */ [PPC_MSG_RESCHEDULE] = reschedule_action, [PPC_MSG_CALL_FUNC_SINGLE] = call_function_single_action, [PPC_MSG_DEBUGGER_BREAK] = debug_ipi_action, }; const char *smp_ipi_name[] = { - [PPC_MSG_CALL_FUNCTION] = ipi call function, + [PPC_MSG_UNUSED] = ipi unused, [PPC_MSG_RESCHEDULE] = ipi reschedule, [PPC_MSG_CALL_FUNC_SINGLE] = ipi call function single, [PPC_MSG_DEBUGGER_BREAK] = ipi debugger, @@ -221,8 +221,6 @@ irqreturn_t smp_ipi_demux(void) all = xchg(info-messages, 0); #ifdef __BIG_ENDIAN - if (all (1 (24 - 8 * PPC_MSG_CALL_FUNCTION))) - generic_smp_call_function_interrupt(); if (all (1 (24 - 8 * PPC_MSG_RESCHEDULE))) scheduler_ipi(); if (all (1 (24 - 8 * PPC_MSG_CALL_FUNC_SINGLE))) @@ -265,7 +263,7 @@ void arch_send_call_function_ipi_mask(const struct cpumask *mask) unsigned int cpu; for_each_cpu(cpu, mask) - do_message_pass(cpu, PPC_MSG_CALL_FUNCTION); + do_message_pass(cpu, PPC_MSG_CALL_FUNC_SINGLE); } #if defined(CONFIG_DEBUGGER) || defined(CONFIG_KEXEC) diff --git a/arch/powerpc/platforms/cell/interrupt.c b/arch/powerpc/platforms/cell/interrupt.c index 2d42f3b..28166e4 100644 --- a/arch/powerpc/platforms/cell/interrupt.c +++ b/arch/powerpc/platforms/cell/interrupt.c @@ -213,7 +213,7 @@ static void iic_request_ipi(int msg) void iic_request_IPIs(void) { - iic_request_ipi(PPC_MSG_CALL_FUNCTION); + iic_request_ipi(PPC_MSG_UNUSED); iic_request_ipi(PPC_MSG_RESCHEDULE); iic_request_ipi(PPC_MSG_CALL_FUNC_SINGLE); iic_request_ipi(PPC_MSG_DEBUGGER_BREAK); diff --git a/arch/powerpc/platforms/ps3/smp.c b/arch/powerpc/platforms/ps3/smp.c index 4b35166..488f069 100644 --- a/arch/powerpc/platforms/ps3/smp.c +++ b/arch/powerpc/platforms/ps3/smp.c @@ -74,7 +74,7 @@ static int __init ps3_smp_probe(void) * to index needs to be setup. */ - BUILD_BUG_ON(PPC_MSG_CALL_FUNCTION!= 0); + BUILD_BUG_ON(PPC_MSG_UNUSED != 0); BUILD_BUG_ON(PPC_MSG_RESCHEDULE != 1); BUILD_BUG_ON(PPC_MSG_CALL_FUNC_SINGLE != 2); BUILD_BUG_ON(PPC_MSG_DEBUGGER_BREAK != 3); -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[Resend RFC PATCH 2/5] powerpc: Implement broadcast timer interrupt as an IPI message
From: Srivatsa S. Bhat srivatsa.b...@linux.vnet.ibm.com For scalability and performance reasons, we want the broadcast timer interrupts to be handled as efficiently as possible. Fixed IPI messages are one of the most efficient mechanisms available - they are faster than the smp_call_function mechanism because the IPI handlers are fixed and hence they don't involve costly operations such as adding IPI handlers to the target CPU's function queue, acquiring locks for synchronization etc. Luckily we have an unused IPI message slot, so use that to implement broadcast timer interrupts efficiently. Signed-off-by: Srivatsa S. Bhat srivatsa.b...@linux.vnet.ibm.com Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com --- arch/powerpc/include/asm/smp.h |3 ++- arch/powerpc/kernel/smp.c | 19 +++ arch/powerpc/platforms/cell/interrupt.c |2 +- arch/powerpc/platforms/ps3/smp.c|2 +- 4 files changed, 19 insertions(+), 7 deletions(-) diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h index 51bf017..d877b69 100644 --- a/arch/powerpc/include/asm/smp.h +++ b/arch/powerpc/include/asm/smp.h @@ -117,7 +117,7 @@ extern int cpu_to_core_id(int cpu); * * Make sure this matches openpic_request_IPIs in open_pic.c, or what shows up * in /proc/interrupts will be wrong!!! --Troy */ -#define PPC_MSG_UNUSED 0 +#define PPC_MSG_TIMER 0 #define PPC_MSG_RESCHEDULE 1 #define PPC_MSG_CALL_FUNC_SINGLE 2 #define PPC_MSG_DEBUGGER_BREAK 3 @@ -190,6 +190,7 @@ extern struct smp_ops_t *smp_ops; extern void arch_send_call_function_single_ipi(int cpu); extern void arch_send_call_function_ipi_mask(const struct cpumask *mask); +extern void arch_send_tick_broadcast(const struct cpumask *mask); /* Definitions relative to the secondary CPU spin loop * and entry point. Not all of them exist on both 32 and diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index bc41e9f..6a68ca4 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -35,6 +35,7 @@ #include asm/ptrace.h #include linux/atomic.h #include asm/irq.h +#include asm/hw_irq.h #include asm/page.h #include asm/pgtable.h #include asm/prom.h @@ -111,9 +112,9 @@ int smp_generic_kick_cpu(int nr) } #endif /* CONFIG_PPC64 */ -static irqreturn_t unused_action(int irq, void *data) +static irqreturn_t timer_action(int irq, void *data) { - /* This slot is unused and hence available for use, if needed */ + timer_interrupt(); return IRQ_HANDLED; } @@ -144,14 +145,14 @@ static irqreturn_t debug_ipi_action(int irq, void *data) } static irq_handler_t smp_ipi_action[] = { - [PPC_MSG_UNUSED] = unused_action, /* Slot available for future use */ + [PPC_MSG_TIMER] = timer_action, [PPC_MSG_RESCHEDULE] = reschedule_action, [PPC_MSG_CALL_FUNC_SINGLE] = call_function_single_action, [PPC_MSG_DEBUGGER_BREAK] = debug_ipi_action, }; const char *smp_ipi_name[] = { - [PPC_MSG_UNUSED] = ipi unused, + [PPC_MSG_TIMER] = ipi timer, [PPC_MSG_RESCHEDULE] = ipi reschedule, [PPC_MSG_CALL_FUNC_SINGLE] = ipi call function single, [PPC_MSG_DEBUGGER_BREAK] = ipi debugger, @@ -221,6 +222,8 @@ irqreturn_t smp_ipi_demux(void) all = xchg(info-messages, 0); #ifdef __BIG_ENDIAN + if (all (1 (24 - 8 * PPC_MSG_TIMER))) + timer_interrupt(); if (all (1 (24 - 8 * PPC_MSG_RESCHEDULE))) scheduler_ipi(); if (all (1 (24 - 8 * PPC_MSG_CALL_FUNC_SINGLE))) @@ -266,6 +269,14 @@ void arch_send_call_function_ipi_mask(const struct cpumask *mask) do_message_pass(cpu, PPC_MSG_CALL_FUNC_SINGLE); } +void arch_send_tick_broadcast(const struct cpumask *mask) +{ + unsigned int cpu; + + for_each_cpu(cpu, mask) + do_message_pass(cpu, PPC_MSG_TIMER); +} + #if defined(CONFIG_DEBUGGER) || defined(CONFIG_KEXEC) void smp_send_debugger_break(void) { diff --git a/arch/powerpc/platforms/cell/interrupt.c b/arch/powerpc/platforms/cell/interrupt.c index 28166e4..1359113 100644 --- a/arch/powerpc/platforms/cell/interrupt.c +++ b/arch/powerpc/platforms/cell/interrupt.c @@ -213,7 +213,7 @@ static void iic_request_ipi(int msg) void iic_request_IPIs(void) { - iic_request_ipi(PPC_MSG_UNUSED); + iic_request_ipi(PPC_MSG_TIMER); iic_request_ipi(PPC_MSG_RESCHEDULE); iic_request_ipi(PPC_MSG_CALL_FUNC_SINGLE); iic_request_ipi(PPC_MSG_DEBUGGER_BREAK); diff --git a/arch/powerpc/platforms/ps3/smp.c b/arch/powerpc/platforms/ps3/smp.c index 488f069..5cb742a 100644 --- a/arch/powerpc/platforms/ps3/smp.c +++ b/arch/powerpc/platforms/ps3/smp.c @@ -74,7 +74,7 @@ static int __init ps3_smp_probe(void) * to index needs to be setup. */ - BUILD_BUG_ON(PPC_MSG_UNUSED
[Resend RFC PATCH 3/5] cpuidle/ppc: Add timer offload framework to support deep idle states
On ppc, in deep idle states, the local clock event device of CPUs gets switched off. On PowerPC, the local clock event device is called the decrementer. Make use of the broadcast framework to issue interrupts to cpus in deep idle states on their timer events, except that on ppc, we do not have an external device such as HPET, but we use the decrementer of one of the CPUs itself as the broadcast device. Instantiate two different clock event devices, one representing the decrementer and another representing the broadcast device for each cpu. The cpu which registers its broadcast device will be responsible for performing the function of issuing timer interrupts to CPUs in deep idle states, and is referred to as the broadcast cpu in the changelogs of this patchset for convenience. Such a CPU is not allowed to enter deep idle states, where the decrementer is switched off. For now, only the boot cpu's broadcast device gets registered as a clock event device along with the decrementer. Hence this is the broadcast cpu. On the broadcast cpu, on each timer interrupt, apart from the regular local timer event handler the broadcast handler is also called. We avoid the overhead of programming the decrementer specifically for a broadcast event. The reason is for performance and scalability reasons. Say cpuX goes to deep idle state. It has to ask the broadcast CPU to reprogram its(broadcast CPU's) decrementer for the next local timer event of cpuX. cpuX can do so only by sending an IPI to the broadcast CPU. With many more cpus going to deep idle, this model of sending IPIs each time will result in performance bottleneck and may not scale well. Apart from this there is no change in the way broadcast is handled today. On a broadcast ipi the event handler for a timer interrupt is called on the cpu in deep idle state to handle the local events. The current design and implementation of the timer offload framework supports the ONESHOT tick mode but not the PERIODIC mode. Signed-off-by: Preeti U. Murthy pre...@linux.vnet.ibm.com --- arch/powerpc/include/asm/time.h|3 + arch/powerpc/kernel/smp.c |4 +- arch/powerpc/kernel/time.c | 81 arch/powerpc/platforms/powernv/Kconfig |1 4 files changed, 86 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/include/asm/time.h b/arch/powerpc/include/asm/time.h index c1f2676..936be0d 100644 --- a/arch/powerpc/include/asm/time.h +++ b/arch/powerpc/include/asm/time.h @@ -24,14 +24,17 @@ extern unsigned long tb_ticks_per_jiffy; extern unsigned long tb_ticks_per_usec; extern unsigned long tb_ticks_per_sec; extern struct clock_event_device decrementer_clockevent; +extern struct clock_event_device broadcast_clockevent; struct rtc_time; extern void to_tm(int tim, struct rtc_time * tm); extern void GregorianDay(struct rtc_time *tm); +extern void decrementer_timer_interrupt(void); extern void generic_calibrate_decr(void); extern void set_dec_cpu6(unsigned int val); +extern int bc_cpu; /* Some sane defaults: 125 MHz timebase, 1GHz processor */ extern unsigned long ppc_proc_freq; diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index 6a68ca4..d3b7014 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -114,7 +114,7 @@ int smp_generic_kick_cpu(int nr) static irqreturn_t timer_action(int irq, void *data) { - timer_interrupt(); + decrementer_timer_interrupt(); return IRQ_HANDLED; } @@ -223,7 +223,7 @@ irqreturn_t smp_ipi_demux(void) #ifdef __BIG_ENDIAN if (all (1 (24 - 8 * PPC_MSG_TIMER))) - timer_interrupt(); + decrementer_timer_interrupt(); if (all (1 (24 - 8 * PPC_MSG_RESCHEDULE))) scheduler_ipi(); if (all (1 (24 - 8 * PPC_MSG_CALL_FUNC_SINGLE))) diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c index 65ab9e9..7e858e1 100644 --- a/arch/powerpc/kernel/time.c +++ b/arch/powerpc/kernel/time.c @@ -42,6 +42,7 @@ #include linux/timex.h #include linux/kernel_stat.h #include linux/time.h +#include linux/timer.h #include linux/init.h #include linux/profile.h #include linux/cpu.h @@ -97,8 +98,11 @@ static struct clocksource clocksource_timebase = { static int decrementer_set_next_event(unsigned long evt, struct clock_event_device *dev); +static int broadcast_set_next_event(unsigned long evt, + struct clock_event_device *dev); static void decrementer_set_mode(enum clock_event_mode mode, struct clock_event_device *dev); +static void decrementer_timer_broadcast(const struct cpumask *mask); struct clock_event_device decrementer_clockevent = { .name = decrementer, @@ -106,13 +110,26 @@ struct clock_event_device decrementer_clockevent = { .irq
[Resend RFC PATCH 5/5] cpuidle/ppc: Add longnap state to the idle states on powernv
This patch hooks into the existing broadcast framework with the support that this patchset introduces for ppc, and the cpuidle driver backend for powernv(posted out recently by Deepthi Dharwar) to add sleep state as one of the deep idle states, in which the decrementer is switched off. However in this patch, we only emulate sleep by going into a state which does a nap with the decrementer interrupts disabled, termed as longnap. This enables focus on the timer broadcast framework for ppc in this series of patches , which is required as a first step to enable sleep on ppc. Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com --- arch/powerpc/platforms/powernv/processor_idle.c | 48 +++ 1 file changed, 47 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/platforms/powernv/processor_idle.c b/arch/powerpc/platforms/powernv/processor_idle.c index f43ad91a..9aca502 100644 --- a/arch/powerpc/platforms/powernv/processor_idle.c +++ b/arch/powerpc/platforms/powernv/processor_idle.c @@ -9,16 +9,18 @@ #include linux/cpuidle.h #include linux/cpu.h #include linux/notifier.h +#include linux/clockchips.h #include asm/machdep.h #include asm/runlatch.h +#include asm/time.h struct cpuidle_driver powernv_idle_driver = { .name = powernv_idle, .owner =THIS_MODULE, }; -#define MAX_IDLE_STATE_COUNT 2 +#define MAX_IDLE_STATE_COUNT 3 static int max_idle_state = MAX_IDLE_STATE_COUNT - 1; static struct cpuidle_device __percpu *powernv_cpuidle_devices; @@ -54,6 +56,43 @@ static int nap_loop(struct cpuidle_device *dev, return index; } +/* Emulate sleep, with long nap. + * During sleep, the core does not receive decrementer interrupts. + * Emulate sleep using long nap with decrementers interrupts disabled. + * This is an initial prototype to test the timer offload framework for ppc. + * We will eventually introduce the sleep state once the timer offload framework + * for ppc is stable. + */ +static int longnap_loop(struct cpuidle_device *dev, + struct cpuidle_driver *drv, + int index) +{ + int cpu = dev-cpu; + + unsigned long lpcr = mfspr(SPRN_LPCR); + + lpcr = ~(LPCR_MER | LPCR_PECE); /* lpcr[mer] must be 0 */ + + /* exit powersave upon external interrupt, but not decrementer +* interrupt, Emulate sleep. +*/ + lpcr |= LPCR_PECE0; + + if (cpu != bc_cpu) { + mtspr(SPRN_LPCR, lpcr); + clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, cpu); + power7_nap(); + clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, cpu); + } else { + /* Wakeup on a decrementer interrupt, Do a nap */ + lpcr |= LPCR_PECE1; + mtspr(SPRN_LPCR, lpcr); + power7_nap(); + } + + return index; +} + /* * States for dedicated partition case. */ @@ -72,6 +111,13 @@ static struct cpuidle_state powernv_states[MAX_IDLE_STATE_COUNT] = { .exit_latency = 10, .target_residency = 100, .enter = nap_loop }, +{ /* LongNap */ + .name = LongNap, + .desc = LongNap, + .flags = CPUIDLE_FLAG_TIME_VALID, + .exit_latency = 10, + .target_residency = 100, + .enter = longnap_loop }, }; static int powernv_cpuidle_add_cpu_notifier(struct notifier_block *n, -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[Resend RFC PATCH 4/5] cpuidle/ppc: CPU goes tickless if there are no arch-specific constraints
In the current design of timer offload framework, the broadcast cpu should *not* go into tickless idle so as to avoid missed wakeups on CPUs in deep idle states. Since we prevent the CPUs entering deep idle states from programming the decrementer of the broadcast cpu for their respective next local events for reasons mentioned in PATCH[3/5], the broadcast CPU checks if there are any CPUs to be woken up during each of its timer interrupt, which is programmed to its local events. With tickless idle, the broadcast CPU might not have a timer interrupt pending till after many ticks, which can result in missed wakeups on CPUs in deep idle states. By disabling tickless idle, worst case, the tick_sched hrtimer will trigger a timer interrupt every period. However the current setup of tickless idle does not let us make the choice of tickless on individual cpus. NOHZ_MODE_INACTIVE which disables tickless idle, is a system wide setting. Hence resort to an arch specific call to check if a cpu can go into tickless idle. Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com --- arch/powerpc/kernel/time.c |5 + kernel/time/tick-sched.c |7 +++ 2 files changed, 12 insertions(+) diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c index 7e858e1..916c32f 100644 --- a/arch/powerpc/kernel/time.c +++ b/arch/powerpc/kernel/time.c @@ -864,6 +864,11 @@ static void decrementer_timer_broadcast(const struct cpumask *mask) arch_send_tick_broadcast(mask); } +int arch_can_stop_idle_tick(int cpu) +{ + return cpu != bc_cpu; +} + static void register_decrementer_clockevent(int cpu) { struct clock_event_device *dec = per_cpu(decrementers, cpu); diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 6960172..e9ffa84 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -700,8 +700,15 @@ static void tick_nohz_full_stop_tick(struct tick_sched *ts) #endif } +int __weak arch_can_stop_idle_tick(int cpu) +{ + return 1; +} + static bool can_stop_idle_tick(int cpu, struct tick_sched *ts) { + if (!arch_can_stop_idle_tick(cpu)) + return false; /* * If this cpu is offline and it is the one which updates * jiffies, then give up the assignment and let it be taken by -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 4/5] cpuidle/ppc: CPU goes tickless if there are no arch-specific constraints
Hi Ben, On 07/27/2013 12:00 PM, Benjamin Herrenschmidt wrote: On Fri, 2013-07-26 at 08:09 +0530, Preeti U Murthy wrote: *The lapic of a broadcast CPU is active always*. Say CPUX, wants the broadcast CPU to wake it up at timeX. Since we cannot program the lapic of a remote CPU, CPUX will need to send an IPI to the broadcast CPU, asking it to program its lapic to fire at timeX so as to wake up CPUX. *With multiple CPUs the overhead of sending IPI, could result in performance bottlenecks and may not scale well.* Hence the workaround is that the broadcast CPU on each of its timer interrupt checks if any of the next timer event of a CPU in deep idle state has expired, which can very well be found from dev-next_event of that CPU. For example the timeX that has been mentioned above has expired. If so the broadcast handler is called to send an IPI to the idling CPU to wake it up. *If the broadcast CPU, is in tickless idle, its timer interrupt could be many ticks away. It could miss waking up a CPU in deep idle*, if its wakeup is much before this timer interrupt of the broadcast CPU. But without tickless idle, atleast at each period we are assured of a timer interrupt. At which time broadcast handling is done as stated in the previous paragraph and we will not miss wakeup of CPUs in deep idle states. But that means a great loss of power saving on the broadcast CPU when the machine is basically completely idle. We might be able to come up with some thing better. (Note : I do no know the timer offload code if it exists already, I'm describing how things could happen out of the blue without any knowledge of pre-existing framework here) We can know when the broadcast CPU expects to wake up next. When a CPU goes to a deep sleep state, it can then - Indicate to the broadcast CPU when it intends to be woken up by queuing itself into an ordered queue (ordered by target wakeup time). (OPTIMISATION: Play with the locality of that: have one queue (and one broadcast CPU) per chip or per node instead of a global one to limit cache bouncing). - Check if that happens before the broadcast CPU intended wake time (we need statistics to see how often that happens), and in that case send an IPI to wake it up now. When the broadcast CPU goes to sleep, it limits its sleep time to the min of it's intended sleep time and the new sleeper time. (OPTIMISATION: Dynamically chose a broadcast CPU based on closest expiry ?) - We can probably limit spurrious wakeups a *LOT* by aligning that target time to a global jiffy boundary, meaning that several CPUs going to idle are likely to be choosing the same. Or maybe better, an adaptative alignment by essentially getting more coarse grained as we go further in the future - When the broadcast CPU goes to sleep, it can play the same game of alignment. I don't like the concept of a dedicated broadcast CPU however. I'd rather have a general queue (or per node) of sleepers needing a wakeup and more/less dynamically pick a waker to be the last man standing, but it does make things a bit more tricky with tickless scheduler (non-idle). Still, I wonder if we could just have some algorithm to actually pick wakers more dynamically based on who ever has the closest next wakeup planned, that sort of thing. A fixed broadcaster will create an imbalance in power/thermal within the chip in addition to needing to be moved around on hotplug etc... Thank you for having listed out the above suggestions. Below, I will bring out some ideas about how the concerns that you have raised can be addressed in the increasing order of priority. - To begin with, I think we can have the following model to have the responsibility of the broadcast CPU float around certain CPUs. i.e. Not have a dedicated broadcast CPU. I will refer to the broadcast CPU as the bc_cpu henceforth for convenience. 1. The first CPU that intends to enter deep sleep state will be the bc_cpu. 2. Every other CPU that intends to enter deep idle state will enter themselves into a mask, say the bc_mask, which is already being done today, after they check that a bc_cpu has been assigned. 3. The bc_cpu should not enter tickless idle, until step 5a holds true. 4. So on every timer interrupt, which is at-least every period, it checks the bc_mask to see if any CPUs need to be woken up. 5. The bc cpu should not enter tickless idle *until* it is de-nominated as the bc_cpu. The de-nomination occurs when: a. In one of its timer interrupts, it does broadcast handling to find out that there are no CPUs to be woken up. 6. So if 5a holds, then there is no bc_cpu anymore until a CPU decides to enter deep idle state again, in which case steps 1 to 5 repeat. - We could optimize this further, to allow the bc_cpu to enter tickless idle, even while it is nominated as one. This can be the next step, if we can get the above to work stably. You have already brought out this point
Re: [RFC PATCH 4/5] cpuidle/ppc: CPU goes tickless if there are no arch-specific constraints
Hi, On 07/29/2013 10:58 AM, Vaidyanathan Srinivasan wrote: * Preeti U Murthy pre...@linux.vnet.ibm.com [2013-07-27 13:20:37]: Hi Ben, On 07/27/2013 12:00 PM, Benjamin Herrenschmidt wrote: On Fri, 2013-07-26 at 08:09 +0530, Preeti U Murthy wrote: *The lapic of a broadcast CPU is active always*. Say CPUX, wants the broadcast CPU to wake it up at timeX. Since we cannot program the lapic of a remote CPU, CPUX will need to send an IPI to the broadcast CPU, asking it to program its lapic to fire at timeX so as to wake up CPUX. *With multiple CPUs the overhead of sending IPI, could result in performance bottlenecks and may not scale well.* Hence the workaround is that the broadcast CPU on each of its timer interrupt checks if any of the next timer event of a CPU in deep idle state has expired, which can very well be found from dev-next_event of that CPU. For example the timeX that has been mentioned above has expired. If so the broadcast handler is called to send an IPI to the idling CPU to wake it up. *If the broadcast CPU, is in tickless idle, its timer interrupt could be many ticks away. It could miss waking up a CPU in deep idle*, if its wakeup is much before this timer interrupt of the broadcast CPU. But without tickless idle, atleast at each period we are assured of a timer interrupt. At which time broadcast handling is done as stated in the previous paragraph and we will not miss wakeup of CPUs in deep idle states. But that means a great loss of power saving on the broadcast CPU when the machine is basically completely idle. We might be able to come up with some thing better. (Note : I do no know the timer offload code if it exists already, I'm describing how things could happen out of the blue without any knowledge of pre-existing framework here) We can know when the broadcast CPU expects to wake up next. When a CPU goes to a deep sleep state, it can then - Indicate to the broadcast CPU when it intends to be woken up by queuing itself into an ordered queue (ordered by target wakeup time). (OPTIMISATION: Play with the locality of that: have one queue (and one broadcast CPU) per chip or per node instead of a global one to limit cache bouncing). - Check if that happens before the broadcast CPU intended wake time (we need statistics to see how often that happens), and in that case send an IPI to wake it up now. When the broadcast CPU goes to sleep, it limits its sleep time to the min of it's intended sleep time and the new sleeper time. (OPTIMISATION: Dynamically chose a broadcast CPU based on closest expiry ?) - We can probably limit spurrious wakeups a *LOT* by aligning that target time to a global jiffy boundary, meaning that several CPUs going to idle are likely to be choosing the same. Or maybe better, an adaptative alignment by essentially getting more coarse grained as we go further in the future - When the broadcast CPU goes to sleep, it can play the same game of alignment. I don't like the concept of a dedicated broadcast CPU however. I'd rather have a general queue (or per node) of sleepers needing a wakeup and more/less dynamically pick a waker to be the last man standing, but it does make things a bit more tricky with tickless scheduler (non-idle). Still, I wonder if we could just have some algorithm to actually pick wakers more dynamically based on who ever has the closest next wakeup planned, that sort of thing. A fixed broadcaster will create an imbalance in power/thermal within the chip in addition to needing to be moved around on hotplug etc... Thank you for having listed out the above suggestions. Below, I will bring out some ideas about how the concerns that you have raised can be addressed in the increasing order of priority. - To begin with, I think we can have the following model to have the responsibility of the broadcast CPU float around certain CPUs. i.e. Not have a dedicated broadcast CPU. I will refer to the broadcast CPU as the bc_cpu henceforth for convenience. 1. The first CPU that intends to enter deep sleep state will be the bc_cpu. 2. Every other CPU that intends to enter deep idle state will enter themselves into a mask, say the bc_mask, which is already being done today, after they check that a bc_cpu has been assigned. 3. The bc_cpu should not enter tickless idle, until step 5a holds true. 4. So on every timer interrupt, which is at-least every period, it checks the bc_mask to see if any CPUs need to be woken up. 5. The bc cpu should not enter tickless idle *until* it is de-nominated as the bc_cpu. The de-nomination occurs when: a. In one of its timer interrupts, it does broadcast handling to find out that there are no CPUs to be woken up. 6. So if 5a holds, then there is no bc_cpu anymore until a CPU decides to enter deep idle state again, in which case steps 1 to 5 repeat. - We could optimize this further, to allow the bc_cpu to enter tickless
Re: [PATCH 06/10] sched, fair: Make group power more consitent
On 08/19/2013 09:31 PM, Peter Zijlstra wrote: Reviewed-by: Preeti U Murthy pre...@linux.vnet.ibm.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 07/10] sched, fair: Optimize find_busiest_queue()
Hi Peter, On 08/19/2013 09:31 PM, Peter Zijlstra wrote: In the load balancing code, looks to me that cpumask_copy(cpus, cpu_active_mask) is not updating the env.cpus at all, before calling find_busiest_group(). Am I missing something? Should not cpumask_copy() below be before we update the env.cpus parameter? static int load_balance(int this_cpu, struct rq *this_rq, struct sched_domain *sd, enum cpu_idle_type idle, int *balance) { int ld_moved, cur_ld_moved, active_balance = 0; struct sched_group *group; struct rq *busiest; unsigned long flags; struct cpumask *cpus = __get_cpu_var(load_balance_mask); struct lb_env env = { .sd = sd, .dst_cpu= this_cpu, .dst_rq = this_rq, .dst_grpmask= sched_group_cpus(sd-groups), .idle = idle, .loop_break = sched_nr_migrate_break, .cpus = cpus, }; /* * For NEWLY_IDLE load_balancing, we don't need to consider * other cpus in our group */ if (idle == CPU_NEWLY_IDLE) env.dst_grpmask = NULL; cpumask_copy(cpus, cpu_active_mask); schedstat_inc(sd, lb_count[idle]); redo: group = find_busiest_group(env, balance); Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 07/10] sched, fair: Optimize find_busiest_queue()
On 08/23/2013 03:33 PM, Peter Zijlstra wrote: On Fri, Aug 23, 2013 at 01:41:55PM +0530, Preeti U Murthy wrote: Hi Peter, On 08/19/2013 09:31 PM, Peter Zijlstra wrote: In the load balancing code, looks to me that cpumask_copy(cpus, cpu_active_mask) is not updating the env.cpus at all, before calling find_busiest_group(). Am I missing something? Should not cpumask_copy() below be before we update the env.cpus parameter? Nah, its all pointers. Yikes! Of course :P Reviewed-by: Preeti U Murthy pre...@linux.vnet.ibm.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 5/6] sched, fair: Make group power more consitent
Hi Peter, On 08/16/2013 03:42 PM, Peter Zijlstra wrote: I have a few comments and clarification to seek. 1. How are you ensuring from this patch that sgs-group_power does not change over the course of load balancing? The only path to update_group_power() where sg-sgp-power gets updated, is from update_sg_lb_stats(). You are updating sgs-group_power in update_sg_lb_stats(). Any change to group-sgp-power will get reflected in sgs-group_power as well right? 2. This point is aside from your patch. In the current implementation, each time the cpu power gets updated in update_cpu_power(), should not the power of the sched_groups comprising of that cpu also get updated? Why wait till the load balancing is done at the sched_domain level of that group, to update its group power? Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/