Re: [PATCH 4/4] sched: bias to target cpu load to reduce task moving
On 01/07/2014 08:59 PM, Peter Zijlstra wrote: > On Tue, Jan 07, 2014 at 12:55:18PM +, Morten Rasmussen wrote: >> My understanding is that should_we_balance() decides which cpu is >> eligible for doing the load balancing for a given domain (and the >> domains above). That is, only one cpu in a group is allowed to load >> balance between the local group and other groups. That cpu would >> therefore be reponsible for pulling enough load that the groups are >> balanced even if it means temporarily overloading itself. The other cpus >> in the group will take care of load balancing the extra load within the >> local group later. Thanks for both of you comments and explanations! :) I know this patch's change is arguable and my attempt doesn't tune well. But I believe I am in a correct way. :) let me explain a bit for this patch again. First cpu_load includes the history load info, so repeatedly decay and use the history load is kind of non-sense. and the old source/target_load randomly select history load or current load just according to max/min, it also owe a well explanation. Second, we consider the bias in source/target_load already. but still use imbalance_pct as last check in idlest/busiest group finding. It is also a kind of redundant job. If we can consider the source/target bias, we'd better not use imbalance_pct again. And last, imbalance pct overused with quickly core number increasing cpu. Like in find_busiset_group: Assume a 2 groups domain, each group has 8 cores cpus. The target group will bias 8 * (imbalance_pct -100) = 8 * (125 - 100) = 200. Since each of cpu bias .25 times load, for 8 cpus, totally bias 2 times average cpu load between groups. That is a too much. But if there only 2 cores in cpu group(common case when the code introduced). the bias is just 2 * 25 / 100 = 0.5 times average cpu load. Now this patchset remove the cpu_load array avoid repeated history decay; reorganize the imbalance_pct usage to avoid redundant balance bias. and reduce the bias value between cpu groups -- maybe it isn't tune well. :) > > Correct. > >> I may have missed something, but I don't understand the reason for the >> performance improvements that you are reporting. I see better numbers >> for a few benchmarks, but I still don't understand why the code makes >> sense after the cleanup. If we don't understand why it works, we cannot >> be sure that it doesn't harm other benchmarks. There is always a chance >> that we miss something but, IMHO, not having any idea to begin with >> increases the chances for problems later significantly. So why not get >> to the bottom of the problem of cleaning up cpu_load? >> >> Have you done more extensive benchmarking? Have you seen any regressions >> in other benchmarks? > > I only remember hackbench numbers and that generally fares well with a > more aggressive balancer since it has no actual work to speak of the > migration penalty is very low and because there's a metric ton of tasks > the aggressive leveling makes for more coherent 'throughput'. I just tested hackbench on arm. and with more testing times plus rebase to .13-rc6, the variation increased, then the benefit become unclear. anyway still no regression find on both perf-stat cpu-migration times and real execute time. On 0day performance testing should tested kbuild, hackbench, aim7, dbench, tbench, sysbench, netperf etc. etc. No regression found. The 0day performance testing also catch a cpu migration reduced on kvm guest. https://lkml.org/lkml/2013/12/21/135 and another benchmark get benefit on the old patchset: like the testing results show on 0day performance testing: https://lkml.org/lkml/2013/12/4/102 Hi Alex, We obsevered 150% performance gain with vm-scalability/300s-mmap-pread-seq testcase with this patch applied. Here is a list of changes we got so far: testbox : brickland testcase: vm-scalability/300s-mmap-pread-seq f1b6442c7dd12802e622 d70495ef86f397816d73 (parent commit)(this commit) 26393249.80 +150.9% 66223933.60 vm-scalability.throughput 225.12 -49.9% 112.75 time.elapsed_time 36333.40 -90.7% 3392.20 vmstat.system.cs 2.40 +375.0%11.40 vmstat.cpu.id 3770081.60 -97.7% 87673.40 time.major_page_faults 3975276.20 -97.0%117409.60 time.voluntary_context_switches 3.05 +301.7%12.24 iostat.cpu.idle 21118.41 -70.3% 6277.19 time.system_time 18.40 +130.4%42.40 vmstat.cpu.us 77.00 -41.3%45.20 vmstat.cpu.sy 47459.60 -31.3% 32592.20 vmstat.system.in 82435.40 -12.1% 72443.60 time.involuntary_context_switches
Re: [PATCH 4/4] sched: bias to target cpu load to reduce task moving
On Tue, Jan 07, 2014 at 03:16:32PM +, Morten Rasmussen wrote: > From a load perspective wouldn't it be better to pick the least loaded > cpu in the group? It is not cheap to implement, but in theory it should > give less balancing within the group later an less unfairness until it > happens. I tried that; see 04f733b4afac5dc93ae9b0a8703c60b87def491e for why it doesn't work. > Rotating the cpu is probably good enough for most cases and certainly > easier to implement. Indeed. > The bias continues after they first round of load balance by the other > cpus? The cost, yes. Even when perfectly balanced, we still get to iterate the entire machine computing s[gd]_lb_stats to find out we're good and don't need to move tasks about. > Pulling everything to one cpu is not ideal from a performance point of > view. You loose some available cpu cycles until the balance settles. > However, it is not easy to do better and maintain scalability at the > same time. Right, its part of the cost we pay for scaling better. And rotating this cost around a bit would alleviate the obvious bias. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] sched: bias to target cpu load to reduce task moving
On Tue, Jan 07, 2014 at 01:15:23PM +, Peter Zijlstra wrote: > On Tue, Jan 07, 2014 at 01:59:30PM +0100, Peter Zijlstra wrote: > > On Tue, Jan 07, 2014 at 12:55:18PM +, Morten Rasmussen wrote: > > > My understanding is that should_we_balance() decides which cpu is > > > eligible for doing the load balancing for a given domain (and the > > > domains above). That is, only one cpu in a group is allowed to load > > > balance between the local group and other groups. That cpu would > > > therefore be reponsible for pulling enough load that the groups are > > > balanced even if it means temporarily overloading itself. The other cpus > > > in the group will take care of load balancing the extra load within the > > > local group later. > > > > Correct. > > On that; one of the things I wanted to (and previously did attempt but > failed) is trying to rotate this cpu. Currently its always the first cpu > (of the group) and that gives a noticeable bias. > > If we could slowly rotate the cpu that does this that would alleviate > both the load and cost bias. >From a load perspective wouldn't it be better to pick the least loaded cpu in the group? It is not cheap to implement, but in theory it should give less balancing within the group later an less unfairness until it happens. Rotating the cpu is probably good enough for most cases and certainly easier to implement. > > One thing I was thinking of is keeping a global counter maybe: > 'x := jiffies >> n' > might be good enough and using the 'x % nr_cpus_in_group'-th cpu > instead. > > Then again, these are micro issue and not a lot of people complain > about this. The bias continues after they first round of load balance by the other cpus? Pulling everything to one cpu is not ideal from a performance point of view. You loose some available cpu cycles until the balance settles. However, it is not easy to do better and maintain scalability at the same time. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] sched: bias to target cpu load to reduce task moving
On Tue, Jan 07, 2014 at 02:32:07PM +0100, Vincent Guittot wrote: > On 7 January 2014 14:15, Peter Zijlstra wrote: > > On Tue, Jan 07, 2014 at 01:59:30PM +0100, Peter Zijlstra wrote: > >> On Tue, Jan 07, 2014 at 12:55:18PM +, Morten Rasmussen wrote: > >> > My understanding is that should_we_balance() decides which cpu is > >> > eligible for doing the load balancing for a given domain (and the > >> > domains above). That is, only one cpu in a group is allowed to load > >> > balance between the local group and other groups. That cpu would > >> > therefore be reponsible for pulling enough load that the groups are > >> > balanced even if it means temporarily overloading itself. The other cpus > >> > in the group will take care of load balancing the extra load within the > >> > local group later. > >> > >> Correct. > > > > On that; one of the things I wanted to (and previously did attempt but > > failed) is trying to rotate this cpu. Currently its always the first cpu > > (of the group) and that gives a noticeable bias. > > Isn't the current policy (it's the 1st idle cpu in priority). a good > enough way to rotate the cpus ? Are you need the rotation for loaded > use case too ? Yeah its for the fully loaded case. And like I said, there's not been many complaints on this. The 'problem' is that its always same cpu that does the most expensive full machine balance; and always that cpu that is the one that gains extra load to redistribute in the group. So its penalized twice. Like said, really minor issue. Just something I thought I'd mention. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] sched: bias to target cpu load to reduce task moving
On 7 January 2014 14:15, Peter Zijlstra wrote: > On Tue, Jan 07, 2014 at 01:59:30PM +0100, Peter Zijlstra wrote: >> On Tue, Jan 07, 2014 at 12:55:18PM +, Morten Rasmussen wrote: >> > My understanding is that should_we_balance() decides which cpu is >> > eligible for doing the load balancing for a given domain (and the >> > domains above). That is, only one cpu in a group is allowed to load >> > balance between the local group and other groups. That cpu would >> > therefore be reponsible for pulling enough load that the groups are >> > balanced even if it means temporarily overloading itself. The other cpus >> > in the group will take care of load balancing the extra load within the >> > local group later. >> >> Correct. > > On that; one of the things I wanted to (and previously did attempt but > failed) is trying to rotate this cpu. Currently its always the first cpu > (of the group) and that gives a noticeable bias. Isn't the current policy (it's the 1st idle cpu in priority). a good enough way to rotate the cpus ? Are you need the rotation for loaded use case too ? > > If we could slowly rotate the cpu that does this that would alleviate > both the load and cost bias. > > One thing I was thinking of is keeping a global counter maybe: > 'x := jiffies >> n' > might be good enough and using the 'x % nr_cpus_in_group'-th cpu > instead. > > Then again, these are micro issue and not a lot of people complain > about this. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] sched: bias to target cpu load to reduce task moving
On Tue, Jan 07, 2014 at 01:59:30PM +0100, Peter Zijlstra wrote: > On Tue, Jan 07, 2014 at 12:55:18PM +, Morten Rasmussen wrote: > > My understanding is that should_we_balance() decides which cpu is > > eligible for doing the load balancing for a given domain (and the > > domains above). That is, only one cpu in a group is allowed to load > > balance between the local group and other groups. That cpu would > > therefore be reponsible for pulling enough load that the groups are > > balanced even if it means temporarily overloading itself. The other cpus > > in the group will take care of load balancing the extra load within the > > local group later. > > Correct. On that; one of the things I wanted to (and previously did attempt but failed) is trying to rotate this cpu. Currently its always the first cpu (of the group) and that gives a noticeable bias. If we could slowly rotate the cpu that does this that would alleviate both the load and cost bias. One thing I was thinking of is keeping a global counter maybe: 'x := jiffies >> n' might be good enough and using the 'x % nr_cpus_in_group'-th cpu instead. Then again, these are micro issue and not a lot of people complain about this. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] sched: bias to target cpu load to reduce task moving
On Tue, Jan 07, 2014 at 12:55:18PM +, Morten Rasmussen wrote: > My understanding is that should_we_balance() decides which cpu is > eligible for doing the load balancing for a given domain (and the > domains above). That is, only one cpu in a group is allowed to load > balance between the local group and other groups. That cpu would > therefore be reponsible for pulling enough load that the groups are > balanced even if it means temporarily overloading itself. The other cpus > in the group will take care of load balancing the extra load within the > local group later. Correct. > I may have missed something, but I don't understand the reason for the > performance improvements that you are reporting. I see better numbers > for a few benchmarks, but I still don't understand why the code makes > sense after the cleanup. If we don't understand why it works, we cannot > be sure that it doesn't harm other benchmarks. There is always a chance > that we miss something but, IMHO, not having any idea to begin with > increases the chances for problems later significantly. So why not get > to the bottom of the problem of cleaning up cpu_load? > > Have you done more extensive benchmarking? Have you seen any regressions > in other benchmarks? I only remember hackbench numbers and that generally fares well with a more aggressive balancer since it has no actual work to speak of the migration penalty is very low and because there's a metric ton of tasks the aggressive leveling makes for more coherent 'throughput'. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] sched: bias to target cpu load to reduce task moving
On Mon, Jan 06, 2014 at 01:35:39PM +, Alex Shi wrote: > On 01/03/2014 12:04 AM, Morten Rasmussen wrote: > > On Wed, Dec 25, 2013 at 02:58:26PM +, Alex Shi wrote: > >> > From 5cd67d975001edafe2ee820e0be5d86881a23bd6 Mon Sep 17 00:00:00 2001 > From: Alex Shi > Date: Sat, 23 Nov 2013 23:18:09 +0800 > Subject: [PATCH 4/4] sched: bias to target cpu load to reduce task moving > > Task migration happens when target just a bit less then source cpu load. > To reduce such situation happens, aggravate the target cpu load with > sd->imbalance_pct/100 in wake_affine. > > In find_idlest/busiest_group, change the aggravate to local cpu only > from old group aggravation. > > on my pandaboard ES. > > latest kernel 527d1511310a89+ whole patchset > hackbench -T -g 10 -f 40 > 23.25" 21.99" > 23.16" 21.20" > 24.24" 21.89" > hackbench -p -g 10 -f 40 > 26.52" 21.46" > 23.89" 22.96" > 25.65" 22.73" > hackbench -P -g 10 -f 40 > 20.14" 19.72" > 19.96" 19.10" > 21.76" 20.03" > > Signed-off-by: Alex Shi > --- > kernel/sched/fair.c | 35 --- > 1 file changed, 16 insertions(+), 19 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index bccdd89..3623ba4 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -978,7 +978,7 @@ static inline unsigned long group_weight(struct > task_struct *p, int nid) > > static unsigned long weighted_cpuload(const int cpu); > static unsigned long source_load(int cpu); > -static unsigned long target_load(int cpu); > +static unsigned long target_load(int cpu, int imbalance_pct); > static unsigned long power_of(int cpu); > static long effective_load(struct task_group *tg, int cpu, long wl, > long wg); > > @@ -3809,11 +3809,17 @@ static unsigned long source_load(int cpu) > * Return a high guess at the load of a migration-target cpu weighted > * according to the scheduling class and "nice" value. > */ > -static unsigned long target_load(int cpu) > +static unsigned long target_load(int cpu, int imbalance_pct) > { > struct rq *rq = cpu_rq(cpu); > unsigned long total = weighted_cpuload(cpu); > > + /* > + * without cpu_load decay, in most of time cpu_load is same as total > + * so we need to make target a bit heavier to reduce task migration > + */ > + total = total * imbalance_pct / 100; > + > if (!sched_feat(LB_BIAS)) > return total; > > @@ -4033,7 +4039,7 @@ static int wake_affine(struct sched_domain *sd, > struct task_struct *p, int sync) > this_cpu = smp_processor_id(); > prev_cpu = task_cpu(p); > load = source_load(prev_cpu); > - this_load = target_load(this_cpu); > + this_load = target_load(this_cpu, 100); > > /* > * If sync wakeup then subtract the (maximum possible) > @@ -4089,7 +4095,7 @@ static int wake_affine(struct sched_domain *sd, > struct task_struct *p, int sync) > > if (balanced || > (this_load <= load && > - this_load + target_load(prev_cpu) <= tl_per_task)) { > + this_load + target_load(prev_cpu, 100) <= tl_per_task)) { > /* > * This domain has SD_WAKE_AFFINE and > * p is cache cold in this domain, and > @@ -4112,7 +4118,6 @@ find_idlest_group(struct sched_domain *sd, struct > task_struct *p, int this_cpu) > { > struct sched_group *idlest = NULL, *group = sd->groups; > unsigned long min_load = ULONG_MAX, this_load = 0; > - int imbalance = 100 + (sd->imbalance_pct-100)/2; > > do { > unsigned long load, avg_load; > @@ -4132,10 +4137,10 @@ find_idlest_group(struct sched_domain *sd, > struct task_struct *p, int this_cpu) > > for_each_cpu(i, sched_group_cpus(group)) { > /* Bias balancing toward cpus of our domain */ > - if (local_group) > + if (i == this_cpu) > >>> > >>> What is the motivation for changing the local_group load calculation? > >>> Now the load contributions of all cpus in the local group, except > >>> this_cpu, will contribute more as their contribution (this_load) is > >>> determined using target_load() instead. > >> > >> This part code 147cbb4bbe99, written i
Re: [PATCH 4/4] sched: bias to target cpu load to reduce task moving
On 01/03/2014 12:04 AM, Morten Rasmussen wrote: > On Wed, Dec 25, 2013 at 02:58:26PM +, Alex Shi wrote: >> From 5cd67d975001edafe2ee820e0be5d86881a23bd6 Mon Sep 17 00:00:00 2001 From: Alex Shi Date: Sat, 23 Nov 2013 23:18:09 +0800 Subject: [PATCH 4/4] sched: bias to target cpu load to reduce task moving Task migration happens when target just a bit less then source cpu load. To reduce such situation happens, aggravate the target cpu load with sd->imbalance_pct/100 in wake_affine. In find_idlest/busiest_group, change the aggravate to local cpu only from old group aggravation. on my pandaboard ES. latest kernel 527d1511310a89+ whole patchset hackbench -T -g 10 -f 40 23.25" 21.99" 23.16" 21.20" 24.24" 21.89" hackbench -p -g 10 -f 40 26.52" 21.46" 23.89" 22.96" 25.65" 22.73" hackbench -P -g 10 -f 40 20.14" 19.72" 19.96" 19.10" 21.76" 20.03" Signed-off-by: Alex Shi --- kernel/sched/fair.c | 35 --- 1 file changed, 16 insertions(+), 19 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index bccdd89..3623ba4 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -978,7 +978,7 @@ static inline unsigned long group_weight(struct task_struct *p, int nid) static unsigned long weighted_cpuload(const int cpu); static unsigned long source_load(int cpu); -static unsigned long target_load(int cpu); +static unsigned long target_load(int cpu, int imbalance_pct); static unsigned long power_of(int cpu); static long effective_load(struct task_group *tg, int cpu, long wl, long wg); @@ -3809,11 +3809,17 @@ static unsigned long source_load(int cpu) * Return a high guess at the load of a migration-target cpu weighted * according to the scheduling class and "nice" value. */ -static unsigned long target_load(int cpu) +static unsigned long target_load(int cpu, int imbalance_pct) { struct rq *rq = cpu_rq(cpu); unsigned long total = weighted_cpuload(cpu); + /* + * without cpu_load decay, in most of time cpu_load is same as total + * so we need to make target a bit heavier to reduce task migration + */ + total = total * imbalance_pct / 100; + if (!sched_feat(LB_BIAS)) return total; @@ -4033,7 +4039,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync) this_cpu = smp_processor_id(); prev_cpu = task_cpu(p); load = source_load(prev_cpu); - this_load = target_load(this_cpu); + this_load = target_load(this_cpu, 100); /* * If sync wakeup then subtract the (maximum possible) @@ -4089,7 +4095,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync) if (balanced || (this_load <= load && - this_load + target_load(prev_cpu) <= tl_per_task)) { + this_load + target_load(prev_cpu, 100) <= tl_per_task)) { /* * This domain has SD_WAKE_AFFINE and * p is cache cold in this domain, and @@ -4112,7 +4118,6 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu) { struct sched_group *idlest = NULL, *group = sd->groups; unsigned long min_load = ULONG_MAX, this_load = 0; - int imbalance = 100 + (sd->imbalance_pct-100)/2; do { unsigned long load, avg_load; @@ -4132,10 +4137,10 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu) for_each_cpu(i, sched_group_cpus(group)) { /* Bias balancing toward cpus of our domain */ - if (local_group) + if (i == this_cpu) >>> >>> What is the motivation for changing the local_group load calculation? >>> Now the load contributions of all cpus in the local group, except >>> this_cpu, will contribute more as their contribution (this_load) is >>> determined using target_load() instead. >> >> This part code 147cbb4bbe99, written in 2005 for x86, at that time, only >> 2 cores(guess no HT at that time) in cpu socket. With the cores number > > NUMA support was already present. I guess that means support for systems > with significantly more than two cpus. Thanks a lot for comments, Morten! the
Re: [PATCH 4/4] sched: bias to target cpu load to reduce task moving
On Wed, Dec 25, 2013 at 02:58:26PM +, Alex Shi wrote: > > >> From 5cd67d975001edafe2ee820e0be5d86881a23bd6 Mon Sep 17 00:00:00 2001 > >> From: Alex Shi > >> Date: Sat, 23 Nov 2013 23:18:09 +0800 > >> Subject: [PATCH 4/4] sched: bias to target cpu load to reduce task moving > >> > >> Task migration happens when target just a bit less then source cpu load. > >> To reduce such situation happens, aggravate the target cpu load with > >> sd->imbalance_pct/100 in wake_affine. > >> > >> In find_idlest/busiest_group, change the aggravate to local cpu only > >> from old group aggravation. > >> > >> on my pandaboard ES. > >> > >>latest kernel 527d1511310a89+ whole patchset > >> hackbench -T -g 10 -f 40 > >>23.25" 21.99" > >>23.16" 21.20" > >>24.24" 21.89" > >> hackbench -p -g 10 -f 40 > >>26.52" 21.46" > >>23.89" 22.96" > >>25.65" 22.73" > >> hackbench -P -g 10 -f 40 > >>20.14" 19.72" > >>19.96" 19.10" > >>21.76" 20.03" > >> > >> Signed-off-by: Alex Shi > >> --- > >> kernel/sched/fair.c | 35 --- > >> 1 file changed, 16 insertions(+), 19 deletions(-) > >> > >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > >> index bccdd89..3623ba4 100644 > >> --- a/kernel/sched/fair.c > >> +++ b/kernel/sched/fair.c > >> @@ -978,7 +978,7 @@ static inline unsigned long group_weight(struct > >> task_struct *p, int nid) > >> > >> static unsigned long weighted_cpuload(const int cpu); > >> static unsigned long source_load(int cpu); > >> -static unsigned long target_load(int cpu); > >> +static unsigned long target_load(int cpu, int imbalance_pct); > >> static unsigned long power_of(int cpu); > >> static long effective_load(struct task_group *tg, int cpu, long wl, long > >> wg); > >> > >> @@ -3809,11 +3809,17 @@ static unsigned long source_load(int cpu) > >> * Return a high guess at the load of a migration-target cpu weighted > >> * according to the scheduling class and "nice" value. > >> */ > >> -static unsigned long target_load(int cpu) > >> +static unsigned long target_load(int cpu, int imbalance_pct) > >> { > >>struct rq *rq = cpu_rq(cpu); > >>unsigned long total = weighted_cpuload(cpu); > >> > >> + /* > >> + * without cpu_load decay, in most of time cpu_load is same as total > >> + * so we need to make target a bit heavier to reduce task migration > >> + */ > >> + total = total * imbalance_pct / 100; > >> + > >>if (!sched_feat(LB_BIAS)) > >>return total; > >> > >> @@ -4033,7 +4039,7 @@ static int wake_affine(struct sched_domain *sd, > >> struct task_struct *p, int sync) > >>this_cpu = smp_processor_id(); > >>prev_cpu = task_cpu(p); > >>load = source_load(prev_cpu); > >> - this_load = target_load(this_cpu); > >> + this_load = target_load(this_cpu, 100); > >> > >>/* > >> * If sync wakeup then subtract the (maximum possible) > >> @@ -4089,7 +4095,7 @@ static int wake_affine(struct sched_domain *sd, > >> struct task_struct *p, int sync) > >> > >>if (balanced || > >>(this_load <= load && > >> - this_load + target_load(prev_cpu) <= tl_per_task)) { > >> + this_load + target_load(prev_cpu, 100) <= tl_per_task)) { > >>/* > >> * This domain has SD_WAKE_AFFINE and > >> * p is cache cold in this domain, and > >> @@ -4112,7 +4118,6 @@ find_idlest_group(struct sched_domain *sd, struct > >> task_struct *p, int this_cpu) > >> { > >>struct sched_group *idlest = NULL, *group = sd->groups; > >>unsigned long min_load = ULONG_MAX, this_load = 0; > >> - int imbalance = 100 + (sd->imbalance_pct-100)/2; > >> > >>do { > >>unsigned long load, avg_load; > >> @@ -4132,10 +4137,10 @@ find_idlest_group(struct sched_domain *sd, struct > >> task_struct *p, int this_cpu) > >> > >>for_each_cpu(i, sched_group_cpus(group)) { > >>/* Bias balancing toward cpus of our domain */ > >> - if (local_group) > >> + if (i == this_cpu) > > > > What is the motivation for changing the local_group load calculation? > > Now the load contributions of all cpus in the local group, except > > this_cpu, will contribute more as their contribution (this_load) is > > determined using target_load() instead. > > This part code 147cbb4bbe99, written in 2005 for x86, at that time, only > 2 cores(guess no HT at that time) in cpu socket. With the cores number NUMA support was already present. I guess that means support for systems with significantly more than two cpus. > increasing trend, the sched_group become large and large, to give whole > group this bias valu
Re: [PATCH 4/4] sched: bias to target cpu load to reduce task moving
>> From 5cd67d975001edafe2ee820e0be5d86881a23bd6 Mon Sep 17 00:00:00 2001 >> From: Alex Shi >> Date: Sat, 23 Nov 2013 23:18:09 +0800 >> Subject: [PATCH 4/4] sched: bias to target cpu load to reduce task moving >> >> Task migration happens when target just a bit less then source cpu load. >> To reduce such situation happens, aggravate the target cpu load with >> sd->imbalance_pct/100 in wake_affine. >> >> In find_idlest/busiest_group, change the aggravate to local cpu only >> from old group aggravation. >> >> on my pandaboard ES. >> >> latest kernel 527d1511310a89+ whole patchset >> hackbench -T -g 10 -f 40 >> 23.25" 21.99" >> 23.16" 21.20" >> 24.24" 21.89" >> hackbench -p -g 10 -f 40 >> 26.52" 21.46" >> 23.89" 22.96" >> 25.65" 22.73" >> hackbench -P -g 10 -f 40 >> 20.14" 19.72" >> 19.96" 19.10" >> 21.76" 20.03" >> >> Signed-off-by: Alex Shi >> --- >> kernel/sched/fair.c | 35 --- >> 1 file changed, 16 insertions(+), 19 deletions(-) >> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index bccdd89..3623ba4 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -978,7 +978,7 @@ static inline unsigned long group_weight(struct >> task_struct *p, int nid) >> >> static unsigned long weighted_cpuload(const int cpu); >> static unsigned long source_load(int cpu); >> -static unsigned long target_load(int cpu); >> +static unsigned long target_load(int cpu, int imbalance_pct); >> static unsigned long power_of(int cpu); >> static long effective_load(struct task_group *tg, int cpu, long wl, long >> wg); >> >> @@ -3809,11 +3809,17 @@ static unsigned long source_load(int cpu) >> * Return a high guess at the load of a migration-target cpu weighted >> * according to the scheduling class and "nice" value. >> */ >> -static unsigned long target_load(int cpu) >> +static unsigned long target_load(int cpu, int imbalance_pct) >> { >> struct rq *rq = cpu_rq(cpu); >> unsigned long total = weighted_cpuload(cpu); >> >> +/* >> + * without cpu_load decay, in most of time cpu_load is same as total >> + * so we need to make target a bit heavier to reduce task migration >> + */ >> +total = total * imbalance_pct / 100; >> + >> if (!sched_feat(LB_BIAS)) >> return total; >> >> @@ -4033,7 +4039,7 @@ static int wake_affine(struct sched_domain *sd, struct >> task_struct *p, int sync) >> this_cpu = smp_processor_id(); >> prev_cpu = task_cpu(p); >> load = source_load(prev_cpu); >> -this_load = target_load(this_cpu); >> +this_load = target_load(this_cpu, 100); >> >> /* >> * If sync wakeup then subtract the (maximum possible) >> @@ -4089,7 +4095,7 @@ static int wake_affine(struct sched_domain *sd, struct >> task_struct *p, int sync) >> >> if (balanced || >> (this_load <= load && >> - this_load + target_load(prev_cpu) <= tl_per_task)) { >> + this_load + target_load(prev_cpu, 100) <= tl_per_task)) { >> /* >> * This domain has SD_WAKE_AFFINE and >> * p is cache cold in this domain, and >> @@ -4112,7 +4118,6 @@ find_idlest_group(struct sched_domain *sd, struct >> task_struct *p, int this_cpu) >> { >> struct sched_group *idlest = NULL, *group = sd->groups; >> unsigned long min_load = ULONG_MAX, this_load = 0; >> -int imbalance = 100 + (sd->imbalance_pct-100)/2; >> >> do { >> unsigned long load, avg_load; >> @@ -4132,10 +4137,10 @@ find_idlest_group(struct sched_domain *sd, struct >> task_struct *p, int this_cpu) >> >> for_each_cpu(i, sched_group_cpus(group)) { >> /* Bias balancing toward cpus of our domain */ >> -if (local_group) >> +if (i == this_cpu) > > What is the motivation for changing the local_group load calculation? > Now the load contributions of all cpus in the local group, except > this_cpu, will contribute more as their contribution (this_load) is > determined using target_load() instead. This part code 147cbb4bbe99, written in 2005 for x86, at that time, only 2 cores(guess no HT at that time) in cpu socket. With the cores number increasing trend, the sched_group become large and large, to give whole group this bias value is becoming non-sense. So it looks reasonable to just bias this cpu only. > > If I'm not mistaken, that will lead to more frequent load balancing as > the local_group bias has been reduced. That is the opposite of your > intentions based on your comment in target_load(). > >> load =
Re: [PATCH 4/4] sched: bias to target cpu load to reduce task moving
On 12/20/2013 07:19 PM, Morten Rasmussen wrote: >> @@ -4132,10 +4137,10 @@ find_idlest_group(struct sched_domain *sd, struct >> task_struct *p, int this_cpu) >> > >> >for_each_cpu(i, sched_group_cpus(group)) { >> >/* Bias balancing toward cpus of our domain */ >> > - if (local_group) >> > + if (i == this_cpu) > What is the motivation for changing the local_group load calculation? > Now the load contributions of all cpus in the local group, except > this_cpu, will contribute more as their contribution (this_load) is > determined using target_load() instead. > > If I'm not mistaken, that will lead to more frequent load balancing as > the local_group bias has been reduced. That is the opposite of your > intentions based on your comment in target_load(). Good catch. will reconsider this again. :) > >> >load = source_load(i); >> >else >> > - load = target_load(i); >> > + load = target_load(i, sd->imbalance_pct); > You scale by sd->imbalance_pct instead of 100+(sd->imbalance_pct-100)/2 > that you removed above. sd->imbalance_pct may have been arbitrarily > chosen in the past, but changing it may affect behavior. > -- Thanks Alex -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] sched: bias to target cpu load to reduce task moving
On Thu, Dec 19, 2013 at 01:34:08PM +, Alex Shi wrote: > On 12/17/2013 11:38 PM, Peter Zijlstra wrote: > > On Tue, Dec 17, 2013 at 02:10:12PM +, Morten Rasmussen wrote: > >>> @@ -4135,7 +4141,7 @@ find_idlest_group(struct sched_domain *sd, struct > >>> task_struct *p, int this_cpu) > >>> if (local_group) > >>> load = source_load(i); > >>> else > >>> - load = target_load(i); > >>> + load = target_load(i, sd->imbalance_pct); > >> > >> Don't you apply imbalance_pct twice here? Later on in > >> find_idlest_group() you have: > >> > >>if (!idlest || 100*this_load < imbalance*min_load) > >>return NULL; > >> > >> where min_load comes from target_load(). > > > > Yes! exactly! this doesn't make any sense. > > Thanks a lot for review and comments! > > I changed the patch to following shape. and push it under Fengguang's testing > system monitor. Any testing are appreciated! > > BTW, Seems lots of changes in scheduler come from kinds of > scenarios/benchmarks > experience. But I still like to take any theoretical comments/suggestions. > > -- > Thanks > Alex > > === > > From 5cd67d975001edafe2ee820e0be5d86881a23bd6 Mon Sep 17 00:00:00 2001 > From: Alex Shi > Date: Sat, 23 Nov 2013 23:18:09 +0800 > Subject: [PATCH 4/4] sched: bias to target cpu load to reduce task moving > > Task migration happens when target just a bit less then source cpu load. > To reduce such situation happens, aggravate the target cpu load with > sd->imbalance_pct/100 in wake_affine. > > In find_idlest/busiest_group, change the aggravate to local cpu only > from old group aggravation. > > on my pandaboard ES. > > latest kernel 527d1511310a89+ whole patchset > hackbench -T -g 10 -f 40 > 23.25" 21.99" > 23.16" 21.20" > 24.24" 21.89" > hackbench -p -g 10 -f 40 > 26.52" 21.46" > 23.89" 22.96" > 25.65" 22.73" > hackbench -P -g 10 -f 40 > 20.14" 19.72" > 19.96" 19.10" > 21.76" 20.03" > > Signed-off-by: Alex Shi > --- > kernel/sched/fair.c | 35 --- > 1 file changed, 16 insertions(+), 19 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index bccdd89..3623ba4 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -978,7 +978,7 @@ static inline unsigned long group_weight(struct > task_struct *p, int nid) > > static unsigned long weighted_cpuload(const int cpu); > static unsigned long source_load(int cpu); > -static unsigned long target_load(int cpu); > +static unsigned long target_load(int cpu, int imbalance_pct); > static unsigned long power_of(int cpu); > static long effective_load(struct task_group *tg, int cpu, long wl, long wg); > > @@ -3809,11 +3809,17 @@ static unsigned long source_load(int cpu) > * Return a high guess at the load of a migration-target cpu weighted > * according to the scheduling class and "nice" value. > */ > -static unsigned long target_load(int cpu) > +static unsigned long target_load(int cpu, int imbalance_pct) > { > struct rq *rq = cpu_rq(cpu); > unsigned long total = weighted_cpuload(cpu); > > + /* > + * without cpu_load decay, in most of time cpu_load is same as total > + * so we need to make target a bit heavier to reduce task migration > + */ > + total = total * imbalance_pct / 100; > + > if (!sched_feat(LB_BIAS)) > return total; > > @@ -4033,7 +4039,7 @@ static int wake_affine(struct sched_domain *sd, struct > task_struct *p, int sync) > this_cpu = smp_processor_id(); > prev_cpu = task_cpu(p); > load = source_load(prev_cpu); > - this_load = target_load(this_cpu); > + this_load = target_load(this_cpu, 100); > > /* >* If sync wakeup then subtract the (maximum possible) > @@ -4089,7 +4095,7 @@ static int wake_affine(struct sched_domain *sd, struct > task_struct *p, int sync) > > if (balanced || > (this_load <= load && > - this_load + target_load(prev_cpu) <= tl_per_task)) { > + this_load + target_load(prev_cpu, 100) <= tl_per_task)) { > /* >* This domain has SD_WAKE_AFFINE and >* p is cache cold in this domain, and > @@ -4112,7 +4118,6 @@ find_idlest_group(struct sched_domain *sd, struct > task_struct *p, int this_cpu) > { > struct sched_group *idlest = NULL, *group = sd->groups; > unsigned long min_load = ULONG_MAX, this_load = 0; > - int imbalance = 100 + (sd->imbalance_pct-100)/2; > > do { > unsi
Re: [PATCH 4/4] sched: bias to target cpu load to reduce task moving
On 12/17/2013 11:38 PM, Peter Zijlstra wrote: > On Tue, Dec 17, 2013 at 02:10:12PM +, Morten Rasmussen wrote: >>> @@ -4135,7 +4141,7 @@ find_idlest_group(struct sched_domain *sd, struct >>> task_struct *p, int this_cpu) >>> if (local_group) >>> load = source_load(i); >>> else >>> - load = target_load(i); >>> + load = target_load(i, sd->imbalance_pct); >> >> Don't you apply imbalance_pct twice here? Later on in >> find_idlest_group() you have: >> >> if (!idlest || 100*this_load < imbalance*min_load) >> return NULL; >> >> where min_load comes from target_load(). > > Yes! exactly! this doesn't make any sense. Thanks a lot for review and comments! I changed the patch to following shape. and push it under Fengguang's testing system monitor. Any testing are appreciated! BTW, Seems lots of changes in scheduler come from kinds of scenarios/benchmarks experience. But I still like to take any theoretical comments/suggestions. -- Thanks Alex === >From 5cd67d975001edafe2ee820e0be5d86881a23bd6 Mon Sep 17 00:00:00 2001 From: Alex Shi Date: Sat, 23 Nov 2013 23:18:09 +0800 Subject: [PATCH 4/4] sched: bias to target cpu load to reduce task moving Task migration happens when target just a bit less then source cpu load. To reduce such situation happens, aggravate the target cpu load with sd->imbalance_pct/100 in wake_affine. In find_idlest/busiest_group, change the aggravate to local cpu only from old group aggravation. on my pandaboard ES. latest kernel 527d1511310a89+ whole patchset hackbench -T -g 10 -f 40 23.25" 21.99" 23.16" 21.20" 24.24" 21.89" hackbench -p -g 10 -f 40 26.52" 21.46" 23.89" 22.96" 25.65" 22.73" hackbench -P -g 10 -f 40 20.14" 19.72" 19.96" 19.10" 21.76" 20.03" Signed-off-by: Alex Shi --- kernel/sched/fair.c | 35 --- 1 file changed, 16 insertions(+), 19 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index bccdd89..3623ba4 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -978,7 +978,7 @@ static inline unsigned long group_weight(struct task_struct *p, int nid) static unsigned long weighted_cpuload(const int cpu); static unsigned long source_load(int cpu); -static unsigned long target_load(int cpu); +static unsigned long target_load(int cpu, int imbalance_pct); static unsigned long power_of(int cpu); static long effective_load(struct task_group *tg, int cpu, long wl, long wg); @@ -3809,11 +3809,17 @@ static unsigned long source_load(int cpu) * Return a high guess at the load of a migration-target cpu weighted * according to the scheduling class and "nice" value. */ -static unsigned long target_load(int cpu) +static unsigned long target_load(int cpu, int imbalance_pct) { struct rq *rq = cpu_rq(cpu); unsigned long total = weighted_cpuload(cpu); + /* +* without cpu_load decay, in most of time cpu_load is same as total +* so we need to make target a bit heavier to reduce task migration +*/ + total = total * imbalance_pct / 100; + if (!sched_feat(LB_BIAS)) return total; @@ -4033,7 +4039,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync) this_cpu = smp_processor_id(); prev_cpu = task_cpu(p); load = source_load(prev_cpu); - this_load = target_load(this_cpu); + this_load = target_load(this_cpu, 100); /* * If sync wakeup then subtract the (maximum possible) @@ -4089,7 +4095,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync) if (balanced || (this_load <= load && -this_load + target_load(prev_cpu) <= tl_per_task)) { +this_load + target_load(prev_cpu, 100) <= tl_per_task)) { /* * This domain has SD_WAKE_AFFINE and * p is cache cold in this domain, and @@ -4112,7 +4118,6 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu) { struct sched_group *idlest = NULL, *group = sd->groups; unsigned long min_load = ULONG_MAX, this_load = 0; - int imbalance = 100 + (sd->imbalance_pct-100)/2; do { unsigned long load, avg_load; @@ -4132,10 +4137,10 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu) for_each_cpu(i, sched_group_cpus(group)) { /* Bias bala
Re: [PATCH 4/4] sched: bias to target cpu load to reduce task moving
On Tue, Dec 17, 2013 at 02:10:12PM +, Morten Rasmussen wrote: > > @@ -4135,7 +4141,7 @@ find_idlest_group(struct sched_domain *sd, struct > > task_struct *p, int this_cpu) > > if (local_group) > > load = source_load(i); > > else > > - load = target_load(i); > > + load = target_load(i, sd->imbalance_pct); > > Don't you apply imbalance_pct twice here? Later on in > find_idlest_group() you have: > > if (!idlest || 100*this_load < imbalance*min_load) > return NULL; > > where min_load comes from target_load(). Yes! exactly! this doesn't make any sense. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] sched: bias to target cpu load to reduce task moving
On Tue, Dec 03, 2013 at 09:05:56AM +, Alex Shi wrote: > Task migration happens when target just a bit less then source cpu load. > To reduce such situation happens, aggravate the target cpu load with > sd->imbalance_pct/100. > > This patch removes the hackbench thread regression on Daniel's > Intel Core2 server. > > a5d6e63 +patch1~3 +patch1~4 > hackbench -T -s 4096 -l 1000 -g 10 -f 40 > 27.914" 38.694" 28.587" > 28.390" 38.341" 29.513" > 28.048" 38.626" 28.706" > > Signed-off-by: Alex Shi > --- > kernel/sched/fair.c | 18 -- > 1 file changed, 12 insertions(+), 6 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index bccdd89..c49b7ba 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -978,7 +978,7 @@ static inline unsigned long group_weight(struct > task_struct *p, int nid) > > static unsigned long weighted_cpuload(const int cpu); > static unsigned long source_load(int cpu); > -static unsigned long target_load(int cpu); > +static unsigned long target_load(int cpu, int imbalance_pct); > static unsigned long power_of(int cpu); > static long effective_load(struct task_group *tg, int cpu, long wl, long wg); > > @@ -3809,11 +3809,17 @@ static unsigned long source_load(int cpu) > * Return a high guess at the load of a migration-target cpu weighted > * according to the scheduling class and "nice" value. > */ > -static unsigned long target_load(int cpu) > +static unsigned long target_load(int cpu, int imbalance_pct) > { > struct rq *rq = cpu_rq(cpu); > unsigned long total = weighted_cpuload(cpu); > > + /* > + * without cpu_load decay, in most of time cpu_load is same as total > + * so we need to make target a bit heavier to reduce task migration > + */ > + total = total * imbalance_pct / 100; > + > if (!sched_feat(LB_BIAS)) > return total; > > @@ -4033,7 +4039,7 @@ static int wake_affine(struct sched_domain *sd, struct > task_struct *p, int sync) > this_cpu = smp_processor_id(); > prev_cpu = task_cpu(p); > load = source_load(prev_cpu); > - this_load = target_load(this_cpu); > + this_load = target_load(this_cpu, 100); > > /* >* If sync wakeup then subtract the (maximum possible) > @@ -4089,7 +4095,7 @@ static int wake_affine(struct sched_domain *sd, struct > task_struct *p, int sync) > > if (balanced || > (this_load <= load && > - this_load + target_load(prev_cpu) <= tl_per_task)) { > + this_load + target_load(prev_cpu, 100) <= tl_per_task)) { > /* >* This domain has SD_WAKE_AFFINE and >* p is cache cold in this domain, and > @@ -4135,7 +4141,7 @@ find_idlest_group(struct sched_domain *sd, struct > task_struct *p, int this_cpu) > if (local_group) > load = source_load(i); > else > - load = target_load(i); > + load = target_load(i, sd->imbalance_pct); Don't you apply imbalance_pct twice here? Later on in find_idlest_group() you have: if (!idlest || 100*this_load < imbalance*min_load) return NULL; where min_load comes from target_load(). > > avg_load += load; > } > @@ -5478,7 +5484,7 @@ static inline void update_sg_lb_stats(struct lb_env > *env, > > /* Bias balancing toward cpus of our domain */ > if (local_group) > - load = target_load(i); > + load = target_load(i, env->sd->imbalance_pct); You probably have the same problem here. Morten -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] sched: bias to target cpu load to reduce task moving
> We obsevered 150% performance gain with vm-scalability/300s-mmap-pread-seq > testcase with this patch applied. Here is a list of changes we got so far: > > testbox : brickland got some explain of brickland on wiki: High-end server platform based on the Ivy Bridge-EX processor > testcase: vm-scalability/300s-mmap-pread-seq https://github.com/aristeu/vm-scalability Thanks a lot for testing! :) Do you have data of base upstream commit? > > > f1b6442c7dd12802e622 d70495ef86f397816d73 >(parent commit)(this commit) > > 26393249.80 +150.9% 66223933.60 vm-scalability.throughput > > 225.12 -49.9% 112.75 time.elapsed_time > 36333.40 -90.7% 3392.20 vmstat.system.cs > 2.40 +375.0%11.40 vmstat.cpu.id > 3770081.60 -97.7% 87673.40 time.major_page_faults > 3975276.20 -97.0%117409.60 > time.voluntary_context_switches > 3.05 +301.7%12.24 iostat.cpu.idle > 21118.41 -70.3% 6277.19 time.system_time >18.40 +130.4%42.40 vmstat.cpu.us >77.00 -41.3%45.20 vmstat.cpu.sy > 47459.60 -31.3% 32592.20 vmstat.system.in > 82435.40 -12.1% 72443.60 > time.involuntary_context_switches > 5128.13 +14.0% 5848.30 time.user_time > 11656.20-7.8% 10745.60 > time.percent_of_cpu_this_job_got >1069997484.80+0.3% 1073679919.00 time.minor_page_faults > -- Thanks Alex -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] sched: bias to target cpu load to reduce task moving
On Tue, Dec 03, 2013 at 05:05:56PM +0800, Alex Shi wrote: > Task migration happens when target just a bit less then source cpu load. > To reduce such situation happens, aggravate the target cpu load with > sd->imbalance_pct/100. > > This patch removes the hackbench thread regression on Daniel's > Intel Core2 server. > > a5d6e63 +patch1~3 +patch1~4 > hackbench -T -s 4096 -l 1000 -g 10 -f 40 > 27.914" 38.694" 28.587" > 28.390" 38.341" 29.513" > 28.048" 38.626" 28.706" > > Signed-off-by: Alex Shi Hi Alex, We obsevered 150% performance gain with vm-scalability/300s-mmap-pread-seq testcase with this patch applied. Here is a list of changes we got so far: testbox : brickland testcase: vm-scalability/300s-mmap-pread-seq f1b6442c7dd12802e622 d70495ef86f397816d73 (parent commit)(this commit) 26393249.80 +150.9% 66223933.60 vm-scalability.throughput 225.12 -49.9% 112.75 time.elapsed_time 36333.40 -90.7% 3392.20 vmstat.system.cs 2.40 +375.0%11.40 vmstat.cpu.id 3770081.60 -97.7% 87673.40 time.major_page_faults 3975276.20 -97.0%117409.60 time.voluntary_context_switches 3.05 +301.7%12.24 iostat.cpu.idle 21118.41 -70.3% 6277.19 time.system_time 18.40 +130.4%42.40 vmstat.cpu.us 77.00 -41.3%45.20 vmstat.cpu.sy 47459.60 -31.3% 32592.20 vmstat.system.in 82435.40 -12.1% 72443.60 time.involuntary_context_switches 5128.13 +14.0% 5848.30 time.user_time 11656.20-7.8% 10745.60 time.percent_of_cpu_this_job_got 1069997484.80+0.3% 1073679919.00 time.minor_page_faults --yliu > --- > kernel/sched/fair.c | 18 -- > 1 file changed, 12 insertions(+), 6 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index bccdd89..c49b7ba 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -978,7 +978,7 @@ static inline unsigned long group_weight(struct > task_struct *p, int nid) > > static unsigned long weighted_cpuload(const int cpu); > static unsigned long source_load(int cpu); > -static unsigned long target_load(int cpu); > +static unsigned long target_load(int cpu, int imbalance_pct); > static unsigned long power_of(int cpu); > static long effective_load(struct task_group *tg, int cpu, long wl, long wg); > > @@ -3809,11 +3809,17 @@ static unsigned long source_load(int cpu) > * Return a high guess at the load of a migration-target cpu weighted > * according to the scheduling class and "nice" value. > */ > -static unsigned long target_load(int cpu) > +static unsigned long target_load(int cpu, int imbalance_pct) > { > struct rq *rq = cpu_rq(cpu); > unsigned long total = weighted_cpuload(cpu); > > + /* > + * without cpu_load decay, in most of time cpu_load is same as total > + * so we need to make target a bit heavier to reduce task migration > + */ > + total = total * imbalance_pct / 100; > + > if (!sched_feat(LB_BIAS)) > return total; > > @@ -4033,7 +4039,7 @@ static int wake_affine(struct sched_domain *sd, struct > task_struct *p, int sync) > this_cpu = smp_processor_id(); > prev_cpu = task_cpu(p); > load = source_load(prev_cpu); > - this_load = target_load(this_cpu); > + this_load = target_load(this_cpu, 100); > > /* >* If sync wakeup then subtract the (maximum possible) > @@ -4089,7 +4095,7 @@ static int wake_affine(struct sched_domain *sd, struct > task_struct *p, int sync) > > if (balanced || > (this_load <= load && > - this_load + target_load(prev_cpu) <= tl_per_task)) { > + this_load + target_load(prev_cpu, 100) <= tl_per_task)) { > /* >* This domain has SD_WAKE_AFFINE and >* p is cache cold in this domain, and > @@ -4135,7 +4141,7 @@ find_idlest_group(struct sched_domain *sd, struct > task_struct *p, int this_cpu) > if (local_group) > load = source_load(i); > else > - load = target_load(i); > + load = target_load(i, sd->imbalance_pct); > > avg_load += load; > } > @@ -5478,7 +5484,7 @@ static inline void update_sg_lb_stats(struct lb_env > *env, > > /* Bias balancing toward cpus of our domain */ > if (local_group) > - load = target_l