[RFC PATCH 01/13] sched:Prevent movement of short running tasks during load balancing

2012-10-25 Thread Preeti U Murthy
Prevent sched groups with low load as tracked by PJT's metrics
from being candidates of the load balance routine.This metric is chosen to be
1024+15%*1024.But using PJT's metrics it has been observed that even when
three 10% tasks are running,the load sometimes does not exceed this
threshold.The call should be taken if the tasks can afford to be throttled.

This is why an additional metric has been included,which can determine how
long we can tolerate tasks not being moved even if the load is low.

Signed-off-by:  Preeti U Murthy pre...@linux.vnet.ibm.com
---
 kernel/sched/fair.c |   16 
 1 file changed, 16 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index dbddcf6..e02dad4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4188,6 +4188,7 @@ struct sd_lb_stats {
  */
 struct sg_lb_stats {
unsigned long avg_load; /*Avg load across the CPUs of the group */
+   u64 avg_cfs_runnable_load; /* Equivalent of avg_load but calculated 
using PJT's metric */
unsigned long group_load; /* Total load over the CPUs of the group */
unsigned long sum_nr_running; /* Nr tasks running in the group */
unsigned long sum_weighted_load; /* Weighted load of group's tasks */
@@ -4504,6 +4505,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
unsigned long load, max_cpu_load, min_cpu_load;
unsigned int balance_cpu = -1, first_idle_cpu = 0;
unsigned long avg_load_per_task = 0;
+   u64 group_load = 0; /* computed using PJT's metric */
int i;
 
if (local_group)
@@ -4548,6 +4550,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
if (idle_cpu(i))
sgs-idle_cpus++;
 
+   group_load += cpu_rq(i)-cfs.runnable_load_avg;
update_sg_numa_stats(sgs, rq);
}
 
@@ -4572,6 +4575,19 @@ static inline void update_sg_lb_stats(struct lb_env *env,
sgs-avg_load = (sgs-group_load*SCHED_POWER_SCALE) / group-sgp-power;
 
/*
+* Check if the sched group has not crossed the threshold.
+*
+* Also check if the sched_group although being within the threshold,is 
not
+* queueing too many tasks.If yes to both,then make it an
+* invalid candidate for load balancing
+*
+* The below condition is included as a tunable to meet performance and 
power needs
+*/
+   sgs-avg_cfs_runnable_load = (group_load * SCHED_POWER_SCALE) / 
group-sgp-power;
+   if (sgs-avg_cfs_runnable_load = 1178  sgs-sum_nr_running = 2)
+   sgs-avg_cfs_runnable_load = 0;
+
+   /*
 * Consider the group unbalanced when the imbalance is larger
 * than the average weight of a task.
 *

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 00/13] sched: Integrating Per-entity-load-tracking with the core scheduler

2012-10-25 Thread Preeti U Murthy
Hi Peter,
Thank you very much for your feedback.

On 10/25/2012 09:26 PM, Peter Zijlstra wrote:
 OK, so I tried reading a few patches and I'm completely failing.. maybe
 its late and my brain stopped working, but it simply doesn't make any
 sense.
 
 Most changelogs and comments aren't really helping either. At best they
 mention what you're doing, not why and how. This means I get to
 basically duplicate your entire thought pattern and I might as well do
 the patches myself.
 
 I also don't see the 'big' picture of what you're doing, you start out
 by some weird avoid short running task movement.. why is that a good
 start?
 


 I would have expected a series like this to replace rq-cpu_load /
 rq-load with a saner metric and go from there.. instead it looks like
 its going about at things completely backwards. Fixing small details
 instead of the big things.

Let me see if I get what you are saying here right.You want to replace for 
example cfs_rq-load.weight with a saner metric because it does not consider 
the run time of the sched entities queued on it,merely their priorities.If this 
is right,in this patchset I believe cfs_rq-runnable_load_avg would be that 
right metric because it considers the run time of the sched entities queued on 
it.

So why didnt I replace? I added cfs_rq-runnable_load_avg as an additional 
metric instead of replacing the older metric.I let the old metric be as a dead 
metric and used the newer metric as an alternative.so if this alternate metric 
does not do us good we have the old metric to fall back on.

 At best they mention what you're doing, not why and how.
So the above explains *what* I am doing.

*How* am i doing it: Everywhere the scheduler needs to make a decision on:

 a.find_busiest_group/find_idlest_group/update_sg_lb_stats:use sum of 
cfs_rq-runnable_load_avg to decide this instead of sum of cfs_rq-load.weight.

 b.find_busiest_queue/find_idlest_queue: use cfs_rq-runnable_load_avg to 
decide this instead of cfs_rq-load.weight
 
 c.move_tasks: use se-avg.load_avg_contrib to decide the weight of of each 
task instead of se-load.weight as the former reflects the runtime of the sched 
entity and hence its actual load.

This is what my patches3-13 do.Merely *Replace*.

*Why am I doing it*: Feed the load balancer with a more realistic metric to do 
load balancing and observe the consequences.

 you start out by some weird avoid short running task movement.
 why is that a good start?

The short running tasks are not really burdening you,they have very little run 
time,so why move them?
Throughout the concept of load balancing the focus is on *balancing the 
*existing* load* between the sched groups.But not really evaluating the 
*absolute load* of any given sched group.

Why is this *the start*? This is like a round of elimination before the actual 
interview round ;) Try to have only those sched groups as candidates for load 
balancing that are sufficiently loaded.[Patch1]
This *sufficiently loaded* is decided by the new metric explained in the *How* 
above.

 I also don't see the 'big' picture of what you're doing

Patch1: is explained above.*End result:Good candidates for lb.*
Patch2: 
 10%
 10%
 10%100%
-- --
SCHED_GP1  SCHED_GP2
   
Before Patch   After Patch
---
SCHED_GP1 load:3072SCHED_GP1:512
SCHED_GP2 load:1024SCHED_GP2:1024

SCHED_GP1 is busiest   SCHED_GP2 is busiest:
   
But Patch2 re-decides between GP1 and GP2 to check if the latency of tasks is 
getting affected although there is less load on GP1.If yes it is a better *busy 
* gp.

*End Result: Better candidates for lb*

Rest of the patches: now that we have our busy sched group,let us load balance 
with the aid of the new metric.
*End Result: Hopefully a more sensible movement of loads*
This is how I build the picture.

Regards
Preeti

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 00/13] sched: Integrating Per-entity-load-tracking with the core scheduler

2012-10-25 Thread Preeti U Murthy
Hi Peter,
Thank you very much for your feedback.Please ignore the previous post.I
am extremely sorry about the word wrap issues with it.

On 10/25/2012 09:26 PM, Peter Zijlstra wrote:
 OK, so I tried reading a few patches and I'm completely failing.. maybe
 its late and my brain stopped working, but it simply doesn't make any
 sense.
 
 Most changelogs and comments aren't really helping either. At best they
 mention what you're doing, not why and how. This means I get to
 basically duplicate your entire thought pattern and I might as well do
 the patches myself.
 
 I also don't see the 'big' picture of what you're doing, you start out
 by some weird avoid short running task movement.. why is that a good
 start?
 
 I would have expected a series like this to replace rq-cpu_load /
 rq-load with a saner metric and go from there.. instead it looks like
 its going about at things completely backwards. Fixing small details
 instead of the big things.
 
 Do explain..
 
Let me see if I get what you are saying here right.You want to replace
for example cfs_rq-load.weight with a saner metric because it does not
consider the run time of the sched entities queued on it,merely their
priorities.If this is right,in this patchset I believe
cfs_rq-runnable_load_avg would be that right metric because it
considers the run time of the sched entities queued on it.

So why didnt I replace? I added cfs_rq-runnable_load_avg as an
additional metric instead of replacing the older metric.I let the old
metric be as a dead metric and used the newer metric as an
alternative.so if this alternate metric does not do us good we have the
old metric to fall back on.

 At best they mention what you're doing, not why and how.
So the above explains *what* I am doing.

*How* am i doing it: Everywhere the scheduler needs to make a decision on:

 a.find_busiest_group/find_idlest_group/update_sg_lb_stats:use sum of
cfs_rq-runnable_load_avg to decide this instead of sum of
cfs_rq-load.weight.

 b.find_busiest_queue/find_idlest_queue: use cfs_rq-runnable_load_avg
to decide this instead of cfs_rq-load.weight

 c.move_tasks: use se-avg.load_avg_contrib to decide the weight of of
each task instead of se-load.weight as the former reflects the runtime
of the sched entity and hence its actual load.

This is what my patches3-13 do.Merely *Replace*.

*Why am I doing it*: Feed the load balancer with a more realistic metric
to do load balancing and observe the consequences.

 you start out by some weird avoid short running task movement.
 why is that a good start?

The short running tasks are not really burdening you,they have very
little run time,so why move them?
Throughout the concept of load balancing the focus is on *balancing the
*existing* load* between the sched groups.But not really evaluating the
*absolute load* of any given sched group.

Why is this *the start*? This is like a round of elimination before the
actual interview round  Try to have only those sched groups as
candidates for load balancing that are sufficiently loaded.[Patch1]
This *sufficiently loaded* is decided by the new metric explained in the
*How* above.

 I also don't see the 'big' picture of what you're doing

Patch1: is explained above.*End result:Good candidates for lb.*
Patch2:
 10%
 10%
 10%100%
-- --
SCHED_GP1  SCHED_GP2

Before Patch   After Patch
---
SCHED_GP1 load:3072SCHED_GP1:512
SCHED_GP2 load:1024SCHED_GP2:1024

SCHED_GP1 is busiest   SCHED_GP2 is busiest:

But Patch2 re-decides between GP1 and GP2 to check if the latency of
tasks is getting affected although there is less load on GP1.If yes it
is a better *busy * gp.

*End Result: Better candidates for lb*

Rest of the patches: now that we have our busy sched group,let us load
balance with the aid of the new metric.
*End Result: Hopefully a more sensible movement of loads*
This is how I build the picture.

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 00/13] sched: Integrating Per-entity-load-tracking with the core scheduler

2012-10-26 Thread Preeti U Murthy
On 10/26/2012 05:59 PM, Peter Zijlstra wrote:
 On Thu, 2012-10-25 at 23:42 +0530, Preeti U Murthy wrote:

 firstly, cfs_rq is the wrong place for a per-cpu load measure, secondly
 why add another load field instead of fixing the one we have? 
Hmm..,rq-load.weight is the place.

 So why didnt I replace? I added cfs_rq-runnable_load_avg as an
 additional metric instead of replacing the older metric.I let the old
 metric be as a dead metric and used the newer metric as an
 alternative.so if this alternate metric does not do us good we have the
 old metric to fall back on.
 
 That's wrong.. either it works and we can apply the patches or it
 doesn't and we won't. What you did makes it very hard to see you
 preserve the current balancer -- which afaict you don't, you change the
 balancer with the very first patch.
You are right on this Peter.
 
 Why not update weighted_cpuload(), update_idle_cpu_load() and
 update_cpu_load_active() to use another metric and go from there. If you
 do that the whole balancer will auto-magically use the new weight
 measure.
 
 Once you have that running, you can look at modifying it.
Hmm...Correct.

  a.find_busiest_group/find_idlest_group/update_sg_lb_stats:use sum of
 cfs_rq-runnable_load_avg to decide this instead of sum of
 cfs_rq-load.weight.
 
 But the first patches are about adding new balancing conditions, that's
 a complete fail..
 
  b.find_busiest_queue/find_idlest_queue: use cfs_rq-runnable_load_avg
 to decide this instead of cfs_rq-load.weight
 
 See, if you would have changed the input function (weighted_cpuload),
 you wouldn't have needed to touch any of this.
 
I see your point.
  c.move_tasks: use se-avg.load_avg_contrib to decide the weight of of
 each task instead of se-load.weight as the former reflects the runtime
 of the sched entity and hence its actual load.
 
 The changelog in that patch (7) is completely devoid of any useful
 information.
 
 This is what my patches3-13 do.Merely *Replace*.

 *Why am I doing it*: Feed the load balancer with a more realistic metric
 to do load balancing and observe the consequences.
 
 I know why you're doing the entire thing, but you fail to motivate each
 individual change. A changelog should read something like:
 
   current code does blah, this has a problem when blah, we can improve
 this doing blah because blah.
 
Ah! I get it.

 you start out by some weird avoid short running task movement.
 why is that a good start?

 The short running tasks are not really burdening you,they have very
 little run time,so why move them?
 
 The most important part is why this would be a good start to the series,
 it is not.
 
 The patch is also dubious at best; short running might be all you have,
 your definition of 'short' is also iffy.
My definition of 'short' was bursty loads.What I observed from using the
new metric to study the effective load calculation was,when there are
around 2-3 such bursty loads the effective load stays below the
threshold that i have stated,and I thought this would be a good enough
excuse to let the loads stay on the cpu.

Bursty being a load that sleeps for 9ms every 10ms for a duration of
10s.(a 10% load) in my experiments.
 
 Throughout the concept of load balancing the focus is on *balancing the
 *existing* load* between the sched groups.But not really evaluating the
 *absolute load* of any given sched group.
 
 Which is why you're going to change the metrics.. the balancer really
 only cares about making load equal, flipping the metric of the load
 doesn't change anything fundamental.
Ok.
 
 Why is this *the start*? This is like a round of elimination before the
 actual interview round  Try to have only those sched groups as
 candidates for load balancing that are sufficiently loaded.[Patch1]
 This *sufficiently loaded* is decided by the new metric explained in the
 *How* above.
 
 No, this is absolutely wrong.
 
 
 So a sane series would introduce maybe two functions: cpu_load() and
 task_load() and use those where we now use rq-load.weight and
 p-se.load.weight for load balancing purposes. Implement these functions
 using those two expression. So effectively this patch is a NOP.
 
 Secondly, switch these two functions over to the per-task based
 averages.
 
 Tada! all done. The load balancer will then try and equalize effective
 load instead of instant load.
 
 It will do the 3x10% vs 100% thing correctly with just those two
 patches. Simply because it will report a lower cpu-load for the 3x10%
 case than it will for the 100% case, no need to go fudge about in the
 load-balance internals.
 
 Once you've got this correctly done, you can go change balancing to
 better utilize the new metric, like use the effective load instead of
 nr_running against the capacity and things like that. But for every such
 change you want to be very careful and run all the benchmarks you can
 find -- in fact you want to do that after the 2nd patch too.
 
I agree with this entire suggestion.Perhaps I was hesitant

Re: [RFC PATCH 00/13] sched: Integrating Per-entity-load-tracking with the core scheduler

2012-10-26 Thread Preeti U Murthy
On 10/26/2012 06:37 PM, Ingo Molnar wrote:
 
 * Peter Zijlstra a.p.zijls...@chello.nl wrote:
 
 [...]

 So a sane series would introduce maybe two functions: 
 cpu_load() and task_load() and use those where we now use 
 rq-load.weight and p-se.load.weight for load balancing 
 purposes. Implement these functions using those two 
 expression. So effectively this patch is a NOP.

 Secondly, switch these two functions over to the per-task 
 based averages.

 Tada! all done. The load balancer will then try and equalize 
 effective load instead of instant load.

 It will do the 3x10% vs 100% thing correctly with just those 
 two patches. Simply because it will report a lower cpu-load 
 for the 3x10% case than it will for the 100% case, no need to 
 go fudge about in the load-balance internals.

 Once you've got this correctly done, you can go change 
 balancing to better utilize the new metric, like use the 
 effective load instead of nr_running against the capacity and 
 things like that. But for every such change you want to be 
 very careful and run all the benchmarks you can find -- in 
 fact you want to do that after the 2nd patch too.
 
 If anyone posted that simple two-patch series that switches over 
 to the new load metrics I'd be happy to test the performance of 
 those.
 
 Having two parallel load metrics is really not something that we 
 should tolerate for too long.
 
 Thanks,
 
   Ingo
 
Right Ingo.I will incorporate this approach and post out very soon.

Thank you

Regards
Preeti

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 5/6] sched: pack the idle load balance

2013-04-21 Thread Preeti U Murthy
Hi Vincent,

On 04/05/2013 04:38 PM, Vincent Guittot wrote:
 Peter,
 
 After some toughts about your comments,I can update the buddy cpu
 during ILB or periofdic LB to a new idle core and extend the packing
 mechanism  Does this additional mechanism sound better for you ?

If the primary goal of this patchset is to pack small tasks in fewer
power domains then why not see if the power aware scheduler patchset by
Alex does the same for you? The reason being:

a.The power aware scheduler also checks if a task is small enough to be
packed on a cpu which has just enough capacity to take on that
task(leader cpu). This cpu belongs to a scheduler group which is nearly
full(group_leader),so we end up packing tasks.

b.The overhead of assigning a buddy cpu gets eliminated because the best
cpu for packing is decided during wake up.

c.This is a scalable solution because if the leader cpu is busy,then any
other idle cpu from that group_leader is chosen.Eventually you end up
packing anyway.

The reason that I am suggesting this is that we could unify the power
awareness of the scheduler under one umbrella.And i believe that the
current power aware scheduler patchset is flexible enough to do this and
that we must cash in on it.

Thanks

Regards
Preeti U Murthy
 
 Vincent
 
 On 26 March 2013 15:42, Peter Zijlstra pet...@infradead.org wrote:
 On Tue, 2013-03-26 at 15:03 +0100, Vincent Guittot wrote:
 But ha! here's your NO_HZ link.. but does the above DTRT and ensure
 that the ILB is a little core when possible?

 The loop looks for an idle CPU as close as possible to the buddy CPU
 and the buddy CPU is the 1st CPU has been chosen. So if your buddy is
 a little and there is an idle little, the ILB will be this idle
 little.

 Earlier you wrote:

   | Cluster 0   | Cluster 1   |
   | CPU0 | CPU1 | CPU2 | CPU3 |
 ---
 buddy | CPU0 | CPU0 | CPU0 | CPU2 |

 So extrapolating that to a 4+4 big-little you'd get something like:

   |   little  A9  ||   big A15 |
   | 0 | 1 | 2 | 3 || 4 | 5 | 6 | 7 |
 --+---+---+---+---++---+---+---+---+
 buddy | 0 | 0 | 0 | 0 || 0 | 4 | 4 | 4 |

 Right?

 So supposing the current ILB is 6, we'll only check 4, not 0-3, even
 though there might be a perfectly idle cpu in there.

 Also, your scheme fails to pack when cpus 0,4 are filled, even when
 there's idle cores around.

 If we'd use the ILB as packing cpu, we would simply select a next pack
 target once the old one fills up.

 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/5] sched: don't consider upper se in sched_slice()

2013-04-04 Thread Preeti U Murthy
Hi Joonsoo,

On 04/04/2013 06:12 AM, Joonsoo Kim wrote:
 Hello, Preeti.

 
 So, how about extending a sched_period with rq-nr_running, instead of
 cfs_rq-nr_running? It is my quick thought and I think that we can ensure
 to run atleast once in this extending sched_period.

Yeah this seems to be correct.This would ensure sched_min_granularity
also. So then in the scenarion where there are 2 tgs in a runqueue with
10 tasks each,when we calculate the sched_slice of any task,the
__sched_period() would return 4*20 = 80ms.

The sched_slice of each of the task would be 80/20 = 4ms. But what about
the sched_slice of each task group? How would that be calculated then?

Let us take the above example and walk through this problem.This would
probably help us spot the issues involved with this.

 And, do we leave a problem if we cannot guaranteed atleast once property?

This would depend on the results of the benchmarks with the changes.I am
unable to comment on this off the top of my head.

Thank you

Regards
Preeti U Murthy




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch v7 20/21] sched: don't do power balance on share cpu power domain

2013-04-07 Thread Preeti U Murthy
Hi Alex,

On 04/04/2013 07:31 AM, Alex Shi wrote:
 Packing tasks among such domain can't save power, just performance
 losing. So no power balance on them.

As far as my understanding goes, powersave policy is the one that tries
to pack tasks onto a SIBLING domain( domain where SD_SHARE_CPUPOWER is
set).balance policy does not do that,meaning it does not pack on the
domain that shares CPU power,but packs across all other domains.So the
change you are making below results in nothing but the default behaviour
of balance policy.

Correct me if I am wrong but my point is,looks to me,that the powersave
policy is introduced in this patchset,and with the below patch its
characteristic behaviour of packing onto domains sharing cpu power is
removed,thus making it default to balance policy.Now there are two
policies which behave the same way:balance and powersave.

 
 Signed-off-by: Alex Shi alex@intel.com
 ---
  kernel/sched/fair.c | 7 ---
  1 file changed, 4 insertions(+), 3 deletions(-)
 
 diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
 index 047a1b3..3a0284b 100644
 --- a/kernel/sched/fair.c
 +++ b/kernel/sched/fair.c
 @@ -3503,7 +3503,7 @@ static int get_cpu_for_power_policy(struct sched_domain 
 *sd, int cpu,
 
   policy = get_sd_sched_balance_policy(sd, cpu, p, sds);
   if (policy != SCHED_POLICY_PERFORMANCE  sds-group_leader) {
 - if (wakeup)
 + if (wakeup  !(sd-flags  SD_SHARE_CPUPOWER))
   new_cpu = find_leader_cpu(sds-group_leader,
   p, cpu, policy);
   /* for fork balancing and a little busy task */
 @@ -4410,8 +4410,9 @@ static unsigned long task_h_load(struct task_struct *p)
  static inline void init_sd_lb_power_stats(struct lb_env *env,
   struct sd_lb_stats *sds)
  {
 - if (sched_balance_policy == SCHED_POLICY_PERFORMANCE ||
 - env-idle == CPU_NOT_IDLE) {
 + if (sched_balance_policy == SCHED_POLICY_PERFORMANCE
 + || env-sd-flags  SD_SHARE_CPUPOWER
 + || env-idle == CPU_NOT_IDLE) {
   env-flags = ~LBF_POWER_BAL;
   env-flags |= LBF_PERF_BAL;
   return;
 

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch v7 20/21] sched: don't do power balance on share cpu power domain

2013-04-07 Thread Preeti U Murthy
Hi Alex,

I am sorry I overlooked the changes you have made to the power
scheduling policies.Now you have just two : performance and powersave.

Hence you can ignore my below comments.But if you use group-capacity
instead of group-weight for threshold,like you did for balance policy
in your version5 of this patchset, dont you think the below patch can be
avoided? group-capacity being the threshold will automatically ensure
that you dont pack onto domains that share cpu power.

Regards
Preeti U Murthy

On 04/08/2013 08:47 AM, Preeti U Murthy wrote:
 Hi Alex,
 
 On 04/04/2013 07:31 AM, Alex Shi wrote:
 Packing tasks among such domain can't save power, just performance
 losing. So no power balance on them.
 
 As far as my understanding goes, powersave policy is the one that tries
 to pack tasks onto a SIBLING domain( domain where SD_SHARE_CPUPOWER is
 set).balance policy does not do that,meaning it does not pack on the
 domain that shares CPU power,but packs across all other domains.So the
 change you are making below results in nothing but the default behaviour
 of balance policy.
 
 Correct me if I am wrong but my point is,looks to me,that the powersave
 policy is introduced in this patchset,and with the below patch its
 characteristic behaviour of packing onto domains sharing cpu power is
 removed,thus making it default to balance policy.Now there are two
 policies which behave the same way:balance and powersave.
 

 Signed-off-by: Alex Shi alex@intel.com
 ---
  kernel/sched/fair.c | 7 ---
  1 file changed, 4 insertions(+), 3 deletions(-)

 diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
 index 047a1b3..3a0284b 100644
 --- a/kernel/sched/fair.c
 +++ b/kernel/sched/fair.c
 @@ -3503,7 +3503,7 @@ static int get_cpu_for_power_policy(struct 
 sched_domain *sd, int cpu,

  policy = get_sd_sched_balance_policy(sd, cpu, p, sds);
  if (policy != SCHED_POLICY_PERFORMANCE  sds-group_leader) {
 -if (wakeup)
 +if (wakeup  !(sd-flags  SD_SHARE_CPUPOWER))
  new_cpu = find_leader_cpu(sds-group_leader,
  p, cpu, policy);
  /* for fork balancing and a little busy task */
 @@ -4410,8 +4410,9 @@ static unsigned long task_h_load(struct task_struct *p)
  static inline void init_sd_lb_power_stats(struct lb_env *env,
  struct sd_lb_stats *sds)
  {
 -if (sched_balance_policy == SCHED_POLICY_PERFORMANCE ||
 -env-idle == CPU_NOT_IDLE) {
 +if (sched_balance_policy == SCHED_POLICY_PERFORMANCE
 +|| env-sd-flags  SD_SHARE_CPUPOWER
 +|| env-idle == CPU_NOT_IDLE) {
  env-flags = ~LBF_POWER_BAL;
  env-flags |= LBF_PERF_BAL;
  return;

 
 Regards
 Preeti U Murthy
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/5] sched: don't consider upper se in sched_slice()

2013-03-29 Thread Preeti U Murthy
Hi Joonsoo,

On 03/28/2013 01:28 PM, Joonsoo Kim wrote:
 Following-up upper se in sched_slice() should not be done,
 because sched_slice() is used for checking that resched is needed
 whithin *this* cfs_rq and there is one problem related to this
 in current implementation.
 
 The problem is that if we follow-up upper se in sched_slice(), it is
 possible that we get a ideal slice which is lower than
 sysctl_sched_min_granularity.
 
 For example, we assume that we have 4 tg which is attached to root tg
 with same share and each one have 20 runnable tasks on cpu0, respectivly.
 In this case, __sched_period() return sysctl_sched_min_granularity * 20
 and then go into loop. At first iteration, we compute a portion of slice
 for this task on this cfs_rq, so get a slice, sysctl_sched_min_granularity.
 Afterward, we enter second iteration and get a slice which is a quarter of
 sysctl_sched_min_granularity, because there is 4 tgs with same share
 in that cfs_rq.
 
 Ensuring slice larger than min_granularity is important for performance
 and there is no lower bound about this, except timer tick, we should
 fix it not to consider upper se when calculating sched_slice.
 

I am assuming the sysctl_sched_latency = 20ms and
sysctl_sched_min_granularity = 4ms.
In that case:

With your patch, the sum of the sched_slice(runnable_task) on each_tg is
 40ms = sched_min_granularity * 10, while the parent tg has a
sched_slice of sysctl_sched_latency / 4 = (20 / 4) = 5ms.

Ideally the children's cpu share must add upto the parent's share.


Thank you

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 5/5] sched: limit sched_slice if it is more than sysctl_sched_latency

2013-03-29 Thread Preeti U Murthy
Hi Joonsoo

On 03/28/2013 01:28 PM, Joonsoo Kim wrote:
 sched_slice() compute ideal runtime slice. If there are many tasks
 in cfs_rq, period for this cfs_rq is extended to guarantee that each task
 has time slice at least, sched_min_granularity. And then each task get
 a portion of this period for it. If there is a task which have much larger
 load weight than others, a portion of period can exceed far more than
 sysctl_sched_latency.

Correct. But that does not matter, the length of the scheduling latency
period is determined by the return value of ___sched_period(), not the
value of sysctl_sched_latency. You would not extend the period,if you
wanted all tasks to have a slice within the sysctl_sched_latency, right?

So since the value of the length of the scheduling latency period, is
dynamic depending on the number of the processes running, the
sysctl_sched_latency which is the default latency period length is not
mesed with, but is only used as a base to determine the actual
scheduling period.

 
 For exampple, you can simply imagine that one task with nice -20 and
 9 tasks with nice 0 on one cfs_rq. In this case, load weight sum for
 this cfs_rq is 88761 + 9 * 1024, 97977. So a portion of slice for the
 task with nice -20 is sysctl_sched_min_granularity * 10 * (88761 / 97977),
 that is, approximately, sysctl_sched_min_granularity * 9. This aspect
 can be much larger if there is more tasks with nice 0.

Yeah so the __sched_period says that within 40ms, all tasks need to be
scheduled ateast once, and the highest priority task gets nearly 36ms of
it, while the rest is distributed among the others.

 
 So we should limit this possible weird situation.
 
 Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com
 
 diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
 index e232421..6ceffbc 100644
 --- a/kernel/sched/fair.c
 +++ b/kernel/sched/fair.c
 @@ -645,6 +645,9 @@ static u64 sched_slice(struct cfs_rq *cfs_rq, struct 
 sched_entity *se)
   }
   slice = calc_delta_mine(slice, se-load.weight, load);
 
 + if (unlikely(slice  sysctl_sched_latency))
 + slice = sysctl_sched_latency;

Then in this case the highest priority thread would get
20ms(sysctl_sched_latency), and the rest would get
sysctl_sched_min_granularity * 10 * (1024/97977) which would be 0.4ms.
Then all tasks would get scheduled ateast once within 20ms + (0.4*9) ms
= 23.7ms, while your scheduling latency period was extended to 40ms,just
so that each of these tasks don't have their sched_slices shrunk due to
large number of tasks.

 +
   return slice;
  }
 

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch v5 14/15] sched: power aware load balance

2013-03-29 Thread Preeti U Murthy
Hi Alex,

On 03/25/2013 10:22 AM, Alex Shi wrote:
 On 03/22/2013 01:14 PM, Preeti U Murthy wrote:

 the value get from decay_load():
  sa-runnable_avg_sum = decay_load(sa-runnable_avg_sum,
 in decay_load it is possible to be set zero.
 Yes you are right, it is possible to be set to 0, but after a very long
 time, to be more precise, nearly 2 seconds. If you look at decay_load(),
 if the period between last update and now has crossed (32*63),only then
 will the runnable_avg_sum become 0, else it will simply decay.

 This means that for nearly 2seconds,consolidation of loads may not be
 possible even after the runqueues have finished executing tasks running
 on them.
 
 Look into the decay_load(), since the LOAD_AVG_MAX is about 47742, so
 after 16 * 32ms, the maximum avg sum will be decay to zero. 2^16 = 65536
 
 Yes, compare to accumulate time 345ms, the decay is not symmetry, and
 not precise, seems it has space to tune well. But it is acceptable now.

 The exact experiment that I performed was running ebizzy, with just two
 threads. My setup was 2 socket,2 cores each,4 threads each core. So a 16
 logical cpu machine.When I begin running ebizzy with balance policy, the
 2 threads of ebizzy are found one on each socket, while I would expect
 them to be on the same socket. All other cpus, except the ones running
 ebizzy threads are idle and not running anything on either socket.
 I am not running any other processes.
 
 did you try the simplest benchmark: while true; do :; done

Yeah I tried out this while true; do :; done benchmark on a vm which ran
on 2 socket, 2 cores each socket and 2 threads each core emulation.
I ran two instances of this loop with balance policy on, and it was
found that there was one instance running on each socket, rather than
both instances getting consolidated on one socket.

But when I apply the change where we do not consider rq-util if it has
no nr_running on the rq,the two instances of the above benchmark get
consolidated onto one socket.


 I am writing the v6 version which include rt_util etc. you may test on
 it after I send out. :)

Sure will do so.

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch v5 14/15] sched: power aware load balance

2013-03-30 Thread Preeti U Murthy
On 03/29/2013 07:09 PM, Alex Shi wrote:
 On 03/29/2013 08:42 PM, Preeti U Murthy wrote:
 did you try the simplest benchmark: while true; do :; done
 Yeah I tried out this while true; do :; done benchmark on a vm which ran
 
 Thanks a lot for trying!
 
 What's do you mean 'vm'? Virtual machine?

Yes.

 
 on 2 socket, 2 cores each socket and 2 threads each core emulation.
 I ran two instances of this loop with balance policy on, and it was
 found that there was one instance running on each socket, rather than
 both instances getting consolidated on one socket.

 But when I apply the change where we do not consider rq-util if it has
 no nr_running on the rq,the two instances of the above benchmark get
 consolidated onto one socket.


 
 I don't know much of virtual machine, guess the unstable VCPU to CPU pin
 cause rq-util keep large? Did you try to pin VCPU to physical CPU?

No I hadn't done any vcpu to cpu pinning but why did the situation
drastically alter to consolidate the load when the rq-util for the
runqueues with 0 tasks on them was not considered as part of sgs-utils?

 
 I still give the rq-util weight even the nr_running is 0, because some
 transitory tasks may actived on the cpu, but just missed on balancing point.
 
 I just wondering that forgetting rq-util when nr_running = 0 is the
 real root cause if your finding is just on VM and without fixed VCPU to
 CPU pin.

I find the same situation on a physical machine too. On a 2 socket, 4
core machine as well. In fact, using trace_printks in the load balancing
part, I could find that the reason that the load was not getting
consolidated onto a socket was because the rq-util of a run-queue with
no processes on it, had not decayed to 0, which is why it would consider
the socket as overloaded and would  rule out power aware balancing.All
this was on a physical machine.

Regards
Preeti U Murthy


 
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch v5 14/15] sched: power aware load balance

2013-03-30 Thread Preeti U Murthy
Hi,

On 03/30/2013 07:34 PM, Alex Shi wrote:
 On 03/30/2013 07:25 PM, Preeti U Murthy wrote:
 I still give the rq-util weight even the nr_running is 0, because some
 transitory tasks may actived on the cpu, but just missed on balancing 
 point.

 I just wondering that forgetting rq-util when nr_running = 0 is the
 real root cause if your finding is just on VM and without fixed VCPU to
 CPU pin.
 I find the same situation on a physical machine too. On a 2 socket, 4
 core machine as well. In fact, using trace_printks in the load balancing
 part, I could find that the reason that the load was not getting
 consolidated onto a socket was because the rq-util of a run-queue with
 no processes on it, had not decayed to 0, which is why it would consider
 the socket as overloaded and would  rule out power aware balancing.All
 this was on a physical machine.
 
 Consider of this situation, we may stop account the rq-util when
 nr_running is zero. Tasks will be a bit more compact. but anyway, that's
 powersaving policy.
 
True, the tasks will be packed a bit more compactly, but we can expect
the behaviour of your patchset *defaulting to performance policy when
overloaded*, to come to the rescue of such a situation.

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 5/5] sched: limit sched_slice if it is more than sysctl_sched_latency

2013-04-01 Thread Preeti U Murthy
Hi Joonsoo,

On 04/01/2013 10:39 AM, Joonsoo Kim wrote:
 Hello Preeti.
 So we should limit this possible weird situation.

 Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com

 diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
 index e232421..6ceffbc 100644
 --- a/kernel/sched/fair.c
 +++ b/kernel/sched/fair.c
 @@ -645,6 +645,9 @@ static u64 sched_slice(struct cfs_rq *cfs_rq, struct 
 sched_entity *se)
 }
 slice = calc_delta_mine(slice, se-load.weight, load);

 +   if (unlikely(slice  sysctl_sched_latency))
 +   slice = sysctl_sched_latency;

 Then in this case the highest priority thread would get
 20ms(sysctl_sched_latency), and the rest would get
 sysctl_sched_min_granularity * 10 * (1024/97977) which would be 0.4ms.
 Then all tasks would get scheduled ateast once within 20ms + (0.4*9) ms
 = 23.7ms, while your scheduling latency period was extended to 40ms,just
 so that each of these tasks don't have their sched_slices shrunk due to
 large number of tasks.
 
 I don't know I understand your question correctly.
 I will do my best to answer your comment. :)
 
 With this patch, I just limit maximum slice at one time. Scheduling is
 controlled through the vruntime. So, in this case, the task with nice -20
 will be scheduled twice.
 
 20 + (0.4 * 9) + 20 = 43.9 ms
 
 And after 43.9 ms, this process is repeated.
 
 So I can tell you that scheduling period is preserved as before.
 
 If we give a long period to a task at one go, it can cause
 a latency problem. So IMHO, limiting this is meaningful.

Thank you very much for the explanation. Just one question. What is the
reason behind you choosing sysctl_sched_latency as the upper bound here?

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/5] sched: don't consider upper se in sched_slice()

2013-04-01 Thread Preeti U Murthy
Hi Joonsoo,

On 04/01/2013 09:38 AM, Joonsoo Kim wrote:
 Hello, Preeti.
 


 Ideally the children's cpu share must add upto the parent's share.

 
 I don't think so.
 
 We should schedule out the parent tg if 5ms is over. As we do so, we can
 fairly distribute time slice to every tg within short term. If we add
 the children's cpu share upto the parent's, the parent tg may have
 large time slice, so it cannot be preempted easily. There may be a latency
 problem if there are many tgs.

In the case where the #running  sched_nr_latency, the children's
sched_slices add up to the parent's.

A rq with two tgs,each with 3 tasks.

Each of these tasks have a sched slice of
[(sysctl_sched_latency / 3) / 2] as of the present implementation.

The sum of the above sched_slice of all tasks of a tg will lead to the
sched_slice of its parent: sysctl_sched_latency / 2

This breaks when the nr_running on each tg  sched_nr_latency. However I
don't know if this is a good thing or a bad thing.

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch v6 12/21] sched: add power aware scheduling in fork/exec/wake

2013-04-01 Thread Preeti U Murthy
Hi Alex,

The below patch has an issue in the select_task_rq_fair(), which I have 
mentioned in 
the fix for it below.

On 03/30/2013 08:04 PM, Alex Shi wrote:
 This patch add power aware scheduling in fork/exec/wake. It try to
 select cpu from the busiest while still has utilization group. That's
 will save power since it leaves more groups idle in system.
 
 The trade off is adding a power aware statistics collection in group
 seeking. But since the collection just happened in power scheduling
 eligible condition, the worst case of hackbench testing just drops
 about 2% with powersaving policy. No clear change for performance
 policy.
 
 The main function in this patch is get_cpu_for_power_policy(), that
 will try to get a idlest cpu from the busiest while still has
 utilization group, if the system is using power aware policy and
 has such group.
 
 Signed-off-by: Alex Shi alex@intel.com
 ---
  kernel/sched/fair.c | 109 
 +---
  1 file changed, 103 insertions(+), 6 deletions(-)
+/*
 + * select_task_rq_fair: balance the current task (running on cpu) in domains
   * that have the 'flag' flag set. In practice, this is SD_BALANCE_FORK and
   * SD_BALANCE_EXEC.
   *
 - * Balance, ie. select the least loaded group.
 - *
   * Returns the target CPU number, or the same CPU if no balancing is needed.
   *
   * preempt must be disabled.
   */
  static int
 -select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 +select_task_rq_fair(struct task_struct *p, int sd_flag, int flags)
  {
   struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
   int cpu = smp_processor_id();
   int prev_cpu = task_cpu(p);
   int new_cpu = cpu;
   int want_affine = 0;
 - int sync = wake_flags  WF_SYNC;
 + int sync = flags  WF_SYNC;
 + struct sd_lb_stats sds;
 
   if (p-nr_cpus_allowed == 1)
   return prev_cpu;
 @@ -3412,11 +3500,20 @@ select_task_rq_fair(struct task_struct *p, int 
 sd_flag, int wake_flags)
   break;
   }
 
 - if (tmp-flags  sd_flag)
 + if (tmp-flags  sd_flag) {
   sd = tmp;
 +
 + new_cpu = get_cpu_for_power_policy(sd, cpu, p, sds);
 + if (new_cpu != -1)
 + goto unlock;
 + }
   }
 
   if (affine_sd) {
 + new_cpu = get_cpu_for_power_policy(affine_sd, cpu, p, sds);
 + if (new_cpu != -1)
 + goto unlock;
 +
   if (cpu != prev_cpu  wake_affine(affine_sd, p, sync))
   prev_cpu = cpu;
 


sched: Fix power aware scheduling in fork/wake/exec

From: Preeti U Murthy pre...@linux.vnet.ibm.com

Problem:

select_task_rq_fair() returns a target CPU/ waking CPU if no balancing is
required. However with the current power aware scheduling in this path, an
invalid CPU might be returned.

If get_cpu_for_power_policy() fails to find a new_cpu for the forked task,
then there is a possibility that the new_cpu could continue to be -1, till
the end of the select_task_rq_fair() if the search for a new cpu ahead in
this function also fails. Since this scenario is unexpected by
the callers of  select_task_rq_fair(),this needs to be fixed.

Fix:

Do not intermix the variables meant to reflect the target CPU of power save
and performance policies. If the target CPU of powersave is successful in being
found, return it. Else allow the performance policy to take a call on the
target CPU.

The above scenario was caught when a kernel crash resulted with a bad data 
access
interrupt, during a kernbench run on a 2 socket,16 core machine,with each core
having SMT-4
---
 kernel/sched/fair.c |   13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 56b96a7..54d4400 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3532,6 +3532,7 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, 
int flags)
int cpu = smp_processor_id();
int prev_cpu = task_cpu(p);
int new_cpu = cpu;
+   int new_cpu_for_power = -1;
int want_affine = 0;
int sync = flags  WF_SYNC;
struct sd_lb_stats sds;
@@ -3563,16 +3564,16 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, 
int flags)
if (tmp-flags  sd_flag) {
sd = tmp;
 
-   new_cpu = get_cpu_for_power_policy(sd, cpu, p, sds,
+   new_cpu_for_power = get_cpu_for_power_policy(sd, cpu, 
p, sds,
flags  SD_BALANCE_FORK);
-   if (new_cpu != -1)
+   if (new_cpu_for_power != -1)
goto unlock;
}
}
 
if (affine_sd) {
-   new_cpu = get_cpu_for_power_policy(affine_sd, cpu, p, sds, 0

Re: [PATCH 4/5] sched: don't consider upper se in sched_slice()

2013-04-01 Thread Preeti U Murthy
Hi Joonsoo,

On 04/02/2013 07:55 AM, Joonsoo Kim wrote:
 Hello, Preeti.
 
 On Mon, Apr 01, 2013 at 12:36:52PM +0530, Preeti U Murthy wrote:
 Hi Joonsoo,

 On 04/01/2013 09:38 AM, Joonsoo Kim wrote:
 Hello, Preeti.



 Ideally the children's cpu share must add upto the parent's share.


 I don't think so.

 We should schedule out the parent tg if 5ms is over. As we do so, we can
 fairly distribute time slice to every tg within short term. If we add
 the children's cpu share upto the parent's, the parent tg may have
 large time slice, so it cannot be preempted easily. There may be a latency
 problem if there are many tgs.

 In the case where the #running  sched_nr_latency, the children's
 sched_slices add up to the parent's.

 A rq with two tgs,each with 3 tasks.

 Each of these tasks have a sched slice of
 [(sysctl_sched_latency / 3) / 2] as of the present implementation.

 The sum of the above sched_slice of all tasks of a tg will lead to the
 sched_slice of its parent: sysctl_sched_latency / 2

 This breaks when the nr_running on each tg  sched_nr_latency. However I
 don't know if this is a good thing or a bad thing.
 
 Ah.. Now I get your point. Yes, you are right and it may be good thing.
 With that property, all tasks in the system can be scheduled at least once
 in sysctl_sched_latency. sysctl_sched_latency is system-wide configuration,
 so my patch may be wrong. With my patch, all tasks in the system cannot be
 scheduled at least once in sysctl_sched_latency. Instead, it schedule
 all tasks in cfs_rq at least once in sysctl_sched_latency if there is
 no other tgs.

Exactly. You have got all the above points right.

 
 I think that it is real problem that sysctl_sched_min_granularity is not
 guaranteed for each task.
 Instead of this patch, how about considering low bound?
 
 if (slice  sysctl_sched_min_granularity)
   slice = sysctl_sched_min_granularity;

Consider the below scenario.

A runqueue has two task groups,each with 10 tasks.

With the current implementation,each of these tasks get a sched_slice of
2ms.Hence in a matter of (10*2) + (10*2) = 40 ms, all tasks( all tasks
of both the task groups) will get the chance to run.

But what is the scheduling period in this scenario? Is it 40ms (extended
sysctl_sched_latency), which is the scheduling period for each of the
runqueues with 10 tasks in it?
Or is it 80ms which is the total of the scheduling periods of each of
the run queues with 10 tasks.Either way all tasks seem to get scheduled
atleast once within the scheduling period.So we appear to be safe.
Although the sched_slice  sched_min_granularity.

With your above lower bound of sysctl_sched_min_granularity, each task
of each tg gets 4ms as its sched_slice.So in a matter of
(10*4) + (10*4) = 80ms,all tasks get to run. With the above question
being put forth here as well, we don't appear to be safe if the
scheduling_period is considered to be 40ms, otherwise it appears fine to
me, because it ensures the sched_slice is atleast sched_min_granularity
no matter what.


Thank you

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/5] sched: don't consider upper se in sched_slice()

2013-04-02 Thread Preeti U Murthy
Hi Joonsoo,


 I think that it is real problem that sysctl_sched_min_granularity is not
 guaranteed for each task.
 Instead of this patch, how about considering low bound?

 if (slice  sysctl_sched_min_granularity)
 slice = sysctl_sched_min_granularity;

 Consider the below scenario.

 A runqueue has two task groups,each with 10 tasks.

 With the current implementation,each of these tasks get a sched_slice of
 2ms.Hence in a matter of (10*2) + (10*2) = 40 ms, all tasks( all tasks
 of both the task groups) will get the chance to run.

 But what is the scheduling period in this scenario? Is it 40ms (extended
 sysctl_sched_latency), which is the scheduling period for each of the
 runqueues with 10 tasks in it?
 Or is it 80ms which is the total of the scheduling periods of each of
 the run queues with 10 tasks.Either way all tasks seem to get scheduled
 atleast once within the scheduling period.So we appear to be safe.
 Although the sched_slice  sched_min_granularity.

 With your above lower bound of sysctl_sched_min_granularity, each task
 of each tg gets 4ms as its sched_slice.So in a matter of
 (10*4) + (10*4) = 80ms,all tasks get to run. With the above question
 being put forth here as well, we don't appear to be safe if the
 scheduling_period is considered to be 40ms, otherwise it appears fine to
 me, because it ensures the sched_slice is atleast sched_min_granularity
 no matter what.
 
 So, you mean that we should guarantee to schedule each task atleast once
 in sysctl_sched_latency? 

I would rather say all tasks get to run atleast once in a sched_period,
which could be just sysctl_sched_latency or more depending on the number
of tasks in the current implementation.

But this is not guaranteed in current code,
 this is why I made this patch. Please refer a patch description.

Ok,take the example of a runqueue with 2 task groups,each with 10
tasks.Same as your previous example. Can you explain how your patch
ensures that all 20 tasks get to run atleast once in a sched_period?

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch v5 14/15] sched: power aware load balance

2013-03-19 Thread Preeti U Murthy
Hi Alex,

Please note one point below.

On 02/18/2013 10:37 AM, Alex Shi wrote:
 This patch enabled the power aware consideration in load balance.
 
 As mentioned in the power aware scheduler proposal, Power aware
 scheduling has 2 assumptions:
 1, race to idle is helpful for power saving
 2, less active sched_groups will reduce power consumption
 
 The first assumption make performance policy take over scheduling when
 any scheduler group is busy.
 The second assumption make power aware scheduling try to pack disperse
 tasks into fewer groups.
 
 The enabling logical summary here:
 1, Collect power aware scheduler statistics during performance load
 balance statistics collection.
 2, If the balance cpu is eligible for power load balance, just do it
 and forget performance load balance. If the domain is suitable for
 power balance, but the cpu is inappropriate(idle or full), stop both
 power/performance balance in this domain. If using performance balance
 or any group is busy, do performance balance.
 
 Above logical is mainly implemented in update_sd_lb_power_stats(). It
 decides if a domain is suitable for power aware scheduling. If so,
 it will fill the dst group and source group accordingly.
 
 This patch reuse some of Suresh's power saving load balance code.
 
 A test can show the effort on different policy:
 for ((i = 0; i  I; i++)) ; do while true; do :; done done
 
 On my SNB laptop with 4core* HT: the data is Watts
 powersaving balance performance
 i = 2   40  54  54
 i = 4   57  64* 68
 i = 8   68  68  68
 
 Note:
 When i = 4 with balance policy, the power may change in 57~68Watt,
 since the HT capacity and core capacity are both 1.
 
 on SNB EP machine with 2 sockets * 8 cores * HT:
 powersaving balance performance
 i = 4   190 201 238
 i = 8   205 241 268
 i = 16  271 348 376
 
 If system has few continued tasks, use power policy can get
 the performance/power gain. Like sysbench fileio randrw test with 16
 thread on the SNB EP box,
 
 Signed-off-by: Alex Shi alex@intel.com
 ---
  kernel/sched/fair.c | 129 
 ++--
  1 file changed, 126 insertions(+), 3 deletions(-)
 
 diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
 index ffdf35d..3b1e9a6 100644
 --- a/kernel/sched/fair.c
 +++ b/kernel/sched/fair.c
 @@ -4650,6 +4753,12 @@ static inline void update_sg_lb_stats(struct lb_env 
 *env,
   sgs-group_load += load;
   sgs-sum_nr_running += nr_running;
   sgs-sum_weighted_load += weighted_cpuload(i);
 +
 + /* accumulate the maximum potential util */
 + if (!nr_running)
 + nr_running = 1;
 + sgs-group_utils += rq-util * nr_running;

You may have observed the following,but I thought it would be best to
bring this to notice.

The above will lead to situations where sched groups never fill to their
full capacity. This is explained with an example below:

Say the topology is two cores with hyper threading of two logical
threads each. If we choose powersaving policy,and run two workloads at
full utilization, the load will get distributed one on each core, they
will not get consolidated on a single core.
The reason being,the condition:
 if (sgs-group_utils + FULL_UTIL  threshold_util)  in
update_sd_lb_power_stats will fail.

The situation goes thus:


w1 w2
t1  t2 t3  t4
------
core1   core2

Above t-thread(logical cpu)
  w-workload


Neither core will be able to pull the task from the other to consolidate
the load because the rq-util of t2 and t4, on which no process is
running, continue to show some number even though they degrade with time
and sgs-utils accounts for them. Therefore,
for core1 and core2, the sgs-utils will be slightly above 100 and the
above condition will fail, thus failing them as candidates for
group_leader,since threshold_util will be 200.

This phenomenon is seen for balance policy and wider topology as well.
I think we would be better off without accounting the rq-utils of the
cpus which do not have any processes running on them for sgs-utils.
What do you think?


Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch v5 14/15] sched: power aware load balance

2013-03-21 Thread Preeti U Murthy
Hi Alex,

On 03/21/2013 01:13 PM, Alex Shi wrote:
 On 03/20/2013 12:57 PM, Preeti U Murthy wrote:
 Neither core will be able to pull the task from the other to consolidate
 the load because the rq-util of t2 and t4, on which no process is
 running, continue to show some number even though they degrade with time
 and sgs-utils accounts for them. Therefore,
 for core1 and core2, the sgs-utils will be slightly above 100 and the
 above condition will fail, thus failing them as candidates for
 group_leader,since threshold_util will be 200.
 
 Thanks for note, Preeti!
 
 Did you find some real issue in some platform?
 In theory, a totally idle cpu has a zero rq-util at least after 3xxms,
 and in fact, I fi
nd the code works fine on my machines.
 

Yes, I did find this behaviour on a 2 socket, 8 core machine very
consistently.

rq-util cannot go to 0, after it has begun accumulating load right?

Say a load was running on a runqueue which had its rq-util to be at
100%. After the load finishes, the runqueue goes idle. For every
scheduler tick, its utilisation decays. But can never become 0.

rq-util = rq-avg.runnable_avg_sum/rq-avg.runnable_avg_period

This ratio will come close to 0, but will never become 0 once it has
picked up a value.So if a sched_group consists of two run queues,one
having utilisation 100, running 1 load and the other having utilisation
.001,but running no load,then in update_sd_lb_power_stats(), the below
condition

sgs-group_utils + FULL_UTIL  threshold_util  will turn out to be

(100.001 + 100  200) and hence the group will fail to act as the group
leader,to take on more tasks onto itself.


Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch v5 14/15] sched: power aware load balance

2013-03-21 Thread Preeti U Murthy
On 03/21/2013 02:57 PM, Alex Shi wrote:
 On 03/21/2013 04:41 PM, Preeti U Murthy wrote:

 Yes, I did find this behaviour on a 2 socket, 8 core machine very
 consistently.

 rq-util cannot go to 0, after it has begun accumulating load right?

 Say a load was running on a runqueue which had its rq-util to be at
 100%. After the load finishes, the runqueue goes idle. For every
 scheduler tick, its utilisation decays. But can never become 0.

 rq-util = rq-avg.runnable_avg_sum/rq-avg.runnable_avg_period
 
 
 did you close all of background system services?
 In theory the rq-avg.runnable_avg_sum should be zero if there is no
 task a bit long, otherwise there are some bugs in kernel.

Could you explain why rq-avg.runnable_avg_sum should be zero? What if
some kernel thread ran on this run queue and is now finished? Its
utilisation would be say x.How would that ever drop to 0,even if nothing
ran on it later?

Regards
Preeti U Murthy

 Could you
 check the value under /proc/sched_debug?
 
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch v5 14/15] sched: power aware load balance

2013-03-21 Thread Preeti U Murthy
Hi,

On 03/22/2013 07:00 AM, Alex Shi wrote:
 On 03/21/2013 06:27 PM, Preeti U Murthy wrote:
 did you close all of background system services?
 In theory the rq-avg.runnable_avg_sum should be zero if there is no
 task a bit long, otherwise there are some bugs in kernel.
 Could you explain why rq-avg.runnable_avg_sum should be zero? What if
 some kernel thread ran on this run queue and is now finished? Its
 utilisation would be say x.How would that ever drop to 0,even if nothing
 ran on it later?
 
 the value get from decay_load():
  sa-runnable_avg_sum = decay_load(sa-runnable_avg_sum,
 in decay_load it is possible to be set zero.

Yes you are right, it is possible to be set to 0, but after a very long
time, to be more precise, nearly 2 seconds. If you look at decay_load(),
if the period between last update and now has crossed (32*63),only then
will the runnable_avg_sum become 0, else it will simply decay.

This means that for nearly 2seconds,consolidation of loads may not be
possible even after the runqueues have finished executing tasks running
on them.

The exact experiment that I performed was running ebizzy, with just two
threads. My setup was 2 socket,2 cores each,4 threads each core. So a 16
logical cpu machine.When I begin running ebizzy with balance policy, the
2 threads of ebizzy are found one on each socket, while I would expect
them to be on the same socket. All other cpus, except the ones running
ebizzy threads are idle and not running anything on either socket.
I am not running any other processes.

You could run a similar experiment and let me know if you see otherwise.
I am at a loss to understand why else would such a spreading of load
occur, if not for the rq-util not becoming 0 quickly,when it is not
running anything. I have used trace_printks to keep track of runqueue
util of those runqueues not running anything after maybe some initial
load and it does not become 0 till the end of the run.

Regards
Preeti U Murthy


 
 and /proc/sched_debug also approve this:
 
   .tg_runnable_contrib   : 0
   .tg-runnable_avg  : 50
   .avg-runnable_avg_sum : 0
   .avg-runnable_avg_period  : 47507
 
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/6] sched: packing small tasks

2013-03-23 Thread Preeti U Murthy
Hi Vincent,

The power aware scheduler patchset that has been released by Alex
recently [https://lkml.org/lkml/2013/2/18/4], also has the idea of
packing of small tasks.In that patchset, the waking of small tasks in
done on a leader cpu, which of course varies dynamically.
https://lkml.org/lkml/2013/2/18/6

I will refer henceforth to the above patchset as Patch A and your
current patchset in this mail as Patch B.

The difference in Patch A is that it is packing small tasks onto a
leader cpu, i.e. a cpu which has just enough capacity to take on itself
a light weight task. Essentially belonging to a group which has a
favourable grp_power and the leader cpu having a suitable rq-util.

In Patch B, you are choosing a buddy cpu, belonging to a group with the
least cpu power, and while deciding to pack tasks onto the buddy cpu,
this again boils down to, verifying if the
grp_power is favourable and the buddy cpu has a suitable rq-util.

In my opinion, this goes to say that both the patches are using similar
metrics to decide if a cpu is capable of handling the light weight task.
The difference lies in how far the search for the appropriate cpu can
proceed. While Patch A continues to search at all levels of sched
domains till the last to find the appropriate cpu, Patch B queries only
the buddy CPU.

The point I am trying to make is that, considering the above
similarities and differences, is it possible for us to see if the ideas
that both these patches bring in can be merged into one, since both of
them are having the common goal of packing small tasks.

Thanks

Regards
Preeti U Murthy

On 03/22/2013 05:55 PM, Vincent Guittot wrote:
 Hi,
 
 This patchset takes advantage of the new per-task load tracking that is
 available in the kernel for packing the small tasks in as few as possible
 CPU/Cluster/Core. The main goal of packing small tasks is to reduce the power
 consumption in the low load use cases by minimizing the number of power domain
 that are enabled. The packing is done in 2 steps:
 
 The 1st step looks for the best place to pack tasks in a system according to
 its topology and it defines a pack buddy CPU for each CPU if there is one
 available. We define the best CPU during the build of the sched_domain instead
 of evaluating it at runtime because it can be difficult to define a stable
 buddy CPU in a low CPU load situation. The policy for defining a buddy CPU is
 that we pack at all levels inside a node where a group of CPU can be power
 gated independently from others. For describing this capability, a new flag
 has been introduced SD_SHARE_POWERDOMAIN that is used to indicate whether the
 groups of CPUs of a scheduling domain are sharing their power state. By
 default, this flag has been set in all sched_domain in order to keep unchanged
 the current behavior of the scheduler and only ARM platform clears the
 SD_SHARE_POWERDOMAIN flag for MC and CPU level.
 
 In a 2nd step, the scheduler checks the load average of a task which wakes up
 as well as the load average of the buddy CPU and it can decide to migrate the
 light tasks on a not busy buddy. This check is done during the wake up because
 small tasks tend to wake up between periodic load balance and asynchronously
 to each other which prevents the default mechanism to catch and migrate them
 efficiently. A light task is defined by a runnable_avg_sum that is less than
 20% of the runnable_avg_period. In fact, the former condition encloses 2 ones:
 The average CPU load of the task must be less than 20% and the task must have
 been runnable less than 10ms when it woke up last time in order to be
 electable for the packing migration. So, a task than runs 1 ms each 5ms will
 be considered as a small task but a task that runs 50 ms with a period of
 500ms, will not.
 Then, the business of the buddy CPU depends of the load average for the rq and
 the number of running tasks. A CPU with a load average greater than 50% will
 be considered as busy CPU whatever the number of running tasks is and this
 threshold will be reduced by the number of running tasks in order to not
 increase too much the wake up latency of a task. When the buddy CPU is busy,
 the scheduler falls back to default CFS policy.
 
 Change since V2:
 
  - Migrate only a task that wakes up
  - Change the light tasks threshold to 20%
  - Change the loaded CPU threshold to not pull tasks if the current number of
running tasks is null but the load average is already greater than 50%
  - Fix the algorithm for selecting the buddy CPU.
 
 Change since V1:
 
 Patch 2/6
  - Change the flag name which was not clear. The new name is
SD_SHARE_POWERDOMAIN.
  - Create an architecture dependent function to tune the sched_domain flags
 Patch 3/6
  - Fix issues in the algorithm that looks for the best buddy CPU
  - Use pr_debug instead of pr_info
  - Fix for uniprocessor
 Patch 4/6
  - Remove the use of usage_avg_sum which has not been merged
 Patch 5/6
  - Change the way the coherency

Re: [RFC PATCH v3 3/6] sched: pack small tasks

2013-03-26 Thread Preeti U Murthy
Hi Peter,

On 03/26/2013 06:07 PM, Peter Zijlstra wrote:
 On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote:
 +static bool is_light_task(struct task_struct *p)
 +{
 +   /* A light task runs less than 20% in average */
 +   return ((p-se.avg.runnable_avg_sum  * 5) 
 +   (p-se.avg.runnable_avg_period));
 +}
 
 OK, so we have a 'problem' here, we initialize runnable_avg_* to 0, but
 we want to 'assume' a fresh task is fully 'loaded'. IIRC Alex ran into
 this as well.
 
 PJT, do you have any sane solution for this, I forgot what the result
 of the last discussion was -- was there any?

The conclusion after last discussion between PJT and Alex was that the
load contribution of a fresh task be set to full during __sched_fork().

task-se.avg.load_avg_contrib = task-se.load.weight during
__sched_fork() is reflected in the latest power aware scheduler patchset
by Alex.

Thanks

Regards
Preeti U Murthy
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 3/6] sched: pack small tasks

2013-03-27 Thread Preeti U Murthy
Hi,

On 03/26/2013 05:56 PM, Peter Zijlstra wrote:
 On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote:
 +static bool is_buddy_busy(int cpu)
 +{
 +   struct rq *rq = cpu_rq(cpu);
 +
 +   /*
 +* A busy buddy is a CPU with a high load or a small load with
 a lot of
 +* running tasks.
 +*/
 +   return (rq-avg.runnable_avg_sum 
 +   (rq-avg.runnable_avg_period / (rq-nr_running
 + 2)));
 +}
 
 Why does the comment talk about load but we don't see it in the
 equation. Also, why does nr_running matter at all? I thought we'd
 simply bother with utilization, if fully utilized we're done etc..
 

Peter, lets say the run-queue has 50% utilization and is running 2
tasks. And we wish to find out if it is busy. We would compare this
metric with the cpu power, which lets say is 100.

rq-util * 100  cpu_of(rq)-power.

In the above scenario would we declare the cpu _not_busy? Or would we do
the following:

(rq-util * 100) * #nr_running   cpu_of(rq)-power and conclude that it
is just enough _busy_ to not take on more processes?


@Vincent: Yes the comment above needs to be fixed. A busy buddy is a CPU
with *high rq utilization*, as far as the equation goes.

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v2 PATCH 2.1] sched: Use Per-Entity-Load-Tracking metric for load balancing

2012-12-03 Thread Preeti U Murthy

sched: Use Per-Entity-Load-Tracking metric for load balancing

From: Preeti U Murthy pre...@linux.vnet.ibm.com

Currently the load balancer weighs a task based upon its priority,and this
weight consequently gets added up to the weight of the run queue that it is
on.It is this weight of the runqueue that sums up to a sched group's load
which is used to decide the busiest or the idlest group and the runqueue
thereof.

The Per-Entity-Load-Tracking metric however measures how long a task has
been runnable over the duration of its lifetime.This gives us a hint of
the amount of CPU time that the task can demand.This metric takes care of the
task priority as well.Therefore apart from the priority of a task we also
have an idea of the live behavior of the task.This seems to be a more
realistic metric to use to compute task weight which adds upto the run queue
weight and the weight of the sched group.Consequently they can be used for
load balancing.

The semantics of load balancing is left untouched.The two functions
load_balance() and select_task_rq_fair() perform the task of load
balancing.These two paths have been browsed through in this patch to make
necessary changes.

weighted_cpuload() and task_h_load() provide the run queue weight and the
weight of the task respectively.They have been modified to provide the
Per-Entity-Load-Tracking metric as relevant for each.
The rest of the modifications had to be made to suit these two changes.

Completely Fair Scheduler class is the only sched_class which contributes to
the run queue load.Therefore the rq-load.weight==cfs_rq-load.weight when
the cfs_rq is the root cfs_rq (rq-cfs) of the hierarchy.When replacing this
with Per-Entity-Load-Tracking metric,cfs_rq-runnable_load_avg needs to be
used as this is the right reflection of the run queue load when
the cfs_rq is the root cfs_rq (rq-cfs) of the hierarchy.This metric reflects
the percentage uptime of the tasks that are queued on it and hence that 
contribute
to the load.Thus cfs_rq-runnable_load_avg replaces the metric earlier used in
weighted_cpuload().

The task load is aptly captured by se.avg.load_avg_contrib which captures the
runnable time vs the alive time of the task against its priority.This metric
replaces the earlier metric used in task_h_load().

The consequent changes appear as data type changes for the helper variables;
they abound in number.Because cfs_rq-runnable_load_avg needs to be big enough
to capture the tasks' load often and accurately.

The following patch does not consider CONFIG_FAIR_GROUP_SCHED AND
CONFIG_SCHED_NUMA.This is done so as to evaluate this approach starting from the
simplest scenario.Earlier discussions can be found in the link below.

Link: https://lkml.org/lkml/2012/10/25/162
Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com
---
I apologise about having overlooked this one change in the patchset.This needs 
to
be applied on top of patch2 of this patchset.The experiment results that have 
been
posted in reply to this thread are done after having applied this patch.

 kernel/sched/fair.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f8f3a29..19094eb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4362,7 +4362,7 @@ struct sd_lb_stats {
  * sg_lb_stats - stats of a sched_group required for load_balancing
  */
 struct sg_lb_stats {
-   unsigned long avg_load; /*Avg load across the CPUs of the group */
+   u64 avg_load; /*Avg load across the CPUs of the group */
u64 group_load; /* Total load over the CPUs of the group */
unsigned long sum_nr_running; /* Nr tasks running in the group */
u64 sum_weighted_load; /* Weighted load of group's tasks */

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v2 PATCH 0/2] sched: Integrating Per-entity-load-tracking with the core scheduler

2012-12-04 Thread Preeti U Murthy
 * of time.This is done to create both long running and short running tasks
 * on the cpu.
 *
 * Multiple threads are created of each instance.The threads request for a
 * memory chunk,write into it and then free it.This is done throughout the
 * period of the run.
 *
 * This program is free software; you can redistribute it and/or
 * modify it under the terms of the GNU General Public License as
 * published by the Free Software Foundation; version 2 of the License.
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.See the
 * GNU General Public License for more details.
 *
 * You should have received a copy of the GNU General Public License
 * along with this program; if not, write to the Free Software
 * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307
 * USA
 */

#include stdio.h
#include unistd.h
#include stdlib.h
#include sys/mman.h
#include pthread.h
#include string.h
#include time.h
#include sys/time.h
#include sys/resource.h
#include malloc.h

/* Variable entities */
static unsigned int seconds;
static unsigned int threads;
static unsigned int mem_chunk_size;
static unsigned int sleep_at;
static unsigned int sleep_interval;


/* Fixed entities */
typedef size_t mem_slot_t;/* 8 bytes */
static unsigned int slot_size = sizeof(mem_slot_t);

/* Other parameters */
static volatile int start;
static time_t start_time;
static unsigned int records_read;
pthread_mutex_t records_count_lock = PTHREAD_MUTEX_INITIALIZER;


static unsigned int write_to_mem(void)
{
int i, j;
mem_slot_t *scratch_pad, *temp;
mem_chunk_size = slot_size * 256;
mem_slot_t *end;
sleep_at = 0; /* sleep for every 2800 records */
sleep_interval = 9000; /* sleep for 9 ms */

for (i=0; start == 1; i++)
{
/* ask for a memory chunk */
scratch_pad = (mem_slot_t *)malloc(mem_chunk_size);
if (scratch_pad == NULL) {
fprintf(stderr,Could not allocate memory\n);
exit(1);
}
end = scratch_pad + (mem_chunk_size / slot_size);
/* write into this chunk */
for (temp = scratch_pad, j=0; temp  end; temp++, j++)
*temp = (mem_slot_t)j;

/* Free this chunk */
free(scratch_pad);

/* Decide the duty cycle;currently 10 ms */
if (sleep_at  !(i % sleep_at))
usleep(sleep_interval);

}
return (i);
}

static void *
thread_run(void *arg)
{

unsigned int records_local;

/* Wait for the start signal */

while (start == 0);

records_local = write_to_mem();

pthread_mutex_lock(records_count_lock);
records_read += records_local;
pthread_mutex_unlock(records_count_lock);

return NULL;
}

static void start_threads()
{
double diff_time;
unsigned int i;
int err;
threads = 8;
seconds = 10;

pthread_t thread_array[threads];
for (i = 0; i  threads; i++) {
err = pthread_create(thread_array[i], NULL, thread_run, NULL);
if (err) {
fprintf(stderr, Error creating thread %d\n, i);
exit(1);
}
}
start_time = time(NULL);
start = 1;
sleep(seconds);
start = 0;
diff_time = difftime(time(NULL), start_time);

for (i = 0; i  threads; i++) {
err = pthread_join(thread_array[i], NULL);
if (err) {
fprintf(stderr, Error joining thread %d\n, i);
exit(1);
}
}
 printf(%u records/s\n,
(unsigned int) (((double) records_read)/diff_time));

}
int main()
{
start_threads();
return 0;
}

END WORKLOAD
Regards
Preeti U Murthy


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch v5 02/15] sched: set initial load avg of new forked task

2013-02-27 Thread Preeti U Murthy
Hi Alex,

 diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
 index 81fa536..cae5134 100644
 --- a/kernel/sched/fair.c
 +++ b/kernel/sched/fair.c
 @@ -1509,6 +1509,10 @@ static inline void enqueue_entity_load_avg(struct 
 cfs_rq *cfs_rq,
  * We track migrations using entity decay_count = 0, on a wake-up
  * migration we use a negative decay count to track the remote decays
  * accumulated while sleeping.
 +*
 +* When enqueue a new forked task, the se-avg.decay_count == 0, so
 +* we bypass update_entity_load_avg(), use avg.load_avg_contrib initial
 +* value: se-load.weight.

 I disagree with the comment.update_entity_load_avg() gets called for all
 forked tasks.
 enqueue_task_fair-update_entity_load_avg() during the second
 iteration.But __update_entity_load_avg() in update_entity_load_avg()

 
 When goes 'enqueue_task_fair-update_entity_load_avg()' during the
 second iteration. the se is changed.
 That is different se.
 
 
Correct Alex,sorry I overlooked this.

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: Consequences of integrating the Per Entity Load Tracking Metric into the Load Balancer

2013-01-20 Thread Preeti U Murthy
Hi Alex,
Thank you very much for running the below benchmark on
blocked_load+runnable_load:) Just a few queries.

How did you do the wake up balancing? Did you iterate over the L3
package looking for an idle cpu? Or did you just query the L2 package
for an idle cpu?

I think when you are using blocked_load+runnable_load it would be better
if we just query the L2 package as Vincent had pointed out because the
fundamental behind using blocked_load+runnable_load is to keep a steady
state across cpus unless we could reap the advantage of moving the
blocked load to a sibling core when it wakes up.

And the drop of performance is relative to what?
1.Your v3 patchset with runnable_load_avg in weighted_cpu_load().
2.Your v3 patchset with runnable_load_avg+blocked_load_avg in
weighted_cpu_load().

Are the above two what you are comparing? And in the above two versions
have you included your [PATCH] sched: use instant load weight in burst
regular load balance?

On 01/20/2013 09:22 PM, Alex Shi wrote:
 The blocked load of a cluster will be high if the blocked tasks have
 run recently. The contribution of a blocked task will be divided by 2
 each 32ms, so it means that a high blocked load will be made of recent
 running tasks and the long sleeping tasks will not influence the load
 balancing.
 The load balance period is between 1 tick (10ms for idle load balance
 on ARM) and up to 256 ms (for busy load balance) so a high blocked
 load should imply some tasks that have run recently otherwise your
 blocked load will be small and will not have a large influence on your
 load balance

 Just tried using cfs's runnable_load_avg + blocked_load_avg in
 weighted_cpuload() with my v3 patchset, aim9 shared workfile testing
 show the performance dropped 70% more on the NHM EP machine. :(

 
 Ops, the performance is still worse than just count runnable_load_avg.
 But dropping is not so big, it dropped 30%, not 70%.
 

Thank you

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] tty: Only wakeup the line discipline idle queue when queue is active

2013-01-20 Thread Preeti U Murthy
On 01/18/2013 09:15 PM, Oleg Nesterov wrote:
 On 01/17, Preeti U Murthy wrote:

 On 01/16/2013 05:32 PM, Ivo Sieben wrote:

 I don't have a problem that there is a context switch to the high
 priority process: it has a higher priority, so it probably is more
 important.
 My problem is that even when the waitqueue is empty, the high priority
 thread has a risk to block on the spinlock needlessly (causing context
 switches to low priority task and back to the high priority task)

 Fair enough Ivo.I think you should go ahead with merging the
 waitqueue_active()
   wake_up()
 logic into the wake_up() variants.
 
 This is not easy. We can't simply change wake_up*() helpers or modify
 __wake_up().

Hmm.I need to confess that I don't really know what goes into a change
such as this.Since there are a lot of waitqueue_active()+wake_up()
calls,I was wondering why at all have a separate logic as
waitqueue_active(),if we could do what it does in wake_up*(). But you
guys can decide this best.
 
 I can't understand why do you dislike Ivo's simple patch. There are
 a lot of if (waitqueue_active) wake_up examples. Even if we add the
 new helpers (personally I don't think this makes sense) , we can do
 this later. Why should we delay this fix?

Personally i was concerned about how this could cause a scheduler
overhead.There does not seem to be much of a problem here.Ivo's patch
for adding a waitqueue_active() for his specific problem would also do
well,unless there is a dire requirement for a clean up,which I am unable
to evaluate.

 
 Oleg.
 

Thank you

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] tty: Only wakeup the line discipline idle queue when queue is active

2013-01-16 Thread Preeti U Murthy
Hi Ivo,
Can you explain how this problem could create a scheduler overhead?
I am a little confused, because as far as i know,scheduler does not come
in the picture of the wake up path right? select_task_rq() in
try_to_wake_up() is where the scheduler comes in,and this is after the
task wakes up.

On 01/03/2013 03:19 PM, Ivo Sieben wrote:
 Oleg, Peter, Ingo, Andi  Preeti,
 
 2013/1/2 Jiri Slaby jsl...@suse.cz:
 On 01/02/2013 04:21 PM, Ivo Sieben wrote:
 I don't understand your responses: do you suggest to implement this
 if active behavior in:
 * A new wake_up function called wake_up_if_active() that is part of
 the waitqueue layer?

 Sounds good.

 --
 js
 suse labs
 
 I want to ask you 'scheduler' people for your opinion:
 
 Maybe you remember my previous patch where I suggested an extra
 'waitqueue empty' check before entering the critical section of the
 wakeup() function (If you do not remember see
 https://lkml.org/lkml/2012/10/25/159)
 
 Finally Oleg responded that a lot of callers do
 
   if (waitqueue_active(q))
   wake_up(...);
 
 what made my patch pointless and adds a memory barrier. I then decided
 to also implement the 'waitqueue_active' approach for my problem.
 
 But now I get a review comment by Jiri that he would like to hide this
 'if active behavior' in a wake_up_if_active() kind of function. I
 think he is right that implementing this check in the wakeup function
 would clean things up, right?
 
 I would like to have your opinion on the following two suggestions:
 - We still can do the original patch on the wake_up() that I
 suggested. I then can do an additional code cleanup patch that removes
 the double 'waitqueue_active' call (a quick grep found about 150 of
 these waitqueue active calls) on several places in the code.

I think this is a good move.

 - Or - as an alternative - I could add extra _if_active() versions of
 all wake_up() functions, that implement this extra test.

Why add 'extra' if_active versions? Why not optimize this within the
existing wake_up() functions?
 
 Regards,
 Ivo

Regards
Preeti U Murthy
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] tty: Only wakeup the line discipline idle queue when queue is active

2013-01-16 Thread Preeti U Murthy
Hi Ivo,

On 01/16/2013 02:46 PM, Ivo Sieben wrote:
 Hi Preeti,
 
 2013/1/16 Preeti U Murthy pre...@linux.vnet.ibm.com:
 Hi Ivo,
 Can you explain how this problem could create a scheduler overhead?
 I am a little confused, because as far as i know,scheduler does not come
 in the picture of the wake up path right? select_task_rq() in
 try_to_wake_up() is where the scheduler comes in,and this is after the
 task wakes up.

 
 Everytime the line discipline is dereferenced, the wakeup function is
 called. The wakeup() function contains a critical section protected by
 spinlocks. On a PREEMPT_RT system, a normal spinlock behaves just
 like a mutex: scheduling is not disabled and it is still possible that
 a new process on a higher RT priority is scheduled in. When a new -
 higher priority - process is scheduled in just when the put_ldisc() is
 in the critical section of the wakeup function, the higher priority
 process (that uses the same TTY instance) will finally also
 dereference the line discipline and try to wakeup the same waitqueue.
 This causes the high priority process to  block on the same spinlock.
 Priority inheritance will solve this blocked situation by a context
 switch to the lower priority process, run until that process leaves
 the critical section, and a context switch back to the higher priority
 process. This is unnecessary since the waitqueue was empty after all
 (during normal operation the waitqueue is empty most of the time).
 This unnecessary context switch from/to the high priority process is
 what a mean with scheduler overhead (maybe not a good name for it,
 sorry for the confusion).
 
 Does this makes sense to you?

Yes.Thank you very much for the explanation :) But I dont see how the
context switching goes away with your patch.With your patch, when the
higher priority thread comes in when the lower priority thread is
running in the critical section,it will see the wait queue empty and
continue its execution without now wanting to enter the critical
section.So this means it will preempt the lower priority thread because
it is not waiting on a lock anyway.There is a context switch here right?
I dont see any problem in scheduling due to this,but I do think your
patch is essential.

The entire logic of
wakelist_active()
  wake_up()

could be integrated into wake_up(). I dont understand why we need a
separate function to check the emptiness of the wake list. But as Oleg
pointed out we must identify the places for this optimization.

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: Consequences of integrating the Per Entity Load Tracking Metric into the Load Balancer

2013-01-17 Thread Preeti U Murthy
Hi Alex,

On 01/16/2013 07:38 PM, Alex Shi wrote:
 On 01/08/2013 04:41 PM, Preeti U Murthy wrote:
 Hi Mike,

 Thank you very much for such a clear and comprehensive explanation.
 So when I put together the problem and the proposed solution pieces in the 
 current
 scheduler scalability,the following was what I found:

 1. select_idle_sibling() is needed as an agent to correctly find the right 
 cpu for wake
up tasks to go to.Correctly would be to find an idle cpu at the lowest 
 cost possible.
 2.Cost could be lowered either by optimizing the order of searching for an 
 idle cpu or
restricting the search to a few cpus alone.
 3. The former has the problem that it would not prevent bouncing tasks all 
 over the domain
sharing an L3 cache,which could potentially affect the fast moving tasks.
 4. The latter has the problem that it is not aggressive enough in finding an 
 idle cpu.

 This is some tangled problem,but I think the solution at best could be 
 smoothed to a a flowchart.

STEP1   STEP2STEP3
  _
 | |
 |See if the idle buddy|No_  Yes   
 |is free at all sched || Do we search the| |Optimized search|
 |domains  | |sched domains|  ||
 |_| |for an idle cpu  | |
   |Yes  |_|\|/
  \|/|No: saturated Return target cpu
 Return \|/ system
 cpu buddyReturn prev_cpu

 
 
 
 I re-written the patch as following. hackbench/aim9 doest show clean 
 performance change.
 Actually we can get some profit. it also will be very slight. :) 
 BTW, it still need another patch before apply this. Just to show the logical.
 
 ===
 From 145ff27744c8ac04eda056739fe5aa907a00877e Mon Sep 17 00:00:00 2001
 From: Alex Shi alex@intel.com
 Date: Fri, 11 Jan 2013 16:49:03 +0800
 Subject: [PATCH 3/7] sched: select_idle_sibling optimization
 
 Current logical in this function will insist to wake up the task in a
 totally idle group, otherwise it would rather back to previous cpu.

As Namhyung pointed out this could be the waking cpu as well.
 
 The new logical will try to wake up the task on any idle cpu in the same
 cpu socket (in same sd_llc), while idle cpu in the smaller domain has
 higher priority.

Here is where the problem of large sockets come in.select_idle_sibling()
has its main weakness here.No doubt that this patch could improve
performance over the current logic,but will still retain the major
drawback of searching for an idle cpu in a large socket in the worst case.
 
 It should has some help on burst wake up benchmarks like aim7.
 
 Original-patch-by: Preeti U Murthy pre...@linux.vnet.ibm.com
 Signed-off-by: Alex Shi alex@intel.com
 ---
  kernel/sched/fair.c |   40 +++-
  1 files changed, 19 insertions(+), 21 deletions(-)
 
 diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
 index e116215..fa40e49 100644
 --- a/kernel/sched/fair.c
 +++ b/kernel/sched/fair.c
 @@ -3253,13 +3253,13 @@ find_idlest_cpu(struct sched_group *group, struct 
 task_struct *p, int this_cpu)
  /*
   * Try and locate an idle CPU in the sched_domain.
   */
 -static int select_idle_sibling(struct task_struct *p)
 +static int select_idle_sibling(struct task_struct *p,
 + struct sched_domain *affine_sd, int sync)

As Namhyung pointed out where is affine_sd being used?
  {
   int cpu = smp_processor_id();
   int prev_cpu = task_cpu(p);
   struct sched_domain *sd;
   struct sched_group *sg;
 - int i;
 
   /*
* If the task is going to be woken-up on this cpu and if it is
 @@ -3281,27 +3281,25 @@ static int select_idle_sibling(struct task_struct *p)
   /*
* Otherwise, iterate the domains and find an elegible idle cpu.
*/
 - sd = rcu_dereference(per_cpu(sd_llc, prev_cpu));
 - for_each_lower_domain(sd) {
 + for_each_domain(prev_cpu, sd) {

Why is prev_cpu being used? Ideally it should be the target cpu
(waking/prev) depending on what wake_affine() decides.

   sg = sd-groups;
   do {
 - if (!cpumask_intersects(sched_group_cpus(sg),
 - tsk_cpus_allowed(p)))
 - goto next;
 -
 - for_each_cpu(i, sched_group_cpus(sg)) {
 - if (!idle_cpu(i))
 - goto next;
 - }
 -
 - prev_cpu = cpumask_first_and(sched_group_cpus(sg),
 - tsk_cpus_allowed(p));
 - goto done;
 -next:
 - sg = sg-next;
 - } while (sg != sd-groups);
 + int nr_busy

Re: sched: Consequences of integrating the Per Entity Load Tracking Metric into the Load Balancer

2013-01-17 Thread Preeti U Murthy
Hi Namhyung,

 I re-written the patch as following. hackbench/aim9 doest show clean 
 performance change.
 Actually we can get some profit. it also will be very slight. :) 
 BTW, it still need another patch before apply this. Just to show the logical.

 ===
 From 145ff27744c8ac04eda056739fe5aa907a00877e Mon Sep 17 00:00:00 2001
 From: Alex Shi alex@intel.com
 Date: Fri, 11 Jan 2013 16:49:03 +0800
 Subject: [PATCH 3/7] sched: select_idle_sibling optimization

 Current logical in this function will insist to wake up the task in a
 totally idle group, otherwise it would rather back to previous cpu.
 
 Or current cpu depending on result of wake_affine(), right?
 

 The new logical will try to wake up the task on any idle cpu in the same
 cpu socket (in same sd_llc), while idle cpu in the smaller domain has
 higher priority.
 
 But what about SMT domain?

The previous approach also descended till the SMT domain.Here we start
from the SMT domain.

You could check with /proc/schedstat as to which are the different
domains the cpu is a part of and SMT domain happens to be domain0.As far
as i know for_each_lower_domain will descend till domain0.
 
 I mean it seems that the code prefers running a task on a idle cpu which
 is a sibling thread in the same core rather than running it on an idle
 cpu in another idle core.  I guess we didn't do that before.
 

 It should has some help on burst wake up benchmarks like aim7.

 Original-patch-by: Preeti U Murthy pre...@linux.vnet.ibm.com
 Signed-off-by: Alex Shi alex@intel.com
 ---
  kernel/sched/fair.c |   40 +++-
  1 files changed, 19 insertions(+), 21 deletions(-)

 diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
 index e116215..fa40e49 100644
 --- a/kernel/sched/fair.c
 +++ b/kernel/sched/fair.c
 @@ -3253,13 +3253,13 @@ find_idlest_cpu(struct sched_group *group, struct 
 task_struct *p, int this_cpu)
  /*
   * Try and locate an idle CPU in the sched_domain.
   */
 -static int select_idle_sibling(struct task_struct *p)
 +static int select_idle_sibling(struct task_struct *p,
 +struct sched_domain *affine_sd, int sync)
 
 Where are these arguments used?
 
 
  {
  int cpu = smp_processor_id();
  int prev_cpu = task_cpu(p);
  struct sched_domain *sd;
  struct sched_group *sg;
 -int i;
  
  /*
   * If the task is going to be woken-up on this cpu and if it is
 @@ -3281,27 +3281,25 @@ static int select_idle_sibling(struct task_struct *p)
  /*
   * Otherwise, iterate the domains and find an elegible idle cpu.
   */
 -sd = rcu_dereference(per_cpu(sd_llc, prev_cpu));
 -for_each_lower_domain(sd) {
 +for_each_domain(prev_cpu, sd) {
 
 Always start from the prev_cpu?
 
 
  sg = sd-groups;
  do {
 -if (!cpumask_intersects(sched_group_cpus(sg),
 -tsk_cpus_allowed(p)))
 -goto next;
 -
 -for_each_cpu(i, sched_group_cpus(sg)) {
 -if (!idle_cpu(i))
 -goto next;
 -}
 -
 -prev_cpu = cpumask_first_and(sched_group_cpus(sg),
 -tsk_cpus_allowed(p));
 -goto done;
 -next:
 -sg = sg-next;
 -} while (sg != sd-groups);
 +int nr_busy = atomic_read(sg-sgp-nr_busy_cpus);
 +int i;
 +
 +/* no idle cpu in the group */
 +if (nr_busy == sg-group_weight)
 +continue;
 
 Maybe we can skip local group since it's a bottom-up search so we know
 there's no idle cpu in the lower domain from the prior iteration.

We could have done this for the current logic because it checks for an
*idle* group.The local group would definitely fail this test.But here we
need to check the local group also because we are looking for an idle cpu.


Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] tty: Only wakeup the line discipline idle queue when queue is active

2013-01-17 Thread Preeti U Murthy
On 01/16/2013 05:32 PM, Ivo Sieben wrote:
 2013/1/16 Preeti U Murthy pre...@linux.vnet.ibm.com:

 Yes.Thank you very much for the explanation :) But I dont see how the
 context switching goes away with your patch.With your patch, when the
 higher priority thread comes in when the lower priority thread is
 running in the critical section,it will see the wait queue empty and
 continue its execution without now wanting to enter the critical
 section.So this means it will preempt the lower priority thread because
 it is not waiting on a lock anyway.There is a context switch here right?
 I dont see any problem in scheduling due to this,but I do think your
 patch is essential.

 
 I don't have a problem that there is a context switch to the high
 priority process: it has a higher priority, so it probably is more
 important.
 My problem is that even when the waitqueue is empty, the high priority
 thread has a risk to block on the spinlock needlessly (causing context
 switches to low priority task and back to the high priority task)
 
Fair enough Ivo.I think you should go ahead with merging the
waitqueue_active()
  wake_up()
logic into the wake_up() variants.

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch v4 08/18] Revert sched: Introduce temporary FAIR_GROUP_SCHED dependency for load-tracking

2013-02-13 Thread Preeti U Murthy
Hi everyone,

On 02/13/2013 09:15 PM, Paul Turner wrote:
 On Wed, Feb 13, 2013 at 7:23 AM, Alex Shi alex@intel.com wrote:
 On 02/12/2013 06:27 PM, Peter Zijlstra wrote:
 On Thu, 2013-01-24 at 11:06 +0800, Alex Shi wrote:
 Remove CONFIG_FAIR_GROUP_SCHED that covers the runnable info, then
 we can use runnable load variables.

 It would be nice if we could quantify the performance hit of doing so.
 Haven't yet looked at later patches to see if we remove anything to
 off-set this.


 In our rough testing, no much clear performance changes.

 
 I'd personally like this to go with a series that actually does
 something with it.
 
 There's been a few proposals floating around on _how_ to do this; but
 the challenge is in getting it stable enough that all of the wake-up
 balancing does not totally perforate your stability gains into the
 noise.  select_idle_sibling really is your nemesis here.
 
 It's a small enough patch that it can go at the head of any such
 series (and indeed; it was originally structured to make such a patch
 rather explicit.)
 
 --
 Thanks
 Alex
 

Paul,what exactly do you mean by select_idle_sibling() is our nemesis
here? What we observed through our experiments was that:
1.With the per entity load tracking(runnable_load_avg) in load
balancing,the load is distributed appropriately across the cpus.
2.However when a task sleeps and wakes up,select_idle_sibling() searches
for the idlest group top to bottom.If a suitable candidate is not
found,it wakes up the task on the prev_cpu/waker_cpu.This would increase
the runqueue size and load of prev_cpu/waker_cpu respectively.
3.The load balancer would then come to the rescue and redistribute the load.

As a consequence,

*The primary observation was that there is no performance degradation
with the integration of per entity load tracking into the load balancer
but there was a good increase in the number of migrations*. This  as I
see it, is due to the point2 and point3 above.Is this what you call as
the nemesis? OR

select_idle_sibling() does a top to bottom search of the chosen domain
for an idlest group and is very likely to spread the waking task to a
far off group,in case of underutilized systems.This would prove costly
for the software buddies in finding each other due to the time taken for
the search and the possible spreading of the software buddy tasks.Is
this what you call nemesis?

Another approach to remove the above two nemesis,if they are so,would be
to use blocked_load+runnable_load for balancing.But when waking up a
task,use select_idle_sibling() only to search the L2 cache domains for
an idlest group.If unsuccessful,return the prev_cpu which has already
accounted for the task in the blocked_load,hence this move would not
increase its load.Would you recommend going in this direction?

Thank you

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch v4 05/18] sched: quicker balancing on fork/exec/wake

2013-02-14 Thread Preeti U Murthy
;
 
for_each_cpu(i, sched_group_cpus(group)) {
-   /* Bias balancing toward cpus of our domain */
-   if (local_group)
-   load = source_load(i, load_idx);
-   else
-   load = target_load(i, load_idx);
+   load = weighted_cpuload(i);
 
avg_load += load;
}
@@ -3227,20 +3218,12 @@ find_idlest_group(struct sched_domain *sd, struct 
task_struct *p,
avg_load = (avg_load * SCHED_POWER_SCALE);
do_div(avg_load, group-sgp-power);
 
-   if (local_group) {
-   this_load = avg_load;
-   this_group = group;
-   } else if (avg_load  min_load) {
+   if (avg_load  min_load) {
min_load = avg_load;
idlest = group;
}
} while (group = group-next, group != sd-groups);
 
-   if (this_group  idlest!= this_group) {
-   /* Bias towards our group again */
-   if (!idlest || 100*this_load  imbalance*min_load)
-   return this_group;
-   }
return idlest;
 }
 
@@ -3248,7 +3231,7 @@ find_idlest_group(struct sched_domain *sd, struct 
task_struct *p,
  * find_idlest_cpu - find the idlest cpu among the cpus in group.
  */
 static int
-find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
+find_idlest_cpu(struct sched_group *group, struct task_struct *p)
 {
u64 load, min_load = ~0ULL;
int idlest = -1;
@@ -3258,7 +3241,7 @@ find_idlest_cpu(struct sched_group *group, struct 
task_struct *p, int this_cpu)
for_each_cpu_and(i, sched_group_cpus(group), tsk_cpus_allowed(p)) {
load = weighted_cpuload(i);
 
-   if (load  min_load || (load == min_load  i == this_cpu)) {
+   if (load  min_load) {
min_load = load;
idlest = i;
}
@@ -3325,6 +3308,7 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, 
int wake_flags)
int cpu = smp_processor_id();
int prev_cpu = task_cpu(p);
int new_cpu = cpu;
+   struct sched_group *group;
int want_affine = 0;
int sync = wake_flags  WF_SYNC;
 
@@ -3367,8 +3351,6 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, 
int wake_flags)
}
 
while (sd) {
-   int load_idx = sd-forkexec_idx;
-   struct sched_group *group;
int weight;
 
if (!(sd-flags  sd_flag)) {
@@ -3376,15 +3358,9 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, 
int wake_flags)
continue;
}
 
-   if (sd_flag  SD_BALANCE_WAKE)
-   load_idx = sd-wake_idx;
-
-   group = find_idlest_group(sd, p, cpu, load_idx);
-   if (!group) {
-   goto unlock;
-   }
+   group = find_idlest_group(sd, p, cpu);
 
-   new_cpu = find_idlest_cpu(group, p, cpu);
+   new_cpu = group_first_cpu(group);
 
/* Now try balancing at a lower domain level of new_cpu */
cpu = new_cpu;
@@ -3398,6 +3374,7 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, 
int wake_flags)
}
/* while loop will break here if sd == NULL */
}
+   new_cpu = find_idlest_cpu(group, p);
 unlock:
rcu_read_unlock();
 
@@ -4280,15 +4257,15 @@ static inline void update_sd_lb_power_stats(struct 
lb_env *env,
if (sgs-sum_nr_running + 1   sgs-group_capacity)
return;
if (sgs-group_util  sds-leader_util ||
-   sgs-group_util == sds-leader_util 
-   group_first_cpu(group)  group_first_cpu(sds-group_leader)) {
+   (sgs-group_util == sds-leader_util 
+   group_first_cpu(group)  group_first_cpu(sds-group_leader))) {
sds-group_leader = group;
sds-leader_util = sgs-group_util;
}
/* Calculate the group which is almost idle */
if (sgs-group_util  sds-min_util ||
-   sgs-group_util == sds-min_util 
-   group_first_cpu(group)  group_first_cpu(sds-group_leader)) {
+   (sgs-group_util == sds-min_util 
+   group_first_cpu(group)  group_first_cpu(sds-group_leader))) {
sds-group_min = group;
sds-min_util = sgs-group_util;
sds-min_load_per_task = sgs-sum_weighted_load;

 
 
Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch v4 07/18] sched: set initial load avg of new forked task

2013-02-19 Thread Preeti U Murthy
Hi everyone,

On 02/19/2013 05:04 PM, Paul Turner wrote:
 On Fri, Feb 15, 2013 at 2:07 AM, Alex Shi alex@intel.com wrote:

 diff --git a/kernel/sched/core.c b/kernel/sched/core.c
 index 1dff78a..9d1c193 100644
 --- a/kernel/sched/core.c
 +++ b/kernel/sched/core.c
 @@ -1557,8 +1557,8 @@ static void __sched_fork(struct task_struct *p)
   * load-balance).
   */
  #if defined(CONFIG_SMP)  defined(CONFIG_FAIR_GROUP_SCHED)
 -   p-se.avg.runnable_avg_period = 0;
 -   p-se.avg.runnable_avg_sum = 0;
 +   p-se.avg.runnable_avg_period = 1024;
 +   p-se.avg.runnable_avg_sum = 1024;

 It can't work.
 avg.decay_count needs to be set to 0 before enqueue_entity_load_avg(), then
 update_entity_load_avg() can't be called, so, runnable_avg_period/sum
 are unusable.
 
 Well we _could_ also use a negative decay_count here and treat it like
 a migration; but the larger problem is the visibility of p-on_rq;
 which is gates whether we account the time as runnable and occurs
 after activate_task() so that's out.
 

 Even we has chance to call __update_entity_runnable_avg(),
 avg.last_runnable_update needs be set before that, usually, it needs to
 be set as 'now', that cause __update_entity_runnable_avg() function
 return 0, then update_entity_load_avg() still can not reach to
 __update_entity_load_avg_contrib().

 If we embed a simple new task load initialization to many functions,
 that is too hard for future reader.
 
 This is my concern about making this a special case with the
 introduction ENQUEUE_NEWTASK flag; enqueue jumps through enough hoops
 as it is.
 
 I still don't see why we can't resolve this at init time in
 __sched_fork(); your patch above just moves an explicit initialization
 of load_avg_contrib into the enqueue path.  Adding a call to
 __update_task_entity_contrib() to the previous alternate suggestion
 would similarly seem to resolve this?

We could do this(Adding a call to __update_task_entity_contrib()),but the
cfs_rq-runnable_load_avg gets updated only if the task is on the runqueue.
But in the forked task's case the on_rq flag is not yet set.Something like
the below:

---
 kernel/sched/fair.c |   18 +-
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8691b0d..841e156 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1451,14 +1451,20 @@ static inline void update_entity_load_avg(struct 
sched_entity *se,
else
now = cfs_rq_clock_task(group_cfs_rq(se));
 
-   if (!__update_entity_runnable_avg(now, se-avg, se-on_rq))
-   return;
-
+   if (!__update_entity_runnable_avg(now, se-avg, se-on_rq)) {
+   if (!(flags  ENQUEUE_NEWTASK))
+   return;
+   }
contrib_delta = __update_entity_load_avg_contrib(se);
 
if (!update_cfs_rq)
return;
 
+   /* But the cfs_rq-runnable_load_avg does not get updated in case of
+* a forked task,because the se-on_rq = 0,although we update the
+* task's load_avg_contrib above in
+* __update_entity_laod_avg_contrib().
+*/
if (se-on_rq)
cfs_rq-runnable_load_avg += contrib_delta;
else
@@ -1538,12 +1544,6 @@ static inline void enqueue_entity_load_avg(struct cfs_rq 
*cfs_rq,
subtract_blocked_load_contrib(cfs_rq, se-avg.load_avg_contrib);
update_entity_load_avg(se, 0);
}
-   /*
-* set the initial load avg of new task same as its load
-* in order to avoid brust fork make few cpu too heavier
-*/
-   if (flags  ENQUEUE_NEWTASK)
-   se-avg.load_avg_contrib = se-load.weight;
 
cfs_rq-runnable_load_avg += se-avg.load_avg_contrib;
/* we force update consideration on load-balancer moves */

Thanks

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch v5 06/15] sched: log the cpu utilization at rq

2013-02-20 Thread Preeti U Murthy
Hi,

  /*
   * This is the main, per-CPU runqueue data structure.
   *
 @@ -481,6 +484,7 @@ struct rq {
  #endif
  
  struct sched_avg avg;
 +unsigned int util;
  };
  
  static inline int cpu_of(struct rq *rq)
 
 You don't actually compute the rq utilization, you only compute the
 utilization as per the fair class, so if there's significant RT activity
 it'll think the cpu is under-utilized, whihc I think will result in the
 wrong thing.

Correct me if I am wrong,but isn't the current load balancer also
disregarding the real time tasks to calculate the domain/group/cpu level
load too?

What I mean is,if the answer to the above question is yes,then can we
safely assume that the furthur optimizations to the load balancer like
the power aware scheduler and the usage of per entity load tracking can
be done without considering the real time tasks?

Regards
Preeti U Murthy
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch v5 06/15] sched: log the cpu utilization at rq

2013-02-20 Thread Preeti U Murthy
Hi everyone,

On 02/18/2013 10:37 AM, Alex Shi wrote:
 The cpu's utilization is to measure how busy is the cpu.
 util = cpu_rq(cpu)-avg.runnable_avg_sum
 / cpu_rq(cpu)-avg.runnable_avg_period;

Why not cfs_rq-runnable_load_avg? I am concerned with what is the right
metric to use here.
Refer to this discussion:https://lkml.org/lkml/2012/10/29/448

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch v5 09/15] sched: add power aware scheduling in fork/exec/wake

2013-02-24 Thread Preeti U Murthy
Hi Alex,

On 02/24/2013 02:57 PM, Alex Shi wrote:
 On 02/22/2013 04:54 PM, Peter Zijlstra wrote:
 On Thu, 2013-02-21 at 22:40 +0800, Alex Shi wrote:
 The name is a secondary issue, first you need to explain why you
 think
 nr_running is a useful metric at all.

 You can have a high nr_running and a low utilization (a burst of
 wakeups, each waking a process that'll instantly go to sleep again),
 or
 low nr_running and high utilization (a single process cpu bound
 process).

 It is true in periodic balance. But in fork/exec/waking timing, the
 incoming processes usually need to do something before sleep again.

 You'd be surprised, there's a fair number of workloads that have
 negligible runtime on wakeup.
 
 will appreciate if you like introduce some workload. :)
 BTW, do you has some idea to handle them?
 Actually, if tasks is just like transitory, it is also hard to catch
 them in balance, like 'cyclitest -t 100' on my 4 LCPU laptop, vmstat
 just can catch 1 or 2 tasks very second.

 I use nr_running to measure how the group busy, due to 3 reasons:
 1, the current performance policy doesn't use utilization too.

 We were planning to fix that now that its available.
 
 I had tried, but failed on aim9 benchmark. As a result I give up to use
 utilization in performance balance.
 Some trying and talking in the thread.
 https://lkml.org/lkml/2013/1/6/96
 https://lkml.org/lkml/2013/1/22/662

 2, the power policy don't care load weight.

 Then its broken, it should very much still care about weight.
 
 Here power policy just use nr_running as the criteria to check if it's
 eligible for power aware balance. when do balancing the load weight is
 still the key judgment.
 

 3, I tested some benchmarks, kbuild/tbench/hackbench/aim7 etc, some
 benchmark results looks clear bad when use utilization. if my memory
 right, the hackbench/aim7 both looks bad. I had tried many ways to
 engage utilization into this balance, like use utilization only, or
 use
 utilization * nr_running etc. but still can not find a way to recover
 the lose. But with nr_running, the performance seems doesn't lose much
 with power policy.

 You're failing to explain why utilization performs bad and you don't
 explain why nr_running is better. That things work simply isn't good
 
 Um, let me try to explain again, The utilisation need much time to
 accumulate itself(345ms). Whenever with or without load weight, many
 bursting tasks just give a minimum weight to the carrier CPU at the
 first few ms. So, it is too easy to do a incorrect distribution here and
 need migration on later periodic balancing.

I dont understand why forked tasks are taking time to accumulate the
load.I understand this if it were to be a woken up task.The first time
the forked task gets a chance to update the load itself,it needs to
reflect full utilization.In __update_entity_runnable_avg both
runnable_avg_period and runnable_avg_sum get equally incremented for a
forked task since it is runnable.Hence where is the chance for the load
to get incremented in steps?

In sleeping tasks since runnable_avg_sum progresses much slower than
runnable_avg_period,these tasks take much time to accumulate the load
when they wake up.This makes sense of course.But how does this happen
for forked tasks?

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch v5 02/15] sched: set initial load avg of new forked task

2013-02-24 Thread Preeti U Murthy
Hi Alex,

On 02/20/2013 11:50 AM, Alex Shi wrote:
 On 02/18/2013 01:07 PM, Alex Shi wrote:
 New task has no runnable sum at its first runnable time, so its
 runnable load is zero. That makes burst forking balancing just select
 few idle cpus to assign tasks if we engage runnable load in balancing.

 Set initial load avg of new forked task as its load weight to resolve
 this issue.

 
 patch answering PJT's update here. that merged the 1st and 2nd patches 
 into one. other patches in serial don't need to change.
 
 =
 From 89b56f2e5a323a0cb91c98be15c94d34e8904098 Mon Sep 17 00:00:00 2001
 From: Alex Shi alex@intel.com
 Date: Mon, 3 Dec 2012 17:30:39 +0800
 Subject: [PATCH 01/14] sched: set initial value of runnable avg for new
  forked task
 
 We need initialize the se.avg.{decay_count, load_avg_contrib} for a
 new forked task.
 Otherwise random values of above variables cause mess when do new task
 enqueue:
 enqueue_task_fair
 enqueue_entity
 enqueue_entity_load_avg
 
 and make forking balancing imbalance since incorrect load_avg_contrib.
 
 set avg.decay_count = 0, and avg.load_avg_contrib = se-load.weight to
 resolve such issues.
 
 Signed-off-by: Alex Shi alex@intel.com
 ---
  kernel/sched/core.c | 3 +++
  kernel/sched/fair.c | 4 
  2 files changed, 7 insertions(+)
 
 diff --git a/kernel/sched/core.c b/kernel/sched/core.c
 index 26058d0..1452e14 100644
 --- a/kernel/sched/core.c
 +++ b/kernel/sched/core.c
 @@ -1559,6 +1559,7 @@ static void __sched_fork(struct task_struct *p)
  #if defined(CONFIG_SMP)  defined(CONFIG_FAIR_GROUP_SCHED)
   p-se.avg.runnable_avg_period = 0;
   p-se.avg.runnable_avg_sum = 0;
 + p-se.avg.decay_count = 0;
  #endif
  #ifdef CONFIG_SCHEDSTATS
   memset(p-se.statistics, 0, sizeof(p-se.statistics));
 @@ -1646,6 +1647,8 @@ void sched_fork(struct task_struct *p)
   p-sched_reset_on_fork = 0;
   }
 
I think the following comment will help here.
/* All forked tasks are assumed to have full utilization to begin with */
 + p-se.avg.load_avg_contrib = p-se.load.weight;
 +
   if (!rt_prio(p-prio))
   p-sched_class = fair_sched_class;
 
 diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
 index 81fa536..cae5134 100644
 --- a/kernel/sched/fair.c
 +++ b/kernel/sched/fair.c
 @@ -1509,6 +1509,10 @@ static inline void enqueue_entity_load_avg(struct 
 cfs_rq *cfs_rq,
* We track migrations using entity decay_count = 0, on a wake-up
* migration we use a negative decay count to track the remote decays
* accumulated while sleeping.
 +  *
 +  * When enqueue a new forked task, the se-avg.decay_count == 0, so
 +  * we bypass update_entity_load_avg(), use avg.load_avg_contrib initial
 +  * value: se-load.weight.

I disagree with the comment.update_entity_load_avg() gets called for all
forked tasks.
enqueue_task_fair-update_entity_load_avg() during the second
iteration.But __update_entity_load_avg() in update_entity_load_avg()
,where the actual load update happens does not get called.This is
because as below,the last_update of the forked task is nearly equal to
the clock task of the runqueue.Hence probably 1ms has not passed by for
the load to get updated.Which is why the load of the task nor the load
of the runqueue gets updated when the task forks.

Also note that the reason we bypass update_entity_load_avg() below is
not because our decay_count=0.Its because the forked tasks have nothing
to update.Only woken up tasks and migrated wake ups have load updates to
do.Forked tasks just got created,they have no load to update but only
to create. This I feel is rightly done in sched_fork by this patch.

So ideally I dont think we should have any comment here.It does not
sound relevant.

*/
   if (unlikely(se-avg.decay_count = 0)) {
   se-avg.last_runnable_update = rq_of(cfs_rq)-clock_task;
 


Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch v5 09/15] sched: add power aware scheduling in fork/exec/wake

2013-02-24 Thread Preeti U Murthy
Hi,

On 02/24/2013 02:57 PM, Alex Shi wrote:
 On 02/22/2013 04:54 PM, Peter Zijlstra wrote:
 On Thu, 2013-02-21 at 22:40 +0800, Alex Shi wrote:
 The name is a secondary issue, first you need to explain why you
 think
 nr_running is a useful metric at all.

 You can have a high nr_running and a low utilization (a burst of
 wakeups, each waking a process that'll instantly go to sleep again),
 or
 low nr_running and high utilization (a single process cpu bound
 process).

 It is true in periodic balance. But in fork/exec/waking timing, the
 incoming processes usually need to do something before sleep again.

 You'd be surprised, there's a fair number of workloads that have
 negligible runtime on wakeup.
 
 will appreciate if you like introduce some workload. :)
 BTW, do you has some idea to handle them?
 Actually, if tasks is just like transitory, it is also hard to catch
 them in balance, like 'cyclitest -t 100' on my 4 LCPU laptop, vmstat
 just can catch 1 or 2 tasks very second.

 I use nr_running to measure how the group busy, due to 3 reasons:
 1, the current performance policy doesn't use utilization too.

 We were planning to fix that now that its available.
 
 I had tried, but failed on aim9 benchmark. As a result I give up to use
 utilization in performance balance.
 Some trying and talking in the thread.
 https://lkml.org/lkml/2013/1/6/96
 https://lkml.org/lkml/2013/1/22/662

 2, the power policy don't care load weight.

 Then its broken, it should very much still care about weight.
 
 Here power policy just use nr_running as the criteria to check if it's
 eligible for power aware balance. when do balancing the load weight is
 still the key judgment.
 

 3, I tested some benchmarks, kbuild/tbench/hackbench/aim7 etc, some
 benchmark results looks clear bad when use utilization. if my memory
 right, the hackbench/aim7 both looks bad. I had tried many ways to
 engage utilization into this balance, like use utilization only, or
 use
 utilization * nr_running etc. but still can not find a way to recover
 the lose. But with nr_running, the performance seems doesn't lose much
 with power policy.

 You're failing to explain why utilization performs bad and you don't
 explain why nr_running is better. That things work simply isn't good
 
 Um, let me try to explain again, The utilisation need much time to
 accumulate itself(345ms). Whenever with or without load weight, many
 bursting tasks just give a minimum weight to the carrier CPU at the
 first few ms. So, it is too easy to do a incorrect distribution here and
 need migration on later periodic balancing.

Why can't this be attacked in *either* of the following ways:

1.Attack this problem at the source, by ensuring that the utilisation is
accumulated faster by making the update window smaller.

2.Balance on nr-running only if you detect burst wakeups.
Alex, you had released a patch earlier which could detect this right?
Instead of balancing on nr_running all the time, why not balance on it
only if burst wakeups are detected. By doing so you ensure that
nr_running as a metric for load balancing is used when it is right to do
so and the reason to use it also gets well documented.

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: weakness of runnable load tracking?

2012-12-05 Thread Preeti U Murthy
Hi Alex,
 Hi Paul  Ingo:
 
 In a short word of this issue: burst forking/waking tasks have no time
 accumulate the load contribute, their runnable load are taken as zero.

On performing certain experiments on the way PJT's metric calculates the
load,I observed a few things.Based on these observations let me see if i
can address the issue of why PJT's metric is calculating the load of
bursty tasks as 0.

When we speak about a burst waking task(I will not go into forking
here),we should also speak about its duty cycle-it burst wakes for 1ms
for a 10ms duty cycle or burst wakes 9s out of a 10s duty cycle-both
being 10% tasks wrt their duty cycles.Lets see how load is calculated by
PJT's metric in each of the above cases.
   --
  |  |
  |  |
__|  |
  A  B
  1ms
  -
  10ms

  Example 1

When the task wakes up at A,it is not yet runnable,and an update of the
task load takes place.Its runtime so far is 0,and its existing time is
10ms.Hence the load is 0/10*1024.Since a scheduler tick happens at B( a
scheduler tick happens for every 1ms,10ms or 4ms.Let us assume 1ms),an
update of the load takes place.PJT's metric divides the time elapsed
into 1ms windows.There is just 1ms window,and hence the runtime is 1ms
and the load is 1ms/10ms*1024.

*If the time elapsed between A and B were to be  1ms,then PJT's metric
will not capture it*.

And under these circumstances the load remains 0/10ms*1024=0.This is the
situation you are pointing out.Let us assume that these cycle continues
throughout the lifetime of the load,then the load remains at 0.The
question is if such tasks which run for periods1ms is ok to be termed
as 0 workloads.If it is fine,then what PJT's metric is doing is
right.Maybe we should ignore such workloads because they hardly
contribute to the load.Otherwise we will need to reduce the window of
load update to  1ms to capture such loads.


Just for some additional info so that we know what happens to different
kinds of loads with PJT's metric,consider the below situation:
 --
|  |
|  |
|  |
A  B
   1s
--
---
  10s

   Example 2

Here at A,the task wakes,just like in Example1 and the load is termed 0.
In between A and B for every scheduler tick if we consider the load to
get updated,then the load slowly increases from 0 to 1024 at B.It is
1024 here,although this is also a 10% task,whereas in Example1 the load
is 102.4 - a 10% task.So what is fishy?

In my opinion,PJT's metric gives the tasks some time to prove their
activeness after they wake up.In Example2 the task has stayed awake too
long-1s; irrespective of what % of the total run time it is.Therefore it
calculates the load to be big enough to balance.

In the example that you have quoted,the tasks may not have run long
enough to consider them as candidates for load balance.

So,essentially what PJT's metric is doing is characterising a task by
the amount it has run so far.


 that make select_task_rq do a wrong decision on which group is idlest.
 
 There is still 3 kinds of solution is helpful for this issue.
 
 a, set a unzero minimum value for the long time sleeping task. but it
 seems unfair for other tasks these just sleep a short while.
 
 b, just use runnable load contrib in load balance. Still using
 nr_running to judge idlest group in select_task_rq_fair. but that may
 cause a bit more migrations in future load balance.
 
 c, consider both runnable load and nr_running in the group: like in the
 searching domain, the nr_running number increased a certain number, like
 double of the domain span, in a certain time. we will think it's a burst
 forking/waking happened, then just count the nr_running as the idlest
 group criteria.
 
 IMHO, I like the 3rd one a bit more. as to the certain time to judge if
 a burst happened, since we will calculate the runnable avg at very tick,
 so if increased nr_running is beyond sd-span_weight in 2 ticks, means
 burst happening. What's your opinion of this?
 
 Any comments are appreciated!


So Pjt's metric rightly seems to be capturing the load of these bursty
tasks but you are right in pointing out that when too many such loads
queue up on the cpu,this metric will consider the load on the cpu as
0,which might not be such a good idea.

It is true that we need to bring in nr_running somewhere.Let me now go
through your suggestions on where to include nr_running and get back on
this.I had planned on including nr_running while selecting the busy
group in update_sd_lb_stats,but select_task_rq_fair is yet another place
to do this, thats right.Good that this issue was brought up :)

 Regards!
 Alex


 

Regards
Preeti U Murthy

--
To unsubscribe from

Re: [PATCH 01/18] sched: select_task_rq_fair clean up

2012-12-10 Thread Preeti U Murthy
Hi Alex,

On 12/10/2012 01:52 PM, Alex Shi wrote:
 It is impossible to miss a task allowed cpu in a eligible group.

The one thing I am concerned with here is if there is a possibility of
the task changing its tsk_cpus_allowed() while this code is running.

i.e find_idlest_group() finds an idle group,then the tsk_cpus_allowed()
for the task changes,perhaps by the user himself,which might not include
the cpus in the idle group.After this find_idlest_cpu() is called.I mean
a race condition in short.Then we might not have an eligible cpu in that
group right?

 And since find_idlest_group only return a different group which
 excludes old cpu, it's also imporissible to find a new cpu same as old
 cpu.

This I agree with.

 Signed-off-by: Alex Shi alex@intel.com
 ---
  kernel/sched/fair.c |5 -
  1 files changed, 0 insertions(+), 5 deletions(-)
 
 diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
 index 59e072b..df99456 100644
 --- a/kernel/sched/fair.c
 +++ b/kernel/sched/fair.c
 @@ -3150,11 +3150,6 @@ select_task_rq_fair(struct task_struct *p, int 
 sd_flag, int wake_flags)
   }
  
   new_cpu = find_idlest_cpu(group, p, cpu);
 - if (new_cpu == -1 || new_cpu == cpu) {
 - /* Now try balancing at a lower domain level of cpu */
 - sd = sd-child;
 - continue;
 - }
  
   /* Now try balancing at a lower domain level of new_cpu */
   cpu = new_cpu;
 
Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 02/18] sched: fix find_idlest_group mess logical

2012-12-10 Thread Preeti U Murthy
Hi Alex,

On 12/10/2012 01:52 PM, Alex Shi wrote:
 There is 4 situations in the function:
 1, no task allowed group;
   so min_load = ULONG_MAX, this_load = 0, idlest = NULL
 2, only local group task allowed;
   so min_load = ULONG_MAX, this_load assigned, idlest = NULL
 3, only non-local task group allowed;
   so min_load assigned, this_load = 0, idlest != NULL
 4, local group + another group are task allowed.
   so min_load assigned, this_load assigned, idlest != NULL
 
 Current logical will return NULL in first 3 kinds of scenarios.
 And still return NULL, if idlest group is heavier then the
 local group in the 4th situation.
 
 Actually, I thought groups in situation 2,3 are also eligible to host
 the task. And in 4th situation, agree to bias toward local group.
 So, has this patch.
 
 Signed-off-by: Alex Shi alex@intel.com
 ---
  kernel/sched/fair.c |   12 +---
  1 files changed, 9 insertions(+), 3 deletions(-)
 
 diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
 index df99456..b40bc2b 100644
 --- a/kernel/sched/fair.c
 +++ b/kernel/sched/fair.c
 @@ -2953,6 +2953,7 @@ find_idlest_group(struct sched_domain *sd, struct 
 task_struct *p,
 int this_cpu, int load_idx)
  {
   struct sched_group *idlest = NULL, *group = sd-groups;
 + struct sched_group *this_group = NULL;
   unsigned long min_load = ULONG_MAX, this_load = 0;
   int imbalance = 100 + (sd-imbalance_pct-100)/2;
  
 @@ -2987,14 +2988,19 @@ find_idlest_group(struct sched_domain *sd, struct 
 task_struct *p,
  
   if (local_group) {
   this_load = avg_load;
 - } else if (avg_load  min_load) {
 + this_group = group;
 + }
 + if (avg_load  min_load) {
   min_load = avg_load;
   idlest = group;
   }
   } while (group = group-next, group != sd-groups);
  
 - if (!idlest || 100*this_load  imbalance*min_load)
 - return NULL;
 + if (this_group  idlest != this_group)
 + /* Bias toward our group again */
 + if (100*this_load  imbalance*min_load)
 + idlest = this_group;

If the idlest group is heavier than this_group(or to put it better if
the difference in the loads of the local group and idlest group is less
than a threshold,it means there is no point moving the load from the
local group) you return NULL,that immediately means this_group is chosen
as the candidate group for the task to run,one does not have to
explicitly return that.

Let me explain:
find_idlest_group()-if it returns NULL to mark your case4,it means there
is no idler group than the group to which this_cpu belongs to, at that
level of sched domain.Which is fair enough.

So now the question is under such a circumstance which is the idlest
group so far.It is the group containing this_cpu,i.e.this_group.After
this sd-child is chosen which is nothing but this_group(sd hierarchy
moves towards the cpu it belongs to). Again here the idlest group search
begins.

 +
   return idlest;
  }
  
 
Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 02/18] sched: fix find_idlest_group mess logical

2012-12-10 Thread Preeti U Murthy
Hi Alex,
On 12/11/2012 10:59 AM, Alex Shi wrote:
 On 12/11/2012 01:08 PM, Preeti U Murthy wrote:
 Hi Alex,

 On 12/10/2012 01:52 PM, Alex Shi wrote:
 There is 4 situations in the function:
 1, no task allowed group;
 so min_load = ULONG_MAX, this_load = 0, idlest = NULL
 2, only local group task allowed;
 so min_load = ULONG_MAX, this_load assigned, idlest = NULL
 3, only non-local task group allowed;
 so min_load assigned, this_load = 0, idlest != NULL
 4, local group + another group are task allowed.
 so min_load assigned, this_load assigned, idlest != NULL

 Current logical will return NULL in first 3 kinds of scenarios.
 And still return NULL, if idlest group is heavier then the
 local group in the 4th situation.

 Actually, I thought groups in situation 2,3 are also eligible to host
 the task. And in 4th situation, agree to bias toward local group.
 So, has this patch.

 Signed-off-by: Alex Shi alex@intel.com
 ---
  kernel/sched/fair.c |   12 +---
  1 files changed, 9 insertions(+), 3 deletions(-)

 diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
 index df99456..b40bc2b 100644
 --- a/kernel/sched/fair.c
 +++ b/kernel/sched/fair.c
 @@ -2953,6 +2953,7 @@ find_idlest_group(struct sched_domain *sd, struct 
 task_struct *p,
   int this_cpu, int load_idx)
  {
 struct sched_group *idlest = NULL, *group = sd-groups;
 +   struct sched_group *this_group = NULL;
 unsigned long min_load = ULONG_MAX, this_load = 0;
 int imbalance = 100 + (sd-imbalance_pct-100)/2;
  
 @@ -2987,14 +2988,19 @@ find_idlest_group(struct sched_domain *sd, struct 
 task_struct *p,
  
 if (local_group) {
 this_load = avg_load;
 -   } else if (avg_load  min_load) {
 +   this_group = group;
 +   }
 +   if (avg_load  min_load) {
 min_load = avg_load;
 idlest = group;
 }
 } while (group = group-next, group != sd-groups);
  
 -   if (!idlest || 100*this_load  imbalance*min_load)
 -   return NULL;
 +   if (this_group  idlest != this_group)
 +   /* Bias toward our group again */
 +   if (100*this_load  imbalance*min_load)
 +   idlest = this_group;

 If the idlest group is heavier than this_group(or to put it better if
 the difference in the loads of the local group and idlest group is less
 than a threshold,it means there is no point moving the load from the
 local group) you return NULL,that immediately means this_group is chosen
 as the candidate group for the task to run,one does not have to
 explicitly return that.
 
 In situation 4, this_group is not NULL.

True.The return value of find_idlest_group() indicates that there is no
other idle group other than the local group(the group to which cpu
belongs to). it does not indicate that there is no host group for the
task.If this is the case,select_task_rq_fair() falls back to the
group(sd-child) to which the cpu chosen in the previous iteration
belongs to,This is nothing but this_group in the current iteration.

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 01/18] sched: select_task_rq_fair clean up

2012-12-10 Thread Preeti U Murthy
On 12/11/2012 10:58 AM, Alex Shi wrote:
 On 12/11/2012 12:23 PM, Preeti U Murthy wrote:
 Hi Alex,

 On 12/10/2012 01:52 PM, Alex Shi wrote:
 It is impossible to miss a task allowed cpu in a eligible group.

 The one thing I am concerned with here is if there is a possibility of
 the task changing its tsk_cpus_allowed() while this code is running.

 i.e find_idlest_group() finds an idle group,then the tsk_cpus_allowed()
 for the task changes,perhaps by the user himself,which might not include
 the cpus in the idle group.After this find_idlest_cpu() is called.I mean
 a race condition in short.Then we might not have an eligible cpu in that
 group right?
 
 your worry make sense, but the code handle the situation, in
 select_task_rq(), it will check the cpu allowed again. if the answer is
 no, it will fallback to old cpu.

 And since find_idlest_group only return a different group which
 excludes old cpu, it's also imporissible to find a new cpu same as old
 cpu.

I doubt this will work correctly.Consider the following situation:sched
domain begins with sd that encloses both socket1 and socket2

cpu0 cpu1  | cpu2 cpu3
---|-
 socket1   |  socket2

old cpu = cpu1

Iteration1:
1.find_idlest_group() returns socket2 to be idlest.
2.task changes tsk_allowed_cpus to 0,1
3.find_idlest_cpu() returns cpu2

* without your patch
   1.the condition after find_idlest_cpu() returns -1,and sd-child is
chosen which happens to be socket1
   2.in the next iteration, find_idlest_group() and find_idlest_cpu()
will probably choose cpu0 which happens to be idler than cpu1,which is
in tsk_allowed_cpu.

* with your patch
   1.the condition after find_idlest_cpu() does not exist,therefore
a sched domain to which cpu2 belongs to is chosen.this is socket2.(under
the for_each_domain() loop).
   2.in the next iteration, find_idlest_group() return NULL,because
there is no cpu which intersects with tsk_allowed_cpus.
   3.in select task rq,the fallback cpu is chosen even when an idle cpu
existed.

So my concern is though select_task_rq() checks the
tsk_allowed_cpus(),you might end up choosing a different path of
sched_domains compared to without this patch as shown above.

In short without the if(new_cpu==-1) condition we might get misled
doing unnecessary iterations over the wrong sched domains in
select_task_rq_fair().(Think about situations when not all the cpus of
socket2 are disallowed by the task,then there will more iterations in
the wrong path of sched_domains before exit,compared to what is shown
above.)

Regards
Preeti U Murthy


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] sched: Explicit division calls on 64-bit integers

2012-11-19 Thread Preeti U Murthy
Certain gcc tool chains convert the division on a 64-bit dividend into a
__aeabi_uldivmod call which does unnecessary 64-bit by 64-bit divides
although the divisor is 32-bit.This 64 by 64 bit division is not implemented
in the kernel for reasons of efficiency,which results in undefined reference
errors during link time.Hence perform the division on 64-bit dividends
using do_div() function.
The below use case is the integration of Per-entity-Load-Tracking
metric with the load balancer,where cfs_rq-runnable_load_avg,
a 64 bit unsigned integer is used to as the base metric for load balancing.

Signed-off-by: Preeti U Murthypre...@linux.vnet.ibm.com
---
 kernel/sched/fair.c |   51 +++
 1 file changed, 31 insertions(+), 20 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f8f3a29..7cd3096 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2982,9 +2982,13 @@ static u64 cpu_avg_load_per_task(int cpu)
 {
struct rq *rq = cpu_rq(cpu);
unsigned long nr_running = ACCESS_ONCE(rq-nr_running);
+   u64 cfs_avg_load_per_task;
 
-   if (nr_running)
-   return rq-cfs.runnable_load_avg / nr_running;
+   if (nr_running) {
+   cfs_avg_load_per_task = rq-cfs.runnable_load_avg;
+   do_div(cfs_avg_load_per_task, nr_running);
+   return cfs_avg_load_per_task;
+   }
 
return 0;
 }
@@ -3249,7 +3253,8 @@ find_idlest_group(struct sched_domain *sd, struct 
task_struct *p,
}
 
/* Adjust by relative CPU power of the group */
-   avg_load = (avg_load * SCHED_POWER_SCALE) / group-sgp-power;
+   avg_load = (avg_load * SCHED_POWER_SCALE);
+   do_div(avg_load, group-sgp-power);
 
if (local_group) {
this_load = avg_load;
@@ -4756,7 +4761,8 @@ static inline void update_sg_lb_stats(struct lb_env *env,
}
 
/* Adjust by relative CPU power of the group */
-   sgs-avg_load = (sgs-group_load*SCHED_POWER_SCALE) / group-sgp-power;
+   sgs-avg_load = (sgs-group_load*SCHED_POWER_SCALE);
+   do_div(sgs-avg_load, group-sgp-power);
 
/*
 * Consider the group unbalanced when the imbalance is larger
@@ -4767,8 +4773,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 *  normalized nr_running number somewhere that negates
 *  the hierarchy?
 */
-   if (sgs-sum_nr_running)
-   avg_load_per_task = sgs-sum_weighted_load / 
sgs-sum_nr_running;
+   if (sgs-sum_nr_running) {
+   avg_load_per_task = sgs-sum_weighted_load;
+   do_div(avg_load_per_task, sgs-sum_nr_running);
+   }
 
if ((max_cpu_load - min_cpu_load) = avg_load_per_task 
(max_nr_running - min_nr_running)  1)
@@ -4953,7 +4961,7 @@ void fix_small_imbalance(struct lb_env *env, struct 
sd_lb_stats *sds)
u64 scaled_busy_load_per_task;
 
if (sds-this_nr_running) {
-   sds-this_load_per_task /= sds-this_nr_running;
+   do_div(sds-this_load_per_task, sds-this_nr_running);
if (sds-busiest_load_per_task 
sds-this_load_per_task)
imbn = 1;
@@ -4964,7 +4972,7 @@ void fix_small_imbalance(struct lb_env *env, struct 
sd_lb_stats *sds)
 
scaled_busy_load_per_task = sds-busiest_load_per_task
 * SCHED_POWER_SCALE;
-   scaled_busy_load_per_task /= sds-busiest-sgp-power;
+   do_div(scaled_busy_load_per_task, sds-busiest-sgp-power);
 
if (sds-max_load - sds-this_load + scaled_busy_load_per_task =
(scaled_busy_load_per_task * imbn)) {
@@ -4985,20 +4993,21 @@ void fix_small_imbalance(struct lb_env *env, struct 
sd_lb_stats *sds)
pwr_now /= SCHED_POWER_SCALE;
 
/* Amount of load we'd subtract */
-   tmp = (sds-busiest_load_per_task * SCHED_POWER_SCALE) /
-   sds-busiest-sgp-power;
+   tmp = (sds-busiest_load_per_task * SCHED_POWER_SCALE);
+   do_div(tmp, sds-busiest-sgp-power);
if (sds-max_load  tmp)
pwr_move += sds-busiest-sgp-power *
min(sds-busiest_load_per_task, sds-max_load - tmp);
 
/* Amount of load we'd add */
if (sds-max_load * sds-busiest-sgp-power 
-   sds-busiest_load_per_task * SCHED_POWER_SCALE)
-   tmp = (sds-max_load * sds-busiest-sgp-power) /
-   sds-this-sgp-power;
-   else
-   tmp = (sds-busiest_load_per_task * SCHED_POWER_SCALE) /
-   sds-this-sgp-power;
+   sds-busiest_load_per_task * SCHED_POWER_SCALE) {
+   tmp = (sds-max_load * sds-busiest-sgp-power);
+   do_div(tmp, sds-this-sgp-power);
+   } else {
+   tmp = (sds-busiest_load_per_task

Re: [RFC PATCH 0/5] enable runnable load avg in load balance

2012-11-26 Thread Preeti U Murthy
Hi everyone,

On 11/27/2012 12:33 AM, Benjamin Segall wrote:
 So, I've been trying out using the runnable averages for load balance in
 a few ways, but haven't actually gotten any improvement on the
 benchmarks I've run. I'll post my patches once I have the numbers down,
 but it's generally been about half a percent to 1% worse on the tests
 I've tried.
 
 The basic idea is to use (cfs_rq-runnable_load_avg +
 cfs_rq-blocked_load_avg) (which should be equivalent to doing
 load_avg_contrib on the rq) for cfs_rqs and possibly the rq, and
 p-se.load.weight * p-se.avg.runnable_avg_sum / period for tasks.

Why should cfs_rq-blocked_load_avg be included to calculate the load
on the rq? They do not contribute to the active load of the cpu right?

When a task goes to sleep its load is removed from cfs_rq-load.weight
as well in account_entity_dequeue(). Which means the load balancer
considers a sleeping entity as *not* contributing to the active runqueue
load.So shouldn't the new metric consider cfs_rq-runnable_load_avg alone?
 
 I have not yet tried including wake_affine, so this has just involved
 h_load (task_load_down and task_h_load), as that makes everything
 (besides wake_affine) be based on either the new averages or the
 rq-cpu_load averages.
 

Yeah I have been trying to view the performance as well,but with
cfs_rq-runnable_load_avg as the rq load contribution and the task load,
same as mentioned above.I have not completed my experiments but I would
expect some significant performance difference due to the below scenario:

 Task3(10% task)
Task1(100% task) Task4(10% task)
Task2(100% task) Task5(10% task)
---    --
CPU1  CPU2  CPU3

When cpu3 triggers load balancing:

CASE1:
 without PJT's metric the following loads will be perceived
 CPU1-2048
 CPU2-3042
 Therefore CPU2 might be relieved of one task to result in:


Task1(100% task) Task4(10% task)
Task2(100% task) Task5(10% task)   Task3(10% task)
---    --
CPU1  CPU2  CPU3

CASE2:
  with PJT's metric the following loads will be perceived
  CPU1-2048
  CPU2-1022
 Therefore CPU1 might be relieved of one task to result in:

 Task3(10% task)
 Task4(10% task)
Task2(100% task) Task5(10% task) Task1(100% task)
---    --
CPU1  CPU2  CPU3


The differences between the above two scenarios include:

1.Reduced latency for Task1 in CASE2,which is the right task to be moved
in the above scenario.

2.Even though in the former case CPU2 is relieved of one task,its of no
use if Task3 is going to sleep most of the time.This might result in
more load balancing on behalf of cpu3.

What do you guys think?

Thank you

Regards
Preeti U Murthy




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/5] enable runnable load avg in load balance

2012-11-26 Thread Preeti U Murthy
Hi,
On 11/27/2012 11:44 AM, Alex Shi wrote:
 On 11/27/2012 11:08 AM, Preeti U Murthy wrote:
 Hi everyone,

 On 11/27/2012 12:33 AM, Benjamin Segall wrote:
 So, I've been trying out using the runnable averages for load balance in
 a few ways, but haven't actually gotten any improvement on the
 benchmarks I've run. I'll post my patches once I have the numbers down,
 but it's generally been about half a percent to 1% worse on the tests
 I've tried.

 The basic idea is to use (cfs_rq-runnable_load_avg +
 cfs_rq-blocked_load_avg) (which should be equivalent to doing
 load_avg_contrib on the rq) for cfs_rqs and possibly the rq, and
 p-se.load.weight * p-se.avg.runnable_avg_sum / period for tasks.

 Why should cfs_rq-blocked_load_avg be included to calculate the load
 on the rq? They do not contribute to the active load of the cpu right?

 When a task goes to sleep its load is removed from cfs_rq-load.weight
 as well in account_entity_dequeue(). Which means the load balancer
 considers a sleeping entity as *not* contributing to the active runqueue
 load.So shouldn't the new metric consider cfs_rq-runnable_load_avg alone?

 I have not yet tried including wake_affine, so this has just involved
 h_load (task_load_down and task_h_load), as that makes everything
 (besides wake_affine) be based on either the new averages or the
 rq-cpu_load averages.


 Yeah I have been trying to view the performance as well,but with
 cfs_rq-runnable_load_avg as the rq load contribution and the task load,
 same as mentioned above.I have not completed my experiments but I would
 expect some significant performance difference due to the below scenario:

  Task3(10% task)
 Task1(100% task) Task4(10% task)
 Task2(100% task) Task5(10% task)
 ---    --
 CPU1  CPU2  CPU3

 When cpu3 triggers load balancing:

 CASE1:
  without PJT's metric the following loads will be perceived
  CPU1-2048
  CPU2-3042
  Therefore CPU2 might be relieved of one task to result in:


 Task1(100% task) Task4(10% task)
 Task2(100% task) Task5(10% task)   Task3(10% task)
 ---    --
 CPU1  CPU2  CPU3

 CASE2:
   with PJT's metric the following loads will be perceived
   CPU1-2048
   CPU2-1022
  Therefore CPU1 might be relieved of one task to result in:

  Task3(10% task)
  Task4(10% task)
 Task2(100% task) Task5(10% task) Task1(100% task)
 ---    --
 CPU1  CPU2  CPU3


 The differences between the above two scenarios include:

 1.Reduced latency for Task1 in CASE2,which is the right task to be moved
 in the above scenario.

 2.Even though in the former case CPU2 is relieved of one task,its of no
 use if Task3 is going to sleep most of the time.This might result in
 more load balancing on behalf of cpu3.

 What do you guys think?
 
 It looks fine. just a question of CASE 1.
 Usually the cpu2 with 3 10% load task will show nr_running == 0, at 70%
 time. So, how you make rq-nr_running = 3 always?
 
 Guess in most chance load balance with pull task1 or task2 to cpu2 or
 cpu3. not the result of CASE 1.

Thats right Alex.Most of the time the nr_running on CPU2 will be shown
to be 0 or perhaps 1/2.But whether you use PJT's metric or not,the load
balancer in such circumstances will behave the same, as you have rightly
pointed out: pull task1/2 to cpu2/3.

But the issue usually arises when all three wake up at the same time on
cpu2,portraying wrongly that the load is 3042, if PJT's metric is not
used.This could lead to load balancing one of these short running tasks
as shown by CASE1.This is the situation where in my opinion,PJT's metric
could make a difference.

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


sched: Consequences of integrating the Per Entity Load Tracking Metric into the Load Balancer

2013-01-01 Thread Preeti U Murthy
 a performance improvement.

Please do let me know your suggestions.This will greatly help take the
right steps here on, in achieving the correct integration.

Thank you

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: Consequences of integrating the Per Entity Load Tracking Metric into the Load Balancer

2013-01-03 Thread Preeti U Murthy
Hi Mike,

Thank you very much for your feedback.Considering your suggestions,I have 
posted out a 
proposed solution to prevent select_idle_sibling() from becoming a disadvantage 
to normal
load balancing,rather aiding it.

**This patch is *without* the enablement of the per entity load tracking 
metric.**

This is with an intention to correct the existing select_idle_sibling() mess 
before
going ahead.

---BEGIN 
PATCH

Subject: [PATCH] sched: Merge select_idle_sibling with the behaviour of 
SD_BALANCE_WAKE

The function of select_idle_sibling() is to place the woken up task in the
vicinity of the waking cpu or on the previous cpu depending on what 
wake_affine() says.
This placement being only in an idle group.If an idle group is not found,the
fallback cpu is either the waking cpu or the previous cpu accordingly.

This results in the runqueue of the waking cpu or the previous cpu getting
overloaded when the system is committed,which is a latency hit to these tasks.

What is required is that the newly woken up tasks be placed close to the wake
up cpu or the previous cpu,whichever is best, for reasons to avoid latency hit 
and cache
coldness respectively.This is achieved with wake_affine() deciding which
cache domain the task should be placed on.

Once this is decided,instead of searching for a completely idle group,let us
search for the idlest group.This will anyway return a completely idle group
if it exists and its mechanism will fall back to what select_idle_sibling()
was doing.But if this fails,find_idlest_group() continues the search for a
relatively more idle group.

The argument could be that,we wish to avoid migration of the newly woken up
task to any other group unless it is completely idle.But in this case, to
begin with we choose a sched domain,within which a migration could be less
harmful.We enable the SD_BALANCE_WAKE flag on the SMT and MC domains to 
co-operate
with the same.

This patch is based on the tip tree without enabling the per entity load
tracking.This is with an intention to clear up the select_idle_sibling() mess
before introducing the metric.
---
 include/linux/topology.h |4 ++-
 kernel/sched/fair.c  |   61 +-
 2 files changed, 9 insertions(+), 56 deletions(-)

diff --git a/include/linux/topology.h b/include/linux/topology.h
index d3cf0d6..eeb309e 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -95,7 +95,7 @@ int arch_update_cpu_topology(void);
| 1*SD_BALANCE_NEWIDLE  \
| 1*SD_BALANCE_EXEC \
| 1*SD_BALANCE_FORK \
-   | 0*SD_BALANCE_WAKE \
+   | 1*SD_BALANCE_WAKE \
| 1*SD_WAKE_AFFINE  \
| 1*SD_SHARE_CPUPOWER   \
| 1*SD_SHARE_PKG_RESOURCES  \
@@ -127,7 +127,7 @@ int arch_update_cpu_topology(void);
| 1*SD_BALANCE_NEWIDLE  \
| 1*SD_BALANCE_EXEC \
| 1*SD_BALANCE_FORK \
-   | 0*SD_BALANCE_WAKE \
+   | 1*SD_BALANCE_WAKE \
| 1*SD_WAKE_AFFINE  \
| 0*SD_SHARE_CPUPOWER   \
| 1*SD_SHARE_PKG_RESOURCES  \
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b29cdbf..c33eda7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3303,58 +3303,6 @@ find_idlest_cpu(struct sched_group *group, struct 
task_struct *p, int this_cpu)
return idlest;
 }
 
-/*
- * Try and locate an idle CPU in the sched_domain.
- */
-static int select_idle_sibling(struct task_struct *p, int target)
-{
-   int cpu = smp_processor_id();
-   int prev_cpu = task_cpu(p);
-   struct sched_domain *sd;
-   struct sched_group *sg;
-   int i;
-
-   /*
-* If the task is going to be woken-up on this cpu and if it is
-* already idle, then it is the right target.
-*/
-   if (target == cpu  idle_cpu(cpu))
-   return cpu;
-
-   /*
-* If the task is going to be woken-up on the cpu where it previously
-* ran and if it is currently idle, then it the right target.
-*/
-   if (target == prev_cpu  idle_cpu(prev_cpu))
-   return prev_cpu;
-
-   /*
-* Otherwise, iterate the domains and find an elegible idle cpu.
-*/
-   sd = rcu_dereference(per_cpu(sd_llc, target));
-   

Re: [PATCH 07/18] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task

2012-12-11 Thread Preeti U Murthy
Hi Alex,
On 12/10/2012 01:52 PM, Alex Shi wrote:
 They are the base values in load balance, update them with rq runnable
 load average, then the load balance will consider runnable load avg
 naturally.
 
 Signed-off-by: Alex Shi alex@intel.com
 ---
  kernel/sched/core.c |4 ++--
  kernel/sched/fair.c |4 ++--
  2 files changed, 4 insertions(+), 4 deletions(-)
 
 diff --git a/kernel/sched/core.c b/kernel/sched/core.c
 index 96fa5f1..0ecb907 100644
 --- a/kernel/sched/core.c
 +++ b/kernel/sched/core.c
 @@ -2487,7 +2487,7 @@ static void __update_cpu_load(struct rq *this_rq, 
 unsigned long this_load,
  void update_idle_cpu_load(struct rq *this_rq)
  {
   unsigned long curr_jiffies = ACCESS_ONCE(jiffies);
 - unsigned long load = this_rq-load.weight;
 + unsigned long load = (unsigned long)this_rq-cfs.runnable_load_avg;
   unsigned long pending_updates;
  
   /*
 @@ -2537,7 +2537,7 @@ static void update_cpu_load_active(struct rq *this_rq)
* See the mess around update_idle_cpu_load() / update_cpu_load_nohz().
*/
   this_rq-last_load_update_tick = jiffies;
 - __update_cpu_load(this_rq, this_rq-load.weight, 1);
 + __update_cpu_load(this_rq, this_rq-cfs.runnable_load_avg, 1);
  
   calc_load_account_active(this_rq);
  }
 diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
 index 61c8d24..6d893a6 100644
 --- a/kernel/sched/fair.c
 +++ b/kernel/sched/fair.c
 @@ -2680,7 +2680,7 @@ static void dequeue_task_fair(struct rq *rq, struct 
 task_struct *p, int flags)
  /* Used instead of source_load when we know the type == 0 */
  static unsigned long weighted_cpuload(const int cpu)
  {
 - return cpu_rq(cpu)-load.weight;
 + return (unsigned long)cpu_rq(cpu)-cfs.runnable_load_avg;

I was wondering why you have typecasted the cfs.runnable_load_avg to
unsigned long.Have you looked into why it was declared as u64 in the
first place?

  }
  
  /*
 @@ -2727,7 +2727,7 @@ static unsigned long cpu_avg_load_per_task(int cpu)
   unsigned long nr_running = ACCESS_ONCE(rq-nr_running);
  
   if (nr_running)
 - return rq-load.weight / nr_running;
 + return rq-cfs.runnable_load_avg / nr_running;

rq-cfs.runnable_load_avg is u64 type.you will need to typecast it here
also right? how does this division work? because the return type is
unsigned long.
  
   return 0;
  }
 

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 08/18] sched: consider runnable load average in move_tasks

2012-12-11 Thread Preeti U Murthy
Hi Alex,
On 12/10/2012 01:52 PM, Alex Shi wrote:
 Except using runnable load average in background, move_tasks is also
 the key functions in load balance. We need consider the runnable load
 average in it in order to the apple to apple load comparison.
 
 Signed-off-by: Alex Shi alex@intel.com
 ---
  kernel/sched/fair.c |   11 ++-
  1 files changed, 10 insertions(+), 1 deletions(-)
 
 diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
 index 6d893a6..bbb069c 100644
 --- a/kernel/sched/fair.c
 +++ b/kernel/sched/fair.c
 @@ -3741,6 +3741,15 @@ static unsigned long task_h_load(struct task_struct 
 *p);
  
  static const unsigned int sched_nr_migrate_break = 32;
  
 +static unsigned long task_h_load_avg(struct task_struct *p)
 +{
 + u32 period = p-se.avg.runnable_avg_period;
 + if (!period)
 + return 0;
 +
 + return task_h_load(p) * p-se.avg.runnable_avg_sum / period;

This might result in an overflow,considering you are multiplying two 32
bit integers.Below is how this is handled in
__update_task_entity_contrib in kernel/sched/fair.c

u32 contrib;
/* avoid overflowing a 32-bit type w/ SCHED_LOAD_SCALE */
contrib = se-avg.runnable_avg_sum * scale_load_down(se-load.weight);
contrib /= (se-avg.runnable_avg_period + 1);
se-avg.load_avg_contrib = scale_load(contrib);

Also why can't p-se.load_avg_contrib be used directly? as a return
value for task_h_load_avg? since this is already updated in
update_task_entity_contrib and update_group_entity_contrib.
 +}
 +
  /*
   * move_tasks tries to move up to imbalance weighted load from busiest to
   * this_rq, as part of a balancing operation within domain sd.
 @@ -3776,7 +3785,7 @@ static int move_tasks(struct lb_env *env)
   if (throttled_lb_pair(task_group(p), env-src_cpu, 
 env-dst_cpu))
   goto next;
  
 - load = task_h_load(p);
 + load = task_h_load_avg(p);
  
   if (sched_feat(LB_MIN)  load  16  
 !env-sd-nr_balance_failed)
   goto next;
 

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 01/18] sched: select_task_rq_fair clean up

2012-12-11 Thread Preeti U Murthy
On 12/11/2012 05:23 PM, Alex Shi wrote:
 On 12/11/2012 02:30 PM, Preeti U Murthy wrote:
 On 12/11/2012 10:58 AM, Alex Shi wrote:
 On 12/11/2012 12:23 PM, Preeti U Murthy wrote:
 Hi Alex,

 On 12/10/2012 01:52 PM, Alex Shi wrote:
 It is impossible to miss a task allowed cpu in a eligible group.

 The one thing I am concerned with here is if there is a possibility of
 the task changing its tsk_cpus_allowed() while this code is running.

 i.e find_idlest_group() finds an idle group,then the tsk_cpus_allowed()
 for the task changes,perhaps by the user himself,which might not include
 the cpus in the idle group.After this find_idlest_cpu() is called.I mean
 a race condition in short.Then we might not have an eligible cpu in that
 group right?

 your worry make sense, but the code handle the situation, in
 select_task_rq(), it will check the cpu allowed again. if the answer is
 no, it will fallback to old cpu.

 And since find_idlest_group only return a different group which
 excludes old cpu, it's also imporissible to find a new cpu same as old
 cpu.

 I doubt this will work correctly.Consider the following situation:sched
 domain begins with sd that encloses both socket1 and socket2

 cpu0 cpu1  | cpu2 cpu3
 ---|-
  socket1   |  socket2

 old cpu = cpu1

 Iteration1:
 1.find_idlest_group() returns socket2 to be idlest.
 2.task changes tsk_allowed_cpus to 0,1
 3.find_idlest_cpu() returns cpu2

 * without your patch
1.the condition after find_idlest_cpu() returns -1,and sd-child is
 chosen which happens to be socket1
2.in the next iteration, find_idlest_group() and find_idlest_cpu()
 will probably choose cpu0 which happens to be idler than cpu1,which is
 in tsk_allowed_cpu.
 
 Thanks for question Preeti! :)
 
 Yes, with more iteration you has more possibility to get task allowed
 cpu in select_task_rq_fair. but how many opportunity the situation
 happened?  how much gain you get here?
 With LCPU increasing many many iterations cause scalability issue. that
 is the simplified forking patchset for. and that why 10% performance
 gain on hackbench process/thread.
 
 and if you insist want not to miss your chance in strf(), the current
 iteration is still not enough. How you know the idlest cpu is still
 idlest after this function finished? how to ensure the allowed cpu won't
 be changed again?
 
 A quick snapshot is enough in balancing here. we still has periodic
 balacning.

Hmm ok,let me look at this more closely.
 

 * with your patch
1.the condition after find_idlest_cpu() does not exist,therefore
 a sched domain to which cpu2 belongs to is chosen.this is socket2.(under
 the for_each_domain() loop).
2.in the next iteration, find_idlest_group() return NULL,because
 there is no cpu which intersects with tsk_allowed_cpus.
3.in select task rq,the fallback cpu is chosen even when an idle cpu
 existed.

 So my concern is though select_task_rq() checks the
 tsk_allowed_cpus(),you might end up choosing a different path of
 sched_domains compared to without this patch as shown above.

 In short without the if(new_cpu==-1) condition we might get misled
 doing unnecessary iterations over the wrong sched domains in
 select_task_rq_fair().(Think about situations when not all the cpus of
 socket2 are disallowed by the task,then there will more iterations in
 
 After read the first 4 patches, believe you will find the patchset is
 trying to reduce iterations, not increase them.

Right,sorry about not noticing this.
 
 the wrong path of sched_domains before exit,compared to what is shown
 above.)

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC v2 PATCH 0/2] sched: Integrating Per-entity-load-tracking with the core scheduler

2012-11-15 Thread Preeti U Murthy
This approach aims to retain the current behavior of load balancer with the
change being only in the metric consumed during load balancing,
without unnecessary introduction of new variables.This RFC has been posted to
evaluate its design;to see if this is the right way to go about introducing
per-entity-load-tracking metric for the consumers of the same; in this
specific case,the load balancer.Once the design has been approved off,I can
go around to testing it.

The patch has been based out of tip-master:HEAD at commit 8a1d31c703d
Subject:Merge branch 'x86/urgent'

Grateful to Peter Zijlstra and Ingo Molnar for their valuable feedback on v1
of the RFC which was the foundation for this version.

PATCH[1/2] Aims at enabling usage of Per-Entity-Load-Tracking for load balacing
PATCH[2/2] The crux of the patchset lies here.
---

Preeti U Murthy (2):
  sched: Revert temporary FAIR_GROUP_SCHED dependency for load-tracking
  sched: Use Per-Entity-Load-Tracking metric for load balancing


 include/linux/sched.h |9 +-
 kernel/sched/core.c   |   19 +---
 kernel/sched/fair.c   |   76 +
 kernel/sched/sched.h  |9 +-
 4 files changed, 43 insertions(+), 70 deletions(-)

-- 
Preeti U Murthy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC v2 PATCH 1/2] sched: Revert temporary FAIR_GROUP_SCHED dependency for load-tracking

2012-11-15 Thread Preeti U Murthy
Now that we need the per-entity load tracking for load balancing,
trivially revert the patch which introduced the FAIR_GROUP_SCHED
dependence for load tracking.

Signed-off-by: Preeti U Murthypre...@linux.vnet.ibm.com
---
 include/linux/sched.h |7 +--
 kernel/sched/core.c   |7 +--
 kernel/sched/fair.c   |   12 +---
 kernel/sched/sched.h  |7 ---
 4 files changed, 3 insertions(+), 30 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 03be150..087dd20 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1169,12 +1169,7 @@ struct sched_entity {
/* rq owned by this entity/group: */
struct cfs_rq   *my_q;
 #endif
-/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#if defined(CONFIG_SMP)  defined(CONFIG_FAIR_GROUP_SCHED)
+#if defined(CONFIG_SMP)
/* Per-entity load-tracking */
struct sched_avgavg;
 #endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c2e077c..24d8b9b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1526,12 +1526,7 @@ static void __sched_fork(struct task_struct *p)
p-se.vruntime  = 0;
INIT_LIST_HEAD(p-se.group_node);
 
-/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#if defined(CONFIG_SMP)  defined(CONFIG_FAIR_GROUP_SCHED)
+#if defined(CONFIG_SMP)
p-se.avg.runnable_avg_period = 0;
p-se.avg.runnable_avg_sum = 0;
 #endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2cebc81..a9cdc8f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1149,8 +1149,7 @@ static inline void update_cfs_shares(struct cfs_rq 
*cfs_rq)
 }
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
-/* Only depends on SMP, FAIR_GROUP_SCHED may be removed when useful in lb */
-#if defined(CONFIG_SMP)  defined(CONFIG_FAIR_GROUP_SCHED)
+#if defined(CONFIG_SMP)
 /*
  * We choose a half-life close to 1 scheduling period.
  * Note: The tables below are dependent on this value.
@@ -3503,12 +3502,6 @@ unlock:
 }
 
 /*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#ifdef CONFIG_FAIR_GROUP_SCHED
-/*
  * Called immediately before a task is migrated to a new cpu; task_cpu(p) and
  * cfs_rq_of(p) references at time of call are still valid and identify the
  * previous cpu.  However, the caller only guarantees p-pi_lock is held; no
@@ -3531,7 +3524,6 @@ migrate_task_rq_fair(struct task_struct *p, int next_cpu)
atomic64_add(se-avg.load_avg_contrib, cfs_rq-removed_load);
}
 }
-#endif
 #endif /* CONFIG_SMP */
 
 static unsigned long
@@ -6416,9 +6408,7 @@ const struct sched_class fair_sched_class = {
 
 #ifdef CONFIG_SMP
.select_task_rq = select_task_rq_fair,
-#ifdef CONFIG_FAIR_GROUP_SCHED
.migrate_task_rq= migrate_task_rq_fair,
-#endif
.rq_online  = rq_online_fair,
.rq_offline = rq_offline_fair,
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 508e77e..bfd004a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -226,12 +226,6 @@ struct cfs_rq {
 #endif
 
 #ifdef CONFIG_SMP
-/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#ifdef CONFIG_FAIR_GROUP_SCHED
/*
 * CFS Load tracking
 * Under CFS, load is tracked on a per-entity basis and aggregated up.
@@ -241,7 +235,6 @@ struct cfs_rq {
u64 runnable_load_avg, blocked_load_avg;
atomic64_t decay_counter, removed_load;
u64 last_decay;
-#endif /* CONFIG_FAIR_GROUP_SCHED */
 /* These always depend on CONFIG_FAIR_GROUP_SCHED */
 #ifdef CONFIG_FAIR_GROUP_SCHED
u32 tg_runnable_contrib;

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC v2 PATCH 2/2] sched: Use Per-Entity-Load-Tracking metric for load balancing

2012-11-15 Thread Preeti U Murthy
Currently the load balancer weighs a task based upon its priority,and this
weight consequently gets added up to the weight of the run queue that it is
on.It is this weight of the runqueue that sums up to a sched group's load
which is used to decide the busiest or the idlest group and the runqueue
thereof.

The Per-Entity-Load-Tracking metric however measures how long a task has
been runnable over the duration of its lifetime.This gives us a hint of
the amount of CPU time that the task can demand.This metric takes care of the
task priority as well.Therefore apart from the priority of a task we also
have an idea of the live behavior of the task.This seems to be a more
realistic metric to use to compute task weight which adds upto the run queue
weight and the weight of the sched group.Consequently they can be used for
load balancing.

The semantics of load balancing is left untouched.The two functions
load_balance() and select_task_rq_fair() perform the task of load
balancing.These two paths have been browsed through in this patch to make
necessary changes.

weighted_cpuload() and task_h_load() provide the run queue weight and the
weight of the task respectively.They have been modified to provide the
Per-Entity-Load-Tracking metric as relevant for each.
The rest of the modifications had to be made to suit these two changes.

Completely Fair Scheduler class is the only sched_class which contributes to
the run queue load.Therefore the rq-load.weight==cfs_rq-load.weight when
the cfs_rq is the root cfs_rq (rq-cfs) of the hierarchy.When replacing this
with Per-Entity-Load-Tracking metric,cfs_rq-runnable_load_avg needs to be
used as this is the right reflection of the run queue load when
the cfs_rq is the root cfs_rq (rq-cfs) of the hierarchy.This metric reflects
the percentage uptime of the tasks that are queued on it and hence that 
contribute
to the load.Thus cfs_rq-runnable_load_avg replaces the metric earlier used in
weighted_cpuload().

The task load is aptly captured by se.avg.load_avg_contrib which captures the
runnable time vs the alive time of the task against its priority.This metric
replaces the earlier metric used in task_h_load().

The consequent changes appear as data type changes for the helper variables;
they abound in number.Because cfs_rq-runnable_load_avg needs to be big enough
to capture the tasks' load often and accurately.

The following patch does not consider CONFIG_FAIR_GROUP_SCHED AND
CONFIG_SCHED_NUMA.This is done so as to evaluate this approach starting from the
simplest scenario.Earlier discussions can be found in the link below.

Link: https://lkml.org/lkml/2012/10/25/162
Signed-off-by: Preeti U Murthypre...@linux.vnet.ibm.com
---
 include/linux/sched.h |2 +-
 kernel/sched/core.c   |   12 +
 kernel/sched/fair.c   |   64 +
 kernel/sched/sched.h  |2 +-
 4 files changed, 40 insertions(+), 40 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 087dd20..302756e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -924,7 +924,7 @@ struct sched_domain {
unsigned int lb_count[CPU_MAX_IDLE_TYPES];
unsigned int lb_failed[CPU_MAX_IDLE_TYPES];
unsigned int lb_balanced[CPU_MAX_IDLE_TYPES];
-   unsigned int lb_imbalance[CPU_MAX_IDLE_TYPES];
+   u64 lb_imbalance[CPU_MAX_IDLE_TYPES];
unsigned int lb_gained[CPU_MAX_IDLE_TYPES];
unsigned int lb_hot_gained[CPU_MAX_IDLE_TYPES];
unsigned int lb_nobusyg[CPU_MAX_IDLE_TYPES];
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 24d8b9b..4dea057 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2415,8 +2415,8 @@ static const unsigned char
  * would be when CPU is idle and so we just decay the old load without
  * adding any new load.
  */
-static unsigned long
-decay_load_missed(unsigned long load, unsigned long missed_updates, int idx)
+static u64
+decay_load_missed(u64 load, unsigned long missed_updates, int idx)
 {
int j = 0;
 
@@ -2444,7 +2444,7 @@ decay_load_missed(unsigned long load, unsigned long 
missed_updates, int idx)
  * scheduler tick (TICK_NSEC). With tickless idle this will not be called
  * every tick. We fix it up based on jiffies.
  */
-static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,
+static void __update_cpu_load(struct rq *this_rq, u64 this_load,
  unsigned long pending_updates)
 {
int i, scale;
@@ -2454,7 +2454,7 @@ static void __update_cpu_load(struct rq *this_rq, 
unsigned long this_load,
/* Update our load: */
this_rq-cpu_load[0] = this_load; /* Fasttrack for idx 0 */
for (i = 1, scale = 2; i  CPU_LOAD_IDX_MAX; i++, scale += scale) {
-   unsigned long old_load, new_load;
+   u64 old_load, new_load;
 
/* scale is effectively 1  i now, and  i divides by scale */
 
@@ -2496,7 +2496,7 @@ static void __update_cpu_load(struct rq

Re: [RFC v2 PATCH 2/2] sched: Use Per-Entity-Load-Tracking metric for load balancing

2012-11-16 Thread Preeti U Murthy
Hi Vincent,
Thank you for your review.

On 11/15/2012 11:43 PM, Vincent Guittot wrote:
 Hi Preeti,
 
 On 15 November 2012 17:54, Preeti U Murthy pre...@linux.vnet.ibm.com wrote:
 Currently the load balancer weighs a task based upon its priority,and this
 weight consequently gets added up to the weight of the run queue that it is
 on.It is this weight of the runqueue that sums up to a sched group's load
 which is used to decide the busiest or the idlest group and the runqueue
 thereof.

 The Per-Entity-Load-Tracking metric however measures how long a task has
 been runnable over the duration of its lifetime.This gives us a hint of
 the amount of CPU time that the task can demand.This metric takes care of the
 task priority as well.Therefore apart from the priority of a task we also
 have an idea of the live behavior of the task.This seems to be a more
 realistic metric to use to compute task weight which adds upto the run queue
 weight and the weight of the sched group.Consequently they can be used for
 load balancing.

 The semantics of load balancing is left untouched.The two functions
 load_balance() and select_task_rq_fair() perform the task of load
 balancing.These two paths have been browsed through in this patch to make
 necessary changes.

 weighted_cpuload() and task_h_load() provide the run queue weight and the
 weight of the task respectively.They have been modified to provide the
 Per-Entity-Load-Tracking metric as relevant for each.
 The rest of the modifications had to be made to suit these two changes.

 Completely Fair Scheduler class is the only sched_class which contributes to
 the run queue load.Therefore the rq-load.weight==cfs_rq-load.weight when
 the cfs_rq is the root cfs_rq (rq-cfs) of the hierarchy.When replacing this
 with Per-Entity-Load-Tracking metric,cfs_rq-runnable_load_avg needs to be
 used as this is the right reflection of the run queue load when
 the cfs_rq is the root cfs_rq (rq-cfs) of the hierarchy.This metric reflects
 the percentage uptime of the tasks that are queued on it and hence that 
 contribute
 to the load.Thus cfs_rq-runnable_load_avg replaces the metric earlier used 
 in
 weighted_cpuload().

 The task load is aptly captured by se.avg.load_avg_contrib which captures the
 runnable time vs the alive time of the task against its priority.This metric
 replaces the earlier metric used in task_h_load().

 The consequent changes appear as data type changes for the helper variables;
 they abound in number.Because cfs_rq-runnable_load_avg needs to be big 
 enough
 to capture the tasks' load often and accurately.
 
 You are now using cfs_rq-runnable_load_avg instead of
 cfs_rq-load.weight for calculation of cpu_load but
 cfs_rq-runnable_load_avg is smaller or equal to cfs_rq-load.weight
 value. This implies that the new value is smaller or equal to the old
 statistic so you should be able to keep the same variable width for
 the computation of cpu_load

Right.But cfs_rq-runnable_load_avg is a 64 bit unsigned integer as per
the Per-entity-load-tracking patchset.I could not figure out why this is
the case although as you mention, its value will not exceed
cfs_rq-load.weight.In order to retain the data type of
cfs_rq-runnable_load_avg as it is,these changes had to be made to suit
it.It would be good if someone would clarify why it is a 64 bit
integer,will save a lot of trouble if we could consider this the same
length as cfs_rq-load.weight.Ben,Paul? can you clarify this point?
 

 The following patch does not consider CONFIG_FAIR_GROUP_SCHED AND
 CONFIG_SCHED_NUMA.This is done so as to evaluate this approach starting from 
 the
 simplest scenario.Earlier discussions can be found in the link below.

 Link: https://lkml.org/lkml/2012/10/25/162
 Signed-off-by: Preeti U Murthypre...@linux.vnet.ibm.com
 ---
  include/linux/sched.h |2 +-
  kernel/sched/core.c   |   12 +
  kernel/sched/fair.c   |   64 
 +
  kernel/sched/sched.h  |2 +-
  4 files changed, 40 insertions(+), 40 deletions(-)

 diff --git a/include/linux/sched.h b/include/linux/sched.h
 index 087dd20..302756e 100644
 --- a/include/linux/sched.h
 +++ b/include/linux/sched.h
 @@ -924,7 +924,7 @@ struct sched_domain {
 unsigned int lb_count[CPU_MAX_IDLE_TYPES];
 unsigned int lb_failed[CPU_MAX_IDLE_TYPES];
 unsigned int lb_balanced[CPU_MAX_IDLE_TYPES];
 -   unsigned int lb_imbalance[CPU_MAX_IDLE_TYPES];
 +   u64 lb_imbalance[CPU_MAX_IDLE_TYPES];
 unsigned int lb_gained[CPU_MAX_IDLE_TYPES];
 unsigned int lb_hot_gained[CPU_MAX_IDLE_TYPES];
 unsigned int lb_nobusyg[CPU_MAX_IDLE_TYPES];
 diff --git a/kernel/sched/core.c b/kernel/sched/core.c
 index 24d8b9b..4dea057 100644
 --- a/kernel/sched/core.c
 +++ b/kernel/sched/core.c
 @@ -2415,8 +2415,8 @@ static const unsigned char
   * would be when CPU is idle and so we just decay the old load without
   * adding any new load.
   */
 -static unsigned long

Re: [RFC PATCH 4/5] sched: consider runnable load average in wake_affine and move_tasks

2012-11-17 Thread Preeti U Murthy
Hi Alex,

On 11/17/2012 06:34 PM, Alex Shi wrote:
 Except using runnable load average in background, wake_affine and
 move_tasks is also the key functions in load balance. We need consider
 the runnable load average in them in order to the apple to apple load
 comparison in load balance.
 
 Signed-off-by: Alex Shi alex@intel.com
 ---
  kernel/sched/fair.c |   16 ++--
  1 files changed, 10 insertions(+), 6 deletions(-)
 
 @@ -4229,7 +4233,7 @@ static int move_tasks(struct lb_env *env)
   if (throttled_lb_pair(task_group(p), env-src_cpu, 
 env-dst_cpu))
   goto next;
 
 - load = task_h_load(p);
 + load = task_h_load(p) * p-se.avg.load_avg_contrib;
Shouldn't the above be just load = p-se.avg.load_avg_contrib? This
metric already has considered p-se.load.weight.task_h_load(p) returns
the same.
 
   if (sched_feat(LB_MIN)  load  16  !env-failed)
   goto next;
 
Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/5] enable runnable load avg in load balance

2012-11-17 Thread Preeti U Murthy
Hi Alex,

On 11/17/2012 06:34 PM, Alex Shi wrote:
 This patchset try to consider runnable load avg when do cpu load comparison
 in load balance.
 
 I had seen preeti's enabling before patch finished, but I still think 
 considing
 runnable load avg on rq is may a more natrual way.
 
 BTW, I am thinking if 2 times decay for cpu_load is too complicate? one for
 runnable time, another for CPU_LOAD_IDX. I think I missed the decay reason
 for CPU_LOAD_IDX. Could anyone like do me favor to give some hints of this?

The decay happening for CPU_LOAD_IDX is *more coarse grained* than the
decay that __update_entity_runnable_avg() is performing.While
__update_cpu_load() decays the rq-load.weight *for every jiffy*(~4ms)
passed so far without update of the load,
__update_entity_runnable_avg() decays the rq-load.weight *for every
1ms* when called from update_rq_runnable_avg().

Before the introduction of PJT's series,__update_cpu_load() seems to be
the only place where decay of older rq load was happening(so as to give
the older load less importance in its relevance),but with the
introduction of PJT's series since the older rq load gets decayed in
__update_entity_runnable_avg() in a more fine grained fashion,perhaps
you are right,while the CPU_LOAD_IDX gets updated,we dont need to decay
the load once again here.
 
 Best Regards!
 Alex 
 
 [RFC PATCH 1/5] sched: get rq runnable load average for load balance
 [RFC PATCH 2/5] sched: update rq runnable load average in time
 [RFC PATCH 3/5] sched: using runnable load avg in cpu_load and
 [RFC PATCH 4/5] sched: consider runnable load average in wake_affine
 [RFC PATCH 5/5] sched: revert 'Introduce temporary FAIR_GROUP_SCHED
 
Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPOST-v2] sched: Prevent wakeup to enter critical section needlessly

2012-11-19 Thread Preeti U Murthy
Hi Ivo,

On 11/19/2012 01:00 PM, Ivo Sieben wrote:
 Check the waitqueue task list to be non empty before entering the critical
 section. This prevents locking the spin lock needlessly in case the queue
 was empty, and therefor also prevent scheduling overhead on a PREEMPT_RT
 system.
 
 Signed-off-by: Ivo Sieben meltedpiano...@gmail.com
 ---
 
  a second repost of this patch v2: Can anyone respond?
  Did I apply the memory barrier correct?
 
  v2:
  - We don't need the careful list empty, a normal list empty is sufficient:
if you miss an update it was just as it happened a little later.
  - Because of memory ordering problems we can observe an unupdated list
administration. This can cause an wait_event-like code to miss an event.
Adding a memory barrier befor checking the list to be empty will guarantee 
 we
evaluate a 100% updated list adminsitration.
 
  kernel/sched/core.c |   19 ---
  1 file changed, 16 insertions(+), 3 deletions(-)
 
 diff --git a/kernel/sched/core.c b/kernel/sched/core.c
 index 2d8927f..168a9b2 100644
 --- a/kernel/sched/core.c
 +++ b/kernel/sched/core.c
 @@ -3090,9 +3090,22 @@ void __wake_up(wait_queue_head_t *q, unsigned int mode,
  {
   unsigned long flags;
  
 - spin_lock_irqsave(q-lock, flags);
 - __wake_up_common(q, mode, nr_exclusive, 0, key);
 - spin_unlock_irqrestore(q-lock, flags);
 + /*
 +  * We check for list emptiness outside the lock. This prevents the wake
 +  * up to enter the critical section needlessly when the task list is
 +  * empty.
 +  *
 +  * Placed a full memory barrier before checking list emptiness to make
 +  * 100% sure this function sees an up-to-date list administration.
 +  * Note that other code that manipulates the list uses a spin_lock and
 +  * therefore doesn't need additional memory barriers.
 +  */
 + smp_mb();
 + if (!list_empty(q-task_list)) {
 + spin_lock_irqsave(q-lock, flags);
 + __wake_up_common(q, mode, nr_exclusive, 0, key);
 + spin_unlock_irqrestore(q-lock, flags);
 + }
  }
  EXPORT_SYMBOL(__wake_up);
  
 
Looks good to me.
Reviewed-by: Preeti U Murthy pre...@linux.vnet.ibm.com

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 05/22] sched: remove domain iterations in fork/exec/wake

2013-01-10 Thread Preeti U Murthy
Hi Morten,Alex

On 01/09/2013 11:51 PM, Morten Rasmussen wrote:
 On Sat, Jan 05, 2013 at 08:37:34AM +, Alex Shi wrote:
 Guess the search cpu from bottom to up in domain tree come from
 commit 3dbd5342074a1e sched: multilevel sbe sbf, the purpose is
 balancing over tasks on all level domains.

 This balancing cost much if there has many domain/groups in a large
 system. And force spreading task among different domains may cause
 performance issue due to bad locality.

 If we remove this code, we will get quick fork/exec/wake, plus better
 balancing among whole system, that also reduce migrations in future
 load balancing.

 This patch increases 10+% performance of hackbench on my 4 sockets
 NHM and SNB machines.

 Signed-off-by: Alex Shi alex@intel.com
 ---
  kernel/sched/fair.c | 20 +---
  1 file changed, 1 insertion(+), 19 deletions(-)

 diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
 index ecfbf8e..895a3f4 100644
 --- a/kernel/sched/fair.c
 +++ b/kernel/sched/fair.c
 @@ -3364,15 +3364,9 @@ select_task_rq_fair(struct task_struct *p, int 
 sd_flag, int wake_flags)
  goto unlock;
  }
  
 -while (sd) {
 +if (sd) {
  int load_idx = sd-forkexec_idx;
  struct sched_group *group;
 -int weight;
 -
 -if (!(sd-flags  sd_flag)) {
 -sd = sd-child;
 -continue;
 -}
  
  if (sd_flag  SD_BALANCE_WAKE)
  load_idx = sd-wake_idx;
 @@ -3382,18 +3376,6 @@ select_task_rq_fair(struct task_struct *p, int 
 sd_flag, int wake_flags)
  goto unlock;
  
  new_cpu = find_idlest_cpu(group, p, cpu);
 -
 -/* Now try balancing at a lower domain level of new_cpu */
 -cpu = new_cpu;
 -weight = sd-span_weight;
 -sd = NULL;
 -for_each_domain(cpu, tmp) {
 -if (weight = tmp-span_weight)
 -break;
 -if (tmp-flags  sd_flag)
 -sd = tmp;
 -}
 -/* while loop will break here if sd == NULL */
 
 I agree that this should be a major optimization. I just can't figure
 out why the existing recursive search for an idle cpu switches to the
 new cpu near the end and then starts a search for an idle cpu in the new
 cpu's domain. Is this to handle some exotic sched domain configurations?
 If so, they probably wouldn't work with your optimizations.

Let me explain my understanding of why the recursive search is the way
it is.

 _  sd0
| |
|  ___sd1__   ___sd2__|
| || ||   |
| | sgx| |  sga   |   |
| | sgy| |  sgb   |   |
| || ||   |
|_|

What the current recursive search is doing is (assuming we start with
sd0-the top level sched domain whose flags are rightly set). we find
that sd1 is the idlest group,and a cpux1 in sgx is the idlest cpu.

We could have ideally stopped the search here.But the problem with this
is that there is a possibility that sgx is more loaded than sgy; meaning
the cpus in sgx are heavily imbalanced;say there are two cpus cpux1 and
cpux2 in sgx,where cpux2 is heavily loaded and cpux1 has recently gotten
idle and load balancing has not come to its rescue yet.According to the
search above, cpux1 is idle,but is *not the right candidate for
scheduling forked task,it is the right candidate for relieving the load
from cpux2* due to cache locality etc.

Therefore in the next recursive search we go one step inside sd1-the
chosen idlest group candidate,which also happens to be the *next level
sched domain for cpux1-the chosen idle cpu*. It then returns sgy as the
idlest perhaps,if the situation happens to be better than what i have
described for sgx and an appropriate cpu there is chosen.

So in short a bird's eye view of a large sched domain to choose the cpu
would be very short sighted,we could end up creating imbalances within
lower level sched domains.To avoid this the recursive search plays safe
and chooses the best idle group after viewing the large sched domain in
detail.

Therefore even i feel that this patch should be implemented after
thorough tests.



 Morten

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 02/22] sched: select_task_rq_fair clean up

2013-01-10 Thread Preeti U Murthy
On 01/05/2013 02:07 PM, Alex Shi wrote:
 It is impossible to miss a task allowed cpu in a eligible group.
 
 And since find_idlest_group only return a different group which
 excludes old cpu, it's also imporissible to find a new cpu same as old
 cpu.
 
 Signed-off-by: Alex Shi alex@intel.com
 ---
  kernel/sched/fair.c | 5 -
  1 file changed, 5 deletions(-)
 
 diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
 index 5eea870..6d3a95d 100644
 --- a/kernel/sched/fair.c
 +++ b/kernel/sched/fair.c
 @@ -3378,11 +3378,6 @@ select_task_rq_fair(struct task_struct *p, int 
 sd_flag, int wake_flags)
   }
 
   new_cpu = find_idlest_cpu(group, p, cpu);
 - if (new_cpu == -1 || new_cpu == cpu) {
 - /* Now try balancing at a lower domain level of cpu */
 - sd = sd-child;
 - continue;
 - }
 
   /* Now try balancing at a lower domain level of new_cpu */
   cpu = new_cpu;
 
Reviewed-by:Preeti U Murthy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 03/22] sched: fix find_idlest_group mess logical

2013-01-10 Thread Preeti U Murthy
On 01/05/2013 02:07 PM, Alex Shi wrote:
 There is 4 situations in the function:
 1, no task allowed group;
   so min_load = ULONG_MAX, this_load = 0, idlest = NULL
 2, only local group task allowed;
   so min_load = ULONG_MAX, this_load assigned, idlest = NULL
 3, only non-local task group allowed;
   so min_load assigned, this_load = 0, idlest != NULL
 4, local group + another group are task allowed.
   so min_load assigned, this_load assigned, idlest != NULL
 
 Current logical will return NULL in first 3 kinds of scenarios.
 And still return NULL, if idlest group is heavier then the
 local group in the 4th situation.
 
 Actually, I thought groups in situation 2,3 are also eligible to host
 the task. And in 4th situation, agree to bias toward local group.
 So, has this patch.
 
 Signed-off-by: Alex Shi alex@intel.com
 ---
  kernel/sched/fair.c | 12 +---
  1 file changed, 9 insertions(+), 3 deletions(-)
 
 diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
 index 6d3a95d..3c7b09a 100644
 --- a/kernel/sched/fair.c
 +++ b/kernel/sched/fair.c
 @@ -3181,6 +3181,7 @@ find_idlest_group(struct sched_domain *sd, struct 
 task_struct *p,
 int this_cpu, int load_idx)
  {
   struct sched_group *idlest = NULL, *group = sd-groups;
 + struct sched_group *this_group = NULL;
   unsigned long min_load = ULONG_MAX, this_load = 0;
   int imbalance = 100 + (sd-imbalance_pct-100)/2;
 
 @@ -3215,14 +3216,19 @@ find_idlest_group(struct sched_domain *sd, struct 
 task_struct *p,
 
   if (local_group) {
   this_load = avg_load;
 - } else if (avg_load  min_load) {
 + this_group = group;
 + }
 + if (avg_load  min_load) {
   min_load = avg_load;
   idlest = group;
   }
   } while (group = group-next, group != sd-groups);
 
 - if (!idlest || 100*this_load  imbalance*min_load)
 - return NULL;
 + if (this_group  idlest != this_group)
 + /* Bias toward our group again */
 + if (100*this_load  imbalance*min_load)
 + idlest = this_group;
 +
   return idlest;
  }
 
Reviewed-by:Preeti U Murthy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 04/22] sched: don't need go to smaller sched domain

2013-01-10 Thread Preeti U Murthy
On 01/05/2013 02:07 PM, Alex Shi wrote:
 If parent sched domain has no task allowed cpu find. neither find in
 it's child. So, go out to save useless checking.
 
 Signed-off-by: Alex Shi alex@intel.com
 ---
  kernel/sched/fair.c | 6 ++
  1 file changed, 2 insertions(+), 4 deletions(-)
 
 diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
 index 3c7b09a..ecfbf8e 100644
 --- a/kernel/sched/fair.c
 +++ b/kernel/sched/fair.c
 @@ -3378,10 +3378,8 @@ select_task_rq_fair(struct task_struct *p, int 
 sd_flag, int wake_flags)
   load_idx = sd-wake_idx;
 
   group = find_idlest_group(sd, p, cpu, load_idx);
 - if (!group) {
 - sd = sd-child;
 - continue;
 - }
 + if (!group)
 + goto unlock;
 
   new_cpu = find_idlest_cpu(group, p, cpu);
 
Reviewed-by:Preeti U Murthy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 07/22] sched: set initial load avg of new forked task

2013-01-10 Thread Preeti U Murthy
On 01/05/2013 02:07 PM, Alex Shi wrote:
 New task has no runnable sum at its first runnable time, that make
 burst forking just select few idle cpus to put tasks.
 Set initial load avg of new forked task as its load weight to resolve
 this issue.
 
 Signed-off-by: Alex Shi alex@intel.com
 ---
  include/linux/sched.h |  1 +
  kernel/sched/core.c   |  2 +-
  kernel/sched/fair.c   | 11 +--
  3 files changed, 11 insertions(+), 3 deletions(-)
 
 diff --git a/include/linux/sched.h b/include/linux/sched.h
 index 206bb08..fb7aab5 100644
 --- a/include/linux/sched.h
 +++ b/include/linux/sched.h
 @@ -1069,6 +1069,7 @@ struct sched_domain;
  #else
  #define ENQUEUE_WAKING   0
  #endif
 +#define ENQUEUE_NEWTASK  8
 
  #define DEQUEUE_SLEEP1
 
 diff --git a/kernel/sched/core.c b/kernel/sched/core.c
 index 66c1718..66ce1f1 100644
 --- a/kernel/sched/core.c
 +++ b/kernel/sched/core.c
 @@ -1705,7 +1705,7 @@ void wake_up_new_task(struct task_struct *p)
  #endif
 
   rq = __task_rq_lock(p);
 - activate_task(rq, p, 0);
 + activate_task(rq, p, ENQUEUE_NEWTASK);
   p-on_rq = 1;
   trace_sched_wakeup_new(p, true);
   check_preempt_curr(rq, p, WF_FORK);
 diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
 index 895a3f4..5c545e4 100644
 --- a/kernel/sched/fair.c
 +++ b/kernel/sched/fair.c
 @@ -1503,8 +1503,9 @@ static inline void update_rq_runnable_avg(struct rq 
 *rq, int runnable)
  /* Add the load generated by se into cfs_rq's child load-average */
  static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
 struct sched_entity *se,
 -   int wakeup)
 +   int flags)
  {
 + int wakeup = flags  ENQUEUE_WAKEUP;
   /*
* We track migrations using entity decay_count = 0, on a wake-up
* migration we use a negative decay count to track the remote decays
 @@ -1538,6 +1539,12 @@ static inline void enqueue_entity_load_avg(struct 
 cfs_rq *cfs_rq,
   update_entity_load_avg(se, 0);
   }
 
 + /*
 +  * set the initial load avg of new task same as its load
 +  * in order to avoid brust fork make few cpu too heavier
 +  */
 + if (flags  ENQUEUE_NEWTASK)
 + se-avg.load_avg_contrib = se-load.weight;
   cfs_rq-runnable_load_avg += se-avg.load_avg_contrib;
   /* we force update consideration on load-balancer moves */
   update_cfs_rq_blocked_load(cfs_rq, !wakeup);
 @@ -1701,7 +1708,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct 
 sched_entity *se, int flags)
* Update run-time statistics of the 'current'.
*/
   update_curr(cfs_rq);
 - enqueue_entity_load_avg(cfs_rq, se, flags  ENQUEUE_WAKEUP);
 + enqueue_entity_load_avg(cfs_rq, se, flags);
   account_entity_enqueue(cfs_rq, se);
   update_cfs_shares(cfs_rq);
 
I had seen in my experiments, that the forked tasks with initial load to
be 0,would adversely affect the runqueue lengths.Since the load for
these tasks to pick up takes some time,the cpus on which the forked
tasks are scheduled, could be candidates for dst_cpu many times and
the runqueue lengths increase considerably.

This patch solves this issue by making the forked tasks contribute
actively to the runqueue load.

Reviewed-by:Preeti U Murthy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: Consequences of integrating the Per Entity Load Tracking Metric into the Load Balancer

2013-01-06 Thread Preeti U Murthy
Hi Mike,
Thank you very much for your inputs.Just a few thoughts so that we are
clear with the problems so far in the scheduler scalability and in what
direction we ought to move to correct them.

1. During fork or exec,the scheduler goes through find_idlest_group()
and find_idlest_cpu() in select_task_rq_fair() by iterating through all
domains.Why then was a similar approach not followed for wake up
balancing? What was so different about wake ups (except that the woken
up task had to remain close to the prev/waking cpu) that we had to
introduce select_idle_sibling() in the first place?

2.To the best of my knowlege,the concept of buddy cpu was introduced in
select_idle_sibling() so as to avoid the entire package traversal and
restrict it to the buddy cpus alone.But even during fork or exec,we
iterate through all the sched domains,like I have mentioned above.Why
did not the buddy cpu solution come to the rescue here as well?

3.So the correct problem stands at avoid iterating through the entire
package at the cost of less aggression in finding the idle cpu or
iterate through the package with an intention of finding the idlest
cpu.To the best of my understanding the former is your approach or
commit 37407ea7,the latter is what I tried to do.But as you have rightly
pointed out my approach will have scaling issues.In this light,how does
your best_combined patch(below) look like?
Do you introduce a cut off value on the loads to decide on which
approach to take?

Meanwhile I will also try to run tbench and a few other benchmarks to
find out why the results are like below.Will update you very soon on this.

Thank you

Regards
Preeti U Murthy



On 01/06/2013 10:02 PM, Mike Galbraith wrote:
 On Sat, 2013-01-05 at 09:13 +0100, Mike Galbraith wrote:
 
 I still have a 2.6-rt problem I need to find time to squabble with, but
 maybe I'll soonish see if what you did plus what I did combined works
 out on that 4x10 core box where current is _so_ unbelievably horrible.
 Heck, it can't get any worse, and the restricted wake balance alone
 kinda sorta worked.
 
 Actually, I flunked copy/paste 101.  Below (preeti) shows the real deal.
 
 tbench, 3 runs, 30 secs/run
 revert = 37407ea7 reverted
 clients 1  5 1020 40  
80
 3.6.0.virgin27.83 139.501488.76   4172.936983.71
 8301.73
 29.23 139.981500.22   4162.926907.16
 8231.13
 30.00 141.431500.09   3975.506847.24
 7983.98
 
 3.6.0+revert   281.081404.762802.44   5019.497080.97
 8592.80
282.381375.702747.23   4823.957052.15
 8508.45
270.691375.532736.29   5243.057058.75
 8806.72
 
 3.6.0+preeti26.43 126.621027.23   3350.067004.22
 7561.83
 26.67 128.66 922.57   3341.737045.05
 7662.18
 25.54 129.201015.02   3337.606591.32
 7634.33
 
 3.6.0+best_combined280.481382.072730.27   4786.206477.28
 7980.07
276.881392.502708.23   4741.256590.99
 7992.11
278.921368.552735.49   4614.996573.38
 7921.75
 
 3.0.51-0.7.9-default   286.441415.372794.41   5284.397282.57   
 13670.80
 
 Something is either wrong with 3.6 itself, or the config I'm using, as
 max throughput is nowhere near where it should be (see default).  On the
 bright side, integrating the two does show some promise.
 
 -Mike
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 09/22] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task

2013-01-06 Thread Preeti U Murthy
Hi everyone,

On 01/07/2013 12:01 AM, Linus Torvalds wrote:
 On Sat, Jan 5, 2013 at 11:54 PM, Alex Shi alex@intel.com wrote:

 I just looked into the aim9 benchmark, in this case it forks 2000 tasks,
 after all tasks ready, aim9 give a signal than all tasks burst waking up
 and run until all finished.
 Since each of tasks are finished very quickly, a imbalanced empty cpu
 may goes to sleep till a regular balancing give it some new tasks. That
 causes the performance dropping. cause more idle entering.
 
 Sounds like for AIM (and possibly for other really bursty loads), we
 might want to do some load-balancing at wakeup time by *just* looking
 at the number of running tasks, rather than at the load average. Hmm?

During wake ups,the load average is not even queried,is it? wake_affine() is 
called
to see in the affinity of which cpu(prev/waking),the task should go.But after 
that
select_idle_sibling() simply sees if there is an idle cpu to offload the task 
to.

Looks like only in the periodic load balancing we can correct this scenario as 
of now,
as pointed below.

 
 The load average is fundamentally always going to run behind a bit,
 and while you want to use it for long-term balancing, a short-term you
 might want to do just a if we have a huge amount of runnable
 processes, do a load balancing *now*. Where huge amount should
 probably be relative to the long-term load balancing (ie comparing the
 number of runnable processes on this CPU right *now* with the load
 average over the last second or so would show a clear spike, and a
 reason for quick action).
 
Linus
 

Earlier I had posted a patch,to address this.
https://lkml.org/lkml/2012/10/25/156
update_sd_pick_busiest() checks whether a sched group has too many running tasks
to be offloaded.

--START_PATCH-


The scenario which led to this patch is shown below:
Consider Task1 and Task2 to be a long running task
and Tasks 3,4,5,6 to be short running tasks

Task3
Task4
Task1   Task5
Task2   Task6
--  --
SCHED_GRP1  SCHED_GRP2

Normal load calculator would qualify SCHED_GRP2 as
the candidate for sd-busiest due to the following loads
that it calculates.

SCHED_GRP1:2048
SCHED_GRP2:4096

Load calculator would probably qualify SCHED_GRP1 as the candidate
for sd-busiest due to the following loads that it calculates

SCHED_GRP1:3200
SCHED_GRP2:1156

This patch aims to strike a balance between the loads of the
group and the number of tasks running on the group to decide the
busiest group in the sched_domain.

This means we will need to use the PJT's metrics but with an
additional constraint.

Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com
---
 kernel/sched/fair.c |   25 ++---
 1 file changed, 22 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e02dad4..aafa3c1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -165,7 +165,8 @@ void sched_init_granularity(void)
 #else
 # define WMULT_CONST   (1UL  32)
 #endif
-
+#define NR_THRESHOLD 2
+#define LOAD_THRESHOLD 1
 #define WMULT_SHIFT32

 /*
@@ -4169,6 +4170,7 @@ struct sd_lb_stats {
/* Statistics of the busiest group */
unsigned int  busiest_idle_cpus;
unsigned long max_load;
+   u64 max_sg_load; /* Equivalent of max_load but calculated using pjt's 
metric*/
unsigned long busiest_load_per_task;
unsigned long busiest_nr_running;
unsigned long busiest_group_capacity;
@@ -4628,8 +4630,24 @@ static bool update_sd_pick_busiest(struct lb_env *env,
   struct sched_group *sg,
   struct sg_lb_stats *sgs)
 {
-   if (sgs-avg_load = sds-max_load)
-   return false;
+   /* Use PJT's metrics to qualify a sched_group as busy
+*
+* But a low load sched group may be queueing up many tasks
+* So before dismissing a sched group with lesser load,ensure
+* that the number of processes on it is checked if it is
+* not too less loaded than the max load so far
+*
+* But as of now as LOAD_THRESHOLD is 1,this check is a nop.
+* But we could vary LOAD_THRESHOLD suitably to bring in this check
+*/
+   if (sgs-avg_cfs_runnable_load = sds-max_sg_load) {
+   if (sgs-avg_cfs_runnable_load  LOAD_THRESHOLD * 
sds-max_sg_load) {
+   if (sgs-sum_nr_running = (NR_THRESHOLD + 
sds-busiest_nr_running))
+   return false;
+   } else {
+   return false;
+   }
+   }

if (sgs-sum_nr_running  sgs-group_capacity)
return true;
@@ -4708,6 +4726,7 @@ static inline void update_sd_lb_stats(struct lb_env *env,
sds-this_idle_cpus

Re: sched: Consequences of integrating the Per Entity Load Tracking Metric into the Load Balancer

2013-01-07 Thread Preeti U Murthy
On 01/07/2013 09:18 PM, Vincent Guittot wrote:
 On 2 January 2013 05:22, Preeti U Murthy pre...@linux.vnet.ibm.com wrote:
 Hi everyone,
 I have been looking at how different workloads react when the per entity
 load tracking metric is integrated into the load balancer and what are
 the possible reasons for it.

 I had posted the integration patch earlier:
 https://lkml.org/lkml/2012/11/15/391

 Essentially what I am doing is:
 1.I have disabled CONFIG_FAIR_GROUP_SCHED to make the analysis simple
 2.I have replaced cfs_rq-load.weight in weighted_cpuload() with
 cfs.runnable_load_avg,the active load tracking metric.
 3.I have replaced se.load.weight in task_h_load() with
 se.load.avg.contrib,the per entity load tracking metric.
 4.The load balancer will end up using these metrics.

 After conducting experiments on several workloads I found out that the
 performance of the workloads with the above integration would neither
 improve nor deteriorate.And this observation was consistent.

 Ideally the performance should have improved considering,that the metric
 does better tracking of load.

 Let me explain with a simple example as to why we should see a
 performance improvement ideally:Consider 2 80% tasks and 1 40% task.

 With integration:
 

40%
 80%40%
 cpu1  cpu2

 The above will be the scenario when the tasks fork initially.And this is
 a perfectly balanced system,hence no more load balancing.And proper
 distribution of loads on the cpu.

 Without integration
 ---

 40%   40%
 80%40% 80%40%
 cpu1   cpu2OR cpu1   cpu2

 Because the  view is that all the tasks as having the same load.The load
 balancer could ping pong tasks between these two situations.

 When I performed this experiment,I did not see an improvement in the
 performance though in the former case.On further observation I found
 that the following was actually happening.

 With integration
 

 Initially 40% task sleeps  40% task wakes up
and select_idle_sibling()
decides to wake it up on cpu1

40%   -   -   40%
 80%40%80%40%   80%  40%
 cpu1  cpu2cpu1   cpu2  cpu1 cpu2


 This makes load balance trigger movement of 40% from cpu1 back to
 cpu2.Hence the stability that the load balancer was trying to achieve is
 gone.Hence the culprit boils down to select_idle_sibling.How is it the
 culprit and how is it hindering performance of the workloads?

 *What is the way ahead with the per entity load tracking metric in the
 load balancer then?*

 In replies to a post by Paul in https://lkml.org/lkml/2012/12/6/105,
 he mentions the following:

 It is my intuition that the greatest carnage here is actually caused
 by wake-up load-balancing getting in the way of periodic in
 establishing a steady state. I suspect more mileage would result from
 reducing the interference wake-up load-balancing has with steady
 state.

 The whole point of using blocked load is so that you can converge on a
 steady state where you don't NEED to move tasks.  What disrupts this is
 we naturally prefer idle cpus on wake-up balance to reduce wake-up
 latency. I think the better answer is making these two processes load
 balancing() and select_idle_sibling() more co-operative.

 I had not realised how this would happen until I saw it happening in the
 above experiment.

 Based on what Paul explained above let us use the runnable load + the
 blocked load for calculating the load on a cfs runqueue rather than just
 the runnable load(which is what i am doing now) and see its consequence.

 Initially:   40% task sleeps

40%
 80%40%   -  80%  40%
 cpu1   cpu2 cpu1  cpu2

 So initially the load on cpu1 is say 80 and on cpu2 also it is
 80.Balanced.Now when 40% task sleeps,the total load on cpu2=runnable
 load+blocked load.which is still 80.

 As a consequence,firstly,during periodic load balancing the load is not
 moved from cpu1 to cpu2 when the 40% task sleeps.(It sees the load on
 cpu2 as 80 and not as 40).
 Hence the above scenario remains the same.On wake up,what happens?

 Here comes the point of making both load balancing and wake up
 balance(select_idle_sibling) co operative. How about we always schedule
 the woken up task on the prev_cpu? This seems more sensible considering
 load balancing considers blocked load as being a part of the load of cpu2.
 
 Hi Preeti,
 
 I'm not sure that we want such steady state at cores level because we
 take advantage of migrating wake up tasks between cores that share
 their cache as Matthew demonstrated. But I agree that reaching such
 steady state at cluster and CPU level is interesting.
 
 IMHO, you're right that taking the blocked load into consideration
 should minimize tasks migration between cluster but it should no
 prevent fast task

Re: sched: Consequences of integrating the Per Entity Load Tracking Metric into the Load Balancer

2013-01-08 Thread Preeti U Murthy
Hi Mike,

Thank you very much for such a clear and comprehensive explanation.
So when I put together the problem and the proposed solution pieces in the 
current
scheduler scalability,the following was what I found:

1. select_idle_sibling() is needed as an agent to correctly find the right cpu 
for wake
   up tasks to go to.Correctly would be to find an idle cpu at the lowest 
cost possible.
2.Cost could be lowered either by optimizing the order of searching for an 
idle cpu or
   restricting the search to a few cpus alone.
3. The former has the problem that it would not prevent bouncing tasks all over 
the domain
   sharing an L3 cache,which could potentially affect the fast moving tasks.
4. The latter has the problem that it is not aggressive enough in finding an 
idle cpu.

This is some tangled problem,but I think the solution at best could be smoothed 
to a a flowchart.

   STEP1   STEP2STEP3
 _
| |
|See if the idle buddy|No_  Yes   
|is free at all sched || Do we search the| |Optimized search|
|domains  | |sched domains|  ||
|_| |for an idle cpu  | |
  |Yes  |_|\|/
 \|/|No: saturated Return target cpu
Return \|/ system
cpu buddyReturn prev_cpu

I dont know how well we can do STEP2. That seems to be the biggest 
challenge.STEP 1 is a
reverted commit 37407ea7,
sched: Improve scalability via 'CPU buddies', which withstand random 
perturbations
for reasons which can hopefully be overcome by this flowchart.

I have tried to tackle STEP3.STEP 3 will not prevent bouncing but a good STEP2 
could tell
us if it is worth the bounce.

STEP3 Patch is given below:

***START PATCH**

sched:Reduce the overhead of select_idle_sibling

From: Preeti U Murthy pre...@linux.vnet.ibm.com

The traversal of the sched domains to find the idlest cpu is a costly
operation currently in select_idle_sibling() due to large L3 caches.
So the traversal better be worth the effort.

In the current approach,in select_idle_sibling(),the sched domains are
searched top to bottom starting at the last level cache domain,and the search
is for the idlest *group*. It is less likely that you will find a completely
idle group at a higher level sched domain compared to a lower level sched
domain.This will make the first few iterations fruitless most of the times.
And it is less likely to find an idle *group* compared to an idle *cpu*.

This patch aims at finding an idle cpu.And since the iterations are from bottom 
to
top,stopping at the last level cache domain,it tries to see if the task can
be scheduled on a core which shares the l2 cache with the waking cpu/prev cpu 
first,
before probing the cpus at the higher level domain,trying to make the cost of
migration lower than the current approach.

Also it tries to return an idle cpu as soon as it finds one,without querying
the status of the other cpus in the sched group.This could potentially be
much faster unless a worse case such as not finding an idle cpu at every
domain occurs.
---
 kernel/sched/fair.c |   33 +
 1 file changed, 21 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b29cdbf..6123bca 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3310,7 +3310,7 @@ static int select_idle_sibling(struct task_struct *p, int 
target)
 {
int cpu = smp_processor_id();
int prev_cpu = task_cpu(p);
-   struct sched_domain *sd;
+   struct sched_domain *sd, *tmp;
struct sched_group *sg;
int i;
 
@@ -3332,24 +3332,33 @@ static int select_idle_sibling(struct task_struct *p, 
int target)
 * Otherwise, iterate the domains and find an elegible idle cpu.
 */
sd = rcu_dereference(per_cpu(sd_llc, target));
-   for_each_lower_domain(sd) {
-   sg = sd-groups;
+   /*
+* Iterate the domains bottom up,to cash in on the shared lower level
+* cache advantages.
+* Since this iteration is costly,return any idle cpu and dont wait
+* for a completely idle group.
+*/
+   for_each_domain(target, tmp) {
+   sg = tmp-groups;
do {
if (!cpumask_intersects(sched_group_cpus(sg),
tsk_cpus_allowed(p)))
goto next;
-
for_each_cpu(i, sched_group_cpus(sg)) {
-   if (!idle_cpu(i))
-   goto next;
+   if (idle_cpu(i)) {
+   target = i

Re: sched: Consequences of integrating the Per Entity Load Tracking Metric into the Load Balancer

2013-01-08 Thread Preeti U Murthy
 Here comes the point of making both load balancing and wake up
 balance(select_idle_sibling) co operative. How about we always schedule
 the woken up task on the prev_cpu? This seems more sensible considering
 load balancing considers blocked load as being a part of the load of cpu2.

 Hi Preeti,

 I'm not sure that we want such steady state at cores level because we
 take advantage of migrating wake up tasks between cores that share
 their cache as Matthew demonstrated. But I agree that reaching such
 steady state at cluster and CPU level is interesting.

 IMHO, you're right that taking the blocked load into consideration
 should minimize tasks migration between cluster but it should no
 prevent fast task migration between cores that share their cache

 True Vincent.But I think the one disadvantage even at cpu or cluster
 level is that when we consider blocked load, we might prevent any more
 tasks from being scheduled on that cpu during periodic load balance if
 the blocked load is too much.This is very poor cpu utilization
 
 The blocked load of a cluster will be high if the blocked tasks have
 run recently. The contribution of a blocked task will be divided by 2
 each 32ms, so it means that a high blocked load will be made of recent
 running tasks and the long sleeping tasks will not influence the load
 balancing.
 The load balance period is between 1 tick (10ms for idle load balance
 on ARM) and up to 256 ms (for busy load balance) so a high blocked
 load should imply some tasks that have run recently otherwise your
 blocked load will be small and will not have a large influence on your
 load balance

Makes a lot of sense.

 Also we can consider steady states if the waking tasks have a specific
 waking pattern.I am not sure if we can risk hoping that the blocked task
 would wake up soon or would wake up at time 'x' and utilize that cpu.
 
 Ok, so you don't consider to use blocked load in load balancing any more ?

Hmm..This has got me thinking.I thought to solve the existing
select_idle_sibling() problem of bouncing tasks all over the l3 package
and taking time to find an idle buddy could be solved in isolation with
the PJT's metric.But that does not seem to be the case considering the
suggestions by you and Mike.

Currently there are so many approaches proposed to improve the scheduler
that it is confusing as to how and which pieces fit well.Let me lay them
down.Please do help me put them together.

Jigsaw Piece1:Use Pjt's metric in load balancing and  Blocked
load+runnable load as part of cpu load while load balancing.

Jigsaw Piece2: select_idle_sibling() choosing the cpu to wake up tasks on.

Jigsaw Piece3: 'cpu buddy' concept to prevent bouncing of tasks.

Considering both yours and Mike's suggestions,what do you guys think of
the following puzzle and solution?

*Puzzle*: Waking up tasks should not take too much time to find a cpu to
run on and should not keep bouncing on too many cpus all over the
package, and should try as much not to create too much of an imbalance
in the load distribution of the cpus.

*Solution:*

Place Jigsaw Piece 1 first:Use Pjt's metric and blocked load + runnable
load as part of cpu load while load balancing.
(As time passes the blocked load becomes less significant on that
cpu,hence load balancing will go on as usual).

Place Jigsaw Piece 2 next: When tasks wake up,**use
select_idle_sibling() to see only if you can migrate tasks between cores
that share their cache**,
IOW see if the cpu at the lowest level sched domain is idle.If it is,
then schedule on it and migrate_task_rq_fair() will remove the load from
the prev_cpu,if not idle,then return the prev_cpu() which had already
considered the blocked load as part of its overall load.Hence very
little imbalance will be created.


*Possible End Picture*

Waking up tasks will not take time to find a cpu since we are probing
the cpus at only one sched domain level.The bouncing of tasks will be
restricted at the core level.An imbalance will not be created as the
blocked load is also considered while load balancing.

*Concerns*

1.Is the wake up load balancing in this solution less aggressive so as
to harm throughput significantly ?
2.Do we need Jigsaw Piece 3 at all?

Please do let me know what you all think.Thank you very much for your
suggestions.

 regards,
 Vincent

Regards
Preeti U Murthy



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] sched: Get rid of unnecessary checks from select_idle_sibling

2013-01-09 Thread Preeti U Murthy
On 01/09/2013 12:20 PM, Namhyung Kim wrote:
 From: Namhyung Kim namhyung@lge.com
 
 AFAICS @target cpu of select_idle_sibling() is always either prev_cpu
 or this_cpu.  So no need to check it again and the conditionals can be
 consolidated.
 
 Cc: Mike Galbraith efa...@gmx.de
 Cc: Preeti U Murthy pre...@linux.vnet.ibm.com
 Cc: Vincent Guittot vincent.guit...@linaro.org
 Cc: Alex Shi alex@intel.com
 Signed-off-by: Namhyung Kim namhy...@kernel.org
 ---
  kernel/sched/fair.c | 17 -
  1 file changed, 4 insertions(+), 13 deletions(-)
 
 diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
 index 5eea8707234a..af665814c216 100644
 --- a/kernel/sched/fair.c
 +++ b/kernel/sched/fair.c
 @@ -3254,25 +3254,16 @@ find_idlest_cpu(struct sched_group *group, struct 
 task_struct *p, int this_cpu)
   */
  static int select_idle_sibling(struct task_struct *p, int target)
  {
 - int cpu = smp_processor_id();
 - int prev_cpu = task_cpu(p);
   struct sched_domain *sd;
   struct sched_group *sg;
   int i;
 
   /*
 -  * If the task is going to be woken-up on this cpu and if it is
 -  * already idle, then it is the right target.
 -  */
 - if (target == cpu  idle_cpu(cpu))
 - return cpu;
 -
 - /*
 -  * If the task is going to be woken-up on the cpu where it previously
 -  * ran and if it is currently idle, then it the right target.
 +  * If the task is going to be woken-up on this cpu or the cpu where it
 +  * previously ran and it is already idle, then it is the right target.
*/
 - if (target == prev_cpu  idle_cpu(prev_cpu))
 - return prev_cpu;
 + if (idle_cpu(target))
 + return target;
 
   /*
* Otherwise, iterate the domains and find an elegible idle cpu.
 
If NUMA_TTWU_BIAS or NUMA_TTWU_TO is true(it is false by
default),cpu/prev_cpu can be changed to be a random node_cpu(the node
that 'this_cpu' is on). In which case even if the node cpu is idle,it
would not be a viable target,looks like.Maybe that is why
select_idle_sibling() makes the check if the target is prev_cpu/this cpu.

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] sched: Get rid of unnecessary checks from select_idle_sibling

2013-01-10 Thread Preeti U Murthy
On 01/10/2013 11:19 AM, Namhyung Kim wrote:
 Hi Preeti,
 
 On Wed, 09 Jan 2013 13:51:00 +0530, Preeti U. Murthy wrote:
 On 01/09/2013 12:20 PM, Namhyung Kim wrote:
 From: Namhyung Kim namhyung@lge.com

 AFAICS @target cpu of select_idle_sibling() is always either prev_cpu
 or this_cpu.  So no need to check it again and the conditionals can be
 consolidated.
 [snip]
 If NUMA_TTWU_BIAS or NUMA_TTWU_TO is true(it is false by
 
 I can't find those bits in the code.  I've checked v3.8-rc2,
 next-20130110, tip/master and tip/numa/core but there's nothing like
 above.  Which tree are you saying?
 
 
 default),cpu/prev_cpu can be changed to be a random node_cpu(the node
 that 'this_cpu' is on). In which case even if the node cpu is idle,it
 would not be a viable target,looks like.Maybe that is why
 select_idle_sibling() makes the check if the target is prev_cpu/this cpu.
 
 Looking into tip/numa/core, I can see that there's a code added for
 CONFIG_NUMA_BALANCING.  But still, it seems nothing changed on a path
 from select_task_rq_fair() to select_idle_sibling() - i.e. if the
 select_idle_sibling called, the target would be either prev_cpu or this
 cpu.  Am I missing something?

Hi Namhyung,
Sorry about this.I did a git pull on the tip/master.You are right this
piece of code is not there any more.So not an issue.

You can go ahead and add my Reviewed-by.I don't find any potential
problems now.

For your information however, the code piece which was earlier there, is
given below which aimed to pull the tasks towards the node of the waking
cpu,if NUMA_TTWU_BIAS is enabled.

   if (sched_feat_numa(NUMA_TTWU_BIAS)  node != -1) {
   /*
* For fork,exec find the idlest cpu in the home-node.
*/
   if (sd_flag  (SD_BALANCE_FORK|SD_BALANCE_EXEC)) {
   int node_cpu = numa_select_node_cpu(p, node);
   if (node_cpu  0)
   goto find_sd;

   new_cpu = cpu = node_cpu;
   sd = per_cpu(sd_node, cpu);
   goto pick_idlest;
   }

   /*
* For wake, pretend we were running in the home-node.
*/
   if (cpu_to_node(prev_cpu) != node) {
   int node_cpu = numa_select_node_cpu(p, node);
   if (node_cpu  0)
   goto find_sd;

   if (sched_feat_numa(NUMA_TTWU_TO))
   cpu = node_cpu;
   else
   prev_cpu = node_cpu;
   }
   }

 
 Thanks,
 Namhyung
 
I apologise once again for the wrong feedback.

Thank you

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 2/3] sched: factor out code to should_we_balance()

2013-08-04 Thread Preeti U Murthy
On 08/02/2013 04:02 PM, Peter Zijlstra wrote:
 On Fri, Aug 02, 2013 at 02:56:14PM +0530, Preeti U Murthy wrote:
 You need to iterate over all the groups of the sched domain env-sd and
 not just the first group of env-sd like you are doing above. This is to

 I don't think so.
 IIRC, env-sd-groups always means local group,
 so we don't need to find our group by iterating over all the groups.

 Take a look at update_sd_lb_stats(). That should clarify this. There is
 an exclusive
 local_group check there.

 sd-groups points to the first group in the list of groups under this sd.
 
 Take a look at: 88b8dac0a
 
Ah ok! Thanks for this pointer.

Apologies for having overlooked the fact that the sd-groups always
points to the group to which the balance_cpu belongs. And subsequent
dst_cpus for retries of load balancing also belong to the same group as
the balance_cpu.

This patch thus looks fine to me.

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 3/3] sched: clean-up struct sd_lb_stat

2013-08-06 Thread Preeti U Murthy
Hi Joonsoo,

On 08/05/2013 01:02 PM, Joonsoo Kim wrote:
 +   if (busiest-group_imb) {
 +   busiest-sum_weighted_load =
 +   min(busiest-sum_weighted_load, sds-sd_avg_load);

 Right here we get confused as to why the total load is being compared
 against load per task (although you are changing it to load per task above).
 
 Yes, you are right. I will add load_per_task to struct sg_lb_stats.
 
 
 @@ -4771,12 +4763,13 @@ static inline void calculate_imbalance(struct 
 lb_env *env, struct sd_lb_stats *s
  * Be careful of negative numbers as they'll appear as very large values
  * with unsigned longs.
  */
 -   max_pull = min(sds-max_load - sds-avg_load, load_above_capacity);
 +   max_pull = min(busiest-avg_load - sds-sd_avg_load,
 +   load_above_capacity);

 This is ok, but readability wise max_load is much better. max_load
 signifies the maximum load per task on a group in this sd, and avg_load
 signifies the total load per task across the sd. You are checking if
 there is imbalance in the total load in the sd and try to even it out
 across the sd.  Here busiest does not convey this immediately.
 
 You may be confused. max_load doesn't means the maximum load per task.
 It means that the busiest group's load per cpu power. And here max doesn't
 mean maximum load in this sd, instead, load of busiest sg. IMO,
 busiest-avg_load convey proper meaning.

Yes sorry about that. It is load per cpu power. My point was that 
max_load helped in readability. Look at the these two lines in your patch.

-   max_pull = min(sds-max_load - sds-avg_load, load_above_capacity);
+   max_pull = min(busiest-avg_load - sds-sd_avg_load,
 +   load_above_capacity);

The deleted line says that there is an imbalance in the load/cpu in some 
part of the sd(max_load), which is above the average load of the sd, which
we are trying to even out. Something like the below diagram for the sd load.

  max_load
_/\ avg_load.

I believe the line you have added makes it hard to understand this.

Anyway this is a minor issue, you can ignore it.

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH 5/5] cpuidle/ppc: Add longnap state to the idle states on powernv

2013-07-25 Thread Preeti U Murthy
This patch hooks into the existing broadcast framework with the support that 
this
patchset introduces for ppc, and the cpuidle driver backend
for powernv(posted out recently by Deepthi Dharwar) to add sleep state as
one of the deep idle states, in which the decrementer is switched off.

However in this patch, we only emulate sleep by going into a state which does
a nap with the decrementer interrupts disabled, termed as longnap. This enables
focus on the timer broadcast framework for ppc in this series of patches ,
which is required as a first step to enable sleep on ppc.

Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com
---

 arch/powerpc/platforms/powernv/processor_idle.c |   48 +++
 1 file changed, 47 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/powernv/processor_idle.c 
b/arch/powerpc/platforms/powernv/processor_idle.c
index f43ad91a..9aca502 100644
--- a/arch/powerpc/platforms/powernv/processor_idle.c
+++ b/arch/powerpc/platforms/powernv/processor_idle.c
@@ -9,16 +9,18 @@
 #include linux/cpuidle.h
 #include linux/cpu.h
 #include linux/notifier.h
+#include linux/clockchips.h
 
 #include asm/machdep.h
 #include asm/runlatch.h
+#include asm/time.h
 
 struct cpuidle_driver powernv_idle_driver = {
.name = powernv_idle,
.owner =THIS_MODULE,
 };
 
-#define MAX_IDLE_STATE_COUNT   2
+#define MAX_IDLE_STATE_COUNT   3
 
 static int max_idle_state = MAX_IDLE_STATE_COUNT - 1;
 static struct cpuidle_device __percpu *powernv_cpuidle_devices;
@@ -54,6 +56,43 @@ static int nap_loop(struct cpuidle_device *dev,
return index;
 }
 
+/* Emulate sleep, with long nap.
+ * During sleep, the core does not receive decrementer interrupts.
+ * Emulate sleep using long nap with decrementers interrupts disabled.
+ * This is an initial prototype to test the timer offload framework for ppc.
+ * We will eventually introduce the sleep state once the timer offload 
framework
+ * for ppc is stable.
+ */
+static int longnap_loop(struct cpuidle_device *dev,
+   struct cpuidle_driver *drv,
+   int index)
+{
+   int cpu = dev-cpu;
+
+   unsigned long lpcr = mfspr(SPRN_LPCR);
+
+   lpcr = ~(LPCR_MER | LPCR_PECE); /* lpcr[mer] must be 0 */
+
+   /* exit powersave upon external interrupt, but not decrementer
+* interrupt, Emulate sleep.
+*/
+   lpcr |= LPCR_PECE0;
+
+   if (cpu != bc_cpu) {
+   mtspr(SPRN_LPCR, lpcr);
+   clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, cpu);
+   power7_nap();
+   clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, cpu);
+   } else {
+   /* Wakeup on a decrementer interrupt, Do a nap */
+   lpcr |= LPCR_PECE1;
+   mtspr(SPRN_LPCR, lpcr);
+   power7_nap();
+   }
+
+   return index;
+}
+
 /*
  * States for dedicated partition case.
  */
@@ -72,6 +111,13 @@ static struct cpuidle_state 
powernv_states[MAX_IDLE_STATE_COUNT] = {
.exit_latency = 10,
.target_residency = 100,
.enter = nap_loop },
+{ /* LongNap */
+   .name = LongNap,
+   .desc = LongNap,
+   .flags = CPUIDLE_FLAG_TIME_VALID,
+   .exit_latency = 10,
+   .target_residency = 100,
+   .enter = longnap_loop },
 };
 
 static int powernv_cpuidle_add_cpu_notifier(struct notifier_block *n,

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH 0/5] cpuidle/ppc: Timer offload framework to support deep idle states

2013-07-25 Thread Preeti U Murthy
On PowerPC, when CPUs enter deep idle states, their local timers are
switched off. The responsibility of waking them up at their next timer event,
needs to be handed over to an external device. On PowerPC, we do not have an
external device equivalent to HPET, which is currently done on architectures
like x86. Instead we assign the local timer of one of the CPUs to do this job.

This patchset is an attempt to make use of the existing timer broadcast
framework in the kernel to meet the above requirement, except that the tick
broadcast device is the local timer of the boot CPU.

This patch series is ported ontop of 3.11-rc1 + the cpuidle driver backend
for powernv posted by Deepthi Dharwar recently. The current design and
implementation supports the ONESHOT tick mode. It does not yet support
the PERIODIC tick mode. This patch is tested with NOHZ_FULL off.

Patch[1/5], Patch[2/5]: optimize the broadcast mechanism on ppc.
Patch[3/5]: Introduces the core of the timer offload framework on powerpc.
Patch[4/5]: The cpu doing the broadcast should not go into tickless idle.
Patch[5/5]: Add a deep idle state to the cpuidle state table on powernv.

Patch[5/5] is the patch that ultimately makes use of the timer offload
framework that the patches Patch[1/5] to Patch[4/5] build.

---

Preeti U Murthy (3):
  cpuidle/ppc: Add timer offload framework to support deep idle states
  cpuidle/ppc: CPU goes tickless if there are no arch-specific constraints
  cpuidle/ppc: Add longnap state to the idle states on powernv

Srivatsa S. Bhat (2):
  powerpc: Free up the IPI message slot of ipi call function 
(PPC_MSG_CALL_FUNC)
  powerpc: Implement broadcast timer interrupt as an IPI message


 arch/powerpc/include/asm/smp.h  |3 +
 arch/powerpc/include/asm/time.h |3 +
 arch/powerpc/kernel/smp.c   |   23 --
 arch/powerpc/kernel/time.c  |   84 +++
 arch/powerpc/platforms/cell/interrupt.c |2 -
 arch/powerpc/platforms/powernv/Kconfig  |1 
 arch/powerpc/platforms/powernv/processor_idle.c |   48 +
 arch/powerpc/platforms/ps3/smp.c|2 -
 kernel/time/tick-sched.c|7 ++
 9 files changed, 161 insertions(+), 12 deletions(-)

-- 
Signature

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH 1/5] powerpc: Free up the IPI message slot of ipi call function (PPC_MSG_CALL_FUNC)

2013-07-25 Thread Preeti U Murthy
From: Srivatsa S. Bhat srivatsa.b...@linux.vnet.ibm.com

The IPI handlers for both PPC_MSG_CALL_FUNC and PPC_MSG_CALL_FUNC_SINGLE
map to a common implementation - generic_smp_call_function_single_interrupt().
So, we can consolidate them and save one of the IPI message slots, (which are
precious, since only 4 of those slots are available).

So, implement the functionality of PPC_MSG_CALL_FUNC using
PPC_MSG_CALL_FUNC_SINGLE itself and release its IPI message slot, so that it
can be used for something else in the future, if desired.

Signed-off-by: Srivatsa S. Bhat srivatsa.b...@linux.vnet.ibm.com
Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com
---

 arch/powerpc/include/asm/smp.h  |2 +-
 arch/powerpc/kernel/smp.c   |   12 +---
 arch/powerpc/platforms/cell/interrupt.c |2 +-
 arch/powerpc/platforms/ps3/smp.c|2 +-
 4 files changed, 8 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
index ffbaabe..51bf017 100644
--- a/arch/powerpc/include/asm/smp.h
+++ b/arch/powerpc/include/asm/smp.h
@@ -117,7 +117,7 @@ extern int cpu_to_core_id(int cpu);
  *
  * Make sure this matches openpic_request_IPIs in open_pic.c, or what shows up
  * in /proc/interrupts will be wrong!!! --Troy */
-#define PPC_MSG_CALL_FUNCTION   0
+#define PPC_MSG_UNUSED 0
 #define PPC_MSG_RESCHEDULE  1
 #define PPC_MSG_CALL_FUNC_SINGLE   2
 #define PPC_MSG_DEBUGGER_BREAK  3
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 38b0ba6..bc41e9f 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -111,9 +111,9 @@ int smp_generic_kick_cpu(int nr)
 }
 #endif /* CONFIG_PPC64 */
 
-static irqreturn_t call_function_action(int irq, void *data)
+static irqreturn_t unused_action(int irq, void *data)
 {
-   generic_smp_call_function_interrupt();
+   /* This slot is unused and hence available for use, if needed */
return IRQ_HANDLED;
 }
 
@@ -144,14 +144,14 @@ static irqreturn_t debug_ipi_action(int irq, void *data)
 }
 
 static irq_handler_t smp_ipi_action[] = {
-   [PPC_MSG_CALL_FUNCTION] =  call_function_action,
+   [PPC_MSG_UNUSED] =  unused_action, /* Slot available for future use */
[PPC_MSG_RESCHEDULE] = reschedule_action,
[PPC_MSG_CALL_FUNC_SINGLE] = call_function_single_action,
[PPC_MSG_DEBUGGER_BREAK] = debug_ipi_action,
 };
 
 const char *smp_ipi_name[] = {
-   [PPC_MSG_CALL_FUNCTION] =  ipi call function,
+   [PPC_MSG_UNUSED] =  ipi unused,
[PPC_MSG_RESCHEDULE] = ipi reschedule,
[PPC_MSG_CALL_FUNC_SINGLE] = ipi call function single,
[PPC_MSG_DEBUGGER_BREAK] = ipi debugger,
@@ -221,8 +221,6 @@ irqreturn_t smp_ipi_demux(void)
all = xchg(info-messages, 0);
 
 #ifdef __BIG_ENDIAN
-   if (all  (1  (24 - 8 * PPC_MSG_CALL_FUNCTION)))
-   generic_smp_call_function_interrupt();
if (all  (1  (24 - 8 * PPC_MSG_RESCHEDULE)))
scheduler_ipi();
if (all  (1  (24 - 8 * PPC_MSG_CALL_FUNC_SINGLE)))
@@ -265,7 +263,7 @@ void arch_send_call_function_ipi_mask(const struct cpumask 
*mask)
unsigned int cpu;
 
for_each_cpu(cpu, mask)
-   do_message_pass(cpu, PPC_MSG_CALL_FUNCTION);
+   do_message_pass(cpu, PPC_MSG_CALL_FUNC_SINGLE);
 }
 
 #if defined(CONFIG_DEBUGGER) || defined(CONFIG_KEXEC)
diff --git a/arch/powerpc/platforms/cell/interrupt.c 
b/arch/powerpc/platforms/cell/interrupt.c
index 2d42f3b..28166e4 100644
--- a/arch/powerpc/platforms/cell/interrupt.c
+++ b/arch/powerpc/platforms/cell/interrupt.c
@@ -213,7 +213,7 @@ static void iic_request_ipi(int msg)
 
 void iic_request_IPIs(void)
 {
-   iic_request_ipi(PPC_MSG_CALL_FUNCTION);
+   iic_request_ipi(PPC_MSG_UNUSED);
iic_request_ipi(PPC_MSG_RESCHEDULE);
iic_request_ipi(PPC_MSG_CALL_FUNC_SINGLE);
iic_request_ipi(PPC_MSG_DEBUGGER_BREAK);
diff --git a/arch/powerpc/platforms/ps3/smp.c b/arch/powerpc/platforms/ps3/smp.c
index 4b35166..488f069 100644
--- a/arch/powerpc/platforms/ps3/smp.c
+++ b/arch/powerpc/platforms/ps3/smp.c
@@ -74,7 +74,7 @@ static int __init ps3_smp_probe(void)
* to index needs to be setup.
*/
 
-   BUILD_BUG_ON(PPC_MSG_CALL_FUNCTION!= 0);
+   BUILD_BUG_ON(PPC_MSG_UNUSED   != 0);
BUILD_BUG_ON(PPC_MSG_RESCHEDULE   != 1);
BUILD_BUG_ON(PPC_MSG_CALL_FUNC_SINGLE != 2);
BUILD_BUG_ON(PPC_MSG_DEBUGGER_BREAK   != 3);

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH 3/5] cpuidle/ppc: Add timer offload framework to support deep idle states

2013-07-25 Thread Preeti U Murthy
On ppc, in deep idle states, the lapic of the cpus gets switched off.
Hence make use of the broadcast framework to wakeup cpus in sleep state,
except that on ppc, we do not have an external device such as HPET, but
we use the lapic of a cpu itself as the broadcast device.

Instantiate two different clock event devices, one representing the
lapic and another representing the broadcast device for each cpu.
Such a cpu is forbidden to enter the deep idle state. The cpu which hosts
the broadcast device will be referred to as the broadcast cpu in the
changelogs of this patchset for convenience.

For now, only the boot cpu's broadcast device gets registered as a clock event
device along with the lapic. Hence this is the broadcast cpu.

On the broadcast cpu, on each timer interrupt, apart from the regular lapic 
event
handler the broadcast handler is also called. We avoid the overhead of
programming the lapic for a broadcast event specifically. The reason is
prevent multiple cpus from sending IPIs to program the lapic of the broadcast
cpu for their next local event each time they go to deep idle state.

Apart from this there is no change in the way broadcast is handled today. On
a broadcast ipi the event handler for a timer interrupt is called on the cpu
in deep idle state to handle the local events.

The current design and implementation of the timer offload framework supports
the ONESHOT tick mode but not the PERIODIC mode.

Signed-off-by: Preeti U. Murthy pre...@linux.vnet.ibm.com
---

 arch/powerpc/include/asm/time.h|3 +
 arch/powerpc/kernel/smp.c  |4 +-
 arch/powerpc/kernel/time.c |   79 
 arch/powerpc/platforms/powernv/Kconfig |1 
 4 files changed, 84 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/time.h b/arch/powerpc/include/asm/time.h
index c1f2676..936be0d 100644
--- a/arch/powerpc/include/asm/time.h
+++ b/arch/powerpc/include/asm/time.h
@@ -24,14 +24,17 @@ extern unsigned long tb_ticks_per_jiffy;
 extern unsigned long tb_ticks_per_usec;
 extern unsigned long tb_ticks_per_sec;
 extern struct clock_event_device decrementer_clockevent;
+extern struct clock_event_device broadcast_clockevent;
 
 struct rtc_time;
 extern void to_tm(int tim, struct rtc_time * tm);
 extern void GregorianDay(struct rtc_time *tm);
+extern void decrementer_timer_interrupt(void);
 
 extern void generic_calibrate_decr(void);
 
 extern void set_dec_cpu6(unsigned int val);
+extern int bc_cpu;
 
 /* Some sane defaults: 125 MHz timebase, 1GHz processor */
 extern unsigned long ppc_proc_freq;
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 6a68ca4..d3b7014 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -114,7 +114,7 @@ int smp_generic_kick_cpu(int nr)
 
 static irqreturn_t timer_action(int irq, void *data)
 {
-   timer_interrupt();
+   decrementer_timer_interrupt();
return IRQ_HANDLED;
 }
 
@@ -223,7 +223,7 @@ irqreturn_t smp_ipi_demux(void)
 
 #ifdef __BIG_ENDIAN
if (all  (1  (24 - 8 * PPC_MSG_TIMER)))
-   timer_interrupt();
+   decrementer_timer_interrupt();
if (all  (1  (24 - 8 * PPC_MSG_RESCHEDULE)))
scheduler_ipi();
if (all  (1  (24 - 8 * PPC_MSG_CALL_FUNC_SINGLE)))
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 65ab9e9..8ed0fb3 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -42,6 +42,7 @@
 #include linux/timex.h
 #include linux/kernel_stat.h
 #include linux/time.h
+#include linux/timer.h
 #include linux/init.h
 #include linux/profile.h
 #include linux/cpu.h
@@ -97,8 +98,11 @@ static struct clocksource clocksource_timebase = {
 
 static int decrementer_set_next_event(unsigned long evt,
  struct clock_event_device *dev);
+static int broadcast_set_next_event(unsigned long evt,
+ struct clock_event_device *dev);
 static void decrementer_set_mode(enum clock_event_mode mode,
 struct clock_event_device *dev);
+static void decrementer_timer_broadcast(const struct cpumask *mask);
 
 struct clock_event_device decrementer_clockevent = {
.name   = decrementer,
@@ -106,13 +110,26 @@ struct clock_event_device decrementer_clockevent = {
.irq= 0,
.set_next_event = decrementer_set_next_event,
.set_mode   = decrementer_set_mode,
-   .features   = CLOCK_EVT_FEAT_ONESHOT,
+   .broadcast  = decrementer_timer_broadcast,
+   .features   = CLOCK_EVT_FEAT_C3STOP | CLOCK_EVT_FEAT_ONESHOT,
 };
 EXPORT_SYMBOL(decrementer_clockevent);
 
+struct clock_event_device broadcast_clockevent = {
+   .name   = broadcast,
+   .rating = 200,
+   .irq= 0,
+   .set_next_event = broadcast_set_next_event

[RFC PATCH 4/5] cpuidle/ppc: CPU goes tickless if there are no arch-specific constraints

2013-07-25 Thread Preeti U Murthy
In the current design of timer offload framework, the broadcast cpu should
*not* go into tickless idle so as to avoid missed wakeups on CPUs in deep idle 
states.

Since we prevent the CPUs entering deep idle states from programming the lapic 
of the
broadcast cpu for their respective next local events for reasons mentioned in
PATCH[3/5], the broadcast CPU checks if there are any CPUs to be woken up during
each of its timer interrupt programmed to its local events.

With tickless idle, the broadcast CPU might not get a timer interrupt till after
many ticks which can result in missed wakeups on CPUs in deep idle states. By
disabling tickless idle, worst case, the tick_sched hrtimer will trigger a
timer interrupt every period to check for broadcast.

However the current setup of tickless idle does not let us make the choice
of tickless on individual cpus. NOHZ_MODE_INACTIVE which disables tickless idle,
is a system wide setting. Hence resort to an arch specific call to check if a 
cpu
can go into tickless idle.

Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com
---

 arch/powerpc/kernel/time.c |5 +
 kernel/time/tick-sched.c   |7 +++
 2 files changed, 12 insertions(+)

diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 8ed0fb3..68a636f 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -862,6 +862,11 @@ static void decrementer_timer_broadcast(const struct 
cpumask *mask)
arch_send_tick_broadcast(mask);
 }
 
+int arch_can_stop_idle_tick(int cpu)
+{
+   return cpu != bc_cpu;
+}
+
 static void register_decrementer_clockevent(int cpu)
 {
struct clock_event_device *dec = per_cpu(decrementers, cpu);
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 6960172..e9ffa84 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -700,8 +700,15 @@ static void tick_nohz_full_stop_tick(struct tick_sched *ts)
 #endif
 }
 
+int __weak arch_can_stop_idle_tick(int cpu)
+{
+   return 1;
+}
+
 static bool can_stop_idle_tick(int cpu, struct tick_sched *ts)
 {
+   if (!arch_can_stop_idle_tick(cpu))
+   return false;
/*
 * If this cpu is offline and it is the one which updates
 * jiffies, then give up the assignment and let it be taken by

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH 2/5] powerpc: Implement broadcast timer interrupt as an IPI message

2013-07-25 Thread Preeti U Murthy
From: Srivatsa S. Bhat srivatsa.b...@linux.vnet.ibm.com

For scalability and performance reasons, we want the broadcast timer
interrupts to be handled as efficiently as possible. Fixed IPI messages
are one of the most efficient mechanisms available - they are faster
than the smp_call_function mechanism because the IPI handlers are fixed
and hence they don't involve costly operations such as adding IPI handlers
to the target CPU's function queue, acquiring locks for synchronization etc.

Luckily we have an unused IPI message slot, so use that to implement
broadcast timer interrupts efficiently.

Signed-off-by: Srivatsa S. Bhat srivatsa.b...@linux.vnet.ibm.com
Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com
---

 arch/powerpc/include/asm/smp.h  |3 ++-
 arch/powerpc/kernel/smp.c   |   19 +++
 arch/powerpc/platforms/cell/interrupt.c |2 +-
 arch/powerpc/platforms/ps3/smp.c|2 +-
 4 files changed, 19 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
index 51bf017..d877b69 100644
--- a/arch/powerpc/include/asm/smp.h
+++ b/arch/powerpc/include/asm/smp.h
@@ -117,7 +117,7 @@ extern int cpu_to_core_id(int cpu);
  *
  * Make sure this matches openpic_request_IPIs in open_pic.c, or what shows up
  * in /proc/interrupts will be wrong!!! --Troy */
-#define PPC_MSG_UNUSED 0
+#define PPC_MSG_TIMER  0
 #define PPC_MSG_RESCHEDULE  1
 #define PPC_MSG_CALL_FUNC_SINGLE   2
 #define PPC_MSG_DEBUGGER_BREAK  3
@@ -190,6 +190,7 @@ extern struct smp_ops_t *smp_ops;
 
 extern void arch_send_call_function_single_ipi(int cpu);
 extern void arch_send_call_function_ipi_mask(const struct cpumask *mask);
+extern void arch_send_tick_broadcast(const struct cpumask *mask);
 
 /* Definitions relative to the secondary CPU spin loop
  * and entry point. Not all of them exist on both 32 and
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index bc41e9f..6a68ca4 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -35,6 +35,7 @@
 #include asm/ptrace.h
 #include linux/atomic.h
 #include asm/irq.h
+#include asm/hw_irq.h
 #include asm/page.h
 #include asm/pgtable.h
 #include asm/prom.h
@@ -111,9 +112,9 @@ int smp_generic_kick_cpu(int nr)
 }
 #endif /* CONFIG_PPC64 */
 
-static irqreturn_t unused_action(int irq, void *data)
+static irqreturn_t timer_action(int irq, void *data)
 {
-   /* This slot is unused and hence available for use, if needed */
+   timer_interrupt();
return IRQ_HANDLED;
 }
 
@@ -144,14 +145,14 @@ static irqreturn_t debug_ipi_action(int irq, void *data)
 }
 
 static irq_handler_t smp_ipi_action[] = {
-   [PPC_MSG_UNUSED] =  unused_action, /* Slot available for future use */
+   [PPC_MSG_TIMER] =  timer_action,
[PPC_MSG_RESCHEDULE] = reschedule_action,
[PPC_MSG_CALL_FUNC_SINGLE] = call_function_single_action,
[PPC_MSG_DEBUGGER_BREAK] = debug_ipi_action,
 };
 
 const char *smp_ipi_name[] = {
-   [PPC_MSG_UNUSED] =  ipi unused,
+   [PPC_MSG_TIMER] =  ipi timer,
[PPC_MSG_RESCHEDULE] = ipi reschedule,
[PPC_MSG_CALL_FUNC_SINGLE] = ipi call function single,
[PPC_MSG_DEBUGGER_BREAK] = ipi debugger,
@@ -221,6 +222,8 @@ irqreturn_t smp_ipi_demux(void)
all = xchg(info-messages, 0);
 
 #ifdef __BIG_ENDIAN
+   if (all  (1  (24 - 8 * PPC_MSG_TIMER)))
+   timer_interrupt();
if (all  (1  (24 - 8 * PPC_MSG_RESCHEDULE)))
scheduler_ipi();
if (all  (1  (24 - 8 * PPC_MSG_CALL_FUNC_SINGLE)))
@@ -266,6 +269,14 @@ void arch_send_call_function_ipi_mask(const struct cpumask 
*mask)
do_message_pass(cpu, PPC_MSG_CALL_FUNC_SINGLE);
 }
 
+void arch_send_tick_broadcast(const struct cpumask *mask)
+{
+   unsigned int cpu;
+
+   for_each_cpu(cpu, mask)
+   do_message_pass(cpu, PPC_MSG_TIMER);
+}
+
 #if defined(CONFIG_DEBUGGER) || defined(CONFIG_KEXEC)
 void smp_send_debugger_break(void)
 {
diff --git a/arch/powerpc/platforms/cell/interrupt.c 
b/arch/powerpc/platforms/cell/interrupt.c
index 28166e4..1359113 100644
--- a/arch/powerpc/platforms/cell/interrupt.c
+++ b/arch/powerpc/platforms/cell/interrupt.c
@@ -213,7 +213,7 @@ static void iic_request_ipi(int msg)
 
 void iic_request_IPIs(void)
 {
-   iic_request_ipi(PPC_MSG_UNUSED);
+   iic_request_ipi(PPC_MSG_TIMER);
iic_request_ipi(PPC_MSG_RESCHEDULE);
iic_request_ipi(PPC_MSG_CALL_FUNC_SINGLE);
iic_request_ipi(PPC_MSG_DEBUGGER_BREAK);
diff --git a/arch/powerpc/platforms/ps3/smp.c b/arch/powerpc/platforms/ps3/smp.c
index 488f069..5cb742a 100644
--- a/arch/powerpc/platforms/ps3/smp.c
+++ b/arch/powerpc/platforms/ps3/smp.c
@@ -74,7 +74,7 @@ static int __init ps3_smp_probe(void)
* to index needs to be setup.
*/
 
-   BUILD_BUG_ON(PPC_MSG_UNUSED

Re: [RFC PATCH 4/5] cpuidle/ppc: CPU goes tickless if there are no arch-specific constraints

2013-07-25 Thread Preeti U Murthy
Hi Frederic,

On 07/25/2013 07:00 PM, Frederic Weisbecker wrote:
 On Thu, Jul 25, 2013 at 02:33:02PM +0530, Preeti U Murthy wrote:
 In the current design of timer offload framework, the broadcast cpu should
 *not* go into tickless idle so as to avoid missed wakeups on CPUs in deep 
 idle states.

 Since we prevent the CPUs entering deep idle states from programming the 
 lapic of the
 broadcast cpu for their respective next local events for reasons mentioned in
 PATCH[3/5], the broadcast CPU checks if there are any CPUs to be woken up 
 during
 each of its timer interrupt programmed to its local events.

 With tickless idle, the broadcast CPU might not get a timer interrupt till 
 after
 many ticks which can result in missed wakeups on CPUs in deep idle states. By
 disabling tickless idle, worst case, the tick_sched hrtimer will trigger a
 timer interrupt every period to check for broadcast.

 However the current setup of tickless idle does not let us make the choice
 of tickless on individual cpus. NOHZ_MODE_INACTIVE which disables tickless 
 idle,
 is a system wide setting. Hence resort to an arch specific call to check if 
 a cpu
 can go into tickless idle.
 
 Hi Preeti,
 
 I'm not exactly sure why you can't enter the broadcast CPU in dynticks idle 
 mode.
 I read in the previous patch that's because in dynticks idle mode the 
 broadcast
 CPU deactivates its lapic so it doesn't receive the IPI. But may be I 
 misunderstood.
 Anyway that's not good for powersaving.

Let me elaborate. The CPUs in deep idle states have their lapics
deactivated. This means the next timer event which would typically have
been taken care of by a lapic firing at the appropriate moment does not
get taken care of in deep idle states, due to the lapic being switched off.

Hence such CPUs offload their next timer event to the broadcast CPU,
which should *not* enter deep idle states. The broadcast CPU has the
responsibility of waking the CPUs in deep idle states.

*The lapic of a broadcast CPU is active always*. Say CPUX, wants the
broadcast CPU to wake it up at timeX.  Since we cannot program the lapic
of a remote CPU, CPUX will need to send an IPI to the broadcast CPU,
asking it to program its lapic to fire at timeX so as to wake up CPUX.
*With multiple CPUs the overhead of sending IPI, could result in
performance bottlenecks and may not scale well.*

Hence the workaround is that the broadcast CPU on each of its timer
interrupt checks if any of the next timer event of a CPU in deep idle
state has expired, which can very well be found from dev-next_event of
that CPU. For example the timeX that has been mentioned above has
expired. If so the broadcast handler is called to send an IPI to the
idling CPU to wake it up.

*If the broadcast CPU, is in tickless idle, its timer interrupt could be
many ticks away. It could miss waking up a CPU in deep idle*, if its
wakeup is much before this timer interrupt of the broadcast CPU. But
without tickless idle, atleast at each period we are assured of a timer
interrupt. At which time broadcast handling is done as stated in the
previous paragraph and we will not miss wakeup of CPUs in deep idle states.

Yeah it is true that not allowing the broadcast CPU to enter tickless
idle is bad for power savings, but for the use case that we are aiming
at in this patch series, the current approach seems to be the best, with
minimal trade-offs in performance, power savings, scalability and no
change in the broadcast framework that exists today in the kernel.

 
 Also when an arch wants to prevent a CPU from entering dynticks idle mode, it 
 typically
 use arch_needs_cpu(). May be that could fit for you as well?

Oh ok thanks :) I will look into this and get back on if we can use it.

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 4/5] cpuidle/ppc: CPU goes tickless if there are no arch-specific constraints

2013-07-25 Thread Preeti U Murthy
Hi Frederic,

On 07/25/2013 07:00 PM, Frederic Weisbecker wrote:
 On Thu, Jul 25, 2013 at 02:33:02PM +0530, Preeti U Murthy wrote:
 In the current design of timer offload framework, the broadcast cpu should
 *not* go into tickless idle so as to avoid missed wakeups on CPUs in deep 
 idle states.

 Since we prevent the CPUs entering deep idle states from programming the 
 lapic of the
 broadcast cpu for their respective next local events for reasons mentioned in
 PATCH[3/5], the broadcast CPU checks if there are any CPUs to be woken up 
 during
 each of its timer interrupt programmed to its local events.

 With tickless idle, the broadcast CPU might not get a timer interrupt till 
 after
 many ticks which can result in missed wakeups on CPUs in deep idle states. By
 disabling tickless idle, worst case, the tick_sched hrtimer will trigger a
 timer interrupt every period to check for broadcast.

 However the current setup of tickless idle does not let us make the choice
 of tickless on individual cpus. NOHZ_MODE_INACTIVE which disables tickless 
 idle,
 is a system wide setting. Hence resort to an arch specific call to check if 
 a cpu
 can go into tickless idle.
 
 Hi Preeti,
 
 I'm not exactly sure why you can't enter the broadcast CPU in dynticks idle 
 mode.
 I read in the previous patch that's because in dynticks idle mode the 
 broadcast
 CPU deactivates its lapic so it doesn't receive the IPI. But may be I 
 misunderstood.
 Anyway that's not good for powersaving.
 
 Also when an arch wants to prevent a CPU from entering dynticks idle mode, it 
 typically
 use arch_needs_cpu(). May be that could fit for you as well?

Yes this will suit our requirement perfectly. I will note down this
change for the next version of this patchset. Thank you very much for
pointing this out :)

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 4/5] cpuidle/ppc: CPU goes tickless if there are no arch-specific constraints

2013-07-25 Thread Preeti U Murthy
Hi Paul,

On 07/26/2013 08:49 AM, Paul Mackerras wrote:
 On Fri, Jul 26, 2013 at 08:09:23AM +0530, Preeti U Murthy wrote:
 Hi Frederic,

 On 07/25/2013 07:00 PM, Frederic Weisbecker wrote:
 Hi Preeti,

 I'm not exactly sure why you can't enter the broadcast CPU in dynticks idle 
 mode.
 I read in the previous patch that's because in dynticks idle mode the 
 broadcast
 CPU deactivates its lapic so it doesn't receive the IPI. But may be I 
 misunderstood.
 Anyway that's not good for powersaving.

 Let me elaborate. The CPUs in deep idle states have their lapics
 deactivated. This means the next timer event which would typically have
 been taken care of by a lapic firing at the appropriate moment does not
 get taken care of in deep idle states, due to the lapic being switched off.
 
 I really don't think it's helpful to use the term lapic in
 connection with Power systems.  There is nothing that is called a
 lapic in a Power machine.  The nearest equivalent of the LAPIC on
 x86 machines is the ICP, the interrupt-controller presentation
 element, of which there is one per CPU thread.
 
 However, I don't believe the ICP gets disabled in deep sleep modes.
 What does get disabled is the decrementer, which is a register that
 normally counts down (at 512MHz) and generates an exception when it is
 negative.  The decrementer *is* part of the CPU core, unlike the ICP.
 That's why we can still get IPIs but not timer interrupts.
 
 Please reword your patch description to not use the term lapic,
 which is not defined in the Power context and is therefore just
 causing confusion.

Noted. Thank you :) I will probably send out a fresh patchset with the
appropriate changelog to avoid this confusion ?
 
 Paul.
 
Regards
Preeti U murthy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 4/5] cpuidle/ppc: CPU goes tickless if there are no arch-specific constraints

2013-07-25 Thread Preeti U Murthy
Hi Frederic,

I apologise for the confusion. As Paul pointed out maybe the usage of
the term lapic is causing a large amount of confusion. So please see the
clarification below. Maybe it will help answer your question.

On 07/26/2013 08:09 AM, Preeti U Murthy wrote:
 Hi Frederic,
 
 On 07/25/2013 07:00 PM, Frederic Weisbecker wrote:
 On Thu, Jul 25, 2013 at 02:33:02PM +0530, Preeti U Murthy wrote:
 In the current design of timer offload framework, the broadcast cpu should
 *not* go into tickless idle so as to avoid missed wakeups on CPUs in deep 
 idle states.

 Since we prevent the CPUs entering deep idle states from programming the 
 lapic of the
 broadcast cpu for their respective next local events for reasons mentioned 
 in
 PATCH[3/5], the broadcast CPU checks if there are any CPUs to be woken up 
 during
 each of its timer interrupt programmed to its local events.

 With tickless idle, the broadcast CPU might not get a timer interrupt till 
 after
 many ticks which can result in missed wakeups on CPUs in deep idle states. 
 By
 disabling tickless idle, worst case, the tick_sched hrtimer will trigger a
 timer interrupt every period to check for broadcast.

 However the current setup of tickless idle does not let us make the choice
 of tickless on individual cpus. NOHZ_MODE_INACTIVE which disables tickless 
 idle,
 is a system wide setting. Hence resort to an arch specific call to check if 
 a cpu
 can go into tickless idle.

 Hi Preeti,

 I'm not exactly sure why you can't enter the broadcast CPU in dynticks idle 
 mode.
 I read in the previous patch that's because in dynticks idle mode the 
 broadcast
 CPU deactivates its lapic so it doesn't receive the IPI. But may be I 
 misunderstood.
 Anyway that's not good for powersaving.

Firstly, when CPUs enter deep idle states, their local clock event
devices get switched off. In the case of powerpc, local clock event
device is the decrementer. Hence such CPUs *do not get timer interrupts*
but are still *capable of taking IPIs.*

So we need to ensure that some other CPU, in this case the broadcast
CPU, makes note of when the timer interrupt of the CPU in such deep idle
states is to trigger and at that moment issue an IPI to that CPU.

*The broadcast CPU however should have its decrementer active always*,
meaning it is disallowed from entering deep idle states, where the
decrementer switches off, precisely because the other idling CPUs bank
on it for the above mentioned reason.

 *The lapic of a broadcast CPU is active always*. Say CPUX, wants the
 broadcast CPU to wake it up at timeX.  Since we cannot program the lapic
 of a remote CPU, CPUX will need to send an IPI to the broadcast CPU,
 asking it to program its lapic to fire at timeX so as to wake up CPUX.
 *With multiple CPUs the overhead of sending IPI, could result in
 performance bottlenecks and may not scale well.*

Rewording the above. The decrementer of the broadcast CPU is active
always. Since we cannot program the clock event device
of a remote CPU, CPUX will need to send an IPI to the broadcast CPU,
(which the broadcast CPU is very well capable of receiving), asking it
to program its decrementer to fire at timeX so as to wake up CPUX
*With multiple CPUs the overhead of sending IPI, could result in
performance bottlenecks and may not scale well.*

 
 Hence the workaround is that the broadcast CPU on each of its timer
 interrupt checks if any of the next timer event of a CPU in deep idle
 state has expired, which can very well be found from dev-next_event of
 that CPU. For example the timeX that has been mentioned above has
 expired. If so the broadcast handler is called to send an IPI to the
 idling CPU to wake it up.
 
 *If the broadcast CPU, is in tickless idle, its timer interrupt could be
 many ticks away. It could miss waking up a CPU in deep idle*, if its
 wakeup is much before this timer interrupt of the broadcast CPU. But
 without tickless idle, atleast at each period we are assured of a timer
 interrupt. At which time broadcast handling is done as stated in the
 previous paragraph and we will not miss wakeup of CPUs in deep idle states.
 
 Yeah it is true that not allowing the broadcast CPU to enter tickless
 idle is bad for power savings, but for the use case that we are aiming
 at in this patch series, the current approach seems to be the best, with
 minimal trade-offs in performance, power savings, scalability and no
 change in the broadcast framework that exists today in the kernel.
 

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[Resend RFC PATCH 0/5] cpuidle/ppc: Timer offload framework to support deep idle states

2013-07-25 Thread Preeti U Murthy
On PowerPC, when CPUs enter deep idle states, their local timers are
switched off. The responsibility of waking them up at their next timer event,
needs to be handed over to an external device. On PowerPC, we do not have an
external device equivalent to HPET, which is currently done on architectures
like x86. Instead we assign the local timer of one of the CPUs to do this
job.

This patchset is an attempt to make use of the existing timer broadcast
framework in the kernel to meet the above requirement, except that the tick
broadcast device is the local timer of the boot CPU.

This patch series is ported ontop of 3.11-rc1 + the cpuidle driver backend
for powernv posted by Deepthi Dharwar recently. The current design and
implementation supports the ONESHOT tick mode. It does not yet support
the PERIODIC tick mode. This patch is tested with NOHZ_FULL off.

Patch[1/5], Patch[2/5]: optimize the broadcast mechanism on ppc.
Patch[3/5]: Introduces the core of the timer offload framework on powerpc.
Patch[4/5]: The cpu doing the broadcast should not go into tickless idle.
Patch[5/5]: Add a deep idle state to the cpuidle state table on powernv.

Patch[5/5] is the patch that ultimately makes use of the timer offload
framework that the patches Patch[1/5] to Patch[4/5] build.

This patch series is being resent to clarify certain ambiguity in the patch
descriptions from the previous post. Discussion around this:
https://lkml.org/lkml/2013/7/25/754

---

Preeti U Murthy (3):
  cpuidle/ppc: Add timer offload framework to support deep idle states
  cpuidle/ppc: CPU goes tickless if there are no arch-specific constraints
  cpuidle/ppc: Add longnap state to the idle states on powernv

Srivatsa S. Bhat (2):
  powerpc: Free up the IPI message slot of ipi call function 
(PPC_MSG_CALL_FUNC)
  powerpc: Implement broadcast timer interrupt as an IPI message


 arch/powerpc/include/asm/smp.h  |3 +
 arch/powerpc/include/asm/time.h |3 +
 arch/powerpc/kernel/smp.c   |   23 --
 arch/powerpc/kernel/time.c  |   86 +++
 arch/powerpc/platforms/cell/interrupt.c |2 -
 arch/powerpc/platforms/powernv/Kconfig  |1 
 arch/powerpc/platforms/powernv/processor_idle.c |   48 +
 arch/powerpc/platforms/ps3/smp.c|2 -
 kernel/time/tick-sched.c|7 ++
 9 files changed, 163 insertions(+), 12 deletions(-)

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[Resend RFC PATCH 1/5] powerpc: Free up the IPI message slot of ipi call function (PPC_MSG_CALL_FUNC)

2013-07-25 Thread Preeti U Murthy
From: Srivatsa S. Bhat srivatsa.b...@linux.vnet.ibm.com

The IPI handlers for both PPC_MSG_CALL_FUNC and PPC_MSG_CALL_FUNC_SINGLE
map to a common implementation - generic_smp_call_function_single_interrupt().
So, we can consolidate them and save one of the IPI message slots, (which are
precious, since only 4 of those slots are available).

So, implement the functionality of PPC_MSG_CALL_FUNC using
PPC_MSG_CALL_FUNC_SINGLE itself and release its IPI message slot, so that it
can be used for something else in the future, if desired.

Signed-off-by: Srivatsa S. Bhat srivatsa.b...@linux.vnet.ibm.com
Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com
---

 arch/powerpc/include/asm/smp.h  |2 +-
 arch/powerpc/kernel/smp.c   |   12 +---
 arch/powerpc/platforms/cell/interrupt.c |2 +-
 arch/powerpc/platforms/ps3/smp.c|2 +-
 4 files changed, 8 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
index ffbaabe..51bf017 100644
--- a/arch/powerpc/include/asm/smp.h
+++ b/arch/powerpc/include/asm/smp.h
@@ -117,7 +117,7 @@ extern int cpu_to_core_id(int cpu);
  *
  * Make sure this matches openpic_request_IPIs in open_pic.c, or what shows up
  * in /proc/interrupts will be wrong!!! --Troy */
-#define PPC_MSG_CALL_FUNCTION   0
+#define PPC_MSG_UNUSED 0
 #define PPC_MSG_RESCHEDULE  1
 #define PPC_MSG_CALL_FUNC_SINGLE   2
 #define PPC_MSG_DEBUGGER_BREAK  3
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 38b0ba6..bc41e9f 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -111,9 +111,9 @@ int smp_generic_kick_cpu(int nr)
 }
 #endif /* CONFIG_PPC64 */
 
-static irqreturn_t call_function_action(int irq, void *data)
+static irqreturn_t unused_action(int irq, void *data)
 {
-   generic_smp_call_function_interrupt();
+   /* This slot is unused and hence available for use, if needed */
return IRQ_HANDLED;
 }
 
@@ -144,14 +144,14 @@ static irqreturn_t debug_ipi_action(int irq, void *data)
 }
 
 static irq_handler_t smp_ipi_action[] = {
-   [PPC_MSG_CALL_FUNCTION] =  call_function_action,
+   [PPC_MSG_UNUSED] =  unused_action, /* Slot available for future use */
[PPC_MSG_RESCHEDULE] = reschedule_action,
[PPC_MSG_CALL_FUNC_SINGLE] = call_function_single_action,
[PPC_MSG_DEBUGGER_BREAK] = debug_ipi_action,
 };
 
 const char *smp_ipi_name[] = {
-   [PPC_MSG_CALL_FUNCTION] =  ipi call function,
+   [PPC_MSG_UNUSED] =  ipi unused,
[PPC_MSG_RESCHEDULE] = ipi reschedule,
[PPC_MSG_CALL_FUNC_SINGLE] = ipi call function single,
[PPC_MSG_DEBUGGER_BREAK] = ipi debugger,
@@ -221,8 +221,6 @@ irqreturn_t smp_ipi_demux(void)
all = xchg(info-messages, 0);
 
 #ifdef __BIG_ENDIAN
-   if (all  (1  (24 - 8 * PPC_MSG_CALL_FUNCTION)))
-   generic_smp_call_function_interrupt();
if (all  (1  (24 - 8 * PPC_MSG_RESCHEDULE)))
scheduler_ipi();
if (all  (1  (24 - 8 * PPC_MSG_CALL_FUNC_SINGLE)))
@@ -265,7 +263,7 @@ void arch_send_call_function_ipi_mask(const struct cpumask 
*mask)
unsigned int cpu;
 
for_each_cpu(cpu, mask)
-   do_message_pass(cpu, PPC_MSG_CALL_FUNCTION);
+   do_message_pass(cpu, PPC_MSG_CALL_FUNC_SINGLE);
 }
 
 #if defined(CONFIG_DEBUGGER) || defined(CONFIG_KEXEC)
diff --git a/arch/powerpc/platforms/cell/interrupt.c 
b/arch/powerpc/platforms/cell/interrupt.c
index 2d42f3b..28166e4 100644
--- a/arch/powerpc/platforms/cell/interrupt.c
+++ b/arch/powerpc/platforms/cell/interrupt.c
@@ -213,7 +213,7 @@ static void iic_request_ipi(int msg)
 
 void iic_request_IPIs(void)
 {
-   iic_request_ipi(PPC_MSG_CALL_FUNCTION);
+   iic_request_ipi(PPC_MSG_UNUSED);
iic_request_ipi(PPC_MSG_RESCHEDULE);
iic_request_ipi(PPC_MSG_CALL_FUNC_SINGLE);
iic_request_ipi(PPC_MSG_DEBUGGER_BREAK);
diff --git a/arch/powerpc/platforms/ps3/smp.c b/arch/powerpc/platforms/ps3/smp.c
index 4b35166..488f069 100644
--- a/arch/powerpc/platforms/ps3/smp.c
+++ b/arch/powerpc/platforms/ps3/smp.c
@@ -74,7 +74,7 @@ static int __init ps3_smp_probe(void)
* to index needs to be setup.
*/
 
-   BUILD_BUG_ON(PPC_MSG_CALL_FUNCTION!= 0);
+   BUILD_BUG_ON(PPC_MSG_UNUSED   != 0);
BUILD_BUG_ON(PPC_MSG_RESCHEDULE   != 1);
BUILD_BUG_ON(PPC_MSG_CALL_FUNC_SINGLE != 2);
BUILD_BUG_ON(PPC_MSG_DEBUGGER_BREAK   != 3);

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[Resend RFC PATCH 2/5] powerpc: Implement broadcast timer interrupt as an IPI message

2013-07-25 Thread Preeti U Murthy
From: Srivatsa S. Bhat srivatsa.b...@linux.vnet.ibm.com

For scalability and performance reasons, we want the broadcast timer
interrupts to be handled as efficiently as possible. Fixed IPI messages
are one of the most efficient mechanisms available - they are faster
than the smp_call_function mechanism because the IPI handlers are fixed
and hence they don't involve costly operations such as adding IPI handlers
to the target CPU's function queue, acquiring locks for synchronization etc.

Luckily we have an unused IPI message slot, so use that to implement
broadcast timer interrupts efficiently.

Signed-off-by: Srivatsa S. Bhat srivatsa.b...@linux.vnet.ibm.com
Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com
---

 arch/powerpc/include/asm/smp.h  |3 ++-
 arch/powerpc/kernel/smp.c   |   19 +++
 arch/powerpc/platforms/cell/interrupt.c |2 +-
 arch/powerpc/platforms/ps3/smp.c|2 +-
 4 files changed, 19 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
index 51bf017..d877b69 100644
--- a/arch/powerpc/include/asm/smp.h
+++ b/arch/powerpc/include/asm/smp.h
@@ -117,7 +117,7 @@ extern int cpu_to_core_id(int cpu);
  *
  * Make sure this matches openpic_request_IPIs in open_pic.c, or what shows up
  * in /proc/interrupts will be wrong!!! --Troy */
-#define PPC_MSG_UNUSED 0
+#define PPC_MSG_TIMER  0
 #define PPC_MSG_RESCHEDULE  1
 #define PPC_MSG_CALL_FUNC_SINGLE   2
 #define PPC_MSG_DEBUGGER_BREAK  3
@@ -190,6 +190,7 @@ extern struct smp_ops_t *smp_ops;
 
 extern void arch_send_call_function_single_ipi(int cpu);
 extern void arch_send_call_function_ipi_mask(const struct cpumask *mask);
+extern void arch_send_tick_broadcast(const struct cpumask *mask);
 
 /* Definitions relative to the secondary CPU spin loop
  * and entry point. Not all of them exist on both 32 and
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index bc41e9f..6a68ca4 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -35,6 +35,7 @@
 #include asm/ptrace.h
 #include linux/atomic.h
 #include asm/irq.h
+#include asm/hw_irq.h
 #include asm/page.h
 #include asm/pgtable.h
 #include asm/prom.h
@@ -111,9 +112,9 @@ int smp_generic_kick_cpu(int nr)
 }
 #endif /* CONFIG_PPC64 */
 
-static irqreturn_t unused_action(int irq, void *data)
+static irqreturn_t timer_action(int irq, void *data)
 {
-   /* This slot is unused and hence available for use, if needed */
+   timer_interrupt();
return IRQ_HANDLED;
 }
 
@@ -144,14 +145,14 @@ static irqreturn_t debug_ipi_action(int irq, void *data)
 }
 
 static irq_handler_t smp_ipi_action[] = {
-   [PPC_MSG_UNUSED] =  unused_action, /* Slot available for future use */
+   [PPC_MSG_TIMER] =  timer_action,
[PPC_MSG_RESCHEDULE] = reschedule_action,
[PPC_MSG_CALL_FUNC_SINGLE] = call_function_single_action,
[PPC_MSG_DEBUGGER_BREAK] = debug_ipi_action,
 };
 
 const char *smp_ipi_name[] = {
-   [PPC_MSG_UNUSED] =  ipi unused,
+   [PPC_MSG_TIMER] =  ipi timer,
[PPC_MSG_RESCHEDULE] = ipi reschedule,
[PPC_MSG_CALL_FUNC_SINGLE] = ipi call function single,
[PPC_MSG_DEBUGGER_BREAK] = ipi debugger,
@@ -221,6 +222,8 @@ irqreturn_t smp_ipi_demux(void)
all = xchg(info-messages, 0);
 
 #ifdef __BIG_ENDIAN
+   if (all  (1  (24 - 8 * PPC_MSG_TIMER)))
+   timer_interrupt();
if (all  (1  (24 - 8 * PPC_MSG_RESCHEDULE)))
scheduler_ipi();
if (all  (1  (24 - 8 * PPC_MSG_CALL_FUNC_SINGLE)))
@@ -266,6 +269,14 @@ void arch_send_call_function_ipi_mask(const struct cpumask 
*mask)
do_message_pass(cpu, PPC_MSG_CALL_FUNC_SINGLE);
 }
 
+void arch_send_tick_broadcast(const struct cpumask *mask)
+{
+   unsigned int cpu;
+
+   for_each_cpu(cpu, mask)
+   do_message_pass(cpu, PPC_MSG_TIMER);
+}
+
 #if defined(CONFIG_DEBUGGER) || defined(CONFIG_KEXEC)
 void smp_send_debugger_break(void)
 {
diff --git a/arch/powerpc/platforms/cell/interrupt.c 
b/arch/powerpc/platforms/cell/interrupt.c
index 28166e4..1359113 100644
--- a/arch/powerpc/platforms/cell/interrupt.c
+++ b/arch/powerpc/platforms/cell/interrupt.c
@@ -213,7 +213,7 @@ static void iic_request_ipi(int msg)
 
 void iic_request_IPIs(void)
 {
-   iic_request_ipi(PPC_MSG_UNUSED);
+   iic_request_ipi(PPC_MSG_TIMER);
iic_request_ipi(PPC_MSG_RESCHEDULE);
iic_request_ipi(PPC_MSG_CALL_FUNC_SINGLE);
iic_request_ipi(PPC_MSG_DEBUGGER_BREAK);
diff --git a/arch/powerpc/platforms/ps3/smp.c b/arch/powerpc/platforms/ps3/smp.c
index 488f069..5cb742a 100644
--- a/arch/powerpc/platforms/ps3/smp.c
+++ b/arch/powerpc/platforms/ps3/smp.c
@@ -74,7 +74,7 @@ static int __init ps3_smp_probe(void)
* to index needs to be setup.
*/
 
-   BUILD_BUG_ON(PPC_MSG_UNUSED

[Resend RFC PATCH 3/5] cpuidle/ppc: Add timer offload framework to support deep idle states

2013-07-25 Thread Preeti U Murthy
On ppc, in deep idle states, the local clock event device of CPUs gets
switched off. On PowerPC, the local clock event device is called the
decrementer. Make use of the broadcast framework to issue interrupts to
cpus in deep idle states on their timer events, except that on ppc, we
do not have an external device such as HPET, but we use the decrementer
of one of the CPUs itself as the broadcast device.

Instantiate two different clock event devices, one representing the
decrementer and another representing the broadcast device for each cpu.
The cpu which registers its broadcast device will be responsible for
performing the function of issuing timer interrupts to CPUs in deep idle
states, and is referred to as the broadcast cpu in the changelogs of this
patchset for convenience. Such a CPU is not allowed to enter deep idle
states, where the decrementer is switched off.

For now, only the boot cpu's broadcast device gets registered as a clock event
device along with the decrementer. Hence this is the broadcast cpu.

On the broadcast cpu, on each timer interrupt, apart from the regular local
timer event handler the broadcast handler is also called. We avoid the overhead
of programming the decrementer specifically for a broadcast event. The reason 
is for
performance and scalability reasons. Say cpuX goes to deep idle state. It
has to ask the broadcast CPU to reprogram its(broadcast CPU's) decrementer for
the next local timer event of cpuX. cpuX can do so only by sending an IPI to the
broadcast CPU. With many more cpus going to deep idle, this model of sending
IPIs each time will result in performance bottleneck and may not scale well.

Apart from this there is no change in the way broadcast is handled today. On
a broadcast ipi the event handler for a timer interrupt is called on the cpu
in deep idle state to handle the local events.

The current design and implementation of the timer offload framework supports
the ONESHOT tick mode but not the PERIODIC mode.

Signed-off-by: Preeti U. Murthy pre...@linux.vnet.ibm.com
---

 arch/powerpc/include/asm/time.h|3 +
 arch/powerpc/kernel/smp.c  |4 +-
 arch/powerpc/kernel/time.c |   81 
 arch/powerpc/platforms/powernv/Kconfig |1 
 4 files changed, 86 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/time.h b/arch/powerpc/include/asm/time.h
index c1f2676..936be0d 100644
--- a/arch/powerpc/include/asm/time.h
+++ b/arch/powerpc/include/asm/time.h
@@ -24,14 +24,17 @@ extern unsigned long tb_ticks_per_jiffy;
 extern unsigned long tb_ticks_per_usec;
 extern unsigned long tb_ticks_per_sec;
 extern struct clock_event_device decrementer_clockevent;
+extern struct clock_event_device broadcast_clockevent;
 
 struct rtc_time;
 extern void to_tm(int tim, struct rtc_time * tm);
 extern void GregorianDay(struct rtc_time *tm);
+extern void decrementer_timer_interrupt(void);
 
 extern void generic_calibrate_decr(void);
 
 extern void set_dec_cpu6(unsigned int val);
+extern int bc_cpu;
 
 /* Some sane defaults: 125 MHz timebase, 1GHz processor */
 extern unsigned long ppc_proc_freq;
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 6a68ca4..d3b7014 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -114,7 +114,7 @@ int smp_generic_kick_cpu(int nr)
 
 static irqreturn_t timer_action(int irq, void *data)
 {
-   timer_interrupt();
+   decrementer_timer_interrupt();
return IRQ_HANDLED;
 }
 
@@ -223,7 +223,7 @@ irqreturn_t smp_ipi_demux(void)
 
 #ifdef __BIG_ENDIAN
if (all  (1  (24 - 8 * PPC_MSG_TIMER)))
-   timer_interrupt();
+   decrementer_timer_interrupt();
if (all  (1  (24 - 8 * PPC_MSG_RESCHEDULE)))
scheduler_ipi();
if (all  (1  (24 - 8 * PPC_MSG_CALL_FUNC_SINGLE)))
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 65ab9e9..7e858e1 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -42,6 +42,7 @@
 #include linux/timex.h
 #include linux/kernel_stat.h
 #include linux/time.h
+#include linux/timer.h
 #include linux/init.h
 #include linux/profile.h
 #include linux/cpu.h
@@ -97,8 +98,11 @@ static struct clocksource clocksource_timebase = {
 
 static int decrementer_set_next_event(unsigned long evt,
  struct clock_event_device *dev);
+static int broadcast_set_next_event(unsigned long evt,
+ struct clock_event_device *dev);
 static void decrementer_set_mode(enum clock_event_mode mode,
 struct clock_event_device *dev);
+static void decrementer_timer_broadcast(const struct cpumask *mask);
 
 struct clock_event_device decrementer_clockevent = {
.name   = decrementer,
@@ -106,13 +110,26 @@ struct clock_event_device decrementer_clockevent = {
.irq

[Resend RFC PATCH 5/5] cpuidle/ppc: Add longnap state to the idle states on powernv

2013-07-25 Thread Preeti U Murthy
This patch hooks into the existing broadcast framework with the support that 
this
patchset introduces for ppc, and the cpuidle driver backend
for powernv(posted out recently by Deepthi Dharwar) to add sleep state as
one of the deep idle states, in which the decrementer is switched off.

However in this patch, we only emulate sleep by going into a state which does
a nap with the decrementer interrupts disabled, termed as longnap. This enables
focus on the timer broadcast framework for ppc in this series of patches ,
which is required as a first step to enable sleep on ppc.

Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com
---

 arch/powerpc/platforms/powernv/processor_idle.c |   48 +++
 1 file changed, 47 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/powernv/processor_idle.c 
b/arch/powerpc/platforms/powernv/processor_idle.c
index f43ad91a..9aca502 100644
--- a/arch/powerpc/platforms/powernv/processor_idle.c
+++ b/arch/powerpc/platforms/powernv/processor_idle.c
@@ -9,16 +9,18 @@
 #include linux/cpuidle.h
 #include linux/cpu.h
 #include linux/notifier.h
+#include linux/clockchips.h
 
 #include asm/machdep.h
 #include asm/runlatch.h
+#include asm/time.h
 
 struct cpuidle_driver powernv_idle_driver = {
.name = powernv_idle,
.owner =THIS_MODULE,
 };
 
-#define MAX_IDLE_STATE_COUNT   2
+#define MAX_IDLE_STATE_COUNT   3
 
 static int max_idle_state = MAX_IDLE_STATE_COUNT - 1;
 static struct cpuidle_device __percpu *powernv_cpuidle_devices;
@@ -54,6 +56,43 @@ static int nap_loop(struct cpuidle_device *dev,
return index;
 }
 
+/* Emulate sleep, with long nap.
+ * During sleep, the core does not receive decrementer interrupts.
+ * Emulate sleep using long nap with decrementers interrupts disabled.
+ * This is an initial prototype to test the timer offload framework for ppc.
+ * We will eventually introduce the sleep state once the timer offload 
framework
+ * for ppc is stable.
+ */
+static int longnap_loop(struct cpuidle_device *dev,
+   struct cpuidle_driver *drv,
+   int index)
+{
+   int cpu = dev-cpu;
+
+   unsigned long lpcr = mfspr(SPRN_LPCR);
+
+   lpcr = ~(LPCR_MER | LPCR_PECE); /* lpcr[mer] must be 0 */
+
+   /* exit powersave upon external interrupt, but not decrementer
+* interrupt, Emulate sleep.
+*/
+   lpcr |= LPCR_PECE0;
+
+   if (cpu != bc_cpu) {
+   mtspr(SPRN_LPCR, lpcr);
+   clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, cpu);
+   power7_nap();
+   clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, cpu);
+   } else {
+   /* Wakeup on a decrementer interrupt, Do a nap */
+   lpcr |= LPCR_PECE1;
+   mtspr(SPRN_LPCR, lpcr);
+   power7_nap();
+   }
+
+   return index;
+}
+
 /*
  * States for dedicated partition case.
  */
@@ -72,6 +111,13 @@ static struct cpuidle_state 
powernv_states[MAX_IDLE_STATE_COUNT] = {
.exit_latency = 10,
.target_residency = 100,
.enter = nap_loop },
+{ /* LongNap */
+   .name = LongNap,
+   .desc = LongNap,
+   .flags = CPUIDLE_FLAG_TIME_VALID,
+   .exit_latency = 10,
+   .target_residency = 100,
+   .enter = longnap_loop },
 };
 
 static int powernv_cpuidle_add_cpu_notifier(struct notifier_block *n,

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[Resend RFC PATCH 4/5] cpuidle/ppc: CPU goes tickless if there are no arch-specific constraints

2013-07-26 Thread Preeti U Murthy
In the current design of timer offload framework, the broadcast cpu should
*not* go into tickless idle so as to avoid missed wakeups on CPUs in deep idle 
states.

Since we prevent the CPUs entering deep idle states from programming the
decrementer of the broadcast cpu for their respective next local events for
reasons mentioned in PATCH[3/5], the broadcast CPU checks if there are any
CPUs to be woken up during each of its timer interrupt, which is programmed
to its local events.

With tickless idle, the broadcast CPU might not have a timer interrupt
pending till after many ticks, which can result in missed wakeups on CPUs
in deep idle states. By disabling tickless idle, worst case, the tick_sched
hrtimer will trigger a timer interrupt every period.

However the current setup of tickless idle does not let us make the choice
of tickless on individual cpus. NOHZ_MODE_INACTIVE which disables tickless idle,
is a system wide setting. Hence resort to an arch specific call to check if a 
cpu
can go into tickless idle.

Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com
---

 arch/powerpc/kernel/time.c |5 +
 kernel/time/tick-sched.c   |7 +++
 2 files changed, 12 insertions(+)

diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 7e858e1..916c32f 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -864,6 +864,11 @@ static void decrementer_timer_broadcast(const struct 
cpumask *mask)
arch_send_tick_broadcast(mask);
 }
 
+int arch_can_stop_idle_tick(int cpu)
+{
+   return cpu != bc_cpu;
+}
+
 static void register_decrementer_clockevent(int cpu)
 {
struct clock_event_device *dec = per_cpu(decrementers, cpu);
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 6960172..e9ffa84 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -700,8 +700,15 @@ static void tick_nohz_full_stop_tick(struct tick_sched *ts)
 #endif
 }
 
+int __weak arch_can_stop_idle_tick(int cpu)
+{
+   return 1;
+}
+
 static bool can_stop_idle_tick(int cpu, struct tick_sched *ts)
 {
+   if (!arch_can_stop_idle_tick(cpu))
+   return false;
/*
 * If this cpu is offline and it is the one which updates
 * jiffies, then give up the assignment and let it be taken by

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 4/5] cpuidle/ppc: CPU goes tickless if there are no arch-specific constraints

2013-07-27 Thread Preeti U Murthy
Hi Ben,

On 07/27/2013 12:00 PM, Benjamin Herrenschmidt wrote:
 On Fri, 2013-07-26 at 08:09 +0530, Preeti U Murthy wrote:
 *The lapic of a broadcast CPU is active always*. Say CPUX, wants the
 broadcast CPU to wake it up at timeX.  Since we cannot program the lapic
 of a remote CPU, CPUX will need to send an IPI to the broadcast CPU,
 asking it to program its lapic to fire at timeX so as to wake up CPUX.
 *With multiple CPUs the overhead of sending IPI, could result in
 performance bottlenecks and may not scale well.*

 Hence the workaround is that the broadcast CPU on each of its timer
 interrupt checks if any of the next timer event of a CPU in deep idle
 state has expired, which can very well be found from dev-next_event of
 that CPU. For example the timeX that has been mentioned above has
 expired. If so the broadcast handler is called to send an IPI to the
 idling CPU to wake it up.

 *If the broadcast CPU, is in tickless idle, its timer interrupt could be
 many ticks away. It could miss waking up a CPU in deep idle*, if its
 wakeup is much before this timer interrupt of the broadcast CPU. But
 without tickless idle, atleast at each period we are assured of a timer
 interrupt. At which time broadcast handling is done as stated in the
 previous paragraph and we will not miss wakeup of CPUs in deep idle states.
 
 But that means a great loss of power saving on the broadcast CPU when the 
 machine
 is basically completely idle. We might be able to come up with some thing 
 better.
 
 (Note : I do no know the timer offload code if it exists already, I'm 
 describing
 how things could happen out of the blue without any knowledge of 
 pre-existing
 framework here)
 
 We can know when the broadcast CPU expects to wake up next. When a CPU goes to
 a deep sleep state, it can then
 
  - Indicate to the broadcast CPU when it intends to be woken up by queuing
 itself into an ordered queue (ordered by target wakeup time). (OPTIMISATION:
 Play with the locality of that: have one queue (and one broadcast CPU) per
 chip or per node instead of a global one to limit cache bouncing).
 
  - Check if that happens before the broadcast CPU intended wake time (we
 need statistics to see how often that happens), and in that case send an IPI
 to wake it up now. When the broadcast CPU goes to sleep, it limits its sleep
 time to the min of it's intended sleep time and the new sleeper time.
 (OPTIMISATION: Dynamically chose a broadcast CPU based on closest expiry ?)
 
  - We can probably limit spurrious wakeups a *LOT* by aligning that target 
 time
 to a global jiffy boundary, meaning that several CPUs going to idle are likely
 to be choosing the same. Or maybe better, an adaptative alignment by 
 essentially
 getting more coarse grained as we go further in the future
 
  - When the broadcast CPU goes to sleep, it can play the same game of 
 alignment.
 
 I don't like the concept of a dedicated broadcast CPU however. I'd rather 
 have a
 general queue (or per node) of sleepers needing a wakeup and more/less 
 dynamically
 pick a waker to be the last man standing, but it does make things a bit more
 tricky with tickless scheduler (non-idle).
 
 Still, I wonder if we could just have some algorithm to actually pick wakers
 more dynamically based on who ever has the closest next wakeup planned,
 that sort of thing. A fixed broadcaster will create an imbalance in
 power/thermal within the chip in addition to needing to be moved around on
 hotplug etc...

Thank you for having listed out the above suggestions. Below, I will
bring out some ideas about how the concerns that you have raised can be
addressed in the increasing order of priority.

- To begin with, I think we can have the following model to have the
responsibility of the broadcast CPU float around certain CPUs. i.e. Not
have a dedicated broadcast CPU. I will refer to the broadcast CPU as the
bc_cpu henceforth for convenience.

1. The first CPU that intends to enter deep sleep state will be the bc_cpu.

2. Every other CPU that intends to enter deep idle state will enter
themselves into a mask, say the bc_mask, which is already being done
today, after they check that a bc_cpu has been assigned.

3. The bc_cpu should not enter tickless idle, until step 5a holds true.

4. So on every timer interrupt, which is at-least every period, it
checks the bc_mask to see if any CPUs need to be woken up.

5. The bc cpu should not enter tickless idle *until* it is de-nominated
as the bc_cpu. The de-nomination occurs when:
  a. In one of its timer interrupts, it does broadcast handling to find
out that there are no CPUs to be woken up.

6. So if 5a holds, then there is no bc_cpu anymore until a CPU decides
to enter deep idle state again, in which case steps 1 to 5 repeat.


- We could optimize this further, to allow the bc_cpu to enter tickless
idle, even while it is nominated as one. This can be the next step, if
we can get the above to work stably.

You have already brought out this point

Re: [RFC PATCH 4/5] cpuidle/ppc: CPU goes tickless if there are no arch-specific constraints

2013-07-29 Thread Preeti U Murthy
Hi,

On 07/29/2013 10:58 AM, Vaidyanathan Srinivasan wrote:
 * Preeti U Murthy pre...@linux.vnet.ibm.com [2013-07-27 13:20:37]:
 
 Hi Ben,

 On 07/27/2013 12:00 PM, Benjamin Herrenschmidt wrote:
 On Fri, 2013-07-26 at 08:09 +0530, Preeti U Murthy wrote:
 *The lapic of a broadcast CPU is active always*. Say CPUX, wants the
 broadcast CPU to wake it up at timeX.  Since we cannot program the lapic
 of a remote CPU, CPUX will need to send an IPI to the broadcast CPU,
 asking it to program its lapic to fire at timeX so as to wake up CPUX.
 *With multiple CPUs the overhead of sending IPI, could result in
 performance bottlenecks and may not scale well.*

 Hence the workaround is that the broadcast CPU on each of its timer
 interrupt checks if any of the next timer event of a CPU in deep idle
 state has expired, which can very well be found from dev-next_event of
 that CPU. For example the timeX that has been mentioned above has
 expired. If so the broadcast handler is called to send an IPI to the
 idling CPU to wake it up.

 *If the broadcast CPU, is in tickless idle, its timer interrupt could be
 many ticks away. It could miss waking up a CPU in deep idle*, if its
 wakeup is much before this timer interrupt of the broadcast CPU. But
 without tickless idle, atleast at each period we are assured of a timer
 interrupt. At which time broadcast handling is done as stated in the
 previous paragraph and we will not miss wakeup of CPUs in deep idle states.

 But that means a great loss of power saving on the broadcast CPU when the 
 machine
 is basically completely idle. We might be able to come up with some thing 
 better.

 (Note : I do no know the timer offload code if it exists already, I'm 
 describing
 how things could happen out of the blue without any knowledge of 
 pre-existing
 framework here)

 We can know when the broadcast CPU expects to wake up next. When a CPU goes 
 to
 a deep sleep state, it can then

  - Indicate to the broadcast CPU when it intends to be woken up by queuing
 itself into an ordered queue (ordered by target wakeup time). (OPTIMISATION:
 Play with the locality of that: have one queue (and one broadcast CPU) per
 chip or per node instead of a global one to limit cache bouncing).

  - Check if that happens before the broadcast CPU intended wake time (we
 need statistics to see how often that happens), and in that case send an IPI
 to wake it up now. When the broadcast CPU goes to sleep, it limits its sleep
 time to the min of it's intended sleep time and the new sleeper time.
 (OPTIMISATION: Dynamically chose a broadcast CPU based on closest expiry ?)

  - We can probably limit spurrious wakeups a *LOT* by aligning that target 
 time
 to a global jiffy boundary, meaning that several CPUs going to idle are 
 likely
 to be choosing the same. Or maybe better, an adaptative alignment by 
 essentially
 getting more coarse grained as we go further in the future

  - When the broadcast CPU goes to sleep, it can play the same game of 
 alignment.

 I don't like the concept of a dedicated broadcast CPU however. I'd rather 
 have a
 general queue (or per node) of sleepers needing a wakeup and more/less 
 dynamically
 pick a waker to be the last man standing, but it does make things a bit more
 tricky with tickless scheduler (non-idle).

 Still, I wonder if we could just have some algorithm to actually pick wakers
 more dynamically based on who ever has the closest next wakeup planned,
 that sort of thing. A fixed broadcaster will create an imbalance in
 power/thermal within the chip in addition to needing to be moved around on
 hotplug etc...

 Thank you for having listed out the above suggestions. Below, I will
 bring out some ideas about how the concerns that you have raised can be
 addressed in the increasing order of priority.

 - To begin with, I think we can have the following model to have the
 responsibility of the broadcast CPU float around certain CPUs. i.e. Not
 have a dedicated broadcast CPU. I will refer to the broadcast CPU as the
 bc_cpu henceforth for convenience.

 1. The first CPU that intends to enter deep sleep state will be the bc_cpu.

 2. Every other CPU that intends to enter deep idle state will enter
 themselves into a mask, say the bc_mask, which is already being done
 today, after they check that a bc_cpu has been assigned.

 3. The bc_cpu should not enter tickless idle, until step 5a holds true.

 4. So on every timer interrupt, which is at-least every period, it
 checks the bc_mask to see if any CPUs need to be woken up.

 5. The bc cpu should not enter tickless idle *until* it is de-nominated
 as the bc_cpu. The de-nomination occurs when:
   a. In one of its timer interrupts, it does broadcast handling to find
 out that there are no CPUs to be woken up.

 6. So if 5a holds, then there is no bc_cpu anymore until a CPU decides
 to enter deep idle state again, in which case steps 1 to 5 repeat.


 - We could optimize this further, to allow the bc_cpu to enter tickless

Re: [PATCH 06/10] sched, fair: Make group power more consitent

2013-08-22 Thread Preeti U Murthy
On 08/19/2013 09:31 PM, Peter Zijlstra wrote:


Reviewed-by: Preeti U Murthy pre...@linux.vnet.ibm.com

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 07/10] sched, fair: Optimize find_busiest_queue()

2013-08-23 Thread Preeti U Murthy
Hi Peter,

On 08/19/2013 09:31 PM, Peter Zijlstra wrote:

In the load balancing code, looks to me that
cpumask_copy(cpus, cpu_active_mask) is not updating the env.cpus at all,
before calling find_busiest_group(). Am I missing something?

Should not cpumask_copy() below be before we update the env.cpus parameter?

static int load_balance(int this_cpu, struct rq *this_rq,
struct sched_domain *sd, enum cpu_idle_type idle,
int *balance)
{
int ld_moved, cur_ld_moved, active_balance = 0;
struct sched_group *group;
struct rq *busiest;
unsigned long flags;
struct cpumask *cpus = __get_cpu_var(load_balance_mask);

struct lb_env env = {
.sd = sd,
.dst_cpu= this_cpu,
.dst_rq = this_rq,
.dst_grpmask= sched_group_cpus(sd-groups),
.idle   = idle,
.loop_break = sched_nr_migrate_break,
.cpus   = cpus,
};

/*
 * For NEWLY_IDLE load_balancing, we don't need to consider
 * other cpus in our group
 */
if (idle == CPU_NEWLY_IDLE)
env.dst_grpmask = NULL;

cpumask_copy(cpus, cpu_active_mask);

schedstat_inc(sd, lb_count[idle]);
redo:
group = find_busiest_group(env, balance);

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 07/10] sched, fair: Optimize find_busiest_queue()

2013-08-23 Thread Preeti U Murthy
On 08/23/2013 03:33 PM, Peter Zijlstra wrote:
 On Fri, Aug 23, 2013 at 01:41:55PM +0530, Preeti U Murthy wrote:
 Hi Peter,

 On 08/19/2013 09:31 PM, Peter Zijlstra wrote:

 In the load balancing code, looks to me that
 cpumask_copy(cpus, cpu_active_mask) is not updating the env.cpus at all,
 before calling find_busiest_group(). Am I missing something?

 Should not cpumask_copy() below be before we update the env.cpus parameter?
 
 Nah, its all pointers.
 

Yikes! Of course :P

Reviewed-by: Preeti U Murthy pre...@linux.vnet.ibm.com

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 5/6] sched, fair: Make group power more consitent

2013-08-18 Thread Preeti U Murthy
Hi Peter,

On 08/16/2013 03:42 PM, Peter Zijlstra wrote:

I have a few comments and clarification to seek.

1. How are you ensuring from this patch that sgs-group_power does not
change over the course of load balancing?

The only path to update_group_power() where sg-sgp-power gets
updated, is from update_sg_lb_stats(). You are updating sgs-group_power
in update_sg_lb_stats(). Any change to group-sgp-power will get
reflected in sgs-group_power as well right?

2. This point is aside from your patch. In the current implementation,
each time the cpu power gets updated in update_cpu_power(), should not
the power of the sched_groups comprising of that cpu also get updated?
Why wait till the load balancing is done at the sched_domain level of
that group, to update its group power?

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


<    1   2   3   4   5   6   7   8   9   10   >