Re: [RFC 3/6] sched: pack small tasks

2012-11-20 Thread Morten Rasmussen
Hi Vincent,

On Mon, Nov 12, 2012 at 01:51:00PM +, Vincent Guittot wrote:
 On 9 November 2012 18:13, Morten Rasmussen morten.rasmus...@arm.com wrote:
  Hi Vincent,
 
  I have experienced suboptimal buddy selection on a dual cluster setup
  (ARM TC2) if SD_SHARE_POWERLINE is enabled at MC level and disabled at
  CPU level. This seems to be the correct flag settings for a system with
  only cluster level power gating.
 
  To me it looks like update_packing_domain() is not doing the right
  thing. See inline comments below.
 
 Hi Morten,
 
 Thanks for testing the patches.
 
 It seems that I have too optimized the loop and remove some use cases.
 
 
  On Sun, Oct 07, 2012 at 08:43:55AM +0100, Vincent Guittot wrote:
  During sched_domain creation, we define a pack buddy CPU if available.
 
  On a system that share the powerline at all level, the buddy is set to -1
 
  On a dual clusters / dual cores system which can powergate each core and
  cluster independantly, the buddy configuration will be :
| CPU0 | CPU1 | CPU2 | CPU3 |
  ---
  buddy | CPU0 | CPU0 | CPU0 | CPU2 |
 
  Small tasks tend to slip out of the periodic load balance.
  The best place to choose to migrate them is at their wake up.
 
  Signed-off-by: Vincent Guittot vincent.guit...@linaro.org
  ---
   kernel/sched/core.c  |1 +
   kernel/sched/fair.c  |  109 
  ++
   kernel/sched/sched.h |1 +
   3 files changed, 111 insertions(+)
 
  diff --git a/kernel/sched/core.c b/kernel/sched/core.c
  index dab7908..70cadbe 100644
  --- a/kernel/sched/core.c
  +++ b/kernel/sched/core.c
  @@ -6131,6 +6131,7 @@ cpu_attach_domain(struct sched_domain *sd, struct 
  root_domain *rd, int cpu)
rcu_assign_pointer(rq-sd, sd);
destroy_sched_domains(tmp, cpu);
 
  + update_packing_domain(cpu);
update_top_cache_domain(cpu);
   }
 
  diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
  index 4f4a4f6..8c9d3ed 100644
  --- a/kernel/sched/fair.c
  +++ b/kernel/sched/fair.c
  @@ -157,6 +157,63 @@ void sched_init_granularity(void)
update_sysctl();
   }
 
  +
  +/*
  + * Save the id of the optimal CPU that should be used to pack small tasks
  + * The value -1 is used when no buddy has been found
  + */
  +DEFINE_PER_CPU(int, sd_pack_buddy);
  +
  +/* Look for the best buddy CPU that can be used to pack small tasks
  + * We make the assumption that it doesn't wort to pack on CPU that share 
  the
  + * same powerline. We looks for the 1st sched_domain without the
  + * SD_SHARE_POWERLINE flag. Then We look for the sched_group witht the 
  lowest
  + * power per core based on the assumption that their power efficiency is
  + * better */
  +void update_packing_domain(int cpu)
  +{
  + struct sched_domain *sd;
  + int id = -1;
  +
  + sd = highest_flag_domain(cpu, SD_SHARE_POWERLINE);
  + if (!sd)
  + sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)-sd);
  + else
  + sd = sd-parent;
  sd is the highest level where SD_SHARE_POWERLINE is enabled so the sched
  groups of the parent level would represent the power domains. If get it
  right, we want to pack inside the cluster first and only let first cpu
 
 You probably wanted to use sched_group instead of cluster because
 cluster is only a special use case, didn't you ?
 
  of the cluster do packing on another cluster. So all cpus - except the
  first one - in the current sched domain should find its buddy within the
  domain and only the first one should go to the parent sched domain to
  find its buddy.
 
 We don't want to pack in the current sched_domain because it shares
 power domain. We want to pack at the parent level
 

Yes. I think we mean the same thing. The packing takes place at the
parent sched_domain but the sched_group that we are looking at only
contains the cpus of the level below.

 
  I propose the following fix:
 
  -   sd = sd-parent;
  +   if (cpumask_first(sched_domain_span(sd)) == cpu
  +   || !sd-parent)
  +   sd = sd-parent;
 
 We always look for the buddy in the parent level whatever the cpu
 position in the mask is.
 
 
 
  +
  + while (sd) {
  + struct sched_group *sg = sd-groups;
  + struct sched_group *pack = sg;
  + struct sched_group *tmp = sg-next;
  +
  + /* 1st CPU of the sched domain is a good candidate */
  + if (id == -1)
  + id = cpumask_first(sched_domain_span(sd));
 
  There is no guarantee that id is in the sched group pointed to by
  sd-groups, which is implicitly assumed later in the search loop. We
  need to find the sched group that contains id and point sg to that
  instead. I haven't found an elegant way to find that group, but the fix
  below should at least give the right result.
 
  +   /* Find sched group of candidate */
  +   tmp 

Re: [RFC 3/6] sched: pack small tasks

2012-11-20 Thread Vincent Guittot
On 20 November 2012 15:28, Morten Rasmussen morten.rasmus...@arm.com wrote:
 Hi Vincent,

 On Mon, Nov 12, 2012 at 01:51:00PM +, Vincent Guittot wrote:
 On 9 November 2012 18:13, Morten Rasmussen morten.rasmus...@arm.com wrote:
  Hi Vincent,
 
  I have experienced suboptimal buddy selection on a dual cluster setup
  (ARM TC2) if SD_SHARE_POWERLINE is enabled at MC level and disabled at
  CPU level. This seems to be the correct flag settings for a system with
  only cluster level power gating.
 
  To me it looks like update_packing_domain() is not doing the right
  thing. See inline comments below.

 Hi Morten,

 Thanks for testing the patches.

 It seems that I have too optimized the loop and remove some use cases.

 
  On Sun, Oct 07, 2012 at 08:43:55AM +0100, Vincent Guittot wrote:
  During sched_domain creation, we define a pack buddy CPU if available.
 
  On a system that share the powerline at all level, the buddy is set to -1
 
  On a dual clusters / dual cores system which can powergate each core and
  cluster independantly, the buddy configuration will be :
| CPU0 | CPU1 | CPU2 | CPU3 |
  ---
  buddy | CPU0 | CPU0 | CPU0 | CPU2 |
 
  Small tasks tend to slip out of the periodic load balance.
  The best place to choose to migrate them is at their wake up.
 
  Signed-off-by: Vincent Guittot vincent.guit...@linaro.org
  ---
   kernel/sched/core.c  |1 +
   kernel/sched/fair.c  |  109 
  ++
   kernel/sched/sched.h |1 +
   3 files changed, 111 insertions(+)
 
  diff --git a/kernel/sched/core.c b/kernel/sched/core.c
  index dab7908..70cadbe 100644
  --- a/kernel/sched/core.c
  +++ b/kernel/sched/core.c
  @@ -6131,6 +6131,7 @@ cpu_attach_domain(struct sched_domain *sd, struct 
  root_domain *rd, int cpu)
rcu_assign_pointer(rq-sd, sd);
destroy_sched_domains(tmp, cpu);
 
  + update_packing_domain(cpu);
update_top_cache_domain(cpu);
   }
 
  diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
  index 4f4a4f6..8c9d3ed 100644
  --- a/kernel/sched/fair.c
  +++ b/kernel/sched/fair.c
  @@ -157,6 +157,63 @@ void sched_init_granularity(void)
update_sysctl();
   }
 
  +
  +/*
  + * Save the id of the optimal CPU that should be used to pack small tasks
  + * The value -1 is used when no buddy has been found
  + */
  +DEFINE_PER_CPU(int, sd_pack_buddy);
  +
  +/* Look for the best buddy CPU that can be used to pack small tasks
  + * We make the assumption that it doesn't wort to pack on CPU that share 
  the
  + * same powerline. We looks for the 1st sched_domain without the
  + * SD_SHARE_POWERLINE flag. Then We look for the sched_group witht the 
  lowest
  + * power per core based on the assumption that their power efficiency is
  + * better */
  +void update_packing_domain(int cpu)
  +{
  + struct sched_domain *sd;
  + int id = -1;
  +
  + sd = highest_flag_domain(cpu, SD_SHARE_POWERLINE);
  + if (!sd)
  + sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)-sd);
  + else
  + sd = sd-parent;
  sd is the highest level where SD_SHARE_POWERLINE is enabled so the sched
  groups of the parent level would represent the power domains. If get it
  right, we want to pack inside the cluster first and only let first cpu

 You probably wanted to use sched_group instead of cluster because
 cluster is only a special use case, didn't you ?

  of the cluster do packing on another cluster. So all cpus - except the
  first one - in the current sched domain should find its buddy within the
  domain and only the first one should go to the parent sched domain to
  find its buddy.

 We don't want to pack in the current sched_domain because it shares
 power domain. We want to pack at the parent level


 Yes. I think we mean the same thing. The packing takes place at the
 parent sched_domain but the sched_group that we are looking at only
 contains the cpus of the level below.

 
  I propose the following fix:
 
  -   sd = sd-parent;
  +   if (cpumask_first(sched_domain_span(sd)) == cpu
  +   || !sd-parent)
  +   sd = sd-parent;

 We always look for the buddy in the parent level whatever the cpu
 position in the mask is.

 
 
  +
  + while (sd) {
  + struct sched_group *sg = sd-groups;
  + struct sched_group *pack = sg;
  + struct sched_group *tmp = sg-next;
  +
  + /* 1st CPU of the sched domain is a good candidate */
  + if (id == -1)
  + id = cpumask_first(sched_domain_span(sd));
 
  There is no guarantee that id is in the sched group pointed to by
  sd-groups, which is implicitly assumed later in the search loop. We
  need to find the sched group that contains id and point sg to that
  instead. I haven't found an elegant way to find that group, but the fix
  below should at least give the right result.
 
  +   

Re: [RFC 3/6] sched: pack small tasks

2012-11-12 Thread Vincent Guittot
On 2 November 2012 11:53, Santosh Shilimkar santosh.shilim...@ti.com wrote:
 On Monday 29 October 2012 06:42 PM, Vincent Guittot wrote:

 On 24 October 2012 17:20, Santosh Shilimkar santosh.shilim...@ti.com
 wrote:

 Vincent,

 Few comments/questions.


 On Sunday 07 October 2012 01:13 PM, Vincent Guittot wrote:


 During sched_domain creation, we define a pack buddy CPU if available.

 On a system that share the powerline at all level, the buddy is set to
 -1

 On a dual clusters / dual cores system which can powergate each core and
 cluster independantly, the buddy configuration will be :
 | CPU0 | CPU1 | CPU2 | CPU3 |
 ---
 buddy | CPU0 | CPU0 | CPU0 | CPU2 |


  ^
 Is that a typo ? Should it be CPU2 instead of
 CPU0 ?


 No it's not a typo.
 The system packs at each scheduling level. It starts to pack in
 cluster because each core can power gate independently so CPU1 tries
 to pack its tasks in CPU0 and CPU3 in CPU2. Then, it packs at CPU
 level so CPU2 tries to pack in the cluster of CPU0 and CPU0 packs in
 itself

 I get it. Though in above example a task may migrate from say
 CPU3-CPU2-CPU0 as part of packing. I was just thinking whether
 moving such task from say CPU3 to CPU0 might be best instead.

We pack in the cluster then at CPU level. Tasks could sometimes
migrate directly to CPU0 but we would miss the case where CPU0 is busy
but CPU2 is not

Vincent




 Small tasks tend to slip out of the periodic load balance.
 The best place to choose to migrate them is at their wake up.

 I have tried this series since I was looking at some of these packing
 bits. On Mobile workloads like OSIdle with Screen ON, MP3, gallary,
 I did see some additional filtering of threads with this series
 but its not making much difference in power. More on this below.


 Can I ask you which configuration you have used ? how many cores and
 cluster ?  Can they be power gated independently ?

 I have been trying with couple of setups. Dual Core ARM machine and
 Quad core X86 box with single package thought most of the mobile
 workload analysis I was doing on ARM machine. On both setups
 CPUs can be gated independently.




 Signed-off-by: Vincent Guittot vincent.guit...@linaro.org
 ---
kernel/sched/core.c  |1 +
kernel/sched/fair.c  |  109
 ++
kernel/sched/sched.h |1 +
3 files changed, 111 insertions(+)

 diff --git a/kernel/sched/core.c b/kernel/sched/core.c
 index dab7908..70cadbe 100644
 --- a/kernel/sched/core.c
 +++ b/kernel/sched/core.c
 @@ -6131,6 +6131,7 @@ cpu_attach_domain(struct sched_domain *sd, struct
 root_domain *rd, int cpu)
  rcu_assign_pointer(rq-sd, sd);
  destroy_sched_domains(tmp, cpu);

 +   update_packing_domain(cpu);
  update_top_cache_domain(cpu);
}

 diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
 index 4f4a4f6..8c9d3ed 100644
 --- a/kernel/sched/fair.c
 +++ b/kernel/sched/fair.c
 @@ -157,6 +157,63 @@ void sched_init_granularity(void)
  update_sysctl();
}

 +
 +/*
 + * Save the id of the optimal CPU that should be used to pack small
 tasks
 + * The value -1 is used when no buddy has been found
 + */
 +DEFINE_PER_CPU(int, sd_pack_buddy);
 +
 +/* Look for the best buddy CPU that can be used to pack small tasks
 + * We make the assumption that it doesn't wort to pack on CPU that
 share
 the


 s/wort/worth


 yes


 + * same powerline. We looks for the 1st sched_domain without the
 + * SD_SHARE_POWERLINE flag. Then We look for the sched_group witht the
 lowest
 + * power per core based on the assumption that their power efficiency
 is
 + * better */


 Commenting style..
 /*
   *
   */


 yes

 Can you please expand the why the assumption is right ?
 it doesn't wort to pack on CPU that share the same powerline


 By share the same power-line, I mean that the CPUs can't power off
 independently. So if some CPUs can't power off independently, it's
 worth to try to use most of them to race to idle.

 In that case I suggest we use different word here. Power line can be
 treated as voltage line, power domain.
 May be SD_SHARE_CPU_POWERDOMAIN ?



 Think about a scenario where you have quad core, ducal cluster system

  |Cluster1|  |cluster 2|
 | CPU0 | CPU1 | CPU2 | CPU3 |   | CPU0 | CPU1 | CPU2 | CPU3 |


 Both clusters run from same voltage rail and have same PLL
 clocking them. But the cluster have their own power domain
 and all CPU's can power gate them-self to low power states.
 Clusters also have their own level2 caches.

 In this case, you will still save power if you try to pack
 load on one cluster. No ?


 yes, I need to update the description of SD_SHARE_POWERLINE because
 I'm afraid I was not clear enough. SD_SHARE_POWERLINE includes the
 power gating capacity of each core. For your example above, the
 SD_SHARE_POWERLINE shoud be cleared at both MC and CPU level.

 Thanks for 

Re: [RFC 3/6] sched: pack small tasks

2012-11-12 Thread Vincent Guittot
On 9 November 2012 17:46, Morten Rasmussen morten.rasmus...@arm.com wrote:
 On Fri, Nov 02, 2012 at 10:53:47AM +, Santosh Shilimkar wrote:
 On Monday 29 October 2012 06:42 PM, Vincent Guittot wrote:
  On 24 October 2012 17:20, Santosh Shilimkar santosh.shilim...@ti.com 
  wrote:
  Vincent,
 
  Few comments/questions.
 
 
  On Sunday 07 October 2012 01:13 PM, Vincent Guittot wrote:
 
  During sched_domain creation, we define a pack buddy CPU if available.
 
  On a system that share the powerline at all level, the buddy is set to -1
 
  On a dual clusters / dual cores system which can powergate each core and
  cluster independantly, the buddy configuration will be :
  | CPU0 | CPU1 | CPU2 | CPU3 |
  ---
  buddy | CPU0 | CPU0 | CPU0 | CPU2 |
 
   ^
  Is that a typo ? Should it be CPU2 instead of
  CPU0 ?
 
  No it's not a typo.
  The system packs at each scheduling level. It starts to pack in
  cluster because each core can power gate independently so CPU1 tries
  to pack its tasks in CPU0 and CPU3 in CPU2. Then, it packs at CPU
  level so CPU2 tries to pack in the cluster of CPU0 and CPU0 packs in
  itself
 
 I get it. Though in above example a task may migrate from say
 CPU3-CPU2-CPU0 as part of packing. I was just thinking whether
 moving such task from say CPU3 to CPU0 might be best instead.

 To me it seems suboptimal to pack the task twice, but the alternative is
 not good either. If you try to move the task directly to CPU0 you may
 miss packing opportunities if CPU0 is already busy, while CPU2 might
 have enough capacity to take it. It would probably be better to check
 the business of CPU0 and then back off and try CPU2 if CP0 is busy. This
 would require a buddy list for each CPU rather just a single buddy and
 thus might become expensive.


 
  Small tasks tend to slip out of the periodic load balance.
  The best place to choose to migrate them is at their wake up.
 
  I have tried this series since I was looking at some of these packing
  bits. On Mobile workloads like OSIdle with Screen ON, MP3, gallary,
  I did see some additional filtering of threads with this series
  but its not making much difference in power. More on this below.
 
  Can I ask you which configuration you have used ? how many cores and
  cluster ?  Can they be power gated independently ?
 
 I have been trying with couple of setups. Dual Core ARM machine and
 Quad core X86 box with single package thought most of the mobile
 workload analysis I was doing on ARM machine. On both setups
 CPUs can be gated independently.

 
 
  Signed-off-by: Vincent Guittot vincent.guit...@linaro.org
  ---
 kernel/sched/core.c  |1 +
 kernel/sched/fair.c  |  109
  ++
 kernel/sched/sched.h |1 +
 3 files changed, 111 insertions(+)
 
  diff --git a/kernel/sched/core.c b/kernel/sched/core.c
  index dab7908..70cadbe 100644
  --- a/kernel/sched/core.c
  +++ b/kernel/sched/core.c
  @@ -6131,6 +6131,7 @@ cpu_attach_domain(struct sched_domain *sd, struct
  root_domain *rd, int cpu)
   rcu_assign_pointer(rq-sd, sd);
   destroy_sched_domains(tmp, cpu);
 
  +   update_packing_domain(cpu);
   update_top_cache_domain(cpu);
 }
 
  diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
  index 4f4a4f6..8c9d3ed 100644
  --- a/kernel/sched/fair.c
  +++ b/kernel/sched/fair.c
  @@ -157,6 +157,63 @@ void sched_init_granularity(void)
   update_sysctl();
 }
 
  +
  +/*
  + * Save the id of the optimal CPU that should be used to pack small 
  tasks
  + * The value -1 is used when no buddy has been found
  + */
  +DEFINE_PER_CPU(int, sd_pack_buddy);
  +
  +/* Look for the best buddy CPU that can be used to pack small tasks
  + * We make the assumption that it doesn't wort to pack on CPU that share
  the
 
  s/wort/worth
 
  yes
 
 
  + * same powerline. We looks for the 1st sched_domain without the
  + * SD_SHARE_POWERLINE flag. Then We look for the sched_group witht the
  lowest
  + * power per core based on the assumption that their power efficiency is
  + * better */
 
  Commenting style..
  /*
*
*/
 
 
  yes
 
  Can you please expand the why the assumption is right ?
  it doesn't wort to pack on CPU that share the same powerline
 
  By share the same power-line, I mean that the CPUs can't power off
  independently. So if some CPUs can't power off independently, it's
  worth to try to use most of them to race to idle.
 
 In that case I suggest we use different word here. Power line can be
 treated as voltage line, power domain.
 May be SD_SHARE_CPU_POWERDOMAIN ?


 How about just SD_SHARE_POWERDOMAIN ?

It looks better than SD_SHARE_POWERLINE. I will replace the name


 
  Think about a scenario where you have quad core, ducal cluster system
 
   |Cluster1|  |cluster 2|
  | CPU0 | CPU1 | CPU2 | CPU3 |   | CPU0 | CPU1 | CPU2 | CPU3 |
 
 
  Both clusters 

Re: [RFC 3/6] sched: pack small tasks

2012-11-12 Thread Vincent Guittot
On 9 November 2012 18:13, Morten Rasmussen morten.rasmus...@arm.com wrote:
 Hi Vincent,

 I have experienced suboptimal buddy selection on a dual cluster setup
 (ARM TC2) if SD_SHARE_POWERLINE is enabled at MC level and disabled at
 CPU level. This seems to be the correct flag settings for a system with
 only cluster level power gating.

 To me it looks like update_packing_domain() is not doing the right
 thing. See inline comments below.

Hi Morten,

Thanks for testing the patches.

It seems that I have too optimized the loop and remove some use cases.


 On Sun, Oct 07, 2012 at 08:43:55AM +0100, Vincent Guittot wrote:
 During sched_domain creation, we define a pack buddy CPU if available.

 On a system that share the powerline at all level, the buddy is set to -1

 On a dual clusters / dual cores system which can powergate each core and
 cluster independantly, the buddy configuration will be :
   | CPU0 | CPU1 | CPU2 | CPU3 |
 ---
 buddy | CPU0 | CPU0 | CPU0 | CPU2 |

 Small tasks tend to slip out of the periodic load balance.
 The best place to choose to migrate them is at their wake up.

 Signed-off-by: Vincent Guittot vincent.guit...@linaro.org
 ---
  kernel/sched/core.c  |1 +
  kernel/sched/fair.c  |  109 
 ++
  kernel/sched/sched.h |1 +
  3 files changed, 111 insertions(+)

 diff --git a/kernel/sched/core.c b/kernel/sched/core.c
 index dab7908..70cadbe 100644
 --- a/kernel/sched/core.c
 +++ b/kernel/sched/core.c
 @@ -6131,6 +6131,7 @@ cpu_attach_domain(struct sched_domain *sd, struct 
 root_domain *rd, int cpu)
   rcu_assign_pointer(rq-sd, sd);
   destroy_sched_domains(tmp, cpu);

 + update_packing_domain(cpu);
   update_top_cache_domain(cpu);
  }

 diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
 index 4f4a4f6..8c9d3ed 100644
 --- a/kernel/sched/fair.c
 +++ b/kernel/sched/fair.c
 @@ -157,6 +157,63 @@ void sched_init_granularity(void)
   update_sysctl();
  }

 +
 +/*
 + * Save the id of the optimal CPU that should be used to pack small tasks
 + * The value -1 is used when no buddy has been found
 + */
 +DEFINE_PER_CPU(int, sd_pack_buddy);
 +
 +/* Look for the best buddy CPU that can be used to pack small tasks
 + * We make the assumption that it doesn't wort to pack on CPU that share the
 + * same powerline. We looks for the 1st sched_domain without the
 + * SD_SHARE_POWERLINE flag. Then We look for the sched_group witht the 
 lowest
 + * power per core based on the assumption that their power efficiency is
 + * better */
 +void update_packing_domain(int cpu)
 +{
 + struct sched_domain *sd;
 + int id = -1;
 +
 + sd = highest_flag_domain(cpu, SD_SHARE_POWERLINE);
 + if (!sd)
 + sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)-sd);
 + else
 + sd = sd-parent;
 sd is the highest level where SD_SHARE_POWERLINE is enabled so the sched
 groups of the parent level would represent the power domains. If get it
 right, we want to pack inside the cluster first and only let first cpu

You probably wanted to use sched_group instead of cluster because
cluster is only a special use case, didn't you ?

 of the cluster do packing on another cluster. So all cpus - except the
 first one - in the current sched domain should find its buddy within the
 domain and only the first one should go to the parent sched domain to
 find its buddy.

We don't want to pack in the current sched_domain because it shares
power domain. We want to pack at the parent level


 I propose the following fix:

 -   sd = sd-parent;
 +   if (cpumask_first(sched_domain_span(sd)) == cpu
 +   || !sd-parent)
 +   sd = sd-parent;

We always look for the buddy in the parent level whatever the cpu
position in the mask is.



 +
 + while (sd) {
 + struct sched_group *sg = sd-groups;
 + struct sched_group *pack = sg;
 + struct sched_group *tmp = sg-next;
 +
 + /* 1st CPU of the sched domain is a good candidate */
 + if (id == -1)
 + id = cpumask_first(sched_domain_span(sd));

 There is no guarantee that id is in the sched group pointed to by
 sd-groups, which is implicitly assumed later in the search loop. We
 need to find the sched group that contains id and point sg to that
 instead. I haven't found an elegant way to find that group, but the fix
 below should at least give the right result.

 +   /* Find sched group of candidate */
 +   tmp = sd-groups;
 +   do {
 +   if (cpumask_test_cpu(id, sched_group_cpus(tmp)))
 +   {
 +   sg = tmp;
 +   break;
 +   }
 +   } while (tmp = tmp-next, tmp != sd-groups);
 +
 +   pack = sg;
 +   tmp = sg-next;


I have a new 

Re: [RFC 3/6] sched: pack small tasks

2012-11-09 Thread Morten Rasmussen
On Fri, Nov 02, 2012 at 10:53:47AM +, Santosh Shilimkar wrote:
 On Monday 29 October 2012 06:42 PM, Vincent Guittot wrote:
  On 24 October 2012 17:20, Santosh Shilimkar santosh.shilim...@ti.com 
  wrote:
  Vincent,
 
  Few comments/questions.
 
 
  On Sunday 07 October 2012 01:13 PM, Vincent Guittot wrote:
 
  During sched_domain creation, we define a pack buddy CPU if available.
 
  On a system that share the powerline at all level, the buddy is set to -1
 
  On a dual clusters / dual cores system which can powergate each core and
  cluster independantly, the buddy configuration will be :
  | CPU0 | CPU1 | CPU2 | CPU3 |
  ---
  buddy | CPU0 | CPU0 | CPU0 | CPU2 |
 
   ^
  Is that a typo ? Should it be CPU2 instead of
  CPU0 ?
 
  No it's not a typo.
  The system packs at each scheduling level. It starts to pack in
  cluster because each core can power gate independently so CPU1 tries
  to pack its tasks in CPU0 and CPU3 in CPU2. Then, it packs at CPU
  level so CPU2 tries to pack in the cluster of CPU0 and CPU0 packs in
  itself
 
 I get it. Though in above example a task may migrate from say
 CPU3-CPU2-CPU0 as part of packing. I was just thinking whether
 moving such task from say CPU3 to CPU0 might be best instead.

To me it seems suboptimal to pack the task twice, but the alternative is
not good either. If you try to move the task directly to CPU0 you may
miss packing opportunities if CPU0 is already busy, while CPU2 might
have enough capacity to take it. It would probably be better to check
the business of CPU0 and then back off and try CPU2 if CP0 is busy. This
would require a buddy list for each CPU rather just a single buddy and
thus might become expensive.

 
 
  Small tasks tend to slip out of the periodic load balance.
  The best place to choose to migrate them is at their wake up.
 
  I have tried this series since I was looking at some of these packing
  bits. On Mobile workloads like OSIdle with Screen ON, MP3, gallary,
  I did see some additional filtering of threads with this series
  but its not making much difference in power. More on this below.
 
  Can I ask you which configuration you have used ? how many cores and
  cluster ?  Can they be power gated independently ?
 
 I have been trying with couple of setups. Dual Core ARM machine and
 Quad core X86 box with single package thought most of the mobile
 workload analysis I was doing on ARM machine. On both setups
 CPUs can be gated independently.
 
 
 
  Signed-off-by: Vincent Guittot vincent.guit...@linaro.org
  ---
 kernel/sched/core.c  |1 +
 kernel/sched/fair.c  |  109
  ++
 kernel/sched/sched.h |1 +
 3 files changed, 111 insertions(+)
 
  diff --git a/kernel/sched/core.c b/kernel/sched/core.c
  index dab7908..70cadbe 100644
  --- a/kernel/sched/core.c
  +++ b/kernel/sched/core.c
  @@ -6131,6 +6131,7 @@ cpu_attach_domain(struct sched_domain *sd, struct
  root_domain *rd, int cpu)
   rcu_assign_pointer(rq-sd, sd);
   destroy_sched_domains(tmp, cpu);
 
  +   update_packing_domain(cpu);
   update_top_cache_domain(cpu);
 }
 
  diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
  index 4f4a4f6..8c9d3ed 100644
  --- a/kernel/sched/fair.c
  +++ b/kernel/sched/fair.c
  @@ -157,6 +157,63 @@ void sched_init_granularity(void)
   update_sysctl();
 }
 
  +
  +/*
  + * Save the id of the optimal CPU that should be used to pack small tasks
  + * The value -1 is used when no buddy has been found
  + */
  +DEFINE_PER_CPU(int, sd_pack_buddy);
  +
  +/* Look for the best buddy CPU that can be used to pack small tasks
  + * We make the assumption that it doesn't wort to pack on CPU that share
  the
 
  s/wort/worth
 
  yes
 
 
  + * same powerline. We looks for the 1st sched_domain without the
  + * SD_SHARE_POWERLINE flag. Then We look for the sched_group witht the
  lowest
  + * power per core based on the assumption that their power efficiency is
  + * better */
 
  Commenting style..
  /*
*
*/
 
 
  yes
 
  Can you please expand the why the assumption is right ?
  it doesn't wort to pack on CPU that share the same powerline
 
  By share the same power-line, I mean that the CPUs can't power off
  independently. So if some CPUs can't power off independently, it's
  worth to try to use most of them to race to idle.
 
 In that case I suggest we use different word here. Power line can be
 treated as voltage line, power domain.
 May be SD_SHARE_CPU_POWERDOMAIN ?
 

How about just SD_SHARE_POWERDOMAIN ?

 
  Think about a scenario where you have quad core, ducal cluster system
 
   |Cluster1|  |cluster 2|
  | CPU0 | CPU1 | CPU2 | CPU3 |   | CPU0 | CPU1 | CPU2 | CPU3 |
 
 
  Both clusters run from same voltage rail and have same PLL
  clocking them. But the cluster have their own power domain
  and all CPU's can power gate them-self to 

Re: [RFC 3/6] sched: pack small tasks

2012-11-09 Thread Morten Rasmussen
Hi Vincent,

I have experienced suboptimal buddy selection on a dual cluster setup
(ARM TC2) if SD_SHARE_POWERLINE is enabled at MC level and disabled at
CPU level. This seems to be the correct flag settings for a system with
only cluster level power gating.

To me it looks like update_packing_domain() is not doing the right
thing. See inline comments below.

On Sun, Oct 07, 2012 at 08:43:55AM +0100, Vincent Guittot wrote:
 During sched_domain creation, we define a pack buddy CPU if available.
 
 On a system that share the powerline at all level, the buddy is set to -1
 
 On a dual clusters / dual cores system which can powergate each core and
 cluster independantly, the buddy configuration will be :
   | CPU0 | CPU1 | CPU2 | CPU3 |
 ---
 buddy | CPU0 | CPU0 | CPU0 | CPU2 |
 
 Small tasks tend to slip out of the periodic load balance.
 The best place to choose to migrate them is at their wake up.
 
 Signed-off-by: Vincent Guittot vincent.guit...@linaro.org
 ---
  kernel/sched/core.c  |1 +
  kernel/sched/fair.c  |  109 
 ++
  kernel/sched/sched.h |1 +
  3 files changed, 111 insertions(+)
 
 diff --git a/kernel/sched/core.c b/kernel/sched/core.c
 index dab7908..70cadbe 100644
 --- a/kernel/sched/core.c
 +++ b/kernel/sched/core.c
 @@ -6131,6 +6131,7 @@ cpu_attach_domain(struct sched_domain *sd, struct 
 root_domain *rd, int cpu)
   rcu_assign_pointer(rq-sd, sd);
   destroy_sched_domains(tmp, cpu);
  
 + update_packing_domain(cpu);
   update_top_cache_domain(cpu);
  }
  
 diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
 index 4f4a4f6..8c9d3ed 100644
 --- a/kernel/sched/fair.c
 +++ b/kernel/sched/fair.c
 @@ -157,6 +157,63 @@ void sched_init_granularity(void)
   update_sysctl();
  }
  
 +
 +/*
 + * Save the id of the optimal CPU that should be used to pack small tasks
 + * The value -1 is used when no buddy has been found
 + */
 +DEFINE_PER_CPU(int, sd_pack_buddy);
 +
 +/* Look for the best buddy CPU that can be used to pack small tasks
 + * We make the assumption that it doesn't wort to pack on CPU that share the
 + * same powerline. We looks for the 1st sched_domain without the
 + * SD_SHARE_POWERLINE flag. Then We look for the sched_group witht the lowest
 + * power per core based on the assumption that their power efficiency is
 + * better */
 +void update_packing_domain(int cpu)
 +{
 + struct sched_domain *sd;
 + int id = -1;
 +
 + sd = highest_flag_domain(cpu, SD_SHARE_POWERLINE);
 + if (!sd)
 + sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)-sd);
 + else
 + sd = sd-parent;
sd is the highest level where SD_SHARE_POWERLINE is enabled so the sched
groups of the parent level would represent the power domains. If get it
right, we want to pack inside the cluster first and only let first cpu
of the cluster do packing on another cluster. So all cpus - except the
first one - in the current sched domain should find its buddy within the
domain and only the first one should go to the parent sched domain to
find its buddy.

I propose the following fix:

-   sd = sd-parent;
+   if (cpumask_first(sched_domain_span(sd)) == cpu
+   || !sd-parent)
+   sd = sd-parent;


 +
 + while (sd) {
 + struct sched_group *sg = sd-groups;
 + struct sched_group *pack = sg;
 + struct sched_group *tmp = sg-next;
 +
 + /* 1st CPU of the sched domain is a good candidate */
 + if (id == -1)
 + id = cpumask_first(sched_domain_span(sd));

There is no guarantee that id is in the sched group pointed to by
sd-groups, which is implicitly assumed later in the search loop. We
need to find the sched group that contains id and point sg to that
instead. I haven't found an elegant way to find that group, but the fix
below should at least give the right result.

+   /* Find sched group of candidate */
+   tmp = sd-groups;
+   do {
+   if (cpumask_test_cpu(id, sched_group_cpus(tmp)))
+   {
+   sg = tmp;
+   break;
+   }
+   } while (tmp = tmp-next, tmp != sd-groups);
+
+   pack = sg;
+   tmp = sg-next;

Regards,
Morten

 +
 + /* loop the sched groups to find the best one */
 + while (tmp != sg) {
 + if (tmp-sgp-power * sg-group_weight 
 + sg-sgp-power * tmp-group_weight)
 + pack = tmp;
 + tmp = tmp-next;
 + }
 +
 + /* we have found a better group */
 + if (pack != sg)
 + id = cpumask_first(sched_group_cpus(pack));
 +
 + /* Look for another CPU than itself */
 + 

Re: [RFC 3/6] sched: pack small tasks

2012-11-02 Thread Santosh Shilimkar

On Monday 29 October 2012 06:42 PM, Vincent Guittot wrote:

On 24 October 2012 17:20, Santosh Shilimkar santosh.shilim...@ti.com wrote:

Vincent,

Few comments/questions.


On Sunday 07 October 2012 01:13 PM, Vincent Guittot wrote:


During sched_domain creation, we define a pack buddy CPU if available.

On a system that share the powerline at all level, the buddy is set to -1

On a dual clusters / dual cores system which can powergate each core and
cluster independantly, the buddy configuration will be :
| CPU0 | CPU1 | CPU2 | CPU3 |
---
buddy | CPU0 | CPU0 | CPU0 | CPU2 |


 ^
Is that a typo ? Should it be CPU2 instead of
CPU0 ?


No it's not a typo.
The system packs at each scheduling level. It starts to pack in
cluster because each core can power gate independently so CPU1 tries
to pack its tasks in CPU0 and CPU3 in CPU2. Then, it packs at CPU
level so CPU2 tries to pack in the cluster of CPU0 and CPU0 packs in
itself


I get it. Though in above example a task may migrate from say
CPU3-CPU2-CPU0 as part of packing. I was just thinking whether
moving such task from say CPU3 to CPU0 might be best instead.




Small tasks tend to slip out of the periodic load balance.
The best place to choose to migrate them is at their wake up.


I have tried this series since I was looking at some of these packing
bits. On Mobile workloads like OSIdle with Screen ON, MP3, gallary,
I did see some additional filtering of threads with this series
but its not making much difference in power. More on this below.


Can I ask you which configuration you have used ? how many cores and
cluster ?  Can they be power gated independently ?


I have been trying with couple of setups. Dual Core ARM machine and
Quad core X86 box with single package thought most of the mobile
workload analysis I was doing on ARM machine. On both setups
CPUs can be gated independently.





Signed-off-by: Vincent Guittot vincent.guit...@linaro.org
---
   kernel/sched/core.c  |1 +
   kernel/sched/fair.c  |  109
++
   kernel/sched/sched.h |1 +
   3 files changed, 111 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index dab7908..70cadbe 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6131,6 +6131,7 @@ cpu_attach_domain(struct sched_domain *sd, struct
root_domain *rd, int cpu)
 rcu_assign_pointer(rq-sd, sd);
 destroy_sched_domains(tmp, cpu);

+   update_packing_domain(cpu);
 update_top_cache_domain(cpu);
   }

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4f4a4f6..8c9d3ed 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -157,6 +157,63 @@ void sched_init_granularity(void)
 update_sysctl();
   }

+
+/*
+ * Save the id of the optimal CPU that should be used to pack small tasks
+ * The value -1 is used when no buddy has been found
+ */
+DEFINE_PER_CPU(int, sd_pack_buddy);
+
+/* Look for the best buddy CPU that can be used to pack small tasks
+ * We make the assumption that it doesn't wort to pack on CPU that share
the


s/wort/worth


yes




+ * same powerline. We looks for the 1st sched_domain without the
+ * SD_SHARE_POWERLINE flag. Then We look for the sched_group witht the
lowest
+ * power per core based on the assumption that their power efficiency is
+ * better */


Commenting style..
/*
  *
  */



yes


Can you please expand the why the assumption is right ?
it doesn't wort to pack on CPU that share the same powerline


By share the same power-line, I mean that the CPUs can't power off
independently. So if some CPUs can't power off independently, it's
worth to try to use most of them to race to idle.


In that case I suggest we use different word here. Power line can be
treated as voltage line, power domain.
May be SD_SHARE_CPU_POWERDOMAIN ?



Think about a scenario where you have quad core, ducal cluster system

 |Cluster1|  |cluster 2|
| CPU0 | CPU1 | CPU2 | CPU3 |   | CPU0 | CPU1 | CPU2 | CPU3 |


Both clusters run from same voltage rail and have same PLL
clocking them. But the cluster have their own power domain
and all CPU's can power gate them-self to low power states.
Clusters also have their own level2 caches.

In this case, you will still save power if you try to pack
load on one cluster. No ?


yes, I need to update the description of SD_SHARE_POWERLINE because
I'm afraid I was not clear enough. SD_SHARE_POWERLINE includes the
power gating capacity of each core. For your example above, the
SD_SHARE_POWERLINE shoud be cleared at both MC and CPU level.


Thanks for clarification.





+void update_packing_domain(int cpu)
+{
+   struct sched_domain *sd;
+   int id = -1;
+
+   sd = highest_flag_domain(cpu, SD_SHARE_POWERLINE);
+   if (!sd)
+   sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)-sd);
+   else
+   sd = sd-parent;
+
+   while (sd) {
+ 

Re: [RFC 3/6] sched: pack small tasks

2012-10-29 Thread Vincent Guittot
On 24 October 2012 17:20, Santosh Shilimkar santosh.shilim...@ti.com wrote:
 Vincent,

 Few comments/questions.


 On Sunday 07 October 2012 01:13 PM, Vincent Guittot wrote:

 During sched_domain creation, we define a pack buddy CPU if available.

 On a system that share the powerline at all level, the buddy is set to -1

 On a dual clusters / dual cores system which can powergate each core and
 cluster independantly, the buddy configuration will be :
| CPU0 | CPU1 | CPU2 | CPU3 |
 ---
 buddy | CPU0 | CPU0 | CPU0 | CPU2 |

 ^
 Is that a typo ? Should it be CPU2 instead of
 CPU0 ?

No it's not a typo.
The system packs at each scheduling level. It starts to pack in
cluster because each core can power gate independently so CPU1 tries
to pack its tasks in CPU0 and CPU3 in CPU2. Then, it packs at CPU
level so CPU2 tries to pack in the cluster of CPU0 and CPU0 packs in
itself



 Small tasks tend to slip out of the periodic load balance.
 The best place to choose to migrate them is at their wake up.

 I have tried this series since I was looking at some of these packing
 bits. On Mobile workloads like OSIdle with Screen ON, MP3, gallary,
 I did see some additional filtering of threads with this series
 but its not making much difference in power. More on this below.

Can I ask you which configuration you have used ? how many cores and
cluster ?  Can they be power gated independently ?



 Signed-off-by: Vincent Guittot vincent.guit...@linaro.org
 ---
   kernel/sched/core.c  |1 +
   kernel/sched/fair.c  |  109
 ++
   kernel/sched/sched.h |1 +
   3 files changed, 111 insertions(+)

 diff --git a/kernel/sched/core.c b/kernel/sched/core.c
 index dab7908..70cadbe 100644
 --- a/kernel/sched/core.c
 +++ b/kernel/sched/core.c
 @@ -6131,6 +6131,7 @@ cpu_attach_domain(struct sched_domain *sd, struct
 root_domain *rd, int cpu)
 rcu_assign_pointer(rq-sd, sd);
 destroy_sched_domains(tmp, cpu);

 +   update_packing_domain(cpu);
 update_top_cache_domain(cpu);
   }

 diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
 index 4f4a4f6..8c9d3ed 100644
 --- a/kernel/sched/fair.c
 +++ b/kernel/sched/fair.c
 @@ -157,6 +157,63 @@ void sched_init_granularity(void)
 update_sysctl();
   }

 +
 +/*
 + * Save the id of the optimal CPU that should be used to pack small tasks
 + * The value -1 is used when no buddy has been found
 + */
 +DEFINE_PER_CPU(int, sd_pack_buddy);
 +
 +/* Look for the best buddy CPU that can be used to pack small tasks
 + * We make the assumption that it doesn't wort to pack on CPU that share
 the

 s/wort/worth

yes


 + * same powerline. We looks for the 1st sched_domain without the
 + * SD_SHARE_POWERLINE flag. Then We look for the sched_group witht the
 lowest
 + * power per core based on the assumption that their power efficiency is
 + * better */

 Commenting style..
 /*
  *
  */


yes

 Can you please expand the why the assumption is right ?
 it doesn't wort to pack on CPU that share the same powerline

By share the same power-line, I mean that the CPUs can't power off
independently. So if some CPUs can't power off independently, it's
worth to try to use most of them to race to idle.


 Think about a scenario where you have quad core, ducal cluster system

 |Cluster1|  |cluster 2|
 | CPU0 | CPU1 | CPU2 | CPU3 |   | CPU0 | CPU1 | CPU2 | CPU3 |


 Both clusters run from same voltage rail and have same PLL
 clocking them. But the cluster have their own power domain
 and all CPU's can power gate them-self to low power states.
 Clusters also have their own level2 caches.

 In this case, you will still save power if you try to pack
 load on one cluster. No ?

yes, I need to update the description of SD_SHARE_POWERLINE because
I'm afraid I was not clear enough. SD_SHARE_POWERLINE includes the
power gating capacity of each core. For your example above, the
SD_SHARE_POWERLINE shoud be cleared at both MC and CPU level.



 +void update_packing_domain(int cpu)
 +{
 +   struct sched_domain *sd;
 +   int id = -1;
 +
 +   sd = highest_flag_domain(cpu, SD_SHARE_POWERLINE);
 +   if (!sd)
 +   sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)-sd);
 +   else
 +   sd = sd-parent;
 +
 +   while (sd) {
 +   struct sched_group *sg = sd-groups;
 +   struct sched_group *pack = sg;
 +   struct sched_group *tmp = sg-next;
 +
 +   /* 1st CPU of the sched domain is a good candidate */
 +   if (id == -1)
 +   id = cpumask_first(sched_domain_span(sd));
 +
 +   /* loop the sched groups to find the best one */
 +   while (tmp != sg) {
 +   if (tmp-sgp-power * sg-group_weight 
 +   sg-sgp-power *
 tmp-group_weight)
 +   

Re: [RFC 3/6] sched: pack small tasks

2012-10-24 Thread Santosh Shilimkar

Vincent,

Few comments/questions.

On Sunday 07 October 2012 01:13 PM, Vincent Guittot wrote:

During sched_domain creation, we define a pack buddy CPU if available.

On a system that share the powerline at all level, the buddy is set to -1

On a dual clusters / dual cores system which can powergate each core and
cluster independantly, the buddy configuration will be :
   | CPU0 | CPU1 | CPU2 | CPU3 |
---
buddy | CPU0 | CPU0 | CPU0 | CPU2 |

^
Is that a typo ? Should it be CPU2 instead of
CPU0 ?


Small tasks tend to slip out of the periodic load balance.
The best place to choose to migrate them is at their wake up.


I have tried this series since I was looking at some of these packing
bits. On Mobile workloads like OSIdle with Screen ON, MP3, gallary,
I did see some additional filtering of threads with this series
but its not making much difference in power. More on this below.


Signed-off-by: Vincent Guittot vincent.guit...@linaro.org
---
  kernel/sched/core.c  |1 +
  kernel/sched/fair.c  |  109 ++
  kernel/sched/sched.h |1 +
  3 files changed, 111 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index dab7908..70cadbe 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6131,6 +6131,7 @@ cpu_attach_domain(struct sched_domain *sd, struct 
root_domain *rd, int cpu)
rcu_assign_pointer(rq-sd, sd);
destroy_sched_domains(tmp, cpu);

+   update_packing_domain(cpu);
update_top_cache_domain(cpu);
  }

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4f4a4f6..8c9d3ed 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -157,6 +157,63 @@ void sched_init_granularity(void)
update_sysctl();
  }

+
+/*
+ * Save the id of the optimal CPU that should be used to pack small tasks
+ * The value -1 is used when no buddy has been found
+ */
+DEFINE_PER_CPU(int, sd_pack_buddy);
+
+/* Look for the best buddy CPU that can be used to pack small tasks
+ * We make the assumption that it doesn't wort to pack on CPU that share the

s/wort/worth

+ * same powerline. We looks for the 1st sched_domain without the
+ * SD_SHARE_POWERLINE flag. Then We look for the sched_group witht the lowest
+ * power per core based on the assumption that their power efficiency is
+ * better */

Commenting style..
/*
 *
 */

Can you please expand the why the assumption is right ?
it doesn't wort to pack on CPU that share the same powerline

Think about a scenario where you have quad core, ducal cluster system

|Cluster1|  |cluster 2|
| CPU0 | CPU1 | CPU2 | CPU3 |   | CPU0 | CPU1 | CPU2 | CPU3 |


Both clusters run from same voltage rail and have same PLL
clocking them. But the cluster have their own power domain
and all CPU's can power gate them-self to low power states.
Clusters also have their own level2 caches.

In this case, you will still save power if you try to pack
load on one cluster. No ?


+void update_packing_domain(int cpu)
+{
+   struct sched_domain *sd;
+   int id = -1;
+
+   sd = highest_flag_domain(cpu, SD_SHARE_POWERLINE);
+   if (!sd)
+   sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)-sd);
+   else
+   sd = sd-parent;
+
+   while (sd) {
+   struct sched_group *sg = sd-groups;
+   struct sched_group *pack = sg;
+   struct sched_group *tmp = sg-next;
+
+   /* 1st CPU of the sched domain is a good candidate */
+   if (id == -1)
+   id = cpumask_first(sched_domain_span(sd));
+
+   /* loop the sched groups to find the best one */
+   while (tmp != sg) {
+   if (tmp-sgp-power * sg-group_weight 
+   sg-sgp-power * tmp-group_weight)
+   pack = tmp;
+   tmp = tmp-next;
+   }
+
+   /* we have found a better group */
+   if (pack != sg)
+   id = cpumask_first(sched_group_cpus(pack));
+
+   /* Look for another CPU than itself */
+   if ((id != cpu)
+|| ((sd-parent)  !(sd-parent-flags  SD_LOAD_BALANCE)))

Is the condition !(sd-parent-flags  SD_LOAD_BALANCE) for
big.LITTLE kind of system where SD_LOAD_BALANCE may not be used ?


+   break;
+
+   sd = sd-parent;
+   }
+
+   pr_info(KERN_INFO CPU%d packing on CPU%d\n, cpu, id);
+   per_cpu(sd_pack_buddy, cpu) = id;
+}
+
  #if BITS_PER_LONG == 32
  # define WMULT_CONST  (~0UL)
  #else
@@ -3073,6 +3130,55 @@ static int select_idle_sibling(struct task_struct *p, 
int target)
return target;
  }

+static inline bool is_buddy_busy(int cpu)
+{
+   struct rq *rq = cpu_rq(cpu);
+
+   /*
+* A busy buddy is a CPU with a high load or a small load with a lot of
+   

[RFC 3/6] sched: pack small tasks

2012-10-07 Thread Vincent Guittot
During sched_domain creation, we define a pack buddy CPU if available.

On a system that share the powerline at all level, the buddy is set to -1

On a dual clusters / dual cores system which can powergate each core and
cluster independantly, the buddy configuration will be :
  | CPU0 | CPU1 | CPU2 | CPU3 |
---
buddy | CPU0 | CPU0 | CPU0 | CPU2 |

Small tasks tend to slip out of the periodic load balance.
The best place to choose to migrate them is at their wake up.

Signed-off-by: Vincent Guittot vincent.guit...@linaro.org
---
 kernel/sched/core.c  |1 +
 kernel/sched/fair.c  |  109 ++
 kernel/sched/sched.h |1 +
 3 files changed, 111 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index dab7908..70cadbe 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6131,6 +6131,7 @@ cpu_attach_domain(struct sched_domain *sd, struct 
root_domain *rd, int cpu)
rcu_assign_pointer(rq-sd, sd);
destroy_sched_domains(tmp, cpu);
 
+   update_packing_domain(cpu);
update_top_cache_domain(cpu);
 }
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4f4a4f6..8c9d3ed 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -157,6 +157,63 @@ void sched_init_granularity(void)
update_sysctl();
 }
 
+
+/*
+ * Save the id of the optimal CPU that should be used to pack small tasks
+ * The value -1 is used when no buddy has been found
+ */
+DEFINE_PER_CPU(int, sd_pack_buddy);
+
+/* Look for the best buddy CPU that can be used to pack small tasks
+ * We make the assumption that it doesn't wort to pack on CPU that share the
+ * same powerline. We looks for the 1st sched_domain without the
+ * SD_SHARE_POWERLINE flag. Then We look for the sched_group witht the lowest
+ * power per core based on the assumption that their power efficiency is
+ * better */
+void update_packing_domain(int cpu)
+{
+   struct sched_domain *sd;
+   int id = -1;
+
+   sd = highest_flag_domain(cpu, SD_SHARE_POWERLINE);
+   if (!sd)
+   sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)-sd);
+   else
+   sd = sd-parent;
+
+   while (sd) {
+   struct sched_group *sg = sd-groups;
+   struct sched_group *pack = sg;
+   struct sched_group *tmp = sg-next;
+
+   /* 1st CPU of the sched domain is a good candidate */
+   if (id == -1)
+   id = cpumask_first(sched_domain_span(sd));
+
+   /* loop the sched groups to find the best one */
+   while (tmp != sg) {
+   if (tmp-sgp-power * sg-group_weight 
+   sg-sgp-power * tmp-group_weight)
+   pack = tmp;
+   tmp = tmp-next;
+   }
+
+   /* we have found a better group */
+   if (pack != sg)
+   id = cpumask_first(sched_group_cpus(pack));
+
+   /* Look for another CPU than itself */
+   if ((id != cpu)
+|| ((sd-parent)  !(sd-parent-flags  SD_LOAD_BALANCE)))
+   break;
+
+   sd = sd-parent;
+   }
+
+   pr_info(KERN_INFO CPU%d packing on CPU%d\n, cpu, id);
+   per_cpu(sd_pack_buddy, cpu) = id;
+}
+
 #if BITS_PER_LONG == 32
 # define WMULT_CONST   (~0UL)
 #else
@@ -3073,6 +3130,55 @@ static int select_idle_sibling(struct task_struct *p, 
int target)
return target;
 }
 
+static inline bool is_buddy_busy(int cpu)
+{
+   struct rq *rq = cpu_rq(cpu);
+
+   /*
+* A busy buddy is a CPU with a high load or a small load with a lot of
+* running tasks.
+*/
+   return ((rq-avg.usage_avg_sum  rq-nr_running) 
+   rq-avg.runnable_avg_period);
+}
+
+static inline bool is_light_task(struct task_struct *p)
+{
+   /* A light task runs less than 25% in average */
+   return ((p-se.avg.usage_avg_sum  2)  p-se.avg.runnable_avg_period);
+}
+
+static int check_pack_buddy(int cpu, struct task_struct *p)
+{
+   int buddy = per_cpu(sd_pack_buddy, cpu);
+
+   /* No pack buddy for this CPU */
+   if (buddy == -1)
+   return false;
+
+   /*
+* If a task is waiting for running on the CPU which is its own buddy,
+* let the default behavior to look for a better CPU if available
+* The threshold has been set to 37.5%
+*/
+   if ((buddy == cpu)
+ ((p-se.avg.usage_avg_sum  3)  (p-se.avg.runnable_avg_sum * 5)))
+   return false;
+
+   /* buddy is not an allowed CPU */
+   if (!cpumask_test_cpu(buddy, tsk_cpus_allowed(p)))
+   return false;
+
+   /*
+* If the task is a small one and the buddy is not overloaded,
+* we use buddy cpu
+*/
+if (!is_light_task(p) || is_buddy_busy(buddy))
+