Re: [PATCH v3 16/22] sched: add power aware scheduling in fork/exec/wake

2013-01-18 Thread Alex Shi
On 01/17/2013 01:47 PM, Namhyung Kim wrote:
> On Wed, 16 Jan 2013 14:27:30 +, Morten Rasmussen wrote:
>> On Wed, Jan 16, 2013 at 06:02:21AM +, Alex Shi wrote:
>>> On 01/15/2013 12:09 AM, Morten Rasmussen wrote:
 On Fri, Jan 11, 2013 at 07:08:45AM +, Alex Shi wrote:
> On 01/10/2013 11:01 PM, Morten Rasmussen wrote:
> For power consideration scenario, it ask task number less than Lcpu
> number, don't care the load weight, since whatever the load weight, the
> task only can burn one LCPU.
>

 True, but you miss the opportunities for power saving when you have many
 light tasks (> LCPU). Currently, the sd_utils < threshold check will go
 for SCHED_POLICY_PERFORMANCE if the number tasks (sd_utils) is greater
 than the domain weight/capacity irrespective of the actual load caused
 by those tasks.

 If you used tracked task load weight for sd_utils instead you would be
 able to go for power saving in scenarios with many light tasks as well.
>>>
>>> yes, that's right on power consideration. but for performance consider,
>>> it's better to spread tasks on different LCPU to save CS cost. And if
>>> the cpu usage is nearly full, we don't know if some tasks real want more
>>> cpu time.
>>
>> If the cpu is nearly full according to its tracked load it should not be
>> used for packing more tasks. It is the nearly idle scenario that I am
>> more interested in. If you have lots of task with tracked load <10% then
>> why not pack them. The performance impact should be minimal.

I had tried using runnable utils with many methods, include similar  way
in late regular balance. But like I talked with Mike before, the burst
waking up has no time to accumulate util, so tasks were set to few cpus
and than pulled away in regular balance, that cause many performance
benchmark drop much.

So I'd rather assume very new task will keep busy. If it is not, we
still can pull them in regular balance.
>>
>> Furthermore, nr_running is just a snapshot of the current runqueue
>> status. The combination of runnable and blocked load should give a
>> better overall view of the cpu loads.
> 
> I have a feeling that power aware scheduling policy has to deal only
> with the utilization.  Of course it only works under a certain threshold
> and if it's exceeded must be changed to other policy which cares the
> load weight/average.  Just throwing an idea. :)
> 
>>
>>> Even in the power sched policy, we still want to get better performance
>>> if it's possible. :)
>>
>> I agree if it comes for free in terms of power. In my opinion it is
>> acceptable to sacrifice a bit of performance to save power when using a
>> power sched policy as long as the performance regression can be
>> justified by the power savings. It will of course depend on the system
>> and its usage how trade-off power and performance. My point is just that
>> with multiple sched policies (performance, balance and power as you
>> propose) it should be acceptable to focus on power for the power policy
>> and let users that only/mostly care about performance use the balance or
>> performance policy.
> 
> Agreed.
> 

Firstly I hope the 'balance' policy can be used on widely on sever,
thus, it's better not to hurt performance.

Secondly 'race to idle' is one of the patchset's assumption, if we can
finish the tasks more early. we can save more power.

Last but not least, if the patch is merged, we can do more tunning on
'power' policy. :)

 Thanks for clarifying. To the best of my knowledge there are no
 guidelines for how to specify cpu power so it may be a bit dangerous to
 assume that capacity < weight when capacity is based on cpu power.
>>>
>>> Sure. I also just got them from code. and don't know other arch how to
>>> different them.
>>> but currently, seems this cpu power concept works fine.
>>
>> Yes, it seems to work fine for your test platform. I just want to
>> highlight that the assumption you make might not be valid for other
>> architectures. I know that cpu power is not widely used, but that may
>> change with the increasing focus on power aware scheduling.

cpu_power defined and used in general code. I saw arm and powerpc
mentioned them much in self arch code.

Anyway, would you like to share which arch doesn't fit this?
> 
> AFAIK on ARM big.LITTLE, a big cpu will have a cpu power more than
> 1024.  I'm sure Morten knows way more than me on this. :)
> 
> Thanks,
> Namhyung
> 


-- 
Thanks
Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 16/22] sched: add power aware scheduling in fork/exec/wake

2013-01-18 Thread Alex Shi
On 01/17/2013 01:47 PM, Namhyung Kim wrote:
 On Wed, 16 Jan 2013 14:27:30 +, Morten Rasmussen wrote:
 On Wed, Jan 16, 2013 at 06:02:21AM +, Alex Shi wrote:
 On 01/15/2013 12:09 AM, Morten Rasmussen wrote:
 On Fri, Jan 11, 2013 at 07:08:45AM +, Alex Shi wrote:
 On 01/10/2013 11:01 PM, Morten Rasmussen wrote:
 For power consideration scenario, it ask task number less than Lcpu
 number, don't care the load weight, since whatever the load weight, the
 task only can burn one LCPU.


 True, but you miss the opportunities for power saving when you have many
 light tasks ( LCPU). Currently, the sd_utils  threshold check will go
 for SCHED_POLICY_PERFORMANCE if the number tasks (sd_utils) is greater
 than the domain weight/capacity irrespective of the actual load caused
 by those tasks.

 If you used tracked task load weight for sd_utils instead you would be
 able to go for power saving in scenarios with many light tasks as well.

 yes, that's right on power consideration. but for performance consider,
 it's better to spread tasks on different LCPU to save CS cost. And if
 the cpu usage is nearly full, we don't know if some tasks real want more
 cpu time.

 If the cpu is nearly full according to its tracked load it should not be
 used for packing more tasks. It is the nearly idle scenario that I am
 more interested in. If you have lots of task with tracked load 10% then
 why not pack them. The performance impact should be minimal.

I had tried using runnable utils with many methods, include similar  way
in late regular balance. But like I talked with Mike before, the burst
waking up has no time to accumulate util, so tasks were set to few cpus
and than pulled away in regular balance, that cause many performance
benchmark drop much.

So I'd rather assume very new task will keep busy. If it is not, we
still can pull them in regular balance.

 Furthermore, nr_running is just a snapshot of the current runqueue
 status. The combination of runnable and blocked load should give a
 better overall view of the cpu loads.
 
 I have a feeling that power aware scheduling policy has to deal only
 with the utilization.  Of course it only works under a certain threshold
 and if it's exceeded must be changed to other policy which cares the
 load weight/average.  Just throwing an idea. :)
 

 Even in the power sched policy, we still want to get better performance
 if it's possible. :)

 I agree if it comes for free in terms of power. In my opinion it is
 acceptable to sacrifice a bit of performance to save power when using a
 power sched policy as long as the performance regression can be
 justified by the power savings. It will of course depend on the system
 and its usage how trade-off power and performance. My point is just that
 with multiple sched policies (performance, balance and power as you
 propose) it should be acceptable to focus on power for the power policy
 and let users that only/mostly care about performance use the balance or
 performance policy.
 
 Agreed.
 

Firstly I hope the 'balance' policy can be used on widely on sever,
thus, it's better not to hurt performance.

Secondly 'race to idle' is one of the patchset's assumption, if we can
finish the tasks more early. we can save more power.

Last but not least, if the patch is merged, we can do more tunning on
'power' policy. :)

 Thanks for clarifying. To the best of my knowledge there are no
 guidelines for how to specify cpu power so it may be a bit dangerous to
 assume that capacity  weight when capacity is based on cpu power.

 Sure. I also just got them from code. and don't know other arch how to
 different them.
 but currently, seems this cpu power concept works fine.

 Yes, it seems to work fine for your test platform. I just want to
 highlight that the assumption you make might not be valid for other
 architectures. I know that cpu power is not widely used, but that may
 change with the increasing focus on power aware scheduling.

cpu_power defined and used in general code. I saw arm and powerpc
mentioned them much in self arch code.

Anyway, would you like to share which arch doesn't fit this?
 
 AFAIK on ARM big.LITTLE, a big cpu will have a cpu power more than
 1024.  I'm sure Morten knows way more than me on this. :)
 
 Thanks,
 Namhyung
 


-- 
Thanks
Alex
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 16/22] sched: add power aware scheduling in fork/exec/wake

2013-01-16 Thread Namhyung Kim
On Wed, 16 Jan 2013 14:27:30 +, Morten Rasmussen wrote:
> On Wed, Jan 16, 2013 at 06:02:21AM +, Alex Shi wrote:
>> On 01/15/2013 12:09 AM, Morten Rasmussen wrote:
>> > On Fri, Jan 11, 2013 at 07:08:45AM +, Alex Shi wrote:
>> >> On 01/10/2013 11:01 PM, Morten Rasmussen wrote:
>> >> For power consideration scenario, it ask task number less than Lcpu
>> >> number, don't care the load weight, since whatever the load weight, the
>> >> task only can burn one LCPU.
>> >>
>> > 
>> > True, but you miss the opportunities for power saving when you have many
>> > light tasks (> LCPU). Currently, the sd_utils < threshold check will go
>> > for SCHED_POLICY_PERFORMANCE if the number tasks (sd_utils) is greater
>> > than the domain weight/capacity irrespective of the actual load caused
>> > by those tasks.
>> > 
>> > If you used tracked task load weight for sd_utils instead you would be
>> > able to go for power saving in scenarios with many light tasks as well.
>> 
>> yes, that's right on power consideration. but for performance consider,
>> it's better to spread tasks on different LCPU to save CS cost. And if
>> the cpu usage is nearly full, we don't know if some tasks real want more
>> cpu time.
>
> If the cpu is nearly full according to its tracked load it should not be
> used for packing more tasks. It is the nearly idle scenario that I am
> more interested in. If you have lots of task with tracked load <10% then
> why not pack them. The performance impact should be minimal.
>
> Furthermore, nr_running is just a snapshot of the current runqueue
> status. The combination of runnable and blocked load should give a
> better overall view of the cpu loads.

I have a feeling that power aware scheduling policy has to deal only
with the utilization.  Of course it only works under a certain threshold
and if it's exceeded must be changed to other policy which cares the
load weight/average.  Just throwing an idea. :)

>
>> Even in the power sched policy, we still want to get better performance
>> if it's possible. :)
>
> I agree if it comes for free in terms of power. In my opinion it is
> acceptable to sacrifice a bit of performance to save power when using a
> power sched policy as long as the performance regression can be
> justified by the power savings. It will of course depend on the system
> and its usage how trade-off power and performance. My point is just that
> with multiple sched policies (performance, balance and power as you
> propose) it should be acceptable to focus on power for the power policy
> and let users that only/mostly care about performance use the balance or
> performance policy.

Agreed.

>
>> > 
>>  +
>>  +   if (sched_policy == SCHED_POLICY_POWERSAVING)
>>  +   threshold = sgs.group_weight;
>>  +   else
>>  +   threshold = sgs.group_capacity;
>> >>>
>> >>> Is group_capacity larger or smaller than group_weight on your platform?
>> >>
>> >> Guess most of your confusing come from the capacity != weight here.
>> >>
>> >> In most of Intel CPU, a cpu core's power(with 2 HT) is usually 1178, it
>> >> just bigger than a normal cpu power - 1024. but the capacity is still 1,
>> >> while the group weight is 2.
>> >>
>> > 
>> > Thanks for clarifying. To the best of my knowledge there are no
>> > guidelines for how to specify cpu power so it may be a bit dangerous to
>> > assume that capacity < weight when capacity is based on cpu power.
>> 
>> Sure. I also just got them from code. and don't know other arch how to
>> different them.
>> but currently, seems this cpu power concept works fine.
>
> Yes, it seems to work fine for your test platform. I just want to
> highlight that the assumption you make might not be valid for other
> architectures. I know that cpu power is not widely used, but that may
> change with the increasing focus on power aware scheduling.

AFAIK on ARM big.LITTLE, a big cpu will have a cpu power more than
1024.  I'm sure Morten knows way more than me on this. :)

Thanks,
Namhyung
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 16/22] sched: add power aware scheduling in fork/exec/wake

2013-01-16 Thread Morten Rasmussen
On Wed, Jan 16, 2013 at 06:02:21AM +, Alex Shi wrote:
> On 01/15/2013 12:09 AM, Morten Rasmussen wrote:
> > On Fri, Jan 11, 2013 at 07:08:45AM +, Alex Shi wrote:
> >> On 01/10/2013 11:01 PM, Morten Rasmussen wrote:
> >>> On Sat, Jan 05, 2013 at 08:37:45AM +, Alex Shi wrote:
>  This patch add power aware scheduling in fork/exec/wake. It try to
>  select cpu from the busiest while still has utilization group. That's
>  will save power for other groups.
> 
>  The trade off is adding a power aware statistics collection in group
>  seeking. But since the collection just happened in power scheduling
>  eligible condition, the worst case of hackbench testing just drops
>  about 2% with powersaving/balance policy. No clear change for
>  performance policy.
> 
>  I had tried to use rq load avg utilisation in this balancing, but since
>  the utilisation need much time to accumulate itself. It's unfit for any
>  burst balancing. So I use nr_running as instant rq utilisation.
> >>>
> >>> So you effective use a mix of nr_running (counting tasks) and PJT's
> >>> tracked load for balancing?
> >>
> >> no, just task number here.
> >>>
> >>> The problem of slow reaction time of the tracked load a cpu/rq is an
> >>> interesting one. Would it be possible to use it if you maintained a
> >>> sched group runnable_load_avg similar to cfs_rq->runnable_load_avg where
> >>> load contribution of a tasks is added when a task is enqueued and
> >>> removed again if it migrates to another cpu?
> >>> This way you would know the new load of the sched group/domain instantly
> >>> when you migrate a task there. It might not be precise as the load
> >>> contribution of the task to some extend depends on the load of the cpu
> >>> where it is running. But it would probably be a fair estimate, which is
> >>> quite likely to be better than just counting tasks (nr_running).
> >>
> >> For power consideration scenario, it ask task number less than Lcpu
> >> number, don't care the load weight, since whatever the load weight, the
> >> task only can burn one LCPU.
> >>
> > 
> > True, but you miss the opportunities for power saving when you have many
> > light tasks (> LCPU). Currently, the sd_utils < threshold check will go
> > for SCHED_POLICY_PERFORMANCE if the number tasks (sd_utils) is greater
> > than the domain weight/capacity irrespective of the actual load caused
> > by those tasks.
> > 
> > If you used tracked task load weight for sd_utils instead you would be
> > able to go for power saving in scenarios with many light tasks as well.
> 
> yes, that's right on power consideration. but for performance consider,
> it's better to spread tasks on different LCPU to save CS cost. And if
> the cpu usage is nearly full, we don't know if some tasks real want more
> cpu time.

If the cpu is nearly full according to its tracked load it should not be
used for packing more tasks. It is the nearly idle scenario that I am
more interested in. If you have lots of task with tracked load <10% then
why not pack them. The performance impact should be minimal.

Furthermore, nr_running is just a snapshot of the current runqueue
status. The combination of runnable and blocked load should give a
better overall view of the cpu loads.

> Even in the power sched policy, we still want to get better performance
> if it's possible. :)

I agree if it comes for free in terms of power. In my opinion it is
acceptable to sacrifice a bit of performance to save power when using a
power sched policy as long as the performance regression can be
justified by the power savings. It will of course depend on the system
and its usage how trade-off power and performance. My point is just that
with multiple sched policies (performance, balance and power as you
propose) it should be acceptable to focus on power for the power policy
and let users that only/mostly care about performance use the balance or
performance policy.

> > 
>  +
>  +if (sched_policy == SCHED_POLICY_POWERSAVING)
>  +threshold = sgs.group_weight;
>  +else
>  +threshold = sgs.group_capacity;
> >>>
> >>> Is group_capacity larger or smaller than group_weight on your platform?
> >>
> >> Guess most of your confusing come from the capacity != weight here.
> >>
> >> In most of Intel CPU, a cpu core's power(with 2 HT) is usually 1178, it
> >> just bigger than a normal cpu power - 1024. but the capacity is still 1,
> >> while the group weight is 2.
> >>
> > 
> > Thanks for clarifying. To the best of my knowledge there are no
> > guidelines for how to specify cpu power so it may be a bit dangerous to
> > assume that capacity < weight when capacity is based on cpu power.
> 
> Sure. I also just got them from code. and don't know other arch how to
> different them.
> but currently, seems this cpu power concept works fine.

Yes, it seems to work fine for your test platform. 

Re: [PATCH v3 16/22] sched: add power aware scheduling in fork/exec/wake

2013-01-16 Thread Morten Rasmussen
On Wed, Jan 16, 2013 at 06:02:21AM +, Alex Shi wrote:
 On 01/15/2013 12:09 AM, Morten Rasmussen wrote:
  On Fri, Jan 11, 2013 at 07:08:45AM +, Alex Shi wrote:
  On 01/10/2013 11:01 PM, Morten Rasmussen wrote:
  On Sat, Jan 05, 2013 at 08:37:45AM +, Alex Shi wrote:
  This patch add power aware scheduling in fork/exec/wake. It try to
  select cpu from the busiest while still has utilization group. That's
  will save power for other groups.
 
  The trade off is adding a power aware statistics collection in group
  seeking. But since the collection just happened in power scheduling
  eligible condition, the worst case of hackbench testing just drops
  about 2% with powersaving/balance policy. No clear change for
  performance policy.
 
  I had tried to use rq load avg utilisation in this balancing, but since
  the utilisation need much time to accumulate itself. It's unfit for any
  burst balancing. So I use nr_running as instant rq utilisation.
 
  So you effective use a mix of nr_running (counting tasks) and PJT's
  tracked load for balancing?
 
  no, just task number here.
 
  The problem of slow reaction time of the tracked load a cpu/rq is an
  interesting one. Would it be possible to use it if you maintained a
  sched group runnable_load_avg similar to cfs_rq-runnable_load_avg where
  load contribution of a tasks is added when a task is enqueued and
  removed again if it migrates to another cpu?
  This way you would know the new load of the sched group/domain instantly
  when you migrate a task there. It might not be precise as the load
  contribution of the task to some extend depends on the load of the cpu
  where it is running. But it would probably be a fair estimate, which is
  quite likely to be better than just counting tasks (nr_running).
 
  For power consideration scenario, it ask task number less than Lcpu
  number, don't care the load weight, since whatever the load weight, the
  task only can burn one LCPU.
 
  
  True, but you miss the opportunities for power saving when you have many
  light tasks ( LCPU). Currently, the sd_utils  threshold check will go
  for SCHED_POLICY_PERFORMANCE if the number tasks (sd_utils) is greater
  than the domain weight/capacity irrespective of the actual load caused
  by those tasks.
  
  If you used tracked task load weight for sd_utils instead you would be
  able to go for power saving in scenarios with many light tasks as well.
 
 yes, that's right on power consideration. but for performance consider,
 it's better to spread tasks on different LCPU to save CS cost. And if
 the cpu usage is nearly full, we don't know if some tasks real want more
 cpu time.

If the cpu is nearly full according to its tracked load it should not be
used for packing more tasks. It is the nearly idle scenario that I am
more interested in. If you have lots of task with tracked load 10% then
why not pack them. The performance impact should be minimal.

Furthermore, nr_running is just a snapshot of the current runqueue
status. The combination of runnable and blocked load should give a
better overall view of the cpu loads.

 Even in the power sched policy, we still want to get better performance
 if it's possible. :)

I agree if it comes for free in terms of power. In my opinion it is
acceptable to sacrifice a bit of performance to save power when using a
power sched policy as long as the performance regression can be
justified by the power savings. It will of course depend on the system
and its usage how trade-off power and performance. My point is just that
with multiple sched policies (performance, balance and power as you
propose) it should be acceptable to focus on power for the power policy
and let users that only/mostly care about performance use the balance or
performance policy.

  
  +
  +if (sched_policy == SCHED_POLICY_POWERSAVING)
  +threshold = sgs.group_weight;
  +else
  +threshold = sgs.group_capacity;
 
  Is group_capacity larger or smaller than group_weight on your platform?
 
  Guess most of your confusing come from the capacity != weight here.
 
  In most of Intel CPU, a cpu core's power(with 2 HT) is usually 1178, it
  just bigger than a normal cpu power - 1024. but the capacity is still 1,
  while the group weight is 2.
 
  
  Thanks for clarifying. To the best of my knowledge there are no
  guidelines for how to specify cpu power so it may be a bit dangerous to
  assume that capacity  weight when capacity is based on cpu power.
 
 Sure. I also just got them from code. and don't know other arch how to
 different them.
 but currently, seems this cpu power concept works fine.

Yes, it seems to work fine for your test platform. I just want to
highlight that the assumption you make might not be valid for other
architectures. I know that cpu power is not widely used, but that may
change with the increasing focus on power aware scheduling.

Morten

  
  You could have 

Re: [PATCH v3 16/22] sched: add power aware scheduling in fork/exec/wake

2013-01-16 Thread Namhyung Kim
On Wed, 16 Jan 2013 14:27:30 +, Morten Rasmussen wrote:
 On Wed, Jan 16, 2013 at 06:02:21AM +, Alex Shi wrote:
 On 01/15/2013 12:09 AM, Morten Rasmussen wrote:
  On Fri, Jan 11, 2013 at 07:08:45AM +, Alex Shi wrote:
  On 01/10/2013 11:01 PM, Morten Rasmussen wrote:
  For power consideration scenario, it ask task number less than Lcpu
  number, don't care the load weight, since whatever the load weight, the
  task only can burn one LCPU.
 
  
  True, but you miss the opportunities for power saving when you have many
  light tasks ( LCPU). Currently, the sd_utils  threshold check will go
  for SCHED_POLICY_PERFORMANCE if the number tasks (sd_utils) is greater
  than the domain weight/capacity irrespective of the actual load caused
  by those tasks.
  
  If you used tracked task load weight for sd_utils instead you would be
  able to go for power saving in scenarios with many light tasks as well.
 
 yes, that's right on power consideration. but for performance consider,
 it's better to spread tasks on different LCPU to save CS cost. And if
 the cpu usage is nearly full, we don't know if some tasks real want more
 cpu time.

 If the cpu is nearly full according to its tracked load it should not be
 used for packing more tasks. It is the nearly idle scenario that I am
 more interested in. If you have lots of task with tracked load 10% then
 why not pack them. The performance impact should be minimal.

 Furthermore, nr_running is just a snapshot of the current runqueue
 status. The combination of runnable and blocked load should give a
 better overall view of the cpu loads.

I have a feeling that power aware scheduling policy has to deal only
with the utilization.  Of course it only works under a certain threshold
and if it's exceeded must be changed to other policy which cares the
load weight/average.  Just throwing an idea. :)


 Even in the power sched policy, we still want to get better performance
 if it's possible. :)

 I agree if it comes for free in terms of power. In my opinion it is
 acceptable to sacrifice a bit of performance to save power when using a
 power sched policy as long as the performance regression can be
 justified by the power savings. It will of course depend on the system
 and its usage how trade-off power and performance. My point is just that
 with multiple sched policies (performance, balance and power as you
 propose) it should be acceptable to focus on power for the power policy
 and let users that only/mostly care about performance use the balance or
 performance policy.

Agreed.


  
  +
  +   if (sched_policy == SCHED_POLICY_POWERSAVING)
  +   threshold = sgs.group_weight;
  +   else
  +   threshold = sgs.group_capacity;
 
  Is group_capacity larger or smaller than group_weight on your platform?
 
  Guess most of your confusing come from the capacity != weight here.
 
  In most of Intel CPU, a cpu core's power(with 2 HT) is usually 1178, it
  just bigger than a normal cpu power - 1024. but the capacity is still 1,
  while the group weight is 2.
 
  
  Thanks for clarifying. To the best of my knowledge there are no
  guidelines for how to specify cpu power so it may be a bit dangerous to
  assume that capacity  weight when capacity is based on cpu power.
 
 Sure. I also just got them from code. and don't know other arch how to
 different them.
 but currently, seems this cpu power concept works fine.

 Yes, it seems to work fine for your test platform. I just want to
 highlight that the assumption you make might not be valid for other
 architectures. I know that cpu power is not widely used, but that may
 change with the increasing focus on power aware scheduling.

AFAIK on ARM big.LITTLE, a big cpu will have a cpu power more than
1024.  I'm sure Morten knows way more than me on this. :)

Thanks,
Namhyung
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 16/22] sched: add power aware scheduling in fork/exec/wake

2013-01-15 Thread Alex Shi
On 01/15/2013 12:09 AM, Morten Rasmussen wrote:
> On Fri, Jan 11, 2013 at 07:08:45AM +, Alex Shi wrote:
>> On 01/10/2013 11:01 PM, Morten Rasmussen wrote:
>>> On Sat, Jan 05, 2013 at 08:37:45AM +, Alex Shi wrote:
 This patch add power aware scheduling in fork/exec/wake. It try to
 select cpu from the busiest while still has utilization group. That's
 will save power for other groups.

 The trade off is adding a power aware statistics collection in group
 seeking. But since the collection just happened in power scheduling
 eligible condition, the worst case of hackbench testing just drops
 about 2% with powersaving/balance policy. No clear change for
 performance policy.

 I had tried to use rq load avg utilisation in this balancing, but since
 the utilisation need much time to accumulate itself. It's unfit for any
 burst balancing. So I use nr_running as instant rq utilisation.
>>>
>>> So you effective use a mix of nr_running (counting tasks) and PJT's
>>> tracked load for balancing?
>>
>> no, just task number here.
>>>
>>> The problem of slow reaction time of the tracked load a cpu/rq is an
>>> interesting one. Would it be possible to use it if you maintained a
>>> sched group runnable_load_avg similar to cfs_rq->runnable_load_avg where
>>> load contribution of a tasks is added when a task is enqueued and
>>> removed again if it migrates to another cpu?
>>> This way you would know the new load of the sched group/domain instantly
>>> when you migrate a task there. It might not be precise as the load
>>> contribution of the task to some extend depends on the load of the cpu
>>> where it is running. But it would probably be a fair estimate, which is
>>> quite likely to be better than just counting tasks (nr_running).
>>
>> For power consideration scenario, it ask task number less than Lcpu
>> number, don't care the load weight, since whatever the load weight, the
>> task only can burn one LCPU.
>>
> 
> True, but you miss the opportunities for power saving when you have many
> light tasks (> LCPU). Currently, the sd_utils < threshold check will go
> for SCHED_POLICY_PERFORMANCE if the number tasks (sd_utils) is greater
> than the domain weight/capacity irrespective of the actual load caused
> by those tasks.
> 
> If you used tracked task load weight for sd_utils instead you would be
> able to go for power saving in scenarios with many light tasks as well.

yes, that's right on power consideration. but for performance consider,
it's better to spread tasks on different LCPU to save CS cost. And if
the cpu usage is nearly full, we don't know if some tasks real want more
cpu time.
Even in the power sched policy, we still want to get better performance
if it's possible. :)
> 
 +
 +  if (sched_policy == SCHED_POLICY_POWERSAVING)
 +  threshold = sgs.group_weight;
 +  else
 +  threshold = sgs.group_capacity;
>>>
>>> Is group_capacity larger or smaller than group_weight on your platform?
>>
>> Guess most of your confusing come from the capacity != weight here.
>>
>> In most of Intel CPU, a cpu core's power(with 2 HT) is usually 1178, it
>> just bigger than a normal cpu power - 1024. but the capacity is still 1,
>> while the group weight is 2.
>>
> 
> Thanks for clarifying. To the best of my knowledge there are no
> guidelines for how to specify cpu power so it may be a bit dangerous to
> assume that capacity < weight when capacity is based on cpu power.

Sure. I also just got them from code. and don't know other arch how to
different them.
but currently, seems this cpu power concept works fine.
> 
> You could have architectures where the cpu power of each LCPU (HT, core,
> cpu, whatever LCPU is on the particular platform) is greater than 1024
> for most LCPUs. In that case, the capacity < weight assumption fails.
> Also, on non-HT systems it is quite likely that you will have capacity =
> weight.

yes.
> 
> Morten
> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
>>
> 


-- 
Thanks Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 16/22] sched: add power aware scheduling in fork/exec/wake

2013-01-15 Thread Alex Shi
On 01/15/2013 12:09 AM, Morten Rasmussen wrote:
 On Fri, Jan 11, 2013 at 07:08:45AM +, Alex Shi wrote:
 On 01/10/2013 11:01 PM, Morten Rasmussen wrote:
 On Sat, Jan 05, 2013 at 08:37:45AM +, Alex Shi wrote:
 This patch add power aware scheduling in fork/exec/wake. It try to
 select cpu from the busiest while still has utilization group. That's
 will save power for other groups.

 The trade off is adding a power aware statistics collection in group
 seeking. But since the collection just happened in power scheduling
 eligible condition, the worst case of hackbench testing just drops
 about 2% with powersaving/balance policy. No clear change for
 performance policy.

 I had tried to use rq load avg utilisation in this balancing, but since
 the utilisation need much time to accumulate itself. It's unfit for any
 burst balancing. So I use nr_running as instant rq utilisation.

 So you effective use a mix of nr_running (counting tasks) and PJT's
 tracked load for balancing?

 no, just task number here.

 The problem of slow reaction time of the tracked load a cpu/rq is an
 interesting one. Would it be possible to use it if you maintained a
 sched group runnable_load_avg similar to cfs_rq-runnable_load_avg where
 load contribution of a tasks is added when a task is enqueued and
 removed again if it migrates to another cpu?
 This way you would know the new load of the sched group/domain instantly
 when you migrate a task there. It might not be precise as the load
 contribution of the task to some extend depends on the load of the cpu
 where it is running. But it would probably be a fair estimate, which is
 quite likely to be better than just counting tasks (nr_running).

 For power consideration scenario, it ask task number less than Lcpu
 number, don't care the load weight, since whatever the load weight, the
 task only can burn one LCPU.

 
 True, but you miss the opportunities for power saving when you have many
 light tasks ( LCPU). Currently, the sd_utils  threshold check will go
 for SCHED_POLICY_PERFORMANCE if the number tasks (sd_utils) is greater
 than the domain weight/capacity irrespective of the actual load caused
 by those tasks.
 
 If you used tracked task load weight for sd_utils instead you would be
 able to go for power saving in scenarios with many light tasks as well.

yes, that's right on power consideration. but for performance consider,
it's better to spread tasks on different LCPU to save CS cost. And if
the cpu usage is nearly full, we don't know if some tasks real want more
cpu time.
Even in the power sched policy, we still want to get better performance
if it's possible. :)
 
 +
 +  if (sched_policy == SCHED_POLICY_POWERSAVING)
 +  threshold = sgs.group_weight;
 +  else
 +  threshold = sgs.group_capacity;

 Is group_capacity larger or smaller than group_weight on your platform?

 Guess most of your confusing come from the capacity != weight here.

 In most of Intel CPU, a cpu core's power(with 2 HT) is usually 1178, it
 just bigger than a normal cpu power - 1024. but the capacity is still 1,
 while the group weight is 2.

 
 Thanks for clarifying. To the best of my knowledge there are no
 guidelines for how to specify cpu power so it may be a bit dangerous to
 assume that capacity  weight when capacity is based on cpu power.

Sure. I also just got them from code. and don't know other arch how to
different them.
but currently, seems this cpu power concept works fine.
 
 You could have architectures where the cpu power of each LCPU (HT, core,
 cpu, whatever LCPU is on the particular platform) is greater than 1024
 for most LCPUs. In that case, the capacity  weight assumption fails.
 Also, on non-HT systems it is quite likely that you will have capacity =
 weight.

yes.
 
 Morten
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/

 


-- 
Thanks Alex
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 16/22] sched: add power aware scheduling in fork/exec/wake

2013-01-14 Thread Morten Rasmussen
On Fri, Jan 11, 2013 at 07:08:45AM +, Alex Shi wrote:
> On 01/10/2013 11:01 PM, Morten Rasmussen wrote:
> > On Sat, Jan 05, 2013 at 08:37:45AM +, Alex Shi wrote:
> >> This patch add power aware scheduling in fork/exec/wake. It try to
> >> select cpu from the busiest while still has utilization group. That's
> >> will save power for other groups.
> >>
> >> The trade off is adding a power aware statistics collection in group
> >> seeking. But since the collection just happened in power scheduling
> >> eligible condition, the worst case of hackbench testing just drops
> >> about 2% with powersaving/balance policy. No clear change for
> >> performance policy.
> >>
> >> I had tried to use rq load avg utilisation in this balancing, but since
> >> the utilisation need much time to accumulate itself. It's unfit for any
> >> burst balancing. So I use nr_running as instant rq utilisation.
> > 
> > So you effective use a mix of nr_running (counting tasks) and PJT's
> > tracked load for balancing?
> 
> no, just task number here.
> > 
> > The problem of slow reaction time of the tracked load a cpu/rq is an
> > interesting one. Would it be possible to use it if you maintained a
> > sched group runnable_load_avg similar to cfs_rq->runnable_load_avg where
> > load contribution of a tasks is added when a task is enqueued and
> > removed again if it migrates to another cpu?
> > This way you would know the new load of the sched group/domain instantly
> > when you migrate a task there. It might not be precise as the load
> > contribution of the task to some extend depends on the load of the cpu
> > where it is running. But it would probably be a fair estimate, which is
> > quite likely to be better than just counting tasks (nr_running).
> 
> For power consideration scenario, it ask task number less than Lcpu
> number, don't care the load weight, since whatever the load weight, the
> task only can burn one LCPU.
> 

True, but you miss the opportunities for power saving when you have many
light tasks (> LCPU). Currently, the sd_utils < threshold check will go
for SCHED_POLICY_PERFORMANCE if the number tasks (sd_utils) is greater
than the domain weight/capacity irrespective of the actual load caused
by those tasks.

If you used tracked task load weight for sd_utils instead you would be
able to go for power saving in scenarios with many light tasks as well.

> >> +
> >> +  if (sched_policy == SCHED_POLICY_POWERSAVING)
> >> +  threshold = sgs.group_weight;
> >> +  else
> >> +  threshold = sgs.group_capacity;
> > 
> > Is group_capacity larger or smaller than group_weight on your platform?
> 
> Guess most of your confusing come from the capacity != weight here.
> 
> In most of Intel CPU, a cpu core's power(with 2 HT) is usually 1178, it
> just bigger than a normal cpu power - 1024. but the capacity is still 1,
> while the group weight is 2.
> 

Thanks for clarifying. To the best of my knowledge there are no
guidelines for how to specify cpu power so it may be a bit dangerous to
assume that capacity < weight when capacity is based on cpu power.

You could have architectures where the cpu power of each LCPU (HT, core,
cpu, whatever LCPU is on the particular platform) is greater than 1024
for most LCPUs. In that case, the capacity < weight assumption fails.
Also, on non-HT systems it is quite likely that you will have capacity =
weight.

Morten

> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 16/22] sched: add power aware scheduling in fork/exec/wake

2013-01-14 Thread Alex Shi

>> +/*
>> + * Try to collect the task running number and capacity of the doamin.
>> + */
>> +static void get_sd_power_stats(struct sched_domain *sd,
>> +struct task_struct *p, struct sd_lb_stats *sds)
>> +{
>> +struct sched_group *group;
>> +struct sg_lb_stats sgs;
>> +int sd_min_delta = INT_MAX;
>> +int cpu = task_cpu(p);
>> +
>> +group = sd->groups;
>> +do {
>> +long g_delta;
>> +unsigned long threshold;
>> +
>> +if (!cpumask_test_cpu(cpu, sched_group_mask(group)))
>> +continue;
> 
> Why?
> 
> That means only local group's stat will be accounted for this domain,
> right?  Is it your intension?
> 

Uh, Thanks a lot for finding this bug!
it is a mistake, should be:
+   if (!cpumask_intersects(sched_group_cpus(group),
+   tsk_cpus_allowed(p)))
+   continue;

-- 
Thanks Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 16/22] sched: add power aware scheduling in fork/exec/wake

2013-01-14 Thread Alex Shi

 +/*
 + * Try to collect the task running number and capacity of the doamin.
 + */
 +static void get_sd_power_stats(struct sched_domain *sd,
 +struct task_struct *p, struct sd_lb_stats *sds)
 +{
 +struct sched_group *group;
 +struct sg_lb_stats sgs;
 +int sd_min_delta = INT_MAX;
 +int cpu = task_cpu(p);
 +
 +group = sd-groups;
 +do {
 +long g_delta;
 +unsigned long threshold;
 +
 +if (!cpumask_test_cpu(cpu, sched_group_mask(group)))
 +continue;
 
 Why?
 
 That means only local group's stat will be accounted for this domain,
 right?  Is it your intension?
 

Uh, Thanks a lot for finding this bug!
it is a mistake, should be:
+   if (!cpumask_intersects(sched_group_cpus(group),
+   tsk_cpus_allowed(p)))
+   continue;

-- 
Thanks Alex
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 16/22] sched: add power aware scheduling in fork/exec/wake

2013-01-14 Thread Morten Rasmussen
On Fri, Jan 11, 2013 at 07:08:45AM +, Alex Shi wrote:
 On 01/10/2013 11:01 PM, Morten Rasmussen wrote:
  On Sat, Jan 05, 2013 at 08:37:45AM +, Alex Shi wrote:
  This patch add power aware scheduling in fork/exec/wake. It try to
  select cpu from the busiest while still has utilization group. That's
  will save power for other groups.
 
  The trade off is adding a power aware statistics collection in group
  seeking. But since the collection just happened in power scheduling
  eligible condition, the worst case of hackbench testing just drops
  about 2% with powersaving/balance policy. No clear change for
  performance policy.
 
  I had tried to use rq load avg utilisation in this balancing, but since
  the utilisation need much time to accumulate itself. It's unfit for any
  burst balancing. So I use nr_running as instant rq utilisation.
  
  So you effective use a mix of nr_running (counting tasks) and PJT's
  tracked load for balancing?
 
 no, just task number here.
  
  The problem of slow reaction time of the tracked load a cpu/rq is an
  interesting one. Would it be possible to use it if you maintained a
  sched group runnable_load_avg similar to cfs_rq-runnable_load_avg where
  load contribution of a tasks is added when a task is enqueued and
  removed again if it migrates to another cpu?
  This way you would know the new load of the sched group/domain instantly
  when you migrate a task there. It might not be precise as the load
  contribution of the task to some extend depends on the load of the cpu
  where it is running. But it would probably be a fair estimate, which is
  quite likely to be better than just counting tasks (nr_running).
 
 For power consideration scenario, it ask task number less than Lcpu
 number, don't care the load weight, since whatever the load weight, the
 task only can burn one LCPU.
 

True, but you miss the opportunities for power saving when you have many
light tasks ( LCPU). Currently, the sd_utils  threshold check will go
for SCHED_POLICY_PERFORMANCE if the number tasks (sd_utils) is greater
than the domain weight/capacity irrespective of the actual load caused
by those tasks.

If you used tracked task load weight for sd_utils instead you would be
able to go for power saving in scenarios with many light tasks as well.

  +
  +  if (sched_policy == SCHED_POLICY_POWERSAVING)
  +  threshold = sgs.group_weight;
  +  else
  +  threshold = sgs.group_capacity;
  
  Is group_capacity larger or smaller than group_weight on your platform?
 
 Guess most of your confusing come from the capacity != weight here.
 
 In most of Intel CPU, a cpu core's power(with 2 HT) is usually 1178, it
 just bigger than a normal cpu power - 1024. but the capacity is still 1,
 while the group weight is 2.
 

Thanks for clarifying. To the best of my knowledge there are no
guidelines for how to specify cpu power so it may be a bit dangerous to
assume that capacity  weight when capacity is based on cpu power.

You could have architectures where the cpu power of each LCPU (HT, core,
cpu, whatever LCPU is on the particular platform) is greater than 1024
for most LCPUs. In that case, the capacity  weight assumption fails.
Also, on non-HT systems it is quite likely that you will have capacity =
weight.

Morten

 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 16/22] sched: add power aware scheduling in fork/exec/wake

2013-01-13 Thread Namhyung Kim
On Sat,  5 Jan 2013 16:37:45 +0800, Alex Shi wrote:
> This patch add power aware scheduling in fork/exec/wake. It try to
> select cpu from the busiest while still has utilization group. That's
> will save power for other groups.
>
> The trade off is adding a power aware statistics collection in group
> seeking. But since the collection just happened in power scheduling
> eligible condition, the worst case of hackbench testing just drops
> about 2% with powersaving/balance policy. No clear change for
> performance policy.
>
> I had tried to use rq load avg utilisation in this balancing, but since
> the utilisation need much time to accumulate itself. It's unfit for any
> burst balancing. So I use nr_running as instant rq utilisation.
>
> Signed-off-by: Alex Shi 
> ---
[snip]
> +/*
> + * Try to collect the task running number and capacity of the doamin.
> + */
> +static void get_sd_power_stats(struct sched_domain *sd,
> + struct task_struct *p, struct sd_lb_stats *sds)
> +{
> + struct sched_group *group;
> + struct sg_lb_stats sgs;
> + int sd_min_delta = INT_MAX;
> + int cpu = task_cpu(p);
> +
> + group = sd->groups;
> + do {
> + long g_delta;
> + unsigned long threshold;
> +
> + if (!cpumask_test_cpu(cpu, sched_group_mask(group)))
> + continue;

Why?

That means only local group's stat will be accounted for this domain,
right?  Is it your intension?

Thanks,
Namhyung


> +
> + memset(, 0, sizeof(sgs));
> + get_sg_power_stats(group, sd, );
> +
> + if (sched_policy == SCHED_POLICY_POWERSAVING)
> + threshold = sgs.group_weight;
> + else
> + threshold = sgs.group_capacity;
> +
> + g_delta = threshold - sgs.group_utils;
> +
> + if (g_delta > 0 && g_delta < sd_min_delta) {
> + sd_min_delta = g_delta;
> + sds->group_leader = group;
> + }
> +
> + sds->sd_utils += sgs.group_utils;
> + sds->total_pwr += group->sgp->power;
> + } while  (group = group->next, group != sd->groups);
> +
> + sds->sd_capacity = DIV_ROUND_CLOSEST(sds->total_pwr,
> + SCHED_POWER_SCALE);
> +}
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 16/22] sched: add power aware scheduling in fork/exec/wake

2013-01-13 Thread Namhyung Kim
On Sat,  5 Jan 2013 16:37:45 +0800, Alex Shi wrote:
 This patch add power aware scheduling in fork/exec/wake. It try to
 select cpu from the busiest while still has utilization group. That's
 will save power for other groups.

 The trade off is adding a power aware statistics collection in group
 seeking. But since the collection just happened in power scheduling
 eligible condition, the worst case of hackbench testing just drops
 about 2% with powersaving/balance policy. No clear change for
 performance policy.

 I had tried to use rq load avg utilisation in this balancing, but since
 the utilisation need much time to accumulate itself. It's unfit for any
 burst balancing. So I use nr_running as instant rq utilisation.

 Signed-off-by: Alex Shi alex@intel.com
 ---
[snip]
 +/*
 + * Try to collect the task running number and capacity of the doamin.
 + */
 +static void get_sd_power_stats(struct sched_domain *sd,
 + struct task_struct *p, struct sd_lb_stats *sds)
 +{
 + struct sched_group *group;
 + struct sg_lb_stats sgs;
 + int sd_min_delta = INT_MAX;
 + int cpu = task_cpu(p);
 +
 + group = sd-groups;
 + do {
 + long g_delta;
 + unsigned long threshold;
 +
 + if (!cpumask_test_cpu(cpu, sched_group_mask(group)))
 + continue;

Why?

That means only local group's stat will be accounted for this domain,
right?  Is it your intension?

Thanks,
Namhyung


 +
 + memset(sgs, 0, sizeof(sgs));
 + get_sg_power_stats(group, sd, sgs);
 +
 + if (sched_policy == SCHED_POLICY_POWERSAVING)
 + threshold = sgs.group_weight;
 + else
 + threshold = sgs.group_capacity;
 +
 + g_delta = threshold - sgs.group_utils;
 +
 + if (g_delta  0  g_delta  sd_min_delta) {
 + sd_min_delta = g_delta;
 + sds-group_leader = group;
 + }
 +
 + sds-sd_utils += sgs.group_utils;
 + sds-total_pwr += group-sgp-power;
 + } while  (group = group-next, group != sd-groups);
 +
 + sds-sd_capacity = DIV_ROUND_CLOSEST(sds-total_pwr,
 + SCHED_POWER_SCALE);
 +}
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 16/22] sched: add power aware scheduling in fork/exec/wake

2013-01-10 Thread Alex Shi
On 01/10/2013 11:01 PM, Morten Rasmussen wrote:
> On Sat, Jan 05, 2013 at 08:37:45AM +, Alex Shi wrote:
>> This patch add power aware scheduling in fork/exec/wake. It try to
>> select cpu from the busiest while still has utilization group. That's
>> will save power for other groups.
>>
>> The trade off is adding a power aware statistics collection in group
>> seeking. But since the collection just happened in power scheduling
>> eligible condition, the worst case of hackbench testing just drops
>> about 2% with powersaving/balance policy. No clear change for
>> performance policy.
>>
>> I had tried to use rq load avg utilisation in this balancing, but since
>> the utilisation need much time to accumulate itself. It's unfit for any
>> burst balancing. So I use nr_running as instant rq utilisation.
> 
> So you effective use a mix of nr_running (counting tasks) and PJT's
> tracked load for balancing?

no, just task number here.
> 
> The problem of slow reaction time of the tracked load a cpu/rq is an
> interesting one. Would it be possible to use it if you maintained a
> sched group runnable_load_avg similar to cfs_rq->runnable_load_avg where
> load contribution of a tasks is added when a task is enqueued and
> removed again if it migrates to another cpu?
> This way you would know the new load of the sched group/domain instantly
> when you migrate a task there. It might not be precise as the load
> contribution of the task to some extend depends on the load of the cpu
> where it is running. But it would probably be a fair estimate, which is
> quite likely to be better than just counting tasks (nr_running).

For power consideration scenario, it ask task number less than Lcpu
number, don't care the load weight, since whatever the load weight, the
task only can burn one LCPU.

>> +
>> +if (sched_policy == SCHED_POLICY_POWERSAVING)
>> +threshold = sgs.group_weight;
>> +else
>> +threshold = sgs.group_capacity;
> 
> Is group_capacity larger or smaller than group_weight on your platform?

Guess most of your confusing come from the capacity != weight here.

In most of Intel CPU, a cpu core's power(with 2 HT) is usually 1178, it
just bigger than a normal cpu power - 1024. but the capacity is still 1,
while the group weight is 2.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 16/22] sched: add power aware scheduling in fork/exec/wake

2013-01-10 Thread Morten Rasmussen
On Sat, Jan 05, 2013 at 08:37:45AM +, Alex Shi wrote:
> This patch add power aware scheduling in fork/exec/wake. It try to
> select cpu from the busiest while still has utilization group. That's
> will save power for other groups.
> 
> The trade off is adding a power aware statistics collection in group
> seeking. But since the collection just happened in power scheduling
> eligible condition, the worst case of hackbench testing just drops
> about 2% with powersaving/balance policy. No clear change for
> performance policy.
> 
> I had tried to use rq load avg utilisation in this balancing, but since
> the utilisation need much time to accumulate itself. It's unfit for any
> burst balancing. So I use nr_running as instant rq utilisation.

So you effective use a mix of nr_running (counting tasks) and PJT's
tracked load for balancing?

The problem of slow reaction time of the tracked load a cpu/rq is an
interesting one. Would it be possible to use it if you maintained a
sched group runnable_load_avg similar to cfs_rq->runnable_load_avg where
load contribution of a tasks is added when a task is enqueued and
removed again if it migrates to another cpu?
This way you would know the new load of the sched group/domain instantly
when you migrate a task there. It might not be precise as the load
contribution of the task to some extend depends on the load of the cpu
where it is running. But it would probably be a fair estimate, which is
quite likely to be better than just counting tasks (nr_running).

> 
> Signed-off-by: Alex Shi 
> ---
>  kernel/sched/fair.c | 230 
> 
>  1 file changed, 179 insertions(+), 51 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 7bfbd69..8d0d3af 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3323,25 +3323,189 @@ done:
>  }
>  
>  /*
> - * sched_balance_self: balance the current task (running on cpu) in domains
> + * sd_lb_stats - Structure to store the statistics of a sched_domain
> + *   during load balancing.
> + */
> +struct sd_lb_stats {
> + struct sched_group *busiest; /* Busiest group in this sd */
> + struct sched_group *this;  /* Local group in this sd */
> + unsigned long total_load;  /* Total load of all groups in sd */
> + unsigned long total_pwr;   /*   Total power of all groups in sd */
> + unsigned long avg_load;/* Average load across all groups in sd */
> +
> + /** Statistics of this group */
> + unsigned long this_load;
> + unsigned long this_load_per_task;
> + unsigned long this_nr_running;
> + unsigned int  this_has_capacity;
> + unsigned int  this_idle_cpus;
> +
> + /* Statistics of the busiest group */
> + unsigned int  busiest_idle_cpus;
> + unsigned long max_load;
> + unsigned long busiest_load_per_task;
> + unsigned long busiest_nr_running;
> + unsigned long busiest_group_capacity;
> + unsigned int  busiest_has_capacity;
> + unsigned int  busiest_group_weight;
> +
> + int group_imb; /* Is there imbalance in this sd */
> +
> + /* Varibles of power awaring scheduling */
> + unsigned int  sd_utils; /* sum utilizations of this domain */
> + unsigned long sd_capacity;  /* capacity of this domain */
> + struct sched_group *group_leader; /* Group which relieves group_min */
> + unsigned long min_load_per_task; /* load_per_task in group_min */
> + unsigned int  leader_util;  /* sum utilizations of group_leader */
> + unsigned int  min_util; /* sum utilizations of group_min */
> +};
> +
> +/*
> + * sg_lb_stats - stats of a sched_group required for load_balancing
> + */
> +struct sg_lb_stats {
> + unsigned long avg_load; /*Avg load across the CPUs of the group */
> + unsigned long group_load; /* Total load over the CPUs of the group */
> + unsigned long sum_nr_running; /* Nr tasks running in the group */
> + unsigned long sum_weighted_load; /* Weighted load of group's tasks */
> + unsigned long group_capacity;
> + unsigned long idle_cpus;
> + unsigned long group_weight;
> + int group_imb; /* Is there an imbalance in the group ? */
> + int group_has_capacity; /* Is there extra capacity in the group? */
> + unsigned int group_utils;   /* sum utilizations of group */
> +
> + unsigned long sum_shared_running;   /* 0 on non-NUMA */
> +};
> +
> +static inline int
> +fix_small_capacity(struct sched_domain *sd, struct sched_group *group);
> +
> +/*
> + * Try to collect the task running number and capacity of the group.
> + */
> +static void get_sg_power_stats(struct sched_group *group,
> + struct sched_domain *sd, struct sg_lb_stats *sgs)
> +{
> + int i;
> +
> + for_each_cpu(i, sched_group_cpus(group)) {
> + struct rq *rq = cpu_rq(i);
> +
> + sgs->group_utils += rq->nr_running;

The utilization of the sched group is the number task active on the
runqueues 

Re: [PATCH v3 16/22] sched: add power aware scheduling in fork/exec/wake

2013-01-10 Thread Alex Shi
On 01/10/2013 11:01 PM, Morten Rasmussen wrote:
 On Sat, Jan 05, 2013 at 08:37:45AM +, Alex Shi wrote:
 This patch add power aware scheduling in fork/exec/wake. It try to
 select cpu from the busiest while still has utilization group. That's
 will save power for other groups.

 The trade off is adding a power aware statistics collection in group
 seeking. But since the collection just happened in power scheduling
 eligible condition, the worst case of hackbench testing just drops
 about 2% with powersaving/balance policy. No clear change for
 performance policy.

 I had tried to use rq load avg utilisation in this balancing, but since
 the utilisation need much time to accumulate itself. It's unfit for any
 burst balancing. So I use nr_running as instant rq utilisation.
 
 So you effective use a mix of nr_running (counting tasks) and PJT's
 tracked load for balancing?

no, just task number here.
 
 The problem of slow reaction time of the tracked load a cpu/rq is an
 interesting one. Would it be possible to use it if you maintained a
 sched group runnable_load_avg similar to cfs_rq-runnable_load_avg where
 load contribution of a tasks is added when a task is enqueued and
 removed again if it migrates to another cpu?
 This way you would know the new load of the sched group/domain instantly
 when you migrate a task there. It might not be precise as the load
 contribution of the task to some extend depends on the load of the cpu
 where it is running. But it would probably be a fair estimate, which is
 quite likely to be better than just counting tasks (nr_running).

For power consideration scenario, it ask task number less than Lcpu
number, don't care the load weight, since whatever the load weight, the
task only can burn one LCPU.

 +
 +if (sched_policy == SCHED_POLICY_POWERSAVING)
 +threshold = sgs.group_weight;
 +else
 +threshold = sgs.group_capacity;
 
 Is group_capacity larger or smaller than group_weight on your platform?

Guess most of your confusing come from the capacity != weight here.

In most of Intel CPU, a cpu core's power(with 2 HT) is usually 1178, it
just bigger than a normal cpu power - 1024. but the capacity is still 1,
while the group weight is 2.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 16/22] sched: add power aware scheduling in fork/exec/wake

2013-01-10 Thread Morten Rasmussen
On Sat, Jan 05, 2013 at 08:37:45AM +, Alex Shi wrote:
 This patch add power aware scheduling in fork/exec/wake. It try to
 select cpu from the busiest while still has utilization group. That's
 will save power for other groups.
 
 The trade off is adding a power aware statistics collection in group
 seeking. But since the collection just happened in power scheduling
 eligible condition, the worst case of hackbench testing just drops
 about 2% with powersaving/balance policy. No clear change for
 performance policy.
 
 I had tried to use rq load avg utilisation in this balancing, but since
 the utilisation need much time to accumulate itself. It's unfit for any
 burst balancing. So I use nr_running as instant rq utilisation.

So you effective use a mix of nr_running (counting tasks) and PJT's
tracked load for balancing?

The problem of slow reaction time of the tracked load a cpu/rq is an
interesting one. Would it be possible to use it if you maintained a
sched group runnable_load_avg similar to cfs_rq-runnable_load_avg where
load contribution of a tasks is added when a task is enqueued and
removed again if it migrates to another cpu?
This way you would know the new load of the sched group/domain instantly
when you migrate a task there. It might not be precise as the load
contribution of the task to some extend depends on the load of the cpu
where it is running. But it would probably be a fair estimate, which is
quite likely to be better than just counting tasks (nr_running).

 
 Signed-off-by: Alex Shi alex@intel.com
 ---
  kernel/sched/fair.c | 230 
 
  1 file changed, 179 insertions(+), 51 deletions(-)
 
 diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
 index 7bfbd69..8d0d3af 100644
 --- a/kernel/sched/fair.c
 +++ b/kernel/sched/fair.c
 @@ -3323,25 +3323,189 @@ done:
  }
  
  /*
 - * sched_balance_self: balance the current task (running on cpu) in domains
 + * sd_lb_stats - Structure to store the statistics of a sched_domain
 + *   during load balancing.
 + */
 +struct sd_lb_stats {
 + struct sched_group *busiest; /* Busiest group in this sd */
 + struct sched_group *this;  /* Local group in this sd */
 + unsigned long total_load;  /* Total load of all groups in sd */
 + unsigned long total_pwr;   /*   Total power of all groups in sd */
 + unsigned long avg_load;/* Average load across all groups in sd */
 +
 + /** Statistics of this group */
 + unsigned long this_load;
 + unsigned long this_load_per_task;
 + unsigned long this_nr_running;
 + unsigned int  this_has_capacity;
 + unsigned int  this_idle_cpus;
 +
 + /* Statistics of the busiest group */
 + unsigned int  busiest_idle_cpus;
 + unsigned long max_load;
 + unsigned long busiest_load_per_task;
 + unsigned long busiest_nr_running;
 + unsigned long busiest_group_capacity;
 + unsigned int  busiest_has_capacity;
 + unsigned int  busiest_group_weight;
 +
 + int group_imb; /* Is there imbalance in this sd */
 +
 + /* Varibles of power awaring scheduling */
 + unsigned int  sd_utils; /* sum utilizations of this domain */
 + unsigned long sd_capacity;  /* capacity of this domain */
 + struct sched_group *group_leader; /* Group which relieves group_min */
 + unsigned long min_load_per_task; /* load_per_task in group_min */
 + unsigned int  leader_util;  /* sum utilizations of group_leader */
 + unsigned int  min_util; /* sum utilizations of group_min */
 +};
 +
 +/*
 + * sg_lb_stats - stats of a sched_group required for load_balancing
 + */
 +struct sg_lb_stats {
 + unsigned long avg_load; /*Avg load across the CPUs of the group */
 + unsigned long group_load; /* Total load over the CPUs of the group */
 + unsigned long sum_nr_running; /* Nr tasks running in the group */
 + unsigned long sum_weighted_load; /* Weighted load of group's tasks */
 + unsigned long group_capacity;
 + unsigned long idle_cpus;
 + unsigned long group_weight;
 + int group_imb; /* Is there an imbalance in the group ? */
 + int group_has_capacity; /* Is there extra capacity in the group? */
 + unsigned int group_utils;   /* sum utilizations of group */
 +
 + unsigned long sum_shared_running;   /* 0 on non-NUMA */
 +};
 +
 +static inline int
 +fix_small_capacity(struct sched_domain *sd, struct sched_group *group);
 +
 +/*
 + * Try to collect the task running number and capacity of the group.
 + */
 +static void get_sg_power_stats(struct sched_group *group,
 + struct sched_domain *sd, struct sg_lb_stats *sgs)
 +{
 + int i;
 +
 + for_each_cpu(i, sched_group_cpus(group)) {
 + struct rq *rq = cpu_rq(i);
 +
 + sgs-group_utils += rq-nr_running;

The utilization of the sched group is the number task active on the
runqueues of the cpus in the group.

 + }
 +
 + sgs-group_capacity = 

[PATCH v3 16/22] sched: add power aware scheduling in fork/exec/wake

2013-01-05 Thread Alex Shi
This patch add power aware scheduling in fork/exec/wake. It try to
select cpu from the busiest while still has utilization group. That's
will save power for other groups.

The trade off is adding a power aware statistics collection in group
seeking. But since the collection just happened in power scheduling
eligible condition, the worst case of hackbench testing just drops
about 2% with powersaving/balance policy. No clear change for
performance policy.

I had tried to use rq load avg utilisation in this balancing, but since
the utilisation need much time to accumulate itself. It's unfit for any
burst balancing. So I use nr_running as instant rq utilisation.

Signed-off-by: Alex Shi 
---
 kernel/sched/fair.c | 230 
 1 file changed, 179 insertions(+), 51 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7bfbd69..8d0d3af 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3323,25 +3323,189 @@ done:
 }
 
 /*
- * sched_balance_self: balance the current task (running on cpu) in domains
+ * sd_lb_stats - Structure to store the statistics of a sched_domain
+ * during load balancing.
+ */
+struct sd_lb_stats {
+   struct sched_group *busiest; /* Busiest group in this sd */
+   struct sched_group *this;  /* Local group in this sd */
+   unsigned long total_load;  /* Total load of all groups in sd */
+   unsigned long total_pwr;   /*   Total power of all groups in sd */
+   unsigned long avg_load;/* Average load across all groups in sd */
+
+   /** Statistics of this group */
+   unsigned long this_load;
+   unsigned long this_load_per_task;
+   unsigned long this_nr_running;
+   unsigned int  this_has_capacity;
+   unsigned int  this_idle_cpus;
+
+   /* Statistics of the busiest group */
+   unsigned int  busiest_idle_cpus;
+   unsigned long max_load;
+   unsigned long busiest_load_per_task;
+   unsigned long busiest_nr_running;
+   unsigned long busiest_group_capacity;
+   unsigned int  busiest_has_capacity;
+   unsigned int  busiest_group_weight;
+
+   int group_imb; /* Is there imbalance in this sd */
+
+   /* Varibles of power awaring scheduling */
+   unsigned int  sd_utils; /* sum utilizations of this domain */
+   unsigned long sd_capacity;  /* capacity of this domain */
+   struct sched_group *group_leader; /* Group which relieves group_min */
+   unsigned long min_load_per_task; /* load_per_task in group_min */
+   unsigned int  leader_util;  /* sum utilizations of group_leader */
+   unsigned int  min_util; /* sum utilizations of group_min */
+};
+
+/*
+ * sg_lb_stats - stats of a sched_group required for load_balancing
+ */
+struct sg_lb_stats {
+   unsigned long avg_load; /*Avg load across the CPUs of the group */
+   unsigned long group_load; /* Total load over the CPUs of the group */
+   unsigned long sum_nr_running; /* Nr tasks running in the group */
+   unsigned long sum_weighted_load; /* Weighted load of group's tasks */
+   unsigned long group_capacity;
+   unsigned long idle_cpus;
+   unsigned long group_weight;
+   int group_imb; /* Is there an imbalance in the group ? */
+   int group_has_capacity; /* Is there extra capacity in the group? */
+   unsigned int group_utils;   /* sum utilizations of group */
+
+   unsigned long sum_shared_running;   /* 0 on non-NUMA */
+};
+
+static inline int
+fix_small_capacity(struct sched_domain *sd, struct sched_group *group);
+
+/*
+ * Try to collect the task running number and capacity of the group.
+ */
+static void get_sg_power_stats(struct sched_group *group,
+   struct sched_domain *sd, struct sg_lb_stats *sgs)
+{
+   int i;
+
+   for_each_cpu(i, sched_group_cpus(group)) {
+   struct rq *rq = cpu_rq(i);
+
+   sgs->group_utils += rq->nr_running;
+   }
+
+   sgs->group_capacity = DIV_ROUND_CLOSEST(group->sgp->power,
+   SCHED_POWER_SCALE);
+   if (!sgs->group_capacity)
+   sgs->group_capacity = fix_small_capacity(sd, group);
+   sgs->group_weight = group->group_weight;
+}
+
+/*
+ * Try to collect the task running number and capacity of the doamin.
+ */
+static void get_sd_power_stats(struct sched_domain *sd,
+   struct task_struct *p, struct sd_lb_stats *sds)
+{
+   struct sched_group *group;
+   struct sg_lb_stats sgs;
+   int sd_min_delta = INT_MAX;
+   int cpu = task_cpu(p);
+
+   group = sd->groups;
+   do {
+   long g_delta;
+   unsigned long threshold;
+
+   if (!cpumask_test_cpu(cpu, sched_group_mask(group)))
+   continue;
+
+   memset(, 0, sizeof(sgs));
+   get_sg_power_stats(group, sd, );
+
+   if (sched_policy == 

[PATCH v3 16/22] sched: add power aware scheduling in fork/exec/wake

2013-01-05 Thread Alex Shi
This patch add power aware scheduling in fork/exec/wake. It try to
select cpu from the busiest while still has utilization group. That's
will save power for other groups.

The trade off is adding a power aware statistics collection in group
seeking. But since the collection just happened in power scheduling
eligible condition, the worst case of hackbench testing just drops
about 2% with powersaving/balance policy. No clear change for
performance policy.

I had tried to use rq load avg utilisation in this balancing, but since
the utilisation need much time to accumulate itself. It's unfit for any
burst balancing. So I use nr_running as instant rq utilisation.

Signed-off-by: Alex Shi alex@intel.com
---
 kernel/sched/fair.c | 230 
 1 file changed, 179 insertions(+), 51 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7bfbd69..8d0d3af 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3323,25 +3323,189 @@ done:
 }
 
 /*
- * sched_balance_self: balance the current task (running on cpu) in domains
+ * sd_lb_stats - Structure to store the statistics of a sched_domain
+ * during load balancing.
+ */
+struct sd_lb_stats {
+   struct sched_group *busiest; /* Busiest group in this sd */
+   struct sched_group *this;  /* Local group in this sd */
+   unsigned long total_load;  /* Total load of all groups in sd */
+   unsigned long total_pwr;   /*   Total power of all groups in sd */
+   unsigned long avg_load;/* Average load across all groups in sd */
+
+   /** Statistics of this group */
+   unsigned long this_load;
+   unsigned long this_load_per_task;
+   unsigned long this_nr_running;
+   unsigned int  this_has_capacity;
+   unsigned int  this_idle_cpus;
+
+   /* Statistics of the busiest group */
+   unsigned int  busiest_idle_cpus;
+   unsigned long max_load;
+   unsigned long busiest_load_per_task;
+   unsigned long busiest_nr_running;
+   unsigned long busiest_group_capacity;
+   unsigned int  busiest_has_capacity;
+   unsigned int  busiest_group_weight;
+
+   int group_imb; /* Is there imbalance in this sd */
+
+   /* Varibles of power awaring scheduling */
+   unsigned int  sd_utils; /* sum utilizations of this domain */
+   unsigned long sd_capacity;  /* capacity of this domain */
+   struct sched_group *group_leader; /* Group which relieves group_min */
+   unsigned long min_load_per_task; /* load_per_task in group_min */
+   unsigned int  leader_util;  /* sum utilizations of group_leader */
+   unsigned int  min_util; /* sum utilizations of group_min */
+};
+
+/*
+ * sg_lb_stats - stats of a sched_group required for load_balancing
+ */
+struct sg_lb_stats {
+   unsigned long avg_load; /*Avg load across the CPUs of the group */
+   unsigned long group_load; /* Total load over the CPUs of the group */
+   unsigned long sum_nr_running; /* Nr tasks running in the group */
+   unsigned long sum_weighted_load; /* Weighted load of group's tasks */
+   unsigned long group_capacity;
+   unsigned long idle_cpus;
+   unsigned long group_weight;
+   int group_imb; /* Is there an imbalance in the group ? */
+   int group_has_capacity; /* Is there extra capacity in the group? */
+   unsigned int group_utils;   /* sum utilizations of group */
+
+   unsigned long sum_shared_running;   /* 0 on non-NUMA */
+};
+
+static inline int
+fix_small_capacity(struct sched_domain *sd, struct sched_group *group);
+
+/*
+ * Try to collect the task running number and capacity of the group.
+ */
+static void get_sg_power_stats(struct sched_group *group,
+   struct sched_domain *sd, struct sg_lb_stats *sgs)
+{
+   int i;
+
+   for_each_cpu(i, sched_group_cpus(group)) {
+   struct rq *rq = cpu_rq(i);
+
+   sgs-group_utils += rq-nr_running;
+   }
+
+   sgs-group_capacity = DIV_ROUND_CLOSEST(group-sgp-power,
+   SCHED_POWER_SCALE);
+   if (!sgs-group_capacity)
+   sgs-group_capacity = fix_small_capacity(sd, group);
+   sgs-group_weight = group-group_weight;
+}
+
+/*
+ * Try to collect the task running number and capacity of the doamin.
+ */
+static void get_sd_power_stats(struct sched_domain *sd,
+   struct task_struct *p, struct sd_lb_stats *sds)
+{
+   struct sched_group *group;
+   struct sg_lb_stats sgs;
+   int sd_min_delta = INT_MAX;
+   int cpu = task_cpu(p);
+
+   group = sd-groups;
+   do {
+   long g_delta;
+   unsigned long threshold;
+
+   if (!cpumask_test_cpu(cpu, sched_group_mask(group)))
+   continue;
+
+   memset(sgs, 0, sizeof(sgs));
+   get_sg_power_stats(group, sd, sgs);
+
+   if (sched_policy ==