Re: [RFC PATCH 2/3] sched: power aware load balance,

2012-11-11 Thread Alex Shi
On 11/12/2012 02:49 AM, Preeti Murthy wrote:
> Hi Alex
> I apologise for the delay in replying .

That's all right. I often also busy on other Intel tasks and have no
time to look at LKML. :)
> 
> On Wed, Nov 7, 2012 at 6:57 PM, Alex Shi  wrote:
>> On 11/07/2012 12:37 PM, Preeti Murthy wrote:
>>> Hi Alex,
>>>
>>> What I am concerned about in this patchset as Peter also
>>> mentioned in the previous discussion of your approach
>>> (https://lkml.org/lkml/2012/8/13/139)
>>> is that:
>>>
>>> 1.Using nr_running of two different sched groups to decide which one
>>> can be group_leader or group_min might not be be the right approach,
>>> as this might mislead us to think that a group running one task is less
>>> loaded than the group running three tasks although the former task is
>>> a cpu hogger.
>>>
>>> 2.Comparing the number of cpus with the number of tasks running in a sched
>>> group to decide if the group is underloaded or overloaded again faces
>>> the same issue.The tasks might be short running,not utilizing cpu much.
>>
>> Yes, maybe nr task is not the best indicator. But as first step, it can
>> approve the proposal is a correct path and worth to try more.
>> Considering the old powersaving implement is also judge on nr tasks, and
>> my testing result of this. It may be still a option.
> Hmm.. will think about this and get back.
>>>
>>> I also feel before we introduce another side to the scheduler called
>>> 'power aware',why not try and see if the current scheduler itself can
>>> perform better? We have an opportunity in terms of PJT's patches which
>>> can help scheduler make more realistic decisions in load balance.Also
>>> since PJT's metric is a statistical one,I believe we could vary it to
>>> allow scheduler to do more rigorous or less rigorous power savings.
>>
>> will study the PJT's approach.
>> Actually, current patch set is also a kind of load balance modification,
>> right? :)
> It is true that this is a different approach,in fact we will require
> this approach
> to do power savings because PJT's patches introduce a new 'metric' and not a 
> new
> 'approach' in my opinion, to do smarter load balancing,not power aware
> load balancing per say.So your patch is surely a step towards power
> aware lb.I am just worried about the metric used in it.
>>>
>>> It is true however that this approach will not try and evacuate nearly idle
>>> cpus over to nearly full cpus.That is definitely one of the benefits of your
>>> patch,in terms of power savings,but I believe your patch is not making use
>>> of the right metric to decide that.
>>
>> If one sched group just has one task, and another group just has one
>> LCPU idle, my patch definitely will pull the task to the nearly full
>> sched group. So I didn't understand what you mean 'will not try and
>> evacuate nearly idle cpus over to nearly full cpus'
> No, by 'this approach' I meant the current load balancer integrated with
> the PJT's metric.Your approach does 'evacuate' the nearly idle cpus
> over to the nearly full cpus..

Oh, a misunderstand on 'this approach'. :) Anyway, we are all clear
about this now.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 2/3] sched: power aware load balance,

2012-11-11 Thread Preeti Murthy
Hi Alex
I apologise for the delay in replying .

On Wed, Nov 7, 2012 at 6:57 PM, Alex Shi  wrote:
> On 11/07/2012 12:37 PM, Preeti Murthy wrote:
>> Hi Alex,
>>
>> What I am concerned about in this patchset as Peter also
>> mentioned in the previous discussion of your approach
>> (https://lkml.org/lkml/2012/8/13/139)
>> is that:
>>
>> 1.Using nr_running of two different sched groups to decide which one
>> can be group_leader or group_min might not be be the right approach,
>> as this might mislead us to think that a group running one task is less
>> loaded than the group running three tasks although the former task is
>> a cpu hogger.
>>
>> 2.Comparing the number of cpus with the number of tasks running in a sched
>> group to decide if the group is underloaded or overloaded again faces
>> the same issue.The tasks might be short running,not utilizing cpu much.
>
> Yes, maybe nr task is not the best indicator. But as first step, it can
> approve the proposal is a correct path and worth to try more.
> Considering the old powersaving implement is also judge on nr tasks, and
> my testing result of this. It may be still a option.
Hmm.. will think about this and get back.
>>
>> I also feel before we introduce another side to the scheduler called
>> 'power aware',why not try and see if the current scheduler itself can
>> perform better? We have an opportunity in terms of PJT's patches which
>> can help scheduler make more realistic decisions in load balance.Also
>> since PJT's metric is a statistical one,I believe we could vary it to
>> allow scheduler to do more rigorous or less rigorous power savings.
>
> will study the PJT's approach.
> Actually, current patch set is also a kind of load balance modification,
> right? :)
It is true that this is a different approach,in fact we will require
this approach
to do power savings because PJT's patches introduce a new 'metric' and not a new
'approach' in my opinion, to do smarter load balancing,not power aware
load balancing per say.So your patch is surely a step towards power
aware lb.I am just worried about the metric used in it.
>>
>> It is true however that this approach will not try and evacuate nearly idle
>> cpus over to nearly full cpus.That is definitely one of the benefits of your
>> patch,in terms of power savings,but I believe your patch is not making use
>> of the right metric to decide that.
>
> If one sched group just has one task, and another group just has one
> LCPU idle, my patch definitely will pull the task to the nearly full
> sched group. So I didn't understand what you mean 'will not try and
> evacuate nearly idle cpus over to nearly full cpus'
No, by 'this approach' I meant the current load balancer integrated with
the PJT's metric.Your approach does 'evacuate' the nearly idle cpus
over to the nearly full cpus..

Regards
Preeti U Murthy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 2/3] sched: power aware load balance,

2012-11-11 Thread Preeti Murthy
Hi Alex
I apologise for the delay in replying .

On Wed, Nov 7, 2012 at 6:57 PM, Alex Shi alex@intel.com wrote:
 On 11/07/2012 12:37 PM, Preeti Murthy wrote:
 Hi Alex,

 What I am concerned about in this patchset as Peter also
 mentioned in the previous discussion of your approach
 (https://lkml.org/lkml/2012/8/13/139)
 is that:

 1.Using nr_running of two different sched groups to decide which one
 can be group_leader or group_min might not be be the right approach,
 as this might mislead us to think that a group running one task is less
 loaded than the group running three tasks although the former task is
 a cpu hogger.

 2.Comparing the number of cpus with the number of tasks running in a sched
 group to decide if the group is underloaded or overloaded again faces
 the same issue.The tasks might be short running,not utilizing cpu much.

 Yes, maybe nr task is not the best indicator. But as first step, it can
 approve the proposal is a correct path and worth to try more.
 Considering the old powersaving implement is also judge on nr tasks, and
 my testing result of this. It may be still a option.
Hmm.. will think about this and get back.

 I also feel before we introduce another side to the scheduler called
 'power aware',why not try and see if the current scheduler itself can
 perform better? We have an opportunity in terms of PJT's patches which
 can help scheduler make more realistic decisions in load balance.Also
 since PJT's metric is a statistical one,I believe we could vary it to
 allow scheduler to do more rigorous or less rigorous power savings.

 will study the PJT's approach.
 Actually, current patch set is also a kind of load balance modification,
 right? :)
It is true that this is a different approach,in fact we will require
this approach
to do power savings because PJT's patches introduce a new 'metric' and not a new
'approach' in my opinion, to do smarter load balancing,not power aware
load balancing per say.So your patch is surely a step towards power
aware lb.I am just worried about the metric used in it.

 It is true however that this approach will not try and evacuate nearly idle
 cpus over to nearly full cpus.That is definitely one of the benefits of your
 patch,in terms of power savings,but I believe your patch is not making use
 of the right metric to decide that.

 If one sched group just has one task, and another group just has one
 LCPU idle, my patch definitely will pull the task to the nearly full
 sched group. So I didn't understand what you mean 'will not try and
 evacuate nearly idle cpus over to nearly full cpus'
No, by 'this approach' I meant the current load balancer integrated with
the PJT's metric.Your approach does 'evacuate' the nearly idle cpus
over to the nearly full cpus..

Regards
Preeti U Murthy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 2/3] sched: power aware load balance,

2012-11-11 Thread Alex Shi
On 11/12/2012 02:49 AM, Preeti Murthy wrote:
 Hi Alex
 I apologise for the delay in replying .

That's all right. I often also busy on other Intel tasks and have no
time to look at LKML. :)
 
 On Wed, Nov 7, 2012 at 6:57 PM, Alex Shi alex@intel.com wrote:
 On 11/07/2012 12:37 PM, Preeti Murthy wrote:
 Hi Alex,

 What I am concerned about in this patchset as Peter also
 mentioned in the previous discussion of your approach
 (https://lkml.org/lkml/2012/8/13/139)
 is that:

 1.Using nr_running of two different sched groups to decide which one
 can be group_leader or group_min might not be be the right approach,
 as this might mislead us to think that a group running one task is less
 loaded than the group running three tasks although the former task is
 a cpu hogger.

 2.Comparing the number of cpus with the number of tasks running in a sched
 group to decide if the group is underloaded or overloaded again faces
 the same issue.The tasks might be short running,not utilizing cpu much.

 Yes, maybe nr task is not the best indicator. But as first step, it can
 approve the proposal is a correct path and worth to try more.
 Considering the old powersaving implement is also judge on nr tasks, and
 my testing result of this. It may be still a option.
 Hmm.. will think about this and get back.

 I also feel before we introduce another side to the scheduler called
 'power aware',why not try and see if the current scheduler itself can
 perform better? We have an opportunity in terms of PJT's patches which
 can help scheduler make more realistic decisions in load balance.Also
 since PJT's metric is a statistical one,I believe we could vary it to
 allow scheduler to do more rigorous or less rigorous power savings.

 will study the PJT's approach.
 Actually, current patch set is also a kind of load balance modification,
 right? :)
 It is true that this is a different approach,in fact we will require
 this approach
 to do power savings because PJT's patches introduce a new 'metric' and not a 
 new
 'approach' in my opinion, to do smarter load balancing,not power aware
 load balancing per say.So your patch is surely a step towards power
 aware lb.I am just worried about the metric used in it.

 It is true however that this approach will not try and evacuate nearly idle
 cpus over to nearly full cpus.That is definitely one of the benefits of your
 patch,in terms of power savings,but I believe your patch is not making use
 of the right metric to decide that.

 If one sched group just has one task, and another group just has one
 LCPU idle, my patch definitely will pull the task to the nearly full
 sched group. So I didn't understand what you mean 'will not try and
 evacuate nearly idle cpus over to nearly full cpus'
 No, by 'this approach' I meant the current load balancer integrated with
 the PJT's metric.Your approach does 'evacuate' the nearly idle cpus
 over to the nearly full cpus..

Oh, a misunderstand on 'this approach'. :) Anyway, we are all clear
about this now.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 2/3] sched: power aware load balance,

2012-11-07 Thread Alex Shi
On 11/07/2012 12:37 PM, Preeti Murthy wrote:
> Hi Alex,
> 
> What I am concerned about in this patchset as Peter also
> mentioned in the previous discussion of your approach
> (https://lkml.org/lkml/2012/8/13/139)
> is that:
> 
> 1.Using nr_running of two different sched groups to decide which one
> can be group_leader or group_min might not be be the right approach,
> as this might mislead us to think that a group running one task is less
> loaded than the group running three tasks although the former task is
> a cpu hogger.
> 
> 2.Comparing the number of cpus with the number of tasks running in a sched
> group to decide if the group is underloaded or overloaded again faces
> the same issue.The tasks might be short running,not utilizing cpu much.

Yes, maybe nr task is not the best indicator. But as first step, it can
approve the proposal is a correct path and worth to try more.
Considering the old powersaving implement is also judge on nr tasks, and
my testing result of this. It may be still a option.
> 
> I also feel before we introduce another side to the scheduler called
> 'power aware',why not try and see if the current scheduler itself can
> perform better? We have an opportunity in terms of PJT's patches which
> can help scheduler make more realistic decisions in load balance.Also
> since PJT's metric is a statistical one,I believe we could vary it to
> allow scheduler to do more rigorous or less rigorous power savings.

will study the PJT's approach.
Actually, current patch set is also a kind of load balance modification,
right? :)
> 
> It is true however that this approach will not try and evacuate nearly idle
> cpus over to nearly full cpus.That is definitely one of the benefits of your
> patch,in terms of power savings,but I believe your patch is not making use
> of the right metric to decide that.

If one sched group just has one task, and another group just has one
LCPU idle, my patch definitely will pull the task to the nearly full
sched group. So I didn't understand what you mean 'will not try and
evacuate nearly idle cpus over to nearly full cpus'.


> 
> IMHO,the appraoch towards power aware scheduler should take the following 
> steps:
> 
> 1.Make use of PJT's per-entity-load tracking metric to allow scheduler to make
> more intelligent decisions in load balancing.Test the performance and power 
> save
> numbers.
> 
> 2.If the above shows some characteristic change in behaviour over the earlier
> scheduler,it should be either towards power save or towards performance.If 
> found
> positive towards one of them, try varying the calculation of
> per-entity-load to see
> if it can lean towards the other behaviour.If it can,then there you
> go,you have a
> knob to change between policies right there!
> 
> 3.If you don't get enough power savings with the above approach then
> add your patchset
> to evacuate nearly idle towards nearly busy groups,but by using PJT's metric 
> to
> make the decision.
> 
> What do you think?

Will consider this. thanks!
> 
> Regards
> Preeti U Murthy
> On Tue, Nov 6, 2012 at 6:39 PM, Alex Shi  wrote:
>> This patch enabled the power aware consideration in load balance.
>>
>> As mentioned in the power aware scheduler proposal, Power aware
>> scheduling has 2 assumptions:
>> 1, race to idle is helpful for power saving
>> 2, shrink tasks on less sched_groups will reduce power consumption
>>
>> The first assumption make performance policy take over scheduling when
>> system busy.
>> The second assumption make power aware scheduling try to move
>> disperse tasks into fewer groups until that groups are full of tasks.
>>
>> This patch reuse lots of Suresh's power saving load balance code.
>> Now the general enabling logical is:
>> 1, Collect power aware scheduler statistics with performance load
>> balance statistics collection.
>> 2, if domain is eligible for power load balance do it and forget
>> performance load balance, else do performance load balance.
>>
>> Has tried on my 2 sockets * 4 cores * HT NHM EP machine.
>> and 2 sockets * 8 cores * HT SNB EP machine.
>> In the following checking, when I is 2/4/8/16, all tasks are
>> shrank to run on single core or single socket.
>>
>> $for ((i=0; i < I; i++)) ; do while true; do : ; done & done
>>
>> Checking the power consuming with a powermeter on the NHM EP.
>> powersaving performance
>> I = 2   148w160w
>> I = 4   175w181w
>> I = 8   207w224w
>> I = 16  324w324w
>>
>> On a SNB laptop(4 cores *HT)
>> powersaving performance
>> I = 2   28w 35w
>> I = 4   38w 52w
>> I = 6   44w 54w
>> I = 8   56w 56w
>>
>> On the SNB EP machine, when I = 16, power saved more than 100 Watts.
>>
>> Also tested the specjbb2005 with jrockit, kbuild, their peak performance
>> has no clear change with powersaving policy on all machines. Just
>> specjbb2005 with openjdk has about 2% drop on NHM EP machine with
>> powersaving 

Re: [RFC PATCH 2/3] sched: power aware load balance,

2012-11-07 Thread Alex Shi
On 11/07/2012 03:51 AM, Andrew Morton wrote:
> On Tue,  6 Nov 2012 21:09:58 +0800
> Alex Shi  wrote:
> 
>> $for ((i=0; i < I; i++)) ; do while true; do : ; done & done
>>
>> Checking the power consuming with a powermeter on the NHM EP.
>>  powersaving performance
>> I = 2   148w160w
>> I = 4   175w181w
>> I = 8   207w224w
>> I = 16  324w324w
>>
>> On a SNB laptop(4 cores *HT)
>>  powersaving performance
>> I = 2   28w 35w
>> I = 4   38w 52w
>> I = 6   44w 54w
>> I = 8   56w 56w
>>
>> On the SNB EP machine, when I = 16, power saved more than 100 Watts.
> 
> Confused.  According to the above table, at I=16 the EP machine saved 0
> watts.  Typo in the data?

Not typo, since the LCPU number in the EP machine is 16, so if I = 16,
the powersaving policy doesn't work actually. That is the patch designed
for race to idle assumption.

The result looks same as the third patch(for fork/exec/wu) applied.
Result put here because it is from this patch.

> 
> 
> Also, that's a pretty narrow test - it's doing fork and exec at very
> high frequency and things such as task placement decisions at process
> startup might be affecting the results.  Also, the load will be quite
> kernel-intensive, as opposed to the more typical userspace-intensive
> loads.

Sorry, why you think it keep do fork/exec? It just generate several
'bash' task to burn CPU, without fork/exec.

with I = 8, on my 32 LCPU SNB EP machine:
No do_fork calling in 5 seconds.

$ sudo perf stat -e probe:* -a sleep 5
 Performance counter stats for 'sleep 5':
   3 probe:do_execve   [100.00%]
   0 probe:do_fork [100.00%]

And it is not kernel-intensive, it nearly running all in user level.

'Top' output: 25:0%us VS 0.0%sy

Tasks: 319 total,   9 running, 310 sleeping,   0 stopped,   0 zombie
Cpu(s): 25.0%us,  0.0%sy,  0.0%ni, 74.5%id,  0.4%wa,  0.1%hi,  0.0%si,
0.0%st
...

> So, please run a broader set of tests so we can see the effects?
> 

Really, I have no more ideas for the suitable benchmarks.

Just tried the kbuild -j 16 on the 32 LCPU SNB EP, power just saved 10%,
but compile time increase about ~15%.
Seems if the task number is variation around the powersaving criteria
number, that just cause trouble.




-- 
Thanks
Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 2/3] sched: power aware load balance,

2012-11-07 Thread Alex Shi
On 11/07/2012 03:51 AM, Andrew Morton wrote:
 On Tue,  6 Nov 2012 21:09:58 +0800
 Alex Shi alex@intel.com wrote:
 
 $for ((i=0; i  I; i++)) ; do while true; do : ; done  done

 Checking the power consuming with a powermeter on the NHM EP.
  powersaving performance
 I = 2   148w160w
 I = 4   175w181w
 I = 8   207w224w
 I = 16  324w324w

 On a SNB laptop(4 cores *HT)
  powersaving performance
 I = 2   28w 35w
 I = 4   38w 52w
 I = 6   44w 54w
 I = 8   56w 56w

 On the SNB EP machine, when I = 16, power saved more than 100 Watts.
 
 Confused.  According to the above table, at I=16 the EP machine saved 0
 watts.  Typo in the data?

Not typo, since the LCPU number in the EP machine is 16, so if I = 16,
the powersaving policy doesn't work actually. That is the patch designed
for race to idle assumption.

The result looks same as the third patch(for fork/exec/wu) applied.
Result put here because it is from this patch.

 
 
 Also, that's a pretty narrow test - it's doing fork and exec at very
 high frequency and things such as task placement decisions at process
 startup might be affecting the results.  Also, the load will be quite
 kernel-intensive, as opposed to the more typical userspace-intensive
 loads.

Sorry, why you think it keep do fork/exec? It just generate several
'bash' task to burn CPU, without fork/exec.

with I = 8, on my 32 LCPU SNB EP machine:
No do_fork calling in 5 seconds.

$ sudo perf stat -e probe:* -a sleep 5
 Performance counter stats for 'sleep 5':
   3 probe:do_execve   [100.00%]
   0 probe:do_fork [100.00%]

And it is not kernel-intensive, it nearly running all in user level.

'Top' output: 25:0%us VS 0.0%sy

Tasks: 319 total,   9 running, 310 sleeping,   0 stopped,   0 zombie
Cpu(s): 25.0%us,  0.0%sy,  0.0%ni, 74.5%id,  0.4%wa,  0.1%hi,  0.0%si,
0.0%st
...

 So, please run a broader set of tests so we can see the effects?
 

Really, I have no more ideas for the suitable benchmarks.

Just tried the kbuild -j 16 on the 32 LCPU SNB EP, power just saved 10%,
but compile time increase about ~15%.
Seems if the task number is variation around the powersaving criteria
number, that just cause trouble.




-- 
Thanks
Alex
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 2/3] sched: power aware load balance,

2012-11-07 Thread Alex Shi
On 11/07/2012 12:37 PM, Preeti Murthy wrote:
 Hi Alex,
 
 What I am concerned about in this patchset as Peter also
 mentioned in the previous discussion of your approach
 (https://lkml.org/lkml/2012/8/13/139)
 is that:
 
 1.Using nr_running of two different sched groups to decide which one
 can be group_leader or group_min might not be be the right approach,
 as this might mislead us to think that a group running one task is less
 loaded than the group running three tasks although the former task is
 a cpu hogger.
 
 2.Comparing the number of cpus with the number of tasks running in a sched
 group to decide if the group is underloaded or overloaded again faces
 the same issue.The tasks might be short running,not utilizing cpu much.

Yes, maybe nr task is not the best indicator. But as first step, it can
approve the proposal is a correct path and worth to try more.
Considering the old powersaving implement is also judge on nr tasks, and
my testing result of this. It may be still a option.
 
 I also feel before we introduce another side to the scheduler called
 'power aware',why not try and see if the current scheduler itself can
 perform better? We have an opportunity in terms of PJT's patches which
 can help scheduler make more realistic decisions in load balance.Also
 since PJT's metric is a statistical one,I believe we could vary it to
 allow scheduler to do more rigorous or less rigorous power savings.

will study the PJT's approach.
Actually, current patch set is also a kind of load balance modification,
right? :)
 
 It is true however that this approach will not try and evacuate nearly idle
 cpus over to nearly full cpus.That is definitely one of the benefits of your
 patch,in terms of power savings,but I believe your patch is not making use
 of the right metric to decide that.

If one sched group just has one task, and another group just has one
LCPU idle, my patch definitely will pull the task to the nearly full
sched group. So I didn't understand what you mean 'will not try and
evacuate nearly idle cpus over to nearly full cpus'.


 
 IMHO,the appraoch towards power aware scheduler should take the following 
 steps:
 
 1.Make use of PJT's per-entity-load tracking metric to allow scheduler to make
 more intelligent decisions in load balancing.Test the performance and power 
 save
 numbers.
 
 2.If the above shows some characteristic change in behaviour over the earlier
 scheduler,it should be either towards power save or towards performance.If 
 found
 positive towards one of them, try varying the calculation of
 per-entity-load to see
 if it can lean towards the other behaviour.If it can,then there you
 go,you have a
 knob to change between policies right there!
 
 3.If you don't get enough power savings with the above approach then
 add your patchset
 to evacuate nearly idle towards nearly busy groups,but by using PJT's metric 
 to
 make the decision.
 
 What do you think?

Will consider this. thanks!
 
 Regards
 Preeti U Murthy
 On Tue, Nov 6, 2012 at 6:39 PM, Alex Shi alex@intel.com wrote:
 This patch enabled the power aware consideration in load balance.

 As mentioned in the power aware scheduler proposal, Power aware
 scheduling has 2 assumptions:
 1, race to idle is helpful for power saving
 2, shrink tasks on less sched_groups will reduce power consumption

 The first assumption make performance policy take over scheduling when
 system busy.
 The second assumption make power aware scheduling try to move
 disperse tasks into fewer groups until that groups are full of tasks.

 This patch reuse lots of Suresh's power saving load balance code.
 Now the general enabling logical is:
 1, Collect power aware scheduler statistics with performance load
 balance statistics collection.
 2, if domain is eligible for power load balance do it and forget
 performance load balance, else do performance load balance.

 Has tried on my 2 sockets * 4 cores * HT NHM EP machine.
 and 2 sockets * 8 cores * HT SNB EP machine.
 In the following checking, when I is 2/4/8/16, all tasks are
 shrank to run on single core or single socket.

 $for ((i=0; i  I; i++)) ; do while true; do : ; done  done

 Checking the power consuming with a powermeter on the NHM EP.
 powersaving performance
 I = 2   148w160w
 I = 4   175w181w
 I = 8   207w224w
 I = 16  324w324w

 On a SNB laptop(4 cores *HT)
 powersaving performance
 I = 2   28w 35w
 I = 4   38w 52w
 I = 6   44w 54w
 I = 8   56w 56w

 On the SNB EP machine, when I = 16, power saved more than 100 Watts.

 Also tested the specjbb2005 with jrockit, kbuild, their peak performance
 has no clear change with powersaving policy on all machines. Just
 specjbb2005 with openjdk has about 2% drop on NHM EP machine with
 powersaving policy.

 This patch seems a bit long, but seems hard to split smaller.

 Signed-off-by: Alex Shi alex@intel.com


-- 
Thanks

Re: [RFC PATCH 2/3] sched: power aware load balance,

2012-11-06 Thread Preeti Murthy
Hi Alex,

What I am concerned about in this patchset as Peter also
mentioned in the previous discussion of your approach
(https://lkml.org/lkml/2012/8/13/139)
is that:

1.Using nr_running of two different sched groups to decide which one
can be group_leader or group_min might not be be the right approach,
as this might mislead us to think that a group running one task is less
loaded than the group running three tasks although the former task is
a cpu hogger.

2.Comparing the number of cpus with the number of tasks running in a sched
group to decide if the group is underloaded or overloaded again faces
the same issue.The tasks might be short running,not utilizing cpu much.

I also feel before we introduce another side to the scheduler called
'power aware',why not try and see if the current scheduler itself can
perform better? We have an opportunity in terms of PJT's patches which
can help scheduler make more realistic decisions in load balance.Also
since PJT's metric is a statistical one,I believe we could vary it to
allow scheduler to do more rigorous or less rigorous power savings.

It is true however that this approach will not try and evacuate nearly idle
cpus over to nearly full cpus.That is definitely one of the benefits of your
patch,in terms of power savings,but I believe your patch is not making use
of the right metric to decide that.

IMHO,the appraoch towards power aware scheduler should take the following steps:

1.Make use of PJT's per-entity-load tracking metric to allow scheduler to make
more intelligent decisions in load balancing.Test the performance and power save
numbers.

2.If the above shows some characteristic change in behaviour over the earlier
scheduler,it should be either towards power save or towards performance.If found
positive towards one of them, try varying the calculation of
per-entity-load to see
if it can lean towards the other behaviour.If it can,then there you
go,you have a
knob to change between policies right there!

3.If you don't get enough power savings with the above approach then
add your patchset
to evacuate nearly idle towards nearly busy groups,but by using PJT's metric to
make the decision.

What do you think?

Regards
Preeti U Murthy
On Tue, Nov 6, 2012 at 6:39 PM, Alex Shi  wrote:
> This patch enabled the power aware consideration in load balance.
>
> As mentioned in the power aware scheduler proposal, Power aware
> scheduling has 2 assumptions:
> 1, race to idle is helpful for power saving
> 2, shrink tasks on less sched_groups will reduce power consumption
>
> The first assumption make performance policy take over scheduling when
> system busy.
> The second assumption make power aware scheduling try to move
> disperse tasks into fewer groups until that groups are full of tasks.
>
> This patch reuse lots of Suresh's power saving load balance code.
> Now the general enabling logical is:
> 1, Collect power aware scheduler statistics with performance load
> balance statistics collection.
> 2, if domain is eligible for power load balance do it and forget
> performance load balance, else do performance load balance.
>
> Has tried on my 2 sockets * 4 cores * HT NHM EP machine.
> and 2 sockets * 8 cores * HT SNB EP machine.
> In the following checking, when I is 2/4/8/16, all tasks are
> shrank to run on single core or single socket.
>
> $for ((i=0; i < I; i++)) ; do while true; do : ; done & done
>
> Checking the power consuming with a powermeter on the NHM EP.
> powersaving performance
> I = 2   148w160w
> I = 4   175w181w
> I = 8   207w224w
> I = 16  324w324w
>
> On a SNB laptop(4 cores *HT)
> powersaving performance
> I = 2   28w 35w
> I = 4   38w 52w
> I = 6   44w 54w
> I = 8   56w 56w
>
> On the SNB EP machine, when I = 16, power saved more than 100 Watts.
>
> Also tested the specjbb2005 with jrockit, kbuild, their peak performance
> has no clear change with powersaving policy on all machines. Just
> specjbb2005 with openjdk has about 2% drop on NHM EP machine with
> powersaving policy.
>
> This patch seems a bit long, but seems hard to split smaller.
>
> Signed-off-by: Alex Shi 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 2/3] sched: power aware load balance,

2012-11-06 Thread Andrew Morton
On Tue,  6 Nov 2012 21:09:58 +0800
Alex Shi  wrote:

> $for ((i=0; i < I; i++)) ; do while true; do : ; done & done
> 
> Checking the power consuming with a powermeter on the NHM EP.
>   powersaving performance
> I = 2   148w160w
> I = 4   175w181w
> I = 8   207w224w
> I = 16  324w324w
> 
> On a SNB laptop(4 cores *HT)
>   powersaving performance
> I = 2   28w 35w
> I = 4   38w 52w
> I = 6   44w 54w
> I = 8   56w 56w
> 
> On the SNB EP machine, when I = 16, power saved more than 100 Watts.

Confused.  According to the above table, at I=16 the EP machine saved 0
watts.  Typo in the data?


Also, that's a pretty narrow test - it's doing fork and exec at very
high frequency and things such as task placement decisions at process
startup might be affecting the results.  Also, the load will be quite
kernel-intensive, as opposed to the more typical userspace-intensive
loads.

So, please run a broader set of tests so we can see the effects?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 2/3] sched: power aware load balance,

2012-11-06 Thread Andrew Morton
On Tue,  6 Nov 2012 21:09:58 +0800
Alex Shi alex@intel.com wrote:

 $for ((i=0; i  I; i++)) ; do while true; do : ; done  done
 
 Checking the power consuming with a powermeter on the NHM EP.
   powersaving performance
 I = 2   148w160w
 I = 4   175w181w
 I = 8   207w224w
 I = 16  324w324w
 
 On a SNB laptop(4 cores *HT)
   powersaving performance
 I = 2   28w 35w
 I = 4   38w 52w
 I = 6   44w 54w
 I = 8   56w 56w
 
 On the SNB EP machine, when I = 16, power saved more than 100 Watts.

Confused.  According to the above table, at I=16 the EP machine saved 0
watts.  Typo in the data?


Also, that's a pretty narrow test - it's doing fork and exec at very
high frequency and things such as task placement decisions at process
startup might be affecting the results.  Also, the load will be quite
kernel-intensive, as opposed to the more typical userspace-intensive
loads.

So, please run a broader set of tests so we can see the effects?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 2/3] sched: power aware load balance,

2012-11-06 Thread Preeti Murthy
Hi Alex,

What I am concerned about in this patchset as Peter also
mentioned in the previous discussion of your approach
(https://lkml.org/lkml/2012/8/13/139)
is that:

1.Using nr_running of two different sched groups to decide which one
can be group_leader or group_min might not be be the right approach,
as this might mislead us to think that a group running one task is less
loaded than the group running three tasks although the former task is
a cpu hogger.

2.Comparing the number of cpus with the number of tasks running in a sched
group to decide if the group is underloaded or overloaded again faces
the same issue.The tasks might be short running,not utilizing cpu much.

I also feel before we introduce another side to the scheduler called
'power aware',why not try and see if the current scheduler itself can
perform better? We have an opportunity in terms of PJT's patches which
can help scheduler make more realistic decisions in load balance.Also
since PJT's metric is a statistical one,I believe we could vary it to
allow scheduler to do more rigorous or less rigorous power savings.

It is true however that this approach will not try and evacuate nearly idle
cpus over to nearly full cpus.That is definitely one of the benefits of your
patch,in terms of power savings,but I believe your patch is not making use
of the right metric to decide that.

IMHO,the appraoch towards power aware scheduler should take the following steps:

1.Make use of PJT's per-entity-load tracking metric to allow scheduler to make
more intelligent decisions in load balancing.Test the performance and power save
numbers.

2.If the above shows some characteristic change in behaviour over the earlier
scheduler,it should be either towards power save or towards performance.If found
positive towards one of them, try varying the calculation of
per-entity-load to see
if it can lean towards the other behaviour.If it can,then there you
go,you have a
knob to change between policies right there!

3.If you don't get enough power savings with the above approach then
add your patchset
to evacuate nearly idle towards nearly busy groups,but by using PJT's metric to
make the decision.

What do you think?

Regards
Preeti U Murthy
On Tue, Nov 6, 2012 at 6:39 PM, Alex Shi alex@intel.com wrote:
 This patch enabled the power aware consideration in load balance.

 As mentioned in the power aware scheduler proposal, Power aware
 scheduling has 2 assumptions:
 1, race to idle is helpful for power saving
 2, shrink tasks on less sched_groups will reduce power consumption

 The first assumption make performance policy take over scheduling when
 system busy.
 The second assumption make power aware scheduling try to move
 disperse tasks into fewer groups until that groups are full of tasks.

 This patch reuse lots of Suresh's power saving load balance code.
 Now the general enabling logical is:
 1, Collect power aware scheduler statistics with performance load
 balance statistics collection.
 2, if domain is eligible for power load balance do it and forget
 performance load balance, else do performance load balance.

 Has tried on my 2 sockets * 4 cores * HT NHM EP machine.
 and 2 sockets * 8 cores * HT SNB EP machine.
 In the following checking, when I is 2/4/8/16, all tasks are
 shrank to run on single core or single socket.

 $for ((i=0; i  I; i++)) ; do while true; do : ; done  done

 Checking the power consuming with a powermeter on the NHM EP.
 powersaving performance
 I = 2   148w160w
 I = 4   175w181w
 I = 8   207w224w
 I = 16  324w324w

 On a SNB laptop(4 cores *HT)
 powersaving performance
 I = 2   28w 35w
 I = 4   38w 52w
 I = 6   44w 54w
 I = 8   56w 56w

 On the SNB EP machine, when I = 16, power saved more than 100 Watts.

 Also tested the specjbb2005 with jrockit, kbuild, their peak performance
 has no clear change with powersaving policy on all machines. Just
 specjbb2005 with openjdk has about 2% drop on NHM EP machine with
 powersaving policy.

 This patch seems a bit long, but seems hard to split smaller.

 Signed-off-by: Alex Shi alex@intel.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/