Re: [PATCH v3 17/22] sched: packing small tasks in wake/exec balancing

2013-01-18 Thread Alex Shi
On 01/16/2013 11:08 PM, Morten Rasmussen wrote:
> On Wed, Jan 16, 2013 at 07:32:49AM +, Alex Shi wrote:
>> On 01/15/2013 01:00 AM, Morten Rasmussen wrote:
> Why multiply rq->util by nr_running?
>>>
>>> Let's take an example where rq->util = 50, nr_running = 2, and putil =
>>> 10. In this case the value of putil doesn't really matter as vacancy
>>> would be negative anyway since FULL_UTIL - rq->util * nr_running is -1.
>>> However, with rq->util = 50 there should be plenty of spare cpu time to
>>> take another task.
>
> for this example, the util is not full maybe due to it was just wake up,
> it still is possible like to run full time. So, I try to give it the
> large guess load.
>>> I don't see why rq->util should be treated different depending on the
>>> number of tasks causing the load. rq->util = 50 means that the cpu is
>>> busy about 50% of the time no matter how many tasks contibute to that
>>> load.
>>>
>>> If nr_running = 1 instead in my example, you would consider the cpu
>>> vacant if putil = 6, but if nr_running > 1 you would not. Why should the
>>> two scenarios be treated differently?
>>>
>>>
>>> Also, why multiply putil by 8? rq->util must be very close to 0 for
>>> vacancy to be positive if putil is close to 12 (12.5%).
>
> just want to pack small util tasks, since packing is possible to hurt
> performance.
>>> I agree that packing may affect performance. But why don't you reduce
>>> FULL_UTIL instead of multiplying by 8? With current expression you will
>>> not pack a 10% task if rq->util = 20 and nr_running = 1, but you would
>>> pack a 6% task even if rq->util = 50 and the resulting cpu load is much
>>> higher.
>>>
>>
>> Yes, the threshold has no strong theory or experiment support. I had
>> tried cyclitest which Vicent used, the case's load avg is too small to
>> be caught. so just use half of Vicent value as 12.5%. If you has more
>> reasonable value, let me know.
>>
>> As to nr_running engaged as multiple mode. it's base on 2 reasons.
>> 1, load avg/util need 345ms to accumulate as 100%. so, if a tasks is
>> cost full cpu time, it still has 345ms with rq->util < 1.
> 
> I agree that load avg may not be accurate, especially for new tasks. But
> why use it if you don't trust its value anyway?
> 
> The load avg (sum/period) of a new task will reach 100% instantly if the
> task is consuming all the cpu time it can get. An old task can reach 50%
> within 32ms. So you should fairly quickly be able to see if it is a
> light task or not. You may under-estimate its load in the beginning, but
> only for a very short time.

this packing is done in wakup, even no 'a very short time' here:)
> 
>> 2, if there are more tasks, like 2 tasks running on one cpu, it's
>> possible to has capacity to burn 200% cpu time, while the biggest
>> rq->util is still 100%.
> 
> If you want to have a better metric for how much cpu time the task on
> the runqueue could potentially use, I would suggest using
> cfs_rq->runnable_load_avg which is the load_avg_contrib sum of all tasks
> on the runqueue. It would give you 200% in your example above.

runnable_load_avg also need much time to accumulate its value, not
better than util.
> 
> On the other hand, I think rq->util is fine for this purpose. If
> rq->util < 100% you know for sure that cpu is not fully utilized no
> matter how many tasks you have on the runqueue. So as long as rq->util
> is well below 100% (like < 50%) it should be safe to pack more small
> tasks on that cpu even if it has multiple tasks running already.
> 
>>
>> Consider to figure out precise utils is complicate and cost much. I do
>> this simple calculation. It is not very precise, but it is efficient and
>> more bias toward performance.
> 
> It is indeed very biased towards performance. I would prefer more focus
> on saving power in a power scheduling policy :)
> 

Agree, and I don't refuse to change the criteria for power. :) but
without reliable benchmarks or data, everything is guess.

-- 
Thanks
Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 17/22] sched: packing small tasks in wake/exec balancing

2013-01-18 Thread Alex Shi
On 01/16/2013 11:08 PM, Morten Rasmussen wrote:
 On Wed, Jan 16, 2013 at 07:32:49AM +, Alex Shi wrote:
 On 01/15/2013 01:00 AM, Morten Rasmussen wrote:
 Why multiply rq-util by nr_running?

 Let's take an example where rq-util = 50, nr_running = 2, and putil =
 10. In this case the value of putil doesn't really matter as vacancy
 would be negative anyway since FULL_UTIL - rq-util * nr_running is -1.
 However, with rq-util = 50 there should be plenty of spare cpu time to
 take another task.

 for this example, the util is not full maybe due to it was just wake up,
 it still is possible like to run full time. So, I try to give it the
 large guess load.
 I don't see why rq-util should be treated different depending on the
 number of tasks causing the load. rq-util = 50 means that the cpu is
 busy about 50% of the time no matter how many tasks contibute to that
 load.

 If nr_running = 1 instead in my example, you would consider the cpu
 vacant if putil = 6, but if nr_running  1 you would not. Why should the
 two scenarios be treated differently?


 Also, why multiply putil by 8? rq-util must be very close to 0 for
 vacancy to be positive if putil is close to 12 (12.5%).

 just want to pack small util tasks, since packing is possible to hurt
 performance.
 I agree that packing may affect performance. But why don't you reduce
 FULL_UTIL instead of multiplying by 8? With current expression you will
 not pack a 10% task if rq-util = 20 and nr_running = 1, but you would
 pack a 6% task even if rq-util = 50 and the resulting cpu load is much
 higher.


 Yes, the threshold has no strong theory or experiment support. I had
 tried cyclitest which Vicent used, the case's load avg is too small to
 be caught. so just use half of Vicent value as 12.5%. If you has more
 reasonable value, let me know.

 As to nr_running engaged as multiple mode. it's base on 2 reasons.
 1, load avg/util need 345ms to accumulate as 100%. so, if a tasks is
 cost full cpu time, it still has 345ms with rq-util  1.
 
 I agree that load avg may not be accurate, especially for new tasks. But
 why use it if you don't trust its value anyway?
 
 The load avg (sum/period) of a new task will reach 100% instantly if the
 task is consuming all the cpu time it can get. An old task can reach 50%
 within 32ms. So you should fairly quickly be able to see if it is a
 light task or not. You may under-estimate its load in the beginning, but
 only for a very short time.

this packing is done in wakup, even no 'a very short time' here:)
 
 2, if there are more tasks, like 2 tasks running on one cpu, it's
 possible to has capacity to burn 200% cpu time, while the biggest
 rq-util is still 100%.
 
 If you want to have a better metric for how much cpu time the task on
 the runqueue could potentially use, I would suggest using
 cfs_rq-runnable_load_avg which is the load_avg_contrib sum of all tasks
 on the runqueue. It would give you 200% in your example above.

runnable_load_avg also need much time to accumulate its value, not
better than util.
 
 On the other hand, I think rq-util is fine for this purpose. If
 rq-util  100% you know for sure that cpu is not fully utilized no
 matter how many tasks you have on the runqueue. So as long as rq-util
 is well below 100% (like  50%) it should be safe to pack more small
 tasks on that cpu even if it has multiple tasks running already.
 

 Consider to figure out precise utils is complicate and cost much. I do
 this simple calculation. It is not very precise, but it is efficient and
 more bias toward performance.
 
 It is indeed very biased towards performance. I would prefer more focus
 on saving power in a power scheduling policy :)
 

Agree, and I don't refuse to change the criteria for power. :) but
without reliable benchmarks or data, everything is guess.

-- 
Thanks
Alex
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 17/22] sched: packing small tasks in wake/exec balancing

2013-01-16 Thread Morten Rasmussen
On Wed, Jan 16, 2013 at 07:32:49AM +, Alex Shi wrote:
> On 01/15/2013 01:00 AM, Morten Rasmussen wrote:
> >>> Why multiply rq->util by nr_running?
> >>> > > 
> >>> > > Let's take an example where rq->util = 50, nr_running = 2, and putil =
> >>> > > 10. In this case the value of putil doesn't really matter as vacancy
> >>> > > would be negative anyway since FULL_UTIL - rq->util * nr_running is 
> >>> > > -1.
> >>> > > However, with rq->util = 50 there should be plenty of spare cpu time 
> >>> > > to
> >>> > > take another task.
> >> > 
> >> > for this example, the util is not full maybe due to it was just wake up,
> >> > it still is possible like to run full time. So, I try to give it the
> >> > large guess load.
> > I don't see why rq->util should be treated different depending on the
> > number of tasks causing the load. rq->util = 50 means that the cpu is
> > busy about 50% of the time no matter how many tasks contibute to that
> > load.
> > 
> > If nr_running = 1 instead in my example, you would consider the cpu
> > vacant if putil = 6, but if nr_running > 1 you would not. Why should the
> > two scenarios be treated differently?
> > 
> >>> > > 
> >>> > > Also, why multiply putil by 8? rq->util must be very close to 0 for
> >>> > > vacancy to be positive if putil is close to 12 (12.5%).
> >> > 
> >> > just want to pack small util tasks, since packing is possible to hurt
> >> > performance.
> > I agree that packing may affect performance. But why don't you reduce
> > FULL_UTIL instead of multiplying by 8? With current expression you will
> > not pack a 10% task if rq->util = 20 and nr_running = 1, but you would
> > pack a 6% task even if rq->util = 50 and the resulting cpu load is much
> > higher.
> > 
> 
> Yes, the threshold has no strong theory or experiment support. I had
> tried cyclitest which Vicent used, the case's load avg is too small to
> be caught. so just use half of Vicent value as 12.5%. If you has more
> reasonable value, let me know.
> 
> As to nr_running engaged as multiple mode. it's base on 2 reasons.
> 1, load avg/util need 345ms to accumulate as 100%. so, if a tasks is
> cost full cpu time, it still has 345ms with rq->util < 1.

I agree that load avg may not be accurate, especially for new tasks. But
why use it if you don't trust its value anyway?

The load avg (sum/period) of a new task will reach 100% instantly if the
task is consuming all the cpu time it can get. An old task can reach 50%
within 32ms. So you should fairly quickly be able to see if it is a
light task or not. You may under-estimate its load in the beginning, but
only for a very short time.

> 2, if there are more tasks, like 2 tasks running on one cpu, it's
> possible to has capacity to burn 200% cpu time, while the biggest
> rq->util is still 100%.

If you want to have a better metric for how much cpu time the task on
the runqueue could potentially use, I would suggest using
cfs_rq->runnable_load_avg which is the load_avg_contrib sum of all tasks
on the runqueue. It would give you 200% in your example above.

On the other hand, I think rq->util is fine for this purpose. If
rq->util < 100% you know for sure that cpu is not fully utilized no
matter how many tasks you have on the runqueue. So as long as rq->util
is well below 100% (like < 50%) it should be safe to pack more small
tasks on that cpu even if it has multiple tasks running already.

> 
> Consider to figure out precise utils is complicate and cost much. I do
> this simple calculation. It is not very precise, but it is efficient and
> more bias toward performance.

It is indeed very biased towards performance. I would prefer more focus
on saving power in a power scheduling policy :)

Morten

> 
> -- 
> Thanks Alex
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 17/22] sched: packing small tasks in wake/exec balancing

2013-01-16 Thread Namhyung Kim
On Wed, 16 Jan 2013 14:11:53 +0800, Alex Shi wrote:
> On 01/14/2013 03:13 PM, Namhyung Kim wrote:
>> On Fri, 11 Jan 2013 11:47:03 +0800, Alex Shi wrote:
>>> Um, change to leader_cpu?
>> 
>> vacantest? ;-)
>
> hard to the ward in google. are you sure it is better than leader_cpu?  :)

Nop.  My English is just bad. :)

But I'm not sure the "leader" is the right one.  Why is it called
"leader" in the first place?

Thanks,
Namhyung
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 17/22] sched: packing small tasks in wake/exec balancing

2013-01-16 Thread Namhyung Kim
On Wed, 16 Jan 2013 14:11:53 +0800, Alex Shi wrote:
 On 01/14/2013 03:13 PM, Namhyung Kim wrote:
 On Fri, 11 Jan 2013 11:47:03 +0800, Alex Shi wrote:
 Um, change to leader_cpu?
 
 vacantest? ;-)

 hard to the ward in google. are you sure it is better than leader_cpu?  :)

Nop.  My English is just bad. :)

But I'm not sure the leader is the right one.  Why is it called
leader in the first place?

Thanks,
Namhyung
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 17/22] sched: packing small tasks in wake/exec balancing

2013-01-16 Thread Morten Rasmussen
On Wed, Jan 16, 2013 at 07:32:49AM +, Alex Shi wrote:
 On 01/15/2013 01:00 AM, Morten Rasmussen wrote:
  Why multiply rq-util by nr_running?

Let's take an example where rq-util = 50, nr_running = 2, and putil =
10. In this case the value of putil doesn't really matter as vacancy
would be negative anyway since FULL_UTIL - rq-util * nr_running is 
-1.
However, with rq-util = 50 there should be plenty of spare cpu time 
to
take another task.
   
   for this example, the util is not full maybe due to it was just wake up,
   it still is possible like to run full time. So, I try to give it the
   large guess load.
  I don't see why rq-util should be treated different depending on the
  number of tasks causing the load. rq-util = 50 means that the cpu is
  busy about 50% of the time no matter how many tasks contibute to that
  load.
  
  If nr_running = 1 instead in my example, you would consider the cpu
  vacant if putil = 6, but if nr_running  1 you would not. Why should the
  two scenarios be treated differently?
  

Also, why multiply putil by 8? rq-util must be very close to 0 for
vacancy to be positive if putil is close to 12 (12.5%).
   
   just want to pack small util tasks, since packing is possible to hurt
   performance.
  I agree that packing may affect performance. But why don't you reduce
  FULL_UTIL instead of multiplying by 8? With current expression you will
  not pack a 10% task if rq-util = 20 and nr_running = 1, but you would
  pack a 6% task even if rq-util = 50 and the resulting cpu load is much
  higher.
  
 
 Yes, the threshold has no strong theory or experiment support. I had
 tried cyclitest which Vicent used, the case's load avg is too small to
 be caught. so just use half of Vicent value as 12.5%. If you has more
 reasonable value, let me know.
 
 As to nr_running engaged as multiple mode. it's base on 2 reasons.
 1, load avg/util need 345ms to accumulate as 100%. so, if a tasks is
 cost full cpu time, it still has 345ms with rq-util  1.

I agree that load avg may not be accurate, especially for new tasks. But
why use it if you don't trust its value anyway?

The load avg (sum/period) of a new task will reach 100% instantly if the
task is consuming all the cpu time it can get. An old task can reach 50%
within 32ms. So you should fairly quickly be able to see if it is a
light task or not. You may under-estimate its load in the beginning, but
only for a very short time.

 2, if there are more tasks, like 2 tasks running on one cpu, it's
 possible to has capacity to burn 200% cpu time, while the biggest
 rq-util is still 100%.

If you want to have a better metric for how much cpu time the task on
the runqueue could potentially use, I would suggest using
cfs_rq-runnable_load_avg which is the load_avg_contrib sum of all tasks
on the runqueue. It would give you 200% in your example above.

On the other hand, I think rq-util is fine for this purpose. If
rq-util  100% you know for sure that cpu is not fully utilized no
matter how many tasks you have on the runqueue. So as long as rq-util
is well below 100% (like  50%) it should be safe to pack more small
tasks on that cpu even if it has multiple tasks running already.

 
 Consider to figure out precise utils is complicate and cost much. I do
 this simple calculation. It is not very precise, but it is efficient and
 more bias toward performance.

It is indeed very biased towards performance. I would prefer more focus
on saving power in a power scheduling policy :)

Morten

 
 -- 
 Thanks Alex
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 17/22] sched: packing small tasks in wake/exec balancing

2013-01-15 Thread Alex Shi
On 01/15/2013 01:00 AM, Morten Rasmussen wrote:
>>> Why multiply rq->util by nr_running?
>>> > > 
>>> > > Let's take an example where rq->util = 50, nr_running = 2, and putil =
>>> > > 10. In this case the value of putil doesn't really matter as vacancy
>>> > > would be negative anyway since FULL_UTIL - rq->util * nr_running is -1.
>>> > > However, with rq->util = 50 there should be plenty of spare cpu time to
>>> > > take another task.
>> > 
>> > for this example, the util is not full maybe due to it was just wake up,
>> > it still is possible like to run full time. So, I try to give it the
>> > large guess load.
> I don't see why rq->util should be treated different depending on the
> number of tasks causing the load. rq->util = 50 means that the cpu is
> busy about 50% of the time no matter how many tasks contibute to that
> load.
> 
> If nr_running = 1 instead in my example, you would consider the cpu
> vacant if putil = 6, but if nr_running > 1 you would not. Why should the
> two scenarios be treated differently?
> 
>>> > > 
>>> > > Also, why multiply putil by 8? rq->util must be very close to 0 for
>>> > > vacancy to be positive if putil is close to 12 (12.5%).
>> > 
>> > just want to pack small util tasks, since packing is possible to hurt
>> > performance.
> I agree that packing may affect performance. But why don't you reduce
> FULL_UTIL instead of multiplying by 8? With current expression you will
> not pack a 10% task if rq->util = 20 and nr_running = 1, but you would
> pack a 6% task even if rq->util = 50 and the resulting cpu load is much
> higher.
> 

Yes, the threshold has no strong theory or experiment support. I had
tried cyclitest which Vicent used, the case's load avg is too small to
be caught. so just use half of Vicent value as 12.5%. If you has more
reasonable value, let me know.

As to nr_running engaged as multiple mode. it's base on 2 reasons.
1, load avg/util need 345ms to accumulate as 100%. so, if a tasks is
cost full cpu time, it still has 345ms with rq->util < 1.
2, if there are more tasks, like 2 tasks running on one cpu, it's
possible to has capacity to burn 200% cpu time, while the biggest
rq->util is still 100%.

Consider to figure out precise utils is complicate and cost much. I do
this simple calculation. It is not very precise, but it is efficient and
more bias toward performance.

-- 
Thanks Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 17/22] sched: packing small tasks in wake/exec balancing

2013-01-15 Thread Alex Shi
On 01/14/2013 03:13 PM, Namhyung Kim wrote:
> On Fri, 11 Jan 2013 11:47:03 +0800, Alex Shi wrote:
>> On 01/11/2013 01:17 AM, Morten Rasmussen wrote:
>>> On Sat, Jan 05, 2013 at 08:37:46AM +, Alex Shi wrote:
 If the wake/exec task is small enough, utils < 12.5%, it will
 has the chance to be packed into a cpu which is busy but still has space to
 handle it.

 Signed-off-by: Alex Shi 
 ---
> [snip]
>>> I may be missing something, but could the expression be something like
>>> the below instead?
>>>
>>> Create a putil < 12.5% check before the loop. There is no reason to
>>> recheck it every iteration. Then:
> 
> Agreed.  Also suggest that the checking local cpu can also be moved
> before the loop so that it can be used without going through the loop if
> it's vacant enough.

Yes, thanks for suggestion!
> 
>>>
>>> vacancy = FULL_UTIL - (rq->util + putil)
>>>
>>> should be enough?
>>>
 +
 +  /* bias toward local cpu */
 +  if (vacancy > 0 && (i == this_cpu))
 +  return i;
 +
 +  if (vacancy > 0 && vacancy < min_vacancy) {
 +  min_vacancy = vacancy;
 +  idlest = i;
>>>
>>> "idlest" may be a bit misleading here as you actually select busiest cpu
>>> that have enough spare capacity to take the task.
>>
>> Um, change to leader_cpu?
> 
> vacantest? ;-)

hard to the ward in google. are you sure it is better than leader_cpu?  :)
> 
> Thanks,
> Namhyung
> 


-- 
Thanks Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 17/22] sched: packing small tasks in wake/exec balancing

2013-01-15 Thread Alex Shi
On 01/14/2013 03:13 PM, Namhyung Kim wrote:
 On Fri, 11 Jan 2013 11:47:03 +0800, Alex Shi wrote:
 On 01/11/2013 01:17 AM, Morten Rasmussen wrote:
 On Sat, Jan 05, 2013 at 08:37:46AM +, Alex Shi wrote:
 If the wake/exec task is small enough, utils  12.5%, it will
 has the chance to be packed into a cpu which is busy but still has space to
 handle it.

 Signed-off-by: Alex Shi alex@intel.com
 ---
 [snip]
 I may be missing something, but could the expression be something like
 the below instead?

 Create a putil  12.5% check before the loop. There is no reason to
 recheck it every iteration. Then:
 
 Agreed.  Also suggest that the checking local cpu can also be moved
 before the loop so that it can be used without going through the loop if
 it's vacant enough.

Yes, thanks for suggestion!
 

 vacancy = FULL_UTIL - (rq-util + putil)

 should be enough?

 +
 +  /* bias toward local cpu */
 +  if (vacancy  0  (i == this_cpu))
 +  return i;
 +
 +  if (vacancy  0  vacancy  min_vacancy) {
 +  min_vacancy = vacancy;
 +  idlest = i;

 idlest may be a bit misleading here as you actually select busiest cpu
 that have enough spare capacity to take the task.

 Um, change to leader_cpu?
 
 vacantest? ;-)

hard to the ward in google. are you sure it is better than leader_cpu?  :)
 
 Thanks,
 Namhyung
 


-- 
Thanks Alex
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 17/22] sched: packing small tasks in wake/exec balancing

2013-01-15 Thread Alex Shi
On 01/15/2013 01:00 AM, Morten Rasmussen wrote:
 Why multiply rq-util by nr_running?
   
   Let's take an example where rq-util = 50, nr_running = 2, and putil =
   10. In this case the value of putil doesn't really matter as vacancy
   would be negative anyway since FULL_UTIL - rq-util * nr_running is -1.
   However, with rq-util = 50 there should be plenty of spare cpu time to
   take another task.
  
  for this example, the util is not full maybe due to it was just wake up,
  it still is possible like to run full time. So, I try to give it the
  large guess load.
 I don't see why rq-util should be treated different depending on the
 number of tasks causing the load. rq-util = 50 means that the cpu is
 busy about 50% of the time no matter how many tasks contibute to that
 load.
 
 If nr_running = 1 instead in my example, you would consider the cpu
 vacant if putil = 6, but if nr_running  1 you would not. Why should the
 two scenarios be treated differently?
 
   
   Also, why multiply putil by 8? rq-util must be very close to 0 for
   vacancy to be positive if putil is close to 12 (12.5%).
  
  just want to pack small util tasks, since packing is possible to hurt
  performance.
 I agree that packing may affect performance. But why don't you reduce
 FULL_UTIL instead of multiplying by 8? With current expression you will
 not pack a 10% task if rq-util = 20 and nr_running = 1, but you would
 pack a 6% task even if rq-util = 50 and the resulting cpu load is much
 higher.
 

Yes, the threshold has no strong theory or experiment support. I had
tried cyclitest which Vicent used, the case's load avg is too small to
be caught. so just use half of Vicent value as 12.5%. If you has more
reasonable value, let me know.

As to nr_running engaged as multiple mode. it's base on 2 reasons.
1, load avg/util need 345ms to accumulate as 100%. so, if a tasks is
cost full cpu time, it still has 345ms with rq-util  1.
2, if there are more tasks, like 2 tasks running on one cpu, it's
possible to has capacity to burn 200% cpu time, while the biggest
rq-util is still 100%.

Consider to figure out precise utils is complicate and cost much. I do
this simple calculation. It is not very precise, but it is efficient and
more bias toward performance.

-- 
Thanks Alex
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 17/22] sched: packing small tasks in wake/exec balancing

2013-01-14 Thread Morten Rasmussen
On Fri, Jan 11, 2013 at 03:47:03AM +, Alex Shi wrote:
> On 01/11/2013 01:17 AM, Morten Rasmussen wrote:
> > On Sat, Jan 05, 2013 at 08:37:46AM +, Alex Shi wrote:
> >> If the wake/exec task is small enough, utils < 12.5%, it will
> >> has the chance to be packed into a cpu which is busy but still has space to
> >> handle it.
> >>
> >> Signed-off-by: Alex Shi 
> >> ---
> >>  kernel/sched/fair.c | 51 
> >> +--
> >>  1 file changed, 45 insertions(+), 6 deletions(-)
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index 8d0d3af..0596e81 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -3471,19 +3471,57 @@ static inline int get_sd_sched_policy(struct 
> >> sched_domain *sd,
> >>  }
> >>  
> >>  /*
> >> + * find_leader_cpu - find the busiest but still has enough leisure time 
> >> cpu
> >> + * among the cpus in group.
> >> + */
> >> +static int
> >> +find_leader_cpu(struct sched_group *group, struct task_struct *p, int 
> >> this_cpu)
> >> +{
> >> +  unsigned vacancy, min_vacancy = UINT_MAX;
> > 
> > unsigned int?
> 
> yes
> > 
> >> +  int idlest = -1;
> >> +  int i;
> >> +  /* percentage the task's util */
> >> +  unsigned putil = p->se.avg.runnable_avg_sum * 100
> >> +  / (p->se.avg.runnable_avg_period + 1);
> > 
> > Alternatively you could use se.avg.load_avg_contrib which is the same
> > ratio scaled by the task priority (se->load.weight). In the above
> > expression you don't take priority into account.
> 
> sure. but this seems more directly of meaningful.
> > 
> >> +
> >> +  /* Traverse only the allowed CPUs */
> >> +  for_each_cpu_and(i, sched_group_cpus(group), tsk_cpus_allowed(p)) {
> >> +  struct rq *rq = cpu_rq(i);
> >> +  int nr_running = rq->nr_running > 0 ? rq->nr_running : 1;
> >> +
> >> +  /* only pack task which putil < 12.5% */
> >> +  vacancy = FULL_UTIL - (rq->util * nr_running + putil * 8);
> > 
> > I can't follow this expression.
> > 
> > The variables can have the following values:
> > FULL_UTIL  = 99
> > rq->util   = [0..99]
> > nr_running = [1..inf]
> > putil  = [0..99]
> > 
> > Why multiply rq->util by nr_running?
> > 
> > Let's take an example where rq->util = 50, nr_running = 2, and putil =
> > 10. In this case the value of putil doesn't really matter as vacancy
> > would be negative anyway since FULL_UTIL - rq->util * nr_running is -1.
> > However, with rq->util = 50 there should be plenty of spare cpu time to
> > take another task.
> 
> for this example, the util is not full maybe due to it was just wake up,
> it still is possible like to run full time. So, I try to give it the
> large guess load.

I don't see why rq->util should be treated different depending on the
number of tasks causing the load. rq->util = 50 means that the cpu is
busy about 50% of the time no matter how many tasks contibute to that
load.

If nr_running = 1 instead in my example, you would consider the cpu
vacant if putil = 6, but if nr_running > 1 you would not. Why should the
two scenarios be treated differently?

> > 
> > Also, why multiply putil by 8? rq->util must be very close to 0 for
> > vacancy to be positive if putil is close to 12 (12.5%).
> 
> just want to pack small util tasks, since packing is possible to hurt
> performance.

I agree that packing may affect performance. But why don't you reduce
FULL_UTIL instead of multiplying by 8? With current expression you will
not pack a 10% task if rq->util = 20 and nr_running = 1, but you would
pack a 6% task even if rq->util = 50 and the resulting cpu load is much
higher.

> > 
> > The vacancy variable is declared unsigned, so it will underflow instead
> > of becoming negative. Is this intentional?
> 
> ops, my mistake.
> > 
> > I may be missing something, but could the expression be something like
> > the below instead?
> > 
> > Create a putil < 12.5% check before the loop. There is no reason to
> > recheck it every iteration. Then:
> > 
> > vacancy = FULL_UTIL - (rq->util + putil)
> > 
> > should be enough?
> > 
> >> +
> >> +  /* bias toward local cpu */
> >> +  if (vacancy > 0 && (i == this_cpu))
> >> +  return i;
> >> +
> >> +  if (vacancy > 0 && vacancy < min_vacancy) {
> >> +  min_vacancy = vacancy;
> >> +  idlest = i;
> > 
> > "idlest" may be a bit misleading here as you actually select busiest cpu
> > that have enough spare capacity to take the task.
> 
> Um, change to leader_cpu?

Fine by me.

Morten

> > 
> > Morten
> > 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 17/22] sched: packing small tasks in wake/exec balancing

2013-01-14 Thread Morten Rasmussen
On Fri, Jan 11, 2013 at 03:47:03AM +, Alex Shi wrote:
 On 01/11/2013 01:17 AM, Morten Rasmussen wrote:
  On Sat, Jan 05, 2013 at 08:37:46AM +, Alex Shi wrote:
  If the wake/exec task is small enough, utils  12.5%, it will
  has the chance to be packed into a cpu which is busy but still has space to
  handle it.
 
  Signed-off-by: Alex Shi alex@intel.com
  ---
   kernel/sched/fair.c | 51 
  +--
   1 file changed, 45 insertions(+), 6 deletions(-)
 
  diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
  index 8d0d3af..0596e81 100644
  --- a/kernel/sched/fair.c
  +++ b/kernel/sched/fair.c
  @@ -3471,19 +3471,57 @@ static inline int get_sd_sched_policy(struct 
  sched_domain *sd,
   }
   
   /*
  + * find_leader_cpu - find the busiest but still has enough leisure time 
  cpu
  + * among the cpus in group.
  + */
  +static int
  +find_leader_cpu(struct sched_group *group, struct task_struct *p, int 
  this_cpu)
  +{
  +  unsigned vacancy, min_vacancy = UINT_MAX;
  
  unsigned int?
 
 yes
  
  +  int idlest = -1;
  +  int i;
  +  /* percentage the task's util */
  +  unsigned putil = p-se.avg.runnable_avg_sum * 100
  +  / (p-se.avg.runnable_avg_period + 1);
  
  Alternatively you could use se.avg.load_avg_contrib which is the same
  ratio scaled by the task priority (se-load.weight). In the above
  expression you don't take priority into account.
 
 sure. but this seems more directly of meaningful.
  
  +
  +  /* Traverse only the allowed CPUs */
  +  for_each_cpu_and(i, sched_group_cpus(group), tsk_cpus_allowed(p)) {
  +  struct rq *rq = cpu_rq(i);
  +  int nr_running = rq-nr_running  0 ? rq-nr_running : 1;
  +
  +  /* only pack task which putil  12.5% */
  +  vacancy = FULL_UTIL - (rq-util * nr_running + putil * 8);
  
  I can't follow this expression.
  
  The variables can have the following values:
  FULL_UTIL  = 99
  rq-util   = [0..99]
  nr_running = [1..inf]
  putil  = [0..99]
  
  Why multiply rq-util by nr_running?
  
  Let's take an example where rq-util = 50, nr_running = 2, and putil =
  10. In this case the value of putil doesn't really matter as vacancy
  would be negative anyway since FULL_UTIL - rq-util * nr_running is -1.
  However, with rq-util = 50 there should be plenty of spare cpu time to
  take another task.
 
 for this example, the util is not full maybe due to it was just wake up,
 it still is possible like to run full time. So, I try to give it the
 large guess load.

I don't see why rq-util should be treated different depending on the
number of tasks causing the load. rq-util = 50 means that the cpu is
busy about 50% of the time no matter how many tasks contibute to that
load.

If nr_running = 1 instead in my example, you would consider the cpu
vacant if putil = 6, but if nr_running  1 you would not. Why should the
two scenarios be treated differently?

  
  Also, why multiply putil by 8? rq-util must be very close to 0 for
  vacancy to be positive if putil is close to 12 (12.5%).
 
 just want to pack small util tasks, since packing is possible to hurt
 performance.

I agree that packing may affect performance. But why don't you reduce
FULL_UTIL instead of multiplying by 8? With current expression you will
not pack a 10% task if rq-util = 20 and nr_running = 1, but you would
pack a 6% task even if rq-util = 50 and the resulting cpu load is much
higher.

  
  The vacancy variable is declared unsigned, so it will underflow instead
  of becoming negative. Is this intentional?
 
 ops, my mistake.
  
  I may be missing something, but could the expression be something like
  the below instead?
  
  Create a putil  12.5% check before the loop. There is no reason to
  recheck it every iteration. Then:
  
  vacancy = FULL_UTIL - (rq-util + putil)
  
  should be enough?
  
  +
  +  /* bias toward local cpu */
  +  if (vacancy  0  (i == this_cpu))
  +  return i;
  +
  +  if (vacancy  0  vacancy  min_vacancy) {
  +  min_vacancy = vacancy;
  +  idlest = i;
  
  idlest may be a bit misleading here as you actually select busiest cpu
  that have enough spare capacity to take the task.
 
 Um, change to leader_cpu?

Fine by me.

Morten

  
  Morten
  
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 17/22] sched: packing small tasks in wake/exec balancing

2013-01-13 Thread Namhyung Kim
On Fri, 11 Jan 2013 11:47:03 +0800, Alex Shi wrote:
> On 01/11/2013 01:17 AM, Morten Rasmussen wrote:
>> On Sat, Jan 05, 2013 at 08:37:46AM +, Alex Shi wrote:
>>> If the wake/exec task is small enough, utils < 12.5%, it will
>>> has the chance to be packed into a cpu which is busy but still has space to
>>> handle it.
>>>
>>> Signed-off-by: Alex Shi 
>>> ---
[snip]
>> I may be missing something, but could the expression be something like
>> the below instead?
>> 
>> Create a putil < 12.5% check before the loop. There is no reason to
>> recheck it every iteration. Then:

Agreed.  Also suggest that the checking local cpu can also be moved
before the loop so that it can be used without going through the loop if
it's vacant enough.

>> 
>> vacancy = FULL_UTIL - (rq->util + putil)
>> 
>> should be enough?
>> 
>>> +
>>> +   /* bias toward local cpu */
>>> +   if (vacancy > 0 && (i == this_cpu))
>>> +   return i;
>>> +
>>> +   if (vacancy > 0 && vacancy < min_vacancy) {
>>> +   min_vacancy = vacancy;
>>> +   idlest = i;
>> 
>> "idlest" may be a bit misleading here as you actually select busiest cpu
>> that have enough spare capacity to take the task.
>
> Um, change to leader_cpu?

vacantest? ;-)

Thanks,
Namhyung
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 17/22] sched: packing small tasks in wake/exec balancing

2013-01-13 Thread Namhyung Kim
On Fri, 11 Jan 2013 11:47:03 +0800, Alex Shi wrote:
 On 01/11/2013 01:17 AM, Morten Rasmussen wrote:
 On Sat, Jan 05, 2013 at 08:37:46AM +, Alex Shi wrote:
 If the wake/exec task is small enough, utils  12.5%, it will
 has the chance to be packed into a cpu which is busy but still has space to
 handle it.

 Signed-off-by: Alex Shi alex@intel.com
 ---
[snip]
 I may be missing something, but could the expression be something like
 the below instead?
 
 Create a putil  12.5% check before the loop. There is no reason to
 recheck it every iteration. Then:

Agreed.  Also suggest that the checking local cpu can also be moved
before the loop so that it can be used without going through the loop if
it's vacant enough.

 
 vacancy = FULL_UTIL - (rq-util + putil)
 
 should be enough?
 
 +
 +   /* bias toward local cpu */
 +   if (vacancy  0  (i == this_cpu))
 +   return i;
 +
 +   if (vacancy  0  vacancy  min_vacancy) {
 +   min_vacancy = vacancy;
 +   idlest = i;
 
 idlest may be a bit misleading here as you actually select busiest cpu
 that have enough spare capacity to take the task.

 Um, change to leader_cpu?

vacantest? ;-)

Thanks,
Namhyung
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 17/22] sched: packing small tasks in wake/exec balancing

2013-01-10 Thread Alex Shi
On 01/11/2013 01:17 AM, Morten Rasmussen wrote:
> On Sat, Jan 05, 2013 at 08:37:46AM +, Alex Shi wrote:
>> If the wake/exec task is small enough, utils < 12.5%, it will
>> has the chance to be packed into a cpu which is busy but still has space to
>> handle it.
>>
>> Signed-off-by: Alex Shi 
>> ---
>>  kernel/sched/fair.c | 51 +--
>>  1 file changed, 45 insertions(+), 6 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 8d0d3af..0596e81 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -3471,19 +3471,57 @@ static inline int get_sd_sched_policy(struct 
>> sched_domain *sd,
>>  }
>>  
>>  /*
>> + * find_leader_cpu - find the busiest but still has enough leisure time cpu
>> + * among the cpus in group.
>> + */
>> +static int
>> +find_leader_cpu(struct sched_group *group, struct task_struct *p, int 
>> this_cpu)
>> +{
>> +unsigned vacancy, min_vacancy = UINT_MAX;
> 
> unsigned int?

yes
> 
>> +int idlest = -1;
>> +int i;
>> +/* percentage the task's util */
>> +unsigned putil = p->se.avg.runnable_avg_sum * 100
>> +/ (p->se.avg.runnable_avg_period + 1);
> 
> Alternatively you could use se.avg.load_avg_contrib which is the same
> ratio scaled by the task priority (se->load.weight). In the above
> expression you don't take priority into account.

sure. but this seems more directly of meaningful.
> 
>> +
>> +/* Traverse only the allowed CPUs */
>> +for_each_cpu_and(i, sched_group_cpus(group), tsk_cpus_allowed(p)) {
>> +struct rq *rq = cpu_rq(i);
>> +int nr_running = rq->nr_running > 0 ? rq->nr_running : 1;
>> +
>> +/* only pack task which putil < 12.5% */
>> +vacancy = FULL_UTIL - (rq->util * nr_running + putil * 8);
> 
> I can't follow this expression.
> 
> The variables can have the following values:
> FULL_UTIL  = 99
> rq->util   = [0..99]
> nr_running = [1..inf]
> putil  = [0..99]
> 
> Why multiply rq->util by nr_running?
> 
> Let's take an example where rq->util = 50, nr_running = 2, and putil =
> 10. In this case the value of putil doesn't really matter as vacancy
> would be negative anyway since FULL_UTIL - rq->util * nr_running is -1.
> However, with rq->util = 50 there should be plenty of spare cpu time to
> take another task.

for this example, the util is not full maybe due to it was just wake up,
it still is possible like to run full time. So, I try to give it the
large guess load.
> 
> Also, why multiply putil by 8? rq->util must be very close to 0 for
> vacancy to be positive if putil is close to 12 (12.5%).

just want to pack small util tasks, since packing is possible to hurt
performance.
> 
> The vacancy variable is declared unsigned, so it will underflow instead
> of becoming negative. Is this intentional?

ops, my mistake.
> 
> I may be missing something, but could the expression be something like
> the below instead?
> 
> Create a putil < 12.5% check before the loop. There is no reason to
> recheck it every iteration. Then:
> 
> vacancy = FULL_UTIL - (rq->util + putil)
> 
> should be enough?
> 
>> +
>> +/* bias toward local cpu */
>> +if (vacancy > 0 && (i == this_cpu))
>> +return i;
>> +
>> +if (vacancy > 0 && vacancy < min_vacancy) {
>> +min_vacancy = vacancy;
>> +idlest = i;
> 
> "idlest" may be a bit misleading here as you actually select busiest cpu
> that have enough spare capacity to take the task.

Um, change to leader_cpu?
> 
> Morten
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 17/22] sched: packing small tasks in wake/exec balancing

2013-01-10 Thread Morten Rasmussen
On Sat, Jan 05, 2013 at 08:37:46AM +, Alex Shi wrote:
> If the wake/exec task is small enough, utils < 12.5%, it will
> has the chance to be packed into a cpu which is busy but still has space to
> handle it.
> 
> Signed-off-by: Alex Shi 
> ---
>  kernel/sched/fair.c | 51 +--
>  1 file changed, 45 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8d0d3af..0596e81 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3471,19 +3471,57 @@ static inline int get_sd_sched_policy(struct 
> sched_domain *sd,
>  }
>  
>  /*
> + * find_leader_cpu - find the busiest but still has enough leisure time cpu
> + * among the cpus in group.
> + */
> +static int
> +find_leader_cpu(struct sched_group *group, struct task_struct *p, int 
> this_cpu)
> +{
> + unsigned vacancy, min_vacancy = UINT_MAX;

unsigned int?

> + int idlest = -1;
> + int i;
> + /* percentage the task's util */
> + unsigned putil = p->se.avg.runnable_avg_sum * 100
> + / (p->se.avg.runnable_avg_period + 1);

Alternatively you could use se.avg.load_avg_contrib which is the same
ratio scaled by the task priority (se->load.weight). In the above
expression you don't take priority into account.

> +
> + /* Traverse only the allowed CPUs */
> + for_each_cpu_and(i, sched_group_cpus(group), tsk_cpus_allowed(p)) {
> + struct rq *rq = cpu_rq(i);
> + int nr_running = rq->nr_running > 0 ? rq->nr_running : 1;
> +
> + /* only pack task which putil < 12.5% */
> + vacancy = FULL_UTIL - (rq->util * nr_running + putil * 8);

I can't follow this expression.

The variables can have the following values:
FULL_UTIL  = 99
rq->util   = [0..99]
nr_running = [1..inf]
putil  = [0..99]

Why multiply rq->util by nr_running?

Let's take an example where rq->util = 50, nr_running = 2, and putil =
10. In this case the value of putil doesn't really matter as vacancy
would be negative anyway since FULL_UTIL - rq->util * nr_running is -1.
However, with rq->util = 50 there should be plenty of spare cpu time to
take another task.

Also, why multiply putil by 8? rq->util must be very close to 0 for
vacancy to be positive if putil is close to 12 (12.5%).

The vacancy variable is declared unsigned, so it will underflow instead
of becoming negative. Is this intentional?

I may be missing something, but could the expression be something like
the below instead?

Create a putil < 12.5% check before the loop. There is no reason to
recheck it every iteration. Then:

vacancy = FULL_UTIL - (rq->util + putil)

should be enough?

> +
> + /* bias toward local cpu */
> + if (vacancy > 0 && (i == this_cpu))
> + return i;
> +
> + if (vacancy > 0 && vacancy < min_vacancy) {
> + min_vacancy = vacancy;
> + idlest = i;

"idlest" may be a bit misleading here as you actually select busiest cpu
that have enough spare capacity to take the task.

Morten

> + }
> + }
> + return idlest;
> +}
> +
> +/*
>   * If power policy is eligible for this domain, and it has task allowed cpu.
>   * we will select CPU from this domain.
>   */
>  static int get_cpu_for_power_policy(struct sched_domain *sd, int cpu,
> - struct task_struct *p, struct sd_lb_stats *sds)
> + struct task_struct *p, struct sd_lb_stats *sds, int fork)
>  {
>   int policy;
>   int new_cpu = -1;
>  
>   policy = get_sd_sched_policy(sd, cpu, p, sds);
> - if (policy != SCHED_POLICY_PERFORMANCE && sds->group_leader)
> - new_cpu = find_idlest_cpu(sds->group_leader, p, cpu);
> -
> + if (policy != SCHED_POLICY_PERFORMANCE && sds->group_leader) {
> + if (!fork)
> + new_cpu = find_leader_cpu(sds->group_leader, p, cpu);
> + /* for fork balancing and a little busy task */
> + if (new_cpu == -1)
> + new_cpu = find_idlest_cpu(sds->group_leader, p, cpu);
> + }
>   return new_cpu;
>  }
>  
> @@ -3534,14 +3572,15 @@ select_task_rq_fair(struct task_struct *p, int 
> sd_flag, int flags)
>   if (tmp->flags & sd_flag) {
>   sd = tmp;
>  
> - new_cpu = get_cpu_for_power_policy(sd, cpu, p, );
> + new_cpu = get_cpu_for_power_policy(sd, cpu, p, ,
> + flags & SD_BALANCE_FORK);
>   if (new_cpu != -1)
>   goto unlock;
>   }
>   }
>  
>   if (affine_sd) {
> - new_cpu = get_cpu_for_power_policy(affine_sd, cpu, p, );
> + new_cpu = get_cpu_for_power_policy(affine_sd, cpu, p, , 0);
>   if (new_cpu != -1)
>   goto unlock;
>  
> -- 
> 1.7.12
> 
> --
> To unsubscribe from this list: send 

Re: [PATCH v3 17/22] sched: packing small tasks in wake/exec balancing

2013-01-10 Thread Morten Rasmussen
On Sat, Jan 05, 2013 at 08:37:46AM +, Alex Shi wrote:
 If the wake/exec task is small enough, utils  12.5%, it will
 has the chance to be packed into a cpu which is busy but still has space to
 handle it.
 
 Signed-off-by: Alex Shi alex@intel.com
 ---
  kernel/sched/fair.c | 51 +--
  1 file changed, 45 insertions(+), 6 deletions(-)
 
 diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
 index 8d0d3af..0596e81 100644
 --- a/kernel/sched/fair.c
 +++ b/kernel/sched/fair.c
 @@ -3471,19 +3471,57 @@ static inline int get_sd_sched_policy(struct 
 sched_domain *sd,
  }
  
  /*
 + * find_leader_cpu - find the busiest but still has enough leisure time cpu
 + * among the cpus in group.
 + */
 +static int
 +find_leader_cpu(struct sched_group *group, struct task_struct *p, int 
 this_cpu)
 +{
 + unsigned vacancy, min_vacancy = UINT_MAX;

unsigned int?

 + int idlest = -1;
 + int i;
 + /* percentage the task's util */
 + unsigned putil = p-se.avg.runnable_avg_sum * 100
 + / (p-se.avg.runnable_avg_period + 1);

Alternatively you could use se.avg.load_avg_contrib which is the same
ratio scaled by the task priority (se-load.weight). In the above
expression you don't take priority into account.

 +
 + /* Traverse only the allowed CPUs */
 + for_each_cpu_and(i, sched_group_cpus(group), tsk_cpus_allowed(p)) {
 + struct rq *rq = cpu_rq(i);
 + int nr_running = rq-nr_running  0 ? rq-nr_running : 1;
 +
 + /* only pack task which putil  12.5% */
 + vacancy = FULL_UTIL - (rq-util * nr_running + putil * 8);

I can't follow this expression.

The variables can have the following values:
FULL_UTIL  = 99
rq-util   = [0..99]
nr_running = [1..inf]
putil  = [0..99]

Why multiply rq-util by nr_running?

Let's take an example where rq-util = 50, nr_running = 2, and putil =
10. In this case the value of putil doesn't really matter as vacancy
would be negative anyway since FULL_UTIL - rq-util * nr_running is -1.
However, with rq-util = 50 there should be plenty of spare cpu time to
take another task.

Also, why multiply putil by 8? rq-util must be very close to 0 for
vacancy to be positive if putil is close to 12 (12.5%).

The vacancy variable is declared unsigned, so it will underflow instead
of becoming negative. Is this intentional?

I may be missing something, but could the expression be something like
the below instead?

Create a putil  12.5% check before the loop. There is no reason to
recheck it every iteration. Then:

vacancy = FULL_UTIL - (rq-util + putil)

should be enough?

 +
 + /* bias toward local cpu */
 + if (vacancy  0  (i == this_cpu))
 + return i;
 +
 + if (vacancy  0  vacancy  min_vacancy) {
 + min_vacancy = vacancy;
 + idlest = i;

idlest may be a bit misleading here as you actually select busiest cpu
that have enough spare capacity to take the task.

Morten

 + }
 + }
 + return idlest;
 +}
 +
 +/*
   * If power policy is eligible for this domain, and it has task allowed cpu.
   * we will select CPU from this domain.
   */
  static int get_cpu_for_power_policy(struct sched_domain *sd, int cpu,
 - struct task_struct *p, struct sd_lb_stats *sds)
 + struct task_struct *p, struct sd_lb_stats *sds, int fork)
  {
   int policy;
   int new_cpu = -1;
  
   policy = get_sd_sched_policy(sd, cpu, p, sds);
 - if (policy != SCHED_POLICY_PERFORMANCE  sds-group_leader)
 - new_cpu = find_idlest_cpu(sds-group_leader, p, cpu);
 -
 + if (policy != SCHED_POLICY_PERFORMANCE  sds-group_leader) {
 + if (!fork)
 + new_cpu = find_leader_cpu(sds-group_leader, p, cpu);
 + /* for fork balancing and a little busy task */
 + if (new_cpu == -1)
 + new_cpu = find_idlest_cpu(sds-group_leader, p, cpu);
 + }
   return new_cpu;
  }
  
 @@ -3534,14 +3572,15 @@ select_task_rq_fair(struct task_struct *p, int 
 sd_flag, int flags)
   if (tmp-flags  sd_flag) {
   sd = tmp;
  
 - new_cpu = get_cpu_for_power_policy(sd, cpu, p, sds);
 + new_cpu = get_cpu_for_power_policy(sd, cpu, p, sds,
 + flags  SD_BALANCE_FORK);
   if (new_cpu != -1)
   goto unlock;
   }
   }
  
   if (affine_sd) {
 - new_cpu = get_cpu_for_power_policy(affine_sd, cpu, p, sds);
 + new_cpu = get_cpu_for_power_policy(affine_sd, cpu, p, sds, 0);
   if (new_cpu != -1)
   goto unlock;
  
 -- 
 1.7.12
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info 

Re: [PATCH v3 17/22] sched: packing small tasks in wake/exec balancing

2013-01-10 Thread Alex Shi
On 01/11/2013 01:17 AM, Morten Rasmussen wrote:
 On Sat, Jan 05, 2013 at 08:37:46AM +, Alex Shi wrote:
 If the wake/exec task is small enough, utils  12.5%, it will
 has the chance to be packed into a cpu which is busy but still has space to
 handle it.

 Signed-off-by: Alex Shi alex@intel.com
 ---
  kernel/sched/fair.c | 51 +--
  1 file changed, 45 insertions(+), 6 deletions(-)

 diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
 index 8d0d3af..0596e81 100644
 --- a/kernel/sched/fair.c
 +++ b/kernel/sched/fair.c
 @@ -3471,19 +3471,57 @@ static inline int get_sd_sched_policy(struct 
 sched_domain *sd,
  }
  
  /*
 + * find_leader_cpu - find the busiest but still has enough leisure time cpu
 + * among the cpus in group.
 + */
 +static int
 +find_leader_cpu(struct sched_group *group, struct task_struct *p, int 
 this_cpu)
 +{
 +unsigned vacancy, min_vacancy = UINT_MAX;
 
 unsigned int?

yes
 
 +int idlest = -1;
 +int i;
 +/* percentage the task's util */
 +unsigned putil = p-se.avg.runnable_avg_sum * 100
 +/ (p-se.avg.runnable_avg_period + 1);
 
 Alternatively you could use se.avg.load_avg_contrib which is the same
 ratio scaled by the task priority (se-load.weight). In the above
 expression you don't take priority into account.

sure. but this seems more directly of meaningful.
 
 +
 +/* Traverse only the allowed CPUs */
 +for_each_cpu_and(i, sched_group_cpus(group), tsk_cpus_allowed(p)) {
 +struct rq *rq = cpu_rq(i);
 +int nr_running = rq-nr_running  0 ? rq-nr_running : 1;
 +
 +/* only pack task which putil  12.5% */
 +vacancy = FULL_UTIL - (rq-util * nr_running + putil * 8);
 
 I can't follow this expression.
 
 The variables can have the following values:
 FULL_UTIL  = 99
 rq-util   = [0..99]
 nr_running = [1..inf]
 putil  = [0..99]
 
 Why multiply rq-util by nr_running?
 
 Let's take an example where rq-util = 50, nr_running = 2, and putil =
 10. In this case the value of putil doesn't really matter as vacancy
 would be negative anyway since FULL_UTIL - rq-util * nr_running is -1.
 However, with rq-util = 50 there should be plenty of spare cpu time to
 take another task.

for this example, the util is not full maybe due to it was just wake up,
it still is possible like to run full time. So, I try to give it the
large guess load.
 
 Also, why multiply putil by 8? rq-util must be very close to 0 for
 vacancy to be positive if putil is close to 12 (12.5%).

just want to pack small util tasks, since packing is possible to hurt
performance.
 
 The vacancy variable is declared unsigned, so it will underflow instead
 of becoming negative. Is this intentional?

ops, my mistake.
 
 I may be missing something, but could the expression be something like
 the below instead?
 
 Create a putil  12.5% check before the loop. There is no reason to
 recheck it every iteration. Then:
 
 vacancy = FULL_UTIL - (rq-util + putil)
 
 should be enough?
 
 +
 +/* bias toward local cpu */
 +if (vacancy  0  (i == this_cpu))
 +return i;
 +
 +if (vacancy  0  vacancy  min_vacancy) {
 +min_vacancy = vacancy;
 +idlest = i;
 
 idlest may be a bit misleading here as you actually select busiest cpu
 that have enough spare capacity to take the task.

Um, change to leader_cpu?
 
 Morten
 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v3 17/22] sched: packing small tasks in wake/exec balancing

2013-01-05 Thread Alex Shi
If the wake/exec task is small enough, utils < 12.5%, it will
has the chance to be packed into a cpu which is busy but still has space to
handle it.

Signed-off-by: Alex Shi 
---
 kernel/sched/fair.c | 51 +--
 1 file changed, 45 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8d0d3af..0596e81 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3471,19 +3471,57 @@ static inline int get_sd_sched_policy(struct 
sched_domain *sd,
 }
 
 /*
+ * find_leader_cpu - find the busiest but still has enough leisure time cpu
+ * among the cpus in group.
+ */
+static int
+find_leader_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
+{
+   unsigned vacancy, min_vacancy = UINT_MAX;
+   int idlest = -1;
+   int i;
+   /* percentage the task's util */
+   unsigned putil = p->se.avg.runnable_avg_sum * 100
+   / (p->se.avg.runnable_avg_period + 1);
+
+   /* Traverse only the allowed CPUs */
+   for_each_cpu_and(i, sched_group_cpus(group), tsk_cpus_allowed(p)) {
+   struct rq *rq = cpu_rq(i);
+   int nr_running = rq->nr_running > 0 ? rq->nr_running : 1;
+
+   /* only pack task which putil < 12.5% */
+   vacancy = FULL_UTIL - (rq->util * nr_running + putil * 8);
+
+   /* bias toward local cpu */
+   if (vacancy > 0 && (i == this_cpu))
+   return i;
+
+   if (vacancy > 0 && vacancy < min_vacancy) {
+   min_vacancy = vacancy;
+   idlest = i;
+   }
+   }
+   return idlest;
+}
+
+/*
  * If power policy is eligible for this domain, and it has task allowed cpu.
  * we will select CPU from this domain.
  */
 static int get_cpu_for_power_policy(struct sched_domain *sd, int cpu,
-   struct task_struct *p, struct sd_lb_stats *sds)
+   struct task_struct *p, struct sd_lb_stats *sds, int fork)
 {
int policy;
int new_cpu = -1;
 
policy = get_sd_sched_policy(sd, cpu, p, sds);
-   if (policy != SCHED_POLICY_PERFORMANCE && sds->group_leader)
-   new_cpu = find_idlest_cpu(sds->group_leader, p, cpu);
-
+   if (policy != SCHED_POLICY_PERFORMANCE && sds->group_leader) {
+   if (!fork)
+   new_cpu = find_leader_cpu(sds->group_leader, p, cpu);
+   /* for fork balancing and a little busy task */
+   if (new_cpu == -1)
+   new_cpu = find_idlest_cpu(sds->group_leader, p, cpu);
+   }
return new_cpu;
 }
 
@@ -3534,14 +3572,15 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, 
int flags)
if (tmp->flags & sd_flag) {
sd = tmp;
 
-   new_cpu = get_cpu_for_power_policy(sd, cpu, p, );
+   new_cpu = get_cpu_for_power_policy(sd, cpu, p, ,
+   flags & SD_BALANCE_FORK);
if (new_cpu != -1)
goto unlock;
}
}
 
if (affine_sd) {
-   new_cpu = get_cpu_for_power_policy(affine_sd, cpu, p, );
+   new_cpu = get_cpu_for_power_policy(affine_sd, cpu, p, , 0);
if (new_cpu != -1)
goto unlock;
 
-- 
1.7.12

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v3 17/22] sched: packing small tasks in wake/exec balancing

2013-01-05 Thread Alex Shi
If the wake/exec task is small enough, utils  12.5%, it will
has the chance to be packed into a cpu which is busy but still has space to
handle it.

Signed-off-by: Alex Shi alex@intel.com
---
 kernel/sched/fair.c | 51 +--
 1 file changed, 45 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8d0d3af..0596e81 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3471,19 +3471,57 @@ static inline int get_sd_sched_policy(struct 
sched_domain *sd,
 }
 
 /*
+ * find_leader_cpu - find the busiest but still has enough leisure time cpu
+ * among the cpus in group.
+ */
+static int
+find_leader_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
+{
+   unsigned vacancy, min_vacancy = UINT_MAX;
+   int idlest = -1;
+   int i;
+   /* percentage the task's util */
+   unsigned putil = p-se.avg.runnable_avg_sum * 100
+   / (p-se.avg.runnable_avg_period + 1);
+
+   /* Traverse only the allowed CPUs */
+   for_each_cpu_and(i, sched_group_cpus(group), tsk_cpus_allowed(p)) {
+   struct rq *rq = cpu_rq(i);
+   int nr_running = rq-nr_running  0 ? rq-nr_running : 1;
+
+   /* only pack task which putil  12.5% */
+   vacancy = FULL_UTIL - (rq-util * nr_running + putil * 8);
+
+   /* bias toward local cpu */
+   if (vacancy  0  (i == this_cpu))
+   return i;
+
+   if (vacancy  0  vacancy  min_vacancy) {
+   min_vacancy = vacancy;
+   idlest = i;
+   }
+   }
+   return idlest;
+}
+
+/*
  * If power policy is eligible for this domain, and it has task allowed cpu.
  * we will select CPU from this domain.
  */
 static int get_cpu_for_power_policy(struct sched_domain *sd, int cpu,
-   struct task_struct *p, struct sd_lb_stats *sds)
+   struct task_struct *p, struct sd_lb_stats *sds, int fork)
 {
int policy;
int new_cpu = -1;
 
policy = get_sd_sched_policy(sd, cpu, p, sds);
-   if (policy != SCHED_POLICY_PERFORMANCE  sds-group_leader)
-   new_cpu = find_idlest_cpu(sds-group_leader, p, cpu);
-
+   if (policy != SCHED_POLICY_PERFORMANCE  sds-group_leader) {
+   if (!fork)
+   new_cpu = find_leader_cpu(sds-group_leader, p, cpu);
+   /* for fork balancing and a little busy task */
+   if (new_cpu == -1)
+   new_cpu = find_idlest_cpu(sds-group_leader, p, cpu);
+   }
return new_cpu;
 }
 
@@ -3534,14 +3572,15 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, 
int flags)
if (tmp-flags  sd_flag) {
sd = tmp;
 
-   new_cpu = get_cpu_for_power_policy(sd, cpu, p, sds);
+   new_cpu = get_cpu_for_power_policy(sd, cpu, p, sds,
+   flags  SD_BALANCE_FORK);
if (new_cpu != -1)
goto unlock;
}
}
 
if (affine_sd) {
-   new_cpu = get_cpu_for_power_policy(affine_sd, cpu, p, sds);
+   new_cpu = get_cpu_for_power_policy(affine_sd, cpu, p, sds, 0);
if (new_cpu != -1)
goto unlock;
 
-- 
1.7.12

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/