Re: [PATCH v3 17/22] sched: packing small tasks in wake/exec balancing

2013-01-18 Thread Alex Shi
On 01/16/2013 11:08 PM, Morten Rasmussen wrote:
> On Wed, Jan 16, 2013 at 07:32:49AM +, Alex Shi wrote:
>> On 01/15/2013 01:00 AM, Morten Rasmussen wrote:
> Why multiply rq->util by nr_running?
>>>
>>> Let's take an example where rq->util = 50, nr_running = 2, and putil =
>>> 10. In this case the value of putil doesn't really matter as vacancy
>>> would be negative anyway since FULL_UTIL - rq->util * nr_running is -1.
>>> However, with rq->util = 50 there should be plenty of spare cpu time to
>>> take another task.
>
> for this example, the util is not full maybe due to it was just wake up,
> it still is possible like to run full time. So, I try to give it the
> large guess load.
>>> I don't see why rq->util should be treated different depending on the
>>> number of tasks causing the load. rq->util = 50 means that the cpu is
>>> busy about 50% of the time no matter how many tasks contibute to that
>>> load.
>>>
>>> If nr_running = 1 instead in my example, you would consider the cpu
>>> vacant if putil = 6, but if nr_running > 1 you would not. Why should the
>>> two scenarios be treated differently?
>>>
>>>
>>> Also, why multiply putil by 8? rq->util must be very close to 0 for
>>> vacancy to be positive if putil is close to 12 (12.5%).
>
> just want to pack small util tasks, since packing is possible to hurt
> performance.
>>> I agree that packing may affect performance. But why don't you reduce
>>> FULL_UTIL instead of multiplying by 8? With current expression you will
>>> not pack a 10% task if rq->util = 20 and nr_running = 1, but you would
>>> pack a 6% task even if rq->util = 50 and the resulting cpu load is much
>>> higher.
>>>
>>
>> Yes, the threshold has no strong theory or experiment support. I had
>> tried cyclitest which Vicent used, the case's load avg is too small to
>> be caught. so just use half of Vicent value as 12.5%. If you has more
>> reasonable value, let me know.
>>
>> As to nr_running engaged as multiple mode. it's base on 2 reasons.
>> 1, load avg/util need 345ms to accumulate as 100%. so, if a tasks is
>> cost full cpu time, it still has 345ms with rq->util < 1.
> 
> I agree that load avg may not be accurate, especially for new tasks. But
> why use it if you don't trust its value anyway?
> 
> The load avg (sum/period) of a new task will reach 100% instantly if the
> task is consuming all the cpu time it can get. An old task can reach 50%
> within 32ms. So you should fairly quickly be able to see if it is a
> light task or not. You may under-estimate its load in the beginning, but
> only for a very short time.

this packing is done in wakup, even no 'a very short time' here:)
> 
>> 2, if there are more tasks, like 2 tasks running on one cpu, it's
>> possible to has capacity to burn 200% cpu time, while the biggest
>> rq->util is still 100%.
> 
> If you want to have a better metric for how much cpu time the task on
> the runqueue could potentially use, I would suggest using
> cfs_rq->runnable_load_avg which is the load_avg_contrib sum of all tasks
> on the runqueue. It would give you 200% in your example above.

runnable_load_avg also need much time to accumulate its value, not
better than util.
> 
> On the other hand, I think rq->util is fine for this purpose. If
> rq->util < 100% you know for sure that cpu is not fully utilized no
> matter how many tasks you have on the runqueue. So as long as rq->util
> is well below 100% (like < 50%) it should be safe to pack more small
> tasks on that cpu even if it has multiple tasks running already.
> 
>>
>> Consider to figure out precise utils is complicate and cost much. I do
>> this simple calculation. It is not very precise, but it is efficient and
>> more bias toward performance.
> 
> It is indeed very biased towards performance. I would prefer more focus
> on saving power in a power scheduling policy :)
> 

Agree, and I don't refuse to change the criteria for power. :) but
without reliable benchmarks or data, everything is guess.

-- 
Thanks
Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 17/22] sched: packing small tasks in wake/exec balancing

2013-01-16 Thread Morten Rasmussen
On Wed, Jan 16, 2013 at 07:32:49AM +, Alex Shi wrote:
> On 01/15/2013 01:00 AM, Morten Rasmussen wrote:
> >>> Why multiply rq->util by nr_running?
> >>> > > 
> >>> > > Let's take an example where rq->util = 50, nr_running = 2, and putil =
> >>> > > 10. In this case the value of putil doesn't really matter as vacancy
> >>> > > would be negative anyway since FULL_UTIL - rq->util * nr_running is 
> >>> > > -1.
> >>> > > However, with rq->util = 50 there should be plenty of spare cpu time 
> >>> > > to
> >>> > > take another task.
> >> > 
> >> > for this example, the util is not full maybe due to it was just wake up,
> >> > it still is possible like to run full time. So, I try to give it the
> >> > large guess load.
> > I don't see why rq->util should be treated different depending on the
> > number of tasks causing the load. rq->util = 50 means that the cpu is
> > busy about 50% of the time no matter how many tasks contibute to that
> > load.
> > 
> > If nr_running = 1 instead in my example, you would consider the cpu
> > vacant if putil = 6, but if nr_running > 1 you would not. Why should the
> > two scenarios be treated differently?
> > 
> >>> > > 
> >>> > > Also, why multiply putil by 8? rq->util must be very close to 0 for
> >>> > > vacancy to be positive if putil is close to 12 (12.5%).
> >> > 
> >> > just want to pack small util tasks, since packing is possible to hurt
> >> > performance.
> > I agree that packing may affect performance. But why don't you reduce
> > FULL_UTIL instead of multiplying by 8? With current expression you will
> > not pack a 10% task if rq->util = 20 and nr_running = 1, but you would
> > pack a 6% task even if rq->util = 50 and the resulting cpu load is much
> > higher.
> > 
> 
> Yes, the threshold has no strong theory or experiment support. I had
> tried cyclitest which Vicent used, the case's load avg is too small to
> be caught. so just use half of Vicent value as 12.5%. If you has more
> reasonable value, let me know.
> 
> As to nr_running engaged as multiple mode. it's base on 2 reasons.
> 1, load avg/util need 345ms to accumulate as 100%. so, if a tasks is
> cost full cpu time, it still has 345ms with rq->util < 1.

I agree that load avg may not be accurate, especially for new tasks. But
why use it if you don't trust its value anyway?

The load avg (sum/period) of a new task will reach 100% instantly if the
task is consuming all the cpu time it can get. An old task can reach 50%
within 32ms. So you should fairly quickly be able to see if it is a
light task or not. You may under-estimate its load in the beginning, but
only for a very short time.

> 2, if there are more tasks, like 2 tasks running on one cpu, it's
> possible to has capacity to burn 200% cpu time, while the biggest
> rq->util is still 100%.

If you want to have a better metric for how much cpu time the task on
the runqueue could potentially use, I would suggest using
cfs_rq->runnable_load_avg which is the load_avg_contrib sum of all tasks
on the runqueue. It would give you 200% in your example above.

On the other hand, I think rq->util is fine for this purpose. If
rq->util < 100% you know for sure that cpu is not fully utilized no
matter how many tasks you have on the runqueue. So as long as rq->util
is well below 100% (like < 50%) it should be safe to pack more small
tasks on that cpu even if it has multiple tasks running already.

> 
> Consider to figure out precise utils is complicate and cost much. I do
> this simple calculation. It is not very precise, but it is efficient and
> more bias toward performance.

It is indeed very biased towards performance. I would prefer more focus
on saving power in a power scheduling policy :)

Morten

> 
> -- 
> Thanks Alex
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 17/22] sched: packing small tasks in wake/exec balancing

2013-01-16 Thread Namhyung Kim
On Wed, 16 Jan 2013 14:11:53 +0800, Alex Shi wrote:
> On 01/14/2013 03:13 PM, Namhyung Kim wrote:
>> On Fri, 11 Jan 2013 11:47:03 +0800, Alex Shi wrote:
>>> Um, change to leader_cpu?
>> 
>> vacantest? ;-)
>
> hard to the ward in google. are you sure it is better than leader_cpu?  :)

Nop.  My English is just bad. :)

But I'm not sure the "leader" is the right one.  Why is it called
"leader" in the first place?

Thanks,
Namhyung
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 17/22] sched: packing small tasks in wake/exec balancing

2013-01-15 Thread Alex Shi
On 01/15/2013 01:00 AM, Morten Rasmussen wrote:
>>> Why multiply rq->util by nr_running?
>>> > > 
>>> > > Let's take an example where rq->util = 50, nr_running = 2, and putil =
>>> > > 10. In this case the value of putil doesn't really matter as vacancy
>>> > > would be negative anyway since FULL_UTIL - rq->util * nr_running is -1.
>>> > > However, with rq->util = 50 there should be plenty of spare cpu time to
>>> > > take another task.
>> > 
>> > for this example, the util is not full maybe due to it was just wake up,
>> > it still is possible like to run full time. So, I try to give it the
>> > large guess load.
> I don't see why rq->util should be treated different depending on the
> number of tasks causing the load. rq->util = 50 means that the cpu is
> busy about 50% of the time no matter how many tasks contibute to that
> load.
> 
> If nr_running = 1 instead in my example, you would consider the cpu
> vacant if putil = 6, but if nr_running > 1 you would not. Why should the
> two scenarios be treated differently?
> 
>>> > > 
>>> > > Also, why multiply putil by 8? rq->util must be very close to 0 for
>>> > > vacancy to be positive if putil is close to 12 (12.5%).
>> > 
>> > just want to pack small util tasks, since packing is possible to hurt
>> > performance.
> I agree that packing may affect performance. But why don't you reduce
> FULL_UTIL instead of multiplying by 8? With current expression you will
> not pack a 10% task if rq->util = 20 and nr_running = 1, but you would
> pack a 6% task even if rq->util = 50 and the resulting cpu load is much
> higher.
> 

Yes, the threshold has no strong theory or experiment support. I had
tried cyclitest which Vicent used, the case's load avg is too small to
be caught. so just use half of Vicent value as 12.5%. If you has more
reasonable value, let me know.

As to nr_running engaged as multiple mode. it's base on 2 reasons.
1, load avg/util need 345ms to accumulate as 100%. so, if a tasks is
cost full cpu time, it still has 345ms with rq->util < 1.
2, if there are more tasks, like 2 tasks running on one cpu, it's
possible to has capacity to burn 200% cpu time, while the biggest
rq->util is still 100%.

Consider to figure out precise utils is complicate and cost much. I do
this simple calculation. It is not very precise, but it is efficient and
more bias toward performance.

-- 
Thanks Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 17/22] sched: packing small tasks in wake/exec balancing

2013-01-15 Thread Alex Shi
On 01/14/2013 03:13 PM, Namhyung Kim wrote:
> On Fri, 11 Jan 2013 11:47:03 +0800, Alex Shi wrote:
>> On 01/11/2013 01:17 AM, Morten Rasmussen wrote:
>>> On Sat, Jan 05, 2013 at 08:37:46AM +, Alex Shi wrote:
 If the wake/exec task is small enough, utils < 12.5%, it will
 has the chance to be packed into a cpu which is busy but still has space to
 handle it.

 Signed-off-by: Alex Shi 
 ---
> [snip]
>>> I may be missing something, but could the expression be something like
>>> the below instead?
>>>
>>> Create a putil < 12.5% check before the loop. There is no reason to
>>> recheck it every iteration. Then:
> 
> Agreed.  Also suggest that the checking local cpu can also be moved
> before the loop so that it can be used without going through the loop if
> it's vacant enough.

Yes, thanks for suggestion!
> 
>>>
>>> vacancy = FULL_UTIL - (rq->util + putil)
>>>
>>> should be enough?
>>>
 +
 +  /* bias toward local cpu */
 +  if (vacancy > 0 && (i == this_cpu))
 +  return i;
 +
 +  if (vacancy > 0 && vacancy < min_vacancy) {
 +  min_vacancy = vacancy;
 +  idlest = i;
>>>
>>> "idlest" may be a bit misleading here as you actually select busiest cpu
>>> that have enough spare capacity to take the task.
>>
>> Um, change to leader_cpu?
> 
> vacantest? ;-)

hard to the ward in google. are you sure it is better than leader_cpu?  :)
> 
> Thanks,
> Namhyung
> 


-- 
Thanks Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 17/22] sched: packing small tasks in wake/exec balancing

2013-01-14 Thread Morten Rasmussen
On Fri, Jan 11, 2013 at 03:47:03AM +, Alex Shi wrote:
> On 01/11/2013 01:17 AM, Morten Rasmussen wrote:
> > On Sat, Jan 05, 2013 at 08:37:46AM +, Alex Shi wrote:
> >> If the wake/exec task is small enough, utils < 12.5%, it will
> >> has the chance to be packed into a cpu which is busy but still has space to
> >> handle it.
> >>
> >> Signed-off-by: Alex Shi 
> >> ---
> >>  kernel/sched/fair.c | 51 
> >> +--
> >>  1 file changed, 45 insertions(+), 6 deletions(-)
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index 8d0d3af..0596e81 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -3471,19 +3471,57 @@ static inline int get_sd_sched_policy(struct 
> >> sched_domain *sd,
> >>  }
> >>  
> >>  /*
> >> + * find_leader_cpu - find the busiest but still has enough leisure time 
> >> cpu
> >> + * among the cpus in group.
> >> + */
> >> +static int
> >> +find_leader_cpu(struct sched_group *group, struct task_struct *p, int 
> >> this_cpu)
> >> +{
> >> +  unsigned vacancy, min_vacancy = UINT_MAX;
> > 
> > unsigned int?
> 
> yes
> > 
> >> +  int idlest = -1;
> >> +  int i;
> >> +  /* percentage the task's util */
> >> +  unsigned putil = p->se.avg.runnable_avg_sum * 100
> >> +  / (p->se.avg.runnable_avg_period + 1);
> > 
> > Alternatively you could use se.avg.load_avg_contrib which is the same
> > ratio scaled by the task priority (se->load.weight). In the above
> > expression you don't take priority into account.
> 
> sure. but this seems more directly of meaningful.
> > 
> >> +
> >> +  /* Traverse only the allowed CPUs */
> >> +  for_each_cpu_and(i, sched_group_cpus(group), tsk_cpus_allowed(p)) {
> >> +  struct rq *rq = cpu_rq(i);
> >> +  int nr_running = rq->nr_running > 0 ? rq->nr_running : 1;
> >> +
> >> +  /* only pack task which putil < 12.5% */
> >> +  vacancy = FULL_UTIL - (rq->util * nr_running + putil * 8);
> > 
> > I can't follow this expression.
> > 
> > The variables can have the following values:
> > FULL_UTIL  = 99
> > rq->util   = [0..99]
> > nr_running = [1..inf]
> > putil  = [0..99]
> > 
> > Why multiply rq->util by nr_running?
> > 
> > Let's take an example where rq->util = 50, nr_running = 2, and putil =
> > 10. In this case the value of putil doesn't really matter as vacancy
> > would be negative anyway since FULL_UTIL - rq->util * nr_running is -1.
> > However, with rq->util = 50 there should be plenty of spare cpu time to
> > take another task.
> 
> for this example, the util is not full maybe due to it was just wake up,
> it still is possible like to run full time. So, I try to give it the
> large guess load.

I don't see why rq->util should be treated different depending on the
number of tasks causing the load. rq->util = 50 means that the cpu is
busy about 50% of the time no matter how many tasks contibute to that
load.

If nr_running = 1 instead in my example, you would consider the cpu
vacant if putil = 6, but if nr_running > 1 you would not. Why should the
two scenarios be treated differently?

> > 
> > Also, why multiply putil by 8? rq->util must be very close to 0 for
> > vacancy to be positive if putil is close to 12 (12.5%).
> 
> just want to pack small util tasks, since packing is possible to hurt
> performance.

I agree that packing may affect performance. But why don't you reduce
FULL_UTIL instead of multiplying by 8? With current expression you will
not pack a 10% task if rq->util = 20 and nr_running = 1, but you would
pack a 6% task even if rq->util = 50 and the resulting cpu load is much
higher.

> > 
> > The vacancy variable is declared unsigned, so it will underflow instead
> > of becoming negative. Is this intentional?
> 
> ops, my mistake.
> > 
> > I may be missing something, but could the expression be something like
> > the below instead?
> > 
> > Create a putil < 12.5% check before the loop. There is no reason to
> > recheck it every iteration. Then:
> > 
> > vacancy = FULL_UTIL - (rq->util + putil)
> > 
> > should be enough?
> > 
> >> +
> >> +  /* bias toward local cpu */
> >> +  if (vacancy > 0 && (i == this_cpu))
> >> +  return i;
> >> +
> >> +  if (vacancy > 0 && vacancy < min_vacancy) {
> >> +  min_vacancy = vacancy;
> >> +  idlest = i;
> > 
> > "idlest" may be a bit misleading here as you actually select busiest cpu
> > that have enough spare capacity to take the task.
> 
> Um, change to leader_cpu?

Fine by me.

Morten

> > 
> > Morten
> > 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 17/22] sched: packing small tasks in wake/exec balancing

2013-01-13 Thread Namhyung Kim
On Fri, 11 Jan 2013 11:47:03 +0800, Alex Shi wrote:
> On 01/11/2013 01:17 AM, Morten Rasmussen wrote:
>> On Sat, Jan 05, 2013 at 08:37:46AM +, Alex Shi wrote:
>>> If the wake/exec task is small enough, utils < 12.5%, it will
>>> has the chance to be packed into a cpu which is busy but still has space to
>>> handle it.
>>>
>>> Signed-off-by: Alex Shi 
>>> ---
[snip]
>> I may be missing something, but could the expression be something like
>> the below instead?
>> 
>> Create a putil < 12.5% check before the loop. There is no reason to
>> recheck it every iteration. Then:

Agreed.  Also suggest that the checking local cpu can also be moved
before the loop so that it can be used without going through the loop if
it's vacant enough.

>> 
>> vacancy = FULL_UTIL - (rq->util + putil)
>> 
>> should be enough?
>> 
>>> +
>>> +   /* bias toward local cpu */
>>> +   if (vacancy > 0 && (i == this_cpu))
>>> +   return i;
>>> +
>>> +   if (vacancy > 0 && vacancy < min_vacancy) {
>>> +   min_vacancy = vacancy;
>>> +   idlest = i;
>> 
>> "idlest" may be a bit misleading here as you actually select busiest cpu
>> that have enough spare capacity to take the task.
>
> Um, change to leader_cpu?

vacantest? ;-)

Thanks,
Namhyung
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 17/22] sched: packing small tasks in wake/exec balancing

2013-01-10 Thread Alex Shi
On 01/11/2013 01:17 AM, Morten Rasmussen wrote:
> On Sat, Jan 05, 2013 at 08:37:46AM +, Alex Shi wrote:
>> If the wake/exec task is small enough, utils < 12.5%, it will
>> has the chance to be packed into a cpu which is busy but still has space to
>> handle it.
>>
>> Signed-off-by: Alex Shi 
>> ---
>>  kernel/sched/fair.c | 51 +--
>>  1 file changed, 45 insertions(+), 6 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 8d0d3af..0596e81 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -3471,19 +3471,57 @@ static inline int get_sd_sched_policy(struct 
>> sched_domain *sd,
>>  }
>>  
>>  /*
>> + * find_leader_cpu - find the busiest but still has enough leisure time cpu
>> + * among the cpus in group.
>> + */
>> +static int
>> +find_leader_cpu(struct sched_group *group, struct task_struct *p, int 
>> this_cpu)
>> +{
>> +unsigned vacancy, min_vacancy = UINT_MAX;
> 
> unsigned int?

yes
> 
>> +int idlest = -1;
>> +int i;
>> +/* percentage the task's util */
>> +unsigned putil = p->se.avg.runnable_avg_sum * 100
>> +/ (p->se.avg.runnable_avg_period + 1);
> 
> Alternatively you could use se.avg.load_avg_contrib which is the same
> ratio scaled by the task priority (se->load.weight). In the above
> expression you don't take priority into account.

sure. but this seems more directly of meaningful.
> 
>> +
>> +/* Traverse only the allowed CPUs */
>> +for_each_cpu_and(i, sched_group_cpus(group), tsk_cpus_allowed(p)) {
>> +struct rq *rq = cpu_rq(i);
>> +int nr_running = rq->nr_running > 0 ? rq->nr_running : 1;
>> +
>> +/* only pack task which putil < 12.5% */
>> +vacancy = FULL_UTIL - (rq->util * nr_running + putil * 8);
> 
> I can't follow this expression.
> 
> The variables can have the following values:
> FULL_UTIL  = 99
> rq->util   = [0..99]
> nr_running = [1..inf]
> putil  = [0..99]
> 
> Why multiply rq->util by nr_running?
> 
> Let's take an example where rq->util = 50, nr_running = 2, and putil =
> 10. In this case the value of putil doesn't really matter as vacancy
> would be negative anyway since FULL_UTIL - rq->util * nr_running is -1.
> However, with rq->util = 50 there should be plenty of spare cpu time to
> take another task.

for this example, the util is not full maybe due to it was just wake up,
it still is possible like to run full time. So, I try to give it the
large guess load.
> 
> Also, why multiply putil by 8? rq->util must be very close to 0 for
> vacancy to be positive if putil is close to 12 (12.5%).

just want to pack small util tasks, since packing is possible to hurt
performance.
> 
> The vacancy variable is declared unsigned, so it will underflow instead
> of becoming negative. Is this intentional?

ops, my mistake.
> 
> I may be missing something, but could the expression be something like
> the below instead?
> 
> Create a putil < 12.5% check before the loop. There is no reason to
> recheck it every iteration. Then:
> 
> vacancy = FULL_UTIL - (rq->util + putil)
> 
> should be enough?
> 
>> +
>> +/* bias toward local cpu */
>> +if (vacancy > 0 && (i == this_cpu))
>> +return i;
>> +
>> +if (vacancy > 0 && vacancy < min_vacancy) {
>> +min_vacancy = vacancy;
>> +idlest = i;
> 
> "idlest" may be a bit misleading here as you actually select busiest cpu
> that have enough spare capacity to take the task.

Um, change to leader_cpu?
> 
> Morten
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 17/22] sched: packing small tasks in wake/exec balancing

2013-01-10 Thread Morten Rasmussen
On Sat, Jan 05, 2013 at 08:37:46AM +, Alex Shi wrote:
> If the wake/exec task is small enough, utils < 12.5%, it will
> has the chance to be packed into a cpu which is busy but still has space to
> handle it.
> 
> Signed-off-by: Alex Shi 
> ---
>  kernel/sched/fair.c | 51 +--
>  1 file changed, 45 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8d0d3af..0596e81 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3471,19 +3471,57 @@ static inline int get_sd_sched_policy(struct 
> sched_domain *sd,
>  }
>  
>  /*
> + * find_leader_cpu - find the busiest but still has enough leisure time cpu
> + * among the cpus in group.
> + */
> +static int
> +find_leader_cpu(struct sched_group *group, struct task_struct *p, int 
> this_cpu)
> +{
> + unsigned vacancy, min_vacancy = UINT_MAX;

unsigned int?

> + int idlest = -1;
> + int i;
> + /* percentage the task's util */
> + unsigned putil = p->se.avg.runnable_avg_sum * 100
> + / (p->se.avg.runnable_avg_period + 1);

Alternatively you could use se.avg.load_avg_contrib which is the same
ratio scaled by the task priority (se->load.weight). In the above
expression you don't take priority into account.

> +
> + /* Traverse only the allowed CPUs */
> + for_each_cpu_and(i, sched_group_cpus(group), tsk_cpus_allowed(p)) {
> + struct rq *rq = cpu_rq(i);
> + int nr_running = rq->nr_running > 0 ? rq->nr_running : 1;
> +
> + /* only pack task which putil < 12.5% */
> + vacancy = FULL_UTIL - (rq->util * nr_running + putil * 8);

I can't follow this expression.

The variables can have the following values:
FULL_UTIL  = 99
rq->util   = [0..99]
nr_running = [1..inf]
putil  = [0..99]

Why multiply rq->util by nr_running?

Let's take an example where rq->util = 50, nr_running = 2, and putil =
10. In this case the value of putil doesn't really matter as vacancy
would be negative anyway since FULL_UTIL - rq->util * nr_running is -1.
However, with rq->util = 50 there should be plenty of spare cpu time to
take another task.

Also, why multiply putil by 8? rq->util must be very close to 0 for
vacancy to be positive if putil is close to 12 (12.5%).

The vacancy variable is declared unsigned, so it will underflow instead
of becoming negative. Is this intentional?

I may be missing something, but could the expression be something like
the below instead?

Create a putil < 12.5% check before the loop. There is no reason to
recheck it every iteration. Then:

vacancy = FULL_UTIL - (rq->util + putil)

should be enough?

> +
> + /* bias toward local cpu */
> + if (vacancy > 0 && (i == this_cpu))
> + return i;
> +
> + if (vacancy > 0 && vacancy < min_vacancy) {
> + min_vacancy = vacancy;
> + idlest = i;

"idlest" may be a bit misleading here as you actually select busiest cpu
that have enough spare capacity to take the task.

Morten

> + }
> + }
> + return idlest;
> +}
> +
> +/*
>   * If power policy is eligible for this domain, and it has task allowed cpu.
>   * we will select CPU from this domain.
>   */
>  static int get_cpu_for_power_policy(struct sched_domain *sd, int cpu,
> - struct task_struct *p, struct sd_lb_stats *sds)
> + struct task_struct *p, struct sd_lb_stats *sds, int fork)
>  {
>   int policy;
>   int new_cpu = -1;
>  
>   policy = get_sd_sched_policy(sd, cpu, p, sds);
> - if (policy != SCHED_POLICY_PERFORMANCE && sds->group_leader)
> - new_cpu = find_idlest_cpu(sds->group_leader, p, cpu);
> -
> + if (policy != SCHED_POLICY_PERFORMANCE && sds->group_leader) {
> + if (!fork)
> + new_cpu = find_leader_cpu(sds->group_leader, p, cpu);
> + /* for fork balancing and a little busy task */
> + if (new_cpu == -1)
> + new_cpu = find_idlest_cpu(sds->group_leader, p, cpu);
> + }
>   return new_cpu;
>  }
>  
> @@ -3534,14 +3572,15 @@ select_task_rq_fair(struct task_struct *p, int 
> sd_flag, int flags)
>   if (tmp->flags & sd_flag) {
>   sd = tmp;
>  
> - new_cpu = get_cpu_for_power_policy(sd, cpu, p, &sds);
> + new_cpu = get_cpu_for_power_policy(sd, cpu, p, &sds,
> + flags & SD_BALANCE_FORK);
>   if (new_cpu != -1)
>   goto unlock;
>   }
>   }
>  
>   if (affine_sd) {
> - new_cpu = get_cpu_for_power_policy(affine_sd, cpu, p, &sds);
> + new_cpu = get_cpu_for_power_policy(affine_sd, cpu, p, &sds, 0);
>   if (new_cpu != -1)
>   goto unlock;
>  
> -- 
> 1.7.12
> 
> --
> To unsubscribe from th