subject:"\[RFC PATCH v2 3\/6\] sched\: pack small tasks"

Re: [RFC PATCH v2 3/6] sched: pack small tasks

2012-12-21 Thread Vincent Guittot

On 21 December 2012 09:53, Vincent Guittot  wrote:
> On 21 December 2012 06:47, Namhyung Kim  wrote:
>> Hi Vincent,
>>
>> On Thu, Dec 13, 2012 at 11:11:11AM +0100, Vincent Guittot wrote:
>>> On 13 December 2012 03:17, Alex Shi  wrote:
>>> > On 12/12/2012 09:31 PM, Vincent Guittot wrote:
>>> >> +static bool is_buddy_busy(int cpu)
>>> >> +{
>>> >> + struct rq *rq = cpu_rq(cpu);
>>> >> +
>>> >> + /*
>>> >> +  * A busy buddy is a CPU with a high load or a small load with a 
>>> >> lot of
>>> >> +  * running tasks.
>>> >> +  */
>>> >> + return ((rq->avg.runnable_avg_sum << rq->nr_running) >
>>> >
>>> > If nr_running a bit big, rq->avg.runnable_avg_sum << rq->nr_running is
>>> > zero. you will get the wrong decision.
>>>
>>> yes, I'm going to do that like below instead:
>>> return (rq->avg.runnable_avg_sum > (rq->avg.runnable_avg_period >>
>>> rq->nr_running));
>>
>> Doesn't it consider nr_running too much?  It seems current is_buddy_busy
>> returns false on a cpu that has 1 task runs 40% cputime, but returns true
>> on a cpu that has 3 tasks runs 10% cputime each or for 2 tasks of 15%
>> cputime each, right?
>
> Yes it's right.
>>
>> I don't know what is correct, but just guessing that in a cpu's point
>> of view it'd be busier if it has a higher runnable_avg_sum than a
>> higher nr_running IMHO.
>
sorry, the mail has been sent before i finish it

> The nr_running is used to point out how many tasks are running
> simultaneously and the potential scheduling latency of adding

The nr_running is used to point out how many tasks are running
simultaneously and as a result the potential scheduling latency.
I have used the shift instruction because it was quite simple and
efficient but it may give too much weight to nr_running. I could use a
simple division instead of shifting runnable_avg_sum

>
>>
>>
>>>
>>> >
>>> >> + rq->avg.runnable_avg_period);
>>> >> +}
>>> >> +
>>> >> +static bool is_light_task(struct task_struct *p)
>>> >> +{
>>> >> + /* A light task runs less than 25% in average */
>>> >> + return ((p->se.avg.runnable_avg_sum << 1) <
>>> >> + p->se.avg.runnable_avg_period);
>>> >
>>> > 25% may not suitable for big machine.
>>>
>>> Threshold is always an issue, which threshold should be suitable for
>>> big machine ?
>>>
>>> I'm wondering if i should use the imbalance_pct value for computing
>>> the threshold
>>
>> Anyway, I wonder how 'sum << 1' computes 25%.  Shouldn't it be << 2 ?
>
> The 1st version of the patch was using << 2 but I received a comment
> saying that it was may be not enough aggressive so I have updated the
> formula with << 1 but forgot to update the comment. I will align
> comment and formula in the next version.
> Thanks for pointing this
>
> Vincent
>
>>
>> Thanks,
>> Namhyung
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH v2 3/6] sched: pack small tasks

2012-12-21 Thread Vincent Guittot

On 21 December 2012 06:47, Namhyung Kim  wrote:
> Hi Vincent,
>
> On Thu, Dec 13, 2012 at 11:11:11AM +0100, Vincent Guittot wrote:
>> On 13 December 2012 03:17, Alex Shi  wrote:
>> > On 12/12/2012 09:31 PM, Vincent Guittot wrote:
>> >> +static bool is_buddy_busy(int cpu)
>> >> +{
>> >> + struct rq *rq = cpu_rq(cpu);
>> >> +
>> >> + /*
>> >> +  * A busy buddy is a CPU with a high load or a small load with a 
>> >> lot of
>> >> +  * running tasks.
>> >> +  */
>> >> + return ((rq->avg.runnable_avg_sum << rq->nr_running) >
>> >
>> > If nr_running a bit big, rq->avg.runnable_avg_sum << rq->nr_running is
>> > zero. you will get the wrong decision.
>>
>> yes, I'm going to do that like below instead:
>> return (rq->avg.runnable_avg_sum > (rq->avg.runnable_avg_period >>
>> rq->nr_running));
>
> Doesn't it consider nr_running too much?  It seems current is_buddy_busy
> returns false on a cpu that has 1 task runs 40% cputime, but returns true
> on a cpu that has 3 tasks runs 10% cputime each or for 2 tasks of 15%
> cputime each, right?

Yes it's right.
>
> I don't know what is correct, but just guessing that in a cpu's point
> of view it'd be busier if it has a higher runnable_avg_sum than a
> higher nr_running IMHO.

The nr_running is used to point out how many tasks are running
simultaneously and the potential scheduling latency of adding

>
>
>>
>> >
>> >> + rq->avg.runnable_avg_period);
>> >> +}
>> >> +
>> >> +static bool is_light_task(struct task_struct *p)
>> >> +{
>> >> + /* A light task runs less than 25% in average */
>> >> + return ((p->se.avg.runnable_avg_sum << 1) <
>> >> + p->se.avg.runnable_avg_period);
>> >
>> > 25% may not suitable for big machine.
>>
>> Threshold is always an issue, which threshold should be suitable for
>> big machine ?
>>
>> I'm wondering if i should use the imbalance_pct value for computing
>> the threshold
>
> Anyway, I wonder how 'sum << 1' computes 25%.  Shouldn't it be << 2 ?

The 1st version of the patch was using << 2 but I received a comment
saying that it was may be not enough aggressive so I have updated the
formula with << 1 but forgot to update the comment. I will align
comment and formula in the next version.
Thanks for pointing this

Vincent

>
> Thanks,
> Namhyung
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH v2 3/6] sched: pack small tasks

2012-12-21 Thread Vincent Guittot

On 21 December 2012 06:47, Namhyung Kim namhy...@kernel.org wrote:
 Hi Vincent,

 On Thu, Dec 13, 2012 at 11:11:11AM +0100, Vincent Guittot wrote:
 On 13 December 2012 03:17, Alex Shi alex@intel.com wrote:
  On 12/12/2012 09:31 PM, Vincent Guittot wrote:
  +static bool is_buddy_busy(int cpu)
  +{
  + struct rq *rq = cpu_rq(cpu);
  +
  + /*
  +  * A busy buddy is a CPU with a high load or a small load with a 
  lot of
  +  * running tasks.
  +  */
  + return ((rq-avg.runnable_avg_sum  rq-nr_running) 
 
  If nr_running a bit big, rq-avg.runnable_avg_sum  rq-nr_running is
  zero. you will get the wrong decision.

 yes, I'm going to do that like below instead:
 return (rq-avg.runnable_avg_sum  (rq-avg.runnable_avg_period 
 rq-nr_running));

 Doesn't it consider nr_running too much?  It seems current is_buddy_busy
 returns false on a cpu that has 1 task runs 40% cputime, but returns true
 on a cpu that has 3 tasks runs 10% cputime each or for 2 tasks of 15%
 cputime each, right?

Yes it's right.

 I don't know what is correct, but just guessing that in a cpu's point
 of view it'd be busier if it has a higher runnable_avg_sum than a
 higher nr_running IMHO.

The nr_running is used to point out how many tasks are running
simultaneously and the potential scheduling latency of adding




 
  + rq-avg.runnable_avg_period);
  +}
  +
  +static bool is_light_task(struct task_struct *p)
  +{
  + /* A light task runs less than 25% in average */
  + return ((p-se.avg.runnable_avg_sum  1) 
  + p-se.avg.runnable_avg_period);
 
  25% may not suitable for big machine.

 Threshold is always an issue, which threshold should be suitable for
 big machine ?

 I'm wondering if i should use the imbalance_pct value for computing
 the threshold

 Anyway, I wonder how 'sum  1' computes 25%.  Shouldn't it be  2 ?

The 1st version of the patch was using  2 but I received a comment
saying that it was may be not enough aggressive so I have updated the
formula with  1 but forgot to update the comment. I will align
comment and formula in the next version.
Thanks for pointing this

Vincent


 Thanks,
 Namhyung
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH v2 3/6] sched: pack small tasks

2012-12-21 Thread Vincent Guittot

On 21 December 2012 09:53, Vincent Guittot vincent.guit...@linaro.org wrote:
 On 21 December 2012 06:47, Namhyung Kim namhy...@kernel.org wrote:
 Hi Vincent,

 On Thu, Dec 13, 2012 at 11:11:11AM +0100, Vincent Guittot wrote:
 On 13 December 2012 03:17, Alex Shi alex@intel.com wrote:
  On 12/12/2012 09:31 PM, Vincent Guittot wrote:
  +static bool is_buddy_busy(int cpu)
  +{
  + struct rq *rq = cpu_rq(cpu);
  +
  + /*
  +  * A busy buddy is a CPU with a high load or a small load with a 
  lot of
  +  * running tasks.
  +  */
  + return ((rq-avg.runnable_avg_sum  rq-nr_running) 
 
  If nr_running a bit big, rq-avg.runnable_avg_sum  rq-nr_running is
  zero. you will get the wrong decision.

 yes, I'm going to do that like below instead:
 return (rq-avg.runnable_avg_sum  (rq-avg.runnable_avg_period 
 rq-nr_running));

 Doesn't it consider nr_running too much?  It seems current is_buddy_busy
 returns false on a cpu that has 1 task runs 40% cputime, but returns true
 on a cpu that has 3 tasks runs 10% cputime each or for 2 tasks of 15%
 cputime each, right?

 Yes it's right.

 I don't know what is correct, but just guessing that in a cpu's point
 of view it'd be busier if it has a higher runnable_avg_sum than a
 higher nr_running IMHO.

sorry, the mail has been sent before i finish it

 The nr_running is used to point out how many tasks are running
 simultaneously and the potential scheduling latency of adding

The nr_running is used to point out how many tasks are running
simultaneously and as a result the potential scheduling latency.
I have used the shift instruction because it was quite simple and
efficient but it may give too much weight to nr_running. I could use a
simple division instead of shifting runnable_avg_sum





 
  + rq-avg.runnable_avg_period);
  +}
  +
  +static bool is_light_task(struct task_struct *p)
  +{
  + /* A light task runs less than 25% in average */
  + return ((p-se.avg.runnable_avg_sum  1) 
  + p-se.avg.runnable_avg_period);
 
  25% may not suitable for big machine.

 Threshold is always an issue, which threshold should be suitable for
 big machine ?

 I'm wondering if i should use the imbalance_pct value for computing
 the threshold

 Anyway, I wonder how 'sum  1' computes 25%.  Shouldn't it be  2 ?

 The 1st version of the patch was using  2 but I received a comment
 saying that it was may be not enough aggressive so I have updated the
 formula with  1 but forgot to update the comment. I will align
 comment and formula in the next version.
 Thanks for pointing this

 Vincent


 Thanks,
 Namhyung
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH v2 3/6] sched: pack small tasks

2012-12-20 Thread Namhyung Kim

Hi Vincent,

On Thu, Dec 13, 2012 at 11:11:11AM +0100, Vincent Guittot wrote:
> On 13 December 2012 03:17, Alex Shi  wrote:
> > On 12/12/2012 09:31 PM, Vincent Guittot wrote:
> >> +static bool is_buddy_busy(int cpu)
> >> +{
> >> + struct rq *rq = cpu_rq(cpu);
> >> +
> >> + /*
> >> +  * A busy buddy is a CPU with a high load or a small load with a lot 
> >> of
> >> +  * running tasks.
> >> +  */
> >> + return ((rq->avg.runnable_avg_sum << rq->nr_running) >
> >
> > If nr_running a bit big, rq->avg.runnable_avg_sum << rq->nr_running is
> > zero. you will get the wrong decision.
> 
> yes, I'm going to do that like below instead:
> return (rq->avg.runnable_avg_sum > (rq->avg.runnable_avg_period >>
> rq->nr_running));

Doesn't it consider nr_running too much?  It seems current is_buddy_busy
returns false on a cpu that has 1 task runs 40% cputime, but returns true
on a cpu that has 3 tasks runs 10% cputime each or for 2 tasks of 15%
cputime each, right?

I don't know what is correct, but just guessing that in a cpu's point
of view it'd be busier if it has a higher runnable_avg_sum than a
higher nr_running IMHO.


> 
> >
> >> + rq->avg.runnable_avg_period);
> >> +}
> >> +
> >> +static bool is_light_task(struct task_struct *p)
> >> +{
> >> + /* A light task runs less than 25% in average */
> >> + return ((p->se.avg.runnable_avg_sum << 1) <
> >> + p->se.avg.runnable_avg_period);
> >
> > 25% may not suitable for big machine.
> 
> Threshold is always an issue, which threshold should be suitable for
> big machine ?
> 
> I'm wondering if i should use the imbalance_pct value for computing
> the threshold

Anyway, I wonder how 'sum << 1' computes 25%.  Shouldn't it be << 2 ?

Thanks,
Namhyung
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH v2 3/6] sched: pack small tasks

2012-12-20 Thread Namhyung Kim

Hi Vincent,

On Thu, Dec 13, 2012 at 11:11:11AM +0100, Vincent Guittot wrote:
 On 13 December 2012 03:17, Alex Shi alex@intel.com wrote:
  On 12/12/2012 09:31 PM, Vincent Guittot wrote:
  +static bool is_buddy_busy(int cpu)
  +{
  + struct rq *rq = cpu_rq(cpu);
  +
  + /*
  +  * A busy buddy is a CPU with a high load or a small load with a lot 
  of
  +  * running tasks.
  +  */
  + return ((rq-avg.runnable_avg_sum  rq-nr_running) 
 
  If nr_running a bit big, rq-avg.runnable_avg_sum  rq-nr_running is
  zero. you will get the wrong decision.
 
 yes, I'm going to do that like below instead:
 return (rq-avg.runnable_avg_sum  (rq-avg.runnable_avg_period 
 rq-nr_running));

Doesn't it consider nr_running too much?  It seems current is_buddy_busy
returns false on a cpu that has 1 task runs 40% cputime, but returns true
on a cpu that has 3 tasks runs 10% cputime each or for 2 tasks of 15%
cputime each, right?

I don't know what is correct, but just guessing that in a cpu's point
of view it'd be busier if it has a higher runnable_avg_sum than a
higher nr_running IMHO.


 
 
  + rq-avg.runnable_avg_period);
  +}
  +
  +static bool is_light_task(struct task_struct *p)
  +{
  + /* A light task runs less than 25% in average */
  + return ((p-se.avg.runnable_avg_sum  1) 
  + p-se.avg.runnable_avg_period);
 
  25% may not suitable for big machine.
 
 Threshold is always an issue, which threshold should be suitable for
 big machine ?
 
 I'm wondering if i should use the imbalance_pct value for computing
 the threshold

Anyway, I wonder how 'sum  1' computes 25%.  Shouldn't it be  2 ?

Thanks,
Namhyung
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH v2 3/6] sched: pack small tasks

2012-12-18 Thread Alex Shi

On Tue, Dec 18, 2012 at 5:53 PM, Vincent Guittot
 wrote:
> On 17 December 2012 16:24, Alex Shi  wrote:
 The scheme below tries to summaries the idea:

 Socket  | socket 0 | socket 1   | socket 2   | socket 3   |
 LCPU| 0 | 1-15 | 16 | 17-31 | 32 | 33-47 | 48 | 49-63 |
 buddy conf0 | 0 | 0| 1  | 16| 2  | 32| 3  | 48|
 buddy conf1 | 0 | 0| 0  | 16| 16 | 32| 32 | 48|
 buddy conf2 | 0 | 0| 16 | 16| 32 | 32| 48 | 48|

 But, I don't know how this can interact with NUMA load balance and the
 better might be to use conf3.
>>>
>>> I mean conf2 not conf3

>
> Cyclictest is the ultimate small tasks use case which points out all
> weaknesses of a scheduler for such kind of tasks.
> Music playback is a more realistic one and it also shows improvement
>
>> granularity or one tick, thus we really don't need to consider task
>> migration cost. But when the task are not too small, migration is more
>
> For which kind of machine are you stating that hypothesis ?

Seems the biggest argument between us is you didn't want to admit 'not
too small tasks' exists and that will cause more migrations because
your patch.

>> even so they should run in the same socket for power saving
>> consideration(my power scheduling patch can do this), instead of spread
>> to all sockets.
>
> This is may be good for your scenario and your machine :-)
> Packing small tasks is the best choice for any scenario and machine.

That's clearly wrong, I had explained many times, your single buddy
CPU is impossible packing all tasks for a  big machine, like for just
16 LCPU, while it suppose do.

Anyway you have right insist your design. and I thought I can not say
more clear about the scalability issue. I won't judge the patch again.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH v2 3/6] sched: pack small tasks

2012-12-18 Thread Vincent Guittot

On 17 December 2012 16:24, Alex Shi  wrote:
>>> The scheme below tries to summaries the idea:
>>>
>>> Socket  | socket 0 | socket 1   | socket 2   | socket 3   |
>>> LCPU| 0 | 1-15 | 16 | 17-31 | 32 | 33-47 | 48 | 49-63 |
>>> buddy conf0 | 0 | 0| 1  | 16| 2  | 32| 3  | 48|
>>> buddy conf1 | 0 | 0| 0  | 16| 16 | 32| 32 | 48|
>>> buddy conf2 | 0 | 0| 16 | 16| 32 | 32| 48 | 48|
>>>
>>> But, I don't know how this can interact with NUMA load balance and the
>>> better might be to use conf3.
>>
>> I mean conf2 not conf3
>
> So, it has 4 levels 0/16/32/ for socket 3 and 0 level for socket 0, it
> is unbalanced for different socket.

 That the target because we have decided to pack the small tasks in
 socket 0 when we have parsed the topology at boot.
 We don't have to loop into sched_domain or sched_group anymore to find
 the best LCPU when a small tasks wake up.
>>>
>>> iteration on domain and group is a advantage feature for power efficient
>>> requirement, not shortage. If some CPU are already idle before forking,
>>> let another waking CPU check their load/util and then decide which one
>>> is best CPU can reduce late migrations, that save both the performance
>>> and power.
>>
>> In fact, we have already done this job once at boot and we consider
>> that moving small tasks in the buddy CPU is always benefit so we don't
>> need to waste time looping sched_domain and sched_group to compute
>> current capacity of each LCPU for each wake up of each small tasks. We
>> want all small tasks and background activity waking up on the same
>> buddy CPU and let the default behavior of the scheduler choosing the
>> best CPU for heavy tasks or loaded CPUs.
>
> IMHO, the design should be very good for your scenario and your machine,
> but when the code move to general scheduler, we do want it can handle
> more general scenarios. like sometime the 'small task' is not as small
> as tasks in cyclictest which even hardly can run longer than migration

Cyclictest is the ultimate small tasks use case which points out all
weaknesses of a scheduler for such kind of tasks.
Music playback is a more realistic one and it also shows improvement

> granularity or one tick, thus we really don't need to consider task
> migration cost. But when the task are not too small, migration is more

For which kind of machine are you stating that hypothesis ?

> heavier than domain/group walking, that is the common sense in
> fork/exec/waking balance.

I would have said the opposite: The current scheduler limits its
computation of statistic during fork/exec/waking compared to a
periodic load balance because it's too heavy. It's even more true for
wake up if wake affine is possible.

>
>>
>>>
>>> On the contrary, move task walking on each level buddies is not only bad
>>> on performance but also bad on power. Consider the quite big latency of
>>> waking a deep idle CPU. we lose too much..
>>
>> My result have shown different conclusion.
>
> That should be due to your tasks are too small to need consider
> migration cost.
>> In fact, there is much more chance that the buddy will not be in a
>> deep idle as all the small tasks and background activity are already
>> waking on this CPU.
>
> powertop is helpful to tune your system for more idle time. Another
> reason is current kernel just try to spread tasks on more cpu for
> performance consideration. My power scheduling patch should helpful on this.
>>
>>>

>
> And the ground level has just one buddy for 16 LCPUs - 8 cores, that's
> not a good design, consider my previous examples: if there are 4 or 8
> tasks in one socket, you just has 2 choices: spread them into all cores,
> or pack them into one LCPU. Actually, moving them just into 2 or 4 cores
> maybe a better solution. but the design missed this.

 You speak about tasks without any notion of load. This patch only care
 of small tasks and light LCPU load, but it falls back to default
 behavior for other situation. So if there are 4 or 8 small tasks, they
 will migrate to the socket 0 after 1 or up to 3 migration (it depends
 of the conf and the LCPU they come from).
>>>
>>> According to your patch, what your mean 'notion of load' is the
>>> utilization of cpu, not the load weight of tasks, right?
>>
>> Yes but not only. The number of tasks that run simultaneously, is
>> another important input
>>
>>>
>>> Yes, I just talked about tasks numbers, but it naturally extends to the
>>> task utilization on cpu. like 8 tasks with 25% util, that just can full
>>> fill 2 CPUs. but clearly beyond the capacity of the buddy, so you need
>>> to wake up another CPU socket while local socket has some LCPU idle...
>>
>> 8 tasks with a running period of 25ms per 100ms that wake up
>> simultaneously should probably run on 8 different LCPU in order to
>> race to idle
>
> nope,

Re: [RFC PATCH v2 3/6] sched: pack small tasks

2012-12-18 Thread Vincent Guittot

On 17 December 2012 16:24, Alex Shi alex@intel.com wrote:
 The scheme below tries to summaries the idea:

 Socket  | socket 0 | socket 1   | socket 2   | socket 3   |
 LCPU| 0 | 1-15 | 16 | 17-31 | 32 | 33-47 | 48 | 49-63 |
 buddy conf0 | 0 | 0| 1  | 16| 2  | 32| 3  | 48|
 buddy conf1 | 0 | 0| 0  | 16| 16 | 32| 32 | 48|
 buddy conf2 | 0 | 0| 16 | 16| 32 | 32| 48 | 48|

 But, I don't know how this can interact with NUMA load balance and the
 better might be to use conf3.

 I mean conf2 not conf3

 So, it has 4 levels 0/16/32/ for socket 3 and 0 level for socket 0, it
 is unbalanced for different socket.

 That the target because we have decided to pack the small tasks in
 socket 0 when we have parsed the topology at boot.
 We don't have to loop into sched_domain or sched_group anymore to find
 the best LCPU when a small tasks wake up.

 iteration on domain and group is a advantage feature for power efficient
 requirement, not shortage. If some CPU are already idle before forking,
 let another waking CPU check their load/util and then decide which one
 is best CPU can reduce late migrations, that save both the performance
 and power.

 In fact, we have already done this job once at boot and we consider
 that moving small tasks in the buddy CPU is always benefit so we don't
 need to waste time looping sched_domain and sched_group to compute
 current capacity of each LCPU for each wake up of each small tasks. We
 want all small tasks and background activity waking up on the same
 buddy CPU and let the default behavior of the scheduler choosing the
 best CPU for heavy tasks or loaded CPUs.

 IMHO, the design should be very good for your scenario and your machine,
 but when the code move to general scheduler, we do want it can handle
 more general scenarios. like sometime the 'small task' is not as small
 as tasks in cyclictest which even hardly can run longer than migration

Cyclictest is the ultimate small tasks use case which points out all
weaknesses of a scheduler for such kind of tasks.
Music playback is a more realistic one and it also shows improvement

 granularity or one tick, thus we really don't need to consider task
 migration cost. But when the task are not too small, migration is more

For which kind of machine are you stating that hypothesis ?

 heavier than domain/group walking, that is the common sense in
 fork/exec/waking balance.

I would have said the opposite: The current scheduler limits its
computation of statistic during fork/exec/waking compared to a
periodic load balance because it's too heavy. It's even more true for
wake up if wake affine is possible.




 On the contrary, move task walking on each level buddies is not only bad
 on performance but also bad on power. Consider the quite big latency of
 waking a deep idle CPU. we lose too much..

 My result have shown different conclusion.

 That should be due to your tasks are too small to need consider
 migration cost.
 In fact, there is much more chance that the buddy will not be in a
 deep idle as all the small tasks and background activity are already
 waking on this CPU.

 powertop is helpful to tune your system for more idle time. Another
 reason is current kernel just try to spread tasks on more cpu for
 performance consideration. My power scheduling patch should helpful on this.




 And the ground level has just one buddy for 16 LCPUs - 8 cores, that's
 not a good design, consider my previous examples: if there are 4 or 8
 tasks in one socket, you just has 2 choices: spread them into all cores,
 or pack them into one LCPU. Actually, moving them just into 2 or 4 cores
 maybe a better solution. but the design missed this.

 You speak about tasks without any notion of load. This patch only care
 of small tasks and light LCPU load, but it falls back to default
 behavior for other situation. So if there are 4 or 8 small tasks, they
 will migrate to the socket 0 after 1 or up to 3 migration (it depends
 of the conf and the LCPU they come from).

 According to your patch, what your mean 'notion of load' is the
 utilization of cpu, not the load weight of tasks, right?

 Yes but not only. The number of tasks that run simultaneously, is
 another important input


 Yes, I just talked about tasks numbers, but it naturally extends to the
 task utilization on cpu. like 8 tasks with 25% util, that just can full
 fill 2 CPUs. but clearly beyond the capacity of the buddy, so you need
 to wake up another CPU socket while local socket has some LCPU idle...

 8 tasks with a running period of 25ms per 100ms that wake up
 simultaneously should probably run on 8 different LCPU in order to
 race to idle

 nope, it's a rare probability of 8 tasks wakeuping simultaneously. And

Multimedia  is one example of tasks waking up simultaneously

 even so they should run in the same socket for power saving
 consideration(my power scheduling patch can do this), instead of spread
 to all sockets.

Re: [RFC PATCH v2 3/6] sched: pack small tasks

2012-12-18 Thread Alex Shi

On Tue, Dec 18, 2012 at 5:53 PM, Vincent Guittot
vincent.guit...@linaro.org wrote:
 On 17 December 2012 16:24, Alex Shi alex@intel.com wrote:
 The scheme below tries to summaries the idea:

 Socket  | socket 0 | socket 1   | socket 2   | socket 3   |
 LCPU| 0 | 1-15 | 16 | 17-31 | 32 | 33-47 | 48 | 49-63 |
 buddy conf0 | 0 | 0| 1  | 16| 2  | 32| 3  | 48|
 buddy conf1 | 0 | 0| 0  | 16| 16 | 32| 32 | 48|
 buddy conf2 | 0 | 0| 16 | 16| 32 | 32| 48 | 48|

 But, I don't know how this can interact with NUMA load balance and the
 better might be to use conf3.

 I mean conf2 not conf3


 Cyclictest is the ultimate small tasks use case which points out all
 weaknesses of a scheduler for such kind of tasks.
 Music playback is a more realistic one and it also shows improvement

 granularity or one tick, thus we really don't need to consider task
 migration cost. But when the task are not too small, migration is more

 For which kind of machine are you stating that hypothesis ?

Seems the biggest argument between us is you didn't want to admit 'not
too small tasks' exists and that will cause more migrations because
your patch.

 even so they should run in the same socket for power saving
 consideration(my power scheduling patch can do this), instead of spread
 to all sockets.

 This is may be good for your scenario and your machine :-)
 Packing small tasks is the best choice for any scenario and machine.

That's clearly wrong, I had explained many times, your single buddy
CPU is impossible packing all tasks for a  big machine, like for just
16 LCPU, while it suppose do.

Anyway you have right insist your design. and I thought I can not say
more clear about the scalability issue. I won't judge the patch again.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH v2 3/6] sched: pack small tasks

2012-12-17 Thread Alex Shi

>> The scheme below tries to summaries the idea:
>>
>> Socket  | socket 0 | socket 1   | socket 2   | socket 3   |
>> LCPU| 0 | 1-15 | 16 | 17-31 | 32 | 33-47 | 48 | 49-63 |
>> buddy conf0 | 0 | 0| 1  | 16| 2  | 32| 3  | 48|
>> buddy conf1 | 0 | 0| 0  | 16| 16 | 32| 32 | 48|
>> buddy conf2 | 0 | 0| 16 | 16| 32 | 32| 48 | 48|
>>
>> But, I don't know how this can interact with NUMA load balance and the
>> better might be to use conf3.
>
> I mean conf2 not conf3

 So, it has 4 levels 0/16/32/ for socket 3 and 0 level for socket 0, it
 is unbalanced for different socket.
>>>
>>> That the target because we have decided to pack the small tasks in
>>> socket 0 when we have parsed the topology at boot.
>>> We don't have to loop into sched_domain or sched_group anymore to find
>>> the best LCPU when a small tasks wake up.
>>
>> iteration on domain and group is a advantage feature for power efficient
>> requirement, not shortage. If some CPU are already idle before forking,
>> let another waking CPU check their load/util and then decide which one
>> is best CPU can reduce late migrations, that save both the performance
>> and power.
> 
> In fact, we have already done this job once at boot and we consider
> that moving small tasks in the buddy CPU is always benefit so we don't
> need to waste time looping sched_domain and sched_group to compute
> current capacity of each LCPU for each wake up of each small tasks. We
> want all small tasks and background activity waking up on the same
> buddy CPU and let the default behavior of the scheduler choosing the
> best CPU for heavy tasks or loaded CPUs.

IMHO, the design should be very good for your scenario and your machine,
but when the code move to general scheduler, we do want it can handle
more general scenarios. like sometime the 'small task' is not as small
as tasks in cyclictest which even hardly can run longer than migration
granularity or one tick, thus we really don't need to consider task
migration cost. But when the task are not too small, migration is more
heavier than domain/group walking, that is the common sense in
fork/exec/waking balance.

> 
>>
>> On the contrary, move task walking on each level buddies is not only bad
>> on performance but also bad on power. Consider the quite big latency of
>> waking a deep idle CPU. we lose too much..
> 
> My result have shown different conclusion.

That should be due to your tasks are too small to need consider
migration cost.
> In fact, there is much more chance that the buddy will not be in a
> deep idle as all the small tasks and background activity are already
> waking on this CPU.

powertop is helpful to tune your system for more idle time. Another
reason is current kernel just try to spread tasks on more cpu for
performance consideration. My power scheduling patch should helpful on this.
> 
>>
>>>

 And the ground level has just one buddy for 16 LCPUs - 8 cores, that's
 not a good design, consider my previous examples: if there are 4 or 8
 tasks in one socket, you just has 2 choices: spread them into all cores,
 or pack them into one LCPU. Actually, moving them just into 2 or 4 cores
 maybe a better solution. but the design missed this.
>>>
>>> You speak about tasks without any notion of load. This patch only care
>>> of small tasks and light LCPU load, but it falls back to default
>>> behavior for other situation. So if there are 4 or 8 small tasks, they
>>> will migrate to the socket 0 after 1 or up to 3 migration (it depends
>>> of the conf and the LCPU they come from).
>>
>> According to your patch, what your mean 'notion of load' is the
>> utilization of cpu, not the load weight of tasks, right?
> 
> Yes but not only. The number of tasks that run simultaneously, is
> another important input
> 
>>
>> Yes, I just talked about tasks numbers, but it naturally extends to the
>> task utilization on cpu. like 8 tasks with 25% util, that just can full
>> fill 2 CPUs. but clearly beyond the capacity of the buddy, so you need
>> to wake up another CPU socket while local socket has some LCPU idle...
> 
> 8 tasks with a running period of 25ms per 100ms that wake up
> simultaneously should probably run on 8 different LCPU in order to
> race to idle

nope, it's a rare probability of 8 tasks wakeuping simultaneously. And
even so they should run in the same socket for power saving
consideration(my power scheduling patch can do this), instead of spread
to all sockets.
> 
> 
> Regards,
> Vincent
> 
>>>
>>> Then, if too much small tasks wake up simultaneously on the same LCPU,
>>> the default load balance will spread them in the core/cluster/socket
>>>

 Obviously, more and more cores is the trend on any kinds of CPU, the
 buddy system seems hard to catch up this.


>>
>>
>> --
>> Thanks
>> Alex


-- 
Thanks
Alex
--
To unsubscribe from this list: send the

Re: [RFC PATCH v2 3/6] sched: pack small tasks

2012-12-17 Thread Vincent Guittot

On 16 December 2012 08:12, Alex Shi  wrote:
> On 12/14/2012 05:33 PM, Vincent Guittot wrote:
>> On 14 December 2012 02:46, Alex Shi  wrote:
>>> On 12/13/2012 11:48 PM, Vincent Guittot wrote:
 On 13 December 2012 15:53, Vincent Guittot  
 wrote:
> On 13 December 2012 15:25, Alex Shi  wrote:
>> On 12/13/2012 06:11 PM, Vincent Guittot wrote:
>>> On 13 December 2012 03:17, Alex Shi  wrote:
 On 12/12/2012 09:31 PM, Vincent Guittot wrote:
> During the creation of sched_domain, we define a pack buddy CPU for 
> each CPU
> when one is available. We want to pack at all levels where a group of 
> CPU can
> be power gated independently from others.
> On a system that can't power gate a group of CPUs independently, the 
> flag is
> set at all sched_domain level and the buddy is set to -1. This is the 
> default
> behavior.
> On a dual clusters / dual cores system which can power gate each core 
> and
> cluster independently, the buddy configuration will be :
>
>   | Cluster 0   | Cluster 1   |
>   | CPU0 | CPU1 | CPU2 | CPU3 |
> ---
> buddy | CPU0 | CPU0 | CPU0 | CPU2 |
>
> Small tasks tend to slip out of the periodic load balance so the best 
> place
> to choose to migrate them is during their wake up. The decision is in 
> O(1) as
> we only check again one buddy CPU

 Just have a little worry about the scalability on a big machine, like 
 on
 a 4 sockets NUMA machine * 8 cores * HT machine, the buddy cpu in whole
 system need care 64 LCPUs. and in your case cpu0 just care 4 LCPU. That
 is different on task distribution decision.
>>>
>>> The buddy CPU should probably not be the same for all 64 LCPU it
>>> depends on where it's worth packing small tasks
>>
>> Do you have further ideas for buddy cpu on such example?
>
> yes, I have several ideas which were not really relevant for small
> system but could be interesting for larger system
>
> We keep the same algorithm in a socket but we could either use another
> LCPU in the targeted socket (conf0) or chain the socket (conf1)
> instead of packing directly in one LCPU
>
> The scheme below tries to summaries the idea:
>
> Socket  | socket 0 | socket 1   | socket 2   | socket 3   |
> LCPU| 0 | 1-15 | 16 | 17-31 | 32 | 33-47 | 48 | 49-63 |
> buddy conf0 | 0 | 0| 1  | 16| 2  | 32| 3  | 48|
> buddy conf1 | 0 | 0| 0  | 16| 16 | 32| 32 | 48|
> buddy conf2 | 0 | 0| 16 | 16| 32 | 32| 48 | 48|
>
> But, I don't know how this can interact with NUMA load balance and the
> better might be to use conf3.

 I mean conf2 not conf3
>>>
>>> So, it has 4 levels 0/16/32/ for socket 3 and 0 level for socket 0, it
>>> is unbalanced for different socket.
>>
>> That the target because we have decided to pack the small tasks in
>> socket 0 when we have parsed the topology at boot.
>> We don't have to loop into sched_domain or sched_group anymore to find
>> the best LCPU when a small tasks wake up.
>
> iteration on domain and group is a advantage feature for power efficient
> requirement, not shortage. If some CPU are already idle before forking,
> let another waking CPU check their load/util and then decide which one
> is best CPU can reduce late migrations, that save both the performance
> and power.

In fact, we have already done this job once at boot and we consider
that moving small tasks in the buddy CPU is always benefit so we don't
need to waste time looping sched_domain and sched_group to compute
current capacity of each LCPU for each wake up of each small tasks. We
want all small tasks and background activity waking up on the same
buddy CPU and let the default behavior of the scheduler choosing the
best CPU for heavy tasks or loaded CPUs.

>
> On the contrary, move task walking on each level buddies is not only bad
> on performance but also bad on power. Consider the quite big latency of
> waking a deep idle CPU. we lose too much..

My result have shown different conclusion.
In fact, there is much more chance that the buddy will not be in a
deep idle as all the small tasks and background activity are already
waking on this CPU.

>
>>
>>>
>>> And the ground level has just one buddy for 16 LCPUs - 8 cores, that's
>>> not a good design, consider my previous examples: if there are 4 or 8
>>> tasks in one socket, you just has 2 choices: spread them into all cores,
>>> or pack them into one LCPU. Actually, moving them just into 2 or 4 cores
>>> maybe a better solution. but the design missed this.
>>
>> You speak about tasks without any notion of load. This patch only care
>> of small tasks and light LCPU load, but it falls back to

Re: [RFC PATCH v2 3/6] sched: pack small tasks

2012-12-17 Thread Vincent Guittot

On 16 December 2012 08:12, Alex Shi alex@intel.com wrote:
 On 12/14/2012 05:33 PM, Vincent Guittot wrote:
 On 14 December 2012 02:46, Alex Shi alex@intel.com wrote:
 On 12/13/2012 11:48 PM, Vincent Guittot wrote:
 On 13 December 2012 15:53, Vincent Guittot vincent.guit...@linaro.org 
 wrote:
 On 13 December 2012 15:25, Alex Shi alex@intel.com wrote:
 On 12/13/2012 06:11 PM, Vincent Guittot wrote:
 On 13 December 2012 03:17, Alex Shi alex@intel.com wrote:
 On 12/12/2012 09:31 PM, Vincent Guittot wrote:
 During the creation of sched_domain, we define a pack buddy CPU for 
 each CPU
 when one is available. We want to pack at all levels where a group of 
 CPU can
 be power gated independently from others.
 On a system that can't power gate a group of CPUs independently, the 
 flag is
 set at all sched_domain level and the buddy is set to -1. This is the 
 default
 behavior.
 On a dual clusters / dual cores system which can power gate each core 
 and
 cluster independently, the buddy configuration will be :

   | Cluster 0   | Cluster 1   |
   | CPU0 | CPU1 | CPU2 | CPU3 |
 ---
 buddy | CPU0 | CPU0 | CPU0 | CPU2 |

 Small tasks tend to slip out of the periodic load balance so the best 
 place
 to choose to migrate them is during their wake up. The decision is in 
 O(1) as
 we only check again one buddy CPU

 Just have a little worry about the scalability on a big machine, like 
 on
 a 4 sockets NUMA machine * 8 cores * HT machine, the buddy cpu in whole
 system need care 64 LCPUs. and in your case cpu0 just care 4 LCPU. That
 is different on task distribution decision.

 The buddy CPU should probably not be the same for all 64 LCPU it
 depends on where it's worth packing small tasks

 Do you have further ideas for buddy cpu on such example?

 yes, I have several ideas which were not really relevant for small
 system but could be interesting for larger system

 We keep the same algorithm in a socket but we could either use another
 LCPU in the targeted socket (conf0) or chain the socket (conf1)
 instead of packing directly in one LCPU

 The scheme below tries to summaries the idea:

 Socket  | socket 0 | socket 1   | socket 2   | socket 3   |
 LCPU| 0 | 1-15 | 16 | 17-31 | 32 | 33-47 | 48 | 49-63 |
 buddy conf0 | 0 | 0| 1  | 16| 2  | 32| 3  | 48|
 buddy conf1 | 0 | 0| 0  | 16| 16 | 32| 32 | 48|
 buddy conf2 | 0 | 0| 16 | 16| 32 | 32| 48 | 48|

 But, I don't know how this can interact with NUMA load balance and the
 better might be to use conf3.

 I mean conf2 not conf3

 So, it has 4 levels 0/16/32/ for socket 3 and 0 level for socket 0, it
 is unbalanced for different socket.

 That the target because we have decided to pack the small tasks in
 socket 0 when we have parsed the topology at boot.
 We don't have to loop into sched_domain or sched_group anymore to find
 the best LCPU when a small tasks wake up.

 iteration on domain and group is a advantage feature for power efficient
 requirement, not shortage. If some CPU are already idle before forking,
 let another waking CPU check their load/util and then decide which one
 is best CPU can reduce late migrations, that save both the performance
 and power.

In fact, we have already done this job once at boot and we consider
that moving small tasks in the buddy CPU is always benefit so we don't
need to waste time looping sched_domain and sched_group to compute
current capacity of each LCPU for each wake up of each small tasks. We
want all small tasks and background activity waking up on the same
buddy CPU and let the default behavior of the scheduler choosing the
best CPU for heavy tasks or loaded CPUs.


 On the contrary, move task walking on each level buddies is not only bad
 on performance but also bad on power. Consider the quite big latency of
 waking a deep idle CPU. we lose too much..

My result have shown different conclusion.
In fact, there is much more chance that the buddy will not be in a
deep idle as all the small tasks and background activity are already
waking on this CPU.




 And the ground level has just one buddy for 16 LCPUs - 8 cores, that's
 not a good design, consider my previous examples: if there are 4 or 8
 tasks in one socket, you just has 2 choices: spread them into all cores,
 or pack them into one LCPU. Actually, moving them just into 2 or 4 cores
 maybe a better solution. but the design missed this.

 You speak about tasks without any notion of load. This patch only care
 of small tasks and light LCPU load, but it falls back to default
 behavior for other situation. So if there are 4 or 8 small tasks, they
 will migrate to the socket 0 after 1 or up to 3 migration (it depends
 of the conf and the LCPU they come from).

 According to your patch, what your mean 'notion of load' is the
 utilization of cpu, not the load weight of tasks, right?

Yes but not only. The number of tasks that run simultaneously, is
another

Re: [RFC PATCH v2 3/6] sched: pack small tasks

2012-12-17 Thread Alex Shi

 The scheme below tries to summaries the idea:

 Socket  | socket 0 | socket 1   | socket 2   | socket 3   |
 LCPU| 0 | 1-15 | 16 | 17-31 | 32 | 33-47 | 48 | 49-63 |
 buddy conf0 | 0 | 0| 1  | 16| 2  | 32| 3  | 48|
 buddy conf1 | 0 | 0| 0  | 16| 16 | 32| 32 | 48|
 buddy conf2 | 0 | 0| 16 | 16| 32 | 32| 48 | 48|

 But, I don't know how this can interact with NUMA load balance and the
 better might be to use conf3.

 I mean conf2 not conf3

 So, it has 4 levels 0/16/32/ for socket 3 and 0 level for socket 0, it
 is unbalanced for different socket.

 That the target because we have decided to pack the small tasks in
 socket 0 when we have parsed the topology at boot.
 We don't have to loop into sched_domain or sched_group anymore to find
 the best LCPU when a small tasks wake up.

 iteration on domain and group is a advantage feature for power efficient
 requirement, not shortage. If some CPU are already idle before forking,
 let another waking CPU check their load/util and then decide which one
 is best CPU can reduce late migrations, that save both the performance
 and power.
 
 In fact, we have already done this job once at boot and we consider
 that moving small tasks in the buddy CPU is always benefit so we don't
 need to waste time looping sched_domain and sched_group to compute
 current capacity of each LCPU for each wake up of each small tasks. We
 want all small tasks and background activity waking up on the same
 buddy CPU and let the default behavior of the scheduler choosing the
 best CPU for heavy tasks or loaded CPUs.

IMHO, the design should be very good for your scenario and your machine,
but when the code move to general scheduler, we do want it can handle
more general scenarios. like sometime the 'small task' is not as small
as tasks in cyclictest which even hardly can run longer than migration
granularity or one tick, thus we really don't need to consider task
migration cost. But when the task are not too small, migration is more
heavier than domain/group walking, that is the common sense in
fork/exec/waking balance.

 

 On the contrary, move task walking on each level buddies is not only bad
 on performance but also bad on power. Consider the quite big latency of
 waking a deep idle CPU. we lose too much..
 
 My result have shown different conclusion.

That should be due to your tasks are too small to need consider
migration cost.
 In fact, there is much more chance that the buddy will not be in a
 deep idle as all the small tasks and background activity are already
 waking on this CPU.

powertop is helpful to tune your system for more idle time. Another
reason is current kernel just try to spread tasks on more cpu for
performance consideration. My power scheduling patch should helpful on this.
 



 And the ground level has just one buddy for 16 LCPUs - 8 cores, that's
 not a good design, consider my previous examples: if there are 4 or 8
 tasks in one socket, you just has 2 choices: spread them into all cores,
 or pack them into one LCPU. Actually, moving them just into 2 or 4 cores
 maybe a better solution. but the design missed this.

 You speak about tasks without any notion of load. This patch only care
 of small tasks and light LCPU load, but it falls back to default
 behavior for other situation. So if there are 4 or 8 small tasks, they
 will migrate to the socket 0 after 1 or up to 3 migration (it depends
 of the conf and the LCPU they come from).

 According to your patch, what your mean 'notion of load' is the
 utilization of cpu, not the load weight of tasks, right?
 
 Yes but not only. The number of tasks that run simultaneously, is
 another important input
 

 Yes, I just talked about tasks numbers, but it naturally extends to the
 task utilization on cpu. like 8 tasks with 25% util, that just can full
 fill 2 CPUs. but clearly beyond the capacity of the buddy, so you need
 to wake up another CPU socket while local socket has some LCPU idle...
 
 8 tasks with a running period of 25ms per 100ms that wake up
 simultaneously should probably run on 8 different LCPU in order to
 race to idle

nope, it's a rare probability of 8 tasks wakeuping simultaneously. And
even so they should run in the same socket for power saving
consideration(my power scheduling patch can do this), instead of spread
to all sockets.
 
 
 Regards,
 Vincent
 

 Then, if too much small tasks wake up simultaneously on the same LCPU,
 the default load balance will spread them in the core/cluster/socket


 Obviously, more and more cores is the trend on any kinds of CPU, the
 buddy system seems hard to catch up this.




 --
 Thanks
 Alex


-- 
Thanks
Alex
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH v2 3/6] sched: pack small tasks

2012-12-16 Thread Alex Shi


> 
> CPU is a bug that slipped into domain degeneration.  You should have
> SIBLING/MC/NUMA (chasing that down is on todo).

Uh, the SD_PREFER_SIBLING on cpu domain is recovered by myself for a share
 memory benchmark regression. But consider all the situations, I think the
 flag is better to be removed.


>From 96bee9a03b2048f2686fbd7de0e2aee458dbd917 Mon Sep 17 00:00:00 2001
From: Alex Shi 
Date: Mon, 17 Dec 2012 09:42:57 +0800
Subject: [PATCH 01/18] sched: remove SD_PERFER_SIBLING flag

The flag was introduced in commit b5d978e0c7e79a. Its purpose seems
trying to fullfill one node first in NUMA machine via pulling tasks
from other nodes when the node has capacity.

Its advantage is when few tasks share memories among them, pulling
together is helpful on locality, so has performance gain. The shortage
is it will keep unnecessary task migrations thrashing among different
nodes, that reduces the performance gain, and just hurt performance if
tasks has no memory cross.

Thinking about the sched numa balancing patch is coming. The small
advantage are meaningless to us, So better to remove this flag.

Reported-by: Mike Galbraith 
Signed-off-by: Alex Shi 
---
 include/linux/sched.h|  1 -
 include/linux/topology.h |  2 --
 kernel/sched/core.c  |  1 -
 kernel/sched/fair.c  | 19 +--
 4 files changed, 1 insertion(+), 22 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5dafac3..6dca96c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -836,7 +836,6 @@ enum cpu_idle_type {
 #define SD_SHARE_PKG_RESOURCES 0x0200  /* Domain members share cpu pkg 
resources */
 #define SD_SERIALIZE   0x0400  /* Only a single load balancing 
instance */
 #define SD_ASYM_PACKING0x0800  /* Place busy groups earlier in 
the domain */
-#define SD_PREFER_SIBLING  0x1000  /* Prefer to place tasks in a sibling 
domain */
 #define SD_OVERLAP 0x2000  /* sched_domains of this level overlap 
*/
 
 extern int __weak arch_sd_sibiling_asym_packing(void);
diff --git a/include/linux/topology.h b/include/linux/topology.h
index d3cf0d6..15864d1 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -100,7 +100,6 @@ int arch_update_cpu_topology(void);
| 1*SD_SHARE_CPUPOWER   \
| 1*SD_SHARE_PKG_RESOURCES  \
| 0*SD_SERIALIZE\
-   | 0*SD_PREFER_SIBLING   \
| arch_sd_sibling_asym_packing()\
,   \
.last_balance   = jiffies,  \
@@ -162,7 +161,6 @@ int arch_update_cpu_topology(void);
| 0*SD_SHARE_CPUPOWER   \
| 0*SD_SHARE_PKG_RESOURCES  \
| 0*SD_SERIALIZE\
-   | 1*SD_PREFER_SIBLING   \
,   \
.last_balance   = jiffies,  \
.balance_interval   = 1,\
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5dae0d2..8ed2784 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6014,7 +6014,6 @@ sd_numa_init(struct sched_domain_topology_level *tl, int 
cpu)
| 0*SD_SHARE_CPUPOWER
| 0*SD_SHARE_PKG_RESOURCES
| 1*SD_SERIALIZE
-   | 0*SD_PREFER_SIBLING
| sd_local_flags(level)
,
.last_balance   = jiffies,
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 59e072b..5d175f2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4339,13 +4339,9 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 static inline void update_sd_lb_stats(struct lb_env *env,
int *balance, struct sd_lb_stats *sds)
 {
-   struct sched_domain *child = env->sd->child;
struct sched_group *sg = env->sd->groups;
struct sg_lb_stats sgs;
-   int load_idx, prefer_sibling = 0;
-
-   if (child && child->flags & SD_PREFER_SIBLING)
-   prefer_sibling = 1;
+   int load_idx;
 
load_idx = get_sd_load_idx(env->sd, env->idle);
 
@@ -4362,19 +4358,6 @@ static inline void update_sd_lb_stats(struct lb_env *env,
sds->total_load += sgs.group_load;
sds->total_pwr += sg->sgp->power;
 
-   /*
-* In case the child domain prefers tasks go to siblings
-

Re: [RFC PATCH v2 3/6] sched: pack small tasks

2012-12-16 Thread Alex Shi


 
 CPU is a bug that slipped into domain degeneration.  You should have
 SIBLING/MC/NUMA (chasing that down is on todo).

Uh, the SD_PREFER_SIBLING on cpu domain is recovered by myself for a share
 memory benchmark regression. But consider all the situations, I think the
 flag is better to be removed.


From 96bee9a03b2048f2686fbd7de0e2aee458dbd917 Mon Sep 17 00:00:00 2001
From: Alex Shi alex@intel.com
Date: Mon, 17 Dec 2012 09:42:57 +0800
Subject: [PATCH 01/18] sched: remove SD_PERFER_SIBLING flag

The flag was introduced in commit b5d978e0c7e79a. Its purpose seems
trying to fullfill one node first in NUMA machine via pulling tasks
from other nodes when the node has capacity.

Its advantage is when few tasks share memories among them, pulling
together is helpful on locality, so has performance gain. The shortage
is it will keep unnecessary task migrations thrashing among different
nodes, that reduces the performance gain, and just hurt performance if
tasks has no memory cross.

Thinking about the sched numa balancing patch is coming. The small
advantage are meaningless to us, So better to remove this flag.

Reported-by: Mike Galbraith efa...@gmx.de
Signed-off-by: Alex Shi alex@intel.com
---
 include/linux/sched.h|  1 -
 include/linux/topology.h |  2 --
 kernel/sched/core.c  |  1 -
 kernel/sched/fair.c  | 19 +--
 4 files changed, 1 insertion(+), 22 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5dafac3..6dca96c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -836,7 +836,6 @@ enum cpu_idle_type {
 #define SD_SHARE_PKG_RESOURCES 0x0200  /* Domain members share cpu pkg 
resources */
 #define SD_SERIALIZE   0x0400  /* Only a single load balancing 
instance */
 #define SD_ASYM_PACKING0x0800  /* Place busy groups earlier in 
the domain */
-#define SD_PREFER_SIBLING  0x1000  /* Prefer to place tasks in a sibling 
domain */
 #define SD_OVERLAP 0x2000  /* sched_domains of this level overlap 
*/
 
 extern int __weak arch_sd_sibiling_asym_packing(void);
diff --git a/include/linux/topology.h b/include/linux/topology.h
index d3cf0d6..15864d1 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -100,7 +100,6 @@ int arch_update_cpu_topology(void);
| 1*SD_SHARE_CPUPOWER   \
| 1*SD_SHARE_PKG_RESOURCES  \
| 0*SD_SERIALIZE\
-   | 0*SD_PREFER_SIBLING   \
| arch_sd_sibling_asym_packing()\
,   \
.last_balance   = jiffies,  \
@@ -162,7 +161,6 @@ int arch_update_cpu_topology(void);
| 0*SD_SHARE_CPUPOWER   \
| 0*SD_SHARE_PKG_RESOURCES  \
| 0*SD_SERIALIZE\
-   | 1*SD_PREFER_SIBLING   \
,   \
.last_balance   = jiffies,  \
.balance_interval   = 1,\
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5dae0d2..8ed2784 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6014,7 +6014,6 @@ sd_numa_init(struct sched_domain_topology_level *tl, int 
cpu)
| 0*SD_SHARE_CPUPOWER
| 0*SD_SHARE_PKG_RESOURCES
| 1*SD_SERIALIZE
-   | 0*SD_PREFER_SIBLING
| sd_local_flags(level)
,
.last_balance   = jiffies,
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 59e072b..5d175f2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4339,13 +4339,9 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 static inline void update_sd_lb_stats(struct lb_env *env,
int *balance, struct sd_lb_stats *sds)
 {
-   struct sched_domain *child = env-sd-child;
struct sched_group *sg = env-sd-groups;
struct sg_lb_stats sgs;
-   int load_idx, prefer_sibling = 0;
-
-   if (child  child-flags  SD_PREFER_SIBLING)
-   prefer_sibling = 1;
+   int load_idx;
 
load_idx = get_sd_load_idx(env-sd, env-idle);
 
@@ -4362,19 +4358,6 @@ static inline void update_sd_lb_stats(struct lb_env *env,
sds-total_load += sgs.group_load;
sds-total_pwr += sg-sgp-power;
 
-   /*
-* In case the child domain prefers

Re: [RFC PATCH v2 3/6] sched: pack small tasks

2012-12-15 Thread Alex Shi

On 12/14/2012 05:33 PM, Vincent Guittot wrote:
> On 14 December 2012 02:46, Alex Shi  wrote:
>> On 12/13/2012 11:48 PM, Vincent Guittot wrote:
>>> On 13 December 2012 15:53, Vincent Guittot  
>>> wrote:
 On 13 December 2012 15:25, Alex Shi  wrote:
> On 12/13/2012 06:11 PM, Vincent Guittot wrote:
>> On 13 December 2012 03:17, Alex Shi  wrote:
>>> On 12/12/2012 09:31 PM, Vincent Guittot wrote:
 During the creation of sched_domain, we define a pack buddy CPU for 
 each CPU
 when one is available. We want to pack at all levels where a group of 
 CPU can
 be power gated independently from others.
 On a system that can't power gate a group of CPUs independently, the 
 flag is
 set at all sched_domain level and the buddy is set to -1. This is the 
 default
 behavior.
 On a dual clusters / dual cores system which can power gate each core 
 and
 cluster independently, the buddy configuration will be :

   | Cluster 0   | Cluster 1   |
   | CPU0 | CPU1 | CPU2 | CPU3 |
 ---
 buddy | CPU0 | CPU0 | CPU0 | CPU2 |

 Small tasks tend to slip out of the periodic load balance so the best 
 place
 to choose to migrate them is during their wake up. The decision is in 
 O(1) as
 we only check again one buddy CPU
>>>
>>> Just have a little worry about the scalability on a big machine, like on
>>> a 4 sockets NUMA machine * 8 cores * HT machine, the buddy cpu in whole
>>> system need care 64 LCPUs. and in your case cpu0 just care 4 LCPU. That
>>> is different on task distribution decision.
>>
>> The buddy CPU should probably not be the same for all 64 LCPU it
>> depends on where it's worth packing small tasks
>
> Do you have further ideas for buddy cpu on such example?

 yes, I have several ideas which were not really relevant for small
 system but could be interesting for larger system

 We keep the same algorithm in a socket but we could either use another
 LCPU in the targeted socket (conf0) or chain the socket (conf1)
 instead of packing directly in one LCPU

 The scheme below tries to summaries the idea:

 Socket  | socket 0 | socket 1   | socket 2   | socket 3   |
 LCPU| 0 | 1-15 | 16 | 17-31 | 32 | 33-47 | 48 | 49-63 |
 buddy conf0 | 0 | 0| 1  | 16| 2  | 32| 3  | 48|
 buddy conf1 | 0 | 0| 0  | 16| 16 | 32| 32 | 48|
 buddy conf2 | 0 | 0| 16 | 16| 32 | 32| 48 | 48|

 But, I don't know how this can interact with NUMA load balance and the
 better might be to use conf3.
>>>
>>> I mean conf2 not conf3
>>
>> So, it has 4 levels 0/16/32/ for socket 3 and 0 level for socket 0, it
>> is unbalanced for different socket.
> 
> That the target because we have decided to pack the small tasks in
> socket 0 when we have parsed the topology at boot.
> We don't have to loop into sched_domain or sched_group anymore to find
> the best LCPU when a small tasks wake up.

iteration on domain and group is a advantage feature for power efficient
requirement, not shortage. If some CPU are already idle before forking,
let another waking CPU check their load/util and then decide which one
is best CPU can reduce late migrations, that save both the performance
and power.

On the contrary, move task walking on each level buddies is not only bad
on performance but also bad on power. Consider the quite big latency of
waking a deep idle CPU. we lose too much..

> 
>>
>> And the ground level has just one buddy for 16 LCPUs - 8 cores, that's
>> not a good design, consider my previous examples: if there are 4 or 8
>> tasks in one socket, you just has 2 choices: spread them into all cores,
>> or pack them into one LCPU. Actually, moving them just into 2 or 4 cores
>> maybe a better solution. but the design missed this.
> 
> You speak about tasks without any notion of load. This patch only care
> of small tasks and light LCPU load, but it falls back to default
> behavior for other situation. So if there are 4 or 8 small tasks, they
> will migrate to the socket 0 after 1 or up to 3 migration (it depends
> of the conf and the LCPU they come from).

According to your patch, what your mean 'notion of load' is the
utilization of cpu, not the load weight of tasks, right?

Yes, I just talked about tasks numbers, but it naturally extends to the
task utilization on cpu. like 8 tasks with 25% util, that just can full
fill 2 CPUs. but clearly beyond the capacity of the buddy, so you need
to wake up another CPU socket while local socket has some LCPU idle...
> 
> Then, if too much small tasks wake up simultaneously on the same LCPU,
> the default load balance will spread them in the core/cluster/socket
> 
>>
>> Obviously, more and more cores is the trend on any

Re: [RFC PATCH v2 3/6] sched: pack small tasks

2012-12-15 Thread Alex Shi

On 12/14/2012 05:33 PM, Vincent Guittot wrote:
 On 14 December 2012 02:46, Alex Shi alex@intel.com wrote:
 On 12/13/2012 11:48 PM, Vincent Guittot wrote:
 On 13 December 2012 15:53, Vincent Guittot vincent.guit...@linaro.org 
 wrote:
 On 13 December 2012 15:25, Alex Shi alex@intel.com wrote:
 On 12/13/2012 06:11 PM, Vincent Guittot wrote:
 On 13 December 2012 03:17, Alex Shi alex@intel.com wrote:
 On 12/12/2012 09:31 PM, Vincent Guittot wrote:
 During the creation of sched_domain, we define a pack buddy CPU for 
 each CPU
 when one is available. We want to pack at all levels where a group of 
 CPU can
 be power gated independently from others.
 On a system that can't power gate a group of CPUs independently, the 
 flag is
 set at all sched_domain level and the buddy is set to -1. This is the 
 default
 behavior.
 On a dual clusters / dual cores system which can power gate each core 
 and
 cluster independently, the buddy configuration will be :

   | Cluster 0   | Cluster 1   |
   | CPU0 | CPU1 | CPU2 | CPU3 |
 ---
 buddy | CPU0 | CPU0 | CPU0 | CPU2 |

 Small tasks tend to slip out of the periodic load balance so the best 
 place
 to choose to migrate them is during their wake up. The decision is in 
 O(1) as
 we only check again one buddy CPU

 Just have a little worry about the scalability on a big machine, like on
 a 4 sockets NUMA machine * 8 cores * HT machine, the buddy cpu in whole
 system need care 64 LCPUs. and in your case cpu0 just care 4 LCPU. That
 is different on task distribution decision.

 The buddy CPU should probably not be the same for all 64 LCPU it
 depends on where it's worth packing small tasks

 Do you have further ideas for buddy cpu on such example?

 yes, I have several ideas which were not really relevant for small
 system but could be interesting for larger system

 We keep the same algorithm in a socket but we could either use another
 LCPU in the targeted socket (conf0) or chain the socket (conf1)
 instead of packing directly in one LCPU

 The scheme below tries to summaries the idea:

 Socket  | socket 0 | socket 1   | socket 2   | socket 3   |
 LCPU| 0 | 1-15 | 16 | 17-31 | 32 | 33-47 | 48 | 49-63 |
 buddy conf0 | 0 | 0| 1  | 16| 2  | 32| 3  | 48|
 buddy conf1 | 0 | 0| 0  | 16| 16 | 32| 32 | 48|
 buddy conf2 | 0 | 0| 16 | 16| 32 | 32| 48 | 48|

 But, I don't know how this can interact with NUMA load balance and the
 better might be to use conf3.

 I mean conf2 not conf3

 So, it has 4 levels 0/16/32/ for socket 3 and 0 level for socket 0, it
 is unbalanced for different socket.
 
 That the target because we have decided to pack the small tasks in
 socket 0 when we have parsed the topology at boot.
 We don't have to loop into sched_domain or sched_group anymore to find
 the best LCPU when a small tasks wake up.

iteration on domain and group is a advantage feature for power efficient
requirement, not shortage. If some CPU are already idle before forking,
let another waking CPU check their load/util and then decide which one
is best CPU can reduce late migrations, that save both the performance
and power.

On the contrary, move task walking on each level buddies is not only bad
on performance but also bad on power. Consider the quite big latency of
waking a deep idle CPU. we lose too much..

 

 And the ground level has just one buddy for 16 LCPUs - 8 cores, that's
 not a good design, consider my previous examples: if there are 4 or 8
 tasks in one socket, you just has 2 choices: spread them into all cores,
 or pack them into one LCPU. Actually, moving them just into 2 or 4 cores
 maybe a better solution. but the design missed this.
 
 You speak about tasks without any notion of load. This patch only care
 of small tasks and light LCPU load, but it falls back to default
 behavior for other situation. So if there are 4 or 8 small tasks, they
 will migrate to the socket 0 after 1 or up to 3 migration (it depends
 of the conf and the LCPU they come from).

According to your patch, what your mean 'notion of load' is the
utilization of cpu, not the load weight of tasks, right?

Yes, I just talked about tasks numbers, but it naturally extends to the
task utilization on cpu. like 8 tasks with 25% util, that just can full
fill 2 CPUs. but clearly beyond the capacity of the buddy, so you need
to wake up another CPU socket while local socket has some LCPU idle...
 
 Then, if too much small tasks wake up simultaneously on the same LCPU,
 the default load balance will spread them in the core/cluster/socket
 

 Obviously, more and more cores is the trend on any kinds of CPU, the
 buddy system seems hard to catch up this.




-- 
Thanks
Alex
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at

Re: [RFC PATCH v2 3/6] sched: pack small tasks

2012-12-14 Thread Mike Galbraith

On Fri, 2012-12-14 at 11:43 +0100, Vincent Guittot wrote: 
> On 14 December 2012 08:45, Mike Galbraith  wrote:
> > On Fri, 2012-12-14 at 14:36 +0800, Alex Shi wrote:
> >> On 12/14/2012 12:45 PM, Mike Galbraith wrote:
> >> >> > Do you have further ideas for buddy cpu on such example?
> >> >>> > >
> >> >>> > > Which kind of sched_domain configuration have you for such system ?
> >> >>> > > and how many sched_domain level have you ?
> >> >> >
> >> >> > it is general X86 domain configuration. with 4 levels,
> >> >> > sibling/core/cpu/numa.
> >> > CPU is a bug that slipped into domain degeneration.  You should have
> >> > SIBLING/MC/NUMA (chasing that down is on todo).
> >>
> >> Maybe.
> >> the CPU/NUMA is different on domain flags, CPU has SD_PREFER_SIBLING.
> >
> > What I noticed during (an unrelated) bisection on a 40 core box was
> > domains going from so..
> >
> > 3.4.0-bisect (virgin)
> > [5.056214] CPU0 attaching sched-domain:
> > [5.065009]  domain 0: span 0,32 level SIBLING
> > [5.075011]   groups: 0 (cpu_power = 589) 32 (cpu_power = 589)
> > [5.088381]   domain 1: span 
> > 0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76 level MC
> > [5.107669]groups: 0,32 (cpu_power = 1178)  4,36 (cpu_power = 1178)  
> > 8,40 (cpu_power = 1178) 12,44 (cpu_power = 1178)
> >  16,48 (cpu_power = 1177) 20,52 (cpu_power = 1178) 
> > 24,56 (cpu_power = 1177) 28,60 (cpu_power = 1177)
> >  64,72 (cpu_power = 1176) 68,76 (cpu_power = 1176)
> > [5.162115]domain 2: span 0-79 level NODE
> > [5.171927] groups: 
> > 0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76 (cpu_power = 11773)
> >
> > 1,5,9,13,17,21,25,29,33,37,41,45,49,53,57,61,65,69,73,77 (cpu_power = 11772)
> >
> > 2,6,10,14,18,22,26,30,34,38,42,46,50,54,58,62,66,70,74,78 (cpu_power = 
> > 11773)
> >
> > 3,7,11,15,19,23,27,31,35,39,43,47,51,55,59,63,67,71,75,79 (cpu_power = 
> > 11770)
> >
> > ..to so, which looks a little bent.  CPU and MC have identical spans, so
> > CPU should have gone away, as it used to do.
> >
> > 3.6.0-bisect (virgin)
> > [3.978338] CPU0 attaching sched-domain:
> > [3.987125]  domain 0: span 0,32 level SIBLING
> > [3.997125]   groups: 0 (cpu_power = 588) 32 (cpu_power = 589)
> > [4.010477]   domain 1: span 
> > 0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76 level MC
> > [4.029748]groups: 0,32 (cpu_power = 1177)  4,36 (cpu_power = 1177)  
> > 8,40 (cpu_power = 1178) 12,44 (cpu_power = 1178)
> >  16,48 (cpu_power = 1178) 20,52 (cpu_power = 1178) 
> > 24,56 (cpu_power = 1178) 28,60 (cpu_power = 1178)
> >  64,72 (cpu_power = 1178) 68,76 (cpu_power = 1177)
> > [4.084143]domain 2: span 
> > 0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76 level CPU
> > [4.103796] groups: 
> > 0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76 (cpu_power = 11777)
> > [4.124373] domain 3: span 0-79 level NUMA
> > [4.134369]  groups: 
> > 0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76 (cpu_power = 11777)
> > 
> > 1,5,9,13,17,21,25,29,33,37,41,45,49,53,57,61,65,69,73,77 (cpu_power = 11778)
> > 
> > 2,6,10,14,18,22,26,30,34,38,42,46,50,54,58,62,66,70,74 ,78 (cpu_power = 
> > 11778)
> > 
> > 3,7,11,15,19,23,27,31,35,39,43,47,51,55,59,63,67,71,75,79 (cpu_power = 
> > 11780)
> >
> 
> Thanks. that's an interesting example of a numa topology
> 
> For your sched_domain difference,
> On 3.4, SD_PREFER_SIBLING was set for both MC and CPU level thanks to
> sd_balance_for_mc_power and  sd_balance_for_package_power
> On 3.6, SD_PREFER_SIBLING is only set for CPU level and this flag
> difference with MC level prevents the destruction of CPU sched_domain
> during the degeneration
> 
> We may need to set SD_PREFER_SIBLING for MC level

Ah, that explains oddity. (todo--).

Hm, seems changing flags should trigger a rebuild. (todo++,drat).

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH v2 3/6] sched: pack small tasks

2012-12-14 Thread Vincent Guittot

On 14 December 2012 08:45, Mike Galbraith  wrote:
> On Fri, 2012-12-14 at 14:36 +0800, Alex Shi wrote:
>> On 12/14/2012 12:45 PM, Mike Galbraith wrote:
>> >> > Do you have further ideas for buddy cpu on such example?
>> >>> > >
>> >>> > > Which kind of sched_domain configuration have you for such system ?
>> >>> > > and how many sched_domain level have you ?
>> >> >
>> >> > it is general X86 domain configuration. with 4 levels,
>> >> > sibling/core/cpu/numa.
>> > CPU is a bug that slipped into domain degeneration.  You should have
>> > SIBLING/MC/NUMA (chasing that down is on todo).
>>
>> Maybe.
>> the CPU/NUMA is different on domain flags, CPU has SD_PREFER_SIBLING.
>
> What I noticed during (an unrelated) bisection on a 40 core box was
> domains going from so..
>
> 3.4.0-bisect (virgin)
> [5.056214] CPU0 attaching sched-domain:
> [5.065009]  domain 0: span 0,32 level SIBLING
> [5.075011]   groups: 0 (cpu_power = 589) 32 (cpu_power = 589)
> [5.088381]   domain 1: span 
> 0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76 level MC
> [5.107669]groups: 0,32 (cpu_power = 1178)  4,36 (cpu_power = 1178)  
> 8,40 (cpu_power = 1178) 12,44 (cpu_power = 1178)
>  16,48 (cpu_power = 1177) 20,52 (cpu_power = 1178) 
> 24,56 (cpu_power = 1177) 28,60 (cpu_power = 1177)
>  64,72 (cpu_power = 1176) 68,76 (cpu_power = 1176)
> [5.162115]domain 2: span 0-79 level NODE
> [5.171927] groups: 
> 0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76 (cpu_power = 11773)
>
> 1,5,9,13,17,21,25,29,33,37,41,45,49,53,57,61,65,69,73,77 (cpu_power = 11772)
>
> 2,6,10,14,18,22,26,30,34,38,42,46,50,54,58,62,66,70,74,78 (cpu_power = 11773)
>
> 3,7,11,15,19,23,27,31,35,39,43,47,51,55,59,63,67,71,75,79 (cpu_power = 11770)
>
> ..to so, which looks a little bent.  CPU and MC have identical spans, so
> CPU should have gone away, as it used to do.
>
> 3.6.0-bisect (virgin)
> [3.978338] CPU0 attaching sched-domain:
> [3.987125]  domain 0: span 0,32 level SIBLING
> [3.997125]   groups: 0 (cpu_power = 588) 32 (cpu_power = 589)
> [4.010477]   domain 1: span 
> 0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76 level MC
> [4.029748]groups: 0,32 (cpu_power = 1177)  4,36 (cpu_power = 1177)  
> 8,40 (cpu_power = 1178) 12,44 (cpu_power = 1178)
>  16,48 (cpu_power = 1178) 20,52 (cpu_power = 1178) 
> 24,56 (cpu_power = 1178) 28,60 (cpu_power = 1178)
>  64,72 (cpu_power = 1178) 68,76 (cpu_power = 1177)
> [4.084143]domain 2: span 
> 0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76 level CPU
> [4.103796] groups: 
> 0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76 (cpu_power = 11777)
> [4.124373] domain 3: span 0-79 level NUMA
> [4.134369]  groups: 
> 0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76 (cpu_power = 11777)
> 
> 1,5,9,13,17,21,25,29,33,37,41,45,49,53,57,61,65,69,73,77 (cpu_power = 11778)
> 
> 2,6,10,14,18,22,26,30,34,38,42,46,50,54,58,62,66,70,74 ,78 (cpu_power = 11778)
> 
> 3,7,11,15,19,23,27,31,35,39,43,47,51,55,59,63,67,71,75,79 (cpu_power = 11780)
>

Thanks. that's an interesting example of a numa topology

For your sched_domain difference,
On 3.4, SD_PREFER_SIBLING was set for both MC and CPU level thanks to
sd_balance_for_mc_power and  sd_balance_for_package_power
On 3.6, SD_PREFER_SIBLING is only set for CPU level and this flag
difference with MC level prevents the destruction of CPU sched_domain
during the degeneration

We may need to set SD_PREFER_SIBLING for MC level

Vincent

> -Mike
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH v2 3/6] sched: pack small tasks

2012-12-14 Thread Vincent Guittot

On 14 December 2012 02:46, Alex Shi  wrote:
> On 12/13/2012 11:48 PM, Vincent Guittot wrote:
>> On 13 December 2012 15:53, Vincent Guittot  
>> wrote:
>>> On 13 December 2012 15:25, Alex Shi  wrote:
 On 12/13/2012 06:11 PM, Vincent Guittot wrote:
> On 13 December 2012 03:17, Alex Shi  wrote:
>> On 12/12/2012 09:31 PM, Vincent Guittot wrote:
>>> During the creation of sched_domain, we define a pack buddy CPU for 
>>> each CPU
>>> when one is available. We want to pack at all levels where a group of 
>>> CPU can
>>> be power gated independently from others.
>>> On a system that can't power gate a group of CPUs independently, the 
>>> flag is
>>> set at all sched_domain level and the buddy is set to -1. This is the 
>>> default
>>> behavior.
>>> On a dual clusters / dual cores system which can power gate each core 
>>> and
>>> cluster independently, the buddy configuration will be :
>>>
>>>   | Cluster 0   | Cluster 1   |
>>>   | CPU0 | CPU1 | CPU2 | CPU3 |
>>> ---
>>> buddy | CPU0 | CPU0 | CPU0 | CPU2 |
>>>
>>> Small tasks tend to slip out of the periodic load balance so the best 
>>> place
>>> to choose to migrate them is during their wake up. The decision is in 
>>> O(1) as
>>> we only check again one buddy CPU
>>
>> Just have a little worry about the scalability on a big machine, like on
>> a 4 sockets NUMA machine * 8 cores * HT machine, the buddy cpu in whole
>> system need care 64 LCPUs. and in your case cpu0 just care 4 LCPU. That
>> is different on task distribution decision.
>
> The buddy CPU should probably not be the same for all 64 LCPU it
> depends on where it's worth packing small tasks

 Do you have further ideas for buddy cpu on such example?
>>>
>>> yes, I have several ideas which were not really relevant for small
>>> system but could be interesting for larger system
>>>
>>> We keep the same algorithm in a socket but we could either use another
>>> LCPU in the targeted socket (conf0) or chain the socket (conf1)
>>> instead of packing directly in one LCPU
>>>
>>> The scheme below tries to summaries the idea:
>>>
>>> Socket  | socket 0 | socket 1   | socket 2   | socket 3   |
>>> LCPU| 0 | 1-15 | 16 | 17-31 | 32 | 33-47 | 48 | 49-63 |
>>> buddy conf0 | 0 | 0| 1  | 16| 2  | 32| 3  | 48|
>>> buddy conf1 | 0 | 0| 0  | 16| 16 | 32| 32 | 48|
>>> buddy conf2 | 0 | 0| 16 | 16| 32 | 32| 48 | 48|
>>>
>>> But, I don't know how this can interact with NUMA load balance and the
>>> better might be to use conf3.
>>
>> I mean conf2 not conf3
>
> So, it has 4 levels 0/16/32/ for socket 3 and 0 level for socket 0, it
> is unbalanced for different socket.

That the target because we have decided to pack the small tasks in
socket 0 when we have parsed the topology at boot.
We don't have to loop into sched_domain or sched_group anymore to find
the best LCPU when a small tasks wake up.

>
> And the ground level has just one buddy for 16 LCPUs - 8 cores, that's
> not a good design, consider my previous examples: if there are 4 or 8
> tasks in one socket, you just has 2 choices: spread them into all cores,
> or pack them into one LCPU. Actually, moving them just into 2 or 4 cores
> maybe a better solution. but the design missed this.

You speak about tasks without any notion of load. This patch only care
of small tasks and light LCPU load, but it falls back to default
behavior for other situation. So if there are 4 or 8 small tasks, they
will migrate to the socket 0 after 1 or up to 3 migration (it depends
of the conf and the LCPU they come from).

Then, if too much small tasks wake up simultaneously on the same LCPU,
the default load balance will spread them in the core/cluster/socket

>
> Obviously, more and more cores is the trend on any kinds of CPU, the
> buddy system seems hard to catch up this.
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH v2 3/6] sched: pack small tasks

2012-12-14 Thread Vincent Guittot

On 14 December 2012 02:46, Alex Shi alex@intel.com wrote:
 On 12/13/2012 11:48 PM, Vincent Guittot wrote:
 On 13 December 2012 15:53, Vincent Guittot vincent.guit...@linaro.org 
 wrote:
 On 13 December 2012 15:25, Alex Shi alex@intel.com wrote:
 On 12/13/2012 06:11 PM, Vincent Guittot wrote:
 On 13 December 2012 03:17, Alex Shi alex@intel.com wrote:
 On 12/12/2012 09:31 PM, Vincent Guittot wrote:
 During the creation of sched_domain, we define a pack buddy CPU for 
 each CPU
 when one is available. We want to pack at all levels where a group of 
 CPU can
 be power gated independently from others.
 On a system that can't power gate a group of CPUs independently, the 
 flag is
 set at all sched_domain level and the buddy is set to -1. This is the 
 default
 behavior.
 On a dual clusters / dual cores system which can power gate each core 
 and
 cluster independently, the buddy configuration will be :

   | Cluster 0   | Cluster 1   |
   | CPU0 | CPU1 | CPU2 | CPU3 |
 ---
 buddy | CPU0 | CPU0 | CPU0 | CPU2 |

 Small tasks tend to slip out of the periodic load balance so the best 
 place
 to choose to migrate them is during their wake up. The decision is in 
 O(1) as
 we only check again one buddy CPU

 Just have a little worry about the scalability on a big machine, like on
 a 4 sockets NUMA machine * 8 cores * HT machine, the buddy cpu in whole
 system need care 64 LCPUs. and in your case cpu0 just care 4 LCPU. That
 is different on task distribution decision.

 The buddy CPU should probably not be the same for all 64 LCPU it
 depends on where it's worth packing small tasks

 Do you have further ideas for buddy cpu on such example?

 yes, I have several ideas which were not really relevant for small
 system but could be interesting for larger system

 We keep the same algorithm in a socket but we could either use another
 LCPU in the targeted socket (conf0) or chain the socket (conf1)
 instead of packing directly in one LCPU

 The scheme below tries to summaries the idea:

 Socket  | socket 0 | socket 1   | socket 2   | socket 3   |
 LCPU| 0 | 1-15 | 16 | 17-31 | 32 | 33-47 | 48 | 49-63 |
 buddy conf0 | 0 | 0| 1  | 16| 2  | 32| 3  | 48|
 buddy conf1 | 0 | 0| 0  | 16| 16 | 32| 32 | 48|
 buddy conf2 | 0 | 0| 16 | 16| 32 | 32| 48 | 48|

 But, I don't know how this can interact with NUMA load balance and the
 better might be to use conf3.

 I mean conf2 not conf3

 So, it has 4 levels 0/16/32/ for socket 3 and 0 level for socket 0, it
 is unbalanced for different socket.

That the target because we have decided to pack the small tasks in
socket 0 when we have parsed the topology at boot.
We don't have to loop into sched_domain or sched_group anymore to find
the best LCPU when a small tasks wake up.


 And the ground level has just one buddy for 16 LCPUs - 8 cores, that's
 not a good design, consider my previous examples: if there are 4 or 8
 tasks in one socket, you just has 2 choices: spread them into all cores,
 or pack them into one LCPU. Actually, moving them just into 2 or 4 cores
 maybe a better solution. but the design missed this.

You speak about tasks without any notion of load. This patch only care
of small tasks and light LCPU load, but it falls back to default
behavior for other situation. So if there are 4 or 8 small tasks, they
will migrate to the socket 0 after 1 or up to 3 migration (it depends
of the conf and the LCPU they come from).

Then, if too much small tasks wake up simultaneously on the same LCPU,
the default load balance will spread them in the core/cluster/socket


 Obviously, more and more cores is the trend on any kinds of CPU, the
 buddy system seems hard to catch up this.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH v2 3/6] sched: pack small tasks

2012-12-14 Thread Vincent Guittot

On 14 December 2012 08:45, Mike Galbraith bitbuc...@online.de wrote:
 On Fri, 2012-12-14 at 14:36 +0800, Alex Shi wrote:
 On 12/14/2012 12:45 PM, Mike Galbraith wrote:
   Do you have further ideas for buddy cpu on such example?
   
Which kind of sched_domain configuration have you for such system ?
and how many sched_domain level have you ?
  
   it is general X86 domain configuration. with 4 levels,
   sibling/core/cpu/numa.
  CPU is a bug that slipped into domain degeneration.  You should have
  SIBLING/MC/NUMA (chasing that down is on todo).

 Maybe.
 the CPU/NUMA is different on domain flags, CPU has SD_PREFER_SIBLING.

 What I noticed during (an unrelated) bisection on a 40 core box was
 domains going from so..

 3.4.0-bisect (virgin)
 [5.056214] CPU0 attaching sched-domain:
 [5.065009]  domain 0: span 0,32 level SIBLING
 [5.075011]   groups: 0 (cpu_power = 589) 32 (cpu_power = 589)
 [5.088381]   domain 1: span 
 0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76 level MC
 [5.107669]groups: 0,32 (cpu_power = 1178)  4,36 (cpu_power = 1178)  
 8,40 (cpu_power = 1178) 12,44 (cpu_power = 1178)
  16,48 (cpu_power = 1177) 20,52 (cpu_power = 1178) 
 24,56 (cpu_power = 1177) 28,60 (cpu_power = 1177)
  64,72 (cpu_power = 1176) 68,76 (cpu_power = 1176)
 [5.162115]domain 2: span 0-79 level NODE
 [5.171927] groups: 
 0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76 (cpu_power = 11773)

 1,5,9,13,17,21,25,29,33,37,41,45,49,53,57,61,65,69,73,77 (cpu_power = 11772)

 2,6,10,14,18,22,26,30,34,38,42,46,50,54,58,62,66,70,74,78 (cpu_power = 11773)

 3,7,11,15,19,23,27,31,35,39,43,47,51,55,59,63,67,71,75,79 (cpu_power = 11770)

 ..to so, which looks a little bent.  CPU and MC have identical spans, so
 CPU should have gone away, as it used to do.

 3.6.0-bisect (virgin)
 [3.978338] CPU0 attaching sched-domain:
 [3.987125]  domain 0: span 0,32 level SIBLING
 [3.997125]   groups: 0 (cpu_power = 588) 32 (cpu_power = 589)
 [4.010477]   domain 1: span 
 0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76 level MC
 [4.029748]groups: 0,32 (cpu_power = 1177)  4,36 (cpu_power = 1177)  
 8,40 (cpu_power = 1178) 12,44 (cpu_power = 1178)
  16,48 (cpu_power = 1178) 20,52 (cpu_power = 1178) 
 24,56 (cpu_power = 1178) 28,60 (cpu_power = 1178)
  64,72 (cpu_power = 1178) 68,76 (cpu_power = 1177)
 [4.084143]domain 2: span 
 0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76 level CPU
 [4.103796] groups: 
 0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76 (cpu_power = 11777)
 [4.124373] domain 3: span 0-79 level NUMA
 [4.134369]  groups: 
 0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76 (cpu_power = 11777)
 
 1,5,9,13,17,21,25,29,33,37,41,45,49,53,57,61,65,69,73,77 (cpu_power = 11778)
 
 2,6,10,14,18,22,26,30,34,38,42,46,50,54,58,62,66,70,74 ,78 (cpu_power = 11778)
 
 3,7,11,15,19,23,27,31,35,39,43,47,51,55,59,63,67,71,75,79 (cpu_power = 11780)


Thanks. that's an interesting example of a numa topology

For your sched_domain difference,
On 3.4, SD_PREFER_SIBLING was set for both MC and CPU level thanks to
sd_balance_for_mc_power and  sd_balance_for_package_power
On 3.6, SD_PREFER_SIBLING is only set for CPU level and this flag
difference with MC level prevents the destruction of CPU sched_domain
during the degeneration

We may need to set SD_PREFER_SIBLING for MC level

Vincent

 -Mike

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH v2 3/6] sched: pack small tasks

2012-12-14 Thread Mike Galbraith

On Fri, 2012-12-14 at 11:43 +0100, Vincent Guittot wrote: 
 On 14 December 2012 08:45, Mike Galbraith bitbuc...@online.de wrote:
  On Fri, 2012-12-14 at 14:36 +0800, Alex Shi wrote:
  On 12/14/2012 12:45 PM, Mike Galbraith wrote:
Do you have further ideas for buddy cpu on such example?

 Which kind of sched_domain configuration have you for such system ?
 and how many sched_domain level have you ?
   
it is general X86 domain configuration. with 4 levels,
sibling/core/cpu/numa.
   CPU is a bug that slipped into domain degeneration.  You should have
   SIBLING/MC/NUMA (chasing that down is on todo).
 
  Maybe.
  the CPU/NUMA is different on domain flags, CPU has SD_PREFER_SIBLING.
 
  What I noticed during (an unrelated) bisection on a 40 core box was
  domains going from so..
 
  3.4.0-bisect (virgin)
  [5.056214] CPU0 attaching sched-domain:
  [5.065009]  domain 0: span 0,32 level SIBLING
  [5.075011]   groups: 0 (cpu_power = 589) 32 (cpu_power = 589)
  [5.088381]   domain 1: span 
  0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76 level MC
  [5.107669]groups: 0,32 (cpu_power = 1178)  4,36 (cpu_power = 1178)  
  8,40 (cpu_power = 1178) 12,44 (cpu_power = 1178)
   16,48 (cpu_power = 1177) 20,52 (cpu_power = 1178) 
  24,56 (cpu_power = 1177) 28,60 (cpu_power = 1177)
   64,72 (cpu_power = 1176) 68,76 (cpu_power = 1176)
  [5.162115]domain 2: span 0-79 level NODE
  [5.171927] groups: 
  0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76 (cpu_power = 11773)
 
  1,5,9,13,17,21,25,29,33,37,41,45,49,53,57,61,65,69,73,77 (cpu_power = 11772)
 
  2,6,10,14,18,22,26,30,34,38,42,46,50,54,58,62,66,70,74,78 (cpu_power = 
  11773)
 
  3,7,11,15,19,23,27,31,35,39,43,47,51,55,59,63,67,71,75,79 (cpu_power = 
  11770)
 
  ..to so, which looks a little bent.  CPU and MC have identical spans, so
  CPU should have gone away, as it used to do.
 
  3.6.0-bisect (virgin)
  [3.978338] CPU0 attaching sched-domain:
  [3.987125]  domain 0: span 0,32 level SIBLING
  [3.997125]   groups: 0 (cpu_power = 588) 32 (cpu_power = 589)
  [4.010477]   domain 1: span 
  0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76 level MC
  [4.029748]groups: 0,32 (cpu_power = 1177)  4,36 (cpu_power = 1177)  
  8,40 (cpu_power = 1178) 12,44 (cpu_power = 1178)
   16,48 (cpu_power = 1178) 20,52 (cpu_power = 1178) 
  24,56 (cpu_power = 1178) 28,60 (cpu_power = 1178)
   64,72 (cpu_power = 1178) 68,76 (cpu_power = 1177)
  [4.084143]domain 2: span 
  0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76 level CPU
  [4.103796] groups: 
  0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76 (cpu_power = 11777)
  [4.124373] domain 3: span 0-79 level NUMA
  [4.134369]  groups: 
  0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76 (cpu_power = 11777)
  
  1,5,9,13,17,21,25,29,33,37,41,45,49,53,57,61,65,69,73,77 (cpu_power = 11778)
  
  2,6,10,14,18,22,26,30,34,38,42,46,50,54,58,62,66,70,74 ,78 (cpu_power = 
  11778)
  
  3,7,11,15,19,23,27,31,35,39,43,47,51,55,59,63,67,71,75,79 (cpu_power = 
  11780)
 
 
 Thanks. that's an interesting example of a numa topology
 
 For your sched_domain difference,
 On 3.4, SD_PREFER_SIBLING was set for both MC and CPU level thanks to
 sd_balance_for_mc_power and  sd_balance_for_package_power
 On 3.6, SD_PREFER_SIBLING is only set for CPU level and this flag
 difference with MC level prevents the destruction of CPU sched_domain
 during the degeneration
 
 We may need to set SD_PREFER_SIBLING for MC level

Ah, that explains oddity. (todo--).

Hm, seems changing flags should trigger a rebuild. (todo++,drat).

-Mike

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH v2 3/6] sched: pack small tasks

2012-12-13 Thread Alex Shi

On 12/14/2012 03:45 PM, Mike Galbraith wrote:
> On Fri, 2012-12-14 at 14:36 +0800, Alex Shi wrote: 
>> On 12/14/2012 12:45 PM, Mike Galbraith wrote:
> Do you have further ideas for buddy cpu on such example?
>>>
>>> Which kind of sched_domain configuration have you for such system ?
>>> and how many sched_domain level have you ?
>
> it is general X86 domain configuration. with 4 levels,
> sibling/core/cpu/numa.
>>> CPU is a bug that slipped into domain degeneration.  You should have
>>> SIBLING/MC/NUMA (chasing that down is on todo).
>>
>> Maybe.
>> the CPU/NUMA is different on domain flags, CPU has SD_PREFER_SIBLING.
> 
> What I noticed during (an unrelated) bisection on a 40 core box was
> domains going from so..
> 
> 3.4.0-bisect (virgin)
> [5.056214] CPU0 attaching sched-domain:
> [5.065009]  domain 0: span 0,32 level SIBLING
> [5.075011]   groups: 0 (cpu_power = 589) 32 (cpu_power = 589)
> [5.088381]   domain 1: span 
> 0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76 level MC
> [5.107669]groups: 0,32 (cpu_power = 1178)  4,36 (cpu_power = 1178)  
> 8,40 (cpu_power = 1178) 12,44 (cpu_power = 1178)
>  16,48 (cpu_power = 1177) 20,52 (cpu_power = 1178) 
> 24,56 (cpu_power = 1177) 28,60 (cpu_power = 1177)
>  64,72 (cpu_power = 1176) 68,76 (cpu_power = 1176)
> [5.162115]domain 2: span 0-79 level NODE
> [5.171927] groups: 
> 0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76 (cpu_power = 11773)
>
> 1,5,9,13,17,21,25,29,33,37,41,45,49,53,57,61,65,69,73,77 (cpu_power = 11772)
>
> 2,6,10,14,18,22,26,30,34,38,42,46,50,54,58,62,66,70,74,78 (cpu_power = 11773)
>
> 3,7,11,15,19,23,27,31,35,39,43,47,51,55,59,63,67,71,75,79 (cpu_power = 11770)
> 
> ..to so, which looks a little bent.  CPU and MC have identical spans, so
> CPU should have gone away, as it used to do.
> 

better to remove one, and believe you can make it. :)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH v2 3/6] sched: pack small tasks

2012-12-13 Thread Mike Galbraith

On Fri, 2012-12-14 at 14:36 +0800, Alex Shi wrote: 
> On 12/14/2012 12:45 PM, Mike Galbraith wrote:
> >> > Do you have further ideas for buddy cpu on such example?
> >>> > > 
> >>> > > Which kind of sched_domain configuration have you for such system ?
> >>> > > and how many sched_domain level have you ?
> >> > 
> >> > it is general X86 domain configuration. with 4 levels,
> >> > sibling/core/cpu/numa.
> > CPU is a bug that slipped into domain degeneration.  You should have
> > SIBLING/MC/NUMA (chasing that down is on todo).
> 
> Maybe.
> the CPU/NUMA is different on domain flags, CPU has SD_PREFER_SIBLING.

What I noticed during (an unrelated) bisection on a 40 core box was
domains going from so..

3.4.0-bisect (virgin)
[5.056214] CPU0 attaching sched-domain:
[5.065009]  domain 0: span 0,32 level SIBLING
[5.075011]   groups: 0 (cpu_power = 589) 32 (cpu_power = 589)
[5.088381]   domain 1: span 
0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76 level MC
[5.107669]groups: 0,32 (cpu_power = 1178)  4,36 (cpu_power = 1178)  
8,40 (cpu_power = 1178) 12,44 (cpu_power = 1178)
 16,48 (cpu_power = 1177) 20,52 (cpu_power = 1178) 
24,56 (cpu_power = 1177) 28,60 (cpu_power = 1177)
 64,72 (cpu_power = 1176) 68,76 (cpu_power = 1176)
[5.162115]domain 2: span 0-79 level NODE
[5.171927] groups: 
0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76 (cpu_power = 11773)
   
1,5,9,13,17,21,25,29,33,37,41,45,49,53,57,61,65,69,73,77 (cpu_power = 11772)
   
2,6,10,14,18,22,26,30,34,38,42,46,50,54,58,62,66,70,74,78 (cpu_power = 11773)
   
3,7,11,15,19,23,27,31,35,39,43,47,51,55,59,63,67,71,75,79 (cpu_power = 11770)

..to so, which looks a little bent.  CPU and MC have identical spans, so
CPU should have gone away, as it used to do.

3.6.0-bisect (virgin)
[3.978338] CPU0 attaching sched-domain:
[3.987125]  domain 0: span 0,32 level SIBLING
[3.997125]   groups: 0 (cpu_power = 588) 32 (cpu_power = 589)
[4.010477]   domain 1: span 
0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76 level MC
[4.029748]groups: 0,32 (cpu_power = 1177)  4,36 (cpu_power = 1177)  
8,40 (cpu_power = 1178) 12,44 (cpu_power = 1178)
 16,48 (cpu_power = 1178) 20,52 (cpu_power = 1178) 
24,56 (cpu_power = 1178) 28,60 (cpu_power = 1178)
 64,72 (cpu_power = 1178) 68,76 (cpu_power = 1177)
[4.084143]domain 2: span 
0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76 level CPU
[4.103796] groups: 
0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76 (cpu_power = 11777)
[4.124373] domain 3: span 0-79 level NUMA
[4.134369]  groups: 
0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76 (cpu_power = 11777)

1,5,9,13,17,21,25,29,33,37,41,45,49,53,57,61,65,69,73,77 (cpu_power = 11778)

2,6,10,14,18,22,26,30,34,38,42,46,50,54,58,62,66,70,74 ,78 (cpu_power = 11778)

3,7,11,15,19,23,27,31,35,39,43,47,51,55,59,63,67,71,75,79 (cpu_power = 11780)

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH v2 3/6] sched: pack small tasks

2012-12-13 Thread Alex Shi

On 12/14/2012 12:45 PM, Mike Galbraith wrote:
>> > Do you have further ideas for buddy cpu on such example?
>>> > > 
>>> > > Which kind of sched_domain configuration have you for such system ?
>>> > > and how many sched_domain level have you ?
>> > 
>> > it is general X86 domain configuration. with 4 levels,
>> > sibling/core/cpu/numa.
> CPU is a bug that slipped into domain degeneration.  You should have
> SIBLING/MC/NUMA (chasing that down is on todo).

Maybe.
the CPU/NUMA is different on domain flags, CPU has SD_PREFER_SIBLING.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH v2 3/6] sched: pack small tasks

2012-12-13 Thread Mike Galbraith

On Thu, 2012-12-13 at 22:25 +0800, Alex Shi wrote: 
> On 12/13/2012 06:11 PM, Vincent Guittot wrote:
> > On 13 December 2012 03:17, Alex Shi  wrote:
> >> On 12/12/2012 09:31 PM, Vincent Guittot wrote:
> >>> During the creation of sched_domain, we define a pack buddy CPU for each 
> >>> CPU
> >>> when one is available. We want to pack at all levels where a group of CPU 
> >>> can
> >>> be power gated independently from others.
> >>> On a system that can't power gate a group of CPUs independently, the flag 
> >>> is
> >>> set at all sched_domain level and the buddy is set to -1. This is the 
> >>> default
> >>> behavior.
> >>> On a dual clusters / dual cores system which can power gate each core and
> >>> cluster independently, the buddy configuration will be :
> >>>
> >>>   | Cluster 0   | Cluster 1   |
> >>>   | CPU0 | CPU1 | CPU2 | CPU3 |
> >>> ---
> >>> buddy | CPU0 | CPU0 | CPU0 | CPU2 |
> >>>
> >>> Small tasks tend to slip out of the periodic load balance so the best 
> >>> place
> >>> to choose to migrate them is during their wake up. The decision is in 
> >>> O(1) as
> >>> we only check again one buddy CPU
> >>
> >> Just have a little worry about the scalability on a big machine, like on
> >> a 4 sockets NUMA machine * 8 cores * HT machine, the buddy cpu in whole
> >> system need care 64 LCPUs. and in your case cpu0 just care 4 LCPU. That
> >> is different on task distribution decision.
> > 
> > The buddy CPU should probably not be the same for all 64 LCPU it
> > depends on where it's worth packing small tasks
> 
> Do you have further ideas for buddy cpu on such example?
> > 
> > Which kind of sched_domain configuration have you for such system ?
> > and how many sched_domain level have you ?
> 
> it is general X86 domain configuration. with 4 levels,
> sibling/core/cpu/numa.

CPU is a bug that slipped into domain degeneration.  You should have
SIBLING/MC/NUMA (chasing that down is on todo).

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH v2 3/6] sched: pack small tasks

2012-12-13 Thread Alex Shi

On 12/13/2012 11:48 PM, Vincent Guittot wrote:
> On 13 December 2012 15:53, Vincent Guittot  wrote:
>> On 13 December 2012 15:25, Alex Shi  wrote:
>>> On 12/13/2012 06:11 PM, Vincent Guittot wrote:
 On 13 December 2012 03:17, Alex Shi  wrote:
> On 12/12/2012 09:31 PM, Vincent Guittot wrote:
>> During the creation of sched_domain, we define a pack buddy CPU for each 
>> CPU
>> when one is available. We want to pack at all levels where a group of 
>> CPU can
>> be power gated independently from others.
>> On a system that can't power gate a group of CPUs independently, the 
>> flag is
>> set at all sched_domain level and the buddy is set to -1. This is the 
>> default
>> behavior.
>> On a dual clusters / dual cores system which can power gate each core and
>> cluster independently, the buddy configuration will be :
>>
>>   | Cluster 0   | Cluster 1   |
>>   | CPU0 | CPU1 | CPU2 | CPU3 |
>> ---
>> buddy | CPU0 | CPU0 | CPU0 | CPU2 |
>>
>> Small tasks tend to slip out of the periodic load balance so the best 
>> place
>> to choose to migrate them is during their wake up. The decision is in 
>> O(1) as
>> we only check again one buddy CPU
>
> Just have a little worry about the scalability on a big machine, like on
> a 4 sockets NUMA machine * 8 cores * HT machine, the buddy cpu in whole
> system need care 64 LCPUs. and in your case cpu0 just care 4 LCPU. That
> is different on task distribution decision.

 The buddy CPU should probably not be the same for all 64 LCPU it
 depends on where it's worth packing small tasks
>>>
>>> Do you have further ideas for buddy cpu on such example?
>>
>> yes, I have several ideas which were not really relevant for small
>> system but could be interesting for larger system
>>
>> We keep the same algorithm in a socket but we could either use another
>> LCPU in the targeted socket (conf0) or chain the socket (conf1)
>> instead of packing directly in one LCPU
>>
>> The scheme below tries to summaries the idea:
>>
>> Socket  | socket 0 | socket 1   | socket 2   | socket 3   |
>> LCPU| 0 | 1-15 | 16 | 17-31 | 32 | 33-47 | 48 | 49-63 |
>> buddy conf0 | 0 | 0| 1  | 16| 2  | 32| 3  | 48|
>> buddy conf1 | 0 | 0| 0  | 16| 16 | 32| 32 | 48|
>> buddy conf2 | 0 | 0| 16 | 16| 32 | 32| 48 | 48|
>>
>> But, I don't know how this can interact with NUMA load balance and the
>> better might be to use conf3.
> 
> I mean conf2 not conf3

So, it has 4 levels 0/16/32/ for socket 3 and 0 level for socket 0, it
is unbalanced for different socket.

And the ground level has just one buddy for 16 LCPUs - 8 cores, that's
not a good design, consider my previous examples: if there are 4 or 8
tasks in one socket, you just has 2 choices: spread them into all cores,
or pack them into one LCPU. Actually, moving them just into 2 or 4 cores
maybe a better solution. but the design missed this.

Obviously, more and more cores is the trend on any kinds of CPU, the
buddy system seems hard to catch up this.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH v2 3/6] sched: pack small tasks

2012-12-13 Thread Vincent Guittot

On 13 December 2012 15:53, Vincent Guittot  wrote:
> On 13 December 2012 15:25, Alex Shi  wrote:
>> On 12/13/2012 06:11 PM, Vincent Guittot wrote:
>>> On 13 December 2012 03:17, Alex Shi  wrote:
 On 12/12/2012 09:31 PM, Vincent Guittot wrote:
> During the creation of sched_domain, we define a pack buddy CPU for each 
> CPU
> when one is available. We want to pack at all levels where a group of CPU 
> can
> be power gated independently from others.
> On a system that can't power gate a group of CPUs independently, the flag 
> is
> set at all sched_domain level and the buddy is set to -1. This is the 
> default
> behavior.
> On a dual clusters / dual cores system which can power gate each core and
> cluster independently, the buddy configuration will be :
>
>   | Cluster 0   | Cluster 1   |
>   | CPU0 | CPU1 | CPU2 | CPU3 |
> ---
> buddy | CPU0 | CPU0 | CPU0 | CPU2 |
>
> Small tasks tend to slip out of the periodic load balance so the best 
> place
> to choose to migrate them is during their wake up. The decision is in 
> O(1) as
> we only check again one buddy CPU

 Just have a little worry about the scalability on a big machine, like on
 a 4 sockets NUMA machine * 8 cores * HT machine, the buddy cpu in whole
 system need care 64 LCPUs. and in your case cpu0 just care 4 LCPU. That
 is different on task distribution decision.
>>>
>>> The buddy CPU should probably not be the same for all 64 LCPU it
>>> depends on where it's worth packing small tasks
>>
>> Do you have further ideas for buddy cpu on such example?
>
> yes, I have several ideas which were not really relevant for small
> system but could be interesting for larger system
>
> We keep the same algorithm in a socket but we could either use another
> LCPU in the targeted socket (conf0) or chain the socket (conf1)
> instead of packing directly in one LCPU
>
> The scheme below tries to summaries the idea:
>
> Socket  | socket 0 | socket 1   | socket 2   | socket 3   |
> LCPU| 0 | 1-15 | 16 | 17-31 | 32 | 33-47 | 48 | 49-63 |
> buddy conf0 | 0 | 0| 1  | 16| 2  | 32| 3  | 48|
> buddy conf1 | 0 | 0| 0  | 16| 16 | 32| 32 | 48|
> buddy conf2 | 0 | 0| 16 | 16| 32 | 32| 48 | 48|
>
> But, I don't know how this can interact with NUMA load balance and the
> better might be to use conf3.

I mean conf2 not conf3

>
>>>
>>> Which kind of sched_domain configuration have you for such system ?
>>> and how many sched_domain level have you ?
>>
>> it is general X86 domain configuration. with 4 levels,
>> sibling/core/cpu/numa.
>>>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH v2 3/6] sched: pack small tasks

2012-12-13 Thread Vincent Guittot

On 13 December 2012 15:25, Alex Shi  wrote:
> On 12/13/2012 06:11 PM, Vincent Guittot wrote:
>> On 13 December 2012 03:17, Alex Shi  wrote:
>>> On 12/12/2012 09:31 PM, Vincent Guittot wrote:
 During the creation of sched_domain, we define a pack buddy CPU for each 
 CPU
 when one is available. We want to pack at all levels where a group of CPU 
 can
 be power gated independently from others.
 On a system that can't power gate a group of CPUs independently, the flag 
 is
 set at all sched_domain level and the buddy is set to -1. This is the 
 default
 behavior.
 On a dual clusters / dual cores system which can power gate each core and
 cluster independently, the buddy configuration will be :

   | Cluster 0   | Cluster 1   |
   | CPU0 | CPU1 | CPU2 | CPU3 |
 ---
 buddy | CPU0 | CPU0 | CPU0 | CPU2 |

 Small tasks tend to slip out of the periodic load balance so the best place
 to choose to migrate them is during their wake up. The decision is in O(1) 
 as
 we only check again one buddy CPU
>>>
>>> Just have a little worry about the scalability on a big machine, like on
>>> a 4 sockets NUMA machine * 8 cores * HT machine, the buddy cpu in whole
>>> system need care 64 LCPUs. and in your case cpu0 just care 4 LCPU. That
>>> is different on task distribution decision.
>>
>> The buddy CPU should probably not be the same for all 64 LCPU it
>> depends on where it's worth packing small tasks
>
> Do you have further ideas for buddy cpu on such example?

yes, I have several ideas which were not really relevant for small
system but could be interesting for larger system

We keep the same algorithm in a socket but we could either use another
LCPU in the targeted socket (conf0) or chain the socket (conf1)
instead of packing directly in one LCPU

The scheme below tries to summaries the idea:

Socket  | socket 0 | socket 1   | socket 2   | socket 3   |
LCPU| 0 | 1-15 | 16 | 17-31 | 32 | 33-47 | 48 | 49-63 |
buddy conf0 | 0 | 0| 1  | 16| 2  | 32| 3  | 48|
buddy conf1 | 0 | 0| 0  | 16| 16 | 32| 32 | 48|
buddy conf2 | 0 | 0| 16 | 16| 32 | 32| 48 | 48|

But, I don't know how this can interact with NUMA load balance and the
better might be to use conf3.

>>
>> Which kind of sched_domain configuration have you for such system ?
>> and how many sched_domain level have you ?
>
> it is general X86 domain configuration. with 4 levels,
> sibling/core/cpu/numa.
>>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH v2 3/6] sched: pack small tasks

2012-12-13 Thread Alex Shi

On 12/13/2012 06:11 PM, Vincent Guittot wrote:
> On 13 December 2012 03:17, Alex Shi  wrote:
>> On 12/12/2012 09:31 PM, Vincent Guittot wrote:
>>> During the creation of sched_domain, we define a pack buddy CPU for each CPU
>>> when one is available. We want to pack at all levels where a group of CPU 
>>> can
>>> be power gated independently from others.
>>> On a system that can't power gate a group of CPUs independently, the flag is
>>> set at all sched_domain level and the buddy is set to -1. This is the 
>>> default
>>> behavior.
>>> On a dual clusters / dual cores system which can power gate each core and
>>> cluster independently, the buddy configuration will be :
>>>
>>>   | Cluster 0   | Cluster 1   |
>>>   | CPU0 | CPU1 | CPU2 | CPU3 |
>>> ---
>>> buddy | CPU0 | CPU0 | CPU0 | CPU2 |
>>>
>>> Small tasks tend to slip out of the periodic load balance so the best place
>>> to choose to migrate them is during their wake up. The decision is in O(1) 
>>> as
>>> we only check again one buddy CPU
>>
>> Just have a little worry about the scalability on a big machine, like on
>> a 4 sockets NUMA machine * 8 cores * HT machine, the buddy cpu in whole
>> system need care 64 LCPUs. and in your case cpu0 just care 4 LCPU. That
>> is different on task distribution decision.
> 
> The buddy CPU should probably not be the same for all 64 LCPU it
> depends on where it's worth packing small tasks

Do you have further ideas for buddy cpu on such example?
> 
> Which kind of sched_domain configuration have you for such system ?
> and how many sched_domain level have you ?

it is general X86 domain configuration. with 4 levels,
sibling/core/cpu/numa.
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH v2 3/6] sched: pack small tasks

2012-12-13 Thread Vincent Guittot

On 13 December 2012 03:17, Alex Shi  wrote:
> On 12/12/2012 09:31 PM, Vincent Guittot wrote:
>> During the creation of sched_domain, we define a pack buddy CPU for each CPU
>> when one is available. We want to pack at all levels where a group of CPU can
>> be power gated independently from others.
>> On a system that can't power gate a group of CPUs independently, the flag is
>> set at all sched_domain level and the buddy is set to -1. This is the default
>> behavior.
>> On a dual clusters / dual cores system which can power gate each core and
>> cluster independently, the buddy configuration will be :
>>
>>   | Cluster 0   | Cluster 1   |
>>   | CPU0 | CPU1 | CPU2 | CPU3 |
>> ---
>> buddy | CPU0 | CPU0 | CPU0 | CPU2 |
>>
>> Small tasks tend to slip out of the periodic load balance so the best place
>> to choose to migrate them is during their wake up. The decision is in O(1) as
>> we only check again one buddy CPU
>
> Just have a little worry about the scalability on a big machine, like on
> a 4 sockets NUMA machine * 8 cores * HT machine, the buddy cpu in whole
> system need care 64 LCPUs. and in your case cpu0 just care 4 LCPU. That
> is different on task distribution decision.

The buddy CPU should probably not be the same for all 64 LCPU it
depends on where it's worth packing small tasks

Which kind of sched_domain configuration have you for such system ?
and how many sched_domain level have you ?

>
>>
>> Signed-off-by: Vincent Guittot 
>> ---
>>  kernel/sched/core.c  |1 +
>>  kernel/sched/fair.c  |  110 
>> ++
>>  kernel/sched/sched.h |5 +++
>>  3 files changed, 116 insertions(+)
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 4f36e9d..3436aad 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -5693,6 +5693,7 @@ cpu_attach_domain(struct sched_domain *sd, struct 
>> root_domain *rd, int cpu)
>>   rcu_assign_pointer(rq->sd, sd);
>>   destroy_sched_domains(tmp, cpu);
>>
>> + update_packing_domain(cpu);
>>   update_domain_cache(cpu);
>>  }
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 9916d41..fc93d96 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -163,6 +163,73 @@ void sched_init_granularity(void)
>>   update_sysctl();
>>  }
>>
>> +
>> +#ifdef CONFIG_SMP
>> +/*
>> + * Save the id of the optimal CPU that should be used to pack small tasks
>> + * The value -1 is used when no buddy has been found
>> + */
>> +DEFINE_PER_CPU(int, sd_pack_buddy);
>> +
>> +/* Look for the best buddy CPU that can be used to pack small tasks
>> + * We make the assumption that it doesn't wort to pack on CPU that share the
>> + * same powerline. We looks for the 1st sched_domain without the
>> + * SD_SHARE_POWERDOMAIN flag. Then We look for the sched_group witht the 
>> lowest
>> + * power per core based on the assumption that their power efficiency is
>> + * better */
>> +void update_packing_domain(int cpu)
>> +{
>> + struct sched_domain *sd;
>> + int id = -1;
>> +
>> + sd = highest_flag_domain(cpu, SD_SHARE_POWERDOMAIN & SD_LOAD_BALANCE);
>> + if (!sd)
>> + sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd);
>> + else
>> + sd = sd->parent;
>> +
>> + while (sd && (sd->flags && SD_LOAD_BALANCE)) {
>> + struct sched_group *sg = sd->groups;
>> + struct sched_group *pack = sg;
>> + struct sched_group *tmp;
>> +
>> + /*
>> +  * The sched_domain of a CPU points on the local sched_group
>> +  * and the 1st CPU of this local group is a good candidate
>> +  */
>> + id = cpumask_first(sched_group_cpus(pack));
>> +
>> + /* loop the sched groups to find the best one */
>> + for (tmp = sg->next; tmp != sg; tmp = tmp->next) {
>> + if (tmp->sgp->power * pack->group_weight >
>> + pack->sgp->power * tmp->group_weight)
>> + continue;
>> +
>> + if ((tmp->sgp->power * pack->group_weight ==
>> + pack->sgp->power * tmp->group_weight)
>> +  && (cpumask_first(sched_group_cpus(tmp)) >= id))
>> + continue;
>> +
>> + /* we have found a better group */
>> + pack = tmp;
>> +
>> + /* Take the 1st CPU of the new group */
>> + id = cpumask_first(sched_group_cpus(pack));
>> + }
>> +
>> + /* Look for another CPU than itself */
>> + if (id != cpu)
>> + break;
>> +
>> + sd = sd->parent;
>> + }
>> +
>> + pr_debug("CPU%d packing on CPU%d\n", cpu, id);
>> + per_cpu(sd_pack_buddy, cpu) = id;
>> +}
>> +
>> +#endif /* CONFIG_SMP */
>> +
>>  #if

Re: [RFC PATCH v2 3/6] sched: pack small tasks

2012-12-13 Thread Vincent Guittot

On 13 December 2012 03:17, Alex Shi alex@intel.com wrote:
 On 12/12/2012 09:31 PM, Vincent Guittot wrote:
 During the creation of sched_domain, we define a pack buddy CPU for each CPU
 when one is available. We want to pack at all levels where a group of CPU can
 be power gated independently from others.
 On a system that can't power gate a group of CPUs independently, the flag is
 set at all sched_domain level and the buddy is set to -1. This is the default
 behavior.
 On a dual clusters / dual cores system which can power gate each core and
 cluster independently, the buddy configuration will be :

   | Cluster 0   | Cluster 1   |
   | CPU0 | CPU1 | CPU2 | CPU3 |
 ---
 buddy | CPU0 | CPU0 | CPU0 | CPU2 |

 Small tasks tend to slip out of the periodic load balance so the best place
 to choose to migrate them is during their wake up. The decision is in O(1) as
 we only check again one buddy CPU

 Just have a little worry about the scalability on a big machine, like on
 a 4 sockets NUMA machine * 8 cores * HT machine, the buddy cpu in whole
 system need care 64 LCPUs. and in your case cpu0 just care 4 LCPU. That
 is different on task distribution decision.

The buddy CPU should probably not be the same for all 64 LCPU it
depends on where it's worth packing small tasks

Which kind of sched_domain configuration have you for such system ?
and how many sched_domain level have you ?



 Signed-off-by: Vincent Guittot vincent.guit...@linaro.org
 ---
  kernel/sched/core.c  |1 +
  kernel/sched/fair.c  |  110 
 ++
  kernel/sched/sched.h |5 +++
  3 files changed, 116 insertions(+)

 diff --git a/kernel/sched/core.c b/kernel/sched/core.c
 index 4f36e9d..3436aad 100644
 --- a/kernel/sched/core.c
 +++ b/kernel/sched/core.c
 @@ -5693,6 +5693,7 @@ cpu_attach_domain(struct sched_domain *sd, struct 
 root_domain *rd, int cpu)
   rcu_assign_pointer(rq-sd, sd);
   destroy_sched_domains(tmp, cpu);

 + update_packing_domain(cpu);
   update_domain_cache(cpu);
  }

 diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
 index 9916d41..fc93d96 100644
 --- a/kernel/sched/fair.c
 +++ b/kernel/sched/fair.c
 @@ -163,6 +163,73 @@ void sched_init_granularity(void)
   update_sysctl();
  }

 +
 +#ifdef CONFIG_SMP
 +/*
 + * Save the id of the optimal CPU that should be used to pack small tasks
 + * The value -1 is used when no buddy has been found
 + */
 +DEFINE_PER_CPU(int, sd_pack_buddy);
 +
 +/* Look for the best buddy CPU that can be used to pack small tasks
 + * We make the assumption that it doesn't wort to pack on CPU that share the
 + * same powerline. We looks for the 1st sched_domain without the
 + * SD_SHARE_POWERDOMAIN flag. Then We look for the sched_group witht the 
 lowest
 + * power per core based on the assumption that their power efficiency is
 + * better */
 +void update_packing_domain(int cpu)
 +{
 + struct sched_domain *sd;
 + int id = -1;
 +
 + sd = highest_flag_domain(cpu, SD_SHARE_POWERDOMAIN  SD_LOAD_BALANCE);
 + if (!sd)
 + sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)-sd);
 + else
 + sd = sd-parent;
 +
 + while (sd  (sd-flags  SD_LOAD_BALANCE)) {
 + struct sched_group *sg = sd-groups;
 + struct sched_group *pack = sg;
 + struct sched_group *tmp;
 +
 + /*
 +  * The sched_domain of a CPU points on the local sched_group
 +  * and the 1st CPU of this local group is a good candidate
 +  */
 + id = cpumask_first(sched_group_cpus(pack));
 +
 + /* loop the sched groups to find the best one */
 + for (tmp = sg-next; tmp != sg; tmp = tmp-next) {
 + if (tmp-sgp-power * pack-group_weight 
 + pack-sgp-power * tmp-group_weight)
 + continue;
 +
 + if ((tmp-sgp-power * pack-group_weight ==
 + pack-sgp-power * tmp-group_weight)
 +   (cpumask_first(sched_group_cpus(tmp)) = id))
 + continue;
 +
 + /* we have found a better group */
 + pack = tmp;
 +
 + /* Take the 1st CPU of the new group */
 + id = cpumask_first(sched_group_cpus(pack));
 + }
 +
 + /* Look for another CPU than itself */
 + if (id != cpu)
 + break;
 +
 + sd = sd-parent;
 + }
 +
 + pr_debug(CPU%d packing on CPU%d\n, cpu, id);
 + per_cpu(sd_pack_buddy, cpu) = id;
 +}
 +
 +#endif /* CONFIG_SMP */
 +
  #if BITS_PER_LONG == 32
  # define WMULT_CONST (~0UL)
  #else
 @@ -5083,6 +5150,46 @@ static bool numa_allow_migration(struct task_struct 
 *p, int prev_cpu, int new_cp
   return true;
  }

 +static bool is_buddy_busy(int cpu)
 +{
 +

Re: [RFC PATCH v2 3/6] sched: pack small tasks

2012-12-13 Thread Alex Shi

On 12/13/2012 06:11 PM, Vincent Guittot wrote:
 On 13 December 2012 03:17, Alex Shi alex@intel.com wrote:
 On 12/12/2012 09:31 PM, Vincent Guittot wrote:
 During the creation of sched_domain, we define a pack buddy CPU for each CPU
 when one is available. We want to pack at all levels where a group of CPU 
 can
 be power gated independently from others.
 On a system that can't power gate a group of CPUs independently, the flag is
 set at all sched_domain level and the buddy is set to -1. This is the 
 default
 behavior.
 On a dual clusters / dual cores system which can power gate each core and
 cluster independently, the buddy configuration will be :

   | Cluster 0   | Cluster 1   |
   | CPU0 | CPU1 | CPU2 | CPU3 |
 ---
 buddy | CPU0 | CPU0 | CPU0 | CPU2 |

 Small tasks tend to slip out of the periodic load balance so the best place
 to choose to migrate them is during their wake up. The decision is in O(1) 
 as
 we only check again one buddy CPU

 Just have a little worry about the scalability on a big machine, like on
 a 4 sockets NUMA machine * 8 cores * HT machine, the buddy cpu in whole
 system need care 64 LCPUs. and in your case cpu0 just care 4 LCPU. That
 is different on task distribution decision.
 
 The buddy CPU should probably not be the same for all 64 LCPU it
 depends on where it's worth packing small tasks

Do you have further ideas for buddy cpu on such example?
 
 Which kind of sched_domain configuration have you for such system ?
 and how many sched_domain level have you ?

it is general X86 domain configuration. with 4 levels,
sibling/core/cpu/numa.
 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH v2 3/6] sched: pack small tasks

2012-12-13 Thread Vincent Guittot

On 13 December 2012 15:25, Alex Shi alex@intel.com wrote:
 On 12/13/2012 06:11 PM, Vincent Guittot wrote:
 On 13 December 2012 03:17, Alex Shi alex@intel.com wrote:
 On 12/12/2012 09:31 PM, Vincent Guittot wrote:
 During the creation of sched_domain, we define a pack buddy CPU for each 
 CPU
 when one is available. We want to pack at all levels where a group of CPU 
 can
 be power gated independently from others.
 On a system that can't power gate a group of CPUs independently, the flag 
 is
 set at all sched_domain level and the buddy is set to -1. This is the 
 default
 behavior.
 On a dual clusters / dual cores system which can power gate each core and
 cluster independently, the buddy configuration will be :

   | Cluster 0   | Cluster 1   |
   | CPU0 | CPU1 | CPU2 | CPU3 |
 ---
 buddy | CPU0 | CPU0 | CPU0 | CPU2 |

 Small tasks tend to slip out of the periodic load balance so the best place
 to choose to migrate them is during their wake up. The decision is in O(1) 
 as
 we only check again one buddy CPU

 Just have a little worry about the scalability on a big machine, like on
 a 4 sockets NUMA machine * 8 cores * HT machine, the buddy cpu in whole
 system need care 64 LCPUs. and in your case cpu0 just care 4 LCPU. That
 is different on task distribution decision.

 The buddy CPU should probably not be the same for all 64 LCPU it
 depends on where it's worth packing small tasks

 Do you have further ideas for buddy cpu on such example?

yes, I have several ideas which were not really relevant for small
system but could be interesting for larger system

We keep the same algorithm in a socket but we could either use another
LCPU in the targeted socket (conf0) or chain the socket (conf1)
instead of packing directly in one LCPU

The scheme below tries to summaries the idea:

Socket  | socket 0 | socket 1   | socket 2   | socket 3   |
LCPU| 0 | 1-15 | 16 | 17-31 | 32 | 33-47 | 48 | 49-63 |
buddy conf0 | 0 | 0| 1  | 16| 2  | 32| 3  | 48|
buddy conf1 | 0 | 0| 0  | 16| 16 | 32| 32 | 48|
buddy conf2 | 0 | 0| 16 | 16| 32 | 32| 48 | 48|

But, I don't know how this can interact with NUMA load balance and the
better might be to use conf3.


 Which kind of sched_domain configuration have you for such system ?
 and how many sched_domain level have you ?

 it is general X86 domain configuration. with 4 levels,
 sibling/core/cpu/numa.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH v2 3/6] sched: pack small tasks

2012-12-13 Thread Vincent Guittot

On 13 December 2012 15:53, Vincent Guittot vincent.guit...@linaro.org wrote:
 On 13 December 2012 15:25, Alex Shi alex@intel.com wrote:
 On 12/13/2012 06:11 PM, Vincent Guittot wrote:
 On 13 December 2012 03:17, Alex Shi alex@intel.com wrote:
 On 12/12/2012 09:31 PM, Vincent Guittot wrote:
 During the creation of sched_domain, we define a pack buddy CPU for each 
 CPU
 when one is available. We want to pack at all levels where a group of CPU 
 can
 be power gated independently from others.
 On a system that can't power gate a group of CPUs independently, the flag 
 is
 set at all sched_domain level and the buddy is set to -1. This is the 
 default
 behavior.
 On a dual clusters / dual cores system which can power gate each core and
 cluster independently, the buddy configuration will be :

   | Cluster 0   | Cluster 1   |
   | CPU0 | CPU1 | CPU2 | CPU3 |
 ---
 buddy | CPU0 | CPU0 | CPU0 | CPU2 |

 Small tasks tend to slip out of the periodic load balance so the best 
 place
 to choose to migrate them is during their wake up. The decision is in 
 O(1) as
 we only check again one buddy CPU

 Just have a little worry about the scalability on a big machine, like on
 a 4 sockets NUMA machine * 8 cores * HT machine, the buddy cpu in whole
 system need care 64 LCPUs. and in your case cpu0 just care 4 LCPU. That
 is different on task distribution decision.

 The buddy CPU should probably not be the same for all 64 LCPU it
 depends on where it's worth packing small tasks

 Do you have further ideas for buddy cpu on such example?

 yes, I have several ideas which were not really relevant for small
 system but could be interesting for larger system

 We keep the same algorithm in a socket but we could either use another
 LCPU in the targeted socket (conf0) or chain the socket (conf1)
 instead of packing directly in one LCPU

 The scheme below tries to summaries the idea:

 Socket  | socket 0 | socket 1   | socket 2   | socket 3   |
 LCPU| 0 | 1-15 | 16 | 17-31 | 32 | 33-47 | 48 | 49-63 |
 buddy conf0 | 0 | 0| 1  | 16| 2  | 32| 3  | 48|
 buddy conf1 | 0 | 0| 0  | 16| 16 | 32| 32 | 48|
 buddy conf2 | 0 | 0| 16 | 16| 32 | 32| 48 | 48|

 But, I don't know how this can interact with NUMA load balance and the
 better might be to use conf3.

I mean conf2 not conf3



 Which kind of sched_domain configuration have you for such system ?
 and how many sched_domain level have you ?

 it is general X86 domain configuration. with 4 levels,
 sibling/core/cpu/numa.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH v2 3/6] sched: pack small tasks

2012-12-13 Thread Alex Shi

On 12/13/2012 11:48 PM, Vincent Guittot wrote:
 On 13 December 2012 15:53, Vincent Guittot vincent.guit...@linaro.org wrote:
 On 13 December 2012 15:25, Alex Shi alex@intel.com wrote:
 On 12/13/2012 06:11 PM, Vincent Guittot wrote:
 On 13 December 2012 03:17, Alex Shi alex@intel.com wrote:
 On 12/12/2012 09:31 PM, Vincent Guittot wrote:
 During the creation of sched_domain, we define a pack buddy CPU for each 
 CPU
 when one is available. We want to pack at all levels where a group of 
 CPU can
 be power gated independently from others.
 On a system that can't power gate a group of CPUs independently, the 
 flag is
 set at all sched_domain level and the buddy is set to -1. This is the 
 default
 behavior.
 On a dual clusters / dual cores system which can power gate each core and
 cluster independently, the buddy configuration will be :

   | Cluster 0   | Cluster 1   |
   | CPU0 | CPU1 | CPU2 | CPU3 |
 ---
 buddy | CPU0 | CPU0 | CPU0 | CPU2 |

 Small tasks tend to slip out of the periodic load balance so the best 
 place
 to choose to migrate them is during their wake up. The decision is in 
 O(1) as
 we only check again one buddy CPU

 Just have a little worry about the scalability on a big machine, like on
 a 4 sockets NUMA machine * 8 cores * HT machine, the buddy cpu in whole
 system need care 64 LCPUs. and in your case cpu0 just care 4 LCPU. That
 is different on task distribution decision.

 The buddy CPU should probably not be the same for all 64 LCPU it
 depends on where it's worth packing small tasks

 Do you have further ideas for buddy cpu on such example?

 yes, I have several ideas which were not really relevant for small
 system but could be interesting for larger system

 We keep the same algorithm in a socket but we could either use another
 LCPU in the targeted socket (conf0) or chain the socket (conf1)
 instead of packing directly in one LCPU

 The scheme below tries to summaries the idea:

 Socket  | socket 0 | socket 1   | socket 2   | socket 3   |
 LCPU| 0 | 1-15 | 16 | 17-31 | 32 | 33-47 | 48 | 49-63 |
 buddy conf0 | 0 | 0| 1  | 16| 2  | 32| 3  | 48|
 buddy conf1 | 0 | 0| 0  | 16| 16 | 32| 32 | 48|
 buddy conf2 | 0 | 0| 16 | 16| 32 | 32| 48 | 48|

 But, I don't know how this can interact with NUMA load balance and the
 better might be to use conf3.
 
 I mean conf2 not conf3

So, it has 4 levels 0/16/32/ for socket 3 and 0 level for socket 0, it
is unbalanced for different socket.

And the ground level has just one buddy for 16 LCPUs - 8 cores, that's
not a good design, consider my previous examples: if there are 4 or 8
tasks in one socket, you just has 2 choices: spread them into all cores,
or pack them into one LCPU. Actually, moving them just into 2 or 4 cores
maybe a better solution. but the design missed this.

Obviously, more and more cores is the trend on any kinds of CPU, the
buddy system seems hard to catch up this.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH v2 3/6] sched: pack small tasks

2012-12-13 Thread Mike Galbraith

On Thu, 2012-12-13 at 22:25 +0800, Alex Shi wrote: 
 On 12/13/2012 06:11 PM, Vincent Guittot wrote:
  On 13 December 2012 03:17, Alex Shi alex@intel.com wrote:
  On 12/12/2012 09:31 PM, Vincent Guittot wrote:
  During the creation of sched_domain, we define a pack buddy CPU for each 
  CPU
  when one is available. We want to pack at all levels where a group of CPU 
  can
  be power gated independently from others.
  On a system that can't power gate a group of CPUs independently, the flag 
  is
  set at all sched_domain level and the buddy is set to -1. This is the 
  default
  behavior.
  On a dual clusters / dual cores system which can power gate each core and
  cluster independently, the buddy configuration will be :
 
| Cluster 0   | Cluster 1   |
| CPU0 | CPU1 | CPU2 | CPU3 |
  ---
  buddy | CPU0 | CPU0 | CPU0 | CPU2 |
 
  Small tasks tend to slip out of the periodic load balance so the best 
  place
  to choose to migrate them is during their wake up. The decision is in 
  O(1) as
  we only check again one buddy CPU
 
  Just have a little worry about the scalability on a big machine, like on
  a 4 sockets NUMA machine * 8 cores * HT machine, the buddy cpu in whole
  system need care 64 LCPUs. and in your case cpu0 just care 4 LCPU. That
  is different on task distribution decision.
  
  The buddy CPU should probably not be the same for all 64 LCPU it
  depends on where it's worth packing small tasks
 
 Do you have further ideas for buddy cpu on such example?
  
  Which kind of sched_domain configuration have you for such system ?
  and how many sched_domain level have you ?
 
 it is general X86 domain configuration. with 4 levels,
 sibling/core/cpu/numa.

CPU is a bug that slipped into domain degeneration.  You should have
SIBLING/MC/NUMA (chasing that down is on todo).

-Mike

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH v2 3/6] sched: pack small tasks

2012-12-13 Thread Alex Shi

On 12/14/2012 12:45 PM, Mike Galbraith wrote:
  Do you have further ideas for buddy cpu on such example?
   
   Which kind of sched_domain configuration have you for such system ?
   and how many sched_domain level have you ?
  
  it is general X86 domain configuration. with 4 levels,
  sibling/core/cpu/numa.
 CPU is a bug that slipped into domain degeneration.  You should have
 SIBLING/MC/NUMA (chasing that down is on todo).

Maybe.
the CPU/NUMA is different on domain flags, CPU has SD_PREFER_SIBLING.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH v2 3/6] sched: pack small tasks

2012-12-13 Thread Mike Galbraith

On Fri, 2012-12-14 at 14:36 +0800, Alex Shi wrote: 
 On 12/14/2012 12:45 PM, Mike Galbraith wrote:
   Do you have further ideas for buddy cpu on such example?

Which kind of sched_domain configuration have you for such system ?
and how many sched_domain level have you ?
   
   it is general X86 domain configuration. with 4 levels,
   sibling/core/cpu/numa.
  CPU is a bug that slipped into domain degeneration.  You should have
  SIBLING/MC/NUMA (chasing that down is on todo).
 
 Maybe.
 the CPU/NUMA is different on domain flags, CPU has SD_PREFER_SIBLING.

What I noticed during (an unrelated) bisection on a 40 core box was
domains going from so..

3.4.0-bisect (virgin)
[5.056214] CPU0 attaching sched-domain:
[5.065009]  domain 0: span 0,32 level SIBLING
[5.075011]   groups: 0 (cpu_power = 589) 32 (cpu_power = 589)
[5.088381]   domain 1: span 
0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76 level MC
[5.107669]groups: 0,32 (cpu_power = 1178)  4,36 (cpu_power = 1178)  
8,40 (cpu_power = 1178) 12,44 (cpu_power = 1178)
 16,48 (cpu_power = 1177) 20,52 (cpu_power = 1178) 
24,56 (cpu_power = 1177) 28,60 (cpu_power = 1177)
 64,72 (cpu_power = 1176) 68,76 (cpu_power = 1176)
[5.162115]domain 2: span 0-79 level NODE
[5.171927] groups: 
0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76 (cpu_power = 11773)
   
1,5,9,13,17,21,25,29,33,37,41,45,49,53,57,61,65,69,73,77 (cpu_power = 11772)
   
2,6,10,14,18,22,26,30,34,38,42,46,50,54,58,62,66,70,74,78 (cpu_power = 11773)
   
3,7,11,15,19,23,27,31,35,39,43,47,51,55,59,63,67,71,75,79 (cpu_power = 11770)

..to so, which looks a little bent.  CPU and MC have identical spans, so
CPU should have gone away, as it used to do.

3.6.0-bisect (virgin)
[3.978338] CPU0 attaching sched-domain:
[3.987125]  domain 0: span 0,32 level SIBLING
[3.997125]   groups: 0 (cpu_power = 588) 32 (cpu_power = 589)
[4.010477]   domain 1: span 
0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76 level MC
[4.029748]groups: 0,32 (cpu_power = 1177)  4,36 (cpu_power = 1177)  
8,40 (cpu_power = 1178) 12,44 (cpu_power = 1178)
 16,48 (cpu_power = 1178) 20,52 (cpu_power = 1178) 
24,56 (cpu_power = 1178) 28,60 (cpu_power = 1178)
 64,72 (cpu_power = 1178) 68,76 (cpu_power = 1177)
[4.084143]domain 2: span 
0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76 level CPU
[4.103796] groups: 
0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76 (cpu_power = 11777)
[4.124373] domain 3: span 0-79 level NUMA
[4.134369]  groups: 
0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76 (cpu_power = 11777)

1,5,9,13,17,21,25,29,33,37,41,45,49,53,57,61,65,69,73,77 (cpu_power = 11778)

2,6,10,14,18,22,26,30,34,38,42,46,50,54,58,62,66,70,74 ,78 (cpu_power = 11778)

3,7,11,15,19,23,27,31,35,39,43,47,51,55,59,63,67,71,75,79 (cpu_power = 11780)

-Mike

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH v2 3/6] sched: pack small tasks

2012-12-13 Thread Alex Shi

On 12/14/2012 03:45 PM, Mike Galbraith wrote:
 On Fri, 2012-12-14 at 14:36 +0800, Alex Shi wrote: 
 On 12/14/2012 12:45 PM, Mike Galbraith wrote:
 Do you have further ideas for buddy cpu on such example?

 Which kind of sched_domain configuration have you for such system ?
 and how many sched_domain level have you ?

 it is general X86 domain configuration. with 4 levels,
 sibling/core/cpu/numa.
 CPU is a bug that slipped into domain degeneration.  You should have
 SIBLING/MC/NUMA (chasing that down is on todo).

 Maybe.
 the CPU/NUMA is different on domain flags, CPU has SD_PREFER_SIBLING.
 
 What I noticed during (an unrelated) bisection on a 40 core box was
 domains going from so..
 
 3.4.0-bisect (virgin)
 [5.056214] CPU0 attaching sched-domain:
 [5.065009]  domain 0: span 0,32 level SIBLING
 [5.075011]   groups: 0 (cpu_power = 589) 32 (cpu_power = 589)
 [5.088381]   domain 1: span 
 0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76 level MC
 [5.107669]groups: 0,32 (cpu_power = 1178)  4,36 (cpu_power = 1178)  
 8,40 (cpu_power = 1178) 12,44 (cpu_power = 1178)
  16,48 (cpu_power = 1177) 20,52 (cpu_power = 1178) 
 24,56 (cpu_power = 1177) 28,60 (cpu_power = 1177)
  64,72 (cpu_power = 1176) 68,76 (cpu_power = 1176)
 [5.162115]domain 2: span 0-79 level NODE
 [5.171927] groups: 
 0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76 (cpu_power = 11773)

 1,5,9,13,17,21,25,29,33,37,41,45,49,53,57,61,65,69,73,77 (cpu_power = 11772)

 2,6,10,14,18,22,26,30,34,38,42,46,50,54,58,62,66,70,74,78 (cpu_power = 11773)

 3,7,11,15,19,23,27,31,35,39,43,47,51,55,59,63,67,71,75,79 (cpu_power = 11770)
 
 ..to so, which looks a little bent.  CPU and MC have identical spans, so
 CPU should have gone away, as it used to do.
 

better to remove one, and believe you can make it. :)

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH v2 3/6] sched: pack small tasks

2012-12-12 Thread Alex Shi

On 12/13/2012 10:17 AM, Alex Shi wrote:
> On 12/12/2012 09:31 PM, Vincent Guittot wrote:
>> During the creation of sched_domain, we define a pack buddy CPU for each CPU
>> when one is available. We want to pack at all levels where a group of CPU can
>> be power gated independently from others.
>> On a system that can't power gate a group of CPUs independently, the flag is
>> set at all sched_domain level and the buddy is set to -1. This is the default
>> behavior.
>> On a dual clusters / dual cores system which can power gate each core and
>> cluster independently, the buddy configuration will be :
>>
>>   | Cluster 0   | Cluster 1   |
>>   | CPU0 | CPU1 | CPU2 | CPU3 |
>> ---
>> buddy | CPU0 | CPU0 | CPU0 | CPU2 |
>>
>> Small tasks tend to slip out of the periodic load balance so the best place
>> to choose to migrate them is during their wake up. The decision is in O(1) as
>> we only check again one buddy CPU
> 
> Just have a little worry about the scalability on a big machine, like on
> a 4 sockets NUMA machine * 8 cores * HT machine, the buddy cpu in whole
> system need care 64 LCPUs. and in your case cpu0 just care 4 LCPU. That
> is different on task distribution decision.

In above big machine example, only one buddy cpu is not sufficient on
each of level, like for 4 sockets level, maybe tasks can just full fill
2 sockets, then we just use 2 sockets, that is more performance/power
efficient. But one buddy cpu here need to spread tasks to 4 sockets all.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH v2 3/6] sched: pack small tasks

2012-12-12 Thread Alex Shi

On 12/12/2012 09:31 PM, Vincent Guittot wrote:
> During the creation of sched_domain, we define a pack buddy CPU for each CPU
> when one is available. We want to pack at all levels where a group of CPU can
> be power gated independently from others.
> On a system that can't power gate a group of CPUs independently, the flag is
> set at all sched_domain level and the buddy is set to -1. This is the default
> behavior.
> On a dual clusters / dual cores system which can power gate each core and
> cluster independently, the buddy configuration will be :
> 
>   | Cluster 0   | Cluster 1   |
>   | CPU0 | CPU1 | CPU2 | CPU3 |
> ---
> buddy | CPU0 | CPU0 | CPU0 | CPU2 |
> 
> Small tasks tend to slip out of the periodic load balance so the best place
> to choose to migrate them is during their wake up. The decision is in O(1) as
> we only check again one buddy CPU

Just have a little worry about the scalability on a big machine, like on
a 4 sockets NUMA machine * 8 cores * HT machine, the buddy cpu in whole
system need care 64 LCPUs. and in your case cpu0 just care 4 LCPU. That
is different on task distribution decision.

> 
> Signed-off-by: Vincent Guittot 
> ---
>  kernel/sched/core.c  |1 +
>  kernel/sched/fair.c  |  110 
> ++
>  kernel/sched/sched.h |5 +++
>  3 files changed, 116 insertions(+)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 4f36e9d..3436aad 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5693,6 +5693,7 @@ cpu_attach_domain(struct sched_domain *sd, struct 
> root_domain *rd, int cpu)
>   rcu_assign_pointer(rq->sd, sd);
>   destroy_sched_domains(tmp, cpu);
>  
> + update_packing_domain(cpu);
>   update_domain_cache(cpu);
>  }
>  
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 9916d41..fc93d96 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -163,6 +163,73 @@ void sched_init_granularity(void)
>   update_sysctl();
>  }
>  
> +
> +#ifdef CONFIG_SMP
> +/*
> + * Save the id of the optimal CPU that should be used to pack small tasks
> + * The value -1 is used when no buddy has been found
> + */
> +DEFINE_PER_CPU(int, sd_pack_buddy);
> +
> +/* Look for the best buddy CPU that can be used to pack small tasks
> + * We make the assumption that it doesn't wort to pack on CPU that share the
> + * same powerline. We looks for the 1st sched_domain without the
> + * SD_SHARE_POWERDOMAIN flag. Then We look for the sched_group witht the 
> lowest
> + * power per core based on the assumption that their power efficiency is
> + * better */
> +void update_packing_domain(int cpu)
> +{
> + struct sched_domain *sd;
> + int id = -1;
> +
> + sd = highest_flag_domain(cpu, SD_SHARE_POWERDOMAIN & SD_LOAD_BALANCE);
> + if (!sd)
> + sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd);
> + else
> + sd = sd->parent;
> +
> + while (sd && (sd->flags && SD_LOAD_BALANCE)) {
> + struct sched_group *sg = sd->groups;
> + struct sched_group *pack = sg;
> + struct sched_group *tmp;
> +
> + /*
> +  * The sched_domain of a CPU points on the local sched_group
> +  * and the 1st CPU of this local group is a good candidate
> +  */
> + id = cpumask_first(sched_group_cpus(pack));
> +
> + /* loop the sched groups to find the best one */
> + for (tmp = sg->next; tmp != sg; tmp = tmp->next) {
> + if (tmp->sgp->power * pack->group_weight >
> + pack->sgp->power * tmp->group_weight)
> + continue;
> +
> + if ((tmp->sgp->power * pack->group_weight ==
> + pack->sgp->power * tmp->group_weight)
> +  && (cpumask_first(sched_group_cpus(tmp)) >= id))
> + continue;
> +
> + /* we have found a better group */
> + pack = tmp;
> +
> + /* Take the 1st CPU of the new group */
> + id = cpumask_first(sched_group_cpus(pack));
> + }
> +
> + /* Look for another CPU than itself */
> + if (id != cpu)
> + break;
> +
> + sd = sd->parent;
> + }
> +
> + pr_debug("CPU%d packing on CPU%d\n", cpu, id);
> + per_cpu(sd_pack_buddy, cpu) = id;
> +}
> +
> +#endif /* CONFIG_SMP */
> +
>  #if BITS_PER_LONG == 32
>  # define WMULT_CONST (~0UL)
>  #else
> @@ -5083,6 +5150,46 @@ static bool numa_allow_migration(struct task_struct 
> *p, int prev_cpu, int new_cp
>   return true;
>  }
>  
> +static bool is_buddy_busy(int cpu)
> +{
> + struct rq *rq = cpu_rq(cpu);
> +
> + /*
> +  * A busy buddy is a CPU with a high load or a small load with a lot of
> +  * running tasks.

Re: [RFC PATCH v2 3/6] sched: pack small tasks

2012-12-12 Thread Alex Shi

On 12/12/2012 09:31 PM, Vincent Guittot wrote:
 During the creation of sched_domain, we define a pack buddy CPU for each CPU
 when one is available. We want to pack at all levels where a group of CPU can
 be power gated independently from others.
 On a system that can't power gate a group of CPUs independently, the flag is
 set at all sched_domain level and the buddy is set to -1. This is the default
 behavior.
 On a dual clusters / dual cores system which can power gate each core and
 cluster independently, the buddy configuration will be :
 
   | Cluster 0   | Cluster 1   |
   | CPU0 | CPU1 | CPU2 | CPU3 |
 ---
 buddy | CPU0 | CPU0 | CPU0 | CPU2 |
 
 Small tasks tend to slip out of the periodic load balance so the best place
 to choose to migrate them is during their wake up. The decision is in O(1) as
 we only check again one buddy CPU

Just have a little worry about the scalability on a big machine, like on
a 4 sockets NUMA machine * 8 cores * HT machine, the buddy cpu in whole
system need care 64 LCPUs. and in your case cpu0 just care 4 LCPU. That
is different on task distribution decision.

 
 Signed-off-by: Vincent Guittot vincent.guit...@linaro.org
 ---
  kernel/sched/core.c  |1 +
  kernel/sched/fair.c  |  110 
 ++
  kernel/sched/sched.h |5 +++
  3 files changed, 116 insertions(+)
 
 diff --git a/kernel/sched/core.c b/kernel/sched/core.c
 index 4f36e9d..3436aad 100644
 --- a/kernel/sched/core.c
 +++ b/kernel/sched/core.c
 @@ -5693,6 +5693,7 @@ cpu_attach_domain(struct sched_domain *sd, struct 
 root_domain *rd, int cpu)
   rcu_assign_pointer(rq-sd, sd);
   destroy_sched_domains(tmp, cpu);
  
 + update_packing_domain(cpu);
   update_domain_cache(cpu);
  }
  
 diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
 index 9916d41..fc93d96 100644
 --- a/kernel/sched/fair.c
 +++ b/kernel/sched/fair.c
 @@ -163,6 +163,73 @@ void sched_init_granularity(void)
   update_sysctl();
  }
  
 +
 +#ifdef CONFIG_SMP
 +/*
 + * Save the id of the optimal CPU that should be used to pack small tasks
 + * The value -1 is used when no buddy has been found
 + */
 +DEFINE_PER_CPU(int, sd_pack_buddy);
 +
 +/* Look for the best buddy CPU that can be used to pack small tasks
 + * We make the assumption that it doesn't wort to pack on CPU that share the
 + * same powerline. We looks for the 1st sched_domain without the
 + * SD_SHARE_POWERDOMAIN flag. Then We look for the sched_group witht the 
 lowest
 + * power per core based on the assumption that their power efficiency is
 + * better */
 +void update_packing_domain(int cpu)
 +{
 + struct sched_domain *sd;
 + int id = -1;
 +
 + sd = highest_flag_domain(cpu, SD_SHARE_POWERDOMAIN  SD_LOAD_BALANCE);
 + if (!sd)
 + sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)-sd);
 + else
 + sd = sd-parent;
 +
 + while (sd  (sd-flags  SD_LOAD_BALANCE)) {
 + struct sched_group *sg = sd-groups;
 + struct sched_group *pack = sg;
 + struct sched_group *tmp;
 +
 + /*
 +  * The sched_domain of a CPU points on the local sched_group
 +  * and the 1st CPU of this local group is a good candidate
 +  */
 + id = cpumask_first(sched_group_cpus(pack));
 +
 + /* loop the sched groups to find the best one */
 + for (tmp = sg-next; tmp != sg; tmp = tmp-next) {
 + if (tmp-sgp-power * pack-group_weight 
 + pack-sgp-power * tmp-group_weight)
 + continue;
 +
 + if ((tmp-sgp-power * pack-group_weight ==
 + pack-sgp-power * tmp-group_weight)
 +   (cpumask_first(sched_group_cpus(tmp)) = id))
 + continue;
 +
 + /* we have found a better group */
 + pack = tmp;
 +
 + /* Take the 1st CPU of the new group */
 + id = cpumask_first(sched_group_cpus(pack));
 + }
 +
 + /* Look for another CPU than itself */
 + if (id != cpu)
 + break;
 +
 + sd = sd-parent;
 + }
 +
 + pr_debug(CPU%d packing on CPU%d\n, cpu, id);
 + per_cpu(sd_pack_buddy, cpu) = id;
 +}
 +
 +#endif /* CONFIG_SMP */
 +
  #if BITS_PER_LONG == 32
  # define WMULT_CONST (~0UL)
  #else
 @@ -5083,6 +5150,46 @@ static bool numa_allow_migration(struct task_struct 
 *p, int prev_cpu, int new_cp
   return true;
  }
  
 +static bool is_buddy_busy(int cpu)
 +{
 + struct rq *rq = cpu_rq(cpu);
 +
 + /*
 +  * A busy buddy is a CPU with a high load or a small load with a lot of
 +  * running tasks.
 +  */
 + return ((rq-avg.runnable_avg_sum  rq-nr_running) 

If nr_running a bit big, rq-avg.runnable_avg_sum  rq-nr_running

Re: [RFC PATCH v2 3/6] sched: pack small tasks

2012-12-12 Thread Alex Shi

On 12/13/2012 10:17 AM, Alex Shi wrote:
 On 12/12/2012 09:31 PM, Vincent Guittot wrote:
 During the creation of sched_domain, we define a pack buddy CPU for each CPU
 when one is available. We want to pack at all levels where a group of CPU can
 be power gated independently from others.
 On a system that can't power gate a group of CPUs independently, the flag is
 set at all sched_domain level and the buddy is set to -1. This is the default
 behavior.
 On a dual clusters / dual cores system which can power gate each core and
 cluster independently, the buddy configuration will be :

   | Cluster 0   | Cluster 1   |
   | CPU0 | CPU1 | CPU2 | CPU3 |
 ---
 buddy | CPU0 | CPU0 | CPU0 | CPU2 |

 Small tasks tend to slip out of the periodic load balance so the best place
 to choose to migrate them is during their wake up. The decision is in O(1) as
 we only check again one buddy CPU
 
 Just have a little worry about the scalability on a big machine, like on
 a 4 sockets NUMA machine * 8 cores * HT machine, the buddy cpu in whole
 system need care 64 LCPUs. and in your case cpu0 just care 4 LCPU. That
 is different on task distribution decision.

In above big machine example, only one buddy cpu is not sufficient on
each of level, like for 4 sockets level, maybe tasks can just full fill
2 sockets, then we just use 2 sockets, that is more performance/power
efficient. But one buddy cpu here need to spread tasks to 4 sockets all.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

46 matches

Mail list logo