Re: [RFC PATCH v3 3/6] sched: pack small tasks

2013-04-26 Thread Vincent Guittot
On 26 April 2013 12:30, Peter Zijlstra  wrote:
> On Wed, Mar 27, 2013 at 12:00:40PM +0100, Vincent Guittot wrote:
>> On 27 March 2013 11:21, Preeti U Murthy  wrote:
>> > Hi,
>> >
>> > On 03/26/2013 05:56 PM, Peter Zijlstra wrote:
>> >> On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote:
>> >>> +static bool is_buddy_busy(int cpu)
>> >>> +{
>> >>> +   struct rq *rq = cpu_rq(cpu);
>> >>> +
>> >>> +   /*
>> >>> +* A busy buddy is a CPU with a high load or a small load with
>> >>> a lot of
>> >>> +* running tasks.
>> >>> +*/
>> >>> +   return (rq->avg.runnable_avg_sum >
>> >>> +   (rq->avg.runnable_avg_period / (rq->nr_running
>> >>> + 2)));
>> >>> +}
>> >>
>> >> Why does the comment talk about load but we don't see it in the
>> >> equation. Also, why does nr_running matter at all? I thought we'd
>> >> simply bother with utilization, if fully utilized we're done etc..
>>
>> By load, I mean : 100 * avg.runnable_avg_sum / avg.runnable_avg_period
>> In addition, i take into account the number of tasks already in the
>> runqueue in order to define the business of a CPU. A CPU with a load
>> of 50% without any tasks in the runqeue in not busy at this time and
>> we can migrate tasks on it but if the CPU already has 2 tasks in its
>> runqueue, it means that newly wake up task will have to share the CPU
>> with other tasks so we consider that the CPU is already busy and we
>> will fall back to default behavior. The equation considers that a CPU
>> is not busy if
>> 100 * avg.runnable_avg_sum / avg.runnable_avg_period < 100 / (nr_running + 2)
>
> I'm still somewhat confused by all this. So raising nr_running will lower the
> required utilization to be considered busy. Suppose we have 50 tasks running,
> all at 1% utilization (bulk wakeup) we'd consider the cpu busy, even though 
> its
> picking its nose for half the time.
>
>
> I'm assuming it's mean to limit process latency or so? Why are you concerned
> with that? This seems like an arbitrary power vs performance tweak without
> solid effidence its needed or even wanted.

Yes the goal was to limit the wake up latency because this version was
only trying to modify the scheduler behavior when the system was not
busy in order to pack the small tasks like background activities but
without decreasing the performance so we were concerned by wakeup
latency.

The new version proposes a more aggressive mode that packs all tasks
until CPUs becomes full.

Vincent
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 3/6] sched: pack small tasks

2013-04-26 Thread Preeti U Murthy
Hi Peter,

On 04/26/2013 03:48 PM, Peter Zijlstra wrote:
> On Wed, Mar 27, 2013 at 03:51:51PM +0530, Preeti U Murthy wrote:
>> Hi,
>>
>> On 03/26/2013 05:56 PM, Peter Zijlstra wrote:
>>> On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote:
 +static bool is_buddy_busy(int cpu)
 +{
 +   struct rq *rq = cpu_rq(cpu);
 +
 +   /*
 +* A busy buddy is a CPU with a high load or a small load with
 a lot of
 +* running tasks.
 +*/
 +   return (rq->avg.runnable_avg_sum >
 +   (rq->avg.runnable_avg_period / (rq->nr_running
 + 2)));
 +}
>>>
>>> Why does the comment talk about load but we don't see it in the
>>> equation. Also, why does nr_running matter at all? I thought we'd
>>> simply bother with utilization, if fully utilized we're done etc..
>>>
>>
>> Peter, lets say the run-queue has 50% utilization and is running 2
>> tasks. And we wish to find out if it is busy. We would compare this
>> metric with the cpu power, which lets say is 100.
>>
>> rq->util * 100 < cpu_of(rq)->power.
>>
>> In the above scenario would we declare the cpu _not_busy? Or would we do
>> the following:
>>
>> (rq->util * 100) * #nr_running <  cpu_of(rq)->power and conclude that it
>> is just enough _busy_ to not take on more processes?
> 
> That is just confused... ->power doesn't have anything to do with a per-cpu
> measure. ->power is a inter-cpu measure of relative compute capacity.

Ok.

> 
> Mixing in nr_running confuses things even more; it doesn't matter how many
> tasks it takes to push utilization up to 100%; once its there the cpu simply
> cannot run more.

True, this is from the perspective of the CPU. But will not the tasks on
this CPU get throttled if, you find the utilization of this CPU < 100%
and decide to put more tasks on it?

Regards
Preeti U Murthy


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 3/6] sched: pack small tasks

2013-04-26 Thread Peter Zijlstra
On Wed, Mar 27, 2013 at 12:00:40PM +0100, Vincent Guittot wrote:
> On 27 March 2013 11:21, Preeti U Murthy  wrote:
> > Hi,
> >
> > On 03/26/2013 05:56 PM, Peter Zijlstra wrote:
> >> On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote:
> >>> +static bool is_buddy_busy(int cpu)
> >>> +{
> >>> +   struct rq *rq = cpu_rq(cpu);
> >>> +
> >>> +   /*
> >>> +* A busy buddy is a CPU with a high load or a small load with
> >>> a lot of
> >>> +* running tasks.
> >>> +*/
> >>> +   return (rq->avg.runnable_avg_sum >
> >>> +   (rq->avg.runnable_avg_period / (rq->nr_running
> >>> + 2)));
> >>> +}
> >>
> >> Why does the comment talk about load but we don't see it in the
> >> equation. Also, why does nr_running matter at all? I thought we'd
> >> simply bother with utilization, if fully utilized we're done etc..
> 
> By load, I mean : 100 * avg.runnable_avg_sum / avg.runnable_avg_period
> In addition, i take into account the number of tasks already in the
> runqueue in order to define the business of a CPU. A CPU with a load
> of 50% without any tasks in the runqeue in not busy at this time and
> we can migrate tasks on it but if the CPU already has 2 tasks in its
> runqueue, it means that newly wake up task will have to share the CPU
> with other tasks so we consider that the CPU is already busy and we
> will fall back to default behavior. The equation considers that a CPU
> is not busy if
> 100 * avg.runnable_avg_sum / avg.runnable_avg_period < 100 / (nr_running + 2)

I'm still somewhat confused by all this. So raising nr_running will lower the
required utilization to be considered busy. Suppose we have 50 tasks running,
all at 1% utilization (bulk wakeup) we'd consider the cpu busy, even though its
picking its nose for half the time.


I'm assuming it's mean to limit process latency or so? Why are you concerned
with that? This seems like an arbitrary power vs performance tweak without
solid effidence its needed or even wanted.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 3/6] sched: pack small tasks

2013-04-26 Thread Peter Zijlstra
On Wed, Mar 27, 2013 at 03:51:51PM +0530, Preeti U Murthy wrote:
> Hi,
> 
> On 03/26/2013 05:56 PM, Peter Zijlstra wrote:
> > On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote:
> >> +static bool is_buddy_busy(int cpu)
> >> +{
> >> +   struct rq *rq = cpu_rq(cpu);
> >> +
> >> +   /*
> >> +* A busy buddy is a CPU with a high load or a small load with
> >> a lot of
> >> +* running tasks.
> >> +*/
> >> +   return (rq->avg.runnable_avg_sum >
> >> +   (rq->avg.runnable_avg_period / (rq->nr_running
> >> + 2)));
> >> +}
> > 
> > Why does the comment talk about load but we don't see it in the
> > equation. Also, why does nr_running matter at all? I thought we'd
> > simply bother with utilization, if fully utilized we're done etc..
> > 
> 
> Peter, lets say the run-queue has 50% utilization and is running 2
> tasks. And we wish to find out if it is busy. We would compare this
> metric with the cpu power, which lets say is 100.
> 
> rq->util * 100 < cpu_of(rq)->power.
> 
> In the above scenario would we declare the cpu _not_busy? Or would we do
> the following:
> 
> (rq->util * 100) * #nr_running <  cpu_of(rq)->power and conclude that it
> is just enough _busy_ to not take on more processes?

That is just confused... ->power doesn't have anything to do with a per-cpu
measure. ->power is a inter-cpu measure of relative compute capacity.

Mixing in nr_running confuses things even more; it doesn't matter how many
tasks it takes to push utilization up to 100%; once its there the cpu simply
cannot run more.

So colour me properly confused..
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 3/6] sched: pack small tasks

2013-04-26 Thread Peter Zijlstra
On Wed, Mar 27, 2013 at 03:51:51PM +0530, Preeti U Murthy wrote:
 Hi,
 
 On 03/26/2013 05:56 PM, Peter Zijlstra wrote:
  On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote:
  +static bool is_buddy_busy(int cpu)
  +{
  +   struct rq *rq = cpu_rq(cpu);
  +
  +   /*
  +* A busy buddy is a CPU with a high load or a small load with
  a lot of
  +* running tasks.
  +*/
  +   return (rq-avg.runnable_avg_sum 
  +   (rq-avg.runnable_avg_period / (rq-nr_running
  + 2)));
  +}
  
  Why does the comment talk about load but we don't see it in the
  equation. Also, why does nr_running matter at all? I thought we'd
  simply bother with utilization, if fully utilized we're done etc..
  
 
 Peter, lets say the run-queue has 50% utilization and is running 2
 tasks. And we wish to find out if it is busy. We would compare this
 metric with the cpu power, which lets say is 100.
 
 rq-util * 100  cpu_of(rq)-power.
 
 In the above scenario would we declare the cpu _not_busy? Or would we do
 the following:
 
 (rq-util * 100) * #nr_running   cpu_of(rq)-power and conclude that it
 is just enough _busy_ to not take on more processes?

That is just confused... -power doesn't have anything to do with a per-cpu
measure. -power is a inter-cpu measure of relative compute capacity.

Mixing in nr_running confuses things even more; it doesn't matter how many
tasks it takes to push utilization up to 100%; once its there the cpu simply
cannot run more.

So colour me properly confused..
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 3/6] sched: pack small tasks

2013-04-26 Thread Peter Zijlstra
On Wed, Mar 27, 2013 at 12:00:40PM +0100, Vincent Guittot wrote:
 On 27 March 2013 11:21, Preeti U Murthy pre...@linux.vnet.ibm.com wrote:
  Hi,
 
  On 03/26/2013 05:56 PM, Peter Zijlstra wrote:
  On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote:
  +static bool is_buddy_busy(int cpu)
  +{
  +   struct rq *rq = cpu_rq(cpu);
  +
  +   /*
  +* A busy buddy is a CPU with a high load or a small load with
  a lot of
  +* running tasks.
  +*/
  +   return (rq-avg.runnable_avg_sum 
  +   (rq-avg.runnable_avg_period / (rq-nr_running
  + 2)));
  +}
 
  Why does the comment talk about load but we don't see it in the
  equation. Also, why does nr_running matter at all? I thought we'd
  simply bother with utilization, if fully utilized we're done etc..
 
 By load, I mean : 100 * avg.runnable_avg_sum / avg.runnable_avg_period
 In addition, i take into account the number of tasks already in the
 runqueue in order to define the business of a CPU. A CPU with a load
 of 50% without any tasks in the runqeue in not busy at this time and
 we can migrate tasks on it but if the CPU already has 2 tasks in its
 runqueue, it means that newly wake up task will have to share the CPU
 with other tasks so we consider that the CPU is already busy and we
 will fall back to default behavior. The equation considers that a CPU
 is not busy if
 100 * avg.runnable_avg_sum / avg.runnable_avg_period  100 / (nr_running + 2)

I'm still somewhat confused by all this. So raising nr_running will lower the
required utilization to be considered busy. Suppose we have 50 tasks running,
all at 1% utilization (bulk wakeup) we'd consider the cpu busy, even though its
picking its nose for half the time.


I'm assuming it's mean to limit process latency or so? Why are you concerned
with that? This seems like an arbitrary power vs performance tweak without
solid effidence its needed or even wanted.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 3/6] sched: pack small tasks

2013-04-26 Thread Preeti U Murthy
Hi Peter,

On 04/26/2013 03:48 PM, Peter Zijlstra wrote:
 On Wed, Mar 27, 2013 at 03:51:51PM +0530, Preeti U Murthy wrote:
 Hi,

 On 03/26/2013 05:56 PM, Peter Zijlstra wrote:
 On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote:
 +static bool is_buddy_busy(int cpu)
 +{
 +   struct rq *rq = cpu_rq(cpu);
 +
 +   /*
 +* A busy buddy is a CPU with a high load or a small load with
 a lot of
 +* running tasks.
 +*/
 +   return (rq-avg.runnable_avg_sum 
 +   (rq-avg.runnable_avg_period / (rq-nr_running
 + 2)));
 +}

 Why does the comment talk about load but we don't see it in the
 equation. Also, why does nr_running matter at all? I thought we'd
 simply bother with utilization, if fully utilized we're done etc..


 Peter, lets say the run-queue has 50% utilization and is running 2
 tasks. And we wish to find out if it is busy. We would compare this
 metric with the cpu power, which lets say is 100.

 rq-util * 100  cpu_of(rq)-power.

 In the above scenario would we declare the cpu _not_busy? Or would we do
 the following:

 (rq-util * 100) * #nr_running   cpu_of(rq)-power and conclude that it
 is just enough _busy_ to not take on more processes?
 
 That is just confused... -power doesn't have anything to do with a per-cpu
 measure. -power is a inter-cpu measure of relative compute capacity.

Ok.

 
 Mixing in nr_running confuses things even more; it doesn't matter how many
 tasks it takes to push utilization up to 100%; once its there the cpu simply
 cannot run more.

True, this is from the perspective of the CPU. But will not the tasks on
this CPU get throttled if, you find the utilization of this CPU  100%
and decide to put more tasks on it?

Regards
Preeti U Murthy


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 3/6] sched: pack small tasks

2013-04-26 Thread Vincent Guittot
On 26 April 2013 12:30, Peter Zijlstra pet...@infradead.org wrote:
 On Wed, Mar 27, 2013 at 12:00:40PM +0100, Vincent Guittot wrote:
 On 27 March 2013 11:21, Preeti U Murthy pre...@linux.vnet.ibm.com wrote:
  Hi,
 
  On 03/26/2013 05:56 PM, Peter Zijlstra wrote:
  On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote:
  +static bool is_buddy_busy(int cpu)
  +{
  +   struct rq *rq = cpu_rq(cpu);
  +
  +   /*
  +* A busy buddy is a CPU with a high load or a small load with
  a lot of
  +* running tasks.
  +*/
  +   return (rq-avg.runnable_avg_sum 
  +   (rq-avg.runnable_avg_period / (rq-nr_running
  + 2)));
  +}
 
  Why does the comment talk about load but we don't see it in the
  equation. Also, why does nr_running matter at all? I thought we'd
  simply bother with utilization, if fully utilized we're done etc..

 By load, I mean : 100 * avg.runnable_avg_sum / avg.runnable_avg_period
 In addition, i take into account the number of tasks already in the
 runqueue in order to define the business of a CPU. A CPU with a load
 of 50% without any tasks in the runqeue in not busy at this time and
 we can migrate tasks on it but if the CPU already has 2 tasks in its
 runqueue, it means that newly wake up task will have to share the CPU
 with other tasks so we consider that the CPU is already busy and we
 will fall back to default behavior. The equation considers that a CPU
 is not busy if
 100 * avg.runnable_avg_sum / avg.runnable_avg_period  100 / (nr_running + 2)

 I'm still somewhat confused by all this. So raising nr_running will lower the
 required utilization to be considered busy. Suppose we have 50 tasks running,
 all at 1% utilization (bulk wakeup) we'd consider the cpu busy, even though 
 its
 picking its nose for half the time.


 I'm assuming it's mean to limit process latency or so? Why are you concerned
 with that? This seems like an arbitrary power vs performance tweak without
 solid effidence its needed or even wanted.

Yes the goal was to limit the wake up latency because this version was
only trying to modify the scheduler behavior when the system was not
busy in order to pack the small tasks like background activities but
without decreasing the performance so we were concerned by wakeup
latency.

The new version proposes a more aggressive mode that packs all tasks
until CPUs becomes full.

Vincent
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 3/6] sched: pack small tasks

2013-03-27 Thread Catalin Marinas
On Wed, Mar 27, 2013 at 05:18:53PM +, Nicolas Pitre wrote:
> On Wed, 27 Mar 2013, Catalin Marinas wrote:
> 
> > So if the above works, the scheduler guys can mandate that little CPUs
> > are always first and for ARM it would be a matter of getting the right
> > CPU topology in the DT (independent of what hw vendors think of CPU
> > topology) and booting Linux on CPU 4 etc.
> 
> Just a note about that: if the scheduler mandates little CPUs first, 
> that should _not_ have any implications on the DT content.  DT is not 
> about encoding Linux specific implementation details.  It is simple 
> enough to tweak the CPU logical map at run time when enumeratiing CPUs.

You are right, though a simpler way (hack) to tweak the cpu_logical_map
is to change the DT ;).

But the problem is that the kernel doesn't know which CPU is big and
which is little, unless you specify this in some way via the DT. It can
be the cpu nodes order or some other means.

-- 
Catalin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 3/6] sched: pack small tasks

2013-03-27 Thread Nicolas Pitre
On Wed, 27 Mar 2013, Catalin Marinas wrote:

> So if the above works, the scheduler guys can mandate that little CPUs
> are always first and for ARM it would be a matter of getting the right
> CPU topology in the DT (independent of what hw vendors think of CPU
> topology) and booting Linux on CPU 4 etc.

Just a note about that: if the scheduler mandates little CPUs first, 
that should _not_ have any implications on the DT content.  DT is not 
about encoding Linux specific implementation details.  It is simple 
enough to tweak the CPU logical map at run time when enumeratiing CPUs.


Nicolas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 3/6] sched: pack small tasks

2013-03-27 Thread Nicolas Pitre
On Wed, 27 Mar 2013, Vincent Guittot wrote:

> On 27 March 2013 09:46, Peter Zijlstra  wrote:
> > On Tue, 2013-03-26 at 08:29 -0700, Arjan van de Ven wrote:
> >> > Isn't this basically related to picking the NO_HZ cpu; if the system
> >> > isn't fully symmetric with its power gates you want the NO_HZ cpu to be
> >> > the 'special' cpu. If it is symmetric we really don't care which core
> >> > is left 'running' and we can even select a new pack cpu from the idle
> >> > cores once the old one is fully utilized.
> >>
> >> you don't really care much sure, but there's some advantages for sorting 
> >> "all the way left",
> >> e.g. to linux cpu 0.
> >> Some tasks only run there, and interrupts tend to be favored to that cpu 
> >> as well on x86.
> >
> > Right, and I suspect all the big-little nonsense will have the little
> > cores on low numbers as well (is this architected or can a creative
> > licensee screw us over?)
> 
> It's not mandatory to have little cores on low numbers even if it's advised

We can trivially move things around in the logical CPU mapping if that 
simplifies things.  However  the boot CPU might not be CPU0 in that 
case which might be less trivial.


Nicolas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 3/6] sched: pack small tasks

2013-03-27 Thread Vincent Guittot
On 27 March 2013 11:21, Preeti U Murthy  wrote:
> Hi,
>
> On 03/26/2013 05:56 PM, Peter Zijlstra wrote:
>> On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote:
>>> +static bool is_buddy_busy(int cpu)
>>> +{
>>> +   struct rq *rq = cpu_rq(cpu);
>>> +
>>> +   /*
>>> +* A busy buddy is a CPU with a high load or a small load with
>>> a lot of
>>> +* running tasks.
>>> +*/
>>> +   return (rq->avg.runnable_avg_sum >
>>> +   (rq->avg.runnable_avg_period / (rq->nr_running
>>> + 2)));
>>> +}
>>
>> Why does the comment talk about load but we don't see it in the
>> equation. Also, why does nr_running matter at all? I thought we'd
>> simply bother with utilization, if fully utilized we're done etc..
>>
>
> Peter, lets say the run-queue has 50% utilization and is running 2
> tasks. And we wish to find out if it is busy. We would compare this
> metric with the cpu power, which lets say is 100.
>
> rq->util * 100 < cpu_of(rq)->power.

I don't use cpu_of(rq)->power in the definition of the business

>
> In the above scenario would we declare the cpu _not_busy? Or would we do
> the following:

In the above scenario, the CPU is busy

By load, I mean : 100 * avg.runnable_avg_sum / avg.runnable_avg_period
In addition, i take into account the number of tasks already in the
runqueue in order to define the business of a CPU. A CPU with a load
of 50% without any tasks in the runqeue in not busy at this time and
we can migrate tasks on it but if the CPU already has 2 tasks in its
runqueue, it means that newly wake up task will have to share the CPU
with other tasks so we consider that the CPU is already busy and we
will fall back to default behavior. The equation considers that a CPU
is not busy if
100 * avg.runnable_avg_sum / avg.runnable_avg_period < 100 / (nr_running + 2)

>
> (rq->util * 100) * #nr_running <  cpu_of(rq)->power and conclude that it
> is just enough _busy_ to not take on more processes?
>
>
> @Vincent: Yes the comment above needs to be fixed. A busy buddy is a CPU
> with *high rq utilization*, as far as the equation goes.

I can update the comment. Is the comment below more clear ?

/*
 * A busy buddy is a CPU with a high average running time or a small
average running time but a lot of
 * running tasks in its runqueue which are already sharing this CPU time.
 */

Vincent

>
> Regards
> Preeti U Murthy
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 3/6] sched: pack small tasks

2013-03-27 Thread Preeti U Murthy
Hi,

On 03/26/2013 05:56 PM, Peter Zijlstra wrote:
> On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote:
>> +static bool is_buddy_busy(int cpu)
>> +{
>> +   struct rq *rq = cpu_rq(cpu);
>> +
>> +   /*
>> +* A busy buddy is a CPU with a high load or a small load with
>> a lot of
>> +* running tasks.
>> +*/
>> +   return (rq->avg.runnable_avg_sum >
>> +   (rq->avg.runnable_avg_period / (rq->nr_running
>> + 2)));
>> +}
> 
> Why does the comment talk about load but we don't see it in the
> equation. Also, why does nr_running matter at all? I thought we'd
> simply bother with utilization, if fully utilized we're done etc..
> 

Peter, lets say the run-queue has 50% utilization and is running 2
tasks. And we wish to find out if it is busy. We would compare this
metric with the cpu power, which lets say is 100.

rq->util * 100 < cpu_of(rq)->power.

In the above scenario would we declare the cpu _not_busy? Or would we do
the following:

(rq->util * 100) * #nr_running <  cpu_of(rq)->power and conclude that it
is just enough _busy_ to not take on more processes?


@Vincent: Yes the comment above needs to be fixed. A busy buddy is a CPU
with *high rq utilization*, as far as the equation goes.

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 3/6] sched: pack small tasks

2013-03-27 Thread Peter Zijlstra
On Wed, 2013-03-27 at 09:54 +0100, Vincent Guittot wrote:
> It's not mandatory to have little cores on low numbers even if it's
> advised

ARGH!

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 3/6] sched: pack small tasks

2013-03-27 Thread Peter Zijlstra
On Tue, 2013-03-26 at 08:29 -0700, Arjan van de Ven wrote:
> > Isn't this basically related to picking the NO_HZ cpu; if the system
> > isn't fully symmetric with its power gates you want the NO_HZ cpu to be
> > the 'special' cpu. If it is symmetric we really don't care which core
> > is left 'running' and we can even select a new pack cpu from the idle
> > cores once the old one is fully utilized.
> 
> you don't really care much sure, but there's some advantages for sorting "all 
> the way left",
> e.g. to linux cpu 0.
> Some tasks only run there, and interrupts tend to be favored to that cpu as 
> well on x86.

Right, and I suspect all the big-little nonsense will have the little
cores on low numbers as well (is this architected or can a creative
licensee screw us over?)

So find_new_ilb() already does cpumask_first(), so it has a strong
leftmost preference. We just need to make sure it indeed does the right
thing and doesn't have some unintended side effect.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 3/6] sched: pack small tasks

2013-03-27 Thread Vincent Guittot
On 27 March 2013 09:46, Peter Zijlstra  wrote:
> On Tue, 2013-03-26 at 08:29 -0700, Arjan van de Ven wrote:
>> > Isn't this basically related to picking the NO_HZ cpu; if the system
>> > isn't fully symmetric with its power gates you want the NO_HZ cpu to be
>> > the 'special' cpu. If it is symmetric we really don't care which core
>> > is left 'running' and we can even select a new pack cpu from the idle
>> > cores once the old one is fully utilized.
>>
>> you don't really care much sure, but there's some advantages for sorting 
>> "all the way left",
>> e.g. to linux cpu 0.
>> Some tasks only run there, and interrupts tend to be favored to that cpu as 
>> well on x86.
>
> Right, and I suspect all the big-little nonsense will have the little
> cores on low numbers as well (is this architected or can a creative
> licensee screw us over?)

It's not mandatory to have little cores on low numbers even if it's advised

>
> So find_new_ilb() already does cpumask_first(), so it has a strong
> leftmost preference. We just need to make sure it indeed does the right
> thing and doesn't have some unintended side effect.
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 3/6] sched: pack small tasks

2013-03-27 Thread Peter Zijlstra
On Wed, 2013-03-27 at 12:48 +0800, Alex Shi wrote:
> 
> Yes, the new forked runnable load was set as full utilisation in V5
> power aware scheduling. PJT, Mike and I both agree on this. PJT just
> discussion how to give the full load to new forked task. and we get
> agreement in my coming V6 power aware scheduling patchset.

Great!, means I can mark the v5 thread read! :-)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 3/6] sched: pack small tasks

2013-03-27 Thread Peter Zijlstra
On Wed, 2013-03-27 at 12:48 +0800, Alex Shi wrote:
 
 Yes, the new forked runnable load was set as full utilisation in V5
 power aware scheduling. PJT, Mike and I both agree on this. PJT just
 discussion how to give the full load to new forked task. and we get
 agreement in my coming V6 power aware scheduling patchset.

Great!, means I can mark the v5 thread read! :-)

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 3/6] sched: pack small tasks

2013-03-27 Thread Vincent Guittot
On 27 March 2013 09:46, Peter Zijlstra pet...@infradead.org wrote:
 On Tue, 2013-03-26 at 08:29 -0700, Arjan van de Ven wrote:
  Isn't this basically related to picking the NO_HZ cpu; if the system
  isn't fully symmetric with its power gates you want the NO_HZ cpu to be
  the 'special' cpu. If it is symmetric we really don't care which core
  is left 'running' and we can even select a new pack cpu from the idle
  cores once the old one is fully utilized.

 you don't really care much sure, but there's some advantages for sorting 
 all the way left,
 e.g. to linux cpu 0.
 Some tasks only run there, and interrupts tend to be favored to that cpu as 
 well on x86.

 Right, and I suspect all the big-little nonsense will have the little
 cores on low numbers as well (is this architected or can a creative
 licensee screw us over?)

It's not mandatory to have little cores on low numbers even if it's advised


 So find_new_ilb() already does cpumask_first(), so it has a strong
 leftmost preference. We just need to make sure it indeed does the right
 thing and doesn't have some unintended side effect.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 3/6] sched: pack small tasks

2013-03-27 Thread Peter Zijlstra
On Tue, 2013-03-26 at 08:29 -0700, Arjan van de Ven wrote:
  Isn't this basically related to picking the NO_HZ cpu; if the system
  isn't fully symmetric with its power gates you want the NO_HZ cpu to be
  the 'special' cpu. If it is symmetric we really don't care which core
  is left 'running' and we can even select a new pack cpu from the idle
  cores once the old one is fully utilized.
 
 you don't really care much sure, but there's some advantages for sorting all 
 the way left,
 e.g. to linux cpu 0.
 Some tasks only run there, and interrupts tend to be favored to that cpu as 
 well on x86.

Right, and I suspect all the big-little nonsense will have the little
cores on low numbers as well (is this architected or can a creative
licensee screw us over?)

So find_new_ilb() already does cpumask_first(), so it has a strong
leftmost preference. We just need to make sure it indeed does the right
thing and doesn't have some unintended side effect.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 3/6] sched: pack small tasks

2013-03-27 Thread Peter Zijlstra
On Wed, 2013-03-27 at 09:54 +0100, Vincent Guittot wrote:
 It's not mandatory to have little cores on low numbers even if it's
 advised

ARGH!

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 3/6] sched: pack small tasks

2013-03-27 Thread Preeti U Murthy
Hi,

On 03/26/2013 05:56 PM, Peter Zijlstra wrote:
 On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote:
 +static bool is_buddy_busy(int cpu)
 +{
 +   struct rq *rq = cpu_rq(cpu);
 +
 +   /*
 +* A busy buddy is a CPU with a high load or a small load with
 a lot of
 +* running tasks.
 +*/
 +   return (rq-avg.runnable_avg_sum 
 +   (rq-avg.runnable_avg_period / (rq-nr_running
 + 2)));
 +}
 
 Why does the comment talk about load but we don't see it in the
 equation. Also, why does nr_running matter at all? I thought we'd
 simply bother with utilization, if fully utilized we're done etc..
 

Peter, lets say the run-queue has 50% utilization and is running 2
tasks. And we wish to find out if it is busy. We would compare this
metric with the cpu power, which lets say is 100.

rq-util * 100  cpu_of(rq)-power.

In the above scenario would we declare the cpu _not_busy? Or would we do
the following:

(rq-util * 100) * #nr_running   cpu_of(rq)-power and conclude that it
is just enough _busy_ to not take on more processes?


@Vincent: Yes the comment above needs to be fixed. A busy buddy is a CPU
with *high rq utilization*, as far as the equation goes.

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 3/6] sched: pack small tasks

2013-03-27 Thread Vincent Guittot
On 27 March 2013 11:21, Preeti U Murthy pre...@linux.vnet.ibm.com wrote:
 Hi,

 On 03/26/2013 05:56 PM, Peter Zijlstra wrote:
 On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote:
 +static bool is_buddy_busy(int cpu)
 +{
 +   struct rq *rq = cpu_rq(cpu);
 +
 +   /*
 +* A busy buddy is a CPU with a high load or a small load with
 a lot of
 +* running tasks.
 +*/
 +   return (rq-avg.runnable_avg_sum 
 +   (rq-avg.runnable_avg_period / (rq-nr_running
 + 2)));
 +}

 Why does the comment talk about load but we don't see it in the
 equation. Also, why does nr_running matter at all? I thought we'd
 simply bother with utilization, if fully utilized we're done etc..


 Peter, lets say the run-queue has 50% utilization and is running 2
 tasks. And we wish to find out if it is busy. We would compare this
 metric with the cpu power, which lets say is 100.

 rq-util * 100  cpu_of(rq)-power.

I don't use cpu_of(rq)-power in the definition of the business


 In the above scenario would we declare the cpu _not_busy? Or would we do
 the following:

In the above scenario, the CPU is busy

By load, I mean : 100 * avg.runnable_avg_sum / avg.runnable_avg_period
In addition, i take into account the number of tasks already in the
runqueue in order to define the business of a CPU. A CPU with a load
of 50% without any tasks in the runqeue in not busy at this time and
we can migrate tasks on it but if the CPU already has 2 tasks in its
runqueue, it means that newly wake up task will have to share the CPU
with other tasks so we consider that the CPU is already busy and we
will fall back to default behavior. The equation considers that a CPU
is not busy if
100 * avg.runnable_avg_sum / avg.runnable_avg_period  100 / (nr_running + 2)


 (rq-util * 100) * #nr_running   cpu_of(rq)-power and conclude that it
 is just enough _busy_ to not take on more processes?


 @Vincent: Yes the comment above needs to be fixed. A busy buddy is a CPU
 with *high rq utilization*, as far as the equation goes.

I can update the comment. Is the comment below more clear ?

/*
 * A busy buddy is a CPU with a high average running time or a small
average running time but a lot of
 * running tasks in its runqueue which are already sharing this CPU time.
 */

Vincent


 Regards
 Preeti U Murthy

 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 3/6] sched: pack small tasks

2013-03-27 Thread Nicolas Pitre
On Wed, 27 Mar 2013, Vincent Guittot wrote:

 On 27 March 2013 09:46, Peter Zijlstra pet...@infradead.org wrote:
  On Tue, 2013-03-26 at 08:29 -0700, Arjan van de Ven wrote:
   Isn't this basically related to picking the NO_HZ cpu; if the system
   isn't fully symmetric with its power gates you want the NO_HZ cpu to be
   the 'special' cpu. If it is symmetric we really don't care which core
   is left 'running' and we can even select a new pack cpu from the idle
   cores once the old one is fully utilized.
 
  you don't really care much sure, but there's some advantages for sorting 
  all the way left,
  e.g. to linux cpu 0.
  Some tasks only run there, and interrupts tend to be favored to that cpu 
  as well on x86.
 
  Right, and I suspect all the big-little nonsense will have the little
  cores on low numbers as well (is this architected or can a creative
  licensee screw us over?)
 
 It's not mandatory to have little cores on low numbers even if it's advised

We can trivially move things around in the logical CPU mapping if that 
simplifies things.  However  the boot CPU might not be CPU0 in that 
case which might be less trivial.


Nicolas
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 3/6] sched: pack small tasks

2013-03-27 Thread Nicolas Pitre
On Wed, 27 Mar 2013, Catalin Marinas wrote:

 So if the above works, the scheduler guys can mandate that little CPUs
 are always first and for ARM it would be a matter of getting the right
 CPU topology in the DT (independent of what hw vendors think of CPU
 topology) and booting Linux on CPU 4 etc.

Just a note about that: if the scheduler mandates little CPUs first, 
that should _not_ have any implications on the DT content.  DT is not 
about encoding Linux specific implementation details.  It is simple 
enough to tweak the CPU logical map at run time when enumeratiing CPUs.


Nicolas
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 3/6] sched: pack small tasks

2013-03-27 Thread Catalin Marinas
On Wed, Mar 27, 2013 at 05:18:53PM +, Nicolas Pitre wrote:
 On Wed, 27 Mar 2013, Catalin Marinas wrote:
 
  So if the above works, the scheduler guys can mandate that little CPUs
  are always first and for ARM it would be a matter of getting the right
  CPU topology in the DT (independent of what hw vendors think of CPU
  topology) and booting Linux on CPU 4 etc.
 
 Just a note about that: if the scheduler mandates little CPUs first, 
 that should _not_ have any implications on the DT content.  DT is not 
 about encoding Linux specific implementation details.  It is simple 
 enough to tweak the CPU logical map at run time when enumeratiing CPUs.

You are right, though a simpler way (hack) to tweak the cpu_logical_map
is to change the DT ;).

But the problem is that the kernel doesn't know which CPU is big and
which is little, unless you specify this in some way via the DT. It can
be the cpu nodes order or some other means.

-- 
Catalin
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 3/6] sched: pack small tasks

2013-03-26 Thread Alex Shi
On 03/27/2013 12:33 PM, Preeti U Murthy wrote:
> Hi Peter,
> 
> On 03/26/2013 06:07 PM, Peter Zijlstra wrote:
>> On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote:
>>> +static bool is_light_task(struct task_struct *p)
>>> +{
>>> +   /* A light task runs less than 20% in average */
>>> +   return ((p->se.avg.runnable_avg_sum  * 5) <
>>> +   (p->se.avg.runnable_avg_period));
>>> +}
>>
>> OK, so we have a 'problem' here, we initialize runnable_avg_* to 0, but
>> we want to 'assume' a fresh task is fully 'loaded'. IIRC Alex ran into
>> this as well.
>>
>> PJT, do you have any sane solution for this, I forgot what the result
>> of the last discussion was -- was there any?
> 
> The conclusion after last discussion between PJT and Alex was that the
> load contribution of a fresh task be set to "full" during "__sched_fork()".
> 
> task->se.avg.load_avg_contrib = task->se.load.weight during
> __sched_fork() is reflected in the latest power aware scheduler patchset
> by Alex.

Yes, the new forked runnable load was set as full utilisation in V5
power aware scheduling. PJT, Mike and I both agree on this. PJT just
discussion how to give the full load to new forked task. and we get
agreement in my coming V6 power aware scheduling patchset.

> 
> Thanks
> 
> Regards
> Preeti U Murthy
>>
> 


-- 
Thanks Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 3/6] sched: pack small tasks

2013-03-26 Thread Preeti U Murthy
Hi Peter,

On 03/26/2013 06:07 PM, Peter Zijlstra wrote:
> On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote:
>> +static bool is_light_task(struct task_struct *p)
>> +{
>> +   /* A light task runs less than 20% in average */
>> +   return ((p->se.avg.runnable_avg_sum  * 5) <
>> +   (p->se.avg.runnable_avg_period));
>> +}
> 
> OK, so we have a 'problem' here, we initialize runnable_avg_* to 0, but
> we want to 'assume' a fresh task is fully 'loaded'. IIRC Alex ran into
> this as well.
> 
> PJT, do you have any sane solution for this, I forgot what the result
> of the last discussion was -- was there any?

The conclusion after last discussion between PJT and Alex was that the
load contribution of a fresh task be set to "full" during "__sched_fork()".

task->se.avg.load_avg_contrib = task->se.load.weight during
__sched_fork() is reflected in the latest power aware scheduler patchset
by Alex.

Thanks

Regards
Preeti U Murthy
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 3/6] sched: pack small tasks

2013-03-26 Thread Arjan van de Ven



Isn't this basically related to picking the NO_HZ cpu; if the system
isn't fully symmetric with its power gates you want the NO_HZ cpu to be
the 'special' cpu. If it is symmetric we really don't care which core
is left 'running' and we can even select a new pack cpu from the idle
cores once the old one is fully utilized.


you don't really care much sure, but there's some advantages for sorting "all the 
way left",
e.g. to linux cpu 0.
Some tasks only run there, and interrupts tend to be favored to that cpu as 
well on x86.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 3/6] sched: pack small tasks

2013-03-26 Thread Vincent Guittot
On 26 March 2013 13:46, Peter Zijlstra  wrote:
> On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote:
>> During the creation of sched_domain, we define a pack buddy CPU for
>> each CPU
>> when one is available. We want to pack at all levels where a group of
>> CPU can
>> be power gated independently from others.
>> On a system that can't power gate a group of CPUs independently, the
>> flag is
>> set at all sched_domain level and the buddy is set to -1. This is the
>> default
>> behavior.
>> On a dual clusters / dual cores system which can power gate each core
>> and
>> cluster independently, the buddy configuration will be :
>>
>>   | Cluster 0   | Cluster 1   |
>>   | CPU0 | CPU1 | CPU2 | CPU3 |
>> ---
>> buddy | CPU0 | CPU0 | CPU0 | CPU2 |
>
> I suppose this is adequate for the 'small' systems you currently have;
> but given that Samsung is already bragging with its 'octo'-core Exynos
> 5 (4+4 big-little thing) does this solution scale?

The packing is only done at MC and CPU level to minimize the number of
transition.

>
> Isn't this basically related to picking the NO_HZ cpu; if the system
> isn't fully symmetric with its power gates you want the NO_HZ cpu to be
> the 'special' cpu. If it is symmetric we really don't care which core
> is left 'running' and we can even select a new pack cpu from the idle
> cores once the old one is fully utilized.

I agree that on a symmetric system, we don't really care about which
core is selected but we want to use the same one whenever possible to
prevent a ping pong between several cores or groups of cores, which is
power consuming. By forcing a NOHZ cpu, your background activity will
smoothly pack on this CPU and will not be spread on your system.
When a CPU is fully loaded, we don't fall in a low CPU load use case
and the periodic load balance can handle the situation to select a new
target CPU which is close to the buddy CPU

>
> Re-using (or integrating) with NO_HZ has the dual advantage that you'll
> make NO_HZ do the right thing for big-little (you typically want a
> little core to be the one staying 'awake' and once someone makes NO_HZ
> scale this all gets to scale along with it.
>

I think that you have answered to this question in your comment of
patch 5, isn't it?

Vincent
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 3/6] sched: pack small tasks

2013-03-26 Thread Vincent Guittot
On 26 March 2013 13:37, Peter Zijlstra  wrote:
> On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote:
>> +static bool is_light_task(struct task_struct *p)
>> +{
>> +   /* A light task runs less than 20% in average */
>> +   return ((p->se.avg.runnable_avg_sum  * 5) <
>> +   (p->se.avg.runnable_avg_period));
>> +}
>
> OK, so we have a 'problem' here, we initialize runnable_avg_* to 0, but
> we want to 'assume' a fresh task is fully 'loaded'. IIRC Alex ran into
> this as well.
Hi Peter,

The packing small tasks is only applied at wake up and not during fork
or exec so the runnable_avg_* should have been initialized. As you
mentionned, we assume that a fresh task is fully loaded and let the
default scheduler behavior to select a target CPU

Vincent

>
> PJT, do you have any sane solution for this, I forgot what the result
> of the last discussion was -- was there any?
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 3/6] sched: pack small tasks

2013-03-26 Thread Peter Zijlstra
On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote:
> During the creation of sched_domain, we define a pack buddy CPU for
> each CPU
> when one is available. We want to pack at all levels where a group of
> CPU can
> be power gated independently from others.
> On a system that can't power gate a group of CPUs independently, the
> flag is
> set at all sched_domain level and the buddy is set to -1. This is the
> default
> behavior.
> On a dual clusters / dual cores system which can power gate each core
> and
> cluster independently, the buddy configuration will be :
> 
>   | Cluster 0   | Cluster 1   |
>   | CPU0 | CPU1 | CPU2 | CPU3 |
> ---
> buddy | CPU0 | CPU0 | CPU0 | CPU2 |

I suppose this is adequate for the 'small' systems you currently have;
but given that Samsung is already bragging with its 'octo'-core Exynos
5 (4+4 big-little thing) does this solution scale?

Isn't this basically related to picking the NO_HZ cpu; if the system
isn't fully symmetric with its power gates you want the NO_HZ cpu to be
the 'special' cpu. If it is symmetric we really don't care which core
is left 'running' and we can even select a new pack cpu from the idle
cores once the old one is fully utilized.

Re-using (or integrating) with NO_HZ has the dual advantage that you'll
make NO_HZ do the right thing for big-little (you typically want a
little core to be the one staying 'awake' and once someone makes NO_HZ
scale this all gets to scale along with it.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 3/6] sched: pack small tasks

2013-03-26 Thread Peter Zijlstra
On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote:
> +static bool is_light_task(struct task_struct *p)
> +{
> +   /* A light task runs less than 20% in average */
> +   return ((p->se.avg.runnable_avg_sum  * 5) <
> +   (p->se.avg.runnable_avg_period));
> +}

OK, so we have a 'problem' here, we initialize runnable_avg_* to 0, but
we want to 'assume' a fresh task is fully 'loaded'. IIRC Alex ran into
this as well.

PJT, do you have any sane solution for this, I forgot what the result
of the last discussion was -- was there any?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 3/6] sched: pack small tasks

2013-03-26 Thread Peter Zijlstra
On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote:
> +static bool is_buddy_busy(int cpu)
> +{
> +   struct rq *rq = cpu_rq(cpu);
> +
> +   /*
> +* A busy buddy is a CPU with a high load or a small load with
> a lot of
> +* running tasks.
> +*/
> +   return (rq->avg.runnable_avg_sum >
> +   (rq->avg.runnable_avg_period / (rq->nr_running
> + 2)));
> +}

Why does the comment talk about load but we don't see it in the
equation. Also, why does nr_running matter at all? I thought we'd
simply bother with utilization, if fully utilized we're done etc..

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 3/6] sched: pack small tasks

2013-03-26 Thread Peter Zijlstra
On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote:
 +static bool is_buddy_busy(int cpu)
 +{
 +   struct rq *rq = cpu_rq(cpu);
 +
 +   /*
 +* A busy buddy is a CPU with a high load or a small load with
 a lot of
 +* running tasks.
 +*/
 +   return (rq-avg.runnable_avg_sum 
 +   (rq-avg.runnable_avg_period / (rq-nr_running
 + 2)));
 +}

Why does the comment talk about load but we don't see it in the
equation. Also, why does nr_running matter at all? I thought we'd
simply bother with utilization, if fully utilized we're done etc..

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 3/6] sched: pack small tasks

2013-03-26 Thread Peter Zijlstra
On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote:
 +static bool is_light_task(struct task_struct *p)
 +{
 +   /* A light task runs less than 20% in average */
 +   return ((p-se.avg.runnable_avg_sum  * 5) 
 +   (p-se.avg.runnable_avg_period));
 +}

OK, so we have a 'problem' here, we initialize runnable_avg_* to 0, but
we want to 'assume' a fresh task is fully 'loaded'. IIRC Alex ran into
this as well.

PJT, do you have any sane solution for this, I forgot what the result
of the last discussion was -- was there any?

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 3/6] sched: pack small tasks

2013-03-26 Thread Peter Zijlstra
On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote:
 During the creation of sched_domain, we define a pack buddy CPU for
 each CPU
 when one is available. We want to pack at all levels where a group of
 CPU can
 be power gated independently from others.
 On a system that can't power gate a group of CPUs independently, the
 flag is
 set at all sched_domain level and the buddy is set to -1. This is the
 default
 behavior.
 On a dual clusters / dual cores system which can power gate each core
 and
 cluster independently, the buddy configuration will be :
 
   | Cluster 0   | Cluster 1   |
   | CPU0 | CPU1 | CPU2 | CPU3 |
 ---
 buddy | CPU0 | CPU0 | CPU0 | CPU2 |

I suppose this is adequate for the 'small' systems you currently have;
but given that Samsung is already bragging with its 'octo'-core Exynos
5 (4+4 big-little thing) does this solution scale?

Isn't this basically related to picking the NO_HZ cpu; if the system
isn't fully symmetric with its power gates you want the NO_HZ cpu to be
the 'special' cpu. If it is symmetric we really don't care which core
is left 'running' and we can even select a new pack cpu from the idle
cores once the old one is fully utilized.

Re-using (or integrating) with NO_HZ has the dual advantage that you'll
make NO_HZ do the right thing for big-little (you typically want a
little core to be the one staying 'awake' and once someone makes NO_HZ
scale this all gets to scale along with it.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 3/6] sched: pack small tasks

2013-03-26 Thread Vincent Guittot
On 26 March 2013 13:37, Peter Zijlstra pet...@infradead.org wrote:
 On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote:
 +static bool is_light_task(struct task_struct *p)
 +{
 +   /* A light task runs less than 20% in average */
 +   return ((p-se.avg.runnable_avg_sum  * 5) 
 +   (p-se.avg.runnable_avg_period));
 +}

 OK, so we have a 'problem' here, we initialize runnable_avg_* to 0, but
 we want to 'assume' a fresh task is fully 'loaded'. IIRC Alex ran into
 this as well.
Hi Peter,

The packing small tasks is only applied at wake up and not during fork
or exec so the runnable_avg_* should have been initialized. As you
mentionned, we assume that a fresh task is fully loaded and let the
default scheduler behavior to select a target CPU

Vincent


 PJT, do you have any sane solution for this, I forgot what the result
 of the last discussion was -- was there any?

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 3/6] sched: pack small tasks

2013-03-26 Thread Vincent Guittot
On 26 March 2013 13:46, Peter Zijlstra pet...@infradead.org wrote:
 On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote:
 During the creation of sched_domain, we define a pack buddy CPU for
 each CPU
 when one is available. We want to pack at all levels where a group of
 CPU can
 be power gated independently from others.
 On a system that can't power gate a group of CPUs independently, the
 flag is
 set at all sched_domain level and the buddy is set to -1. This is the
 default
 behavior.
 On a dual clusters / dual cores system which can power gate each core
 and
 cluster independently, the buddy configuration will be :

   | Cluster 0   | Cluster 1   |
   | CPU0 | CPU1 | CPU2 | CPU3 |
 ---
 buddy | CPU0 | CPU0 | CPU0 | CPU2 |

 I suppose this is adequate for the 'small' systems you currently have;
 but given that Samsung is already bragging with its 'octo'-core Exynos
 5 (4+4 big-little thing) does this solution scale?

The packing is only done at MC and CPU level to minimize the number of
transition.


 Isn't this basically related to picking the NO_HZ cpu; if the system
 isn't fully symmetric with its power gates you want the NO_HZ cpu to be
 the 'special' cpu. If it is symmetric we really don't care which core
 is left 'running' and we can even select a new pack cpu from the idle
 cores once the old one is fully utilized.

I agree that on a symmetric system, we don't really care about which
core is selected but we want to use the same one whenever possible to
prevent a ping pong between several cores or groups of cores, which is
power consuming. By forcing a NOHZ cpu, your background activity will
smoothly pack on this CPU and will not be spread on your system.
When a CPU is fully loaded, we don't fall in a low CPU load use case
and the periodic load balance can handle the situation to select a new
target CPU which is close to the buddy CPU


 Re-using (or integrating) with NO_HZ has the dual advantage that you'll
 make NO_HZ do the right thing for big-little (you typically want a
 little core to be the one staying 'awake' and once someone makes NO_HZ
 scale this all gets to scale along with it.


I think that you have answered to this question in your comment of
patch 5, isn't it?

Vincent
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 3/6] sched: pack small tasks

2013-03-26 Thread Arjan van de Ven



Isn't this basically related to picking the NO_HZ cpu; if the system
isn't fully symmetric with its power gates you want the NO_HZ cpu to be
the 'special' cpu. If it is symmetric we really don't care which core
is left 'running' and we can even select a new pack cpu from the idle
cores once the old one is fully utilized.


you don't really care much sure, but there's some advantages for sorting all the 
way left,
e.g. to linux cpu 0.
Some tasks only run there, and interrupts tend to be favored to that cpu as 
well on x86.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 3/6] sched: pack small tasks

2013-03-26 Thread Preeti U Murthy
Hi Peter,

On 03/26/2013 06:07 PM, Peter Zijlstra wrote:
 On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote:
 +static bool is_light_task(struct task_struct *p)
 +{
 +   /* A light task runs less than 20% in average */
 +   return ((p-se.avg.runnable_avg_sum  * 5) 
 +   (p-se.avg.runnable_avg_period));
 +}
 
 OK, so we have a 'problem' here, we initialize runnable_avg_* to 0, but
 we want to 'assume' a fresh task is fully 'loaded'. IIRC Alex ran into
 this as well.
 
 PJT, do you have any sane solution for this, I forgot what the result
 of the last discussion was -- was there any?

The conclusion after last discussion between PJT and Alex was that the
load contribution of a fresh task be set to full during __sched_fork().

task-se.avg.load_avg_contrib = task-se.load.weight during
__sched_fork() is reflected in the latest power aware scheduler patchset
by Alex.

Thanks

Regards
Preeti U Murthy
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 3/6] sched: pack small tasks

2013-03-26 Thread Alex Shi
On 03/27/2013 12:33 PM, Preeti U Murthy wrote:
 Hi Peter,
 
 On 03/26/2013 06:07 PM, Peter Zijlstra wrote:
 On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote:
 +static bool is_light_task(struct task_struct *p)
 +{
 +   /* A light task runs less than 20% in average */
 +   return ((p-se.avg.runnable_avg_sum  * 5) 
 +   (p-se.avg.runnable_avg_period));
 +}

 OK, so we have a 'problem' here, we initialize runnable_avg_* to 0, but
 we want to 'assume' a fresh task is fully 'loaded'. IIRC Alex ran into
 this as well.

 PJT, do you have any sane solution for this, I forgot what the result
 of the last discussion was -- was there any?
 
 The conclusion after last discussion between PJT and Alex was that the
 load contribution of a fresh task be set to full during __sched_fork().
 
 task-se.avg.load_avg_contrib = task-se.load.weight during
 __sched_fork() is reflected in the latest power aware scheduler patchset
 by Alex.

Yes, the new forked runnable load was set as full utilisation in V5
power aware scheduling. PJT, Mike and I both agree on this. PJT just
discussion how to give the full load to new forked task. and we get
agreement in my coming V6 power aware scheduling patchset.

 
 Thanks
 
 Regards
 Preeti U Murthy

 


-- 
Thanks Alex
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH v3 3/6] sched: pack small tasks

2013-03-22 Thread Vincent Guittot
During the creation of sched_domain, we define a pack buddy CPU for each CPU
when one is available. We want to pack at all levels where a group of CPU can
be power gated independently from others.
On a system that can't power gate a group of CPUs independently, the flag is
set at all sched_domain level and the buddy is set to -1. This is the default
behavior.
On a dual clusters / dual cores system which can power gate each core and
cluster independently, the buddy configuration will be :

  | Cluster 0   | Cluster 1   |
  | CPU0 | CPU1 | CPU2 | CPU3 |
---
buddy | CPU0 | CPU0 | CPU0 | CPU2 |

Small tasks tend to slip out of the periodic load balance so the best place
to choose to migrate them is during their wake up. The decision is in O(1) as
we only check again one buddy CPU

Signed-off-by: Vincent Guittot 
Reviewed-by: Morten Rasmussen 
---
 kernel/sched/core.c  |1 +
 kernel/sched/fair.c  |  115 ++
 kernel/sched/sched.h |5 +++
 3 files changed, 121 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b827e0c..21c35ce 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5662,6 +5662,7 @@ cpu_attach_domain(struct sched_domain *sd, struct 
root_domain *rd, int cpu)
rcu_assign_pointer(rq->sd, sd);
destroy_sched_domains(tmp, cpu);
 
+   update_packing_domain(cpu);
update_top_cache_domain(cpu);
 }
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9c2f726..021c7b7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -160,6 +160,76 @@ void sched_init_granularity(void)
update_sysctl();
 }
 
+
+#ifdef CONFIG_SMP
+/*
+ * Save the id of the optimal CPU that should be used to pack small tasks
+ * The value -1 is used when no buddy has been found
+ */
+DEFINE_PER_CPU(int, sd_pack_buddy);
+
+/*
+ * Look for the best buddy CPU that can be used to pack small tasks
+ * We make the assumption that it doesn't wort to pack on CPU that share the
+ * same powerline. We look for the 1st sched_domain without the
+ * SD_SHARE_POWERDOMAIN flag. Then we look for the sched_group with the lowest
+ * power per core based on the assumption that their power efficiency is
+ * better
+ */
+void update_packing_domain(int cpu)
+{
+   struct sched_domain *sd;
+   int id = -1;
+
+   sd = highest_flag_domain(cpu, SD_SHARE_POWERDOMAIN);
+   if (!sd)
+   sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd);
+   else
+   sd = sd->parent;
+
+   while (sd && (sd->flags & SD_LOAD_BALANCE)
+   && !(sd->flags & SD_SHARE_POWERDOMAIN)) {
+   struct sched_group *sg = sd->groups;
+   struct sched_group *pack = sg;
+   struct sched_group *tmp;
+
+   /*
+* The sched_domain of a CPU points on the local sched_group
+* and the 1st CPU of this local group is a good candidate
+*/
+   id = cpumask_first(sched_group_cpus(pack));
+
+   /* loop the sched groups to find the best one */
+   for (tmp = sg->next; tmp != sg; tmp = tmp->next) {
+   if (tmp->sgp->power * pack->group_weight >
+   pack->sgp->power * tmp->group_weight)
+   continue;
+
+   if ((tmp->sgp->power * pack->group_weight ==
+   pack->sgp->power * tmp->group_weight)
+&& (cpumask_first(sched_group_cpus(tmp)) >= id))
+   continue;
+
+   /* we have found a better group */
+   pack = tmp;
+
+   /* Take the 1st CPU of the new group */
+   id = cpumask_first(sched_group_cpus(pack));
+   }
+
+   /* Look for another CPU than itself */
+   if (id != cpu)
+   break;
+
+   sd = sd->parent;
+   }
+
+   pr_debug("CPU%d packing on CPU%d\n", cpu, id);
+   per_cpu(sd_pack_buddy, cpu) = id;
+}
+
+#endif /* CONFIG_SMP */
+
 #if BITS_PER_LONG == 32
 # define WMULT_CONST   (~0UL)
 #else
@@ -3291,6 +3361,47 @@ done:
return target;
 }
 
+static bool is_buddy_busy(int cpu)
+{
+   struct rq *rq = cpu_rq(cpu);
+
+   /*
+* A busy buddy is a CPU with a high load or a small load with a lot of
+* running tasks.
+*/
+   return (rq->avg.runnable_avg_sum >
+   (rq->avg.runnable_avg_period / (rq->nr_running + 2)));
+}
+
+static bool is_light_task(struct task_struct *p)
+{
+   /* A light task runs less than 20% in average */
+   return ((p->se.avg.runnable_avg_sum  * 5) <
+   (p->se.avg.runnable_avg_period));
+}
+
+static int check_pack_buddy(int cpu, struct task_struct *p)
+{
+   int buddy = per_cpu(sd_pack_buddy, 

[RFC PATCH v3 3/6] sched: pack small tasks

2013-03-22 Thread Vincent Guittot
During the creation of sched_domain, we define a pack buddy CPU for each CPU
when one is available. We want to pack at all levels where a group of CPU can
be power gated independently from others.
On a system that can't power gate a group of CPUs independently, the flag is
set at all sched_domain level and the buddy is set to -1. This is the default
behavior.
On a dual clusters / dual cores system which can power gate each core and
cluster independently, the buddy configuration will be :

  | Cluster 0   | Cluster 1   |
  | CPU0 | CPU1 | CPU2 | CPU3 |
---
buddy | CPU0 | CPU0 | CPU0 | CPU2 |

Small tasks tend to slip out of the periodic load balance so the best place
to choose to migrate them is during their wake up. The decision is in O(1) as
we only check again one buddy CPU

Signed-off-by: Vincent Guittot vincent.guit...@linaro.org
Reviewed-by: Morten Rasmussen morten.rasmus...@arm.com
---
 kernel/sched/core.c  |1 +
 kernel/sched/fair.c  |  115 ++
 kernel/sched/sched.h |5 +++
 3 files changed, 121 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b827e0c..21c35ce 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5662,6 +5662,7 @@ cpu_attach_domain(struct sched_domain *sd, struct 
root_domain *rd, int cpu)
rcu_assign_pointer(rq-sd, sd);
destroy_sched_domains(tmp, cpu);
 
+   update_packing_domain(cpu);
update_top_cache_domain(cpu);
 }
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9c2f726..021c7b7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -160,6 +160,76 @@ void sched_init_granularity(void)
update_sysctl();
 }
 
+
+#ifdef CONFIG_SMP
+/*
+ * Save the id of the optimal CPU that should be used to pack small tasks
+ * The value -1 is used when no buddy has been found
+ */
+DEFINE_PER_CPU(int, sd_pack_buddy);
+
+/*
+ * Look for the best buddy CPU that can be used to pack small tasks
+ * We make the assumption that it doesn't wort to pack on CPU that share the
+ * same powerline. We look for the 1st sched_domain without the
+ * SD_SHARE_POWERDOMAIN flag. Then we look for the sched_group with the lowest
+ * power per core based on the assumption that their power efficiency is
+ * better
+ */
+void update_packing_domain(int cpu)
+{
+   struct sched_domain *sd;
+   int id = -1;
+
+   sd = highest_flag_domain(cpu, SD_SHARE_POWERDOMAIN);
+   if (!sd)
+   sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)-sd);
+   else
+   sd = sd-parent;
+
+   while (sd  (sd-flags  SD_LOAD_BALANCE)
+!(sd-flags  SD_SHARE_POWERDOMAIN)) {
+   struct sched_group *sg = sd-groups;
+   struct sched_group *pack = sg;
+   struct sched_group *tmp;
+
+   /*
+* The sched_domain of a CPU points on the local sched_group
+* and the 1st CPU of this local group is a good candidate
+*/
+   id = cpumask_first(sched_group_cpus(pack));
+
+   /* loop the sched groups to find the best one */
+   for (tmp = sg-next; tmp != sg; tmp = tmp-next) {
+   if (tmp-sgp-power * pack-group_weight 
+   pack-sgp-power * tmp-group_weight)
+   continue;
+
+   if ((tmp-sgp-power * pack-group_weight ==
+   pack-sgp-power * tmp-group_weight)
+ (cpumask_first(sched_group_cpus(tmp)) = id))
+   continue;
+
+   /* we have found a better group */
+   pack = tmp;
+
+   /* Take the 1st CPU of the new group */
+   id = cpumask_first(sched_group_cpus(pack));
+   }
+
+   /* Look for another CPU than itself */
+   if (id != cpu)
+   break;
+
+   sd = sd-parent;
+   }
+
+   pr_debug(CPU%d packing on CPU%d\n, cpu, id);
+   per_cpu(sd_pack_buddy, cpu) = id;
+}
+
+#endif /* CONFIG_SMP */
+
 #if BITS_PER_LONG == 32
 # define WMULT_CONST   (~0UL)
 #else
@@ -3291,6 +3361,47 @@ done:
return target;
 }
 
+static bool is_buddy_busy(int cpu)
+{
+   struct rq *rq = cpu_rq(cpu);
+
+   /*
+* A busy buddy is a CPU with a high load or a small load with a lot of
+* running tasks.
+*/
+   return (rq-avg.runnable_avg_sum 
+   (rq-avg.runnable_avg_period / (rq-nr_running + 2)));
+}
+
+static bool is_light_task(struct task_struct *p)
+{
+   /* A light task runs less than 20% in average */
+   return ((p-se.avg.runnable_avg_sum  * 5) 
+   (p-se.avg.runnable_avg_period));
+}
+
+static int check_pack_buddy(int cpu, struct task_struct *p)
+{
+   int buddy =