Re: [PATCH v3 05/22] sched: remove domain iterations in fork/exec/wake

2013-01-15 Thread Alex Shi
On 01/16/2013 01:43 PM, Alex Shi wrote:
> - /* while loop will break here if sd == NULL */
>>>
>>> I agree that this should be a major optimization. I just can't figure
>>> out why the existing recursive search for an idle cpu switches to the
>>> new cpu near the end and then starts a search for an idle cpu in the new
>>> cpu's domain. Is this to handle some exotic sched domain configurations?
>>> If so, they probably wouldn't work with your optimizations.
>>
>> Let me explain my understanding of why the recursive search is the way
>> it is.
>>
>>  _  sd0
>> | |
>> |  ___sd1__   ___sd2__|
>> | || ||   |
>> | | sgx| |  sga   |   |
>> | | sgy| |  sgb   |   |
>> | || ||   |
>> |_|
>>
>> What the current recursive search is doing is (assuming we start with
>> sd0-the top level sched domain whose flags are rightly set). we find
>> that sd1 is the idlest group,and a cpux1 in sgx is the idlest cpu.
>>
>> We could have ideally stopped the search here.But the problem with this
>> is that there is a possibility that sgx is more loaded than sgy; meaning
>> the cpus in sgx are heavily imbalanced;say there are two cpus cpux1 and
>> cpux2 in sgx,where cpux2 is heavily loaded and cpux1 has recently gotten
>> idle and load balancing has not come to its rescue yet.According to the
>> search above, cpux1 is idle,but is *not the right candidate for
>> scheduling forked task,it is the right candidate for relieving the load
>> from cpux2* due to cache locality etc.
> 
> The problem still exists on the current code. It still goes to cpux1.
> and then goes up to sgx to seek idlest group ... idlest cpu, and back to
> cpux1 again. nothing help.
> 
> 

to resolve the problem, I has tried to walk domains from top down. but testing
show aim9/hackbench performance is not good on our SNB EP. and no change on 
other platforms.
---
@@ -3351,51 +3363,33 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, 
int wake_flags)
 
 
-   while (sd) {
+   for_each_lower_domain(sd) {
int load_idx = sd->forkexec_idx;
-   struct sched_group *group;
-   int weight;
-
-   if (!(sd->flags & sd_flag)) {
-   sd = sd->child;
-   continue;
-   }
+   int local = 0;
 
if (sd_flag & SD_BALANCE_WAKE)
load_idx = sd->wake_idx;
 
-   group = find_idlest_group(sd, p, cpu, load_idx);
-   if (!group) {
-   sd = sd->child;
-   continue;
-   }
-
-   new_cpu = find_idlest_cpu(group, p, cpu);
-   if (new_cpu == -1 || new_cpu == cpu) {
-   /* Now try balancing at a lower domain level of cpu */
-   sd = sd->child;
+   group = find_idlest_group(sd, p, cpu, load_idx, );
+   if (local)
continue;
-   }
+   if (!group)
+   goto unlock;
 
-   /* Now try balancing at a lower domain level of new_cpu */
-   cpu = new_cpu;
-   weight = sd->span_weight;
-   sd = NULL;
-   for_each_domain(cpu, tmp) {
-   if (weight <= tmp->span_weight)
-   break;
-   if (tmp->flags & sd_flag)
+   /* go down from non-local group */
+   for_each_domain(group_first_cpu(group), tmp)
+   if (cpumask_equal(sched_domain_span(tmp),
+   sched_group_cpus(group))) {
sd = tmp;
-   }
-   /* while loop will break here if sd == NULL */
+   break;
+   }
}
+   if (group)
+   new_cpu = find_idlest_cpu(group, p, cpu);
 unlock:
rcu_read_unlock();



-- 
Thanks Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 05/22] sched: remove domain iterations in fork/exec/wake

2013-01-15 Thread Alex Shi
-   /* while loop will break here if sd == NULL */
>>
>> I agree that this should be a major optimization. I just can't figure
>> out why the existing recursive search for an idle cpu switches to the
>> new cpu near the end and then starts a search for an idle cpu in the new
>> cpu's domain. Is this to handle some exotic sched domain configurations?
>> If so, they probably wouldn't work with your optimizations.
> 
> Let me explain my understanding of why the recursive search is the way
> it is.
> 
>  _  sd0
> | |
> |  ___sd1__   ___sd2__|
> | || ||   |
> | | sgx| |  sga   |   |
> | | sgy| |  sgb   |   |
> | || ||   |
> |_|
> 
> What the current recursive search is doing is (assuming we start with
> sd0-the top level sched domain whose flags are rightly set). we find
> that sd1 is the idlest group,and a cpux1 in sgx is the idlest cpu.
> 
> We could have ideally stopped the search here.But the problem with this
> is that there is a possibility that sgx is more loaded than sgy; meaning
> the cpus in sgx are heavily imbalanced;say there are two cpus cpux1 and
> cpux2 in sgx,where cpux2 is heavily loaded and cpux1 has recently gotten
> idle and load balancing has not come to its rescue yet.According to the
> search above, cpux1 is idle,but is *not the right candidate for
> scheduling forked task,it is the right candidate for relieving the load
> from cpux2* due to cache locality etc.

The problem still exists on the current code. It still goes to cpux1.
and then goes up to sgx to seek idlest group ... idlest cpu, and back to
cpux1 again. nothing help.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 05/22] sched: remove domain iterations in fork/exec/wake

2013-01-15 Thread li guang
在 2013-01-15二的 10:34 +0800,Alex Shi写道:
> On 01/14/2013 05:03 PM, li guang wrote:
> >>> > > This corner case may occur after "[PATCH v3 03/22] sched: fix
> >>> > > find_idlest_group mess logical" brought in the local sched_group bias,
> >>> > > and assume balancing runs on cpux2.
> >>> > > ideally,  find_idlest_group should find the real idlest(this case: 
> >>> > > sgy),
> >>> > > then, this patch is reasonable.
> >>> > > 
> >> > 
> >> > Sure. but seems it is a bit hard to go down the idlest group.
> >> > 
> >> > and the old logical is real cost too much, on my 2 socket NHM/SNB
> >> > server, hackbench can increase 2~5% performance. and no clean
> >> > performance on kbuild/aim7 etc.
> > what about remove local group bias?
> 
> 
> Any theory profit for non local group? Usually, bias toward local group
> will has cache locality profit.
> 

but the disadvantage is miss correct balance sometime,
if mostly like you said before it will bring in better performance,
that's fine, I have no strong statistic to do more profound analysis.


-- 
regards!
li guang

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 05/22] sched: remove domain iterations in fork/exec/wake

2013-01-15 Thread li guang
在 2013-01-15二的 10:34 +0800,Alex Shi写道:
 On 01/14/2013 05:03 PM, li guang wrote:
This corner case may occur after [PATCH v3 03/22] sched: fix
find_idlest_group mess logical brought in the local sched_group bias,
and assume balancing runs on cpux2.
ideally,  find_idlest_group should find the real idlest(this case: 
sgy),
then, this patch is reasonable.

   
   Sure. but seems it is a bit hard to go down the idlest group.
   
   and the old logical is real cost too much, on my 2 socket NHM/SNB
   server, hackbench can increase 2~5% performance. and no clean
   performance on kbuild/aim7 etc.
  what about remove local group bias?
 
 
 Any theory profit for non local group? Usually, bias toward local group
 will has cache locality profit.
 

but the disadvantage is miss correct balance sometime,
if mostly like you said before it will bring in better performance,
that's fine, I have no strong statistic to do more profound analysis.


-- 
regards!
li guang

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 05/22] sched: remove domain iterations in fork/exec/wake

2013-01-15 Thread Alex Shi
-   /* while loop will break here if sd == NULL */

 I agree that this should be a major optimization. I just can't figure
 out why the existing recursive search for an idle cpu switches to the
 new cpu near the end and then starts a search for an idle cpu in the new
 cpu's domain. Is this to handle some exotic sched domain configurations?
 If so, they probably wouldn't work with your optimizations.
 
 Let me explain my understanding of why the recursive search is the way
 it is.
 
  _  sd0
 | |
 |  ___sd1__   ___sd2__|
 | || ||   |
 | | sgx| |  sga   |   |
 | | sgy| |  sgb   |   |
 | || ||   |
 |_|
 
 What the current recursive search is doing is (assuming we start with
 sd0-the top level sched domain whose flags are rightly set). we find
 that sd1 is the idlest group,and a cpux1 in sgx is the idlest cpu.
 
 We could have ideally stopped the search here.But the problem with this
 is that there is a possibility that sgx is more loaded than sgy; meaning
 the cpus in sgx are heavily imbalanced;say there are two cpus cpux1 and
 cpux2 in sgx,where cpux2 is heavily loaded and cpux1 has recently gotten
 idle and load balancing has not come to its rescue yet.According to the
 search above, cpux1 is idle,but is *not the right candidate for
 scheduling forked task,it is the right candidate for relieving the load
 from cpux2* due to cache locality etc.

The problem still exists on the current code. It still goes to cpux1.
and then goes up to sgx to seek idlest group ... idlest cpu, and back to
cpux1 again. nothing help.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 05/22] sched: remove domain iterations in fork/exec/wake

2013-01-15 Thread Alex Shi
On 01/16/2013 01:43 PM, Alex Shi wrote:
 - /* while loop will break here if sd == NULL */

 I agree that this should be a major optimization. I just can't figure
 out why the existing recursive search for an idle cpu switches to the
 new cpu near the end and then starts a search for an idle cpu in the new
 cpu's domain. Is this to handle some exotic sched domain configurations?
 If so, they probably wouldn't work with your optimizations.

 Let me explain my understanding of why the recursive search is the way
 it is.

  _  sd0
 | |
 |  ___sd1__   ___sd2__|
 | || ||   |
 | | sgx| |  sga   |   |
 | | sgy| |  sgb   |   |
 | || ||   |
 |_|

 What the current recursive search is doing is (assuming we start with
 sd0-the top level sched domain whose flags are rightly set). we find
 that sd1 is the idlest group,and a cpux1 in sgx is the idlest cpu.

 We could have ideally stopped the search here.But the problem with this
 is that there is a possibility that sgx is more loaded than sgy; meaning
 the cpus in sgx are heavily imbalanced;say there are two cpus cpux1 and
 cpux2 in sgx,where cpux2 is heavily loaded and cpux1 has recently gotten
 idle and load balancing has not come to its rescue yet.According to the
 search above, cpux1 is idle,but is *not the right candidate for
 scheduling forked task,it is the right candidate for relieving the load
 from cpux2* due to cache locality etc.
 
 The problem still exists on the current code. It still goes to cpux1.
 and then goes up to sgx to seek idlest group ... idlest cpu, and back to
 cpux1 again. nothing help.
 
 

to resolve the problem, I has tried to walk domains from top down. but testing
show aim9/hackbench performance is not good on our SNB EP. and no change on 
other platforms.
---
@@ -3351,51 +3363,33 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, 
int wake_flags)
 
 
-   while (sd) {
+   for_each_lower_domain(sd) {
int load_idx = sd-forkexec_idx;
-   struct sched_group *group;
-   int weight;
-
-   if (!(sd-flags  sd_flag)) {
-   sd = sd-child;
-   continue;
-   }
+   int local = 0;
 
if (sd_flag  SD_BALANCE_WAKE)
load_idx = sd-wake_idx;
 
-   group = find_idlest_group(sd, p, cpu, load_idx);
-   if (!group) {
-   sd = sd-child;
-   continue;
-   }
-
-   new_cpu = find_idlest_cpu(group, p, cpu);
-   if (new_cpu == -1 || new_cpu == cpu) {
-   /* Now try balancing at a lower domain level of cpu */
-   sd = sd-child;
+   group = find_idlest_group(sd, p, cpu, load_idx, local);
+   if (local)
continue;
-   }
+   if (!group)
+   goto unlock;
 
-   /* Now try balancing at a lower domain level of new_cpu */
-   cpu = new_cpu;
-   weight = sd-span_weight;
-   sd = NULL;
-   for_each_domain(cpu, tmp) {
-   if (weight = tmp-span_weight)
-   break;
-   if (tmp-flags  sd_flag)
+   /* go down from non-local group */
+   for_each_domain(group_first_cpu(group), tmp)
+   if (cpumask_equal(sched_domain_span(tmp),
+   sched_group_cpus(group))) {
sd = tmp;
-   }
-   /* while loop will break here if sd == NULL */
+   break;
+   }
}
+   if (group)
+   new_cpu = find_idlest_cpu(group, p, cpu);
 unlock:
rcu_read_unlock();



-- 
Thanks Alex
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 05/22] sched: remove domain iterations in fork/exec/wake

2013-01-14 Thread Alex Shi
On 01/14/2013 05:03 PM, li guang wrote:
>>> > > This corner case may occur after "[PATCH v3 03/22] sched: fix
>>> > > find_idlest_group mess logical" brought in the local sched_group bias,
>>> > > and assume balancing runs on cpux2.
>>> > > ideally,  find_idlest_group should find the real idlest(this case: sgy),
>>> > > then, this patch is reasonable.
>>> > > 
>> > 
>> > Sure. but seems it is a bit hard to go down the idlest group.
>> > 
>> > and the old logical is real cost too much, on my 2 socket NHM/SNB
>> > server, hackbench can increase 2~5% performance. and no clean
>> > performance on kbuild/aim7 etc.
> what about remove local group bias?


Any theory profit for non local group? Usually, bias toward local group
will has cache locality profit.

-- 
Thanks Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 05/22] sched: remove domain iterations in fork/exec/wake

2013-01-14 Thread li guang
在 2013-01-11五的 22:56 +0800,Alex Shi写道:
> On 01/11/2013 04:01 PM, li guang wrote:
> > 在 2013-01-11五的 10:26 +0530,Preeti U Murthy写道:
> >> Hi Morten,Alex
> >>
> >> On 01/09/2013 11:51 PM, Morten Rasmussen wrote:
> >>> On Sat, Jan 05, 2013 at 08:37:34AM +, Alex Shi wrote:
>  Guess the search cpu from bottom to up in domain tree come from
>  commit 3dbd5342074a1e sched: multilevel sbe sbf, the purpose is
>  balancing over tasks on all level domains.
> 
>  This balancing cost much if there has many domain/groups in a large
>  system. And force spreading task among different domains may cause
>  performance issue due to bad locality.
> 
>  If we remove this code, we will get quick fork/exec/wake, plus better
>  balancing among whole system, that also reduce migrations in future
>  load balancing.
> 
>  This patch increases 10+% performance of hackbench on my 4 sockets
>  NHM and SNB machines.
> 
>  Signed-off-by: Alex Shi 
>  ---
>   kernel/sched/fair.c | 20 +---
>   1 file changed, 1 insertion(+), 19 deletions(-)
> 
>  diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>  index ecfbf8e..895a3f4 100644
>  --- a/kernel/sched/fair.c
>  +++ b/kernel/sched/fair.c
>  @@ -3364,15 +3364,9 @@ select_task_rq_fair(struct task_struct *p, int 
>  sd_flag, int wake_flags)
>   goto unlock;
>   }
>   
>  -while (sd) {
>  +if (sd) {
>   int load_idx = sd->forkexec_idx;
>   struct sched_group *group;
>  -int weight;
>  -
>  -if (!(sd->flags & sd_flag)) {
>  -sd = sd->child;
>  -continue;
>  -}
>   
>   if (sd_flag & SD_BALANCE_WAKE)
>   load_idx = sd->wake_idx;
>  @@ -3382,18 +3376,6 @@ select_task_rq_fair(struct task_struct *p, int 
>  sd_flag, int wake_flags)
>   goto unlock;
>   
>   new_cpu = find_idlest_cpu(group, p, cpu);
>  -
>  -/* Now try balancing at a lower domain level of new_cpu 
>  */
>  -cpu = new_cpu;
>  -weight = sd->span_weight;
>  -sd = NULL;
>  -for_each_domain(cpu, tmp) {
>  -if (weight <= tmp->span_weight)
>  -break;
>  -if (tmp->flags & sd_flag)
>  -sd = tmp;
>  -}
>  -/* while loop will break here if sd == NULL */
> >>>
> >>> I agree that this should be a major optimization. I just can't figure
> >>> out why the existing recursive search for an idle cpu switches to the
> >>> new cpu near the end and then starts a search for an idle cpu in the new
> >>> cpu's domain. Is this to handle some exotic sched domain configurations?
> >>> If so, they probably wouldn't work with your optimizations.
> >>
> >> Let me explain my understanding of why the recursive search is the way
> >> it is.
> >>
> >>  _  sd0
> >> | |
> >> |  ___sd1__   ___sd2__|
> >> | || ||   |
> >> | | sgx| |  sga   |   |
> >> | | sgy| |  sgb   |   |
> >> | || ||   |
> >> |_|
> >>
> >> What the current recursive search is doing is (assuming we start with
> >> sd0-the top level sched domain whose flags are rightly set). we find
> >> that sd1 is the idlest group,and a cpux1 in sgx is the idlest cpu.
> >>
> >> We could have ideally stopped the search here.But the problem with this
> >> is that there is a possibility that sgx is more loaded than sgy; meaning
> >> the cpus in sgx are heavily imbalanced;say there are two cpus cpux1 and
> >> cpux2 in sgx,where cpux2 is heavily loaded and cpux1 has recently gotten
> >> idle and load balancing has not come to its rescue yet.According to the
> >> search above, cpux1 is idle,but is *not the right candidate for
> >> scheduling forked task,it is the right candidate for relieving the load
> >> from cpux2* due to cache locality etc.
> > 
> > This corner case may occur after "[PATCH v3 03/22] sched: fix
> > find_idlest_group mess logical" brought in the local sched_group bias,
> > and assume balancing runs on cpux2.
> > ideally,  find_idlest_group should find the real idlest(this case: sgy),
> > then, this patch is reasonable.
> > 
> 
> Sure. but seems it is a bit hard to go down the idlest group.
> 
> and the old logical is real cost too much, on my 2 socket NHM/SNB
> server, hackbench can increase 2~5% performance. and no clean
> performance on kbuild/aim7 etc.

what about remove local group bias?


-- 
regards!
li guang

--
To unsubscribe from this list: send the line 

Re: [PATCH v3 05/22] sched: remove domain iterations in fork/exec/wake

2013-01-14 Thread Alex Shi
On 01/14/2013 04:55 PM, li guang wrote:
>> > > >> > -   /* while loop will break here if sd == NULL */
 > > > I agree that this should be a major optimization. I just can't figure
 > > > out why the existing recursive search for an idle cpu switches to the
 > > > new cpu near the end and then starts a search for an idle cpu in the 
 > > > new
 > > > cpu's domain. Is this to handle some exotic sched domain 
 > > > configurations?
 > > > If so, they probably wouldn't work with your optimizations.
>>> > > 
>>> > > I did not find odd configuration that asking for old logical.
>>> > > 
>>> > > According to Documentation/scheduler/sched-domains.txt, Maybe never.
>>> > > "A domain's span MUST be a superset of it child's span (this restriction
>>> > > could be relaxed if the need arises), and a base domain for CPU i MUST
>>> > > span at least i."  etc. etc.
>> > 
>> > The reason for my suspicion is the SD_OVERLAP flag, which has something
>> > to do overlapping sched domains. I haven't looked into what it does or
>> > how it works. I'm just wondering if this optimization will affect the
>> > use of that flag.
> seems it did, SD_OVERLAP will not work after this change,
> though this flag is maybe scarcely used.
> because, this optimization assume all sched-domains span 
> is super-set over child domain.
> isn't it? Alex.
> 

As my understanding, overlap just said some cpu may appears in 2 or more
same level sub domains. If so, this change won't miss cpus.

Am I right, Peter?

-- 
Thanks Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 05/22] sched: remove domain iterations in fork/exec/wake

2013-01-14 Thread li guang
在 2013-01-11五的 10:07 +,Morten Rasmussen写道:
> On Fri, Jan 11, 2013 at 02:46:31AM +, Alex Shi wrote:
> > On 01/10/2013 02:21 AM, Morten Rasmussen wrote:
> > >>  new_cpu = find_idlest_cpu(group, p, cpu);
> > >> > -
> > >> > -  /* Now try balancing at a lower domain level of new_cpu 
> > >> > */
> > >> > -  cpu = new_cpu;
> > >> > -  weight = sd->span_weight;
> > >> > -  sd = NULL;
> > >> > -  for_each_domain(cpu, tmp) {
> > >> > -  if (weight <= tmp->span_weight)
> > >> > -  break;
> > >> > -  if (tmp->flags & sd_flag)
> > >> > -  sd = tmp;
> > >> > -  }
> > >> > -  /* while loop will break here if sd == NULL */
> > > I agree that this should be a major optimization. I just can't figure
> > > out why the existing recursive search for an idle cpu switches to the
> > > new cpu near the end and then starts a search for an idle cpu in the new
> > > cpu's domain. Is this to handle some exotic sched domain configurations?
> > > If so, they probably wouldn't work with your optimizations.
> > 
> > I did not find odd configuration that asking for old logical.
> > 
> > According to Documentation/scheduler/sched-domains.txt, Maybe never.
> > "A domain's span MUST be a superset of it child's span (this restriction
> > could be relaxed if the need arises), and a base domain for CPU i MUST
> > span at least i."  etc. etc.
> 
> The reason for my suspicion is the SD_OVERLAP flag, which has something
> to do overlapping sched domains. I haven't looked into what it does or
> how it works. I'm just wondering if this optimization will affect the
> use of that flag.

seems it did, SD_OVERLAP will not work after this change,
though this flag is maybe scarcely used.
because, this optimization assume all sched-domains span 
is super-set over child domain.
isn't it? Alex.

> 
> Morten
> 
> > 
> > 
> > -- 
> > Thanks Alex
> > 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
regards!
li guang

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 05/22] sched: remove domain iterations in fork/exec/wake

2013-01-14 Thread li guang
在 2013-01-11五的 10:07 +,Morten Rasmussen写道:
 On Fri, Jan 11, 2013 at 02:46:31AM +, Alex Shi wrote:
  On 01/10/2013 02:21 AM, Morten Rasmussen wrote:
new_cpu = find_idlest_cpu(group, p, cpu);
-
-  /* Now try balancing at a lower domain level of new_cpu 
*/
-  cpu = new_cpu;
-  weight = sd-span_weight;
-  sd = NULL;
-  for_each_domain(cpu, tmp) {
-  if (weight = tmp-span_weight)
-  break;
-  if (tmp-flags  sd_flag)
-  sd = tmp;
-  }
-  /* while loop will break here if sd == NULL */
   I agree that this should be a major optimization. I just can't figure
   out why the existing recursive search for an idle cpu switches to the
   new cpu near the end and then starts a search for an idle cpu in the new
   cpu's domain. Is this to handle some exotic sched domain configurations?
   If so, they probably wouldn't work with your optimizations.
  
  I did not find odd configuration that asking for old logical.
  
  According to Documentation/scheduler/sched-domains.txt, Maybe never.
  A domain's span MUST be a superset of it child's span (this restriction
  could be relaxed if the need arises), and a base domain for CPU i MUST
  span at least i.  etc. etc.
 
 The reason for my suspicion is the SD_OVERLAP flag, which has something
 to do overlapping sched domains. I haven't looked into what it does or
 how it works. I'm just wondering if this optimization will affect the
 use of that flag.

seems it did, SD_OVERLAP will not work after this change,
though this flag is maybe scarcely used.
because, this optimization assume all sched-domains span 
is super-set over child domain.
isn't it? Alex.

 
 Morten
 
  
  
  -- 
  Thanks Alex
  
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/

-- 
regards!
li guang

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 05/22] sched: remove domain iterations in fork/exec/wake

2013-01-14 Thread Alex Shi
On 01/14/2013 04:55 PM, li guang wrote:
 -   /* while loop will break here if sd == NULL */
I agree that this should be a major optimization. I just can't figure
out why the existing recursive search for an idle cpu switches to the
new cpu near the end and then starts a search for an idle cpu in the 
new
cpu's domain. Is this to handle some exotic sched domain 
configurations?
If so, they probably wouldn't work with your optimizations.
   
   I did not find odd configuration that asking for old logical.
   
   According to Documentation/scheduler/sched-domains.txt, Maybe never.
   A domain's span MUST be a superset of it child's span (this restriction
   could be relaxed if the need arises), and a base domain for CPU i MUST
   span at least i.  etc. etc.
  
  The reason for my suspicion is the SD_OVERLAP flag, which has something
  to do overlapping sched domains. I haven't looked into what it does or
  how it works. I'm just wondering if this optimization will affect the
  use of that flag.
 seems it did, SD_OVERLAP will not work after this change,
 though this flag is maybe scarcely used.
 because, this optimization assume all sched-domains span 
 is super-set over child domain.
 isn't it? Alex.
 

As my understanding, overlap just said some cpu may appears in 2 or more
same level sub domains. If so, this change won't miss cpus.

Am I right, Peter?

-- 
Thanks Alex
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 05/22] sched: remove domain iterations in fork/exec/wake

2013-01-14 Thread li guang
在 2013-01-11五的 22:56 +0800,Alex Shi写道:
 On 01/11/2013 04:01 PM, li guang wrote:
  在 2013-01-11五的 10:26 +0530,Preeti U Murthy写道:
  Hi Morten,Alex
 
  On 01/09/2013 11:51 PM, Morten Rasmussen wrote:
  On Sat, Jan 05, 2013 at 08:37:34AM +, Alex Shi wrote:
  Guess the search cpu from bottom to up in domain tree come from
  commit 3dbd5342074a1e sched: multilevel sbe sbf, the purpose is
  balancing over tasks on all level domains.
 
  This balancing cost much if there has many domain/groups in a large
  system. And force spreading task among different domains may cause
  performance issue due to bad locality.
 
  If we remove this code, we will get quick fork/exec/wake, plus better
  balancing among whole system, that also reduce migrations in future
  load balancing.
 
  This patch increases 10+% performance of hackbench on my 4 sockets
  NHM and SNB machines.
 
  Signed-off-by: Alex Shi alex@intel.com
  ---
   kernel/sched/fair.c | 20 +---
   1 file changed, 1 insertion(+), 19 deletions(-)
 
  diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
  index ecfbf8e..895a3f4 100644
  --- a/kernel/sched/fair.c
  +++ b/kernel/sched/fair.c
  @@ -3364,15 +3364,9 @@ select_task_rq_fair(struct task_struct *p, int 
  sd_flag, int wake_flags)
   goto unlock;
   }
   
  -while (sd) {
  +if (sd) {
   int load_idx = sd-forkexec_idx;
   struct sched_group *group;
  -int weight;
  -
  -if (!(sd-flags  sd_flag)) {
  -sd = sd-child;
  -continue;
  -}
   
   if (sd_flag  SD_BALANCE_WAKE)
   load_idx = sd-wake_idx;
  @@ -3382,18 +3376,6 @@ select_task_rq_fair(struct task_struct *p, int 
  sd_flag, int wake_flags)
   goto unlock;
   
   new_cpu = find_idlest_cpu(group, p, cpu);
  -
  -/* Now try balancing at a lower domain level of new_cpu 
  */
  -cpu = new_cpu;
  -weight = sd-span_weight;
  -sd = NULL;
  -for_each_domain(cpu, tmp) {
  -if (weight = tmp-span_weight)
  -break;
  -if (tmp-flags  sd_flag)
  -sd = tmp;
  -}
  -/* while loop will break here if sd == NULL */
 
  I agree that this should be a major optimization. I just can't figure
  out why the existing recursive search for an idle cpu switches to the
  new cpu near the end and then starts a search for an idle cpu in the new
  cpu's domain. Is this to handle some exotic sched domain configurations?
  If so, they probably wouldn't work with your optimizations.
 
  Let me explain my understanding of why the recursive search is the way
  it is.
 
   _  sd0
  | |
  |  ___sd1__   ___sd2__|
  | || ||   |
  | | sgx| |  sga   |   |
  | | sgy| |  sgb   |   |
  | || ||   |
  |_|
 
  What the current recursive search is doing is (assuming we start with
  sd0-the top level sched domain whose flags are rightly set). we find
  that sd1 is the idlest group,and a cpux1 in sgx is the idlest cpu.
 
  We could have ideally stopped the search here.But the problem with this
  is that there is a possibility that sgx is more loaded than sgy; meaning
  the cpus in sgx are heavily imbalanced;say there are two cpus cpux1 and
  cpux2 in sgx,where cpux2 is heavily loaded and cpux1 has recently gotten
  idle and load balancing has not come to its rescue yet.According to the
  search above, cpux1 is idle,but is *not the right candidate for
  scheduling forked task,it is the right candidate for relieving the load
  from cpux2* due to cache locality etc.
  
  This corner case may occur after [PATCH v3 03/22] sched: fix
  find_idlest_group mess logical brought in the local sched_group bias,
  and assume balancing runs on cpux2.
  ideally,  find_idlest_group should find the real idlest(this case: sgy),
  then, this patch is reasonable.
  
 
 Sure. but seems it is a bit hard to go down the idlest group.
 
 and the old logical is real cost too much, on my 2 socket NHM/SNB
 server, hackbench can increase 2~5% performance. and no clean
 performance on kbuild/aim7 etc.

what about remove local group bias?


-- 
regards!
li guang

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 05/22] sched: remove domain iterations in fork/exec/wake

2013-01-14 Thread Alex Shi
On 01/14/2013 05:03 PM, li guang wrote:
   This corner case may occur after [PATCH v3 03/22] sched: fix
   find_idlest_group mess logical brought in the local sched_group bias,
   and assume balancing runs on cpux2.
   ideally,  find_idlest_group should find the real idlest(this case: sgy),
   then, this patch is reasonable.
   
  
  Sure. but seems it is a bit hard to go down the idlest group.
  
  and the old logical is real cost too much, on my 2 socket NHM/SNB
  server, hackbench can increase 2~5% performance. and no clean
  performance on kbuild/aim7 etc.
 what about remove local group bias?


Any theory profit for non local group? Usually, bias toward local group
will has cache locality profit.

-- 
Thanks Alex
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 05/22] sched: remove domain iterations in fork/exec/wake

2013-01-11 Thread Alex Shi
On 01/11/2013 04:01 PM, li guang wrote:
> 在 2013-01-11五的 10:26 +0530,Preeti U Murthy写道:
>> Hi Morten,Alex
>>
>> On 01/09/2013 11:51 PM, Morten Rasmussen wrote:
>>> On Sat, Jan 05, 2013 at 08:37:34AM +, Alex Shi wrote:
 Guess the search cpu from bottom to up in domain tree come from
 commit 3dbd5342074a1e sched: multilevel sbe sbf, the purpose is
 balancing over tasks on all level domains.

 This balancing cost much if there has many domain/groups in a large
 system. And force spreading task among different domains may cause
 performance issue due to bad locality.

 If we remove this code, we will get quick fork/exec/wake, plus better
 balancing among whole system, that also reduce migrations in future
 load balancing.

 This patch increases 10+% performance of hackbench on my 4 sockets
 NHM and SNB machines.

 Signed-off-by: Alex Shi 
 ---
  kernel/sched/fair.c | 20 +---
  1 file changed, 1 insertion(+), 19 deletions(-)

 diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
 index ecfbf8e..895a3f4 100644
 --- a/kernel/sched/fair.c
 +++ b/kernel/sched/fair.c
 @@ -3364,15 +3364,9 @@ select_task_rq_fair(struct task_struct *p, int 
 sd_flag, int wake_flags)
goto unlock;
}
  
 -  while (sd) {
 +  if (sd) {
int load_idx = sd->forkexec_idx;
struct sched_group *group;
 -  int weight;
 -
 -  if (!(sd->flags & sd_flag)) {
 -  sd = sd->child;
 -  continue;
 -  }
  
if (sd_flag & SD_BALANCE_WAKE)
load_idx = sd->wake_idx;
 @@ -3382,18 +3376,6 @@ select_task_rq_fair(struct task_struct *p, int 
 sd_flag, int wake_flags)
goto unlock;
  
new_cpu = find_idlest_cpu(group, p, cpu);
 -
 -  /* Now try balancing at a lower domain level of new_cpu */
 -  cpu = new_cpu;
 -  weight = sd->span_weight;
 -  sd = NULL;
 -  for_each_domain(cpu, tmp) {
 -  if (weight <= tmp->span_weight)
 -  break;
 -  if (tmp->flags & sd_flag)
 -  sd = tmp;
 -  }
 -  /* while loop will break here if sd == NULL */
>>>
>>> I agree that this should be a major optimization. I just can't figure
>>> out why the existing recursive search for an idle cpu switches to the
>>> new cpu near the end and then starts a search for an idle cpu in the new
>>> cpu's domain. Is this to handle some exotic sched domain configurations?
>>> If so, they probably wouldn't work with your optimizations.
>>
>> Let me explain my understanding of why the recursive search is the way
>> it is.
>>
>>  _  sd0
>> | |
>> |  ___sd1__   ___sd2__|
>> | || ||   |
>> | | sgx| |  sga   |   |
>> | | sgy| |  sgb   |   |
>> | || ||   |
>> |_|
>>
>> What the current recursive search is doing is (assuming we start with
>> sd0-the top level sched domain whose flags are rightly set). we find
>> that sd1 is the idlest group,and a cpux1 in sgx is the idlest cpu.
>>
>> We could have ideally stopped the search here.But the problem with this
>> is that there is a possibility that sgx is more loaded than sgy; meaning
>> the cpus in sgx are heavily imbalanced;say there are two cpus cpux1 and
>> cpux2 in sgx,where cpux2 is heavily loaded and cpux1 has recently gotten
>> idle and load balancing has not come to its rescue yet.According to the
>> search above, cpux1 is idle,but is *not the right candidate for
>> scheduling forked task,it is the right candidate for relieving the load
>> from cpux2* due to cache locality etc.
> 
> This corner case may occur after "[PATCH v3 03/22] sched: fix
> find_idlest_group mess logical" brought in the local sched_group bias,
> and assume balancing runs on cpux2.
> ideally,  find_idlest_group should find the real idlest(this case: sgy),
> then, this patch is reasonable.
> 

Sure. but seems it is a bit hard to go down the idlest group.

and the old logical is real cost too much, on my 2 socket NHM/SNB
server, hackbench can increase 2~5% performance. and no clean
performance on kbuild/aim7 etc.

-- 
Thanks
Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 05/22] sched: remove domain iterations in fork/exec/wake

2013-01-11 Thread Alex Shi
On 01/11/2013 06:07 PM, Morten Rasmussen wrote:
> On Fri, Jan 11, 2013 at 02:46:31AM +, Alex Shi wrote:
>> On 01/10/2013 02:21 AM, Morten Rasmussen wrote:
new_cpu = find_idlest_cpu(group, p, cpu);
> -
> - /* Now try balancing at a lower domain level of new_cpu */
> - cpu = new_cpu;
> - weight = sd->span_weight;
> - sd = NULL;
> - for_each_domain(cpu, tmp) {
> - if (weight <= tmp->span_weight)
> - break;
> - if (tmp->flags & sd_flag)
> - sd = tmp;
> - }
> - /* while loop will break here if sd == NULL */
>>> I agree that this should be a major optimization. I just can't figure
>>> out why the existing recursive search for an idle cpu switches to the
>>> new cpu near the end and then starts a search for an idle cpu in the new
>>> cpu's domain. Is this to handle some exotic sched domain configurations?
>>> If so, they probably wouldn't work with your optimizations.
>>
>> I did not find odd configuration that asking for old logical.
>>
>> According to Documentation/scheduler/sched-domains.txt, Maybe never.
>> "A domain's span MUST be a superset of it child's span (this restriction
>> could be relaxed if the need arises), and a base domain for CPU i MUST
>> span at least i."  etc. etc.
> 
> The reason for my suspicion is the SD_OVERLAP flag, which has something
> to do overlapping sched domains. I haven't looked into what it does or
> how it works. I'm just wondering if this optimization will affect the
> use of that flag.

I didn't know any machine has this flag, but if just some cpu overlap,
not stay alone without any domain, the patch won't miss eligible cpu.
> 
> Morten
> 
>>
>>
>> -- 
>> Thanks Alex
>>
> 


-- 
Thanks
Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 05/22] sched: remove domain iterations in fork/exec/wake

2013-01-11 Thread Morten Rasmussen
Hi Preeti,

On Fri, Jan 11, 2013 at 04:56:09AM +, Preeti U Murthy wrote:
> Hi Morten,Alex
> 
> On 01/09/2013 11:51 PM, Morten Rasmussen wrote:
> > On Sat, Jan 05, 2013 at 08:37:34AM +, Alex Shi wrote:
> >> Guess the search cpu from bottom to up in domain tree come from
> >> commit 3dbd5342074a1e sched: multilevel sbe sbf, the purpose is
> >> balancing over tasks on all level domains.
> >>
> >> This balancing cost much if there has many domain/groups in a large
> >> system. And force spreading task among different domains may cause
> >> performance issue due to bad locality.
> >>
> >> If we remove this code, we will get quick fork/exec/wake, plus better
> >> balancing among whole system, that also reduce migrations in future
> >> load balancing.
> >>
> >> This patch increases 10+% performance of hackbench on my 4 sockets
> >> NHM and SNB machines.
> >>
> >> Signed-off-by: Alex Shi 
> >> ---
> >>  kernel/sched/fair.c | 20 +---
> >>  1 file changed, 1 insertion(+), 19 deletions(-)
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index ecfbf8e..895a3f4 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -3364,15 +3364,9 @@ select_task_rq_fair(struct task_struct *p, int 
> >> sd_flag, int wake_flags)
> >>goto unlock;
> >>}
> >>  
> >> -  while (sd) {
> >> +  if (sd) {
> >>int load_idx = sd->forkexec_idx;
> >>struct sched_group *group;
> >> -  int weight;
> >> -
> >> -  if (!(sd->flags & sd_flag)) {
> >> -  sd = sd->child;
> >> -  continue;
> >> -  }
> >>  
> >>if (sd_flag & SD_BALANCE_WAKE)
> >>load_idx = sd->wake_idx;
> >> @@ -3382,18 +3376,6 @@ select_task_rq_fair(struct task_struct *p, int 
> >> sd_flag, int wake_flags)
> >>goto unlock;
> >>  
> >>new_cpu = find_idlest_cpu(group, p, cpu);
> >> -
> >> -  /* Now try balancing at a lower domain level of new_cpu */
> >> -  cpu = new_cpu;
> >> -  weight = sd->span_weight;
> >> -  sd = NULL;
> >> -  for_each_domain(cpu, tmp) {
> >> -  if (weight <= tmp->span_weight)
> >> -  break;
> >> -  if (tmp->flags & sd_flag)
> >> -  sd = tmp;
> >> -  }
> >> -  /* while loop will break here if sd == NULL */
> > 
> > I agree that this should be a major optimization. I just can't figure
> > out why the existing recursive search for an idle cpu switches to the
> > new cpu near the end and then starts a search for an idle cpu in the new
> > cpu's domain. Is this to handle some exotic sched domain configurations?
> > If so, they probably wouldn't work with your optimizations.
> 
> Let me explain my understanding of why the recursive search is the way
> it is.
> 
>  _  sd0
> | |
> |  ___sd1__   ___sd2__|
> | || ||   |
> | | sgx| |  sga   |   |
> | | sgy| |  sgb   |   |
> | || ||   |
> |_|
> 
> What the current recursive search is doing is (assuming we start with
> sd0-the top level sched domain whose flags are rightly set). we find
> that sd1 is the idlest group,and a cpux1 in sgx is the idlest cpu.
> 
> We could have ideally stopped the search here.But the problem with this
> is that there is a possibility that sgx is more loaded than sgy; meaning
> the cpus in sgx are heavily imbalanced;say there are two cpus cpux1 and
> cpux2 in sgx,where cpux2 is heavily loaded and cpux1 has recently gotten
> idle and load balancing has not come to its rescue yet.According to the
> search above, cpux1 is idle,but is *not the right candidate for
> scheduling forked task,it is the right candidate for relieving the load
> from cpux2* due to cache locality etc.
> 
> Therefore in the next recursive search we go one step inside sd1-the
> chosen idlest group candidate,which also happens to be the *next level
> sched domain for cpux1-the chosen idle cpu*. It then returns sgy as the
> idlest perhaps,if the situation happens to be better than what i have
> described for sgx and an appropriate cpu there is chosen.
> 
> So in short a bird's eye view of a large sched domain to choose the cpu
> would be very short sighted,we could end up creating imbalances within
> lower level sched domains.To avoid this the recursive search plays safe
> and chooses the best idle group after viewing the large sched domain in
> detail.

Thanks for your explanation. I see your point that the first search my
end a high level in the sched domain and pick a cpu in a very unbalanced
group. The extra search will then try to put things right.

This patch set removes the recursive search completely. So the overall
balance policy is changed from trying to achieve equal load across all
groups to always put tasks on the most idle cpu regardless of the load
of 

Re: [PATCH v3 05/22] sched: remove domain iterations in fork/exec/wake

2013-01-11 Thread Morten Rasmussen
On Fri, Jan 11, 2013 at 02:46:31AM +, Alex Shi wrote:
> On 01/10/2013 02:21 AM, Morten Rasmussen wrote:
> >>new_cpu = find_idlest_cpu(group, p, cpu);
> >> > -
> >> > -/* Now try balancing at a lower domain level of new_cpu 
> >> > */
> >> > -cpu = new_cpu;
> >> > -weight = sd->span_weight;
> >> > -sd = NULL;
> >> > -for_each_domain(cpu, tmp) {
> >> > -if (weight <= tmp->span_weight)
> >> > -break;
> >> > -if (tmp->flags & sd_flag)
> >> > -sd = tmp;
> >> > -}
> >> > -/* while loop will break here if sd == NULL */
> > I agree that this should be a major optimization. I just can't figure
> > out why the existing recursive search for an idle cpu switches to the
> > new cpu near the end and then starts a search for an idle cpu in the new
> > cpu's domain. Is this to handle some exotic sched domain configurations?
> > If so, they probably wouldn't work with your optimizations.
> 
> I did not find odd configuration that asking for old logical.
> 
> According to Documentation/scheduler/sched-domains.txt, Maybe never.
> "A domain's span MUST be a superset of it child's span (this restriction
> could be relaxed if the need arises), and a base domain for CPU i MUST
> span at least i."  etc. etc.

The reason for my suspicion is the SD_OVERLAP flag, which has something
to do overlapping sched domains. I haven't looked into what it does or
how it works. I'm just wondering if this optimization will affect the
use of that flag.

Morten

> 
> 
> -- 
> Thanks Alex
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 05/22] sched: remove domain iterations in fork/exec/wake

2013-01-11 Thread li guang
在 2013-01-11五的 10:26 +0530,Preeti U Murthy写道:
> Hi Morten,Alex
> 
> On 01/09/2013 11:51 PM, Morten Rasmussen wrote:
> > On Sat, Jan 05, 2013 at 08:37:34AM +, Alex Shi wrote:
> >> Guess the search cpu from bottom to up in domain tree come from
> >> commit 3dbd5342074a1e sched: multilevel sbe sbf, the purpose is
> >> balancing over tasks on all level domains.
> >>
> >> This balancing cost much if there has many domain/groups in a large
> >> system. And force spreading task among different domains may cause
> >> performance issue due to bad locality.
> >>
> >> If we remove this code, we will get quick fork/exec/wake, plus better
> >> balancing among whole system, that also reduce migrations in future
> >> load balancing.
> >>
> >> This patch increases 10+% performance of hackbench on my 4 sockets
> >> NHM and SNB machines.
> >>
> >> Signed-off-by: Alex Shi 
> >> ---
> >>  kernel/sched/fair.c | 20 +---
> >>  1 file changed, 1 insertion(+), 19 deletions(-)
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index ecfbf8e..895a3f4 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -3364,15 +3364,9 @@ select_task_rq_fair(struct task_struct *p, int 
> >> sd_flag, int wake_flags)
> >>goto unlock;
> >>}
> >>  
> >> -  while (sd) {
> >> +  if (sd) {
> >>int load_idx = sd->forkexec_idx;
> >>struct sched_group *group;
> >> -  int weight;
> >> -
> >> -  if (!(sd->flags & sd_flag)) {
> >> -  sd = sd->child;
> >> -  continue;
> >> -  }
> >>  
> >>if (sd_flag & SD_BALANCE_WAKE)
> >>load_idx = sd->wake_idx;
> >> @@ -3382,18 +3376,6 @@ select_task_rq_fair(struct task_struct *p, int 
> >> sd_flag, int wake_flags)
> >>goto unlock;
> >>  
> >>new_cpu = find_idlest_cpu(group, p, cpu);
> >> -
> >> -  /* Now try balancing at a lower domain level of new_cpu */
> >> -  cpu = new_cpu;
> >> -  weight = sd->span_weight;
> >> -  sd = NULL;
> >> -  for_each_domain(cpu, tmp) {
> >> -  if (weight <= tmp->span_weight)
> >> -  break;
> >> -  if (tmp->flags & sd_flag)
> >> -  sd = tmp;
> >> -  }
> >> -  /* while loop will break here if sd == NULL */
> > 
> > I agree that this should be a major optimization. I just can't figure
> > out why the existing recursive search for an idle cpu switches to the
> > new cpu near the end and then starts a search for an idle cpu in the new
> > cpu's domain. Is this to handle some exotic sched domain configurations?
> > If so, they probably wouldn't work with your optimizations.
> 
> Let me explain my understanding of why the recursive search is the way
> it is.
> 
>  _  sd0
> | |
> |  ___sd1__   ___sd2__|
> | || ||   |
> | | sgx| |  sga   |   |
> | | sgy| |  sgb   |   |
> | || ||   |
> |_|
> 
> What the current recursive search is doing is (assuming we start with
> sd0-the top level sched domain whose flags are rightly set). we find
> that sd1 is the idlest group,and a cpux1 in sgx is the idlest cpu.
> 
> We could have ideally stopped the search here.But the problem with this
> is that there is a possibility that sgx is more loaded than sgy; meaning
> the cpus in sgx are heavily imbalanced;say there are two cpus cpux1 and
> cpux2 in sgx,where cpux2 is heavily loaded and cpux1 has recently gotten
> idle and load balancing has not come to its rescue yet.According to the
> search above, cpux1 is idle,but is *not the right candidate for
> scheduling forked task,it is the right candidate for relieving the load
> from cpux2* due to cache locality etc.

This corner case may occur after "[PATCH v3 03/22] sched: fix
find_idlest_group mess logical" brought in the local sched_group bias,
and assume balancing runs on cpux2.
ideally,  find_idlest_group should find the real idlest(this case: sgy),
then, this patch is reasonable.

> 
> Therefore in the next recursive search we go one step inside sd1-the
> chosen idlest group candidate,which also happens to be the *next level
> sched domain for cpux1-the chosen idle cpu*. It then returns sgy as the
> idlest perhaps,if the situation happens to be better than what i have
> described for sgx and an appropriate cpu there is chosen.
> 
> So in short a bird's eye view of a large sched domain to choose the cpu
> would be very short sighted,we could end up creating imbalances within
> lower level sched domains.To avoid this the recursive search plays safe
> and chooses the best idle group after viewing the large sched domain in
> detail.
> 
> Therefore even i feel that this patch should be implemented after
> thorough tests.
> 
> 
> 
> > Morten
> 
> Regards
> Preeti U Murthy
> 
> --
> To unsubscribe from this 

Re: [PATCH v3 05/22] sched: remove domain iterations in fork/exec/wake

2013-01-11 Thread li guang
在 2013-01-11五的 10:26 +0530,Preeti U Murthy写道:
 Hi Morten,Alex
 
 On 01/09/2013 11:51 PM, Morten Rasmussen wrote:
  On Sat, Jan 05, 2013 at 08:37:34AM +, Alex Shi wrote:
  Guess the search cpu from bottom to up in domain tree come from
  commit 3dbd5342074a1e sched: multilevel sbe sbf, the purpose is
  balancing over tasks on all level domains.
 
  This balancing cost much if there has many domain/groups in a large
  system. And force spreading task among different domains may cause
  performance issue due to bad locality.
 
  If we remove this code, we will get quick fork/exec/wake, plus better
  balancing among whole system, that also reduce migrations in future
  load balancing.
 
  This patch increases 10+% performance of hackbench on my 4 sockets
  NHM and SNB machines.
 
  Signed-off-by: Alex Shi alex@intel.com
  ---
   kernel/sched/fair.c | 20 +---
   1 file changed, 1 insertion(+), 19 deletions(-)
 
  diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
  index ecfbf8e..895a3f4 100644
  --- a/kernel/sched/fair.c
  +++ b/kernel/sched/fair.c
  @@ -3364,15 +3364,9 @@ select_task_rq_fair(struct task_struct *p, int 
  sd_flag, int wake_flags)
 goto unlock;
 }
   
  -  while (sd) {
  +  if (sd) {
 int load_idx = sd-forkexec_idx;
 struct sched_group *group;
  -  int weight;
  -
  -  if (!(sd-flags  sd_flag)) {
  -  sd = sd-child;
  -  continue;
  -  }
   
 if (sd_flag  SD_BALANCE_WAKE)
 load_idx = sd-wake_idx;
  @@ -3382,18 +3376,6 @@ select_task_rq_fair(struct task_struct *p, int 
  sd_flag, int wake_flags)
 goto unlock;
   
 new_cpu = find_idlest_cpu(group, p, cpu);
  -
  -  /* Now try balancing at a lower domain level of new_cpu */
  -  cpu = new_cpu;
  -  weight = sd-span_weight;
  -  sd = NULL;
  -  for_each_domain(cpu, tmp) {
  -  if (weight = tmp-span_weight)
  -  break;
  -  if (tmp-flags  sd_flag)
  -  sd = tmp;
  -  }
  -  /* while loop will break here if sd == NULL */
  
  I agree that this should be a major optimization. I just can't figure
  out why the existing recursive search for an idle cpu switches to the
  new cpu near the end and then starts a search for an idle cpu in the new
  cpu's domain. Is this to handle some exotic sched domain configurations?
  If so, they probably wouldn't work with your optimizations.
 
 Let me explain my understanding of why the recursive search is the way
 it is.
 
  _  sd0
 | |
 |  ___sd1__   ___sd2__|
 | || ||   |
 | | sgx| |  sga   |   |
 | | sgy| |  sgb   |   |
 | || ||   |
 |_|
 
 What the current recursive search is doing is (assuming we start with
 sd0-the top level sched domain whose flags are rightly set). we find
 that sd1 is the idlest group,and a cpux1 in sgx is the idlest cpu.
 
 We could have ideally stopped the search here.But the problem with this
 is that there is a possibility that sgx is more loaded than sgy; meaning
 the cpus in sgx are heavily imbalanced;say there are two cpus cpux1 and
 cpux2 in sgx,where cpux2 is heavily loaded and cpux1 has recently gotten
 idle and load balancing has not come to its rescue yet.According to the
 search above, cpux1 is idle,but is *not the right candidate for
 scheduling forked task,it is the right candidate for relieving the load
 from cpux2* due to cache locality etc.

This corner case may occur after [PATCH v3 03/22] sched: fix
find_idlest_group mess logical brought in the local sched_group bias,
and assume balancing runs on cpux2.
ideally,  find_idlest_group should find the real idlest(this case: sgy),
then, this patch is reasonable.

 
 Therefore in the next recursive search we go one step inside sd1-the
 chosen idlest group candidate,which also happens to be the *next level
 sched domain for cpux1-the chosen idle cpu*. It then returns sgy as the
 idlest perhaps,if the situation happens to be better than what i have
 described for sgx and an appropriate cpu there is chosen.
 
 So in short a bird's eye view of a large sched domain to choose the cpu
 would be very short sighted,we could end up creating imbalances within
 lower level sched domains.To avoid this the recursive search plays safe
 and chooses the best idle group after viewing the large sched domain in
 detail.
 
 Therefore even i feel that this patch should be implemented after
 thorough tests.
 
 
 
  Morten
 
 Regards
 Preeti U Murthy
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/

-- 
regards!
li guang


Re: [PATCH v3 05/22] sched: remove domain iterations in fork/exec/wake

2013-01-11 Thread Morten Rasmussen
On Fri, Jan 11, 2013 at 02:46:31AM +, Alex Shi wrote:
 On 01/10/2013 02:21 AM, Morten Rasmussen wrote:
 new_cpu = find_idlest_cpu(group, p, cpu);
   -
   -/* Now try balancing at a lower domain level of new_cpu 
   */
   -cpu = new_cpu;
   -weight = sd-span_weight;
   -sd = NULL;
   -for_each_domain(cpu, tmp) {
   -if (weight = tmp-span_weight)
   -break;
   -if (tmp-flags  sd_flag)
   -sd = tmp;
   -}
   -/* while loop will break here if sd == NULL */
  I agree that this should be a major optimization. I just can't figure
  out why the existing recursive search for an idle cpu switches to the
  new cpu near the end and then starts a search for an idle cpu in the new
  cpu's domain. Is this to handle some exotic sched domain configurations?
  If so, they probably wouldn't work with your optimizations.
 
 I did not find odd configuration that asking for old logical.
 
 According to Documentation/scheduler/sched-domains.txt, Maybe never.
 A domain's span MUST be a superset of it child's span (this restriction
 could be relaxed if the need arises), and a base domain for CPU i MUST
 span at least i.  etc. etc.

The reason for my suspicion is the SD_OVERLAP flag, which has something
to do overlapping sched domains. I haven't looked into what it does or
how it works. I'm just wondering if this optimization will affect the
use of that flag.

Morten

 
 
 -- 
 Thanks Alex
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 05/22] sched: remove domain iterations in fork/exec/wake

2013-01-11 Thread Morten Rasmussen
Hi Preeti,

On Fri, Jan 11, 2013 at 04:56:09AM +, Preeti U Murthy wrote:
 Hi Morten,Alex
 
 On 01/09/2013 11:51 PM, Morten Rasmussen wrote:
  On Sat, Jan 05, 2013 at 08:37:34AM +, Alex Shi wrote:
  Guess the search cpu from bottom to up in domain tree come from
  commit 3dbd5342074a1e sched: multilevel sbe sbf, the purpose is
  balancing over tasks on all level domains.
 
  This balancing cost much if there has many domain/groups in a large
  system. And force spreading task among different domains may cause
  performance issue due to bad locality.
 
  If we remove this code, we will get quick fork/exec/wake, plus better
  balancing among whole system, that also reduce migrations in future
  load balancing.
 
  This patch increases 10+% performance of hackbench on my 4 sockets
  NHM and SNB machines.
 
  Signed-off-by: Alex Shi alex@intel.com
  ---
   kernel/sched/fair.c | 20 +---
   1 file changed, 1 insertion(+), 19 deletions(-)
 
  diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
  index ecfbf8e..895a3f4 100644
  --- a/kernel/sched/fair.c
  +++ b/kernel/sched/fair.c
  @@ -3364,15 +3364,9 @@ select_task_rq_fair(struct task_struct *p, int 
  sd_flag, int wake_flags)
 goto unlock;
 }
   
  -  while (sd) {
  +  if (sd) {
 int load_idx = sd-forkexec_idx;
 struct sched_group *group;
  -  int weight;
  -
  -  if (!(sd-flags  sd_flag)) {
  -  sd = sd-child;
  -  continue;
  -  }
   
 if (sd_flag  SD_BALANCE_WAKE)
 load_idx = sd-wake_idx;
  @@ -3382,18 +3376,6 @@ select_task_rq_fair(struct task_struct *p, int 
  sd_flag, int wake_flags)
 goto unlock;
   
 new_cpu = find_idlest_cpu(group, p, cpu);
  -
  -  /* Now try balancing at a lower domain level of new_cpu */
  -  cpu = new_cpu;
  -  weight = sd-span_weight;
  -  sd = NULL;
  -  for_each_domain(cpu, tmp) {
  -  if (weight = tmp-span_weight)
  -  break;
  -  if (tmp-flags  sd_flag)
  -  sd = tmp;
  -  }
  -  /* while loop will break here if sd == NULL */
  
  I agree that this should be a major optimization. I just can't figure
  out why the existing recursive search for an idle cpu switches to the
  new cpu near the end and then starts a search for an idle cpu in the new
  cpu's domain. Is this to handle some exotic sched domain configurations?
  If so, they probably wouldn't work with your optimizations.
 
 Let me explain my understanding of why the recursive search is the way
 it is.
 
  _  sd0
 | |
 |  ___sd1__   ___sd2__|
 | || ||   |
 | | sgx| |  sga   |   |
 | | sgy| |  sgb   |   |
 | || ||   |
 |_|
 
 What the current recursive search is doing is (assuming we start with
 sd0-the top level sched domain whose flags are rightly set). we find
 that sd1 is the idlest group,and a cpux1 in sgx is the idlest cpu.
 
 We could have ideally stopped the search here.But the problem with this
 is that there is a possibility that sgx is more loaded than sgy; meaning
 the cpus in sgx are heavily imbalanced;say there are two cpus cpux1 and
 cpux2 in sgx,where cpux2 is heavily loaded and cpux1 has recently gotten
 idle and load balancing has not come to its rescue yet.According to the
 search above, cpux1 is idle,but is *not the right candidate for
 scheduling forked task,it is the right candidate for relieving the load
 from cpux2* due to cache locality etc.
 
 Therefore in the next recursive search we go one step inside sd1-the
 chosen idlest group candidate,which also happens to be the *next level
 sched domain for cpux1-the chosen idle cpu*. It then returns sgy as the
 idlest perhaps,if the situation happens to be better than what i have
 described for sgx and an appropriate cpu there is chosen.
 
 So in short a bird's eye view of a large sched domain to choose the cpu
 would be very short sighted,we could end up creating imbalances within
 lower level sched domains.To avoid this the recursive search plays safe
 and chooses the best idle group after viewing the large sched domain in
 detail.

Thanks for your explanation. I see your point that the first search my
end a high level in the sched domain and pick a cpu in a very unbalanced
group. The extra search will then try to put things right.

This patch set removes the recursive search completely. So the overall
balance policy is changed from trying to achieve equal load across all
groups to always put tasks on the most idle cpu regardless of the load
of its group.

I'm not sure if this is a good or bad move. It is quicker.

Regards,
Morten

 
 Therefore even i feel that this patch should be implemented after
 thorough tests.
 
 
 
  Morten
 
 Regards
 Preeti U Murthy
 
 

--

Re: [PATCH v3 05/22] sched: remove domain iterations in fork/exec/wake

2013-01-11 Thread Alex Shi
On 01/11/2013 06:07 PM, Morten Rasmussen wrote:
 On Fri, Jan 11, 2013 at 02:46:31AM +, Alex Shi wrote:
 On 01/10/2013 02:21 AM, Morten Rasmussen wrote:
new_cpu = find_idlest_cpu(group, p, cpu);
 -
 - /* Now try balancing at a lower domain level of new_cpu */
 - cpu = new_cpu;
 - weight = sd-span_weight;
 - sd = NULL;
 - for_each_domain(cpu, tmp) {
 - if (weight = tmp-span_weight)
 - break;
 - if (tmp-flags  sd_flag)
 - sd = tmp;
 - }
 - /* while loop will break here if sd == NULL */
 I agree that this should be a major optimization. I just can't figure
 out why the existing recursive search for an idle cpu switches to the
 new cpu near the end and then starts a search for an idle cpu in the new
 cpu's domain. Is this to handle some exotic sched domain configurations?
 If so, they probably wouldn't work with your optimizations.

 I did not find odd configuration that asking for old logical.

 According to Documentation/scheduler/sched-domains.txt, Maybe never.
 A domain's span MUST be a superset of it child's span (this restriction
 could be relaxed if the need arises), and a base domain for CPU i MUST
 span at least i.  etc. etc.
 
 The reason for my suspicion is the SD_OVERLAP flag, which has something
 to do overlapping sched domains. I haven't looked into what it does or
 how it works. I'm just wondering if this optimization will affect the
 use of that flag.

I didn't know any machine has this flag, but if just some cpu overlap,
not stay alone without any domain, the patch won't miss eligible cpu.
 
 Morten
 


 -- 
 Thanks Alex

 


-- 
Thanks
Alex
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 05/22] sched: remove domain iterations in fork/exec/wake

2013-01-11 Thread Alex Shi
On 01/11/2013 04:01 PM, li guang wrote:
 在 2013-01-11五的 10:26 +0530,Preeti U Murthy写道:
 Hi Morten,Alex

 On 01/09/2013 11:51 PM, Morten Rasmussen wrote:
 On Sat, Jan 05, 2013 at 08:37:34AM +, Alex Shi wrote:
 Guess the search cpu from bottom to up in domain tree come from
 commit 3dbd5342074a1e sched: multilevel sbe sbf, the purpose is
 balancing over tasks on all level domains.

 This balancing cost much if there has many domain/groups in a large
 system. And force spreading task among different domains may cause
 performance issue due to bad locality.

 If we remove this code, we will get quick fork/exec/wake, plus better
 balancing among whole system, that also reduce migrations in future
 load balancing.

 This patch increases 10+% performance of hackbench on my 4 sockets
 NHM and SNB machines.

 Signed-off-by: Alex Shi alex@intel.com
 ---
  kernel/sched/fair.c | 20 +---
  1 file changed, 1 insertion(+), 19 deletions(-)

 diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
 index ecfbf8e..895a3f4 100644
 --- a/kernel/sched/fair.c
 +++ b/kernel/sched/fair.c
 @@ -3364,15 +3364,9 @@ select_task_rq_fair(struct task_struct *p, int 
 sd_flag, int wake_flags)
goto unlock;
}
  
 -  while (sd) {
 +  if (sd) {
int load_idx = sd-forkexec_idx;
struct sched_group *group;
 -  int weight;
 -
 -  if (!(sd-flags  sd_flag)) {
 -  sd = sd-child;
 -  continue;
 -  }
  
if (sd_flag  SD_BALANCE_WAKE)
load_idx = sd-wake_idx;
 @@ -3382,18 +3376,6 @@ select_task_rq_fair(struct task_struct *p, int 
 sd_flag, int wake_flags)
goto unlock;
  
new_cpu = find_idlest_cpu(group, p, cpu);
 -
 -  /* Now try balancing at a lower domain level of new_cpu */
 -  cpu = new_cpu;
 -  weight = sd-span_weight;
 -  sd = NULL;
 -  for_each_domain(cpu, tmp) {
 -  if (weight = tmp-span_weight)
 -  break;
 -  if (tmp-flags  sd_flag)
 -  sd = tmp;
 -  }
 -  /* while loop will break here if sd == NULL */

 I agree that this should be a major optimization. I just can't figure
 out why the existing recursive search for an idle cpu switches to the
 new cpu near the end and then starts a search for an idle cpu in the new
 cpu's domain. Is this to handle some exotic sched domain configurations?
 If so, they probably wouldn't work with your optimizations.

 Let me explain my understanding of why the recursive search is the way
 it is.

  _  sd0
 | |
 |  ___sd1__   ___sd2__|
 | || ||   |
 | | sgx| |  sga   |   |
 | | sgy| |  sgb   |   |
 | || ||   |
 |_|

 What the current recursive search is doing is (assuming we start with
 sd0-the top level sched domain whose flags are rightly set). we find
 that sd1 is the idlest group,and a cpux1 in sgx is the idlest cpu.

 We could have ideally stopped the search here.But the problem with this
 is that there is a possibility that sgx is more loaded than sgy; meaning
 the cpus in sgx are heavily imbalanced;say there are two cpus cpux1 and
 cpux2 in sgx,where cpux2 is heavily loaded and cpux1 has recently gotten
 idle and load balancing has not come to its rescue yet.According to the
 search above, cpux1 is idle,but is *not the right candidate for
 scheduling forked task,it is the right candidate for relieving the load
 from cpux2* due to cache locality etc.
 
 This corner case may occur after [PATCH v3 03/22] sched: fix
 find_idlest_group mess logical brought in the local sched_group bias,
 and assume balancing runs on cpux2.
 ideally,  find_idlest_group should find the real idlest(this case: sgy),
 then, this patch is reasonable.
 

Sure. but seems it is a bit hard to go down the idlest group.

and the old logical is real cost too much, on my 2 socket NHM/SNB
server, hackbench can increase 2~5% performance. and no clean
performance on kbuild/aim7 etc.

-- 
Thanks
Alex
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 05/22] sched: remove domain iterations in fork/exec/wake

2013-01-10 Thread Preeti U Murthy
Hi Morten,Alex

On 01/09/2013 11:51 PM, Morten Rasmussen wrote:
> On Sat, Jan 05, 2013 at 08:37:34AM +, Alex Shi wrote:
>> Guess the search cpu from bottom to up in domain tree come from
>> commit 3dbd5342074a1e sched: multilevel sbe sbf, the purpose is
>> balancing over tasks on all level domains.
>>
>> This balancing cost much if there has many domain/groups in a large
>> system. And force spreading task among different domains may cause
>> performance issue due to bad locality.
>>
>> If we remove this code, we will get quick fork/exec/wake, plus better
>> balancing among whole system, that also reduce migrations in future
>> load balancing.
>>
>> This patch increases 10+% performance of hackbench on my 4 sockets
>> NHM and SNB machines.
>>
>> Signed-off-by: Alex Shi 
>> ---
>>  kernel/sched/fair.c | 20 +---
>>  1 file changed, 1 insertion(+), 19 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index ecfbf8e..895a3f4 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -3364,15 +3364,9 @@ select_task_rq_fair(struct task_struct *p, int 
>> sd_flag, int wake_flags)
>>  goto unlock;
>>  }
>>  
>> -while (sd) {
>> +if (sd) {
>>  int load_idx = sd->forkexec_idx;
>>  struct sched_group *group;
>> -int weight;
>> -
>> -if (!(sd->flags & sd_flag)) {
>> -sd = sd->child;
>> -continue;
>> -}
>>  
>>  if (sd_flag & SD_BALANCE_WAKE)
>>  load_idx = sd->wake_idx;
>> @@ -3382,18 +3376,6 @@ select_task_rq_fair(struct task_struct *p, int 
>> sd_flag, int wake_flags)
>>  goto unlock;
>>  
>>  new_cpu = find_idlest_cpu(group, p, cpu);
>> -
>> -/* Now try balancing at a lower domain level of new_cpu */
>> -cpu = new_cpu;
>> -weight = sd->span_weight;
>> -sd = NULL;
>> -for_each_domain(cpu, tmp) {
>> -if (weight <= tmp->span_weight)
>> -break;
>> -if (tmp->flags & sd_flag)
>> -sd = tmp;
>> -}
>> -/* while loop will break here if sd == NULL */
> 
> I agree that this should be a major optimization. I just can't figure
> out why the existing recursive search for an idle cpu switches to the
> new cpu near the end and then starts a search for an idle cpu in the new
> cpu's domain. Is this to handle some exotic sched domain configurations?
> If so, they probably wouldn't work with your optimizations.

Let me explain my understanding of why the recursive search is the way
it is.

 _  sd0
| |
|  ___sd1__   ___sd2__|
| || ||   |
| | sgx| |  sga   |   |
| | sgy| |  sgb   |   |
| || ||   |
|_|

What the current recursive search is doing is (assuming we start with
sd0-the top level sched domain whose flags are rightly set). we find
that sd1 is the idlest group,and a cpux1 in sgx is the idlest cpu.

We could have ideally stopped the search here.But the problem with this
is that there is a possibility that sgx is more loaded than sgy; meaning
the cpus in sgx are heavily imbalanced;say there are two cpus cpux1 and
cpux2 in sgx,where cpux2 is heavily loaded and cpux1 has recently gotten
idle and load balancing has not come to its rescue yet.According to the
search above, cpux1 is idle,but is *not the right candidate for
scheduling forked task,it is the right candidate for relieving the load
from cpux2* due to cache locality etc.

Therefore in the next recursive search we go one step inside sd1-the
chosen idlest group candidate,which also happens to be the *next level
sched domain for cpux1-the chosen idle cpu*. It then returns sgy as the
idlest perhaps,if the situation happens to be better than what i have
described for sgx and an appropriate cpu there is chosen.

So in short a bird's eye view of a large sched domain to choose the cpu
would be very short sighted,we could end up creating imbalances within
lower level sched domains.To avoid this the recursive search plays safe
and chooses the best idle group after viewing the large sched domain in
detail.

Therefore even i feel that this patch should be implemented after
thorough tests.



> Morten

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 05/22] sched: remove domain iterations in fork/exec/wake

2013-01-10 Thread Alex Shi
On 01/10/2013 02:21 AM, Morten Rasmussen wrote:
>>  new_cpu = find_idlest_cpu(group, p, cpu);
>> > -
>> > -  /* Now try balancing at a lower domain level of new_cpu */
>> > -  cpu = new_cpu;
>> > -  weight = sd->span_weight;
>> > -  sd = NULL;
>> > -  for_each_domain(cpu, tmp) {
>> > -  if (weight <= tmp->span_weight)
>> > -  break;
>> > -  if (tmp->flags & sd_flag)
>> > -  sd = tmp;
>> > -  }
>> > -  /* while loop will break here if sd == NULL */
> I agree that this should be a major optimization. I just can't figure
> out why the existing recursive search for an idle cpu switches to the
> new cpu near the end and then starts a search for an idle cpu in the new
> cpu's domain. Is this to handle some exotic sched domain configurations?
> If so, they probably wouldn't work with your optimizations.

I did not find odd configuration that asking for old logical.

According to Documentation/scheduler/sched-domains.txt, Maybe never.
"A domain's span MUST be a superset of it child's span (this restriction
could be relaxed if the need arises), and a base domain for CPU i MUST
span at least i."  etc. etc.


-- 
Thanks Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 05/22] sched: remove domain iterations in fork/exec/wake

2013-01-10 Thread Alex Shi
On 01/10/2013 02:21 AM, Morten Rasmussen wrote:
  new_cpu = find_idlest_cpu(group, p, cpu);
  -
  -  /* Now try balancing at a lower domain level of new_cpu */
  -  cpu = new_cpu;
  -  weight = sd-span_weight;
  -  sd = NULL;
  -  for_each_domain(cpu, tmp) {
  -  if (weight = tmp-span_weight)
  -  break;
  -  if (tmp-flags  sd_flag)
  -  sd = tmp;
  -  }
  -  /* while loop will break here if sd == NULL */
 I agree that this should be a major optimization. I just can't figure
 out why the existing recursive search for an idle cpu switches to the
 new cpu near the end and then starts a search for an idle cpu in the new
 cpu's domain. Is this to handle some exotic sched domain configurations?
 If so, they probably wouldn't work with your optimizations.

I did not find odd configuration that asking for old logical.

According to Documentation/scheduler/sched-domains.txt, Maybe never.
A domain's span MUST be a superset of it child's span (this restriction
could be relaxed if the need arises), and a base domain for CPU i MUST
span at least i.  etc. etc.


-- 
Thanks Alex
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 05/22] sched: remove domain iterations in fork/exec/wake

2013-01-10 Thread Preeti U Murthy
Hi Morten,Alex

On 01/09/2013 11:51 PM, Morten Rasmussen wrote:
 On Sat, Jan 05, 2013 at 08:37:34AM +, Alex Shi wrote:
 Guess the search cpu from bottom to up in domain tree come from
 commit 3dbd5342074a1e sched: multilevel sbe sbf, the purpose is
 balancing over tasks on all level domains.

 This balancing cost much if there has many domain/groups in a large
 system. And force spreading task among different domains may cause
 performance issue due to bad locality.

 If we remove this code, we will get quick fork/exec/wake, plus better
 balancing among whole system, that also reduce migrations in future
 load balancing.

 This patch increases 10+% performance of hackbench on my 4 sockets
 NHM and SNB machines.

 Signed-off-by: Alex Shi alex@intel.com
 ---
  kernel/sched/fair.c | 20 +---
  1 file changed, 1 insertion(+), 19 deletions(-)

 diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
 index ecfbf8e..895a3f4 100644
 --- a/kernel/sched/fair.c
 +++ b/kernel/sched/fair.c
 @@ -3364,15 +3364,9 @@ select_task_rq_fair(struct task_struct *p, int 
 sd_flag, int wake_flags)
  goto unlock;
  }
  
 -while (sd) {
 +if (sd) {
  int load_idx = sd-forkexec_idx;
  struct sched_group *group;
 -int weight;
 -
 -if (!(sd-flags  sd_flag)) {
 -sd = sd-child;
 -continue;
 -}
  
  if (sd_flag  SD_BALANCE_WAKE)
  load_idx = sd-wake_idx;
 @@ -3382,18 +3376,6 @@ select_task_rq_fair(struct task_struct *p, int 
 sd_flag, int wake_flags)
  goto unlock;
  
  new_cpu = find_idlest_cpu(group, p, cpu);
 -
 -/* Now try balancing at a lower domain level of new_cpu */
 -cpu = new_cpu;
 -weight = sd-span_weight;
 -sd = NULL;
 -for_each_domain(cpu, tmp) {
 -if (weight = tmp-span_weight)
 -break;
 -if (tmp-flags  sd_flag)
 -sd = tmp;
 -}
 -/* while loop will break here if sd == NULL */
 
 I agree that this should be a major optimization. I just can't figure
 out why the existing recursive search for an idle cpu switches to the
 new cpu near the end and then starts a search for an idle cpu in the new
 cpu's domain. Is this to handle some exotic sched domain configurations?
 If so, they probably wouldn't work with your optimizations.

Let me explain my understanding of why the recursive search is the way
it is.

 _  sd0
| |
|  ___sd1__   ___sd2__|
| || ||   |
| | sgx| |  sga   |   |
| | sgy| |  sgb   |   |
| || ||   |
|_|

What the current recursive search is doing is (assuming we start with
sd0-the top level sched domain whose flags are rightly set). we find
that sd1 is the idlest group,and a cpux1 in sgx is the idlest cpu.

We could have ideally stopped the search here.But the problem with this
is that there is a possibility that sgx is more loaded than sgy; meaning
the cpus in sgx are heavily imbalanced;say there are two cpus cpux1 and
cpux2 in sgx,where cpux2 is heavily loaded and cpux1 has recently gotten
idle and load balancing has not come to its rescue yet.According to the
search above, cpux1 is idle,but is *not the right candidate for
scheduling forked task,it is the right candidate for relieving the load
from cpux2* due to cache locality etc.

Therefore in the next recursive search we go one step inside sd1-the
chosen idlest group candidate,which also happens to be the *next level
sched domain for cpux1-the chosen idle cpu*. It then returns sgy as the
idlest perhaps,if the situation happens to be better than what i have
described for sgx and an appropriate cpu there is chosen.

So in short a bird's eye view of a large sched domain to choose the cpu
would be very short sighted,we could end up creating imbalances within
lower level sched domains.To avoid this the recursive search plays safe
and chooses the best idle group after viewing the large sched domain in
detail.

Therefore even i feel that this patch should be implemented after
thorough tests.



 Morten

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 05/22] sched: remove domain iterations in fork/exec/wake

2013-01-09 Thread Morten Rasmussen
On Sat, Jan 05, 2013 at 08:37:34AM +, Alex Shi wrote:
> Guess the search cpu from bottom to up in domain tree come from
> commit 3dbd5342074a1e sched: multilevel sbe sbf, the purpose is
> balancing over tasks on all level domains.
> 
> This balancing cost much if there has many domain/groups in a large
> system. And force spreading task among different domains may cause
> performance issue due to bad locality.
> 
> If we remove this code, we will get quick fork/exec/wake, plus better
> balancing among whole system, that also reduce migrations in future
> load balancing.
> 
> This patch increases 10+% performance of hackbench on my 4 sockets
> NHM and SNB machines.
> 
> Signed-off-by: Alex Shi 
> ---
>  kernel/sched/fair.c | 20 +---
>  1 file changed, 1 insertion(+), 19 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ecfbf8e..895a3f4 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3364,15 +3364,9 @@ select_task_rq_fair(struct task_struct *p, int 
> sd_flag, int wake_flags)
>   goto unlock;
>   }
>  
> - while (sd) {
> + if (sd) {
>   int load_idx = sd->forkexec_idx;
>   struct sched_group *group;
> - int weight;
> -
> - if (!(sd->flags & sd_flag)) {
> - sd = sd->child;
> - continue;
> - }
>  
>   if (sd_flag & SD_BALANCE_WAKE)
>   load_idx = sd->wake_idx;
> @@ -3382,18 +3376,6 @@ select_task_rq_fair(struct task_struct *p, int 
> sd_flag, int wake_flags)
>   goto unlock;
>  
>   new_cpu = find_idlest_cpu(group, p, cpu);
> -
> - /* Now try balancing at a lower domain level of new_cpu */
> - cpu = new_cpu;
> - weight = sd->span_weight;
> - sd = NULL;
> - for_each_domain(cpu, tmp) {
> - if (weight <= tmp->span_weight)
> - break;
> - if (tmp->flags & sd_flag)
> - sd = tmp;
> - }
> - /* while loop will break here if sd == NULL */

I agree that this should be a major optimization. I just can't figure
out why the existing recursive search for an idle cpu switches to the
new cpu near the end and then starts a search for an idle cpu in the new
cpu's domain. Is this to handle some exotic sched domain configurations?
If so, they probably wouldn't work with your optimizations.

Morten

>   }
>  unlock:
>   rcu_read_unlock();
> -- 
> 1.7.12
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 05/22] sched: remove domain iterations in fork/exec/wake

2013-01-09 Thread Morten Rasmussen
On Sat, Jan 05, 2013 at 08:37:34AM +, Alex Shi wrote:
 Guess the search cpu from bottom to up in domain tree come from
 commit 3dbd5342074a1e sched: multilevel sbe sbf, the purpose is
 balancing over tasks on all level domains.
 
 This balancing cost much if there has many domain/groups in a large
 system. And force spreading task among different domains may cause
 performance issue due to bad locality.
 
 If we remove this code, we will get quick fork/exec/wake, plus better
 balancing among whole system, that also reduce migrations in future
 load balancing.
 
 This patch increases 10+% performance of hackbench on my 4 sockets
 NHM and SNB machines.
 
 Signed-off-by: Alex Shi alex@intel.com
 ---
  kernel/sched/fair.c | 20 +---
  1 file changed, 1 insertion(+), 19 deletions(-)
 
 diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
 index ecfbf8e..895a3f4 100644
 --- a/kernel/sched/fair.c
 +++ b/kernel/sched/fair.c
 @@ -3364,15 +3364,9 @@ select_task_rq_fair(struct task_struct *p, int 
 sd_flag, int wake_flags)
   goto unlock;
   }
  
 - while (sd) {
 + if (sd) {
   int load_idx = sd-forkexec_idx;
   struct sched_group *group;
 - int weight;
 -
 - if (!(sd-flags  sd_flag)) {
 - sd = sd-child;
 - continue;
 - }
  
   if (sd_flag  SD_BALANCE_WAKE)
   load_idx = sd-wake_idx;
 @@ -3382,18 +3376,6 @@ select_task_rq_fair(struct task_struct *p, int 
 sd_flag, int wake_flags)
   goto unlock;
  
   new_cpu = find_idlest_cpu(group, p, cpu);
 -
 - /* Now try balancing at a lower domain level of new_cpu */
 - cpu = new_cpu;
 - weight = sd-span_weight;
 - sd = NULL;
 - for_each_domain(cpu, tmp) {
 - if (weight = tmp-span_weight)
 - break;
 - if (tmp-flags  sd_flag)
 - sd = tmp;
 - }
 - /* while loop will break here if sd == NULL */

I agree that this should be a major optimization. I just can't figure
out why the existing recursive search for an idle cpu switches to the
new cpu near the end and then starts a search for an idle cpu in the new
cpu's domain. Is this to handle some exotic sched domain configurations?
If so, they probably wouldn't work with your optimizations.

Morten

   }
  unlock:
   rcu_read_unlock();
 -- 
 1.7.12
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v3 05/22] sched: remove domain iterations in fork/exec/wake

2013-01-05 Thread Alex Shi
Guess the search cpu from bottom to up in domain tree come from
commit 3dbd5342074a1e sched: multilevel sbe sbf, the purpose is
balancing over tasks on all level domains.

This balancing cost much if there has many domain/groups in a large
system. And force spreading task among different domains may cause
performance issue due to bad locality.

If we remove this code, we will get quick fork/exec/wake, plus better
balancing among whole system, that also reduce migrations in future
load balancing.

This patch increases 10+% performance of hackbench on my 4 sockets
NHM and SNB machines.

Signed-off-by: Alex Shi 
---
 kernel/sched/fair.c | 20 +---
 1 file changed, 1 insertion(+), 19 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ecfbf8e..895a3f4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3364,15 +3364,9 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, 
int wake_flags)
goto unlock;
}
 
-   while (sd) {
+   if (sd) {
int load_idx = sd->forkexec_idx;
struct sched_group *group;
-   int weight;
-
-   if (!(sd->flags & sd_flag)) {
-   sd = sd->child;
-   continue;
-   }
 
if (sd_flag & SD_BALANCE_WAKE)
load_idx = sd->wake_idx;
@@ -3382,18 +3376,6 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, 
int wake_flags)
goto unlock;
 
new_cpu = find_idlest_cpu(group, p, cpu);
-
-   /* Now try balancing at a lower domain level of new_cpu */
-   cpu = new_cpu;
-   weight = sd->span_weight;
-   sd = NULL;
-   for_each_domain(cpu, tmp) {
-   if (weight <= tmp->span_weight)
-   break;
-   if (tmp->flags & sd_flag)
-   sd = tmp;
-   }
-   /* while loop will break here if sd == NULL */
}
 unlock:
rcu_read_unlock();
-- 
1.7.12

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v3 05/22] sched: remove domain iterations in fork/exec/wake

2013-01-05 Thread Alex Shi
Guess the search cpu from bottom to up in domain tree come from
commit 3dbd5342074a1e sched: multilevel sbe sbf, the purpose is
balancing over tasks on all level domains.

This balancing cost much if there has many domain/groups in a large
system. And force spreading task among different domains may cause
performance issue due to bad locality.

If we remove this code, we will get quick fork/exec/wake, plus better
balancing among whole system, that also reduce migrations in future
load balancing.

This patch increases 10+% performance of hackbench on my 4 sockets
NHM and SNB machines.

Signed-off-by: Alex Shi alex@intel.com
---
 kernel/sched/fair.c | 20 +---
 1 file changed, 1 insertion(+), 19 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ecfbf8e..895a3f4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3364,15 +3364,9 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, 
int wake_flags)
goto unlock;
}
 
-   while (sd) {
+   if (sd) {
int load_idx = sd-forkexec_idx;
struct sched_group *group;
-   int weight;
-
-   if (!(sd-flags  sd_flag)) {
-   sd = sd-child;
-   continue;
-   }
 
if (sd_flag  SD_BALANCE_WAKE)
load_idx = sd-wake_idx;
@@ -3382,18 +3376,6 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, 
int wake_flags)
goto unlock;
 
new_cpu = find_idlest_cpu(group, p, cpu);
-
-   /* Now try balancing at a lower domain level of new_cpu */
-   cpu = new_cpu;
-   weight = sd-span_weight;
-   sd = NULL;
-   for_each_domain(cpu, tmp) {
-   if (weight = tmp-span_weight)
-   break;
-   if (tmp-flags  sd_flag)
-   sd = tmp;
-   }
-   /* while loop will break here if sd == NULL */
}
 unlock:
rcu_read_unlock();
-- 
1.7.12

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/