Re: [PATCH v3 17/22] sched: packing small tasks in wake/exec balancing
On 01/16/2013 11:08 PM, Morten Rasmussen wrote: > On Wed, Jan 16, 2013 at 07:32:49AM +, Alex Shi wrote: >> On 01/15/2013 01:00 AM, Morten Rasmussen wrote: > Why multiply rq->util by nr_running? >>> >>> Let's take an example where rq->util = 50, nr_running = 2, and putil = >>> 10. In this case the value of putil doesn't really matter as vacancy >>> would be negative anyway since FULL_UTIL - rq->util * nr_running is -1. >>> However, with rq->util = 50 there should be plenty of spare cpu time to >>> take another task. > > for this example, the util is not full maybe due to it was just wake up, > it still is possible like to run full time. So, I try to give it the > large guess load. >>> I don't see why rq->util should be treated different depending on the >>> number of tasks causing the load. rq->util = 50 means that the cpu is >>> busy about 50% of the time no matter how many tasks contibute to that >>> load. >>> >>> If nr_running = 1 instead in my example, you would consider the cpu >>> vacant if putil = 6, but if nr_running > 1 you would not. Why should the >>> two scenarios be treated differently? >>> >>> >>> Also, why multiply putil by 8? rq->util must be very close to 0 for >>> vacancy to be positive if putil is close to 12 (12.5%). > > just want to pack small util tasks, since packing is possible to hurt > performance. >>> I agree that packing may affect performance. But why don't you reduce >>> FULL_UTIL instead of multiplying by 8? With current expression you will >>> not pack a 10% task if rq->util = 20 and nr_running = 1, but you would >>> pack a 6% task even if rq->util = 50 and the resulting cpu load is much >>> higher. >>> >> >> Yes, the threshold has no strong theory or experiment support. I had >> tried cyclitest which Vicent used, the case's load avg is too small to >> be caught. so just use half of Vicent value as 12.5%. If you has more >> reasonable value, let me know. >> >> As to nr_running engaged as multiple mode. it's base on 2 reasons. >> 1, load avg/util need 345ms to accumulate as 100%. so, if a tasks is >> cost full cpu time, it still has 345ms with rq->util < 1. > > I agree that load avg may not be accurate, especially for new tasks. But > why use it if you don't trust its value anyway? > > The load avg (sum/period) of a new task will reach 100% instantly if the > task is consuming all the cpu time it can get. An old task can reach 50% > within 32ms. So you should fairly quickly be able to see if it is a > light task or not. You may under-estimate its load in the beginning, but > only for a very short time. this packing is done in wakup, even no 'a very short time' here:) > >> 2, if there are more tasks, like 2 tasks running on one cpu, it's >> possible to has capacity to burn 200% cpu time, while the biggest >> rq->util is still 100%. > > If you want to have a better metric for how much cpu time the task on > the runqueue could potentially use, I would suggest using > cfs_rq->runnable_load_avg which is the load_avg_contrib sum of all tasks > on the runqueue. It would give you 200% in your example above. runnable_load_avg also need much time to accumulate its value, not better than util. > > On the other hand, I think rq->util is fine for this purpose. If > rq->util < 100% you know for sure that cpu is not fully utilized no > matter how many tasks you have on the runqueue. So as long as rq->util > is well below 100% (like < 50%) it should be safe to pack more small > tasks on that cpu even if it has multiple tasks running already. > >> >> Consider to figure out precise utils is complicate and cost much. I do >> this simple calculation. It is not very precise, but it is efficient and >> more bias toward performance. > > It is indeed very biased towards performance. I would prefer more focus > on saving power in a power scheduling policy :) > Agree, and I don't refuse to change the criteria for power. :) but without reliable benchmarks or data, everything is guess. -- Thanks Alex -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 17/22] sched: packing small tasks in wake/exec balancing
On Wed, Jan 16, 2013 at 07:32:49AM +, Alex Shi wrote: > On 01/15/2013 01:00 AM, Morten Rasmussen wrote: > >>> Why multiply rq->util by nr_running? > >>> > > > >>> > > Let's take an example where rq->util = 50, nr_running = 2, and putil = > >>> > > 10. In this case the value of putil doesn't really matter as vacancy > >>> > > would be negative anyway since FULL_UTIL - rq->util * nr_running is > >>> > > -1. > >>> > > However, with rq->util = 50 there should be plenty of spare cpu time > >>> > > to > >>> > > take another task. > >> > > >> > for this example, the util is not full maybe due to it was just wake up, > >> > it still is possible like to run full time. So, I try to give it the > >> > large guess load. > > I don't see why rq->util should be treated different depending on the > > number of tasks causing the load. rq->util = 50 means that the cpu is > > busy about 50% of the time no matter how many tasks contibute to that > > load. > > > > If nr_running = 1 instead in my example, you would consider the cpu > > vacant if putil = 6, but if nr_running > 1 you would not. Why should the > > two scenarios be treated differently? > > > >>> > > > >>> > > Also, why multiply putil by 8? rq->util must be very close to 0 for > >>> > > vacancy to be positive if putil is close to 12 (12.5%). > >> > > >> > just want to pack small util tasks, since packing is possible to hurt > >> > performance. > > I agree that packing may affect performance. But why don't you reduce > > FULL_UTIL instead of multiplying by 8? With current expression you will > > not pack a 10% task if rq->util = 20 and nr_running = 1, but you would > > pack a 6% task even if rq->util = 50 and the resulting cpu load is much > > higher. > > > > Yes, the threshold has no strong theory or experiment support. I had > tried cyclitest which Vicent used, the case's load avg is too small to > be caught. so just use half of Vicent value as 12.5%. If you has more > reasonable value, let me know. > > As to nr_running engaged as multiple mode. it's base on 2 reasons. > 1, load avg/util need 345ms to accumulate as 100%. so, if a tasks is > cost full cpu time, it still has 345ms with rq->util < 1. I agree that load avg may not be accurate, especially for new tasks. But why use it if you don't trust its value anyway? The load avg (sum/period) of a new task will reach 100% instantly if the task is consuming all the cpu time it can get. An old task can reach 50% within 32ms. So you should fairly quickly be able to see if it is a light task or not. You may under-estimate its load in the beginning, but only for a very short time. > 2, if there are more tasks, like 2 tasks running on one cpu, it's > possible to has capacity to burn 200% cpu time, while the biggest > rq->util is still 100%. If you want to have a better metric for how much cpu time the task on the runqueue could potentially use, I would suggest using cfs_rq->runnable_load_avg which is the load_avg_contrib sum of all tasks on the runqueue. It would give you 200% in your example above. On the other hand, I think rq->util is fine for this purpose. If rq->util < 100% you know for sure that cpu is not fully utilized no matter how many tasks you have on the runqueue. So as long as rq->util is well below 100% (like < 50%) it should be safe to pack more small tasks on that cpu even if it has multiple tasks running already. > > Consider to figure out precise utils is complicate and cost much. I do > this simple calculation. It is not very precise, but it is efficient and > more bias toward performance. It is indeed very biased towards performance. I would prefer more focus on saving power in a power scheduling policy :) Morten > > -- > Thanks Alex > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 17/22] sched: packing small tasks in wake/exec balancing
On Wed, 16 Jan 2013 14:11:53 +0800, Alex Shi wrote: > On 01/14/2013 03:13 PM, Namhyung Kim wrote: >> On Fri, 11 Jan 2013 11:47:03 +0800, Alex Shi wrote: >>> Um, change to leader_cpu? >> >> vacantest? ;-) > > hard to the ward in google. are you sure it is better than leader_cpu? :) Nop. My English is just bad. :) But I'm not sure the "leader" is the right one. Why is it called "leader" in the first place? Thanks, Namhyung -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 17/22] sched: packing small tasks in wake/exec balancing
On 01/15/2013 01:00 AM, Morten Rasmussen wrote: >>> Why multiply rq->util by nr_running? >>> > > >>> > > Let's take an example where rq->util = 50, nr_running = 2, and putil = >>> > > 10. In this case the value of putil doesn't really matter as vacancy >>> > > would be negative anyway since FULL_UTIL - rq->util * nr_running is -1. >>> > > However, with rq->util = 50 there should be plenty of spare cpu time to >>> > > take another task. >> > >> > for this example, the util is not full maybe due to it was just wake up, >> > it still is possible like to run full time. So, I try to give it the >> > large guess load. > I don't see why rq->util should be treated different depending on the > number of tasks causing the load. rq->util = 50 means that the cpu is > busy about 50% of the time no matter how many tasks contibute to that > load. > > If nr_running = 1 instead in my example, you would consider the cpu > vacant if putil = 6, but if nr_running > 1 you would not. Why should the > two scenarios be treated differently? > >>> > > >>> > > Also, why multiply putil by 8? rq->util must be very close to 0 for >>> > > vacancy to be positive if putil is close to 12 (12.5%). >> > >> > just want to pack small util tasks, since packing is possible to hurt >> > performance. > I agree that packing may affect performance. But why don't you reduce > FULL_UTIL instead of multiplying by 8? With current expression you will > not pack a 10% task if rq->util = 20 and nr_running = 1, but you would > pack a 6% task even if rq->util = 50 and the resulting cpu load is much > higher. > Yes, the threshold has no strong theory or experiment support. I had tried cyclitest which Vicent used, the case's load avg is too small to be caught. so just use half of Vicent value as 12.5%. If you has more reasonable value, let me know. As to nr_running engaged as multiple mode. it's base on 2 reasons. 1, load avg/util need 345ms to accumulate as 100%. so, if a tasks is cost full cpu time, it still has 345ms with rq->util < 1. 2, if there are more tasks, like 2 tasks running on one cpu, it's possible to has capacity to burn 200% cpu time, while the biggest rq->util is still 100%. Consider to figure out precise utils is complicate and cost much. I do this simple calculation. It is not very precise, but it is efficient and more bias toward performance. -- Thanks Alex -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 17/22] sched: packing small tasks in wake/exec balancing
On 01/14/2013 03:13 PM, Namhyung Kim wrote: > On Fri, 11 Jan 2013 11:47:03 +0800, Alex Shi wrote: >> On 01/11/2013 01:17 AM, Morten Rasmussen wrote: >>> On Sat, Jan 05, 2013 at 08:37:46AM +, Alex Shi wrote: If the wake/exec task is small enough, utils < 12.5%, it will has the chance to be packed into a cpu which is busy but still has space to handle it. Signed-off-by: Alex Shi --- > [snip] >>> I may be missing something, but could the expression be something like >>> the below instead? >>> >>> Create a putil < 12.5% check before the loop. There is no reason to >>> recheck it every iteration. Then: > > Agreed. Also suggest that the checking local cpu can also be moved > before the loop so that it can be used without going through the loop if > it's vacant enough. Yes, thanks for suggestion! > >>> >>> vacancy = FULL_UTIL - (rq->util + putil) >>> >>> should be enough? >>> + + /* bias toward local cpu */ + if (vacancy > 0 && (i == this_cpu)) + return i; + + if (vacancy > 0 && vacancy < min_vacancy) { + min_vacancy = vacancy; + idlest = i; >>> >>> "idlest" may be a bit misleading here as you actually select busiest cpu >>> that have enough spare capacity to take the task. >> >> Um, change to leader_cpu? > > vacantest? ;-) hard to the ward in google. are you sure it is better than leader_cpu? :) > > Thanks, > Namhyung > -- Thanks Alex -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 17/22] sched: packing small tasks in wake/exec balancing
On Fri, Jan 11, 2013 at 03:47:03AM +, Alex Shi wrote: > On 01/11/2013 01:17 AM, Morten Rasmussen wrote: > > On Sat, Jan 05, 2013 at 08:37:46AM +, Alex Shi wrote: > >> If the wake/exec task is small enough, utils < 12.5%, it will > >> has the chance to be packed into a cpu which is busy but still has space to > >> handle it. > >> > >> Signed-off-by: Alex Shi > >> --- > >> kernel/sched/fair.c | 51 > >> +-- > >> 1 file changed, 45 insertions(+), 6 deletions(-) > >> > >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > >> index 8d0d3af..0596e81 100644 > >> --- a/kernel/sched/fair.c > >> +++ b/kernel/sched/fair.c > >> @@ -3471,19 +3471,57 @@ static inline int get_sd_sched_policy(struct > >> sched_domain *sd, > >> } > >> > >> /* > >> + * find_leader_cpu - find the busiest but still has enough leisure time > >> cpu > >> + * among the cpus in group. > >> + */ > >> +static int > >> +find_leader_cpu(struct sched_group *group, struct task_struct *p, int > >> this_cpu) > >> +{ > >> + unsigned vacancy, min_vacancy = UINT_MAX; > > > > unsigned int? > > yes > > > >> + int idlest = -1; > >> + int i; > >> + /* percentage the task's util */ > >> + unsigned putil = p->se.avg.runnable_avg_sum * 100 > >> + / (p->se.avg.runnable_avg_period + 1); > > > > Alternatively you could use se.avg.load_avg_contrib which is the same > > ratio scaled by the task priority (se->load.weight). In the above > > expression you don't take priority into account. > > sure. but this seems more directly of meaningful. > > > >> + > >> + /* Traverse only the allowed CPUs */ > >> + for_each_cpu_and(i, sched_group_cpus(group), tsk_cpus_allowed(p)) { > >> + struct rq *rq = cpu_rq(i); > >> + int nr_running = rq->nr_running > 0 ? rq->nr_running : 1; > >> + > >> + /* only pack task which putil < 12.5% */ > >> + vacancy = FULL_UTIL - (rq->util * nr_running + putil * 8); > > > > I can't follow this expression. > > > > The variables can have the following values: > > FULL_UTIL = 99 > > rq->util = [0..99] > > nr_running = [1..inf] > > putil = [0..99] > > > > Why multiply rq->util by nr_running? > > > > Let's take an example where rq->util = 50, nr_running = 2, and putil = > > 10. In this case the value of putil doesn't really matter as vacancy > > would be negative anyway since FULL_UTIL - rq->util * nr_running is -1. > > However, with rq->util = 50 there should be plenty of spare cpu time to > > take another task. > > for this example, the util is not full maybe due to it was just wake up, > it still is possible like to run full time. So, I try to give it the > large guess load. I don't see why rq->util should be treated different depending on the number of tasks causing the load. rq->util = 50 means that the cpu is busy about 50% of the time no matter how many tasks contibute to that load. If nr_running = 1 instead in my example, you would consider the cpu vacant if putil = 6, but if nr_running > 1 you would not. Why should the two scenarios be treated differently? > > > > Also, why multiply putil by 8? rq->util must be very close to 0 for > > vacancy to be positive if putil is close to 12 (12.5%). > > just want to pack small util tasks, since packing is possible to hurt > performance. I agree that packing may affect performance. But why don't you reduce FULL_UTIL instead of multiplying by 8? With current expression you will not pack a 10% task if rq->util = 20 and nr_running = 1, but you would pack a 6% task even if rq->util = 50 and the resulting cpu load is much higher. > > > > The vacancy variable is declared unsigned, so it will underflow instead > > of becoming negative. Is this intentional? > > ops, my mistake. > > > > I may be missing something, but could the expression be something like > > the below instead? > > > > Create a putil < 12.5% check before the loop. There is no reason to > > recheck it every iteration. Then: > > > > vacancy = FULL_UTIL - (rq->util + putil) > > > > should be enough? > > > >> + > >> + /* bias toward local cpu */ > >> + if (vacancy > 0 && (i == this_cpu)) > >> + return i; > >> + > >> + if (vacancy > 0 && vacancy < min_vacancy) { > >> + min_vacancy = vacancy; > >> + idlest = i; > > > > "idlest" may be a bit misleading here as you actually select busiest cpu > > that have enough spare capacity to take the task. > > Um, change to leader_cpu? Fine by me. Morten > > > > Morten > > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 17/22] sched: packing small tasks in wake/exec balancing
On Fri, 11 Jan 2013 11:47:03 +0800, Alex Shi wrote: > On 01/11/2013 01:17 AM, Morten Rasmussen wrote: >> On Sat, Jan 05, 2013 at 08:37:46AM +, Alex Shi wrote: >>> If the wake/exec task is small enough, utils < 12.5%, it will >>> has the chance to be packed into a cpu which is busy but still has space to >>> handle it. >>> >>> Signed-off-by: Alex Shi >>> --- [snip] >> I may be missing something, but could the expression be something like >> the below instead? >> >> Create a putil < 12.5% check before the loop. There is no reason to >> recheck it every iteration. Then: Agreed. Also suggest that the checking local cpu can also be moved before the loop so that it can be used without going through the loop if it's vacant enough. >> >> vacancy = FULL_UTIL - (rq->util + putil) >> >> should be enough? >> >>> + >>> + /* bias toward local cpu */ >>> + if (vacancy > 0 && (i == this_cpu)) >>> + return i; >>> + >>> + if (vacancy > 0 && vacancy < min_vacancy) { >>> + min_vacancy = vacancy; >>> + idlest = i; >> >> "idlest" may be a bit misleading here as you actually select busiest cpu >> that have enough spare capacity to take the task. > > Um, change to leader_cpu? vacantest? ;-) Thanks, Namhyung -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 17/22] sched: packing small tasks in wake/exec balancing
On 01/11/2013 01:17 AM, Morten Rasmussen wrote: > On Sat, Jan 05, 2013 at 08:37:46AM +, Alex Shi wrote: >> If the wake/exec task is small enough, utils < 12.5%, it will >> has the chance to be packed into a cpu which is busy but still has space to >> handle it. >> >> Signed-off-by: Alex Shi >> --- >> kernel/sched/fair.c | 51 +-- >> 1 file changed, 45 insertions(+), 6 deletions(-) >> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index 8d0d3af..0596e81 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -3471,19 +3471,57 @@ static inline int get_sd_sched_policy(struct >> sched_domain *sd, >> } >> >> /* >> + * find_leader_cpu - find the busiest but still has enough leisure time cpu >> + * among the cpus in group. >> + */ >> +static int >> +find_leader_cpu(struct sched_group *group, struct task_struct *p, int >> this_cpu) >> +{ >> +unsigned vacancy, min_vacancy = UINT_MAX; > > unsigned int? yes > >> +int idlest = -1; >> +int i; >> +/* percentage the task's util */ >> +unsigned putil = p->se.avg.runnable_avg_sum * 100 >> +/ (p->se.avg.runnable_avg_period + 1); > > Alternatively you could use se.avg.load_avg_contrib which is the same > ratio scaled by the task priority (se->load.weight). In the above > expression you don't take priority into account. sure. but this seems more directly of meaningful. > >> + >> +/* Traverse only the allowed CPUs */ >> +for_each_cpu_and(i, sched_group_cpus(group), tsk_cpus_allowed(p)) { >> +struct rq *rq = cpu_rq(i); >> +int nr_running = rq->nr_running > 0 ? rq->nr_running : 1; >> + >> +/* only pack task which putil < 12.5% */ >> +vacancy = FULL_UTIL - (rq->util * nr_running + putil * 8); > > I can't follow this expression. > > The variables can have the following values: > FULL_UTIL = 99 > rq->util = [0..99] > nr_running = [1..inf] > putil = [0..99] > > Why multiply rq->util by nr_running? > > Let's take an example where rq->util = 50, nr_running = 2, and putil = > 10. In this case the value of putil doesn't really matter as vacancy > would be negative anyway since FULL_UTIL - rq->util * nr_running is -1. > However, with rq->util = 50 there should be plenty of spare cpu time to > take another task. for this example, the util is not full maybe due to it was just wake up, it still is possible like to run full time. So, I try to give it the large guess load. > > Also, why multiply putil by 8? rq->util must be very close to 0 for > vacancy to be positive if putil is close to 12 (12.5%). just want to pack small util tasks, since packing is possible to hurt performance. > > The vacancy variable is declared unsigned, so it will underflow instead > of becoming negative. Is this intentional? ops, my mistake. > > I may be missing something, but could the expression be something like > the below instead? > > Create a putil < 12.5% check before the loop. There is no reason to > recheck it every iteration. Then: > > vacancy = FULL_UTIL - (rq->util + putil) > > should be enough? > >> + >> +/* bias toward local cpu */ >> +if (vacancy > 0 && (i == this_cpu)) >> +return i; >> + >> +if (vacancy > 0 && vacancy < min_vacancy) { >> +min_vacancy = vacancy; >> +idlest = i; > > "idlest" may be a bit misleading here as you actually select busiest cpu > that have enough spare capacity to take the task. Um, change to leader_cpu? > > Morten > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 17/22] sched: packing small tasks in wake/exec balancing
On Sat, Jan 05, 2013 at 08:37:46AM +, Alex Shi wrote: > If the wake/exec task is small enough, utils < 12.5%, it will > has the chance to be packed into a cpu which is busy but still has space to > handle it. > > Signed-off-by: Alex Shi > --- > kernel/sched/fair.c | 51 +-- > 1 file changed, 45 insertions(+), 6 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 8d0d3af..0596e81 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -3471,19 +3471,57 @@ static inline int get_sd_sched_policy(struct > sched_domain *sd, > } > > /* > + * find_leader_cpu - find the busiest but still has enough leisure time cpu > + * among the cpus in group. > + */ > +static int > +find_leader_cpu(struct sched_group *group, struct task_struct *p, int > this_cpu) > +{ > + unsigned vacancy, min_vacancy = UINT_MAX; unsigned int? > + int idlest = -1; > + int i; > + /* percentage the task's util */ > + unsigned putil = p->se.avg.runnable_avg_sum * 100 > + / (p->se.avg.runnable_avg_period + 1); Alternatively you could use se.avg.load_avg_contrib which is the same ratio scaled by the task priority (se->load.weight). In the above expression you don't take priority into account. > + > + /* Traverse only the allowed CPUs */ > + for_each_cpu_and(i, sched_group_cpus(group), tsk_cpus_allowed(p)) { > + struct rq *rq = cpu_rq(i); > + int nr_running = rq->nr_running > 0 ? rq->nr_running : 1; > + > + /* only pack task which putil < 12.5% */ > + vacancy = FULL_UTIL - (rq->util * nr_running + putil * 8); I can't follow this expression. The variables can have the following values: FULL_UTIL = 99 rq->util = [0..99] nr_running = [1..inf] putil = [0..99] Why multiply rq->util by nr_running? Let's take an example where rq->util = 50, nr_running = 2, and putil = 10. In this case the value of putil doesn't really matter as vacancy would be negative anyway since FULL_UTIL - rq->util * nr_running is -1. However, with rq->util = 50 there should be plenty of spare cpu time to take another task. Also, why multiply putil by 8? rq->util must be very close to 0 for vacancy to be positive if putil is close to 12 (12.5%). The vacancy variable is declared unsigned, so it will underflow instead of becoming negative. Is this intentional? I may be missing something, but could the expression be something like the below instead? Create a putil < 12.5% check before the loop. There is no reason to recheck it every iteration. Then: vacancy = FULL_UTIL - (rq->util + putil) should be enough? > + > + /* bias toward local cpu */ > + if (vacancy > 0 && (i == this_cpu)) > + return i; > + > + if (vacancy > 0 && vacancy < min_vacancy) { > + min_vacancy = vacancy; > + idlest = i; "idlest" may be a bit misleading here as you actually select busiest cpu that have enough spare capacity to take the task. Morten > + } > + } > + return idlest; > +} > + > +/* > * If power policy is eligible for this domain, and it has task allowed cpu. > * we will select CPU from this domain. > */ > static int get_cpu_for_power_policy(struct sched_domain *sd, int cpu, > - struct task_struct *p, struct sd_lb_stats *sds) > + struct task_struct *p, struct sd_lb_stats *sds, int fork) > { > int policy; > int new_cpu = -1; > > policy = get_sd_sched_policy(sd, cpu, p, sds); > - if (policy != SCHED_POLICY_PERFORMANCE && sds->group_leader) > - new_cpu = find_idlest_cpu(sds->group_leader, p, cpu); > - > + if (policy != SCHED_POLICY_PERFORMANCE && sds->group_leader) { > + if (!fork) > + new_cpu = find_leader_cpu(sds->group_leader, p, cpu); > + /* for fork balancing and a little busy task */ > + if (new_cpu == -1) > + new_cpu = find_idlest_cpu(sds->group_leader, p, cpu); > + } > return new_cpu; > } > > @@ -3534,14 +3572,15 @@ select_task_rq_fair(struct task_struct *p, int > sd_flag, int flags) > if (tmp->flags & sd_flag) { > sd = tmp; > > - new_cpu = get_cpu_for_power_policy(sd, cpu, p, &sds); > + new_cpu = get_cpu_for_power_policy(sd, cpu, p, &sds, > + flags & SD_BALANCE_FORK); > if (new_cpu != -1) > goto unlock; > } > } > > if (affine_sd) { > - new_cpu = get_cpu_for_power_policy(affine_sd, cpu, p, &sds); > + new_cpu = get_cpu_for_power_policy(affine_sd, cpu, p, &sds, 0); > if (new_cpu != -1) > goto unlock; > > -- > 1.7.12 > > -- > To unsubscribe from th