Hi Brendan, On Fri, Aug 11, 2017 at 2:45 AM, Brendan Jackman <brendan.jack...@arm.com> wrote: > This patch adds a parameter to select_task_rq, sibling_count_hint > allowing the caller, where it has this information, to inform the > sched_class the number of tasks that are being woken up as part of > the same event. > > The wake_q mechanism is one case where this information is available. > > select_task_rq_fair can then use the information to detect that it > needs to widen the search space for task placement in order to avoid > overloading the last-level cache domain's CPUs. > > * * * > > The reason I am investigating this change is the following use case > on ARM big.LITTLE (asymmetrical CPU capacity): 1 task per CPU, which > all repeatedly do X amount of work then > pthread_barrier_wait (i.e. sleep until the last task finishes its X > and hits the barrier). On big.LITTLE, the tasks which get a "big" CPU > finish faster, and then those CPUs pull over the tasks that are still > running: > > v CPU v ->time-> > > ------------- > 0 (big) 11111 /333 > ------------- > 1 (big) 22222 /444| > ------------- > 2 (LITTLE) 333333/ > ------------- > 3 (LITTLE) 444444/ > -------------
This is because of the newly idle balancing right? > > Now when task 4 hits the barrier (at |) and wakes the others up, > there are 4 tasks with prev_cpu=<big> and 0 tasks with > prev_cpu=<little>. want_affine therefore means that we'll only look > in CPUs 0 and 1 (sd_llc), so tasks will be unnecessarily coscheduled > on the bigs until the next load balance, something like this: > > v CPU v ->time-> > > ------------------------ > 0 (big) 11111 /333 31313\33333 > ------------------------ > 1 (big) 22222 /444|424\4444444 > ------------------------ > 2 (LITTLE) 333333/ \222222 > ------------------------ > 3 (LITTLE) 444444/ \1111 > ------------------------ > ^^^ > underutilization I wonder if this problem can be better solved by not having the first newly-idle-balance pull happen based on some condition, only to be corrected later by another balancing act? > > So, I'm trying to get want_affine = 0 for these tasks. > > I don't _think_ any incarnation of the wakee_flips mechanism can help > us here because which task is waker and which tasks are wakees > generally changes with each iteration. > > However pthread_barrier_wait (or more accurately FUTEX_WAKE) has the > nice property that we know exactly how many tasks are being woken, so > we can cheat. > > It might be a disadvantage that we "widen" _every_ task that's woken in > an event, while select_idle_sibling would work fine for the first > sd_llc_size - 1 tasks. > > IIUC, if wake_affine() behaves correctly this trick wouldn't be > necessary on SMP systems, so it might be best guarded by the presence Actually wake_affine doesn't check for balance if previous/next cpu are within the same shared cache domain. The difference is some time ago it would return true for shared cache but now it returns false as of 4.14-rc1: http://elixir.free-electrons.com/linux/v4.14-rc1/source/kernel/sched/fair.c#L5466 Since it would return false in the above wake up cases for task 1 and 2, it would then run select_idle_sibling on the previous CPU which is also within the big cluster, so I don't think it will make a difference in this case... Infact what it returns probably doesn't matter. > of SD_ASYM_CPUCAPACITY? > > * * * > > Final note.. > > In order to observe "perfect" behaviour for this use case, I also had > to disable the TTWU_QUEUE sched feature. Suppose during the wakeup > above we are working through the work queue and have placed tasks 3 > and 2, and are about to place task 1: > > v CPU v ->time-> > > -------------- > 0 (big) 11111 /333 3 > -------------- > 1 (big) 22222 /444|4 > -------------- > 2 (LITTLE) 333333/ 2 > -------------- > 3 (LITTLE) 444444/ <- Task 1 should go here > -------------- > > If TTWU_QUEUE is enabled, we will not yet have enqueued task > 2 (having instead sent a reschedule IPI) or attached its load to CPU > 2. So we are likely to also place task 1 on cpu 2. Disabling > TTWU_QUEUE means that we enqueue task 2 before placing task 1, > solving this issue. TTWU_QUEUE is there to minimise rq lock > contention, and I guess that this contention is less of an issue on > big.LITTLE systems since they have relatively few CPUs, which > suggests the trade-off makes sense here. Yeah, OTOH probably once those IPIs are sent, the select path should I feel be smart enough to factor those pending enqueues, not sure how hard or meaningful it is to implement something like that.. > > Signed-off-by: Brendan Jackman <brendan.jack...@arm.com> > Cc: Ingo Molnar <mi...@redhat.com> > Cc: Peter Zijlstra <pet...@infradead.org> > Cc: Josef Bacik <jo...@toxicpanda.com> > Cc: Joel Fernandes <joe...@google.com> > Cc: Mike Galbraith <efa...@gmx.de> > Cc: Matt Fleming <m...@codeblueprint.co.uk> > --- > include/linux/sched/wake_q.h | 2 ++ > kernel/sched/core.c | 34 +++++++++++++++++++++++----------- > kernel/sched/deadline.c | 3 ++- > kernel/sched/fair.c | 17 +++++++++++------ > kernel/sched/idle_task.c | 3 ++- > kernel/sched/rt.c | 3 ++- > kernel/sched/sched.h | 3 ++- > kernel/sched/stop_task.c | 3 ++- > 8 files changed, 46 insertions(+), 22 deletions(-) > > diff --git a/include/linux/sched/wake_q.h b/include/linux/sched/wake_q.h > index d03d8a9047dc..607a888eb35b 100644 > --- a/include/linux/sched/wake_q.h > +++ b/include/linux/sched/wake_q.h > @@ -33,6 +33,7 @@ > struct wake_q_head { > struct wake_q_node *first; > struct wake_q_node **lastp; > + int count; > }; > > #define WAKE_Q_TAIL ((struct wake_q_node *) 0x01) > @@ -44,6 +45,7 @@ static inline void wake_q_init(struct wake_q_head *head) > { > head->first = WAKE_Q_TAIL; > head->lastp = &head->first; > + head->count = 0; > } > > extern void wake_q_add(struct wake_q_head *head, > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 0869b20fba81..ddf9257b1467 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -438,6 +438,8 @@ void wake_q_add(struct wake_q_head *head, struct > task_struct *task) > if (cmpxchg(&node->next, NULL, WAKE_Q_TAIL)) > return; > > + head->count++; > + > get_task_struct(task); > > /* > @@ -447,6 +449,10 @@ void wake_q_add(struct wake_q_head *head, struct > task_struct *task) > head->lastp = &node->next; > } > > +static int > +try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags, > + int sibling_count_hint); > + > void wake_up_q(struct wake_q_head *head) > { > struct wake_q_node *node = head->first; > @@ -461,10 +467,10 @@ void wake_up_q(struct wake_q_head *head) > task->wake_q.next = NULL; > > /* > - * wake_up_process() implies a wmb() to pair with the queueing > + * try_to_wake_up() implies a wmb() to pair with the queueing > * in wake_q_add() so as not to miss wakeups. > */ > - wake_up_process(task); > + try_to_wake_up(task, TASK_NORMAL, 0, head->count); > put_task_struct(task); Shouldn't you reset head->count after all the tasks have been woken up? > } > } > @@ -1527,12 +1533,14 @@ static int select_fallback_rq(int cpu, struct > task_struct *p) > * The caller (fork, wakeup) owns p->pi_lock, ->cpus_allowed is stable. > */ > static inline > -int select_task_rq(struct task_struct *p, int cpu, int sd_flags, int > wake_flags) > +int select_task_rq(struct task_struct *p, int cpu, int sd_flags, int > wake_flags, > + int sibling_count_hint) This variable seems a bit long to me, how about just sibling_count? > { > lockdep_assert_held(&p->pi_lock); > > if (p->nr_cpus_allowed > 1) > - cpu = p->sched_class->select_task_rq(p, cpu, sd_flags, > wake_flags); > + cpu = p->sched_class->select_task_rq(p, cpu, sd_flags, > wake_flags, > + sibling_count_hint); same. <snip> > > static int > -select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags) > +select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags, > + int sibling_count_hint) > { > struct task_struct *curr; > struct rq *rq; > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > index eeef1a3086d1..56ae525618e9 100644 > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -1419,7 +1419,8 @@ struct sched_class { > void (*put_prev_task) (struct rq *rq, struct task_struct *p); > > #ifdef CONFIG_SMP > - int (*select_task_rq)(struct task_struct *p, int task_cpu, int > sd_flag, int flags); > + int (*select_task_rq)(struct task_struct *p, int task_cpu, int > sd_flag, int flags, > + int subling_count_hint); s/subling/sibling/ thanks, - Joel