Re: [RFC PATCH v2 13/17] sched: Add core wide task selection and scheduling.
> > Please try this and see how it compares with the vanilla v2. I think its > > time for a v3 now and we shall be posting it soon after some more > > testing and benchmarking. > > Is there any potential change between pre v3 and v3? I prefer working > based on v3 so that everyone are on the same page. > Makes sense, testing can wait until v3 is posted. I don't expect many changes from above, but its better to test on the posted v3. Thanks,
Re: [RFC PATCH v2 13/17] sched: Add core wide task selection and scheduling.
On Mon, May 20, 2019 at 10:04 PM Vineeth Pillai wrote: > > > > The following patch improved my test cases. > > > Welcome any comments. > > > > > > > This is certainly better than violating the point of the core scheduler :) > > > > If I'm understanding this right what will happen in this case is instead > > of using the idle process selected by the sibling we do the core scheduling > > again. This may start with a newidle_balance which might bring over > > something > > to run that matches what we want to put on the sibling. If that works then I > > can see this helping. > > > > But I'd be a little concerned that we could end up thrashing. Once we do > > core > > scheduling again here we'd force the sibling to resched and if we got a > > different > > result which "helped" him pick idle we'd go around again. Thrashing means more IPIs right? That's not what I observed, because idle task has less chance onto CPU, rescheduling is reduced accordingly. > > I think inherent in the concept of core scheduling (barring a perfectly > > aligned set > > of jobs) is some extra idle time on siblings. Yeah, I understand and agree with this, but 10-15% idle time on an overloaded system makes me to try to figure out how this could happen and if we can improve it. > > > > > I was also thinking along the same lines. This change basically always > tries to avoid idle and there by constantly interrupting the sibling. > While this change might benefit a very small subset of workloads, it > might introduce thrashing more often. Thrashing is not observed under an overloaded case but may happen under a light load or a mid load case, I need more investigation. > > One other reason you might be seeing performance improvement is > because of the bugs that caused both siblings to go idle even though > there are runnable and compatible threads in the queue. Most of the > issues are fixed based on all the feedback received in v2. We have a > github repo with the pre v3 changes here: > https://github.com/digitalocean/linux-coresched/tree/coresched Okay, thanks, it looks like the core functions pick_next_task() and pick_task() have a lot of changes against v2. Need more brain power.. > > Please try this and see how it compares with the vanilla v2. I think its > time for a v3 now and we shall be posting it soon after some more > testing and benchmarking. Is there any potential change between pre v3 and v3? I prefer working based on v3 so that everyone are on the same page. Thanks, -Aubrey
Re: [RFC PATCH v2 13/17] sched: Add core wide task selection and scheduling.
> > The following patch improved my test cases. > > Welcome any comments. > > > > This is certainly better than violating the point of the core scheduler :) > > If I'm understanding this right what will happen in this case is instead > of using the idle process selected by the sibling we do the core scheduling > again. This may start with a newidle_balance which might bring over something > to run that matches what we want to put on the sibling. If that works then I > can see this helping. > > But I'd be a little concerned that we could end up thrashing. Once we do core > scheduling again here we'd force the sibling to resched and if we got a > different > result which "helped" him pick idle we'd go around again. > > I think inherent in the concept of core scheduling (barring a perfectly > aligned set > of jobs) is some extra idle time on siblings. > I was also thinking along the same lines. This change basically always tries to avoid idle and there by constantly interrupting the sibling. While this change might benefit a very small subset of workloads, it might introduce thrashing more often. One other reason you might be seeing performance improvement is because of the bugs that caused both siblings to go idle even though there are runnable and compatible threads in the queue. Most of the issues are fixed based on all the feedback received in v2. We have a github repo with the pre v3 changes here: https://github.com/digitalocean/linux-coresched/tree/coresched Please try this and see how it compares with the vanilla v2. I think its time for a v3 now and we shall be posting it soon after some more testing and benchmarking. Thanks,
Re: [RFC PATCH v2 13/17] sched: Add core wide task selection and scheduling.
On Sat, May 18, 2019 at 11:37:56PM +0800 Aubrey Li wrote: > On Wed, Apr 24, 2019 at 12:18 AM Vineeth Remanan Pillai > wrote: > > > > From: Peter Zijlstra (Intel) > > > > Instead of only selecting a local task, select a task for all SMT > > siblings for every reschedule on the core (irrespective which logical > > CPU does the reschedule). > > > > NOTE: there is still potential for siblings rivalry. > > NOTE: this is far too complicated; but thus far I've failed to > > simplify it further. > > > > Signed-off-by: Peter Zijlstra (Intel) > > --- > > kernel/sched/core.c | 222 ++- > > kernel/sched/sched.h | 5 +- > > 2 files changed, 224 insertions(+), 3 deletions(-) > > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > > index e5bdc1c4d8d7..9e6e90c6f9b9 100644 > > --- a/kernel/sched/core.c > > +++ b/kernel/sched/core.c > > @@ -3574,7 +3574,7 @@ static inline void schedule_debug(struct task_struct > > *prev) > > * Pick up the highest-prio task: > > */ > > static inline struct task_struct * > > -pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags > > *rf) > > +__pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags > > *rf) > > { > > const struct sched_class *class; > > struct task_struct *p; > > @@ -3619,6 +3619,220 @@ pick_next_task(struct rq *rq, struct task_struct > > *prev, struct rq_flags *rf) > > BUG(); > > } > > > > +#ifdef CONFIG_SCHED_CORE > > + > > +static inline bool cookie_match(struct task_struct *a, struct task_struct > > *b) > > +{ > > + if (is_idle_task(a) || is_idle_task(b)) > > + return true; > > + > > + return a->core_cookie == b->core_cookie; > > +} > > + > > +// XXX fairness/fwd progress conditions > > +static struct task_struct * > > +pick_task(struct rq *rq, const struct sched_class *class, struct > > task_struct *max) > > +{ > > + struct task_struct *class_pick, *cookie_pick; > > + unsigned long cookie = 0UL; > > + > > + /* > > +* We must not rely on rq->core->core_cookie here, because we fail > > to reset > > +* rq->core->core_cookie on new picks, such that we can detect if > > we need > > +* to do single vs multi rq task selection. > > +*/ > > + > > + if (max && max->core_cookie) { > > + WARN_ON_ONCE(rq->core->core_cookie != max->core_cookie); > > + cookie = max->core_cookie; > > + } > > + > > + class_pick = class->pick_task(rq); > > + if (!cookie) > > + return class_pick; > > + > > + cookie_pick = sched_core_find(rq, cookie); > > + if (!class_pick) > > + return cookie_pick; > > + > > + /* > > +* If class > max && class > cookie, it is the highest priority > > task on > > +* the core (so far) and it must be selected, otherwise we must go > > with > > +* the cookie pick in order to satisfy the constraint. > > +*/ > > + if (cpu_prio_less(cookie_pick, class_pick) && core_prio_less(max, > > class_pick)) > > + return class_pick; > > + > > + return cookie_pick; > > +} > > + > > +static struct task_struct * > > +pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags > > *rf) > > +{ > > + struct task_struct *next, *max = NULL; > > + const struct sched_class *class; > > + const struct cpumask *smt_mask; > > + int i, j, cpu; > > + > > + if (!sched_core_enabled(rq)) > > + return __pick_next_task(rq, prev, rf); > > + > > + /* > > +* If there were no {en,de}queues since we picked (IOW, the task > > +* pointers are all still valid), and we haven't scheduled the last > > +* pick yet, do so now. > > +*/ > > + if (rq->core->core_pick_seq == rq->core->core_task_seq && > > + rq->core->core_pick_seq != rq->core_sched_seq) { > > + WRITE_ONCE(rq->core_sched_seq, rq->core->core_pick_seq); > > + > > + next = rq->core_pick; > > + if (next != prev) { > > + put_prev_task(rq, prev); > > + set_next_task(rq, next); > > + } > > + return next; > > + } > > + > > The following patch improved my test cases. > Welcome any comments. > This is certainly better than violating the point of the core scheduler :) If I'm understanding this right what will happen in this case is instead of using the idle process selected by the sibling we do the core scheduling again. This may start with a newidle_balance which might bring over something to run that matches what we want to put on the sibling. If that works then I can see this helping. But I'd be a little concerned that we could end up thrashing. Once we do core scheduling again here we'd force the sibling to resched and if we got a different result which "helped" him
Re: [RFC PATCH v2 13/17] sched: Add core wide task selection and scheduling.
On Wed, Apr 24, 2019 at 12:18 AM Vineeth Remanan Pillai wrote: > > From: Peter Zijlstra (Intel) > > Instead of only selecting a local task, select a task for all SMT > siblings for every reschedule on the core (irrespective which logical > CPU does the reschedule). > > NOTE: there is still potential for siblings rivalry. > NOTE: this is far too complicated; but thus far I've failed to > simplify it further. > > Signed-off-by: Peter Zijlstra (Intel) > --- > kernel/sched/core.c | 222 ++- > kernel/sched/sched.h | 5 +- > 2 files changed, 224 insertions(+), 3 deletions(-) > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index e5bdc1c4d8d7..9e6e90c6f9b9 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -3574,7 +3574,7 @@ static inline void schedule_debug(struct task_struct > *prev) > * Pick up the highest-prio task: > */ > static inline struct task_struct * > -pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) > +__pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags > *rf) > { > const struct sched_class *class; > struct task_struct *p; > @@ -3619,6 +3619,220 @@ pick_next_task(struct rq *rq, struct task_struct > *prev, struct rq_flags *rf) > BUG(); > } > > +#ifdef CONFIG_SCHED_CORE > + > +static inline bool cookie_match(struct task_struct *a, struct task_struct *b) > +{ > + if (is_idle_task(a) || is_idle_task(b)) > + return true; > + > + return a->core_cookie == b->core_cookie; > +} > + > +// XXX fairness/fwd progress conditions > +static struct task_struct * > +pick_task(struct rq *rq, const struct sched_class *class, struct task_struct > *max) > +{ > + struct task_struct *class_pick, *cookie_pick; > + unsigned long cookie = 0UL; > + > + /* > +* We must not rely on rq->core->core_cookie here, because we fail to > reset > +* rq->core->core_cookie on new picks, such that we can detect if we > need > +* to do single vs multi rq task selection. > +*/ > + > + if (max && max->core_cookie) { > + WARN_ON_ONCE(rq->core->core_cookie != max->core_cookie); > + cookie = max->core_cookie; > + } > + > + class_pick = class->pick_task(rq); > + if (!cookie) > + return class_pick; > + > + cookie_pick = sched_core_find(rq, cookie); > + if (!class_pick) > + return cookie_pick; > + > + /* > +* If class > max && class > cookie, it is the highest priority task > on > +* the core (so far) and it must be selected, otherwise we must go > with > +* the cookie pick in order to satisfy the constraint. > +*/ > + if (cpu_prio_less(cookie_pick, class_pick) && core_prio_less(max, > class_pick)) > + return class_pick; > + > + return cookie_pick; > +} > + > +static struct task_struct * > +pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) > +{ > + struct task_struct *next, *max = NULL; > + const struct sched_class *class; > + const struct cpumask *smt_mask; > + int i, j, cpu; > + > + if (!sched_core_enabled(rq)) > + return __pick_next_task(rq, prev, rf); > + > + /* > +* If there were no {en,de}queues since we picked (IOW, the task > +* pointers are all still valid), and we haven't scheduled the last > +* pick yet, do so now. > +*/ > + if (rq->core->core_pick_seq == rq->core->core_task_seq && > + rq->core->core_pick_seq != rq->core_sched_seq) { > + WRITE_ONCE(rq->core_sched_seq, rq->core->core_pick_seq); > + > + next = rq->core_pick; > + if (next != prev) { > + put_prev_task(rq, prev); > + set_next_task(rq, next); > + } > + return next; > + } > + The following patch improved my test cases. Welcome any comments. Thanks, -Aubrey diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 3e3162f..86031f4 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3685,10 +3685,12 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) /* * If there were no {en,de}queues since we picked (IOW, the task * pointers are all still valid), and we haven't scheduled the last -* pick yet, do so now. +* pick yet, do so now. If the last pick is idle task, we abandon +* last pick and try to pick up task this time. */ if (rq->core->core_pick_seq == rq->core->core_task_seq && - rq->core->core_pick_seq != rq->core_sched_seq) { + rq->core->core_pick_seq != rq->core_sched_seq && + !is_idle_task(rq->core_pick)) { WRITE_ONCE(rq->core_sched_seq, rq->core->core_pick_seq); next =
Re: [RFC PATCH v2 13/17] sched: Add core wide task selection and scheduling.
On Tue, Apr 23, 2019 at 04:18:18PM +, Vineeth Remanan Pillai wrote: > +// XXX fairness/fwd progress conditions > +static struct task_struct * > +pick_task(struct rq *rq, const struct sched_class *class, struct task_struct > *max) > +{ > + struct task_struct *class_pick, *cookie_pick; > + unsigned long cookie = 0UL; > + > + /* > + * We must not rely on rq->core->core_cookie here, because we fail to > reset > + * rq->core->core_cookie on new picks, such that we can detect if we > need > + * to do single vs multi rq task selection. > + */ > + > + if (max && max->core_cookie) { > + WARN_ON_ONCE(rq->core->core_cookie != max->core_cookie); > + cookie = max->core_cookie; > + } > + > + class_pick = class->pick_task(rq); > + if (!cookie) > + return class_pick; > + > + cookie_pick = sched_core_find(rq, cookie); > + if (!class_pick) > + return cookie_pick; > + > + /* > + * If class > max && class > cookie, it is the highest priority task on > + * the core (so far) and it must be selected, otherwise we must go with > + * the cookie pick in order to satisfy the constraint. > + */ > + if (cpu_prio_less(cookie_pick, class_pick) && core_prio_less(max, > class_pick)) It apapears to me the cpu_prio_less(cookie_pick, class_pick) isn't needed. If cookie_pick is idle task, then cpu_prio_less(cookie_pick, class_pick) is always true; If cookie_pick is not idle task and has the same sched class as class_pick, then class_pick is the best candidate to run accoring to their sched class. In this case, cpu_prio_less(cookie_pick, class_pick) shouldn't return false or it feels like a bug; If cookie_pick is not idle task and has a different sched class as class_pick: - if cookie_pick's sched class has higher priority than class_pick's sched class, then cookie_pick should have been selected in previous sched class iteration; and since its cookie matches with max, everything should have been finished already; - if cookie_pick's sched class has lower priority than class_pick's sched class, then cpu_prio_less(cookie_pick, class_pick) will still returns true. So looks like cpu_prio_less(cookie_pick, class_pick) should always return true and thus not needed. > + return class_pick; > + > + return cookie_pick; > +}
[RFC PATCH v2 13/17] sched: Add core wide task selection and scheduling.
From: Peter Zijlstra (Intel) Instead of only selecting a local task, select a task for all SMT siblings for every reschedule on the core (irrespective which logical CPU does the reschedule). NOTE: there is still potential for siblings rivalry. NOTE: this is far too complicated; but thus far I've failed to simplify it further. Signed-off-by: Peter Zijlstra (Intel) --- kernel/sched/core.c | 222 ++- kernel/sched/sched.h | 5 +- 2 files changed, 224 insertions(+), 3 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index e5bdc1c4d8d7..9e6e90c6f9b9 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3574,7 +3574,7 @@ static inline void schedule_debug(struct task_struct *prev) * Pick up the highest-prio task: */ static inline struct task_struct * -pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) +__pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) { const struct sched_class *class; struct task_struct *p; @@ -3619,6 +3619,220 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) BUG(); } +#ifdef CONFIG_SCHED_CORE + +static inline bool cookie_match(struct task_struct *a, struct task_struct *b) +{ + if (is_idle_task(a) || is_idle_task(b)) + return true; + + return a->core_cookie == b->core_cookie; +} + +// XXX fairness/fwd progress conditions +static struct task_struct * +pick_task(struct rq *rq, const struct sched_class *class, struct task_struct *max) +{ + struct task_struct *class_pick, *cookie_pick; + unsigned long cookie = 0UL; + + /* +* We must not rely on rq->core->core_cookie here, because we fail to reset +* rq->core->core_cookie on new picks, such that we can detect if we need +* to do single vs multi rq task selection. +*/ + + if (max && max->core_cookie) { + WARN_ON_ONCE(rq->core->core_cookie != max->core_cookie); + cookie = max->core_cookie; + } + + class_pick = class->pick_task(rq); + if (!cookie) + return class_pick; + + cookie_pick = sched_core_find(rq, cookie); + if (!class_pick) + return cookie_pick; + + /* +* If class > max && class > cookie, it is the highest priority task on +* the core (so far) and it must be selected, otherwise we must go with +* the cookie pick in order to satisfy the constraint. +*/ + if (cpu_prio_less(cookie_pick, class_pick) && core_prio_less(max, class_pick)) + return class_pick; + + return cookie_pick; +} + +static struct task_struct * +pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) +{ + struct task_struct *next, *max = NULL; + const struct sched_class *class; + const struct cpumask *smt_mask; + int i, j, cpu; + + if (!sched_core_enabled(rq)) + return __pick_next_task(rq, prev, rf); + + /* +* If there were no {en,de}queues since we picked (IOW, the task +* pointers are all still valid), and we haven't scheduled the last +* pick yet, do so now. +*/ + if (rq->core->core_pick_seq == rq->core->core_task_seq && + rq->core->core_pick_seq != rq->core_sched_seq) { + WRITE_ONCE(rq->core_sched_seq, rq->core->core_pick_seq); + + next = rq->core_pick; + if (next != prev) { + put_prev_task(rq, prev); + set_next_task(rq, next); + } + return next; + } + + prev->sched_class->put_prev_task(rq, prev, rf); + if (!rq->nr_running) + newidle_balance(rq, rf); + + cpu = cpu_of(rq); + smt_mask = cpu_smt_mask(cpu); + + /* +* core->core_task_seq, core->core_pick_seq, rq->core_sched_seq +* +* @task_seq guards the task state ({en,de}queues) +* @pick_seq is the @task_seq we did a selection on +* @sched_seq is the @pick_seq we scheduled +* +* However, preemptions can cause multiple picks on the same task set. +* 'Fix' this by also increasing @task_seq for every pick. +*/ + rq->core->core_task_seq++; + + /* reset state */ + for_each_cpu(i, smt_mask) { + struct rq *rq_i = cpu_rq(i); + + rq_i->core_pick = NULL; + + if (i != cpu) + update_rq_clock(rq_i); + } + + /* +* Try and select tasks for each sibling in decending sched_class +* order. +*/ + for_each_class(class) { +again: + for_each_cpu_wrap(i, smt_mask, cpu) { + struct rq *rq_i = cpu_rq(i); + struct task_struct *p; + + if (rq_i->core_pick) +