Re: [RFC][PATCH 13/16] sched: Add core wide task selection and scheduling.

Peter Zijlstra Wed, 10 Apr 2019 08:02:06 -0700

On Tue, Apr 09, 2019 at 02:38:55PM -0400, Julien Desfossez wrote:
> We found the source of the major performance regression we discussed
> previously. It turns out there was a pattern where a task (a kworker in this
> case) could be woken up, but the core could still end up idle before that
> task had a chance to run.
> 
> Example sequence, cpu0 and cpu1 and siblings on the same core, task1 and
> task2 are in the same cgroup with the tag enabled (each following line
> happens in the increasing order of time):
> - task1 running on cpu0, task2 running on cpu1
> - sched_waking(kworker/0, target_cpu=cpu0)
> - task1 scheduled out of cpu0
> - kworker/0 cannot run on cpu0 because of task2 is still running on cpu1
>   cpu0 is idle
> - task2 scheduled out of cpu1


But at this point core_cookie is still set; we don't clear it when the
last task goes away.

> - cpu1 doesn’t select kworker/0 for cpu0, because the optimization path ends
>   the task selection if core_cookie is NULL for currently selected process
>   and the cpu1’s runqueue.

But at this point core_cookie is still set, we only (re)set it later to
p->core_cookie.

What I suspect happens is that you hit the 'again' clause due to a
higher prio @max on the second sibling. And at that point we've
destroyed core_cookie.

> - cpu1 is idle
> --> both siblings are idle but kworker/0 is still in the run queue of cpu0.
>     Cpu0 may stay idle for longer if it goes deep idle.
> 
> With the fix below, we ensure to send an IPI to the sibling if it is idle
> and has tasks waiting in its runqueue.
> This fixes the performance issue we were seeing.
> 
> Now here is what we can measure with a disk write-intensive benchmark:
> - no performance impact with enabling core scheduling without any tagged
>   task,
> - 5% overhead if one tagged task is competing with an untagged task,
> - 10% overhead if 2 tasks tagged with a different tag are competing
>   against each other.
> 
> We are starting more scaling tests, but this is very encouraging !
> 
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index e1fa10561279..02c862a5e973 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3779,7 +3779,22 @@ pick_next_task(struct rq *rq, struct task_struct 
> *prev, struct rq_flags *rf)
>  
>                               trace_printk("unconstrained pick: %s/%d %lx\n",
>                                               next->comm, next->pid, 
> next->core_cookie);
> +                             rq->core_pick = NULL;
>  
> +                             /*
> +                              * If the sibling is idling, we might want to 
> wake it
> +                              * so that it can check for any runnable but 
> blocked tasks 
> +                              * due to previous task matching.
> +                              */
> +                             for_each_cpu(j, smt_mask) {
> +                                     struct rq *rq_j = cpu_rq(j);
> +                                     rq_j->core_pick = NULL;
> +                                     if (j != cpu && 
> is_idle_task(rq_j->curr) && rq_j->nr_running) {
> +                                             resched_curr(rq_j);
> +                                             trace_printk("IPI(%d->%d[%d]) 
> idle preempt\n",
> +                                                          cpu, j, 
> rq_j->nr_running);
> +                                     }
> +                             }
>                               goto done;
>                       }

I'm thinking there is a more elegant solution hiding in there; possibly
saving/restoring that core_cookie on the again loop should do, but I've
always had the nagging suspicion that whole selection loop could be done
better.

Re: [RFC][PATCH 13/16] sched: Add core wide task selection and scheduling.

Reply via email to