Gentle ping... -- Steve
On Fri, 24 Jun 2016 11:26:13 -0400 Steven Rostedt <rost...@goodmis.org> wrote: > When a CPU lowers its priority (schedules out a high priority task for a > lower priority one), a check is made to see if any other CPU has overloaded > RT tasks (more than one). It checks the rto_mask to determine this and if so > it will request to pull one of those tasks to itself if the non running RT > task is of higher priority than the new priority of the next task to run on > the current CPU. > > When we deal with large number of CPUs, the original pull logic suffered > from large lock contention on a single CPU run queue, which caused a huge > latency across all CPUs. This was caused by only having one CPU having > overloaded RT tasks and a bunch of other CPUs lowering their priority. To > solve this issue, commit b6366f048e0c ("sched/rt: Use IPI to trigger RT task > push migration instead of pulling") changed the way to request a pull. > Instead of grabbing the lock of the overloaded CPU's runqueue, it simply > sent an IPI to that CPU to do the work. > > Although the IPI logic worked very well in removing the large latency build > up, it still could suffer a large (although not as large as without the IPI) > latency due to the work within the IPI. To understand this issue, an > explanation of the IPI logic is required. > > When a CPU lowers its priority, it finds the next set bit in the rto_mask > from its own CPU. That is, if bit 2 and 10 are set in the rto_mask, and CPU > 8 lowers its priority, it will select CPU 10 to send its IPI to. Now, lets > say that CPU 0 and CPU 1 lower their prioirty. They will both send their IPI > to CPU 2. > > If IPI of CPU 0 gets to CPU 2 first, then it triggers the PUSH logic and if > CPU 1 has a lower priority than CPU 0, it will push its overloaded task to > CPU 1 (due to cpupri), even though the IPI came from CPU 0. Even though a > task was pushed, we need to make sure that there's not higher tasks still > waiting. Thus an IPI is then sent to CPU 10 for processing of CPU 0's > request (remember the pushed task went to CPU 1). > > When the IPI of CPU 1 reaches CPU 2, it will skip the push logic (because it > no longer has any tasks to push), but it too still needs to notify other > CPUs about this CPU lowering its priority. Thus it sends another IPI to CPU > 10, because that bit is still set in the rto_mask. > > Now on CPU 10, it just finished dealing with the IPI of CPU 8, and even > though it now doesn't have any RT tasks to push, it just received two more > IPIs (from CPU 2, to deal with CPU 0 and CPU 1). It too must do work to see > if it should continue sending an IPI to more rto_mask CPUs. If there's no > more CPUs to send to, it still needs to "stop" the execution of the push > request. > > Although these IPIs are fast to process, I've traced a single CPU dealing > with 89 IPIs in a row, on a 80 CPU machine! This was caused by an overloaded > RT task that and a limited CPU affinity, and most of the CPUs sending IPIs > to it, couldn't do anything with it. And because the CPUs were very active > and changed their priorities again, it sent out duplicates. The latency of > handling 89 IPIs was 200us (~2.3us to handle each IPI), as each IPI does > require taking of a spinlock that deals with the IPI itself (not a rq lock, > and very little contention). > > To solve this, an ipi_count is added to rt_rq, that gets incremented when an > IPI is sent to that runqueue. When looking for the next CPU to process, the > ipi_count is checked to see if that CPU is already processing push requests, > and if so, then that CPU is skipped, and the next CPU in the rto_mask is > checked. > > The IPI code now needs to call push_tasks() instead of just push_task() as > it will not be receiving an IPI for each CPU that is requesting a PULL. > > This change removes this duplication of work in the IPI logic, and lowers > the latency caused by the IPIs greatly. > > Signed-off-by: Steven Rostedt <rost...@goodmis.org> > --- > kernel/sched/rt.c | 14 ++++++++++++-- > kernel/sched/sched.h | 2 ++ > 2 files changed, 14 insertions(+), 2 deletions(-) > > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c > index d5690b722691..165bcfdbd94b 100644 > --- a/kernel/sched/rt.c > +++ b/kernel/sched/rt.c > @@ -100,6 +100,7 @@ void init_rt_rq(struct rt_rq *rt_rq) > rt_rq->push_flags = 0; > rt_rq->push_cpu = nr_cpu_ids; > raw_spin_lock_init(&rt_rq->push_lock); > + atomic_set(&rt_rq->ipi_count, 0); > init_irq_work(&rt_rq->push_work, push_irq_work_func); > #endif > #endif /* CONFIG_SMP */ > @@ -1917,6 +1918,10 @@ static int find_next_push_cpu(struct rq *rq) > break; > next_rq = cpu_rq(cpu); > > + /* If pushing was already started, ignore */ > + if (atomic_read(&next_rq->rt.ipi_count)) > + continue; > + > /* Make sure the next rq can push to this rq */ > if (next_rq->rt.highest_prio.next < rq->rt.highest_prio.curr) > break; > @@ -1955,6 +1960,7 @@ static void tell_cpu_to_push(struct rq *rq) > return; > > rq->rt.push_flags = RT_PUSH_IPI_EXECUTING; > + atomic_inc(&cpu_rq(cpu)->rt.ipi_count); > > irq_work_queue_on(&rq->rt.push_work, cpu); > } > @@ -1974,11 +1980,12 @@ static void try_to_push_tasks(void *arg) > > rq = cpu_rq(this_cpu); > src_rq = rq_of_rt_rq(rt_rq); > + WARN_ON_ONCE(!atomic_read(&rq->rt.ipi_count)); > > again: > if (has_pushable_tasks(rq)) { > raw_spin_lock(&rq->lock); > - push_rt_task(rq); > + push_rt_tasks(rq); > raw_spin_unlock(&rq->lock); > } > > @@ -2000,7 +2007,7 @@ again: > raw_spin_unlock(&rt_rq->push_lock); > > if (cpu >= nr_cpu_ids) > - return; > + goto out; > > /* > * It is possible that a restart caused this CPU to be > @@ -2011,7 +2018,10 @@ again: > goto again; > > /* Try the next RT overloaded CPU */ > + atomic_inc(&cpu_rq(cpu)->rt.ipi_count); > irq_work_queue_on(&rt_rq->push_work, cpu); > +out: > + atomic_dec(&rq->rt.ipi_count); > } > > static void push_irq_work_func(struct irq_work *work) > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > index de607e4febd9..b47d580dfa84 100644 > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -476,6 +476,8 @@ struct rt_rq { > int push_cpu; > struct irq_work push_work; > raw_spinlock_t push_lock; > + /* Used to skip CPUs being processed in the rto_mask */ > + atomic_t ipi_count; > #endif > #endif /* CONFIG_SMP */ > int rt_queued;