When a CPU lowers its priority (schedules out a high priority task for a lower priority one), a check is made to see if any other CPU has overloaded RT tasks (more than one). It checks the rto_mask to determine this and if so it will request to pull one of those tasks to itself if the non running RT task is of higher priority than the new priority of the next task to run on the current CPU.
When we deal with large number of CPUs, the original pull logic suffered from large lock contention on a single CPU run queue, which caused a huge latency across all CPUs. This was caused by only having one CPU having overloaded RT tasks and a bunch of other CPUs lowering their priority. To solve this issue, commit b6366f048e0c ("sched/rt: Use IPI to trigger RT task push migration instead of pulling") changed the way to request a pull. Instead of grabbing the lock of the overloaded CPU's runqueue, it simply sent an IPI to that CPU to do the work. Although the IPI logic worked very well in removing the large latency build up, it still could suffer a large (although not as large as without the IPI) latency due to the work within the IPI. To understand this issue, an explanation of the IPI logic is required. When a CPU lowers its priority, it finds the next set bit in the rto_mask from its own CPU. That is, if bit 2 and 10 are set in the rto_mask, and CPU 8 lowers its priority, it will select CPU 10 to send its IPI to. Now, lets say that CPU 0 and CPU 1 lower their prioirty. They will both send their IPI to CPU 2. If IPI of CPU 0 gets to CPU 2 first, then it triggers the PUSH logic and if CPU 1 has a lower priority than CPU 0, it will push its overloaded task to CPU 1 (due to cpupri), even though the IPI came from CPU 0. Even though a task was pushed, we need to make sure that there's not higher tasks still waiting. Thus an IPI is then sent to CPU 10 for processing of CPU 0's request (remember the pushed task went to CPU 1). When the IPI of CPU 1 reaches CPU 2, it will skip the push logic (because it no longer has any tasks to push), but it too still needs to notify other CPUs about this CPU lowering its priority. Thus it sends another IPI to CPU 10, because that bit is still set in the rto_mask. Now on CPU 10, it just finished dealing with the IPI of CPU 8, and even though it now doesn't have any RT tasks to push, it just received two more IPIs (from CPU 2, to deal with CPU 0 and CPU 1). It too must do work to see if it should continue sending an IPI to more rto_mask CPUs. If there's no more CPUs to send to, it still needs to "stop" the execution of the push request. Although these IPIs are fast to process, I've traced a single CPU dealing with 89 IPIs in a row, on a 80 CPU machine! This was caused by an overloaded RT task that and a limited CPU affinity, and most of the CPUs sending IPIs to it, couldn't do anything with it. And because the CPUs were very active and changed their priorities again, it sent out duplicates. The latency of handling 89 IPIs was 200us (~2.3us to handle each IPI), as each IPI does require taking of a spinlock that deals with the IPI itself (not a rq lock, and very little contention). To solve this, an ipi_count is added to rt_rq, that gets incremented when an IPI is sent to that runqueue. When looking for the next CPU to process, the ipi_count is checked to see if that CPU is already processing push requests, and if so, then that CPU is skipped, and the next CPU in the rto_mask is checked. The IPI code now needs to call push_tasks() instead of just push_task() as it will not be receiving an IPI for each CPU that is requesting a PULL. This change removes this duplication of work in the IPI logic, and lowers the latency caused by the IPIs greatly. Signed-off-by: Steven Rostedt <rost...@goodmis.org> --- kernel/sched/rt.c | 14 ++++++++++++-- kernel/sched/sched.h | 2 ++ 2 files changed, 14 insertions(+), 2 deletions(-) diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index d5690b722691..165bcfdbd94b 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -100,6 +100,7 @@ void init_rt_rq(struct rt_rq *rt_rq) rt_rq->push_flags = 0; rt_rq->push_cpu = nr_cpu_ids; raw_spin_lock_init(&rt_rq->push_lock); + atomic_set(&rt_rq->ipi_count, 0); init_irq_work(&rt_rq->push_work, push_irq_work_func); #endif #endif /* CONFIG_SMP */ @@ -1917,6 +1918,10 @@ static int find_next_push_cpu(struct rq *rq) break; next_rq = cpu_rq(cpu); + /* If pushing was already started, ignore */ + if (atomic_read(&next_rq->rt.ipi_count)) + continue; + /* Make sure the next rq can push to this rq */ if (next_rq->rt.highest_prio.next < rq->rt.highest_prio.curr) break; @@ -1955,6 +1960,7 @@ static void tell_cpu_to_push(struct rq *rq) return; rq->rt.push_flags = RT_PUSH_IPI_EXECUTING; + atomic_inc(&cpu_rq(cpu)->rt.ipi_count); irq_work_queue_on(&rq->rt.push_work, cpu); } @@ -1974,11 +1980,12 @@ static void try_to_push_tasks(void *arg) rq = cpu_rq(this_cpu); src_rq = rq_of_rt_rq(rt_rq); + WARN_ON_ONCE(!atomic_read(&rq->rt.ipi_count)); again: if (has_pushable_tasks(rq)) { raw_spin_lock(&rq->lock); - push_rt_task(rq); + push_rt_tasks(rq); raw_spin_unlock(&rq->lock); } @@ -2000,7 +2007,7 @@ again: raw_spin_unlock(&rt_rq->push_lock); if (cpu >= nr_cpu_ids) - return; + goto out; /* * It is possible that a restart caused this CPU to be @@ -2011,7 +2018,10 @@ again: goto again; /* Try the next RT overloaded CPU */ + atomic_inc(&cpu_rq(cpu)->rt.ipi_count); irq_work_queue_on(&rt_rq->push_work, cpu); +out: + atomic_dec(&rq->rt.ipi_count); } static void push_irq_work_func(struct irq_work *work) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index de607e4febd9..b47d580dfa84 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -476,6 +476,8 @@ struct rt_rq { int push_cpu; struct irq_work push_work; raw_spinlock_t push_lock; + /* Used to skip CPUs being processed in the rto_mask */ + atomic_t ipi_count; #endif #endif /* CONFIG_SMP */ int rt_queued; -- 1.9.3