On 08/17/2018 11:27 AM, Steve Muckle wrote:
From: John Dias <joaod...@google.com>

When rt_mutex_setprio changes a task's scheduling class to RT,
we're seeing cases where the task's vruntime is not updated
correctly upon return to the fair class.
Specifically, the following is being observed:
- task is deactivated while still in the fair class
- task is boosted to RT via rt_mutex_setprio, which changes
   the task to RT and calls check_class_changed.
- check_class_changed leads to detach_task_cfs_rq, at which point
   the vruntime_normalized check sees that the task's state is TASK_WAKING,
   which results in skipping the subtraction of the rq's min_vruntime
   from the task's vruntime
- later, when the prio is deboosted and the task is moved back
   to the fair class, the fair rq's min_vruntime is added to
   the task's vruntime, even though it wasn't subtracted earlier.
The immediate result is inflation of the task's vruntime, giving
it lower priority (starving it if there's enough available work).
The longer-term effect is inflation of all vruntimes because the
task's vruntime becomes the rq's min_vruntime when the higher
priority tasks go idle. That leads to a vicious cycle, where
the vruntime inflation repeatedly doubled.

The change here is to detect when vruntime_normalized is being
called when the task is waking but is waking in another class,
and to conclude that this is a case where vruntime has not
been normalized.

Signed-off-by: John Dias <joaod...@google.com>
Signed-off-by: Steve Muckle <smuc...@google.com>
---
  kernel/sched/fair.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b39fb596f6c1..14011d7929d8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9638,7 +9638,8 @@ static inline bool vruntime_normalized(struct task_struct 
*p)
         * - A task which has been woken up by try_to_wake_up() and
         *   waiting for actually being woken up by sched_ttwu_pending().
         */
-       if (!se->sum_exec_runtime || p->state == TASK_WAKING)
+       if (!se->sum_exec_runtime ||
+           (p->state == TASK_WAKING && p->sched_class == &fair_sched_class))
                return true;
return false;
The normalization of vruntime used to exist in task_waking but it was removed and the normalization was moved into migrate_task_rq_fair. The reasoning being that task_waking_fair was only hit when a task is queued onto a different core and migrate_task_rq_fair should do the same work.

However, we're finding that there's one case which migrate_task_rq_fair doesn't hit: that being the case where rt_mutex_setprio changes a task's scheduling class to RT when its scheduled out. The task never hits migrate_task_rq_fair because it is switched to RT and migrates as an RT task. Because of this we're getting an unbounded addition of min_vruntime when the task is re-attached to the CFS runqueue when it loses the inherited priority. The patch above works because now the kernel specifically checks for this case and normalizes accordingly.

Here's the patch I was talking about: https://lore.kernel.org/patchwork/patch/677689/. In our testing we were seeing vruntimes nearly double every time after rt_mutex_setprio boosts the task to RT.

Signed-off-by: Miguel de Dios <miguelded...@google.com>
Tested-by: Miguel de Dios <miguelded...@google.com>

Reply via email to