Re: [PATCH] sched/fair: vruntime should normalize when switching from fair
On Sat, 29 Sep 2018 at 01:36, Dietmar Eggemann wrote: > > On 09/28/2018 06:10 PM, Steve Muckle wrote: > > On 09/27/2018 05:43 PM, Wanpeng Li wrote: > On your CPU4: > scheduler_ipi() > -> sched_ttwu_pending() > -> ttwu_do_activate()=> p->sched_remote_wakeup should be > false, so ENQUEUE_WAKEUP is set, ENQUEUE_MIGRATED is not > -> ttwu_activate() > -> activate_task() > -> enqueue_task() > -> enqueue_task_fair() > -> enqueue_entity() > bool renorm = !(flags & > ENQUEUE_WAKEUP) || (flags & ENQUEUE_MIGRATE) > so renorm is false in enqueue_entity(), why you mentioned that the > cfs_rq->min_vruntime is still added to the se->vruntime in > enqueue_task_fair()? > >>> > >>> Maybe this is a misunderstanding on my side but didn't you asked me to > >>> '... Could you point out when the fair rq's min_vruntime is added to the > >>> task's vruntime in your *later* scenario? ...' > >> > >> Yeah, if the calltrace above and my analysis is correct, then the fair > >> rq's min_vruntime will not be added to the task's vruntime in your > >> *later* scenario, which means that your patch is not necessary. > > > > In the scenario I observed, the task is not waking - it is running and > > being deboosted from priority inheritance, transitioning from RT to CFS. > > > > Dietmar and I both were able to reproduce the issue with the testcase I > > posted earlier in this thread. > > Correct, and with the same testcase I got this call stack in this scenario: > > [ 35.588509] CPU: 1 PID: 2926 Comm: fair_task Not tainted > 4.18.0-rc6-00052-g11b7dafa2edb-dirty #5 > [ 35.597217] Hardware name: ARM Juno development board (r0) (DT) > [ 35.603080] Call trace: > [ 35.605509] dump_backtrace+0x0/0x168 > [ 35.609138] show_stack+0x24/0x30 > [ 35.612424] dump_stack+0xac/0xe4 > [ 35.615710] enqueue_task_fair+0xae0/0x11c0 > [ 35.619854] rt_mutex_setprio+0x5a0/0x628 > [ 35.623827] mark_wakeup_next_waiter+0x7c/0xc8 > [ 35.628228] __rt_mutex_futex_unlock+0x30/0x50 > [ 35.632630] do_futex+0x74c/0xb28 > [ 35.635912] sys_futex+0x118/0x198 > [ 35.639280] el0_svc_naked+0x30/0x34 Thanks for pointing out. :) Regards, Wanpeng Li
Re: [PATCH] sched/fair: vruntime should normalize when switching from fair
On 09/28/2018 06:10 PM, Steve Muckle wrote: On 09/27/2018 05:43 PM, Wanpeng Li wrote: On your CPU4: scheduler_ipi() -> sched_ttwu_pending() -> ttwu_do_activate() => p->sched_remote_wakeup should be false, so ENQUEUE_WAKEUP is set, ENQUEUE_MIGRATED is not -> ttwu_activate() -> activate_task() -> enqueue_task() -> enqueue_task_fair() -> enqueue_entity() bool renorm = !(flags & ENQUEUE_WAKEUP) || (flags & ENQUEUE_MIGRATE) so renorm is false in enqueue_entity(), why you mentioned that the cfs_rq->min_vruntime is still added to the se->vruntime in enqueue_task_fair()? Maybe this is a misunderstanding on my side but didn't you asked me to '... Could you point out when the fair rq's min_vruntime is added to the task's vruntime in your *later* scenario? ...' Yeah, if the calltrace above and my analysis is correct, then the fair rq's min_vruntime will not be added to the task's vruntime in your *later* scenario, which means that your patch is not necessary. In the scenario I observed, the task is not waking - it is running and being deboosted from priority inheritance, transitioning from RT to CFS. Dietmar and I both were able to reproduce the issue with the testcase I posted earlier in this thread. Correct, and with the same testcase I got this call stack in this scenario: [ 35.588509] CPU: 1 PID: 2926 Comm: fair_task Not tainted 4.18.0-rc6-00052-g11b7dafa2edb-dirty #5 [ 35.597217] Hardware name: ARM Juno development board (r0) (DT) [ 35.603080] Call trace: [ 35.605509] dump_backtrace+0x0/0x168 [ 35.609138] show_stack+0x24/0x30 [ 35.612424] dump_stack+0xac/0xe4 [ 35.615710] enqueue_task_fair+0xae0/0x11c0 [ 35.619854] rt_mutex_setprio+0x5a0/0x628 [ 35.623827] mark_wakeup_next_waiter+0x7c/0xc8 [ 35.628228] __rt_mutex_futex_unlock+0x30/0x50 [ 35.632630] do_futex+0x74c/0xb28 [ 35.635912] sys_futex+0x118/0x198 [ 35.639280] el0_svc_naked+0x30/0x34 [...]
Re: [PATCH] sched/fair: vruntime should normalize when switching from fair
On 09/28/2018 02:43 AM, Wanpeng Li wrote: On Thu, 27 Sep 2018 at 21:23, Dietmar Eggemann wrote: On 09/27/2018 03:19 AM, Wanpeng Li wrote: On Thu, 27 Sep 2018 at 06:38, Dietmar Eggemann wrote: Hi, On 09/26/2018 11:50 AM, Wanpeng Li wrote: Hi Dietmar, On Tue, 28 Aug 2018 at 22:55, Dietmar Eggemann wrote: On 08/27/2018 12:14 PM, Peter Zijlstra wrote: On Fri, Aug 24, 2018 at 02:24:48PM -0700, Steve Muckle wrote: On 08/24/2018 02:47 AM, Peter Zijlstra wrote: On 08/17/2018 11:27 AM, Steve Muckle wrote: [...] - later, when the prio is deboosted and the task is moved back to the fair class, the fair rq's min_vruntime is added to the task's vruntime, even though it wasn't subtracted earlier. Could you point out when the fair rq's min_vruntime is added to the task's vruntime in your *later* scenario? attach_task_cfs_rq will not do that the same reason as detach_task_cfs_rq. fair task's sched_remote_wakeup is false which results in vruntime will not be renormalized in enqueue_entity. The cfs_rq->min_vruntime is still added to the se->vruntime in enqueue_task_fair(). I understand what your patch done, It's not my patch ;-) I just helped to find out under which circumstances this issue can happen. On your CPU4: scheduler_ipi() -> sched_ttwu_pending() -> ttwu_do_activate()=> p->sched_remote_wakeup should be false, so ENQUEUE_WAKEUP is set, ENQUEUE_MIGRATED is not -> ttwu_activate() -> activate_task() -> enqueue_task() -> enqueue_task_fair() -> enqueue_entity() bool renorm = !(flags & ENQUEUE_WAKEUP) || (flags & ENQUEUE_MIGRATE) so renorm is false in enqueue_entity(), why you mentioned that the cfs_rq->min_vruntime is still added to the se->vruntime in enqueue_task_fair()? Maybe this is a misunderstanding on my side but didn't you asked me to '... Could you point out when the fair rq's min_vruntime is added to the task's vruntime in your *later* scenario? ...' Yeah, if the calltrace above and my analysis is correct, then the fair rq's min_vruntime will not be added to the task's vruntime in your *later* scenario, which means that your patch is not necessary. Ah, ok, both, the ENQUEUE_WAKEUP and the ENQUEUE_MIGRATED are not set in the enqueue_entity() call so renorm is 1 (flags is 0xe).
Re: [PATCH] sched/fair: vruntime should normalize when switching from fair
On Fri, Sep 28, 2018 at 9:10 AM, 'Steve Muckle' via kernel-team wrote: > On 09/27/2018 05:43 PM, Wanpeng Li wrote: On your CPU4: scheduler_ipi() -> sched_ttwu_pending() -> ttwu_do_activate()=> p->sched_remote_wakeup should be false, so ENQUEUE_WAKEUP is set, ENQUEUE_MIGRATED is not -> ttwu_activate() -> activate_task() -> enqueue_task() -> enqueue_task_fair() -> enqueue_entity() bool renorm = !(flags & ENQUEUE_WAKEUP) || (flags & ENQUEUE_MIGRATE) so renorm is false in enqueue_entity(), why you mentioned that the cfs_rq->min_vruntime is still added to the se->vruntime in enqueue_task_fair()? >>> >>> >>> Maybe this is a misunderstanding on my side but didn't you asked me to >>> '... Could you point out when the fair rq's min_vruntime is added to the >>> task's vruntime in your *later* scenario? ...' >> >> >> Yeah, if the calltrace above and my analysis is correct, then the fair >> rq's min_vruntime will not be added to the task's vruntime in your >> *later* scenario, which means that your patch is not necessary. > > > In the scenario I observed, the task is not waking - it is running and being > deboosted from priority inheritance, transitioning from RT to CFS. > > Dietmar and I both were able to reproduce the issue with the testcase I > posted earlier in this thread. Can this issue still show up say if the wake up was not remote? Say the task was locally awakened. In that case do we still need to check the class in vruntime_normalized like John was doing? Just want to make sure we caught all scenarios. - Joel
Re: [PATCH] sched/fair: vruntime should normalize when switching from fair
On Wed, Sep 26, 2018 at 6:19 PM, Wanpeng Li wrote: > On Thu, 27 Sep 2018 at 06:38, Dietmar Eggemann > wrote: >> >> Hi, >> >> On 09/26/2018 11:50 AM, Wanpeng Li wrote: >> > Hi Dietmar, >> > On Tue, 28 Aug 2018 at 22:55, Dietmar Eggemann >> > wrote: >> >> >> >> On 08/27/2018 12:14 PM, Peter Zijlstra wrote: >> >>> On Fri, Aug 24, 2018 at 02:24:48PM -0700, Steve Muckle wrote: >> On 08/24/2018 02:47 AM, Peter Zijlstra wrote: >> >>> On 08/17/2018 11:27 AM, Steve Muckle wrote: >> >> [...] >> >> - later, when the prio is deboosted and the task is moved back >> to the fair class, the fair rq's min_vruntime is added to >> the task's vruntime, even though it wasn't subtracted earlier. >> > >> > Could you point out when the fair rq's min_vruntime is added to the >> > task's vruntime in your *later* scenario? attach_task_cfs_rq will not >> > do that the same reason as detach_task_cfs_rq. fair task's >> > sched_remote_wakeup is false which results in vruntime will not be >> > renormalized in enqueue_entity. >> >> The cfs_rq->min_vruntime is still added to the se->vruntime in >> enqueue_task_fair(). > > I understand what your patch done, > > On your CPU4: > scheduler_ipi() > -> sched_ttwu_pending() > -> ttwu_do_activate()=> p->sched_remote_wakeup should be > false, so ENQUEUE_WAKEUP is set, ENQUEUE_MIGRATED is not >-> ttwu_activate() > -> activate_task() > -> enqueue_task() > -> enqueue_task_fair() >-> enqueue_entity() > bool renorm = !(flags & > ENQUEUE_WAKEUP) || (flags & ENQUEUE_MIGRATE) > so renorm is false in enqueue_entity(), why you mentioned that the > cfs_rq->min_vruntime is still added to the se->vruntime in > enqueue_task_fair()? If I understand John's original patch correctly, the additional vruntime is added when the class of the waking task is changed during priority deboost so that path is different from the one you listed above? Perhaps you should be looking at rt_mutex_setprio? - Joel
Re: [PATCH] sched/fair: vruntime should normalize when switching from fair
On 09/27/2018 05:43 PM, Wanpeng Li wrote: On your CPU4: scheduler_ipi() -> sched_ttwu_pending() -> ttwu_do_activate()=> p->sched_remote_wakeup should be false, so ENQUEUE_WAKEUP is set, ENQUEUE_MIGRATED is not -> ttwu_activate() -> activate_task() -> enqueue_task() -> enqueue_task_fair() -> enqueue_entity() bool renorm = !(flags & ENQUEUE_WAKEUP) || (flags & ENQUEUE_MIGRATE) so renorm is false in enqueue_entity(), why you mentioned that the cfs_rq->min_vruntime is still added to the se->vruntime in enqueue_task_fair()? Maybe this is a misunderstanding on my side but didn't you asked me to '... Could you point out when the fair rq's min_vruntime is added to the task's vruntime in your *later* scenario? ...' Yeah, if the calltrace above and my analysis is correct, then the fair rq's min_vruntime will not be added to the task's vruntime in your *later* scenario, which means that your patch is not necessary. In the scenario I observed, the task is not waking - it is running and being deboosted from priority inheritance, transitioning from RT to CFS. Dietmar and I both were able to reproduce the issue with the testcase I posted earlier in this thread. thanks, Steve
Re: [PATCH] sched/fair: vruntime should normalize when switching from fair
On Thu, 27 Sep 2018 at 21:23, Dietmar Eggemann wrote: > > On 09/27/2018 03:19 AM, Wanpeng Li wrote: > > On Thu, 27 Sep 2018 at 06:38, Dietmar Eggemann > > wrote: > >> > >> Hi, > >> > >> On 09/26/2018 11:50 AM, Wanpeng Li wrote: > >>> Hi Dietmar, > >>> On Tue, 28 Aug 2018 at 22:55, Dietmar Eggemann > >>> wrote: > > On 08/27/2018 12:14 PM, Peter Zijlstra wrote: > > On Fri, Aug 24, 2018 at 02:24:48PM -0700, Steve Muckle wrote: > >> On 08/24/2018 02:47 AM, Peter Zijlstra wrote: > > On 08/17/2018 11:27 AM, Steve Muckle wrote: > >> > >> [...] > >> > >> - later, when the prio is deboosted and the task is moved back > >>to the fair class, the fair rq's min_vruntime is added to > >>the task's vruntime, even though it wasn't subtracted > >> earlier. > >>> > >>> Could you point out when the fair rq's min_vruntime is added to the > >>> task's vruntime in your *later* scenario? attach_task_cfs_rq will not > >>> do that the same reason as detach_task_cfs_rq. fair task's > >>> sched_remote_wakeup is false which results in vruntime will not be > >>> renormalized in enqueue_entity. > >> > >> The cfs_rq->min_vruntime is still added to the se->vruntime in > >> enqueue_task_fair(). > > > > I understand what your patch done, > > It's not my patch ;-) I just helped to find out under which > circumstances this issue can happen. > > > On your CPU4: > > scheduler_ipi() > > -> sched_ttwu_pending() > >-> ttwu_do_activate()=> p->sched_remote_wakeup should be > > false, so ENQUEUE_WAKEUP is set, ENQUEUE_MIGRATED is not > > -> ttwu_activate() > > -> activate_task() > > -> enqueue_task() > >-> enqueue_task_fair() > > -> enqueue_entity() > > bool renorm = !(flags & > > ENQUEUE_WAKEUP) || (flags & ENQUEUE_MIGRATE) > > so renorm is false in enqueue_entity(), why you mentioned that the > > cfs_rq->min_vruntime is still added to the se->vruntime in > > enqueue_task_fair()? > > Maybe this is a misunderstanding on my side but didn't you asked me to > '... Could you point out when the fair rq's min_vruntime is added to the > task's vruntime in your *later* scenario? ...' Yeah, if the calltrace above and my analysis is correct, then the fair rq's min_vruntime will not be added to the task's vruntime in your *later* scenario, which means that your patch is not necessary. Regards, Wanpeng Li
Re: [PATCH] sched/fair: vruntime should normalize when switching from fair
On 09/27/2018 03:19 AM, Wanpeng Li wrote: On Thu, 27 Sep 2018 at 06:38, Dietmar Eggemann wrote: Hi, On 09/26/2018 11:50 AM, Wanpeng Li wrote: Hi Dietmar, On Tue, 28 Aug 2018 at 22:55, Dietmar Eggemann wrote: On 08/27/2018 12:14 PM, Peter Zijlstra wrote: On Fri, Aug 24, 2018 at 02:24:48PM -0700, Steve Muckle wrote: On 08/24/2018 02:47 AM, Peter Zijlstra wrote: On 08/17/2018 11:27 AM, Steve Muckle wrote: [...] - later, when the prio is deboosted and the task is moved back to the fair class, the fair rq's min_vruntime is added to the task's vruntime, even though it wasn't subtracted earlier. Could you point out when the fair rq's min_vruntime is added to the task's vruntime in your *later* scenario? attach_task_cfs_rq will not do that the same reason as detach_task_cfs_rq. fair task's sched_remote_wakeup is false which results in vruntime will not be renormalized in enqueue_entity. The cfs_rq->min_vruntime is still added to the se->vruntime in enqueue_task_fair(). I understand what your patch done, It's not my patch ;-) I just helped to find out under which circumstances this issue can happen. On your CPU4: scheduler_ipi() -> sched_ttwu_pending() -> ttwu_do_activate()=> p->sched_remote_wakeup should be false, so ENQUEUE_WAKEUP is set, ENQUEUE_MIGRATED is not -> ttwu_activate() -> activate_task() -> enqueue_task() -> enqueue_task_fair() -> enqueue_entity() bool renorm = !(flags & ENQUEUE_WAKEUP) || (flags & ENQUEUE_MIGRATE) so renorm is false in enqueue_entity(), why you mentioned that the cfs_rq->min_vruntime is still added to the se->vruntime in enqueue_task_fair()? Maybe this is a misunderstanding on my side but didn't you asked me to '... Could you point out when the fair rq's min_vruntime is added to the task's vruntime in your *later* scenario? ...'
Re: [PATCH] sched/fair: vruntime should normalize when switching from fair
On Thu, 27 Sep 2018 at 06:38, Dietmar Eggemann wrote: > > Hi, > > On 09/26/2018 11:50 AM, Wanpeng Li wrote: > > Hi Dietmar, > > On Tue, 28 Aug 2018 at 22:55, Dietmar Eggemann > > wrote: > >> > >> On 08/27/2018 12:14 PM, Peter Zijlstra wrote: > >>> On Fri, Aug 24, 2018 at 02:24:48PM -0700, Steve Muckle wrote: > On 08/24/2018 02:47 AM, Peter Zijlstra wrote: > >>> On 08/17/2018 11:27 AM, Steve Muckle wrote: > > [...] > > - later, when the prio is deboosted and the task is moved back > to the fair class, the fair rq's min_vruntime is added to > the task's vruntime, even though it wasn't subtracted earlier. > > > > Could you point out when the fair rq's min_vruntime is added to the > > task's vruntime in your *later* scenario? attach_task_cfs_rq will not > > do that the same reason as detach_task_cfs_rq. fair task's > > sched_remote_wakeup is false which results in vruntime will not be > > renormalized in enqueue_entity. > > The cfs_rq->min_vruntime is still added to the se->vruntime in > enqueue_task_fair(). I understand what your patch done, On your CPU4: scheduler_ipi() -> sched_ttwu_pending() -> ttwu_do_activate()=> p->sched_remote_wakeup should be false, so ENQUEUE_WAKEUP is set, ENQUEUE_MIGRATED is not -> ttwu_activate() -> activate_task() -> enqueue_task() -> enqueue_task_fair() -> enqueue_entity() bool renorm = !(flags & ENQUEUE_WAKEUP) || (flags & ENQUEUE_MIGRATE) so renorm is false in enqueue_entity(), why you mentioned that the cfs_rq->min_vruntime is still added to the se->vruntime in enqueue_task_fair()? Regards, Wanpeng Li
Re: [PATCH] sched/fair: vruntime should normalize when switching from fair
Hi, On 09/26/2018 11:50 AM, Wanpeng Li wrote: > Hi Dietmar, > On Tue, 28 Aug 2018 at 22:55, Dietmar Eggemann > wrote: >> >> On 08/27/2018 12:14 PM, Peter Zijlstra wrote: >>> On Fri, Aug 24, 2018 at 02:24:48PM -0700, Steve Muckle wrote: On 08/24/2018 02:47 AM, Peter Zijlstra wrote: >>> On 08/17/2018 11:27 AM, Steve Muckle wrote: [...] - later, when the prio is deboosted and the task is moved back to the fair class, the fair rq's min_vruntime is added to the task's vruntime, even though it wasn't subtracted earlier. > > Could you point out when the fair rq's min_vruntime is added to the > task's vruntime in your *later* scenario? attach_task_cfs_rq will not > do that the same reason as detach_task_cfs_rq. fair task's > sched_remote_wakeup is false which results in vruntime will not be > renormalized in enqueue_entity. The cfs_rq->min_vruntime is still added to the se->vruntime in enqueue_task_fair(). It's just that without this patch, which adds the '&& p->sched_remote_wakeup' bit to the condition under which vruntime_normalized() returns true, detach_task_cfs_rq() won't go into the 'if (!vruntime_normalized(p))' path and not subtract cfs_rq->min_vruntime from se->vruntime. Since 'task_cpu(p) equal cpu' in try_to_wake_up() for the fair task, WF_MIGRATED is not set and set_task_cpu() -> migrate_task_rq_fair() is not called which could subtract cfs_rq->min_vruntime from se->vruntime as well. My former example with a different set of trace events: fair_task-3580 [004]35.389346: sched_stat_runtime: comm=fair_task pid=3580 runtime=45312 [ns] vruntime=46922871 [ns] <-- se->vruntime=46.922.871 ... rt_task-3579 [000]35.391573: sched_waking: comm=fair_task pid=3580 prio=120 target_cpu=004 ... rt_task-3579 [000]35.391627: sched_pi_setprio: comm=fair_task pid=3580 oldprio=120 newprio=19 ... rt_task-3579 [000]35.391661: bprint: detach_task_cfs_rq: task=fair_task pid=3580 cpu=4 vruntime_normalized=1 rt_task-3579 [000]35.391706: sched_switch: rt_task:3579 [19] D ==> swapper/0:0 [120] -0 [004]35.391834: sched_wakeup: fair_task:3580 [19] success=1 CPU:004 -0 [004]35.391840: sched_switch: swapper/4:0 [120] S ==> fair_task:3580 [19] fair_task-3580 [004]35.391853: sched_pi_setprio: comm=fair_task pid=3580 oldprio=19 newprio=120 ... fair_task-3580 [004]35.391863: bprint: enqueue_task_fair: task=fair_task pid=3580 curr=0 se->vruntime=93845742 cpu=4 cfs_rq->min_vruntime=46922871 ... fair_task-3580 [004]35.391877: sched_waking: comm=rt_task pid=3579 prio=19 target_cpu=000 ... fair_task-3580 [004]35.391885: sched_stat_runtime: comm=fair_task pid=3580 runtime=31250 [ns] vruntime=93876992 [ns] <-- se->vruntime=93.876.992
Re: [PATCH] sched/fair: vruntime should normalize when switching from fair
Hi Dietmar, On Tue, 28 Aug 2018 at 22:55, Dietmar Eggemann wrote: > > On 08/27/2018 12:14 PM, Peter Zijlstra wrote: > > On Fri, Aug 24, 2018 at 02:24:48PM -0700, Steve Muckle wrote: > >> On 08/24/2018 02:47 AM, Peter Zijlstra wrote: > > On 08/17/2018 11:27 AM, Steve Muckle wrote: > >>> > >> When rt_mutex_setprio changes a task's scheduling class to RT, > >> we're seeing cases where the task's vruntime is not updated > >> correctly upon return to the fair class. > >>> > >> Specifically, the following is being observed: > >> - task is deactivated while still in the fair class > >> - task is boosted to RT via rt_mutex_setprio, which changes > >> the task to RT and calls check_class_changed. > >> - check_class_changed leads to detach_task_cfs_rq, at which point > >> the vruntime_normalized check sees that the task's state is > >> TASK_WAKING, > >> which results in skipping the subtraction of the rq's min_vruntime > >> from the task's vruntime > >> - later, when the prio is deboosted and the task is moved back > >> to the fair class, the fair rq's min_vruntime is added to > >> the task's vruntime, even though it wasn't subtracted earlier. Could you point out when the fair rq's min_vruntime is added to the task's vruntime in your *later* scenario? attach_task_cfs_rq will not do that the same reason as detach_task_cfs_rq. fair task's sched_remote_wakeup is false which results in vruntime will not be renormalized in enqueue_entity. Regards, Wanpeng Li > >>> > >>> I'm thinking that is an incomplete scenario; where do we get to > >>> TASK_WAKING. > >> > >> Yes there's a missing bit of context here at the beginning that the task to > >> be boosted had already been put into TASK_WAKING. > > > > See, I'm confused... > > > > The only time TASK_WAKING is visible, is if we've done a remote wakeup > > and it's 'stuck' on the remote wake_list. And in that case we've done > > migrate_task_rq_fair() on it. > > > > So by the time either rt_mutex_setprio() or __sched_setscheduler() get > > to calling check_class_changed(), under both pi_lock and rq->lock, the > > vruntime_normalized() thing should be right. > > > > So please detail the exact scenario. Because I'm not seeing it. > > Using Steve's test program (https://lkml.org/lkml/2018/8/24/686) I see the > issue but only if the two tasks (rt_task, fair_task) run on 2 cpus which > don't share LLC (e.g. CPU0 and CPU4 on hikey960). > > So the wakeup goes the TTWU_QUEUE && !share_cache (ttwu_queue_remote) path. > > ... > rt_task-3579 [000] 35.391573: sched_waking: comm=fair_task pid=3580 > prio=120 target_cpu=004 > rt_task-3579 [000] 35.391580: bprint: try_to_wake_up: > try_to_wake_up: task=fair_task pid=3580 task_cpu(p)=4 p->on_rq=0 > rt_task-3579 [000] 35.391584: bprint: try_to_wake_up: ttwu_queue: > task=fair_task pid=3580 > rt_task-3579 [000] 35.391588: bprint: try_to_wake_up: > ttwu_queue_remote: task=fair_task pid=3580 > rt_task-3579 [000] 35.391591: bprint: try_to_wake_up: > ttwu_queue_remote: cpu=4 smp_send_reschedule > rt_task-3579 [000] 35.391627: sched_pi_setprio: comm=fair_task pid=3580 > oldprio=120 newprio=19 > rt_task-3579 [000] 35.391635: bprint: rt_mutex_setprio: > task=fair_task pid=3580 prio=120->19 queued=0 running=0 state=0x200 > vruntime=46922871 cpu=4 cfs_rq->min_vruntime=7807420844 > rt_task-3579 [000] 35.391641: bprint: rt_mutex_setprio: p->prio > set: task=fair_task pid=3580 prio=19 queued=0 running=0 state=0x200 > vruntime=46922871 > rt_task-3579 [000] 35.391646: bprint: rt_mutex_setprio: queued > checked: task=fair_task pid=3580 prio=19 queued=0 running=0 state=0x200 > vruntime=46922871 > rt_task-3579 [000] 35.391652: bprint: rt_mutex_setprio: running > checked: task=fair_task pid=3580 prio=19 queued=0 running=0 state=0x200 > vruntime=46922871 > rt_task-3579 [000] 35.391657: bprint: rt_mutex_setprio: > fair_class=0x08da2c80 rt_class=0x08da2d70 > prev_class=0x08da2c80 p->sched_class=0x08da2d70 oldprio=120 > p->prio=19 > rt_task-3579 [000] 35.391661: bprint: detach_task_cfs_rq: > task=fair_task pid=3580 cpu=4 vruntime_normalized=1 > rt_task-3579 [000] 35.391706: sched_switch: rt_task:3579 [19] D ==> > swapper/0:0 [120] > -0 [004] 35.391828: bprint: ttwu_do_activate: > ttwu_do_activate: task=fair_task pid=3580 > -0 [004] 35.391832: bprint: ttwu_do_activate: > ttwu_activate: task=fair_task pid=3580 > -0 [004] 35.391833: bprint: ttwu_do_wakeup: > ttwu_do_wakeup: task=fair_task pid=3580 > -0 [004] 35.391834: sched_wakeup: fair_task:3580 [19] > success=1 CPU:004 > > It doesn't happen on hikey960 when I use two cpus of the same LLC or on my > laptop (i7-4750HQ).
Re: [PATCH] sched/fair: vruntime should normalize when switching from fair
On 09/07/2018 12:58 AM, Vincent Guittot wrote: On Fri, 7 Sep 2018 at 09:16, Juri Lelli wrote: On 06/09/18 16:25, Dietmar Eggemann wrote: Hi Juri, On 08/23/2018 11:54 PM, Juri Lelli wrote: On 23/08/18 18:52, Dietmar Eggemann wrote: Hi, On 08/21/2018 01:54 AM, Miguel de Dios wrote: On 08/17/2018 11:27 AM, Steve Muckle wrote: From: John Dias [...] Adding semaphores is possible but rt-app has no easy way to initialize individual objects, e.g. sem_init(..., value). The only way I see is via the global section, like "pi_enabled". But then, this is true for all objects of this kind (in this case mutexes)? Right, global section should work fine. Why do you think this is a problem/limitation? keep in mind that rt-app still have "ressources" section. This one is optional and almost never used as resources can be created on the fly but it's still there and can be used to initialize resources if needed like semaphore I wasn't aware of that but this will do the job AFAICS. I just have to re-introduce the direct calls to init_foo_resource() (in this case init_sem_resource()) in init_resource_data() and call that instead of init_resource_data() for semaphores listed in the global resources section. Example for a semaphore b_sem with initial value eq. 1: "resources" : { "b_sem" : { "type" : "sem_wait", "value" : 1 } } [...]
Re: [PATCH] sched/fair: vruntime should normalize when switching from fair
On Fri, 7 Sep 2018 at 09:16, Juri Lelli wrote: > > On 06/09/18 16:25, Dietmar Eggemann wrote: > > Hi Juri, > > > > On 08/23/2018 11:54 PM, Juri Lelli wrote: > > > On 23/08/18 18:52, Dietmar Eggemann wrote: > > > > Hi, > > > > > > > > On 08/21/2018 01:54 AM, Miguel de Dios wrote: > > > > > On 08/17/2018 11:27 AM, Steve Muckle wrote: > > > > > > From: John Dias > > > > [...] > > > > > > > > > > I tried to catch this issue on my Arm64 Juno board using pi_test (and a > > > > slightly adapted pip_test (usleep_val = 1500 and keep low as cfs)) from > > > > rt-tests but wasn't able to do so. > > > > > > > > # pi_stress --inversions=1 --duration=1 --groups=1 --sched > > > > id=low,policy=cfs > > > > > > > > Starting PI Stress Test > > > > Number of thread groups: 1 > > > > Duration of test run: 1 seconds > > > > Number of inversions per group: 1 > > > > Admin thread SCHED_FIFO priority 4 > > > > 1 groups of 3 threads will be created > > > >High thread SCHED_FIFO priority 3 > > > > Med thread SCHED_FIFO priority 2 > > > > Low thread SCHED_OTHER nice 0 > > > > > > > > # ./pip_stress > > > > > > > > In both cases, the cfs task entering rt_mutex_setprio() is queued, so > > > > dequeue_task_fair()->dequeue_entity(), which subtracts > > > > cfs_rq->min_vruntime > > > > from se->vruntime, is called on it before it gets the rt prio. > > > > > > > > Maybe it requires a very specific use of the pthread library to provoke > > > > this > > > > issue by making sure that the cfs tasks really blocks/sleeps? > > > > > > Maybe one could play with rt-app to recreate such specific use case? > > > > > > https://github.com/scheduler-tools/rt-app/blob/master/doc/tutorial.txt#L459 > > > > I played a little bit with rt-app on hikey960 to re-create Steve's test > > program. > > Oh, nice! Thanks for sharing what you have got. > > > Since there is no semaphore support (sem_wait(), sem_post()) I used > > condition variables (wait: pthread_cond_wait() , signal: > > pthread_cond_signal()). It's not really the same since this is stateless but > > sleeps before the signals help to maintain the state in this easy example. > > > > This provokes the vruntime issue e.g. for cpus 0,4 and it doesn't for 0,1: > > > > > > "global": { > > "calibration" : 130, > > "pi_enabled" : true > > }, > > "tasks": { > > "rt_task": { > > "loop" : 100, > > "policy" : "SCHED_FIFO", > > "cpus" : [0], > > > > "lock" : "b_mutex", > > "wait" : { "ref" : "b_cond", "mutex" : "b_mutex" }, > > "unlock" : "b_mutex", > > "sleep" : 3000, > > "lock1" : "a_mutex", > > "signal" : "a_cond", > > "unlock1" : "a_mutex", > > "lock2" : "pi-mutex", > > "unlock2" : "pi-mutex" > > }, > > "cfs_task": { > > "loop" : 100, > > "policy" : "SCHED_OTHER", > > "cpus" : [4], > > > > "lock" : "pi-mutex", > > "sleep" : 3000, > > "lock1" : "b_mutex", > > "signal" : "b_cond", > > "unlock" : "b_mutex", > > "lock2" : "a_mutex", > > "wait" : { "ref" : "a_cond", "mutex" : "a_mutex" }, > > "unlock1" : "a_mutex", > > "unlock2" : "pi-mutex" > > } > > } > > } > > > > Adding semaphores is possible but rt-app has no easy way to initialize > > individual objects, e.g. sem_init(..., value). The only way I see is via the > > global section, like "pi_enabled". But then, this is true for all objects of > > this kind (in this case mutexes)? > > Right, global section should work fine. Why do you think this is a > problem/limitation? keep in mind that rt-app still have "ressources" section. This one is optional and almost never used as resources can be created on the fly but it's still there and can be used to initialize resources if needed like semaphore > > > So the following couple of lines extension to rt-app works because both > > semaphores can be initialized to 0: > > > > { > > "global": { > > "calibration" : 130, > > "pi_enabled" : true > > }, > > "tasks": { > > "rt_task": { > > "loop" : 100, > > "policy" : "SCHED_FIFO", > > "cpus" : [0], > > > > "sem_wait" : "b_sem", > > "sleep" : 1000, > > "sem_post" : "a_sem", > > > > "lock" : "pi-mutex", > > "unlock" : "pi-mutex" > > }, > > "cfs_task": { > > "loop" : 100, > > "policy" : "SCHED_OTHER", > > "cpus" : [4], > > > > "lock" : "pi-mutex", > > "sleep" : 1000, > > "sem_post" : "b_sem", > > "sem_wait" : "a_sem", > > "unlock" : "pi-mutex" > > } > > } > > } > > > > Any thoughts on that? I can see something like this as infrastructure to > > create a regression test case based on rt-app and standard ftrace. > > Agree. I guess we s
Re: [PATCH] sched/fair: vruntime should normalize when switching from fair
On 06/09/18 16:25, Dietmar Eggemann wrote: > Hi Juri, > > On 08/23/2018 11:54 PM, Juri Lelli wrote: > > On 23/08/18 18:52, Dietmar Eggemann wrote: > > > Hi, > > > > > > On 08/21/2018 01:54 AM, Miguel de Dios wrote: > > > > On 08/17/2018 11:27 AM, Steve Muckle wrote: > > > > > From: John Dias > > [...] > > > > > > > I tried to catch this issue on my Arm64 Juno board using pi_test (and a > > > slightly adapted pip_test (usleep_val = 1500 and keep low as cfs)) from > > > rt-tests but wasn't able to do so. > > > > > > # pi_stress --inversions=1 --duration=1 --groups=1 --sched > > > id=low,policy=cfs > > > > > > Starting PI Stress Test > > > Number of thread groups: 1 > > > Duration of test run: 1 seconds > > > Number of inversions per group: 1 > > > Admin thread SCHED_FIFO priority 4 > > > 1 groups of 3 threads will be created > > >High thread SCHED_FIFO priority 3 > > > Med thread SCHED_FIFO priority 2 > > > Low thread SCHED_OTHER nice 0 > > > > > > # ./pip_stress > > > > > > In both cases, the cfs task entering rt_mutex_setprio() is queued, so > > > dequeue_task_fair()->dequeue_entity(), which subtracts > > > cfs_rq->min_vruntime > > > from se->vruntime, is called on it before it gets the rt prio. > > > > > > Maybe it requires a very specific use of the pthread library to provoke > > > this > > > issue by making sure that the cfs tasks really blocks/sleeps? > > > > Maybe one could play with rt-app to recreate such specific use case? > > > > https://github.com/scheduler-tools/rt-app/blob/master/doc/tutorial.txt#L459 > > I played a little bit with rt-app on hikey960 to re-create Steve's test > program. Oh, nice! Thanks for sharing what you have got. > Since there is no semaphore support (sem_wait(), sem_post()) I used > condition variables (wait: pthread_cond_wait() , signal: > pthread_cond_signal()). It's not really the same since this is stateless but > sleeps before the signals help to maintain the state in this easy example. > > This provokes the vruntime issue e.g. for cpus 0,4 and it doesn't for 0,1: > > > "global": { > "calibration" : 130, > "pi_enabled" : true > }, > "tasks": { > "rt_task": { > "loop" : 100, > "policy" : "SCHED_FIFO", > "cpus" : [0], > > "lock" : "b_mutex", > "wait" : { "ref" : "b_cond", "mutex" : "b_mutex" }, > "unlock" : "b_mutex", > "sleep" : 3000, > "lock1" : "a_mutex", > "signal" : "a_cond", > "unlock1" : "a_mutex", > "lock2" : "pi-mutex", > "unlock2" : "pi-mutex" > }, > "cfs_task": { > "loop" : 100, > "policy" : "SCHED_OTHER", > "cpus" : [4], > > "lock" : "pi-mutex", > "sleep" : 3000, > "lock1" : "b_mutex", > "signal" : "b_cond", > "unlock" : "b_mutex", > "lock2" : "a_mutex", > "wait" : { "ref" : "a_cond", "mutex" : "a_mutex" }, > "unlock1" : "a_mutex", > "unlock2" : "pi-mutex" > } > } > } > > Adding semaphores is possible but rt-app has no easy way to initialize > individual objects, e.g. sem_init(..., value). The only way I see is via the > global section, like "pi_enabled". But then, this is true for all objects of > this kind (in this case mutexes)? Right, global section should work fine. Why do you think this is a problem/limitation? > So the following couple of lines extension to rt-app works because both > semaphores can be initialized to 0: > > { > "global": { > "calibration" : 130, > "pi_enabled" : true > }, > "tasks": { > "rt_task": { > "loop" : 100, > "policy" : "SCHED_FIFO", > "cpus" : [0], > > "sem_wait" : "b_sem", > "sleep" : 1000, > "sem_post" : "a_sem", > > "lock" : "pi-mutex", > "unlock" : "pi-mutex" > }, > "cfs_task": { > "loop" : 100, > "policy" : "SCHED_OTHER", > "cpus" : [4], > > "lock" : "pi-mutex", > "sleep" : 1000, > "sem_post" : "b_sem", > "sem_wait" : "a_sem", > "unlock" : "pi-mutex" > } > } > } > > Any thoughts on that? I can see something like this as infrastructure to > create a regression test case based on rt-app and standard ftrace. Agree. I guess we should add your first example to the repo (you'd be very welcome to create a PR) already and then work to support the second?
Re: [PATCH] sched/fair: vruntime should normalize when switching from fair
Hi Juri, On 08/23/2018 11:54 PM, Juri Lelli wrote: On 23/08/18 18:52, Dietmar Eggemann wrote: Hi, On 08/21/2018 01:54 AM, Miguel de Dios wrote: On 08/17/2018 11:27 AM, Steve Muckle wrote: From: John Dias [...] I tried to catch this issue on my Arm64 Juno board using pi_test (and a slightly adapted pip_test (usleep_val = 1500 and keep low as cfs)) from rt-tests but wasn't able to do so. # pi_stress --inversions=1 --duration=1 --groups=1 --sched id=low,policy=cfs Starting PI Stress Test Number of thread groups: 1 Duration of test run: 1 seconds Number of inversions per group: 1 Admin thread SCHED_FIFO priority 4 1 groups of 3 threads will be created High thread SCHED_FIFO priority 3 Med thread SCHED_FIFO priority 2 Low thread SCHED_OTHER nice 0 # ./pip_stress In both cases, the cfs task entering rt_mutex_setprio() is queued, so dequeue_task_fair()->dequeue_entity(), which subtracts cfs_rq->min_vruntime from se->vruntime, is called on it before it gets the rt prio. Maybe it requires a very specific use of the pthread library to provoke this issue by making sure that the cfs tasks really blocks/sleeps? Maybe one could play with rt-app to recreate such specific use case? https://github.com/scheduler-tools/rt-app/blob/master/doc/tutorial.txt#L459 I played a little bit with rt-app on hikey960 to re-create Steve's test program. Since there is no semaphore support (sem_wait(), sem_post()) I used condition variables (wait: pthread_cond_wait() , signal: pthread_cond_signal()). It's not really the same since this is stateless but sleeps before the signals help to maintain the state in this easy example. This provokes the vruntime issue e.g. for cpus 0,4 and it doesn't for 0,1: "global": { "calibration" : 130, "pi_enabled" : true }, "tasks": { "rt_task": { "loop" : 100, "policy" : "SCHED_FIFO", "cpus" : [0], "lock" : "b_mutex", "wait" : { "ref" : "b_cond", "mutex" : "b_mutex" }, "unlock" : "b_mutex", "sleep" : 3000, "lock1" : "a_mutex", "signal" : "a_cond", "unlock1" : "a_mutex", "lock2" : "pi-mutex", "unlock2" : "pi-mutex" }, "cfs_task": { "loop" : 100, "policy" : "SCHED_OTHER", "cpus" : [4], "lock" : "pi-mutex", "sleep" : 3000, "lock1" : "b_mutex", "signal" : "b_cond", "unlock" : "b_mutex", "lock2" : "a_mutex", "wait" : { "ref" : "a_cond", "mutex" : "a_mutex" }, "unlock1" : "a_mutex", "unlock2" : "pi-mutex" } } } Adding semaphores is possible but rt-app has no easy way to initialize individual objects, e.g. sem_init(..., value). The only way I see is via the global section, like "pi_enabled". But then, this is true for all objects of this kind (in this case mutexes)? So the following couple of lines extension to rt-app works because both semaphores can be initialized to 0: { "global": { "calibration" : 130, "pi_enabled" : true }, "tasks": { "rt_task": { "loop" : 100, "policy" : "SCHED_FIFO", "cpus" : [0], "sem_wait" : "b_sem", "sleep" : 1000, "sem_post" : "a_sem", "lock" : "pi-mutex", "unlock" : "pi-mutex" }, "cfs_task": { "loop" : 100, "policy" : "SCHED_OTHER", "cpus" : [4], "lock" : "pi-mutex", "sleep" : 1000, "sem_post" : "b_sem", "sem_wait" : "a_sem", "unlock" : "pi-mutex" } } } Any thoughts on that? I can see something like this as infrastructure to create a regression test case based on rt-app and standard ftrace. [...]
Re: [PATCH] sched/fair: vruntime should normalize when switching from fair
On 08/29/2018 08:33 AM, Dietmar Eggemann wrote: Yes, this solves the issue for the case I described. Using 'p->sched_remote_wakeup' (WF_MIGRATED) looks more elegant than using 'p->sched_class == &fair_sched_class'. It's confirmed that this patch solves the original issue we saw (and my test case passes for me as well). I will send this version out shortly.
Re: [PATCH] sched/fair: vruntime should normalize when switching from fair
On 08/29/2018 12:59 PM, Peter Zijlstra wrote: On Wed, Aug 29, 2018 at 11:54:58AM +0100, Dietmar Eggemann wrote: I forgot to mention that since fair_task's cpu affinity is restricted to CPU4, there is no call to set_task_cpu()->migrate_task_rq_fair() since if (task_cpu(p) != cpu) fails. I think the combination of cpu affinity of the fair_task to CPU4 and the fact that the scheduler runs on CPU1 when waking fair_task (with the two cpus not sharing LLC) while TTWU_QUEUE is enabled is the situation in which this vruntime issue can happen. Ohhh, D'0h. A remote wakeup that doesn't migrate. Ah, there is this WF_MIGRATED flag, perfect for the distinction whether a task migrated or not. That would suggest something like so: diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b39fb596f6c1..b3b62cf37fb6 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9638,7 +9638,8 @@ static inline bool vruntime_normalized(struct task_struct *p) * - A task which has been woken up by try_to_wake_up() and * waiting for actually being woken up by sched_ttwu_pending(). */ - if (!se->sum_exec_runtime || p->state == TASK_WAKING) + if (!se->sum_exec_runtime || + (p->state == TASK_WAKING && p->sched_remote_wakeup)) return true; return false; Yes, this solves the issue for the case I described. Using 'p->sched_remote_wakeup' (WF_MIGRATED) looks more elegant than using 'p->sched_class == &fair_sched_class'.
Re: [PATCH] sched/fair: vruntime should normalize when switching from fair
On Wed, Aug 29, 2018 at 11:54:58AM +0100, Dietmar Eggemann wrote: > I forgot to mention that since fair_task's cpu affinity is restricted to > CPU4, there is no call to set_task_cpu()->migrate_task_rq_fair() since if > (task_cpu(p) != cpu) fails. > > I think the combination of cpu affinity of the fair_task to CPU4 and the > fact that the scheduler runs on CPU1 when waking fair_task (with the two > cpus not sharing LLC) while TTWU_QUEUE is enabled is the situation in which > this vruntime issue can happen. Ohhh, D'0h. A remote wakeup that doesn't migrate. That would suggest something like so: diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b39fb596f6c1..b3b62cf37fb6 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9638,7 +9638,8 @@ static inline bool vruntime_normalized(struct task_struct *p) * - A task which has been woken up by try_to_wake_up() and * waiting for actually being woken up by sched_ttwu_pending(). */ - if (!se->sum_exec_runtime || p->state == TASK_WAKING) + if (!se->sum_exec_runtime || + (p->state == TASK_WAKING && p->sched_remote_wakeup)) return true; return false;
Re: [PATCH] sched/fair: vruntime should normalize when switching from fair
On 08/28/2018 03:53 PM, Dietmar Eggemann wrote: On 08/27/2018 12:14 PM, Peter Zijlstra wrote: On Fri, Aug 24, 2018 at 02:24:48PM -0700, Steve Muckle wrote: On 08/24/2018 02:47 AM, Peter Zijlstra wrote: On 08/17/2018 11:27 AM, Steve Muckle wrote: When rt_mutex_setprio changes a task's scheduling class to RT, we're seeing cases where the task's vruntime is not updated correctly upon return to the fair class. Specifically, the following is being observed: - task is deactivated while still in the fair class - task is boosted to RT via rt_mutex_setprio, which changes the task to RT and calls check_class_changed. - check_class_changed leads to detach_task_cfs_rq, at which point the vruntime_normalized check sees that the task's state is TASK_WAKING, which results in skipping the subtraction of the rq's min_vruntime from the task's vruntime - later, when the prio is deboosted and the task is moved back to the fair class, the fair rq's min_vruntime is added to the task's vruntime, even though it wasn't subtracted earlier. I'm thinking that is an incomplete scenario; where do we get to TASK_WAKING. Yes there's a missing bit of context here at the beginning that the task to be boosted had already been put into TASK_WAKING. See, I'm confused... The only time TASK_WAKING is visible, is if we've done a remote wakeup and it's 'stuck' on the remote wake_list. And in that case we've done migrate_task_rq_fair() on it. So by the time either rt_mutex_setprio() or __sched_setscheduler() get to calling check_class_changed(), under both pi_lock and rq->lock, the vruntime_normalized() thing should be right. So please detail the exact scenario. Because I'm not seeing it. Using Steve's test program (https://lkml.org/lkml/2018/8/24/686) I see the issue but only if the two tasks (rt_task, fair_task) run on 2 cpus which don't share LLC (e.g. CPU0 and CPU4 on hikey960). So the wakeup goes the TTWU_QUEUE && !share_cache (ttwu_queue_remote) path. I forgot to mention that since fair_task's cpu affinity is restricted to CPU4, there is no call to set_task_cpu()->migrate_task_rq_fair() since if (task_cpu(p) != cpu) fails. I think the combination of cpu affinity of the fair_task to CPU4 and the fact that the scheduler runs on CPU1 when waking fair_task (with the two cpus not sharing LLC) while TTWU_QUEUE is enabled is the situation in which this vruntime issue can happen. [...]
Re: [PATCH] sched/fair: vruntime should normalize when switching from fair
On 08/27/2018 12:14 PM, Peter Zijlstra wrote: > On Fri, Aug 24, 2018 at 02:24:48PM -0700, Steve Muckle wrote: >> On 08/24/2018 02:47 AM, Peter Zijlstra wrote: > On 08/17/2018 11:27 AM, Steve Muckle wrote: >>> >> When rt_mutex_setprio changes a task's scheduling class to RT, >> we're seeing cases where the task's vruntime is not updated >> correctly upon return to the fair class. >>> >> Specifically, the following is being observed: >> - task is deactivated while still in the fair class >> - task is boosted to RT via rt_mutex_setprio, which changes >> the task to RT and calls check_class_changed. >> - check_class_changed leads to detach_task_cfs_rq, at which point >> the vruntime_normalized check sees that the task's state is >> TASK_WAKING, >> which results in skipping the subtraction of the rq's min_vruntime >> from the task's vruntime >> - later, when the prio is deboosted and the task is moved back >> to the fair class, the fair rq's min_vruntime is added to >> the task's vruntime, even though it wasn't subtracted earlier. >>> >>> I'm thinking that is an incomplete scenario; where do we get to >>> TASK_WAKING. >> >> Yes there's a missing bit of context here at the beginning that the task to >> be boosted had already been put into TASK_WAKING. > > See, I'm confused... > > The only time TASK_WAKING is visible, is if we've done a remote wakeup > and it's 'stuck' on the remote wake_list. And in that case we've done > migrate_task_rq_fair() on it. > > So by the time either rt_mutex_setprio() or __sched_setscheduler() get > to calling check_class_changed(), under both pi_lock and rq->lock, the > vruntime_normalized() thing should be right. > > So please detail the exact scenario. Because I'm not seeing it. Using Steve's test program (https://lkml.org/lkml/2018/8/24/686) I see the issue but only if the two tasks (rt_task, fair_task) run on 2 cpus which don't share LLC (e.g. CPU0 and CPU4 on hikey960). So the wakeup goes the TTWU_QUEUE && !share_cache (ttwu_queue_remote) path. ... rt_task-3579 [000] 35.391573: sched_waking: comm=fair_task pid=3580 prio=120 target_cpu=004 rt_task-3579 [000] 35.391580: bprint: try_to_wake_up: try_to_wake_up: task=fair_task pid=3580 task_cpu(p)=4 p->on_rq=0 rt_task-3579 [000] 35.391584: bprint: try_to_wake_up: ttwu_queue: task=fair_task pid=3580 rt_task-3579 [000] 35.391588: bprint: try_to_wake_up: ttwu_queue_remote: task=fair_task pid=3580 rt_task-3579 [000] 35.391591: bprint: try_to_wake_up: ttwu_queue_remote: cpu=4 smp_send_reschedule rt_task-3579 [000] 35.391627: sched_pi_setprio: comm=fair_task pid=3580 oldprio=120 newprio=19 rt_task-3579 [000] 35.391635: bprint: rt_mutex_setprio: task=fair_task pid=3580 prio=120->19 queued=0 running=0 state=0x200 vruntime=46922871 cpu=4 cfs_rq->min_vruntime=7807420844 rt_task-3579 [000] 35.391641: bprint: rt_mutex_setprio: p->prio set: task=fair_task pid=3580 prio=19 queued=0 running=0 state=0x200 vruntime=46922871 rt_task-3579 [000] 35.391646: bprint: rt_mutex_setprio: queued checked: task=fair_task pid=3580 prio=19 queued=0 running=0 state=0x200 vruntime=46922871 rt_task-3579 [000] 35.391652: bprint: rt_mutex_setprio: running checked: task=fair_task pid=3580 prio=19 queued=0 running=0 state=0x200 vruntime=46922871 rt_task-3579 [000] 35.391657: bprint: rt_mutex_setprio: fair_class=0x08da2c80 rt_class=0x08da2d70 prev_class=0x08da2c80 p->sched_class=0x08da2d70 oldprio=120 p->prio=19 rt_task-3579 [000] 35.391661: bprint: detach_task_cfs_rq: task=fair_task pid=3580 cpu=4 vruntime_normalized=1 rt_task-3579 [000] 35.391706: sched_switch: rt_task:3579 [19] D ==> swapper/0:0 [120] -0 [004] 35.391828: bprint: ttwu_do_activate: ttwu_do_activate: task=fair_task pid=3580 -0 [004] 35.391832: bprint: ttwu_do_activate: ttwu_activate: task=fair_task pid=3580 -0 [004] 35.391833: bprint: ttwu_do_wakeup: ttwu_do_wakeup: task=fair_task pid=3580 -0 [004] 35.391834: sched_wakeup: fair_task:3580 [19] success=1 CPU:004 It doesn't happen on hikey960 when I use two cpus of the same LLC or on my laptop (i7-4750HQ).
Re: [PATCH] sched/fair: vruntime should normalize when switching from fair
On Fri, Aug 24, 2018 at 02:24:48PM -0700, Steve Muckle wrote: > On 08/24/2018 02:47 AM, Peter Zijlstra wrote: > > > > On 08/17/2018 11:27 AM, Steve Muckle wrote: > > > > > > > When rt_mutex_setprio changes a task's scheduling class to RT, > > > > > we're seeing cases where the task's vruntime is not updated > > > > > correctly upon return to the fair class. > > > > > > > Specifically, the following is being observed: > > > > > - task is deactivated while still in the fair class > > > > > - task is boosted to RT via rt_mutex_setprio, which changes > > > > > the task to RT and calls check_class_changed. > > > > > - check_class_changed leads to detach_task_cfs_rq, at which point > > > > > the vruntime_normalized check sees that the task's state is > > > > > TASK_WAKING, > > > > > which results in skipping the subtraction of the rq's min_vruntime > > > > > from the task's vruntime > > > > > - later, when the prio is deboosted and the task is moved back > > > > > to the fair class, the fair rq's min_vruntime is added to > > > > > the task's vruntime, even though it wasn't subtracted earlier. > > > > I'm thinking that is an incomplete scenario; where do we get to > > TASK_WAKING. > > Yes there's a missing bit of context here at the beginning that the task to > be boosted had already been put into TASK_WAKING. See, I'm confused... The only time TASK_WAKING is visible, is if we've done a remote wakeup and it's 'stuck' on the remote wake_list. And in that case we've done migrate_task_rq_fair() on it. So by the time either rt_mutex_setprio() or __sched_setscheduler() get to calling check_class_changed(), under both pi_lock and rq->lock, the vruntime_normalized() thing should be right. So please detail the exact scenario. Because I'm not seeing it.
Re: [PATCH] sched/fair: vruntime should normalize when switching from fair
On 08/24/2018 02:47 AM, Peter Zijlstra wrote: On 08/17/2018 11:27 AM, Steve Muckle wrote: When rt_mutex_setprio changes a task's scheduling class to RT, we're seeing cases where the task's vruntime is not updated correctly upon return to the fair class. Specifically, the following is being observed: - task is deactivated while still in the fair class - task is boosted to RT via rt_mutex_setprio, which changes the task to RT and calls check_class_changed. - check_class_changed leads to detach_task_cfs_rq, at which point the vruntime_normalized check sees that the task's state is TASK_WAKING, which results in skipping the subtraction of the rq's min_vruntime from the task's vruntime - later, when the prio is deboosted and the task is moved back to the fair class, the fair rq's min_vruntime is added to the task's vruntime, even though it wasn't subtracted earlier. I'm thinking that is an incomplete scenario; where do we get to TASK_WAKING. Yes there's a missing bit of context here at the beginning that the task to be boosted had already been put into TASK_WAKING.
Re: [PATCH] sched/fair: vruntime should normalize when switching from fair
On 08/23/2018 11:54 PM, Juri Lelli wrote: I tried to catch this issue on my Arm64 Juno board using pi_test (and a slightly adapted pip_test (usleep_val = 1500 and keep low as cfs)) from rt-tests but wasn't able to do so. # pi_stress --inversions=1 --duration=1 --groups=1 --sched id=low,policy=cfs Starting PI Stress Test Number of thread groups: 1 Duration of test run: 1 seconds Number of inversions per group: 1 Admin thread SCHED_FIFO priority 4 1 groups of 3 threads will be created High thread SCHED_FIFO priority 3 Med thread SCHED_FIFO priority 2 Low thread SCHED_OTHER nice 0 # ./pip_stress In both cases, the cfs task entering rt_mutex_setprio() is queued, so dequeue_task_fair()->dequeue_entity(), which subtracts cfs_rq->min_vruntime from se->vruntime, is called on it before it gets the rt prio. Maybe it requires a very specific use of the pthread library to provoke this issue by making sure that the cfs tasks really blocks/sleeps? Maybe one could play with rt-app to recreate such specific use case? https://github.com/scheduler-tools/rt-app/blob/master/doc/tutorial.txt#L459 This was reproduced for me on tip of mainline by using the program at the end of this mail. It was run in a 2 CPU virtualbox VM. Relevant annotated bits of the trace: low-prio thread vruntime is 752ms pi-vruntime-tes-598 [001] d... 520.572459: sched_stat_runtime: comm=pi-vruntime-tes pid=598 runtime=29953 [ns] vruntime=752888705 [ns] low-prio thread waits on a_sem pi-vruntime-tes-598 [001] d... 520.572465: sched_switch: prev_comm=pi-vruntime-tes prev_pid=598 prev_prio=120 prev_state=D ==> next_comm=swapper/1 next_pid=0 next_prio=120 high prio thread finishes wakeup, then sleeps for 1ms -0 [000] dNh. 520.572483: sched_wakeup: comm=pi-vruntime-tes pid=597 prio=19 target_cpu=000 -0 [000] d... 520.572486: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=S ==> next_comm=pi-vruntime-tes next_pid=597 next_prio=19 pi-vruntime-tes-597 [000] d... 520.572498: sched_switch: prev_comm=pi-vruntime-tes prev_pid=597 prev_prio=19 prev_state=D ==> next_comm=swapper/0 next_pid=0 next_prio=120 high prio thread wakes up after 1ms sleep, posts a_sem which starts to wake low-prio thread, then tries to grab pi_mutex, which low-prio thread has -0 [000] d.h. 520.573876: sched_waking: comm=pi-vruntime-tes pid=597 prio=19 target_cpu=000 -0 [000] dNh. 520.573879: sched_wakeup: comm=pi-vruntime-tes pid=597 prio=19 target_cpu=000 -0 [000] d... 520.573887: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=S ==> next_comm=pi-vruntime-tes next_pid=597 next_prio=19 pi-vruntime-tes-597 [000] d... 520.573895: sched_waking: comm=pi-vruntime-tes pid=598 prio=120 target_cpu=001 low-prio thread pid 598 gets pi_mutex priority inheritance, this happens while low-prio thread is in waking state pi-vruntime-tes-597 [000] d... 520.573911: sched_pi_setprio: comm=pi-vruntime-tes pid=598 oldprio=120 newprio=19 high-prio thread sleeps on pi_mutex pi-vruntime-tes-597 [000] d... 520.573919: sched_switch: prev_comm=pi-vruntime-tes prev_pid=597 prev_prio=19 prev_state=D ==> next_comm=swapper/0 next_pid=0 next_prio=120 low-prio thread finishes wakeup -0 [001] dNh. 520.573932: sched_wakeup: comm=pi-vruntime-tes pid=598 prio=19 target_cpu=001 -0 [001] d... 520.573936: sched_switch: prev_comm=swapper/1 prev_pid=0 prev_prio=120 prev_state=S ==> next_comm=pi-vruntime-tes next_pid=598 next_prio=19 low-prio thread releases pi-mutex, loses pi boost, high-prio thread wakes for pi-mutex pi-vruntime-tes-598 [001] d... 520.573946: sched_pi_setprio: comm=pi-vruntime-tes pid=598 oldprio=19 newprio=120 pi-vruntime-tes-598 [001] dN.. 520.573954: sched_waking: comm=pi-vruntime-tes pid=597 prio=19 target_cpu=000 low-prio thread vruntime is 1505ms pi-vruntime-tes-598 [001] dN.. 520.573966: sched_stat_runtime: comm=pi-vruntime-tes pid=598 runtime=20150 [ns] vruntime=1505797560 [ns] The timing is quite sensitive since the task being boosted has to be caught in the TASK_WAKING state. The program: /* * Test case for vruntime management during rtmutex priority inheritance * promotion and demotion. * * build with -lpthread */ #define _GNU_SOURCE #include #include #include #include #include #define ERROR_CHECK(x) \ if (x) \ fprintf(stderr, "Error at line %d", __LINE__); pthread_mutex_t pi_mutex; sem_t a_sem; sem_t b_sem; void *rt_thread_func(void *arg) { int policy; int i = 0; cpu_set_t cpuset; CPU_ZERO(&cpuset); CPU_SET(0, &cpuset); ERROR_CHECK(pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpuset)); while(i++ < 100) { sem_wait(&b_s
Re: [PATCH] sched/fair: vruntime should normalize when switching from fair
> > On 08/17/2018 11:27 AM, Steve Muckle wrote: > > > When rt_mutex_setprio changes a task's scheduling class to RT, > > > we're seeing cases where the task's vruntime is not updated > > > correctly upon return to the fair class. > > > Specifically, the following is being observed: > > > - task is deactivated while still in the fair class > > > - task is boosted to RT via rt_mutex_setprio, which changes > > >the task to RT and calls check_class_changed. > > > - check_class_changed leads to detach_task_cfs_rq, at which point > > >the vruntime_normalized check sees that the task's state is > > > TASK_WAKING, > > >which results in skipping the subtraction of the rq's min_vruntime > > >from the task's vruntime > > > - later, when the prio is deboosted and the task is moved back > > >to the fair class, the fair rq's min_vruntime is added to > > >the task's vruntime, even though it wasn't subtracted earlier. I'm thinking that is an incomplete scenario; where do we get to TASK_WAKING.
Re: [PATCH] sched/fair: vruntime should normalize when switching from fair
On Mon, Aug 20, 2018 at 04:54:25PM -0700, Miguel de Dios wrote: > On 08/17/2018 11:27 AM, Steve Muckle wrote: > > From: John Dias > > > > When rt_mutex_setprio changes a task's scheduling class to RT, > > we're seeing cases where the task's vruntime is not updated > > correctly upon return to the fair class. > > Specifically, the following is being observed: > > - task is deactivated while still in the fair class > > - task is boosted to RT via rt_mutex_setprio, which changes > >the task to RT and calls check_class_changed. > > - check_class_changed leads to detach_task_cfs_rq, at which point > >the vruntime_normalized check sees that the task's state is TASK_WAKING, > >which results in skipping the subtraction of the rq's min_vruntime > >from the task's vruntime > > - later, when the prio is deboosted and the task is moved back > >to the fair class, the fair rq's min_vruntime is added to > >the task's vruntime, even though it wasn't subtracted earlier. > > The immediate result is inflation of the task's vruntime, giving > > it lower priority (starving it if there's enough available work). > > The longer-term effect is inflation of all vruntimes because the > > task's vruntime becomes the rq's min_vruntime when the higher > > priority tasks go idle. That leads to a vicious cycle, where > > the vruntime inflation repeatedly doubled. > > > > The change here is to detect when vruntime_normalized is being > > called when the task is waking but is waking in another class, > > and to conclude that this is a case where vruntime has not > > been normalized. > > > > Signed-off-by: John Dias > > Signed-off-by: Steve Muckle > > --- > > kernel/sched/fair.c | 3 ++- > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index b39fb596f6c1..14011d7929d8 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -9638,7 +9638,8 @@ static inline bool vruntime_normalized(struct > > task_struct *p) > > * - A task which has been woken up by try_to_wake_up() and > > * waiting for actually being woken up by sched_ttwu_pending(). > > */ > > - if (!se->sum_exec_runtime || p->state == TASK_WAKING) > > + if (!se->sum_exec_runtime || > > + (p->state == TASK_WAKING && p->sched_class == &fair_sched_class)) > > return true; > > return false; > The normalization of vruntime used to exist in task_waking but it was > removed and the normalization was moved into migrate_task_rq_fair. The > reasoning being that task_waking_fair was only hit when a task is queued > onto a different core and migrate_task_rq_fair should do the same work. > > However, we're finding that there's one case which migrate_task_rq_fair > doesn't hit: that being the case where rt_mutex_setprio changes a task's > scheduling class to RT when its scheduled out. The task never hits > migrate_task_rq_fair because it is switched to RT and migrates as an RT > task. Because of this we're getting an unbounded addition of min_vruntime > when the task is re-attached to the CFS runqueue when it loses the inherited > priority. The patch above works because now the kernel specifically checks > for this case and normalizes accordingly. > > Here's the patch I was talking about: > https://lore.kernel.org/patchwork/patch/677689/. In our testing we were > seeing vruntimes nearly double every time after rt_mutex_setprio boosts the > task to RT. Bah, patchwork is such shit... how do you get to the previus patch from there? Because I think 2/3 is the actual commit that changed things, 3/3 just cleans up a bit. That would be commit: b5179ac70de8 ("sched/fair: Prepare to fix fairness problems on migration") But I'm still somewhat confused; how would task_waking_fair() have helped if we're already changed to a different class?
Re: [PATCH] sched/fair: vruntime should normalize when switching from fair
On 23/08/18 18:52, Dietmar Eggemann wrote: > Hi, > > On 08/21/2018 01:54 AM, Miguel de Dios wrote: > > On 08/17/2018 11:27 AM, Steve Muckle wrote: > > > From: John Dias > > > > > > When rt_mutex_setprio changes a task's scheduling class to RT, > > > we're seeing cases where the task's vruntime is not updated > > > correctly upon return to the fair class. > > > Specifically, the following is being observed: > > > - task is deactivated while still in the fair class > > > - task is boosted to RT via rt_mutex_setprio, which changes > > > the task to RT and calls check_class_changed. > > > - check_class_changed leads to detach_task_cfs_rq, at which point > > > the vruntime_normalized check sees that the task's state is > > > TASK_WAKING, > > > which results in skipping the subtraction of the rq's min_vruntime > > > from the task's vruntime > > > - later, when the prio is deboosted and the task is moved back > > > to the fair class, the fair rq's min_vruntime is added to > > > the task's vruntime, even though it wasn't subtracted earlier. > > > The immediate result is inflation of the task's vruntime, giving > > > it lower priority (starving it if there's enough available work). > > > The longer-term effect is inflation of all vruntimes because the > > > task's vruntime becomes the rq's min_vruntime when the higher > > > priority tasks go idle. That leads to a vicious cycle, where > > > the vruntime inflation repeatedly doubled. > > > > > > The change here is to detect when vruntime_normalized is being > > > called when the task is waking but is waking in another class, > > > and to conclude that this is a case where vruntime has not > > > been normalized. > > > > > > Signed-off-by: John Dias > > > Signed-off-by: Steve Muckle > > > --- > > > kernel/sched/fair.c | 3 ++- > > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > > index b39fb596f6c1..14011d7929d8 100644 > > > --- a/kernel/sched/fair.c > > > +++ b/kernel/sched/fair.c > > > @@ -9638,7 +9638,8 @@ static inline bool vruntime_normalized(struct > > > task_struct *p) > > > * - A task which has been woken up by try_to_wake_up() and > > > * waiting for actually being woken up by sched_ttwu_pending(). > > > */ > > > - if (!se->sum_exec_runtime || p->state == TASK_WAKING) > > > + if (!se->sum_exec_runtime || > > > + (p->state == TASK_WAKING && p->sched_class == > > > &fair_sched_class)) > > > return true; > > > return false; > > The normalization of vruntime used to exist in task_waking but it was > > removed and the normalization was moved into migrate_task_rq_fair. The > > reasoning being that task_waking_fair was only hit when a task is queued > > onto a different core and migrate_task_rq_fair should do the same work. > > > > However, we're finding that there's one case which migrate_task_rq_fair > > doesn't hit: that being the case where rt_mutex_setprio changes a task's > > scheduling class to RT when its scheduled out. The task never hits > > migrate_task_rq_fair because it is switched to RT and migrates as an RT > > task. Because of this we're getting an unbounded addition of > > min_vruntime when the task is re-attached to the CFS runqueue when it > > loses the inherited priority. The patch above works because now the > > kernel specifically checks for this case and normalizes accordingly. > > > > Here's the patch I was talking about: > > https://lore.kernel.org/patchwork/patch/677689/. In our testing we were > > seeing vruntimes nearly double every time after rt_mutex_setprio boosts > > the task to RT. > > > > Signed-off-by: Miguel de Dios > > Tested-by: Miguel de Dios > > I tried to catch this issue on my Arm64 Juno board using pi_test (and a > slightly adapted pip_test (usleep_val = 1500 and keep low as cfs)) from > rt-tests but wasn't able to do so. > > # pi_stress --inversions=1 --duration=1 --groups=1 --sched id=low,policy=cfs > > Starting PI Stress Test > Number of thread groups: 1 > Duration of test run: 1 seconds > Number of inversions per group: 1 > Admin thread SCHED_FIFO priority 4 > 1 groups of 3 threads will be created > High thread SCHED_FIFO priority 3 >Med thread SCHED_FIFO priority 2 >Low thread SCHED_OTHER nice 0 > > # ./pip_stress > > In both cases, the cfs task entering rt_mutex_setprio() is queued, so > dequeue_task_fair()->dequeue_entity(), which subtracts cfs_rq->min_vruntime > from se->vruntime, is called on it before it gets the rt prio. > > Maybe it requires a very specific use of the pthread library to provoke this > issue by making sure that the cfs tasks really blocks/sleeps? Maybe one could play with rt-app to recreate such specific use case? https://github.com/scheduler-tools/rt-app/blob/master/doc/tutorial.txt#L459 Best, - Juri
Re: [PATCH] sched/fair: vruntime should normalize when switching from fair
Hi, On 08/21/2018 01:54 AM, Miguel de Dios wrote: On 08/17/2018 11:27 AM, Steve Muckle wrote: From: John Dias When rt_mutex_setprio changes a task's scheduling class to RT, we're seeing cases where the task's vruntime is not updated correctly upon return to the fair class. Specifically, the following is being observed: - task is deactivated while still in the fair class - task is boosted to RT via rt_mutex_setprio, which changes the task to RT and calls check_class_changed. - check_class_changed leads to detach_task_cfs_rq, at which point the vruntime_normalized check sees that the task's state is TASK_WAKING, which results in skipping the subtraction of the rq's min_vruntime from the task's vruntime - later, when the prio is deboosted and the task is moved back to the fair class, the fair rq's min_vruntime is added to the task's vruntime, even though it wasn't subtracted earlier. The immediate result is inflation of the task's vruntime, giving it lower priority (starving it if there's enough available work). The longer-term effect is inflation of all vruntimes because the task's vruntime becomes the rq's min_vruntime when the higher priority tasks go idle. That leads to a vicious cycle, where the vruntime inflation repeatedly doubled. The change here is to detect when vruntime_normalized is being called when the task is waking but is waking in another class, and to conclude that this is a case where vruntime has not been normalized. Signed-off-by: John Dias Signed-off-by: Steve Muckle --- kernel/sched/fair.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b39fb596f6c1..14011d7929d8 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9638,7 +9638,8 @@ static inline bool vruntime_normalized(struct task_struct *p) * - A task which has been woken up by try_to_wake_up() and * waiting for actually being woken up by sched_ttwu_pending(). */ - if (!se->sum_exec_runtime || p->state == TASK_WAKING) + if (!se->sum_exec_runtime || + (p->state == TASK_WAKING && p->sched_class == &fair_sched_class)) return true; return false; The normalization of vruntime used to exist in task_waking but it was removed and the normalization was moved into migrate_task_rq_fair. The reasoning being that task_waking_fair was only hit when a task is queued onto a different core and migrate_task_rq_fair should do the same work. However, we're finding that there's one case which migrate_task_rq_fair doesn't hit: that being the case where rt_mutex_setprio changes a task's scheduling class to RT when its scheduled out. The task never hits migrate_task_rq_fair because it is switched to RT and migrates as an RT task. Because of this we're getting an unbounded addition of min_vruntime when the task is re-attached to the CFS runqueue when it loses the inherited priority. The patch above works because now the kernel specifically checks for this case and normalizes accordingly. Here's the patch I was talking about: https://lore.kernel.org/patchwork/patch/677689/. In our testing we were seeing vruntimes nearly double every time after rt_mutex_setprio boosts the task to RT. Signed-off-by: Miguel de Dios Tested-by: Miguel de Dios I tried to catch this issue on my Arm64 Juno board using pi_test (and a slightly adapted pip_test (usleep_val = 1500 and keep low as cfs)) from rt-tests but wasn't able to do so. # pi_stress --inversions=1 --duration=1 --groups=1 --sched id=low,policy=cfs Starting PI Stress Test Number of thread groups: 1 Duration of test run: 1 seconds Number of inversions per group: 1 Admin thread SCHED_FIFO priority 4 1 groups of 3 threads will be created High thread SCHED_FIFO priority 3 Med thread SCHED_FIFO priority 2 Low thread SCHED_OTHER nice 0 # ./pip_stress In both cases, the cfs task entering rt_mutex_setprio() is queued, so dequeue_task_fair()->dequeue_entity(), which subtracts cfs_rq->min_vruntime from se->vruntime, is called on it before it gets the rt prio. Maybe it requires a very specific use of the pthread library to provoke this issue by making sure that the cfs tasks really blocks/sleeps?
Re: [PATCH] sched/fair: vruntime should normalize when switching from fair
On 08/17/2018 11:27 AM, Steve Muckle wrote: From: John Dias When rt_mutex_setprio changes a task's scheduling class to RT, we're seeing cases where the task's vruntime is not updated correctly upon return to the fair class. Specifically, the following is being observed: - task is deactivated while still in the fair class - task is boosted to RT via rt_mutex_setprio, which changes the task to RT and calls check_class_changed. - check_class_changed leads to detach_task_cfs_rq, at which point the vruntime_normalized check sees that the task's state is TASK_WAKING, which results in skipping the subtraction of the rq's min_vruntime from the task's vruntime - later, when the prio is deboosted and the task is moved back to the fair class, the fair rq's min_vruntime is added to the task's vruntime, even though it wasn't subtracted earlier. The immediate result is inflation of the task's vruntime, giving it lower priority (starving it if there's enough available work). The longer-term effect is inflation of all vruntimes because the task's vruntime becomes the rq's min_vruntime when the higher priority tasks go idle. That leads to a vicious cycle, where the vruntime inflation repeatedly doubled. The change here is to detect when vruntime_normalized is being called when the task is waking but is waking in another class, and to conclude that this is a case where vruntime has not been normalized. Signed-off-by: John Dias Signed-off-by: Steve Muckle --- kernel/sched/fair.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b39fb596f6c1..14011d7929d8 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9638,7 +9638,8 @@ static inline bool vruntime_normalized(struct task_struct *p) * - A task which has been woken up by try_to_wake_up() and * waiting for actually being woken up by sched_ttwu_pending(). */ - if (!se->sum_exec_runtime || p->state == TASK_WAKING) + if (!se->sum_exec_runtime || + (p->state == TASK_WAKING && p->sched_class == &fair_sched_class)) return true; return false; The normalization of vruntime used to exist in task_waking but it was removed and the normalization was moved into migrate_task_rq_fair. The reasoning being that task_waking_fair was only hit when a task is queued onto a different core and migrate_task_rq_fair should do the same work. However, we're finding that there's one case which migrate_task_rq_fair doesn't hit: that being the case where rt_mutex_setprio changes a task's scheduling class to RT when its scheduled out. The task never hits migrate_task_rq_fair because it is switched to RT and migrates as an RT task. Because of this we're getting an unbounded addition of min_vruntime when the task is re-attached to the CFS runqueue when it loses the inherited priority. The patch above works because now the kernel specifically checks for this case and normalizes accordingly. Here's the patch I was talking about: https://lore.kernel.org/patchwork/patch/677689/. In our testing we were seeing vruntimes nearly double every time after rt_mutex_setprio boosts the task to RT. Signed-off-by: Miguel de Dios Tested-by: Miguel de Dios
[PATCH] sched/fair: vruntime should normalize when switching from fair
From: John Dias When rt_mutex_setprio changes a task's scheduling class to RT, we're seeing cases where the task's vruntime is not updated correctly upon return to the fair class. Specifically, the following is being observed: - task is deactivated while still in the fair class - task is boosted to RT via rt_mutex_setprio, which changes the task to RT and calls check_class_changed. - check_class_changed leads to detach_task_cfs_rq, at which point the vruntime_normalized check sees that the task's state is TASK_WAKING, which results in skipping the subtraction of the rq's min_vruntime from the task's vruntime - later, when the prio is deboosted and the task is moved back to the fair class, the fair rq's min_vruntime is added to the task's vruntime, even though it wasn't subtracted earlier. The immediate result is inflation of the task's vruntime, giving it lower priority (starving it if there's enough available work). The longer-term effect is inflation of all vruntimes because the task's vruntime becomes the rq's min_vruntime when the higher priority tasks go idle. That leads to a vicious cycle, where the vruntime inflation repeatedly doubled. The change here is to detect when vruntime_normalized is being called when the task is waking but is waking in another class, and to conclude that this is a case where vruntime has not been normalized. Signed-off-by: John Dias Signed-off-by: Steve Muckle --- kernel/sched/fair.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b39fb596f6c1..14011d7929d8 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9638,7 +9638,8 @@ static inline bool vruntime_normalized(struct task_struct *p) * - A task which has been woken up by try_to_wake_up() and * waiting for actually being woken up by sched_ttwu_pending(). */ - if (!se->sum_exec_runtime || p->state == TASK_WAKING) + if (!se->sum_exec_runtime || + (p->state == TASK_WAKING && p->sched_class == &fair_sched_class)) return true; return false; -- 2.18.0.865.gffc8e1a3cd6-goog