On 04/08/2020 01:59, Valentin Schneider wrote: > > On 03/08/20 20:22, Thomas Gleixner wrote: >> Valentin, >> >> Valentin Schneider <valentin.schnei...@arm.com> writes: >>> On 03/08/20 16:13, Thomas Gleixner wrote: >>>> Vladimir Oltean <olte...@gmail.com> writes: >>>>>> 1) When irq accounting is disabled, RT throttling kicks in as >>>>>> expected. >>>>>> >>>>>> 2) With irq accounting the RT throttler does not kick in and the RCU >>>>>> stall/lockups happen. >>>>> What is this telling us? >>>> >>>> It seems that the fine grained irq time accounting affects the runtime >>>> accounting in some way which I haven't figured out yet. >>>> >>> >>> With IRQ_TIME_ACCOUNTING, rq_clock_task() will always be incremented by a >>> lesser-or-equal value than when not having the option; you start with the >>> same delta_exec but slice some for the IRQ accounting, and leave the rest >>> for the rq_clock_task() (+paravirt). >>> >>> IIUC this means that if you spend e.g. 10% of the time in IRQ and 90% of >>> the time running the stress-ng RT tasks, despite having RT tasks hogging >>> the entirety of the "available time" it is still only 90% runtime, which is >>> below the 95% default and the throttling doesn't happen. >> >> totaltime = irqtime + tasktime >> >> Ignoring irqtime and pretending that totaltime is what the scheduler >> can control and deal with is naive at best. >> > > Agreed, however AFAICT rt_time is only incremented by rq_clock_task() > deltas, which don't include IRQ time with IRQ_TIME_ACCOUNTING=y. That would > then be directly compared to the sysctl runtime. > > Adding some prints in sched_rt_runtime_exceeded() and running this test > case on my Juno, I get: > # IRQ_TIME_ACCOUNTING=y > cpu=2 rt_time=713455220 runtime=950000000 rq->avg_irq.util_avg=265 > (rt_time oscillates between [70.1e7, 75.1e7]; avg_irq between [220, 270]) > > # IRQ_TIME_ACCOUNTING=n > cpu=2 rt_time=963035300 runtime=949951811 > (rt_time oscillates between [94.1e7, 96.1e7]; > > Throttling happens for IRQ_TIME_ACCOUNTING=n and doesn't for > IRQ_TIME_ACCOUNTING=y - clearly the accounted rt_time isn't high enough for > that to happen, and it does look like what is missing in rt_time (or what > should be subtracted from the available runtime) is there in the avg_irq.
I agree that w/ IRQ_TIME_ACCOUNTING=y rt_rq->rt_time isn't high enough in this testcase. stress-ng-hrtim-1655 [001] 462.897733: bprint: update_curr_rt: rt_rq->rt_time=416716900 rt_rq->rt_runtime=950000000 rt_b->rt_runtime=950000000 The 5% reservation (1 - sched_rt_runtime_us/sched_rt_period_us) for CFS is massively eclipsed by irqtime. It's true that avg_irq tracks 'irq_delta + steal' time but it is meant to potentially reduce cpu capacity. It's also cpu and frequency invariant (your CPU2 is a big CPU so no issue here). Could a rq_clock(rq) derived rt_rq signal been used to compare against rt_runtime? BTW, DL already influences rt_rq->rt_time. [...]