On 11/1/18 6:55 AM, Juri Lelli wrote: >> I meant, I am not against the/a fix, i just think that... it is more >> complicated >> that it seems. >> >> For example: Let's assume that we have a non-rt bad thread A in CPU 0 >> generating >> IPIs because of static key update, and a good dl thread B in the CPU 1. >> >> In this case, the thread B could run less than what was reserved for it, but >> it >> was not causing the interrupts. It is not fair to put a penalty in the >> thread B. >> >> The same is valid for a dl thread running in the same CPU that is receiving a >> lot of network packets to another application, and other legit cases. >> >> In the end, if we want to avoid non-rt threads starving, we need to >> prioritize >> them some time, but in this case, we return to the DL server for non-rt >> threads. >> >> Thoughts? > And I see your point. :-) > > I'd also add (maybe you mentioned this as well) that it seems the same > could happen with RT throttling safety measure, as we are using > clock_task there as well to account runtime and throttle stuff.
Yes! The same problem can happen with rt scheduler as well! I saw this problem first with the rt throttling mechanism when I was trying to make it work in the microseconds granularity (it is only enforced in the schedule tick, so it is in an ms granularity in practice). After using hr timers to do the enforcement in the microseconds granularity, I was trying to let just fewer us for the non-rt. But as the IRQ runtime was higher than these fewer us, the rt_rq was never throttled. It is the same/similar behavior we see here. As we think in the rt throttling as "avoiding rt workload to consume more than rt_runtime/rt_period", and considering that IRQs are a level of task with a fixed priority higher than all the real-time related schedulers, i.e., deadline and rt, we can safely argue that we can consider the IRQ time into the pool of rt workload and account it in the rt_runtime. The easiest way to do it is to use the rq_clock() in the measurement. I agree. The point is that the CBS has a dual goal: it avoids a task running for more than its runtime (a throttling behavior), but it also is used as a guarantee of runtime for the case in which the task behaves, and the system is not overloaded. Considering we can have more load than we can schedule in a multiprocessor - but that is another story. The the obvious reasoning here is: Ok boy, but the system IS overloaded in this case, we have a RCU stall! And that is true if you look at the processor starving RCU. But if the system has mode than one CPU, it could have CPU time available in another CPU. So, we could just move the dl task from one CPU to another. Btw, that is another point. We have the AC with the sum of the utilization of all CPUs. But we do no enforcement for per-cpu utilization. If one set a single thread with runtime=deadline=period (in a system with more than one CPU), and run in a busy-loop, we will eventually have an RCU stall as well (I just did on my box, I got a soft lockup). I know this is a different problem. But, maybe, there is a general solution for both issues: For instance, if the sum of the execution time of all "task" with priority higher than the OTHER class (rt, dl, stop_machine, IRQs, NMIs, Hypervisor?) in a CPU is higher than rt_runtime in the rt_period, we need to avoid what is "avoidable" by trying to move rt and dl threads away from that CPU. Another possibility is to bump the priority of the OTHER class (and we are back to the DL server). - Dude, would not be easy just changing the CBS? Yeah, but by changing the CBS, we may end up breaking the algorithms/properties that rely on CBS... like GRUB, user-space/kernel-space synchronization... > OTOH, when something like you describe happens, guarantees are probably > already out of the window and we should just do our best to at least > keep the system "working"? (maybe only to warn the user that something > bad has happened) Btw, don't get me wrong, I am not against changing CBS: I am just trying to raise other viewpoints to avoid touching in the base of the DL scheduler, and avoid punishing a thread that behaves well. Anyway, notifying that dl+rt+IRQ time is higher than the rt_runtime is another good thing to do as well. We will be notified anyway, either by RCU or softlockup... but they are side effects warning. By notifying that we have an overload of rt or higher workload we will be pointing to the cause. Thoughts? -- Daniel