Let me reply in parts as I read this.. easy things first :-) On Tue, Apr 11, 2017 at 06:58:33PM +0100, Patrick Bellasi wrote: > On 10-Apr 09:36, Peter Zijlstra wrote:
> > 4) they have muddled semantics, because while its presented as a task > > property, it very much is not. > > Actually we always presented it as a task group property, while other > people suggested it should be a per-task API. > > Still, IMO it could make sense also as a per-task API, for example > considering a specific RT task which we know in our system is > perfectly fine to always run it below a certain OPP. > > Do you think it should be a per-task API? > Should we focus (at least initially) on providing a per task-group API? Even for the cgroup interface, I think they should set a per-task property, not a group property. > > 3) they have absolutely unacceptable overhead in implementation. Two > > more RB tree operations per enqueue/dequeue is just not going to > > happen. > > This last point is about "implementation details", I'm pretty > confident that if we find an agreement on the previous point than > this last will be simple to solve. > > Just to be clear, the rb-trees are per CPU and used to track just the > RUNNABLE tasks on each CPUs and, as I described in the previous > example, for the OPP biasing to work I think we really need an > aggregation mechanism. I know its runnable, which is exactly what the regular RB tree in fair already tracks. That gets us 3 RB tree operations per scheduling operation, which is entirely ridiculous. And while I disagree with the need to track this at all, see below, it can be done _much_ cheaper using a double augmented RB-tree, where you keep the min/max as heaps inside the regular RB-tree. > Ideally, every time a task is enqueue/dequeued from a CPU we want to > know what is the currently required capacity clamping. This requires > to maintain an ordered list of values and rb-trees are the most effective > solution. > > Perhaps, if we accept a certain level of approximation, we can > potentially reduce the set of tracked values to a finite set (maybe > corresponding to the EM capacity values) and use just a per-cpu > ref-counting array. > Still the min/max aggregation will have a worst case O(N) search > complexity, where N is the number of finite values we want to use. So the bigger point is that if the min/max is a per-task property (even if set through a cgroup interface), the min(max) / max(min) thing is wrong. If the min/max were to apply to each individual task's util, you'd end up with something like: Dom(\Sum util) = [min(1, \Sum min), min(1, \Sum max)]. Where a possible approximation is scaling the aggregate util into that domain.

