2017-01-04 19:00 GMT+01:00, Daniel Bristot de Oliveira <bris...@redhat.com>: [...] >>>>> Some tasks start to use more CPU time, while others seems to use less >>>>> CPU than it was reserved for them. See the task 14926, it is using >>>>> only 23.8 % of the CPU, which is less than its 10/30 reservation. >>>> >>>> What happened here is that some runqueues have an active utilisation >>>> larger than 0.95. So, GRUB is decreasing the amount of time received by >>>> the tasks on those runqueues to consume less than 95%... This is the >>>> reason for the effect you noticed below: >>> >>> I see. But, AFAIK, the Linux's sched deadline measures the load >>> globally, not locally. So, it is not a problem having a load > than 95% >>> in the local queue if the global queue is < 95%. >>> >>> Am I missing something? >> >> The version of GRUB reclaiming implemented in my patches tracks a >> per-runqueue "active utilization", and uses it for reclaiming. > > I _think_ that this might be (one of) the source(s) of the problem... I agree that this can cause some problems, but I am not sure if it justifies the huge difference in utilisations you observed
> Just exercising... > > For example, with my taskset, with a hypothetical perfect balance of the > whole runqueue, one possible scenario is: > > CPU 0 1 2 3 > # TASKS 3 3 3 2 > > In this case, CPUs 0 1 2 are with 100% of local utilization. Thus, the > current task on these CPUs will have their runtime decreased by GRUB. > Meanwhile, the luck tasks in the CPU 3 would use an additional time that > they "globally" do not have - because the system, globally, has a load > higher than the 66.6...% of the local runqueue. Actually, part of the > time decreased from tasks on [0-2] are being used by the tasks on 3, > until the next migration of any task, which will change the luck > tasks... but without any guaranty that all tasks will be the luck one on > every activation, causing the problem. > > Does it make sense? Yes; but my impression is that gEDF will migrate tasks so that the distribution of the reclaimed CPU bandwidth is almost uniform... Instead, you saw huge differences in the utilisations (and I do not think that "compressing" the utilisations from 100% to 95% can decrease the utilisation of a task from 33% to 25% / 26%... :) I suspect there is something more going on here (might be some bug in one of my patches). I am trying to better understand what happened. > If it does, this let me think that only with the global track of > utilization we will achieve the correct result... but I may be missing > something... :-). Of course tracking the global active utilisation can be a solution, but I also want to better understand what is wrong with the current approach. Thanks, Luca