When running update_curr, it is checked whether the current task has missed its deadline (update_deadline). If the deadline has been crossed, the task is set to be rescheduled if there are other tasks available on its cfs_rq. This can cause task starvation in some cgroup configurations.
Assume the following scenario: [ ] rq of CPU | [ ] cfs_rq1 / \ cfs_rq2 [ ] othertask | curr In this case, update_curr is called for all levels of the hierarchy, starting at the leaf. Assume that curr is a kernel task which does not give up the CPU voluntarily, i.e. loops indefinitely unless forced to give up the CPU. Assume further that curr has actually reached its deadline, the expected behavior would be, that the scheduler determines that curr would now start to overconsume and therefore should set the need_resched flag to nudge the current task to allow a reschedule in favor of othertask. To reach the expected behavior, it is not sufficient to check whether other tasks are queued in the cfs_rq, which curr is assigned to. In the configuration shown above, each run queue sees: rq_cpu: 2 tasks cfs_rq1: 2 tasks cfs_rq2: 1 task This means that cfs_rq2 will never reschedule on its own as it sees no other tasks that would be worth rescheduling for. Hence, the call to reschedule relies on the upper levels in the hierarchy. cfs_rq1 sees 1 additional task and could therefore take the desired decision. But, cfs_rq1 takes the additional task into account when computing its own deadline, which means, its deadline lies further in the future. This causes that its deadline is not being reached. Therefore cfs_rq1 does not even check for other potential tasks to be ran. It could now be assumed that cfs_rq1 will reach its deadline at some point. But, after curr has consumed its timeslice, all sched_entities in the cgroup tree get reweighted. This causes the deadlines of all entities in the hierarchy to be shifted further into the future. This means especially, that the deadline of cfs_rq1 also gets pushed into the future, causing it to never reach its deadline. The decision whether a reweight needs to be done depends on the weight of the sched_entity and a recalculated value of the shares the sched_entity shall expect (see calc_group_shares in fair.c). These values are, in the described scenario, always different, hence, reweight is done. The combination of these two circumstances can lead to curr running indefinitely unless interrupted by an external entity. Address this issue by rather checking for the nr_running value of the rq of the CPU itself, i.e. if there is any other task, somewhere on the CPU runqueue, give it a chance to execute. The scenario that made me discover this is being discussed here: https://lore.kernel.org/netdev/ZWbapeL34Z8AMR5f@DESKTOP-2CCOB1S./T/ Questions: 1. Is this behavior by design? I couldn't find an explanation why we only check for the local cfs_rq. Do we only want to allow take-overs from the same cgroup hierarchy level in that case? 2. This problem is referred to as the "hog"-problem by Ankur Arora here: https://lore.kernel.org/lkml/20240213055554.1802415-24-ankur.a.ar...@oracle.com/ Might there be a connection, it's the same piece of code in the end. I'd also prefer the boolean passing to avoid calling resched_curr too often. 3. Which are the criteria for which I should expect a reweight of the cgroup hierarchy? It seems unintuitive too me, that the shares value keeps changing every time. Feedback and opinions would be highly appreciated. Signed-off-by: Tobias Huschle <husc...@linux.ibm.com> --- kernel/sched/fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 61c4ef20a2f8..e9733ef7964a 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1015,7 +1015,7 @@ static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se) /* * The task has consumed its request, reschedule. */ - if (cfs_rq->nr_running > 1) { + if (rq_of(cfs_rq)->nr_running > 1) { resched_curr(rq_of(cfs_rq)); clear_buddies(cfs_rq, se); } -- 2.34.1