On Wed, Jan 30, 2019 at 02:04:10PM +0100, Peter Zijlstra wrote: > On Wed, Jan 30, 2019 at 06:22:47AM +0100, Vincent Guittot wrote: > > > The algorithm used to order cfs_rq in rq->leaf_cfs_rq_list assumes that > > it will walk down to root the 1st time a cfs_rq is used and we will finish > > to add either a cfs_rq without parent or a cfs_rq with a parent that is > > already on the list. But this is not always true in presence of throttling. > > Because a cfs_rq can be throttled even if it has never been used but other > > CPUs > > of the cgroup have already used all the bandwdith, we are not sure to go > > down to > > the root and add all cfs_rq in the list. > > > > Ensure that all cfs_rq will be added in the list even if they are throttled. > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index e2ff4b6..826fbe5 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -352,6 +352,20 @@ static inline void list_del_leaf_cfs_rq(struct cfs_rq > > *cfs_rq) > > } > > } > > > > +static inline void list_add_branch_cfs_rq(struct sched_entity *se, struct > > rq *rq) > > +{ > > + struct cfs_rq *cfs_rq; > > + > > + for_each_sched_entity(se) { > > + cfs_rq = cfs_rq_of(se); > > + list_add_leaf_cfs_rq(cfs_rq); > > + > > + /* If parent is already in the list, we can stop */ > > + if (rq->tmp_alone_branch == &rq->leaf_cfs_rq_list) > > + break; > > + } > > +} > > + > > /* Iterate through all leaf cfs_rq's on a runqueue: */ > > #define for_each_leaf_cfs_rq(rq, cfs_rq) \ > > list_for_each_entry_rcu(cfs_rq, &rq->leaf_cfs_rq_list, leaf_cfs_rq_list) > > > @@ -5179,6 +5197,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct > > *p, int flags) > > > > } > > > > + /* Ensure that all cfs_rq have been added to the list */ > > + list_add_branch_cfs_rq(se, rq); > > + > > hrtick_update(rq); > > } > > So I don't much like this; at all. But maybe I misunderstand, this is > somewhat tricky stuff and I've not looked at it in a while. > > So per normal we do: > > enqueue_task_fair() > for_each_sched_entity() { > if (se->on_rq) > break; > enqueue_entity() > list_add_leaf_cfs_rq(); > } > > This ensures that all parents are already enqueued, right? because this > is what enqueues those parents. > > And in this case you add an unconditional second > for_each_sched_entity(); even though it is completely redundant, afaict.
Ah, it doesn't do a second iteration; it continues where the previous two left off. Still, why isn't this in unthrottle? > The problem seems to stem from the whole throttled crud; which (also) > breaks the above enqueue loop on throttle state, and there the parent can > go missing. > > So why doesn't this live in unthrottle_cfs_rq() ? >