On Thu, Sep 17, 2020 at 11:42:11AM +0200, Thomas Gleixner wrote: > +static inline void update_nr_migratory(struct task_struct *p, long delta) > +{ > + if (p->nr_cpus_allowed > 1 && p->sched_class->update_migratory) > + p->sched_class->update_migratory(p, delta); > +}
Right, so as you know, I totally hate this thing :-) It adds a second (and radically different) version of changing affinity. I'm working on a version that uses the normal *set_cpus_allowed*() interface. > +/* > + * The migrate_disable/enable() fastpath updates only the tasks migrate > + * disable count which is sufficient as long as the task stays on the CPU. > + * > + * When a migrate disabled task is scheduled out it can become subject to > + * load balancing. To prevent this, update task::cpus_ptr to point to the > + * current CPUs cpumask and set task::nr_cpus_allowed to 1. > + * > + * If task::cpus_ptr does not point to task::cpus_mask then the update has > + * been done already. This check is also used in in migrate_enable() as an > + * indicator to restore task::cpus_ptr to point to task::cpus_mask > + */ > +static inline void sched_migration_ctrl(struct task_struct *prev, int cpu) > +{ > + if (!prev->migration_ctrl.disable_cnt || > + prev->cpus_ptr != &prev->cpus_mask) > + return; > + > + prev->cpus_ptr = cpumask_of(cpu); > + update_nr_migratory(prev, -1); > + prev->nr_cpus_allowed = 1; > +} So this thing is called from schedule(), with only rq->lock held, and that violates the locking rules for changing the affinity. I have a comment that explains how it's broken and why it's sort-of working. > +void migrate_disable(void) > +{ > + unsigned long flags; > + > + if (!current->migration_ctrl.disable_cnt) { > + raw_spin_lock_irqsave(¤t->pi_lock, flags); > + current->migration_ctrl.disable_cnt++; > + raw_spin_unlock_irqrestore(¤t->pi_lock, flags); > + } else { > + current->migration_ctrl.disable_cnt++; > + } > +} That pi_lock seems unfortunate, and it isn't obvious what the point of it is. > +void migrate_enable(void) > +{ > + struct task_migrate_data *pending; > + struct task_struct *p = current; > + struct rq_flags rf; > + struct rq *rq; > + > + if (WARN_ON_ONCE(p->migration_ctrl.disable_cnt <= 0)) > + return; > + > + if (p->migration_ctrl.disable_cnt > 1) { > + p->migration_ctrl.disable_cnt--; > + return; > + } > + > + raw_spin_lock_irqsave(&p->pi_lock, rf.flags); > + p->migration_ctrl.disable_cnt = 0; > + pending = p->migration_ctrl.pending; > + p->migration_ctrl.pending = NULL; > + > + /* > + * If the task was never scheduled out while in the migrate > + * disabled region and there is no migration request pending, > + * return. > + */ > + if (!pending && p->cpus_ptr == &p->cpus_mask) { > + raw_spin_unlock_irqrestore(&p->pi_lock, rf.flags); > + return; > + } > + > + rq = __task_rq_lock(p, &rf); > + /* Was it scheduled out while in a migrate disabled region? */ > + if (p->cpus_ptr != &p->cpus_mask) { > + /* Restore the tasks CPU mask and update the weight */ > + p->cpus_ptr = &p->cpus_mask; > + p->nr_cpus_allowed = cpumask_weight(&p->cpus_mask); > + update_nr_migratory(p, 1); > + } > + > + /* If no migration request is pending, no further action required. */ > + if (!pending) { > + task_rq_unlock(rq, p, &rf); > + return; > + } > + > + /* Migrate self to the requested target */ > + pending->res = set_cpus_allowed_ptr_locked(p, pending->mask, > + pending->check, rq, &rf); > + complete(pending->done); > +} So, what I'm missing with all this are the design contraints for this trainwreck. Because the 'sane' solution was having migrate_disable() imply cpus_read_lock(). But that didn't fly because we can't have migrate_disable() / migrate_enable() schedule for raisins. And if I'm not mistaken, the above migrate_enable() *does* require being able to schedule, and our favourite piece of futex: raw_spin_lock_irq(&q.pi_state->pi_mutex.wait_lock); spin_unlock(q.lock_ptr); is broken. Consider that spin_unlock() doing migrate_enable() with a pending sched_setaffinity(). Let me ponder this more..