Load balancer and NUMA balancer are not suppose to work on isolcpus. Currently when setting cpus_allowed for a task, there are no checks to see if the requested cpumask has CPUs from both isolcpus and housekeeping CPUs.
If user passes a mix of isolcpus and housekeeping CPUs, then NUMA balancer can pick a isolcpu to schedule. With this change, if a combination of isolcpus and housekeeping CPUs are provided, then we restrict it to housekeeping CPUs only. For example: System with 32 CPUs $ grep -o "isolcpus=[,,1-9]*" /proc/cmdline isolcpus=1,5,9,13 $ grep -i cpus_allowed /proc/$$/status Cpus_allowed: ffffdddd Cpus_allowed_list: 0,2-4,6-8,10-12,14-31 Running "perf bench numa mem --no-data_rand_walk -p 4 -t 8 -G 0 -P 3072 -T 0 -l 50 -c -s 1000" which calls sched_setaffinity to all CPUs in system. Without patch ------------ $ for i in $(pgrep -f perf); do grep -i cpus_allowed_list /proc/$i/task/*/status ; done | head -n 10 Cpus_allowed_list: 0,2-4,6-8,10-12,14-31 /proc/2107/task/2107/status:Cpus_allowed_list: 0-31 /proc/2107/task/2196/status:Cpus_allowed_list: 0-31 /proc/2107/task/2197/status:Cpus_allowed_list: 0-31 /proc/2107/task/2198/status:Cpus_allowed_list: 0-31 /proc/2107/task/2199/status:Cpus_allowed_list: 0-31 /proc/2107/task/2200/status:Cpus_allowed_list: 0-31 /proc/2107/task/2201/status:Cpus_allowed_list: 0-31 /proc/2107/task/2202/status:Cpus_allowed_list: 0-31 /proc/2107/task/2203/status:Cpus_allowed_list: 0-31 With patch ---------- $ for i in $(pgrep -f perf); do grep -i cpus_allowed_list /proc/$i/task/*/status ; done | head -n 10 Cpus_allowed_list: 0,2-4,6-8,10-12,14-31 /proc/18591/task/18591/status:Cpus_allowed_list: 0,2-4,6-8,10-12,14-31 /proc/18591/task/18603/status:Cpus_allowed_list: 0,2-4,6-8,10-12,14-31 /proc/18591/task/18604/status:Cpus_allowed_list: 0,2-4,6-8,10-12,14-31 /proc/18591/task/18605/status:Cpus_allowed_list: 0,2-4,6-8,10-12,14-31 /proc/18591/task/18606/status:Cpus_allowed_list: 0,2-4,6-8,10-12,14-31 /proc/18591/task/18607/status:Cpus_allowed_list: 0,2-4,6-8,10-12,14-31 /proc/18591/task/18608/status:Cpus_allowed_list: 0,2-4,6-8,10-12,14-31 /proc/18591/task/18609/status:Cpus_allowed_list: 0,2-4,6-8,10-12,14-31 /proc/18591/task/18610/status:Cpus_allowed_list: 0,2-4,6-8,10-12,14-31 Signed-off-by: Srikar Dronamraju <[email protected]> --- Changelog v2->v3: The actual detection is moved to set_cpus_allowed_common from sched_setaffinity. This helps to solve all cases where task cpus_allowed is set. kernel/sched/core.c | 14 +++++++++++++- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 3064e0f..37e62b8 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1003,7 +1003,19 @@ static int migration_cpu_stop(void *data) */ void set_cpus_allowed_common(struct task_struct *p, const struct cpumask *new_mask) { - cpumask_copy(&p->cpus_allowed, new_mask); + const struct cpumask *hk_mask = housekeeping_cpumask(HK_FLAG_DOMAIN); + + /* + * If the cpumask provided has CPUs that are part of isolated and + * housekeeping_cpumask, then restrict it to just the CPUs that + * are part of the housekeeping_cpumask. + */ + if (!cpumask_subset(new_mask, hk_mask) && + cpumask_intersects(new_mask, hk_mask)) + cpumask_and(&p->cpus_allowed, new_mask, hk_mask); + else + cpumask_copy(&p->cpus_allowed, new_mask); + p->nr_cpus_allowed = cpumask_weight(new_mask); } -- 1.8.3.1

