After reading more traces and trying to understand why only untagged
tasks are starving when there are cpu-intensive tasks running on the
same set of CPUs, we noticed a difference in behavior in ‘pick_task’. In
the case where ‘core_cookie’ is 0, we are supposed to only prefer the
tagged task if it’s priority is higher, but when the priorities are
equal we prefer it as well which causes the starving. ‘pick_task’ is
biased toward selecting its first parameter in case of equality which in
this case was the ‘class_pick’ instead of ‘max’. Reversing the order of
the parameter solves this issue and matches the expected behavior.

So we can get rid of this vruntime_boost concept.

We have tested the fix below and it seems to work well with
tagged/untagged tasks.

Here are our initial test results. When core scheduling is enabled,
each VM (and associated vhost threads) are in their own cgroup/tag.

1 12-vcpu VM MySQL TPC-C benchmark (IO + CPU) with 96 mostly-idle 1-vcpu
VMs on each NUMA node (72 logical CPUs total with SMT on):
+-------------+----------+--------------+------------+--------+
|             | baseline | coresched    | coresched  | nosmt  |
|             | no tag   | VMs tagged   | VMs tagged | no tag |
|             | v5.1.5   | no stall fix | stall fix  |        |
+-------------+----------+--------------+------------+--------+
|average TPS  | 1474     | 1289         | 1264       | 1339   |
|stdev        | 48       | 12           | 17         | 24     |
|overhead     | N/A      | -12%         | -14%       | -9%    |
+-------------+----------+--------------+------------+--------+

3 12-vcpu VMs running linpack (cpu-intensive), all pinned on the same
NUMA node (36 logical CPUs with SMT enabled on that NUMA node):
+---------------+----------+--------------+-----------+--------+
|               | baseline | coresched    | coresched | nosmt  |
|               | no tag   | VMs tagged   | VMs tagged| no tag |
|               | v5.1.5   | no stall fix | stall fix |        |
+---------------+----------+--------------+-----------+--------+
|average gflops | 177.9    | 171.3        | 172.7     | 81.9   |
|stdev          | 2.6      | 10.6         | 6.4       | 8.1    |
|overhead       | N/A      | -3.7%        | -2.9%     | -53.9% |
+---------------+----------+--------------+-----------+--------+

This fix can be toggled dynamically with the ‘CORESCHED_STALL_FIX’
sched_feature so it’s easy to test before/after (it is disabled by
default).

The up-to-date git tree can also be found here in case it’s easier to
follow:
https://github.com/digitalocean/linux-coresched/commits/vpillai/coresched-v3-v5.1.5-test

Feedback welcome !

Thanks,

Julien

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6e79421..26fea68 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3668,8 +3668,10 @@ pick_task(struct rq *rq, const struct sched_class 
*class, struct task_struct *ma
                 * If class_pick is tagged, return it only if it has
                 * higher priority than max.
                 */
-               if (max && class_pick->core_cookie &&
-                   prio_less(class_pick, max))
+               bool max_is_higher = sched_feat(CORESCHED_STALL_FIX) ?
+                                    max && !prio_less(max, class_pick) :
+                                    max && prio_less(class_pick, max);
+               if (class_pick->core_cookie && max_is_higher)
                        return idle_sched_class.pick_task(rq);
 
                return class_pick;
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 858589b..332a092 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -90,3 +90,9 @@ SCHED_FEAT(WA_BIAS, true)
  * UtilEstimation. Use estimated CPU utilization.
  */
 SCHED_FEAT(UTIL_EST, true)
+
+/*
+ * Prevent task stall due to vruntime comparison limitation across
+ * cpus.
+ */
+SCHED_FEAT(CORESCHED_STALL_FIX, false)

Reply via email to