> 1) Unfairness between the sibling threads
> -----------------------------------------
> One sibling thread could be suppressing and force idling
> the sibling thread over proportionally.  Resulting in
> the force idled CPU not getting run and stall tasks on
> suppressed CPU.
> 
> Status:
> i) Aaron has proposed a patchset here based on using one
> rq as a base reference for vruntime for task priority
> comparison between siblings.
> 
> https://lore.kernel.org/lkml/20190725143248.GC992@aaronlu/
> It works well on fairness but has some initialization issues
> 
> ii) Tim has proposed a patchset here to account for forced
> idle time in rq's min_vruntime
> https://lore.kernel.org/lkml/f96350c1-25a9-0564-ff46-6658e96d7...@linux.intel.com/
> It improves over v3 with simpler logic compared to
> Aaron's patch, but does not work as well on fairness
> 
> iii) Tim has proposed yet another patch to maintain fairness
> of forced idle time between CPU threads per Peter's suggestion.
> https://lore.kernel.org/lkml/21933a50-f796-3d28-664c-030cb7c98...@linux.intel.com/
> Its performance has yet to be tested.
> 
> 2) Not rescheduling forced idled CPU
> ------------------------------------
> The forced idled CPU does not get a chance to re-schedule
> itself, and will stall for a long time even though it
> has eligible tasks to run.
> 
> Status:
> i) Aaron proposed a patch to fix this to check if there
> are runnable tasks when scheduling tick comes in.
> https://lore.kernel.org/lkml/20190725143344.GD992@aaronlu/
> 
> ii) Vineeth has patches to this issue and also issue 1, based
> on scheduling in a new "forced idle task" when getting forced
> idle, but has yet to post the patches.

We finished writing and debugging the PoC for the coresched_idle task
and here are the results and the code.

Those patches are applied on top of Aaron's patches:
- sched: Fix incorrect rq tagged as forced idle
- wrapper for cfs_rq->min_vruntime
  https://lore.kernel.org/lkml/20190725143127.GB992@aaronlu/
- core vruntime comparison
  https://lore.kernel.org/lkml/20190725143248.GC992@aaronlu/

For the testing, we used the same strategy as described in
https://lore.kernel.org/lkml/20190802153715.GA18075@sinkpad/

No tag
------
Test                            Average     Stdev
Alone                           1306.90     0.94
nosmt                           649.95      1.44
Aaron's full patchset:          828.15      32.45
Aaron's first 2 patches:        832.12      36.53
Tim's first patchset:           852.50      4.11
Tim's second patchset:          855.11      9.89
coresched_idle                  985.67      0.83

Sysbench mem untagged, sysbench cpu tagged
------------------------------------------
Test                            Average     Stdev
Alone                           1306.90     0.94
nosmt                           649.95      1.44
Aaron's full patchset:          586.06      1.77
Tim's first patchset:           852.50      4.11
Tim's second patchset:          663.88      44.43
coresched_idle                  653.58      0.49

Sysbench mem tagged, sysbench cpu untagged
------------------------------------------
Test                            Average     Stdev
Alone                           1306.90     0.94
nosmt                           649.95      1.44
Aaron's full patchset:          583.77      3.52
Tim's first patchset:           564.04      58.05
Tim's second patchset:          524.72      55.24
coresched_idle                  653.30      0.81

Both sysbench tagged
--------------------
Test                            Average     Stdev
Alone                           1306.90     0.94
nosmt                           649.95      1.44
Aaron's full patchset:          582.15      3.75
Tim's first patchset:           679.43      70.07
Tim's second patchset:          563.10      34.58
coresched_idle                  653.12      1.68

As we can see from this stress-test, with the coresched_idle thread
being a real process, the fairness is more consistent (low stdev). Also,
the performance remains the same regardless of the tagging, and even
always slightly better than nosmt.

Thanks,

Julien

From: vpillai <vpil...@digitalocean.com>
Date: Wed, 4 Sep 2019 17:41:38 +0000
Subject: [RFC PATCH 1/2] coresched_idle thread

---
 kernel/sched/core.c  | 46 ++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h |  1 +
 2 files changed, 47 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f7839bf96e8b..fe560739c247 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3639,6 +3639,51 @@ static inline bool cookie_match(struct task_struct *a, 
struct task_struct *b)
        return a->core_cookie == b->core_cookie;
 }
 
+static int coresched_idle_worker(void *data)
+{
+       struct rq *rq = (struct rq *)data;
+
+       /*
+        * Transition to parked state and dequeue from runqueue.
+        * pick_task() will select us if needed without enqueueing.
+        */
+       set_special_state(TASK_PARKED);
+       schedule();
+
+       while (true) {
+               if (kthread_should_stop())
+                       break;
+
+               play_idle(1);
+       }
+
+       return 0;
+}
+
+static void coresched_idle_worker_init(struct rq *rq)
+{
+
+       // XXX core_idle_task needs lock protection?
+       if (!rq->core_idle_task) {
+               rq->core_idle_task = 
kthread_create_on_cpu(coresched_idle_worker,
+                               (void *)rq, cpu_of(rq), "coresched_idle");
+               if (rq->core_idle_task) {
+                       wake_up_process(rq->core_idle_task);
+               }
+
+       }
+
+       return;
+}
+
+static void coresched_idle_worker_fini(struct rq *rq)
+{
+       if (rq->core_idle_task) {
+               kthread_stop(rq->core_idle_task);
+               rq->core_idle_task = NULL;
+       }
+}
+
 // XXX fairness/fwd progress conditions
 /*
  * Returns
@@ -6774,6 +6819,7 @@ void __init sched_init(void)
                atomic_set(&rq->nr_iowait, 0);
 
 #ifdef CONFIG_SCHED_CORE
+               rq->core_idle_task = NULL;
                rq->core = NULL;
                rq->core_pick = NULL;
                rq->core_enabled = 0;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e91c188a452c..c3ae0af55b05 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -965,6 +965,7 @@ struct rq {
        unsigned int            core_sched_seq;
        struct rb_root          core_tree;
        bool                    core_forceidle;
+       struct task_struct      *core_idle_task;
 
        /* shared state */
        unsigned int            core_task_seq;
-- 
2.17.1

From: vpillai <vpil...@digitalocean.com>
Date: Wed, 4 Sep 2019 18:22:55 +0000
Subject: [RFC PATCH 2/2] Use coresched_idle to force idle a sibling

Currently we use idle thread to force idle on a sibling. Lets
use the new coresched_idle thread that scheduler sees a valid
task during force idle.
---
 kernel/sched/core.c | 66 ++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 56 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fe560739c247..e35d69a81adb 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -244,23 +244,33 @@ static int __sched_core_stopper(void *data)
 static DEFINE_MUTEX(sched_core_mutex);
 static int sched_core_count;
 
+static void coresched_idle_worker_init(struct rq *rq);
+static void coresched_idle_worker_fini(struct rq *rq);
 static void __sched_core_enable(void)
 {
+       int cpu;
+
        // XXX verify there are no cookie tasks (yet)
 
        static_branch_enable(&__sched_core_enabled);
        stop_machine(__sched_core_stopper, (void *)true, NULL);
 
+       for_each_online_cpu(cpu)
+               coresched_idle_worker_init(cpu_rq(cpu));
        printk("core sched enabled\n");
 }
 
 static void __sched_core_disable(void)
 {
+       int cpu;
+
        // XXX verify there are no cookie tasks (left)
 
        stop_machine(__sched_core_stopper, (void *)false, NULL);
        static_branch_disable(&__sched_core_enabled);
 
+       for_each_online_cpu(cpu)
+               coresched_idle_worker_fini(cpu_rq(cpu));
        printk("core sched disabled\n");
 }
 
@@ -3626,14 +3636,25 @@ __pick_next_task(struct rq *rq, struct task_struct 
*prev, struct rq_flags *rf)
 
 #ifdef CONFIG_SCHED_CORE
 
+static inline bool is_force_idle_task(struct task_struct *p)
+{
+       BUG_ON(task_rq(p)->core_idle_task == NULL);
+       return task_rq(p)->core_idle_task == p;
+}
+
+static inline bool is_core_idle_task(struct task_struct *p)
+{
+       return is_idle_task(p) || is_force_idle_task(p);
+}
+
 static inline bool cookie_equals(struct task_struct *a, unsigned long cookie)
 {
-       return is_idle_task(a) || (a->core_cookie == cookie);
+       return is_core_idle_task(a) || (a->core_cookie == cookie);
 }
 
 static inline bool cookie_match(struct task_struct *a, struct task_struct *b)
 {
-       if (is_idle_task(a) || is_idle_task(b))
+       if (is_core_idle_task(a) || is_core_idle_task(b))
                return true;
 
        return a->core_cookie == b->core_cookie;
@@ -3641,8 +3662,6 @@ static inline bool cookie_match(struct task_struct *a, 
struct task_struct *b)
 
 static int coresched_idle_worker(void *data)
 {
-       struct rq *rq = (struct rq *)data;
-
        /*
         * Transition to parked state and dequeue from runqueue.
         * pick_task() will select us if needed without enqueueing.
@@ -3666,7 +3685,7 @@ static void coresched_idle_worker_init(struct rq *rq)
        // XXX core_idle_task needs lock protection?
        if (!rq->core_idle_task) {
                rq->core_idle_task = 
kthread_create_on_cpu(coresched_idle_worker,
-                               (void *)rq, cpu_of(rq), "coresched_idle");
+                               NULL, cpu_of(rq), "coresched_idle");
                if (rq->core_idle_task) {
                        wake_up_process(rq->core_idle_task);
                }
@@ -3684,6 +3703,14 @@ static void coresched_idle_worker_fini(struct rq *rq)
        }
 }
 
+static inline struct task_struct *core_idle_task(struct rq *rq)
+{
+       BUG_ON(rq->core_idle_task == NULL);
+
+       return rq->core_idle_task;
+
+}
+
 // XXX fairness/fwd progress conditions
 /*
  * Returns
@@ -3709,7 +3736,7 @@ pick_task(struct rq *rq, const struct sched_class *class, 
struct task_struct *ma
                 */
                if (max && class_pick->core_cookie &&
                    prio_less(class_pick, max))
-                       return idle_sched_class.pick_task(rq);
+                       return core_idle_task(rq);
 
                return class_pick;
        }
@@ -3853,7 +3880,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
                                goto done;
                        }
 
-                       if (!is_idle_task(p))
+                       if (!is_force_idle_task(p))
                                occ++;
 
                        rq_i->core_pick = p;
@@ -3906,7 +3933,6 @@ next_class:;
        rq->core->core_pick_seq = rq->core->core_task_seq;
        next = rq->core_pick;
        rq->core_sched_seq = rq->core->core_pick_seq;
-       trace_printk("picked: %s/%d %lx\n", next->comm, next->pid, 
next->core_cookie);
 
        /*
         * Reschedule siblings
@@ -3924,13 +3950,24 @@ next_class:;
 
                WARN_ON_ONCE(!rq_i->core_pick);
 
-               if (is_idle_task(rq_i->core_pick) && rq_i->nr_running)
+               if (is_core_idle_task(rq_i->core_pick) && rq_i->nr_running) {
+                       /*
+                        * Matching logic can sometimes select idle_task when
+                        * iterating the sched_classes. If that selection is
+                        * actually a forced idle case, we need to update the
+                        * core_pick to coresched_idle.
+                        */
+                       if (is_idle_task(rq_i->core_pick))
+                               rq_i->core_pick = core_idle_task(rq_i);
                        rq_i->core_forceidle = true;
+               }
 
                rq_i->core_pick->core_occupation = occ;
 
-               if (i == cpu)
+               if (i == cpu) {
+                       next = rq_i->core_pick;
                        continue;
+               }
 
                if (rq_i->curr != rq_i->core_pick) {
                        trace_printk("IPI(%d)\n", i);
@@ -3947,6 +3984,7 @@ next_class:;
                        WARN_ON_ONCE(1);
                }
        }
+       trace_printk("picked: %s/%d %lx\n", next->comm, next->pid, 
next->core_cookie);
 
 done:
        set_next_task(rq, next);
@@ -4200,6 +4238,12 @@ static void __sched notrace __schedule(bool preempt)
                 *   is a RELEASE barrier),
                 */
                ++*switch_count;
+#ifdef CONFIG_SCHED_CORE
+               if (next == rq->core_idle_task)
+                       next->state = TASK_RUNNING;
+               else if (prev == rq->core_idle_task)
+                       prev->state = TASK_PARKED;
+#endif
 
                trace_sched_switch(preempt, prev, next);
 
@@ -6479,6 +6523,7 @@ int sched_cpu_activate(unsigned int cpu)
 #ifdef CONFIG_SCHED_CORE
                if (static_branch_unlikely(&__sched_core_enabled)) {
                        rq->core_enabled = true;
+                       coresched_idle_worker_init(rq);
                }
 #endif
        }
@@ -6535,6 +6580,7 @@ int sched_cpu_deactivate(unsigned int cpu)
                struct rq *rq = cpu_rq(cpu);
                if (static_branch_unlikely(&__sched_core_enabled)) {
                        rq->core_enabled = false;
+                       coresched_idle_worker_fini(rq);
                }
 #endif
                static_branch_dec_cpuslocked(&sched_smt_present);
-- 
2.17.1

Reply via email to