Re: [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window.

Uladzislau Rezki Thu, 23 Nov 2017 02:53:50 -0800

Hello, Atish, Peter, all.

I have a question about if a task's nr_cpus_allowed is 1.
In that scenario we do not call select_task_rq. Therefore
even thought a task "p" is placed on idle CPU that CPU
will not be marked as claimed for wake-up.


What do you think about adding per_cpu(claim_wakeup, cpu) = 1;
to select_task_rq() instead and possibly get rid of them from
other places (increases a race window a bit)?

Also, it will help and improve an ILB decision for NO_HZ idle
balance. To be more clear i mean:

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d17c5da523a0..8111cfa20075 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1559,6 +1559,7 @@ int select_task_rq(struct task_struct *p, int cpu, int 
sd_flags, int wake_flags)
             !cpu_online(cpu)))
        cpu = select_fallback_rq(task_cpu(p), p);
+   per_cpu(claim_wakeup, cpu) = 1;
    return cpu;
}

@@ -3888,6 +3889,7 @@ int task_prio(const struct task_struct *p)
    return p->prio - MAX_RT_PRIO;
 }

+DEFINE_PER_CPU(int, claim_wakeup);
 /**
  * idle_cpu - is a given CPU idle currently?
  * @cpu: the processor in question.
@@ -3909,6 +3911,9 @@ int idle_cpu(int cpu)
        return 0;
 #endif

+   if (per_cpu(claim_wakeup, cpu))
+       return 0;
+
    return 1;
 }

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5c09ddf8c832..55ad7fa97e95 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8596,7 +8596,17 @@ static struct {

 static inline int find_new_ilb(void)
  {
  -       int ilb = cpumask_first(nohz.idle_cpus_mask);
  +       cpumask_t cpumask;
  +       int ilb;
  +
  +       cpumask_copy(&cpumask, nohz.idle_cpus_mask);
  +
  +       /*
  +        * Probably is not good approach in case of many CPUs
  +        */
  +       for_each_cpu(ilb, &cpumask)
  +               if (!per_cpu(claim_wakeup, ilb))
  +                       break;

          if (ilb < nr_cpu_ids && idle_cpu(ilb))
                  return ilb;
diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c
index d518664cce4f..91cf50cab836 100644
--- a/kernel/sched/idle_task.c
+++ b/kernel/sched/idle_task.c
@@ -29,6 +29,7 @@ pick_next_task_idle(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf
 {
        put_prev_task(rq, prev);
        update_idle_core(rq);
+       this_cpu_write(claim_wakeup, 0);
        schedstat_inc(rq->sched_goidle);
        return rq->idle;
 }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3b448ba82225..0e0bb7536725 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1055,6 +1055,7 @@ DECLARE_PER_CPU(int, sd_llc_id);
 DECLARE_PER_CPU(struct sched_domain_shared *, sd_llc_shared);
 DECLARE_PER_CPU(struct sched_domain *, sd_numa);
 DECLARE_PER_CPU(struct sched_domain *, sd_asym);
+DECLARE_PER_CPU(int, claim_wakeup);
    
 struct sched_group_capacity {
        atomic_t ref;

--
Uladzislau Rezki

On Tue, Nov 21, 2017 at 11:23:18PM -0600, Atish Patra wrote:
> Here are the results of schbench(scheduler latency benchmark) and uperf
> (networking benchmark).
> 
> Hardware config: 20 core (40 hyperthreaded cpus) x86 box.
> schbench config: message threads = 2; time = 180s, worker thread = variable
> uperf config:ping pong test on loopback interface with message size = 8k
> 
> Overall, both benchmark seems to happiest when number of threads are closer
> to number of cpus.
> --------------------------------------------------------------------------------------------------------------------------
> schbench Maximum Latency(lower is better):
>             Base(4.14)                 Base+pcpu
> Num
> Worker Mean            Stdev        Mean         Stdev Improvement (%)
> 10       3026.8     4987.12           523         474.35 82.7210255055
> 18       13854.6   1841.61        12945.6     125.19 6.5609977913
> 19       16457      2046.51        12985.4     48.46 21.0949747828
> 20       14995      2368.84        15838       2038.82 -5.621873958
> 25       29952.2   107.72           29673.6   337.57 0.9301487036
> 30       30084      19.768           30096.2     7.782 -0.0405531179
> 
> -------------------------------------------------------------------------------------------------------------------
> 
> The proposed fix seems to improve the maximum latency for lower number
> of threads. It also seems to reduce the variation(lower stdev) as well.
> 
> If number of threads are equal or higher than number of cpus, it results in
> significantly
> higher latencies in because of the nature of the benchmark. Results for
> higher
> threads use case are presented to provide a complete picture but it is
> difficult to conclude
> anything from that.
> 
> Next individual percentile results are present for each use case. The
> proposed fix also
> improves latency across all percentiles for configuration(19 worker threads)
> which
> should saturate the system.
> ---------------------------------------------------------------------------------------------------------------------
> schbench Latency in usec(lower is better)
>             Baseline(4.14)           Base+pcpu
> Num
> Worker Mean       stdev        Mean    stdev    Improvement(%)
> 
> 50th
> 10       64.2            2.039        63.6    1.743        0.934
> 18       57.6            5.388        57        4.939        1.041
> 19       63               4.774        58        4 7.936
> 20       59.6           4.127        60.2    5.153        -1.006
> 25       78.4           0.489        78.2    0.748        0.255
> 30       96.2           0.748        96.4    1.019        -0.207
> 
> 75th
> 10        72             3.033        71.6      2.939        0.555
> 18        78             2.097        77.2      2.135        1.025
> 19        81.6          1.2            79.4      0.8 2.696
> 20        81             1.264        80.4      2.332        0.740
> 25        109.6        1.019       110        0               -0.364
> 30        781.4        50.902     731.8   70.6382    6.3475
> 
> 90th
> 10        80.4          3.666       80.6        2.576        -0.248
> 18        87.8          1.469       88           1.673        -0.227
> 19        92.8          0.979       90.6        0.489        2.370
> 20        92.6          1.019       92            2 0.647
> 25        8977.6   1277.160   9014.4    467.857   -0.409
> 30        9558.4   334.641     9507.2   320.383    0.5356
> 
> 95th
> 10        86.8         3.867        87.6         4.409        -0.921
> 18        95.4         1.496        95.2         2.039        0.209
> 19        102.6       1.624        99            0.894        3.508
> 20        103.2       1.326        102.2       2.481        0.968
> 25        12400      78.383      12406.4   37.318    -0.051
> 30        12336      40.477      12310.4   12.8        0.207
> 
> 99th
> 10        99.2          5.418       103.4        6.887        -4.233
> 18        115.2        2.561       114.6        3.611        0.5208
> 19        126.25      4.573       120.4        3.872        4.6336
> 20        145.4        3.09          133           1.41 8.5281
> 25        12988.8   15.676      12924.8   25.6          0.4927
> 30        12988.8   15.676      12956.8   32.633     0.2463
> 
> 99.50th
> 10        104.4        5.161         109.8    7.909          -5.172
> 18        127.6        7.391         124.2    4.214          2.6645
> 19        2712.2      4772.883  133.6    5.571          95.074
> 20        3707.8      2831.954  2844.2  4708.345    23.291
> 25        14032       1283.834  13008   0                  7.2976
> 30        16550.4    886.382    13840   1218.355    16.376
> ------------------------------------------------------------------------------------------------------------------------
> 
> 
> Results from uperf
> uperf config: Loopback ping pong test with message size = 8k
> 
>         Baseline (4.14)            Baseline +pcpu
>          Mean        stdev        Mean        stdev Improvement(%)
> 1      9.056        0.02          8.966        0.083        -0.993
> 2      17.664      0.13         17.448       0.303       -1.222
> 4      32.03        0.22         31.972       0.129       -0.181
> 8      58.198      0.31         58.588       0.198        0.670
> 16    101.018   0.67        100.056      0.455        -0.952
> 32    148.1        15.41     164.494      2.312        11.069
> 64    203.66      1.16       203.042      1.348        -0.3073
> 128   197.12     1.04       194.722     1.174         -1.2165
> 
> The race window fix seems to help uperf for 32 threads (closest to number of
> cpus) as well.
> 
> Regards,
> Atish
> 
> On 11/04/2017 07:58 PM, Joel Fernandes wrote:
> > Hi Peter,
> > 
> > On Tue, Oct 31, 2017 at 1:20 AM, Peter Zijlstra <pet...@infradead.org> 
> > wrote:
> > > On Tue, Oct 31, 2017 at 12:27:41AM -0500, Atish Patra wrote:
> > > > Currently, multiple tasks can wakeup on same cpu from
> > > > select_idle_sibiling() path in case they wakeup simulatenously
> > > > and last ran on the same llc. This happens because an idle cpu
> > > > is not updated until idle task is scheduled out. Any task waking
> > > > during that period may potentially select that cpu for a wakeup
> > > > candidate.
> > > > 
> > > > Introduce a per cpu variable that is set as soon as a cpu is
> > > > selected for wakeup for any task. This prevents from other tasks
> > > > to select the same cpu again. Note: This does not close the race
> > > > window but minimizes it to accessing the per-cpu variable. If two
> > > > wakee tasks access the per cpu variable at the same time, they may
> > > > select the same cpu again. But it minimizes the race window
> > > > considerably.
> > > The very most important question; does it actually help? What
> > > benchmarks, give what numbers?
> > I collected some numbers with an Android benchmark called Jankbench.
> > Most tests didn't show an improvement or degradation with the patch.
> > However, one of the tests called "list view",  consistently shows an
> > improvement. Particularly striking is the improvement at mean and 25
> > percentile.
> > 
> > For list_view test, Jankbench pulls up a list of text and scrolls the
> > list, this exercises the display pipeline in Android to render and
> > display the animation as the scroll happens. For Android, lower frame
> > times is considered quite important as that means we are less likely
> > to drop frames and give the user a good experience vs a perceivable
> > poor experience.
> > 
> > For each frame, Jankbench measures the total time a frame takes and
> > stores it in a DB (the time from which the app starts drawing, to when
> > the rendering completes and the frame is submitted for display).
> > Following is the distribution of frame times in ms.
> > 
> > count    16304   (@60 fps, 4.5 minutes)
> > 
> >          Without patch   With patch
> > mean         5.196633   4.429641 (+14.75%)
> > std          2.030054   2.310025
> > 25%          5.606810   1.991017 (+64.48%)
> > 50%          5.824013   5.716631 (+1.84%)
> > 75%          5.987102   5.932751 (+0.90%)
> > 95%          6.461230   6.301318 (+2.47%)
> > 99%          9.828959   9.697076 (+1.34%)
> > 
> > Note that although Android uses energy aware scheduling patches, I
> > turned those off to bring the test as close to mainline as possible. I
> > also backported Vincent's and Brendan's slow path fixes to the 4.4
> > kernel that the Pixel 2 uses.
> > 
> > Personally I am in favor of this patch considering this test data but
> > also that in the past, I remember that our teams had to deal with the
> > same race issue and used cpusets to avoid it (although they probably
> > tested with "energy aware" CPU selection kept on).
> > 
> > thanks,
> > 
> > - Joel
>

Re: [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window.

Reply via email to