Hello, Atish, Peter, all. I have a question about if a task's nr_cpus_allowed is 1. In that scenario we do not call select_task_rq. Therefore even thought a task "p" is placed on idle CPU that CPU will not be marked as claimed for wake-up.
What do you think about adding per_cpu(claim_wakeup, cpu) = 1; to select_task_rq() instead and possibly get rid of them from other places (increases a race window a bit)? Also, it will help and improve an ILB decision for NO_HZ idle balance. To be more clear i mean: diff --git a/kernel/sched/core.c b/kernel/sched/core.c index d17c5da523a0..8111cfa20075 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1559,6 +1559,7 @@ int select_task_rq(struct task_struct *p, int cpu, int sd_flags, int wake_flags) !cpu_online(cpu))) cpu = select_fallback_rq(task_cpu(p), p); + per_cpu(claim_wakeup, cpu) = 1; return cpu; } @@ -3888,6 +3889,7 @@ int task_prio(const struct task_struct *p) return p->prio - MAX_RT_PRIO; } +DEFINE_PER_CPU(int, claim_wakeup); /** * idle_cpu - is a given CPU idle currently? * @cpu: the processor in question. @@ -3909,6 +3911,9 @@ int idle_cpu(int cpu) return 0; #endif + if (per_cpu(claim_wakeup, cpu)) + return 0; + return 1; } diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 5c09ddf8c832..55ad7fa97e95 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -8596,7 +8596,17 @@ static struct { static inline int find_new_ilb(void) { - int ilb = cpumask_first(nohz.idle_cpus_mask); + cpumask_t cpumask; + int ilb; + + cpumask_copy(&cpumask, nohz.idle_cpus_mask); + + /* + * Probably is not good approach in case of many CPUs + */ + for_each_cpu(ilb, &cpumask) + if (!per_cpu(claim_wakeup, ilb)) + break; if (ilb < nr_cpu_ids && idle_cpu(ilb)) return ilb; diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c index d518664cce4f..91cf50cab836 100644 --- a/kernel/sched/idle_task.c +++ b/kernel/sched/idle_task.c @@ -29,6 +29,7 @@ pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf { put_prev_task(rq, prev); update_idle_core(rq); + this_cpu_write(claim_wakeup, 0); schedstat_inc(rq->sched_goidle); return rq->idle; } diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 3b448ba82225..0e0bb7536725 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1055,6 +1055,7 @@ DECLARE_PER_CPU(int, sd_llc_id); DECLARE_PER_CPU(struct sched_domain_shared *, sd_llc_shared); DECLARE_PER_CPU(struct sched_domain *, sd_numa); DECLARE_PER_CPU(struct sched_domain *, sd_asym); +DECLARE_PER_CPU(int, claim_wakeup); struct sched_group_capacity { atomic_t ref; -- Uladzislau Rezki On Tue, Nov 21, 2017 at 11:23:18PM -0600, Atish Patra wrote: > Here are the results of schbench(scheduler latency benchmark) and uperf > (networking benchmark). > > Hardware config: 20 core (40 hyperthreaded cpus) x86 box. > schbench config: message threads = 2; time = 180s, worker thread = variable > uperf config:ping pong test on loopback interface with message size = 8k > > Overall, both benchmark seems to happiest when number of threads are closer > to number of cpus. > -------------------------------------------------------------------------------------------------------------------------- > schbench Maximum Latency(lower is better): > Base(4.14) Base+pcpu > Num > Worker Mean Stdev Mean Stdev Improvement (%) > 10 3026.8 4987.12 523 474.35 82.7210255055 > 18 13854.6 1841.61 12945.6 125.19 6.5609977913 > 19 16457 2046.51 12985.4 48.46 21.0949747828 > 20 14995 2368.84 15838 2038.82 -5.621873958 > 25 29952.2 107.72 29673.6 337.57 0.9301487036 > 30 30084 19.768 30096.2 7.782 -0.0405531179 > > ------------------------------------------------------------------------------------------------------------------- > > The proposed fix seems to improve the maximum latency for lower number > of threads. It also seems to reduce the variation(lower stdev) as well. > > If number of threads are equal or higher than number of cpus, it results in > significantly > higher latencies in because of the nature of the benchmark. Results for > higher > threads use case are presented to provide a complete picture but it is > difficult to conclude > anything from that. > > Next individual percentile results are present for each use case. The > proposed fix also > improves latency across all percentiles for configuration(19 worker threads) > which > should saturate the system. > --------------------------------------------------------------------------------------------------------------------- > schbench Latency in usec(lower is better) > Baseline(4.14) Base+pcpu > Num > Worker Mean stdev Mean stdev Improvement(%) > > 50th > 10 64.2 2.039 63.6 1.743 0.934 > 18 57.6 5.388 57 4.939 1.041 > 19 63 4.774 58 4 7.936 > 20 59.6 4.127 60.2 5.153 -1.006 > 25 78.4 0.489 78.2 0.748 0.255 > 30 96.2 0.748 96.4 1.019 -0.207 > > 75th > 10 72 3.033 71.6 2.939 0.555 > 18 78 2.097 77.2 2.135 1.025 > 19 81.6 1.2 79.4 0.8 2.696 > 20 81 1.264 80.4 2.332 0.740 > 25 109.6 1.019 110 0 -0.364 > 30 781.4 50.902 731.8 70.6382 6.3475 > > 90th > 10 80.4 3.666 80.6 2.576 -0.248 > 18 87.8 1.469 88 1.673 -0.227 > 19 92.8 0.979 90.6 0.489 2.370 > 20 92.6 1.019 92 2 0.647 > 25 8977.6 1277.160 9014.4 467.857 -0.409 > 30 9558.4 334.641 9507.2 320.383 0.5356 > > 95th > 10 86.8 3.867 87.6 4.409 -0.921 > 18 95.4 1.496 95.2 2.039 0.209 > 19 102.6 1.624 99 0.894 3.508 > 20 103.2 1.326 102.2 2.481 0.968 > 25 12400 78.383 12406.4 37.318 -0.051 > 30 12336 40.477 12310.4 12.8 0.207 > > 99th > 10 99.2 5.418 103.4 6.887 -4.233 > 18 115.2 2.561 114.6 3.611 0.5208 > 19 126.25 4.573 120.4 3.872 4.6336 > 20 145.4 3.09 133 1.41 8.5281 > 25 12988.8 15.676 12924.8 25.6 0.4927 > 30 12988.8 15.676 12956.8 32.633 0.2463 > > 99.50th > 10 104.4 5.161 109.8 7.909 -5.172 > 18 127.6 7.391 124.2 4.214 2.6645 > 19 2712.2 4772.883 133.6 5.571 95.074 > 20 3707.8 2831.954 2844.2 4708.345 23.291 > 25 14032 1283.834 13008 0 7.2976 > 30 16550.4 886.382 13840 1218.355 16.376 > ------------------------------------------------------------------------------------------------------------------------ > > > Results from uperf > uperf config: Loopback ping pong test with message size = 8k > > Baseline (4.14) Baseline +pcpu > Mean stdev Mean stdev Improvement(%) > 1 9.056 0.02 8.966 0.083 -0.993 > 2 17.664 0.13 17.448 0.303 -1.222 > 4 32.03 0.22 31.972 0.129 -0.181 > 8 58.198 0.31 58.588 0.198 0.670 > 16 101.018 0.67 100.056 0.455 -0.952 > 32 148.1 15.41 164.494 2.312 11.069 > 64 203.66 1.16 203.042 1.348 -0.3073 > 128 197.12 1.04 194.722 1.174 -1.2165 > > The race window fix seems to help uperf for 32 threads (closest to number of > cpus) as well. > > Regards, > Atish > > On 11/04/2017 07:58 PM, Joel Fernandes wrote: > > Hi Peter, > > > > On Tue, Oct 31, 2017 at 1:20 AM, Peter Zijlstra <pet...@infradead.org> > > wrote: > > > On Tue, Oct 31, 2017 at 12:27:41AM -0500, Atish Patra wrote: > > > > Currently, multiple tasks can wakeup on same cpu from > > > > select_idle_sibiling() path in case they wakeup simulatenously > > > > and last ran on the same llc. This happens because an idle cpu > > > > is not updated until idle task is scheduled out. Any task waking > > > > during that period may potentially select that cpu for a wakeup > > > > candidate. > > > > > > > > Introduce a per cpu variable that is set as soon as a cpu is > > > > selected for wakeup for any task. This prevents from other tasks > > > > to select the same cpu again. Note: This does not close the race > > > > window but minimizes it to accessing the per-cpu variable. If two > > > > wakee tasks access the per cpu variable at the same time, they may > > > > select the same cpu again. But it minimizes the race window > > > > considerably. > > > The very most important question; does it actually help? What > > > benchmarks, give what numbers? > > I collected some numbers with an Android benchmark called Jankbench. > > Most tests didn't show an improvement or degradation with the patch. > > However, one of the tests called "list view", consistently shows an > > improvement. Particularly striking is the improvement at mean and 25 > > percentile. > > > > For list_view test, Jankbench pulls up a list of text and scrolls the > > list, this exercises the display pipeline in Android to render and > > display the animation as the scroll happens. For Android, lower frame > > times is considered quite important as that means we are less likely > > to drop frames and give the user a good experience vs a perceivable > > poor experience. > > > > For each frame, Jankbench measures the total time a frame takes and > > stores it in a DB (the time from which the app starts drawing, to when > > the rendering completes and the frame is submitted for display). > > Following is the distribution of frame times in ms. > > > > count 16304 (@60 fps, 4.5 minutes) > > > > Without patch With patch > > mean 5.196633 4.429641 (+14.75%) > > std 2.030054 2.310025 > > 25% 5.606810 1.991017 (+64.48%) > > 50% 5.824013 5.716631 (+1.84%) > > 75% 5.987102 5.932751 (+0.90%) > > 95% 6.461230 6.301318 (+2.47%) > > 99% 9.828959 9.697076 (+1.34%) > > > > Note that although Android uses energy aware scheduling patches, I > > turned those off to bring the test as close to mainline as possible. I > > also backported Vincent's and Brendan's slow path fixes to the 4.4 > > kernel that the Pixel 2 uses. > > > > Personally I am in favor of this patch considering this test data but > > also that in the past, I remember that our teams had to deal with the > > same race issue and used cpusets to avoid it (although they probably > > tested with "energy aware" CPU selection kept on). > > > > thanks, > > > > - Joel >