On Fri, 22 Jan 2021 at 20:10, Joel Fernandes <j...@joelfernandes.org> wrote: > > Hi Vincent, > > Thanks for reply. Please see the replies below: > > On Fri, Jan 22, 2021 at 05:56:22PM +0100, Vincent Guittot wrote: > > On Fri, 22 Jan 2021 at 16:46, Joel Fernandes (Google) > > <j...@joelfernandes.org> wrote: > > > > > > On an octacore ARM64 device running ChromeOS Linux kernel v5.4, I found > > > that there are a lot of calls to update_blocked_averages(). This causes > > > the schedule loop to slow down to taking upto 500 micro seconds at > > > times (due to newidle load balance). I have also seen this manifest in > > > the periodic balancer. > > > > > > Closer look shows that the problem is caused by the following > > > ingredients: > > > 1. If the system has a lot of inactive CGroups (thanks Dietmar for > > > suggesting to inspect /proc/sched_debug for this), this can make > > > __update_blocked_fair() take a long time. > > > > Inactive cgroups are removed from the list so they should not impact > > the duration > > I meant blocked CGroups. According to this code, a cfs_rq can be partially > decayed and not have any tasks running on it but its load needs to be > decayed, correct? That's what I meant by 'inactive'. I can reword it to > 'blocked'.
How many blocked cgroups have you got ? > > * There can be a lot of idle CPU cgroups. Don't let fully > * decayed cfs_rqs linger on the list. > */ > if (cfs_rq_is_decayed(cfs_rq)) > list_del_leaf_cfs_rq(cfs_rq); > > > > 2. The device has a lot of CPUs in a cluster which causes schedutil in a > > > shared frequency domain configuration to be slower than usual. (the load > > > > What do you mean exactly by it causes schedutil to be slower than usual ? > > sugov_next_freq_shared() is order number of CPUs in the a cluster. This > system is a 6+2 system with 6 CPUs in a cluster. schedutil shared policy > frequency update needs to go through utilization of other CPUs in the > cluster. I believe this could be adding to the problem but is not really > needed to optimize if we can rate limit the calls to update_blocked_averages > to begin with. Qais mentioned half of the time being used by sugov_next_freq_shared(). Are there any frequency changes resulting in this call ? > > > > average updates also try to update the frequency in schedutil). > > > > > > 3. The CPU is running at a low frequency causing the scheduler/schedutil > > > code paths to take longer than when running at a high CPU frequency. > > > > Low frequency usually means low utilization so it should happen that much. > > It happens a lot as can be seen with schbench. It is super easy to reproduce. Happening a lot in itself is not a problem if there is nothing else to do so it's not a argument in itself So why is it a problem for you ? You are mentioning newly idle load balance so I assume that your root problem is the scheduling delay generated by the newly idle load balance which then calls update_blocked_averages rate limiting the call to update_blocked_averages() will only reduce the number of time it happens but it will not prevent it to happen. > > schedule() can result in new idle balance with the CFS pick call happening > often. Here is a function graph trace. The tracer shows > update_blocked_averages taking a lot of time. > > sugov:0-2454 [002] 2657.992570: funcgraph_entry: | > load_balance() { > sugov:0-2454 [002] 2657.992577: funcgraph_entry: | > update_group_capacity() { > sugov:0-2454 [002] 2657.992580: funcgraph_entry: 2.656 us | > __msecs_to_jiffies(); > sugov:0-2454 [002] 2657.992585: funcgraph_entry: 2.447 us | > _raw_spin_lock_irqsave(); > sugov:0-2454 [002] 2657.992591: funcgraph_entry: 2.552 us | > _raw_spin_unlock_irqrestore(); > sugov:0-2454 [002] 2657.992595: funcgraph_exit: + 17.448 us | > } > sugov:0-2454 [002] 2657.992597: funcgraph_entry: 1.875 us | > update_nohz_stats(); > sugov:0-2454 [002] 2657.992601: funcgraph_entry: 1.667 us | > idle_cpu(); > sugov:0-2454 [002] 2657.992605: funcgraph_entry: | > update_nohz_stats() { > sugov:0-2454 [002] 2657.992608: funcgraph_entry: + 33.333 us | > update_blocked_averages(); > sugov:0-2454 [002] 2657.992643: funcgraph_exit: + 38.073 us | > } > sugov:0-2454 [002] 2657.992645: funcgraph_entry: 1.770 us | > idle_cpu(); > sugov:0-2454 [002] 2657.992649: funcgraph_entry: | > update_nohz_stats() { > sugov:0-2454 [002] 2657.992651: funcgraph_entry: + 41.823 us | > update_blocked_averages(); > sugov:0-2454 [002] 2657.992694: funcgraph_exit: + 45.729 us | > } > sugov:0-2454 [002] 2657.992696: funcgraph_entry: 1.823 us | > idle_cpu(); > sugov:0-2454 [002] 2657.992700: funcgraph_entry: | > update_nohz_stats() { > sugov:0-2454 [002] 2657.992702: funcgraph_entry: + 35.312 us | > update_blocked_averages(); > sugov:0-2454 [002] 2657.992740: funcgraph_exit: + 39.792 us | > } > sugov:0-2454 [002] 2657.992742: funcgraph_entry: 1.771 us | > idle_cpu(); > sugov:0-2454 [002] 2657.992746: funcgraph_entry: | > update_nohz_stats() { > sugov:0-2454 [002] 2657.992748: funcgraph_entry: + 33.438 us | > update_blocked_averages(); > sugov:0-2454 [002] 2657.992783: funcgraph_exit: + 37.500 us | > } > sugov:0-2454 [002] 2657.992785: funcgraph_entry: 1.771 us | > idle_cpu(); > sugov:0-2454 [002] 2657.992790: funcgraph_entry: | > update_nohz_stats() { > sugov:0-2454 [002] 2657.992792: funcgraph_entry: + 45.521 us | > update_blocked_averages(); > sugov:0-2454 [002] 2657.992839: funcgraph_exit: + 49.323 us | > } > sugov:0-2454 [002] 2657.992842: funcgraph_entry: 1.823 us | > idle_cpu(); > sugov:0-2454 [002] 2657.992847: funcgraph_entry: | > update_nohz_stats() { > sugov:0-2454 [002] 2657.992850: funcgraph_entry: + 67.187 us | > update_blocked_averages(); > sugov:0-2454 [002] 2657.992919: funcgraph_exit: + 72.031 us | > } > sugov:0-2454 [002] 2657.992921: funcgraph_entry: 2.760 us | > idle_cpu(); > sugov:0-2454 [002] 2657.992926: funcgraph_entry: | > update_nohz_stats() { > sugov:0-2454 [002] 2657.992928: funcgraph_entry: + 61.146 us | > update_blocked_averages(); > sugov:0-2454 [002] 2657.992992: funcgraph_exit: + 65.886 us | > } > sugov:0-2454 [002] 2657.992994: funcgraph_entry: 1.771 us | > idle_cpu(); > sugov:0-2454 [002] 2657.992998: funcgraph_exit: ! 430.209 us | } > sugov:0-2454 [002] 2657.993006: bprint: trace_long: wtf: > lb: 432916 > sugov:0-2454 [002] 2657.993017: bprint: trace_long: wtf: > newidle_balance: 501458 > > > > > The fix is simply rate limit the calls to update_blocked_averages to 20 > > > times per second. It appears that updating the blocked average less > > > often is sufficient. Currently I see about 200 calls per second > > > > Would be good to explain why updating less often is sufficient ? > > I don't know this code that well, intuitively it seems to me updating blocked > averages at such a high rate seems pointless. But I defer to your expertise > on that. Why do you feel an update is needed at least HZ times per second? > What about system with HZ=1000 or 300, that seems to be an insane rate of > updating (not to mention all the complexity of going through the leaf cgroup > list and doing the frequency updates). Your trace above is a full update of blocked load which is a normal sequence that happen regularly but you don't show that this happens too often The call to update_blocked _average is rate limited by load_balance interval for idle and busy load balance so those 2 cases should not be a problem. There is no rate limit in case of newly_idle because it is assumed that the latter is normally called only if it has been estimated that there is enough time to run it. That's why update_blocked_load_average is called all the time during newly idle load balance see in update_sd_lb_stats() if (env->idle == CPU_NEWLY_IDLE && READ_ONCE(nohz.has_blocked)) env->flags |= LBF_NOHZ_STATS; We could consider removing this condition but I need time to remind all the side effect of this IIUC, your real problem is that newidle_balance is running whereas a task is about to wake up on the cpu and we don't abort quickly during this load_balance so we could also try to abort earlier in case of newly idle load balance > > > > sometimes, which seems overkill. > > > > > > schbench shows a clear improvement with the change: > > > > Have you got more details about your test setup ? > > which platform ? > > which kernel ? > > I mentioned in the commit message it is a v5.4 kernel. I was not sure if the tests results done with this kernel because we usually ask for results against mainline to make sure you are not facing a problem that has solved since v5.4 has been released > > The platform is Snapdragon 7c. The platform is: > https://www.qualcomm.com/products/snapdragon-7c-compute-platform > > > > > > > > > Without patch: > > > ~/schbench -t 2 -m 2 -r 5 > > > Latency percentiles (usec) runtime 5 (s) (212 total samples) > > > 50.0th: 210 (106 samples) > > > 75.0th: 619 (53 samples) > > > 90.0th: 665 (32 samples) > > > 95.0th: 703 (11 samples) > > > *99.0th: 12656 (8 samples) > > > 99.5th: 12784 (1 samples) > > > 99.9th: 13424 (1 samples) > > > min=15, max=13424 > > > > > > With patch: > > > ~/schbench -t 2 -m 2 -r 5 > > > Latency percentiles (usec) runtime 5 (s) (214 total samples) > > > 50.0th: 188 (108 samples) > > > 75.0th: 238 (53 samples) > > > 90.0th: 623 (32 samples) > > > 95.0th: 657 (12 samples) > > > *99.0th: 717 (7 samples) > > > 99.5th: 725 (2 samples) > > > 99.9th: 725 (0 samples) > > > > > > Cc: Paul McKenney <paul...@kernel.org> > > > Cc: Frederic Weisbecker <fweis...@gmail.com> > > > Suggested-by: Dietmar Eggeman <dietmar.eggem...@arm.com> > > > Co-developed-by: Qais Yousef <qais.you...@arm.com> > > > Signed-off-by: Qais Yousef <qais.you...@arm.com> > > > Signed-off-by: Joel Fernandes (Google) <j...@joelfernandes.org> > > > > > > --- > > > kernel/sched/fair.c | 2 +- > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > > index 04a3ce20da67..fe2dc0024db5 100644 > > > --- a/kernel/sched/fair.c > > > +++ b/kernel/sched/fair.c > > > @@ -8381,7 +8381,7 @@ static bool update_nohz_stats(struct rq *rq, bool > > > force) > > > if (!cpumask_test_cpu(cpu, nohz.idle_cpus_mask)) > > > return false; > > > > > > - if (!force && !time_after(jiffies, > > > rq->last_blocked_load_update_tick)) > > > + if (!force && !time_after(jiffies, > > > rq->last_blocked_load_update_tick + (HZ/20))) > > > > This condition is there to make sure to update blocked load at most > > once a tick in order to filter newly idle case otherwise the rate > > limit is already done by load balance interval > > What prevents newidle_balance from hitting this code? Are you suggesting rate > limit that as well? > > > This hard coded (HZ/20) looks really like an ugly hack > > Yes, it was not a well researched number. If you have a suggestion for better > fix, let me know. > > thanks, > > - Joel > > > > > return true; > > > > > > update_blocked_averages(cpu); > > > -- > > > 2.30.0.280.ga3ce27912f-goog > > >