Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK
On 5 February 2018 at 23:18, Valentin Schneiderwrote: > On 01/30/2018 11:41 AM, Valentin Schneider wrote: >> [...] >>> I have studied a way to keep track of how many cpus still have blocked >>> load to try to minimize the number of useless ilb kick but this add >>> more atomic operations which can impact the system throughput with >>> heavy load and lot of very small wake up. that's why i have propose >>> this solution which is more simple. But it's probably just a matter of >>> where we want to "waste" time. Either we accept to spent a bit more >>> time to check the state of idle CPUs or we accept to kick ilb from >>> time to time for no good reason. >>> >> >> Agreed. I have the feeling that spending more time doing atomic ops could be >> worth it - I'll try to test this out and see if it's actually relevant. >> > > I gave this a spin, still using Vincent's patches with the included patch on > top. Nothing too clever, just seeing how replacing nohz.stats_state with a > cpumask would go. > > I've replaced nohz.stats_state by nohz.stale_cpus_mask. I kept changes > minimal - there are some places where I think nohz.idle_cpus_mask could be > substituted by nohz.stale_cpus_mask. Speaking about that, I was about to > write a comment regarding the fact that nohz.stale_cpus_mask should be a > subset of nohz.idle_cpus_mask, but I realized it's actually not true: > > In the current implementation (cpumask or stats_state flag), an idle CPU is > defined as having blocked load as soon as it goes through > nohz_balance_enter_idle(), and that flag is cleared when we go through > _nohz_idle_balance() (and newly idle load balance in my cpumask > implementation). > However we can imagine a scenario where a CPU goes idle, is flagged as having > blocked load, then it wakes up and goes through its periodic balance code and > updates that load. Yet, the flag (or cpumask) won't have been updated. > So I think there could be a discussion on whether the flag should be cleared > on nohz_balance_exit_idle() if we assume periodic balance now takes care of > this. It could cause issues if we have a weird scenario where a CPU keeps > going online/idle but never stays online long enough for a tick though. > Alternatively we could clear that flag when going through the first periodic > balance after idling, but then that's a bit weird because we're using a nohz > structure in a non-nohz context. > > > Anyway, I tried to get some profiling done with the cpumask but there's > something wrong with my setup, I would only get nonsense numbers (for both > baseline & my patch), so I added start/end trace_printks to > _nohz_idle_balance(). It's ugly, inaccurate and unorthodox but it still gives > a rough idea of how the cpumask impacts stuff. > I ran 20 iterations of my "nohz update" test case (a task accumulates load, > goes to sleep, and another always-running task keeps kicking an ILB to decay > that blocked load) and the time saved by skipping CPUs is in the ballpark of > 20%. Notebook is at [1]. Even if saving time in the ILB is interesting, we have to check the impact of bitmask operation on the wake up latency of the CPUs when the system is heavily used and generate a lot of sleep/wakeup on CPUs > > I'll try to get a proper function profiling working for when Vincent posts > his "v2". > > [1]: https://gist.github.com/valschneider/6f203143bee1e149f24c44e9582a9eff > > ---
Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK
On 5 February 2018 at 23:18, Valentin Schneider wrote: > On 01/30/2018 11:41 AM, Valentin Schneider wrote: >> [...] >>> I have studied a way to keep track of how many cpus still have blocked >>> load to try to minimize the number of useless ilb kick but this add >>> more atomic operations which can impact the system throughput with >>> heavy load and lot of very small wake up. that's why i have propose >>> this solution which is more simple. But it's probably just a matter of >>> where we want to "waste" time. Either we accept to spent a bit more >>> time to check the state of idle CPUs or we accept to kick ilb from >>> time to time for no good reason. >>> >> >> Agreed. I have the feeling that spending more time doing atomic ops could be >> worth it - I'll try to test this out and see if it's actually relevant. >> > > I gave this a spin, still using Vincent's patches with the included patch on > top. Nothing too clever, just seeing how replacing nohz.stats_state with a > cpumask would go. > > I've replaced nohz.stats_state by nohz.stale_cpus_mask. I kept changes > minimal - there are some places where I think nohz.idle_cpus_mask could be > substituted by nohz.stale_cpus_mask. Speaking about that, I was about to > write a comment regarding the fact that nohz.stale_cpus_mask should be a > subset of nohz.idle_cpus_mask, but I realized it's actually not true: > > In the current implementation (cpumask or stats_state flag), an idle CPU is > defined as having blocked load as soon as it goes through > nohz_balance_enter_idle(), and that flag is cleared when we go through > _nohz_idle_balance() (and newly idle load balance in my cpumask > implementation). > However we can imagine a scenario where a CPU goes idle, is flagged as having > blocked load, then it wakes up and goes through its periodic balance code and > updates that load. Yet, the flag (or cpumask) won't have been updated. > So I think there could be a discussion on whether the flag should be cleared > on nohz_balance_exit_idle() if we assume periodic balance now takes care of > this. It could cause issues if we have a weird scenario where a CPU keeps > going online/idle but never stays online long enough for a tick though. > Alternatively we could clear that flag when going through the first periodic > balance after idling, but then that's a bit weird because we're using a nohz > structure in a non-nohz context. > > > Anyway, I tried to get some profiling done with the cpumask but there's > something wrong with my setup, I would only get nonsense numbers (for both > baseline & my patch), so I added start/end trace_printks to > _nohz_idle_balance(). It's ugly, inaccurate and unorthodox but it still gives > a rough idea of how the cpumask impacts stuff. > I ran 20 iterations of my "nohz update" test case (a task accumulates load, > goes to sleep, and another always-running task keeps kicking an ILB to decay > that blocked load) and the time saved by skipping CPUs is in the ballpark of > 20%. Notebook is at [1]. Even if saving time in the ILB is interesting, we have to check the impact of bitmask operation on the wake up latency of the CPUs when the system is heavily used and generate a lot of sleep/wakeup on CPUs > > I'll try to get a proper function profiling working for when Vincent posts > his "v2". > > [1]: https://gist.github.com/valschneider/6f203143bee1e149f24c44e9582a9eff > > ---
Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK
On 01/30/2018 11:41 AM, Valentin Schneider wrote: > [...] >> I have studied a way to keep track of how many cpus still have blocked >> load to try to minimize the number of useless ilb kick but this add >> more atomic operations which can impact the system throughput with >> heavy load and lot of very small wake up. that's why i have propose >> this solution which is more simple. But it's probably just a matter of >> where we want to "waste" time. Either we accept to spent a bit more >> time to check the state of idle CPUs or we accept to kick ilb from >> time to time for no good reason. >> > > Agreed. I have the feeling that spending more time doing atomic ops could be > worth it - I'll try to test this out and see if it's actually relevant. > I gave this a spin, still using Vincent's patches with the included patch on top. Nothing too clever, just seeing how replacing nohz.stats_state with a cpumask would go. I've replaced nohz.stats_state by nohz.stale_cpus_mask. I kept changes minimal - there are some places where I think nohz.idle_cpus_mask could be substituted by nohz.stale_cpus_mask. Speaking about that, I was about to write a comment regarding the fact that nohz.stale_cpus_mask should be a subset of nohz.idle_cpus_mask, but I realized it's actually not true: In the current implementation (cpumask or stats_state flag), an idle CPU is defined as having blocked load as soon as it goes through nohz_balance_enter_idle(), and that flag is cleared when we go through _nohz_idle_balance() (and newly idle load balance in my cpumask implementation). However we can imagine a scenario where a CPU goes idle, is flagged as having blocked load, then it wakes up and goes through its periodic balance code and updates that load. Yet, the flag (or cpumask) won't have been updated. So I think there could be a discussion on whether the flag should be cleared on nohz_balance_exit_idle() if we assume periodic balance now takes care of this. It could cause issues if we have a weird scenario where a CPU keeps going online/idle but never stays online long enough for a tick though. Alternatively we could clear that flag when going through the first periodic balance after idling, but then that's a bit weird because we're using a nohz structure in a non-nohz context. Anyway, I tried to get some profiling done with the cpumask but there's something wrong with my setup, I would only get nonsense numbers (for both baseline & my patch), so I added start/end trace_printks to _nohz_idle_balance(). It's ugly, inaccurate and unorthodox but it still gives a rough idea of how the cpumask impacts stuff. I ran 20 iterations of my "nohz update" test case (a task accumulates load, goes to sleep, and another always-running task keeps kicking an ILB to decay that blocked load) and the time saved by skipping CPUs is in the ballpark of 20%. Notebook is at [1]. I'll try to get a proper function profiling working for when Vincent posts his "v2". [1]: https://gist.github.com/valschneider/6f203143bee1e149f24c44e9582a9eff --- kernel/sched/fair.c | 72 - 1 file changed, 38 insertions(+), 34 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 3ada92b..8bcf465 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5404,8 +5404,8 @@ decay_load_missed(unsigned long load, unsigned long missed_updates, int idx) static struct { cpumask_var_t idle_cpus_mask; + cpumask_var_t stale_cpus_mask; atomic_t nr_cpus; - atomic_t stats_state; unsigned long next_balance; /* in jiffy units */ unsigned long next_stats; } nohz cacheline_aligned; @@ -6968,7 +6968,6 @@ enum fbq_type { regular, remote, all }; #define LBF_DST_PINNED 0x04 #define LBF_SOME_PINNED0x08 #define LBF_NOHZ_STATS 0x10 -#define LBF_NOHZ_AGAIN 0x20 struct lb_env { struct sched_domain *sd; @@ -7829,25 +7828,25 @@ group_type group_classify(struct sched_group *group, return group_other; } -static bool update_nohz_stats(struct rq *rq) +static void update_nohz_stats(struct rq *rq) { #ifdef CONFIG_NO_HZ_COMMON unsigned int cpu = rq->cpu; - if (!rq->has_blocked_load) - return false; - - if (!cpumask_test_cpu(cpu, nohz.idle_cpus_mask)) - return false; + if (!cpumask_test_cpu(cpu, nohz.stale_cpus_mask)) + return; if (!time_after(jiffies, rq->last_blocked_load_update_tick)) - return true; + return; update_blocked_averages(cpu); - return rq->has_blocked_load; + if (rq->has_blocked_load) + return; + + cpumask_clear_cpu(cpu, nohz.stale_cpus_mask); #else - return false; + return; #endif } @@ -7873,8 +7872,8 @@ static inline void update_sg_lb_stats(struct lb_env *env, for_each_cpu_and(i, sched_group_span(group), env->cpus) {
Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK
On 01/30/2018 11:41 AM, Valentin Schneider wrote: > [...] >> I have studied a way to keep track of how many cpus still have blocked >> load to try to minimize the number of useless ilb kick but this add >> more atomic operations which can impact the system throughput with >> heavy load and lot of very small wake up. that's why i have propose >> this solution which is more simple. But it's probably just a matter of >> where we want to "waste" time. Either we accept to spent a bit more >> time to check the state of idle CPUs or we accept to kick ilb from >> time to time for no good reason. >> > > Agreed. I have the feeling that spending more time doing atomic ops could be > worth it - I'll try to test this out and see if it's actually relevant. > I gave this a spin, still using Vincent's patches with the included patch on top. Nothing too clever, just seeing how replacing nohz.stats_state with a cpumask would go. I've replaced nohz.stats_state by nohz.stale_cpus_mask. I kept changes minimal - there are some places where I think nohz.idle_cpus_mask could be substituted by nohz.stale_cpus_mask. Speaking about that, I was about to write a comment regarding the fact that nohz.stale_cpus_mask should be a subset of nohz.idle_cpus_mask, but I realized it's actually not true: In the current implementation (cpumask or stats_state flag), an idle CPU is defined as having blocked load as soon as it goes through nohz_balance_enter_idle(), and that flag is cleared when we go through _nohz_idle_balance() (and newly idle load balance in my cpumask implementation). However we can imagine a scenario where a CPU goes idle, is flagged as having blocked load, then it wakes up and goes through its periodic balance code and updates that load. Yet, the flag (or cpumask) won't have been updated. So I think there could be a discussion on whether the flag should be cleared on nohz_balance_exit_idle() if we assume periodic balance now takes care of this. It could cause issues if we have a weird scenario where a CPU keeps going online/idle but never stays online long enough for a tick though. Alternatively we could clear that flag when going through the first periodic balance after idling, but then that's a bit weird because we're using a nohz structure in a non-nohz context. Anyway, I tried to get some profiling done with the cpumask but there's something wrong with my setup, I would only get nonsense numbers (for both baseline & my patch), so I added start/end trace_printks to _nohz_idle_balance(). It's ugly, inaccurate and unorthodox but it still gives a rough idea of how the cpumask impacts stuff. I ran 20 iterations of my "nohz update" test case (a task accumulates load, goes to sleep, and another always-running task keeps kicking an ILB to decay that blocked load) and the time saved by skipping CPUs is in the ballpark of 20%. Notebook is at [1]. I'll try to get a proper function profiling working for when Vincent posts his "v2". [1]: https://gist.github.com/valschneider/6f203143bee1e149f24c44e9582a9eff --- kernel/sched/fair.c | 72 - 1 file changed, 38 insertions(+), 34 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 3ada92b..8bcf465 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5404,8 +5404,8 @@ decay_load_missed(unsigned long load, unsigned long missed_updates, int idx) static struct { cpumask_var_t idle_cpus_mask; + cpumask_var_t stale_cpus_mask; atomic_t nr_cpus; - atomic_t stats_state; unsigned long next_balance; /* in jiffy units */ unsigned long next_stats; } nohz cacheline_aligned; @@ -6968,7 +6968,6 @@ enum fbq_type { regular, remote, all }; #define LBF_DST_PINNED 0x04 #define LBF_SOME_PINNED0x08 #define LBF_NOHZ_STATS 0x10 -#define LBF_NOHZ_AGAIN 0x20 struct lb_env { struct sched_domain *sd; @@ -7829,25 +7828,25 @@ group_type group_classify(struct sched_group *group, return group_other; } -static bool update_nohz_stats(struct rq *rq) +static void update_nohz_stats(struct rq *rq) { #ifdef CONFIG_NO_HZ_COMMON unsigned int cpu = rq->cpu; - if (!rq->has_blocked_load) - return false; - - if (!cpumask_test_cpu(cpu, nohz.idle_cpus_mask)) - return false; + if (!cpumask_test_cpu(cpu, nohz.stale_cpus_mask)) + return; if (!time_after(jiffies, rq->last_blocked_load_update_tick)) - return true; + return; update_blocked_averages(cpu); - return rq->has_blocked_load; + if (rq->has_blocked_load) + return; + + cpumask_clear_cpu(cpu, nohz.stale_cpus_mask); #else - return false; + return; #endif } @@ -7873,8 +7872,8 @@ static inline void update_sg_lb_stats(struct lb_env *env, for_each_cpu_and(i, sched_group_span(group), env->cpus) {
Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK
On 1 February 2018 at 19:10, Peter Zijlstrawrote: > On Wed, Jan 24, 2018 at 09:25:36AM +0100, Vincent Guittot wrote: >> @@ -8861,7 +8875,14 @@ static int idle_balance(struct rq *this_rq, struct >> rq_flags *rf) >> update_next_balance(sd, _balance); >> rcu_read_unlock(); >> >> - if (time_after(jiffies, next) && >> atomic_read(_state)) >> + /* >> + * Update blocked idle load if it has not been done for a >> + * while. Try to do it locally before entering idle but kick a >> + * ilb if it takes too much time and might delay next local >> + * wake up >> + */ >> + if (time_after(jiffies, next) && >> atomic_read(_state) && >> + !_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, >> CPU_NEWLY_IDLE)) >> kick_ilb(NOHZ_STATS_KICK); >> >> goto out; > > This I really dislike. We're here because avg_idle is _really_ low, we > really should not then call _nohz_idle_balance(). Yes. In fact I was targeting the case were (this_rq->avg_idle >= sysctl_sched_migration_cost) and the system is not overloaded
Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK
On 1 February 2018 at 19:10, Peter Zijlstra wrote: > On Wed, Jan 24, 2018 at 09:25:36AM +0100, Vincent Guittot wrote: >> @@ -8861,7 +8875,14 @@ static int idle_balance(struct rq *this_rq, struct >> rq_flags *rf) >> update_next_balance(sd, _balance); >> rcu_read_unlock(); >> >> - if (time_after(jiffies, next) && >> atomic_read(_state)) >> + /* >> + * Update blocked idle load if it has not been done for a >> + * while. Try to do it locally before entering idle but kick a >> + * ilb if it takes too much time and might delay next local >> + * wake up >> + */ >> + if (time_after(jiffies, next) && >> atomic_read(_state) && >> + !_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, >> CPU_NEWLY_IDLE)) >> kick_ilb(NOHZ_STATS_KICK); >> >> goto out; > > This I really dislike. We're here because avg_idle is _really_ low, we > really should not then call _nohz_idle_balance(). Yes. In fact I was targeting the case were (this_rq->avg_idle >= sysctl_sched_migration_cost) and the system is not overloaded
Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK
On Mon, Jan 29, 2018 at 07:31:07PM +, Valentin Schneider wrote: > But for load update via _nohz_idle_balance(), we iterate through all of the > nohz CPUS and unconditionally call update_blocked_averages(). This could be > avoided by remembering which CPUs have stale load before going idle. > Initially I thought that was what nohz.stats_state was for, but it isn't. > With Vincent's patches it's only ever set to either 0 or 1, but we could use > it as a CPU mask, and use it to skip nohz CPUs that don't have stale load in > _nohz_idle_balance() (when NOHZ_STATS_KICK). Yes, you'd need to allocate a second cpumask, worse you need atomic bitops to set and clear bits there. That all _might_ be worth it... dunno.
Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK
On Mon, Jan 29, 2018 at 07:31:07PM +, Valentin Schneider wrote: > But for load update via _nohz_idle_balance(), we iterate through all of the > nohz CPUS and unconditionally call update_blocked_averages(). This could be > avoided by remembering which CPUs have stale load before going idle. > Initially I thought that was what nohz.stats_state was for, but it isn't. > With Vincent's patches it's only ever set to either 0 or 1, but we could use > it as a CPU mask, and use it to skip nohz CPUs that don't have stale load in > _nohz_idle_balance() (when NOHZ_STATS_KICK). Yes, you'd need to allocate a second cpumask, worse you need atomic bitops to set and clear bits there. That all _might_ be worth it... dunno.
Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK
On Wed, Jan 24, 2018 at 09:25:36AM +0100, Vincent Guittot wrote: > @@ -8861,7 +8875,14 @@ static int idle_balance(struct rq *this_rq, struct > rq_flags *rf) > update_next_balance(sd, _balance); > rcu_read_unlock(); > > - if (time_after(jiffies, next) && atomic_read(_state)) > + /* > + * Update blocked idle load if it has not been done for a > + * while. Try to do it locally before entering idle but kick a > + * ilb if it takes too much time and might delay next local > + * wake up > + */ > + if (time_after(jiffies, next) && atomic_read(_state) > && > + !_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, > CPU_NEWLY_IDLE)) > kick_ilb(NOHZ_STATS_KICK); > > goto out; This I really dislike. We're here because avg_idle is _really_ low, we really should not then call _nohz_idle_balance().
Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK
On Wed, Jan 24, 2018 at 09:25:36AM +0100, Vincent Guittot wrote: > @@ -8861,7 +8875,14 @@ static int idle_balance(struct rq *this_rq, struct > rq_flags *rf) > update_next_balance(sd, _balance); > rcu_read_unlock(); > > - if (time_after(jiffies, next) && atomic_read(_state)) > + /* > + * Update blocked idle load if it has not been done for a > + * while. Try to do it locally before entering idle but kick a > + * ilb if it takes too much time and might delay next local > + * wake up > + */ > + if (time_after(jiffies, next) && atomic_read(_state) > && > + !_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, > CPU_NEWLY_IDLE)) > kick_ilb(NOHZ_STATS_KICK); > > goto out; This I really dislike. We're here because avg_idle is _really_ low, we really should not then call _nohz_idle_balance().
Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK
On 1 February 2018 at 17:57, Peter Zijlstrawrote: > On Wed, Jan 24, 2018 at 09:25:36AM +0100, Vincent Guittot wrote: >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index 898785d..ed90303 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -7356,6 +7356,17 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq >> *cfs_rq) >> return true; >> } >> >> +static inline bool cfs_rq_has_blocked(struct cfs_rq *cfs_rq) >> +{ >> + if (cfs_rq->avg.load_avg) >> + return true; >> + >> + if (cfs_rq->avg.util_avg) >> + return true; >> + >> + return false; >> +} >> + >> #ifdef CONFIG_FAIR_GROUP_SCHED >> >> static void update_blocked_averages(int cpu) >> @@ -7393,7 +7404,9 @@ static void update_blocked_averages(int cpu) >>*/ >> if (cfs_rq_is_decayed(cfs_rq)) >> list_del_leaf_cfs_rq(cfs_rq); >> - else >> + >> + /* Don't need periodic decay once load/util_avg are null */ >> + if (cfs_rq_has_blocked(cfs_rq)) >> done = false; >> } >> >> @@ -7463,7 +7476,7 @@ static inline void update_blocked_averages(int cpu) >> update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq); >> #ifdef CONFIG_NO_HZ_COMMON >> rq->last_blocked_load_update_tick = jiffies; >> - if (cfs_rq_is_decayed(cfs_rq)) >> + if (cfs_rq_has_blocked(cfs_rq)) >> rq->has_blocked_load = 0; >> #endif >> rq_unlock_irqrestore(rq, ); > > OK makes sense; would've been even better as a separate patch :-) Yes i will make a separate patch for that
Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK
On 1 February 2018 at 17:57, Peter Zijlstra wrote: > On Wed, Jan 24, 2018 at 09:25:36AM +0100, Vincent Guittot wrote: >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index 898785d..ed90303 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -7356,6 +7356,17 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq >> *cfs_rq) >> return true; >> } >> >> +static inline bool cfs_rq_has_blocked(struct cfs_rq *cfs_rq) >> +{ >> + if (cfs_rq->avg.load_avg) >> + return true; >> + >> + if (cfs_rq->avg.util_avg) >> + return true; >> + >> + return false; >> +} >> + >> #ifdef CONFIG_FAIR_GROUP_SCHED >> >> static void update_blocked_averages(int cpu) >> @@ -7393,7 +7404,9 @@ static void update_blocked_averages(int cpu) >>*/ >> if (cfs_rq_is_decayed(cfs_rq)) >> list_del_leaf_cfs_rq(cfs_rq); >> - else >> + >> + /* Don't need periodic decay once load/util_avg are null */ >> + if (cfs_rq_has_blocked(cfs_rq)) >> done = false; >> } >> >> @@ -7463,7 +7476,7 @@ static inline void update_blocked_averages(int cpu) >> update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq); >> #ifdef CONFIG_NO_HZ_COMMON >> rq->last_blocked_load_update_tick = jiffies; >> - if (cfs_rq_is_decayed(cfs_rq)) >> + if (cfs_rq_has_blocked(cfs_rq)) >> rq->has_blocked_load = 0; >> #endif >> rq_unlock_irqrestore(rq, ); > > OK makes sense; would've been even better as a separate patch :-) Yes i will make a separate patch for that
Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK
On 1 February 2018 at 17:52, Peter Zijlstrawrote: > On Mon, Jan 15, 2018 at 09:26:09AM +0100, Vincent Guittot wrote: > > Would've probably been easier to read if you'd not included the revert > of that timer patch... > >> @@ -9258,21 +9255,11 @@ void nohz_balance_enter_idle(int cpu) >> set_cpu_sd_state_idle(cpu); >> >> /* >> - * implies a barrier such that if the stats_state update is observed >> - * the above updates are also visible. Pairs with stuff in >> - * update_sd_lb_stats() and nohz_idle_balance(). >> + * Each time a cpu enter idle, we assume that it has blocked load and >> + * enable the periodic update of the load of idle cpus >>*/ >> - val = atomic_read(_state); >> - do { >> - new = val + 2; >> - new |= 1; >> - } while (!atomic_try_cmpxchg(_state, , new)); >> + atomic_set(_state, 1); >> > >> @@ -9422,7 +9408,6 @@ static bool nohz_idle_balance(struct rq *this_rq, enum >> cpu_idle_type idle) >> return false; >> } >> >> - stats_seq = atomic_read(_state); >> /* >>* barrier, pairs with nohz_balance_enter_idle(), ensures ... >>*/ >> @@ -9432,6 +9417,16 @@ static bool nohz_idle_balance(struct rq *this_rq, >> enum cpu_idle_type idle) >> >> SCHED_WARN_ON((flags & NOHZ_KICK_MASK) == NOHZ_BALANCE_KICK); >> >> + /* >> + * We assume there will be no idle load after this update and clear >> + * the stats state. If a cpu enters idle in the mean time, it will >> + * set the stats state and trig another update of idle load. >> + * Because a cpu that becomes idle, is added to idle_cpus_mask before >> + * setting the stats state, we are sure to not clear the state and not >> + * check the load of an idle cpu. >> + */ >> + atomic_set(_state, 0); >> + >> for_each_cpu(balance_cpu, nohz.idle_cpus_mask) { >> if (balance_cpu == this_cpu || !idle_cpu(balance_cpu)) >> continue; >> @@ -9441,8 +9436,10 @@ static bool nohz_idle_balance(struct rq *this_rq, >> enum cpu_idle_type idle) >>* work being done for other cpus. Next load >>* balancing owner will pick it up. >>*/ >> - if (need_resched()) >> + if (need_resched()) { >> + has_blocked_load = true; >> break; >> + } >> >> rq = cpu_rq(balance_cpu); >> >> @@ -9477,12 +9474,12 @@ static bool nohz_idle_balance(struct rq *this_rq, >> enum cpu_idle_type idle) >> if (flags & NOHZ_BALANCE_KICK) >> rebalance_domains(this_rq, CPU_IDLE); >> >> - if (has_blocked_load || >> - !atomic_try_cmpxchg(_state, _seq, 0)) { >> - WRITE_ONCE(nohz.next_stats, >> - now + msecs_to_jiffies(LOAD_AVG_PERIOD)); >> - mod_timer(, nohz.next_stats); >> - } >> + WRITE_ONCE(nohz.next_stats, >> + now + msecs_to_jiffies(LOAD_AVG_PERIOD)); >> + >> + /* There is still blocked load, enable periodic update */ >> + if (has_blocked_load) >> + atomic_set(_state, 1); >> >> /* >>* next_balance will be updated only when there is a need. > > After this there is no point for stats_state to be atomic anymore. Also > a better name. Ok > > Maybe if I drop the last two patches (and you re-introduce the bits > from: Subject: sched: Optimize nohz stats, that you do need) this all > becomes more readable? Yes. we can do like that
Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK
On 1 February 2018 at 17:52, Peter Zijlstra wrote: > On Mon, Jan 15, 2018 at 09:26:09AM +0100, Vincent Guittot wrote: > > Would've probably been easier to read if you'd not included the revert > of that timer patch... > >> @@ -9258,21 +9255,11 @@ void nohz_balance_enter_idle(int cpu) >> set_cpu_sd_state_idle(cpu); >> >> /* >> - * implies a barrier such that if the stats_state update is observed >> - * the above updates are also visible. Pairs with stuff in >> - * update_sd_lb_stats() and nohz_idle_balance(). >> + * Each time a cpu enter idle, we assume that it has blocked load and >> + * enable the periodic update of the load of idle cpus >>*/ >> - val = atomic_read(_state); >> - do { >> - new = val + 2; >> - new |= 1; >> - } while (!atomic_try_cmpxchg(_state, , new)); >> + atomic_set(_state, 1); >> > >> @@ -9422,7 +9408,6 @@ static bool nohz_idle_balance(struct rq *this_rq, enum >> cpu_idle_type idle) >> return false; >> } >> >> - stats_seq = atomic_read(_state); >> /* >>* barrier, pairs with nohz_balance_enter_idle(), ensures ... >>*/ >> @@ -9432,6 +9417,16 @@ static bool nohz_idle_balance(struct rq *this_rq, >> enum cpu_idle_type idle) >> >> SCHED_WARN_ON((flags & NOHZ_KICK_MASK) == NOHZ_BALANCE_KICK); >> >> + /* >> + * We assume there will be no idle load after this update and clear >> + * the stats state. If a cpu enters idle in the mean time, it will >> + * set the stats state and trig another update of idle load. >> + * Because a cpu that becomes idle, is added to idle_cpus_mask before >> + * setting the stats state, we are sure to not clear the state and not >> + * check the load of an idle cpu. >> + */ >> + atomic_set(_state, 0); >> + >> for_each_cpu(balance_cpu, nohz.idle_cpus_mask) { >> if (balance_cpu == this_cpu || !idle_cpu(balance_cpu)) >> continue; >> @@ -9441,8 +9436,10 @@ static bool nohz_idle_balance(struct rq *this_rq, >> enum cpu_idle_type idle) >>* work being done for other cpus. Next load >>* balancing owner will pick it up. >>*/ >> - if (need_resched()) >> + if (need_resched()) { >> + has_blocked_load = true; >> break; >> + } >> >> rq = cpu_rq(balance_cpu); >> >> @@ -9477,12 +9474,12 @@ static bool nohz_idle_balance(struct rq *this_rq, >> enum cpu_idle_type idle) >> if (flags & NOHZ_BALANCE_KICK) >> rebalance_domains(this_rq, CPU_IDLE); >> >> - if (has_blocked_load || >> - !atomic_try_cmpxchg(_state, _seq, 0)) { >> - WRITE_ONCE(nohz.next_stats, >> - now + msecs_to_jiffies(LOAD_AVG_PERIOD)); >> - mod_timer(, nohz.next_stats); >> - } >> + WRITE_ONCE(nohz.next_stats, >> + now + msecs_to_jiffies(LOAD_AVG_PERIOD)); >> + >> + /* There is still blocked load, enable periodic update */ >> + if (has_blocked_load) >> + atomic_set(_state, 1); >> >> /* >>* next_balance will be updated only when there is a need. > > After this there is no point for stats_state to be atomic anymore. Also > a better name. Ok > > Maybe if I drop the last two patches (and you re-introduce the bits > from: Subject: sched: Optimize nohz stats, that you do need) this all > becomes more readable? Yes. we can do like that
Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK
On Wed, Jan 24, 2018 at 09:25:36AM +0100, Vincent Guittot wrote: > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 898785d..ed90303 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -7356,6 +7356,17 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq > *cfs_rq) > return true; > } > > +static inline bool cfs_rq_has_blocked(struct cfs_rq *cfs_rq) > +{ > + if (cfs_rq->avg.load_avg) > + return true; > + > + if (cfs_rq->avg.util_avg) > + return true; > + > + return false; > +} > + > #ifdef CONFIG_FAIR_GROUP_SCHED > > static void update_blocked_averages(int cpu) > @@ -7393,7 +7404,9 @@ static void update_blocked_averages(int cpu) >*/ > if (cfs_rq_is_decayed(cfs_rq)) > list_del_leaf_cfs_rq(cfs_rq); > - else > + > + /* Don't need periodic decay once load/util_avg are null */ > + if (cfs_rq_has_blocked(cfs_rq)) > done = false; > } > > @@ -7463,7 +7476,7 @@ static inline void update_blocked_averages(int cpu) > update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq); > #ifdef CONFIG_NO_HZ_COMMON > rq->last_blocked_load_update_tick = jiffies; > - if (cfs_rq_is_decayed(cfs_rq)) > + if (cfs_rq_has_blocked(cfs_rq)) > rq->has_blocked_load = 0; > #endif > rq_unlock_irqrestore(rq, ); OK makes sense; would've been even better as a separate patch :-)
Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK
On Wed, Jan 24, 2018 at 09:25:36AM +0100, Vincent Guittot wrote: > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 898785d..ed90303 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -7356,6 +7356,17 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq > *cfs_rq) > return true; > } > > +static inline bool cfs_rq_has_blocked(struct cfs_rq *cfs_rq) > +{ > + if (cfs_rq->avg.load_avg) > + return true; > + > + if (cfs_rq->avg.util_avg) > + return true; > + > + return false; > +} > + > #ifdef CONFIG_FAIR_GROUP_SCHED > > static void update_blocked_averages(int cpu) > @@ -7393,7 +7404,9 @@ static void update_blocked_averages(int cpu) >*/ > if (cfs_rq_is_decayed(cfs_rq)) > list_del_leaf_cfs_rq(cfs_rq); > - else > + > + /* Don't need periodic decay once load/util_avg are null */ > + if (cfs_rq_has_blocked(cfs_rq)) > done = false; > } > > @@ -7463,7 +7476,7 @@ static inline void update_blocked_averages(int cpu) > update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq); > #ifdef CONFIG_NO_HZ_COMMON > rq->last_blocked_load_update_tick = jiffies; > - if (cfs_rq_is_decayed(cfs_rq)) > + if (cfs_rq_has_blocked(cfs_rq)) > rq->has_blocked_load = 0; > #endif > rq_unlock_irqrestore(rq, ); OK makes sense; would've been even better as a separate patch :-)
Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK
On Thu, Jan 18, 2018 at 10:38:07AM +, Morten Rasmussen wrote: > It seems pointless to have a timer to update PELT if the system is > completely idle, and when it isn't we can piggy back other events to > make the updates happen. Only if we do that update before making decisions based on the values. The thing I was bothered by in the earlier patches was that wakeup would use whatever current value and async kick something to go update. That just seems wrong.
Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK
On Thu, Jan 18, 2018 at 10:38:07AM +, Morten Rasmussen wrote: > It seems pointless to have a timer to update PELT if the system is > completely idle, and when it isn't we can piggy back other events to > make the updates happen. Only if we do that update before making decisions based on the values. The thing I was bothered by in the earlier patches was that wakeup would use whatever current value and async kick something to go update. That just seems wrong.
Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK
On Mon, Jan 15, 2018 at 09:26:09AM +0100, Vincent Guittot wrote: Would've probably been easier to read if you'd not included the revert of that timer patch... > @@ -9258,21 +9255,11 @@ void nohz_balance_enter_idle(int cpu) > set_cpu_sd_state_idle(cpu); > > /* > - * implies a barrier such that if the stats_state update is observed > - * the above updates are also visible. Pairs with stuff in > - * update_sd_lb_stats() and nohz_idle_balance(). > + * Each time a cpu enter idle, we assume that it has blocked load and > + * enable the periodic update of the load of idle cpus >*/ > - val = atomic_read(_state); > - do { > - new = val + 2; > - new |= 1; > - } while (!atomic_try_cmpxchg(_state, , new)); > + atomic_set(_state, 1); > > @@ -9422,7 +9408,6 @@ static bool nohz_idle_balance(struct rq *this_rq, enum > cpu_idle_type idle) > return false; > } > > - stats_seq = atomic_read(_state); > /* >* barrier, pairs with nohz_balance_enter_idle(), ensures ... >*/ > @@ -9432,6 +9417,16 @@ static bool nohz_idle_balance(struct rq *this_rq, enum > cpu_idle_type idle) > > SCHED_WARN_ON((flags & NOHZ_KICK_MASK) == NOHZ_BALANCE_KICK); > > + /* > + * We assume there will be no idle load after this update and clear > + * the stats state. If a cpu enters idle in the mean time, it will > + * set the stats state and trig another update of idle load. > + * Because a cpu that becomes idle, is added to idle_cpus_mask before > + * setting the stats state, we are sure to not clear the state and not > + * check the load of an idle cpu. > + */ > + atomic_set(_state, 0); > + > for_each_cpu(balance_cpu, nohz.idle_cpus_mask) { > if (balance_cpu == this_cpu || !idle_cpu(balance_cpu)) > continue; > @@ -9441,8 +9436,10 @@ static bool nohz_idle_balance(struct rq *this_rq, enum > cpu_idle_type idle) >* work being done for other cpus. Next load >* balancing owner will pick it up. >*/ > - if (need_resched()) > + if (need_resched()) { > + has_blocked_load = true; > break; > + } > > rq = cpu_rq(balance_cpu); > > @@ -9477,12 +9474,12 @@ static bool nohz_idle_balance(struct rq *this_rq, > enum cpu_idle_type idle) > if (flags & NOHZ_BALANCE_KICK) > rebalance_domains(this_rq, CPU_IDLE); > > - if (has_blocked_load || > - !atomic_try_cmpxchg(_state, _seq, 0)) { > - WRITE_ONCE(nohz.next_stats, > - now + msecs_to_jiffies(LOAD_AVG_PERIOD)); > - mod_timer(, nohz.next_stats); > - } > + WRITE_ONCE(nohz.next_stats, > + now + msecs_to_jiffies(LOAD_AVG_PERIOD)); > + > + /* There is still blocked load, enable periodic update */ > + if (has_blocked_load) > + atomic_set(_state, 1); > > /* >* next_balance will be updated only when there is a need. After this there is no point for stats_state to be atomic anymore. Also a better name. Maybe if I drop the last two patches (and you re-introduce the bits from: Subject: sched: Optimize nohz stats, that you do need) this all becomes more readable?
Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK
On Mon, Jan 15, 2018 at 09:26:09AM +0100, Vincent Guittot wrote: Would've probably been easier to read if you'd not included the revert of that timer patch... > @@ -9258,21 +9255,11 @@ void nohz_balance_enter_idle(int cpu) > set_cpu_sd_state_idle(cpu); > > /* > - * implies a barrier such that if the stats_state update is observed > - * the above updates are also visible. Pairs with stuff in > - * update_sd_lb_stats() and nohz_idle_balance(). > + * Each time a cpu enter idle, we assume that it has blocked load and > + * enable the periodic update of the load of idle cpus >*/ > - val = atomic_read(_state); > - do { > - new = val + 2; > - new |= 1; > - } while (!atomic_try_cmpxchg(_state, , new)); > + atomic_set(_state, 1); > > @@ -9422,7 +9408,6 @@ static bool nohz_idle_balance(struct rq *this_rq, enum > cpu_idle_type idle) > return false; > } > > - stats_seq = atomic_read(_state); > /* >* barrier, pairs with nohz_balance_enter_idle(), ensures ... >*/ > @@ -9432,6 +9417,16 @@ static bool nohz_idle_balance(struct rq *this_rq, enum > cpu_idle_type idle) > > SCHED_WARN_ON((flags & NOHZ_KICK_MASK) == NOHZ_BALANCE_KICK); > > + /* > + * We assume there will be no idle load after this update and clear > + * the stats state. If a cpu enters idle in the mean time, it will > + * set the stats state and trig another update of idle load. > + * Because a cpu that becomes idle, is added to idle_cpus_mask before > + * setting the stats state, we are sure to not clear the state and not > + * check the load of an idle cpu. > + */ > + atomic_set(_state, 0); > + > for_each_cpu(balance_cpu, nohz.idle_cpus_mask) { > if (balance_cpu == this_cpu || !idle_cpu(balance_cpu)) > continue; > @@ -9441,8 +9436,10 @@ static bool nohz_idle_balance(struct rq *this_rq, enum > cpu_idle_type idle) >* work being done for other cpus. Next load >* balancing owner will pick it up. >*/ > - if (need_resched()) > + if (need_resched()) { > + has_blocked_load = true; > break; > + } > > rq = cpu_rq(balance_cpu); > > @@ -9477,12 +9474,12 @@ static bool nohz_idle_balance(struct rq *this_rq, > enum cpu_idle_type idle) > if (flags & NOHZ_BALANCE_KICK) > rebalance_domains(this_rq, CPU_IDLE); > > - if (has_blocked_load || > - !atomic_try_cmpxchg(_state, _seq, 0)) { > - WRITE_ONCE(nohz.next_stats, > - now + msecs_to_jiffies(LOAD_AVG_PERIOD)); > - mod_timer(, nohz.next_stats); > - } > + WRITE_ONCE(nohz.next_stats, > + now + msecs_to_jiffies(LOAD_AVG_PERIOD)); > + > + /* There is still blocked load, enable periodic update */ > + if (has_blocked_load) > + atomic_set(_state, 1); > > /* >* next_balance will be updated only when there is a need. After this there is no point for stats_state to be atomic anymore. Also a better name. Maybe if I drop the last two patches (and you re-introduce the bits from: Subject: sched: Optimize nohz stats, that you do need) this all becomes more readable?
Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK
On 30 January 2018 at 12:41, Valentin Schneiderwrote: > (Resending because I snuck in some HTML... Apologies) > > On 01/30/2018 08:32 AM, Vincent Guittot wrote: >> >> On 29 January 2018 at 20:31, Valentin Schneider >> wrote: >>> >>> Hi Vincent, Peter, >>> >>> I've been running some tests on your patches (Peter's base + the 2 from >>> Vincent). The results themselves are hosted at [1]. >>> The base of those tests is the same: a task ("accumulator") is ran for 5 >>> seconds (arbitrary value) to accumulate some load, then goes to sleep for >>> .5 >>> seconds. >>> >>> I've set up 3 test scenarios: >>> >>> Update by nohz_balance_kick() >>> - >>> Right before the "accumulator" task goes to sleep, a CPU-hogging task >>> (100% >>> utilization) is spawned on another CPU. It won't go idle so the only way >>> to >>> update the blocked load generated by "accumulator" is to kick an ILB >>> (NOHZ_STATS_KICK). >>> >>> The test shows that this is behaving nicely - we keep kicking an ILB >>> every >>> ~36ms (see next test for comments on that) until there is no more blocked >>> load. I did however notice some interesting scenarios: after the load has >>> been fully decayed, a tiny background task can spawn and end in less than >>> a >>> scheduling period. However, it still goes through >>> nohz_balance_enter_idle(), >>> and thus sets nohz.stats_state, which will later cause an ILB kick. >>> >>> This makes me wonder if it's worth kicking ILBs for such tiny load values >>> - >>> perhaps it could be worth having a margin to set rq->has_blocked_load ? >> >> >> So it's difficult to know what will be the load/utilization on the >> cfs_rq once the cpu wakes up. Even if it's for a really short time, >> that's doesn't mean that the load/utilization is small because it can >> be the migration of a big task that just have a very short wakes up >> this time. >> That's why I don't make any assumption on the utilization/load value >> when a cpu goes to sleep >> > > Right, hadn't thought about those kind of migrations. > >>> >>> Furthermore, this tiny task will cause the ILB to iterate over all of the >>> idle CPUs, although only one has stale load. For load update via >>> NEWLY_IDLE >>> load_balance() we use: >>> >>> static bool update_nohz_stats(struct rq *rq) >>> { >>> if (!rq->has_blocked_load) >>> return false; >>> [...] >>> } >>> >>> But for load update via _nohz_idle_balance(), we iterate through all of >>> the >>> nohz CPUS and unconditionally call update_blocked_averages(). This could >>> be >>> avoided by remembering which CPUs have stale load before going idle. >>> Initially I thought that was what nohz.stats_state was for, but it isn't. >>> With Vincent's patches it's only ever set to either 0 or 1, but we could >>> use >>> it as a CPU mask, and use it to skip nohz CPUs that don't have stale load >>> in >>> _nohz_idle_balance() (when NOHZ_STATS_KICK). >> >> >> I have studied a way to keep track of how many cpus still have blocked >> load to try to minimize the number of useless ilb kick but this add >> more atomic operations which can impact the system throughput with >> heavy load and lot of very small wake up. that's why i have propose >> this solution which is more simple. But it's probably just a matter of >> where we want to "waste" time. Either we accept to spent a bit more >> time to check the state of idle CPUs or we accept to kick ilb from >> time to time for no good reason. >> > > Agreed. I have the feeling that spending more time doing atomic ops could be > worth it - I'll try to test this out and see if it's actually relevant. > >>> >>> Update by idle_balance() >>> >>> Right before the "accumulator" task goes to sleep, a tiny periodic >>> (period=32ms) task is spawned on another CPU. It's expected that it will >>> update the blocked load in idle_balance(), either by running >>> _nohz_idle_balance() locally or kicking an ILB (The overload flag >>> shouldn't >>> be set in this test case, so we shouldn't go through the NEWLY_IDLE >>> load_balance()). >>> >>> This also seems to be working fine, but I'm noticing a delay between load >>> updates that is closer to 64ms than 32ms. After digging into it I found >>> out >>> that the time checks done in idle_balance() and nohz_balancer_kick() are >>> time_after(jiffies, next_stats), but IMHO they should be >>> time_after_eq(jiffies, next_stats) to have 32ms-based updates. This also >>> explains the 36ms periodicity of the updates in the test above. >> >> >> I have use the 32ms as a minimum value between update. We must use the >> time_after() if we want to have at least 32ms between each update. We >> will have a 36ms period if the previous update was triggered by the >> tick (just after in fact) but there will be only 32ms if the last >> update was done during an idle_balance that happens just before the >> tick. With time_after_eq, the update
Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK
On 30 January 2018 at 12:41, Valentin Schneider wrote: > (Resending because I snuck in some HTML... Apologies) > > On 01/30/2018 08:32 AM, Vincent Guittot wrote: >> >> On 29 January 2018 at 20:31, Valentin Schneider >> wrote: >>> >>> Hi Vincent, Peter, >>> >>> I've been running some tests on your patches (Peter's base + the 2 from >>> Vincent). The results themselves are hosted at [1]. >>> The base of those tests is the same: a task ("accumulator") is ran for 5 >>> seconds (arbitrary value) to accumulate some load, then goes to sleep for >>> .5 >>> seconds. >>> >>> I've set up 3 test scenarios: >>> >>> Update by nohz_balance_kick() >>> - >>> Right before the "accumulator" task goes to sleep, a CPU-hogging task >>> (100% >>> utilization) is spawned on another CPU. It won't go idle so the only way >>> to >>> update the blocked load generated by "accumulator" is to kick an ILB >>> (NOHZ_STATS_KICK). >>> >>> The test shows that this is behaving nicely - we keep kicking an ILB >>> every >>> ~36ms (see next test for comments on that) until there is no more blocked >>> load. I did however notice some interesting scenarios: after the load has >>> been fully decayed, a tiny background task can spawn and end in less than >>> a >>> scheduling period. However, it still goes through >>> nohz_balance_enter_idle(), >>> and thus sets nohz.stats_state, which will later cause an ILB kick. >>> >>> This makes me wonder if it's worth kicking ILBs for such tiny load values >>> - >>> perhaps it could be worth having a margin to set rq->has_blocked_load ? >> >> >> So it's difficult to know what will be the load/utilization on the >> cfs_rq once the cpu wakes up. Even if it's for a really short time, >> that's doesn't mean that the load/utilization is small because it can >> be the migration of a big task that just have a very short wakes up >> this time. >> That's why I don't make any assumption on the utilization/load value >> when a cpu goes to sleep >> > > Right, hadn't thought about those kind of migrations. > >>> >>> Furthermore, this tiny task will cause the ILB to iterate over all of the >>> idle CPUs, although only one has stale load. For load update via >>> NEWLY_IDLE >>> load_balance() we use: >>> >>> static bool update_nohz_stats(struct rq *rq) >>> { >>> if (!rq->has_blocked_load) >>> return false; >>> [...] >>> } >>> >>> But for load update via _nohz_idle_balance(), we iterate through all of >>> the >>> nohz CPUS and unconditionally call update_blocked_averages(). This could >>> be >>> avoided by remembering which CPUs have stale load before going idle. >>> Initially I thought that was what nohz.stats_state was for, but it isn't. >>> With Vincent's patches it's only ever set to either 0 or 1, but we could >>> use >>> it as a CPU mask, and use it to skip nohz CPUs that don't have stale load >>> in >>> _nohz_idle_balance() (when NOHZ_STATS_KICK). >> >> >> I have studied a way to keep track of how many cpus still have blocked >> load to try to minimize the number of useless ilb kick but this add >> more atomic operations which can impact the system throughput with >> heavy load and lot of very small wake up. that's why i have propose >> this solution which is more simple. But it's probably just a matter of >> where we want to "waste" time. Either we accept to spent a bit more >> time to check the state of idle CPUs or we accept to kick ilb from >> time to time for no good reason. >> > > Agreed. I have the feeling that spending more time doing atomic ops could be > worth it - I'll try to test this out and see if it's actually relevant. > >>> >>> Update by idle_balance() >>> >>> Right before the "accumulator" task goes to sleep, a tiny periodic >>> (period=32ms) task is spawned on another CPU. It's expected that it will >>> update the blocked load in idle_balance(), either by running >>> _nohz_idle_balance() locally or kicking an ILB (The overload flag >>> shouldn't >>> be set in this test case, so we shouldn't go through the NEWLY_IDLE >>> load_balance()). >>> >>> This also seems to be working fine, but I'm noticing a delay between load >>> updates that is closer to 64ms than 32ms. After digging into it I found >>> out >>> that the time checks done in idle_balance() and nohz_balancer_kick() are >>> time_after(jiffies, next_stats), but IMHO they should be >>> time_after_eq(jiffies, next_stats) to have 32ms-based updates. This also >>> explains the 36ms periodicity of the updates in the test above. >> >> >> I have use the 32ms as a minimum value between update. We must use the >> time_after() if we want to have at least 32ms between each update. We >> will have a 36ms period if the previous update was triggered by the >> tick (just after in fact) but there will be only 32ms if the last >> update was done during an idle_balance that happens just before the >> tick. With time_after_eq, the update period will between 28 and >> 32ms. >> >> Then, I
Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK
(Resending because I snuck in some HTML... Apologies) On 01/30/2018 08:32 AM, Vincent Guittot wrote: On 29 January 2018 at 20:31, Valentin Schneiderwrote: Hi Vincent, Peter, I've been running some tests on your patches (Peter's base + the 2 from Vincent). The results themselves are hosted at [1]. The base of those tests is the same: a task ("accumulator") is ran for 5 seconds (arbitrary value) to accumulate some load, then goes to sleep for .5 seconds. I've set up 3 test scenarios: Update by nohz_balance_kick() - Right before the "accumulator" task goes to sleep, a CPU-hogging task (100% utilization) is spawned on another CPU. It won't go idle so the only way to update the blocked load generated by "accumulator" is to kick an ILB (NOHZ_STATS_KICK). The test shows that this is behaving nicely - we keep kicking an ILB every ~36ms (see next test for comments on that) until there is no more blocked load. I did however notice some interesting scenarios: after the load has been fully decayed, a tiny background task can spawn and end in less than a scheduling period. However, it still goes through nohz_balance_enter_idle(), and thus sets nohz.stats_state, which will later cause an ILB kick. This makes me wonder if it's worth kicking ILBs for such tiny load values - perhaps it could be worth having a margin to set rq->has_blocked_load ? So it's difficult to know what will be the load/utilization on the cfs_rq once the cpu wakes up. Even if it's for a really short time, that's doesn't mean that the load/utilization is small because it can be the migration of a big task that just have a very short wakes up this time. That's why I don't make any assumption on the utilization/load value when a cpu goes to sleep Right, hadn't thought about those kind of migrations. Furthermore, this tiny task will cause the ILB to iterate over all of the idle CPUs, although only one has stale load. For load update via NEWLY_IDLE load_balance() we use: static bool update_nohz_stats(struct rq *rq) { if (!rq->has_blocked_load) return false; [...] } But for load update via _nohz_idle_balance(), we iterate through all of the nohz CPUS and unconditionally call update_blocked_averages(). This could be avoided by remembering which CPUs have stale load before going idle. Initially I thought that was what nohz.stats_state was for, but it isn't. With Vincent's patches it's only ever set to either 0 or 1, but we could use it as a CPU mask, and use it to skip nohz CPUs that don't have stale load in _nohz_idle_balance() (when NOHZ_STATS_KICK). I have studied a way to keep track of how many cpus still have blocked load to try to minimize the number of useless ilb kick but this add more atomic operations which can impact the system throughput with heavy load and lot of very small wake up. that's why i have propose this solution which is more simple. But it's probably just a matter of where we want to "waste" time. Either we accept to spent a bit more time to check the state of idle CPUs or we accept to kick ilb from time to time for no good reason. Agreed. I have the feeling that spending more time doing atomic ops could be worth it - I'll try to test this out and see if it's actually relevant. Update by idle_balance() Right before the "accumulator" task goes to sleep, a tiny periodic (period=32ms) task is spawned on another CPU. It's expected that it will update the blocked load in idle_balance(), either by running _nohz_idle_balance() locally or kicking an ILB (The overload flag shouldn't be set in this test case, so we shouldn't go through the NEWLY_IDLE load_balance()). This also seems to be working fine, but I'm noticing a delay between load updates that is closer to 64ms than 32ms. After digging into it I found out that the time checks done in idle_balance() and nohz_balancer_kick() are time_after(jiffies, next_stats), but IMHO they should be time_after_eq(jiffies, next_stats) to have 32ms-based updates. This also explains the 36ms periodicity of the updates in the test above. I have use the 32ms as a minimum value between update. We must use the time_after() if we want to have at least 32ms between each update. We will have a 36ms period if the previous update was triggered by the tick (just after in fact) but there will be only 32ms if the last update was done during an idle_balance that happens just before the tick. With time_after_eq, the update period will between 28 and 32ms. Then, I mention a possible optimization by using time_after_eq in the idle_balance() so a newly_idle cpu will have more chance (between 0 and 4ms for hz250) to do the update before a ilb is kicked IIUC with time_after() the update period should be within ]32, 36] ms, but it looks like I'm always on that upper bound in my tests. When evaluating whether we need to kick_ilb() for load updates, we'll always be right after the tick
Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK
(Resending because I snuck in some HTML... Apologies) On 01/30/2018 08:32 AM, Vincent Guittot wrote: On 29 January 2018 at 20:31, Valentin Schneider wrote: Hi Vincent, Peter, I've been running some tests on your patches (Peter's base + the 2 from Vincent). The results themselves are hosted at [1]. The base of those tests is the same: a task ("accumulator") is ran for 5 seconds (arbitrary value) to accumulate some load, then goes to sleep for .5 seconds. I've set up 3 test scenarios: Update by nohz_balance_kick() - Right before the "accumulator" task goes to sleep, a CPU-hogging task (100% utilization) is spawned on another CPU. It won't go idle so the only way to update the blocked load generated by "accumulator" is to kick an ILB (NOHZ_STATS_KICK). The test shows that this is behaving nicely - we keep kicking an ILB every ~36ms (see next test for comments on that) until there is no more blocked load. I did however notice some interesting scenarios: after the load has been fully decayed, a tiny background task can spawn and end in less than a scheduling period. However, it still goes through nohz_balance_enter_idle(), and thus sets nohz.stats_state, which will later cause an ILB kick. This makes me wonder if it's worth kicking ILBs for such tiny load values - perhaps it could be worth having a margin to set rq->has_blocked_load ? So it's difficult to know what will be the load/utilization on the cfs_rq once the cpu wakes up. Even if it's for a really short time, that's doesn't mean that the load/utilization is small because it can be the migration of a big task that just have a very short wakes up this time. That's why I don't make any assumption on the utilization/load value when a cpu goes to sleep Right, hadn't thought about those kind of migrations. Furthermore, this tiny task will cause the ILB to iterate over all of the idle CPUs, although only one has stale load. For load update via NEWLY_IDLE load_balance() we use: static bool update_nohz_stats(struct rq *rq) { if (!rq->has_blocked_load) return false; [...] } But for load update via _nohz_idle_balance(), we iterate through all of the nohz CPUS and unconditionally call update_blocked_averages(). This could be avoided by remembering which CPUs have stale load before going idle. Initially I thought that was what nohz.stats_state was for, but it isn't. With Vincent's patches it's only ever set to either 0 or 1, but we could use it as a CPU mask, and use it to skip nohz CPUs that don't have stale load in _nohz_idle_balance() (when NOHZ_STATS_KICK). I have studied a way to keep track of how many cpus still have blocked load to try to minimize the number of useless ilb kick but this add more atomic operations which can impact the system throughput with heavy load and lot of very small wake up. that's why i have propose this solution which is more simple. But it's probably just a matter of where we want to "waste" time. Either we accept to spent a bit more time to check the state of idle CPUs or we accept to kick ilb from time to time for no good reason. Agreed. I have the feeling that spending more time doing atomic ops could be worth it - I'll try to test this out and see if it's actually relevant. Update by idle_balance() Right before the "accumulator" task goes to sleep, a tiny periodic (period=32ms) task is spawned on another CPU. It's expected that it will update the blocked load in idle_balance(), either by running _nohz_idle_balance() locally or kicking an ILB (The overload flag shouldn't be set in this test case, so we shouldn't go through the NEWLY_IDLE load_balance()). This also seems to be working fine, but I'm noticing a delay between load updates that is closer to 64ms than 32ms. After digging into it I found out that the time checks done in idle_balance() and nohz_balancer_kick() are time_after(jiffies, next_stats), but IMHO they should be time_after_eq(jiffies, next_stats) to have 32ms-based updates. This also explains the 36ms periodicity of the updates in the test above. I have use the 32ms as a minimum value between update. We must use the time_after() if we want to have at least 32ms between each update. We will have a 36ms period if the previous update was triggered by the tick (just after in fact) but there will be only 32ms if the last update was done during an idle_balance that happens just before the tick. With time_after_eq, the update period will between 28 and 32ms. Then, I mention a possible optimization by using time_after_eq in the idle_balance() so a newly_idle cpu will have more chance (between 0 and 4ms for hz250) to do the update before a ilb is kicked IIUC with time_after() the update period should be within ]32, 36] ms, but it looks like I'm always on that upper bound in my tests. When evaluating whether we need to kick_ilb() for load updates, we'll always be right after the tick (excluding the case in
Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK
On 29 January 2018 at 20:31, Valentin Schneiderwrote: > Hi Vincent, Peter, > > I've been running some tests on your patches (Peter's base + the 2 from > Vincent). The results themselves are hosted at [1]. > The base of those tests is the same: a task ("accumulator") is ran for 5 > seconds (arbitrary value) to accumulate some load, then goes to sleep for .5 > seconds. > > I've set up 3 test scenarios: > > Update by nohz_balance_kick() > - > Right before the "accumulator" task goes to sleep, a CPU-hogging task (100% > utilization) is spawned on another CPU. It won't go idle so the only way to > update the blocked load generated by "accumulator" is to kick an ILB > (NOHZ_STATS_KICK). > > The test shows that this is behaving nicely - we keep kicking an ILB every > ~36ms (see next test for comments on that) until there is no more blocked > load. I did however notice some interesting scenarios: after the load has > been fully decayed, a tiny background task can spawn and end in less than a > scheduling period. However, it still goes through nohz_balance_enter_idle(), > and thus sets nohz.stats_state, which will later cause an ILB kick. > > This makes me wonder if it's worth kicking ILBs for such tiny load values - > perhaps it could be worth having a margin to set rq->has_blocked_load ? So it's difficult to know what will be the load/utilization on the cfs_rq once the cpu wakes up. Even if it's for a really short time, that's doesn't mean that the load/utilization is small because it can be the migration of a big task that just have a very short wakes up this time. That's why I don't make any assumption on the utilization/load value when a cpu goes to sleep > > Furthermore, this tiny task will cause the ILB to iterate over all of the > idle CPUs, although only one has stale load. For load update via NEWLY_IDLE > load_balance() we use: > > static bool update_nohz_stats(struct rq *rq) > { > if (!rq->has_blocked_load) > return false; > [...] > } > > But for load update via _nohz_idle_balance(), we iterate through all of the > nohz CPUS and unconditionally call update_blocked_averages(). This could be > avoided by remembering which CPUs have stale load before going idle. > Initially I thought that was what nohz.stats_state was for, but it isn't. > With Vincent's patches it's only ever set to either 0 or 1, but we could use > it as a CPU mask, and use it to skip nohz CPUs that don't have stale load in > _nohz_idle_balance() (when NOHZ_STATS_KICK). I have studied a way to keep track of how many cpus still have blocked load to try to minimize the number of useless ilb kick but this add more atomic operations which can impact the system throughput with heavy load and lot of very small wake up. that's why i have propose this solution which is more simple. But it's probably just a matter of where we want to "waste" time. Either we accept to spent a bit more time to check the state of idle CPUs or we accept to kick ilb from time to time for no good reason. > > Update by idle_balance() > > Right before the "accumulator" task goes to sleep, a tiny periodic > (period=32ms) task is spawned on another CPU. It's expected that it will > update the blocked load in idle_balance(), either by running > _nohz_idle_balance() locally or kicking an ILB (The overload flag shouldn't > be set in this test case, so we shouldn't go through the NEWLY_IDLE > load_balance()). > > This also seems to be working fine, but I'm noticing a delay between load > updates that is closer to 64ms than 32ms. After digging into it I found out > that the time checks done in idle_balance() and nohz_balancer_kick() are > time_after(jiffies, next_stats), but IMHO they should be > time_after_eq(jiffies, next_stats) to have 32ms-based updates. This also > explains the 36ms periodicity of the updates in the test above. I have use the 32ms as a minimum value between update. We must use the time_after() if we want to have at least 32ms between each update. We will have a 36ms period if the previous update was triggered by the tick (just after in fact) but there will be only 32ms if the last update was done during an idle_balance that happens just before the tick. With time_after_eq, the update period will between 28 and 32ms. Then, I mention a possible optimization by using time_after_eq in the idle_balance() so a newly_idle cpu will have more chance (between 0 and 4ms for hz250) to do the update before a ilb is kicked Thanks, Vincent > > > No update (idle system) > --- > Nothing special here, just making sure nothing happens when the system is > fully idle. On a sidenote, that's relatively hard to achieve - I had to > switch over to Juno because my HiKey960 gets interrupts every 16ms. The Juno > still gets woken up every now and then but it's a bit quieter. > > > [1]: https://gist.github.com/valschneider/a8da7bb8e11fb1ec63a419710f56c0a0 > >
Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK
On 29 January 2018 at 20:31, Valentin Schneider wrote: > Hi Vincent, Peter, > > I've been running some tests on your patches (Peter's base + the 2 from > Vincent). The results themselves are hosted at [1]. > The base of those tests is the same: a task ("accumulator") is ran for 5 > seconds (arbitrary value) to accumulate some load, then goes to sleep for .5 > seconds. > > I've set up 3 test scenarios: > > Update by nohz_balance_kick() > - > Right before the "accumulator" task goes to sleep, a CPU-hogging task (100% > utilization) is spawned on another CPU. It won't go idle so the only way to > update the blocked load generated by "accumulator" is to kick an ILB > (NOHZ_STATS_KICK). > > The test shows that this is behaving nicely - we keep kicking an ILB every > ~36ms (see next test for comments on that) until there is no more blocked > load. I did however notice some interesting scenarios: after the load has > been fully decayed, a tiny background task can spawn and end in less than a > scheduling period. However, it still goes through nohz_balance_enter_idle(), > and thus sets nohz.stats_state, which will later cause an ILB kick. > > This makes me wonder if it's worth kicking ILBs for such tiny load values - > perhaps it could be worth having a margin to set rq->has_blocked_load ? So it's difficult to know what will be the load/utilization on the cfs_rq once the cpu wakes up. Even if it's for a really short time, that's doesn't mean that the load/utilization is small because it can be the migration of a big task that just have a very short wakes up this time. That's why I don't make any assumption on the utilization/load value when a cpu goes to sleep > > Furthermore, this tiny task will cause the ILB to iterate over all of the > idle CPUs, although only one has stale load. For load update via NEWLY_IDLE > load_balance() we use: > > static bool update_nohz_stats(struct rq *rq) > { > if (!rq->has_blocked_load) > return false; > [...] > } > > But for load update via _nohz_idle_balance(), we iterate through all of the > nohz CPUS and unconditionally call update_blocked_averages(). This could be > avoided by remembering which CPUs have stale load before going idle. > Initially I thought that was what nohz.stats_state was for, but it isn't. > With Vincent's patches it's only ever set to either 0 or 1, but we could use > it as a CPU mask, and use it to skip nohz CPUs that don't have stale load in > _nohz_idle_balance() (when NOHZ_STATS_KICK). I have studied a way to keep track of how many cpus still have blocked load to try to minimize the number of useless ilb kick but this add more atomic operations which can impact the system throughput with heavy load and lot of very small wake up. that's why i have propose this solution which is more simple. But it's probably just a matter of where we want to "waste" time. Either we accept to spent a bit more time to check the state of idle CPUs or we accept to kick ilb from time to time for no good reason. > > Update by idle_balance() > > Right before the "accumulator" task goes to sleep, a tiny periodic > (period=32ms) task is spawned on another CPU. It's expected that it will > update the blocked load in idle_balance(), either by running > _nohz_idle_balance() locally or kicking an ILB (The overload flag shouldn't > be set in this test case, so we shouldn't go through the NEWLY_IDLE > load_balance()). > > This also seems to be working fine, but I'm noticing a delay between load > updates that is closer to 64ms than 32ms. After digging into it I found out > that the time checks done in idle_balance() and nohz_balancer_kick() are > time_after(jiffies, next_stats), but IMHO they should be > time_after_eq(jiffies, next_stats) to have 32ms-based updates. This also > explains the 36ms periodicity of the updates in the test above. I have use the 32ms as a minimum value between update. We must use the time_after() if we want to have at least 32ms between each update. We will have a 36ms period if the previous update was triggered by the tick (just after in fact) but there will be only 32ms if the last update was done during an idle_balance that happens just before the tick. With time_after_eq, the update period will between 28 and 32ms. Then, I mention a possible optimization by using time_after_eq in the idle_balance() so a newly_idle cpu will have more chance (between 0 and 4ms for hz250) to do the update before a ilb is kicked Thanks, Vincent > > > No update (idle system) > --- > Nothing special here, just making sure nothing happens when the system is > fully idle. On a sidenote, that's relatively hard to achieve - I had to > switch over to Juno because my HiKey960 gets interrupts every 16ms. The Juno > still gets woken up every now and then but it's a bit quieter. > > > [1]: https://gist.github.com/valschneider/a8da7bb8e11fb1ec63a419710f56c0a0 > > > > On 01/24/2018 08:25 AM,
Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK
On 29 January 2018 at 19:43, Dietmar Eggemannwrote: > On 01/24/2018 09:25 AM, Vincent Guittot wrote: >> >> Hi, >> >> Le Thursday 18 Jan 2018 à 10:38:07 (+), Morten Rasmussen a écrit : >>> >>> On Mon, Jan 15, 2018 at 09:26:09AM +0100, Vincent Guittot wrote: Le Wednesday 03 Jan 2018 à 10:16:00 (+0100), Vincent Guittot a écrit : > > > [...] > > Hi Peter, With the patch below on top of your branch, the blocked loads are updated and decayed regularly. The main differences are: - It doesn't use a timer to trig ilb but the tick and when a cpu becomes idle. The main drawback of this solution is that the load is blocked when the system is fully idle with the advantage of not waking up a fully idle system. We have to wait for the next tick or newly idle event for updating blocked load when the system leaves idle stat which can be up to a tick long. If this is too long, we can check for kicking ilb when task wakes up so the blocked load will be updated as soon as the system leaves idle state. The main advantage is that we don't wake up a fully idle system every 32ms to update blocked load that will be not used. - I'm working on one more improvement to use nohz_idle_balance in the newly idle case when the system is not overloaded and (this_rq->avg_idle > sysctl_sched_migration_cost). In this case, we can try to use nohz_idle_balance with NOHZ_STATS_KICK and abort as soon as it exceed this_rq->avg_idle. This will remove some calls to kick_ilb and some wake up of an idle cpus. >>> >>> >>> This sound like what I meant in my other reply :-) >>> >>> It seems pointless to have a timer to update PELT if the system is >>> completely idle, and when it isn't we can piggy back other events to >>> make the updates happen. >> >> >> The patch below implements what has been described above. It calls part of >> nohz_idle_balance when a cpu becomes idle and kick a ilb if it takes too >> much >> time. This removes part of ilb that are kicked on an idle cpu for updating >> the blocked load but the ratio really depends on when the tick happens >> compared >> to a cpu becoming idle and the 32ms boundary. I have an additionnal patch >> that >> enables to update the blocked loads when a cpu becomes idle 1 period >> before >> kicking an ilb and there is far less ilb because we give more chance to >> the >> newly idle case (time_after is replaced by time_after_eq in >> idle_balance()). >> >> The patch also uses a function cfs_rq_has_blocked, which only checks the >> util/load_avg, instead of the cfs_rq_is_decayed which check *_sum too. >> This >> reduce significantly the number of update of blocked load. the *_avg will >> be >> fully decayed in around 300~400ms but it's far longer for the *_sum which >> have >> a higher resolution and we can easily reach almost seconds. But only the >> *_avg >> are used to make decision so keeping some blocked *_sum is acceptable. >> >> --- >> kernel/sched/fair.c | 121 >> +++- >> 1 file changed, 92 insertions(+), 29 deletions(-) >> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index 898785d..ed90303 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -7356,6 +7356,17 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq >> *cfs_rq) >> return true; >> } >> +static inline bool cfs_rq_has_blocked(struct cfs_rq *cfs_rq) >> +{ >> + if (cfs_rq->avg.load_avg) >> + return true; >> + >> + if (cfs_rq->avg.util_avg) >> + return true; >> + >> + return false; >> +} >> + > > > Can we not change cfs_rq_is_decayed() to use avg.foo_avg instead of > avg.foo_sum ? I don't think so because the *_sum are used to keep coherency bewteen cfs_rq and cgroups when task migrates and are enqueued/dequeued so wwe can't remove it until the *_sum are null otherwise the all cfs_rq and group will be out of sync > >> #ifdef CONFIG_FAIR_GROUP_SCHED >> static void update_blocked_averages(int cpu) >> @@ -7393,7 +7404,9 @@ static void update_blocked_averages(int cpu) >> */ >> if (cfs_rq_is_decayed(cfs_rq)) >> list_del_leaf_cfs_rq(cfs_rq); >> - else >> + >> + /* Don't need periodic decay once load/util_avg are null >> */ >> + if (cfs_rq_has_blocked(cfs_rq)) >> done = false; >> } >> @@ -7463,7 +7476,7 @@ static inline void update_blocked_averages(int >> cpu) >> update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq); >> #ifdef CONFIG_NO_HZ_COMMON >> rq->last_blocked_load_update_tick = jiffies; >> - if (cfs_rq_is_decayed(cfs_rq)) >> + if (cfs_rq_has_blocked(cfs_rq)) > > > Schouldn't this be
Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK
On 29 January 2018 at 19:43, Dietmar Eggemann wrote: > On 01/24/2018 09:25 AM, Vincent Guittot wrote: >> >> Hi, >> >> Le Thursday 18 Jan 2018 à 10:38:07 (+), Morten Rasmussen a écrit : >>> >>> On Mon, Jan 15, 2018 at 09:26:09AM +0100, Vincent Guittot wrote: Le Wednesday 03 Jan 2018 à 10:16:00 (+0100), Vincent Guittot a écrit : > > > [...] > > Hi Peter, With the patch below on top of your branch, the blocked loads are updated and decayed regularly. The main differences are: - It doesn't use a timer to trig ilb but the tick and when a cpu becomes idle. The main drawback of this solution is that the load is blocked when the system is fully idle with the advantage of not waking up a fully idle system. We have to wait for the next tick or newly idle event for updating blocked load when the system leaves idle stat which can be up to a tick long. If this is too long, we can check for kicking ilb when task wakes up so the blocked load will be updated as soon as the system leaves idle state. The main advantage is that we don't wake up a fully idle system every 32ms to update blocked load that will be not used. - I'm working on one more improvement to use nohz_idle_balance in the newly idle case when the system is not overloaded and (this_rq->avg_idle > sysctl_sched_migration_cost). In this case, we can try to use nohz_idle_balance with NOHZ_STATS_KICK and abort as soon as it exceed this_rq->avg_idle. This will remove some calls to kick_ilb and some wake up of an idle cpus. >>> >>> >>> This sound like what I meant in my other reply :-) >>> >>> It seems pointless to have a timer to update PELT if the system is >>> completely idle, and when it isn't we can piggy back other events to >>> make the updates happen. >> >> >> The patch below implements what has been described above. It calls part of >> nohz_idle_balance when a cpu becomes idle and kick a ilb if it takes too >> much >> time. This removes part of ilb that are kicked on an idle cpu for updating >> the blocked load but the ratio really depends on when the tick happens >> compared >> to a cpu becoming idle and the 32ms boundary. I have an additionnal patch >> that >> enables to update the blocked loads when a cpu becomes idle 1 period >> before >> kicking an ilb and there is far less ilb because we give more chance to >> the >> newly idle case (time_after is replaced by time_after_eq in >> idle_balance()). >> >> The patch also uses a function cfs_rq_has_blocked, which only checks the >> util/load_avg, instead of the cfs_rq_is_decayed which check *_sum too. >> This >> reduce significantly the number of update of blocked load. the *_avg will >> be >> fully decayed in around 300~400ms but it's far longer for the *_sum which >> have >> a higher resolution and we can easily reach almost seconds. But only the >> *_avg >> are used to make decision so keeping some blocked *_sum is acceptable. >> >> --- >> kernel/sched/fair.c | 121 >> +++- >> 1 file changed, 92 insertions(+), 29 deletions(-) >> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index 898785d..ed90303 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -7356,6 +7356,17 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq >> *cfs_rq) >> return true; >> } >> +static inline bool cfs_rq_has_blocked(struct cfs_rq *cfs_rq) >> +{ >> + if (cfs_rq->avg.load_avg) >> + return true; >> + >> + if (cfs_rq->avg.util_avg) >> + return true; >> + >> + return false; >> +} >> + > > > Can we not change cfs_rq_is_decayed() to use avg.foo_avg instead of > avg.foo_sum ? I don't think so because the *_sum are used to keep coherency bewteen cfs_rq and cgroups when task migrates and are enqueued/dequeued so wwe can't remove it until the *_sum are null otherwise the all cfs_rq and group will be out of sync > >> #ifdef CONFIG_FAIR_GROUP_SCHED >> static void update_blocked_averages(int cpu) >> @@ -7393,7 +7404,9 @@ static void update_blocked_averages(int cpu) >> */ >> if (cfs_rq_is_decayed(cfs_rq)) >> list_del_leaf_cfs_rq(cfs_rq); >> - else >> + >> + /* Don't need periodic decay once load/util_avg are null >> */ >> + if (cfs_rq_has_blocked(cfs_rq)) >> done = false; >> } >> @@ -7463,7 +7476,7 @@ static inline void update_blocked_averages(int >> cpu) >> update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq); >> #ifdef CONFIG_NO_HZ_COMMON >> rq->last_blocked_load_update_tick = jiffies; >> - if (cfs_rq_is_decayed(cfs_rq)) >> + if (cfs_rq_has_blocked(cfs_rq)) > > > Schouldn't this be !cfs_rq_has_blocked(cfs_rq) ? yes. I
Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK
Hi Vincent, Peter, I've been running some tests on your patches (Peter's base + the 2 from Vincent). The results themselves are hosted at [1]. The base of those tests is the same: a task ("accumulator") is ran for 5 seconds (arbitrary value) to accumulate some load, then goes to sleep for .5 seconds. I've set up 3 test scenarios: Update by nohz_balance_kick() - Right before the "accumulator" task goes to sleep, a CPU-hogging task (100% utilization) is spawned on another CPU. It won't go idle so the only way to update the blocked load generated by "accumulator" is to kick an ILB (NOHZ_STATS_KICK). The test shows that this is behaving nicely - we keep kicking an ILB every ~36ms (see next test for comments on that) until there is no more blocked load. I did however notice some interesting scenarios: after the load has been fully decayed, a tiny background task can spawn and end in less than a scheduling period. However, it still goes through nohz_balance_enter_idle(), and thus sets nohz.stats_state, which will later cause an ILB kick. This makes me wonder if it's worth kicking ILBs for such tiny load values - perhaps it could be worth having a margin to set rq->has_blocked_load ? Furthermore, this tiny task will cause the ILB to iterate over all of the idle CPUs, although only one has stale load. For load update via NEWLY_IDLE load_balance() we use: static bool update_nohz_stats(struct rq *rq) {    if (!rq->has_blocked_load)     return false;    [...] } But for load update via _nohz_idle_balance(), we iterate through all of the nohz CPUS and unconditionally call update_blocked_averages(). This could be avoided by remembering which CPUs have stale load before going idle. Initially I thought that was what nohz.stats_state was for, but it isn't. With Vincent's patches it's only ever set to either 0 or 1, but we could use it as a CPU mask, and use it to skip nohz CPUs that don't have stale load in _nohz_idle_balance() (when NOHZ_STATS_KICK). Update by idle_balance() Right before the "accumulator" task goes to sleep, a tiny periodic (period=32ms) task is spawned on another CPU. It's expected that it will update the blocked load in idle_balance(), either by running _nohz_idle_balance() locally or kicking an ILB (The overload flag shouldn't be set in this test case, so we shouldn't go through the NEWLY_IDLE load_balance()). This also seems to be working fine, but I'm noticing a delay between load updates that is closer to 64ms than 32ms. After digging into it I found out that the time checks done in idle_balance() and nohz_balancer_kick() are time_after(jiffies, next_stats), but IMHO they should be time_after_eq(jiffies, next_stats) to have 32ms-based updates. This also explains the 36ms periodicity of the updates in the test above. No update (idle system) --- Nothing special here, just making sure nothing happens when the system is fully idle. On a sidenote, that's relatively hard to achieve - I had to switch over to Juno because my HiKey960 gets interrupts every 16ms. The Juno still gets woken up every now and then but it's a bit quieter. [1]: https://gist.github.com/valschneider/a8da7bb8e11fb1ec63a419710f56c0a0 On 01/24/2018 08:25 AM, Vincent Guittot wrote: Hi, Le Thursday 18 Jan 2018 à 10:38:07 (+), Morten Rasmussen a écrit : On Mon, Jan 15, 2018 at 09:26:09AM +0100, Vincent Guittot wrote: Le Wednesday 03 Jan 2018 à 10:16:00 (+0100), Vincent Guittot a écrit : Hi Peter, On 22 December 2017 at 21:42, Peter Zijlstrawrote: On Fri, Dec 22, 2017 at 07:56:29PM +0100, Peter Zijlstra wrote: Right; but I figured we'd try and do it 'right' and see how horrible it is before we try and do funny things. So now it should have a 32ms tick for up to .5s when the system goes completely idle. No idea how bad that is.. I have tested your branch but the timer doesn't seem to fire correctly because i can still see blocked load in the use case i have run. I haven't found the reason yet Hi Peter, With the patch below on top of your branch, the blocked loads are updated and decayed regularly. The main differences are: - It doesn't use a timer to trig ilb but the tick and when a cpu becomes idle. The main drawback of this solution is that the load is blocked when the system is fully idle with the advantage of not waking up a fully idle system. We have to wait for the next tick or newly idle event for updating blocked load when the system leaves idle stat which can be up to a tick long. If this is too long, we can check for kicking ilb when task wakes up so the blocked load will be updated as soon as the system leaves idle state. The main advantage is that we don't wake up a fully idle system every 32ms to update blocked load that will be not used. - I'm working on one more improvement to use nohz_idle_balance in the newly idle case
Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK
Hi Vincent, Peter, I've been running some tests on your patches (Peter's base + the 2 from Vincent). The results themselves are hosted at [1]. The base of those tests is the same: a task ("accumulator") is ran for 5 seconds (arbitrary value) to accumulate some load, then goes to sleep for .5 seconds. I've set up 3 test scenarios: Update by nohz_balance_kick() - Right before the "accumulator" task goes to sleep, a CPU-hogging task (100% utilization) is spawned on another CPU. It won't go idle so the only way to update the blocked load generated by "accumulator" is to kick an ILB (NOHZ_STATS_KICK). The test shows that this is behaving nicely - we keep kicking an ILB every ~36ms (see next test for comments on that) until there is no more blocked load. I did however notice some interesting scenarios: after the load has been fully decayed, a tiny background task can spawn and end in less than a scheduling period. However, it still goes through nohz_balance_enter_idle(), and thus sets nohz.stats_state, which will later cause an ILB kick. This makes me wonder if it's worth kicking ILBs for such tiny load values - perhaps it could be worth having a margin to set rq->has_blocked_load ? Furthermore, this tiny task will cause the ILB to iterate over all of the idle CPUs, although only one has stale load. For load update via NEWLY_IDLE load_balance() we use: static bool update_nohz_stats(struct rq *rq) {    if (!rq->has_blocked_load)     return false;    [...] } But for load update via _nohz_idle_balance(), we iterate through all of the nohz CPUS and unconditionally call update_blocked_averages(). This could be avoided by remembering which CPUs have stale load before going idle. Initially I thought that was what nohz.stats_state was for, but it isn't. With Vincent's patches it's only ever set to either 0 or 1, but we could use it as a CPU mask, and use it to skip nohz CPUs that don't have stale load in _nohz_idle_balance() (when NOHZ_STATS_KICK). Update by idle_balance() Right before the "accumulator" task goes to sleep, a tiny periodic (period=32ms) task is spawned on another CPU. It's expected that it will update the blocked load in idle_balance(), either by running _nohz_idle_balance() locally or kicking an ILB (The overload flag shouldn't be set in this test case, so we shouldn't go through the NEWLY_IDLE load_balance()). This also seems to be working fine, but I'm noticing a delay between load updates that is closer to 64ms than 32ms. After digging into it I found out that the time checks done in idle_balance() and nohz_balancer_kick() are time_after(jiffies, next_stats), but IMHO they should be time_after_eq(jiffies, next_stats) to have 32ms-based updates. This also explains the 36ms periodicity of the updates in the test above. No update (idle system) --- Nothing special here, just making sure nothing happens when the system is fully idle. On a sidenote, that's relatively hard to achieve - I had to switch over to Juno because my HiKey960 gets interrupts every 16ms. The Juno still gets woken up every now and then but it's a bit quieter. [1]: https://gist.github.com/valschneider/a8da7bb8e11fb1ec63a419710f56c0a0 On 01/24/2018 08:25 AM, Vincent Guittot wrote: Hi, Le Thursday 18 Jan 2018 à 10:38:07 (+), Morten Rasmussen a écrit : On Mon, Jan 15, 2018 at 09:26:09AM +0100, Vincent Guittot wrote: Le Wednesday 03 Jan 2018 à 10:16:00 (+0100), Vincent Guittot a écrit : Hi Peter, On 22 December 2017 at 21:42, Peter Zijlstra wrote: On Fri, Dec 22, 2017 at 07:56:29PM +0100, Peter Zijlstra wrote: Right; but I figured we'd try and do it 'right' and see how horrible it is before we try and do funny things. So now it should have a 32ms tick for up to .5s when the system goes completely idle. No idea how bad that is.. I have tested your branch but the timer doesn't seem to fire correctly because i can still see blocked load in the use case i have run. I haven't found the reason yet Hi Peter, With the patch below on top of your branch, the blocked loads are updated and decayed regularly. The main differences are: - It doesn't use a timer to trig ilb but the tick and when a cpu becomes idle. The main drawback of this solution is that the load is blocked when the system is fully idle with the advantage of not waking up a fully idle system. We have to wait for the next tick or newly idle event for updating blocked load when the system leaves idle stat which can be up to a tick long. If this is too long, we can check for kicking ilb when task wakes up so the blocked load will be updated as soon as the system leaves idle state. The main advantage is that we don't wake up a fully idle system every 32ms to update blocked load that will be not used. - I'm working on one more improvement to use nohz_idle_balance in the newly idle case when the system is not
Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK
On 01/24/2018 09:25 AM, Vincent Guittot wrote: Hi, Le Thursday 18 Jan 2018 à 10:38:07 (+), Morten Rasmussen a écrit : On Mon, Jan 15, 2018 at 09:26:09AM +0100, Vincent Guittot wrote: Le Wednesday 03 Jan 2018 à 10:16:00 (+0100), Vincent Guittot a écrit : [...] Hi Peter, With the patch below on top of your branch, the blocked loads are updated and decayed regularly. The main differences are: - It doesn't use a timer to trig ilb but the tick and when a cpu becomes idle. The main drawback of this solution is that the load is blocked when the system is fully idle with the advantage of not waking up a fully idle system. We have to wait for the next tick or newly idle event for updating blocked load when the system leaves idle stat which can be up to a tick long. If this is too long, we can check for kicking ilb when task wakes up so the blocked load will be updated as soon as the system leaves idle state. The main advantage is that we don't wake up a fully idle system every 32ms to update blocked load that will be not used. - I'm working on one more improvement to use nohz_idle_balance in the newly idle case when the system is not overloaded and (this_rq->avg_idle > sysctl_sched_migration_cost). In this case, we can try to use nohz_idle_balance with NOHZ_STATS_KICK and abort as soon as it exceed this_rq->avg_idle. This will remove some calls to kick_ilb and some wake up of an idle cpus. This sound like what I meant in my other reply :-) It seems pointless to have a timer to update PELT if the system is completely idle, and when it isn't we can piggy back other events to make the updates happen. The patch below implements what has been described above. It calls part of nohz_idle_balance when a cpu becomes idle and kick a ilb if it takes too much time. This removes part of ilb that are kicked on an idle cpu for updating the blocked load but the ratio really depends on when the tick happens compared to a cpu becoming idle and the 32ms boundary. I have an additionnal patch that enables to update the blocked loads when a cpu becomes idle 1 period before kicking an ilb and there is far less ilb because we give more chance to the newly idle case (time_after is replaced by time_after_eq in idle_balance()). The patch also uses a function cfs_rq_has_blocked, which only checks the util/load_avg, instead of the cfs_rq_is_decayed which check *_sum too. This reduce significantly the number of update of blocked load. the *_avg will be fully decayed in around 300~400ms but it's far longer for the *_sum which have a higher resolution and we can easily reach almost seconds. But only the *_avg are used to make decision so keeping some blocked *_sum is acceptable. --- kernel/sched/fair.c | 121 +++- 1 file changed, 92 insertions(+), 29 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 898785d..ed90303 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7356,6 +7356,17 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq) return true; } +static inline bool cfs_rq_has_blocked(struct cfs_rq *cfs_rq) +{ + if (cfs_rq->avg.load_avg) + return true; + + if (cfs_rq->avg.util_avg) + return true; + + return false; +} + Can we not change cfs_rq_is_decayed() to use avg.foo_avg instead of avg.foo_sum ? #ifdef CONFIG_FAIR_GROUP_SCHED static void update_blocked_averages(int cpu) @@ -7393,7 +7404,9 @@ static void update_blocked_averages(int cpu) */ if (cfs_rq_is_decayed(cfs_rq)) list_del_leaf_cfs_rq(cfs_rq); - else + + /* Don't need periodic decay once load/util_avg are null */ + if (cfs_rq_has_blocked(cfs_rq)) done = false; } @@ -7463,7 +7476,7 @@ static inline void update_blocked_averages(int cpu) update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq); #ifdef CONFIG_NO_HZ_COMMON rq->last_blocked_load_update_tick = jiffies; - if (cfs_rq_is_decayed(cfs_rq)) + if (cfs_rq_has_blocked(cfs_rq)) Schouldn't this be !cfs_rq_has_blocked(cfs_rq) ? rq->has_blocked_load = 0; #endif rq_unlock_irqrestore(rq, ); [...] @@ -9438,7 +9451,17 @@ static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) */ if (need_resched()) { has_blocked_load = true; - break; + goto abort; + } + + /* +* If the update is done while CPU becomes idle, we abort +* the update when its cost is higher than the average idle +* time in orde to not delay a possible wake up. +*/ + if (idle == CPU_NEWLY_IDLE && this_rq->avg_idle < curr_cost) { +
Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK
On 01/24/2018 09:25 AM, Vincent Guittot wrote: Hi, Le Thursday 18 Jan 2018 à 10:38:07 (+), Morten Rasmussen a écrit : On Mon, Jan 15, 2018 at 09:26:09AM +0100, Vincent Guittot wrote: Le Wednesday 03 Jan 2018 à 10:16:00 (+0100), Vincent Guittot a écrit : [...] Hi Peter, With the patch below on top of your branch, the blocked loads are updated and decayed regularly. The main differences are: - It doesn't use a timer to trig ilb but the tick and when a cpu becomes idle. The main drawback of this solution is that the load is blocked when the system is fully idle with the advantage of not waking up a fully idle system. We have to wait for the next tick or newly idle event for updating blocked load when the system leaves idle stat which can be up to a tick long. If this is too long, we can check for kicking ilb when task wakes up so the blocked load will be updated as soon as the system leaves idle state. The main advantage is that we don't wake up a fully idle system every 32ms to update blocked load that will be not used. - I'm working on one more improvement to use nohz_idle_balance in the newly idle case when the system is not overloaded and (this_rq->avg_idle > sysctl_sched_migration_cost). In this case, we can try to use nohz_idle_balance with NOHZ_STATS_KICK and abort as soon as it exceed this_rq->avg_idle. This will remove some calls to kick_ilb and some wake up of an idle cpus. This sound like what I meant in my other reply :-) It seems pointless to have a timer to update PELT if the system is completely idle, and when it isn't we can piggy back other events to make the updates happen. The patch below implements what has been described above. It calls part of nohz_idle_balance when a cpu becomes idle and kick a ilb if it takes too much time. This removes part of ilb that are kicked on an idle cpu for updating the blocked load but the ratio really depends on when the tick happens compared to a cpu becoming idle and the 32ms boundary. I have an additionnal patch that enables to update the blocked loads when a cpu becomes idle 1 period before kicking an ilb and there is far less ilb because we give more chance to the newly idle case (time_after is replaced by time_after_eq in idle_balance()). The patch also uses a function cfs_rq_has_blocked, which only checks the util/load_avg, instead of the cfs_rq_is_decayed which check *_sum too. This reduce significantly the number of update of blocked load. the *_avg will be fully decayed in around 300~400ms but it's far longer for the *_sum which have a higher resolution and we can easily reach almost seconds. But only the *_avg are used to make decision so keeping some blocked *_sum is acceptable. --- kernel/sched/fair.c | 121 +++- 1 file changed, 92 insertions(+), 29 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 898785d..ed90303 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7356,6 +7356,17 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq) return true; } +static inline bool cfs_rq_has_blocked(struct cfs_rq *cfs_rq) +{ + if (cfs_rq->avg.load_avg) + return true; + + if (cfs_rq->avg.util_avg) + return true; + + return false; +} + Can we not change cfs_rq_is_decayed() to use avg.foo_avg instead of avg.foo_sum ? #ifdef CONFIG_FAIR_GROUP_SCHED static void update_blocked_averages(int cpu) @@ -7393,7 +7404,9 @@ static void update_blocked_averages(int cpu) */ if (cfs_rq_is_decayed(cfs_rq)) list_del_leaf_cfs_rq(cfs_rq); - else + + /* Don't need periodic decay once load/util_avg are null */ + if (cfs_rq_has_blocked(cfs_rq)) done = false; } @@ -7463,7 +7476,7 @@ static inline void update_blocked_averages(int cpu) update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq); #ifdef CONFIG_NO_HZ_COMMON rq->last_blocked_load_update_tick = jiffies; - if (cfs_rq_is_decayed(cfs_rq)) + if (cfs_rq_has_blocked(cfs_rq)) Schouldn't this be !cfs_rq_has_blocked(cfs_rq) ? rq->has_blocked_load = 0; #endif rq_unlock_irqrestore(rq, ); [...] @@ -9438,7 +9451,17 @@ static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) */ if (need_resched()) { has_blocked_load = true; - break; + goto abort; + } + + /* +* If the update is done while CPU becomes idle, we abort +* the update when its cost is higher than the average idle +* time in orde to not delay a possible wake up. +*/ + if (idle == CPU_NEWLY_IDLE && this_rq->avg_idle < curr_cost) { +
Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK
Hi, Le Thursday 18 Jan 2018 à 10:38:07 (+), Morten Rasmussen a écrit : > On Mon, Jan 15, 2018 at 09:26:09AM +0100, Vincent Guittot wrote: > > Le Wednesday 03 Jan 2018 à 10:16:00 (+0100), Vincent Guittot a écrit : > > > Hi Peter, > > > > > > On 22 December 2017 at 21:42, Peter Zijlstrawrote: > > > > On Fri, Dec 22, 2017 at 07:56:29PM +0100, Peter Zijlstra wrote: > > > >> Right; but I figured we'd try and do it 'right' and see how horrible it > > > >> is before we try and do funny things. > > > > > > > > So now it should have a 32ms tick for up to .5s when the system goes > > > > completely idle. > > > > > > > > No idea how bad that is.. > > > > > > I have tested your branch but the timer doesn't seem to fire correctly > > > because i can still see blocked load in the use case i have run. > > > I haven't found the reason yet > > > > Hi Peter, > > > > With the patch below on top of your branch, the blocked loads are updated > > and > > decayed regularly. The main differences are: > > - It doesn't use a timer to trig ilb but the tick and when a cpu becomes > > idle. > > The main drawback of this solution is that the load is blocked when the > > system is fully idle with the advantage of not waking up a fully idle > > system. We have to wait for the next tick or newly idle event for updating > > blocked load when the system leaves idle stat which can be up to a tick > > long. > > If this is too long, we can check for kicking ilb when task wakes up so > > the > > blocked load will be updated as soon as the system leaves idle state. > > The main advantage is that we don't wake up a fully idle system every > > 32ms to > > update blocked load that will be not used. > > - I'm working on one more improvement to use nohz_idle_balance in the newly > > idle case when the system is not overloaded and > > (this_rq->avg_idle > sysctl_sched_migration_cost). In this case, we can > > try to > > use nohz_idle_balance with NOHZ_STATS_KICK and abort as soon as it exceed > > this_rq->avg_idle. This will remove some calls to kick_ilb and some wake > > up > > of an idle cpus. > > This sound like what I meant in my other reply :-) > > It seems pointless to have a timer to update PELT if the system is > completely idle, and when it isn't we can piggy back other events to > make the updates happen. The patch below implements what has been described above. It calls part of nohz_idle_balance when a cpu becomes idle and kick a ilb if it takes too much time. This removes part of ilb that are kicked on an idle cpu for updating the blocked load but the ratio really depends on when the tick happens compared to a cpu becoming idle and the 32ms boundary. I have an additionnal patch that enables to update the blocked loads when a cpu becomes idle 1 period before kicking an ilb and there is far less ilb because we give more chance to the newly idle case (time_after is replaced by time_after_eq in idle_balance()). The patch also uses a function cfs_rq_has_blocked, which only checks the util/load_avg, instead of the cfs_rq_is_decayed which check *_sum too. This reduce significantly the number of update of blocked load. the *_avg will be fully decayed in around 300~400ms but it's far longer for the *_sum which have a higher resolution and we can easily reach almost seconds. But only the *_avg are used to make decision so keeping some blocked *_sum is acceptable. --- kernel/sched/fair.c | 121 +++- 1 file changed, 92 insertions(+), 29 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 898785d..ed90303 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7356,6 +7356,17 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq) return true; } +static inline bool cfs_rq_has_blocked(struct cfs_rq *cfs_rq) +{ + if (cfs_rq->avg.load_avg) + return true; + + if (cfs_rq->avg.util_avg) + return true; + + return false; +} + #ifdef CONFIG_FAIR_GROUP_SCHED static void update_blocked_averages(int cpu) @@ -7393,7 +7404,9 @@ static void update_blocked_averages(int cpu) */ if (cfs_rq_is_decayed(cfs_rq)) list_del_leaf_cfs_rq(cfs_rq); - else + + /* Don't need periodic decay once load/util_avg are null */ + if (cfs_rq_has_blocked(cfs_rq)) done = false; } @@ -7463,7 +7476,7 @@ static inline void update_blocked_averages(int cpu) update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq); #ifdef CONFIG_NO_HZ_COMMON rq->last_blocked_load_update_tick = jiffies; - if (cfs_rq_is_decayed(cfs_rq)) + if (cfs_rq_has_blocked(cfs_rq)) rq->has_blocked_load = 0; #endif rq_unlock_irqrestore(rq, ); @@ -8818,6 +8831,7 @@ update_next_balance(struct sched_domain *sd, unsigned long
Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK
Hi, Le Thursday 18 Jan 2018 à 10:38:07 (+), Morten Rasmussen a écrit : > On Mon, Jan 15, 2018 at 09:26:09AM +0100, Vincent Guittot wrote: > > Le Wednesday 03 Jan 2018 à 10:16:00 (+0100), Vincent Guittot a écrit : > > > Hi Peter, > > > > > > On 22 December 2017 at 21:42, Peter Zijlstra wrote: > > > > On Fri, Dec 22, 2017 at 07:56:29PM +0100, Peter Zijlstra wrote: > > > >> Right; but I figured we'd try and do it 'right' and see how horrible it > > > >> is before we try and do funny things. > > > > > > > > So now it should have a 32ms tick for up to .5s when the system goes > > > > completely idle. > > > > > > > > No idea how bad that is.. > > > > > > I have tested your branch but the timer doesn't seem to fire correctly > > > because i can still see blocked load in the use case i have run. > > > I haven't found the reason yet > > > > Hi Peter, > > > > With the patch below on top of your branch, the blocked loads are updated > > and > > decayed regularly. The main differences are: > > - It doesn't use a timer to trig ilb but the tick and when a cpu becomes > > idle. > > The main drawback of this solution is that the load is blocked when the > > system is fully idle with the advantage of not waking up a fully idle > > system. We have to wait for the next tick or newly idle event for updating > > blocked load when the system leaves idle stat which can be up to a tick > > long. > > If this is too long, we can check for kicking ilb when task wakes up so > > the > > blocked load will be updated as soon as the system leaves idle state. > > The main advantage is that we don't wake up a fully idle system every > > 32ms to > > update blocked load that will be not used. > > - I'm working on one more improvement to use nohz_idle_balance in the newly > > idle case when the system is not overloaded and > > (this_rq->avg_idle > sysctl_sched_migration_cost). In this case, we can > > try to > > use nohz_idle_balance with NOHZ_STATS_KICK and abort as soon as it exceed > > this_rq->avg_idle. This will remove some calls to kick_ilb and some wake > > up > > of an idle cpus. > > This sound like what I meant in my other reply :-) > > It seems pointless to have a timer to update PELT if the system is > completely idle, and when it isn't we can piggy back other events to > make the updates happen. The patch below implements what has been described above. It calls part of nohz_idle_balance when a cpu becomes idle and kick a ilb if it takes too much time. This removes part of ilb that are kicked on an idle cpu for updating the blocked load but the ratio really depends on when the tick happens compared to a cpu becoming idle and the 32ms boundary. I have an additionnal patch that enables to update the blocked loads when a cpu becomes idle 1 period before kicking an ilb and there is far less ilb because we give more chance to the newly idle case (time_after is replaced by time_after_eq in idle_balance()). The patch also uses a function cfs_rq_has_blocked, which only checks the util/load_avg, instead of the cfs_rq_is_decayed which check *_sum too. This reduce significantly the number of update of blocked load. the *_avg will be fully decayed in around 300~400ms but it's far longer for the *_sum which have a higher resolution and we can easily reach almost seconds. But only the *_avg are used to make decision so keeping some blocked *_sum is acceptable. --- kernel/sched/fair.c | 121 +++- 1 file changed, 92 insertions(+), 29 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 898785d..ed90303 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7356,6 +7356,17 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq) return true; } +static inline bool cfs_rq_has_blocked(struct cfs_rq *cfs_rq) +{ + if (cfs_rq->avg.load_avg) + return true; + + if (cfs_rq->avg.util_avg) + return true; + + return false; +} + #ifdef CONFIG_FAIR_GROUP_SCHED static void update_blocked_averages(int cpu) @@ -7393,7 +7404,9 @@ static void update_blocked_averages(int cpu) */ if (cfs_rq_is_decayed(cfs_rq)) list_del_leaf_cfs_rq(cfs_rq); - else + + /* Don't need periodic decay once load/util_avg are null */ + if (cfs_rq_has_blocked(cfs_rq)) done = false; } @@ -7463,7 +7476,7 @@ static inline void update_blocked_averages(int cpu) update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq); #ifdef CONFIG_NO_HZ_COMMON rq->last_blocked_load_update_tick = jiffies; - if (cfs_rq_is_decayed(cfs_rq)) + if (cfs_rq_has_blocked(cfs_rq)) rq->has_blocked_load = 0; #endif rq_unlock_irqrestore(rq, ); @@ -8818,6 +8831,7 @@ update_next_balance(struct sched_domain *sd, unsigned long *next_balance)
Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK
On 22 January 2018 at 10:40, Dietmar Eggemannwrote: > On 01/15/2018 08:26 AM, Vincent Guittot wrote: >> >> Le Wednesday 03 Jan 2018 à 10:16:00 (+0100), Vincent Guittot a écrit : >>> >>> Hi Peter, >>> >>> On 22 December 2017 at 21:42, Peter Zijlstra >>> wrote: On Fri, Dec 22, 2017 at 07:56:29PM +0100, Peter Zijlstra wrote: > > Right; but I figured we'd try and do it 'right' and see how horrible it > is before we try and do funny things. So now it should have a 32ms tick for up to .5s when the system goes completely idle. No idea how bad that is.. >>> >>> >>> I have tested your branch but the timer doesn't seem to fire correctly >>> because i can still see blocked load in the use case i have run. >>> I haven't found the reason yet > > > Isn't the issue with the timer based implementation that > rq->has_blocked_load is never set to 1 ? Yes that's what I suggested to Peter on IRC > > Something you changed in your implementation by adding a > rq->has_blocked_load = 1 into nohz_balance_enter_idle(). > > >> >> Hi Peter, >> >> With the patch below on top of your branch, the blocked loads are updated >> and >> decayed regularly. The main differences are: >> - It doesn't use a timer to trig ilb but the tick and when a cpu becomes >> idle. >>The main drawback of this solution is that the load is blocked when the >>system is fully idle with the advantage of not waking up a fully idle >>system. We have to wait for the next tick or newly idle event for >> updating >>blocked load when the system leaves idle stat which can be up to a tick >> long. >>If this is too long, we can check for kicking ilb when task wakes up so >> the >>blocked load will be updated as soon as the system leaves idle state. >>The main advantage is that we don't wake up a fully idle system every >> 32ms to >>update blocked load that will be not used. >> - I'm working on one more improvement to use nohz_idle_balance in the >> newly >>idle case when the system is not overloaded and >>(this_rq->avg_idle > sysctl_sched_migration_cost). In this case, we can >> try to >>use nohz_idle_balance with NOHZ_STATS_KICK and abort as soon as it >> exceed >>this_rq->avg_idle. This will remove some calls to kick_ilb and some >> wake up >>of an idle cpus. >> >> --- >> kernel/sched/fair.c | 72 >> + >> 1 file changed, 34 insertions(+), 38 deletions(-) >> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index 52114c6..898785d 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -5386,7 +5386,6 @@ static struct { >> atomic_t stats_state; >> unsigned long next_balance; /* in jiffy units */ >> unsigned long next_stats; >> - struct timer_list timer; >> } nohz cacheline_aligned; >> #endif /* CONFIG_NO_HZ_COMMON */ >> @@ -8004,8 +8003,9 @@ static inline void update_sd_lb_stats(struct lb_env >> *env, struct sd_lb_stats *sd >> prefer_sibling = 1; >> #ifdef CONFIG_NO_HZ_COMMON >> - if (env->idle == CPU_NEWLY_IDLE) >> + if (env->idle == CPU_NEWLY_IDLE && atomic_read(_state)) >> { >> env->flags |= LBF_NOHZ_STATS; >> + } >> #endif >> load_idx = get_sd_load_idx(env->sd, env->idle); >> @@ -8818,6 +8818,8 @@ update_next_balance(struct sched_domain *sd, >> unsigned long *next_balance) >> *next_balance = next; >> } >> +static void kick_ilb(unsigned int flags); >> + >> /* >>* idle_balance is called by schedule() if this_cpu is about to become >>* idle. Attempts to pull tasks from other CPUs. >> @@ -8852,12 +8854,16 @@ static int idle_balance(struct rq *this_rq, struct >> rq_flags *rf) >> if (this_rq->avg_idle < sysctl_sched_migration_cost || >> !this_rq->rd->overload) { >> + unsigned long next = READ_ONCE(nohz.next_stats); >> rcu_read_lock(); >> sd = rcu_dereference_check_sched_domain(this_rq->sd); >> if (sd) >> update_next_balance(sd, _balance); >> rcu_read_unlock(); >> + if (time_after(jiffies, next) && >> atomic_read(_state)) >> + kick_ilb(NOHZ_STATS_KICK); >> + >> goto out; >> } >> @@ -9075,18 +9081,6 @@ static void kick_ilb(unsigned int flags) >> smp_send_reschedule(ilb_cpu); >> } >> -void nohz_balance_timer(struct timer_list *timer) >> -{ >> - unsigned long next = READ_ONCE(nohz.next_stats); >> - >> - if (time_before(jiffies, next)) { >> - mod_timer(timer, next); >> - return; >> - } >> - >> - kick_ilb(NOHZ_STATS_KICK); >> -} >> - >> /* >>* Current heuristic for kicking the idle load balancer in the presence >>* of an idle cpu in the system. >> @@ -9122,6
Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK
On 22 January 2018 at 10:40, Dietmar Eggemann wrote: > On 01/15/2018 08:26 AM, Vincent Guittot wrote: >> >> Le Wednesday 03 Jan 2018 à 10:16:00 (+0100), Vincent Guittot a écrit : >>> >>> Hi Peter, >>> >>> On 22 December 2017 at 21:42, Peter Zijlstra >>> wrote: On Fri, Dec 22, 2017 at 07:56:29PM +0100, Peter Zijlstra wrote: > > Right; but I figured we'd try and do it 'right' and see how horrible it > is before we try and do funny things. So now it should have a 32ms tick for up to .5s when the system goes completely idle. No idea how bad that is.. >>> >>> >>> I have tested your branch but the timer doesn't seem to fire correctly >>> because i can still see blocked load in the use case i have run. >>> I haven't found the reason yet > > > Isn't the issue with the timer based implementation that > rq->has_blocked_load is never set to 1 ? Yes that's what I suggested to Peter on IRC > > Something you changed in your implementation by adding a > rq->has_blocked_load = 1 into nohz_balance_enter_idle(). > > >> >> Hi Peter, >> >> With the patch below on top of your branch, the blocked loads are updated >> and >> decayed regularly. The main differences are: >> - It doesn't use a timer to trig ilb but the tick and when a cpu becomes >> idle. >>The main drawback of this solution is that the load is blocked when the >>system is fully idle with the advantage of not waking up a fully idle >>system. We have to wait for the next tick or newly idle event for >> updating >>blocked load when the system leaves idle stat which can be up to a tick >> long. >>If this is too long, we can check for kicking ilb when task wakes up so >> the >>blocked load will be updated as soon as the system leaves idle state. >>The main advantage is that we don't wake up a fully idle system every >> 32ms to >>update blocked load that will be not used. >> - I'm working on one more improvement to use nohz_idle_balance in the >> newly >>idle case when the system is not overloaded and >>(this_rq->avg_idle > sysctl_sched_migration_cost). In this case, we can >> try to >>use nohz_idle_balance with NOHZ_STATS_KICK and abort as soon as it >> exceed >>this_rq->avg_idle. This will remove some calls to kick_ilb and some >> wake up >>of an idle cpus. >> >> --- >> kernel/sched/fair.c | 72 >> + >> 1 file changed, 34 insertions(+), 38 deletions(-) >> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index 52114c6..898785d 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -5386,7 +5386,6 @@ static struct { >> atomic_t stats_state; >> unsigned long next_balance; /* in jiffy units */ >> unsigned long next_stats; >> - struct timer_list timer; >> } nohz cacheline_aligned; >> #endif /* CONFIG_NO_HZ_COMMON */ >> @@ -8004,8 +8003,9 @@ static inline void update_sd_lb_stats(struct lb_env >> *env, struct sd_lb_stats *sd >> prefer_sibling = 1; >> #ifdef CONFIG_NO_HZ_COMMON >> - if (env->idle == CPU_NEWLY_IDLE) >> + if (env->idle == CPU_NEWLY_IDLE && atomic_read(_state)) >> { >> env->flags |= LBF_NOHZ_STATS; >> + } >> #endif >> load_idx = get_sd_load_idx(env->sd, env->idle); >> @@ -8818,6 +8818,8 @@ update_next_balance(struct sched_domain *sd, >> unsigned long *next_balance) >> *next_balance = next; >> } >> +static void kick_ilb(unsigned int flags); >> + >> /* >>* idle_balance is called by schedule() if this_cpu is about to become >>* idle. Attempts to pull tasks from other CPUs. >> @@ -8852,12 +8854,16 @@ static int idle_balance(struct rq *this_rq, struct >> rq_flags *rf) >> if (this_rq->avg_idle < sysctl_sched_migration_cost || >> !this_rq->rd->overload) { >> + unsigned long next = READ_ONCE(nohz.next_stats); >> rcu_read_lock(); >> sd = rcu_dereference_check_sched_domain(this_rq->sd); >> if (sd) >> update_next_balance(sd, _balance); >> rcu_read_unlock(); >> + if (time_after(jiffies, next) && >> atomic_read(_state)) >> + kick_ilb(NOHZ_STATS_KICK); >> + >> goto out; >> } >> @@ -9075,18 +9081,6 @@ static void kick_ilb(unsigned int flags) >> smp_send_reschedule(ilb_cpu); >> } >> -void nohz_balance_timer(struct timer_list *timer) >> -{ >> - unsigned long next = READ_ONCE(nohz.next_stats); >> - >> - if (time_before(jiffies, next)) { >> - mod_timer(timer, next); >> - return; >> - } >> - >> - kick_ilb(NOHZ_STATS_KICK); >> -} >> - >> /* >>* Current heuristic for kicking the idle load balancer in the presence >>* of an idle cpu in the system. >> @@ -9122,6 +9116,9 @@ static void
Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK
On 01/15/2018 08:26 AM, Vincent Guittot wrote: Le Wednesday 03 Jan 2018 à 10:16:00 (+0100), Vincent Guittot a écrit : Hi Peter, On 22 December 2017 at 21:42, Peter Zijlstrawrote: On Fri, Dec 22, 2017 at 07:56:29PM +0100, Peter Zijlstra wrote: Right; but I figured we'd try and do it 'right' and see how horrible it is before we try and do funny things. So now it should have a 32ms tick for up to .5s when the system goes completely idle. No idea how bad that is.. I have tested your branch but the timer doesn't seem to fire correctly because i can still see blocked load in the use case i have run. I haven't found the reason yet Isn't the issue with the timer based implementation that rq->has_blocked_load is never set to 1 ? Something you changed in your implementation by adding a rq->has_blocked_load = 1 into nohz_balance_enter_idle(). Hi Peter, With the patch below on top of your branch, the blocked loads are updated and decayed regularly. The main differences are: - It doesn't use a timer to trig ilb but the tick and when a cpu becomes idle. The main drawback of this solution is that the load is blocked when the system is fully idle with the advantage of not waking up a fully idle system. We have to wait for the next tick or newly idle event for updating blocked load when the system leaves idle stat which can be up to a tick long. If this is too long, we can check for kicking ilb when task wakes up so the blocked load will be updated as soon as the system leaves idle state. The main advantage is that we don't wake up a fully idle system every 32ms to update blocked load that will be not used. - I'm working on one more improvement to use nohz_idle_balance in the newly idle case when the system is not overloaded and (this_rq->avg_idle > sysctl_sched_migration_cost). In this case, we can try to use nohz_idle_balance with NOHZ_STATS_KICK and abort as soon as it exceed this_rq->avg_idle. This will remove some calls to kick_ilb and some wake up of an idle cpus. --- kernel/sched/fair.c | 72 + 1 file changed, 34 insertions(+), 38 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 52114c6..898785d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5386,7 +5386,6 @@ static struct { atomic_t stats_state; unsigned long next_balance; /* in jiffy units */ unsigned long next_stats; - struct timer_list timer; } nohz cacheline_aligned; #endif /* CONFIG_NO_HZ_COMMON */ @@ -8004,8 +8003,9 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd prefer_sibling = 1; #ifdef CONFIG_NO_HZ_COMMON - if (env->idle == CPU_NEWLY_IDLE) + if (env->idle == CPU_NEWLY_IDLE && atomic_read(_state)) { env->flags |= LBF_NOHZ_STATS; + } #endif load_idx = get_sd_load_idx(env->sd, env->idle); @@ -8818,6 +8818,8 @@ update_next_balance(struct sched_domain *sd, unsigned long *next_balance) *next_balance = next; } +static void kick_ilb(unsigned int flags); + /* * idle_balance is called by schedule() if this_cpu is about to become * idle. Attempts to pull tasks from other CPUs. @@ -8852,12 +8854,16 @@ static int idle_balance(struct rq *this_rq, struct rq_flags *rf) if (this_rq->avg_idle < sysctl_sched_migration_cost || !this_rq->rd->overload) { + unsigned long next = READ_ONCE(nohz.next_stats); rcu_read_lock(); sd = rcu_dereference_check_sched_domain(this_rq->sd); if (sd) update_next_balance(sd, _balance); rcu_read_unlock(); + if (time_after(jiffies, next) && atomic_read(_state)) + kick_ilb(NOHZ_STATS_KICK); + goto out; } @@ -9075,18 +9081,6 @@ static void kick_ilb(unsigned int flags) smp_send_reschedule(ilb_cpu); } -void nohz_balance_timer(struct timer_list *timer) -{ - unsigned long next = READ_ONCE(nohz.next_stats); - - if (time_before(jiffies, next)) { - mod_timer(timer, next); - return; - } - - kick_ilb(NOHZ_STATS_KICK); -} - /* * Current heuristic for kicking the idle load balancer in the presence * of an idle cpu in the system. @@ -9122,6 +9116,9 @@ static void nohz_balancer_kick(struct rq *rq) if (likely(!atomic_read(_cpus))) return; + if (time_after(now, nohz.next_stats) && atomic_read(_state)) + flags = NOHZ_STATS_KICK; + if (time_before(now, nohz.next_balance)) goto out; @@ -9227,7 +9224,6 @@ static void set_cpu_sd_state_idle(int cpu) void nohz_balance_enter_idle(int cpu) { struct rq *rq = cpu_rq(cpu); - unsigned int val, new; SCHED_WARN_ON(cpu !=
Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK
On 01/15/2018 08:26 AM, Vincent Guittot wrote: Le Wednesday 03 Jan 2018 à 10:16:00 (+0100), Vincent Guittot a écrit : Hi Peter, On 22 December 2017 at 21:42, Peter Zijlstra wrote: On Fri, Dec 22, 2017 at 07:56:29PM +0100, Peter Zijlstra wrote: Right; but I figured we'd try and do it 'right' and see how horrible it is before we try and do funny things. So now it should have a 32ms tick for up to .5s when the system goes completely idle. No idea how bad that is.. I have tested your branch but the timer doesn't seem to fire correctly because i can still see blocked load in the use case i have run. I haven't found the reason yet Isn't the issue with the timer based implementation that rq->has_blocked_load is never set to 1 ? Something you changed in your implementation by adding a rq->has_blocked_load = 1 into nohz_balance_enter_idle(). Hi Peter, With the patch below on top of your branch, the blocked loads are updated and decayed regularly. The main differences are: - It doesn't use a timer to trig ilb but the tick and when a cpu becomes idle. The main drawback of this solution is that the load is blocked when the system is fully idle with the advantage of not waking up a fully idle system. We have to wait for the next tick or newly idle event for updating blocked load when the system leaves idle stat which can be up to a tick long. If this is too long, we can check for kicking ilb when task wakes up so the blocked load will be updated as soon as the system leaves idle state. The main advantage is that we don't wake up a fully idle system every 32ms to update blocked load that will be not used. - I'm working on one more improvement to use nohz_idle_balance in the newly idle case when the system is not overloaded and (this_rq->avg_idle > sysctl_sched_migration_cost). In this case, we can try to use nohz_idle_balance with NOHZ_STATS_KICK and abort as soon as it exceed this_rq->avg_idle. This will remove some calls to kick_ilb and some wake up of an idle cpus. --- kernel/sched/fair.c | 72 + 1 file changed, 34 insertions(+), 38 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 52114c6..898785d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5386,7 +5386,6 @@ static struct { atomic_t stats_state; unsigned long next_balance; /* in jiffy units */ unsigned long next_stats; - struct timer_list timer; } nohz cacheline_aligned; #endif /* CONFIG_NO_HZ_COMMON */ @@ -8004,8 +8003,9 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd prefer_sibling = 1; #ifdef CONFIG_NO_HZ_COMMON - if (env->idle == CPU_NEWLY_IDLE) + if (env->idle == CPU_NEWLY_IDLE && atomic_read(_state)) { env->flags |= LBF_NOHZ_STATS; + } #endif load_idx = get_sd_load_idx(env->sd, env->idle); @@ -8818,6 +8818,8 @@ update_next_balance(struct sched_domain *sd, unsigned long *next_balance) *next_balance = next; } +static void kick_ilb(unsigned int flags); + /* * idle_balance is called by schedule() if this_cpu is about to become * idle. Attempts to pull tasks from other CPUs. @@ -8852,12 +8854,16 @@ static int idle_balance(struct rq *this_rq, struct rq_flags *rf) if (this_rq->avg_idle < sysctl_sched_migration_cost || !this_rq->rd->overload) { + unsigned long next = READ_ONCE(nohz.next_stats); rcu_read_lock(); sd = rcu_dereference_check_sched_domain(this_rq->sd); if (sd) update_next_balance(sd, _balance); rcu_read_unlock(); + if (time_after(jiffies, next) && atomic_read(_state)) + kick_ilb(NOHZ_STATS_KICK); + goto out; } @@ -9075,18 +9081,6 @@ static void kick_ilb(unsigned int flags) smp_send_reschedule(ilb_cpu); } -void nohz_balance_timer(struct timer_list *timer) -{ - unsigned long next = READ_ONCE(nohz.next_stats); - - if (time_before(jiffies, next)) { - mod_timer(timer, next); - return; - } - - kick_ilb(NOHZ_STATS_KICK); -} - /* * Current heuristic for kicking the idle load balancer in the presence * of an idle cpu in the system. @@ -9122,6 +9116,9 @@ static void nohz_balancer_kick(struct rq *rq) if (likely(!atomic_read(_cpus))) return; + if (time_after(now, nohz.next_stats) && atomic_read(_state)) + flags = NOHZ_STATS_KICK; + if (time_before(now, nohz.next_balance)) goto out; @@ -9227,7 +9224,6 @@ static void set_cpu_sd_state_idle(int cpu) void nohz_balance_enter_idle(int cpu) { struct rq *rq = cpu_rq(cpu); - unsigned int val, new; SCHED_WARN_ON(cpu != smp_processor_id()); @@ -9251,6
Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK
On Mon, Jan 15, 2018 at 09:26:09AM +0100, Vincent Guittot wrote: > Le Wednesday 03 Jan 2018 à 10:16:00 (+0100), Vincent Guittot a écrit : > > Hi Peter, > > > > On 22 December 2017 at 21:42, Peter Zijlstrawrote: > > > On Fri, Dec 22, 2017 at 07:56:29PM +0100, Peter Zijlstra wrote: > > >> Right; but I figured we'd try and do it 'right' and see how horrible it > > >> is before we try and do funny things. > > > > > > So now it should have a 32ms tick for up to .5s when the system goes > > > completely idle. > > > > > > No idea how bad that is.. > > > > I have tested your branch but the timer doesn't seem to fire correctly > > because i can still see blocked load in the use case i have run. > > I haven't found the reason yet > > Hi Peter, > > With the patch below on top of your branch, the blocked loads are updated and > decayed regularly. The main differences are: > - It doesn't use a timer to trig ilb but the tick and when a cpu becomes idle. > The main drawback of this solution is that the load is blocked when the > system is fully idle with the advantage of not waking up a fully idle > system. We have to wait for the next tick or newly idle event for updating > blocked load when the system leaves idle stat which can be up to a tick > long. > If this is too long, we can check for kicking ilb when task wakes up so the > blocked load will be updated as soon as the system leaves idle state. > The main advantage is that we don't wake up a fully idle system every 32ms > to > update blocked load that will be not used. > - I'm working on one more improvement to use nohz_idle_balance in the newly > idle case when the system is not overloaded and > (this_rq->avg_idle > sysctl_sched_migration_cost). In this case, we can try > to > use nohz_idle_balance with NOHZ_STATS_KICK and abort as soon as it exceed > this_rq->avg_idle. This will remove some calls to kick_ilb and some wake up > of an idle cpus. This sound like what I meant in my other reply :-) It seems pointless to have a timer to update PELT if the system is completely idle, and when it isn't we can piggy back other events to make the updates happen.
Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK
On Mon, Jan 15, 2018 at 09:26:09AM +0100, Vincent Guittot wrote: > Le Wednesday 03 Jan 2018 à 10:16:00 (+0100), Vincent Guittot a écrit : > > Hi Peter, > > > > On 22 December 2017 at 21:42, Peter Zijlstra wrote: > > > On Fri, Dec 22, 2017 at 07:56:29PM +0100, Peter Zijlstra wrote: > > >> Right; but I figured we'd try and do it 'right' and see how horrible it > > >> is before we try and do funny things. > > > > > > So now it should have a 32ms tick for up to .5s when the system goes > > > completely idle. > > > > > > No idea how bad that is.. > > > > I have tested your branch but the timer doesn't seem to fire correctly > > because i can still see blocked load in the use case i have run. > > I haven't found the reason yet > > Hi Peter, > > With the patch below on top of your branch, the blocked loads are updated and > decayed regularly. The main differences are: > - It doesn't use a timer to trig ilb but the tick and when a cpu becomes idle. > The main drawback of this solution is that the load is blocked when the > system is fully idle with the advantage of not waking up a fully idle > system. We have to wait for the next tick or newly idle event for updating > blocked load when the system leaves idle stat which can be up to a tick > long. > If this is too long, we can check for kicking ilb when task wakes up so the > blocked load will be updated as soon as the system leaves idle state. > The main advantage is that we don't wake up a fully idle system every 32ms > to > update blocked load that will be not used. > - I'm working on one more improvement to use nohz_idle_balance in the newly > idle case when the system is not overloaded and > (this_rq->avg_idle > sysctl_sched_migration_cost). In this case, we can try > to > use nohz_idle_balance with NOHZ_STATS_KICK and abort as soon as it exceed > this_rq->avg_idle. This will remove some calls to kick_ilb and some wake up > of an idle cpus. This sound like what I meant in my other reply :-) It seems pointless to have a timer to update PELT if the system is completely idle, and when it isn't we can piggy back other events to make the updates happen.
Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK
On Mon, Jan 15, 2018 at 10:43:18AM +0100, Peter Zijlstra wrote: > On Tue, Jan 02, 2018 at 03:44:57PM +, Morten Rasmussen wrote: > > > Vincent already proposed, why can't we just modify Brendan's > > CPU_NEWLY_IDLE proposal to do a stats update from idle_balance() every > > 32ms regardless of whether we need to load-balance? > > I think that code is there, no? > > Subject: sched: Update blocked load from NEWIDLE The mechanics are there, but I think the problem is that the idle_balance() bails out before we get to it in some cases. If we only have a few small periodic tasks running rd->overload won't be set and idle_balance() returns before doing anything. We would need some sort of check to see if a PELT update is due and make sure it happens, even if idle_balance() has nothing to do.
Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK
On Mon, Jan 15, 2018 at 10:43:18AM +0100, Peter Zijlstra wrote: > On Tue, Jan 02, 2018 at 03:44:57PM +, Morten Rasmussen wrote: > > > Vincent already proposed, why can't we just modify Brendan's > > CPU_NEWLY_IDLE proposal to do a stats update from idle_balance() every > > 32ms regardless of whether we need to load-balance? > > I think that code is there, no? > > Subject: sched: Update blocked load from NEWIDLE The mechanics are there, but I think the problem is that the idle_balance() bails out before we get to it in some cases. If we only have a few small periodic tasks running rd->overload won't be set and idle_balance() returns before doing anything. We would need some sort of check to see if a PELT update is due and make sure it happens, even if idle_balance() has nothing to do.
Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK
On Tue, Jan 02, 2018 at 03:44:57PM +, Morten Rasmussen wrote: > Vincent already proposed, why can't we just modify Brendan's > CPU_NEWLY_IDLE proposal to do a stats update from idle_balance() every > 32ms regardless of whether we need to load-balance? I think that code is there, no? Subject: sched: Update blocked load from NEWIDLE
Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK
On Tue, Jan 02, 2018 at 03:44:57PM +, Morten Rasmussen wrote: > Vincent already proposed, why can't we just modify Brendan's > CPU_NEWLY_IDLE proposal to do a stats update from idle_balance() every > 32ms regardless of whether we need to load-balance? I think that code is there, no? Subject: sched: Update blocked load from NEWIDLE
Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK
Le Wednesday 03 Jan 2018 à 10:16:00 (+0100), Vincent Guittot a écrit : > Hi Peter, > > On 22 December 2017 at 21:42, Peter Zijlstrawrote: > > On Fri, Dec 22, 2017 at 07:56:29PM +0100, Peter Zijlstra wrote: > >> Right; but I figured we'd try and do it 'right' and see how horrible it > >> is before we try and do funny things. > > > > So now it should have a 32ms tick for up to .5s when the system goes > > completely idle. > > > > No idea how bad that is.. > > I have tested your branch but the timer doesn't seem to fire correctly > because i can still see blocked load in the use case i have run. > I haven't found the reason yet Hi Peter, With the patch below on top of your branch, the blocked loads are updated and decayed regularly. The main differences are: - It doesn't use a timer to trig ilb but the tick and when a cpu becomes idle. The main drawback of this solution is that the load is blocked when the system is fully idle with the advantage of not waking up a fully idle system. We have to wait for the next tick or newly idle event for updating blocked load when the system leaves idle stat which can be up to a tick long. If this is too long, we can check for kicking ilb when task wakes up so the blocked load will be updated as soon as the system leaves idle state. The main advantage is that we don't wake up a fully idle system every 32ms to update blocked load that will be not used. - I'm working on one more improvement to use nohz_idle_balance in the newly idle case when the system is not overloaded and (this_rq->avg_idle > sysctl_sched_migration_cost). In this case, we can try to use nohz_idle_balance with NOHZ_STATS_KICK and abort as soon as it exceed this_rq->avg_idle. This will remove some calls to kick_ilb and some wake up of an idle cpus. --- kernel/sched/fair.c | 72 + 1 file changed, 34 insertions(+), 38 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 52114c6..898785d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5386,7 +5386,6 @@ static struct { atomic_t stats_state; unsigned long next_balance; /* in jiffy units */ unsigned long next_stats; - struct timer_list timer; } nohz cacheline_aligned; #endif /* CONFIG_NO_HZ_COMMON */ @@ -8004,8 +8003,9 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd prefer_sibling = 1; #ifdef CONFIG_NO_HZ_COMMON - if (env->idle == CPU_NEWLY_IDLE) + if (env->idle == CPU_NEWLY_IDLE && atomic_read(_state)) { env->flags |= LBF_NOHZ_STATS; + } #endif load_idx = get_sd_load_idx(env->sd, env->idle); @@ -8818,6 +8818,8 @@ update_next_balance(struct sched_domain *sd, unsigned long *next_balance) *next_balance = next; } +static void kick_ilb(unsigned int flags); + /* * idle_balance is called by schedule() if this_cpu is about to become * idle. Attempts to pull tasks from other CPUs. @@ -8852,12 +8854,16 @@ static int idle_balance(struct rq *this_rq, struct rq_flags *rf) if (this_rq->avg_idle < sysctl_sched_migration_cost || !this_rq->rd->overload) { + unsigned long next = READ_ONCE(nohz.next_stats); rcu_read_lock(); sd = rcu_dereference_check_sched_domain(this_rq->sd); if (sd) update_next_balance(sd, _balance); rcu_read_unlock(); + if (time_after(jiffies, next) && atomic_read(_state)) + kick_ilb(NOHZ_STATS_KICK); + goto out; } @@ -9075,18 +9081,6 @@ static void kick_ilb(unsigned int flags) smp_send_reschedule(ilb_cpu); } -void nohz_balance_timer(struct timer_list *timer) -{ - unsigned long next = READ_ONCE(nohz.next_stats); - - if (time_before(jiffies, next)) { - mod_timer(timer, next); - return; - } - - kick_ilb(NOHZ_STATS_KICK); -} - /* * Current heuristic for kicking the idle load balancer in the presence * of an idle cpu in the system. @@ -9122,6 +9116,9 @@ static void nohz_balancer_kick(struct rq *rq) if (likely(!atomic_read(_cpus))) return; + if (time_after(now, nohz.next_stats) && atomic_read(_state)) + flags = NOHZ_STATS_KICK; + if (time_before(now, nohz.next_balance)) goto out; @@ -9227,7 +9224,6 @@ static void set_cpu_sd_state_idle(int cpu) void nohz_balance_enter_idle(int cpu) { struct rq *rq = cpu_rq(cpu); - unsigned int val, new; SCHED_WARN_ON(cpu != smp_processor_id()); @@ -9251,6 +9247,7 @@ void nohz_balance_enter_idle(int cpu) return; rq->nohz_tick_stopped = 1; + rq->has_blocked_load = 1; cpumask_set_cpu(cpu, nohz.idle_cpus_mask); atomic_inc(_cpus);
Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK
Le Wednesday 03 Jan 2018 à 10:16:00 (+0100), Vincent Guittot a écrit : > Hi Peter, > > On 22 December 2017 at 21:42, Peter Zijlstra wrote: > > On Fri, Dec 22, 2017 at 07:56:29PM +0100, Peter Zijlstra wrote: > >> Right; but I figured we'd try and do it 'right' and see how horrible it > >> is before we try and do funny things. > > > > So now it should have a 32ms tick for up to .5s when the system goes > > completely idle. > > > > No idea how bad that is.. > > I have tested your branch but the timer doesn't seem to fire correctly > because i can still see blocked load in the use case i have run. > I haven't found the reason yet Hi Peter, With the patch below on top of your branch, the blocked loads are updated and decayed regularly. The main differences are: - It doesn't use a timer to trig ilb but the tick and when a cpu becomes idle. The main drawback of this solution is that the load is blocked when the system is fully idle with the advantage of not waking up a fully idle system. We have to wait for the next tick or newly idle event for updating blocked load when the system leaves idle stat which can be up to a tick long. If this is too long, we can check for kicking ilb when task wakes up so the blocked load will be updated as soon as the system leaves idle state. The main advantage is that we don't wake up a fully idle system every 32ms to update blocked load that will be not used. - I'm working on one more improvement to use nohz_idle_balance in the newly idle case when the system is not overloaded and (this_rq->avg_idle > sysctl_sched_migration_cost). In this case, we can try to use nohz_idle_balance with NOHZ_STATS_KICK and abort as soon as it exceed this_rq->avg_idle. This will remove some calls to kick_ilb and some wake up of an idle cpus. --- kernel/sched/fair.c | 72 + 1 file changed, 34 insertions(+), 38 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 52114c6..898785d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5386,7 +5386,6 @@ static struct { atomic_t stats_state; unsigned long next_balance; /* in jiffy units */ unsigned long next_stats; - struct timer_list timer; } nohz cacheline_aligned; #endif /* CONFIG_NO_HZ_COMMON */ @@ -8004,8 +8003,9 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd prefer_sibling = 1; #ifdef CONFIG_NO_HZ_COMMON - if (env->idle == CPU_NEWLY_IDLE) + if (env->idle == CPU_NEWLY_IDLE && atomic_read(_state)) { env->flags |= LBF_NOHZ_STATS; + } #endif load_idx = get_sd_load_idx(env->sd, env->idle); @@ -8818,6 +8818,8 @@ update_next_balance(struct sched_domain *sd, unsigned long *next_balance) *next_balance = next; } +static void kick_ilb(unsigned int flags); + /* * idle_balance is called by schedule() if this_cpu is about to become * idle. Attempts to pull tasks from other CPUs. @@ -8852,12 +8854,16 @@ static int idle_balance(struct rq *this_rq, struct rq_flags *rf) if (this_rq->avg_idle < sysctl_sched_migration_cost || !this_rq->rd->overload) { + unsigned long next = READ_ONCE(nohz.next_stats); rcu_read_lock(); sd = rcu_dereference_check_sched_domain(this_rq->sd); if (sd) update_next_balance(sd, _balance); rcu_read_unlock(); + if (time_after(jiffies, next) && atomic_read(_state)) + kick_ilb(NOHZ_STATS_KICK); + goto out; } @@ -9075,18 +9081,6 @@ static void kick_ilb(unsigned int flags) smp_send_reschedule(ilb_cpu); } -void nohz_balance_timer(struct timer_list *timer) -{ - unsigned long next = READ_ONCE(nohz.next_stats); - - if (time_before(jiffies, next)) { - mod_timer(timer, next); - return; - } - - kick_ilb(NOHZ_STATS_KICK); -} - /* * Current heuristic for kicking the idle load balancer in the presence * of an idle cpu in the system. @@ -9122,6 +9116,9 @@ static void nohz_balancer_kick(struct rq *rq) if (likely(!atomic_read(_cpus))) return; + if (time_after(now, nohz.next_stats) && atomic_read(_state)) + flags = NOHZ_STATS_KICK; + if (time_before(now, nohz.next_balance)) goto out; @@ -9227,7 +9224,6 @@ static void set_cpu_sd_state_idle(int cpu) void nohz_balance_enter_idle(int cpu) { struct rq *rq = cpu_rq(cpu); - unsigned int val, new; SCHED_WARN_ON(cpu != smp_processor_id()); @@ -9251,6 +9247,7 @@ void nohz_balance_enter_idle(int cpu) return; rq->nohz_tick_stopped = 1; + rq->has_blocked_load = 1; cpumask_set_cpu(cpu, nohz.idle_cpus_mask); atomic_inc(_cpus); @@ -9258,21 +9255,11
Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK
Hi Peter, On 22 December 2017 at 21:42, Peter Zijlstrawrote: > On Fri, Dec 22, 2017 at 07:56:29PM +0100, Peter Zijlstra wrote: >> Right; but I figured we'd try and do it 'right' and see how horrible it >> is before we try and do funny things. > > So now it should have a 32ms tick for up to .5s when the system goes > completely idle. > > No idea how bad that is.. I have tested your branch but the timer doesn't seem to fire correctly because i can still see blocked load in the use case i have run. I haven't found the reason yet
Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK
Hi Peter, On 22 December 2017 at 21:42, Peter Zijlstra wrote: > On Fri, Dec 22, 2017 at 07:56:29PM +0100, Peter Zijlstra wrote: >> Right; but I figured we'd try and do it 'right' and see how horrible it >> is before we try and do funny things. > > So now it should have a 32ms tick for up to .5s when the system goes > completely idle. > > No idea how bad that is.. I have tested your branch but the timer doesn't seem to fire correctly because i can still see blocked load in the use case i have run. I haven't found the reason yet
Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK
On Fri, Dec 22, 2017 at 09:42:47PM +0100, Peter Zijlstra wrote: > On Fri, Dec 22, 2017 at 07:56:29PM +0100, Peter Zijlstra wrote: > > Right; but I figured we'd try and do it 'right' and see how horrible it > > is before we try and do funny things. > > So now it should have a 32ms tick for up to .5s when the system goes > completely idle. > > No idea how bad that is.. Does it mean that the 32ms tick will keep going forever if the system doesn't go completely idle? Some tiny background task or a slightly bigger one with a longer period? Do we actually care about stale values if the system is completely idle? Instead of hacking select_task_rq_fair() to kick off a stats update as Vincent already proposed, why can't we just modify Brendan's CPU_NEWLY_IDLE proposal to do a stats update from idle_balance() every 32ms regardless of whether we need to load-balance? This way we should get updates if there is anything running, we don't touch the wake-up path, we don't cause any additional wake-ups, and we don't need a timer. What am I missing?
Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK
On Fri, Dec 22, 2017 at 09:42:47PM +0100, Peter Zijlstra wrote: > On Fri, Dec 22, 2017 at 07:56:29PM +0100, Peter Zijlstra wrote: > > Right; but I figured we'd try and do it 'right' and see how horrible it > > is before we try and do funny things. > > So now it should have a 32ms tick for up to .5s when the system goes > completely idle. > > No idea how bad that is.. Does it mean that the 32ms tick will keep going forever if the system doesn't go completely idle? Some tiny background task or a slightly bigger one with a longer period? Do we actually care about stale values if the system is completely idle? Instead of hacking select_task_rq_fair() to kick off a stats update as Vincent already proposed, why can't we just modify Brendan's CPU_NEWLY_IDLE proposal to do a stats update from idle_balance() every 32ms regardless of whether we need to load-balance? This way we should get updates if there is anything running, we don't touch the wake-up path, we don't cause any additional wake-ups, and we don't need a timer. What am I missing?
Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK
On Fri, Dec 22, 2017 at 07:56:29PM +0100, Peter Zijlstra wrote: > Right; but I figured we'd try and do it 'right' and see how horrible it > is before we try and do funny things. So now it should have a 32ms tick for up to .5s when the system goes completely idle. No idea how bad that is..
Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK
On Fri, Dec 22, 2017 at 07:56:29PM +0100, Peter Zijlstra wrote: > Right; but I figured we'd try and do it 'right' and see how horrible it > is before we try and do funny things. So now it should have a 32ms tick for up to .5s when the system goes completely idle. No idea how bad that is..
Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK
On Fri, Dec 22, 2017 at 03:32:53PM +0100, Vincent Guittot wrote: > > The only thing I could come up with is running a timer for this :/ That > > would keep the ILB thing running until all load is decayed (have a patch > > for that somewhere). > > IMHO running a timer doesn't sound really great I tend to agree.. > When we have enough activity on the system, the tick and the periodic > load balance will ensure the update of load of all cpus (including the > idle cpus) at the load balance period pace. > But if we don't have enough activity to trig the periodic update > through ilb or because the system is not overloaded or even almost > idle, we don't have these periodic update anymore. > The goal is to do a lazy update of the blocked load to not hurt too > much power consumption of idle CPUs. When a task wakes up and the > blocked idle load have not been updated for a while, we trig the > update of these blocked loads in parallel to the wake up so the data > will be more accurate for the next events. > It's already too late for the current wake up but that's not a big > deal because the wake up path of a light loaded system is mainly > choosing between previous and current cpu and the load_avg_contrib and > the utilization will have been updated for next events. Right; but I figured we'd try and do it 'right' and see how horrible it is before we try and do funny things.
Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK
On Fri, Dec 22, 2017 at 03:32:53PM +0100, Vincent Guittot wrote: > > The only thing I could come up with is running a timer for this :/ That > > would keep the ILB thing running until all load is decayed (have a patch > > for that somewhere). > > IMHO running a timer doesn't sound really great I tend to agree.. > When we have enough activity on the system, the tick and the periodic > load balance will ensure the update of load of all cpus (including the > idle cpus) at the load balance period pace. > But if we don't have enough activity to trig the periodic update > through ilb or because the system is not overloaded or even almost > idle, we don't have these periodic update anymore. > The goal is to do a lazy update of the blocked load to not hurt too > much power consumption of idle CPUs. When a task wakes up and the > blocked idle load have not been updated for a while, we trig the > update of these blocked loads in parallel to the wake up so the data > will be more accurate for the next events. > It's already too late for the current wake up but that's not a big > deal because the wake up path of a light loaded system is mainly > choosing between previous and current cpu and the load_avg_contrib and > the utilization will have been updated for next events. Right; but I figured we'd try and do it 'right' and see how horrible it is before we try and do funny things.
Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK
On 22 December 2017 at 15:31, Peter Zijlstrawrote: > On Fri, Dec 22, 2017 at 10:12:21AM +0100, Peter Zijlstra wrote: > >> The only thing I could come up with is running a timer for this :/ That >> would keep the ILB thing running until all load is decayed (have a patch >> for that somewhere). > > Implemented that; pushed it out, should all be at: > > git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/testing > > but given how today is going, it'll eat your nan and set your cat on > fire. Our emails crossed. i'm going to have a look at your branch
Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK
On 22 December 2017 at 15:31, Peter Zijlstra wrote: > On Fri, Dec 22, 2017 at 10:12:21AM +0100, Peter Zijlstra wrote: > >> The only thing I could come up with is running a timer for this :/ That >> would keep the ILB thing running until all load is decayed (have a patch >> for that somewhere). > > Implemented that; pushed it out, should all be at: > > git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/testing > > but given how today is going, it'll eat your nan and set your cat on > fire. Our emails crossed. i'm going to have a look at your branch
Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK
On 22 December 2017 at 10:12, Peter Zijlstrawrote: > On Fri, Dec 22, 2017 at 09:29:15AM +0100, Peter Zijlstra wrote: >> On Fri, Dec 22, 2017 at 09:05:45AM +0100, Vincent Guittot wrote: >> > On 22 December 2017 at 08:59, Peter Zijlstra wrote: >> > > On Thu, Dec 21, 2017 at 05:56:32PM +0100, Vincent Guittot wrote: >> > >> In fact, we can't only rely on the tick and newly_idle load balance to >> > >> ensure a period update of the blocked load because they can never >> > >> happen. >> > > >> > > I'm confused, why would the ilb not happen? >> > >> > the ilb will be kick only if tick fires which might not be the case >> > for task that runs less than a tick >> >> Oh, urgh, you're talking about when the entire system is idle. Yes >> indeed. >> >> Lemme have a think, surely we can do something saner there. > > The only thing I could come up with is running a timer for this :/ That > would keep the ILB thing running until all load is decayed (have a patch > for that somewhere). IMHO running a timer doesn't sound really great When we have enough activity on the system, the tick and the periodic load balance will ensure the update of load of all cpus (including the idle cpus) at the load balance period pace But if we don't have enough activity to trig the periodic update through ilb or because the system is not overloaded or even almost idle, we don't have these periodic update anymore. The goal is to do a lazy update of the blocked load to not hurt too much power consumption of idle CPUs. When a task wakes up and the blocked idle load have not been updated for a while, we trig the update of these blocked loads in parallel to the wake up so the data will be more accurate for the next events. It's already too late for the current wake up but that's not a big deal because the wake up path of a light loaded system is mainly choosing between previous and current cpu and the load_avg_contrib and the utilization will have been updated for next events. >
Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK
On 22 December 2017 at 10:12, Peter Zijlstra wrote: > On Fri, Dec 22, 2017 at 09:29:15AM +0100, Peter Zijlstra wrote: >> On Fri, Dec 22, 2017 at 09:05:45AM +0100, Vincent Guittot wrote: >> > On 22 December 2017 at 08:59, Peter Zijlstra wrote: >> > > On Thu, Dec 21, 2017 at 05:56:32PM +0100, Vincent Guittot wrote: >> > >> In fact, we can't only rely on the tick and newly_idle load balance to >> > >> ensure a period update of the blocked load because they can never >> > >> happen. >> > > >> > > I'm confused, why would the ilb not happen? >> > >> > the ilb will be kick only if tick fires which might not be the case >> > for task that runs less than a tick >> >> Oh, urgh, you're talking about when the entire system is idle. Yes >> indeed. >> >> Lemme have a think, surely we can do something saner there. > > The only thing I could come up with is running a timer for this :/ That > would keep the ILB thing running until all load is decayed (have a patch > for that somewhere). IMHO running a timer doesn't sound really great When we have enough activity on the system, the tick and the periodic load balance will ensure the update of load of all cpus (including the idle cpus) at the load balance period pace But if we don't have enough activity to trig the periodic update through ilb or because the system is not overloaded or even almost idle, we don't have these periodic update anymore. The goal is to do a lazy update of the blocked load to not hurt too much power consumption of idle CPUs. When a task wakes up and the blocked idle load have not been updated for a while, we trig the update of these blocked loads in parallel to the wake up so the data will be more accurate for the next events. It's already too late for the current wake up but that's not a big deal because the wake up path of a light loaded system is mainly choosing between previous and current cpu and the load_avg_contrib and the utilization will have been updated for next events. >
Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK
On Fri, Dec 22, 2017 at 10:12:21AM +0100, Peter Zijlstra wrote: > The only thing I could come up with is running a timer for this :/ That > would keep the ILB thing running until all load is decayed (have a patch > for that somewhere). Implemented that; pushed it out, should all be at: git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/testing but given how today is going, it'll eat your nan and set your cat on fire.
Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK
On Fri, Dec 22, 2017 at 10:12:21AM +0100, Peter Zijlstra wrote: > The only thing I could come up with is running a timer for this :/ That > would keep the ILB thing running until all load is decayed (have a patch > for that somewhere). Implemented that; pushed it out, should all be at: git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/testing but given how today is going, it'll eat your nan and set your cat on fire.
Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK
On Fri, Dec 22, 2017 at 09:29:15AM +0100, Peter Zijlstra wrote: > On Fri, Dec 22, 2017 at 09:05:45AM +0100, Vincent Guittot wrote: > > On 22 December 2017 at 08:59, Peter Zijlstrawrote: > > > On Thu, Dec 21, 2017 at 05:56:32PM +0100, Vincent Guittot wrote: > > >> In fact, we can't only rely on the tick and newly_idle load balance to > > >> ensure a period update of the blocked load because they can never > > >> happen. > > > > > > I'm confused, why would the ilb not happen? > > > > the ilb will be kick only if tick fires which might not be the case > > for task that runs less than a tick > > Oh, urgh, you're talking about when the entire system is idle. Yes > indeed. > > Lemme have a think, surely we can do something saner there. The only thing I could come up with is running a timer for this :/ That would keep the ILB thing running until all load is decayed (have a patch for that somewhere).
Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK
On Fri, Dec 22, 2017 at 09:29:15AM +0100, Peter Zijlstra wrote: > On Fri, Dec 22, 2017 at 09:05:45AM +0100, Vincent Guittot wrote: > > On 22 December 2017 at 08:59, Peter Zijlstra wrote: > > > On Thu, Dec 21, 2017 at 05:56:32PM +0100, Vincent Guittot wrote: > > >> In fact, we can't only rely on the tick and newly_idle load balance to > > >> ensure a period update of the blocked load because they can never > > >> happen. > > > > > > I'm confused, why would the ilb not happen? > > > > the ilb will be kick only if tick fires which might not be the case > > for task that runs less than a tick > > Oh, urgh, you're talking about when the entire system is idle. Yes > indeed. > > Lemme have a think, surely we can do something saner there. The only thing I could come up with is running a timer for this :/ That would keep the ILB thing running until all load is decayed (have a patch for that somewhere).
Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK
On Fri, Dec 22, 2017 at 09:05:45AM +0100, Vincent Guittot wrote: > On 22 December 2017 at 08:59, Peter Zijlstrawrote: > > On Thu, Dec 21, 2017 at 05:56:32PM +0100, Vincent Guittot wrote: > >> In fact, we can't only rely on the tick and newly_idle load balance to > >> ensure a period update of the blocked load because they can never > >> happen. > > > > I'm confused, why would the ilb not happen? > > the ilb will be kick only if tick fires which might not be the case > for task that runs less than a tick Oh, urgh, you're talking about when the entire system is idle. Yes indeed. Lemme have a think, surely we can do something saner there.
Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK
On Fri, Dec 22, 2017 at 09:05:45AM +0100, Vincent Guittot wrote: > On 22 December 2017 at 08:59, Peter Zijlstra wrote: > > On Thu, Dec 21, 2017 at 05:56:32PM +0100, Vincent Guittot wrote: > >> In fact, we can't only rely on the tick and newly_idle load balance to > >> ensure a period update of the blocked load because they can never > >> happen. > > > > I'm confused, why would the ilb not happen? > > the ilb will be kick only if tick fires which might not be the case > for task that runs less than a tick Oh, urgh, you're talking about when the entire system is idle. Yes indeed. Lemme have a think, surely we can do something saner there.
Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK
On 22 December 2017 at 08:59, Peter Zijlstrawrote: > On Thu, Dec 21, 2017 at 05:56:32PM +0100, Vincent Guittot wrote: >> In fact, we can't only rely on the tick and newly_idle load balance to >> ensure a period update of the blocked load because they can never >> happen. > > I'm confused, why would the ilb not happen? the ilb will be kick only if tick fires which might not be the case for task that runs less than a tick
Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK
On 22 December 2017 at 08:59, Peter Zijlstra wrote: > On Thu, Dec 21, 2017 at 05:56:32PM +0100, Vincent Guittot wrote: >> In fact, we can't only rely on the tick and newly_idle load balance to >> ensure a period update of the blocked load because they can never >> happen. > > I'm confused, why would the ilb not happen? the ilb will be kick only if tick fires which might not be the case for task that runs less than a tick
Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK
On 22 December 2017 at 08:56, Peter Zijlstrawrote: > On Thu, Dec 21, 2017 at 05:23:27PM +0100, Vincent Guittot wrote: >> Hi Peter, >> >> I think that part of the proposal is missing. >> >> One goal of the patchset was to kick an update of the stats of idle >> cpu when a task wake up on a cpu but the statistic has not been >> updated for a while. >> >> That's why there where a call to nohz_kick_needed in the proposal to >> kick ilb but only for updating blocked load and not a full idle load >> balance >> >> I can't find this call any more in your patchset > > Yeah, I took it out because it didn't make sense to me. If you're waking > to a stale CPU, you're too late. I agree that's it's too late for the current wake up but that's the trade off with delaying the wake up until all blocked load has been updated. >
Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK
On 22 December 2017 at 08:56, Peter Zijlstra wrote: > On Thu, Dec 21, 2017 at 05:23:27PM +0100, Vincent Guittot wrote: >> Hi Peter, >> >> I think that part of the proposal is missing. >> >> One goal of the patchset was to kick an update of the stats of idle >> cpu when a task wake up on a cpu but the statistic has not been >> updated for a while. >> >> That's why there where a call to nohz_kick_needed in the proposal to >> kick ilb but only for updating blocked load and not a full idle load >> balance >> >> I can't find this call any more in your patchset > > Yeah, I took it out because it didn't make sense to me. If you're waking > to a stale CPU, you're too late. I agree that's it's too late for the current wake up but that's the trade off with delaying the wake up until all blocked load has been updated. >
Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK
On Thu, Dec 21, 2017 at 05:56:32PM +0100, Vincent Guittot wrote: > In fact, we can't only rely on the tick and newly_idle load balance to > ensure a period update of the blocked load because they can never > happen. I'm confused, why would the ilb not happen?
Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK
On Thu, Dec 21, 2017 at 05:56:32PM +0100, Vincent Guittot wrote: > In fact, we can't only rely on the tick and newly_idle load balance to > ensure a period update of the blocked load because they can never > happen. I'm confused, why would the ilb not happen?
Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK
On Thu, Dec 21, 2017 at 05:23:27PM +0100, Vincent Guittot wrote: > Hi Peter, > > I think that part of the proposal is missing. > > One goal of the patchset was to kick an update of the stats of idle > cpu when a task wake up on a cpu but the statistic has not been > updated for a while. > > That's why there where a call to nohz_kick_needed in the proposal to > kick ilb but only for updating blocked load and not a full idle load > balance > > I can't find this call any more in your patchset Yeah, I took it out because it didn't make sense to me. If you're waking to a stale CPU, you're too late.
Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK
On Thu, Dec 21, 2017 at 05:23:27PM +0100, Vincent Guittot wrote: > Hi Peter, > > I think that part of the proposal is missing. > > One goal of the patchset was to kick an update of the stats of idle > cpu when a task wake up on a cpu but the statistic has not been > updated for a while. > > That's why there where a call to nohz_kick_needed in the proposal to > kick ilb but only for updating blocked load and not a full idle load > balance > > I can't find this call any more in your patchset Yeah, I took it out because it didn't make sense to me. If you're waking to a stale CPU, you're too late.
Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK
On 21 December 2017 at 17:23, Vincent Guittotwrote: > Hi Peter, > > I think that part of the proposal is missing. > > One goal of the patchset was to kick an update of the stats of idle > cpu when a task wake up on a cpu but the statistic has not been > updated for a while. > > That's why there where a call to nohz_kick_needed in the proposal to > kick ilb but only for updating blocked load and not a full idle load sorry a call to nohz_kick_needed (which becomes nohz_balancer_kick in yours patchset) in select_task_fair_rq_fair > balance > > I can't find this call any more in your patchset In fact, we can't only rely on the tick and newly_idle load balance to ensure a period update of the blocked load because they can never happen. So we need to find another place to kick for a periodic update which is when a task wake up > > On 21 December 2017 at 11:21, Peter Zijlstra wrote: >> >> Suggested-by: Vincent Guittot >> Signed-off-by: Peter Zijlstra (Intel) >> --- >> kernel/sched/core.c |4 +-- >> kernel/sched/fair.c | 52 >> ++- >> kernel/sched/sched.h |4 +++ >> 3 files changed, 41 insertions(+), 19 deletions(-) >> >> --- a/kernel/sched/core.c >> +++ b/kernel/sched/core.c >> @@ -608,7 +608,7 @@ static inline bool got_nohz_idle_kick(vo >> { >> int cpu = smp_processor_id(); >> >> - if (!(atomic_read(nohz_flags(cpu)) & NOHZ_BALANCE_KICK)) >> + if (!(atomic_read(nohz_flags(cpu)) & NOHZ_KICK_MASK)) >> return false; >> >> if (idle_cpu(cpu) && !need_resched()) >> @@ -618,7 +618,7 @@ static inline bool got_nohz_idle_kick(vo >> * We can't run Idle Load Balance on this CPU for this time so we >> * cancel it and clear NOHZ_BALANCE_KICK >> */ >> - atomic_andnot(NOHZ_BALANCE_KICK, nohz_flags(cpu)); >> + atomic_andnot(NOHZ_KICK_MASK, nohz_flags(cpu)); >> return false; >> } >> >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -9001,8 +9001,8 @@ static void nohz_balancer_kick(void) >> if (ilb_cpu >= nr_cpu_ids) >> return; >> >> - flags = atomic_fetch_or(NOHZ_BALANCE_KICK, nohz_flags(ilb_cpu)); >> - if (flags & NOHZ_BALANCE_KICK) >> + flags = atomic_fetch_or(NOHZ_KICK_MASK, nohz_flags(ilb_cpu)); >> + if (flags & NOHZ_KICK_MASK) >> return; >> /* >> * Use smp_send_reschedule() instead of resched_cpu(). >> @@ -9125,8 +9125,6 @@ static void rebalance_domains(struct rq >> int need_serialize, need_decay = 0; >> u64 max_cost = 0; >> >> - update_blocked_averages(cpu); >> - >> rcu_read_lock(); >> for_each_domain(cpu, sd) { >> /* >> @@ -9221,20 +9219,27 @@ static void rebalance_domains(struct rq >> * In CONFIG_NO_HZ_COMMON case, the idle balance kickee will do the >> * rebalancing for all the cpus for whom scheduler ticks are stopped. >> */ >> -static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) >> +static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) >> { >> - int this_cpu = this_rq->cpu; >> - struct rq *rq; >> - int balance_cpu; >> /* Earliest time when we have to do rebalance again */ >> unsigned long next_balance = jiffies + 60*HZ; >> int update_next_balance = 0; >> + int this_cpu = this_rq->cpu; >> + unsigned int flags; >> + int balance_cpu; >> + struct rq *rq; >> >> - if (!(atomic_read(nohz_flags(this_cpu)) & NOHZ_BALANCE_KICK)) >> - return; >> + if (!(atomic_read(nohz_flags(this_cpu)) & NOHZ_KICK_MASK)) >> + return false; >> >> - if (idle != CPU_IDLE) >> - goto end; >> + if (idle != CPU_IDLE) { >> + atomic_andnot(NOHZ_KICK_MASK, nohz_flags(this_cpu)); >> + return false; >> + } >> + >> + flags = atomic_fetch_andnot(NOHZ_KICK_MASK, nohz_flags(this_cpu)); >> + >> + SCHED_WARN_ON((flags & NOHZ_KICK_MASK) == NOHZ_BALANCE_KICK); >> >> for_each_cpu(balance_cpu, nohz.idle_cpus_mask) { >> if (balance_cpu == this_cpu || !idle_cpu(balance_cpu)) >> @@ -9262,7 +9267,9 @@ static void nohz_idle_balance(struct rq >> cpu_load_update_idle(rq); >> rq_unlock_irq(rq, ); >> >> - rebalance_domains(rq, CPU_IDLE); >> + update_blocked_averages(rq->cpu); >> + if (flags & NOHZ_BALANCE_KICK) >> + rebalance_domains(rq, CPU_IDLE); >> } >> >> if (time_after(next_balance, rq->next_balance)) { >> @@ -9271,6 +9278,10 @@ static void nohz_idle_balance(struct rq >> } >> } >> >> +
Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK
On 21 December 2017 at 17:23, Vincent Guittot wrote: > Hi Peter, > > I think that part of the proposal is missing. > > One goal of the patchset was to kick an update of the stats of idle > cpu when a task wake up on a cpu but the statistic has not been > updated for a while. > > That's why there where a call to nohz_kick_needed in the proposal to > kick ilb but only for updating blocked load and not a full idle load sorry a call to nohz_kick_needed (which becomes nohz_balancer_kick in yours patchset) in select_task_fair_rq_fair > balance > > I can't find this call any more in your patchset In fact, we can't only rely on the tick and newly_idle load balance to ensure a period update of the blocked load because they can never happen. So we need to find another place to kick for a periodic update which is when a task wake up > > On 21 December 2017 at 11:21, Peter Zijlstra wrote: >> >> Suggested-by: Vincent Guittot >> Signed-off-by: Peter Zijlstra (Intel) >> --- >> kernel/sched/core.c |4 +-- >> kernel/sched/fair.c | 52 >> ++- >> kernel/sched/sched.h |4 +++ >> 3 files changed, 41 insertions(+), 19 deletions(-) >> >> --- a/kernel/sched/core.c >> +++ b/kernel/sched/core.c >> @@ -608,7 +608,7 @@ static inline bool got_nohz_idle_kick(vo >> { >> int cpu = smp_processor_id(); >> >> - if (!(atomic_read(nohz_flags(cpu)) & NOHZ_BALANCE_KICK)) >> + if (!(atomic_read(nohz_flags(cpu)) & NOHZ_KICK_MASK)) >> return false; >> >> if (idle_cpu(cpu) && !need_resched()) >> @@ -618,7 +618,7 @@ static inline bool got_nohz_idle_kick(vo >> * We can't run Idle Load Balance on this CPU for this time so we >> * cancel it and clear NOHZ_BALANCE_KICK >> */ >> - atomic_andnot(NOHZ_BALANCE_KICK, nohz_flags(cpu)); >> + atomic_andnot(NOHZ_KICK_MASK, nohz_flags(cpu)); >> return false; >> } >> >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -9001,8 +9001,8 @@ static void nohz_balancer_kick(void) >> if (ilb_cpu >= nr_cpu_ids) >> return; >> >> - flags = atomic_fetch_or(NOHZ_BALANCE_KICK, nohz_flags(ilb_cpu)); >> - if (flags & NOHZ_BALANCE_KICK) >> + flags = atomic_fetch_or(NOHZ_KICK_MASK, nohz_flags(ilb_cpu)); >> + if (flags & NOHZ_KICK_MASK) >> return; >> /* >> * Use smp_send_reschedule() instead of resched_cpu(). >> @@ -9125,8 +9125,6 @@ static void rebalance_domains(struct rq >> int need_serialize, need_decay = 0; >> u64 max_cost = 0; >> >> - update_blocked_averages(cpu); >> - >> rcu_read_lock(); >> for_each_domain(cpu, sd) { >> /* >> @@ -9221,20 +9219,27 @@ static void rebalance_domains(struct rq >> * In CONFIG_NO_HZ_COMMON case, the idle balance kickee will do the >> * rebalancing for all the cpus for whom scheduler ticks are stopped. >> */ >> -static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) >> +static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) >> { >> - int this_cpu = this_rq->cpu; >> - struct rq *rq; >> - int balance_cpu; >> /* Earliest time when we have to do rebalance again */ >> unsigned long next_balance = jiffies + 60*HZ; >> int update_next_balance = 0; >> + int this_cpu = this_rq->cpu; >> + unsigned int flags; >> + int balance_cpu; >> + struct rq *rq; >> >> - if (!(atomic_read(nohz_flags(this_cpu)) & NOHZ_BALANCE_KICK)) >> - return; >> + if (!(atomic_read(nohz_flags(this_cpu)) & NOHZ_KICK_MASK)) >> + return false; >> >> - if (idle != CPU_IDLE) >> - goto end; >> + if (idle != CPU_IDLE) { >> + atomic_andnot(NOHZ_KICK_MASK, nohz_flags(this_cpu)); >> + return false; >> + } >> + >> + flags = atomic_fetch_andnot(NOHZ_KICK_MASK, nohz_flags(this_cpu)); >> + >> + SCHED_WARN_ON((flags & NOHZ_KICK_MASK) == NOHZ_BALANCE_KICK); >> >> for_each_cpu(balance_cpu, nohz.idle_cpus_mask) { >> if (balance_cpu == this_cpu || !idle_cpu(balance_cpu)) >> @@ -9262,7 +9267,9 @@ static void nohz_idle_balance(struct rq >> cpu_load_update_idle(rq); >> rq_unlock_irq(rq, ); >> >> - rebalance_domains(rq, CPU_IDLE); >> + update_blocked_averages(rq->cpu); >> + if (flags & NOHZ_BALANCE_KICK) >> + rebalance_domains(rq, CPU_IDLE); >> } >> >> if (time_after(next_balance, rq->next_balance)) { >> @@ -9271,6 +9278,10 @@ static void nohz_idle_balance(struct rq >> } >> } >> >> + update_blocked_averages(this_cpu); >> + if (flags & NOHZ_BALANCE_KICK) >> + rebalance_domains(this_rq,
Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK
Hi Peter, I think that part of the proposal is missing. One goal of the patchset was to kick an update of the stats of idle cpu when a task wake up on a cpu but the statistic has not been updated for a while. That's why there where a call to nohz_kick_needed in the proposal to kick ilb but only for updating blocked load and not a full idle load balance I can't find this call any more in your patchset On 21 December 2017 at 11:21, Peter Zijlstrawrote: > > Suggested-by: Vincent Guittot > Signed-off-by: Peter Zijlstra (Intel) > --- > kernel/sched/core.c |4 +-- > kernel/sched/fair.c | 52 > ++- > kernel/sched/sched.h |4 +++ > 3 files changed, 41 insertions(+), 19 deletions(-) > > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -608,7 +608,7 @@ static inline bool got_nohz_idle_kick(vo > { > int cpu = smp_processor_id(); > > - if (!(atomic_read(nohz_flags(cpu)) & NOHZ_BALANCE_KICK)) > + if (!(atomic_read(nohz_flags(cpu)) & NOHZ_KICK_MASK)) > return false; > > if (idle_cpu(cpu) && !need_resched()) > @@ -618,7 +618,7 @@ static inline bool got_nohz_idle_kick(vo > * We can't run Idle Load Balance on this CPU for this time so we > * cancel it and clear NOHZ_BALANCE_KICK > */ > - atomic_andnot(NOHZ_BALANCE_KICK, nohz_flags(cpu)); > + atomic_andnot(NOHZ_KICK_MASK, nohz_flags(cpu)); > return false; > } > > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -9001,8 +9001,8 @@ static void nohz_balancer_kick(void) > if (ilb_cpu >= nr_cpu_ids) > return; > > - flags = atomic_fetch_or(NOHZ_BALANCE_KICK, nohz_flags(ilb_cpu)); > - if (flags & NOHZ_BALANCE_KICK) > + flags = atomic_fetch_or(NOHZ_KICK_MASK, nohz_flags(ilb_cpu)); > + if (flags & NOHZ_KICK_MASK) > return; > /* > * Use smp_send_reschedule() instead of resched_cpu(). > @@ -9125,8 +9125,6 @@ static void rebalance_domains(struct rq > int need_serialize, need_decay = 0; > u64 max_cost = 0; > > - update_blocked_averages(cpu); > - > rcu_read_lock(); > for_each_domain(cpu, sd) { > /* > @@ -9221,20 +9219,27 @@ static void rebalance_domains(struct rq > * In CONFIG_NO_HZ_COMMON case, the idle balance kickee will do the > * rebalancing for all the cpus for whom scheduler ticks are stopped. > */ > -static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) > +static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) > { > - int this_cpu = this_rq->cpu; > - struct rq *rq; > - int balance_cpu; > /* Earliest time when we have to do rebalance again */ > unsigned long next_balance = jiffies + 60*HZ; > int update_next_balance = 0; > + int this_cpu = this_rq->cpu; > + unsigned int flags; > + int balance_cpu; > + struct rq *rq; > > - if (!(atomic_read(nohz_flags(this_cpu)) & NOHZ_BALANCE_KICK)) > - return; > + if (!(atomic_read(nohz_flags(this_cpu)) & NOHZ_KICK_MASK)) > + return false; > > - if (idle != CPU_IDLE) > - goto end; > + if (idle != CPU_IDLE) { > + atomic_andnot(NOHZ_KICK_MASK, nohz_flags(this_cpu)); > + return false; > + } > + > + flags = atomic_fetch_andnot(NOHZ_KICK_MASK, nohz_flags(this_cpu)); > + > + SCHED_WARN_ON((flags & NOHZ_KICK_MASK) == NOHZ_BALANCE_KICK); > > for_each_cpu(balance_cpu, nohz.idle_cpus_mask) { > if (balance_cpu == this_cpu || !idle_cpu(balance_cpu)) > @@ -9262,7 +9267,9 @@ static void nohz_idle_balance(struct rq > cpu_load_update_idle(rq); > rq_unlock_irq(rq, ); > > - rebalance_domains(rq, CPU_IDLE); > + update_blocked_averages(rq->cpu); > + if (flags & NOHZ_BALANCE_KICK) > + rebalance_domains(rq, CPU_IDLE); > } > > if (time_after(next_balance, rq->next_balance)) { > @@ -9271,6 +9278,10 @@ static void nohz_idle_balance(struct rq > } > } > > + update_blocked_averages(this_cpu); > + if (flags & NOHZ_BALANCE_KICK) > + rebalance_domains(this_rq, CPU_IDLE); > + > /* > * next_balance will be updated only when there is a need. > * When the CPU is attached to null domain for ex, it will not be > @@ -9278,8 +9289,8 @@ static void nohz_idle_balance(struct rq > */ > if (likely(update_next_balance)) > nohz.next_balance = next_balance; > -end: > - atomic_andnot(NOHZ_BALANCE_KICK, nohz_flags(this_cpu)); > + > + return true; > } > > /* > @@ -9366,7
Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK
Hi Peter, I think that part of the proposal is missing. One goal of the patchset was to kick an update of the stats of idle cpu when a task wake up on a cpu but the statistic has not been updated for a while. That's why there where a call to nohz_kick_needed in the proposal to kick ilb but only for updating blocked load and not a full idle load balance I can't find this call any more in your patchset On 21 December 2017 at 11:21, Peter Zijlstra wrote: > > Suggested-by: Vincent Guittot > Signed-off-by: Peter Zijlstra (Intel) > --- > kernel/sched/core.c |4 +-- > kernel/sched/fair.c | 52 > ++- > kernel/sched/sched.h |4 +++ > 3 files changed, 41 insertions(+), 19 deletions(-) > > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -608,7 +608,7 @@ static inline bool got_nohz_idle_kick(vo > { > int cpu = smp_processor_id(); > > - if (!(atomic_read(nohz_flags(cpu)) & NOHZ_BALANCE_KICK)) > + if (!(atomic_read(nohz_flags(cpu)) & NOHZ_KICK_MASK)) > return false; > > if (idle_cpu(cpu) && !need_resched()) > @@ -618,7 +618,7 @@ static inline bool got_nohz_idle_kick(vo > * We can't run Idle Load Balance on this CPU for this time so we > * cancel it and clear NOHZ_BALANCE_KICK > */ > - atomic_andnot(NOHZ_BALANCE_KICK, nohz_flags(cpu)); > + atomic_andnot(NOHZ_KICK_MASK, nohz_flags(cpu)); > return false; > } > > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -9001,8 +9001,8 @@ static void nohz_balancer_kick(void) > if (ilb_cpu >= nr_cpu_ids) > return; > > - flags = atomic_fetch_or(NOHZ_BALANCE_KICK, nohz_flags(ilb_cpu)); > - if (flags & NOHZ_BALANCE_KICK) > + flags = atomic_fetch_or(NOHZ_KICK_MASK, nohz_flags(ilb_cpu)); > + if (flags & NOHZ_KICK_MASK) > return; > /* > * Use smp_send_reschedule() instead of resched_cpu(). > @@ -9125,8 +9125,6 @@ static void rebalance_domains(struct rq > int need_serialize, need_decay = 0; > u64 max_cost = 0; > > - update_blocked_averages(cpu); > - > rcu_read_lock(); > for_each_domain(cpu, sd) { > /* > @@ -9221,20 +9219,27 @@ static void rebalance_domains(struct rq > * In CONFIG_NO_HZ_COMMON case, the idle balance kickee will do the > * rebalancing for all the cpus for whom scheduler ticks are stopped. > */ > -static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) > +static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) > { > - int this_cpu = this_rq->cpu; > - struct rq *rq; > - int balance_cpu; > /* Earliest time when we have to do rebalance again */ > unsigned long next_balance = jiffies + 60*HZ; > int update_next_balance = 0; > + int this_cpu = this_rq->cpu; > + unsigned int flags; > + int balance_cpu; > + struct rq *rq; > > - if (!(atomic_read(nohz_flags(this_cpu)) & NOHZ_BALANCE_KICK)) > - return; > + if (!(atomic_read(nohz_flags(this_cpu)) & NOHZ_KICK_MASK)) > + return false; > > - if (idle != CPU_IDLE) > - goto end; > + if (idle != CPU_IDLE) { > + atomic_andnot(NOHZ_KICK_MASK, nohz_flags(this_cpu)); > + return false; > + } > + > + flags = atomic_fetch_andnot(NOHZ_KICK_MASK, nohz_flags(this_cpu)); > + > + SCHED_WARN_ON((flags & NOHZ_KICK_MASK) == NOHZ_BALANCE_KICK); > > for_each_cpu(balance_cpu, nohz.idle_cpus_mask) { > if (balance_cpu == this_cpu || !idle_cpu(balance_cpu)) > @@ -9262,7 +9267,9 @@ static void nohz_idle_balance(struct rq > cpu_load_update_idle(rq); > rq_unlock_irq(rq, ); > > - rebalance_domains(rq, CPU_IDLE); > + update_blocked_averages(rq->cpu); > + if (flags & NOHZ_BALANCE_KICK) > + rebalance_domains(rq, CPU_IDLE); > } > > if (time_after(next_balance, rq->next_balance)) { > @@ -9271,6 +9278,10 @@ static void nohz_idle_balance(struct rq > } > } > > + update_blocked_averages(this_cpu); > + if (flags & NOHZ_BALANCE_KICK) > + rebalance_domains(this_rq, CPU_IDLE); > + > /* > * next_balance will be updated only when there is a need. > * When the CPU is attached to null domain for ex, it will not be > @@ -9278,8 +9289,8 @@ static void nohz_idle_balance(struct rq > */ > if (likely(update_next_balance)) > nohz.next_balance = next_balance; > -end: > - atomic_andnot(NOHZ_BALANCE_KICK, nohz_flags(this_cpu)); > + > + return true; > } > > /* > @@ -9366,7 +9377,10 @@ static inline bool nohz_kick_needed(stru > return
[RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK
Suggested-by: Vincent GuittotSigned-off-by: Peter Zijlstra (Intel) --- kernel/sched/core.c |4 +-- kernel/sched/fair.c | 52 ++- kernel/sched/sched.h |4 +++ 3 files changed, 41 insertions(+), 19 deletions(-) --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -608,7 +608,7 @@ static inline bool got_nohz_idle_kick(vo { int cpu = smp_processor_id(); - if (!(atomic_read(nohz_flags(cpu)) & NOHZ_BALANCE_KICK)) + if (!(atomic_read(nohz_flags(cpu)) & NOHZ_KICK_MASK)) return false; if (idle_cpu(cpu) && !need_resched()) @@ -618,7 +618,7 @@ static inline bool got_nohz_idle_kick(vo * We can't run Idle Load Balance on this CPU for this time so we * cancel it and clear NOHZ_BALANCE_KICK */ - atomic_andnot(NOHZ_BALANCE_KICK, nohz_flags(cpu)); + atomic_andnot(NOHZ_KICK_MASK, nohz_flags(cpu)); return false; } --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9001,8 +9001,8 @@ static void nohz_balancer_kick(void) if (ilb_cpu >= nr_cpu_ids) return; - flags = atomic_fetch_or(NOHZ_BALANCE_KICK, nohz_flags(ilb_cpu)); - if (flags & NOHZ_BALANCE_KICK) + flags = atomic_fetch_or(NOHZ_KICK_MASK, nohz_flags(ilb_cpu)); + if (flags & NOHZ_KICK_MASK) return; /* * Use smp_send_reschedule() instead of resched_cpu(). @@ -9125,8 +9125,6 @@ static void rebalance_domains(struct rq int need_serialize, need_decay = 0; u64 max_cost = 0; - update_blocked_averages(cpu); - rcu_read_lock(); for_each_domain(cpu, sd) { /* @@ -9221,20 +9219,27 @@ static void rebalance_domains(struct rq * In CONFIG_NO_HZ_COMMON case, the idle balance kickee will do the * rebalancing for all the cpus for whom scheduler ticks are stopped. */ -static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) +static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) { - int this_cpu = this_rq->cpu; - struct rq *rq; - int balance_cpu; /* Earliest time when we have to do rebalance again */ unsigned long next_balance = jiffies + 60*HZ; int update_next_balance = 0; + int this_cpu = this_rq->cpu; + unsigned int flags; + int balance_cpu; + struct rq *rq; - if (!(atomic_read(nohz_flags(this_cpu)) & NOHZ_BALANCE_KICK)) - return; + if (!(atomic_read(nohz_flags(this_cpu)) & NOHZ_KICK_MASK)) + return false; - if (idle != CPU_IDLE) - goto end; + if (idle != CPU_IDLE) { + atomic_andnot(NOHZ_KICK_MASK, nohz_flags(this_cpu)); + return false; + } + + flags = atomic_fetch_andnot(NOHZ_KICK_MASK, nohz_flags(this_cpu)); + + SCHED_WARN_ON((flags & NOHZ_KICK_MASK) == NOHZ_BALANCE_KICK); for_each_cpu(balance_cpu, nohz.idle_cpus_mask) { if (balance_cpu == this_cpu || !idle_cpu(balance_cpu)) @@ -9262,7 +9267,9 @@ static void nohz_idle_balance(struct rq cpu_load_update_idle(rq); rq_unlock_irq(rq, ); - rebalance_domains(rq, CPU_IDLE); + update_blocked_averages(rq->cpu); + if (flags & NOHZ_BALANCE_KICK) + rebalance_domains(rq, CPU_IDLE); } if (time_after(next_balance, rq->next_balance)) { @@ -9271,6 +9278,10 @@ static void nohz_idle_balance(struct rq } } + update_blocked_averages(this_cpu); + if (flags & NOHZ_BALANCE_KICK) + rebalance_domains(this_rq, CPU_IDLE); + /* * next_balance will be updated only when there is a need. * When the CPU is attached to null domain for ex, it will not be @@ -9278,8 +9289,8 @@ static void nohz_idle_balance(struct rq */ if (likely(update_next_balance)) nohz.next_balance = next_balance; -end: - atomic_andnot(NOHZ_BALANCE_KICK, nohz_flags(this_cpu)); + + return true; } /* @@ -9366,7 +9377,10 @@ static inline bool nohz_kick_needed(stru return kick; } #else -static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) { } +static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) +{ + return false; +} #endif /* @@ -9387,7 +9401,11 @@ static __latent_entropy void run_rebalan * load balance only within the local sched_domain hierarchy * and abort nohz_idle_balance altogether if we pull some load. */ - nohz_idle_balance(this_rq, idle); + if (nohz_idle_balance(this_rq, idle)) + return; + + /* normal load balance */ + update_blocked_averages(this_rq->cpu);
[RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK
Suggested-by: Vincent Guittot Signed-off-by: Peter Zijlstra (Intel) --- kernel/sched/core.c |4 +-- kernel/sched/fair.c | 52 ++- kernel/sched/sched.h |4 +++ 3 files changed, 41 insertions(+), 19 deletions(-) --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -608,7 +608,7 @@ static inline bool got_nohz_idle_kick(vo { int cpu = smp_processor_id(); - if (!(atomic_read(nohz_flags(cpu)) & NOHZ_BALANCE_KICK)) + if (!(atomic_read(nohz_flags(cpu)) & NOHZ_KICK_MASK)) return false; if (idle_cpu(cpu) && !need_resched()) @@ -618,7 +618,7 @@ static inline bool got_nohz_idle_kick(vo * We can't run Idle Load Balance on this CPU for this time so we * cancel it and clear NOHZ_BALANCE_KICK */ - atomic_andnot(NOHZ_BALANCE_KICK, nohz_flags(cpu)); + atomic_andnot(NOHZ_KICK_MASK, nohz_flags(cpu)); return false; } --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9001,8 +9001,8 @@ static void nohz_balancer_kick(void) if (ilb_cpu >= nr_cpu_ids) return; - flags = atomic_fetch_or(NOHZ_BALANCE_KICK, nohz_flags(ilb_cpu)); - if (flags & NOHZ_BALANCE_KICK) + flags = atomic_fetch_or(NOHZ_KICK_MASK, nohz_flags(ilb_cpu)); + if (flags & NOHZ_KICK_MASK) return; /* * Use smp_send_reschedule() instead of resched_cpu(). @@ -9125,8 +9125,6 @@ static void rebalance_domains(struct rq int need_serialize, need_decay = 0; u64 max_cost = 0; - update_blocked_averages(cpu); - rcu_read_lock(); for_each_domain(cpu, sd) { /* @@ -9221,20 +9219,27 @@ static void rebalance_domains(struct rq * In CONFIG_NO_HZ_COMMON case, the idle balance kickee will do the * rebalancing for all the cpus for whom scheduler ticks are stopped. */ -static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) +static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) { - int this_cpu = this_rq->cpu; - struct rq *rq; - int balance_cpu; /* Earliest time when we have to do rebalance again */ unsigned long next_balance = jiffies + 60*HZ; int update_next_balance = 0; + int this_cpu = this_rq->cpu; + unsigned int flags; + int balance_cpu; + struct rq *rq; - if (!(atomic_read(nohz_flags(this_cpu)) & NOHZ_BALANCE_KICK)) - return; + if (!(atomic_read(nohz_flags(this_cpu)) & NOHZ_KICK_MASK)) + return false; - if (idle != CPU_IDLE) - goto end; + if (idle != CPU_IDLE) { + atomic_andnot(NOHZ_KICK_MASK, nohz_flags(this_cpu)); + return false; + } + + flags = atomic_fetch_andnot(NOHZ_KICK_MASK, nohz_flags(this_cpu)); + + SCHED_WARN_ON((flags & NOHZ_KICK_MASK) == NOHZ_BALANCE_KICK); for_each_cpu(balance_cpu, nohz.idle_cpus_mask) { if (balance_cpu == this_cpu || !idle_cpu(balance_cpu)) @@ -9262,7 +9267,9 @@ static void nohz_idle_balance(struct rq cpu_load_update_idle(rq); rq_unlock_irq(rq, ); - rebalance_domains(rq, CPU_IDLE); + update_blocked_averages(rq->cpu); + if (flags & NOHZ_BALANCE_KICK) + rebalance_domains(rq, CPU_IDLE); } if (time_after(next_balance, rq->next_balance)) { @@ -9271,6 +9278,10 @@ static void nohz_idle_balance(struct rq } } + update_blocked_averages(this_cpu); + if (flags & NOHZ_BALANCE_KICK) + rebalance_domains(this_rq, CPU_IDLE); + /* * next_balance will be updated only when there is a need. * When the CPU is attached to null domain for ex, it will not be @@ -9278,8 +9289,8 @@ static void nohz_idle_balance(struct rq */ if (likely(update_next_balance)) nohz.next_balance = next_balance; -end: - atomic_andnot(NOHZ_BALANCE_KICK, nohz_flags(this_cpu)); + + return true; } /* @@ -9366,7 +9377,10 @@ static inline bool nohz_kick_needed(stru return kick; } #else -static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) { } +static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) +{ + return false; +} #endif /* @@ -9387,7 +9401,11 @@ static __latent_entropy void run_rebalan * load balance only within the local sched_domain hierarchy * and abort nohz_idle_balance altogether if we pull some load. */ - nohz_idle_balance(this_rq, idle); + if (nohz_idle_balance(this_rq, idle)) + return; + + /* normal load balance */ + update_blocked_averages(this_rq->cpu); rebalance_domains(this_rq, idle); } ---