Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

2018-02-06 Thread Vincent Guittot
On 5 February 2018 at 23:18, Valentin Schneider
 wrote:
> On 01/30/2018 11:41 AM, Valentin Schneider wrote:
>> [...]
>>> I have studied a way to keep track of how many cpus still have blocked
>>> load to try to minimize the number of useless ilb kick but this add
>>> more atomic operations which can impact the system throughput with
>>> heavy load and lot of very small wake up. that's why i have propose
>>> this solution which is more simple. But it's probably just a matter of
>>> where we want to "waste" time. Either we accept to spent a bit more
>>> time to check the state of idle CPUs or we accept to kick ilb from
>>> time to time for no good reason.
>>>
>>
>> Agreed. I have the feeling that spending more time doing atomic ops could be 
>> worth it - I'll try to test this out and see if it's actually relevant.
>>
>
> I gave this a spin, still using Vincent's patches with the included patch on 
> top. Nothing too clever, just seeing how replacing nohz.stats_state with a 
> cpumask would go.
>
> I've replaced nohz.stats_state by nohz.stale_cpus_mask. I kept changes 
> minimal - there are some places where I think nohz.idle_cpus_mask could be 
> substituted by nohz.stale_cpus_mask. Speaking about that, I was about to 
> write a comment regarding the fact that nohz.stale_cpus_mask should be a 
> subset of nohz.idle_cpus_mask, but I realized it's actually not true:
>
> In the current implementation (cpumask or stats_state flag), an idle CPU is 
> defined as having blocked load as soon as it goes through 
> nohz_balance_enter_idle(), and that flag is cleared when we go through 
> _nohz_idle_balance() (and newly idle load balance in my cpumask 
> implementation).
> However we can imagine a scenario where a CPU goes idle, is flagged as having 
> blocked load, then it wakes up and goes through its periodic balance code and 
> updates that load. Yet, the flag (or cpumask) won't have been updated.
> So I think there could be a discussion on whether the flag should be cleared 
> on nohz_balance_exit_idle() if we assume periodic balance now takes care of 
> this. It could cause issues if we have a weird scenario where a CPU keeps 
> going online/idle but never stays online long enough for a tick though.
> Alternatively we could clear that flag when going through the first periodic 
> balance after idling, but then that's a bit weird because we're using a nohz 
> structure in a non-nohz context.
>
>
> Anyway, I tried to get some profiling done with the cpumask but there's 
> something wrong with my setup, I would only get nonsense numbers (for both 
> baseline & my patch), so I added start/end trace_printks to 
> _nohz_idle_balance(). It's ugly, inaccurate and unorthodox but it still gives 
> a rough idea of how the cpumask impacts stuff.
> I ran 20 iterations of my "nohz update" test case (a task accumulates load, 
> goes to sleep, and another always-running task keeps kicking an ILB to decay 
> that blocked load) and the time saved by skipping CPUs is in the ballpark of 
> 20%. Notebook is at [1].

Even if saving time in the ILB is interesting, we have to check the
impact of bitmask operation on the wake up latency of the CPUs when
the system is heavily used and generate a lot of sleep/wakeup on CPUs


>
> I'll try to get a proper function profiling working for when Vincent posts 
> his "v2".
>
> [1]: https://gist.github.com/valschneider/6f203143bee1e149f24c44e9582a9eff
>
> ---


Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

2018-02-06 Thread Vincent Guittot
On 5 February 2018 at 23:18, Valentin Schneider
 wrote:
> On 01/30/2018 11:41 AM, Valentin Schneider wrote:
>> [...]
>>> I have studied a way to keep track of how many cpus still have blocked
>>> load to try to minimize the number of useless ilb kick but this add
>>> more atomic operations which can impact the system throughput with
>>> heavy load and lot of very small wake up. that's why i have propose
>>> this solution which is more simple. But it's probably just a matter of
>>> where we want to "waste" time. Either we accept to spent a bit more
>>> time to check the state of idle CPUs or we accept to kick ilb from
>>> time to time for no good reason.
>>>
>>
>> Agreed. I have the feeling that spending more time doing atomic ops could be 
>> worth it - I'll try to test this out and see if it's actually relevant.
>>
>
> I gave this a spin, still using Vincent's patches with the included patch on 
> top. Nothing too clever, just seeing how replacing nohz.stats_state with a 
> cpumask would go.
>
> I've replaced nohz.stats_state by nohz.stale_cpus_mask. I kept changes 
> minimal - there are some places where I think nohz.idle_cpus_mask could be 
> substituted by nohz.stale_cpus_mask. Speaking about that, I was about to 
> write a comment regarding the fact that nohz.stale_cpus_mask should be a 
> subset of nohz.idle_cpus_mask, but I realized it's actually not true:
>
> In the current implementation (cpumask or stats_state flag), an idle CPU is 
> defined as having blocked load as soon as it goes through 
> nohz_balance_enter_idle(), and that flag is cleared when we go through 
> _nohz_idle_balance() (and newly idle load balance in my cpumask 
> implementation).
> However we can imagine a scenario where a CPU goes idle, is flagged as having 
> blocked load, then it wakes up and goes through its periodic balance code and 
> updates that load. Yet, the flag (or cpumask) won't have been updated.
> So I think there could be a discussion on whether the flag should be cleared 
> on nohz_balance_exit_idle() if we assume periodic balance now takes care of 
> this. It could cause issues if we have a weird scenario where a CPU keeps 
> going online/idle but never stays online long enough for a tick though.
> Alternatively we could clear that flag when going through the first periodic 
> balance after idling, but then that's a bit weird because we're using a nohz 
> structure in a non-nohz context.
>
>
> Anyway, I tried to get some profiling done with the cpumask but there's 
> something wrong with my setup, I would only get nonsense numbers (for both 
> baseline & my patch), so I added start/end trace_printks to 
> _nohz_idle_balance(). It's ugly, inaccurate and unorthodox but it still gives 
> a rough idea of how the cpumask impacts stuff.
> I ran 20 iterations of my "nohz update" test case (a task accumulates load, 
> goes to sleep, and another always-running task keeps kicking an ILB to decay 
> that blocked load) and the time saved by skipping CPUs is in the ballpark of 
> 20%. Notebook is at [1].

Even if saving time in the ILB is interesting, we have to check the
impact of bitmask operation on the wake up latency of the CPUs when
the system is heavily used and generate a lot of sleep/wakeup on CPUs


>
> I'll try to get a proper function profiling working for when Vincent posts 
> his "v2".
>
> [1]: https://gist.github.com/valschneider/6f203143bee1e149f24c44e9582a9eff
>
> ---


Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

2018-02-05 Thread Valentin Schneider
On 01/30/2018 11:41 AM, Valentin Schneider wrote:
> [...]
>> I have studied a way to keep track of how many cpus still have blocked
>> load to try to minimize the number of useless ilb kick but this add
>> more atomic operations which can impact the system throughput with
>> heavy load and lot of very small wake up. that's why i have propose
>> this solution which is more simple. But it's probably just a matter of
>> where we want to "waste" time. Either we accept to spent a bit more
>> time to check the state of idle CPUs or we accept to kick ilb from
>> time to time for no good reason.
>>
> 
> Agreed. I have the feeling that spending more time doing atomic ops could be 
> worth it - I'll try to test this out and see if it's actually relevant.
> 

I gave this a spin, still using Vincent's patches with the included patch on 
top. Nothing too clever, just seeing how replacing nohz.stats_state with a 
cpumask would go.

I've replaced nohz.stats_state by nohz.stale_cpus_mask. I kept changes minimal 
- there are some places where I think nohz.idle_cpus_mask could be substituted 
by nohz.stale_cpus_mask. Speaking about that, I was about to write a comment 
regarding the fact that nohz.stale_cpus_mask should be a subset of 
nohz.idle_cpus_mask, but I realized it's actually not true:

In the current implementation (cpumask or stats_state flag), an idle CPU is 
defined as having blocked load as soon as it goes through 
nohz_balance_enter_idle(), and that flag is cleared when we go through 
_nohz_idle_balance() (and newly idle load balance in my cpumask implementation).
However we can imagine a scenario where a CPU goes idle, is flagged as having 
blocked load, then it wakes up and goes through its periodic balance code and 
updates that load. Yet, the flag (or cpumask) won't have been updated.
So I think there could be a discussion on whether the flag should be cleared on 
nohz_balance_exit_idle() if we assume periodic balance now takes care of this. 
It could cause issues if we have a weird scenario where a CPU keeps going 
online/idle but never stays online long enough for a tick though.
Alternatively we could clear that flag when going through the first periodic 
balance after idling, but then that's a bit weird because we're using a nohz 
structure in a non-nohz context.


Anyway, I tried to get some profiling done with the cpumask but there's 
something wrong with my setup, I would only get nonsense numbers (for both 
baseline & my patch), so I added start/end trace_printks to 
_nohz_idle_balance(). It's ugly, inaccurate and unorthodox but it still gives a 
rough idea of how the cpumask impacts stuff.
I ran 20 iterations of my "nohz update" test case (a task accumulates load, 
goes to sleep, and another always-running task keeps kicking an ILB to decay 
that blocked load) and the time saved by skipping CPUs is in the ballpark of 
20%. Notebook is at [1].

I'll try to get a proper function profiling working for when Vincent posts his 
"v2".

[1]: https://gist.github.com/valschneider/6f203143bee1e149f24c44e9582a9eff

---
 kernel/sched/fair.c | 72 -
 1 file changed, 38 insertions(+), 34 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3ada92b..8bcf465 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5404,8 +5404,8 @@ decay_load_missed(unsigned long load, unsigned long 
missed_updates, int idx)
 
 static struct {
cpumask_var_t idle_cpus_mask;
+   cpumask_var_t stale_cpus_mask;
atomic_t nr_cpus;
-   atomic_t stats_state;
unsigned long next_balance; /* in jiffy units */
unsigned long next_stats;
 } nohz cacheline_aligned;
@@ -6968,7 +6968,6 @@ enum fbq_type { regular, remote, all };
 #define LBF_DST_PINNED  0x04
 #define LBF_SOME_PINNED0x08
 #define LBF_NOHZ_STATS 0x10
-#define LBF_NOHZ_AGAIN 0x20
 
 struct lb_env {
struct sched_domain *sd;
@@ -7829,25 +7828,25 @@ group_type group_classify(struct sched_group *group,
return group_other;
 }
 
-static bool update_nohz_stats(struct rq *rq)
+static void update_nohz_stats(struct rq *rq)
 {
 #ifdef CONFIG_NO_HZ_COMMON
unsigned int cpu = rq->cpu;
 
-   if (!rq->has_blocked_load)
-   return false;
-
-   if (!cpumask_test_cpu(cpu, nohz.idle_cpus_mask))
-   return false;
+   if (!cpumask_test_cpu(cpu, nohz.stale_cpus_mask))
+   return;
 
if (!time_after(jiffies, rq->last_blocked_load_update_tick))
-   return true;
+   return;
 
update_blocked_averages(cpu);
 
-   return rq->has_blocked_load;
+   if (rq->has_blocked_load)
+   return;
+
+   cpumask_clear_cpu(cpu, nohz.stale_cpus_mask);
 #else
-   return false;
+   return;
 #endif
 }
 
@@ -7873,8 +7872,8 @@ static inline void update_sg_lb_stats(struct lb_env *env,
for_each_cpu_and(i, sched_group_span(group), env->cpus) {
 

Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

2018-02-05 Thread Valentin Schneider
On 01/30/2018 11:41 AM, Valentin Schneider wrote:
> [...]
>> I have studied a way to keep track of how many cpus still have blocked
>> load to try to minimize the number of useless ilb kick but this add
>> more atomic operations which can impact the system throughput with
>> heavy load and lot of very small wake up. that's why i have propose
>> this solution which is more simple. But it's probably just a matter of
>> where we want to "waste" time. Either we accept to spent a bit more
>> time to check the state of idle CPUs or we accept to kick ilb from
>> time to time for no good reason.
>>
> 
> Agreed. I have the feeling that spending more time doing atomic ops could be 
> worth it - I'll try to test this out and see if it's actually relevant.
> 

I gave this a spin, still using Vincent's patches with the included patch on 
top. Nothing too clever, just seeing how replacing nohz.stats_state with a 
cpumask would go.

I've replaced nohz.stats_state by nohz.stale_cpus_mask. I kept changes minimal 
- there are some places where I think nohz.idle_cpus_mask could be substituted 
by nohz.stale_cpus_mask. Speaking about that, I was about to write a comment 
regarding the fact that nohz.stale_cpus_mask should be a subset of 
nohz.idle_cpus_mask, but I realized it's actually not true:

In the current implementation (cpumask or stats_state flag), an idle CPU is 
defined as having blocked load as soon as it goes through 
nohz_balance_enter_idle(), and that flag is cleared when we go through 
_nohz_idle_balance() (and newly idle load balance in my cpumask implementation).
However we can imagine a scenario where a CPU goes idle, is flagged as having 
blocked load, then it wakes up and goes through its periodic balance code and 
updates that load. Yet, the flag (or cpumask) won't have been updated.
So I think there could be a discussion on whether the flag should be cleared on 
nohz_balance_exit_idle() if we assume periodic balance now takes care of this. 
It could cause issues if we have a weird scenario where a CPU keeps going 
online/idle but never stays online long enough for a tick though.
Alternatively we could clear that flag when going through the first periodic 
balance after idling, but then that's a bit weird because we're using a nohz 
structure in a non-nohz context.


Anyway, I tried to get some profiling done with the cpumask but there's 
something wrong with my setup, I would only get nonsense numbers (for both 
baseline & my patch), so I added start/end trace_printks to 
_nohz_idle_balance(). It's ugly, inaccurate and unorthodox but it still gives a 
rough idea of how the cpumask impacts stuff.
I ran 20 iterations of my "nohz update" test case (a task accumulates load, 
goes to sleep, and another always-running task keeps kicking an ILB to decay 
that blocked load) and the time saved by skipping CPUs is in the ballpark of 
20%. Notebook is at [1].

I'll try to get a proper function profiling working for when Vincent posts his 
"v2".

[1]: https://gist.github.com/valschneider/6f203143bee1e149f24c44e9582a9eff

---
 kernel/sched/fair.c | 72 -
 1 file changed, 38 insertions(+), 34 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3ada92b..8bcf465 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5404,8 +5404,8 @@ decay_load_missed(unsigned long load, unsigned long 
missed_updates, int idx)
 
 static struct {
cpumask_var_t idle_cpus_mask;
+   cpumask_var_t stale_cpus_mask;
atomic_t nr_cpus;
-   atomic_t stats_state;
unsigned long next_balance; /* in jiffy units */
unsigned long next_stats;
 } nohz cacheline_aligned;
@@ -6968,7 +6968,6 @@ enum fbq_type { regular, remote, all };
 #define LBF_DST_PINNED  0x04
 #define LBF_SOME_PINNED0x08
 #define LBF_NOHZ_STATS 0x10
-#define LBF_NOHZ_AGAIN 0x20
 
 struct lb_env {
struct sched_domain *sd;
@@ -7829,25 +7828,25 @@ group_type group_classify(struct sched_group *group,
return group_other;
 }
 
-static bool update_nohz_stats(struct rq *rq)
+static void update_nohz_stats(struct rq *rq)
 {
 #ifdef CONFIG_NO_HZ_COMMON
unsigned int cpu = rq->cpu;
 
-   if (!rq->has_blocked_load)
-   return false;
-
-   if (!cpumask_test_cpu(cpu, nohz.idle_cpus_mask))
-   return false;
+   if (!cpumask_test_cpu(cpu, nohz.stale_cpus_mask))
+   return;
 
if (!time_after(jiffies, rq->last_blocked_load_update_tick))
-   return true;
+   return;
 
update_blocked_averages(cpu);
 
-   return rq->has_blocked_load;
+   if (rq->has_blocked_load)
+   return;
+
+   cpumask_clear_cpu(cpu, nohz.stale_cpus_mask);
 #else
-   return false;
+   return;
 #endif
 }
 
@@ -7873,8 +7872,8 @@ static inline void update_sg_lb_stats(struct lb_env *env,
for_each_cpu_and(i, sched_group_span(group), env->cpus) {
 

Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

2018-02-01 Thread Vincent Guittot
On 1 February 2018 at 19:10, Peter Zijlstra  wrote:
> On Wed, Jan 24, 2018 at 09:25:36AM +0100, Vincent Guittot wrote:
>> @@ -8861,7 +8875,14 @@ static int idle_balance(struct rq *this_rq, struct 
>> rq_flags *rf)
>>   update_next_balance(sd, _balance);
>>   rcu_read_unlock();
>>
>> - if (time_after(jiffies, next) && 
>> atomic_read(_state))
>> + /*
>> +  * Update blocked idle load if it has not been done for a
>> +  * while. Try to do it locally before entering idle but kick a
>> +  * ilb if it takes too much time and might delay next local
>> +  * wake up
>> +  */
>> + if (time_after(jiffies, next) && 
>> atomic_read(_state) &&
>> + !_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, 
>> CPU_NEWLY_IDLE))
>>   kick_ilb(NOHZ_STATS_KICK);
>>
>>   goto out;
>
> This I really dislike. We're here because avg_idle is _really_ low, we
> really should not then call _nohz_idle_balance().

Yes. In fact I was targeting the case were (this_rq->avg_idle >=
sysctl_sched_migration_cost) and the system is not overloaded


Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

2018-02-01 Thread Vincent Guittot
On 1 February 2018 at 19:10, Peter Zijlstra  wrote:
> On Wed, Jan 24, 2018 at 09:25:36AM +0100, Vincent Guittot wrote:
>> @@ -8861,7 +8875,14 @@ static int idle_balance(struct rq *this_rq, struct 
>> rq_flags *rf)
>>   update_next_balance(sd, _balance);
>>   rcu_read_unlock();
>>
>> - if (time_after(jiffies, next) && 
>> atomic_read(_state))
>> + /*
>> +  * Update blocked idle load if it has not been done for a
>> +  * while. Try to do it locally before entering idle but kick a
>> +  * ilb if it takes too much time and might delay next local
>> +  * wake up
>> +  */
>> + if (time_after(jiffies, next) && 
>> atomic_read(_state) &&
>> + !_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, 
>> CPU_NEWLY_IDLE))
>>   kick_ilb(NOHZ_STATS_KICK);
>>
>>   goto out;
>
> This I really dislike. We're here because avg_idle is _really_ low, we
> really should not then call _nohz_idle_balance().

Yes. In fact I was targeting the case were (this_rq->avg_idle >=
sysctl_sched_migration_cost) and the system is not overloaded


Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

2018-02-01 Thread Peter Zijlstra
On Mon, Jan 29, 2018 at 07:31:07PM +, Valentin Schneider wrote:

> But for load update via _nohz_idle_balance(), we iterate through all of the
> nohz CPUS and unconditionally call update_blocked_averages(). This could be
> avoided by remembering which CPUs have stale load before going idle.
> Initially I thought that was what nohz.stats_state was for, but it isn't.
> With Vincent's patches it's only ever set to either 0 or 1, but we could use
> it as a CPU mask, and use it to skip nohz CPUs that don't have stale load in
> _nohz_idle_balance() (when NOHZ_STATS_KICK).

Yes, you'd need to allocate a second cpumask, worse you need atomic
bitops to set and clear bits there.

That all _might_ be worth it... dunno.



Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

2018-02-01 Thread Peter Zijlstra
On Mon, Jan 29, 2018 at 07:31:07PM +, Valentin Schneider wrote:

> But for load update via _nohz_idle_balance(), we iterate through all of the
> nohz CPUS and unconditionally call update_blocked_averages(). This could be
> avoided by remembering which CPUs have stale load before going idle.
> Initially I thought that was what nohz.stats_state was for, but it isn't.
> With Vincent's patches it's only ever set to either 0 or 1, but we could use
> it as a CPU mask, and use it to skip nohz CPUs that don't have stale load in
> _nohz_idle_balance() (when NOHZ_STATS_KICK).

Yes, you'd need to allocate a second cpumask, worse you need atomic
bitops to set and clear bits there.

That all _might_ be worth it... dunno.



Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

2018-02-01 Thread Peter Zijlstra
On Wed, Jan 24, 2018 at 09:25:36AM +0100, Vincent Guittot wrote:
> @@ -8861,7 +8875,14 @@ static int idle_balance(struct rq *this_rq, struct 
> rq_flags *rf)
>   update_next_balance(sd, _balance);
>   rcu_read_unlock();
>  
> - if (time_after(jiffies, next) && atomic_read(_state))
> + /*
> +  * Update blocked idle load if it has not been done for a
> +  * while. Try to do it locally before entering idle but kick a
> +  * ilb if it takes too much time and might delay next local
> +  * wake up
> +  */
> + if (time_after(jiffies, next) && atomic_read(_state) 
> &&
> + !_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, 
> CPU_NEWLY_IDLE))
>   kick_ilb(NOHZ_STATS_KICK);
>  
>   goto out;

This I really dislike. We're here because avg_idle is _really_ low, we
really should not then call _nohz_idle_balance().


Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

2018-02-01 Thread Peter Zijlstra
On Wed, Jan 24, 2018 at 09:25:36AM +0100, Vincent Guittot wrote:
> @@ -8861,7 +8875,14 @@ static int idle_balance(struct rq *this_rq, struct 
> rq_flags *rf)
>   update_next_balance(sd, _balance);
>   rcu_read_unlock();
>  
> - if (time_after(jiffies, next) && atomic_read(_state))
> + /*
> +  * Update blocked idle load if it has not been done for a
> +  * while. Try to do it locally before entering idle but kick a
> +  * ilb if it takes too much time and might delay next local
> +  * wake up
> +  */
> + if (time_after(jiffies, next) && atomic_read(_state) 
> &&
> + !_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, 
> CPU_NEWLY_IDLE))
>   kick_ilb(NOHZ_STATS_KICK);
>  
>   goto out;

This I really dislike. We're here because avg_idle is _really_ low, we
really should not then call _nohz_idle_balance().


Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

2018-02-01 Thread Vincent Guittot
On 1 February 2018 at 17:57, Peter Zijlstra  wrote:
> On Wed, Jan 24, 2018 at 09:25:36AM +0100, Vincent Guittot wrote:
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 898785d..ed90303 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -7356,6 +7356,17 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq 
>> *cfs_rq)
>>   return true;
>>  }
>>
>> +static inline bool cfs_rq_has_blocked(struct cfs_rq *cfs_rq)
>> +{
>> + if (cfs_rq->avg.load_avg)
>> + return true;
>> +
>> + if (cfs_rq->avg.util_avg)
>> + return true;
>> +
>> + return false;
>> +}
>> +
>>  #ifdef CONFIG_FAIR_GROUP_SCHED
>>
>>  static void update_blocked_averages(int cpu)
>> @@ -7393,7 +7404,9 @@ static void update_blocked_averages(int cpu)
>>*/
>>   if (cfs_rq_is_decayed(cfs_rq))
>>   list_del_leaf_cfs_rq(cfs_rq);
>> - else
>> +
>> + /* Don't need periodic decay once load/util_avg are null */
>> + if (cfs_rq_has_blocked(cfs_rq))
>>   done = false;
>>   }
>>
>> @@ -7463,7 +7476,7 @@ static inline void update_blocked_averages(int cpu)
>>   update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq);
>>  #ifdef CONFIG_NO_HZ_COMMON
>>   rq->last_blocked_load_update_tick = jiffies;
>> - if (cfs_rq_is_decayed(cfs_rq))
>> + if (cfs_rq_has_blocked(cfs_rq))
>>   rq->has_blocked_load = 0;
>>  #endif
>>   rq_unlock_irqrestore(rq, );
>
> OK makes sense; would've been even better as a separate patch :-)

Yes  i will make a separate patch for that


Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

2018-02-01 Thread Vincent Guittot
On 1 February 2018 at 17:57, Peter Zijlstra  wrote:
> On Wed, Jan 24, 2018 at 09:25:36AM +0100, Vincent Guittot wrote:
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 898785d..ed90303 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -7356,6 +7356,17 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq 
>> *cfs_rq)
>>   return true;
>>  }
>>
>> +static inline bool cfs_rq_has_blocked(struct cfs_rq *cfs_rq)
>> +{
>> + if (cfs_rq->avg.load_avg)
>> + return true;
>> +
>> + if (cfs_rq->avg.util_avg)
>> + return true;
>> +
>> + return false;
>> +}
>> +
>>  #ifdef CONFIG_FAIR_GROUP_SCHED
>>
>>  static void update_blocked_averages(int cpu)
>> @@ -7393,7 +7404,9 @@ static void update_blocked_averages(int cpu)
>>*/
>>   if (cfs_rq_is_decayed(cfs_rq))
>>   list_del_leaf_cfs_rq(cfs_rq);
>> - else
>> +
>> + /* Don't need periodic decay once load/util_avg are null */
>> + if (cfs_rq_has_blocked(cfs_rq))
>>   done = false;
>>   }
>>
>> @@ -7463,7 +7476,7 @@ static inline void update_blocked_averages(int cpu)
>>   update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq);
>>  #ifdef CONFIG_NO_HZ_COMMON
>>   rq->last_blocked_load_update_tick = jiffies;
>> - if (cfs_rq_is_decayed(cfs_rq))
>> + if (cfs_rq_has_blocked(cfs_rq))
>>   rq->has_blocked_load = 0;
>>  #endif
>>   rq_unlock_irqrestore(rq, );
>
> OK makes sense; would've been even better as a separate patch :-)

Yes  i will make a separate patch for that


Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

2018-02-01 Thread Vincent Guittot
On 1 February 2018 at 17:52, Peter Zijlstra  wrote:
> On Mon, Jan 15, 2018 at 09:26:09AM +0100, Vincent Guittot wrote:
>
> Would've probably been easier to read if you'd not included the revert
> of that timer patch...
>
>> @@ -9258,21 +9255,11 @@ void nohz_balance_enter_idle(int cpu)
>>   set_cpu_sd_state_idle(cpu);
>>
>>   /*
>> -  * implies a barrier such that if the stats_state update is observed
>> -  * the above updates are also visible. Pairs with stuff in
>> -  * update_sd_lb_stats() and nohz_idle_balance().
>> +  * Each time a cpu enter idle, we assume that it has blocked load and
>> +  * enable the periodic update of the load of idle cpus
>>*/
>> - val = atomic_read(_state);
>> - do {
>> - new = val + 2;
>> - new |= 1;
>> - } while (!atomic_try_cmpxchg(_state, , new));
>> + atomic_set(_state, 1);
>>
>
>> @@ -9422,7 +9408,6 @@ static bool nohz_idle_balance(struct rq *this_rq, enum 
>> cpu_idle_type idle)
>>   return false;
>>   }
>>
>> - stats_seq = atomic_read(_state);
>>   /*
>>* barrier, pairs with nohz_balance_enter_idle(), ensures ...
>>*/
>> @@ -9432,6 +9417,16 @@ static bool nohz_idle_balance(struct rq *this_rq, 
>> enum cpu_idle_type idle)
>>
>>   SCHED_WARN_ON((flags & NOHZ_KICK_MASK) == NOHZ_BALANCE_KICK);
>>
>> + /*
>> +  * We assume there will be no idle load after this update and clear
>> +  * the stats state. If a cpu enters idle in the mean time, it will
>> +  * set the stats state and trig another update of idle load.
>> +  * Because a cpu that becomes idle, is added to idle_cpus_mask before
>> +  * setting the stats state, we are sure to not clear the state and not
>> +  * check the load of an idle cpu.
>> +  */
>> + atomic_set(_state, 0);
>> +
>>   for_each_cpu(balance_cpu, nohz.idle_cpus_mask) {
>>   if (balance_cpu == this_cpu || !idle_cpu(balance_cpu))
>>   continue;
>> @@ -9441,8 +9436,10 @@ static bool nohz_idle_balance(struct rq *this_rq, 
>> enum cpu_idle_type idle)
>>* work being done for other cpus. Next load
>>* balancing owner will pick it up.
>>*/
>> - if (need_resched())
>> + if (need_resched()) {
>> + has_blocked_load = true;
>>   break;
>> + }
>>
>>   rq = cpu_rq(balance_cpu);
>>
>> @@ -9477,12 +9474,12 @@ static bool nohz_idle_balance(struct rq *this_rq, 
>> enum cpu_idle_type idle)
>>   if (flags & NOHZ_BALANCE_KICK)
>>   rebalance_domains(this_rq, CPU_IDLE);
>>
>> - if (has_blocked_load ||
>> - !atomic_try_cmpxchg(_state, _seq, 0)) {
>> - WRITE_ONCE(nohz.next_stats,
>> - now + msecs_to_jiffies(LOAD_AVG_PERIOD));
>> - mod_timer(, nohz.next_stats);
>> - }
>> + WRITE_ONCE(nohz.next_stats,
>> + now + msecs_to_jiffies(LOAD_AVG_PERIOD));
>> +
>> + /* There is still blocked load, enable periodic update */
>> + if (has_blocked_load)
>> + atomic_set(_state, 1);
>>
>>   /*
>>* next_balance will be updated only when there is a need.
>
> After this there is no point for stats_state to be atomic anymore. Also
> a better name.

Ok

>
> Maybe if I drop the last two patches (and you re-introduce the bits
> from: Subject: sched: Optimize nohz stats, that you do need) this all
> becomes more readable?

Yes. we can do like that


Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

2018-02-01 Thread Vincent Guittot
On 1 February 2018 at 17:52, Peter Zijlstra  wrote:
> On Mon, Jan 15, 2018 at 09:26:09AM +0100, Vincent Guittot wrote:
>
> Would've probably been easier to read if you'd not included the revert
> of that timer patch...
>
>> @@ -9258,21 +9255,11 @@ void nohz_balance_enter_idle(int cpu)
>>   set_cpu_sd_state_idle(cpu);
>>
>>   /*
>> -  * implies a barrier such that if the stats_state update is observed
>> -  * the above updates are also visible. Pairs with stuff in
>> -  * update_sd_lb_stats() and nohz_idle_balance().
>> +  * Each time a cpu enter idle, we assume that it has blocked load and
>> +  * enable the periodic update of the load of idle cpus
>>*/
>> - val = atomic_read(_state);
>> - do {
>> - new = val + 2;
>> - new |= 1;
>> - } while (!atomic_try_cmpxchg(_state, , new));
>> + atomic_set(_state, 1);
>>
>
>> @@ -9422,7 +9408,6 @@ static bool nohz_idle_balance(struct rq *this_rq, enum 
>> cpu_idle_type idle)
>>   return false;
>>   }
>>
>> - stats_seq = atomic_read(_state);
>>   /*
>>* barrier, pairs with nohz_balance_enter_idle(), ensures ...
>>*/
>> @@ -9432,6 +9417,16 @@ static bool nohz_idle_balance(struct rq *this_rq, 
>> enum cpu_idle_type idle)
>>
>>   SCHED_WARN_ON((flags & NOHZ_KICK_MASK) == NOHZ_BALANCE_KICK);
>>
>> + /*
>> +  * We assume there will be no idle load after this update and clear
>> +  * the stats state. If a cpu enters idle in the mean time, it will
>> +  * set the stats state and trig another update of idle load.
>> +  * Because a cpu that becomes idle, is added to idle_cpus_mask before
>> +  * setting the stats state, we are sure to not clear the state and not
>> +  * check the load of an idle cpu.
>> +  */
>> + atomic_set(_state, 0);
>> +
>>   for_each_cpu(balance_cpu, nohz.idle_cpus_mask) {
>>   if (balance_cpu == this_cpu || !idle_cpu(balance_cpu))
>>   continue;
>> @@ -9441,8 +9436,10 @@ static bool nohz_idle_balance(struct rq *this_rq, 
>> enum cpu_idle_type idle)
>>* work being done for other cpus. Next load
>>* balancing owner will pick it up.
>>*/
>> - if (need_resched())
>> + if (need_resched()) {
>> + has_blocked_load = true;
>>   break;
>> + }
>>
>>   rq = cpu_rq(balance_cpu);
>>
>> @@ -9477,12 +9474,12 @@ static bool nohz_idle_balance(struct rq *this_rq, 
>> enum cpu_idle_type idle)
>>   if (flags & NOHZ_BALANCE_KICK)
>>   rebalance_domains(this_rq, CPU_IDLE);
>>
>> - if (has_blocked_load ||
>> - !atomic_try_cmpxchg(_state, _seq, 0)) {
>> - WRITE_ONCE(nohz.next_stats,
>> - now + msecs_to_jiffies(LOAD_AVG_PERIOD));
>> - mod_timer(, nohz.next_stats);
>> - }
>> + WRITE_ONCE(nohz.next_stats,
>> + now + msecs_to_jiffies(LOAD_AVG_PERIOD));
>> +
>> + /* There is still blocked load, enable periodic update */
>> + if (has_blocked_load)
>> + atomic_set(_state, 1);
>>
>>   /*
>>* next_balance will be updated only when there is a need.
>
> After this there is no point for stats_state to be atomic anymore. Also
> a better name.

Ok

>
> Maybe if I drop the last two patches (and you re-introduce the bits
> from: Subject: sched: Optimize nohz stats, that you do need) this all
> becomes more readable?

Yes. we can do like that


Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

2018-02-01 Thread Peter Zijlstra
On Wed, Jan 24, 2018 at 09:25:36AM +0100, Vincent Guittot wrote:
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 898785d..ed90303 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7356,6 +7356,17 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq 
> *cfs_rq)
>   return true;
>  }
>  
> +static inline bool cfs_rq_has_blocked(struct cfs_rq *cfs_rq)
> +{
> + if (cfs_rq->avg.load_avg)
> + return true;
> +
> + if (cfs_rq->avg.util_avg)
> + return true;
> +
> + return false;
> +}
> +
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>  
>  static void update_blocked_averages(int cpu)
> @@ -7393,7 +7404,9 @@ static void update_blocked_averages(int cpu)
>*/
>   if (cfs_rq_is_decayed(cfs_rq))
>   list_del_leaf_cfs_rq(cfs_rq);
> - else
> +
> + /* Don't need periodic decay once load/util_avg are null */
> + if (cfs_rq_has_blocked(cfs_rq))
>   done = false;
>   }
>  
> @@ -7463,7 +7476,7 @@ static inline void update_blocked_averages(int cpu)
>   update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq);
>  #ifdef CONFIG_NO_HZ_COMMON
>   rq->last_blocked_load_update_tick = jiffies;
> - if (cfs_rq_is_decayed(cfs_rq))
> + if (cfs_rq_has_blocked(cfs_rq))
>   rq->has_blocked_load = 0;
>  #endif
>   rq_unlock_irqrestore(rq, );

OK makes sense; would've been even better as a separate patch :-)


Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

2018-02-01 Thread Peter Zijlstra
On Wed, Jan 24, 2018 at 09:25:36AM +0100, Vincent Guittot wrote:
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 898785d..ed90303 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7356,6 +7356,17 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq 
> *cfs_rq)
>   return true;
>  }
>  
> +static inline bool cfs_rq_has_blocked(struct cfs_rq *cfs_rq)
> +{
> + if (cfs_rq->avg.load_avg)
> + return true;
> +
> + if (cfs_rq->avg.util_avg)
> + return true;
> +
> + return false;
> +}
> +
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>  
>  static void update_blocked_averages(int cpu)
> @@ -7393,7 +7404,9 @@ static void update_blocked_averages(int cpu)
>*/
>   if (cfs_rq_is_decayed(cfs_rq))
>   list_del_leaf_cfs_rq(cfs_rq);
> - else
> +
> + /* Don't need periodic decay once load/util_avg are null */
> + if (cfs_rq_has_blocked(cfs_rq))
>   done = false;
>   }
>  
> @@ -7463,7 +7476,7 @@ static inline void update_blocked_averages(int cpu)
>   update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq);
>  #ifdef CONFIG_NO_HZ_COMMON
>   rq->last_blocked_load_update_tick = jiffies;
> - if (cfs_rq_is_decayed(cfs_rq))
> + if (cfs_rq_has_blocked(cfs_rq))
>   rq->has_blocked_load = 0;
>  #endif
>   rq_unlock_irqrestore(rq, );

OK makes sense; would've been even better as a separate patch :-)


Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

2018-02-01 Thread Peter Zijlstra
On Thu, Jan 18, 2018 at 10:38:07AM +, Morten Rasmussen wrote:
> It seems pointless to have a timer to update PELT if the system is
> completely idle, and when it isn't we can piggy back other events to
> make the updates happen.

Only if we do that update before making decisions based on the values.
The thing I was bothered by in the earlier patches was that wakeup would
use whatever current value and async kick something to go update.

That just seems wrong.


Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

2018-02-01 Thread Peter Zijlstra
On Thu, Jan 18, 2018 at 10:38:07AM +, Morten Rasmussen wrote:
> It seems pointless to have a timer to update PELT if the system is
> completely idle, and when it isn't we can piggy back other events to
> make the updates happen.

Only if we do that update before making decisions based on the values.
The thing I was bothered by in the earlier patches was that wakeup would
use whatever current value and async kick something to go update.

That just seems wrong.


Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

2018-02-01 Thread Peter Zijlstra
On Mon, Jan 15, 2018 at 09:26:09AM +0100, Vincent Guittot wrote:

Would've probably been easier to read if you'd not included the revert
of that timer patch...

> @@ -9258,21 +9255,11 @@ void nohz_balance_enter_idle(int cpu)
>   set_cpu_sd_state_idle(cpu);
>  
>   /*
> -  * implies a barrier such that if the stats_state update is observed
> -  * the above updates are also visible. Pairs with stuff in
> -  * update_sd_lb_stats() and nohz_idle_balance().
> +  * Each time a cpu enter idle, we assume that it has blocked load and
> +  * enable the periodic update of the load of idle cpus
>*/
> - val = atomic_read(_state);
> - do {
> - new = val + 2;
> - new |= 1;
> - } while (!atomic_try_cmpxchg(_state, , new));
> + atomic_set(_state, 1);
>  

> @@ -9422,7 +9408,6 @@ static bool nohz_idle_balance(struct rq *this_rq, enum 
> cpu_idle_type idle)
>   return false;
>   }
>  
> - stats_seq = atomic_read(_state);
>   /*
>* barrier, pairs with nohz_balance_enter_idle(), ensures ...
>*/
> @@ -9432,6 +9417,16 @@ static bool nohz_idle_balance(struct rq *this_rq, enum 
> cpu_idle_type idle)
>  
>   SCHED_WARN_ON((flags & NOHZ_KICK_MASK) == NOHZ_BALANCE_KICK);
>  
> + /*
> +  * We assume there will be no idle load after this update and clear
> +  * the stats state. If a cpu enters idle in the mean time, it will
> +  * set the stats state and trig another update of idle load.
> +  * Because a cpu that becomes idle, is added to idle_cpus_mask before
> +  * setting the stats state, we are sure to not clear the state and not
> +  * check the load of an idle cpu.
> +  */
> + atomic_set(_state, 0);
> +
>   for_each_cpu(balance_cpu, nohz.idle_cpus_mask) {
>   if (balance_cpu == this_cpu || !idle_cpu(balance_cpu))
>   continue;
> @@ -9441,8 +9436,10 @@ static bool nohz_idle_balance(struct rq *this_rq, enum 
> cpu_idle_type idle)
>* work being done for other cpus. Next load
>* balancing owner will pick it up.
>*/
> - if (need_resched())
> + if (need_resched()) {
> + has_blocked_load = true;
>   break;
> + }
>  
>   rq = cpu_rq(balance_cpu);
>  
> @@ -9477,12 +9474,12 @@ static bool nohz_idle_balance(struct rq *this_rq, 
> enum cpu_idle_type idle)
>   if (flags & NOHZ_BALANCE_KICK)
>   rebalance_domains(this_rq, CPU_IDLE);
>  
> - if (has_blocked_load ||
> - !atomic_try_cmpxchg(_state, _seq, 0)) {
> - WRITE_ONCE(nohz.next_stats,
> - now + msecs_to_jiffies(LOAD_AVG_PERIOD));
> - mod_timer(, nohz.next_stats);
> - }
> + WRITE_ONCE(nohz.next_stats,
> + now + msecs_to_jiffies(LOAD_AVG_PERIOD));
> +
> + /* There is still blocked load, enable periodic update */
> + if (has_blocked_load)
> + atomic_set(_state, 1);
>  
>   /*
>* next_balance will be updated only when there is a need.

After this there is no point for stats_state to be atomic anymore. Also
a better name.

Maybe if I drop the last two patches (and you re-introduce the bits
from: Subject: sched: Optimize nohz stats, that you do need) this all
becomes more readable?


Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

2018-02-01 Thread Peter Zijlstra
On Mon, Jan 15, 2018 at 09:26:09AM +0100, Vincent Guittot wrote:

Would've probably been easier to read if you'd not included the revert
of that timer patch...

> @@ -9258,21 +9255,11 @@ void nohz_balance_enter_idle(int cpu)
>   set_cpu_sd_state_idle(cpu);
>  
>   /*
> -  * implies a barrier such that if the stats_state update is observed
> -  * the above updates are also visible. Pairs with stuff in
> -  * update_sd_lb_stats() and nohz_idle_balance().
> +  * Each time a cpu enter idle, we assume that it has blocked load and
> +  * enable the periodic update of the load of idle cpus
>*/
> - val = atomic_read(_state);
> - do {
> - new = val + 2;
> - new |= 1;
> - } while (!atomic_try_cmpxchg(_state, , new));
> + atomic_set(_state, 1);
>  

> @@ -9422,7 +9408,6 @@ static bool nohz_idle_balance(struct rq *this_rq, enum 
> cpu_idle_type idle)
>   return false;
>   }
>  
> - stats_seq = atomic_read(_state);
>   /*
>* barrier, pairs with nohz_balance_enter_idle(), ensures ...
>*/
> @@ -9432,6 +9417,16 @@ static bool nohz_idle_balance(struct rq *this_rq, enum 
> cpu_idle_type idle)
>  
>   SCHED_WARN_ON((flags & NOHZ_KICK_MASK) == NOHZ_BALANCE_KICK);
>  
> + /*
> +  * We assume there will be no idle load after this update and clear
> +  * the stats state. If a cpu enters idle in the mean time, it will
> +  * set the stats state and trig another update of idle load.
> +  * Because a cpu that becomes idle, is added to idle_cpus_mask before
> +  * setting the stats state, we are sure to not clear the state and not
> +  * check the load of an idle cpu.
> +  */
> + atomic_set(_state, 0);
> +
>   for_each_cpu(balance_cpu, nohz.idle_cpus_mask) {
>   if (balance_cpu == this_cpu || !idle_cpu(balance_cpu))
>   continue;
> @@ -9441,8 +9436,10 @@ static bool nohz_idle_balance(struct rq *this_rq, enum 
> cpu_idle_type idle)
>* work being done for other cpus. Next load
>* balancing owner will pick it up.
>*/
> - if (need_resched())
> + if (need_resched()) {
> + has_blocked_load = true;
>   break;
> + }
>  
>   rq = cpu_rq(balance_cpu);
>  
> @@ -9477,12 +9474,12 @@ static bool nohz_idle_balance(struct rq *this_rq, 
> enum cpu_idle_type idle)
>   if (flags & NOHZ_BALANCE_KICK)
>   rebalance_domains(this_rq, CPU_IDLE);
>  
> - if (has_blocked_load ||
> - !atomic_try_cmpxchg(_state, _seq, 0)) {
> - WRITE_ONCE(nohz.next_stats,
> - now + msecs_to_jiffies(LOAD_AVG_PERIOD));
> - mod_timer(, nohz.next_stats);
> - }
> + WRITE_ONCE(nohz.next_stats,
> + now + msecs_to_jiffies(LOAD_AVG_PERIOD));
> +
> + /* There is still blocked load, enable periodic update */
> + if (has_blocked_load)
> + atomic_set(_state, 1);
>  
>   /*
>* next_balance will be updated only when there is a need.

After this there is no point for stats_state to be atomic anymore. Also
a better name.

Maybe if I drop the last two patches (and you re-introduce the bits
from: Subject: sched: Optimize nohz stats, that you do need) this all
becomes more readable?


Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

2018-01-30 Thread Vincent Guittot
On 30 January 2018 at 12:41, Valentin Schneider
 wrote:
> (Resending because I snuck in some HTML... Apologies)
>
> On 01/30/2018 08:32 AM, Vincent Guittot wrote:
>>
>> On 29 January 2018 at 20:31, Valentin Schneider
>>  wrote:
>>>
>>> Hi Vincent, Peter,
>>>
>>> I've been running some tests on your patches (Peter's base + the 2 from
>>> Vincent). The results themselves are hosted at [1].
>>> The base of those tests is the same: a task ("accumulator") is ran for 5
>>> seconds (arbitrary value) to accumulate some load, then goes to sleep for
>>> .5
>>> seconds.
>>>
>>> I've set up 3 test scenarios:
>>>
>>> Update by nohz_balance_kick()
>>> -
>>> Right before the "accumulator" task goes to sleep, a CPU-hogging task
>>> (100%
>>> utilization) is spawned on another CPU. It won't go idle so the only way
>>> to
>>> update the blocked load generated by "accumulator" is to kick an ILB
>>> (NOHZ_STATS_KICK).
>>>
>>> The test shows that this is behaving nicely - we keep kicking an ILB
>>> every
>>> ~36ms (see next test for comments on that) until there is no more blocked
>>> load. I did however notice some interesting scenarios: after the load has
>>> been fully decayed, a tiny background task can spawn and end in less than
>>> a
>>> scheduling period. However, it still goes through
>>> nohz_balance_enter_idle(),
>>> and thus sets nohz.stats_state, which will later cause an ILB kick.
>>>
>>> This makes me wonder if it's worth kicking ILBs for such tiny load values
>>> -
>>> perhaps it could be worth having a margin to set rq->has_blocked_load ?
>>
>>
>> So it's difficult to know what will be the load/utilization on the
>> cfs_rq once the cpu wakes up. Even if it's for a really short time,
>> that's doesn't mean that the load/utilization is small because it can
>> be the migration of a big task that just have a very short wakes up
>> this time.
>> That's why I don't make any assumption on the utilization/load value
>> when a cpu goes to sleep
>>
>
> Right, hadn't thought about those kind of migrations.
>
>>>
>>> Furthermore, this tiny task will cause the ILB to iterate over all of the
>>> idle CPUs, although only one has stale load. For load update via
>>> NEWLY_IDLE
>>> load_balance() we use:
>>>
>>> static bool update_nohz_stats(struct rq *rq)
>>> {
>>>  if (!rq->has_blocked_load)
>>>   return false;
>>>  [...]
>>> }
>>>
>>> But for load update via _nohz_idle_balance(), we iterate through all of
>>> the
>>> nohz CPUS and unconditionally call update_blocked_averages(). This could
>>> be
>>> avoided by remembering which CPUs have stale load before going idle.
>>> Initially I thought that was what nohz.stats_state was for, but it isn't.
>>> With Vincent's patches it's only ever set to either 0 or 1, but we could
>>> use
>>> it as a CPU mask, and use it to skip nohz CPUs that don't have stale load
>>> in
>>> _nohz_idle_balance() (when NOHZ_STATS_KICK).
>>
>>
>> I have studied a way to keep track of how many cpus still have blocked
>> load to try to minimize the number of useless ilb kick but this add
>> more atomic operations which can impact the system throughput with
>> heavy load and lot of very small wake up. that's why i have propose
>> this solution which is more simple. But it's probably just a matter of
>> where we want to "waste" time. Either we accept to spent a bit more
>> time to check the state of idle CPUs or we accept to kick ilb from
>> time to time for no good reason.
>>
>
> Agreed. I have the feeling that spending more time doing atomic ops could be
> worth it - I'll try to test this out and see if it's actually relevant.
>
>>>
>>> Update by idle_balance()
>>> 
>>> Right before the "accumulator" task goes to sleep, a tiny periodic
>>> (period=32ms) task is spawned on another CPU. It's expected that it will
>>> update the blocked load in idle_balance(), either by running
>>> _nohz_idle_balance() locally or kicking an ILB (The overload flag
>>> shouldn't
>>> be set in this test case, so we shouldn't go through the NEWLY_IDLE
>>> load_balance()).
>>>
>>> This also seems to be working fine, but I'm noticing a delay between load
>>> updates that is closer to 64ms than 32ms. After digging into it I found
>>> out
>>> that the time checks done in idle_balance() and nohz_balancer_kick() are
>>> time_after(jiffies, next_stats), but IMHO they should be
>>> time_after_eq(jiffies, next_stats) to have 32ms-based updates. This also
>>> explains the 36ms periodicity of the updates in the test above.
>>
>>
>> I have use the 32ms as a minimum value between update. We must use the
>> time_after()  if we want to have at least 32ms between each update. We
>> will have a 36ms period if the previous update was triggered by the
>> tick (just after in fact) but there will be only 32ms if the last
>> update was done during an idle_balance that happens just before the
>> tick. With  time_after_eq,  the update 

Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

2018-01-30 Thread Vincent Guittot
On 30 January 2018 at 12:41, Valentin Schneider
 wrote:
> (Resending because I snuck in some HTML... Apologies)
>
> On 01/30/2018 08:32 AM, Vincent Guittot wrote:
>>
>> On 29 January 2018 at 20:31, Valentin Schneider
>>  wrote:
>>>
>>> Hi Vincent, Peter,
>>>
>>> I've been running some tests on your patches (Peter's base + the 2 from
>>> Vincent). The results themselves are hosted at [1].
>>> The base of those tests is the same: a task ("accumulator") is ran for 5
>>> seconds (arbitrary value) to accumulate some load, then goes to sleep for
>>> .5
>>> seconds.
>>>
>>> I've set up 3 test scenarios:
>>>
>>> Update by nohz_balance_kick()
>>> -
>>> Right before the "accumulator" task goes to sleep, a CPU-hogging task
>>> (100%
>>> utilization) is spawned on another CPU. It won't go idle so the only way
>>> to
>>> update the blocked load generated by "accumulator" is to kick an ILB
>>> (NOHZ_STATS_KICK).
>>>
>>> The test shows that this is behaving nicely - we keep kicking an ILB
>>> every
>>> ~36ms (see next test for comments on that) until there is no more blocked
>>> load. I did however notice some interesting scenarios: after the load has
>>> been fully decayed, a tiny background task can spawn and end in less than
>>> a
>>> scheduling period. However, it still goes through
>>> nohz_balance_enter_idle(),
>>> and thus sets nohz.stats_state, which will later cause an ILB kick.
>>>
>>> This makes me wonder if it's worth kicking ILBs for such tiny load values
>>> -
>>> perhaps it could be worth having a margin to set rq->has_blocked_load ?
>>
>>
>> So it's difficult to know what will be the load/utilization on the
>> cfs_rq once the cpu wakes up. Even if it's for a really short time,
>> that's doesn't mean that the load/utilization is small because it can
>> be the migration of a big task that just have a very short wakes up
>> this time.
>> That's why I don't make any assumption on the utilization/load value
>> when a cpu goes to sleep
>>
>
> Right, hadn't thought about those kind of migrations.
>
>>>
>>> Furthermore, this tiny task will cause the ILB to iterate over all of the
>>> idle CPUs, although only one has stale load. For load update via
>>> NEWLY_IDLE
>>> load_balance() we use:
>>>
>>> static bool update_nohz_stats(struct rq *rq)
>>> {
>>>  if (!rq->has_blocked_load)
>>>   return false;
>>>  [...]
>>> }
>>>
>>> But for load update via _nohz_idle_balance(), we iterate through all of
>>> the
>>> nohz CPUS and unconditionally call update_blocked_averages(). This could
>>> be
>>> avoided by remembering which CPUs have stale load before going idle.
>>> Initially I thought that was what nohz.stats_state was for, but it isn't.
>>> With Vincent's patches it's only ever set to either 0 or 1, but we could
>>> use
>>> it as a CPU mask, and use it to skip nohz CPUs that don't have stale load
>>> in
>>> _nohz_idle_balance() (when NOHZ_STATS_KICK).
>>
>>
>> I have studied a way to keep track of how many cpus still have blocked
>> load to try to minimize the number of useless ilb kick but this add
>> more atomic operations which can impact the system throughput with
>> heavy load and lot of very small wake up. that's why i have propose
>> this solution which is more simple. But it's probably just a matter of
>> where we want to "waste" time. Either we accept to spent a bit more
>> time to check the state of idle CPUs or we accept to kick ilb from
>> time to time for no good reason.
>>
>
> Agreed. I have the feeling that spending more time doing atomic ops could be
> worth it - I'll try to test this out and see if it's actually relevant.
>
>>>
>>> Update by idle_balance()
>>> 
>>> Right before the "accumulator" task goes to sleep, a tiny periodic
>>> (period=32ms) task is spawned on another CPU. It's expected that it will
>>> update the blocked load in idle_balance(), either by running
>>> _nohz_idle_balance() locally or kicking an ILB (The overload flag
>>> shouldn't
>>> be set in this test case, so we shouldn't go through the NEWLY_IDLE
>>> load_balance()).
>>>
>>> This also seems to be working fine, but I'm noticing a delay between load
>>> updates that is closer to 64ms than 32ms. After digging into it I found
>>> out
>>> that the time checks done in idle_balance() and nohz_balancer_kick() are
>>> time_after(jiffies, next_stats), but IMHO they should be
>>> time_after_eq(jiffies, next_stats) to have 32ms-based updates. This also
>>> explains the 36ms periodicity of the updates in the test above.
>>
>>
>> I have use the 32ms as a minimum value between update. We must use the
>> time_after()  if we want to have at least 32ms between each update. We
>> will have a 36ms period if the previous update was triggered by the
>> tick (just after in fact) but there will be only 32ms if the last
>> update was done during an idle_balance that happens just before the
>> tick. With  time_after_eq,  the update period will between 28 and
>> 32ms.
>>
>> Then, I 

Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

2018-01-30 Thread Valentin Schneider

(Resending because I snuck in some HTML... Apologies)

On 01/30/2018 08:32 AM, Vincent Guittot wrote:

On 29 January 2018 at 20:31, Valentin Schneider
 wrote:

Hi Vincent, Peter,

I've been running some tests on your patches (Peter's base + the 2 from
Vincent). The results themselves are hosted at [1].
The base of those tests is the same: a task ("accumulator") is ran for 5
seconds (arbitrary value) to accumulate some load, then goes to sleep for .5
seconds.

I've set up 3 test scenarios:

Update by nohz_balance_kick()
-
Right before the "accumulator" task goes to sleep, a CPU-hogging task (100%
utilization) is spawned on another CPU. It won't go idle so the only way to
update the blocked load generated by "accumulator" is to kick an ILB
(NOHZ_STATS_KICK).

The test shows that this is behaving nicely - we keep kicking an ILB every
~36ms (see next test for comments on that) until there is no more blocked
load. I did however notice some interesting scenarios: after the load has
been fully decayed, a tiny background task can spawn and end in less than a
scheduling period. However, it still goes through nohz_balance_enter_idle(),
and thus sets nohz.stats_state, which will later cause an ILB kick.

This makes me wonder if it's worth kicking ILBs for such tiny load values -
perhaps it could be worth having a margin to set rq->has_blocked_load ?


So it's difficult to know what will be the load/utilization on the
cfs_rq once the cpu wakes up. Even if it's for a really short time,
that's doesn't mean that the load/utilization is small because it can
be the migration of a big task that just have a very short wakes up
this time.
That's why I don't make any assumption on the utilization/load value
when a cpu goes to sleep



Right, hadn't thought about those kind of migrations.



Furthermore, this tiny task will cause the ILB to iterate over all of the
idle CPUs, although only one has stale load. For load update via NEWLY_IDLE
load_balance() we use:

static bool update_nohz_stats(struct rq *rq)
{
 if (!rq->has_blocked_load)
  return false;
 [...]
}

But for load update via _nohz_idle_balance(), we iterate through all of the
nohz CPUS and unconditionally call update_blocked_averages(). This could be
avoided by remembering which CPUs have stale load before going idle.
Initially I thought that was what nohz.stats_state was for, but it isn't.
With Vincent's patches it's only ever set to either 0 or 1, but we could use
it as a CPU mask, and use it to skip nohz CPUs that don't have stale load in
_nohz_idle_balance() (when NOHZ_STATS_KICK).


I have studied a way to keep track of how many cpus still have blocked
load to try to minimize the number of useless ilb kick but this add
more atomic operations which can impact the system throughput with
heavy load and lot of very small wake up. that's why i have propose
this solution which is more simple. But it's probably just a matter of
where we want to "waste" time. Either we accept to spent a bit more
time to check the state of idle CPUs or we accept to kick ilb from
time to time for no good reason.



Agreed. I have the feeling that spending more time doing atomic ops 
could be worth it - I'll try to test this out and see if it's actually 
relevant.




Update by idle_balance()

Right before the "accumulator" task goes to sleep, a tiny periodic
(period=32ms) task is spawned on another CPU. It's expected that it will
update the blocked load in idle_balance(), either by running
_nohz_idle_balance() locally or kicking an ILB (The overload flag shouldn't
be set in this test case, so we shouldn't go through the NEWLY_IDLE
load_balance()).

This also seems to be working fine, but I'm noticing a delay between load
updates that is closer to 64ms than 32ms. After digging into it I found out
that the time checks done in idle_balance() and nohz_balancer_kick() are
time_after(jiffies, next_stats), but IMHO they should be
time_after_eq(jiffies, next_stats) to have 32ms-based updates. This also
explains the 36ms periodicity of the updates in the test above.


I have use the 32ms as a minimum value between update. We must use the
time_after()  if we want to have at least 32ms between each update. We
will have a 36ms period if the previous update was triggered by the
tick (just after in fact) but there will be only 32ms if the last
update was done during an idle_balance that happens just before the
tick. With  time_after_eq,  the update period will between 28 and
32ms.

Then, I mention a possible optimization by using time_after_eq in the
idle_balance() so a newly_idle cpu will have more chance (between 0
and 4ms for hz250) to do the update before a ilb is kicked



IIUC with time_after() the update period should be within ]32, 36] ms, 
but it looks like I'm always on that upper bound in my tests.


When evaluating whether we need to kick_ilb() for load updates, we'll 
always be right after the tick 

Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

2018-01-30 Thread Valentin Schneider

(Resending because I snuck in some HTML... Apologies)

On 01/30/2018 08:32 AM, Vincent Guittot wrote:

On 29 January 2018 at 20:31, Valentin Schneider
 wrote:

Hi Vincent, Peter,

I've been running some tests on your patches (Peter's base + the 2 from
Vincent). The results themselves are hosted at [1].
The base of those tests is the same: a task ("accumulator") is ran for 5
seconds (arbitrary value) to accumulate some load, then goes to sleep for .5
seconds.

I've set up 3 test scenarios:

Update by nohz_balance_kick()
-
Right before the "accumulator" task goes to sleep, a CPU-hogging task (100%
utilization) is spawned on another CPU. It won't go idle so the only way to
update the blocked load generated by "accumulator" is to kick an ILB
(NOHZ_STATS_KICK).

The test shows that this is behaving nicely - we keep kicking an ILB every
~36ms (see next test for comments on that) until there is no more blocked
load. I did however notice some interesting scenarios: after the load has
been fully decayed, a tiny background task can spawn and end in less than a
scheduling period. However, it still goes through nohz_balance_enter_idle(),
and thus sets nohz.stats_state, which will later cause an ILB kick.

This makes me wonder if it's worth kicking ILBs for such tiny load values -
perhaps it could be worth having a margin to set rq->has_blocked_load ?


So it's difficult to know what will be the load/utilization on the
cfs_rq once the cpu wakes up. Even if it's for a really short time,
that's doesn't mean that the load/utilization is small because it can
be the migration of a big task that just have a very short wakes up
this time.
That's why I don't make any assumption on the utilization/load value
when a cpu goes to sleep



Right, hadn't thought about those kind of migrations.



Furthermore, this tiny task will cause the ILB to iterate over all of the
idle CPUs, although only one has stale load. For load update via NEWLY_IDLE
load_balance() we use:

static bool update_nohz_stats(struct rq *rq)
{
 if (!rq->has_blocked_load)
  return false;
 [...]
}

But for load update via _nohz_idle_balance(), we iterate through all of the
nohz CPUS and unconditionally call update_blocked_averages(). This could be
avoided by remembering which CPUs have stale load before going idle.
Initially I thought that was what nohz.stats_state was for, but it isn't.
With Vincent's patches it's only ever set to either 0 or 1, but we could use
it as a CPU mask, and use it to skip nohz CPUs that don't have stale load in
_nohz_idle_balance() (when NOHZ_STATS_KICK).


I have studied a way to keep track of how many cpus still have blocked
load to try to minimize the number of useless ilb kick but this add
more atomic operations which can impact the system throughput with
heavy load and lot of very small wake up. that's why i have propose
this solution which is more simple. But it's probably just a matter of
where we want to "waste" time. Either we accept to spent a bit more
time to check the state of idle CPUs or we accept to kick ilb from
time to time for no good reason.



Agreed. I have the feeling that spending more time doing atomic ops 
could be worth it - I'll try to test this out and see if it's actually 
relevant.




Update by idle_balance()

Right before the "accumulator" task goes to sleep, a tiny periodic
(period=32ms) task is spawned on another CPU. It's expected that it will
update the blocked load in idle_balance(), either by running
_nohz_idle_balance() locally or kicking an ILB (The overload flag shouldn't
be set in this test case, so we shouldn't go through the NEWLY_IDLE
load_balance()).

This also seems to be working fine, but I'm noticing a delay between load
updates that is closer to 64ms than 32ms. After digging into it I found out
that the time checks done in idle_balance() and nohz_balancer_kick() are
time_after(jiffies, next_stats), but IMHO they should be
time_after_eq(jiffies, next_stats) to have 32ms-based updates. This also
explains the 36ms periodicity of the updates in the test above.


I have use the 32ms as a minimum value between update. We must use the
time_after()  if we want to have at least 32ms between each update. We
will have a 36ms period if the previous update was triggered by the
tick (just after in fact) but there will be only 32ms if the last
update was done during an idle_balance that happens just before the
tick. With  time_after_eq,  the update period will between 28 and
32ms.

Then, I mention a possible optimization by using time_after_eq in the
idle_balance() so a newly_idle cpu will have more chance (between 0
and 4ms for hz250) to do the update before a ilb is kicked



IIUC with time_after() the update period should be within ]32, 36] ms, 
but it looks like I'm always on that upper bound in my tests.


When evaluating whether we need to kick_ilb() for load updates, we'll 
always be right after the tick (excluding the case in 

Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

2018-01-30 Thread Vincent Guittot
On 29 January 2018 at 20:31, Valentin Schneider
 wrote:
> Hi Vincent, Peter,
>
> I've been running some tests on your patches (Peter's base + the 2 from
> Vincent). The results themselves are hosted at [1].
> The base of those tests is the same: a task ("accumulator") is ran for 5
> seconds (arbitrary value) to accumulate some load, then goes to sleep for .5
> seconds.
>
> I've set up 3 test scenarios:
>
> Update by nohz_balance_kick()
> -
> Right before the "accumulator" task goes to sleep, a CPU-hogging task (100%
> utilization) is spawned on another CPU. It won't go idle so the only way to
> update the blocked load generated by "accumulator" is to kick an ILB
> (NOHZ_STATS_KICK).
>
> The test shows that this is behaving nicely - we keep kicking an ILB every
> ~36ms (see next test for comments on that) until there is no more blocked
> load. I did however notice some interesting scenarios: after the load has
> been fully decayed, a tiny background task can spawn and end in less than a
> scheduling period. However, it still goes through nohz_balance_enter_idle(),
> and thus sets nohz.stats_state, which will later cause an ILB kick.
>
> This makes me wonder if it's worth kicking ILBs for such tiny load values -
> perhaps it could be worth having a margin to set rq->has_blocked_load ?

So it's difficult to know what will be the load/utilization on the
cfs_rq once the cpu wakes up. Even if it's for a really short time,
that's doesn't mean that the load/utilization is small because it can
be the migration of a big task that just have a very short wakes up
this time.
That's why I don't make any assumption on the utilization/load value
when a cpu goes to sleep

>
> Furthermore, this tiny task will cause the ILB to iterate over all of the
> idle CPUs, although only one has stale load. For load update via NEWLY_IDLE
> load_balance() we use:
>
> static bool update_nohz_stats(struct rq *rq)
> {
> if (!rq->has_blocked_load)
>  return false;
> [...]
> }
>
> But for load update via _nohz_idle_balance(), we iterate through all of the
> nohz CPUS and unconditionally call update_blocked_averages(). This could be
> avoided by remembering which CPUs have stale load before going idle.
> Initially I thought that was what nohz.stats_state was for, but it isn't.
> With Vincent's patches it's only ever set to either 0 or 1, but we could use
> it as a CPU mask, and use it to skip nohz CPUs that don't have stale load in
> _nohz_idle_balance() (when NOHZ_STATS_KICK).

I have studied a way to keep track of how many cpus still have blocked
load to try to minimize the number of useless ilb kick but this add
more atomic operations which can impact the system throughput with
heavy load and lot of very small wake up. that's why i have propose
this solution which is more simple. But it's probably just a matter of
where we want to "waste" time. Either we accept to spent a bit more
time to check the state of idle CPUs or we accept to kick ilb from
time to time for no good reason.

>
> Update by idle_balance()
> 
> Right before the "accumulator" task goes to sleep, a tiny periodic
> (period=32ms) task is spawned on another CPU. It's expected that it will
> update the blocked load in idle_balance(), either by running
> _nohz_idle_balance() locally or kicking an ILB (The overload flag shouldn't
> be set in this test case, so we shouldn't go through the NEWLY_IDLE
> load_balance()).
>
> This also seems to be working fine, but I'm noticing a delay between load
> updates that is closer to 64ms than 32ms. After digging into it I found out
> that the time checks done in idle_balance() and nohz_balancer_kick() are
> time_after(jiffies, next_stats), but IMHO they should be
> time_after_eq(jiffies, next_stats) to have 32ms-based updates. This also
> explains the 36ms periodicity of the updates in the test above.

I have use the 32ms as a minimum value between update. We must use the
time_after()  if we want to have at least 32ms between each update. We
will have a 36ms period if the previous update was triggered by the
tick (just after in fact) but there will be only 32ms if the last
update was done during an idle_balance that happens just before the
tick. With  time_after_eq,  the update period will between 28 and
32ms.

Then, I mention a possible optimization by using time_after_eq in the
idle_balance() so a newly_idle cpu will have more chance (between 0
and 4ms for hz250) to do the update before a ilb is kicked

Thanks,
Vincent

>
>
> No update (idle system)
> ---
> Nothing special here, just making sure nothing happens when the system is
> fully idle. On a sidenote, that's relatively hard to achieve - I had to
> switch over to Juno because my HiKey960 gets interrupts every 16ms. The Juno
> still gets woken up every now and then but it's a bit quieter.
>
>
> [1]: https://gist.github.com/valschneider/a8da7bb8e11fb1ec63a419710f56c0a0
>
>

Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

2018-01-30 Thread Vincent Guittot
On 29 January 2018 at 20:31, Valentin Schneider
 wrote:
> Hi Vincent, Peter,
>
> I've been running some tests on your patches (Peter's base + the 2 from
> Vincent). The results themselves are hosted at [1].
> The base of those tests is the same: a task ("accumulator") is ran for 5
> seconds (arbitrary value) to accumulate some load, then goes to sleep for .5
> seconds.
>
> I've set up 3 test scenarios:
>
> Update by nohz_balance_kick()
> -
> Right before the "accumulator" task goes to sleep, a CPU-hogging task (100%
> utilization) is spawned on another CPU. It won't go idle so the only way to
> update the blocked load generated by "accumulator" is to kick an ILB
> (NOHZ_STATS_KICK).
>
> The test shows that this is behaving nicely - we keep kicking an ILB every
> ~36ms (see next test for comments on that) until there is no more blocked
> load. I did however notice some interesting scenarios: after the load has
> been fully decayed, a tiny background task can spawn and end in less than a
> scheduling period. However, it still goes through nohz_balance_enter_idle(),
> and thus sets nohz.stats_state, which will later cause an ILB kick.
>
> This makes me wonder if it's worth kicking ILBs for such tiny load values -
> perhaps it could be worth having a margin to set rq->has_blocked_load ?

So it's difficult to know what will be the load/utilization on the
cfs_rq once the cpu wakes up. Even if it's for a really short time,
that's doesn't mean that the load/utilization is small because it can
be the migration of a big task that just have a very short wakes up
this time.
That's why I don't make any assumption on the utilization/load value
when a cpu goes to sleep

>
> Furthermore, this tiny task will cause the ILB to iterate over all of the
> idle CPUs, although only one has stale load. For load update via NEWLY_IDLE
> load_balance() we use:
>
> static bool update_nohz_stats(struct rq *rq)
> {
> if (!rq->has_blocked_load)
>  return false;
> [...]
> }
>
> But for load update via _nohz_idle_balance(), we iterate through all of the
> nohz CPUS and unconditionally call update_blocked_averages(). This could be
> avoided by remembering which CPUs have stale load before going idle.
> Initially I thought that was what nohz.stats_state was for, but it isn't.
> With Vincent's patches it's only ever set to either 0 or 1, but we could use
> it as a CPU mask, and use it to skip nohz CPUs that don't have stale load in
> _nohz_idle_balance() (when NOHZ_STATS_KICK).

I have studied a way to keep track of how many cpus still have blocked
load to try to minimize the number of useless ilb kick but this add
more atomic operations which can impact the system throughput with
heavy load and lot of very small wake up. that's why i have propose
this solution which is more simple. But it's probably just a matter of
where we want to "waste" time. Either we accept to spent a bit more
time to check the state of idle CPUs or we accept to kick ilb from
time to time for no good reason.

>
> Update by idle_balance()
> 
> Right before the "accumulator" task goes to sleep, a tiny periodic
> (period=32ms) task is spawned on another CPU. It's expected that it will
> update the blocked load in idle_balance(), either by running
> _nohz_idle_balance() locally or kicking an ILB (The overload flag shouldn't
> be set in this test case, so we shouldn't go through the NEWLY_IDLE
> load_balance()).
>
> This also seems to be working fine, but I'm noticing a delay between load
> updates that is closer to 64ms than 32ms. After digging into it I found out
> that the time checks done in idle_balance() and nohz_balancer_kick() are
> time_after(jiffies, next_stats), but IMHO they should be
> time_after_eq(jiffies, next_stats) to have 32ms-based updates. This also
> explains the 36ms periodicity of the updates in the test above.

I have use the 32ms as a minimum value between update. We must use the
time_after()  if we want to have at least 32ms between each update. We
will have a 36ms period if the previous update was triggered by the
tick (just after in fact) but there will be only 32ms if the last
update was done during an idle_balance that happens just before the
tick. With  time_after_eq,  the update period will between 28 and
32ms.

Then, I mention a possible optimization by using time_after_eq in the
idle_balance() so a newly_idle cpu will have more chance (between 0
and 4ms for hz250) to do the update before a ilb is kicked

Thanks,
Vincent

>
>
> No update (idle system)
> ---
> Nothing special here, just making sure nothing happens when the system is
> fully idle. On a sidenote, that's relatively hard to achieve - I had to
> switch over to Juno because my HiKey960 gets interrupts every 16ms. The Juno
> still gets woken up every now and then but it's a bit quieter.
>
>
> [1]: https://gist.github.com/valschneider/a8da7bb8e11fb1ec63a419710f56c0a0
>
>
>
> On 01/24/2018 08:25 AM, 

Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

2018-01-30 Thread Vincent Guittot
On 29 January 2018 at 19:43, Dietmar Eggemann  wrote:
> On 01/24/2018 09:25 AM, Vincent Guittot wrote:
>>
>> Hi,
>>
>> Le Thursday 18 Jan 2018 à 10:38:07 (+), Morten Rasmussen a écrit :
>>>
>>> On Mon, Jan 15, 2018 at 09:26:09AM +0100, Vincent Guittot wrote:

 Le Wednesday 03 Jan 2018 à 10:16:00 (+0100), Vincent Guittot a écrit :
>
>
> [...]
>
>

 Hi Peter,

 With the patch below on top of your branch, the blocked loads are
 updated and
 decayed regularly. The main differences are:
 - It doesn't use a timer to trig ilb but the tick and when a cpu becomes
 idle.
The main drawback of this solution is that the load is blocked when
 the
system is fully idle with the advantage of not waking up a fully idle
system. We have to wait for the next tick or newly idle event for
 updating
blocked load when the system leaves idle stat which can be up to a
 tick long.
If this is too long, we can check for kicking ilb when task wakes up
 so the
blocked load will be updated as soon as the system leaves idle state.
The main advantage is that we don't wake up a fully idle system every
 32ms to
update blocked load that will be not used.
 - I'm working on one more improvement to use nohz_idle_balance in the
 newly
idle case when the system is not overloaded and
(this_rq->avg_idle > sysctl_sched_migration_cost). In this case, we
 can try to
use nohz_idle_balance with NOHZ_STATS_KICK and abort as soon as it
 exceed
this_rq->avg_idle. This will remove some calls to kick_ilb and some
 wake up
of an idle cpus.
>>>
>>>
>>> This sound like what I meant in my other reply :-)
>>>
>>> It seems pointless to have a timer to update PELT if the system is
>>> completely idle, and when it isn't we can piggy back other events to
>>> make the updates happen.
>>
>>
>> The patch below implements what has been described above. It calls part of
>> nohz_idle_balance when a cpu becomes idle and kick a ilb if it takes too
>> much
>> time. This removes part of ilb that are kicked on an idle cpu for updating
>> the blocked load but the ratio really depends on when the tick happens
>> compared
>> to a cpu becoming idle and the 32ms boundary. I have an additionnal patch
>> that
>> enables to update the blocked loads when a cpu becomes idle 1 period
>> before
>> kicking an ilb and there is far less ilb because we give more chance to
>> the
>> newly idle case (time_after is replaced by time_after_eq in
>> idle_balance()).
>>
>> The patch also uses a function cfs_rq_has_blocked, which only checks the
>> util/load_avg, instead of the cfs_rq_is_decayed which check *_sum too.
>> This
>> reduce significantly the number of update of blocked load. the *_avg will
>> be
>> fully decayed in around 300~400ms but it's far longer for the *_sum which
>> have
>> a higher resolution and we can easily reach almost seconds. But only the
>> *_avg
>> are used to make decision so keeping some blocked *_sum is acceptable.
>>
>> ---
>>   kernel/sched/fair.c | 121
>> +++-
>>   1 file changed, 92 insertions(+), 29 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 898785d..ed90303 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -7356,6 +7356,17 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq
>> *cfs_rq)
>> return true;
>>   }
>>   +static inline bool cfs_rq_has_blocked(struct cfs_rq *cfs_rq)
>> +{
>> +   if (cfs_rq->avg.load_avg)
>> +   return true;
>> +
>> +   if (cfs_rq->avg.util_avg)
>> +   return true;
>> +
>> +   return false;
>> +}
>> +
>
>
> Can we not change cfs_rq_is_decayed() to use avg.foo_avg instead of
> avg.foo_sum ?

I don't think so because the *_sum are used to keep coherency bewteen
cfs_rq and cgroups when task migrates and are enqueued/dequeued so wwe
can't remove it until the *_sum are null otherwise the all cfs_rq and
group will be out of sync

>
>>   #ifdef CONFIG_FAIR_GROUP_SCHED
>> static void update_blocked_averages(int cpu)
>> @@ -7393,7 +7404,9 @@ static void update_blocked_averages(int cpu)
>>  */
>> if (cfs_rq_is_decayed(cfs_rq))
>> list_del_leaf_cfs_rq(cfs_rq);
>> -   else
>> +
>> +   /* Don't need periodic decay once load/util_avg are null
>> */
>> +   if (cfs_rq_has_blocked(cfs_rq))
>> done = false;
>> }
>>   @@ -7463,7 +7476,7 @@ static inline void update_blocked_averages(int
>> cpu)
>> update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq);
>>   #ifdef CONFIG_NO_HZ_COMMON
>> rq->last_blocked_load_update_tick = jiffies;
>> -   if (cfs_rq_is_decayed(cfs_rq))
>> +   if (cfs_rq_has_blocked(cfs_rq))
>
>
> Schouldn't this be 

Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

2018-01-30 Thread Vincent Guittot
On 29 January 2018 at 19:43, Dietmar Eggemann  wrote:
> On 01/24/2018 09:25 AM, Vincent Guittot wrote:
>>
>> Hi,
>>
>> Le Thursday 18 Jan 2018 à 10:38:07 (+), Morten Rasmussen a écrit :
>>>
>>> On Mon, Jan 15, 2018 at 09:26:09AM +0100, Vincent Guittot wrote:

 Le Wednesday 03 Jan 2018 à 10:16:00 (+0100), Vincent Guittot a écrit :
>
>
> [...]
>
>

 Hi Peter,

 With the patch below on top of your branch, the blocked loads are
 updated and
 decayed regularly. The main differences are:
 - It doesn't use a timer to trig ilb but the tick and when a cpu becomes
 idle.
The main drawback of this solution is that the load is blocked when
 the
system is fully idle with the advantage of not waking up a fully idle
system. We have to wait for the next tick or newly idle event for
 updating
blocked load when the system leaves idle stat which can be up to a
 tick long.
If this is too long, we can check for kicking ilb when task wakes up
 so the
blocked load will be updated as soon as the system leaves idle state.
The main advantage is that we don't wake up a fully idle system every
 32ms to
update blocked load that will be not used.
 - I'm working on one more improvement to use nohz_idle_balance in the
 newly
idle case when the system is not overloaded and
(this_rq->avg_idle > sysctl_sched_migration_cost). In this case, we
 can try to
use nohz_idle_balance with NOHZ_STATS_KICK and abort as soon as it
 exceed
this_rq->avg_idle. This will remove some calls to kick_ilb and some
 wake up
of an idle cpus.
>>>
>>>
>>> This sound like what I meant in my other reply :-)
>>>
>>> It seems pointless to have a timer to update PELT if the system is
>>> completely idle, and when it isn't we can piggy back other events to
>>> make the updates happen.
>>
>>
>> The patch below implements what has been described above. It calls part of
>> nohz_idle_balance when a cpu becomes idle and kick a ilb if it takes too
>> much
>> time. This removes part of ilb that are kicked on an idle cpu for updating
>> the blocked load but the ratio really depends on when the tick happens
>> compared
>> to a cpu becoming idle and the 32ms boundary. I have an additionnal patch
>> that
>> enables to update the blocked loads when a cpu becomes idle 1 period
>> before
>> kicking an ilb and there is far less ilb because we give more chance to
>> the
>> newly idle case (time_after is replaced by time_after_eq in
>> idle_balance()).
>>
>> The patch also uses a function cfs_rq_has_blocked, which only checks the
>> util/load_avg, instead of the cfs_rq_is_decayed which check *_sum too.
>> This
>> reduce significantly the number of update of blocked load. the *_avg will
>> be
>> fully decayed in around 300~400ms but it's far longer for the *_sum which
>> have
>> a higher resolution and we can easily reach almost seconds. But only the
>> *_avg
>> are used to make decision so keeping some blocked *_sum is acceptable.
>>
>> ---
>>   kernel/sched/fair.c | 121
>> +++-
>>   1 file changed, 92 insertions(+), 29 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 898785d..ed90303 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -7356,6 +7356,17 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq
>> *cfs_rq)
>> return true;
>>   }
>>   +static inline bool cfs_rq_has_blocked(struct cfs_rq *cfs_rq)
>> +{
>> +   if (cfs_rq->avg.load_avg)
>> +   return true;
>> +
>> +   if (cfs_rq->avg.util_avg)
>> +   return true;
>> +
>> +   return false;
>> +}
>> +
>
>
> Can we not change cfs_rq_is_decayed() to use avg.foo_avg instead of
> avg.foo_sum ?

I don't think so because the *_sum are used to keep coherency bewteen
cfs_rq and cgroups when task migrates and are enqueued/dequeued so wwe
can't remove it until the *_sum are null otherwise the all cfs_rq and
group will be out of sync

>
>>   #ifdef CONFIG_FAIR_GROUP_SCHED
>> static void update_blocked_averages(int cpu)
>> @@ -7393,7 +7404,9 @@ static void update_blocked_averages(int cpu)
>>  */
>> if (cfs_rq_is_decayed(cfs_rq))
>> list_del_leaf_cfs_rq(cfs_rq);
>> -   else
>> +
>> +   /* Don't need periodic decay once load/util_avg are null
>> */
>> +   if (cfs_rq_has_blocked(cfs_rq))
>> done = false;
>> }
>>   @@ -7463,7 +7476,7 @@ static inline void update_blocked_averages(int
>> cpu)
>> update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq);
>>   #ifdef CONFIG_NO_HZ_COMMON
>> rq->last_blocked_load_update_tick = jiffies;
>> -   if (cfs_rq_is_decayed(cfs_rq))
>> +   if (cfs_rq_has_blocked(cfs_rq))
>
>
> Schouldn't this be !cfs_rq_has_blocked(cfs_rq) ?

yes. I 

Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

2018-01-29 Thread Valentin Schneider

Hi Vincent, Peter,

I've been running some tests on your patches (Peter's base + the 2 from 
Vincent). The results themselves are hosted at [1].
The base of those tests is the same: a task ("accumulator") is ran for 5 
seconds (arbitrary value) to accumulate some load, then goes to sleep 
for .5 seconds.


I've set up 3 test scenarios:

Update by nohz_balance_kick()
-
Right before the "accumulator" task goes to sleep, a CPU-hogging task 
(100% utilization) is spawned on another CPU. It won't go idle so the 
only way to update the blocked load generated by "accumulator" is to 
kick an ILB (NOHZ_STATS_KICK).


The test shows that this is behaving nicely - we keep kicking an ILB 
every ~36ms (see next test for comments on that) until there is no more 
blocked load. I did however notice some interesting scenarios: after the 
load has been fully decayed, a tiny background task can spawn and end in 
less than a scheduling period. However, it still goes through 
nohz_balance_enter_idle(), and thus sets nohz.stats_state, which will 
later cause an ILB kick.


This makes me wonder if it's worth kicking ILBs for such tiny load 
values - perhaps it could be worth having a margin to set 
rq->has_blocked_load ?


Furthermore, this tiny task will cause the ILB to iterate over all of 
the idle CPUs, although only one has stale load. For load update via 
NEWLY_IDLE load_balance() we use:


static bool update_nohz_stats(struct rq *rq)
{
    if (!rq->has_blocked_load)
     return false;
    [...]
}

But for load update via _nohz_idle_balance(), we iterate through all of 
the nohz CPUS and unconditionally call update_blocked_averages(). This 
could be avoided by remembering which CPUs have stale load before going 
idle. Initially I thought that was what nohz.stats_state was for, but it 
isn't.
With Vincent's patches it's only ever set to either 0 or 1, but we could 
use it as a CPU mask, and use it to skip nohz CPUs that don't have stale 
load in _nohz_idle_balance() (when NOHZ_STATS_KICK).


Update by idle_balance()

Right before the "accumulator" task goes to sleep, a tiny periodic 
(period=32ms) task is spawned on another CPU. It's expected that it will 
update the blocked load in idle_balance(), either by running 
_nohz_idle_balance() locally or kicking an ILB (The overload flag 
shouldn't be set in this test case, so we shouldn't go through the 
NEWLY_IDLE load_balance()).


This also seems to be working fine, but I'm noticing a delay between 
load updates that is closer to 64ms than 32ms. After digging into it I 
found out that the time checks done in idle_balance() and 
nohz_balancer_kick() are time_after(jiffies, next_stats), but IMHO they 
should be time_after_eq(jiffies, next_stats) to have 32ms-based updates. 
This also explains the 36ms periodicity of the updates in the test above.



No update (idle system)
---
Nothing special here, just making sure nothing happens when the system 
is fully idle. On a sidenote, that's relatively hard to achieve - I had 
to switch over to Juno because my HiKey960 gets interrupts every 16ms. 
The Juno still gets woken up every now and then but it's a bit quieter.



[1]: https://gist.github.com/valschneider/a8da7bb8e11fb1ec63a419710f56c0a0


On 01/24/2018 08:25 AM, Vincent Guittot wrote:

Hi,

Le Thursday 18 Jan 2018 à 10:38:07 (+), Morten Rasmussen a écrit :

On Mon, Jan 15, 2018 at 09:26:09AM +0100, Vincent Guittot wrote:

Le Wednesday 03 Jan 2018 à 10:16:00 (+0100), Vincent Guittot a écrit :

Hi Peter,

On 22 December 2017 at 21:42, Peter Zijlstra  wrote:

On Fri, Dec 22, 2017 at 07:56:29PM +0100, Peter Zijlstra wrote:

Right; but I figured we'd try and do it 'right' and see how horrible it
is before we try and do funny things.

So now it should have a 32ms tick for up to .5s when the system goes
completely idle.

No idea how bad that is..

I have tested your branch but the timer doesn't seem to fire correctly
because i can still see blocked load in the use case i have run.
I haven't found the reason yet

Hi Peter,

With the patch below on top of your branch, the blocked loads are updated and
decayed regularly. The main differences are:
- It doesn't use a timer to trig ilb but the tick and when a cpu becomes idle.
   The main drawback of this solution is that the load is blocked when the
   system is fully idle with the advantage of not waking up a fully idle
   system. We have to wait for the next tick or newly idle event for updating
   blocked load when the system leaves idle stat which can be up to a tick long.
   If this is too long, we can check for kicking ilb when task wakes up so the
   blocked load will be updated as soon as the system leaves idle state.
   The main advantage is that we don't wake up a fully idle system every 32ms to
   update blocked load that will be not used.
- I'm working on one more improvement to use nohz_idle_balance in the newly
   idle case 

Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

2018-01-29 Thread Valentin Schneider

Hi Vincent, Peter,

I've been running some tests on your patches (Peter's base + the 2 from 
Vincent). The results themselves are hosted at [1].
The base of those tests is the same: a task ("accumulator") is ran for 5 
seconds (arbitrary value) to accumulate some load, then goes to sleep 
for .5 seconds.


I've set up 3 test scenarios:

Update by nohz_balance_kick()
-
Right before the "accumulator" task goes to sleep, a CPU-hogging task 
(100% utilization) is spawned on another CPU. It won't go idle so the 
only way to update the blocked load generated by "accumulator" is to 
kick an ILB (NOHZ_STATS_KICK).


The test shows that this is behaving nicely - we keep kicking an ILB 
every ~36ms (see next test for comments on that) until there is no more 
blocked load. I did however notice some interesting scenarios: after the 
load has been fully decayed, a tiny background task can spawn and end in 
less than a scheduling period. However, it still goes through 
nohz_balance_enter_idle(), and thus sets nohz.stats_state, which will 
later cause an ILB kick.


This makes me wonder if it's worth kicking ILBs for such tiny load 
values - perhaps it could be worth having a margin to set 
rq->has_blocked_load ?


Furthermore, this tiny task will cause the ILB to iterate over all of 
the idle CPUs, although only one has stale load. For load update via 
NEWLY_IDLE load_balance() we use:


static bool update_nohz_stats(struct rq *rq)
{
    if (!rq->has_blocked_load)
     return false;
    [...]
}

But for load update via _nohz_idle_balance(), we iterate through all of 
the nohz CPUS and unconditionally call update_blocked_averages(). This 
could be avoided by remembering which CPUs have stale load before going 
idle. Initially I thought that was what nohz.stats_state was for, but it 
isn't.
With Vincent's patches it's only ever set to either 0 or 1, but we could 
use it as a CPU mask, and use it to skip nohz CPUs that don't have stale 
load in _nohz_idle_balance() (when NOHZ_STATS_KICK).


Update by idle_balance()

Right before the "accumulator" task goes to sleep, a tiny periodic 
(period=32ms) task is spawned on another CPU. It's expected that it will 
update the blocked load in idle_balance(), either by running 
_nohz_idle_balance() locally or kicking an ILB (The overload flag 
shouldn't be set in this test case, so we shouldn't go through the 
NEWLY_IDLE load_balance()).


This also seems to be working fine, but I'm noticing a delay between 
load updates that is closer to 64ms than 32ms. After digging into it I 
found out that the time checks done in idle_balance() and 
nohz_balancer_kick() are time_after(jiffies, next_stats), but IMHO they 
should be time_after_eq(jiffies, next_stats) to have 32ms-based updates. 
This also explains the 36ms periodicity of the updates in the test above.



No update (idle system)
---
Nothing special here, just making sure nothing happens when the system 
is fully idle. On a sidenote, that's relatively hard to achieve - I had 
to switch over to Juno because my HiKey960 gets interrupts every 16ms. 
The Juno still gets woken up every now and then but it's a bit quieter.



[1]: https://gist.github.com/valschneider/a8da7bb8e11fb1ec63a419710f56c0a0


On 01/24/2018 08:25 AM, Vincent Guittot wrote:

Hi,

Le Thursday 18 Jan 2018 à 10:38:07 (+), Morten Rasmussen a écrit :

On Mon, Jan 15, 2018 at 09:26:09AM +0100, Vincent Guittot wrote:

Le Wednesday 03 Jan 2018 à 10:16:00 (+0100), Vincent Guittot a écrit :

Hi Peter,

On 22 December 2017 at 21:42, Peter Zijlstra  wrote:

On Fri, Dec 22, 2017 at 07:56:29PM +0100, Peter Zijlstra wrote:

Right; but I figured we'd try and do it 'right' and see how horrible it
is before we try and do funny things.

So now it should have a 32ms tick for up to .5s when the system goes
completely idle.

No idea how bad that is..

I have tested your branch but the timer doesn't seem to fire correctly
because i can still see blocked load in the use case i have run.
I haven't found the reason yet

Hi Peter,

With the patch below on top of your branch, the blocked loads are updated and
decayed regularly. The main differences are:
- It doesn't use a timer to trig ilb but the tick and when a cpu becomes idle.
   The main drawback of this solution is that the load is blocked when the
   system is fully idle with the advantage of not waking up a fully idle
   system. We have to wait for the next tick or newly idle event for updating
   blocked load when the system leaves idle stat which can be up to a tick long.
   If this is too long, we can check for kicking ilb when task wakes up so the
   blocked load will be updated as soon as the system leaves idle state.
   The main advantage is that we don't wake up a fully idle system every 32ms to
   update blocked load that will be not used.
- I'm working on one more improvement to use nohz_idle_balance in the newly
   idle case when the system is not 

Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

2018-01-29 Thread Dietmar Eggemann

On 01/24/2018 09:25 AM, Vincent Guittot wrote:

Hi,

Le Thursday 18 Jan 2018 à 10:38:07 (+), Morten Rasmussen a écrit :

On Mon, Jan 15, 2018 at 09:26:09AM +0100, Vincent Guittot wrote:

Le Wednesday 03 Jan 2018 à 10:16:00 (+0100), Vincent Guittot a écrit :


[...]



Hi Peter,

With the patch below on top of your branch, the blocked loads are updated and
decayed regularly. The main differences are:
- It doesn't use a timer to trig ilb but the tick and when a cpu becomes idle.
   The main drawback of this solution is that the load is blocked when the
   system is fully idle with the advantage of not waking up a fully idle
   system. We have to wait for the next tick or newly idle event for updating
   blocked load when the system leaves idle stat which can be up to a tick long.
   If this is too long, we can check for kicking ilb when task wakes up so the
   blocked load will be updated as soon as the system leaves idle state.
   The main advantage is that we don't wake up a fully idle system every 32ms to
   update blocked load that will be not used.
- I'm working on one more improvement to use nohz_idle_balance in the newly
   idle case when the system is not overloaded and
   (this_rq->avg_idle > sysctl_sched_migration_cost). In this case, we can try 
to
   use nohz_idle_balance with NOHZ_STATS_KICK and abort as soon as it exceed
   this_rq->avg_idle. This will remove some calls to kick_ilb and some wake up
   of an idle cpus.


This sound like what I meant in my other reply :-)

It seems pointless to have a timer to update PELT if the system is
completely idle, and when it isn't we can piggy back other events to
make the updates happen.


The patch below implements what has been described above. It calls part of
nohz_idle_balance when a cpu becomes idle and kick a ilb if it takes too much
time. This removes part of ilb that are kicked on an idle cpu for updating
the blocked load but the ratio really depends on when the tick happens compared
to a cpu becoming idle and the 32ms boundary. I have an additionnal patch that
enables to update the blocked loads when a cpu becomes idle 1 period before
kicking an ilb and there is far less ilb because we give more chance to the
newly idle case (time_after is replaced by time_after_eq in idle_balance()).

The patch also uses a function cfs_rq_has_blocked, which only checks the
util/load_avg, instead of the cfs_rq_is_decayed which check *_sum too. This
reduce significantly the number of update of blocked load. the *_avg will be
fully decayed in around 300~400ms but it's far longer for the *_sum which have
a higher resolution and we can easily reach almost seconds. But only the *_avg
are used to make decision so keeping some blocked *_sum is acceptable.

---
  kernel/sched/fair.c | 121 +++-
  1 file changed, 92 insertions(+), 29 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 898785d..ed90303 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7356,6 +7356,17 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq 
*cfs_rq)
return true;
  }
  
+static inline bool cfs_rq_has_blocked(struct cfs_rq *cfs_rq)

+{
+   if (cfs_rq->avg.load_avg)
+   return true;
+
+   if (cfs_rq->avg.util_avg)
+   return true;
+
+   return false;
+}
+


Can we not change cfs_rq_is_decayed() to use avg.foo_avg instead of 
avg.foo_sum ?



  #ifdef CONFIG_FAIR_GROUP_SCHED
  
  static void update_blocked_averages(int cpu)

@@ -7393,7 +7404,9 @@ static void update_blocked_averages(int cpu)
 */
if (cfs_rq_is_decayed(cfs_rq))
list_del_leaf_cfs_rq(cfs_rq);
-   else
+
+   /* Don't need periodic decay once load/util_avg are null */
+   if (cfs_rq_has_blocked(cfs_rq))
done = false;
}
  
@@ -7463,7 +7476,7 @@ static inline void update_blocked_averages(int cpu)

update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq);
  #ifdef CONFIG_NO_HZ_COMMON
rq->last_blocked_load_update_tick = jiffies;
-   if (cfs_rq_is_decayed(cfs_rq))
+   if (cfs_rq_has_blocked(cfs_rq))


Schouldn't this be !cfs_rq_has_blocked(cfs_rq) ?


rq->has_blocked_load = 0;
  #endif
rq_unlock_irqrestore(rq, );


[...]


@@ -9438,7 +9451,17 @@ static bool nohz_idle_balance(struct rq *this_rq, enum 
cpu_idle_type idle)
 */
if (need_resched()) {
has_blocked_load = true;
-   break;
+   goto abort;
+   }
+
+   /*
+* If the update is done while CPU becomes idle, we abort
+* the update when its cost is higher than the average idle
+* time in orde to not delay a possible wake up.
+*/
+   if (idle == CPU_NEWLY_IDLE && this_rq->avg_idle < curr_cost) {
+  

Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

2018-01-29 Thread Dietmar Eggemann

On 01/24/2018 09:25 AM, Vincent Guittot wrote:

Hi,

Le Thursday 18 Jan 2018 à 10:38:07 (+), Morten Rasmussen a écrit :

On Mon, Jan 15, 2018 at 09:26:09AM +0100, Vincent Guittot wrote:

Le Wednesday 03 Jan 2018 à 10:16:00 (+0100), Vincent Guittot a écrit :


[...]



Hi Peter,

With the patch below on top of your branch, the blocked loads are updated and
decayed regularly. The main differences are:
- It doesn't use a timer to trig ilb but the tick and when a cpu becomes idle.
   The main drawback of this solution is that the load is blocked when the
   system is fully idle with the advantage of not waking up a fully idle
   system. We have to wait for the next tick or newly idle event for updating
   blocked load when the system leaves idle stat which can be up to a tick long.
   If this is too long, we can check for kicking ilb when task wakes up so the
   blocked load will be updated as soon as the system leaves idle state.
   The main advantage is that we don't wake up a fully idle system every 32ms to
   update blocked load that will be not used.
- I'm working on one more improvement to use nohz_idle_balance in the newly
   idle case when the system is not overloaded and
   (this_rq->avg_idle > sysctl_sched_migration_cost). In this case, we can try 
to
   use nohz_idle_balance with NOHZ_STATS_KICK and abort as soon as it exceed
   this_rq->avg_idle. This will remove some calls to kick_ilb and some wake up
   of an idle cpus.


This sound like what I meant in my other reply :-)

It seems pointless to have a timer to update PELT if the system is
completely idle, and when it isn't we can piggy back other events to
make the updates happen.


The patch below implements what has been described above. It calls part of
nohz_idle_balance when a cpu becomes idle and kick a ilb if it takes too much
time. This removes part of ilb that are kicked on an idle cpu for updating
the blocked load but the ratio really depends on when the tick happens compared
to a cpu becoming idle and the 32ms boundary. I have an additionnal patch that
enables to update the blocked loads when a cpu becomes idle 1 period before
kicking an ilb and there is far less ilb because we give more chance to the
newly idle case (time_after is replaced by time_after_eq in idle_balance()).

The patch also uses a function cfs_rq_has_blocked, which only checks the
util/load_avg, instead of the cfs_rq_is_decayed which check *_sum too. This
reduce significantly the number of update of blocked load. the *_avg will be
fully decayed in around 300~400ms but it's far longer for the *_sum which have
a higher resolution and we can easily reach almost seconds. But only the *_avg
are used to make decision so keeping some blocked *_sum is acceptable.

---
  kernel/sched/fair.c | 121 +++-
  1 file changed, 92 insertions(+), 29 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 898785d..ed90303 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7356,6 +7356,17 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq 
*cfs_rq)
return true;
  }
  
+static inline bool cfs_rq_has_blocked(struct cfs_rq *cfs_rq)

+{
+   if (cfs_rq->avg.load_avg)
+   return true;
+
+   if (cfs_rq->avg.util_avg)
+   return true;
+
+   return false;
+}
+


Can we not change cfs_rq_is_decayed() to use avg.foo_avg instead of 
avg.foo_sum ?



  #ifdef CONFIG_FAIR_GROUP_SCHED
  
  static void update_blocked_averages(int cpu)

@@ -7393,7 +7404,9 @@ static void update_blocked_averages(int cpu)
 */
if (cfs_rq_is_decayed(cfs_rq))
list_del_leaf_cfs_rq(cfs_rq);
-   else
+
+   /* Don't need periodic decay once load/util_avg are null */
+   if (cfs_rq_has_blocked(cfs_rq))
done = false;
}
  
@@ -7463,7 +7476,7 @@ static inline void update_blocked_averages(int cpu)

update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq);
  #ifdef CONFIG_NO_HZ_COMMON
rq->last_blocked_load_update_tick = jiffies;
-   if (cfs_rq_is_decayed(cfs_rq))
+   if (cfs_rq_has_blocked(cfs_rq))


Schouldn't this be !cfs_rq_has_blocked(cfs_rq) ?


rq->has_blocked_load = 0;
  #endif
rq_unlock_irqrestore(rq, );


[...]


@@ -9438,7 +9451,17 @@ static bool nohz_idle_balance(struct rq *this_rq, enum 
cpu_idle_type idle)
 */
if (need_resched()) {
has_blocked_load = true;
-   break;
+   goto abort;
+   }
+
+   /*
+* If the update is done while CPU becomes idle, we abort
+* the update when its cost is higher than the average idle
+* time in orde to not delay a possible wake up.
+*/
+   if (idle == CPU_NEWLY_IDLE && this_rq->avg_idle < curr_cost) {
+  

Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

2018-01-24 Thread Vincent Guittot
Hi,

Le Thursday 18 Jan 2018 à 10:38:07 (+), Morten Rasmussen a écrit :
> On Mon, Jan 15, 2018 at 09:26:09AM +0100, Vincent Guittot wrote:
> > Le Wednesday 03 Jan 2018 à 10:16:00 (+0100), Vincent Guittot a écrit :
> > > Hi Peter,
> > > 
> > > On 22 December 2017 at 21:42, Peter Zijlstra  wrote:
> > > > On Fri, Dec 22, 2017 at 07:56:29PM +0100, Peter Zijlstra wrote:
> > > >> Right; but I figured we'd try and do it 'right' and see how horrible it
> > > >> is before we try and do funny things.
> > > >
> > > > So now it should have a 32ms tick for up to .5s when the system goes
> > > > completely idle.
> > > >
> > > > No idea how bad that is..
> > > 
> > > I have tested your branch but the timer doesn't seem to fire correctly
> > > because i can still see blocked load in the use case i have run.
> > > I haven't found the reason yet
> > 
> > Hi Peter,
> > 
> > With the patch below on top of your branch, the blocked loads are updated 
> > and
> > decayed regularly. The main differences are:
> > - It doesn't use a timer to trig ilb but the tick and when a cpu becomes 
> > idle.
> >   The main drawback of this solution is that the load is blocked when the
> >   system is fully idle with the advantage of not waking up a fully idle
> >   system. We have to wait for the next tick or newly idle event for updating
> >   blocked load when the system leaves idle stat which can be up to a tick 
> > long.
> >   If this is too long, we can check for kicking ilb when task wakes up so 
> > the
> >   blocked load will be updated as soon as the system leaves idle state.
> >   The main advantage is that we don't wake up a fully idle system every 
> > 32ms to
> >   update blocked load that will be not used.
> > - I'm working on one more improvement to use nohz_idle_balance in the newly
> >   idle case when the system is not overloaded and 
> >   (this_rq->avg_idle > sysctl_sched_migration_cost). In this case, we can 
> > try to
> >   use nohz_idle_balance with NOHZ_STATS_KICK and abort as soon as it exceed
> >   this_rq->avg_idle. This will remove some calls to kick_ilb and some wake 
> > up
> >   of an idle cpus.
> 
> This sound like what I meant in my other reply :-)
> 
> It seems pointless to have a timer to update PELT if the system is
> completely idle, and when it isn't we can piggy back other events to
> make the updates happen.

The patch below implements what has been described above. It calls part of 
nohz_idle_balance when a cpu becomes idle and kick a ilb if it takes too much
time. This removes part of ilb that are kicked on an idle cpu for updating
the blocked load but the ratio really depends on when the tick happens compared
to a cpu becoming idle and the 32ms boundary. I have an additionnal patch that
enables to update the blocked loads when a cpu becomes idle 1 period before
kicking an ilb and there is far less ilb because we give more chance to the
newly idle case (time_after is replaced by time_after_eq in idle_balance()).

The patch also uses a function cfs_rq_has_blocked, which only checks the
util/load_avg, instead of the cfs_rq_is_decayed which check *_sum too. This
reduce significantly the number of update of blocked load. the *_avg will be
fully decayed in around 300~400ms but it's far longer for the *_sum which have
a higher resolution and we can easily reach almost seconds. But only the *_avg
are used to make decision so keeping some blocked *_sum is acceptable.

---
 kernel/sched/fair.c | 121 +++-
 1 file changed, 92 insertions(+), 29 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 898785d..ed90303 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7356,6 +7356,17 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq 
*cfs_rq)
return true;
 }
 
+static inline bool cfs_rq_has_blocked(struct cfs_rq *cfs_rq)
+{
+   if (cfs_rq->avg.load_avg)
+   return true;
+
+   if (cfs_rq->avg.util_avg)
+   return true;
+
+   return false;
+}
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 
 static void update_blocked_averages(int cpu)
@@ -7393,7 +7404,9 @@ static void update_blocked_averages(int cpu)
 */
if (cfs_rq_is_decayed(cfs_rq))
list_del_leaf_cfs_rq(cfs_rq);
-   else
+
+   /* Don't need periodic decay once load/util_avg are null */
+   if (cfs_rq_has_blocked(cfs_rq))
done = false;
}
 
@@ -7463,7 +7476,7 @@ static inline void update_blocked_averages(int cpu)
update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq);
 #ifdef CONFIG_NO_HZ_COMMON
rq->last_blocked_load_update_tick = jiffies;
-   if (cfs_rq_is_decayed(cfs_rq))
+   if (cfs_rq_has_blocked(cfs_rq))
rq->has_blocked_load = 0;
 #endif
rq_unlock_irqrestore(rq, );
@@ -8818,6 +8831,7 @@ update_next_balance(struct sched_domain *sd, unsigned 
long 

Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

2018-01-24 Thread Vincent Guittot
Hi,

Le Thursday 18 Jan 2018 à 10:38:07 (+), Morten Rasmussen a écrit :
> On Mon, Jan 15, 2018 at 09:26:09AM +0100, Vincent Guittot wrote:
> > Le Wednesday 03 Jan 2018 à 10:16:00 (+0100), Vincent Guittot a écrit :
> > > Hi Peter,
> > > 
> > > On 22 December 2017 at 21:42, Peter Zijlstra  wrote:
> > > > On Fri, Dec 22, 2017 at 07:56:29PM +0100, Peter Zijlstra wrote:
> > > >> Right; but I figured we'd try and do it 'right' and see how horrible it
> > > >> is before we try and do funny things.
> > > >
> > > > So now it should have a 32ms tick for up to .5s when the system goes
> > > > completely idle.
> > > >
> > > > No idea how bad that is..
> > > 
> > > I have tested your branch but the timer doesn't seem to fire correctly
> > > because i can still see blocked load in the use case i have run.
> > > I haven't found the reason yet
> > 
> > Hi Peter,
> > 
> > With the patch below on top of your branch, the blocked loads are updated 
> > and
> > decayed regularly. The main differences are:
> > - It doesn't use a timer to trig ilb but the tick and when a cpu becomes 
> > idle.
> >   The main drawback of this solution is that the load is blocked when the
> >   system is fully idle with the advantage of not waking up a fully idle
> >   system. We have to wait for the next tick or newly idle event for updating
> >   blocked load when the system leaves idle stat which can be up to a tick 
> > long.
> >   If this is too long, we can check for kicking ilb when task wakes up so 
> > the
> >   blocked load will be updated as soon as the system leaves idle state.
> >   The main advantage is that we don't wake up a fully idle system every 
> > 32ms to
> >   update blocked load that will be not used.
> > - I'm working on one more improvement to use nohz_idle_balance in the newly
> >   idle case when the system is not overloaded and 
> >   (this_rq->avg_idle > sysctl_sched_migration_cost). In this case, we can 
> > try to
> >   use nohz_idle_balance with NOHZ_STATS_KICK and abort as soon as it exceed
> >   this_rq->avg_idle. This will remove some calls to kick_ilb and some wake 
> > up
> >   of an idle cpus.
> 
> This sound like what I meant in my other reply :-)
> 
> It seems pointless to have a timer to update PELT if the system is
> completely idle, and when it isn't we can piggy back other events to
> make the updates happen.

The patch below implements what has been described above. It calls part of 
nohz_idle_balance when a cpu becomes idle and kick a ilb if it takes too much
time. This removes part of ilb that are kicked on an idle cpu for updating
the blocked load but the ratio really depends on when the tick happens compared
to a cpu becoming idle and the 32ms boundary. I have an additionnal patch that
enables to update the blocked loads when a cpu becomes idle 1 period before
kicking an ilb and there is far less ilb because we give more chance to the
newly idle case (time_after is replaced by time_after_eq in idle_balance()).

The patch also uses a function cfs_rq_has_blocked, which only checks the
util/load_avg, instead of the cfs_rq_is_decayed which check *_sum too. This
reduce significantly the number of update of blocked load. the *_avg will be
fully decayed in around 300~400ms but it's far longer for the *_sum which have
a higher resolution and we can easily reach almost seconds. But only the *_avg
are used to make decision so keeping some blocked *_sum is acceptable.

---
 kernel/sched/fair.c | 121 +++-
 1 file changed, 92 insertions(+), 29 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 898785d..ed90303 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7356,6 +7356,17 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq 
*cfs_rq)
return true;
 }
 
+static inline bool cfs_rq_has_blocked(struct cfs_rq *cfs_rq)
+{
+   if (cfs_rq->avg.load_avg)
+   return true;
+
+   if (cfs_rq->avg.util_avg)
+   return true;
+
+   return false;
+}
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 
 static void update_blocked_averages(int cpu)
@@ -7393,7 +7404,9 @@ static void update_blocked_averages(int cpu)
 */
if (cfs_rq_is_decayed(cfs_rq))
list_del_leaf_cfs_rq(cfs_rq);
-   else
+
+   /* Don't need periodic decay once load/util_avg are null */
+   if (cfs_rq_has_blocked(cfs_rq))
done = false;
}
 
@@ -7463,7 +7476,7 @@ static inline void update_blocked_averages(int cpu)
update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq);
 #ifdef CONFIG_NO_HZ_COMMON
rq->last_blocked_load_update_tick = jiffies;
-   if (cfs_rq_is_decayed(cfs_rq))
+   if (cfs_rq_has_blocked(cfs_rq))
rq->has_blocked_load = 0;
 #endif
rq_unlock_irqrestore(rq, );
@@ -8818,6 +8831,7 @@ update_next_balance(struct sched_domain *sd, unsigned 
long *next_balance)

Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

2018-01-22 Thread Vincent Guittot
On 22 January 2018 at 10:40, Dietmar Eggemann  wrote:
> On 01/15/2018 08:26 AM, Vincent Guittot wrote:
>>
>> Le Wednesday 03 Jan 2018 à 10:16:00 (+0100), Vincent Guittot a écrit :
>>>
>>> Hi Peter,
>>>
>>> On 22 December 2017 at 21:42, Peter Zijlstra 
>>> wrote:

 On Fri, Dec 22, 2017 at 07:56:29PM +0100, Peter Zijlstra wrote:
>
> Right; but I figured we'd try and do it 'right' and see how horrible it
> is before we try and do funny things.


 So now it should have a 32ms tick for up to .5s when the system goes
 completely idle.

 No idea how bad that is..
>>>
>>>
>>> I have tested your branch but the timer doesn't seem to fire correctly
>>> because i can still see blocked load in the use case i have run.
>>> I haven't found the reason yet
>
>
> Isn't the issue with the timer based implementation that
> rq->has_blocked_load is never set to 1 ?

Yes that's what I suggested to Peter on IRC

>
> Something you changed in your implementation by adding a
> rq->has_blocked_load = 1 into nohz_balance_enter_idle().
>
>
>>
>> Hi Peter,
>>
>> With the patch below on top of your branch, the blocked loads are updated
>> and
>> decayed regularly. The main differences are:
>> - It doesn't use a timer to trig ilb but the tick and when a cpu becomes
>> idle.
>>The main drawback of this solution is that the load is blocked when the
>>system is fully idle with the advantage of not waking up a fully idle
>>system. We have to wait for the next tick or newly idle event for
>> updating
>>blocked load when the system leaves idle stat which can be up to a tick
>> long.
>>If this is too long, we can check for kicking ilb when task wakes up so
>> the
>>blocked load will be updated as soon as the system leaves idle state.
>>The main advantage is that we don't wake up a fully idle system every
>> 32ms to
>>update blocked load that will be not used.
>> - I'm working on one more improvement to use nohz_idle_balance in the
>> newly
>>idle case when the system is not overloaded and
>>(this_rq->avg_idle > sysctl_sched_migration_cost). In this case, we can
>> try to
>>use nohz_idle_balance with NOHZ_STATS_KICK and abort as soon as it
>> exceed
>>this_rq->avg_idle. This will remove some calls to kick_ilb and some
>> wake up
>>of an idle cpus.
>>
>> ---
>>   kernel/sched/fair.c | 72
>> +
>>   1 file changed, 34 insertions(+), 38 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 52114c6..898785d 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -5386,7 +5386,6 @@ static struct {
>> atomic_t stats_state;
>> unsigned long next_balance; /* in jiffy units */
>> unsigned long next_stats;
>> -   struct timer_list timer;
>>   } nohz cacheline_aligned;
>> #endif /* CONFIG_NO_HZ_COMMON */
>> @@ -8004,8 +8003,9 @@ static inline void update_sd_lb_stats(struct lb_env
>> *env, struct sd_lb_stats *sd
>> prefer_sibling = 1;
>> #ifdef CONFIG_NO_HZ_COMMON
>> -   if (env->idle == CPU_NEWLY_IDLE)
>> +   if (env->idle == CPU_NEWLY_IDLE && atomic_read(_state))
>> {
>> env->flags |= LBF_NOHZ_STATS;
>> +   }
>>   #endif
>> load_idx = get_sd_load_idx(env->sd, env->idle);
>> @@ -8818,6 +8818,8 @@ update_next_balance(struct sched_domain *sd,
>> unsigned long *next_balance)
>> *next_balance = next;
>>   }
>>   +static void kick_ilb(unsigned int flags);
>> +
>>   /*
>>* idle_balance is called by schedule() if this_cpu is about to become
>>* idle. Attempts to pull tasks from other CPUs.
>> @@ -8852,12 +8854,16 @@ static int idle_balance(struct rq *this_rq, struct
>> rq_flags *rf)
>> if (this_rq->avg_idle < sysctl_sched_migration_cost ||
>> !this_rq->rd->overload) {
>> +   unsigned long next = READ_ONCE(nohz.next_stats);
>> rcu_read_lock();
>> sd = rcu_dereference_check_sched_domain(this_rq->sd);
>> if (sd)
>> update_next_balance(sd, _balance);
>> rcu_read_unlock();
>>   + if (time_after(jiffies, next) &&
>> atomic_read(_state))
>> +   kick_ilb(NOHZ_STATS_KICK);
>> +
>> goto out;
>> }
>>   @@ -9075,18 +9081,6 @@ static void kick_ilb(unsigned int flags)
>> smp_send_reschedule(ilb_cpu);
>>   }
>>   -void nohz_balance_timer(struct timer_list *timer)
>> -{
>> -   unsigned long next = READ_ONCE(nohz.next_stats);
>> -
>> -   if (time_before(jiffies, next)) {
>> -   mod_timer(timer, next);
>> -   return;
>> -   }
>> -
>> -   kick_ilb(NOHZ_STATS_KICK);
>> -}
>> -
>>   /*
>>* Current heuristic for kicking the idle load balancer in the presence
>>* of an idle cpu in the system.
>> @@ -9122,6 

Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

2018-01-22 Thread Vincent Guittot
On 22 January 2018 at 10:40, Dietmar Eggemann  wrote:
> On 01/15/2018 08:26 AM, Vincent Guittot wrote:
>>
>> Le Wednesday 03 Jan 2018 à 10:16:00 (+0100), Vincent Guittot a écrit :
>>>
>>> Hi Peter,
>>>
>>> On 22 December 2017 at 21:42, Peter Zijlstra 
>>> wrote:

 On Fri, Dec 22, 2017 at 07:56:29PM +0100, Peter Zijlstra wrote:
>
> Right; but I figured we'd try and do it 'right' and see how horrible it
> is before we try and do funny things.


 So now it should have a 32ms tick for up to .5s when the system goes
 completely idle.

 No idea how bad that is..
>>>
>>>
>>> I have tested your branch but the timer doesn't seem to fire correctly
>>> because i can still see blocked load in the use case i have run.
>>> I haven't found the reason yet
>
>
> Isn't the issue with the timer based implementation that
> rq->has_blocked_load is never set to 1 ?

Yes that's what I suggested to Peter on IRC

>
> Something you changed in your implementation by adding a
> rq->has_blocked_load = 1 into nohz_balance_enter_idle().
>
>
>>
>> Hi Peter,
>>
>> With the patch below on top of your branch, the blocked loads are updated
>> and
>> decayed regularly. The main differences are:
>> - It doesn't use a timer to trig ilb but the tick and when a cpu becomes
>> idle.
>>The main drawback of this solution is that the load is blocked when the
>>system is fully idle with the advantage of not waking up a fully idle
>>system. We have to wait for the next tick or newly idle event for
>> updating
>>blocked load when the system leaves idle stat which can be up to a tick
>> long.
>>If this is too long, we can check for kicking ilb when task wakes up so
>> the
>>blocked load will be updated as soon as the system leaves idle state.
>>The main advantage is that we don't wake up a fully idle system every
>> 32ms to
>>update blocked load that will be not used.
>> - I'm working on one more improvement to use nohz_idle_balance in the
>> newly
>>idle case when the system is not overloaded and
>>(this_rq->avg_idle > sysctl_sched_migration_cost). In this case, we can
>> try to
>>use nohz_idle_balance with NOHZ_STATS_KICK and abort as soon as it
>> exceed
>>this_rq->avg_idle. This will remove some calls to kick_ilb and some
>> wake up
>>of an idle cpus.
>>
>> ---
>>   kernel/sched/fair.c | 72
>> +
>>   1 file changed, 34 insertions(+), 38 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 52114c6..898785d 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -5386,7 +5386,6 @@ static struct {
>> atomic_t stats_state;
>> unsigned long next_balance; /* in jiffy units */
>> unsigned long next_stats;
>> -   struct timer_list timer;
>>   } nohz cacheline_aligned;
>> #endif /* CONFIG_NO_HZ_COMMON */
>> @@ -8004,8 +8003,9 @@ static inline void update_sd_lb_stats(struct lb_env
>> *env, struct sd_lb_stats *sd
>> prefer_sibling = 1;
>> #ifdef CONFIG_NO_HZ_COMMON
>> -   if (env->idle == CPU_NEWLY_IDLE)
>> +   if (env->idle == CPU_NEWLY_IDLE && atomic_read(_state))
>> {
>> env->flags |= LBF_NOHZ_STATS;
>> +   }
>>   #endif
>> load_idx = get_sd_load_idx(env->sd, env->idle);
>> @@ -8818,6 +8818,8 @@ update_next_balance(struct sched_domain *sd,
>> unsigned long *next_balance)
>> *next_balance = next;
>>   }
>>   +static void kick_ilb(unsigned int flags);
>> +
>>   /*
>>* idle_balance is called by schedule() if this_cpu is about to become
>>* idle. Attempts to pull tasks from other CPUs.
>> @@ -8852,12 +8854,16 @@ static int idle_balance(struct rq *this_rq, struct
>> rq_flags *rf)
>> if (this_rq->avg_idle < sysctl_sched_migration_cost ||
>> !this_rq->rd->overload) {
>> +   unsigned long next = READ_ONCE(nohz.next_stats);
>> rcu_read_lock();
>> sd = rcu_dereference_check_sched_domain(this_rq->sd);
>> if (sd)
>> update_next_balance(sd, _balance);
>> rcu_read_unlock();
>>   + if (time_after(jiffies, next) &&
>> atomic_read(_state))
>> +   kick_ilb(NOHZ_STATS_KICK);
>> +
>> goto out;
>> }
>>   @@ -9075,18 +9081,6 @@ static void kick_ilb(unsigned int flags)
>> smp_send_reschedule(ilb_cpu);
>>   }
>>   -void nohz_balance_timer(struct timer_list *timer)
>> -{
>> -   unsigned long next = READ_ONCE(nohz.next_stats);
>> -
>> -   if (time_before(jiffies, next)) {
>> -   mod_timer(timer, next);
>> -   return;
>> -   }
>> -
>> -   kick_ilb(NOHZ_STATS_KICK);
>> -}
>> -
>>   /*
>>* Current heuristic for kicking the idle load balancer in the presence
>>* of an idle cpu in the system.
>> @@ -9122,6 +9116,9 @@ static void 

Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

2018-01-22 Thread Dietmar Eggemann

On 01/15/2018 08:26 AM, Vincent Guittot wrote:

Le Wednesday 03 Jan 2018 à 10:16:00 (+0100), Vincent Guittot a écrit :

Hi Peter,

On 22 December 2017 at 21:42, Peter Zijlstra  wrote:

On Fri, Dec 22, 2017 at 07:56:29PM +0100, Peter Zijlstra wrote:

Right; but I figured we'd try and do it 'right' and see how horrible it
is before we try and do funny things.


So now it should have a 32ms tick for up to .5s when the system goes
completely idle.

No idea how bad that is..


I have tested your branch but the timer doesn't seem to fire correctly
because i can still see blocked load in the use case i have run.
I haven't found the reason yet


Isn't the issue with the timer based implementation that 
rq->has_blocked_load is never set to 1 ?


Something you changed in your implementation by adding a 
rq->has_blocked_load = 1 into nohz_balance_enter_idle().




Hi Peter,

With the patch below on top of your branch, the blocked loads are updated and
decayed regularly. The main differences are:
- It doesn't use a timer to trig ilb but the tick and when a cpu becomes idle.
   The main drawback of this solution is that the load is blocked when the
   system is fully idle with the advantage of not waking up a fully idle
   system. We have to wait for the next tick or newly idle event for updating
   blocked load when the system leaves idle stat which can be up to a tick long.
   If this is too long, we can check for kicking ilb when task wakes up so the
   blocked load will be updated as soon as the system leaves idle state.
   The main advantage is that we don't wake up a fully idle system every 32ms to
   update blocked load that will be not used.
- I'm working on one more improvement to use nohz_idle_balance in the newly
   idle case when the system is not overloaded and
   (this_rq->avg_idle > sysctl_sched_migration_cost). In this case, we can try 
to
   use nohz_idle_balance with NOHZ_STATS_KICK and abort as soon as it exceed
   this_rq->avg_idle. This will remove some calls to kick_ilb and some wake up
   of an idle cpus.

---
  kernel/sched/fair.c | 72 +
  1 file changed, 34 insertions(+), 38 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 52114c6..898785d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5386,7 +5386,6 @@ static struct {
atomic_t stats_state;
unsigned long next_balance; /* in jiffy units */
unsigned long next_stats;
-   struct timer_list timer;
  } nohz cacheline_aligned;
  
  #endif /* CONFIG_NO_HZ_COMMON */

@@ -8004,8 +8003,9 @@ static inline void update_sd_lb_stats(struct lb_env *env, 
struct sd_lb_stats *sd
prefer_sibling = 1;
  
  #ifdef CONFIG_NO_HZ_COMMON

-   if (env->idle == CPU_NEWLY_IDLE)
+   if (env->idle == CPU_NEWLY_IDLE && atomic_read(_state)) {
env->flags |= LBF_NOHZ_STATS;
+   }
  #endif
  
  	load_idx = get_sd_load_idx(env->sd, env->idle);

@@ -8818,6 +8818,8 @@ update_next_balance(struct sched_domain *sd, unsigned 
long *next_balance)
*next_balance = next;
  }
  
+static void kick_ilb(unsigned int flags);

+
  /*
   * idle_balance is called by schedule() if this_cpu is about to become
   * idle. Attempts to pull tasks from other CPUs.
@@ -8852,12 +8854,16 @@ static int idle_balance(struct rq *this_rq, struct 
rq_flags *rf)
  
  	if (this_rq->avg_idle < sysctl_sched_migration_cost ||

!this_rq->rd->overload) {
+   unsigned long next = READ_ONCE(nohz.next_stats);
rcu_read_lock();
sd = rcu_dereference_check_sched_domain(this_rq->sd);
if (sd)
update_next_balance(sd, _balance);
rcu_read_unlock();
  
+		if (time_after(jiffies, next) && atomic_read(_state))

+   kick_ilb(NOHZ_STATS_KICK);
+
goto out;
}
  
@@ -9075,18 +9081,6 @@ static void kick_ilb(unsigned int flags)

smp_send_reschedule(ilb_cpu);
  }
  
-void nohz_balance_timer(struct timer_list *timer)

-{
-   unsigned long next = READ_ONCE(nohz.next_stats);
-
-   if (time_before(jiffies, next)) {
-   mod_timer(timer, next);
-   return;
-   }
-
-   kick_ilb(NOHZ_STATS_KICK);
-}
-
  /*
   * Current heuristic for kicking the idle load balancer in the presence
   * of an idle cpu in the system.
@@ -9122,6 +9116,9 @@ static void nohz_balancer_kick(struct rq *rq)
if (likely(!atomic_read(_cpus)))
return;
  
+	if (time_after(now, nohz.next_stats) && atomic_read(_state))

+   flags = NOHZ_STATS_KICK;
+
if (time_before(now, nohz.next_balance))
goto out;
  
@@ -9227,7 +9224,6 @@ static void set_cpu_sd_state_idle(int cpu)

  void nohz_balance_enter_idle(int cpu)
  {
struct rq *rq = cpu_rq(cpu);
-   unsigned int val, new;
  
  	SCHED_WARN_ON(cpu != 

Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

2018-01-22 Thread Dietmar Eggemann

On 01/15/2018 08:26 AM, Vincent Guittot wrote:

Le Wednesday 03 Jan 2018 à 10:16:00 (+0100), Vincent Guittot a écrit :

Hi Peter,

On 22 December 2017 at 21:42, Peter Zijlstra  wrote:

On Fri, Dec 22, 2017 at 07:56:29PM +0100, Peter Zijlstra wrote:

Right; but I figured we'd try and do it 'right' and see how horrible it
is before we try and do funny things.


So now it should have a 32ms tick for up to .5s when the system goes
completely idle.

No idea how bad that is..


I have tested your branch but the timer doesn't seem to fire correctly
because i can still see blocked load in the use case i have run.
I haven't found the reason yet


Isn't the issue with the timer based implementation that 
rq->has_blocked_load is never set to 1 ?


Something you changed in your implementation by adding a 
rq->has_blocked_load = 1 into nohz_balance_enter_idle().




Hi Peter,

With the patch below on top of your branch, the blocked loads are updated and
decayed regularly. The main differences are:
- It doesn't use a timer to trig ilb but the tick and when a cpu becomes idle.
   The main drawback of this solution is that the load is blocked when the
   system is fully idle with the advantage of not waking up a fully idle
   system. We have to wait for the next tick or newly idle event for updating
   blocked load when the system leaves idle stat which can be up to a tick long.
   If this is too long, we can check for kicking ilb when task wakes up so the
   blocked load will be updated as soon as the system leaves idle state.
   The main advantage is that we don't wake up a fully idle system every 32ms to
   update blocked load that will be not used.
- I'm working on one more improvement to use nohz_idle_balance in the newly
   idle case when the system is not overloaded and
   (this_rq->avg_idle > sysctl_sched_migration_cost). In this case, we can try 
to
   use nohz_idle_balance with NOHZ_STATS_KICK and abort as soon as it exceed
   this_rq->avg_idle. This will remove some calls to kick_ilb and some wake up
   of an idle cpus.

---
  kernel/sched/fair.c | 72 +
  1 file changed, 34 insertions(+), 38 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 52114c6..898785d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5386,7 +5386,6 @@ static struct {
atomic_t stats_state;
unsigned long next_balance; /* in jiffy units */
unsigned long next_stats;
-   struct timer_list timer;
  } nohz cacheline_aligned;
  
  #endif /* CONFIG_NO_HZ_COMMON */

@@ -8004,8 +8003,9 @@ static inline void update_sd_lb_stats(struct lb_env *env, 
struct sd_lb_stats *sd
prefer_sibling = 1;
  
  #ifdef CONFIG_NO_HZ_COMMON

-   if (env->idle == CPU_NEWLY_IDLE)
+   if (env->idle == CPU_NEWLY_IDLE && atomic_read(_state)) {
env->flags |= LBF_NOHZ_STATS;
+   }
  #endif
  
  	load_idx = get_sd_load_idx(env->sd, env->idle);

@@ -8818,6 +8818,8 @@ update_next_balance(struct sched_domain *sd, unsigned 
long *next_balance)
*next_balance = next;
  }
  
+static void kick_ilb(unsigned int flags);

+
  /*
   * idle_balance is called by schedule() if this_cpu is about to become
   * idle. Attempts to pull tasks from other CPUs.
@@ -8852,12 +8854,16 @@ static int idle_balance(struct rq *this_rq, struct 
rq_flags *rf)
  
  	if (this_rq->avg_idle < sysctl_sched_migration_cost ||

!this_rq->rd->overload) {
+   unsigned long next = READ_ONCE(nohz.next_stats);
rcu_read_lock();
sd = rcu_dereference_check_sched_domain(this_rq->sd);
if (sd)
update_next_balance(sd, _balance);
rcu_read_unlock();
  
+		if (time_after(jiffies, next) && atomic_read(_state))

+   kick_ilb(NOHZ_STATS_KICK);
+
goto out;
}
  
@@ -9075,18 +9081,6 @@ static void kick_ilb(unsigned int flags)

smp_send_reschedule(ilb_cpu);
  }
  
-void nohz_balance_timer(struct timer_list *timer)

-{
-   unsigned long next = READ_ONCE(nohz.next_stats);
-
-   if (time_before(jiffies, next)) {
-   mod_timer(timer, next);
-   return;
-   }
-
-   kick_ilb(NOHZ_STATS_KICK);
-}
-
  /*
   * Current heuristic for kicking the idle load balancer in the presence
   * of an idle cpu in the system.
@@ -9122,6 +9116,9 @@ static void nohz_balancer_kick(struct rq *rq)
if (likely(!atomic_read(_cpus)))
return;
  
+	if (time_after(now, nohz.next_stats) && atomic_read(_state))

+   flags = NOHZ_STATS_KICK;
+
if (time_before(now, nohz.next_balance))
goto out;
  
@@ -9227,7 +9224,6 @@ static void set_cpu_sd_state_idle(int cpu)

  void nohz_balance_enter_idle(int cpu)
  {
struct rq *rq = cpu_rq(cpu);
-   unsigned int val, new;
  
  	SCHED_WARN_ON(cpu != smp_processor_id());
  
@@ -9251,6 

Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

2018-01-18 Thread Morten Rasmussen
On Mon, Jan 15, 2018 at 09:26:09AM +0100, Vincent Guittot wrote:
> Le Wednesday 03 Jan 2018 à 10:16:00 (+0100), Vincent Guittot a écrit :
> > Hi Peter,
> > 
> > On 22 December 2017 at 21:42, Peter Zijlstra  wrote:
> > > On Fri, Dec 22, 2017 at 07:56:29PM +0100, Peter Zijlstra wrote:
> > >> Right; but I figured we'd try and do it 'right' and see how horrible it
> > >> is before we try and do funny things.
> > >
> > > So now it should have a 32ms tick for up to .5s when the system goes
> > > completely idle.
> > >
> > > No idea how bad that is..
> > 
> > I have tested your branch but the timer doesn't seem to fire correctly
> > because i can still see blocked load in the use case i have run.
> > I haven't found the reason yet
> 
> Hi Peter,
> 
> With the patch below on top of your branch, the blocked loads are updated and
> decayed regularly. The main differences are:
> - It doesn't use a timer to trig ilb but the tick and when a cpu becomes idle.
>   The main drawback of this solution is that the load is blocked when the
>   system is fully idle with the advantage of not waking up a fully idle
>   system. We have to wait for the next tick or newly idle event for updating
>   blocked load when the system leaves idle stat which can be up to a tick 
> long.
>   If this is too long, we can check for kicking ilb when task wakes up so the
>   blocked load will be updated as soon as the system leaves idle state.
>   The main advantage is that we don't wake up a fully idle system every 32ms 
> to
>   update blocked load that will be not used.
> - I'm working on one more improvement to use nohz_idle_balance in the newly
>   idle case when the system is not overloaded and 
>   (this_rq->avg_idle > sysctl_sched_migration_cost). In this case, we can try 
> to
>   use nohz_idle_balance with NOHZ_STATS_KICK and abort as soon as it exceed
>   this_rq->avg_idle. This will remove some calls to kick_ilb and some wake up
>   of an idle cpus.

This sound like what I meant in my other reply :-)

It seems pointless to have a timer to update PELT if the system is
completely idle, and when it isn't we can piggy back other events to
make the updates happen.


Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

2018-01-18 Thread Morten Rasmussen
On Mon, Jan 15, 2018 at 09:26:09AM +0100, Vincent Guittot wrote:
> Le Wednesday 03 Jan 2018 à 10:16:00 (+0100), Vincent Guittot a écrit :
> > Hi Peter,
> > 
> > On 22 December 2017 at 21:42, Peter Zijlstra  wrote:
> > > On Fri, Dec 22, 2017 at 07:56:29PM +0100, Peter Zijlstra wrote:
> > >> Right; but I figured we'd try and do it 'right' and see how horrible it
> > >> is before we try and do funny things.
> > >
> > > So now it should have a 32ms tick for up to .5s when the system goes
> > > completely idle.
> > >
> > > No idea how bad that is..
> > 
> > I have tested your branch but the timer doesn't seem to fire correctly
> > because i can still see blocked load in the use case i have run.
> > I haven't found the reason yet
> 
> Hi Peter,
> 
> With the patch below on top of your branch, the blocked loads are updated and
> decayed regularly. The main differences are:
> - It doesn't use a timer to trig ilb but the tick and when a cpu becomes idle.
>   The main drawback of this solution is that the load is blocked when the
>   system is fully idle with the advantage of not waking up a fully idle
>   system. We have to wait for the next tick or newly idle event for updating
>   blocked load when the system leaves idle stat which can be up to a tick 
> long.
>   If this is too long, we can check for kicking ilb when task wakes up so the
>   blocked load will be updated as soon as the system leaves idle state.
>   The main advantage is that we don't wake up a fully idle system every 32ms 
> to
>   update blocked load that will be not used.
> - I'm working on one more improvement to use nohz_idle_balance in the newly
>   idle case when the system is not overloaded and 
>   (this_rq->avg_idle > sysctl_sched_migration_cost). In this case, we can try 
> to
>   use nohz_idle_balance with NOHZ_STATS_KICK and abort as soon as it exceed
>   this_rq->avg_idle. This will remove some calls to kick_ilb and some wake up
>   of an idle cpus.

This sound like what I meant in my other reply :-)

It seems pointless to have a timer to update PELT if the system is
completely idle, and when it isn't we can piggy back other events to
make the updates happen.


Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

2018-01-18 Thread Morten Rasmussen
On Mon, Jan 15, 2018 at 10:43:18AM +0100, Peter Zijlstra wrote:
> On Tue, Jan 02, 2018 at 03:44:57PM +, Morten Rasmussen wrote:
> 
> > Vincent already proposed, why can't we just modify Brendan's
> > CPU_NEWLY_IDLE proposal to do a stats update from idle_balance() every
> > 32ms regardless of whether we need to load-balance?
> 
> I think that code is there, no?
> 
> Subject: sched: Update blocked load from NEWIDLE

The mechanics are there, but I think the problem is that the
idle_balance() bails out before we get to it in some cases. If we only
have a few small periodic tasks running rd->overload won't be set and
idle_balance() returns before doing anything.

We would need some sort of check to see if a PELT update is due and make
sure it happens, even if idle_balance() has nothing to do.


Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

2018-01-18 Thread Morten Rasmussen
On Mon, Jan 15, 2018 at 10:43:18AM +0100, Peter Zijlstra wrote:
> On Tue, Jan 02, 2018 at 03:44:57PM +, Morten Rasmussen wrote:
> 
> > Vincent already proposed, why can't we just modify Brendan's
> > CPU_NEWLY_IDLE proposal to do a stats update from idle_balance() every
> > 32ms regardless of whether we need to load-balance?
> 
> I think that code is there, no?
> 
> Subject: sched: Update blocked load from NEWIDLE

The mechanics are there, but I think the problem is that the
idle_balance() bails out before we get to it in some cases. If we only
have a few small periodic tasks running rd->overload won't be set and
idle_balance() returns before doing anything.

We would need some sort of check to see if a PELT update is due and make
sure it happens, even if idle_balance() has nothing to do.


Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

2018-01-15 Thread Peter Zijlstra
On Tue, Jan 02, 2018 at 03:44:57PM +, Morten Rasmussen wrote:

> Vincent already proposed, why can't we just modify Brendan's
> CPU_NEWLY_IDLE proposal to do a stats update from idle_balance() every
> 32ms regardless of whether we need to load-balance?

I think that code is there, no?

Subject: sched: Update blocked load from NEWIDLE


Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

2018-01-15 Thread Peter Zijlstra
On Tue, Jan 02, 2018 at 03:44:57PM +, Morten Rasmussen wrote:

> Vincent already proposed, why can't we just modify Brendan's
> CPU_NEWLY_IDLE proposal to do a stats update from idle_balance() every
> 32ms regardless of whether we need to load-balance?

I think that code is there, no?

Subject: sched: Update blocked load from NEWIDLE


Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

2018-01-15 Thread Vincent Guittot
Le Wednesday 03 Jan 2018 à 10:16:00 (+0100), Vincent Guittot a écrit :
> Hi Peter,
> 
> On 22 December 2017 at 21:42, Peter Zijlstra  wrote:
> > On Fri, Dec 22, 2017 at 07:56:29PM +0100, Peter Zijlstra wrote:
> >> Right; but I figured we'd try and do it 'right' and see how horrible it
> >> is before we try and do funny things.
> >
> > So now it should have a 32ms tick for up to .5s when the system goes
> > completely idle.
> >
> > No idea how bad that is..
> 
> I have tested your branch but the timer doesn't seem to fire correctly
> because i can still see blocked load in the use case i have run.
> I haven't found the reason yet

Hi Peter,

With the patch below on top of your branch, the blocked loads are updated and
decayed regularly. The main differences are:
- It doesn't use a timer to trig ilb but the tick and when a cpu becomes idle.
  The main drawback of this solution is that the load is blocked when the
  system is fully idle with the advantage of not waking up a fully idle
  system. We have to wait for the next tick or newly idle event for updating
  blocked load when the system leaves idle stat which can be up to a tick long.
  If this is too long, we can check for kicking ilb when task wakes up so the
  blocked load will be updated as soon as the system leaves idle state.
  The main advantage is that we don't wake up a fully idle system every 32ms to
  update blocked load that will be not used.
- I'm working on one more improvement to use nohz_idle_balance in the newly
  idle case when the system is not overloaded and 
  (this_rq->avg_idle > sysctl_sched_migration_cost). In this case, we can try to
  use nohz_idle_balance with NOHZ_STATS_KICK and abort as soon as it exceed
  this_rq->avg_idle. This will remove some calls to kick_ilb and some wake up
  of an idle cpus.

---
 kernel/sched/fair.c | 72 +
 1 file changed, 34 insertions(+), 38 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 52114c6..898785d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5386,7 +5386,6 @@ static struct {
atomic_t stats_state;
unsigned long next_balance; /* in jiffy units */
unsigned long next_stats;
-   struct timer_list timer;
 } nohz cacheline_aligned;
 
 #endif /* CONFIG_NO_HZ_COMMON */
@@ -8004,8 +8003,9 @@ static inline void update_sd_lb_stats(struct lb_env *env, 
struct sd_lb_stats *sd
prefer_sibling = 1;
 
 #ifdef CONFIG_NO_HZ_COMMON
-   if (env->idle == CPU_NEWLY_IDLE)
+   if (env->idle == CPU_NEWLY_IDLE && atomic_read(_state)) {
env->flags |= LBF_NOHZ_STATS;
+   }
 #endif
 
load_idx = get_sd_load_idx(env->sd, env->idle);
@@ -8818,6 +8818,8 @@ update_next_balance(struct sched_domain *sd, unsigned 
long *next_balance)
*next_balance = next;
 }
 
+static void kick_ilb(unsigned int flags);
+
 /*
  * idle_balance is called by schedule() if this_cpu is about to become
  * idle. Attempts to pull tasks from other CPUs.
@@ -8852,12 +8854,16 @@ static int idle_balance(struct rq *this_rq, struct 
rq_flags *rf)
 
if (this_rq->avg_idle < sysctl_sched_migration_cost ||
!this_rq->rd->overload) {
+   unsigned long next = READ_ONCE(nohz.next_stats);
rcu_read_lock();
sd = rcu_dereference_check_sched_domain(this_rq->sd);
if (sd)
update_next_balance(sd, _balance);
rcu_read_unlock();
 
+   if (time_after(jiffies, next) && atomic_read(_state))
+   kick_ilb(NOHZ_STATS_KICK);
+
goto out;
}
 
@@ -9075,18 +9081,6 @@ static void kick_ilb(unsigned int flags)
smp_send_reschedule(ilb_cpu);
 }
 
-void nohz_balance_timer(struct timer_list *timer)
-{
-   unsigned long next = READ_ONCE(nohz.next_stats);
-
-   if (time_before(jiffies, next)) {
-   mod_timer(timer, next);
-   return;
-   }
-
-   kick_ilb(NOHZ_STATS_KICK);
-}
-
 /*
  * Current heuristic for kicking the idle load balancer in the presence
  * of an idle cpu in the system.
@@ -9122,6 +9116,9 @@ static void nohz_balancer_kick(struct rq *rq)
if (likely(!atomic_read(_cpus)))
return;
 
+   if (time_after(now, nohz.next_stats) && atomic_read(_state))
+   flags = NOHZ_STATS_KICK;
+
if (time_before(now, nohz.next_balance))
goto out;
 
@@ -9227,7 +9224,6 @@ static void set_cpu_sd_state_idle(int cpu)
 void nohz_balance_enter_idle(int cpu)
 {
struct rq *rq = cpu_rq(cpu);
-   unsigned int val, new;
 
SCHED_WARN_ON(cpu != smp_processor_id());
 
@@ -9251,6 +9247,7 @@ void nohz_balance_enter_idle(int cpu)
return;
 
rq->nohz_tick_stopped = 1;
+   rq->has_blocked_load = 1;
 
cpumask_set_cpu(cpu, nohz.idle_cpus_mask);
atomic_inc(_cpus);

Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

2018-01-15 Thread Vincent Guittot
Le Wednesday 03 Jan 2018 à 10:16:00 (+0100), Vincent Guittot a écrit :
> Hi Peter,
> 
> On 22 December 2017 at 21:42, Peter Zijlstra  wrote:
> > On Fri, Dec 22, 2017 at 07:56:29PM +0100, Peter Zijlstra wrote:
> >> Right; but I figured we'd try and do it 'right' and see how horrible it
> >> is before we try and do funny things.
> >
> > So now it should have a 32ms tick for up to .5s when the system goes
> > completely idle.
> >
> > No idea how bad that is..
> 
> I have tested your branch but the timer doesn't seem to fire correctly
> because i can still see blocked load in the use case i have run.
> I haven't found the reason yet

Hi Peter,

With the patch below on top of your branch, the blocked loads are updated and
decayed regularly. The main differences are:
- It doesn't use a timer to trig ilb but the tick and when a cpu becomes idle.
  The main drawback of this solution is that the load is blocked when the
  system is fully idle with the advantage of not waking up a fully idle
  system. We have to wait for the next tick or newly idle event for updating
  blocked load when the system leaves idle stat which can be up to a tick long.
  If this is too long, we can check for kicking ilb when task wakes up so the
  blocked load will be updated as soon as the system leaves idle state.
  The main advantage is that we don't wake up a fully idle system every 32ms to
  update blocked load that will be not used.
- I'm working on one more improvement to use nohz_idle_balance in the newly
  idle case when the system is not overloaded and 
  (this_rq->avg_idle > sysctl_sched_migration_cost). In this case, we can try to
  use nohz_idle_balance with NOHZ_STATS_KICK and abort as soon as it exceed
  this_rq->avg_idle. This will remove some calls to kick_ilb and some wake up
  of an idle cpus.

---
 kernel/sched/fair.c | 72 +
 1 file changed, 34 insertions(+), 38 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 52114c6..898785d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5386,7 +5386,6 @@ static struct {
atomic_t stats_state;
unsigned long next_balance; /* in jiffy units */
unsigned long next_stats;
-   struct timer_list timer;
 } nohz cacheline_aligned;
 
 #endif /* CONFIG_NO_HZ_COMMON */
@@ -8004,8 +8003,9 @@ static inline void update_sd_lb_stats(struct lb_env *env, 
struct sd_lb_stats *sd
prefer_sibling = 1;
 
 #ifdef CONFIG_NO_HZ_COMMON
-   if (env->idle == CPU_NEWLY_IDLE)
+   if (env->idle == CPU_NEWLY_IDLE && atomic_read(_state)) {
env->flags |= LBF_NOHZ_STATS;
+   }
 #endif
 
load_idx = get_sd_load_idx(env->sd, env->idle);
@@ -8818,6 +8818,8 @@ update_next_balance(struct sched_domain *sd, unsigned 
long *next_balance)
*next_balance = next;
 }
 
+static void kick_ilb(unsigned int flags);
+
 /*
  * idle_balance is called by schedule() if this_cpu is about to become
  * idle. Attempts to pull tasks from other CPUs.
@@ -8852,12 +8854,16 @@ static int idle_balance(struct rq *this_rq, struct 
rq_flags *rf)
 
if (this_rq->avg_idle < sysctl_sched_migration_cost ||
!this_rq->rd->overload) {
+   unsigned long next = READ_ONCE(nohz.next_stats);
rcu_read_lock();
sd = rcu_dereference_check_sched_domain(this_rq->sd);
if (sd)
update_next_balance(sd, _balance);
rcu_read_unlock();
 
+   if (time_after(jiffies, next) && atomic_read(_state))
+   kick_ilb(NOHZ_STATS_KICK);
+
goto out;
}
 
@@ -9075,18 +9081,6 @@ static void kick_ilb(unsigned int flags)
smp_send_reschedule(ilb_cpu);
 }
 
-void nohz_balance_timer(struct timer_list *timer)
-{
-   unsigned long next = READ_ONCE(nohz.next_stats);
-
-   if (time_before(jiffies, next)) {
-   mod_timer(timer, next);
-   return;
-   }
-
-   kick_ilb(NOHZ_STATS_KICK);
-}
-
 /*
  * Current heuristic for kicking the idle load balancer in the presence
  * of an idle cpu in the system.
@@ -9122,6 +9116,9 @@ static void nohz_balancer_kick(struct rq *rq)
if (likely(!atomic_read(_cpus)))
return;
 
+   if (time_after(now, nohz.next_stats) && atomic_read(_state))
+   flags = NOHZ_STATS_KICK;
+
if (time_before(now, nohz.next_balance))
goto out;
 
@@ -9227,7 +9224,6 @@ static void set_cpu_sd_state_idle(int cpu)
 void nohz_balance_enter_idle(int cpu)
 {
struct rq *rq = cpu_rq(cpu);
-   unsigned int val, new;
 
SCHED_WARN_ON(cpu != smp_processor_id());
 
@@ -9251,6 +9247,7 @@ void nohz_balance_enter_idle(int cpu)
return;
 
rq->nohz_tick_stopped = 1;
+   rq->has_blocked_load = 1;
 
cpumask_set_cpu(cpu, nohz.idle_cpus_mask);
atomic_inc(_cpus);
@@ -9258,21 +9255,11 

Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

2018-01-03 Thread Vincent Guittot
Hi Peter,

On 22 December 2017 at 21:42, Peter Zijlstra  wrote:
> On Fri, Dec 22, 2017 at 07:56:29PM +0100, Peter Zijlstra wrote:
>> Right; but I figured we'd try and do it 'right' and see how horrible it
>> is before we try and do funny things.
>
> So now it should have a 32ms tick for up to .5s when the system goes
> completely idle.
>
> No idea how bad that is..

I have tested your branch but the timer doesn't seem to fire correctly
because i can still see blocked load in the use case i have run.
I haven't found the reason yet


Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

2018-01-03 Thread Vincent Guittot
Hi Peter,

On 22 December 2017 at 21:42, Peter Zijlstra  wrote:
> On Fri, Dec 22, 2017 at 07:56:29PM +0100, Peter Zijlstra wrote:
>> Right; but I figured we'd try and do it 'right' and see how horrible it
>> is before we try and do funny things.
>
> So now it should have a 32ms tick for up to .5s when the system goes
> completely idle.
>
> No idea how bad that is..

I have tested your branch but the timer doesn't seem to fire correctly
because i can still see blocked load in the use case i have run.
I haven't found the reason yet


Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

2018-01-02 Thread Morten Rasmussen
On Fri, Dec 22, 2017 at 09:42:47PM +0100, Peter Zijlstra wrote:
> On Fri, Dec 22, 2017 at 07:56:29PM +0100, Peter Zijlstra wrote:
> > Right; but I figured we'd try and do it 'right' and see how horrible it
> > is before we try and do funny things.
> 
> So now it should have a 32ms tick for up to .5s when the system goes
> completely idle.
> 
> No idea how bad that is..

Does it mean that the 32ms tick will keep going forever if the system
doesn't go completely idle? Some tiny background task or a slightly
bigger one with a longer period?

Do we actually care about stale values if the system is completely idle?

Instead of hacking select_task_rq_fair() to kick off a stats update as
Vincent already proposed, why can't we just modify Brendan's
CPU_NEWLY_IDLE proposal to do a stats update from idle_balance() every
32ms regardless of whether we need to load-balance?

This way we should get updates if there is anything running, we don't
touch the wake-up path, we don't cause any additional wake-ups, and we
don't need a timer. What am I missing?


Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

2018-01-02 Thread Morten Rasmussen
On Fri, Dec 22, 2017 at 09:42:47PM +0100, Peter Zijlstra wrote:
> On Fri, Dec 22, 2017 at 07:56:29PM +0100, Peter Zijlstra wrote:
> > Right; but I figured we'd try and do it 'right' and see how horrible it
> > is before we try and do funny things.
> 
> So now it should have a 32ms tick for up to .5s when the system goes
> completely idle.
> 
> No idea how bad that is..

Does it mean that the 32ms tick will keep going forever if the system
doesn't go completely idle? Some tiny background task or a slightly
bigger one with a longer period?

Do we actually care about stale values if the system is completely idle?

Instead of hacking select_task_rq_fair() to kick off a stats update as
Vincent already proposed, why can't we just modify Brendan's
CPU_NEWLY_IDLE proposal to do a stats update from idle_balance() every
32ms regardless of whether we need to load-balance?

This way we should get updates if there is anything running, we don't
touch the wake-up path, we don't cause any additional wake-ups, and we
don't need a timer. What am I missing?


Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

2017-12-22 Thread Peter Zijlstra
On Fri, Dec 22, 2017 at 07:56:29PM +0100, Peter Zijlstra wrote:
> Right; but I figured we'd try and do it 'right' and see how horrible it
> is before we try and do funny things.

So now it should have a 32ms tick for up to .5s when the system goes
completely idle.

No idea how bad that is..


Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

2017-12-22 Thread Peter Zijlstra
On Fri, Dec 22, 2017 at 07:56:29PM +0100, Peter Zijlstra wrote:
> Right; but I figured we'd try and do it 'right' and see how horrible it
> is before we try and do funny things.

So now it should have a 32ms tick for up to .5s when the system goes
completely idle.

No idea how bad that is..


Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

2017-12-22 Thread Peter Zijlstra
On Fri, Dec 22, 2017 at 03:32:53PM +0100, Vincent Guittot wrote:
> > The only thing I could come up with is running a timer for this :/ That
> > would keep the ILB thing running until all load is decayed (have a patch
> > for that somewhere).
> 
> IMHO running a timer doesn't sound really great

I tend to agree..

> When we have enough activity on the system, the tick and the periodic
> load balance will ensure the update of load of all cpus (including the
> idle cpus) at the load balance period pace.

> But if we don't have enough activity to trig the periodic update
> through ilb or because the system is not overloaded or even almost
> idle, we don't have these periodic update anymore.

> The goal is to do a lazy update of the blocked load to not hurt too
> much power consumption of idle CPUs. When a task wakes up and the
> blocked idle load have not been updated for a while, we trig the
> update of these blocked loads in parallel to the wake up so the data
> will be more accurate for the next events.

> It's already too late for the current wake up but that's not a big
> deal because the wake up path of a light loaded system is mainly
> choosing between previous and current cpu and the load_avg_contrib and
> the utilization will have been updated for next events.

Right; but I figured we'd try and do it 'right' and see how horrible it
is before we try and do funny things.



Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

2017-12-22 Thread Peter Zijlstra
On Fri, Dec 22, 2017 at 03:32:53PM +0100, Vincent Guittot wrote:
> > The only thing I could come up with is running a timer for this :/ That
> > would keep the ILB thing running until all load is decayed (have a patch
> > for that somewhere).
> 
> IMHO running a timer doesn't sound really great

I tend to agree..

> When we have enough activity on the system, the tick and the periodic
> load balance will ensure the update of load of all cpus (including the
> idle cpus) at the load balance period pace.

> But if we don't have enough activity to trig the periodic update
> through ilb or because the system is not overloaded or even almost
> idle, we don't have these periodic update anymore.

> The goal is to do a lazy update of the blocked load to not hurt too
> much power consumption of idle CPUs. When a task wakes up and the
> blocked idle load have not been updated for a while, we trig the
> update of these blocked loads in parallel to the wake up so the data
> will be more accurate for the next events.

> It's already too late for the current wake up but that's not a big
> deal because the wake up path of a light loaded system is mainly
> choosing between previous and current cpu and the load_avg_contrib and
> the utilization will have been updated for next events.

Right; but I figured we'd try and do it 'right' and see how horrible it
is before we try and do funny things.



Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

2017-12-22 Thread Vincent Guittot
On 22 December 2017 at 15:31, Peter Zijlstra  wrote:
> On Fri, Dec 22, 2017 at 10:12:21AM +0100, Peter Zijlstra wrote:
>
>> The only thing I could come up with is running a timer for this :/ That
>> would keep the ILB thing running until all load is decayed (have a patch
>> for that somewhere).
>
> Implemented that; pushed it out, should all be at:
>
>   git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/testing
>
> but given how today is going, it'll eat your nan and set your cat on
> fire.

Our emails crossed. i'm going to have a look at your branch


Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

2017-12-22 Thread Vincent Guittot
On 22 December 2017 at 15:31, Peter Zijlstra  wrote:
> On Fri, Dec 22, 2017 at 10:12:21AM +0100, Peter Zijlstra wrote:
>
>> The only thing I could come up with is running a timer for this :/ That
>> would keep the ILB thing running until all load is decayed (have a patch
>> for that somewhere).
>
> Implemented that; pushed it out, should all be at:
>
>   git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/testing
>
> but given how today is going, it'll eat your nan and set your cat on
> fire.

Our emails crossed. i'm going to have a look at your branch


Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

2017-12-22 Thread Vincent Guittot
On 22 December 2017 at 10:12, Peter Zijlstra  wrote:
> On Fri, Dec 22, 2017 at 09:29:15AM +0100, Peter Zijlstra wrote:
>> On Fri, Dec 22, 2017 at 09:05:45AM +0100, Vincent Guittot wrote:
>> > On 22 December 2017 at 08:59, Peter Zijlstra  wrote:
>> > > On Thu, Dec 21, 2017 at 05:56:32PM +0100, Vincent Guittot wrote:
>> > >> In fact, we can't only rely on the tick and newly_idle load balance to
>> > >> ensure a period update of the blocked load because they can never
>> > >> happen.
>> > >
>> > > I'm confused, why would the ilb not happen?
>> >
>> > the ilb will be kick only if tick fires which might not be the case
>> > for task that runs less than a tick
>>
>> Oh, urgh, you're talking about when the entire system is idle. Yes
>> indeed.
>>
>> Lemme have a think, surely we can do something saner there.
>
> The only thing I could come up with is running a timer for this :/ That
> would keep the ILB thing running until all load is decayed (have a patch
> for that somewhere).

IMHO running a timer doesn't sound really great
When we have enough activity on the system, the tick and the periodic
load balance will ensure the update of load of all cpus (including the
idle cpus) at the load balance period pace But if we don't have enough
activity to trig the periodic update through ilb or because the system
is not overloaded or even almost idle, we don't have these periodic
update anymore. The goal is to do a lazy update of the blocked load to
not hurt too much power consumption of idle CPUs. When a task wakes up
and the blocked idle load have not been updated for a while, we trig
the update of these blocked loads in parallel to the wake up so the
data will be more accurate for the next events. It's already too late
for the current wake up but that's not a big deal because the wake up
path of a light loaded system is mainly choosing between previous and
current cpu and the load_avg_contrib and the utilization will have
been updated for next events.



>


Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

2017-12-22 Thread Vincent Guittot
On 22 December 2017 at 10:12, Peter Zijlstra  wrote:
> On Fri, Dec 22, 2017 at 09:29:15AM +0100, Peter Zijlstra wrote:
>> On Fri, Dec 22, 2017 at 09:05:45AM +0100, Vincent Guittot wrote:
>> > On 22 December 2017 at 08:59, Peter Zijlstra  wrote:
>> > > On Thu, Dec 21, 2017 at 05:56:32PM +0100, Vincent Guittot wrote:
>> > >> In fact, we can't only rely on the tick and newly_idle load balance to
>> > >> ensure a period update of the blocked load because they can never
>> > >> happen.
>> > >
>> > > I'm confused, why would the ilb not happen?
>> >
>> > the ilb will be kick only if tick fires which might not be the case
>> > for task that runs less than a tick
>>
>> Oh, urgh, you're talking about when the entire system is idle. Yes
>> indeed.
>>
>> Lemme have a think, surely we can do something saner there.
>
> The only thing I could come up with is running a timer for this :/ That
> would keep the ILB thing running until all load is decayed (have a patch
> for that somewhere).

IMHO running a timer doesn't sound really great
When we have enough activity on the system, the tick and the periodic
load balance will ensure the update of load of all cpus (including the
idle cpus) at the load balance period pace But if we don't have enough
activity to trig the periodic update through ilb or because the system
is not overloaded or even almost idle, we don't have these periodic
update anymore. The goal is to do a lazy update of the blocked load to
not hurt too much power consumption of idle CPUs. When a task wakes up
and the blocked idle load have not been updated for a while, we trig
the update of these blocked loads in parallel to the wake up so the
data will be more accurate for the next events. It's already too late
for the current wake up but that's not a big deal because the wake up
path of a light loaded system is mainly choosing between previous and
current cpu and the load_avg_contrib and the utilization will have
been updated for next events.



>


Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

2017-12-22 Thread Peter Zijlstra
On Fri, Dec 22, 2017 at 10:12:21AM +0100, Peter Zijlstra wrote:

> The only thing I could come up with is running a timer for this :/ That
> would keep the ILB thing running until all load is decayed (have a patch
> for that somewhere).

Implemented that; pushed it out, should all be at:

  git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/testing

but given how today is going, it'll eat your nan and set your cat on
fire.


Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

2017-12-22 Thread Peter Zijlstra
On Fri, Dec 22, 2017 at 10:12:21AM +0100, Peter Zijlstra wrote:

> The only thing I could come up with is running a timer for this :/ That
> would keep the ILB thing running until all load is decayed (have a patch
> for that somewhere).

Implemented that; pushed it out, should all be at:

  git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/testing

but given how today is going, it'll eat your nan and set your cat on
fire.


Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

2017-12-22 Thread Peter Zijlstra
On Fri, Dec 22, 2017 at 09:29:15AM +0100, Peter Zijlstra wrote:
> On Fri, Dec 22, 2017 at 09:05:45AM +0100, Vincent Guittot wrote:
> > On 22 December 2017 at 08:59, Peter Zijlstra  wrote:
> > > On Thu, Dec 21, 2017 at 05:56:32PM +0100, Vincent Guittot wrote:
> > >> In fact, we can't only rely on the tick and newly_idle load balance to
> > >> ensure a period update of the blocked load because they can never
> > >> happen.
> > >
> > > I'm confused, why would the ilb not happen?
> > 
> > the ilb will be kick only if tick fires which might not be the case
> > for task that runs less than a tick
> 
> Oh, urgh, you're talking about when the entire system is idle. Yes
> indeed.
> 
> Lemme have a think, surely we can do something saner there.

The only thing I could come up with is running a timer for this :/ That
would keep the ILB thing running until all load is decayed (have a patch
for that somewhere).



Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

2017-12-22 Thread Peter Zijlstra
On Fri, Dec 22, 2017 at 09:29:15AM +0100, Peter Zijlstra wrote:
> On Fri, Dec 22, 2017 at 09:05:45AM +0100, Vincent Guittot wrote:
> > On 22 December 2017 at 08:59, Peter Zijlstra  wrote:
> > > On Thu, Dec 21, 2017 at 05:56:32PM +0100, Vincent Guittot wrote:
> > >> In fact, we can't only rely on the tick and newly_idle load balance to
> > >> ensure a period update of the blocked load because they can never
> > >> happen.
> > >
> > > I'm confused, why would the ilb not happen?
> > 
> > the ilb will be kick only if tick fires which might not be the case
> > for task that runs less than a tick
> 
> Oh, urgh, you're talking about when the entire system is idle. Yes
> indeed.
> 
> Lemme have a think, surely we can do something saner there.

The only thing I could come up with is running a timer for this :/ That
would keep the ILB thing running until all load is decayed (have a patch
for that somewhere).



Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

2017-12-22 Thread Peter Zijlstra
On Fri, Dec 22, 2017 at 09:05:45AM +0100, Vincent Guittot wrote:
> On 22 December 2017 at 08:59, Peter Zijlstra  wrote:
> > On Thu, Dec 21, 2017 at 05:56:32PM +0100, Vincent Guittot wrote:
> >> In fact, we can't only rely on the tick and newly_idle load balance to
> >> ensure a period update of the blocked load because they can never
> >> happen.
> >
> > I'm confused, why would the ilb not happen?
> 
> the ilb will be kick only if tick fires which might not be the case
> for task that runs less than a tick

Oh, urgh, you're talking about when the entire system is idle. Yes
indeed.

Lemme have a think, surely we can do something saner there.


Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

2017-12-22 Thread Peter Zijlstra
On Fri, Dec 22, 2017 at 09:05:45AM +0100, Vincent Guittot wrote:
> On 22 December 2017 at 08:59, Peter Zijlstra  wrote:
> > On Thu, Dec 21, 2017 at 05:56:32PM +0100, Vincent Guittot wrote:
> >> In fact, we can't only rely on the tick and newly_idle load balance to
> >> ensure a period update of the blocked load because they can never
> >> happen.
> >
> > I'm confused, why would the ilb not happen?
> 
> the ilb will be kick only if tick fires which might not be the case
> for task that runs less than a tick

Oh, urgh, you're talking about when the entire system is idle. Yes
indeed.

Lemme have a think, surely we can do something saner there.


Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

2017-12-22 Thread Vincent Guittot
On 22 December 2017 at 08:59, Peter Zijlstra  wrote:
> On Thu, Dec 21, 2017 at 05:56:32PM +0100, Vincent Guittot wrote:
>> In fact, we can't only rely on the tick and newly_idle load balance to
>> ensure a period update of the blocked load because they can never
>> happen.
>
> I'm confused, why would the ilb not happen?

the ilb will be kick only if tick fires which might not be the case
for task that runs less than a tick


Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

2017-12-22 Thread Vincent Guittot
On 22 December 2017 at 08:59, Peter Zijlstra  wrote:
> On Thu, Dec 21, 2017 at 05:56:32PM +0100, Vincent Guittot wrote:
>> In fact, we can't only rely on the tick and newly_idle load balance to
>> ensure a period update of the blocked load because they can never
>> happen.
>
> I'm confused, why would the ilb not happen?

the ilb will be kick only if tick fires which might not be the case
for task that runs less than a tick


Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

2017-12-22 Thread Vincent Guittot
On 22 December 2017 at 08:56, Peter Zijlstra  wrote:
> On Thu, Dec 21, 2017 at 05:23:27PM +0100, Vincent Guittot wrote:
>> Hi Peter,
>>
>> I think that part of the proposal is missing.
>>
>> One goal of the patchset was to kick an update of the stats of idle
>> cpu when a task wake up on a cpu but the statistic has not been
>> updated for a while.
>>
>> That's why there where a call to nohz_kick_needed in the proposal to
>> kick ilb but only for updating blocked load and not a full idle load
>> balance
>>
>> I can't find this call any more in your patchset
>
> Yeah, I took it out because it didn't make sense to me. If you're waking
> to a stale CPU, you're too late.

I agree that's it's too late for the current wake up but that's the
trade off with delaying the wake up until all blocked load has been
updated.

>


Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

2017-12-22 Thread Vincent Guittot
On 22 December 2017 at 08:56, Peter Zijlstra  wrote:
> On Thu, Dec 21, 2017 at 05:23:27PM +0100, Vincent Guittot wrote:
>> Hi Peter,
>>
>> I think that part of the proposal is missing.
>>
>> One goal of the patchset was to kick an update of the stats of idle
>> cpu when a task wake up on a cpu but the statistic has not been
>> updated for a while.
>>
>> That's why there where a call to nohz_kick_needed in the proposal to
>> kick ilb but only for updating blocked load and not a full idle load
>> balance
>>
>> I can't find this call any more in your patchset
>
> Yeah, I took it out because it didn't make sense to me. If you're waking
> to a stale CPU, you're too late.

I agree that's it's too late for the current wake up but that's the
trade off with delaying the wake up until all blocked load has been
updated.

>


Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

2017-12-22 Thread Peter Zijlstra
On Thu, Dec 21, 2017 at 05:56:32PM +0100, Vincent Guittot wrote:
> In fact, we can't only rely on the tick and newly_idle load balance to
> ensure a period update of the blocked load because they can never
> happen.

I'm confused, why would the ilb not happen?


Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

2017-12-22 Thread Peter Zijlstra
On Thu, Dec 21, 2017 at 05:56:32PM +0100, Vincent Guittot wrote:
> In fact, we can't only rely on the tick and newly_idle load balance to
> ensure a period update of the blocked load because they can never
> happen.

I'm confused, why would the ilb not happen?


Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

2017-12-21 Thread Peter Zijlstra
On Thu, Dec 21, 2017 at 05:23:27PM +0100, Vincent Guittot wrote:
> Hi Peter,
> 
> I think that part of the proposal is missing.
> 
> One goal of the patchset was to kick an update of the stats of idle
> cpu when a task wake up on a cpu but the statistic has not been
> updated for a while.
> 
> That's why there where a call to nohz_kick_needed in the proposal to
> kick ilb but only for updating blocked load and not a full idle load
> balance
> 
> I can't find this call any more in your patchset

Yeah, I took it out because it didn't make sense to me. If you're waking
to a stale CPU, you're too late.



Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

2017-12-21 Thread Peter Zijlstra
On Thu, Dec 21, 2017 at 05:23:27PM +0100, Vincent Guittot wrote:
> Hi Peter,
> 
> I think that part of the proposal is missing.
> 
> One goal of the patchset was to kick an update of the stats of idle
> cpu when a task wake up on a cpu but the statistic has not been
> updated for a while.
> 
> That's why there where a call to nohz_kick_needed in the proposal to
> kick ilb but only for updating blocked load and not a full idle load
> balance
> 
> I can't find this call any more in your patchset

Yeah, I took it out because it didn't make sense to me. If you're waking
to a stale CPU, you're too late.



Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

2017-12-21 Thread Vincent Guittot
On 21 December 2017 at 17:23, Vincent Guittot
 wrote:
> Hi Peter,
>
> I think that part of the proposal is missing.
>
> One goal of the patchset was to kick an update of the stats of idle
> cpu when a task wake up on a cpu but the statistic has not been
> updated for a while.
>
> That's why there where a call to nohz_kick_needed in the proposal to
> kick ilb but only for updating blocked load and not a full idle load

sorry a call to nohz_kick_needed (which becomes nohz_balancer_kick in
yours patchset) in select_task_fair_rq_fair

> balance
>
> I can't find this call any more in your patchset

In fact, we can't only rely on the tick and newly_idle load balance to
ensure a period update of the blocked load because they can never
happen. So we need to find another place to kick for a periodic update
which is when a task wake up


>
> On 21 December 2017 at 11:21, Peter Zijlstra  wrote:
>>
>> Suggested-by: Vincent Guittot 
>> Signed-off-by: Peter Zijlstra (Intel) 
>> ---
>>  kernel/sched/core.c  |4 +--
>>  kernel/sched/fair.c  |   52 
>> ++-
>>  kernel/sched/sched.h |4 +++
>>  3 files changed, 41 insertions(+), 19 deletions(-)
>>
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -608,7 +608,7 @@ static inline bool got_nohz_idle_kick(vo
>>  {
>> int cpu = smp_processor_id();
>>
>> -   if (!(atomic_read(nohz_flags(cpu)) & NOHZ_BALANCE_KICK))
>> +   if (!(atomic_read(nohz_flags(cpu)) & NOHZ_KICK_MASK))
>> return false;
>>
>> if (idle_cpu(cpu) && !need_resched())
>> @@ -618,7 +618,7 @@ static inline bool got_nohz_idle_kick(vo
>>  * We can't run Idle Load Balance on this CPU for this time so we
>>  * cancel it and clear NOHZ_BALANCE_KICK
>>  */
>> -   atomic_andnot(NOHZ_BALANCE_KICK, nohz_flags(cpu));
>> +   atomic_andnot(NOHZ_KICK_MASK, nohz_flags(cpu));
>> return false;
>>  }
>>
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -9001,8 +9001,8 @@ static void nohz_balancer_kick(void)
>> if (ilb_cpu >= nr_cpu_ids)
>> return;
>>
>> -   flags = atomic_fetch_or(NOHZ_BALANCE_KICK, nohz_flags(ilb_cpu));
>> -   if (flags & NOHZ_BALANCE_KICK)
>> +   flags = atomic_fetch_or(NOHZ_KICK_MASK, nohz_flags(ilb_cpu));
>> +   if (flags & NOHZ_KICK_MASK)
>> return;
>> /*
>>  * Use smp_send_reschedule() instead of resched_cpu().
>> @@ -9125,8 +9125,6 @@ static void rebalance_domains(struct rq
>> int need_serialize, need_decay = 0;
>> u64 max_cost = 0;
>>
>> -   update_blocked_averages(cpu);
>> -
>> rcu_read_lock();
>> for_each_domain(cpu, sd) {
>> /*
>> @@ -9221,20 +9219,27 @@ static void rebalance_domains(struct rq
>>   * In CONFIG_NO_HZ_COMMON case, the idle balance kickee will do the
>>   * rebalancing for all the cpus for whom scheduler ticks are stopped.
>>   */
>> -static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
>> +static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
>>  {
>> -   int this_cpu = this_rq->cpu;
>> -   struct rq *rq;
>> -   int balance_cpu;
>> /* Earliest time when we have to do rebalance again */
>> unsigned long next_balance = jiffies + 60*HZ;
>> int update_next_balance = 0;
>> +   int this_cpu = this_rq->cpu;
>> +   unsigned int flags;
>> +   int balance_cpu;
>> +   struct rq *rq;
>>
>> -   if (!(atomic_read(nohz_flags(this_cpu)) & NOHZ_BALANCE_KICK))
>> -   return;
>> +   if (!(atomic_read(nohz_flags(this_cpu)) & NOHZ_KICK_MASK))
>> +   return false;
>>
>> -   if (idle != CPU_IDLE)
>> -   goto end;
>> +   if (idle != CPU_IDLE) {
>> +   atomic_andnot(NOHZ_KICK_MASK, nohz_flags(this_cpu));
>> +   return false;
>> +   }
>> +
>> +   flags = atomic_fetch_andnot(NOHZ_KICK_MASK, nohz_flags(this_cpu));
>> +
>> +   SCHED_WARN_ON((flags & NOHZ_KICK_MASK) == NOHZ_BALANCE_KICK);
>>
>> for_each_cpu(balance_cpu, nohz.idle_cpus_mask) {
>> if (balance_cpu == this_cpu || !idle_cpu(balance_cpu))
>> @@ -9262,7 +9267,9 @@ static void nohz_idle_balance(struct rq
>> cpu_load_update_idle(rq);
>> rq_unlock_irq(rq, );
>>
>> -   rebalance_domains(rq, CPU_IDLE);
>> +   update_blocked_averages(rq->cpu);
>> +   if (flags & NOHZ_BALANCE_KICK)
>> +   rebalance_domains(rq, CPU_IDLE);
>> }
>>
>> if (time_after(next_balance, rq->next_balance)) {
>> @@ -9271,6 +9278,10 @@ static void nohz_idle_balance(struct rq
>> }
>> }
>>
>> +   

Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

2017-12-21 Thread Vincent Guittot
On 21 December 2017 at 17:23, Vincent Guittot
 wrote:
> Hi Peter,
>
> I think that part of the proposal is missing.
>
> One goal of the patchset was to kick an update of the stats of idle
> cpu when a task wake up on a cpu but the statistic has not been
> updated for a while.
>
> That's why there where a call to nohz_kick_needed in the proposal to
> kick ilb but only for updating blocked load and not a full idle load

sorry a call to nohz_kick_needed (which becomes nohz_balancer_kick in
yours patchset) in select_task_fair_rq_fair

> balance
>
> I can't find this call any more in your patchset

In fact, we can't only rely on the tick and newly_idle load balance to
ensure a period update of the blocked load because they can never
happen. So we need to find another place to kick for a periodic update
which is when a task wake up


>
> On 21 December 2017 at 11:21, Peter Zijlstra  wrote:
>>
>> Suggested-by: Vincent Guittot 
>> Signed-off-by: Peter Zijlstra (Intel) 
>> ---
>>  kernel/sched/core.c  |4 +--
>>  kernel/sched/fair.c  |   52 
>> ++-
>>  kernel/sched/sched.h |4 +++
>>  3 files changed, 41 insertions(+), 19 deletions(-)
>>
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -608,7 +608,7 @@ static inline bool got_nohz_idle_kick(vo
>>  {
>> int cpu = smp_processor_id();
>>
>> -   if (!(atomic_read(nohz_flags(cpu)) & NOHZ_BALANCE_KICK))
>> +   if (!(atomic_read(nohz_flags(cpu)) & NOHZ_KICK_MASK))
>> return false;
>>
>> if (idle_cpu(cpu) && !need_resched())
>> @@ -618,7 +618,7 @@ static inline bool got_nohz_idle_kick(vo
>>  * We can't run Idle Load Balance on this CPU for this time so we
>>  * cancel it and clear NOHZ_BALANCE_KICK
>>  */
>> -   atomic_andnot(NOHZ_BALANCE_KICK, nohz_flags(cpu));
>> +   atomic_andnot(NOHZ_KICK_MASK, nohz_flags(cpu));
>> return false;
>>  }
>>
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -9001,8 +9001,8 @@ static void nohz_balancer_kick(void)
>> if (ilb_cpu >= nr_cpu_ids)
>> return;
>>
>> -   flags = atomic_fetch_or(NOHZ_BALANCE_KICK, nohz_flags(ilb_cpu));
>> -   if (flags & NOHZ_BALANCE_KICK)
>> +   flags = atomic_fetch_or(NOHZ_KICK_MASK, nohz_flags(ilb_cpu));
>> +   if (flags & NOHZ_KICK_MASK)
>> return;
>> /*
>>  * Use smp_send_reschedule() instead of resched_cpu().
>> @@ -9125,8 +9125,6 @@ static void rebalance_domains(struct rq
>> int need_serialize, need_decay = 0;
>> u64 max_cost = 0;
>>
>> -   update_blocked_averages(cpu);
>> -
>> rcu_read_lock();
>> for_each_domain(cpu, sd) {
>> /*
>> @@ -9221,20 +9219,27 @@ static void rebalance_domains(struct rq
>>   * In CONFIG_NO_HZ_COMMON case, the idle balance kickee will do the
>>   * rebalancing for all the cpus for whom scheduler ticks are stopped.
>>   */
>> -static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
>> +static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
>>  {
>> -   int this_cpu = this_rq->cpu;
>> -   struct rq *rq;
>> -   int balance_cpu;
>> /* Earliest time when we have to do rebalance again */
>> unsigned long next_balance = jiffies + 60*HZ;
>> int update_next_balance = 0;
>> +   int this_cpu = this_rq->cpu;
>> +   unsigned int flags;
>> +   int balance_cpu;
>> +   struct rq *rq;
>>
>> -   if (!(atomic_read(nohz_flags(this_cpu)) & NOHZ_BALANCE_KICK))
>> -   return;
>> +   if (!(atomic_read(nohz_flags(this_cpu)) & NOHZ_KICK_MASK))
>> +   return false;
>>
>> -   if (idle != CPU_IDLE)
>> -   goto end;
>> +   if (idle != CPU_IDLE) {
>> +   atomic_andnot(NOHZ_KICK_MASK, nohz_flags(this_cpu));
>> +   return false;
>> +   }
>> +
>> +   flags = atomic_fetch_andnot(NOHZ_KICK_MASK, nohz_flags(this_cpu));
>> +
>> +   SCHED_WARN_ON((flags & NOHZ_KICK_MASK) == NOHZ_BALANCE_KICK);
>>
>> for_each_cpu(balance_cpu, nohz.idle_cpus_mask) {
>> if (balance_cpu == this_cpu || !idle_cpu(balance_cpu))
>> @@ -9262,7 +9267,9 @@ static void nohz_idle_balance(struct rq
>> cpu_load_update_idle(rq);
>> rq_unlock_irq(rq, );
>>
>> -   rebalance_domains(rq, CPU_IDLE);
>> +   update_blocked_averages(rq->cpu);
>> +   if (flags & NOHZ_BALANCE_KICK)
>> +   rebalance_domains(rq, CPU_IDLE);
>> }
>>
>> if (time_after(next_balance, rq->next_balance)) {
>> @@ -9271,6 +9278,10 @@ static void nohz_idle_balance(struct rq
>> }
>> }
>>
>> +   update_blocked_averages(this_cpu);
>> +   if (flags & NOHZ_BALANCE_KICK)
>> +   rebalance_domains(this_rq, 

Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

2017-12-21 Thread Vincent Guittot
Hi Peter,

I think that part of the proposal is missing.

One goal of the patchset was to kick an update of the stats of idle
cpu when a task wake up on a cpu but the statistic has not been
updated for a while.

That's why there where a call to nohz_kick_needed in the proposal to
kick ilb but only for updating blocked load and not a full idle load
balance

I can't find this call any more in your patchset

On 21 December 2017 at 11:21, Peter Zijlstra  wrote:
>
> Suggested-by: Vincent Guittot 
> Signed-off-by: Peter Zijlstra (Intel) 
> ---
>  kernel/sched/core.c  |4 +--
>  kernel/sched/fair.c  |   52 
> ++-
>  kernel/sched/sched.h |4 +++
>  3 files changed, 41 insertions(+), 19 deletions(-)
>
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -608,7 +608,7 @@ static inline bool got_nohz_idle_kick(vo
>  {
> int cpu = smp_processor_id();
>
> -   if (!(atomic_read(nohz_flags(cpu)) & NOHZ_BALANCE_KICK))
> +   if (!(atomic_read(nohz_flags(cpu)) & NOHZ_KICK_MASK))
> return false;
>
> if (idle_cpu(cpu) && !need_resched())
> @@ -618,7 +618,7 @@ static inline bool got_nohz_idle_kick(vo
>  * We can't run Idle Load Balance on this CPU for this time so we
>  * cancel it and clear NOHZ_BALANCE_KICK
>  */
> -   atomic_andnot(NOHZ_BALANCE_KICK, nohz_flags(cpu));
> +   atomic_andnot(NOHZ_KICK_MASK, nohz_flags(cpu));
> return false;
>  }
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9001,8 +9001,8 @@ static void nohz_balancer_kick(void)
> if (ilb_cpu >= nr_cpu_ids)
> return;
>
> -   flags = atomic_fetch_or(NOHZ_BALANCE_KICK, nohz_flags(ilb_cpu));
> -   if (flags & NOHZ_BALANCE_KICK)
> +   flags = atomic_fetch_or(NOHZ_KICK_MASK, nohz_flags(ilb_cpu));
> +   if (flags & NOHZ_KICK_MASK)
> return;
> /*
>  * Use smp_send_reschedule() instead of resched_cpu().
> @@ -9125,8 +9125,6 @@ static void rebalance_domains(struct rq
> int need_serialize, need_decay = 0;
> u64 max_cost = 0;
>
> -   update_blocked_averages(cpu);
> -
> rcu_read_lock();
> for_each_domain(cpu, sd) {
> /*
> @@ -9221,20 +9219,27 @@ static void rebalance_domains(struct rq
>   * In CONFIG_NO_HZ_COMMON case, the idle balance kickee will do the
>   * rebalancing for all the cpus for whom scheduler ticks are stopped.
>   */
> -static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
> +static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
>  {
> -   int this_cpu = this_rq->cpu;
> -   struct rq *rq;
> -   int balance_cpu;
> /* Earliest time when we have to do rebalance again */
> unsigned long next_balance = jiffies + 60*HZ;
> int update_next_balance = 0;
> +   int this_cpu = this_rq->cpu;
> +   unsigned int flags;
> +   int balance_cpu;
> +   struct rq *rq;
>
> -   if (!(atomic_read(nohz_flags(this_cpu)) & NOHZ_BALANCE_KICK))
> -   return;
> +   if (!(atomic_read(nohz_flags(this_cpu)) & NOHZ_KICK_MASK))
> +   return false;
>
> -   if (idle != CPU_IDLE)
> -   goto end;
> +   if (idle != CPU_IDLE) {
> +   atomic_andnot(NOHZ_KICK_MASK, nohz_flags(this_cpu));
> +   return false;
> +   }
> +
> +   flags = atomic_fetch_andnot(NOHZ_KICK_MASK, nohz_flags(this_cpu));
> +
> +   SCHED_WARN_ON((flags & NOHZ_KICK_MASK) == NOHZ_BALANCE_KICK);
>
> for_each_cpu(balance_cpu, nohz.idle_cpus_mask) {
> if (balance_cpu == this_cpu || !idle_cpu(balance_cpu))
> @@ -9262,7 +9267,9 @@ static void nohz_idle_balance(struct rq
> cpu_load_update_idle(rq);
> rq_unlock_irq(rq, );
>
> -   rebalance_domains(rq, CPU_IDLE);
> +   update_blocked_averages(rq->cpu);
> +   if (flags & NOHZ_BALANCE_KICK)
> +   rebalance_domains(rq, CPU_IDLE);
> }
>
> if (time_after(next_balance, rq->next_balance)) {
> @@ -9271,6 +9278,10 @@ static void nohz_idle_balance(struct rq
> }
> }
>
> +   update_blocked_averages(this_cpu);
> +   if (flags & NOHZ_BALANCE_KICK)
> +   rebalance_domains(this_rq, CPU_IDLE);
> +
> /*
>  * next_balance will be updated only when there is a need.
>  * When the CPU is attached to null domain for ex, it will not be
> @@ -9278,8 +9289,8 @@ static void nohz_idle_balance(struct rq
>  */
> if (likely(update_next_balance))
> nohz.next_balance = next_balance;
> -end:
> -   atomic_andnot(NOHZ_BALANCE_KICK, nohz_flags(this_cpu));
> +
> +   return true;
>  }
>
>  /*
> @@ -9366,7 

Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

2017-12-21 Thread Vincent Guittot
Hi Peter,

I think that part of the proposal is missing.

One goal of the patchset was to kick an update of the stats of idle
cpu when a task wake up on a cpu but the statistic has not been
updated for a while.

That's why there where a call to nohz_kick_needed in the proposal to
kick ilb but only for updating blocked load and not a full idle load
balance

I can't find this call any more in your patchset

On 21 December 2017 at 11:21, Peter Zijlstra  wrote:
>
> Suggested-by: Vincent Guittot 
> Signed-off-by: Peter Zijlstra (Intel) 
> ---
>  kernel/sched/core.c  |4 +--
>  kernel/sched/fair.c  |   52 
> ++-
>  kernel/sched/sched.h |4 +++
>  3 files changed, 41 insertions(+), 19 deletions(-)
>
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -608,7 +608,7 @@ static inline bool got_nohz_idle_kick(vo
>  {
> int cpu = smp_processor_id();
>
> -   if (!(atomic_read(nohz_flags(cpu)) & NOHZ_BALANCE_KICK))
> +   if (!(atomic_read(nohz_flags(cpu)) & NOHZ_KICK_MASK))
> return false;
>
> if (idle_cpu(cpu) && !need_resched())
> @@ -618,7 +618,7 @@ static inline bool got_nohz_idle_kick(vo
>  * We can't run Idle Load Balance on this CPU for this time so we
>  * cancel it and clear NOHZ_BALANCE_KICK
>  */
> -   atomic_andnot(NOHZ_BALANCE_KICK, nohz_flags(cpu));
> +   atomic_andnot(NOHZ_KICK_MASK, nohz_flags(cpu));
> return false;
>  }
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9001,8 +9001,8 @@ static void nohz_balancer_kick(void)
> if (ilb_cpu >= nr_cpu_ids)
> return;
>
> -   flags = atomic_fetch_or(NOHZ_BALANCE_KICK, nohz_flags(ilb_cpu));
> -   if (flags & NOHZ_BALANCE_KICK)
> +   flags = atomic_fetch_or(NOHZ_KICK_MASK, nohz_flags(ilb_cpu));
> +   if (flags & NOHZ_KICK_MASK)
> return;
> /*
>  * Use smp_send_reschedule() instead of resched_cpu().
> @@ -9125,8 +9125,6 @@ static void rebalance_domains(struct rq
> int need_serialize, need_decay = 0;
> u64 max_cost = 0;
>
> -   update_blocked_averages(cpu);
> -
> rcu_read_lock();
> for_each_domain(cpu, sd) {
> /*
> @@ -9221,20 +9219,27 @@ static void rebalance_domains(struct rq
>   * In CONFIG_NO_HZ_COMMON case, the idle balance kickee will do the
>   * rebalancing for all the cpus for whom scheduler ticks are stopped.
>   */
> -static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
> +static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
>  {
> -   int this_cpu = this_rq->cpu;
> -   struct rq *rq;
> -   int balance_cpu;
> /* Earliest time when we have to do rebalance again */
> unsigned long next_balance = jiffies + 60*HZ;
> int update_next_balance = 0;
> +   int this_cpu = this_rq->cpu;
> +   unsigned int flags;
> +   int balance_cpu;
> +   struct rq *rq;
>
> -   if (!(atomic_read(nohz_flags(this_cpu)) & NOHZ_BALANCE_KICK))
> -   return;
> +   if (!(atomic_read(nohz_flags(this_cpu)) & NOHZ_KICK_MASK))
> +   return false;
>
> -   if (idle != CPU_IDLE)
> -   goto end;
> +   if (idle != CPU_IDLE) {
> +   atomic_andnot(NOHZ_KICK_MASK, nohz_flags(this_cpu));
> +   return false;
> +   }
> +
> +   flags = atomic_fetch_andnot(NOHZ_KICK_MASK, nohz_flags(this_cpu));
> +
> +   SCHED_WARN_ON((flags & NOHZ_KICK_MASK) == NOHZ_BALANCE_KICK);
>
> for_each_cpu(balance_cpu, nohz.idle_cpus_mask) {
> if (balance_cpu == this_cpu || !idle_cpu(balance_cpu))
> @@ -9262,7 +9267,9 @@ static void nohz_idle_balance(struct rq
> cpu_load_update_idle(rq);
> rq_unlock_irq(rq, );
>
> -   rebalance_domains(rq, CPU_IDLE);
> +   update_blocked_averages(rq->cpu);
> +   if (flags & NOHZ_BALANCE_KICK)
> +   rebalance_domains(rq, CPU_IDLE);
> }
>
> if (time_after(next_balance, rq->next_balance)) {
> @@ -9271,6 +9278,10 @@ static void nohz_idle_balance(struct rq
> }
> }
>
> +   update_blocked_averages(this_cpu);
> +   if (flags & NOHZ_BALANCE_KICK)
> +   rebalance_domains(this_rq, CPU_IDLE);
> +
> /*
>  * next_balance will be updated only when there is a need.
>  * When the CPU is attached to null domain for ex, it will not be
> @@ -9278,8 +9289,8 @@ static void nohz_idle_balance(struct rq
>  */
> if (likely(update_next_balance))
> nohz.next_balance = next_balance;
> -end:
> -   atomic_andnot(NOHZ_BALANCE_KICK, nohz_flags(this_cpu));
> +
> +   return true;
>  }
>
>  /*
> @@ -9366,7 +9377,10 @@ static inline bool nohz_kick_needed(stru
> return 

[RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

2017-12-21 Thread Peter Zijlstra

Suggested-by: Vincent Guittot 
Signed-off-by: Peter Zijlstra (Intel) 
---
 kernel/sched/core.c  |4 +--
 kernel/sched/fair.c  |   52 ++-
 kernel/sched/sched.h |4 +++
 3 files changed, 41 insertions(+), 19 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -608,7 +608,7 @@ static inline bool got_nohz_idle_kick(vo
 {
int cpu = smp_processor_id();
 
-   if (!(atomic_read(nohz_flags(cpu)) & NOHZ_BALANCE_KICK))
+   if (!(atomic_read(nohz_flags(cpu)) & NOHZ_KICK_MASK))
return false;
 
if (idle_cpu(cpu) && !need_resched())
@@ -618,7 +618,7 @@ static inline bool got_nohz_idle_kick(vo
 * We can't run Idle Load Balance on this CPU for this time so we
 * cancel it and clear NOHZ_BALANCE_KICK
 */
-   atomic_andnot(NOHZ_BALANCE_KICK, nohz_flags(cpu));
+   atomic_andnot(NOHZ_KICK_MASK, nohz_flags(cpu));
return false;
 }
 
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9001,8 +9001,8 @@ static void nohz_balancer_kick(void)
if (ilb_cpu >= nr_cpu_ids)
return;
 
-   flags = atomic_fetch_or(NOHZ_BALANCE_KICK, nohz_flags(ilb_cpu));
-   if (flags & NOHZ_BALANCE_KICK)
+   flags = atomic_fetch_or(NOHZ_KICK_MASK, nohz_flags(ilb_cpu));
+   if (flags & NOHZ_KICK_MASK)
return;
/*
 * Use smp_send_reschedule() instead of resched_cpu().
@@ -9125,8 +9125,6 @@ static void rebalance_domains(struct rq
int need_serialize, need_decay = 0;
u64 max_cost = 0;
 
-   update_blocked_averages(cpu);
-
rcu_read_lock();
for_each_domain(cpu, sd) {
/*
@@ -9221,20 +9219,27 @@ static void rebalance_domains(struct rq
  * In CONFIG_NO_HZ_COMMON case, the idle balance kickee will do the
  * rebalancing for all the cpus for whom scheduler ticks are stopped.
  */
-static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
+static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
 {
-   int this_cpu = this_rq->cpu;
-   struct rq *rq;
-   int balance_cpu;
/* Earliest time when we have to do rebalance again */
unsigned long next_balance = jiffies + 60*HZ;
int update_next_balance = 0;
+   int this_cpu = this_rq->cpu;
+   unsigned int flags;
+   int balance_cpu;
+   struct rq *rq;
 
-   if (!(atomic_read(nohz_flags(this_cpu)) & NOHZ_BALANCE_KICK))
-   return;
+   if (!(atomic_read(nohz_flags(this_cpu)) & NOHZ_KICK_MASK))
+   return false;
 
-   if (idle != CPU_IDLE)
-   goto end;
+   if (idle != CPU_IDLE) {
+   atomic_andnot(NOHZ_KICK_MASK, nohz_flags(this_cpu));
+   return false;
+   }
+
+   flags = atomic_fetch_andnot(NOHZ_KICK_MASK, nohz_flags(this_cpu));
+
+   SCHED_WARN_ON((flags & NOHZ_KICK_MASK) == NOHZ_BALANCE_KICK);
 
for_each_cpu(balance_cpu, nohz.idle_cpus_mask) {
if (balance_cpu == this_cpu || !idle_cpu(balance_cpu))
@@ -9262,7 +9267,9 @@ static void nohz_idle_balance(struct rq
cpu_load_update_idle(rq);
rq_unlock_irq(rq, );
 
-   rebalance_domains(rq, CPU_IDLE);
+   update_blocked_averages(rq->cpu);
+   if (flags & NOHZ_BALANCE_KICK)
+   rebalance_domains(rq, CPU_IDLE);
}
 
if (time_after(next_balance, rq->next_balance)) {
@@ -9271,6 +9278,10 @@ static void nohz_idle_balance(struct rq
}
}
 
+   update_blocked_averages(this_cpu);
+   if (flags & NOHZ_BALANCE_KICK)
+   rebalance_domains(this_rq, CPU_IDLE);
+
/*
 * next_balance will be updated only when there is a need.
 * When the CPU is attached to null domain for ex, it will not be
@@ -9278,8 +9289,8 @@ static void nohz_idle_balance(struct rq
 */
if (likely(update_next_balance))
nohz.next_balance = next_balance;
-end:
-   atomic_andnot(NOHZ_BALANCE_KICK, nohz_flags(this_cpu));
+
+   return true;
 }
 
 /*
@@ -9366,7 +9377,10 @@ static inline bool nohz_kick_needed(stru
return kick;
 }
 #else
-static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) { }
+static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
+{
+   return false;
+}
 #endif
 
 /*
@@ -9387,7 +9401,11 @@ static __latent_entropy void run_rebalan
 * load balance only within the local sched_domain hierarchy
 * and abort nohz_idle_balance altogether if we pull some load.
 */
-   nohz_idle_balance(this_rq, idle);
+   if (nohz_idle_balance(this_rq, idle))
+   return;
+
+   /* normal load balance */
+   update_blocked_averages(this_rq->cpu);

[RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

2017-12-21 Thread Peter Zijlstra

Suggested-by: Vincent Guittot 
Signed-off-by: Peter Zijlstra (Intel) 
---
 kernel/sched/core.c  |4 +--
 kernel/sched/fair.c  |   52 ++-
 kernel/sched/sched.h |4 +++
 3 files changed, 41 insertions(+), 19 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -608,7 +608,7 @@ static inline bool got_nohz_idle_kick(vo
 {
int cpu = smp_processor_id();
 
-   if (!(atomic_read(nohz_flags(cpu)) & NOHZ_BALANCE_KICK))
+   if (!(atomic_read(nohz_flags(cpu)) & NOHZ_KICK_MASK))
return false;
 
if (idle_cpu(cpu) && !need_resched())
@@ -618,7 +618,7 @@ static inline bool got_nohz_idle_kick(vo
 * We can't run Idle Load Balance on this CPU for this time so we
 * cancel it and clear NOHZ_BALANCE_KICK
 */
-   atomic_andnot(NOHZ_BALANCE_KICK, nohz_flags(cpu));
+   atomic_andnot(NOHZ_KICK_MASK, nohz_flags(cpu));
return false;
 }
 
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9001,8 +9001,8 @@ static void nohz_balancer_kick(void)
if (ilb_cpu >= nr_cpu_ids)
return;
 
-   flags = atomic_fetch_or(NOHZ_BALANCE_KICK, nohz_flags(ilb_cpu));
-   if (flags & NOHZ_BALANCE_KICK)
+   flags = atomic_fetch_or(NOHZ_KICK_MASK, nohz_flags(ilb_cpu));
+   if (flags & NOHZ_KICK_MASK)
return;
/*
 * Use smp_send_reschedule() instead of resched_cpu().
@@ -9125,8 +9125,6 @@ static void rebalance_domains(struct rq
int need_serialize, need_decay = 0;
u64 max_cost = 0;
 
-   update_blocked_averages(cpu);
-
rcu_read_lock();
for_each_domain(cpu, sd) {
/*
@@ -9221,20 +9219,27 @@ static void rebalance_domains(struct rq
  * In CONFIG_NO_HZ_COMMON case, the idle balance kickee will do the
  * rebalancing for all the cpus for whom scheduler ticks are stopped.
  */
-static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
+static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
 {
-   int this_cpu = this_rq->cpu;
-   struct rq *rq;
-   int balance_cpu;
/* Earliest time when we have to do rebalance again */
unsigned long next_balance = jiffies + 60*HZ;
int update_next_balance = 0;
+   int this_cpu = this_rq->cpu;
+   unsigned int flags;
+   int balance_cpu;
+   struct rq *rq;
 
-   if (!(atomic_read(nohz_flags(this_cpu)) & NOHZ_BALANCE_KICK))
-   return;
+   if (!(atomic_read(nohz_flags(this_cpu)) & NOHZ_KICK_MASK))
+   return false;
 
-   if (idle != CPU_IDLE)
-   goto end;
+   if (idle != CPU_IDLE) {
+   atomic_andnot(NOHZ_KICK_MASK, nohz_flags(this_cpu));
+   return false;
+   }
+
+   flags = atomic_fetch_andnot(NOHZ_KICK_MASK, nohz_flags(this_cpu));
+
+   SCHED_WARN_ON((flags & NOHZ_KICK_MASK) == NOHZ_BALANCE_KICK);
 
for_each_cpu(balance_cpu, nohz.idle_cpus_mask) {
if (balance_cpu == this_cpu || !idle_cpu(balance_cpu))
@@ -9262,7 +9267,9 @@ static void nohz_idle_balance(struct rq
cpu_load_update_idle(rq);
rq_unlock_irq(rq, );
 
-   rebalance_domains(rq, CPU_IDLE);
+   update_blocked_averages(rq->cpu);
+   if (flags & NOHZ_BALANCE_KICK)
+   rebalance_domains(rq, CPU_IDLE);
}
 
if (time_after(next_balance, rq->next_balance)) {
@@ -9271,6 +9278,10 @@ static void nohz_idle_balance(struct rq
}
}
 
+   update_blocked_averages(this_cpu);
+   if (flags & NOHZ_BALANCE_KICK)
+   rebalance_domains(this_rq, CPU_IDLE);
+
/*
 * next_balance will be updated only when there is a need.
 * When the CPU is attached to null domain for ex, it will not be
@@ -9278,8 +9289,8 @@ static void nohz_idle_balance(struct rq
 */
if (likely(update_next_balance))
nohz.next_balance = next_balance;
-end:
-   atomic_andnot(NOHZ_BALANCE_KICK, nohz_flags(this_cpu));
+
+   return true;
 }
 
 /*
@@ -9366,7 +9377,10 @@ static inline bool nohz_kick_needed(stru
return kick;
 }
 #else
-static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) { }
+static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
+{
+   return false;
+}
 #endif
 
 /*
@@ -9387,7 +9401,11 @@ static __latent_entropy void run_rebalan
 * load balance only within the local sched_domain hierarchy
 * and abort nohz_idle_balance altogether if we pull some load.
 */
-   nohz_idle_balance(this_rq, idle);
+   if (nohz_idle_balance(this_rq, idle))
+   return;
+
+   /* normal load balance */
+   update_blocked_averages(this_rq->cpu);
rebalance_domains(this_rq, idle);
 }
 
---