Re: [RFC PATCH 3/3] sched: introduce synchronized idle injection

2015-11-09 Thread Jacob Pan
On Fri, 6 Nov 2015 15:49:29 -0800
Jacob Pan  wrote:

> > Check the softirq stuff before calling throttle ?  
> 
> yes, played with it but it seems there are other cases causing pending
> softirq in idle in addition to throttle. I still haven't figure it
> out, this problem only shows up in heavy irq, network load. e.g.
> compile kernel over NFS. Debugging.
ok, I added a check for softirq_pending and a retry. seems to work. Now
idle injection will allow softirq and softirqd to run. The caveat is
that during that injection period, if softirqd does not run for the
entire duration, other normal tasks would also run during forced idle.
But just for that period. I guess we have to strike for the right
balance for QoS and overhead. For most workload, pending softirq is
rare so the tasks slip under softirqd are also rare. Will send out V2
soon.

Thanks,

Jacob
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 3/3] sched: introduce synchronized idle injection

2015-11-06 Thread Jacob Pan
On Fri, 6 Nov 2015 08:45:10 +0100
Peter Zijlstra  wrote:

> On Thu, Nov 05, 2015 at 03:36:25PM -0800, Jacob Pan wrote:
> 
> > I did some testing with the code below, it shows random
> > [  150.442597] NOHZ: local_softirq_pending 02
> > [  153.032673] NOHZ: local_softirq_pending 202
> > [  153.203785] NOHZ: local_softirq_pending 202
> > [  153.206486] NOHZ: local_softirq_pending 282
> > I recalled that was why i checked for local_softirq_pending in the
> > initial patch, still trying to find out how we can avoid that. These
> > also causes non stop sched ticks in the inner idle loop.
> 
> Check the softirq stuff before calling throttle ?

yes, played with it but it seems there are other cases causing pending
softirq in idle in addition to throttle. I still haven't figure it out,
this problem only shows up in heavy irq, network load. e.g. compile
kernel over NFS. Debugging.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 3/3] sched: introduce synchronized idle injection

2015-11-05 Thread Peter Zijlstra
On Thu, Nov 05, 2015 at 03:36:25PM -0800, Jacob Pan wrote:

> I did some testing with the code below, it shows random
> [  150.442597] NOHZ: local_softirq_pending 02
> [  153.032673] NOHZ: local_softirq_pending 202
> [  153.203785] NOHZ: local_softirq_pending 202
> [  153.206486] NOHZ: local_softirq_pending 282
> I recalled that was why i checked for local_softirq_pending in the
> initial patch, still trying to find out how we can avoid that. These
> also causes non stop sched ticks in the inner idle loop.

Check the softirq stuff before calling throttle ?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 3/3] sched: introduce synchronized idle injection

2015-11-05 Thread Jacob Pan
On Thu, 5 Nov 2015 14:59:52 +0100
Peter Zijlstra  wrote:

> On Tue, Nov 03, 2015 at 02:31:20PM +0100, Peter Zijlstra wrote:
> > > @@ -5136,6 +5148,16 @@ pick_next_task_fair(struct rq *rq, struct
> > > task_struct *prev) struct task_struct *p;
> > >   int new_tasks;
> > >  
> > > +#ifdef CONFIG_CFS_IDLE_INJECT
> > > + if (cfs_rq->force_throttled &&
> > > + !idle_cpu(cpu_of(rq)) &&
> > > + !unlikely(local_softirq_pending())) {
> > > + /* forced idle, pick no task */
> > > + trace_sched_cfs_idle_inject(cpu_of(rq), 1);
> > > + update_curr(cfs_rq);
> > > + return NULL;
> > > + }
> > > +#endif
> > >  again:
> > >  #ifdef CONFIG_FAIR_GROUP_SCHED
> > >   if (!cfs_rq->nr_running)
> > 
> > So this is horrible...
> 
> So this isn't ideal either (I rather liked the previous approach of a
> random task assuming idle, but tglx hated that). This should at least
> not touch extra cachelines in the hot paths, although it does add a
> few extra instructions :/
> 
> Very limited testing didn't show anything horrible.
> 
I did some testing with the code below, it shows random
[  150.442597] NOHZ: local_softirq_pending 02
[  153.032673] NOHZ: local_softirq_pending 202
[  153.203785] NOHZ: local_softirq_pending 202
[  153.206486] NOHZ: local_softirq_pending 282
I recalled that was why i checked for local_softirq_pending in the
initial patch, still trying to find out how we can avoid that. These
also causes non stop sched ticks in the inner idle loop.

> Your throttle would:
> 
>   raw_spin_lock_irqsave(&rq->lock, flags);
>   rq->cfs.forced_idle = true;
>   resched = rq->cfs.runnable;
>   rq->cfs.runnable = false;
>   raw_spin_unlock_irqrestore(&rq->lock, flags);
>   if (resched)
> resched_cpu(cpu_of(rq));
> 
> And your unthrottle:
> 
>   raw_spin_lock_irqsave(&rq->lock, flags);
>   rq->cfs.forced_idle = false;
>   resched = rq->cfs.runnable = !!rq->cfs.nr_running;
>   raw_spin_unlock_irqrestore(&rq->lock, flags);
>   if (resched)
> resched_cpu(cpu_of(rq));
> 
> ---
>  kernel/sched/fair.c  |   13 +
>  kernel/sched/sched.h |1 +
>  2 files changed, 10 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 824aa9f..1f0c809 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2341,7 +2341,8 @@ account_entity_enqueue(struct cfs_rq *cfs_rq,
> struct sched_entity *se) list_add(&se->group_node, &rq->cfs_tasks);
>   }
>  #endif
> - cfs_rq->nr_running++;
> + if (!cfs_rq->nr_running++ && !cfs_rq->forced_idle)
> + cfs_rq->runnable = true;
>  }
>  
>  static void
> @@ -2354,7 +2355,8 @@ account_entity_dequeue(struct cfs_rq *cfs_rq,
> struct sched_entity *se) account_numa_dequeue(rq_of(cfs_rq),
> task_of(se)); list_del_init(&se->group_node);
>   }
> - cfs_rq->nr_running--;
> + if (!--cfs_rq->nr_running && !cfs_rq->forced_idle)
> + cfs_rq->runnable = false;
>  }
>  
>  #ifdef CONFIG_FAIR_GROUP_SCHED
> @@ -5204,7 +5206,7 @@ pick_next_task_fair(struct rq *rq, struct
> task_struct *prev) 
>  again:
>  #ifdef CONFIG_FAIR_GROUP_SCHED
> - if (!cfs_rq->nr_running)
> + if (!cfs_rq->runnable)
>   goto idle;
>  
>   if (prev->sched_class != &fair_sched_class)
> @@ -5283,7 +5285,7 @@ simple:
>   cfs_rq = &rq->cfs;
>  #endif
>  
> - if (!cfs_rq->nr_running)
> + if (!cfs_rq->runnable)
>   goto idle;
>  
>   put_prev_task(rq, prev);
> @@ -5302,6 +5304,9 @@ simple:
>   return p;
>  
>  idle:
> + if (cfs_rq->forced_idle)
> + return NULL;
> +
>   /*
>* This is OK, because current is on_cpu, which avoids it
> being picked
>* for load-balance and preemption/IRQs are still disabled
> avoiding diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index efd3bfc..33d355d 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -347,6 +347,7 @@ struct cfs_bandwidth { };
>  struct cfs_rq {
>   struct load_weight load;
>   unsigned int nr_running, h_nr_running;
> + unsigned int runnable, forced_idle;
>  
>   u64 exec_clock;
>   u64 min_vruntime;

[Jacob Pan]
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 3/3] sched: introduce synchronized idle injection

2015-11-05 Thread Jacob Pan
On Thu, 5 Nov 2015 11:27:32 -0800
Jacob Pan  wrote:

> On Thu, 5 Nov 2015 11:09:22 +0100
> Peter Zijlstra  wrote:
> 
> > > Before:
> > > CPU0 __||| ||  |___| || || |_
> > > CPU1 _||| ||  |___| || |___
> > > 
> > > After:
> > > 
> > > CPU0 __||| ||  |___| || || |_
> > > CPU1 __||| ||  |___| || |___
> > > 
> > > The goal is to have overlapping idle time if the load is already
> > > balanced. The energy saving can be significant.  
> > 
> > I can see such a scheme having a fairly big impact on latency, esp.
> > with forced idleness such as this. That's not going to be popular
> > for many workloads.
> agreed, it would be for limited workload. the key is to identify such
> workloads at runtime. I am thinking to use the load average of
> the busiest CPU as reference for consolidation, will not go beyond
> that.
> For the patch I have today and if you play a game like this one
> http://www.agame.com/game/cut-the-rope
sorry, hit the wrong button before finishing the email.
and set duration to 5, and 20% idle, it does not affect user experience
much. It saves ~15% power on my BDW ultrabook. Other unbalanced
workload such as video playback don't benefit, should be avoided.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 3/3] sched: introduce synchronized idle injection

2015-11-05 Thread Jacob Pan
On Thu, 5 Nov 2015 11:09:22 +0100
Peter Zijlstra  wrote:

> > Before:
> > CPU0 __||| ||  |___| || || |_
> > CPU1 _||| ||  |___| || |___
> > 
> > After:
> > 
> > CPU0 __||| ||  |___| || || |_
> > CPU1 __||| ||  |___| || |___
> > 
> > The goal is to have overlapping idle time if the load is already
> > balanced. The energy saving can be significant.  
> 
> I can see such a scheme having a fairly big impact on latency, esp.
> with forced idleness such as this. That's not going to be popular for
> many workloads.
agreed, it would be for limited workload. the key is to identify such
workloads at runtime. I am thinking to use the load average of
the busiest CPU as reference for consolidation, will not go beyond
that.
For the patch I have today and if you play a game like this one
http://www.agame.com/game/cut-the-rope
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 3/3] sched: introduce synchronized idle injection

2015-11-05 Thread Thomas Gleixner
On Thu, 5 Nov 2015, Peter Zijlstra wrote:
> On Thu, Nov 05, 2015 at 07:28:50AM -0800, Arjan van de Ven wrote:
> > well we have this as a driver right now that does not touch hot paths,
> > but it seems you and tglx also hate that approach with a passion
> 
> The current code is/was broken, but when I tried fixing it, tglx
> objected to the entire approach yes...
> 
> Thomas, no arm twisting can convince you to reconsider the fake idle
> task approach?

As long as the thing just calls mwait or whatever twisting might be
successful, but I'm not going to change my mind on something calling
into the idle related routines (timers, rcu, etc..) and violates all
sensible assumptions we make in that code. It's complex enough already
and we really do not need subtle wreckage by half baken and unreviewed
drivers to add more complexity.

Thanks,

tglx



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 3/3] sched: introduce synchronized idle injection

2015-11-05 Thread Thomas Gleixner
On Thu, 5 Nov 2015, Arjan van de Ven wrote:
> On 11/5/2015 6:33 AM, Peter Zijlstra wrote:
> 
> > It just grates at me a bit that we have to touch hot paths for such
> scenarios :/
> 
> well we have this as a driver right now that does not touch hot paths,
> but it seems you and tglx also hate that approach with a passion

Right. It's a horror to deal with tasks which try to impersonate idle
while actually running as highest priority task in the system. We
really want the scheduler to know about it. Yes, we probably have to
pay the price for some extra check in the hot path, but that's way
more sensible than figuring out how 'want to be idle' RT tasks
wreckage the world and some more.

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 3/3] sched: introduce synchronized idle injection

2015-11-05 Thread Peter Zijlstra
On Thu, Nov 05, 2015 at 07:28:50AM -0800, Arjan van de Ven wrote:
> well we have this as a driver right now that does not touch hot paths,
> but it seems you and tglx also hate that approach with a passion

The current code is/was broken, but when I tried fixing it, tglx
objected to the entire approach yes...

Thomas, no arm twisting can convince you to reconsider the fake idle
task approach?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 3/3] sched: introduce synchronized idle injection

2015-11-05 Thread Arjan van de Ven

On 11/5/2015 7:32 AM, Jacob Pan wrote:

On Thu, 5 Nov 2015 15:33:32 +0100
Peter Zijlstra  wrote:


On Thu, Nov 05, 2015 at 06:22:58AM -0800, Arjan van de Ven wrote:

On 11/5/2015 2:09 AM, Peter Zijlstra wrote:


I can see such a scheme having a fairly big impact on latency,
esp. with forced idleness such as this. That's not going to be
popular for many workloads.


idle injection is a last ditch effort in thermal management, before
this gets used the hardware already has clamped you to a low
frequency, reduced memory speeds, probably dimmed your screen etc
etc.


Just to clarify, the low frequency here is not necessarily the minimum
frequency. It is usually the Pe (max efficiency).


to translate that from Intelese to English:
The system already is at the lowest frequency that's relatively efficient. To 
go even lower in instant power
consumption (e.g. heat) by even a little bit, a LOT of frequency needs to be 
sacrificed.

Idle injection sucks. But it's more efficient (at the point that it would get 
used) than any other methods,
so it also sucks less than those other methods for the same amount of reduction 
in heat generation.
It only gets used if the system HAS to reduce the heat generation, either 
because it's a mobile device with
little cooling capacity, or because the airconditioning in your big datacenter 
is currently
not able to keep up.




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 3/3] sched: introduce synchronized idle injection

2015-11-05 Thread Jacob Pan
On Thu, 5 Nov 2015 15:33:32 +0100
Peter Zijlstra  wrote:

> On Thu, Nov 05, 2015 at 06:22:58AM -0800, Arjan van de Ven wrote:
> > On 11/5/2015 2:09 AM, Peter Zijlstra wrote:
> > 
> > >I can see such a scheme having a fairly big impact on latency,
> > >esp. with forced idleness such as this. That's not going to be
> > >popular for many workloads.
> > 
> > idle injection is a last ditch effort in thermal management, before
> > this gets used the hardware already has clamped you to a low
> > frequency, reduced memory speeds, probably dimmed your screen etc
> > etc.
> > 
Just to clarify, the low frequency here is not necessarily the minimum
frequency. It is usually the Pe (max efficiency).
> > at this point there are 3 choices
> > 1) Shut off the device
> > 2) do uncoordinated idle injection for 40% of the time
> > 3) do coordinated idle injection for 5% of the time
> > 
> > as much as force injecting idle in a synchronized way sucks, the
> > alternatives are worse.
> 
> OK, it wasn't put that way. I figured it was a way to use less power
> on any workload with idle time on.
> 
> That said; what kind of devices are we talking about here; mobile with
> pittyful heat dissipation? Surely a well designed server or desktop
> class system should never get into this situation in the first place.
Yes, Mobile devices, especially fan-less, are the targets. On one side
we all desire high performance, but it does not come free. The
performance tax might limit the ability to scale at the low end.
e.g. on skylake-y P1 is 1.5GHz, Pe(max efficiency, dynamic) is ~900MHz,
Pmin is 400Mhz.
When thermal limit occurs, there are two options
1. limit freq to Pmin 400Mhz and run 100%
2. let CPU run at ~800Mhz but inject idle at 50%

#2 option provides better performance per watt since it can scale
nearly linearly, i.e. 50% performance at 50% power. For my own limited
testing and this can vary greatly by parts, running at Pmin vs Pe can
lose 30% perf per watt.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 3/3] sched: introduce synchronized idle injection

2015-11-05 Thread Arjan van de Ven

On 11/5/2015 6:33 AM, Peter Zijlstra wrote:

On Thu, Nov 05, 2015 at 06:22:58AM -0800, Arjan van de Ven wrote:

On 11/5/2015 2:09 AM, Peter Zijlstra wrote:


I can see such a scheme having a fairly big impact on latency, esp. with
forced idleness such as this. That's not going to be popular for many
workloads.


idle injection is a last ditch effort in thermal management, before
this gets used the hardware already has clamped you to a low frequency,
reduced memory speeds, probably dimmed your screen etc etc.

at this point there are 3 choices
1) Shut off the device
2) do uncoordinated idle injection for 40% of the time
3) do coordinated idle injection for 5% of the time

as much as force injecting idle in a synchronized way sucks, the alternatives 
are worse.


OK, it wasn't put that way. I figured it was a way to use less power on
any workload with idle time on.


so idle injection (as with pretty much every thermal management feature) is NOT 
a way to save
on battery life. Every known method pretty much ends up sacrificing more in 
terms of performance
than you gain in instant power that over time you end up using more (drain 
battery basically).

idle injection, if synchronized, is one of the more effective ones, e.g. give 
up the least efficiency
compared to, say, unsynchronized or even inserting idle cycles in the CPU 
(T-states)...
not even speaking of just turning the system off.




That said; what kind of devices are we talking about here; mobile with
pittyful heat dissipation? Surely a well designed server or desktop
class system should never get into this situation in the first place.


a well designed server may not, but the datacenter it is in may.
for example if the AC goes out, but also, sometimes the datacenter's peak heat 
dissapation
can exceed the AC capacity (which is outside temperature dependent.. yay global 
warming),
which may require an urgent reduction over a series of machines for the 
duration of the peak load/peak temperature
(usually just inserting a little bit, say 1%, over all servers will do)



It just grates at me a bit that we have to touch hot paths for such

scenarios :/

well we have this as a driver right now that does not touch hot paths,
but it seems you and tglx also hate that approach with a passion




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 3/3] sched: introduce synchronized idle injection

2015-11-05 Thread Peter Zijlstra
On Thu, Nov 05, 2015 at 03:33:32PM +0100, Peter Zijlstra wrote:

> > idle injection is a last ditch effort in thermal management

It just grates at me a bit that we have to touch hot paths for such
scenarios :/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 3/3] sched: introduce synchronized idle injection

2015-11-05 Thread Peter Zijlstra
On Thu, Nov 05, 2015 at 06:22:58AM -0800, Arjan van de Ven wrote:
> On 11/5/2015 2:09 AM, Peter Zijlstra wrote:
> 
> >I can see such a scheme having a fairly big impact on latency, esp. with
> >forced idleness such as this. That's not going to be popular for many
> >workloads.
> 
> idle injection is a last ditch effort in thermal management, before
> this gets used the hardware already has clamped you to a low frequency,
> reduced memory speeds, probably dimmed your screen etc etc.
> 
> at this point there are 3 choices
> 1) Shut off the device
> 2) do uncoordinated idle injection for 40% of the time
> 3) do coordinated idle injection for 5% of the time
> 
> as much as force injecting idle in a synchronized way sucks, the alternatives 
> are worse.

OK, it wasn't put that way. I figured it was a way to use less power on
any workload with idle time on.

That said; what kind of devices are we talking about here; mobile with
pittyful heat dissipation? Surely a well designed server or desktop
class system should never get into this situation in the first place.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 3/3] sched: introduce synchronized idle injection

2015-11-05 Thread Arjan van de Ven

On 11/5/2015 2:09 AM, Peter Zijlstra wrote:


I can see such a scheme having a fairly big impact on latency, esp. with
forced idleness such as this. That's not going to be popular for many
workloads.


idle injection is a last ditch effort in thermal management, before
this gets used the hardware already has clamped you to a low frequency,
reduced memory speeds, probably dimmed your screen etc etc.

at this point there are 3 choices
1) Shut off the device
2) do uncoordinated idle injection for 40% of the time
3) do coordinated idle injection for 5% of the time

as much as force injecting idle in a synchronized way sucks, the alternatives 
are worse.






--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 3/3] sched: introduce synchronized idle injection

2015-11-05 Thread Peter Zijlstra
On Tue, Nov 03, 2015 at 02:31:20PM +0100, Peter Zijlstra wrote:
> > @@ -5136,6 +5148,16 @@ pick_next_task_fair(struct rq *rq, struct 
> > task_struct *prev)
> > struct task_struct *p;
> > int new_tasks;
> >  
> > +#ifdef CONFIG_CFS_IDLE_INJECT
> > +   if (cfs_rq->force_throttled &&
> > +   !idle_cpu(cpu_of(rq)) &&
> > +   !unlikely(local_softirq_pending())) {
> > +   /* forced idle, pick no task */
> > +   trace_sched_cfs_idle_inject(cpu_of(rq), 1);
> > +   update_curr(cfs_rq);
> > +   return NULL;
> > +   }
> > +#endif
> >  again:
> >  #ifdef CONFIG_FAIR_GROUP_SCHED
> > if (!cfs_rq->nr_running)
> 
> So this is horrible...

So this isn't ideal either (I rather liked the previous approach of a
random task assuming idle, but tglx hated that). This should at least
not touch extra cachelines in the hot paths, although it does add a few
extra instructions :/

Very limited testing didn't show anything horrible.

Your throttle would:

raw_spin_lock_irqsave(&rq->lock, flags);
rq->cfs.forced_idle = true;
resched = rq->cfs.runnable;
rq->cfs.runnable = false;
raw_spin_unlock_irqrestore(&rq->lock, flags);
if (resched)
  resched_cpu(cpu_of(rq));

And your unthrottle:

raw_spin_lock_irqsave(&rq->lock, flags);
rq->cfs.forced_idle = false;
resched = rq->cfs.runnable = !!rq->cfs.nr_running;
raw_spin_unlock_irqrestore(&rq->lock, flags);
if (resched)
  resched_cpu(cpu_of(rq));

---
 kernel/sched/fair.c  |   13 +
 kernel/sched/sched.h |1 +
 2 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 824aa9f..1f0c809 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2341,7 +2341,8 @@ account_entity_enqueue(struct cfs_rq *cfs_rq, struct 
sched_entity *se)
list_add(&se->group_node, &rq->cfs_tasks);
}
 #endif
-   cfs_rq->nr_running++;
+   if (!cfs_rq->nr_running++ && !cfs_rq->forced_idle)
+   cfs_rq->runnable = true;
 }
 
 static void
@@ -2354,7 +2355,8 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct 
sched_entity *se)
account_numa_dequeue(rq_of(cfs_rq), task_of(se));
list_del_init(&se->group_node);
}
-   cfs_rq->nr_running--;
+   if (!--cfs_rq->nr_running && !cfs_rq->forced_idle)
+   cfs_rq->runnable = false;
 }
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
@@ -5204,7 +5206,7 @@ pick_next_task_fair(struct rq *rq, struct task_struct 
*prev)
 
 again:
 #ifdef CONFIG_FAIR_GROUP_SCHED
-   if (!cfs_rq->nr_running)
+   if (!cfs_rq->runnable)
goto idle;
 
if (prev->sched_class != &fair_sched_class)
@@ -5283,7 +5285,7 @@ simple:
cfs_rq = &rq->cfs;
 #endif
 
-   if (!cfs_rq->nr_running)
+   if (!cfs_rq->runnable)
goto idle;
 
put_prev_task(rq, prev);
@@ -5302,6 +5304,9 @@ simple:
return p;
 
 idle:
+   if (cfs_rq->forced_idle)
+   return NULL;
+
/*
 * This is OK, because current is on_cpu, which avoids it being picked
 * for load-balance and preemption/IRQs are still disabled avoiding
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index efd3bfc..33d355d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -347,6 +347,7 @@ struct cfs_bandwidth { };
 struct cfs_rq {
struct load_weight load;
unsigned int nr_running, h_nr_running;
+   unsigned int runnable, forced_idle;
 
u64 exec_clock;
u64 min_vruntime;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 3/3] sched: introduce synchronized idle injection

2015-11-05 Thread Peter Zijlstra
On Tue, Nov 03, 2015 at 08:45:01AM -0800, Jacob Pan wrote:
> Fair enough, I will give that a try, I guess it would be costly
> and hard to scale if we were to dequeue/enqueue every se for every
> period of injection plus locking. Let me get some data first.

Yeah, don't go dequeue/enqueue everything, the regular throttle code
doesn't do that either.

Throttling the root is more interesting though, becuase you can simply
dequeue the child group representative, whereas you cannot dequeue the
root (for being the root means there is no parent to dequeue from).

> I understand we don't want to sacrifice the hot patch for some code
> almost 'never' run. But I also have follow up plan to use this code for
> consolidating/synchronizing idle during balanced semi-active workload.
> In that case, it may run more often. e.g.
> 
> Before:
> CPU0 __||| ||  |___| || || |_
> CPU1 _||| ||  |___| || |___
> 
> After:
> 
> CPU0 __||| ||  |___| || || |_
> CPU1 __||| ||  |___| || |___
> 
> The goal is to have overlapping idle time if the load is already
> balanced. The energy saving can be significant.

I can see such a scheme having a fairly big impact on latency, esp. with
forced idleness such as this. That's not going to be popular for many
workloads.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 3/3] sched: introduce synchronized idle injection

2015-11-03 Thread Jacob Pan
On Tue, 3 Nov 2015 14:31:20 +0100
Peter Zijlstra  wrote:

> > @@ -5136,6 +5148,16 @@ pick_next_task_fair(struct rq *rq, struct
> > task_struct *prev) struct task_struct *p;
> > int new_tasks;
> >  
> > +#ifdef CONFIG_CFS_IDLE_INJECT
> > +   if (cfs_rq->force_throttled &&
> > +   !idle_cpu(cpu_of(rq)) &&
> > +   !unlikely(local_softirq_pending())) {
> > +   /* forced idle, pick no task */
> > +   trace_sched_cfs_idle_inject(cpu_of(rq), 1);
> > +   update_curr(cfs_rq);
> > +   return NULL;
> > +   }
> > +#endif
> >  again:
> >  #ifdef CONFIG_FAIR_GROUP_SCHED
> > if (!cfs_rq->nr_running)
> 
> So this is horrible...
> 
> This is a fast path, and you just put at least one cachemiss in it, a
> branch (without hint) and some goofy code (wth are we checking
> softirqs?).
> 
softirq is checked here since it is one of the conditions to stop
sched tick. can_stop_idle_tick(). but we don't have to check here, you
are right.
> How about you frob things such that cfs_rq->nr_running == 0 and we'll
> hit the idle: path, at that point you can test if we're forced idle
> and skip the load-balancing attempt.
> 
> There's probably a fair number of icky cases to deal with if you frob
> cfs_rq->nr_running, like the enqueue path which will add to it. We'll
> have to come up with something to not slow that down either.
> 
> The thing is, both schedule and enqueue are very hot and this is code
> that will 'never' run.
Fair enough, I will give that a try, I guess it would be costly
and hard to scale if we were to dequeue/enqueue every se for every
period of injection plus locking. Let me get some data first.

I understand we don't want to sacrifice the hot patch for some code
almost 'never' run. But I also have follow up plan to use this code for
consolidating/synchronizing idle during balanced semi-active workload.
In that case, it may run more often. e.g.

Before:
CPU0 __||| ||  |___| || || |_
CPU1 _||| ||  |___| || |___

After:

CPU0 __||| ||  |___| || || |_
CPU1 __||| ||  |___| || |___

The goal is to have overlapping idle time if the load is already
balanced. The energy saving can be significant.

Jacob
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 3/3] sched: introduce synchronized idle injection

2015-11-03 Thread Jacob Pan
On Tue, 3 Nov 2015 14:31:20 +0100
Peter Zijlstra  wrote:

> > @@ -5136,6 +5148,16 @@ pick_next_task_fair(struct rq *rq, struct
> > task_struct *prev) struct task_struct *p;
> > int new_tasks;
> >  
> > +#ifdef CONFIG_CFS_IDLE_INJECT
> > +   if (cfs_rq->force_throttled &&
> > +   !idle_cpu(cpu_of(rq)) &&
> > +   !unlikely(local_softirq_pending())) {
> > +   /* forced idle, pick no task */
> > +   trace_sched_cfs_idle_inject(cpu_of(rq), 1);
> > +   update_curr(cfs_rq);
> > +   return NULL;
> > +   }
> > +#endif
> >  again:
> >  #ifdef CONFIG_FAIR_GROUP_SCHED
> > if (!cfs_rq->nr_running)
> 
> So this is horrible...
> 
> This is a fast path, and you just put at least one cachemiss in it, a
> branch (without hint) and some goofy code (wth are we checking
> softirqs?).
> 
softirq is checked here since it is one of the conditions to stop
sched tick. can_stop_idle_tick(). but we don't have to check here, you
are right.
> How about you frob things such that cfs_rq->nr_running == 0 and we'll
> hit the idle: path, at that point you can test if we're forced idle
> and skip the load-balancing attempt.
> 
> There's probably a fair number of icky cases to deal with if you frob
> cfs_rq->nr_running, like the enqueue path which will add to it. We'll
> have to come up with something to not slow that down either.
> 
> The thing is, both schedule and enqueue are very hot and this is code
> that will 'never' run.
Fair enough, I will give that a try. I understand we don't want to
sacrifice the hot patch for some code almost 'never' run. But I also
have follow up plan to use this code for consolidating/synchronizing
idle during balanced semi-active workload. In that case, it may run
more often. e.g.

Before:
CPU0 __||| ||  |___| || || |_
CPU1 _||| ||  |___| || |___

After:

CPU0 __||| ||  |___| || || |_
CPU1 __||| ||  |___| || |___

The goal is to have overlapping idle time if the load is already
balanced. The energy saving can be significant.

Jacob
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 3/3] sched: introduce synchronized idle injection

2015-11-03 Thread Peter Zijlstra
> @@ -5136,6 +5148,16 @@ pick_next_task_fair(struct rq *rq, struct task_struct 
> *prev)
>   struct task_struct *p;
>   int new_tasks;
>  
> +#ifdef CONFIG_CFS_IDLE_INJECT
> + if (cfs_rq->force_throttled &&
> + !idle_cpu(cpu_of(rq)) &&
> + !unlikely(local_softirq_pending())) {
> + /* forced idle, pick no task */
> + trace_sched_cfs_idle_inject(cpu_of(rq), 1);
> + update_curr(cfs_rq);
> + return NULL;
> + }
> +#endif
>  again:
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>   if (!cfs_rq->nr_running)

So this is horrible...

This is a fast path, and you just put at least one cachemiss in it, a
branch (without hint) and some goofy code (wth are we checking
softirqs?).

How about you frob things such that cfs_rq->nr_running == 0 and we'll
hit the idle: path, at that point you can test if we're forced idle and
skip the load-balancing attempt.

There's probably a fair number of icky cases to deal with if you frob
cfs_rq->nr_running, like the enqueue path which will add to it. We'll
have to come up with something to not slow that down either.

The thing is, both schedule and enqueue are very hot and this is code
that will 'never' run.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/