Re: [RFC V2 PATCH 3/6] cpuidle/ppc: Add timer offload framework to support deep idle states

2013-08-22 Thread Preeti U Murthy
Hi Ben,

On 08/22/2013 08:57 AM, Benjamin Herrenschmidt wrote:
> On Wed, 2013-08-14 at 17:26 +0530, Preeti U Murthy wrote:
> 
>>  static irqreturn_t timer_action(int irq, void *data)
>>  {
>> -timer_interrupt();
>> +decrementer_timer_interrupt();
>>  return IRQ_HANDLED;
>>  }
> 
> I don't completely understand what you are doing here, but ...
> 
>> @@ -223,7 +223,7 @@ irqreturn_t smp_ipi_demux(void)
>>  
>>  #ifdef __BIG_ENDIAN
>>  if (all & (1 << (24 - 8 * PPC_MSG_TIMER)))
>> -timer_interrupt();
>> +decrementer_timer_interrupt();
> 
> Why call this decrementer_* since it's specifically *not* the
> decrementer ?
>
> Makes more sense to be called broadcast_timer_interrupt() no ?

A broadcast IPI is meant to trigger timer interrupt handling on the
target CPUs. In deep idle states, even though the local timers of CPUs
become non-functional, it should not make a difference to them because
of the broadcast framework's help.

The broadcast framework is meant to hide the fact that the CPUs in deep
idle states require external help to wakeup on their expired timer
events. So the CPUs should wake up to find themselves handling the
timers under such a scenario, although they woke up from an IPI.

This whole idea gets conveyed by naming the handler of the broadcast IPI
to decrementer_timer_interrupt().

That said, ideally it should have been called timer_interrupt(). But
since we already have the timer interrupt handler with the same name,
and also we cannot call it directly for reasons mentioned in the reply
to your review on PATCH 2/6, I named it decrementer_timer_interrupt() to
come close to conveying the idea. This calls only the timer interrupt
handler there.

> 
>>  if (all & (1 << (24 - 8 * PPC_MSG_RESCHEDULE)))
>>  scheduler_ipi();
>>  if (all & (1 << (24 - 8 * PPC_MSG_CALL_FUNC_SINGLE)))
>> diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
>> index 65ab9e9..7e858e1 100644
>> --- a/arch/powerpc/kernel/time.c
>> +++ b/arch/powerpc/kernel/time.c
>> @@ -42,6 +42,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  #include 
>>  #include 
>>  #include 
>> @@ -97,8 +98,11 @@ static struct clocksource clocksource_timebase = {
>>  
>>  static int decrementer_set_next_event(unsigned long evt,
>>struct clock_event_device *dev);
>> +static int broadcast_set_next_event(unsigned long evt,
>> +  struct clock_event_device *dev);
>>  static void decrementer_set_mode(enum clock_event_mode mode,
>>   struct clock_event_device *dev);
>> +static void decrementer_timer_broadcast(const struct cpumask *mask);
>>  
>>  struct clock_event_device decrementer_clockevent = {
>>  .name   = "decrementer",
>> @@ -106,13 +110,26 @@ struct clock_event_device decrementer_clockevent = {
>>  .irq= 0,
>>  .set_next_event = decrementer_set_next_event,
>>  .set_mode   = decrementer_set_mode,
>> -.features   = CLOCK_EVT_FEAT_ONESHOT,
>> +.broadcast  = decrementer_timer_broadcast,
>> +.features   = CLOCK_EVT_FEAT_C3STOP | CLOCK_EVT_FEAT_ONESHOT,
>>  };
>>  EXPORT_SYMBOL(decrementer_clockevent);
>>  
>> +struct clock_event_device broadcast_clockevent = {
>> +.name   = "broadcast",
>> +.rating = 200,
>> +.irq= 0,
>> +.set_next_event = broadcast_set_next_event,
>> +.set_mode   = decrementer_set_mode,
> 
> Same here, why "decrementer" ? This event device is *not* the
> decrementer right ?

You are right. In this case it should have been broadcast_set_mode. This
is because this function is associated with the sender(the broadcast cpu).

> 
> Also, pardon my ignorance, by why do we need a separate
> clock_event_device ? Ie what does that do ? I am not familiar with the
> broadcast scheme and what .broadcast do in the "decrementer" one, so
> you need to provide me at least with better explanations.

The answer in short to why we need an additional clock event device is
that, in this patchset, we try to integrate the support for deep sleep
states in power, with the broadcast framework existent today in the
kernel without any changes to this broadcast framework and trying our
best to adapt to it. Let me now elaborate.

The broadcast framework kicks in if there is a broadcast clock event
device. We can double up the local timer/decrementer of a CPU as the
broadcast clock event device itself. But the broadcast framework
considers a clock event device an eligible candidate for broadcast, if
the feature CLOCK_EVT_FEAT_C3STOP is not associated with it. This
feature says that the clock event device stops in deep idle states. The
broadcast framework expects the clock event device which takes up the
responsibility of doing broadcast to be available and on at all points
in time.
  The decrementer clock event device 

Re: [RFC V2 PATCH 3/6] cpuidle/ppc: Add timer offload framework to support deep idle states

2013-08-22 Thread Preeti U Murthy
Hi Ben,

On 08/22/2013 08:57 AM, Benjamin Herrenschmidt wrote:
 On Wed, 2013-08-14 at 17:26 +0530, Preeti U Murthy wrote:
 
  static irqreturn_t timer_action(int irq, void *data)
  {
 -timer_interrupt();
 +decrementer_timer_interrupt();
  return IRQ_HANDLED;
  }
 
 I don't completely understand what you are doing here, but ...
 
 @@ -223,7 +223,7 @@ irqreturn_t smp_ipi_demux(void)
  
  #ifdef __BIG_ENDIAN
  if (all  (1  (24 - 8 * PPC_MSG_TIMER)))
 -timer_interrupt();
 +decrementer_timer_interrupt();
 
 Why call this decrementer_* since it's specifically *not* the
 decrementer ?

 Makes more sense to be called broadcast_timer_interrupt() no ?

A broadcast IPI is meant to trigger timer interrupt handling on the
target CPUs. In deep idle states, even though the local timers of CPUs
become non-functional, it should not make a difference to them because
of the broadcast framework's help.

The broadcast framework is meant to hide the fact that the CPUs in deep
idle states require external help to wakeup on their expired timer
events. So the CPUs should wake up to find themselves handling the
timers under such a scenario, although they woke up from an IPI.

This whole idea gets conveyed by naming the handler of the broadcast IPI
to decrementer_timer_interrupt().

That said, ideally it should have been called timer_interrupt(). But
since we already have the timer interrupt handler with the same name,
and also we cannot call it directly for reasons mentioned in the reply
to your review on PATCH 2/6, I named it decrementer_timer_interrupt() to
come close to conveying the idea. This calls only the timer interrupt
handler there.

 
  if (all  (1  (24 - 8 * PPC_MSG_RESCHEDULE)))
  scheduler_ipi();
  if (all  (1  (24 - 8 * PPC_MSG_CALL_FUNC_SINGLE)))
 diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
 index 65ab9e9..7e858e1 100644
 --- a/arch/powerpc/kernel/time.c
 +++ b/arch/powerpc/kernel/time.c
 @@ -42,6 +42,7 @@
  #include linux/timex.h
  #include linux/kernel_stat.h
  #include linux/time.h
 +#include linux/timer.h
  #include linux/init.h
  #include linux/profile.h
  #include linux/cpu.h
 @@ -97,8 +98,11 @@ static struct clocksource clocksource_timebase = {
  
  static int decrementer_set_next_event(unsigned long evt,
struct clock_event_device *dev);
 +static int broadcast_set_next_event(unsigned long evt,
 +  struct clock_event_device *dev);
  static void decrementer_set_mode(enum clock_event_mode mode,
   struct clock_event_device *dev);
 +static void decrementer_timer_broadcast(const struct cpumask *mask);
  
  struct clock_event_device decrementer_clockevent = {
  .name   = decrementer,
 @@ -106,13 +110,26 @@ struct clock_event_device decrementer_clockevent = {
  .irq= 0,
  .set_next_event = decrementer_set_next_event,
  .set_mode   = decrementer_set_mode,
 -.features   = CLOCK_EVT_FEAT_ONESHOT,
 +.broadcast  = decrementer_timer_broadcast,
 +.features   = CLOCK_EVT_FEAT_C3STOP | CLOCK_EVT_FEAT_ONESHOT,
  };
  EXPORT_SYMBOL(decrementer_clockevent);
  
 +struct clock_event_device broadcast_clockevent = {
 +.name   = broadcast,
 +.rating = 200,
 +.irq= 0,
 +.set_next_event = broadcast_set_next_event,
 +.set_mode   = decrementer_set_mode,
 
 Same here, why decrementer ? This event device is *not* the
 decrementer right ?

You are right. In this case it should have been broadcast_set_mode. This
is because this function is associated with the sender(the broadcast cpu).

 
 Also, pardon my ignorance, by why do we need a separate
 clock_event_device ? Ie what does that do ? I am not familiar with the
 broadcast scheme and what .broadcast do in the decrementer one, so
 you need to provide me at least with better explanations.

The answer in short to why we need an additional clock event device is
that, in this patchset, we try to integrate the support for deep sleep
states in power, with the broadcast framework existent today in the
kernel without any changes to this broadcast framework and trying our
best to adapt to it. Let me now elaborate.

The broadcast framework kicks in if there is a broadcast clock event
device. We can double up the local timer/decrementer of a CPU as the
broadcast clock event device itself. But the broadcast framework
considers a clock event device an eligible candidate for broadcast, if
the feature CLOCK_EVT_FEAT_C3STOP is not associated with it. This
feature says that the clock event device stops in deep idle states. The
broadcast framework expects the clock event device which takes up the
responsibility of doing broadcast to be available and on at all points
in time.
  The decrementer clock event device however is susceptible to
non-availability in deep idle 

Re: [RFC V2 PATCH 3/6] cpuidle/ppc: Add timer offload framework to support deep idle states

2013-08-21 Thread Benjamin Herrenschmidt
On Wed, 2013-08-14 at 17:26 +0530, Preeti U Murthy wrote:

>  static irqreturn_t timer_action(int irq, void *data)
>  {
> - timer_interrupt();
> + decrementer_timer_interrupt();
>   return IRQ_HANDLED;
>  }

I don't completely understand what you are doing here, but ...

> @@ -223,7 +223,7 @@ irqreturn_t smp_ipi_demux(void)
>  
>  #ifdef __BIG_ENDIAN
>   if (all & (1 << (24 - 8 * PPC_MSG_TIMER)))
> - timer_interrupt();
> + decrementer_timer_interrupt();

Why call this decrementer_* since it's specifically *not* the
decrementer ?

Makes more sense to be called broadcast_timer_interrupt() no ?

>   if (all & (1 << (24 - 8 * PPC_MSG_RESCHEDULE)))
>   scheduler_ipi();
>   if (all & (1 << (24 - 8 * PPC_MSG_CALL_FUNC_SINGLE)))
> diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
> index 65ab9e9..7e858e1 100644
> --- a/arch/powerpc/kernel/time.c
> +++ b/arch/powerpc/kernel/time.c
> @@ -42,6 +42,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -97,8 +98,11 @@ static struct clocksource clocksource_timebase = {
>  
>  static int decrementer_set_next_event(unsigned long evt,
> struct clock_event_device *dev);
> +static int broadcast_set_next_event(unsigned long evt,
> +   struct clock_event_device *dev);
>  static void decrementer_set_mode(enum clock_event_mode mode,
>struct clock_event_device *dev);
> +static void decrementer_timer_broadcast(const struct cpumask *mask);
>  
>  struct clock_event_device decrementer_clockevent = {
>   .name   = "decrementer",
> @@ -106,13 +110,26 @@ struct clock_event_device decrementer_clockevent = {
>   .irq= 0,
>   .set_next_event = decrementer_set_next_event,
>   .set_mode   = decrementer_set_mode,
> - .features   = CLOCK_EVT_FEAT_ONESHOT,
> + .broadcast  = decrementer_timer_broadcast,
> + .features   = CLOCK_EVT_FEAT_C3STOP | CLOCK_EVT_FEAT_ONESHOT,
>  };
>  EXPORT_SYMBOL(decrementer_clockevent);
>  
> +struct clock_event_device broadcast_clockevent = {
> + .name   = "broadcast",
> + .rating = 200,
> + .irq= 0,
> + .set_next_event = broadcast_set_next_event,
> + .set_mode   = decrementer_set_mode,

Same here, why "decrementer" ? This event device is *not* the
decrementer right ?

Also, pardon my ignorance, by why do we need a separate
clock_event_device ? Ie what does that do ? I am not familiar with the
broadcast scheme and what .broadcast do in the "decrementer" one, so
you need to provide me at least with better explanations.

> + .features   = CLOCK_EVT_FEAT_ONESHOT,
> +};
> +EXPORT_SYMBOL(broadcast_clockevent);
> +
>  DEFINE_PER_CPU(u64, decrementers_next_tb);
>  static DEFINE_PER_CPU(struct clock_event_device, decrementers);
> +static DEFINE_PER_CPU(struct clock_event_device, bc_timer);

Do we really need an additional per CPU here ? What does the bc_timer
actually represent ? The sender (broadcaster) or receiver ? In the
latter case, why does it have to differ from the decrementer ?

> +int bc_cpu;

A global with that name ? Exported ? That's gross...

>  #define XSEC_PER_SEC (1024*1024)
>  
>  #ifdef CONFIG_PPC64
> @@ -487,6 +504,8 @@ void timer_interrupt(struct pt_regs * regs)
>   struct pt_regs *old_regs;
>   u64 *next_tb = &__get_cpu_var(decrementers_next_tb);
>   struct clock_event_device *evt = &__get_cpu_var(decrementers);
> + struct clock_event_device *bc_evt = &__get_cpu_var(bc_timer);
> + int cpu = smp_processor_id();
>   u64 now;
>  
>   /* Ensure a positive value is written to the decrementer, or else
> @@ -532,6 +551,10 @@ void timer_interrupt(struct pt_regs * regs)
>   *next_tb = ~(u64)0;
>   if (evt->event_handler)
>   evt->event_handler(evt);
> + if (cpu == bc_cpu && bc_evt->event_handler) {
> + bc_evt->event_handler(bc_evt);
> + }
> +

So here, on a CPU that happens to be "bc_cpu", we call an additional
"event_handler" on every timer interrupt ? What does that handler
actually do ? How does it relate to the "broadcast" field in the
original clock source ?

>   } else {
>   now = *next_tb - now;
>   if (now <= DECREMENTER_MAX)
> @@ -806,6 +829,20 @@ static int decrementer_set_next_event(unsigned long evt,
>   return 0;
>  }
>  
> +/*
> + * We cannot program the decrementer of a remote CPU. Hence CPUs going into
> + * deep idle states need to send IPIs to the broadcast CPU to program its
> + * decrementer for their next local event so as to receive a broadcast IPI
> + * for the same. In order to avoid the overhead of multiple CPUs from sending
> + * IPIs, this function is a nop. Instead the broadcast CPU will handle 

Re: [RFC V2 PATCH 3/6] cpuidle/ppc: Add timer offload framework to support deep idle states

2013-08-21 Thread Benjamin Herrenschmidt
On Wed, 2013-08-14 at 17:26 +0530, Preeti U Murthy wrote:

  static irqreturn_t timer_action(int irq, void *data)
  {
 - timer_interrupt();
 + decrementer_timer_interrupt();
   return IRQ_HANDLED;
  }

I don't completely understand what you are doing here, but ...

 @@ -223,7 +223,7 @@ irqreturn_t smp_ipi_demux(void)
  
  #ifdef __BIG_ENDIAN
   if (all  (1  (24 - 8 * PPC_MSG_TIMER)))
 - timer_interrupt();
 + decrementer_timer_interrupt();

Why call this decrementer_* since it's specifically *not* the
decrementer ?

Makes more sense to be called broadcast_timer_interrupt() no ?

   if (all  (1  (24 - 8 * PPC_MSG_RESCHEDULE)))
   scheduler_ipi();
   if (all  (1  (24 - 8 * PPC_MSG_CALL_FUNC_SINGLE)))
 diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
 index 65ab9e9..7e858e1 100644
 --- a/arch/powerpc/kernel/time.c
 +++ b/arch/powerpc/kernel/time.c
 @@ -42,6 +42,7 @@
  #include linux/timex.h
  #include linux/kernel_stat.h
  #include linux/time.h
 +#include linux/timer.h
  #include linux/init.h
  #include linux/profile.h
  #include linux/cpu.h
 @@ -97,8 +98,11 @@ static struct clocksource clocksource_timebase = {
  
  static int decrementer_set_next_event(unsigned long evt,
 struct clock_event_device *dev);
 +static int broadcast_set_next_event(unsigned long evt,
 +   struct clock_event_device *dev);
  static void decrementer_set_mode(enum clock_event_mode mode,
struct clock_event_device *dev);
 +static void decrementer_timer_broadcast(const struct cpumask *mask);
  
  struct clock_event_device decrementer_clockevent = {
   .name   = decrementer,
 @@ -106,13 +110,26 @@ struct clock_event_device decrementer_clockevent = {
   .irq= 0,
   .set_next_event = decrementer_set_next_event,
   .set_mode   = decrementer_set_mode,
 - .features   = CLOCK_EVT_FEAT_ONESHOT,
 + .broadcast  = decrementer_timer_broadcast,
 + .features   = CLOCK_EVT_FEAT_C3STOP | CLOCK_EVT_FEAT_ONESHOT,
  };
  EXPORT_SYMBOL(decrementer_clockevent);
  
 +struct clock_event_device broadcast_clockevent = {
 + .name   = broadcast,
 + .rating = 200,
 + .irq= 0,
 + .set_next_event = broadcast_set_next_event,
 + .set_mode   = decrementer_set_mode,

Same here, why decrementer ? This event device is *not* the
decrementer right ?

Also, pardon my ignorance, by why do we need a separate
clock_event_device ? Ie what does that do ? I am not familiar with the
broadcast scheme and what .broadcast do in the decrementer one, so
you need to provide me at least with better explanations.

 + .features   = CLOCK_EVT_FEAT_ONESHOT,
 +};
 +EXPORT_SYMBOL(broadcast_clockevent);
 +
  DEFINE_PER_CPU(u64, decrementers_next_tb);
  static DEFINE_PER_CPU(struct clock_event_device, decrementers);
 +static DEFINE_PER_CPU(struct clock_event_device, bc_timer);

Do we really need an additional per CPU here ? What does the bc_timer
actually represent ? The sender (broadcaster) or receiver ? In the
latter case, why does it have to differ from the decrementer ?

 +int bc_cpu;

A global with that name ? Exported ? That's gross...

  #define XSEC_PER_SEC (1024*1024)
  
  #ifdef CONFIG_PPC64
 @@ -487,6 +504,8 @@ void timer_interrupt(struct pt_regs * regs)
   struct pt_regs *old_regs;
   u64 *next_tb = __get_cpu_var(decrementers_next_tb);
   struct clock_event_device *evt = __get_cpu_var(decrementers);
 + struct clock_event_device *bc_evt = __get_cpu_var(bc_timer);
 + int cpu = smp_processor_id();
   u64 now;
  
   /* Ensure a positive value is written to the decrementer, or else
 @@ -532,6 +551,10 @@ void timer_interrupt(struct pt_regs * regs)
   *next_tb = ~(u64)0;
   if (evt-event_handler)
   evt-event_handler(evt);
 + if (cpu == bc_cpu  bc_evt-event_handler) {
 + bc_evt-event_handler(bc_evt);
 + }
 +

So here, on a CPU that happens to be bc_cpu, we call an additional
event_handler on every timer interrupt ? What does that handler
actually do ? How does it relate to the broadcast field in the
original clock source ?

   } else {
   now = *next_tb - now;
   if (now = DECREMENTER_MAX)
 @@ -806,6 +829,20 @@ static int decrementer_set_next_event(unsigned long evt,
   return 0;
  }
  
 +/*
 + * We cannot program the decrementer of a remote CPU. Hence CPUs going into
 + * deep idle states need to send IPIs to the broadcast CPU to program its
 + * decrementer for their next local event so as to receive a broadcast IPI
 + * for the same. In order to avoid the overhead of multiple CPUs from sending
 + * IPIs, this function is a nop. Instead the broadcast CPU will handle the
 + * wakeup of CPUs in deep 

[RFC V2 PATCH 3/6] cpuidle/ppc: Add timer offload framework to support deep idle states

2013-08-14 Thread Preeti U Murthy
On ppc, in deep idle states, the local clock event device of CPUs gets
switched off. On PowerPC, the local clock event device is called the
decrementer. Make use of the broadcast framework to issue interrupts to
cpus in deep idle states on their timer events, except that on ppc, we
do not have an external device such as HPET, but we use the decrementer
of one of the CPUs itself as the broadcast device.

Instantiate two different clock event devices, one representing the
decrementer and another representing the broadcast device for each cpu.
The cpu which registers its broadcast device will be responsible for
performing the function of issuing timer interrupts to CPUs in deep idle
states, and is referred to as the broadcast cpu/bc_cpu in the changelogs of this
patchset for convenience. Such a CPU is not allowed to enter deep idle
states, where the decrementer is switched off.

For now, only the boot cpu's broadcast device gets registered as a clock event
device along with the decrementer. Hence this is the broadcast cpu.

On the broadcast cpu, on each timer interrupt, apart from the regular local
timer event handler the broadcast handler is also called. We avoid the overhead
of programming the decrementer specifically for a broadcast event. The reason 
is for
performance and scalability reasons. Say cpuX goes to deep idle state. It
has to ask the broadcast CPU to reprogram its(broadcast CPU's) decrementer for
the next local timer event of cpuX. cpuX can do so only by sending an IPI to the
broadcast CPU. With many more cpus going to deep idle, this model of sending
IPIs each time will result in performance bottleneck and may not scale well.

Apart from this there is no change in the way broadcast is handled today. On
a broadcast ipi the event handler for a timer interrupt is called on the cpu
in deep idle state to handle the local events.

The current design and implementation of the timer offload framework supports
the ONESHOT tick mode but not the PERIODIC mode.

Signed-off-by: Preeti U. Murthy 
---

 arch/powerpc/include/asm/time.h|3 +
 arch/powerpc/kernel/smp.c  |4 +-
 arch/powerpc/kernel/time.c |   81 
 arch/powerpc/platforms/powernv/Kconfig |1 
 4 files changed, 86 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/time.h b/arch/powerpc/include/asm/time.h
index c1f2676..936be0d 100644
--- a/arch/powerpc/include/asm/time.h
+++ b/arch/powerpc/include/asm/time.h
@@ -24,14 +24,17 @@ extern unsigned long tb_ticks_per_jiffy;
 extern unsigned long tb_ticks_per_usec;
 extern unsigned long tb_ticks_per_sec;
 extern struct clock_event_device decrementer_clockevent;
+extern struct clock_event_device broadcast_clockevent;
 
 struct rtc_time;
 extern void to_tm(int tim, struct rtc_time * tm);
 extern void GregorianDay(struct rtc_time *tm);
+extern void decrementer_timer_interrupt(void);
 
 extern void generic_calibrate_decr(void);
 
 extern void set_dec_cpu6(unsigned int val);
+extern int bc_cpu;
 
 /* Some sane defaults: 125 MHz timebase, 1GHz processor */
 extern unsigned long ppc_proc_freq;
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 6a68ca4..d3b7014 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -114,7 +114,7 @@ int smp_generic_kick_cpu(int nr)
 
 static irqreturn_t timer_action(int irq, void *data)
 {
-   timer_interrupt();
+   decrementer_timer_interrupt();
return IRQ_HANDLED;
 }
 
@@ -223,7 +223,7 @@ irqreturn_t smp_ipi_demux(void)
 
 #ifdef __BIG_ENDIAN
if (all & (1 << (24 - 8 * PPC_MSG_TIMER)))
-   timer_interrupt();
+   decrementer_timer_interrupt();
if (all & (1 << (24 - 8 * PPC_MSG_RESCHEDULE)))
scheduler_ipi();
if (all & (1 << (24 - 8 * PPC_MSG_CALL_FUNC_SINGLE)))
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 65ab9e9..7e858e1 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -42,6 +42,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -97,8 +98,11 @@ static struct clocksource clocksource_timebase = {
 
 static int decrementer_set_next_event(unsigned long evt,
  struct clock_event_device *dev);
+static int broadcast_set_next_event(unsigned long evt,
+ struct clock_event_device *dev);
 static void decrementer_set_mode(enum clock_event_mode mode,
 struct clock_event_device *dev);
+static void decrementer_timer_broadcast(const struct cpumask *mask);
 
 struct clock_event_device decrementer_clockevent = {
.name   = "decrementer",
@@ -106,13 +110,26 @@ struct clock_event_device decrementer_clockevent = {
.irq= 0,
.set_next_event = decrementer_set_next_event,
.set_mode   = 

[RFC V2 PATCH 3/6] cpuidle/ppc: Add timer offload framework to support deep idle states

2013-08-14 Thread Preeti U Murthy
On ppc, in deep idle states, the local clock event device of CPUs gets
switched off. On PowerPC, the local clock event device is called the
decrementer. Make use of the broadcast framework to issue interrupts to
cpus in deep idle states on their timer events, except that on ppc, we
do not have an external device such as HPET, but we use the decrementer
of one of the CPUs itself as the broadcast device.

Instantiate two different clock event devices, one representing the
decrementer and another representing the broadcast device for each cpu.
The cpu which registers its broadcast device will be responsible for
performing the function of issuing timer interrupts to CPUs in deep idle
states, and is referred to as the broadcast cpu/bc_cpu in the changelogs of this
patchset for convenience. Such a CPU is not allowed to enter deep idle
states, where the decrementer is switched off.

For now, only the boot cpu's broadcast device gets registered as a clock event
device along with the decrementer. Hence this is the broadcast cpu.

On the broadcast cpu, on each timer interrupt, apart from the regular local
timer event handler the broadcast handler is also called. We avoid the overhead
of programming the decrementer specifically for a broadcast event. The reason 
is for
performance and scalability reasons. Say cpuX goes to deep idle state. It
has to ask the broadcast CPU to reprogram its(broadcast CPU's) decrementer for
the next local timer event of cpuX. cpuX can do so only by sending an IPI to the
broadcast CPU. With many more cpus going to deep idle, this model of sending
IPIs each time will result in performance bottleneck and may not scale well.

Apart from this there is no change in the way broadcast is handled today. On
a broadcast ipi the event handler for a timer interrupt is called on the cpu
in deep idle state to handle the local events.

The current design and implementation of the timer offload framework supports
the ONESHOT tick mode but not the PERIODIC mode.

Signed-off-by: Preeti U. Murthy pre...@linux.vnet.ibm.com
---

 arch/powerpc/include/asm/time.h|3 +
 arch/powerpc/kernel/smp.c  |4 +-
 arch/powerpc/kernel/time.c |   81 
 arch/powerpc/platforms/powernv/Kconfig |1 
 4 files changed, 86 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/time.h b/arch/powerpc/include/asm/time.h
index c1f2676..936be0d 100644
--- a/arch/powerpc/include/asm/time.h
+++ b/arch/powerpc/include/asm/time.h
@@ -24,14 +24,17 @@ extern unsigned long tb_ticks_per_jiffy;
 extern unsigned long tb_ticks_per_usec;
 extern unsigned long tb_ticks_per_sec;
 extern struct clock_event_device decrementer_clockevent;
+extern struct clock_event_device broadcast_clockevent;
 
 struct rtc_time;
 extern void to_tm(int tim, struct rtc_time * tm);
 extern void GregorianDay(struct rtc_time *tm);
+extern void decrementer_timer_interrupt(void);
 
 extern void generic_calibrate_decr(void);
 
 extern void set_dec_cpu6(unsigned int val);
+extern int bc_cpu;
 
 /* Some sane defaults: 125 MHz timebase, 1GHz processor */
 extern unsigned long ppc_proc_freq;
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 6a68ca4..d3b7014 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -114,7 +114,7 @@ int smp_generic_kick_cpu(int nr)
 
 static irqreturn_t timer_action(int irq, void *data)
 {
-   timer_interrupt();
+   decrementer_timer_interrupt();
return IRQ_HANDLED;
 }
 
@@ -223,7 +223,7 @@ irqreturn_t smp_ipi_demux(void)
 
 #ifdef __BIG_ENDIAN
if (all  (1  (24 - 8 * PPC_MSG_TIMER)))
-   timer_interrupt();
+   decrementer_timer_interrupt();
if (all  (1  (24 - 8 * PPC_MSG_RESCHEDULE)))
scheduler_ipi();
if (all  (1  (24 - 8 * PPC_MSG_CALL_FUNC_SINGLE)))
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 65ab9e9..7e858e1 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -42,6 +42,7 @@
 #include linux/timex.h
 #include linux/kernel_stat.h
 #include linux/time.h
+#include linux/timer.h
 #include linux/init.h
 #include linux/profile.h
 #include linux/cpu.h
@@ -97,8 +98,11 @@ static struct clocksource clocksource_timebase = {
 
 static int decrementer_set_next_event(unsigned long evt,
  struct clock_event_device *dev);
+static int broadcast_set_next_event(unsigned long evt,
+ struct clock_event_device *dev);
 static void decrementer_set_mode(enum clock_event_mode mode,
 struct clock_event_device *dev);
+static void decrementer_timer_broadcast(const struct cpumask *mask);
 
 struct clock_event_device decrementer_clockevent = {
.name   = decrementer,
@@ -106,13 +110,26 @@ struct clock_event_device decrementer_clockevent = {
.irq