from:"preeti"

Re: [PATCH 1/2] PPC: powernv: remove redundant cpuidle_idle_call()

2014-02-06 Thread Preeti U Murthy

Hi Nicolas,

Find below the patch that will need to be squashed with this one.
This patch is based on the mainline.Adding Deepthi, the author of
the patch which introduced the powernv cpuidle driver. Deepthi,
do you think the below patch looks right? We do not need to do an
explicit local_irq_enable() since we are in the call path of
cpuidle driver and that explicitly enables irqs on exit from
idle states.

On 02/07/2014 06:47 AM, Nicolas Pitre wrote:
> On Thu, 6 Feb 2014, Preeti U Murthy wrote:
> 
>> Hi Daniel,
>>
>> On 02/06/2014 09:55 PM, Daniel Lezcano wrote:
>>> Hi Nico,
>>>
>>>
>>> On 6 February 2014 14:16, Nicolas Pitre  wrote:
>>>
>>>> The core idle loop now takes care of it.
>>>>
>>>> Signed-off-by: Nicolas Pitre 
>>>> ---
>>>>  arch/powerpc/platforms/powernv/setup.c | 13 +
>>>>  1 file changed, 1 insertion(+), 12 deletions(-)
>>>>
>>>> diff --git a/arch/powerpc/platforms/powernv/setup.c
>>>> b/arch/powerpc/platforms/powernv/setup.c
>>>> index 21166f65c9..a932feb290 100644
>>>> --- a/arch/powerpc/platforms/powernv/setup.c
>>>> +++ b/arch/powerpc/platforms/powernv/setup.c
>>>> @@ -26,7 +26,6 @@
>>>>  #include 
>>>>  #include 
>>>>  #include 
>>>> -#include 
>>>>
>>>>  #include 
>>>>  #include 
>>>> @@ -217,16 +216,6 @@ static int __init pnv_probe(void)
>>>> return 1;
>>>>  }
>>>>
>>>> -void powernv_idle(void)
>>>> -{
>>>> -   /* Hook to cpuidle framework if available, else
>>>> -* call on default platform idle code
>>>> -*/
>>>> -   if (cpuidle_idle_call()) {
>>>> -   power7_idle();
>>>> -   }
>>>>

 drivers/cpuidle/cpuidle-powernv.c |4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/cpuidle/cpuidle-powernv.c 
b/drivers/cpuidle/cpuidle-powernv.c
index 78fd174..130f081 100644
--- a/drivers/cpuidle/cpuidle-powernv.c
+++ b/drivers/cpuidle/cpuidle-powernv.c
@@ -31,11 +31,13 @@ static int snooze_loop(struct cpuidle_device *dev,
set_thread_flag(TIF_POLLING_NRFLAG);
 
while (!need_resched()) {
+   ppc64_runlatch_off();
HMT_low();
HMT_very_low();
}
 
HMT_medium();
+   ppc64_runlatch_on();
clear_thread_flag(TIF_POLLING_NRFLAG);
smp_mb();
return index;
@@ -45,7 +47,9 @@ static int nap_loop(struct cpuidle_device *dev,
struct cpuidle_driver *drv,
int index)
 {
+   ppc64_runlatch_off();
power7_idle();
+   ppc64_runlatch_on();
return index;
 }
 
Thanks

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH V4 0/3] time/cpuidle: Support in tick broadcast framework in absence of external clock device

2014-02-07 Thread Preeti U Murthy

This patchset provides support in the tick broadcast framework for such
architectures so as to enable the CPUs to get into deep idle.

Presently we are in need of this support on certain implementations of
PowerPC. This patchset has thus been tested on the same.

This patchset has been based on the idea discussed here:
http://www.kernelhub.org/?p=2=399516

Changes in V4:
1. Cleared the stand by CPU from the oneshot mask. As a result PATCH 3/3
was simplified.
2. Fixed compile time warnings.
---

Preeti U Murthy (2):
  time: Change the return type of clockevents_notify() to integer
  time/cpuidle:Handle failed call to BROADCAST_ENTER on archs with 
CPUIDLE_FLAG_TIMER_STOP set

Thomas Gleixner (1):
  tick/cpuidle: Initialize hrtimer mode of broadcast


 drivers/cpuidle/cpuidle.c|   14 +++--
 include/linux/clockchips.h   |   15 -
 kernel/time/Makefile |2 -
 kernel/time/clockevents.c|8 ++-
 kernel/time/tick-broadcast-hrtimer.c |  105 ++
 kernel/time/tick-broadcast.c |   60 ++-
 kernel/time/tick-internal.h  |6 +-
 7 files changed, 189 insertions(+), 21 deletions(-)
 create mode 100644 kernel/time/tick-broadcast-hrtimer.c

-- 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH V4 1/3] time: Change the return type of clockevents_notify() to integer

2014-02-07 Thread Preeti U Murthy

The broadcast framework can potentially be made use of by archs which do not 
have an
external clock device as well. Then, it is required that one of the CPUs need
to handle the broadcasting of wakeup IPIs to the CPUs in deep idle. As a
result its local timers should remain functional all the time. For such
a CPU, the BROADCAST_ENTER notification has to fail indicating that its clock
device cannot be shutdown. To make way for this support, change the return
type of tick_broadcast_oneshot_control() and hence clockevents_notify() to
indicate such scenarios.

Signed-off-by: Preeti U Murthy 
---

 include/linux/clockchips.h   |6 +++---
 kernel/time/clockevents.c|8 +---
 kernel/time/tick-broadcast.c |6 --
 kernel/time/tick-internal.h  |6 +++---
 4 files changed, 15 insertions(+), 11 deletions(-)

diff --git a/include/linux/clockchips.h b/include/linux/clockchips.h
index 493aa02..e0c5a6c 100644
--- a/include/linux/clockchips.h
+++ b/include/linux/clockchips.h
@@ -186,9 +186,9 @@ static inline int tick_check_broadcast_expired(void) { 
return 0; }
 #endif
 
 #ifdef CONFIG_GENERIC_CLOCKEVENTS
-extern void clockevents_notify(unsigned long reason, void *arg);
+extern int clockevents_notify(unsigned long reason, void *arg);
 #else
-static inline void clockevents_notify(unsigned long reason, void *arg) {}
+static inline int clockevents_notify(unsigned long reason, void *arg) { return 
0; }
 #endif
 
 #else /* CONFIG_GENERIC_CLOCKEVENTS_BUILD */
@@ -196,7 +196,7 @@ static inline void clockevents_notify(unsigned long reason, 
void *arg) {}
 static inline void clockevents_suspend(void) {}
 static inline void clockevents_resume(void) {}
 
-static inline void clockevents_notify(unsigned long reason, void *arg) {}
+static inline int clockevents_notify(unsigned long reason, void *arg) { return 
0; }
 static inline int tick_check_broadcast_expired(void) { return 0; }
 
 #endif
diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
index 086ad60..79b8685 100644
--- a/kernel/time/clockevents.c
+++ b/kernel/time/clockevents.c
@@ -524,12 +524,13 @@ void clockevents_resume(void)
 #ifdef CONFIG_GENERIC_CLOCKEVENTS
 /**
  * clockevents_notify - notification about relevant events
+ * Returns 0 on success, any other value on error
  */
-void clockevents_notify(unsigned long reason, void *arg)
+int clockevents_notify(unsigned long reason, void *arg)
 {
struct clock_event_device *dev, *tmp;
unsigned long flags;
-   int cpu;
+   int cpu, ret = 0;
 
raw_spin_lock_irqsave(_lock, flags);
 
@@ -542,7 +543,7 @@ void clockevents_notify(unsigned long reason, void *arg)
 
case CLOCK_EVT_NOTIFY_BROADCAST_ENTER:
case CLOCK_EVT_NOTIFY_BROADCAST_EXIT:
-   tick_broadcast_oneshot_control(reason);
+   ret = tick_broadcast_oneshot_control(reason);
break;
 
case CLOCK_EVT_NOTIFY_CPU_DYING:
@@ -585,6 +586,7 @@ void clockevents_notify(unsigned long reason, void *arg)
break;
}
raw_spin_unlock_irqrestore(_lock, flags);
+   return ret;
 }
 EXPORT_SYMBOL_GPL(clockevents_notify);
 
diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c
index 43780ab..ddf2ac2 100644
--- a/kernel/time/tick-broadcast.c
+++ b/kernel/time/tick-broadcast.c
@@ -633,14 +633,15 @@ again:
 /*
  * Powerstate information: The system enters/leaves a state, where
  * affected devices might stop
+ * Returns 0 on success, -EBUSY if the cpu is used to broadcast wakeups.
  */
-void tick_broadcast_oneshot_control(unsigned long reason)
+int tick_broadcast_oneshot_control(unsigned long reason)
 {
struct clock_event_device *bc, *dev;
struct tick_device *td;
unsigned long flags;
ktime_t now;
-   int cpu;
+   int cpu, ret = 0;
 
/*
 * Periodic mode does not care about the enter/exit of power
@@ -746,6 +747,7 @@ void tick_broadcast_oneshot_control(unsigned long reason)
}
 out:
raw_spin_unlock_irqrestore(_broadcast_lock, flags);
+   return ret;
 }
 
 /*
diff --git a/kernel/time/tick-internal.h b/kernel/time/tick-internal.h
index 8329669..f0dc03c 100644
--- a/kernel/time/tick-internal.h
+++ b/kernel/time/tick-internal.h
@@ -46,7 +46,7 @@ extern int tick_switch_to_oneshot(void (*handler)(struct 
clock_event_device *));
 extern void tick_resume_oneshot(void);
 # ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST
 extern void tick_broadcast_setup_oneshot(struct clock_event_device *bc);
-extern void tick_broadcast_oneshot_control(unsigned long reason);
+extern int tick_broadcast_oneshot_control(unsigned long reason);
 extern void tick_broadcast_switch_to_oneshot(void);
 extern void tick_shutdown_broadcast_oneshot(unsigned int *cpup);
 extern int tick_resume_broadcast_oneshot(struct clock_event_device *bc);
@@ -58,7 +58,7 @@ static inline void tick_broadcast_setup_oneshot(struct 
clock_event_device *bc)
 {
BUG();
 }
-static inline void

[PATCH V4 3/3] time/cpuidle:Handle failed call to BROADCAST_ENTER on archs with CPUIDLE_FLAG_TIMER_STOP set

2014-02-07 Thread Preeti U Murthy

Some archs set the CPUIDLE_FLAG_TIMER_STOP flag for idle states in which the
local timers stop. The cpuidle_idle_call() currently handles such idle states
by calling into the broadcast framework so as to wakeup CPUs at their next
wakeup event. With the hrtimer mode of broadcast, the BROADCAST_ENTER call
into the broadcast frameowork can fail for archs that do not have an external
clock device to handle wakeups and the CPU in question has to thus be made
the stand by CPU. This patch handles such cases by failing the call into
cpuidle so that the arch can take some default action. The arch will certainly
not enter a similar idle state because a failed cpuidle call will also 
implicitly
indicate that the broadcast framework has not registered this CPU to be woken 
up.
Hence we are safe if we fail the cpuidle call.

In the process move the functions that trace idle statistics just before and
after the entry and exit into idle states respectively. In other
scenarios where the call to cpuidle fails, we end up not tracing idle
entry and exit since a decision on an idle state could not be taken. Similarly
when the call to broadcast framework fails, we skip tracing idle statistics
because we are in no further position to take a decision on an alternative
idle state to enter into.

Signed-off-by: Preeti U Murthy 
---

 drivers/cpuidle/cpuidle.c |   14 --
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index a55e68f..8beb0f02 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -140,12 +140,14 @@ int cpuidle_idle_call(void)
return 0;
}
 
-   trace_cpu_idle_rcuidle(next_state, dev->cpu);
-
broadcast = !!(drv->states[next_state].flags & CPUIDLE_FLAG_TIMER_STOP);
 
-   if (broadcast)
-   clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, >cpu);
+   if (broadcast &&
+   clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, >cpu))
+   return -EBUSY;
+
+
+   trace_cpu_idle_rcuidle(next_state, dev->cpu);
 
if (cpuidle_state_is_coupled(dev, drv, next_state))
entered_state = cpuidle_enter_state_coupled(dev, drv,
@@ -153,11 +155,11 @@ int cpuidle_idle_call(void)
else
entered_state = cpuidle_enter_state(dev, drv, next_state);
 
+   trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, dev->cpu);
+
if (broadcast)
clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, >cpu);
 
-   trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, dev->cpu);
-
/* give the governor an opportunity to reflect on the outcome */
if (cpuidle_curr_governor->reflect)
cpuidle_curr_governor->reflect(dev, entered_state);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH V4 2/3] tick/cpuidle: Initialize hrtimer mode of broadcast

2014-02-07 Thread Preeti U Murthy

From: Thomas Gleixner 

On some architectures, in certain CPU deep idle states the local timers stop.
An external clock device is used to wakeup these CPUs. The kernel support for 
the
wakeup of these CPUs is provided by the tick broadcast framework by using the
external clock device as the wakeup source.

However not all implementations of architectures provide such an external
clock device. This patch includes support in the broadcast framework to handle
the wakeup of the CPUs in deep idle states on such systems by queuing a hrtimer
on one of the CPUs, which is meant to handle the wakeup of CPUs in deep idle 
states.

This patchset introduces a pseudo clock device which can be registered by the
archs as tick_broadcast_device in the absence of a real external clock
device. Once registered, the broadcast framework will work as is for these
architectures as long as the archs take care of the BROADCAST_ENTER
notification failing for one of the CPUs. This CPU is made the stand by CPU to
handle wakeup of the CPUs in deep idle and it *must not enter deep idle states*.

The CPU with the earliest wakeup is chosen to be this CPU. Hence this way the
stand by CPU dynamically moves around and so does the hrtimer which is queued
to trigger at the next earliest wakeup time. This is consistent with the case 
where
an external clock device is present. The smp affinity of this clock device is
set to the CPU with the earliest wakeup. This patchset handles the hotplug of
the stand by CPU as well by moving the hrtimer on to the CPU handling the 
CPU_DEAD
notification.

Signed-off-by: Preeti U Murthy 
[Added Changelog and code to handle reprogramming of hrtimer]
---

 include/linux/clockchips.h   |9 +++
 kernel/time/Makefile |2 -
 kernel/time/tick-broadcast-hrtimer.c |  105 ++
 kernel/time/tick-broadcast.c |   54 +
 4 files changed, 166 insertions(+), 4 deletions(-)
 create mode 100644 kernel/time/tick-broadcast-hrtimer.c

diff --git a/include/linux/clockchips.h b/include/linux/clockchips.h
index e0c5a6c..dbe9e14 100644
--- a/include/linux/clockchips.h
+++ b/include/linux/clockchips.h
@@ -62,6 +62,11 @@ enum clock_event_mode {
 #define CLOCK_EVT_FEAT_DYNIRQ  0x20
 #define CLOCK_EVT_FEAT_PERCPU  0x40
 
+/*
+ * Clockevent device is based on a hrtimer for broadcast
+ */
+#define CLOCK_EVT_FEAT_HRTIMER 0x80
+
 /**
  * struct clock_event_device - clock event device descriptor
  * @event_handler: Assigned by the framework to be called by the low
@@ -83,6 +88,7 @@ enum clock_event_mode {
  * @name:  ptr to clock event name
  * @rating:variable to rate clock event devices
  * @irq:   IRQ number (only for non CPU local devices)
+ * @bound_on:  Bound on CPU
  * @cpumask:   cpumask to indicate for which CPUs this device works
  * @list:  list head for the management code
  * @owner: module reference
@@ -113,6 +119,7 @@ struct clock_event_device {
const char  *name;
int rating;
int irq;
+   int bound_on;
const struct cpumask*cpumask;
struct list_headlist;
struct module   *owner;
@@ -180,9 +187,11 @@ extern int tick_receive_broadcast(void);
 #endif
 
 #if defined(CONFIG_GENERIC_CLOCKEVENTS_BROADCAST) && 
defined(CONFIG_TICK_ONESHOT)
+extern void tick_setup_hrtimer_broadcast(void);
 extern int tick_check_broadcast_expired(void);
 #else
 static inline int tick_check_broadcast_expired(void) { return 0; }
+static void tick_setup_hrtimer_broadcast(void) {};
 #endif
 
 #ifdef CONFIG_GENERIC_CLOCKEVENTS
diff --git a/kernel/time/Makefile b/kernel/time/Makefile
index 9250130..06151ef 100644
--- a/kernel/time/Makefile
+++ b/kernel/time/Makefile
@@ -3,7 +3,7 @@ obj-y += timeconv.o posix-clock.o alarmtimer.o
 
 obj-$(CONFIG_GENERIC_CLOCKEVENTS_BUILD)+= clockevents.o
 obj-$(CONFIG_GENERIC_CLOCKEVENTS)  += tick-common.o
-obj-$(CONFIG_GENERIC_CLOCKEVENTS_BROADCAST)+= tick-broadcast.o
+obj-$(CONFIG_GENERIC_CLOCKEVENTS_BROADCAST)+= tick-broadcast.o 
tick-broadcast-hrtimer.o
 obj-$(CONFIG_GENERIC_SCHED_CLOCK)  += sched_clock.o
 obj-$(CONFIG_TICK_ONESHOT) += tick-oneshot.o
 obj-$(CONFIG_TICK_ONESHOT) += tick-sched.o
diff --git a/kernel/time/tick-broadcast-hrtimer.c 
b/kernel/time/tick-broadcast-hrtimer.c
new file mode 100644
index 000..af1e119
--- /dev/null
+++ b/kernel/time/tick-broadcast-hrtimer.c
@@ -0,0 +1,105 @@
+/*
+ * linux/kernel/time/tick-broadcast-hrtimer.c
+ * This file emulates a local clock event device
+ * via a pseudo clock device.
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "tick-internal.h"
+
+static stru

Re: [PATCH 1/2] PPC: powernv: remove redundant cpuidle_idle_call()

2014-02-07 Thread Preeti U Murthy

Hi Deepthi,

On 02/07/2014 03:15 PM, Deepthi Dharwar wrote:
> Hi Preeti,
> 
> Thanks for the patch.
> 
> On 02/07/2014 12:31 PM, Preeti U Murthy wrote:
>> Hi Nicolas,
>>
>> Find below the patch that will need to be squashed with this one.
>> This patch is based on the mainline.Adding Deepthi, the author of
>> the patch which introduced the powernv cpuidle driver. Deepthi,
>> do you think the below patch looks right? We do not need to do an
>> explicit local_irq_enable() since we are in the call path of
>> cpuidle driver and that explicitly enables irqs on exit from
>> idle states.
> 
> Yes, We enable irqs explicitly while entering snooze loop and we always
> have interrupts enabled in the snooze state.
> For NAP state, we exit out of this state with interrupts enabled so we
> do not need an explicit enable of irqs.
> 
>> On 02/07/2014 06:47 AM, Nicolas Pitre wrote:
>>> On Thu, 6 Feb 2014, Preeti U Murthy wrote:
>>>
>>>> Hi Daniel,
>>>>
>>>> On 02/06/2014 09:55 PM, Daniel Lezcano wrote:
>>>>> Hi Nico,
>>>>>
>>>>>
>>>>> On 6 February 2014 14:16, Nicolas Pitre  wrote:
>>>>>
>>>>>> The core idle loop now takes care of it.
>>>>>>
>>>>>> Signed-off-by: Nicolas Pitre 
>>>>>> ---
>>>>>>  arch/powerpc/platforms/powernv/setup.c | 13 +
>>>>>>  1 file changed, 1 insertion(+), 12 deletions(-)
>>>>>>
>>>>>> diff --git a/arch/powerpc/platforms/powernv/setup.c
>>>>>> b/arch/powerpc/platforms/powernv/setup.c
>>>>>> index 21166f65c9..a932feb290 100644
>>>>>> --- a/arch/powerpc/platforms/powernv/setup.c
>>>>>> +++ b/arch/powerpc/platforms/powernv/setup.c
>>>>>> @@ -26,7 +26,6 @@
>>>>>>  #include 
>>>>>>  #include 
>>>>>>  #include 
>>>>>> -#include 
>>>>>>
>>>>>>  #include 
>>>>>>  #include 
>>>>>> @@ -217,16 +216,6 @@ static int __init pnv_probe(void)
>>>>>> return 1;
>>>>>>  }
>>>>>>
>>>>>> -void powernv_idle(void)
>>>>>> -{
>>>>>> -   /* Hook to cpuidle framework if available, else
>>>>>> -* call on default platform idle code
>>>>>> -*/
>>>>>> -   if (cpuidle_idle_call()) {
>>>>>> -   power7_idle();
>>>>>> -   }
>>>>>>
>>
>>  drivers/cpuidle/cpuidle-powernv.c |4 
>>  1 file changed, 4 insertions(+)
>>
>> diff --git a/drivers/cpuidle/cpuidle-powernv.c 
>> b/drivers/cpuidle/cpuidle-powernv.c
>> index 78fd174..130f081 100644
>> --- a/drivers/cpuidle/cpuidle-powernv.c
>> +++ b/drivers/cpuidle/cpuidle-powernv.c
>> @@ -31,11 +31,13 @@ static int snooze_loop(struct cpuidle_device *dev,
>>  set_thread_flag(TIF_POLLING_NRFLAG);
>>
>>  while (!need_resched()) {
>> +ppc64_runlatch_off();
> ^^^
> We could move this before the while() loop.
> It would ideal to turn off latch when we enter snooze and
> turn it on when we are about to exit, rather than doing
> it over and over in the while loop.

You are right, this can be moved out of the loop.

Thanks

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/2] PPC: powernv: remove redundant cpuidle_idle_call()

2014-02-07 Thread Preeti U Murthy

Hi Nicolas,

On 02/07/2014 04:18 PM, Nicolas Pitre wrote:
> On Fri, 7 Feb 2014, Preeti U Murthy wrote:
> 
>> Hi Nicolas,
>>
>> On 02/07/2014 06:47 AM, Nicolas Pitre wrote:
>>>
>>> What about creating arch_cpu_idle_enter() and arch_cpu_idle_exit() in 
>>> arch/powerpc/kernel/idle.c and calling ppc64_runlatch_off() and 
>>> ppc64_runlatch_on() respectively from there instead?  Would that work?  
>>> That would make the idle consolidation much easier afterwards.
>>
>> I would not suggest doing this. The ppc64_runlatch_*() routines need to
>> be called when we are sure that the cpu is about to enter or has exit an
>> idle state. Moving the ppc64_runlatch_on() routine to
>> arch_cpu_idle_enter() for instance is not a good idea because there are
>> places where the cpu can decide not to enter any idle state before the
>> call to cpuidle_idle_call() itself. In that case communicating
>> prematurely that we are in an idle state would not be a good idea.
>>
>> So its best to add the ppc64_runlatch_* calls in the powernv cpuidle
>> driver IMO. We could however create idle_loop_prologue/epilogue()
>> variants inside it so that in addition to the runlatch routines we could
>> potentially add more such similar routines that are powernv specific.
>>   If there are cases where there is work to be done prior to and post an
>> entry into an idle state common to both pseries and powernv, we will
>> probably put them in arch_cpu_idle_enter/exit(). But the runlatch
>> routines are not suitable to be moved there as far as I can see.
> 
> OK.
> 
> However, one thing we need to do as much as possible is to remove those 
> loops based on need_resched() from idle backend drivers.  A somewhat 
> common pattern is:
> 
> my_idle()
> {
>   /* interrupts disabled on entry */
>   while (!need_resched()) {
>   lowpower_wait_for_interrupts();
>   local_irq_enable();
>   /* IRQ serviced from here */
>   local_irq_disable();
>   }
>   local_irq_enable();
>   /* interrupts enabled on exit */
> }
> 
> To be able to keep statistics on the actual idleness of the CPU we'd 
> need for all idle backends to always return to generic code on every 
> interrupt similar to this:
> 
> my_idle()
> {
>   /* interrupts disabled on entry */
>   lowpower_wait_for_interrupts();

You can do this for the idle states which do not have the polling
nature. IOW, these idle states are capable of doing what you describe as
"wait_for_interrupts". They do some kind of spinning at the hardware
level with interrupts enabled. A reschedule IPI or any other interrupt
will wake them up to enter the generic idle loop where they check for
the cause of the interrupt.

But observe the idle state "snooze" on powerpc. The power that this idle
state saves is through the lowering of the thread priority of the CPU.
After it lowers the thread priority, it is done. It cannot
"wait_for_interrupts". It will exit my_idle(). It is now upto the
generic idle loop to increase the thread priority if the need_resched
flag is set. Only an interrupt routine can increase the thread priority.
Else we will need to do it explicitly. And in such states which have a
polling nature, the cpu will not receive a reschedule IPI.

That is why in the snooze_loop() we poll on need_resched. If it is set
we up the priority of the thread using HMT_MEDIUM() and then exit the
my_idle() loop. In case of interrupts, the priority gets automatically
increased.

This might not be required to be done for similar idle routines on other
archs but this is the consequence of applying this idea of simplified
cpuidle backend driver on powerpc.

I would say you could let the backend cpuidle drivers be in this regard,
it could complicate the generic idle loop IMO depending on how the
polling states are implemented in each architecture.

> The generic code would be responsible for dealing with need_resched() 
> and call back into the backend right away if necessary after updating 
> some stats.
> 
> Do you see a problem with the runlatch calls happening around each 
> interrrupt from such a simplified idle backend?

The runlatch calls could be moved outside the loop.They do not need to
be called each time.

Thanks

Regards
Preeti U Murthy
> 
> 
> Nicolas
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] time/cpuidle:Fixup fallout from hrtimer broadcast mode inclusion

2014-02-07 Thread Preeti U Murthy

The broadcast timer registration has to be done only when
GENERIC_CLOCKEVENTS_BROADCAST and TICK_ONESHOT config options are enabled.
Also fix max_delta_ticks value for the pseudo clock device.

Reported-by: Fengguang Wu 
Signed-off-by: Preeti U Murthy 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
---

 kernel/time/tick-broadcast-hrtimer.c |4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/time/tick-broadcast-hrtimer.c 
b/kernel/time/tick-broadcast-hrtimer.c
index 5591aaa..bc383ac 100644
--- a/kernel/time/tick-broadcast-hrtimer.c
+++ b/kernel/time/tick-broadcast-hrtimer.c
@@ -81,7 +81,7 @@ static struct clock_event_device ce_broadcast_hrtimer = {
.min_delta_ns   = 1,
.max_delta_ns   = KTIME_MAX,
.min_delta_ticks= 1,
-   .max_delta_ticks= KTIME_MAX,
+   .max_delta_ticks= ULONG_MAX,
.mult   = 1,
.shift  = 0,
.cpumask= cpu_all_mask,
@@ -102,9 +102,11 @@ static enum hrtimer_restart bc_handler(struct hrtimer *t)
return HRTIMER_RESTART;
 }
 
+#if defined(CONFIG_GENERIC_CLOCKEVENTS_BROADCAST) && 
defined(CONFIG_TICK_ONESHOT)
 void tick_setup_hrtimer_broadcast(void)
 {
hrtimer_init(, CLOCK_MONOTONIC, HRTIMER_MODE_ABS);
bctimer.function = bc_handler;
clockevents_register_device(_broadcast_hrtimer);
 }
+#endif

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] time/cpuidle:Fixup fallout from hrtimer broadcast mode inclusion

2014-02-08 Thread Preeti U Murthy

Hi Thomas,

On 02/07/2014 11:27 PM, Thomas Gleixner wrote:
> On Fri, 7 Feb 2014, Preeti U Murthy wrote:
> 
>> The broadcast timer registration has to be done only when
>> GENERIC_CLOCKEVENTS_BROADCAST and TICK_ONESHOT config options are enabled.
> 
> Then we should compile that file only when those options are
> enabled. Where is the point to compile that code w/o the registration
> function?

Hmm of course. The delta patch is at the end.

Another concern I have is with regard to the periodic mode of broadcast .
We currently do not support the hrtimer mode of broadcast in periodic mode.
The BROADCAST_ON/OFF calls which take effect in periodic mode has not yet
been modified by this patchset to disable one CPU from going into deep idle,
since we expect the deep idle states to never be chosen by the cpuidle
governor in this mode. Do you think we should bother to modify this piece
of code at all?

On the same note, my understanding is that BROADCAST_ON/OFF takes effect
only in periodic mode, in oneshot mode it is a nop. But why do we expect
the CPUs to avail broadcast in periodic mode when they are not supposed
to be entering deep idle states? Am I missing something here? IOW what
is the point of periodic mode of broadcast? Is it for malfunctioning
local clock devices?

The delta patch below for fixing the compile time errors. This is based on
tip/timers/core branch.


time/cpuidle:Fixup fallout from hrtimer broadcast mode inclusion

From: Preeti U Murthy 

The hrtimer mode of broadcast is supported only when
GENERIC_CLOCKEVENTS_BROADCAST and TICK_ONESHOT config options
are enabled. Hence compile in the functions for hrtimer mode
of broadcast only when these options are selected.
Also fix max_delta_ticks value for the pseudo clock device.

Reported-by: Fengguang Wu 
Signed-off-by: Preeti U Murthy 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
---
 include/linux/clockchips.h   |1 +
 kernel/time/Makefile |5 -
 kernel/time/tick-broadcast-hrtimer.c |2 +-
 3 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/include/linux/clockchips.h b/include/linux/clockchips.h
index 20a7183..2e4cb67 100644
--- a/include/linux/clockchips.h
+++ b/include/linux/clockchips.h
@@ -207,6 +207,7 @@ static inline void clockevents_resume(void) {}
 
 static inline int clockevents_notify(unsigned long reason, void *arg) { return 
0; }
 static inline int tick_check_broadcast_expired(void) { return 0; }
+static inline void tick_setup_hrtimer_broadcast(void) {};
 
 #endif
 
diff --git a/kernel/time/Makefile b/kernel/time/Makefile
index 06151ef..57a413f 100644
--- a/kernel/time/Makefile
+++ b/kernel/time/Makefile
@@ -3,7 +3,10 @@ obj-y += timeconv.o posix-clock.o alarmtimer.o
 
 obj-$(CONFIG_GENERIC_CLOCKEVENTS_BUILD)+= clockevents.o
 obj-$(CONFIG_GENERIC_CLOCKEVENTS)  += tick-common.o
-obj-$(CONFIG_GENERIC_CLOCKEVENTS_BROADCAST)+= tick-broadcast.o 
tick-broadcast-hrtimer.o
+ifeq ($(CONFIG_GENERIC_CLOCKEVENTS_BROADCAST),y)
+ obj-y += tick-broadcast.o
+ obj-$(CONFIG_TICK_ONESHOT)+= tick-broadcast-hrtimer.o
+endif
 obj-$(CONFIG_GENERIC_SCHED_CLOCK)  += sched_clock.o
 obj-$(CONFIG_TICK_ONESHOT) += tick-oneshot.o
 obj-$(CONFIG_TICK_ONESHOT) += tick-sched.o
diff --git a/kernel/time/tick-broadcast-hrtimer.c 
b/kernel/time/tick-broadcast-hrtimer.c
index 9242527..eb682d5 100644
--- a/kernel/time/tick-broadcast-hrtimer.c
+++ b/kernel/time/tick-broadcast-hrtimer.c
@@ -82,7 +82,7 @@ static struct clock_event_device ce_broadcast_hrtimer = {
.min_delta_ns   = 1,
.max_delta_ns   = KTIME_MAX,
.min_delta_ticks= 1,
-   .max_delta_ticks= KTIME_MAX,
+   .max_delta_ticks= ULONG_MAX,
.mult   = 1,
.shift  = 0,
.cpumask= cpu_all_mask,



Thanks

Regards
Preeti U Murthy

> 
>> Also fix max_delta_ticks value for the pseudo clock device.
>>
>> Reported-by: Fengguang Wu 
>> Signed-off-by: Preeti U Murthy 
>> Cc: Thomas Gleixner 
>> Cc: Ingo Molnar 
>> ---
>>
>>  kernel/time/tick-broadcast-hrtimer.c |4 +++-
>>  1 file changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/time/tick-broadcast-hrtimer.c 
>> b/kernel/time/tick-broadcast-hrtimer.c
>> index 5591aaa..bc383ac 100644
>> --- a/kernel/time/tick-broadcast-hrtimer.c
>> +++ b/kernel/time/tick-broadcast-hrtimer.c
>> @@ -81,7 +81,7 @@ static struct clock_event_device ce_broadcast_hrtimer = {
>>  .min_delta_ns   = 1,
>>  .max_delta_ns   = KTIME_MAX,
>>  .min_delta_ticks= 1,
>> -.max_delta_ticks= KTIME_MAX,
>> +.max_delta_ticks= ULONG_MAX,
>>  .mult

Re: [PATCH] time/cpuidle:Fixup fallout from hrtimer broadcast mode inclusion

2014-02-09 Thread Preeti U Murthy

Hi David,

I have sent out a revised patch on
https://lkml.org/lkml/2014/2/9/2. Can you let me
know if this works for you?

Thanks

Regards
Preeti U Murthy

On 02/09/2014 01:01 PM, David Rientjes wrote:
> On Fri, 7 Feb 2014, Preeti U Murthy wrote:
> 
>> The broadcast timer registration has to be done only when
>> GENERIC_CLOCKEVENTS_BROADCAST and TICK_ONESHOT config options are enabled.
>> Also fix max_delta_ticks value for the pseudo clock device.
>>
>> Reported-by: Fengguang Wu 
>> Signed-off-by: Preeti U Murthy 
>> Cc: Thomas Gleixner 
>> Cc: Ingo Molnar 
>> ---
>>
>>  kernel/time/tick-broadcast-hrtimer.c |4 +++-
>>  1 file changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/time/tick-broadcast-hrtimer.c 
>> b/kernel/time/tick-broadcast-hrtimer.c
>> index 5591aaa..bc383ac 100644
>> --- a/kernel/time/tick-broadcast-hrtimer.c
>> +++ b/kernel/time/tick-broadcast-hrtimer.c
>> @@ -81,7 +81,7 @@ static struct clock_event_device ce_broadcast_hrtimer = {
>>  .min_delta_ns   = 1,
>>  .max_delta_ns   = KTIME_MAX,
>>  .min_delta_ticks= 1,
>> -.max_delta_ticks= KTIME_MAX,
>> +.max_delta_ticks= ULONG_MAX,
>>  .mult   = 1,
>>  .shift  = 0,
>>  .cpumask= cpu_all_mask,
>> @@ -102,9 +102,11 @@ static enum hrtimer_restart bc_handler(struct hrtimer 
>> *t)
>>  return HRTIMER_RESTART;
>>  }
>>  
>> +#if defined(CONFIG_GENERIC_CLOCKEVENTS_BROADCAST) && 
>> defined(CONFIG_TICK_ONESHOT)
>>  void tick_setup_hrtimer_broadcast(void)
>>  {
>>  hrtimer_init(, CLOCK_MONOTONIC, HRTIMER_MODE_ABS);
>>  bctimer.function = bc_handler;
>>  clockevents_register_device(_broadcast_hrtimer);
>>  }
>> +#endif
> 
> I see a build error in timers/core today:
> 
> kernel/time/tick-broadcast-hrtimer.c:101:6: error: redefinition of 
> 'tick_setup_hrtimer_broadcast'
> include/linux/clockchips.h:194:20: note: previous definition of 
> 'tick_setup_hrtimer_broadcast' was here
> 
> and I assume this is the intended fix for that, although it isn't 
> mentioned in the changelog.
> 
> After it's applied, this is left over:
> 
> kernel/time/tick-broadcast-hrtimer.c:91:29: warning: ‘bc_handler’ defined but 
> not used [-Wunused-function]
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RESEND PATCH 0/3] powerpc: Free up an IPI message slot for tick broadcast IPIs

2014-02-09 Thread Preeti U Murthy

This patchset is a precursor for enabling deep idle states on powerpc,
when the local CPU timers stop. The tick broadcast framework in
the Linux Kernel today handles wakeup of such CPUs at their next timer event
by using an external clock device. At the expiry of this clock device, IPIs
are sent to the CPUs in deep idle states  so that they wakeup to handle their
respective timers. This patchset frees up one of the IPI slots on powerpc
so as to be used to handle the tick broadcast IPI.

On certain implementations of powerpc, such an external clock device is absent.
The support in the tick broadcast framework to handle wakeup of CPUs from
deep idle states on such implementations is currently in the tip tree.
https://lkml.org/lkml/2014/2/7/906
https://lkml.org/lkml/2014/2/7/876
https://lkml.org/lkml/2014/2/7/608

With the above support in place, this patchset is next in line to enable deep
idle states on powerpc.

The patchset has been appended by a RESEND tag since nothing has changed from
the previous post except for an added config condition around
tick_broadcast() which handles sending broadcast IPIs, and the update in the 
cover
letter.
---

Preeti U Murthy (1):
  cpuidle/ppc: Split timer_interrupt() into timer handling and interrupt 
handling routines

Srivatsa S. Bhat (2):
  powerpc: Free up the slot of PPC_MSG_CALL_FUNC_SINGLE IPI message
  powerpc: Implement tick broadcast IPI as a fixed IPI message


 arch/powerpc/include/asm/smp.h  |2 -
 arch/powerpc/include/asm/time.h |1 
 arch/powerpc/kernel/smp.c   |   25 ++---
 arch/powerpc/kernel/time.c  |   86 ++-
 arch/powerpc/platforms/cell/interrupt.c |2 -
 arch/powerpc/platforms/ps3/smp.c|2 -
 6 files changed, 73 insertions(+), 45 deletions(-)

-- 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RESEND PATCH 2/3] powerpc: Implement tick broadcast IPI as a fixed IPI message

2014-02-09 Thread Preeti U Murthy

From: Srivatsa S. Bhat 

For scalability and performance reasons, we want the tick broadcast IPIs
to be handled as efficiently as possible. Fixed IPI messages
are one of the most efficient mechanisms available - they are faster than
the smp_call_function mechanism because the IPI handlers are fixed and hence
they don't involve costly operations such as adding IPI handlers to the target
CPU's function queue, acquiring locks for synchronization etc.

Luckily we have an unused IPI message slot, so use that to implement
tick broadcast IPIs efficiently.

Signed-off-by: Srivatsa S. Bhat 
[Functions renamed to tick_broadcast* and Changelog modified by
 Preeti U. Murthy]
Signed-off-by: Preeti U. Murthy 
Acked-by: Geoff Levand  [For the PS3 part]
---

 arch/powerpc/include/asm/smp.h  |2 +-
 arch/powerpc/include/asm/time.h |1 +
 arch/powerpc/kernel/smp.c   |   21 +
 arch/powerpc/kernel/time.c  |5 +
 arch/powerpc/platforms/cell/interrupt.c |2 +-
 arch/powerpc/platforms/ps3/smp.c|2 +-
 6 files changed, 26 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
index 9f7356b..ff51046 100644
--- a/arch/powerpc/include/asm/smp.h
+++ b/arch/powerpc/include/asm/smp.h
@@ -120,7 +120,7 @@ extern int cpu_to_core_id(int cpu);
  * in /proc/interrupts will be wrong!!! --Troy */
 #define PPC_MSG_CALL_FUNCTION   0
 #define PPC_MSG_RESCHEDULE  1
-#define PPC_MSG_UNUSED 2
+#define PPC_MSG_TICK_BROADCAST 2
 #define PPC_MSG_DEBUGGER_BREAK  3
 
 /* for irq controllers that have dedicated ipis per message (4) */
diff --git a/arch/powerpc/include/asm/time.h b/arch/powerpc/include/asm/time.h
index c1f2676..1d428e6 100644
--- a/arch/powerpc/include/asm/time.h
+++ b/arch/powerpc/include/asm/time.h
@@ -28,6 +28,7 @@ extern struct clock_event_device decrementer_clockevent;
 struct rtc_time;
 extern void to_tm(int tim, struct rtc_time * tm);
 extern void GregorianDay(struct rtc_time *tm);
+extern void tick_broadcast_ipi_handler(void);
 
 extern void generic_calibrate_decr(void);
 
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index ee7d76b..e2a4232 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -35,6 +35,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -145,9 +146,9 @@ static irqreturn_t reschedule_action(int irq, void *data)
return IRQ_HANDLED;
 }
 
-static irqreturn_t unused_action(int irq, void *data)
+static irqreturn_t tick_broadcast_ipi_action(int irq, void *data)
 {
-   /* This slot is unused and hence available for use, if needed */
+   tick_broadcast_ipi_handler();
return IRQ_HANDLED;
 }
 
@@ -168,14 +169,14 @@ static irqreturn_t debug_ipi_action(int irq, void *data)
 static irq_handler_t smp_ipi_action[] = {
[PPC_MSG_CALL_FUNCTION] =  call_function_action,
[PPC_MSG_RESCHEDULE] = reschedule_action,
-   [PPC_MSG_UNUSED] = unused_action,
+   [PPC_MSG_TICK_BROADCAST] = tick_broadcast_ipi_action,
[PPC_MSG_DEBUGGER_BREAK] = debug_ipi_action,
 };
 
 const char *smp_ipi_name[] = {
[PPC_MSG_CALL_FUNCTION] =  "ipi call function",
[PPC_MSG_RESCHEDULE] = "ipi reschedule",
-   [PPC_MSG_UNUSED] = "ipi unused",
+   [PPC_MSG_TICK_BROADCAST] = "ipi tick-broadcast",
[PPC_MSG_DEBUGGER_BREAK] = "ipi debugger",
 };
 
@@ -251,6 +252,8 @@ irqreturn_t smp_ipi_demux(void)
generic_smp_call_function_interrupt();
if (all & IPI_MESSAGE(PPC_MSG_RESCHEDULE))
scheduler_ipi();
+   if (all & IPI_MESSAGE(PPC_MSG_TICK_BROADCAST))
+   tick_broadcast_ipi_handler();
if (all & IPI_MESSAGE(PPC_MSG_DEBUGGER_BREAK))
debug_ipi_action(0, NULL);
} while (info->messages);
@@ -289,6 +292,16 @@ void arch_send_call_function_ipi_mask(const struct cpumask 
*mask)
do_message_pass(cpu, PPC_MSG_CALL_FUNCTION);
 }
 
+#ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST
+void tick_broadcast(const struct cpumask *mask)
+{
+   unsigned int cpu;
+
+   for_each_cpu(cpu, mask)
+   do_message_pass(cpu, PPC_MSG_TICK_BROADCAST);
+}
+#endif
+
 #if defined(CONFIG_DEBUGGER) || defined(CONFIG_KEXEC)
 void smp_send_debugger_break(void)
 {
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index b3dab20..3ff97db 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -825,6 +825,11 @@ static void decrementer_set_mode(enum clock_event_mode 
mode,
decrementer_set_next_event(DECREMENTER_MAX, dev);
 }
 
+/* Interrupt handler for the timer broadcast IPI */
+void tick_broadcast_ipi_handler(void)
+{
+}
+
 static void register_decrementer_clockevent(int cpu)
 {
struct clock_event_device *dec =

[RESEND PATCH 3/3] cpuidle/ppc: Split timer_interrupt() into timer handling and interrupt handling routines

2014-02-09 Thread Preeti U Murthy

From: Preeti U Murthy 

Split timer_interrupt(), which is the local timer interrupt handler on ppc
into routines called during regular interrupt handling and __timer_interrupt(),
which takes care of running local timers and collecting time related stats.

This will enable callers interested only in running expired local timers to
directly call into __timer_interupt(). One of the use cases of this is the
tick broadcast IPI handling in which the sleeping CPUs need to handle the local
timers that have expired.

Signed-off-by: Preeti U Murthy 
---

 arch/powerpc/kernel/time.c |   81 +---
 1 file changed, 46 insertions(+), 35 deletions(-)

diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 3ff97db..df2989b 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -478,6 +478,47 @@ void arch_irq_work_raise(void)
 
 #endif /* CONFIG_IRQ_WORK */
 
+void __timer_interrupt(void)
+{
+   struct pt_regs *regs = get_irq_regs();
+   u64 *next_tb = &__get_cpu_var(decrementers_next_tb);
+   struct clock_event_device *evt = &__get_cpu_var(decrementers);
+   u64 now;
+
+   trace_timer_interrupt_entry(regs);
+
+   if (test_irq_work_pending()) {
+   clear_irq_work_pending();
+   irq_work_run();
+   }
+
+   now = get_tb_or_rtc();
+   if (now >= *next_tb) {
+   *next_tb = ~(u64)0;
+   if (evt->event_handler)
+   evt->event_handler(evt);
+   __get_cpu_var(irq_stat).timer_irqs_event++;
+   } else {
+   now = *next_tb - now;
+   if (now <= DECREMENTER_MAX)
+   set_dec((int)now);
+   /* We may have raced with new irq work */
+   if (test_irq_work_pending())
+   set_dec(1);
+   __get_cpu_var(irq_stat).timer_irqs_others++;
+   }
+
+#ifdef CONFIG_PPC64
+   /* collect purr register values often, for accurate calculations */
+   if (firmware_has_feature(FW_FEATURE_SPLPAR)) {
+   struct cpu_usage *cu = &__get_cpu_var(cpu_usage_array);
+   cu->current_tb = mfspr(SPRN_PURR);
+   }
+#endif
+
+   trace_timer_interrupt_exit(regs);
+}
+
 /*
  * timer_interrupt - gets called when the decrementer overflows,
  * with interrupts disabled.
@@ -486,8 +527,6 @@ void timer_interrupt(struct pt_regs * regs)
 {
struct pt_regs *old_regs;
u64 *next_tb = &__get_cpu_var(decrementers_next_tb);
-   struct clock_event_device *evt = &__get_cpu_var(decrementers);
-   u64 now;
 
/* Ensure a positive value is written to the decrementer, or else
 * some CPUs will continue to take decrementer exceptions.
@@ -519,39 +558,7 @@ void timer_interrupt(struct pt_regs * regs)
old_regs = set_irq_regs(regs);
irq_enter();
 
-   trace_timer_interrupt_entry(regs);
-
-   if (test_irq_work_pending()) {
-   clear_irq_work_pending();
-   irq_work_run();
-   }
-
-   now = get_tb_or_rtc();
-   if (now >= *next_tb) {
-   *next_tb = ~(u64)0;
-   if (evt->event_handler)
-   evt->event_handler(evt);
-   __get_cpu_var(irq_stat).timer_irqs_event++;
-   } else {
-   now = *next_tb - now;
-   if (now <= DECREMENTER_MAX)
-   set_dec((int)now);
-   /* We may have raced with new irq work */
-   if (test_irq_work_pending())
-   set_dec(1);
-   __get_cpu_var(irq_stat).timer_irqs_others++;
-   }
-
-#ifdef CONFIG_PPC64
-   /* collect purr register values often, for accurate calculations */
-   if (firmware_has_feature(FW_FEATURE_SPLPAR)) {
-   struct cpu_usage *cu = &__get_cpu_var(cpu_usage_array);
-   cu->current_tb = mfspr(SPRN_PURR);
-   }
-#endif
-
-   trace_timer_interrupt_exit(regs);
-
+   __timer_interrupt();
irq_exit();
set_irq_regs(old_regs);
 }
@@ -828,6 +835,10 @@ static void decrementer_set_mode(enum clock_event_mode 
mode,
 /* Interrupt handler for the timer broadcast IPI */
 void tick_broadcast_ipi_handler(void)
 {
+   u64 *next_tb = &__get_cpu_var(decrementers_next_tb);
+
+   *next_tb = get_tb_or_rtc();
+   __timer_interrupt();
 }
 
 static void register_decrementer_clockevent(int cpu)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RESEND PATCH 1/3] powerpc: Free up the slot of PPC_MSG_CALL_FUNC_SINGLE IPI message

2014-02-09 Thread Preeti U Murthy

From: Srivatsa S. Bhat 

The IPI handlers for both PPC_MSG_CALL_FUNC and PPC_MSG_CALL_FUNC_SINGLE map
to a common implementation - generic_smp_call_function_single_interrupt(). So,
we can consolidate them and save one of the IPI message slots, (which are
precious on powerpc, since only 4 of those slots are available).

So, implement the functionality of PPC_MSG_CALL_FUNC_SINGLE using
PPC_MSG_CALL_FUNC itself and release its IPI message slot, so that it can be
used for something else in the future, if desired.

Signed-off-by: Srivatsa S. Bhat 
Signed-off-by: Preeti U. Murthy 
Acked-by: Geoff Levand  [For the PS3 part]
---

 arch/powerpc/include/asm/smp.h  |2 +-
 arch/powerpc/kernel/smp.c   |   12 +---
 arch/powerpc/platforms/cell/interrupt.c |2 +-
 arch/powerpc/platforms/ps3/smp.c|2 +-
 4 files changed, 8 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
index 084e080..9f7356b 100644
--- a/arch/powerpc/include/asm/smp.h
+++ b/arch/powerpc/include/asm/smp.h
@@ -120,7 +120,7 @@ extern int cpu_to_core_id(int cpu);
  * in /proc/interrupts will be wrong!!! --Troy */
 #define PPC_MSG_CALL_FUNCTION   0
 #define PPC_MSG_RESCHEDULE  1
-#define PPC_MSG_CALL_FUNC_SINGLE   2
+#define PPC_MSG_UNUSED 2
 #define PPC_MSG_DEBUGGER_BREAK  3
 
 /* for irq controllers that have dedicated ipis per message (4) */
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index ac2621a..ee7d76b 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -145,9 +145,9 @@ static irqreturn_t reschedule_action(int irq, void *data)
return IRQ_HANDLED;
 }
 
-static irqreturn_t call_function_single_action(int irq, void *data)
+static irqreturn_t unused_action(int irq, void *data)
 {
-   generic_smp_call_function_single_interrupt();
+   /* This slot is unused and hence available for use, if needed */
return IRQ_HANDLED;
 }
 
@@ -168,14 +168,14 @@ static irqreturn_t debug_ipi_action(int irq, void *data)
 static irq_handler_t smp_ipi_action[] = {
[PPC_MSG_CALL_FUNCTION] =  call_function_action,
[PPC_MSG_RESCHEDULE] = reschedule_action,
-   [PPC_MSG_CALL_FUNC_SINGLE] = call_function_single_action,
+   [PPC_MSG_UNUSED] = unused_action,
[PPC_MSG_DEBUGGER_BREAK] = debug_ipi_action,
 };
 
 const char *smp_ipi_name[] = {
[PPC_MSG_CALL_FUNCTION] =  "ipi call function",
[PPC_MSG_RESCHEDULE] = "ipi reschedule",
-   [PPC_MSG_CALL_FUNC_SINGLE] = "ipi call function single",
+   [PPC_MSG_UNUSED] = "ipi unused",
[PPC_MSG_DEBUGGER_BREAK] = "ipi debugger",
 };
 
@@ -251,8 +251,6 @@ irqreturn_t smp_ipi_demux(void)
generic_smp_call_function_interrupt();
if (all & IPI_MESSAGE(PPC_MSG_RESCHEDULE))
scheduler_ipi();
-   if (all & IPI_MESSAGE(PPC_MSG_CALL_FUNC_SINGLE))
-   generic_smp_call_function_single_interrupt();
if (all & IPI_MESSAGE(PPC_MSG_DEBUGGER_BREAK))
debug_ipi_action(0, NULL);
} while (info->messages);
@@ -280,7 +278,7 @@ EXPORT_SYMBOL_GPL(smp_send_reschedule);
 
 void arch_send_call_function_single_ipi(int cpu)
 {
-   do_message_pass(cpu, PPC_MSG_CALL_FUNC_SINGLE);
+   do_message_pass(cpu, PPC_MSG_CALL_FUNCTION);
 }
 
 void arch_send_call_function_ipi_mask(const struct cpumask *mask)
diff --git a/arch/powerpc/platforms/cell/interrupt.c 
b/arch/powerpc/platforms/cell/interrupt.c
index 2d42f3b..adf3726 100644
--- a/arch/powerpc/platforms/cell/interrupt.c
+++ b/arch/powerpc/platforms/cell/interrupt.c
@@ -215,7 +215,7 @@ void iic_request_IPIs(void)
 {
iic_request_ipi(PPC_MSG_CALL_FUNCTION);
iic_request_ipi(PPC_MSG_RESCHEDULE);
-   iic_request_ipi(PPC_MSG_CALL_FUNC_SINGLE);
+   iic_request_ipi(PPC_MSG_UNUSED);
iic_request_ipi(PPC_MSG_DEBUGGER_BREAK);
 }
 
diff --git a/arch/powerpc/platforms/ps3/smp.c b/arch/powerpc/platforms/ps3/smp.c
index 4b35166..00d1a7c 100644
--- a/arch/powerpc/platforms/ps3/smp.c
+++ b/arch/powerpc/platforms/ps3/smp.c
@@ -76,7 +76,7 @@ static int __init ps3_smp_probe(void)
 
BUILD_BUG_ON(PPC_MSG_CALL_FUNCTION!= 0);
BUILD_BUG_ON(PPC_MSG_RESCHEDULE   != 1);
-   BUILD_BUG_ON(PPC_MSG_CALL_FUNC_SINGLE != 2);
+   BUILD_BUG_ON(PPC_MSG_UNUSED   != 2);
BUILD_BUG_ON(PPC_MSG_DEBUGGER_BREAK   != 3);
 
for (i = 0; i < MSG_COUNT; i++) {

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/2] PPC: powernv: remove redundant cpuidle_idle_call()

2014-02-09 Thread Preeti U Murthy

Hi Peter,

On 02/07/2014 06:11 PM, Peter Zijlstra wrote:
> On Fri, Feb 07, 2014 at 05:11:26PM +0530, Preeti U Murthy wrote:
>> But observe the idle state "snooze" on powerpc. The power that this idle
>> state saves is through the lowering of the thread priority of the CPU.
>> After it lowers the thread priority, it is done. It cannot
>> "wait_for_interrupts". It will exit my_idle(). It is now upto the
>> generic idle loop to increase the thread priority if the need_resched
>> flag is set. Only an interrupt routine can increase the thread priority.
>> Else we will need to do it explicitly. And in such states which have a
>> polling nature, the cpu will not receive a reschedule IPI.
>>
>> That is why in the snooze_loop() we poll on need_resched. If it is set
>> we up the priority of the thread using HMT_MEDIUM() and then exit the
>> my_idle() loop. In case of interrupts, the priority gets automatically
>> increased.
> 
> You can poll without setting TS_POLLING/TIF_POLLING_NRFLAGS just fine
> and get the IPI if that is what you want.
> 
> Depending on how horribly unprovisioned the thread gets at the lowest
> priority, that might actually be faster than polling and raising the
> prio whenever it does get ran.

So I am assuming you mean something like the below:

my_idle()
{
   local_irq_enable();
   /* Remove the setting of the polling flag */
   HMT_low();
   return index;
}

And then exit into the generic idle loop. But the issue I see here is
that the TS_POLLING/TIF_POLLING_NRFLAGS gets set immediately. So, if on
testing need_resched() immediately after this returns that the
TIF_NEED_RESCHED flag is set, the thread will exit at low priority right?
 We could raise the priority of the thread in arch_cpu_idle_exit() soon
after setting the polling flag but that would mean for cases where the
TIF_NEED_RESCHED flag is not set we unnecessarily raise the priority of
the thread.

Thanks

Regards
Preeti U Murthy

> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] sched: CPU topology try

2013-12-31 Thread Preeti U Murthy

Hi Vincent,

On 12/18/2013 06:43 PM, Vincent Guittot wrote:
> This patch applies on top of the two patches [1][2] that have been proposed by
> Peter for creating a new way to initialize sched_domain. It includes some 
> minor
> compilation fixes and a trial of using this new method on ARM platform.
> [1] https://lkml.org/lkml/2013/11/5/239
> [2] https://lkml.org/lkml/2013/11/5/449
> 
> Based on the results of this tests, my feeling about this new way to init the
> sched_domain is a bit mitigated.
> 
> The good point is that I have been able to create the same sched_domain
> topologies than before and even more complex ones (where a subset of the cores
> in a cluster share their powergating capabilities). I have described various
> topology results below.
> 
> I use a system that is made of a dual cluster of quad cores with 
> hyperthreading
> for my examples.
> 
> If one cluster (0-7) can powergate its cores independantly but not the other
> cluster (8-15) we have the following topology, which is equal to what I had
> previously:
> 
> CPU0:
> domain 0: span 0-1 level: SMT
> flags: SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
> groups: 0 1
>   domain 1: span 0-7 level: MC
>   flags: SD_SHARE_PKG_RESOURCES
>   groups: 0-1 2-3 4-5 6-7
> domain 2: span 0-15 level: CPU
> flags:
> groups: 0-7 8-15
> 
> CPU8
> domain 0: span 8-9 level: SMT
> flags: SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
> groups: 8 9
>   domain 1: span 8-15 level: MC
>   flags: SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
>   groups: 8-9 10-11 12-13 14-15
> domain 2: span 0-15 level CPU
> flags:
> groups: 8-15 0-7
> 
> We can even describe some more complex topologies if a susbset (2-7) of the
> cluster can't powergate independatly:
> 
> CPU0:
> domain 0: span 0-1 level: SMT
> flags: SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
> groups: 0 1
>   domain 1: span 0-7 level: MC
>   flags: SD_SHARE_PKG_RESOURCES
>   groups: 0-1 2-7
> domain 2: span 0-15 level: CPU
> flags:
> groups: 0-7 8-15
> 
> CPU2:
> domain 0: span 2-3 level: SMT
> flags: SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
> groups: 0 1
>   domain 1: span 2-7 level: MC
>   flags: SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
>   groups: 2-7 4-5 6-7
> domain 2: span 0-7 level: MC
> flags: SD_SHARE_PKG_RESOURCES
> groups: 2-7 0-1
>   domain 3: span 0-15 level: CPU
>   flags:
>   groups: 0-7 8-15
> 
> In this case, we have an aditionnal sched_domain MC level for this subset 
> (2-7)
> of cores so we can trigger some load balance in this subset before doing that
> on the complete cluster (which is the last level of cache in my example)
> 
> We can add more levels that will describe other dependency/independency like
> the frequency scaling dependency and as a result the final sched_domain
> topology will have additional levels (if they have not been removed during
> the degenerate sequence)

The design looks good to me. In my opinion information like P-states and
C-states dependency can be kept separate from the topology levels, it
might get too complicated unless the information is tightly coupled to
the topology.

> 
> My concern is about the configuration of the table that is used to create the
> sched_domain. Some levels are "duplicated" with different flags configuration

I do not feel this is a problem since the levels are not duplicated,
rather they have different properties within them which is best
represented by flags like you have introduced in this patch.

> which make the table not easily readable and we must also take care of the
> order  because parents have to gather all cpus of its childs. So we must
> choose which capabilities will be a subset of the other one. The order is

The sched domain levels which have SD_SHARE_POWERDOMAIN set is expected
to have cpus which are a subset of the cpus that this domain would have
included had this flag not been set. In addition to this every higher
domain, irrespective of SD_SHARE_POWERDOMAIN being set, will include all
cpus of the lower domains. As far as I see, this patch does not change
these assumptions. Hence I am unable to imagine a scenario when the
parent might not include all cpus of its children domain. Do you have
such a scenario in mind which can arise due to this patch ?

Thanks

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH V2] time/cpuidle: Support in tick broadcast framework for archs without external clock device

2013-12-31 Thread Preeti U Murthy

On some architectures, in certain CPU deep idle states the local timers stop.
An external clock device is used to wakeup these CPUs. The kernel support for 
the
wakeup of these CPUs is provided by the tick broadcast framework by using the
external clock device as the wakeup source.

However not all implementations of architectures provide such an external
clock device such as some PowerPC ones. This patch includes support in the
broadcast framework to handle the wakeup of the CPUs in deep idle states on such
systems by queuing a hrtimer on one of the CPUs, meant to handle the wakeup of
CPUs in deep idle states. This CPU is identified as the bc_cpu.

Each time the hrtimer expires, it is reprogrammed for the next wakeup of the
CPUs in deep idle state after handling broadcast. However when a CPU is about
to enter  deep idle state with its wakeup time earlier than the time at which
the hrtimer is currently programmed, it *becomes the new bc_cpu* and restarts
the hrtimer on itself. This way the job of doing broadcast is handed around to
the CPUs that ask for the earliest wakeup just before entering deep idle
state. This is consistent with what happens in cases where an external clock
device is present. The smp affinity of this clock device is set to the CPU
with the earliest wakeup.

The important point here is that the bc_cpu cannot enter deep idle state
since it has a hrtimer queued to wakeup the other CPUs in deep idle. Hence it
cannot have its local timer stopped. Therefore for such a CPU, the
BROADCAST_ENTER notification has to fail implying that it cannot enter deep
idle state. On architectures where an external clock device is present, all
CPUs can enter deep idle.

During hotplug of the bc_cpu, the job of doing a broadcast is assigned to the
first cpu in the broadcast mask. This newly nominated bc_cpu is woken up by
an IPI so as to queue the above mentioned hrtimer on itself.

Changes from V1:https://lkml.org/lkml/2013/12/12/687

If idle states exist when the local timers of CPUs stop and
there is no external clock device to handle their wakeups the kernel switches
the tick mode to periodic so as to prevent the CPUs from entering such idle
states altogether. Therefore include an additional check consistent
with this patch, where if an external clock device does not exist, queue a
hrtimer to handle wakeups. If this also fails, only then switch the tick mode to
periodic.

Signed-off-by: Preeti U Murthy 
---

 include/linux/clockchips.h   |4 -
 kernel/time/clockevents.c|8 +-
 kernel/time/tick-broadcast.c |  180 ++
 kernel/time/tick-internal.h  |8 +-
 4 files changed, 173 insertions(+), 27 deletions(-)

diff --git a/include/linux/clockchips.h b/include/linux/clockchips.h
index 493aa02..bbda37b 100644
--- a/include/linux/clockchips.h
+++ b/include/linux/clockchips.h
@@ -186,9 +186,9 @@ static inline int tick_check_broadcast_expired(void) { 
return 0; }
 #endif
 
 #ifdef CONFIG_GENERIC_CLOCKEVENTS
-extern void clockevents_notify(unsigned long reason, void *arg);
+extern int clockevents_notify(unsigned long reason, void *arg);
 #else
-static inline void clockevents_notify(unsigned long reason, void *arg) {}
+static inline int clockevents_notify(unsigned long reason, void *arg) {}
 #endif
 
 #else /* CONFIG_GENERIC_CLOCKEVENTS_BUILD */
diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
index 086ad60..bbbd671 100644
--- a/kernel/time/clockevents.c
+++ b/kernel/time/clockevents.c
@@ -525,11 +525,11 @@ void clockevents_resume(void)
 /**
  * clockevents_notify - notification about relevant events
  */
-void clockevents_notify(unsigned long reason, void *arg)
+int clockevents_notify(unsigned long reason, void *arg)
 {
struct clock_event_device *dev, *tmp;
unsigned long flags;
-   int cpu;
+   int cpu, ret = 0;
 
raw_spin_lock_irqsave(_lock, flags);
 
@@ -542,11 +542,12 @@ void clockevents_notify(unsigned long reason, void *arg)
 
case CLOCK_EVT_NOTIFY_BROADCAST_ENTER:
case CLOCK_EVT_NOTIFY_BROADCAST_EXIT:
-   tick_broadcast_oneshot_control(reason);
+   ret = tick_broadcast_oneshot_control(reason);
break;
 
case CLOCK_EVT_NOTIFY_CPU_DYING:
tick_handover_do_timer(arg);
+   tick_handover_bc_cpu(arg);
break;
 
case CLOCK_EVT_NOTIFY_SUSPEND:
@@ -585,6 +586,7 @@ void clockevents_notify(unsigned long reason, void *arg)
break;
}
raw_spin_unlock_irqrestore(_lock, flags);
+   return ret;
 }
 EXPORT_SYMBOL_GPL(clockevents_notify);
 
diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c
index 9532690..1755984 100644
--- a/kernel/time/tick-broadcast.c
+++ b/kernel/time/tick-broadcast.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "tick-internal.h"
 
@@ -35,6 +36,11 @@ static cpumask_var_t tmpmask;
 static DEFINE_RA

Re: [PATCH 1/2] tick: broadcast: Deny per-cpu clockevents from being broadcast sources

2013-09-13 Thread Preeti U Murthy

Hi Soren,

On 09/13/2013 03:50 PM, Preeti Murthy wrote:
> Hi,
> 
> So the patch that Daniel points out http://lwn.net/Articles/566270/ ,
> enables broadcast functionality
> without using an external global clock device. It uses one of the per cpu
> clock devices to enable the broadcast functionality.
> 
> The way it achieves this is by creating a pseudo clock device and
> associating it with one of the cpus clock device and
> by having a hrtimer queued on the same cpu. This pseudo clock device acts
> as the broadcast device, and the
>  per cpu clock device that it is associated with acts as the broadcast
> source.
> 
> The disadvantages that Soren mentions in having a per cpu clock device as
> the broadcast source can be overcome
> by following the approach proposed in this patch n the way described below:
> 
> 1. What if the cpu, whose clock device is the broadcast source goes offline?
> 
> The solution that the above patch proposes is associate the pseudo clock
> device with another cpu and move the hrtimer
> whose function is explained in the next point to another cpu. The broadcast
> functionality continues to remain active transparently.
> 
> 2. The cpu that requires broadcast functionality is different from the cpu
> whose clock device is the broadcast source.
> So how will the former cpu program/control the clock device of the latter
> cpu?
> 
> The above patch queues a hrtimer on the cpu whose clock device is the
> broadcast source, which expires at
> max(tick_broadcast_period,  dev->next_event), where tick_broadcast_period
> is what we define and dev is the
> pseudo device whose next event is set by the broadcast framework.
> 
> On expiry of this hrtimer, do broadcast handling and reprogram the hrtimer
> with same as above,
> max(tick_broadcast_period,  dev->next_event).
> 
> This ensures that a cpu that requires broadcast function to be activated
> need not program the broadcast source,
> which also happens to be a per cpu clock device. The hrtimer queued on the
> cpu whose clock device is the
> broadcast source takes care of when to do broadcast handling.
> tick_broadcast_period ensures that we do
> not miss wakeups. This is introduced to overcome the constraint of a cpu
> not being able to program the clock
> device of another cpu.
> 
> Soren, do let me know if the above approach described in the patch has not
> addressed any of the challenges
> that you see with having a  per cpu clock device as the broadcast source.
> 
> Regards
> Preeti U Murthy
> 
> 
> On Fri, Sep 13, 2013 at 1:55 PM, Daniel Lezcano
> wrote:
> 
>> On 09/12/2013 10:30 PM, Thomas Gleixner wrote:
>>> On Thu, 12 Sep 2013, Soren Brinkmann wrote:
>>>> From: Stephen Boyd 
>>>>
>>>> On most ARM systems the per-cpu clockevents are truly per-cpu in
>>>> the sense that they can't be controlled on any other CPU besides
>>>> the CPU that they interrupt. If one of these clockevents were to
>>>> become a broadcast source we will run into a lot of trouble
>>>> because the broadcast source is enabled on the first CPU to go
>>>> into deep idle (if that CPU suffers from FEAT_C3_STOP) and that
>>>> could be a different CPU than what the clockevent is interrupting
>>>> (or even worse the CPU that the clockevent interrupts could be
>>>> offline).
>>>>
>>>> Theoretically it's possible to support per-cpu clockevents as the
>>>> broadcast source but so far we haven't needed this and supporting
>>>> it is rather complicated. Let's just deny the possibility for now
>>>> until this becomes a reality (let's hope it never does!).
>>>
>>> Well, we can't do it this way. There are globally accessible clock
>>> event devices which deliver only to cpu0. So the mask check might be
>>> causing failure here.
>>>
>>> Just add a feature flag CLOCK_EVT_FEAT_PERCPU to the clock event
>>> device and check for it.
>>
>> It sounds probably more understandable than dealing with the cpumasks.
>>
>> I am wondering if this is semantically opposed to
>> http://lwn.net/Articles/566270/ ?
>>
>> [PATCH V3 0/6] cpuidle/ppc: Enable broadcast support for deep idle states
>>
>>   -- Daniel

So the point I am trying to make is that the fix that you have proposed
on this thread is valid. It is difficult to ensure that a per cpu clock
device doubles up as the broadcast source without significant code
changes to the current broadcast code and the timer code.

But the patch [PATCH V3 0/6] cpuidle/ppc: Enable broadcast support for
deep idle states, attempts to overcome t

Re: [PATCH 1/2] tick: broadcast: Deny per-cpu clockevents from being broadcast sources

2013-09-13 Thread Preeti U Murthy

Hi Soren,

On 09/13/2013 09:53 PM, Sören Brinkmann wrote:
> Hi Preeti,
> Thanks for the explanation but now I'm a little confused. That's a lot of
> details and I'm lacking the in depth knowledge to fully understand
> everything.
> 
> Is it correct to say, that your patch series enables per cpu devices to
> be the broadcast device - for PPC?

Not really. We have a pseudo clock device, which is registered as the
broadcast device. This clock device has all the features of an external
clock device that the broadcast framework expects from a broadcast
device like !CLOCK_FEAT_C3STOP & !FEAT_PERCPU that you introduce in your
patch.

It as though we trick the broadcast framework into believing that we
have an external device, while in reality the pseudo device is just a dummy.

So if this is a pseudo device, which gets registered as the broadcast
device, how do we program it to handle broadcast events? That is where
the per cpu device steps in. It serves as the clock source to this
pseudo device. Meaning we program the per cpu device for the next
broadcast event using a hrtimer framework that we introduce, which calls
pseudo_dev->event_handler on expiry. This is nothing but the broadcast
handler.

Therefore we are able to manage broadcast without having to have an
explicit clock device for the purpose.

> And that would mean, that even though you have a per cpu device, you'd
> deliberately not set the FEAT_PERCPU flag, because on PPC a per cpu
> timer is a valid broadcast device?

No we would set the FEAT_PERCPU for the per cpu device on PPC. As I
mentioned above this is not going to be registered as the broadcast
device. We would however not set this flag for the pseudo device, that
we register as the broadcast device.

> 
> Assuming that is not going into an utterly wrong direction: How would we
> close on this one? AFAIK, ARM does not have this capability and I guess
> it won't be added. So, should I go forward with the fix proposed by
> Thomas? Should we rename the FEAT_PERCPU flag to something else, given
> that PPC may use per cpu devices for broadcasting and the sole usage of
> that flag is to prevent such a device from becoming the broadcast device?

You can go ahead with this fix because as explained above, when we
register a broadcast device we use a pseudo device which has the
features that the broadcast framework approves. The per cpu device does
not register itself with the broadcast framework. It merely programs
itself for the next broadcast event. Hence this fix will not hinder the
broadcast support on PPC.
> 
>   Thanks,
>   Sören
> 
> 
Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/7] Power-aware scheduling v2

2013-10-15 Thread Preeti U Murthy

Hi,

On 10/14/2013 07:02 PM, Peter Zijlstra wrote:
> On Fri, Oct 11, 2013 at 06:19:10PM +0100, Morten Rasmussen wrote:
>> Hi,
>>
>> I have revised the previous power scheduler proposal[1] trying to address as
>> many of the comments as possible. The overall idea was discussed at LPC[2,3].
>> The revised design has removed the power scheduler and replaced it with a 
>> high
>> level power driver interface. An interface that allows the scheduler to query
>> the power driver for information and provide hints to guide power management
>> decisions in the power driver.
>>
>> The power driver is going to be a unified platform power driver that can
>> replace cpufreq and cpuidle drivers. Generic power policies will be optional
>> helper functions called from the power driver. Platforms may choose to
>> implement their own policies as part of their power driver.
>>
>> This RFC series prototypes a part of the power driver interface (cpu capacity
>> hints) and shows how they can be used from the scheduler. More extensive use 
>> of
>> the power driver hints and queries is left for later. The focus for now is 
>> the
>> power driver interface. The patch series includes a power driver/cpufreq
>> governor that can use existing cpufreq drivers as backend. It has been tested
>> (not thoroughly) on ARM TC2. The cpufreq governor power driver implementation
>> is rather horrible, but it illustrates how the power driver interface can be
>> used. Native power drivers is on the todo list.
>>
>> The power driver interface is still missing quite a few calls to handle: 
>> Idle,
>> adding extra information to the sched_domain hierarchy to guide scheduling
>> decisions (packing), and possibly scaling of tracked load to compensate for
>> frequency changes and asymmetric systems (big.LITTLE).
>>
>> This set is based on 3.11. I have done ARM TC2 testing based on linux-linaro
>> 2013.08[4] to get cpufreq support for TC2.
> 
> What I'm missing is a general overview of why what and how.

I agree that the "why" needs to be mentioned very clearly since the
patchset revolves around it. As far as I understand we need a single
controller for deciding the power efficiency of the kernel, who is
exposed to all the user policies and the frequency+idle states stats of
the CPU to begin with. These stats are being supplied by the power driver.

Having these details and decision making in multiple places like we do
today in cpuidle, cpu-frequency and scheduler will probably cause
problems. For example, when the power efficiency of the kernel goes
wrong we have trouble point out the reason behind it. Where did the
problem arise from among the above three power policy decision makers?
This is a maintainability concern.
   Another reason is the power saving decisions made by say cpuidle may
not complement the power saving decisions made by cpufreq. This can lead
to inconsistent results across different workloads.

Thus having a single policy maker for power savings we are hoping to
solve the primary concerns of consistent behaviour from the kernel in
terms of power efficiency and improved maintainability.

> 
> In particular; how does this proposal lead to power savings. Is there a
> mathematical model that supports this framework? Something where if you
> give it a task set with global utilisation < 1 (ie. there's idle time),
> it results in less power used.

AFAIK, this patchset is an attempt to achieve consistency in the power
efficiency of the kernel across workloads with the existing algorithms,
in addition to a cleanup involving integration of the power policy
making in one place as explained above. In an attempt to do so, *maybe*
better power numbers can be obtained or at-least the default power
efficiency of the kernel will show up.

However adding the new patchsets like packing small tasks, heterogeneous
scheduling, power aware scheduling etc.. *should* then yield good and
consistent power savings since they now stand on top of an integrated
stable power driver.

Regards
Preeti U Murthy
> 
> Also, how does this proposal deal with cpufreq's fundamental broken
> approach to SMP? Afaict nothing considers the effect of one cpu upon
> another -- something which isn't true at all.
> 
> In fact, I don't see anything except a random bunch of hooks without an
> over-all picture of how to get less power used.
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH V3 0/6] cpuidle/ppc: Enable broadcast support for deep idle states

2013-09-10 Thread Preeti U Murthy

On PowerPC, when CPUs enter deep idle states, their local timers get
switched off. An external clock device needs to programmed to wake them
up at their next timer event.
On PowerPC, we do not have an external device equivalent to HPET,
which is currently used on architectures like x86 under the same scenario.
Instead we assign the local timer of one of the CPUs to do this job.

This patchset is an attempt to hook onto the existing timer broadcast
framework in the kernel by using the local timer of one of the CPUs to do the
job of the external clock device.

On expiry of this device, the broadcast framework today has the infrastructure
to send ipis to all such CPUs whose local timers have expired. Hence the term
"broadcast" and the ipi sent is called the broadcast ipi.

This patch series is ported ontop of 3.11-rc7 + the cpuidle driver backend
for power posted by Deepthi Dharwar recently.
http://comments.gmane.org/gmane.linux.ports.ppc.embedded/63556

Changes in V3:

1. Fix the way in which a broadcast ipi is handled on the idling cpus. Timer
handling on a broadcast ipi is being done now without missing out any timer
stats generation.

2. Fix a bug in the programming of the hrtimer meant to do broadcast. Program
it to trigger at the earlier of a "broadcast period", and the next wakeup
event. By introducing the "broadcast period" as the maximum period after
which the broadcast hrtimer can fire, we ensure that we do not miss
wakeups in corner cases.

3. On hotplug of a broadcast cpu, trigger the hrtimer meant to do broadcast
to fire immediately on the new broadcast cpu. This will ensure we do not miss
doing a broadcast pending in the nearest future.

4. Change the type of allocation from GFP_KERNEL to GFP_NOWAIT while
initializing bc_hrtimer since we are in an atomic context and cannot sleep.

5. Use the broadcast ipi to wakeup the newly nominated broadcast cpu on
hotplug of the old instead of smp_call_function_single(). This is because we
are interrupt disabled at this point and should not be using
smp_call_function_single or its children in this context to send an ipi.

6. Move GENERIC_CLOCKEVENTS_BROADCAST to arch/powerpc/Kconfig.

7. Fix coding style issues.

Changes in V2: https://lkml.org/lkml/2013/8/14/239

1. Dynamically pick a broadcast CPU, instead of having a dedicated one.
2. Remove the constraint of having to disable tickless idle on the broadcast
CPU by queueing a hrtimer dedicated to do broadcast.

V1 posting: https://lkml.org/lkml/2013/7/25/740.

The patchset has been tested for stability in idle and during multi threaded
ebizzy runs.

Many thanks to Ben H, Frederic Weisbecker, Li Yang, Srivatsa S. Bhat and
Vaidyanathan Srinivasan for all their comments and suggestions so far.

---

Preeti U Murthy (4):
  cpuidle/ppc: Split timer_interrupt() into timer handling and interrupt 
handling routines
  cpuidle/ppc: Add basic infrastructure to support the broadcast framework 
on ppc
  cpuidle/ppc: Introduce the deep idle state in which the local timers stop
  cpuidle/ppc: Nominate new broadcast cpu on hotplug of the old

Srivatsa S. Bhat (2):
  powerpc: Free up the IPI message slot of ipi call function 
(PPC_MSG_CALL_FUNC)
  powerpc: Implement broadcast timer interrupt as an IPI message


 arch/powerpc/Kconfig|1 
 arch/powerpc/include/asm/smp.h  |3 -
 arch/powerpc/include/asm/time.h |4 +
 arch/powerpc/kernel/smp.c   |   23 +++-
 arch/powerpc/kernel/time.c  |  143 --
 arch/powerpc/platforms/cell/interrupt.c |2 
 arch/powerpc/platforms/ps3/smp.c|2 
 drivers/cpuidle/cpuidle-ibm-power.c |  172 +++
 scripts/kconfig/streamline_config.pl|0 
 9 files changed, 307 insertions(+), 43 deletions(-)
 mode change 100644 => 100755 scripts/kconfig/streamline_config.pl

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH V3 1/6] powerpc: Free up the IPI message slot of ipi call function (PPC_MSG_CALL_FUNC)

2013-09-10 Thread Preeti U Murthy

From: Srivatsa S. Bhat 

The IPI handlers for both PPC_MSG_CALL_FUNC and PPC_MSG_CALL_FUNC_SINGLE map to
a common implementation - generic_smp_call_function_single_interrupt(). So, we
can consolidate them and save one of the IPI message slots, (which are precious,
since only 4 of those slots are available).

So, implement the functionality of PPC_MSG_CALL_FUNC using
PPC_MSG_CALL_FUNC_SINGLE itself and release its IPI message slot, so that it
can be used for something else in the future, if desired.

Signed-off-by: Srivatsa S. Bhat 
Signed-off-by: Preeti U Murthy 
---

 arch/powerpc/include/asm/smp.h  |2 +-
 arch/powerpc/kernel/smp.c   |   12 +---
 arch/powerpc/platforms/cell/interrupt.c |2 +-
 arch/powerpc/platforms/ps3/smp.c|2 +-
 4 files changed, 8 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
index 48cfc85..a632b6e 100644
--- a/arch/powerpc/include/asm/smp.h
+++ b/arch/powerpc/include/asm/smp.h
@@ -117,7 +117,7 @@ extern int cpu_to_core_id(int cpu);
  *
  * Make sure this matches openpic_request_IPIs in open_pic.c, or what shows up
  * in /proc/interrupts will be wrong!!! --Troy */
-#define PPC_MSG_CALL_FUNCTION   0
+#define PPC_MSG_UNUSED 0
 #define PPC_MSG_RESCHEDULE  1
 #define PPC_MSG_CALL_FUNC_SINGLE   2
 #define PPC_MSG_DEBUGGER_BREAK  3
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 38b0ba6..bc41e9f 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -111,9 +111,9 @@ int smp_generic_kick_cpu(int nr)
 }
 #endif /* CONFIG_PPC64 */
 
-static irqreturn_t call_function_action(int irq, void *data)
+static irqreturn_t unused_action(int irq, void *data)
 {
-   generic_smp_call_function_interrupt();
+   /* This slot is unused and hence available for use, if needed */
return IRQ_HANDLED;
 }
 
@@ -144,14 +144,14 @@ static irqreturn_t debug_ipi_action(int irq, void *data)
 }
 
 static irq_handler_t smp_ipi_action[] = {
-   [PPC_MSG_CALL_FUNCTION] =  call_function_action,
+   [PPC_MSG_UNUSED] =  unused_action, /* Slot available for future use */
[PPC_MSG_RESCHEDULE] = reschedule_action,
[PPC_MSG_CALL_FUNC_SINGLE] = call_function_single_action,
[PPC_MSG_DEBUGGER_BREAK] = debug_ipi_action,
 };
 
 const char *smp_ipi_name[] = {
-   [PPC_MSG_CALL_FUNCTION] =  "ipi call function",
+   [PPC_MSG_UNUSED] =  "ipi unused",
[PPC_MSG_RESCHEDULE] = "ipi reschedule",
[PPC_MSG_CALL_FUNC_SINGLE] = "ipi call function single",
[PPC_MSG_DEBUGGER_BREAK] = "ipi debugger",
@@ -221,8 +221,6 @@ irqreturn_t smp_ipi_demux(void)
all = xchg(>messages, 0);
 
 #ifdef __BIG_ENDIAN
-   if (all & (1 << (24 - 8 * PPC_MSG_CALL_FUNCTION)))
-   generic_smp_call_function_interrupt();
if (all & (1 << (24 - 8 * PPC_MSG_RESCHEDULE)))
scheduler_ipi();
if (all & (1 << (24 - 8 * PPC_MSG_CALL_FUNC_SINGLE)))
@@ -265,7 +263,7 @@ void arch_send_call_function_ipi_mask(const struct cpumask 
*mask)
unsigned int cpu;
 
for_each_cpu(cpu, mask)
-   do_message_pass(cpu, PPC_MSG_CALL_FUNCTION);
+   do_message_pass(cpu, PPC_MSG_CALL_FUNC_SINGLE);
 }
 
 #if defined(CONFIG_DEBUGGER) || defined(CONFIG_KEXEC)
diff --git a/arch/powerpc/platforms/cell/interrupt.c 
b/arch/powerpc/platforms/cell/interrupt.c
index 2d42f3b..28166e4 100644
--- a/arch/powerpc/platforms/cell/interrupt.c
+++ b/arch/powerpc/platforms/cell/interrupt.c
@@ -213,7 +213,7 @@ static void iic_request_ipi(int msg)
 
 void iic_request_IPIs(void)
 {
-   iic_request_ipi(PPC_MSG_CALL_FUNCTION);
+   iic_request_ipi(PPC_MSG_UNUSED);
iic_request_ipi(PPC_MSG_RESCHEDULE);
iic_request_ipi(PPC_MSG_CALL_FUNC_SINGLE);
iic_request_ipi(PPC_MSG_DEBUGGER_BREAK);
diff --git a/arch/powerpc/platforms/ps3/smp.c b/arch/powerpc/platforms/ps3/smp.c
index 4b35166..488f069 100644
--- a/arch/powerpc/platforms/ps3/smp.c
+++ b/arch/powerpc/platforms/ps3/smp.c
@@ -74,7 +74,7 @@ static int __init ps3_smp_probe(void)
* to index needs to be setup.
*/
 
-   BUILD_BUG_ON(PPC_MSG_CALL_FUNCTION!= 0);
+   BUILD_BUG_ON(PPC_MSG_UNUSED   != 0);
BUILD_BUG_ON(PPC_MSG_RESCHEDULE   != 1);
BUILD_BUG_ON(PPC_MSG_CALL_FUNC_SINGLE != 2);
BUILD_BUG_ON(PPC_MSG_DEBUGGER_BREAK   != 3);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH V3 3/6] cpuidle/ppc: Split timer_interrupt() into timer handling and interrupt handling routines

2013-09-10 Thread Preeti U Murthy

On PowerPC, when CPUs enter deep idle states, their local timers get
switched off. The local timer is called the decrementer. An external clock
device needs to programmed to wake them up at their next timer event.
On PowerPC, we do not have an external device equivalent to HPET,
which is currently used on architectures like x86 under the same scenario.
Instead we assign the local timer of one of the CPUs to do this job.

On expiry of this timer, the broadcast framework today has the infrastructure
to send ipis to all such CPUs whose local timers have expired.

When such an ipi is received, the cpus in deep idle should handle their
expired timers. It should be as though they were woken up from a
timer interrupt itself. Hence this external ipi serves as an emulated timer
interrupt for the cpus in deep idle.

Therefore ideally on ppc, these cpus should call timer_interrupt() which
is the interrupt handler for a decrementer interrupt. But timer_interrupt()
also contains routines which are usually performed in an interrupt handler.
These are not required to be done in this scenario as the external interrupt
handler takes care of them.

Therefore split up timer_interrupt() into routines performed during regular
interrupt handling and __timer_interrupt(), which takes care of running local
timers and collecting time related stats. Now on a broadcast ipi, call
__timer_interrupt().

Signed-off-by: Preeti U Murthy 
---

 arch/powerpc/kernel/time.c |   69 
 1 file changed, 37 insertions(+), 32 deletions(-)

diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 0dfa0c5..eb48291 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -478,6 +478,42 @@ void arch_irq_work_raise(void)
 
 #endif /* CONFIG_IRQ_WORK */
 
+static void __timer_interrupt(void)
+{
+   struct pt_regs *regs = get_irq_regs();
+   u64 *next_tb = &__get_cpu_var(decrementers_next_tb);
+   struct clock_event_device *evt = &__get_cpu_var(decrementers);
+   u64 now;
+
+   __get_cpu_var(irq_stat).timer_irqs++;
+   trace_timer_interrupt_entry(regs);
+
+   if (test_irq_work_pending()) {
+   clear_irq_work_pending();
+   irq_work_run();
+   }
+
+   now = get_tb_or_rtc();
+   if (now >= *next_tb) {
+   *next_tb = ~(u64)0;
+   if (evt->event_handler)
+   evt->event_handler(evt);
+   } else {
+   now = *next_tb - now;
+   if (now <= DECREMENTER_MAX)
+   set_dec((int)now);
+   }
+
+#ifdef CONFIG_PPC64
+   /* collect purr register values often, for accurate calculations */
+   if (firmware_has_feature(FW_FEATURE_SPLPAR)) {
+   struct cpu_usage *cu = &__get_cpu_var(cpu_usage_array);
+   cu->current_tb = mfspr(SPRN_PURR);
+   }
+#endif
+   trace_timer_interrupt_exit(regs);
+}
+
 /*
  * timer_interrupt - gets called when the decrementer overflows,
  * with interrupts disabled.
@@ -486,8 +522,6 @@ void timer_interrupt(struct pt_regs * regs)
 {
struct pt_regs *old_regs;
u64 *next_tb = &__get_cpu_var(decrementers_next_tb);
-   struct clock_event_device *evt = &__get_cpu_var(decrementers);
-   u64 now;
 
/* Ensure a positive value is written to the decrementer, or else
 * some CPUs will continue to take decrementer exceptions.
@@ -510,8 +544,6 @@ void timer_interrupt(struct pt_regs * regs)
 */
may_hard_irq_enable();
 
-   __get_cpu_var(irq_stat).timer_irqs++;
-
 #if defined(CONFIG_PPC32) && defined(CONFIG_PMAC)
if (atomic_read(_n_lost_interrupts) != 0)
do_IRQ(regs);
@@ -520,34 +552,7 @@ void timer_interrupt(struct pt_regs * regs)
old_regs = set_irq_regs(regs);
irq_enter();
 
-   trace_timer_interrupt_entry(regs);
-
-   if (test_irq_work_pending()) {
-   clear_irq_work_pending();
-   irq_work_run();
-   }
-
-   now = get_tb_or_rtc();
-   if (now >= *next_tb) {
-   *next_tb = ~(u64)0;
-   if (evt->event_handler)
-   evt->event_handler(evt);
-   } else {
-   now = *next_tb - now;
-   if (now <= DECREMENTER_MAX)
-   set_dec((int)now);
-   }
-
-#ifdef CONFIG_PPC64
-   /* collect purr register values often, for accurate calculations */
-   if (firmware_has_feature(FW_FEATURE_SPLPAR)) {
-   struct cpu_usage *cu = &__get_cpu_var(cpu_usage_array);
-   cu->current_tb = mfspr(SPRN_PURR);
-   }
-#endif
-
-   trace_timer_interrupt_exit(regs);
-
+   __timer_interrupt();
irq_exit();
set_irq_regs(old_regs);
 }

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kerne

[PATCH V3 2/6] powerpc: Implement broadcast timer interrupt as an IPI message

2013-09-10 Thread Preeti U Murthy

From: Srivatsa S. Bhat 

For scalability and performance reasons, we want the broadcast IPIs
to be handled as efficiently as possible. Fixed IPI messages
are one of the most efficient mechanisms available - they are faster
than the smp_call_function mechanism because the IPI handlers are fixed
and hence they don't involve costly operations such as adding IPI handlers
to the target CPU's function queue, acquiring locks for synchronization etc.

Luckily we have an unused IPI message slot, so use that to implement
broadcast timer interrupts efficiently.

Signed-off-by: Srivatsa S. Bhat 
[Changelog modified by pre...@linux.vnet.ibm.com]
Signed-off-by: Preeti U Murthy 
---

 arch/powerpc/include/asm/smp.h  |3 ++-
 arch/powerpc/include/asm/time.h |1 +
 arch/powerpc/kernel/smp.c   |   19 +++
 arch/powerpc/kernel/time.c  |4 
 arch/powerpc/platforms/cell/interrupt.c |2 +-
 arch/powerpc/platforms/ps3/smp.c|2 +-
 scripts/kconfig/streamline_config.pl|0 
 7 files changed, 24 insertions(+), 7 deletions(-)
 mode change 100644 => 100755 scripts/kconfig/streamline_config.pl

diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
index a632b6e..22f6d63 100644
--- a/arch/powerpc/include/asm/smp.h
+++ b/arch/powerpc/include/asm/smp.h
@@ -117,7 +117,7 @@ extern int cpu_to_core_id(int cpu);
  *
  * Make sure this matches openpic_request_IPIs in open_pic.c, or what shows up
  * in /proc/interrupts will be wrong!!! --Troy */
-#define PPC_MSG_UNUSED 0
+#define PPC_MSG_TIMER  0
 #define PPC_MSG_RESCHEDULE  1
 #define PPC_MSG_CALL_FUNC_SINGLE   2
 #define PPC_MSG_DEBUGGER_BREAK  3
@@ -194,6 +194,7 @@ extern struct smp_ops_t *smp_ops;
 
 extern void arch_send_call_function_single_ipi(int cpu);
 extern void arch_send_call_function_ipi_mask(const struct cpumask *mask);
+extern void arch_send_tick_broadcast(const struct cpumask *mask);
 
 /* Definitions relative to the secondary CPU spin loop
  * and entry point. Not all of them exist on both 32 and
diff --git a/arch/powerpc/include/asm/time.h b/arch/powerpc/include/asm/time.h
index c1f2676..4e35282 100644
--- a/arch/powerpc/include/asm/time.h
+++ b/arch/powerpc/include/asm/time.h
@@ -28,6 +28,7 @@ extern struct clock_event_device decrementer_clockevent;
 struct rtc_time;
 extern void to_tm(int tim, struct rtc_time * tm);
 extern void GregorianDay(struct rtc_time *tm);
+extern void decrementer_timer_interrupt(void);
 
 extern void generic_calibrate_decr(void);
 
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index bc41e9f..d3b7014 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -35,6 +35,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -111,9 +112,9 @@ int smp_generic_kick_cpu(int nr)
 }
 #endif /* CONFIG_PPC64 */
 
-static irqreturn_t unused_action(int irq, void *data)
+static irqreturn_t timer_action(int irq, void *data)
 {
-   /* This slot is unused and hence available for use, if needed */
+   decrementer_timer_interrupt();
return IRQ_HANDLED;
 }
 
@@ -144,14 +145,14 @@ static irqreturn_t debug_ipi_action(int irq, void *data)
 }
 
 static irq_handler_t smp_ipi_action[] = {
-   [PPC_MSG_UNUSED] =  unused_action, /* Slot available for future use */
+   [PPC_MSG_TIMER] =  timer_action,
[PPC_MSG_RESCHEDULE] = reschedule_action,
[PPC_MSG_CALL_FUNC_SINGLE] = call_function_single_action,
[PPC_MSG_DEBUGGER_BREAK] = debug_ipi_action,
 };
 
 const char *smp_ipi_name[] = {
-   [PPC_MSG_UNUSED] =  "ipi unused",
+   [PPC_MSG_TIMER] =  "ipi timer",
[PPC_MSG_RESCHEDULE] = "ipi reschedule",
[PPC_MSG_CALL_FUNC_SINGLE] = "ipi call function single",
[PPC_MSG_DEBUGGER_BREAK] = "ipi debugger",
@@ -221,6 +222,8 @@ irqreturn_t smp_ipi_demux(void)
all = xchg(>messages, 0);
 
 #ifdef __BIG_ENDIAN
+   if (all & (1 << (24 - 8 * PPC_MSG_TIMER)))
+   decrementer_timer_interrupt();
if (all & (1 << (24 - 8 * PPC_MSG_RESCHEDULE)))
scheduler_ipi();
if (all & (1 << (24 - 8 * PPC_MSG_CALL_FUNC_SINGLE)))
@@ -266,6 +269,14 @@ void arch_send_call_function_ipi_mask(const struct cpumask 
*mask)
do_message_pass(cpu, PPC_MSG_CALL_FUNC_SINGLE);
 }
 
+void arch_send_tick_broadcast(const struct cpumask *mask)
+{
+   unsigned int cpu;
+
+   for_each_cpu(cpu, mask)
+   do_message_pass(cpu, PPC_MSG_TIMER);
+}
+
 #if defined(CONFIG_DEBUGGER) || defined(CONFIG_KEXEC)
 void smp_send_debugger_break(void)
 {
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 65ab9e9..0dfa0c5 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -813,6 +813,10 @@ static void decrem

[PATCH V3 4/6] cpuidle/ppc: Add basic infrastructure to support the broadcast framework on ppc

2013-09-10 Thread Preeti U Murthy

The broadcast framework in the kernel expects an external clock device which 
will
continue functioning in deep idle states also. This ability is specified by
the "non-existence" of the feature C3STOP . This is the device that it relies
upon to wakup cpus in deep idle states whose local timers/clock devices get
switched off in deep idle states.

On ppc we do not have such an external device. Therefore we introduce a
pseudo clock device, which has the features of this external clock device
called the broadcast_clockevent. Having such a device qualifies the cpus to
enter and exit deep idle states from the point of view of the broadcast
framework, because there is an external device to wake them up.
Specifically the broadcast framework uses this device's event
handler and next_event members in its functioning. On ppc we use this
device as the gateway into the broadcast framework and *not* as a
timer. An explicit timer infrastructure will be developed in the following
patches to keep track of when to wake up cpus in deep idle.

Since this device is a pseudo device, it can be safely assumed to work for
all cpus. Therefore its cpumask is set to cpu_possible_mask. Also due to the
same reason, the set_next_event() routine associated with this device is a
nop.

The broadcast framework relies on a broadcast functionality being made
available in the .broadcast member of the local clock devices on all cpus.
This function is called upon by the broadcast framework on one of the nominated
cpus, to send ipis to all the cpus in deep idle at their expired timer events.
This patch also initializes the .broadcast member of the decrementer whose
job is to send the broadcast ipis.

When cpus inform the broadcast framework that they are entering deep idle,
their local timers are put in shutdown mode. On ppc, this means setting the
decrementer_next_tb and programming the decrementer to DECREMENTER_MAX.
On being woken up by the broadcast ipi, these cpus call __timer_interrupt(),
which runs the local timers only if decrementer_next_tb has expired.
  Therefore on being woken up from the broadcast ipi, set the 
decrementers_next_tb
to now before calling __timer_interrupt().

Signed-off-by: Preeti U Murthy 
---

 arch/powerpc/Kconfig|1 +
 arch/powerpc/include/asm/time.h |1 +
 arch/powerpc/kernel/time.c  |   69 ++-
 3 files changed, 70 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index dbd9d3c..550fc04 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -130,6 +130,7 @@ config PPC
select GENERIC_CMOS_UPDATE
select GENERIC_TIME_VSYSCALL_OLD
select GENERIC_CLOCKEVENTS
+   select GENERIC_CLOCKEVENTS_BROADCAST
select GENERIC_STRNCPY_FROM_USER
select GENERIC_STRNLEN_USER
select HAVE_MOD_ARCH_SPECIFIC
diff --git a/arch/powerpc/include/asm/time.h b/arch/powerpc/include/asm/time.h
index 4e35282..264dc96 100644
--- a/arch/powerpc/include/asm/time.h
+++ b/arch/powerpc/include/asm/time.h
@@ -24,6 +24,7 @@ extern unsigned long tb_ticks_per_jiffy;
 extern unsigned long tb_ticks_per_usec;
 extern unsigned long tb_ticks_per_sec;
 extern struct clock_event_device decrementer_clockevent;
+extern struct clock_event_device broadcast_clockevent;
 
 struct rtc_time;
 extern void to_tm(int tim, struct rtc_time * tm);
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index eb48291..bda78bb 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -42,6 +42,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -97,8 +98,13 @@ static struct clocksource clocksource_timebase = {
 
 static int decrementer_set_next_event(unsigned long evt,
  struct clock_event_device *dev);
+static int broadcast_set_next_event(unsigned long evt,
+ struct clock_event_device *dev);
+static void broadcast_set_mode(enum clock_event_mode mode,
+struct clock_event_device *dev);
 static void decrementer_set_mode(enum clock_event_mode mode,
 struct clock_event_device *dev);
+static void decrementer_timer_broadcast(const struct cpumask *mask);
 
 struct clock_event_device decrementer_clockevent = {
.name   = "decrementer",
@@ -106,12 +112,24 @@ struct clock_event_device decrementer_clockevent = {
.irq= 0,
.set_next_event = decrementer_set_next_event,
.set_mode   = decrementer_set_mode,
-   .features   = CLOCK_EVT_FEAT_ONESHOT,
+   .broadcast  = decrementer_timer_broadcast,
+   .features   = CLOCK_EVT_FEAT_C3STOP | CLOCK_EVT_FEAT_ONESHOT,
 };
 EXPORT_SYMBOL(decrementer_clockevent);
 
+struct clock_event_device broadcast_clockevent = {
+   .name   = "broadcast",
+   .rating = 20

[PATCH V3 5/6] cpuidle/ppc: Introduce the deep idle state in which the local timers stop

2013-09-10 Thread Preeti U Murthy

Now that we have the basic infrastructure setup to make use of the broadcast
framework, introduce the deep idle state in which cpus need to avail the
functionality provided by this infrastructure to wake them up at their
expired timer events. On ppc this deep idle state is called sleep.
In this patch however, we introduce longnap, which emulates sleep
state, by disabling timer interrupts. This is until such time that sleep 
support is
made available in the kernel.

Since on ppc, we do not have an external device that can wakeup cpus in deep
idle, the local timer of one of the cpus need to be nominated to do this job.
This cpu is called the broadcast cpu/bc_cpu. Only if the bc_cpu is nominated
will the remaining cpus be allowed to enter deep idle state after notifying
the broadcast framework about their next timer event. The bc_cpu is not allowed
to enter deep idle state.

The first cpu that enters longnap is made the bc_cpu. It queues a hrtimer onto
itself which expires after a broadcast period. The job of this
hrtimer is to call into the broadcast framework[1] using the pseudo clock device
that we have initiliazed, in which, the cpus whose wakeup times
have expired are sent an ipi.
On each expiry of the hrtimer, it is programmed to the earlier of the
next pending timer event of the cpus in deep idle and the broadcast period, so
as to not miss any wakeups.

The broadcast period is nothing but the max duration until which the
bc_cpu need not concern itself with checking for expired timer events on cpus
in deep idle. The broadcast period is set to a jiffy in this patch for debug
purposes. Ideally it needn't be smaller than the target_residency of the deep
idle state.

But having a dedicated bc_cpu would mean overloading just one cpu with the
broadcast work which could hinder its performance apart from leading to thermal
imbalance on the chip. Therefore unassign the bc_cpu when there are no more cpus
in deep idle to be woken up. The bc_cpu is left unassigned until such a time 
that
a cpu enters longnap to be nominated as the bc_cpu and the above cycle repeats.

Protect the region of nomination,de-nomination and check for existence of 
broadcast
cpu with a lock to ensure synchronization between them.

[1] tick_handle_oneshot_broadcast() or tick_handle_periodic_broadcast().

Signed-off-by: Preeti U Murthy 
---

 arch/powerpc/include/asm/time.h |1 
 arch/powerpc/kernel/time.c  |2 
 drivers/cpuidle/cpuidle-ibm-power.c |  150 +++
 3 files changed, 152 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/time.h b/arch/powerpc/include/asm/time.h
index 264dc96..38341fa 100644
--- a/arch/powerpc/include/asm/time.h
+++ b/arch/powerpc/include/asm/time.h
@@ -25,6 +25,7 @@ extern unsigned long tb_ticks_per_usec;
 extern unsigned long tb_ticks_per_sec;
 extern struct clock_event_device decrementer_clockevent;
 extern struct clock_event_device broadcast_clockevent;
+extern struct clock_event_device bc_timer;
 
 struct rtc_time;
 extern void to_tm(int tim, struct rtc_time * tm);
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index bda78bb..44a76de 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -129,7 +129,7 @@ EXPORT_SYMBOL(broadcast_clockevent);
 
 DEFINE_PER_CPU(u64, decrementers_next_tb);
 static DEFINE_PER_CPU(struct clock_event_device, decrementers);
-static struct clock_event_device bc_timer;
+struct clock_event_device bc_timer;
 
 #define XSEC_PER_SEC (1024*1024)
 
diff --git a/drivers/cpuidle/cpuidle-ibm-power.c 
b/drivers/cpuidle/cpuidle-ibm-power.c
index f8905c3..ae47a0a 100644
--- a/drivers/cpuidle/cpuidle-ibm-power.c
+++ b/drivers/cpuidle/cpuidle-ibm-power.c
@@ -12,12 +12,19 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
 
 #include 
 #include 
 #include 
 #include 
 #include 
+#include 
 #include 
 
 struct cpuidle_driver power_idle_driver = {
@@ -28,6 +35,26 @@ struct cpuidle_driver power_idle_driver = {
 static int max_idle_state;
 static struct cpuidle_state *cpuidle_state_table;
 
+static int bc_cpu = -1;
+static struct hrtimer *bc_hrtimer;
+static int bc_hrtimer_initialized = 0;
+
+/*
+ * Bits to indicate if a cpu can enter deep idle where local timer gets
+ * switched off.
+ * BROADCAST_CPU_PRESENT : Enter deep idle since bc_cpu is assigned
+ * BROADCAST_CPU_SELF   : Do not enter deep idle since you are bc_cpu
+ * BROADCAST_CPU_ABSENT : Do not enter deep idle since there is no 
bc_cpu,
+ *hence nominate yourself as bc_cpu
+ * BROADCAST_CPU_ERROR :  Do not enter deep idle since there is no bc_cpu
+ *and the broadcast hrtimer could not be initialized.
+ */
+enum broadcast_cpu_status {
+   BROADCAST_CPU_PRESENT,
+   BROADCAST_CPU_SELF,
+   BROADCAST_CPU_ERROR,
+};
+
 static inline void idle_loop_prolog(unsigned long *in_purr)
 {
*in_purr = mfspr

[PATCH V3 6/6] cpuidle/ppc: Nominate new broadcast cpu on hotplug of the old

2013-09-10 Thread Preeti U Murthy

On hotplug of the broadcast cpu, cancel the hrtimer queued to do
broadcast and nominate a new broadcast cpu to be the first cpu in the
broadcast mask which includes all the cpus that have notified the broadcast
framework about entering deep idle state.

Since the new broadcast cpu is one of the cpus in deep idle, send an ipi to
wake it up to continue the duty of broadcast. The new broadcast cpu needs to
find out if it woke up to resume broadcast. If so it needs to restart the
broadcast hrtimer on itself.

Its possible that the old broadcast cpu was hotplugged out when the broadcast
hrtimer was about to fire on it. Therefore the newly nominated broadcast cpu
should set the broadcast hrtimer on itself to expire immediately so as to not
miss wakeups under such scenarios.

Signed-off-by: Preeti U Murthy 
---

 arch/powerpc/include/asm/time.h |1 +
 arch/powerpc/kernel/time.c  |1 +
 drivers/cpuidle/cpuidle-ibm-power.c |   22 ++
 3 files changed, 24 insertions(+)

diff --git a/arch/powerpc/include/asm/time.h b/arch/powerpc/include/asm/time.h
index 38341fa..3bc0205 100644
--- a/arch/powerpc/include/asm/time.h
+++ b/arch/powerpc/include/asm/time.h
@@ -31,6 +31,7 @@ struct rtc_time;
 extern void to_tm(int tim, struct rtc_time * tm);
 extern void GregorianDay(struct rtc_time *tm);
 extern void decrementer_timer_interrupt(void);
+extern void broadcast_irq_entry(void);
 
 extern void generic_calibrate_decr(void);
 
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 44a76de..0ac2e11 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -853,6 +853,7 @@ void decrementer_timer_interrupt(void)
 {
u64 *next_tb = &__get_cpu_var(decrementers_next_tb);
 
+   broadcast_irq_entry();
*next_tb = get_tb_or_rtc();
__timer_interrupt();
 }
diff --git a/drivers/cpuidle/cpuidle-ibm-power.c 
b/drivers/cpuidle/cpuidle-ibm-power.c
index ae47a0a..580ea04 100644
--- a/drivers/cpuidle/cpuidle-ibm-power.c
+++ b/drivers/cpuidle/cpuidle-ibm-power.c
@@ -282,6 +282,12 @@ static int longnap_loop(struct cpuidle_device *dev,
return index;
 }
 
+void broadcast_irq_entry(void)
+{
+   if (smp_processor_id() == bc_cpu)
+   hrtimer_start(bc_hrtimer, ns_to_ktime(0), 
HRTIMER_MODE_REL_PINNED);
+}
+
 /*
  * States for dedicated partition case.
  */
@@ -360,6 +366,7 @@ static int power_cpuidle_add_cpu_notifier(struct 
notifier_block *n,
unsigned long action, void *hcpu)
 {
int hotcpu = (unsigned long)hcpu;
+   unsigned long flags;
struct cpuidle_device *dev =
per_cpu(cpuidle_devices, hotcpu);
 
@@ -372,6 +379,21 @@ static int power_cpuidle_add_cpu_notifier(struct 
notifier_block *n,
cpuidle_resume_and_unlock();
break;
 
+   case CPU_DYING:
+   case CPU_DYING_FROZEN:
+   spin_lock_irqsave(_idle_lock, flags);
+   if (hotcpu == bc_cpu) {
+   bc_cpu = -1;
+   hrtimer_cancel(bc_hrtimer);
+   if 
(!cpumask_empty(tick_get_broadcast_oneshot_mask())) {
+   bc_cpu = cpumask_first(
+   
tick_get_broadcast_oneshot_mask());
+   
arch_send_tick_broadcast(cpumask_of(bc_cpu));
+   }
+   }
+   spin_unlock_irqrestore(_idle_lock, flags);
+   break;
+
case CPU_DEAD:
case CPU_DEAD_FROZEN:
cpuidle_pause_and_lock();

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/3] sched: Fix nohz_kick_needed to consider the nr_busy of the parent domain's group

2013-10-22 Thread Preeti U Murthy

Hi Kamalesh,

On 10/22/2013 08:05 PM, Kamalesh Babulal wrote:
> * Vaidyanathan Srinivasan  [2013-10-21 17:14:42]:
> 
>>  for_each_domain(cpu, sd) {
>> -struct sched_group *sg = sd->groups;
>> -struct sched_group_power *sgp = sg->sgp;
>> -int nr_busy = atomic_read(>nr_busy_cpus);
>> -
>> -if (sd->flags & SD_SHARE_PKG_RESOURCES && nr_busy > 1)
>> -goto need_kick_unlock;
>> +struct sched_domain *sd_parent = sd->parent;
>> +struct sched_group *sg;
>> +struct sched_group_power *sgp;
>> +int nr_busy;
>> +
>> +if (sd_parent) {
>> +sg = sd_parent->groups;
>> +sgp = sg->sgp;
>> +nr_busy = atomic_read(>nr_busy_cpus);
>> +
>> +if (sd->flags & SD_SHARE_PKG_RESOURCES && nr_busy > 1)
>> +goto need_kick_unlock;
>> +}
>>
>>  if (sd->flags & SD_ASYM_PACKING && nr_busy != sg->group_weight
>>  && (cpumask_first_and(nohz.idle_cpus_mask,
> 
> CC'ing Suresh Siddha and Vincent Guittot
> 
> Please correct me, If my understanding of idle balancing is wrong.
> With proposed approach will not idle load balancer kick in, even if
> there are busy cpus across groups or if there are 2 busy cpus which
> are spread across sockets.

Yes load balancing will happen on busy cpus periodically.

Wrt idle balancing there are two points here. One, when a CPU is just
about to go idle, it will enter idle_balance(), and trigger load
balancing with itself being the destination CPU to begin with. It will
load balance at every level of the sched domain that it belongs to. If
it manages to pull tasks, good, else it will enter an idle state.

nohz_idle_balancing is triggered by a busy cpu at every tick if it has
more than one task in its runqueue or if it belongs to a group that
shares the package resources and has more than one cpu busy. By
"nohz_idle_balance triggered", it means the busy cpu will send an ipi to
the ilb_cpu to do load balancing on the behalf of the idle cpus in the
nohz mask.

So to answer your question wrt this patch, if there is one busy cpu with
say 2 tasks in one socket and another busy cpu with 1 task on another
socket, the former busy cpu can kick nohz_idle_balance since it has more
than one task in its runqueue. An idle cpu in either socket could be
woken up to balance tasks with it.

The usual idle load balancer that runs on a CPU about to become idle
could pull from either cpu depending on who is more busy as it begins to
load balance across all levels of sched domain that it belongs to.
> 
> Consider 2 socket machine with 4 processors each (MC and NUMA domains).
> If the machine is partial loaded such that cpus 0,4,5,6,7 are busy, then too
> nohz balancing is triggered because with this approach
> (NUMA)->groups->sgp->nr_busy_cpus is taken in account for nohz kick, while
> iterating over MC domain.

For the example that you mention, you will have a CPU domain and a NUMA
domain. When the sockets are NUMA nodes, each socket will belong to a
CPU domain. If the sockets are non-numa nodes, then the domain
encompassing both the nodes will be a CPU domain, possibly with each
socket being an MC domain.
> 
> Isn't idle load balancer not suppose kick in, even in the case of two busy
> cpu's in a dual-core single socket system

nohz_idle_balancing is a special case. It is triggered when the
conditions mentioned in nohz_kick_needed() are true. A CPU just about to
go idle will trigger load balancing without any pre-conditions.

In a single socket machine, there will be a CPU domain encompassing the
socket and the MC domain will encompass a core. nohz_idle load balancer
will kick in if both the threads in the core have tasks running on them.
This is fair enough because the threads share the resources of the core.

Regards
Preeti U Murthy
> 
> Thanks,
> Kamalesh.
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/3] sched: Fix nohz_kick_needed to consider the nr_busy of the parent domain's group

2013-10-22 Thread Preeti U Murthy

Hi Peter,

On 10/23/2013 03:41 AM, Peter Zijlstra wrote:
> On Mon, Oct 21, 2013 at 05:14:42PM +0530, Vaidyanathan Srinivasan wrote:
>>  kernel/sched/fair.c |   19 +--
>>  1 file changed, 13 insertions(+), 6 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 7c70201..12f0eab 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -5807,12 +5807,19 @@ static inline int nohz_kick_needed(struct rq *rq, 
>> int cpu)
>>  
>>  rcu_read_lock();
>>  for_each_domain(cpu, sd) {
>> +struct sched_domain *sd_parent = sd->parent;
>> +struct sched_group *sg;
>> +struct sched_group_power *sgp;
>> +int nr_busy;
>> +
>> +if (sd_parent) {
>> +sg = sd_parent->groups;
>> +sgp = sg->sgp;
>> +nr_busy = atomic_read(>nr_busy_cpus);
>> +
>> +if (sd->flags & SD_SHARE_PKG_RESOURCES && nr_busy > 1)
>> +goto need_kick_unlock;
>> +}
>>  
>>  if (sd->flags & SD_ASYM_PACKING && nr_busy != sg->group_weight
>>  && (cpumask_first_and(nohz.idle_cpus_mask,
>>
> 
> Almost I'd say; what happens on !sd_parent && SD_ASYM_PACKING ?

You are right, sorry about this. The idea was to correct the nr_busy
computation before the patch that would remove its usage in the second
patch. But that would mean the condition nr_busy != sg->group_weight
would be invalid with this patch. The second patch needs to go first to
avoid this confusion.

> 
> Also, this made me look at the nr_busy stuff again, and somehow that
> entire thing makes me a little sad.
> 
> Can't we do something like the below and cut that nr_busy sd iteration
> short?

We can surely cut the nr_busy sd iteration but not like what is done
with this patch. You stop the nr_busy computation at the sched domain
that has the flag SD_SHARE_PKG_RESOURCES set. But nohz_kick_needed()
would want to know the nr_busy for one level above this.
   Consider a core. Assume it is the highest domain with this flag set.
The nr_busy of its groups, which are logical threads are set to 1/0
each. But nohz_kick_needed() would like to know the sum of the nr_busy
parameter of all the groups, i.e. the threads in a core before it
decides if it can kick nohz_idle balancing. The information about the
individual group's nr_busy is of no relevance here.

Thats why the above patch tries to get the
sd->parent->groups->sgp->nr_busy_cpus. This will translate rightly to
the core's busy cpus in this example. But the below patch stops before
updating this parameter at the sd->parent level, where sd is the highest
level sched domain with the SD_SHARE_PKG_RESOURCES flag set.

But we can get around all this confusion if we can move the nr_busy
parameter to be included in the sched_domain structure rather than the
sched_groups_power structure. Anyway the only place where nr_busy is
used, that is at nohz_kick_needed(), is done to know the total number of
busy cpus at a sched domain level which has the SD_SHARE_PKG_RESOURCES
set and not at a sched group level.

So why not move nr_busy to struct sched_domain  and having the below
patch which just updates this parameter for the sched domain, sd_busy ?
This will avoid iterating through all the levels of sched domains and
should resolve the scalability issue. We also don't need to get to
sd->parent to get the nr_busy parameter for the sake of nohz_kick_needed().

What do you think?

Regards
Preeti U Murthy
> 
> This nohz stuff really needs to be re-thought and made more scalable --
> its a royal pain :/
> 
> 
>  kernel/sched/core.c  |  4 
>  kernel/sched/fair.c  | 21 +++--
>  kernel/sched/sched.h |  5 ++---
>  3 files changed, 21 insertions(+), 9 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index c06b8d3..89db8dc 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5271,6 +5271,7 @@ DEFINE_PER_CPU(struct sched_domain *, sd_llc);
>  DEFINE_PER_CPU(int, sd_llc_size);
>  DEFINE_PER_CPU(int, sd_llc_id);
>  DEFINE_PER_CPU(struct sched_domain *, sd_numa);
> +DEFINE_PER_CPU(struct sched_domain *, sd_busy);
> 
>  static void update_top_cache_domain(int cpu)
>  {
> @@ -5290,6 +5291,9 @@ static void update_top_cache_domain(int cpu)
> 
>   sd = lowest_flag_domain(cpu, SD_NUMA);
>   rcu_assign_pointer(per_cpu(sd_numa, cpu), sd);
> +
> + sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES | SD_ASYM_PACKING);
> + rcu_assign_pointer(per_cpu(sd_busy, cpu), sd);
>  }
> 
>

Re: [PATCH 1/3] sched: Fix nohz_kick_needed to consider the nr_busy of the parent domain's group

2013-10-22 Thread Preeti U Murthy

On 10/23/2013 09:30 AM, Preeti U Murthy wrote:
> Hi Peter,
> 
> On 10/23/2013 03:41 AM, Peter Zijlstra wrote:
>> On Mon, Oct 21, 2013 at 05:14:42PM +0530, Vaidyanathan Srinivasan wrote:
>>>  kernel/sched/fair.c |   19 +--
>>>  1 file changed, 13 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 7c70201..12f0eab 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -5807,12 +5807,19 @@ static inline int nohz_kick_needed(struct rq *rq, 
>>> int cpu)
>>>  
>>> rcu_read_lock();
>>> for_each_domain(cpu, sd) {
>>> +   struct sched_domain *sd_parent = sd->parent;
>>> +   struct sched_group *sg;
>>> +   struct sched_group_power *sgp;
>>> +   int nr_busy;
>>> +
>>> +   if (sd_parent) {
>>> +   sg = sd_parent->groups;
>>> +   sgp = sg->sgp;
>>> +   nr_busy = atomic_read(>nr_busy_cpus);
>>> +
>>> +   if (sd->flags & SD_SHARE_PKG_RESOURCES && nr_busy > 1)
>>> +   goto need_kick_unlock;
>>> +   }
>>>  
>>> if (sd->flags & SD_ASYM_PACKING && nr_busy != sg->group_weight
>>> && (cpumask_first_and(nohz.idle_cpus_mask,
>>>
>>
>> Almost I'd say; what happens on !sd_parent && SD_ASYM_PACKING ?
> 
> You are right, sorry about this. The idea was to correct the nr_busy
> computation before the patch that would remove its usage in the second
> patch. But that would mean the condition nr_busy != sg->group_weight
> would be invalid with this patch. The second patch needs to go first to
> avoid this confusion.
> 
>>
>> Also, this made me look at the nr_busy stuff again, and somehow that
>> entire thing makes me a little sad.
>>
>> Can't we do something like the below and cut that nr_busy sd iteration
>> short?
> 
> We can surely cut the nr_busy sd iteration but not like what is done
> with this patch. You stop the nr_busy computation at the sched domain
> that has the flag SD_SHARE_PKG_RESOURCES set. But nohz_kick_needed()
> would want to know the nr_busy for one level above this.
>Consider a core. Assume it is the highest domain with this flag set.
> The nr_busy of its groups, which are logical threads are set to 1/0
> each. But nohz_kick_needed() would like to know the sum of the nr_busy
> parameter of all the groups, i.e. the threads in a core before it
> decides if it can kick nohz_idle balancing. The information about the
> individual group's nr_busy is of no relevance here.
> 
> Thats why the above patch tries to get the
> sd->parent->groups->sgp->nr_busy_cpus. This will translate rightly to
> the core's busy cpus in this example. But the below patch stops before
> updating this parameter at the sd->parent level, where sd is the highest
> level sched domain with the SD_SHARE_PKG_RESOURCES flag set.
> 
> But we can get around all this confusion if we can move the nr_busy
> parameter to be included in the sched_domain structure rather than the
> sched_groups_power structure. Anyway the only place where nr_busy is
> used, that is at nohz_kick_needed(), is done to know the total number of
> busy cpus at a sched domain level which has the SD_SHARE_PKG_RESOURCES
> set and not at a sched group level.
> 
> So why not move nr_busy to struct sched_domain  and having the below
> patch which just updates this parameter for the sched domain, sd_busy ?

Oh this can't be done :( Domain structures are per cpu!

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/3] sched: Fix nohz_kick_needed to consider the nr_busy of the parent domain's group

2013-10-23 Thread Preeti U Murthy

Hi Peter

On 10/23/2013 03:41 AM, Peter Zijlstra wrote:
> This nohz stuff really needs to be re-thought and made more scalable --
> its a royal pain :/

Why  not do something like the below instead? It does the following.

This patch introduces sd_busy just like your suggested patch, except that
it points to the parent of the highest level sched domain which has the
SD_SHARE_PKG_RESOURCES set and initializes it in update_top_cache_domain(). 
This is the sched domain that is relevant in nohz_kick_needed().

sd_set_sd_state_busy(), sd_set_sd_state_idle() and nohz_kick_needed() query
and update *only* this sched domain(sd_busy) for nr_busy_cpus. They are the
only users of this parameter. While we are at it, we might as well change
the nohz_idle parameter to be updated at the sd_busy domain level alone and
not the base domain level of a CPU. This will unify the concept of busy cpus
at just one level of sched domain.

There is no need to iterate through all levels of sched domains of a cpu to
update nr_busy_cpus since it is irrelevant at all other sched domains except
at sd_busy level.

De-couple asymmetric load balancing from the nr_busy parameter which the
PATCH 2/3 anyway does. sd_busy therefore is irrelevant for asymmetric load
balancing.

Regards
Preeti U Murthy
START_PATCH---

sched: Fix nohz_kick_needed()

---
 kernel/sched/core.c  |4 
 kernel/sched/fair.c  |   40 ++--
 kernel/sched/sched.h |1 +
 3 files changed, 27 insertions(+), 18 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c06b8d3..c1dd11c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5271,6 +5271,7 @@ DEFINE_PER_CPU(struct sched_domain *, sd_llc);
 DEFINE_PER_CPU(int, sd_llc_size);
 DEFINE_PER_CPU(int, sd_llc_id);
 DEFINE_PER_CPU(struct sched_domain *, sd_numa);
+DEFINE_PER_CPU(struct sched_domain *, sd_busy);
 
 static void update_top_cache_domain(int cpu)
 {
@@ -5290,6 +5291,9 @@ static void update_top_cache_domain(int cpu)
 
sd = lowest_flag_domain(cpu, SD_NUMA);
rcu_assign_pointer(per_cpu(sd_numa, cpu), sd);
+
+   sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES)->parent;
+   rcu_assign_pointer(per_cpu(sd_busy, cpu), sd);
 }
 
 /*
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 813dd61..71e6f14 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6515,16 +6515,16 @@ static inline void nohz_balance_exit_idle(int cpu)
 static inline void set_cpu_sd_state_busy(void)
 {
struct sched_domain *sd;
+   int cpu = smp_processor_id();
 
rcu_read_lock();
-   sd = rcu_dereference_check_sched_domain(this_rq()->sd);
+   sd = per_cpu(sd_busy, cpu);
 
if (!sd || !sd->nohz_idle)
goto unlock;
sd->nohz_idle = 0;
 
-   for (; sd; sd = sd->parent)
-   atomic_inc(>groups->sgp->nr_busy_cpus);
+   atomic_inc(>groups->sgp->nr_busy_cpus);
 unlock:
rcu_read_unlock();
 }
@@ -6532,16 +6532,16 @@ unlock:
 void set_cpu_sd_state_idle(void)
 {
struct sched_domain *sd;
+   int cpu = smp_processor_id();
 
rcu_read_lock();
-   sd = rcu_dereference_check_sched_domain(this_rq()->sd);
+   sd = per_cpu(sd_busy, cpu);
 
if (!sd || sd->nohz_idle)
goto unlock;
sd->nohz_idle = 1;
 
-   for (; sd; sd = sd->parent)
-   atomic_dec(>groups->sgp->nr_busy_cpus);
+   atomic_dec(>groups->sgp->nr_busy_cpus);
 unlock:
rcu_read_unlock();
 }
@@ -6748,6 +6748,9 @@ static inline int nohz_kick_needed(struct rq *rq, int cpu)
 {
unsigned long now = jiffies;
struct sched_domain *sd;
+   struct sched_group *sg;
+   struct sched_group_power *sgp;
+   int nr_busy;
 
if (unlikely(idle_cpu(cpu)))
return 0;
@@ -6773,22 +6776,23 @@ static inline int nohz_kick_needed(struct rq *rq, int 
cpu)
goto need_kick;
 
rcu_read_lock();
-   for_each_domain(cpu, sd) {
-   struct sched_group *sg = sd->groups;
-   struct sched_group_power *sgp = sg->sgp;
-   int nr_busy = atomic_read(>nr_busy_cpus);
+   sd = per_cpu(sd_busy, cpu);
 
-   if (sd->flags & SD_SHARE_PKG_RESOURCES && nr_busy > 1)
-   goto need_kick_unlock;
+   if (sd) {
+   sg = sd->groups;
+   sgp = sg->sgp;
+   nr_busy = atomic_read(>nr_busy_cpus);
 
-   if (sd->flags & SD_ASYM_PACKING && nr_busy != sg->group_weight
-   && (cpumask_first_and(nohz.idle_cpus_mask,
- sched_domain_span(sd)) < cpu))
+   if (nr_busy > 1)
goto need_kick_unlock;
-
-   if (!

Re: [PATCH 3/3] sched: Aggressive balance in domains whose groups share package resources

2013-10-23 Thread Preeti U Murthy

Hi Peter,

On 10/23/2013 03:53 AM, Peter Zijlstra wrote:
> On Mon, Oct 21, 2013 at 05:15:02PM +0530, Vaidyanathan Srinivasan wrote:
>>  kernel/sched/fair.c |   18 ++
>>  1 file changed, 18 insertions(+)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 828ed97..bbcd96b 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -5165,6 +5165,8 @@ static int load_balance(int this_cpu, struct rq 
>> *this_rq,
>>  {
>>  int ld_moved, cur_ld_moved, active_balance = 0;
>>  struct sched_group *group;
>> +struct sched_domain *child;
>> +int share_pkg_res = 0;
>>  struct rq *busiest;
>>  unsigned long flags;
>>  struct cpumask *cpus = __get_cpu_var(load_balance_mask);
>> @@ -5190,6 +5192,10 @@ static int load_balance(int this_cpu, struct rq 
>> *this_rq,
>>  
>>  schedstat_inc(sd, lb_count[idle]);
>>  
>> +child = sd->child;
>> +if (child && child->flags & SD_SHARE_PKG_RESOURCES)
>> +share_pkg_res = 1;
>> +
>>  redo:
>>  if (!should_we_balance()) {
>>  *continue_balancing = 0;
>> @@ -5202,6 +5208,7 @@ redo:
>>  goto out_balanced;
>>  }
>>  
>> +redo_grp:
>>  busiest = find_busiest_queue(, group);
>>  if (!busiest) {
>>  schedstat_inc(sd, lb_nobusyq[idle]);
>> @@ -5292,6 +5299,11 @@ more_balance:
>>  if (!cpumask_empty(cpus)) {
>>  env.loop = 0;
>>  env.loop_break = sched_nr_migrate_break;
>> +if (share_pkg_res &&
>> +cpumask_intersects(cpus,
>> +to_cpumask(group->cpumask)))
> 
> sched_group_cpus()
> 
>> +goto redo_grp;
>> +
>>  goto redo;
>>  }
>>  goto out_balanced;
>> @@ -5318,9 +5330,15 @@ more_balance:
>>   */
>>  if (!cpumask_test_cpu(this_cpu,
>>  tsk_cpus_allowed(busiest->curr))) {
>> +cpumask_clear_cpu(cpu_of(busiest), cpus);
>>  raw_spin_unlock_irqrestore(>lock,
>>  flags);
>>  env.flags |= LBF_ALL_PINNED;
>> +if (share_pkg_res &&
>> +        cpumask_intersects(cpus,
>> +to_cpumask(group->cpumask)))
>> +goto redo_grp;
>> +
>>  goto out_one_pinned;
>>  }
> 
> Man this retry logic is getting annoying.. isn't there anything saner we
> can do?

Let me give this a thought and get back.

Regards
Preeti U Murthy
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/3] sched: Fix nohz_kick_needed to consider the nr_busy of the parent domain's group

2013-10-24 Thread Preeti U Murthy

Hi Vincent,

I have addressed your comments and below is the fresh patch. This patch
applies on PATCH 2/3 posted in this thread.

Regards
Preeti U Murthy


sched:Remove un-necessary iterations over sched domains to update/query 
nr_busy_cpus

From: Preeti U Murthy 

nr_busy_cpus parameter is used by nohz_kick_needed() to find out the number
of busy cpus in a sched domain which has SD_SHARE_PKG_RESOURCES flag set.
Therefore instead of updating nr_busy_cpus at every level of sched domain,
since it is irrelevant, we can update this parameter only at the parent
domain of the sd which has this flag set. Introduce a per-cpu parameter
sd_busy which represents this parent domain.

In nohz_kick_needed() we directly query the nr_busy_cpus parameter
associated with the groups of sd_busy.

By associating sd_busy with the highest domain which has
SD_SHARE_PKG_RESOURCES flag set, we cover all lower level domains which could
have this flag set and trigger nohz_idle_balancing if any of the levels have
more than one busy cpu.

sd_busy is irrelevant for asymmetric load balancing.

While we are at it, we might as well change the nohz_idle parameter to be
updated at the sd_busy domain level alone and not the base domain level of a 
CPU.
This will unify the concept of busy cpus at just one level of sched domain
where it is currently used.

Signed-off-by: Preeti U Murthy
---
 kernel/sched/core.c  |5 +
 kernel/sched/fair.c  |   38 --
 kernel/sched/sched.h |1 +
 3 files changed, 26 insertions(+), 18 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c06b8d3..c540392 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5271,6 +5271,7 @@ DEFINE_PER_CPU(struct sched_domain *, sd_llc);
 DEFINE_PER_CPU(int, sd_llc_size);
 DEFINE_PER_CPU(int, sd_llc_id);
 DEFINE_PER_CPU(struct sched_domain *, sd_numa);
+DEFINE_PER_CPU(struct sched_domain *, sd_busy);
 
 static void update_top_cache_domain(int cpu)
 {
@@ -5290,6 +5291,10 @@ static void update_top_cache_domain(int cpu)
 
sd = lowest_flag_domain(cpu, SD_NUMA);
rcu_assign_pointer(per_cpu(sd_numa, cpu), sd);
+
+   sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES);
+   if (sd)
+   rcu_assign_pointer(per_cpu(sd_busy, cpu), sd->parent);
 }
 
 /*
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e9c9549..f66cfd9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6515,16 +6515,16 @@ static inline void nohz_balance_exit_idle(int cpu)
 static inline void set_cpu_sd_state_busy(void)
 {
struct sched_domain *sd;
+   int cpu = smp_processor_id();
 
rcu_read_lock();
-   sd = rcu_dereference_check_sched_domain(this_rq()->sd);
+   sd = rcu_dereference(per_cpu(sd_busy, cpu));
 
if (!sd || !sd->nohz_idle)
goto unlock;
sd->nohz_idle = 0;
 
-   for (; sd; sd = sd->parent)
-   atomic_inc(>groups->sgp->nr_busy_cpus);
+   atomic_inc(>groups->sgp->nr_busy_cpus);
 unlock:
rcu_read_unlock();
 }
@@ -6532,16 +6532,16 @@ unlock:
 void set_cpu_sd_state_idle(void)
 {
struct sched_domain *sd;
+   int cpu = smp_processor_id();
 
rcu_read_lock();
-   sd = rcu_dereference_check_sched_domain(this_rq()->sd);
+   sd = rcu_dereference(per_cpu(sd_busy, cpu));
 
if (!sd || sd->nohz_idle)
goto unlock;
sd->nohz_idle = 1;
 
-   for (; sd; sd = sd->parent)
-   atomic_dec(>groups->sgp->nr_busy_cpus);
+   atomic_dec(>groups->sgp->nr_busy_cpus);
 unlock:
rcu_read_unlock();
 }
@@ -6748,6 +6748,8 @@ static inline int nohz_kick_needed(struct rq *rq, int cpu)
 {
unsigned long now = jiffies;
struct sched_domain *sd;
+   struct sched_group_power *sgp;
+   int nr_busy;
 
if (unlikely(idle_cpu(cpu)))
return 0;
@@ -6773,22 +6775,22 @@ static inline int nohz_kick_needed(struct rq *rq, int 
cpu)
goto need_kick;
 
rcu_read_lock();
-   for_each_domain(cpu, sd) {
-   struct sched_group *sg = sd->groups;
-   struct sched_group_power *sgp = sg->sgp;
-   int nr_busy = atomic_read(>nr_busy_cpus);
+   sd = rcu_dereference(per_cpu(sd_busy, cpu));
 
-   if (sd->flags & SD_SHARE_PKG_RESOURCES && nr_busy > 1)
-   goto need_kick_unlock;
+   if (sd) {
+   sgp = sd->groups->sgp;
+   nr_busy = atomic_read(>nr_busy_cpus);
 
-   if (sd->flags & SD_ASYM_PACKING
-   && (cpumask_first_and(nohz.idle_cpus_mask,
- sched_domain_span(sd)) < cpu))
+   if (nr_busy > 1)
goto need_kick_unlock;
-
-   if (!(sd->flags & (SD_SHARE_PKG_

Re: [PATCH 3/3] sched: Aggressive balance in domains whose groups share package resources

2013-10-25 Thread Preeti U Murthy

Hi Peter,

On 10/23/2013 03:53 AM, Peter Zijlstra wrote:
> On Mon, Oct 21, 2013 at 05:15:02PM +0530, Vaidyanathan Srinivasan wrote:
>>  kernel/sched/fair.c |   18 ++
>>  1 file changed, 18 insertions(+)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 828ed97..bbcd96b 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -5165,6 +5165,8 @@ static int load_balance(int this_cpu, struct rq 
>> *this_rq,
>>  {
>>  int ld_moved, cur_ld_moved, active_balance = 0;
>>  struct sched_group *group;
>> +struct sched_domain *child;
>> +int share_pkg_res = 0;
>>  struct rq *busiest;
>>  unsigned long flags;
>>  struct cpumask *cpus = __get_cpu_var(load_balance_mask);
>> @@ -5190,6 +5192,10 @@ static int load_balance(int this_cpu, struct rq 
>> *this_rq,
>>  
>>  schedstat_inc(sd, lb_count[idle]);
>>  
>> +child = sd->child;
>> +if (child && child->flags & SD_SHARE_PKG_RESOURCES)
>> +share_pkg_res = 1;
>> +
>>  redo:
>>  if (!should_we_balance()) {
>>  *continue_balancing = 0;
>> @@ -5202,6 +5208,7 @@ redo:
>>  goto out_balanced;
>>  }
>>  
>> +redo_grp:
>>  busiest = find_busiest_queue(, group);
>>  if (!busiest) {
>>  schedstat_inc(sd, lb_nobusyq[idle]);
>> @@ -5292,6 +5299,11 @@ more_balance:
>>  if (!cpumask_empty(cpus)) {
>>  env.loop = 0;
>>  env.loop_break = sched_nr_migrate_break;
>> +if (share_pkg_res &&
>> +cpumask_intersects(cpus,
>> +to_cpumask(group->cpumask)))
> 
> sched_group_cpus()
> 
>> +goto redo_grp;
>> +
>>  goto redo;
>>  }
>>  goto out_balanced;
>> @@ -5318,9 +5330,15 @@ more_balance:
>>   */
>>  if (!cpumask_test_cpu(this_cpu,
>>  tsk_cpus_allowed(busiest->curr))) {
>> +cpumask_clear_cpu(cpu_of(busiest), cpus);
>>  raw_spin_unlock_irqrestore(>lock,
>>  flags);
>>  env.flags |= LBF_ALL_PINNED;
>> +if (share_pkg_res &&
>> +cpumask_intersects(cpus,
>> +to_cpumask(group->cpumask)))
>> +goto redo_grp;
>> +
>>  goto out_one_pinned;
>>  }
> 
> Man this retry logic is getting annoying.. isn't there anything saner we
> can do?

Maybe we can do this just at the SIBLINGS level? Having the hyper
threads busy due to the scenario described in the changelog is bad for
performance.

Regards
Preeti U Murthy
> ___
> Linuxppc-dev mailing list
> linuxppc-...@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH v2] sched: Limit idle_balance()

2013-07-19 Thread Preeti U Murthy

Hi Json,

I ran ebizzy and kernbench benchmarks on your 3.11-rc1 + your"V1
patch" on a 1 socket, 16 core powerpc machine. I thought I would let you
know the results before I try your V2.

Ebizzy: 30 seconds run. The table below shows the improvement in the
number of records completed. I have not spent enough time on the patch
to explain such a big improvement.

Number_of_threads   %improvement_with_patch
4  41.86%
8  9.8%
   12  34.77%
   16  28.37%

While on kernbench there was no significant change in the observation.

I will try patch V2 and let you know the results.

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH v2] sched: Limit idle_balance()

2013-07-21 Thread Preeti U Murthy

Hi Json,

With V2 of your patch here are the results for the ebizzy run on
3.11-rc1 + patch on  a 1 socket, 16 core powerpc machine. Each ebizzy
run was for 30 seconds.

Number_of_threads %improvement_with_patch
48.63
81.29
12   9.98
16  20.46

Let me know if you want me to profile any of these runs for specific
statistics.

Regards
Preeti U Murthy


On 07/20/2013 12:58 AM, Jason Low wrote:
> On Fri, 2013-07-19 at 16:54 +0530, Preeti U Murthy wrote:
>> Hi Json,
>>
>> I ran ebizzy and kernbench benchmarks on your 3.11-rc1 + your"V1
>> patch" on a 1 socket, 16 core powerpc machine. I thought I would let you
>> know the results before I try your V2.
>>
>> Ebizzy: 30 seconds run. The table below shows the improvement in the
>> number of records completed. I have not spent enough time on the patch
>> to explain such a big improvement.
>>
>> Number_of_threads   %improvement_with_patch
>> 4  41.86%
>> 8  9.8%
>>12  34.77%
>>16  28.37%
>>
>> While on kernbench there was no significant change in the observation.
>>
>> I will try patch V2 and let you know the results.
> 
> Great to see those improvements so far. Thank you for testing this.
> 
> Jason
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: power-efficient scheduling design

2013-06-07 Thread Preeti U Murthy

Hi,

On 05/31/2013 04:22 PM, Ingo Molnar wrote:
> PeterZ and me tried to point out the design requirements previously, but 
> it still does not appear to be clear enough to people, so let me spell it 
> out again, in a hopefully clearer fashion.
> 
> The scheduler has valuable power saving information available:
> 
>  - when a CPU is busy: about how long the current task expects to run
> 
>  - when a CPU is idle: how long the current CPU expects _not_ to run
> 
>  - topology: it knows how the CPUs and caches interrelate and already 
>optimizes based on that
> 
>  - various high level and low level load averages and other metrics about 
>the recent past that show how busy a particular CPU is, how busy the 
>whole system is, and what the runtime properties of individual tasks is 
>(how often it sleeps, etc.)
> 
> so the scheduler is in an _ideal_ position to do a judgement call about 
> the near future and estimate how deep an idle state a CPU core should 
> enter into and what frequency it should run at.

I don't think the problem lies in the fact that scheduler is not making
these decisions about which idle state the CPU should enter or which
frequency the CPU should run at.

IIUC, I think the problem lies in the part where although the
*cpuidle and cpufrequency governors are co-operating with the scheduler,
the scheduler is not doing the same.*

Let me elaborate with respect to cpuidle subsystem. When the scheduler
chooses the CPUs to run tasks on, it leaves certain other CPUs idle. The
cpuidle governor then evaluates, among other things, the load average of
the CPUs, before deciding to put it into an ideal idle state. With the
PJT's metric, an idle CPU's load average degrades over time and cpuidle
governor will perhaps decide to put such CPUs to deep idle states.

But the problem surfaces when scheduler gets to choose a CPU to run
new/woken up tasks on. It chooses the *idlest_cpu* to run the task on
without considering how deep an idle state that CPU is in,if at all it
is in an idle state. It would end up waking a deep sleeping CPU, which
will *hinder power savings*.

I think here is where we need to focus. Currently, there is no
*two way co-operation between the scheduler and cpuidle/cpufrequency*
subsystems, which makes no sense. In the above case for instance
scheduler prompts the cpuidle governor to put CPU to idle state and
comes back to hamper that move.

> 
> The scheduler is also at a high enough level to host a "I want maximum 
> performance, power does not matter to me" user policy override switch and 
> similar user policy details.
> 
> No ifs and whens about that.
> 
> Today the power saving landscape is fragmented and sad: we just randomly 
> interface scheduler task packing changes with some idle policy (and 
> cpufreq policy), which might or might not combine correctly.

I would repeat here that today we interface cpuidle/cpufrequency
policies with scheduler but not the other way around. They do their bit
when a cpu is busy/idle. However scheduler does not see that somebody
else is taking instructions from it and comes back to give different
instructions!

Therefore I think among other things, this is one fundamental issue that
we need to resolve in the steps towards better power savings through
scheduler.

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: power-efficient scheduling design

2013-06-07 Thread Preeti U Murthy

tion as you prefer a complete and
> potentially huge patch set over incremental patch sets?
> 
> It would be good to have even a high level agreement on the path forward
> where the expectation first and foremost is to take advantage of the
> schedulers ideal position to drive the power management while
> simplifying the power management code.
> 
> Thanks,
> Morten
> 

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: power-efficient scheduling design

2013-06-07 Thread Preeti U Murthy

uler and cpufreq/cpuidle 

I agree with this. This is what I have been emphasizing, if we feel that
the cpufrequency/ cpuidle subsystems are suboptimal in terms of the
information that they use to make their decisions, let us improve them.
But this will not yield us any improvement if the scheduler does not
have enough information. And IMHO, the next fundamental information that
the scheduler needs should come from cpufreq and cpuidle.

Then we should move onto supplying scheduler information from the power
domain topology, thermal factors, user policies. This does not need a
re-write of the scheduler, this would need a good interface between the
scheduler and the rest of the ecosystem. This ecosystem includes the
cpuidle subsystem, cpu frequency subsystems and they are already in
place. Lets use them.

or (b) come up
> with a unified load-balancing/cpufreq/cpuidle implementation as per
> Ingo's request. The latter is harder but, with a good design, has
> potentially a lot more benefits.
> 
> A possible implementation for (a) is to let the scheduler focus on
> performance load-balancing but control the balance ratio from a
> cpufreq governor (via things like arch_scale_freq_power() or something
> new). CPUfreq would not be concerned just with individual CPU
> load/frequency but also making a decision on how tasks are balanced
> between CPUs based on the overall load (e.g. four CPUs are enough for
> the current load, I can shut the other four off by telling the
> scheduler not to use them).
> 
> As for Ingo's preferred solution (b), a proposal forward could be to
> factor the load balancing out of kernel/sched/fair.c and provide an
> abstract interface (like load_class?) for easier extending or
> different policies (e.g. small task packing). 

 Let me elaborate on the patches that have been posted so far on the
power awareness of the scheduler. When we say *power aware scheduler*
what exactly do we want it to do?

In my opinion, we want it to *avoid touching idle cpus*, so as to keep
them in that state longer and *keep more power domains idle*, so as to
yield power savings with them turned off. The patches released so far
are striving to do the latter. Correct me if I am wrong at this. Also
feel free to point out any other expectation from the power aware
scheduler if I am missing any.

If I have got Ingo's point right, the issues with them are that they are
not taking a holistic approach to meet the said goal. Keeping more power
domains idle (by packing tasks) would sound much better if the scheduler
has taken all aspects of doing such a thing into account, like

1. How idle are the cpus, on the domain that it is packing
2. Can they go to turbo mode, because if they do,then we cant pack
tasks. We would need certain cpus in that domain idle.
3. Are the domains in which we pack tasks power gated?
4. Will there be significant performance drop by packing? Meaning do the
tasks share cpu resources? If they do there will be severe contention.

The approach I suggest therefore would be to get the scheduler well in
sync with the eco system, then the patches posted so far will achieve
their goals more easily and with very few regressions because they are
well informed decisions.


Regards
Preeti U Murthy


> Best regards.
> 
> --
> Catalin
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: power-efficient scheduling design

2013-06-08 Thread Preeti U Murthy

Hi Rafael,

On 06/08/2013 07:32 PM, Rafael J. Wysocki wrote:
> On Saturday, June 08, 2013 12:28:04 PM Catalin Marinas wrote:
>> On Fri, Jun 07, 2013 at 07:08:47PM +0100, Preeti U Murthy wrote:
>>> On 06/07/2013 08:21 PM, Catalin Marinas wrote:
>>>> I think you are missing Ingo's point. It's not about the scheduler
>>>> complying with decisions made by various governors in the kernel
>>>> (which may or may not have enough information) but rather the
>>>> scheduler being in a better position for making such decisions.
>>>
>>> My mail pointed out that I disagree with this design ("the scheduler
>>> being in a better position for making such decisions").
>>> I think it should be a 2 way co-operation. I have elaborated below.
> 
> I agree with that.
> 
>>>> Take the cpuidle example, it uses the load average of the CPUs,
>>>> however this load average is currently controlled by the scheduler
>>>> (load balance). Rather than using a load average that degrades over
>>>> time and gradually putting the CPU into deeper sleep states, the
>>>> scheduler could predict more accurately that a run-queue won't have
>>>> any work over the next x ms and ask for a deeper sleep state from the
>>>> beginning.
>>>
>>> How will the scheduler know that there will not be work in the near
>>> future? How will the scheduler ask for a deeper sleep state?
>>>
>>> My answer to the above two questions are, the scheduler cannot know how
>>> much work will come up. All it knows is the current load of the
>>> runqueues and the nature of the task (thanks to the PJT's metric). It
>>> can then match the task load to the cpu capacity and schedule the tasks
>>> on the appropriate cpus.
>>
>> The scheduler can decide to load a single CPU or cluster and let the
>> others idle. If the total CPU load can fit into a smaller number of CPUs
>> it could as well tell cpuidle to go into deeper state from the
>> beginning as it moved all the tasks elsewhere.
> 
> So why can't it do that today?  What's the problem?

The reason that scheduler does not do it today is due to the
prefer_sibling logic. The tasks within a core get distributed across
cores if they are more than 1, since the cpu power of a core is not high
enough to handle more than one task.

However at a socket level/ MC level (cluster at a low level), there can
be as many tasks as there are cores because the socket has enough CPU
capacity to handle them. But the prefer_sibling logic moves tasks across
socket/MC level domains even when load<=domain_capacity.

I think the reason why the prefer_sibling logic was introduced, is that
scheduler looks at spreading tasks across all the resources it has. It
believes keeping tasks within a cluster/socket level domain would mean
tasks are being throttled by having access to only the cluster/socket
level resources. Which is why it spreads.

The prefer_sibling logic is nothing but a flag set at domain level to
communicate to the scheduler that load should be spread across the
groups of this domain. In the above example across sockets/clusters.

But I think it is time we take another look at the prefer_sibling logic
and decide on its worthiness.

> 
>> Regarding future work, neither cpuidle nor the scheduler know this but
>> the scheduler would make a better prediction, for example by tracking
>> task periodicity.
> 
> Well, basically, two pieces of information are needed to make target idle
> state selections: (1) when the CPU (core or package) is going to be used
> next time and (2) how much latency for going back to the non-idle state
> can be tolerated.  While the scheduler knows (1) to some extent (arguably,
> it generally cannot predict when hardware interrupts are going to occur),
> I'm not really sure about (2).
> 
>>> As a consequence, it leaves certain cpus idle. The load of these cpus
>>> degrade. It is via this load that the scheduler asks for a deeper sleep
>>> state. Right here we have scheduler talking to the cpuidle governor.
>>
>> So we agree that the scheduler _tells_ the cpuidle governor when to go
>> idle (but not how deep).
> 
> It does indicate to cpuidle how deep it can go, however, by providing it with
> the information about when the CPU is going to be used next time (from the
> scheduler's perspective).
> 
>> IOW, the scheduler drives the cpuidle decisions. Two problems: (1) the
>> cpuidle does not get enough information from the scheduler (arguably this
>> could be fixed)
> 
> OK, so what information is missing in your opinion?
> 
>> and (2) the scheduler does

Re: power-efficient scheduling design

2013-06-08 Thread Preeti U Murthy

Hi Catalin,

On 06/08/2013 04:58 PM, Catalin Marinas wrote:
> On Fri, Jun 07, 2013 at 07:08:47PM +0100, Preeti U Murthy wrote:
>> On 06/07/2013 08:21 PM, Catalin Marinas wrote:
>>> I think you are missing Ingo's point. It's not about the scheduler
>>> complying with decisions made by various governors in the kernel
>>> (which may or may not have enough information) but rather the
>>> scheduler being in a better position for making such decisions.
>>
>> My mail pointed out that I disagree with this design ("the scheduler
>> being in a better position for making such decisions").
>> I think it should be a 2 way co-operation. I have elaborated below.
>>
>>> Take the cpuidle example, it uses the load average of the CPUs,
>>> however this load average is currently controlled by the scheduler
>>> (load balance). Rather than using a load average that degrades over
>>> time and gradually putting the CPU into deeper sleep states, the
>>> scheduler could predict more accurately that a run-queue won't have
>>> any work over the next x ms and ask for a deeper sleep state from the
>>> beginning.
>>
>> How will the scheduler know that there will not be work in the near
>> future? How will the scheduler ask for a deeper sleep state?
>>
>> My answer to the above two questions are, the scheduler cannot know how
>> much work will come up. All it knows is the current load of the
>> runqueues and the nature of the task (thanks to the PJT's metric). It
>> can then match the task load to the cpu capacity and schedule the tasks
>> on the appropriate cpus.
> 
> The scheduler can decide to load a single CPU or cluster and let the
> others idle. If the total CPU load can fit into a smaller number of CPUs
> it could as well tell cpuidle to go into deeper state from the
> beginning as it moved all the tasks elsewhere.

This currently does not happen. I have elaborated in the response to
Rafael's mail. Sorry I should have put you on the 'To' list, missed
that. Do take a look at that mail since many of the replies to your
current mail are in it.

What do you mean "from the beginning"? As soon as those cpus go idle,
cpuidle will kick in anyway. If you are saying that scheduler should
tell cpuidle that "this cpu can go into deep sleep state x, since I am
not going to use it for the next y seconds", that is not possible.

Firstly, because scheduler can't "predict" this 'y' parameter. Secondly
because hardware could change the idle state availibility or details
dynamically as Rafael pointed out and hence this 'x' is best not to be
told by the scheduler, but be queried by cpuidle governor by itself.

> 
> Regarding future work, neither cpuidle nor the scheduler know this but
> the scheduler would make a better prediction, for example by tracking
> task periodicity.

This prediction that you mention scheduler already exports it to
cpuidle. load_avg does precisely that, it tracks history and predicts
the future based on this. load_avg being tracked by scheduler
periodically is already seen by cpuidle governor.
> 
>> As a consequence, it leaves certain cpus idle. The load of these cpus
>> degrade. It is via this load that the scheduler asks for a deeper sleep
>> state. Right here we have scheduler talking to the cpuidle governor.
> 
> So we agree that the scheduler _tells_ the cpuidle governor when to go
> idle (but not how deep). IOW, the scheduler drives the cpuidle
> decisions. Two problems: (1) the cpuidle does not get enough information
> from the scheduler (arguably this could be fixed) and (2) the scheduler
> does not have any information about the idle states (power gating etc.)
> to make any informed decision on which/when CPUs should go idle.
> 
> As you said, it is a non-optimal one-way communication but the solution
> is not feedback loop from cpuidle into scheduler. It's like the
> scheduler managed by chance to get the CPU into a deeper sleep state and
> now you'd like the scheduler to get feedback form cpuidle and not
> disturb that CPU anymore. That's the closed loop I disagree with. Could
> the scheduler not make this informed decision before - it has this total
> load, let's get this CPU into deeper sleep state?

Lets say the scheduler does make an informed decision before, with lets
get this cpu into idle state. Then what? Say the load begins to increase
on the system. The scheduler has to wake up cpus. Which cpus to wake up
best? Who tells scheduler this? One, the power gating information which
is yet to be exported to the scheduler can tell scheduler this to an
extent. As far as I can see the next person to guide the scheduler here
is cpuidle, isnt it?

> 
>> I don't see what the problem is

Re: power-efficient scheduling design

2013-06-08 Thread Preeti U Murthy

Hi David,

On 06/07/2013 11:06 PM, David Lang wrote:
> On Fri, 7 Jun 2013, Preeti U Murthy wrote:
> 
>> Hi Catalin,
>>
>> On 06/07/2013 08:21 PM, Catalin Marinas wrote:
> 
>>> Take the cpuidle example, it uses the load average of the CPUs,
>>> however this load average is currently controlled by the scheduler
>>> (load balance). Rather than using a load average that degrades over
>>> time and gradually putting the CPU into deeper sleep states, the
>>> scheduler could predict more accurately that a run-queue won't have
>>> any work over the next x ms and ask for a deeper sleep state from the
>>> beginning.
>>
>> How will the scheduler know that there will not be work in the near
>> future? How will the scheduler ask for a deeper sleep state?
>>
>> My answer to the above two questions are, the scheduler cannot know how
>> much work will come up. All it knows is the current load of the
>> runqueues and the nature of the task (thanks to the PJT's metric). It
>> can then match the task load to the cpu capacity and schedule the tasks
>> on the appropriate cpus.
> 
> how will the cpuidle govenor know what will come up in the future?
> 
> the scheduler knows more than the current load on the runqueus, it
> tracks some information about the past behavior of the process that it
> uses for it's decisions. This is information that cpuidle doesn't have.

This is incorrect. The scheduler knows the possible future load on a cpu
due to past behavior, thats right, and so does cpuidle today. It queries
the load average for predicted idle time and compares this with exit
latencies of the idle states.

> 
> 
>> I don't see what the problem is with the cpuidle governor waiting for
>> the load to degrade before putting that cpu to sleep. In my opinion,
>> putting a cpu to deeper sleep states should happen gradually.
> 
> remember that it takes power and time to wake up a cpu to put it in a
> deeper sleep state.

Correct. I apologise in saying that it does it gradually. This is not
entirely right. cpuidle governor can decide on the state the cpu is best
put into directly without going through the shallow idle states.
It also takes care to rectify any incorrect prediction. So there is no
exit-enter-exit-enter sub optimal implementation.

> 
>>> Of course, you could export more scheduler information to cpuidle,
>>> various hooks (task wakeup etc.) but then we have another framework,
>>> cpufreq. It also decides the CPU parameters (frequency) based on the
>>> load controlled by the scheduler. Can cpufreq decide whether it's
>>> better to keep the CPU at higher frequency so that it gets to idle
>>> quicker and therefore deeper sleep states? I don't think it has enough
>>> information because there are at least three deciding factors
>>> (cpufreq, cpuidle and scheduler's load balancing) which are not
>>> unified.
>>
>> Why not? When the cpu load is high, cpu frequency governor knows it has
>> to boost the frequency of that CPU. The task gets over quickly, the CPU
>> goes idle. Then the cpuidle governor kicks in to put the CPU to deeper
>> sleep state gradually.
>>
>> Meanwhile the scheduler should ensure that the tasks are retained on
>> that CPU,whose frequency is boosted and should not load balance it, so
>> that they can get over quickly. This I think is what is missing. Again
>> this comes down to the scheduler taking feedback from the CPU frequency
>> governors which is not currently happening.
> 
> how should the scheduler know that the cpufreq governor decided to boost
> the speed of one CPU to handle an important process as opposed to
> handling multiple smaller processes?

This has been elaborated in my response to Rafael's mail. Scheduler
decides to call cpu frequency governor when it sees fit. Then cpu
frequency governor boosts the frequency of that cpu. cpu_power  will now
match the task load. So scheduler will not move the task away from that
cpu since load does not exceed cpu capacity. So scheduler knows in this way.

> the communication between the two is starting to sound really messy
> 

Not really. More is elaborated in responses to Catalin and Rafael's mails.

Regards
Preeti U Murthy
> David Lang
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch v7 0/21] sched: power aware scheduling

2013-05-19 Thread Preeti U Murthy

Hi Alex,

On 05/20/2013 06:31 AM, Alex Shi wrote:
> 
>>>>>> Which are the workloads where 'powersaving' mode hurts workload 
>>>>>> performance measurably?
>>
>> I ran ebizzy on a 2 socket, 16 core, SMT 4 Power machine.
> 
> Is this a 2 * 16 * 4 LCPUs PowerPC machine?

This is a 2 * 8 * 4 LCPUs PowerPC machine.

>> The power efficiency drops significantly with the powersaving policy of
>> this patch,over the power efficiency of the scheduler without this patch.
>>
>> The below parameters are measured relative to the default scheduler
>> behaviour.
>>
>> A: Drop in power efficiency with the patch+powersaving policy
>> B: Drop in performance with the patch+powersaving policy
>> C: Decrease in power consumption with the patch+powersaving policy
>>
>> NumThreads  AB C
>> -
>> 2   33% 36%   4%
>> 4   31% 33%   3%
>> 8   28% 30%   3%
>> 16  31% 33%   4%
>>
>> Each of the above run is for 30s.
>>
>> On investigating socket utilization,I found that only 1 socket was being
>> used during all the above threaded runs. As can be guessed this is due
>> to the group_weight being considered for the threshold metric.
>> This stacks up tasks on a core and further on a socket, thus throttling
>> them, as observed by Mike below.
>>
>> I therefore think we must switch to group_capacity as the metric for
>> threshold and use only (rq->utils*nr_running) for group_utils
>> calculation during non-bursty wakeup scenarios.
>> This way we are comparing right; the utilization of the runqueue by the
>> fair tasks and the cpu capacity available for them after being consumed
>> by the rt tasks.
>>
>> After I made the above modification,all the above three parameters came
>> to be nearly null. However, I am observing the load balancing of the
>> scheduler with the patch and powersavings policy enabled. It is behaving
>> very close to the default scheduler (spreading tasks across sockets).
>> That also explains why there is no performance drop or gain with the
>> patch+powersavings policy enabled. I will look into this observation and
>> revert.
> 
> Thanks a lot for the great testings!
> Seem tasks per SMT cpu isn't power efficient.
> And I got the similar result last week. I tested the fspin testing(do
> endless calculation, in linux-next tree.). when I bind task per SMT cpu,
> the power efficiency really dropped with most every threads number. but
> when bind task per core, it has better power efficiency on all threads.
> Beside to move task depend on group_capacity, another choice is balance
> task according cpu_power. I did the transfer in code. but need to go
> through a internal open source process before public them.

What do you mean by *another* choice is balance task according to
cpu_power? group_capacity is based on cpu_power.

Also, your balance policy in v6 was doing the same right? It was rightly
comparing rq->utils * nr_running against cpu_power. Why not simply
switch to that code for power policy load balancing?

>>>>> Well, it'll lose throughput any time there's parallel execution
>>>>> potential but it's serialized instead.. using average will inevitably
>>>>> stack tasks sometimes, but that's its goal.  Hackbench shows it.
>>>>
>>>> (but that consolidation can be a winner too, and I bet a nickle it would
>>>> be for a socket sized pgbench run)
>>>
>>> (belay that, was thinking of keeping all tasks on a single node, but
>>> it'll likely stack the whole thing on a CPU or two, if so, it'll hurt)
>>
>> At this point, I would like to raise one issue.
>> *Is the goal of the power aware scheduler improving power efficiency of
>> the scheduler or a compromise on the power efficiency but definitely a
>> decrease in power consumption, since it is the user who has decided to
>> prioritise lower power consumption over performance* ?
>>
> 
> It could be one of reason for this feather, but I could like to
> make it has better efficiency, like packing tasks according to cpu_power
> not current group_weight.

Yes we could try the patch using group_capacity and observe the results
for power efficiency, before we decide to compromise on power efficiency
for decrease in power.

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch v5 09/15] sched: add power aware scheduling in fork/exec/wake

2013-02-24 Thread Preeti U Murthy

Hi Alex,

On 02/24/2013 02:57 PM, Alex Shi wrote:
> On 02/22/2013 04:54 PM, Peter Zijlstra wrote:
>> On Thu, 2013-02-21 at 22:40 +0800, Alex Shi wrote:
>>>> The name is a secondary issue, first you need to explain why you
>>> think
>>>> nr_running is a useful metric at all.
>>>>
>>>> You can have a high nr_running and a low utilization (a burst of
>>>> wakeups, each waking a process that'll instantly go to sleep again),
>>> or
>>>> low nr_running and high utilization (a single process cpu bound
>>>> process).
>>>
>>> It is true in periodic balance. But in fork/exec/waking timing, the
>>> incoming processes usually need to do something before sleep again.
>>
>> You'd be surprised, there's a fair number of workloads that have
>> negligible runtime on wakeup.
> 
> will appreciate if you like introduce some workload. :)
> BTW, do you has some idea to handle them?
> Actually, if tasks is just like transitory, it is also hard to catch
> them in balance, like 'cyclitest -t 100' on my 4 LCPU laptop, vmstat
> just can catch 1 or 2 tasks very second.
>>
>>> I use nr_running to measure how the group busy, due to 3 reasons:
>>> 1, the current performance policy doesn't use utilization too.
>>
>> We were planning to fix that now that its available.
> 
> I had tried, but failed on aim9 benchmark. As a result I give up to use
> utilization in performance balance.
> Some trying and talking in the thread.
> https://lkml.org/lkml/2013/1/6/96
> https://lkml.org/lkml/2013/1/22/662
>>
>>> 2, the power policy don't care load weight.
>>
>> Then its broken, it should very much still care about weight.
> 
> Here power policy just use nr_running as the criteria to check if it's
> eligible for power aware balance. when do balancing the load weight is
> still the key judgment.
> 
>>
>>> 3, I tested some benchmarks, kbuild/tbench/hackbench/aim7 etc, some
>>> benchmark results looks clear bad when use utilization. if my memory
>>> right, the hackbench/aim7 both looks bad. I had tried many ways to
>>> engage utilization into this balance, like use utilization only, or
>>> use
>>> utilization * nr_running etc. but still can not find a way to recover
>>> the lose. But with nr_running, the performance seems doesn't lose much
>>> with power policy.
>>
>> You're failing to explain why utilization performs bad and you don't
>> explain why nr_running is better. That things work simply isn't good
> 
> Um, let me try to explain again, The utilisation need much time to
> accumulate itself(345ms). Whenever with or without load weight, many
> bursting tasks just give a minimum weight to the carrier CPU at the
> first few ms. So, it is too easy to do a incorrect distribution here and
> need migration on later periodic balancing.

I dont understand why forked tasks are taking time to accumulate the
load.I understand this if it were to be a woken up task.The first time
the forked task gets a chance to update the load itself,it needs to
reflect full utilization.In __update_entity_runnable_avg both
runnable_avg_period and runnable_avg_sum get equally incremented for a
forked task since it is runnable.Hence where is the chance for the load
to get incremented in steps?

In sleeping tasks since runnable_avg_sum progresses much slower than
runnable_avg_period,these tasks take much time to accumulate the load
when they wake up.This makes sense of course.But how does this happen
for forked tasks?

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch v5 02/15] sched: set initial load avg of new forked task

2013-02-24 Thread Preeti U Murthy

Hi Alex,

On 02/20/2013 11:50 AM, Alex Shi wrote:
> On 02/18/2013 01:07 PM, Alex Shi wrote:
>> New task has no runnable sum at its first runnable time, so its
>> runnable load is zero. That makes burst forking balancing just select
>> few idle cpus to assign tasks if we engage runnable load in balancing.
>>
>> Set initial load avg of new forked task as its load weight to resolve
>> this issue.
>>
> 
> patch answering PJT's update here. that merged the 1st and 2nd patches 
> into one. other patches in serial don't need to change.
> 
> =
> From 89b56f2e5a323a0cb91c98be15c94d34e8904098 Mon Sep 17 00:00:00 2001
> From: Alex Shi 
> Date: Mon, 3 Dec 2012 17:30:39 +0800
> Subject: [PATCH 01/14] sched: set initial value of runnable avg for new
>  forked task
> 
> We need initialize the se.avg.{decay_count, load_avg_contrib} for a
> new forked task.
> Otherwise random values of above variables cause mess when do new task
> enqueue:
> enqueue_task_fair
> enqueue_entity
> enqueue_entity_load_avg
> 
> and make forking balancing imbalance since incorrect load_avg_contrib.
> 
> set avg.decay_count = 0, and avg.load_avg_contrib = se->load.weight to
> resolve such issues.
> 
> Signed-off-by: Alex Shi 
> ---
>  kernel/sched/core.c | 3 +++
>  kernel/sched/fair.c | 4 
>  2 files changed, 7 insertions(+)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 26058d0..1452e14 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1559,6 +1559,7 @@ static void __sched_fork(struct task_struct *p)
>  #if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
>   p->se.avg.runnable_avg_period = 0;
>   p->se.avg.runnable_avg_sum = 0;
> + p->se.avg.decay_count = 0;
>  #endif
>  #ifdef CONFIG_SCHEDSTATS
>   memset(>se.statistics, 0, sizeof(p->se.statistics));
> @@ -1646,6 +1647,8 @@ void sched_fork(struct task_struct *p)
>   p->sched_reset_on_fork = 0;
>   }
> 
I think the following comment will help here.
/* All forked tasks are assumed to have full utilization to begin with */
> + p->se.avg.load_avg_contrib = p->se.load.weight;
> +
>   if (!rt_prio(p->prio))
>   p->sched_class = _sched_class;
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 81fa536..cae5134 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1509,6 +1509,10 @@ static inline void enqueue_entity_load_avg(struct 
> cfs_rq *cfs_rq,
>* We track migrations using entity decay_count <= 0, on a wake-up
>* migration we use a negative decay count to track the remote decays
>* accumulated while sleeping.
> +  *
> +  * When enqueue a new forked task, the se->avg.decay_count == 0, so
> +  * we bypass update_entity_load_avg(), use avg.load_avg_contrib initial
> +  * value: se->load.weight.

I disagree with the comment.update_entity_load_avg() gets called for all
forked tasks.
enqueue_task_fair->update_entity_load_avg() during the second
iteration.But __update_entity_load_avg() in update_entity_load_avg()
,where the actual load update happens does not get called.This is
because as below,the last_update of the forked task is nearly equal to
the clock task of the runqueue.Hence probably 1ms has not passed by for
the load to get updated.Which is why the load of the task nor the load
of the runqueue gets updated when the task forks.

Also note that the reason we bypass update_entity_load_avg() below is
not because our decay_count=0.Its because the forked tasks have nothing
to update.Only woken up tasks and migrated wake ups have load updates to
do.Forked tasks just got created,they have no load to "update" but only
to "create". This I feel is rightly done in sched_fork by this patch.

So ideally I dont think we should have any comment here.It does not
sound relevant.

>*/
>   if (unlikely(se->avg.decay_count <= 0)) {
>   se->avg.last_runnable_update = rq_of(cfs_rq)->clock_task;
> 


Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch v5 09/15] sched: add power aware scheduling in fork/exec/wake

2013-02-24 Thread Preeti U Murthy

Hi,

On 02/24/2013 02:57 PM, Alex Shi wrote:
> On 02/22/2013 04:54 PM, Peter Zijlstra wrote:
>> On Thu, 2013-02-21 at 22:40 +0800, Alex Shi wrote:
>>>> The name is a secondary issue, first you need to explain why you
>>> think
>>>> nr_running is a useful metric at all.
>>>>
>>>> You can have a high nr_running and a low utilization (a burst of
>>>> wakeups, each waking a process that'll instantly go to sleep again),
>>> or
>>>> low nr_running and high utilization (a single process cpu bound
>>>> process).
>>>
>>> It is true in periodic balance. But in fork/exec/waking timing, the
>>> incoming processes usually need to do something before sleep again.
>>
>> You'd be surprised, there's a fair number of workloads that have
>> negligible runtime on wakeup.
> 
> will appreciate if you like introduce some workload. :)
> BTW, do you has some idea to handle them?
> Actually, if tasks is just like transitory, it is also hard to catch
> them in balance, like 'cyclitest -t 100' on my 4 LCPU laptop, vmstat
> just can catch 1 or 2 tasks very second.
>>
>>> I use nr_running to measure how the group busy, due to 3 reasons:
>>> 1, the current performance policy doesn't use utilization too.
>>
>> We were planning to fix that now that its available.
> 
> I had tried, but failed on aim9 benchmark. As a result I give up to use
> utilization in performance balance.
> Some trying and talking in the thread.
> https://lkml.org/lkml/2013/1/6/96
> https://lkml.org/lkml/2013/1/22/662
>>
>>> 2, the power policy don't care load weight.
>>
>> Then its broken, it should very much still care about weight.
> 
> Here power policy just use nr_running as the criteria to check if it's
> eligible for power aware balance. when do balancing the load weight is
> still the key judgment.
> 
>>
>>> 3, I tested some benchmarks, kbuild/tbench/hackbench/aim7 etc, some
>>> benchmark results looks clear bad when use utilization. if my memory
>>> right, the hackbench/aim7 both looks bad. I had tried many ways to
>>> engage utilization into this balance, like use utilization only, or
>>> use
>>> utilization * nr_running etc. but still can not find a way to recover
>>> the lose. But with nr_running, the performance seems doesn't lose much
>>> with power policy.
>>
>> You're failing to explain why utilization performs bad and you don't
>> explain why nr_running is better. That things work simply isn't good
> 
> Um, let me try to explain again, The utilisation need much time to
> accumulate itself(345ms). Whenever with or without load weight, many
> bursting tasks just give a minimum weight to the carrier CPU at the
> first few ms. So, it is too easy to do a incorrect distribution here and
> need migration on later periodic balancing.

Why can't this be attacked in *either* of the following ways:

1.Attack this problem at the source, by ensuring that the utilisation is
accumulated faster by making the update window smaller.

2.Balance on nr->running only if you detect burst wakeups.
Alex, you had released a patch earlier which could detect this right?
Instead of balancing on nr_running all the time, why not balance on it
only if burst wakeups are detected. By doing so you ensure that
nr_running as a metric for load balancing is used when it is right to do
so and the reason to use it also gets well documented.

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH v3 3/6] sched: pack small tasks

2013-04-26 Thread Preeti U Murthy

Hi Peter,

On 04/26/2013 03:48 PM, Peter Zijlstra wrote:
> On Wed, Mar 27, 2013 at 03:51:51PM +0530, Preeti U Murthy wrote:
>> Hi,
>>
>> On 03/26/2013 05:56 PM, Peter Zijlstra wrote:
>>> On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote:
>>>> +static bool is_buddy_busy(int cpu)
>>>> +{
>>>> +   struct rq *rq = cpu_rq(cpu);
>>>> +
>>>> +   /*
>>>> +* A busy buddy is a CPU with a high load or a small load with
>>>> a lot of
>>>> +* running tasks.
>>>> +*/
>>>> +   return (rq->avg.runnable_avg_sum >
>>>> +   (rq->avg.runnable_avg_period / (rq->nr_running
>>>> + 2)));
>>>> +}
>>>
>>> Why does the comment talk about load but we don't see it in the
>>> equation. Also, why does nr_running matter at all? I thought we'd
>>> simply bother with utilization, if fully utilized we're done etc..
>>>
>>
>> Peter, lets say the run-queue has 50% utilization and is running 2
>> tasks. And we wish to find out if it is busy. We would compare this
>> metric with the cpu power, which lets say is 100.
>>
>> rq->util * 100 < cpu_of(rq)->power.
>>
>> In the above scenario would we declare the cpu _not_busy? Or would we do
>> the following:
>>
>> (rq->util * 100) * #nr_running <  cpu_of(rq)->power and conclude that it
>> is just enough _busy_ to not take on more processes?
> 
> That is just confused... ->power doesn't have anything to do with a per-cpu
> measure. ->power is a inter-cpu measure of relative compute capacity.

Ok.

> 
> Mixing in nr_running confuses things even more; it doesn't matter how many
> tasks it takes to push utilization up to 100%; once its there the cpu simply
> cannot run more.

True, this is from the perspective of the CPU. But will not the tasks on
this CPU get throttled if, you find the utilization of this CPU < 100%
and decide to put more tasks on it?

Regards
Preeti U Murthy


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch v4 07/18] sched: set initial load avg of new forked task

2013-02-19 Thread Preeti U Murthy

Hi everyone,

On 02/19/2013 05:04 PM, Paul Turner wrote:
> On Fri, Feb 15, 2013 at 2:07 AM, Alex Shi  wrote:
>>
>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>> index 1dff78a..9d1c193 100644
>>> --- a/kernel/sched/core.c
>>> +++ b/kernel/sched/core.c
>>> @@ -1557,8 +1557,8 @@ static void __sched_fork(struct task_struct *p)
>>>   * load-balance).
>>>   */
>>>  #if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
>>> -   p->se.avg.runnable_avg_period = 0;
>>> -   p->se.avg.runnable_avg_sum = 0;
>>> +   p->se.avg.runnable_avg_period = 1024;
>>> +   p->se.avg.runnable_avg_sum = 1024;
>>
>> It can't work.
>> avg.decay_count needs to be set to 0 before enqueue_entity_load_avg(), then
>> update_entity_load_avg() can't be called, so, runnable_avg_period/sum
>> are unusable.
> 
> Well we _could_ also use a negative decay_count here and treat it like
> a migration; but the larger problem is the visibility of p->on_rq;
> which is gates whether we account the time as runnable and occurs
> after activate_task() so that's out.
> 
>>
>> Even we has chance to call __update_entity_runnable_avg(),
>> avg.last_runnable_update needs be set before that, usually, it needs to
>> be set as 'now', that cause __update_entity_runnable_avg() function
>> return 0, then update_entity_load_avg() still can not reach to
>> __update_entity_load_avg_contrib().
>>
>> If we embed a simple new task load initialization to many functions,
>> that is too hard for future reader.
> 
> This is my concern about making this a special case with the
> introduction ENQUEUE_NEWTASK flag; enqueue jumps through enough hoops
> as it is.
> 
> I still don't see why we can't resolve this at init time in
> __sched_fork(); your patch above just moves an explicit initialization
> of load_avg_contrib into the enqueue path.  Adding a call to
> __update_task_entity_contrib() to the previous alternate suggestion
> would similarly seem to resolve this?

We could do this(Adding a call to __update_task_entity_contrib()),but the
cfs_rq->runnable_load_avg gets updated only if the task is on the runqueue.
But in the forked task's case the on_rq flag is not yet set.Something like
the below:

---
 kernel/sched/fair.c |   18 +-
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8691b0d..841e156 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1451,14 +1451,20 @@ static inline void update_entity_load_avg(struct 
sched_entity *se,
else
now = cfs_rq_clock_task(group_cfs_rq(se));
 
-   if (!__update_entity_runnable_avg(now, >avg, se->on_rq))
-   return;
-
+   if (!__update_entity_runnable_avg(now, >avg, se->on_rq)) {
+   if (!(flags & ENQUEUE_NEWTASK))
+   return;
+   }
contrib_delta = __update_entity_load_avg_contrib(se);
 
if (!update_cfs_rq)
return;
 
+   /* But the cfs_rq->runnable_load_avg does not get updated in case of
+* a forked task,because the se->on_rq = 0,although we update the
+* task's load_avg_contrib above in
+* __update_entity_laod_avg_contrib().
+*/
if (se->on_rq)
cfs_rq->runnable_load_avg += contrib_delta;
else
@@ -1538,12 +1544,6 @@ static inline void enqueue_entity_load_avg(struct cfs_rq 
*cfs_rq,
subtract_blocked_load_contrib(cfs_rq, se->avg.load_avg_contrib);
update_entity_load_avg(se, 0);
}
-   /*
-* set the initial load avg of new task same as its load
-* in order to avoid brust fork make few cpu too heavier
-*/
-   if (flags & ENQUEUE_NEWTASK)
-   se->avg.load_avg_contrib = se->load.weight;
 
cfs_rq->runnable_load_avg += se->avg.load_avg_contrib;
/* we force update consideration on load-balancer moves */

Thanks

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch v5 06/15] sched: log the cpu utilization at rq

2013-02-20 Thread Preeti U Murthy

Hi,

>>  /*
>>   * This is the main, per-CPU runqueue data structure.
>>   *
>> @@ -481,6 +484,7 @@ struct rq {
>>  #endif
>>  
>>  struct sched_avg avg;
>> +unsigned int util;
>>  };
>>  
>>  static inline int cpu_of(struct rq *rq)
> 
> You don't actually compute the rq utilization, you only compute the
> utilization as per the fair class, so if there's significant RT activity
> it'll think the cpu is under-utilized, whihc I think will result in the
> wrong thing.

Correct me if I am wrong,but isn't the current load balancer also
disregarding the real time tasks to calculate the domain/group/cpu level
load too?

What I mean is,if the answer to the above question is yes,then can we
safely assume that the furthur optimizations to the load balancer like
the power aware scheduler and the usage of per entity load tracking can
be done without considering the real time tasks?

Regards
Preeti U Murthy
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch v5 06/15] sched: log the cpu utilization at rq

2013-02-20 Thread Preeti U Murthy

Hi everyone,

On 02/18/2013 10:37 AM, Alex Shi wrote:
> The cpu's utilization is to measure how busy is the cpu.
> util = cpu_rq(cpu)->avg.runnable_avg_sum
> / cpu_rq(cpu)->avg.runnable_avg_period;

Why not cfs_rq->runnable_load_avg? I am concerned with what is the right
metric to use here.
Refer to this discussion:https://lkml.org/lkml/2012/10/29/448

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] cpuidle/menu: Fail cpuidle_idle_call() if no idle state is acceptable

2014-01-13 Thread Preeti U Murthy

On PowerPC, in a particular test scenario, all the cpu idle states were 
disabled.
Inspite of this it was observed that the idle state count of the shallowest
idle state, snooze, was increasing.

This is because the governor returns the idle state index as 0 even in
scenarios when no idle state can be chosen. These scenarios could be when the
latency requirement is 0 or as mentioned above when the user wants to disable
certain cpu idle states at runtime. In the latter case, its possible that no
cpu idle state is valid because the suitable states were disabled
and the rest did not match the menu governor criteria to be chosen as the
next idle state.

This patch adds the code to indicate that a valid cpu idle state could not be
chosen by the menu governor and reports back to arch so that it can take some
default action.

Signed-off-by: Preeti U Murthy 
---

 drivers/cpuidle/cpuidle.c|6 +-
 drivers/cpuidle/governors/menu.c |7 ---
 2 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index a55e68f..5bf06bb 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -131,8 +131,9 @@ int cpuidle_idle_call(void)
 
/* ask the governor for the next state */
next_state = cpuidle_curr_governor->select(drv, dev);
+
+   dev->last_residency = 0;
if (need_resched()) {
-   dev->last_residency = 0;
/* give the governor an opportunity to reflect on the outcome */
if (cpuidle_curr_governor->reflect)
cpuidle_curr_governor->reflect(dev, next_state);
@@ -140,6 +141,9 @@ int cpuidle_idle_call(void)
return 0;
}
 
+   if (next_state < 0)
+   return -EINVAL;
+
trace_cpu_idle_rcuidle(next_state, dev->cpu);
 
broadcast = !!(drv->states[next_state].flags & CPUIDLE_FLAG_TIMER_STOP);
diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
index cf7f2f0..6921543 100644
--- a/drivers/cpuidle/governors/menu.c
+++ b/drivers/cpuidle/governors/menu.c
@@ -283,6 +283,7 @@ again:
  * menu_select - selects the next idle state to enter
  * @drv: cpuidle driver containing state data
  * @dev: the CPU
+ * Returns -1 when no idle state is suitable
  */
 static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
 {
@@ -292,17 +293,17 @@ static int menu_select(struct cpuidle_driver *drv, struct 
cpuidle_device *dev)
int multiplier;
struct timespec t;
 
-   if (data->needs_update) {
+   if (data->last_state_idx >= 0 && data->needs_update) {
menu_update(drv, dev);
data->needs_update = 0;
}
 
-   data->last_state_idx = 0;
+   data->last_state_idx = -1;
data->exit_us = 0;
 
/* Special case when user has set very strict latency requirement */
if (unlikely(latency_req == 0))
-   return 0;
+   return data->last_state_idx;
 
/* determine the expected residency time, round up */
t = ktime_to_timespec(tick_nohz_get_sleep_length());

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] cpuidle/menu: Fail cpuidle_idle_call() if no idle state is acceptable

2014-01-14 Thread Preeti U Murthy

Hi Srivatsa,

On 01/14/2014 12:30 PM, Srivatsa S. Bhat wrote:
> On 01/14/2014 11:35 AM, Preeti U Murthy wrote:
>> On PowerPC, in a particular test scenario, all the cpu idle states were 
>> disabled.
>> Inspite of this it was observed that the idle state count of the shallowest
>> idle state, snooze, was increasing.
>>
>> This is because the governor returns the idle state index as 0 even in
>> scenarios when no idle state can be chosen. These scenarios could be when the
>> latency requirement is 0 or as mentioned above when the user wants to disable
>> certain cpu idle states at runtime. In the latter case, its possible that no
>> cpu idle state is valid because the suitable states were disabled
>> and the rest did not match the menu governor criteria to be chosen as the
>> next idle state.
>>
>> This patch adds the code to indicate that a valid cpu idle state could not be
>> chosen by the menu governor and reports back to arch so that it can take some
>> default action.
>>
> 
> That sounds fair enough. However, the "default" action of pseries idle loop
> (pseries_lpar_idle()) surprises me. It enters Cede, which is _deeper_ than 
> doing
> a snooze! IOW, a user might "disable" cpuidle or set the 
> PM_QOS_CPU_DMA_LATENCY
> to 0 hoping to prevent the CPUs from going to deep idle states, but then the
> machine would still end up going to Cede, even though that wont get reflected
> in the idle state counts. IMHO that scenario needs some thought as well...

Yes I did see this, but since the patch intends to only communicate
whether the cpuidle governor was successful in choosing an idle state on
its part, I wished to address the default action of pseries idle loop
separately. You are right we will need to understand the patch which
introduced this action. I will take a look at it.

> 
>> Signed-off-by: Preeti U Murthy 
>> ---
>>
>>  drivers/cpuidle/cpuidle.c|6 +-
>>  drivers/cpuidle/governors/menu.c |7 ---
>>  2 files changed, 9 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
>> index a55e68f..5bf06bb 100644
>> --- a/drivers/cpuidle/cpuidle.c
>> +++ b/drivers/cpuidle/cpuidle.c
>> @@ -131,8 +131,9 @@ int cpuidle_idle_call(void)
>>
>>  /* ask the governor for the next state */
>>  next_state = cpuidle_curr_governor->select(drv, dev);
>> +
>> +dev->last_residency = 0;
>>  if (need_resched()) {
>> -dev->last_residency = 0;
>>  /* give the governor an opportunity to reflect on the outcome */
>>  if (cpuidle_curr_governor->reflect)
>>  cpuidle_curr_governor->reflect(dev, next_state);
> 
> The comments on top of the .reflect() routines of the governors say that the
> second parameter is the index of the actual state entered. But after this 
> patch,
> next_state can be negative, indicating an invalid index. So those comments 
> need
> to be updated accordingly.

Right, I will take care of the comment in the next post.
> 
>> @@ -140,6 +141,9 @@ int cpuidle_idle_call(void)
>>  return 0;
>>  }
>>
>> +if (next_state < 0)
>> +return -EINVAL;
> 
> The exit path above (due to need_resched) returns with irqs enabled, but the 
> new
> one you are adding (next_state < 0) returns with irqs disabled. This is 
> correct,
> because in the latter case, "idle" is still in progress and the arch will 
> choose
> a default handler to execute (unlike the former case where "idle" is over and
> hence its time to enable interrupts).

Correct.
> 
> IMHO it would be good to add comments around this code to explain this subtle
> difference. We can never be too careful with these things... ;-)

Ok, will do so.
> 
>> +
>>  trace_cpu_idle_rcuidle(next_state, dev->cpu);
>>
>>  broadcast = !!(drv->states[next_state].flags & CPUIDLE_FLAG_TIMER_STOP);
>> diff --git a/drivers/cpuidle/governors/menu.c 
>> b/drivers/cpuidle/governors/menu.c
>> index cf7f2f0..6921543 100644
>> --- a/drivers/cpuidle/governors/menu.c
>> +++ b/drivers/cpuidle/governors/menu.c
>> @@ -283,6 +283,7 @@ again:
>>   * menu_select - selects the next idle state to enter
>>   * @drv: cpuidle driver containing state data
>>   * @dev: the CPU
>> + * Returns -1 when no idle state is suitable
>>   */
>>  static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device 
>> *dev)
>>  {
>> @@ -292,17 +293,17 @@ static int menu_select(struct cpuidle_d

Re: [PATCH] cpuidle/menu: Fail cpuidle_idle_call() if no idle state is acceptable

2014-01-14 Thread Preeti U Murthy

On 01/14/2014 01:07 PM, Srivatsa S. Bhat wrote:
> On 01/14/2014 12:30 PM, Srivatsa S. Bhat wrote:
>> On 01/14/2014 11:35 AM, Preeti U Murthy wrote:
>>> On PowerPC, in a particular test scenario, all the cpu idle states were 
>>> disabled.
>>> Inspite of this it was observed that the idle state count of the shallowest
>>> idle state, snooze, was increasing.
>>>
>>> This is because the governor returns the idle state index as 0 even in
>>> scenarios when no idle state can be chosen. These scenarios could be when 
>>> the
>>> latency requirement is 0 or as mentioned above when the user wants to 
>>> disable
>>> certain cpu idle states at runtime. In the latter case, its possible that no
>>> cpu idle state is valid because the suitable states were disabled
>>> and the rest did not match the menu governor criteria to be chosen as the
>>> next idle state.
>>>
>>> This patch adds the code to indicate that a valid cpu idle state could not 
>>> be
>>> chosen by the menu governor and reports back to arch so that it can take 
>>> some
>>> default action.
>>>
>>
>> That sounds fair enough. However, the "default" action of pseries idle loop
>> (pseries_lpar_idle()) surprises me. It enters Cede, which is _deeper_ than 
>> doing
>> a snooze! IOW, a user might "disable" cpuidle or set the 
>> PM_QOS_CPU_DMA_LATENCY
>> to 0 hoping to prevent the CPUs from going to deep idle states, but then the
>> machine would still end up going to Cede, even though that wont get reflected
>> in the idle state counts. IMHO that scenario needs some thought as well...
>>
> 
> I checked the git history and found that the default idle was changed (on 
> purpose)
> to cede the processor, in order to speed up booting.. Hmm..
> 
> commit 363edbe2614aa90df706c0f19ccfa2a6c06af0be
> Author: Vaidyanathan Srinivasan 
> Date:   Fri Sep 6 00:25:06 2013 +0530
> 
> powerpc: Default arch idle could cede processor on pseries

This issue is not powerpc specific as I observed on digging a bit into
the default idle routines of the common archs. The way that archs
perceive the call to cpuidle framework today is that if it fails, it
means that cpuidle backend driver fails to *function* due to some reason
(as is mentioned in the above commit: either since cpuidle driver is not
registered or it does not work on some specific platforms) and that
therefore the archs should decide on an idle state themselves. They
therefore end up choosing a convenient idle state which could very well
be one of the idle states in the cpuidle state table.

The archs do not see failed call to cpuidle driver as "cpuidle driver
says no idle state can be entered now because there are strict latency
requirements or the idle states are disabled". IOW, the call to cpuidle
driver is currently based on if cpuidle driver exists rather than if it
agrees on entry into any of the idle states.

This patch brings in the need for the archs to incorporate this
additional check of "did cpuidle_idle_call() fail because it did not
find it wise to enter any of the idle states". In which case they should
simply exit without taking any *default action*.

Need to give this some thought and reconsider the patch.

Regards
Preeti U Murthy
> 
> 
> Regards,
> Srivatsa S. Bhat
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH V5 0/8] cpuidle/ppc: Enable deep idle states on PowerNV

2014-01-15 Thread Preeti U Murthy

inated broadcast cpu on
hotplug of the old instead of smp_call_function_single(). This is because we
are interrupt disabled at this point and should not be using
smp_call_function_single or its children in this context to send an ipi.

6. Move GENERIC_CLOCKEVENTS_BROADCAST to arch/powerpc/Kconfig.

7. Fix coding style issues.


Changes in V2:
-
https://lkml.org/lkml/2013/8/14/239

1. Dynamically pick a broadcast CPU, instead of having a dedicated one.
2. Remove the constraint of having to disable tickless idle on the broadcast
CPU by queueing a hrtimer dedicated to do broadcast.



V1 posting: https://lkml.org/lkml/2013/7/25/740.

1. Added the infrastructure to wakeup CPUs in deep idle states in which the
local timers stop.
---

Preeti U Murthy (4):
  cpuidle/ppc: Split timer_interrupt() into timer handling and interrupt 
handling routines
  time/cpuidle: Support in tick broadcast framework in the absence of 
external clock device
  cpuidle/powernv: Add "Fast-Sleep" CPU idle state
  cpuidle/powernv: Parse device tree to setup idle states

Srivatsa S. Bhat (2):
  powerpc: Free up the slot of PPC_MSG_CALL_FUNC_SINGLE IPI message
  powerpc: Implement tick broadcast IPI as a fixed IPI message

Vaidyanathan Srinivasan (2):
  powernv/cpuidle: Add context management for Fast Sleep
  powermgt: Add OPAL call to resync timebase on wakeup


 arch/powerpc/Kconfig   |2 
 arch/powerpc/include/asm/opal.h|2 
 arch/powerpc/include/asm/processor.h   |1 
 arch/powerpc/include/asm/smp.h |2 
 arch/powerpc/include/asm/time.h|1 
 arch/powerpc/kernel/exceptions-64s.S   |   10 +
 arch/powerpc/kernel/idle_power7.S  |   90 +--
 arch/powerpc/kernel/smp.c  |   23 ++-
 arch/powerpc/kernel/time.c |   80 ++
 arch/powerpc/platforms/cell/interrupt.c|2 
 arch/powerpc/platforms/powernv/opal-wrappers.S |1 
 arch/powerpc/platforms/ps3/smp.c   |2 
 drivers/cpuidle/cpuidle-powernv.c  |  106 -
 include/linux/clockchips.h |4 -
 kernel/time/clockevents.c  |9 +
 kernel/time/tick-broadcast.c   |  192 ++--
 kernel/time/tick-internal.h|8 +
 17 files changed, 434 insertions(+), 101 deletions(-)

-- 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH V5 1/8] powerpc: Free up the slot of PPC_MSG_CALL_FUNC_SINGLE IPI message

2014-01-15 Thread Preeti U Murthy

From: Srivatsa S. Bhat 

The IPI handlers for both PPC_MSG_CALL_FUNC and PPC_MSG_CALL_FUNC_SINGLE map
to a common implementation - generic_smp_call_function_single_interrupt(). So,
we can consolidate them and save one of the IPI message slots, (which are
precious on powerpc, since only 4 of those slots are available).

So, implement the functionality of PPC_MSG_CALL_FUNC_SINGLE using
PPC_MSG_CALL_FUNC itself and release its IPI message slot, so that it can be
used for something else in the future, if desired.

Signed-off-by: Srivatsa S. Bhat 
Signed-off-by: Preeti U. Murthy 
Acked-by: Geoff Levand  [For the PS3 part]
---

 arch/powerpc/include/asm/smp.h  |2 +-
 arch/powerpc/kernel/smp.c   |   12 +---
 arch/powerpc/platforms/cell/interrupt.c |2 +-
 arch/powerpc/platforms/ps3/smp.c|2 +-
 4 files changed, 8 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
index 084e080..9f7356b 100644
--- a/arch/powerpc/include/asm/smp.h
+++ b/arch/powerpc/include/asm/smp.h
@@ -120,7 +120,7 @@ extern int cpu_to_core_id(int cpu);
  * in /proc/interrupts will be wrong!!! --Troy */
 #define PPC_MSG_CALL_FUNCTION   0
 #define PPC_MSG_RESCHEDULE  1
-#define PPC_MSG_CALL_FUNC_SINGLE   2
+#define PPC_MSG_UNUSED 2
 #define PPC_MSG_DEBUGGER_BREAK  3
 
 /* for irq controllers that have dedicated ipis per message (4) */
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index a3b64f3..c2bd8d6 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -145,9 +145,9 @@ static irqreturn_t reschedule_action(int irq, void *data)
return IRQ_HANDLED;
 }
 
-static irqreturn_t call_function_single_action(int irq, void *data)
+static irqreturn_t unused_action(int irq, void *data)
 {
-   generic_smp_call_function_single_interrupt();
+   /* This slot is unused and hence available for use, if needed */
return IRQ_HANDLED;
 }
 
@@ -168,14 +168,14 @@ static irqreturn_t debug_ipi_action(int irq, void *data)
 static irq_handler_t smp_ipi_action[] = {
[PPC_MSG_CALL_FUNCTION] =  call_function_action,
[PPC_MSG_RESCHEDULE] = reschedule_action,
-   [PPC_MSG_CALL_FUNC_SINGLE] = call_function_single_action,
+   [PPC_MSG_UNUSED] = unused_action,
[PPC_MSG_DEBUGGER_BREAK] = debug_ipi_action,
 };
 
 const char *smp_ipi_name[] = {
[PPC_MSG_CALL_FUNCTION] =  "ipi call function",
[PPC_MSG_RESCHEDULE] = "ipi reschedule",
-   [PPC_MSG_CALL_FUNC_SINGLE] = "ipi call function single",
+   [PPC_MSG_UNUSED] = "ipi unused",
[PPC_MSG_DEBUGGER_BREAK] = "ipi debugger",
 };
 
@@ -251,8 +251,6 @@ irqreturn_t smp_ipi_demux(void)
generic_smp_call_function_interrupt();
if (all & IPI_MESSAGE(PPC_MSG_RESCHEDULE))
scheduler_ipi();
-   if (all & IPI_MESSAGE(PPC_MSG_CALL_FUNC_SINGLE))
-   generic_smp_call_function_single_interrupt();
if (all & IPI_MESSAGE(PPC_MSG_DEBUGGER_BREAK))
debug_ipi_action(0, NULL);
} while (info->messages);
@@ -280,7 +278,7 @@ EXPORT_SYMBOL_GPL(smp_send_reschedule);
 
 void arch_send_call_function_single_ipi(int cpu)
 {
-   do_message_pass(cpu, PPC_MSG_CALL_FUNC_SINGLE);
+   do_message_pass(cpu, PPC_MSG_CALL_FUNCTION);
 }
 
 void arch_send_call_function_ipi_mask(const struct cpumask *mask)
diff --git a/arch/powerpc/platforms/cell/interrupt.c 
b/arch/powerpc/platforms/cell/interrupt.c
index 2d42f3b..adf3726 100644
--- a/arch/powerpc/platforms/cell/interrupt.c
+++ b/arch/powerpc/platforms/cell/interrupt.c
@@ -215,7 +215,7 @@ void iic_request_IPIs(void)
 {
iic_request_ipi(PPC_MSG_CALL_FUNCTION);
iic_request_ipi(PPC_MSG_RESCHEDULE);
-   iic_request_ipi(PPC_MSG_CALL_FUNC_SINGLE);
+   iic_request_ipi(PPC_MSG_UNUSED);
iic_request_ipi(PPC_MSG_DEBUGGER_BREAK);
 }
 
diff --git a/arch/powerpc/platforms/ps3/smp.c b/arch/powerpc/platforms/ps3/smp.c
index 4b35166..00d1a7c 100644
--- a/arch/powerpc/platforms/ps3/smp.c
+++ b/arch/powerpc/platforms/ps3/smp.c
@@ -76,7 +76,7 @@ static int __init ps3_smp_probe(void)
 
BUILD_BUG_ON(PPC_MSG_CALL_FUNCTION!= 0);
BUILD_BUG_ON(PPC_MSG_RESCHEDULE   != 1);
-   BUILD_BUG_ON(PPC_MSG_CALL_FUNC_SINGLE != 2);
+   BUILD_BUG_ON(PPC_MSG_UNUSED   != 2);
BUILD_BUG_ON(PPC_MSG_DEBUGGER_BREAK   != 3);
 
for (i = 0; i < MSG_COUNT; i++) {

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH V5 2/8] powerpc: Implement tick broadcast IPI as a fixed IPI message

2014-01-15 Thread Preeti U Murthy

From: Srivatsa S. Bhat 

For scalability and performance reasons, we want the tick broadcast IPIs
to be handled as efficiently as possible. Fixed IPI messages
are one of the most efficient mechanisms available - they are faster than
the smp_call_function mechanism because the IPI handlers are fixed and hence
they don't involve costly operations such as adding IPI handlers to the target
CPU's function queue, acquiring locks for synchronization etc.

Luckily we have an unused IPI message slot, so use that to implement
tick broadcast IPIs efficiently.

Signed-off-by: Srivatsa S. Bhat 
[Functions renamed to tick_broadcast* and Changelog modified by
 Preeti U. Murthy]
Signed-off-by: Preeti U. Murthy 
Acked-by: Geoff Levand  [For the PS3 part]
---

 arch/powerpc/include/asm/smp.h  |2 +-
 arch/powerpc/include/asm/time.h |1 +
 arch/powerpc/kernel/smp.c   |   19 +++
 arch/powerpc/kernel/time.c  |5 +
 arch/powerpc/platforms/cell/interrupt.c |2 +-
 arch/powerpc/platforms/ps3/smp.c|2 +-
 6 files changed, 24 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
index 9f7356b..ff51046 100644
--- a/arch/powerpc/include/asm/smp.h
+++ b/arch/powerpc/include/asm/smp.h
@@ -120,7 +120,7 @@ extern int cpu_to_core_id(int cpu);
  * in /proc/interrupts will be wrong!!! --Troy */
 #define PPC_MSG_CALL_FUNCTION   0
 #define PPC_MSG_RESCHEDULE  1
-#define PPC_MSG_UNUSED 2
+#define PPC_MSG_TICK_BROADCAST 2
 #define PPC_MSG_DEBUGGER_BREAK  3
 
 /* for irq controllers that have dedicated ipis per message (4) */
diff --git a/arch/powerpc/include/asm/time.h b/arch/powerpc/include/asm/time.h
index c1f2676..1d428e6 100644
--- a/arch/powerpc/include/asm/time.h
+++ b/arch/powerpc/include/asm/time.h
@@ -28,6 +28,7 @@ extern struct clock_event_device decrementer_clockevent;
 struct rtc_time;
 extern void to_tm(int tim, struct rtc_time * tm);
 extern void GregorianDay(struct rtc_time *tm);
+extern void tick_broadcast_ipi_handler(void);
 
 extern void generic_calibrate_decr(void);
 
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index c2bd8d6..c77c6d7 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -35,6 +35,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -145,9 +146,9 @@ static irqreturn_t reschedule_action(int irq, void *data)
return IRQ_HANDLED;
 }
 
-static irqreturn_t unused_action(int irq, void *data)
+static irqreturn_t tick_broadcast_ipi_action(int irq, void *data)
 {
-   /* This slot is unused and hence available for use, if needed */
+   tick_broadcast_ipi_handler();
return IRQ_HANDLED;
 }
 
@@ -168,14 +169,14 @@ static irqreturn_t debug_ipi_action(int irq, void *data)
 static irq_handler_t smp_ipi_action[] = {
[PPC_MSG_CALL_FUNCTION] =  call_function_action,
[PPC_MSG_RESCHEDULE] = reschedule_action,
-   [PPC_MSG_UNUSED] = unused_action,
+   [PPC_MSG_TICK_BROADCAST] = tick_broadcast_ipi_action,
[PPC_MSG_DEBUGGER_BREAK] = debug_ipi_action,
 };
 
 const char *smp_ipi_name[] = {
[PPC_MSG_CALL_FUNCTION] =  "ipi call function",
[PPC_MSG_RESCHEDULE] = "ipi reschedule",
-   [PPC_MSG_UNUSED] = "ipi unused",
+   [PPC_MSG_TICK_BROADCAST] = "ipi tick-broadcast",
[PPC_MSG_DEBUGGER_BREAK] = "ipi debugger",
 };
 
@@ -251,6 +252,8 @@ irqreturn_t smp_ipi_demux(void)
generic_smp_call_function_interrupt();
if (all & IPI_MESSAGE(PPC_MSG_RESCHEDULE))
scheduler_ipi();
+   if (all & IPI_MESSAGE(PPC_MSG_TICK_BROADCAST))
+   tick_broadcast_ipi_handler();
if (all & IPI_MESSAGE(PPC_MSG_DEBUGGER_BREAK))
debug_ipi_action(0, NULL);
} while (info->messages);
@@ -289,6 +292,14 @@ void arch_send_call_function_ipi_mask(const struct cpumask 
*mask)
do_message_pass(cpu, PPC_MSG_CALL_FUNCTION);
 }
 
+void tick_broadcast(const struct cpumask *mask)
+{
+   unsigned int cpu;
+
+   for_each_cpu(cpu, mask)
+   do_message_pass(cpu, PPC_MSG_TICK_BROADCAST);
+}
+
 #if defined(CONFIG_DEBUGGER) || defined(CONFIG_KEXEC)
 void smp_send_debugger_break(void)
 {
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index b3b1441..42269c7 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -813,6 +813,11 @@ static void decrementer_set_mode(enum clock_event_mode 
mode,
decrementer_set_next_event(DECREMENTER_MAX, dev);
 }
 
+/* Interrupt handler for the timer broadcast IPI */
+void tick_broadcast_ipi_handler(void)
+{
+}
+
 static void register_decrementer_clockevent(int cpu)
 {
struct clock_event_device *dec = _cpu(decrementers, cpu);
diff --git a/arch/powerpc/plat

[PATCH V5 3/8] cpuidle/ppc: Split timer_interrupt() into timer handling and interrupt handling routines

2014-01-15 Thread Preeti U Murthy

Split timer_interrupt(), which is the local timer interrupt handler on ppc
into routines called during regular interrupt handling and __timer_interrupt(),
which takes care of running local timers and collecting time related stats.

This will enable callers interested only in running expired local timers to
directly call into __timer_interupt(). One of the use cases of this is the
tick broadcast IPI handling in which the sleeping CPUs need to handle the local
timers that have expired.

Signed-off-by: Preeti U Murthy 
---

 arch/powerpc/kernel/time.c |   73 +---
 1 file changed, 41 insertions(+), 32 deletions(-)

diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 42269c7..42cb603 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -478,6 +478,42 @@ void arch_irq_work_raise(void)
 
 #endif /* CONFIG_IRQ_WORK */
 
+static void __timer_interrupt(void)
+{
+   struct pt_regs *regs = get_irq_regs();
+   u64 *next_tb = &__get_cpu_var(decrementers_next_tb);
+   struct clock_event_device *evt = &__get_cpu_var(decrementers);
+   u64 now;
+
+   __get_cpu_var(irq_stat).timer_irqs++;
+   trace_timer_interrupt_entry(regs);
+
+   if (test_irq_work_pending()) {
+   clear_irq_work_pending();
+   irq_work_run();
+   }
+
+   now = get_tb_or_rtc();
+   if (now >= *next_tb) {
+   *next_tb = ~(u64)0;
+   if (evt->event_handler)
+   evt->event_handler(evt);
+   } else {
+   now = *next_tb - now;
+   if (now <= DECREMENTER_MAX)
+   set_dec((int)now);
+   }
+
+#ifdef CONFIG_PPC64
+   /* collect purr register values often, for accurate calculations */
+   if (firmware_has_feature(FW_FEATURE_SPLPAR)) {
+   struct cpu_usage *cu = &__get_cpu_var(cpu_usage_array);
+   cu->current_tb = mfspr(SPRN_PURR);
+   }
+#endif
+   trace_timer_interrupt_exit(regs);
+}
+
 /*
  * timer_interrupt - gets called when the decrementer overflows,
  * with interrupts disabled.
@@ -486,8 +522,6 @@ void timer_interrupt(struct pt_regs * regs)
 {
struct pt_regs *old_regs;
u64 *next_tb = &__get_cpu_var(decrementers_next_tb);
-   struct clock_event_device *evt = &__get_cpu_var(decrementers);
-   u64 now;
 
/* Ensure a positive value is written to the decrementer, or else
 * some CPUs will continue to take decrementer exceptions.
@@ -510,8 +544,6 @@ void timer_interrupt(struct pt_regs * regs)
 */
may_hard_irq_enable();
 
-   __get_cpu_var(irq_stat).timer_irqs++;
-
 #if defined(CONFIG_PPC32) && defined(CONFIG_PMAC)
if (atomic_read(_n_lost_interrupts) != 0)
do_IRQ(regs);
@@ -520,34 +552,7 @@ void timer_interrupt(struct pt_regs * regs)
old_regs = set_irq_regs(regs);
irq_enter();
 
-   trace_timer_interrupt_entry(regs);
-
-   if (test_irq_work_pending()) {
-   clear_irq_work_pending();
-   irq_work_run();
-   }
-
-   now = get_tb_or_rtc();
-   if (now >= *next_tb) {
-   *next_tb = ~(u64)0;
-   if (evt->event_handler)
-   evt->event_handler(evt);
-   } else {
-   now = *next_tb - now;
-   if (now <= DECREMENTER_MAX)
-   set_dec((int)now);
-   }
-
-#ifdef CONFIG_PPC64
-   /* collect purr register values often, for accurate calculations */
-   if (firmware_has_feature(FW_FEATURE_SPLPAR)) {
-   struct cpu_usage *cu = &__get_cpu_var(cpu_usage_array);
-   cu->current_tb = mfspr(SPRN_PURR);
-   }
-#endif
-
-   trace_timer_interrupt_exit(regs);
-
+   __timer_interrupt();
irq_exit();
set_irq_regs(old_regs);
 }
@@ -816,6 +821,10 @@ static void decrementer_set_mode(enum clock_event_mode 
mode,
 /* Interrupt handler for the timer broadcast IPI */
 void tick_broadcast_ipi_handler(void)
 {
+   u64 *next_tb = &__get_cpu_var(decrementers_next_tb);
+
+   *next_tb = get_tb_or_rtc();
+   __timer_interrupt();
 }
 
 static void register_decrementer_clockevent(int cpu)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH V5 4/8] powernv/cpuidle: Add context management for Fast Sleep

2014-01-15 Thread Preeti U Murthy

From: Vaidyanathan Srinivasan 

Before adding Fast-Sleep into the cpuidle framework, some low level
support needs to be added to enable it. This includes saving and
restoring of certain registers at entry and exit time of this state
respectively just like we do in the NAP idle state.

Signed-off-by: Vaidyanathan Srinivasan 
[Changelog modified by Preeti U. Murthy ]
Signed-off-by: Preeti U. Murthy 
---

 arch/powerpc/include/asm/processor.h |1 +
 arch/powerpc/kernel/exceptions-64s.S |   10 -
 arch/powerpc/kernel/idle_power7.S|   63 --
 3 files changed, 53 insertions(+), 21 deletions(-)

diff --git a/arch/powerpc/include/asm/processor.h 
b/arch/powerpc/include/asm/processor.h
index 027fefd..22e547a 100644
--- a/arch/powerpc/include/asm/processor.h
+++ b/arch/powerpc/include/asm/processor.h
@@ -444,6 +444,7 @@ enum idle_boot_override {IDLE_NO_OVERRIDE = 0, 
IDLE_POWERSAVE_OFF};
 
 extern int powersave_nap;  /* set if nap mode can be used in idle loop */
 extern void power7_nap(void);
+extern void power7_sleep(void);
 extern void flush_instruction_cache(void);
 extern void hard_reset_now(void);
 extern void poweroff_now(void);
diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index 9f905e4..b8139fb 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -121,9 +121,10 @@ BEGIN_FTR_SECTION
cmpwi   cr1,r13,2
/* Total loss of HV state is fatal, we could try to use the
 * PIR to locate a PACA, then use an emergency stack etc...
-* but for now, let's just stay stuck here
+* OPAL v3 based powernv platforms have new idle states
+* which fall in this catagory.
 */
-   bgt cr1,.
+   bgt cr1,8f
GET_PACA(r13)
 
 #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
@@ -141,6 +142,11 @@ BEGIN_FTR_SECTION
beq cr1,2f
b   .power7_wakeup_noloss
 2: b   .power7_wakeup_loss
+
+   /* Fast Sleep wakeup on PowerNV */
+8: GET_PACA(r13)
+   b   .power7_wakeup_loss
+
 9:
 END_FTR_SECTION_IFSET(CPU_FTR_HVMODE | CPU_FTR_ARCH_206)
 #endif /* CONFIG_PPC_P7_NAP */
diff --git a/arch/powerpc/kernel/idle_power7.S 
b/arch/powerpc/kernel/idle_power7.S
index 847e40e..e4bbca2 100644
--- a/arch/powerpc/kernel/idle_power7.S
+++ b/arch/powerpc/kernel/idle_power7.S
@@ -20,17 +20,27 @@
 
 #undef DEBUG
 
-   .text
+/* Idle state entry routines */
 
-_GLOBAL(power7_idle)
-   /* Now check if user or arch enabled NAP mode */
-   LOAD_REG_ADDRBASE(r3,powersave_nap)
-   lwz r4,ADDROFF(powersave_nap)(r3)
-   cmpwi   0,r4,0
-   beqlr
-   /* fall through */
+#defineIDLE_STATE_ENTER_SEQ(IDLE_INST) \
+   /* Magic NAP/SLEEP/WINKLE mode enter sequence */\
+   std r0,0(r1);   \
+   ptesync;\
+   ld  r0,0(r1);   \
+1: cmp cr0,r0,r0;  \
+   bne 1b; \
+   IDLE_INST;  \
+   b   .
 
-_GLOBAL(power7_nap)
+   .text
+
+/*
+ * Pass requested state in r3:
+ * 0 - nap
+ * 1 - sleep
+ */
+_GLOBAL(power7_powersave_common)
+   /* Use r3 to pass state nap/sleep/winkle */
/* NAP is a state loss, we create a regs frame on the
 * stack, fill it up with the state we care about and
 * stick a pointer to it in PACAR1. We really only
@@ -79,8 +89,8 @@ _GLOBAL(power7_nap)
/* Continue saving state */
SAVE_GPR(2, r1)
SAVE_NVGPRS(r1)
-   mfcrr3
-   std r3,_CCR(r1)
+   mfcrr4
+   std r4,_CCR(r1)
std r9,_MSR(r1)
std r1,PACAR1(r13)
 
@@ -89,15 +99,30 @@ _GLOBAL(power7_nap)
li  r4,KVM_HWTHREAD_IN_NAP
stb r4,HSTATE_HWTHREAD_STATE(r13)
 #endif
+   cmpwi   cr0,r3,1
+   beq 2f
+   IDLE_STATE_ENTER_SEQ(PPC_NAP)
+   /* No return */
+2: IDLE_STATE_ENTER_SEQ(PPC_SLEEP)
+   /* No return */
 
-   /* Magic NAP mode enter sequence */
-   std r0,0(r1)
-   ptesync
-   ld  r0,0(r1)
-1: cmp cr0,r0,r0
-   bne 1b
-   PPC_NAP
-   b   .
+_GLOBAL(power7_idle)
+   /* Now check if user or arch enabled NAP mode */
+   LOAD_REG_ADDRBASE(r3,powersave_nap)
+   lwz r4,ADDROFF(powersave_nap)(r3)
+   cmpwi   0,r4,0
+   beqlr
+   /* fall through */
+
+_GLOBAL(power7_nap)
+   li  r3,0
+   b   power7_powersave_common
+   /* No return */
+
+_GLOBAL(power7_sleep)
+   li  r3,1
+   b   power7_powersave_common
+   /* No return */
 
 _GLOBAL(power7_wakeup_loss)
ld  r1,PACAR1(r13)

--
To unsubscribe from this list: send the line

[PATCH V5 6/8] time/cpuidle: Support in tick broadcast framework in the absence of external clock device

2014-01-15 Thread Preeti U Murthy

On some architectures, in certain CPU deep idle states the local timers stop.
An external clock device is used to wakeup these CPUs. The kernel support for 
the
wakeup of these CPUs is provided by the tick broadcast framework by using the
external clock device as the wakeup source.

However not all implementations of architectures provide such an external
clock device such as some PowerPC ones. This patch includes support in the
broadcast framework to handle the wakeup of the CPUs in deep idle states on such
systems by queuing a hrtimer on one of the CPUs, meant to handle the wakeup of
CPUs in deep idle states. This CPU is identified as the bc_cpu.

Each time the hrtimer expires, it is reprogrammed for the next wakeup of the
CPUs in deep idle state after handling broadcast. However when a CPU is about
to enter  deep idle state with its wakeup time earlier than the time at which
the hrtimer is currently programmed, it *becomes the new bc_cpu* and restarts
the hrtimer on itself. This way the job of doing broadcast is handed around to
the CPUs that ask for the earliest wakeup just before entering deep idle
state. This is consistent with what happens in cases where an external clock
device is present. The smp affinity of this clock device is set to the CPU
with the earliest wakeup.

The important point here is that the bc_cpu cannot enter deep idle state
since it has a hrtimer queued to wakeup the other CPUs in deep idle. Hence it
cannot have its local timer stopped. Therefore for such a CPU, the
BROADCAST_ENTER notification has to fail implying that it cannot enter deep
idle state. On architectures where an external clock device is present, all
CPUs can enter deep idle.

During hotplug of the bc_cpu, the job of doing a broadcast is assigned to the
first cpu in the broadcast mask. This newly nominated bc_cpu is woken up by
an IPI so as to queue the above mentioned hrtimer on it.

Signed-off-by: Preeti U Murthy 
---

 include/linux/clockchips.h   |4 -
 kernel/time/clockevents.c|9 +-
 kernel/time/tick-broadcast.c |  192 ++
 kernel/time/tick-internal.h  |8 +-
 4 files changed, 186 insertions(+), 27 deletions(-)

diff --git a/include/linux/clockchips.h b/include/linux/clockchips.h
index 493aa02..bbda37b 100644
--- a/include/linux/clockchips.h
+++ b/include/linux/clockchips.h
@@ -186,9 +186,9 @@ static inline int tick_check_broadcast_expired(void) { 
return 0; }
 #endif
 
 #ifdef CONFIG_GENERIC_CLOCKEVENTS
-extern void clockevents_notify(unsigned long reason, void *arg);
+extern int clockevents_notify(unsigned long reason, void *arg);
 #else
-static inline void clockevents_notify(unsigned long reason, void *arg) {}
+static inline int clockevents_notify(unsigned long reason, void *arg) {}
 #endif
 
 #else /* CONFIG_GENERIC_CLOCKEVENTS_BUILD */
diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
index 086ad60..d61404e 100644
--- a/kernel/time/clockevents.c
+++ b/kernel/time/clockevents.c
@@ -524,12 +524,13 @@ void clockevents_resume(void)
 #ifdef CONFIG_GENERIC_CLOCKEVENTS
 /**
  * clockevents_notify - notification about relevant events
+ * Returns non zero on error.
  */
-void clockevents_notify(unsigned long reason, void *arg)
+int clockevents_notify(unsigned long reason, void *arg)
 {
struct clock_event_device *dev, *tmp;
unsigned long flags;
-   int cpu;
+   int cpu, ret = 0;
 
raw_spin_lock_irqsave(_lock, flags);
 
@@ -542,11 +543,12 @@ void clockevents_notify(unsigned long reason, void *arg)
 
case CLOCK_EVT_NOTIFY_BROADCAST_ENTER:
case CLOCK_EVT_NOTIFY_BROADCAST_EXIT:
-   tick_broadcast_oneshot_control(reason);
+   ret = tick_broadcast_oneshot_control(reason);
break;
 
case CLOCK_EVT_NOTIFY_CPU_DYING:
tick_handover_do_timer(arg);
+   tick_handover_broadcast_cpu(arg);
break;
 
case CLOCK_EVT_NOTIFY_SUSPEND:
@@ -585,6 +587,7 @@ void clockevents_notify(unsigned long reason, void *arg)
break;
}
raw_spin_unlock_irqrestore(_lock, flags);
+   return ret;
 }
 EXPORT_SYMBOL_GPL(clockevents_notify);
 
diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c
index 9532690..1c23912 100644
--- a/kernel/time/tick-broadcast.c
+++ b/kernel/time/tick-broadcast.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "tick-internal.h"
 
@@ -35,6 +36,15 @@ static cpumask_var_t tmpmask;
 static DEFINE_RAW_SPINLOCK(tick_broadcast_lock);
 static int tick_broadcast_force;
 
+/*
+ * Helper variables for handling broadcast in the absence of a
+ * tick_broadcast_device.
+ * */
+static struct hrtimer *bc_hrtimer;
+static int bc_cpu = -1;
+static ktime_t bc_next_wakeup;
+static int hrtimer_initialized = 0;
+
 #ifdef CONFIG_TICK_ONESHOT
 static void tick_broadcast_clear_oneshot(int cpu);
 #else
@@ -528,6 +538,20 @@ static int tick_broadcast

[PATCH V5 5/8] powermgt: Add OPAL call to resync timebase on wakeup

2014-01-15 Thread Preeti U Murthy

From: Vaidyanathan Srinivasan 

During "Fast-sleep" and deeper power savings state, decrementer and
timebase could be stopped making it out of sync with rest
of the cores in the system.

Add a firmware call to request platform to resync timebase
using low level platform methods.

Signed-off-by: Vaidyanathan Srinivasan 
Signed-off-by: Preeti U. Murthy 
---

 arch/powerpc/include/asm/opal.h|2 ++
 arch/powerpc/kernel/exceptions-64s.S   |2 +-
 arch/powerpc/kernel/idle_power7.S  |   27 
 arch/powerpc/platforms/powernv/opal-wrappers.S |1 +
 4 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index 033c06b..a662d06 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -132,6 +132,7 @@ extern int opal_enter_rtas(struct rtas_args *args,
 #define OPAL_FLASH_VALIDATE76
 #define OPAL_FLASH_MANAGE  77
 #define OPAL_FLASH_UPDATE  78
+#define OPAL_RESYNC_TIMEBASE   79
 
 #ifndef __ASSEMBLY__
 
@@ -763,6 +764,7 @@ extern void opal_flash_init(void);
 extern int opal_machine_check(struct pt_regs *regs);
 
 extern void opal_shutdown(void);
+extern int opal_resync_timebase(void);
 
 extern void opal_lpc_init(void);
 
diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index b8139fb..91e6417 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -145,7 +145,7 @@ BEGIN_FTR_SECTION
 
/* Fast Sleep wakeup on PowerNV */
 8: GET_PACA(r13)
-   b   .power7_wakeup_loss
+   b   .power7_wakeup_tb_loss
 
 9:
 END_FTR_SECTION_IFSET(CPU_FTR_HVMODE | CPU_FTR_ARCH_206)
diff --git a/arch/powerpc/kernel/idle_power7.S 
b/arch/powerpc/kernel/idle_power7.S
index e4bbca2..34c71e8 100644
--- a/arch/powerpc/kernel/idle_power7.S
+++ b/arch/powerpc/kernel/idle_power7.S
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #undef DEBUG
 
@@ -124,6 +125,32 @@ _GLOBAL(power7_sleep)
b   power7_powersave_common
/* No return */
 
+_GLOBAL(power7_wakeup_tb_loss)
+   ld  r2,PACATOC(r13);
+   ld  r1,PACAR1(r13)
+
+   /* Time base re-sync */
+   li  r0,OPAL_RESYNC_TIMEBASE
+   LOAD_REG_ADDR(r11,opal);
+   ld  r12,8(r11);
+   ld  r2,0(r11);
+   mtctr   r12
+   bctrl
+
+   /* TODO: Check r3 for failure */
+
+   REST_NVGPRS(r1)
+   REST_GPR(2, r1)
+   ld  r3,_CCR(r1)
+   ld  r4,_MSR(r1)
+   ld  r5,_NIP(r1)
+   addir1,r1,INT_FRAME_SIZE
+   mtcrr3
+   mfspr   r3,SPRN_SRR1/* Return SRR1 */
+   mtspr   SPRN_SRR1,r4
+   mtspr   SPRN_SRR0,r5
+   rfid
+
 _GLOBAL(power7_wakeup_loss)
ld  r1,PACAR1(r13)
REST_NVGPRS(r1)
diff --git a/arch/powerpc/platforms/powernv/opal-wrappers.S 
b/arch/powerpc/platforms/powernv/opal-wrappers.S
index e780650..ddfe95a 100644
--- a/arch/powerpc/platforms/powernv/opal-wrappers.S
+++ b/arch/powerpc/platforms/powernv/opal-wrappers.S
@@ -126,3 +126,4 @@ OPAL_CALL(opal_return_cpu,  
OPAL_RETURN_CPU);
 OPAL_CALL(opal_validate_flash, OPAL_FLASH_VALIDATE);
 OPAL_CALL(opal_manage_flash,   OPAL_FLASH_MANAGE);
 OPAL_CALL(opal_update_flash,   OPAL_FLASH_UPDATE);
+OPAL_CALL(opal_resync_timebase,OPAL_RESYNC_TIMEBASE);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH V5 7/8] cpuidle/powernv: Add "Fast-Sleep" CPU idle state

2014-01-15 Thread Preeti U Murthy

Fast sleep is one of the deep idle states on Power8 in which local timers of
CPUs stop. On PowerPC we do not have an external clock device which can
handle wakeup of such CPUs. Now that we have the support in the tick broadcast
framework for archs that do not sport such a device and the low level support
for fast sleep, enable it in the cpuidle framework on PowerNV.

Signed-off-by: Preeti U Murthy 
---

 arch/powerpc/Kconfig  |2 ++
 arch/powerpc/kernel/time.c|2 +-
 drivers/cpuidle/cpuidle-powernv.c |   39 +
 3 files changed, 42 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index b44b52c..cafa788 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -129,6 +129,8 @@ config PPC
select GENERIC_CMOS_UPDATE
select GENERIC_TIME_VSYSCALL_OLD
select GENERIC_CLOCKEVENTS
+   select GENERIC_CLOCKEVENTS_BROADCAST
+   select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST
select GENERIC_STRNCPY_FROM_USER
select GENERIC_STRNLEN_USER
select HAVE_MOD_ARCH_SPECIFIC
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 42cb603..d9efd93 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -106,7 +106,7 @@ struct clock_event_device decrementer_clockevent = {
.irq= 0,
.set_next_event = decrementer_set_next_event,
.set_mode   = decrementer_set_mode,
-   .features   = CLOCK_EVT_FEAT_ONESHOT,
+   .features   = CLOCK_EVT_FEAT_ONESHOT | CLOCK_EVT_FEAT_C3STOP,
 };
 EXPORT_SYMBOL(decrementer_clockevent);
 
diff --git a/drivers/cpuidle/cpuidle-powernv.c 
b/drivers/cpuidle/cpuidle-powernv.c
index 78fd174..e3aa62f 100644
--- a/drivers/cpuidle/cpuidle-powernv.c
+++ b/drivers/cpuidle/cpuidle-powernv.c
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -49,6 +50,37 @@ static int nap_loop(struct cpuidle_device *dev,
return index;
 }
 
+static int fastsleep_loop(struct cpuidle_device *dev,
+   struct cpuidle_driver *drv,
+   int index)
+{
+   int cpu = dev->cpu;
+   unsigned long old_lpcr = mfspr(SPRN_LPCR);
+   unsigned long new_lpcr;
+
+   new_lpcr = old_lpcr;
+   new_lpcr &= ~(LPCR_MER | LPCR_PECE); /* lpcr[mer] must be 0 */
+
+   /* exit powersave upon external interrupt, but not decrementer
+* interrupt, Emulate sleep.
+*/
+   new_lpcr |= LPCR_PECE0;
+
+   if (clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, )) {
+   new_lpcr |= LPCR_PECE1;
+   mtspr(SPRN_LPCR, new_lpcr);
+   power7_nap();
+   } else {
+   mtspr(SPRN_LPCR, new_lpcr);
+   power7_sleep();
+   clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, );
+   }
+
+   mtspr(SPRN_LPCR, old_lpcr);
+
+   return index;
+}
+
 /*
  * States for dedicated partition case.
  */
@@ -67,6 +99,13 @@ static struct cpuidle_state powernv_states[] = {
.exit_latency = 10,
.target_residency = 100,
.enter = _loop },
+{ /* Fastsleep */
+   .name = "fastsleep",
+   .desc = "fastsleep",
+   .flags = CPUIDLE_FLAG_TIME_VALID,
+   .exit_latency = 10,
+   .target_residency = 100,
+   .enter = _loop },
 };
 
 static int powernv_cpuidle_add_cpu_notifier(struct notifier_block *n,

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH V5 8/8] cpuidle/powernv: Parse device tree to setup idle states

2014-01-15 Thread Preeti U Murthy

Add deep idle states such as nap and fast sleep to the cpuidle state table
only if they are discovered from the device tree during cpuidle initialization.

Signed-off-by: Preeti U Murthy 
---

 drivers/cpuidle/cpuidle-powernv.c |   81 +
 1 file changed, 64 insertions(+), 17 deletions(-)

diff --git a/drivers/cpuidle/cpuidle-powernv.c 
b/drivers/cpuidle/cpuidle-powernv.c
index e3aa62f..b01987d 100644
--- a/drivers/cpuidle/cpuidle-powernv.c
+++ b/drivers/cpuidle/cpuidle-powernv.c
@@ -12,10 +12,17 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
 
+/* Flags and constants used in PowerNV platform */
+
+#define MAX_POWERNV_IDLE_STATES8
+#define IDLE_USE_INST_NAP  0x0001 /* Use nap instruction */
+#define IDLE_USE_INST_SLEEP0x0002 /* Use sleep instruction */
+
 struct cpuidle_driver powernv_idle_driver = {
.name = "powernv_idle",
.owner= THIS_MODULE,
@@ -84,7 +91,7 @@ static int fastsleep_loop(struct cpuidle_device *dev,
 /*
  * States for dedicated partition case.
  */
-static struct cpuidle_state powernv_states[] = {
+static struct cpuidle_state powernv_states[MAX_POWERNV_IDLE_STATES] = {
{ /* Snooze */
.name = "snooze",
.desc = "snooze",
@@ -92,20 +99,6 @@ static struct cpuidle_state powernv_states[] = {
.exit_latency = 0,
.target_residency = 0,
.enter = _loop },
-   { /* NAP */
-   .name = "NAP",
-   .desc = "NAP",
-   .flags = CPUIDLE_FLAG_TIME_VALID,
-   .exit_latency = 10,
-   .target_residency = 100,
-   .enter = _loop },
-{ /* Fastsleep */
-   .name = "fastsleep",
-   .desc = "fastsleep",
-   .flags = CPUIDLE_FLAG_TIME_VALID,
-   .exit_latency = 10,
-   .target_residency = 100,
-   .enter = _loop },
 };
 
 static int powernv_cpuidle_add_cpu_notifier(struct notifier_block *n,
@@ -166,19 +159,73 @@ static int powernv_cpuidle_driver_init(void)
return 0;
 }
 
+static int powernv_add_idle_states(void)
+{
+   struct device_node *power_mgt;
+   struct property *prop;
+   int nr_idle_states = 1; /* Snooze */
+   int dt_idle_states;
+   u32 *flags;
+   int i;
+
+   /* Currently we have snooze statically defined */
+
+   power_mgt = of_find_node_by_path("/ibm,opal/power-mgt");
+   if (!power_mgt) {
+   pr_warn("opal: PowerMgmt Node not found\n");
+   return nr_idle_states;
+   }
+
+   prop = of_find_property(power_mgt, "ibm,cpu-idle-state-flags", NULL);
+   if (!prop) {
+   pr_warn("DT-PowerMgmt: missing ibm,cpu-idle-state-flags\n");
+   return nr_idle_states;
+   }
+
+   dt_idle_states = prop->length / sizeof(u32);
+   flags = (u32 *) prop->value;
+
+   for (i = 0; i < dt_idle_states; i++) {
+
+   if (flags[i] & IDLE_USE_INST_NAP) {
+   /* Add NAP state */
+   strcpy(powernv_states[nr_idle_states].name, "Nap");
+   strcpy(powernv_states[nr_idle_states].desc, "Nap");
+   powernv_states[nr_idle_states].flags = 
CPUIDLE_FLAG_TIME_VALID;
+   powernv_states[nr_idle_states].exit_latency = 10;
+   powernv_states[nr_idle_states].target_residency = 100;
+   powernv_states[nr_idle_states].enter = _loop;
+   nr_idle_states++;
+   }
+
+   if (flags[i] & IDLE_USE_INST_SLEEP) {
+   /* Add FASTSLEEP state */
+   strcpy(powernv_states[nr_idle_states].name, 
"FastSleep");
+   strcpy(powernv_states[nr_idle_states].desc, 
"FastSleep");
+   powernv_states[nr_idle_states].flags = 
CPUIDLE_FLAG_TIME_VALID;
+   powernv_states[nr_idle_states].exit_latency = 300;
+   powernv_states[nr_idle_states].target_residency = 
100;
+   powernv_states[nr_idle_states].enter = _loop;
+   nr_idle_states++;
+   }
+   }
+
+   return nr_idle_states;
+}
+
 /*
  * powernv_idle_probe()
  * Choose state table for shared versus dedicated partition
  */
 static int powernv_idle_probe(void)
 {
-
if (cpuidle_disable != IDLE_NO_OVERRIDE)
return -ENODEV;
 
if (firmware_has_feature(FW_FEATURE_OPALv3)) {
cpuidle_state_table = powernv_states;
-   max_idle_state = ARRAY_SIZE(powernv_states);
+   /* Device tree can indicate more idle states */
+   max_idle_stat

[RESEND PATCH V5 0/8] cpuidle/ppc: Enable deep idle states on PowerNV

2014-01-21 Thread Preeti U Murthy

uot;broadcast period", and the next wakeup
event. By introducing the "broadcast period" as the maximum period after
which the broadcast hrtimer can fire, we ensure that we do not miss
wakeups in corner cases.

3. On hotplug of a broadcast cpu, trigger the hrtimer meant to do broadcast
to fire immediately on the new broadcast cpu. This will ensure we do not miss
doing a broadcast pending in the nearest future.

4. Change the type of allocation from GFP_KERNEL to GFP_NOWAIT while
initializing bc_hrtimer since we are in an atomic context and cannot sleep.

5. Use the broadcast ipi to wakeup the newly nominated broadcast cpu on
hotplug of the old instead of smp_call_function_single(). This is because we
are interrupt disabled at this point and should not be using
smp_call_function_single or its children in this context to send an ipi.

6. Move GENERIC_CLOCKEVENTS_BROADCAST to arch/powerpc/Kconfig.

7. Fix coding style issues.


Changes in V2: https://lkml.org/lkml/2013/8/14/239

1. Dynamically pick a broadcast CPU, instead of having a dedicated one.
2. Remove the constraint of having to disable tickless idle on the broadcast
CPU by queueing a hrtimer dedicated to do broadcast.



V1 posting: https://lkml.org/lkml/2013/7/25/740.

1. Added the infrastructure to wakeup CPUs in deep idle states in which the
local timers stop.

---

Preeti U Murthy (5):
  cpuidle/ppc: Split timer_interrupt() into timer handling and interrupt 
handling routines
  powermgt: Add OPAL call to resync timebase on wakeup
  time/cpuidle: Support in tick broadcast framework in the absence of 
external clock device
  cpuidle/powernv: Add "Fast-Sleep" CPU idle state
  cpuidle/powernv: Parse device tree to setup idle states

Srivatsa S. Bhat (2):
  powerpc: Free up the slot of PPC_MSG_CALL_FUNC_SINGLE IPI message
  powerpc: Implement tick broadcast IPI as a fixed IPI message

Vaidyanathan Srinivasan (1):
  powernv/cpuidle: Add context management for Fast Sleep


 arch/powerpc/Kconfig   |2 
 arch/powerpc/include/asm/opal.h|2 
 arch/powerpc/include/asm/processor.h   |1 
 arch/powerpc/include/asm/smp.h |2 
 arch/powerpc/include/asm/time.h|1 
 arch/powerpc/kernel/exceptions-64s.S   |   10 +
 arch/powerpc/kernel/idle_power7.S  |   90 +--
 arch/powerpc/kernel/smp.c  |   23 ++-
 arch/powerpc/kernel/time.c |   88 +++
 arch/powerpc/platforms/cell/interrupt.c|2 
 arch/powerpc/platforms/powernv/opal-wrappers.S |1 
 arch/powerpc/platforms/ps3/smp.c   |2 
 drivers/cpuidle/cpuidle-powernv.c  |  109 --
 include/linux/clockchips.h |4 -
 kernel/time/clockevents.c  |9 +
 kernel/time/tick-broadcast.c   |  192 ++--
 kernel/time/tick-internal.h|8 +
 17 files changed, 442 insertions(+), 104 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RESEND PATCH V5 3/8] cpuidle/ppc: Split timer_interrupt() into timer handling and interrupt handling routines

2014-01-21 Thread Preeti U Murthy

Split timer_interrupt(), which is the local timer interrupt handler on ppc
into routines called during regular interrupt handling and __timer_interrupt(),
which takes care of running local timers and collecting time related stats.

This will enable callers interested only in running expired local timers to
directly call into __timer_interupt(). One of the use cases of this is the
tick broadcast IPI handling in which the sleeping CPUs need to handle the local
timers that have expired.

Signed-off-by: Preeti U Murthy 
---

 arch/powerpc/kernel/time.c |   81 +---
 1 file changed, 46 insertions(+), 35 deletions(-)

diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 3ff97db..df2989b 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -478,6 +478,47 @@ void arch_irq_work_raise(void)
 
 #endif /* CONFIG_IRQ_WORK */
 
+void __timer_interrupt(void)
+{
+   struct pt_regs *regs = get_irq_regs();
+   u64 *next_tb = &__get_cpu_var(decrementers_next_tb);
+   struct clock_event_device *evt = &__get_cpu_var(decrementers);
+   u64 now;
+
+   trace_timer_interrupt_entry(regs);
+
+   if (test_irq_work_pending()) {
+   clear_irq_work_pending();
+   irq_work_run();
+   }
+
+   now = get_tb_or_rtc();
+   if (now >= *next_tb) {
+   *next_tb = ~(u64)0;
+   if (evt->event_handler)
+   evt->event_handler(evt);
+   __get_cpu_var(irq_stat).timer_irqs_event++;
+   } else {
+   now = *next_tb - now;
+   if (now <= DECREMENTER_MAX)
+   set_dec((int)now);
+   /* We may have raced with new irq work */
+   if (test_irq_work_pending())
+   set_dec(1);
+   __get_cpu_var(irq_stat).timer_irqs_others++;
+   }
+
+#ifdef CONFIG_PPC64
+   /* collect purr register values often, for accurate calculations */
+   if (firmware_has_feature(FW_FEATURE_SPLPAR)) {
+   struct cpu_usage *cu = &__get_cpu_var(cpu_usage_array);
+   cu->current_tb = mfspr(SPRN_PURR);
+   }
+#endif
+
+   trace_timer_interrupt_exit(regs);
+}
+
 /*
  * timer_interrupt - gets called when the decrementer overflows,
  * with interrupts disabled.
@@ -486,8 +527,6 @@ void timer_interrupt(struct pt_regs * regs)
 {
struct pt_regs *old_regs;
u64 *next_tb = &__get_cpu_var(decrementers_next_tb);
-   struct clock_event_device *evt = &__get_cpu_var(decrementers);
-   u64 now;
 
/* Ensure a positive value is written to the decrementer, or else
 * some CPUs will continue to take decrementer exceptions.
@@ -519,39 +558,7 @@ void timer_interrupt(struct pt_regs * regs)
old_regs = set_irq_regs(regs);
irq_enter();
 
-   trace_timer_interrupt_entry(regs);
-
-   if (test_irq_work_pending()) {
-   clear_irq_work_pending();
-   irq_work_run();
-   }
-
-   now = get_tb_or_rtc();
-   if (now >= *next_tb) {
-   *next_tb = ~(u64)0;
-   if (evt->event_handler)
-   evt->event_handler(evt);
-   __get_cpu_var(irq_stat).timer_irqs_event++;
-   } else {
-   now = *next_tb - now;
-   if (now <= DECREMENTER_MAX)
-   set_dec((int)now);
-   /* We may have raced with new irq work */
-   if (test_irq_work_pending())
-   set_dec(1);
-   __get_cpu_var(irq_stat).timer_irqs_others++;
-   }
-
-#ifdef CONFIG_PPC64
-   /* collect purr register values often, for accurate calculations */
-   if (firmware_has_feature(FW_FEATURE_SPLPAR)) {
-   struct cpu_usage *cu = &__get_cpu_var(cpu_usage_array);
-   cu->current_tb = mfspr(SPRN_PURR);
-   }
-#endif
-
-   trace_timer_interrupt_exit(regs);
-
+   __timer_interrupt();
irq_exit();
set_irq_regs(old_regs);
 }
@@ -828,6 +835,10 @@ static void decrementer_set_mode(enum clock_event_mode 
mode,
 /* Interrupt handler for the timer broadcast IPI */
 void tick_broadcast_ipi_handler(void)
 {
+   u64 *next_tb = &__get_cpu_var(decrementers_next_tb);
+
+   *next_tb = get_tb_or_rtc();
+   __timer_interrupt();
 }
 
 static void register_decrementer_clockevent(int cpu)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RESEND PATCH V5 1/8] powerpc: Free up the slot of PPC_MSG_CALL_FUNC_SINGLE IPI message

2014-01-21 Thread Preeti U Murthy

From: Srivatsa S. Bhat 

The IPI handlers for both PPC_MSG_CALL_FUNC and PPC_MSG_CALL_FUNC_SINGLE map
to a common implementation - generic_smp_call_function_single_interrupt(). So,
we can consolidate them and save one of the IPI message slots, (which are
precious on powerpc, since only 4 of those slots are available).

So, implement the functionality of PPC_MSG_CALL_FUNC_SINGLE using
PPC_MSG_CALL_FUNC itself and release its IPI message slot, so that it can be
used for something else in the future, if desired.

Signed-off-by: Srivatsa S. Bhat 
Signed-off-by: Preeti U. Murthy 
Acked-by: Geoff Levand  [For the PS3 part]
---

 arch/powerpc/include/asm/smp.h  |2 +-
 arch/powerpc/kernel/smp.c   |   12 +---
 arch/powerpc/platforms/cell/interrupt.c |2 +-
 arch/powerpc/platforms/ps3/smp.c|2 +-
 4 files changed, 8 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
index 084e080..9f7356b 100644
--- a/arch/powerpc/include/asm/smp.h
+++ b/arch/powerpc/include/asm/smp.h
@@ -120,7 +120,7 @@ extern int cpu_to_core_id(int cpu);
  * in /proc/interrupts will be wrong!!! --Troy */
 #define PPC_MSG_CALL_FUNCTION   0
 #define PPC_MSG_RESCHEDULE  1
-#define PPC_MSG_CALL_FUNC_SINGLE   2
+#define PPC_MSG_UNUSED 2
 #define PPC_MSG_DEBUGGER_BREAK  3
 
 /* for irq controllers that have dedicated ipis per message (4) */
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index ac2621a..ee7d76b 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -145,9 +145,9 @@ static irqreturn_t reschedule_action(int irq, void *data)
return IRQ_HANDLED;
 }
 
-static irqreturn_t call_function_single_action(int irq, void *data)
+static irqreturn_t unused_action(int irq, void *data)
 {
-   generic_smp_call_function_single_interrupt();
+   /* This slot is unused and hence available for use, if needed */
return IRQ_HANDLED;
 }
 
@@ -168,14 +168,14 @@ static irqreturn_t debug_ipi_action(int irq, void *data)
 static irq_handler_t smp_ipi_action[] = {
[PPC_MSG_CALL_FUNCTION] =  call_function_action,
[PPC_MSG_RESCHEDULE] = reschedule_action,
-   [PPC_MSG_CALL_FUNC_SINGLE] = call_function_single_action,
+   [PPC_MSG_UNUSED] = unused_action,
[PPC_MSG_DEBUGGER_BREAK] = debug_ipi_action,
 };
 
 const char *smp_ipi_name[] = {
[PPC_MSG_CALL_FUNCTION] =  "ipi call function",
[PPC_MSG_RESCHEDULE] = "ipi reschedule",
-   [PPC_MSG_CALL_FUNC_SINGLE] = "ipi call function single",
+   [PPC_MSG_UNUSED] = "ipi unused",
[PPC_MSG_DEBUGGER_BREAK] = "ipi debugger",
 };
 
@@ -251,8 +251,6 @@ irqreturn_t smp_ipi_demux(void)
generic_smp_call_function_interrupt();
if (all & IPI_MESSAGE(PPC_MSG_RESCHEDULE))
scheduler_ipi();
-   if (all & IPI_MESSAGE(PPC_MSG_CALL_FUNC_SINGLE))
-   generic_smp_call_function_single_interrupt();
if (all & IPI_MESSAGE(PPC_MSG_DEBUGGER_BREAK))
debug_ipi_action(0, NULL);
} while (info->messages);
@@ -280,7 +278,7 @@ EXPORT_SYMBOL_GPL(smp_send_reschedule);
 
 void arch_send_call_function_single_ipi(int cpu)
 {
-   do_message_pass(cpu, PPC_MSG_CALL_FUNC_SINGLE);
+   do_message_pass(cpu, PPC_MSG_CALL_FUNCTION);
 }
 
 void arch_send_call_function_ipi_mask(const struct cpumask *mask)
diff --git a/arch/powerpc/platforms/cell/interrupt.c 
b/arch/powerpc/platforms/cell/interrupt.c
index 2d42f3b..adf3726 100644
--- a/arch/powerpc/platforms/cell/interrupt.c
+++ b/arch/powerpc/platforms/cell/interrupt.c
@@ -215,7 +215,7 @@ void iic_request_IPIs(void)
 {
iic_request_ipi(PPC_MSG_CALL_FUNCTION);
iic_request_ipi(PPC_MSG_RESCHEDULE);
-   iic_request_ipi(PPC_MSG_CALL_FUNC_SINGLE);
+   iic_request_ipi(PPC_MSG_UNUSED);
iic_request_ipi(PPC_MSG_DEBUGGER_BREAK);
 }
 
diff --git a/arch/powerpc/platforms/ps3/smp.c b/arch/powerpc/platforms/ps3/smp.c
index 4b35166..00d1a7c 100644
--- a/arch/powerpc/platforms/ps3/smp.c
+++ b/arch/powerpc/platforms/ps3/smp.c
@@ -76,7 +76,7 @@ static int __init ps3_smp_probe(void)
 
BUILD_BUG_ON(PPC_MSG_CALL_FUNCTION!= 0);
BUILD_BUG_ON(PPC_MSG_RESCHEDULE   != 1);
-   BUILD_BUG_ON(PPC_MSG_CALL_FUNC_SINGLE != 2);
+   BUILD_BUG_ON(PPC_MSG_UNUSED   != 2);
BUILD_BUG_ON(PPC_MSG_DEBUGGER_BREAK   != 3);
 
for (i = 0; i < MSG_COUNT; i++) {

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RESEND PATCH V5 2/8] powerpc: Implement tick broadcast IPI as a fixed IPI message

2014-01-21 Thread Preeti U Murthy

From: Srivatsa S. Bhat 

For scalability and performance reasons, we want the tick broadcast IPIs
to be handled as efficiently as possible. Fixed IPI messages
are one of the most efficient mechanisms available - they are faster than
the smp_call_function mechanism because the IPI handlers are fixed and hence
they don't involve costly operations such as adding IPI handlers to the target
CPU's function queue, acquiring locks for synchronization etc.

Luckily we have an unused IPI message slot, so use that to implement
tick broadcast IPIs efficiently.

Signed-off-by: Srivatsa S. Bhat 
[Functions renamed to tick_broadcast* and Changelog modified by
 Preeti U. Murthy]
Signed-off-by: Preeti U. Murthy 
Acked-by: Geoff Levand  [For the PS3 part]
---

 arch/powerpc/include/asm/smp.h  |2 +-
 arch/powerpc/include/asm/time.h |1 +
 arch/powerpc/kernel/smp.c   |   19 +++
 arch/powerpc/kernel/time.c  |5 +
 arch/powerpc/platforms/cell/interrupt.c |2 +-
 arch/powerpc/platforms/ps3/smp.c|2 +-
 6 files changed, 24 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
index 9f7356b..ff51046 100644
--- a/arch/powerpc/include/asm/smp.h
+++ b/arch/powerpc/include/asm/smp.h
@@ -120,7 +120,7 @@ extern int cpu_to_core_id(int cpu);
  * in /proc/interrupts will be wrong!!! --Troy */
 #define PPC_MSG_CALL_FUNCTION   0
 #define PPC_MSG_RESCHEDULE  1
-#define PPC_MSG_UNUSED 2
+#define PPC_MSG_TICK_BROADCAST 2
 #define PPC_MSG_DEBUGGER_BREAK  3
 
 /* for irq controllers that have dedicated ipis per message (4) */
diff --git a/arch/powerpc/include/asm/time.h b/arch/powerpc/include/asm/time.h
index c1f2676..1d428e6 100644
--- a/arch/powerpc/include/asm/time.h
+++ b/arch/powerpc/include/asm/time.h
@@ -28,6 +28,7 @@ extern struct clock_event_device decrementer_clockevent;
 struct rtc_time;
 extern void to_tm(int tim, struct rtc_time * tm);
 extern void GregorianDay(struct rtc_time *tm);
+extern void tick_broadcast_ipi_handler(void);
 
 extern void generic_calibrate_decr(void);
 
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index ee7d76b..6f06f05 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -35,6 +35,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -145,9 +146,9 @@ static irqreturn_t reschedule_action(int irq, void *data)
return IRQ_HANDLED;
 }
 
-static irqreturn_t unused_action(int irq, void *data)
+static irqreturn_t tick_broadcast_ipi_action(int irq, void *data)
 {
-   /* This slot is unused and hence available for use, if needed */
+   tick_broadcast_ipi_handler();
return IRQ_HANDLED;
 }
 
@@ -168,14 +169,14 @@ static irqreturn_t debug_ipi_action(int irq, void *data)
 static irq_handler_t smp_ipi_action[] = {
[PPC_MSG_CALL_FUNCTION] =  call_function_action,
[PPC_MSG_RESCHEDULE] = reschedule_action,
-   [PPC_MSG_UNUSED] = unused_action,
+   [PPC_MSG_TICK_BROADCAST] = tick_broadcast_ipi_action,
[PPC_MSG_DEBUGGER_BREAK] = debug_ipi_action,
 };
 
 const char *smp_ipi_name[] = {
[PPC_MSG_CALL_FUNCTION] =  "ipi call function",
[PPC_MSG_RESCHEDULE] = "ipi reschedule",
-   [PPC_MSG_UNUSED] = "ipi unused",
+   [PPC_MSG_TICK_BROADCAST] = "ipi tick-broadcast",
[PPC_MSG_DEBUGGER_BREAK] = "ipi debugger",
 };
 
@@ -251,6 +252,8 @@ irqreturn_t smp_ipi_demux(void)
generic_smp_call_function_interrupt();
if (all & IPI_MESSAGE(PPC_MSG_RESCHEDULE))
scheduler_ipi();
+   if (all & IPI_MESSAGE(PPC_MSG_TICK_BROADCAST))
+   tick_broadcast_ipi_handler();
if (all & IPI_MESSAGE(PPC_MSG_DEBUGGER_BREAK))
debug_ipi_action(0, NULL);
} while (info->messages);
@@ -289,6 +292,14 @@ void arch_send_call_function_ipi_mask(const struct cpumask 
*mask)
do_message_pass(cpu, PPC_MSG_CALL_FUNCTION);
 }
 
+void tick_broadcast(const struct cpumask *mask)
+{
+   unsigned int cpu;
+
+   for_each_cpu(cpu, mask)
+   do_message_pass(cpu, PPC_MSG_TICK_BROADCAST);
+}
+
 #if defined(CONFIG_DEBUGGER) || defined(CONFIG_KEXEC)
 void smp_send_debugger_break(void)
 {
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index b3dab20..3ff97db 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -825,6 +825,11 @@ static void decrementer_set_mode(enum clock_event_mode 
mode,
decrementer_set_next_event(DECREMENTER_MAX, dev);
 }
 
+/* Interrupt handler for the timer broadcast IPI */
+void tick_broadcast_ipi_handler(void)
+{
+}
+
 static void register_decrementer_clockevent(int cpu)
 {
struct clock_event_device *dec = _cpu(decrementers, cpu);
diff --git a/arch/powerpc/plat

[RESEND PATCH V5 7/8] cpuidle/powernv: Add "Fast-Sleep" CPU idle state

2014-01-21 Thread Preeti U Murthy

Fast sleep is one of the deep idle states on Power8 in which local timers of
CPUs stop. On PowerPC we do not have an external clock device which can
handle wakeup of such CPUs. Now that we have the support in the tick broadcast
framework for archs that do not sport such a device and the low level support
for fast sleep, enable it in the cpuidle framework on PowerNV.

Signed-off-by: Preeti U Murthy 
---

 arch/powerpc/Kconfig  |2 ++
 arch/powerpc/kernel/time.c|2 +-
 drivers/cpuidle/cpuidle-powernv.c |   42 +
 3 files changed, 45 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index fa39517..ec91584 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -129,6 +129,8 @@ config PPC
select GENERIC_CMOS_UPDATE
select GENERIC_TIME_VSYSCALL_OLD
select GENERIC_CLOCKEVENTS
+   select GENERIC_CLOCKEVENTS_BROADCAST
+   select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST
select GENERIC_STRNCPY_FROM_USER
select GENERIC_STRNLEN_USER
select HAVE_MOD_ARCH_SPECIFIC
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index df2989b..95fa5ce 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -106,7 +106,7 @@ struct clock_event_device decrementer_clockevent = {
.irq= 0,
.set_next_event = decrementer_set_next_event,
.set_mode   = decrementer_set_mode,
-   .features   = CLOCK_EVT_FEAT_ONESHOT,
+   .features   = CLOCK_EVT_FEAT_ONESHOT | CLOCK_EVT_FEAT_C3STOP,
 };
 EXPORT_SYMBOL(decrementer_clockevent);
 
diff --git a/drivers/cpuidle/cpuidle-powernv.c 
b/drivers/cpuidle/cpuidle-powernv.c
index 78fd174..90f0c2b 100644
--- a/drivers/cpuidle/cpuidle-powernv.c
+++ b/drivers/cpuidle/cpuidle-powernv.c
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -49,6 +50,40 @@ static int nap_loop(struct cpuidle_device *dev,
return index;
 }
 
+static int fastsleep_loop(struct cpuidle_device *dev,
+   struct cpuidle_driver *drv,
+   int index)
+{
+   int cpu = dev->cpu;
+   unsigned long old_lpcr = mfspr(SPRN_LPCR);
+   unsigned long new_lpcr;
+
+   if (unlikely(system_state < SYSTEM_RUNNING))
+   return index;
+
+   new_lpcr = old_lpcr;
+   new_lpcr &= ~(LPCR_MER | LPCR_PECE); /* lpcr[mer] must be 0 */
+
+   /* exit powersave upon external interrupt, but not decrementer
+* interrupt, Emulate sleep.
+*/
+   new_lpcr |= LPCR_PECE0;
+
+   if (clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, )) {
+   new_lpcr |= LPCR_PECE1;
+   mtspr(SPRN_LPCR, new_lpcr);
+   power7_nap();
+   } else {
+   mtspr(SPRN_LPCR, new_lpcr);
+   power7_sleep();
+   }
+   clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, );
+
+   mtspr(SPRN_LPCR, old_lpcr);
+
+   return index;
+}
+
 /*
  * States for dedicated partition case.
  */
@@ -67,6 +102,13 @@ static struct cpuidle_state powernv_states[] = {
.exit_latency = 10,
.target_residency = 100,
.enter = _loop },
+{ /* Fastsleep */
+   .name = "fastsleep",
+   .desc = "fastsleep",
+   .flags = CPUIDLE_FLAG_TIME_VALID,
+   .exit_latency = 10,
+   .target_residency = 100,
+   .enter = _loop },
 };
 
 static int powernv_cpuidle_add_cpu_notifier(struct notifier_block *n,

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RESEND PATCH V5 4/8] powernv/cpuidle: Add context management for Fast Sleep

2014-01-21 Thread Preeti U Murthy

From: Vaidyanathan Srinivasan 

Before adding Fast-Sleep into the cpuidle framework, some low level
support needs to be added to enable it. This includes saving and
restoring of certain registers at entry and exit time of this state
respectively just like we do in the NAP idle state.

Signed-off-by: Vaidyanathan Srinivasan 
[Changelog modified by Preeti U. Murthy ]
Signed-off-by: Preeti U. Murthy 
---

 arch/powerpc/include/asm/processor.h |1 +
 arch/powerpc/kernel/exceptions-64s.S |   10 -
 arch/powerpc/kernel/idle_power7.S|   63 --
 3 files changed, 53 insertions(+), 21 deletions(-)

diff --git a/arch/powerpc/include/asm/processor.h 
b/arch/powerpc/include/asm/processor.h
index b62de43..d660dc3 100644
--- a/arch/powerpc/include/asm/processor.h
+++ b/arch/powerpc/include/asm/processor.h
@@ -450,6 +450,7 @@ enum idle_boot_override {IDLE_NO_OVERRIDE = 0, 
IDLE_POWERSAVE_OFF};
 
 extern int powersave_nap;  /* set if nap mode can be used in idle loop */
 extern void power7_nap(void);
+extern void power7_sleep(void);
 extern void flush_instruction_cache(void);
 extern void hard_reset_now(void);
 extern void poweroff_now(void);
diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index 38d5073..b01a9cb 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -121,9 +121,10 @@ BEGIN_FTR_SECTION
cmpwi   cr1,r13,2
/* Total loss of HV state is fatal, we could try to use the
 * PIR to locate a PACA, then use an emergency stack etc...
-* but for now, let's just stay stuck here
+* OPAL v3 based powernv platforms have new idle states
+* which fall in this catagory.
 */
-   bgt cr1,.
+   bgt cr1,8f
GET_PACA(r13)
 
 #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
@@ -141,6 +142,11 @@ BEGIN_FTR_SECTION
beq cr1,2f
b   .power7_wakeup_noloss
 2: b   .power7_wakeup_loss
+
+   /* Fast Sleep wakeup on PowerNV */
+8: GET_PACA(r13)
+   b   .power7_wakeup_loss
+
 9:
 END_FTR_SECTION_IFSET(CPU_FTR_HVMODE | CPU_FTR_ARCH_206)
 #endif /* CONFIG_PPC_P7_NAP */
diff --git a/arch/powerpc/kernel/idle_power7.S 
b/arch/powerpc/kernel/idle_power7.S
index 3fdef0f..14f78be 100644
--- a/arch/powerpc/kernel/idle_power7.S
+++ b/arch/powerpc/kernel/idle_power7.S
@@ -20,17 +20,27 @@
 
 #undef DEBUG
 
-   .text
+/* Idle state entry routines */
 
-_GLOBAL(power7_idle)
-   /* Now check if user or arch enabled NAP mode */
-   LOAD_REG_ADDRBASE(r3,powersave_nap)
-   lwz r4,ADDROFF(powersave_nap)(r3)
-   cmpwi   0,r4,0
-   beqlr
-   /* fall through */
+#defineIDLE_STATE_ENTER_SEQ(IDLE_INST) \
+   /* Magic NAP/SLEEP/WINKLE mode enter sequence */\
+   std r0,0(r1);   \
+   ptesync;\
+   ld  r0,0(r1);   \
+1: cmp cr0,r0,r0;  \
+   bne 1b; \
+   IDLE_INST;  \
+   b   .
 
-_GLOBAL(power7_nap)
+   .text
+
+/*
+ * Pass requested state in r3:
+ * 0 - nap
+ * 1 - sleep
+ */
+_GLOBAL(power7_powersave_common)
+   /* Use r3 to pass state nap/sleep/winkle */
/* NAP is a state loss, we create a regs frame on the
 * stack, fill it up with the state we care about and
 * stick a pointer to it in PACAR1. We really only
@@ -79,8 +89,8 @@ _GLOBAL(power7_nap)
/* Continue saving state */
SAVE_GPR(2, r1)
SAVE_NVGPRS(r1)
-   mfcrr3
-   std r3,_CCR(r1)
+   mfcrr4
+   std r4,_CCR(r1)
std r9,_MSR(r1)
std r1,PACAR1(r13)
 
@@ -90,15 +100,30 @@ _GLOBAL(power7_enter_nap_mode)
li  r4,KVM_HWTHREAD_IN_NAP
stb r4,HSTATE_HWTHREAD_STATE(r13)
 #endif
+   cmpwi   cr0,r3,1
+   beq 2f
+   IDLE_STATE_ENTER_SEQ(PPC_NAP)
+   /* No return */
+2: IDLE_STATE_ENTER_SEQ(PPC_SLEEP)
+   /* No return */
 
-   /* Magic NAP mode enter sequence */
-   std r0,0(r1)
-   ptesync
-   ld  r0,0(r1)
-1: cmp cr0,r0,r0
-   bne 1b
-   PPC_NAP
-   b   .
+_GLOBAL(power7_idle)
+   /* Now check if user or arch enabled NAP mode */
+   LOAD_REG_ADDRBASE(r3,powersave_nap)
+   lwz r4,ADDROFF(powersave_nap)(r3)
+   cmpwi   0,r4,0
+   beqlr
+   /* fall through */
+
+_GLOBAL(power7_nap)
+   li  r3,0
+   b   power7_powersave_common
+   /* No return */
+
+_GLOBAL(power7_sleep)
+   li  r3,1
+   b   power7_powersave_common
+   /* No return */
 
 _GLOBAL(power7_wakeup_loss)
ld  r1,PACAR1(r13)

--
To unsubscribe from this list: send

[RESEND PATCH V5 5/8] powermgt: Add OPAL call to resync timebase on wakeup

2014-01-21 Thread Preeti U Murthy

From: Vaidyanathan Srinivasan 

During "Fast-sleep" and deeper power savings state, decrementer and
timebase could be stopped making it out of sync with rest
of the cores in the system.

Add a firmware call to request platform to resync timebase
using low level platform methods.

Signed-off-by: Vaidyanathan Srinivasan 
Signed-off-by: Preeti U. Murthy 
---

 arch/powerpc/include/asm/opal.h|2 ++
 arch/powerpc/kernel/exceptions-64s.S   |2 +-
 arch/powerpc/kernel/idle_power7.S  |   27 
 arch/powerpc/platforms/powernv/opal-wrappers.S |1 +
 4 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index 9a87b44..8c4829f 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -154,6 +154,7 @@ extern int opal_enter_rtas(struct rtas_args *args,
 #define OPAL_FLASH_VALIDATE76
 #define OPAL_FLASH_MANAGE  77
 #define OPAL_FLASH_UPDATE  78
+#define OPAL_RESYNC_TIMEBASE   79
 #define OPAL_GET_MSG   85
 #define OPAL_CHECK_ASYNC_COMPLETION86
 
@@ -863,6 +864,7 @@ extern void opal_flash_init(void);
 extern int opal_machine_check(struct pt_regs *regs);
 
 extern void opal_shutdown(void);
+extern int opal_resync_timebase(void);
 
 extern void opal_lpc_init(void);
 
diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index b01a9cb..9533d7a 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -145,7 +145,7 @@ BEGIN_FTR_SECTION
 
/* Fast Sleep wakeup on PowerNV */
 8: GET_PACA(r13)
-   b   .power7_wakeup_loss
+   b   .power7_wakeup_tb_loss
 
 9:
 END_FTR_SECTION_IFSET(CPU_FTR_HVMODE | CPU_FTR_ARCH_206)
diff --git a/arch/powerpc/kernel/idle_power7.S 
b/arch/powerpc/kernel/idle_power7.S
index 14f78be..c3ab869 100644
--- a/arch/powerpc/kernel/idle_power7.S
+++ b/arch/powerpc/kernel/idle_power7.S
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #undef DEBUG
 
@@ -125,6 +126,32 @@ _GLOBAL(power7_sleep)
b   power7_powersave_common
/* No return */
 
+_GLOBAL(power7_wakeup_tb_loss)
+   ld  r2,PACATOC(r13);
+   ld  r1,PACAR1(r13)
+
+   /* Time base re-sync */
+   li  r0,OPAL_RESYNC_TIMEBASE
+   LOAD_REG_ADDR(r11,opal);
+   ld  r12,8(r11);
+   ld  r2,0(r11);
+   mtctr   r12
+   bctrl
+
+   /* TODO: Check r3 for failure */
+
+   REST_NVGPRS(r1)
+   REST_GPR(2, r1)
+   ld  r3,_CCR(r1)
+   ld  r4,_MSR(r1)
+   ld  r5,_NIP(r1)
+   addir1,r1,INT_FRAME_SIZE
+   mtcrr3
+   mfspr   r3,SPRN_SRR1/* Return SRR1 */
+   mtspr   SPRN_SRR1,r4
+   mtspr   SPRN_SRR0,r5
+   rfid
+
 _GLOBAL(power7_wakeup_loss)
ld  r1,PACAR1(r13)
REST_NVGPRS(r1)
diff --git a/arch/powerpc/platforms/powernv/opal-wrappers.S 
b/arch/powerpc/platforms/powernv/opal-wrappers.S
index 719aa5c..a11a87c 100644
--- a/arch/powerpc/platforms/powernv/opal-wrappers.S
+++ b/arch/powerpc/platforms/powernv/opal-wrappers.S
@@ -126,5 +126,6 @@ OPAL_CALL(opal_return_cpu,  
OPAL_RETURN_CPU);
 OPAL_CALL(opal_validate_flash, OPAL_FLASH_VALIDATE);
 OPAL_CALL(opal_manage_flash,   OPAL_FLASH_MANAGE);
 OPAL_CALL(opal_update_flash,   OPAL_FLASH_UPDATE);
+OPAL_CALL(opal_resync_timebase,OPAL_RESYNC_TIMEBASE);
 OPAL_CALL(opal_get_msg,OPAL_GET_MSG);
 OPAL_CALL(opal_check_completion,   OPAL_CHECK_ASYNC_COMPLETION);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RESEND PATCH V5 6/8] time/cpuidle: Support in tick broadcast framework in the absence of external clock device

2014-01-21 Thread Preeti U Murthy

On some architectures, in certain CPU deep idle states the local timers stop.
An external clock device is used to wakeup these CPUs. The kernel support for 
the
wakeup of these CPUs is provided by the tick broadcast framework by using the
external clock device as the wakeup source.

However not all implementations of architectures provide such an external
clock device such as some PowerPC ones. This patch includes support in the
broadcast framework to handle the wakeup of the CPUs in deep idle states on such
systems by queuing a hrtimer on one of the CPUs, meant to handle the wakeup of
CPUs in deep idle states. This CPU is identified as the bc_cpu.

Each time the hrtimer expires, it is reprogrammed for the next wakeup of the
CPUs in deep idle state after handling broadcast. However when a CPU is about
to enter  deep idle state with its wakeup time earlier than the time at which
the hrtimer is currently programmed, it *becomes the new bc_cpu* and restarts
the hrtimer on itself. This way the job of doing broadcast is handed around to
the CPUs that ask for the earliest wakeup just before entering deep idle
state. This is consistent with what happens in cases where an external clock
device is present. The smp affinity of this clock device is set to the CPU
with the earliest wakeup.

The important point here is that the bc_cpu cannot enter deep idle state
since it has a hrtimer queued to wakeup the other CPUs in deep idle. Hence it
cannot have its local timer stopped. Therefore for such a CPU, the
BROADCAST_ENTER notification has to fail implying that it cannot enter deep
idle state. On architectures where an external clock device is present, all
CPUs can enter deep idle.

During hotplug of the bc_cpu, the job of doing a broadcast is assigned to the
first cpu in the broadcast mask. This newly nominated bc_cpu is woken up by
an IPI so as to queue the above mentioned hrtimer on it.

Signed-off-by: Preeti U Murthy 
---

 include/linux/clockchips.h   |4 -
 kernel/time/clockevents.c|9 +-
 kernel/time/tick-broadcast.c |  192 ++
 kernel/time/tick-internal.h  |8 +-
 4 files changed, 186 insertions(+), 27 deletions(-)

diff --git a/include/linux/clockchips.h b/include/linux/clockchips.h
index 493aa02..bbda37b 100644
--- a/include/linux/clockchips.h
+++ b/include/linux/clockchips.h
@@ -186,9 +186,9 @@ static inline int tick_check_broadcast_expired(void) { 
return 0; }
 #endif
 
 #ifdef CONFIG_GENERIC_CLOCKEVENTS
-extern void clockevents_notify(unsigned long reason, void *arg);
+extern int clockevents_notify(unsigned long reason, void *arg);
 #else
-static inline void clockevents_notify(unsigned long reason, void *arg) {}
+static inline int clockevents_notify(unsigned long reason, void *arg) {}
 #endif
 
 #else /* CONFIG_GENERIC_CLOCKEVENTS_BUILD */
diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
index 086ad60..d61404e 100644
--- a/kernel/time/clockevents.c
+++ b/kernel/time/clockevents.c
@@ -524,12 +524,13 @@ void clockevents_resume(void)
 #ifdef CONFIG_GENERIC_CLOCKEVENTS
 /**
  * clockevents_notify - notification about relevant events
+ * Returns non zero on error.
  */
-void clockevents_notify(unsigned long reason, void *arg)
+int clockevents_notify(unsigned long reason, void *arg)
 {
struct clock_event_device *dev, *tmp;
unsigned long flags;
-   int cpu;
+   int cpu, ret = 0;
 
raw_spin_lock_irqsave(_lock, flags);
 
@@ -542,11 +543,12 @@ void clockevents_notify(unsigned long reason, void *arg)
 
case CLOCK_EVT_NOTIFY_BROADCAST_ENTER:
case CLOCK_EVT_NOTIFY_BROADCAST_EXIT:
-   tick_broadcast_oneshot_control(reason);
+   ret = tick_broadcast_oneshot_control(reason);
break;
 
case CLOCK_EVT_NOTIFY_CPU_DYING:
tick_handover_do_timer(arg);
+   tick_handover_broadcast_cpu(arg);
break;
 
case CLOCK_EVT_NOTIFY_SUSPEND:
@@ -585,6 +587,7 @@ void clockevents_notify(unsigned long reason, void *arg)
break;
}
raw_spin_unlock_irqrestore(_lock, flags);
+   return ret;
 }
 EXPORT_SYMBOL_GPL(clockevents_notify);
 
diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c
index 9532690..1c23912 100644
--- a/kernel/time/tick-broadcast.c
+++ b/kernel/time/tick-broadcast.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "tick-internal.h"
 
@@ -35,6 +36,15 @@ static cpumask_var_t tmpmask;
 static DEFINE_RAW_SPINLOCK(tick_broadcast_lock);
 static int tick_broadcast_force;
 
+/*
+ * Helper variables for handling broadcast in the absence of a
+ * tick_broadcast_device.
+ * */
+static struct hrtimer *bc_hrtimer;
+static int bc_cpu = -1;
+static ktime_t bc_next_wakeup;
+static int hrtimer_initialized = 0;
+
 #ifdef CONFIG_TICK_ONESHOT
 static void tick_broadcast_clear_oneshot(int cpu);
 #else
@@ -528,6 +538,20 @@ static int tick_broadcast

[RESEND PATCH V5 8/8] cpuidle/powernv: Parse device tree to setup idle states

2014-01-21 Thread Preeti U Murthy

Add deep idle states such as nap and fast sleep to the cpuidle state table
only if they are discovered from the device tree during cpuidle initialization.

Signed-off-by: Preeti U Murthy 
---

 drivers/cpuidle/cpuidle-powernv.c |   81 +
 1 file changed, 64 insertions(+), 17 deletions(-)

diff --git a/drivers/cpuidle/cpuidle-powernv.c 
b/drivers/cpuidle/cpuidle-powernv.c
index 90f0c2b..b3face5 100644
--- a/drivers/cpuidle/cpuidle-powernv.c
+++ b/drivers/cpuidle/cpuidle-powernv.c
@@ -12,10 +12,17 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
 
+/* Flags and constants used in PowerNV platform */
+
+#define MAX_POWERNV_IDLE_STATES8
+#define IDLE_USE_INST_NAP  0x0001 /* Use nap instruction */
+#define IDLE_USE_INST_SLEEP0x0002 /* Use sleep instruction */
+
 struct cpuidle_driver powernv_idle_driver = {
.name = "powernv_idle",
.owner= THIS_MODULE,
@@ -87,7 +94,7 @@ static int fastsleep_loop(struct cpuidle_device *dev,
 /*
  * States for dedicated partition case.
  */
-static struct cpuidle_state powernv_states[] = {
+static struct cpuidle_state powernv_states[MAX_POWERNV_IDLE_STATES] = {
{ /* Snooze */
.name = "snooze",
.desc = "snooze",
@@ -95,20 +102,6 @@ static struct cpuidle_state powernv_states[] = {
.exit_latency = 0,
.target_residency = 0,
.enter = _loop },
-   { /* NAP */
-   .name = "NAP",
-   .desc = "NAP",
-   .flags = CPUIDLE_FLAG_TIME_VALID,
-   .exit_latency = 10,
-   .target_residency = 100,
-   .enter = _loop },
-{ /* Fastsleep */
-   .name = "fastsleep",
-   .desc = "fastsleep",
-   .flags = CPUIDLE_FLAG_TIME_VALID,
-   .exit_latency = 10,
-   .target_residency = 100,
-   .enter = _loop },
 };
 
 static int powernv_cpuidle_add_cpu_notifier(struct notifier_block *n,
@@ -169,19 +162,73 @@ static int powernv_cpuidle_driver_init(void)
return 0;
 }
 
+static int powernv_add_idle_states(void)
+{
+   struct device_node *power_mgt;
+   struct property *prop;
+   int nr_idle_states = 1; /* Snooze */
+   int dt_idle_states;
+   u32 *flags;
+   int i;
+
+   /* Currently we have snooze statically defined */
+
+   power_mgt = of_find_node_by_path("/ibm,opal/power-mgt");
+   if (!power_mgt) {
+   pr_warn("opal: PowerMgmt Node not found\n");
+   return nr_idle_states;
+   }
+
+   prop = of_find_property(power_mgt, "ibm,cpu-idle-state-flags", NULL);
+   if (!prop) {
+   pr_warn("DT-PowerMgmt: missing ibm,cpu-idle-state-flags\n");
+   return nr_idle_states;
+   }
+
+   dt_idle_states = prop->length / sizeof(u32);
+   flags = (u32 *) prop->value;
+
+   for (i = 0; i < dt_idle_states; i++) {
+
+   if (flags[i] & IDLE_USE_INST_NAP) {
+   /* Add NAP state */
+   strcpy(powernv_states[nr_idle_states].name, "Nap");
+   strcpy(powernv_states[nr_idle_states].desc, "Nap");
+   powernv_states[nr_idle_states].flags = 
CPUIDLE_FLAG_TIME_VALID;
+   powernv_states[nr_idle_states].exit_latency = 10;
+   powernv_states[nr_idle_states].target_residency = 100;
+   powernv_states[nr_idle_states].enter = _loop;
+   nr_idle_states++;
+   }
+
+   if (flags[i] & IDLE_USE_INST_SLEEP) {
+   /* Add FASTSLEEP state */
+   strcpy(powernv_states[nr_idle_states].name, 
"FastSleep");
+   strcpy(powernv_states[nr_idle_states].desc, 
"FastSleep");
+   powernv_states[nr_idle_states].flags = 
CPUIDLE_FLAG_TIME_VALID;
+   powernv_states[nr_idle_states].exit_latency = 300;
+   powernv_states[nr_idle_states].target_residency = 
100;
+   powernv_states[nr_idle_states].enter = _loop;
+   nr_idle_states++;
+   }
+   }
+
+   return nr_idle_states;
+}
+
 /*
  * powernv_idle_probe()
  * Choose state table for shared versus dedicated partition
  */
 static int powernv_idle_probe(void)
 {
-
if (cpuidle_disable != IDLE_NO_OVERRIDE)
return -ENODEV;
 
if (firmware_has_feature(FW_FEATURE_OPALv3)) {
cpuidle_state_table = powernv_states;
-   max_idle_state = ARRAY_SIZE(powernv_states);
+   /* Device tree can indicate more idle states */
+   max_idle_stat

Re: [PATCH V5 6/8] time/cpuidle: Support in tick broadcast framework in the absence of external clock device

2014-01-22 Thread Preeti U Murthy

Hi Thomas,

Thank you very much for the review.

On 01/22/2014 06:57 PM, Thomas Gleixner wrote:
> On Wed, 15 Jan 2014, Preeti U Murthy wrote:
>> diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
>> index 086ad60..d61404e 100644
>> --- a/kernel/time/clockevents.c
>> +++ b/kernel/time/clockevents.c
>> @@ -524,12 +524,13 @@ void clockevents_resume(void)
>>  #ifdef CONFIG_GENERIC_CLOCKEVENTS
>>  /**
>>   * clockevents_notify - notification about relevant events
>> + * Returns non zero on error.
>>   */
>> -void clockevents_notify(unsigned long reason, void *arg)
>> +int clockevents_notify(unsigned long reason, void *arg)
>>  {
> 
> The interface change of clockevents_notify wants to be a separate
> patch.
> 
>> diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c
>> index 9532690..1c23912 100644
>> --- a/kernel/time/tick-broadcast.c
>> +++ b/kernel/time/tick-broadcast.c
>> @@ -20,6 +20,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  
>>  #include "tick-internal.h"
>>  
>> @@ -35,6 +36,15 @@ static cpumask_var_t tmpmask;
>>  static DEFINE_RAW_SPINLOCK(tick_broadcast_lock);
>>  static int tick_broadcast_force;
>>  
>> +/*
>> + * Helper variables for handling broadcast in the absence of a
>> + * tick_broadcast_device.
>> + * */
>> +static struct hrtimer *bc_hrtimer;
>> +static int bc_cpu = -1;
>> +static ktime_t bc_next_wakeup;
> 
> Why do you need another variable to store the expiry time? The
> broadcast code already knows it and the hrtimer expiry value gives you
> the same information for free.

The reason was functions like tick_handle_oneshot_broadcast() and
tick_broadcast_switch_to_oneshot() were using the
tick_broadcast_device.evtdev->next_event to set/get the next wakeups.

But since this patchset introduced an explicit hrtimer for archs which
did not have such a device, I wanted these functions to use a generic
parameter to set/get the next wakeups without having to know about the
existence of this hrtimer, if at all. And program the hrtimer/tick
broadcast device whichever was present only when the next event was to
be set. But with your below concept patch, we will not be required to do
this.
> 
>> +static int hrtimer_initialized = 0;
> 
> What's the point of this hrtimer_initialized dance? Why not simply
> making the hrtimer static and avoid that all together. Also adding the
> initialization into tick_broadcast_oneshot_available() is
> braindamaged.  Why not adding this to tick_broadcast_init() which is
> the proper place to do?

Right I agree, this hrtimer initialization should have been in
tick_broadcast_init() and a simple static declaration would have done
the job.
> 
> Aside of that you are making this hrtimer mode unconditional, which
> might break existing systems which are not aware of the hrtimer
> implications.
> 
> What you really want is a pseudo clock event device which has the
> proper functions for handling the timer and you can register it from
> your architecture code. The broadcast core code needs a few tweaks to
> avoid the shutdown of the cpu local clock event device, but aside of
> that the whole thing just falls into place. So architectures can use
> this if they want and are sure that their low level idle code knows
> about the deep idle preventing return value of
> clockevents_notify(). Once that works you can register the hrtimer
> based broadcast device and a real hardware broadcast device with a
> higher rating. It just works.

I now completely see your point. This will surely break on archs which
are not using the return value of the BROADCAST_ENTER notification.

I am not even giving them a choice about using the hrtimer mode of
broadcast framework and am expecting them to take action for the failed
return of BROADCAST_ENTER. I missed that critical point. I went through
the below patch and am able to see how you are solving this problem.
> 
> Find an incomplete and nonfunctional concept patch below. It should be
> simple to make it work for real.

Thank you very much for the valuable review. The below patch makes your
points very clear. Let me try this out.

Regards
Preeti U Murthy
> 
> Thanks,
> 
>   tglx
> 
> Index: linux-2.6/include/linux/clockchips.h
> ===
> --- linux-2.6.orig/include/linux/clockchips.h
> +++ linux-2.6/include/linux/clockchips.h
> @@ -62,6 +62,11 @@ enum clock_event_mode {
>  #define CLOCK_EVT_FEAT_DYNIRQ0x20
>  #define CLOCK_EVT_FEAT_PERCPU0x40
> 
> +/*
> + * Clockevent device is based on a hrtimer for broa

Re: [PATCH V2] cpuidle/governors: Fix logic in selection of idle states

2014-01-23 Thread Preeti U Murthy

Hi Daniel,

Thank you for the review.

On 01/22/2014 01:59 PM, Daniel Lezcano wrote:
> On 01/17/2014 05:33 AM, Preeti U Murthy wrote:
>>
>> diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
>> index a55e68f..831b664 100644
>> --- a/drivers/cpuidle/cpuidle.c
>> +++ b/drivers/cpuidle/cpuidle.c
>> @@ -131,8 +131,9 @@ int cpuidle_idle_call(void)
>>
>>   /* ask the governor for the next state */
>>   next_state = cpuidle_curr_governor->select(drv, dev);
>> +
>> +dev->last_residency = 0;
>>   if (need_resched()) {
>> -dev->last_residency = 0;
> 
> Why do you need to do this change ? ^

So as to keep the last_residency consistent with the case that this patch
addresses: where no idle state could be selected due to strict latency
requirements or disabled states and hence the cpu exits without entering
idle. Else it would contain the stale value from the previous idle state
entry.

But coming to think of it dev->last_residency is not used when the last
entered idle state index is -1.

So I have reverted this change as well in the revised patch below along
with mentioning the reason in the last paragraph of the changelog.

> 
>>   /* give the governor an opportunity to reflect on the
>> outcome */
>>   if (cpuidle_curr_governor->reflect)
>>   cpuidle_curr_governor->reflect(dev, next_state);
>> @@ -140,6 +141,18 @@ int cpuidle_idle_call(void)
>>   return 0;
>>   }
>>
>> +/* Unlike in the need_resched() case, we return here because the
>> + * governor did not find a suitable idle state. However idle is
>> still
>> + * in progress as we are not asked to reschedule. Hence we return
>> + * without enabling interrupts.
> 
> That will lead to a WARN.
> 
>> + * NOTE: The return code should still be success, since the
>> verdict of this
>> + * call is "do not enter any idle state" and not a failed call
>> due to
>> + * errors.
>> + */
>> +if (next_state < 0)
>> +return 0;
>> +
> 
> Returning from here breaks the symmetry of the trace.

I have addressed the above concerns in the patch found below.
Does the rest of the patch look sound?

Regards
Preeti U Murthy

--

cpuidle/governors: Fix logic in selection of idle states

From: Preeti U Murthy 

The cpuidle governors today are not handling scenarios where no idle state
can be chosen. Such scenarios coud arise if the user has disabled all the
idle states at runtime or the latency requirement from the cpus is very strict.

The menu governor returns 0th index of the idle state table when no other
idle state is suitable. This is even when the idle state corresponding to this
index is disabled or the latency requirement is strict and the exit_latency
of the lowest idle state is also not acceptable. Hence this patch
fixes this logic in the menu governor by defaulting to an idle state index
of -1 unless any other state is suitable.

The ladder governor needs a few more fixes in addition to that required in the
menu governor. When the ladder governor decides to demote the idle state of a
CPU, it does not check if the lower idle states are enabled. Add this logic
in addition to the logic where it chooses an index of -1 if it can neither
promote or demote the idle state of a cpu nor can it choose the current idle
state.

The cpuidle_idle_call() will return back if the governor decides upon not
entering any idle state. However it cannot return an error code because all
archs have the logic today that if the call to cpuidle_idle_call() fails, it
means that the cpuidle driver failed to *function*; for instance due to
errors during registration. As a result they end up deciding upon a
default idle state on their own, which could very well be a deep idle state.
This is incorrect in cases where no idle state is suitable.

Besides for the scenario that this patch is addressing, the call actually
succeeds. Its just that no idle state is thought to be suitable by the 
governors.
Under such a circumstance return success code without entering any idle
state.

The consequence of this patch additionally  on the menu governor is that as
long as a valid idle state cannot be chosen, the cpuidle statistics that this
governor uses to predict the next idle state remain untouched from the last
valid idle state. This is because an idle state is not even being predicted
in this path, hence there is no point correcting the prediction either.

Signed-off-by: Preeti U Murthy 

Changes from V1:https://lkml.org/lkml/2014/1/14/26

1. Change the return code to success from -EINVAL due to the reason mentioned
in the c

Re: [PATCH V5 6/8] time/cpuidle: Support in tick broadcast framework in the absence of external clock device

2014-01-23 Thread Preeti U Murthy

Hi Thomas,

The below patch works pretty much as is. I tried this out with deep idle
states on our system. Looking through the code and analysing corner
cases also did not bring out any issues to me. I will send out a patch
V2 of this.

Regards
Preeti U Murthy

On 01/22/2014 06:57 PM, Thomas Gleixner wrote:
> On Wed, 15 Jan 2014, Preeti U Murthy wrote:
>> diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
>> index 086ad60..d61404e 100644
>> --- a/kernel/time/clockevents.c
>> +++ b/kernel/time/clockevents.c
>> @@ -524,12 +524,13 @@ void clockevents_resume(void)
>>  #ifdef CONFIG_GENERIC_CLOCKEVENTS
>>  /**
>>   * clockevents_notify - notification about relevant events
>> + * Returns non zero on error.
>>   */
>> -void clockevents_notify(unsigned long reason, void *arg)
>> +int clockevents_notify(unsigned long reason, void *arg)
>>  {
> 
> The interface change of clockevents_notify wants to be a separate
> patch.
> 
>> diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c
>> index 9532690..1c23912 100644
>> --- a/kernel/time/tick-broadcast.c
>> +++ b/kernel/time/tick-broadcast.c
>> @@ -20,6 +20,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  
>>  #include "tick-internal.h"
>>  
>> @@ -35,6 +36,15 @@ static cpumask_var_t tmpmask;
>>  static DEFINE_RAW_SPINLOCK(tick_broadcast_lock);
>>  static int tick_broadcast_force;
>>  
>> +/*
>> + * Helper variables for handling broadcast in the absence of a
>> + * tick_broadcast_device.
>> + * */
>> +static struct hrtimer *bc_hrtimer;
>> +static int bc_cpu = -1;
>> +static ktime_t bc_next_wakeup;
> 
> Why do you need another variable to store the expiry time? The
> broadcast code already knows it and the hrtimer expiry value gives you
> the same information for free.
> 
>> +static int hrtimer_initialized = 0;
> 
> What's the point of this hrtimer_initialized dance? Why not simply
> making the hrtimer static and avoid that all together. Also adding the
> initialization into tick_broadcast_oneshot_available() is
> braindamaged.  Why not adding this to tick_broadcast_init() which is
> the proper place to do?
> 
> Aside of that you are making this hrtimer mode unconditional, which
> might break existing systems which are not aware of the hrtimer
> implications.
> 
> What you really want is a pseudo clock event device which has the
> proper functions for handling the timer and you can register it from
> your architecture code. The broadcast core code needs a few tweaks to
> avoid the shutdown of the cpu local clock event device, but aside of
> that the whole thing just falls into place. So architectures can use
> this if they want and are sure that their low level idle code knows
> about the deep idle preventing return value of
> clockevents_notify(). Once that works you can register the hrtimer
> based broadcast device and a real hardware broadcast device with a
> higher rating. It just works.
> 
> Find an incomplete and nonfunctional concept patch below. It should be
> simple to make it work for real.
> 
> Thanks,
> 
>   tglx
> 
> Index: linux-2.6/include/linux/clockchips.h
> ===
> --- linux-2.6.orig/include/linux/clockchips.h
> +++ linux-2.6/include/linux/clockchips.h
> @@ -62,6 +62,11 @@ enum clock_event_mode {
>  #define CLOCK_EVT_FEAT_DYNIRQ0x20
>  #define CLOCK_EVT_FEAT_PERCPU0x40
> 
> +/*
> + * Clockevent device is based on a hrtimer for broadcast
> + */
> +#define CLOCK_EVT_FEAT_HRTIMER   0x80
> +
>  /**
>   * struct clock_event_device - clock event device descriptor
>   * @event_handler:   Assigned by the framework to be called by the low
> @@ -83,6 +88,7 @@ enum clock_event_mode {
>   * @name:ptr to clock event name
>   * @rating:  variable to rate clock event devices
>   * @irq: IRQ number (only for non CPU local devices)
> + * @bound_on:Bound on CPU
>   * @cpumask: cpumask to indicate for which CPUs this device works
>   * @list:list head for the management code
>   * @owner:   module reference
> @@ -113,6 +119,7 @@ struct clock_event_device {
>   const char  *name;
>   int rating;
>   int irq;
> + int bound_on;
>   const struct cpumask*cpumask;
>   struct list_headlist;
>   struct module   *owner;
> Index: linux-2.6/kernel/time/tick-broadcast-hrtimer.c

[PATCH V2 0/2] time/cpuidle: Support in tick broadcast framework in absence of external clock device

2014-01-23 Thread Preeti U Murthy

This earlier version of this patchset can be found here:
https://lkml.org/lkml/2013/12/12/687. This version has been based on the
discussion in http://www.kernelhub.org/?p=2=399516.

This patchset provides the hooks that the architectures without an external
clock device and deep idle states where the local timers stop can make use of.

Presently we are in need of this support on certain implementations of
PowerPC. This patchset has been used on PowerPC for testing with

---

Preeti U Murthy (1):
  time: Change the return type of clockevents_notify() to integer

Thomas Gleixner (1):
  tick/cpuidle: Initialize hrtimer mode of broadcast


 include/linux/clockchips.h   |   15 -
 kernel/time/Makefile |2 -
 kernel/time/clockevents.c|8 ++-
 kernel/time/tick-broadcast-hrtimer.c |  102 ++
 kernel/time/tick-broadcast.c |   51 -
 kernel/time/tick-internal.h  |6 +-
 6 files changed, 171 insertions(+), 13 deletions(-)
 create mode 100644 kernel/time/tick-broadcast-hrtimer.c

-- 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH V2 2/2] tick/cpuidle: Initialize hrtimer mode of broadcast

2014-01-23 Thread Preeti U Murthy

From: Thomas Gleixner 

On some architectures, in certain CPU deep idle states the local timers stop.
An external clock device is used to wakeup these CPUs. The kernel support for 
the
wakeup of these CPUs is provided by the tick broadcast framework by using the
external clock device as the wakeup source.

However not all implementations of architectures provide such an external
clock device. This patch includes support in the broadcast framework to handle
the wakeup of the CPUs in deep idle states on such systems by queuing a hrtimer
on one of the CPUs, which is meant to handle the wakeup of CPUs in deep idle 
states.

This patchset introduces a pseudo clock device which can be registered by the
archs as tick_broadcast_device in the absence of a real external clock
device. Once registered, the broadcast framework will work as is for these
architectures as long as the archs take care of the BROADCAST_ENTER
notification failing for one of the CPUs. This CPU is made the stand by CPU to
handle wakeup of the CPUs in deep idle and it *must not enter deep idle states*.

The CPU with the earliest wakeup is chosen to be this CPU. Hence this way the
stand by CPU dynamically moves around and so does the hrtimer which is queued
to trigger at the next earliest wakeup time. This is consistent with the case 
where
an external clock device is present. The smp affinity of this clock device is
set to the CPU with the earliest wakeup. This patchset handles the hotplug of
the stand by CPU as well by moving the hrtimer on to the CPU handling the 
CPU_DEAD
notification.

Signed-off-by: Preeti U Murthy 
[Added Changelog and code to handle reprogramming of hrtimer]
---

 include/linux/clockchips.h   |9 +++
 kernel/time/Makefile |2 -
 kernel/time/tick-broadcast-hrtimer.c |  102 ++
 kernel/time/tick-broadcast.c |   45 +++
 4 files changed, 156 insertions(+), 2 deletions(-)
 create mode 100644 kernel/time/tick-broadcast-hrtimer.c

diff --git a/include/linux/clockchips.h b/include/linux/clockchips.h
index ac81b56..2293025 100644
--- a/include/linux/clockchips.h
+++ b/include/linux/clockchips.h
@@ -62,6 +62,11 @@ enum clock_event_mode {
 #define CLOCK_EVT_FEAT_DYNIRQ  0x20
 #define CLOCK_EVT_FEAT_PERCPU  0x40
 
+/*
+ * Clockevent device is based on a hrtimer for broadcast
+ */
+#define CLOCK_EVT_FEAT_HRTIMER 0x80
+
 /**
  * struct clock_event_device - clock event device descriptor
  * @event_handler: Assigned by the framework to be called by the low
@@ -83,6 +88,7 @@ enum clock_event_mode {
  * @name:  ptr to clock event name
  * @rating:variable to rate clock event devices
  * @irq:   IRQ number (only for non CPU local devices)
+ * @bound_on:  Bound on CPU
  * @cpumask:   cpumask to indicate for which CPUs this device works
  * @list:  list head for the management code
  * @owner: module reference
@@ -113,6 +119,7 @@ struct clock_event_device {
const char  *name;
int rating;
int irq;
+   int bound_on;
const struct cpumask*cpumask;
struct list_headlist;
struct module   *owner;
@@ -180,9 +187,11 @@ extern int tick_receive_broadcast(void);
 #endif
 
 #if defined(CONFIG_GENERIC_CLOCKEVENTS_BROADCAST) && 
defined(CONFIG_TICK_ONESHOT)
+extern void tick_setup_hrtimer_broadcast(void);
 extern int tick_check_broadcast_expired(void);
 #else
 static inline int tick_check_broadcast_expired(void) { return 0; }
+static void tick_setup_hrtimer_broadcast(void) {};
 #endif
 
 #ifdef CONFIG_GENERIC_CLOCKEVENTS
diff --git a/kernel/time/Makefile b/kernel/time/Makefile
index 9250130..06151ef 100644
--- a/kernel/time/Makefile
+++ b/kernel/time/Makefile
@@ -3,7 +3,7 @@ obj-y += timeconv.o posix-clock.o alarmtimer.o
 
 obj-$(CONFIG_GENERIC_CLOCKEVENTS_BUILD)+= clockevents.o
 obj-$(CONFIG_GENERIC_CLOCKEVENTS)  += tick-common.o
-obj-$(CONFIG_GENERIC_CLOCKEVENTS_BROADCAST)+= tick-broadcast.o
+obj-$(CONFIG_GENERIC_CLOCKEVENTS_BROADCAST)+= tick-broadcast.o 
tick-broadcast-hrtimer.o
 obj-$(CONFIG_GENERIC_SCHED_CLOCK)  += sched_clock.o
 obj-$(CONFIG_TICK_ONESHOT) += tick-oneshot.o
 obj-$(CONFIG_TICK_ONESHOT) += tick-sched.o
diff --git a/kernel/time/tick-broadcast-hrtimer.c 
b/kernel/time/tick-broadcast-hrtimer.c
new file mode 100644
index 000..23f4925
--- /dev/null
+++ b/kernel/time/tick-broadcast-hrtimer.c
@@ -0,0 +1,102 @@
+/*
+ * linux/kernel/time/tick-broadcast-hrtimer.c
+ * This file emulates a local clock event device
+ * via a pseudo clock device.
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "tick-internal.h"
+
+static stru

[PATCH V2 1/2] time: Change the return type of clockevents_notify() to integer

2014-01-23 Thread Preeti U Murthy

The broadcast framework can potentially be made use of by archs which do not 
have an
external clock device as well. Then, it is required that one of the CPUs need
to handle the broadcasting of wakeup IPIs to the CPUs in deep idle. As a
result its local timers should remain functional all the time. For such
a CPU, the BROADCAST_ENTER notification has to fail indicating that its clock
device cannot be shutdown. To make way for this support, change the return
type of tick_broadcast_oneshot_control() and hence clockevents_notify() to
indicate such scenarios.

Signed-off-by: Preeti U Murthy 
---

 include/linux/clockchips.h   |6 +++---
 kernel/time/clockevents.c|8 +---
 kernel/time/tick-broadcast.c |6 --
 kernel/time/tick-internal.h  |6 +++---
 4 files changed, 15 insertions(+), 11 deletions(-)

diff --git a/include/linux/clockchips.h b/include/linux/clockchips.h
index 493aa02..ac81b56 100644
--- a/include/linux/clockchips.h
+++ b/include/linux/clockchips.h
@@ -186,9 +186,9 @@ static inline int tick_check_broadcast_expired(void) { 
return 0; }
 #endif
 
 #ifdef CONFIG_GENERIC_CLOCKEVENTS
-extern void clockevents_notify(unsigned long reason, void *arg);
+extern int clockevents_notify(unsigned long reason, void *arg);
 #else
-static inline void clockevents_notify(unsigned long reason, void *arg) {}
+static inline int clockevents_notify(unsigned long reason, void *arg) {}
 #endif
 
 #else /* CONFIG_GENERIC_CLOCKEVENTS_BUILD */
@@ -196,7 +196,7 @@ static inline void clockevents_notify(unsigned long reason, 
void *arg) {}
 static inline void clockevents_suspend(void) {}
 static inline void clockevents_resume(void) {}
 
-static inline void clockevents_notify(unsigned long reason, void *arg) {}
+static inline int clockevents_notify(unsigned long reason, void *arg) {}
 static inline int tick_check_broadcast_expired(void) { return 0; }
 
 #endif
diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
index 086ad60..79b8685 100644
--- a/kernel/time/clockevents.c
+++ b/kernel/time/clockevents.c
@@ -524,12 +524,13 @@ void clockevents_resume(void)
 #ifdef CONFIG_GENERIC_CLOCKEVENTS
 /**
  * clockevents_notify - notification about relevant events
+ * Returns 0 on success, any other value on error
  */
-void clockevents_notify(unsigned long reason, void *arg)
+int clockevents_notify(unsigned long reason, void *arg)
 {
struct clock_event_device *dev, *tmp;
unsigned long flags;
-   int cpu;
+   int cpu, ret = 0;
 
raw_spin_lock_irqsave(_lock, flags);
 
@@ -542,7 +543,7 @@ void clockevents_notify(unsigned long reason, void *arg)
 
case CLOCK_EVT_NOTIFY_BROADCAST_ENTER:
case CLOCK_EVT_NOTIFY_BROADCAST_EXIT:
-   tick_broadcast_oneshot_control(reason);
+   ret = tick_broadcast_oneshot_control(reason);
break;
 
case CLOCK_EVT_NOTIFY_CPU_DYING:
@@ -585,6 +586,7 @@ void clockevents_notify(unsigned long reason, void *arg)
break;
}
raw_spin_unlock_irqrestore(_lock, flags);
+   return ret;
 }
 EXPORT_SYMBOL_GPL(clockevents_notify);
 
diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c
index 9532690..be00692 100644
--- a/kernel/time/tick-broadcast.c
+++ b/kernel/time/tick-broadcast.c
@@ -633,14 +633,15 @@ again:
 /*
  * Powerstate information: The system enters/leaves a state, where
  * affected devices might stop
+ * Returns 0 on success, -EBUSY if the cpu is used to broadcast wakeups.
  */
-void tick_broadcast_oneshot_control(unsigned long reason)
+int tick_broadcast_oneshot_control(unsigned long reason)
 {
struct clock_event_device *bc, *dev;
struct tick_device *td;
unsigned long flags;
ktime_t now;
-   int cpu;
+   int cpu, ret = 0;
 
/*
 * Periodic mode does not care about the enter/exit of power
@@ -746,6 +747,7 @@ void tick_broadcast_oneshot_control(unsigned long reason)
}
 out:
raw_spin_unlock_irqrestore(_broadcast_lock, flags);
+   return ret;
 }
 
 /*
diff --git a/kernel/time/tick-internal.h b/kernel/time/tick-internal.h
index 18e71f7..164465c 100644
--- a/kernel/time/tick-internal.h
+++ b/kernel/time/tick-internal.h
@@ -46,7 +46,7 @@ extern int tick_switch_to_oneshot(void (*handler)(struct 
clock_event_device *));
 extern void tick_resume_oneshot(void);
 # ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST
 extern void tick_broadcast_setup_oneshot(struct clock_event_device *bc);
-extern void tick_broadcast_oneshot_control(unsigned long reason);
+extern int tick_broadcast_oneshot_control(unsigned long reason);
 extern void tick_broadcast_switch_to_oneshot(void);
 extern void tick_shutdown_broadcast_oneshot(unsigned int *cpup);
 extern int tick_resume_broadcast_oneshot(struct clock_event_device *bc);
@@ -58,7 +58,7 @@ static inline void tick_broadcast_setup_oneshot(struct 
clock_event_device *bc)
 {
BUG();
 }
-static inline void tick_broadcast_oneshot_control

Re: [PATCH V2] cpuidle/governors: Fix logic in selection of idle states

2014-01-24 Thread Preeti U Murthy

On 01/24/2014 02:38 PM, Daniel Lezcano wrote:
> On 01/23/2014 12:15 PM, Preeti U Murthy wrote:
>> Hi Daniel,
>>
>> Thank you for the review.
>>
>> On 01/22/2014 01:59 PM, Daniel Lezcano wrote:
>>> On 01/17/2014 05:33 AM, Preeti U Murthy wrote:
>>>>
>>>> diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
>>>> index a55e68f..831b664 100644
>>>> --- a/drivers/cpuidle/cpuidle.c
>>>> +++ b/drivers/cpuidle/cpuidle.c
>>>> @@ -131,8 +131,9 @@ int cpuidle_idle_call(void)
>>>>
>>>>/* ask the governor for the next state */
>>>>next_state = cpuidle_curr_governor->select(drv, dev);
>>>> +
>>>> +dev->last_residency = 0;
>>>>if (need_resched()) {
>>>> -dev->last_residency = 0;
>>>
>>> Why do you need to do this change ? ^
>>
>> So as to keep the last_residency consistent with the case that this patch
>> addresses: where no idle state could be selected due to strict latency
>> requirements or disabled states and hence the cpu exits without entering
>> idle. Else it would contain the stale value from the previous idle state
>> entry.
>>
>> But coming to think of it dev->last_residency is not used when the last
>> entered idle state index is -1.
>>
>> So I have reverted this change as well in the revised patch below along
>> with mentioning the reason in the last paragraph of the changelog.
>>
>>>
>>>>/* give the governor an opportunity to reflect on the
>>>> outcome */
>>>>if (cpuidle_curr_governor->reflect)
>>>>cpuidle_curr_governor->reflect(dev, next_state);
>>>> @@ -140,6 +141,18 @@ int cpuidle_idle_call(void)
>>>>return 0;
>>>>}
>>>>
>>>> +/* Unlike in the need_resched() case, we return here because the
>>>> + * governor did not find a suitable idle state. However idle is
>>>> still
>>>> + * in progress as we are not asked to reschedule. Hence we return
>>>> + * without enabling interrupts.
>>>
>>> That will lead to a WARN.
>>>
>>>> + * NOTE: The return code should still be success, since the
>>>> verdict of this
>>>> + * call is "do not enter any idle state" and not a failed call
>>>> due to
>>>> + * errors.
>>>> + */
>>>> +if (next_state < 0)
>>>> +return 0;
>>>> +
>>>
>>> Returning from here breaks the symmetry of the trace.
>>
>> I have addressed the above concerns in the patch found below.
>> Does the rest of the patch look sound?
>>
>> Regards
>> Preeti U Murthy
>>
>> --
>>
>> cpuidle/governors: Fix logic in selection of idle states
>>
>> From: Preeti U Murthy 
>>
>> The cpuidle governors today are not handling scenarios where no idle
>> state
>> can be chosen. Such scenarios coud arise if the user has disabled all the
>> idle states at runtime or the latency requirement from the cpus is
>> very strict.
>>
>> The menu governor returns 0th index of the idle state table when no other
>> idle state is suitable. This is even when the idle state corresponding
>> to this
>> index is disabled or the latency requirement is strict and the
>> exit_latency
>> of the lowest idle state is also not acceptable. Hence this patch
>> fixes this logic in the menu governor by defaulting to an idle state
>> index
>> of -1 unless any other state is suitable.
>>
>> The ladder governor needs a few more fixes in addition to that
>> required in the
>> menu governor. When the ladder governor decides to demote the idle
>> state of a
>> CPU, it does not check if the lower idle states are enabled. Add this
>> logic
>> in addition to the logic where it chooses an index of -1 if it can
>> neither
>> promote or demote the idle state of a cpu nor can it choose the
>> current idle
>> state.
>>
>> The cpuidle_idle_call() will return back if the governor decides upon not
>> entering any idle state. However it cannot return an error code
>> because all
>> archs have the logic today that if the call to cpuidle_idle_call()
>> fails, it
>> means that the cpuidle driver failed to *function*; for instance due to
>>

Re: [PATCH 6/9] PPC: remove redundant cpuidle_idle_call()

2014-01-27 Thread Preeti U Murthy

Hi Nicolas,

On 01/27/2014 11:38 AM, Nicolas Pitre wrote:
> The core idle loop now takes care of it.  However a few things need
> checking:
> 
> - Invocation of cpuidle_idle_call() in pseries_lpar_idle() happened
>   through arch_cpu_idle() and was therefore always preceded by a call
>   to ppc64_runlatch_off().  To preserve this property now that
>   cpuidle_idle_call() is invoked directly from core code, a call to
>   ppc64_runlatch_off() has been added to idle_loop_prolog() in
>   platforms/pseries/processor_idle.c.
> 
> - Similarly, cpuidle_idle_call() was followed by ppc64_runlatch_off()
>   so a call to the later has been added to idle_loop_epilog().
> 
> - And since arch_cpu_idle() always made sure to re-enable IRQs if they
>   were not enabled, this is now
>   done in idle_loop_epilog() as well.
> 
> The above was made in order to keep the execution flow close to the
> original.  I don't know if that was strictly necessary. Someone well
> aquainted with the platform details might find some room for possible
> optimizations.
> 
> Signed-off-by: Nicolas Pitre 
> ---
>  arch/powerpc/platforms/pseries/processor_idle.c |  5 
>  arch/powerpc/platforms/pseries/setup.c  | 34 
> ++---
>  2 files changed, 19 insertions(+), 20 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/pseries/processor_idle.c 
> b/arch/powerpc/platforms/pseries/processor_idle.c
> index a166e38bd6..72ddfe3d2f 100644
> --- a/arch/powerpc/platforms/pseries/processor_idle.c
> +++ b/arch/powerpc/platforms/pseries/processor_idle.c
> @@ -33,6 +33,7 @@ static struct cpuidle_state *cpuidle_state_table;
> 
>  static inline void idle_loop_prolog(unsigned long *in_purr)
>  {
> + ppc64_runlatch_off();
>   *in_purr = mfspr(SPRN_PURR);
>   /*
>* Indicate to the HV that we are idle. Now would be
> @@ -49,6 +50,10 @@ static inline void idle_loop_epilog(unsigned long in_purr)
>   wait_cycles += mfspr(SPRN_PURR) - in_purr;
>   get_lppaca()->wait_state_cycles = cpu_to_be64(wait_cycles);
>   get_lppaca()->idle = 0;
> +
> + if (irqs_disabled())
> + local_irq_enable();
> + ppc64_runlatch_on();
>  }
> 
>  static int snooze_loop(struct cpuidle_device *dev,
> diff --git a/arch/powerpc/platforms/pseries/setup.c 
> b/arch/powerpc/platforms/pseries/setup.c
> index c1f1908587..7604c19d54 100644
> --- a/arch/powerpc/platforms/pseries/setup.c
> +++ b/arch/powerpc/platforms/pseries/setup.c
> @@ -39,7 +39,6 @@
>  #include 
>  #include 
>  #include 
> -#include 
>  #include 
>  #include 
> 
> @@ -356,29 +355,24 @@ early_initcall(alloc_dispatch_log_kmem_cache);
> 
>  static void pseries_lpar_idle(void)
>  {
> - /* This would call on the cpuidle framework, and the back-end pseries
> -  * driver to  go to idle states
> + /*
> +  * Default handler to go into low thread priority and possibly
> +  * low power mode by cedeing processor to hypervisor
>*/
> - if (cpuidle_idle_call()) {
> - /* On error, execute default handler
> -  * to go into low thread priority and possibly
> -  * low power mode by cedeing processor to hypervisor
> -  */
> 
> - /* Indicate to hypervisor that we are idle. */
> - get_lppaca()->idle = 1;
> + /* Indicate to hypervisor that we are idle. */
> + get_lppaca()->idle = 1;
> 
> - /*
> -  * Yield the processor to the hypervisor.  We return if
> -  * an external interrupt occurs (which are driven prior
> -  * to returning here) or if a prod occurs from another
> -  * processor. When returning here, external interrupts
> -  * are enabled.
> -  */
> - cede_processor();
> + /*
> +  * Yield the processor to the hypervisor.  We return if
> +  * an external interrupt occurs (which are driven prior
> +  * to returning here) or if a prod occurs from another
> +  * processor. When returning here, external interrupts
> +  * are enabled.
> +  */
> + cede_processor();
> 
> - get_lppaca()->idle = 0;
> - }
> + get_lppaca()->idle = 0;
>  }
> 
>  /*
> 

Reviewed-by: Preeti U Murthy 

The consequence of this would be for other Power platforms like PowerNV,
we will need to invoke ppc_runlatch_off() and ppc_runlatch_on() in each
of the idle routines since the idle_loop_prologue() and
idle_loop_epilogue() are not invoked by them, but we will take care of this.

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: a LLC sched domain bug for panda board?

2014-02-03 Thread Preeti U Murthy

Hi Alex, Vincent,

On 02/04/2014 02:10 AM, Vincent Guittot wrote:
> Yes,  it's probably worth enabling by default for all ARM arch.
> 
> Vincent
> 
> On 02/04/2014 12:28 AM, Vincent Guittot wrote:
>> On 3 February 2014 17:27, Vincent Guittot 
> wrote:
>>> Have you checked that CONFIG_SCHED_LC is set ?
>>
>> sorry it's CONFIG_SCHED_MC
> 
> Thanks for reminder! no it wasn't set. Does it means
> arch/arm/configs/omap2plus_defconfig need add this config?

Hmm..ok let me think this aloud. So looks like the SMT,MC and the NUMA
sched domains are optional depending on the architecture. They are
config dependent. These domains could potentially exist on the processor
layout, but if the respective CONFIG options are not set, the scheduler
could very well ignore these levels.

What this means is that although the architecture could populate the
cpu_sibling_mask and cpu_coregroup_mask, the scheduler is not mandated
to schedule across the SMT and MC levels of the topology.
Its just the CPU sched domain which is guaranteed to be present no
matter what.

This is indeed interesting to note :) Thanks Alex for bringing up this
point :)

On PowerPC, the SCHED_MC option can never be set. Its not even optional.
On x86, it is on by default and on arm looks like its off by default.

Thanks,

Regards
Preeti U Murthy

> 
>>
>>>
>>>
>>> On 3 February 2014 17:17, Alex Shi  wrote:
>>>> I just run the 3.14-rc1 kernel on panda board. The only domain for it is
>>>> 'CPU' domain, but this domain has no SD_SHARE_PKG_RESOURCES setting, it
>>>> has no sd_llc.
>>>>
>>>> Guess the right domain for this board should be MC. So is it a bug?
>>>>
>>>> ..
>>>> /proc/sys/kernel/sched_domain/cpu0/domain0/name:CPU
>>>> ..
>>>> /proc/sys/kernel/sched_domain/cpu1/domain0/name:CPU
>>>>
>>>> --
>>>> Thanks
>>>> Alex
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-kernel"
> in
>>>> the body of a message to majord...@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>> Please read the FAQ at  http://www.tux.org/lkml/
> 
> --
> Thanks
> Alex
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH V3] cpuidle/governors: Fix logic in selection of idle states

2014-02-04 Thread Preeti U Murthy

The cpuidle governors today are not handling scenarios where no idle state
can be chosen. Such scenarios coud arise if the user has disabled all the
idle states at runtime or the latency requirement from the cpus is very strict.

The menu governor returns 0th index of the idle state table when no other
idle state is suitable. This is even when the idle state corresponding to this
index is disabled or the latency requirement is strict and the exit_latency
of the lowest idle state is also not acceptable. Hence this patch
fixes this logic in the menu governor by defaulting to an idle state index
of -1 unless any other state is suitable.

The ladder governor needs a few more fixes in addition to that required in the
menu governor. When the ladder governor decides to demote the idle state of a
CPU, it does not check if the lower idle states are enabled. Add this logic
in addition to the logic where it chooses an index of -1 if it can neither
promote or demote the idle state of a cpu nor can it choose the current idle
state.

The cpuidle_idle_call() will return back if the governor decides upon not
entering any idle state. However it cannot return an error code because all
archs have the logic today that if the call to cpuidle_idle_call() fails, it
means that the cpuidle driver failed to *function*; for instance due to
errors during registration. As a result they end up deciding upon a
default idle state on their own, which could very well be a deep idle state.
This is incorrect in cases where no idle state is suitable.

Besides for the scenario that this patch is addressing, the call actually
succeeds. Its just that no idle state is thought to be suitable by the 
governors.
Under such a circumstance return success code without entering any idle
state.

The consequence of this patch, on the menu governor is that as long as a valid
idle state cannot be chosen, the cpuidle statistics that this governor uses
to predict the next idle state remain untouched from the last valid idle state.
This is because an idle state is not even being predicted in this path, hence
there is no point correcting the prediction either.

Signed-off-by: Preeti U Murthy 

Changes from V1:https://lkml.org/lkml/2014/1/14/26

1. Change the return code to success from -EINVAL due to the reason mentioned
in the changelog.
2. Add logic that the patch is addressing in the ladder governor as well.
3. Added relevant comments and removed redundant logic as suggested in the
above thread.

Changes from V2:lkml.org/lkml/2014/1/16/617

1. Enable interrupts when exiting from cpuidle_idle_call() in the case when
no idle state was deemed suitable by the governor.
---

 drivers/cpuidle/cpuidle.c  |2 -
 drivers/cpuidle/governors/ladder.c |  101 ++--
 drivers/cpuidle/governors/menu.c   |7 +-
 3 files changed, 78 insertions(+), 32 deletions(-)

diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index a55e68f..89abdfc 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -131,7 +131,7 @@ int cpuidle_idle_call(void)
 
/* ask the governor for the next state */
next_state = cpuidle_curr_governor->select(drv, dev);
-   if (need_resched()) {
+   if (need_resched() || (next_state < 0)) {
dev->last_residency = 0;
/* give the governor an opportunity to reflect on the outcome */
if (cpuidle_curr_governor->reflect)
diff --git a/drivers/cpuidle/governors/ladder.c 
b/drivers/cpuidle/governors/ladder.c
index 9f08e8c..7e93aaa 100644
--- a/drivers/cpuidle/governors/ladder.c
+++ b/drivers/cpuidle/governors/ladder.c
@@ -58,6 +58,36 @@ static inline void ladder_do_selection(struct ladder_device 
*ldev,
ldev->last_state_idx = new_idx;
 }
 
+static int can_promote(struct ladder_device *ldev, int last_idx,
+   int last_residency)
+{
+   struct ladder_device_state *last_state;
+
+   last_state = >states[last_idx];
+   if (last_residency > last_state->threshold.promotion_time) {
+   last_state->stats.promotion_count++;
+   last_state->stats.demotion_count = 0;
+   if (last_state->stats.promotion_count >= 
last_state->threshold.promotion_count)
+   return 1;
+   }
+   return 0;
+}
+
+static int can_demote(struct ladder_device *ldev, int last_idx,
+   int last_residency)
+{
+   struct ladder_device_state *last_state;
+
+   last_state = >states[last_idx];
+   if (last_residency < last_state->threshold.demotion_time) {
+   last_state->stats.demotion_count++;
+   last_state->stats.promotion_count = 0;
+   if (last_state->stats.demotion_count >= 
last_state->threshold.demotion_count)
+   return 1;
+   }
+   return 0;
+}
+
 /**
  * ladder_select_state - selects the

Re: [PATCH V2 1/2] time: Change the return type of clockevents_notify() to integer

2014-02-04 Thread Preeti U Murthy

On 02/04/2014 03:31 PM, Thomas Gleixner wrote:
> On Fri, 24 Jan 2014, Preeti U Murthy wrote:
>> -extern void tick_broadcast_oneshot_control(unsigned long reason);
>> +extern int tick_broadcast_oneshot_control(unsigned long reason);
> 
>> -static inline void tick_broadcast_oneshot_control(unsigned long reason) { }
>> +static inline int tick_broadcast_oneshot_control(unsigned long reason) { }
> 
>> -static inline void tick_broadcast_oneshot_control(unsigned long reason) { }
>> +static inline int tick_broadcast_oneshot_control(unsigned long reason) { }
> 
> The inline stubs need to return 0.

Oh right! Apologies!
Thanks.
> 
> Thanks,
> 
>   tglx
> 

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH V2 2/2] tick/cpuidle: Initialize hrtimer mode of broadcast

2014-02-04 Thread Preeti U Murthy

Hi Thomas,

On 02/04/2014 03:48 PM, Thomas Gleixner wrote:
>> +++ b/kernel/time/tick-broadcast-hrtimer.c
>> +/*
>> + * This is called from the guts of the broadcast code when the cpu
>> + * which is about to enter idle has the earliest broadcast timer event.
>> + */
>> +static int bc_set_next(ktime_t expires, struct clock_event_device *bc)
>> +{
>> +ktime_t now, interval;
>> +/*
>> + * We try to cancel the timer first. If the callback is on
>> + * flight on some other cpu then we let it handle it. If we
>> + * were able to cancel the timer nothing can rearm it as we
>> + * own broadcast_lock.
>> + *
>> + * However if we are called from the hrtimer interrupt handler
>> + * itself, reprogram it.
>> + */
>> +if (hrtimer_try_to_cancel() >= 0) {
>> +hrtimer_start(, expires, HRTIMER_MODE_ABS_PINNED);
>> +/* Bind the "device" to the cpu */
>> +bc->bound_on = smp_processor_id();
>> +} else if (bc->bound_on == smp_processor_id()) {
> 
> This part really wants a proper comment. It took me a while to figure
> out why this is correct and what the call chain is.

How about:

"However we can also be called from the event handler of
ce_broadcast_hrtimer when bctimer expires. We cannot therefore restart
the timer since it is on flight on the same CPU. But due to the same
reason we can reset it."
?

> 
> 
>> +now = ktime_get();
>> +interval = ktime_sub(expires, now);
>> +hrtimer_forward_now(, interval);
> 
> We are in the event handler called from bc_handler() and expires is
> absolute time. So what's wrong with calling
> hrtimer_set_expires(, expires)?

You are right. There are so many interfaces doing nearly the same thing
:( I overlooked that hrtimer_forward() and its variants were being used
when the interval was pre-calculated and stored away. And
hrtimer_set_expires() would be used when we knew the absolute expiry.
And it looks safe to call it here too.

> 
>> +static enum hrtimer_restart bc_handler(struct hrtimer *t)
>> +{
>> +ce_broadcast_hrtimer.event_handler(_broadcast_hrtimer);
>> +return HRTIMER_RESTART;
> 
> We probably want to check whether the timer needs to be restarted at
> all.
> 
>   if (ce_broadcast_timer.next_event.tv64 == KTIME_MAX)
>  return HRTIMER_NORESTART;
> 
>   return HRTIMER_RESTART;

True this additional check would be useful.

Do you want me to send out the next version with the above corrections
including the patch added to this thread where we handle archs setting
the CPUIDLE_FLAG_TIMER_STOP flag?

> 
> Hmm?
> 
> Thanks,
> 
>   tglx

Thanks

Regards
Preeti U Murthy
> ___
> Linuxppc-dev mailing list
> linuxppc-...@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH V3] cpuidle/governors: Fix logic in selection of idle states

2014-02-04 Thread Preeti U Murthy

Hi Arjan,

On 02/04/2014 08:22 PM, Arjan van de Ven wrote:
> On 2/4/2014 12:35 AM, Preeti U Murthy wrote:
>> The cpuidle governors today are not handling scenarios where no idle
>> state
>> can be chosen. Such scenarios coud arise if the user has disabled all the
>> idle states at runtime or the latency requirement from the cpus is
>> very strict.
>>
>> The menu governor returns 0th index of the idle state table when no other
>> idle state is suitable. This is even when the idle state corresponding
>> to this
>> index is disabled or the latency requirement is strict and the
>> exit_latency
>> of the lowest idle state is also not acceptable. Hence this patch
>> fixes this logic in the menu governor by defaulting to an idle state
>> index
>>  of -1 unless any other state is suitable.
> 
> state 0 is defined as polling, and polling ALWAYS should be ok

Hmm.. you are right. This is convincing. There is no need for this patch.

Thanks

Regards
Preeti U Murthy
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH V3 1/3] time: Change the return type of clockevents_notify() to integer

2014-02-05 Thread Preeti U Murthy

The broadcast framework can potentially be made use of by archs which do not 
have an
external clock device as well. Then, it is required that one of the CPUs need
to handle the broadcasting of wakeup IPIs to the CPUs in deep idle. As a
result its local timers should remain functional all the time. For such
a CPU, the BROADCAST_ENTER notification has to fail indicating that its clock
device cannot be shutdown. To make way for this support, change the return
type of tick_broadcast_oneshot_control() and hence clockevents_notify() to
indicate such scenarios.

Signed-off-by: Preeti U Murthy 
---

 include/linux/clockchips.h   |6 +++---
 kernel/time/clockevents.c|8 +---
 kernel/time/tick-broadcast.c |6 --
 kernel/time/tick-internal.h  |6 +++---
 4 files changed, 15 insertions(+), 11 deletions(-)

diff --git a/include/linux/clockchips.h b/include/linux/clockchips.h
index 493aa02..e0c5a6c 100644
--- a/include/linux/clockchips.h
+++ b/include/linux/clockchips.h
@@ -186,9 +186,9 @@ static inline int tick_check_broadcast_expired(void) { 
return 0; }
 #endif
 
 #ifdef CONFIG_GENERIC_CLOCKEVENTS
-extern void clockevents_notify(unsigned long reason, void *arg);
+extern int clockevents_notify(unsigned long reason, void *arg);
 #else
-static inline void clockevents_notify(unsigned long reason, void *arg) {}
+static inline int clockevents_notify(unsigned long reason, void *arg) { return 
0; }
 #endif
 
 #else /* CONFIG_GENERIC_CLOCKEVENTS_BUILD */
@@ -196,7 +196,7 @@ static inline void clockevents_notify(unsigned long reason, 
void *arg) {}
 static inline void clockevents_suspend(void) {}
 static inline void clockevents_resume(void) {}
 
-static inline void clockevents_notify(unsigned long reason, void *arg) {}
+static inline int clockevents_notify(unsigned long reason, void *arg) { return 
0; }
 static inline int tick_check_broadcast_expired(void) { return 0; }
 
 #endif
diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
index 086ad60..79b8685 100644
--- a/kernel/time/clockevents.c
+++ b/kernel/time/clockevents.c
@@ -524,12 +524,13 @@ void clockevents_resume(void)
 #ifdef CONFIG_GENERIC_CLOCKEVENTS
 /**
  * clockevents_notify - notification about relevant events
+ * Returns 0 on success, any other value on error
  */
-void clockevents_notify(unsigned long reason, void *arg)
+int clockevents_notify(unsigned long reason, void *arg)
 {
struct clock_event_device *dev, *tmp;
unsigned long flags;
-   int cpu;
+   int cpu, ret = 0;
 
raw_spin_lock_irqsave(_lock, flags);
 
@@ -542,7 +543,7 @@ void clockevents_notify(unsigned long reason, void *arg)
 
case CLOCK_EVT_NOTIFY_BROADCAST_ENTER:
case CLOCK_EVT_NOTIFY_BROADCAST_EXIT:
-   tick_broadcast_oneshot_control(reason);
+   ret = tick_broadcast_oneshot_control(reason);
break;
 
case CLOCK_EVT_NOTIFY_CPU_DYING:
@@ -585,6 +586,7 @@ void clockevents_notify(unsigned long reason, void *arg)
break;
}
raw_spin_unlock_irqrestore(_lock, flags);
+   return ret;
 }
 EXPORT_SYMBOL_GPL(clockevents_notify);
 
diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c
index 43780ab..ddf2ac2 100644
--- a/kernel/time/tick-broadcast.c
+++ b/kernel/time/tick-broadcast.c
@@ -633,14 +633,15 @@ again:
 /*
  * Powerstate information: The system enters/leaves a state, where
  * affected devices might stop
+ * Returns 0 on success, -EBUSY if the cpu is used to broadcast wakeups.
  */
-void tick_broadcast_oneshot_control(unsigned long reason)
+int tick_broadcast_oneshot_control(unsigned long reason)
 {
struct clock_event_device *bc, *dev;
struct tick_device *td;
unsigned long flags;
ktime_t now;
-   int cpu;
+   int cpu, ret = 0;
 
/*
 * Periodic mode does not care about the enter/exit of power
@@ -746,6 +747,7 @@ void tick_broadcast_oneshot_control(unsigned long reason)
}
 out:
raw_spin_unlock_irqrestore(_broadcast_lock, flags);
+   return ret;
 }
 
 /*
diff --git a/kernel/time/tick-internal.h b/kernel/time/tick-internal.h
index 8329669..f0dc03c 100644
--- a/kernel/time/tick-internal.h
+++ b/kernel/time/tick-internal.h
@@ -46,7 +46,7 @@ extern int tick_switch_to_oneshot(void (*handler)(struct 
clock_event_device *));
 extern void tick_resume_oneshot(void);
 # ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST
 extern void tick_broadcast_setup_oneshot(struct clock_event_device *bc);
-extern void tick_broadcast_oneshot_control(unsigned long reason);
+extern int tick_broadcast_oneshot_control(unsigned long reason);
 extern void tick_broadcast_switch_to_oneshot(void);
 extern void tick_shutdown_broadcast_oneshot(unsigned int *cpup);
 extern int tick_resume_broadcast_oneshot(struct clock_event_device *bc);
@@ -58,7 +58,7 @@ static inline void tick_broadcast_setup_oneshot(struct 
clock_event_device *bc)
 {
BUG();
 }
-static inline void

[PATCH V3 0/3] time/cpuidle: Support in tick broadcast framework in absence of external clock device

2014-02-05 Thread Preeti U Murthy

On some architectures, the local timers of CPUs stop in deep idle states.
They will need to depend on an external clock device to wake them up. However
certain implementations of archs do not have an external clock device.

This patchset provides support in the tick broadcast framework for such
architectures so as to enable the CPUs to get into deep idle.

Presently we are in need of this support on certain implementations of
PowerPC. This patchset has thus been tested on the same.

V1: https://lkml.org/lkml/2013/12/12/687.
V2: https://lkml.org/lkml/2014/1/24/28

Changes in V3:
1. Modified comments and code around programming of the broadcast hrtimer.

---

Preeti U Murthy (2):
  time: Change the return type of clockevents_notify() to integer
  time/cpuidle:Handle failed call to BROADCAST_ENTER on archs with 
CPUIDLE_FLAG_TIMER_STOP set

Thomas Gleixner (1):
  tick/cpuidle: Initialize hrtimer mode of broadcast


 drivers/cpuidle/cpuidle.c|   38 +++-
 include/linux/clockchips.h   |   15 -
 kernel/time/Makefile |2 -
 kernel/time/clockevents.c|8 ++-
 kernel/time/tick-broadcast-hrtimer.c |  105 ++
 kernel/time/tick-broadcast.c |   51 -
 kernel/time/tick-internal.h  |6 +-
 7 files changed, 197 insertions(+), 28 deletions(-)
 create mode 100644 kernel/time/tick-broadcast-hrtimer.c

-- 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH V3 2/3] tick/cpuidle: Initialize hrtimer mode of broadcast

2014-02-05 Thread Preeti U Murthy

From: Thomas Gleixner 

On some architectures, in certain CPU deep idle states the local timers stop.
An external clock device is used to wakeup these CPUs. The kernel support for 
the
wakeup of these CPUs is provided by the tick broadcast framework by using the
external clock device as the wakeup source.

However not all implementations of architectures provide such an external
clock device. This patch includes support in the broadcast framework to handle
the wakeup of the CPUs in deep idle states on such systems by queuing a hrtimer
on one of the CPUs, which is meant to handle the wakeup of CPUs in deep idle 
states.

This patchset introduces a pseudo clock device which can be registered by the
archs as tick_broadcast_device in the absence of a real external clock
device. Once registered, the broadcast framework will work as is for these
architectures as long as the archs take care of the BROADCAST_ENTER
notification failing for one of the CPUs. This CPU is made the stand by CPU to
handle wakeup of the CPUs in deep idle and it *must not enter deep idle states*.

The CPU with the earliest wakeup is chosen to be this CPU. Hence this way the
stand by CPU dynamically moves around and so does the hrtimer which is queued
to trigger at the next earliest wakeup time. This is consistent with the case 
where
an external clock device is present. The smp affinity of this clock device is
set to the CPU with the earliest wakeup. This patchset handles the hotplug of
the stand by CPU as well by moving the hrtimer on to the CPU handling the 
CPU_DEAD
notification.

Signed-off-by: Preeti U Murthy 
[Added Changelog and code to handle reprogramming of hrtimer]
---

 include/linux/clockchips.h   |9 +++
 kernel/time/Makefile |2 -
 kernel/time/tick-broadcast-hrtimer.c |  105 ++
 kernel/time/tick-broadcast.c |   45 ++-
 4 files changed, 159 insertions(+), 2 deletions(-)
 create mode 100644 kernel/time/tick-broadcast-hrtimer.c

diff --git a/include/linux/clockchips.h b/include/linux/clockchips.h
index e0c5a6c..dbe9e14 100644
--- a/include/linux/clockchips.h
+++ b/include/linux/clockchips.h
@@ -62,6 +62,11 @@ enum clock_event_mode {
 #define CLOCK_EVT_FEAT_DYNIRQ  0x20
 #define CLOCK_EVT_FEAT_PERCPU  0x40
 
+/*
+ * Clockevent device is based on a hrtimer for broadcast
+ */
+#define CLOCK_EVT_FEAT_HRTIMER 0x80
+
 /**
  * struct clock_event_device - clock event device descriptor
  * @event_handler: Assigned by the framework to be called by the low
@@ -83,6 +88,7 @@ enum clock_event_mode {
  * @name:  ptr to clock event name
  * @rating:variable to rate clock event devices
  * @irq:   IRQ number (only for non CPU local devices)
+ * @bound_on:  Bound on CPU
  * @cpumask:   cpumask to indicate for which CPUs this device works
  * @list:  list head for the management code
  * @owner: module reference
@@ -113,6 +119,7 @@ struct clock_event_device {
const char  *name;
int rating;
int irq;
+   int bound_on;
const struct cpumask*cpumask;
struct list_headlist;
struct module   *owner;
@@ -180,9 +187,11 @@ extern int tick_receive_broadcast(void);
 #endif
 
 #if defined(CONFIG_GENERIC_CLOCKEVENTS_BROADCAST) && 
defined(CONFIG_TICK_ONESHOT)
+extern void tick_setup_hrtimer_broadcast(void);
 extern int tick_check_broadcast_expired(void);
 #else
 static inline int tick_check_broadcast_expired(void) { return 0; }
+static void tick_setup_hrtimer_broadcast(void) {};
 #endif
 
 #ifdef CONFIG_GENERIC_CLOCKEVENTS
diff --git a/kernel/time/Makefile b/kernel/time/Makefile
index 9250130..06151ef 100644
--- a/kernel/time/Makefile
+++ b/kernel/time/Makefile
@@ -3,7 +3,7 @@ obj-y += timeconv.o posix-clock.o alarmtimer.o
 
 obj-$(CONFIG_GENERIC_CLOCKEVENTS_BUILD)+= clockevents.o
 obj-$(CONFIG_GENERIC_CLOCKEVENTS)  += tick-common.o
-obj-$(CONFIG_GENERIC_CLOCKEVENTS_BROADCAST)+= tick-broadcast.o
+obj-$(CONFIG_GENERIC_CLOCKEVENTS_BROADCAST)+= tick-broadcast.o 
tick-broadcast-hrtimer.o
 obj-$(CONFIG_GENERIC_SCHED_CLOCK)  += sched_clock.o
 obj-$(CONFIG_TICK_ONESHOT) += tick-oneshot.o
 obj-$(CONFIG_TICK_ONESHOT) += tick-sched.o
diff --git a/kernel/time/tick-broadcast-hrtimer.c 
b/kernel/time/tick-broadcast-hrtimer.c
new file mode 100644
index 000..af1e119
--- /dev/null
+++ b/kernel/time/tick-broadcast-hrtimer.c
@@ -0,0 +1,105 @@
+/*
+ * linux/kernel/time/tick-broadcast-hrtimer.c
+ * This file emulates a local clock event device
+ * via a pseudo clock device.
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "tick-internal.h"
+
+static stru

[PATCH V3 3/3] time/cpuidle:Handle failed call to BROADCAST_ENTER on archs with CPUIDLE_FLAG_TIMER_STOP set

2014-02-05 Thread Preeti U Murthy

Some archs set the CPUIDLE_FLAG_TIMER_STOP flag for idle states in which the
local timers stop. The cpuidle_idle_call() currently handles such idle states
by calling into the broadcast framework so as to wakeup CPUs at their next
wakeup event. With the hrtimer mode of broadcast, the BROADCAST_ENTER call
into the broadcast frameowork can fail for archs that do not have an external
clock device to handle wakeups and the CPU in question has to thus be made
the stand by CPU. This patch handles such cases by failing the call into
cpuidle so that the arch can take some default action. The arch will certainly
not enter a similar idle state because a failed cpuidle call will also 
implicitly
indicate that the broadcast framework has not registered this CPU to be woken 
up.
Hence we are safe if we fail the cpuidle call.

In the process move the functions that trace idle statistics just before and
after the entry and exit into idle states respectively. In other
scenarios where the call to cpuidle fails, we end up not tracing idle
entry and exit since a decision on an idle state could not be taken. Similarly
when the call to broadcast framework fails, we skip tracing idle statistics
because we are in no further position to take a decision on an alternative
idle state to enter into.

Signed-off-by: Preeti U Murthy 
---

 drivers/cpuidle/cpuidle.c |   38 +++---
 1 file changed, 23 insertions(+), 15 deletions(-)

diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index a55e68f..8f42033 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -117,15 +117,19 @@ int cpuidle_idle_call(void)
 {
struct cpuidle_device *dev = __this_cpu_read(cpuidle_devices);
struct cpuidle_driver *drv;
-   int next_state, entered_state;
-   bool broadcast;
+   int next_state, entered_state, ret = 0;
+   bool broadcast = false;
 
-   if (off || !initialized)
-   return -ENODEV;
+   if (off || !initialized) {
+   ret = -ENODEV;
+   goto out;
+   }
 
/* check if the device is ready */
-   if (!dev || !dev->enabled)
-   return -EBUSY;
+   if (!dev || !dev->enabled) {
+   ret = -EBUSY;
+   goto out;
+   }
 
drv = cpuidle_get_cpu_driver(dev);
 
@@ -137,15 +141,18 @@ int cpuidle_idle_call(void)
if (cpuidle_curr_governor->reflect)
cpuidle_curr_governor->reflect(dev, next_state);
local_irq_enable();
-   return 0;
+   goto out;
}
 
-   trace_cpu_idle_rcuidle(next_state, dev->cpu);
-
broadcast = !!(drv->states[next_state].flags & CPUIDLE_FLAG_TIMER_STOP);
 
-   if (broadcast)
-   clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, >cpu);
+   if (broadcast) {
+   ret = clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, 
>cpu);
+   if (ret)
+   goto out;
+   }
+
+   trace_cpu_idle_rcuidle(next_state, dev->cpu);
 
if (cpuidle_state_is_coupled(dev, drv, next_state))
entered_state = cpuidle_enter_state_coupled(dev, drv,
@@ -153,16 +160,17 @@ int cpuidle_idle_call(void)
else
entered_state = cpuidle_enter_state(dev, drv, next_state);
 
-   if (broadcast)
-   clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, >cpu);
-
trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, dev->cpu);
 
/* give the governor an opportunity to reflect on the outcome */
if (cpuidle_curr_governor->reflect)
cpuidle_curr_governor->reflect(dev, entered_state);
 
-   return 0;
+out:   if (broadcast)
+   clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, >cpu);
+
+
+   return ret;
 }
 
 /**

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/2] PPC: powernv: remove redundant cpuidle_idle_call()

2014-02-06 Thread Preeti U Murthy

On 02/06/2014 07:46 PM, Nicolas Pitre wrote:
> The core idle loop now takes care of it.
> 
> Signed-off-by: Nicolas Pitre 
> ---
>  arch/powerpc/platforms/powernv/setup.c | 13 +
>  1 file changed, 1 insertion(+), 12 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/setup.c 
> b/arch/powerpc/platforms/powernv/setup.c
> index 21166f65c9..a932feb290 100644
> --- a/arch/powerpc/platforms/powernv/setup.c
> +++ b/arch/powerpc/platforms/powernv/setup.c
> @@ -26,7 +26,6 @@
>  #include 
>  #include 
>  #include 
> -#include 
> 
>  #include 
>  #include 
> @@ -217,16 +216,6 @@ static int __init pnv_probe(void)
>   return 1;
>  }
> 
> -void powernv_idle(void)
> -{
> - /* Hook to cpuidle framework if available, else
> -  * call on default platform idle code
> -  */
> - if (cpuidle_idle_call()) {
> - power7_idle();
> - }
> -}
> -
>  define_machine(powernv) {
>   .name   = "PowerNV",
>   .probe  = pnv_probe,
> @@ -236,7 +225,7 @@ define_machine(powernv) {
>   .show_cpuinfo   = pnv_show_cpuinfo,
>   .progress   = pnv_progress,
>   .machine_shutdown   = pnv_shutdown,
> - .power_save = powernv_idle,
> + .power_save = power7_idle,
>   .calibrate_decr = generic_calibrate_decr,
>  #ifdef CONFIG_KEXEC
>   .kexec_cpu_down = pnv_kexec_cpu_down,
> 

Reviewed-by: Preeti U Murthy 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/2] ARM64: powernv: remove redundant cpuidle_idle_call()

2014-02-06 Thread Preeti U Murthy

Hi Nicolas,

powernv in the subject of the patch?

Regards
Preeti U Murthy
On 02/06/2014 07:46 PM, Nicolas Pitre wrote:
> The core idle loop now takes care of it.
> 
> Signed-off-by: Nicolas Pitre 
> ---
>  arch/arm64/kernel/process.c | 7 ++-
>  1 file changed, 2 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
> index 1c0a9be2ff..9cce0098f4 100644
> --- a/arch/arm64/kernel/process.c
> +++ b/arch/arm64/kernel/process.c
> @@ -33,7 +33,6 @@
>  #include 
>  #include 
>  #include 
> -#include 
>  #include 
>  #include 
>  #include 
> @@ -94,10 +93,8 @@ void arch_cpu_idle(void)
>* This should do all the clock switching and wait for interrupt
>* tricks
>*/
> - if (cpuidle_idle_call()) {
> - cpu_do_idle();
> - local_irq_enable();
> - }
> + cpu_do_idle();
> + local_irq_enable();
>  }
> 
>  #ifdef CONFIG_HOTPLUG_CPU
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/2] PPC: powernv: remove redundant cpuidle_idle_call()

2014-02-06 Thread Preeti U Murthy

Hi Daniel,

On 02/06/2014 09:55 PM, Daniel Lezcano wrote:
> Hi Nico,
> 
> 
> On 6 February 2014 14:16, Nicolas Pitre  wrote:
> 
>> The core idle loop now takes care of it.
>>
>> Signed-off-by: Nicolas Pitre 
>> ---
>>  arch/powerpc/platforms/powernv/setup.c | 13 +
>>  1 file changed, 1 insertion(+), 12 deletions(-)
>>
>> diff --git a/arch/powerpc/platforms/powernv/setup.c
>> b/arch/powerpc/platforms/powernv/setup.c
>> index 21166f65c9..a932feb290 100644
>> --- a/arch/powerpc/platforms/powernv/setup.c
>> +++ b/arch/powerpc/platforms/powernv/setup.c
>> @@ -26,7 +26,6 @@
>>  #include 
>>  #include 
>>  #include 
>> -#include 
>>
>>  #include 
>>  #include 
>> @@ -217,16 +216,6 @@ static int __init pnv_probe(void)
>> return 1;
>>  }
>>
>> -void powernv_idle(void)
>> -{
>> -   /* Hook to cpuidle framework if available, else
>> -* call on default platform idle code
>> -*/
>> -   if (cpuidle_idle_call()) {
>> -   power7_idle();
>> -   }
>>
> 
> The cpuidle_idle_call is called from arch_cpu_idle in
> arch/powerpc/kernel/idle.c between a ppc64_runlatch_off|on section.
> Shouldn't the cpuidle-powernv driver call these functions when entering
> idle ?

Yes they should, I will send out a patch that does that ontop of this.
There have been cpuidle driver cleanups for powernv and pseries in this
merge window. While no change would be required in the pseries cpuidle
driver as a result of Nicolas's cleanup, we would need to add the
ppc64_runlatch_on and off functions before and after the entry into the
powernv idle states.

Thanks

Regards
Preeti U Murthy
> 
>   -- Daniel
> 
> 
>> -}
>> -
>>  define_machine(powernv) {
>> .name   = "PowerNV",
>> .probe  = pnv_probe,
>> @@ -236,7 +225,7 @@ define_machine(powernv) {
>> .show_cpuinfo   = pnv_show_cpuinfo,
>> .progress   = pnv_progress,
>> .machine_shutdown   = pnv_shutdown,
>> -   .power_save = powernv_idle,
>> +   .power_save = power7_idle,
>> .calibrate_decr = generic_calibrate_decr,
>>  #ifdef CONFIG_KEXEC
>> .kexec_cpu_down = pnv_kexec_cpu_down,
>> --
>> 1.8.4.108.g55ea5f6
>>
>>
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH V3 2/3] tick/cpuidle: Initialize hrtimer mode of broadcast

2014-02-06 Thread Preeti U Murthy

Hi Thomas,

On 02/06/2014 09:33 PM, Thomas Gleixner wrote:
> On Thu, 6 Feb 2014, Preeti U Murthy wrote:
> 
> Compiler warnings are not so important, right?
> 
> kernel/time/tick-broadcast.c: In function ‘tick_broadcast_oneshot_control’:
> kernel/time/tick-broadcast.c:700:3: warning: ‘return’ with no value, in 
> function returning non-void [-Wreturn-type]
> kernel/time/tick-broadcast.c:711:3: warning: ‘return’ with no value, in 
> function returning non-void [-Wreturn-type]

My apologies for this, will make sure this will not repeat. On compilation I
did not receive any warnings with the additional compile time flags too.I
compiled it on powerpc. Let me look into why the warnings did not show up.
Nevertheless I should have taken care of this even by simply looking at the
code.

> 
>> +/*
>> + * If the current CPU owns the hrtimer broadcast
>> + * mechanism, it cannot go deep idle.
>> + */
>> +ret = broadcast_needs_cpu(bc, cpu);
> 
> So we leave the CPU in the broadcast mask, just to force another call
> to the notify code right away to remove it again. Wouldn't it be more
> clever to clear the flag right away? That would make the changes to
> the cpuidle code simpler. Delta patch below.

You are right.
> 
> Thanks,
> 
>   tglx
> ---
> 
> --- tip.orig/kernel/time/tick-broadcast.c
> +++ tip/kernel/time/tick-broadcast.c
> @@ -697,7 +697,7 @@ int tick_broadcast_oneshot_control(unsig
>* states
>*/
>   if (tick_broadcast_device.mode == TICKDEV_MODE_PERIODIC)
> - return;
> + return 0;
> 
>   /*
>* We are called with preemtion disabled from the depth of the
> @@ -708,7 +708,7 @@ int tick_broadcast_oneshot_control(unsig
>   dev = td->evtdev;
> 
>   if (!(dev->features & CLOCK_EVT_FEAT_C3STOP))
> - return;
> + return 0;
> 
>   bc = tick_broadcast_device.evtdev;
> 
> @@ -731,9 +731,14 @@ int tick_broadcast_oneshot_control(unsig
>   }
>   /*
>* If the current CPU owns the hrtimer broadcast
> -  * mechanism, it cannot go deep idle.
> +  * mechanism, it cannot go deep idle and we remove the
> +  * CPU from the broadcast mask. We don't have to go
> +  * through the EXIT path as the local timer is not
> +  * shutdown.
>*/
>   ret = broadcast_needs_cpu(bc, cpu);
> + if (ret)
> + cpumask_clear_cpu(cpu, tick_broadcast_oneshot_mask);
>   } else {
>   if (cpumask_test_and_clear_cpu(cpu, 
> tick_broadcast_oneshot_mask)) {
>   clockevents_set_mode(dev, CLOCK_EVT_MODE_ONESHOT);
> 
> 

The cpuidle patch then is below. The trace_cpu_idle_rcuidle() functions have
been moved around so that the broadcast CPU does not trace any idle event
and that the symmetry between the trace functions and the call to the
broadcast framework is  maintained. Wow, it does become very simple :)

time/cpuidle:Handle failed call to BROADCAST_ENTER on archs with 
CPUIDLE_FLAG_TIMER_STOP set

From: Preeti U Murthy 

Some archs set the CPUIDLE_FLAG_TIMER_STOP flag for idle states in which the
local timers stop. The cpuidle_idle_call() currently handles such idle states
by calling into the broadcast framework so as to wakeup CPUs at their next
wakeup event. With the hrtimer mode of broadcast, the BROADCAST_ENTER call
into the broadcast frameowork can fail for archs that do not have an external
clock device to handle wakeups and the CPU in question has to thus be made
the stand by CPU. This patch handles such cases by failing the call into
cpuidle so that the arch can take some default action. The arch will certainly
not enter a similar idle state because a failed cpuidle call will also 
implicitly
indicate that the broadcast framework has not registered this CPU to be woken 
up.
Hence we are safe if we fail the cpuidle call.

In the process move the functions that trace idle statistics just before and
after the entry and exit into idle states respectively. In other
scenarios where the call to cpuidle fails, we end up not tracing idle
entry and exit since a decision on an idle state could not be taken. Similarly
when the call to broadcast framework fails, we skip tracing idle statistics
because we are in no further position to take a decision on an alternative
idle state to enter into.

Signed-off-by: Preeti U Murthy 
---
 drivers/cpuidle/cpuidle.c |   14 --
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index a55e68f..8beb0f02 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@

Re: [PATCH V5 0/8] cpuidle/ppc: Enable deep idle states on PowerNV

2014-01-15 Thread Preeti U Murthy

Hi Paul,

On 01/15/2014 08:59 PM, Paul Gortmaker wrote:
> On 14-01-15 03:07 AM, Preeti U Murthy wrote:
> 
> [...]
> 
>>
>> This patchset is based on mainline commit-id:8ae516aa8b8161254d3,  and the
> 
> I figured I'd give this a quick sanity build test for a few
> configs, but v3.13-rc1-141-g8ae516aa8b81 seems too old; Ben's
> ppc next branch is at v3.13-rc1-160-gfac515db4520 and it fails:
> 
> ---
> $ git am ppc-idle
> Applying: powerpc: Free up the slot of PPC_MSG_CALL_FUNC_SINGLE IPI message
> Applying: powerpc: Implement tick broadcast IPI as a fixed IPI message
> Applying: cpuidle/ppc: Split timer_interrupt() into timer handling and 
> interrupt handling routines
> error: patch failed: arch/powerpc/kernel/time.c:510
> error: arch/powerpc/kernel/time.c: patch does not apply
> Patch failed at 0003 cpuidle/ppc: Split timer_interrupt() into timer handling 
> and interrupt handling routines
> The copy of the patch that failed is found in:
>/home/paul/git/linux-head/.git/rebase-apply/patch
> When you have resolved this problem, run "git am --continue".
> If you prefer to skip this patch, run "git am --skip" instead.
> To restore the original branch and stop patching, run "git am --abort".
> $ dry-run
> patching file arch/powerpc/kernel/time.c
> Hunk #3 FAILED at 544.
> Hunk #4 FAILED at 554.
> Hunk #5 succeeded at 862 (offset 12 lines).
> 2 out of 5 hunks FAILED -- saving rejects to file 
> arch/powerpc/kernel/time.c.rej
> 
> 
> It appears to conflict with:
> 
> commit 0215f7d8c53fb192cd4491ede0ece5cca6b5db57
> Author: Benjamin Herrenschmidt 
> Date:   Tue Jan 14 17:11:39 2014 +1100
> 
> powerpc: Fix races with irq_work
> 
> 

Thanks for the build test.I will base it on the mainline at the latest
commit as well as on Ben's tree and send out this patchset.

Regards
Preeti U Murthy
> Paul.
> --
> 
>> cpuidle driver for powernv posted by Deepthi Dharwar:
>> https://lkml.org/lkml/2014/1/14/172
>>
>>
>> Changes in V5:
>> -
>> The primary change in this version is in Patch[6/8].
>> As per the discussions in V4 posting of this patchset, it was decided to
>> refine handling the wakeup of CPUs in fast-sleep by doing the following:
>>
>> 1. In V4, a polling mechanism was used by the CPU handling broadcast to
>> find out the time of next wakeup of the CPUs in deep idle states. V5 avoids
>> polling by a way described under PATCH[6/8] in this patchset.
>>
>> 2. The mechanism of broadcast handling of CPUs in deep idle in the absence 
>> of an
>> external wakeup device should be generic and not arch specific code. Hence 
>> in this
>> version this functionality has been integrated into the tick broadcast 
>> framework in
>> the kernel unlike before where it was handled in powerpc specific code.
>>
>> 3. It was suggested that the "broadcast cpu" can be the time keeping cpu
>> itself. However this has challenges of its own:
>>
>>  a. The time keeping cpu need not exist when all cpus are idle. Hence there
>> are phases in time when time keeping cpu is absent. But for the use case that
>> this patchset is trying to address we rely on the presence of a broadcast cpu
>> all the time.
>>
>>  b. The nomination and un-assignment of the time keeping cpu is not protected
>> by a lock today and need not be as well since such is its use case in the
>> kernel. However we would need locks if we double up the time keeping cpu as 
>> the
>> broadcast cpu.
>>
>> Hence the broadcast cpu is independent of the time-keeping cpu. However 
>> PATCH[6/8]
>> proposes a simpler solution to pick a broadcast cpu in this version.
>>
>>
>>
>> Changes in V4:
>> -
>> https://lkml.org/lkml/2013/11/29/97
>>
>> 1. Add Fast Sleep CPU idle state on PowerNV.
>>
>> 2. Add the required context management for Fast Sleep and the call to OPAL
>> to synchronize time base after wakeup from fast sleep.
>>
>> 4. Add parsing of CPU idle states from the device tree to populate the
>> cpuidle
>> state table.
>>
>> 5. Rename ambiguous functions in the code around waking up of CPUs from fast
>> sleep.
>>
>> 6. Fixed a bug in re-programming of the hrtimer that is queued to wakeup the
>> CPUs in fast sleep and modified Changelogs.
>>
>> 7. Added the ARCH_HAS_TICK_BROADCAST option. This signifies that we have a
>> arch specific function to perform broadcast.
>>
>>
>> Changes in V3:
>> -
>> http://thread.gmane.org/gmane.linux.po

[PATCH V2] cpuidle/governors: Fix logic in selection of idle states

2014-01-16 Thread Preeti U Murthy

The cpuidle governors today are not handling scenarios where no idle state
can be chosen. Such scenarios coud arise if the user has disabled all the
idle states at runtime or the latency requirement from the cpus is very strict.

The menu governor returns 0th index of the idle state table when no other
idle state is suitable. This is even when the idle state corresponding to this
index is disabled or the latency requirement is strict and the exit_latency
of the lowest idle state is also not acceptable. Hence this patch
fixes this logic in the menu governor by defaulting to an idle state index
of -1 unless any other state is suitable.

The ladder governor needs a few more fixes in addition to that required in the
menu governor. When the ladder governor decides to demote the idle state of a
CPU, it does not check if the lower idle states are enabled. Add this logic
in addition to the logic where it chooses an index of -1 if it can neither
promote or demote the idle state of a cpu nor can it choose the current idle
state.

The cpuidle_idle_call() will return back if the governor decides upon not
entering any idle state. However it cannot return an error code because all
archs have the logic today that if the call to cpuidle_idle_call() fails, it
means that the cpuidle driver failed to *function*; for instance due to
errors during registration. As a result they end up deciding upon a
default idle state on their own, which could very well be a deep idle state.
This is incorrect in cases where no idle state is suitable.

Besides for the scenario that this patch is addressing, the call actually
succeeds. Its just that no idle state is thought to be suitable by the 
governors.
Under such a circumstance return success code without entering any idle
state.

Signed-off-by: Preeti U Murthy 

Changes from V1:https://lkml.org/lkml/2014/1/14/26

1. Change the return code to success from -EINVAL due to the reason mentioned
in the changelog.
2. Add logic that the patch is addressing in the ladder governor as well.
3. Added relevant comments and removed redundant logic as suggested in the
above thread.
---

 drivers/cpuidle/cpuidle.c  |   15 +-
 drivers/cpuidle/governors/ladder.c |   98 ++--
 drivers/cpuidle/governors/menu.c   |7 +--
 3 files changed, 89 insertions(+), 31 deletions(-)

diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index a55e68f..831b664 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -131,8 +131,9 @@ int cpuidle_idle_call(void)
 
/* ask the governor for the next state */
next_state = cpuidle_curr_governor->select(drv, dev);
+
+   dev->last_residency = 0;
if (need_resched()) {
-   dev->last_residency = 0;
/* give the governor an opportunity to reflect on the outcome */
if (cpuidle_curr_governor->reflect)
cpuidle_curr_governor->reflect(dev, next_state);
@@ -140,6 +141,18 @@ int cpuidle_idle_call(void)
return 0;
}
 
+   /* Unlike in the need_resched() case, we return here because the
+* governor did not find a suitable idle state. However idle is still
+* in progress as we are not asked to reschedule. Hence we return
+* without enabling interrupts.
+*
+* NOTE: The return code should still be success, since the verdict of 
this
+* call is "do not enter any idle state" and not a failed call due to
+* errors.
+*/
+   if (next_state < 0)
+   return 0;
+
trace_cpu_idle_rcuidle(next_state, dev->cpu);
 
broadcast = !!(drv->states[next_state].flags & CPUIDLE_FLAG_TIMER_STOP);
diff --git a/drivers/cpuidle/governors/ladder.c 
b/drivers/cpuidle/governors/ladder.c
index 9f08e8c..f495f57 100644
--- a/drivers/cpuidle/governors/ladder.c
+++ b/drivers/cpuidle/governors/ladder.c
@@ -58,6 +58,36 @@ static inline void ladder_do_selection(struct ladder_device 
*ldev,
ldev->last_state_idx = new_idx;
 }
 
+static int can_promote(struct ladder_device *ldev, int last_idx,
+   int last_residency)
+{
+   struct ladder_device_state *last_state;
+
+   last_state = >states[last_idx];
+   if (last_residency > last_state->threshold.promotion_time) {
+   last_state->stats.promotion_count++;
+   last_state->stats.demotion_count = 0;
+   if (last_state->stats.promotion_count >= 
last_state->threshold.promotion_count)
+   return 1;
+   }
+   return 0;
+}
+
+static int can_demote(struct ladder_device *ldev, int last_idx,
+   int last_residency)
+{
+   struct ladder_device_state *last_state;
+
+   last_state = >states[last_idx];
+   if (last_residency < last_state->threshold.demotion_time) {
+   last_state->

Re: [PATCH V2 0/2] time/cpuidle: Support in tick broadcast framework in absence of external clock device

2014-01-28 Thread Preeti U Murthy

Hi Thomas,

I realized that the below patch is also required for this patchset.

This patch apart, I noticed that there is also one corner case which we will
need to handle. The BROADCAST_ON notifications in periodic mode
(oneshot mode is a nop).
  We will need to fail the BROADCAST_ON notification too in this case if the
CPU in question is made the stand by CPU. 

Thanks

Regards
Preeti U Murthy

-

time/cpuidle:Handle failed call to BROADCAST_ENTER on archs with 
CPUIDLE_FLAG_TIMER_STOP set

From: Preeti U Murthy 

Some archs set the CPUIDLE_FLAG_TIMER_STOP flag for idle states in which the
local timers stop. The cpuidle_idle_call() currently handles such idle states
by calling into the broadcast framework so as to wakeup CPUs at their next
wakeup event. With the hrtimer mode of broadcast, the BROADCAST_ENTER call
into the broadcast frameowork can fail for archs that do not have an external
clock device to handle wakeups and the CPU in question has to thus be made
the stand by CPU. This patch handles such cases by failing the call into
cpuidle so that the arch can take some default action. The arch will certainly
not enter a similar idle state because a failed cpuidle call will also 
implicitly
indicate that the broadcast framework has not registered this CPU to be woken 
up.
Hence we are safe if we fail the cpuidle call.

In the process move the functions that trace idle statistics just before and
after the entry and exit into idle states respectively. In other
scenarios where the call to cpuidle fails, we end up not tracing idle
entry and exit since a decision on an idle state could not be taken. Similarly
when the call to broadcast framework fails, we skip tracing idle statistics
because we are in no further position to take a decision on an alternative
idle state to enter into.

Signed-off-by: Preeti U Murthy 
---
 drivers/cpuidle/cpuidle.c |   38 +++---
 1 file changed, 23 insertions(+), 15 deletions(-)

diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index a55e68f..8f42033 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -117,15 +117,19 @@ int cpuidle_idle_call(void)
 {
struct cpuidle_device *dev = __this_cpu_read(cpuidle_devices);
struct cpuidle_driver *drv;
-   int next_state, entered_state;
-   bool broadcast;
+   int next_state, entered_state, ret = 0;
+   bool broadcast = false;
 
-   if (off || !initialized)
-   return -ENODEV;
+   if (off || !initialized) {
+   ret = -ENODEV;
+   goto out;
+   }
 
/* check if the device is ready */
-   if (!dev || !dev->enabled)
-   return -EBUSY;
+   if (!dev || !dev->enabled) {
+   ret = -EBUSY;
+   goto out;
+   }
 
drv = cpuidle_get_cpu_driver(dev);
 
@@ -137,15 +141,18 @@ int cpuidle_idle_call(void)
if (cpuidle_curr_governor->reflect)
cpuidle_curr_governor->reflect(dev, next_state);
local_irq_enable();
-   return 0;
+   goto out;
}
 
-   trace_cpu_idle_rcuidle(next_state, dev->cpu);
-
broadcast = !!(drv->states[next_state].flags & CPUIDLE_FLAG_TIMER_STOP);
 
-   if (broadcast)
-   clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, >cpu);
+   if (broadcast) {
+   ret = clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, 
>cpu);
+   if (ret)
+   goto out;
+   }
+
+   trace_cpu_idle_rcuidle(next_state, dev->cpu);
 
if (cpuidle_state_is_coupled(dev, drv, next_state))
entered_state = cpuidle_enter_state_coupled(dev, drv,
@@ -153,16 +160,17 @@ int cpuidle_idle_call(void)
else
entered_state = cpuidle_enter_state(dev, drv, next_state);
 
-   if (broadcast)
-   clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, >cpu);
-
trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, dev->cpu);
 
/* give the governor an opportunity to reflect on the outcome */
if (cpuidle_curr_governor->reflect)
cpuidle_curr_governor->reflect(dev, entered_state);
 
-   return 0;
+out:   if (broadcast)
+   clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, >cpu);
+
+
+   return ret;
 }
 
 /**

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH V2] cpuidle/governors: Fix logic in selection of idle states

2014-01-28 Thread Preeti U Murthy

Hi Daniel,

On 01/28/2014 02:16 PM, Daniel Lezcano wrote:
> On 01/24/2014 11:21 AM, Preeti U Murthy wrote:
>> On 01/24/2014 02:38 PM, Daniel Lezcano wrote:
>>> On 01/23/2014 12:15 PM, Preeti U Murthy wrote:
>>>> Hi Daniel,
>>>>
>>>> Thank you for the review.
> 
> [ ... ]
> 
>>>> ---
>>>>drivers/cpuidle/cpuidle.c  |   15 +
>>>>drivers/cpuidle/governors/ladder.c |  101
>>>> ++--
>>>>drivers/cpuidle/governors/menu.c   |7 +-
>>>>3 files changed, 90 insertions(+), 33 deletions(-)
>>>>
>>>> diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
>>>> index a55e68f..19d17e8 100644
>>>> --- a/drivers/cpuidle/cpuidle.c
>>>> +++ b/drivers/cpuidle/cpuidle.c
>>>> @@ -131,8 +131,9 @@ int cpuidle_idle_call(void)
>>>>
>>>>/* ask the governor for the next state */
>>>>next_state = cpuidle_curr_governor->select(drv, dev);
>>>> +
>>>> +dev->last_residency = 0;
>>>>if (need_resched()) {
>>>
>>> What about if (need_resched() || next_state < 0) ?
>>
>> Hmm.. I feel we need to distinguish between the need_resched() scenario
>> and the scenario when no idle state was suitable through the trace
>> points at-least.
> 
> Well, I don't think so as soon as we don't care about the return value
> of cpuidle_idle_call in both cases.
> 
> The traces are following a specific format. That is if the state is -1
> (PWR_EVENT_EXIT), it means exiting the current idle state.
> 
> The idlestat tool [1] is using this traces to open - close transitions.
> 
> IMO, if the cpu is not entering idle, it should just exit without any
> idle traces.

Yes I see your point here.
> 
> This portion of code is a bit confusing because it is introduced by the
> menu governor updates post-poned when entering the next idle state (not
> exiting the current idle state with good reasons).

I am sorry but I don't understand this part. Which is the portion of the
code you refer to here? Also can you please elaborate on the above
statement?

Thanks

Regards
Preeti U Murthy
> 
>   -- Daniel
> 
> [1] http://git.linaro.org/power/idlestat.git
> 
>> This could help while debugging when we could find situations where
>> there are no tasks to run, yet the cpu is not entering any idle state.
>> The traces could help clearly point that no idle state was thought
>> suitable by the governor. Of course there are many other means to find
>> this out, but this seems rather straightforward. Hence having the
>> condition next_state < 0 between trace_cpu_idle*() would be apt IMHO.
>>
>> Regards
>> Preeti U Murthy
>>
>>>
>>>> -dev->last_residency = 0;
>>>>/* give the governor an opportunity to reflect on the
>>>> outcome */
>>>>if (cpuidle_curr_governor->reflect)
>>>>cpuidle_curr_governor->reflect(dev, next_state);
>>>> @@ -141,6 +142,16 @@ int cpuidle_idle_call(void)
>>>>}
>>>>
>>>>trace_cpu_idle_rcuidle(next_state, dev->cpu);
>>>> +/*
>>>> + * NOTE: The return code should still be success, since the
>>>> verdict of
>>>> + * this call is "do not enter any idle state". It is not a failed
>>>> call
>>>> + * due to errors.
>>>> + */
>>>> +if (next_state < 0) {
>>>> +entered_state = next_state;
>>>> +local_irq_enable();
>>>> +goto out;
>>>> +}
>>>>
>>>>broadcast = !!(drv->states[next_state].flags &
>>>> CPUIDLE_FLAG_TIMER_STOP);
>>>>
>>>> @@ -156,7 +167,7 @@ int cpuidle_idle_call(void)
>>>>if (broadcast)
>>>>clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT,
>>>> >cpu);
>>>>
>>>> -trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, dev->cpu);
>>>> +out:trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, dev->cpu);
>>>>
>>>>/* give the governor an opportunity to reflect on the outcome */
>>>>if (cpuidle_curr_governor->reflect)
>>
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 1/6] idle: move the cpuidle entry point to the generic idle loop

2014-01-29 Thread Preeti U Murthy

Hi Nicolas,

On 01/30/2014 02:01 AM, Nicolas Pitre wrote:
> On Wed, 29 Jan 2014, Nicolas Pitre wrote:
> 
>> In order to integrate cpuidle with the scheduler, we must have a better
>> proximity in the core code with what cpuidle is doing and not delegate
>> such interaction to arch code.
>>
>> Architectures implementing arch_cpu_idle() should simply enter
>> a cheap idle mode in the absence of a proper cpuidle driver.
>>
>> Signed-off-by: Nicolas Pitre 
>> Acked-by: Daniel Lezcano 
> 
> As mentioned in my reply to Olof's comment on patch #5/6, here's a new 
> version of this patch adding the safety local_irq_enable() to the core 
> code.
> 
> - >8
> 
> From: Nicolas Pitre 
> Subject: idle: move the cpuidle entry point to the generic idle loop
> 
> In order to integrate cpuidle with the scheduler, we must have a better
> proximity in the core code with what cpuidle is doing and not delegate
> such interaction to arch code.
> 
> Architectures implementing arch_cpu_idle() should simply enter
> a cheap idle mode in the absence of a proper cpuidle driver.
> 
> In both cases i.e. whether it is a cpuidle driver or the default
> arch_cpu_idle(), the calling convention expects IRQs to be disabled
> on entry and enabled on exit. There is a warning in place already but
> let's add a forced IRQ enable here as well.  This will allow for
> removing the forced IRQ enable some implementations do locally and 

Why would this patch allow for removing the forced IRQ enable that are
being done on some archs in arch_cpu_idle()? Isn't this patch expecting
the default arch_cpu_idle() to have re-enabled the interrupts after
exiting from the default idle state? Its supposed to only catch faulty
cpuidle drivers that haven't enabled IRQs on exit from idle state but
are expected to have done so, isn't it?

Thanks

Regards
Preeti U Murthy

> allowing for the warning to trig.
> 
> Signed-off-by: Nicolas Pitre 
> 
> diff --git a/kernel/cpu/idle.c b/kernel/cpu/idle.c
> index 988573a9a3..14ca43430a 100644
> --- a/kernel/cpu/idle.c
> +++ b/kernel/cpu/idle.c
> @@ -3,6 +3,7 @@
>   */
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -95,8 +96,10 @@ static void cpu_idle_loop(void)
>   if (!current_clr_polling_and_test()) {
>   stop_critical_timings();
>   rcu_idle_enter();
> - arch_cpu_idle();
> - WARN_ON_ONCE(irqs_disabled());
> + if (cpuidle_idle_call())
> + arch_cpu_idle();
> + if (WARN_ON_ONCE(irqs_disabled()))
> + local_irq_enable();
>   rcu_idle_exit();
>   start_critical_timings();
>   } else {
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 1/6] idle: move the cpuidle entry point to the generic idle loop

2014-01-29 Thread Preeti U Murthy

Hi Nicolas,

On 01/30/2014 10:58 AM, Nicolas Pitre wrote:
> On Thu, 30 Jan 2014, Preeti U Murthy wrote:
> 
>> Hi Nicolas,
>>
>> On 01/30/2014 02:01 AM, Nicolas Pitre wrote:
>>> On Wed, 29 Jan 2014, Nicolas Pitre wrote:
>>>
>>>> In order to integrate cpuidle with the scheduler, we must have a better
>>>> proximity in the core code with what cpuidle is doing and not delegate
>>>> such interaction to arch code.
>>>>
>>>> Architectures implementing arch_cpu_idle() should simply enter
>>>> a cheap idle mode in the absence of a proper cpuidle driver.
>>>>
>>>> Signed-off-by: Nicolas Pitre 
>>>> Acked-by: Daniel Lezcano 
>>>
>>> As mentioned in my reply to Olof's comment on patch #5/6, here's a new 
>>> version of this patch adding the safety local_irq_enable() to the core 
>>> code.
>>>
>>> - >8
>>>
>>> From: Nicolas Pitre 
>>> Subject: idle: move the cpuidle entry point to the generic idle loop
>>>
>>> In order to integrate cpuidle with the scheduler, we must have a better
>>> proximity in the core code with what cpuidle is doing and not delegate
>>> such interaction to arch code.
>>>
>>> Architectures implementing arch_cpu_idle() should simply enter
>>> a cheap idle mode in the absence of a proper cpuidle driver.
>>>
>>> In both cases i.e. whether it is a cpuidle driver or the default
>>> arch_cpu_idle(), the calling convention expects IRQs to be disabled
>>> on entry and enabled on exit. There is a warning in place already but
>>> let's add a forced IRQ enable here as well.  This will allow for
>>> removing the forced IRQ enable some implementations do locally and 
>>
>> Why would this patch allow for removing the forced IRQ enable that are
>> being done on some archs in arch_cpu_idle()? Isn't this patch expecting
>> the default arch_cpu_idle() to have re-enabled the interrupts after
>> exiting from the default idle state? Its supposed to only catch faulty
>> cpuidle drivers that haven't enabled IRQs on exit from idle state but
>> are expected to have done so, isn't it?
> 
> Exact.  However x86 currently does this:
> 
>   if (cpuidle_idle_call())
>   x86_idle();
>   else
>   local_irq_enable();
> 
> So whenever cpuidle_idle_call() is successful then IRQs are 
> unconditionally enabled whether or not the underlying cpuidle driver has 
> properly done it or not.  And the reason is that some of the x86 cpuidle 
> do fail to enable IRQs before returning.
> 
> So the idea is to get rid of this unconditional IRQ enabling and let the 
> core issue a warning instead (as well as enabling IRQs to allow the 
> system to run).

Oh ok, thank you for clarifying this:)

Regards
Preeti U Murthy
> 
> 
> Nicolas
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/3] powerpc: Free up the slot of PPC_MSG_CALL_FUNC_SINGLE IPI message

2014-01-30 Thread Preeti U Murthy

From: Srivatsa S. Bhat 

The IPI handlers for both PPC_MSG_CALL_FUNC and PPC_MSG_CALL_FUNC_SINGLE map
to a common implementation - generic_smp_call_function_single_interrupt(). So,
we can consolidate them and save one of the IPI message slots, (which are
precious on powerpc, since only 4 of those slots are available).

So, implement the functionality of PPC_MSG_CALL_FUNC_SINGLE using
PPC_MSG_CALL_FUNC itself and release its IPI message slot, so that it can be
used for something else in the future, if desired.

Signed-off-by: Srivatsa S. Bhat 
Signed-off-by: Preeti U. Murthy 
Acked-by: Geoff Levand  [For the PS3 part]
---

 arch/powerpc/include/asm/smp.h  |2 +-
 arch/powerpc/kernel/smp.c   |   12 +---
 arch/powerpc/platforms/cell/interrupt.c |2 +-
 arch/powerpc/platforms/ps3/smp.c|2 +-
 4 files changed, 8 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
index 084e080..9f7356b 100644
--- a/arch/powerpc/include/asm/smp.h
+++ b/arch/powerpc/include/asm/smp.h
@@ -120,7 +120,7 @@ extern int cpu_to_core_id(int cpu);
  * in /proc/interrupts will be wrong!!! --Troy */
 #define PPC_MSG_CALL_FUNCTION   0
 #define PPC_MSG_RESCHEDULE  1
-#define PPC_MSG_CALL_FUNC_SINGLE   2
+#define PPC_MSG_UNUSED 2
 #define PPC_MSG_DEBUGGER_BREAK  3
 
 /* for irq controllers that have dedicated ipis per message (4) */
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index ac2621a..ee7d76b 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -145,9 +145,9 @@ static irqreturn_t reschedule_action(int irq, void *data)
return IRQ_HANDLED;
 }
 
-static irqreturn_t call_function_single_action(int irq, void *data)
+static irqreturn_t unused_action(int irq, void *data)
 {
-   generic_smp_call_function_single_interrupt();
+   /* This slot is unused and hence available for use, if needed */
return IRQ_HANDLED;
 }
 
@@ -168,14 +168,14 @@ static irqreturn_t debug_ipi_action(int irq, void *data)
 static irq_handler_t smp_ipi_action[] = {
[PPC_MSG_CALL_FUNCTION] =  call_function_action,
[PPC_MSG_RESCHEDULE] = reschedule_action,
-   [PPC_MSG_CALL_FUNC_SINGLE] = call_function_single_action,
+   [PPC_MSG_UNUSED] = unused_action,
[PPC_MSG_DEBUGGER_BREAK] = debug_ipi_action,
 };
 
 const char *smp_ipi_name[] = {
[PPC_MSG_CALL_FUNCTION] =  "ipi call function",
[PPC_MSG_RESCHEDULE] = "ipi reschedule",
-   [PPC_MSG_CALL_FUNC_SINGLE] = "ipi call function single",
+   [PPC_MSG_UNUSED] = "ipi unused",
[PPC_MSG_DEBUGGER_BREAK] = "ipi debugger",
 };
 
@@ -251,8 +251,6 @@ irqreturn_t smp_ipi_demux(void)
generic_smp_call_function_interrupt();
if (all & IPI_MESSAGE(PPC_MSG_RESCHEDULE))
scheduler_ipi();
-   if (all & IPI_MESSAGE(PPC_MSG_CALL_FUNC_SINGLE))
-   generic_smp_call_function_single_interrupt();
if (all & IPI_MESSAGE(PPC_MSG_DEBUGGER_BREAK))
debug_ipi_action(0, NULL);
} while (info->messages);
@@ -280,7 +278,7 @@ EXPORT_SYMBOL_GPL(smp_send_reschedule);
 
 void arch_send_call_function_single_ipi(int cpu)
 {
-   do_message_pass(cpu, PPC_MSG_CALL_FUNC_SINGLE);
+   do_message_pass(cpu, PPC_MSG_CALL_FUNCTION);
 }
 
 void arch_send_call_function_ipi_mask(const struct cpumask *mask)
diff --git a/arch/powerpc/platforms/cell/interrupt.c 
b/arch/powerpc/platforms/cell/interrupt.c
index 2d42f3b..adf3726 100644
--- a/arch/powerpc/platforms/cell/interrupt.c
+++ b/arch/powerpc/platforms/cell/interrupt.c
@@ -215,7 +215,7 @@ void iic_request_IPIs(void)
 {
iic_request_ipi(PPC_MSG_CALL_FUNCTION);
iic_request_ipi(PPC_MSG_RESCHEDULE);
-   iic_request_ipi(PPC_MSG_CALL_FUNC_SINGLE);
+   iic_request_ipi(PPC_MSG_UNUSED);
iic_request_ipi(PPC_MSG_DEBUGGER_BREAK);
 }
 
diff --git a/arch/powerpc/platforms/ps3/smp.c b/arch/powerpc/platforms/ps3/smp.c
index 4b35166..00d1a7c 100644
--- a/arch/powerpc/platforms/ps3/smp.c
+++ b/arch/powerpc/platforms/ps3/smp.c
@@ -76,7 +76,7 @@ static int __init ps3_smp_probe(void)
 
BUILD_BUG_ON(PPC_MSG_CALL_FUNCTION!= 0);
BUILD_BUG_ON(PPC_MSG_RESCHEDULE   != 1);
-   BUILD_BUG_ON(PPC_MSG_CALL_FUNC_SINGLE != 2);
+   BUILD_BUG_ON(PPC_MSG_UNUSED   != 2);
BUILD_BUG_ON(PPC_MSG_DEBUGGER_BREAK   != 3);
 
for (i = 0; i < MSG_COUNT; i++) {

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/3] powerpc: Implement tick broadcast IPI as a fixed IPI message

2014-01-30 Thread Preeti U Murthy

From: Srivatsa S. Bhat 

For scalability and performance reasons, we want the tick broadcast IPIs
to be handled as efficiently as possible. Fixed IPI messages
are one of the most efficient mechanisms available - they are faster than
the smp_call_function mechanism because the IPI handlers are fixed and hence
they don't involve costly operations such as adding IPI handlers to the target
CPU's function queue, acquiring locks for synchronization etc.

Luckily we have an unused IPI message slot, so use that to implement
tick broadcast IPIs efficiently.

Signed-off-by: Srivatsa S. Bhat 
[Functions renamed to tick_broadcast* and Changelog modified by
 Preeti U. Murthy]
Signed-off-by: Preeti U. Murthy 
Acked-by: Geoff Levand  [For the PS3 part]
---

 arch/powerpc/include/asm/smp.h  |2 +-
 arch/powerpc/include/asm/time.h |1 +
 arch/powerpc/kernel/smp.c   |   19 +++
 arch/powerpc/kernel/time.c  |5 +
 arch/powerpc/platforms/cell/interrupt.c |2 +-
 arch/powerpc/platforms/ps3/smp.c|2 +-
 6 files changed, 24 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
index 9f7356b..ff51046 100644
--- a/arch/powerpc/include/asm/smp.h
+++ b/arch/powerpc/include/asm/smp.h
@@ -120,7 +120,7 @@ extern int cpu_to_core_id(int cpu);
  * in /proc/interrupts will be wrong!!! --Troy */
 #define PPC_MSG_CALL_FUNCTION   0
 #define PPC_MSG_RESCHEDULE  1
-#define PPC_MSG_UNUSED 2
+#define PPC_MSG_TICK_BROADCAST 2
 #define PPC_MSG_DEBUGGER_BREAK  3
 
 /* for irq controllers that have dedicated ipis per message (4) */
diff --git a/arch/powerpc/include/asm/time.h b/arch/powerpc/include/asm/time.h
index c1f2676..1d428e6 100644
--- a/arch/powerpc/include/asm/time.h
+++ b/arch/powerpc/include/asm/time.h
@@ -28,6 +28,7 @@ extern struct clock_event_device decrementer_clockevent;
 struct rtc_time;
 extern void to_tm(int tim, struct rtc_time * tm);
 extern void GregorianDay(struct rtc_time *tm);
+extern void tick_broadcast_ipi_handler(void);
 
 extern void generic_calibrate_decr(void);
 
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index ee7d76b..6f06f05 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -35,6 +35,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -145,9 +146,9 @@ static irqreturn_t reschedule_action(int irq, void *data)
return IRQ_HANDLED;
 }
 
-static irqreturn_t unused_action(int irq, void *data)
+static irqreturn_t tick_broadcast_ipi_action(int irq, void *data)
 {
-   /* This slot is unused and hence available for use, if needed */
+   tick_broadcast_ipi_handler();
return IRQ_HANDLED;
 }
 
@@ -168,14 +169,14 @@ static irqreturn_t debug_ipi_action(int irq, void *data)
 static irq_handler_t smp_ipi_action[] = {
[PPC_MSG_CALL_FUNCTION] =  call_function_action,
[PPC_MSG_RESCHEDULE] = reschedule_action,
-   [PPC_MSG_UNUSED] = unused_action,
+   [PPC_MSG_TICK_BROADCAST] = tick_broadcast_ipi_action,
[PPC_MSG_DEBUGGER_BREAK] = debug_ipi_action,
 };
 
 const char *smp_ipi_name[] = {
[PPC_MSG_CALL_FUNCTION] =  "ipi call function",
[PPC_MSG_RESCHEDULE] = "ipi reschedule",
-   [PPC_MSG_UNUSED] = "ipi unused",
+   [PPC_MSG_TICK_BROADCAST] = "ipi tick-broadcast",
[PPC_MSG_DEBUGGER_BREAK] = "ipi debugger",
 };
 
@@ -251,6 +252,8 @@ irqreturn_t smp_ipi_demux(void)
generic_smp_call_function_interrupt();
if (all & IPI_MESSAGE(PPC_MSG_RESCHEDULE))
scheduler_ipi();
+   if (all & IPI_MESSAGE(PPC_MSG_TICK_BROADCAST))
+   tick_broadcast_ipi_handler();
if (all & IPI_MESSAGE(PPC_MSG_DEBUGGER_BREAK))
debug_ipi_action(0, NULL);
} while (info->messages);
@@ -289,6 +292,14 @@ void arch_send_call_function_ipi_mask(const struct cpumask 
*mask)
do_message_pass(cpu, PPC_MSG_CALL_FUNCTION);
 }
 
+void tick_broadcast(const struct cpumask *mask)
+{
+   unsigned int cpu;
+
+   for_each_cpu(cpu, mask)
+   do_message_pass(cpu, PPC_MSG_TICK_BROADCAST);
+}
+
 #if defined(CONFIG_DEBUGGER) || defined(CONFIG_KEXEC)
 void smp_send_debugger_break(void)
 {
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index b3dab20..3ff97db 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -825,6 +825,11 @@ static void decrementer_set_mode(enum clock_event_mode 
mode,
decrementer_set_next_event(DECREMENTER_MAX, dev);
 }
 
+/* Interrupt handler for the timer broadcast IPI */
+void tick_broadcast_ipi_handler(void)
+{
+}
+
 static void register_decrementer_clockevent(int cpu)
 {
struct clock_event_device *dec = _cpu(decrementers, cpu);
diff --git a/arch/powerpc/plat

< 4 5 6 7 8 9 10 11 12 13 >

801 - 900 of 1331 matches

Mail list logo