from:"Preeti U Murthy"

Re: [PATCH V3] tick/broadcast: Make movement of broadcast hrtimer robust against hotplug

2015-01-21 Thread Preeti U Murthy

On 01/21/2015 05:16 PM, Thomas Gleixner wrote:
> On Tue, 20 Jan 2015, Preeti U Murthy wrote:
>> diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
>> index 5544990..f3907c9 100644
>> --- a/kernel/time/clockevents.c
>> +++ b/kernel/time/clockevents.c
>> @@ -568,6 +568,7 @@ int clockevents_notify(unsigned long reason, void *arg)
>>  
>>  case CLOCK_EVT_NOTIFY_CPU_DYING:
>>  tick_handover_do_timer(arg);
>> +tick_shutdown_broadcast_oneshot(arg);
>>  break;
>>  
>>  case CLOCK_EVT_NOTIFY_SUSPEND:
>> @@ -580,7 +581,6 @@ int clockevents_notify(unsigned long reason, void *arg)
>>  break;
>>  
>>  case CLOCK_EVT_NOTIFY_CPU_DEAD:
>> -tick_shutdown_broadcast_oneshot(arg);
>>  tick_shutdown_broadcast(arg);
>>  tick_shutdown(arg);
>>  /*
>> diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c
>> index 066f0ec..f983983 100644
>> --- a/kernel/time/tick-broadcast.c
>> +++ b/kernel/time/tick-broadcast.c
>> @@ -675,8 +675,11 @@ static void broadcast_move_bc(int deadcpu)
>>  
>>  if (!bc || !broadcast_needs_cpu(bc, deadcpu))
>>  return;
>> -/* This moves the broadcast assignment to this cpu */
>> -clockevents_program_event(bc, bc->next_event, 1);
>> +/* Since a cpu with the earliest wakeup is nominated as the 
>> + * standby cpu, the next cpu to invoke BROADCAST_ENTER
>> + * will now automatically take up the duty of broadcasting.
>> + */
>> +bc->next_event.tv64 = KTIME_MAX;
> 
> So that relies on the fact, that cpu_down() currently forces ALL cpus
> into stop_machine(). Of course this is not in any way obvious and any
> change to this will cause even more hard to debug issues.

Hmm.. true this is a concern.
> 
> And to be honest, the clever 'set next_event to KTIME_MAX' is even
> more nonobvious because it's only relevant for your hrtimer based
> broadcasting magic. Any real broadcast device does not care about this
> at all.

bc->next_event is set to max only if CLOCK_EVT_FEATURE_HRTIMER is true.
It does not affect the usual broadcast logic.

> 
> This whole random notifier driven hotplug business is just a
> trainwreck. I'm still trying to convert this to a well documented
> state machine, so I rather prefer to make this an explicit take over
> rather than a completely undocumented 'works today' mechanism.
> 
> What about the patch below?
> 
> Thanks,
> 
>   tglx
> 
> diff --git a/kernel/cpu.c b/kernel/cpu.c
> index 5d220234b3ca..7a9b1ae4a945 100644
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -16,6 +16,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -421,6 +422,12 @@ static int __ref _cpu_down(unsigned int cpu, int 
> tasks_frozen)
>   while (!idle_cpu(cpu))
>   cpu_relax();
> 
> + /*
> +  * Before waiting for the cpu to enter DEAD state, take over
> +  * any tick related duties
> +  */
> + clockevents_notify(CLOCK_EVT_NOTIFY_CPU_DEAD, );
> +
>   /* This actually kills the CPU. */
>   __cpu_die(cpu);
> 
> diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
> index 37e50aadd471..3c1bfd0f7074 100644
> --- a/kernel/time/hrtimer.c
> +++ b/kernel/time/hrtimer.c
> @@ -1721,11 +1721,8 @@ static int hrtimer_cpu_notify(struct notifier_block 
> *self,
>   break;
>   case CPU_DEAD:
>   case CPU_DEAD_FROZEN:
> - {
> - clockevents_notify(CLOCK_EVT_NOTIFY_CPU_DEAD, );
>   migrate_hrtimers(scpu);
>   break;
> - }
>  #endif
> 
>   default:
> 


How about when the cpu that is going offline receives a timer interrupt
just before setting its state to CPU_DEAD ? That is still possible right
given that its clock devices may not have been shutdown and it is
capable of receiving interrupts for a short duration. Even with the
above patch, is the following scenario possible ?

CPU0  CPU1
t0 Receives timer interrupt

t1 Sees that there are hrtimers
   to be serviced (hrtimers are not yet migrated)

t2 calls hrtimer_interrupt()

t3 tick_program_event()   CPU_DEAD notifiers
CPU0's td->evtdev = NULL

t4 clockevent_program_event()
   references NULL tick device pointer

So my concern is that since the CLOCK_EVT_NOTIFY_CPU_DEAD callback
handles shutting down of devices besides moving tick related

[PATCH V2] cpuidle: Add missing checks to the exit condition of cpu_idle_poll()

2015-01-21 Thread Preeti U Murthy

cpu_idle_poll() is entered into when either the cpu_idle_force_poll is set or
tick_check_broadcast_expired() returns true. The exit condition from
cpu_idle_poll() is tif_need_resched().

However this does not take into account scenarios where cpu_idle_force_poll
changes or tick_check_broadcast_expired() returns false, without setting
the resched flag. So a cpu will be caught in cpu_idle_poll() needlessly,
thereby wasting power. Add an explicit check on cpu_idle_force_poll and
tick_check_broadcast_expired() to the exit condition of cpu_idle_poll()
to avoid this.

Signed-off-by: Preeti U Murthy 
---
Changes from V1: Modified the Changelog

 kernel/sched/idle.c |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index c47fce7..aaf1c1d 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -47,7 +47,8 @@ static inline int cpu_idle_poll(void)
rcu_idle_enter();
trace_cpu_idle_rcuidle(0, smp_processor_id());
local_irq_enable();
-   while (!tif_need_resched())
+   while (!tif_need_resched() &&
+   (cpu_idle_force_poll || tick_check_broadcast_expired()))
cpu_relax();
trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, smp_processor_id());
rcu_idle_exit();

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] idle/tick-broadcast: Exit cpu idle poll loop when cleared from tick_broadcast_force_mask

2015-01-21 Thread Preeti U Murthy

On 01/21/2015 03:26 PM, Thomas Gleixner wrote:
> On Tue, 20 Jan 2015, Preeti U Murthy wrote:
>> On 01/20/2015 04:51 PM, Thomas Gleixner wrote:
>>> On Mon, 19 Jan 2015, Preeti U Murthy wrote:
>>>> An idle cpu enters cpu_idle_poll() if it is set in the 
>>>> tick_broadcast_force_mask.
>>>> This is so that it does not incur the overhead of entering idle states 
>>>> when it is expected
>>>> to be woken up anytime then through a broadcast IPI. The condition that 
>>>> forces an exit out
>>>> of the idle polling is the check on setting of the TIF_NEED_RESCHED flag 
>>>> for the idle thread.
>>>>
>>>> When the broadcast IPI does arrive, it is not guarenteed that the handler 
>>>> sets the
>>>> TIF_NEED_RESCHED flag. Hence although the cpu is cleared in the 
>>>> tick_broadcast_force_mask,
>>>> it continues to loop in cpu_idle_poll unnecessarily wasting power. Hence 
>>>> exit the idle
>>>> poll loop if the tick_broadcast_force_mask gets cleared and enter idle 
>>>> states.
>>>>
>>>> Of course if the cpu has entered cpu_idle_poll() on being asked to poll 
>>>> explicitly,
>>>> it continues to poll till it is asked to reschedule.
>>>>
>>>> Signed-off-by: Preeti U Murthy 
>>>> ---
>>>>
>>>>  kernel/sched/idle.c |3 ++-
>>>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
>>>> index c47fce7..aaf1c1d 100644
>>>> --- a/kernel/sched/idle.c
>>>> +++ b/kernel/sched/idle.c
>>>> @@ -47,7 +47,8 @@ static inline int cpu_idle_poll(void)
>>>>rcu_idle_enter();
>>>>trace_cpu_idle_rcuidle(0, smp_processor_id());
>>>>local_irq_enable();
>>>> -  while (!tif_need_resched())
>>>> +  while (!tif_need_resched() &&
>>>> +  (cpu_idle_force_poll || tick_check_broadcast_expired()))
>>>
>>> You explain the tick_check_broadcast_expired() bit, but what about the
>>> cpu_idle_force_poll part?
>>
>> The last few lines which say "Of course if the cpu has entered
>> cpu_idle_poll() on being asked to poll explicitly, it continues to poll
>> till it is asked to reschedule" explains the cpu_idle_force_poll part.
> 
> Well, I read it more than once and did not figure it out.
> 
> The paragraph describes some behaviour. Now I know it's the behaviour
> before the patch. So maybe something like this:
> 
>   cpu_idle_poll() is entered when cpu_idle_force_poll is set or
>   tick_check_broadcast_expired() returns true. The exit condition from
>   cpu_idle_poll() is tif_need_resched().
> 
>   But this does not take into account that cpu_idle_force_poll and
>   tick_check_broadcast_expired() can change without setting the
>   resched flag. So a cpu can be caught in cpu_idle_poll() needlessly
>   and thereby wasting power.
> 
>   Add an explicit check for cpu_idle_force_poll and
>   tick_check_broadcast_expired() to the exit condition of
>   cpu_idle_poll() to avoid this.
> 
> This explains the technical issue without confusing people with IPIs
> and other completely irrelevant information. Hmm?

Yep, much simpler, thanks! I will send out the next version with this
changelog.

Regards
Preeti U Murthy
> 
> Thanks,
> 
>   tglx
> ___
> Linuxppc-dev mailing list
> linuxppc-...@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH V3] tick/broadcast: Make movement of broadcast hrtimer robust against hotplug

2015-01-21 Thread Preeti U Murthy

On 01/21/2015 05:16 PM, Thomas Gleixner wrote:
 On Tue, 20 Jan 2015, Preeti U Murthy wrote:
 diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
 index 5544990..f3907c9 100644
 --- a/kernel/time/clockevents.c
 +++ b/kernel/time/clockevents.c
 @@ -568,6 +568,7 @@ int clockevents_notify(unsigned long reason, void *arg)
  
  case CLOCK_EVT_NOTIFY_CPU_DYING:
  tick_handover_do_timer(arg);
 +tick_shutdown_broadcast_oneshot(arg);
  break;
  
  case CLOCK_EVT_NOTIFY_SUSPEND:
 @@ -580,7 +581,6 @@ int clockevents_notify(unsigned long reason, void *arg)
  break;
  
  case CLOCK_EVT_NOTIFY_CPU_DEAD:
 -tick_shutdown_broadcast_oneshot(arg);
  tick_shutdown_broadcast(arg);
  tick_shutdown(arg);
  /*
 diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c
 index 066f0ec..f983983 100644
 --- a/kernel/time/tick-broadcast.c
 +++ b/kernel/time/tick-broadcast.c
 @@ -675,8 +675,11 @@ static void broadcast_move_bc(int deadcpu)
  
  if (!bc || !broadcast_needs_cpu(bc, deadcpu))
  return;
 -/* This moves the broadcast assignment to this cpu */
 -clockevents_program_event(bc, bc-next_event, 1);
 +/* Since a cpu with the earliest wakeup is nominated as the 
 + * standby cpu, the next cpu to invoke BROADCAST_ENTER
 + * will now automatically take up the duty of broadcasting.
 + */
 +bc-next_event.tv64 = KTIME_MAX;
 
 So that relies on the fact, that cpu_down() currently forces ALL cpus
 into stop_machine(). Of course this is not in any way obvious and any
 change to this will cause even more hard to debug issues.

Hmm.. true this is a concern.
 
 And to be honest, the clever 'set next_event to KTIME_MAX' is even
 more nonobvious because it's only relevant for your hrtimer based
 broadcasting magic. Any real broadcast device does not care about this
 at all.

bc-next_event is set to max only if CLOCK_EVT_FEATURE_HRTIMER is true.
It does not affect the usual broadcast logic.

 
 This whole random notifier driven hotplug business is just a
 trainwreck. I'm still trying to convert this to a well documented
 state machine, so I rather prefer to make this an explicit take over
 rather than a completely undocumented 'works today' mechanism.
 
 What about the patch below?
 
 Thanks,
 
   tglx
 
 diff --git a/kernel/cpu.c b/kernel/cpu.c
 index 5d220234b3ca..7a9b1ae4a945 100644
 --- a/kernel/cpu.c
 +++ b/kernel/cpu.c
 @@ -16,6 +16,7 @@
  #include linux/bug.h
  #include linux/kthread.h
  #include linux/stop_machine.h
 +#include linux/clockchips.h
  #include linux/mutex.h
  #include linux/gfp.h
  #include linux/suspend.h
 @@ -421,6 +422,12 @@ static int __ref _cpu_down(unsigned int cpu, int 
 tasks_frozen)
   while (!idle_cpu(cpu))
   cpu_relax();
 
 + /*
 +  * Before waiting for the cpu to enter DEAD state, take over
 +  * any tick related duties
 +  */
 + clockevents_notify(CLOCK_EVT_NOTIFY_CPU_DEAD, cpu);
 +
   /* This actually kills the CPU. */
   __cpu_die(cpu);
 
 diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
 index 37e50aadd471..3c1bfd0f7074 100644
 --- a/kernel/time/hrtimer.c
 +++ b/kernel/time/hrtimer.c
 @@ -1721,11 +1721,8 @@ static int hrtimer_cpu_notify(struct notifier_block 
 *self,
   break;
   case CPU_DEAD:
   case CPU_DEAD_FROZEN:
 - {
 - clockevents_notify(CLOCK_EVT_NOTIFY_CPU_DEAD, scpu);
   migrate_hrtimers(scpu);
   break;
 - }
  #endif
 
   default:
 


How about when the cpu that is going offline receives a timer interrupt
just before setting its state to CPU_DEAD ? That is still possible right
given that its clock devices may not have been shutdown and it is
capable of receiving interrupts for a short duration. Even with the
above patch, is the following scenario possible ?

CPU0  CPU1
t0 Receives timer interrupt

t1 Sees that there are hrtimers
   to be serviced (hrtimers are not yet migrated)

t2 calls hrtimer_interrupt()

t3 tick_program_event()   CPU_DEAD notifiers
CPU0's td-evtdev = NULL

t4 clockevent_program_event()
   references NULL tick device pointer

So my concern is that since the CLOCK_EVT_NOTIFY_CPU_DEAD callback
handles shutting down of devices besides moving tick related duties.
it's functions may race with the hotplug cpu still handling tick events.

We do check on powerpc if the timer interrupt has arrived on an offline
cpu, but that is to avoid an entirely different scenario and not one
like the above. I would not expect the arch to check if a timer
interrupt arrived on an offline cpu. A timer interrupt may be serviced
as long as the tick device is alive.

Regards
Preeti U Murthy

--
To unsubscribe from this list: send

Re: [PATCH] idle/tick-broadcast: Exit cpu idle poll loop when cleared from tick_broadcast_force_mask

2015-01-21 Thread Preeti U Murthy

On 01/21/2015 03:26 PM, Thomas Gleixner wrote:
 On Tue, 20 Jan 2015, Preeti U Murthy wrote:
 On 01/20/2015 04:51 PM, Thomas Gleixner wrote:
 On Mon, 19 Jan 2015, Preeti U Murthy wrote:
 An idle cpu enters cpu_idle_poll() if it is set in the 
 tick_broadcast_force_mask.
 This is so that it does not incur the overhead of entering idle states 
 when it is expected
 to be woken up anytime then through a broadcast IPI. The condition that 
 forces an exit out
 of the idle polling is the check on setting of the TIF_NEED_RESCHED flag 
 for the idle thread.

 When the broadcast IPI does arrive, it is not guarenteed that the handler 
 sets the
 TIF_NEED_RESCHED flag. Hence although the cpu is cleared in the 
 tick_broadcast_force_mask,
 it continues to loop in cpu_idle_poll unnecessarily wasting power. Hence 
 exit the idle
 poll loop if the tick_broadcast_force_mask gets cleared and enter idle 
 states.

 Of course if the cpu has entered cpu_idle_poll() on being asked to poll 
 explicitly,
 it continues to poll till it is asked to reschedule.

 Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com
 ---

  kernel/sched/idle.c |3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

 diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
 index c47fce7..aaf1c1d 100644
 --- a/kernel/sched/idle.c
 +++ b/kernel/sched/idle.c
 @@ -47,7 +47,8 @@ static inline int cpu_idle_poll(void)
rcu_idle_enter();
trace_cpu_idle_rcuidle(0, smp_processor_id());
local_irq_enable();
 -  while (!tif_need_resched())
 +  while (!tif_need_resched() 
 +  (cpu_idle_force_poll || tick_check_broadcast_expired()))

 You explain the tick_check_broadcast_expired() bit, but what about the
 cpu_idle_force_poll part?

 The last few lines which say Of course if the cpu has entered
 cpu_idle_poll() on being asked to poll explicitly, it continues to poll
 till it is asked to reschedule explains the cpu_idle_force_poll part.
 
 Well, I read it more than once and did not figure it out.
 
 The paragraph describes some behaviour. Now I know it's the behaviour
 before the patch. So maybe something like this:
 
   cpu_idle_poll() is entered when cpu_idle_force_poll is set or
   tick_check_broadcast_expired() returns true. The exit condition from
   cpu_idle_poll() is tif_need_resched().
 
   But this does not take into account that cpu_idle_force_poll and
   tick_check_broadcast_expired() can change without setting the
   resched flag. So a cpu can be caught in cpu_idle_poll() needlessly
   and thereby wasting power.
 
   Add an explicit check for cpu_idle_force_poll and
   tick_check_broadcast_expired() to the exit condition of
   cpu_idle_poll() to avoid this.
 
 This explains the technical issue without confusing people with IPIs
 and other completely irrelevant information. Hmm?

Yep, much simpler, thanks! I will send out the next version with this
changelog.

Regards
Preeti U Murthy
 
 Thanks,
 
   tglx
 ___
 Linuxppc-dev mailing list
 linuxppc-...@lists.ozlabs.org
 https://lists.ozlabs.org/listinfo/linuxppc-dev
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH V2] cpuidle: Add missing checks to the exit condition of cpu_idle_poll()

2015-01-21 Thread Preeti U Murthy

cpu_idle_poll() is entered into when either the cpu_idle_force_poll is set or
tick_check_broadcast_expired() returns true. The exit condition from
cpu_idle_poll() is tif_need_resched().

However this does not take into account scenarios where cpu_idle_force_poll
changes or tick_check_broadcast_expired() returns false, without setting
the resched flag. So a cpu will be caught in cpu_idle_poll() needlessly,
thereby wasting power. Add an explicit check on cpu_idle_force_poll and
tick_check_broadcast_expired() to the exit condition of cpu_idle_poll()
to avoid this.

Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com
---
Changes from V1: Modified the Changelog

 kernel/sched/idle.c |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index c47fce7..aaf1c1d 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -47,7 +47,8 @@ static inline int cpu_idle_poll(void)
rcu_idle_enter();
trace_cpu_idle_rcuidle(0, smp_processor_id());
local_irq_enable();
-   while (!tif_need_resched())
+   while (!tif_need_resched() 
+   (cpu_idle_force_poll || tick_check_broadcast_expired()))
cpu_relax();
trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, smp_processor_id());
rcu_idle_exit();

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: cpuidle/powernv: Read target_residency value of idle states from DT if available

2015-01-20 Thread Preeti U Murthy

On 01/20/2015 11:15 AM, Michael Ellerman wrote:
> On Mon, 2015-19-01 at 11:32:51 UTC, Preeti U Murthy wrote:
>> The device tree now exposes the residency values for different idle states. 
>> Read
>> these values instead of calculating residency from the latency values. The 
>> values
>> exposed in the DT are validated for optimal power efficiency. However to 
>> maintain
>> compatibility with the older firmware code which does not expose residency
>> values, use default values as a fallback mechanism. While at it, clump
>> the common parts of device tree parsing into one chunk.
>>
>> Signed-off-by: Preeti U Murthy 
>> ---
>>
>>  drivers/cpuidle/cpuidle-powernv.c |   39 
>> -
>>  1 file changed, 25 insertions(+), 14 deletions(-)
>>
>> diff --git a/drivers/cpuidle/cpuidle-powernv.c 
>> b/drivers/cpuidle/cpuidle-powernv.c
>> index 2726663..9e732e1 100644
>> --- a/drivers/cpuidle/cpuidle-powernv.c
>> +++ b/drivers/cpuidle/cpuidle-powernv.c
>> @@ -162,7 +162,8 @@ static int powernv_add_idle_states(void)
>>  int dt_idle_states;
>>  const __be32 *idle_state_flags;
>>  const __be32 *idle_state_latency;
>> -u32 len_flags, flags, latency_ns;
>> +const __be32 *idle_state_residency;
>> +u32 len_flags, flags, latency_ns, residency_ns;
>>  int i;
>>  
>>  /* Currently we have snooze statically defined */
>> @@ -186,14 +187,21 @@ static int powernv_add_idle_states(void)
>>  return nr_idle_states;
>>  }
>>  
>> +idle_state_residency = of_get_property(power_mgt,
>> +"ibm,cpu-idle-state-residency-ns", NULL);
>> +if (!idle_state_residency) {
>> +pr_warn("DT-PowerMgmt: missing 
>> ibm,cpu-idle-state-residency-ns\n");
>> +pr_warn("Falling back to default values\n");
>> +}
> 
> This would be better done with something like:
> 
>   rc = of_read_property_u32(power_mgt, "ibm,cpu-idle-state-residency-ns", 
> _ns);
>   if (!rc) {
>   pr_info("cpuidle-powernv: missing 
> ibm,cpu-idle-state-residency-ns\n");
>   residency_ns = 30;
>   }

This looks like a better API, but the default residency values are different
for each idle state. So perhaps a patch like the below ?

--Start Patch--

From: Preeti U Murthy 


---
 drivers/cpuidle/cpuidle-powernv.c |   54 +++--
 1 file changed, 33 insertions(+), 21 deletions(-)

diff --git a/drivers/cpuidle/cpuidle-powernv.c 
b/drivers/cpuidle/cpuidle-powernv.c
index aedec09..2031560 100644
--- a/drivers/cpuidle/cpuidle-powernv.c
+++ b/drivers/cpuidle/cpuidle-powernv.c
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -159,9 +160,9 @@ static int powernv_add_idle_states(void)
int nr_idle_states = 1; /* Snooze */
int dt_idle_states;
const __be32 *idle_state_flags;
-   const __be32 *idle_state_latency;
-   u32 len_flags, flags, latency_ns;
-   int i;
+   u32 len_flags, flags;
+   u32 *latency_ns, *residency_ns;
+   int i, rc;
 
/* Currently we have snooze statically defined */
 
@@ -177,34 +178,39 @@ static int powernv_add_idle_states(void)
return nr_idle_states;
}
 
-   idle_state_latency = of_get_property(power_mgt,
-   "ibm,cpu-idle-state-latencies-ns", NULL);
-   if (!idle_state_latency) {
+   dt_idle_states = len_flags / sizeof(u32);
+
+   latency_ns = kzalloc(sizeof(*latency_ns) * dt_idle_states, GFP_KERNEL);
+   rc = of_property_read_u32(power_mgt,
+   "ibm,cpu-idle-state-latencies-ns", latency_ns);
+   if (rc) {
pr_warn("DT-PowerMgmt: missing 
ibm,cpu-idle-state-latencies-ns\n");
return nr_idle_states;
}
 
-   dt_idle_states = len_flags / sizeof(u32);
+   residency_ns = kzalloc(sizeof(*residency_ns) * dt_idle_states, 
GFP_KERNEL);
+   rc = of_property_read_u32(power_mgt,
+   "ibm,cpu-idle-state-residency-ns", residency_ns);
+   if (rc) {
+   pr_warn("DT-PowerMgmt: missing 
ibm,cpu-idle-state-residency-ns\n");
+   pr_warn("Falling back to default values\n");
+   }
+
 
for (i = 0; i < dt_idle_states; i++) {
 
flags = be32_to_cpu(idle_state_flags[i]);
-
-   /* Cpuidle accepts exit_latency in us and we estimate
-* target residency to be 10x exit_latency
+   /*
+

Re: [PATCH] idle/tick-broadcast: Exit cpu idle poll loop when cleared from tick_broadcast_force_mask

2015-01-20 Thread Preeti U Murthy

On 01/20/2015 04:51 PM, Thomas Gleixner wrote:
> On Mon, 19 Jan 2015, Preeti U Murthy wrote:
>> An idle cpu enters cpu_idle_poll() if it is set in the 
>> tick_broadcast_force_mask.
>> This is so that it does not incur the overhead of entering idle states when 
>> it is expected
>> to be woken up anytime then through a broadcast IPI. The condition that 
>> forces an exit out
>> of the idle polling is the check on setting of the TIF_NEED_RESCHED flag for 
>> the idle thread.
>>
>> When the broadcast IPI does arrive, it is not guarenteed that the handler 
>> sets the
>> TIF_NEED_RESCHED flag. Hence although the cpu is cleared in the 
>> tick_broadcast_force_mask,
>> it continues to loop in cpu_idle_poll unnecessarily wasting power. Hence 
>> exit the idle
>> poll loop if the tick_broadcast_force_mask gets cleared and enter idle 
>> states.
>>
>> Of course if the cpu has entered cpu_idle_poll() on being asked to poll 
>> explicitly,
>> it continues to poll till it is asked to reschedule.
>>
>> Signed-off-by: Preeti U Murthy 
>> ---
>>
>>  kernel/sched/idle.c |3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
>> index c47fce7..aaf1c1d 100644
>> --- a/kernel/sched/idle.c
>> +++ b/kernel/sched/idle.c
>> @@ -47,7 +47,8 @@ static inline int cpu_idle_poll(void)
>>  rcu_idle_enter();
>>  trace_cpu_idle_rcuidle(0, smp_processor_id());
>>  local_irq_enable();
>> -while (!tif_need_resched())
>> +while (!tif_need_resched() &&
>> +(cpu_idle_force_poll || tick_check_broadcast_expired()))
> 
> You explain the tick_check_broadcast_expired() bit, but what about the
> cpu_idle_force_poll part?

The last few lines which say "Of course if the cpu has entered
cpu_idle_poll() on being asked to poll explicitly, it continues to poll
till it is asked to reschedule" explains the cpu_idle_force_poll part.
Perhaps I should s/poll explicitly/do cpu_idle_force_poll ?

Regards
Preeti U Murthy
> 
> Thanks,
> 
>   tglx
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH V3] tick/broadcast: Make movement of broadcast hrtimer robust against hotplug

2015-01-20 Thread Preeti U Murthy

Today if the cpu handling broadcasting of wakeups goes offline, the job of
broadcasting is handed over to another cpu in the CPU_DEAD phase. The CPU_DEAD
notifiers are run only after the offline cpu sets its state as CPU_DEAD.
Meanwhile, the kthread doing the offline is scheduled out while waiting for
this transition by queuing a timer. This is fatal because if the cpu on which
this kthread was running has no other work queued on it, it can re-enter deep
idle state, since it sees that a broadcast cpu still exists. However the 
broadcast
wakeup will never come since the cpu which was handling it is offline, and the 
cpu
on which the kthread doing the hotplug operation was running never wakes up to 
see
this because its in deep idle state.

Fix this by setting the broadcast timer to a max value so as to force the cpus
entering deep idle states henceforth to freshly nominate the broadcast cpu. More
importantly this has to be done in the CPU_DYING phase so that it is visible to
all cpus right after exiting stop_machine, which is when they can re-enter idle.
This ensures that handover of the broadcast duty falls in place on offline of 
the
broadcast cpu, without having to do it explicitly.

It fixes the bug reported here:
http://linuxppc.10917.n7.nabble.com/offlining-cpus-breakage-td88619.html

Signed-off-by: Preeti U Murthy 
---
Changes from previous versions:
1. Modification to the changelog
2. Clarified the comments

 kernel/time/clockevents.c|2 +-
 kernel/time/tick-broadcast.c |7 +--
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
index 5544990..f3907c9 100644
--- a/kernel/time/clockevents.c
+++ b/kernel/time/clockevents.c
@@ -568,6 +568,7 @@ int clockevents_notify(unsigned long reason, void *arg)
 
case CLOCK_EVT_NOTIFY_CPU_DYING:
tick_handover_do_timer(arg);
+   tick_shutdown_broadcast_oneshot(arg);
break;
 
case CLOCK_EVT_NOTIFY_SUSPEND:
@@ -580,7 +581,6 @@ int clockevents_notify(unsigned long reason, void *arg)
break;
 
case CLOCK_EVT_NOTIFY_CPU_DEAD:
-   tick_shutdown_broadcast_oneshot(arg);
tick_shutdown_broadcast(arg);
tick_shutdown(arg);
/*
diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c
index 066f0ec..f983983 100644
--- a/kernel/time/tick-broadcast.c
+++ b/kernel/time/tick-broadcast.c
@@ -675,8 +675,11 @@ static void broadcast_move_bc(int deadcpu)
 
if (!bc || !broadcast_needs_cpu(bc, deadcpu))
return;
-   /* This moves the broadcast assignment to this cpu */
-   clockevents_program_event(bc, bc->next_event, 1);
+   /* Since a cpu with the earliest wakeup is nominated as the 
+* standby cpu, the next cpu to invoke BROADCAST_ENTER
+* will now automatically take up the duty of broadcasting.
+*/
+   bc->next_event.tv64 = KTIME_MAX;
 }
 
 /*

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH V2] tick/broadcast: Make movement of broadcast hrtimer robust against hotplug

2015-01-20 Thread Preeti U Murthy

Today if the cpu handling broadcasting of wakeups goes offline, the job of
broadcasting is handed over to another cpu in the CPU_DEAD phase. The CPU_DEAD
notifiers are run only after the offline cpu sets its state as CPU_DEAD.
Meanwhile, the kthread doing the offline is scheduled out while waiting for
this transition by queuing a timer. This is fatal because if the cpu on which
this kthread was running has no other work queued on it, it can re-enter deep
idle state, since it sees that a broadcast cpu still exists. However the 
broadcast
wakeup will never come since the cpu which was handling it is offline, and the 
cpu
on which the kthread doing the hotplug operation was running never wakes up to 
see
this because its in deep idle state.

Fix this by setting the broadcast timer to a max value so as to force the cpus
entering deep idle states henceforth to freshly nominate the broadcast cpu. More
importantly this has to be done in the CPU_DYING phase so that it is visible to
all cpus right after exiting stop_machine, which is when they can re-enter idle.
This ensures that handover of the broadcast duty falls in place on offline of 
the
broadcast cpu, without having to do it explicitly.

It fixes the bug reported here:
http://linuxppc.10917.n7.nabble.com/offlining-cpus-breakage-td88619.html

Signed-off-by: Preeti U Murthy 
---
Changes from V1: https://lkml.org/lkml/2015/1/19/168
1. Modified the Changelog

 kernel/time/clockevents.c|2 +-
 kernel/time/tick-broadcast.c |4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
index 5544990..f3907c9 100644
--- a/kernel/time/clockevents.c
+++ b/kernel/time/clockevents.c
@@ -568,6 +568,7 @@ int clockevents_notify(unsigned long reason, void *arg)
 
case CLOCK_EVT_NOTIFY_CPU_DYING:
tick_handover_do_timer(arg);
+   tick_shutdown_broadcast_oneshot(arg);
break;
 
case CLOCK_EVT_NOTIFY_SUSPEND:
@@ -580,7 +581,6 @@ int clockevents_notify(unsigned long reason, void *arg)
break;
 
case CLOCK_EVT_NOTIFY_CPU_DEAD:
-   tick_shutdown_broadcast_oneshot(arg);
tick_shutdown_broadcast(arg);
tick_shutdown(arg);
/*
diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c
index 066f0ec..e9c1d9b 100644
--- a/kernel/time/tick-broadcast.c
+++ b/kernel/time/tick-broadcast.c
@@ -675,8 +675,8 @@ static void broadcast_move_bc(int deadcpu)
 
if (!bc || !broadcast_needs_cpu(bc, deadcpu))
return;
-   /* This moves the broadcast assignment to this cpu */
-   clockevents_program_event(bc, bc->next_event, 1);
+   /* This allows fresh nomination of broadcast cpu */
+   bc->next_event.tv64 = KTIME_MAX;
 }
 
 /*

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH V2] tick/broadcast: Make movement of broadcast hrtimer robust against hotplug

2015-01-20 Thread Preeti U Murthy

Today if the cpu handling broadcasting of wakeups goes offline, the job of
broadcasting is handed over to another cpu in the CPU_DEAD phase. The CPU_DEAD
notifiers are run only after the offline cpu sets its state as CPU_DEAD.
Meanwhile, the kthread doing the offline is scheduled out while waiting for
this transition by queuing a timer. This is fatal because if the cpu on which
this kthread was running has no other work queued on it, it can re-enter deep
idle state, since it sees that a broadcast cpu still exists. However the 
broadcast
wakeup will never come since the cpu which was handling it is offline, and the 
cpu
on which the kthread doing the hotplug operation was running never wakes up to 
see
this because its in deep idle state.

Fix this by setting the broadcast timer to a max value so as to force the cpus
entering deep idle states henceforth to freshly nominate the broadcast cpu. More
importantly this has to be done in the CPU_DYING phase so that it is visible to
all cpus right after exiting stop_machine, which is when they can re-enter idle.
This ensures that handover of the broadcast duty falls in place on offline of 
the
broadcast cpu, without having to do it explicitly.

It fixes the bug reported here:
http://linuxppc.10917.n7.nabble.com/offlining-cpus-breakage-td88619.html

Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com
---
Changes from V1: https://lkml.org/lkml/2015/1/19/168
1. Modified the Changelog

 kernel/time/clockevents.c|2 +-
 kernel/time/tick-broadcast.c |4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
index 5544990..f3907c9 100644
--- a/kernel/time/clockevents.c
+++ b/kernel/time/clockevents.c
@@ -568,6 +568,7 @@ int clockevents_notify(unsigned long reason, void *arg)
 
case CLOCK_EVT_NOTIFY_CPU_DYING:
tick_handover_do_timer(arg);
+   tick_shutdown_broadcast_oneshot(arg);
break;
 
case CLOCK_EVT_NOTIFY_SUSPEND:
@@ -580,7 +581,6 @@ int clockevents_notify(unsigned long reason, void *arg)
break;
 
case CLOCK_EVT_NOTIFY_CPU_DEAD:
-   tick_shutdown_broadcast_oneshot(arg);
tick_shutdown_broadcast(arg);
tick_shutdown(arg);
/*
diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c
index 066f0ec..e9c1d9b 100644
--- a/kernel/time/tick-broadcast.c
+++ b/kernel/time/tick-broadcast.c
@@ -675,8 +675,8 @@ static void broadcast_move_bc(int deadcpu)
 
if (!bc || !broadcast_needs_cpu(bc, deadcpu))
return;
-   /* This moves the broadcast assignment to this cpu */
-   clockevents_program_event(bc, bc-next_event, 1);
+   /* This allows fresh nomination of broadcast cpu */
+   bc-next_event.tv64 = KTIME_MAX;
 }
 
 /*

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH V3] tick/broadcast: Make movement of broadcast hrtimer robust against hotplug

2015-01-20 Thread Preeti U Murthy

Today if the cpu handling broadcasting of wakeups goes offline, the job of
broadcasting is handed over to another cpu in the CPU_DEAD phase. The CPU_DEAD
notifiers are run only after the offline cpu sets its state as CPU_DEAD.
Meanwhile, the kthread doing the offline is scheduled out while waiting for
this transition by queuing a timer. This is fatal because if the cpu on which
this kthread was running has no other work queued on it, it can re-enter deep
idle state, since it sees that a broadcast cpu still exists. However the 
broadcast
wakeup will never come since the cpu which was handling it is offline, and the 
cpu
on which the kthread doing the hotplug operation was running never wakes up to 
see
this because its in deep idle state.

Fix this by setting the broadcast timer to a max value so as to force the cpus
entering deep idle states henceforth to freshly nominate the broadcast cpu. More
importantly this has to be done in the CPU_DYING phase so that it is visible to
all cpus right after exiting stop_machine, which is when they can re-enter idle.
This ensures that handover of the broadcast duty falls in place on offline of 
the
broadcast cpu, without having to do it explicitly.

It fixes the bug reported here:
http://linuxppc.10917.n7.nabble.com/offlining-cpus-breakage-td88619.html

Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com
---
Changes from previous versions:
1. Modification to the changelog
2. Clarified the comments

 kernel/time/clockevents.c|2 +-
 kernel/time/tick-broadcast.c |7 +--
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
index 5544990..f3907c9 100644
--- a/kernel/time/clockevents.c
+++ b/kernel/time/clockevents.c
@@ -568,6 +568,7 @@ int clockevents_notify(unsigned long reason, void *arg)
 
case CLOCK_EVT_NOTIFY_CPU_DYING:
tick_handover_do_timer(arg);
+   tick_shutdown_broadcast_oneshot(arg);
break;
 
case CLOCK_EVT_NOTIFY_SUSPEND:
@@ -580,7 +581,6 @@ int clockevents_notify(unsigned long reason, void *arg)
break;
 
case CLOCK_EVT_NOTIFY_CPU_DEAD:
-   tick_shutdown_broadcast_oneshot(arg);
tick_shutdown_broadcast(arg);
tick_shutdown(arg);
/*
diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c
index 066f0ec..f983983 100644
--- a/kernel/time/tick-broadcast.c
+++ b/kernel/time/tick-broadcast.c
@@ -675,8 +675,11 @@ static void broadcast_move_bc(int deadcpu)
 
if (!bc || !broadcast_needs_cpu(bc, deadcpu))
return;
-   /* This moves the broadcast assignment to this cpu */
-   clockevents_program_event(bc, bc-next_event, 1);
+   /* Since a cpu with the earliest wakeup is nominated as the 
+* standby cpu, the next cpu to invoke BROADCAST_ENTER
+* will now automatically take up the duty of broadcasting.
+*/
+   bc-next_event.tv64 = KTIME_MAX;
 }
 
 /*

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: cpuidle/powernv: Read target_residency value of idle states from DT if available

2015-01-20 Thread Preeti U Murthy

On 01/20/2015 11:15 AM, Michael Ellerman wrote:
 On Mon, 2015-19-01 at 11:32:51 UTC, Preeti U Murthy wrote:
 The device tree now exposes the residency values for different idle states. 
 Read
 these values instead of calculating residency from the latency values. The 
 values
 exposed in the DT are validated for optimal power efficiency. However to 
 maintain
 compatibility with the older firmware code which does not expose residency
 values, use default values as a fallback mechanism. While at it, clump
 the common parts of device tree parsing into one chunk.

 Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com
 ---

  drivers/cpuidle/cpuidle-powernv.c |   39 
 -
  1 file changed, 25 insertions(+), 14 deletions(-)

 diff --git a/drivers/cpuidle/cpuidle-powernv.c 
 b/drivers/cpuidle/cpuidle-powernv.c
 index 2726663..9e732e1 100644
 --- a/drivers/cpuidle/cpuidle-powernv.c
 +++ b/drivers/cpuidle/cpuidle-powernv.c
 @@ -162,7 +162,8 @@ static int powernv_add_idle_states(void)
  int dt_idle_states;
  const __be32 *idle_state_flags;
  const __be32 *idle_state_latency;
 -u32 len_flags, flags, latency_ns;
 +const __be32 *idle_state_residency;
 +u32 len_flags, flags, latency_ns, residency_ns;
  int i;
  
  /* Currently we have snooze statically defined */
 @@ -186,14 +187,21 @@ static int powernv_add_idle_states(void)
  return nr_idle_states;
  }
  
 +idle_state_residency = of_get_property(power_mgt,
 +ibm,cpu-idle-state-residency-ns, NULL);
 +if (!idle_state_residency) {
 +pr_warn(DT-PowerMgmt: missing 
 ibm,cpu-idle-state-residency-ns\n);
 +pr_warn(Falling back to default values\n);
 +}
 
 This would be better done with something like:
 
   rc = of_read_property_u32(power_mgt, ibm,cpu-idle-state-residency-ns, 
 residency_ns);
   if (!rc) {
   pr_info(cpuidle-powernv: missing 
 ibm,cpu-idle-state-residency-ns\n);
   residency_ns = 30;
   }

This looks like a better API, but the default residency values are different
for each idle state. So perhaps a patch like the below ?

--Start Patch--

From: Preeti U Murthy pre...@linux.vnet.ibm.com


---
 drivers/cpuidle/cpuidle-powernv.c |   54 +++--
 1 file changed, 33 insertions(+), 21 deletions(-)

diff --git a/drivers/cpuidle/cpuidle-powernv.c 
b/drivers/cpuidle/cpuidle-powernv.c
index aedec09..2031560 100644
--- a/drivers/cpuidle/cpuidle-powernv.c
+++ b/drivers/cpuidle/cpuidle-powernv.c
@@ -13,6 +13,7 @@
 #include linux/notifier.h
 #include linux/clockchips.h
 #include linux/of.h
+#include linux/slab.h
 
 #include asm/machdep.h
 #include asm/firmware.h
@@ -159,9 +160,9 @@ static int powernv_add_idle_states(void)
int nr_idle_states = 1; /* Snooze */
int dt_idle_states;
const __be32 *idle_state_flags;
-   const __be32 *idle_state_latency;
-   u32 len_flags, flags, latency_ns;
-   int i;
+   u32 len_flags, flags;
+   u32 *latency_ns, *residency_ns;
+   int i, rc;
 
/* Currently we have snooze statically defined */
 
@@ -177,34 +178,39 @@ static int powernv_add_idle_states(void)
return nr_idle_states;
}
 
-   idle_state_latency = of_get_property(power_mgt,
-   ibm,cpu-idle-state-latencies-ns, NULL);
-   if (!idle_state_latency) {
+   dt_idle_states = len_flags / sizeof(u32);
+
+   latency_ns = kzalloc(sizeof(*latency_ns) * dt_idle_states, GFP_KERNEL);
+   rc = of_property_read_u32(power_mgt,
+   ibm,cpu-idle-state-latencies-ns, latency_ns);
+   if (rc) {
pr_warn(DT-PowerMgmt: missing 
ibm,cpu-idle-state-latencies-ns\n);
return nr_idle_states;
}
 
-   dt_idle_states = len_flags / sizeof(u32);
+   residency_ns = kzalloc(sizeof(*residency_ns) * dt_idle_states, 
GFP_KERNEL);
+   rc = of_property_read_u32(power_mgt,
+   ibm,cpu-idle-state-residency-ns, residency_ns);
+   if (rc) {
+   pr_warn(DT-PowerMgmt: missing 
ibm,cpu-idle-state-residency-ns\n);
+   pr_warn(Falling back to default values\n);
+   }
+
 
for (i = 0; i  dt_idle_states; i++) {
 
flags = be32_to_cpu(idle_state_flags[i]);
-
-   /* Cpuidle accepts exit_latency in us and we estimate
-* target residency to be 10x exit_latency
+   /*
+* Cpuidle accepts exit_latency and target_residency in us.
+* Use default target_residency values if f/w does not expose 
it.
 */
-   latency_ns = be32_to_cpu(idle_state_latency[i]);
if (flags  OPAL_PM_NAP_ENABLED) {
/* Add NAP state */
strcpy(powernv_states[nr_idle_states].name

Re: [PATCH] idle/tick-broadcast: Exit cpu idle poll loop when cleared from tick_broadcast_force_mask

2015-01-20 Thread Preeti U Murthy

On 01/20/2015 04:51 PM, Thomas Gleixner wrote:
 On Mon, 19 Jan 2015, Preeti U Murthy wrote:
 An idle cpu enters cpu_idle_poll() if it is set in the 
 tick_broadcast_force_mask.
 This is so that it does not incur the overhead of entering idle states when 
 it is expected
 to be woken up anytime then through a broadcast IPI. The condition that 
 forces an exit out
 of the idle polling is the check on setting of the TIF_NEED_RESCHED flag for 
 the idle thread.

 When the broadcast IPI does arrive, it is not guarenteed that the handler 
 sets the
 TIF_NEED_RESCHED flag. Hence although the cpu is cleared in the 
 tick_broadcast_force_mask,
 it continues to loop in cpu_idle_poll unnecessarily wasting power. Hence 
 exit the idle
 poll loop if the tick_broadcast_force_mask gets cleared and enter idle 
 states.

 Of course if the cpu has entered cpu_idle_poll() on being asked to poll 
 explicitly,
 it continues to poll till it is asked to reschedule.

 Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com
 ---

  kernel/sched/idle.c |3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

 diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
 index c47fce7..aaf1c1d 100644
 --- a/kernel/sched/idle.c
 +++ b/kernel/sched/idle.c
 @@ -47,7 +47,8 @@ static inline int cpu_idle_poll(void)
  rcu_idle_enter();
  trace_cpu_idle_rcuidle(0, smp_processor_id());
  local_irq_enable();
 -while (!tif_need_resched())
 +while (!tif_need_resched() 
 +(cpu_idle_force_poll || tick_check_broadcast_expired()))
 
 You explain the tick_check_broadcast_expired() bit, but what about the
 cpu_idle_force_poll part?

The last few lines which say Of course if the cpu has entered
cpu_idle_poll() on being asked to poll explicitly, it continues to poll
till it is asked to reschedule explains the cpu_idle_force_poll part.
Perhaps I should s/poll explicitly/do cpu_idle_force_poll ?

Regards
Preeti U Murthy
 
 Thanks,
 
   tglx
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: tick/broadcast: Make movement of broadcast hrtimer robust against hotplug

2015-01-19 Thread Preeti U Murthy

On 01/20/2015 11:39 AM, Michael Ellerman wrote:
> On Mon, 2015-19-01 at 10:26:48 UTC, Preeti U Murthy wrote:
>> Today if a cpu handling broadcasting of wakeups goes offline, it hands over
> 
> It's *the* cpu handling broadcasting of wakeups right? ie. there's only ever
> one at a time.

Right, thanks for pointing this.

> 
>> the job of broadcasting to another cpu in the CPU_DEAD phase. 
> 
> I think that would be clearer as "to another cpu, when the cpu going offline
> reaches the CPU_DEAD state."
> 
> Otherwise it can read as "another cpu (which is) in the CPU_DEAD phase", which
> is not what you mean - I think.

Yes I will word this better.
> 
>> The CPU_DEAD notifiers are run only after the offline cpu sets its state as
>> CPU_DEAD. Meanwhile, the kthread doing the offline is scheduled out while
> 
> The kthread which is running on a different cpu from either of the first two
> cpus you've mentioned.

It could be running on any cpu other than the offline one. The next line
clarifies this - "This is fatal if the cpu on which this kthread was
running". I also say "Meanwhile, the kthread doing the offline" above so
as to clarify that the offline cpu has nothing to do with this kthread.

> 
>> waiting for this transition by queuing a timer. This is fatal because if the
>> cpu on which this kthread was running has no other work queued on it, it can
>> re-enter deep idle state, since it sees that a broadcast cpu still exists.
>> However the broadcast wakeup will never come since the cpu which was handling
>> it is offline, and this cpu never wakes up to see this because its in deep
>> idle state.
> 
> Which cpu is "this cpu"? I think you mean the one running the kthread which is
> doing the offline, but it's not 100% clear.

Right, I will make this correction as well, its ambiguous.

> 
>> Fix this by setting the broadcast timer to a max value so as to force the 
>> cpus
>> entering deep idle states henceforth to freshly nominate the broadcast cpu. 
>> More
>> importantly this has to be done in the CPU_DYING phase so that it is visible 
>> to
>> all cpus right after exiting stop_machine, which is when they can re-enter 
>> idle.
>> This ensures that handover of the broadcast duty falls in place on offline, 
>> without
>> having to do it explicitly.
> 
> OK, I don't know the code well enough to say if that's the right fix.
> 
> You say:
> 
> + /* This allows fresh nomination of broadcast cpu */
> + bc->next_event.tv64 = KTIME_MAX;
> 
> Is that all it does? I see that check in several places in the code.

This change does not affect those parts of the tick broadcast code,
which do not depend on the hrtimer broadcast framework. And for those
parts that do depend on this framework, this plays a critical role in
handing over the broadcast duty.

> 
> 
> I assume we're expecting Thomas to merge this?

Yes, Thomas can you please take this into the next 3.19 rc release ? I
will send out a new patch with the modified changelog. If you find this
acceptable I will port it to the relevant stable kernels.
> 
> If so it's probably worth mentioning that it fixes a bug we are seeing on

I will mention this in the changelog by pointing to the bug report.

Regards
Preeti U Murthy
> machines in the wild. So it'd be nice if it went into 3.19 and/or gets sent to
> stable.
> 
> cheers
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] cpuidle/powernv: Read target_residency value of idle states from DT if available

2015-01-19 Thread Preeti U Murthy

The device tree now exposes the residency values for different idle states. Read
these values instead of calculating residency from the latency values. The 
values
exposed in the DT are validated for optimal power efficiency. However to 
maintain
compatibility with the older firmware code which does not expose residency
values, use default values as a fallback mechanism. While at it, clump
the common parts of device tree parsing into one chunk.

Signed-off-by: Preeti U Murthy 
---

 drivers/cpuidle/cpuidle-powernv.c |   39 -
 1 file changed, 25 insertions(+), 14 deletions(-)

diff --git a/drivers/cpuidle/cpuidle-powernv.c 
b/drivers/cpuidle/cpuidle-powernv.c
index 2726663..9e732e1 100644
--- a/drivers/cpuidle/cpuidle-powernv.c
+++ b/drivers/cpuidle/cpuidle-powernv.c
@@ -162,7 +162,8 @@ static int powernv_add_idle_states(void)
int dt_idle_states;
const __be32 *idle_state_flags;
const __be32 *idle_state_latency;
-   u32 len_flags, flags, latency_ns;
+   const __be32 *idle_state_residency;
+   u32 len_flags, flags, latency_ns, residency_ns;
int i;
 
/* Currently we have snooze statically defined */
@@ -186,14 +187,21 @@ static int powernv_add_idle_states(void)
return nr_idle_states;
}
 
+   idle_state_residency = of_get_property(power_mgt,
+   "ibm,cpu-idle-state-residency-ns", NULL);
+   if (!idle_state_residency) {
+   pr_warn("DT-PowerMgmt: missing 
ibm,cpu-idle-state-residency-ns\n");
+   pr_warn("Falling back to default values\n");
+   }
+
dt_idle_states = len_flags / sizeof(u32);
 
for (i = 0; i < dt_idle_states; i++) {
 
flags = be32_to_cpu(idle_state_flags[i]);
-
-   /* Cpuidle accepts exit_latency in us and we estimate
-* target residency to be 10x exit_latency
+   /*
+* Cpuidle accepts exit_latency and target_residency in us.
+* Use default target_residency values if f/w does not expose 
it.
 */
latency_ns = be32_to_cpu(idle_state_latency[i]);
if (flags & OPAL_PM_NAP_ENABLED) {
@@ -201,12 +209,8 @@ static int powernv_add_idle_states(void)
strcpy(powernv_states[nr_idle_states].name, "Nap");
strcpy(powernv_states[nr_idle_states].desc, "Nap");
powernv_states[nr_idle_states].flags = 0;
-   powernv_states[nr_idle_states].exit_latency =
-   ((unsigned int)latency_ns) / 1000;
-   powernv_states[nr_idle_states].target_residency =
-   ((unsigned int)latency_ns / 100);
+   powernv_states[nr_idle_states].target_residency = 100;
powernv_states[nr_idle_states].enter = _loop;
-   nr_idle_states++;
}
 
if (flags & OPAL_PM_SLEEP_ENABLED ||
@@ -215,13 +219,20 @@ static int powernv_add_idle_states(void)
strcpy(powernv_states[nr_idle_states].name, 
"FastSleep");
strcpy(powernv_states[nr_idle_states].desc, 
"FastSleep");
powernv_states[nr_idle_states].flags = 
CPUIDLE_FLAG_TIMER_STOP;
-   powernv_states[nr_idle_states].exit_latency =
-   ((unsigned int)latency_ns) / 1000;
-   powernv_states[nr_idle_states].target_residency =
-   ((unsigned int)latency_ns / 100);
+   powernv_states[nr_idle_states].target_residency = 
30;
powernv_states[nr_idle_states].enter = _loop;
-   nr_idle_states++;
}
+
+   powernv_states[nr_idle_states].exit_latency =
+   ((unsigned int)latency_ns) / 1000;
+
+   if (idle_state_residency) {
+   residency_ns = be32_to_cpu(idle_state_residency[i]);
+   powernv_states[nr_idle_states].target_residency =
+   ((unsigned int)residency_ns) / 1000;
+   }
+
+   nr_idle_states++;
}
 
return nr_idle_states;

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] tick/broadcast: Make movement of broadcast hrtimer robust against hotplug

2015-01-19 Thread Preeti U Murthy

Today if a cpu handling broadcasting of wakeups goes offline, it hands over
the job of broadcasting to another cpu in the CPU_DEAD phase. The CPU_DEAD
notifiers are run only after the offline cpu sets its state as CPU_DEAD.
Meanwhile, the kthread doing the offline is scheduled out while waiting for
this transition by queuing a timer. This is fatal because if the cpu on which
this kthread was running has no other work queued on it, it can re-enter deep
idle state, since it sees that a broadcast cpu still exists. However the 
broadcast
wakeup will never come since the cpu which was handling it is offline, and this 
cpu
never wakes up to see this because its in deep idle state.

Fix this by setting the broadcast timer to a max value so as to force the cpus
entering deep idle states henceforth to freshly nominate the broadcast cpu. More
importantly this has to be done in the CPU_DYING phase so that it is visible to
all cpus right after exiting stop_machine, which is when they can re-enter idle.
This ensures that handover of the broadcast duty falls in place on offline, 
without
having to do it explicitly.

Signed-off-by: Preeti U Murthy 
---

 kernel/time/clockevents.c|2 +-
 kernel/time/tick-broadcast.c |4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
index 5544990..f3907c9 100644
--- a/kernel/time/clockevents.c
+++ b/kernel/time/clockevents.c
@@ -568,6 +568,7 @@ int clockevents_notify(unsigned long reason, void *arg)
 
case CLOCK_EVT_NOTIFY_CPU_DYING:
tick_handover_do_timer(arg);
+   tick_shutdown_broadcast_oneshot(arg);
break;
 
case CLOCK_EVT_NOTIFY_SUSPEND:
@@ -580,7 +581,6 @@ int clockevents_notify(unsigned long reason, void *arg)
break;
 
case CLOCK_EVT_NOTIFY_CPU_DEAD:
-   tick_shutdown_broadcast_oneshot(arg);
tick_shutdown_broadcast(arg);
tick_shutdown(arg);
/*
diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c
index 066f0ec..e9c1d9b 100644
--- a/kernel/time/tick-broadcast.c
+++ b/kernel/time/tick-broadcast.c
@@ -675,8 +675,8 @@ static void broadcast_move_bc(int deadcpu)
 
if (!bc || !broadcast_needs_cpu(bc, deadcpu))
return;
-   /* This moves the broadcast assignment to this cpu */
-   clockevents_program_event(bc, bc->next_event, 1);
+   /* This allows fresh nomination of broadcast cpu */
+   bc->next_event.tv64 = KTIME_MAX;
 }
 
 /*

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: tick/broadcast: Make movement of broadcast hrtimer robust against hotplug

2015-01-19 Thread Preeti U Murthy

On 01/20/2015 11:39 AM, Michael Ellerman wrote:
 On Mon, 2015-19-01 at 10:26:48 UTC, Preeti U Murthy wrote:
 Today if a cpu handling broadcasting of wakeups goes offline, it hands over
 
 It's *the* cpu handling broadcasting of wakeups right? ie. there's only ever
 one at a time.

Right, thanks for pointing this.

 
 the job of broadcasting to another cpu in the CPU_DEAD phase. 
 
 I think that would be clearer as to another cpu, when the cpu going offline
 reaches the CPU_DEAD state.
 
 Otherwise it can read as another cpu (which is) in the CPU_DEAD phase, which
 is not what you mean - I think.

Yes I will word this better.
 
 The CPU_DEAD notifiers are run only after the offline cpu sets its state as
 CPU_DEAD. Meanwhile, the kthread doing the offline is scheduled out while
 
 The kthread which is running on a different cpu from either of the first two
 cpus you've mentioned.

It could be running on any cpu other than the offline one. The next line
clarifies this - This is fatal if the cpu on which this kthread was
running. I also say Meanwhile, the kthread doing the offline above so
as to clarify that the offline cpu has nothing to do with this kthread.

 
 waiting for this transition by queuing a timer. This is fatal because if the
 cpu on which this kthread was running has no other work queued on it, it can
 re-enter deep idle state, since it sees that a broadcast cpu still exists.
 However the broadcast wakeup will never come since the cpu which was handling
 it is offline, and this cpu never wakes up to see this because its in deep
 idle state.
 
 Which cpu is this cpu? I think you mean the one running the kthread which is
 doing the offline, but it's not 100% clear.

Right, I will make this correction as well, its ambiguous.

 
 Fix this by setting the broadcast timer to a max value so as to force the 
 cpus
 entering deep idle states henceforth to freshly nominate the broadcast cpu. 
 More
 importantly this has to be done in the CPU_DYING phase so that it is visible 
 to
 all cpus right after exiting stop_machine, which is when they can re-enter 
 idle.
 This ensures that handover of the broadcast duty falls in place on offline, 
 without
 having to do it explicitly.
 
 OK, I don't know the code well enough to say if that's the right fix.
 
 You say:
 
 + /* This allows fresh nomination of broadcast cpu */
 + bc-next_event.tv64 = KTIME_MAX;
 
 Is that all it does? I see that check in several places in the code.

This change does not affect those parts of the tick broadcast code,
which do not depend on the hrtimer broadcast framework. And for those
parts that do depend on this framework, this plays a critical role in
handing over the broadcast duty.

 
 
 I assume we're expecting Thomas to merge this?

Yes, Thomas can you please take this into the next 3.19 rc release ? I
will send out a new patch with the modified changelog. If you find this
acceptable I will port it to the relevant stable kernels.
 
 If so it's probably worth mentioning that it fixes a bug we are seeing on

I will mention this in the changelog by pointing to the bug report.

Regards
Preeti U Murthy
 machines in the wild. So it'd be nice if it went into 3.19 and/or gets sent to
 stable.
 
 cheers
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] tick/broadcast: Make movement of broadcast hrtimer robust against hotplug

2015-01-19 Thread Preeti U Murthy

Today if a cpu handling broadcasting of wakeups goes offline, it hands over
the job of broadcasting to another cpu in the CPU_DEAD phase. The CPU_DEAD
notifiers are run only after the offline cpu sets its state as CPU_DEAD.
Meanwhile, the kthread doing the offline is scheduled out while waiting for
this transition by queuing a timer. This is fatal because if the cpu on which
this kthread was running has no other work queued on it, it can re-enter deep
idle state, since it sees that a broadcast cpu still exists. However the 
broadcast
wakeup will never come since the cpu which was handling it is offline, and this 
cpu
never wakes up to see this because its in deep idle state.

Fix this by setting the broadcast timer to a max value so as to force the cpus
entering deep idle states henceforth to freshly nominate the broadcast cpu. More
importantly this has to be done in the CPU_DYING phase so that it is visible to
all cpus right after exiting stop_machine, which is when they can re-enter idle.
This ensures that handover of the broadcast duty falls in place on offline, 
without
having to do it explicitly.

Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com
---

 kernel/time/clockevents.c|2 +-
 kernel/time/tick-broadcast.c |4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
index 5544990..f3907c9 100644
--- a/kernel/time/clockevents.c
+++ b/kernel/time/clockevents.c
@@ -568,6 +568,7 @@ int clockevents_notify(unsigned long reason, void *arg)
 
case CLOCK_EVT_NOTIFY_CPU_DYING:
tick_handover_do_timer(arg);
+   tick_shutdown_broadcast_oneshot(arg);
break;
 
case CLOCK_EVT_NOTIFY_SUSPEND:
@@ -580,7 +581,6 @@ int clockevents_notify(unsigned long reason, void *arg)
break;
 
case CLOCK_EVT_NOTIFY_CPU_DEAD:
-   tick_shutdown_broadcast_oneshot(arg);
tick_shutdown_broadcast(arg);
tick_shutdown(arg);
/*
diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c
index 066f0ec..e9c1d9b 100644
--- a/kernel/time/tick-broadcast.c
+++ b/kernel/time/tick-broadcast.c
@@ -675,8 +675,8 @@ static void broadcast_move_bc(int deadcpu)
 
if (!bc || !broadcast_needs_cpu(bc, deadcpu))
return;
-   /* This moves the broadcast assignment to this cpu */
-   clockevents_program_event(bc, bc-next_event, 1);
+   /* This allows fresh nomination of broadcast cpu */
+   bc-next_event.tv64 = KTIME_MAX;
 }
 
 /*

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] cpuidle/powernv: Read target_residency value of idle states from DT if available

2015-01-19 Thread Preeti U Murthy

The device tree now exposes the residency values for different idle states. Read
these values instead of calculating residency from the latency values. The 
values
exposed in the DT are validated for optimal power efficiency. However to 
maintain
compatibility with the older firmware code which does not expose residency
values, use default values as a fallback mechanism. While at it, clump
the common parts of device tree parsing into one chunk.

Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com
---

 drivers/cpuidle/cpuidle-powernv.c |   39 -
 1 file changed, 25 insertions(+), 14 deletions(-)

diff --git a/drivers/cpuidle/cpuidle-powernv.c 
b/drivers/cpuidle/cpuidle-powernv.c
index 2726663..9e732e1 100644
--- a/drivers/cpuidle/cpuidle-powernv.c
+++ b/drivers/cpuidle/cpuidle-powernv.c
@@ -162,7 +162,8 @@ static int powernv_add_idle_states(void)
int dt_idle_states;
const __be32 *idle_state_flags;
const __be32 *idle_state_latency;
-   u32 len_flags, flags, latency_ns;
+   const __be32 *idle_state_residency;
+   u32 len_flags, flags, latency_ns, residency_ns;
int i;
 
/* Currently we have snooze statically defined */
@@ -186,14 +187,21 @@ static int powernv_add_idle_states(void)
return nr_idle_states;
}
 
+   idle_state_residency = of_get_property(power_mgt,
+   ibm,cpu-idle-state-residency-ns, NULL);
+   if (!idle_state_residency) {
+   pr_warn(DT-PowerMgmt: missing 
ibm,cpu-idle-state-residency-ns\n);
+   pr_warn(Falling back to default values\n);
+   }
+
dt_idle_states = len_flags / sizeof(u32);
 
for (i = 0; i  dt_idle_states; i++) {
 
flags = be32_to_cpu(idle_state_flags[i]);
-
-   /* Cpuidle accepts exit_latency in us and we estimate
-* target residency to be 10x exit_latency
+   /*
+* Cpuidle accepts exit_latency and target_residency in us.
+* Use default target_residency values if f/w does not expose 
it.
 */
latency_ns = be32_to_cpu(idle_state_latency[i]);
if (flags  OPAL_PM_NAP_ENABLED) {
@@ -201,12 +209,8 @@ static int powernv_add_idle_states(void)
strcpy(powernv_states[nr_idle_states].name, Nap);
strcpy(powernv_states[nr_idle_states].desc, Nap);
powernv_states[nr_idle_states].flags = 0;
-   powernv_states[nr_idle_states].exit_latency =
-   ((unsigned int)latency_ns) / 1000;
-   powernv_states[nr_idle_states].target_residency =
-   ((unsigned int)latency_ns / 100);
+   powernv_states[nr_idle_states].target_residency = 100;
powernv_states[nr_idle_states].enter = nap_loop;
-   nr_idle_states++;
}
 
if (flags  OPAL_PM_SLEEP_ENABLED ||
@@ -215,13 +219,20 @@ static int powernv_add_idle_states(void)
strcpy(powernv_states[nr_idle_states].name, 
FastSleep);
strcpy(powernv_states[nr_idle_states].desc, 
FastSleep);
powernv_states[nr_idle_states].flags = 
CPUIDLE_FLAG_TIMER_STOP;
-   powernv_states[nr_idle_states].exit_latency =
-   ((unsigned int)latency_ns) / 1000;
-   powernv_states[nr_idle_states].target_residency =
-   ((unsigned int)latency_ns / 100);
+   powernv_states[nr_idle_states].target_residency = 
30;
powernv_states[nr_idle_states].enter = fastsleep_loop;
-   nr_idle_states++;
}
+
+   powernv_states[nr_idle_states].exit_latency =
+   ((unsigned int)latency_ns) / 1000;
+
+   if (idle_state_residency) {
+   residency_ns = be32_to_cpu(idle_state_residency[i]);
+   powernv_states[nr_idle_states].target_residency =
+   ((unsigned int)residency_ns) / 1000;
+   }
+
+   nr_idle_states++;
}
 
return nr_idle_states;

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] idle/tick-broadcast: Exit cpu idle poll loop when cleared from tick_broadcast_force_mask

2015-01-18 Thread Preeti U Murthy

An idle cpu enters cpu_idle_poll() if it is set in the 
tick_broadcast_force_mask.
This is so that it does not incur the overhead of entering idle states when it 
is expected
to be woken up anytime then through a broadcast IPI. The condition that forces 
an exit out
of the idle polling is the check on setting of the TIF_NEED_RESCHED flag for 
the idle thread.

When the broadcast IPI does arrive, it is not guarenteed that the handler sets 
the
TIF_NEED_RESCHED flag. Hence although the cpu is cleared in the 
tick_broadcast_force_mask,
it continues to loop in cpu_idle_poll unnecessarily wasting power. Hence exit 
the idle
poll loop if the tick_broadcast_force_mask gets cleared and enter idle states.

Of course if the cpu has entered cpu_idle_poll() on being asked to poll 
explicitly,
it continues to poll till it is asked to reschedule.

Signed-off-by: Preeti U Murthy 
---

 kernel/sched/idle.c |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index c47fce7..aaf1c1d 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -47,7 +47,8 @@ static inline int cpu_idle_poll(void)
rcu_idle_enter();
trace_cpu_idle_rcuidle(0, smp_processor_id());
local_irq_enable();
-   while (!tif_need_resched())
+   while (!tif_need_resched() &&
+   (cpu_idle_force_poll || tick_check_broadcast_expired()))
cpu_relax();
trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, smp_processor_id());
rcu_idle_exit();

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] idle/tick-broadcast: Exit cpu idle poll loop when cleared from tick_broadcast_force_mask

2015-01-18 Thread Preeti U Murthy

An idle cpu enters cpu_idle_poll() if it is set in the 
tick_broadcast_force_mask.
This is so that it does not incur the overhead of entering idle states when it 
is expected
to be woken up anytime then through a broadcast IPI. The condition that forces 
an exit out
of the idle polling is the check on setting of the TIF_NEED_RESCHED flag for 
the idle thread.

When the broadcast IPI does arrive, it is not guarenteed that the handler sets 
the
TIF_NEED_RESCHED flag. Hence although the cpu is cleared in the 
tick_broadcast_force_mask,
it continues to loop in cpu_idle_poll unnecessarily wasting power. Hence exit 
the idle
poll loop if the tick_broadcast_force_mask gets cleared and enter idle states.

Of course if the cpu has entered cpu_idle_poll() on being asked to poll 
explicitly,
it continues to poll till it is asked to reschedule.

Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com
---

 kernel/sched/idle.c |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index c47fce7..aaf1c1d 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -47,7 +47,8 @@ static inline int cpu_idle_poll(void)
rcu_idle_enter();
trace_cpu_idle_rcuidle(0, smp_processor_id());
local_irq_enable();
-   while (!tif_need_resched())
+   while (!tif_need_resched() 
+   (cpu_idle_force_poll || tick_check_broadcast_expired()))
cpu_relax();
trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, smp_processor_id());
rcu_idle_exit();

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] tick/powerclamp: Remove tick_nohz_idle abuse

2014-12-30 Thread Preeti U Murthy

Hi Jacob,

On 12/23/2014 08:27 AM, Jacob Pan wrote:
> On Sat, 20 Dec 2014 07:01:12 +0530
> Preeti U Murthy  wrote:
> 
>> On 12/20/2014 01:26 AM, Thomas Gleixner wrote:
>>> On Fri, 19 Dec 2014, Jacob Pan wrote:
>>>
>>>> On Thu, 18 Dec 2014 22:12:57 +0100 (CET)
>>>> Thomas Gleixner  wrote:
>>>>
>>>>> On Thu, 18 Dec 2014, Jacob Pan wrote:
>>>>>> OK I agree, also as I mentioned earlier, Peter already has a
>>>>>> patch for consolidated idle loop and remove
>>>>>> tick_nohz_idle_enter/exit call from powerclamp driver. I have
>>>>>> been working on a few tweaks to maintain the functionality and
>>>>>> efficiency with the consolidated idle loop. We can apply the
>>>>>> patches on top of yours.
>>>>>
>>>>> No. This is equally wrong as I pointed out before. The 'unified'
>>>>> idle loop is still fake and just pretending to be idle.
>>>>>
>>>> In terms of efficiency, the consolidated idle loop will allow
>>>> turning off sched tick during idle injection period. If we just
>>>> take out the tick_nohz_idle_xxx call, the effectiveness of
>>>> powerclamp is going down significantly. I am not arguing the
>>>> design but from fixing regression perspective or short term
>>>> solution.
>>>
>>> There is no perspective. Period.
>>>
>>> Its violates every rightful assumption of the nohz_IDLE_* code and
>>> just ever worked by chance. There is so much subtle wreckage lurking
>>> there that the only sane solution is to forbid it. End of story.
>>>
>>> Thanks,
>>>
>>> tglx
>>>
>> Hi Jacob,
>>
>> Like Thomas pointed out, we can design a sane solution for powerclamp.
>> Idle injection is nothing but throttling of runqueue. If the runqueue
>> is throttled, no fair tasks will be selected and the natural choice
>> in the absence of tasks from any other sched class is the idle task.
>>
>> The idle loop will automatically be called and the nohz state will
>> also fall in place. The cpu is really idle now: the runqueue has no
>> tasks and the task running on the cpu is the idle thread. The
>> throttled tasks are on a separate list.
>>
>> When the period of idle injection is over, we unthrottle the runqueue.
>> All this being taken care of my a non-deferrable timer. This design
>> ensures that the intention of powerclamp is not hampered while at the
>> same time maintaining a sane state for nohz; you will get the
>> efficiency you want.
>>
>> Of course there may be corner cases and challenges around
>> synchronization of package idle, which I am sure we can work around
>> with a better design such as the above. I am working on that patchset
>> and will post out in a day. You can take a look and let us know the
>> pieces we are missing.
>>
>> I find that implementing the above design is not too hard.
>>
> Hi Preeti,
> Yeah, it seems to be a good approach. looking forward to work with you
> on this. Timer may scale better for larger systems. One question, will
> timer irq gets unpredictable delays if run by ksoftirqd?

I am sorry I could not respond earlier; I was on vacation as well. Yes,
we may have a problem here. Let alone synchronization between cpus in
performing clamping, there are two other functionality issues that I see.

1. Since periodic timers get executed in the softirq context,
scheduler_tick() would have passed by by then. i.e.
hrtimer_interrupt()
|__tick_sched_handle()
   |__scheduler_tick()
   |__raise_softirq(TIMER_SOFTIRQ)
ksoftirqd runs on local_bh_enable()-->powerclamp_timer handler runs

Although runqueues are throttled in the powerclamp timer handler, it has
to wait till the next scheduler tick to select an idle task to run.
A precious 4-10ms depending on the config would have passed by then.

2. For the same reason as 1, when the ksoftirqd has to run on the cpus
during the tick after the one in which throttling is enabled, cpus are
unavailable to run the daemon because they are throttled. However there
is no other way to unthrottle the runqueues now except by running the
ksoftirqd; a chicken and egg problem.

I think both the above problems could be solved by using hrtimers
instead of periodic timers to perform clamping/unclamping, since
hrtimers are serviced in the interrupt context. But we cannot
initialize/start/modify hrtimers on a remote cpu. We will end up using
IPIs for handling the hrtimers during start/end of powerclamp or
modification of the idle duration of clamping, which is not a tempting
option either.

So I am currently stuck at this point. I would be glad to have some
suggestions.

Thanks

Regards
Preeti U Murthy

> BTW, I may not be able to respond quickly during the holidays. If
> things workout, it may benefit ACPI PAD driver as well.
> 
> 
> Thanks,
> 
> Jacob
>> Regards
>> Preeti U Murthy
>>
> 
> [Jacob Pan]
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] tick/powerclamp: Remove tick_nohz_idle abuse

2014-12-30 Thread Preeti U Murthy

Hi Jacob,

On 12/23/2014 08:27 AM, Jacob Pan wrote:
 On Sat, 20 Dec 2014 07:01:12 +0530
 Preeti U Murthy pre...@linux.vnet.ibm.com wrote:
 
 On 12/20/2014 01:26 AM, Thomas Gleixner wrote:
 On Fri, 19 Dec 2014, Jacob Pan wrote:

 On Thu, 18 Dec 2014 22:12:57 +0100 (CET)
 Thomas Gleixner t...@linutronix.de wrote:

 On Thu, 18 Dec 2014, Jacob Pan wrote:
 OK I agree, also as I mentioned earlier, Peter already has a
 patch for consolidated idle loop and remove
 tick_nohz_idle_enter/exit call from powerclamp driver. I have
 been working on a few tweaks to maintain the functionality and
 efficiency with the consolidated idle loop. We can apply the
 patches on top of yours.

 No. This is equally wrong as I pointed out before. The 'unified'
 idle loop is still fake and just pretending to be idle.

 In terms of efficiency, the consolidated idle loop will allow
 turning off sched tick during idle injection period. If we just
 take out the tick_nohz_idle_xxx call, the effectiveness of
 powerclamp is going down significantly. I am not arguing the
 design but from fixing regression perspective or short term
 solution.

 There is no perspective. Period.

 Its violates every rightful assumption of the nohz_IDLE_* code and
 just ever worked by chance. There is so much subtle wreckage lurking
 there that the only sane solution is to forbid it. End of story.

 Thanks,

 tglx

 Hi Jacob,

 Like Thomas pointed out, we can design a sane solution for powerclamp.
 Idle injection is nothing but throttling of runqueue. If the runqueue
 is throttled, no fair tasks will be selected and the natural choice
 in the absence of tasks from any other sched class is the idle task.

 The idle loop will automatically be called and the nohz state will
 also fall in place. The cpu is really idle now: the runqueue has no
 tasks and the task running on the cpu is the idle thread. The
 throttled tasks are on a separate list.

 When the period of idle injection is over, we unthrottle the runqueue.
 All this being taken care of my a non-deferrable timer. This design
 ensures that the intention of powerclamp is not hampered while at the
 same time maintaining a sane state for nohz; you will get the
 efficiency you want.

 Of course there may be corner cases and challenges around
 synchronization of package idle, which I am sure we can work around
 with a better design such as the above. I am working on that patchset
 and will post out in a day. You can take a look and let us know the
 pieces we are missing.

 I find that implementing the above design is not too hard.

 Hi Preeti,
 Yeah, it seems to be a good approach. looking forward to work with you
 on this. Timer may scale better for larger systems. One question, will
 timer irq gets unpredictable delays if run by ksoftirqd?

I am sorry I could not respond earlier; I was on vacation as well. Yes,
we may have a problem here. Let alone synchronization between cpus in
performing clamping, there are two other functionality issues that I see.

1. Since periodic timers get executed in the softirq context,
scheduler_tick() would have passed by by then. i.e.
hrtimer_interrupt()
|__tick_sched_handle()
   |__scheduler_tick()
   |__raise_softirq(TIMER_SOFTIRQ)
ksoftirqd runs on local_bh_enable()--powerclamp_timer handler runs

Although runqueues are throttled in the powerclamp timer handler, it has
to wait till the next scheduler tick to select an idle task to run.
A precious 4-10ms depending on the config would have passed by then.

2. For the same reason as 1, when the ksoftirqd has to run on the cpus
during the tick after the one in which throttling is enabled, cpus are
unavailable to run the daemon because they are throttled. However there
is no other way to unthrottle the runqueues now except by running the
ksoftirqd; a chicken and egg problem.

I think both the above problems could be solved by using hrtimers
instead of periodic timers to perform clamping/unclamping, since
hrtimers are serviced in the interrupt context. But we cannot
initialize/start/modify hrtimers on a remote cpu. We will end up using
IPIs for handling the hrtimers during start/end of powerclamp or
modification of the idle duration of clamping, which is not a tempting
option either.

So I am currently stuck at this point. I would be glad to have some
suggestions.

Thanks

Regards
Preeti U Murthy

 BTW, I may not be able to respond quickly during the holidays. If
 things workout, it may benefit ACPI PAD driver as well.
 
 
 Thanks,
 
 Jacob
 Regards
 Preeti U Murthy

 
 [Jacob Pan]
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH] drivers/intel_powerclamp : Redesign implementation of idle injection

2014-12-21 Thread Preeti U Murthy

The powerclamp driver injects idle periods to stay within the thermal 
constraints.
The driver does a fake idle by spawning per-cpu threads that call the mwait
instruction. This behavior of fake idle can confuse the other kernel subsystems.
For instance it calls into the nohz tick handlers, which are meant to be called
only by the idle thread. It sets the state of the tick in each cpu to idle and
stops the tick, when there are tasks on the runqueue. As a result the callers of
idle_cpu()/ tick_nohz_tick_stopped() see different states of the cpu; while the
former thinks that the cpu is busy, the latter thinks that it is idle. The 
outcome
may be  inconsistency in the scheduler/nohz states which can lead to serious
consequences. One of them was reported on this thread:
https://lkml.org/lkml/2014/12/11/365.

Thomas posted out a patch to disable the powerclamp driver from calling into the
tick nohz code which has taken care of the above regression for the moment. 
However
powerclamp driver as a result, will not be able to inject idle periods due to 
the
presence of periodic ticks. With the current design of fake idle, we cannot move
towards a better solution.
https://lkml.org/lkml/2014/12/18/169

This patch aims at removing the concept of fake idle and instead makes the cpus
truly idle by throttling the runqueues during the idle injection periods. The 
situation
is in fact very similar to throttling cfs_rqs when they exceed their bandwidths.
The exception here is that the root cfs_rq is throttled in the presence of 
available
bandwidth, due to the need to force idle. The runqueues automatically get 
unthrottled
at all other times and call to reschedule is made. Per-cpu timers are queued to 
expire
after every active/idle periods of powerclamp driver, which decide whether to 
throttle
or unthrottle the runqueues. This way, the tick nohz state and the scheduler 
states of
the cpus are sane during the idle injection periods because they naturally fall 
in place.

Another point to note is that the scheduler_tick() is called after the timers 
are
serviced. This means that soon after the runqueue is throttled by the 
powerclamp timer,
the idle task will get scheduled in. IOW, the request of the powerclamp timer 
to inject
idle is honored almost immediately. The same holds for unthrottling as well.

This patch is compile tested only. The goal is to get a consensus on the design 
first.
There are some points that need to be given a thought as far as I can see:

1. The idle task chosen during the idle injection period needs to enter a 
specific mwait
state. The cpuidle governors cannot choose any idle state like they do 
generally for idle
tasks.

2. We throttle runqueue when the bandwidth is available. We do not touch any 
parameters
around the cfs_rq bandwidth in this patch. It is supposed to bypass the 
bandwidth checks
entirely. But will there be any serious consequence as a result?

3. There can be cases when the runqueues are throttled when bandwidths are 
exceeded too.
What are the consequences of the powerclamp driver running parallely?

4. The idle periods should stand synchronized across all cpus because the 
change in the
design is to start timers to expire immediately instead of waking up kthreads. 
The only
concern is cpu hotplug operations when the cpu coming online can go out of sync 
with
the idle periods. But that situation exists today as well.

Signed-off-by: Preeti U Murthy 
---
Missed Ccing LKML on the previous mail.

 drivers/thermal/intel_powerclamp.c |  227 ++--
 include/linux/sched.h  |2 
 kernel/sched/fair.c|   29 -
 3 files changed, 117 insertions(+), 141 deletions(-)

diff --git a/drivers/thermal/intel_powerclamp.c 
b/drivers/thermal/intel_powerclamp.c
index 95cb7fc..a6fd453 100644
--- a/drivers/thermal/intel_powerclamp.c
+++ b/drivers/thermal/intel_powerclamp.c
@@ -87,7 +87,7 @@ static unsigned int control_cpu; /* The cpu assigned to 
collect stat and update
 static bool clamping;
 
 
-static struct task_struct * __percpu *powerclamp_thread;
+DEFINE_PER_CPU(struct timer_list, powerclamp_timer);
 static struct thermal_cooling_device *cooling_dev;
 static unsigned long *cpu_clamping_mask;  /* bit map for tracking per cpu
   * clamping thread
@@ -256,11 +256,6 @@ static u64 pkg_state_counter(void)
return count;
 }
 
-static void noop_timer(unsigned long foo)
-{
-   /* empty... just the fact that we get the interrupt wakes us up */
-}
-
 static unsigned int get_compensation(int ratio)
 {
unsigned int comp = 0;
@@ -362,102 +357,72 @@ static bool powerclamp_adjust_controls(unsigned int 
target_ratio,
return set_target_ratio + guard <= current_ratio;
 }
 
-static int clamp_thread(void *arg)
+static void powerclamp_timer_fn(unsigned long data)
 {
-   int cpunr = (unsigned long)arg;
-   DEFINE_TIMER(wakeup_timer, noop_timer, 0, 0);
-   static co

[RFC PATCH] drivers/intel_powerclamp : Redesign implementation of idle injection

2014-12-21 Thread Preeti U Murthy

The powerclamp driver injects idle periods to stay within the thermal 
constraints.
The driver does a fake idle by spawning per-cpu threads that call the mwait
instruction. This behavior of fake idle can confuse the other kernel subsystems.
For instance it calls into the nohz tick handlers, which are meant to be called
only by the idle thread. It sets the state of the tick in each cpu to idle and
stops the tick, when there are tasks on the runqueue. As a result the callers of
idle_cpu()/ tick_nohz_tick_stopped() see different states of the cpu; while the
former thinks that the cpu is busy, the latter thinks that it is idle. The 
outcome
may be  inconsistency in the scheduler/nohz states which can lead to serious
consequences. One of them was reported on this thread:
https://lkml.org/lkml/2014/12/11/365.

Thomas posted out a patch to disable the powerclamp driver from calling into the
tick nohz code which has taken care of the above regression for the moment. 
However
powerclamp driver as a result, will not be able to inject idle periods due to 
the
presence of periodic ticks. With the current design of fake idle, we cannot move
towards a better solution.
https://lkml.org/lkml/2014/12/18/169

This patch aims at removing the concept of fake idle and instead makes the cpus
truly idle by throttling the runqueues during the idle injection periods. The 
situation
is in fact very similar to throttling cfs_rqs when they exceed their bandwidths.
The exception here is that the root cfs_rq is throttled in the presence of 
available
bandwidth, due to the need to force idle. The runqueues automatically get 
unthrottled
at all other times and call to reschedule is made. Per-cpu timers are queued to 
expire
after every active/idle periods of powerclamp driver, which decide whether to 
throttle
or unthrottle the runqueues. This way, the tick nohz state and the scheduler 
states of
the cpus are sane during the idle injection periods because they naturally fall 
in place.

Another point to note is that the scheduler_tick() is called after the timers 
are
serviced. This means that soon after the runqueue is throttled by the 
powerclamp timer,
the idle task will get scheduled in. IOW, the request of the powerclamp timer 
to inject
idle is honored almost immediately. The same holds for unthrottling as well.

This patch is compile tested only. The goal is to get a consensus on the design 
first.
There are some points that need to be given a thought as far as I can see:

1. The idle task chosen during the idle injection period needs to enter a 
specific mwait
state. The cpuidle governors cannot choose any idle state like they do 
generally for idle
tasks.

2. We throttle runqueue when the bandwidth is available. We do not touch any 
parameters
around the cfs_rq bandwidth in this patch. It is supposed to bypass the 
bandwidth checks
entirely. But will there be any serious consequence as a result?

3. There can be cases when the runqueues are throttled when bandwidths are 
exceeded too.
What are the consequences of the powerclamp driver running parallely?

4. The idle periods should stand synchronized across all cpus because the 
change in the
design is to start timers to expire immediately instead of waking up kthreads. 
The only
concern is cpu hotplug operations when the cpu coming online can go out of sync 
with
the idle periods. But that situation exists today as well.

Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com
---
Missed Ccing LKML on the previous mail.

 drivers/thermal/intel_powerclamp.c |  227 ++--
 include/linux/sched.h  |2 
 kernel/sched/fair.c|   29 -
 3 files changed, 117 insertions(+), 141 deletions(-)

diff --git a/drivers/thermal/intel_powerclamp.c 
b/drivers/thermal/intel_powerclamp.c
index 95cb7fc..a6fd453 100644
--- a/drivers/thermal/intel_powerclamp.c
+++ b/drivers/thermal/intel_powerclamp.c
@@ -87,7 +87,7 @@ static unsigned int control_cpu; /* The cpu assigned to 
collect stat and update
 static bool clamping;
 
 
-static struct task_struct * __percpu *powerclamp_thread;
+DEFINE_PER_CPU(struct timer_list, powerclamp_timer);
 static struct thermal_cooling_device *cooling_dev;
 static unsigned long *cpu_clamping_mask;  /* bit map for tracking per cpu
   * clamping thread
@@ -256,11 +256,6 @@ static u64 pkg_state_counter(void)
return count;
 }
 
-static void noop_timer(unsigned long foo)
-{
-   /* empty... just the fact that we get the interrupt wakes us up */
-}
-
 static unsigned int get_compensation(int ratio)
 {
unsigned int comp = 0;
@@ -362,102 +357,72 @@ static bool powerclamp_adjust_controls(unsigned int 
target_ratio,
return set_target_ratio + guard = current_ratio;
 }
 
-static int clamp_thread(void *arg)
+static void powerclamp_timer_fn(unsigned long data)
 {
-   int cpunr = (unsigned long)arg;
-   DEFINE_TIMER(wakeup_timer, noop_timer, 0, 0

Re: [PATCH] tick/powerclamp: Remove tick_nohz_idle abuse

2014-12-19 Thread Preeti U Murthy

On 12/20/2014 01:26 AM, Thomas Gleixner wrote:
> On Fri, 19 Dec 2014, Jacob Pan wrote:
> 
>> On Thu, 18 Dec 2014 22:12:57 +0100 (CET)
>> Thomas Gleixner  wrote:
>>
>>> On Thu, 18 Dec 2014, Jacob Pan wrote:
>>>> OK I agree, also as I mentioned earlier, Peter already has a patch
>>>> for consolidated idle loop and remove tick_nohz_idle_enter/exit
>>>> call from powerclamp driver. I have been working on a few tweaks to
>>>> maintain the functionality and efficiency with the consolidated
>>>> idle loop. We can apply the patches on top of yours.
>>>
>>> No. This is equally wrong as I pointed out before. The 'unified' idle
>>> loop is still fake and just pretending to be idle.
>>>
>> In terms of efficiency, the consolidated idle loop will allow turning
>> off sched tick during idle injection period. If we just take out the
>> tick_nohz_idle_xxx call, the effectiveness of powerclamp is going down
>> significantly. I am not arguing the design but from fixing regression
>> perspective or short term solution.
> 
> There is no perspective. Period.
> 
> Its violates every rightful assumption of the nohz_IDLE_* code and
> just ever worked by chance. There is so much subtle wreckage lurking
> there that the only sane solution is to forbid it. End of story.
> 
> Thanks,
> 
>   tglx
> 
Hi Jacob,

Like Thomas pointed out, we can design a sane solution for powerclamp.
Idle injection is nothing but throttling of runqueue. If the runqueue is
throttled, no fair tasks will be selected and the natural choice in the
absence of tasks from any other sched class is the idle task.

The idle loop will automatically be called and the nohz state will also
fall in place. The cpu is really idle now: the runqueue has no tasks and
the task running on the cpu is the idle thread. The throttled tasks are
on a separate list.

When the period of idle injection is over, we unthrottle the runqueue.
All this being taken care of my a non-deferrable timer. This design
ensures that the intention of powerclamp is not hampered while at the
same time maintaining a sane state for nohz; you will get the efficiency
you want.

Of course there may be corner cases and challenges around
synchronization of package idle, which I am sure we can work around with
a better design such as the above. I am working on that patchset and
will post out in a day. You can take a look and let us know the pieces
we are missing.

I find that implementing the above design is not too hard.

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] tick/powerclamp: Remove tick_nohz_idle abuse

2014-12-19 Thread Preeti U Murthy

On 12/20/2014 01:26 AM, Thomas Gleixner wrote:
 On Fri, 19 Dec 2014, Jacob Pan wrote:
 
 On Thu, 18 Dec 2014 22:12:57 +0100 (CET)
 Thomas Gleixner t...@linutronix.de wrote:

 On Thu, 18 Dec 2014, Jacob Pan wrote:
 OK I agree, also as I mentioned earlier, Peter already has a patch
 for consolidated idle loop and remove tick_nohz_idle_enter/exit
 call from powerclamp driver. I have been working on a few tweaks to
 maintain the functionality and efficiency with the consolidated
 idle loop. We can apply the patches on top of yours.

 No. This is equally wrong as I pointed out before. The 'unified' idle
 loop is still fake and just pretending to be idle.

 In terms of efficiency, the consolidated idle loop will allow turning
 off sched tick during idle injection period. If we just take out the
 tick_nohz_idle_xxx call, the effectiveness of powerclamp is going down
 significantly. I am not arguing the design but from fixing regression
 perspective or short term solution.
 
 There is no perspective. Period.
 
 Its violates every rightful assumption of the nohz_IDLE_* code and
 just ever worked by chance. There is so much subtle wreckage lurking
 there that the only sane solution is to forbid it. End of story.
 
 Thanks,
 
   tglx
 
Hi Jacob,

Like Thomas pointed out, we can design a sane solution for powerclamp.
Idle injection is nothing but throttling of runqueue. If the runqueue is
throttled, no fair tasks will be selected and the natural choice in the
absence of tasks from any other sched class is the idle task.

The idle loop will automatically be called and the nohz state will also
fall in place. The cpu is really idle now: the runqueue has no tasks and
the task running on the cpu is the idle thread. The throttled tasks are
on a separate list.

When the period of idle injection is over, we unthrottle the runqueue.
All this being taken care of my a non-deferrable timer. This design
ensures that the intention of powerclamp is not hampered while at the
same time maintaining a sane state for nohz; you will get the efficiency
you want.

Of course there may be corner cases and challenges around
synchronization of package idle, which I am sure we can work around with
a better design such as the above. I am working on that patchset and
will post out in a day. You can take a look and let us know the pieces
we are missing.

I find that implementing the above design is not too hard.

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] tick/powerclamp: Remove tick_nohz_idle abuse

2014-12-18 Thread Preeti U Murthy

Hi Thomas,

On 12/18/2014 04:21 PM, Thomas Gleixner wrote:
> commit 4dbd27711cd9 "tick: export nohz tick idle symbols for module
> use" was merged via the thermal tree without an explicit ack from the
> relevant maintainers.
> 
> The exports are abused by the intel powerclamp driver which implements
> a fake idle state from a sched FIFO task. This causes all kinds of
> wreckage in the NOHZ core code which rightfully assumes that
> tick_nohz_idle_enter/exit() are only called from the idle task itself.
> 
> Recent changes in the NOHZ core lead to a failure of the powerclamp
> driver and now people try to hack completely broken and backwards
> workarounds into the NOHZ core code. This is completely unacceptable.
> 
> The real solution is to fix the powerclamp driver by rewriting it with
> a sane concept, but that's beyond the scope of this.
> 
> So the only solution for now is to remove the calls into the core NOHZ
> code from the powerclamp trainwreck along with the exports.
> 
> Fixes: d6d71ee4a14a "PM: Introduce Intel PowerClamp Driver"
> Signed-off-by: Thomas Gleixner 
> ---
> diff --git a/drivers/thermal/intel_powerclamp.c 
> b/drivers/thermal/intel_powerclamp.c
> index b46c706e1cac..e98b4249187c 100644
> --- a/drivers/thermal/intel_powerclamp.c
> +++ b/drivers/thermal/intel_powerclamp.c
> @@ -435,7 +435,6 @@ static int clamp_thread(void *arg)
>* allowed. thus jiffies are updated properly.
>*/
>   preempt_disable();
> - tick_nohz_idle_enter();
>   /* mwait until target jiffies is reached */
>   while (time_before(jiffies, target_jiffies)) {
>   unsigned long ecx = 1;
> @@ -451,7 +450,6 @@ static int clamp_thread(void *arg)
>   start_critical_timings();
>   atomic_inc(_wakeup_counter);
>   }
> - tick_nohz_idle_exit();
>   preempt_enable();
>   }
>   del_timer_sync(_timer);
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index 4d54b7540585..1363d58f07e9 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -847,7 +847,6 @@ void tick_nohz_idle_enter(void)
> 
>   local_irq_enable();
>  }
> -EXPORT_SYMBOL_GPL(tick_nohz_idle_enter);
> 
>  /**
>   * tick_nohz_irq_exit - update next tick event from interrupt exit
> @@ -974,7 +973,6 @@ void tick_nohz_idle_exit(void)
> 
>   local_irq_enable();
>  }
> -EXPORT_SYMBOL_GPL(tick_nohz_idle_exit);
> 
>  static int tick_nohz_reprogram(struct tick_sched *ts, ktime_t now)
>  {
> 

Ok the solution looks apt to me.

Let me see if I can come up with a sane solution for powerclamp based on
the suggestions that you gave in the previous thread. I was thinking of
the below steps towards its implementation. The idea is based on the
throttling mechanism that you had suggested.

1. Queue a deferable periodic timer whose handler checks if idle needs
to be injected. If so, it sets rq->need_throttle for the cpu. If its
already in the fake idle period, it clears rq->need_throttle and sets
need_resched.

2. pick_next_task_fair() checks rq->need_throttle and dequeues all tasks
in the rq if this is set and puts them on a throttled list. This
mechanism is similar to throttling cfs rq today. This function hence
fails to return a task, and if no task from any other sched class
exists, idle task is picked.

Peter thoughts?

3. So we are now in the idle injected period. The scheduler state is
sane because the cpu is idle, rq->nr_running = 0, rq->curr = rq->idle.
The nohz state is sane, because ts->inidle = 1 and tick_stopped may or
may not be 1 and they are set by an idle task.

4. When need_resched is set again, the idle task of course unsets inidle
and restarts tick. In the following scheduler tick,
pick_next_task_fair() sees that rq->need_throttle is cleared, enqueues
back the tasks and returns one of them to run.

Of course there may be several points that I have missed. But how does
the approach appear? If it looks sane enough, the cases which do not
obviously fall in place can be worked upon.

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] tick/powerclamp: Remove tick_nohz_idle abuse

2014-12-18 Thread Preeti U Murthy

Hi Thomas,

On 12/18/2014 04:21 PM, Thomas Gleixner wrote:
 commit 4dbd27711cd9 tick: export nohz tick idle symbols for module
 use was merged via the thermal tree without an explicit ack from the
 relevant maintainers.
 
 The exports are abused by the intel powerclamp driver which implements
 a fake idle state from a sched FIFO task. This causes all kinds of
 wreckage in the NOHZ core code which rightfully assumes that
 tick_nohz_idle_enter/exit() are only called from the idle task itself.
 
 Recent changes in the NOHZ core lead to a failure of the powerclamp
 driver and now people try to hack completely broken and backwards
 workarounds into the NOHZ core code. This is completely unacceptable.
 
 The real solution is to fix the powerclamp driver by rewriting it with
 a sane concept, but that's beyond the scope of this.
 
 So the only solution for now is to remove the calls into the core NOHZ
 code from the powerclamp trainwreck along with the exports.
 
 Fixes: d6d71ee4a14a PM: Introduce Intel PowerClamp Driver
 Signed-off-by: Thomas Gleixner t...@linutronix.de
 ---
 diff --git a/drivers/thermal/intel_powerclamp.c 
 b/drivers/thermal/intel_powerclamp.c
 index b46c706e1cac..e98b4249187c 100644
 --- a/drivers/thermal/intel_powerclamp.c
 +++ b/drivers/thermal/intel_powerclamp.c
 @@ -435,7 +435,6 @@ static int clamp_thread(void *arg)
* allowed. thus jiffies are updated properly.
*/
   preempt_disable();
 - tick_nohz_idle_enter();
   /* mwait until target jiffies is reached */
   while (time_before(jiffies, target_jiffies)) {
   unsigned long ecx = 1;
 @@ -451,7 +450,6 @@ static int clamp_thread(void *arg)
   start_critical_timings();
   atomic_inc(idle_wakeup_counter);
   }
 - tick_nohz_idle_exit();
   preempt_enable();
   }
   del_timer_sync(wakeup_timer);
 diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
 index 4d54b7540585..1363d58f07e9 100644
 --- a/kernel/time/tick-sched.c
 +++ b/kernel/time/tick-sched.c
 @@ -847,7 +847,6 @@ void tick_nohz_idle_enter(void)
 
   local_irq_enable();
  }
 -EXPORT_SYMBOL_GPL(tick_nohz_idle_enter);
 
  /**
   * tick_nohz_irq_exit - update next tick event from interrupt exit
 @@ -974,7 +973,6 @@ void tick_nohz_idle_exit(void)
 
   local_irq_enable();
  }
 -EXPORT_SYMBOL_GPL(tick_nohz_idle_exit);
 
  static int tick_nohz_reprogram(struct tick_sched *ts, ktime_t now)
  {
 

Ok the solution looks apt to me.

Let me see if I can come up with a sane solution for powerclamp based on
the suggestions that you gave in the previous thread. I was thinking of
the below steps towards its implementation. The idea is based on the
throttling mechanism that you had suggested.

1. Queue a deferable periodic timer whose handler checks if idle needs
to be injected. If so, it sets rq-need_throttle for the cpu. If its
already in the fake idle period, it clears rq-need_throttle and sets
need_resched.

2. pick_next_task_fair() checks rq-need_throttle and dequeues all tasks
in the rq if this is set and puts them on a throttled list. This
mechanism is similar to throttling cfs rq today. This function hence
fails to return a task, and if no task from any other sched class
exists, idle task is picked.

Peter thoughts?

3. So we are now in the idle injected period. The scheduler state is
sane because the cpu is idle, rq-nr_running = 0, rq-curr = rq-idle.
The nohz state is sane, because ts-inidle = 1 and tick_stopped may or
may not be 1 and they are set by an idle task.

4. When need_resched is set again, the idle task of course unsets inidle
and restarts tick. In the following scheduler tick,
pick_next_task_fair() sees that rq-need_throttle is cleared, enqueues
back the tasks and returns one of them to run.

Of course there may be several points that I have missed. But how does
the approach appear? If it looks sane enough, the cases which do not
obviously fall in place can be worked upon.

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [nohz] 2a16fc93d2c: kernel lockup on idle injection

2014-12-16 Thread Preeti U Murthy

On 12/16/2014 10:23 AM, Viresh Kumar wrote:
> + Peter from Jacob's mail ..
> 
> On 16 December 2014 at 05:14, Frederic Weisbecker  wrote:
>> So to summarize: I see it enqueues a timer then it loops on that timer 
>> expiration.
>> On that loop we stop the CPU and we expect the timer to fire and wake the 
>> thread up.
>> But if the delayed tick fires too early, before the timer actually
>> expires, then we go to sleep again because we haven't yet reached the delay,
>> but since tick_nohz_irq_exit() is only called on idle tasks and not for 
>> threads
>> like powerclamp, the tick isn't rescheduled to handle the remaining timer 
>> slice
>> so we sleep forever.
> 
> Perfect !!
> 
>> Hence if we really want to stop the tick when we mimic idle from powerclamp 
>> driver,
>> we must call tick_nohz_irq_exit() on irq exit to do it correctly.
>>
>> It happened to work by accident before the commit because we were 
>> rescheduling the
>> tick from itself without tick_nohz_irq_exit() to cancel anything. And that 
>> restored
>> the periodic behaviour necessary to complete the delay.
>>
>> So the above change is rather a hack than a solution.
>>
>> We have several choices:
>>
>> 1) Revert the commit. But this has to be a temporary solution really. 
>> Powerclamp has
>> to be fixed and handle tick_nohz_irq_exit().
>>
>> 2) Remove powerclamp tick stop until somebody fixes it to handle nohz 
>> properly.
>>
>> 2) Fix it directly. But I believe there is a release that is going to miss 
>> the fix
>> and suffer the regression. Does the regression matter for anybody? Is 
>> powerclamp
>> meant for anything else than testing (I have no idea what it's used for)?
>>
>> So to fix powerclamp to handle nohz correctly, tick_nohz_irq_exit() must be 
>> called
>> for both idle tasks and powerclamp kthreads. Checking ts->inidle can be a 
>> good way to match

As far as I can see, the primary purpose of tick_nohz_irq_enter()/exit()
paths was to take care of *tick stopped* cases.

Before handling interrupts we would want jiffies to be updated, which is
done in tick_nohz_irq_enter(). And after handling interrupts we need to
verify if any timers/RCU callbacks were queued during an interrupt.
Both of these will not be handled properly
*only when the tick is stopped* right?

If so, the checks which permit entry into these functions should at a
minimum include a check on ts->tick_stopped(). The rest of the checks
around if the CPU is idle/need_resched() may be needed in conjunction,
but will not be complete without checking if ticks are stopped. I see
that tick_nohz_irq_enter() has a check on ts->tick_stopped, which is
correct, but tick_noz_irq_exit() does not.

Adding this check to tick_nohz_irq_exit() will not only solve this
regression but is probably a fix to a long standing bug in the
conditional check in tick_nohz_irq_exit().

>> both. That means we might need to use a reduced part of idle_cpu() to avoid 
>> redundant checks.
>> tick_irq_enter() must be called as well for powerclamp, in case it's the 
>> only CPU running, it
>> has to fixup the timekeeping alone.
> 
> Yeah, you can call my fix a Hacky one. I agree..
> 
> But I don't know if calling tick_nohz_irq_exit() from these threads wouldn't
> be hacky as well. And ofcourse my knowledge wouldn't be adequate here to
> judge that :)
> 
> It looked a bit odd to me. Why should we call irq-exit from the threads 
> working
> with idle? And that's not what we do even for the real-idle loop as well ..
> 
> Also from Jacob's referenced patch, we would be moving to a consolidated
> idle-loop for both real idle and threads like powerclamp, and this part may be
> tricky then..
> 
> Untested, but what about something like this?
> 
> diff --git a/kernel/softirq.c b/kernel/softirq.c
> index 5918d227730f..5e4bfc367735 100644
> --- a/kernel/softirq.c
> +++ b/kernel/softirq.c
> @@ -321,7 +321,7 @@ asmlinkage __visible void do_softirq(void)
>  void irq_enter(void)
>  {
> rcu_irq_enter();
> -   if (is_idle_task(current) && !in_interrupt()) {
> +   if (tick_idle_active() && !in_interrupt()) {
> /*
>  * Prevent raise_softirq from needlessly waking up ksoftirqd
>  * here, as softirq will be serviced on return from interrupt.
> @@ -363,7 +363,7 @@ static inline void tick_irq_exit(void)
> int cpu = smp_processor_id();
> 
> /* Make sure that timer wheel updates are propagated */
> -   if ((idle_cpu(cpu) && !need_resched()) || tick_nohz_full_cpu(cpu)) {
> +   if ((tick_idle_ac

Re: [nohz] 2a16fc93d2c: kernel lockup on idle injection

2014-12-16 Thread Preeti U Murthy

On 12/16/2014 10:23 AM, Viresh Kumar wrote:
 + Peter from Jacob's mail ..
 
 On 16 December 2014 at 05:14, Frederic Weisbecker fweis...@gmail.com wrote:
 So to summarize: I see it enqueues a timer then it loops on that timer 
 expiration.
 On that loop we stop the CPU and we expect the timer to fire and wake the 
 thread up.
 But if the delayed tick fires too early, before the timer actually
 expires, then we go to sleep again because we haven't yet reached the delay,
 but since tick_nohz_irq_exit() is only called on idle tasks and not for 
 threads
 like powerclamp, the tick isn't rescheduled to handle the remaining timer 
 slice
 so we sleep forever.
 
 Perfect !!
 
 Hence if we really want to stop the tick when we mimic idle from powerclamp 
 driver,
 we must call tick_nohz_irq_exit() on irq exit to do it correctly.

 It happened to work by accident before the commit because we were 
 rescheduling the
 tick from itself without tick_nohz_irq_exit() to cancel anything. And that 
 restored
 the periodic behaviour necessary to complete the delay.

 So the above change is rather a hack than a solution.

 We have several choices:

 1) Revert the commit. But this has to be a temporary solution really. 
 Powerclamp has
 to be fixed and handle tick_nohz_irq_exit().

 2) Remove powerclamp tick stop until somebody fixes it to handle nohz 
 properly.

 2) Fix it directly. But I believe there is a release that is going to miss 
 the fix
 and suffer the regression. Does the regression matter for anybody? Is 
 powerclamp
 meant for anything else than testing (I have no idea what it's used for)?

 So to fix powerclamp to handle nohz correctly, tick_nohz_irq_exit() must be 
 called
 for both idle tasks and powerclamp kthreads. Checking ts-inidle can be a 
 good way to match

As far as I can see, the primary purpose of tick_nohz_irq_enter()/exit()
paths was to take care of *tick stopped* cases.

Before handling interrupts we would want jiffies to be updated, which is
done in tick_nohz_irq_enter(). And after handling interrupts we need to
verify if any timers/RCU callbacks were queued during an interrupt.
Both of these will not be handled properly
*only when the tick is stopped* right?

If so, the checks which permit entry into these functions should at a
minimum include a check on ts-tick_stopped(). The rest of the checks
around if the CPU is idle/need_resched() may be needed in conjunction,
but will not be complete without checking if ticks are stopped. I see
that tick_nohz_irq_enter() has a check on ts-tick_stopped, which is
correct, but tick_noz_irq_exit() does not.

Adding this check to tick_nohz_irq_exit() will not only solve this
regression but is probably a fix to a long standing bug in the
conditional check in tick_nohz_irq_exit().

 both. That means we might need to use a reduced part of idle_cpu() to avoid 
 redundant checks.
 tick_irq_enter() must be called as well for powerclamp, in case it's the 
 only CPU running, it
 has to fixup the timekeeping alone.
 
 Yeah, you can call my fix a Hacky one. I agree..
 
 But I don't know if calling tick_nohz_irq_exit() from these threads wouldn't
 be hacky as well. And ofcourse my knowledge wouldn't be adequate here to
 judge that :)
 
 It looked a bit odd to me. Why should we call irq-exit from the threads 
 working
 with idle? And that's not what we do even for the real-idle loop as well ..
 
 Also from Jacob's referenced patch, we would be moving to a consolidated
 idle-loop for both real idle and threads like powerclamp, and this part may be
 tricky then..
 
 Untested, but what about something like this?
 
 diff --git a/kernel/softirq.c b/kernel/softirq.c
 index 5918d227730f..5e4bfc367735 100644
 --- a/kernel/softirq.c
 +++ b/kernel/softirq.c
 @@ -321,7 +321,7 @@ asmlinkage __visible void do_softirq(void)
  void irq_enter(void)
  {
 rcu_irq_enter();
 -   if (is_idle_task(current)  !in_interrupt()) {
 +   if (tick_idle_active()  !in_interrupt()) {
 /*
  * Prevent raise_softirq from needlessly waking up ksoftirqd
  * here, as softirq will be serviced on return from interrupt.
 @@ -363,7 +363,7 @@ static inline void tick_irq_exit(void)
 int cpu = smp_processor_id();
 
 /* Make sure that timer wheel updates are propagated */
 -   if ((idle_cpu(cpu)  !need_resched()) || tick_nohz_full_cpu(cpu)) {
 +   if ((tick_idle_active()  !need_resched()) ||

For reasons mentioned above, this check alone will not do. It may solve
this regression,but the check is incomplete.

IMO it should simply be :
if (tick_nohz_tick_stopped() || tick_nohz_full_cpu())

Regards
Preeti U Murthy


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [nohz] 2a16fc93d2c: kernel lockup on idle injection

2014-12-15 Thread Preeti U Murthy

On 12/15/2014 03:02 PM, Viresh Kumar wrote:
> On 15 December 2014 at 12:55, Preeti U Murthy  
> wrote:
>> Hi Viresh,
>>
>> Let me explain why I think this is happening.
>>
>> 1. tick_nohz_irq_enter/exit() both get called *only if the cpu is idle*
>> and receives an interrupt.
> 
> Bang on target. Yeah that's the part we missed while writing this patch :)
> 
>> 2. Commit 2a16fc93d2c9568e1, cancels programming of tick_sched timer
>> in its handler, assuming that tick_nohz_irq_exit() will take care of
>> programming the clock event device appropriately, and hence it would
>> requeue or cancel the tick_sched timer.
> 
> Correct.
> 
>> 3. But the intel_powerclamp driver injects an idle period only.
>> *The CPU however is not idle*. It has work on its runqueue and the
>> rq->curr != idle. This means that *tick_nohz_irq_enter()/exit() will not
>> get called on any interrupt*.
> 
> Still good..
> 
>> 4. As a consequence, when we get a hrtimer interrupt during the period
>> that the powerclamp driver is mimicking idle, the exit path of the
>> interrupt never calls tick_nohz_irq_exit(). Hence the tick_sched timer
>> that would have got removed due to the above commit will not get
>> enqueued back on for any pending timers that there might be. Besides
>> this, *jiffies never gets updated*.
> 
> Jiffies can be updated by any CPU and there is something called a control
> cpu with powerclamp driver. BUT we may have got interrupted before the
> powerclamp timer expired and so we are stuck in the
> 
> while (time_before(jiffies, target_jiffies))
> 
> loop for ever.
> 
>> Hope the above explanation makes sense.
> 
> Mostly good. Thanks for helping out.
> 
> Now, what's the right solution going forward ?
> 
> - Revert the offending commit ..
> - Or still try to avoid reprogramming if we can ..
> 
> This is what I could come up with to still avoid reprogramming of tick:
> 
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index cc0a5b6f741b..49f4278f69e2 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -1100,7 +1100,7 @@ static enum hrtimer_restart
> tick_sched_timer(struct hrtimer *timer)
> tick_sched_handle(ts, regs);
> 
> /* No need to reprogram if we are in idle or full dynticks mode */
> -   if (unlikely(ts->tick_stopped))
> +   if (unlikely(ts->tick_stopped && (is_idle_task(current) ||
> !ts->inidle)))
> return HRTIMER_NORESTART;
> 
> hrtimer_forward(timer, now, tick_period);
> 
> 

Looks good to me. You can add my Reviewed-by to the above patch.

> 
> Above change checks why we have stopped tick..
> - The cpu has gone idle (really): is_idle_task(current)
> - The cpu isn't in idle mode, i.e. its in nohz-full mode: !ts->inidle
> 
> This fixed the issues with powerclamp in my case.
> 
> @Fengguang: Can you please check if this fixes it for you as well?
> 
> @Thomas: Please let me know if you want me to send this fix
> or you want to revert the original commit itself.

Regards
Preeti U Murthy
> 
> Thanks.
> 
> --
> Viresh
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [nohz] 2a16fc93d2c: kernel lockup on idle injection

2014-12-15 Thread Preeti U Murthy

On 12/15/2014 03:02 PM, Viresh Kumar wrote:
 On 15 December 2014 at 12:55, Preeti U Murthy pre...@linux.vnet.ibm.com 
 wrote:
 Hi Viresh,

 Let me explain why I think this is happening.

 1. tick_nohz_irq_enter/exit() both get called *only if the cpu is idle*
 and receives an interrupt.
 
 Bang on target. Yeah that's the part we missed while writing this patch :)
 
 2. Commit 2a16fc93d2c9568e1, cancels programming of tick_sched timer
 in its handler, assuming that tick_nohz_irq_exit() will take care of
 programming the clock event device appropriately, and hence it would
 requeue or cancel the tick_sched timer.
 
 Correct.
 
 3. But the intel_powerclamp driver injects an idle period only.
 *The CPU however is not idle*. It has work on its runqueue and the
 rq-curr != idle. This means that *tick_nohz_irq_enter()/exit() will not
 get called on any interrupt*.
 
 Still good..
 
 4. As a consequence, when we get a hrtimer interrupt during the period
 that the powerclamp driver is mimicking idle, the exit path of the
 interrupt never calls tick_nohz_irq_exit(). Hence the tick_sched timer
 that would have got removed due to the above commit will not get
 enqueued back on for any pending timers that there might be. Besides
 this, *jiffies never gets updated*.
 
 Jiffies can be updated by any CPU and there is something called a control
 cpu with powerclamp driver. BUT we may have got interrupted before the
 powerclamp timer expired and so we are stuck in the
 
 while (time_before(jiffies, target_jiffies))
 
 loop for ever.
 
 Hope the above explanation makes sense.
 
 Mostly good. Thanks for helping out.
 
 Now, what's the right solution going forward ?
 
 - Revert the offending commit ..
 - Or still try to avoid reprogramming if we can ..
 
 This is what I could come up with to still avoid reprogramming of tick:
 
 diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
 index cc0a5b6f741b..49f4278f69e2 100644
 --- a/kernel/time/tick-sched.c
 +++ b/kernel/time/tick-sched.c
 @@ -1100,7 +1100,7 @@ static enum hrtimer_restart
 tick_sched_timer(struct hrtimer *timer)
 tick_sched_handle(ts, regs);
 
 /* No need to reprogram if we are in idle or full dynticks mode */
 -   if (unlikely(ts-tick_stopped))
 +   if (unlikely(ts-tick_stopped  (is_idle_task(current) ||
 !ts-inidle)))
 return HRTIMER_NORESTART;
 
 hrtimer_forward(timer, now, tick_period);
 
 

Looks good to me. You can add my Reviewed-by to the above patch.

 
 Above change checks why we have stopped tick..
 - The cpu has gone idle (really): is_idle_task(current)
 - The cpu isn't in idle mode, i.e. its in nohz-full mode: !ts-inidle
 
 This fixed the issues with powerclamp in my case.
 
 @Fengguang: Can you please check if this fixes it for you as well?
 
 @Thomas: Please let me know if you want me to send this fix
 or you want to revert the original commit itself.

Regards
Preeti U Murthy
 
 Thanks.
 
 --
 Viresh
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [nohz] 2a16fc93d2c: kernel lockup on idle injection

2014-12-14 Thread Preeti U Murthy

Hi Viresh,

Let me explain why I think this is happening.

1. tick_nohz_irq_enter/exit() both get called *only if the cpu is idle*
and receives an interrupt.

2. Commit 2a16fc93d2c9568e1, cancels programming of tick_sched timer
in its handler, assuming that tick_nohz_irq_exit() will take care of
programming the clock event device appropriately, and hence it would
requeue or cancel the tick_sched timer.

3. But the intel_powerclamp driver injects an idle period only.
*The CPU however is not idle*. It has work on its runqueue and the
rq->curr != idle. This means that *tick_nohz_irq_enter()/exit() will not
get called on any interrupt*.

4. As a consequence, when we get a hrtimer interrupt during the period
that the powerclamp driver is mimicking idle, the exit path of the
interrupt never calls tick_nohz_irq_exit(). Hence the tick_sched timer
that would have got removed due to the above commit will not get
enqueued back on for any pending timers that there might be. Besides
this, *jiffies never gets updated*.

5. If you look at the code of the powerclamp driver, clamp_thread()
loops on jiffies getting updated. It continues to do so with preemption
disabled and no tick_sched timer to force a scheduler tick to update the
jiffies. Since this happens on cpus in a package, all of them get soft
lockedup.

Hope the above explanation makes sense.

Regards
Preeti U Murthy

On 12/12/2014 05:27 PM, Viresh Kumar wrote:
> Cc'ing Thomas as well..
> 
> On 12 December 2014 at 01:12, Fengguang Wu  wrote:
>> Hi Viresh,
>>
>> We noticed the below lockup regression on commit 2a16fc93d2c ("nohz:
>> Avoid tick's double reprogramming in highres mode").
>>
>> testbox/testcase/testparams: ivb42/idle-inject/60s-200%-10cp
>>
>> b5e995e671d8e4d7  2a16fc93d2c9568e16d45db77c
>>   --
>>fail:runs  %reproductionfail:runs
>>| | |
>>:5  100%   1:1 last_state.is_incomplete_run
>>:5  100%   1:1 last_state.running
>>
>> testbox/testcase/testparams: lkp-sb03/idle-inject/60s-200%-10cp
>>
>> b5e995e671d8e4d7  2a16fc93d2c9568e16d45db77c
>>   --
>>:7  100%   1:1 last_state.is_incomplete_run
>>:7  100%   1:1 last_state.running
>>
>> Where test box ivb42 is Ivy Bridge-EP and lkp-sb03 is Sandy Bridge-EP.
>>
>> To reproduce:
>>
>> apt-get install ruby ruby-oj
>> git clone 
>> git://git.kernel.org/pub/scm/linux/kernel/git/wfg/lkp-tests.git
>> cd lkp-tests
>> bin/setup-local job.yaml # the job file attached in this email
>> bin/run-local   job.yaml
>>
>> Basically what the test case does is to
>>
>> - find a Sandy Bridge or newer machine
>> - look for a cooling device with type “intel_powerclamp”
>> - set cur_state to 10
>> - run any CPU extensive workload
>>
>> Then expect soft lockup. It's very reproducible.
> 
> Thanks Fengguang. Yes I am able to reproduce it, but don't know yet what
> went wrong..
> 
> --
> viresh
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [nohz] 2a16fc93d2c: kernel lockup on idle injection

2014-12-14 Thread Preeti U Murthy

Hi Viresh,

Let me explain why I think this is happening.

1. tick_nohz_irq_enter/exit() both get called *only if the cpu is idle*
and receives an interrupt.

2. Commit 2a16fc93d2c9568e1, cancels programming of tick_sched timer
in its handler, assuming that tick_nohz_irq_exit() will take care of
programming the clock event device appropriately, and hence it would
requeue or cancel the tick_sched timer.

3. But the intel_powerclamp driver injects an idle period only.
*The CPU however is not idle*. It has work on its runqueue and the
rq-curr != idle. This means that *tick_nohz_irq_enter()/exit() will not
get called on any interrupt*.

4. As a consequence, when we get a hrtimer interrupt during the period
that the powerclamp driver is mimicking idle, the exit path of the
interrupt never calls tick_nohz_irq_exit(). Hence the tick_sched timer
that would have got removed due to the above commit will not get
enqueued back on for any pending timers that there might be. Besides
this, *jiffies never gets updated*.

5. If you look at the code of the powerclamp driver, clamp_thread()
loops on jiffies getting updated. It continues to do so with preemption
disabled and no tick_sched timer to force a scheduler tick to update the
jiffies. Since this happens on cpus in a package, all of them get soft
lockedup.

Hope the above explanation makes sense.

Regards
Preeti U Murthy

On 12/12/2014 05:27 PM, Viresh Kumar wrote:
 Cc'ing Thomas as well..
 
 On 12 December 2014 at 01:12, Fengguang Wu fengguang...@intel.com wrote:
 Hi Viresh,

 We noticed the below lockup regression on commit 2a16fc93d2c (nohz:
 Avoid tick's double reprogramming in highres mode).

 testbox/testcase/testparams: ivb42/idle-inject/60s-200%-10cp

 b5e995e671d8e4d7  2a16fc93d2c9568e16d45db77c
   --
fail:runs  %reproductionfail:runs
| | |
:5  100%   1:1 last_state.is_incomplete_run
:5  100%   1:1 last_state.running

 testbox/testcase/testparams: lkp-sb03/idle-inject/60s-200%-10cp

 b5e995e671d8e4d7  2a16fc93d2c9568e16d45db77c
   --
:7  100%   1:1 last_state.is_incomplete_run
:7  100%   1:1 last_state.running

 Where test box ivb42 is Ivy Bridge-EP and lkp-sb03 is Sandy Bridge-EP.

 To reproduce:

 apt-get install ruby ruby-oj
 git clone 
 git://git.kernel.org/pub/scm/linux/kernel/git/wfg/lkp-tests.git
 cd lkp-tests
 bin/setup-local job.yaml # the job file attached in this email
 bin/run-local   job.yaml

 Basically what the test case does is to

 - find a Sandy Bridge or newer machine
 - look for a cooling device with type “intel_powerclamp”
 - set cur_state to 10
 - run any CPU extensive workload

 Then expect soft lockup. It's very reproducible.
 
 Thanks Fengguang. Yes I am able to reproduce it, but don't know yet what
 went wrong..
 
 --
 viresh
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Query] Spurious interrupts from clockevent device on X86 Ivybridge

2014-12-12 Thread Preeti U Murthy

On 12/11/2014 10:26 AM, Santosh Shukla wrote:
> On 11 December 2014 at 10:14, Preeti U Murthy  
> wrote:
>> On 12/10/2014 06:22 PM, Viresh Kumar wrote:
>>> On 10 December 2014 at 18:03, Preeti U Murthy  
>>> wrote:
>>>
>>>> Right. We get an interrupt when nobody had asked for it to be delivered
>>>> or had asked for it to be delivered and later canceled the request. It
>>>> is most often in the latter situation, that there can be race
>>>> conditions. If these race conditions are not taken care of, they can
>>>> result in spurious interrupts.
>>>
>>> But the delta time will be very small then, right ?
>>
>> I was talking of the case where we get an interrupt from the clockevent
>> device but dont find the hrtimer to service and not really of an anomaly
>> in timekeeping.
>>For instance one of the issues that we had seen earlier wherein we
>> cancel the tick-sched-timer before going tickless, but since we had
>> programmed the clock event device to fire, we get a spurious interrupt.
>>
> 
> I verified this case before reporting; In my case tick_sched_timer do
> get cancelled before expire duration but then  clk_evt_device get
> reprogrammed for next time node in list.  __remove_hrtimer() takes
> care of that.

Right. The scenario I described happens in the Low Resolution Mode. You
are right, this does not happen in the High Resolution Mode.

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Query] Spurious interrupts from clockevent device on X86 Ivybridge

2014-12-12 Thread Preeti U Murthy

On 12/11/2014 10:26 AM, Santosh Shukla wrote:
 On 11 December 2014 at 10:14, Preeti U Murthy pre...@linux.vnet.ibm.com 
 wrote:
 On 12/10/2014 06:22 PM, Viresh Kumar wrote:
 On 10 December 2014 at 18:03, Preeti U Murthy pre...@linux.vnet.ibm.com 
 wrote:

 Right. We get an interrupt when nobody had asked for it to be delivered
 or had asked for it to be delivered and later canceled the request. It
 is most often in the latter situation, that there can be race
 conditions. If these race conditions are not taken care of, they can
 result in spurious interrupts.

 But the delta time will be very small then, right ?

 I was talking of the case where we get an interrupt from the clockevent
 device but dont find the hrtimer to service and not really of an anomaly
 in timekeeping.
For instance one of the issues that we had seen earlier wherein we
 cancel the tick-sched-timer before going tickless, but since we had
 programmed the clock event device to fire, we get a spurious interrupt.

 
 I verified this case before reporting; In my case tick_sched_timer do
 get cancelled before expire duration but then  clk_evt_device get
 reprogrammed for next time node in list.  __remove_hrtimer() takes
 care of that.

Right. The scenario I described happens in the Low Resolution Mode. You
are right, this does not happen in the High Resolution Mode.

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Query] Spurious interrupts from clockevent device on X86 Ivybridge

2014-12-10 Thread Preeti U Murthy

On 12/10/2014 06:22 PM, Viresh Kumar wrote:
> On 10 December 2014 at 18:03, Preeti U Murthy  
> wrote:
> 
>> Right. We get an interrupt when nobody had asked for it to be delivered
>> or had asked for it to be delivered and later canceled the request. It
>> is most often in the latter situation, that there can be race
>> conditions. If these race conditions are not taken care of, they can
>> result in spurious interrupts.
> 
> But the delta time will be very small then, right ?

I was talking of the case where we get an interrupt from the clockevent
device but dont find the hrtimer to service and not really of an anomaly
in timekeeping.
   For instance one of the issues that we had seen earlier wherein we
cancel the tick-sched-timer before going tickless, but since we had
programmed the clock event device to fire, we get a spurious interrupt.

> 
>> Since the difference is 1us and not a noticeably high value, it is most
>> probably because during hrtimer handling, we traverse all queued timers
>> and call their handlers, till we find timers whose expiry is in the
>> future. I would not be surprised if we overshoot the expiry of the
>> timers at the end of the list by a microsecond by the time we call their
>> handlers.
> 
> Looks like you misunderstood what he wrote. He isn't saying that we
> serviced the timers/hrtimers sometime after we should have.
> 
> What he is saying is: we got the clockevent device's interrupt at the
> time we requested but hrtimer_update_base() returned a time
> lesser than what it *should* have. And that results in a spurious
> interrupt.. We enqueue again for 1 us and service the timer then.

Oh ok I see. I understand this better now after reading Thomas's
explanation.

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Query] Spurious interrupts from clockevent device on X86 Ivybridge

2014-12-10 Thread Preeti U Murthy

Hi Santhosh,

On 12/10/2014 11:00 AM, Santosh Shukla wrote:
> Hi Guys,
> 
> I have been chasing spurious clockevent interrupts (In HIGHRES mode) on my x86
> IvyBridge server and stuck on some blockers (mostly due to lack of knowledge)
> and need some help..
> 
> What does spurious mean ?
> I take it as the case when a core is unnecessarily interrupted by the 
> clockevent
> device while we didn't wanted it to.

Right. We get an interrupt when nobody had asked for it to be delivered
or had asked for it to be delivered and later canceled the request. It
is most often in the latter situation, that there can be race
conditions. If these race conditions are not taken care of, they can
result in spurious interrupts.

> 
> I tried to catch them with the help of few new trace points in
> hrtimer_interrupt() when it didn't serviced any pending hrtimers..
> 
> The causes I could find:
> - Clockevent device's counter overflow: When we need to program a clockevt
>   device for times larger than it can handle, we program it for max time and
>   then reprogram again on expiration. Because of hardware counter limitations 
> we
>   have to do it this way and this isn't really a spurious interrupt as we
>   programmed it that way.

Yes this is a possible cause for spurious interrupts. On PowerPC we
check for such a situation specifically before we begin to handle timers.

> 
> - Hardware timer firing before its expiration: Yes, its hard to digest but 
> from
>   my tests it looks like this does happen. Will circulate another mail once I 
> am
>   sure about it.

Yep, will need more information if at all this is happening.

> 
> - Difference between time returned by hrtimer_update_base() and clockevt
>   device's notion of time: Wanted to discuss this here..
> 
> 
> An example of what I am witnessing in my tests:
> 
> clockevt device programmed to fire after 199 ms and it fires exactly after 
> that
> time (Verified with ftrace time-diff and tsc diff [tsc: timestamp counter
> instructions for x86]).
> 
> But when hrtimer_interrupt() tries to find the hrtimer for which the event
> occurred, it finds the value returned by hrtimer_update_base() being lesser 
> than
> the softexpires of the timer for which the event was programmed. And the diff 
> is
> approximately 1 us most of the times I see the spurious interrupt.
> 
> -0 [010] d...68.961171: clk_evt_program: delta:199933679
> expires=6907100 /* programmed for 199 ms */
> -0 [010] d...68.961171: lapic_next_deadline:
> cur_tsc:365799488705427 new_tsc:365799907536051 delta:199443 usec /*
> sets new_tsc */
> -0 [010] d.h.69.161183: local_timer_entry: vector=239
> -0 [010] d.h.69.161185: local_apic_timer_interrupt:
> cur_tsc:365799907547595 delta:199448 usec /* Interrupt arrived
> afterexactly 199 ms */
> -0 [010] d.h.69.161187:
> hrtimer_interrupt:function=tick_sched_timer expires=6907100
> softexpires=6907100
> basenow:69070997234 /* But Softexpires > basenow ?? */
> -0 [010] d.h.69.161188: clk_evt_program: delta:1000
> expires=6907100 /* new expiry value (1 us) */
> -0 [010] d.h.69.161189: lapic_next_deadline:
> cur_tsc:365799907556179 new_tsc:365799907558259 delta:0 usec /* sets
> new tsc value */
> -0 [010] d.h.69.161189: local_timer_exit: vector=239

Since the difference is 1us and not a noticeably high value, it is most
probably because during hrtimer handling, we traverse all queued timers
and call their handlers, till we find timers whose expiry is in the
future. I would not be surprised if we overshoot the expiry of the
timers at the end of the list by a microsecond by the time we call their
handlers.

Regards
Preeti U Murthy
> 
> Tested on: Ivybrdge V2, 12 core X86 server, though it occurs on ARM as well
> (but not that frequent).
> 
> Regards,
> Santosh
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Query] Spurious interrupts from clockevent device on X86 Ivybridge

2014-12-10 Thread Preeti U Murthy

On 12/10/2014 06:22 PM, Viresh Kumar wrote:
 On 10 December 2014 at 18:03, Preeti U Murthy pre...@linux.vnet.ibm.com 
 wrote:
 
 Right. We get an interrupt when nobody had asked for it to be delivered
 or had asked for it to be delivered and later canceled the request. It
 is most often in the latter situation, that there can be race
 conditions. If these race conditions are not taken care of, they can
 result in spurious interrupts.
 
 But the delta time will be very small then, right ?

I was talking of the case where we get an interrupt from the clockevent
device but dont find the hrtimer to service and not really of an anomaly
in timekeeping.
   For instance one of the issues that we had seen earlier wherein we
cancel the tick-sched-timer before going tickless, but since we had
programmed the clock event device to fire, we get a spurious interrupt.

 
 Since the difference is 1us and not a noticeably high value, it is most
 probably because during hrtimer handling, we traverse all queued timers
 and call their handlers, till we find timers whose expiry is in the
 future. I would not be surprised if we overshoot the expiry of the
 timers at the end of the list by a microsecond by the time we call their
 handlers.
 
 Looks like you misunderstood what he wrote. He isn't saying that we
 serviced the timers/hrtimers sometime after we should have.
 
 What he is saying is: we got the clockevent device's interrupt at the
 time we requested but hrtimer_update_base() returned a time
 lesser than what it *should* have. And that results in a spurious
 interrupt.. We enqueue again for 1 us and service the timer then.

Oh ok I see. I understand this better now after reading Thomas's
explanation.

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Query] Spurious interrupts from clockevent device on X86 Ivybridge

2014-12-10 Thread Preeti U Murthy

Hi Santhosh,

On 12/10/2014 11:00 AM, Santosh Shukla wrote:
 Hi Guys,
 
 I have been chasing spurious clockevent interrupts (In HIGHRES mode) on my x86
 IvyBridge server and stuck on some blockers (mostly due to lack of knowledge)
 and need some help..
 
 What does spurious mean ?
 I take it as the case when a core is unnecessarily interrupted by the 
 clockevent
 device while we didn't wanted it to.

Right. We get an interrupt when nobody had asked for it to be delivered
or had asked for it to be delivered and later canceled the request. It
is most often in the latter situation, that there can be race
conditions. If these race conditions are not taken care of, they can
result in spurious interrupts.

 
 I tried to catch them with the help of few new trace points in
 hrtimer_interrupt() when it didn't serviced any pending hrtimers..
 
 The causes I could find:
 - Clockevent device's counter overflow: When we need to program a clockevt
   device for times larger than it can handle, we program it for max time and
   then reprogram again on expiration. Because of hardware counter limitations 
 we
   have to do it this way and this isn't really a spurious interrupt as we
   programmed it that way.

Yes this is a possible cause for spurious interrupts. On PowerPC we
check for such a situation specifically before we begin to handle timers.

 
 - Hardware timer firing before its expiration: Yes, its hard to digest but 
 from
   my tests it looks like this does happen. Will circulate another mail once I 
 am
   sure about it.

Yep, will need more information if at all this is happening.

 
 - Difference between time returned by hrtimer_update_base() and clockevt
   device's notion of time: Wanted to discuss this here..
 
 
 An example of what I am witnessing in my tests:
 
 clockevt device programmed to fire after 199 ms and it fires exactly after 
 that
 time (Verified with ftrace time-diff and tsc diff [tsc: timestamp counter
 instructions for x86]).
 
 But when hrtimer_interrupt() tries to find the hrtimer for which the event
 occurred, it finds the value returned by hrtimer_update_base() being lesser 
 than
 the softexpires of the timer for which the event was programmed. And the diff 
 is
 approximately 1 us most of the times I see the spurious interrupt.
 
 idle-0 [010] d...68.961171: clk_evt_program: delta:199933679
 expires=6907100 /* programmed for 199 ms */
 idle-0 [010] d...68.961171: lapic_next_deadline:
 cur_tsc:365799488705427 new_tsc:365799907536051 delta:199443 usec /*
 sets new_tsc */
 idle-0 [010] d.h.69.161183: local_timer_entry: vector=239
 idle-0 [010] d.h.69.161185: local_apic_timer_interrupt:
 cur_tsc:365799907547595 delta:199448 usec /* Interrupt arrived
 afterexactly 199 ms */
 idle-0 [010] d.h.69.161187:
 hrtimer_interrupt:function=tick_sched_timer expires=6907100
 softexpires=6907100
 basenow:69070997234 /* But Softexpires  basenow ?? */
 idle-0 [010] d.h.69.161188: clk_evt_program: delta:1000
 expires=6907100 /* new expiry value (1 us) */
 idle-0 [010] d.h.69.161189: lapic_next_deadline:
 cur_tsc:365799907556179 new_tsc:365799907558259 delta:0 usec /* sets
 new tsc value */
 idle-0 [010] d.h.69.161189: local_timer_exit: vector=239

Since the difference is 1us and not a noticeably high value, it is most
probably because during hrtimer handling, we traverse all queued timers
and call their handlers, till we find timers whose expiry is in the
future. I would not be surprised if we overshoot the expiry of the
timers at the end of the list by a microsecond by the time we call their
handlers.

Regards
Preeti U Murthy
 
 Tested on: Ivybrdge V2, 12 core X86 server, though it occurs on ARM as well
 (but not that frequent).
 
 Regards,
 Santosh
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/2] clockevents: introduce ->set_dev_mode() which can return error

2014-12-09 Thread Preeti U Murthy

On 12/10/2014 03:33 AM, Kevin Hilman wrote:
> From: Viresh Kumar 
> 
> Currently, the ->set_mode() method of a clockevent device is not
> allowed to fail, so it has no return value.  In order to add new
> clockevent modes, and allow the setting of those modes to fail, we
> need the clockevent core to be able to detect when setting a mode
> fails.
> 
> To allow detection of mode setting failures, introduce a new method
> ->set_dev_mode() which can return an error if the given mode is not
> supported or fails.
> 
> Since all current modes are still not allowed to fail, the core code
> simply WARNs if any modes fail.  Future patches that add new mode
> support will add proper error recovery based on failure conditions.
> 
> Signed-off-by: Viresh Kumar 
> Reviewed-by: Kevin Hilman 
> [khilman: rework changelogs, minor formatting/checkpatch cleanups]
> Signed-off-by: Kevin Hilman 
> ---
>  include/linux/clockchips.h |  5 -
>  kernel/time/clockevents.c  | 21 ++---
>  kernel/time/timer_list.c   |  5 -
>  3 files changed, 26 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/clockchips.h b/include/linux/clockchips.h
> index 2e4cb67f6e56..d969659cf688 100644
> --- a/include/linux/clockchips.h
> +++ b/include/linux/clockchips.h
> @@ -81,7 +81,8 @@ enum clock_event_mode {
>   * @mode:operating mode assigned by the management code
>   * @features:features
>   * @retries: number of forced programming retries
> - * @set_mode:set mode function
> + * @set_dev_mode:set dev mode function
> + * @set_mode:set mode function (deprecated, use set_dev_mode 
> instead)
>   * @broadcast:   function to broadcast events
>   * @min_delta_ticks: minimum delta value in ticks stored for reconfiguration
>   * @max_delta_ticks: maximum delta value in ticks stored for reconfiguration
> @@ -109,6 +110,8 @@ struct clock_event_device {
>   unsigned long   retries;
> 
>   void(*broadcast)(const struct cpumask *mask);
> + int (*set_dev_mode)(enum clock_event_mode mode,
> + struct clock_event_device *);
>   void(*set_mode)(enum clock_event_mode mode,
>   struct clock_event_device *);
>   void(*suspend)(struct clock_event_device *);
> diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
> index 55449909f114..f7614041240e 100644
> --- a/kernel/time/clockevents.c
> +++ b/kernel/time/clockevents.c
> @@ -105,7 +105,16 @@ void clockevents_set_mode(struct clock_event_device *dev,
>enum clock_event_mode mode)
>  {
>   if (dev->mode != mode) {
> - dev->set_mode(mode, dev);
> + if (dev->set_dev_mode) {
> + int ret = dev->set_dev_mode(mode, dev);
> +
> + /* Currently available modes shouldn't fail */
> + WARN_ONCE(ret, "Requested mode: %d, error: %d\n",
> +   mode, ret);
> + } else {
> + dev->set_mode(mode, dev);
> + }
> +
>   dev->mode = mode;
> 
>   /*
> @@ -448,8 +457,14 @@ int __clockevents_update_freq(struct clock_event_device 
> *dev, u32 freq)
>   if (dev->mode == CLOCK_EVT_MODE_ONESHOT)
>   return clockevents_program_event(dev, dev->next_event, false);
> 
> - if (dev->mode == CLOCK_EVT_MODE_PERIODIC)
> - dev->set_mode(CLOCK_EVT_MODE_PERIODIC, dev);
> + if (dev->mode == CLOCK_EVT_MODE_PERIODIC) {
> + /* Shouldn't fail while switching to PERIODIC MODE */
> + if (dev->set_dev_mode)
> + WARN_ON_ONCE(dev->set_dev_mode(CLOCK_EVT_MODE_PERIODIC,
> +dev));
> + else
> + dev->set_mode(CLOCK_EVT_MODE_PERIODIC, dev);
> + }
> 
>   return 0;
>  }
> diff --git a/kernel/time/timer_list.c b/kernel/time/timer_list.c
> index 61ed862cdd37..957a04951ef0 100644
> --- a/kernel/time/timer_list.c
> +++ b/kernel/time/timer_list.c
> @@ -229,7 +229,10 @@ print_tickdevice(struct seq_file *m, struct tick_device 
> *td, int cpu)
>   SEQ_printf(m, "\n");
> 
>   SEQ_printf(m, " set_mode:   ");
> - print_name_offset(m, dev->set_mode);
> + if (dev->set_dev_mode)
> + print_name_offset(m, dev->set_dev_mode);
> + else
> + print_name_offset(m, dev->set_mode);
>   SEQ_printf(m, "\n");
> 
>   SEQ_printf(m, " event_handler:  ");
> 
Reviewed-by: Preeti U Murthy 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/2] clockevents: introduce -set_dev_mode() which can return error

2014-12-09 Thread Preeti U Murthy

On 12/10/2014 03:33 AM, Kevin Hilman wrote:
 From: Viresh Kumar viresh.ku...@linaro.org
 
 Currently, the -set_mode() method of a clockevent device is not
 allowed to fail, so it has no return value.  In order to add new
 clockevent modes, and allow the setting of those modes to fail, we
 need the clockevent core to be able to detect when setting a mode
 fails.
 
 To allow detection of mode setting failures, introduce a new method
 -set_dev_mode() which can return an error if the given mode is not
 supported or fails.
 
 Since all current modes are still not allowed to fail, the core code
 simply WARNs if any modes fail.  Future patches that add new mode
 support will add proper error recovery based on failure conditions.
 
 Signed-off-by: Viresh Kumar viresh.ku...@linaro.org
 Reviewed-by: Kevin Hilman khil...@linaro.org
 [khilman: rework changelogs, minor formatting/checkpatch cleanups]
 Signed-off-by: Kevin Hilman khil...@linaro.org
 ---
  include/linux/clockchips.h |  5 -
  kernel/time/clockevents.c  | 21 ++---
  kernel/time/timer_list.c   |  5 -
  3 files changed, 26 insertions(+), 5 deletions(-)
 
 diff --git a/include/linux/clockchips.h b/include/linux/clockchips.h
 index 2e4cb67f6e56..d969659cf688 100644
 --- a/include/linux/clockchips.h
 +++ b/include/linux/clockchips.h
 @@ -81,7 +81,8 @@ enum clock_event_mode {
   * @mode:operating mode assigned by the management code
   * @features:features
   * @retries: number of forced programming retries
 - * @set_mode:set mode function
 + * @set_dev_mode:set dev mode function
 + * @set_mode:set mode function (deprecated, use set_dev_mode 
 instead)
   * @broadcast:   function to broadcast events
   * @min_delta_ticks: minimum delta value in ticks stored for reconfiguration
   * @max_delta_ticks: maximum delta value in ticks stored for reconfiguration
 @@ -109,6 +110,8 @@ struct clock_event_device {
   unsigned long   retries;
 
   void(*broadcast)(const struct cpumask *mask);
 + int (*set_dev_mode)(enum clock_event_mode mode,
 + struct clock_event_device *);
   void(*set_mode)(enum clock_event_mode mode,
   struct clock_event_device *);
   void(*suspend)(struct clock_event_device *);
 diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
 index 55449909f114..f7614041240e 100644
 --- a/kernel/time/clockevents.c
 +++ b/kernel/time/clockevents.c
 @@ -105,7 +105,16 @@ void clockevents_set_mode(struct clock_event_device *dev,
enum clock_event_mode mode)
  {
   if (dev-mode != mode) {
 - dev-set_mode(mode, dev);
 + if (dev-set_dev_mode) {
 + int ret = dev-set_dev_mode(mode, dev);
 +
 + /* Currently available modes shouldn't fail */
 + WARN_ONCE(ret, Requested mode: %d, error: %d\n,
 +   mode, ret);
 + } else {
 + dev-set_mode(mode, dev);
 + }
 +
   dev-mode = mode;
 
   /*
 @@ -448,8 +457,14 @@ int __clockevents_update_freq(struct clock_event_device 
 *dev, u32 freq)
   if (dev-mode == CLOCK_EVT_MODE_ONESHOT)
   return clockevents_program_event(dev, dev-next_event, false);
 
 - if (dev-mode == CLOCK_EVT_MODE_PERIODIC)
 - dev-set_mode(CLOCK_EVT_MODE_PERIODIC, dev);
 + if (dev-mode == CLOCK_EVT_MODE_PERIODIC) {
 + /* Shouldn't fail while switching to PERIODIC MODE */
 + if (dev-set_dev_mode)
 + WARN_ON_ONCE(dev-set_dev_mode(CLOCK_EVT_MODE_PERIODIC,
 +dev));
 + else
 + dev-set_mode(CLOCK_EVT_MODE_PERIODIC, dev);
 + }
 
   return 0;
  }
 diff --git a/kernel/time/timer_list.c b/kernel/time/timer_list.c
 index 61ed862cdd37..957a04951ef0 100644
 --- a/kernel/time/timer_list.c
 +++ b/kernel/time/timer_list.c
 @@ -229,7 +229,10 @@ print_tickdevice(struct seq_file *m, struct tick_device 
 *td, int cpu)
   SEQ_printf(m, \n);
 
   SEQ_printf(m,  set_mode:   );
 - print_name_offset(m, dev-set_mode);
 + if (dev-set_dev_mode)
 + print_name_offset(m, dev-set_dev_mode);
 + else
 + print_name_offset(m, dev-set_mode);
   SEQ_printf(m, \n);
 
   SEQ_printf(m,  event_handler:  );
 
Reviewed-by: Preeti U Murthy pre...@linux.vnet.ibm.com

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH V3] tick-broadcast: Register for hrtimer based broadcast as the fallback broadcast mode

2014-12-08 Thread Preeti U Murthy

Commit 5d1638acb9f6 ('tick: Introduce hrtimer based broadcast') added a
hrtimer based broadcast mode for those platforms in which local timers stop
when CPUs enter deep idle states. The commit expected the platforms to
register for this mode explicitly when they lacked a better external device
to wake up CPUs in deep idle. Given that more platforms are beginning to use
this mode, we can avoid the call to set it up on every platform that requires
it, by registering for the hrtimer based broadcast mode in the core code if
no better broadcast device is available.

This commit also helps detect cases where the platform fails to register for
a broadcast device but invokes the help of one when entering deep idle states.
Currently we do not handle this situation at all and call the broadcast clock
device without checking for its existence. This patch will handle such buggy
cases properly. While at it, give a name to this mode of broadcast which was
missing all along.

Signed-off-by: Preeti U Murthy 
Tested-by: Mark Rutland 
---
Changes from V1: https://lkml.org/lkml/2014/12/5/261
1.Moved registering the hrtimer based broadcast from timekeeping code
to an early_initcall.

Changes from V2: https://lkml.org/lkml/2014/12/8/57
1.Added the 'name' param to hrtimer broadcast mode and removed the prototype
to setup this mode since there are no external callers of it.

 arch/arm64/kernel/time.c |2 --
 arch/powerpc/kernel/time.c   |1 -
 include/linux/clockchips.h   |3 ---
 kernel/time/tick-broadcast-hrtimer.c |5 -
 4 files changed, 4 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/kernel/time.c b/arch/arm64/kernel/time.c
index 1a7125c..47baaa8 100644
--- a/arch/arm64/kernel/time.c
+++ b/arch/arm64/kernel/time.c
@@ -70,8 +70,6 @@ void __init time_init(void)
of_clk_init(NULL);
clocksource_of_init();
 
-   tick_setup_hrtimer_broadcast();
-
arch_timer_rate = arch_timer_get_rate();
if (!arch_timer_rate)
panic("Unable to initialise architected timer.\n");
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 7505599..51433a8 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -942,7 +942,6 @@ void __init time_init(void)
clocksource_init();
 
init_decrementer_clockevent();
-   tick_setup_hrtimer_broadcast();
 }
 
 
diff --git a/include/linux/clockchips.h b/include/linux/clockchips.h
index 2e4cb67..c362143 100644
--- a/include/linux/clockchips.h
+++ b/include/linux/clockchips.h
@@ -187,11 +187,9 @@ extern int tick_receive_broadcast(void);
 #endif
 
 #if defined(CONFIG_GENERIC_CLOCKEVENTS_BROADCAST) && 
defined(CONFIG_TICK_ONESHOT)
-extern void tick_setup_hrtimer_broadcast(void);
 extern int tick_check_broadcast_expired(void);
 #else
 static inline int tick_check_broadcast_expired(void) { return 0; }
-static inline void tick_setup_hrtimer_broadcast(void) {};
 #endif
 
 #ifdef CONFIG_GENERIC_CLOCKEVENTS
@@ -207,7 +205,6 @@ static inline void clockevents_resume(void) {}
 
 static inline int clockevents_notify(unsigned long reason, void *arg) { return 
0; }
 static inline int tick_check_broadcast_expired(void) { return 0; }
-static inline void tick_setup_hrtimer_broadcast(void) {};
 
 #endif
 
diff --git a/kernel/time/tick-broadcast-hrtimer.c 
b/kernel/time/tick-broadcast-hrtimer.c
index eb682d5..1f6bc6a 100644
--- a/kernel/time/tick-broadcast-hrtimer.c
+++ b/kernel/time/tick-broadcast-hrtimer.c
@@ -72,6 +72,7 @@ static int bc_set_next(ktime_t expires, struct 
clock_event_device *bc)
 }
 
 static struct clock_event_device ce_broadcast_hrtimer = {
+   .name   = "broadcast_hrtimer",
.set_mode   = bc_set_mode,
.set_next_ktime = bc_set_next,
.features   = CLOCK_EVT_FEAT_ONESHOT |
@@ -98,9 +99,11 @@ static enum hrtimer_restart bc_handler(struct hrtimer *t)
return HRTIMER_RESTART;
 }
 
-void tick_setup_hrtimer_broadcast(void)
+static int __init tick_setup_hrtimer_broadcast(void)
 {
hrtimer_init(, CLOCK_MONOTONIC, HRTIMER_MODE_ABS);
bctimer.function = bc_handler;
clockevents_register_device(_broadcast_hrtimer);
+   return 0;
 }
+early_initcall(tick_setup_hrtimer_broadcast);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH V2] tick-broadcast: Register for hrtimer based broadcast as the fallback broadcast mode

2014-12-08 Thread Preeti U Murthy

On 12/08/2014 04:18 PM, Mark Rutland wrote:
> Hi Preeti,
> 
> On Mon, Dec 08, 2014 at 06:55:43AM +0000, Preeti U Murthy wrote:
>> Commit 5d1638acb9f6 ('tick: Introduce hrtimer based broadcast') added a
>> hrtimer based broadcast mode for those platforms in which local timers stop
>> when CPUs enter deep idle states. The commit expected the platforms to
>> register for this mode explicitly when they lacked a better external device
>> to wake up CPUs in deep idle. Given that more platforms are beginning to use
>> this mode, we can avoid the call to set it up on every platform that requires
>> it, by registering for the hrtimer based broadcast mode in the core code if
>> no better broadcast device is available.
>>
>> This commit also helps detect cases where the platform fails to register for
>> a broadcast device but invokes the help of one when entering deep idle 
>> states.
>> Currently we do not handle this situation at all and call the broadcast clock
>> device without checking for its existence. This patch will handle such buggy
>> cases properly.
>>
>> Signed-off-by: Preeti U Murthy 
> 
> I've just given this a go on an arm64 platform (Juno) without any
> system-wide clock_event_devices registered, and everything works well
> with CPUs entering and exiting idle states where the cpu-local timers
> lose state. So:
> 
> Tested-by: Mark Rutland 

Thanks!

> 
> One minor thing I noticed when testing was that
> /sys/devices/system/clockevents/broadcast/name contained "(null)",
> because we never set the name field on the clock_event_device. It's
> always been that way, but now might be a good time to change that to
> something like "broadcast_hrtimer".

You mean /sys/devices/system/clockevents/broadcast/current_device right?

> 
> [...]
> 
>> diff --git a/include/linux/clockchips.h b/include/linux/clockchips.h
>> index 2e4cb67..91754b0 100644
>> --- a/include/linux/clockchips.h
>> +++ b/include/linux/clockchips.h
>> @@ -187,11 +187,11 @@ extern int tick_receive_broadcast(void);
>>  #endif
>>  
>>  #if defined(CONFIG_GENERIC_CLOCKEVENTS_BROADCAST) && 
>> defined(CONFIG_TICK_ONESHOT)
>> -extern void tick_setup_hrtimer_broadcast(void);
>> +extern int __init tick_setup_hrtimer_broadcast(void);
>>  extern int tick_check_broadcast_expired(void);
>>  #else
>>  static inline int tick_check_broadcast_expired(void) { return 0; }
>> -static inline void tick_setup_hrtimer_broadcast(void) {};
>> +static inline int __init tick_setup_hrtimer_broadcast(void) { return 0; }
>>  #endif
>>  
>>  #ifdef CONFIG_GENERIC_CLOCKEVENTS
>> @@ -207,7 +207,7 @@ static inline void clockevents_resume(void) {}
>>  
>>  static inline int clockevents_notify(unsigned long reason, void *arg) { 
>> return 0; }
>>  static inline int tick_check_broadcast_expired(void) { return 0; }
>> -static inline void tick_setup_hrtimer_broadcast(void) {};
>> +static inline int __init tick_setup_hrtimer_broadcast(void) { return 0; }
> 
> With the initcall moved to the driver we have no external users of
> tick_setup_hrtimer_broadcast, so I think we can remove the prototype
> entirely from clockchips.h...
> 
>>  #endif
>>  
>> diff --git a/kernel/time/tick-broadcast-hrtimer.c 
>> b/kernel/time/tick-broadcast-hrtimer.c
>> index eb682d5..5c35995 100644
>> --- a/kernel/time/tick-broadcast-hrtimer.c
>> +++ b/kernel/time/tick-broadcast-hrtimer.c
>> @@ -98,9 +98,11 @@ static enum hrtimer_restart bc_handler(struct hrtimer *t)
>>  return HRTIMER_RESTART;
>>  }
>>  
>> -void tick_setup_hrtimer_broadcast(void)
>> +int __init tick_setup_hrtimer_broadcast(void)
> 
> ...and make it static here.

Yep will do. Sorry I overlooked this.

> 
>>  {
>>  hrtimer_init(, CLOCK_MONOTONIC, HRTIMER_MODE_ABS);
>>  bctimer.function = bc_handler;
>>  clockevents_register_device(_broadcast_hrtimer);
>> +return 0;
>>  }
>> +early_initcall(tick_setup_hrtimer_broadcast);
> 
> Otherwise this looks good to me, thanks for putting this together!

Thanks a lot for the review! Will send out the patch with the above
corrections.

Regards
Preeti U Murthy
> 
> Mark.
> ___
> Linuxppc-dev mailing list
> linuxppc-...@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH V2] tick-broadcast: Register for hrtimer based broadcast as the fallback broadcast mode

2014-12-08 Thread Preeti U Murthy

On 12/08/2014 04:18 PM, Mark Rutland wrote:
 Hi Preeti,
 
 On Mon, Dec 08, 2014 at 06:55:43AM +, Preeti U Murthy wrote:
 Commit 5d1638acb9f6 ('tick: Introduce hrtimer based broadcast') added a
 hrtimer based broadcast mode for those platforms in which local timers stop
 when CPUs enter deep idle states. The commit expected the platforms to
 register for this mode explicitly when they lacked a better external device
 to wake up CPUs in deep idle. Given that more platforms are beginning to use
 this mode, we can avoid the call to set it up on every platform that requires
 it, by registering for the hrtimer based broadcast mode in the core code if
 no better broadcast device is available.

 This commit also helps detect cases where the platform fails to register for
 a broadcast device but invokes the help of one when entering deep idle 
 states.
 Currently we do not handle this situation at all and call the broadcast clock
 device without checking for its existence. This patch will handle such buggy
 cases properly.

 Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com
 
 I've just given this a go on an arm64 platform (Juno) without any
 system-wide clock_event_devices registered, and everything works well
 with CPUs entering and exiting idle states where the cpu-local timers
 lose state. So:
 
 Tested-by: Mark Rutland mark.rutl...@arm.com

Thanks!

 
 One minor thing I noticed when testing was that
 /sys/devices/system/clockevents/broadcast/name contained (null),
 because we never set the name field on the clock_event_device. It's
 always been that way, but now might be a good time to change that to
 something like broadcast_hrtimer.

You mean /sys/devices/system/clockevents/broadcast/current_device right?

 
 [...]
 
 diff --git a/include/linux/clockchips.h b/include/linux/clockchips.h
 index 2e4cb67..91754b0 100644
 --- a/include/linux/clockchips.h
 +++ b/include/linux/clockchips.h
 @@ -187,11 +187,11 @@ extern int tick_receive_broadcast(void);
  #endif
  
  #if defined(CONFIG_GENERIC_CLOCKEVENTS_BROADCAST)  
 defined(CONFIG_TICK_ONESHOT)
 -extern void tick_setup_hrtimer_broadcast(void);
 +extern int __init tick_setup_hrtimer_broadcast(void);
  extern int tick_check_broadcast_expired(void);
  #else
  static inline int tick_check_broadcast_expired(void) { return 0; }
 -static inline void tick_setup_hrtimer_broadcast(void) {};
 +static inline int __init tick_setup_hrtimer_broadcast(void) { return 0; }
  #endif
  
  #ifdef CONFIG_GENERIC_CLOCKEVENTS
 @@ -207,7 +207,7 @@ static inline void clockevents_resume(void) {}
  
  static inline int clockevents_notify(unsigned long reason, void *arg) { 
 return 0; }
  static inline int tick_check_broadcast_expired(void) { return 0; }
 -static inline void tick_setup_hrtimer_broadcast(void) {};
 +static inline int __init tick_setup_hrtimer_broadcast(void) { return 0; }
 
 With the initcall moved to the driver we have no external users of
 tick_setup_hrtimer_broadcast, so I think we can remove the prototype
 entirely from clockchips.h...
 
  #endif
  
 diff --git a/kernel/time/tick-broadcast-hrtimer.c 
 b/kernel/time/tick-broadcast-hrtimer.c
 index eb682d5..5c35995 100644
 --- a/kernel/time/tick-broadcast-hrtimer.c
 +++ b/kernel/time/tick-broadcast-hrtimer.c
 @@ -98,9 +98,11 @@ static enum hrtimer_restart bc_handler(struct hrtimer *t)
  return HRTIMER_RESTART;
  }
  
 -void tick_setup_hrtimer_broadcast(void)
 +int __init tick_setup_hrtimer_broadcast(void)
 
 ...and make it static here.

Yep will do. Sorry I overlooked this.

 
  {
  hrtimer_init(bctimer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS);
  bctimer.function = bc_handler;
  clockevents_register_device(ce_broadcast_hrtimer);
 +return 0;
  }
 +early_initcall(tick_setup_hrtimer_broadcast);
 
 Otherwise this looks good to me, thanks for putting this together!

Thanks a lot for the review! Will send out the patch with the above
corrections.

Regards
Preeti U Murthy
 
 Mark.
 ___
 Linuxppc-dev mailing list
 linuxppc-...@lists.ozlabs.org
 https://lists.ozlabs.org/listinfo/linuxppc-dev
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH V3] tick-broadcast: Register for hrtimer based broadcast as the fallback broadcast mode

2014-12-08 Thread Preeti U Murthy

Commit 5d1638acb9f6 ('tick: Introduce hrtimer based broadcast') added a
hrtimer based broadcast mode for those platforms in which local timers stop
when CPUs enter deep idle states. The commit expected the platforms to
register for this mode explicitly when they lacked a better external device
to wake up CPUs in deep idle. Given that more platforms are beginning to use
this mode, we can avoid the call to set it up on every platform that requires
it, by registering for the hrtimer based broadcast mode in the core code if
no better broadcast device is available.

This commit also helps detect cases where the platform fails to register for
a broadcast device but invokes the help of one when entering deep idle states.
Currently we do not handle this situation at all and call the broadcast clock
device without checking for its existence. This patch will handle such buggy
cases properly. While at it, give a name to this mode of broadcast which was
missing all along.

Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com
Tested-by: Mark Rutland mark.rutl...@arm.com
---
Changes from V1: https://lkml.org/lkml/2014/12/5/261
1.Moved registering the hrtimer based broadcast from timekeeping code
to an early_initcall.

Changes from V2: https://lkml.org/lkml/2014/12/8/57
1.Added the 'name' param to hrtimer broadcast mode and removed the prototype
to setup this mode since there are no external callers of it.

 arch/arm64/kernel/time.c |2 --
 arch/powerpc/kernel/time.c   |1 -
 include/linux/clockchips.h   |3 ---
 kernel/time/tick-broadcast-hrtimer.c |5 -
 4 files changed, 4 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/kernel/time.c b/arch/arm64/kernel/time.c
index 1a7125c..47baaa8 100644
--- a/arch/arm64/kernel/time.c
+++ b/arch/arm64/kernel/time.c
@@ -70,8 +70,6 @@ void __init time_init(void)
of_clk_init(NULL);
clocksource_of_init();
 
-   tick_setup_hrtimer_broadcast();
-
arch_timer_rate = arch_timer_get_rate();
if (!arch_timer_rate)
panic(Unable to initialise architected timer.\n);
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 7505599..51433a8 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -942,7 +942,6 @@ void __init time_init(void)
clocksource_init();
 
init_decrementer_clockevent();
-   tick_setup_hrtimer_broadcast();
 }
 
 
diff --git a/include/linux/clockchips.h b/include/linux/clockchips.h
index 2e4cb67..c362143 100644
--- a/include/linux/clockchips.h
+++ b/include/linux/clockchips.h
@@ -187,11 +187,9 @@ extern int tick_receive_broadcast(void);
 #endif
 
 #if defined(CONFIG_GENERIC_CLOCKEVENTS_BROADCAST)  
defined(CONFIG_TICK_ONESHOT)
-extern void tick_setup_hrtimer_broadcast(void);
 extern int tick_check_broadcast_expired(void);
 #else
 static inline int tick_check_broadcast_expired(void) { return 0; }
-static inline void tick_setup_hrtimer_broadcast(void) {};
 #endif
 
 #ifdef CONFIG_GENERIC_CLOCKEVENTS
@@ -207,7 +205,6 @@ static inline void clockevents_resume(void) {}
 
 static inline int clockevents_notify(unsigned long reason, void *arg) { return 
0; }
 static inline int tick_check_broadcast_expired(void) { return 0; }
-static inline void tick_setup_hrtimer_broadcast(void) {};
 
 #endif
 
diff --git a/kernel/time/tick-broadcast-hrtimer.c 
b/kernel/time/tick-broadcast-hrtimer.c
index eb682d5..1f6bc6a 100644
--- a/kernel/time/tick-broadcast-hrtimer.c
+++ b/kernel/time/tick-broadcast-hrtimer.c
@@ -72,6 +72,7 @@ static int bc_set_next(ktime_t expires, struct 
clock_event_device *bc)
 }
 
 static struct clock_event_device ce_broadcast_hrtimer = {
+   .name   = broadcast_hrtimer,
.set_mode   = bc_set_mode,
.set_next_ktime = bc_set_next,
.features   = CLOCK_EVT_FEAT_ONESHOT |
@@ -98,9 +99,11 @@ static enum hrtimer_restart bc_handler(struct hrtimer *t)
return HRTIMER_RESTART;
 }
 
-void tick_setup_hrtimer_broadcast(void)
+static int __init tick_setup_hrtimer_broadcast(void)
 {
hrtimer_init(bctimer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS);
bctimer.function = bc_handler;
clockevents_register_device(ce_broadcast_hrtimer);
+   return 0;
 }
+early_initcall(tick_setup_hrtimer_broadcast);

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH V2] tick-broadcast: Register for hrtimer based broadcast as the fallback broadcast mode

2014-12-07 Thread Preeti U Murthy

Commit 5d1638acb9f6 ('tick: Introduce hrtimer based broadcast') added a
hrtimer based broadcast mode for those platforms in which local timers stop
when CPUs enter deep idle states. The commit expected the platforms to
register for this mode explicitly when they lacked a better external device
to wake up CPUs in deep idle. Given that more platforms are beginning to use
this mode, we can avoid the call to set it up on every platform that requires
it, by registering for the hrtimer based broadcast mode in the core code if
no better broadcast device is available.

This commit also helps detect cases where the platform fails to register for
a broadcast device but invokes the help of one when entering deep idle states.
Currently we do not handle this situation at all and call the broadcast clock
device without checking for its existence. This patch will handle such buggy
cases properly.

Signed-off-by: Preeti U Murthy 
---
Changes from V1: https://lkml.org/lkml/2014/12/5/261
1.Moved registering the hrtimer based broadcast from timekeeping code
to an early_initcall.

 arch/arm64/kernel/time.c |2 --
 arch/powerpc/kernel/time.c   |1 -
 include/linux/clockchips.h   |6 +++---
 kernel/time/tick-broadcast-hrtimer.c |4 +++-
 4 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/kernel/time.c b/arch/arm64/kernel/time.c
index 1a7125c..47baaa8 100644
--- a/arch/arm64/kernel/time.c
+++ b/arch/arm64/kernel/time.c
@@ -70,8 +70,6 @@ void __init time_init(void)
of_clk_init(NULL);
clocksource_of_init();
 
-   tick_setup_hrtimer_broadcast();
-
arch_timer_rate = arch_timer_get_rate();
if (!arch_timer_rate)
panic("Unable to initialise architected timer.\n");
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 7505599..51433a8 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -942,7 +942,6 @@ void __init time_init(void)
clocksource_init();
 
init_decrementer_clockevent();
-   tick_setup_hrtimer_broadcast();
 }
 
 
diff --git a/include/linux/clockchips.h b/include/linux/clockchips.h
index 2e4cb67..91754b0 100644
--- a/include/linux/clockchips.h
+++ b/include/linux/clockchips.h
@@ -187,11 +187,11 @@ extern int tick_receive_broadcast(void);
 #endif
 
 #if defined(CONFIG_GENERIC_CLOCKEVENTS_BROADCAST) && 
defined(CONFIG_TICK_ONESHOT)
-extern void tick_setup_hrtimer_broadcast(void);
+extern int __init tick_setup_hrtimer_broadcast(void);
 extern int tick_check_broadcast_expired(void);
 #else
 static inline int tick_check_broadcast_expired(void) { return 0; }
-static inline void tick_setup_hrtimer_broadcast(void) {};
+static inline int __init tick_setup_hrtimer_broadcast(void) { return 0; }
 #endif
 
 #ifdef CONFIG_GENERIC_CLOCKEVENTS
@@ -207,7 +207,7 @@ static inline void clockevents_resume(void) {}
 
 static inline int clockevents_notify(unsigned long reason, void *arg) { return 
0; }
 static inline int tick_check_broadcast_expired(void) { return 0; }
-static inline void tick_setup_hrtimer_broadcast(void) {};
+static inline int __init tick_setup_hrtimer_broadcast(void) { return 0; }
 
 #endif
 
diff --git a/kernel/time/tick-broadcast-hrtimer.c 
b/kernel/time/tick-broadcast-hrtimer.c
index eb682d5..5c35995 100644
--- a/kernel/time/tick-broadcast-hrtimer.c
+++ b/kernel/time/tick-broadcast-hrtimer.c
@@ -98,9 +98,11 @@ static enum hrtimer_restart bc_handler(struct hrtimer *t)
return HRTIMER_RESTART;
 }
 
-void tick_setup_hrtimer_broadcast(void)
+int __init tick_setup_hrtimer_broadcast(void)
 {
hrtimer_init(, CLOCK_MONOTONIC, HRTIMER_MODE_ABS);
bctimer.function = bc_handler;
clockevents_register_device(_broadcast_hrtimer);
+   return 0;
 }
+early_initcall(tick_setup_hrtimer_broadcast);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] tick-broadcast: Register for hrtimer based broadcast as the default broadcast mode

2014-12-07 Thread Preeti U Murthy

On 12/05/2014 07:09 PM, Mark Rutland wrote:
> Hi Preeti,
> 
> Moving this out of the architecture code looks good to me!
> 
> I have a couple of minor comments below.

>>  
>> diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
>> index ec1791f..6044a51 100644
>> --- a/kernel/time/timekeeping.c
>> +++ b/kernel/time/timekeeping.c
>> @@ -1016,6 +1016,10 @@ void __init timekeeping_init(void)
>>  boot.tv_sec = 0;
>>  boot.tv_nsec = 0;
>>  }
>> +/* Register for hrtimer based broadcast as the default timekeeping
>> + * mode in deep idle states.
>> + */
> 
> Nit: for code style this should have a newline after the '/*' (and we
> should probably have a newline before that anyway.
> 
>> +tick_setup_hrtimer_broadcast();
> 
> We register the generic dummy timer via an early_initcall, which keeps
> all the logic in the dummy timer driver. Are we able to do the same of
> the broadcast hrtimer? Or is there some ordering constraint we need to
> meet?

Yes this is doable. There are no ordering constraints. Placing it in an
early_initcall() will enable it to run before initializing SMP and
cpuidle, which is perfect (we do not have any per-cpu initializations
and we do not want cpus to begin entering idle states before this
hrtimer is initialized).

It also runs after the hrtimer/clockevents infrastructure has been
initialized, which is good since we need them. The broadcast hrtimer
will thus only get registered as a broadcast device if no other clock
device has managed to register itself as one.

Let me send out a V2.

Thanks

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] tick-broadcast: Register for hrtimer based broadcast as the default broadcast mode

2014-12-07 Thread Preeti U Murthy

On 12/05/2014 07:09 PM, Mark Rutland wrote:
 Hi Preeti,
 
 Moving this out of the architecture code looks good to me!
 
 I have a couple of minor comments below.
snip

  
 diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
 index ec1791f..6044a51 100644
 --- a/kernel/time/timekeeping.c
 +++ b/kernel/time/timekeeping.c
 @@ -1016,6 +1016,10 @@ void __init timekeeping_init(void)
  boot.tv_sec = 0;
  boot.tv_nsec = 0;
  }
 +/* Register for hrtimer based broadcast as the default timekeeping
 + * mode in deep idle states.
 + */
 
 Nit: for code style this should have a newline after the '/*' (and we
 should probably have a newline before that anyway.
 
 +tick_setup_hrtimer_broadcast();
 
 We register the generic dummy timer via an early_initcall, which keeps
 all the logic in the dummy timer driver. Are we able to do the same of
 the broadcast hrtimer? Or is there some ordering constraint we need to
 meet?

Yes this is doable. There are no ordering constraints. Placing it in an
early_initcall() will enable it to run before initializing SMP and
cpuidle, which is perfect (we do not have any per-cpu initializations
and we do not want cpus to begin entering idle states before this
hrtimer is initialized).

It also runs after the hrtimer/clockevents infrastructure has been
initialized, which is good since we need them. The broadcast hrtimer
will thus only get registered as a broadcast device if no other clock
device has managed to register itself as one.

Let me send out a V2.

Thanks

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH V2] tick-broadcast: Register for hrtimer based broadcast as the fallback broadcast mode

2014-12-07 Thread Preeti U Murthy

Commit 5d1638acb9f6 ('tick: Introduce hrtimer based broadcast') added a
hrtimer based broadcast mode for those platforms in which local timers stop
when CPUs enter deep idle states. The commit expected the platforms to
register for this mode explicitly when they lacked a better external device
to wake up CPUs in deep idle. Given that more platforms are beginning to use
this mode, we can avoid the call to set it up on every platform that requires
it, by registering for the hrtimer based broadcast mode in the core code if
no better broadcast device is available.

This commit also helps detect cases where the platform fails to register for
a broadcast device but invokes the help of one when entering deep idle states.
Currently we do not handle this situation at all and call the broadcast clock
device without checking for its existence. This patch will handle such buggy
cases properly.

Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com
---
Changes from V1: https://lkml.org/lkml/2014/12/5/261
1.Moved registering the hrtimer based broadcast from timekeeping code
to an early_initcall.

 arch/arm64/kernel/time.c |2 --
 arch/powerpc/kernel/time.c   |1 -
 include/linux/clockchips.h   |6 +++---
 kernel/time/tick-broadcast-hrtimer.c |4 +++-
 4 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/kernel/time.c b/arch/arm64/kernel/time.c
index 1a7125c..47baaa8 100644
--- a/arch/arm64/kernel/time.c
+++ b/arch/arm64/kernel/time.c
@@ -70,8 +70,6 @@ void __init time_init(void)
of_clk_init(NULL);
clocksource_of_init();
 
-   tick_setup_hrtimer_broadcast();
-
arch_timer_rate = arch_timer_get_rate();
if (!arch_timer_rate)
panic(Unable to initialise architected timer.\n);
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 7505599..51433a8 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -942,7 +942,6 @@ void __init time_init(void)
clocksource_init();
 
init_decrementer_clockevent();
-   tick_setup_hrtimer_broadcast();
 }
 
 
diff --git a/include/linux/clockchips.h b/include/linux/clockchips.h
index 2e4cb67..91754b0 100644
--- a/include/linux/clockchips.h
+++ b/include/linux/clockchips.h
@@ -187,11 +187,11 @@ extern int tick_receive_broadcast(void);
 #endif
 
 #if defined(CONFIG_GENERIC_CLOCKEVENTS_BROADCAST)  
defined(CONFIG_TICK_ONESHOT)
-extern void tick_setup_hrtimer_broadcast(void);
+extern int __init tick_setup_hrtimer_broadcast(void);
 extern int tick_check_broadcast_expired(void);
 #else
 static inline int tick_check_broadcast_expired(void) { return 0; }
-static inline void tick_setup_hrtimer_broadcast(void) {};
+static inline int __init tick_setup_hrtimer_broadcast(void) { return 0; }
 #endif
 
 #ifdef CONFIG_GENERIC_CLOCKEVENTS
@@ -207,7 +207,7 @@ static inline void clockevents_resume(void) {}
 
 static inline int clockevents_notify(unsigned long reason, void *arg) { return 
0; }
 static inline int tick_check_broadcast_expired(void) { return 0; }
-static inline void tick_setup_hrtimer_broadcast(void) {};
+static inline int __init tick_setup_hrtimer_broadcast(void) { return 0; }
 
 #endif
 
diff --git a/kernel/time/tick-broadcast-hrtimer.c 
b/kernel/time/tick-broadcast-hrtimer.c
index eb682d5..5c35995 100644
--- a/kernel/time/tick-broadcast-hrtimer.c
+++ b/kernel/time/tick-broadcast-hrtimer.c
@@ -98,9 +98,11 @@ static enum hrtimer_restart bc_handler(struct hrtimer *t)
return HRTIMER_RESTART;
 }
 
-void tick_setup_hrtimer_broadcast(void)
+int __init tick_setup_hrtimer_broadcast(void)
 {
hrtimer_init(bctimer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS);
bctimer.function = bc_handler;
clockevents_register_device(ce_broadcast_hrtimer);
+   return 0;
 }
+early_initcall(tick_setup_hrtimer_broadcast);

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] tick-broadcast: Register for hrtimer based broadcast as the default broadcast mode

2014-12-05 Thread Preeti U Murthy

Commit 5d1638acb9f62fa7 added a hrtimer based broadcast mode for those
platforms in which local timers stop when CPUs enter deep idle states. The
commit expected the platforms to register for this mode explicitly when they
lacked a better external device to wake up CPUs in deep idle. Given that
more platforms are beginning to use this mode, we can avoid the call to
set it up on every platform that requires it, by registering for the hrtimer
based broadcast mode in the core code before clock devices begin to get
initialized.

So if there exists a better broadcast device, it will overide the hrtimer
based one; else there is a backup mechanism when a wakeup device is required.
This commit also helps detect cases where the platform fails to register for
a broadcast device but invokes the help of one when entering deep idle states.
Currently we do not handle this situation at all and invoke the help of the
broadcast clock device without checking for its existence. Registering a default
broadcast mode will handle such buggy cases properly.

Signed-off-by: Preeti U Murthy 
---

 arch/arm64/kernel/time.c   |2 --
 arch/powerpc/kernel/time.c |1 -
 kernel/time/timekeeping.c  |4 
 3 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/kernel/time.c b/arch/arm64/kernel/time.c
index 1a7125c..47baaa8 100644
--- a/arch/arm64/kernel/time.c
+++ b/arch/arm64/kernel/time.c
@@ -70,8 +70,6 @@ void __init time_init(void)
of_clk_init(NULL);
clocksource_of_init();
 
-   tick_setup_hrtimer_broadcast();
-
arch_timer_rate = arch_timer_get_rate();
if (!arch_timer_rate)
panic("Unable to initialise architected timer.\n");
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 7505599..51433a8 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -942,7 +942,6 @@ void __init time_init(void)
clocksource_init();
 
init_decrementer_clockevent();
-   tick_setup_hrtimer_broadcast();
 }
 
 
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index ec1791f..6044a51 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -1016,6 +1016,10 @@ void __init timekeeping_init(void)
boot.tv_sec = 0;
boot.tv_nsec = 0;
}
+   /* Register for hrtimer based broadcast as the default timekeeping
+* mode in deep idle states.
+*/
+   tick_setup_hrtimer_broadcast();
 
raw_spin_lock_irqsave(_lock, flags);
write_seqcount_begin(_core.seq);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] arm: ls1021a: setup hrtimer based tick broadcast

2014-12-05 Thread Preeti U Murthy

On 12/05/2014 03:57 PM, Mark Rutland wrote:
> Hi,
> 
> On Fri, Dec 05, 2014 at 09:07:29AM +, Jingchang Lu wrote:
>> This adds setup of hrtimer based tick broadcast to utilize
>> the hrtimer in SMP without other broadcast tick devices.
>>
>> Signed-off-by: Jingchang Lu 
> 
> This shouldn't live in board code.
> 
> Similarly to what happens with the generic dummy timer on all arches
> that select it (and the broadcast hrtimer on all arches that use it so
> far), we should just unconditionally register the broadcast hrtimer.
> 
> A real broadcast-capable clock event device will take preference if
> present, so we don't need to add board-specific code to only register
> this in certain conditions.

Hmm.. this makes sense to me. Let me send out a patch to do this.

Regards
Preeti U Murthy
> 
> Mark.
> 
>> ---
>>  arch/arm/mach-imx/mach-ls1021a.c | 12 
>>  1 file changed, 12 insertions(+)
>>
>> diff --git a/arch/arm/mach-imx/mach-ls1021a.c 
>> b/arch/arm/mach-imx/mach-ls1021a.c
>> index b89c858..4d074cf 100644
>> --- a/arch/arm/mach-imx/mach-ls1021a.c
>> +++ b/arch/arm/mach-imx/mach-ls1021a.c
>> @@ -7,6 +7,10 @@
>>   * (at your option) any later version.
>>   */
>>  
>> +#include 
>> +#include 
>> +#include 
>> +
>>  #include 
>>  
>>  #include "common.h"
>> @@ -16,7 +20,15 @@ static const char * const ls1021a_dt_compat[] __initconst 
>> = {
>>  NULL,
>>  };
>>  
>> +static void __init ls1021a_init_time(void)
>> +{
>> +of_clk_init(NULL);
>> +clocksource_of_init();
>> +tick_setup_hrtimer_broadcast();
>> +}
>> +
>>  DT_MACHINE_START(LS1021A, "Freescale LS1021A")
>>  .smp= smp_ops(ls1021a_smp_ops),
>> +.init_time  = ls1021a_init_time,
>>  .dt_compat  = ls1021a_dt_compat,
>>  MACHINE_END
>> -- 
>> 1.8.0
>>
>>
>> ___
>> linux-arm-kernel mailing list
>> linux-arm-ker...@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
>>
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] arm: ls1021a: setup hrtimer based tick broadcast

2014-12-05 Thread Preeti U Murthy

On 12/05/2014 03:57 PM, Mark Rutland wrote:
 Hi,
 
 On Fri, Dec 05, 2014 at 09:07:29AM +, Jingchang Lu wrote:
 This adds setup of hrtimer based tick broadcast to utilize
 the hrtimer in SMP without other broadcast tick devices.

 Signed-off-by: Jingchang Lu jingchang...@freescale.com
 
 This shouldn't live in board code.
 
 Similarly to what happens with the generic dummy timer on all arches
 that select it (and the broadcast hrtimer on all arches that use it so
 far), we should just unconditionally register the broadcast hrtimer.
 
 A real broadcast-capable clock event device will take preference if
 present, so we don't need to add board-specific code to only register
 this in certain conditions.

Hmm.. this makes sense to me. Let me send out a patch to do this.

Regards
Preeti U Murthy
 
 Mark.
 
 ---
  arch/arm/mach-imx/mach-ls1021a.c | 12 
  1 file changed, 12 insertions(+)

 diff --git a/arch/arm/mach-imx/mach-ls1021a.c 
 b/arch/arm/mach-imx/mach-ls1021a.c
 index b89c858..4d074cf 100644
 --- a/arch/arm/mach-imx/mach-ls1021a.c
 +++ b/arch/arm/mach-imx/mach-ls1021a.c
 @@ -7,6 +7,10 @@
   * (at your option) any later version.
   */
  
 +#include linux/clockchips.h
 +#include linux/clk-provider.h
 +#include linux/clocksource.h
 +
  #include asm/mach/arch.h
  
  #include common.h
 @@ -16,7 +20,15 @@ static const char * const ls1021a_dt_compat[] __initconst 
 = {
  NULL,
  };
  
 +static void __init ls1021a_init_time(void)
 +{
 +of_clk_init(NULL);
 +clocksource_of_init();
 +tick_setup_hrtimer_broadcast();
 +}
 +
  DT_MACHINE_START(LS1021A, Freescale LS1021A)
  .smp= smp_ops(ls1021a_smp_ops),
 +.init_time  = ls1021a_init_time,
  .dt_compat  = ls1021a_dt_compat,
  MACHINE_END
 -- 
 1.8.0


 ___
 linux-arm-kernel mailing list
 linux-arm-ker...@lists.infradead.org
 http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] tick-broadcast: Register for hrtimer based broadcast as the default broadcast mode

2014-12-05 Thread Preeti U Murthy

Commit 5d1638acb9f62fa7 added a hrtimer based broadcast mode for those
platforms in which local timers stop when CPUs enter deep idle states. The
commit expected the platforms to register for this mode explicitly when they
lacked a better external device to wake up CPUs in deep idle. Given that
more platforms are beginning to use this mode, we can avoid the call to
set it up on every platform that requires it, by registering for the hrtimer
based broadcast mode in the core code before clock devices begin to get
initialized.

So if there exists a better broadcast device, it will overide the hrtimer
based one; else there is a backup mechanism when a wakeup device is required.
This commit also helps detect cases where the platform fails to register for
a broadcast device but invokes the help of one when entering deep idle states.
Currently we do not handle this situation at all and invoke the help of the
broadcast clock device without checking for its existence. Registering a default
broadcast mode will handle such buggy cases properly.

Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com
---

 arch/arm64/kernel/time.c   |2 --
 arch/powerpc/kernel/time.c |1 -
 kernel/time/timekeeping.c  |4 
 3 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/kernel/time.c b/arch/arm64/kernel/time.c
index 1a7125c..47baaa8 100644
--- a/arch/arm64/kernel/time.c
+++ b/arch/arm64/kernel/time.c
@@ -70,8 +70,6 @@ void __init time_init(void)
of_clk_init(NULL);
clocksource_of_init();
 
-   tick_setup_hrtimer_broadcast();
-
arch_timer_rate = arch_timer_get_rate();
if (!arch_timer_rate)
panic(Unable to initialise architected timer.\n);
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 7505599..51433a8 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -942,7 +942,6 @@ void __init time_init(void)
clocksource_init();
 
init_decrementer_clockevent();
-   tick_setup_hrtimer_broadcast();
 }
 
 
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index ec1791f..6044a51 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -1016,6 +1016,10 @@ void __init timekeeping_init(void)
boot.tv_sec = 0;
boot.tv_nsec = 0;
}
+   /* Register for hrtimer based broadcast as the default timekeeping
+* mode in deep idle states.
+*/
+   tick_setup_hrtimer_broadcast();
 
raw_spin_lock_irqsave(timekeeper_lock, flags);
write_seqcount_begin(tk_core.seq);

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC 6/7] sched: cfs: cpu frequency scaling based on task placement

2014-11-27 Thread Preeti U Murthy

On 10/22/2014 11:37 AM, Mike Turquette wrote:
> {en,de}queue_task_fair are updated to track which cpus will have changed
> utilization values as function of task queueing. The affected cpus are
> passed on to arch_eval_cpu_freq for further machine-specific processing
> based on a selectable policy.
> 
> arch_scale_cpu_freq is called from run_rebalance_domains as a way to
> kick off the scaling process (via wake_up_process), so as to prevent
> re-entering the {en,de}queue code.
> 
> All of the call sites in this patch are up for discussion. Does it make
> sense to track which cpus have updated statistics in enqueue_fair_task?
> I chose this because I wanted to gather statistics for all cpus affected
> in the event CONFIG_FAIR_GROUP_SCHED is enabled. As agreed at LPC14 the
> next version of this patch will focus on the simpler case of not using
> scheduler cgroups, which should remove a good chunk of this code,
> including the cpumask stuff.
> 
> Also discussed at LPC14 is that fact that load_balance is a very
> interesting place to do this as frequency can be considered in concert
> with task placement. Please put forth any ideas on a sensible way to do
> this.

I believe load balancing would be the right place to evaluate the
frequency at which CPUs must run. find_busiest_group() is already
iterating through all the CPUs and calculating the load on them. So this
information is readily available and that which remains to be seen is
which of the CPUs in the group have their load > some_threshold and
queue a kthread on that cpu to scale its frequency, while the current
cpu continues with its load balancing.

There is another positive I see in evaluating cpu frequency in load
balancing. The frequency at which load balancing is run is already
optimized for scalability. One of the factors that is considered is if
any sibling cpus has carried out load balancing in the recent past, the
current cpu defers doing the same. This means it is naturally ensured
that only one cpu in the power domain takes care of frequency scaling
each time and there is no need of explicit synchronization between the
policy cpus to do this.

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC 6/7] sched: cfs: cpu frequency scaling based on task placement

2014-11-27 Thread Preeti U Murthy

On 10/22/2014 11:37 AM, Mike Turquette wrote:
 {en,de}queue_task_fair are updated to track which cpus will have changed
 utilization values as function of task queueing. The affected cpus are
 passed on to arch_eval_cpu_freq for further machine-specific processing
 based on a selectable policy.
 
 arch_scale_cpu_freq is called from run_rebalance_domains as a way to
 kick off the scaling process (via wake_up_process), so as to prevent
 re-entering the {en,de}queue code.
 
 All of the call sites in this patch are up for discussion. Does it make
 sense to track which cpus have updated statistics in enqueue_fair_task?
 I chose this because I wanted to gather statistics for all cpus affected
 in the event CONFIG_FAIR_GROUP_SCHED is enabled. As agreed at LPC14 the
 next version of this patch will focus on the simpler case of not using
 scheduler cgroups, which should remove a good chunk of this code,
 including the cpumask stuff.
 
 Also discussed at LPC14 is that fact that load_balance is a very
 interesting place to do this as frequency can be considered in concert
 with task placement. Please put forth any ideas on a sensible way to do
 this.

I believe load balancing would be the right place to evaluate the
frequency at which CPUs must run. find_busiest_group() is already
iterating through all the CPUs and calculating the load on them. So this
information is readily available and that which remains to be seen is
which of the CPUs in the group have their load  some_threshold and
queue a kthread on that cpu to scale its frequency, while the current
cpu continues with its load balancing.

There is another positive I see in evaluating cpu frequency in load
balancing. The frequency at which load balancing is run is already
optimized for scalability. One of the factors that is considered is if
any sibling cpus has carried out load balancing in the recent past, the
current cpu defers doing the same. This means it is naturally ensured
that only one cpu in the power domain takes care of frequency scaling
each time and there is no need of explicit synchronization between the
policy cpus to do this.

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 0/4] powernv: cpuidle: Redesign idle states management

2014-11-25 Thread Preeti U Murthy

Hi,

I ran hackbench to evaluate this patchset and found good improvements in
the results.

I modified hackbench to take in a 'loops' parameter along with
num_groups which ensures that the test runs long enough to observe and
debug issues. The idea was to find out how latency sensitive workloads
can get affected by modification in cpuidle heuristics since it is easy
to measure the impact on these workloads.

The experiment was conducted on a Power8 system with 1 socket and 6
cores on it.

The first experiment was carried out by pinning hackbench to the first
thread in each core while the rest of the smt threads were idle and
below are the results. This would ensure the core entered deep idle
states more often.

num_grps %improvement with patchset
33.6
6   10.6
12   5.0
24   5.0

The second experiment was carried out by allowing hackbench to run on
the smt threads of two cores and % improvement with the patchset was in
range of 4-7%.

I ran the experiments on the vanilla kernel. This means the performance
improvements is primarily due to avoiding having to do a timebase sync
by every thread in the core. The power numbers have very little
variation between the runs with and without the patchset.

Thanks

Regards
Preeti U Murthy

On 11/25/2014 04:47 PM, Shreyas B. Prabhu wrote:
> Deep idle states like sleep and winkle are per core idle states. A core
> enters these states only when all the threads enter either the particular
> idle state or a deeper one. There are tasks like fastsleep hardware bug
> workaround and hypervisor core state save which have to be done only by
> the last thread of the core entering deep idle state and similarly tasks
> like timebase resync, hypervisor core register restore that have to be
> done only by the first thread waking up from these states. 
> 
> The current idle state management does not have a way to distinguish the
> first/last thread of the core waking/entering idle states. Tasks like
> timebase resync are done for all the threads. This is not only is suboptimal,
> but can cause functionality issues when subcores are involved.
> 
> Winkle is deeper idle state compared to fastsleep. In this state the power
> supply to the chiplet, i.e core, private L2 and private L3 is turned off.
> This results in a total hypervisor state loss. This patch set adds support
> for winkle and provides a way to track the idle states of the threads of the
> core and use it for idle state management of idle states sleep and winkle.
> 
> 
> Changes in v2:
> --
> -Using PNV_THREAD_NAP/SLEEP defines while calling power7_powersave_common
> -Comment changes based on review
> -Rebased on top of 3.18-rc6
> 
> 
> Cc: Benjamin Herrenschmidt 
> Cc: Paul Mackerras 
> Cc: Michael Ellerman 
> Cc: Rafael J. Wysocki 
> Cc: linux...@vger.kernel.org
> Cc: linuxppc-...@lists.ozlabs.org
> Cc: Vaidyanathan Srinivasan 
> Cc: Preeti U Murthy 
> 
> Paul Mackerras (1):
>   powerpc: powernv: Switch off MMU before entering nap/sleep/rvwinkle
> mode
> 
> Preeti U. Murthy (1):
>   powerpc/powernv: Enable Offline CPUs to enter deep idle states
> 
> Shreyas B. Prabhu (2):
>   powernv: cpuidle: Redesign idle states management
>   powernv: powerpc: Add winkle support for offline cpus
> 
>  arch/powerpc/include/asm/cpuidle.h |  14 ++
>  arch/powerpc/include/asm/opal.h|  13 +
>  arch/powerpc/include/asm/paca.h|   6 +
>  arch/powerpc/include/asm/ppc-opcode.h  |   2 +
>  arch/powerpc/include/asm/processor.h   |   1 +
>  arch/powerpc/include/asm/reg.h |   4 +
>  arch/powerpc/kernel/asm-offsets.c  |   6 +
>  arch/powerpc/kernel/cpu_setup_power.S  |   4 +
>  arch/powerpc/kernel/exceptions-64s.S   |  30 ++-
>  arch/powerpc/kernel/idle_power7.S  | 332 
> +
>  arch/powerpc/platforms/powernv/opal-wrappers.S |  39 +++
>  arch/powerpc/platforms/powernv/powernv.h   |   2 +
>  arch/powerpc/platforms/powernv/setup.c | 160 
>  arch/powerpc/platforms/powernv/smp.c   |  10 +-
>  arch/powerpc/platforms/powernv/subcore.c   |  34 +++
>  arch/powerpc/platforms/powernv/subcore.h   |   1 +
>  drivers/cpuidle/cpuidle-powernv.c  |  10 +-
>  17 files changed, 608 insertions(+), 60 deletions(-)
>  create mode 100644 arch/powerpc/include/asm/cpuidle.h
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 0/4] powernv: cpuidle: Redesign idle states management

2014-11-25 Thread Preeti U Murthy

Hi,

I ran hackbench to evaluate this patchset and found good improvements in
the results.

I modified hackbench to take in a 'loops' parameter along with
num_groups which ensures that the test runs long enough to observe and
debug issues. The idea was to find out how latency sensitive workloads
can get affected by modification in cpuidle heuristics since it is easy
to measure the impact on these workloads.

The experiment was conducted on a Power8 system with 1 socket and 6
cores on it.

The first experiment was carried out by pinning hackbench to the first
thread in each core while the rest of the smt threads were idle and
below are the results. This would ensure the core entered deep idle
states more often.

num_grps %improvement with patchset
33.6
6   10.6
12   5.0
24   5.0

The second experiment was carried out by allowing hackbench to run on
the smt threads of two cores and % improvement with the patchset was in
range of 4-7%.

I ran the experiments on the vanilla kernel. This means the performance
improvements is primarily due to avoiding having to do a timebase sync
by every thread in the core. The power numbers have very little
variation between the runs with and without the patchset.

Thanks

Regards
Preeti U Murthy

On 11/25/2014 04:47 PM, Shreyas B. Prabhu wrote:
 Deep idle states like sleep and winkle are per core idle states. A core
 enters these states only when all the threads enter either the particular
 idle state or a deeper one. There are tasks like fastsleep hardware bug
 workaround and hypervisor core state save which have to be done only by
 the last thread of the core entering deep idle state and similarly tasks
 like timebase resync, hypervisor core register restore that have to be
 done only by the first thread waking up from these states. 
 
 The current idle state management does not have a way to distinguish the
 first/last thread of the core waking/entering idle states. Tasks like
 timebase resync are done for all the threads. This is not only is suboptimal,
 but can cause functionality issues when subcores are involved.
 
 Winkle is deeper idle state compared to fastsleep. In this state the power
 supply to the chiplet, i.e core, private L2 and private L3 is turned off.
 This results in a total hypervisor state loss. This patch set adds support
 for winkle and provides a way to track the idle states of the threads of the
 core and use it for idle state management of idle states sleep and winkle.
 
 
 Changes in v2:
 --
 -Using PNV_THREAD_NAP/SLEEP defines while calling power7_powersave_common
 -Comment changes based on review
 -Rebased on top of 3.18-rc6
 
 
 Cc: Benjamin Herrenschmidt b...@kernel.crashing.org
 Cc: Paul Mackerras pau...@samba.org
 Cc: Michael Ellerman m...@ellerman.id.au
 Cc: Rafael J. Wysocki r...@rjwysocki.net
 Cc: linux...@vger.kernel.org
 Cc: linuxppc-...@lists.ozlabs.org
 Cc: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com
 Cc: Preeti U Murthy pre...@linux.vnet.ibm.com
 
 Paul Mackerras (1):
   powerpc: powernv: Switch off MMU before entering nap/sleep/rvwinkle
 mode
 
 Preeti U. Murthy (1):
   powerpc/powernv: Enable Offline CPUs to enter deep idle states
 
 Shreyas B. Prabhu (2):
   powernv: cpuidle: Redesign idle states management
   powernv: powerpc: Add winkle support for offline cpus
 
  arch/powerpc/include/asm/cpuidle.h |  14 ++
  arch/powerpc/include/asm/opal.h|  13 +
  arch/powerpc/include/asm/paca.h|   6 +
  arch/powerpc/include/asm/ppc-opcode.h  |   2 +
  arch/powerpc/include/asm/processor.h   |   1 +
  arch/powerpc/include/asm/reg.h |   4 +
  arch/powerpc/kernel/asm-offsets.c  |   6 +
  arch/powerpc/kernel/cpu_setup_power.S  |   4 +
  arch/powerpc/kernel/exceptions-64s.S   |  30 ++-
  arch/powerpc/kernel/idle_power7.S  | 332 
 +
  arch/powerpc/platforms/powernv/opal-wrappers.S |  39 +++
  arch/powerpc/platforms/powernv/powernv.h   |   2 +
  arch/powerpc/platforms/powernv/setup.c | 160 
  arch/powerpc/platforms/powernv/smp.c   |  10 +-
  arch/powerpc/platforms/powernv/subcore.c   |  34 +++
  arch/powerpc/platforms/powernv/subcore.h   |   1 +
  drivers/cpuidle/cpuidle-powernv.c  |  10 +-
  17 files changed, 608 insertions(+), 60 deletions(-)
  create mode 100644 arch/powerpc/include/asm/cpuidle.h
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/2] clockevents: introduce ->set_dev_mode() which can return error

2014-11-19 Thread Preeti U Murthy

This looks good to me.

On 11/20/2014 12:09 AM, Kevin Hilman wrote:
> From: Viresh Kumar 
> 
> Currently, the ->set_mode() method of a clockevent device is not
> allowed to fail, so it has no return value.  In order to add new
> clockevent modes, and allow the setting of those modes to fail, we
> need the clockevent core to be able to detect when setting a mode
> fails.
> 
> To allow detection of mode setting failures, introduce a new method
> ->set_dev_mode() which can return an error if the given mode is not
> supported or fails.
> 
> Since all current modes are still not allowed to fail, the core code
> simply WARNs if any modes fail.  Future patches that add new mode
> support will add proper error recovery based on failure conditions.
> 
> Signed-off-by: Viresh Kumar 
> Reviewed-by: Kevin Hilman 
> [khilman: rework changelogs, minor formatting/checkpatch cleanups]
> Signed-off-by: Kevin Hilman 
> ---
>  include/linux/clockchips.h |  5 -
>  kernel/time/clockevents.c  | 21 ++---
>  kernel/time/timer_list.c   |  5 -
>  3 files changed, 26 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/clockchips.h b/include/linux/clockchips.h
> index 2e4cb67f6e56..d969659cf688 100644
> --- a/include/linux/clockchips.h
> +++ b/include/linux/clockchips.h
> @@ -81,7 +81,8 @@ enum clock_event_mode {
>   * @mode:operating mode assigned by the management code
>   * @features:features
>   * @retries: number of forced programming retries
> - * @set_mode:set mode function
> + * @set_dev_mode:set dev mode function
> + * @set_mode:set mode function (deprecated, use set_dev_mode 
> instead)
>   * @broadcast:   function to broadcast events
>   * @min_delta_ticks: minimum delta value in ticks stored for reconfiguration
>   * @max_delta_ticks: maximum delta value in ticks stored for reconfiguration
> @@ -109,6 +110,8 @@ struct clock_event_device {
>   unsigned long   retries;
> 
>   void(*broadcast)(const struct cpumask *mask);
> + int (*set_dev_mode)(enum clock_event_mode mode,
> + struct clock_event_device *);
>   void(*set_mode)(enum clock_event_mode mode,
>   struct clock_event_device *);
>   void(*suspend)(struct clock_event_device *);
> diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
> index 9c94c19f1305..1d5ce3ce9228 100644
> --- a/kernel/time/clockevents.c
> +++ b/kernel/time/clockevents.c
> @@ -105,7 +105,16 @@ void clockevents_set_mode(struct clock_event_device *dev,
>enum clock_event_mode mode)
>  {
>   if (dev->mode != mode) {
> - dev->set_mode(mode, dev);
> + if (dev->set_dev_mode) {
> + int ret = dev->set_dev_mode(mode, dev);
> +
> + /* Currently available modes shouldn't fail */
> + WARN_ONCE(ret, "Requested mode: %d, error: %d\n",
> +   mode, ret);
> + } else {
> + dev->set_mode(mode, dev);
> + }
> +
>   dev->mode = mode;
> 
>   /*
> @@ -448,8 +457,14 @@ int __clockevents_update_freq(struct clock_event_device 
> *dev, u32 freq)
>   if (dev->mode == CLOCK_EVT_MODE_ONESHOT)
>   return clockevents_program_event(dev, dev->next_event, false);
> 
> - if (dev->mode == CLOCK_EVT_MODE_PERIODIC)
> - dev->set_mode(CLOCK_EVT_MODE_PERIODIC, dev);
> + if (dev->mode == CLOCK_EVT_MODE_PERIODIC) {
> + /* Shouldn't fail while switching to PERIODIC MODE */
> + if (dev->set_dev_mode)
> + WARN_ON_ONCE(dev->set_dev_mode(CLOCK_EVT_MODE_PERIODIC,
> +dev));
> + else
> + dev->set_mode(CLOCK_EVT_MODE_PERIODIC, dev);
> + }
> 
>   return 0;
>  }
> diff --git a/kernel/time/timer_list.c b/kernel/time/timer_list.c
> index 61ed862cdd37..957a04951ef0 100644
> --- a/kernel/time/timer_list.c
> +++ b/kernel/time/timer_list.c
> @@ -229,7 +229,10 @@ print_tickdevice(struct seq_file *m, struct tick_device 
> *td, int cpu)
>   SEQ_printf(m, "\n");
> 
>   SEQ_printf(m, " set_mode:   ");
> - print_name_offset(m, dev->set_mode);
> + if (dev->set_dev_mode)
> + print_name_offset(m, dev->set_dev_mode);
> + else
> + print_name_offset(m, dev->set_mode);
>   SEQ_printf(m, "\n");
> 
>   SEQ_printf(m, " event_handler:  ");
> 

Reviewed-by: Preeti U Murthy 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/2] clockevents: introduce -set_dev_mode() which can return error

2014-11-19 Thread Preeti U Murthy

This looks good to me.

On 11/20/2014 12:09 AM, Kevin Hilman wrote:
 From: Viresh Kumar viresh.ku...@linaro.org
 
 Currently, the -set_mode() method of a clockevent device is not
 allowed to fail, so it has no return value.  In order to add new
 clockevent modes, and allow the setting of those modes to fail, we
 need the clockevent core to be able to detect when setting a mode
 fails.
 
 To allow detection of mode setting failures, introduce a new method
 -set_dev_mode() which can return an error if the given mode is not
 supported or fails.
 
 Since all current modes are still not allowed to fail, the core code
 simply WARNs if any modes fail.  Future patches that add new mode
 support will add proper error recovery based on failure conditions.
 
 Signed-off-by: Viresh Kumar viresh.ku...@linaro.org
 Reviewed-by: Kevin Hilman khil...@linaro.org
 [khilman: rework changelogs, minor formatting/checkpatch cleanups]
 Signed-off-by: Kevin Hilman khil...@linaro.org
 ---
  include/linux/clockchips.h |  5 -
  kernel/time/clockevents.c  | 21 ++---
  kernel/time/timer_list.c   |  5 -
  3 files changed, 26 insertions(+), 5 deletions(-)
 
 diff --git a/include/linux/clockchips.h b/include/linux/clockchips.h
 index 2e4cb67f6e56..d969659cf688 100644
 --- a/include/linux/clockchips.h
 +++ b/include/linux/clockchips.h
 @@ -81,7 +81,8 @@ enum clock_event_mode {
   * @mode:operating mode assigned by the management code
   * @features:features
   * @retries: number of forced programming retries
 - * @set_mode:set mode function
 + * @set_dev_mode:set dev mode function
 + * @set_mode:set mode function (deprecated, use set_dev_mode 
 instead)
   * @broadcast:   function to broadcast events
   * @min_delta_ticks: minimum delta value in ticks stored for reconfiguration
   * @max_delta_ticks: maximum delta value in ticks stored for reconfiguration
 @@ -109,6 +110,8 @@ struct clock_event_device {
   unsigned long   retries;
 
   void(*broadcast)(const struct cpumask *mask);
 + int (*set_dev_mode)(enum clock_event_mode mode,
 + struct clock_event_device *);
   void(*set_mode)(enum clock_event_mode mode,
   struct clock_event_device *);
   void(*suspend)(struct clock_event_device *);
 diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
 index 9c94c19f1305..1d5ce3ce9228 100644
 --- a/kernel/time/clockevents.c
 +++ b/kernel/time/clockevents.c
 @@ -105,7 +105,16 @@ void clockevents_set_mode(struct clock_event_device *dev,
enum clock_event_mode mode)
  {
   if (dev-mode != mode) {
 - dev-set_mode(mode, dev);
 + if (dev-set_dev_mode) {
 + int ret = dev-set_dev_mode(mode, dev);
 +
 + /* Currently available modes shouldn't fail */
 + WARN_ONCE(ret, Requested mode: %d, error: %d\n,
 +   mode, ret);
 + } else {
 + dev-set_mode(mode, dev);
 + }
 +
   dev-mode = mode;
 
   /*
 @@ -448,8 +457,14 @@ int __clockevents_update_freq(struct clock_event_device 
 *dev, u32 freq)
   if (dev-mode == CLOCK_EVT_MODE_ONESHOT)
   return clockevents_program_event(dev, dev-next_event, false);
 
 - if (dev-mode == CLOCK_EVT_MODE_PERIODIC)
 - dev-set_mode(CLOCK_EVT_MODE_PERIODIC, dev);
 + if (dev-mode == CLOCK_EVT_MODE_PERIODIC) {
 + /* Shouldn't fail while switching to PERIODIC MODE */
 + if (dev-set_dev_mode)
 + WARN_ON_ONCE(dev-set_dev_mode(CLOCK_EVT_MODE_PERIODIC,
 +dev));
 + else
 + dev-set_mode(CLOCK_EVT_MODE_PERIODIC, dev);
 + }
 
   return 0;
  }
 diff --git a/kernel/time/timer_list.c b/kernel/time/timer_list.c
 index 61ed862cdd37..957a04951ef0 100644
 --- a/kernel/time/timer_list.c
 +++ b/kernel/time/timer_list.c
 @@ -229,7 +229,10 @@ print_tickdevice(struct seq_file *m, struct tick_device 
 *td, int cpu)
   SEQ_printf(m, \n);
 
   SEQ_printf(m,  set_mode:   );
 - print_name_offset(m, dev-set_mode);
 + if (dev-set_dev_mode)
 + print_name_offset(m, dev-set_dev_mode);
 + else
 + print_name_offset(m, dev-set_mode);
   SEQ_printf(m, \n);
 
   SEQ_printf(m,  event_handler:  );
 

Reviewed-by: Preeti U Murthy pre...@linux.vnet.ibm.com

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] cpuidle/powernv: Re-enable fastsleep at boot time

2014-11-18 Thread Preeti U Murthy

Hi Joel,

On 11/19/2014 08:04 AM, Joel Stanley wrote:
> Hey Preeti,
> 
> On Tue, Nov 18, 2014 at 5:26 PM, Preeti U Murthy
>  wrote:
>> Commit dcb18694 "Fix ipi on palmeto" disabled fastsleep at boot time.
> 
> I couldn't find this commit in any tree; upstream, mpe's next, nor powerkvm.

Oh yes you are right. I must have been looking at the wrong git branch.
I verified upstream now. The commit dcb18694 is not present.So we are
good. Thanks for pointing this out :)

Sorry for the noise!
> 
> I remember testing this as a workaround for the palmetto, but I don't
> recall sending it upstream.
> 
> Cheers,
> 
> Joel
> 

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] cpuidle/powernv: Re-enable fastsleep at boot time

2014-11-18 Thread Preeti U Murthy

Hi Joel,

On 11/19/2014 08:04 AM, Joel Stanley wrote:
 Hey Preeti,
 
 On Tue, Nov 18, 2014 at 5:26 PM, Preeti U Murthy
 pre...@linux.vnet.ibm.com wrote:
 Commit dcb18694 Fix ipi on palmeto disabled fastsleep at boot time.
 
 I couldn't find this commit in any tree; upstream, mpe's next, nor powerkvm.

Oh yes you are right. I must have been looking at the wrong git branch.
I verified upstream now. The commit dcb18694 is not present.So we are
good. Thanks for pointing this out :)

Sorry for the noise!
 
 I remember testing this as a workaround for the palmetto, but I don't
 recall sending it upstream.
 
 Cheers,
 
 Joel
 

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] cpuidle/powernv: Re-enable fastsleep at boot time

2014-11-17 Thread Preeti U Murthy

Commit dcb18694 "Fix ipi on palmeto" disabled fastsleep at boot time.
Revert this commit since we no longer need this fix. However we can
continue to use powersave_nap parameter to control entry into deep
idle states beyond snooze.

Signed-off-by: Preeti U. Murthy 
---

 drivers/cpuidle/cpuidle-powernv.c |5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/cpuidle/cpuidle-powernv.c 
b/drivers/cpuidle/cpuidle-powernv.c
index b57681d..b430e76 100644
--- a/drivers/cpuidle/cpuidle-powernv.c
+++ b/drivers/cpuidle/cpuidle-powernv.c
@@ -69,8 +69,9 @@ static int fastsleep_loop(struct cpuidle_device *dev,
unsigned long old_lpcr = mfspr(SPRN_LPCR);
unsigned long new_lpcr;
 
-   if (powersave_nap < 2)
-   return;
+   if (!(powersave_nap > 0))
+   return index;
+
if (unlikely(system_state < SYSTEM_RUNNING))
return index;
 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] cpuidle/powernv: Re-enable fastsleep at boot time

2014-11-17 Thread Preeti U Murthy

Commit dcb18694 Fix ipi on palmeto disabled fastsleep at boot time.
Revert this commit since we no longer need this fix. However we can
continue to use powersave_nap parameter to control entry into deep
idle states beyond snooze.

Signed-off-by: Preeti U. Murthy pre...@linux.vnet.ibm.com
---

 drivers/cpuidle/cpuidle-powernv.c |5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/cpuidle/cpuidle-powernv.c 
b/drivers/cpuidle/cpuidle-powernv.c
index b57681d..b430e76 100644
--- a/drivers/cpuidle/cpuidle-powernv.c
+++ b/drivers/cpuidle/cpuidle-powernv.c
@@ -69,8 +69,9 @@ static int fastsleep_loop(struct cpuidle_device *dev,
unsigned long old_lpcr = mfspr(SPRN_LPCR);
unsigned long new_lpcr;
 
-   if (powersave_nap  2)
-   return;
+   if (!(powersave_nap  0))
+   return index;
+
if (unlikely(system_state  SYSTEM_RUNNING))
return index;
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/4] powernv: cpuidle: Redesign idle states management

2014-11-11 Thread Preeti U Murthy

^^ in the changelog, you do not need to
elaborate it here.

> +  * core_idle_state - First 8 bits track the idle state of each thread
> +  * of the core. The 8th bit is the lock bit. Initially all thread bits
> +  * are set. They are cleared when the thread enters deep idle state
> +  * like sleep and winkle. Initially the lock bit is cleared.

you can simply have the comment about the bits of core_idle_state
without having to mention about when they are cleared etc..

> +  * The lock bit has 2 purposes
> +  * a. While the first thread is restoring core state, it prevents
> +  * from other threads in the core from switching to prcoess context.
> +  * b. While the last thread in the core is saving the core state, it
> +  * prevent a different thread from waking up.

The above two points are useful. As far as I see besides explaining the
bits of core_idle_state structure and the purpose of lock bit the rest
of the comments is redundant. A git-blame will let people know why all
this is needed. The comment section should not be used up for this
purpose IMO.

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH V3 1/6] sched: idle: Add a weak arch_cpu_idle_poll function

2014-11-11 Thread Preeti U Murthy

On 11/10/2014 08:47 PM, Peter Zijlstra wrote:
> On Mon, Nov 10, 2014 at 07:50:22PM +0530, Preeti U Murthy wrote:
>> Hi Peter,
>>
>> On 11/10/2014 05:59 PM, Peter Zijlstra wrote:
>>> On Fri, Nov 07, 2014 at 03:31:22PM +0100, Daniel Lezcano wrote:
>>>> The poll function is called when a timer expired or if we force to poll 
>>>> when
>>>> the cpu_idle_force_poll option is set.
>>>>
>>>> The poll function does:
>>>>
>>>>local_irq_enable();
>>>>while (!tif_need_resched())
>>>>cpu_relax();
>>>>
>>>> This default poll function suits for the x86 arch because its rep; nop;
>>>> hardware power optimization. But on other archs, this optimization does not
>>>> exists and we are not saving power. The arch specific bits may want to
>>>> optimize this loop by adding their own optimization.
>>>
>>> This doesn't make sense to me; should an arch not either implement an
>>> actual idle driver or implement cpu_relax() properly, why allow for a
>>> third weird option?
>>>
>>
>> The previous version of this patch simply invoked cpu_idle_loop() for
>> cases where latency_req was 0. This would have changed the behavior
>> on PowerPC wherein earlier the 0th idle index was returned which is also
>> a polling loop but differs from cpu_idle_loop() in two ways:
>>
>> a. It polls at a relatively lower power state than cpu_relax().
>> b. We set certain registers to indicate that the cpu is idle.
> 
> So I'm confused; the current code runs the generic cpu_relax idle poll
> loop for the broadcast case. I suppose you want to retain this because
> not doing your a-b above will indeed give you a lower latency.
> 
> Therefore one could argue that latency_req==0 should indeed use this,
> and your a-b idle state should be latency_req==1 or higher.
> 
> Thus yes it changes behaviour, but I think it actually fixes something.
> You cannot have a latency_req==0 state which has higher latency than the
> actual polling loop, as you appear to have.

Yes you are right. This is fixing the current behavior. So we should be
good to call cpuidle_idle_loop() when latency_req=0.

> 
>> Hence for all such cases wherein the cpu is required to poll while idle
>> (only for cases such as force poll, broadcast ipi to arrive soon and
>> latency_req = 0), we should be able to call into cpuidle_idle_loop()
>> only if the cpuidle driver's 0th idle state has an exit_latency > 0.
>> (The 0th idle state is expected to be a polling loop with
>> exit_latency = 0).
>>
>> If otherwise, it would mean the driver has an optimized polling loop
>> when idle. But instead of adding in the logic of checking the
>> exit_latency, we thought it would be simpler to call into an arch
>> defined polling idle loop under the above circumstances. If that is no
>> better we could fall back to cpuidle_idle_loop().
> 
> That still doesn't make sense to me; suppose the implementation of this
> special poll state differs on different uarchs for the same arch, then
> we'll end up with another registration and selection interface parallel
> to cpuidle.

Yeah this will only get complicated. I was trying to see how to keep the
current behavior unchanged but looks like it is not worth it.

Thanks

Regards
Preeti U Murthy
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH V3 1/6] sched: idle: Add a weak arch_cpu_idle_poll function

2014-11-11 Thread Preeti U Murthy

On 11/10/2014 08:47 PM, Peter Zijlstra wrote:
 On Mon, Nov 10, 2014 at 07:50:22PM +0530, Preeti U Murthy wrote:
 Hi Peter,

 On 11/10/2014 05:59 PM, Peter Zijlstra wrote:
 On Fri, Nov 07, 2014 at 03:31:22PM +0100, Daniel Lezcano wrote:
 The poll function is called when a timer expired or if we force to poll 
 when
 the cpu_idle_force_poll option is set.

 The poll function does:

local_irq_enable();
while (!tif_need_resched())
cpu_relax();

 This default poll function suits for the x86 arch because its rep; nop;
 hardware power optimization. But on other archs, this optimization does not
 exists and we are not saving power. The arch specific bits may want to
 optimize this loop by adding their own optimization.

 This doesn't make sense to me; should an arch not either implement an
 actual idle driver or implement cpu_relax() properly, why allow for a
 third weird option?


 The previous version of this patch simply invoked cpu_idle_loop() for
 cases where latency_req was 0. This would have changed the behavior
 on PowerPC wherein earlier the 0th idle index was returned which is also
 a polling loop but differs from cpu_idle_loop() in two ways:

 a. It polls at a relatively lower power state than cpu_relax().
 b. We set certain registers to indicate that the cpu is idle.
 
 So I'm confused; the current code runs the generic cpu_relax idle poll
 loop for the broadcast case. I suppose you want to retain this because
 not doing your a-b above will indeed give you a lower latency.
 
 Therefore one could argue that latency_req==0 should indeed use this,
 and your a-b idle state should be latency_req==1 or higher.
 
 Thus yes it changes behaviour, but I think it actually fixes something.
 You cannot have a latency_req==0 state which has higher latency than the
 actual polling loop, as you appear to have.

Yes you are right. This is fixing the current behavior. So we should be
good to call cpuidle_idle_loop() when latency_req=0.

 
 Hence for all such cases wherein the cpu is required to poll while idle
 (only for cases such as force poll, broadcast ipi to arrive soon and
 latency_req = 0), we should be able to call into cpuidle_idle_loop()
 only if the cpuidle driver's 0th idle state has an exit_latency  0.
 (The 0th idle state is expected to be a polling loop with
 exit_latency = 0).

 If otherwise, it would mean the driver has an optimized polling loop
 when idle. But instead of adding in the logic of checking the
 exit_latency, we thought it would be simpler to call into an arch
 defined polling idle loop under the above circumstances. If that is no
 better we could fall back to cpuidle_idle_loop().
 
 That still doesn't make sense to me; suppose the implementation of this
 special poll state differs on different uarchs for the same arch, then
 we'll end up with another registration and selection interface parallel
 to cpuidle.

Yeah this will only get complicated. I was trying to see how to keep the
current behavior unchanged but looks like it is not worth it.

Thanks

Regards
Preeti U Murthy
 
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/4] powernv: cpuidle: Redesign idle states management

2014-11-11 Thread Preeti U Murthy

 having to mention about when they are cleared etc..

 +  * The lock bit has 2 purposes
 +  * a. While the first thread is restoring core state, it prevents
 +  * from other threads in the core from switching to prcoess context.
 +  * b. While the last thread in the core is saving the core state, it
 +  * prevent a different thread from waking up.

The above two points are useful. As far as I see besides explaining the
bits of core_idle_state structure and the purpose of lock bit the rest
of the comments is redundant. A git-blame will let people know why all
this is needed. The comment section should not be used up for this
purpose IMO.

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH V3 1/6] sched: idle: Add a weak arch_cpu_idle_poll function

2014-11-10 Thread Preeti U Murthy

Hi Peter,

On 11/10/2014 05:59 PM, Peter Zijlstra wrote:
> On Fri, Nov 07, 2014 at 03:31:22PM +0100, Daniel Lezcano wrote:
>> The poll function is called when a timer expired or if we force to poll when
>> the cpu_idle_force_poll option is set.
>>
>> The poll function does:
>>
>>local_irq_enable();
>>while (!tif_need_resched())
>>cpu_relax();
>>
>> This default poll function suits for the x86 arch because its rep; nop;
>> hardware power optimization. But on other archs, this optimization does not
>> exists and we are not saving power. The arch specific bits may want to
>> optimize this loop by adding their own optimization.
> 
> This doesn't make sense to me; should an arch not either implement an
> actual idle driver or implement cpu_relax() properly, why allow for a
> third weird option?
> 

The previous version of this patch simply invoked cpu_idle_loop() for
cases where latency_req was 0. This would have changed the behavior
on PowerPC wherein earlier the 0th idle index was returned which is also
a polling loop but differs from cpu_idle_loop() in two ways:

a. It polls at a relatively lower power state than cpu_relax().
b. We set certain registers to indicate that the cpu is idle.

Hence for all such cases wherein the cpu is required to poll while idle
(only for cases such as force poll, broadcast ipi to arrive soon and
latency_req = 0), we should be able to call into cpuidle_idle_loop()
only if the cpuidle driver's 0th idle state has an exit_latency > 0.
(The 0th idle state is expected to be a polling loop with
exit_latency = 0).

If otherwise, it would mean the driver has an optimized polling loop
when idle. But instead of adding in the logic of checking the
exit_latency, we thought it would be simpler to call into an arch
defined polling idle loop under the above circumstances. If that is no
better we could fall back to cpuidle_idle_loop().

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] cpuidle/powernv: Fix return value of idle index in fastsleep

2014-11-10 Thread Preeti U Murthy

Hi Ben, Mikey,

Can you please take a look at this patch?  I feel the commit dcb1869 was
just a plug-in solution. Shouldn't we take a look at this again?

Ideally we should not need this check. With the cpuidle redesign on
PowerNV going in, we can enable fastsleep at bootup itself and without
checks on any debug parameters such as powersave_nap .We will then only
need to check for powersave_nap == 0 and return only if that is the
case. This check is still required since the user can disable all deep
idle states by setting powersave_nap to 0.

Regards
Preeti U Murthy

On 10/27/2014 06:56 PM, Preeti U Murthy wrote:
> Commit dcb18694 : Fix ipi on Palmeto added a workaround to disable
> going into fastsleep on Palmeto boards which was reported to fail
> to boot when fastsleep was enabled. However it missed returning
> an idle index. Fix this.
> 
> There is probably no harm in returning the index of fastsleep
> although the cpu did not enter this state. The governor will notice
> the difference in the residency time in the idle state and the
> target residency of the idle state and prevent cpus from entering
> fastsleep from that point on. Hence the usage and time statistics
> logged for fastsleep will not be out of sync with reality except
> for the first entry.
> 
> Signed-off-by: Preeti U Murthy 
> ---
> Do we still need this workaround? Or can we get rid of the check
> on powersave_nap altogether?
> ---
> 
>  drivers/cpuidle/cpuidle-powernv.c |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/cpuidle/cpuidle-powernv.c 
> b/drivers/cpuidle/cpuidle-powernv.c
> index fa79392..c18da24 100644
> --- a/drivers/cpuidle/cpuidle-powernv.c
> +++ b/drivers/cpuidle/cpuidle-powernv.c
> @@ -70,7 +70,7 @@ static int fastsleep_loop(struct cpuidle_device *dev,
>   unsigned long new_lpcr;
>  
>   if (powersave_nap < 2)
> - return;
> + return index;
>   if (unlikely(system_state < SYSTEM_RUNNING))
>   return index;
>  
> 
> ___
> Linuxppc-dev mailing list
> linuxppc-...@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] cpuidle/powernv: Fix return value of idle index in fastsleep

2014-11-10 Thread Preeti U Murthy

Hi Ben, Mikey,

Can you please take a look at this patch?  I feel the commit dcb1869 was
just a plug-in solution. Shouldn't we take a look at this again?

Ideally we should not need this check. With the cpuidle redesign on
PowerNV going in, we can enable fastsleep at bootup itself and without
checks on any debug parameters such as powersave_nap .We will then only
need to check for powersave_nap == 0 and return only if that is the
case. This check is still required since the user can disable all deep
idle states by setting powersave_nap to 0.

Regards
Preeti U Murthy

On 10/27/2014 06:56 PM, Preeti U Murthy wrote:
 Commit dcb18694 : Fix ipi on Palmeto added a workaround to disable
 going into fastsleep on Palmeto boards which was reported to fail
 to boot when fastsleep was enabled. However it missed returning
 an idle index. Fix this.
 
 There is probably no harm in returning the index of fastsleep
 although the cpu did not enter this state. The governor will notice
 the difference in the residency time in the idle state and the
 target residency of the idle state and prevent cpus from entering
 fastsleep from that point on. Hence the usage and time statistics
 logged for fastsleep will not be out of sync with reality except
 for the first entry.
 
 Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com
 ---
 Do we still need this workaround? Or can we get rid of the check
 on powersave_nap altogether?
 ---
 
  drivers/cpuidle/cpuidle-powernv.c |2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)
 
 diff --git a/drivers/cpuidle/cpuidle-powernv.c 
 b/drivers/cpuidle/cpuidle-powernv.c
 index fa79392..c18da24 100644
 --- a/drivers/cpuidle/cpuidle-powernv.c
 +++ b/drivers/cpuidle/cpuidle-powernv.c
 @@ -70,7 +70,7 @@ static int fastsleep_loop(struct cpuidle_device *dev,
   unsigned long new_lpcr;
  
   if (powersave_nap  2)
 - return;
 + return index;
   if (unlikely(system_state  SYSTEM_RUNNING))
   return index;
  
 
 ___
 Linuxppc-dev mailing list
 linuxppc-...@lists.ozlabs.org
 https://lists.ozlabs.org/listinfo/linuxppc-dev
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH V3 1/6] sched: idle: Add a weak arch_cpu_idle_poll function

2014-11-10 Thread Preeti U Murthy

Hi Peter,

On 11/10/2014 05:59 PM, Peter Zijlstra wrote:
 On Fri, Nov 07, 2014 at 03:31:22PM +0100, Daniel Lezcano wrote:
 The poll function is called when a timer expired or if we force to poll when
 the cpu_idle_force_poll option is set.

 The poll function does:

local_irq_enable();
while (!tif_need_resched())
cpu_relax();

 This default poll function suits for the x86 arch because its rep; nop;
 hardware power optimization. But on other archs, this optimization does not
 exists and we are not saving power. The arch specific bits may want to
 optimize this loop by adding their own optimization.
 
 This doesn't make sense to me; should an arch not either implement an
 actual idle driver or implement cpu_relax() properly, why allow for a
 third weird option?
 

The previous version of this patch simply invoked cpu_idle_loop() for
cases where latency_req was 0. This would have changed the behavior
on PowerPC wherein earlier the 0th idle index was returned which is also
a polling loop but differs from cpu_idle_loop() in two ways:

a. It polls at a relatively lower power state than cpu_relax().
b. We set certain registers to indicate that the cpu is idle.

Hence for all such cases wherein the cpu is required to poll while idle
(only for cases such as force poll, broadcast ipi to arrive soon and
latency_req = 0), we should be able to call into cpuidle_idle_loop()
only if the cpuidle driver's 0th idle state has an exit_latency  0.
(The 0th idle state is expected to be a polling loop with
exit_latency = 0).

If otherwise, it would mean the driver has an optimized polling loop
when idle. But instead of adding in the logic of checking the
exit_latency, we thought it would be simpler to call into an arch
defined polling idle loop under the above circumstances. If that is no
better we could fall back to cpuidle_idle_loop().

Regards
Preeti U Murthy



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH V3 3/6] sched: idle: Get the next timer event and pass it the cpuidle framework

2014-11-08 Thread Preeti U Murthy

On 11/07/2014 08:01 PM, Daniel Lezcano wrote:
> Following the logic of the previous patch, retrieve from the idle task the
> expected timer sleep duration and pass it to the cpuidle framework.
> 
> Take the opportunity to remove the unused headers in the menu.c file.
> 
> This patch does not change the current behavior.
> 
> Signed-off-by: Daniel Lezcano 
> Acked-by: Nicolas Pitre 
> Reviewed-by: Len Brown 
> ---

This patch looks good to me as well.

Reviewed-by: Preeti U. Murthy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH V3 4/6] cpuidle: idle: menu: Don't reflect when a state selection failed

2014-11-08 Thread Preeti U Murthy

On 11/07/2014 08:01 PM, Daniel Lezcano wrote:
> In the current code, the check to reflect or not the outcoming state is done
> against the idle state which has been chosen and its value.
> 
> Instead of doing a check in each of the reflect functions, just don't call 
> reflect
> if something went wrong in the idle path.
> 
> Signed-off-by: Daniel Lezcano 
> Acked-by: Nicolas Pitre 
> ---
>  drivers/cpuidle/governors/ladder.c | 3 +--
>  drivers/cpuidle/governors/menu.c   | 3 +--
>  kernel/sched/idle.c| 3 ++-
>  3 files changed, 4 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/cpuidle/governors/ladder.c 
> b/drivers/cpuidle/governors/ladder.c
> index 51c9ccd..86e4d49 100644
> --- a/drivers/cpuidle/governors/ladder.c
> +++ b/drivers/cpuidle/governors/ladder.c
> @@ -165,8 +165,7 @@ static int ladder_enable_device(struct cpuidle_driver 
> *drv,
>  static void ladder_reflect(struct cpuidle_device *dev, int index)
>  {
>   struct ladder_device *ldev = this_cpu_ptr(_devices);
> - if (index > 0)
> - ldev->last_state_idx = index;
> + ldev->last_state_idx = index;
>  }
> 
>  static struct cpuidle_governor ladder_governor = {
> diff --git a/drivers/cpuidle/governors/menu.c 
> b/drivers/cpuidle/governors/menu.c
> index 91b3000..2e4a315 100644
> --- a/drivers/cpuidle/governors/menu.c
> +++ b/drivers/cpuidle/governors/menu.c
> @@ -366,8 +366,7 @@ static void menu_reflect(struct cpuidle_device *dev, int 
> index)
>  {
>   struct menu_device *data = this_cpu_ptr(_devices);
>   data->last_state_idx = index;
> - if (index >= 0)
> - data->needs_update = 1;
> + data->needs_update = 1;
>  }
> 
>  /**
> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> index 0a7a1d1..0ae1cc8 100644
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -173,7 +173,8 @@ use_default:
>   /*
>* Give the governor an opportunity to reflect on the outcome
>*/
> - cpuidle_reflect(dev, entered_state);
> + if (entered_state >= 0)
> + cpuidle_reflect(dev, entered_state);
> 
>  exit_idle:
>   __current_set_polling();
> 
Reviewed-by: Preeti U. Murthy 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH V3 2/6] sched: idle: cpuidle: Check the latency req before idle

2014-11-08 Thread Preeti U Murthy

On 11/07/2014 08:01 PM, Daniel Lezcano wrote:
> When the pmqos latency requirement is set to zero that means "poll in all the
> cases".
> 
> That is correctly implemented on x86 but not on the other archs.
> 
> As how is written the code, if the latency request is zero, the governor will
> return zero, so corresponding, for x86, to the poll function, but for the
> others arch the default idle function. For example, on ARM this is wait-for-
> interrupt with a latency of '1', so violating the constraint.
> 
> In order to fix that, do the latency requirement check *before* calling the
> cpuidle framework in order to jump to the poll function without entering
> cpuidle. That has several benefits:
> 
>  1. It clarifies and unifies the code
>  2. It fixes x86 vs other archs behavior
>  3. Factors out the call to the same function
>  4. Prevent to enter the cpuidle framework with its expensive cost in
> calculation
> 
> As the latency_req is needed in all the cases, change the select API to take
> the latency_req as parameter in case it is not equal to zero.
> 
> As a positive side effect, it introduces the latency constraint specified
> externally, so one more step to the cpuidle/scheduler integration.
> 
> Signed-off-by: Daniel Lezcano 
> Acked-by: Nicolas Pitre 
> Acked-by: Peter Zijlstra (Intel) 
> Reviewed-by: Len Brown 
> ---

Reviewed-by: Preeti U Murthy 

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH V3 1/6] sched: idle: Add a weak arch_cpu_idle_poll function

2014-11-08 Thread Preeti U Murthy

On 11/07/2014 08:01 PM, Daniel Lezcano wrote:
> The poll function is called when a timer expired or if we force to poll when
> the cpu_idle_force_poll option is set.
> 
> The poll function does:
> 
>local_irq_enable();
>while (!tif_need_resched())
>cpu_relax();
> 
> This default poll function suits for the x86 arch because its rep; nop;
> hardware power optimization. But on other archs, this optimization does not
> exists and we are not saving power. The arch specific bits may want to
> optimize this loop by adding their own optimization.
> 
> Give an opportunity to the different platform to specify their own polling
> loop by adding a weak cpu_idle_poll_loop function.
> 
> Signed-off-by: Daniel Lezcano 
> ---
>  kernel/sched/idle.c | 29 +
>  1 file changed, 17 insertions(+), 12 deletions(-)
> 
> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> index c47fce7..ea65bbae 100644
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -42,18 +42,6 @@ static int __init cpu_idle_nopoll_setup(char *__unused)
>  __setup("hlt", cpu_idle_nopoll_setup);
>  #endif
> 
> -static inline int cpu_idle_poll(void)
> -{
> - rcu_idle_enter();
> - trace_cpu_idle_rcuidle(0, smp_processor_id());
> - local_irq_enable();
> - while (!tif_need_resched())
> - cpu_relax();
> - trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, smp_processor_id());
> - rcu_idle_exit();
> - return 1;
> -}
> -
>  /* Weak implementations for optional arch specific functions */
>  void __weak arch_cpu_idle_prepare(void) { }
>  void __weak arch_cpu_idle_enter(void) { }
> @@ -65,6 +53,23 @@ void __weak arch_cpu_idle(void)
>   local_irq_enable();
>  }
> 
> +void __weak arch_cpu_idle_poll(void)
> +{
> + local_irq_enable();
> + while (!tif_need_resched())
> + cpu_relax();
> +}
> +
> +static inline int cpu_idle_poll(void)
> +{
> + rcu_idle_enter();
> + trace_cpu_idle_rcuidle(0, smp_processor_id());
> + arch_cpu_idle_poll();
> + trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, smp_processor_id());
> + rcu_idle_exit();
> + return 1;
> +}
> +
>  /**
>   * cpuidle_idle_call - the main idle function
>   *
> 

Thanks Daniel!

Reviewed-by: Preeti U. Murthy 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH V3 1/6] sched: idle: Add a weak arch_cpu_idle_poll function

2014-11-08 Thread Preeti U Murthy

On 11/07/2014 08:01 PM, Daniel Lezcano wrote:
 The poll function is called when a timer expired or if we force to poll when
 the cpu_idle_force_poll option is set.
 
 The poll function does:
 
local_irq_enable();
while (!tif_need_resched())
cpu_relax();
 
 This default poll function suits for the x86 arch because its rep; nop;
 hardware power optimization. But on other archs, this optimization does not
 exists and we are not saving power. The arch specific bits may want to
 optimize this loop by adding their own optimization.
 
 Give an opportunity to the different platform to specify their own polling
 loop by adding a weak cpu_idle_poll_loop function.
 
 Signed-off-by: Daniel Lezcano daniel.lezc...@linaro.org
 ---
  kernel/sched/idle.c | 29 +
  1 file changed, 17 insertions(+), 12 deletions(-)
 
 diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
 index c47fce7..ea65bbae 100644
 --- a/kernel/sched/idle.c
 +++ b/kernel/sched/idle.c
 @@ -42,18 +42,6 @@ static int __init cpu_idle_nopoll_setup(char *__unused)
  __setup(hlt, cpu_idle_nopoll_setup);
  #endif
 
 -static inline int cpu_idle_poll(void)
 -{
 - rcu_idle_enter();
 - trace_cpu_idle_rcuidle(0, smp_processor_id());
 - local_irq_enable();
 - while (!tif_need_resched())
 - cpu_relax();
 - trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, smp_processor_id());
 - rcu_idle_exit();
 - return 1;
 -}
 -
  /* Weak implementations for optional arch specific functions */
  void __weak arch_cpu_idle_prepare(void) { }
  void __weak arch_cpu_idle_enter(void) { }
 @@ -65,6 +53,23 @@ void __weak arch_cpu_idle(void)
   local_irq_enable();
  }
 
 +void __weak arch_cpu_idle_poll(void)
 +{
 + local_irq_enable();
 + while (!tif_need_resched())
 + cpu_relax();
 +}
 +
 +static inline int cpu_idle_poll(void)
 +{
 + rcu_idle_enter();
 + trace_cpu_idle_rcuidle(0, smp_processor_id());
 + arch_cpu_idle_poll();
 + trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, smp_processor_id());
 + rcu_idle_exit();
 + return 1;
 +}
 +
  /**
   * cpuidle_idle_call - the main idle function
   *
 

Thanks Daniel!

Reviewed-by: Preeti U. Murthy pre...@linux.vnet.ibm.com

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH V3 2/6] sched: idle: cpuidle: Check the latency req before idle

2014-11-08 Thread Preeti U Murthy

On 11/07/2014 08:01 PM, Daniel Lezcano wrote:
 When the pmqos latency requirement is set to zero that means poll in all the
 cases.
 
 That is correctly implemented on x86 but not on the other archs.
 
 As how is written the code, if the latency request is zero, the governor will
 return zero, so corresponding, for x86, to the poll function, but for the
 others arch the default idle function. For example, on ARM this is wait-for-
 interrupt with a latency of '1', so violating the constraint.
 
 In order to fix that, do the latency requirement check *before* calling the
 cpuidle framework in order to jump to the poll function without entering
 cpuidle. That has several benefits:
 
  1. It clarifies and unifies the code
  2. It fixes x86 vs other archs behavior
  3. Factors out the call to the same function
  4. Prevent to enter the cpuidle framework with its expensive cost in
 calculation
 
 As the latency_req is needed in all the cases, change the select API to take
 the latency_req as parameter in case it is not equal to zero.
 
 As a positive side effect, it introduces the latency constraint specified
 externally, so one more step to the cpuidle/scheduler integration.
 
 Signed-off-by: Daniel Lezcano daniel.lezc...@linaro.org
 Acked-by: Nicolas Pitre n...@linaro.org
 Acked-by: Peter Zijlstra (Intel) pet...@infradead.org
 Reviewed-by: Len Brown len.br...@intel.com
 ---

Reviewed-by: Preeti U Murthy pre...@linux.vnet.ibm.com

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH V3 4/6] cpuidle: idle: menu: Don't reflect when a state selection failed

2014-11-08 Thread Preeti U Murthy

On 11/07/2014 08:01 PM, Daniel Lezcano wrote:
 In the current code, the check to reflect or not the outcoming state is done
 against the idle state which has been chosen and its value.
 
 Instead of doing a check in each of the reflect functions, just don't call 
 reflect
 if something went wrong in the idle path.
 
 Signed-off-by: Daniel Lezcano daniel.lezc...@linaro.org
 Acked-by: Nicolas Pitre n...@linaro.org
 ---
  drivers/cpuidle/governors/ladder.c | 3 +--
  drivers/cpuidle/governors/menu.c   | 3 +--
  kernel/sched/idle.c| 3 ++-
  3 files changed, 4 insertions(+), 5 deletions(-)
 
 diff --git a/drivers/cpuidle/governors/ladder.c 
 b/drivers/cpuidle/governors/ladder.c
 index 51c9ccd..86e4d49 100644
 --- a/drivers/cpuidle/governors/ladder.c
 +++ b/drivers/cpuidle/governors/ladder.c
 @@ -165,8 +165,7 @@ static int ladder_enable_device(struct cpuidle_driver 
 *drv,
  static void ladder_reflect(struct cpuidle_device *dev, int index)
  {
   struct ladder_device *ldev = this_cpu_ptr(ladder_devices);
 - if (index  0)
 - ldev-last_state_idx = index;
 + ldev-last_state_idx = index;
  }
 
  static struct cpuidle_governor ladder_governor = {
 diff --git a/drivers/cpuidle/governors/menu.c 
 b/drivers/cpuidle/governors/menu.c
 index 91b3000..2e4a315 100644
 --- a/drivers/cpuidle/governors/menu.c
 +++ b/drivers/cpuidle/governors/menu.c
 @@ -366,8 +366,7 @@ static void menu_reflect(struct cpuidle_device *dev, int 
 index)
  {
   struct menu_device *data = this_cpu_ptr(menu_devices);
   data-last_state_idx = index;
 - if (index = 0)
 - data-needs_update = 1;
 + data-needs_update = 1;
  }
 
  /**
 diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
 index 0a7a1d1..0ae1cc8 100644
 --- a/kernel/sched/idle.c
 +++ b/kernel/sched/idle.c
 @@ -173,7 +173,8 @@ use_default:
   /*
* Give the governor an opportunity to reflect on the outcome
*/
 - cpuidle_reflect(dev, entered_state);
 + if (entered_state = 0)
 + cpuidle_reflect(dev, entered_state);
 
  exit_idle:
   __current_set_polling();
 
Reviewed-by: Preeti U. Murthy pre...@linux.vnet.ibm.com

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH V3 3/6] sched: idle: Get the next timer event and pass it the cpuidle framework

2014-11-08 Thread Preeti U Murthy

On 11/07/2014 08:01 PM, Daniel Lezcano wrote:
 Following the logic of the previous patch, retrieve from the idle task the
 expected timer sleep duration and pass it to the cpuidle framework.
 
 Take the opportunity to remove the unused headers in the menu.c file.
 
 This patch does not change the current behavior.
 
 Signed-off-by: Daniel Lezcano daniel.lezc...@linaro.org
 Acked-by: Nicolas Pitre n...@linaro.org
 Reviewed-by: Len Brown len.br...@intel.com
 ---

This patch looks good to me as well.

Reviewed-by: Preeti U. Murthy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH V2 1/5] sched: idle: cpuidle: Check the latency req before idle

2014-11-06 Thread Preeti U Murthy

On 11/06/2014 07:12 PM, Daniel Lezcano wrote:
> 
> Preeti,
> 
> I am wondering if we aren't going to a false debate.
> 
> If the latency_req is 0, we should just poll and not enter in any idle
> state even if one has zero exit latency. With a zero latency req, we
> want full reactivity on the system, not enter an idle state with all the
> computation in the menu governor, no ?
> 
> I agree this patch changes the behavior on PowerPC, but only if the
> latency_req is set to zero. I don't think we are worried about power
> saving when setting this value.
> 
> Couldn't the patch accepted as it is for the sake of consistency on all
> the platform and then we optimize cleanly for the special latency zero
> case ?

Alright Daniel, you can go ahead. I was thinking this patch through and
now realize that, like you point out the logic will only get complicated
with all the additional hack.

But would it be possible to add the weak arch_cpu_idle_loop() call for
the cases where latency requirement is 0 like you had suggested earlier
? This would ensure the polling logic does not break on PowerPC and we
don't bother the governor even. I will add the function in the core
PowerPC code. If arch does not define this function it will fall back to
cpu_idle_loop(). Fair enough?

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH V2 1/5] sched: idle: cpuidle: Check the latency req before idle

2014-11-06 Thread Preeti U Murthy

On 11/06/2014 05:57 PM, Daniel Lezcano wrote:
> On 11/06/2014 05:08 AM, Preeti U Murthy wrote:
>> On 11/05/2014 07:58 PM, Daniel Lezcano wrote:
>>> On 10/29/2014 03:01 AM, Preeti U Murthy wrote:
>>>> On 10/29/2014 12:29 AM, Daniel Lezcano wrote:
>>>>> On 10/28/2014 04:51 AM, Preeti Murthy wrote:
>>>>>> Hi Daniel,
>>>>>>
>>>>>> On Thu, Oct 23, 2014 at 2:31 PM, Daniel Lezcano
>>>>>>  wrote:
>>>>>>> When the pmqos latency requirement is set to zero that means
>>>>>>> "poll in
>>>>>>> all the
>>>>>>> cases".
>>>>>>>
>>>>>>> That is correctly implemented on x86 but not on the other archs.
>>>>>>>
>>>>>>> As how is written the code, if the latency request is zero, the
>>>>>>> governor will
>>>>>>> return zero, so corresponding, for x86, to the poll function, but
>>>>>>> for
>>>>>>> the
>>>>>>> others arch the default idle function. For example, on ARM this is
>>>>>>> wait-for-
>>>>>>> interrupt with a latency of '1', so violating the constraint.
>>>>>>
>>>>>> This is not true actually. On PowerPC the idle state 0 has an
>>>>>> exit_latency of 0.
>>>>>>
>>>>>>>
>>>>>>> In order to fix that, do the latency requirement check *before*
>>>>>>> calling the
>>>>>>> cpuidle framework in order to jump to the poll function without
>>>>>>> entering
>>>>>>> cpuidle. That has several benefits:
>>>>>>
>>>>>> Doing so actually hurts on PowerPC. Because the idle loop defined for
>>>>>> idle state 0 is different from what cpu_relax() does in
>>>>>> cpu_idle_loop().
>>>>>> The spinning is more power efficient in the former case. Moreover we
>>>>>> also set
>>>>>> certain register values which indicate an idle cpu. The ppc_runlatch
>>>>>> bits
>>>>>> do precisely this. These register values are being read by some user
>>>>>> space
>>>>>> tools.  So we will end up breaking them with this patch
>>>>>>
>>>>>> My suggestion is very well keep the latency requirement check in
>>>>>> kernel/sched/idle.c
>>>>>> like your doing in this patch. But before jumping to cpu_idle_loop
>>>>>> verify if the
>>>>>> idle state 0 has an exit_latency > 0 in addition to your check on the
>>>>>> latency_req == 0.
>>>>>> If not, you can fall through to the regular path of calling into the
>>>>>> cpuidle driver.
>>>>>> The scheduler can query the cpuidle_driver structure anyway.
>>>>>>
>>>>>> What do you think?
>>>>>
>>>>> Thanks for reviewing the patch and spotting this.
>>>>>
>>>>> Wouldn't make sense to create:
>>>>>
>>>>> void __weak_cpu_idle_poll(void) ?
>>>>>
>>>>> and override it with your specific poll function ?
>>>>>
>>>>
>>>> No this would become ugly as far as I can see. A weak function has
>>>> to be
>>>> defined under arch/* code. We will either need to duplicate the idle
>>>> loop that we already have in the drivers or point the weak function to
>>>> the first idle state defined by our driver. Both of which is not
>>>> desirable (calling into the driver from arch code is ugly). Another
>>>> reason why I don't like the idea of a weak function is that if you have
>>>> missed looking at a specific driver and they have an idle loop with
>>>> features similar to on powerpc, you will have to spot it yourself and
>>>> include the arch specific cpu_idle_poll() for them.
>>>
>>> Yes, I agree this is a fair point. But actually I don't see the interest
>>> of having the poll loop in the cpuidle driver. These cleanups are
>>
>> We can't do that simply because the idle poll loop has arch specific
>> bits on powerpc.
> 
> I am not sure.
> 
> Could you describe what is the difference between the arch_cpu_idle
> function in arch/arm/powerpc/kernel/idle.c and the 0th power PC idle
> state ?

a

Re: [PATCH V2 1/5] sched: idle: cpuidle: Check the latency req before idle

2014-11-06 Thread Preeti U Murthy

On 11/06/2014 05:57 PM, Daniel Lezcano wrote:
 On 11/06/2014 05:08 AM, Preeti U Murthy wrote:
 On 11/05/2014 07:58 PM, Daniel Lezcano wrote:
 On 10/29/2014 03:01 AM, Preeti U Murthy wrote:
 On 10/29/2014 12:29 AM, Daniel Lezcano wrote:
 On 10/28/2014 04:51 AM, Preeti Murthy wrote:
 Hi Daniel,

 On Thu, Oct 23, 2014 at 2:31 PM, Daniel Lezcano
 daniel.lezc...@linaro.org wrote:
 When the pmqos latency requirement is set to zero that means
 poll in
 all the
 cases.

 That is correctly implemented on x86 but not on the other archs.

 As how is written the code, if the latency request is zero, the
 governor will
 return zero, so corresponding, for x86, to the poll function, but
 for
 the
 others arch the default idle function. For example, on ARM this is
 wait-for-
 interrupt with a latency of '1', so violating the constraint.

 This is not true actually. On PowerPC the idle state 0 has an
 exit_latency of 0.


 In order to fix that, do the latency requirement check *before*
 calling the
 cpuidle framework in order to jump to the poll function without
 entering
 cpuidle. That has several benefits:

 Doing so actually hurts on PowerPC. Because the idle loop defined for
 idle state 0 is different from what cpu_relax() does in
 cpu_idle_loop().
 The spinning is more power efficient in the former case. Moreover we
 also set
 certain register values which indicate an idle cpu. The ppc_runlatch
 bits
 do precisely this. These register values are being read by some user
 space
 tools.  So we will end up breaking them with this patch

 My suggestion is very well keep the latency requirement check in
 kernel/sched/idle.c
 like your doing in this patch. But before jumping to cpu_idle_loop
 verify if the
 idle state 0 has an exit_latency  0 in addition to your check on the
 latency_req == 0.
 If not, you can fall through to the regular path of calling into the
 cpuidle driver.
 The scheduler can query the cpuidle_driver structure anyway.

 What do you think?

 Thanks for reviewing the patch and spotting this.

 Wouldn't make sense to create:

 void __weak_cpu_idle_poll(void) ?

 and override it with your specific poll function ?


 No this would become ugly as far as I can see. A weak function has
 to be
 defined under arch/* code. We will either need to duplicate the idle
 loop that we already have in the drivers or point the weak function to
 the first idle state defined by our driver. Both of which is not
 desirable (calling into the driver from arch code is ugly). Another
 reason why I don't like the idea of a weak function is that if you have
 missed looking at a specific driver and they have an idle loop with
 features similar to on powerpc, you will have to spot it yourself and
 include the arch specific cpu_idle_poll() for them.

 Yes, I agree this is a fair point. But actually I don't see the interest
 of having the poll loop in the cpuidle driver. These cleanups are

 We can't do that simply because the idle poll loop has arch specific
 bits on powerpc.
 
 I am not sure.
 
 Could you describe what is the difference between the arch_cpu_idle
 function in arch/arm/powerpc/kernel/idle.c and the 0th power PC idle
 state ?

arch_cpu_idle() is the arch specific idle routine. It goes into deeper
idle state. I am guessing you meant to ask the difference between
power pc 0th idle state and the polling logic in cpu_idle_poll().

The 0th idle state is also a polling loop. Additionally it sets a couple
of registers to indicate idleness.

 
 Is it kind of duplicate ?
 
 And for polling, do you really want to use while (...); cpu_relax(); as
 it is x86 specific ? instead of the powerpc's arch_idle ?
 
 Today, if latency_req == 0, it returns the 0th idle state, so polling.
 
 If we jump to the arch_cpu_idle_poll, the result will be the same for
 all architecture.


So you propose creating a weak arch_cpu_idle_poll()? Ok if it is going
to make the cleanup easier, go ahead. I can add arch_cpu_idle_poll() in
the core code on powerpc.

 
 preparing the removal of the CPUIDLE_DRIVER_STATE_START macro which
 leads to a lot of mess in the cpuidle code.

 How is the suggestion to check the exit_latency of idle state 0 when
 latency_req  == 0 going to hinder this removal?
 
 It sounds a bit hackish. I prefer to sort out the current situation.
 
 And by the way, what is the reasoning behind having a target_residency /
 exit_latency equal to zero for an idle state ?

Its a polling idle state, hence the exit_latency is 0.

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH V2 1/5] sched: idle: cpuidle: Check the latency req before idle

2014-11-06 Thread Preeti U Murthy

On 11/06/2014 07:12 PM, Daniel Lezcano wrote:
 
 Preeti,
 
 I am wondering if we aren't going to a false debate.
 
 If the latency_req is 0, we should just poll and not enter in any idle
 state even if one has zero exit latency. With a zero latency req, we
 want full reactivity on the system, not enter an idle state with all the
 computation in the menu governor, no ?
 
 I agree this patch changes the behavior on PowerPC, but only if the
 latency_req is set to zero. I don't think we are worried about power
 saving when setting this value.
 
 Couldn't the patch accepted as it is for the sake of consistency on all
 the platform and then we optimize cleanly for the special latency zero
 case ?

Alright Daniel, you can go ahead. I was thinking this patch through and
now realize that, like you point out the logic will only get complicated
with all the additional hack.

But would it be possible to add the weak arch_cpu_idle_loop() call for
the cases where latency requirement is 0 like you had suggested earlier
? This would ensure the polling logic does not break on PowerPC and we
don't bother the governor even. I will add the function in the core
PowerPC code. If arch does not define this function it will fall back to
cpu_idle_loop(). Fair enough?

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH V2 1/5] sched: idle: cpuidle: Check the latency req before idle

2014-11-05 Thread Preeti U Murthy

On 11/05/2014 07:58 PM, Daniel Lezcano wrote:
> On 10/29/2014 03:01 AM, Preeti U Murthy wrote:
>> On 10/29/2014 12:29 AM, Daniel Lezcano wrote:
>>> On 10/28/2014 04:51 AM, Preeti Murthy wrote:
>>>> Hi Daniel,
>>>>
>>>> On Thu, Oct 23, 2014 at 2:31 PM, Daniel Lezcano
>>>>  wrote:
>>>>> When the pmqos latency requirement is set to zero that means "poll in
>>>>> all the
>>>>> cases".
>>>>>
>>>>> That is correctly implemented on x86 but not on the other archs.
>>>>>
>>>>> As how is written the code, if the latency request is zero, the
>>>>> governor will
>>>>> return zero, so corresponding, for x86, to the poll function, but for
>>>>> the
>>>>> others arch the default idle function. For example, on ARM this is
>>>>> wait-for-
>>>>> interrupt with a latency of '1', so violating the constraint.
>>>>
>>>> This is not true actually. On PowerPC the idle state 0 has an
>>>> exit_latency of 0.
>>>>
>>>>>
>>>>> In order to fix that, do the latency requirement check *before*
>>>>> calling the
>>>>> cpuidle framework in order to jump to the poll function without
>>>>> entering
>>>>> cpuidle. That has several benefits:
>>>>
>>>> Doing so actually hurts on PowerPC. Because the idle loop defined for
>>>> idle state 0 is different from what cpu_relax() does in
>>>> cpu_idle_loop().
>>>> The spinning is more power efficient in the former case. Moreover we
>>>> also set
>>>> certain register values which indicate an idle cpu. The ppc_runlatch
>>>> bits
>>>> do precisely this. These register values are being read by some user
>>>> space
>>>> tools.  So we will end up breaking them with this patch
>>>>
>>>> My suggestion is very well keep the latency requirement check in
>>>> kernel/sched/idle.c
>>>> like your doing in this patch. But before jumping to cpu_idle_loop
>>>> verify if the
>>>> idle state 0 has an exit_latency > 0 in addition to your check on the
>>>> latency_req == 0.
>>>> If not, you can fall through to the regular path of calling into the
>>>> cpuidle driver.
>>>> The scheduler can query the cpuidle_driver structure anyway.
>>>>
>>>> What do you think?
>>>
>>> Thanks for reviewing the patch and spotting this.
>>>
>>> Wouldn't make sense to create:
>>>
>>> void __weak_cpu_idle_poll(void) ?
>>>
>>> and override it with your specific poll function ?
>>>
>>
>> No this would become ugly as far as I can see. A weak function has to be
>> defined under arch/* code. We will either need to duplicate the idle
>> loop that we already have in the drivers or point the weak function to
>> the first idle state defined by our driver. Both of which is not
>> desirable (calling into the driver from arch code is ugly). Another
>> reason why I don't like the idea of a weak function is that if you have
>> missed looking at a specific driver and they have an idle loop with
>> features similar to on powerpc, you will have to spot it yourself and
>> include the arch specific cpu_idle_poll() for them.
> 
> Yes, I agree this is a fair point. But actually I don't see the interest
> of having the poll loop in the cpuidle driver. These cleanups are

We can't do that simply because the idle poll loop has arch specific
bits on powerpc.

> preparing the removal of the CPUIDLE_DRIVER_STATE_START macro which
> leads to a lot of mess in the cpuidle code.

How is the suggestion to check the exit_latency of idle state 0 when
latency_req  == 0 going to hinder this removal?

> 
> With the removal of this macro, we should be able to move the select
> loop from the menu governor and use it everywhere else. Furthermore,
> this state which is flagged with TIME_VALID, isn't because the local
> interrupt are enabled so we are measuring the interrupt time processing.
> Beside that the idle loop for x86 is mostly not used.
> 
> So the idea would be to extract those idle loop from the drivers and use
> them directly when:
>  1. the idle selection fails (use the poll loop under certain
> circumstances we have to redefine)

This behavior will not change as per my suggestion.

>  2. when the latency req is zero

Its only here that I suggested you also verify state 0's exit_latency.
For the reason that the arch may have a more optimized idle poll loop,
which we cannot override with the generic cpuidle poll loop.

Regards
Preeti U Murthy
> 
> That will result in a cleaner code in cpuidle and in the governor.
> 
> Do you agree with that ?
> 
>> But by having a check on the exit_latency, you are claiming that since
>> the driver's 0th idle state is no better than the generic idle loop in
>> cases of 0 latency req, we are better off calling the latter, which
>> looks reasonable. That way you don't have to bother about worsening the
>> idle loop behavior on any other driver.
> 
> 
> 
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH V2 1/5] sched: idle: cpuidle: Check the latency req before idle

2014-11-05 Thread Preeti U Murthy

On 11/05/2014 07:58 PM, Daniel Lezcano wrote:
 On 10/29/2014 03:01 AM, Preeti U Murthy wrote:
 On 10/29/2014 12:29 AM, Daniel Lezcano wrote:
 On 10/28/2014 04:51 AM, Preeti Murthy wrote:
 Hi Daniel,

 On Thu, Oct 23, 2014 at 2:31 PM, Daniel Lezcano
 daniel.lezc...@linaro.org wrote:
 When the pmqos latency requirement is set to zero that means poll in
 all the
 cases.

 That is correctly implemented on x86 but not on the other archs.

 As how is written the code, if the latency request is zero, the
 governor will
 return zero, so corresponding, for x86, to the poll function, but for
 the
 others arch the default idle function. For example, on ARM this is
 wait-for-
 interrupt with a latency of '1', so violating the constraint.

 This is not true actually. On PowerPC the idle state 0 has an
 exit_latency of 0.


 In order to fix that, do the latency requirement check *before*
 calling the
 cpuidle framework in order to jump to the poll function without
 entering
 cpuidle. That has several benefits:

 Doing so actually hurts on PowerPC. Because the idle loop defined for
 idle state 0 is different from what cpu_relax() does in
 cpu_idle_loop().
 The spinning is more power efficient in the former case. Moreover we
 also set
 certain register values which indicate an idle cpu. The ppc_runlatch
 bits
 do precisely this. These register values are being read by some user
 space
 tools.  So we will end up breaking them with this patch

 My suggestion is very well keep the latency requirement check in
 kernel/sched/idle.c
 like your doing in this patch. But before jumping to cpu_idle_loop
 verify if the
 idle state 0 has an exit_latency  0 in addition to your check on the
 latency_req == 0.
 If not, you can fall through to the regular path of calling into the
 cpuidle driver.
 The scheduler can query the cpuidle_driver structure anyway.

 What do you think?

 Thanks for reviewing the patch and spotting this.

 Wouldn't make sense to create:

 void __weak_cpu_idle_poll(void) ?

 and override it with your specific poll function ?


 No this would become ugly as far as I can see. A weak function has to be
 defined under arch/* code. We will either need to duplicate the idle
 loop that we already have in the drivers or point the weak function to
 the first idle state defined by our driver. Both of which is not
 desirable (calling into the driver from arch code is ugly). Another
 reason why I don't like the idea of a weak function is that if you have
 missed looking at a specific driver and they have an idle loop with
 features similar to on powerpc, you will have to spot it yourself and
 include the arch specific cpu_idle_poll() for them.
 
 Yes, I agree this is a fair point. But actually I don't see the interest
 of having the poll loop in the cpuidle driver. These cleanups are

We can't do that simply because the idle poll loop has arch specific
bits on powerpc.

 preparing the removal of the CPUIDLE_DRIVER_STATE_START macro which
 leads to a lot of mess in the cpuidle code.

How is the suggestion to check the exit_latency of idle state 0 when
latency_req  == 0 going to hinder this removal?

 
 With the removal of this macro, we should be able to move the select
 loop from the menu governor and use it everywhere else. Furthermore,
 this state which is flagged with TIME_VALID, isn't because the local
 interrupt are enabled so we are measuring the interrupt time processing.
 Beside that the idle loop for x86 is mostly not used.
 
 So the idea would be to extract those idle loop from the drivers and use
 them directly when:
  1. the idle selection fails (use the poll loop under certain
 circumstances we have to redefine)

This behavior will not change as per my suggestion.

  2. when the latency req is zero

Its only here that I suggested you also verify state 0's exit_latency.
For the reason that the arch may have a more optimized idle poll loop,
which we cannot override with the generic cpuidle poll loop.

Regards
Preeti U Murthy
 
 That will result in a cleaner code in cpuidle and in the governor.
 
 Do you agree with that ?
 
 But by having a check on the exit_latency, you are claiming that since
 the driver's 0th idle state is no better than the generic idle loop in
 cases of 0 latency req, we are better off calling the latter, which
 looks reasonable. That way you don't have to bother about worsening the
 idle loop behavior on any other driver.
 
 
 
 
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH V2 1/5] sched: idle: cpuidle: Check the latency req before idle

2014-10-28 Thread Preeti U Murthy

On 10/29/2014 12:29 AM, Daniel Lezcano wrote:
> On 10/28/2014 04:51 AM, Preeti Murthy wrote:
>> Hi Daniel,
>>
>> On Thu, Oct 23, 2014 at 2:31 PM, Daniel Lezcano
>>  wrote:
>>> When the pmqos latency requirement is set to zero that means "poll in
>>> all the
>>> cases".
>>>
>>> That is correctly implemented on x86 but not on the other archs.
>>>
>>> As how is written the code, if the latency request is zero, the
>>> governor will
>>> return zero, so corresponding, for x86, to the poll function, but for
>>> the
>>> others arch the default idle function. For example, on ARM this is
>>> wait-for-
>>> interrupt with a latency of '1', so violating the constraint.
>>
>> This is not true actually. On PowerPC the idle state 0 has an
>> exit_latency of 0.
>>
>>>
>>> In order to fix that, do the latency requirement check *before*
>>> calling the
>>> cpuidle framework in order to jump to the poll function without entering
>>> cpuidle. That has several benefits:
>>
>> Doing so actually hurts on PowerPC. Because the idle loop defined for
>> idle state 0 is different from what cpu_relax() does in cpu_idle_loop().
>> The spinning is more power efficient in the former case. Moreover we
>> also set
>> certain register values which indicate an idle cpu. The ppc_runlatch bits
>> do precisely this. These register values are being read by some user
>> space
>> tools.  So we will end up breaking them with this patch
>>
>> My suggestion is very well keep the latency requirement check in
>> kernel/sched/idle.c
>> like your doing in this patch. But before jumping to cpu_idle_loop
>> verify if the
>> idle state 0 has an exit_latency > 0 in addition to your check on the
>> latency_req == 0.
>> If not, you can fall through to the regular path of calling into the
>> cpuidle driver.
>> The scheduler can query the cpuidle_driver structure anyway.
>>
>> What do you think?
> 
> Thanks for reviewing the patch and spotting this.
> 
> Wouldn't make sense to create:
> 
> void __weak_cpu_idle_poll(void) ?
> 
> and override it with your specific poll function ?
> 

No this would become ugly as far as I can see. A weak function has to be
defined under arch/* code. We will either need to duplicate the idle
loop that we already have in the drivers or point the weak function to
the first idle state defined by our driver. Both of which is not
desirable (calling into the driver from arch code is ugly). Another
reason why I don't like the idea of a weak function is that if you have
missed looking at a specific driver and they have an idle loop with
features similar to on powerpc, you will have to spot it yourself and
include the arch specific cpu_idle_poll() for them.

But by having a check on the exit_latency, you are claiming that since
the driver's 0th idle state is no better than the generic idle loop in
cases of 0 latency req, we are better off calling the latter, which
looks reasonable. That way you don't have to bother about worsening the
idle loop behavior on any other driver.

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH V2 3/5] cpuidle: idle: menu: Don't reflect when a state selection failed

2014-10-28 Thread Preeti U Murthy

On 10/28/2014 11:58 PM, Daniel Lezcano wrote:
> On 10/28/2014 08:01 AM, Preeti Murthy wrote:
>> On Thu, Oct 23, 2014 at 2:31 PM, Daniel Lezcano
>>  wrote:
>>> In the current code, the check to reflect or not the outcoming state
>>> is done
>>> against the idle state which has been chosen and its value.
>>>
>>> Instead of doing a check in each of the reflect functions, just don't
>>> call reflect
>>> if something went wrong in the idle path.
>>>
>>> Signed-off-by: Daniel Lezcano 
>>> Acked-by: Nicolas Pitre 
>>> ---
>>>   drivers/cpuidle/governors/ladder.c | 3 +--
>>>   drivers/cpuidle/governors/menu.c   | 4 +---
>>>   kernel/sched/idle.c| 3 ++-
>>>   3 files changed, 4 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/drivers/cpuidle/governors/ladder.c
>>> b/drivers/cpuidle/governors/ladder.c
>>> index fb396d6..c0b36a8 100644
>>> --- a/drivers/cpuidle/governors/ladder.c
>>> +++ b/drivers/cpuidle/governors/ladder.c
>>> @@ -165,8 +165,7 @@ static int ladder_enable_device(struct
>>> cpuidle_driver *drv,
>>>   static void ladder_reflect(struct cpuidle_device *dev, int index)
>>>   {
>>>  struct ladder_device *ldev = &__get_cpu_var(ladder_devices);
>>> -   if (index > 0)
>>> -   ldev->last_state_idx = index;
>>> +   ldev->last_state_idx = index;
>>>   }
>>>
>>>   static struct cpuidle_governor ladder_governor = {
>>> diff --git a/drivers/cpuidle/governors/menu.c
>>> b/drivers/cpuidle/governors/menu.c
>>> index a17515f..3907301 100644
>>> --- a/drivers/cpuidle/governors/menu.c
>>> +++ b/drivers/cpuidle/governors/menu.c
>>> @@ -365,9 +365,7 @@ static int menu_select(struct cpuidle_driver
>>> *drv, struct cpuidle_device *dev,
>>>   static void menu_reflect(struct cpuidle_device *dev, int index)
>>>   {
>>>  struct menu_device *data = &__get_cpu_var(menu_devices);
>>> -   data->last_state_idx = index;
>>> -   if (index >= 0)
>>> -   data->needs_update = 1;
>>> +   data->needs_update = 1;
>>
>> Why is the last_state_idx not getting updated ?
> 
> Oups, right. This is missing.
> 
> Thanks for pointing this out.
> 
> By the way, I don't think a back end driver is changing the selected
> state currently and I am not sure this is desirable since we want to
> trust the state we are going (as a best effort). So if the 'enter'
> function does not change the index, that means the last_state_idx has
> not to be changed since it has been assigned in the 'select' function.

Hmm Right. So you might want to remove the last_state_idx update in
ladder_reflect() also?

Regards
Preeti U Murthy
> 
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH V2 3/5] cpuidle: idle: menu: Don't reflect when a state selection failed

2014-10-28 Thread Preeti U Murthy

On 10/28/2014 11:58 PM, Daniel Lezcano wrote:
 On 10/28/2014 08:01 AM, Preeti Murthy wrote:
 On Thu, Oct 23, 2014 at 2:31 PM, Daniel Lezcano
 daniel.lezc...@linaro.org wrote:
 In the current code, the check to reflect or not the outcoming state
 is done
 against the idle state which has been chosen and its value.

 Instead of doing a check in each of the reflect functions, just don't
 call reflect
 if something went wrong in the idle path.

 Signed-off-by: Daniel Lezcano daniel.lezc...@linaro.org
 Acked-by: Nicolas Pitre n...@linaro.org
 ---
   drivers/cpuidle/governors/ladder.c | 3 +--
   drivers/cpuidle/governors/menu.c   | 4 +---
   kernel/sched/idle.c| 3 ++-
   3 files changed, 4 insertions(+), 6 deletions(-)

 diff --git a/drivers/cpuidle/governors/ladder.c
 b/drivers/cpuidle/governors/ladder.c
 index fb396d6..c0b36a8 100644
 --- a/drivers/cpuidle/governors/ladder.c
 +++ b/drivers/cpuidle/governors/ladder.c
 @@ -165,8 +165,7 @@ static int ladder_enable_device(struct
 cpuidle_driver *drv,
   static void ladder_reflect(struct cpuidle_device *dev, int index)
   {
  struct ladder_device *ldev = __get_cpu_var(ladder_devices);
 -   if (index  0)
 -   ldev-last_state_idx = index;
 +   ldev-last_state_idx = index;
   }

   static struct cpuidle_governor ladder_governor = {
 diff --git a/drivers/cpuidle/governors/menu.c
 b/drivers/cpuidle/governors/menu.c
 index a17515f..3907301 100644
 --- a/drivers/cpuidle/governors/menu.c
 +++ b/drivers/cpuidle/governors/menu.c
 @@ -365,9 +365,7 @@ static int menu_select(struct cpuidle_driver
 *drv, struct cpuidle_device *dev,
   static void menu_reflect(struct cpuidle_device *dev, int index)
   {
  struct menu_device *data = __get_cpu_var(menu_devices);
 -   data-last_state_idx = index;
 -   if (index = 0)
 -   data-needs_update = 1;
 +   data-needs_update = 1;

 Why is the last_state_idx not getting updated ?
 
 Oups, right. This is missing.
 
 Thanks for pointing this out.
 
 By the way, I don't think a back end driver is changing the selected
 state currently and I am not sure this is desirable since we want to
 trust the state we are going (as a best effort). So if the 'enter'
 function does not change the index, that means the last_state_idx has
 not to be changed since it has been assigned in the 'select' function.

Hmm Right. So you might want to remove the last_state_idx update in
ladder_reflect() also?

Regards
Preeti U Murthy
 
 
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH V2 1/5] sched: idle: cpuidle: Check the latency req before idle

2014-10-28 Thread Preeti U Murthy

On 10/29/2014 12:29 AM, Daniel Lezcano wrote:
 On 10/28/2014 04:51 AM, Preeti Murthy wrote:
 Hi Daniel,

 On Thu, Oct 23, 2014 at 2:31 PM, Daniel Lezcano
 daniel.lezc...@linaro.org wrote:
 When the pmqos latency requirement is set to zero that means poll in
 all the
 cases.

 That is correctly implemented on x86 but not on the other archs.

 As how is written the code, if the latency request is zero, the
 governor will
 return zero, so corresponding, for x86, to the poll function, but for
 the
 others arch the default idle function. For example, on ARM this is
 wait-for-
 interrupt with a latency of '1', so violating the constraint.

 This is not true actually. On PowerPC the idle state 0 has an
 exit_latency of 0.


 In order to fix that, do the latency requirement check *before*
 calling the
 cpuidle framework in order to jump to the poll function without entering
 cpuidle. That has several benefits:

 Doing so actually hurts on PowerPC. Because the idle loop defined for
 idle state 0 is different from what cpu_relax() does in cpu_idle_loop().
 The spinning is more power efficient in the former case. Moreover we
 also set
 certain register values which indicate an idle cpu. The ppc_runlatch bits
 do precisely this. These register values are being read by some user
 space
 tools.  So we will end up breaking them with this patch

 My suggestion is very well keep the latency requirement check in
 kernel/sched/idle.c
 like your doing in this patch. But before jumping to cpu_idle_loop
 verify if the
 idle state 0 has an exit_latency  0 in addition to your check on the
 latency_req == 0.
 If not, you can fall through to the regular path of calling into the
 cpuidle driver.
 The scheduler can query the cpuidle_driver structure anyway.

 What do you think?
 
 Thanks for reviewing the patch and spotting this.
 
 Wouldn't make sense to create:
 
 void __weak_cpu_idle_poll(void) ?
 
 and override it with your specific poll function ?
 

No this would become ugly as far as I can see. A weak function has to be
defined under arch/* code. We will either need to duplicate the idle
loop that we already have in the drivers or point the weak function to
the first idle state defined by our driver. Both of which is not
desirable (calling into the driver from arch code is ugly). Another
reason why I don't like the idea of a weak function is that if you have
missed looking at a specific driver and they have an idle loop with
features similar to on powerpc, you will have to spot it yourself and
include the arch specific cpu_idle_poll() for them.

But by having a check on the exit_latency, you are claiming that since
the driver's 0th idle state is no better than the generic idle loop in
cases of 0 latency req, we are better off calling the latter, which
looks reasonable. That way you don't have to bother about worsening the
idle loop behavior on any other driver.

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] cpuidle/powernv: Fix return value of idle index in fastsleep

2014-10-27 Thread Preeti U Murthy

Commit dcb18694 : Fix ipi on Palmeto added a workaround to disable
going into fastsleep on Palmeto boards which was reported to fail
to boot when fastsleep was enabled. However it missed returning
an idle index. Fix this.

There is probably no harm in returning the index of fastsleep
although the cpu did not enter this state. The governor will notice
the difference in the residency time in the idle state and the
target residency of the idle state and prevent cpus from entering
fastsleep from that point on. Hence the usage and time statistics
logged for fastsleep will not be out of sync with reality except
for the first entry.

Signed-off-by: Preeti U Murthy 
---
Do we still need this workaround? Or can we get rid of the check
on powersave_nap altogether?
---

 drivers/cpuidle/cpuidle-powernv.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/cpuidle/cpuidle-powernv.c 
b/drivers/cpuidle/cpuidle-powernv.c
index fa79392..c18da24 100644
--- a/drivers/cpuidle/cpuidle-powernv.c
+++ b/drivers/cpuidle/cpuidle-powernv.c
@@ -70,7 +70,7 @@ static int fastsleep_loop(struct cpuidle_device *dev,
unsigned long new_lpcr;
 
if (powersave_nap < 2)
-   return;
+   return index;
if (unlikely(system_state < SYSTEM_RUNNING))
return index;
 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] cpuidle/powernv: Fix return value of idle index in fastsleep

2014-10-27 Thread Preeti U Murthy

Commit dcb18694 : Fix ipi on Palmeto added a workaround to disable
going into fastsleep on Palmeto boards which was reported to fail
to boot when fastsleep was enabled. However it missed returning
an idle index. Fix this.

There is probably no harm in returning the index of fastsleep
although the cpu did not enter this state. The governor will notice
the difference in the residency time in the idle state and the
target residency of the idle state and prevent cpus from entering
fastsleep from that point on. Hence the usage and time statistics
logged for fastsleep will not be out of sync with reality except
for the first entry.

Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com
---
Do we still need this workaround? Or can we get rid of the check
on powersave_nap altogether?
---

 drivers/cpuidle/cpuidle-powernv.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/cpuidle/cpuidle-powernv.c 
b/drivers/cpuidle/cpuidle-powernv.c
index fa79392..c18da24 100644
--- a/drivers/cpuidle/cpuidle-powernv.c
+++ b/drivers/cpuidle/cpuidle-powernv.c
@@ -70,7 +70,7 @@ static int fastsleep_loop(struct cpuidle_device *dev,
unsigned long new_lpcr;
 
if (powersave_nap  2)
-   return;
+   return index;
if (unlikely(system_state  SYSTEM_RUNNING))
return index;
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC 6/7] sched: cfs: cpu frequency scaling based on task placement

2014-10-22 Thread Preeti U Murthy

Hi Mike,

On 10/22/2014 11:37 AM, Mike Turquette wrote:
> {en,de}queue_task_fair are updated to track which cpus will have changed
> utilization values as function of task queueing. The affected cpus are
> passed on to arch_eval_cpu_freq for further machine-specific processing
> based on a selectable policy.
> 
> arch_scale_cpu_freq is called from run_rebalance_domains as a way to
> kick off the scaling process (via wake_up_process), so as to prevent
> re-entering the {en,de}queue code.
> 
> All of the call sites in this patch are up for discussion. Does it make
> sense to track which cpus have updated statistics in enqueue_fair_task?
> I chose this because I wanted to gather statistics for all cpus affected
> in the event CONFIG_FAIR_GROUP_SCHED is enabled. As agreed at LPC14 the

Can you explain how pstate selection can get affected by the presence of
task groups? We are after all concerned with the cpu load. So when we
enqueue/dequeue a task, we update the cpu load and pass it on for cpu
pstate scaling. How does this change if we have task groups?
I know that this issue was brought up during LPC, but I have yet not
managed to gain clarity here.

> next version of this patch will focus on the simpler case of not using
> scheduler cgroups, which should remove a good chunk of this code,
> including the cpumask stuff.
> 
> Also discussed at LPC14 is that fact that load_balance is a very
> interesting place to do this as frequency can be considered in concert
> with task placement. Please put forth any ideas on a sensible way to do
> this.
> 
> Is run_rebalance_domains a logical place to change cpu frequency? What
> other call sites make sense?
> 
> Even for platforms that can target a cpu frequency without sleeping
> (x86, some ARM platforms with PM microcontrollers) it is currently
> necessary to always kick the frequency target work out into a kthread.
> This is because of the rw_sem usage in the cpufreq core which might
> sleep. Replacing that lock type is probably a good idea.
> 
> Not-signed-off-by: Mike Turquette 
> ---
>  kernel/sched/fair.c | 39 +++
>  1 file changed, 39 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1af6f6d..3619f63 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3999,6 +3999,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, 
> int flags)
>  {
>   struct cfs_rq *cfs_rq;
>   struct sched_entity *se = >se;
> + struct cpumask update_cpus;
> +
> + cpumask_clear(_cpus);
> 
>   for_each_sched_entity(se) {
>   if (se->on_rq)
> @@ -4028,12 +4031,27 @@ enqueue_task_fair(struct rq *rq, struct task_struct 
> *p, int flags)
> 
>   update_cfs_shares(cfs_rq);
>   update_entity_load_avg(se, 1);
> + /* track cpus that need to be re-evaluated */
> + cpumask_set_cpu(cpu_of(rq_of(cfs_rq)), _cpus);

All the cfs_rqs that you iterate through here will belong to the same
rq/cpu right?

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC 6/7] sched: cfs: cpu frequency scaling based on task placement

2014-10-22 Thread Preeti U Murthy

Hi Mike,

On 10/22/2014 11:37 AM, Mike Turquette wrote:
 {en,de}queue_task_fair are updated to track which cpus will have changed
 utilization values as function of task queueing. The affected cpus are
 passed on to arch_eval_cpu_freq for further machine-specific processing
 based on a selectable policy.
 
 arch_scale_cpu_freq is called from run_rebalance_domains as a way to
 kick off the scaling process (via wake_up_process), so as to prevent
 re-entering the {en,de}queue code.
 
 All of the call sites in this patch are up for discussion. Does it make
 sense to track which cpus have updated statistics in enqueue_fair_task?
 I chose this because I wanted to gather statistics for all cpus affected
 in the event CONFIG_FAIR_GROUP_SCHED is enabled. As agreed at LPC14 the

Can you explain how pstate selection can get affected by the presence of
task groups? We are after all concerned with the cpu load. So when we
enqueue/dequeue a task, we update the cpu load and pass it on for cpu
pstate scaling. How does this change if we have task groups?
I know that this issue was brought up during LPC, but I have yet not
managed to gain clarity here.

 next version of this patch will focus on the simpler case of not using
 scheduler cgroups, which should remove a good chunk of this code,
 including the cpumask stuff.
 
 Also discussed at LPC14 is that fact that load_balance is a very
 interesting place to do this as frequency can be considered in concert
 with task placement. Please put forth any ideas on a sensible way to do
 this.
 
 Is run_rebalance_domains a logical place to change cpu frequency? What
 other call sites make sense?
 
 Even for platforms that can target a cpu frequency without sleeping
 (x86, some ARM platforms with PM microcontrollers) it is currently
 necessary to always kick the frequency target work out into a kthread.
 This is because of the rw_sem usage in the cpufreq core which might
 sleep. Replacing that lock type is probably a good idea.
 
 Not-signed-off-by: Mike Turquette mturque...@linaro.org
 ---
  kernel/sched/fair.c | 39 +++
  1 file changed, 39 insertions(+)
 
 diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
 index 1af6f6d..3619f63 100644
 --- a/kernel/sched/fair.c
 +++ b/kernel/sched/fair.c
 @@ -3999,6 +3999,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, 
 int flags)
  {
   struct cfs_rq *cfs_rq;
   struct sched_entity *se = p-se;
 + struct cpumask update_cpus;
 +
 + cpumask_clear(update_cpus);
 
   for_each_sched_entity(se) {
   if (se-on_rq)
 @@ -4028,12 +4031,27 @@ enqueue_task_fair(struct rq *rq, struct task_struct 
 *p, int flags)
 
   update_cfs_shares(cfs_rq);
   update_entity_load_avg(se, 1);
 + /* track cpus that need to be re-evaluated */
 + cpumask_set_cpu(cpu_of(rq_of(cfs_rq)), update_cpus);

All the cfs_rqs that you iterate through here will belong to the same
rq/cpu right?

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 2/3] powerpc/kvm/book3s_hv: Enable CPUs to run guest after waking up from fast-sleep

2014-10-09 Thread Preeti U Murthy

On 10/07/2014 10:41 AM, Benjamin Herrenschmidt wrote:
> On Wed, 2014-10-01 at 13:15 +0530, Shreyas B. Prabhu wrote:
>>
>> diff --git a/arch/powerpc/kernel/exceptions-64s.S 
>> b/arch/powerpc/kernel/exceptions-64s.S
>> index 050f79a..c64f3cc0 100644
>> --- a/arch/powerpc/kernel/exceptions-64s.S
>> +++ b/arch/powerpc/kernel/exceptions-64s.S
>> @@ -100,25 +100,8 @@ system_reset_pSeries:
>>  SET_SCRATCH0(r13)
>>  #ifdef CONFIG_PPC_P7_NAP
>>  BEGIN_FTR_SECTION
>> -/* Running native on arch 2.06 or later, check if we are
>> - * waking up from nap. We only handle no state loss and
>> - * supervisor state loss. We do -not- handle hypervisor
>> - * state loss at this time.
>> - */
>> -mfspr   r13,SPRN_SRR1
>> -rlwinm. r13,r13,47-31,30,31
>> -beq 9f
>>  
>> -/* waking up from powersave (nap) state */
>> -cmpwi   cr1,r13,2
>> -/* Total loss of HV state is fatal, we could try to use the
>> - * PIR to locate a PACA, then use an emergency stack etc...
>> - * OPAL v3 based powernv platforms have new idle states
>> - * which fall in this catagory.
>> - */
>> -bgt cr1,8f
>>  GET_PACA(r13)
>> -
>>  #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
>>  li  r0,KVM_HWTHREAD_IN_KERNEL
>>  stb r0,HSTATE_HWTHREAD_STATE(r13)
>> @@ -131,13 +114,27 @@ BEGIN_FTR_SECTION
>>  1:
>>  #endif
> 
> So you moved the state loss check to after the KVM check ? Was this
> reviewed by Paul ? Is that ok ? (Does this match what we have in
> PowerKVM ?). Is it possible that we end up calling kvm_start_guest
> after a HV state loss or do we know for sure that this won't happen
> for a reason or another ? If that's the case, then that reason needs
> to be clearly documented here in a comment.

This wont happen because the first thread in the core which comes out of
an idle state which has a state loss will not enter into KVM since the
HSTATE_HWTHREAD_STATE is not yet set. It continues on to restore the
lost state.

This thread sets the HSTATE_HWTHREAD_STATE and wakes up the remaining
threads in the core. These sibling threads enter kvm directly not
requiring to restore lost state since the first thread has restored it
anyway. So we are safe. We will certainly add a comment there.

Thanks

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] cpusets: Make cpus_allowed and mems_allowed masks hotplug invariant

2014-10-09 Thread Preeti U Murthy

On 10/08/2014 03:48 PM, Peter Zijlstra wrote:
> On Wed, Oct 08, 2014 at 03:07:51PM +0530, Preeti U Murthy wrote:
> 
>>> I still completely hate all that.. It basically makes cpusets useless,
>>> they no longer guarantee anything, it makes then an optional placement
>>> hint instead.
>>
>> Why do you say they don't guarantee anything? We ensure that we always
>> run on the cpus in our cpuset which are online. We do not run in any
>> arbitrary cpuset. We also do not wait unreasonably on an offline cpu to
>> come back. What we are doing is ensuring that if the resources that we
>> began with are available we use them. Why is this not a logical thing to
>> expect?
> 
> Because if you randomly hotplug cpus your tasks can randomly flip
> between sets.
> 
> Therefore there is no strict containment and no guarantees.
> 
>>> You also break long standing behaviour.
>>
>> Which is? As I understand cpusets are meant to ensure a dedicated set of
>> resources to some tasks. We cannot scheduler the tasks anywhere outside
>> this set as long as they are available. And when they are not, currently
>> we move them to their parents,
> 
> No currently we hard break affinity and never look back. That move to
> parent and back crap is all new fangled stuff, and broken because of the
> above argument.
> 
>> but you had also suggested killing the
>> task. Maybe this can be debated. But what behavior are we changing by
>> ensuring that we retain our original configuration at all times?
> 
> See above, by pretending hotplug is a sane operation you loose all
> guarantees.

Ok I see the point. The kernel must not be bothered about keeping
cpusets and hotplug operations consistent when both of these are user
initiated actions specifying affinity with the former and breaking the
same with the latter.

> 
>>> Also, power is insane if it needs/uses hotplug for operational crap
>>> like that.
>>
>> SMT 8 on Power8 can help/hinder workloads. Hence we dynamically switch
>> the modes at runtime.
> 
> That's just a horrible piece of crap hack and you deserve any and all
> pain you get from doing it.
> 
> Randomly removing/adding cpus like that is horrible and makes a mockery
> of all the affinity interfaces we have.

We observed this on ubuntu kernel, in which systemd explicitly mounts
cgroup controllers under a child cgroup identified by the user pid.
Since we had not observed this additional cgroup being added under the
hood, it came as a surprise to us that cgroup/cpuset handling in the
kernel should indeed kick in.

At best we expect hotplug to be handled well if the users have not
explicitly configured cpusets, hence implicitly specifying that task
affinity is for all online cpus. This is indeed the case today, so that
is good.

However what remains to be answered is that the V2 of cgroup design -
the default hierarchy, tracks hotplug operations for children cgroups as
well. Tejun, Li, will not the concerns that Peter raised above hold for
the default hierarchy as well?

Regards
Preeti U Murthy
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] cpusets: Make cpus_allowed and mems_allowed masks hotplug invariant

2014-10-09 Thread Preeti U Murthy

On 10/08/2014 03:48 PM, Peter Zijlstra wrote:
 On Wed, Oct 08, 2014 at 03:07:51PM +0530, Preeti U Murthy wrote:
 
 I still completely hate all that.. It basically makes cpusets useless,
 they no longer guarantee anything, it makes then an optional placement
 hint instead.

 Why do you say they don't guarantee anything? We ensure that we always
 run on the cpus in our cpuset which are online. We do not run in any
 arbitrary cpuset. We also do not wait unreasonably on an offline cpu to
 come back. What we are doing is ensuring that if the resources that we
 began with are available we use them. Why is this not a logical thing to
 expect?
 
 Because if you randomly hotplug cpus your tasks can randomly flip
 between sets.
 
 Therefore there is no strict containment and no guarantees.
 
 You also break long standing behaviour.

 Which is? As I understand cpusets are meant to ensure a dedicated set of
 resources to some tasks. We cannot scheduler the tasks anywhere outside
 this set as long as they are available. And when they are not, currently
 we move them to their parents,
 
 No currently we hard break affinity and never look back. That move to
 parent and back crap is all new fangled stuff, and broken because of the
 above argument.
 
 but you had also suggested killing the
 task. Maybe this can be debated. But what behavior are we changing by
 ensuring that we retain our original configuration at all times?
 
 See above, by pretending hotplug is a sane operation you loose all
 guarantees.

Ok I see the point. The kernel must not be bothered about keeping
cpusets and hotplug operations consistent when both of these are user
initiated actions specifying affinity with the former and breaking the
same with the latter.

 
 Also, power is insane if it needs/uses hotplug for operational crap
 like that.

 SMT 8 on Power8 can help/hinder workloads. Hence we dynamically switch
 the modes at runtime.
 
 That's just a horrible piece of crap hack and you deserve any and all
 pain you get from doing it.
 
 Randomly removing/adding cpus like that is horrible and makes a mockery
 of all the affinity interfaces we have.

We observed this on ubuntu kernel, in which systemd explicitly mounts
cgroup controllers under a child cgroup identified by the user pid.
Since we had not observed this additional cgroup being added under the
hood, it came as a surprise to us that cgroup/cpuset handling in the
kernel should indeed kick in.

At best we expect hotplug to be handled well if the users have not
explicitly configured cpusets, hence implicitly specifying that task
affinity is for all online cpus. This is indeed the case today, so that
is good.

However what remains to be answered is that the V2 of cgroup design -
the default hierarchy, tracks hotplug operations for children cgroups as
well. Tejun, Li, will not the concerns that Peter raised above hold for
the default hierarchy as well?

Regards
Preeti U Murthy
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 2/3] powerpc/kvm/book3s_hv: Enable CPUs to run guest after waking up from fast-sleep

2014-10-09 Thread Preeti U Murthy

On 10/07/2014 10:41 AM, Benjamin Herrenschmidt wrote:
 On Wed, 2014-10-01 at 13:15 +0530, Shreyas B. Prabhu wrote:

 diff --git a/arch/powerpc/kernel/exceptions-64s.S 
 b/arch/powerpc/kernel/exceptions-64s.S
 index 050f79a..c64f3cc0 100644
 --- a/arch/powerpc/kernel/exceptions-64s.S
 +++ b/arch/powerpc/kernel/exceptions-64s.S
 @@ -100,25 +100,8 @@ system_reset_pSeries:
  SET_SCRATCH0(r13)
  #ifdef CONFIG_PPC_P7_NAP
  BEGIN_FTR_SECTION
 -/* Running native on arch 2.06 or later, check if we are
 - * waking up from nap. We only handle no state loss and
 - * supervisor state loss. We do -not- handle hypervisor
 - * state loss at this time.
 - */
 -mfspr   r13,SPRN_SRR1
 -rlwinm. r13,r13,47-31,30,31
 -beq 9f
  
 -/* waking up from powersave (nap) state */
 -cmpwi   cr1,r13,2
 -/* Total loss of HV state is fatal, we could try to use the
 - * PIR to locate a PACA, then use an emergency stack etc...
 - * OPAL v3 based powernv platforms have new idle states
 - * which fall in this catagory.
 - */
 -bgt cr1,8f
  GET_PACA(r13)
 -
  #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
  li  r0,KVM_HWTHREAD_IN_KERNEL
  stb r0,HSTATE_HWTHREAD_STATE(r13)
 @@ -131,13 +114,27 @@ BEGIN_FTR_SECTION
  1:
  #endif
 
 So you moved the state loss check to after the KVM check ? Was this
 reviewed by Paul ? Is that ok ? (Does this match what we have in
 PowerKVM ?). Is it possible that we end up calling kvm_start_guest
 after a HV state loss or do we know for sure that this won't happen
 for a reason or another ? If that's the case, then that reason needs
 to be clearly documented here in a comment.

This wont happen because the first thread in the core which comes out of
an idle state which has a state loss will not enter into KVM since the
HSTATE_HWTHREAD_STATE is not yet set. It continues on to restore the
lost state.

This thread sets the HSTATE_HWTHREAD_STATE and wakes up the remaining
threads in the core. These sibling threads enter kvm directly not
requiring to restore lost state since the first thread has restored it
anyway. So we are safe. We will certainly add a comment there.

Thanks

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

< 1 2 3 4 5 6 7 8 9 10 >

301 - 400 of 1248 matches

Mail list logo