Re: [PATCH V3] tick/broadcast: Make movement of broadcast hrtimer robust against hotplug
On 01/21/2015 05:16 PM, Thomas Gleixner wrote: > On Tue, 20 Jan 2015, Preeti U Murthy wrote: >> diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c >> index 5544990..f3907c9 100644 >> --- a/kernel/time/clockevents.c >> +++ b/kernel/time/clockevents.c >> @@ -568,6 +568,7 @@ int clockevents_notify(unsigned long reason, void *arg) >> >> case CLOCK_EVT_NOTIFY_CPU_DYING: >> tick_handover_do_timer(arg); >> +tick_shutdown_broadcast_oneshot(arg); >> break; >> >> case CLOCK_EVT_NOTIFY_SUSPEND: >> @@ -580,7 +581,6 @@ int clockevents_notify(unsigned long reason, void *arg) >> break; >> >> case CLOCK_EVT_NOTIFY_CPU_DEAD: >> -tick_shutdown_broadcast_oneshot(arg); >> tick_shutdown_broadcast(arg); >> tick_shutdown(arg); >> /* >> diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c >> index 066f0ec..f983983 100644 >> --- a/kernel/time/tick-broadcast.c >> +++ b/kernel/time/tick-broadcast.c >> @@ -675,8 +675,11 @@ static void broadcast_move_bc(int deadcpu) >> >> if (!bc || !broadcast_needs_cpu(bc, deadcpu)) >> return; >> -/* This moves the broadcast assignment to this cpu */ >> -clockevents_program_event(bc, bc->next_event, 1); >> +/* Since a cpu with the earliest wakeup is nominated as the >> + * standby cpu, the next cpu to invoke BROADCAST_ENTER >> + * will now automatically take up the duty of broadcasting. >> + */ >> +bc->next_event.tv64 = KTIME_MAX; > > So that relies on the fact, that cpu_down() currently forces ALL cpus > into stop_machine(). Of course this is not in any way obvious and any > change to this will cause even more hard to debug issues. Hmm.. true this is a concern. > > And to be honest, the clever 'set next_event to KTIME_MAX' is even > more nonobvious because it's only relevant for your hrtimer based > broadcasting magic. Any real broadcast device does not care about this > at all. bc->next_event is set to max only if CLOCK_EVT_FEATURE_HRTIMER is true. It does not affect the usual broadcast logic. > > This whole random notifier driven hotplug business is just a > trainwreck. I'm still trying to convert this to a well documented > state machine, so I rather prefer to make this an explicit take over > rather than a completely undocumented 'works today' mechanism. > > What about the patch below? > > Thanks, > > tglx > > diff --git a/kernel/cpu.c b/kernel/cpu.c > index 5d220234b3ca..7a9b1ae4a945 100644 > --- a/kernel/cpu.c > +++ b/kernel/cpu.c > @@ -16,6 +16,7 @@ > #include > #include > #include > +#include > #include > #include > #include > @@ -421,6 +422,12 @@ static int __ref _cpu_down(unsigned int cpu, int > tasks_frozen) > while (!idle_cpu(cpu)) > cpu_relax(); > > + /* > + * Before waiting for the cpu to enter DEAD state, take over > + * any tick related duties > + */ > + clockevents_notify(CLOCK_EVT_NOTIFY_CPU_DEAD, ); > + > /* This actually kills the CPU. */ > __cpu_die(cpu); > > diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c > index 37e50aadd471..3c1bfd0f7074 100644 > --- a/kernel/time/hrtimer.c > +++ b/kernel/time/hrtimer.c > @@ -1721,11 +1721,8 @@ static int hrtimer_cpu_notify(struct notifier_block > *self, > break; > case CPU_DEAD: > case CPU_DEAD_FROZEN: > - { > - clockevents_notify(CLOCK_EVT_NOTIFY_CPU_DEAD, ); > migrate_hrtimers(scpu); > break; > - } > #endif > > default: > How about when the cpu that is going offline receives a timer interrupt just before setting its state to CPU_DEAD ? That is still possible right given that its clock devices may not have been shutdown and it is capable of receiving interrupts for a short duration. Even with the above patch, is the following scenario possible ? CPU0 CPU1 t0 Receives timer interrupt t1 Sees that there are hrtimers to be serviced (hrtimers are not yet migrated) t2 calls hrtimer_interrupt() t3 tick_program_event() CPU_DEAD notifiers CPU0's td->evtdev = NULL t4 clockevent_program_event() references NULL tick device pointer So my concern is that since the CLOCK_EVT_NOTIFY_CPU_DEAD callback handles shutting down of devices besides moving tick related
[PATCH V2] cpuidle: Add missing checks to the exit condition of cpu_idle_poll()
cpu_idle_poll() is entered into when either the cpu_idle_force_poll is set or tick_check_broadcast_expired() returns true. The exit condition from cpu_idle_poll() is tif_need_resched(). However this does not take into account scenarios where cpu_idle_force_poll changes or tick_check_broadcast_expired() returns false, without setting the resched flag. So a cpu will be caught in cpu_idle_poll() needlessly, thereby wasting power. Add an explicit check on cpu_idle_force_poll and tick_check_broadcast_expired() to the exit condition of cpu_idle_poll() to avoid this. Signed-off-by: Preeti U Murthy --- Changes from V1: Modified the Changelog kernel/sched/idle.c |3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c index c47fce7..aaf1c1d 100644 --- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -47,7 +47,8 @@ static inline int cpu_idle_poll(void) rcu_idle_enter(); trace_cpu_idle_rcuidle(0, smp_processor_id()); local_irq_enable(); - while (!tif_need_resched()) + while (!tif_need_resched() && + (cpu_idle_force_poll || tick_check_broadcast_expired())) cpu_relax(); trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, smp_processor_id()); rcu_idle_exit(); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] idle/tick-broadcast: Exit cpu idle poll loop when cleared from tick_broadcast_force_mask
On 01/21/2015 03:26 PM, Thomas Gleixner wrote: > On Tue, 20 Jan 2015, Preeti U Murthy wrote: >> On 01/20/2015 04:51 PM, Thomas Gleixner wrote: >>> On Mon, 19 Jan 2015, Preeti U Murthy wrote: >>>> An idle cpu enters cpu_idle_poll() if it is set in the >>>> tick_broadcast_force_mask. >>>> This is so that it does not incur the overhead of entering idle states >>>> when it is expected >>>> to be woken up anytime then through a broadcast IPI. The condition that >>>> forces an exit out >>>> of the idle polling is the check on setting of the TIF_NEED_RESCHED flag >>>> for the idle thread. >>>> >>>> When the broadcast IPI does arrive, it is not guarenteed that the handler >>>> sets the >>>> TIF_NEED_RESCHED flag. Hence although the cpu is cleared in the >>>> tick_broadcast_force_mask, >>>> it continues to loop in cpu_idle_poll unnecessarily wasting power. Hence >>>> exit the idle >>>> poll loop if the tick_broadcast_force_mask gets cleared and enter idle >>>> states. >>>> >>>> Of course if the cpu has entered cpu_idle_poll() on being asked to poll >>>> explicitly, >>>> it continues to poll till it is asked to reschedule. >>>> >>>> Signed-off-by: Preeti U Murthy >>>> --- >>>> >>>> kernel/sched/idle.c |3 ++- >>>> 1 file changed, 2 insertions(+), 1 deletion(-) >>>> >>>> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c >>>> index c47fce7..aaf1c1d 100644 >>>> --- a/kernel/sched/idle.c >>>> +++ b/kernel/sched/idle.c >>>> @@ -47,7 +47,8 @@ static inline int cpu_idle_poll(void) >>>>rcu_idle_enter(); >>>>trace_cpu_idle_rcuidle(0, smp_processor_id()); >>>>local_irq_enable(); >>>> - while (!tif_need_resched()) >>>> + while (!tif_need_resched() && >>>> + (cpu_idle_force_poll || tick_check_broadcast_expired())) >>> >>> You explain the tick_check_broadcast_expired() bit, but what about the >>> cpu_idle_force_poll part? >> >> The last few lines which say "Of course if the cpu has entered >> cpu_idle_poll() on being asked to poll explicitly, it continues to poll >> till it is asked to reschedule" explains the cpu_idle_force_poll part. > > Well, I read it more than once and did not figure it out. > > The paragraph describes some behaviour. Now I know it's the behaviour > before the patch. So maybe something like this: > > cpu_idle_poll() is entered when cpu_idle_force_poll is set or > tick_check_broadcast_expired() returns true. The exit condition from > cpu_idle_poll() is tif_need_resched(). > > But this does not take into account that cpu_idle_force_poll and > tick_check_broadcast_expired() can change without setting the > resched flag. So a cpu can be caught in cpu_idle_poll() needlessly > and thereby wasting power. > > Add an explicit check for cpu_idle_force_poll and > tick_check_broadcast_expired() to the exit condition of > cpu_idle_poll() to avoid this. > > This explains the technical issue without confusing people with IPIs > and other completely irrelevant information. Hmm? Yep, much simpler, thanks! I will send out the next version with this changelog. Regards Preeti U Murthy > > Thanks, > > tglx > ___ > Linuxppc-dev mailing list > linuxppc-...@lists.ozlabs.org > https://lists.ozlabs.org/listinfo/linuxppc-dev > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V3] tick/broadcast: Make movement of broadcast hrtimer robust against hotplug
On 01/21/2015 05:16 PM, Thomas Gleixner wrote: On Tue, 20 Jan 2015, Preeti U Murthy wrote: diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c index 5544990..f3907c9 100644 --- a/kernel/time/clockevents.c +++ b/kernel/time/clockevents.c @@ -568,6 +568,7 @@ int clockevents_notify(unsigned long reason, void *arg) case CLOCK_EVT_NOTIFY_CPU_DYING: tick_handover_do_timer(arg); +tick_shutdown_broadcast_oneshot(arg); break; case CLOCK_EVT_NOTIFY_SUSPEND: @@ -580,7 +581,6 @@ int clockevents_notify(unsigned long reason, void *arg) break; case CLOCK_EVT_NOTIFY_CPU_DEAD: -tick_shutdown_broadcast_oneshot(arg); tick_shutdown_broadcast(arg); tick_shutdown(arg); /* diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c index 066f0ec..f983983 100644 --- a/kernel/time/tick-broadcast.c +++ b/kernel/time/tick-broadcast.c @@ -675,8 +675,11 @@ static void broadcast_move_bc(int deadcpu) if (!bc || !broadcast_needs_cpu(bc, deadcpu)) return; -/* This moves the broadcast assignment to this cpu */ -clockevents_program_event(bc, bc-next_event, 1); +/* Since a cpu with the earliest wakeup is nominated as the + * standby cpu, the next cpu to invoke BROADCAST_ENTER + * will now automatically take up the duty of broadcasting. + */ +bc-next_event.tv64 = KTIME_MAX; So that relies on the fact, that cpu_down() currently forces ALL cpus into stop_machine(). Of course this is not in any way obvious and any change to this will cause even more hard to debug issues. Hmm.. true this is a concern. And to be honest, the clever 'set next_event to KTIME_MAX' is even more nonobvious because it's only relevant for your hrtimer based broadcasting magic. Any real broadcast device does not care about this at all. bc-next_event is set to max only if CLOCK_EVT_FEATURE_HRTIMER is true. It does not affect the usual broadcast logic. This whole random notifier driven hotplug business is just a trainwreck. I'm still trying to convert this to a well documented state machine, so I rather prefer to make this an explicit take over rather than a completely undocumented 'works today' mechanism. What about the patch below? Thanks, tglx diff --git a/kernel/cpu.c b/kernel/cpu.c index 5d220234b3ca..7a9b1ae4a945 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -16,6 +16,7 @@ #include linux/bug.h #include linux/kthread.h #include linux/stop_machine.h +#include linux/clockchips.h #include linux/mutex.h #include linux/gfp.h #include linux/suspend.h @@ -421,6 +422,12 @@ static int __ref _cpu_down(unsigned int cpu, int tasks_frozen) while (!idle_cpu(cpu)) cpu_relax(); + /* + * Before waiting for the cpu to enter DEAD state, take over + * any tick related duties + */ + clockevents_notify(CLOCK_EVT_NOTIFY_CPU_DEAD, cpu); + /* This actually kills the CPU. */ __cpu_die(cpu); diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c index 37e50aadd471..3c1bfd0f7074 100644 --- a/kernel/time/hrtimer.c +++ b/kernel/time/hrtimer.c @@ -1721,11 +1721,8 @@ static int hrtimer_cpu_notify(struct notifier_block *self, break; case CPU_DEAD: case CPU_DEAD_FROZEN: - { - clockevents_notify(CLOCK_EVT_NOTIFY_CPU_DEAD, scpu); migrate_hrtimers(scpu); break; - } #endif default: How about when the cpu that is going offline receives a timer interrupt just before setting its state to CPU_DEAD ? That is still possible right given that its clock devices may not have been shutdown and it is capable of receiving interrupts for a short duration. Even with the above patch, is the following scenario possible ? CPU0 CPU1 t0 Receives timer interrupt t1 Sees that there are hrtimers to be serviced (hrtimers are not yet migrated) t2 calls hrtimer_interrupt() t3 tick_program_event() CPU_DEAD notifiers CPU0's td-evtdev = NULL t4 clockevent_program_event() references NULL tick device pointer So my concern is that since the CLOCK_EVT_NOTIFY_CPU_DEAD callback handles shutting down of devices besides moving tick related duties. it's functions may race with the hotplug cpu still handling tick events. We do check on powerpc if the timer interrupt has arrived on an offline cpu, but that is to avoid an entirely different scenario and not one like the above. I would not expect the arch to check if a timer interrupt arrived on an offline cpu. A timer interrupt may be serviced as long as the tick device is alive. Regards Preeti U Murthy -- To unsubscribe from this list: send
Re: [PATCH] idle/tick-broadcast: Exit cpu idle poll loop when cleared from tick_broadcast_force_mask
On 01/21/2015 03:26 PM, Thomas Gleixner wrote: On Tue, 20 Jan 2015, Preeti U Murthy wrote: On 01/20/2015 04:51 PM, Thomas Gleixner wrote: On Mon, 19 Jan 2015, Preeti U Murthy wrote: An idle cpu enters cpu_idle_poll() if it is set in the tick_broadcast_force_mask. This is so that it does not incur the overhead of entering idle states when it is expected to be woken up anytime then through a broadcast IPI. The condition that forces an exit out of the idle polling is the check on setting of the TIF_NEED_RESCHED flag for the idle thread. When the broadcast IPI does arrive, it is not guarenteed that the handler sets the TIF_NEED_RESCHED flag. Hence although the cpu is cleared in the tick_broadcast_force_mask, it continues to loop in cpu_idle_poll unnecessarily wasting power. Hence exit the idle poll loop if the tick_broadcast_force_mask gets cleared and enter idle states. Of course if the cpu has entered cpu_idle_poll() on being asked to poll explicitly, it continues to poll till it is asked to reschedule. Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com --- kernel/sched/idle.c |3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c index c47fce7..aaf1c1d 100644 --- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -47,7 +47,8 @@ static inline int cpu_idle_poll(void) rcu_idle_enter(); trace_cpu_idle_rcuidle(0, smp_processor_id()); local_irq_enable(); - while (!tif_need_resched()) + while (!tif_need_resched() + (cpu_idle_force_poll || tick_check_broadcast_expired())) You explain the tick_check_broadcast_expired() bit, but what about the cpu_idle_force_poll part? The last few lines which say Of course if the cpu has entered cpu_idle_poll() on being asked to poll explicitly, it continues to poll till it is asked to reschedule explains the cpu_idle_force_poll part. Well, I read it more than once and did not figure it out. The paragraph describes some behaviour. Now I know it's the behaviour before the patch. So maybe something like this: cpu_idle_poll() is entered when cpu_idle_force_poll is set or tick_check_broadcast_expired() returns true. The exit condition from cpu_idle_poll() is tif_need_resched(). But this does not take into account that cpu_idle_force_poll and tick_check_broadcast_expired() can change without setting the resched flag. So a cpu can be caught in cpu_idle_poll() needlessly and thereby wasting power. Add an explicit check for cpu_idle_force_poll and tick_check_broadcast_expired() to the exit condition of cpu_idle_poll() to avoid this. This explains the technical issue without confusing people with IPIs and other completely irrelevant information. Hmm? Yep, much simpler, thanks! I will send out the next version with this changelog. Regards Preeti U Murthy Thanks, tglx ___ Linuxppc-dev mailing list linuxppc-...@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH V2] cpuidle: Add missing checks to the exit condition of cpu_idle_poll()
cpu_idle_poll() is entered into when either the cpu_idle_force_poll is set or tick_check_broadcast_expired() returns true. The exit condition from cpu_idle_poll() is tif_need_resched(). However this does not take into account scenarios where cpu_idle_force_poll changes or tick_check_broadcast_expired() returns false, without setting the resched flag. So a cpu will be caught in cpu_idle_poll() needlessly, thereby wasting power. Add an explicit check on cpu_idle_force_poll and tick_check_broadcast_expired() to the exit condition of cpu_idle_poll() to avoid this. Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com --- Changes from V1: Modified the Changelog kernel/sched/idle.c |3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c index c47fce7..aaf1c1d 100644 --- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -47,7 +47,8 @@ static inline int cpu_idle_poll(void) rcu_idle_enter(); trace_cpu_idle_rcuidle(0, smp_processor_id()); local_irq_enable(); - while (!tif_need_resched()) + while (!tif_need_resched() + (cpu_idle_force_poll || tick_check_broadcast_expired())) cpu_relax(); trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, smp_processor_id()); rcu_idle_exit(); -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cpuidle/powernv: Read target_residency value of idle states from DT if available
On 01/20/2015 11:15 AM, Michael Ellerman wrote: > On Mon, 2015-19-01 at 11:32:51 UTC, Preeti U Murthy wrote: >> The device tree now exposes the residency values for different idle states. >> Read >> these values instead of calculating residency from the latency values. The >> values >> exposed in the DT are validated for optimal power efficiency. However to >> maintain >> compatibility with the older firmware code which does not expose residency >> values, use default values as a fallback mechanism. While at it, clump >> the common parts of device tree parsing into one chunk. >> >> Signed-off-by: Preeti U Murthy >> --- >> >> drivers/cpuidle/cpuidle-powernv.c | 39 >> - >> 1 file changed, 25 insertions(+), 14 deletions(-) >> >> diff --git a/drivers/cpuidle/cpuidle-powernv.c >> b/drivers/cpuidle/cpuidle-powernv.c >> index 2726663..9e732e1 100644 >> --- a/drivers/cpuidle/cpuidle-powernv.c >> +++ b/drivers/cpuidle/cpuidle-powernv.c >> @@ -162,7 +162,8 @@ static int powernv_add_idle_states(void) >> int dt_idle_states; >> const __be32 *idle_state_flags; >> const __be32 *idle_state_latency; >> -u32 len_flags, flags, latency_ns; >> +const __be32 *idle_state_residency; >> +u32 len_flags, flags, latency_ns, residency_ns; >> int i; >> >> /* Currently we have snooze statically defined */ >> @@ -186,14 +187,21 @@ static int powernv_add_idle_states(void) >> return nr_idle_states; >> } >> >> +idle_state_residency = of_get_property(power_mgt, >> +"ibm,cpu-idle-state-residency-ns", NULL); >> +if (!idle_state_residency) { >> +pr_warn("DT-PowerMgmt: missing >> ibm,cpu-idle-state-residency-ns\n"); >> +pr_warn("Falling back to default values\n"); >> +} > > This would be better done with something like: > > rc = of_read_property_u32(power_mgt, "ibm,cpu-idle-state-residency-ns", > _ns); > if (!rc) { > pr_info("cpuidle-powernv: missing > ibm,cpu-idle-state-residency-ns\n"); > residency_ns = 30; > } This looks like a better API, but the default residency values are different for each idle state. So perhaps a patch like the below ? --Start Patch-- From: Preeti U Murthy --- drivers/cpuidle/cpuidle-powernv.c | 54 +++-- 1 file changed, 33 insertions(+), 21 deletions(-) diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c index aedec09..2031560 100644 --- a/drivers/cpuidle/cpuidle-powernv.c +++ b/drivers/cpuidle/cpuidle-powernv.c @@ -13,6 +13,7 @@ #include #include #include +#include #include #include @@ -159,9 +160,9 @@ static int powernv_add_idle_states(void) int nr_idle_states = 1; /* Snooze */ int dt_idle_states; const __be32 *idle_state_flags; - const __be32 *idle_state_latency; - u32 len_flags, flags, latency_ns; - int i; + u32 len_flags, flags; + u32 *latency_ns, *residency_ns; + int i, rc; /* Currently we have snooze statically defined */ @@ -177,34 +178,39 @@ static int powernv_add_idle_states(void) return nr_idle_states; } - idle_state_latency = of_get_property(power_mgt, - "ibm,cpu-idle-state-latencies-ns", NULL); - if (!idle_state_latency) { + dt_idle_states = len_flags / sizeof(u32); + + latency_ns = kzalloc(sizeof(*latency_ns) * dt_idle_states, GFP_KERNEL); + rc = of_property_read_u32(power_mgt, + "ibm,cpu-idle-state-latencies-ns", latency_ns); + if (rc) { pr_warn("DT-PowerMgmt: missing ibm,cpu-idle-state-latencies-ns\n"); return nr_idle_states; } - dt_idle_states = len_flags / sizeof(u32); + residency_ns = kzalloc(sizeof(*residency_ns) * dt_idle_states, GFP_KERNEL); + rc = of_property_read_u32(power_mgt, + "ibm,cpu-idle-state-residency-ns", residency_ns); + if (rc) { + pr_warn("DT-PowerMgmt: missing ibm,cpu-idle-state-residency-ns\n"); + pr_warn("Falling back to default values\n"); + } + for (i = 0; i < dt_idle_states; i++) { flags = be32_to_cpu(idle_state_flags[i]); - - /* Cpuidle accepts exit_latency in us and we estimate -* target residency to be 10x exit_latency + /* +
Re: [PATCH] idle/tick-broadcast: Exit cpu idle poll loop when cleared from tick_broadcast_force_mask
On 01/20/2015 04:51 PM, Thomas Gleixner wrote: > On Mon, 19 Jan 2015, Preeti U Murthy wrote: >> An idle cpu enters cpu_idle_poll() if it is set in the >> tick_broadcast_force_mask. >> This is so that it does not incur the overhead of entering idle states when >> it is expected >> to be woken up anytime then through a broadcast IPI. The condition that >> forces an exit out >> of the idle polling is the check on setting of the TIF_NEED_RESCHED flag for >> the idle thread. >> >> When the broadcast IPI does arrive, it is not guarenteed that the handler >> sets the >> TIF_NEED_RESCHED flag. Hence although the cpu is cleared in the >> tick_broadcast_force_mask, >> it continues to loop in cpu_idle_poll unnecessarily wasting power. Hence >> exit the idle >> poll loop if the tick_broadcast_force_mask gets cleared and enter idle >> states. >> >> Of course if the cpu has entered cpu_idle_poll() on being asked to poll >> explicitly, >> it continues to poll till it is asked to reschedule. >> >> Signed-off-by: Preeti U Murthy >> --- >> >> kernel/sched/idle.c |3 ++- >> 1 file changed, 2 insertions(+), 1 deletion(-) >> >> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c >> index c47fce7..aaf1c1d 100644 >> --- a/kernel/sched/idle.c >> +++ b/kernel/sched/idle.c >> @@ -47,7 +47,8 @@ static inline int cpu_idle_poll(void) >> rcu_idle_enter(); >> trace_cpu_idle_rcuidle(0, smp_processor_id()); >> local_irq_enable(); >> -while (!tif_need_resched()) >> +while (!tif_need_resched() && >> +(cpu_idle_force_poll || tick_check_broadcast_expired())) > > You explain the tick_check_broadcast_expired() bit, but what about the > cpu_idle_force_poll part? The last few lines which say "Of course if the cpu has entered cpu_idle_poll() on being asked to poll explicitly, it continues to poll till it is asked to reschedule" explains the cpu_idle_force_poll part. Perhaps I should s/poll explicitly/do cpu_idle_force_poll ? Regards Preeti U Murthy > > Thanks, > > tglx > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH V3] tick/broadcast: Make movement of broadcast hrtimer robust against hotplug
Today if the cpu handling broadcasting of wakeups goes offline, the job of broadcasting is handed over to another cpu in the CPU_DEAD phase. The CPU_DEAD notifiers are run only after the offline cpu sets its state as CPU_DEAD. Meanwhile, the kthread doing the offline is scheduled out while waiting for this transition by queuing a timer. This is fatal because if the cpu on which this kthread was running has no other work queued on it, it can re-enter deep idle state, since it sees that a broadcast cpu still exists. However the broadcast wakeup will never come since the cpu which was handling it is offline, and the cpu on which the kthread doing the hotplug operation was running never wakes up to see this because its in deep idle state. Fix this by setting the broadcast timer to a max value so as to force the cpus entering deep idle states henceforth to freshly nominate the broadcast cpu. More importantly this has to be done in the CPU_DYING phase so that it is visible to all cpus right after exiting stop_machine, which is when they can re-enter idle. This ensures that handover of the broadcast duty falls in place on offline of the broadcast cpu, without having to do it explicitly. It fixes the bug reported here: http://linuxppc.10917.n7.nabble.com/offlining-cpus-breakage-td88619.html Signed-off-by: Preeti U Murthy --- Changes from previous versions: 1. Modification to the changelog 2. Clarified the comments kernel/time/clockevents.c|2 +- kernel/time/tick-broadcast.c |7 +-- 2 files changed, 6 insertions(+), 3 deletions(-) diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c index 5544990..f3907c9 100644 --- a/kernel/time/clockevents.c +++ b/kernel/time/clockevents.c @@ -568,6 +568,7 @@ int clockevents_notify(unsigned long reason, void *arg) case CLOCK_EVT_NOTIFY_CPU_DYING: tick_handover_do_timer(arg); + tick_shutdown_broadcast_oneshot(arg); break; case CLOCK_EVT_NOTIFY_SUSPEND: @@ -580,7 +581,6 @@ int clockevents_notify(unsigned long reason, void *arg) break; case CLOCK_EVT_NOTIFY_CPU_DEAD: - tick_shutdown_broadcast_oneshot(arg); tick_shutdown_broadcast(arg); tick_shutdown(arg); /* diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c index 066f0ec..f983983 100644 --- a/kernel/time/tick-broadcast.c +++ b/kernel/time/tick-broadcast.c @@ -675,8 +675,11 @@ static void broadcast_move_bc(int deadcpu) if (!bc || !broadcast_needs_cpu(bc, deadcpu)) return; - /* This moves the broadcast assignment to this cpu */ - clockevents_program_event(bc, bc->next_event, 1); + /* Since a cpu with the earliest wakeup is nominated as the +* standby cpu, the next cpu to invoke BROADCAST_ENTER +* will now automatically take up the duty of broadcasting. +*/ + bc->next_event.tv64 = KTIME_MAX; } /* -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH V2] tick/broadcast: Make movement of broadcast hrtimer robust against hotplug
Today if the cpu handling broadcasting of wakeups goes offline, the job of broadcasting is handed over to another cpu in the CPU_DEAD phase. The CPU_DEAD notifiers are run only after the offline cpu sets its state as CPU_DEAD. Meanwhile, the kthread doing the offline is scheduled out while waiting for this transition by queuing a timer. This is fatal because if the cpu on which this kthread was running has no other work queued on it, it can re-enter deep idle state, since it sees that a broadcast cpu still exists. However the broadcast wakeup will never come since the cpu which was handling it is offline, and the cpu on which the kthread doing the hotplug operation was running never wakes up to see this because its in deep idle state. Fix this by setting the broadcast timer to a max value so as to force the cpus entering deep idle states henceforth to freshly nominate the broadcast cpu. More importantly this has to be done in the CPU_DYING phase so that it is visible to all cpus right after exiting stop_machine, which is when they can re-enter idle. This ensures that handover of the broadcast duty falls in place on offline of the broadcast cpu, without having to do it explicitly. It fixes the bug reported here: http://linuxppc.10917.n7.nabble.com/offlining-cpus-breakage-td88619.html Signed-off-by: Preeti U Murthy --- Changes from V1: https://lkml.org/lkml/2015/1/19/168 1. Modified the Changelog kernel/time/clockevents.c|2 +- kernel/time/tick-broadcast.c |4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c index 5544990..f3907c9 100644 --- a/kernel/time/clockevents.c +++ b/kernel/time/clockevents.c @@ -568,6 +568,7 @@ int clockevents_notify(unsigned long reason, void *arg) case CLOCK_EVT_NOTIFY_CPU_DYING: tick_handover_do_timer(arg); + tick_shutdown_broadcast_oneshot(arg); break; case CLOCK_EVT_NOTIFY_SUSPEND: @@ -580,7 +581,6 @@ int clockevents_notify(unsigned long reason, void *arg) break; case CLOCK_EVT_NOTIFY_CPU_DEAD: - tick_shutdown_broadcast_oneshot(arg); tick_shutdown_broadcast(arg); tick_shutdown(arg); /* diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c index 066f0ec..e9c1d9b 100644 --- a/kernel/time/tick-broadcast.c +++ b/kernel/time/tick-broadcast.c @@ -675,8 +675,8 @@ static void broadcast_move_bc(int deadcpu) if (!bc || !broadcast_needs_cpu(bc, deadcpu)) return; - /* This moves the broadcast assignment to this cpu */ - clockevents_program_event(bc, bc->next_event, 1); + /* This allows fresh nomination of broadcast cpu */ + bc->next_event.tv64 = KTIME_MAX; } /* -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH V2] tick/broadcast: Make movement of broadcast hrtimer robust against hotplug
Today if the cpu handling broadcasting of wakeups goes offline, the job of broadcasting is handed over to another cpu in the CPU_DEAD phase. The CPU_DEAD notifiers are run only after the offline cpu sets its state as CPU_DEAD. Meanwhile, the kthread doing the offline is scheduled out while waiting for this transition by queuing a timer. This is fatal because if the cpu on which this kthread was running has no other work queued on it, it can re-enter deep idle state, since it sees that a broadcast cpu still exists. However the broadcast wakeup will never come since the cpu which was handling it is offline, and the cpu on which the kthread doing the hotplug operation was running never wakes up to see this because its in deep idle state. Fix this by setting the broadcast timer to a max value so as to force the cpus entering deep idle states henceforth to freshly nominate the broadcast cpu. More importantly this has to be done in the CPU_DYING phase so that it is visible to all cpus right after exiting stop_machine, which is when they can re-enter idle. This ensures that handover of the broadcast duty falls in place on offline of the broadcast cpu, without having to do it explicitly. It fixes the bug reported here: http://linuxppc.10917.n7.nabble.com/offlining-cpus-breakage-td88619.html Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com --- Changes from V1: https://lkml.org/lkml/2015/1/19/168 1. Modified the Changelog kernel/time/clockevents.c|2 +- kernel/time/tick-broadcast.c |4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c index 5544990..f3907c9 100644 --- a/kernel/time/clockevents.c +++ b/kernel/time/clockevents.c @@ -568,6 +568,7 @@ int clockevents_notify(unsigned long reason, void *arg) case CLOCK_EVT_NOTIFY_CPU_DYING: tick_handover_do_timer(arg); + tick_shutdown_broadcast_oneshot(arg); break; case CLOCK_EVT_NOTIFY_SUSPEND: @@ -580,7 +581,6 @@ int clockevents_notify(unsigned long reason, void *arg) break; case CLOCK_EVT_NOTIFY_CPU_DEAD: - tick_shutdown_broadcast_oneshot(arg); tick_shutdown_broadcast(arg); tick_shutdown(arg); /* diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c index 066f0ec..e9c1d9b 100644 --- a/kernel/time/tick-broadcast.c +++ b/kernel/time/tick-broadcast.c @@ -675,8 +675,8 @@ static void broadcast_move_bc(int deadcpu) if (!bc || !broadcast_needs_cpu(bc, deadcpu)) return; - /* This moves the broadcast assignment to this cpu */ - clockevents_program_event(bc, bc-next_event, 1); + /* This allows fresh nomination of broadcast cpu */ + bc-next_event.tv64 = KTIME_MAX; } /* -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH V3] tick/broadcast: Make movement of broadcast hrtimer robust against hotplug
Today if the cpu handling broadcasting of wakeups goes offline, the job of broadcasting is handed over to another cpu in the CPU_DEAD phase. The CPU_DEAD notifiers are run only after the offline cpu sets its state as CPU_DEAD. Meanwhile, the kthread doing the offline is scheduled out while waiting for this transition by queuing a timer. This is fatal because if the cpu on which this kthread was running has no other work queued on it, it can re-enter deep idle state, since it sees that a broadcast cpu still exists. However the broadcast wakeup will never come since the cpu which was handling it is offline, and the cpu on which the kthread doing the hotplug operation was running never wakes up to see this because its in deep idle state. Fix this by setting the broadcast timer to a max value so as to force the cpus entering deep idle states henceforth to freshly nominate the broadcast cpu. More importantly this has to be done in the CPU_DYING phase so that it is visible to all cpus right after exiting stop_machine, which is when they can re-enter idle. This ensures that handover of the broadcast duty falls in place on offline of the broadcast cpu, without having to do it explicitly. It fixes the bug reported here: http://linuxppc.10917.n7.nabble.com/offlining-cpus-breakage-td88619.html Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com --- Changes from previous versions: 1. Modification to the changelog 2. Clarified the comments kernel/time/clockevents.c|2 +- kernel/time/tick-broadcast.c |7 +-- 2 files changed, 6 insertions(+), 3 deletions(-) diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c index 5544990..f3907c9 100644 --- a/kernel/time/clockevents.c +++ b/kernel/time/clockevents.c @@ -568,6 +568,7 @@ int clockevents_notify(unsigned long reason, void *arg) case CLOCK_EVT_NOTIFY_CPU_DYING: tick_handover_do_timer(arg); + tick_shutdown_broadcast_oneshot(arg); break; case CLOCK_EVT_NOTIFY_SUSPEND: @@ -580,7 +581,6 @@ int clockevents_notify(unsigned long reason, void *arg) break; case CLOCK_EVT_NOTIFY_CPU_DEAD: - tick_shutdown_broadcast_oneshot(arg); tick_shutdown_broadcast(arg); tick_shutdown(arg); /* diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c index 066f0ec..f983983 100644 --- a/kernel/time/tick-broadcast.c +++ b/kernel/time/tick-broadcast.c @@ -675,8 +675,11 @@ static void broadcast_move_bc(int deadcpu) if (!bc || !broadcast_needs_cpu(bc, deadcpu)) return; - /* This moves the broadcast assignment to this cpu */ - clockevents_program_event(bc, bc-next_event, 1); + /* Since a cpu with the earliest wakeup is nominated as the +* standby cpu, the next cpu to invoke BROADCAST_ENTER +* will now automatically take up the duty of broadcasting. +*/ + bc-next_event.tv64 = KTIME_MAX; } /* -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cpuidle/powernv: Read target_residency value of idle states from DT if available
On 01/20/2015 11:15 AM, Michael Ellerman wrote: On Mon, 2015-19-01 at 11:32:51 UTC, Preeti U Murthy wrote: The device tree now exposes the residency values for different idle states. Read these values instead of calculating residency from the latency values. The values exposed in the DT are validated for optimal power efficiency. However to maintain compatibility with the older firmware code which does not expose residency values, use default values as a fallback mechanism. While at it, clump the common parts of device tree parsing into one chunk. Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com --- drivers/cpuidle/cpuidle-powernv.c | 39 - 1 file changed, 25 insertions(+), 14 deletions(-) diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c index 2726663..9e732e1 100644 --- a/drivers/cpuidle/cpuidle-powernv.c +++ b/drivers/cpuidle/cpuidle-powernv.c @@ -162,7 +162,8 @@ static int powernv_add_idle_states(void) int dt_idle_states; const __be32 *idle_state_flags; const __be32 *idle_state_latency; -u32 len_flags, flags, latency_ns; +const __be32 *idle_state_residency; +u32 len_flags, flags, latency_ns, residency_ns; int i; /* Currently we have snooze statically defined */ @@ -186,14 +187,21 @@ static int powernv_add_idle_states(void) return nr_idle_states; } +idle_state_residency = of_get_property(power_mgt, +ibm,cpu-idle-state-residency-ns, NULL); +if (!idle_state_residency) { +pr_warn(DT-PowerMgmt: missing ibm,cpu-idle-state-residency-ns\n); +pr_warn(Falling back to default values\n); +} This would be better done with something like: rc = of_read_property_u32(power_mgt, ibm,cpu-idle-state-residency-ns, residency_ns); if (!rc) { pr_info(cpuidle-powernv: missing ibm,cpu-idle-state-residency-ns\n); residency_ns = 30; } This looks like a better API, but the default residency values are different for each idle state. So perhaps a patch like the below ? --Start Patch-- From: Preeti U Murthy pre...@linux.vnet.ibm.com --- drivers/cpuidle/cpuidle-powernv.c | 54 +++-- 1 file changed, 33 insertions(+), 21 deletions(-) diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c index aedec09..2031560 100644 --- a/drivers/cpuidle/cpuidle-powernv.c +++ b/drivers/cpuidle/cpuidle-powernv.c @@ -13,6 +13,7 @@ #include linux/notifier.h #include linux/clockchips.h #include linux/of.h +#include linux/slab.h #include asm/machdep.h #include asm/firmware.h @@ -159,9 +160,9 @@ static int powernv_add_idle_states(void) int nr_idle_states = 1; /* Snooze */ int dt_idle_states; const __be32 *idle_state_flags; - const __be32 *idle_state_latency; - u32 len_flags, flags, latency_ns; - int i; + u32 len_flags, flags; + u32 *latency_ns, *residency_ns; + int i, rc; /* Currently we have snooze statically defined */ @@ -177,34 +178,39 @@ static int powernv_add_idle_states(void) return nr_idle_states; } - idle_state_latency = of_get_property(power_mgt, - ibm,cpu-idle-state-latencies-ns, NULL); - if (!idle_state_latency) { + dt_idle_states = len_flags / sizeof(u32); + + latency_ns = kzalloc(sizeof(*latency_ns) * dt_idle_states, GFP_KERNEL); + rc = of_property_read_u32(power_mgt, + ibm,cpu-idle-state-latencies-ns, latency_ns); + if (rc) { pr_warn(DT-PowerMgmt: missing ibm,cpu-idle-state-latencies-ns\n); return nr_idle_states; } - dt_idle_states = len_flags / sizeof(u32); + residency_ns = kzalloc(sizeof(*residency_ns) * dt_idle_states, GFP_KERNEL); + rc = of_property_read_u32(power_mgt, + ibm,cpu-idle-state-residency-ns, residency_ns); + if (rc) { + pr_warn(DT-PowerMgmt: missing ibm,cpu-idle-state-residency-ns\n); + pr_warn(Falling back to default values\n); + } + for (i = 0; i dt_idle_states; i++) { flags = be32_to_cpu(idle_state_flags[i]); - - /* Cpuidle accepts exit_latency in us and we estimate -* target residency to be 10x exit_latency + /* +* Cpuidle accepts exit_latency and target_residency in us. +* Use default target_residency values if f/w does not expose it. */ - latency_ns = be32_to_cpu(idle_state_latency[i]); if (flags OPAL_PM_NAP_ENABLED) { /* Add NAP state */ strcpy(powernv_states[nr_idle_states].name
Re: [PATCH] idle/tick-broadcast: Exit cpu idle poll loop when cleared from tick_broadcast_force_mask
On 01/20/2015 04:51 PM, Thomas Gleixner wrote: On Mon, 19 Jan 2015, Preeti U Murthy wrote: An idle cpu enters cpu_idle_poll() if it is set in the tick_broadcast_force_mask. This is so that it does not incur the overhead of entering idle states when it is expected to be woken up anytime then through a broadcast IPI. The condition that forces an exit out of the idle polling is the check on setting of the TIF_NEED_RESCHED flag for the idle thread. When the broadcast IPI does arrive, it is not guarenteed that the handler sets the TIF_NEED_RESCHED flag. Hence although the cpu is cleared in the tick_broadcast_force_mask, it continues to loop in cpu_idle_poll unnecessarily wasting power. Hence exit the idle poll loop if the tick_broadcast_force_mask gets cleared and enter idle states. Of course if the cpu has entered cpu_idle_poll() on being asked to poll explicitly, it continues to poll till it is asked to reschedule. Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com --- kernel/sched/idle.c |3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c index c47fce7..aaf1c1d 100644 --- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -47,7 +47,8 @@ static inline int cpu_idle_poll(void) rcu_idle_enter(); trace_cpu_idle_rcuidle(0, smp_processor_id()); local_irq_enable(); -while (!tif_need_resched()) +while (!tif_need_resched() +(cpu_idle_force_poll || tick_check_broadcast_expired())) You explain the tick_check_broadcast_expired() bit, but what about the cpu_idle_force_poll part? The last few lines which say Of course if the cpu has entered cpu_idle_poll() on being asked to poll explicitly, it continues to poll till it is asked to reschedule explains the cpu_idle_force_poll part. Perhaps I should s/poll explicitly/do cpu_idle_force_poll ? Regards Preeti U Murthy Thanks, tglx -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: tick/broadcast: Make movement of broadcast hrtimer robust against hotplug
On 01/20/2015 11:39 AM, Michael Ellerman wrote: > On Mon, 2015-19-01 at 10:26:48 UTC, Preeti U Murthy wrote: >> Today if a cpu handling broadcasting of wakeups goes offline, it hands over > > It's *the* cpu handling broadcasting of wakeups right? ie. there's only ever > one at a time. Right, thanks for pointing this. > >> the job of broadcasting to another cpu in the CPU_DEAD phase. > > I think that would be clearer as "to another cpu, when the cpu going offline > reaches the CPU_DEAD state." > > Otherwise it can read as "another cpu (which is) in the CPU_DEAD phase", which > is not what you mean - I think. Yes I will word this better. > >> The CPU_DEAD notifiers are run only after the offline cpu sets its state as >> CPU_DEAD. Meanwhile, the kthread doing the offline is scheduled out while > > The kthread which is running on a different cpu from either of the first two > cpus you've mentioned. It could be running on any cpu other than the offline one. The next line clarifies this - "This is fatal if the cpu on which this kthread was running". I also say "Meanwhile, the kthread doing the offline" above so as to clarify that the offline cpu has nothing to do with this kthread. > >> waiting for this transition by queuing a timer. This is fatal because if the >> cpu on which this kthread was running has no other work queued on it, it can >> re-enter deep idle state, since it sees that a broadcast cpu still exists. >> However the broadcast wakeup will never come since the cpu which was handling >> it is offline, and this cpu never wakes up to see this because its in deep >> idle state. > > Which cpu is "this cpu"? I think you mean the one running the kthread which is > doing the offline, but it's not 100% clear. Right, I will make this correction as well, its ambiguous. > >> Fix this by setting the broadcast timer to a max value so as to force the >> cpus >> entering deep idle states henceforth to freshly nominate the broadcast cpu. >> More >> importantly this has to be done in the CPU_DYING phase so that it is visible >> to >> all cpus right after exiting stop_machine, which is when they can re-enter >> idle. >> This ensures that handover of the broadcast duty falls in place on offline, >> without >> having to do it explicitly. > > OK, I don't know the code well enough to say if that's the right fix. > > You say: > > + /* This allows fresh nomination of broadcast cpu */ > + bc->next_event.tv64 = KTIME_MAX; > > Is that all it does? I see that check in several places in the code. This change does not affect those parts of the tick broadcast code, which do not depend on the hrtimer broadcast framework. And for those parts that do depend on this framework, this plays a critical role in handing over the broadcast duty. > > > I assume we're expecting Thomas to merge this? Yes, Thomas can you please take this into the next 3.19 rc release ? I will send out a new patch with the modified changelog. If you find this acceptable I will port it to the relevant stable kernels. > > If so it's probably worth mentioning that it fixes a bug we are seeing on I will mention this in the changelog by pointing to the bug report. Regards Preeti U Murthy > machines in the wild. So it'd be nice if it went into 3.19 and/or gets sent to > stable. > > cheers > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] cpuidle/powernv: Read target_residency value of idle states from DT if available
The device tree now exposes the residency values for different idle states. Read these values instead of calculating residency from the latency values. The values exposed in the DT are validated for optimal power efficiency. However to maintain compatibility with the older firmware code which does not expose residency values, use default values as a fallback mechanism. While at it, clump the common parts of device tree parsing into one chunk. Signed-off-by: Preeti U Murthy --- drivers/cpuidle/cpuidle-powernv.c | 39 - 1 file changed, 25 insertions(+), 14 deletions(-) diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c index 2726663..9e732e1 100644 --- a/drivers/cpuidle/cpuidle-powernv.c +++ b/drivers/cpuidle/cpuidle-powernv.c @@ -162,7 +162,8 @@ static int powernv_add_idle_states(void) int dt_idle_states; const __be32 *idle_state_flags; const __be32 *idle_state_latency; - u32 len_flags, flags, latency_ns; + const __be32 *idle_state_residency; + u32 len_flags, flags, latency_ns, residency_ns; int i; /* Currently we have snooze statically defined */ @@ -186,14 +187,21 @@ static int powernv_add_idle_states(void) return nr_idle_states; } + idle_state_residency = of_get_property(power_mgt, + "ibm,cpu-idle-state-residency-ns", NULL); + if (!idle_state_residency) { + pr_warn("DT-PowerMgmt: missing ibm,cpu-idle-state-residency-ns\n"); + pr_warn("Falling back to default values\n"); + } + dt_idle_states = len_flags / sizeof(u32); for (i = 0; i < dt_idle_states; i++) { flags = be32_to_cpu(idle_state_flags[i]); - - /* Cpuidle accepts exit_latency in us and we estimate -* target residency to be 10x exit_latency + /* +* Cpuidle accepts exit_latency and target_residency in us. +* Use default target_residency values if f/w does not expose it. */ latency_ns = be32_to_cpu(idle_state_latency[i]); if (flags & OPAL_PM_NAP_ENABLED) { @@ -201,12 +209,8 @@ static int powernv_add_idle_states(void) strcpy(powernv_states[nr_idle_states].name, "Nap"); strcpy(powernv_states[nr_idle_states].desc, "Nap"); powernv_states[nr_idle_states].flags = 0; - powernv_states[nr_idle_states].exit_latency = - ((unsigned int)latency_ns) / 1000; - powernv_states[nr_idle_states].target_residency = - ((unsigned int)latency_ns / 100); + powernv_states[nr_idle_states].target_residency = 100; powernv_states[nr_idle_states].enter = _loop; - nr_idle_states++; } if (flags & OPAL_PM_SLEEP_ENABLED || @@ -215,13 +219,20 @@ static int powernv_add_idle_states(void) strcpy(powernv_states[nr_idle_states].name, "FastSleep"); strcpy(powernv_states[nr_idle_states].desc, "FastSleep"); powernv_states[nr_idle_states].flags = CPUIDLE_FLAG_TIMER_STOP; - powernv_states[nr_idle_states].exit_latency = - ((unsigned int)latency_ns) / 1000; - powernv_states[nr_idle_states].target_residency = - ((unsigned int)latency_ns / 100); + powernv_states[nr_idle_states].target_residency = 30; powernv_states[nr_idle_states].enter = _loop; - nr_idle_states++; } + + powernv_states[nr_idle_states].exit_latency = + ((unsigned int)latency_ns) / 1000; + + if (idle_state_residency) { + residency_ns = be32_to_cpu(idle_state_residency[i]); + powernv_states[nr_idle_states].target_residency = + ((unsigned int)residency_ns) / 1000; + } + + nr_idle_states++; } return nr_idle_states; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] tick/broadcast: Make movement of broadcast hrtimer robust against hotplug
Today if a cpu handling broadcasting of wakeups goes offline, it hands over the job of broadcasting to another cpu in the CPU_DEAD phase. The CPU_DEAD notifiers are run only after the offline cpu sets its state as CPU_DEAD. Meanwhile, the kthread doing the offline is scheduled out while waiting for this transition by queuing a timer. This is fatal because if the cpu on which this kthread was running has no other work queued on it, it can re-enter deep idle state, since it sees that a broadcast cpu still exists. However the broadcast wakeup will never come since the cpu which was handling it is offline, and this cpu never wakes up to see this because its in deep idle state. Fix this by setting the broadcast timer to a max value so as to force the cpus entering deep idle states henceforth to freshly nominate the broadcast cpu. More importantly this has to be done in the CPU_DYING phase so that it is visible to all cpus right after exiting stop_machine, which is when they can re-enter idle. This ensures that handover of the broadcast duty falls in place on offline, without having to do it explicitly. Signed-off-by: Preeti U Murthy --- kernel/time/clockevents.c|2 +- kernel/time/tick-broadcast.c |4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c index 5544990..f3907c9 100644 --- a/kernel/time/clockevents.c +++ b/kernel/time/clockevents.c @@ -568,6 +568,7 @@ int clockevents_notify(unsigned long reason, void *arg) case CLOCK_EVT_NOTIFY_CPU_DYING: tick_handover_do_timer(arg); + tick_shutdown_broadcast_oneshot(arg); break; case CLOCK_EVT_NOTIFY_SUSPEND: @@ -580,7 +581,6 @@ int clockevents_notify(unsigned long reason, void *arg) break; case CLOCK_EVT_NOTIFY_CPU_DEAD: - tick_shutdown_broadcast_oneshot(arg); tick_shutdown_broadcast(arg); tick_shutdown(arg); /* diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c index 066f0ec..e9c1d9b 100644 --- a/kernel/time/tick-broadcast.c +++ b/kernel/time/tick-broadcast.c @@ -675,8 +675,8 @@ static void broadcast_move_bc(int deadcpu) if (!bc || !broadcast_needs_cpu(bc, deadcpu)) return; - /* This moves the broadcast assignment to this cpu */ - clockevents_program_event(bc, bc->next_event, 1); + /* This allows fresh nomination of broadcast cpu */ + bc->next_event.tv64 = KTIME_MAX; } /* -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: tick/broadcast: Make movement of broadcast hrtimer robust against hotplug
On 01/20/2015 11:39 AM, Michael Ellerman wrote: On Mon, 2015-19-01 at 10:26:48 UTC, Preeti U Murthy wrote: Today if a cpu handling broadcasting of wakeups goes offline, it hands over It's *the* cpu handling broadcasting of wakeups right? ie. there's only ever one at a time. Right, thanks for pointing this. the job of broadcasting to another cpu in the CPU_DEAD phase. I think that would be clearer as to another cpu, when the cpu going offline reaches the CPU_DEAD state. Otherwise it can read as another cpu (which is) in the CPU_DEAD phase, which is not what you mean - I think. Yes I will word this better. The CPU_DEAD notifiers are run only after the offline cpu sets its state as CPU_DEAD. Meanwhile, the kthread doing the offline is scheduled out while The kthread which is running on a different cpu from either of the first two cpus you've mentioned. It could be running on any cpu other than the offline one. The next line clarifies this - This is fatal if the cpu on which this kthread was running. I also say Meanwhile, the kthread doing the offline above so as to clarify that the offline cpu has nothing to do with this kthread. waiting for this transition by queuing a timer. This is fatal because if the cpu on which this kthread was running has no other work queued on it, it can re-enter deep idle state, since it sees that a broadcast cpu still exists. However the broadcast wakeup will never come since the cpu which was handling it is offline, and this cpu never wakes up to see this because its in deep idle state. Which cpu is this cpu? I think you mean the one running the kthread which is doing the offline, but it's not 100% clear. Right, I will make this correction as well, its ambiguous. Fix this by setting the broadcast timer to a max value so as to force the cpus entering deep idle states henceforth to freshly nominate the broadcast cpu. More importantly this has to be done in the CPU_DYING phase so that it is visible to all cpus right after exiting stop_machine, which is when they can re-enter idle. This ensures that handover of the broadcast duty falls in place on offline, without having to do it explicitly. OK, I don't know the code well enough to say if that's the right fix. You say: + /* This allows fresh nomination of broadcast cpu */ + bc-next_event.tv64 = KTIME_MAX; Is that all it does? I see that check in several places in the code. This change does not affect those parts of the tick broadcast code, which do not depend on the hrtimer broadcast framework. And for those parts that do depend on this framework, this plays a critical role in handing over the broadcast duty. I assume we're expecting Thomas to merge this? Yes, Thomas can you please take this into the next 3.19 rc release ? I will send out a new patch with the modified changelog. If you find this acceptable I will port it to the relevant stable kernels. If so it's probably worth mentioning that it fixes a bug we are seeing on I will mention this in the changelog by pointing to the bug report. Regards Preeti U Murthy machines in the wild. So it'd be nice if it went into 3.19 and/or gets sent to stable. cheers -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] tick/broadcast: Make movement of broadcast hrtimer robust against hotplug
Today if a cpu handling broadcasting of wakeups goes offline, it hands over the job of broadcasting to another cpu in the CPU_DEAD phase. The CPU_DEAD notifiers are run only after the offline cpu sets its state as CPU_DEAD. Meanwhile, the kthread doing the offline is scheduled out while waiting for this transition by queuing a timer. This is fatal because if the cpu on which this kthread was running has no other work queued on it, it can re-enter deep idle state, since it sees that a broadcast cpu still exists. However the broadcast wakeup will never come since the cpu which was handling it is offline, and this cpu never wakes up to see this because its in deep idle state. Fix this by setting the broadcast timer to a max value so as to force the cpus entering deep idle states henceforth to freshly nominate the broadcast cpu. More importantly this has to be done in the CPU_DYING phase so that it is visible to all cpus right after exiting stop_machine, which is when they can re-enter idle. This ensures that handover of the broadcast duty falls in place on offline, without having to do it explicitly. Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com --- kernel/time/clockevents.c|2 +- kernel/time/tick-broadcast.c |4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c index 5544990..f3907c9 100644 --- a/kernel/time/clockevents.c +++ b/kernel/time/clockevents.c @@ -568,6 +568,7 @@ int clockevents_notify(unsigned long reason, void *arg) case CLOCK_EVT_NOTIFY_CPU_DYING: tick_handover_do_timer(arg); + tick_shutdown_broadcast_oneshot(arg); break; case CLOCK_EVT_NOTIFY_SUSPEND: @@ -580,7 +581,6 @@ int clockevents_notify(unsigned long reason, void *arg) break; case CLOCK_EVT_NOTIFY_CPU_DEAD: - tick_shutdown_broadcast_oneshot(arg); tick_shutdown_broadcast(arg); tick_shutdown(arg); /* diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c index 066f0ec..e9c1d9b 100644 --- a/kernel/time/tick-broadcast.c +++ b/kernel/time/tick-broadcast.c @@ -675,8 +675,8 @@ static void broadcast_move_bc(int deadcpu) if (!bc || !broadcast_needs_cpu(bc, deadcpu)) return; - /* This moves the broadcast assignment to this cpu */ - clockevents_program_event(bc, bc-next_event, 1); + /* This allows fresh nomination of broadcast cpu */ + bc-next_event.tv64 = KTIME_MAX; } /* -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] cpuidle/powernv: Read target_residency value of idle states from DT if available
The device tree now exposes the residency values for different idle states. Read these values instead of calculating residency from the latency values. The values exposed in the DT are validated for optimal power efficiency. However to maintain compatibility with the older firmware code which does not expose residency values, use default values as a fallback mechanism. While at it, clump the common parts of device tree parsing into one chunk. Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com --- drivers/cpuidle/cpuidle-powernv.c | 39 - 1 file changed, 25 insertions(+), 14 deletions(-) diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c index 2726663..9e732e1 100644 --- a/drivers/cpuidle/cpuidle-powernv.c +++ b/drivers/cpuidle/cpuidle-powernv.c @@ -162,7 +162,8 @@ static int powernv_add_idle_states(void) int dt_idle_states; const __be32 *idle_state_flags; const __be32 *idle_state_latency; - u32 len_flags, flags, latency_ns; + const __be32 *idle_state_residency; + u32 len_flags, flags, latency_ns, residency_ns; int i; /* Currently we have snooze statically defined */ @@ -186,14 +187,21 @@ static int powernv_add_idle_states(void) return nr_idle_states; } + idle_state_residency = of_get_property(power_mgt, + ibm,cpu-idle-state-residency-ns, NULL); + if (!idle_state_residency) { + pr_warn(DT-PowerMgmt: missing ibm,cpu-idle-state-residency-ns\n); + pr_warn(Falling back to default values\n); + } + dt_idle_states = len_flags / sizeof(u32); for (i = 0; i dt_idle_states; i++) { flags = be32_to_cpu(idle_state_flags[i]); - - /* Cpuidle accepts exit_latency in us and we estimate -* target residency to be 10x exit_latency + /* +* Cpuidle accepts exit_latency and target_residency in us. +* Use default target_residency values if f/w does not expose it. */ latency_ns = be32_to_cpu(idle_state_latency[i]); if (flags OPAL_PM_NAP_ENABLED) { @@ -201,12 +209,8 @@ static int powernv_add_idle_states(void) strcpy(powernv_states[nr_idle_states].name, Nap); strcpy(powernv_states[nr_idle_states].desc, Nap); powernv_states[nr_idle_states].flags = 0; - powernv_states[nr_idle_states].exit_latency = - ((unsigned int)latency_ns) / 1000; - powernv_states[nr_idle_states].target_residency = - ((unsigned int)latency_ns / 100); + powernv_states[nr_idle_states].target_residency = 100; powernv_states[nr_idle_states].enter = nap_loop; - nr_idle_states++; } if (flags OPAL_PM_SLEEP_ENABLED || @@ -215,13 +219,20 @@ static int powernv_add_idle_states(void) strcpy(powernv_states[nr_idle_states].name, FastSleep); strcpy(powernv_states[nr_idle_states].desc, FastSleep); powernv_states[nr_idle_states].flags = CPUIDLE_FLAG_TIMER_STOP; - powernv_states[nr_idle_states].exit_latency = - ((unsigned int)latency_ns) / 1000; - powernv_states[nr_idle_states].target_residency = - ((unsigned int)latency_ns / 100); + powernv_states[nr_idle_states].target_residency = 30; powernv_states[nr_idle_states].enter = fastsleep_loop; - nr_idle_states++; } + + powernv_states[nr_idle_states].exit_latency = + ((unsigned int)latency_ns) / 1000; + + if (idle_state_residency) { + residency_ns = be32_to_cpu(idle_state_residency[i]); + powernv_states[nr_idle_states].target_residency = + ((unsigned int)residency_ns) / 1000; + } + + nr_idle_states++; } return nr_idle_states; -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] idle/tick-broadcast: Exit cpu idle poll loop when cleared from tick_broadcast_force_mask
An idle cpu enters cpu_idle_poll() if it is set in the tick_broadcast_force_mask. This is so that it does not incur the overhead of entering idle states when it is expected to be woken up anytime then through a broadcast IPI. The condition that forces an exit out of the idle polling is the check on setting of the TIF_NEED_RESCHED flag for the idle thread. When the broadcast IPI does arrive, it is not guarenteed that the handler sets the TIF_NEED_RESCHED flag. Hence although the cpu is cleared in the tick_broadcast_force_mask, it continues to loop in cpu_idle_poll unnecessarily wasting power. Hence exit the idle poll loop if the tick_broadcast_force_mask gets cleared and enter idle states. Of course if the cpu has entered cpu_idle_poll() on being asked to poll explicitly, it continues to poll till it is asked to reschedule. Signed-off-by: Preeti U Murthy --- kernel/sched/idle.c |3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c index c47fce7..aaf1c1d 100644 --- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -47,7 +47,8 @@ static inline int cpu_idle_poll(void) rcu_idle_enter(); trace_cpu_idle_rcuidle(0, smp_processor_id()); local_irq_enable(); - while (!tif_need_resched()) + while (!tif_need_resched() && + (cpu_idle_force_poll || tick_check_broadcast_expired())) cpu_relax(); trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, smp_processor_id()); rcu_idle_exit(); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] idle/tick-broadcast: Exit cpu idle poll loop when cleared from tick_broadcast_force_mask
An idle cpu enters cpu_idle_poll() if it is set in the tick_broadcast_force_mask. This is so that it does not incur the overhead of entering idle states when it is expected to be woken up anytime then through a broadcast IPI. The condition that forces an exit out of the idle polling is the check on setting of the TIF_NEED_RESCHED flag for the idle thread. When the broadcast IPI does arrive, it is not guarenteed that the handler sets the TIF_NEED_RESCHED flag. Hence although the cpu is cleared in the tick_broadcast_force_mask, it continues to loop in cpu_idle_poll unnecessarily wasting power. Hence exit the idle poll loop if the tick_broadcast_force_mask gets cleared and enter idle states. Of course if the cpu has entered cpu_idle_poll() on being asked to poll explicitly, it continues to poll till it is asked to reschedule. Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com --- kernel/sched/idle.c |3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c index c47fce7..aaf1c1d 100644 --- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -47,7 +47,8 @@ static inline int cpu_idle_poll(void) rcu_idle_enter(); trace_cpu_idle_rcuidle(0, smp_processor_id()); local_irq_enable(); - while (!tif_need_resched()) + while (!tif_need_resched() + (cpu_idle_force_poll || tick_check_broadcast_expired())) cpu_relax(); trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, smp_processor_id()); rcu_idle_exit(); -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] tick/powerclamp: Remove tick_nohz_idle abuse
Hi Jacob, On 12/23/2014 08:27 AM, Jacob Pan wrote: > On Sat, 20 Dec 2014 07:01:12 +0530 > Preeti U Murthy wrote: > >> On 12/20/2014 01:26 AM, Thomas Gleixner wrote: >>> On Fri, 19 Dec 2014, Jacob Pan wrote: >>> >>>> On Thu, 18 Dec 2014 22:12:57 +0100 (CET) >>>> Thomas Gleixner wrote: >>>> >>>>> On Thu, 18 Dec 2014, Jacob Pan wrote: >>>>>> OK I agree, also as I mentioned earlier, Peter already has a >>>>>> patch for consolidated idle loop and remove >>>>>> tick_nohz_idle_enter/exit call from powerclamp driver. I have >>>>>> been working on a few tweaks to maintain the functionality and >>>>>> efficiency with the consolidated idle loop. We can apply the >>>>>> patches on top of yours. >>>>> >>>>> No. This is equally wrong as I pointed out before. The 'unified' >>>>> idle loop is still fake and just pretending to be idle. >>>>> >>>> In terms of efficiency, the consolidated idle loop will allow >>>> turning off sched tick during idle injection period. If we just >>>> take out the tick_nohz_idle_xxx call, the effectiveness of >>>> powerclamp is going down significantly. I am not arguing the >>>> design but from fixing regression perspective or short term >>>> solution. >>> >>> There is no perspective. Period. >>> >>> Its violates every rightful assumption of the nohz_IDLE_* code and >>> just ever worked by chance. There is so much subtle wreckage lurking >>> there that the only sane solution is to forbid it. End of story. >>> >>> Thanks, >>> >>> tglx >>> >> Hi Jacob, >> >> Like Thomas pointed out, we can design a sane solution for powerclamp. >> Idle injection is nothing but throttling of runqueue. If the runqueue >> is throttled, no fair tasks will be selected and the natural choice >> in the absence of tasks from any other sched class is the idle task. >> >> The idle loop will automatically be called and the nohz state will >> also fall in place. The cpu is really idle now: the runqueue has no >> tasks and the task running on the cpu is the idle thread. The >> throttled tasks are on a separate list. >> >> When the period of idle injection is over, we unthrottle the runqueue. >> All this being taken care of my a non-deferrable timer. This design >> ensures that the intention of powerclamp is not hampered while at the >> same time maintaining a sane state for nohz; you will get the >> efficiency you want. >> >> Of course there may be corner cases and challenges around >> synchronization of package idle, which I am sure we can work around >> with a better design such as the above. I am working on that patchset >> and will post out in a day. You can take a look and let us know the >> pieces we are missing. >> >> I find that implementing the above design is not too hard. >> > Hi Preeti, > Yeah, it seems to be a good approach. looking forward to work with you > on this. Timer may scale better for larger systems. One question, will > timer irq gets unpredictable delays if run by ksoftirqd? I am sorry I could not respond earlier; I was on vacation as well. Yes, we may have a problem here. Let alone synchronization between cpus in performing clamping, there are two other functionality issues that I see. 1. Since periodic timers get executed in the softirq context, scheduler_tick() would have passed by by then. i.e. hrtimer_interrupt() |__tick_sched_handle() |__scheduler_tick() |__raise_softirq(TIMER_SOFTIRQ) ksoftirqd runs on local_bh_enable()-->powerclamp_timer handler runs Although runqueues are throttled in the powerclamp timer handler, it has to wait till the next scheduler tick to select an idle task to run. A precious 4-10ms depending on the config would have passed by then. 2. For the same reason as 1, when the ksoftirqd has to run on the cpus during the tick after the one in which throttling is enabled, cpus are unavailable to run the daemon because they are throttled. However there is no other way to unthrottle the runqueues now except by running the ksoftirqd; a chicken and egg problem. I think both the above problems could be solved by using hrtimers instead of periodic timers to perform clamping/unclamping, since hrtimers are serviced in the interrupt context. But we cannot initialize/start/modify hrtimers on a remote cpu. We will end up using IPIs for handling the hrtimers during start/end of powerclamp or modification of the idle duration of clamping, which is not a tempting option either. So I am currently stuck at this point. I would be glad to have some suggestions. Thanks Regards Preeti U Murthy > BTW, I may not be able to respond quickly during the holidays. If > things workout, it may benefit ACPI PAD driver as well. > > > Thanks, > > Jacob >> Regards >> Preeti U Murthy >> > > [Jacob Pan] > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] tick/powerclamp: Remove tick_nohz_idle abuse
Hi Jacob, On 12/23/2014 08:27 AM, Jacob Pan wrote: On Sat, 20 Dec 2014 07:01:12 +0530 Preeti U Murthy pre...@linux.vnet.ibm.com wrote: On 12/20/2014 01:26 AM, Thomas Gleixner wrote: On Fri, 19 Dec 2014, Jacob Pan wrote: On Thu, 18 Dec 2014 22:12:57 +0100 (CET) Thomas Gleixner t...@linutronix.de wrote: On Thu, 18 Dec 2014, Jacob Pan wrote: OK I agree, also as I mentioned earlier, Peter already has a patch for consolidated idle loop and remove tick_nohz_idle_enter/exit call from powerclamp driver. I have been working on a few tweaks to maintain the functionality and efficiency with the consolidated idle loop. We can apply the patches on top of yours. No. This is equally wrong as I pointed out before. The 'unified' idle loop is still fake and just pretending to be idle. In terms of efficiency, the consolidated idle loop will allow turning off sched tick during idle injection period. If we just take out the tick_nohz_idle_xxx call, the effectiveness of powerclamp is going down significantly. I am not arguing the design but from fixing regression perspective or short term solution. There is no perspective. Period. Its violates every rightful assumption of the nohz_IDLE_* code and just ever worked by chance. There is so much subtle wreckage lurking there that the only sane solution is to forbid it. End of story. Thanks, tglx Hi Jacob, Like Thomas pointed out, we can design a sane solution for powerclamp. Idle injection is nothing but throttling of runqueue. If the runqueue is throttled, no fair tasks will be selected and the natural choice in the absence of tasks from any other sched class is the idle task. The idle loop will automatically be called and the nohz state will also fall in place. The cpu is really idle now: the runqueue has no tasks and the task running on the cpu is the idle thread. The throttled tasks are on a separate list. When the period of idle injection is over, we unthrottle the runqueue. All this being taken care of my a non-deferrable timer. This design ensures that the intention of powerclamp is not hampered while at the same time maintaining a sane state for nohz; you will get the efficiency you want. Of course there may be corner cases and challenges around synchronization of package idle, which I am sure we can work around with a better design such as the above. I am working on that patchset and will post out in a day. You can take a look and let us know the pieces we are missing. I find that implementing the above design is not too hard. Hi Preeti, Yeah, it seems to be a good approach. looking forward to work with you on this. Timer may scale better for larger systems. One question, will timer irq gets unpredictable delays if run by ksoftirqd? I am sorry I could not respond earlier; I was on vacation as well. Yes, we may have a problem here. Let alone synchronization between cpus in performing clamping, there are two other functionality issues that I see. 1. Since periodic timers get executed in the softirq context, scheduler_tick() would have passed by by then. i.e. hrtimer_interrupt() |__tick_sched_handle() |__scheduler_tick() |__raise_softirq(TIMER_SOFTIRQ) ksoftirqd runs on local_bh_enable()--powerclamp_timer handler runs Although runqueues are throttled in the powerclamp timer handler, it has to wait till the next scheduler tick to select an idle task to run. A precious 4-10ms depending on the config would have passed by then. 2. For the same reason as 1, when the ksoftirqd has to run on the cpus during the tick after the one in which throttling is enabled, cpus are unavailable to run the daemon because they are throttled. However there is no other way to unthrottle the runqueues now except by running the ksoftirqd; a chicken and egg problem. I think both the above problems could be solved by using hrtimers instead of periodic timers to perform clamping/unclamping, since hrtimers are serviced in the interrupt context. But we cannot initialize/start/modify hrtimers on a remote cpu. We will end up using IPIs for handling the hrtimers during start/end of powerclamp or modification of the idle duration of clamping, which is not a tempting option either. So I am currently stuck at this point. I would be glad to have some suggestions. Thanks Regards Preeti U Murthy BTW, I may not be able to respond quickly during the holidays. If things workout, it may benefit ACPI PAD driver as well. Thanks, Jacob Regards Preeti U Murthy [Jacob Pan] -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH] drivers/intel_powerclamp : Redesign implementation of idle injection
The powerclamp driver injects idle periods to stay within the thermal constraints. The driver does a fake idle by spawning per-cpu threads that call the mwait instruction. This behavior of fake idle can confuse the other kernel subsystems. For instance it calls into the nohz tick handlers, which are meant to be called only by the idle thread. It sets the state of the tick in each cpu to idle and stops the tick, when there are tasks on the runqueue. As a result the callers of idle_cpu()/ tick_nohz_tick_stopped() see different states of the cpu; while the former thinks that the cpu is busy, the latter thinks that it is idle. The outcome may be inconsistency in the scheduler/nohz states which can lead to serious consequences. One of them was reported on this thread: https://lkml.org/lkml/2014/12/11/365. Thomas posted out a patch to disable the powerclamp driver from calling into the tick nohz code which has taken care of the above regression for the moment. However powerclamp driver as a result, will not be able to inject idle periods due to the presence of periodic ticks. With the current design of fake idle, we cannot move towards a better solution. https://lkml.org/lkml/2014/12/18/169 This patch aims at removing the concept of fake idle and instead makes the cpus truly idle by throttling the runqueues during the idle injection periods. The situation is in fact very similar to throttling cfs_rqs when they exceed their bandwidths. The exception here is that the root cfs_rq is throttled in the presence of available bandwidth, due to the need to force idle. The runqueues automatically get unthrottled at all other times and call to reschedule is made. Per-cpu timers are queued to expire after every active/idle periods of powerclamp driver, which decide whether to throttle or unthrottle the runqueues. This way, the tick nohz state and the scheduler states of the cpus are sane during the idle injection periods because they naturally fall in place. Another point to note is that the scheduler_tick() is called after the timers are serviced. This means that soon after the runqueue is throttled by the powerclamp timer, the idle task will get scheduled in. IOW, the request of the powerclamp timer to inject idle is honored almost immediately. The same holds for unthrottling as well. This patch is compile tested only. The goal is to get a consensus on the design first. There are some points that need to be given a thought as far as I can see: 1. The idle task chosen during the idle injection period needs to enter a specific mwait state. The cpuidle governors cannot choose any idle state like they do generally for idle tasks. 2. We throttle runqueue when the bandwidth is available. We do not touch any parameters around the cfs_rq bandwidth in this patch. It is supposed to bypass the bandwidth checks entirely. But will there be any serious consequence as a result? 3. There can be cases when the runqueues are throttled when bandwidths are exceeded too. What are the consequences of the powerclamp driver running parallely? 4. The idle periods should stand synchronized across all cpus because the change in the design is to start timers to expire immediately instead of waking up kthreads. The only concern is cpu hotplug operations when the cpu coming online can go out of sync with the idle periods. But that situation exists today as well. Signed-off-by: Preeti U Murthy --- Missed Ccing LKML on the previous mail. drivers/thermal/intel_powerclamp.c | 227 ++-- include/linux/sched.h |2 kernel/sched/fair.c| 29 - 3 files changed, 117 insertions(+), 141 deletions(-) diff --git a/drivers/thermal/intel_powerclamp.c b/drivers/thermal/intel_powerclamp.c index 95cb7fc..a6fd453 100644 --- a/drivers/thermal/intel_powerclamp.c +++ b/drivers/thermal/intel_powerclamp.c @@ -87,7 +87,7 @@ static unsigned int control_cpu; /* The cpu assigned to collect stat and update static bool clamping; -static struct task_struct * __percpu *powerclamp_thread; +DEFINE_PER_CPU(struct timer_list, powerclamp_timer); static struct thermal_cooling_device *cooling_dev; static unsigned long *cpu_clamping_mask; /* bit map for tracking per cpu * clamping thread @@ -256,11 +256,6 @@ static u64 pkg_state_counter(void) return count; } -static void noop_timer(unsigned long foo) -{ - /* empty... just the fact that we get the interrupt wakes us up */ -} - static unsigned int get_compensation(int ratio) { unsigned int comp = 0; @@ -362,102 +357,72 @@ static bool powerclamp_adjust_controls(unsigned int target_ratio, return set_target_ratio + guard <= current_ratio; } -static int clamp_thread(void *arg) +static void powerclamp_timer_fn(unsigned long data) { - int cpunr = (unsigned long)arg; - DEFINE_TIMER(wakeup_timer, noop_timer, 0, 0); - static co
[RFC PATCH] drivers/intel_powerclamp : Redesign implementation of idle injection
The powerclamp driver injects idle periods to stay within the thermal constraints. The driver does a fake idle by spawning per-cpu threads that call the mwait instruction. This behavior of fake idle can confuse the other kernel subsystems. For instance it calls into the nohz tick handlers, which are meant to be called only by the idle thread. It sets the state of the tick in each cpu to idle and stops the tick, when there are tasks on the runqueue. As a result the callers of idle_cpu()/ tick_nohz_tick_stopped() see different states of the cpu; while the former thinks that the cpu is busy, the latter thinks that it is idle. The outcome may be inconsistency in the scheduler/nohz states which can lead to serious consequences. One of them was reported on this thread: https://lkml.org/lkml/2014/12/11/365. Thomas posted out a patch to disable the powerclamp driver from calling into the tick nohz code which has taken care of the above regression for the moment. However powerclamp driver as a result, will not be able to inject idle periods due to the presence of periodic ticks. With the current design of fake idle, we cannot move towards a better solution. https://lkml.org/lkml/2014/12/18/169 This patch aims at removing the concept of fake idle and instead makes the cpus truly idle by throttling the runqueues during the idle injection periods. The situation is in fact very similar to throttling cfs_rqs when they exceed their bandwidths. The exception here is that the root cfs_rq is throttled in the presence of available bandwidth, due to the need to force idle. The runqueues automatically get unthrottled at all other times and call to reschedule is made. Per-cpu timers are queued to expire after every active/idle periods of powerclamp driver, which decide whether to throttle or unthrottle the runqueues. This way, the tick nohz state and the scheduler states of the cpus are sane during the idle injection periods because they naturally fall in place. Another point to note is that the scheduler_tick() is called after the timers are serviced. This means that soon after the runqueue is throttled by the powerclamp timer, the idle task will get scheduled in. IOW, the request of the powerclamp timer to inject idle is honored almost immediately. The same holds for unthrottling as well. This patch is compile tested only. The goal is to get a consensus on the design first. There are some points that need to be given a thought as far as I can see: 1. The idle task chosen during the idle injection period needs to enter a specific mwait state. The cpuidle governors cannot choose any idle state like they do generally for idle tasks. 2. We throttle runqueue when the bandwidth is available. We do not touch any parameters around the cfs_rq bandwidth in this patch. It is supposed to bypass the bandwidth checks entirely. But will there be any serious consequence as a result? 3. There can be cases when the runqueues are throttled when bandwidths are exceeded too. What are the consequences of the powerclamp driver running parallely? 4. The idle periods should stand synchronized across all cpus because the change in the design is to start timers to expire immediately instead of waking up kthreads. The only concern is cpu hotplug operations when the cpu coming online can go out of sync with the idle periods. But that situation exists today as well. Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com --- Missed Ccing LKML on the previous mail. drivers/thermal/intel_powerclamp.c | 227 ++-- include/linux/sched.h |2 kernel/sched/fair.c| 29 - 3 files changed, 117 insertions(+), 141 deletions(-) diff --git a/drivers/thermal/intel_powerclamp.c b/drivers/thermal/intel_powerclamp.c index 95cb7fc..a6fd453 100644 --- a/drivers/thermal/intel_powerclamp.c +++ b/drivers/thermal/intel_powerclamp.c @@ -87,7 +87,7 @@ static unsigned int control_cpu; /* The cpu assigned to collect stat and update static bool clamping; -static struct task_struct * __percpu *powerclamp_thread; +DEFINE_PER_CPU(struct timer_list, powerclamp_timer); static struct thermal_cooling_device *cooling_dev; static unsigned long *cpu_clamping_mask; /* bit map for tracking per cpu * clamping thread @@ -256,11 +256,6 @@ static u64 pkg_state_counter(void) return count; } -static void noop_timer(unsigned long foo) -{ - /* empty... just the fact that we get the interrupt wakes us up */ -} - static unsigned int get_compensation(int ratio) { unsigned int comp = 0; @@ -362,102 +357,72 @@ static bool powerclamp_adjust_controls(unsigned int target_ratio, return set_target_ratio + guard = current_ratio; } -static int clamp_thread(void *arg) +static void powerclamp_timer_fn(unsigned long data) { - int cpunr = (unsigned long)arg; - DEFINE_TIMER(wakeup_timer, noop_timer, 0, 0
Re: [PATCH] tick/powerclamp: Remove tick_nohz_idle abuse
On 12/20/2014 01:26 AM, Thomas Gleixner wrote: > On Fri, 19 Dec 2014, Jacob Pan wrote: > >> On Thu, 18 Dec 2014 22:12:57 +0100 (CET) >> Thomas Gleixner wrote: >> >>> On Thu, 18 Dec 2014, Jacob Pan wrote: >>>> OK I agree, also as I mentioned earlier, Peter already has a patch >>>> for consolidated idle loop and remove tick_nohz_idle_enter/exit >>>> call from powerclamp driver. I have been working on a few tweaks to >>>> maintain the functionality and efficiency with the consolidated >>>> idle loop. We can apply the patches on top of yours. >>> >>> No. This is equally wrong as I pointed out before. The 'unified' idle >>> loop is still fake and just pretending to be idle. >>> >> In terms of efficiency, the consolidated idle loop will allow turning >> off sched tick during idle injection period. If we just take out the >> tick_nohz_idle_xxx call, the effectiveness of powerclamp is going down >> significantly. I am not arguing the design but from fixing regression >> perspective or short term solution. > > There is no perspective. Period. > > Its violates every rightful assumption of the nohz_IDLE_* code and > just ever worked by chance. There is so much subtle wreckage lurking > there that the only sane solution is to forbid it. End of story. > > Thanks, > > tglx > Hi Jacob, Like Thomas pointed out, we can design a sane solution for powerclamp. Idle injection is nothing but throttling of runqueue. If the runqueue is throttled, no fair tasks will be selected and the natural choice in the absence of tasks from any other sched class is the idle task. The idle loop will automatically be called and the nohz state will also fall in place. The cpu is really idle now: the runqueue has no tasks and the task running on the cpu is the idle thread. The throttled tasks are on a separate list. When the period of idle injection is over, we unthrottle the runqueue. All this being taken care of my a non-deferrable timer. This design ensures that the intention of powerclamp is not hampered while at the same time maintaining a sane state for nohz; you will get the efficiency you want. Of course there may be corner cases and challenges around synchronization of package idle, which I am sure we can work around with a better design such as the above. I am working on that patchset and will post out in a day. You can take a look and let us know the pieces we are missing. I find that implementing the above design is not too hard. Regards Preeti U Murthy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] tick/powerclamp: Remove tick_nohz_idle abuse
On 12/20/2014 01:26 AM, Thomas Gleixner wrote: On Fri, 19 Dec 2014, Jacob Pan wrote: On Thu, 18 Dec 2014 22:12:57 +0100 (CET) Thomas Gleixner t...@linutronix.de wrote: On Thu, 18 Dec 2014, Jacob Pan wrote: OK I agree, also as I mentioned earlier, Peter already has a patch for consolidated idle loop and remove tick_nohz_idle_enter/exit call from powerclamp driver. I have been working on a few tweaks to maintain the functionality and efficiency with the consolidated idle loop. We can apply the patches on top of yours. No. This is equally wrong as I pointed out before. The 'unified' idle loop is still fake and just pretending to be idle. In terms of efficiency, the consolidated idle loop will allow turning off sched tick during idle injection period. If we just take out the tick_nohz_idle_xxx call, the effectiveness of powerclamp is going down significantly. I am not arguing the design but from fixing regression perspective or short term solution. There is no perspective. Period. Its violates every rightful assumption of the nohz_IDLE_* code and just ever worked by chance. There is so much subtle wreckage lurking there that the only sane solution is to forbid it. End of story. Thanks, tglx Hi Jacob, Like Thomas pointed out, we can design a sane solution for powerclamp. Idle injection is nothing but throttling of runqueue. If the runqueue is throttled, no fair tasks will be selected and the natural choice in the absence of tasks from any other sched class is the idle task. The idle loop will automatically be called and the nohz state will also fall in place. The cpu is really idle now: the runqueue has no tasks and the task running on the cpu is the idle thread. The throttled tasks are on a separate list. When the period of idle injection is over, we unthrottle the runqueue. All this being taken care of my a non-deferrable timer. This design ensures that the intention of powerclamp is not hampered while at the same time maintaining a sane state for nohz; you will get the efficiency you want. Of course there may be corner cases and challenges around synchronization of package idle, which I am sure we can work around with a better design such as the above. I am working on that patchset and will post out in a day. You can take a look and let us know the pieces we are missing. I find that implementing the above design is not too hard. Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] tick/powerclamp: Remove tick_nohz_idle abuse
Hi Thomas, On 12/18/2014 04:21 PM, Thomas Gleixner wrote: > commit 4dbd27711cd9 "tick: export nohz tick idle symbols for module > use" was merged via the thermal tree without an explicit ack from the > relevant maintainers. > > The exports are abused by the intel powerclamp driver which implements > a fake idle state from a sched FIFO task. This causes all kinds of > wreckage in the NOHZ core code which rightfully assumes that > tick_nohz_idle_enter/exit() are only called from the idle task itself. > > Recent changes in the NOHZ core lead to a failure of the powerclamp > driver and now people try to hack completely broken and backwards > workarounds into the NOHZ core code. This is completely unacceptable. > > The real solution is to fix the powerclamp driver by rewriting it with > a sane concept, but that's beyond the scope of this. > > So the only solution for now is to remove the calls into the core NOHZ > code from the powerclamp trainwreck along with the exports. > > Fixes: d6d71ee4a14a "PM: Introduce Intel PowerClamp Driver" > Signed-off-by: Thomas Gleixner > --- > diff --git a/drivers/thermal/intel_powerclamp.c > b/drivers/thermal/intel_powerclamp.c > index b46c706e1cac..e98b4249187c 100644 > --- a/drivers/thermal/intel_powerclamp.c > +++ b/drivers/thermal/intel_powerclamp.c > @@ -435,7 +435,6 @@ static int clamp_thread(void *arg) >* allowed. thus jiffies are updated properly. >*/ > preempt_disable(); > - tick_nohz_idle_enter(); > /* mwait until target jiffies is reached */ > while (time_before(jiffies, target_jiffies)) { > unsigned long ecx = 1; > @@ -451,7 +450,6 @@ static int clamp_thread(void *arg) > start_critical_timings(); > atomic_inc(_wakeup_counter); > } > - tick_nohz_idle_exit(); > preempt_enable(); > } > del_timer_sync(_timer); > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c > index 4d54b7540585..1363d58f07e9 100644 > --- a/kernel/time/tick-sched.c > +++ b/kernel/time/tick-sched.c > @@ -847,7 +847,6 @@ void tick_nohz_idle_enter(void) > > local_irq_enable(); > } > -EXPORT_SYMBOL_GPL(tick_nohz_idle_enter); > > /** > * tick_nohz_irq_exit - update next tick event from interrupt exit > @@ -974,7 +973,6 @@ void tick_nohz_idle_exit(void) > > local_irq_enable(); > } > -EXPORT_SYMBOL_GPL(tick_nohz_idle_exit); > > static int tick_nohz_reprogram(struct tick_sched *ts, ktime_t now) > { > Ok the solution looks apt to me. Let me see if I can come up with a sane solution for powerclamp based on the suggestions that you gave in the previous thread. I was thinking of the below steps towards its implementation. The idea is based on the throttling mechanism that you had suggested. 1. Queue a deferable periodic timer whose handler checks if idle needs to be injected. If so, it sets rq->need_throttle for the cpu. If its already in the fake idle period, it clears rq->need_throttle and sets need_resched. 2. pick_next_task_fair() checks rq->need_throttle and dequeues all tasks in the rq if this is set and puts them on a throttled list. This mechanism is similar to throttling cfs rq today. This function hence fails to return a task, and if no task from any other sched class exists, idle task is picked. Peter thoughts? 3. So we are now in the idle injected period. The scheduler state is sane because the cpu is idle, rq->nr_running = 0, rq->curr = rq->idle. The nohz state is sane, because ts->inidle = 1 and tick_stopped may or may not be 1 and they are set by an idle task. 4. When need_resched is set again, the idle task of course unsets inidle and restarts tick. In the following scheduler tick, pick_next_task_fair() sees that rq->need_throttle is cleared, enqueues back the tasks and returns one of them to run. Of course there may be several points that I have missed. But how does the approach appear? If it looks sane enough, the cases which do not obviously fall in place can be worked upon. Regards Preeti U Murthy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] tick/powerclamp: Remove tick_nohz_idle abuse
Hi Thomas, On 12/18/2014 04:21 PM, Thomas Gleixner wrote: commit 4dbd27711cd9 tick: export nohz tick idle symbols for module use was merged via the thermal tree without an explicit ack from the relevant maintainers. The exports are abused by the intel powerclamp driver which implements a fake idle state from a sched FIFO task. This causes all kinds of wreckage in the NOHZ core code which rightfully assumes that tick_nohz_idle_enter/exit() are only called from the idle task itself. Recent changes in the NOHZ core lead to a failure of the powerclamp driver and now people try to hack completely broken and backwards workarounds into the NOHZ core code. This is completely unacceptable. The real solution is to fix the powerclamp driver by rewriting it with a sane concept, but that's beyond the scope of this. So the only solution for now is to remove the calls into the core NOHZ code from the powerclamp trainwreck along with the exports. Fixes: d6d71ee4a14a PM: Introduce Intel PowerClamp Driver Signed-off-by: Thomas Gleixner t...@linutronix.de --- diff --git a/drivers/thermal/intel_powerclamp.c b/drivers/thermal/intel_powerclamp.c index b46c706e1cac..e98b4249187c 100644 --- a/drivers/thermal/intel_powerclamp.c +++ b/drivers/thermal/intel_powerclamp.c @@ -435,7 +435,6 @@ static int clamp_thread(void *arg) * allowed. thus jiffies are updated properly. */ preempt_disable(); - tick_nohz_idle_enter(); /* mwait until target jiffies is reached */ while (time_before(jiffies, target_jiffies)) { unsigned long ecx = 1; @@ -451,7 +450,6 @@ static int clamp_thread(void *arg) start_critical_timings(); atomic_inc(idle_wakeup_counter); } - tick_nohz_idle_exit(); preempt_enable(); } del_timer_sync(wakeup_timer); diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 4d54b7540585..1363d58f07e9 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -847,7 +847,6 @@ void tick_nohz_idle_enter(void) local_irq_enable(); } -EXPORT_SYMBOL_GPL(tick_nohz_idle_enter); /** * tick_nohz_irq_exit - update next tick event from interrupt exit @@ -974,7 +973,6 @@ void tick_nohz_idle_exit(void) local_irq_enable(); } -EXPORT_SYMBOL_GPL(tick_nohz_idle_exit); static int tick_nohz_reprogram(struct tick_sched *ts, ktime_t now) { Ok the solution looks apt to me. Let me see if I can come up with a sane solution for powerclamp based on the suggestions that you gave in the previous thread. I was thinking of the below steps towards its implementation. The idea is based on the throttling mechanism that you had suggested. 1. Queue a deferable periodic timer whose handler checks if idle needs to be injected. If so, it sets rq-need_throttle for the cpu. If its already in the fake idle period, it clears rq-need_throttle and sets need_resched. 2. pick_next_task_fair() checks rq-need_throttle and dequeues all tasks in the rq if this is set and puts them on a throttled list. This mechanism is similar to throttling cfs rq today. This function hence fails to return a task, and if no task from any other sched class exists, idle task is picked. Peter thoughts? 3. So we are now in the idle injected period. The scheduler state is sane because the cpu is idle, rq-nr_running = 0, rq-curr = rq-idle. The nohz state is sane, because ts-inidle = 1 and tick_stopped may or may not be 1 and they are set by an idle task. 4. When need_resched is set again, the idle task of course unsets inidle and restarts tick. In the following scheduler tick, pick_next_task_fair() sees that rq-need_throttle is cleared, enqueues back the tasks and returns one of them to run. Of course there may be several points that I have missed. But how does the approach appear? If it looks sane enough, the cases which do not obviously fall in place can be worked upon. Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [nohz] 2a16fc93d2c: kernel lockup on idle injection
On 12/16/2014 10:23 AM, Viresh Kumar wrote: > + Peter from Jacob's mail .. > > On 16 December 2014 at 05:14, Frederic Weisbecker wrote: >> So to summarize: I see it enqueues a timer then it loops on that timer >> expiration. >> On that loop we stop the CPU and we expect the timer to fire and wake the >> thread up. >> But if the delayed tick fires too early, before the timer actually >> expires, then we go to sleep again because we haven't yet reached the delay, >> but since tick_nohz_irq_exit() is only called on idle tasks and not for >> threads >> like powerclamp, the tick isn't rescheduled to handle the remaining timer >> slice >> so we sleep forever. > > Perfect !! > >> Hence if we really want to stop the tick when we mimic idle from powerclamp >> driver, >> we must call tick_nohz_irq_exit() on irq exit to do it correctly. >> >> It happened to work by accident before the commit because we were >> rescheduling the >> tick from itself without tick_nohz_irq_exit() to cancel anything. And that >> restored >> the periodic behaviour necessary to complete the delay. >> >> So the above change is rather a hack than a solution. >> >> We have several choices: >> >> 1) Revert the commit. But this has to be a temporary solution really. >> Powerclamp has >> to be fixed and handle tick_nohz_irq_exit(). >> >> 2) Remove powerclamp tick stop until somebody fixes it to handle nohz >> properly. >> >> 2) Fix it directly. But I believe there is a release that is going to miss >> the fix >> and suffer the regression. Does the regression matter for anybody? Is >> powerclamp >> meant for anything else than testing (I have no idea what it's used for)? >> >> So to fix powerclamp to handle nohz correctly, tick_nohz_irq_exit() must be >> called >> for both idle tasks and powerclamp kthreads. Checking ts->inidle can be a >> good way to match As far as I can see, the primary purpose of tick_nohz_irq_enter()/exit() paths was to take care of *tick stopped* cases. Before handling interrupts we would want jiffies to be updated, which is done in tick_nohz_irq_enter(). And after handling interrupts we need to verify if any timers/RCU callbacks were queued during an interrupt. Both of these will not be handled properly *only when the tick is stopped* right? If so, the checks which permit entry into these functions should at a minimum include a check on ts->tick_stopped(). The rest of the checks around if the CPU is idle/need_resched() may be needed in conjunction, but will not be complete without checking if ticks are stopped. I see that tick_nohz_irq_enter() has a check on ts->tick_stopped, which is correct, but tick_noz_irq_exit() does not. Adding this check to tick_nohz_irq_exit() will not only solve this regression but is probably a fix to a long standing bug in the conditional check in tick_nohz_irq_exit(). >> both. That means we might need to use a reduced part of idle_cpu() to avoid >> redundant checks. >> tick_irq_enter() must be called as well for powerclamp, in case it's the >> only CPU running, it >> has to fixup the timekeeping alone. > > Yeah, you can call my fix a Hacky one. I agree.. > > But I don't know if calling tick_nohz_irq_exit() from these threads wouldn't > be hacky as well. And ofcourse my knowledge wouldn't be adequate here to > judge that :) > > It looked a bit odd to me. Why should we call irq-exit from the threads > working > with idle? And that's not what we do even for the real-idle loop as well .. > > Also from Jacob's referenced patch, we would be moving to a consolidated > idle-loop for both real idle and threads like powerclamp, and this part may be > tricky then.. > > Untested, but what about something like this? > > diff --git a/kernel/softirq.c b/kernel/softirq.c > index 5918d227730f..5e4bfc367735 100644 > --- a/kernel/softirq.c > +++ b/kernel/softirq.c > @@ -321,7 +321,7 @@ asmlinkage __visible void do_softirq(void) > void irq_enter(void) > { > rcu_irq_enter(); > - if (is_idle_task(current) && !in_interrupt()) { > + if (tick_idle_active() && !in_interrupt()) { > /* > * Prevent raise_softirq from needlessly waking up ksoftirqd > * here, as softirq will be serviced on return from interrupt. > @@ -363,7 +363,7 @@ static inline void tick_irq_exit(void) > int cpu = smp_processor_id(); > > /* Make sure that timer wheel updates are propagated */ > - if ((idle_cpu(cpu) && !need_resched()) || tick_nohz_full_cpu(cpu)) { > + if ((tick_idle_ac
Re: [nohz] 2a16fc93d2c: kernel lockup on idle injection
On 12/16/2014 10:23 AM, Viresh Kumar wrote: + Peter from Jacob's mail .. On 16 December 2014 at 05:14, Frederic Weisbecker fweis...@gmail.com wrote: So to summarize: I see it enqueues a timer then it loops on that timer expiration. On that loop we stop the CPU and we expect the timer to fire and wake the thread up. But if the delayed tick fires too early, before the timer actually expires, then we go to sleep again because we haven't yet reached the delay, but since tick_nohz_irq_exit() is only called on idle tasks and not for threads like powerclamp, the tick isn't rescheduled to handle the remaining timer slice so we sleep forever. Perfect !! Hence if we really want to stop the tick when we mimic idle from powerclamp driver, we must call tick_nohz_irq_exit() on irq exit to do it correctly. It happened to work by accident before the commit because we were rescheduling the tick from itself without tick_nohz_irq_exit() to cancel anything. And that restored the periodic behaviour necessary to complete the delay. So the above change is rather a hack than a solution. We have several choices: 1) Revert the commit. But this has to be a temporary solution really. Powerclamp has to be fixed and handle tick_nohz_irq_exit(). 2) Remove powerclamp tick stop until somebody fixes it to handle nohz properly. 2) Fix it directly. But I believe there is a release that is going to miss the fix and suffer the regression. Does the regression matter for anybody? Is powerclamp meant for anything else than testing (I have no idea what it's used for)? So to fix powerclamp to handle nohz correctly, tick_nohz_irq_exit() must be called for both idle tasks and powerclamp kthreads. Checking ts-inidle can be a good way to match As far as I can see, the primary purpose of tick_nohz_irq_enter()/exit() paths was to take care of *tick stopped* cases. Before handling interrupts we would want jiffies to be updated, which is done in tick_nohz_irq_enter(). And after handling interrupts we need to verify if any timers/RCU callbacks were queued during an interrupt. Both of these will not be handled properly *only when the tick is stopped* right? If so, the checks which permit entry into these functions should at a minimum include a check on ts-tick_stopped(). The rest of the checks around if the CPU is idle/need_resched() may be needed in conjunction, but will not be complete without checking if ticks are stopped. I see that tick_nohz_irq_enter() has a check on ts-tick_stopped, which is correct, but tick_noz_irq_exit() does not. Adding this check to tick_nohz_irq_exit() will not only solve this regression but is probably a fix to a long standing bug in the conditional check in tick_nohz_irq_exit(). both. That means we might need to use a reduced part of idle_cpu() to avoid redundant checks. tick_irq_enter() must be called as well for powerclamp, in case it's the only CPU running, it has to fixup the timekeeping alone. Yeah, you can call my fix a Hacky one. I agree.. But I don't know if calling tick_nohz_irq_exit() from these threads wouldn't be hacky as well. And ofcourse my knowledge wouldn't be adequate here to judge that :) It looked a bit odd to me. Why should we call irq-exit from the threads working with idle? And that's not what we do even for the real-idle loop as well .. Also from Jacob's referenced patch, we would be moving to a consolidated idle-loop for both real idle and threads like powerclamp, and this part may be tricky then.. Untested, but what about something like this? diff --git a/kernel/softirq.c b/kernel/softirq.c index 5918d227730f..5e4bfc367735 100644 --- a/kernel/softirq.c +++ b/kernel/softirq.c @@ -321,7 +321,7 @@ asmlinkage __visible void do_softirq(void) void irq_enter(void) { rcu_irq_enter(); - if (is_idle_task(current) !in_interrupt()) { + if (tick_idle_active() !in_interrupt()) { /* * Prevent raise_softirq from needlessly waking up ksoftirqd * here, as softirq will be serviced on return from interrupt. @@ -363,7 +363,7 @@ static inline void tick_irq_exit(void) int cpu = smp_processor_id(); /* Make sure that timer wheel updates are propagated */ - if ((idle_cpu(cpu) !need_resched()) || tick_nohz_full_cpu(cpu)) { + if ((tick_idle_active() !need_resched()) || For reasons mentioned above, this check alone will not do. It may solve this regression,but the check is incomplete. IMO it should simply be : if (tick_nohz_tick_stopped() || tick_nohz_full_cpu()) Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [nohz] 2a16fc93d2c: kernel lockup on idle injection
On 12/15/2014 03:02 PM, Viresh Kumar wrote: > On 15 December 2014 at 12:55, Preeti U Murthy > wrote: >> Hi Viresh, >> >> Let me explain why I think this is happening. >> >> 1. tick_nohz_irq_enter/exit() both get called *only if the cpu is idle* >> and receives an interrupt. > > Bang on target. Yeah that's the part we missed while writing this patch :) > >> 2. Commit 2a16fc93d2c9568e1, cancels programming of tick_sched timer >> in its handler, assuming that tick_nohz_irq_exit() will take care of >> programming the clock event device appropriately, and hence it would >> requeue or cancel the tick_sched timer. > > Correct. > >> 3. But the intel_powerclamp driver injects an idle period only. >> *The CPU however is not idle*. It has work on its runqueue and the >> rq->curr != idle. This means that *tick_nohz_irq_enter()/exit() will not >> get called on any interrupt*. > > Still good.. > >> 4. As a consequence, when we get a hrtimer interrupt during the period >> that the powerclamp driver is mimicking idle, the exit path of the >> interrupt never calls tick_nohz_irq_exit(). Hence the tick_sched timer >> that would have got removed due to the above commit will not get >> enqueued back on for any pending timers that there might be. Besides >> this, *jiffies never gets updated*. > > Jiffies can be updated by any CPU and there is something called a control > cpu with powerclamp driver. BUT we may have got interrupted before the > powerclamp timer expired and so we are stuck in the > > while (time_before(jiffies, target_jiffies)) > > loop for ever. > >> Hope the above explanation makes sense. > > Mostly good. Thanks for helping out. > > Now, what's the right solution going forward ? > > - Revert the offending commit .. > - Or still try to avoid reprogramming if we can .. > > This is what I could come up with to still avoid reprogramming of tick: > > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c > index cc0a5b6f741b..49f4278f69e2 100644 > --- a/kernel/time/tick-sched.c > +++ b/kernel/time/tick-sched.c > @@ -1100,7 +1100,7 @@ static enum hrtimer_restart > tick_sched_timer(struct hrtimer *timer) > tick_sched_handle(ts, regs); > > /* No need to reprogram if we are in idle or full dynticks mode */ > - if (unlikely(ts->tick_stopped)) > + if (unlikely(ts->tick_stopped && (is_idle_task(current) || > !ts->inidle))) > return HRTIMER_NORESTART; > > hrtimer_forward(timer, now, tick_period); > > Looks good to me. You can add my Reviewed-by to the above patch. > > Above change checks why we have stopped tick.. > - The cpu has gone idle (really): is_idle_task(current) > - The cpu isn't in idle mode, i.e. its in nohz-full mode: !ts->inidle > > This fixed the issues with powerclamp in my case. > > @Fengguang: Can you please check if this fixes it for you as well? > > @Thomas: Please let me know if you want me to send this fix > or you want to revert the original commit itself. Regards Preeti U Murthy > > Thanks. > > -- > Viresh > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [nohz] 2a16fc93d2c: kernel lockup on idle injection
On 12/15/2014 03:02 PM, Viresh Kumar wrote: On 15 December 2014 at 12:55, Preeti U Murthy pre...@linux.vnet.ibm.com wrote: Hi Viresh, Let me explain why I think this is happening. 1. tick_nohz_irq_enter/exit() both get called *only if the cpu is idle* and receives an interrupt. Bang on target. Yeah that's the part we missed while writing this patch :) 2. Commit 2a16fc93d2c9568e1, cancels programming of tick_sched timer in its handler, assuming that tick_nohz_irq_exit() will take care of programming the clock event device appropriately, and hence it would requeue or cancel the tick_sched timer. Correct. 3. But the intel_powerclamp driver injects an idle period only. *The CPU however is not idle*. It has work on its runqueue and the rq-curr != idle. This means that *tick_nohz_irq_enter()/exit() will not get called on any interrupt*. Still good.. 4. As a consequence, when we get a hrtimer interrupt during the period that the powerclamp driver is mimicking idle, the exit path of the interrupt never calls tick_nohz_irq_exit(). Hence the tick_sched timer that would have got removed due to the above commit will not get enqueued back on for any pending timers that there might be. Besides this, *jiffies never gets updated*. Jiffies can be updated by any CPU and there is something called a control cpu with powerclamp driver. BUT we may have got interrupted before the powerclamp timer expired and so we are stuck in the while (time_before(jiffies, target_jiffies)) loop for ever. Hope the above explanation makes sense. Mostly good. Thanks for helping out. Now, what's the right solution going forward ? - Revert the offending commit .. - Or still try to avoid reprogramming if we can .. This is what I could come up with to still avoid reprogramming of tick: diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index cc0a5b6f741b..49f4278f69e2 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -1100,7 +1100,7 @@ static enum hrtimer_restart tick_sched_timer(struct hrtimer *timer) tick_sched_handle(ts, regs); /* No need to reprogram if we are in idle or full dynticks mode */ - if (unlikely(ts-tick_stopped)) + if (unlikely(ts-tick_stopped (is_idle_task(current) || !ts-inidle))) return HRTIMER_NORESTART; hrtimer_forward(timer, now, tick_period); Looks good to me. You can add my Reviewed-by to the above patch. Above change checks why we have stopped tick.. - The cpu has gone idle (really): is_idle_task(current) - The cpu isn't in idle mode, i.e. its in nohz-full mode: !ts-inidle This fixed the issues with powerclamp in my case. @Fengguang: Can you please check if this fixes it for you as well? @Thomas: Please let me know if you want me to send this fix or you want to revert the original commit itself. Regards Preeti U Murthy Thanks. -- Viresh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [nohz] 2a16fc93d2c: kernel lockup on idle injection
Hi Viresh, Let me explain why I think this is happening. 1. tick_nohz_irq_enter/exit() both get called *only if the cpu is idle* and receives an interrupt. 2. Commit 2a16fc93d2c9568e1, cancels programming of tick_sched timer in its handler, assuming that tick_nohz_irq_exit() will take care of programming the clock event device appropriately, and hence it would requeue or cancel the tick_sched timer. 3. But the intel_powerclamp driver injects an idle period only. *The CPU however is not idle*. It has work on its runqueue and the rq->curr != idle. This means that *tick_nohz_irq_enter()/exit() will not get called on any interrupt*. 4. As a consequence, when we get a hrtimer interrupt during the period that the powerclamp driver is mimicking idle, the exit path of the interrupt never calls tick_nohz_irq_exit(). Hence the tick_sched timer that would have got removed due to the above commit will not get enqueued back on for any pending timers that there might be. Besides this, *jiffies never gets updated*. 5. If you look at the code of the powerclamp driver, clamp_thread() loops on jiffies getting updated. It continues to do so with preemption disabled and no tick_sched timer to force a scheduler tick to update the jiffies. Since this happens on cpus in a package, all of them get soft lockedup. Hope the above explanation makes sense. Regards Preeti U Murthy On 12/12/2014 05:27 PM, Viresh Kumar wrote: > Cc'ing Thomas as well.. > > On 12 December 2014 at 01:12, Fengguang Wu wrote: >> Hi Viresh, >> >> We noticed the below lockup regression on commit 2a16fc93d2c ("nohz: >> Avoid tick's double reprogramming in highres mode"). >> >> testbox/testcase/testparams: ivb42/idle-inject/60s-200%-10cp >> >> b5e995e671d8e4d7 2a16fc93d2c9568e16d45db77c >> -- >>fail:runs %reproductionfail:runs >>| | | >>:5 100% 1:1 last_state.is_incomplete_run >>:5 100% 1:1 last_state.running >> >> testbox/testcase/testparams: lkp-sb03/idle-inject/60s-200%-10cp >> >> b5e995e671d8e4d7 2a16fc93d2c9568e16d45db77c >> -- >>:7 100% 1:1 last_state.is_incomplete_run >>:7 100% 1:1 last_state.running >> >> Where test box ivb42 is Ivy Bridge-EP and lkp-sb03 is Sandy Bridge-EP. >> >> To reproduce: >> >> apt-get install ruby ruby-oj >> git clone >> git://git.kernel.org/pub/scm/linux/kernel/git/wfg/lkp-tests.git >> cd lkp-tests >> bin/setup-local job.yaml # the job file attached in this email >> bin/run-local job.yaml >> >> Basically what the test case does is to >> >> - find a Sandy Bridge or newer machine >> - look for a cooling device with type “intel_powerclamp” >> - set cur_state to 10 >> - run any CPU extensive workload >> >> Then expect soft lockup. It's very reproducible. > > Thanks Fengguang. Yes I am able to reproduce it, but don't know yet what > went wrong.. > > -- > viresh > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [nohz] 2a16fc93d2c: kernel lockup on idle injection
Hi Viresh, Let me explain why I think this is happening. 1. tick_nohz_irq_enter/exit() both get called *only if the cpu is idle* and receives an interrupt. 2. Commit 2a16fc93d2c9568e1, cancels programming of tick_sched timer in its handler, assuming that tick_nohz_irq_exit() will take care of programming the clock event device appropriately, and hence it would requeue or cancel the tick_sched timer. 3. But the intel_powerclamp driver injects an idle period only. *The CPU however is not idle*. It has work on its runqueue and the rq-curr != idle. This means that *tick_nohz_irq_enter()/exit() will not get called on any interrupt*. 4. As a consequence, when we get a hrtimer interrupt during the period that the powerclamp driver is mimicking idle, the exit path of the interrupt never calls tick_nohz_irq_exit(). Hence the tick_sched timer that would have got removed due to the above commit will not get enqueued back on for any pending timers that there might be. Besides this, *jiffies never gets updated*. 5. If you look at the code of the powerclamp driver, clamp_thread() loops on jiffies getting updated. It continues to do so with preemption disabled and no tick_sched timer to force a scheduler tick to update the jiffies. Since this happens on cpus in a package, all of them get soft lockedup. Hope the above explanation makes sense. Regards Preeti U Murthy On 12/12/2014 05:27 PM, Viresh Kumar wrote: Cc'ing Thomas as well.. On 12 December 2014 at 01:12, Fengguang Wu fengguang...@intel.com wrote: Hi Viresh, We noticed the below lockup regression on commit 2a16fc93d2c (nohz: Avoid tick's double reprogramming in highres mode). testbox/testcase/testparams: ivb42/idle-inject/60s-200%-10cp b5e995e671d8e4d7 2a16fc93d2c9568e16d45db77c -- fail:runs %reproductionfail:runs | | | :5 100% 1:1 last_state.is_incomplete_run :5 100% 1:1 last_state.running testbox/testcase/testparams: lkp-sb03/idle-inject/60s-200%-10cp b5e995e671d8e4d7 2a16fc93d2c9568e16d45db77c -- :7 100% 1:1 last_state.is_incomplete_run :7 100% 1:1 last_state.running Where test box ivb42 is Ivy Bridge-EP and lkp-sb03 is Sandy Bridge-EP. To reproduce: apt-get install ruby ruby-oj git clone git://git.kernel.org/pub/scm/linux/kernel/git/wfg/lkp-tests.git cd lkp-tests bin/setup-local job.yaml # the job file attached in this email bin/run-local job.yaml Basically what the test case does is to - find a Sandy Bridge or newer machine - look for a cooling device with type “intel_powerclamp” - set cur_state to 10 - run any CPU extensive workload Then expect soft lockup. It's very reproducible. Thanks Fengguang. Yes I am able to reproduce it, but don't know yet what went wrong.. -- viresh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Query] Spurious interrupts from clockevent device on X86 Ivybridge
On 12/11/2014 10:26 AM, Santosh Shukla wrote: > On 11 December 2014 at 10:14, Preeti U Murthy > wrote: >> On 12/10/2014 06:22 PM, Viresh Kumar wrote: >>> On 10 December 2014 at 18:03, Preeti U Murthy >>> wrote: >>> >>>> Right. We get an interrupt when nobody had asked for it to be delivered >>>> or had asked for it to be delivered and later canceled the request. It >>>> is most often in the latter situation, that there can be race >>>> conditions. If these race conditions are not taken care of, they can >>>> result in spurious interrupts. >>> >>> But the delta time will be very small then, right ? >> >> I was talking of the case where we get an interrupt from the clockevent >> device but dont find the hrtimer to service and not really of an anomaly >> in timekeeping. >>For instance one of the issues that we had seen earlier wherein we >> cancel the tick-sched-timer before going tickless, but since we had >> programmed the clock event device to fire, we get a spurious interrupt. >> > > I verified this case before reporting; In my case tick_sched_timer do > get cancelled before expire duration but then clk_evt_device get > reprogrammed for next time node in list. __remove_hrtimer() takes > care of that. Right. The scenario I described happens in the Low Resolution Mode. You are right, this does not happen in the High Resolution Mode. Regards Preeti U Murthy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Query] Spurious interrupts from clockevent device on X86 Ivybridge
On 12/11/2014 10:26 AM, Santosh Shukla wrote: On 11 December 2014 at 10:14, Preeti U Murthy pre...@linux.vnet.ibm.com wrote: On 12/10/2014 06:22 PM, Viresh Kumar wrote: On 10 December 2014 at 18:03, Preeti U Murthy pre...@linux.vnet.ibm.com wrote: Right. We get an interrupt when nobody had asked for it to be delivered or had asked for it to be delivered and later canceled the request. It is most often in the latter situation, that there can be race conditions. If these race conditions are not taken care of, they can result in spurious interrupts. But the delta time will be very small then, right ? I was talking of the case where we get an interrupt from the clockevent device but dont find the hrtimer to service and not really of an anomaly in timekeeping. For instance one of the issues that we had seen earlier wherein we cancel the tick-sched-timer before going tickless, but since we had programmed the clock event device to fire, we get a spurious interrupt. I verified this case before reporting; In my case tick_sched_timer do get cancelled before expire duration but then clk_evt_device get reprogrammed for next time node in list. __remove_hrtimer() takes care of that. Right. The scenario I described happens in the Low Resolution Mode. You are right, this does not happen in the High Resolution Mode. Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Query] Spurious interrupts from clockevent device on X86 Ivybridge
On 12/10/2014 06:22 PM, Viresh Kumar wrote: > On 10 December 2014 at 18:03, Preeti U Murthy > wrote: > >> Right. We get an interrupt when nobody had asked for it to be delivered >> or had asked for it to be delivered and later canceled the request. It >> is most often in the latter situation, that there can be race >> conditions. If these race conditions are not taken care of, they can >> result in spurious interrupts. > > But the delta time will be very small then, right ? I was talking of the case where we get an interrupt from the clockevent device but dont find the hrtimer to service and not really of an anomaly in timekeeping. For instance one of the issues that we had seen earlier wherein we cancel the tick-sched-timer before going tickless, but since we had programmed the clock event device to fire, we get a spurious interrupt. > >> Since the difference is 1us and not a noticeably high value, it is most >> probably because during hrtimer handling, we traverse all queued timers >> and call their handlers, till we find timers whose expiry is in the >> future. I would not be surprised if we overshoot the expiry of the >> timers at the end of the list by a microsecond by the time we call their >> handlers. > > Looks like you misunderstood what he wrote. He isn't saying that we > serviced the timers/hrtimers sometime after we should have. > > What he is saying is: we got the clockevent device's interrupt at the > time we requested but hrtimer_update_base() returned a time > lesser than what it *should* have. And that results in a spurious > interrupt.. We enqueue again for 1 us and service the timer then. Oh ok I see. I understand this better now after reading Thomas's explanation. Regards Preeti U Murthy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Query] Spurious interrupts from clockevent device on X86 Ivybridge
Hi Santhosh, On 12/10/2014 11:00 AM, Santosh Shukla wrote: > Hi Guys, > > I have been chasing spurious clockevent interrupts (In HIGHRES mode) on my x86 > IvyBridge server and stuck on some blockers (mostly due to lack of knowledge) > and need some help.. > > What does spurious mean ? > I take it as the case when a core is unnecessarily interrupted by the > clockevent > device while we didn't wanted it to. Right. We get an interrupt when nobody had asked for it to be delivered or had asked for it to be delivered and later canceled the request. It is most often in the latter situation, that there can be race conditions. If these race conditions are not taken care of, they can result in spurious interrupts. > > I tried to catch them with the help of few new trace points in > hrtimer_interrupt() when it didn't serviced any pending hrtimers.. > > The causes I could find: > - Clockevent device's counter overflow: When we need to program a clockevt > device for times larger than it can handle, we program it for max time and > then reprogram again on expiration. Because of hardware counter limitations > we > have to do it this way and this isn't really a spurious interrupt as we > programmed it that way. Yes this is a possible cause for spurious interrupts. On PowerPC we check for such a situation specifically before we begin to handle timers. > > - Hardware timer firing before its expiration: Yes, its hard to digest but > from > my tests it looks like this does happen. Will circulate another mail once I > am > sure about it. Yep, will need more information if at all this is happening. > > - Difference between time returned by hrtimer_update_base() and clockevt > device's notion of time: Wanted to discuss this here.. > > > An example of what I am witnessing in my tests: > > clockevt device programmed to fire after 199 ms and it fires exactly after > that > time (Verified with ftrace time-diff and tsc diff [tsc: timestamp counter > instructions for x86]). > > But when hrtimer_interrupt() tries to find the hrtimer for which the event > occurred, it finds the value returned by hrtimer_update_base() being lesser > than > the softexpires of the timer for which the event was programmed. And the diff > is > approximately 1 us most of the times I see the spurious interrupt. > > -0 [010] d...68.961171: clk_evt_program: delta:199933679 > expires=6907100 /* programmed for 199 ms */ > -0 [010] d...68.961171: lapic_next_deadline: > cur_tsc:365799488705427 new_tsc:365799907536051 delta:199443 usec /* > sets new_tsc */ > -0 [010] d.h.69.161183: local_timer_entry: vector=239 > -0 [010] d.h.69.161185: local_apic_timer_interrupt: > cur_tsc:365799907547595 delta:199448 usec /* Interrupt arrived > afterexactly 199 ms */ > -0 [010] d.h.69.161187: > hrtimer_interrupt:function=tick_sched_timer expires=6907100 > softexpires=6907100 > basenow:69070997234 /* But Softexpires > basenow ?? */ > -0 [010] d.h.69.161188: clk_evt_program: delta:1000 > expires=6907100 /* new expiry value (1 us) */ > -0 [010] d.h.69.161189: lapic_next_deadline: > cur_tsc:365799907556179 new_tsc:365799907558259 delta:0 usec /* sets > new tsc value */ > -0 [010] d.h.69.161189: local_timer_exit: vector=239 Since the difference is 1us and not a noticeably high value, it is most probably because during hrtimer handling, we traverse all queued timers and call their handlers, till we find timers whose expiry is in the future. I would not be surprised if we overshoot the expiry of the timers at the end of the list by a microsecond by the time we call their handlers. Regards Preeti U Murthy > > Tested on: Ivybrdge V2, 12 core X86 server, though it occurs on ARM as well > (but not that frequent). > > Regards, > Santosh > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Query] Spurious interrupts from clockevent device on X86 Ivybridge
On 12/10/2014 06:22 PM, Viresh Kumar wrote: On 10 December 2014 at 18:03, Preeti U Murthy pre...@linux.vnet.ibm.com wrote: Right. We get an interrupt when nobody had asked for it to be delivered or had asked for it to be delivered and later canceled the request. It is most often in the latter situation, that there can be race conditions. If these race conditions are not taken care of, they can result in spurious interrupts. But the delta time will be very small then, right ? I was talking of the case where we get an interrupt from the clockevent device but dont find the hrtimer to service and not really of an anomaly in timekeeping. For instance one of the issues that we had seen earlier wherein we cancel the tick-sched-timer before going tickless, but since we had programmed the clock event device to fire, we get a spurious interrupt. Since the difference is 1us and not a noticeably high value, it is most probably because during hrtimer handling, we traverse all queued timers and call their handlers, till we find timers whose expiry is in the future. I would not be surprised if we overshoot the expiry of the timers at the end of the list by a microsecond by the time we call their handlers. Looks like you misunderstood what he wrote. He isn't saying that we serviced the timers/hrtimers sometime after we should have. What he is saying is: we got the clockevent device's interrupt at the time we requested but hrtimer_update_base() returned a time lesser than what it *should* have. And that results in a spurious interrupt.. We enqueue again for 1 us and service the timer then. Oh ok I see. I understand this better now after reading Thomas's explanation. Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Query] Spurious interrupts from clockevent device on X86 Ivybridge
Hi Santhosh, On 12/10/2014 11:00 AM, Santosh Shukla wrote: Hi Guys, I have been chasing spurious clockevent interrupts (In HIGHRES mode) on my x86 IvyBridge server and stuck on some blockers (mostly due to lack of knowledge) and need some help.. What does spurious mean ? I take it as the case when a core is unnecessarily interrupted by the clockevent device while we didn't wanted it to. Right. We get an interrupt when nobody had asked for it to be delivered or had asked for it to be delivered and later canceled the request. It is most often in the latter situation, that there can be race conditions. If these race conditions are not taken care of, they can result in spurious interrupts. I tried to catch them with the help of few new trace points in hrtimer_interrupt() when it didn't serviced any pending hrtimers.. The causes I could find: - Clockevent device's counter overflow: When we need to program a clockevt device for times larger than it can handle, we program it for max time and then reprogram again on expiration. Because of hardware counter limitations we have to do it this way and this isn't really a spurious interrupt as we programmed it that way. Yes this is a possible cause for spurious interrupts. On PowerPC we check for such a situation specifically before we begin to handle timers. - Hardware timer firing before its expiration: Yes, its hard to digest but from my tests it looks like this does happen. Will circulate another mail once I am sure about it. Yep, will need more information if at all this is happening. - Difference between time returned by hrtimer_update_base() and clockevt device's notion of time: Wanted to discuss this here.. An example of what I am witnessing in my tests: clockevt device programmed to fire after 199 ms and it fires exactly after that time (Verified with ftrace time-diff and tsc diff [tsc: timestamp counter instructions for x86]). But when hrtimer_interrupt() tries to find the hrtimer for which the event occurred, it finds the value returned by hrtimer_update_base() being lesser than the softexpires of the timer for which the event was programmed. And the diff is approximately 1 us most of the times I see the spurious interrupt. idle-0 [010] d...68.961171: clk_evt_program: delta:199933679 expires=6907100 /* programmed for 199 ms */ idle-0 [010] d...68.961171: lapic_next_deadline: cur_tsc:365799488705427 new_tsc:365799907536051 delta:199443 usec /* sets new_tsc */ idle-0 [010] d.h.69.161183: local_timer_entry: vector=239 idle-0 [010] d.h.69.161185: local_apic_timer_interrupt: cur_tsc:365799907547595 delta:199448 usec /* Interrupt arrived afterexactly 199 ms */ idle-0 [010] d.h.69.161187: hrtimer_interrupt:function=tick_sched_timer expires=6907100 softexpires=6907100 basenow:69070997234 /* But Softexpires basenow ?? */ idle-0 [010] d.h.69.161188: clk_evt_program: delta:1000 expires=6907100 /* new expiry value (1 us) */ idle-0 [010] d.h.69.161189: lapic_next_deadline: cur_tsc:365799907556179 new_tsc:365799907558259 delta:0 usec /* sets new tsc value */ idle-0 [010] d.h.69.161189: local_timer_exit: vector=239 Since the difference is 1us and not a noticeably high value, it is most probably because during hrtimer handling, we traverse all queued timers and call their handlers, till we find timers whose expiry is in the future. I would not be surprised if we overshoot the expiry of the timers at the end of the list by a microsecond by the time we call their handlers. Regards Preeti U Murthy Tested on: Ivybrdge V2, 12 core X86 server, though it occurs on ARM as well (but not that frequent). Regards, Santosh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] clockevents: introduce ->set_dev_mode() which can return error
On 12/10/2014 03:33 AM, Kevin Hilman wrote: > From: Viresh Kumar > > Currently, the ->set_mode() method of a clockevent device is not > allowed to fail, so it has no return value. In order to add new > clockevent modes, and allow the setting of those modes to fail, we > need the clockevent core to be able to detect when setting a mode > fails. > > To allow detection of mode setting failures, introduce a new method > ->set_dev_mode() which can return an error if the given mode is not > supported or fails. > > Since all current modes are still not allowed to fail, the core code > simply WARNs if any modes fail. Future patches that add new mode > support will add proper error recovery based on failure conditions. > > Signed-off-by: Viresh Kumar > Reviewed-by: Kevin Hilman > [khilman: rework changelogs, minor formatting/checkpatch cleanups] > Signed-off-by: Kevin Hilman > --- > include/linux/clockchips.h | 5 - > kernel/time/clockevents.c | 21 ++--- > kernel/time/timer_list.c | 5 - > 3 files changed, 26 insertions(+), 5 deletions(-) > > diff --git a/include/linux/clockchips.h b/include/linux/clockchips.h > index 2e4cb67f6e56..d969659cf688 100644 > --- a/include/linux/clockchips.h > +++ b/include/linux/clockchips.h > @@ -81,7 +81,8 @@ enum clock_event_mode { > * @mode:operating mode assigned by the management code > * @features:features > * @retries: number of forced programming retries > - * @set_mode:set mode function > + * @set_dev_mode:set dev mode function > + * @set_mode:set mode function (deprecated, use set_dev_mode > instead) > * @broadcast: function to broadcast events > * @min_delta_ticks: minimum delta value in ticks stored for reconfiguration > * @max_delta_ticks: maximum delta value in ticks stored for reconfiguration > @@ -109,6 +110,8 @@ struct clock_event_device { > unsigned long retries; > > void(*broadcast)(const struct cpumask *mask); > + int (*set_dev_mode)(enum clock_event_mode mode, > + struct clock_event_device *); > void(*set_mode)(enum clock_event_mode mode, > struct clock_event_device *); > void(*suspend)(struct clock_event_device *); > diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c > index 55449909f114..f7614041240e 100644 > --- a/kernel/time/clockevents.c > +++ b/kernel/time/clockevents.c > @@ -105,7 +105,16 @@ void clockevents_set_mode(struct clock_event_device *dev, >enum clock_event_mode mode) > { > if (dev->mode != mode) { > - dev->set_mode(mode, dev); > + if (dev->set_dev_mode) { > + int ret = dev->set_dev_mode(mode, dev); > + > + /* Currently available modes shouldn't fail */ > + WARN_ONCE(ret, "Requested mode: %d, error: %d\n", > + mode, ret); > + } else { > + dev->set_mode(mode, dev); > + } > + > dev->mode = mode; > > /* > @@ -448,8 +457,14 @@ int __clockevents_update_freq(struct clock_event_device > *dev, u32 freq) > if (dev->mode == CLOCK_EVT_MODE_ONESHOT) > return clockevents_program_event(dev, dev->next_event, false); > > - if (dev->mode == CLOCK_EVT_MODE_PERIODIC) > - dev->set_mode(CLOCK_EVT_MODE_PERIODIC, dev); > + if (dev->mode == CLOCK_EVT_MODE_PERIODIC) { > + /* Shouldn't fail while switching to PERIODIC MODE */ > + if (dev->set_dev_mode) > + WARN_ON_ONCE(dev->set_dev_mode(CLOCK_EVT_MODE_PERIODIC, > +dev)); > + else > + dev->set_mode(CLOCK_EVT_MODE_PERIODIC, dev); > + } > > return 0; > } > diff --git a/kernel/time/timer_list.c b/kernel/time/timer_list.c > index 61ed862cdd37..957a04951ef0 100644 > --- a/kernel/time/timer_list.c > +++ b/kernel/time/timer_list.c > @@ -229,7 +229,10 @@ print_tickdevice(struct seq_file *m, struct tick_device > *td, int cpu) > SEQ_printf(m, "\n"); > > SEQ_printf(m, " set_mode: "); > - print_name_offset(m, dev->set_mode); > + if (dev->set_dev_mode) > + print_name_offset(m, dev->set_dev_mode); > + else > + print_name_offset(m, dev->set_mode); > SEQ_printf(m, "\n"); > > SEQ_printf(m, " event_handler: "); > Reviewed-by: Preeti U Murthy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] clockevents: introduce -set_dev_mode() which can return error
On 12/10/2014 03:33 AM, Kevin Hilman wrote: From: Viresh Kumar viresh.ku...@linaro.org Currently, the -set_mode() method of a clockevent device is not allowed to fail, so it has no return value. In order to add new clockevent modes, and allow the setting of those modes to fail, we need the clockevent core to be able to detect when setting a mode fails. To allow detection of mode setting failures, introduce a new method -set_dev_mode() which can return an error if the given mode is not supported or fails. Since all current modes are still not allowed to fail, the core code simply WARNs if any modes fail. Future patches that add new mode support will add proper error recovery based on failure conditions. Signed-off-by: Viresh Kumar viresh.ku...@linaro.org Reviewed-by: Kevin Hilman khil...@linaro.org [khilman: rework changelogs, minor formatting/checkpatch cleanups] Signed-off-by: Kevin Hilman khil...@linaro.org --- include/linux/clockchips.h | 5 - kernel/time/clockevents.c | 21 ++--- kernel/time/timer_list.c | 5 - 3 files changed, 26 insertions(+), 5 deletions(-) diff --git a/include/linux/clockchips.h b/include/linux/clockchips.h index 2e4cb67f6e56..d969659cf688 100644 --- a/include/linux/clockchips.h +++ b/include/linux/clockchips.h @@ -81,7 +81,8 @@ enum clock_event_mode { * @mode:operating mode assigned by the management code * @features:features * @retries: number of forced programming retries - * @set_mode:set mode function + * @set_dev_mode:set dev mode function + * @set_mode:set mode function (deprecated, use set_dev_mode instead) * @broadcast: function to broadcast events * @min_delta_ticks: minimum delta value in ticks stored for reconfiguration * @max_delta_ticks: maximum delta value in ticks stored for reconfiguration @@ -109,6 +110,8 @@ struct clock_event_device { unsigned long retries; void(*broadcast)(const struct cpumask *mask); + int (*set_dev_mode)(enum clock_event_mode mode, + struct clock_event_device *); void(*set_mode)(enum clock_event_mode mode, struct clock_event_device *); void(*suspend)(struct clock_event_device *); diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c index 55449909f114..f7614041240e 100644 --- a/kernel/time/clockevents.c +++ b/kernel/time/clockevents.c @@ -105,7 +105,16 @@ void clockevents_set_mode(struct clock_event_device *dev, enum clock_event_mode mode) { if (dev-mode != mode) { - dev-set_mode(mode, dev); + if (dev-set_dev_mode) { + int ret = dev-set_dev_mode(mode, dev); + + /* Currently available modes shouldn't fail */ + WARN_ONCE(ret, Requested mode: %d, error: %d\n, + mode, ret); + } else { + dev-set_mode(mode, dev); + } + dev-mode = mode; /* @@ -448,8 +457,14 @@ int __clockevents_update_freq(struct clock_event_device *dev, u32 freq) if (dev-mode == CLOCK_EVT_MODE_ONESHOT) return clockevents_program_event(dev, dev-next_event, false); - if (dev-mode == CLOCK_EVT_MODE_PERIODIC) - dev-set_mode(CLOCK_EVT_MODE_PERIODIC, dev); + if (dev-mode == CLOCK_EVT_MODE_PERIODIC) { + /* Shouldn't fail while switching to PERIODIC MODE */ + if (dev-set_dev_mode) + WARN_ON_ONCE(dev-set_dev_mode(CLOCK_EVT_MODE_PERIODIC, +dev)); + else + dev-set_mode(CLOCK_EVT_MODE_PERIODIC, dev); + } return 0; } diff --git a/kernel/time/timer_list.c b/kernel/time/timer_list.c index 61ed862cdd37..957a04951ef0 100644 --- a/kernel/time/timer_list.c +++ b/kernel/time/timer_list.c @@ -229,7 +229,10 @@ print_tickdevice(struct seq_file *m, struct tick_device *td, int cpu) SEQ_printf(m, \n); SEQ_printf(m, set_mode: ); - print_name_offset(m, dev-set_mode); + if (dev-set_dev_mode) + print_name_offset(m, dev-set_dev_mode); + else + print_name_offset(m, dev-set_mode); SEQ_printf(m, \n); SEQ_printf(m, event_handler: ); Reviewed-by: Preeti U Murthy pre...@linux.vnet.ibm.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH V3] tick-broadcast: Register for hrtimer based broadcast as the fallback broadcast mode
Commit 5d1638acb9f6 ('tick: Introduce hrtimer based broadcast') added a hrtimer based broadcast mode for those platforms in which local timers stop when CPUs enter deep idle states. The commit expected the platforms to register for this mode explicitly when they lacked a better external device to wake up CPUs in deep idle. Given that more platforms are beginning to use this mode, we can avoid the call to set it up on every platform that requires it, by registering for the hrtimer based broadcast mode in the core code if no better broadcast device is available. This commit also helps detect cases where the platform fails to register for a broadcast device but invokes the help of one when entering deep idle states. Currently we do not handle this situation at all and call the broadcast clock device without checking for its existence. This patch will handle such buggy cases properly. While at it, give a name to this mode of broadcast which was missing all along. Signed-off-by: Preeti U Murthy Tested-by: Mark Rutland --- Changes from V1: https://lkml.org/lkml/2014/12/5/261 1.Moved registering the hrtimer based broadcast from timekeeping code to an early_initcall. Changes from V2: https://lkml.org/lkml/2014/12/8/57 1.Added the 'name' param to hrtimer broadcast mode and removed the prototype to setup this mode since there are no external callers of it. arch/arm64/kernel/time.c |2 -- arch/powerpc/kernel/time.c |1 - include/linux/clockchips.h |3 --- kernel/time/tick-broadcast-hrtimer.c |5 - 4 files changed, 4 insertions(+), 7 deletions(-) diff --git a/arch/arm64/kernel/time.c b/arch/arm64/kernel/time.c index 1a7125c..47baaa8 100644 --- a/arch/arm64/kernel/time.c +++ b/arch/arm64/kernel/time.c @@ -70,8 +70,6 @@ void __init time_init(void) of_clk_init(NULL); clocksource_of_init(); - tick_setup_hrtimer_broadcast(); - arch_timer_rate = arch_timer_get_rate(); if (!arch_timer_rate) panic("Unable to initialise architected timer.\n"); diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c index 7505599..51433a8 100644 --- a/arch/powerpc/kernel/time.c +++ b/arch/powerpc/kernel/time.c @@ -942,7 +942,6 @@ void __init time_init(void) clocksource_init(); init_decrementer_clockevent(); - tick_setup_hrtimer_broadcast(); } diff --git a/include/linux/clockchips.h b/include/linux/clockchips.h index 2e4cb67..c362143 100644 --- a/include/linux/clockchips.h +++ b/include/linux/clockchips.h @@ -187,11 +187,9 @@ extern int tick_receive_broadcast(void); #endif #if defined(CONFIG_GENERIC_CLOCKEVENTS_BROADCAST) && defined(CONFIG_TICK_ONESHOT) -extern void tick_setup_hrtimer_broadcast(void); extern int tick_check_broadcast_expired(void); #else static inline int tick_check_broadcast_expired(void) { return 0; } -static inline void tick_setup_hrtimer_broadcast(void) {}; #endif #ifdef CONFIG_GENERIC_CLOCKEVENTS @@ -207,7 +205,6 @@ static inline void clockevents_resume(void) {} static inline int clockevents_notify(unsigned long reason, void *arg) { return 0; } static inline int tick_check_broadcast_expired(void) { return 0; } -static inline void tick_setup_hrtimer_broadcast(void) {}; #endif diff --git a/kernel/time/tick-broadcast-hrtimer.c b/kernel/time/tick-broadcast-hrtimer.c index eb682d5..1f6bc6a 100644 --- a/kernel/time/tick-broadcast-hrtimer.c +++ b/kernel/time/tick-broadcast-hrtimer.c @@ -72,6 +72,7 @@ static int bc_set_next(ktime_t expires, struct clock_event_device *bc) } static struct clock_event_device ce_broadcast_hrtimer = { + .name = "broadcast_hrtimer", .set_mode = bc_set_mode, .set_next_ktime = bc_set_next, .features = CLOCK_EVT_FEAT_ONESHOT | @@ -98,9 +99,11 @@ static enum hrtimer_restart bc_handler(struct hrtimer *t) return HRTIMER_RESTART; } -void tick_setup_hrtimer_broadcast(void) +static int __init tick_setup_hrtimer_broadcast(void) { hrtimer_init(, CLOCK_MONOTONIC, HRTIMER_MODE_ABS); bctimer.function = bc_handler; clockevents_register_device(_broadcast_hrtimer); + return 0; } +early_initcall(tick_setup_hrtimer_broadcast); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V2] tick-broadcast: Register for hrtimer based broadcast as the fallback broadcast mode
On 12/08/2014 04:18 PM, Mark Rutland wrote: > Hi Preeti, > > On Mon, Dec 08, 2014 at 06:55:43AM +0000, Preeti U Murthy wrote: >> Commit 5d1638acb9f6 ('tick: Introduce hrtimer based broadcast') added a >> hrtimer based broadcast mode for those platforms in which local timers stop >> when CPUs enter deep idle states. The commit expected the platforms to >> register for this mode explicitly when they lacked a better external device >> to wake up CPUs in deep idle. Given that more platforms are beginning to use >> this mode, we can avoid the call to set it up on every platform that requires >> it, by registering for the hrtimer based broadcast mode in the core code if >> no better broadcast device is available. >> >> This commit also helps detect cases where the platform fails to register for >> a broadcast device but invokes the help of one when entering deep idle >> states. >> Currently we do not handle this situation at all and call the broadcast clock >> device without checking for its existence. This patch will handle such buggy >> cases properly. >> >> Signed-off-by: Preeti U Murthy > > I've just given this a go on an arm64 platform (Juno) without any > system-wide clock_event_devices registered, and everything works well > with CPUs entering and exiting idle states where the cpu-local timers > lose state. So: > > Tested-by: Mark Rutland Thanks! > > One minor thing I noticed when testing was that > /sys/devices/system/clockevents/broadcast/name contained "(null)", > because we never set the name field on the clock_event_device. It's > always been that way, but now might be a good time to change that to > something like "broadcast_hrtimer". You mean /sys/devices/system/clockevents/broadcast/current_device right? > > [...] > >> diff --git a/include/linux/clockchips.h b/include/linux/clockchips.h >> index 2e4cb67..91754b0 100644 >> --- a/include/linux/clockchips.h >> +++ b/include/linux/clockchips.h >> @@ -187,11 +187,11 @@ extern int tick_receive_broadcast(void); >> #endif >> >> #if defined(CONFIG_GENERIC_CLOCKEVENTS_BROADCAST) && >> defined(CONFIG_TICK_ONESHOT) >> -extern void tick_setup_hrtimer_broadcast(void); >> +extern int __init tick_setup_hrtimer_broadcast(void); >> extern int tick_check_broadcast_expired(void); >> #else >> static inline int tick_check_broadcast_expired(void) { return 0; } >> -static inline void tick_setup_hrtimer_broadcast(void) {}; >> +static inline int __init tick_setup_hrtimer_broadcast(void) { return 0; } >> #endif >> >> #ifdef CONFIG_GENERIC_CLOCKEVENTS >> @@ -207,7 +207,7 @@ static inline void clockevents_resume(void) {} >> >> static inline int clockevents_notify(unsigned long reason, void *arg) { >> return 0; } >> static inline int tick_check_broadcast_expired(void) { return 0; } >> -static inline void tick_setup_hrtimer_broadcast(void) {}; >> +static inline int __init tick_setup_hrtimer_broadcast(void) { return 0; } > > With the initcall moved to the driver we have no external users of > tick_setup_hrtimer_broadcast, so I think we can remove the prototype > entirely from clockchips.h... > >> #endif >> >> diff --git a/kernel/time/tick-broadcast-hrtimer.c >> b/kernel/time/tick-broadcast-hrtimer.c >> index eb682d5..5c35995 100644 >> --- a/kernel/time/tick-broadcast-hrtimer.c >> +++ b/kernel/time/tick-broadcast-hrtimer.c >> @@ -98,9 +98,11 @@ static enum hrtimer_restart bc_handler(struct hrtimer *t) >> return HRTIMER_RESTART; >> } >> >> -void tick_setup_hrtimer_broadcast(void) >> +int __init tick_setup_hrtimer_broadcast(void) > > ...and make it static here. Yep will do. Sorry I overlooked this. > >> { >> hrtimer_init(, CLOCK_MONOTONIC, HRTIMER_MODE_ABS); >> bctimer.function = bc_handler; >> clockevents_register_device(_broadcast_hrtimer); >> +return 0; >> } >> +early_initcall(tick_setup_hrtimer_broadcast); > > Otherwise this looks good to me, thanks for putting this together! Thanks a lot for the review! Will send out the patch with the above corrections. Regards Preeti U Murthy > > Mark. > ___ > Linuxppc-dev mailing list > linuxppc-...@lists.ozlabs.org > https://lists.ozlabs.org/listinfo/linuxppc-dev > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V2] tick-broadcast: Register for hrtimer based broadcast as the fallback broadcast mode
On 12/08/2014 04:18 PM, Mark Rutland wrote: Hi Preeti, On Mon, Dec 08, 2014 at 06:55:43AM +, Preeti U Murthy wrote: Commit 5d1638acb9f6 ('tick: Introduce hrtimer based broadcast') added a hrtimer based broadcast mode for those platforms in which local timers stop when CPUs enter deep idle states. The commit expected the platforms to register for this mode explicitly when they lacked a better external device to wake up CPUs in deep idle. Given that more platforms are beginning to use this mode, we can avoid the call to set it up on every platform that requires it, by registering for the hrtimer based broadcast mode in the core code if no better broadcast device is available. This commit also helps detect cases where the platform fails to register for a broadcast device but invokes the help of one when entering deep idle states. Currently we do not handle this situation at all and call the broadcast clock device without checking for its existence. This patch will handle such buggy cases properly. Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com I've just given this a go on an arm64 platform (Juno) without any system-wide clock_event_devices registered, and everything works well with CPUs entering and exiting idle states where the cpu-local timers lose state. So: Tested-by: Mark Rutland mark.rutl...@arm.com Thanks! One minor thing I noticed when testing was that /sys/devices/system/clockevents/broadcast/name contained (null), because we never set the name field on the clock_event_device. It's always been that way, but now might be a good time to change that to something like broadcast_hrtimer. You mean /sys/devices/system/clockevents/broadcast/current_device right? [...] diff --git a/include/linux/clockchips.h b/include/linux/clockchips.h index 2e4cb67..91754b0 100644 --- a/include/linux/clockchips.h +++ b/include/linux/clockchips.h @@ -187,11 +187,11 @@ extern int tick_receive_broadcast(void); #endif #if defined(CONFIG_GENERIC_CLOCKEVENTS_BROADCAST) defined(CONFIG_TICK_ONESHOT) -extern void tick_setup_hrtimer_broadcast(void); +extern int __init tick_setup_hrtimer_broadcast(void); extern int tick_check_broadcast_expired(void); #else static inline int tick_check_broadcast_expired(void) { return 0; } -static inline void tick_setup_hrtimer_broadcast(void) {}; +static inline int __init tick_setup_hrtimer_broadcast(void) { return 0; } #endif #ifdef CONFIG_GENERIC_CLOCKEVENTS @@ -207,7 +207,7 @@ static inline void clockevents_resume(void) {} static inline int clockevents_notify(unsigned long reason, void *arg) { return 0; } static inline int tick_check_broadcast_expired(void) { return 0; } -static inline void tick_setup_hrtimer_broadcast(void) {}; +static inline int __init tick_setup_hrtimer_broadcast(void) { return 0; } With the initcall moved to the driver we have no external users of tick_setup_hrtimer_broadcast, so I think we can remove the prototype entirely from clockchips.h... #endif diff --git a/kernel/time/tick-broadcast-hrtimer.c b/kernel/time/tick-broadcast-hrtimer.c index eb682d5..5c35995 100644 --- a/kernel/time/tick-broadcast-hrtimer.c +++ b/kernel/time/tick-broadcast-hrtimer.c @@ -98,9 +98,11 @@ static enum hrtimer_restart bc_handler(struct hrtimer *t) return HRTIMER_RESTART; } -void tick_setup_hrtimer_broadcast(void) +int __init tick_setup_hrtimer_broadcast(void) ...and make it static here. Yep will do. Sorry I overlooked this. { hrtimer_init(bctimer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS); bctimer.function = bc_handler; clockevents_register_device(ce_broadcast_hrtimer); +return 0; } +early_initcall(tick_setup_hrtimer_broadcast); Otherwise this looks good to me, thanks for putting this together! Thanks a lot for the review! Will send out the patch with the above corrections. Regards Preeti U Murthy Mark. ___ Linuxppc-dev mailing list linuxppc-...@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH V3] tick-broadcast: Register for hrtimer based broadcast as the fallback broadcast mode
Commit 5d1638acb9f6 ('tick: Introduce hrtimer based broadcast') added a hrtimer based broadcast mode for those platforms in which local timers stop when CPUs enter deep idle states. The commit expected the platforms to register for this mode explicitly when they lacked a better external device to wake up CPUs in deep idle. Given that more platforms are beginning to use this mode, we can avoid the call to set it up on every platform that requires it, by registering for the hrtimer based broadcast mode in the core code if no better broadcast device is available. This commit also helps detect cases where the platform fails to register for a broadcast device but invokes the help of one when entering deep idle states. Currently we do not handle this situation at all and call the broadcast clock device without checking for its existence. This patch will handle such buggy cases properly. While at it, give a name to this mode of broadcast which was missing all along. Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com Tested-by: Mark Rutland mark.rutl...@arm.com --- Changes from V1: https://lkml.org/lkml/2014/12/5/261 1.Moved registering the hrtimer based broadcast from timekeeping code to an early_initcall. Changes from V2: https://lkml.org/lkml/2014/12/8/57 1.Added the 'name' param to hrtimer broadcast mode and removed the prototype to setup this mode since there are no external callers of it. arch/arm64/kernel/time.c |2 -- arch/powerpc/kernel/time.c |1 - include/linux/clockchips.h |3 --- kernel/time/tick-broadcast-hrtimer.c |5 - 4 files changed, 4 insertions(+), 7 deletions(-) diff --git a/arch/arm64/kernel/time.c b/arch/arm64/kernel/time.c index 1a7125c..47baaa8 100644 --- a/arch/arm64/kernel/time.c +++ b/arch/arm64/kernel/time.c @@ -70,8 +70,6 @@ void __init time_init(void) of_clk_init(NULL); clocksource_of_init(); - tick_setup_hrtimer_broadcast(); - arch_timer_rate = arch_timer_get_rate(); if (!arch_timer_rate) panic(Unable to initialise architected timer.\n); diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c index 7505599..51433a8 100644 --- a/arch/powerpc/kernel/time.c +++ b/arch/powerpc/kernel/time.c @@ -942,7 +942,6 @@ void __init time_init(void) clocksource_init(); init_decrementer_clockevent(); - tick_setup_hrtimer_broadcast(); } diff --git a/include/linux/clockchips.h b/include/linux/clockchips.h index 2e4cb67..c362143 100644 --- a/include/linux/clockchips.h +++ b/include/linux/clockchips.h @@ -187,11 +187,9 @@ extern int tick_receive_broadcast(void); #endif #if defined(CONFIG_GENERIC_CLOCKEVENTS_BROADCAST) defined(CONFIG_TICK_ONESHOT) -extern void tick_setup_hrtimer_broadcast(void); extern int tick_check_broadcast_expired(void); #else static inline int tick_check_broadcast_expired(void) { return 0; } -static inline void tick_setup_hrtimer_broadcast(void) {}; #endif #ifdef CONFIG_GENERIC_CLOCKEVENTS @@ -207,7 +205,6 @@ static inline void clockevents_resume(void) {} static inline int clockevents_notify(unsigned long reason, void *arg) { return 0; } static inline int tick_check_broadcast_expired(void) { return 0; } -static inline void tick_setup_hrtimer_broadcast(void) {}; #endif diff --git a/kernel/time/tick-broadcast-hrtimer.c b/kernel/time/tick-broadcast-hrtimer.c index eb682d5..1f6bc6a 100644 --- a/kernel/time/tick-broadcast-hrtimer.c +++ b/kernel/time/tick-broadcast-hrtimer.c @@ -72,6 +72,7 @@ static int bc_set_next(ktime_t expires, struct clock_event_device *bc) } static struct clock_event_device ce_broadcast_hrtimer = { + .name = broadcast_hrtimer, .set_mode = bc_set_mode, .set_next_ktime = bc_set_next, .features = CLOCK_EVT_FEAT_ONESHOT | @@ -98,9 +99,11 @@ static enum hrtimer_restart bc_handler(struct hrtimer *t) return HRTIMER_RESTART; } -void tick_setup_hrtimer_broadcast(void) +static int __init tick_setup_hrtimer_broadcast(void) { hrtimer_init(bctimer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS); bctimer.function = bc_handler; clockevents_register_device(ce_broadcast_hrtimer); + return 0; } +early_initcall(tick_setup_hrtimer_broadcast); -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH V2] tick-broadcast: Register for hrtimer based broadcast as the fallback broadcast mode
Commit 5d1638acb9f6 ('tick: Introduce hrtimer based broadcast') added a hrtimer based broadcast mode for those platforms in which local timers stop when CPUs enter deep idle states. The commit expected the platforms to register for this mode explicitly when they lacked a better external device to wake up CPUs in deep idle. Given that more platforms are beginning to use this mode, we can avoid the call to set it up on every platform that requires it, by registering for the hrtimer based broadcast mode in the core code if no better broadcast device is available. This commit also helps detect cases where the platform fails to register for a broadcast device but invokes the help of one when entering deep idle states. Currently we do not handle this situation at all and call the broadcast clock device without checking for its existence. This patch will handle such buggy cases properly. Signed-off-by: Preeti U Murthy --- Changes from V1: https://lkml.org/lkml/2014/12/5/261 1.Moved registering the hrtimer based broadcast from timekeeping code to an early_initcall. arch/arm64/kernel/time.c |2 -- arch/powerpc/kernel/time.c |1 - include/linux/clockchips.h |6 +++--- kernel/time/tick-broadcast-hrtimer.c |4 +++- 4 files changed, 6 insertions(+), 7 deletions(-) diff --git a/arch/arm64/kernel/time.c b/arch/arm64/kernel/time.c index 1a7125c..47baaa8 100644 --- a/arch/arm64/kernel/time.c +++ b/arch/arm64/kernel/time.c @@ -70,8 +70,6 @@ void __init time_init(void) of_clk_init(NULL); clocksource_of_init(); - tick_setup_hrtimer_broadcast(); - arch_timer_rate = arch_timer_get_rate(); if (!arch_timer_rate) panic("Unable to initialise architected timer.\n"); diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c index 7505599..51433a8 100644 --- a/arch/powerpc/kernel/time.c +++ b/arch/powerpc/kernel/time.c @@ -942,7 +942,6 @@ void __init time_init(void) clocksource_init(); init_decrementer_clockevent(); - tick_setup_hrtimer_broadcast(); } diff --git a/include/linux/clockchips.h b/include/linux/clockchips.h index 2e4cb67..91754b0 100644 --- a/include/linux/clockchips.h +++ b/include/linux/clockchips.h @@ -187,11 +187,11 @@ extern int tick_receive_broadcast(void); #endif #if defined(CONFIG_GENERIC_CLOCKEVENTS_BROADCAST) && defined(CONFIG_TICK_ONESHOT) -extern void tick_setup_hrtimer_broadcast(void); +extern int __init tick_setup_hrtimer_broadcast(void); extern int tick_check_broadcast_expired(void); #else static inline int tick_check_broadcast_expired(void) { return 0; } -static inline void tick_setup_hrtimer_broadcast(void) {}; +static inline int __init tick_setup_hrtimer_broadcast(void) { return 0; } #endif #ifdef CONFIG_GENERIC_CLOCKEVENTS @@ -207,7 +207,7 @@ static inline void clockevents_resume(void) {} static inline int clockevents_notify(unsigned long reason, void *arg) { return 0; } static inline int tick_check_broadcast_expired(void) { return 0; } -static inline void tick_setup_hrtimer_broadcast(void) {}; +static inline int __init tick_setup_hrtimer_broadcast(void) { return 0; } #endif diff --git a/kernel/time/tick-broadcast-hrtimer.c b/kernel/time/tick-broadcast-hrtimer.c index eb682d5..5c35995 100644 --- a/kernel/time/tick-broadcast-hrtimer.c +++ b/kernel/time/tick-broadcast-hrtimer.c @@ -98,9 +98,11 @@ static enum hrtimer_restart bc_handler(struct hrtimer *t) return HRTIMER_RESTART; } -void tick_setup_hrtimer_broadcast(void) +int __init tick_setup_hrtimer_broadcast(void) { hrtimer_init(, CLOCK_MONOTONIC, HRTIMER_MODE_ABS); bctimer.function = bc_handler; clockevents_register_device(_broadcast_hrtimer); + return 0; } +early_initcall(tick_setup_hrtimer_broadcast); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] tick-broadcast: Register for hrtimer based broadcast as the default broadcast mode
On 12/05/2014 07:09 PM, Mark Rutland wrote: > Hi Preeti, > > Moving this out of the architecture code looks good to me! > > I have a couple of minor comments below. >> >> diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c >> index ec1791f..6044a51 100644 >> --- a/kernel/time/timekeeping.c >> +++ b/kernel/time/timekeeping.c >> @@ -1016,6 +1016,10 @@ void __init timekeeping_init(void) >> boot.tv_sec = 0; >> boot.tv_nsec = 0; >> } >> +/* Register for hrtimer based broadcast as the default timekeeping >> + * mode in deep idle states. >> + */ > > Nit: for code style this should have a newline after the '/*' (and we > should probably have a newline before that anyway. > >> +tick_setup_hrtimer_broadcast(); > > We register the generic dummy timer via an early_initcall, which keeps > all the logic in the dummy timer driver. Are we able to do the same of > the broadcast hrtimer? Or is there some ordering constraint we need to > meet? Yes this is doable. There are no ordering constraints. Placing it in an early_initcall() will enable it to run before initializing SMP and cpuidle, which is perfect (we do not have any per-cpu initializations and we do not want cpus to begin entering idle states before this hrtimer is initialized). It also runs after the hrtimer/clockevents infrastructure has been initialized, which is good since we need them. The broadcast hrtimer will thus only get registered as a broadcast device if no other clock device has managed to register itself as one. Let me send out a V2. Thanks Regards Preeti U Murthy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] tick-broadcast: Register for hrtimer based broadcast as the default broadcast mode
On 12/05/2014 07:09 PM, Mark Rutland wrote: Hi Preeti, Moving this out of the architecture code looks good to me! I have a couple of minor comments below. snip diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c index ec1791f..6044a51 100644 --- a/kernel/time/timekeeping.c +++ b/kernel/time/timekeeping.c @@ -1016,6 +1016,10 @@ void __init timekeeping_init(void) boot.tv_sec = 0; boot.tv_nsec = 0; } +/* Register for hrtimer based broadcast as the default timekeeping + * mode in deep idle states. + */ Nit: for code style this should have a newline after the '/*' (and we should probably have a newline before that anyway. +tick_setup_hrtimer_broadcast(); We register the generic dummy timer via an early_initcall, which keeps all the logic in the dummy timer driver. Are we able to do the same of the broadcast hrtimer? Or is there some ordering constraint we need to meet? Yes this is doable. There are no ordering constraints. Placing it in an early_initcall() will enable it to run before initializing SMP and cpuidle, which is perfect (we do not have any per-cpu initializations and we do not want cpus to begin entering idle states before this hrtimer is initialized). It also runs after the hrtimer/clockevents infrastructure has been initialized, which is good since we need them. The broadcast hrtimer will thus only get registered as a broadcast device if no other clock device has managed to register itself as one. Let me send out a V2. Thanks Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH V2] tick-broadcast: Register for hrtimer based broadcast as the fallback broadcast mode
Commit 5d1638acb9f6 ('tick: Introduce hrtimer based broadcast') added a hrtimer based broadcast mode for those platforms in which local timers stop when CPUs enter deep idle states. The commit expected the platforms to register for this mode explicitly when they lacked a better external device to wake up CPUs in deep idle. Given that more platforms are beginning to use this mode, we can avoid the call to set it up on every platform that requires it, by registering for the hrtimer based broadcast mode in the core code if no better broadcast device is available. This commit also helps detect cases where the platform fails to register for a broadcast device but invokes the help of one when entering deep idle states. Currently we do not handle this situation at all and call the broadcast clock device without checking for its existence. This patch will handle such buggy cases properly. Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com --- Changes from V1: https://lkml.org/lkml/2014/12/5/261 1.Moved registering the hrtimer based broadcast from timekeeping code to an early_initcall. arch/arm64/kernel/time.c |2 -- arch/powerpc/kernel/time.c |1 - include/linux/clockchips.h |6 +++--- kernel/time/tick-broadcast-hrtimer.c |4 +++- 4 files changed, 6 insertions(+), 7 deletions(-) diff --git a/arch/arm64/kernel/time.c b/arch/arm64/kernel/time.c index 1a7125c..47baaa8 100644 --- a/arch/arm64/kernel/time.c +++ b/arch/arm64/kernel/time.c @@ -70,8 +70,6 @@ void __init time_init(void) of_clk_init(NULL); clocksource_of_init(); - tick_setup_hrtimer_broadcast(); - arch_timer_rate = arch_timer_get_rate(); if (!arch_timer_rate) panic(Unable to initialise architected timer.\n); diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c index 7505599..51433a8 100644 --- a/arch/powerpc/kernel/time.c +++ b/arch/powerpc/kernel/time.c @@ -942,7 +942,6 @@ void __init time_init(void) clocksource_init(); init_decrementer_clockevent(); - tick_setup_hrtimer_broadcast(); } diff --git a/include/linux/clockchips.h b/include/linux/clockchips.h index 2e4cb67..91754b0 100644 --- a/include/linux/clockchips.h +++ b/include/linux/clockchips.h @@ -187,11 +187,11 @@ extern int tick_receive_broadcast(void); #endif #if defined(CONFIG_GENERIC_CLOCKEVENTS_BROADCAST) defined(CONFIG_TICK_ONESHOT) -extern void tick_setup_hrtimer_broadcast(void); +extern int __init tick_setup_hrtimer_broadcast(void); extern int tick_check_broadcast_expired(void); #else static inline int tick_check_broadcast_expired(void) { return 0; } -static inline void tick_setup_hrtimer_broadcast(void) {}; +static inline int __init tick_setup_hrtimer_broadcast(void) { return 0; } #endif #ifdef CONFIG_GENERIC_CLOCKEVENTS @@ -207,7 +207,7 @@ static inline void clockevents_resume(void) {} static inline int clockevents_notify(unsigned long reason, void *arg) { return 0; } static inline int tick_check_broadcast_expired(void) { return 0; } -static inline void tick_setup_hrtimer_broadcast(void) {}; +static inline int __init tick_setup_hrtimer_broadcast(void) { return 0; } #endif diff --git a/kernel/time/tick-broadcast-hrtimer.c b/kernel/time/tick-broadcast-hrtimer.c index eb682d5..5c35995 100644 --- a/kernel/time/tick-broadcast-hrtimer.c +++ b/kernel/time/tick-broadcast-hrtimer.c @@ -98,9 +98,11 @@ static enum hrtimer_restart bc_handler(struct hrtimer *t) return HRTIMER_RESTART; } -void tick_setup_hrtimer_broadcast(void) +int __init tick_setup_hrtimer_broadcast(void) { hrtimer_init(bctimer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS); bctimer.function = bc_handler; clockevents_register_device(ce_broadcast_hrtimer); + return 0; } +early_initcall(tick_setup_hrtimer_broadcast); -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] tick-broadcast: Register for hrtimer based broadcast as the default broadcast mode
Commit 5d1638acb9f62fa7 added a hrtimer based broadcast mode for those platforms in which local timers stop when CPUs enter deep idle states. The commit expected the platforms to register for this mode explicitly when they lacked a better external device to wake up CPUs in deep idle. Given that more platforms are beginning to use this mode, we can avoid the call to set it up on every platform that requires it, by registering for the hrtimer based broadcast mode in the core code before clock devices begin to get initialized. So if there exists a better broadcast device, it will overide the hrtimer based one; else there is a backup mechanism when a wakeup device is required. This commit also helps detect cases where the platform fails to register for a broadcast device but invokes the help of one when entering deep idle states. Currently we do not handle this situation at all and invoke the help of the broadcast clock device without checking for its existence. Registering a default broadcast mode will handle such buggy cases properly. Signed-off-by: Preeti U Murthy --- arch/arm64/kernel/time.c |2 -- arch/powerpc/kernel/time.c |1 - kernel/time/timekeeping.c |4 3 files changed, 4 insertions(+), 3 deletions(-) diff --git a/arch/arm64/kernel/time.c b/arch/arm64/kernel/time.c index 1a7125c..47baaa8 100644 --- a/arch/arm64/kernel/time.c +++ b/arch/arm64/kernel/time.c @@ -70,8 +70,6 @@ void __init time_init(void) of_clk_init(NULL); clocksource_of_init(); - tick_setup_hrtimer_broadcast(); - arch_timer_rate = arch_timer_get_rate(); if (!arch_timer_rate) panic("Unable to initialise architected timer.\n"); diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c index 7505599..51433a8 100644 --- a/arch/powerpc/kernel/time.c +++ b/arch/powerpc/kernel/time.c @@ -942,7 +942,6 @@ void __init time_init(void) clocksource_init(); init_decrementer_clockevent(); - tick_setup_hrtimer_broadcast(); } diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c index ec1791f..6044a51 100644 --- a/kernel/time/timekeeping.c +++ b/kernel/time/timekeeping.c @@ -1016,6 +1016,10 @@ void __init timekeeping_init(void) boot.tv_sec = 0; boot.tv_nsec = 0; } + /* Register for hrtimer based broadcast as the default timekeeping +* mode in deep idle states. +*/ + tick_setup_hrtimer_broadcast(); raw_spin_lock_irqsave(_lock, flags); write_seqcount_begin(_core.seq); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] arm: ls1021a: setup hrtimer based tick broadcast
On 12/05/2014 03:57 PM, Mark Rutland wrote: > Hi, > > On Fri, Dec 05, 2014 at 09:07:29AM +, Jingchang Lu wrote: >> This adds setup of hrtimer based tick broadcast to utilize >> the hrtimer in SMP without other broadcast tick devices. >> >> Signed-off-by: Jingchang Lu > > This shouldn't live in board code. > > Similarly to what happens with the generic dummy timer on all arches > that select it (and the broadcast hrtimer on all arches that use it so > far), we should just unconditionally register the broadcast hrtimer. > > A real broadcast-capable clock event device will take preference if > present, so we don't need to add board-specific code to only register > this in certain conditions. Hmm.. this makes sense to me. Let me send out a patch to do this. Regards Preeti U Murthy > > Mark. > >> --- >> arch/arm/mach-imx/mach-ls1021a.c | 12 >> 1 file changed, 12 insertions(+) >> >> diff --git a/arch/arm/mach-imx/mach-ls1021a.c >> b/arch/arm/mach-imx/mach-ls1021a.c >> index b89c858..4d074cf 100644 >> --- a/arch/arm/mach-imx/mach-ls1021a.c >> +++ b/arch/arm/mach-imx/mach-ls1021a.c >> @@ -7,6 +7,10 @@ >> * (at your option) any later version. >> */ >> >> +#include >> +#include >> +#include >> + >> #include >> >> #include "common.h" >> @@ -16,7 +20,15 @@ static const char * const ls1021a_dt_compat[] __initconst >> = { >> NULL, >> }; >> >> +static void __init ls1021a_init_time(void) >> +{ >> +of_clk_init(NULL); >> +clocksource_of_init(); >> +tick_setup_hrtimer_broadcast(); >> +} >> + >> DT_MACHINE_START(LS1021A, "Freescale LS1021A") >> .smp= smp_ops(ls1021a_smp_ops), >> +.init_time = ls1021a_init_time, >> .dt_compat = ls1021a_dt_compat, >> MACHINE_END >> -- >> 1.8.0 >> >> >> ___ >> linux-arm-kernel mailing list >> linux-arm-ker...@lists.infradead.org >> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel >> > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] arm: ls1021a: setup hrtimer based tick broadcast
On 12/05/2014 03:57 PM, Mark Rutland wrote: Hi, On Fri, Dec 05, 2014 at 09:07:29AM +, Jingchang Lu wrote: This adds setup of hrtimer based tick broadcast to utilize the hrtimer in SMP without other broadcast tick devices. Signed-off-by: Jingchang Lu jingchang...@freescale.com This shouldn't live in board code. Similarly to what happens with the generic dummy timer on all arches that select it (and the broadcast hrtimer on all arches that use it so far), we should just unconditionally register the broadcast hrtimer. A real broadcast-capable clock event device will take preference if present, so we don't need to add board-specific code to only register this in certain conditions. Hmm.. this makes sense to me. Let me send out a patch to do this. Regards Preeti U Murthy Mark. --- arch/arm/mach-imx/mach-ls1021a.c | 12 1 file changed, 12 insertions(+) diff --git a/arch/arm/mach-imx/mach-ls1021a.c b/arch/arm/mach-imx/mach-ls1021a.c index b89c858..4d074cf 100644 --- a/arch/arm/mach-imx/mach-ls1021a.c +++ b/arch/arm/mach-imx/mach-ls1021a.c @@ -7,6 +7,10 @@ * (at your option) any later version. */ +#include linux/clockchips.h +#include linux/clk-provider.h +#include linux/clocksource.h + #include asm/mach/arch.h #include common.h @@ -16,7 +20,15 @@ static const char * const ls1021a_dt_compat[] __initconst = { NULL, }; +static void __init ls1021a_init_time(void) +{ +of_clk_init(NULL); +clocksource_of_init(); +tick_setup_hrtimer_broadcast(); +} + DT_MACHINE_START(LS1021A, Freescale LS1021A) .smp= smp_ops(ls1021a_smp_ops), +.init_time = ls1021a_init_time, .dt_compat = ls1021a_dt_compat, MACHINE_END -- 1.8.0 ___ linux-arm-kernel mailing list linux-arm-ker...@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] tick-broadcast: Register for hrtimer based broadcast as the default broadcast mode
Commit 5d1638acb9f62fa7 added a hrtimer based broadcast mode for those platforms in which local timers stop when CPUs enter deep idle states. The commit expected the platforms to register for this mode explicitly when they lacked a better external device to wake up CPUs in deep idle. Given that more platforms are beginning to use this mode, we can avoid the call to set it up on every platform that requires it, by registering for the hrtimer based broadcast mode in the core code before clock devices begin to get initialized. So if there exists a better broadcast device, it will overide the hrtimer based one; else there is a backup mechanism when a wakeup device is required. This commit also helps detect cases where the platform fails to register for a broadcast device but invokes the help of one when entering deep idle states. Currently we do not handle this situation at all and invoke the help of the broadcast clock device without checking for its existence. Registering a default broadcast mode will handle such buggy cases properly. Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com --- arch/arm64/kernel/time.c |2 -- arch/powerpc/kernel/time.c |1 - kernel/time/timekeeping.c |4 3 files changed, 4 insertions(+), 3 deletions(-) diff --git a/arch/arm64/kernel/time.c b/arch/arm64/kernel/time.c index 1a7125c..47baaa8 100644 --- a/arch/arm64/kernel/time.c +++ b/arch/arm64/kernel/time.c @@ -70,8 +70,6 @@ void __init time_init(void) of_clk_init(NULL); clocksource_of_init(); - tick_setup_hrtimer_broadcast(); - arch_timer_rate = arch_timer_get_rate(); if (!arch_timer_rate) panic(Unable to initialise architected timer.\n); diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c index 7505599..51433a8 100644 --- a/arch/powerpc/kernel/time.c +++ b/arch/powerpc/kernel/time.c @@ -942,7 +942,6 @@ void __init time_init(void) clocksource_init(); init_decrementer_clockevent(); - tick_setup_hrtimer_broadcast(); } diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c index ec1791f..6044a51 100644 --- a/kernel/time/timekeeping.c +++ b/kernel/time/timekeeping.c @@ -1016,6 +1016,10 @@ void __init timekeeping_init(void) boot.tv_sec = 0; boot.tv_nsec = 0; } + /* Register for hrtimer based broadcast as the default timekeeping +* mode in deep idle states. +*/ + tick_setup_hrtimer_broadcast(); raw_spin_lock_irqsave(timekeeper_lock, flags); write_seqcount_begin(tk_core.seq); -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 6/7] sched: cfs: cpu frequency scaling based on task placement
On 10/22/2014 11:37 AM, Mike Turquette wrote: > {en,de}queue_task_fair are updated to track which cpus will have changed > utilization values as function of task queueing. The affected cpus are > passed on to arch_eval_cpu_freq for further machine-specific processing > based on a selectable policy. > > arch_scale_cpu_freq is called from run_rebalance_domains as a way to > kick off the scaling process (via wake_up_process), so as to prevent > re-entering the {en,de}queue code. > > All of the call sites in this patch are up for discussion. Does it make > sense to track which cpus have updated statistics in enqueue_fair_task? > I chose this because I wanted to gather statistics for all cpus affected > in the event CONFIG_FAIR_GROUP_SCHED is enabled. As agreed at LPC14 the > next version of this patch will focus on the simpler case of not using > scheduler cgroups, which should remove a good chunk of this code, > including the cpumask stuff. > > Also discussed at LPC14 is that fact that load_balance is a very > interesting place to do this as frequency can be considered in concert > with task placement. Please put forth any ideas on a sensible way to do > this. I believe load balancing would be the right place to evaluate the frequency at which CPUs must run. find_busiest_group() is already iterating through all the CPUs and calculating the load on them. So this information is readily available and that which remains to be seen is which of the CPUs in the group have their load > some_threshold and queue a kthread on that cpu to scale its frequency, while the current cpu continues with its load balancing. There is another positive I see in evaluating cpu frequency in load balancing. The frequency at which load balancing is run is already optimized for scalability. One of the factors that is considered is if any sibling cpus has carried out load balancing in the recent past, the current cpu defers doing the same. This means it is naturally ensured that only one cpu in the power domain takes care of frequency scaling each time and there is no need of explicit synchronization between the policy cpus to do this. Regards Preeti U Murthy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 6/7] sched: cfs: cpu frequency scaling based on task placement
On 10/22/2014 11:37 AM, Mike Turquette wrote: {en,de}queue_task_fair are updated to track which cpus will have changed utilization values as function of task queueing. The affected cpus are passed on to arch_eval_cpu_freq for further machine-specific processing based on a selectable policy. arch_scale_cpu_freq is called from run_rebalance_domains as a way to kick off the scaling process (via wake_up_process), so as to prevent re-entering the {en,de}queue code. All of the call sites in this patch are up for discussion. Does it make sense to track which cpus have updated statistics in enqueue_fair_task? I chose this because I wanted to gather statistics for all cpus affected in the event CONFIG_FAIR_GROUP_SCHED is enabled. As agreed at LPC14 the next version of this patch will focus on the simpler case of not using scheduler cgroups, which should remove a good chunk of this code, including the cpumask stuff. Also discussed at LPC14 is that fact that load_balance is a very interesting place to do this as frequency can be considered in concert with task placement. Please put forth any ideas on a sensible way to do this. I believe load balancing would be the right place to evaluate the frequency at which CPUs must run. find_busiest_group() is already iterating through all the CPUs and calculating the load on them. So this information is readily available and that which remains to be seen is which of the CPUs in the group have their load some_threshold and queue a kthread on that cpu to scale its frequency, while the current cpu continues with its load balancing. There is another positive I see in evaluating cpu frequency in load balancing. The frequency at which load balancing is run is already optimized for scalability. One of the factors that is considered is if any sibling cpus has carried out load balancing in the recent past, the current cpu defers doing the same. This means it is naturally ensured that only one cpu in the power domain takes care of frequency scaling each time and there is no need of explicit synchronization between the policy cpus to do this. Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 0/4] powernv: cpuidle: Redesign idle states management
Hi, I ran hackbench to evaluate this patchset and found good improvements in the results. I modified hackbench to take in a 'loops' parameter along with num_groups which ensures that the test runs long enough to observe and debug issues. The idea was to find out how latency sensitive workloads can get affected by modification in cpuidle heuristics since it is easy to measure the impact on these workloads. The experiment was conducted on a Power8 system with 1 socket and 6 cores on it. The first experiment was carried out by pinning hackbench to the first thread in each core while the rest of the smt threads were idle and below are the results. This would ensure the core entered deep idle states more often. num_grps %improvement with patchset 33.6 6 10.6 12 5.0 24 5.0 The second experiment was carried out by allowing hackbench to run on the smt threads of two cores and % improvement with the patchset was in range of 4-7%. I ran the experiments on the vanilla kernel. This means the performance improvements is primarily due to avoiding having to do a timebase sync by every thread in the core. The power numbers have very little variation between the runs with and without the patchset. Thanks Regards Preeti U Murthy On 11/25/2014 04:47 PM, Shreyas B. Prabhu wrote: > Deep idle states like sleep and winkle are per core idle states. A core > enters these states only when all the threads enter either the particular > idle state or a deeper one. There are tasks like fastsleep hardware bug > workaround and hypervisor core state save which have to be done only by > the last thread of the core entering deep idle state and similarly tasks > like timebase resync, hypervisor core register restore that have to be > done only by the first thread waking up from these states. > > The current idle state management does not have a way to distinguish the > first/last thread of the core waking/entering idle states. Tasks like > timebase resync are done for all the threads. This is not only is suboptimal, > but can cause functionality issues when subcores are involved. > > Winkle is deeper idle state compared to fastsleep. In this state the power > supply to the chiplet, i.e core, private L2 and private L3 is turned off. > This results in a total hypervisor state loss. This patch set adds support > for winkle and provides a way to track the idle states of the threads of the > core and use it for idle state management of idle states sleep and winkle. > > > Changes in v2: > -- > -Using PNV_THREAD_NAP/SLEEP defines while calling power7_powersave_common > -Comment changes based on review > -Rebased on top of 3.18-rc6 > > > Cc: Benjamin Herrenschmidt > Cc: Paul Mackerras > Cc: Michael Ellerman > Cc: Rafael J. Wysocki > Cc: linux...@vger.kernel.org > Cc: linuxppc-...@lists.ozlabs.org > Cc: Vaidyanathan Srinivasan > Cc: Preeti U Murthy > > Paul Mackerras (1): > powerpc: powernv: Switch off MMU before entering nap/sleep/rvwinkle > mode > > Preeti U. Murthy (1): > powerpc/powernv: Enable Offline CPUs to enter deep idle states > > Shreyas B. Prabhu (2): > powernv: cpuidle: Redesign idle states management > powernv: powerpc: Add winkle support for offline cpus > > arch/powerpc/include/asm/cpuidle.h | 14 ++ > arch/powerpc/include/asm/opal.h| 13 + > arch/powerpc/include/asm/paca.h| 6 + > arch/powerpc/include/asm/ppc-opcode.h | 2 + > arch/powerpc/include/asm/processor.h | 1 + > arch/powerpc/include/asm/reg.h | 4 + > arch/powerpc/kernel/asm-offsets.c | 6 + > arch/powerpc/kernel/cpu_setup_power.S | 4 + > arch/powerpc/kernel/exceptions-64s.S | 30 ++- > arch/powerpc/kernel/idle_power7.S | 332 > + > arch/powerpc/platforms/powernv/opal-wrappers.S | 39 +++ > arch/powerpc/platforms/powernv/powernv.h | 2 + > arch/powerpc/platforms/powernv/setup.c | 160 > arch/powerpc/platforms/powernv/smp.c | 10 +- > arch/powerpc/platforms/powernv/subcore.c | 34 +++ > arch/powerpc/platforms/powernv/subcore.h | 1 + > drivers/cpuidle/cpuidle-powernv.c | 10 +- > 17 files changed, 608 insertions(+), 60 deletions(-) > create mode 100644 arch/powerpc/include/asm/cpuidle.h > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 0/4] powernv: cpuidle: Redesign idle states management
Hi, I ran hackbench to evaluate this patchset and found good improvements in the results. I modified hackbench to take in a 'loops' parameter along with num_groups which ensures that the test runs long enough to observe and debug issues. The idea was to find out how latency sensitive workloads can get affected by modification in cpuidle heuristics since it is easy to measure the impact on these workloads. The experiment was conducted on a Power8 system with 1 socket and 6 cores on it. The first experiment was carried out by pinning hackbench to the first thread in each core while the rest of the smt threads were idle and below are the results. This would ensure the core entered deep idle states more often. num_grps %improvement with patchset 33.6 6 10.6 12 5.0 24 5.0 The second experiment was carried out by allowing hackbench to run on the smt threads of two cores and % improvement with the patchset was in range of 4-7%. I ran the experiments on the vanilla kernel. This means the performance improvements is primarily due to avoiding having to do a timebase sync by every thread in the core. The power numbers have very little variation between the runs with and without the patchset. Thanks Regards Preeti U Murthy On 11/25/2014 04:47 PM, Shreyas B. Prabhu wrote: Deep idle states like sleep and winkle are per core idle states. A core enters these states only when all the threads enter either the particular idle state or a deeper one. There are tasks like fastsleep hardware bug workaround and hypervisor core state save which have to be done only by the last thread of the core entering deep idle state and similarly tasks like timebase resync, hypervisor core register restore that have to be done only by the first thread waking up from these states. The current idle state management does not have a way to distinguish the first/last thread of the core waking/entering idle states. Tasks like timebase resync are done for all the threads. This is not only is suboptimal, but can cause functionality issues when subcores are involved. Winkle is deeper idle state compared to fastsleep. In this state the power supply to the chiplet, i.e core, private L2 and private L3 is turned off. This results in a total hypervisor state loss. This patch set adds support for winkle and provides a way to track the idle states of the threads of the core and use it for idle state management of idle states sleep and winkle. Changes in v2: -- -Using PNV_THREAD_NAP/SLEEP defines while calling power7_powersave_common -Comment changes based on review -Rebased on top of 3.18-rc6 Cc: Benjamin Herrenschmidt b...@kernel.crashing.org Cc: Paul Mackerras pau...@samba.org Cc: Michael Ellerman m...@ellerman.id.au Cc: Rafael J. Wysocki r...@rjwysocki.net Cc: linux...@vger.kernel.org Cc: linuxppc-...@lists.ozlabs.org Cc: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com Cc: Preeti U Murthy pre...@linux.vnet.ibm.com Paul Mackerras (1): powerpc: powernv: Switch off MMU before entering nap/sleep/rvwinkle mode Preeti U. Murthy (1): powerpc/powernv: Enable Offline CPUs to enter deep idle states Shreyas B. Prabhu (2): powernv: cpuidle: Redesign idle states management powernv: powerpc: Add winkle support for offline cpus arch/powerpc/include/asm/cpuidle.h | 14 ++ arch/powerpc/include/asm/opal.h| 13 + arch/powerpc/include/asm/paca.h| 6 + arch/powerpc/include/asm/ppc-opcode.h | 2 + arch/powerpc/include/asm/processor.h | 1 + arch/powerpc/include/asm/reg.h | 4 + arch/powerpc/kernel/asm-offsets.c | 6 + arch/powerpc/kernel/cpu_setup_power.S | 4 + arch/powerpc/kernel/exceptions-64s.S | 30 ++- arch/powerpc/kernel/idle_power7.S | 332 + arch/powerpc/platforms/powernv/opal-wrappers.S | 39 +++ arch/powerpc/platforms/powernv/powernv.h | 2 + arch/powerpc/platforms/powernv/setup.c | 160 arch/powerpc/platforms/powernv/smp.c | 10 +- arch/powerpc/platforms/powernv/subcore.c | 34 +++ arch/powerpc/platforms/powernv/subcore.h | 1 + drivers/cpuidle/cpuidle-powernv.c | 10 +- 17 files changed, 608 insertions(+), 60 deletions(-) create mode 100644 arch/powerpc/include/asm/cpuidle.h -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] clockevents: introduce ->set_dev_mode() which can return error
This looks good to me. On 11/20/2014 12:09 AM, Kevin Hilman wrote: > From: Viresh Kumar > > Currently, the ->set_mode() method of a clockevent device is not > allowed to fail, so it has no return value. In order to add new > clockevent modes, and allow the setting of those modes to fail, we > need the clockevent core to be able to detect when setting a mode > fails. > > To allow detection of mode setting failures, introduce a new method > ->set_dev_mode() which can return an error if the given mode is not > supported or fails. > > Since all current modes are still not allowed to fail, the core code > simply WARNs if any modes fail. Future patches that add new mode > support will add proper error recovery based on failure conditions. > > Signed-off-by: Viresh Kumar > Reviewed-by: Kevin Hilman > [khilman: rework changelogs, minor formatting/checkpatch cleanups] > Signed-off-by: Kevin Hilman > --- > include/linux/clockchips.h | 5 - > kernel/time/clockevents.c | 21 ++--- > kernel/time/timer_list.c | 5 - > 3 files changed, 26 insertions(+), 5 deletions(-) > > diff --git a/include/linux/clockchips.h b/include/linux/clockchips.h > index 2e4cb67f6e56..d969659cf688 100644 > --- a/include/linux/clockchips.h > +++ b/include/linux/clockchips.h > @@ -81,7 +81,8 @@ enum clock_event_mode { > * @mode:operating mode assigned by the management code > * @features:features > * @retries: number of forced programming retries > - * @set_mode:set mode function > + * @set_dev_mode:set dev mode function > + * @set_mode:set mode function (deprecated, use set_dev_mode > instead) > * @broadcast: function to broadcast events > * @min_delta_ticks: minimum delta value in ticks stored for reconfiguration > * @max_delta_ticks: maximum delta value in ticks stored for reconfiguration > @@ -109,6 +110,8 @@ struct clock_event_device { > unsigned long retries; > > void(*broadcast)(const struct cpumask *mask); > + int (*set_dev_mode)(enum clock_event_mode mode, > + struct clock_event_device *); > void(*set_mode)(enum clock_event_mode mode, > struct clock_event_device *); > void(*suspend)(struct clock_event_device *); > diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c > index 9c94c19f1305..1d5ce3ce9228 100644 > --- a/kernel/time/clockevents.c > +++ b/kernel/time/clockevents.c > @@ -105,7 +105,16 @@ void clockevents_set_mode(struct clock_event_device *dev, >enum clock_event_mode mode) > { > if (dev->mode != mode) { > - dev->set_mode(mode, dev); > + if (dev->set_dev_mode) { > + int ret = dev->set_dev_mode(mode, dev); > + > + /* Currently available modes shouldn't fail */ > + WARN_ONCE(ret, "Requested mode: %d, error: %d\n", > + mode, ret); > + } else { > + dev->set_mode(mode, dev); > + } > + > dev->mode = mode; > > /* > @@ -448,8 +457,14 @@ int __clockevents_update_freq(struct clock_event_device > *dev, u32 freq) > if (dev->mode == CLOCK_EVT_MODE_ONESHOT) > return clockevents_program_event(dev, dev->next_event, false); > > - if (dev->mode == CLOCK_EVT_MODE_PERIODIC) > - dev->set_mode(CLOCK_EVT_MODE_PERIODIC, dev); > + if (dev->mode == CLOCK_EVT_MODE_PERIODIC) { > + /* Shouldn't fail while switching to PERIODIC MODE */ > + if (dev->set_dev_mode) > + WARN_ON_ONCE(dev->set_dev_mode(CLOCK_EVT_MODE_PERIODIC, > +dev)); > + else > + dev->set_mode(CLOCK_EVT_MODE_PERIODIC, dev); > + } > > return 0; > } > diff --git a/kernel/time/timer_list.c b/kernel/time/timer_list.c > index 61ed862cdd37..957a04951ef0 100644 > --- a/kernel/time/timer_list.c > +++ b/kernel/time/timer_list.c > @@ -229,7 +229,10 @@ print_tickdevice(struct seq_file *m, struct tick_device > *td, int cpu) > SEQ_printf(m, "\n"); > > SEQ_printf(m, " set_mode: "); > - print_name_offset(m, dev->set_mode); > + if (dev->set_dev_mode) > + print_name_offset(m, dev->set_dev_mode); > + else > + print_name_offset(m, dev->set_mode); > SEQ_printf(m, "\n"); > > SEQ_printf(m, " event_handler: "); > Reviewed-by: Preeti U Murthy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] clockevents: introduce -set_dev_mode() which can return error
This looks good to me. On 11/20/2014 12:09 AM, Kevin Hilman wrote: From: Viresh Kumar viresh.ku...@linaro.org Currently, the -set_mode() method of a clockevent device is not allowed to fail, so it has no return value. In order to add new clockevent modes, and allow the setting of those modes to fail, we need the clockevent core to be able to detect when setting a mode fails. To allow detection of mode setting failures, introduce a new method -set_dev_mode() which can return an error if the given mode is not supported or fails. Since all current modes are still not allowed to fail, the core code simply WARNs if any modes fail. Future patches that add new mode support will add proper error recovery based on failure conditions. Signed-off-by: Viresh Kumar viresh.ku...@linaro.org Reviewed-by: Kevin Hilman khil...@linaro.org [khilman: rework changelogs, minor formatting/checkpatch cleanups] Signed-off-by: Kevin Hilman khil...@linaro.org --- include/linux/clockchips.h | 5 - kernel/time/clockevents.c | 21 ++--- kernel/time/timer_list.c | 5 - 3 files changed, 26 insertions(+), 5 deletions(-) diff --git a/include/linux/clockchips.h b/include/linux/clockchips.h index 2e4cb67f6e56..d969659cf688 100644 --- a/include/linux/clockchips.h +++ b/include/linux/clockchips.h @@ -81,7 +81,8 @@ enum clock_event_mode { * @mode:operating mode assigned by the management code * @features:features * @retries: number of forced programming retries - * @set_mode:set mode function + * @set_dev_mode:set dev mode function + * @set_mode:set mode function (deprecated, use set_dev_mode instead) * @broadcast: function to broadcast events * @min_delta_ticks: minimum delta value in ticks stored for reconfiguration * @max_delta_ticks: maximum delta value in ticks stored for reconfiguration @@ -109,6 +110,8 @@ struct clock_event_device { unsigned long retries; void(*broadcast)(const struct cpumask *mask); + int (*set_dev_mode)(enum clock_event_mode mode, + struct clock_event_device *); void(*set_mode)(enum clock_event_mode mode, struct clock_event_device *); void(*suspend)(struct clock_event_device *); diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c index 9c94c19f1305..1d5ce3ce9228 100644 --- a/kernel/time/clockevents.c +++ b/kernel/time/clockevents.c @@ -105,7 +105,16 @@ void clockevents_set_mode(struct clock_event_device *dev, enum clock_event_mode mode) { if (dev-mode != mode) { - dev-set_mode(mode, dev); + if (dev-set_dev_mode) { + int ret = dev-set_dev_mode(mode, dev); + + /* Currently available modes shouldn't fail */ + WARN_ONCE(ret, Requested mode: %d, error: %d\n, + mode, ret); + } else { + dev-set_mode(mode, dev); + } + dev-mode = mode; /* @@ -448,8 +457,14 @@ int __clockevents_update_freq(struct clock_event_device *dev, u32 freq) if (dev-mode == CLOCK_EVT_MODE_ONESHOT) return clockevents_program_event(dev, dev-next_event, false); - if (dev-mode == CLOCK_EVT_MODE_PERIODIC) - dev-set_mode(CLOCK_EVT_MODE_PERIODIC, dev); + if (dev-mode == CLOCK_EVT_MODE_PERIODIC) { + /* Shouldn't fail while switching to PERIODIC MODE */ + if (dev-set_dev_mode) + WARN_ON_ONCE(dev-set_dev_mode(CLOCK_EVT_MODE_PERIODIC, +dev)); + else + dev-set_mode(CLOCK_EVT_MODE_PERIODIC, dev); + } return 0; } diff --git a/kernel/time/timer_list.c b/kernel/time/timer_list.c index 61ed862cdd37..957a04951ef0 100644 --- a/kernel/time/timer_list.c +++ b/kernel/time/timer_list.c @@ -229,7 +229,10 @@ print_tickdevice(struct seq_file *m, struct tick_device *td, int cpu) SEQ_printf(m, \n); SEQ_printf(m, set_mode: ); - print_name_offset(m, dev-set_mode); + if (dev-set_dev_mode) + print_name_offset(m, dev-set_dev_mode); + else + print_name_offset(m, dev-set_mode); SEQ_printf(m, \n); SEQ_printf(m, event_handler: ); Reviewed-by: Preeti U Murthy pre...@linux.vnet.ibm.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] cpuidle/powernv: Re-enable fastsleep at boot time
Hi Joel, On 11/19/2014 08:04 AM, Joel Stanley wrote: > Hey Preeti, > > On Tue, Nov 18, 2014 at 5:26 PM, Preeti U Murthy > wrote: >> Commit dcb18694 "Fix ipi on palmeto" disabled fastsleep at boot time. > > I couldn't find this commit in any tree; upstream, mpe's next, nor powerkvm. Oh yes you are right. I must have been looking at the wrong git branch. I verified upstream now. The commit dcb18694 is not present.So we are good. Thanks for pointing this out :) Sorry for the noise! > > I remember testing this as a workaround for the palmetto, but I don't > recall sending it upstream. > > Cheers, > > Joel > Regards Preeti U Murthy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] cpuidle/powernv: Re-enable fastsleep at boot time
Hi Joel, On 11/19/2014 08:04 AM, Joel Stanley wrote: Hey Preeti, On Tue, Nov 18, 2014 at 5:26 PM, Preeti U Murthy pre...@linux.vnet.ibm.com wrote: Commit dcb18694 Fix ipi on palmeto disabled fastsleep at boot time. I couldn't find this commit in any tree; upstream, mpe's next, nor powerkvm. Oh yes you are right. I must have been looking at the wrong git branch. I verified upstream now. The commit dcb18694 is not present.So we are good. Thanks for pointing this out :) Sorry for the noise! I remember testing this as a workaround for the palmetto, but I don't recall sending it upstream. Cheers, Joel Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] cpuidle/powernv: Re-enable fastsleep at boot time
Commit dcb18694 "Fix ipi on palmeto" disabled fastsleep at boot time. Revert this commit since we no longer need this fix. However we can continue to use powersave_nap parameter to control entry into deep idle states beyond snooze. Signed-off-by: Preeti U. Murthy --- drivers/cpuidle/cpuidle-powernv.c |5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c index b57681d..b430e76 100644 --- a/drivers/cpuidle/cpuidle-powernv.c +++ b/drivers/cpuidle/cpuidle-powernv.c @@ -69,8 +69,9 @@ static int fastsleep_loop(struct cpuidle_device *dev, unsigned long old_lpcr = mfspr(SPRN_LPCR); unsigned long new_lpcr; - if (powersave_nap < 2) - return; + if (!(powersave_nap > 0)) + return index; + if (unlikely(system_state < SYSTEM_RUNNING)) return index; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] cpuidle/powernv: Re-enable fastsleep at boot time
Commit dcb18694 Fix ipi on palmeto disabled fastsleep at boot time. Revert this commit since we no longer need this fix. However we can continue to use powersave_nap parameter to control entry into deep idle states beyond snooze. Signed-off-by: Preeti U. Murthy pre...@linux.vnet.ibm.com --- drivers/cpuidle/cpuidle-powernv.c |5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c index b57681d..b430e76 100644 --- a/drivers/cpuidle/cpuidle-powernv.c +++ b/drivers/cpuidle/cpuidle-powernv.c @@ -69,8 +69,9 @@ static int fastsleep_loop(struct cpuidle_device *dev, unsigned long old_lpcr = mfspr(SPRN_LPCR); unsigned long new_lpcr; - if (powersave_nap 2) - return; + if (!(powersave_nap 0)) + return index; + if (unlikely(system_state SYSTEM_RUNNING)) return index; -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/4] powernv: cpuidle: Redesign idle states management
^^ in the changelog, you do not need to elaborate it here. > + * core_idle_state - First 8 bits track the idle state of each thread > + * of the core. The 8th bit is the lock bit. Initially all thread bits > + * are set. They are cleared when the thread enters deep idle state > + * like sleep and winkle. Initially the lock bit is cleared. you can simply have the comment about the bits of core_idle_state without having to mention about when they are cleared etc.. > + * The lock bit has 2 purposes > + * a. While the first thread is restoring core state, it prevents > + * from other threads in the core from switching to prcoess context. > + * b. While the last thread in the core is saving the core state, it > + * prevent a different thread from waking up. The above two points are useful. As far as I see besides explaining the bits of core_idle_state structure and the purpose of lock bit the rest of the comments is redundant. A git-blame will let people know why all this is needed. The comment section should not be used up for this purpose IMO. Regards Preeti U Murthy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V3 1/6] sched: idle: Add a weak arch_cpu_idle_poll function
On 11/10/2014 08:47 PM, Peter Zijlstra wrote: > On Mon, Nov 10, 2014 at 07:50:22PM +0530, Preeti U Murthy wrote: >> Hi Peter, >> >> On 11/10/2014 05:59 PM, Peter Zijlstra wrote: >>> On Fri, Nov 07, 2014 at 03:31:22PM +0100, Daniel Lezcano wrote: >>>> The poll function is called when a timer expired or if we force to poll >>>> when >>>> the cpu_idle_force_poll option is set. >>>> >>>> The poll function does: >>>> >>>>local_irq_enable(); >>>>while (!tif_need_resched()) >>>>cpu_relax(); >>>> >>>> This default poll function suits for the x86 arch because its rep; nop; >>>> hardware power optimization. But on other archs, this optimization does not >>>> exists and we are not saving power. The arch specific bits may want to >>>> optimize this loop by adding their own optimization. >>> >>> This doesn't make sense to me; should an arch not either implement an >>> actual idle driver or implement cpu_relax() properly, why allow for a >>> third weird option? >>> >> >> The previous version of this patch simply invoked cpu_idle_loop() for >> cases where latency_req was 0. This would have changed the behavior >> on PowerPC wherein earlier the 0th idle index was returned which is also >> a polling loop but differs from cpu_idle_loop() in two ways: >> >> a. It polls at a relatively lower power state than cpu_relax(). >> b. We set certain registers to indicate that the cpu is idle. > > So I'm confused; the current code runs the generic cpu_relax idle poll > loop for the broadcast case. I suppose you want to retain this because > not doing your a-b above will indeed give you a lower latency. > > Therefore one could argue that latency_req==0 should indeed use this, > and your a-b idle state should be latency_req==1 or higher. > > Thus yes it changes behaviour, but I think it actually fixes something. > You cannot have a latency_req==0 state which has higher latency than the > actual polling loop, as you appear to have. Yes you are right. This is fixing the current behavior. So we should be good to call cpuidle_idle_loop() when latency_req=0. > >> Hence for all such cases wherein the cpu is required to poll while idle >> (only for cases such as force poll, broadcast ipi to arrive soon and >> latency_req = 0), we should be able to call into cpuidle_idle_loop() >> only if the cpuidle driver's 0th idle state has an exit_latency > 0. >> (The 0th idle state is expected to be a polling loop with >> exit_latency = 0). >> >> If otherwise, it would mean the driver has an optimized polling loop >> when idle. But instead of adding in the logic of checking the >> exit_latency, we thought it would be simpler to call into an arch >> defined polling idle loop under the above circumstances. If that is no >> better we could fall back to cpuidle_idle_loop(). > > That still doesn't make sense to me; suppose the implementation of this > special poll state differs on different uarchs for the same arch, then > we'll end up with another registration and selection interface parallel > to cpuidle. Yeah this will only get complicated. I was trying to see how to keep the current behavior unchanged but looks like it is not worth it. Thanks Regards Preeti U Murthy > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V3 1/6] sched: idle: Add a weak arch_cpu_idle_poll function
On 11/10/2014 08:47 PM, Peter Zijlstra wrote: On Mon, Nov 10, 2014 at 07:50:22PM +0530, Preeti U Murthy wrote: Hi Peter, On 11/10/2014 05:59 PM, Peter Zijlstra wrote: On Fri, Nov 07, 2014 at 03:31:22PM +0100, Daniel Lezcano wrote: The poll function is called when a timer expired or if we force to poll when the cpu_idle_force_poll option is set. The poll function does: local_irq_enable(); while (!tif_need_resched()) cpu_relax(); This default poll function suits for the x86 arch because its rep; nop; hardware power optimization. But on other archs, this optimization does not exists and we are not saving power. The arch specific bits may want to optimize this loop by adding their own optimization. This doesn't make sense to me; should an arch not either implement an actual idle driver or implement cpu_relax() properly, why allow for a third weird option? The previous version of this patch simply invoked cpu_idle_loop() for cases where latency_req was 0. This would have changed the behavior on PowerPC wherein earlier the 0th idle index was returned which is also a polling loop but differs from cpu_idle_loop() in two ways: a. It polls at a relatively lower power state than cpu_relax(). b. We set certain registers to indicate that the cpu is idle. So I'm confused; the current code runs the generic cpu_relax idle poll loop for the broadcast case. I suppose you want to retain this because not doing your a-b above will indeed give you a lower latency. Therefore one could argue that latency_req==0 should indeed use this, and your a-b idle state should be latency_req==1 or higher. Thus yes it changes behaviour, but I think it actually fixes something. You cannot have a latency_req==0 state which has higher latency than the actual polling loop, as you appear to have. Yes you are right. This is fixing the current behavior. So we should be good to call cpuidle_idle_loop() when latency_req=0. Hence for all such cases wherein the cpu is required to poll while idle (only for cases such as force poll, broadcast ipi to arrive soon and latency_req = 0), we should be able to call into cpuidle_idle_loop() only if the cpuidle driver's 0th idle state has an exit_latency 0. (The 0th idle state is expected to be a polling loop with exit_latency = 0). If otherwise, it would mean the driver has an optimized polling loop when idle. But instead of adding in the logic of checking the exit_latency, we thought it would be simpler to call into an arch defined polling idle loop under the above circumstances. If that is no better we could fall back to cpuidle_idle_loop(). That still doesn't make sense to me; suppose the implementation of this special poll state differs on different uarchs for the same arch, then we'll end up with another registration and selection interface parallel to cpuidle. Yeah this will only get complicated. I was trying to see how to keep the current behavior unchanged but looks like it is not worth it. Thanks Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/4] powernv: cpuidle: Redesign idle states management
having to mention about when they are cleared etc.. + * The lock bit has 2 purposes + * a. While the first thread is restoring core state, it prevents + * from other threads in the core from switching to prcoess context. + * b. While the last thread in the core is saving the core state, it + * prevent a different thread from waking up. The above two points are useful. As far as I see besides explaining the bits of core_idle_state structure and the purpose of lock bit the rest of the comments is redundant. A git-blame will let people know why all this is needed. The comment section should not be used up for this purpose IMO. Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V3 1/6] sched: idle: Add a weak arch_cpu_idle_poll function
Hi Peter, On 11/10/2014 05:59 PM, Peter Zijlstra wrote: > On Fri, Nov 07, 2014 at 03:31:22PM +0100, Daniel Lezcano wrote: >> The poll function is called when a timer expired or if we force to poll when >> the cpu_idle_force_poll option is set. >> >> The poll function does: >> >>local_irq_enable(); >>while (!tif_need_resched()) >>cpu_relax(); >> >> This default poll function suits for the x86 arch because its rep; nop; >> hardware power optimization. But on other archs, this optimization does not >> exists and we are not saving power. The arch specific bits may want to >> optimize this loop by adding their own optimization. > > This doesn't make sense to me; should an arch not either implement an > actual idle driver or implement cpu_relax() properly, why allow for a > third weird option? > The previous version of this patch simply invoked cpu_idle_loop() for cases where latency_req was 0. This would have changed the behavior on PowerPC wherein earlier the 0th idle index was returned which is also a polling loop but differs from cpu_idle_loop() in two ways: a. It polls at a relatively lower power state than cpu_relax(). b. We set certain registers to indicate that the cpu is idle. Hence for all such cases wherein the cpu is required to poll while idle (only for cases such as force poll, broadcast ipi to arrive soon and latency_req = 0), we should be able to call into cpuidle_idle_loop() only if the cpuidle driver's 0th idle state has an exit_latency > 0. (The 0th idle state is expected to be a polling loop with exit_latency = 0). If otherwise, it would mean the driver has an optimized polling loop when idle. But instead of adding in the logic of checking the exit_latency, we thought it would be simpler to call into an arch defined polling idle loop under the above circumstances. If that is no better we could fall back to cpuidle_idle_loop(). Regards Preeti U Murthy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] cpuidle/powernv: Fix return value of idle index in fastsleep
Hi Ben, Mikey, Can you please take a look at this patch? I feel the commit dcb1869 was just a plug-in solution. Shouldn't we take a look at this again? Ideally we should not need this check. With the cpuidle redesign on PowerNV going in, we can enable fastsleep at bootup itself and without checks on any debug parameters such as powersave_nap .We will then only need to check for powersave_nap == 0 and return only if that is the case. This check is still required since the user can disable all deep idle states by setting powersave_nap to 0. Regards Preeti U Murthy On 10/27/2014 06:56 PM, Preeti U Murthy wrote: > Commit dcb18694 : Fix ipi on Palmeto added a workaround to disable > going into fastsleep on Palmeto boards which was reported to fail > to boot when fastsleep was enabled. However it missed returning > an idle index. Fix this. > > There is probably no harm in returning the index of fastsleep > although the cpu did not enter this state. The governor will notice > the difference in the residency time in the idle state and the > target residency of the idle state and prevent cpus from entering > fastsleep from that point on. Hence the usage and time statistics > logged for fastsleep will not be out of sync with reality except > for the first entry. > > Signed-off-by: Preeti U Murthy > --- > Do we still need this workaround? Or can we get rid of the check > on powersave_nap altogether? > --- > > drivers/cpuidle/cpuidle-powernv.c |2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/cpuidle/cpuidle-powernv.c > b/drivers/cpuidle/cpuidle-powernv.c > index fa79392..c18da24 100644 > --- a/drivers/cpuidle/cpuidle-powernv.c > +++ b/drivers/cpuidle/cpuidle-powernv.c > @@ -70,7 +70,7 @@ static int fastsleep_loop(struct cpuidle_device *dev, > unsigned long new_lpcr; > > if (powersave_nap < 2) > - return; > + return index; > if (unlikely(system_state < SYSTEM_RUNNING)) > return index; > > > ___ > Linuxppc-dev mailing list > linuxppc-...@lists.ozlabs.org > https://lists.ozlabs.org/listinfo/linuxppc-dev > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] cpuidle/powernv: Fix return value of idle index in fastsleep
Hi Ben, Mikey, Can you please take a look at this patch? I feel the commit dcb1869 was just a plug-in solution. Shouldn't we take a look at this again? Ideally we should not need this check. With the cpuidle redesign on PowerNV going in, we can enable fastsleep at bootup itself and without checks on any debug parameters such as powersave_nap .We will then only need to check for powersave_nap == 0 and return only if that is the case. This check is still required since the user can disable all deep idle states by setting powersave_nap to 0. Regards Preeti U Murthy On 10/27/2014 06:56 PM, Preeti U Murthy wrote: Commit dcb18694 : Fix ipi on Palmeto added a workaround to disable going into fastsleep on Palmeto boards which was reported to fail to boot when fastsleep was enabled. However it missed returning an idle index. Fix this. There is probably no harm in returning the index of fastsleep although the cpu did not enter this state. The governor will notice the difference in the residency time in the idle state and the target residency of the idle state and prevent cpus from entering fastsleep from that point on. Hence the usage and time statistics logged for fastsleep will not be out of sync with reality except for the first entry. Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com --- Do we still need this workaround? Or can we get rid of the check on powersave_nap altogether? --- drivers/cpuidle/cpuidle-powernv.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c index fa79392..c18da24 100644 --- a/drivers/cpuidle/cpuidle-powernv.c +++ b/drivers/cpuidle/cpuidle-powernv.c @@ -70,7 +70,7 @@ static int fastsleep_loop(struct cpuidle_device *dev, unsigned long new_lpcr; if (powersave_nap 2) - return; + return index; if (unlikely(system_state SYSTEM_RUNNING)) return index; ___ Linuxppc-dev mailing list linuxppc-...@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V3 1/6] sched: idle: Add a weak arch_cpu_idle_poll function
Hi Peter, On 11/10/2014 05:59 PM, Peter Zijlstra wrote: On Fri, Nov 07, 2014 at 03:31:22PM +0100, Daniel Lezcano wrote: The poll function is called when a timer expired or if we force to poll when the cpu_idle_force_poll option is set. The poll function does: local_irq_enable(); while (!tif_need_resched()) cpu_relax(); This default poll function suits for the x86 arch because its rep; nop; hardware power optimization. But on other archs, this optimization does not exists and we are not saving power. The arch specific bits may want to optimize this loop by adding their own optimization. This doesn't make sense to me; should an arch not either implement an actual idle driver or implement cpu_relax() properly, why allow for a third weird option? The previous version of this patch simply invoked cpu_idle_loop() for cases where latency_req was 0. This would have changed the behavior on PowerPC wherein earlier the 0th idle index was returned which is also a polling loop but differs from cpu_idle_loop() in two ways: a. It polls at a relatively lower power state than cpu_relax(). b. We set certain registers to indicate that the cpu is idle. Hence for all such cases wherein the cpu is required to poll while idle (only for cases such as force poll, broadcast ipi to arrive soon and latency_req = 0), we should be able to call into cpuidle_idle_loop() only if the cpuidle driver's 0th idle state has an exit_latency 0. (The 0th idle state is expected to be a polling loop with exit_latency = 0). If otherwise, it would mean the driver has an optimized polling loop when idle. But instead of adding in the logic of checking the exit_latency, we thought it would be simpler to call into an arch defined polling idle loop under the above circumstances. If that is no better we could fall back to cpuidle_idle_loop(). Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V3 3/6] sched: idle: Get the next timer event and pass it the cpuidle framework
On 11/07/2014 08:01 PM, Daniel Lezcano wrote: > Following the logic of the previous patch, retrieve from the idle task the > expected timer sleep duration and pass it to the cpuidle framework. > > Take the opportunity to remove the unused headers in the menu.c file. > > This patch does not change the current behavior. > > Signed-off-by: Daniel Lezcano > Acked-by: Nicolas Pitre > Reviewed-by: Len Brown > --- This patch looks good to me as well. Reviewed-by: Preeti U. Murthy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V3 4/6] cpuidle: idle: menu: Don't reflect when a state selection failed
On 11/07/2014 08:01 PM, Daniel Lezcano wrote: > In the current code, the check to reflect or not the outcoming state is done > against the idle state which has been chosen and its value. > > Instead of doing a check in each of the reflect functions, just don't call > reflect > if something went wrong in the idle path. > > Signed-off-by: Daniel Lezcano > Acked-by: Nicolas Pitre > --- > drivers/cpuidle/governors/ladder.c | 3 +-- > drivers/cpuidle/governors/menu.c | 3 +-- > kernel/sched/idle.c| 3 ++- > 3 files changed, 4 insertions(+), 5 deletions(-) > > diff --git a/drivers/cpuidle/governors/ladder.c > b/drivers/cpuidle/governors/ladder.c > index 51c9ccd..86e4d49 100644 > --- a/drivers/cpuidle/governors/ladder.c > +++ b/drivers/cpuidle/governors/ladder.c > @@ -165,8 +165,7 @@ static int ladder_enable_device(struct cpuidle_driver > *drv, > static void ladder_reflect(struct cpuidle_device *dev, int index) > { > struct ladder_device *ldev = this_cpu_ptr(_devices); > - if (index > 0) > - ldev->last_state_idx = index; > + ldev->last_state_idx = index; > } > > static struct cpuidle_governor ladder_governor = { > diff --git a/drivers/cpuidle/governors/menu.c > b/drivers/cpuidle/governors/menu.c > index 91b3000..2e4a315 100644 > --- a/drivers/cpuidle/governors/menu.c > +++ b/drivers/cpuidle/governors/menu.c > @@ -366,8 +366,7 @@ static void menu_reflect(struct cpuidle_device *dev, int > index) > { > struct menu_device *data = this_cpu_ptr(_devices); > data->last_state_idx = index; > - if (index >= 0) > - data->needs_update = 1; > + data->needs_update = 1; > } > > /** > diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c > index 0a7a1d1..0ae1cc8 100644 > --- a/kernel/sched/idle.c > +++ b/kernel/sched/idle.c > @@ -173,7 +173,8 @@ use_default: > /* >* Give the governor an opportunity to reflect on the outcome >*/ > - cpuidle_reflect(dev, entered_state); > + if (entered_state >= 0) > + cpuidle_reflect(dev, entered_state); > > exit_idle: > __current_set_polling(); > Reviewed-by: Preeti U. Murthy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V3 2/6] sched: idle: cpuidle: Check the latency req before idle
On 11/07/2014 08:01 PM, Daniel Lezcano wrote: > When the pmqos latency requirement is set to zero that means "poll in all the > cases". > > That is correctly implemented on x86 but not on the other archs. > > As how is written the code, if the latency request is zero, the governor will > return zero, so corresponding, for x86, to the poll function, but for the > others arch the default idle function. For example, on ARM this is wait-for- > interrupt with a latency of '1', so violating the constraint. > > In order to fix that, do the latency requirement check *before* calling the > cpuidle framework in order to jump to the poll function without entering > cpuidle. That has several benefits: > > 1. It clarifies and unifies the code > 2. It fixes x86 vs other archs behavior > 3. Factors out the call to the same function > 4. Prevent to enter the cpuidle framework with its expensive cost in > calculation > > As the latency_req is needed in all the cases, change the select API to take > the latency_req as parameter in case it is not equal to zero. > > As a positive side effect, it introduces the latency constraint specified > externally, so one more step to the cpuidle/scheduler integration. > > Signed-off-by: Daniel Lezcano > Acked-by: Nicolas Pitre > Acked-by: Peter Zijlstra (Intel) > Reviewed-by: Len Brown > --- Reviewed-by: Preeti U Murthy Regards Preeti U Murthy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V3 1/6] sched: idle: Add a weak arch_cpu_idle_poll function
On 11/07/2014 08:01 PM, Daniel Lezcano wrote: > The poll function is called when a timer expired or if we force to poll when > the cpu_idle_force_poll option is set. > > The poll function does: > >local_irq_enable(); >while (!tif_need_resched()) >cpu_relax(); > > This default poll function suits for the x86 arch because its rep; nop; > hardware power optimization. But on other archs, this optimization does not > exists and we are not saving power. The arch specific bits may want to > optimize this loop by adding their own optimization. > > Give an opportunity to the different platform to specify their own polling > loop by adding a weak cpu_idle_poll_loop function. > > Signed-off-by: Daniel Lezcano > --- > kernel/sched/idle.c | 29 + > 1 file changed, 17 insertions(+), 12 deletions(-) > > diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c > index c47fce7..ea65bbae 100644 > --- a/kernel/sched/idle.c > +++ b/kernel/sched/idle.c > @@ -42,18 +42,6 @@ static int __init cpu_idle_nopoll_setup(char *__unused) > __setup("hlt", cpu_idle_nopoll_setup); > #endif > > -static inline int cpu_idle_poll(void) > -{ > - rcu_idle_enter(); > - trace_cpu_idle_rcuidle(0, smp_processor_id()); > - local_irq_enable(); > - while (!tif_need_resched()) > - cpu_relax(); > - trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, smp_processor_id()); > - rcu_idle_exit(); > - return 1; > -} > - > /* Weak implementations for optional arch specific functions */ > void __weak arch_cpu_idle_prepare(void) { } > void __weak arch_cpu_idle_enter(void) { } > @@ -65,6 +53,23 @@ void __weak arch_cpu_idle(void) > local_irq_enable(); > } > > +void __weak arch_cpu_idle_poll(void) > +{ > + local_irq_enable(); > + while (!tif_need_resched()) > + cpu_relax(); > +} > + > +static inline int cpu_idle_poll(void) > +{ > + rcu_idle_enter(); > + trace_cpu_idle_rcuidle(0, smp_processor_id()); > + arch_cpu_idle_poll(); > + trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, smp_processor_id()); > + rcu_idle_exit(); > + return 1; > +} > + > /** > * cpuidle_idle_call - the main idle function > * > Thanks Daniel! Reviewed-by: Preeti U. Murthy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V3 1/6] sched: idle: Add a weak arch_cpu_idle_poll function
On 11/07/2014 08:01 PM, Daniel Lezcano wrote: The poll function is called when a timer expired or if we force to poll when the cpu_idle_force_poll option is set. The poll function does: local_irq_enable(); while (!tif_need_resched()) cpu_relax(); This default poll function suits for the x86 arch because its rep; nop; hardware power optimization. But on other archs, this optimization does not exists and we are not saving power. The arch specific bits may want to optimize this loop by adding their own optimization. Give an opportunity to the different platform to specify their own polling loop by adding a weak cpu_idle_poll_loop function. Signed-off-by: Daniel Lezcano daniel.lezc...@linaro.org --- kernel/sched/idle.c | 29 + 1 file changed, 17 insertions(+), 12 deletions(-) diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c index c47fce7..ea65bbae 100644 --- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -42,18 +42,6 @@ static int __init cpu_idle_nopoll_setup(char *__unused) __setup(hlt, cpu_idle_nopoll_setup); #endif -static inline int cpu_idle_poll(void) -{ - rcu_idle_enter(); - trace_cpu_idle_rcuidle(0, smp_processor_id()); - local_irq_enable(); - while (!tif_need_resched()) - cpu_relax(); - trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, smp_processor_id()); - rcu_idle_exit(); - return 1; -} - /* Weak implementations for optional arch specific functions */ void __weak arch_cpu_idle_prepare(void) { } void __weak arch_cpu_idle_enter(void) { } @@ -65,6 +53,23 @@ void __weak arch_cpu_idle(void) local_irq_enable(); } +void __weak arch_cpu_idle_poll(void) +{ + local_irq_enable(); + while (!tif_need_resched()) + cpu_relax(); +} + +static inline int cpu_idle_poll(void) +{ + rcu_idle_enter(); + trace_cpu_idle_rcuidle(0, smp_processor_id()); + arch_cpu_idle_poll(); + trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, smp_processor_id()); + rcu_idle_exit(); + return 1; +} + /** * cpuidle_idle_call - the main idle function * Thanks Daniel! Reviewed-by: Preeti U. Murthy pre...@linux.vnet.ibm.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V3 2/6] sched: idle: cpuidle: Check the latency req before idle
On 11/07/2014 08:01 PM, Daniel Lezcano wrote: When the pmqos latency requirement is set to zero that means poll in all the cases. That is correctly implemented on x86 but not on the other archs. As how is written the code, if the latency request is zero, the governor will return zero, so corresponding, for x86, to the poll function, but for the others arch the default idle function. For example, on ARM this is wait-for- interrupt with a latency of '1', so violating the constraint. In order to fix that, do the latency requirement check *before* calling the cpuidle framework in order to jump to the poll function without entering cpuidle. That has several benefits: 1. It clarifies and unifies the code 2. It fixes x86 vs other archs behavior 3. Factors out the call to the same function 4. Prevent to enter the cpuidle framework with its expensive cost in calculation As the latency_req is needed in all the cases, change the select API to take the latency_req as parameter in case it is not equal to zero. As a positive side effect, it introduces the latency constraint specified externally, so one more step to the cpuidle/scheduler integration. Signed-off-by: Daniel Lezcano daniel.lezc...@linaro.org Acked-by: Nicolas Pitre n...@linaro.org Acked-by: Peter Zijlstra (Intel) pet...@infradead.org Reviewed-by: Len Brown len.br...@intel.com --- Reviewed-by: Preeti U Murthy pre...@linux.vnet.ibm.com Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V3 4/6] cpuidle: idle: menu: Don't reflect when a state selection failed
On 11/07/2014 08:01 PM, Daniel Lezcano wrote: In the current code, the check to reflect or not the outcoming state is done against the idle state which has been chosen and its value. Instead of doing a check in each of the reflect functions, just don't call reflect if something went wrong in the idle path. Signed-off-by: Daniel Lezcano daniel.lezc...@linaro.org Acked-by: Nicolas Pitre n...@linaro.org --- drivers/cpuidle/governors/ladder.c | 3 +-- drivers/cpuidle/governors/menu.c | 3 +-- kernel/sched/idle.c| 3 ++- 3 files changed, 4 insertions(+), 5 deletions(-) diff --git a/drivers/cpuidle/governors/ladder.c b/drivers/cpuidle/governors/ladder.c index 51c9ccd..86e4d49 100644 --- a/drivers/cpuidle/governors/ladder.c +++ b/drivers/cpuidle/governors/ladder.c @@ -165,8 +165,7 @@ static int ladder_enable_device(struct cpuidle_driver *drv, static void ladder_reflect(struct cpuidle_device *dev, int index) { struct ladder_device *ldev = this_cpu_ptr(ladder_devices); - if (index 0) - ldev-last_state_idx = index; + ldev-last_state_idx = index; } static struct cpuidle_governor ladder_governor = { diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c index 91b3000..2e4a315 100644 --- a/drivers/cpuidle/governors/menu.c +++ b/drivers/cpuidle/governors/menu.c @@ -366,8 +366,7 @@ static void menu_reflect(struct cpuidle_device *dev, int index) { struct menu_device *data = this_cpu_ptr(menu_devices); data-last_state_idx = index; - if (index = 0) - data-needs_update = 1; + data-needs_update = 1; } /** diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c index 0a7a1d1..0ae1cc8 100644 --- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -173,7 +173,8 @@ use_default: /* * Give the governor an opportunity to reflect on the outcome */ - cpuidle_reflect(dev, entered_state); + if (entered_state = 0) + cpuidle_reflect(dev, entered_state); exit_idle: __current_set_polling(); Reviewed-by: Preeti U. Murthy pre...@linux.vnet.ibm.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V3 3/6] sched: idle: Get the next timer event and pass it the cpuidle framework
On 11/07/2014 08:01 PM, Daniel Lezcano wrote: Following the logic of the previous patch, retrieve from the idle task the expected timer sleep duration and pass it to the cpuidle framework. Take the opportunity to remove the unused headers in the menu.c file. This patch does not change the current behavior. Signed-off-by: Daniel Lezcano daniel.lezc...@linaro.org Acked-by: Nicolas Pitre n...@linaro.org Reviewed-by: Len Brown len.br...@intel.com --- This patch looks good to me as well. Reviewed-by: Preeti U. Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V2 1/5] sched: idle: cpuidle: Check the latency req before idle
On 11/06/2014 07:12 PM, Daniel Lezcano wrote: > > Preeti, > > I am wondering if we aren't going to a false debate. > > If the latency_req is 0, we should just poll and not enter in any idle > state even if one has zero exit latency. With a zero latency req, we > want full reactivity on the system, not enter an idle state with all the > computation in the menu governor, no ? > > I agree this patch changes the behavior on PowerPC, but only if the > latency_req is set to zero. I don't think we are worried about power > saving when setting this value. > > Couldn't the patch accepted as it is for the sake of consistency on all > the platform and then we optimize cleanly for the special latency zero > case ? Alright Daniel, you can go ahead. I was thinking this patch through and now realize that, like you point out the logic will only get complicated with all the additional hack. But would it be possible to add the weak arch_cpu_idle_loop() call for the cases where latency requirement is 0 like you had suggested earlier ? This would ensure the polling logic does not break on PowerPC and we don't bother the governor even. I will add the function in the core PowerPC code. If arch does not define this function it will fall back to cpu_idle_loop(). Fair enough? Regards Preeti U Murthy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V2 1/5] sched: idle: cpuidle: Check the latency req before idle
On 11/06/2014 05:57 PM, Daniel Lezcano wrote: > On 11/06/2014 05:08 AM, Preeti U Murthy wrote: >> On 11/05/2014 07:58 PM, Daniel Lezcano wrote: >>> On 10/29/2014 03:01 AM, Preeti U Murthy wrote: >>>> On 10/29/2014 12:29 AM, Daniel Lezcano wrote: >>>>> On 10/28/2014 04:51 AM, Preeti Murthy wrote: >>>>>> Hi Daniel, >>>>>> >>>>>> On Thu, Oct 23, 2014 at 2:31 PM, Daniel Lezcano >>>>>> wrote: >>>>>>> When the pmqos latency requirement is set to zero that means >>>>>>> "poll in >>>>>>> all the >>>>>>> cases". >>>>>>> >>>>>>> That is correctly implemented on x86 but not on the other archs. >>>>>>> >>>>>>> As how is written the code, if the latency request is zero, the >>>>>>> governor will >>>>>>> return zero, so corresponding, for x86, to the poll function, but >>>>>>> for >>>>>>> the >>>>>>> others arch the default idle function. For example, on ARM this is >>>>>>> wait-for- >>>>>>> interrupt with a latency of '1', so violating the constraint. >>>>>> >>>>>> This is not true actually. On PowerPC the idle state 0 has an >>>>>> exit_latency of 0. >>>>>> >>>>>>> >>>>>>> In order to fix that, do the latency requirement check *before* >>>>>>> calling the >>>>>>> cpuidle framework in order to jump to the poll function without >>>>>>> entering >>>>>>> cpuidle. That has several benefits: >>>>>> >>>>>> Doing so actually hurts on PowerPC. Because the idle loop defined for >>>>>> idle state 0 is different from what cpu_relax() does in >>>>>> cpu_idle_loop(). >>>>>> The spinning is more power efficient in the former case. Moreover we >>>>>> also set >>>>>> certain register values which indicate an idle cpu. The ppc_runlatch >>>>>> bits >>>>>> do precisely this. These register values are being read by some user >>>>>> space >>>>>> tools. So we will end up breaking them with this patch >>>>>> >>>>>> My suggestion is very well keep the latency requirement check in >>>>>> kernel/sched/idle.c >>>>>> like your doing in this patch. But before jumping to cpu_idle_loop >>>>>> verify if the >>>>>> idle state 0 has an exit_latency > 0 in addition to your check on the >>>>>> latency_req == 0. >>>>>> If not, you can fall through to the regular path of calling into the >>>>>> cpuidle driver. >>>>>> The scheduler can query the cpuidle_driver structure anyway. >>>>>> >>>>>> What do you think? >>>>> >>>>> Thanks for reviewing the patch and spotting this. >>>>> >>>>> Wouldn't make sense to create: >>>>> >>>>> void __weak_cpu_idle_poll(void) ? >>>>> >>>>> and override it with your specific poll function ? >>>>> >>>> >>>> No this would become ugly as far as I can see. A weak function has >>>> to be >>>> defined under arch/* code. We will either need to duplicate the idle >>>> loop that we already have in the drivers or point the weak function to >>>> the first idle state defined by our driver. Both of which is not >>>> desirable (calling into the driver from arch code is ugly). Another >>>> reason why I don't like the idea of a weak function is that if you have >>>> missed looking at a specific driver and they have an idle loop with >>>> features similar to on powerpc, you will have to spot it yourself and >>>> include the arch specific cpu_idle_poll() for them. >>> >>> Yes, I agree this is a fair point. But actually I don't see the interest >>> of having the poll loop in the cpuidle driver. These cleanups are >> >> We can't do that simply because the idle poll loop has arch specific >> bits on powerpc. > > I am not sure. > > Could you describe what is the difference between the arch_cpu_idle > function in arch/arm/powerpc/kernel/idle.c and the 0th power PC idle > state ? a
Re: [PATCH V2 1/5] sched: idle: cpuidle: Check the latency req before idle
On 11/06/2014 05:57 PM, Daniel Lezcano wrote: On 11/06/2014 05:08 AM, Preeti U Murthy wrote: On 11/05/2014 07:58 PM, Daniel Lezcano wrote: On 10/29/2014 03:01 AM, Preeti U Murthy wrote: On 10/29/2014 12:29 AM, Daniel Lezcano wrote: On 10/28/2014 04:51 AM, Preeti Murthy wrote: Hi Daniel, On Thu, Oct 23, 2014 at 2:31 PM, Daniel Lezcano daniel.lezc...@linaro.org wrote: When the pmqos latency requirement is set to zero that means poll in all the cases. That is correctly implemented on x86 but not on the other archs. As how is written the code, if the latency request is zero, the governor will return zero, so corresponding, for x86, to the poll function, but for the others arch the default idle function. For example, on ARM this is wait-for- interrupt with a latency of '1', so violating the constraint. This is not true actually. On PowerPC the idle state 0 has an exit_latency of 0. In order to fix that, do the latency requirement check *before* calling the cpuidle framework in order to jump to the poll function without entering cpuidle. That has several benefits: Doing so actually hurts on PowerPC. Because the idle loop defined for idle state 0 is different from what cpu_relax() does in cpu_idle_loop(). The spinning is more power efficient in the former case. Moreover we also set certain register values which indicate an idle cpu. The ppc_runlatch bits do precisely this. These register values are being read by some user space tools. So we will end up breaking them with this patch My suggestion is very well keep the latency requirement check in kernel/sched/idle.c like your doing in this patch. But before jumping to cpu_idle_loop verify if the idle state 0 has an exit_latency 0 in addition to your check on the latency_req == 0. If not, you can fall through to the regular path of calling into the cpuidle driver. The scheduler can query the cpuidle_driver structure anyway. What do you think? Thanks for reviewing the patch and spotting this. Wouldn't make sense to create: void __weak_cpu_idle_poll(void) ? and override it with your specific poll function ? No this would become ugly as far as I can see. A weak function has to be defined under arch/* code. We will either need to duplicate the idle loop that we already have in the drivers or point the weak function to the first idle state defined by our driver. Both of which is not desirable (calling into the driver from arch code is ugly). Another reason why I don't like the idea of a weak function is that if you have missed looking at a specific driver and they have an idle loop with features similar to on powerpc, you will have to spot it yourself and include the arch specific cpu_idle_poll() for them. Yes, I agree this is a fair point. But actually I don't see the interest of having the poll loop in the cpuidle driver. These cleanups are We can't do that simply because the idle poll loop has arch specific bits on powerpc. I am not sure. Could you describe what is the difference between the arch_cpu_idle function in arch/arm/powerpc/kernel/idle.c and the 0th power PC idle state ? arch_cpu_idle() is the arch specific idle routine. It goes into deeper idle state. I am guessing you meant to ask the difference between power pc 0th idle state and the polling logic in cpu_idle_poll(). The 0th idle state is also a polling loop. Additionally it sets a couple of registers to indicate idleness. Is it kind of duplicate ? And for polling, do you really want to use while (...); cpu_relax(); as it is x86 specific ? instead of the powerpc's arch_idle ? Today, if latency_req == 0, it returns the 0th idle state, so polling. If we jump to the arch_cpu_idle_poll, the result will be the same for all architecture. So you propose creating a weak arch_cpu_idle_poll()? Ok if it is going to make the cleanup easier, go ahead. I can add arch_cpu_idle_poll() in the core code on powerpc. preparing the removal of the CPUIDLE_DRIVER_STATE_START macro which leads to a lot of mess in the cpuidle code. How is the suggestion to check the exit_latency of idle state 0 when latency_req == 0 going to hinder this removal? It sounds a bit hackish. I prefer to sort out the current situation. And by the way, what is the reasoning behind having a target_residency / exit_latency equal to zero for an idle state ? Its a polling idle state, hence the exit_latency is 0. Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V2 1/5] sched: idle: cpuidle: Check the latency req before idle
On 11/06/2014 07:12 PM, Daniel Lezcano wrote: Preeti, I am wondering if we aren't going to a false debate. If the latency_req is 0, we should just poll and not enter in any idle state even if one has zero exit latency. With a zero latency req, we want full reactivity on the system, not enter an idle state with all the computation in the menu governor, no ? I agree this patch changes the behavior on PowerPC, but only if the latency_req is set to zero. I don't think we are worried about power saving when setting this value. Couldn't the patch accepted as it is for the sake of consistency on all the platform and then we optimize cleanly for the special latency zero case ? Alright Daniel, you can go ahead. I was thinking this patch through and now realize that, like you point out the logic will only get complicated with all the additional hack. But would it be possible to add the weak arch_cpu_idle_loop() call for the cases where latency requirement is 0 like you had suggested earlier ? This would ensure the polling logic does not break on PowerPC and we don't bother the governor even. I will add the function in the core PowerPC code. If arch does not define this function it will fall back to cpu_idle_loop(). Fair enough? Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V2 1/5] sched: idle: cpuidle: Check the latency req before idle
On 11/05/2014 07:58 PM, Daniel Lezcano wrote: > On 10/29/2014 03:01 AM, Preeti U Murthy wrote: >> On 10/29/2014 12:29 AM, Daniel Lezcano wrote: >>> On 10/28/2014 04:51 AM, Preeti Murthy wrote: >>>> Hi Daniel, >>>> >>>> On Thu, Oct 23, 2014 at 2:31 PM, Daniel Lezcano >>>> wrote: >>>>> When the pmqos latency requirement is set to zero that means "poll in >>>>> all the >>>>> cases". >>>>> >>>>> That is correctly implemented on x86 but not on the other archs. >>>>> >>>>> As how is written the code, if the latency request is zero, the >>>>> governor will >>>>> return zero, so corresponding, for x86, to the poll function, but for >>>>> the >>>>> others arch the default idle function. For example, on ARM this is >>>>> wait-for- >>>>> interrupt with a latency of '1', so violating the constraint. >>>> >>>> This is not true actually. On PowerPC the idle state 0 has an >>>> exit_latency of 0. >>>> >>>>> >>>>> In order to fix that, do the latency requirement check *before* >>>>> calling the >>>>> cpuidle framework in order to jump to the poll function without >>>>> entering >>>>> cpuidle. That has several benefits: >>>> >>>> Doing so actually hurts on PowerPC. Because the idle loop defined for >>>> idle state 0 is different from what cpu_relax() does in >>>> cpu_idle_loop(). >>>> The spinning is more power efficient in the former case. Moreover we >>>> also set >>>> certain register values which indicate an idle cpu. The ppc_runlatch >>>> bits >>>> do precisely this. These register values are being read by some user >>>> space >>>> tools. So we will end up breaking them with this patch >>>> >>>> My suggestion is very well keep the latency requirement check in >>>> kernel/sched/idle.c >>>> like your doing in this patch. But before jumping to cpu_idle_loop >>>> verify if the >>>> idle state 0 has an exit_latency > 0 in addition to your check on the >>>> latency_req == 0. >>>> If not, you can fall through to the regular path of calling into the >>>> cpuidle driver. >>>> The scheduler can query the cpuidle_driver structure anyway. >>>> >>>> What do you think? >>> >>> Thanks for reviewing the patch and spotting this. >>> >>> Wouldn't make sense to create: >>> >>> void __weak_cpu_idle_poll(void) ? >>> >>> and override it with your specific poll function ? >>> >> >> No this would become ugly as far as I can see. A weak function has to be >> defined under arch/* code. We will either need to duplicate the idle >> loop that we already have in the drivers or point the weak function to >> the first idle state defined by our driver. Both of which is not >> desirable (calling into the driver from arch code is ugly). Another >> reason why I don't like the idea of a weak function is that if you have >> missed looking at a specific driver and they have an idle loop with >> features similar to on powerpc, you will have to spot it yourself and >> include the arch specific cpu_idle_poll() for them. > > Yes, I agree this is a fair point. But actually I don't see the interest > of having the poll loop in the cpuidle driver. These cleanups are We can't do that simply because the idle poll loop has arch specific bits on powerpc. > preparing the removal of the CPUIDLE_DRIVER_STATE_START macro which > leads to a lot of mess in the cpuidle code. How is the suggestion to check the exit_latency of idle state 0 when latency_req == 0 going to hinder this removal? > > With the removal of this macro, we should be able to move the select > loop from the menu governor and use it everywhere else. Furthermore, > this state which is flagged with TIME_VALID, isn't because the local > interrupt are enabled so we are measuring the interrupt time processing. > Beside that the idle loop for x86 is mostly not used. > > So the idea would be to extract those idle loop from the drivers and use > them directly when: > 1. the idle selection fails (use the poll loop under certain > circumstances we have to redefine) This behavior will not change as per my suggestion. > 2. when the latency req is zero Its only here that I suggested you also verify state 0's exit_latency. For the reason that the arch may have a more optimized idle poll loop, which we cannot override with the generic cpuidle poll loop. Regards Preeti U Murthy > > That will result in a cleaner code in cpuidle and in the governor. > > Do you agree with that ? > >> But by having a check on the exit_latency, you are claiming that since >> the driver's 0th idle state is no better than the generic idle loop in >> cases of 0 latency req, we are better off calling the latter, which >> looks reasonable. That way you don't have to bother about worsening the >> idle loop behavior on any other driver. > > > > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V2 1/5] sched: idle: cpuidle: Check the latency req before idle
On 11/05/2014 07:58 PM, Daniel Lezcano wrote: On 10/29/2014 03:01 AM, Preeti U Murthy wrote: On 10/29/2014 12:29 AM, Daniel Lezcano wrote: On 10/28/2014 04:51 AM, Preeti Murthy wrote: Hi Daniel, On Thu, Oct 23, 2014 at 2:31 PM, Daniel Lezcano daniel.lezc...@linaro.org wrote: When the pmqos latency requirement is set to zero that means poll in all the cases. That is correctly implemented on x86 but not on the other archs. As how is written the code, if the latency request is zero, the governor will return zero, so corresponding, for x86, to the poll function, but for the others arch the default idle function. For example, on ARM this is wait-for- interrupt with a latency of '1', so violating the constraint. This is not true actually. On PowerPC the idle state 0 has an exit_latency of 0. In order to fix that, do the latency requirement check *before* calling the cpuidle framework in order to jump to the poll function without entering cpuidle. That has several benefits: Doing so actually hurts on PowerPC. Because the idle loop defined for idle state 0 is different from what cpu_relax() does in cpu_idle_loop(). The spinning is more power efficient in the former case. Moreover we also set certain register values which indicate an idle cpu. The ppc_runlatch bits do precisely this. These register values are being read by some user space tools. So we will end up breaking them with this patch My suggestion is very well keep the latency requirement check in kernel/sched/idle.c like your doing in this patch. But before jumping to cpu_idle_loop verify if the idle state 0 has an exit_latency 0 in addition to your check on the latency_req == 0. If not, you can fall through to the regular path of calling into the cpuidle driver. The scheduler can query the cpuidle_driver structure anyway. What do you think? Thanks for reviewing the patch and spotting this. Wouldn't make sense to create: void __weak_cpu_idle_poll(void) ? and override it with your specific poll function ? No this would become ugly as far as I can see. A weak function has to be defined under arch/* code. We will either need to duplicate the idle loop that we already have in the drivers or point the weak function to the first idle state defined by our driver. Both of which is not desirable (calling into the driver from arch code is ugly). Another reason why I don't like the idea of a weak function is that if you have missed looking at a specific driver and they have an idle loop with features similar to on powerpc, you will have to spot it yourself and include the arch specific cpu_idle_poll() for them. Yes, I agree this is a fair point. But actually I don't see the interest of having the poll loop in the cpuidle driver. These cleanups are We can't do that simply because the idle poll loop has arch specific bits on powerpc. preparing the removal of the CPUIDLE_DRIVER_STATE_START macro which leads to a lot of mess in the cpuidle code. How is the suggestion to check the exit_latency of idle state 0 when latency_req == 0 going to hinder this removal? With the removal of this macro, we should be able to move the select loop from the menu governor and use it everywhere else. Furthermore, this state which is flagged with TIME_VALID, isn't because the local interrupt are enabled so we are measuring the interrupt time processing. Beside that the idle loop for x86 is mostly not used. So the idea would be to extract those idle loop from the drivers and use them directly when: 1. the idle selection fails (use the poll loop under certain circumstances we have to redefine) This behavior will not change as per my suggestion. 2. when the latency req is zero Its only here that I suggested you also verify state 0's exit_latency. For the reason that the arch may have a more optimized idle poll loop, which we cannot override with the generic cpuidle poll loop. Regards Preeti U Murthy That will result in a cleaner code in cpuidle and in the governor. Do you agree with that ? But by having a check on the exit_latency, you are claiming that since the driver's 0th idle state is no better than the generic idle loop in cases of 0 latency req, we are better off calling the latter, which looks reasonable. That way you don't have to bother about worsening the idle loop behavior on any other driver. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V2 1/5] sched: idle: cpuidle: Check the latency req before idle
On 10/29/2014 12:29 AM, Daniel Lezcano wrote: > On 10/28/2014 04:51 AM, Preeti Murthy wrote: >> Hi Daniel, >> >> On Thu, Oct 23, 2014 at 2:31 PM, Daniel Lezcano >> wrote: >>> When the pmqos latency requirement is set to zero that means "poll in >>> all the >>> cases". >>> >>> That is correctly implemented on x86 but not on the other archs. >>> >>> As how is written the code, if the latency request is zero, the >>> governor will >>> return zero, so corresponding, for x86, to the poll function, but for >>> the >>> others arch the default idle function. For example, on ARM this is >>> wait-for- >>> interrupt with a latency of '1', so violating the constraint. >> >> This is not true actually. On PowerPC the idle state 0 has an >> exit_latency of 0. >> >>> >>> In order to fix that, do the latency requirement check *before* >>> calling the >>> cpuidle framework in order to jump to the poll function without entering >>> cpuidle. That has several benefits: >> >> Doing so actually hurts on PowerPC. Because the idle loop defined for >> idle state 0 is different from what cpu_relax() does in cpu_idle_loop(). >> The spinning is more power efficient in the former case. Moreover we >> also set >> certain register values which indicate an idle cpu. The ppc_runlatch bits >> do precisely this. These register values are being read by some user >> space >> tools. So we will end up breaking them with this patch >> >> My suggestion is very well keep the latency requirement check in >> kernel/sched/idle.c >> like your doing in this patch. But before jumping to cpu_idle_loop >> verify if the >> idle state 0 has an exit_latency > 0 in addition to your check on the >> latency_req == 0. >> If not, you can fall through to the regular path of calling into the >> cpuidle driver. >> The scheduler can query the cpuidle_driver structure anyway. >> >> What do you think? > > Thanks for reviewing the patch and spotting this. > > Wouldn't make sense to create: > > void __weak_cpu_idle_poll(void) ? > > and override it with your specific poll function ? > No this would become ugly as far as I can see. A weak function has to be defined under arch/* code. We will either need to duplicate the idle loop that we already have in the drivers or point the weak function to the first idle state defined by our driver. Both of which is not desirable (calling into the driver from arch code is ugly). Another reason why I don't like the idea of a weak function is that if you have missed looking at a specific driver and they have an idle loop with features similar to on powerpc, you will have to spot it yourself and include the arch specific cpu_idle_poll() for them. But by having a check on the exit_latency, you are claiming that since the driver's 0th idle state is no better than the generic idle loop in cases of 0 latency req, we are better off calling the latter, which looks reasonable. That way you don't have to bother about worsening the idle loop behavior on any other driver. Regards Preeti U Murthy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V2 3/5] cpuidle: idle: menu: Don't reflect when a state selection failed
On 10/28/2014 11:58 PM, Daniel Lezcano wrote: > On 10/28/2014 08:01 AM, Preeti Murthy wrote: >> On Thu, Oct 23, 2014 at 2:31 PM, Daniel Lezcano >> wrote: >>> In the current code, the check to reflect or not the outcoming state >>> is done >>> against the idle state which has been chosen and its value. >>> >>> Instead of doing a check in each of the reflect functions, just don't >>> call reflect >>> if something went wrong in the idle path. >>> >>> Signed-off-by: Daniel Lezcano >>> Acked-by: Nicolas Pitre >>> --- >>> drivers/cpuidle/governors/ladder.c | 3 +-- >>> drivers/cpuidle/governors/menu.c | 4 +--- >>> kernel/sched/idle.c| 3 ++- >>> 3 files changed, 4 insertions(+), 6 deletions(-) >>> >>> diff --git a/drivers/cpuidle/governors/ladder.c >>> b/drivers/cpuidle/governors/ladder.c >>> index fb396d6..c0b36a8 100644 >>> --- a/drivers/cpuidle/governors/ladder.c >>> +++ b/drivers/cpuidle/governors/ladder.c >>> @@ -165,8 +165,7 @@ static int ladder_enable_device(struct >>> cpuidle_driver *drv, >>> static void ladder_reflect(struct cpuidle_device *dev, int index) >>> { >>> struct ladder_device *ldev = &__get_cpu_var(ladder_devices); >>> - if (index > 0) >>> - ldev->last_state_idx = index; >>> + ldev->last_state_idx = index; >>> } >>> >>> static struct cpuidle_governor ladder_governor = { >>> diff --git a/drivers/cpuidle/governors/menu.c >>> b/drivers/cpuidle/governors/menu.c >>> index a17515f..3907301 100644 >>> --- a/drivers/cpuidle/governors/menu.c >>> +++ b/drivers/cpuidle/governors/menu.c >>> @@ -365,9 +365,7 @@ static int menu_select(struct cpuidle_driver >>> *drv, struct cpuidle_device *dev, >>> static void menu_reflect(struct cpuidle_device *dev, int index) >>> { >>> struct menu_device *data = &__get_cpu_var(menu_devices); >>> - data->last_state_idx = index; >>> - if (index >= 0) >>> - data->needs_update = 1; >>> + data->needs_update = 1; >> >> Why is the last_state_idx not getting updated ? > > Oups, right. This is missing. > > Thanks for pointing this out. > > By the way, I don't think a back end driver is changing the selected > state currently and I am not sure this is desirable since we want to > trust the state we are going (as a best effort). So if the 'enter' > function does not change the index, that means the last_state_idx has > not to be changed since it has been assigned in the 'select' function. Hmm Right. So you might want to remove the last_state_idx update in ladder_reflect() also? Regards Preeti U Murthy > > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V2 3/5] cpuidle: idle: menu: Don't reflect when a state selection failed
On 10/28/2014 11:58 PM, Daniel Lezcano wrote: On 10/28/2014 08:01 AM, Preeti Murthy wrote: On Thu, Oct 23, 2014 at 2:31 PM, Daniel Lezcano daniel.lezc...@linaro.org wrote: In the current code, the check to reflect or not the outcoming state is done against the idle state which has been chosen and its value. Instead of doing a check in each of the reflect functions, just don't call reflect if something went wrong in the idle path. Signed-off-by: Daniel Lezcano daniel.lezc...@linaro.org Acked-by: Nicolas Pitre n...@linaro.org --- drivers/cpuidle/governors/ladder.c | 3 +-- drivers/cpuidle/governors/menu.c | 4 +--- kernel/sched/idle.c| 3 ++- 3 files changed, 4 insertions(+), 6 deletions(-) diff --git a/drivers/cpuidle/governors/ladder.c b/drivers/cpuidle/governors/ladder.c index fb396d6..c0b36a8 100644 --- a/drivers/cpuidle/governors/ladder.c +++ b/drivers/cpuidle/governors/ladder.c @@ -165,8 +165,7 @@ static int ladder_enable_device(struct cpuidle_driver *drv, static void ladder_reflect(struct cpuidle_device *dev, int index) { struct ladder_device *ldev = __get_cpu_var(ladder_devices); - if (index 0) - ldev-last_state_idx = index; + ldev-last_state_idx = index; } static struct cpuidle_governor ladder_governor = { diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c index a17515f..3907301 100644 --- a/drivers/cpuidle/governors/menu.c +++ b/drivers/cpuidle/governors/menu.c @@ -365,9 +365,7 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev, static void menu_reflect(struct cpuidle_device *dev, int index) { struct menu_device *data = __get_cpu_var(menu_devices); - data-last_state_idx = index; - if (index = 0) - data-needs_update = 1; + data-needs_update = 1; Why is the last_state_idx not getting updated ? Oups, right. This is missing. Thanks for pointing this out. By the way, I don't think a back end driver is changing the selected state currently and I am not sure this is desirable since we want to trust the state we are going (as a best effort). So if the 'enter' function does not change the index, that means the last_state_idx has not to be changed since it has been assigned in the 'select' function. Hmm Right. So you might want to remove the last_state_idx update in ladder_reflect() also? Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V2 1/5] sched: idle: cpuidle: Check the latency req before idle
On 10/29/2014 12:29 AM, Daniel Lezcano wrote: On 10/28/2014 04:51 AM, Preeti Murthy wrote: Hi Daniel, On Thu, Oct 23, 2014 at 2:31 PM, Daniel Lezcano daniel.lezc...@linaro.org wrote: When the pmqos latency requirement is set to zero that means poll in all the cases. That is correctly implemented on x86 but not on the other archs. As how is written the code, if the latency request is zero, the governor will return zero, so corresponding, for x86, to the poll function, but for the others arch the default idle function. For example, on ARM this is wait-for- interrupt with a latency of '1', so violating the constraint. This is not true actually. On PowerPC the idle state 0 has an exit_latency of 0. In order to fix that, do the latency requirement check *before* calling the cpuidle framework in order to jump to the poll function without entering cpuidle. That has several benefits: Doing so actually hurts on PowerPC. Because the idle loop defined for idle state 0 is different from what cpu_relax() does in cpu_idle_loop(). The spinning is more power efficient in the former case. Moreover we also set certain register values which indicate an idle cpu. The ppc_runlatch bits do precisely this. These register values are being read by some user space tools. So we will end up breaking them with this patch My suggestion is very well keep the latency requirement check in kernel/sched/idle.c like your doing in this patch. But before jumping to cpu_idle_loop verify if the idle state 0 has an exit_latency 0 in addition to your check on the latency_req == 0. If not, you can fall through to the regular path of calling into the cpuidle driver. The scheduler can query the cpuidle_driver structure anyway. What do you think? Thanks for reviewing the patch and spotting this. Wouldn't make sense to create: void __weak_cpu_idle_poll(void) ? and override it with your specific poll function ? No this would become ugly as far as I can see. A weak function has to be defined under arch/* code. We will either need to duplicate the idle loop that we already have in the drivers or point the weak function to the first idle state defined by our driver. Both of which is not desirable (calling into the driver from arch code is ugly). Another reason why I don't like the idea of a weak function is that if you have missed looking at a specific driver and they have an idle loop with features similar to on powerpc, you will have to spot it yourself and include the arch specific cpu_idle_poll() for them. But by having a check on the exit_latency, you are claiming that since the driver's 0th idle state is no better than the generic idle loop in cases of 0 latency req, we are better off calling the latter, which looks reasonable. That way you don't have to bother about worsening the idle loop behavior on any other driver. Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] cpuidle/powernv: Fix return value of idle index in fastsleep
Commit dcb18694 : Fix ipi on Palmeto added a workaround to disable going into fastsleep on Palmeto boards which was reported to fail to boot when fastsleep was enabled. However it missed returning an idle index. Fix this. There is probably no harm in returning the index of fastsleep although the cpu did not enter this state. The governor will notice the difference in the residency time in the idle state and the target residency of the idle state and prevent cpus from entering fastsleep from that point on. Hence the usage and time statistics logged for fastsleep will not be out of sync with reality except for the first entry. Signed-off-by: Preeti U Murthy --- Do we still need this workaround? Or can we get rid of the check on powersave_nap altogether? --- drivers/cpuidle/cpuidle-powernv.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c index fa79392..c18da24 100644 --- a/drivers/cpuidle/cpuidle-powernv.c +++ b/drivers/cpuidle/cpuidle-powernv.c @@ -70,7 +70,7 @@ static int fastsleep_loop(struct cpuidle_device *dev, unsigned long new_lpcr; if (powersave_nap < 2) - return; + return index; if (unlikely(system_state < SYSTEM_RUNNING)) return index; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] cpuidle/powernv: Fix return value of idle index in fastsleep
Commit dcb18694 : Fix ipi on Palmeto added a workaround to disable going into fastsleep on Palmeto boards which was reported to fail to boot when fastsleep was enabled. However it missed returning an idle index. Fix this. There is probably no harm in returning the index of fastsleep although the cpu did not enter this state. The governor will notice the difference in the residency time in the idle state and the target residency of the idle state and prevent cpus from entering fastsleep from that point on. Hence the usage and time statistics logged for fastsleep will not be out of sync with reality except for the first entry. Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com --- Do we still need this workaround? Or can we get rid of the check on powersave_nap altogether? --- drivers/cpuidle/cpuidle-powernv.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c index fa79392..c18da24 100644 --- a/drivers/cpuidle/cpuidle-powernv.c +++ b/drivers/cpuidle/cpuidle-powernv.c @@ -70,7 +70,7 @@ static int fastsleep_loop(struct cpuidle_device *dev, unsigned long new_lpcr; if (powersave_nap 2) - return; + return index; if (unlikely(system_state SYSTEM_RUNNING)) return index; -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 6/7] sched: cfs: cpu frequency scaling based on task placement
Hi Mike, On 10/22/2014 11:37 AM, Mike Turquette wrote: > {en,de}queue_task_fair are updated to track which cpus will have changed > utilization values as function of task queueing. The affected cpus are > passed on to arch_eval_cpu_freq for further machine-specific processing > based on a selectable policy. > > arch_scale_cpu_freq is called from run_rebalance_domains as a way to > kick off the scaling process (via wake_up_process), so as to prevent > re-entering the {en,de}queue code. > > All of the call sites in this patch are up for discussion. Does it make > sense to track which cpus have updated statistics in enqueue_fair_task? > I chose this because I wanted to gather statistics for all cpus affected > in the event CONFIG_FAIR_GROUP_SCHED is enabled. As agreed at LPC14 the Can you explain how pstate selection can get affected by the presence of task groups? We are after all concerned with the cpu load. So when we enqueue/dequeue a task, we update the cpu load and pass it on for cpu pstate scaling. How does this change if we have task groups? I know that this issue was brought up during LPC, but I have yet not managed to gain clarity here. > next version of this patch will focus on the simpler case of not using > scheduler cgroups, which should remove a good chunk of this code, > including the cpumask stuff. > > Also discussed at LPC14 is that fact that load_balance is a very > interesting place to do this as frequency can be considered in concert > with task placement. Please put forth any ideas on a sensible way to do > this. > > Is run_rebalance_domains a logical place to change cpu frequency? What > other call sites make sense? > > Even for platforms that can target a cpu frequency without sleeping > (x86, some ARM platforms with PM microcontrollers) it is currently > necessary to always kick the frequency target work out into a kthread. > This is because of the rw_sem usage in the cpufreq core which might > sleep. Replacing that lock type is probably a good idea. > > Not-signed-off-by: Mike Turquette > --- > kernel/sched/fair.c | 39 +++ > 1 file changed, 39 insertions(+) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 1af6f6d..3619f63 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -3999,6 +3999,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, > int flags) > { > struct cfs_rq *cfs_rq; > struct sched_entity *se = >se; > + struct cpumask update_cpus; > + > + cpumask_clear(_cpus); > > for_each_sched_entity(se) { > if (se->on_rq) > @@ -4028,12 +4031,27 @@ enqueue_task_fair(struct rq *rq, struct task_struct > *p, int flags) > > update_cfs_shares(cfs_rq); > update_entity_load_avg(se, 1); > + /* track cpus that need to be re-evaluated */ > + cpumask_set_cpu(cpu_of(rq_of(cfs_rq)), _cpus); All the cfs_rqs that you iterate through here will belong to the same rq/cpu right? Regards Preeti U Murthy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 6/7] sched: cfs: cpu frequency scaling based on task placement
Hi Mike, On 10/22/2014 11:37 AM, Mike Turquette wrote: {en,de}queue_task_fair are updated to track which cpus will have changed utilization values as function of task queueing. The affected cpus are passed on to arch_eval_cpu_freq for further machine-specific processing based on a selectable policy. arch_scale_cpu_freq is called from run_rebalance_domains as a way to kick off the scaling process (via wake_up_process), so as to prevent re-entering the {en,de}queue code. All of the call sites in this patch are up for discussion. Does it make sense to track which cpus have updated statistics in enqueue_fair_task? I chose this because I wanted to gather statistics for all cpus affected in the event CONFIG_FAIR_GROUP_SCHED is enabled. As agreed at LPC14 the Can you explain how pstate selection can get affected by the presence of task groups? We are after all concerned with the cpu load. So when we enqueue/dequeue a task, we update the cpu load and pass it on for cpu pstate scaling. How does this change if we have task groups? I know that this issue was brought up during LPC, but I have yet not managed to gain clarity here. next version of this patch will focus on the simpler case of not using scheduler cgroups, which should remove a good chunk of this code, including the cpumask stuff. Also discussed at LPC14 is that fact that load_balance is a very interesting place to do this as frequency can be considered in concert with task placement. Please put forth any ideas on a sensible way to do this. Is run_rebalance_domains a logical place to change cpu frequency? What other call sites make sense? Even for platforms that can target a cpu frequency without sleeping (x86, some ARM platforms with PM microcontrollers) it is currently necessary to always kick the frequency target work out into a kthread. This is because of the rw_sem usage in the cpufreq core which might sleep. Replacing that lock type is probably a good idea. Not-signed-off-by: Mike Turquette mturque...@linaro.org --- kernel/sched/fair.c | 39 +++ 1 file changed, 39 insertions(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 1af6f6d..3619f63 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3999,6 +3999,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) { struct cfs_rq *cfs_rq; struct sched_entity *se = p-se; + struct cpumask update_cpus; + + cpumask_clear(update_cpus); for_each_sched_entity(se) { if (se-on_rq) @@ -4028,12 +4031,27 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) update_cfs_shares(cfs_rq); update_entity_load_avg(se, 1); + /* track cpus that need to be re-evaluated */ + cpumask_set_cpu(cpu_of(rq_of(cfs_rq)), update_cpus); All the cfs_rqs that you iterate through here will belong to the same rq/cpu right? Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 2/3] powerpc/kvm/book3s_hv: Enable CPUs to run guest after waking up from fast-sleep
On 10/07/2014 10:41 AM, Benjamin Herrenschmidt wrote: > On Wed, 2014-10-01 at 13:15 +0530, Shreyas B. Prabhu wrote: >> >> diff --git a/arch/powerpc/kernel/exceptions-64s.S >> b/arch/powerpc/kernel/exceptions-64s.S >> index 050f79a..c64f3cc0 100644 >> --- a/arch/powerpc/kernel/exceptions-64s.S >> +++ b/arch/powerpc/kernel/exceptions-64s.S >> @@ -100,25 +100,8 @@ system_reset_pSeries: >> SET_SCRATCH0(r13) >> #ifdef CONFIG_PPC_P7_NAP >> BEGIN_FTR_SECTION >> -/* Running native on arch 2.06 or later, check if we are >> - * waking up from nap. We only handle no state loss and >> - * supervisor state loss. We do -not- handle hypervisor >> - * state loss at this time. >> - */ >> -mfspr r13,SPRN_SRR1 >> -rlwinm. r13,r13,47-31,30,31 >> -beq 9f >> >> -/* waking up from powersave (nap) state */ >> -cmpwi cr1,r13,2 >> -/* Total loss of HV state is fatal, we could try to use the >> - * PIR to locate a PACA, then use an emergency stack etc... >> - * OPAL v3 based powernv platforms have new idle states >> - * which fall in this catagory. >> - */ >> -bgt cr1,8f >> GET_PACA(r13) >> - >> #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE >> li r0,KVM_HWTHREAD_IN_KERNEL >> stb r0,HSTATE_HWTHREAD_STATE(r13) >> @@ -131,13 +114,27 @@ BEGIN_FTR_SECTION >> 1: >> #endif > > So you moved the state loss check to after the KVM check ? Was this > reviewed by Paul ? Is that ok ? (Does this match what we have in > PowerKVM ?). Is it possible that we end up calling kvm_start_guest > after a HV state loss or do we know for sure that this won't happen > for a reason or another ? If that's the case, then that reason needs > to be clearly documented here in a comment. This wont happen because the first thread in the core which comes out of an idle state which has a state loss will not enter into KVM since the HSTATE_HWTHREAD_STATE is not yet set. It continues on to restore the lost state. This thread sets the HSTATE_HWTHREAD_STATE and wakes up the remaining threads in the core. These sibling threads enter kvm directly not requiring to restore lost state since the first thread has restored it anyway. So we are safe. We will certainly add a comment there. Thanks Regards Preeti U Murthy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] cpusets: Make cpus_allowed and mems_allowed masks hotplug invariant
On 10/08/2014 03:48 PM, Peter Zijlstra wrote: > On Wed, Oct 08, 2014 at 03:07:51PM +0530, Preeti U Murthy wrote: > >>> I still completely hate all that.. It basically makes cpusets useless, >>> they no longer guarantee anything, it makes then an optional placement >>> hint instead. >> >> Why do you say they don't guarantee anything? We ensure that we always >> run on the cpus in our cpuset which are online. We do not run in any >> arbitrary cpuset. We also do not wait unreasonably on an offline cpu to >> come back. What we are doing is ensuring that if the resources that we >> began with are available we use them. Why is this not a logical thing to >> expect? > > Because if you randomly hotplug cpus your tasks can randomly flip > between sets. > > Therefore there is no strict containment and no guarantees. > >>> You also break long standing behaviour. >> >> Which is? As I understand cpusets are meant to ensure a dedicated set of >> resources to some tasks. We cannot scheduler the tasks anywhere outside >> this set as long as they are available. And when they are not, currently >> we move them to their parents, > > No currently we hard break affinity and never look back. That move to > parent and back crap is all new fangled stuff, and broken because of the > above argument. > >> but you had also suggested killing the >> task. Maybe this can be debated. But what behavior are we changing by >> ensuring that we retain our original configuration at all times? > > See above, by pretending hotplug is a sane operation you loose all > guarantees. Ok I see the point. The kernel must not be bothered about keeping cpusets and hotplug operations consistent when both of these are user initiated actions specifying affinity with the former and breaking the same with the latter. > >>> Also, power is insane if it needs/uses hotplug for operational crap >>> like that. >> >> SMT 8 on Power8 can help/hinder workloads. Hence we dynamically switch >> the modes at runtime. > > That's just a horrible piece of crap hack and you deserve any and all > pain you get from doing it. > > Randomly removing/adding cpus like that is horrible and makes a mockery > of all the affinity interfaces we have. We observed this on ubuntu kernel, in which systemd explicitly mounts cgroup controllers under a child cgroup identified by the user pid. Since we had not observed this additional cgroup being added under the hood, it came as a surprise to us that cgroup/cpuset handling in the kernel should indeed kick in. At best we expect hotplug to be handled well if the users have not explicitly configured cpusets, hence implicitly specifying that task affinity is for all online cpus. This is indeed the case today, so that is good. However what remains to be answered is that the V2 of cgroup design - the default hierarchy, tracks hotplug operations for children cgroups as well. Tejun, Li, will not the concerns that Peter raised above hold for the default hierarchy as well? Regards Preeti U Murthy > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] cpusets: Make cpus_allowed and mems_allowed masks hotplug invariant
On 10/08/2014 03:48 PM, Peter Zijlstra wrote: On Wed, Oct 08, 2014 at 03:07:51PM +0530, Preeti U Murthy wrote: I still completely hate all that.. It basically makes cpusets useless, they no longer guarantee anything, it makes then an optional placement hint instead. Why do you say they don't guarantee anything? We ensure that we always run on the cpus in our cpuset which are online. We do not run in any arbitrary cpuset. We also do not wait unreasonably on an offline cpu to come back. What we are doing is ensuring that if the resources that we began with are available we use them. Why is this not a logical thing to expect? Because if you randomly hotplug cpus your tasks can randomly flip between sets. Therefore there is no strict containment and no guarantees. You also break long standing behaviour. Which is? As I understand cpusets are meant to ensure a dedicated set of resources to some tasks. We cannot scheduler the tasks anywhere outside this set as long as they are available. And when they are not, currently we move them to their parents, No currently we hard break affinity and never look back. That move to parent and back crap is all new fangled stuff, and broken because of the above argument. but you had also suggested killing the task. Maybe this can be debated. But what behavior are we changing by ensuring that we retain our original configuration at all times? See above, by pretending hotplug is a sane operation you loose all guarantees. Ok I see the point. The kernel must not be bothered about keeping cpusets and hotplug operations consistent when both of these are user initiated actions specifying affinity with the former and breaking the same with the latter. Also, power is insane if it needs/uses hotplug for operational crap like that. SMT 8 on Power8 can help/hinder workloads. Hence we dynamically switch the modes at runtime. That's just a horrible piece of crap hack and you deserve any and all pain you get from doing it. Randomly removing/adding cpus like that is horrible and makes a mockery of all the affinity interfaces we have. We observed this on ubuntu kernel, in which systemd explicitly mounts cgroup controllers under a child cgroup identified by the user pid. Since we had not observed this additional cgroup being added under the hood, it came as a surprise to us that cgroup/cpuset handling in the kernel should indeed kick in. At best we expect hotplug to be handled well if the users have not explicitly configured cpusets, hence implicitly specifying that task affinity is for all online cpus. This is indeed the case today, so that is good. However what remains to be answered is that the V2 of cgroup design - the default hierarchy, tracks hotplug operations for children cgroups as well. Tejun, Li, will not the concerns that Peter raised above hold for the default hierarchy as well? Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 2/3] powerpc/kvm/book3s_hv: Enable CPUs to run guest after waking up from fast-sleep
On 10/07/2014 10:41 AM, Benjamin Herrenschmidt wrote: On Wed, 2014-10-01 at 13:15 +0530, Shreyas B. Prabhu wrote: diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S index 050f79a..c64f3cc0 100644 --- a/arch/powerpc/kernel/exceptions-64s.S +++ b/arch/powerpc/kernel/exceptions-64s.S @@ -100,25 +100,8 @@ system_reset_pSeries: SET_SCRATCH0(r13) #ifdef CONFIG_PPC_P7_NAP BEGIN_FTR_SECTION -/* Running native on arch 2.06 or later, check if we are - * waking up from nap. We only handle no state loss and - * supervisor state loss. We do -not- handle hypervisor - * state loss at this time. - */ -mfspr r13,SPRN_SRR1 -rlwinm. r13,r13,47-31,30,31 -beq 9f -/* waking up from powersave (nap) state */ -cmpwi cr1,r13,2 -/* Total loss of HV state is fatal, we could try to use the - * PIR to locate a PACA, then use an emergency stack etc... - * OPAL v3 based powernv platforms have new idle states - * which fall in this catagory. - */ -bgt cr1,8f GET_PACA(r13) - #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE li r0,KVM_HWTHREAD_IN_KERNEL stb r0,HSTATE_HWTHREAD_STATE(r13) @@ -131,13 +114,27 @@ BEGIN_FTR_SECTION 1: #endif So you moved the state loss check to after the KVM check ? Was this reviewed by Paul ? Is that ok ? (Does this match what we have in PowerKVM ?). Is it possible that we end up calling kvm_start_guest after a HV state loss or do we know for sure that this won't happen for a reason or another ? If that's the case, then that reason needs to be clearly documented here in a comment. This wont happen because the first thread in the core which comes out of an idle state which has a state loss will not enter into KVM since the HSTATE_HWTHREAD_STATE is not yet set. It continues on to restore the lost state. This thread sets the HSTATE_HWTHREAD_STATE and wakes up the remaining threads in the core. These sibling threads enter kvm directly not requiring to restore lost state since the first thread has restored it anyway. So we are safe. We will certainly add a comment there. Thanks Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/