Re: [PATCH 1/2] PPC: powernv: remove redundant cpuidle_idle_call()
Hi Nicolas, Find below the patch that will need to be squashed with this one. This patch is based on the mainline.Adding Deepthi, the author of the patch which introduced the powernv cpuidle driver. Deepthi, do you think the below patch looks right? We do not need to do an explicit local_irq_enable() since we are in the call path of cpuidle driver and that explicitly enables irqs on exit from idle states. On 02/07/2014 06:47 AM, Nicolas Pitre wrote: > On Thu, 6 Feb 2014, Preeti U Murthy wrote: > >> Hi Daniel, >> >> On 02/06/2014 09:55 PM, Daniel Lezcano wrote: >>> Hi Nico, >>> >>> >>> On 6 February 2014 14:16, Nicolas Pitre wrote: >>> >>>> The core idle loop now takes care of it. >>>> >>>> Signed-off-by: Nicolas Pitre >>>> --- >>>> arch/powerpc/platforms/powernv/setup.c | 13 + >>>> 1 file changed, 1 insertion(+), 12 deletions(-) >>>> >>>> diff --git a/arch/powerpc/platforms/powernv/setup.c >>>> b/arch/powerpc/platforms/powernv/setup.c >>>> index 21166f65c9..a932feb290 100644 >>>> --- a/arch/powerpc/platforms/powernv/setup.c >>>> +++ b/arch/powerpc/platforms/powernv/setup.c >>>> @@ -26,7 +26,6 @@ >>>> #include >>>> #include >>>> #include >>>> -#include >>>> >>>> #include >>>> #include >>>> @@ -217,16 +216,6 @@ static int __init pnv_probe(void) >>>> return 1; >>>> } >>>> >>>> -void powernv_idle(void) >>>> -{ >>>> - /* Hook to cpuidle framework if available, else >>>> -* call on default platform idle code >>>> -*/ >>>> - if (cpuidle_idle_call()) { >>>> - power7_idle(); >>>> - } >>>> drivers/cpuidle/cpuidle-powernv.c |4 1 file changed, 4 insertions(+) diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c index 78fd174..130f081 100644 --- a/drivers/cpuidle/cpuidle-powernv.c +++ b/drivers/cpuidle/cpuidle-powernv.c @@ -31,11 +31,13 @@ static int snooze_loop(struct cpuidle_device *dev, set_thread_flag(TIF_POLLING_NRFLAG); while (!need_resched()) { + ppc64_runlatch_off(); HMT_low(); HMT_very_low(); } HMT_medium(); + ppc64_runlatch_on(); clear_thread_flag(TIF_POLLING_NRFLAG); smp_mb(); return index; @@ -45,7 +47,9 @@ static int nap_loop(struct cpuidle_device *dev, struct cpuidle_driver *drv, int index) { + ppc64_runlatch_off(); power7_idle(); + ppc64_runlatch_on(); return index; } Thanks Regards Preeti U Murthy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH V4 0/3] time/cpuidle: Support in tick broadcast framework in absence of external clock device
This patchset provides support in the tick broadcast framework for such architectures so as to enable the CPUs to get into deep idle. Presently we are in need of this support on certain implementations of PowerPC. This patchset has thus been tested on the same. This patchset has been based on the idea discussed here: http://www.kernelhub.org/?p=2=399516 Changes in V4: 1. Cleared the stand by CPU from the oneshot mask. As a result PATCH 3/3 was simplified. 2. Fixed compile time warnings. --- Preeti U Murthy (2): time: Change the return type of clockevents_notify() to integer time/cpuidle:Handle failed call to BROADCAST_ENTER on archs with CPUIDLE_FLAG_TIMER_STOP set Thomas Gleixner (1): tick/cpuidle: Initialize hrtimer mode of broadcast drivers/cpuidle/cpuidle.c| 14 +++-- include/linux/clockchips.h | 15 - kernel/time/Makefile |2 - kernel/time/clockevents.c|8 ++- kernel/time/tick-broadcast-hrtimer.c | 105 ++ kernel/time/tick-broadcast.c | 60 ++- kernel/time/tick-internal.h |6 +- 7 files changed, 189 insertions(+), 21 deletions(-) create mode 100644 kernel/time/tick-broadcast-hrtimer.c -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH V4 1/3] time: Change the return type of clockevents_notify() to integer
The broadcast framework can potentially be made use of by archs which do not have an external clock device as well. Then, it is required that one of the CPUs need to handle the broadcasting of wakeup IPIs to the CPUs in deep idle. As a result its local timers should remain functional all the time. For such a CPU, the BROADCAST_ENTER notification has to fail indicating that its clock device cannot be shutdown. To make way for this support, change the return type of tick_broadcast_oneshot_control() and hence clockevents_notify() to indicate such scenarios. Signed-off-by: Preeti U Murthy --- include/linux/clockchips.h |6 +++--- kernel/time/clockevents.c|8 +--- kernel/time/tick-broadcast.c |6 -- kernel/time/tick-internal.h |6 +++--- 4 files changed, 15 insertions(+), 11 deletions(-) diff --git a/include/linux/clockchips.h b/include/linux/clockchips.h index 493aa02..e0c5a6c 100644 --- a/include/linux/clockchips.h +++ b/include/linux/clockchips.h @@ -186,9 +186,9 @@ static inline int tick_check_broadcast_expired(void) { return 0; } #endif #ifdef CONFIG_GENERIC_CLOCKEVENTS -extern void clockevents_notify(unsigned long reason, void *arg); +extern int clockevents_notify(unsigned long reason, void *arg); #else -static inline void clockevents_notify(unsigned long reason, void *arg) {} +static inline int clockevents_notify(unsigned long reason, void *arg) { return 0; } #endif #else /* CONFIG_GENERIC_CLOCKEVENTS_BUILD */ @@ -196,7 +196,7 @@ static inline void clockevents_notify(unsigned long reason, void *arg) {} static inline void clockevents_suspend(void) {} static inline void clockevents_resume(void) {} -static inline void clockevents_notify(unsigned long reason, void *arg) {} +static inline int clockevents_notify(unsigned long reason, void *arg) { return 0; } static inline int tick_check_broadcast_expired(void) { return 0; } #endif diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c index 086ad60..79b8685 100644 --- a/kernel/time/clockevents.c +++ b/kernel/time/clockevents.c @@ -524,12 +524,13 @@ void clockevents_resume(void) #ifdef CONFIG_GENERIC_CLOCKEVENTS /** * clockevents_notify - notification about relevant events + * Returns 0 on success, any other value on error */ -void clockevents_notify(unsigned long reason, void *arg) +int clockevents_notify(unsigned long reason, void *arg) { struct clock_event_device *dev, *tmp; unsigned long flags; - int cpu; + int cpu, ret = 0; raw_spin_lock_irqsave(_lock, flags); @@ -542,7 +543,7 @@ void clockevents_notify(unsigned long reason, void *arg) case CLOCK_EVT_NOTIFY_BROADCAST_ENTER: case CLOCK_EVT_NOTIFY_BROADCAST_EXIT: - tick_broadcast_oneshot_control(reason); + ret = tick_broadcast_oneshot_control(reason); break; case CLOCK_EVT_NOTIFY_CPU_DYING: @@ -585,6 +586,7 @@ void clockevents_notify(unsigned long reason, void *arg) break; } raw_spin_unlock_irqrestore(_lock, flags); + return ret; } EXPORT_SYMBOL_GPL(clockevents_notify); diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c index 43780ab..ddf2ac2 100644 --- a/kernel/time/tick-broadcast.c +++ b/kernel/time/tick-broadcast.c @@ -633,14 +633,15 @@ again: /* * Powerstate information: The system enters/leaves a state, where * affected devices might stop + * Returns 0 on success, -EBUSY if the cpu is used to broadcast wakeups. */ -void tick_broadcast_oneshot_control(unsigned long reason) +int tick_broadcast_oneshot_control(unsigned long reason) { struct clock_event_device *bc, *dev; struct tick_device *td; unsigned long flags; ktime_t now; - int cpu; + int cpu, ret = 0; /* * Periodic mode does not care about the enter/exit of power @@ -746,6 +747,7 @@ void tick_broadcast_oneshot_control(unsigned long reason) } out: raw_spin_unlock_irqrestore(_broadcast_lock, flags); + return ret; } /* diff --git a/kernel/time/tick-internal.h b/kernel/time/tick-internal.h index 8329669..f0dc03c 100644 --- a/kernel/time/tick-internal.h +++ b/kernel/time/tick-internal.h @@ -46,7 +46,7 @@ extern int tick_switch_to_oneshot(void (*handler)(struct clock_event_device *)); extern void tick_resume_oneshot(void); # ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST extern void tick_broadcast_setup_oneshot(struct clock_event_device *bc); -extern void tick_broadcast_oneshot_control(unsigned long reason); +extern int tick_broadcast_oneshot_control(unsigned long reason); extern void tick_broadcast_switch_to_oneshot(void); extern void tick_shutdown_broadcast_oneshot(unsigned int *cpup); extern int tick_resume_broadcast_oneshot(struct clock_event_device *bc); @@ -58,7 +58,7 @@ static inline void tick_broadcast_setup_oneshot(struct clock_event_device *bc) { BUG(); } -static inline void
[PATCH V4 3/3] time/cpuidle:Handle failed call to BROADCAST_ENTER on archs with CPUIDLE_FLAG_TIMER_STOP set
Some archs set the CPUIDLE_FLAG_TIMER_STOP flag for idle states in which the local timers stop. The cpuidle_idle_call() currently handles such idle states by calling into the broadcast framework so as to wakeup CPUs at their next wakeup event. With the hrtimer mode of broadcast, the BROADCAST_ENTER call into the broadcast frameowork can fail for archs that do not have an external clock device to handle wakeups and the CPU in question has to thus be made the stand by CPU. This patch handles such cases by failing the call into cpuidle so that the arch can take some default action. The arch will certainly not enter a similar idle state because a failed cpuidle call will also implicitly indicate that the broadcast framework has not registered this CPU to be woken up. Hence we are safe if we fail the cpuidle call. In the process move the functions that trace idle statistics just before and after the entry and exit into idle states respectively. In other scenarios where the call to cpuidle fails, we end up not tracing idle entry and exit since a decision on an idle state could not be taken. Similarly when the call to broadcast framework fails, we skip tracing idle statistics because we are in no further position to take a decision on an alternative idle state to enter into. Signed-off-by: Preeti U Murthy --- drivers/cpuidle/cpuidle.c | 14 -- 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c index a55e68f..8beb0f02 100644 --- a/drivers/cpuidle/cpuidle.c +++ b/drivers/cpuidle/cpuidle.c @@ -140,12 +140,14 @@ int cpuidle_idle_call(void) return 0; } - trace_cpu_idle_rcuidle(next_state, dev->cpu); - broadcast = !!(drv->states[next_state].flags & CPUIDLE_FLAG_TIMER_STOP); - if (broadcast) - clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, >cpu); + if (broadcast && + clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, >cpu)) + return -EBUSY; + + + trace_cpu_idle_rcuidle(next_state, dev->cpu); if (cpuidle_state_is_coupled(dev, drv, next_state)) entered_state = cpuidle_enter_state_coupled(dev, drv, @@ -153,11 +155,11 @@ int cpuidle_idle_call(void) else entered_state = cpuidle_enter_state(dev, drv, next_state); + trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, dev->cpu); + if (broadcast) clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, >cpu); - trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, dev->cpu); - /* give the governor an opportunity to reflect on the outcome */ if (cpuidle_curr_governor->reflect) cpuidle_curr_governor->reflect(dev, entered_state); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH V4 2/3] tick/cpuidle: Initialize hrtimer mode of broadcast
From: Thomas Gleixner On some architectures, in certain CPU deep idle states the local timers stop. An external clock device is used to wakeup these CPUs. The kernel support for the wakeup of these CPUs is provided by the tick broadcast framework by using the external clock device as the wakeup source. However not all implementations of architectures provide such an external clock device. This patch includes support in the broadcast framework to handle the wakeup of the CPUs in deep idle states on such systems by queuing a hrtimer on one of the CPUs, which is meant to handle the wakeup of CPUs in deep idle states. This patchset introduces a pseudo clock device which can be registered by the archs as tick_broadcast_device in the absence of a real external clock device. Once registered, the broadcast framework will work as is for these architectures as long as the archs take care of the BROADCAST_ENTER notification failing for one of the CPUs. This CPU is made the stand by CPU to handle wakeup of the CPUs in deep idle and it *must not enter deep idle states*. The CPU with the earliest wakeup is chosen to be this CPU. Hence this way the stand by CPU dynamically moves around and so does the hrtimer which is queued to trigger at the next earliest wakeup time. This is consistent with the case where an external clock device is present. The smp affinity of this clock device is set to the CPU with the earliest wakeup. This patchset handles the hotplug of the stand by CPU as well by moving the hrtimer on to the CPU handling the CPU_DEAD notification. Signed-off-by: Preeti U Murthy [Added Changelog and code to handle reprogramming of hrtimer] --- include/linux/clockchips.h |9 +++ kernel/time/Makefile |2 - kernel/time/tick-broadcast-hrtimer.c | 105 ++ kernel/time/tick-broadcast.c | 54 + 4 files changed, 166 insertions(+), 4 deletions(-) create mode 100644 kernel/time/tick-broadcast-hrtimer.c diff --git a/include/linux/clockchips.h b/include/linux/clockchips.h index e0c5a6c..dbe9e14 100644 --- a/include/linux/clockchips.h +++ b/include/linux/clockchips.h @@ -62,6 +62,11 @@ enum clock_event_mode { #define CLOCK_EVT_FEAT_DYNIRQ 0x20 #define CLOCK_EVT_FEAT_PERCPU 0x40 +/* + * Clockevent device is based on a hrtimer for broadcast + */ +#define CLOCK_EVT_FEAT_HRTIMER 0x80 + /** * struct clock_event_device - clock event device descriptor * @event_handler: Assigned by the framework to be called by the low @@ -83,6 +88,7 @@ enum clock_event_mode { * @name: ptr to clock event name * @rating:variable to rate clock event devices * @irq: IRQ number (only for non CPU local devices) + * @bound_on: Bound on CPU * @cpumask: cpumask to indicate for which CPUs this device works * @list: list head for the management code * @owner: module reference @@ -113,6 +119,7 @@ struct clock_event_device { const char *name; int rating; int irq; + int bound_on; const struct cpumask*cpumask; struct list_headlist; struct module *owner; @@ -180,9 +187,11 @@ extern int tick_receive_broadcast(void); #endif #if defined(CONFIG_GENERIC_CLOCKEVENTS_BROADCAST) && defined(CONFIG_TICK_ONESHOT) +extern void tick_setup_hrtimer_broadcast(void); extern int tick_check_broadcast_expired(void); #else static inline int tick_check_broadcast_expired(void) { return 0; } +static void tick_setup_hrtimer_broadcast(void) {}; #endif #ifdef CONFIG_GENERIC_CLOCKEVENTS diff --git a/kernel/time/Makefile b/kernel/time/Makefile index 9250130..06151ef 100644 --- a/kernel/time/Makefile +++ b/kernel/time/Makefile @@ -3,7 +3,7 @@ obj-y += timeconv.o posix-clock.o alarmtimer.o obj-$(CONFIG_GENERIC_CLOCKEVENTS_BUILD)+= clockevents.o obj-$(CONFIG_GENERIC_CLOCKEVENTS) += tick-common.o -obj-$(CONFIG_GENERIC_CLOCKEVENTS_BROADCAST)+= tick-broadcast.o +obj-$(CONFIG_GENERIC_CLOCKEVENTS_BROADCAST)+= tick-broadcast.o tick-broadcast-hrtimer.o obj-$(CONFIG_GENERIC_SCHED_CLOCK) += sched_clock.o obj-$(CONFIG_TICK_ONESHOT) += tick-oneshot.o obj-$(CONFIG_TICK_ONESHOT) += tick-sched.o diff --git a/kernel/time/tick-broadcast-hrtimer.c b/kernel/time/tick-broadcast-hrtimer.c new file mode 100644 index 000..af1e119 --- /dev/null +++ b/kernel/time/tick-broadcast-hrtimer.c @@ -0,0 +1,105 @@ +/* + * linux/kernel/time/tick-broadcast-hrtimer.c + * This file emulates a local clock event device + * via a pseudo clock device. + */ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "tick-internal.h" + +static stru
Re: [PATCH 1/2] PPC: powernv: remove redundant cpuidle_idle_call()
Hi Deepthi, On 02/07/2014 03:15 PM, Deepthi Dharwar wrote: > Hi Preeti, > > Thanks for the patch. > > On 02/07/2014 12:31 PM, Preeti U Murthy wrote: >> Hi Nicolas, >> >> Find below the patch that will need to be squashed with this one. >> This patch is based on the mainline.Adding Deepthi, the author of >> the patch which introduced the powernv cpuidle driver. Deepthi, >> do you think the below patch looks right? We do not need to do an >> explicit local_irq_enable() since we are in the call path of >> cpuidle driver and that explicitly enables irqs on exit from >> idle states. > > Yes, We enable irqs explicitly while entering snooze loop and we always > have interrupts enabled in the snooze state. > For NAP state, we exit out of this state with interrupts enabled so we > do not need an explicit enable of irqs. > >> On 02/07/2014 06:47 AM, Nicolas Pitre wrote: >>> On Thu, 6 Feb 2014, Preeti U Murthy wrote: >>> >>>> Hi Daniel, >>>> >>>> On 02/06/2014 09:55 PM, Daniel Lezcano wrote: >>>>> Hi Nico, >>>>> >>>>> >>>>> On 6 February 2014 14:16, Nicolas Pitre wrote: >>>>> >>>>>> The core idle loop now takes care of it. >>>>>> >>>>>> Signed-off-by: Nicolas Pitre >>>>>> --- >>>>>> arch/powerpc/platforms/powernv/setup.c | 13 + >>>>>> 1 file changed, 1 insertion(+), 12 deletions(-) >>>>>> >>>>>> diff --git a/arch/powerpc/platforms/powernv/setup.c >>>>>> b/arch/powerpc/platforms/powernv/setup.c >>>>>> index 21166f65c9..a932feb290 100644 >>>>>> --- a/arch/powerpc/platforms/powernv/setup.c >>>>>> +++ b/arch/powerpc/platforms/powernv/setup.c >>>>>> @@ -26,7 +26,6 @@ >>>>>> #include >>>>>> #include >>>>>> #include >>>>>> -#include >>>>>> >>>>>> #include >>>>>> #include >>>>>> @@ -217,16 +216,6 @@ static int __init pnv_probe(void) >>>>>> return 1; >>>>>> } >>>>>> >>>>>> -void powernv_idle(void) >>>>>> -{ >>>>>> - /* Hook to cpuidle framework if available, else >>>>>> -* call on default platform idle code >>>>>> -*/ >>>>>> - if (cpuidle_idle_call()) { >>>>>> - power7_idle(); >>>>>> - } >>>>>> >> >> drivers/cpuidle/cpuidle-powernv.c |4 >> 1 file changed, 4 insertions(+) >> >> diff --git a/drivers/cpuidle/cpuidle-powernv.c >> b/drivers/cpuidle/cpuidle-powernv.c >> index 78fd174..130f081 100644 >> --- a/drivers/cpuidle/cpuidle-powernv.c >> +++ b/drivers/cpuidle/cpuidle-powernv.c >> @@ -31,11 +31,13 @@ static int snooze_loop(struct cpuidle_device *dev, >> set_thread_flag(TIF_POLLING_NRFLAG); >> >> while (!need_resched()) { >> +ppc64_runlatch_off(); > ^^^ > We could move this before the while() loop. > It would ideal to turn off latch when we enter snooze and > turn it on when we are about to exit, rather than doing > it over and over in the while loop. You are right, this can be moved out of the loop. Thanks Regards Preeti U Murthy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] PPC: powernv: remove redundant cpuidle_idle_call()
Hi Nicolas, On 02/07/2014 04:18 PM, Nicolas Pitre wrote: > On Fri, 7 Feb 2014, Preeti U Murthy wrote: > >> Hi Nicolas, >> >> On 02/07/2014 06:47 AM, Nicolas Pitre wrote: >>> >>> What about creating arch_cpu_idle_enter() and arch_cpu_idle_exit() in >>> arch/powerpc/kernel/idle.c and calling ppc64_runlatch_off() and >>> ppc64_runlatch_on() respectively from there instead? Would that work? >>> That would make the idle consolidation much easier afterwards. >> >> I would not suggest doing this. The ppc64_runlatch_*() routines need to >> be called when we are sure that the cpu is about to enter or has exit an >> idle state. Moving the ppc64_runlatch_on() routine to >> arch_cpu_idle_enter() for instance is not a good idea because there are >> places where the cpu can decide not to enter any idle state before the >> call to cpuidle_idle_call() itself. In that case communicating >> prematurely that we are in an idle state would not be a good idea. >> >> So its best to add the ppc64_runlatch_* calls in the powernv cpuidle >> driver IMO. We could however create idle_loop_prologue/epilogue() >> variants inside it so that in addition to the runlatch routines we could >> potentially add more such similar routines that are powernv specific. >> If there are cases where there is work to be done prior to and post an >> entry into an idle state common to both pseries and powernv, we will >> probably put them in arch_cpu_idle_enter/exit(). But the runlatch >> routines are not suitable to be moved there as far as I can see. > > OK. > > However, one thing we need to do as much as possible is to remove those > loops based on need_resched() from idle backend drivers. A somewhat > common pattern is: > > my_idle() > { > /* interrupts disabled on entry */ > while (!need_resched()) { > lowpower_wait_for_interrupts(); > local_irq_enable(); > /* IRQ serviced from here */ > local_irq_disable(); > } > local_irq_enable(); > /* interrupts enabled on exit */ > } > > To be able to keep statistics on the actual idleness of the CPU we'd > need for all idle backends to always return to generic code on every > interrupt similar to this: > > my_idle() > { > /* interrupts disabled on entry */ > lowpower_wait_for_interrupts(); You can do this for the idle states which do not have the polling nature. IOW, these idle states are capable of doing what you describe as "wait_for_interrupts". They do some kind of spinning at the hardware level with interrupts enabled. A reschedule IPI or any other interrupt will wake them up to enter the generic idle loop where they check for the cause of the interrupt. But observe the idle state "snooze" on powerpc. The power that this idle state saves is through the lowering of the thread priority of the CPU. After it lowers the thread priority, it is done. It cannot "wait_for_interrupts". It will exit my_idle(). It is now upto the generic idle loop to increase the thread priority if the need_resched flag is set. Only an interrupt routine can increase the thread priority. Else we will need to do it explicitly. And in such states which have a polling nature, the cpu will not receive a reschedule IPI. That is why in the snooze_loop() we poll on need_resched. If it is set we up the priority of the thread using HMT_MEDIUM() and then exit the my_idle() loop. In case of interrupts, the priority gets automatically increased. This might not be required to be done for similar idle routines on other archs but this is the consequence of applying this idea of simplified cpuidle backend driver on powerpc. I would say you could let the backend cpuidle drivers be in this regard, it could complicate the generic idle loop IMO depending on how the polling states are implemented in each architecture. > The generic code would be responsible for dealing with need_resched() > and call back into the backend right away if necessary after updating > some stats. > > Do you see a problem with the runlatch calls happening around each > interrrupt from such a simplified idle backend? The runlatch calls could be moved outside the loop.They do not need to be called each time. Thanks Regards Preeti U Murthy > > > Nicolas > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] time/cpuidle:Fixup fallout from hrtimer broadcast mode inclusion
The broadcast timer registration has to be done only when GENERIC_CLOCKEVENTS_BROADCAST and TICK_ONESHOT config options are enabled. Also fix max_delta_ticks value for the pseudo clock device. Reported-by: Fengguang Wu Signed-off-by: Preeti U Murthy Cc: Thomas Gleixner Cc: Ingo Molnar --- kernel/time/tick-broadcast-hrtimer.c |4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/kernel/time/tick-broadcast-hrtimer.c b/kernel/time/tick-broadcast-hrtimer.c index 5591aaa..bc383ac 100644 --- a/kernel/time/tick-broadcast-hrtimer.c +++ b/kernel/time/tick-broadcast-hrtimer.c @@ -81,7 +81,7 @@ static struct clock_event_device ce_broadcast_hrtimer = { .min_delta_ns = 1, .max_delta_ns = KTIME_MAX, .min_delta_ticks= 1, - .max_delta_ticks= KTIME_MAX, + .max_delta_ticks= ULONG_MAX, .mult = 1, .shift = 0, .cpumask= cpu_all_mask, @@ -102,9 +102,11 @@ static enum hrtimer_restart bc_handler(struct hrtimer *t) return HRTIMER_RESTART; } +#if defined(CONFIG_GENERIC_CLOCKEVENTS_BROADCAST) && defined(CONFIG_TICK_ONESHOT) void tick_setup_hrtimer_broadcast(void) { hrtimer_init(, CLOCK_MONOTONIC, HRTIMER_MODE_ABS); bctimer.function = bc_handler; clockevents_register_device(_broadcast_hrtimer); } +#endif -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] time/cpuidle:Fixup fallout from hrtimer broadcast mode inclusion
Hi Thomas, On 02/07/2014 11:27 PM, Thomas Gleixner wrote: > On Fri, 7 Feb 2014, Preeti U Murthy wrote: > >> The broadcast timer registration has to be done only when >> GENERIC_CLOCKEVENTS_BROADCAST and TICK_ONESHOT config options are enabled. > > Then we should compile that file only when those options are > enabled. Where is the point to compile that code w/o the registration > function? Hmm of course. The delta patch is at the end. Another concern I have is with regard to the periodic mode of broadcast . We currently do not support the hrtimer mode of broadcast in periodic mode. The BROADCAST_ON/OFF calls which take effect in periodic mode has not yet been modified by this patchset to disable one CPU from going into deep idle, since we expect the deep idle states to never be chosen by the cpuidle governor in this mode. Do you think we should bother to modify this piece of code at all? On the same note, my understanding is that BROADCAST_ON/OFF takes effect only in periodic mode, in oneshot mode it is a nop. But why do we expect the CPUs to avail broadcast in periodic mode when they are not supposed to be entering deep idle states? Am I missing something here? IOW what is the point of periodic mode of broadcast? Is it for malfunctioning local clock devices? The delta patch below for fixing the compile time errors. This is based on tip/timers/core branch. time/cpuidle:Fixup fallout from hrtimer broadcast mode inclusion From: Preeti U Murthy The hrtimer mode of broadcast is supported only when GENERIC_CLOCKEVENTS_BROADCAST and TICK_ONESHOT config options are enabled. Hence compile in the functions for hrtimer mode of broadcast only when these options are selected. Also fix max_delta_ticks value for the pseudo clock device. Reported-by: Fengguang Wu Signed-off-by: Preeti U Murthy Cc: Thomas Gleixner Cc: Ingo Molnar --- include/linux/clockchips.h |1 + kernel/time/Makefile |5 - kernel/time/tick-broadcast-hrtimer.c |2 +- 3 files changed, 6 insertions(+), 2 deletions(-) diff --git a/include/linux/clockchips.h b/include/linux/clockchips.h index 20a7183..2e4cb67 100644 --- a/include/linux/clockchips.h +++ b/include/linux/clockchips.h @@ -207,6 +207,7 @@ static inline void clockevents_resume(void) {} static inline int clockevents_notify(unsigned long reason, void *arg) { return 0; } static inline int tick_check_broadcast_expired(void) { return 0; } +static inline void tick_setup_hrtimer_broadcast(void) {}; #endif diff --git a/kernel/time/Makefile b/kernel/time/Makefile index 06151ef..57a413f 100644 --- a/kernel/time/Makefile +++ b/kernel/time/Makefile @@ -3,7 +3,10 @@ obj-y += timeconv.o posix-clock.o alarmtimer.o obj-$(CONFIG_GENERIC_CLOCKEVENTS_BUILD)+= clockevents.o obj-$(CONFIG_GENERIC_CLOCKEVENTS) += tick-common.o -obj-$(CONFIG_GENERIC_CLOCKEVENTS_BROADCAST)+= tick-broadcast.o tick-broadcast-hrtimer.o +ifeq ($(CONFIG_GENERIC_CLOCKEVENTS_BROADCAST),y) + obj-y += tick-broadcast.o + obj-$(CONFIG_TICK_ONESHOT)+= tick-broadcast-hrtimer.o +endif obj-$(CONFIG_GENERIC_SCHED_CLOCK) += sched_clock.o obj-$(CONFIG_TICK_ONESHOT) += tick-oneshot.o obj-$(CONFIG_TICK_ONESHOT) += tick-sched.o diff --git a/kernel/time/tick-broadcast-hrtimer.c b/kernel/time/tick-broadcast-hrtimer.c index 9242527..eb682d5 100644 --- a/kernel/time/tick-broadcast-hrtimer.c +++ b/kernel/time/tick-broadcast-hrtimer.c @@ -82,7 +82,7 @@ static struct clock_event_device ce_broadcast_hrtimer = { .min_delta_ns = 1, .max_delta_ns = KTIME_MAX, .min_delta_ticks= 1, - .max_delta_ticks= KTIME_MAX, + .max_delta_ticks= ULONG_MAX, .mult = 1, .shift = 0, .cpumask= cpu_all_mask, Thanks Regards Preeti U Murthy > >> Also fix max_delta_ticks value for the pseudo clock device. >> >> Reported-by: Fengguang Wu >> Signed-off-by: Preeti U Murthy >> Cc: Thomas Gleixner >> Cc: Ingo Molnar >> --- >> >> kernel/time/tick-broadcast-hrtimer.c |4 +++- >> 1 file changed, 3 insertions(+), 1 deletion(-) >> >> diff --git a/kernel/time/tick-broadcast-hrtimer.c >> b/kernel/time/tick-broadcast-hrtimer.c >> index 5591aaa..bc383ac 100644 >> --- a/kernel/time/tick-broadcast-hrtimer.c >> +++ b/kernel/time/tick-broadcast-hrtimer.c >> @@ -81,7 +81,7 @@ static struct clock_event_device ce_broadcast_hrtimer = { >> .min_delta_ns = 1, >> .max_delta_ns = KTIME_MAX, >> .min_delta_ticks= 1, >> -.max_delta_ticks= KTIME_MAX, >> +.max_delta_ticks= ULONG_MAX, >> .mult
Re: [PATCH] time/cpuidle:Fixup fallout from hrtimer broadcast mode inclusion
Hi David, I have sent out a revised patch on https://lkml.org/lkml/2014/2/9/2. Can you let me know if this works for you? Thanks Regards Preeti U Murthy On 02/09/2014 01:01 PM, David Rientjes wrote: > On Fri, 7 Feb 2014, Preeti U Murthy wrote: > >> The broadcast timer registration has to be done only when >> GENERIC_CLOCKEVENTS_BROADCAST and TICK_ONESHOT config options are enabled. >> Also fix max_delta_ticks value for the pseudo clock device. >> >> Reported-by: Fengguang Wu >> Signed-off-by: Preeti U Murthy >> Cc: Thomas Gleixner >> Cc: Ingo Molnar >> --- >> >> kernel/time/tick-broadcast-hrtimer.c |4 +++- >> 1 file changed, 3 insertions(+), 1 deletion(-) >> >> diff --git a/kernel/time/tick-broadcast-hrtimer.c >> b/kernel/time/tick-broadcast-hrtimer.c >> index 5591aaa..bc383ac 100644 >> --- a/kernel/time/tick-broadcast-hrtimer.c >> +++ b/kernel/time/tick-broadcast-hrtimer.c >> @@ -81,7 +81,7 @@ static struct clock_event_device ce_broadcast_hrtimer = { >> .min_delta_ns = 1, >> .max_delta_ns = KTIME_MAX, >> .min_delta_ticks= 1, >> -.max_delta_ticks= KTIME_MAX, >> +.max_delta_ticks= ULONG_MAX, >> .mult = 1, >> .shift = 0, >> .cpumask= cpu_all_mask, >> @@ -102,9 +102,11 @@ static enum hrtimer_restart bc_handler(struct hrtimer >> *t) >> return HRTIMER_RESTART; >> } >> >> +#if defined(CONFIG_GENERIC_CLOCKEVENTS_BROADCAST) && >> defined(CONFIG_TICK_ONESHOT) >> void tick_setup_hrtimer_broadcast(void) >> { >> hrtimer_init(, CLOCK_MONOTONIC, HRTIMER_MODE_ABS); >> bctimer.function = bc_handler; >> clockevents_register_device(_broadcast_hrtimer); >> } >> +#endif > > I see a build error in timers/core today: > > kernel/time/tick-broadcast-hrtimer.c:101:6: error: redefinition of > 'tick_setup_hrtimer_broadcast' > include/linux/clockchips.h:194:20: note: previous definition of > 'tick_setup_hrtimer_broadcast' was here > > and I assume this is the intended fix for that, although it isn't > mentioned in the changelog. > > After it's applied, this is left over: > > kernel/time/tick-broadcast-hrtimer.c:91:29: warning: ‘bc_handler’ defined but > not used [-Wunused-function] > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RESEND PATCH 0/3] powerpc: Free up an IPI message slot for tick broadcast IPIs
This patchset is a precursor for enabling deep idle states on powerpc, when the local CPU timers stop. The tick broadcast framework in the Linux Kernel today handles wakeup of such CPUs at their next timer event by using an external clock device. At the expiry of this clock device, IPIs are sent to the CPUs in deep idle states so that they wakeup to handle their respective timers. This patchset frees up one of the IPI slots on powerpc so as to be used to handle the tick broadcast IPI. On certain implementations of powerpc, such an external clock device is absent. The support in the tick broadcast framework to handle wakeup of CPUs from deep idle states on such implementations is currently in the tip tree. https://lkml.org/lkml/2014/2/7/906 https://lkml.org/lkml/2014/2/7/876 https://lkml.org/lkml/2014/2/7/608 With the above support in place, this patchset is next in line to enable deep idle states on powerpc. The patchset has been appended by a RESEND tag since nothing has changed from the previous post except for an added config condition around tick_broadcast() which handles sending broadcast IPIs, and the update in the cover letter. --- Preeti U Murthy (1): cpuidle/ppc: Split timer_interrupt() into timer handling and interrupt handling routines Srivatsa S. Bhat (2): powerpc: Free up the slot of PPC_MSG_CALL_FUNC_SINGLE IPI message powerpc: Implement tick broadcast IPI as a fixed IPI message arch/powerpc/include/asm/smp.h |2 - arch/powerpc/include/asm/time.h |1 arch/powerpc/kernel/smp.c | 25 ++--- arch/powerpc/kernel/time.c | 86 ++- arch/powerpc/platforms/cell/interrupt.c |2 - arch/powerpc/platforms/ps3/smp.c|2 - 6 files changed, 73 insertions(+), 45 deletions(-) -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RESEND PATCH 2/3] powerpc: Implement tick broadcast IPI as a fixed IPI message
From: Srivatsa S. Bhat For scalability and performance reasons, we want the tick broadcast IPIs to be handled as efficiently as possible. Fixed IPI messages are one of the most efficient mechanisms available - they are faster than the smp_call_function mechanism because the IPI handlers are fixed and hence they don't involve costly operations such as adding IPI handlers to the target CPU's function queue, acquiring locks for synchronization etc. Luckily we have an unused IPI message slot, so use that to implement tick broadcast IPIs efficiently. Signed-off-by: Srivatsa S. Bhat [Functions renamed to tick_broadcast* and Changelog modified by Preeti U. Murthy] Signed-off-by: Preeti U. Murthy Acked-by: Geoff Levand [For the PS3 part] --- arch/powerpc/include/asm/smp.h |2 +- arch/powerpc/include/asm/time.h |1 + arch/powerpc/kernel/smp.c | 21 + arch/powerpc/kernel/time.c |5 + arch/powerpc/platforms/cell/interrupt.c |2 +- arch/powerpc/platforms/ps3/smp.c|2 +- 6 files changed, 26 insertions(+), 7 deletions(-) diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h index 9f7356b..ff51046 100644 --- a/arch/powerpc/include/asm/smp.h +++ b/arch/powerpc/include/asm/smp.h @@ -120,7 +120,7 @@ extern int cpu_to_core_id(int cpu); * in /proc/interrupts will be wrong!!! --Troy */ #define PPC_MSG_CALL_FUNCTION 0 #define PPC_MSG_RESCHEDULE 1 -#define PPC_MSG_UNUSED 2 +#define PPC_MSG_TICK_BROADCAST 2 #define PPC_MSG_DEBUGGER_BREAK 3 /* for irq controllers that have dedicated ipis per message (4) */ diff --git a/arch/powerpc/include/asm/time.h b/arch/powerpc/include/asm/time.h index c1f2676..1d428e6 100644 --- a/arch/powerpc/include/asm/time.h +++ b/arch/powerpc/include/asm/time.h @@ -28,6 +28,7 @@ extern struct clock_event_device decrementer_clockevent; struct rtc_time; extern void to_tm(int tim, struct rtc_time * tm); extern void GregorianDay(struct rtc_time *tm); +extern void tick_broadcast_ipi_handler(void); extern void generic_calibrate_decr(void); diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index ee7d76b..e2a4232 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -35,6 +35,7 @@ #include #include #include +#include #include #include #include @@ -145,9 +146,9 @@ static irqreturn_t reschedule_action(int irq, void *data) return IRQ_HANDLED; } -static irqreturn_t unused_action(int irq, void *data) +static irqreturn_t tick_broadcast_ipi_action(int irq, void *data) { - /* This slot is unused and hence available for use, if needed */ + tick_broadcast_ipi_handler(); return IRQ_HANDLED; } @@ -168,14 +169,14 @@ static irqreturn_t debug_ipi_action(int irq, void *data) static irq_handler_t smp_ipi_action[] = { [PPC_MSG_CALL_FUNCTION] = call_function_action, [PPC_MSG_RESCHEDULE] = reschedule_action, - [PPC_MSG_UNUSED] = unused_action, + [PPC_MSG_TICK_BROADCAST] = tick_broadcast_ipi_action, [PPC_MSG_DEBUGGER_BREAK] = debug_ipi_action, }; const char *smp_ipi_name[] = { [PPC_MSG_CALL_FUNCTION] = "ipi call function", [PPC_MSG_RESCHEDULE] = "ipi reschedule", - [PPC_MSG_UNUSED] = "ipi unused", + [PPC_MSG_TICK_BROADCAST] = "ipi tick-broadcast", [PPC_MSG_DEBUGGER_BREAK] = "ipi debugger", }; @@ -251,6 +252,8 @@ irqreturn_t smp_ipi_demux(void) generic_smp_call_function_interrupt(); if (all & IPI_MESSAGE(PPC_MSG_RESCHEDULE)) scheduler_ipi(); + if (all & IPI_MESSAGE(PPC_MSG_TICK_BROADCAST)) + tick_broadcast_ipi_handler(); if (all & IPI_MESSAGE(PPC_MSG_DEBUGGER_BREAK)) debug_ipi_action(0, NULL); } while (info->messages); @@ -289,6 +292,16 @@ void arch_send_call_function_ipi_mask(const struct cpumask *mask) do_message_pass(cpu, PPC_MSG_CALL_FUNCTION); } +#ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST +void tick_broadcast(const struct cpumask *mask) +{ + unsigned int cpu; + + for_each_cpu(cpu, mask) + do_message_pass(cpu, PPC_MSG_TICK_BROADCAST); +} +#endif + #if defined(CONFIG_DEBUGGER) || defined(CONFIG_KEXEC) void smp_send_debugger_break(void) { diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c index b3dab20..3ff97db 100644 --- a/arch/powerpc/kernel/time.c +++ b/arch/powerpc/kernel/time.c @@ -825,6 +825,11 @@ static void decrementer_set_mode(enum clock_event_mode mode, decrementer_set_next_event(DECREMENTER_MAX, dev); } +/* Interrupt handler for the timer broadcast IPI */ +void tick_broadcast_ipi_handler(void) +{ +} + static void register_decrementer_clockevent(int cpu) { struct clock_event_device *dec =
[RESEND PATCH 3/3] cpuidle/ppc: Split timer_interrupt() into timer handling and interrupt handling routines
From: Preeti U Murthy Split timer_interrupt(), which is the local timer interrupt handler on ppc into routines called during regular interrupt handling and __timer_interrupt(), which takes care of running local timers and collecting time related stats. This will enable callers interested only in running expired local timers to directly call into __timer_interupt(). One of the use cases of this is the tick broadcast IPI handling in which the sleeping CPUs need to handle the local timers that have expired. Signed-off-by: Preeti U Murthy --- arch/powerpc/kernel/time.c | 81 +--- 1 file changed, 46 insertions(+), 35 deletions(-) diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c index 3ff97db..df2989b 100644 --- a/arch/powerpc/kernel/time.c +++ b/arch/powerpc/kernel/time.c @@ -478,6 +478,47 @@ void arch_irq_work_raise(void) #endif /* CONFIG_IRQ_WORK */ +void __timer_interrupt(void) +{ + struct pt_regs *regs = get_irq_regs(); + u64 *next_tb = &__get_cpu_var(decrementers_next_tb); + struct clock_event_device *evt = &__get_cpu_var(decrementers); + u64 now; + + trace_timer_interrupt_entry(regs); + + if (test_irq_work_pending()) { + clear_irq_work_pending(); + irq_work_run(); + } + + now = get_tb_or_rtc(); + if (now >= *next_tb) { + *next_tb = ~(u64)0; + if (evt->event_handler) + evt->event_handler(evt); + __get_cpu_var(irq_stat).timer_irqs_event++; + } else { + now = *next_tb - now; + if (now <= DECREMENTER_MAX) + set_dec((int)now); + /* We may have raced with new irq work */ + if (test_irq_work_pending()) + set_dec(1); + __get_cpu_var(irq_stat).timer_irqs_others++; + } + +#ifdef CONFIG_PPC64 + /* collect purr register values often, for accurate calculations */ + if (firmware_has_feature(FW_FEATURE_SPLPAR)) { + struct cpu_usage *cu = &__get_cpu_var(cpu_usage_array); + cu->current_tb = mfspr(SPRN_PURR); + } +#endif + + trace_timer_interrupt_exit(regs); +} + /* * timer_interrupt - gets called when the decrementer overflows, * with interrupts disabled. @@ -486,8 +527,6 @@ void timer_interrupt(struct pt_regs * regs) { struct pt_regs *old_regs; u64 *next_tb = &__get_cpu_var(decrementers_next_tb); - struct clock_event_device *evt = &__get_cpu_var(decrementers); - u64 now; /* Ensure a positive value is written to the decrementer, or else * some CPUs will continue to take decrementer exceptions. @@ -519,39 +558,7 @@ void timer_interrupt(struct pt_regs * regs) old_regs = set_irq_regs(regs); irq_enter(); - trace_timer_interrupt_entry(regs); - - if (test_irq_work_pending()) { - clear_irq_work_pending(); - irq_work_run(); - } - - now = get_tb_or_rtc(); - if (now >= *next_tb) { - *next_tb = ~(u64)0; - if (evt->event_handler) - evt->event_handler(evt); - __get_cpu_var(irq_stat).timer_irqs_event++; - } else { - now = *next_tb - now; - if (now <= DECREMENTER_MAX) - set_dec((int)now); - /* We may have raced with new irq work */ - if (test_irq_work_pending()) - set_dec(1); - __get_cpu_var(irq_stat).timer_irqs_others++; - } - -#ifdef CONFIG_PPC64 - /* collect purr register values often, for accurate calculations */ - if (firmware_has_feature(FW_FEATURE_SPLPAR)) { - struct cpu_usage *cu = &__get_cpu_var(cpu_usage_array); - cu->current_tb = mfspr(SPRN_PURR); - } -#endif - - trace_timer_interrupt_exit(regs); - + __timer_interrupt(); irq_exit(); set_irq_regs(old_regs); } @@ -828,6 +835,10 @@ static void decrementer_set_mode(enum clock_event_mode mode, /* Interrupt handler for the timer broadcast IPI */ void tick_broadcast_ipi_handler(void) { + u64 *next_tb = &__get_cpu_var(decrementers_next_tb); + + *next_tb = get_tb_or_rtc(); + __timer_interrupt(); } static void register_decrementer_clockevent(int cpu) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RESEND PATCH 1/3] powerpc: Free up the slot of PPC_MSG_CALL_FUNC_SINGLE IPI message
From: Srivatsa S. Bhat The IPI handlers for both PPC_MSG_CALL_FUNC and PPC_MSG_CALL_FUNC_SINGLE map to a common implementation - generic_smp_call_function_single_interrupt(). So, we can consolidate them and save one of the IPI message slots, (which are precious on powerpc, since only 4 of those slots are available). So, implement the functionality of PPC_MSG_CALL_FUNC_SINGLE using PPC_MSG_CALL_FUNC itself and release its IPI message slot, so that it can be used for something else in the future, if desired. Signed-off-by: Srivatsa S. Bhat Signed-off-by: Preeti U. Murthy Acked-by: Geoff Levand [For the PS3 part] --- arch/powerpc/include/asm/smp.h |2 +- arch/powerpc/kernel/smp.c | 12 +--- arch/powerpc/platforms/cell/interrupt.c |2 +- arch/powerpc/platforms/ps3/smp.c|2 +- 4 files changed, 8 insertions(+), 10 deletions(-) diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h index 084e080..9f7356b 100644 --- a/arch/powerpc/include/asm/smp.h +++ b/arch/powerpc/include/asm/smp.h @@ -120,7 +120,7 @@ extern int cpu_to_core_id(int cpu); * in /proc/interrupts will be wrong!!! --Troy */ #define PPC_MSG_CALL_FUNCTION 0 #define PPC_MSG_RESCHEDULE 1 -#define PPC_MSG_CALL_FUNC_SINGLE 2 +#define PPC_MSG_UNUSED 2 #define PPC_MSG_DEBUGGER_BREAK 3 /* for irq controllers that have dedicated ipis per message (4) */ diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index ac2621a..ee7d76b 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -145,9 +145,9 @@ static irqreturn_t reschedule_action(int irq, void *data) return IRQ_HANDLED; } -static irqreturn_t call_function_single_action(int irq, void *data) +static irqreturn_t unused_action(int irq, void *data) { - generic_smp_call_function_single_interrupt(); + /* This slot is unused and hence available for use, if needed */ return IRQ_HANDLED; } @@ -168,14 +168,14 @@ static irqreturn_t debug_ipi_action(int irq, void *data) static irq_handler_t smp_ipi_action[] = { [PPC_MSG_CALL_FUNCTION] = call_function_action, [PPC_MSG_RESCHEDULE] = reschedule_action, - [PPC_MSG_CALL_FUNC_SINGLE] = call_function_single_action, + [PPC_MSG_UNUSED] = unused_action, [PPC_MSG_DEBUGGER_BREAK] = debug_ipi_action, }; const char *smp_ipi_name[] = { [PPC_MSG_CALL_FUNCTION] = "ipi call function", [PPC_MSG_RESCHEDULE] = "ipi reschedule", - [PPC_MSG_CALL_FUNC_SINGLE] = "ipi call function single", + [PPC_MSG_UNUSED] = "ipi unused", [PPC_MSG_DEBUGGER_BREAK] = "ipi debugger", }; @@ -251,8 +251,6 @@ irqreturn_t smp_ipi_demux(void) generic_smp_call_function_interrupt(); if (all & IPI_MESSAGE(PPC_MSG_RESCHEDULE)) scheduler_ipi(); - if (all & IPI_MESSAGE(PPC_MSG_CALL_FUNC_SINGLE)) - generic_smp_call_function_single_interrupt(); if (all & IPI_MESSAGE(PPC_MSG_DEBUGGER_BREAK)) debug_ipi_action(0, NULL); } while (info->messages); @@ -280,7 +278,7 @@ EXPORT_SYMBOL_GPL(smp_send_reschedule); void arch_send_call_function_single_ipi(int cpu) { - do_message_pass(cpu, PPC_MSG_CALL_FUNC_SINGLE); + do_message_pass(cpu, PPC_MSG_CALL_FUNCTION); } void arch_send_call_function_ipi_mask(const struct cpumask *mask) diff --git a/arch/powerpc/platforms/cell/interrupt.c b/arch/powerpc/platforms/cell/interrupt.c index 2d42f3b..adf3726 100644 --- a/arch/powerpc/platforms/cell/interrupt.c +++ b/arch/powerpc/platforms/cell/interrupt.c @@ -215,7 +215,7 @@ void iic_request_IPIs(void) { iic_request_ipi(PPC_MSG_CALL_FUNCTION); iic_request_ipi(PPC_MSG_RESCHEDULE); - iic_request_ipi(PPC_MSG_CALL_FUNC_SINGLE); + iic_request_ipi(PPC_MSG_UNUSED); iic_request_ipi(PPC_MSG_DEBUGGER_BREAK); } diff --git a/arch/powerpc/platforms/ps3/smp.c b/arch/powerpc/platforms/ps3/smp.c index 4b35166..00d1a7c 100644 --- a/arch/powerpc/platforms/ps3/smp.c +++ b/arch/powerpc/platforms/ps3/smp.c @@ -76,7 +76,7 @@ static int __init ps3_smp_probe(void) BUILD_BUG_ON(PPC_MSG_CALL_FUNCTION!= 0); BUILD_BUG_ON(PPC_MSG_RESCHEDULE != 1); - BUILD_BUG_ON(PPC_MSG_CALL_FUNC_SINGLE != 2); + BUILD_BUG_ON(PPC_MSG_UNUSED != 2); BUILD_BUG_ON(PPC_MSG_DEBUGGER_BREAK != 3); for (i = 0; i < MSG_COUNT; i++) { -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] PPC: powernv: remove redundant cpuidle_idle_call()
Hi Peter, On 02/07/2014 06:11 PM, Peter Zijlstra wrote: > On Fri, Feb 07, 2014 at 05:11:26PM +0530, Preeti U Murthy wrote: >> But observe the idle state "snooze" on powerpc. The power that this idle >> state saves is through the lowering of the thread priority of the CPU. >> After it lowers the thread priority, it is done. It cannot >> "wait_for_interrupts". It will exit my_idle(). It is now upto the >> generic idle loop to increase the thread priority if the need_resched >> flag is set. Only an interrupt routine can increase the thread priority. >> Else we will need to do it explicitly. And in such states which have a >> polling nature, the cpu will not receive a reschedule IPI. >> >> That is why in the snooze_loop() we poll on need_resched. If it is set >> we up the priority of the thread using HMT_MEDIUM() and then exit the >> my_idle() loop. In case of interrupts, the priority gets automatically >> increased. > > You can poll without setting TS_POLLING/TIF_POLLING_NRFLAGS just fine > and get the IPI if that is what you want. > > Depending on how horribly unprovisioned the thread gets at the lowest > priority, that might actually be faster than polling and raising the > prio whenever it does get ran. So I am assuming you mean something like the below: my_idle() { local_irq_enable(); /* Remove the setting of the polling flag */ HMT_low(); return index; } And then exit into the generic idle loop. But the issue I see here is that the TS_POLLING/TIF_POLLING_NRFLAGS gets set immediately. So, if on testing need_resched() immediately after this returns that the TIF_NEED_RESCHED flag is set, the thread will exit at low priority right? We could raise the priority of the thread in arch_cpu_idle_exit() soon after setting the polling flag but that would mean for cases where the TIF_NEED_RESCHED flag is not set we unnecessarily raise the priority of the thread. Thanks Regards Preeti U Murthy > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] sched: CPU topology try
Hi Vincent, On 12/18/2013 06:43 PM, Vincent Guittot wrote: > This patch applies on top of the two patches [1][2] that have been proposed by > Peter for creating a new way to initialize sched_domain. It includes some > minor > compilation fixes and a trial of using this new method on ARM platform. > [1] https://lkml.org/lkml/2013/11/5/239 > [2] https://lkml.org/lkml/2013/11/5/449 > > Based on the results of this tests, my feeling about this new way to init the > sched_domain is a bit mitigated. > > The good point is that I have been able to create the same sched_domain > topologies than before and even more complex ones (where a subset of the cores > in a cluster share their powergating capabilities). I have described various > topology results below. > > I use a system that is made of a dual cluster of quad cores with > hyperthreading > for my examples. > > If one cluster (0-7) can powergate its cores independantly but not the other > cluster (8-15) we have the following topology, which is equal to what I had > previously: > > CPU0: > domain 0: span 0-1 level: SMT > flags: SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN > groups: 0 1 > domain 1: span 0-7 level: MC > flags: SD_SHARE_PKG_RESOURCES > groups: 0-1 2-3 4-5 6-7 > domain 2: span 0-15 level: CPU > flags: > groups: 0-7 8-15 > > CPU8 > domain 0: span 8-9 level: SMT > flags: SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN > groups: 8 9 > domain 1: span 8-15 level: MC > flags: SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN > groups: 8-9 10-11 12-13 14-15 > domain 2: span 0-15 level CPU > flags: > groups: 8-15 0-7 > > We can even describe some more complex topologies if a susbset (2-7) of the > cluster can't powergate independatly: > > CPU0: > domain 0: span 0-1 level: SMT > flags: SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN > groups: 0 1 > domain 1: span 0-7 level: MC > flags: SD_SHARE_PKG_RESOURCES > groups: 0-1 2-7 > domain 2: span 0-15 level: CPU > flags: > groups: 0-7 8-15 > > CPU2: > domain 0: span 2-3 level: SMT > flags: SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN > groups: 0 1 > domain 1: span 2-7 level: MC > flags: SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN > groups: 2-7 4-5 6-7 > domain 2: span 0-7 level: MC > flags: SD_SHARE_PKG_RESOURCES > groups: 2-7 0-1 > domain 3: span 0-15 level: CPU > flags: > groups: 0-7 8-15 > > In this case, we have an aditionnal sched_domain MC level for this subset > (2-7) > of cores so we can trigger some load balance in this subset before doing that > on the complete cluster (which is the last level of cache in my example) > > We can add more levels that will describe other dependency/independency like > the frequency scaling dependency and as a result the final sched_domain > topology will have additional levels (if they have not been removed during > the degenerate sequence) The design looks good to me. In my opinion information like P-states and C-states dependency can be kept separate from the topology levels, it might get too complicated unless the information is tightly coupled to the topology. > > My concern is about the configuration of the table that is used to create the > sched_domain. Some levels are "duplicated" with different flags configuration I do not feel this is a problem since the levels are not duplicated, rather they have different properties within them which is best represented by flags like you have introduced in this patch. > which make the table not easily readable and we must also take care of the > order because parents have to gather all cpus of its childs. So we must > choose which capabilities will be a subset of the other one. The order is The sched domain levels which have SD_SHARE_POWERDOMAIN set is expected to have cpus which are a subset of the cpus that this domain would have included had this flag not been set. In addition to this every higher domain, irrespective of SD_SHARE_POWERDOMAIN being set, will include all cpus of the lower domains. As far as I see, this patch does not change these assumptions. Hence I am unable to imagine a scenario when the parent might not include all cpus of its children domain. Do you have such a scenario in mind which can arise due to this patch ? Thanks Regards Preeti U Murthy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH V2] time/cpuidle: Support in tick broadcast framework for archs without external clock device
On some architectures, in certain CPU deep idle states the local timers stop. An external clock device is used to wakeup these CPUs. The kernel support for the wakeup of these CPUs is provided by the tick broadcast framework by using the external clock device as the wakeup source. However not all implementations of architectures provide such an external clock device such as some PowerPC ones. This patch includes support in the broadcast framework to handle the wakeup of the CPUs in deep idle states on such systems by queuing a hrtimer on one of the CPUs, meant to handle the wakeup of CPUs in deep idle states. This CPU is identified as the bc_cpu. Each time the hrtimer expires, it is reprogrammed for the next wakeup of the CPUs in deep idle state after handling broadcast. However when a CPU is about to enter deep idle state with its wakeup time earlier than the time at which the hrtimer is currently programmed, it *becomes the new bc_cpu* and restarts the hrtimer on itself. This way the job of doing broadcast is handed around to the CPUs that ask for the earliest wakeup just before entering deep idle state. This is consistent with what happens in cases where an external clock device is present. The smp affinity of this clock device is set to the CPU with the earliest wakeup. The important point here is that the bc_cpu cannot enter deep idle state since it has a hrtimer queued to wakeup the other CPUs in deep idle. Hence it cannot have its local timer stopped. Therefore for such a CPU, the BROADCAST_ENTER notification has to fail implying that it cannot enter deep idle state. On architectures where an external clock device is present, all CPUs can enter deep idle. During hotplug of the bc_cpu, the job of doing a broadcast is assigned to the first cpu in the broadcast mask. This newly nominated bc_cpu is woken up by an IPI so as to queue the above mentioned hrtimer on itself. Changes from V1:https://lkml.org/lkml/2013/12/12/687 If idle states exist when the local timers of CPUs stop and there is no external clock device to handle their wakeups the kernel switches the tick mode to periodic so as to prevent the CPUs from entering such idle states altogether. Therefore include an additional check consistent with this patch, where if an external clock device does not exist, queue a hrtimer to handle wakeups. If this also fails, only then switch the tick mode to periodic. Signed-off-by: Preeti U Murthy --- include/linux/clockchips.h |4 - kernel/time/clockevents.c|8 +- kernel/time/tick-broadcast.c | 180 ++ kernel/time/tick-internal.h |8 +- 4 files changed, 173 insertions(+), 27 deletions(-) diff --git a/include/linux/clockchips.h b/include/linux/clockchips.h index 493aa02..bbda37b 100644 --- a/include/linux/clockchips.h +++ b/include/linux/clockchips.h @@ -186,9 +186,9 @@ static inline int tick_check_broadcast_expired(void) { return 0; } #endif #ifdef CONFIG_GENERIC_CLOCKEVENTS -extern void clockevents_notify(unsigned long reason, void *arg); +extern int clockevents_notify(unsigned long reason, void *arg); #else -static inline void clockevents_notify(unsigned long reason, void *arg) {} +static inline int clockevents_notify(unsigned long reason, void *arg) {} #endif #else /* CONFIG_GENERIC_CLOCKEVENTS_BUILD */ diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c index 086ad60..bbbd671 100644 --- a/kernel/time/clockevents.c +++ b/kernel/time/clockevents.c @@ -525,11 +525,11 @@ void clockevents_resume(void) /** * clockevents_notify - notification about relevant events */ -void clockevents_notify(unsigned long reason, void *arg) +int clockevents_notify(unsigned long reason, void *arg) { struct clock_event_device *dev, *tmp; unsigned long flags; - int cpu; + int cpu, ret = 0; raw_spin_lock_irqsave(_lock, flags); @@ -542,11 +542,12 @@ void clockevents_notify(unsigned long reason, void *arg) case CLOCK_EVT_NOTIFY_BROADCAST_ENTER: case CLOCK_EVT_NOTIFY_BROADCAST_EXIT: - tick_broadcast_oneshot_control(reason); + ret = tick_broadcast_oneshot_control(reason); break; case CLOCK_EVT_NOTIFY_CPU_DYING: tick_handover_do_timer(arg); + tick_handover_bc_cpu(arg); break; case CLOCK_EVT_NOTIFY_SUSPEND: @@ -585,6 +586,7 @@ void clockevents_notify(unsigned long reason, void *arg) break; } raw_spin_unlock_irqrestore(_lock, flags); + return ret; } EXPORT_SYMBOL_GPL(clockevents_notify); diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c index 9532690..1755984 100644 --- a/kernel/time/tick-broadcast.c +++ b/kernel/time/tick-broadcast.c @@ -20,6 +20,7 @@ #include #include #include +#include #include "tick-internal.h" @@ -35,6 +36,11 @@ static cpumask_var_t tmpmask; static DEFINE_RA
Re: [PATCH 1/2] tick: broadcast: Deny per-cpu clockevents from being broadcast sources
Hi Soren, On 09/13/2013 03:50 PM, Preeti Murthy wrote: > Hi, > > So the patch that Daniel points out http://lwn.net/Articles/566270/ , > enables broadcast functionality > without using an external global clock device. It uses one of the per cpu > clock devices to enable the broadcast functionality. > > The way it achieves this is by creating a pseudo clock device and > associating it with one of the cpus clock device and > by having a hrtimer queued on the same cpu. This pseudo clock device acts > as the broadcast device, and the > per cpu clock device that it is associated with acts as the broadcast > source. > > The disadvantages that Soren mentions in having a per cpu clock device as > the broadcast source can be overcome > by following the approach proposed in this patch n the way described below: > > 1. What if the cpu, whose clock device is the broadcast source goes offline? > > The solution that the above patch proposes is associate the pseudo clock > device with another cpu and move the hrtimer > whose function is explained in the next point to another cpu. The broadcast > functionality continues to remain active transparently. > > 2. The cpu that requires broadcast functionality is different from the cpu > whose clock device is the broadcast source. > So how will the former cpu program/control the clock device of the latter > cpu? > > The above patch queues a hrtimer on the cpu whose clock device is the > broadcast source, which expires at > max(tick_broadcast_period, dev->next_event), where tick_broadcast_period > is what we define and dev is the > pseudo device whose next event is set by the broadcast framework. > > On expiry of this hrtimer, do broadcast handling and reprogram the hrtimer > with same as above, > max(tick_broadcast_period, dev->next_event). > > This ensures that a cpu that requires broadcast function to be activated > need not program the broadcast source, > which also happens to be a per cpu clock device. The hrtimer queued on the > cpu whose clock device is the > broadcast source takes care of when to do broadcast handling. > tick_broadcast_period ensures that we do > not miss wakeups. This is introduced to overcome the constraint of a cpu > not being able to program the clock > device of another cpu. > > Soren, do let me know if the above approach described in the patch has not > addressed any of the challenges > that you see with having a per cpu clock device as the broadcast source. > > Regards > Preeti U Murthy > > > On Fri, Sep 13, 2013 at 1:55 PM, Daniel Lezcano > wrote: > >> On 09/12/2013 10:30 PM, Thomas Gleixner wrote: >>> On Thu, 12 Sep 2013, Soren Brinkmann wrote: >>>> From: Stephen Boyd >>>> >>>> On most ARM systems the per-cpu clockevents are truly per-cpu in >>>> the sense that they can't be controlled on any other CPU besides >>>> the CPU that they interrupt. If one of these clockevents were to >>>> become a broadcast source we will run into a lot of trouble >>>> because the broadcast source is enabled on the first CPU to go >>>> into deep idle (if that CPU suffers from FEAT_C3_STOP) and that >>>> could be a different CPU than what the clockevent is interrupting >>>> (or even worse the CPU that the clockevent interrupts could be >>>> offline). >>>> >>>> Theoretically it's possible to support per-cpu clockevents as the >>>> broadcast source but so far we haven't needed this and supporting >>>> it is rather complicated. Let's just deny the possibility for now >>>> until this becomes a reality (let's hope it never does!). >>> >>> Well, we can't do it this way. There are globally accessible clock >>> event devices which deliver only to cpu0. So the mask check might be >>> causing failure here. >>> >>> Just add a feature flag CLOCK_EVT_FEAT_PERCPU to the clock event >>> device and check for it. >> >> It sounds probably more understandable than dealing with the cpumasks. >> >> I am wondering if this is semantically opposed to >> http://lwn.net/Articles/566270/ ? >> >> [PATCH V3 0/6] cpuidle/ppc: Enable broadcast support for deep idle states >> >> -- Daniel So the point I am trying to make is that the fix that you have proposed on this thread is valid. It is difficult to ensure that a per cpu clock device doubles up as the broadcast source without significant code changes to the current broadcast code and the timer code. But the patch [PATCH V3 0/6] cpuidle/ppc: Enable broadcast support for deep idle states, attempts to overcome t
Re: [PATCH 1/2] tick: broadcast: Deny per-cpu clockevents from being broadcast sources
Hi Soren, On 09/13/2013 09:53 PM, Sören Brinkmann wrote: > Hi Preeti, > Thanks for the explanation but now I'm a little confused. That's a lot of > details and I'm lacking the in depth knowledge to fully understand > everything. > > Is it correct to say, that your patch series enables per cpu devices to > be the broadcast device - for PPC? Not really. We have a pseudo clock device, which is registered as the broadcast device. This clock device has all the features of an external clock device that the broadcast framework expects from a broadcast device like !CLOCK_FEAT_C3STOP & !FEAT_PERCPU that you introduce in your patch. It as though we trick the broadcast framework into believing that we have an external device, while in reality the pseudo device is just a dummy. So if this is a pseudo device, which gets registered as the broadcast device, how do we program it to handle broadcast events? That is where the per cpu device steps in. It serves as the clock source to this pseudo device. Meaning we program the per cpu device for the next broadcast event using a hrtimer framework that we introduce, which calls pseudo_dev->event_handler on expiry. This is nothing but the broadcast handler. Therefore we are able to manage broadcast without having to have an explicit clock device for the purpose. > And that would mean, that even though you have a per cpu device, you'd > deliberately not set the FEAT_PERCPU flag, because on PPC a per cpu > timer is a valid broadcast device? No we would set the FEAT_PERCPU for the per cpu device on PPC. As I mentioned above this is not going to be registered as the broadcast device. We would however not set this flag for the pseudo device, that we register as the broadcast device. > > Assuming that is not going into an utterly wrong direction: How would we > close on this one? AFAIK, ARM does not have this capability and I guess > it won't be added. So, should I go forward with the fix proposed by > Thomas? Should we rename the FEAT_PERCPU flag to something else, given > that PPC may use per cpu devices for broadcasting and the sole usage of > that flag is to prevent such a device from becoming the broadcast device? You can go ahead with this fix because as explained above, when we register a broadcast device we use a pseudo device which has the features that the broadcast framework approves. The per cpu device does not register itself with the broadcast framework. It merely programs itself for the next broadcast event. Hence this fix will not hinder the broadcast support on PPC. > > Thanks, > Sören > > Regards Preeti U Murthy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/7] Power-aware scheduling v2
Hi, On 10/14/2013 07:02 PM, Peter Zijlstra wrote: > On Fri, Oct 11, 2013 at 06:19:10PM +0100, Morten Rasmussen wrote: >> Hi, >> >> I have revised the previous power scheduler proposal[1] trying to address as >> many of the comments as possible. The overall idea was discussed at LPC[2,3]. >> The revised design has removed the power scheduler and replaced it with a >> high >> level power driver interface. An interface that allows the scheduler to query >> the power driver for information and provide hints to guide power management >> decisions in the power driver. >> >> The power driver is going to be a unified platform power driver that can >> replace cpufreq and cpuidle drivers. Generic power policies will be optional >> helper functions called from the power driver. Platforms may choose to >> implement their own policies as part of their power driver. >> >> This RFC series prototypes a part of the power driver interface (cpu capacity >> hints) and shows how they can be used from the scheduler. More extensive use >> of >> the power driver hints and queries is left for later. The focus for now is >> the >> power driver interface. The patch series includes a power driver/cpufreq >> governor that can use existing cpufreq drivers as backend. It has been tested >> (not thoroughly) on ARM TC2. The cpufreq governor power driver implementation >> is rather horrible, but it illustrates how the power driver interface can be >> used. Native power drivers is on the todo list. >> >> The power driver interface is still missing quite a few calls to handle: >> Idle, >> adding extra information to the sched_domain hierarchy to guide scheduling >> decisions (packing), and possibly scaling of tracked load to compensate for >> frequency changes and asymmetric systems (big.LITTLE). >> >> This set is based on 3.11. I have done ARM TC2 testing based on linux-linaro >> 2013.08[4] to get cpufreq support for TC2. > > What I'm missing is a general overview of why what and how. I agree that the "why" needs to be mentioned very clearly since the patchset revolves around it. As far as I understand we need a single controller for deciding the power efficiency of the kernel, who is exposed to all the user policies and the frequency+idle states stats of the CPU to begin with. These stats are being supplied by the power driver. Having these details and decision making in multiple places like we do today in cpuidle, cpu-frequency and scheduler will probably cause problems. For example, when the power efficiency of the kernel goes wrong we have trouble point out the reason behind it. Where did the problem arise from among the above three power policy decision makers? This is a maintainability concern. Another reason is the power saving decisions made by say cpuidle may not complement the power saving decisions made by cpufreq. This can lead to inconsistent results across different workloads. Thus having a single policy maker for power savings we are hoping to solve the primary concerns of consistent behaviour from the kernel in terms of power efficiency and improved maintainability. > > In particular; how does this proposal lead to power savings. Is there a > mathematical model that supports this framework? Something where if you > give it a task set with global utilisation < 1 (ie. there's idle time), > it results in less power used. AFAIK, this patchset is an attempt to achieve consistency in the power efficiency of the kernel across workloads with the existing algorithms, in addition to a cleanup involving integration of the power policy making in one place as explained above. In an attempt to do so, *maybe* better power numbers can be obtained or at-least the default power efficiency of the kernel will show up. However adding the new patchsets like packing small tasks, heterogeneous scheduling, power aware scheduling etc.. *should* then yield good and consistent power savings since they now stand on top of an integrated stable power driver. Regards Preeti U Murthy > > Also, how does this proposal deal with cpufreq's fundamental broken > approach to SMP? Afaict nothing considers the effect of one cpu upon > another -- something which isn't true at all. > > In fact, I don't see anything except a random bunch of hooks without an > over-all picture of how to get less power used. > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH V3 0/6] cpuidle/ppc: Enable broadcast support for deep idle states
On PowerPC, when CPUs enter deep idle states, their local timers get switched off. An external clock device needs to programmed to wake them up at their next timer event. On PowerPC, we do not have an external device equivalent to HPET, which is currently used on architectures like x86 under the same scenario. Instead we assign the local timer of one of the CPUs to do this job. This patchset is an attempt to hook onto the existing timer broadcast framework in the kernel by using the local timer of one of the CPUs to do the job of the external clock device. On expiry of this device, the broadcast framework today has the infrastructure to send ipis to all such CPUs whose local timers have expired. Hence the term "broadcast" and the ipi sent is called the broadcast ipi. This patch series is ported ontop of 3.11-rc7 + the cpuidle driver backend for power posted by Deepthi Dharwar recently. http://comments.gmane.org/gmane.linux.ports.ppc.embedded/63556 Changes in V3: 1. Fix the way in which a broadcast ipi is handled on the idling cpus. Timer handling on a broadcast ipi is being done now without missing out any timer stats generation. 2. Fix a bug in the programming of the hrtimer meant to do broadcast. Program it to trigger at the earlier of a "broadcast period", and the next wakeup event. By introducing the "broadcast period" as the maximum period after which the broadcast hrtimer can fire, we ensure that we do not miss wakeups in corner cases. 3. On hotplug of a broadcast cpu, trigger the hrtimer meant to do broadcast to fire immediately on the new broadcast cpu. This will ensure we do not miss doing a broadcast pending in the nearest future. 4. Change the type of allocation from GFP_KERNEL to GFP_NOWAIT while initializing bc_hrtimer since we are in an atomic context and cannot sleep. 5. Use the broadcast ipi to wakeup the newly nominated broadcast cpu on hotplug of the old instead of smp_call_function_single(). This is because we are interrupt disabled at this point and should not be using smp_call_function_single or its children in this context to send an ipi. 6. Move GENERIC_CLOCKEVENTS_BROADCAST to arch/powerpc/Kconfig. 7. Fix coding style issues. Changes in V2: https://lkml.org/lkml/2013/8/14/239 1. Dynamically pick a broadcast CPU, instead of having a dedicated one. 2. Remove the constraint of having to disable tickless idle on the broadcast CPU by queueing a hrtimer dedicated to do broadcast. V1 posting: https://lkml.org/lkml/2013/7/25/740. The patchset has been tested for stability in idle and during multi threaded ebizzy runs. Many thanks to Ben H, Frederic Weisbecker, Li Yang, Srivatsa S. Bhat and Vaidyanathan Srinivasan for all their comments and suggestions so far. --- Preeti U Murthy (4): cpuidle/ppc: Split timer_interrupt() into timer handling and interrupt handling routines cpuidle/ppc: Add basic infrastructure to support the broadcast framework on ppc cpuidle/ppc: Introduce the deep idle state in which the local timers stop cpuidle/ppc: Nominate new broadcast cpu on hotplug of the old Srivatsa S. Bhat (2): powerpc: Free up the IPI message slot of ipi call function (PPC_MSG_CALL_FUNC) powerpc: Implement broadcast timer interrupt as an IPI message arch/powerpc/Kconfig|1 arch/powerpc/include/asm/smp.h |3 - arch/powerpc/include/asm/time.h |4 + arch/powerpc/kernel/smp.c | 23 +++- arch/powerpc/kernel/time.c | 143 -- arch/powerpc/platforms/cell/interrupt.c |2 arch/powerpc/platforms/ps3/smp.c|2 drivers/cpuidle/cpuidle-ibm-power.c | 172 +++ scripts/kconfig/streamline_config.pl|0 9 files changed, 307 insertions(+), 43 deletions(-) mode change 100644 => 100755 scripts/kconfig/streamline_config.pl -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH V3 1/6] powerpc: Free up the IPI message slot of ipi call function (PPC_MSG_CALL_FUNC)
From: Srivatsa S. Bhat The IPI handlers for both PPC_MSG_CALL_FUNC and PPC_MSG_CALL_FUNC_SINGLE map to a common implementation - generic_smp_call_function_single_interrupt(). So, we can consolidate them and save one of the IPI message slots, (which are precious, since only 4 of those slots are available). So, implement the functionality of PPC_MSG_CALL_FUNC using PPC_MSG_CALL_FUNC_SINGLE itself and release its IPI message slot, so that it can be used for something else in the future, if desired. Signed-off-by: Srivatsa S. Bhat Signed-off-by: Preeti U Murthy --- arch/powerpc/include/asm/smp.h |2 +- arch/powerpc/kernel/smp.c | 12 +--- arch/powerpc/platforms/cell/interrupt.c |2 +- arch/powerpc/platforms/ps3/smp.c|2 +- 4 files changed, 8 insertions(+), 10 deletions(-) diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h index 48cfc85..a632b6e 100644 --- a/arch/powerpc/include/asm/smp.h +++ b/arch/powerpc/include/asm/smp.h @@ -117,7 +117,7 @@ extern int cpu_to_core_id(int cpu); * * Make sure this matches openpic_request_IPIs in open_pic.c, or what shows up * in /proc/interrupts will be wrong!!! --Troy */ -#define PPC_MSG_CALL_FUNCTION 0 +#define PPC_MSG_UNUSED 0 #define PPC_MSG_RESCHEDULE 1 #define PPC_MSG_CALL_FUNC_SINGLE 2 #define PPC_MSG_DEBUGGER_BREAK 3 diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index 38b0ba6..bc41e9f 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -111,9 +111,9 @@ int smp_generic_kick_cpu(int nr) } #endif /* CONFIG_PPC64 */ -static irqreturn_t call_function_action(int irq, void *data) +static irqreturn_t unused_action(int irq, void *data) { - generic_smp_call_function_interrupt(); + /* This slot is unused and hence available for use, if needed */ return IRQ_HANDLED; } @@ -144,14 +144,14 @@ static irqreturn_t debug_ipi_action(int irq, void *data) } static irq_handler_t smp_ipi_action[] = { - [PPC_MSG_CALL_FUNCTION] = call_function_action, + [PPC_MSG_UNUSED] = unused_action, /* Slot available for future use */ [PPC_MSG_RESCHEDULE] = reschedule_action, [PPC_MSG_CALL_FUNC_SINGLE] = call_function_single_action, [PPC_MSG_DEBUGGER_BREAK] = debug_ipi_action, }; const char *smp_ipi_name[] = { - [PPC_MSG_CALL_FUNCTION] = "ipi call function", + [PPC_MSG_UNUSED] = "ipi unused", [PPC_MSG_RESCHEDULE] = "ipi reschedule", [PPC_MSG_CALL_FUNC_SINGLE] = "ipi call function single", [PPC_MSG_DEBUGGER_BREAK] = "ipi debugger", @@ -221,8 +221,6 @@ irqreturn_t smp_ipi_demux(void) all = xchg(>messages, 0); #ifdef __BIG_ENDIAN - if (all & (1 << (24 - 8 * PPC_MSG_CALL_FUNCTION))) - generic_smp_call_function_interrupt(); if (all & (1 << (24 - 8 * PPC_MSG_RESCHEDULE))) scheduler_ipi(); if (all & (1 << (24 - 8 * PPC_MSG_CALL_FUNC_SINGLE))) @@ -265,7 +263,7 @@ void arch_send_call_function_ipi_mask(const struct cpumask *mask) unsigned int cpu; for_each_cpu(cpu, mask) - do_message_pass(cpu, PPC_MSG_CALL_FUNCTION); + do_message_pass(cpu, PPC_MSG_CALL_FUNC_SINGLE); } #if defined(CONFIG_DEBUGGER) || defined(CONFIG_KEXEC) diff --git a/arch/powerpc/platforms/cell/interrupt.c b/arch/powerpc/platforms/cell/interrupt.c index 2d42f3b..28166e4 100644 --- a/arch/powerpc/platforms/cell/interrupt.c +++ b/arch/powerpc/platforms/cell/interrupt.c @@ -213,7 +213,7 @@ static void iic_request_ipi(int msg) void iic_request_IPIs(void) { - iic_request_ipi(PPC_MSG_CALL_FUNCTION); + iic_request_ipi(PPC_MSG_UNUSED); iic_request_ipi(PPC_MSG_RESCHEDULE); iic_request_ipi(PPC_MSG_CALL_FUNC_SINGLE); iic_request_ipi(PPC_MSG_DEBUGGER_BREAK); diff --git a/arch/powerpc/platforms/ps3/smp.c b/arch/powerpc/platforms/ps3/smp.c index 4b35166..488f069 100644 --- a/arch/powerpc/platforms/ps3/smp.c +++ b/arch/powerpc/platforms/ps3/smp.c @@ -74,7 +74,7 @@ static int __init ps3_smp_probe(void) * to index needs to be setup. */ - BUILD_BUG_ON(PPC_MSG_CALL_FUNCTION!= 0); + BUILD_BUG_ON(PPC_MSG_UNUSED != 0); BUILD_BUG_ON(PPC_MSG_RESCHEDULE != 1); BUILD_BUG_ON(PPC_MSG_CALL_FUNC_SINGLE != 2); BUILD_BUG_ON(PPC_MSG_DEBUGGER_BREAK != 3); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH V3 3/6] cpuidle/ppc: Split timer_interrupt() into timer handling and interrupt handling routines
On PowerPC, when CPUs enter deep idle states, their local timers get switched off. The local timer is called the decrementer. An external clock device needs to programmed to wake them up at their next timer event. On PowerPC, we do not have an external device equivalent to HPET, which is currently used on architectures like x86 under the same scenario. Instead we assign the local timer of one of the CPUs to do this job. On expiry of this timer, the broadcast framework today has the infrastructure to send ipis to all such CPUs whose local timers have expired. When such an ipi is received, the cpus in deep idle should handle their expired timers. It should be as though they were woken up from a timer interrupt itself. Hence this external ipi serves as an emulated timer interrupt for the cpus in deep idle. Therefore ideally on ppc, these cpus should call timer_interrupt() which is the interrupt handler for a decrementer interrupt. But timer_interrupt() also contains routines which are usually performed in an interrupt handler. These are not required to be done in this scenario as the external interrupt handler takes care of them. Therefore split up timer_interrupt() into routines performed during regular interrupt handling and __timer_interrupt(), which takes care of running local timers and collecting time related stats. Now on a broadcast ipi, call __timer_interrupt(). Signed-off-by: Preeti U Murthy --- arch/powerpc/kernel/time.c | 69 1 file changed, 37 insertions(+), 32 deletions(-) diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c index 0dfa0c5..eb48291 100644 --- a/arch/powerpc/kernel/time.c +++ b/arch/powerpc/kernel/time.c @@ -478,6 +478,42 @@ void arch_irq_work_raise(void) #endif /* CONFIG_IRQ_WORK */ +static void __timer_interrupt(void) +{ + struct pt_regs *regs = get_irq_regs(); + u64 *next_tb = &__get_cpu_var(decrementers_next_tb); + struct clock_event_device *evt = &__get_cpu_var(decrementers); + u64 now; + + __get_cpu_var(irq_stat).timer_irqs++; + trace_timer_interrupt_entry(regs); + + if (test_irq_work_pending()) { + clear_irq_work_pending(); + irq_work_run(); + } + + now = get_tb_or_rtc(); + if (now >= *next_tb) { + *next_tb = ~(u64)0; + if (evt->event_handler) + evt->event_handler(evt); + } else { + now = *next_tb - now; + if (now <= DECREMENTER_MAX) + set_dec((int)now); + } + +#ifdef CONFIG_PPC64 + /* collect purr register values often, for accurate calculations */ + if (firmware_has_feature(FW_FEATURE_SPLPAR)) { + struct cpu_usage *cu = &__get_cpu_var(cpu_usage_array); + cu->current_tb = mfspr(SPRN_PURR); + } +#endif + trace_timer_interrupt_exit(regs); +} + /* * timer_interrupt - gets called when the decrementer overflows, * with interrupts disabled. @@ -486,8 +522,6 @@ void timer_interrupt(struct pt_regs * regs) { struct pt_regs *old_regs; u64 *next_tb = &__get_cpu_var(decrementers_next_tb); - struct clock_event_device *evt = &__get_cpu_var(decrementers); - u64 now; /* Ensure a positive value is written to the decrementer, or else * some CPUs will continue to take decrementer exceptions. @@ -510,8 +544,6 @@ void timer_interrupt(struct pt_regs * regs) */ may_hard_irq_enable(); - __get_cpu_var(irq_stat).timer_irqs++; - #if defined(CONFIG_PPC32) && defined(CONFIG_PMAC) if (atomic_read(_n_lost_interrupts) != 0) do_IRQ(regs); @@ -520,34 +552,7 @@ void timer_interrupt(struct pt_regs * regs) old_regs = set_irq_regs(regs); irq_enter(); - trace_timer_interrupt_entry(regs); - - if (test_irq_work_pending()) { - clear_irq_work_pending(); - irq_work_run(); - } - - now = get_tb_or_rtc(); - if (now >= *next_tb) { - *next_tb = ~(u64)0; - if (evt->event_handler) - evt->event_handler(evt); - } else { - now = *next_tb - now; - if (now <= DECREMENTER_MAX) - set_dec((int)now); - } - -#ifdef CONFIG_PPC64 - /* collect purr register values often, for accurate calculations */ - if (firmware_has_feature(FW_FEATURE_SPLPAR)) { - struct cpu_usage *cu = &__get_cpu_var(cpu_usage_array); - cu->current_tb = mfspr(SPRN_PURR); - } -#endif - - trace_timer_interrupt_exit(regs); - + __timer_interrupt(); irq_exit(); set_irq_regs(old_regs); } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kerne
[PATCH V3 2/6] powerpc: Implement broadcast timer interrupt as an IPI message
From: Srivatsa S. Bhat For scalability and performance reasons, we want the broadcast IPIs to be handled as efficiently as possible. Fixed IPI messages are one of the most efficient mechanisms available - they are faster than the smp_call_function mechanism because the IPI handlers are fixed and hence they don't involve costly operations such as adding IPI handlers to the target CPU's function queue, acquiring locks for synchronization etc. Luckily we have an unused IPI message slot, so use that to implement broadcast timer interrupts efficiently. Signed-off-by: Srivatsa S. Bhat [Changelog modified by pre...@linux.vnet.ibm.com] Signed-off-by: Preeti U Murthy --- arch/powerpc/include/asm/smp.h |3 ++- arch/powerpc/include/asm/time.h |1 + arch/powerpc/kernel/smp.c | 19 +++ arch/powerpc/kernel/time.c |4 arch/powerpc/platforms/cell/interrupt.c |2 +- arch/powerpc/platforms/ps3/smp.c|2 +- scripts/kconfig/streamline_config.pl|0 7 files changed, 24 insertions(+), 7 deletions(-) mode change 100644 => 100755 scripts/kconfig/streamline_config.pl diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h index a632b6e..22f6d63 100644 --- a/arch/powerpc/include/asm/smp.h +++ b/arch/powerpc/include/asm/smp.h @@ -117,7 +117,7 @@ extern int cpu_to_core_id(int cpu); * * Make sure this matches openpic_request_IPIs in open_pic.c, or what shows up * in /proc/interrupts will be wrong!!! --Troy */ -#define PPC_MSG_UNUSED 0 +#define PPC_MSG_TIMER 0 #define PPC_MSG_RESCHEDULE 1 #define PPC_MSG_CALL_FUNC_SINGLE 2 #define PPC_MSG_DEBUGGER_BREAK 3 @@ -194,6 +194,7 @@ extern struct smp_ops_t *smp_ops; extern void arch_send_call_function_single_ipi(int cpu); extern void arch_send_call_function_ipi_mask(const struct cpumask *mask); +extern void arch_send_tick_broadcast(const struct cpumask *mask); /* Definitions relative to the secondary CPU spin loop * and entry point. Not all of them exist on both 32 and diff --git a/arch/powerpc/include/asm/time.h b/arch/powerpc/include/asm/time.h index c1f2676..4e35282 100644 --- a/arch/powerpc/include/asm/time.h +++ b/arch/powerpc/include/asm/time.h @@ -28,6 +28,7 @@ extern struct clock_event_device decrementer_clockevent; struct rtc_time; extern void to_tm(int tim, struct rtc_time * tm); extern void GregorianDay(struct rtc_time *tm); +extern void decrementer_timer_interrupt(void); extern void generic_calibrate_decr(void); diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index bc41e9f..d3b7014 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -35,6 +35,7 @@ #include #include #include +#include #include #include #include @@ -111,9 +112,9 @@ int smp_generic_kick_cpu(int nr) } #endif /* CONFIG_PPC64 */ -static irqreturn_t unused_action(int irq, void *data) +static irqreturn_t timer_action(int irq, void *data) { - /* This slot is unused and hence available for use, if needed */ + decrementer_timer_interrupt(); return IRQ_HANDLED; } @@ -144,14 +145,14 @@ static irqreturn_t debug_ipi_action(int irq, void *data) } static irq_handler_t smp_ipi_action[] = { - [PPC_MSG_UNUSED] = unused_action, /* Slot available for future use */ + [PPC_MSG_TIMER] = timer_action, [PPC_MSG_RESCHEDULE] = reschedule_action, [PPC_MSG_CALL_FUNC_SINGLE] = call_function_single_action, [PPC_MSG_DEBUGGER_BREAK] = debug_ipi_action, }; const char *smp_ipi_name[] = { - [PPC_MSG_UNUSED] = "ipi unused", + [PPC_MSG_TIMER] = "ipi timer", [PPC_MSG_RESCHEDULE] = "ipi reschedule", [PPC_MSG_CALL_FUNC_SINGLE] = "ipi call function single", [PPC_MSG_DEBUGGER_BREAK] = "ipi debugger", @@ -221,6 +222,8 @@ irqreturn_t smp_ipi_demux(void) all = xchg(>messages, 0); #ifdef __BIG_ENDIAN + if (all & (1 << (24 - 8 * PPC_MSG_TIMER))) + decrementer_timer_interrupt(); if (all & (1 << (24 - 8 * PPC_MSG_RESCHEDULE))) scheduler_ipi(); if (all & (1 << (24 - 8 * PPC_MSG_CALL_FUNC_SINGLE))) @@ -266,6 +269,14 @@ void arch_send_call_function_ipi_mask(const struct cpumask *mask) do_message_pass(cpu, PPC_MSG_CALL_FUNC_SINGLE); } +void arch_send_tick_broadcast(const struct cpumask *mask) +{ + unsigned int cpu; + + for_each_cpu(cpu, mask) + do_message_pass(cpu, PPC_MSG_TIMER); +} + #if defined(CONFIG_DEBUGGER) || defined(CONFIG_KEXEC) void smp_send_debugger_break(void) { diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c index 65ab9e9..0dfa0c5 100644 --- a/arch/powerpc/kernel/time.c +++ b/arch/powerpc/kernel/time.c @@ -813,6 +813,10 @@ static void decrem
[PATCH V3 4/6] cpuidle/ppc: Add basic infrastructure to support the broadcast framework on ppc
The broadcast framework in the kernel expects an external clock device which will continue functioning in deep idle states also. This ability is specified by the "non-existence" of the feature C3STOP . This is the device that it relies upon to wakup cpus in deep idle states whose local timers/clock devices get switched off in deep idle states. On ppc we do not have such an external device. Therefore we introduce a pseudo clock device, which has the features of this external clock device called the broadcast_clockevent. Having such a device qualifies the cpus to enter and exit deep idle states from the point of view of the broadcast framework, because there is an external device to wake them up. Specifically the broadcast framework uses this device's event handler and next_event members in its functioning. On ppc we use this device as the gateway into the broadcast framework and *not* as a timer. An explicit timer infrastructure will be developed in the following patches to keep track of when to wake up cpus in deep idle. Since this device is a pseudo device, it can be safely assumed to work for all cpus. Therefore its cpumask is set to cpu_possible_mask. Also due to the same reason, the set_next_event() routine associated with this device is a nop. The broadcast framework relies on a broadcast functionality being made available in the .broadcast member of the local clock devices on all cpus. This function is called upon by the broadcast framework on one of the nominated cpus, to send ipis to all the cpus in deep idle at their expired timer events. This patch also initializes the .broadcast member of the decrementer whose job is to send the broadcast ipis. When cpus inform the broadcast framework that they are entering deep idle, their local timers are put in shutdown mode. On ppc, this means setting the decrementer_next_tb and programming the decrementer to DECREMENTER_MAX. On being woken up by the broadcast ipi, these cpus call __timer_interrupt(), which runs the local timers only if decrementer_next_tb has expired. Therefore on being woken up from the broadcast ipi, set the decrementers_next_tb to now before calling __timer_interrupt(). Signed-off-by: Preeti U Murthy --- arch/powerpc/Kconfig|1 + arch/powerpc/include/asm/time.h |1 + arch/powerpc/kernel/time.c | 69 ++- 3 files changed, 70 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index dbd9d3c..550fc04 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -130,6 +130,7 @@ config PPC select GENERIC_CMOS_UPDATE select GENERIC_TIME_VSYSCALL_OLD select GENERIC_CLOCKEVENTS + select GENERIC_CLOCKEVENTS_BROADCAST select GENERIC_STRNCPY_FROM_USER select GENERIC_STRNLEN_USER select HAVE_MOD_ARCH_SPECIFIC diff --git a/arch/powerpc/include/asm/time.h b/arch/powerpc/include/asm/time.h index 4e35282..264dc96 100644 --- a/arch/powerpc/include/asm/time.h +++ b/arch/powerpc/include/asm/time.h @@ -24,6 +24,7 @@ extern unsigned long tb_ticks_per_jiffy; extern unsigned long tb_ticks_per_usec; extern unsigned long tb_ticks_per_sec; extern struct clock_event_device decrementer_clockevent; +extern struct clock_event_device broadcast_clockevent; struct rtc_time; extern void to_tm(int tim, struct rtc_time * tm); diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c index eb48291..bda78bb 100644 --- a/arch/powerpc/kernel/time.c +++ b/arch/powerpc/kernel/time.c @@ -42,6 +42,7 @@ #include #include #include +#include #include #include #include @@ -97,8 +98,13 @@ static struct clocksource clocksource_timebase = { static int decrementer_set_next_event(unsigned long evt, struct clock_event_device *dev); +static int broadcast_set_next_event(unsigned long evt, + struct clock_event_device *dev); +static void broadcast_set_mode(enum clock_event_mode mode, +struct clock_event_device *dev); static void decrementer_set_mode(enum clock_event_mode mode, struct clock_event_device *dev); +static void decrementer_timer_broadcast(const struct cpumask *mask); struct clock_event_device decrementer_clockevent = { .name = "decrementer", @@ -106,12 +112,24 @@ struct clock_event_device decrementer_clockevent = { .irq= 0, .set_next_event = decrementer_set_next_event, .set_mode = decrementer_set_mode, - .features = CLOCK_EVT_FEAT_ONESHOT, + .broadcast = decrementer_timer_broadcast, + .features = CLOCK_EVT_FEAT_C3STOP | CLOCK_EVT_FEAT_ONESHOT, }; EXPORT_SYMBOL(decrementer_clockevent); +struct clock_event_device broadcast_clockevent = { + .name = "broadcast", + .rating = 20
[PATCH V3 5/6] cpuidle/ppc: Introduce the deep idle state in which the local timers stop
Now that we have the basic infrastructure setup to make use of the broadcast framework, introduce the deep idle state in which cpus need to avail the functionality provided by this infrastructure to wake them up at their expired timer events. On ppc this deep idle state is called sleep. In this patch however, we introduce longnap, which emulates sleep state, by disabling timer interrupts. This is until such time that sleep support is made available in the kernel. Since on ppc, we do not have an external device that can wakeup cpus in deep idle, the local timer of one of the cpus need to be nominated to do this job. This cpu is called the broadcast cpu/bc_cpu. Only if the bc_cpu is nominated will the remaining cpus be allowed to enter deep idle state after notifying the broadcast framework about their next timer event. The bc_cpu is not allowed to enter deep idle state. The first cpu that enters longnap is made the bc_cpu. It queues a hrtimer onto itself which expires after a broadcast period. The job of this hrtimer is to call into the broadcast framework[1] using the pseudo clock device that we have initiliazed, in which, the cpus whose wakeup times have expired are sent an ipi. On each expiry of the hrtimer, it is programmed to the earlier of the next pending timer event of the cpus in deep idle and the broadcast period, so as to not miss any wakeups. The broadcast period is nothing but the max duration until which the bc_cpu need not concern itself with checking for expired timer events on cpus in deep idle. The broadcast period is set to a jiffy in this patch for debug purposes. Ideally it needn't be smaller than the target_residency of the deep idle state. But having a dedicated bc_cpu would mean overloading just one cpu with the broadcast work which could hinder its performance apart from leading to thermal imbalance on the chip. Therefore unassign the bc_cpu when there are no more cpus in deep idle to be woken up. The bc_cpu is left unassigned until such a time that a cpu enters longnap to be nominated as the bc_cpu and the above cycle repeats. Protect the region of nomination,de-nomination and check for existence of broadcast cpu with a lock to ensure synchronization between them. [1] tick_handle_oneshot_broadcast() or tick_handle_periodic_broadcast(). Signed-off-by: Preeti U Murthy --- arch/powerpc/include/asm/time.h |1 arch/powerpc/kernel/time.c |2 drivers/cpuidle/cpuidle-ibm-power.c | 150 +++ 3 files changed, 152 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/time.h b/arch/powerpc/include/asm/time.h index 264dc96..38341fa 100644 --- a/arch/powerpc/include/asm/time.h +++ b/arch/powerpc/include/asm/time.h @@ -25,6 +25,7 @@ extern unsigned long tb_ticks_per_usec; extern unsigned long tb_ticks_per_sec; extern struct clock_event_device decrementer_clockevent; extern struct clock_event_device broadcast_clockevent; +extern struct clock_event_device bc_timer; struct rtc_time; extern void to_tm(int tim, struct rtc_time * tm); diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c index bda78bb..44a76de 100644 --- a/arch/powerpc/kernel/time.c +++ b/arch/powerpc/kernel/time.c @@ -129,7 +129,7 @@ EXPORT_SYMBOL(broadcast_clockevent); DEFINE_PER_CPU(u64, decrementers_next_tb); static DEFINE_PER_CPU(struct clock_event_device, decrementers); -static struct clock_event_device bc_timer; +struct clock_event_device bc_timer; #define XSEC_PER_SEC (1024*1024) diff --git a/drivers/cpuidle/cpuidle-ibm-power.c b/drivers/cpuidle/cpuidle-ibm-power.c index f8905c3..ae47a0a 100644 --- a/drivers/cpuidle/cpuidle-ibm-power.c +++ b/drivers/cpuidle/cpuidle-ibm-power.c @@ -12,12 +12,19 @@ #include #include #include +#include +#include +#include +#include +#include +#include #include #include #include #include #include +#include #include struct cpuidle_driver power_idle_driver = { @@ -28,6 +35,26 @@ struct cpuidle_driver power_idle_driver = { static int max_idle_state; static struct cpuidle_state *cpuidle_state_table; +static int bc_cpu = -1; +static struct hrtimer *bc_hrtimer; +static int bc_hrtimer_initialized = 0; + +/* + * Bits to indicate if a cpu can enter deep idle where local timer gets + * switched off. + * BROADCAST_CPU_PRESENT : Enter deep idle since bc_cpu is assigned + * BROADCAST_CPU_SELF : Do not enter deep idle since you are bc_cpu + * BROADCAST_CPU_ABSENT : Do not enter deep idle since there is no bc_cpu, + *hence nominate yourself as bc_cpu + * BROADCAST_CPU_ERROR : Do not enter deep idle since there is no bc_cpu + *and the broadcast hrtimer could not be initialized. + */ +enum broadcast_cpu_status { + BROADCAST_CPU_PRESENT, + BROADCAST_CPU_SELF, + BROADCAST_CPU_ERROR, +}; + static inline void idle_loop_prolog(unsigned long *in_purr) { *in_purr = mfspr
[PATCH V3 6/6] cpuidle/ppc: Nominate new broadcast cpu on hotplug of the old
On hotplug of the broadcast cpu, cancel the hrtimer queued to do broadcast and nominate a new broadcast cpu to be the first cpu in the broadcast mask which includes all the cpus that have notified the broadcast framework about entering deep idle state. Since the new broadcast cpu is one of the cpus in deep idle, send an ipi to wake it up to continue the duty of broadcast. The new broadcast cpu needs to find out if it woke up to resume broadcast. If so it needs to restart the broadcast hrtimer on itself. Its possible that the old broadcast cpu was hotplugged out when the broadcast hrtimer was about to fire on it. Therefore the newly nominated broadcast cpu should set the broadcast hrtimer on itself to expire immediately so as to not miss wakeups under such scenarios. Signed-off-by: Preeti U Murthy --- arch/powerpc/include/asm/time.h |1 + arch/powerpc/kernel/time.c |1 + drivers/cpuidle/cpuidle-ibm-power.c | 22 ++ 3 files changed, 24 insertions(+) diff --git a/arch/powerpc/include/asm/time.h b/arch/powerpc/include/asm/time.h index 38341fa..3bc0205 100644 --- a/arch/powerpc/include/asm/time.h +++ b/arch/powerpc/include/asm/time.h @@ -31,6 +31,7 @@ struct rtc_time; extern void to_tm(int tim, struct rtc_time * tm); extern void GregorianDay(struct rtc_time *tm); extern void decrementer_timer_interrupt(void); +extern void broadcast_irq_entry(void); extern void generic_calibrate_decr(void); diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c index 44a76de..0ac2e11 100644 --- a/arch/powerpc/kernel/time.c +++ b/arch/powerpc/kernel/time.c @@ -853,6 +853,7 @@ void decrementer_timer_interrupt(void) { u64 *next_tb = &__get_cpu_var(decrementers_next_tb); + broadcast_irq_entry(); *next_tb = get_tb_or_rtc(); __timer_interrupt(); } diff --git a/drivers/cpuidle/cpuidle-ibm-power.c b/drivers/cpuidle/cpuidle-ibm-power.c index ae47a0a..580ea04 100644 --- a/drivers/cpuidle/cpuidle-ibm-power.c +++ b/drivers/cpuidle/cpuidle-ibm-power.c @@ -282,6 +282,12 @@ static int longnap_loop(struct cpuidle_device *dev, return index; } +void broadcast_irq_entry(void) +{ + if (smp_processor_id() == bc_cpu) + hrtimer_start(bc_hrtimer, ns_to_ktime(0), HRTIMER_MODE_REL_PINNED); +} + /* * States for dedicated partition case. */ @@ -360,6 +366,7 @@ static int power_cpuidle_add_cpu_notifier(struct notifier_block *n, unsigned long action, void *hcpu) { int hotcpu = (unsigned long)hcpu; + unsigned long flags; struct cpuidle_device *dev = per_cpu(cpuidle_devices, hotcpu); @@ -372,6 +379,21 @@ static int power_cpuidle_add_cpu_notifier(struct notifier_block *n, cpuidle_resume_and_unlock(); break; + case CPU_DYING: + case CPU_DYING_FROZEN: + spin_lock_irqsave(_idle_lock, flags); + if (hotcpu == bc_cpu) { + bc_cpu = -1; + hrtimer_cancel(bc_hrtimer); + if (!cpumask_empty(tick_get_broadcast_oneshot_mask())) { + bc_cpu = cpumask_first( + tick_get_broadcast_oneshot_mask()); + arch_send_tick_broadcast(cpumask_of(bc_cpu)); + } + } + spin_unlock_irqrestore(_idle_lock, flags); + break; + case CPU_DEAD: case CPU_DEAD_FROZEN: cpuidle_pause_and_lock(); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] sched: Fix nohz_kick_needed to consider the nr_busy of the parent domain's group
Hi Kamalesh, On 10/22/2013 08:05 PM, Kamalesh Babulal wrote: > * Vaidyanathan Srinivasan [2013-10-21 17:14:42]: > >> for_each_domain(cpu, sd) { >> -struct sched_group *sg = sd->groups; >> -struct sched_group_power *sgp = sg->sgp; >> -int nr_busy = atomic_read(>nr_busy_cpus); >> - >> -if (sd->flags & SD_SHARE_PKG_RESOURCES && nr_busy > 1) >> -goto need_kick_unlock; >> +struct sched_domain *sd_parent = sd->parent; >> +struct sched_group *sg; >> +struct sched_group_power *sgp; >> +int nr_busy; >> + >> +if (sd_parent) { >> +sg = sd_parent->groups; >> +sgp = sg->sgp; >> +nr_busy = atomic_read(>nr_busy_cpus); >> + >> +if (sd->flags & SD_SHARE_PKG_RESOURCES && nr_busy > 1) >> +goto need_kick_unlock; >> +} >> >> if (sd->flags & SD_ASYM_PACKING && nr_busy != sg->group_weight >> && (cpumask_first_and(nohz.idle_cpus_mask, > > CC'ing Suresh Siddha and Vincent Guittot > > Please correct me, If my understanding of idle balancing is wrong. > With proposed approach will not idle load balancer kick in, even if > there are busy cpus across groups or if there are 2 busy cpus which > are spread across sockets. Yes load balancing will happen on busy cpus periodically. Wrt idle balancing there are two points here. One, when a CPU is just about to go idle, it will enter idle_balance(), and trigger load balancing with itself being the destination CPU to begin with. It will load balance at every level of the sched domain that it belongs to. If it manages to pull tasks, good, else it will enter an idle state. nohz_idle_balancing is triggered by a busy cpu at every tick if it has more than one task in its runqueue or if it belongs to a group that shares the package resources and has more than one cpu busy. By "nohz_idle_balance triggered", it means the busy cpu will send an ipi to the ilb_cpu to do load balancing on the behalf of the idle cpus in the nohz mask. So to answer your question wrt this patch, if there is one busy cpu with say 2 tasks in one socket and another busy cpu with 1 task on another socket, the former busy cpu can kick nohz_idle_balance since it has more than one task in its runqueue. An idle cpu in either socket could be woken up to balance tasks with it. The usual idle load balancer that runs on a CPU about to become idle could pull from either cpu depending on who is more busy as it begins to load balance across all levels of sched domain that it belongs to. > > Consider 2 socket machine with 4 processors each (MC and NUMA domains). > If the machine is partial loaded such that cpus 0,4,5,6,7 are busy, then too > nohz balancing is triggered because with this approach > (NUMA)->groups->sgp->nr_busy_cpus is taken in account for nohz kick, while > iterating over MC domain. For the example that you mention, you will have a CPU domain and a NUMA domain. When the sockets are NUMA nodes, each socket will belong to a CPU domain. If the sockets are non-numa nodes, then the domain encompassing both the nodes will be a CPU domain, possibly with each socket being an MC domain. > > Isn't idle load balancer not suppose kick in, even in the case of two busy > cpu's in a dual-core single socket system nohz_idle_balancing is a special case. It is triggered when the conditions mentioned in nohz_kick_needed() are true. A CPU just about to go idle will trigger load balancing without any pre-conditions. In a single socket machine, there will be a CPU domain encompassing the socket and the MC domain will encompass a core. nohz_idle load balancer will kick in if both the threads in the core have tasks running on them. This is fair enough because the threads share the resources of the core. Regards Preeti U Murthy > > Thanks, > Kamalesh. > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] sched: Fix nohz_kick_needed to consider the nr_busy of the parent domain's group
Hi Peter, On 10/23/2013 03:41 AM, Peter Zijlstra wrote: > On Mon, Oct 21, 2013 at 05:14:42PM +0530, Vaidyanathan Srinivasan wrote: >> kernel/sched/fair.c | 19 +-- >> 1 file changed, 13 insertions(+), 6 deletions(-) >> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index 7c70201..12f0eab 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -5807,12 +5807,19 @@ static inline int nohz_kick_needed(struct rq *rq, >> int cpu) >> >> rcu_read_lock(); >> for_each_domain(cpu, sd) { >> +struct sched_domain *sd_parent = sd->parent; >> +struct sched_group *sg; >> +struct sched_group_power *sgp; >> +int nr_busy; >> + >> +if (sd_parent) { >> +sg = sd_parent->groups; >> +sgp = sg->sgp; >> +nr_busy = atomic_read(>nr_busy_cpus); >> + >> +if (sd->flags & SD_SHARE_PKG_RESOURCES && nr_busy > 1) >> +goto need_kick_unlock; >> +} >> >> if (sd->flags & SD_ASYM_PACKING && nr_busy != sg->group_weight >> && (cpumask_first_and(nohz.idle_cpus_mask, >> > > Almost I'd say; what happens on !sd_parent && SD_ASYM_PACKING ? You are right, sorry about this. The idea was to correct the nr_busy computation before the patch that would remove its usage in the second patch. But that would mean the condition nr_busy != sg->group_weight would be invalid with this patch. The second patch needs to go first to avoid this confusion. > > Also, this made me look at the nr_busy stuff again, and somehow that > entire thing makes me a little sad. > > Can't we do something like the below and cut that nr_busy sd iteration > short? We can surely cut the nr_busy sd iteration but not like what is done with this patch. You stop the nr_busy computation at the sched domain that has the flag SD_SHARE_PKG_RESOURCES set. But nohz_kick_needed() would want to know the nr_busy for one level above this. Consider a core. Assume it is the highest domain with this flag set. The nr_busy of its groups, which are logical threads are set to 1/0 each. But nohz_kick_needed() would like to know the sum of the nr_busy parameter of all the groups, i.e. the threads in a core before it decides if it can kick nohz_idle balancing. The information about the individual group's nr_busy is of no relevance here. Thats why the above patch tries to get the sd->parent->groups->sgp->nr_busy_cpus. This will translate rightly to the core's busy cpus in this example. But the below patch stops before updating this parameter at the sd->parent level, where sd is the highest level sched domain with the SD_SHARE_PKG_RESOURCES flag set. But we can get around all this confusion if we can move the nr_busy parameter to be included in the sched_domain structure rather than the sched_groups_power structure. Anyway the only place where nr_busy is used, that is at nohz_kick_needed(), is done to know the total number of busy cpus at a sched domain level which has the SD_SHARE_PKG_RESOURCES set and not at a sched group level. So why not move nr_busy to struct sched_domain and having the below patch which just updates this parameter for the sched domain, sd_busy ? This will avoid iterating through all the levels of sched domains and should resolve the scalability issue. We also don't need to get to sd->parent to get the nr_busy parameter for the sake of nohz_kick_needed(). What do you think? Regards Preeti U Murthy > > This nohz stuff really needs to be re-thought and made more scalable -- > its a royal pain :/ > > > kernel/sched/core.c | 4 > kernel/sched/fair.c | 21 +++-- > kernel/sched/sched.h | 5 ++--- > 3 files changed, 21 insertions(+), 9 deletions(-) > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index c06b8d3..89db8dc 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -5271,6 +5271,7 @@ DEFINE_PER_CPU(struct sched_domain *, sd_llc); > DEFINE_PER_CPU(int, sd_llc_size); > DEFINE_PER_CPU(int, sd_llc_id); > DEFINE_PER_CPU(struct sched_domain *, sd_numa); > +DEFINE_PER_CPU(struct sched_domain *, sd_busy); > > static void update_top_cache_domain(int cpu) > { > @@ -5290,6 +5291,9 @@ static void update_top_cache_domain(int cpu) > > sd = lowest_flag_domain(cpu, SD_NUMA); > rcu_assign_pointer(per_cpu(sd_numa, cpu), sd); > + > + sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES | SD_ASYM_PACKING); > + rcu_assign_pointer(per_cpu(sd_busy, cpu), sd); > } > >
Re: [PATCH 1/3] sched: Fix nohz_kick_needed to consider the nr_busy of the parent domain's group
On 10/23/2013 09:30 AM, Preeti U Murthy wrote: > Hi Peter, > > On 10/23/2013 03:41 AM, Peter Zijlstra wrote: >> On Mon, Oct 21, 2013 at 05:14:42PM +0530, Vaidyanathan Srinivasan wrote: >>> kernel/sched/fair.c | 19 +-- >>> 1 file changed, 13 insertions(+), 6 deletions(-) >>> >>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >>> index 7c70201..12f0eab 100644 >>> --- a/kernel/sched/fair.c >>> +++ b/kernel/sched/fair.c >>> @@ -5807,12 +5807,19 @@ static inline int nohz_kick_needed(struct rq *rq, >>> int cpu) >>> >>> rcu_read_lock(); >>> for_each_domain(cpu, sd) { >>> + struct sched_domain *sd_parent = sd->parent; >>> + struct sched_group *sg; >>> + struct sched_group_power *sgp; >>> + int nr_busy; >>> + >>> + if (sd_parent) { >>> + sg = sd_parent->groups; >>> + sgp = sg->sgp; >>> + nr_busy = atomic_read(>nr_busy_cpus); >>> + >>> + if (sd->flags & SD_SHARE_PKG_RESOURCES && nr_busy > 1) >>> + goto need_kick_unlock; >>> + } >>> >>> if (sd->flags & SD_ASYM_PACKING && nr_busy != sg->group_weight >>> && (cpumask_first_and(nohz.idle_cpus_mask, >>> >> >> Almost I'd say; what happens on !sd_parent && SD_ASYM_PACKING ? > > You are right, sorry about this. The idea was to correct the nr_busy > computation before the patch that would remove its usage in the second > patch. But that would mean the condition nr_busy != sg->group_weight > would be invalid with this patch. The second patch needs to go first to > avoid this confusion. > >> >> Also, this made me look at the nr_busy stuff again, and somehow that >> entire thing makes me a little sad. >> >> Can't we do something like the below and cut that nr_busy sd iteration >> short? > > We can surely cut the nr_busy sd iteration but not like what is done > with this patch. You stop the nr_busy computation at the sched domain > that has the flag SD_SHARE_PKG_RESOURCES set. But nohz_kick_needed() > would want to know the nr_busy for one level above this. >Consider a core. Assume it is the highest domain with this flag set. > The nr_busy of its groups, which are logical threads are set to 1/0 > each. But nohz_kick_needed() would like to know the sum of the nr_busy > parameter of all the groups, i.e. the threads in a core before it > decides if it can kick nohz_idle balancing. The information about the > individual group's nr_busy is of no relevance here. > > Thats why the above patch tries to get the > sd->parent->groups->sgp->nr_busy_cpus. This will translate rightly to > the core's busy cpus in this example. But the below patch stops before > updating this parameter at the sd->parent level, where sd is the highest > level sched domain with the SD_SHARE_PKG_RESOURCES flag set. > > But we can get around all this confusion if we can move the nr_busy > parameter to be included in the sched_domain structure rather than the > sched_groups_power structure. Anyway the only place where nr_busy is > used, that is at nohz_kick_needed(), is done to know the total number of > busy cpus at a sched domain level which has the SD_SHARE_PKG_RESOURCES > set and not at a sched group level. > > So why not move nr_busy to struct sched_domain and having the below > patch which just updates this parameter for the sched domain, sd_busy ? Oh this can't be done :( Domain structures are per cpu! Regards Preeti U Murthy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] sched: Fix nohz_kick_needed to consider the nr_busy of the parent domain's group
Hi Peter On 10/23/2013 03:41 AM, Peter Zijlstra wrote: > This nohz stuff really needs to be re-thought and made more scalable -- > its a royal pain :/ Why not do something like the below instead? It does the following. This patch introduces sd_busy just like your suggested patch, except that it points to the parent of the highest level sched domain which has the SD_SHARE_PKG_RESOURCES set and initializes it in update_top_cache_domain(). This is the sched domain that is relevant in nohz_kick_needed(). sd_set_sd_state_busy(), sd_set_sd_state_idle() and nohz_kick_needed() query and update *only* this sched domain(sd_busy) for nr_busy_cpus. They are the only users of this parameter. While we are at it, we might as well change the nohz_idle parameter to be updated at the sd_busy domain level alone and not the base domain level of a CPU. This will unify the concept of busy cpus at just one level of sched domain. There is no need to iterate through all levels of sched domains of a cpu to update nr_busy_cpus since it is irrelevant at all other sched domains except at sd_busy level. De-couple asymmetric load balancing from the nr_busy parameter which the PATCH 2/3 anyway does. sd_busy therefore is irrelevant for asymmetric load balancing. Regards Preeti U Murthy START_PATCH--- sched: Fix nohz_kick_needed() --- kernel/sched/core.c |4 kernel/sched/fair.c | 40 ++-- kernel/sched/sched.h |1 + 3 files changed, 27 insertions(+), 18 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index c06b8d3..c1dd11c 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5271,6 +5271,7 @@ DEFINE_PER_CPU(struct sched_domain *, sd_llc); DEFINE_PER_CPU(int, sd_llc_size); DEFINE_PER_CPU(int, sd_llc_id); DEFINE_PER_CPU(struct sched_domain *, sd_numa); +DEFINE_PER_CPU(struct sched_domain *, sd_busy); static void update_top_cache_domain(int cpu) { @@ -5290,6 +5291,9 @@ static void update_top_cache_domain(int cpu) sd = lowest_flag_domain(cpu, SD_NUMA); rcu_assign_pointer(per_cpu(sd_numa, cpu), sd); + + sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES)->parent; + rcu_assign_pointer(per_cpu(sd_busy, cpu), sd); } /* diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 813dd61..71e6f14 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6515,16 +6515,16 @@ static inline void nohz_balance_exit_idle(int cpu) static inline void set_cpu_sd_state_busy(void) { struct sched_domain *sd; + int cpu = smp_processor_id(); rcu_read_lock(); - sd = rcu_dereference_check_sched_domain(this_rq()->sd); + sd = per_cpu(sd_busy, cpu); if (!sd || !sd->nohz_idle) goto unlock; sd->nohz_idle = 0; - for (; sd; sd = sd->parent) - atomic_inc(>groups->sgp->nr_busy_cpus); + atomic_inc(>groups->sgp->nr_busy_cpus); unlock: rcu_read_unlock(); } @@ -6532,16 +6532,16 @@ unlock: void set_cpu_sd_state_idle(void) { struct sched_domain *sd; + int cpu = smp_processor_id(); rcu_read_lock(); - sd = rcu_dereference_check_sched_domain(this_rq()->sd); + sd = per_cpu(sd_busy, cpu); if (!sd || sd->nohz_idle) goto unlock; sd->nohz_idle = 1; - for (; sd; sd = sd->parent) - atomic_dec(>groups->sgp->nr_busy_cpus); + atomic_dec(>groups->sgp->nr_busy_cpus); unlock: rcu_read_unlock(); } @@ -6748,6 +6748,9 @@ static inline int nohz_kick_needed(struct rq *rq, int cpu) { unsigned long now = jiffies; struct sched_domain *sd; + struct sched_group *sg; + struct sched_group_power *sgp; + int nr_busy; if (unlikely(idle_cpu(cpu))) return 0; @@ -6773,22 +6776,23 @@ static inline int nohz_kick_needed(struct rq *rq, int cpu) goto need_kick; rcu_read_lock(); - for_each_domain(cpu, sd) { - struct sched_group *sg = sd->groups; - struct sched_group_power *sgp = sg->sgp; - int nr_busy = atomic_read(>nr_busy_cpus); + sd = per_cpu(sd_busy, cpu); - if (sd->flags & SD_SHARE_PKG_RESOURCES && nr_busy > 1) - goto need_kick_unlock; + if (sd) { + sg = sd->groups; + sgp = sg->sgp; + nr_busy = atomic_read(>nr_busy_cpus); - if (sd->flags & SD_ASYM_PACKING && nr_busy != sg->group_weight - && (cpumask_first_and(nohz.idle_cpus_mask, - sched_domain_span(sd)) < cpu)) + if (nr_busy > 1) goto need_kick_unlock; - - if (!
Re: [PATCH 3/3] sched: Aggressive balance in domains whose groups share package resources
Hi Peter, On 10/23/2013 03:53 AM, Peter Zijlstra wrote: > On Mon, Oct 21, 2013 at 05:15:02PM +0530, Vaidyanathan Srinivasan wrote: >> kernel/sched/fair.c | 18 ++ >> 1 file changed, 18 insertions(+) >> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index 828ed97..bbcd96b 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -5165,6 +5165,8 @@ static int load_balance(int this_cpu, struct rq >> *this_rq, >> { >> int ld_moved, cur_ld_moved, active_balance = 0; >> struct sched_group *group; >> +struct sched_domain *child; >> +int share_pkg_res = 0; >> struct rq *busiest; >> unsigned long flags; >> struct cpumask *cpus = __get_cpu_var(load_balance_mask); >> @@ -5190,6 +5192,10 @@ static int load_balance(int this_cpu, struct rq >> *this_rq, >> >> schedstat_inc(sd, lb_count[idle]); >> >> +child = sd->child; >> +if (child && child->flags & SD_SHARE_PKG_RESOURCES) >> +share_pkg_res = 1; >> + >> redo: >> if (!should_we_balance()) { >> *continue_balancing = 0; >> @@ -5202,6 +5208,7 @@ redo: >> goto out_balanced; >> } >> >> +redo_grp: >> busiest = find_busiest_queue(, group); >> if (!busiest) { >> schedstat_inc(sd, lb_nobusyq[idle]); >> @@ -5292,6 +5299,11 @@ more_balance: >> if (!cpumask_empty(cpus)) { >> env.loop = 0; >> env.loop_break = sched_nr_migrate_break; >> +if (share_pkg_res && >> +cpumask_intersects(cpus, >> +to_cpumask(group->cpumask))) > > sched_group_cpus() > >> +goto redo_grp; >> + >> goto redo; >> } >> goto out_balanced; >> @@ -5318,9 +5330,15 @@ more_balance: >> */ >> if (!cpumask_test_cpu(this_cpu, >> tsk_cpus_allowed(busiest->curr))) { >> +cpumask_clear_cpu(cpu_of(busiest), cpus); >> raw_spin_unlock_irqrestore(>lock, >> flags); >> env.flags |= LBF_ALL_PINNED; >> +if (share_pkg_res && >> + cpumask_intersects(cpus, >> +to_cpumask(group->cpumask))) >> +goto redo_grp; >> + >> goto out_one_pinned; >> } > > Man this retry logic is getting annoying.. isn't there anything saner we > can do? Let me give this a thought and get back. Regards Preeti U Murthy > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] sched: Fix nohz_kick_needed to consider the nr_busy of the parent domain's group
Hi Vincent, I have addressed your comments and below is the fresh patch. This patch applies on PATCH 2/3 posted in this thread. Regards Preeti U Murthy sched:Remove un-necessary iterations over sched domains to update/query nr_busy_cpus From: Preeti U Murthy nr_busy_cpus parameter is used by nohz_kick_needed() to find out the number of busy cpus in a sched domain which has SD_SHARE_PKG_RESOURCES flag set. Therefore instead of updating nr_busy_cpus at every level of sched domain, since it is irrelevant, we can update this parameter only at the parent domain of the sd which has this flag set. Introduce a per-cpu parameter sd_busy which represents this parent domain. In nohz_kick_needed() we directly query the nr_busy_cpus parameter associated with the groups of sd_busy. By associating sd_busy with the highest domain which has SD_SHARE_PKG_RESOURCES flag set, we cover all lower level domains which could have this flag set and trigger nohz_idle_balancing if any of the levels have more than one busy cpu. sd_busy is irrelevant for asymmetric load balancing. While we are at it, we might as well change the nohz_idle parameter to be updated at the sd_busy domain level alone and not the base domain level of a CPU. This will unify the concept of busy cpus at just one level of sched domain where it is currently used. Signed-off-by: Preeti U Murthy --- kernel/sched/core.c |5 + kernel/sched/fair.c | 38 -- kernel/sched/sched.h |1 + 3 files changed, 26 insertions(+), 18 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index c06b8d3..c540392 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5271,6 +5271,7 @@ DEFINE_PER_CPU(struct sched_domain *, sd_llc); DEFINE_PER_CPU(int, sd_llc_size); DEFINE_PER_CPU(int, sd_llc_id); DEFINE_PER_CPU(struct sched_domain *, sd_numa); +DEFINE_PER_CPU(struct sched_domain *, sd_busy); static void update_top_cache_domain(int cpu) { @@ -5290,6 +5291,10 @@ static void update_top_cache_domain(int cpu) sd = lowest_flag_domain(cpu, SD_NUMA); rcu_assign_pointer(per_cpu(sd_numa, cpu), sd); + + sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES); + if (sd) + rcu_assign_pointer(per_cpu(sd_busy, cpu), sd->parent); } /* diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index e9c9549..f66cfd9 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6515,16 +6515,16 @@ static inline void nohz_balance_exit_idle(int cpu) static inline void set_cpu_sd_state_busy(void) { struct sched_domain *sd; + int cpu = smp_processor_id(); rcu_read_lock(); - sd = rcu_dereference_check_sched_domain(this_rq()->sd); + sd = rcu_dereference(per_cpu(sd_busy, cpu)); if (!sd || !sd->nohz_idle) goto unlock; sd->nohz_idle = 0; - for (; sd; sd = sd->parent) - atomic_inc(>groups->sgp->nr_busy_cpus); + atomic_inc(>groups->sgp->nr_busy_cpus); unlock: rcu_read_unlock(); } @@ -6532,16 +6532,16 @@ unlock: void set_cpu_sd_state_idle(void) { struct sched_domain *sd; + int cpu = smp_processor_id(); rcu_read_lock(); - sd = rcu_dereference_check_sched_domain(this_rq()->sd); + sd = rcu_dereference(per_cpu(sd_busy, cpu)); if (!sd || sd->nohz_idle) goto unlock; sd->nohz_idle = 1; - for (; sd; sd = sd->parent) - atomic_dec(>groups->sgp->nr_busy_cpus); + atomic_dec(>groups->sgp->nr_busy_cpus); unlock: rcu_read_unlock(); } @@ -6748,6 +6748,8 @@ static inline int nohz_kick_needed(struct rq *rq, int cpu) { unsigned long now = jiffies; struct sched_domain *sd; + struct sched_group_power *sgp; + int nr_busy; if (unlikely(idle_cpu(cpu))) return 0; @@ -6773,22 +6775,22 @@ static inline int nohz_kick_needed(struct rq *rq, int cpu) goto need_kick; rcu_read_lock(); - for_each_domain(cpu, sd) { - struct sched_group *sg = sd->groups; - struct sched_group_power *sgp = sg->sgp; - int nr_busy = atomic_read(>nr_busy_cpus); + sd = rcu_dereference(per_cpu(sd_busy, cpu)); - if (sd->flags & SD_SHARE_PKG_RESOURCES && nr_busy > 1) - goto need_kick_unlock; + if (sd) { + sgp = sd->groups->sgp; + nr_busy = atomic_read(>nr_busy_cpus); - if (sd->flags & SD_ASYM_PACKING - && (cpumask_first_and(nohz.idle_cpus_mask, - sched_domain_span(sd)) < cpu)) + if (nr_busy > 1) goto need_kick_unlock; - - if (!(sd->flags & (SD_SHARE_PKG_
Re: [PATCH 3/3] sched: Aggressive balance in domains whose groups share package resources
Hi Peter, On 10/23/2013 03:53 AM, Peter Zijlstra wrote: > On Mon, Oct 21, 2013 at 05:15:02PM +0530, Vaidyanathan Srinivasan wrote: >> kernel/sched/fair.c | 18 ++ >> 1 file changed, 18 insertions(+) >> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index 828ed97..bbcd96b 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -5165,6 +5165,8 @@ static int load_balance(int this_cpu, struct rq >> *this_rq, >> { >> int ld_moved, cur_ld_moved, active_balance = 0; >> struct sched_group *group; >> +struct sched_domain *child; >> +int share_pkg_res = 0; >> struct rq *busiest; >> unsigned long flags; >> struct cpumask *cpus = __get_cpu_var(load_balance_mask); >> @@ -5190,6 +5192,10 @@ static int load_balance(int this_cpu, struct rq >> *this_rq, >> >> schedstat_inc(sd, lb_count[idle]); >> >> +child = sd->child; >> +if (child && child->flags & SD_SHARE_PKG_RESOURCES) >> +share_pkg_res = 1; >> + >> redo: >> if (!should_we_balance()) { >> *continue_balancing = 0; >> @@ -5202,6 +5208,7 @@ redo: >> goto out_balanced; >> } >> >> +redo_grp: >> busiest = find_busiest_queue(, group); >> if (!busiest) { >> schedstat_inc(sd, lb_nobusyq[idle]); >> @@ -5292,6 +5299,11 @@ more_balance: >> if (!cpumask_empty(cpus)) { >> env.loop = 0; >> env.loop_break = sched_nr_migrate_break; >> +if (share_pkg_res && >> +cpumask_intersects(cpus, >> +to_cpumask(group->cpumask))) > > sched_group_cpus() > >> +goto redo_grp; >> + >> goto redo; >> } >> goto out_balanced; >> @@ -5318,9 +5330,15 @@ more_balance: >> */ >> if (!cpumask_test_cpu(this_cpu, >> tsk_cpus_allowed(busiest->curr))) { >> +cpumask_clear_cpu(cpu_of(busiest), cpus); >> raw_spin_unlock_irqrestore(>lock, >> flags); >> env.flags |= LBF_ALL_PINNED; >> +if (share_pkg_res && >> +cpumask_intersects(cpus, >> +to_cpumask(group->cpumask))) >> +goto redo_grp; >> + >> goto out_one_pinned; >> } > > Man this retry logic is getting annoying.. isn't there anything saner we > can do? Maybe we can do this just at the SIBLINGS level? Having the hyper threads busy due to the scenario described in the changelog is bad for performance. Regards Preeti U Murthy > ___ > Linuxppc-dev mailing list > linuxppc-...@lists.ozlabs.org > https://lists.ozlabs.org/listinfo/linuxppc-dev > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v2] sched: Limit idle_balance()
Hi Json, I ran ebizzy and kernbench benchmarks on your 3.11-rc1 + your"V1 patch" on a 1 socket, 16 core powerpc machine. I thought I would let you know the results before I try your V2. Ebizzy: 30 seconds run. The table below shows the improvement in the number of records completed. I have not spent enough time on the patch to explain such a big improvement. Number_of_threads %improvement_with_patch 4 41.86% 8 9.8% 12 34.77% 16 28.37% While on kernbench there was no significant change in the observation. I will try patch V2 and let you know the results. Regards Preeti U Murthy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v2] sched: Limit idle_balance()
Hi Json, With V2 of your patch here are the results for the ebizzy run on 3.11-rc1 + patch on a 1 socket, 16 core powerpc machine. Each ebizzy run was for 30 seconds. Number_of_threads %improvement_with_patch 48.63 81.29 12 9.98 16 20.46 Let me know if you want me to profile any of these runs for specific statistics. Regards Preeti U Murthy On 07/20/2013 12:58 AM, Jason Low wrote: > On Fri, 2013-07-19 at 16:54 +0530, Preeti U Murthy wrote: >> Hi Json, >> >> I ran ebizzy and kernbench benchmarks on your 3.11-rc1 + your"V1 >> patch" on a 1 socket, 16 core powerpc machine. I thought I would let you >> know the results before I try your V2. >> >> Ebizzy: 30 seconds run. The table below shows the improvement in the >> number of records completed. I have not spent enough time on the patch >> to explain such a big improvement. >> >> Number_of_threads %improvement_with_patch >> 4 41.86% >> 8 9.8% >>12 34.77% >>16 28.37% >> >> While on kernbench there was no significant change in the observation. >> >> I will try patch V2 and let you know the results. > > Great to see those improvements so far. Thank you for testing this. > > Jason > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: power-efficient scheduling design
Hi, On 05/31/2013 04:22 PM, Ingo Molnar wrote: > PeterZ and me tried to point out the design requirements previously, but > it still does not appear to be clear enough to people, so let me spell it > out again, in a hopefully clearer fashion. > > The scheduler has valuable power saving information available: > > - when a CPU is busy: about how long the current task expects to run > > - when a CPU is idle: how long the current CPU expects _not_ to run > > - topology: it knows how the CPUs and caches interrelate and already >optimizes based on that > > - various high level and low level load averages and other metrics about >the recent past that show how busy a particular CPU is, how busy the >whole system is, and what the runtime properties of individual tasks is >(how often it sleeps, etc.) > > so the scheduler is in an _ideal_ position to do a judgement call about > the near future and estimate how deep an idle state a CPU core should > enter into and what frequency it should run at. I don't think the problem lies in the fact that scheduler is not making these decisions about which idle state the CPU should enter or which frequency the CPU should run at. IIUC, I think the problem lies in the part where although the *cpuidle and cpufrequency governors are co-operating with the scheduler, the scheduler is not doing the same.* Let me elaborate with respect to cpuidle subsystem. When the scheduler chooses the CPUs to run tasks on, it leaves certain other CPUs idle. The cpuidle governor then evaluates, among other things, the load average of the CPUs, before deciding to put it into an ideal idle state. With the PJT's metric, an idle CPU's load average degrades over time and cpuidle governor will perhaps decide to put such CPUs to deep idle states. But the problem surfaces when scheduler gets to choose a CPU to run new/woken up tasks on. It chooses the *idlest_cpu* to run the task on without considering how deep an idle state that CPU is in,if at all it is in an idle state. It would end up waking a deep sleeping CPU, which will *hinder power savings*. I think here is where we need to focus. Currently, there is no *two way co-operation between the scheduler and cpuidle/cpufrequency* subsystems, which makes no sense. In the above case for instance scheduler prompts the cpuidle governor to put CPU to idle state and comes back to hamper that move. > > The scheduler is also at a high enough level to host a "I want maximum > performance, power does not matter to me" user policy override switch and > similar user policy details. > > No ifs and whens about that. > > Today the power saving landscape is fragmented and sad: we just randomly > interface scheduler task packing changes with some idle policy (and > cpufreq policy), which might or might not combine correctly. I would repeat here that today we interface cpuidle/cpufrequency policies with scheduler but not the other way around. They do their bit when a cpu is busy/idle. However scheduler does not see that somebody else is taking instructions from it and comes back to give different instructions! Therefore I think among other things, this is one fundamental issue that we need to resolve in the steps towards better power savings through scheduler. Regards Preeti U Murthy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: power-efficient scheduling design
tion as you prefer a complete and > potentially huge patch set over incremental patch sets? > > It would be good to have even a high level agreement on the path forward > where the expectation first and foremost is to take advantage of the > schedulers ideal position to drive the power management while > simplifying the power management code. > > Thanks, > Morten > Regards Preeti U Murthy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: power-efficient scheduling design
uler and cpufreq/cpuidle I agree with this. This is what I have been emphasizing, if we feel that the cpufrequency/ cpuidle subsystems are suboptimal in terms of the information that they use to make their decisions, let us improve them. But this will not yield us any improvement if the scheduler does not have enough information. And IMHO, the next fundamental information that the scheduler needs should come from cpufreq and cpuidle. Then we should move onto supplying scheduler information from the power domain topology, thermal factors, user policies. This does not need a re-write of the scheduler, this would need a good interface between the scheduler and the rest of the ecosystem. This ecosystem includes the cpuidle subsystem, cpu frequency subsystems and they are already in place. Lets use them. or (b) come up > with a unified load-balancing/cpufreq/cpuidle implementation as per > Ingo's request. The latter is harder but, with a good design, has > potentially a lot more benefits. > > A possible implementation for (a) is to let the scheduler focus on > performance load-balancing but control the balance ratio from a > cpufreq governor (via things like arch_scale_freq_power() or something > new). CPUfreq would not be concerned just with individual CPU > load/frequency but also making a decision on how tasks are balanced > between CPUs based on the overall load (e.g. four CPUs are enough for > the current load, I can shut the other four off by telling the > scheduler not to use them). > > As for Ingo's preferred solution (b), a proposal forward could be to > factor the load balancing out of kernel/sched/fair.c and provide an > abstract interface (like load_class?) for easier extending or > different policies (e.g. small task packing). Let me elaborate on the patches that have been posted so far on the power awareness of the scheduler. When we say *power aware scheduler* what exactly do we want it to do? In my opinion, we want it to *avoid touching idle cpus*, so as to keep them in that state longer and *keep more power domains idle*, so as to yield power savings with them turned off. The patches released so far are striving to do the latter. Correct me if I am wrong at this. Also feel free to point out any other expectation from the power aware scheduler if I am missing any. If I have got Ingo's point right, the issues with them are that they are not taking a holistic approach to meet the said goal. Keeping more power domains idle (by packing tasks) would sound much better if the scheduler has taken all aspects of doing such a thing into account, like 1. How idle are the cpus, on the domain that it is packing 2. Can they go to turbo mode, because if they do,then we cant pack tasks. We would need certain cpus in that domain idle. 3. Are the domains in which we pack tasks power gated? 4. Will there be significant performance drop by packing? Meaning do the tasks share cpu resources? If they do there will be severe contention. The approach I suggest therefore would be to get the scheduler well in sync with the eco system, then the patches posted so far will achieve their goals more easily and with very few regressions because they are well informed decisions. Regards Preeti U Murthy > Best regards. > > -- > Catalin > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: power-efficient scheduling design
Hi Rafael, On 06/08/2013 07:32 PM, Rafael J. Wysocki wrote: > On Saturday, June 08, 2013 12:28:04 PM Catalin Marinas wrote: >> On Fri, Jun 07, 2013 at 07:08:47PM +0100, Preeti U Murthy wrote: >>> On 06/07/2013 08:21 PM, Catalin Marinas wrote: >>>> I think you are missing Ingo's point. It's not about the scheduler >>>> complying with decisions made by various governors in the kernel >>>> (which may or may not have enough information) but rather the >>>> scheduler being in a better position for making such decisions. >>> >>> My mail pointed out that I disagree with this design ("the scheduler >>> being in a better position for making such decisions"). >>> I think it should be a 2 way co-operation. I have elaborated below. > > I agree with that. > >>>> Take the cpuidle example, it uses the load average of the CPUs, >>>> however this load average is currently controlled by the scheduler >>>> (load balance). Rather than using a load average that degrades over >>>> time and gradually putting the CPU into deeper sleep states, the >>>> scheduler could predict more accurately that a run-queue won't have >>>> any work over the next x ms and ask for a deeper sleep state from the >>>> beginning. >>> >>> How will the scheduler know that there will not be work in the near >>> future? How will the scheduler ask for a deeper sleep state? >>> >>> My answer to the above two questions are, the scheduler cannot know how >>> much work will come up. All it knows is the current load of the >>> runqueues and the nature of the task (thanks to the PJT's metric). It >>> can then match the task load to the cpu capacity and schedule the tasks >>> on the appropriate cpus. >> >> The scheduler can decide to load a single CPU or cluster and let the >> others idle. If the total CPU load can fit into a smaller number of CPUs >> it could as well tell cpuidle to go into deeper state from the >> beginning as it moved all the tasks elsewhere. > > So why can't it do that today? What's the problem? The reason that scheduler does not do it today is due to the prefer_sibling logic. The tasks within a core get distributed across cores if they are more than 1, since the cpu power of a core is not high enough to handle more than one task. However at a socket level/ MC level (cluster at a low level), there can be as many tasks as there are cores because the socket has enough CPU capacity to handle them. But the prefer_sibling logic moves tasks across socket/MC level domains even when load<=domain_capacity. I think the reason why the prefer_sibling logic was introduced, is that scheduler looks at spreading tasks across all the resources it has. It believes keeping tasks within a cluster/socket level domain would mean tasks are being throttled by having access to only the cluster/socket level resources. Which is why it spreads. The prefer_sibling logic is nothing but a flag set at domain level to communicate to the scheduler that load should be spread across the groups of this domain. In the above example across sockets/clusters. But I think it is time we take another look at the prefer_sibling logic and decide on its worthiness. > >> Regarding future work, neither cpuidle nor the scheduler know this but >> the scheduler would make a better prediction, for example by tracking >> task periodicity. > > Well, basically, two pieces of information are needed to make target idle > state selections: (1) when the CPU (core or package) is going to be used > next time and (2) how much latency for going back to the non-idle state > can be tolerated. While the scheduler knows (1) to some extent (arguably, > it generally cannot predict when hardware interrupts are going to occur), > I'm not really sure about (2). > >>> As a consequence, it leaves certain cpus idle. The load of these cpus >>> degrade. It is via this load that the scheduler asks for a deeper sleep >>> state. Right here we have scheduler talking to the cpuidle governor. >> >> So we agree that the scheduler _tells_ the cpuidle governor when to go >> idle (but not how deep). > > It does indicate to cpuidle how deep it can go, however, by providing it with > the information about when the CPU is going to be used next time (from the > scheduler's perspective). > >> IOW, the scheduler drives the cpuidle decisions. Two problems: (1) the >> cpuidle does not get enough information from the scheduler (arguably this >> could be fixed) > > OK, so what information is missing in your opinion? > >> and (2) the scheduler does
Re: power-efficient scheduling design
Hi Catalin, On 06/08/2013 04:58 PM, Catalin Marinas wrote: > On Fri, Jun 07, 2013 at 07:08:47PM +0100, Preeti U Murthy wrote: >> On 06/07/2013 08:21 PM, Catalin Marinas wrote: >>> I think you are missing Ingo's point. It's not about the scheduler >>> complying with decisions made by various governors in the kernel >>> (which may or may not have enough information) but rather the >>> scheduler being in a better position for making such decisions. >> >> My mail pointed out that I disagree with this design ("the scheduler >> being in a better position for making such decisions"). >> I think it should be a 2 way co-operation. I have elaborated below. >> >>> Take the cpuidle example, it uses the load average of the CPUs, >>> however this load average is currently controlled by the scheduler >>> (load balance). Rather than using a load average that degrades over >>> time and gradually putting the CPU into deeper sleep states, the >>> scheduler could predict more accurately that a run-queue won't have >>> any work over the next x ms and ask for a deeper sleep state from the >>> beginning. >> >> How will the scheduler know that there will not be work in the near >> future? How will the scheduler ask for a deeper sleep state? >> >> My answer to the above two questions are, the scheduler cannot know how >> much work will come up. All it knows is the current load of the >> runqueues and the nature of the task (thanks to the PJT's metric). It >> can then match the task load to the cpu capacity and schedule the tasks >> on the appropriate cpus. > > The scheduler can decide to load a single CPU or cluster and let the > others idle. If the total CPU load can fit into a smaller number of CPUs > it could as well tell cpuidle to go into deeper state from the > beginning as it moved all the tasks elsewhere. This currently does not happen. I have elaborated in the response to Rafael's mail. Sorry I should have put you on the 'To' list, missed that. Do take a look at that mail since many of the replies to your current mail are in it. What do you mean "from the beginning"? As soon as those cpus go idle, cpuidle will kick in anyway. If you are saying that scheduler should tell cpuidle that "this cpu can go into deep sleep state x, since I am not going to use it for the next y seconds", that is not possible. Firstly, because scheduler can't "predict" this 'y' parameter. Secondly because hardware could change the idle state availibility or details dynamically as Rafael pointed out and hence this 'x' is best not to be told by the scheduler, but be queried by cpuidle governor by itself. > > Regarding future work, neither cpuidle nor the scheduler know this but > the scheduler would make a better prediction, for example by tracking > task periodicity. This prediction that you mention scheduler already exports it to cpuidle. load_avg does precisely that, it tracks history and predicts the future based on this. load_avg being tracked by scheduler periodically is already seen by cpuidle governor. > >> As a consequence, it leaves certain cpus idle. The load of these cpus >> degrade. It is via this load that the scheduler asks for a deeper sleep >> state. Right here we have scheduler talking to the cpuidle governor. > > So we agree that the scheduler _tells_ the cpuidle governor when to go > idle (but not how deep). IOW, the scheduler drives the cpuidle > decisions. Two problems: (1) the cpuidle does not get enough information > from the scheduler (arguably this could be fixed) and (2) the scheduler > does not have any information about the idle states (power gating etc.) > to make any informed decision on which/when CPUs should go idle. > > As you said, it is a non-optimal one-way communication but the solution > is not feedback loop from cpuidle into scheduler. It's like the > scheduler managed by chance to get the CPU into a deeper sleep state and > now you'd like the scheduler to get feedback form cpuidle and not > disturb that CPU anymore. That's the closed loop I disagree with. Could > the scheduler not make this informed decision before - it has this total > load, let's get this CPU into deeper sleep state? Lets say the scheduler does make an informed decision before, with lets get this cpu into idle state. Then what? Say the load begins to increase on the system. The scheduler has to wake up cpus. Which cpus to wake up best? Who tells scheduler this? One, the power gating information which is yet to be exported to the scheduler can tell scheduler this to an extent. As far as I can see the next person to guide the scheduler here is cpuidle, isnt it? > >> I don't see what the problem is
Re: power-efficient scheduling design
Hi David, On 06/07/2013 11:06 PM, David Lang wrote: > On Fri, 7 Jun 2013, Preeti U Murthy wrote: > >> Hi Catalin, >> >> On 06/07/2013 08:21 PM, Catalin Marinas wrote: > >>> Take the cpuidle example, it uses the load average of the CPUs, >>> however this load average is currently controlled by the scheduler >>> (load balance). Rather than using a load average that degrades over >>> time and gradually putting the CPU into deeper sleep states, the >>> scheduler could predict more accurately that a run-queue won't have >>> any work over the next x ms and ask for a deeper sleep state from the >>> beginning. >> >> How will the scheduler know that there will not be work in the near >> future? How will the scheduler ask for a deeper sleep state? >> >> My answer to the above two questions are, the scheduler cannot know how >> much work will come up. All it knows is the current load of the >> runqueues and the nature of the task (thanks to the PJT's metric). It >> can then match the task load to the cpu capacity and schedule the tasks >> on the appropriate cpus. > > how will the cpuidle govenor know what will come up in the future? > > the scheduler knows more than the current load on the runqueus, it > tracks some information about the past behavior of the process that it > uses for it's decisions. This is information that cpuidle doesn't have. This is incorrect. The scheduler knows the possible future load on a cpu due to past behavior, thats right, and so does cpuidle today. It queries the load average for predicted idle time and compares this with exit latencies of the idle states. > > >> I don't see what the problem is with the cpuidle governor waiting for >> the load to degrade before putting that cpu to sleep. In my opinion, >> putting a cpu to deeper sleep states should happen gradually. > > remember that it takes power and time to wake up a cpu to put it in a > deeper sleep state. Correct. I apologise in saying that it does it gradually. This is not entirely right. cpuidle governor can decide on the state the cpu is best put into directly without going through the shallow idle states. It also takes care to rectify any incorrect prediction. So there is no exit-enter-exit-enter sub optimal implementation. > >>> Of course, you could export more scheduler information to cpuidle, >>> various hooks (task wakeup etc.) but then we have another framework, >>> cpufreq. It also decides the CPU parameters (frequency) based on the >>> load controlled by the scheduler. Can cpufreq decide whether it's >>> better to keep the CPU at higher frequency so that it gets to idle >>> quicker and therefore deeper sleep states? I don't think it has enough >>> information because there are at least three deciding factors >>> (cpufreq, cpuidle and scheduler's load balancing) which are not >>> unified. >> >> Why not? When the cpu load is high, cpu frequency governor knows it has >> to boost the frequency of that CPU. The task gets over quickly, the CPU >> goes idle. Then the cpuidle governor kicks in to put the CPU to deeper >> sleep state gradually. >> >> Meanwhile the scheduler should ensure that the tasks are retained on >> that CPU,whose frequency is boosted and should not load balance it, so >> that they can get over quickly. This I think is what is missing. Again >> this comes down to the scheduler taking feedback from the CPU frequency >> governors which is not currently happening. > > how should the scheduler know that the cpufreq governor decided to boost > the speed of one CPU to handle an important process as opposed to > handling multiple smaller processes? This has been elaborated in my response to Rafael's mail. Scheduler decides to call cpu frequency governor when it sees fit. Then cpu frequency governor boosts the frequency of that cpu. cpu_power will now match the task load. So scheduler will not move the task away from that cpu since load does not exceed cpu capacity. So scheduler knows in this way. > the communication between the two is starting to sound really messy > Not really. More is elaborated in responses to Catalin and Rafael's mails. Regards Preeti U Murthy > David Lang > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v7 0/21] sched: power aware scheduling
Hi Alex, On 05/20/2013 06:31 AM, Alex Shi wrote: > >>>>>> Which are the workloads where 'powersaving' mode hurts workload >>>>>> performance measurably? >> >> I ran ebizzy on a 2 socket, 16 core, SMT 4 Power machine. > > Is this a 2 * 16 * 4 LCPUs PowerPC machine? This is a 2 * 8 * 4 LCPUs PowerPC machine. >> The power efficiency drops significantly with the powersaving policy of >> this patch,over the power efficiency of the scheduler without this patch. >> >> The below parameters are measured relative to the default scheduler >> behaviour. >> >> A: Drop in power efficiency with the patch+powersaving policy >> B: Drop in performance with the patch+powersaving policy >> C: Decrease in power consumption with the patch+powersaving policy >> >> NumThreads AB C >> - >> 2 33% 36% 4% >> 4 31% 33% 3% >> 8 28% 30% 3% >> 16 31% 33% 4% >> >> Each of the above run is for 30s. >> >> On investigating socket utilization,I found that only 1 socket was being >> used during all the above threaded runs. As can be guessed this is due >> to the group_weight being considered for the threshold metric. >> This stacks up tasks on a core and further on a socket, thus throttling >> them, as observed by Mike below. >> >> I therefore think we must switch to group_capacity as the metric for >> threshold and use only (rq->utils*nr_running) for group_utils >> calculation during non-bursty wakeup scenarios. >> This way we are comparing right; the utilization of the runqueue by the >> fair tasks and the cpu capacity available for them after being consumed >> by the rt tasks. >> >> After I made the above modification,all the above three parameters came >> to be nearly null. However, I am observing the load balancing of the >> scheduler with the patch and powersavings policy enabled. It is behaving >> very close to the default scheduler (spreading tasks across sockets). >> That also explains why there is no performance drop or gain with the >> patch+powersavings policy enabled. I will look into this observation and >> revert. > > Thanks a lot for the great testings! > Seem tasks per SMT cpu isn't power efficient. > And I got the similar result last week. I tested the fspin testing(do > endless calculation, in linux-next tree.). when I bind task per SMT cpu, > the power efficiency really dropped with most every threads number. but > when bind task per core, it has better power efficiency on all threads. > Beside to move task depend on group_capacity, another choice is balance > task according cpu_power. I did the transfer in code. but need to go > through a internal open source process before public them. What do you mean by *another* choice is balance task according to cpu_power? group_capacity is based on cpu_power. Also, your balance policy in v6 was doing the same right? It was rightly comparing rq->utils * nr_running against cpu_power. Why not simply switch to that code for power policy load balancing? >>>>> Well, it'll lose throughput any time there's parallel execution >>>>> potential but it's serialized instead.. using average will inevitably >>>>> stack tasks sometimes, but that's its goal. Hackbench shows it. >>>> >>>> (but that consolidation can be a winner too, and I bet a nickle it would >>>> be for a socket sized pgbench run) >>> >>> (belay that, was thinking of keeping all tasks on a single node, but >>> it'll likely stack the whole thing on a CPU or two, if so, it'll hurt) >> >> At this point, I would like to raise one issue. >> *Is the goal of the power aware scheduler improving power efficiency of >> the scheduler or a compromise on the power efficiency but definitely a >> decrease in power consumption, since it is the user who has decided to >> prioritise lower power consumption over performance* ? >> > > It could be one of reason for this feather, but I could like to > make it has better efficiency, like packing tasks according to cpu_power > not current group_weight. Yes we could try the patch using group_capacity and observe the results for power efficiency, before we decide to compromise on power efficiency for decrease in power. Regards Preeti U Murthy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v5 09/15] sched: add power aware scheduling in fork/exec/wake
Hi Alex, On 02/24/2013 02:57 PM, Alex Shi wrote: > On 02/22/2013 04:54 PM, Peter Zijlstra wrote: >> On Thu, 2013-02-21 at 22:40 +0800, Alex Shi wrote: >>>> The name is a secondary issue, first you need to explain why you >>> think >>>> nr_running is a useful metric at all. >>>> >>>> You can have a high nr_running and a low utilization (a burst of >>>> wakeups, each waking a process that'll instantly go to sleep again), >>> or >>>> low nr_running and high utilization (a single process cpu bound >>>> process). >>> >>> It is true in periodic balance. But in fork/exec/waking timing, the >>> incoming processes usually need to do something before sleep again. >> >> You'd be surprised, there's a fair number of workloads that have >> negligible runtime on wakeup. > > will appreciate if you like introduce some workload. :) > BTW, do you has some idea to handle them? > Actually, if tasks is just like transitory, it is also hard to catch > them in balance, like 'cyclitest -t 100' on my 4 LCPU laptop, vmstat > just can catch 1 or 2 tasks very second. >> >>> I use nr_running to measure how the group busy, due to 3 reasons: >>> 1, the current performance policy doesn't use utilization too. >> >> We were planning to fix that now that its available. > > I had tried, but failed on aim9 benchmark. As a result I give up to use > utilization in performance balance. > Some trying and talking in the thread. > https://lkml.org/lkml/2013/1/6/96 > https://lkml.org/lkml/2013/1/22/662 >> >>> 2, the power policy don't care load weight. >> >> Then its broken, it should very much still care about weight. > > Here power policy just use nr_running as the criteria to check if it's > eligible for power aware balance. when do balancing the load weight is > still the key judgment. > >> >>> 3, I tested some benchmarks, kbuild/tbench/hackbench/aim7 etc, some >>> benchmark results looks clear bad when use utilization. if my memory >>> right, the hackbench/aim7 both looks bad. I had tried many ways to >>> engage utilization into this balance, like use utilization only, or >>> use >>> utilization * nr_running etc. but still can not find a way to recover >>> the lose. But with nr_running, the performance seems doesn't lose much >>> with power policy. >> >> You're failing to explain why utilization performs bad and you don't >> explain why nr_running is better. That things work simply isn't good > > Um, let me try to explain again, The utilisation need much time to > accumulate itself(345ms). Whenever with or without load weight, many > bursting tasks just give a minimum weight to the carrier CPU at the > first few ms. So, it is too easy to do a incorrect distribution here and > need migration on later periodic balancing. I dont understand why forked tasks are taking time to accumulate the load.I understand this if it were to be a woken up task.The first time the forked task gets a chance to update the load itself,it needs to reflect full utilization.In __update_entity_runnable_avg both runnable_avg_period and runnable_avg_sum get equally incremented for a forked task since it is runnable.Hence where is the chance for the load to get incremented in steps? In sleeping tasks since runnable_avg_sum progresses much slower than runnable_avg_period,these tasks take much time to accumulate the load when they wake up.This makes sense of course.But how does this happen for forked tasks? Regards Preeti U Murthy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v5 02/15] sched: set initial load avg of new forked task
Hi Alex, On 02/20/2013 11:50 AM, Alex Shi wrote: > On 02/18/2013 01:07 PM, Alex Shi wrote: >> New task has no runnable sum at its first runnable time, so its >> runnable load is zero. That makes burst forking balancing just select >> few idle cpus to assign tasks if we engage runnable load in balancing. >> >> Set initial load avg of new forked task as its load weight to resolve >> this issue. >> > > patch answering PJT's update here. that merged the 1st and 2nd patches > into one. other patches in serial don't need to change. > > = > From 89b56f2e5a323a0cb91c98be15c94d34e8904098 Mon Sep 17 00:00:00 2001 > From: Alex Shi > Date: Mon, 3 Dec 2012 17:30:39 +0800 > Subject: [PATCH 01/14] sched: set initial value of runnable avg for new > forked task > > We need initialize the se.avg.{decay_count, load_avg_contrib} for a > new forked task. > Otherwise random values of above variables cause mess when do new task > enqueue: > enqueue_task_fair > enqueue_entity > enqueue_entity_load_avg > > and make forking balancing imbalance since incorrect load_avg_contrib. > > set avg.decay_count = 0, and avg.load_avg_contrib = se->load.weight to > resolve such issues. > > Signed-off-by: Alex Shi > --- > kernel/sched/core.c | 3 +++ > kernel/sched/fair.c | 4 > 2 files changed, 7 insertions(+) > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 26058d0..1452e14 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -1559,6 +1559,7 @@ static void __sched_fork(struct task_struct *p) > #if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED) > p->se.avg.runnable_avg_period = 0; > p->se.avg.runnable_avg_sum = 0; > + p->se.avg.decay_count = 0; > #endif > #ifdef CONFIG_SCHEDSTATS > memset(>se.statistics, 0, sizeof(p->se.statistics)); > @@ -1646,6 +1647,8 @@ void sched_fork(struct task_struct *p) > p->sched_reset_on_fork = 0; > } > I think the following comment will help here. /* All forked tasks are assumed to have full utilization to begin with */ > + p->se.avg.load_avg_contrib = p->se.load.weight; > + > if (!rt_prio(p->prio)) > p->sched_class = _sched_class; > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 81fa536..cae5134 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -1509,6 +1509,10 @@ static inline void enqueue_entity_load_avg(struct > cfs_rq *cfs_rq, >* We track migrations using entity decay_count <= 0, on a wake-up >* migration we use a negative decay count to track the remote decays >* accumulated while sleeping. > + * > + * When enqueue a new forked task, the se->avg.decay_count == 0, so > + * we bypass update_entity_load_avg(), use avg.load_avg_contrib initial > + * value: se->load.weight. I disagree with the comment.update_entity_load_avg() gets called for all forked tasks. enqueue_task_fair->update_entity_load_avg() during the second iteration.But __update_entity_load_avg() in update_entity_load_avg() ,where the actual load update happens does not get called.This is because as below,the last_update of the forked task is nearly equal to the clock task of the runqueue.Hence probably 1ms has not passed by for the load to get updated.Which is why the load of the task nor the load of the runqueue gets updated when the task forks. Also note that the reason we bypass update_entity_load_avg() below is not because our decay_count=0.Its because the forked tasks have nothing to update.Only woken up tasks and migrated wake ups have load updates to do.Forked tasks just got created,they have no load to "update" but only to "create". This I feel is rightly done in sched_fork by this patch. So ideally I dont think we should have any comment here.It does not sound relevant. >*/ > if (unlikely(se->avg.decay_count <= 0)) { > se->avg.last_runnable_update = rq_of(cfs_rq)->clock_task; > Regards Preeti U Murthy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v5 09/15] sched: add power aware scheduling in fork/exec/wake
Hi, On 02/24/2013 02:57 PM, Alex Shi wrote: > On 02/22/2013 04:54 PM, Peter Zijlstra wrote: >> On Thu, 2013-02-21 at 22:40 +0800, Alex Shi wrote: >>>> The name is a secondary issue, first you need to explain why you >>> think >>>> nr_running is a useful metric at all. >>>> >>>> You can have a high nr_running and a low utilization (a burst of >>>> wakeups, each waking a process that'll instantly go to sleep again), >>> or >>>> low nr_running and high utilization (a single process cpu bound >>>> process). >>> >>> It is true in periodic balance. But in fork/exec/waking timing, the >>> incoming processes usually need to do something before sleep again. >> >> You'd be surprised, there's a fair number of workloads that have >> negligible runtime on wakeup. > > will appreciate if you like introduce some workload. :) > BTW, do you has some idea to handle them? > Actually, if tasks is just like transitory, it is also hard to catch > them in balance, like 'cyclitest -t 100' on my 4 LCPU laptop, vmstat > just can catch 1 or 2 tasks very second. >> >>> I use nr_running to measure how the group busy, due to 3 reasons: >>> 1, the current performance policy doesn't use utilization too. >> >> We were planning to fix that now that its available. > > I had tried, but failed on aim9 benchmark. As a result I give up to use > utilization in performance balance. > Some trying and talking in the thread. > https://lkml.org/lkml/2013/1/6/96 > https://lkml.org/lkml/2013/1/22/662 >> >>> 2, the power policy don't care load weight. >> >> Then its broken, it should very much still care about weight. > > Here power policy just use nr_running as the criteria to check if it's > eligible for power aware balance. when do balancing the load weight is > still the key judgment. > >> >>> 3, I tested some benchmarks, kbuild/tbench/hackbench/aim7 etc, some >>> benchmark results looks clear bad when use utilization. if my memory >>> right, the hackbench/aim7 both looks bad. I had tried many ways to >>> engage utilization into this balance, like use utilization only, or >>> use >>> utilization * nr_running etc. but still can not find a way to recover >>> the lose. But with nr_running, the performance seems doesn't lose much >>> with power policy. >> >> You're failing to explain why utilization performs bad and you don't >> explain why nr_running is better. That things work simply isn't good > > Um, let me try to explain again, The utilisation need much time to > accumulate itself(345ms). Whenever with or without load weight, many > bursting tasks just give a minimum weight to the carrier CPU at the > first few ms. So, it is too easy to do a incorrect distribution here and > need migration on later periodic balancing. Why can't this be attacked in *either* of the following ways: 1.Attack this problem at the source, by ensuring that the utilisation is accumulated faster by making the update window smaller. 2.Balance on nr->running only if you detect burst wakeups. Alex, you had released a patch earlier which could detect this right? Instead of balancing on nr_running all the time, why not balance on it only if burst wakeups are detected. By doing so you ensure that nr_running as a metric for load balancing is used when it is right to do so and the reason to use it also gets well documented. Regards Preeti U Murthy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 3/6] sched: pack small tasks
Hi Peter, On 04/26/2013 03:48 PM, Peter Zijlstra wrote: > On Wed, Mar 27, 2013 at 03:51:51PM +0530, Preeti U Murthy wrote: >> Hi, >> >> On 03/26/2013 05:56 PM, Peter Zijlstra wrote: >>> On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote: >>>> +static bool is_buddy_busy(int cpu) >>>> +{ >>>> + struct rq *rq = cpu_rq(cpu); >>>> + >>>> + /* >>>> +* A busy buddy is a CPU with a high load or a small load with >>>> a lot of >>>> +* running tasks. >>>> +*/ >>>> + return (rq->avg.runnable_avg_sum > >>>> + (rq->avg.runnable_avg_period / (rq->nr_running >>>> + 2))); >>>> +} >>> >>> Why does the comment talk about load but we don't see it in the >>> equation. Also, why does nr_running matter at all? I thought we'd >>> simply bother with utilization, if fully utilized we're done etc.. >>> >> >> Peter, lets say the run-queue has 50% utilization and is running 2 >> tasks. And we wish to find out if it is busy. We would compare this >> metric with the cpu power, which lets say is 100. >> >> rq->util * 100 < cpu_of(rq)->power. >> >> In the above scenario would we declare the cpu _not_busy? Or would we do >> the following: >> >> (rq->util * 100) * #nr_running < cpu_of(rq)->power and conclude that it >> is just enough _busy_ to not take on more processes? > > That is just confused... ->power doesn't have anything to do with a per-cpu > measure. ->power is a inter-cpu measure of relative compute capacity. Ok. > > Mixing in nr_running confuses things even more; it doesn't matter how many > tasks it takes to push utilization up to 100%; once its there the cpu simply > cannot run more. True, this is from the perspective of the CPU. But will not the tasks on this CPU get throttled if, you find the utilization of this CPU < 100% and decide to put more tasks on it? Regards Preeti U Murthy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v4 07/18] sched: set initial load avg of new forked task
Hi everyone, On 02/19/2013 05:04 PM, Paul Turner wrote: > On Fri, Feb 15, 2013 at 2:07 AM, Alex Shi wrote: >> >>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c >>> index 1dff78a..9d1c193 100644 >>> --- a/kernel/sched/core.c >>> +++ b/kernel/sched/core.c >>> @@ -1557,8 +1557,8 @@ static void __sched_fork(struct task_struct *p) >>> * load-balance). >>> */ >>> #if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED) >>> - p->se.avg.runnable_avg_period = 0; >>> - p->se.avg.runnable_avg_sum = 0; >>> + p->se.avg.runnable_avg_period = 1024; >>> + p->se.avg.runnable_avg_sum = 1024; >> >> It can't work. >> avg.decay_count needs to be set to 0 before enqueue_entity_load_avg(), then >> update_entity_load_avg() can't be called, so, runnable_avg_period/sum >> are unusable. > > Well we _could_ also use a negative decay_count here and treat it like > a migration; but the larger problem is the visibility of p->on_rq; > which is gates whether we account the time as runnable and occurs > after activate_task() so that's out. > >> >> Even we has chance to call __update_entity_runnable_avg(), >> avg.last_runnable_update needs be set before that, usually, it needs to >> be set as 'now', that cause __update_entity_runnable_avg() function >> return 0, then update_entity_load_avg() still can not reach to >> __update_entity_load_avg_contrib(). >> >> If we embed a simple new task load initialization to many functions, >> that is too hard for future reader. > > This is my concern about making this a special case with the > introduction ENQUEUE_NEWTASK flag; enqueue jumps through enough hoops > as it is. > > I still don't see why we can't resolve this at init time in > __sched_fork(); your patch above just moves an explicit initialization > of load_avg_contrib into the enqueue path. Adding a call to > __update_task_entity_contrib() to the previous alternate suggestion > would similarly seem to resolve this? We could do this(Adding a call to __update_task_entity_contrib()),but the cfs_rq->runnable_load_avg gets updated only if the task is on the runqueue. But in the forked task's case the on_rq flag is not yet set.Something like the below: --- kernel/sched/fair.c | 18 +- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 8691b0d..841e156 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1451,14 +1451,20 @@ static inline void update_entity_load_avg(struct sched_entity *se, else now = cfs_rq_clock_task(group_cfs_rq(se)); - if (!__update_entity_runnable_avg(now, >avg, se->on_rq)) - return; - + if (!__update_entity_runnable_avg(now, >avg, se->on_rq)) { + if (!(flags & ENQUEUE_NEWTASK)) + return; + } contrib_delta = __update_entity_load_avg_contrib(se); if (!update_cfs_rq) return; + /* But the cfs_rq->runnable_load_avg does not get updated in case of +* a forked task,because the se->on_rq = 0,although we update the +* task's load_avg_contrib above in +* __update_entity_laod_avg_contrib(). +*/ if (se->on_rq) cfs_rq->runnable_load_avg += contrib_delta; else @@ -1538,12 +1544,6 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq, subtract_blocked_load_contrib(cfs_rq, se->avg.load_avg_contrib); update_entity_load_avg(se, 0); } - /* -* set the initial load avg of new task same as its load -* in order to avoid brust fork make few cpu too heavier -*/ - if (flags & ENQUEUE_NEWTASK) - se->avg.load_avg_contrib = se->load.weight; cfs_rq->runnable_load_avg += se->avg.load_avg_contrib; /* we force update consideration on load-balancer moves */ Thanks Regards Preeti U Murthy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v5 06/15] sched: log the cpu utilization at rq
Hi, >> /* >> * This is the main, per-CPU runqueue data structure. >> * >> @@ -481,6 +484,7 @@ struct rq { >> #endif >> >> struct sched_avg avg; >> +unsigned int util; >> }; >> >> static inline int cpu_of(struct rq *rq) > > You don't actually compute the rq utilization, you only compute the > utilization as per the fair class, so if there's significant RT activity > it'll think the cpu is under-utilized, whihc I think will result in the > wrong thing. Correct me if I am wrong,but isn't the current load balancer also disregarding the real time tasks to calculate the domain/group/cpu level load too? What I mean is,if the answer to the above question is yes,then can we safely assume that the furthur optimizations to the load balancer like the power aware scheduler and the usage of per entity load tracking can be done without considering the real time tasks? Regards Preeti U Murthy > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v5 06/15] sched: log the cpu utilization at rq
Hi everyone, On 02/18/2013 10:37 AM, Alex Shi wrote: > The cpu's utilization is to measure how busy is the cpu. > util = cpu_rq(cpu)->avg.runnable_avg_sum > / cpu_rq(cpu)->avg.runnable_avg_period; Why not cfs_rq->runnable_load_avg? I am concerned with what is the right metric to use here. Refer to this discussion:https://lkml.org/lkml/2012/10/29/448 Regards Preeti U Murthy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] cpuidle/menu: Fail cpuidle_idle_call() if no idle state is acceptable
On PowerPC, in a particular test scenario, all the cpu idle states were disabled. Inspite of this it was observed that the idle state count of the shallowest idle state, snooze, was increasing. This is because the governor returns the idle state index as 0 even in scenarios when no idle state can be chosen. These scenarios could be when the latency requirement is 0 or as mentioned above when the user wants to disable certain cpu idle states at runtime. In the latter case, its possible that no cpu idle state is valid because the suitable states were disabled and the rest did not match the menu governor criteria to be chosen as the next idle state. This patch adds the code to indicate that a valid cpu idle state could not be chosen by the menu governor and reports back to arch so that it can take some default action. Signed-off-by: Preeti U Murthy --- drivers/cpuidle/cpuidle.c|6 +- drivers/cpuidle/governors/menu.c |7 --- 2 files changed, 9 insertions(+), 4 deletions(-) diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c index a55e68f..5bf06bb 100644 --- a/drivers/cpuidle/cpuidle.c +++ b/drivers/cpuidle/cpuidle.c @@ -131,8 +131,9 @@ int cpuidle_idle_call(void) /* ask the governor for the next state */ next_state = cpuidle_curr_governor->select(drv, dev); + + dev->last_residency = 0; if (need_resched()) { - dev->last_residency = 0; /* give the governor an opportunity to reflect on the outcome */ if (cpuidle_curr_governor->reflect) cpuidle_curr_governor->reflect(dev, next_state); @@ -140,6 +141,9 @@ int cpuidle_idle_call(void) return 0; } + if (next_state < 0) + return -EINVAL; + trace_cpu_idle_rcuidle(next_state, dev->cpu); broadcast = !!(drv->states[next_state].flags & CPUIDLE_FLAG_TIMER_STOP); diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c index cf7f2f0..6921543 100644 --- a/drivers/cpuidle/governors/menu.c +++ b/drivers/cpuidle/governors/menu.c @@ -283,6 +283,7 @@ again: * menu_select - selects the next idle state to enter * @drv: cpuidle driver containing state data * @dev: the CPU + * Returns -1 when no idle state is suitable */ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev) { @@ -292,17 +293,17 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev) int multiplier; struct timespec t; - if (data->needs_update) { + if (data->last_state_idx >= 0 && data->needs_update) { menu_update(drv, dev); data->needs_update = 0; } - data->last_state_idx = 0; + data->last_state_idx = -1; data->exit_us = 0; /* Special case when user has set very strict latency requirement */ if (unlikely(latency_req == 0)) - return 0; + return data->last_state_idx; /* determine the expected residency time, round up */ t = ktime_to_timespec(tick_nohz_get_sleep_length()); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] cpuidle/menu: Fail cpuidle_idle_call() if no idle state is acceptable
Hi Srivatsa, On 01/14/2014 12:30 PM, Srivatsa S. Bhat wrote: > On 01/14/2014 11:35 AM, Preeti U Murthy wrote: >> On PowerPC, in a particular test scenario, all the cpu idle states were >> disabled. >> Inspite of this it was observed that the idle state count of the shallowest >> idle state, snooze, was increasing. >> >> This is because the governor returns the idle state index as 0 even in >> scenarios when no idle state can be chosen. These scenarios could be when the >> latency requirement is 0 or as mentioned above when the user wants to disable >> certain cpu idle states at runtime. In the latter case, its possible that no >> cpu idle state is valid because the suitable states were disabled >> and the rest did not match the menu governor criteria to be chosen as the >> next idle state. >> >> This patch adds the code to indicate that a valid cpu idle state could not be >> chosen by the menu governor and reports back to arch so that it can take some >> default action. >> > > That sounds fair enough. However, the "default" action of pseries idle loop > (pseries_lpar_idle()) surprises me. It enters Cede, which is _deeper_ than > doing > a snooze! IOW, a user might "disable" cpuidle or set the > PM_QOS_CPU_DMA_LATENCY > to 0 hoping to prevent the CPUs from going to deep idle states, but then the > machine would still end up going to Cede, even though that wont get reflected > in the idle state counts. IMHO that scenario needs some thought as well... Yes I did see this, but since the patch intends to only communicate whether the cpuidle governor was successful in choosing an idle state on its part, I wished to address the default action of pseries idle loop separately. You are right we will need to understand the patch which introduced this action. I will take a look at it. > >> Signed-off-by: Preeti U Murthy >> --- >> >> drivers/cpuidle/cpuidle.c|6 +- >> drivers/cpuidle/governors/menu.c |7 --- >> 2 files changed, 9 insertions(+), 4 deletions(-) >> >> diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c >> index a55e68f..5bf06bb 100644 >> --- a/drivers/cpuidle/cpuidle.c >> +++ b/drivers/cpuidle/cpuidle.c >> @@ -131,8 +131,9 @@ int cpuidle_idle_call(void) >> >> /* ask the governor for the next state */ >> next_state = cpuidle_curr_governor->select(drv, dev); >> + >> +dev->last_residency = 0; >> if (need_resched()) { >> -dev->last_residency = 0; >> /* give the governor an opportunity to reflect on the outcome */ >> if (cpuidle_curr_governor->reflect) >> cpuidle_curr_governor->reflect(dev, next_state); > > The comments on top of the .reflect() routines of the governors say that the > second parameter is the index of the actual state entered. But after this > patch, > next_state can be negative, indicating an invalid index. So those comments > need > to be updated accordingly. Right, I will take care of the comment in the next post. > >> @@ -140,6 +141,9 @@ int cpuidle_idle_call(void) >> return 0; >> } >> >> +if (next_state < 0) >> +return -EINVAL; > > The exit path above (due to need_resched) returns with irqs enabled, but the > new > one you are adding (next_state < 0) returns with irqs disabled. This is > correct, > because in the latter case, "idle" is still in progress and the arch will > choose > a default handler to execute (unlike the former case where "idle" is over and > hence its time to enable interrupts). Correct. > > IMHO it would be good to add comments around this code to explain this subtle > difference. We can never be too careful with these things... ;-) Ok, will do so. > >> + >> trace_cpu_idle_rcuidle(next_state, dev->cpu); >> >> broadcast = !!(drv->states[next_state].flags & CPUIDLE_FLAG_TIMER_STOP); >> diff --git a/drivers/cpuidle/governors/menu.c >> b/drivers/cpuidle/governors/menu.c >> index cf7f2f0..6921543 100644 >> --- a/drivers/cpuidle/governors/menu.c >> +++ b/drivers/cpuidle/governors/menu.c >> @@ -283,6 +283,7 @@ again: >> * menu_select - selects the next idle state to enter >> * @drv: cpuidle driver containing state data >> * @dev: the CPU >> + * Returns -1 when no idle state is suitable >> */ >> static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device >> *dev) >> { >> @@ -292,17 +293,17 @@ static int menu_select(struct cpuidle_d
Re: [PATCH] cpuidle/menu: Fail cpuidle_idle_call() if no idle state is acceptable
On 01/14/2014 01:07 PM, Srivatsa S. Bhat wrote: > On 01/14/2014 12:30 PM, Srivatsa S. Bhat wrote: >> On 01/14/2014 11:35 AM, Preeti U Murthy wrote: >>> On PowerPC, in a particular test scenario, all the cpu idle states were >>> disabled. >>> Inspite of this it was observed that the idle state count of the shallowest >>> idle state, snooze, was increasing. >>> >>> This is because the governor returns the idle state index as 0 even in >>> scenarios when no idle state can be chosen. These scenarios could be when >>> the >>> latency requirement is 0 or as mentioned above when the user wants to >>> disable >>> certain cpu idle states at runtime. In the latter case, its possible that no >>> cpu idle state is valid because the suitable states were disabled >>> and the rest did not match the menu governor criteria to be chosen as the >>> next idle state. >>> >>> This patch adds the code to indicate that a valid cpu idle state could not >>> be >>> chosen by the menu governor and reports back to arch so that it can take >>> some >>> default action. >>> >> >> That sounds fair enough. However, the "default" action of pseries idle loop >> (pseries_lpar_idle()) surprises me. It enters Cede, which is _deeper_ than >> doing >> a snooze! IOW, a user might "disable" cpuidle or set the >> PM_QOS_CPU_DMA_LATENCY >> to 0 hoping to prevent the CPUs from going to deep idle states, but then the >> machine would still end up going to Cede, even though that wont get reflected >> in the idle state counts. IMHO that scenario needs some thought as well... >> > > I checked the git history and found that the default idle was changed (on > purpose) > to cede the processor, in order to speed up booting.. Hmm.. > > commit 363edbe2614aa90df706c0f19ccfa2a6c06af0be > Author: Vaidyanathan Srinivasan > Date: Fri Sep 6 00:25:06 2013 +0530 > > powerpc: Default arch idle could cede processor on pseries This issue is not powerpc specific as I observed on digging a bit into the default idle routines of the common archs. The way that archs perceive the call to cpuidle framework today is that if it fails, it means that cpuidle backend driver fails to *function* due to some reason (as is mentioned in the above commit: either since cpuidle driver is not registered or it does not work on some specific platforms) and that therefore the archs should decide on an idle state themselves. They therefore end up choosing a convenient idle state which could very well be one of the idle states in the cpuidle state table. The archs do not see failed call to cpuidle driver as "cpuidle driver says no idle state can be entered now because there are strict latency requirements or the idle states are disabled". IOW, the call to cpuidle driver is currently based on if cpuidle driver exists rather than if it agrees on entry into any of the idle states. This patch brings in the need for the archs to incorporate this additional check of "did cpuidle_idle_call() fail because it did not find it wise to enter any of the idle states". In which case they should simply exit without taking any *default action*. Need to give this some thought and reconsider the patch. Regards Preeti U Murthy > > > Regards, > Srivatsa S. Bhat > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH V5 0/8] cpuidle/ppc: Enable deep idle states on PowerNV
inated broadcast cpu on hotplug of the old instead of smp_call_function_single(). This is because we are interrupt disabled at this point and should not be using smp_call_function_single or its children in this context to send an ipi. 6. Move GENERIC_CLOCKEVENTS_BROADCAST to arch/powerpc/Kconfig. 7. Fix coding style issues. Changes in V2: - https://lkml.org/lkml/2013/8/14/239 1. Dynamically pick a broadcast CPU, instead of having a dedicated one. 2. Remove the constraint of having to disable tickless idle on the broadcast CPU by queueing a hrtimer dedicated to do broadcast. V1 posting: https://lkml.org/lkml/2013/7/25/740. 1. Added the infrastructure to wakeup CPUs in deep idle states in which the local timers stop. --- Preeti U Murthy (4): cpuidle/ppc: Split timer_interrupt() into timer handling and interrupt handling routines time/cpuidle: Support in tick broadcast framework in the absence of external clock device cpuidle/powernv: Add "Fast-Sleep" CPU idle state cpuidle/powernv: Parse device tree to setup idle states Srivatsa S. Bhat (2): powerpc: Free up the slot of PPC_MSG_CALL_FUNC_SINGLE IPI message powerpc: Implement tick broadcast IPI as a fixed IPI message Vaidyanathan Srinivasan (2): powernv/cpuidle: Add context management for Fast Sleep powermgt: Add OPAL call to resync timebase on wakeup arch/powerpc/Kconfig |2 arch/powerpc/include/asm/opal.h|2 arch/powerpc/include/asm/processor.h |1 arch/powerpc/include/asm/smp.h |2 arch/powerpc/include/asm/time.h|1 arch/powerpc/kernel/exceptions-64s.S | 10 + arch/powerpc/kernel/idle_power7.S | 90 +-- arch/powerpc/kernel/smp.c | 23 ++- arch/powerpc/kernel/time.c | 80 ++ arch/powerpc/platforms/cell/interrupt.c|2 arch/powerpc/platforms/powernv/opal-wrappers.S |1 arch/powerpc/platforms/ps3/smp.c |2 drivers/cpuidle/cpuidle-powernv.c | 106 - include/linux/clockchips.h |4 - kernel/time/clockevents.c |9 + kernel/time/tick-broadcast.c | 192 ++-- kernel/time/tick-internal.h|8 + 17 files changed, 434 insertions(+), 101 deletions(-) -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH V5 1/8] powerpc: Free up the slot of PPC_MSG_CALL_FUNC_SINGLE IPI message
From: Srivatsa S. Bhat The IPI handlers for both PPC_MSG_CALL_FUNC and PPC_MSG_CALL_FUNC_SINGLE map to a common implementation - generic_smp_call_function_single_interrupt(). So, we can consolidate them and save one of the IPI message slots, (which are precious on powerpc, since only 4 of those slots are available). So, implement the functionality of PPC_MSG_CALL_FUNC_SINGLE using PPC_MSG_CALL_FUNC itself and release its IPI message slot, so that it can be used for something else in the future, if desired. Signed-off-by: Srivatsa S. Bhat Signed-off-by: Preeti U. Murthy Acked-by: Geoff Levand [For the PS3 part] --- arch/powerpc/include/asm/smp.h |2 +- arch/powerpc/kernel/smp.c | 12 +--- arch/powerpc/platforms/cell/interrupt.c |2 +- arch/powerpc/platforms/ps3/smp.c|2 +- 4 files changed, 8 insertions(+), 10 deletions(-) diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h index 084e080..9f7356b 100644 --- a/arch/powerpc/include/asm/smp.h +++ b/arch/powerpc/include/asm/smp.h @@ -120,7 +120,7 @@ extern int cpu_to_core_id(int cpu); * in /proc/interrupts will be wrong!!! --Troy */ #define PPC_MSG_CALL_FUNCTION 0 #define PPC_MSG_RESCHEDULE 1 -#define PPC_MSG_CALL_FUNC_SINGLE 2 +#define PPC_MSG_UNUSED 2 #define PPC_MSG_DEBUGGER_BREAK 3 /* for irq controllers that have dedicated ipis per message (4) */ diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index a3b64f3..c2bd8d6 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -145,9 +145,9 @@ static irqreturn_t reschedule_action(int irq, void *data) return IRQ_HANDLED; } -static irqreturn_t call_function_single_action(int irq, void *data) +static irqreturn_t unused_action(int irq, void *data) { - generic_smp_call_function_single_interrupt(); + /* This slot is unused and hence available for use, if needed */ return IRQ_HANDLED; } @@ -168,14 +168,14 @@ static irqreturn_t debug_ipi_action(int irq, void *data) static irq_handler_t smp_ipi_action[] = { [PPC_MSG_CALL_FUNCTION] = call_function_action, [PPC_MSG_RESCHEDULE] = reschedule_action, - [PPC_MSG_CALL_FUNC_SINGLE] = call_function_single_action, + [PPC_MSG_UNUSED] = unused_action, [PPC_MSG_DEBUGGER_BREAK] = debug_ipi_action, }; const char *smp_ipi_name[] = { [PPC_MSG_CALL_FUNCTION] = "ipi call function", [PPC_MSG_RESCHEDULE] = "ipi reschedule", - [PPC_MSG_CALL_FUNC_SINGLE] = "ipi call function single", + [PPC_MSG_UNUSED] = "ipi unused", [PPC_MSG_DEBUGGER_BREAK] = "ipi debugger", }; @@ -251,8 +251,6 @@ irqreturn_t smp_ipi_demux(void) generic_smp_call_function_interrupt(); if (all & IPI_MESSAGE(PPC_MSG_RESCHEDULE)) scheduler_ipi(); - if (all & IPI_MESSAGE(PPC_MSG_CALL_FUNC_SINGLE)) - generic_smp_call_function_single_interrupt(); if (all & IPI_MESSAGE(PPC_MSG_DEBUGGER_BREAK)) debug_ipi_action(0, NULL); } while (info->messages); @@ -280,7 +278,7 @@ EXPORT_SYMBOL_GPL(smp_send_reschedule); void arch_send_call_function_single_ipi(int cpu) { - do_message_pass(cpu, PPC_MSG_CALL_FUNC_SINGLE); + do_message_pass(cpu, PPC_MSG_CALL_FUNCTION); } void arch_send_call_function_ipi_mask(const struct cpumask *mask) diff --git a/arch/powerpc/platforms/cell/interrupt.c b/arch/powerpc/platforms/cell/interrupt.c index 2d42f3b..adf3726 100644 --- a/arch/powerpc/platforms/cell/interrupt.c +++ b/arch/powerpc/platforms/cell/interrupt.c @@ -215,7 +215,7 @@ void iic_request_IPIs(void) { iic_request_ipi(PPC_MSG_CALL_FUNCTION); iic_request_ipi(PPC_MSG_RESCHEDULE); - iic_request_ipi(PPC_MSG_CALL_FUNC_SINGLE); + iic_request_ipi(PPC_MSG_UNUSED); iic_request_ipi(PPC_MSG_DEBUGGER_BREAK); } diff --git a/arch/powerpc/platforms/ps3/smp.c b/arch/powerpc/platforms/ps3/smp.c index 4b35166..00d1a7c 100644 --- a/arch/powerpc/platforms/ps3/smp.c +++ b/arch/powerpc/platforms/ps3/smp.c @@ -76,7 +76,7 @@ static int __init ps3_smp_probe(void) BUILD_BUG_ON(PPC_MSG_CALL_FUNCTION!= 0); BUILD_BUG_ON(PPC_MSG_RESCHEDULE != 1); - BUILD_BUG_ON(PPC_MSG_CALL_FUNC_SINGLE != 2); + BUILD_BUG_ON(PPC_MSG_UNUSED != 2); BUILD_BUG_ON(PPC_MSG_DEBUGGER_BREAK != 3); for (i = 0; i < MSG_COUNT; i++) { -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH V5 2/8] powerpc: Implement tick broadcast IPI as a fixed IPI message
From: Srivatsa S. Bhat For scalability and performance reasons, we want the tick broadcast IPIs to be handled as efficiently as possible. Fixed IPI messages are one of the most efficient mechanisms available - they are faster than the smp_call_function mechanism because the IPI handlers are fixed and hence they don't involve costly operations such as adding IPI handlers to the target CPU's function queue, acquiring locks for synchronization etc. Luckily we have an unused IPI message slot, so use that to implement tick broadcast IPIs efficiently. Signed-off-by: Srivatsa S. Bhat [Functions renamed to tick_broadcast* and Changelog modified by Preeti U. Murthy] Signed-off-by: Preeti U. Murthy Acked-by: Geoff Levand [For the PS3 part] --- arch/powerpc/include/asm/smp.h |2 +- arch/powerpc/include/asm/time.h |1 + arch/powerpc/kernel/smp.c | 19 +++ arch/powerpc/kernel/time.c |5 + arch/powerpc/platforms/cell/interrupt.c |2 +- arch/powerpc/platforms/ps3/smp.c|2 +- 6 files changed, 24 insertions(+), 7 deletions(-) diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h index 9f7356b..ff51046 100644 --- a/arch/powerpc/include/asm/smp.h +++ b/arch/powerpc/include/asm/smp.h @@ -120,7 +120,7 @@ extern int cpu_to_core_id(int cpu); * in /proc/interrupts will be wrong!!! --Troy */ #define PPC_MSG_CALL_FUNCTION 0 #define PPC_MSG_RESCHEDULE 1 -#define PPC_MSG_UNUSED 2 +#define PPC_MSG_TICK_BROADCAST 2 #define PPC_MSG_DEBUGGER_BREAK 3 /* for irq controllers that have dedicated ipis per message (4) */ diff --git a/arch/powerpc/include/asm/time.h b/arch/powerpc/include/asm/time.h index c1f2676..1d428e6 100644 --- a/arch/powerpc/include/asm/time.h +++ b/arch/powerpc/include/asm/time.h @@ -28,6 +28,7 @@ extern struct clock_event_device decrementer_clockevent; struct rtc_time; extern void to_tm(int tim, struct rtc_time * tm); extern void GregorianDay(struct rtc_time *tm); +extern void tick_broadcast_ipi_handler(void); extern void generic_calibrate_decr(void); diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index c2bd8d6..c77c6d7 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -35,6 +35,7 @@ #include #include #include +#include #include #include #include @@ -145,9 +146,9 @@ static irqreturn_t reschedule_action(int irq, void *data) return IRQ_HANDLED; } -static irqreturn_t unused_action(int irq, void *data) +static irqreturn_t tick_broadcast_ipi_action(int irq, void *data) { - /* This slot is unused and hence available for use, if needed */ + tick_broadcast_ipi_handler(); return IRQ_HANDLED; } @@ -168,14 +169,14 @@ static irqreturn_t debug_ipi_action(int irq, void *data) static irq_handler_t smp_ipi_action[] = { [PPC_MSG_CALL_FUNCTION] = call_function_action, [PPC_MSG_RESCHEDULE] = reschedule_action, - [PPC_MSG_UNUSED] = unused_action, + [PPC_MSG_TICK_BROADCAST] = tick_broadcast_ipi_action, [PPC_MSG_DEBUGGER_BREAK] = debug_ipi_action, }; const char *smp_ipi_name[] = { [PPC_MSG_CALL_FUNCTION] = "ipi call function", [PPC_MSG_RESCHEDULE] = "ipi reschedule", - [PPC_MSG_UNUSED] = "ipi unused", + [PPC_MSG_TICK_BROADCAST] = "ipi tick-broadcast", [PPC_MSG_DEBUGGER_BREAK] = "ipi debugger", }; @@ -251,6 +252,8 @@ irqreturn_t smp_ipi_demux(void) generic_smp_call_function_interrupt(); if (all & IPI_MESSAGE(PPC_MSG_RESCHEDULE)) scheduler_ipi(); + if (all & IPI_MESSAGE(PPC_MSG_TICK_BROADCAST)) + tick_broadcast_ipi_handler(); if (all & IPI_MESSAGE(PPC_MSG_DEBUGGER_BREAK)) debug_ipi_action(0, NULL); } while (info->messages); @@ -289,6 +292,14 @@ void arch_send_call_function_ipi_mask(const struct cpumask *mask) do_message_pass(cpu, PPC_MSG_CALL_FUNCTION); } +void tick_broadcast(const struct cpumask *mask) +{ + unsigned int cpu; + + for_each_cpu(cpu, mask) + do_message_pass(cpu, PPC_MSG_TICK_BROADCAST); +} + #if defined(CONFIG_DEBUGGER) || defined(CONFIG_KEXEC) void smp_send_debugger_break(void) { diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c index b3b1441..42269c7 100644 --- a/arch/powerpc/kernel/time.c +++ b/arch/powerpc/kernel/time.c @@ -813,6 +813,11 @@ static void decrementer_set_mode(enum clock_event_mode mode, decrementer_set_next_event(DECREMENTER_MAX, dev); } +/* Interrupt handler for the timer broadcast IPI */ +void tick_broadcast_ipi_handler(void) +{ +} + static void register_decrementer_clockevent(int cpu) { struct clock_event_device *dec = _cpu(decrementers, cpu); diff --git a/arch/powerpc/plat
[PATCH V5 3/8] cpuidle/ppc: Split timer_interrupt() into timer handling and interrupt handling routines
Split timer_interrupt(), which is the local timer interrupt handler on ppc into routines called during regular interrupt handling and __timer_interrupt(), which takes care of running local timers and collecting time related stats. This will enable callers interested only in running expired local timers to directly call into __timer_interupt(). One of the use cases of this is the tick broadcast IPI handling in which the sleeping CPUs need to handle the local timers that have expired. Signed-off-by: Preeti U Murthy --- arch/powerpc/kernel/time.c | 73 +--- 1 file changed, 41 insertions(+), 32 deletions(-) diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c index 42269c7..42cb603 100644 --- a/arch/powerpc/kernel/time.c +++ b/arch/powerpc/kernel/time.c @@ -478,6 +478,42 @@ void arch_irq_work_raise(void) #endif /* CONFIG_IRQ_WORK */ +static void __timer_interrupt(void) +{ + struct pt_regs *regs = get_irq_regs(); + u64 *next_tb = &__get_cpu_var(decrementers_next_tb); + struct clock_event_device *evt = &__get_cpu_var(decrementers); + u64 now; + + __get_cpu_var(irq_stat).timer_irqs++; + trace_timer_interrupt_entry(regs); + + if (test_irq_work_pending()) { + clear_irq_work_pending(); + irq_work_run(); + } + + now = get_tb_or_rtc(); + if (now >= *next_tb) { + *next_tb = ~(u64)0; + if (evt->event_handler) + evt->event_handler(evt); + } else { + now = *next_tb - now; + if (now <= DECREMENTER_MAX) + set_dec((int)now); + } + +#ifdef CONFIG_PPC64 + /* collect purr register values often, for accurate calculations */ + if (firmware_has_feature(FW_FEATURE_SPLPAR)) { + struct cpu_usage *cu = &__get_cpu_var(cpu_usage_array); + cu->current_tb = mfspr(SPRN_PURR); + } +#endif + trace_timer_interrupt_exit(regs); +} + /* * timer_interrupt - gets called when the decrementer overflows, * with interrupts disabled. @@ -486,8 +522,6 @@ void timer_interrupt(struct pt_regs * regs) { struct pt_regs *old_regs; u64 *next_tb = &__get_cpu_var(decrementers_next_tb); - struct clock_event_device *evt = &__get_cpu_var(decrementers); - u64 now; /* Ensure a positive value is written to the decrementer, or else * some CPUs will continue to take decrementer exceptions. @@ -510,8 +544,6 @@ void timer_interrupt(struct pt_regs * regs) */ may_hard_irq_enable(); - __get_cpu_var(irq_stat).timer_irqs++; - #if defined(CONFIG_PPC32) && defined(CONFIG_PMAC) if (atomic_read(_n_lost_interrupts) != 0) do_IRQ(regs); @@ -520,34 +552,7 @@ void timer_interrupt(struct pt_regs * regs) old_regs = set_irq_regs(regs); irq_enter(); - trace_timer_interrupt_entry(regs); - - if (test_irq_work_pending()) { - clear_irq_work_pending(); - irq_work_run(); - } - - now = get_tb_or_rtc(); - if (now >= *next_tb) { - *next_tb = ~(u64)0; - if (evt->event_handler) - evt->event_handler(evt); - } else { - now = *next_tb - now; - if (now <= DECREMENTER_MAX) - set_dec((int)now); - } - -#ifdef CONFIG_PPC64 - /* collect purr register values often, for accurate calculations */ - if (firmware_has_feature(FW_FEATURE_SPLPAR)) { - struct cpu_usage *cu = &__get_cpu_var(cpu_usage_array); - cu->current_tb = mfspr(SPRN_PURR); - } -#endif - - trace_timer_interrupt_exit(regs); - + __timer_interrupt(); irq_exit(); set_irq_regs(old_regs); } @@ -816,6 +821,10 @@ static void decrementer_set_mode(enum clock_event_mode mode, /* Interrupt handler for the timer broadcast IPI */ void tick_broadcast_ipi_handler(void) { + u64 *next_tb = &__get_cpu_var(decrementers_next_tb); + + *next_tb = get_tb_or_rtc(); + __timer_interrupt(); } static void register_decrementer_clockevent(int cpu) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH V5 4/8] powernv/cpuidle: Add context management for Fast Sleep
From: Vaidyanathan Srinivasan Before adding Fast-Sleep into the cpuidle framework, some low level support needs to be added to enable it. This includes saving and restoring of certain registers at entry and exit time of this state respectively just like we do in the NAP idle state. Signed-off-by: Vaidyanathan Srinivasan [Changelog modified by Preeti U. Murthy ] Signed-off-by: Preeti U. Murthy --- arch/powerpc/include/asm/processor.h |1 + arch/powerpc/kernel/exceptions-64s.S | 10 - arch/powerpc/kernel/idle_power7.S| 63 -- 3 files changed, 53 insertions(+), 21 deletions(-) diff --git a/arch/powerpc/include/asm/processor.h b/arch/powerpc/include/asm/processor.h index 027fefd..22e547a 100644 --- a/arch/powerpc/include/asm/processor.h +++ b/arch/powerpc/include/asm/processor.h @@ -444,6 +444,7 @@ enum idle_boot_override {IDLE_NO_OVERRIDE = 0, IDLE_POWERSAVE_OFF}; extern int powersave_nap; /* set if nap mode can be used in idle loop */ extern void power7_nap(void); +extern void power7_sleep(void); extern void flush_instruction_cache(void); extern void hard_reset_now(void); extern void poweroff_now(void); diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S index 9f905e4..b8139fb 100644 --- a/arch/powerpc/kernel/exceptions-64s.S +++ b/arch/powerpc/kernel/exceptions-64s.S @@ -121,9 +121,10 @@ BEGIN_FTR_SECTION cmpwi cr1,r13,2 /* Total loss of HV state is fatal, we could try to use the * PIR to locate a PACA, then use an emergency stack etc... -* but for now, let's just stay stuck here +* OPAL v3 based powernv platforms have new idle states +* which fall in this catagory. */ - bgt cr1,. + bgt cr1,8f GET_PACA(r13) #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE @@ -141,6 +142,11 @@ BEGIN_FTR_SECTION beq cr1,2f b .power7_wakeup_noloss 2: b .power7_wakeup_loss + + /* Fast Sleep wakeup on PowerNV */ +8: GET_PACA(r13) + b .power7_wakeup_loss + 9: END_FTR_SECTION_IFSET(CPU_FTR_HVMODE | CPU_FTR_ARCH_206) #endif /* CONFIG_PPC_P7_NAP */ diff --git a/arch/powerpc/kernel/idle_power7.S b/arch/powerpc/kernel/idle_power7.S index 847e40e..e4bbca2 100644 --- a/arch/powerpc/kernel/idle_power7.S +++ b/arch/powerpc/kernel/idle_power7.S @@ -20,17 +20,27 @@ #undef DEBUG - .text +/* Idle state entry routines */ -_GLOBAL(power7_idle) - /* Now check if user or arch enabled NAP mode */ - LOAD_REG_ADDRBASE(r3,powersave_nap) - lwz r4,ADDROFF(powersave_nap)(r3) - cmpwi 0,r4,0 - beqlr - /* fall through */ +#defineIDLE_STATE_ENTER_SEQ(IDLE_INST) \ + /* Magic NAP/SLEEP/WINKLE mode enter sequence */\ + std r0,0(r1); \ + ptesync;\ + ld r0,0(r1); \ +1: cmp cr0,r0,r0; \ + bne 1b; \ + IDLE_INST; \ + b . -_GLOBAL(power7_nap) + .text + +/* + * Pass requested state in r3: + * 0 - nap + * 1 - sleep + */ +_GLOBAL(power7_powersave_common) + /* Use r3 to pass state nap/sleep/winkle */ /* NAP is a state loss, we create a regs frame on the * stack, fill it up with the state we care about and * stick a pointer to it in PACAR1. We really only @@ -79,8 +89,8 @@ _GLOBAL(power7_nap) /* Continue saving state */ SAVE_GPR(2, r1) SAVE_NVGPRS(r1) - mfcrr3 - std r3,_CCR(r1) + mfcrr4 + std r4,_CCR(r1) std r9,_MSR(r1) std r1,PACAR1(r13) @@ -89,15 +99,30 @@ _GLOBAL(power7_nap) li r4,KVM_HWTHREAD_IN_NAP stb r4,HSTATE_HWTHREAD_STATE(r13) #endif + cmpwi cr0,r3,1 + beq 2f + IDLE_STATE_ENTER_SEQ(PPC_NAP) + /* No return */ +2: IDLE_STATE_ENTER_SEQ(PPC_SLEEP) + /* No return */ - /* Magic NAP mode enter sequence */ - std r0,0(r1) - ptesync - ld r0,0(r1) -1: cmp cr0,r0,r0 - bne 1b - PPC_NAP - b . +_GLOBAL(power7_idle) + /* Now check if user or arch enabled NAP mode */ + LOAD_REG_ADDRBASE(r3,powersave_nap) + lwz r4,ADDROFF(powersave_nap)(r3) + cmpwi 0,r4,0 + beqlr + /* fall through */ + +_GLOBAL(power7_nap) + li r3,0 + b power7_powersave_common + /* No return */ + +_GLOBAL(power7_sleep) + li r3,1 + b power7_powersave_common + /* No return */ _GLOBAL(power7_wakeup_loss) ld r1,PACAR1(r13) -- To unsubscribe from this list: send the line
[PATCH V5 6/8] time/cpuidle: Support in tick broadcast framework in the absence of external clock device
On some architectures, in certain CPU deep idle states the local timers stop. An external clock device is used to wakeup these CPUs. The kernel support for the wakeup of these CPUs is provided by the tick broadcast framework by using the external clock device as the wakeup source. However not all implementations of architectures provide such an external clock device such as some PowerPC ones. This patch includes support in the broadcast framework to handle the wakeup of the CPUs in deep idle states on such systems by queuing a hrtimer on one of the CPUs, meant to handle the wakeup of CPUs in deep idle states. This CPU is identified as the bc_cpu. Each time the hrtimer expires, it is reprogrammed for the next wakeup of the CPUs in deep idle state after handling broadcast. However when a CPU is about to enter deep idle state with its wakeup time earlier than the time at which the hrtimer is currently programmed, it *becomes the new bc_cpu* and restarts the hrtimer on itself. This way the job of doing broadcast is handed around to the CPUs that ask for the earliest wakeup just before entering deep idle state. This is consistent with what happens in cases where an external clock device is present. The smp affinity of this clock device is set to the CPU with the earliest wakeup. The important point here is that the bc_cpu cannot enter deep idle state since it has a hrtimer queued to wakeup the other CPUs in deep idle. Hence it cannot have its local timer stopped. Therefore for such a CPU, the BROADCAST_ENTER notification has to fail implying that it cannot enter deep idle state. On architectures where an external clock device is present, all CPUs can enter deep idle. During hotplug of the bc_cpu, the job of doing a broadcast is assigned to the first cpu in the broadcast mask. This newly nominated bc_cpu is woken up by an IPI so as to queue the above mentioned hrtimer on it. Signed-off-by: Preeti U Murthy --- include/linux/clockchips.h |4 - kernel/time/clockevents.c|9 +- kernel/time/tick-broadcast.c | 192 ++ kernel/time/tick-internal.h |8 +- 4 files changed, 186 insertions(+), 27 deletions(-) diff --git a/include/linux/clockchips.h b/include/linux/clockchips.h index 493aa02..bbda37b 100644 --- a/include/linux/clockchips.h +++ b/include/linux/clockchips.h @@ -186,9 +186,9 @@ static inline int tick_check_broadcast_expired(void) { return 0; } #endif #ifdef CONFIG_GENERIC_CLOCKEVENTS -extern void clockevents_notify(unsigned long reason, void *arg); +extern int clockevents_notify(unsigned long reason, void *arg); #else -static inline void clockevents_notify(unsigned long reason, void *arg) {} +static inline int clockevents_notify(unsigned long reason, void *arg) {} #endif #else /* CONFIG_GENERIC_CLOCKEVENTS_BUILD */ diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c index 086ad60..d61404e 100644 --- a/kernel/time/clockevents.c +++ b/kernel/time/clockevents.c @@ -524,12 +524,13 @@ void clockevents_resume(void) #ifdef CONFIG_GENERIC_CLOCKEVENTS /** * clockevents_notify - notification about relevant events + * Returns non zero on error. */ -void clockevents_notify(unsigned long reason, void *arg) +int clockevents_notify(unsigned long reason, void *arg) { struct clock_event_device *dev, *tmp; unsigned long flags; - int cpu; + int cpu, ret = 0; raw_spin_lock_irqsave(_lock, flags); @@ -542,11 +543,12 @@ void clockevents_notify(unsigned long reason, void *arg) case CLOCK_EVT_NOTIFY_BROADCAST_ENTER: case CLOCK_EVT_NOTIFY_BROADCAST_EXIT: - tick_broadcast_oneshot_control(reason); + ret = tick_broadcast_oneshot_control(reason); break; case CLOCK_EVT_NOTIFY_CPU_DYING: tick_handover_do_timer(arg); + tick_handover_broadcast_cpu(arg); break; case CLOCK_EVT_NOTIFY_SUSPEND: @@ -585,6 +587,7 @@ void clockevents_notify(unsigned long reason, void *arg) break; } raw_spin_unlock_irqrestore(_lock, flags); + return ret; } EXPORT_SYMBOL_GPL(clockevents_notify); diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c index 9532690..1c23912 100644 --- a/kernel/time/tick-broadcast.c +++ b/kernel/time/tick-broadcast.c @@ -20,6 +20,7 @@ #include #include #include +#include #include "tick-internal.h" @@ -35,6 +36,15 @@ static cpumask_var_t tmpmask; static DEFINE_RAW_SPINLOCK(tick_broadcast_lock); static int tick_broadcast_force; +/* + * Helper variables for handling broadcast in the absence of a + * tick_broadcast_device. + * */ +static struct hrtimer *bc_hrtimer; +static int bc_cpu = -1; +static ktime_t bc_next_wakeup; +static int hrtimer_initialized = 0; + #ifdef CONFIG_TICK_ONESHOT static void tick_broadcast_clear_oneshot(int cpu); #else @@ -528,6 +538,20 @@ static int tick_broadcast
[PATCH V5 5/8] powermgt: Add OPAL call to resync timebase on wakeup
From: Vaidyanathan Srinivasan During "Fast-sleep" and deeper power savings state, decrementer and timebase could be stopped making it out of sync with rest of the cores in the system. Add a firmware call to request platform to resync timebase using low level platform methods. Signed-off-by: Vaidyanathan Srinivasan Signed-off-by: Preeti U. Murthy --- arch/powerpc/include/asm/opal.h|2 ++ arch/powerpc/kernel/exceptions-64s.S |2 +- arch/powerpc/kernel/idle_power7.S | 27 arch/powerpc/platforms/powernv/opal-wrappers.S |1 + 4 files changed, 31 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h index 033c06b..a662d06 100644 --- a/arch/powerpc/include/asm/opal.h +++ b/arch/powerpc/include/asm/opal.h @@ -132,6 +132,7 @@ extern int opal_enter_rtas(struct rtas_args *args, #define OPAL_FLASH_VALIDATE76 #define OPAL_FLASH_MANAGE 77 #define OPAL_FLASH_UPDATE 78 +#define OPAL_RESYNC_TIMEBASE 79 #ifndef __ASSEMBLY__ @@ -763,6 +764,7 @@ extern void opal_flash_init(void); extern int opal_machine_check(struct pt_regs *regs); extern void opal_shutdown(void); +extern int opal_resync_timebase(void); extern void opal_lpc_init(void); diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S index b8139fb..91e6417 100644 --- a/arch/powerpc/kernel/exceptions-64s.S +++ b/arch/powerpc/kernel/exceptions-64s.S @@ -145,7 +145,7 @@ BEGIN_FTR_SECTION /* Fast Sleep wakeup on PowerNV */ 8: GET_PACA(r13) - b .power7_wakeup_loss + b .power7_wakeup_tb_loss 9: END_FTR_SECTION_IFSET(CPU_FTR_HVMODE | CPU_FTR_ARCH_206) diff --git a/arch/powerpc/kernel/idle_power7.S b/arch/powerpc/kernel/idle_power7.S index e4bbca2..34c71e8 100644 --- a/arch/powerpc/kernel/idle_power7.S +++ b/arch/powerpc/kernel/idle_power7.S @@ -17,6 +17,7 @@ #include #include #include +#include #undef DEBUG @@ -124,6 +125,32 @@ _GLOBAL(power7_sleep) b power7_powersave_common /* No return */ +_GLOBAL(power7_wakeup_tb_loss) + ld r2,PACATOC(r13); + ld r1,PACAR1(r13) + + /* Time base re-sync */ + li r0,OPAL_RESYNC_TIMEBASE + LOAD_REG_ADDR(r11,opal); + ld r12,8(r11); + ld r2,0(r11); + mtctr r12 + bctrl + + /* TODO: Check r3 for failure */ + + REST_NVGPRS(r1) + REST_GPR(2, r1) + ld r3,_CCR(r1) + ld r4,_MSR(r1) + ld r5,_NIP(r1) + addir1,r1,INT_FRAME_SIZE + mtcrr3 + mfspr r3,SPRN_SRR1/* Return SRR1 */ + mtspr SPRN_SRR1,r4 + mtspr SPRN_SRR0,r5 + rfid + _GLOBAL(power7_wakeup_loss) ld r1,PACAR1(r13) REST_NVGPRS(r1) diff --git a/arch/powerpc/platforms/powernv/opal-wrappers.S b/arch/powerpc/platforms/powernv/opal-wrappers.S index e780650..ddfe95a 100644 --- a/arch/powerpc/platforms/powernv/opal-wrappers.S +++ b/arch/powerpc/platforms/powernv/opal-wrappers.S @@ -126,3 +126,4 @@ OPAL_CALL(opal_return_cpu, OPAL_RETURN_CPU); OPAL_CALL(opal_validate_flash, OPAL_FLASH_VALIDATE); OPAL_CALL(opal_manage_flash, OPAL_FLASH_MANAGE); OPAL_CALL(opal_update_flash, OPAL_FLASH_UPDATE); +OPAL_CALL(opal_resync_timebase,OPAL_RESYNC_TIMEBASE); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH V5 7/8] cpuidle/powernv: Add "Fast-Sleep" CPU idle state
Fast sleep is one of the deep idle states on Power8 in which local timers of CPUs stop. On PowerPC we do not have an external clock device which can handle wakeup of such CPUs. Now that we have the support in the tick broadcast framework for archs that do not sport such a device and the low level support for fast sleep, enable it in the cpuidle framework on PowerNV. Signed-off-by: Preeti U Murthy --- arch/powerpc/Kconfig |2 ++ arch/powerpc/kernel/time.c|2 +- drivers/cpuidle/cpuidle-powernv.c | 39 + 3 files changed, 42 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index b44b52c..cafa788 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -129,6 +129,8 @@ config PPC select GENERIC_CMOS_UPDATE select GENERIC_TIME_VSYSCALL_OLD select GENERIC_CLOCKEVENTS + select GENERIC_CLOCKEVENTS_BROADCAST + select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST select GENERIC_STRNCPY_FROM_USER select GENERIC_STRNLEN_USER select HAVE_MOD_ARCH_SPECIFIC diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c index 42cb603..d9efd93 100644 --- a/arch/powerpc/kernel/time.c +++ b/arch/powerpc/kernel/time.c @@ -106,7 +106,7 @@ struct clock_event_device decrementer_clockevent = { .irq= 0, .set_next_event = decrementer_set_next_event, .set_mode = decrementer_set_mode, - .features = CLOCK_EVT_FEAT_ONESHOT, + .features = CLOCK_EVT_FEAT_ONESHOT | CLOCK_EVT_FEAT_C3STOP, }; EXPORT_SYMBOL(decrementer_clockevent); diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c index 78fd174..e3aa62f 100644 --- a/drivers/cpuidle/cpuidle-powernv.c +++ b/drivers/cpuidle/cpuidle-powernv.c @@ -11,6 +11,7 @@ #include #include #include +#include #include #include @@ -49,6 +50,37 @@ static int nap_loop(struct cpuidle_device *dev, return index; } +static int fastsleep_loop(struct cpuidle_device *dev, + struct cpuidle_driver *drv, + int index) +{ + int cpu = dev->cpu; + unsigned long old_lpcr = mfspr(SPRN_LPCR); + unsigned long new_lpcr; + + new_lpcr = old_lpcr; + new_lpcr &= ~(LPCR_MER | LPCR_PECE); /* lpcr[mer] must be 0 */ + + /* exit powersave upon external interrupt, but not decrementer +* interrupt, Emulate sleep. +*/ + new_lpcr |= LPCR_PECE0; + + if (clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, )) { + new_lpcr |= LPCR_PECE1; + mtspr(SPRN_LPCR, new_lpcr); + power7_nap(); + } else { + mtspr(SPRN_LPCR, new_lpcr); + power7_sleep(); + clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, ); + } + + mtspr(SPRN_LPCR, old_lpcr); + + return index; +} + /* * States for dedicated partition case. */ @@ -67,6 +99,13 @@ static struct cpuidle_state powernv_states[] = { .exit_latency = 10, .target_residency = 100, .enter = _loop }, +{ /* Fastsleep */ + .name = "fastsleep", + .desc = "fastsleep", + .flags = CPUIDLE_FLAG_TIME_VALID, + .exit_latency = 10, + .target_residency = 100, + .enter = _loop }, }; static int powernv_cpuidle_add_cpu_notifier(struct notifier_block *n, -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH V5 8/8] cpuidle/powernv: Parse device tree to setup idle states
Add deep idle states such as nap and fast sleep to the cpuidle state table only if they are discovered from the device tree during cpuidle initialization. Signed-off-by: Preeti U Murthy --- drivers/cpuidle/cpuidle-powernv.c | 81 + 1 file changed, 64 insertions(+), 17 deletions(-) diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c index e3aa62f..b01987d 100644 --- a/drivers/cpuidle/cpuidle-powernv.c +++ b/drivers/cpuidle/cpuidle-powernv.c @@ -12,10 +12,17 @@ #include #include #include +#include #include #include +/* Flags and constants used in PowerNV platform */ + +#define MAX_POWERNV_IDLE_STATES8 +#define IDLE_USE_INST_NAP 0x0001 /* Use nap instruction */ +#define IDLE_USE_INST_SLEEP0x0002 /* Use sleep instruction */ + struct cpuidle_driver powernv_idle_driver = { .name = "powernv_idle", .owner= THIS_MODULE, @@ -84,7 +91,7 @@ static int fastsleep_loop(struct cpuidle_device *dev, /* * States for dedicated partition case. */ -static struct cpuidle_state powernv_states[] = { +static struct cpuidle_state powernv_states[MAX_POWERNV_IDLE_STATES] = { { /* Snooze */ .name = "snooze", .desc = "snooze", @@ -92,20 +99,6 @@ static struct cpuidle_state powernv_states[] = { .exit_latency = 0, .target_residency = 0, .enter = _loop }, - { /* NAP */ - .name = "NAP", - .desc = "NAP", - .flags = CPUIDLE_FLAG_TIME_VALID, - .exit_latency = 10, - .target_residency = 100, - .enter = _loop }, -{ /* Fastsleep */ - .name = "fastsleep", - .desc = "fastsleep", - .flags = CPUIDLE_FLAG_TIME_VALID, - .exit_latency = 10, - .target_residency = 100, - .enter = _loop }, }; static int powernv_cpuidle_add_cpu_notifier(struct notifier_block *n, @@ -166,19 +159,73 @@ static int powernv_cpuidle_driver_init(void) return 0; } +static int powernv_add_idle_states(void) +{ + struct device_node *power_mgt; + struct property *prop; + int nr_idle_states = 1; /* Snooze */ + int dt_idle_states; + u32 *flags; + int i; + + /* Currently we have snooze statically defined */ + + power_mgt = of_find_node_by_path("/ibm,opal/power-mgt"); + if (!power_mgt) { + pr_warn("opal: PowerMgmt Node not found\n"); + return nr_idle_states; + } + + prop = of_find_property(power_mgt, "ibm,cpu-idle-state-flags", NULL); + if (!prop) { + pr_warn("DT-PowerMgmt: missing ibm,cpu-idle-state-flags\n"); + return nr_idle_states; + } + + dt_idle_states = prop->length / sizeof(u32); + flags = (u32 *) prop->value; + + for (i = 0; i < dt_idle_states; i++) { + + if (flags[i] & IDLE_USE_INST_NAP) { + /* Add NAP state */ + strcpy(powernv_states[nr_idle_states].name, "Nap"); + strcpy(powernv_states[nr_idle_states].desc, "Nap"); + powernv_states[nr_idle_states].flags = CPUIDLE_FLAG_TIME_VALID; + powernv_states[nr_idle_states].exit_latency = 10; + powernv_states[nr_idle_states].target_residency = 100; + powernv_states[nr_idle_states].enter = _loop; + nr_idle_states++; + } + + if (flags[i] & IDLE_USE_INST_SLEEP) { + /* Add FASTSLEEP state */ + strcpy(powernv_states[nr_idle_states].name, "FastSleep"); + strcpy(powernv_states[nr_idle_states].desc, "FastSleep"); + powernv_states[nr_idle_states].flags = CPUIDLE_FLAG_TIME_VALID; + powernv_states[nr_idle_states].exit_latency = 300; + powernv_states[nr_idle_states].target_residency = 100; + powernv_states[nr_idle_states].enter = _loop; + nr_idle_states++; + } + } + + return nr_idle_states; +} + /* * powernv_idle_probe() * Choose state table for shared versus dedicated partition */ static int powernv_idle_probe(void) { - if (cpuidle_disable != IDLE_NO_OVERRIDE) return -ENODEV; if (firmware_has_feature(FW_FEATURE_OPALv3)) { cpuidle_state_table = powernv_states; - max_idle_state = ARRAY_SIZE(powernv_states); + /* Device tree can indicate more idle states */ + max_idle_stat
[RESEND PATCH V5 0/8] cpuidle/ppc: Enable deep idle states on PowerNV
uot;broadcast period", and the next wakeup event. By introducing the "broadcast period" as the maximum period after which the broadcast hrtimer can fire, we ensure that we do not miss wakeups in corner cases. 3. On hotplug of a broadcast cpu, trigger the hrtimer meant to do broadcast to fire immediately on the new broadcast cpu. This will ensure we do not miss doing a broadcast pending in the nearest future. 4. Change the type of allocation from GFP_KERNEL to GFP_NOWAIT while initializing bc_hrtimer since we are in an atomic context and cannot sleep. 5. Use the broadcast ipi to wakeup the newly nominated broadcast cpu on hotplug of the old instead of smp_call_function_single(). This is because we are interrupt disabled at this point and should not be using smp_call_function_single or its children in this context to send an ipi. 6. Move GENERIC_CLOCKEVENTS_BROADCAST to arch/powerpc/Kconfig. 7. Fix coding style issues. Changes in V2: https://lkml.org/lkml/2013/8/14/239 1. Dynamically pick a broadcast CPU, instead of having a dedicated one. 2. Remove the constraint of having to disable tickless idle on the broadcast CPU by queueing a hrtimer dedicated to do broadcast. V1 posting: https://lkml.org/lkml/2013/7/25/740. 1. Added the infrastructure to wakeup CPUs in deep idle states in which the local timers stop. --- Preeti U Murthy (5): cpuidle/ppc: Split timer_interrupt() into timer handling and interrupt handling routines powermgt: Add OPAL call to resync timebase on wakeup time/cpuidle: Support in tick broadcast framework in the absence of external clock device cpuidle/powernv: Add "Fast-Sleep" CPU idle state cpuidle/powernv: Parse device tree to setup idle states Srivatsa S. Bhat (2): powerpc: Free up the slot of PPC_MSG_CALL_FUNC_SINGLE IPI message powerpc: Implement tick broadcast IPI as a fixed IPI message Vaidyanathan Srinivasan (1): powernv/cpuidle: Add context management for Fast Sleep arch/powerpc/Kconfig |2 arch/powerpc/include/asm/opal.h|2 arch/powerpc/include/asm/processor.h |1 arch/powerpc/include/asm/smp.h |2 arch/powerpc/include/asm/time.h|1 arch/powerpc/kernel/exceptions-64s.S | 10 + arch/powerpc/kernel/idle_power7.S | 90 +-- arch/powerpc/kernel/smp.c | 23 ++- arch/powerpc/kernel/time.c | 88 +++ arch/powerpc/platforms/cell/interrupt.c|2 arch/powerpc/platforms/powernv/opal-wrappers.S |1 arch/powerpc/platforms/ps3/smp.c |2 drivers/cpuidle/cpuidle-powernv.c | 109 -- include/linux/clockchips.h |4 - kernel/time/clockevents.c |9 + kernel/time/tick-broadcast.c | 192 ++-- kernel/time/tick-internal.h|8 + 17 files changed, 442 insertions(+), 104 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RESEND PATCH V5 3/8] cpuidle/ppc: Split timer_interrupt() into timer handling and interrupt handling routines
Split timer_interrupt(), which is the local timer interrupt handler on ppc into routines called during regular interrupt handling and __timer_interrupt(), which takes care of running local timers and collecting time related stats. This will enable callers interested only in running expired local timers to directly call into __timer_interupt(). One of the use cases of this is the tick broadcast IPI handling in which the sleeping CPUs need to handle the local timers that have expired. Signed-off-by: Preeti U Murthy --- arch/powerpc/kernel/time.c | 81 +--- 1 file changed, 46 insertions(+), 35 deletions(-) diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c index 3ff97db..df2989b 100644 --- a/arch/powerpc/kernel/time.c +++ b/arch/powerpc/kernel/time.c @@ -478,6 +478,47 @@ void arch_irq_work_raise(void) #endif /* CONFIG_IRQ_WORK */ +void __timer_interrupt(void) +{ + struct pt_regs *regs = get_irq_regs(); + u64 *next_tb = &__get_cpu_var(decrementers_next_tb); + struct clock_event_device *evt = &__get_cpu_var(decrementers); + u64 now; + + trace_timer_interrupt_entry(regs); + + if (test_irq_work_pending()) { + clear_irq_work_pending(); + irq_work_run(); + } + + now = get_tb_or_rtc(); + if (now >= *next_tb) { + *next_tb = ~(u64)0; + if (evt->event_handler) + evt->event_handler(evt); + __get_cpu_var(irq_stat).timer_irqs_event++; + } else { + now = *next_tb - now; + if (now <= DECREMENTER_MAX) + set_dec((int)now); + /* We may have raced with new irq work */ + if (test_irq_work_pending()) + set_dec(1); + __get_cpu_var(irq_stat).timer_irqs_others++; + } + +#ifdef CONFIG_PPC64 + /* collect purr register values often, for accurate calculations */ + if (firmware_has_feature(FW_FEATURE_SPLPAR)) { + struct cpu_usage *cu = &__get_cpu_var(cpu_usage_array); + cu->current_tb = mfspr(SPRN_PURR); + } +#endif + + trace_timer_interrupt_exit(regs); +} + /* * timer_interrupt - gets called when the decrementer overflows, * with interrupts disabled. @@ -486,8 +527,6 @@ void timer_interrupt(struct pt_regs * regs) { struct pt_regs *old_regs; u64 *next_tb = &__get_cpu_var(decrementers_next_tb); - struct clock_event_device *evt = &__get_cpu_var(decrementers); - u64 now; /* Ensure a positive value is written to the decrementer, or else * some CPUs will continue to take decrementer exceptions. @@ -519,39 +558,7 @@ void timer_interrupt(struct pt_regs * regs) old_regs = set_irq_regs(regs); irq_enter(); - trace_timer_interrupt_entry(regs); - - if (test_irq_work_pending()) { - clear_irq_work_pending(); - irq_work_run(); - } - - now = get_tb_or_rtc(); - if (now >= *next_tb) { - *next_tb = ~(u64)0; - if (evt->event_handler) - evt->event_handler(evt); - __get_cpu_var(irq_stat).timer_irqs_event++; - } else { - now = *next_tb - now; - if (now <= DECREMENTER_MAX) - set_dec((int)now); - /* We may have raced with new irq work */ - if (test_irq_work_pending()) - set_dec(1); - __get_cpu_var(irq_stat).timer_irqs_others++; - } - -#ifdef CONFIG_PPC64 - /* collect purr register values often, for accurate calculations */ - if (firmware_has_feature(FW_FEATURE_SPLPAR)) { - struct cpu_usage *cu = &__get_cpu_var(cpu_usage_array); - cu->current_tb = mfspr(SPRN_PURR); - } -#endif - - trace_timer_interrupt_exit(regs); - + __timer_interrupt(); irq_exit(); set_irq_regs(old_regs); } @@ -828,6 +835,10 @@ static void decrementer_set_mode(enum clock_event_mode mode, /* Interrupt handler for the timer broadcast IPI */ void tick_broadcast_ipi_handler(void) { + u64 *next_tb = &__get_cpu_var(decrementers_next_tb); + + *next_tb = get_tb_or_rtc(); + __timer_interrupt(); } static void register_decrementer_clockevent(int cpu) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RESEND PATCH V5 1/8] powerpc: Free up the slot of PPC_MSG_CALL_FUNC_SINGLE IPI message
From: Srivatsa S. Bhat The IPI handlers for both PPC_MSG_CALL_FUNC and PPC_MSG_CALL_FUNC_SINGLE map to a common implementation - generic_smp_call_function_single_interrupt(). So, we can consolidate them and save one of the IPI message slots, (which are precious on powerpc, since only 4 of those slots are available). So, implement the functionality of PPC_MSG_CALL_FUNC_SINGLE using PPC_MSG_CALL_FUNC itself and release its IPI message slot, so that it can be used for something else in the future, if desired. Signed-off-by: Srivatsa S. Bhat Signed-off-by: Preeti U. Murthy Acked-by: Geoff Levand [For the PS3 part] --- arch/powerpc/include/asm/smp.h |2 +- arch/powerpc/kernel/smp.c | 12 +--- arch/powerpc/platforms/cell/interrupt.c |2 +- arch/powerpc/platforms/ps3/smp.c|2 +- 4 files changed, 8 insertions(+), 10 deletions(-) diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h index 084e080..9f7356b 100644 --- a/arch/powerpc/include/asm/smp.h +++ b/arch/powerpc/include/asm/smp.h @@ -120,7 +120,7 @@ extern int cpu_to_core_id(int cpu); * in /proc/interrupts will be wrong!!! --Troy */ #define PPC_MSG_CALL_FUNCTION 0 #define PPC_MSG_RESCHEDULE 1 -#define PPC_MSG_CALL_FUNC_SINGLE 2 +#define PPC_MSG_UNUSED 2 #define PPC_MSG_DEBUGGER_BREAK 3 /* for irq controllers that have dedicated ipis per message (4) */ diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index ac2621a..ee7d76b 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -145,9 +145,9 @@ static irqreturn_t reschedule_action(int irq, void *data) return IRQ_HANDLED; } -static irqreturn_t call_function_single_action(int irq, void *data) +static irqreturn_t unused_action(int irq, void *data) { - generic_smp_call_function_single_interrupt(); + /* This slot is unused and hence available for use, if needed */ return IRQ_HANDLED; } @@ -168,14 +168,14 @@ static irqreturn_t debug_ipi_action(int irq, void *data) static irq_handler_t smp_ipi_action[] = { [PPC_MSG_CALL_FUNCTION] = call_function_action, [PPC_MSG_RESCHEDULE] = reschedule_action, - [PPC_MSG_CALL_FUNC_SINGLE] = call_function_single_action, + [PPC_MSG_UNUSED] = unused_action, [PPC_MSG_DEBUGGER_BREAK] = debug_ipi_action, }; const char *smp_ipi_name[] = { [PPC_MSG_CALL_FUNCTION] = "ipi call function", [PPC_MSG_RESCHEDULE] = "ipi reschedule", - [PPC_MSG_CALL_FUNC_SINGLE] = "ipi call function single", + [PPC_MSG_UNUSED] = "ipi unused", [PPC_MSG_DEBUGGER_BREAK] = "ipi debugger", }; @@ -251,8 +251,6 @@ irqreturn_t smp_ipi_demux(void) generic_smp_call_function_interrupt(); if (all & IPI_MESSAGE(PPC_MSG_RESCHEDULE)) scheduler_ipi(); - if (all & IPI_MESSAGE(PPC_MSG_CALL_FUNC_SINGLE)) - generic_smp_call_function_single_interrupt(); if (all & IPI_MESSAGE(PPC_MSG_DEBUGGER_BREAK)) debug_ipi_action(0, NULL); } while (info->messages); @@ -280,7 +278,7 @@ EXPORT_SYMBOL_GPL(smp_send_reschedule); void arch_send_call_function_single_ipi(int cpu) { - do_message_pass(cpu, PPC_MSG_CALL_FUNC_SINGLE); + do_message_pass(cpu, PPC_MSG_CALL_FUNCTION); } void arch_send_call_function_ipi_mask(const struct cpumask *mask) diff --git a/arch/powerpc/platforms/cell/interrupt.c b/arch/powerpc/platforms/cell/interrupt.c index 2d42f3b..adf3726 100644 --- a/arch/powerpc/platforms/cell/interrupt.c +++ b/arch/powerpc/platforms/cell/interrupt.c @@ -215,7 +215,7 @@ void iic_request_IPIs(void) { iic_request_ipi(PPC_MSG_CALL_FUNCTION); iic_request_ipi(PPC_MSG_RESCHEDULE); - iic_request_ipi(PPC_MSG_CALL_FUNC_SINGLE); + iic_request_ipi(PPC_MSG_UNUSED); iic_request_ipi(PPC_MSG_DEBUGGER_BREAK); } diff --git a/arch/powerpc/platforms/ps3/smp.c b/arch/powerpc/platforms/ps3/smp.c index 4b35166..00d1a7c 100644 --- a/arch/powerpc/platforms/ps3/smp.c +++ b/arch/powerpc/platforms/ps3/smp.c @@ -76,7 +76,7 @@ static int __init ps3_smp_probe(void) BUILD_BUG_ON(PPC_MSG_CALL_FUNCTION!= 0); BUILD_BUG_ON(PPC_MSG_RESCHEDULE != 1); - BUILD_BUG_ON(PPC_MSG_CALL_FUNC_SINGLE != 2); + BUILD_BUG_ON(PPC_MSG_UNUSED != 2); BUILD_BUG_ON(PPC_MSG_DEBUGGER_BREAK != 3); for (i = 0; i < MSG_COUNT; i++) { -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RESEND PATCH V5 2/8] powerpc: Implement tick broadcast IPI as a fixed IPI message
From: Srivatsa S. Bhat For scalability and performance reasons, we want the tick broadcast IPIs to be handled as efficiently as possible. Fixed IPI messages are one of the most efficient mechanisms available - they are faster than the smp_call_function mechanism because the IPI handlers are fixed and hence they don't involve costly operations such as adding IPI handlers to the target CPU's function queue, acquiring locks for synchronization etc. Luckily we have an unused IPI message slot, so use that to implement tick broadcast IPIs efficiently. Signed-off-by: Srivatsa S. Bhat [Functions renamed to tick_broadcast* and Changelog modified by Preeti U. Murthy] Signed-off-by: Preeti U. Murthy Acked-by: Geoff Levand [For the PS3 part] --- arch/powerpc/include/asm/smp.h |2 +- arch/powerpc/include/asm/time.h |1 + arch/powerpc/kernel/smp.c | 19 +++ arch/powerpc/kernel/time.c |5 + arch/powerpc/platforms/cell/interrupt.c |2 +- arch/powerpc/platforms/ps3/smp.c|2 +- 6 files changed, 24 insertions(+), 7 deletions(-) diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h index 9f7356b..ff51046 100644 --- a/arch/powerpc/include/asm/smp.h +++ b/arch/powerpc/include/asm/smp.h @@ -120,7 +120,7 @@ extern int cpu_to_core_id(int cpu); * in /proc/interrupts will be wrong!!! --Troy */ #define PPC_MSG_CALL_FUNCTION 0 #define PPC_MSG_RESCHEDULE 1 -#define PPC_MSG_UNUSED 2 +#define PPC_MSG_TICK_BROADCAST 2 #define PPC_MSG_DEBUGGER_BREAK 3 /* for irq controllers that have dedicated ipis per message (4) */ diff --git a/arch/powerpc/include/asm/time.h b/arch/powerpc/include/asm/time.h index c1f2676..1d428e6 100644 --- a/arch/powerpc/include/asm/time.h +++ b/arch/powerpc/include/asm/time.h @@ -28,6 +28,7 @@ extern struct clock_event_device decrementer_clockevent; struct rtc_time; extern void to_tm(int tim, struct rtc_time * tm); extern void GregorianDay(struct rtc_time *tm); +extern void tick_broadcast_ipi_handler(void); extern void generic_calibrate_decr(void); diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index ee7d76b..6f06f05 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -35,6 +35,7 @@ #include #include #include +#include #include #include #include @@ -145,9 +146,9 @@ static irqreturn_t reschedule_action(int irq, void *data) return IRQ_HANDLED; } -static irqreturn_t unused_action(int irq, void *data) +static irqreturn_t tick_broadcast_ipi_action(int irq, void *data) { - /* This slot is unused and hence available for use, if needed */ + tick_broadcast_ipi_handler(); return IRQ_HANDLED; } @@ -168,14 +169,14 @@ static irqreturn_t debug_ipi_action(int irq, void *data) static irq_handler_t smp_ipi_action[] = { [PPC_MSG_CALL_FUNCTION] = call_function_action, [PPC_MSG_RESCHEDULE] = reschedule_action, - [PPC_MSG_UNUSED] = unused_action, + [PPC_MSG_TICK_BROADCAST] = tick_broadcast_ipi_action, [PPC_MSG_DEBUGGER_BREAK] = debug_ipi_action, }; const char *smp_ipi_name[] = { [PPC_MSG_CALL_FUNCTION] = "ipi call function", [PPC_MSG_RESCHEDULE] = "ipi reschedule", - [PPC_MSG_UNUSED] = "ipi unused", + [PPC_MSG_TICK_BROADCAST] = "ipi tick-broadcast", [PPC_MSG_DEBUGGER_BREAK] = "ipi debugger", }; @@ -251,6 +252,8 @@ irqreturn_t smp_ipi_demux(void) generic_smp_call_function_interrupt(); if (all & IPI_MESSAGE(PPC_MSG_RESCHEDULE)) scheduler_ipi(); + if (all & IPI_MESSAGE(PPC_MSG_TICK_BROADCAST)) + tick_broadcast_ipi_handler(); if (all & IPI_MESSAGE(PPC_MSG_DEBUGGER_BREAK)) debug_ipi_action(0, NULL); } while (info->messages); @@ -289,6 +292,14 @@ void arch_send_call_function_ipi_mask(const struct cpumask *mask) do_message_pass(cpu, PPC_MSG_CALL_FUNCTION); } +void tick_broadcast(const struct cpumask *mask) +{ + unsigned int cpu; + + for_each_cpu(cpu, mask) + do_message_pass(cpu, PPC_MSG_TICK_BROADCAST); +} + #if defined(CONFIG_DEBUGGER) || defined(CONFIG_KEXEC) void smp_send_debugger_break(void) { diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c index b3dab20..3ff97db 100644 --- a/arch/powerpc/kernel/time.c +++ b/arch/powerpc/kernel/time.c @@ -825,6 +825,11 @@ static void decrementer_set_mode(enum clock_event_mode mode, decrementer_set_next_event(DECREMENTER_MAX, dev); } +/* Interrupt handler for the timer broadcast IPI */ +void tick_broadcast_ipi_handler(void) +{ +} + static void register_decrementer_clockevent(int cpu) { struct clock_event_device *dec = _cpu(decrementers, cpu); diff --git a/arch/powerpc/plat
[RESEND PATCH V5 7/8] cpuidle/powernv: Add "Fast-Sleep" CPU idle state
Fast sleep is one of the deep idle states on Power8 in which local timers of CPUs stop. On PowerPC we do not have an external clock device which can handle wakeup of such CPUs. Now that we have the support in the tick broadcast framework for archs that do not sport such a device and the low level support for fast sleep, enable it in the cpuidle framework on PowerNV. Signed-off-by: Preeti U Murthy --- arch/powerpc/Kconfig |2 ++ arch/powerpc/kernel/time.c|2 +- drivers/cpuidle/cpuidle-powernv.c | 42 + 3 files changed, 45 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index fa39517..ec91584 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -129,6 +129,8 @@ config PPC select GENERIC_CMOS_UPDATE select GENERIC_TIME_VSYSCALL_OLD select GENERIC_CLOCKEVENTS + select GENERIC_CLOCKEVENTS_BROADCAST + select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST select GENERIC_STRNCPY_FROM_USER select GENERIC_STRNLEN_USER select HAVE_MOD_ARCH_SPECIFIC diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c index df2989b..95fa5ce 100644 --- a/arch/powerpc/kernel/time.c +++ b/arch/powerpc/kernel/time.c @@ -106,7 +106,7 @@ struct clock_event_device decrementer_clockevent = { .irq= 0, .set_next_event = decrementer_set_next_event, .set_mode = decrementer_set_mode, - .features = CLOCK_EVT_FEAT_ONESHOT, + .features = CLOCK_EVT_FEAT_ONESHOT | CLOCK_EVT_FEAT_C3STOP, }; EXPORT_SYMBOL(decrementer_clockevent); diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c index 78fd174..90f0c2b 100644 --- a/drivers/cpuidle/cpuidle-powernv.c +++ b/drivers/cpuidle/cpuidle-powernv.c @@ -11,6 +11,7 @@ #include #include #include +#include #include #include @@ -49,6 +50,40 @@ static int nap_loop(struct cpuidle_device *dev, return index; } +static int fastsleep_loop(struct cpuidle_device *dev, + struct cpuidle_driver *drv, + int index) +{ + int cpu = dev->cpu; + unsigned long old_lpcr = mfspr(SPRN_LPCR); + unsigned long new_lpcr; + + if (unlikely(system_state < SYSTEM_RUNNING)) + return index; + + new_lpcr = old_lpcr; + new_lpcr &= ~(LPCR_MER | LPCR_PECE); /* lpcr[mer] must be 0 */ + + /* exit powersave upon external interrupt, but not decrementer +* interrupt, Emulate sleep. +*/ + new_lpcr |= LPCR_PECE0; + + if (clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, )) { + new_lpcr |= LPCR_PECE1; + mtspr(SPRN_LPCR, new_lpcr); + power7_nap(); + } else { + mtspr(SPRN_LPCR, new_lpcr); + power7_sleep(); + } + clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, ); + + mtspr(SPRN_LPCR, old_lpcr); + + return index; +} + /* * States for dedicated partition case. */ @@ -67,6 +102,13 @@ static struct cpuidle_state powernv_states[] = { .exit_latency = 10, .target_residency = 100, .enter = _loop }, +{ /* Fastsleep */ + .name = "fastsleep", + .desc = "fastsleep", + .flags = CPUIDLE_FLAG_TIME_VALID, + .exit_latency = 10, + .target_residency = 100, + .enter = _loop }, }; static int powernv_cpuidle_add_cpu_notifier(struct notifier_block *n, -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RESEND PATCH V5 4/8] powernv/cpuidle: Add context management for Fast Sleep
From: Vaidyanathan Srinivasan Before adding Fast-Sleep into the cpuidle framework, some low level support needs to be added to enable it. This includes saving and restoring of certain registers at entry and exit time of this state respectively just like we do in the NAP idle state. Signed-off-by: Vaidyanathan Srinivasan [Changelog modified by Preeti U. Murthy ] Signed-off-by: Preeti U. Murthy --- arch/powerpc/include/asm/processor.h |1 + arch/powerpc/kernel/exceptions-64s.S | 10 - arch/powerpc/kernel/idle_power7.S| 63 -- 3 files changed, 53 insertions(+), 21 deletions(-) diff --git a/arch/powerpc/include/asm/processor.h b/arch/powerpc/include/asm/processor.h index b62de43..d660dc3 100644 --- a/arch/powerpc/include/asm/processor.h +++ b/arch/powerpc/include/asm/processor.h @@ -450,6 +450,7 @@ enum idle_boot_override {IDLE_NO_OVERRIDE = 0, IDLE_POWERSAVE_OFF}; extern int powersave_nap; /* set if nap mode can be used in idle loop */ extern void power7_nap(void); +extern void power7_sleep(void); extern void flush_instruction_cache(void); extern void hard_reset_now(void); extern void poweroff_now(void); diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S index 38d5073..b01a9cb 100644 --- a/arch/powerpc/kernel/exceptions-64s.S +++ b/arch/powerpc/kernel/exceptions-64s.S @@ -121,9 +121,10 @@ BEGIN_FTR_SECTION cmpwi cr1,r13,2 /* Total loss of HV state is fatal, we could try to use the * PIR to locate a PACA, then use an emergency stack etc... -* but for now, let's just stay stuck here +* OPAL v3 based powernv platforms have new idle states +* which fall in this catagory. */ - bgt cr1,. + bgt cr1,8f GET_PACA(r13) #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE @@ -141,6 +142,11 @@ BEGIN_FTR_SECTION beq cr1,2f b .power7_wakeup_noloss 2: b .power7_wakeup_loss + + /* Fast Sleep wakeup on PowerNV */ +8: GET_PACA(r13) + b .power7_wakeup_loss + 9: END_FTR_SECTION_IFSET(CPU_FTR_HVMODE | CPU_FTR_ARCH_206) #endif /* CONFIG_PPC_P7_NAP */ diff --git a/arch/powerpc/kernel/idle_power7.S b/arch/powerpc/kernel/idle_power7.S index 3fdef0f..14f78be 100644 --- a/arch/powerpc/kernel/idle_power7.S +++ b/arch/powerpc/kernel/idle_power7.S @@ -20,17 +20,27 @@ #undef DEBUG - .text +/* Idle state entry routines */ -_GLOBAL(power7_idle) - /* Now check if user or arch enabled NAP mode */ - LOAD_REG_ADDRBASE(r3,powersave_nap) - lwz r4,ADDROFF(powersave_nap)(r3) - cmpwi 0,r4,0 - beqlr - /* fall through */ +#defineIDLE_STATE_ENTER_SEQ(IDLE_INST) \ + /* Magic NAP/SLEEP/WINKLE mode enter sequence */\ + std r0,0(r1); \ + ptesync;\ + ld r0,0(r1); \ +1: cmp cr0,r0,r0; \ + bne 1b; \ + IDLE_INST; \ + b . -_GLOBAL(power7_nap) + .text + +/* + * Pass requested state in r3: + * 0 - nap + * 1 - sleep + */ +_GLOBAL(power7_powersave_common) + /* Use r3 to pass state nap/sleep/winkle */ /* NAP is a state loss, we create a regs frame on the * stack, fill it up with the state we care about and * stick a pointer to it in PACAR1. We really only @@ -79,8 +89,8 @@ _GLOBAL(power7_nap) /* Continue saving state */ SAVE_GPR(2, r1) SAVE_NVGPRS(r1) - mfcrr3 - std r3,_CCR(r1) + mfcrr4 + std r4,_CCR(r1) std r9,_MSR(r1) std r1,PACAR1(r13) @@ -90,15 +100,30 @@ _GLOBAL(power7_enter_nap_mode) li r4,KVM_HWTHREAD_IN_NAP stb r4,HSTATE_HWTHREAD_STATE(r13) #endif + cmpwi cr0,r3,1 + beq 2f + IDLE_STATE_ENTER_SEQ(PPC_NAP) + /* No return */ +2: IDLE_STATE_ENTER_SEQ(PPC_SLEEP) + /* No return */ - /* Magic NAP mode enter sequence */ - std r0,0(r1) - ptesync - ld r0,0(r1) -1: cmp cr0,r0,r0 - bne 1b - PPC_NAP - b . +_GLOBAL(power7_idle) + /* Now check if user or arch enabled NAP mode */ + LOAD_REG_ADDRBASE(r3,powersave_nap) + lwz r4,ADDROFF(powersave_nap)(r3) + cmpwi 0,r4,0 + beqlr + /* fall through */ + +_GLOBAL(power7_nap) + li r3,0 + b power7_powersave_common + /* No return */ + +_GLOBAL(power7_sleep) + li r3,1 + b power7_powersave_common + /* No return */ _GLOBAL(power7_wakeup_loss) ld r1,PACAR1(r13) -- To unsubscribe from this list: send
[RESEND PATCH V5 5/8] powermgt: Add OPAL call to resync timebase on wakeup
From: Vaidyanathan Srinivasan During "Fast-sleep" and deeper power savings state, decrementer and timebase could be stopped making it out of sync with rest of the cores in the system. Add a firmware call to request platform to resync timebase using low level platform methods. Signed-off-by: Vaidyanathan Srinivasan Signed-off-by: Preeti U. Murthy --- arch/powerpc/include/asm/opal.h|2 ++ arch/powerpc/kernel/exceptions-64s.S |2 +- arch/powerpc/kernel/idle_power7.S | 27 arch/powerpc/platforms/powernv/opal-wrappers.S |1 + 4 files changed, 31 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h index 9a87b44..8c4829f 100644 --- a/arch/powerpc/include/asm/opal.h +++ b/arch/powerpc/include/asm/opal.h @@ -154,6 +154,7 @@ extern int opal_enter_rtas(struct rtas_args *args, #define OPAL_FLASH_VALIDATE76 #define OPAL_FLASH_MANAGE 77 #define OPAL_FLASH_UPDATE 78 +#define OPAL_RESYNC_TIMEBASE 79 #define OPAL_GET_MSG 85 #define OPAL_CHECK_ASYNC_COMPLETION86 @@ -863,6 +864,7 @@ extern void opal_flash_init(void); extern int opal_machine_check(struct pt_regs *regs); extern void opal_shutdown(void); +extern int opal_resync_timebase(void); extern void opal_lpc_init(void); diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S index b01a9cb..9533d7a 100644 --- a/arch/powerpc/kernel/exceptions-64s.S +++ b/arch/powerpc/kernel/exceptions-64s.S @@ -145,7 +145,7 @@ BEGIN_FTR_SECTION /* Fast Sleep wakeup on PowerNV */ 8: GET_PACA(r13) - b .power7_wakeup_loss + b .power7_wakeup_tb_loss 9: END_FTR_SECTION_IFSET(CPU_FTR_HVMODE | CPU_FTR_ARCH_206) diff --git a/arch/powerpc/kernel/idle_power7.S b/arch/powerpc/kernel/idle_power7.S index 14f78be..c3ab869 100644 --- a/arch/powerpc/kernel/idle_power7.S +++ b/arch/powerpc/kernel/idle_power7.S @@ -17,6 +17,7 @@ #include #include #include +#include #undef DEBUG @@ -125,6 +126,32 @@ _GLOBAL(power7_sleep) b power7_powersave_common /* No return */ +_GLOBAL(power7_wakeup_tb_loss) + ld r2,PACATOC(r13); + ld r1,PACAR1(r13) + + /* Time base re-sync */ + li r0,OPAL_RESYNC_TIMEBASE + LOAD_REG_ADDR(r11,opal); + ld r12,8(r11); + ld r2,0(r11); + mtctr r12 + bctrl + + /* TODO: Check r3 for failure */ + + REST_NVGPRS(r1) + REST_GPR(2, r1) + ld r3,_CCR(r1) + ld r4,_MSR(r1) + ld r5,_NIP(r1) + addir1,r1,INT_FRAME_SIZE + mtcrr3 + mfspr r3,SPRN_SRR1/* Return SRR1 */ + mtspr SPRN_SRR1,r4 + mtspr SPRN_SRR0,r5 + rfid + _GLOBAL(power7_wakeup_loss) ld r1,PACAR1(r13) REST_NVGPRS(r1) diff --git a/arch/powerpc/platforms/powernv/opal-wrappers.S b/arch/powerpc/platforms/powernv/opal-wrappers.S index 719aa5c..a11a87c 100644 --- a/arch/powerpc/platforms/powernv/opal-wrappers.S +++ b/arch/powerpc/platforms/powernv/opal-wrappers.S @@ -126,5 +126,6 @@ OPAL_CALL(opal_return_cpu, OPAL_RETURN_CPU); OPAL_CALL(opal_validate_flash, OPAL_FLASH_VALIDATE); OPAL_CALL(opal_manage_flash, OPAL_FLASH_MANAGE); OPAL_CALL(opal_update_flash, OPAL_FLASH_UPDATE); +OPAL_CALL(opal_resync_timebase,OPAL_RESYNC_TIMEBASE); OPAL_CALL(opal_get_msg,OPAL_GET_MSG); OPAL_CALL(opal_check_completion, OPAL_CHECK_ASYNC_COMPLETION); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RESEND PATCH V5 6/8] time/cpuidle: Support in tick broadcast framework in the absence of external clock device
On some architectures, in certain CPU deep idle states the local timers stop. An external clock device is used to wakeup these CPUs. The kernel support for the wakeup of these CPUs is provided by the tick broadcast framework by using the external clock device as the wakeup source. However not all implementations of architectures provide such an external clock device such as some PowerPC ones. This patch includes support in the broadcast framework to handle the wakeup of the CPUs in deep idle states on such systems by queuing a hrtimer on one of the CPUs, meant to handle the wakeup of CPUs in deep idle states. This CPU is identified as the bc_cpu. Each time the hrtimer expires, it is reprogrammed for the next wakeup of the CPUs in deep idle state after handling broadcast. However when a CPU is about to enter deep idle state with its wakeup time earlier than the time at which the hrtimer is currently programmed, it *becomes the new bc_cpu* and restarts the hrtimer on itself. This way the job of doing broadcast is handed around to the CPUs that ask for the earliest wakeup just before entering deep idle state. This is consistent with what happens in cases where an external clock device is present. The smp affinity of this clock device is set to the CPU with the earliest wakeup. The important point here is that the bc_cpu cannot enter deep idle state since it has a hrtimer queued to wakeup the other CPUs in deep idle. Hence it cannot have its local timer stopped. Therefore for such a CPU, the BROADCAST_ENTER notification has to fail implying that it cannot enter deep idle state. On architectures where an external clock device is present, all CPUs can enter deep idle. During hotplug of the bc_cpu, the job of doing a broadcast is assigned to the first cpu in the broadcast mask. This newly nominated bc_cpu is woken up by an IPI so as to queue the above mentioned hrtimer on it. Signed-off-by: Preeti U Murthy --- include/linux/clockchips.h |4 - kernel/time/clockevents.c|9 +- kernel/time/tick-broadcast.c | 192 ++ kernel/time/tick-internal.h |8 +- 4 files changed, 186 insertions(+), 27 deletions(-) diff --git a/include/linux/clockchips.h b/include/linux/clockchips.h index 493aa02..bbda37b 100644 --- a/include/linux/clockchips.h +++ b/include/linux/clockchips.h @@ -186,9 +186,9 @@ static inline int tick_check_broadcast_expired(void) { return 0; } #endif #ifdef CONFIG_GENERIC_CLOCKEVENTS -extern void clockevents_notify(unsigned long reason, void *arg); +extern int clockevents_notify(unsigned long reason, void *arg); #else -static inline void clockevents_notify(unsigned long reason, void *arg) {} +static inline int clockevents_notify(unsigned long reason, void *arg) {} #endif #else /* CONFIG_GENERIC_CLOCKEVENTS_BUILD */ diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c index 086ad60..d61404e 100644 --- a/kernel/time/clockevents.c +++ b/kernel/time/clockevents.c @@ -524,12 +524,13 @@ void clockevents_resume(void) #ifdef CONFIG_GENERIC_CLOCKEVENTS /** * clockevents_notify - notification about relevant events + * Returns non zero on error. */ -void clockevents_notify(unsigned long reason, void *arg) +int clockevents_notify(unsigned long reason, void *arg) { struct clock_event_device *dev, *tmp; unsigned long flags; - int cpu; + int cpu, ret = 0; raw_spin_lock_irqsave(_lock, flags); @@ -542,11 +543,12 @@ void clockevents_notify(unsigned long reason, void *arg) case CLOCK_EVT_NOTIFY_BROADCAST_ENTER: case CLOCK_EVT_NOTIFY_BROADCAST_EXIT: - tick_broadcast_oneshot_control(reason); + ret = tick_broadcast_oneshot_control(reason); break; case CLOCK_EVT_NOTIFY_CPU_DYING: tick_handover_do_timer(arg); + tick_handover_broadcast_cpu(arg); break; case CLOCK_EVT_NOTIFY_SUSPEND: @@ -585,6 +587,7 @@ void clockevents_notify(unsigned long reason, void *arg) break; } raw_spin_unlock_irqrestore(_lock, flags); + return ret; } EXPORT_SYMBOL_GPL(clockevents_notify); diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c index 9532690..1c23912 100644 --- a/kernel/time/tick-broadcast.c +++ b/kernel/time/tick-broadcast.c @@ -20,6 +20,7 @@ #include #include #include +#include #include "tick-internal.h" @@ -35,6 +36,15 @@ static cpumask_var_t tmpmask; static DEFINE_RAW_SPINLOCK(tick_broadcast_lock); static int tick_broadcast_force; +/* + * Helper variables for handling broadcast in the absence of a + * tick_broadcast_device. + * */ +static struct hrtimer *bc_hrtimer; +static int bc_cpu = -1; +static ktime_t bc_next_wakeup; +static int hrtimer_initialized = 0; + #ifdef CONFIG_TICK_ONESHOT static void tick_broadcast_clear_oneshot(int cpu); #else @@ -528,6 +538,20 @@ static int tick_broadcast
[RESEND PATCH V5 8/8] cpuidle/powernv: Parse device tree to setup idle states
Add deep idle states such as nap and fast sleep to the cpuidle state table only if they are discovered from the device tree during cpuidle initialization. Signed-off-by: Preeti U Murthy --- drivers/cpuidle/cpuidle-powernv.c | 81 + 1 file changed, 64 insertions(+), 17 deletions(-) diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c index 90f0c2b..b3face5 100644 --- a/drivers/cpuidle/cpuidle-powernv.c +++ b/drivers/cpuidle/cpuidle-powernv.c @@ -12,10 +12,17 @@ #include #include #include +#include #include #include +/* Flags and constants used in PowerNV platform */ + +#define MAX_POWERNV_IDLE_STATES8 +#define IDLE_USE_INST_NAP 0x0001 /* Use nap instruction */ +#define IDLE_USE_INST_SLEEP0x0002 /* Use sleep instruction */ + struct cpuidle_driver powernv_idle_driver = { .name = "powernv_idle", .owner= THIS_MODULE, @@ -87,7 +94,7 @@ static int fastsleep_loop(struct cpuidle_device *dev, /* * States for dedicated partition case. */ -static struct cpuidle_state powernv_states[] = { +static struct cpuidle_state powernv_states[MAX_POWERNV_IDLE_STATES] = { { /* Snooze */ .name = "snooze", .desc = "snooze", @@ -95,20 +102,6 @@ static struct cpuidle_state powernv_states[] = { .exit_latency = 0, .target_residency = 0, .enter = _loop }, - { /* NAP */ - .name = "NAP", - .desc = "NAP", - .flags = CPUIDLE_FLAG_TIME_VALID, - .exit_latency = 10, - .target_residency = 100, - .enter = _loop }, -{ /* Fastsleep */ - .name = "fastsleep", - .desc = "fastsleep", - .flags = CPUIDLE_FLAG_TIME_VALID, - .exit_latency = 10, - .target_residency = 100, - .enter = _loop }, }; static int powernv_cpuidle_add_cpu_notifier(struct notifier_block *n, @@ -169,19 +162,73 @@ static int powernv_cpuidle_driver_init(void) return 0; } +static int powernv_add_idle_states(void) +{ + struct device_node *power_mgt; + struct property *prop; + int nr_idle_states = 1; /* Snooze */ + int dt_idle_states; + u32 *flags; + int i; + + /* Currently we have snooze statically defined */ + + power_mgt = of_find_node_by_path("/ibm,opal/power-mgt"); + if (!power_mgt) { + pr_warn("opal: PowerMgmt Node not found\n"); + return nr_idle_states; + } + + prop = of_find_property(power_mgt, "ibm,cpu-idle-state-flags", NULL); + if (!prop) { + pr_warn("DT-PowerMgmt: missing ibm,cpu-idle-state-flags\n"); + return nr_idle_states; + } + + dt_idle_states = prop->length / sizeof(u32); + flags = (u32 *) prop->value; + + for (i = 0; i < dt_idle_states; i++) { + + if (flags[i] & IDLE_USE_INST_NAP) { + /* Add NAP state */ + strcpy(powernv_states[nr_idle_states].name, "Nap"); + strcpy(powernv_states[nr_idle_states].desc, "Nap"); + powernv_states[nr_idle_states].flags = CPUIDLE_FLAG_TIME_VALID; + powernv_states[nr_idle_states].exit_latency = 10; + powernv_states[nr_idle_states].target_residency = 100; + powernv_states[nr_idle_states].enter = _loop; + nr_idle_states++; + } + + if (flags[i] & IDLE_USE_INST_SLEEP) { + /* Add FASTSLEEP state */ + strcpy(powernv_states[nr_idle_states].name, "FastSleep"); + strcpy(powernv_states[nr_idle_states].desc, "FastSleep"); + powernv_states[nr_idle_states].flags = CPUIDLE_FLAG_TIME_VALID; + powernv_states[nr_idle_states].exit_latency = 300; + powernv_states[nr_idle_states].target_residency = 100; + powernv_states[nr_idle_states].enter = _loop; + nr_idle_states++; + } + } + + return nr_idle_states; +} + /* * powernv_idle_probe() * Choose state table for shared versus dedicated partition */ static int powernv_idle_probe(void) { - if (cpuidle_disable != IDLE_NO_OVERRIDE) return -ENODEV; if (firmware_has_feature(FW_FEATURE_OPALv3)) { cpuidle_state_table = powernv_states; - max_idle_state = ARRAY_SIZE(powernv_states); + /* Device tree can indicate more idle states */ + max_idle_stat
Re: [PATCH V5 6/8] time/cpuidle: Support in tick broadcast framework in the absence of external clock device
Hi Thomas, Thank you very much for the review. On 01/22/2014 06:57 PM, Thomas Gleixner wrote: > On Wed, 15 Jan 2014, Preeti U Murthy wrote: >> diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c >> index 086ad60..d61404e 100644 >> --- a/kernel/time/clockevents.c >> +++ b/kernel/time/clockevents.c >> @@ -524,12 +524,13 @@ void clockevents_resume(void) >> #ifdef CONFIG_GENERIC_CLOCKEVENTS >> /** >> * clockevents_notify - notification about relevant events >> + * Returns non zero on error. >> */ >> -void clockevents_notify(unsigned long reason, void *arg) >> +int clockevents_notify(unsigned long reason, void *arg) >> { > > The interface change of clockevents_notify wants to be a separate > patch. > >> diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c >> index 9532690..1c23912 100644 >> --- a/kernel/time/tick-broadcast.c >> +++ b/kernel/time/tick-broadcast.c >> @@ -20,6 +20,7 @@ >> #include >> #include >> #include >> +#include >> >> #include "tick-internal.h" >> >> @@ -35,6 +36,15 @@ static cpumask_var_t tmpmask; >> static DEFINE_RAW_SPINLOCK(tick_broadcast_lock); >> static int tick_broadcast_force; >> >> +/* >> + * Helper variables for handling broadcast in the absence of a >> + * tick_broadcast_device. >> + * */ >> +static struct hrtimer *bc_hrtimer; >> +static int bc_cpu = -1; >> +static ktime_t bc_next_wakeup; > > Why do you need another variable to store the expiry time? The > broadcast code already knows it and the hrtimer expiry value gives you > the same information for free. The reason was functions like tick_handle_oneshot_broadcast() and tick_broadcast_switch_to_oneshot() were using the tick_broadcast_device.evtdev->next_event to set/get the next wakeups. But since this patchset introduced an explicit hrtimer for archs which did not have such a device, I wanted these functions to use a generic parameter to set/get the next wakeups without having to know about the existence of this hrtimer, if at all. And program the hrtimer/tick broadcast device whichever was present only when the next event was to be set. But with your below concept patch, we will not be required to do this. > >> +static int hrtimer_initialized = 0; > > What's the point of this hrtimer_initialized dance? Why not simply > making the hrtimer static and avoid that all together. Also adding the > initialization into tick_broadcast_oneshot_available() is > braindamaged. Why not adding this to tick_broadcast_init() which is > the proper place to do? Right I agree, this hrtimer initialization should have been in tick_broadcast_init() and a simple static declaration would have done the job. > > Aside of that you are making this hrtimer mode unconditional, which > might break existing systems which are not aware of the hrtimer > implications. > > What you really want is a pseudo clock event device which has the > proper functions for handling the timer and you can register it from > your architecture code. The broadcast core code needs a few tweaks to > avoid the shutdown of the cpu local clock event device, but aside of > that the whole thing just falls into place. So architectures can use > this if they want and are sure that their low level idle code knows > about the deep idle preventing return value of > clockevents_notify(). Once that works you can register the hrtimer > based broadcast device and a real hardware broadcast device with a > higher rating. It just works. I now completely see your point. This will surely break on archs which are not using the return value of the BROADCAST_ENTER notification. I am not even giving them a choice about using the hrtimer mode of broadcast framework and am expecting them to take action for the failed return of BROADCAST_ENTER. I missed that critical point. I went through the below patch and am able to see how you are solving this problem. > > Find an incomplete and nonfunctional concept patch below. It should be > simple to make it work for real. Thank you very much for the valuable review. The below patch makes your points very clear. Let me try this out. Regards Preeti U Murthy > > Thanks, > > tglx > > Index: linux-2.6/include/linux/clockchips.h > === > --- linux-2.6.orig/include/linux/clockchips.h > +++ linux-2.6/include/linux/clockchips.h > @@ -62,6 +62,11 @@ enum clock_event_mode { > #define CLOCK_EVT_FEAT_DYNIRQ0x20 > #define CLOCK_EVT_FEAT_PERCPU0x40 > > +/* > + * Clockevent device is based on a hrtimer for broa
Re: [PATCH V2] cpuidle/governors: Fix logic in selection of idle states
Hi Daniel, Thank you for the review. On 01/22/2014 01:59 PM, Daniel Lezcano wrote: > On 01/17/2014 05:33 AM, Preeti U Murthy wrote: >> >> diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c >> index a55e68f..831b664 100644 >> --- a/drivers/cpuidle/cpuidle.c >> +++ b/drivers/cpuidle/cpuidle.c >> @@ -131,8 +131,9 @@ int cpuidle_idle_call(void) >> >> /* ask the governor for the next state */ >> next_state = cpuidle_curr_governor->select(drv, dev); >> + >> +dev->last_residency = 0; >> if (need_resched()) { >> -dev->last_residency = 0; > > Why do you need to do this change ? ^ So as to keep the last_residency consistent with the case that this patch addresses: where no idle state could be selected due to strict latency requirements or disabled states and hence the cpu exits without entering idle. Else it would contain the stale value from the previous idle state entry. But coming to think of it dev->last_residency is not used when the last entered idle state index is -1. So I have reverted this change as well in the revised patch below along with mentioning the reason in the last paragraph of the changelog. > >> /* give the governor an opportunity to reflect on the >> outcome */ >> if (cpuidle_curr_governor->reflect) >> cpuidle_curr_governor->reflect(dev, next_state); >> @@ -140,6 +141,18 @@ int cpuidle_idle_call(void) >> return 0; >> } >> >> +/* Unlike in the need_resched() case, we return here because the >> + * governor did not find a suitable idle state. However idle is >> still >> + * in progress as we are not asked to reschedule. Hence we return >> + * without enabling interrupts. > > That will lead to a WARN. > >> + * NOTE: The return code should still be success, since the >> verdict of this >> + * call is "do not enter any idle state" and not a failed call >> due to >> + * errors. >> + */ >> +if (next_state < 0) >> +return 0; >> + > > Returning from here breaks the symmetry of the trace. I have addressed the above concerns in the patch found below. Does the rest of the patch look sound? Regards Preeti U Murthy -- cpuidle/governors: Fix logic in selection of idle states From: Preeti U Murthy The cpuidle governors today are not handling scenarios where no idle state can be chosen. Such scenarios coud arise if the user has disabled all the idle states at runtime or the latency requirement from the cpus is very strict. The menu governor returns 0th index of the idle state table when no other idle state is suitable. This is even when the idle state corresponding to this index is disabled or the latency requirement is strict and the exit_latency of the lowest idle state is also not acceptable. Hence this patch fixes this logic in the menu governor by defaulting to an idle state index of -1 unless any other state is suitable. The ladder governor needs a few more fixes in addition to that required in the menu governor. When the ladder governor decides to demote the idle state of a CPU, it does not check if the lower idle states are enabled. Add this logic in addition to the logic where it chooses an index of -1 if it can neither promote or demote the idle state of a cpu nor can it choose the current idle state. The cpuidle_idle_call() will return back if the governor decides upon not entering any idle state. However it cannot return an error code because all archs have the logic today that if the call to cpuidle_idle_call() fails, it means that the cpuidle driver failed to *function*; for instance due to errors during registration. As a result they end up deciding upon a default idle state on their own, which could very well be a deep idle state. This is incorrect in cases where no idle state is suitable. Besides for the scenario that this patch is addressing, the call actually succeeds. Its just that no idle state is thought to be suitable by the governors. Under such a circumstance return success code without entering any idle state. The consequence of this patch additionally on the menu governor is that as long as a valid idle state cannot be chosen, the cpuidle statistics that this governor uses to predict the next idle state remain untouched from the last valid idle state. This is because an idle state is not even being predicted in this path, hence there is no point correcting the prediction either. Signed-off-by: Preeti U Murthy Changes from V1:https://lkml.org/lkml/2014/1/14/26 1. Change the return code to success from -EINVAL due to the reason mentioned in the c
Re: [PATCH V5 6/8] time/cpuidle: Support in tick broadcast framework in the absence of external clock device
Hi Thomas, The below patch works pretty much as is. I tried this out with deep idle states on our system. Looking through the code and analysing corner cases also did not bring out any issues to me. I will send out a patch V2 of this. Regards Preeti U Murthy On 01/22/2014 06:57 PM, Thomas Gleixner wrote: > On Wed, 15 Jan 2014, Preeti U Murthy wrote: >> diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c >> index 086ad60..d61404e 100644 >> --- a/kernel/time/clockevents.c >> +++ b/kernel/time/clockevents.c >> @@ -524,12 +524,13 @@ void clockevents_resume(void) >> #ifdef CONFIG_GENERIC_CLOCKEVENTS >> /** >> * clockevents_notify - notification about relevant events >> + * Returns non zero on error. >> */ >> -void clockevents_notify(unsigned long reason, void *arg) >> +int clockevents_notify(unsigned long reason, void *arg) >> { > > The interface change of clockevents_notify wants to be a separate > patch. > >> diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c >> index 9532690..1c23912 100644 >> --- a/kernel/time/tick-broadcast.c >> +++ b/kernel/time/tick-broadcast.c >> @@ -20,6 +20,7 @@ >> #include >> #include >> #include >> +#include >> >> #include "tick-internal.h" >> >> @@ -35,6 +36,15 @@ static cpumask_var_t tmpmask; >> static DEFINE_RAW_SPINLOCK(tick_broadcast_lock); >> static int tick_broadcast_force; >> >> +/* >> + * Helper variables for handling broadcast in the absence of a >> + * tick_broadcast_device. >> + * */ >> +static struct hrtimer *bc_hrtimer; >> +static int bc_cpu = -1; >> +static ktime_t bc_next_wakeup; > > Why do you need another variable to store the expiry time? The > broadcast code already knows it and the hrtimer expiry value gives you > the same information for free. > >> +static int hrtimer_initialized = 0; > > What's the point of this hrtimer_initialized dance? Why not simply > making the hrtimer static and avoid that all together. Also adding the > initialization into tick_broadcast_oneshot_available() is > braindamaged. Why not adding this to tick_broadcast_init() which is > the proper place to do? > > Aside of that you are making this hrtimer mode unconditional, which > might break existing systems which are not aware of the hrtimer > implications. > > What you really want is a pseudo clock event device which has the > proper functions for handling the timer and you can register it from > your architecture code. The broadcast core code needs a few tweaks to > avoid the shutdown of the cpu local clock event device, but aside of > that the whole thing just falls into place. So architectures can use > this if they want and are sure that their low level idle code knows > about the deep idle preventing return value of > clockevents_notify(). Once that works you can register the hrtimer > based broadcast device and a real hardware broadcast device with a > higher rating. It just works. > > Find an incomplete and nonfunctional concept patch below. It should be > simple to make it work for real. > > Thanks, > > tglx > > Index: linux-2.6/include/linux/clockchips.h > === > --- linux-2.6.orig/include/linux/clockchips.h > +++ linux-2.6/include/linux/clockchips.h > @@ -62,6 +62,11 @@ enum clock_event_mode { > #define CLOCK_EVT_FEAT_DYNIRQ0x20 > #define CLOCK_EVT_FEAT_PERCPU0x40 > > +/* > + * Clockevent device is based on a hrtimer for broadcast > + */ > +#define CLOCK_EVT_FEAT_HRTIMER 0x80 > + > /** > * struct clock_event_device - clock event device descriptor > * @event_handler: Assigned by the framework to be called by the low > @@ -83,6 +88,7 @@ enum clock_event_mode { > * @name:ptr to clock event name > * @rating: variable to rate clock event devices > * @irq: IRQ number (only for non CPU local devices) > + * @bound_on:Bound on CPU > * @cpumask: cpumask to indicate for which CPUs this device works > * @list:list head for the management code > * @owner: module reference > @@ -113,6 +119,7 @@ struct clock_event_device { > const char *name; > int rating; > int irq; > + int bound_on; > const struct cpumask*cpumask; > struct list_headlist; > struct module *owner; > Index: linux-2.6/kernel/time/tick-broadcast-hrtimer.c
[PATCH V2 0/2] time/cpuidle: Support in tick broadcast framework in absence of external clock device
This earlier version of this patchset can be found here: https://lkml.org/lkml/2013/12/12/687. This version has been based on the discussion in http://www.kernelhub.org/?p=2=399516. This patchset provides the hooks that the architectures without an external clock device and deep idle states where the local timers stop can make use of. Presently we are in need of this support on certain implementations of PowerPC. This patchset has been used on PowerPC for testing with --- Preeti U Murthy (1): time: Change the return type of clockevents_notify() to integer Thomas Gleixner (1): tick/cpuidle: Initialize hrtimer mode of broadcast include/linux/clockchips.h | 15 - kernel/time/Makefile |2 - kernel/time/clockevents.c|8 ++- kernel/time/tick-broadcast-hrtimer.c | 102 ++ kernel/time/tick-broadcast.c | 51 - kernel/time/tick-internal.h |6 +- 6 files changed, 171 insertions(+), 13 deletions(-) create mode 100644 kernel/time/tick-broadcast-hrtimer.c -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH V2 2/2] tick/cpuidle: Initialize hrtimer mode of broadcast
From: Thomas Gleixner On some architectures, in certain CPU deep idle states the local timers stop. An external clock device is used to wakeup these CPUs. The kernel support for the wakeup of these CPUs is provided by the tick broadcast framework by using the external clock device as the wakeup source. However not all implementations of architectures provide such an external clock device. This patch includes support in the broadcast framework to handle the wakeup of the CPUs in deep idle states on such systems by queuing a hrtimer on one of the CPUs, which is meant to handle the wakeup of CPUs in deep idle states. This patchset introduces a pseudo clock device which can be registered by the archs as tick_broadcast_device in the absence of a real external clock device. Once registered, the broadcast framework will work as is for these architectures as long as the archs take care of the BROADCAST_ENTER notification failing for one of the CPUs. This CPU is made the stand by CPU to handle wakeup of the CPUs in deep idle and it *must not enter deep idle states*. The CPU with the earliest wakeup is chosen to be this CPU. Hence this way the stand by CPU dynamically moves around and so does the hrtimer which is queued to trigger at the next earliest wakeup time. This is consistent with the case where an external clock device is present. The smp affinity of this clock device is set to the CPU with the earliest wakeup. This patchset handles the hotplug of the stand by CPU as well by moving the hrtimer on to the CPU handling the CPU_DEAD notification. Signed-off-by: Preeti U Murthy [Added Changelog and code to handle reprogramming of hrtimer] --- include/linux/clockchips.h |9 +++ kernel/time/Makefile |2 - kernel/time/tick-broadcast-hrtimer.c | 102 ++ kernel/time/tick-broadcast.c | 45 +++ 4 files changed, 156 insertions(+), 2 deletions(-) create mode 100644 kernel/time/tick-broadcast-hrtimer.c diff --git a/include/linux/clockchips.h b/include/linux/clockchips.h index ac81b56..2293025 100644 --- a/include/linux/clockchips.h +++ b/include/linux/clockchips.h @@ -62,6 +62,11 @@ enum clock_event_mode { #define CLOCK_EVT_FEAT_DYNIRQ 0x20 #define CLOCK_EVT_FEAT_PERCPU 0x40 +/* + * Clockevent device is based on a hrtimer for broadcast + */ +#define CLOCK_EVT_FEAT_HRTIMER 0x80 + /** * struct clock_event_device - clock event device descriptor * @event_handler: Assigned by the framework to be called by the low @@ -83,6 +88,7 @@ enum clock_event_mode { * @name: ptr to clock event name * @rating:variable to rate clock event devices * @irq: IRQ number (only for non CPU local devices) + * @bound_on: Bound on CPU * @cpumask: cpumask to indicate for which CPUs this device works * @list: list head for the management code * @owner: module reference @@ -113,6 +119,7 @@ struct clock_event_device { const char *name; int rating; int irq; + int bound_on; const struct cpumask*cpumask; struct list_headlist; struct module *owner; @@ -180,9 +187,11 @@ extern int tick_receive_broadcast(void); #endif #if defined(CONFIG_GENERIC_CLOCKEVENTS_BROADCAST) && defined(CONFIG_TICK_ONESHOT) +extern void tick_setup_hrtimer_broadcast(void); extern int tick_check_broadcast_expired(void); #else static inline int tick_check_broadcast_expired(void) { return 0; } +static void tick_setup_hrtimer_broadcast(void) {}; #endif #ifdef CONFIG_GENERIC_CLOCKEVENTS diff --git a/kernel/time/Makefile b/kernel/time/Makefile index 9250130..06151ef 100644 --- a/kernel/time/Makefile +++ b/kernel/time/Makefile @@ -3,7 +3,7 @@ obj-y += timeconv.o posix-clock.o alarmtimer.o obj-$(CONFIG_GENERIC_CLOCKEVENTS_BUILD)+= clockevents.o obj-$(CONFIG_GENERIC_CLOCKEVENTS) += tick-common.o -obj-$(CONFIG_GENERIC_CLOCKEVENTS_BROADCAST)+= tick-broadcast.o +obj-$(CONFIG_GENERIC_CLOCKEVENTS_BROADCAST)+= tick-broadcast.o tick-broadcast-hrtimer.o obj-$(CONFIG_GENERIC_SCHED_CLOCK) += sched_clock.o obj-$(CONFIG_TICK_ONESHOT) += tick-oneshot.o obj-$(CONFIG_TICK_ONESHOT) += tick-sched.o diff --git a/kernel/time/tick-broadcast-hrtimer.c b/kernel/time/tick-broadcast-hrtimer.c new file mode 100644 index 000..23f4925 --- /dev/null +++ b/kernel/time/tick-broadcast-hrtimer.c @@ -0,0 +1,102 @@ +/* + * linux/kernel/time/tick-broadcast-hrtimer.c + * This file emulates a local clock event device + * via a pseudo clock device. + */ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "tick-internal.h" + +static stru
[PATCH V2 1/2] time: Change the return type of clockevents_notify() to integer
The broadcast framework can potentially be made use of by archs which do not have an external clock device as well. Then, it is required that one of the CPUs need to handle the broadcasting of wakeup IPIs to the CPUs in deep idle. As a result its local timers should remain functional all the time. For such a CPU, the BROADCAST_ENTER notification has to fail indicating that its clock device cannot be shutdown. To make way for this support, change the return type of tick_broadcast_oneshot_control() and hence clockevents_notify() to indicate such scenarios. Signed-off-by: Preeti U Murthy --- include/linux/clockchips.h |6 +++--- kernel/time/clockevents.c|8 +--- kernel/time/tick-broadcast.c |6 -- kernel/time/tick-internal.h |6 +++--- 4 files changed, 15 insertions(+), 11 deletions(-) diff --git a/include/linux/clockchips.h b/include/linux/clockchips.h index 493aa02..ac81b56 100644 --- a/include/linux/clockchips.h +++ b/include/linux/clockchips.h @@ -186,9 +186,9 @@ static inline int tick_check_broadcast_expired(void) { return 0; } #endif #ifdef CONFIG_GENERIC_CLOCKEVENTS -extern void clockevents_notify(unsigned long reason, void *arg); +extern int clockevents_notify(unsigned long reason, void *arg); #else -static inline void clockevents_notify(unsigned long reason, void *arg) {} +static inline int clockevents_notify(unsigned long reason, void *arg) {} #endif #else /* CONFIG_GENERIC_CLOCKEVENTS_BUILD */ @@ -196,7 +196,7 @@ static inline void clockevents_notify(unsigned long reason, void *arg) {} static inline void clockevents_suspend(void) {} static inline void clockevents_resume(void) {} -static inline void clockevents_notify(unsigned long reason, void *arg) {} +static inline int clockevents_notify(unsigned long reason, void *arg) {} static inline int tick_check_broadcast_expired(void) { return 0; } #endif diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c index 086ad60..79b8685 100644 --- a/kernel/time/clockevents.c +++ b/kernel/time/clockevents.c @@ -524,12 +524,13 @@ void clockevents_resume(void) #ifdef CONFIG_GENERIC_CLOCKEVENTS /** * clockevents_notify - notification about relevant events + * Returns 0 on success, any other value on error */ -void clockevents_notify(unsigned long reason, void *arg) +int clockevents_notify(unsigned long reason, void *arg) { struct clock_event_device *dev, *tmp; unsigned long flags; - int cpu; + int cpu, ret = 0; raw_spin_lock_irqsave(_lock, flags); @@ -542,7 +543,7 @@ void clockevents_notify(unsigned long reason, void *arg) case CLOCK_EVT_NOTIFY_BROADCAST_ENTER: case CLOCK_EVT_NOTIFY_BROADCAST_EXIT: - tick_broadcast_oneshot_control(reason); + ret = tick_broadcast_oneshot_control(reason); break; case CLOCK_EVT_NOTIFY_CPU_DYING: @@ -585,6 +586,7 @@ void clockevents_notify(unsigned long reason, void *arg) break; } raw_spin_unlock_irqrestore(_lock, flags); + return ret; } EXPORT_SYMBOL_GPL(clockevents_notify); diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c index 9532690..be00692 100644 --- a/kernel/time/tick-broadcast.c +++ b/kernel/time/tick-broadcast.c @@ -633,14 +633,15 @@ again: /* * Powerstate information: The system enters/leaves a state, where * affected devices might stop + * Returns 0 on success, -EBUSY if the cpu is used to broadcast wakeups. */ -void tick_broadcast_oneshot_control(unsigned long reason) +int tick_broadcast_oneshot_control(unsigned long reason) { struct clock_event_device *bc, *dev; struct tick_device *td; unsigned long flags; ktime_t now; - int cpu; + int cpu, ret = 0; /* * Periodic mode does not care about the enter/exit of power @@ -746,6 +747,7 @@ void tick_broadcast_oneshot_control(unsigned long reason) } out: raw_spin_unlock_irqrestore(_broadcast_lock, flags); + return ret; } /* diff --git a/kernel/time/tick-internal.h b/kernel/time/tick-internal.h index 18e71f7..164465c 100644 --- a/kernel/time/tick-internal.h +++ b/kernel/time/tick-internal.h @@ -46,7 +46,7 @@ extern int tick_switch_to_oneshot(void (*handler)(struct clock_event_device *)); extern void tick_resume_oneshot(void); # ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST extern void tick_broadcast_setup_oneshot(struct clock_event_device *bc); -extern void tick_broadcast_oneshot_control(unsigned long reason); +extern int tick_broadcast_oneshot_control(unsigned long reason); extern void tick_broadcast_switch_to_oneshot(void); extern void tick_shutdown_broadcast_oneshot(unsigned int *cpup); extern int tick_resume_broadcast_oneshot(struct clock_event_device *bc); @@ -58,7 +58,7 @@ static inline void tick_broadcast_setup_oneshot(struct clock_event_device *bc) { BUG(); } -static inline void tick_broadcast_oneshot_control
Re: [PATCH V2] cpuidle/governors: Fix logic in selection of idle states
On 01/24/2014 02:38 PM, Daniel Lezcano wrote: > On 01/23/2014 12:15 PM, Preeti U Murthy wrote: >> Hi Daniel, >> >> Thank you for the review. >> >> On 01/22/2014 01:59 PM, Daniel Lezcano wrote: >>> On 01/17/2014 05:33 AM, Preeti U Murthy wrote: >>>> >>>> diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c >>>> index a55e68f..831b664 100644 >>>> --- a/drivers/cpuidle/cpuidle.c >>>> +++ b/drivers/cpuidle/cpuidle.c >>>> @@ -131,8 +131,9 @@ int cpuidle_idle_call(void) >>>> >>>>/* ask the governor for the next state */ >>>>next_state = cpuidle_curr_governor->select(drv, dev); >>>> + >>>> +dev->last_residency = 0; >>>>if (need_resched()) { >>>> -dev->last_residency = 0; >>> >>> Why do you need to do this change ? ^ >> >> So as to keep the last_residency consistent with the case that this patch >> addresses: where no idle state could be selected due to strict latency >> requirements or disabled states and hence the cpu exits without entering >> idle. Else it would contain the stale value from the previous idle state >> entry. >> >> But coming to think of it dev->last_residency is not used when the last >> entered idle state index is -1. >> >> So I have reverted this change as well in the revised patch below along >> with mentioning the reason in the last paragraph of the changelog. >> >>> >>>>/* give the governor an opportunity to reflect on the >>>> outcome */ >>>>if (cpuidle_curr_governor->reflect) >>>>cpuidle_curr_governor->reflect(dev, next_state); >>>> @@ -140,6 +141,18 @@ int cpuidle_idle_call(void) >>>>return 0; >>>>} >>>> >>>> +/* Unlike in the need_resched() case, we return here because the >>>> + * governor did not find a suitable idle state. However idle is >>>> still >>>> + * in progress as we are not asked to reschedule. Hence we return >>>> + * without enabling interrupts. >>> >>> That will lead to a WARN. >>> >>>> + * NOTE: The return code should still be success, since the >>>> verdict of this >>>> + * call is "do not enter any idle state" and not a failed call >>>> due to >>>> + * errors. >>>> + */ >>>> +if (next_state < 0) >>>> +return 0; >>>> + >>> >>> Returning from here breaks the symmetry of the trace. >> >> I have addressed the above concerns in the patch found below. >> Does the rest of the patch look sound? >> >> Regards >> Preeti U Murthy >> >> -- >> >> cpuidle/governors: Fix logic in selection of idle states >> >> From: Preeti U Murthy >> >> The cpuidle governors today are not handling scenarios where no idle >> state >> can be chosen. Such scenarios coud arise if the user has disabled all the >> idle states at runtime or the latency requirement from the cpus is >> very strict. >> >> The menu governor returns 0th index of the idle state table when no other >> idle state is suitable. This is even when the idle state corresponding >> to this >> index is disabled or the latency requirement is strict and the >> exit_latency >> of the lowest idle state is also not acceptable. Hence this patch >> fixes this logic in the menu governor by defaulting to an idle state >> index >> of -1 unless any other state is suitable. >> >> The ladder governor needs a few more fixes in addition to that >> required in the >> menu governor. When the ladder governor decides to demote the idle >> state of a >> CPU, it does not check if the lower idle states are enabled. Add this >> logic >> in addition to the logic where it chooses an index of -1 if it can >> neither >> promote or demote the idle state of a cpu nor can it choose the >> current idle >> state. >> >> The cpuidle_idle_call() will return back if the governor decides upon not >> entering any idle state. However it cannot return an error code >> because all >> archs have the logic today that if the call to cpuidle_idle_call() >> fails, it >> means that the cpuidle driver failed to *function*; for instance due to >>
Re: [PATCH 6/9] PPC: remove redundant cpuidle_idle_call()
Hi Nicolas, On 01/27/2014 11:38 AM, Nicolas Pitre wrote: > The core idle loop now takes care of it. However a few things need > checking: > > - Invocation of cpuidle_idle_call() in pseries_lpar_idle() happened > through arch_cpu_idle() and was therefore always preceded by a call > to ppc64_runlatch_off(). To preserve this property now that > cpuidle_idle_call() is invoked directly from core code, a call to > ppc64_runlatch_off() has been added to idle_loop_prolog() in > platforms/pseries/processor_idle.c. > > - Similarly, cpuidle_idle_call() was followed by ppc64_runlatch_off() > so a call to the later has been added to idle_loop_epilog(). > > - And since arch_cpu_idle() always made sure to re-enable IRQs if they > were not enabled, this is now > done in idle_loop_epilog() as well. > > The above was made in order to keep the execution flow close to the > original. I don't know if that was strictly necessary. Someone well > aquainted with the platform details might find some room for possible > optimizations. > > Signed-off-by: Nicolas Pitre > --- > arch/powerpc/platforms/pseries/processor_idle.c | 5 > arch/powerpc/platforms/pseries/setup.c | 34 > ++--- > 2 files changed, 19 insertions(+), 20 deletions(-) > > diff --git a/arch/powerpc/platforms/pseries/processor_idle.c > b/arch/powerpc/platforms/pseries/processor_idle.c > index a166e38bd6..72ddfe3d2f 100644 > --- a/arch/powerpc/platforms/pseries/processor_idle.c > +++ b/arch/powerpc/platforms/pseries/processor_idle.c > @@ -33,6 +33,7 @@ static struct cpuidle_state *cpuidle_state_table; > > static inline void idle_loop_prolog(unsigned long *in_purr) > { > + ppc64_runlatch_off(); > *in_purr = mfspr(SPRN_PURR); > /* >* Indicate to the HV that we are idle. Now would be > @@ -49,6 +50,10 @@ static inline void idle_loop_epilog(unsigned long in_purr) > wait_cycles += mfspr(SPRN_PURR) - in_purr; > get_lppaca()->wait_state_cycles = cpu_to_be64(wait_cycles); > get_lppaca()->idle = 0; > + > + if (irqs_disabled()) > + local_irq_enable(); > + ppc64_runlatch_on(); > } > > static int snooze_loop(struct cpuidle_device *dev, > diff --git a/arch/powerpc/platforms/pseries/setup.c > b/arch/powerpc/platforms/pseries/setup.c > index c1f1908587..7604c19d54 100644 > --- a/arch/powerpc/platforms/pseries/setup.c > +++ b/arch/powerpc/platforms/pseries/setup.c > @@ -39,7 +39,6 @@ > #include > #include > #include > -#include > #include > #include > > @@ -356,29 +355,24 @@ early_initcall(alloc_dispatch_log_kmem_cache); > > static void pseries_lpar_idle(void) > { > - /* This would call on the cpuidle framework, and the back-end pseries > - * driver to go to idle states > + /* > + * Default handler to go into low thread priority and possibly > + * low power mode by cedeing processor to hypervisor >*/ > - if (cpuidle_idle_call()) { > - /* On error, execute default handler > - * to go into low thread priority and possibly > - * low power mode by cedeing processor to hypervisor > - */ > > - /* Indicate to hypervisor that we are idle. */ > - get_lppaca()->idle = 1; > + /* Indicate to hypervisor that we are idle. */ > + get_lppaca()->idle = 1; > > - /* > - * Yield the processor to the hypervisor. We return if > - * an external interrupt occurs (which are driven prior > - * to returning here) or if a prod occurs from another > - * processor. When returning here, external interrupts > - * are enabled. > - */ > - cede_processor(); > + /* > + * Yield the processor to the hypervisor. We return if > + * an external interrupt occurs (which are driven prior > + * to returning here) or if a prod occurs from another > + * processor. When returning here, external interrupts > + * are enabled. > + */ > + cede_processor(); > > - get_lppaca()->idle = 0; > - } > + get_lppaca()->idle = 0; > } > > /* > Reviewed-by: Preeti U Murthy The consequence of this would be for other Power platforms like PowerNV, we will need to invoke ppc_runlatch_off() and ppc_runlatch_on() in each of the idle routines since the idle_loop_prologue() and idle_loop_epilogue() are not invoked by them, but we will take care of this. Regards Preeti U Murthy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: a LLC sched domain bug for panda board?
Hi Alex, Vincent, On 02/04/2014 02:10 AM, Vincent Guittot wrote: > Yes, it's probably worth enabling by default for all ARM arch. > > Vincent > > On 02/04/2014 12:28 AM, Vincent Guittot wrote: >> On 3 February 2014 17:27, Vincent Guittot > wrote: >>> Have you checked that CONFIG_SCHED_LC is set ? >> >> sorry it's CONFIG_SCHED_MC > > Thanks for reminder! no it wasn't set. Does it means > arch/arm/configs/omap2plus_defconfig need add this config? Hmm..ok let me think this aloud. So looks like the SMT,MC and the NUMA sched domains are optional depending on the architecture. They are config dependent. These domains could potentially exist on the processor layout, but if the respective CONFIG options are not set, the scheduler could very well ignore these levels. What this means is that although the architecture could populate the cpu_sibling_mask and cpu_coregroup_mask, the scheduler is not mandated to schedule across the SMT and MC levels of the topology. Its just the CPU sched domain which is guaranteed to be present no matter what. This is indeed interesting to note :) Thanks Alex for bringing up this point :) On PowerPC, the SCHED_MC option can never be set. Its not even optional. On x86, it is on by default and on arm looks like its off by default. Thanks, Regards Preeti U Murthy > >> >>> >>> >>> On 3 February 2014 17:17, Alex Shi wrote: >>>> I just run the 3.14-rc1 kernel on panda board. The only domain for it is >>>> 'CPU' domain, but this domain has no SD_SHARE_PKG_RESOURCES setting, it >>>> has no sd_llc. >>>> >>>> Guess the right domain for this board should be MC. So is it a bug? >>>> >>>> .. >>>> /proc/sys/kernel/sched_domain/cpu0/domain0/name:CPU >>>> .. >>>> /proc/sys/kernel/sched_domain/cpu1/domain0/name:CPU >>>> >>>> -- >>>> Thanks >>>> Alex >>>> -- >>>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" > in >>>> the body of a message to majord...@vger.kernel.org >>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>> Please read the FAQ at http://www.tux.org/lkml/ > > -- > Thanks > Alex > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH V3] cpuidle/governors: Fix logic in selection of idle states
The cpuidle governors today are not handling scenarios where no idle state can be chosen. Such scenarios coud arise if the user has disabled all the idle states at runtime or the latency requirement from the cpus is very strict. The menu governor returns 0th index of the idle state table when no other idle state is suitable. This is even when the idle state corresponding to this index is disabled or the latency requirement is strict and the exit_latency of the lowest idle state is also not acceptable. Hence this patch fixes this logic in the menu governor by defaulting to an idle state index of -1 unless any other state is suitable. The ladder governor needs a few more fixes in addition to that required in the menu governor. When the ladder governor decides to demote the idle state of a CPU, it does not check if the lower idle states are enabled. Add this logic in addition to the logic where it chooses an index of -1 if it can neither promote or demote the idle state of a cpu nor can it choose the current idle state. The cpuidle_idle_call() will return back if the governor decides upon not entering any idle state. However it cannot return an error code because all archs have the logic today that if the call to cpuidle_idle_call() fails, it means that the cpuidle driver failed to *function*; for instance due to errors during registration. As a result they end up deciding upon a default idle state on their own, which could very well be a deep idle state. This is incorrect in cases where no idle state is suitable. Besides for the scenario that this patch is addressing, the call actually succeeds. Its just that no idle state is thought to be suitable by the governors. Under such a circumstance return success code without entering any idle state. The consequence of this patch, on the menu governor is that as long as a valid idle state cannot be chosen, the cpuidle statistics that this governor uses to predict the next idle state remain untouched from the last valid idle state. This is because an idle state is not even being predicted in this path, hence there is no point correcting the prediction either. Signed-off-by: Preeti U Murthy Changes from V1:https://lkml.org/lkml/2014/1/14/26 1. Change the return code to success from -EINVAL due to the reason mentioned in the changelog. 2. Add logic that the patch is addressing in the ladder governor as well. 3. Added relevant comments and removed redundant logic as suggested in the above thread. Changes from V2:lkml.org/lkml/2014/1/16/617 1. Enable interrupts when exiting from cpuidle_idle_call() in the case when no idle state was deemed suitable by the governor. --- drivers/cpuidle/cpuidle.c |2 - drivers/cpuidle/governors/ladder.c | 101 ++-- drivers/cpuidle/governors/menu.c |7 +- 3 files changed, 78 insertions(+), 32 deletions(-) diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c index a55e68f..89abdfc 100644 --- a/drivers/cpuidle/cpuidle.c +++ b/drivers/cpuidle/cpuidle.c @@ -131,7 +131,7 @@ int cpuidle_idle_call(void) /* ask the governor for the next state */ next_state = cpuidle_curr_governor->select(drv, dev); - if (need_resched()) { + if (need_resched() || (next_state < 0)) { dev->last_residency = 0; /* give the governor an opportunity to reflect on the outcome */ if (cpuidle_curr_governor->reflect) diff --git a/drivers/cpuidle/governors/ladder.c b/drivers/cpuidle/governors/ladder.c index 9f08e8c..7e93aaa 100644 --- a/drivers/cpuidle/governors/ladder.c +++ b/drivers/cpuidle/governors/ladder.c @@ -58,6 +58,36 @@ static inline void ladder_do_selection(struct ladder_device *ldev, ldev->last_state_idx = new_idx; } +static int can_promote(struct ladder_device *ldev, int last_idx, + int last_residency) +{ + struct ladder_device_state *last_state; + + last_state = >states[last_idx]; + if (last_residency > last_state->threshold.promotion_time) { + last_state->stats.promotion_count++; + last_state->stats.demotion_count = 0; + if (last_state->stats.promotion_count >= last_state->threshold.promotion_count) + return 1; + } + return 0; +} + +static int can_demote(struct ladder_device *ldev, int last_idx, + int last_residency) +{ + struct ladder_device_state *last_state; + + last_state = >states[last_idx]; + if (last_residency < last_state->threshold.demotion_time) { + last_state->stats.demotion_count++; + last_state->stats.promotion_count = 0; + if (last_state->stats.demotion_count >= last_state->threshold.demotion_count) + return 1; + } + return 0; +} + /** * ladder_select_state - selects the
Re: [PATCH V2 1/2] time: Change the return type of clockevents_notify() to integer
On 02/04/2014 03:31 PM, Thomas Gleixner wrote: > On Fri, 24 Jan 2014, Preeti U Murthy wrote: >> -extern void tick_broadcast_oneshot_control(unsigned long reason); >> +extern int tick_broadcast_oneshot_control(unsigned long reason); > >> -static inline void tick_broadcast_oneshot_control(unsigned long reason) { } >> +static inline int tick_broadcast_oneshot_control(unsigned long reason) { } > >> -static inline void tick_broadcast_oneshot_control(unsigned long reason) { } >> +static inline int tick_broadcast_oneshot_control(unsigned long reason) { } > > The inline stubs need to return 0. Oh right! Apologies! Thanks. > > Thanks, > > tglx > Regards Preeti U Murthy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V2 2/2] tick/cpuidle: Initialize hrtimer mode of broadcast
Hi Thomas, On 02/04/2014 03:48 PM, Thomas Gleixner wrote: >> +++ b/kernel/time/tick-broadcast-hrtimer.c >> +/* >> + * This is called from the guts of the broadcast code when the cpu >> + * which is about to enter idle has the earliest broadcast timer event. >> + */ >> +static int bc_set_next(ktime_t expires, struct clock_event_device *bc) >> +{ >> +ktime_t now, interval; >> +/* >> + * We try to cancel the timer first. If the callback is on >> + * flight on some other cpu then we let it handle it. If we >> + * were able to cancel the timer nothing can rearm it as we >> + * own broadcast_lock. >> + * >> + * However if we are called from the hrtimer interrupt handler >> + * itself, reprogram it. >> + */ >> +if (hrtimer_try_to_cancel() >= 0) { >> +hrtimer_start(, expires, HRTIMER_MODE_ABS_PINNED); >> +/* Bind the "device" to the cpu */ >> +bc->bound_on = smp_processor_id(); >> +} else if (bc->bound_on == smp_processor_id()) { > > This part really wants a proper comment. It took me a while to figure > out why this is correct and what the call chain is. How about: "However we can also be called from the event handler of ce_broadcast_hrtimer when bctimer expires. We cannot therefore restart the timer since it is on flight on the same CPU. But due to the same reason we can reset it." ? > > >> +now = ktime_get(); >> +interval = ktime_sub(expires, now); >> +hrtimer_forward_now(, interval); > > We are in the event handler called from bc_handler() and expires is > absolute time. So what's wrong with calling > hrtimer_set_expires(, expires)? You are right. There are so many interfaces doing nearly the same thing :( I overlooked that hrtimer_forward() and its variants were being used when the interval was pre-calculated and stored away. And hrtimer_set_expires() would be used when we knew the absolute expiry. And it looks safe to call it here too. > >> +static enum hrtimer_restart bc_handler(struct hrtimer *t) >> +{ >> +ce_broadcast_hrtimer.event_handler(_broadcast_hrtimer); >> +return HRTIMER_RESTART; > > We probably want to check whether the timer needs to be restarted at > all. > > if (ce_broadcast_timer.next_event.tv64 == KTIME_MAX) > return HRTIMER_NORESTART; > > return HRTIMER_RESTART; True this additional check would be useful. Do you want me to send out the next version with the above corrections including the patch added to this thread where we handle archs setting the CPUIDLE_FLAG_TIMER_STOP flag? > > Hmm? > > Thanks, > > tglx Thanks Regards Preeti U Murthy > ___ > Linuxppc-dev mailing list > linuxppc-...@lists.ozlabs.org > https://lists.ozlabs.org/listinfo/linuxppc-dev > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V3] cpuidle/governors: Fix logic in selection of idle states
Hi Arjan, On 02/04/2014 08:22 PM, Arjan van de Ven wrote: > On 2/4/2014 12:35 AM, Preeti U Murthy wrote: >> The cpuidle governors today are not handling scenarios where no idle >> state >> can be chosen. Such scenarios coud arise if the user has disabled all the >> idle states at runtime or the latency requirement from the cpus is >> very strict. >> >> The menu governor returns 0th index of the idle state table when no other >> idle state is suitable. This is even when the idle state corresponding >> to this >> index is disabled or the latency requirement is strict and the >> exit_latency >> of the lowest idle state is also not acceptable. Hence this patch >> fixes this logic in the menu governor by defaulting to an idle state >> index >> of -1 unless any other state is suitable. > > state 0 is defined as polling, and polling ALWAYS should be ok Hmm.. you are right. This is convincing. There is no need for this patch. Thanks Regards Preeti U Murthy > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH V3 1/3] time: Change the return type of clockevents_notify() to integer
The broadcast framework can potentially be made use of by archs which do not have an external clock device as well. Then, it is required that one of the CPUs need to handle the broadcasting of wakeup IPIs to the CPUs in deep idle. As a result its local timers should remain functional all the time. For such a CPU, the BROADCAST_ENTER notification has to fail indicating that its clock device cannot be shutdown. To make way for this support, change the return type of tick_broadcast_oneshot_control() and hence clockevents_notify() to indicate such scenarios. Signed-off-by: Preeti U Murthy --- include/linux/clockchips.h |6 +++--- kernel/time/clockevents.c|8 +--- kernel/time/tick-broadcast.c |6 -- kernel/time/tick-internal.h |6 +++--- 4 files changed, 15 insertions(+), 11 deletions(-) diff --git a/include/linux/clockchips.h b/include/linux/clockchips.h index 493aa02..e0c5a6c 100644 --- a/include/linux/clockchips.h +++ b/include/linux/clockchips.h @@ -186,9 +186,9 @@ static inline int tick_check_broadcast_expired(void) { return 0; } #endif #ifdef CONFIG_GENERIC_CLOCKEVENTS -extern void clockevents_notify(unsigned long reason, void *arg); +extern int clockevents_notify(unsigned long reason, void *arg); #else -static inline void clockevents_notify(unsigned long reason, void *arg) {} +static inline int clockevents_notify(unsigned long reason, void *arg) { return 0; } #endif #else /* CONFIG_GENERIC_CLOCKEVENTS_BUILD */ @@ -196,7 +196,7 @@ static inline void clockevents_notify(unsigned long reason, void *arg) {} static inline void clockevents_suspend(void) {} static inline void clockevents_resume(void) {} -static inline void clockevents_notify(unsigned long reason, void *arg) {} +static inline int clockevents_notify(unsigned long reason, void *arg) { return 0; } static inline int tick_check_broadcast_expired(void) { return 0; } #endif diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c index 086ad60..79b8685 100644 --- a/kernel/time/clockevents.c +++ b/kernel/time/clockevents.c @@ -524,12 +524,13 @@ void clockevents_resume(void) #ifdef CONFIG_GENERIC_CLOCKEVENTS /** * clockevents_notify - notification about relevant events + * Returns 0 on success, any other value on error */ -void clockevents_notify(unsigned long reason, void *arg) +int clockevents_notify(unsigned long reason, void *arg) { struct clock_event_device *dev, *tmp; unsigned long flags; - int cpu; + int cpu, ret = 0; raw_spin_lock_irqsave(_lock, flags); @@ -542,7 +543,7 @@ void clockevents_notify(unsigned long reason, void *arg) case CLOCK_EVT_NOTIFY_BROADCAST_ENTER: case CLOCK_EVT_NOTIFY_BROADCAST_EXIT: - tick_broadcast_oneshot_control(reason); + ret = tick_broadcast_oneshot_control(reason); break; case CLOCK_EVT_NOTIFY_CPU_DYING: @@ -585,6 +586,7 @@ void clockevents_notify(unsigned long reason, void *arg) break; } raw_spin_unlock_irqrestore(_lock, flags); + return ret; } EXPORT_SYMBOL_GPL(clockevents_notify); diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c index 43780ab..ddf2ac2 100644 --- a/kernel/time/tick-broadcast.c +++ b/kernel/time/tick-broadcast.c @@ -633,14 +633,15 @@ again: /* * Powerstate information: The system enters/leaves a state, where * affected devices might stop + * Returns 0 on success, -EBUSY if the cpu is used to broadcast wakeups. */ -void tick_broadcast_oneshot_control(unsigned long reason) +int tick_broadcast_oneshot_control(unsigned long reason) { struct clock_event_device *bc, *dev; struct tick_device *td; unsigned long flags; ktime_t now; - int cpu; + int cpu, ret = 0; /* * Periodic mode does not care about the enter/exit of power @@ -746,6 +747,7 @@ void tick_broadcast_oneshot_control(unsigned long reason) } out: raw_spin_unlock_irqrestore(_broadcast_lock, flags); + return ret; } /* diff --git a/kernel/time/tick-internal.h b/kernel/time/tick-internal.h index 8329669..f0dc03c 100644 --- a/kernel/time/tick-internal.h +++ b/kernel/time/tick-internal.h @@ -46,7 +46,7 @@ extern int tick_switch_to_oneshot(void (*handler)(struct clock_event_device *)); extern void tick_resume_oneshot(void); # ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST extern void tick_broadcast_setup_oneshot(struct clock_event_device *bc); -extern void tick_broadcast_oneshot_control(unsigned long reason); +extern int tick_broadcast_oneshot_control(unsigned long reason); extern void tick_broadcast_switch_to_oneshot(void); extern void tick_shutdown_broadcast_oneshot(unsigned int *cpup); extern int tick_resume_broadcast_oneshot(struct clock_event_device *bc); @@ -58,7 +58,7 @@ static inline void tick_broadcast_setup_oneshot(struct clock_event_device *bc) { BUG(); } -static inline void
[PATCH V3 0/3] time/cpuidle: Support in tick broadcast framework in absence of external clock device
On some architectures, the local timers of CPUs stop in deep idle states. They will need to depend on an external clock device to wake them up. However certain implementations of archs do not have an external clock device. This patchset provides support in the tick broadcast framework for such architectures so as to enable the CPUs to get into deep idle. Presently we are in need of this support on certain implementations of PowerPC. This patchset has thus been tested on the same. V1: https://lkml.org/lkml/2013/12/12/687. V2: https://lkml.org/lkml/2014/1/24/28 Changes in V3: 1. Modified comments and code around programming of the broadcast hrtimer. --- Preeti U Murthy (2): time: Change the return type of clockevents_notify() to integer time/cpuidle:Handle failed call to BROADCAST_ENTER on archs with CPUIDLE_FLAG_TIMER_STOP set Thomas Gleixner (1): tick/cpuidle: Initialize hrtimer mode of broadcast drivers/cpuidle/cpuidle.c| 38 +++- include/linux/clockchips.h | 15 - kernel/time/Makefile |2 - kernel/time/clockevents.c|8 ++- kernel/time/tick-broadcast-hrtimer.c | 105 ++ kernel/time/tick-broadcast.c | 51 - kernel/time/tick-internal.h |6 +- 7 files changed, 197 insertions(+), 28 deletions(-) create mode 100644 kernel/time/tick-broadcast-hrtimer.c -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH V3 2/3] tick/cpuidle: Initialize hrtimer mode of broadcast
From: Thomas Gleixner On some architectures, in certain CPU deep idle states the local timers stop. An external clock device is used to wakeup these CPUs. The kernel support for the wakeup of these CPUs is provided by the tick broadcast framework by using the external clock device as the wakeup source. However not all implementations of architectures provide such an external clock device. This patch includes support in the broadcast framework to handle the wakeup of the CPUs in deep idle states on such systems by queuing a hrtimer on one of the CPUs, which is meant to handle the wakeup of CPUs in deep idle states. This patchset introduces a pseudo clock device which can be registered by the archs as tick_broadcast_device in the absence of a real external clock device. Once registered, the broadcast framework will work as is for these architectures as long as the archs take care of the BROADCAST_ENTER notification failing for one of the CPUs. This CPU is made the stand by CPU to handle wakeup of the CPUs in deep idle and it *must not enter deep idle states*. The CPU with the earliest wakeup is chosen to be this CPU. Hence this way the stand by CPU dynamically moves around and so does the hrtimer which is queued to trigger at the next earliest wakeup time. This is consistent with the case where an external clock device is present. The smp affinity of this clock device is set to the CPU with the earliest wakeup. This patchset handles the hotplug of the stand by CPU as well by moving the hrtimer on to the CPU handling the CPU_DEAD notification. Signed-off-by: Preeti U Murthy [Added Changelog and code to handle reprogramming of hrtimer] --- include/linux/clockchips.h |9 +++ kernel/time/Makefile |2 - kernel/time/tick-broadcast-hrtimer.c | 105 ++ kernel/time/tick-broadcast.c | 45 ++- 4 files changed, 159 insertions(+), 2 deletions(-) create mode 100644 kernel/time/tick-broadcast-hrtimer.c diff --git a/include/linux/clockchips.h b/include/linux/clockchips.h index e0c5a6c..dbe9e14 100644 --- a/include/linux/clockchips.h +++ b/include/linux/clockchips.h @@ -62,6 +62,11 @@ enum clock_event_mode { #define CLOCK_EVT_FEAT_DYNIRQ 0x20 #define CLOCK_EVT_FEAT_PERCPU 0x40 +/* + * Clockevent device is based on a hrtimer for broadcast + */ +#define CLOCK_EVT_FEAT_HRTIMER 0x80 + /** * struct clock_event_device - clock event device descriptor * @event_handler: Assigned by the framework to be called by the low @@ -83,6 +88,7 @@ enum clock_event_mode { * @name: ptr to clock event name * @rating:variable to rate clock event devices * @irq: IRQ number (only for non CPU local devices) + * @bound_on: Bound on CPU * @cpumask: cpumask to indicate for which CPUs this device works * @list: list head for the management code * @owner: module reference @@ -113,6 +119,7 @@ struct clock_event_device { const char *name; int rating; int irq; + int bound_on; const struct cpumask*cpumask; struct list_headlist; struct module *owner; @@ -180,9 +187,11 @@ extern int tick_receive_broadcast(void); #endif #if defined(CONFIG_GENERIC_CLOCKEVENTS_BROADCAST) && defined(CONFIG_TICK_ONESHOT) +extern void tick_setup_hrtimer_broadcast(void); extern int tick_check_broadcast_expired(void); #else static inline int tick_check_broadcast_expired(void) { return 0; } +static void tick_setup_hrtimer_broadcast(void) {}; #endif #ifdef CONFIG_GENERIC_CLOCKEVENTS diff --git a/kernel/time/Makefile b/kernel/time/Makefile index 9250130..06151ef 100644 --- a/kernel/time/Makefile +++ b/kernel/time/Makefile @@ -3,7 +3,7 @@ obj-y += timeconv.o posix-clock.o alarmtimer.o obj-$(CONFIG_GENERIC_CLOCKEVENTS_BUILD)+= clockevents.o obj-$(CONFIG_GENERIC_CLOCKEVENTS) += tick-common.o -obj-$(CONFIG_GENERIC_CLOCKEVENTS_BROADCAST)+= tick-broadcast.o +obj-$(CONFIG_GENERIC_CLOCKEVENTS_BROADCAST)+= tick-broadcast.o tick-broadcast-hrtimer.o obj-$(CONFIG_GENERIC_SCHED_CLOCK) += sched_clock.o obj-$(CONFIG_TICK_ONESHOT) += tick-oneshot.o obj-$(CONFIG_TICK_ONESHOT) += tick-sched.o diff --git a/kernel/time/tick-broadcast-hrtimer.c b/kernel/time/tick-broadcast-hrtimer.c new file mode 100644 index 000..af1e119 --- /dev/null +++ b/kernel/time/tick-broadcast-hrtimer.c @@ -0,0 +1,105 @@ +/* + * linux/kernel/time/tick-broadcast-hrtimer.c + * This file emulates a local clock event device + * via a pseudo clock device. + */ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "tick-internal.h" + +static stru
[PATCH V3 3/3] time/cpuidle:Handle failed call to BROADCAST_ENTER on archs with CPUIDLE_FLAG_TIMER_STOP set
Some archs set the CPUIDLE_FLAG_TIMER_STOP flag for idle states in which the local timers stop. The cpuidle_idle_call() currently handles such idle states by calling into the broadcast framework so as to wakeup CPUs at their next wakeup event. With the hrtimer mode of broadcast, the BROADCAST_ENTER call into the broadcast frameowork can fail for archs that do not have an external clock device to handle wakeups and the CPU in question has to thus be made the stand by CPU. This patch handles such cases by failing the call into cpuidle so that the arch can take some default action. The arch will certainly not enter a similar idle state because a failed cpuidle call will also implicitly indicate that the broadcast framework has not registered this CPU to be woken up. Hence we are safe if we fail the cpuidle call. In the process move the functions that trace idle statistics just before and after the entry and exit into idle states respectively. In other scenarios where the call to cpuidle fails, we end up not tracing idle entry and exit since a decision on an idle state could not be taken. Similarly when the call to broadcast framework fails, we skip tracing idle statistics because we are in no further position to take a decision on an alternative idle state to enter into. Signed-off-by: Preeti U Murthy --- drivers/cpuidle/cpuidle.c | 38 +++--- 1 file changed, 23 insertions(+), 15 deletions(-) diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c index a55e68f..8f42033 100644 --- a/drivers/cpuidle/cpuidle.c +++ b/drivers/cpuidle/cpuidle.c @@ -117,15 +117,19 @@ int cpuidle_idle_call(void) { struct cpuidle_device *dev = __this_cpu_read(cpuidle_devices); struct cpuidle_driver *drv; - int next_state, entered_state; - bool broadcast; + int next_state, entered_state, ret = 0; + bool broadcast = false; - if (off || !initialized) - return -ENODEV; + if (off || !initialized) { + ret = -ENODEV; + goto out; + } /* check if the device is ready */ - if (!dev || !dev->enabled) - return -EBUSY; + if (!dev || !dev->enabled) { + ret = -EBUSY; + goto out; + } drv = cpuidle_get_cpu_driver(dev); @@ -137,15 +141,18 @@ int cpuidle_idle_call(void) if (cpuidle_curr_governor->reflect) cpuidle_curr_governor->reflect(dev, next_state); local_irq_enable(); - return 0; + goto out; } - trace_cpu_idle_rcuidle(next_state, dev->cpu); - broadcast = !!(drv->states[next_state].flags & CPUIDLE_FLAG_TIMER_STOP); - if (broadcast) - clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, >cpu); + if (broadcast) { + ret = clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, >cpu); + if (ret) + goto out; + } + + trace_cpu_idle_rcuidle(next_state, dev->cpu); if (cpuidle_state_is_coupled(dev, drv, next_state)) entered_state = cpuidle_enter_state_coupled(dev, drv, @@ -153,16 +160,17 @@ int cpuidle_idle_call(void) else entered_state = cpuidle_enter_state(dev, drv, next_state); - if (broadcast) - clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, >cpu); - trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, dev->cpu); /* give the governor an opportunity to reflect on the outcome */ if (cpuidle_curr_governor->reflect) cpuidle_curr_governor->reflect(dev, entered_state); - return 0; +out: if (broadcast) + clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, >cpu); + + + return ret; } /** -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] PPC: powernv: remove redundant cpuidle_idle_call()
On 02/06/2014 07:46 PM, Nicolas Pitre wrote: > The core idle loop now takes care of it. > > Signed-off-by: Nicolas Pitre > --- > arch/powerpc/platforms/powernv/setup.c | 13 + > 1 file changed, 1 insertion(+), 12 deletions(-) > > diff --git a/arch/powerpc/platforms/powernv/setup.c > b/arch/powerpc/platforms/powernv/setup.c > index 21166f65c9..a932feb290 100644 > --- a/arch/powerpc/platforms/powernv/setup.c > +++ b/arch/powerpc/platforms/powernv/setup.c > @@ -26,7 +26,6 @@ > #include > #include > #include > -#include > > #include > #include > @@ -217,16 +216,6 @@ static int __init pnv_probe(void) > return 1; > } > > -void powernv_idle(void) > -{ > - /* Hook to cpuidle framework if available, else > - * call on default platform idle code > - */ > - if (cpuidle_idle_call()) { > - power7_idle(); > - } > -} > - > define_machine(powernv) { > .name = "PowerNV", > .probe = pnv_probe, > @@ -236,7 +225,7 @@ define_machine(powernv) { > .show_cpuinfo = pnv_show_cpuinfo, > .progress = pnv_progress, > .machine_shutdown = pnv_shutdown, > - .power_save = powernv_idle, > + .power_save = power7_idle, > .calibrate_decr = generic_calibrate_decr, > #ifdef CONFIG_KEXEC > .kexec_cpu_down = pnv_kexec_cpu_down, > Reviewed-by: Preeti U Murthy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] ARM64: powernv: remove redundant cpuidle_idle_call()
Hi Nicolas, powernv in the subject of the patch? Regards Preeti U Murthy On 02/06/2014 07:46 PM, Nicolas Pitre wrote: > The core idle loop now takes care of it. > > Signed-off-by: Nicolas Pitre > --- > arch/arm64/kernel/process.c | 7 ++- > 1 file changed, 2 insertions(+), 5 deletions(-) > > diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c > index 1c0a9be2ff..9cce0098f4 100644 > --- a/arch/arm64/kernel/process.c > +++ b/arch/arm64/kernel/process.c > @@ -33,7 +33,6 @@ > #include > #include > #include > -#include > #include > #include > #include > @@ -94,10 +93,8 @@ void arch_cpu_idle(void) >* This should do all the clock switching and wait for interrupt >* tricks >*/ > - if (cpuidle_idle_call()) { > - cpu_do_idle(); > - local_irq_enable(); > - } > + cpu_do_idle(); > + local_irq_enable(); > } > > #ifdef CONFIG_HOTPLUG_CPU > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] PPC: powernv: remove redundant cpuidle_idle_call()
Hi Daniel, On 02/06/2014 09:55 PM, Daniel Lezcano wrote: > Hi Nico, > > > On 6 February 2014 14:16, Nicolas Pitre wrote: > >> The core idle loop now takes care of it. >> >> Signed-off-by: Nicolas Pitre >> --- >> arch/powerpc/platforms/powernv/setup.c | 13 + >> 1 file changed, 1 insertion(+), 12 deletions(-) >> >> diff --git a/arch/powerpc/platforms/powernv/setup.c >> b/arch/powerpc/platforms/powernv/setup.c >> index 21166f65c9..a932feb290 100644 >> --- a/arch/powerpc/platforms/powernv/setup.c >> +++ b/arch/powerpc/platforms/powernv/setup.c >> @@ -26,7 +26,6 @@ >> #include >> #include >> #include >> -#include >> >> #include >> #include >> @@ -217,16 +216,6 @@ static int __init pnv_probe(void) >> return 1; >> } >> >> -void powernv_idle(void) >> -{ >> - /* Hook to cpuidle framework if available, else >> -* call on default platform idle code >> -*/ >> - if (cpuidle_idle_call()) { >> - power7_idle(); >> - } >> > > The cpuidle_idle_call is called from arch_cpu_idle in > arch/powerpc/kernel/idle.c between a ppc64_runlatch_off|on section. > Shouldn't the cpuidle-powernv driver call these functions when entering > idle ? Yes they should, I will send out a patch that does that ontop of this. There have been cpuidle driver cleanups for powernv and pseries in this merge window. While no change would be required in the pseries cpuidle driver as a result of Nicolas's cleanup, we would need to add the ppc64_runlatch_on and off functions before and after the entry into the powernv idle states. Thanks Regards Preeti U Murthy > > -- Daniel > > >> -} >> - >> define_machine(powernv) { >> .name = "PowerNV", >> .probe = pnv_probe, >> @@ -236,7 +225,7 @@ define_machine(powernv) { >> .show_cpuinfo = pnv_show_cpuinfo, >> .progress = pnv_progress, >> .machine_shutdown = pnv_shutdown, >> - .power_save = powernv_idle, >> + .power_save = power7_idle, >> .calibrate_decr = generic_calibrate_decr, >> #ifdef CONFIG_KEXEC >> .kexec_cpu_down = pnv_kexec_cpu_down, >> -- >> 1.8.4.108.g55ea5f6 >> >> > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V3 2/3] tick/cpuidle: Initialize hrtimer mode of broadcast
Hi Thomas, On 02/06/2014 09:33 PM, Thomas Gleixner wrote: > On Thu, 6 Feb 2014, Preeti U Murthy wrote: > > Compiler warnings are not so important, right? > > kernel/time/tick-broadcast.c: In function ‘tick_broadcast_oneshot_control’: > kernel/time/tick-broadcast.c:700:3: warning: ‘return’ with no value, in > function returning non-void [-Wreturn-type] > kernel/time/tick-broadcast.c:711:3: warning: ‘return’ with no value, in > function returning non-void [-Wreturn-type] My apologies for this, will make sure this will not repeat. On compilation I did not receive any warnings with the additional compile time flags too.I compiled it on powerpc. Let me look into why the warnings did not show up. Nevertheless I should have taken care of this even by simply looking at the code. > >> +/* >> + * If the current CPU owns the hrtimer broadcast >> + * mechanism, it cannot go deep idle. >> + */ >> +ret = broadcast_needs_cpu(bc, cpu); > > So we leave the CPU in the broadcast mask, just to force another call > to the notify code right away to remove it again. Wouldn't it be more > clever to clear the flag right away? That would make the changes to > the cpuidle code simpler. Delta patch below. You are right. > > Thanks, > > tglx > --- > > --- tip.orig/kernel/time/tick-broadcast.c > +++ tip/kernel/time/tick-broadcast.c > @@ -697,7 +697,7 @@ int tick_broadcast_oneshot_control(unsig >* states >*/ > if (tick_broadcast_device.mode == TICKDEV_MODE_PERIODIC) > - return; > + return 0; > > /* >* We are called with preemtion disabled from the depth of the > @@ -708,7 +708,7 @@ int tick_broadcast_oneshot_control(unsig > dev = td->evtdev; > > if (!(dev->features & CLOCK_EVT_FEAT_C3STOP)) > - return; > + return 0; > > bc = tick_broadcast_device.evtdev; > > @@ -731,9 +731,14 @@ int tick_broadcast_oneshot_control(unsig > } > /* >* If the current CPU owns the hrtimer broadcast > - * mechanism, it cannot go deep idle. > + * mechanism, it cannot go deep idle and we remove the > + * CPU from the broadcast mask. We don't have to go > + * through the EXIT path as the local timer is not > + * shutdown. >*/ > ret = broadcast_needs_cpu(bc, cpu); > + if (ret) > + cpumask_clear_cpu(cpu, tick_broadcast_oneshot_mask); > } else { > if (cpumask_test_and_clear_cpu(cpu, > tick_broadcast_oneshot_mask)) { > clockevents_set_mode(dev, CLOCK_EVT_MODE_ONESHOT); > > The cpuidle patch then is below. The trace_cpu_idle_rcuidle() functions have been moved around so that the broadcast CPU does not trace any idle event and that the symmetry between the trace functions and the call to the broadcast framework is maintained. Wow, it does become very simple :) time/cpuidle:Handle failed call to BROADCAST_ENTER on archs with CPUIDLE_FLAG_TIMER_STOP set From: Preeti U Murthy Some archs set the CPUIDLE_FLAG_TIMER_STOP flag for idle states in which the local timers stop. The cpuidle_idle_call() currently handles such idle states by calling into the broadcast framework so as to wakeup CPUs at their next wakeup event. With the hrtimer mode of broadcast, the BROADCAST_ENTER call into the broadcast frameowork can fail for archs that do not have an external clock device to handle wakeups and the CPU in question has to thus be made the stand by CPU. This patch handles such cases by failing the call into cpuidle so that the arch can take some default action. The arch will certainly not enter a similar idle state because a failed cpuidle call will also implicitly indicate that the broadcast framework has not registered this CPU to be woken up. Hence we are safe if we fail the cpuidle call. In the process move the functions that trace idle statistics just before and after the entry and exit into idle states respectively. In other scenarios where the call to cpuidle fails, we end up not tracing idle entry and exit since a decision on an idle state could not be taken. Similarly when the call to broadcast framework fails, we skip tracing idle statistics because we are in no further position to take a decision on an alternative idle state to enter into. Signed-off-by: Preeti U Murthy --- drivers/cpuidle/cpuidle.c | 14 -- 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c index a55e68f..8beb0f02 100644 --- a/drivers/cpuidle/cpuidle.c +++ b/drivers/cpuidle/cpuidle.c @@
Re: [PATCH V5 0/8] cpuidle/ppc: Enable deep idle states on PowerNV
Hi Paul, On 01/15/2014 08:59 PM, Paul Gortmaker wrote: > On 14-01-15 03:07 AM, Preeti U Murthy wrote: > > [...] > >> >> This patchset is based on mainline commit-id:8ae516aa8b8161254d3, and the > > I figured I'd give this a quick sanity build test for a few > configs, but v3.13-rc1-141-g8ae516aa8b81 seems too old; Ben's > ppc next branch is at v3.13-rc1-160-gfac515db4520 and it fails: > > --- > $ git am ppc-idle > Applying: powerpc: Free up the slot of PPC_MSG_CALL_FUNC_SINGLE IPI message > Applying: powerpc: Implement tick broadcast IPI as a fixed IPI message > Applying: cpuidle/ppc: Split timer_interrupt() into timer handling and > interrupt handling routines > error: patch failed: arch/powerpc/kernel/time.c:510 > error: arch/powerpc/kernel/time.c: patch does not apply > Patch failed at 0003 cpuidle/ppc: Split timer_interrupt() into timer handling > and interrupt handling routines > The copy of the patch that failed is found in: >/home/paul/git/linux-head/.git/rebase-apply/patch > When you have resolved this problem, run "git am --continue". > If you prefer to skip this patch, run "git am --skip" instead. > To restore the original branch and stop patching, run "git am --abort". > $ dry-run > patching file arch/powerpc/kernel/time.c > Hunk #3 FAILED at 544. > Hunk #4 FAILED at 554. > Hunk #5 succeeded at 862 (offset 12 lines). > 2 out of 5 hunks FAILED -- saving rejects to file > arch/powerpc/kernel/time.c.rej > > > It appears to conflict with: > > commit 0215f7d8c53fb192cd4491ede0ece5cca6b5db57 > Author: Benjamin Herrenschmidt > Date: Tue Jan 14 17:11:39 2014 +1100 > > powerpc: Fix races with irq_work > > Thanks for the build test.I will base it on the mainline at the latest commit as well as on Ben's tree and send out this patchset. Regards Preeti U Murthy > Paul. > -- > >> cpuidle driver for powernv posted by Deepthi Dharwar: >> https://lkml.org/lkml/2014/1/14/172 >> >> >> Changes in V5: >> - >> The primary change in this version is in Patch[6/8]. >> As per the discussions in V4 posting of this patchset, it was decided to >> refine handling the wakeup of CPUs in fast-sleep by doing the following: >> >> 1. In V4, a polling mechanism was used by the CPU handling broadcast to >> find out the time of next wakeup of the CPUs in deep idle states. V5 avoids >> polling by a way described under PATCH[6/8] in this patchset. >> >> 2. The mechanism of broadcast handling of CPUs in deep idle in the absence >> of an >> external wakeup device should be generic and not arch specific code. Hence >> in this >> version this functionality has been integrated into the tick broadcast >> framework in >> the kernel unlike before where it was handled in powerpc specific code. >> >> 3. It was suggested that the "broadcast cpu" can be the time keeping cpu >> itself. However this has challenges of its own: >> >> a. The time keeping cpu need not exist when all cpus are idle. Hence there >> are phases in time when time keeping cpu is absent. But for the use case that >> this patchset is trying to address we rely on the presence of a broadcast cpu >> all the time. >> >> b. The nomination and un-assignment of the time keeping cpu is not protected >> by a lock today and need not be as well since such is its use case in the >> kernel. However we would need locks if we double up the time keeping cpu as >> the >> broadcast cpu. >> >> Hence the broadcast cpu is independent of the time-keeping cpu. However >> PATCH[6/8] >> proposes a simpler solution to pick a broadcast cpu in this version. >> >> >> >> Changes in V4: >> - >> https://lkml.org/lkml/2013/11/29/97 >> >> 1. Add Fast Sleep CPU idle state on PowerNV. >> >> 2. Add the required context management for Fast Sleep and the call to OPAL >> to synchronize time base after wakeup from fast sleep. >> >> 4. Add parsing of CPU idle states from the device tree to populate the >> cpuidle >> state table. >> >> 5. Rename ambiguous functions in the code around waking up of CPUs from fast >> sleep. >> >> 6. Fixed a bug in re-programming of the hrtimer that is queued to wakeup the >> CPUs in fast sleep and modified Changelogs. >> >> 7. Added the ARCH_HAS_TICK_BROADCAST option. This signifies that we have a >> arch specific function to perform broadcast. >> >> >> Changes in V3: >> - >> http://thread.gmane.org/gmane.linux.po
[PATCH V2] cpuidle/governors: Fix logic in selection of idle states
The cpuidle governors today are not handling scenarios where no idle state can be chosen. Such scenarios coud arise if the user has disabled all the idle states at runtime or the latency requirement from the cpus is very strict. The menu governor returns 0th index of the idle state table when no other idle state is suitable. This is even when the idle state corresponding to this index is disabled or the latency requirement is strict and the exit_latency of the lowest idle state is also not acceptable. Hence this patch fixes this logic in the menu governor by defaulting to an idle state index of -1 unless any other state is suitable. The ladder governor needs a few more fixes in addition to that required in the menu governor. When the ladder governor decides to demote the idle state of a CPU, it does not check if the lower idle states are enabled. Add this logic in addition to the logic where it chooses an index of -1 if it can neither promote or demote the idle state of a cpu nor can it choose the current idle state. The cpuidle_idle_call() will return back if the governor decides upon not entering any idle state. However it cannot return an error code because all archs have the logic today that if the call to cpuidle_idle_call() fails, it means that the cpuidle driver failed to *function*; for instance due to errors during registration. As a result they end up deciding upon a default idle state on their own, which could very well be a deep idle state. This is incorrect in cases where no idle state is suitable. Besides for the scenario that this patch is addressing, the call actually succeeds. Its just that no idle state is thought to be suitable by the governors. Under such a circumstance return success code without entering any idle state. Signed-off-by: Preeti U Murthy Changes from V1:https://lkml.org/lkml/2014/1/14/26 1. Change the return code to success from -EINVAL due to the reason mentioned in the changelog. 2. Add logic that the patch is addressing in the ladder governor as well. 3. Added relevant comments and removed redundant logic as suggested in the above thread. --- drivers/cpuidle/cpuidle.c | 15 +- drivers/cpuidle/governors/ladder.c | 98 ++-- drivers/cpuidle/governors/menu.c |7 +-- 3 files changed, 89 insertions(+), 31 deletions(-) diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c index a55e68f..831b664 100644 --- a/drivers/cpuidle/cpuidle.c +++ b/drivers/cpuidle/cpuidle.c @@ -131,8 +131,9 @@ int cpuidle_idle_call(void) /* ask the governor for the next state */ next_state = cpuidle_curr_governor->select(drv, dev); + + dev->last_residency = 0; if (need_resched()) { - dev->last_residency = 0; /* give the governor an opportunity to reflect on the outcome */ if (cpuidle_curr_governor->reflect) cpuidle_curr_governor->reflect(dev, next_state); @@ -140,6 +141,18 @@ int cpuidle_idle_call(void) return 0; } + /* Unlike in the need_resched() case, we return here because the +* governor did not find a suitable idle state. However idle is still +* in progress as we are not asked to reschedule. Hence we return +* without enabling interrupts. +* +* NOTE: The return code should still be success, since the verdict of this +* call is "do not enter any idle state" and not a failed call due to +* errors. +*/ + if (next_state < 0) + return 0; + trace_cpu_idle_rcuidle(next_state, dev->cpu); broadcast = !!(drv->states[next_state].flags & CPUIDLE_FLAG_TIMER_STOP); diff --git a/drivers/cpuidle/governors/ladder.c b/drivers/cpuidle/governors/ladder.c index 9f08e8c..f495f57 100644 --- a/drivers/cpuidle/governors/ladder.c +++ b/drivers/cpuidle/governors/ladder.c @@ -58,6 +58,36 @@ static inline void ladder_do_selection(struct ladder_device *ldev, ldev->last_state_idx = new_idx; } +static int can_promote(struct ladder_device *ldev, int last_idx, + int last_residency) +{ + struct ladder_device_state *last_state; + + last_state = >states[last_idx]; + if (last_residency > last_state->threshold.promotion_time) { + last_state->stats.promotion_count++; + last_state->stats.demotion_count = 0; + if (last_state->stats.promotion_count >= last_state->threshold.promotion_count) + return 1; + } + return 0; +} + +static int can_demote(struct ladder_device *ldev, int last_idx, + int last_residency) +{ + struct ladder_device_state *last_state; + + last_state = >states[last_idx]; + if (last_residency < last_state->threshold.demotion_time) { + last_state->
Re: [PATCH V2 0/2] time/cpuidle: Support in tick broadcast framework in absence of external clock device
Hi Thomas, I realized that the below patch is also required for this patchset. This patch apart, I noticed that there is also one corner case which we will need to handle. The BROADCAST_ON notifications in periodic mode (oneshot mode is a nop). We will need to fail the BROADCAST_ON notification too in this case if the CPU in question is made the stand by CPU. Thanks Regards Preeti U Murthy - time/cpuidle:Handle failed call to BROADCAST_ENTER on archs with CPUIDLE_FLAG_TIMER_STOP set From: Preeti U Murthy Some archs set the CPUIDLE_FLAG_TIMER_STOP flag for idle states in which the local timers stop. The cpuidle_idle_call() currently handles such idle states by calling into the broadcast framework so as to wakeup CPUs at their next wakeup event. With the hrtimer mode of broadcast, the BROADCAST_ENTER call into the broadcast frameowork can fail for archs that do not have an external clock device to handle wakeups and the CPU in question has to thus be made the stand by CPU. This patch handles such cases by failing the call into cpuidle so that the arch can take some default action. The arch will certainly not enter a similar idle state because a failed cpuidle call will also implicitly indicate that the broadcast framework has not registered this CPU to be woken up. Hence we are safe if we fail the cpuidle call. In the process move the functions that trace idle statistics just before and after the entry and exit into idle states respectively. In other scenarios where the call to cpuidle fails, we end up not tracing idle entry and exit since a decision on an idle state could not be taken. Similarly when the call to broadcast framework fails, we skip tracing idle statistics because we are in no further position to take a decision on an alternative idle state to enter into. Signed-off-by: Preeti U Murthy --- drivers/cpuidle/cpuidle.c | 38 +++--- 1 file changed, 23 insertions(+), 15 deletions(-) diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c index a55e68f..8f42033 100644 --- a/drivers/cpuidle/cpuidle.c +++ b/drivers/cpuidle/cpuidle.c @@ -117,15 +117,19 @@ int cpuidle_idle_call(void) { struct cpuidle_device *dev = __this_cpu_read(cpuidle_devices); struct cpuidle_driver *drv; - int next_state, entered_state; - bool broadcast; + int next_state, entered_state, ret = 0; + bool broadcast = false; - if (off || !initialized) - return -ENODEV; + if (off || !initialized) { + ret = -ENODEV; + goto out; + } /* check if the device is ready */ - if (!dev || !dev->enabled) - return -EBUSY; + if (!dev || !dev->enabled) { + ret = -EBUSY; + goto out; + } drv = cpuidle_get_cpu_driver(dev); @@ -137,15 +141,18 @@ int cpuidle_idle_call(void) if (cpuidle_curr_governor->reflect) cpuidle_curr_governor->reflect(dev, next_state); local_irq_enable(); - return 0; + goto out; } - trace_cpu_idle_rcuidle(next_state, dev->cpu); - broadcast = !!(drv->states[next_state].flags & CPUIDLE_FLAG_TIMER_STOP); - if (broadcast) - clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, >cpu); + if (broadcast) { + ret = clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, >cpu); + if (ret) + goto out; + } + + trace_cpu_idle_rcuidle(next_state, dev->cpu); if (cpuidle_state_is_coupled(dev, drv, next_state)) entered_state = cpuidle_enter_state_coupled(dev, drv, @@ -153,16 +160,17 @@ int cpuidle_idle_call(void) else entered_state = cpuidle_enter_state(dev, drv, next_state); - if (broadcast) - clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, >cpu); - trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, dev->cpu); /* give the governor an opportunity to reflect on the outcome */ if (cpuidle_curr_governor->reflect) cpuidle_curr_governor->reflect(dev, entered_state); - return 0; +out: if (broadcast) + clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, >cpu); + + + return ret; } /** -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V2] cpuidle/governors: Fix logic in selection of idle states
Hi Daniel, On 01/28/2014 02:16 PM, Daniel Lezcano wrote: > On 01/24/2014 11:21 AM, Preeti U Murthy wrote: >> On 01/24/2014 02:38 PM, Daniel Lezcano wrote: >>> On 01/23/2014 12:15 PM, Preeti U Murthy wrote: >>>> Hi Daniel, >>>> >>>> Thank you for the review. > > [ ... ] > >>>> --- >>>>drivers/cpuidle/cpuidle.c | 15 + >>>>drivers/cpuidle/governors/ladder.c | 101 >>>> ++-- >>>>drivers/cpuidle/governors/menu.c |7 +- >>>>3 files changed, 90 insertions(+), 33 deletions(-) >>>> >>>> diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c >>>> index a55e68f..19d17e8 100644 >>>> --- a/drivers/cpuidle/cpuidle.c >>>> +++ b/drivers/cpuidle/cpuidle.c >>>> @@ -131,8 +131,9 @@ int cpuidle_idle_call(void) >>>> >>>>/* ask the governor for the next state */ >>>>next_state = cpuidle_curr_governor->select(drv, dev); >>>> + >>>> +dev->last_residency = 0; >>>>if (need_resched()) { >>> >>> What about if (need_resched() || next_state < 0) ? >> >> Hmm.. I feel we need to distinguish between the need_resched() scenario >> and the scenario when no idle state was suitable through the trace >> points at-least. > > Well, I don't think so as soon as we don't care about the return value > of cpuidle_idle_call in both cases. > > The traces are following a specific format. That is if the state is -1 > (PWR_EVENT_EXIT), it means exiting the current idle state. > > The idlestat tool [1] is using this traces to open - close transitions. > > IMO, if the cpu is not entering idle, it should just exit without any > idle traces. Yes I see your point here. > > This portion of code is a bit confusing because it is introduced by the > menu governor updates post-poned when entering the next idle state (not > exiting the current idle state with good reasons). I am sorry but I don't understand this part. Which is the portion of the code you refer to here? Also can you please elaborate on the above statement? Thanks Regards Preeti U Murthy > > -- Daniel > > [1] http://git.linaro.org/power/idlestat.git > >> This could help while debugging when we could find situations where >> there are no tasks to run, yet the cpu is not entering any idle state. >> The traces could help clearly point that no idle state was thought >> suitable by the governor. Of course there are many other means to find >> this out, but this seems rather straightforward. Hence having the >> condition next_state < 0 between trace_cpu_idle*() would be apt IMHO. >> >> Regards >> Preeti U Murthy >> >>> >>>> -dev->last_residency = 0; >>>>/* give the governor an opportunity to reflect on the >>>> outcome */ >>>>if (cpuidle_curr_governor->reflect) >>>>cpuidle_curr_governor->reflect(dev, next_state); >>>> @@ -141,6 +142,16 @@ int cpuidle_idle_call(void) >>>>} >>>> >>>>trace_cpu_idle_rcuidle(next_state, dev->cpu); >>>> +/* >>>> + * NOTE: The return code should still be success, since the >>>> verdict of >>>> + * this call is "do not enter any idle state". It is not a failed >>>> call >>>> + * due to errors. >>>> + */ >>>> +if (next_state < 0) { >>>> +entered_state = next_state; >>>> +local_irq_enable(); >>>> +goto out; >>>> +} >>>> >>>>broadcast = !!(drv->states[next_state].flags & >>>> CPUIDLE_FLAG_TIMER_STOP); >>>> >>>> @@ -156,7 +167,7 @@ int cpuidle_idle_call(void) >>>>if (broadcast) >>>>clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, >>>> >cpu); >>>> >>>> -trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, dev->cpu); >>>> +out:trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, dev->cpu); >>>> >>>>/* give the governor an opportunity to reflect on the outcome */ >>>>if (cpuidle_curr_governor->reflect) >> > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 1/6] idle: move the cpuidle entry point to the generic idle loop
Hi Nicolas, On 01/30/2014 02:01 AM, Nicolas Pitre wrote: > On Wed, 29 Jan 2014, Nicolas Pitre wrote: > >> In order to integrate cpuidle with the scheduler, we must have a better >> proximity in the core code with what cpuidle is doing and not delegate >> such interaction to arch code. >> >> Architectures implementing arch_cpu_idle() should simply enter >> a cheap idle mode in the absence of a proper cpuidle driver. >> >> Signed-off-by: Nicolas Pitre >> Acked-by: Daniel Lezcano > > As mentioned in my reply to Olof's comment on patch #5/6, here's a new > version of this patch adding the safety local_irq_enable() to the core > code. > > - >8 > > From: Nicolas Pitre > Subject: idle: move the cpuidle entry point to the generic idle loop > > In order to integrate cpuidle with the scheduler, we must have a better > proximity in the core code with what cpuidle is doing and not delegate > such interaction to arch code. > > Architectures implementing arch_cpu_idle() should simply enter > a cheap idle mode in the absence of a proper cpuidle driver. > > In both cases i.e. whether it is a cpuidle driver or the default > arch_cpu_idle(), the calling convention expects IRQs to be disabled > on entry and enabled on exit. There is a warning in place already but > let's add a forced IRQ enable here as well. This will allow for > removing the forced IRQ enable some implementations do locally and Why would this patch allow for removing the forced IRQ enable that are being done on some archs in arch_cpu_idle()? Isn't this patch expecting the default arch_cpu_idle() to have re-enabled the interrupts after exiting from the default idle state? Its supposed to only catch faulty cpuidle drivers that haven't enabled IRQs on exit from idle state but are expected to have done so, isn't it? Thanks Regards Preeti U Murthy > allowing for the warning to trig. > > Signed-off-by: Nicolas Pitre > > diff --git a/kernel/cpu/idle.c b/kernel/cpu/idle.c > index 988573a9a3..14ca43430a 100644 > --- a/kernel/cpu/idle.c > +++ b/kernel/cpu/idle.c > @@ -3,6 +3,7 @@ > */ > #include > #include > +#include > #include > #include > #include > @@ -95,8 +96,10 @@ static void cpu_idle_loop(void) > if (!current_clr_polling_and_test()) { > stop_critical_timings(); > rcu_idle_enter(); > - arch_cpu_idle(); > - WARN_ON_ONCE(irqs_disabled()); > + if (cpuidle_idle_call()) > + arch_cpu_idle(); > + if (WARN_ON_ONCE(irqs_disabled())) > + local_irq_enable(); > rcu_idle_exit(); > start_critical_timings(); > } else { > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 1/6] idle: move the cpuidle entry point to the generic idle loop
Hi Nicolas, On 01/30/2014 10:58 AM, Nicolas Pitre wrote: > On Thu, 30 Jan 2014, Preeti U Murthy wrote: > >> Hi Nicolas, >> >> On 01/30/2014 02:01 AM, Nicolas Pitre wrote: >>> On Wed, 29 Jan 2014, Nicolas Pitre wrote: >>> >>>> In order to integrate cpuidle with the scheduler, we must have a better >>>> proximity in the core code with what cpuidle is doing and not delegate >>>> such interaction to arch code. >>>> >>>> Architectures implementing arch_cpu_idle() should simply enter >>>> a cheap idle mode in the absence of a proper cpuidle driver. >>>> >>>> Signed-off-by: Nicolas Pitre >>>> Acked-by: Daniel Lezcano >>> >>> As mentioned in my reply to Olof's comment on patch #5/6, here's a new >>> version of this patch adding the safety local_irq_enable() to the core >>> code. >>> >>> - >8 >>> >>> From: Nicolas Pitre >>> Subject: idle: move the cpuidle entry point to the generic idle loop >>> >>> In order to integrate cpuidle with the scheduler, we must have a better >>> proximity in the core code with what cpuidle is doing and not delegate >>> such interaction to arch code. >>> >>> Architectures implementing arch_cpu_idle() should simply enter >>> a cheap idle mode in the absence of a proper cpuidle driver. >>> >>> In both cases i.e. whether it is a cpuidle driver or the default >>> arch_cpu_idle(), the calling convention expects IRQs to be disabled >>> on entry and enabled on exit. There is a warning in place already but >>> let's add a forced IRQ enable here as well. This will allow for >>> removing the forced IRQ enable some implementations do locally and >> >> Why would this patch allow for removing the forced IRQ enable that are >> being done on some archs in arch_cpu_idle()? Isn't this patch expecting >> the default arch_cpu_idle() to have re-enabled the interrupts after >> exiting from the default idle state? Its supposed to only catch faulty >> cpuidle drivers that haven't enabled IRQs on exit from idle state but >> are expected to have done so, isn't it? > > Exact. However x86 currently does this: > > if (cpuidle_idle_call()) > x86_idle(); > else > local_irq_enable(); > > So whenever cpuidle_idle_call() is successful then IRQs are > unconditionally enabled whether or not the underlying cpuidle driver has > properly done it or not. And the reason is that some of the x86 cpuidle > do fail to enable IRQs before returning. > > So the idea is to get rid of this unconditional IRQ enabling and let the > core issue a warning instead (as well as enabling IRQs to allow the > system to run). Oh ok, thank you for clarifying this:) Regards Preeti U Murthy > > > Nicolas > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/3] powerpc: Free up the slot of PPC_MSG_CALL_FUNC_SINGLE IPI message
From: Srivatsa S. Bhat The IPI handlers for both PPC_MSG_CALL_FUNC and PPC_MSG_CALL_FUNC_SINGLE map to a common implementation - generic_smp_call_function_single_interrupt(). So, we can consolidate them and save one of the IPI message slots, (which are precious on powerpc, since only 4 of those slots are available). So, implement the functionality of PPC_MSG_CALL_FUNC_SINGLE using PPC_MSG_CALL_FUNC itself and release its IPI message slot, so that it can be used for something else in the future, if desired. Signed-off-by: Srivatsa S. Bhat Signed-off-by: Preeti U. Murthy Acked-by: Geoff Levand [For the PS3 part] --- arch/powerpc/include/asm/smp.h |2 +- arch/powerpc/kernel/smp.c | 12 +--- arch/powerpc/platforms/cell/interrupt.c |2 +- arch/powerpc/platforms/ps3/smp.c|2 +- 4 files changed, 8 insertions(+), 10 deletions(-) diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h index 084e080..9f7356b 100644 --- a/arch/powerpc/include/asm/smp.h +++ b/arch/powerpc/include/asm/smp.h @@ -120,7 +120,7 @@ extern int cpu_to_core_id(int cpu); * in /proc/interrupts will be wrong!!! --Troy */ #define PPC_MSG_CALL_FUNCTION 0 #define PPC_MSG_RESCHEDULE 1 -#define PPC_MSG_CALL_FUNC_SINGLE 2 +#define PPC_MSG_UNUSED 2 #define PPC_MSG_DEBUGGER_BREAK 3 /* for irq controllers that have dedicated ipis per message (4) */ diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index ac2621a..ee7d76b 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -145,9 +145,9 @@ static irqreturn_t reschedule_action(int irq, void *data) return IRQ_HANDLED; } -static irqreturn_t call_function_single_action(int irq, void *data) +static irqreturn_t unused_action(int irq, void *data) { - generic_smp_call_function_single_interrupt(); + /* This slot is unused and hence available for use, if needed */ return IRQ_HANDLED; } @@ -168,14 +168,14 @@ static irqreturn_t debug_ipi_action(int irq, void *data) static irq_handler_t smp_ipi_action[] = { [PPC_MSG_CALL_FUNCTION] = call_function_action, [PPC_MSG_RESCHEDULE] = reschedule_action, - [PPC_MSG_CALL_FUNC_SINGLE] = call_function_single_action, + [PPC_MSG_UNUSED] = unused_action, [PPC_MSG_DEBUGGER_BREAK] = debug_ipi_action, }; const char *smp_ipi_name[] = { [PPC_MSG_CALL_FUNCTION] = "ipi call function", [PPC_MSG_RESCHEDULE] = "ipi reschedule", - [PPC_MSG_CALL_FUNC_SINGLE] = "ipi call function single", + [PPC_MSG_UNUSED] = "ipi unused", [PPC_MSG_DEBUGGER_BREAK] = "ipi debugger", }; @@ -251,8 +251,6 @@ irqreturn_t smp_ipi_demux(void) generic_smp_call_function_interrupt(); if (all & IPI_MESSAGE(PPC_MSG_RESCHEDULE)) scheduler_ipi(); - if (all & IPI_MESSAGE(PPC_MSG_CALL_FUNC_SINGLE)) - generic_smp_call_function_single_interrupt(); if (all & IPI_MESSAGE(PPC_MSG_DEBUGGER_BREAK)) debug_ipi_action(0, NULL); } while (info->messages); @@ -280,7 +278,7 @@ EXPORT_SYMBOL_GPL(smp_send_reschedule); void arch_send_call_function_single_ipi(int cpu) { - do_message_pass(cpu, PPC_MSG_CALL_FUNC_SINGLE); + do_message_pass(cpu, PPC_MSG_CALL_FUNCTION); } void arch_send_call_function_ipi_mask(const struct cpumask *mask) diff --git a/arch/powerpc/platforms/cell/interrupt.c b/arch/powerpc/platforms/cell/interrupt.c index 2d42f3b..adf3726 100644 --- a/arch/powerpc/platforms/cell/interrupt.c +++ b/arch/powerpc/platforms/cell/interrupt.c @@ -215,7 +215,7 @@ void iic_request_IPIs(void) { iic_request_ipi(PPC_MSG_CALL_FUNCTION); iic_request_ipi(PPC_MSG_RESCHEDULE); - iic_request_ipi(PPC_MSG_CALL_FUNC_SINGLE); + iic_request_ipi(PPC_MSG_UNUSED); iic_request_ipi(PPC_MSG_DEBUGGER_BREAK); } diff --git a/arch/powerpc/platforms/ps3/smp.c b/arch/powerpc/platforms/ps3/smp.c index 4b35166..00d1a7c 100644 --- a/arch/powerpc/platforms/ps3/smp.c +++ b/arch/powerpc/platforms/ps3/smp.c @@ -76,7 +76,7 @@ static int __init ps3_smp_probe(void) BUILD_BUG_ON(PPC_MSG_CALL_FUNCTION!= 0); BUILD_BUG_ON(PPC_MSG_RESCHEDULE != 1); - BUILD_BUG_ON(PPC_MSG_CALL_FUNC_SINGLE != 2); + BUILD_BUG_ON(PPC_MSG_UNUSED != 2); BUILD_BUG_ON(PPC_MSG_DEBUGGER_BREAK != 3); for (i = 0; i < MSG_COUNT; i++) { -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/3] powerpc: Implement tick broadcast IPI as a fixed IPI message
From: Srivatsa S. Bhat For scalability and performance reasons, we want the tick broadcast IPIs to be handled as efficiently as possible. Fixed IPI messages are one of the most efficient mechanisms available - they are faster than the smp_call_function mechanism because the IPI handlers are fixed and hence they don't involve costly operations such as adding IPI handlers to the target CPU's function queue, acquiring locks for synchronization etc. Luckily we have an unused IPI message slot, so use that to implement tick broadcast IPIs efficiently. Signed-off-by: Srivatsa S. Bhat [Functions renamed to tick_broadcast* and Changelog modified by Preeti U. Murthy] Signed-off-by: Preeti U. Murthy Acked-by: Geoff Levand [For the PS3 part] --- arch/powerpc/include/asm/smp.h |2 +- arch/powerpc/include/asm/time.h |1 + arch/powerpc/kernel/smp.c | 19 +++ arch/powerpc/kernel/time.c |5 + arch/powerpc/platforms/cell/interrupt.c |2 +- arch/powerpc/platforms/ps3/smp.c|2 +- 6 files changed, 24 insertions(+), 7 deletions(-) diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h index 9f7356b..ff51046 100644 --- a/arch/powerpc/include/asm/smp.h +++ b/arch/powerpc/include/asm/smp.h @@ -120,7 +120,7 @@ extern int cpu_to_core_id(int cpu); * in /proc/interrupts will be wrong!!! --Troy */ #define PPC_MSG_CALL_FUNCTION 0 #define PPC_MSG_RESCHEDULE 1 -#define PPC_MSG_UNUSED 2 +#define PPC_MSG_TICK_BROADCAST 2 #define PPC_MSG_DEBUGGER_BREAK 3 /* for irq controllers that have dedicated ipis per message (4) */ diff --git a/arch/powerpc/include/asm/time.h b/arch/powerpc/include/asm/time.h index c1f2676..1d428e6 100644 --- a/arch/powerpc/include/asm/time.h +++ b/arch/powerpc/include/asm/time.h @@ -28,6 +28,7 @@ extern struct clock_event_device decrementer_clockevent; struct rtc_time; extern void to_tm(int tim, struct rtc_time * tm); extern void GregorianDay(struct rtc_time *tm); +extern void tick_broadcast_ipi_handler(void); extern void generic_calibrate_decr(void); diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index ee7d76b..6f06f05 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -35,6 +35,7 @@ #include #include #include +#include #include #include #include @@ -145,9 +146,9 @@ static irqreturn_t reschedule_action(int irq, void *data) return IRQ_HANDLED; } -static irqreturn_t unused_action(int irq, void *data) +static irqreturn_t tick_broadcast_ipi_action(int irq, void *data) { - /* This slot is unused and hence available for use, if needed */ + tick_broadcast_ipi_handler(); return IRQ_HANDLED; } @@ -168,14 +169,14 @@ static irqreturn_t debug_ipi_action(int irq, void *data) static irq_handler_t smp_ipi_action[] = { [PPC_MSG_CALL_FUNCTION] = call_function_action, [PPC_MSG_RESCHEDULE] = reschedule_action, - [PPC_MSG_UNUSED] = unused_action, + [PPC_MSG_TICK_BROADCAST] = tick_broadcast_ipi_action, [PPC_MSG_DEBUGGER_BREAK] = debug_ipi_action, }; const char *smp_ipi_name[] = { [PPC_MSG_CALL_FUNCTION] = "ipi call function", [PPC_MSG_RESCHEDULE] = "ipi reschedule", - [PPC_MSG_UNUSED] = "ipi unused", + [PPC_MSG_TICK_BROADCAST] = "ipi tick-broadcast", [PPC_MSG_DEBUGGER_BREAK] = "ipi debugger", }; @@ -251,6 +252,8 @@ irqreturn_t smp_ipi_demux(void) generic_smp_call_function_interrupt(); if (all & IPI_MESSAGE(PPC_MSG_RESCHEDULE)) scheduler_ipi(); + if (all & IPI_MESSAGE(PPC_MSG_TICK_BROADCAST)) + tick_broadcast_ipi_handler(); if (all & IPI_MESSAGE(PPC_MSG_DEBUGGER_BREAK)) debug_ipi_action(0, NULL); } while (info->messages); @@ -289,6 +292,14 @@ void arch_send_call_function_ipi_mask(const struct cpumask *mask) do_message_pass(cpu, PPC_MSG_CALL_FUNCTION); } +void tick_broadcast(const struct cpumask *mask) +{ + unsigned int cpu; + + for_each_cpu(cpu, mask) + do_message_pass(cpu, PPC_MSG_TICK_BROADCAST); +} + #if defined(CONFIG_DEBUGGER) || defined(CONFIG_KEXEC) void smp_send_debugger_break(void) { diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c index b3dab20..3ff97db 100644 --- a/arch/powerpc/kernel/time.c +++ b/arch/powerpc/kernel/time.c @@ -825,6 +825,11 @@ static void decrementer_set_mode(enum clock_event_mode mode, decrementer_set_next_event(DECREMENTER_MAX, dev); } +/* Interrupt handler for the timer broadcast IPI */ +void tick_broadcast_ipi_handler(void) +{ +} + static void register_decrementer_clockevent(int cpu) { struct clock_event_device *dec = _cpu(decrementers, cpu); diff --git a/arch/powerpc/plat