Re: [PATCH] powerpc: mitigate impact of decrementer reset
On Tue, Nov 18, 2014 at 12:46:56PM +1100, Michael Ellerman wrote: > On Mon, 2014-11-17 at 11:18 -0800, Paul E. McKenney wrote: > > On Thu, Nov 13, 2014 at 01:42:12PM +1100, Michael Ellerman wrote: > > > On Mon, 2014-11-10 at 14:58 -0600, Paul Clarke wrote: > > > > On 11/10/2014 04:08 AM, Benjamin Herrenschmidt wrote: > > > > > On Tue, 2014-10-07 at 14:13 -0500, Paul Clarke wrote: > > > > >> This patch short-circuits the reset of the decrementer, exiting after > > > > >> the decrementer reset, but before the housekeeping tasks if the only > > > > >> need for the interrupt is simply to reset it. After this patch, > > > > >> the latency spike was measured at about 150 nanoseconds. > > > > > > > > > > Doesn't this break the irq_work stuff ? We trigger it with a > > > > > set_dec(1); > > > > > and your patch will probably cause it to be skipped... > > > > > > > > You're right. > > > > > > Yeah, thanks Ben, that would have been bad. > > > > > > So we'll need to come up with a different approach. > > > > If I am understanding this correctly, it underscores the need for more > > bits in the decrementer register. :-/ > > Yes that is the root cause of the problem :) Sigh!!! I was hoping! ;-) Thanx, Paul ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc: mitigate impact of decrementer reset
On Mon, 2014-11-17 at 11:18 -0800, Paul E. McKenney wrote: > On Thu, Nov 13, 2014 at 01:42:12PM +1100, Michael Ellerman wrote: > > On Mon, 2014-11-10 at 14:58 -0600, Paul Clarke wrote: > > > On 11/10/2014 04:08 AM, Benjamin Herrenschmidt wrote: > > > > On Tue, 2014-10-07 at 14:13 -0500, Paul Clarke wrote: > > > >> This patch short-circuits the reset of the decrementer, exiting after > > > >> the decrementer reset, but before the housekeeping tasks if the only > > > >> need for the interrupt is simply to reset it. After this patch, > > > >> the latency spike was measured at about 150 nanoseconds. > > > > > > > > Doesn't this break the irq_work stuff ? We trigger it with a set_dec(1); > > > > and your patch will probably cause it to be skipped... > > > > > > You're right. > > > > Yeah, thanks Ben, that would have been bad. > > > > So we'll need to come up with a different approach. > > If I am understanding this correctly, it underscores the need for more > bits in the decrementer register. :-/ Yes that is the root cause of the problem :) cheers ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc: mitigate impact of decrementer reset
On Thu, Nov 13, 2014 at 01:42:12PM +1100, Michael Ellerman wrote: > On Mon, 2014-11-10 at 14:58 -0600, Paul Clarke wrote: > > On 11/10/2014 04:08 AM, Benjamin Herrenschmidt wrote: > > > On Tue, 2014-10-07 at 14:13 -0500, Paul Clarke wrote: > > >> This patch short-circuits the reset of the decrementer, exiting after > > >> the decrementer reset, but before the housekeeping tasks if the only > > >> need for the interrupt is simply to reset it. After this patch, > > >> the latency spike was measured at about 150 nanoseconds. > > > > > > Doesn't this break the irq_work stuff ? We trigger it with a set_dec(1); > > > and your patch will probably cause it to be skipped... > > > > You're right. > > Yeah, thanks Ben, that would have been bad. > > So we'll need to come up with a different approach. > > > I'm confused by the division between timer_interrupt() and > > __timer_interrupt(). The former is called with interrupts disabled (and > > enables them), but also calls irq_enter()/irq_exit(). Why are those > > calls not in __timer_interrupt()? (If they were, the short-circuit > > logic might be a bit easier to put directly in __timer_interrupt(), > > which would eliminate any duplicate code.) > > > > It looks like __timer_interrupt is only called directly by the broadcast > > timer IPI handler. (Why is __timer_interrupt not static?) Does this > > path not need irq_enter/irq_exit? > > I think I answered most of this in the other mail I just sent, but let me know > if not. > > And __timer_interrupt() is static, if you have a new enough kernel :) If I am understanding this correctly, it underscores the need for more bits in the decrementer register. :-/ Thanx, Paul ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: powerpc: mitigate impact of decrementer reset
On 11/12/2014 08:39 PM, Michael Ellerman wrote: On Wed, 2014-11-05 at 11:06 -0600, Paul Clarke wrote: On 10/07/2014 09:52 PM, Michael Ellerman wrote: On Tue, 2014-07-10 at 19:13:24 UTC, Paul Clarke wrote: This patch short-circuits the reset of the decrementer, exiting after the decrementer reset, but before the housekeeping tasks if the only need for the interrupt is simply to reset it. After this patch, the latency spike was measured at about 150 nanoseconds. Thanks for the excellent changelog. But this patch makes me a bit nervous :) Do you know where the latency is coming from? Is it primarily the irq work? Yes, it is all under irq_enter (measured at ~10us) and irq_exit (~12us). Hmm, OK. I actually meant irq_work_run(). AIUI irq_enter/exit() are just state tracking, they shouldn't be actually running work. How are you measuring it? ftrace function_graph tracer: -- 127.425212 || .irq_enter() { 127.425213 ||.rcu_irq_enter() { 127.425213 | + 12.206 us | .rcu_eqs_exit_common.isra.41(); 127.425226 | + 12.750 us |} ... RCU is a big hitter 127.425226 ||.vtime_common_account_irq_enter() { 127.425226 || .vtime_account_user() { 127.425226 |0.032 us|._raw_spin_lock(); 127.425227 |0.034 us|.get_vtime_delta(); 127.425227 ||.account_user_time() { 127.425228 |0.030 us| .cpuacct_account_field(); 127.425228 || .acct_account_cputime() { 127.425228 |0.082 us|.__acct_update_integrals(); 127.425229 |0.562 us| } 127.425229 |1.500 us|} 127.425229 |2.954 us| } 127.425230 |3.434 us|} ... but even accounting is not insignificant 127.425230 | + 17.218 us | } 127.425230 || /* timer_interrupt_entry: [...] */ ... nothing to see here, because there's nothing to do except reset the decrementer 127.425230 || /* timer_interrupt_exit: [...] */ ... (less than 1 us spent doing the "required" work) 127.425231 || .irq_exit() { 127.425231 ||.vtime_gen_account_irq_exit() { 127.425231 |0.036 us| ._raw_spin_lock(); 127.425232 || .__vtime_account_system() { 127.425232 |0.030 us|.get_vtime_delta(); 127.425232 ||.account_system_time() { 127.425233 |0.030 us| .cpuacct_account_field(); 127.425233 || .acct_account_cputime() { 127.425233 |0.072 us|.__acct_update_integrals(); 127.425234 |0.564 us| } 127.425234 |1.546 us|} 127.425234 |2.528 us| } 127.425235 |3.700 us|} ... significant accounting time 127.425235 |0.032 us|.idle_cpu(); 127.425235 ||.tick_nohz_irq_exit() { 127.425236 || .can_stop_full_tick() { 127.425236 |0.022 us|.sched_can_stop_tick(); 127.425236 |0.020 us|.posix_cpu_timers_can_stop_tick() 127.425237 |0.970 us| } 127.425237 |0.082 us| .ktime_get(); 127.425238 || .tick_nohz_stop_sched_tick() { 127.425238 |0.032 us|.timekeeping_max_deferment(); 127.425238 ||.get_next_timer_interrupt() { 127.425239 |0.038 us| ._raw_spin_lock(); 127.425239 || .hrtimer_get_next_event() { 127.425239 |0.030 us|._raw_spin_lock_irqsave(); 127.425240 |0.028 us|._raw_spin_unlock_irqrestore 127.425240 |0.984 us| } 127.425241 |1.936 us|} 127.425241 |0.032 us|.scheduler_tick_max_deferment(); 127.425241 |3.438 us| } 127.425242 |5.880 us|} 127.425242 ||.rcu_irq_exit() { 127.425242 |0.102 us| .rcu_eqs_enter_common.isra.40(); 127.425243 |0.576 us|} 127.425243 | + 12.156 us | } This one was almost 30 us total (17.218 + 12.156 = 29.374 us), just to reset the decrementer. If so I'd prefer if we could move the short circuit into __timer_interrupt() itself. That way we'd still have the trace points usable, and it would hopefully result in less duplicated logic. But irq_enter and irq_exit are called in timer_interrupt, before __timer_interrupt is called. I don't see how that helps. The time spent in __timer_interrupt is minuscule by comparison. Right, it won't help if it's irq_enter() that is causing the delay. But I was assuming it was irq_work_run(). Are you suggesting that irq_enter/exit be moved into __timer_interrupt as well? (I'm not sure how that would impact the existing call to __timer_interrupt from tick_broadcast_ipi_handler? An
Re: [PATCH] powerpc: mitigate impact of decrementer reset
On Mon, 2014-11-10 at 14:58 -0600, Paul Clarke wrote: > On 11/10/2014 04:08 AM, Benjamin Herrenschmidt wrote: > > On Tue, 2014-10-07 at 14:13 -0500, Paul Clarke wrote: > >> This patch short-circuits the reset of the decrementer, exiting after > >> the decrementer reset, but before the housekeeping tasks if the only > >> need for the interrupt is simply to reset it. After this patch, > >> the latency spike was measured at about 150 nanoseconds. > > > > Doesn't this break the irq_work stuff ? We trigger it with a set_dec(1); > > and your patch will probably cause it to be skipped... > > You're right. Yeah, thanks Ben, that would have been bad. So we'll need to come up with a different approach. > I'm confused by the division between timer_interrupt() and > __timer_interrupt(). The former is called with interrupts disabled (and > enables them), but also calls irq_enter()/irq_exit(). Why are those > calls not in __timer_interrupt()? (If they were, the short-circuit > logic might be a bit easier to put directly in __timer_interrupt(), > which would eliminate any duplicate code.) > > It looks like __timer_interrupt is only called directly by the broadcast > timer IPI handler. (Why is __timer_interrupt not static?) Does this > path not need irq_enter/irq_exit? I think I answered most of this in the other mail I just sent, but let me know if not. And __timer_interrupt() is static, if you have a new enough kernel :) cheers ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: powerpc: mitigate impact of decrementer reset
On Wed, 2014-11-05 at 11:06 -0600, Paul Clarke wrote: > Sorry it took me so long to get back to this... > > On 10/07/2014 09:52 PM, Michael Ellerman wrote: > > On Tue, 2014-07-10 at 19:13:24 UTC, Paul Clarke wrote: > >> This patch short-circuits the reset of the decrementer, exiting after > >> the decrementer reset, but before the housekeeping tasks if the only > >> need for the interrupt is simply to reset it. After this patch, > >> the latency spike was measured at about 150 nanoseconds. > > > Thanks for the excellent changelog. But this patch makes me a bit nervous :) > > > > Do you know where the latency is coming from? Is it primarily the irq work? > > Yes, it is all under irq_enter (measured at ~10us) and irq_exit (~12us). Hmm, OK. I actually meant irq_work_run(). AIUI irq_enter/exit() are just state tracking, they shouldn't be actually running work. How are you measuring it? > > If so I'd prefer if we could move the short circuit into __timer_interrupt() > > itself. That way we'd still have the trace points usable, and it would > > hopefully result in less duplicated logic. > > But irq_enter and irq_exit are called in timer_interrupt, before > __timer_interrupt is called. I don't see how that helps. The time > spent in __timer_interrupt is minuscule by comparison. Right, it won't help if it's irq_enter() that is causing the delay. But I was assuming it was irq_work_run(). > Are you suggesting that irq_enter/exit be moved into __timer_interrupt > as well? (I'm not sure how that would impact the existing call to > __timer_interrupt from tick_broadcast_ipi_handler? And if there is no > impact, what's the point of separating timer_interrupt and > __timer_interrupt?) The point is __timer_interrupt() is called from tick_broadcast_ipi_handler(), which is called from smp_ipi_demux(), from icp_hv_ipi_action(), from __do_irq(), which has already done irq_enter() (and will do irq_exit()). cheers ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc: mitigate impact of decrementer reset
On 11/10/2014 04:08 AM, Benjamin Herrenschmidt wrote: On Tue, 2014-10-07 at 14:13 -0500, Paul Clarke wrote: The POWER ISA defines an always-running decrementer which can be used to schedule interrupts after a certain time interval has elapsed. The decrementer counts down at the same frequency as the Time Base, which is 512 MHz. The maximum value of the decrementer is 0x7fff. This works out to a maximum interval of about 4.19 seconds. If a larger interval is desired, the kernel will set the decrementer to its maximum value and reset it after it expires (underflows) a sufficient number of times until the desired interval has elapsed. The negative effect of this is that an unwanted latency spike will impact normal processing at most every 4.19 seconds. On an IBM POWER8-based system, this spike was measured at about 25-30 microseconds, much of which was basic, opportunistic housekeeping tasks that could otherwise have waited. This patch short-circuits the reset of the decrementer, exiting after the decrementer reset, but before the housekeeping tasks if the only need for the interrupt is simply to reset it. After this patch, the latency spike was measured at about 150 nanoseconds. Doesn't this break the irq_work stuff ? We trigger it with a set_dec(1); and your patch will probably cause it to be skipped... You're right. I'm confused by the division between timer_interrupt() and __timer_interrupt(). The former is called with interrupts disabled (and enables them), but also calls irq_enter()/irq_exit(). Why are those calls not in __timer_interrupt()? (If they were, the short-circuit logic might be a bit easier to put directly in __timer_interrupt(), which would eliminate any duplicate code.) It looks like __timer_interrupt is only called directly by the broadcast timer IPI handler. (Why is __timer_interrupt not static?) Does this path not need irq_enter/irq_exit? Signed-off-by: Paul A. Clarke --- arch/powerpc/kernel/time.c | 13 + 1 file changed, 13 insertions(+) diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c index 368ab37..962a06b 100644 --- a/arch/powerpc/kernel/time.c +++ b/arch/powerpc/kernel/time.c @@ -528,6 +528,7 @@ void timer_interrupt(struct pt_regs * regs) { struct pt_regs *old_regs; u64 *next_tb = &__get_cpu_var(decrementers_next_tb); + u64 now; /* Ensure a positive value is written to the decrementer, or else * some CPUs will continue to take decrementer exceptions. @@ -550,6 +551,18 @@ void timer_interrupt(struct pt_regs * regs) */ may_hard_irq_enable(); + /* If this is simply the decrementer expiring (underflow) due to +* the limited size of the decrementer, and not a set timer, +* reset (if needed) and return +*/ + now = get_tb_or_rtc(); + if (now < *next_tb) { + now = *next_tb - now; + if (now <= DECREMENTER_MAX) + set_dec((int)now); + __get_cpu_var(irq_stat).timer_irqs_others++; + return; + } #if defined(CONFIG_PPC32) && defined(CONFIG_PPC_PMAC) if (atomic_read(&ppc_n_lost_interrupts) != 0) Regards, PC ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc: mitigate impact of decrementer reset
On Tue, 2014-10-07 at 14:13 -0500, Paul Clarke wrote: > The POWER ISA defines an always-running decrementer which can be used > to schedule interrupts after a certain time interval has elapsed. > The decrementer counts down at the same frequency as the Time Base, > which is 512 MHz. The maximum value of the decrementer is 0x7fff. > This works out to a maximum interval of about 4.19 seconds. > > If a larger interval is desired, the kernel will set the decrementer > to its maximum value and reset it after it expires (underflows) > a sufficient number of times until the desired interval has elapsed. > > The negative effect of this is that an unwanted latency spike will > impact normal processing at most every 4.19 seconds. On an IBM > POWER8-based system, this spike was measured at about 25-30 > microseconds, much of which was basic, opportunistic housekeeping > tasks that could otherwise have waited. > > This patch short-circuits the reset of the decrementer, exiting after > the decrementer reset, but before the housekeeping tasks if the only > need for the interrupt is simply to reset it. After this patch, > the latency spike was measured at about 150 nanoseconds. Doesn't this break the irq_work stuff ? We trigger it with a set_dec(1); and your patch will probably cause it to be skipped... Cheers, Ben. > Signed-off-by: Paul A. Clarke > --- > arch/powerpc/kernel/time.c | 13 + > 1 file changed, 13 insertions(+) > > diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c > index 368ab37..962a06b 100644 > --- a/arch/powerpc/kernel/time.c > +++ b/arch/powerpc/kernel/time.c > @@ -528,6 +528,7 @@ void timer_interrupt(struct pt_regs * regs) > { > struct pt_regs *old_regs; > u64 *next_tb = &__get_cpu_var(decrementers_next_tb); > + u64 now; > > /* Ensure a positive value is written to the decrementer, or else >* some CPUs will continue to take decrementer exceptions. > @@ -550,6 +551,18 @@ void timer_interrupt(struct pt_regs * regs) >*/ > may_hard_irq_enable(); > > + /* If this is simply the decrementer expiring (underflow) due to > + * the limited size of the decrementer, and not a set timer, > + * reset (if needed) and return > + */ > + now = get_tb_or_rtc(); > + if (now < *next_tb) { > + now = *next_tb - now; > + if (now <= DECREMENTER_MAX) > + set_dec((int)now); > + __get_cpu_var(irq_stat).timer_irqs_others++; > + return; > + } > > #if defined(CONFIG_PPC32) && defined(CONFIG_PPC_PMAC) > if (atomic_read(&ppc_n_lost_interrupts) != 0) ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: powerpc: mitigate impact of decrementer reset
Sorry it took me so long to get back to this... On 10/07/2014 09:52 PM, Michael Ellerman wrote: On Tue, 2014-07-10 at 19:13:24 UTC, Paul Clarke wrote: The POWER ISA defines an always-running decrementer which can be used to schedule interrupts after a certain time interval has elapsed. The decrementer counts down at the same frequency as the Time Base, which is 512 MHz. The maximum value of the decrementer is 0x7fff. This works out to a maximum interval of about 4.19 seconds. If a larger interval is desired, the kernel will set the decrementer to its maximum value and reset it after it expires (underflows) a sufficient number of times until the desired interval has elapsed. The negative effect of this is that an unwanted latency spike will impact normal processing at most every 4.19 seconds. On an IBM POWER8-based system, this spike was measured at about 25-30 microseconds, much of which was basic, opportunistic housekeeping tasks that could otherwise have waited. This patch short-circuits the reset of the decrementer, exiting after the decrementer reset, but before the housekeeping tasks if the only need for the interrupt is simply to reset it. After this patch, the latency spike was measured at about 150 nanoseconds. Thanks for the excellent changelog. But this patch makes me a bit nervous :) Do you know where the latency is coming from? Is it primarily the irq work? Yes, it is all under irq_enter (measured at ~10us) and irq_exit (~12us). If so I'd prefer if we could move the short circuit into __timer_interrupt() itself. That way we'd still have the trace points usable, and it would hopefully result in less duplicated logic. But irq_enter and irq_exit are called in timer_interrupt, before __timer_interrupt is called. I don't see how that helps. The time spent in __timer_interrupt is minuscule by comparison. Are you suggesting that irq_enter/exit be moved into __timer_interrupt as well? (I'm not sure how that would impact the existing call to __timer_interrupt from tick_broadcast_ipi_handler? And if there is no impact, what's the point of separating timer_interrupt and __timer_interrupt?) Regards, PC ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc: mitigate impact of decrementer reset
On 10/08/2014 12:37 AM, Heinz Wrobel wrote: what if your tb wraps during the test? Per the Power ISA, Time Base is 64 bits, monotonically increasing, and is writable only in hypervisor state. To my understanding, it is set to zero at boot (although this is not prescribed). Also, as noted by others, the logic is roughly duplicated (with some differences) from the analogous code in __timer_interrupt just above it. I don't see wrapping as a concern. PC ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: powerpc: mitigate impact of decrementer reset
On 10/08/2014 08:22 AM, Michael Ellerman wrote: > On Tue, 2014-07-10 at 19:13:24 UTC, Paul Clarke wrote: >> The POWER ISA defines an always-running decrementer which can be used >> to schedule interrupts after a certain time interval has elapsed. >> The decrementer counts down at the same frequency as the Time Base, >> which is 512 MHz. The maximum value of the decrementer is 0x7fff. >> This works out to a maximum interval of about 4.19 seconds. >> >> If a larger interval is desired, the kernel will set the decrementer >> to its maximum value and reset it after it expires (underflows) >> a sufficient number of times until the desired interval has elapsed. >> >> The negative effect of this is that an unwanted latency spike will >> impact normal processing at most every 4.19 seconds. On an IBM >> POWER8-based system, this spike was measured at about 25-30 >> microseconds, much of which was basic, opportunistic housekeeping >> tasks that could otherwise have waited. >> >> This patch short-circuits the reset of the decrementer, exiting after >> the decrementer reset, but before the housekeeping tasks if the only >> need for the interrupt is simply to reset it. After this patch, >> the latency spike was measured at about 150 nanoseconds. > > Hi Paul, > > Thanks for the excellent changelog. But this patch makes me a bit nervous :) > > Do you know where the latency is coming from? Is it primarily the irq work? > > If so I'd prefer if we could move the short circuit into __timer_interrupt() > itself. That way we'd still have the trace points usable, and it would > hopefully result in less duplicated logic. I agree, this is perhaps the better approach. Regards Preeti U Murthy > > cheers > ___ > Linuxppc-dev mailing list > Linuxppc-dev@lists.ozlabs.org > https://lists.ozlabs.org/listinfo/linuxppc-dev > ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: [PATCH] powerpc: mitigate impact of decrementer reset
Paul, what if your tb wraps during the test? > -Original Message- > From: Linuxppc-dev [mailto:linuxppc-dev- > bounces+heinz.wrobel=freescale@lists.ozlabs.org] On Behalf Of Paul > Clarke > Sent: Tuesday, October 07, 2014 21:13 > To: linuxppc-dev@lists.ozlabs.org > Subject: [PATCH] powerpc: mitigate impact of decrementer reset > > The POWER ISA defines an always-running decrementer which can be used to > schedule interrupts after a certain time interval has elapsed. > The decrementer counts down at the same frequency as the Time Base, which > is 512 MHz. The maximum value of the decrementer is 0x7fff. > This works out to a maximum interval of about 4.19 seconds. > > If a larger interval is desired, the kernel will set the decrementer to its > maximum value and reset it after it expires (underflows) a sufficient number > of > times until the desired interval has elapsed. > > The negative effect of this is that an unwanted latency spike will impact > normal > processing at most every 4.19 seconds. On an IBM POWER8-based system, this > spike was measured at about 25-30 microseconds, much of which was basic, > opportunistic housekeeping tasks that could otherwise have waited. > > This patch short-circuits the reset of the decrementer, exiting after the > decrementer reset, but before the housekeeping tasks if the only need for the > interrupt is simply to reset it. After this patch, the latency spike was > measured > at about 150 nanoseconds. > > Signed-off-by: Paul A. Clarke > --- > arch/powerpc/kernel/time.c | 13 + > 1 file changed, 13 insertions(+) > > diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c index > 368ab37..962a06b 100644 > --- a/arch/powerpc/kernel/time.c > +++ b/arch/powerpc/kernel/time.c > @@ -528,6 +528,7 @@ void timer_interrupt(struct pt_regs * regs) > { > struct pt_regs *old_regs; > u64 *next_tb = &__get_cpu_var(decrementers_next_tb); > + u64 now; > > /* Ensure a positive value is written to the decrementer, or else >* some CPUs will continue to take decrementer exceptions. > @@ -550,6 +551,18 @@ void timer_interrupt(struct pt_regs * regs) >*/ > may_hard_irq_enable(); > > + /* If this is simply the decrementer expiring (underflow) due to > + * the limited size of the decrementer, and not a set timer, > + * reset (if needed) and return > + */ > + now = get_tb_or_rtc(); > + if (now < *next_tb) { What if "now" and *next_tb are not on the same wrap count? They are both modulo values AFACS. Shouldn't this be right here more like a "if ((*next_tb - now) < 2^63)" style test to check for deltas within the range instead of absolute values? > + now = *next_tb - now; > + if (now <= DECREMENTER_MAX) > + set_dec((int)now); > + __get_cpu_var(irq_stat).timer_irqs_others++; > + return; > + } > > #if defined(CONFIG_PPC32) && defined(CONFIG_PPC_PMAC) > if (atomic_read(&ppc_n_lost_interrupts) != 0) > -- > 2.1.2.330.g565301e BR, Heinz ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: powerpc: mitigate impact of decrementer reset
On Tue, 2014-07-10 at 19:13:24 UTC, Paul Clarke wrote: > The POWER ISA defines an always-running decrementer which can be used > to schedule interrupts after a certain time interval has elapsed. > The decrementer counts down at the same frequency as the Time Base, > which is 512 MHz. The maximum value of the decrementer is 0x7fff. > This works out to a maximum interval of about 4.19 seconds. > > If a larger interval is desired, the kernel will set the decrementer > to its maximum value and reset it after it expires (underflows) > a sufficient number of times until the desired interval has elapsed. > > The negative effect of this is that an unwanted latency spike will > impact normal processing at most every 4.19 seconds. On an IBM > POWER8-based system, this spike was measured at about 25-30 > microseconds, much of which was basic, opportunistic housekeeping > tasks that could otherwise have waited. > > This patch short-circuits the reset of the decrementer, exiting after > the decrementer reset, but before the housekeeping tasks if the only > need for the interrupt is simply to reset it. After this patch, > the latency spike was measured at about 150 nanoseconds. Hi Paul, Thanks for the excellent changelog. But this patch makes me a bit nervous :) Do you know where the latency is coming from? Is it primarily the irq work? If so I'd prefer if we could move the short circuit into __timer_interrupt() itself. That way we'd still have the trace points usable, and it would hopefully result in less duplicated logic. cheers ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH] powerpc: mitigate impact of decrementer reset
The POWER ISA defines an always-running decrementer which can be used to schedule interrupts after a certain time interval has elapsed. The decrementer counts down at the same frequency as the Time Base, which is 512 MHz. The maximum value of the decrementer is 0x7fff. This works out to a maximum interval of about 4.19 seconds. If a larger interval is desired, the kernel will set the decrementer to its maximum value and reset it after it expires (underflows) a sufficient number of times until the desired interval has elapsed. The negative effect of this is that an unwanted latency spike will impact normal processing at most every 4.19 seconds. On an IBM POWER8-based system, this spike was measured at about 25-30 microseconds, much of which was basic, opportunistic housekeeping tasks that could otherwise have waited. This patch short-circuits the reset of the decrementer, exiting after the decrementer reset, but before the housekeeping tasks if the only need for the interrupt is simply to reset it. After this patch, the latency spike was measured at about 150 nanoseconds. Signed-off-by: Paul A. Clarke --- arch/powerpc/kernel/time.c | 13 + 1 file changed, 13 insertions(+) diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c index 368ab37..962a06b 100644 --- a/arch/powerpc/kernel/time.c +++ b/arch/powerpc/kernel/time.c @@ -528,6 +528,7 @@ void timer_interrupt(struct pt_regs * regs) { struct pt_regs *old_regs; u64 *next_tb = &__get_cpu_var(decrementers_next_tb); + u64 now; /* Ensure a positive value is written to the decrementer, or else * some CPUs will continue to take decrementer exceptions. @@ -550,6 +551,18 @@ void timer_interrupt(struct pt_regs * regs) */ may_hard_irq_enable(); + /* If this is simply the decrementer expiring (underflow) due to +* the limited size of the decrementer, and not a set timer, +* reset (if needed) and return +*/ + now = get_tb_or_rtc(); + if (now < *next_tb) { + now = *next_tb - now; + if (now <= DECREMENTER_MAX) + set_dec((int)now); + __get_cpu_var(irq_stat).timer_irqs_others++; + return; + } #if defined(CONFIG_PPC32) && defined(CONFIG_PPC_PMAC) if (atomic_read(&ppc_n_lost_interrupts) != 0) -- 2.1.2.330.g565301e ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev