Hi, Thanks for the set, first of all. Reviewing and testing it.
On 01/07/20 03:10, Frederic Weisbecker wrote: > When a timer is enqueued with a negative delta (ie: expiry is below > base->clk), it gets added to the wheel as expiring now (base->clk). > > Yet the value that gets stored in base->next_expiry, while calling > trigger_dyntick_cpu(), is the initial timer->expires value. The > resulting state becomes: > > base->next_expiry < base->clk > > On the next timer enqueue, forward_timer_base() may accidentally > rewind base->clk. As a possible outcome, timers may expire way too > early, the worst case being that the highest wheel levels get spuriously > processed again. > > To prevent from that, make sure that base->next_expiry doesn't get below > base->clk. > > Signed-off-by: Frederic Weisbecker <frede...@kernel.org> > Cc: Peter Zijlstra <pet...@infradead.org> > Cc: Anna-Maria Gleixner <anna-ma...@linutronix.de> > Cc: Juri Lelli <juri.le...@redhat.com> > --- > kernel/time/timer.c | 17 ++++++++++++++--- > 1 file changed, 14 insertions(+), 3 deletions(-) > > diff --git a/kernel/time/timer.c b/kernel/time/timer.c > index 398e6eadb861..9a838d38dbe6 100644 > --- a/kernel/time/timer.c > +++ b/kernel/time/timer.c > @@ -584,7 +584,15 @@ trigger_dyntick_cpu(struct timer_base *base, struct > timer_list *timer) > * Set the next expiry time and kick the CPU so it can reevaluate the > * wheel: > */ > - base->next_expiry = timer->expires; > + if (time_before(timer->expires, base->clk)) { > + /* > + * Prevent from forward_timer_base() moving the base->clk > + * backward > + */ > + base->next_expiry = base->clk; > + } else { > + base->next_expiry = timer->expires; > + } > wake_up_nohz_cpu(base->cpu); > } > > @@ -896,10 +904,13 @@ static inline void forward_timer_base(struct timer_base > *base) > * If the next expiry value is > jiffies, then we fast forward to > * jiffies otherwise we forward to the next expiry value. > */ > - if (time_after(base->next_expiry, jnow)) > + if (time_after(base->next_expiry, jnow)) { > base->clk = jnow; > - else > + } else { > + if (WARN_ON_ONCE(time_before(base->next_expiry, base->clk))) > + return; I actually ported it to latest RT tree (v5.6.17-rt10) w/o conflicts, but hit this one above: ... 000: Kernel command line: BOOT_IMAGE=(hd0,msdos1)/vmlinuz-5.6.17-rt10 root=/dev/mapper/rhel_rt--qe--04-root ro crashkernel=auto resume=/dev/mapper/rhel_rt--qe--04-swap rd.lvm.lv=rhel_rt-qe-04/root rd.lvm.lv=rhel_rt-qe-04/swap console=ttyS0,115200 000: mem auto-init: stack:off, heap alloc:off, heap free:off 000: Memory: 2102240K/134089036K available (12292K kernel code, 2449K rwdata, 4332K rodata, 2248K init, 15392K bss, 2278548K reserved, 0K cma-reserved) 000: random: get_random_u64 called from cache_random_seq_create+0x7c/0x140 with crng_init=0 000: SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=24, Nodes=2 000: Kernel/User page tables isolation: enabled 000: ftrace: allocating 37680 entries in 148 pages 000: ftrace: allocated 148 pages with 3 groups 000: rcu: Preemptible hierarchical RCU implementation. 000: rcu: RCU restricting CPUs from NR_CPUS=8192 to nr_cpu_ids=24. 000: rcu: RCU priority boosting: priority 1 delay 500 ms. 000: rcu: RCU_SOFTIRQ processing moved to rcuc kthreads. 000: No expedited grace period (rcu_normal_after_boot). 000: Tasks RCU enabled. 000: rcu: RCU calculated value of scheduler-enlistment delay is 100 jiffies. 000: rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=24 000: NR_IRQS: 524544, nr_irqs: 1432, preallocated irqs: 16 000: random: crng done (trusting CPU's manufacturer) 000: Console: colour VGA+ 80x25 000: printk: console [ttyS0] enabled 000: ACPI: Core revision 20200110 000: clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 133484882848 ns 000: APIC: Switch to symmetric I/O mode setup 000: DMAR: Host address width 46 000: DMAR: DRHD base: 0x000000fbffc000 flags: 0x0 000: DMAR: dmar0: reg_base_addr fbffc000 ver 1:0 cap d2078c106f0466 ecap f020df 000: DMAR: DRHD base: 0x000000c7ffc000 flags: 0x1 000: DMAR: dmar1: reg_base_addr c7ffc000 ver 1:0 cap d2078c106f0466 ecap f020df 000: DMAR: RMRR base: 0x00000079190000 end: 0x00000079192fff 000: DMAR: RMRR base: 0x000000791f4000 end: 0x000000791f7fff 000: DMAR: RMRR base: 0x000000791de000 end: 0x000000791f3fff 000: DMAR: RMRR base: 0x000000791cb000 end: 0x000000791dbfff 000: DMAR: RMRR base: 0x000000791dc000 end: 0x000000791ddfff 000: DMAR: RMRR base: 0x0000005a661000 end: 0x0000005a6a0fff 000: DMAR-IR: IOAPIC id 10 under DRHD base 0xfbffc000 IOMMU 0 000: DMAR-IR: IOAPIC id 8 under DRHD base 0xc7ffc000 IOMMU 1 000: DMAR-IR: IOAPIC id 9 under DRHD base 0xc7ffc000 IOMMU 1 000: DMAR-IR: HPET id 0 under DRHD base 0xc7ffc000 000: DMAR-IR: Queued invalidation will be enabled to support x2apic and Intr-remapping. 000: DMAR-IR: Enabled IRQ remapping in x2apic mode 000: x2apic enabled 000: Switched APIC routing to cluster x2apic. 000: ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1 000: clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x228d095820e, max_idle_ns: 440795295198 ns 000: ------------[ cut here ]------------ 000: WARNING: CPU: 0 PID: 0 at kernel/time/timer.c:897 add_timer_on+0x129/0x140 000: Modules linked in: 000: CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.6.17-rt10 #1 000: Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 10/25/2017 000: RIP: 0010:add_timer_on+0x129/0x140 000: Code: 24 48 89 ef e8 e8 ca 7d 00 48 8b 44 24 08 65 48 33 04 25 28 00 00 00 75 1f 48 83 c4 10 5b 5d 41 5c c3 48 89 45 48 eb ca 0f 0b <0f> 0b eb c4 e8 53 94 e9 ff e9 7b ff ff ff e8 64 c4 f6 ff 0f 1f 40 000: RSP: 0000:ffffffffb2c03e80 EFLAGS: 00010083 000: RAX: 00000000fffb6c25 RBX: ffffffffb3886540 RCX: 0000000000000000 000: RDX: 00000000fffb6c20 RSI: ffffffffb2c03e80 RDI: 0000000000000001 000: RBP: ffff92687fa19300 R08: 0000000000000000 R09: 0000000000000001 000: R10: ffffffffb2c5b4e0 R11: 3235393730343420 R12: 0000000000000000 000: R13: ffffffffb2cd42c0 R14: 0000000000000000 R15: ffff92687fa27f40 000: FS: 0000000000000000(0000) GS:ffff92687fa00000(0000) knlGS:0000000000000000 000: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 000: CR2: ffff92610ac01000 CR3: 00000008c920e001 CR4: 00000000000606b0 000: Call Trace: 000: clocksource_select_watchdog+0x144/0x1a0 000: __clocksource_register_scale+0x88/0xf0 000: tsc_init+0x1a1/0x268 000: start_kernel+0x4ae/0x56e 000: secondary_startup_64+0xb6/0xc0 000: ---[ end trace 0000000000000001 ]--- Guess you might be faster to understand what I'm missing. :-) Best, Juri