On Mon 2023-05-01 08:24:46, Douglas Anderson wrote: > From: Colin Cross <ccr...@android.com> > > Implement a hardlockup detector that doesn't doesn't need any extra > arch-specific support code to detect lockups. Instead of using > something arch-specific we will use the buddy system, where each CPU > watches out for another one. Specifically, each CPU will use its > softlockup hrtimer to check that the next CPU is processing hrtimer > interrupts by verifying that a counter is increasing. > > --- /dev/null > +++ b/kernel/watchdog_buddy_cpu.c > +int watchdog_nmi_enable(unsigned int cpu) > +{ > + /* > + * The new CPU will be marked online before the first hrtimer interrupt > + * runs on it.
It does not need to be the first hrtimer interrupt. The CPU might have been offlined/onlined repeatedly. The counter might have any value. > + * If another CPU tests for a hardlockup on the new CPU > + * before it has run its first hrtimer, it will get a false positive. > + * Touch the watchdog on the new CPU to delay the first check for at > + * least 3 sampling periods to guarantee one hrtimer has run on the new > + * CPU. > + */ > + per_cpu(watchdog_touch, cpu) = true; We should touch also the next_cpu: /* * We are going to check the next CPU. Our watchdog_hrtimer * need not be zero if the CPU has already been online earlier. * Touch the watchdog on the next CPU to avoid false positive * if we try to check it in less then 3 interrupts. */ next_cpu = watchdog_next_cpu(cpu); if (next_cpu < nr_cpu_ids) per_cpu(watchdog_touch, next_cpu) = true; Alternative would be to clear watchdog_hrtimer. But it would kind-of affect also the softlockup detector. > + /* Match with smp_rmb() in watchdog_check_hardlockup() */ > + smp_wmb(); > + cpumask_set_cpu(cpu, &watchdog_cpus); > + return 0; > +} > + > +void watchdog_nmi_disable(unsigned int cpu) > +{ > + unsigned int next_cpu = watchdog_next_cpu(cpu); > + > + /* > + * Offlining this CPU will cause the CPU before this one to start > + * checking the one after this one. If this CPU just finished checking > + * the next CPU and updating hrtimer_interrupts_saved, and then the > + * previous CPU checks it within one sample period, it will trigger a > + * false positive. Touch the watchdog on the next CPU to prevent it. > + */ > + if (next_cpu < nr_cpu_ids) > + per_cpu(watchdog_touch, next_cpu) = true; > + /* Match with smp_rmb() in watchdog_check_hardlockup() */ > + smp_wmb(); > + cpumask_clear_cpu(cpu, &watchdog_cpus); > +} > + Best Regards, Petr _______________________________________________ Kgdb-bugreport mailing list Kgdb-bugreport@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport