On Mon 2023-05-01 08:24:46, Douglas Anderson wrote:
> From: Colin Cross <ccr...@android.com>
> 
> Implement a hardlockup detector that doesn't doesn't need any extra
> arch-specific support code to detect lockups. Instead of using
> something arch-specific we will use the buddy system, where each CPU
> watches out for another one. Specifically, each CPU will use its
> softlockup hrtimer to check that the next CPU is processing hrtimer
> interrupts by verifying that a counter is increasing.
> 
> --- /dev/null
> +++ b/kernel/watchdog_buddy_cpu.c
> +int watchdog_nmi_enable(unsigned int cpu)
> +{
> +     /*
> +      * The new CPU will be marked online before the first hrtimer interrupt
> +      * runs on it.

It does not need to be the first hrtimer interrupt. The CPU might have
been offlined/onlined repeatedly. The counter might have any value.

> +      * If another CPU tests for a hardlockup on the new CPU
> +      * before it has run its first hrtimer, it will get a false positive.
> +      * Touch the watchdog on the new CPU to delay the first check for at
> +      * least 3 sampling periods to guarantee one hrtimer has run on the new
> +      * CPU.
> +      */
> +     per_cpu(watchdog_touch, cpu) = true;

We should touch also the next_cpu:

        /*
         * We are going to check the next CPU. Our watchdog_hrtimer
         * need not be zero if the CPU has already been online earlier.
         * Touch the watchdog on the next CPU to avoid false positive
         * if we try to check it in less then 3 interrupts.
         */
        next_cpu = watchdog_next_cpu(cpu);
        if (next_cpu < nr_cpu_ids)
                per_cpu(watchdog_touch, next_cpu) = true;

Alternative would be to clear watchdog_hrtimer. But it would kind-of
affect also the softlockup detector.


> +     /* Match with smp_rmb() in watchdog_check_hardlockup() */
> +     smp_wmb();
> +     cpumask_set_cpu(cpu, &watchdog_cpus);
> +     return 0;
> +}
> +
> +void watchdog_nmi_disable(unsigned int cpu)
> +{
> +     unsigned int next_cpu = watchdog_next_cpu(cpu);
> +
> +     /*
> +      * Offlining this CPU will cause the CPU before this one to start
> +      * checking the one after this one. If this CPU just finished checking
> +      * the next CPU and updating hrtimer_interrupts_saved, and then the
> +      * previous CPU checks it within one sample period, it will trigger a
> +      * false positive. Touch the watchdog on the next CPU to prevent it.
> +      */
> +     if (next_cpu < nr_cpu_ids)
> +             per_cpu(watchdog_touch, next_cpu) = true;
> +     /* Match with smp_rmb() in watchdog_check_hardlockup() */
> +     smp_wmb();
> +     cpumask_clear_cpu(cpu, &watchdog_cpus);
> +}
> +

Best Regards,
Petr


_______________________________________________
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport

Reply via email to