On Thu, Jan 10, 2013 at 6:02 AM, Don Zickus <dzic...@redhat.com> wrote: > On Wed, Jan 09, 2013 at 05:57:39PM -0800, Colin Cross wrote: >> Emulate NMIs on systems where they are not available by using timer >> interrupts on other cpus. Each cpu will use its softlockup hrtimer >> to check that the next cpu is processing hrtimer interrupts by >> verifying that a counter is increasing. >> >> This patch is useful on systems where the hardlockup detector is not >> available due to a lack of NMIs, for example most ARM SoCs. > > I have seen other cpus, like Sparc I think, create a 'virtual NMI' by > reserving an IRQ line as 'special' (can not be masked). Not sure if that > is something worth looking at here (or even possible). > >> Without this patch any cpu stuck with interrupts disabled can >> cause a hardware watchdog reset with no debugging information, >> but with this patch the kernel can detect the lockup and panic, >> which can result in useful debugging info. > > <SNIP> >> +#ifdef CONFIG_HARDLOCKUP_DETECTOR_OTHER_CPU >> +static int is_hardlockup_other_cpu(int cpu) >> +{ >> + unsigned long hrint = per_cpu(hrtimer_interrupts, cpu); >> + >> + if (per_cpu(hrtimer_interrupts_saved, cpu) == hrint) >> + return 1; >> + >> + per_cpu(hrtimer_interrupts_saved, cpu) = hrint; >> + return 0; > > Will this race with the other cpu you are checking? For example if cpuA > just updated its hrtimer_interrupts_saved and cpuB goes to check cpuA's > hrtimer_interrupts_saved, it seems possible that cpuB could falsely assume > cpuA is stuck?
cpuA doesn't update its own hrtimer_interrupts_saved, cpuB does. However, there may be a similar race condition during hotplug if cpuB updates hrtimer_interrupts_saved for cpuA, then goes offline, then cpuC may try to check cpuA and see that hrtimer_interrupts_saved == hrtimer_interrupts. I think this can be solved by setting watchdog_nmi_touch for the next cpu when a cpu goes online or offline. >> +} >> + >> +static void watchdog_check_hardlockup_other_cpu(void) >> +{ >> + int cpu; >> + cpumask_t cpus = watchdog_cpus; >> + >> + /* >> + * Test for hardlockups every 3 samples. The sample period is >> + * watchdog_thresh * 2 / 5, so 3 samples gets us back to slightly over >> + * watchdog_thresh (over by 20%). >> + */ >> + if (__this_cpu_read(hrtimer_interrupts) % 3 != 0) >> + return; >> + >> + /* check for a hardlockup on the next cpu */ >> + cpu = cpumask_next(smp_processor_id(), &cpus); >> + if (cpu >= nr_cpu_ids) >> + cpu = cpumask_first(&cpus); >> + if (cpu == smp_processor_id()) >> + return; >> + >> + smp_rmb(); >> + >> + if (per_cpu(watchdog_nmi_touch, cpu) == true) { >> + per_cpu(watchdog_nmi_touch, cpu) = false; >> + return; >> + } > > Same race here. Usually touch_nmi_watchdog is reserved for those > functions that plan on disabling interrupts for a while. cpuB could set > cpuA's watchdog_nmi_touch to false here expecting not to revisit this > variable for another couple of seconds. While cpuA could read this > variable milliseconds later after cpuB sets it and falsely assume there is > a lockup? > > Perhaps I am misreading the code? Again, cpuA won't ever read its own watchdog_nmi_touch variable, only cpuB will. The only variables cpuA updates for itself is hrtimer_interrupts or setting watchdog_nmi_touch to true. hrtimer_interrupts_saved and setting watchdog_nmi_touch to false are done by the cpu watching over cpuA, so the only races here are when a cpu goes offline and a different cpu starts watching over cpuA. > If not, I don't have a good idea on how to solve those races off the top of my > head unfortunately. > > Cheers, > Don -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/