On Thu, Jan 10, 2013 at 9:57 PM, Liu, Chuansheng <[email protected]> wrote: > > >> -----Original Message----- >> From: [email protected] [mailto:[email protected]] On Behalf Of Colin >> Cross >> Sent: Friday, January 11, 2013 1:34 PM >> To: Liu, Chuansheng >> Cc: [email protected]; Andrew Morton; Don Zickus; Ingo Molnar; >> Thomas Gleixner; [email protected] >> Subject: Re: [PATCH] hardlockup: detect hard lockups without NMIs using >> secondary cpus >> >> On Thu, Jan 10, 2013 at 5:39 PM, Liu, Chuansheng >> <[email protected]> wrote: >> > >> > >> >> -----Original Message----- >> >> From: Colin Cross [mailto:[email protected]] >> >> Sent: Thursday, January 10, 2013 9:58 AM >> >> To: [email protected] >> >> Cc: Andrew Morton; Don Zickus; Ingo Molnar; Thomas Gleixner; Liu, >> >> Chuansheng; [email protected]; Colin Cross >> >> Subject: [PATCH] hardlockup: detect hard lockups without NMIs using >> >> secondary cpus >> >> >> >> Emulate NMIs on systems where they are not available by using timer >> >> interrupts on other cpus. Each cpu will use its softlockup hrtimer >> >> to check that the next cpu is processing hrtimer interrupts by >> >> verifying that a counter is increasing. >> >> >> >> This patch is useful on systems where the hardlockup detector is not >> >> available due to a lack of NMIs, for example most ARM SoCs. >> >> Without this patch any cpu stuck with interrupts disabled can >> >> cause a hardware watchdog reset with no debugging information, >> >> but with this patch the kernel can detect the lockup and panic, >> >> which can result in useful debugging info. >> >> >> >> Signed-off-by: Colin Cross <[email protected]> >> >> +static void watchdog_check_hardlockup_other_cpu(void) >> >> +{ >> >> + int cpu; >> >> + cpumask_t cpus = watchdog_cpus; >> >> + >> >> + /* >> >> + * Test for hardlockups every 3 samples. The sample period is >> >> + * watchdog_thresh * 2 / 5, so 3 samples gets us back to slightly >> over >> >> + * watchdog_thresh (over by 20%). >> >> + */ >> >> + if (__this_cpu_read(hrtimer_interrupts) % 3 != 0) >> >> + return; >> >> + > Another feeling is about __this_cpu_read(hrtimer_interrupts) % 3 != 0, > It will cause the actual timeout value for hard lockup detection is not very > fix, or even > very short. > Sometimes using 3 samples can detect the lockup case, but sometimes 1 sample. > Is it the case?
I'm not sure what you mean. The mod 3 will cause every 3rd timer (12 seconds, assuming watchdog_thresh = 10) to check hrtimer_interrupts vs. hrtimer_interrupts_saved, and then update it. The sampling should be fixed and very accurate. It will cause a panic/warning between 12 and 24 seconds after a cpu stops processing timer interrupts, depending on the alignment of the hrtimers between the two cpus. > And in NMI case, the NMI interrupt is coming at least every watchdog_thresh. NMI interrupt will happen every 10 seconds instead of 12, meaning the panic/warning will occur between 10 and 20 seconds after a cpu stops processing timer interrupts, depending on the alignment of the NMI with the hrtimer, but otherwise my patch should be very similar. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/

