On Tue, Mar 15, 2016 at 11:50 AM, Gratian Crisan <gratian.cri...@ni.com> wrote: > The clocksource watchdog can falsely trigger and disable the main > clocksource when the watchdog wraps around. > > The reason is that an interrupt storm and/or high priority (FIFO/RR) tasks > can preempt the timer softirq long enough for the watchdog to wrap around > if it has a limited number of bits available by comparison to the main > clocksource. One observed example is on a Intel Baytrail platform where TSC > is the main clocksource, HPET is disabled due to a hardware bug and acpi_pm > gets selected as the watchdog clocksource. > > Calculate the maximum number of nanoseconds the watchdog clocksource can > represent without overflow and do not disqualify the main clocksource if > the delta since the last time we have checked exceeds the measurement > capabilities of the watchdog clocksource.
Sorry for not getting back to you sooner on this. You managed to send these both out while I was at a conference and on vacation, and so they were deep in the mail backlog. :) So I'm sympathetic to this issue, because I remember seeing similar problems w/ runaway SCHED_FIFO tasks w/ PREEMPT_RT. However, its really difficult to create a solution without opening new cases where bad clocksources will be mis-identified as good (which your solution seems to suffer as well, measuring the time past with a known bad clocksource can easily result in large deltas, which will be ignored if the watchdog has a short interval). A previous effort on this was made here, and there's a resulting thread that didn't come to resolution: https://lkml.org/lkml/2015/8/17/542 Way back I had tried to come up with an approach where if the time delta was large, it was divided by the watchdog interval, and then we just compared the remainder with the current watchdog delta to see if they matched (closely enough). Unfortunately this didn't work out for me then, but perhaps it deserves a second try? thanks -john