On Thu, Apr 7, 2016 at 1:14 AM, Gratian Crisan <gratian.cri...@ni.com> wrote: > John Stultz writes: >> So I'm sympathetic to this issue, because I remember seeing similar >> problems w/ runaway SCHED_FIFO tasks w/ PREEMPT_RT. > > Yeah, a runaway rt thread can easily do it. That's just bad design. In > our case it was a bit more subtle bc. it was a combination of high > priority interrupts and rt threads that would occasionally stack up to > delay the timer softirq long enough to cause the watchdog wrap.
So in the last discussion, I believe Thomas and others were skeptical because we really shouldn't be blocking tasks from running for such a long time. Instead of trying to turn off the watchdog, instead they were suggesting we ensure we don't get into such a state where things are delayed so unexpectedly long. >> However, its really difficult to create a solution without opening new >> cases where bad clocksources will be mis-identified as good (which >> your solution seems to suffer as well, measuring the time past with a >> known bad clocksource can easily result in large deltas, which will be >> ignored if the watchdog has a short interval). > > Fair point. Ultimately you have to trust one of the clocksources. I > guess I was naive in thinking that the main clocksource can't drift more > than what the watchdog clocksource can measure within the > WATCHDOG_INTERVAL. I'm glad I don't have to deal with hardware that > lobotomized. Another thought might be to try to add a third longer-running clock into the mix. Possibly a very rough fallback check against something like the RTC to see if the interval was really long enough to have the watchdog wrap. > Would a simple solution that exposes the config option for the > clocksource wathchdog[1] (and defaults it to on) be an acceptable > alternative? It will work for us because we test the stability of the > main clocksource - part of the hardware bring-up. So there is already the tsc=reliable boot option, which I believe disables the watchdog. So I'm not sure the build time option makes the most sense. >> A previous effort on this was made here, and there's a resulting >> thread that didn't come to resolution: >> https://lkml.org/lkml/2015/8/17/542 > > Sorry I've missed it. > >> Way back I had tried to come up with an approach where if the time >> delta was large, it was divided by the watchdog interval, and then we >> just compared the remainder with the current watchdog delta to see if >> they matched (closely enough). Unfortunately this didn't work out for >> me then, but perhaps it deserves a second try? > > I've entertained that idea too but I think I was trying to optimize > things too early and do everything with the mult/shift math. That first > attempt failed but I do need to try harder because it would be a better > general solution. Yea. I'd much prefer a general solution. thanks -john