On 1/20/16, Thomas Gleixner <t...@linutronix.de> wrote: > Jeff, > > On Wed, 20 Jan 2016, Thomas Gleixner wrote: >> On Tue, 19 Jan 2016, Jeff Merkey wrote: >> > Nasty bug but trivial fix for this. What happens here is RAX (nsecs) >> > gets set to a huge value (RAX = 0x17AE7F57C671EA7D) and passed through >> >> And how exactly does that happen? >> >> 0x17AE7F57C671EA7D = 1.70644e+18 nsec >> = 1.70644e+09 sec >> = 2.84407e+07 min >> = 474011 hrs >> = 19750.5 days >> = 54.1109 years >> >> That's the real issue, not what you are trying to 'fix' in >> timespec_add_ns() > > And that's caused by stopping the whole machine for 20 minutes. It violates > the assumption of the timekeeping core, that the maximum time which is > between > two updates of the core is < 5-10min. So that insane large number is caused > by a > mult overrun when converting the time delta to nanoseconds. > > You can find that limit via: > > # dmesg | grep tsc | grep max_idle_ns > [ 5.242683] clocksource tsc: mask: 0xffffffffffffffff max_cycles: > 0x21139a22526, max_idle_ns: 440795252169 ns > > So on that machine the limit is: > > 440795252169 nsec > 440.795 sec > 7.34659 min > > And before you ask or come up with patches: No, we are not going to add > anything to the core timekeeping code to work around this limitation simply > because its going to add overhead to a performance sensitive code path for > a > very limited value.
Given how fragile that code appears to be, this is reasonable. > > Keeping a machine stopped for 20 minutes will make a lot of other things > unhappy, so introducing a 'fix' for that particular issue is just silly. > You know what's needed here is some form of touch function to keep this system updated while spinning in the debugger. That would solve it. I can maintain a fix for that locally. I debugged the soft hang in systemd last night, and I discovered that its all related to this function returning bogus time (systemd was doing a system call that eventually made its way to ktime_get_ts64 and got returned garbage). When this wraps it causes all sorts of bad stuff. Do you have any suggestions on how a touch function could be coded to keep this subsystem updated while the debugger is active? There are already a few of them I have to call as well as kgdb and kdb to get around some of this. void mdb_watchdogs(void) { touch_softlockup_watchdog_sync(); clocksource_touch_watchdog(); #if defined(CONFIG_TREE_RCU) rcu_cpu_stall_reset(); #endif touch_nmi_watchdog(); #ifdef CONFIG_HARDLOCKUP_DETECTOR touch_hardlockup_watchdog(); #endif return; } As you can see, there are already quite a few subsystems that manage this problem of debuggers holding the system in stasis. Jeff > Thanks, > > tglx > Well, that explains it.