Re: [BUG REPORT] ktime_get_ts64 causes Hard Lockup

Jeff Merkey Wed, 20 Jan 2016 08:41:11 -0800

On 1/20/16, Thomas Gleixner <t...@linutronix.de> wrote:
> Jeff,
>
> On Wed, 20 Jan 2016, Thomas Gleixner wrote:
>> On Tue, 19 Jan 2016, Jeff Merkey wrote:
>> > Nasty bug but trivial fix for this.  What happens here is RAX (nsecs)
>> > gets set to a huge value (RAX = 0x17AE7F57C671EA7D) and passed through
>>
>> And how exactly does that happen?
>>
>> 0x17AE7F57C671EA7D = 1.70644e+18  nsec
>>                 = 1.70644e+09  sec
>>                 = 2.84407e+07  min
>>                 = 474011       hrs
>>                 = 19750.5      days
>>                 = 54.1109      years
>>
>> That's the real issue, not what you are trying to 'fix' in
>> timespec_add_ns()
>
> And that's caused by stopping the whole machine for 20 minutes. It violates
> the assumption of the timekeeping core, that the maximum time which is
> between
> two updates of the core is < 5-10min. So that insane large number is caused
> by a
> mult overrun when converting the time delta to nanoseconds.
>
> You can find that limit via:
>
> # dmesg | grep tsc | grep max_idle_ns
> [    5.242683] clocksource tsc: mask: 0xffffffffffffffff max_cycles:
> 0x21139a22526, max_idle_ns: 440795252169 ns
>
> So on that machine the limit is:
>
>    440795252169 nsec
>    440.795    sec
>    7.34659    min
>
> And before you ask or come up with patches: No, we are not going to add
> anything to the core timekeeping code to work around this limitation simply
> because its going to add overhead to a performance sensitive code path for
> a
> very limited value.


Given how fragile that code appears to be, this is reasonable.

>
> Keeping a machine stopped for 20 minutes will make a lot of other things
> unhappy, so introducing a 'fix' for that particular issue is just silly.
>

You know what's needed here is some form of touch function to keep this
system updated while spinning in the debugger.  That would solve it.
I can maintain
a fix for that locally.  I debugged the soft hang in systemd last
night, and I discovered
that its all related to this function returning bogus time (systemd
was doing a system call that eventually made its way to ktime_get_ts64
and got returned garbage).   When this wraps it causes all sorts of
bad stuff.

Do you have any suggestions on how a touch function could be coded to keep this
subsystem updated while the debugger is active?  There are already a
few of them I
have to call as well as kgdb and kdb to get around some of this.

void mdb_watchdogs(void)
{
    touch_softlockup_watchdog_sync();
    clocksource_touch_watchdog();

#if defined(CONFIG_TREE_RCU)
    rcu_cpu_stall_reset();
#endif

    touch_nmi_watchdog();
#ifdef CONFIG_HARDLOCKUP_DETECTOR
    touch_hardlockup_watchdog();
#endif
    return;
}

As you can see, there are already quite a few subsystems that manage
this problem of
debuggers holding the system in stasis.

Jeff

> Thanks,
>
>       tglx
>

Well, that explains it.

Re: [BUG REPORT] ktime_get_ts64 causes Hard Lockup

Reply via email to