Re: [BUG REPORT] ktime_get_ts64 causes Hard Lockup

Jeff Merkey Thu, 21 Jan 2016 00:19:44 -0800

On 1/21/16, Jeff Merkey <linux....@gmail.com> wrote:
> Ok, here's what I found after several hours of debugging and reviewing
> this subsystem:
>
> This subsystem plays is pretty loose in doing its math on 64 bit
> registers.  I traced through ktime_get_ts64 hundreds of times and
> sampled data running through it and from what I saw, just normal
> operations comes dangerously close to causing the RAX register to
> wrap.   If the delta gets too big it does wrap and I observed it
> happening with the debugger tracing through the code.  It wraps
> because of a sar instruction generated from the inline macros.
>
> The wrap happens in this inline function.
>
> static inline s64 timekeeping_get_ns(struct tk_read_base *tkr)
> {
>       cycle_t delta;
>       s64 nsec;
>
>       delta = timekeeping_get_delta(tkr);
>
>       nsec = delta * tkr->mult + tkr->xtime_nsec;
>       nsec >>= tkr->shift;    << wrap caused here
>
>       /* If arch requires, add in get_arch_timeoffset() */
>       return nsec + arch_gettimeoffset();
> }
>
> You only have 64 bits of register and the numbers being calculated
> here are big.   By way of example, I observed the following during
> normal operations:
>
> delta  (RAX)   |     tkr->mult (RDX)
>
> 0x157876            0x65ee27
> 0xf1855               0x65f158
> 0x16cf05             0x65f408
> 303bc3                0x65f154
>
> When this bug occurs different story.
>
> delta  (RAX)    |     tkr->mult (RDX)
>
> 0x243283994b8     0x65233
>
> So it goes like this:
>
>       nsec = delta * tkr->mult + tkr->xtime_nsec;
>         0x243283994b8 * 0x65233
>         imul   rax,rdx = 0xE6A2Ce1f1ea690a8
>
>       nsec >>= tkr->shift;    << wrap caused here
>         sar    rax,cl  =  0xFFFFFFE6BFB3B7C3
>
> the sar instruction doesn't just shift, it backfills the signedness of
> the value, so this instruction is not doing what the C code is asking
> it to do.  I am guessing that somewhere in this mass of macros,
> something may have gotten declared wrong or incomplete (declared
> signed ?).
>
> The assembler output for this section that calls the macro to
> calculate nsecs shows the sar instruction:
>
>       delta = timekeeping_get_delta(tkr);
>
>       nsec = delta * tkr->mult + tkr->xtime_nsec;
>      29b:     48 0f af c2             imul   %rdx,%rax
>      29f:     48 03 05 00 00 00 00    add    0x0(%rip),%rax        # 2a6
> <ktime_get_ts64+0xc6>
>       nsec >>= tkr->shift;
>      2a6:     48 d3 f8                sar    %cl,%rax
>
>
> There is another problem with the tkr->read returning an unchanging,
> unclearable number when this bug occurs for the delta value.  I
> appears for whatever reason the clock has gone to sleep or gone away
> and is no longer updating its counters.
>
> static inline cycle_t timekeeping_get_delta(struct tk_read_base *tkr)
> {
>       cycle_t cycle_now, delta;
>
>       /* read clocksource */
>       cycle_now = tkr->read(tkr->clock); << returns the same value after
> this bug happens
>
>       /* calculate the delta since the last update_wall_time */
>       delta = clocksource_delta(cycle_now, tkr->cycle_last, tkr->mask); <<
> cycle last is also the same value.
>
>       return delta;
> }
>
> This problem appears to have several things happening at once.
> Probably the most concerning is that the assembler output is making
> some assumptions about the SIGNEDNESS of the values being shifted and
> using sar instead of shl instructions.
>
> I am also concerned about the thr->read function returning an
> unchanging value when this problem shows up.
>
> This subsystem plays it fast and loose with its math, and if the clock
> gets delayed or out of sync, it will wrap in the above function and it
> will trigger the Hard Lockup detector if the value is large enough in
> RAX.  The sanity check for CONFIG_DEBUG_TIMEKEEPER does not catch the
> code path where this delta value gets set because the function to
> update the delta is called in more then just in that function that
> checks for an overflow and the wrap case happens underneath it.
>
> I would check how these structs are defined and the vars in them to
> see if somewhere they are declared as signed values to the compiler,
> because that's what it thinks it was given to compile.
>
> I am still debugging the thr->read issue.  I have determined the cause
> of the wrap in the assembler.  As to why the gcc compiler is outputing
> this instruction here is something to be determined.
>
> Jeff
>


s64 = signed 64.    right in the code.  wow.

The iter_div64 functions all assume unsigned values, so this needs to
be patched.  Can't use signed for shifting to a value that will be
interpreted as unsigned by the system math.  Just makes a huge number
that wraps.  This patch is a one-liner after all.

:-)

Jeff

Re: [BUG REPORT] ktime_get_ts64 causes Hard Lockup

Reply via email to