Jeff, On Thu, 21 Jan 2016, Jeff Merkey wrote: > static inline s64 timekeeping_get_ns(struct tk_read_base *tkr) > { > cycle_t delta; > s64 nsec; > > delta = timekeeping_get_delta(tkr); > > nsec = delta * tkr->mult + tkr->xtime_nsec; > nsec >>= tkr->shift; << wrap caused here > > /* If arch requires, add in get_arch_timeoffset() */ > return nsec + arch_gettimeoffset(); > } > > You only have 64 bits of register and the numbers being calculated > here are big. By way of example, I observed the following during > normal operations: > > delta (RAX) | tkr->mult (RDX) > > 0x157876 0x65ee27 > 0xf1855 0x65f158 > 0x16cf05 0x65f408 > 303bc3 0x65f154 > > When this bug occurs different story. > > delta (RAX) | tkr->mult (RDX) > > 0x243283994b8 0x65233 > > So it goes like this: > > nsec = delta * tkr->mult + tkr->xtime_nsec; > 0x243283994b8 * 0x65233 > imul rax,rdx = 0xE6A2Ce1f1ea690a8 > > nsec >>= tkr->shift; << wrap caused here > sar rax,cl = 0xFFFFFFE6BFB3B7C3
That SAR is siomply wrong here. It must be an SHR and it is at least when I'm looking at the assembly of my machine. > the sar instruction doesn't just shift, it backfills the signedness of > the value, so this instruction is not doing what the C code is asking > it to do. I am guessing that somewhere in this mass of macros, > something may have gotten declared wrong or incomplete (declared > signed ?). There is no macro involved. timekeeping_get_ns { nsec = (delta * tkr->mult + tkr->xtime_nsec) >> tkr->shift; } > The assembler output for this section that calls the macro to > calculate nsecs shows the sar instruction: > > delta = timekeeping_get_delta(tkr); > > nsec = delta * tkr->mult + tkr->xtime_nsec; > 29b: 48 0f af c2 imul %rdx,%rax > 29f: 48 03 05 00 00 00 00 add 0x0(%rip),%rax # 2a6 > <ktime_get_ts64+0xc6> > nsec >>= tkr->shift; > 2a6: 48 d3 f8 sar %cl,%rax And this is fundamentally wrong. Why is the compiler emitting SAR instead of SHR here? Here is the assembly output from my kernel: nsec = (delta * tkr->mult + tkr->xtime_nsec) >> tkr->shift; 27e: 48 0f af c5 imul %rbp,%rax 282: 48 01 d8 add %rbx,%rax 285: 48 d3 e8 shr %cl,%rax } while (read_seqcount_retry(&tk_core.seq, seq)); So the first thing which needs to be figured out is WHY this results in a SAR on your compiler. > There is another problem with the tkr->read returning an unchanging, > unclearable number when this bug occurs for the delta value. I > appears for whatever reason the clock has gone to sleep or gone away > and is no longer updating its counters. > > static inline cycle_t timekeeping_get_delta(struct tk_read_base *tkr) > { > cycle_t cycle_now, delta; > > /* read clocksource */ > cycle_now = tkr->read(tkr->clock); << returns the same value after > this bug happens > > /* calculate the delta since the last update_wall_time */ > delta = clocksource_delta(cycle_now, tkr->cycle_last, tkr->mask); << > cycle last is also the same value. > > return delta; > } If that value does not change, then the timekeeping update is not running. That might happen because the timer interrupt is not happening or whatever got wreckaged. > I would check how these structs are defined and the vars in them to > see if somewhere they are declared as signed values to the compiler, > because that's what it thinks it was given to compile. Sure. Here you go: nsec = (delta * tkr->mult + tkr->xtime_nsec) >> tkr->shift; delta, mult, xtime_nsec and shift are unsigned. The only signed value is nsec. Does that issue go away if you apply the patch below? Thanks, tglx 8<----------- diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c index 34b4cedfa80d..d405bcdf9d40 100644 --- a/kernel/time/timekeeping.c +++ b/kernel/time/timekeeping.c @@ -301,7 +301,7 @@ static inline u32 arch_gettimeoffset(void) { return 0; } static inline s64 timekeeping_get_ns(struct tk_read_base *tkr) { cycle_t delta; - s64 nsec; + u64 nsec; delta = timekeeping_get_delta(tkr);