Re: [PATCH v1] watchdog: Use a reference cycle counter to avoid scaling issues

2015-04-24 Thread Thomas Gleixner
On Fri, 24 Apr 2015, Andi Kleen wrote:
> > There are better ways to do that than using heuristics. We have to
> > deal with 3 variants of the reference counter:
> > 
> > 1) Core and Atom: counts bus cycles and we know that frequency already
> >   from the local apic calibration
> > 
> > 2) Nehalem, Westmere: Same as TSC
> > 
> > 3) Sandybridge and later:  XCLK which is 100MHz
> > 
> > No magic calibration, just use the information which we have on our
> > hands already.
> 
> This is a really bad idea. We basically would need to maintain a big
> switch with model numbers, with new cases added for every new CPU.
> Would be a maintenance nightmare.

Nonsense. We have already enough family specific switch cases which
need to be maintained for every new cpu anyway. So they can simply
provide that extra bit of information. It's neither rocket science nor
a nightmare.

We don't need to calibrate stuff which we can access by simpler means.

Thanks,

tglx


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v1] watchdog: Use a reference cycle counter to avoid scaling issues

2015-04-24 Thread Andi Kleen
> There are better ways to do that than using heuristics. We have to
> deal with 3 variants of the reference counter:
> 
> 1) Core and Atom: counts bus cycles and we know that frequency already
> from the local apic calibration
> 
> 2) Nehalem, Westmere: Same as TSC
> 
> 3) Sandybridge and later:  XCLK which is 100MHz
> 
> No magic calibration, just use the information which we have on our
> hands already.

This is a really bad idea. We basically would need to maintain a big
switch with model numbers, with new cases added for every new CPU.
Would be a maintenance nightmare.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v1] watchdog: Use a reference cycle counter to avoid scaling issues

2015-04-24 Thread Thomas Gleixner
On Fri, 24 Apr 2015, Alexander Shishkin wrote:
> Thomas Gleixner  writes:
> 
> > On Fri, 24 Apr 2015, Thomas Gleixner wrote:
> >> On Thu, 23 Apr 2015, Andi Kleen wrote:
> >> > > We can just detect the deviation in the callback itself:
> >> > > 
> >> > >u64 now = ktime_get_mono_fast_ns();
> >> > > 
> >> > >if (now - __this_cpu_read(nmi_timestamp) < period)
> >> > >   return;
> >> > > 
> >> > >__this_cpu_write(nmi_timestamp, now);
> >> > > 
> >> > > It's that simple.
> >> > 
> >> > It's a simple short term hac^wsolution.
> >> 
> >> Yes, and way simpler and less complex for pushing into stable.
> >> 
> >> > But if we had a (hypothetical) system with let's say 10*TSC max you
> >> > may end up with quite a few false ticks, as in unnecessary
> >> > interrupts. With 100*TSC it would be really bad.
> >> 
> >> And hypothetical systems with 100*TSC justify all that?
> >>  
> >> > There were systems in the past that ran TSC at a much slower frequency,
> >> > such as the early AMD Barcelona systems.
> >> > 
> >> > So the problem may eventually come back if not solved properly.
> >> 
> >> There are better ways to do that than using heuristics. We have to
> >> deal with 3 variants of the reference counter:
> >> 
> >> 1) Core and Atom: counts bus cycles and we know that frequency already
> >>  from the local apic calibration
> >> 
> >> 2) Nehalem, Westmere: Same as TSC
> >> 
> >> 3) Sandybridge and later:  XCLK which is 100MHz
> >> 
> >> No magic calibration, just use the information which we have on our
> >> hands already.
> >
> > And aside of that calibration stuff emits a warning on everything
> > except intel, arc and metag. Very useful.
> >
> > This is core code and not intel playground.
> 
> This warning is only ever compiled on intel

Which means that all AMD machines will emit it.

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v1] watchdog: Use a reference cycle counter to avoid scaling issues

2015-04-24 Thread Alexander Shishkin
Thomas Gleixner  writes:

> On Fri, 24 Apr 2015, Thomas Gleixner wrote:
>> On Thu, 23 Apr 2015, Andi Kleen wrote:
>> > > We can just detect the deviation in the callback itself:
>> > > 
>> > >u64 now = ktime_get_mono_fast_ns();
>> > > 
>> > >if (now - __this_cpu_read(nmi_timestamp) < period)
>> > > return;
>> > > 
>> > >__this_cpu_write(nmi_timestamp, now);
>> > > 
>> > > It's that simple.
>> > 
>> > It's a simple short term hac^wsolution.
>> 
>> Yes, and way simpler and less complex for pushing into stable.
>> 
>> > But if we had a (hypothetical) system with let's say 10*TSC max you
>> > may end up with quite a few false ticks, as in unnecessary
>> > interrupts. With 100*TSC it would be really bad.
>> 
>> And hypothetical systems with 100*TSC justify all that?
>>  
>> > There were systems in the past that ran TSC at a much slower frequency,
>> > such as the early AMD Barcelona systems.
>> > 
>> > So the problem may eventually come back if not solved properly.
>> 
>> There are better ways to do that than using heuristics. We have to
>> deal with 3 variants of the reference counter:
>> 
>> 1) Core and Atom: counts bus cycles and we know that frequency already
>>from the local apic calibration
>> 
>> 2) Nehalem, Westmere: Same as TSC
>> 
>> 3) Sandybridge and later:  XCLK which is 100MHz
>> 
>> No magic calibration, just use the information which we have on our
>> hands already.
>
> And aside of that calibration stuff emits a warning on everything
> except intel, arc and metag. Very useful.
>
> This is core code and not intel playground.

This warning is only ever compiled on intel and (since last week) on
powerpc.

But sure, it can be fixed.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v1] watchdog: Use a reference cycle counter to avoid scaling issues

2015-04-24 Thread Thomas Gleixner
On Fri, 24 Apr 2015, Thomas Gleixner wrote:
> On Thu, 23 Apr 2015, Andi Kleen wrote:
> > > We can just detect the deviation in the callback itself:
> > > 
> > >u64 now = ktime_get_mono_fast_ns();
> > > 
> > >if (now - __this_cpu_read(nmi_timestamp) < period)
> > >  return;
> > > 
> > >__this_cpu_write(nmi_timestamp, now);
> > > 
> > > It's that simple.
> > 
> > It's a simple short term hac^wsolution.
> 
> Yes, and way simpler and less complex for pushing into stable.
> 
> > But if we had a (hypothetical) system with let's say 10*TSC max you
> > may end up with quite a few false ticks, as in unnecessary
> > interrupts. With 100*TSC it would be really bad.
> 
> And hypothetical systems with 100*TSC justify all that?
>  
> > There were systems in the past that ran TSC at a much slower frequency,
> > such as the early AMD Barcelona systems.
> > 
> > So the problem may eventually come back if not solved properly.
> 
> There are better ways to do that than using heuristics. We have to
> deal with 3 variants of the reference counter:
> 
> 1) Core and Atom: counts bus cycles and we know that frequency already
> from the local apic calibration
> 
> 2) Nehalem, Westmere: Same as TSC
> 
> 3) Sandybridge and later:  XCLK which is 100MHz
> 
> No magic calibration, just use the information which we have on our
> hands already.

And aside of that calibration stuff emits a warning on everything
except intel, arc and metag. Very useful.

This is core code and not intel playground.

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v1] watchdog: Use a reference cycle counter to avoid scaling issues

2015-04-24 Thread Borislav Petkov
On Thu, Apr 23, 2015 at 05:51:33PM -0700, Andi Kleen wrote:
> There were systems in the past that ran TSC at a much slower
> frequency, such as the early AMD Barcelona systems.

You mean the eval boxes which were never sold as production systems?

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v1] watchdog: Use a reference cycle counter to avoid scaling issues

2015-04-24 Thread Thomas Gleixner
On Thu, 23 Apr 2015, Andi Kleen wrote:
> > We can just detect the deviation in the callback itself:
> > 
> >u64 now = ktime_get_mono_fast_ns();
> > 
> >if (now - __this_cpu_read(nmi_timestamp) < period)
> >return;
> > 
> >__this_cpu_write(nmi_timestamp, now);
> > 
> > It's that simple.
> 
> It's a simple short term hac^wsolution.

Yes, and way simpler and less complex for pushing into stable.

> But if we had a (hypothetical) system with let's say 10*TSC max you
> may end up with quite a few false ticks, as in unnecessary
> interrupts. With 100*TSC it would be really bad.

And hypothetical systems with 100*TSC justify all that?
 
> There were systems in the past that ran TSC at a much slower frequency,
> such as the early AMD Barcelona systems.
> 
> So the problem may eventually come back if not solved properly.

There are better ways to do that than using heuristics. We have to
deal with 3 variants of the reference counter:

1) Core and Atom: counts bus cycles and we know that frequency already
  from the local apic calibration

2) Nehalem, Westmere: Same as TSC

3) Sandybridge and later:  XCLK which is 100MHz

No magic calibration, just use the information which we have on our
hands already.

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v1] watchdog: Use a reference cycle counter to avoid scaling issues

2015-04-23 Thread Andi Kleen
> We can just detect the deviation in the callback itself:
> 
>u64 now = ktime_get_mono_fast_ns();
> 
>if (now - __this_cpu_read(nmi_timestamp) < period)
>  return;
> 
>__this_cpu_write(nmi_timestamp, now);
> 
> It's that simple.

It's a simple short term hac^wsolution. But if we had a (hypothetical) system 
with
let's say 10*TSC max you may end up with quite a few false ticks, as in 
unnecessary interrupts. With 100*TSC it would be really bad.

There were systems in the past that ran TSC at a much slower frequency,
such as the early AMD Barcelona systems.

So the problem may eventually come back if not solved properly.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v1] watchdog: Use a reference cycle counter to avoid scaling issues

2015-04-23 Thread Thomas Gleixner
On Thu, 23 Apr 2015, Andi Kleen wrote:
> On Thu, Apr 23, 2015 at 10:01:04PM +0200, Thomas Gleixner wrote:
> > On Thu, 23 Apr 2015, Alexander Shishkin wrote:
> > 
> > > The problem with using cycle counter for NMI watchdog is that its
> > > frequency changes with the corresponding core's frequency. This means
> > > that, in particular, if the core frequency scales up, watchdog NMI will
> > > arrive more frequently than what user requested through watchdog_thresh
> > > and also increasing the probability of setting off the hardlockup 
> > > detector,
> > > because the corresponding hrtimer will keep firing at the same intervals
> > > regardless of the core frequency. And, if the core can turbo to up to 2.5x
> > > its base frequency (and therefore TSC) [1], we'll have the hrtimer and NMI
> > 
> > So you are saying that this M-5Y10 has a non-constant TSC again? You
> > really can't be serious about that.
> 
> The TSC is constant, but the maximum frequency can be >=2.5TSC, so the
> watchdog which uses cycles can have that much error.

That makes sense. I misinterpreted the (therefore TSC) above.

The nmi_watchdog then fires not once per watchdog_tresh seconds, it
fires once per watchdog_tresh * 400ms. So the hrtimer might not have a
chance to increment hrtimer_interrupts and the hardlockup detector
triggers.

But do we really need that whole calibration maze to deal with this?

Definitely not.

We can just detect the deviation in the callback itself:

   u64 now = ktime_get_mono_fast_ns();

   if (now - __this_cpu_read(nmi_timestamp) < period)
   return;

   __this_cpu_write(nmi_timestamp, now);

It's that simple.

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v1] watchdog: Use a reference cycle counter to avoid scaling issues

2015-04-23 Thread Andi Kleen
On Thu, Apr 23, 2015 at 10:01:04PM +0200, Thomas Gleixner wrote:
> On Thu, 23 Apr 2015, Alexander Shishkin wrote:
> 
> > The problem with using cycle counter for NMI watchdog is that its
> > frequency changes with the corresponding core's frequency. This means
> > that, in particular, if the core frequency scales up, watchdog NMI will
> > arrive more frequently than what user requested through watchdog_thresh
> > and also increasing the probability of setting off the hardlockup detector,
> > because the corresponding hrtimer will keep firing at the same intervals
> > regardless of the core frequency. And, if the core can turbo to up to 2.5x
> > its base frequency (and therefore TSC) [1], we'll have the hrtimer and NMI
> 
> So you are saying that this M-5Y10 has a non-constant TSC again? You
> really can't be serious about that.

The TSC is constant, but the maximum frequency can be >=2.5TSC, so the
watchdog which uses cycles can have that much error.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v1] watchdog: Use a reference cycle counter to avoid scaling issues

2015-04-23 Thread Thomas Gleixner
On Thu, 23 Apr 2015, Alexander Shishkin wrote:

> The problem with using cycle counter for NMI watchdog is that its
> frequency changes with the corresponding core's frequency. This means
> that, in particular, if the core frequency scales up, watchdog NMI will
> arrive more frequently than what user requested through watchdog_thresh
> and also increasing the probability of setting off the hardlockup detector,
> because the corresponding hrtimer will keep firing at the same intervals
> regardless of the core frequency. And, if the core can turbo to up to 2.5x
> its base frequency (and therefore TSC) [1], we'll have the hrtimer and NMI

So you are saying that this M-5Y10 has a non-constant TSC again? You
really can't be serious about that.

Please provide the output of /proc/cpuinfo

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v1] watchdog: Use a reference cycle counter to avoid scaling issues

2015-04-23 Thread Alexander Shishkin
The problem with using cycle counter for NMI watchdog is that its
frequency changes with the corresponding core's frequency. This means
that, in particular, if the core frequency scales up, watchdog NMI will
arrive more frequently than what user requested through watchdog_thresh
and also increasing the probability of setting off the hardlockup detector,
because the corresponding hrtimer will keep firing at the same intervals
regardless of the core frequency. And, if the core can turbo to up to 2.5x
its base frequency (and therefore TSC) [1], we'll have the hrtimer and NMI
counter firing at the same frequency; anything that disables interrupts at
an unfortunate moment can set off the hardlockup detector then.

The proposed solution is to use reference cycle counter instead, which is
guaranteed to run at the same frequency regardless of the cpu speed. This
will also ensure that the watchdog_thresh timeout value is honored.

One issue with the reference counter is that its frequency is not the same
as that of tsc on older cpu models and therefore it needs calibration.

This patch adds code to calibrate ref-cycles counter using an hrtimer and
a perf counter, which runs in parallel with the rest of kernel init
sequence.

This patch leaves intact the hw_nmi_get_sample_period() arch call which is
still used in case of a fallback to the cycle counter.

[1] 
http://ark.intel.com/products/83610/Intel-Core-M-5Y10-Processor-4M-Cache-up-to-2_00-GHz

Signed-off-by: Alexander Shishkin 
Cc: sta...@vger.kernel.org
---
Changes since the initial version:

 * kept hw_nmi_* intact to avoid build-breaking powerpc; on powerpc, the
   calibration should fail because of the missing REF_CYCLES counter and
   therefore fall back to the old way of doing things;
 * rebased on top of Linus' master.

 kernel/watchdog.c | 134 --
 1 file changed, 131 insertions(+), 3 deletions(-)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 2316f50b07..7e8e031544 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -259,15 +259,129 @@ static int is_softlockup(unsigned long touch_ts)
 }
 
 #ifdef CONFIG_HARDLOCKUP_DETECTOR
-
 static struct perf_event_attr wd_hw_attr = {
.type   = PERF_TYPE_HARDWARE,
-   .config = PERF_COUNT_HW_CPU_CYCLES,
+   .config = PERF_COUNT_HW_REF_CPU_CYCLES,
.size   = sizeof(struct perf_event_attr),
.pinned = 1,
.disabled   = 1,
 };
 
+/*
+ * Reference cycle counter calibration
+ *
+ * In order for nmi watchdog to be bound to the flow of time instead
+ * of cycles, which changes with cpu frequency scaling, we need to use
+ * PERF_COUNT_HW_REF_CPU_CYCLES for it. And for this we need to know its
+ * resolution since it's not necessarily the same as that of TSC.
+ *
+ * Thus, reference clock counter is calibrated using a perf counter and
+ * an hrtimer taking samples of the former's overflow counter. When the
+ * hrtimer callback has enough samples, it kicks off a work, which does
+ * the "math" and kickstarts the NMI watchdog if needed.
+ *
+ * This method of calibration does not give stellar precision, but should
+ * be enough for a watchdog timer. And it runs in parallel to the rest of
+ * the bootup sequence. The main factor contributing to the error in this
+ * calibration method is the nmi handling path leading up to the overflow
+ * handler, that is a greater REF_CAL_SAMPLE_CYCLES value gives better
+ * precision.
+ */
+
+/* PERF_COUNT_HW_REF_CPU_CYCLES timer resolution */
+static unsigned long   wd_hw_ref_khz;
+static local_t wd_hw_ref_cycles;
+static struct hrtimer  wd_hw_cal_timer;
+static struct perf_event *wd_hw_cal_event;
+
+#define REF_CAL_POINTS 9
+static unsigned long ref_cal_point[REF_CAL_POINTS];
+static unsigned int ref_cal_points;
+
+#define REF_CAL_SAMPLE_CYCLES  100
+#define REF_CAL_MS 100
+#define REF_CAL_PERIOD (REF_CAL_MS * 100)
+
+static void wd_hw_cal_overflow(struct perf_event *event,
+  struct perf_sample_data *data,
+  struct pt_regs *regs)
+{
+   event->hw.interrupts = 0;
+   local_inc(&wd_hw_ref_cycles);
+}
+
+static void watchdog_calibration_failed(void)
+{
+   wd_hw_attr.config = PERF_COUNT_HW_CPU_CYCLES;
+   pr_warn("reference counter calibration failed, falling back to cycle 
counter for NMI watchdog\n");
+}
+
+static int watchdog_enable_all_cpus(void);
+
+static void watchdog_finish_ref_calibration(struct work_struct *work)
+{
+   unsigned int point;
+   unsigned long diff;
+
+   hrtimer_cancel(&wd_hw_cal_timer);
+   perf_event_release_kernel(wd_hw_cal_event);
+
+   for (diff = 0, point = 0; point < REF_CAL_POINTS - 1; point++)
+   diff += ref_cal_point[point + 1] - ref_cal_point[point];
+
+   diff /= REF_CAL_POINTS - 1;
+
+   wd_hw_ref_khz = diff * REF_CAL_SAMPLE_CYCLES / REF_CAL_MS;
+   if (!wd_hw