> On Tue, 15 Aug 2017, Liang, Kan wrote:
> > This patch which speed up the hrtimer
> (https://lkml.org/lkml/2017/6/26/685)
> > is decent to fix the spurious hard lockups.
> > Tested-by: Kan Liang
> >
> > Please consider to merge it into both mainline and stable tree.
>
> Well, it 'fixes' the pr
On Tue, 15 Aug 2017, Liang, Kan wrote:
> This patch which speed up the hrtimer (https://lkml.org/lkml/2017/6/26/685)
> is decent to fix the spurious hard lockups.
> Tested-by: Kan Liang
>
> Please consider to merge it into both mainline and stable tree.
Well, it 'fixes' the problem, but at the s
On Mon, Aug 14, 2017 at 6:16 PM, Liang, Kan wrote:
>
> We have confirmed that the hardlock with "speed up the hrtimer" patch is
> actually another issue.
Good.
However:
> Tim has already proposed a patch to fix it.
> Here is his patch. https://lkml.org/lkml/2017/8/14/1000
Ugh. I hate that patc
> On Mon, Jul 17, 2017 at 01:24:23AM +, Liang, Kan wrote:
> > Hi Don & Thomas,
> >
> > Sorry for the late response. We just finished the tests for all proposed
> patches.
> >
> > There are three proposed patches so far.
> > Patch 1: The patch as above which speed up the hrtimer.
> > Patch 2:
On Mon, 17 Jul 2017, Liang, Kan wrote:
> > > > > According to our test, only patch 3 works well.
> > > > > The other two patches will hang the system eventually.
> >
> > Hang the system eventually? Does that mean that the system stops working
> > and the watchdog does not catch the problem?
>
> R
> On Mon, 17 Jul 2017, Liang, Kan wrote:
> > > That doesn't make sense. What's the exact test procedure?
> >
> > I don't know the exact test procedure. The test case is from our customer.
> > I only know that the test case makes calls into the x11 libs.
>
> Sigh. This starts to be silly. You tes
On Mon, Jul 17, 2017 at 01:24:23AM +, Liang, Kan wrote:
> Hi Don & Thomas,
>
> Sorry for the late response. We just finished the tests for all proposed
> patches.
>
> There are three proposed patches so far.
> Patch 1: The patch as above which speed up the hrtimer.
> Patch 2: Thomas's first
On Mon, 17 Jul 2017, Liang, Kan wrote:
> > That doesn't make sense. What's the exact test procedure?
>
> I don't know the exact test procedure. The test case is from our customer.
> I only know that the test case makes calls into the x11 libs.
Sigh. This starts to be silly. You test something and
>
> On Mon, 17 Jul 2017, Liang, Kan wrote:
> > There are three proposed patches so far.
> > Patch 1: The patch as above which speed up the hrtimer.
> > Patch 2: Thomas's first proposal.
> > https://patchwork.kernel.org/patch/9803033/
> > https://patchwork.kernel.org/patch/9805903/
> > Patch 3: my
On Mon, 17 Jul 2017, Liang, Kan wrote:
> There are three proposed patches so far.
> Patch 1: The patch as above which speed up the hrtimer.
> Patch 2: Thomas's first proposal.
> https://patchwork.kernel.org/patch/9803033/
> https://patchwork.kernel.org/patch/9805903/
> Patch 3: my original proposal
> On Mon, Jun 26, 2017 at 04:19:27PM -0400, Don Zickus wrote:
> > On Fri, Jun 23, 2017 at 11:50:25PM +0200, Thomas Gleixner wrote:
> > > On Fri, 23 Jun 2017, Don Zickus wrote:
> > > > Hmm, all this work for a temp fix. Kan, how much longer until the
> > > > real fix of having perf count the righ
> > Thomas' patch to modulate the frequency seemed reasonable to me.
> > It made the NMI watchdog depend on accurate ktime, but that's probably ok.
>
> Ok, did Kan finish testing this patch (with the small fix on top)?
Kan doesn't have the specific hardware to test it. We've been waiting
for anot
On Thu, Jun 29, 2017 at 09:12:20AM -0700, Andi Kleen wrote:
> On Thu, Jun 29, 2017 at 11:44:06AM -0400, Don Zickus wrote:
> > On Wed, Jun 28, 2017 at 01:14:04PM -0700, Andi Kleen wrote:
> > > It can be a useful debugging tool for a specific class of bugs:
> > > when kernel software is looping fore
On Thu, Jun 29, 2017 at 11:44:06AM -0400, Don Zickus wrote:
> On Wed, Jun 28, 2017 at 01:14:04PM -0700, Andi Kleen wrote:
> > It can be a useful debugging tool for a specific class of bugs:
> > when kernel software is looping forever.
> >
> > But if that happens does it really matter how many ite
On Wed, Jun 28, 2017 at 01:14:04PM -0700, Andi Kleen wrote:
> It can be a useful debugging tool for a specific class of bugs:
> when kernel software is looping forever.
>
> But if that happens does it really matter how many iterations the
> loop does before it is stopped?
>
> Even the current ti
On Wed, Jun 28, 2017 at 03:00:08PM -0400, Don Zickus wrote:
> On Tue, Jun 27, 2017 at 04:48:22PM -0700, Andi Kleen wrote:
> > > I haven't heard back any test result yet.
> > >
> > > The above patch looks good to me.
> >
> > This needs performance testing. It may slow down performance or latency
On Tue, Jun 27, 2017 at 04:48:22PM -0700, Andi Kleen wrote:
> > I haven't heard back any test result yet.
> >
> > The above patch looks good to me.
>
> This needs performance testing. It may slow down performance or latency
> sensitive workloads.
More motivation to work through the issues with
> I haven't heard back any test result yet.
>
> The above patch looks good to me.
This needs performance testing. It may slow down performance or latency
sensitive workloads.
> Which workaround do you prefer, the above one or the one checking timestamp?
I prefer the earlier patch, it has far
On Tue, Jun 27, 2017 at 08:49:19PM +, Liang, Kan wrote:
>
> > On Mon, Jun 26, 2017 at 04:19:27PM -0400, Don Zickus wrote:
> > > On Fri, Jun 23, 2017 at 11:50:25PM +0200, Thomas Gleixner wrote:
> > > > On Fri, 23 Jun 2017, Don Zickus wrote:
> > > > > Hmm, all this work for a temp fix. Kan, how
> On Mon, Jun 26, 2017 at 04:19:27PM -0400, Don Zickus wrote:
> > On Fri, Jun 23, 2017 at 11:50:25PM +0200, Thomas Gleixner wrote:
> > > On Fri, 23 Jun 2017, Don Zickus wrote:
> > > > Hmm, all this work for a temp fix. Kan, how much longer until the
> > > > real fix of having perf count the right
On Mon, Jun 26, 2017 at 04:19:27PM -0400, Don Zickus wrote:
> On Fri, Jun 23, 2017 at 11:50:25PM +0200, Thomas Gleixner wrote:
> > On Fri, 23 Jun 2017, Don Zickus wrote:
> > > Hmm, all this work for a temp fix. Kan, how much longer until the real
> > > fix
> > > of having perf count the right cyc
On Mon, 26 Jun 2017, Don Zickus wrote:
> On Fri, Jun 23, 2017 at 11:50:25PM +0200, Thomas Gleixner wrote:
> > On Fri, 23 Jun 2017, Don Zickus wrote:
> > > Hmm, all this work for a temp fix. Kan, how much longer until the real
> > > fix
> > > of having perf count the right cycles?
> >
> > Quite a
On Fri, Jun 23, 2017 at 11:50:25PM +0200, Thomas Gleixner wrote:
> On Fri, 23 Jun 2017, Don Zickus wrote:
> > Hmm, all this work for a temp fix. Kan, how much longer until the real fix
> > of having perf count the right cycles?
>
> Quite a while. The approach is wilfully breaking the user space A
On Fri, 23 Jun 2017, Don Zickus wrote:
> Hmm, all this work for a temp fix. Kan, how much longer until the real fix
> of having perf count the right cycles?
Quite a while. The approach is wilfully breaking the user space ABI, which
is not going to happen.
And there is a simpler solution as well,
On Fri, Jun 23, 2017 at 10:01:55AM +0200, Thomas Gleixner wrote:
> On Thu, 22 Jun 2017, Don Zickus wrote:
> > On Wed, Jun 21, 2017 at 11:53:57PM +0200, Thomas Gleixner wrote:
> > > On Wed, 21 Jun 2017, kan.li...@intel.com wrote:
> > > > We now have more and more systems where the Turbo range is wid
On Thu, 22 Jun 2017, Don Zickus wrote:
> On Wed, Jun 21, 2017 at 11:53:57PM +0200, Thomas Gleixner wrote:
> > On Wed, 21 Jun 2017, kan.li...@intel.com wrote:
> > > We now have more and more systems where the Turbo range is wide enough
> > > that the NMI watchdog expires faster than the soft watchdo
> Subject: Re: [PATCH V2] kernel/watchdog: fix spurious hard lockups
>
> On Wed, Jun 21, 2017 at 11:53:57PM +0200, Thomas Gleixner wrote:
> > On Wed, 21 Jun 2017, kan.li...@intel.com wrote:
> > > We now have more and more systems where the Turbo range is wide
> >
On Wed, Jun 21, 2017 at 11:53:57PM +0200, Thomas Gleixner wrote:
> On Wed, 21 Jun 2017, kan.li...@intel.com wrote:
> > We now have more and more systems where the Turbo range is wide enough
> > that the NMI watchdog expires faster than the soft watchdog timer that
> > updates the interrupt tick the
On Wed, 21 Jun 2017, Thomas Gleixner wrote:
> On Wed, 21 Jun 2017, kan.li...@intel.com wrote:
> > We now have more and more systems where the Turbo range is wide enough
> > that the NMI watchdog expires faster than the soft watchdog timer that
> > updates the interrupt tick the NMI watchdog relies
On Wed, 21 Jun 2017, kan.li...@intel.com wrote:
> We now have more and more systems where the Turbo range is wide enough
> that the NMI watchdog expires faster than the soft watchdog timer that
> updates the interrupt tick the NMI watchdog relies on.
>
> This problem was originally added by commit
On Wed, 21 Jun 2017, Andi Kleen wrote:
> On Wed, Jun 21, 2017 at 05:12:06PM +0200, Thomas Gleixner wrote:
> > On Wed, 21 Jun 2017, kan.li...@intel.com wrote:
> > >
> > > #ifdef CONFIG_HARDLOCKUP_DETECTOR
> > > +/*
> > > + * The NMI watchdog relies on PERF_COUNT_HW_CPU_CYCLES event, which
> > > +
On 06/21/2017 11:47 AM, Liang, Kan wrote:
>
>
>> On Wed, 21 Jun 2017, kan.li...@intel.com wrote:
>>>
>>> #ifdef CONFIG_HARDLOCKUP_DETECTOR
>>> +/*
>>> + * The NMI watchdog relies on PERF_COUNT_HW_CPU_CYCLES event,
>> which
>>> + * can tick faster than the measured CPU Frequency due to Turbo mo
> On Wed, 21 Jun 2017, kan.li...@intel.com wrote:
> >
> > #ifdef CONFIG_HARDLOCKUP_DETECTOR
> > +/*
> > + * The NMI watchdog relies on PERF_COUNT_HW_CPU_CYCLES event,
> which
> > + * can tick faster than the measured CPU Frequency due to Turbo mode.
> > + * That can lead to spurious timeouts.
>
On Wed, Jun 21, 2017 at 05:12:06PM +0200, Thomas Gleixner wrote:
> On Wed, 21 Jun 2017, kan.li...@intel.com wrote:
> >
> > #ifdef CONFIG_HARDLOCKUP_DETECTOR
> > +/*
> > + * The NMI watchdog relies on PERF_COUNT_HW_CPU_CYCLES event, which
> > + * can tick faster than the measured CPU Frequency du
On Wed, 21 Jun 2017, kan.li...@intel.com wrote:
>
> #ifdef CONFIG_HARDLOCKUP_DETECTOR
> +/*
> + * The NMI watchdog relies on PERF_COUNT_HW_CPU_CYCLES event, which
> + * can tick faster than the measured CPU Frequency due to Turbo mode.
> + * That can lead to spurious timeouts.
> + * To workaroun
35 matches
Mail list logo