subject:"\[PATCH 8\/9\] clocksource\: Improve unstable clocksource detection"

Re: [PATCH 8/9] clocksource: Improve unstable clocksource detection

2015-09-02 Thread Peter Zijlstra

On Tue, Sep 01, 2015 at 03:35:18PM -0400, Steven Rostedt wrote:
> On Tue, 1 Sep 2015 11:14:17 -0700
> Shaohua Li  wrote:
> 
> > > You think that blocking softirq execution for 42.9 seconds is normal?
> > > Seems we are living in a different universe.
> > 
> > I don't say it's normal. I say it's not hard to trigger.
> > 
> > > > it's just VM off. A softirq can hog the cpu.
> > > 
> 
> Please provide a test case that shows the softirq hogging the cpu for
> over 40 seconds.

Do not overlook the MAX_SOFTIRQ_* logic while you're there trying to
explain this ;-)

A softirq running significantly longer than 2 jiffies and not falling
back to ksoftirq is a plain bug.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 8/9] clocksource: Improve unstable clocksource detection

2015-09-02 Thread Peter Zijlstra

On Tue, Sep 01, 2015 at 03:35:18PM -0400, Steven Rostedt wrote:
> On Tue, 1 Sep 2015 11:14:17 -0700
> Shaohua Li  wrote:
> 
> > > You think that blocking softirq execution for 42.9 seconds is normal?
> > > Seems we are living in a different universe.
> > 
> > I don't say it's normal. I say it's not hard to trigger.
> > 
> > > > it's just VM off. A softirq can hog the cpu.
> > > 
> 
> Please provide a test case that shows the softirq hogging the cpu for
> over 40 seconds.

Do not overlook the MAX_SOFTIRQ_* logic while you're there trying to
explain this ;-)

A softirq running significantly longer than 2 jiffies and not falling
back to ksoftirq is a plain bug.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 8/9] clocksource: Improve unstable clocksource detection

2015-09-01 Thread Steven Rostedt

On Tue, 1 Sep 2015 11:14:17 -0700
Shaohua Li  wrote:

> > You think that blocking softirq execution for 42.9 seconds is normal?
> > Seems we are living in a different universe.
> 
> I don't say it's normal. I say it's not hard to trigger.
> 
> > > it's just VM off. A softirq can hog the cpu.
> > 

Please provide a test case that shows the softirq hogging the cpu for
over 40 seconds.

-- Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 8/9] clocksource: Improve unstable clocksource detection

2015-09-01 Thread Thomas Gleixner

On Tue, 1 Sep 2015, Shaohua Li wrote:
> On Tue, Sep 01, 2015 at 07:13:40PM +0200, Thomas Gleixner wrote:
> > On Mon, 31 Aug 2015, Shaohua Li wrote:
> > > On Mon, Aug 31, 2015 at 11:47:52PM +0200, Thomas Gleixner wrote:
> > > > On Mon, 31 Aug 2015, Shaohua Li wrote:
> > > > > > The HPET wraps interval is 0x / 1 = 42.9s
> > > > > > 
> > > > > > tsc interval is (0x481250b45b - 0x219e6efb50) / 22 = 75s
> > > > > > 
> > > > > > 32.1 + 42.9 = 75
> > > > > > 
> > > > > > The example shows hpet wraps, while tsc is marked unstable
> > > > > 
> > > > > Thomas & John,
> > > > > Is this data enough to prove TSC unstable issue can be triggered by 
> > > > > HPET
> > > > > wrap? I can resend the patch with the data included.
> > > > 
> > > > Well, it's enough data to prove:
> > > > 
> > > >  - that keeping a VM off the CPU for 75 seconds is insane.
> > > 
> > > It wraps in 42.9s. 42.9s isn't a long time hard to block. I don’t think
> > 
> > You think that blocking softirq execution for 42.9 seconds is normal?
> > Seems we are living in a different universe.
> 
> I don't say it's normal. I say it's not hard to trigger.

So and because its not hard to trigger, we cure the symptom and do not
think about the insanity of blocking the watchdog for 42+ or 300+
seconds.

> > > it's just VM off. A softirq can hog the cpu.
> > 
> > I still want to see prove of that. There is just handwaving about
> > that, but nobody has provided proper data to back that up.
> 
> I showed you the TSC runs 75s, while hpet wraps. What info you think can
> prove this?

You prove nothing. You showed me the symptom, but you never showed
real data that a softirq hogs the cpu for 300+ seconds. Still you keep
claiming that.

You did neither provide a proper explanation WHY your VM test blocked
the watchdog for 75 seconds. No, you merily showed me the numbers. And
just because the numbers explain the symptom, that's no justification
WHY we should cure the symptom instead of looking at the root cause.

> > > >  - that emulating the HPET with 100MHz shortens the HPET wraparound by
> > > >a factor of 7 compared to real hardware. With a realist HPET
> > > >frequency you have about 300 seconds.
> > > > 
> > > >Who though that using 100MHz HPET frequency is a brilliant idea?
> > > 
> > > I'm not a VM expert. My guess is the 100Mhz can reduce interrupt. It’s
> > > insane hypervisor updates HPET count in 14.3Mhz. Switching to HPET can
> > > introduce even higher overhead in virtual, because of the vmexit of
> > > iomemory access
> > 
> > Sorry, that does not make any sense at all.
> > 
> > - How does 100Mhz HPET frequency reduce interrupts?
> > 
> > - What's insane about a lower emulated HPET frequency?
> > 
> > - We all know that switching to HPET is more expensive than just
> >   using TSC. That's not the question at all and completely
> >   unrelated to the 100MHz HPET emulation frequency.
> 
> It's meaningless to argue about HPET frequency. The code should not just
> work for 14.3Mhz HPET.

You carefully avoid to answer any of my questions, but you expect from
me to accept your wild guess argumentations?

> > I'm not pretending anything. I'm merily refusing to accept that change
> > w/o a proper explanation WHY the watchdog fails on physical hardware,
> > i.e. WHY it does not run for more than 300 seconds.
> 
> It's meaningless to argue about virtual/physical machine too. Linux
> works for both virtual/physical machines.

That has nothing to do with virt vs. physical. Virtualization is meant
to provide proper hardware emulation. Does Linux work with a buggy
APIC emulation? Not at all, but you expect that it just handles an
insane HPET emulation, right? 

> What about acpi_pm clocksource then? It wraps in abour 5s. It's sane
> HPET is disabled and acpi_pm is used for watchdog. Do you still think 5s
> is long?

Yes, five seconds is long. It's more than 10 billions worth of cpu
cycles on a 2GHz machine. If your desktop stalls for 5 seconds you are
probably less enthusiatic.

Again, I'm not against making the watchdog more robust, but I'm
against curing the symptoms. As long as you refuse to provide proper
explanations WHY the watchdog is blocked unduly long, this is going
nowhere.

Thanks,

tglx

Re: [PATCH 8/9] clocksource: Improve unstable clocksource detection

2015-09-01 Thread Shaohua Li

On Tue, Sep 01, 2015 at 07:13:40PM +0200, Thomas Gleixner wrote:
> On Mon, 31 Aug 2015, Shaohua Li wrote:
> > On Mon, Aug 31, 2015 at 11:47:52PM +0200, Thomas Gleixner wrote:
> > > On Mon, 31 Aug 2015, Shaohua Li wrote:
> > > > > The HPET wraps interval is 0x / 1 = 42.9s
> > > > > 
> > > > > tsc interval is (0x481250b45b - 0x219e6efb50) / 22 = 75s
> > > > > 
> > > > > 32.1 + 42.9 = 75
> > > > > 
> > > > > The example shows hpet wraps, while tsc is marked unstable
> > > > 
> > > > Thomas & John,
> > > > Is this data enough to prove TSC unstable issue can be triggered by HPET
> > > > wrap? I can resend the patch with the data included.
> > > 
> > > Well, it's enough data to prove:
> > > 
> > >  - that keeping a VM off the CPU for 75 seconds is insane.
> > 
> > It wraps in 42.9s. 42.9s isn't a long time hard to block. I don’t think
> 
> You think that blocking softirq execution for 42.9 seconds is normal?
> Seems we are living in a different universe.

I don't say it's normal. I say it's not hard to trigger.

> > it's just VM off. A softirq can hog the cpu.
> 
> I still want to see prove of that. There is just handwaving about
> that, but nobody has provided proper data to back that up.

I showed you the TSC runs 75s, while hpet wraps. What info you think can
prove this?
> > >  - that emulating the HPET with 100MHz shortens the HPET wraparound by
> > >a factor of 7 compared to real hardware. With a realist HPET
> > >frequency you have about 300 seconds.
> > > 
> > >Who though that using 100MHz HPET frequency is a brilliant idea?
> > 
> > I'm not a VM expert. My guess is the 100Mhz can reduce interrupt. It’s
> > insane hypervisor updates HPET count in 14.3Mhz. Switching to HPET can
> > introduce even higher overhead in virtual, because of the vmexit of
> > iomemory access
> 
> Sorry, that does not make any sense at all.
> 
> - How does 100Mhz HPET frequency reduce interrupts?
> 
> - What's insane about a lower emulated HPET frequency?
> 
> - We all know that switching to HPET is more expensive than just
>   using TSC. That's not the question at all and completely
>   unrelated to the 100MHz HPET emulation frequency.

It's meaningless to argue about HPET frequency. The code should not just
work for 14.3Mhz HPET.

> > > So we should add crappy heuristics to the watchdog just to workaround
> > > virt insanities? I'm not convinced.
> > 
> > This is a real issue which could impact performance seriously. Though
> > the data is collected in vm, we do see the issue happens in physical
> > machines too.
> 
> And what's the exact reason for this on physical machines? Some magic
> softirq hog again for which you cannot provide proof?
> 
> > The watchdog clock source shows restriction here apparently, it
> > deserves an improvement if we can do.
> 
> The restriction in a sane environment is 300 seconds. And the only
> fallout on physical hardware which we have seen so far is on
> preempt-RT where the softirq can actually be blocked by RT hogs, but
> that's a completely different issue and has nothing to do with the
> situation in mainline.
> 
> > I'm happy to hear from you if there is better solution, but we
> > shouldn't pretend there is no issue here.
> 
> I'm not pretending anything. I'm merily refusing to accept that change
> w/o a proper explanation WHY the watchdog fails on physical hardware,
> i.e. WHY it does not run for more than 300 seconds.

It's meaningless to argue about virtual/physical machine too. Linux
works for both virtual/physical machines.

What about acpi_pm clocksource then? It wraps in abour 5s. It's sane
HPET is disabled and acpi_pm is used for watchdog. Do you still think 5s
is long?

Thanks,
Shaohua
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 8/9] clocksource: Improve unstable clocksource detection

2015-09-01 Thread Thomas Gleixner

On Mon, 31 Aug 2015, Shaohua Li wrote:
> On Mon, Aug 31, 2015 at 11:47:52PM +0200, Thomas Gleixner wrote:
> > On Mon, 31 Aug 2015, Shaohua Li wrote:
> > > > The HPET wraps interval is 0x / 1 = 42.9s
> > > > 
> > > > tsc interval is (0x481250b45b - 0x219e6efb50) / 22 = 75s
> > > > 
> > > > 32.1 + 42.9 = 75
> > > > 
> > > > The example shows hpet wraps, while tsc is marked unstable
> > > 
> > > Thomas & John,
> > > Is this data enough to prove TSC unstable issue can be triggered by HPET
> > > wrap? I can resend the patch with the data included.
> > 
> > Well, it's enough data to prove:
> > 
> >  - that keeping a VM off the CPU for 75 seconds is insane.
> 
> It wraps in 42.9s. 42.9s isn't a long time hard to block. I don’t think

You think that blocking softirq execution for 42.9 seconds is normal?
Seems we are living in a different universe.

> it's just VM off. A softirq can hog the cpu.

I still want to see prove of that. There is just handwaving about
that, but nobody has provided proper data to back that up.

> >  - that emulating the HPET with 100MHz shortens the HPET wraparound by
> >a factor of 7 compared to real hardware. With a realist HPET
> >frequency you have about 300 seconds.
> > 
> >Who though that using 100MHz HPET frequency is a brilliant idea?
> 
> I'm not a VM expert. My guess is the 100Mhz can reduce interrupt. It’s
> insane hypervisor updates HPET count in 14.3Mhz. Switching to HPET can
> introduce even higher overhead in virtual, because of the vmexit of
> iomemory access

Sorry, that does not make any sense at all.

- How does 100Mhz HPET frequency reduce interrupts?

- What's insane about a lower emulated HPET frequency?

- We all know that switching to HPET is more expensive than just
  using TSC. That's not the question at all and completely
  unrelated to the 100MHz HPET emulation frequency.

> > So we should add crappy heuristics to the watchdog just to workaround
> > virt insanities? I'm not convinced.
> 
> This is a real issue which could impact performance seriously. Though
> the data is collected in vm, we do see the issue happens in physical
> machines too.

And what's the exact reason for this on physical machines? Some magic
softirq hog again for which you cannot provide proof?

> The watchdog clock source shows restriction here apparently, it
> deserves an improvement if we can do.

The restriction in a sane environment is 300 seconds. And the only
fallout on physical hardware which we have seen so far is on
preempt-RT where the softirq can actually be blocked by RT hogs, but
that's a completely different issue and has nothing to do with the
situation in mainline.

> I'm happy to hear from you if there is better solution, but we
> shouldn't pretend there is no issue here.

I'm not pretending anything. I'm merily refusing to accept that change
w/o a proper explanation WHY the watchdog fails on physical hardware,
i.e. WHY it does not run for more than 300 seconds.

Thanks,

tglx

Re: [PATCH 8/9] clocksource: Improve unstable clocksource detection

2015-09-01 Thread Thomas Gleixner

On Tue, 1 Sep 2015, Shaohua Li wrote:
> On Tue, Sep 01, 2015 at 07:13:40PM +0200, Thomas Gleixner wrote:
> > On Mon, 31 Aug 2015, Shaohua Li wrote:
> > > On Mon, Aug 31, 2015 at 11:47:52PM +0200, Thomas Gleixner wrote:
> > > > On Mon, 31 Aug 2015, Shaohua Li wrote:
> > > > > > The HPET wraps interval is 0x / 1 = 42.9s
> > > > > > 
> > > > > > tsc interval is (0x481250b45b - 0x219e6efb50) / 22 = 75s
> > > > > > 
> > > > > > 32.1 + 42.9 = 75
> > > > > > 
> > > > > > The example shows hpet wraps, while tsc is marked unstable
> > > > > 
> > > > > Thomas & John,
> > > > > Is this data enough to prove TSC unstable issue can be triggered by 
> > > > > HPET
> > > > > wrap? I can resend the patch with the data included.
> > > > 
> > > > Well, it's enough data to prove:
> > > > 
> > > >  - that keeping a VM off the CPU for 75 seconds is insane.
> > > 
> > > It wraps in 42.9s. 42.9s isn't a long time hard to block. I don’t think
> > 
> > You think that blocking softirq execution for 42.9 seconds is normal?
> > Seems we are living in a different universe.
> 
> I don't say it's normal. I say it's not hard to trigger.

So and because its not hard to trigger, we cure the symptom and do not
think about the insanity of blocking the watchdog for 42+ or 300+
seconds.

> > > it's just VM off. A softirq can hog the cpu.
> > 
> > I still want to see prove of that. There is just handwaving about
> > that, but nobody has provided proper data to back that up.
> 
> I showed you the TSC runs 75s, while hpet wraps. What info you think can
> prove this?

You prove nothing. You showed me the symptom, but you never showed
real data that a softirq hogs the cpu for 300+ seconds. Still you keep
claiming that.

You did neither provide a proper explanation WHY your VM test blocked
the watchdog for 75 seconds. No, you merily showed me the numbers. And
just because the numbers explain the symptom, that's no justification
WHY we should cure the symptom instead of looking at the root cause.

> > > >  - that emulating the HPET with 100MHz shortens the HPET wraparound by
> > > >a factor of 7 compared to real hardware. With a realist HPET
> > > >frequency you have about 300 seconds.
> > > > 
> > > >Who though that using 100MHz HPET frequency is a brilliant idea?
> > > 
> > > I'm not a VM expert. My guess is the 100Mhz can reduce interrupt. It’s
> > > insane hypervisor updates HPET count in 14.3Mhz. Switching to HPET can
> > > introduce even higher overhead in virtual, because of the vmexit of
> > > iomemory access
> > 
> > Sorry, that does not make any sense at all.
> > 
> > - How does 100Mhz HPET frequency reduce interrupts?
> > 
> > - What's insane about a lower emulated HPET frequency?
> > 
> > - We all know that switching to HPET is more expensive than just
> >   using TSC. That's not the question at all and completely
> >   unrelated to the 100MHz HPET emulation frequency.
> 
> It's meaningless to argue about HPET frequency. The code should not just
> work for 14.3Mhz HPET.

You carefully avoid to answer any of my questions, but you expect from
me to accept your wild guess argumentations?

> > I'm not pretending anything. I'm merily refusing to accept that change
> > w/o a proper explanation WHY the watchdog fails on physical hardware,
> > i.e. WHY it does not run for more than 300 seconds.
> 
> It's meaningless to argue about virtual/physical machine too. Linux
> works for both virtual/physical machines.

That has nothing to do with virt vs. physical. Virtualization is meant
to provide proper hardware emulation. Does Linux work with a buggy
APIC emulation? Not at all, but you expect that it just handles an
insane HPET emulation, right? 

> What about acpi_pm clocksource then? It wraps in abour 5s. It's sane
> HPET is disabled and acpi_pm is used for watchdog. Do you still think 5s
> is long?

Yes, five seconds is long. It's more than 10 billions worth of cpu
cycles on a 2GHz machine. If your desktop stalls for 5 seconds you are
probably less enthusiatic.

Again, I'm not against making the watchdog more robust, but I'm
against curing the symptoms. As long as you refuse to provide proper
explanations WHY the watchdog is blocked unduly long, this is going
nowhere.

Thanks,

tglx

Re: [PATCH 8/9] clocksource: Improve unstable clocksource detection

2015-09-01 Thread Steven Rostedt

On Tue, 1 Sep 2015 11:14:17 -0700
Shaohua Li  wrote:

> > You think that blocking softirq execution for 42.9 seconds is normal?
> > Seems we are living in a different universe.
> 
> I don't say it's normal. I say it's not hard to trigger.
> 
> > > it's just VM off. A softirq can hog the cpu.
> > 

Please provide a test case that shows the softirq hogging the cpu for
over 40 seconds.

-- Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 8/9] clocksource: Improve unstable clocksource detection

2015-09-01 Thread Thomas Gleixner

On Mon, 31 Aug 2015, Shaohua Li wrote:
> On Mon, Aug 31, 2015 at 11:47:52PM +0200, Thomas Gleixner wrote:
> > On Mon, 31 Aug 2015, Shaohua Li wrote:
> > > > The HPET wraps interval is 0x / 1 = 42.9s
> > > > 
> > > > tsc interval is (0x481250b45b - 0x219e6efb50) / 22 = 75s
> > > > 
> > > > 32.1 + 42.9 = 75
> > > > 
> > > > The example shows hpet wraps, while tsc is marked unstable
> > > 
> > > Thomas & John,
> > > Is this data enough to prove TSC unstable issue can be triggered by HPET
> > > wrap? I can resend the patch with the data included.
> > 
> > Well, it's enough data to prove:
> > 
> >  - that keeping a VM off the CPU for 75 seconds is insane.
> 
> It wraps in 42.9s. 42.9s isn't a long time hard to block. I don’t think

You think that blocking softirq execution for 42.9 seconds is normal?
Seems we are living in a different universe.

> it's just VM off. A softirq can hog the cpu.

I still want to see prove of that. There is just handwaving about
that, but nobody has provided proper data to back that up.

> >  - that emulating the HPET with 100MHz shortens the HPET wraparound by
> >a factor of 7 compared to real hardware. With a realist HPET
> >frequency you have about 300 seconds.
> > 
> >Who though that using 100MHz HPET frequency is a brilliant idea?
> 
> I'm not a VM expert. My guess is the 100Mhz can reduce interrupt. It’s
> insane hypervisor updates HPET count in 14.3Mhz. Switching to HPET can
> introduce even higher overhead in virtual, because of the vmexit of
> iomemory access

Sorry, that does not make any sense at all.

- How does 100Mhz HPET frequency reduce interrupts?

- What's insane about a lower emulated HPET frequency?

- We all know that switching to HPET is more expensive than just
  using TSC. That's not the question at all and completely
  unrelated to the 100MHz HPET emulation frequency.

> > So we should add crappy heuristics to the watchdog just to workaround
> > virt insanities? I'm not convinced.
> 
> This is a real issue which could impact performance seriously. Though
> the data is collected in vm, we do see the issue happens in physical
> machines too.

And what's the exact reason for this on physical machines? Some magic
softirq hog again for which you cannot provide proof?

> The watchdog clock source shows restriction here apparently, it
> deserves an improvement if we can do.

The restriction in a sane environment is 300 seconds. And the only
fallout on physical hardware which we have seen so far is on
preempt-RT where the softirq can actually be blocked by RT hogs, but
that's a completely different issue and has nothing to do with the
situation in mainline.

> I'm happy to hear from you if there is better solution, but we
> shouldn't pretend there is no issue here.

I'm not pretending anything. I'm merily refusing to accept that change
w/o a proper explanation WHY the watchdog fails on physical hardware,
i.e. WHY it does not run for more than 300 seconds.

Thanks,

tglx

Re: [PATCH 8/9] clocksource: Improve unstable clocksource detection

2015-09-01 Thread Shaohua Li

On Tue, Sep 01, 2015 at 07:13:40PM +0200, Thomas Gleixner wrote:
> On Mon, 31 Aug 2015, Shaohua Li wrote:
> > On Mon, Aug 31, 2015 at 11:47:52PM +0200, Thomas Gleixner wrote:
> > > On Mon, 31 Aug 2015, Shaohua Li wrote:
> > > > > The HPET wraps interval is 0x / 1 = 42.9s
> > > > > 
> > > > > tsc interval is (0x481250b45b - 0x219e6efb50) / 22 = 75s
> > > > > 
> > > > > 32.1 + 42.9 = 75
> > > > > 
> > > > > The example shows hpet wraps, while tsc is marked unstable
> > > > 
> > > > Thomas & John,
> > > > Is this data enough to prove TSC unstable issue can be triggered by HPET
> > > > wrap? I can resend the patch with the data included.
> > > 
> > > Well, it's enough data to prove:
> > > 
> > >  - that keeping a VM off the CPU for 75 seconds is insane.
> > 
> > It wraps in 42.9s. 42.9s isn't a long time hard to block. I don’t think
> 
> You think that blocking softirq execution for 42.9 seconds is normal?
> Seems we are living in a different universe.

I don't say it's normal. I say it's not hard to trigger.

> > it's just VM off. A softirq can hog the cpu.
> 
> I still want to see prove of that. There is just handwaving about
> that, but nobody has provided proper data to back that up.

I showed you the TSC runs 75s, while hpet wraps. What info you think can
prove this?
> > >  - that emulating the HPET with 100MHz shortens the HPET wraparound by
> > >a factor of 7 compared to real hardware. With a realist HPET
> > >frequency you have about 300 seconds.
> > > 
> > >Who though that using 100MHz HPET frequency is a brilliant idea?
> > 
> > I'm not a VM expert. My guess is the 100Mhz can reduce interrupt. It’s
> > insane hypervisor updates HPET count in 14.3Mhz. Switching to HPET can
> > introduce even higher overhead in virtual, because of the vmexit of
> > iomemory access
> 
> Sorry, that does not make any sense at all.
> 
> - How does 100Mhz HPET frequency reduce interrupts?
> 
> - What's insane about a lower emulated HPET frequency?
> 
> - We all know that switching to HPET is more expensive than just
>   using TSC. That's not the question at all and completely
>   unrelated to the 100MHz HPET emulation frequency.

It's meaningless to argue about HPET frequency. The code should not just
work for 14.3Mhz HPET.

> > > So we should add crappy heuristics to the watchdog just to workaround
> > > virt insanities? I'm not convinced.
> > 
> > This is a real issue which could impact performance seriously. Though
> > the data is collected in vm, we do see the issue happens in physical
> > machines too.
> 
> And what's the exact reason for this on physical machines? Some magic
> softirq hog again for which you cannot provide proof?
> 
> > The watchdog clock source shows restriction here apparently, it
> > deserves an improvement if we can do.
> 
> The restriction in a sane environment is 300 seconds. And the only
> fallout on physical hardware which we have seen so far is on
> preempt-RT where the softirq can actually be blocked by RT hogs, but
> that's a completely different issue and has nothing to do with the
> situation in mainline.
> 
> > I'm happy to hear from you if there is better solution, but we
> > shouldn't pretend there is no issue here.
> 
> I'm not pretending anything. I'm merily refusing to accept that change
> w/o a proper explanation WHY the watchdog fails on physical hardware,
> i.e. WHY it does not run for more than 300 seconds.

It's meaningless to argue about virtual/physical machine too. Linux
works for both virtual/physical machines.

What about acpi_pm clocksource then? It wraps in abour 5s. It's sane
HPET is disabled and acpi_pm is used for watchdog. Do you still think 5s
is long?

Thanks,
Shaohua
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 8/9] clocksource: Improve unstable clocksource detection

2015-08-31 Thread Shaohua Li

On Mon, Aug 31, 2015 at 11:47:52PM +0200, Thomas Gleixner wrote:
> On Mon, 31 Aug 2015, Shaohua Li wrote:
> > > The HPET wraps interval is 0x / 1 = 42.9s
> > > 
> > > tsc interval is (0x481250b45b - 0x219e6efb50) / 22 = 75s
> > > 
> > > 32.1 + 42.9 = 75
> > > 
> > > The example shows hpet wraps, while tsc is marked unstable
> > 
> > Thomas & John,
> > Is this data enough to prove TSC unstable issue can be triggered by HPET
> > wrap? I can resend the patch with the data included.
> 
> Well, it's enough data to prove:
> 
>  - that keeping a VM off the CPU for 75 seconds is insane.

It wraps in 42.9s. 42.9s isn't a long time hard to block. I don’t think
it's just VM off. A softirq can hog the cpu.

>  - that emulating the HPET with 100MHz shortens the HPET wraparound by
>a factor of 7 compared to real hardware. With a realist HPET
>frequency you have about 300 seconds.
> 
>Who though that using 100MHz HPET frequency is a brilliant idea?

I'm not a VM expert. My guess is the 100Mhz can reduce interrupt. It’s
insane hypervisor updates HPET count in 14.3Mhz. Switching to HPET can
introduce even higher overhead in virtual, because of the vmexit of
iomemory access

> So we should add crappy heuristics to the watchdog just to workaround
> virt insanities? I'm not convinced.

This is a real issue which could impact performance seriously. Though
the data is collected in vm, we do see the issue happens in physical
machines too. The watchdog clock source shows restriction here
apparently, it deserves an improvement if we can do. I'm happy to hear
from you if there is better solution, but we shouldn't pretend there is
no issue here.

Thanks,
Shaohua
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 8/9] clocksource: Improve unstable clocksource detection

2015-08-31 Thread Thomas Gleixner

On Mon, 31 Aug 2015, Shaohua Li wrote:
> > The HPET wraps interval is 0x / 1 = 42.9s
> > 
> > tsc interval is (0x481250b45b - 0x219e6efb50) / 22 = 75s
> > 
> > 32.1 + 42.9 = 75
> > 
> > The example shows hpet wraps, while tsc is marked unstable
> 
> Thomas & John,
> Is this data enough to prove TSC unstable issue can be triggered by HPET
> wrap? I can resend the patch with the data included.

Well, it's enough data to prove:

 - that keeping a VM off the CPU for 75 seconds is insane.

 - that emulating the HPET with 100MHz shortens the HPET wraparound by
   a factor of 7 compared to real hardware. With a realist HPET
   frequency you have about 300 seconds.

   Who though that using 100MHz HPET frequency is a brilliant idea?

So we should add crappy heuristics to the watchdog just to workaround
virt insanities? I'm not convinced.

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 8/9] clocksource: Improve unstable clocksource detection

2015-08-31 Thread Shaohua Li

On Wed, Aug 26, 2015 at 10:15:33AM -0700, Shaohua Li wrote:
> On Tue, Aug 18, 2015 at 10:18:09PM +0200, Thomas Gleixner wrote:
> > On Tue, 18 Aug 2015, John Stultz wrote:
> > > On Tue, Aug 18, 2015 at 12:28 PM, Thomas Gleixner  
> > > wrote:
> > > > On Tue, 18 Aug 2015, John Stultz wrote:
> > > >> On Tue, Aug 18, 2015 at 1:38 AM, Thomas Gleixner  
> > > >> wrote:
> > > >> > On Mon, 17 Aug 2015, John Stultz wrote:
> > > >> >> On Mon, Aug 17, 2015 at 3:04 PM, Thomas Gleixner 
> > > >> >>  wrote:
> > > >> >> > On Mon, 17 Aug 2015, John Stultz wrote:
> > > >> >> >
> > > >> >> >> From: Shaohua Li 
> > > >> >> >>
> > > >> >> >> >From time to time we saw TSC is marked as unstable in our 
> > > >> >> >> >systems, while
> > > >> >> >
> > > >> >> > Stray '>'
> > > >> >> >
> > > >> >> >> the CPUs declare to have stable TSC. Looking at the clocksource 
> > > >> >> >> unstable
> > > >> >> >> detection, there are two problems:
> > > >> >> >> - watchdog clock source wrap. HPET is the most common watchdog 
> > > >> >> >> clock
> > > >> >> >>   source. It's 32-bit and runs in 14.3Mhz. That means the hpet 
> > > >> >> >> counter
> > > >> >> >>   can wrap in about 5 minutes.
> > > >> >> >> - threshold isn't scaled against interval. The threshold is 
> > > >> >> >> 0.0625s in
> > > >> >> >>   0.5s interval. What if the actual interval is bigger than 0.5s?
> > > >> >> >>
> > > >> >> >> The watchdog runs in a timer bh, so hard/soft irq can defer its 
> > > >> >> >> running.
> > > >> >> >> Heavy network stack softirq can hog a cpu. IPMI driver can 
> > > >> >> >> disable
> > > >> >> >> interrupt for a very long time.
> > > >> >> >
> > > >> >> > And they hold off the timer softirq for more than a second? Don't 
> > > >> >> > you
> > > >> >> > think that's the problem which needs to be fixed?
> > > >> >>
> > > >> >> Though this is an issue I've experienced (and tried unsuccessfully 
> > > >> >> to
> > > >> >> fix in a more complicated way) with the RT kernel, where high 
> > > >> >> priority
> > > >> >> tasks blocked the watchdog long enough that we'd disqualify the TSC.
> > > >> >
> > > >> > Did it disqualify the watchdog due to HPET wraparounds (5 minutes) or
> > > >> > due to the fixed threshold being applied?
> > > >>
> > > >> This was years ago, but in my experience, the watchdog false positives
> > > >> were due to HPET wraparounds.
> > > >
> > > > Blocking stuff for 5 minutes is insane 
> > > 
> > > Yea. It was usually due to -RT stress testing, which keept the
> > > machines busy for quite awhile. But again, if you have machines being
> > > maxed out with networking load, etc, even for long amounts of time, we
> > > still want to avoid false positives. Because after the watchdog
> > 
> > The networking softirq does not hog the other softirqs. It has a limit
> > on processing loops and then goes back to let the other softirqs be
> > handled. So no, I doubt that heavy networking can cause this. If it
> > does then we have some other way more serious problems.
> > 
> > I can see the issue with RT stress testing, but not with networking in
> > mainline.
> 
> Ok, the issue is triggerd in my kvm guest, I guess it's easier to
> trigger in kvm because hpet is 100Mhz.
> 
> [  135.930067] clocksource: timekeeping watchdog: Marking clocksource 'tsc' 
> as unstable because the skew is too large:
> [  135.930095] clocksource:   'hpet' wd_now: 2bc19ea0 
> wd_last: 6c4e5570 mask: 
> [  135.930105] clocksource:   'tsc' cs_now: 481250b45b 
> cs_last: 219e6efb50 mask: 
> [  135.938750] clocksource: Switched to clocksource hpet
> 
> The HPET clock is 100MHz, CPU speed is 2200MHz, kvm is passed correct cpu
> info, so guest cpuinfo shows TSC is stable.
> 
> hpet interval is ((0x2bc19ea0 - 0x6c4e5570) & 0x) / 1 = 32.1s.
> 
> The HPET wraps interval is 0x / 1 = 42.9s
> 
> tsc interval is (0x481250b45b - 0x219e6efb50) / 22 = 75s
> 
> 32.1 + 42.9 = 75
> 
> The example shows hpet wraps, while tsc is marked unstable

Thomas & John,
Is this data enough to prove TSC unstable issue can be triggered by HPET
wrap? I can resend the patch with the data included.

Thanks,
Shaohua
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 8/9] clocksource: Improve unstable clocksource detection

2015-08-31 Thread Shaohua Li

On Wed, Aug 26, 2015 at 10:15:33AM -0700, Shaohua Li wrote:
> On Tue, Aug 18, 2015 at 10:18:09PM +0200, Thomas Gleixner wrote:
> > On Tue, 18 Aug 2015, John Stultz wrote:
> > > On Tue, Aug 18, 2015 at 12:28 PM, Thomas Gleixner  
> > > wrote:
> > > > On Tue, 18 Aug 2015, John Stultz wrote:
> > > >> On Tue, Aug 18, 2015 at 1:38 AM, Thomas Gleixner  
> > > >> wrote:
> > > >> > On Mon, 17 Aug 2015, John Stultz wrote:
> > > >> >> On Mon, Aug 17, 2015 at 3:04 PM, Thomas Gleixner 
> > > >> >>  wrote:
> > > >> >> > On Mon, 17 Aug 2015, John Stultz wrote:
> > > >> >> >
> > > >> >> >> From: Shaohua Li 
> > > >> >> >>
> > > >> >> >> >From time to time we saw TSC is marked as unstable in our 
> > > >> >> >> >systems, while
> > > >> >> >
> > > >> >> > Stray '>'
> > > >> >> >
> > > >> >> >> the CPUs declare to have stable TSC. Looking at the clocksource 
> > > >> >> >> unstable
> > > >> >> >> detection, there are two problems:
> > > >> >> >> - watchdog clock source wrap. HPET is the most common watchdog 
> > > >> >> >> clock
> > > >> >> >>   source. It's 32-bit and runs in 14.3Mhz. That means the hpet 
> > > >> >> >> counter
> > > >> >> >>   can wrap in about 5 minutes.
> > > >> >> >> - threshold isn't scaled against interval. The threshold is 
> > > >> >> >> 0.0625s in
> > > >> >> >>   0.5s interval. What if the actual interval is bigger than 0.5s?
> > > >> >> >>
> > > >> >> >> The watchdog runs in a timer bh, so hard/soft irq can defer its 
> > > >> >> >> running.
> > > >> >> >> Heavy network stack softirq can hog a cpu. IPMI driver can 
> > > >> >> >> disable
> > > >> >> >> interrupt for a very long time.
> > > >> >> >
> > > >> >> > And they hold off the timer softirq for more than a second? Don't 
> > > >> >> > you
> > > >> >> > think that's the problem which needs to be fixed?
> > > >> >>
> > > >> >> Though this is an issue I've experienced (and tried unsuccessfully 
> > > >> >> to
> > > >> >> fix in a more complicated way) with the RT kernel, where high 
> > > >> >> priority
> > > >> >> tasks blocked the watchdog long enough that we'd disqualify the TSC.
> > > >> >
> > > >> > Did it disqualify the watchdog due to HPET wraparounds (5 minutes) or
> > > >> > due to the fixed threshold being applied?
> > > >>
> > > >> This was years ago, but in my experience, the watchdog false positives
> > > >> were due to HPET wraparounds.
> > > >
> > > > Blocking stuff for 5 minutes is insane 
> > > 
> > > Yea. It was usually due to -RT stress testing, which keept the
> > > machines busy for quite awhile. But again, if you have machines being
> > > maxed out with networking load, etc, even for long amounts of time, we
> > > still want to avoid false positives. Because after the watchdog
> > 
> > The networking softirq does not hog the other softirqs. It has a limit
> > on processing loops and then goes back to let the other softirqs be
> > handled. So no, I doubt that heavy networking can cause this. If it
> > does then we have some other way more serious problems.
> > 
> > I can see the issue with RT stress testing, but not with networking in
> > mainline.
> 
> Ok, the issue is triggerd in my kvm guest, I guess it's easier to
> trigger in kvm because hpet is 100Mhz.
> 
> [  135.930067] clocksource: timekeeping watchdog: Marking clocksource 'tsc' 
> as unstable because the skew is too large:
> [  135.930095] clocksource:   'hpet' wd_now: 2bc19ea0 
> wd_last: 6c4e5570 mask: 
> [  135.930105] clocksource:   'tsc' cs_now: 481250b45b 
> cs_last: 219e6efb50 mask: 
> [  135.938750] clocksource: Switched to clocksource hpet
> 
> The HPET clock is 100MHz, CPU speed is 2200MHz, kvm is passed correct cpu
> info, so guest cpuinfo shows TSC is stable.
> 
> hpet interval is ((0x2bc19ea0 - 0x6c4e5570) & 0x) / 1 = 32.1s.
> 
> The HPET wraps interval is 0x / 1 = 42.9s
> 
> tsc interval is (0x481250b45b - 0x219e6efb50) / 22 = 75s
> 
> 32.1 + 42.9 = 75
> 
> The example shows hpet wraps, while tsc is marked unstable

Thomas & John,
Is this data enough to prove TSC unstable issue can be triggered by HPET
wrap? I can resend the patch with the data included.

Thanks,
Shaohua
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 8/9] clocksource: Improve unstable clocksource detection

2015-08-31 Thread Thomas Gleixner

On Mon, 31 Aug 2015, Shaohua Li wrote:
> > The HPET wraps interval is 0x / 1 = 42.9s
> > 
> > tsc interval is (0x481250b45b - 0x219e6efb50) / 22 = 75s
> > 
> > 32.1 + 42.9 = 75
> > 
> > The example shows hpet wraps, while tsc is marked unstable
> 
> Thomas & John,
> Is this data enough to prove TSC unstable issue can be triggered by HPET
> wrap? I can resend the patch with the data included.

Well, it's enough data to prove:

 - that keeping a VM off the CPU for 75 seconds is insane.

 - that emulating the HPET with 100MHz shortens the HPET wraparound by
   a factor of 7 compared to real hardware. With a realist HPET
   frequency you have about 300 seconds.

   Who though that using 100MHz HPET frequency is a brilliant idea?

So we should add crappy heuristics to the watchdog just to workaround
virt insanities? I'm not convinced.

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 8/9] clocksource: Improve unstable clocksource detection

2015-08-31 Thread Shaohua Li

On Mon, Aug 31, 2015 at 11:47:52PM +0200, Thomas Gleixner wrote:
> On Mon, 31 Aug 2015, Shaohua Li wrote:
> > > The HPET wraps interval is 0x / 1 = 42.9s
> > > 
> > > tsc interval is (0x481250b45b - 0x219e6efb50) / 22 = 75s
> > > 
> > > 32.1 + 42.9 = 75
> > > 
> > > The example shows hpet wraps, while tsc is marked unstable
> > 
> > Thomas & John,
> > Is this data enough to prove TSC unstable issue can be triggered by HPET
> > wrap? I can resend the patch with the data included.
> 
> Well, it's enough data to prove:
> 
>  - that keeping a VM off the CPU for 75 seconds is insane.

It wraps in 42.9s. 42.9s isn't a long time hard to block. I don’t think
it's just VM off. A softirq can hog the cpu.

>  - that emulating the HPET with 100MHz shortens the HPET wraparound by
>a factor of 7 compared to real hardware. With a realist HPET
>frequency you have about 300 seconds.
> 
>Who though that using 100MHz HPET frequency is a brilliant idea?

I'm not a VM expert. My guess is the 100Mhz can reduce interrupt. It’s
insane hypervisor updates HPET count in 14.3Mhz. Switching to HPET can
introduce even higher overhead in virtual, because of the vmexit of
iomemory access

> So we should add crappy heuristics to the watchdog just to workaround
> virt insanities? I'm not convinced.

This is a real issue which could impact performance seriously. Though
the data is collected in vm, we do see the issue happens in physical
machines too. The watchdog clock source shows restriction here
apparently, it deserves an improvement if we can do. I'm happy to hear
from you if there is better solution, but we shouldn't pretend there is
no issue here.

Thanks,
Shaohua
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 8/9] clocksource: Improve unstable clocksource detection

2015-08-26 Thread Shaohua Li

On Tue, Aug 18, 2015 at 10:18:09PM +0200, Thomas Gleixner wrote:
> On Tue, 18 Aug 2015, John Stultz wrote:
> > On Tue, Aug 18, 2015 at 12:28 PM, Thomas Gleixner  
> > wrote:
> > > On Tue, 18 Aug 2015, John Stultz wrote:
> > >> On Tue, Aug 18, 2015 at 1:38 AM, Thomas Gleixner  
> > >> wrote:
> > >> > On Mon, 17 Aug 2015, John Stultz wrote:
> > >> >> On Mon, Aug 17, 2015 at 3:04 PM, Thomas Gleixner  
> > >> >> wrote:
> > >> >> > On Mon, 17 Aug 2015, John Stultz wrote:
> > >> >> >
> > >> >> >> From: Shaohua Li 
> > >> >> >>
> > >> >> >> >From time to time we saw TSC is marked as unstable in our 
> > >> >> >> >systems, while
> > >> >> >
> > >> >> > Stray '>'
> > >> >> >
> > >> >> >> the CPUs declare to have stable TSC. Looking at the clocksource 
> > >> >> >> unstable
> > >> >> >> detection, there are two problems:
> > >> >> >> - watchdog clock source wrap. HPET is the most common watchdog 
> > >> >> >> clock
> > >> >> >>   source. It's 32-bit and runs in 14.3Mhz. That means the hpet 
> > >> >> >> counter
> > >> >> >>   can wrap in about 5 minutes.
> > >> >> >> - threshold isn't scaled against interval. The threshold is 
> > >> >> >> 0.0625s in
> > >> >> >>   0.5s interval. What if the actual interval is bigger than 0.5s?
> > >> >> >>
> > >> >> >> The watchdog runs in a timer bh, so hard/soft irq can defer its 
> > >> >> >> running.
> > >> >> >> Heavy network stack softirq can hog a cpu. IPMI driver can disable
> > >> >> >> interrupt for a very long time.
> > >> >> >
> > >> >> > And they hold off the timer softirq for more than a second? Don't 
> > >> >> > you
> > >> >> > think that's the problem which needs to be fixed?
> > >> >>
> > >> >> Though this is an issue I've experienced (and tried unsuccessfully to
> > >> >> fix in a more complicated way) with the RT kernel, where high priority
> > >> >> tasks blocked the watchdog long enough that we'd disqualify the TSC.
> > >> >
> > >> > Did it disqualify the watchdog due to HPET wraparounds (5 minutes) or
> > >> > due to the fixed threshold being applied?
> > >>
> > >> This was years ago, but in my experience, the watchdog false positives
> > >> were due to HPET wraparounds.
> > >
> > > Blocking stuff for 5 minutes is insane 
> > 
> > Yea. It was usually due to -RT stress testing, which keept the
> > machines busy for quite awhile. But again, if you have machines being
> > maxed out with networking load, etc, even for long amounts of time, we
> > still want to avoid false positives. Because after the watchdog
> 
> The networking softirq does not hog the other softirqs. It has a limit
> on processing loops and then goes back to let the other softirqs be
> handled. So no, I doubt that heavy networking can cause this. If it
> does then we have some other way more serious problems.
> 
> I can see the issue with RT stress testing, but not with networking in
> mainline.

Ok, the issue is triggerd in my kvm guest, I guess it's easier to
trigger in kvm because hpet is 100Mhz.

[  135.930067] clocksource: timekeeping watchdog: Marking clocksource 'tsc' as 
unstable because the skew is too large:
[  135.930095] clocksource:   'hpet' wd_now: 2bc19ea0 
wd_last: 6c4e5570 mask: 
[  135.930105] clocksource:   'tsc' cs_now: 481250b45b 
cs_last: 219e6efb50 mask: 
[  135.938750] clocksource: Switched to clocksource hpet

The HPET clock is 100MHz, CPU speed is 2200MHz, kvm is passed correct cpu
info, so guest cpuinfo shows TSC is stable.

hpet interval is ((0x2bc19ea0 - 0x6c4e5570) & 0x) / 1 = 32.1s.

The HPET wraps interval is 0x / 1 = 42.9s

tsc interval is (0x481250b45b - 0x219e6efb50) / 22 = 75s

32.1 + 42.9 = 75

The example shows hpet wraps, while tsc is marked unstable

Thanks,
Shaohua
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 8/9] clocksource: Improve unstable clocksource detection

2015-08-26 Thread Shaohua Li

On Tue, Aug 18, 2015 at 10:18:09PM +0200, Thomas Gleixner wrote:
 On Tue, 18 Aug 2015, John Stultz wrote:
  On Tue, Aug 18, 2015 at 12:28 PM, Thomas Gleixner t...@linutronix.de 
  wrote:
   On Tue, 18 Aug 2015, John Stultz wrote:
   On Tue, Aug 18, 2015 at 1:38 AM, Thomas Gleixner t...@linutronix.de 
   wrote:
On Mon, 17 Aug 2015, John Stultz wrote:
On Mon, Aug 17, 2015 at 3:04 PM, Thomas Gleixner t...@linutronix.de 
wrote:
 On Mon, 17 Aug 2015, John Stultz wrote:

 From: Shaohua Li s...@fb.com

 From time to time we saw TSC is marked as unstable in our 
 systems, while

 Stray ''

 the CPUs declare to have stable TSC. Looking at the clocksource 
 unstable
 detection, there are two problems:
 - watchdog clock source wrap. HPET is the most common watchdog 
 clock
   source. It's 32-bit and runs in 14.3Mhz. That means the hpet 
 counter
   can wrap in about 5 minutes.
 - threshold isn't scaled against interval. The threshold is 
 0.0625s in
   0.5s interval. What if the actual interval is bigger than 0.5s?

 The watchdog runs in a timer bh, so hard/soft irq can defer its 
 running.
 Heavy network stack softirq can hog a cpu. IPMI driver can disable
 interrupt for a very long time.

 And they hold off the timer softirq for more than a second? Don't 
 you
 think that's the problem which needs to be fixed?
   
Though this is an issue I've experienced (and tried unsuccessfully to
fix in a more complicated way) with the RT kernel, where high priority
tasks blocked the watchdog long enough that we'd disqualify the TSC.
   
Did it disqualify the watchdog due to HPET wraparounds (5 minutes) or
due to the fixed threshold being applied?
  
   This was years ago, but in my experience, the watchdog false positives
   were due to HPET wraparounds.
  
   Blocking stuff for 5 minutes is insane 
  
  Yea. It was usually due to -RT stress testing, which keept the
  machines busy for quite awhile. But again, if you have machines being
  maxed out with networking load, etc, even for long amounts of time, we
  still want to avoid false positives. Because after the watchdog
 
 The networking softirq does not hog the other softirqs. It has a limit
 on processing loops and then goes back to let the other softirqs be
 handled. So no, I doubt that heavy networking can cause this. If it
 does then we have some other way more serious problems.
 
 I can see the issue with RT stress testing, but not with networking in
 mainline.

Ok, the issue is triggerd in my kvm guest, I guess it's easier to
trigger in kvm because hpet is 100Mhz.

[  135.930067] clocksource: timekeeping watchdog: Marking clocksource 'tsc' as 
unstable because the skew is too large:
[  135.930095] clocksource:   'hpet' wd_now: 2bc19ea0 
wd_last: 6c4e5570 mask: 
[  135.930105] clocksource:   'tsc' cs_now: 481250b45b 
cs_last: 219e6efb50 mask: 
[  135.938750] clocksource: Switched to clocksource hpet

The HPET clock is 100MHz, CPU speed is 2200MHz, kvm is passed correct cpu
info, so guest cpuinfo shows TSC is stable.

hpet interval is ((0x2bc19ea0 - 0x6c4e5570)  0x) / 1 = 32.1s.

The HPET wraps interval is 0x / 1 = 42.9s

tsc interval is (0x481250b45b - 0x219e6efb50) / 22 = 75s

32.1 + 42.9 = 75

The example shows hpet wraps, while tsc is marked unstable

Thanks,
Shaohua
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 8/9] clocksource: Improve unstable clocksource detection

2015-08-18 Thread Thomas Gleixner

On Tue, 18 Aug 2015, John Stultz wrote:
> On Tue, Aug 18, 2015 at 12:28 PM, Thomas Gleixner  wrote:
> > On Tue, 18 Aug 2015, John Stultz wrote:
> >> On Tue, Aug 18, 2015 at 1:38 AM, Thomas Gleixner  
> >> wrote:
> >> > On Mon, 17 Aug 2015, John Stultz wrote:
> >> >> On Mon, Aug 17, 2015 at 3:04 PM, Thomas Gleixner  
> >> >> wrote:
> >> >> > On Mon, 17 Aug 2015, John Stultz wrote:
> >> >> >
> >> >> >> From: Shaohua Li 
> >> >> >>
> >> >> >> >From time to time we saw TSC is marked as unstable in our systems, 
> >> >> >> >while
> >> >> >
> >> >> > Stray '>'
> >> >> >
> >> >> >> the CPUs declare to have stable TSC. Looking at the clocksource 
> >> >> >> unstable
> >> >> >> detection, there are two problems:
> >> >> >> - watchdog clock source wrap. HPET is the most common watchdog clock
> >> >> >>   source. It's 32-bit and runs in 14.3Mhz. That means the hpet 
> >> >> >> counter
> >> >> >>   can wrap in about 5 minutes.
> >> >> >> - threshold isn't scaled against interval. The threshold is 0.0625s 
> >> >> >> in
> >> >> >>   0.5s interval. What if the actual interval is bigger than 0.5s?
> >> >> >>
> >> >> >> The watchdog runs in a timer bh, so hard/soft irq can defer its 
> >> >> >> running.
> >> >> >> Heavy network stack softirq can hog a cpu. IPMI driver can disable
> >> >> >> interrupt for a very long time.
> >> >> >
> >> >> > And they hold off the timer softirq for more than a second? Don't you
> >> >> > think that's the problem which needs to be fixed?
> >> >>
> >> >> Though this is an issue I've experienced (and tried unsuccessfully to
> >> >> fix in a more complicated way) with the RT kernel, where high priority
> >> >> tasks blocked the watchdog long enough that we'd disqualify the TSC.
> >> >
> >> > Did it disqualify the watchdog due to HPET wraparounds (5 minutes) or
> >> > due to the fixed threshold being applied?
> >>
> >> This was years ago, but in my experience, the watchdog false positives
> >> were due to HPET wraparounds.
> >
> > Blocking stuff for 5 minutes is insane 
> 
> Yea. It was usually due to -RT stress testing, which keept the
> machines busy for quite awhile. But again, if you have machines being
> maxed out with networking load, etc, even for long amounts of time, we
> still want to avoid false positives. Because after the watchdog

The networking softirq does not hog the other softirqs. It has a limit
on processing loops and then goes back to let the other softirqs be
handled. So no, I doubt that heavy networking can cause this. If it
does then we have some other way more serious problems.

I can see the issue with RT stress testing, but not with networking in
mainline.

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 8/9] clocksource: Improve unstable clocksource detection

2015-08-18 Thread John Stultz

On Tue, Aug 18, 2015 at 12:28 PM, Thomas Gleixner  wrote:
> On Tue, 18 Aug 2015, John Stultz wrote:
>> On Tue, Aug 18, 2015 at 1:38 AM, Thomas Gleixner  wrote:
>> > On Mon, 17 Aug 2015, John Stultz wrote:
>> >> On Mon, Aug 17, 2015 at 3:04 PM, Thomas Gleixner  
>> >> wrote:
>> >> > On Mon, 17 Aug 2015, John Stultz wrote:
>> >> >
>> >> >> From: Shaohua Li 
>> >> >>
>> >> >> >From time to time we saw TSC is marked as unstable in our systems, 
>> >> >> >while
>> >> >
>> >> > Stray '>'
>> >> >
>> >> >> the CPUs declare to have stable TSC. Looking at the clocksource 
>> >> >> unstable
>> >> >> detection, there are two problems:
>> >> >> - watchdog clock source wrap. HPET is the most common watchdog clock
>> >> >>   source. It's 32-bit and runs in 14.3Mhz. That means the hpet counter
>> >> >>   can wrap in about 5 minutes.
>> >> >> - threshold isn't scaled against interval. The threshold is 0.0625s in
>> >> >>   0.5s interval. What if the actual interval is bigger than 0.5s?
>> >> >>
>> >> >> The watchdog runs in a timer bh, so hard/soft irq can defer its 
>> >> >> running.
>> >> >> Heavy network stack softirq can hog a cpu. IPMI driver can disable
>> >> >> interrupt for a very long time.
>> >> >
>> >> > And they hold off the timer softirq for more than a second? Don't you
>> >> > think that's the problem which needs to be fixed?
>> >>
>> >> Though this is an issue I've experienced (and tried unsuccessfully to
>> >> fix in a more complicated way) with the RT kernel, where high priority
>> >> tasks blocked the watchdog long enough that we'd disqualify the TSC.
>> >
>> > Did it disqualify the watchdog due to HPET wraparounds (5 minutes) or
>> > due to the fixed threshold being applied?
>>
>> This was years ago, but in my experience, the watchdog false positives
>> were due to HPET wraparounds.
>
> Blocking stuff for 5 minutes is insane 

Yea. It was usually due to -RT stress testing, which keept the
machines busy for quite awhile. But again, if you have machines being
maxed out with networking load, etc, even for long amounts of time, we
still want to avoid false positives. Because after the watchdog
disqualifies the TSC, the only clocksources left wrap around much
sooner, and we're more likely to then actually lose time during the
next load spike.

Cc'ing Clark and Steven to see if its something they still run into,
and maybe they can help validate the patch.

thanks
-john
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 8/9] clocksource: Improve unstable clocksource detection

2015-08-18 Thread Thomas Gleixner

On Tue, 18 Aug 2015, John Stultz wrote:
> On Tue, Aug 18, 2015 at 1:38 AM, Thomas Gleixner  wrote:
> > On Mon, 17 Aug 2015, John Stultz wrote:
> >> On Mon, Aug 17, 2015 at 3:04 PM, Thomas Gleixner  
> >> wrote:
> >> > On Mon, 17 Aug 2015, John Stultz wrote:
> >> >
> >> >> From: Shaohua Li 
> >> >>
> >> >> >From time to time we saw TSC is marked as unstable in our systems, 
> >> >> >while
> >> >
> >> > Stray '>'
> >> >
> >> >> the CPUs declare to have stable TSC. Looking at the clocksource unstable
> >> >> detection, there are two problems:
> >> >> - watchdog clock source wrap. HPET is the most common watchdog clock
> >> >>   source. It's 32-bit and runs in 14.3Mhz. That means the hpet counter
> >> >>   can wrap in about 5 minutes.
> >> >> - threshold isn't scaled against interval. The threshold is 0.0625s in
> >> >>   0.5s interval. What if the actual interval is bigger than 0.5s?
> >> >>
> >> >> The watchdog runs in a timer bh, so hard/soft irq can defer its running.
> >> >> Heavy network stack softirq can hog a cpu. IPMI driver can disable
> >> >> interrupt for a very long time.
> >> >
> >> > And they hold off the timer softirq for more than a second? Don't you
> >> > think that's the problem which needs to be fixed?
> >>
> >> Though this is an issue I've experienced (and tried unsuccessfully to
> >> fix in a more complicated way) with the RT kernel, where high priority
> >> tasks blocked the watchdog long enough that we'd disqualify the TSC.
> >
> > Did it disqualify the watchdog due to HPET wraparounds (5 minutes) or
> > due to the fixed threshold being applied?
> 
> This was years ago, but in my experience, the watchdog false positives
> were due to HPET wraparounds.

Blocking stuff for 5 minutes is insane 
 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 8/9] clocksource: Improve unstable clocksource detection

2015-08-18 Thread John Stultz

On Tue, Aug 18, 2015 at 1:38 AM, Thomas Gleixner  wrote:
> On Mon, 17 Aug 2015, John Stultz wrote:
>> On Mon, Aug 17, 2015 at 3:04 PM, Thomas Gleixner  wrote:
>> > On Mon, 17 Aug 2015, John Stultz wrote:
>> >
>> >> From: Shaohua Li 
>> >>
>> >> >From time to time we saw TSC is marked as unstable in our systems, while
>> >
>> > Stray '>'
>> >
>> >> the CPUs declare to have stable TSC. Looking at the clocksource unstable
>> >> detection, there are two problems:
>> >> - watchdog clock source wrap. HPET is the most common watchdog clock
>> >>   source. It's 32-bit and runs in 14.3Mhz. That means the hpet counter
>> >>   can wrap in about 5 minutes.
>> >> - threshold isn't scaled against interval. The threshold is 0.0625s in
>> >>   0.5s interval. What if the actual interval is bigger than 0.5s?
>> >>
>> >> The watchdog runs in a timer bh, so hard/soft irq can defer its running.
>> >> Heavy network stack softirq can hog a cpu. IPMI driver can disable
>> >> interrupt for a very long time.
>> >
>> > And they hold off the timer softirq for more than a second? Don't you
>> > think that's the problem which needs to be fixed?
>>
>> Though this is an issue I've experienced (and tried unsuccessfully to
>> fix in a more complicated way) with the RT kernel, where high priority
>> tasks blocked the watchdog long enough that we'd disqualify the TSC.
>
> Did it disqualify the watchdog due to HPET wraparounds (5 minutes) or
> due to the fixed threshold being applied?

This was years ago, but in my experience, the watchdog false positives
were due to HPET wraparounds.

thanks
-john
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 8/9] clocksource: Improve unstable clocksource detection

2015-08-18 Thread Thomas Gleixner

On Mon, 17 Aug 2015, Shaohua Li wrote:
> On Mon, Aug 17, 2015 at 03:17:28PM -0700, John Stultz wrote:
> > That said, I agree the "should"s and other vague qualifiers in the
> > commit description you point out should have more specifics to back
> > things up. And I'm fine delaying this (and the follow-on) patch until
> > those details are provided.
> 
> It's not something I guess. We do see the issue from time to time. The
> IPMI driver accesses some IO ports in softirq and hog cpu for a very
> long time, then the watchdog alert.

You still fail to provide proper numbers. 'very long time' does not
qualify as an argument at all.

> The false alert on the other hand has very worse effect. It forces
> to use HPET as clocksource, which has very big performance
> penality. We can't even manually switch back to TSC as current
> interface doesn't allow us to do it, then we can only reboot the
> system. I agree the driver should be fixed, but the watchdog has
> false alert, we definitively should fix it.

I tend to disagree. The watchdog has constraints and the driver is
violating these constraints, so the first thing which wants to be
addressed is the driver itself.

The behaviour of the watchdog in the case of constraint violations is
definitely suboptimal and can lead to false positives. I'm not against
making that more robust, but I'm not accepting your 'watchdog is
broken' argumentation at all.

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 8/9] clocksource: Improve unstable clocksource detection

2015-08-18 Thread Thomas Gleixner

On Mon, 17 Aug 2015, John Stultz wrote:
> On Mon, Aug 17, 2015 at 3:04 PM, Thomas Gleixner  wrote:
> > On Mon, 17 Aug 2015, John Stultz wrote:
> >
> >> From: Shaohua Li 
> >>
> >> >From time to time we saw TSC is marked as unstable in our systems, while
> >
> > Stray '>'
> >
> >> the CPUs declare to have stable TSC. Looking at the clocksource unstable
> >> detection, there are two problems:
> >> - watchdog clock source wrap. HPET is the most common watchdog clock
> >>   source. It's 32-bit and runs in 14.3Mhz. That means the hpet counter
> >>   can wrap in about 5 minutes.
> >> - threshold isn't scaled against interval. The threshold is 0.0625s in
> >>   0.5s interval. What if the actual interval is bigger than 0.5s?
> >>
> >> The watchdog runs in a timer bh, so hard/soft irq can defer its running.
> >> Heavy network stack softirq can hog a cpu. IPMI driver can disable
> >> interrupt for a very long time.
> >
> > And they hold off the timer softirq for more than a second? Don't you
> > think that's the problem which needs to be fixed?
> 
> Though this is an issue I've experienced (and tried unsuccessfully to
> fix in a more complicated way) with the RT kernel, where high priority
> tasks blocked the watchdog long enough that we'd disqualify the TSC.

Did it disqualify the watchdog due to HPET wraparounds (5 minutes) or
due to the fixed threshold being applied?

> > So 'fixing' the watchdog is the wrong approach. Fixing the stuff which
> > prevents the watchdog to run is the proper thing to do.
> 
> I'm not sure here. I feel like these delay-caused false positives
> (I've seen similar reports w/ VMs being stalled) are more common then
> one-off SMI TSC skews.

Yes, they are more common, but the other issues are reality as well.

> There are hard lines in the timekeeping code, where we do say: Don't
> delay us past X or we can't really handle it, but in this case, the
> main clocksource is fine and the limit is being caused by the
> watchdog. So I think some sort of a solution to remove this
> restriction would be good. We don't want to needlessly punish fine
> hardware because our checks for bad hardware add extra restrictions.

No argument here. Though fine hardware has an escape route already to
avoid the watchdog business alltogether (tsc=reliable on the command
line).

> That said, I agree the "should"s and other vague qualifiers in the
> commit description you point out should have more specifics to back
> things up. And I'm fine delaying this (and the follow-on) patch until
> those details are provided.

Fair enough.

Thanks,

tglx


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 8/9] clocksource: Improve unstable clocksource detection

2015-08-18 Thread Thomas Gleixner

On Mon, 17 Aug 2015, Shaohua Li wrote:
 On Mon, Aug 17, 2015 at 03:17:28PM -0700, John Stultz wrote:
  That said, I agree the shoulds and other vague qualifiers in the
  commit description you point out should have more specifics to back
  things up. And I'm fine delaying this (and the follow-on) patch until
  those details are provided.
 
 It's not something I guess. We do see the issue from time to time. The
 IPMI driver accesses some IO ports in softirq and hog cpu for a very
 long time, then the watchdog alert.

You still fail to provide proper numbers. 'very long time' does not
qualify as an argument at all.

 The false alert on the other hand has very worse effect. It forces
 to use HPET as clocksource, which has very big performance
 penality. We can't even manually switch back to TSC as current
 interface doesn't allow us to do it, then we can only reboot the
 system. I agree the driver should be fixed, but the watchdog has
 false alert, we definitively should fix it.

I tend to disagree. The watchdog has constraints and the driver is
violating these constraints, so the first thing which wants to be
addressed is the driver itself.

The behaviour of the watchdog in the case of constraint violations is
definitely suboptimal and can lead to false positives. I'm not against
making that more robust, but I'm not accepting your 'watchdog is
broken' argumentation at all.

Thanks,

tglx
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 8/9] clocksource: Improve unstable clocksource detection

2015-08-18 Thread Thomas Gleixner

On Mon, 17 Aug 2015, John Stultz wrote:
 On Mon, Aug 17, 2015 at 3:04 PM, Thomas Gleixner t...@linutronix.de wrote:
  On Mon, 17 Aug 2015, John Stultz wrote:
 
  From: Shaohua Li s...@fb.com
 
  From time to time we saw TSC is marked as unstable in our systems, while
 
  Stray ''
 
  the CPUs declare to have stable TSC. Looking at the clocksource unstable
  detection, there are two problems:
  - watchdog clock source wrap. HPET is the most common watchdog clock
source. It's 32-bit and runs in 14.3Mhz. That means the hpet counter
can wrap in about 5 minutes.
  - threshold isn't scaled against interval. The threshold is 0.0625s in
0.5s interval. What if the actual interval is bigger than 0.5s?
 
  The watchdog runs in a timer bh, so hard/soft irq can defer its running.
  Heavy network stack softirq can hog a cpu. IPMI driver can disable
  interrupt for a very long time.
 
  And they hold off the timer softirq for more than a second? Don't you
  think that's the problem which needs to be fixed?
 
 Though this is an issue I've experienced (and tried unsuccessfully to
 fix in a more complicated way) with the RT kernel, where high priority
 tasks blocked the watchdog long enough that we'd disqualify the TSC.

Did it disqualify the watchdog due to HPET wraparounds (5 minutes) or
due to the fixed threshold being applied?

  So 'fixing' the watchdog is the wrong approach. Fixing the stuff which
  prevents the watchdog to run is the proper thing to do.
 
 I'm not sure here. I feel like these delay-caused false positives
 (I've seen similar reports w/ VMs being stalled) are more common then
 one-off SMI TSC skews.

Yes, they are more common, but the other issues are reality as well.

 There are hard lines in the timekeeping code, where we do say: Don't
 delay us past X or we can't really handle it, but in this case, the
 main clocksource is fine and the limit is being caused by the
 watchdog. So I think some sort of a solution to remove this
 restriction would be good. We don't want to needlessly punish fine
 hardware because our checks for bad hardware add extra restrictions.

No argument here. Though fine hardware has an escape route already to
avoid the watchdog business alltogether (tsc=reliable on the command
line).

 That said, I agree the shoulds and other vague qualifiers in the
 commit description you point out should have more specifics to back
 things up. And I'm fine delaying this (and the follow-on) patch until
 those details are provided.

Fair enough.

Thanks,

tglx


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 8/9] clocksource: Improve unstable clocksource detection

2015-08-18 Thread John Stultz

On Tue, Aug 18, 2015 at 1:38 AM, Thomas Gleixner t...@linutronix.de wrote:
 On Mon, 17 Aug 2015, John Stultz wrote:
 On Mon, Aug 17, 2015 at 3:04 PM, Thomas Gleixner t...@linutronix.de wrote:
  On Mon, 17 Aug 2015, John Stultz wrote:
 
  From: Shaohua Li s...@fb.com
 
  From time to time we saw TSC is marked as unstable in our systems, while
 
  Stray ''
 
  the CPUs declare to have stable TSC. Looking at the clocksource unstable
  detection, there are two problems:
  - watchdog clock source wrap. HPET is the most common watchdog clock
source. It's 32-bit and runs in 14.3Mhz. That means the hpet counter
can wrap in about 5 minutes.
  - threshold isn't scaled against interval. The threshold is 0.0625s in
0.5s interval. What if the actual interval is bigger than 0.5s?
 
  The watchdog runs in a timer bh, so hard/soft irq can defer its running.
  Heavy network stack softirq can hog a cpu. IPMI driver can disable
  interrupt for a very long time.
 
  And they hold off the timer softirq for more than a second? Don't you
  think that's the problem which needs to be fixed?

 Though this is an issue I've experienced (and tried unsuccessfully to
 fix in a more complicated way) with the RT kernel, where high priority
 tasks blocked the watchdog long enough that we'd disqualify the TSC.

 Did it disqualify the watchdog due to HPET wraparounds (5 minutes) or
 due to the fixed threshold being applied?

This was years ago, but in my experience, the watchdog false positives
were due to HPET wraparounds.

thanks
-john
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 8/9] clocksource: Improve unstable clocksource detection

2015-08-18 Thread Thomas Gleixner

On Tue, 18 Aug 2015, John Stultz wrote:
 On Tue, Aug 18, 2015 at 1:38 AM, Thomas Gleixner t...@linutronix.de wrote:
  On Mon, 17 Aug 2015, John Stultz wrote:
  On Mon, Aug 17, 2015 at 3:04 PM, Thomas Gleixner t...@linutronix.de 
  wrote:
   On Mon, 17 Aug 2015, John Stultz wrote:
  
   From: Shaohua Li s...@fb.com
  
   From time to time we saw TSC is marked as unstable in our systems, 
   while
  
   Stray ''
  
   the CPUs declare to have stable TSC. Looking at the clocksource unstable
   detection, there are two problems:
   - watchdog clock source wrap. HPET is the most common watchdog clock
 source. It's 32-bit and runs in 14.3Mhz. That means the hpet counter
 can wrap in about 5 minutes.
   - threshold isn't scaled against interval. The threshold is 0.0625s in
 0.5s interval. What if the actual interval is bigger than 0.5s?
  
   The watchdog runs in a timer bh, so hard/soft irq can defer its running.
   Heavy network stack softirq can hog a cpu. IPMI driver can disable
   interrupt for a very long time.
  
   And they hold off the timer softirq for more than a second? Don't you
   think that's the problem which needs to be fixed?
 
  Though this is an issue I've experienced (and tried unsuccessfully to
  fix in a more complicated way) with the RT kernel, where high priority
  tasks blocked the watchdog long enough that we'd disqualify the TSC.
 
  Did it disqualify the watchdog due to HPET wraparounds (5 minutes) or
  due to the fixed threshold being applied?
 
 This was years ago, but in my experience, the watchdog false positives
 were due to HPET wraparounds.

Blocking stuff for 5 minutes is insane 
 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 8/9] clocksource: Improve unstable clocksource detection

2015-08-18 Thread Thomas Gleixner

On Tue, 18 Aug 2015, John Stultz wrote:
 On Tue, Aug 18, 2015 at 12:28 PM, Thomas Gleixner t...@linutronix.de wrote:
  On Tue, 18 Aug 2015, John Stultz wrote:
  On Tue, Aug 18, 2015 at 1:38 AM, Thomas Gleixner t...@linutronix.de 
  wrote:
   On Mon, 17 Aug 2015, John Stultz wrote:
   On Mon, Aug 17, 2015 at 3:04 PM, Thomas Gleixner t...@linutronix.de 
   wrote:
On Mon, 17 Aug 2015, John Stultz wrote:
   
From: Shaohua Li s...@fb.com
   
From time to time we saw TSC is marked as unstable in our systems, 
while
   
Stray ''
   
the CPUs declare to have stable TSC. Looking at the clocksource 
unstable
detection, there are two problems:
- watchdog clock source wrap. HPET is the most common watchdog clock
  source. It's 32-bit and runs in 14.3Mhz. That means the hpet 
counter
  can wrap in about 5 minutes.
- threshold isn't scaled against interval. The threshold is 0.0625s 
in
  0.5s interval. What if the actual interval is bigger than 0.5s?
   
The watchdog runs in a timer bh, so hard/soft irq can defer its 
running.
Heavy network stack softirq can hog a cpu. IPMI driver can disable
interrupt for a very long time.
   
And they hold off the timer softirq for more than a second? Don't you
think that's the problem which needs to be fixed?
  
   Though this is an issue I've experienced (and tried unsuccessfully to
   fix in a more complicated way) with the RT kernel, where high priority
   tasks blocked the watchdog long enough that we'd disqualify the TSC.
  
   Did it disqualify the watchdog due to HPET wraparounds (5 minutes) or
   due to the fixed threshold being applied?
 
  This was years ago, but in my experience, the watchdog false positives
  were due to HPET wraparounds.
 
  Blocking stuff for 5 minutes is insane 
 
 Yea. It was usually due to -RT stress testing, which keept the
 machines busy for quite awhile. But again, if you have machines being
 maxed out with networking load, etc, even for long amounts of time, we
 still want to avoid false positives. Because after the watchdog

The networking softirq does not hog the other softirqs. It has a limit
on processing loops and then goes back to let the other softirqs be
handled. So no, I doubt that heavy networking can cause this. If it
does then we have some other way more serious problems.

I can see the issue with RT stress testing, but not with networking in
mainline.

Thanks,

tglx
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 8/9] clocksource: Improve unstable clocksource detection

2015-08-18 Thread John Stultz

On Tue, Aug 18, 2015 at 12:28 PM, Thomas Gleixner t...@linutronix.de wrote:
 On Tue, 18 Aug 2015, John Stultz wrote:
 On Tue, Aug 18, 2015 at 1:38 AM, Thomas Gleixner t...@linutronix.de wrote:
  On Mon, 17 Aug 2015, John Stultz wrote:
  On Mon, Aug 17, 2015 at 3:04 PM, Thomas Gleixner t...@linutronix.de 
  wrote:
   On Mon, 17 Aug 2015, John Stultz wrote:
  
   From: Shaohua Li s...@fb.com
  
   From time to time we saw TSC is marked as unstable in our systems, 
   while
  
   Stray ''
  
   the CPUs declare to have stable TSC. Looking at the clocksource 
   unstable
   detection, there are two problems:
   - watchdog clock source wrap. HPET is the most common watchdog clock
 source. It's 32-bit and runs in 14.3Mhz. That means the hpet counter
 can wrap in about 5 minutes.
   - threshold isn't scaled against interval. The threshold is 0.0625s in
 0.5s interval. What if the actual interval is bigger than 0.5s?
  
   The watchdog runs in a timer bh, so hard/soft irq can defer its 
   running.
   Heavy network stack softirq can hog a cpu. IPMI driver can disable
   interrupt for a very long time.
  
   And they hold off the timer softirq for more than a second? Don't you
   think that's the problem which needs to be fixed?
 
  Though this is an issue I've experienced (and tried unsuccessfully to
  fix in a more complicated way) with the RT kernel, where high priority
  tasks blocked the watchdog long enough that we'd disqualify the TSC.
 
  Did it disqualify the watchdog due to HPET wraparounds (5 minutes) or
  due to the fixed threshold being applied?

 This was years ago, but in my experience, the watchdog false positives
 were due to HPET wraparounds.

 Blocking stuff for 5 minutes is insane 

Yea. It was usually due to -RT stress testing, which keept the
machines busy for quite awhile. But again, if you have machines being
maxed out with networking load, etc, even for long amounts of time, we
still want to avoid false positives. Because after the watchdog
disqualifies the TSC, the only clocksources left wrap around much
sooner, and we're more likely to then actually lose time during the
next load spike.

Cc'ing Clark and Steven to see if its something they still run into,
and maybe they can help validate the patch.

thanks
-john
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 8/9] clocksource: Improve unstable clocksource detection

2015-08-17 Thread John Stultz

On Mon, Aug 17, 2015 at 7:57 PM, Shaohua Li  wrote:
> On Mon, Aug 17, 2015 at 03:17:28PM -0700, John Stultz wrote:
>> That said, I agree the "should"s and other vague qualifiers in the
>> commit description you point out should have more specifics to back
>> things up. And I'm fine delaying this (and the follow-on) patch until
>> those details are provided.
>
> It's not something I guess. We do see the issue from time to time. The
> IPMI driver accesses some IO ports in softirq and hog cpu for a very
> long time, then the watchdog alert. The false alert on the other hand
> has very worse effect. It forces to use HPET as clocksource, which has
> very big performance penality. We can't even manually switch back to TSC
> as current interface doesn't allow us to do it, then we can only reboot
> the system. I agree the driver should be fixed, but the watchdog has
> false alert, we definitively should fix it.

I think Thomas is requesting that some of the vague terms be
quantified. Seeing the issue "from time to time" isn't super
informative. When the IPMI driver hogs the cpu "for a very long time",
how long does that  actually take?  You've provided the HPET
frequency, and  the wrapping interval on your hardware. Do these
intervals all line up properly?

I sympathize that the "show-your-work" math problem aspect of this
request might be a little remedial and irritating, esp when the patch
fixes the problem for you. But its important, so later on when some
bug crops up in near by code, folks can easily repeat your calculation
and know the problem isn't from your code.

> The 1s interval is arbitary. If you think there is better way to fix the
> issue, please let me know.

I don't think 1s is necessarily arbitrary. Maybe not much conscious
thought was put into it, but clearly .001 sec wasn't chosen, nor
10minutes for a reason.

So given the intervals you're seeing the problem with, would maybe a
larger max interval (say 30-seconds) make more or less sense? What
would the tradeoffs be? (ie: Would that exclude clocksources with
faster wraps from being used as watchdogs, with your patches?).

I'm sure an good interval could be chosen with some thought, and the
rational be explained. :)

thanks
-john
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 8/9] clocksource: Improve unstable clocksource detection

2015-08-17 Thread Shaohua Li

On Mon, Aug 17, 2015 at 03:17:28PM -0700, John Stultz wrote:
> On Mon, Aug 17, 2015 at 3:04 PM, Thomas Gleixner  wrote:
> > On Mon, 17 Aug 2015, John Stultz wrote:
> >
> >> From: Shaohua Li 
> >>
> >> >From time to time we saw TSC is marked as unstable in our systems, while
> >
> > Stray '>'
> >
> >> the CPUs declare to have stable TSC. Looking at the clocksource unstable
> >> detection, there are two problems:
> >> - watchdog clock source wrap. HPET is the most common watchdog clock
> >>   source. It's 32-bit and runs in 14.3Mhz. That means the hpet counter
> >>   can wrap in about 5 minutes.
> >> - threshold isn't scaled against interval. The threshold is 0.0625s in
> >>   0.5s interval. What if the actual interval is bigger than 0.5s?
> >>
> >> The watchdog runs in a timer bh, so hard/soft irq can defer its running.
> >> Heavy network stack softirq can hog a cpu. IPMI driver can disable
> >> interrupt for a very long time.
> >
> > And they hold off the timer softirq for more than a second? Don't you
> > think that's the problem which needs to be fixed?
> 
> Though this is an issue I've experienced (and tried unsuccessfully to
> fix in a more complicated way) with the RT kernel, where high priority
> tasks blocked the watchdog long enough that we'd disqualify the TSC.
> 
> Ideally that sort of high-priority RT busyness would be avoided, but
> its also a pain to have false positive trigger when doing things like
> stress testing.
> 
> 
> >> The first problem is mostly we are suffering I think.
> >
> > So you think that's the root cause and because your patch makes it go
> > away it's not necessary to know for sure, right?
> >
> >> Here is a simple patch to fix the issues. If the waterdog doesn't run
> >
> > waterdog?
> 
> Allergen-free. :)
> 
> 
> >> for a long time, we ignore the detection.
> >
> > What's 'long time'? Please explain the numbers chosen.
> >
> >> This should work for the two
> >
> > Emphasis on 'should'?
> >
> >> problems. For the second one, we probably doen't need to scale if the
> >> interval isn't very long.
> >
> > -ENOPARSE
> >
> >> @@ -122,9 +122,10 @@ static int clocksource_watchdog_kthread(void *data);
> >>  static void __clocksource_change_rating(struct clocksource *cs, int 
> >> rating);
> >>
> >>  /*
> >> - * Interval: 0.5sec Threshold: 0.0625s
> >> + * Interval: 0.5sec MaxInterval: 1s Threshold: 0.0625s
> >>   */
> >>  #define WATCHDOG_INTERVAL (HZ >> 1)
> >> +#define WATCHDOG_MAX_INTERVAL_NS (NSEC_PER_SEC)
> >>  #define WATCHDOG_THRESHOLD (NSEC_PER_SEC >> 4)
> >>
> >>  static void clocksource_watchdog_work(struct work_struct *work)
> >> @@ -217,7 +218,9 @@ static void clocksource_watchdog(unsigned long data)
> >>   continue;
> >>
> >>   /* Check the deviation from the watchdog clocksource. */
> >> - if ((abs(cs_nsec - wd_nsec) > WATCHDOG_THRESHOLD)) {
> >> + if ((abs(cs_nsec - wd_nsec) > WATCHDOG_THRESHOLD) &&
> >> + cs_nsec < WATCHDOG_MAX_INTERVAL_NS &&
> >> + wd_nsec < WATCHDOG_MAX_INTERVAL_NS) {
> >
> > So that adds a new opportunity for undiscovered wreckage:
> >
> >clocksource_watchdog();
> > <--- SMI skews TSC
> >looong_irq_disabled_region();
> >
> >clocksource_watchdog();  <--- Does not detect skew
> >
> > and it will not detect it later on if that SMI was a one time event.
> >
> > So 'fixing' the watchdog is the wrong approach. Fixing the stuff which
> > prevents the watchdog to run is the proper thing to do.
> 
> I'm not sure here. I feel like these delay-caused false positives
> (I've seen similar reports w/ VMs being stalled) are more common then
> one-off SMI TSC skews.
> 
> There are hard lines in the timekeeping code, where we do say: Don't
> delay us past X or we can't really handle it, but in this case, the
> main clocksource is fine and the limit is being caused by the
> watchdog. So I think some sort of a solution to remove this
> restriction would be good. We don't want to needlessly punish fine
> hardware because our checks for bad hardware add extra restrictions.
> 
> That said, I agree the "should"s and other vague qualifiers in the
> commit description you point out should have more specifics to back
> things up. And I'm fine delaying this (and the follow-on) patch until
> those details are provided.

It's not something I guess. We do see the issue from time to time. The
IPMI driver accesses some IO ports in softirq and hog cpu for a very
long time, then the watchdog alert. The false alert on the other hand
has very worse effect. It forces to use HPET as clocksource, which has
very big performance penality. We can't even manually switch back to TSC
as current interface doesn't allow us to do it, then we can only reboot
the system. I agree the driver should be fixed, but the watchdog has
false alert, we definitively should fix it.

The 1s interval is arbitary. If you think there is better way to fix the
issue,

Re: [PATCH 8/9] clocksource: Improve unstable clocksource detection

2015-08-17 Thread John Stultz

On Mon, Aug 17, 2015 at 3:04 PM, Thomas Gleixner  wrote:
> On Mon, 17 Aug 2015, John Stultz wrote:
>
>> From: Shaohua Li 
>>
>> >From time to time we saw TSC is marked as unstable in our systems, while
>
> Stray '>'
>
>> the CPUs declare to have stable TSC. Looking at the clocksource unstable
>> detection, there are two problems:
>> - watchdog clock source wrap. HPET is the most common watchdog clock
>>   source. It's 32-bit and runs in 14.3Mhz. That means the hpet counter
>>   can wrap in about 5 minutes.
>> - threshold isn't scaled against interval. The threshold is 0.0625s in
>>   0.5s interval. What if the actual interval is bigger than 0.5s?
>>
>> The watchdog runs in a timer bh, so hard/soft irq can defer its running.
>> Heavy network stack softirq can hog a cpu. IPMI driver can disable
>> interrupt for a very long time.
>
> And they hold off the timer softirq for more than a second? Don't you
> think that's the problem which needs to be fixed?

Though this is an issue I've experienced (and tried unsuccessfully to
fix in a more complicated way) with the RT kernel, where high priority
tasks blocked the watchdog long enough that we'd disqualify the TSC.

Ideally that sort of high-priority RT busyness would be avoided, but
its also a pain to have false positive trigger when doing things like
stress testing.


>> The first problem is mostly we are suffering I think.
>
> So you think that's the root cause and because your patch makes it go
> away it's not necessary to know for sure, right?
>
>> Here is a simple patch to fix the issues. If the waterdog doesn't run
>
> waterdog?

Allergen-free. :)


>> for a long time, we ignore the detection.
>
> What's 'long time'? Please explain the numbers chosen.
>
>> This should work for the two
>
> Emphasis on 'should'?
>
>> problems. For the second one, we probably doen't need to scale if the
>> interval isn't very long.
>
> -ENOPARSE
>
>> @@ -122,9 +122,10 @@ static int clocksource_watchdog_kthread(void *data);
>>  static void __clocksource_change_rating(struct clocksource *cs, int rating);
>>
>>  /*
>> - * Interval: 0.5sec Threshold: 0.0625s
>> + * Interval: 0.5sec MaxInterval: 1s Threshold: 0.0625s
>>   */
>>  #define WATCHDOG_INTERVAL (HZ >> 1)
>> +#define WATCHDOG_MAX_INTERVAL_NS (NSEC_PER_SEC)
>>  #define WATCHDOG_THRESHOLD (NSEC_PER_SEC >> 4)
>>
>>  static void clocksource_watchdog_work(struct work_struct *work)
>> @@ -217,7 +218,9 @@ static void clocksource_watchdog(unsigned long data)
>>   continue;
>>
>>   /* Check the deviation from the watchdog clocksource. */
>> - if ((abs(cs_nsec - wd_nsec) > WATCHDOG_THRESHOLD)) {
>> + if ((abs(cs_nsec - wd_nsec) > WATCHDOG_THRESHOLD) &&
>> + cs_nsec < WATCHDOG_MAX_INTERVAL_NS &&
>> + wd_nsec < WATCHDOG_MAX_INTERVAL_NS) {
>
> So that adds a new opportunity for undiscovered wreckage:
>
>clocksource_watchdog();
> <--- SMI skews TSC
>looong_irq_disabled_region();
>
>clocksource_watchdog();  <--- Does not detect skew
>
> and it will not detect it later on if that SMI was a one time event.
>
> So 'fixing' the watchdog is the wrong approach. Fixing the stuff which
> prevents the watchdog to run is the proper thing to do.

I'm not sure here. I feel like these delay-caused false positives
(I've seen similar reports w/ VMs being stalled) are more common then
one-off SMI TSC skews.

There are hard lines in the timekeeping code, where we do say: Don't
delay us past X or we can't really handle it, but in this case, the
main clocksource is fine and the limit is being caused by the
watchdog. So I think some sort of a solution to remove this
restriction would be good. We don't want to needlessly punish fine
hardware because our checks for bad hardware add extra restrictions.

That said, I agree the "should"s and other vague qualifiers in the
commit description you point out should have more specifics to back
things up. And I'm fine delaying this (and the follow-on) patch until
those details are provided.

thanks
-john
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 8/9] clocksource: Improve unstable clocksource detection

2015-08-17 Thread Thomas Gleixner

On Mon, 17 Aug 2015, John Stultz wrote:

> From: Shaohua Li 
> 
> >From time to time we saw TSC is marked as unstable in our systems, while

Stray '>'

> the CPUs declare to have stable TSC. Looking at the clocksource unstable
> detection, there are two problems:
> - watchdog clock source wrap. HPET is the most common watchdog clock
>   source. It's 32-bit and runs in 14.3Mhz. That means the hpet counter
>   can wrap in about 5 minutes.
> - threshold isn't scaled against interval. The threshold is 0.0625s in
>   0.5s interval. What if the actual interval is bigger than 0.5s?
> 
> The watchdog runs in a timer bh, so hard/soft irq can defer its running.
> Heavy network stack softirq can hog a cpu. IPMI driver can disable
> interrupt for a very long time.

And they hold off the timer softirq for more than a second? Don't you
think that's the problem which needs to be fixed?

> The first problem is mostly we are suffering I think.

So you think that's the root cause and because your patch makes it go
away it's not necessary to know for sure, right?

> Here is a simple patch to fix the issues. If the waterdog doesn't run

waterdog?

> for a long time, we ignore the detection. 

What's 'long time'? Please explain the numbers chosen.

> This should work for the two

Emphasis on 'should'? 

> problems. For the second one, we probably doen't need to scale if the
> interval isn't very long.

-ENOPARSE
 
> @@ -122,9 +122,10 @@ static int clocksource_watchdog_kthread(void *data);
>  static void __clocksource_change_rating(struct clocksource *cs, int rating);
>  
>  /*
> - * Interval: 0.5sec Threshold: 0.0625s
> + * Interval: 0.5sec MaxInterval: 1s Threshold: 0.0625s
>   */
>  #define WATCHDOG_INTERVAL (HZ >> 1)
> +#define WATCHDOG_MAX_INTERVAL_NS (NSEC_PER_SEC)
>  #define WATCHDOG_THRESHOLD (NSEC_PER_SEC >> 4)
>  
>  static void clocksource_watchdog_work(struct work_struct *work)
> @@ -217,7 +218,9 @@ static void clocksource_watchdog(unsigned long data)
>   continue;
>  
>   /* Check the deviation from the watchdog clocksource. */
> - if ((abs(cs_nsec - wd_nsec) > WATCHDOG_THRESHOLD)) {
> + if ((abs(cs_nsec - wd_nsec) > WATCHDOG_THRESHOLD) &&
> + cs_nsec < WATCHDOG_MAX_INTERVAL_NS &&
> + wd_nsec < WATCHDOG_MAX_INTERVAL_NS) {

So that adds a new opportunity for undiscovered wreckage:

   clocksource_watchdog();
    <--- SMI skews TSC
   looong_irq_disabled_region();
   
   clocksource_watchdog();  <--- Does not detect skew

and it will not detect it later on if that SMI was a one time event.

So 'fixing' the watchdog is the wrong approach. Fixing the stuff which
prevents the watchdog to run is the proper thing to do.

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 8/9] clocksource: Improve unstable clocksource detection

2015-08-17 Thread John Stultz

From: Shaohua Li 

>From time to time we saw TSC is marked as unstable in our systems, while
the CPUs declare to have stable TSC. Looking at the clocksource unstable
detection, there are two problems:
- watchdog clock source wrap. HPET is the most common watchdog clock
  source. It's 32-bit and runs in 14.3Mhz. That means the hpet counter
  can wrap in about 5 minutes.
- threshold isn't scaled against interval. The threshold is 0.0625s in
  0.5s interval. What if the actual interval is bigger than 0.5s?

The watchdog runs in a timer bh, so hard/soft irq can defer its running.
Heavy network stack softirq can hog a cpu. IPMI driver can disable
interrupt for a very long time. The first problem is mostly we are
suffering I think.

Here is a simple patch to fix the issues. If the waterdog doesn't run
for a long time, we ignore the detection. This should work for the two
problems. For the second one, we probably doen't need to scale if the
interval isn't very long.

Cc: Prarit Bhargava 
Cc: Richard Cochran 
Cc: Daniel Lezcano 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Signed-off-by: Shaohua Li 
Signed-off-by: John Stultz 
---
 kernel/time/clocksource.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 841b72f..8417c83 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -122,9 +122,10 @@ static int clocksource_watchdog_kthread(void *data);
 static void __clocksource_change_rating(struct clocksource *cs, int rating);
 
 /*
- * Interval: 0.5sec Threshold: 0.0625s
+ * Interval: 0.5sec MaxInterval: 1s Threshold: 0.0625s
  */
 #define WATCHDOG_INTERVAL (HZ >> 1)
+#define WATCHDOG_MAX_INTERVAL_NS (NSEC_PER_SEC)
 #define WATCHDOG_THRESHOLD (NSEC_PER_SEC >> 4)
 
 static void clocksource_watchdog_work(struct work_struct *work)
@@ -217,7 +218,9 @@ static void clocksource_watchdog(unsigned long data)
continue;
 
/* Check the deviation from the watchdog clocksource. */
-   if ((abs(cs_nsec - wd_nsec) > WATCHDOG_THRESHOLD)) {
+   if ((abs(cs_nsec - wd_nsec) > WATCHDOG_THRESHOLD) &&
+   cs_nsec < WATCHDOG_MAX_INTERVAL_NS &&
+   wd_nsec < WATCHDOG_MAX_INTERVAL_NS) {
pr_warn("timekeeping watchdog: Marking clocksource '%s' 
as unstable because the skew is too large:\n",
cs->name);
pr_warn("  '%s' wd_now: %llx 
wd_last: %llx mask: %llx\n",
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 8/9] clocksource: Improve unstable clocksource detection

2015-08-17 Thread Thomas Gleixner

On Mon, 17 Aug 2015, John Stultz wrote:

 From: Shaohua Li s...@fb.com
 
 From time to time we saw TSC is marked as unstable in our systems, while

Stray ''

 the CPUs declare to have stable TSC. Looking at the clocksource unstable
 detection, there are two problems:
 - watchdog clock source wrap. HPET is the most common watchdog clock
   source. It's 32-bit and runs in 14.3Mhz. That means the hpet counter
   can wrap in about 5 minutes.
 - threshold isn't scaled against interval. The threshold is 0.0625s in
   0.5s interval. What if the actual interval is bigger than 0.5s?
 
 The watchdog runs in a timer bh, so hard/soft irq can defer its running.
 Heavy network stack softirq can hog a cpu. IPMI driver can disable
 interrupt for a very long time.

And they hold off the timer softirq for more than a second? Don't you
think that's the problem which needs to be fixed?

 The first problem is mostly we are suffering I think.

So you think that's the root cause and because your patch makes it go
away it's not necessary to know for sure, right?

 Here is a simple patch to fix the issues. If the waterdog doesn't run

waterdog?

 for a long time, we ignore the detection. 

What's 'long time'? Please explain the numbers chosen.

 This should work for the two

Emphasis on 'should'? 

 problems. For the second one, we probably doen't need to scale if the
 interval isn't very long.

-ENOPARSE
 
 @@ -122,9 +122,10 @@ static int clocksource_watchdog_kthread(void *data);
  static void __clocksource_change_rating(struct clocksource *cs, int rating);
  
  /*
 - * Interval: 0.5sec Threshold: 0.0625s
 + * Interval: 0.5sec MaxInterval: 1s Threshold: 0.0625s
   */
  #define WATCHDOG_INTERVAL (HZ  1)
 +#define WATCHDOG_MAX_INTERVAL_NS (NSEC_PER_SEC)
  #define WATCHDOG_THRESHOLD (NSEC_PER_SEC  4)
  
  static void clocksource_watchdog_work(struct work_struct *work)
 @@ -217,7 +218,9 @@ static void clocksource_watchdog(unsigned long data)
   continue;
  
   /* Check the deviation from the watchdog clocksource. */
 - if ((abs(cs_nsec - wd_nsec)  WATCHDOG_THRESHOLD)) {
 + if ((abs(cs_nsec - wd_nsec)  WATCHDOG_THRESHOLD) 
 + cs_nsec  WATCHDOG_MAX_INTERVAL_NS 
 + wd_nsec  WATCHDOG_MAX_INTERVAL_NS) {

So that adds a new opportunity for undiscovered wreckage:

   clocksource_watchdog();
    --- SMI skews TSC
   looong_irq_disabled_region();
   
   clocksource_watchdog();  --- Does not detect skew

and it will not detect it later on if that SMI was a one time event.

So 'fixing' the watchdog is the wrong approach. Fixing the stuff which
prevents the watchdog to run is the proper thing to do.

Thanks,

tglx
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 8/9] clocksource: Improve unstable clocksource detection

2015-08-17 Thread Shaohua Li

On Mon, Aug 17, 2015 at 03:17:28PM -0700, John Stultz wrote:
 On Mon, Aug 17, 2015 at 3:04 PM, Thomas Gleixner t...@linutronix.de wrote:
  On Mon, 17 Aug 2015, John Stultz wrote:
 
  From: Shaohua Li s...@fb.com
 
  From time to time we saw TSC is marked as unstable in our systems, while
 
  Stray ''
 
  the CPUs declare to have stable TSC. Looking at the clocksource unstable
  detection, there are two problems:
  - watchdog clock source wrap. HPET is the most common watchdog clock
source. It's 32-bit and runs in 14.3Mhz. That means the hpet counter
can wrap in about 5 minutes.
  - threshold isn't scaled against interval. The threshold is 0.0625s in
0.5s interval. What if the actual interval is bigger than 0.5s?
 
  The watchdog runs in a timer bh, so hard/soft irq can defer its running.
  Heavy network stack softirq can hog a cpu. IPMI driver can disable
  interrupt for a very long time.
 
  And they hold off the timer softirq for more than a second? Don't you
  think that's the problem which needs to be fixed?
 
 Though this is an issue I've experienced (and tried unsuccessfully to
 fix in a more complicated way) with the RT kernel, where high priority
 tasks blocked the watchdog long enough that we'd disqualify the TSC.
 
 Ideally that sort of high-priority RT busyness would be avoided, but
 its also a pain to have false positive trigger when doing things like
 stress testing.
 
 
  The first problem is mostly we are suffering I think.
 
  So you think that's the root cause and because your patch makes it go
  away it's not necessary to know for sure, right?
 
  Here is a simple patch to fix the issues. If the waterdog doesn't run
 
  waterdog?
 
 Allergen-free. :)
 
 
  for a long time, we ignore the detection.
 
  What's 'long time'? Please explain the numbers chosen.
 
  This should work for the two
 
  Emphasis on 'should'?
 
  problems. For the second one, we probably doen't need to scale if the
  interval isn't very long.
 
  -ENOPARSE
 
  @@ -122,9 +122,10 @@ static int clocksource_watchdog_kthread(void *data);
   static void __clocksource_change_rating(struct clocksource *cs, int 
  rating);
 
   /*
  - * Interval: 0.5sec Threshold: 0.0625s
  + * Interval: 0.5sec MaxInterval: 1s Threshold: 0.0625s
*/
   #define WATCHDOG_INTERVAL (HZ  1)
  +#define WATCHDOG_MAX_INTERVAL_NS (NSEC_PER_SEC)
   #define WATCHDOG_THRESHOLD (NSEC_PER_SEC  4)
 
   static void clocksource_watchdog_work(struct work_struct *work)
  @@ -217,7 +218,9 @@ static void clocksource_watchdog(unsigned long data)
continue;
 
/* Check the deviation from the watchdog clocksource. */
  - if ((abs(cs_nsec - wd_nsec)  WATCHDOG_THRESHOLD)) {
  + if ((abs(cs_nsec - wd_nsec)  WATCHDOG_THRESHOLD) 
  + cs_nsec  WATCHDOG_MAX_INTERVAL_NS 
  + wd_nsec  WATCHDOG_MAX_INTERVAL_NS) {
 
  So that adds a new opportunity for undiscovered wreckage:
 
 clocksource_watchdog();
  --- SMI skews TSC
 looong_irq_disabled_region();
 
 clocksource_watchdog();  --- Does not detect skew
 
  and it will not detect it later on if that SMI was a one time event.
 
  So 'fixing' the watchdog is the wrong approach. Fixing the stuff which
  prevents the watchdog to run is the proper thing to do.
 
 I'm not sure here. I feel like these delay-caused false positives
 (I've seen similar reports w/ VMs being stalled) are more common then
 one-off SMI TSC skews.
 
 There are hard lines in the timekeeping code, where we do say: Don't
 delay us past X or we can't really handle it, but in this case, the
 main clocksource is fine and the limit is being caused by the
 watchdog. So I think some sort of a solution to remove this
 restriction would be good. We don't want to needlessly punish fine
 hardware because our checks for bad hardware add extra restrictions.
 
 That said, I agree the shoulds and other vague qualifiers in the
 commit description you point out should have more specifics to back
 things up. And I'm fine delaying this (and the follow-on) patch until
 those details are provided.

It's not something I guess. We do see the issue from time to time. The
IPMI driver accesses some IO ports in softirq and hog cpu for a very
long time, then the watchdog alert. The false alert on the other hand
has very worse effect. It forces to use HPET as clocksource, which has
very big performance penality. We can't even manually switch back to TSC
as current interface doesn't allow us to do it, then we can only reboot
the system. I agree the driver should be fixed, but the watchdog has
false alert, we definitively should fix it.

The 1s interval is arbitary. If you think there is better way to fix the
issue, please let me know.

Thanks,
Shaohua
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at

Re: [PATCH 8/9] clocksource: Improve unstable clocksource detection

2015-08-17 Thread John Stultz

On Mon, Aug 17, 2015 at 7:57 PM, Shaohua Li s...@fb.com wrote:
 On Mon, Aug 17, 2015 at 03:17:28PM -0700, John Stultz wrote:
 That said, I agree the shoulds and other vague qualifiers in the
 commit description you point out should have more specifics to back
 things up. And I'm fine delaying this (and the follow-on) patch until
 those details are provided.

 It's not something I guess. We do see the issue from time to time. The
 IPMI driver accesses some IO ports in softirq and hog cpu for a very
 long time, then the watchdog alert. The false alert on the other hand
 has very worse effect. It forces to use HPET as clocksource, which has
 very big performance penality. We can't even manually switch back to TSC
 as current interface doesn't allow us to do it, then we can only reboot
 the system. I agree the driver should be fixed, but the watchdog has
 false alert, we definitively should fix it.

I think Thomas is requesting that some of the vague terms be
quantified. Seeing the issue from time to time isn't super
informative. When the IPMI driver hogs the cpu for a very long time,
how long does that  actually take?  You've provided the HPET
frequency, and  the wrapping interval on your hardware. Do these
intervals all line up properly?

I sympathize that the show-your-work math problem aspect of this
request might be a little remedial and irritating, esp when the patch
fixes the problem for you. But its important, so later on when some
bug crops up in near by code, folks can easily repeat your calculation
and know the problem isn't from your code.

 The 1s interval is arbitary. If you think there is better way to fix the
 issue, please let me know.

I don't think 1s is necessarily arbitrary. Maybe not much conscious
thought was put into it, but clearly .001 sec wasn't chosen, nor
10minutes for a reason.

So given the intervals you're seeing the problem with, would maybe a
larger max interval (say 30-seconds) make more or less sense? What
would the tradeoffs be? (ie: Would that exclude clocksources with
faster wraps from being used as watchdogs, with your patches?).

I'm sure an good interval could be chosen with some thought, and the
rational be explained. :)

thanks
-john
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 8/9] clocksource: Improve unstable clocksource detection

2015-08-17 Thread John Stultz

On Mon, Aug 17, 2015 at 3:04 PM, Thomas Gleixner t...@linutronix.de wrote:
 On Mon, 17 Aug 2015, John Stultz wrote:

 From: Shaohua Li s...@fb.com

 From time to time we saw TSC is marked as unstable in our systems, while

 Stray ''

 the CPUs declare to have stable TSC. Looking at the clocksource unstable
 detection, there are two problems:
 - watchdog clock source wrap. HPET is the most common watchdog clock
   source. It's 32-bit and runs in 14.3Mhz. That means the hpet counter
   can wrap in about 5 minutes.
 - threshold isn't scaled against interval. The threshold is 0.0625s in
   0.5s interval. What if the actual interval is bigger than 0.5s?

 The watchdog runs in a timer bh, so hard/soft irq can defer its running.
 Heavy network stack softirq can hog a cpu. IPMI driver can disable
 interrupt for a very long time.

 And they hold off the timer softirq for more than a second? Don't you
 think that's the problem which needs to be fixed?

Though this is an issue I've experienced (and tried unsuccessfully to
fix in a more complicated way) with the RT kernel, where high priority
tasks blocked the watchdog long enough that we'd disqualify the TSC.

Ideally that sort of high-priority RT busyness would be avoided, but
its also a pain to have false positive trigger when doing things like
stress testing.


 The first problem is mostly we are suffering I think.

 So you think that's the root cause and because your patch makes it go
 away it's not necessary to know for sure, right?

 Here is a simple patch to fix the issues. If the waterdog doesn't run

 waterdog?

Allergen-free. :)


 for a long time, we ignore the detection.

 What's 'long time'? Please explain the numbers chosen.

 This should work for the two

 Emphasis on 'should'?

 problems. For the second one, we probably doen't need to scale if the
 interval isn't very long.

 -ENOPARSE

 @@ -122,9 +122,10 @@ static int clocksource_watchdog_kthread(void *data);
  static void __clocksource_change_rating(struct clocksource *cs, int rating);

  /*
 - * Interval: 0.5sec Threshold: 0.0625s
 + * Interval: 0.5sec MaxInterval: 1s Threshold: 0.0625s
   */
  #define WATCHDOG_INTERVAL (HZ  1)
 +#define WATCHDOG_MAX_INTERVAL_NS (NSEC_PER_SEC)
  #define WATCHDOG_THRESHOLD (NSEC_PER_SEC  4)

  static void clocksource_watchdog_work(struct work_struct *work)
 @@ -217,7 +218,9 @@ static void clocksource_watchdog(unsigned long data)
   continue;

   /* Check the deviation from the watchdog clocksource. */
 - if ((abs(cs_nsec - wd_nsec)  WATCHDOG_THRESHOLD)) {
 + if ((abs(cs_nsec - wd_nsec)  WATCHDOG_THRESHOLD) 
 + cs_nsec  WATCHDOG_MAX_INTERVAL_NS 
 + wd_nsec  WATCHDOG_MAX_INTERVAL_NS) {

 So that adds a new opportunity for undiscovered wreckage:

clocksource_watchdog();
 --- SMI skews TSC
looong_irq_disabled_region();

clocksource_watchdog();  --- Does not detect skew

 and it will not detect it later on if that SMI was a one time event.

 So 'fixing' the watchdog is the wrong approach. Fixing the stuff which
 prevents the watchdog to run is the proper thing to do.

I'm not sure here. I feel like these delay-caused false positives
(I've seen similar reports w/ VMs being stalled) are more common then
one-off SMI TSC skews.

There are hard lines in the timekeeping code, where we do say: Don't
delay us past X or we can't really handle it, but in this case, the
main clocksource is fine and the limit is being caused by the
watchdog. So I think some sort of a solution to remove this
restriction would be good. We don't want to needlessly punish fine
hardware because our checks for bad hardware add extra restrictions.

That said, I agree the shoulds and other vague qualifiers in the
commit description you point out should have more specifics to back
things up. And I'm fine delaying this (and the follow-on) patch until
those details are provided.

thanks
-john
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 8/9] clocksource: Improve unstable clocksource detection

2015-08-17 Thread John Stultz

From: Shaohua Li s...@fb.com

From time to time we saw TSC is marked as unstable in our systems, while
the CPUs declare to have stable TSC. Looking at the clocksource unstable
detection, there are two problems:
- watchdog clock source wrap. HPET is the most common watchdog clock
  source. It's 32-bit and runs in 14.3Mhz. That means the hpet counter
  can wrap in about 5 minutes.
- threshold isn't scaled against interval. The threshold is 0.0625s in
  0.5s interval. What if the actual interval is bigger than 0.5s?

The watchdog runs in a timer bh, so hard/soft irq can defer its running.
Heavy network stack softirq can hog a cpu. IPMI driver can disable
interrupt for a very long time. The first problem is mostly we are
suffering I think.

Here is a simple patch to fix the issues. If the waterdog doesn't run
for a long time, we ignore the detection. This should work for the two
problems. For the second one, we probably doen't need to scale if the
interval isn't very long.

Cc: Prarit Bhargava pra...@redhat.com
Cc: Richard Cochran richardcoch...@gmail.com
Cc: Daniel Lezcano daniel.lezc...@linaro.org
Cc: Thomas Gleixner t...@linutronix.de
Cc: Ingo Molnar mi...@kernel.org
Signed-off-by: Shaohua Li s...@fb.com
Signed-off-by: John Stultz john.stu...@linaro.org
---
 kernel/time/clocksource.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 841b72f..8417c83 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -122,9 +122,10 @@ static int clocksource_watchdog_kthread(void *data);
 static void __clocksource_change_rating(struct clocksource *cs, int rating);
 
 /*
- * Interval: 0.5sec Threshold: 0.0625s
+ * Interval: 0.5sec MaxInterval: 1s Threshold: 0.0625s
  */
 #define WATCHDOG_INTERVAL (HZ  1)
+#define WATCHDOG_MAX_INTERVAL_NS (NSEC_PER_SEC)
 #define WATCHDOG_THRESHOLD (NSEC_PER_SEC  4)
 
 static void clocksource_watchdog_work(struct work_struct *work)
@@ -217,7 +218,9 @@ static void clocksource_watchdog(unsigned long data)
continue;
 
/* Check the deviation from the watchdog clocksource. */
-   if ((abs(cs_nsec - wd_nsec)  WATCHDOG_THRESHOLD)) {
+   if ((abs(cs_nsec - wd_nsec)  WATCHDOG_THRESHOLD) 
+   cs_nsec  WATCHDOG_MAX_INTERVAL_NS 
+   wd_nsec  WATCHDOG_MAX_INTERVAL_NS) {
pr_warn(timekeeping watchdog: Marking clocksource '%s' 
as unstable because the skew is too large:\n,
cs-name);
pr_warn(  '%s' wd_now: %llx 
wd_last: %llx mask: %llx\n,
-- 
1.9.1

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

40 matches

Mail list logo