Re: kvm guest: hrtimer: interrupt too slow

2009-10-10 Thread Michael Tokarev

Michael Tokarev wrote:

Frederic Weisbecker wrote:

[]

Was there swapping going on?

Not as far as I can see, and sar output agrees.


But I can read this from you guest traces:


I missed this one yesterday.  Note it's GUEST traces
indeed.  Higher (read: non-zero) pgp{in,out} and faults
values happens in *guest*, not on host (original question
was if we've swapping in HOST, which'd explain the timer
issues)

[cutting extra all-zero columns]

pgpgin/s pgpgout/s   fault/s  pgfree/s
11:44:47 0.00 32.32907.07   277.78
11:44:4827.59 22.99 44.83   150.57
11:44:49 0.00 33.68 22.11   218.95
[...]
21:46:54 0.00 31.68 16.8390.10
21:46:55 0.00108.00 17.0089.00
21:46:56 9.76482.93   3890.24   439.02
21:46:57 0.00760.00   8627.00  1133.00
21:46:58 0.00 84.85   2612.12   138.38
21:46:59 0.00 16.00 17.0090.00

So it looks like there was some swapping in when the hrtimer (spuriously)
hanged.


One possible guess.  Since the guest hanged for some time, the
higher values there might be a result of accumulated values for
several seconds.


It's not swapping.  Swapping is in a separate table, with columns titled
pswpin/s and pswpout/s -- first table.

On my home machine with no swap at all, 4gig memory and 2gig free,
pgpgin and pgpgout are increasing too.

Also, while in the second case (21:46:56) there's actually some
noticeable activity (page faults at least), in first case that
activity is modest.

Note there's no documentation for /proc/vmstat file :)

/mjt

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm guest: hrtimer: interrupt too slow

2009-10-10 Thread Frederic Weisbecker
On Sat, Oct 10, 2009 at 01:18:16PM +0400, Michael Tokarev wrote:
 Michael Tokarev wrote:
 Frederic Weisbecker wrote:
 []
 Was there swapping going on?
 Not as far as I can see, and sar output agrees.

 But I can read this from you guest traces:

 I missed this one yesterday.  Note it's GUEST traces
 indeed.  Higher (read: non-zero) pgp{in,out} and faults
 values happens in *guest*, not on host (original question
 was if we've swapping in HOST, which'd explain the timer
 issues)



Yeah indeed. But still, that's a strange happenstance.



 [cutting extra all-zero columns]
 pgpgin/s pgpgout/s   fault/s  pgfree/s
 11:44:47 0.00 32.32907.07   277.78
 11:44:4827.59 22.99 44.83   150.57
 11:44:49 0.00 33.68 22.11   218.95
 [...]
 21:46:54 0.00 31.68 16.8390.10
 21:46:55 0.00108.00 17.0089.00
 21:46:56 9.76482.93   3890.24   439.02
 21:46:57 0.00760.00   8627.00  1133.00
 21:46:58 0.00 84.85   2612.12   138.38
 21:46:59 0.00 16.00 17.0090.00

 So it looks like there was some swapping in when the hrtimer (spuriously)
 hanged.

 One possible guess.  Since the guest hanged for some time, the
 higher values there might be a result of accumulated values for
 several seconds.



May be yeah. I don't know enough about virtual internals so...

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm guest: hrtimer: interrupt too slow

2009-10-10 Thread Marcelo Tosatti
On Sat, Oct 10, 2009 at 02:34:22AM +0400, Michael Tokarev wrote:
 Frederic Weisbecker wrote:
 On Sat, Oct 10, 2009 at 01:22:16AM +0400, Michael Tokarev wrote:
 Marcelo Tosatti wrote:
 [snip]

 Would be useful to collect sar (sar -B -b -u) output every one second
 in both host/guest. You already mentioned load was low, but this should
 give more details.
 Here we go:  http://www.corpit.ru/mjt/hrtimer-interrupt-too-slow/

 Two incindents - cases when hrtimer: interrupt is too slow were
 reported in the guest (with Marcelo's patch so that min_delta is
 increased to 50% only), happened at 11:44:48 and 21:46:56 (as
 shown in guest-dmesg file).  For both, there's `sar -BWbd' output
 for a 2-minute interval (starting one minute before the delay
 and ending one minute after) from both host and guest.

 Was there swapping going on?
 Not as far as I can see, and sar output agrees.

 But I can read this from you guest traces:

 pgpgin/s pgpgout/s   fault/s  majflt/s  pgfree/s pgscank/s 
 pgscand/s pgsteal/s%vmeff

 11:44:45 0.00 32.32174.75  0.00176.77  0.00  
 0.00  0.00  0.00
 11:44:46 0.00 16.00789.00  0.00323.00  0.00  
 0.00  0.00  0.00
 11:44:47 0.00 32.32907.07  0.00277.78  0.00  
 0.00  0.00  0.00
 11:44:4827.59 22.99 44.83  0.00150.57  0.00  
 0.00  0.00  0.00
 11:44:49 0.00 33.68 22.11  0.00218.95  0.00  
 0.00  0.00  0.00
 11:44:50 0.00101.01 17.17  0.00151.52  0.00  
 0.00  0.00  0.00
 11:44:51 0.00 15.69 16.67  0.00126.47  0.00  
 0.00  0.00  0.00
 [...]
 21:46:52 0.00 40.00 17.00  0.00 82.00  0.00  
 0.00  0.00  0.00
 21:46:53 0.00 31.68 18.81  0.00 94.06  0.00  
 0.00  0.00  0.00
 21:46:54 0.00 31.68 16.83  0.00 90.10  0.00  
 0.00  0.00  0.00
 21:46:55 0.00108.00 17.00  0.00 89.00  0.00  
 0.00  0.00  0.00
 21:46:56 9.76482.93   3890.24  0.00439.02  0.00  
 0.00  0.00  0.00
 21:46:57 0.00760.00   8627.00  0.00   1133.00  0.00  
 0.00  0.00  0.00
 21:46:58 0.00 84.85   2612.12  0.00138.38  0.00  
 0.00  0.00  0.00
 21:46:59 0.00 16.00 17.00  0.00 90.00  0.00  
 0.00  0.00  0.00
 21:47:00 0.00 36.36 17.17  0.00 90.91  0.00  
 0.00  0.00  0.00


 So it looks like there was some swapping in when the hrtimer (spuriously)
 hanged.

 It's not swapping.  Swapping is in a separate table, with columns titled
 pswpin/s and pswpout/s -- first table.

 On my home machine with no swap at all, 4gig memory and 2gig free,
 pgpgin and pgpgout are increasing too.

 Also, while in the second case (21:46:56) there's actually some
 noticeable activity (page faults at least), in first case that
 activity is modest.

Yes. It still could be scheduling though (but you mention the host is
idle, you forgot cpu stats in sar).

It might be a QEMU bug, can you share your qemu-kvm command line?
There's always (guest) disk i/o when the hang happens.

Also please give the latest hrtimer_interrupt patch a try.

Thanks!

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm guest: hrtimer: interrupt too slow

2009-10-09 Thread Michael Tokarev

Marcelo Tosatti wrote:
[snip]


Would be useful to collect sar (sar -B -b -u) output every one second
in both host/guest. You already mentioned load was low, but this should
give more details.


Here we go:  http://www.corpit.ru/mjt/hrtimer-interrupt-too-slow/

Two incindents - cases when hrtimer: interrupt is too slow were
reported in the guest (with Marcelo's patch so that min_delta is
increased to 50% only), happened at 11:44:48 and 21:46:56 (as
shown in guest-dmesg file).  For both, there's `sar -BWbd' output
for a 2-minute interval (starting one minute before the delay
and ending one minute after) from both host and guest.


Was there swapping going on?


Not as far as I can see, and sar output agrees.

/mjt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm guest: hrtimer: interrupt too slow

2009-10-09 Thread Frederic Weisbecker
On Sat, Oct 10, 2009 at 01:22:16AM +0400, Michael Tokarev wrote:
 Marcelo Tosatti wrote:
 [snip]

 Would be useful to collect sar (sar -B -b -u) output every one second
 in both host/guest. You already mentioned load was low, but this should
 give more details.

 Here we go:  http://www.corpit.ru/mjt/hrtimer-interrupt-too-slow/

 Two incindents - cases when hrtimer: interrupt is too slow were
 reported in the guest (with Marcelo's patch so that min_delta is
 increased to 50% only), happened at 11:44:48 and 21:46:56 (as
 shown in guest-dmesg file).  For both, there's `sar -BWbd' output
 for a 2-minute interval (starting one minute before the delay
 and ending one minute after) from both host and guest.

 Was there swapping going on?

 Not as far as I can see, and sar output agrees.


But I can read this from you guest traces:

pgpgin/s pgpgout/s   fault/s  majflt/s  pgfree/s pgscank/s 
pgscand/s pgsteal/s%vmeff

11:44:45 0.00 32.32174.75  0.00176.77  0.00  
0.00  0.00  0.00
11:44:46 0.00 16.00789.00  0.00323.00  0.00  
0.00  0.00  0.00
11:44:47 0.00 32.32907.07  0.00277.78  0.00  
0.00  0.00  0.00
11:44:4827.59 22.99 44.83  0.00150.57  0.00  
0.00  0.00  0.00
11:44:49 0.00 33.68 22.11  0.00218.95  0.00  
0.00  0.00  0.00
11:44:50 0.00101.01 17.17  0.00151.52  0.00  
0.00  0.00  0.00
11:44:51 0.00 15.69 16.67  0.00126.47  0.00  
0.00  0.00  0.00
[...]
21:46:52 0.00 40.00 17.00  0.00 82.00  0.00  
0.00  0.00  0.00
21:46:53 0.00 31.68 18.81  0.00 94.06  0.00  
0.00  0.00  0.00
21:46:54 0.00 31.68 16.83  0.00 90.10  0.00  
0.00  0.00  0.00
21:46:55 0.00108.00 17.00  0.00 89.00  0.00  
0.00  0.00  0.00
21:46:56 9.76482.93   3890.24  0.00439.02  0.00  
0.00  0.00  0.00
21:46:57 0.00760.00   8627.00  0.00   1133.00  0.00  
0.00  0.00  0.00
21:46:58 0.00 84.85   2612.12  0.00138.38  0.00  
0.00  0.00  0.00
21:46:59 0.00 16.00 17.00  0.00 90.00  0.00  
0.00  0.00  0.00
21:47:00 0.00 36.36 17.17  0.00 90.91  0.00  
0.00  0.00  0.00



So it looks like there was some swapping in when the hrtimer (spuriously)
hanged.

Not sure how to correlate these results with the incident though...



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm guest: hrtimer: interrupt too slow

2009-10-09 Thread Michael Tokarev

Frederic Weisbecker wrote:

On Sat, Oct 10, 2009 at 01:22:16AM +0400, Michael Tokarev wrote:

Marcelo Tosatti wrote:
[snip]


Would be useful to collect sar (sar -B -b -u) output every one second
in both host/guest. You already mentioned load was low, but this should
give more details.

Here we go:  http://www.corpit.ru/mjt/hrtimer-interrupt-too-slow/

Two incindents - cases when hrtimer: interrupt is too slow were
reported in the guest (with Marcelo's patch so that min_delta is
increased to 50% only), happened at 11:44:48 and 21:46:56 (as
shown in guest-dmesg file).  For both, there's `sar -BWbd' output
for a 2-minute interval (starting one minute before the delay
and ending one minute after) from both host and guest.


Was there swapping going on?

Not as far as I can see, and sar output agrees.


But I can read this from you guest traces:

pgpgin/s pgpgout/s   fault/s  majflt/s  pgfree/s pgscank/s 
pgscand/s pgsteal/s%vmeff

11:44:45 0.00 32.32174.75  0.00176.77  0.00  
0.00  0.00  0.00
11:44:46 0.00 16.00789.00  0.00323.00  0.00  
0.00  0.00  0.00
11:44:47 0.00 32.32907.07  0.00277.78  0.00  
0.00  0.00  0.00
11:44:4827.59 22.99 44.83  0.00150.57  0.00  
0.00  0.00  0.00
11:44:49 0.00 33.68 22.11  0.00218.95  0.00  
0.00  0.00  0.00
11:44:50 0.00101.01 17.17  0.00151.52  0.00  
0.00  0.00  0.00
11:44:51 0.00 15.69 16.67  0.00126.47  0.00  
0.00  0.00  0.00
[...]
21:46:52 0.00 40.00 17.00  0.00 82.00  0.00  
0.00  0.00  0.00
21:46:53 0.00 31.68 18.81  0.00 94.06  0.00  
0.00  0.00  0.00
21:46:54 0.00 31.68 16.83  0.00 90.10  0.00  
0.00  0.00  0.00
21:46:55 0.00108.00 17.00  0.00 89.00  0.00  
0.00  0.00  0.00
21:46:56 9.76482.93   3890.24  0.00439.02  0.00  
0.00  0.00  0.00
21:46:57 0.00760.00   8627.00  0.00   1133.00  0.00  
0.00  0.00  0.00
21:46:58 0.00 84.85   2612.12  0.00138.38  0.00  
0.00  0.00  0.00
21:46:59 0.00 16.00 17.00  0.00 90.00  0.00  
0.00  0.00  0.00
21:47:00 0.00 36.36 17.17  0.00 90.91  0.00  
0.00  0.00  0.00


So it looks like there was some swapping in when the hrtimer (spuriously)
hanged.


It's not swapping.  Swapping is in a separate table, with columns titled
pswpin/s and pswpout/s -- first table.

On my home machine with no swap at all, 4gig memory and 2gig free,
pgpgin and pgpgout are increasing too.

Also, while in the second case (21:46:56) there's actually some
noticeable activity (page faults at least), in first case that
activity is modest.

Note there's no documentation for /proc/vmstat file :)

/mjt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm guest: hrtimer: interrupt too slow

2009-10-08 Thread Michael Tokarev

[skip everything]

Yesterday I was lucky enough to actually watch what's
going on when the delay actually happens.

I run desktop environment on a kvm virtual machine here.
The server is on diskless terminal, and the rest, incl.
the window manager etc, is started from a VM.

And yesterday, during normal system load (nothing extra,
and not idle either, and all the other guests were running
under normal load too), I had a stall of everyhing on this
X session for about 2..3, maybe 5 secounds.

It felt like completely stuck machine. Nothing were moving
on the screen, no reaction to the keyboard etc.

And after several seconds it returned to normal.  With
the familiar message in dmesg -- increasing hrtimer etc,
to the next 50%.  (Without a patch from Marcelo at this
time it shuold increase min_delta to a large number).

To summarize: there's something, well, more interesting
going on here.  In addition to the scheduling issues that
causes timers to be calculated on the wrong CPU etc as
mentioned before, we've some real stalls somewhere, and
that only happens on SMP guests (so far anyway).

Thanks!

/mjt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm guest: hrtimer: interrupt too slow

2009-10-08 Thread Thomas Gleixner
On Wed, 7 Oct 2009, Marcelo Tosatti wrote:
 On Thu, Oct 08, 2009 at 01:17:35AM +0200, Frederic Weisbecker wrote:
 What about getting rid of the retry loop, instead? So something
 like:
 
 - run hrtimer callbacks (once)
 - while (tick_program_event(expires))
   expires = ktime_add_ns(expires, dev-min_delta_ns)
 
 This way there's no static tuning involved.

And what does that buy us ? We get an timer interrupt right away, so
it's not that much different from the retry loop. See below.

 Its not clear to me why the loop is there in the first place.

We get a timer interrupt and handle the expired timers and find out
the timer which is going to expire next to reprogram the hardware. Now
when we program that expiry time we find out that the timer is already
expired. So instead of programming the hardware to fire an interrupt
in the very near future which you would do with your loop above we
stay in the interrupt handler and expire the timer and any other by
now expired timers right away.

The hang check is just there to avoid starving (slow) machines. We do
this by spreading the timer interrupts out so that the system can do
something else than expiring timers.

Thanks,

tglx
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm guest: hrtimer: interrupt too slow

2009-10-08 Thread Thomas Gleixner
On Thu, 8 Oct 2009, Michael Tokarev wrote:
 Yesterday I was lucky enough to actually watch what's
 going on when the delay actually happens.
 
 I run desktop environment on a kvm virtual machine here.
 The server is on diskless terminal, and the rest, incl.
 the window manager etc, is started from a VM.
 
 And yesterday, during normal system load (nothing extra,
 and not idle either, and all the other guests were running
 under normal load too), I had a stall of everyhing on this
 X session for about 2..3, maybe 5 secounds.
 
 It felt like completely stuck machine. Nothing were moving
 on the screen, no reaction to the keyboard etc.
 
 And after several seconds it returned to normal.  With
 the familiar message in dmesg -- increasing hrtimer etc,
 to the next 50%.  (Without a patch from Marcelo at this
 time it shuold increase min_delta to a large number).
 
 To summarize: there's something, well, more interesting
 going on here.  In addition to the scheduling issues that
 causes timers to be calculated on the wrong CPU etc as

Care to elaborate ?

Thanks,

tglx
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm guest: hrtimer: interrupt too slow

2009-10-08 Thread Michael Tokarev

[]

hrtimer: interrupt too slow, forcing clock min delta to 461487495 ns

[]

All that does not make sense anymore in a guest. The hang detection
and warnings, the recalibrations of the min_clock_deltas are completely
wrong in this context.
Not only does it spuriously warn, but the minimum timer is increasing
slowly and the guest progressively suffers from higher and higher
latencies.


Well, it's not slowly, -- that huge jump shown above is typical.
If my calculations are correct, that's about 0.5 sec min_delta.


That's really bad.


*nod* :)


Your patch lowers the immediate impact and makes this illness evolving
smoother by scaling down the recalibration to the min_clock_delta.
This appeases the bug but doesn't solve it. I fear it could be even
worse because it makes it more discreet.


Well, long-term it's not worse still.  New code has a chance to hitting
the same values for min_delta in a long run, but this chance is so small
and the time spent is so long that it can be forgotten about completely.


May be can we instead increase the minimum threshold of loop in the
hrtimer interrupt before considering it as a hang? Hmm, but a too high
number could make this check useless, depending of the number of pending
timers, which is a finite number.

Well, actually I'm not confident anymore in this check. Or actually we
should change it. May be we can rebase it on the time spent on the hrtimer
interrupt (and check it every 10 loops of reprocessing in hrtimer_interrupts).

Would a mimimum threshold of 5 seconds spent in hrtimer_interrupt() be
a reasonable check to perform?
We should probably base our check on such kind of high boundary.
What we want is an ultimate rescue against hard hangs anyway, not
something that can solve the hang source itself. After the min_clock_delta
recalibration, the system will be unstable (eg: high latencies).
So if this must behave as a hammer, let's ensure we really need this hammer,
even if we need to wait for few seconds before it triggers.


By the way, all other cases I've seen this message (hrtimer: interrupt too 
slow..)
triggering, the problems were elsewhere and re-calibrating timer was not a good
idea anyway, because the problem was elsewhere and changing timer didn't solve
it.

Back into the vm issue at hand.  I (almost) understand what's happening in
the discussion above, but I does not see how it is possible to have such a
*huge* delays explained by scheduling on a different CPU etc.  The delays
are measured in *seconds*, not nano- or micro-secs etc.

I can imagine, say, swapping on host that causes the whole guest to be
swapped out for a while during the timer interrupt handling for example.
But it is NOT what's happening here, at least not that I can see it.
Yes host had some swapping:

 pswpin 17535
 pswpout 41602

but it's not massive and I know when exactly it happened - when I was testing
something else.  Right now free(1) reports:

 total   used   free sharedbuffers cached
Mem:   81552808105704  49576  01209136  27440
-/+ buffers/cache:68691281286152
Swap:  8388856 1241128264744

(and f*ng vmstat that, again, does not show swapping activity at all)

So, I think, the problem is somewhere elsewhere.

By the way, I *think* it only happens with kvm_clock, and does not
happen with acpi_pm clocksource.  Is it worth to check?

Thanks!

/mjt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm guest: hrtimer: interrupt too slow

2009-10-08 Thread Michael Tokarev

Thomas Gleixner wrote:

On Thu, 8 Oct 2009, Michael Tokarev wrote:

Yesterday I was lucky enough to actually watch what's
going on when the delay actually happens.

I run desktop environment on a kvm virtual machine here.
The server is on diskless terminal, and the rest, incl.
the window manager etc, is started from a VM.

And yesterday, during normal system load (nothing extra,
and not idle either, and all the other guests were running
under normal load too), I had a stall of everyhing on this
X session for about 2..3, maybe 5 secounds.

It felt like completely stuck machine. Nothing were moving
on the screen, no reaction to the keyboard etc.

And after several seconds it returned to normal.  With
the familiar message in dmesg -- increasing hrtimer etc,
to the next 50%.  (Without a patch from Marcelo at this
time it shuold increase min_delta to a large number).

To summarize: there's something, well, more interesting
going on here.  In addition to the scheduling issues that
causes timers to be calculated on the wrong CPU etc as


Care to elaborate ?


Such huge delays (in terms of seconds, not ms or ns) - I don't
understand how such delays can be explained by sheduling to the
different cpu etc.  That's what I mean.  I know very little about
all this low-level stuff so I may be completely out of context,
but such explanation does not look right to me, simple as that.
By scheduling mistakes we can get mistakes in range of millisecs,
but not secs.

/mjt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm guest: hrtimer: interrupt too slow

2009-10-08 Thread Thomas Gleixner
On Thu, 8 Oct 2009, Michael Tokarev wrote:

 Thomas Gleixner wrote:
  On Thu, 8 Oct 2009, Michael Tokarev wrote:
   Yesterday I was lucky enough to actually watch what's
   going on when the delay actually happens.
   
   I run desktop environment on a kvm virtual machine here.
   The server is on diskless terminal, and the rest, incl.
   the window manager etc, is started from a VM.
   
   And yesterday, during normal system load (nothing extra,
   and not idle either, and all the other guests were running
   under normal load too), I had a stall of everyhing on this
   X session for about 2..3, maybe 5 secounds.
   
   It felt like completely stuck machine. Nothing were moving
   on the screen, no reaction to the keyboard etc.
   
   And after several seconds it returned to normal.  With
   the familiar message in dmesg -- increasing hrtimer etc,
   to the next 50%.  (Without a patch from Marcelo at this
   time it shuold increase min_delta to a large number).
   
   To summarize: there's something, well, more interesting
   going on here.  In addition to the scheduling issues that
   causes timers to be calculated on the wrong CPU etc as
  
  Care to elaborate ?
 
 Such huge delays (in terms of seconds, not ms or ns) - I don't
 understand how such delays can be explained by sheduling to the
 different cpu etc.  That's what I mean.  I know very little about
 all this low-level stuff so I may be completely out of context,
 but such explanation does not look right to me, simple as that.
 By scheduling mistakes we can get mistakes in range of millisecs,
 but not secs.

I'm really missing the big picture here. 

What means causes timers to be calculated on the wrong CPU etc ?
And what do you consider a scheduling mistake ?

Thanks,

tglx
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm guest: hrtimer: interrupt too slow

2009-10-08 Thread Thomas Gleixner
On Thu, 8 Oct 2009, Michael Tokarev wrote:
 Thomas Gleixner wrote:
  
  I'm really missing the big picture here. 
  What means causes timers to be calculated on the wrong CPU etc ?
  And what do you consider a scheduling mistake ?
 
 From the initial diagnostics by Marcelo:
 
  It seems the way hrtimer_interrupt_hanging calculates min_delta is
  wrong (especially to virtual machines). The guest vcpu can be scheduled
  out during the execution of the hrtimer callbacks (and the callbacks
  themselves can do operations that translate to blocking operations in
  the hypervisor).
 
  So high min_delta values can be calculated if, for example, a single
  hrtimer_interrupt run takes two host time slices to execute, while some
  other higher priority task runs for N slices in between.
 
 From this I conclude that the huge min_delta is due to some other task(s)
 on the host being run while this guest is in hrtimer callback.  But I
 fail to see why that process on the host takes SO MUCH time, to warrant
 resulting min_delta to 0.5s, or to cause delays for 3..5 seconds in
 guest.  It's ok to have delays in range of several extra milliseconds,
 but for *seconds* is too much.
 
 Note again that neither host nor guest are not under high load when
 this jump happens.  Also note that there's no high-priority processes
 running on the host, all are of the same priority level, including
 all the guests.
 
 Note also that so far I only see it on SMP guests, never on UP
 guests.  And only on guests with kvm_clock, not with acpi_pm
 clocksource.
 
 What I'm trying to say is that it looks like there's something
 else wrong here in the guest code.  Huge stalls, huge delays
 while in hrtimer callback (i think it jappens always when such
 delay is happening, it's just noticed by hrtimer code) -- that's
 the root cause of all this, (probably) wrong logic in hrtimer
 calibration just shows the results of something that's wrong
 elsewhere.

Ah, ok. That makes sense. The hrtimer interrupt hang check detects
that the CPU was stolen for whatever reasons. I'm wondering why this
happens several times in a row - it takes at least 4 iterations until
it decides to make the interval larger.

Thanks,

tglx
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm guest: hrtimer: interrupt too slow

2009-10-08 Thread Marcelo Tosatti
On Thu, Oct 08, 2009 at 10:05:01AM +0200, Thomas Gleixner wrote:
 On Wed, 7 Oct 2009, Marcelo Tosatti wrote:
  On Thu, Oct 08, 2009 at 01:17:35AM +0200, Frederic Weisbecker wrote:
  What about getting rid of the retry loop, instead? So something
  like:
  
  - run hrtimer callbacks (once)
  - while (tick_program_event(expires))
  expires = ktime_add_ns(expires, dev-min_delta_ns)
  
  This way there's no static tuning involved.
 
 And what does that buy us ? We get an timer interrupt right away, so
 it's not that much different from the retry loop. See below.
 
  Its not clear to me why the loop is there in the first place.
 
 We get a timer interrupt and handle the expired timers and find out
 the timer which is going to expire next to reprogram the hardware. Now
 when we program that expiry time we find out that the timer is already
 expired. So instead of programming the hardware to fire an interrupt
 in the very near future which you would do with your loop above we
 stay in the interrupt handler and expire the timer and any other by
 now expired timers right away.
 
 The hang check is just there to avoid starving (slow) machines. We do
 this by spreading the timer interrupts out so that the system can do
 something else than expiring timers.

OK, makes sense.

So why not program only the next tick using the heuristic, without 
touching min_delta_ns?



diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
index c03f221..4fcb670 100644
--- a/kernel/hrtimer.c
+++ b/kernel/hrtimer.c
@@ -1178,29 +1178,16 @@ static void __run_hrtimer(struct hrtimer *timer)
 
 #ifdef CONFIG_HIGH_RES_TIMERS
 
-static int force_clock_reprogram;
-
 /*
  * After 5 iteration's attempts, we consider that hrtimer_interrupt()
  * is hanging, which could happen with something that slows the interrupt
- * such as the tracing. Then we force the clock reprogramming for each future
- * hrtimer interrupts to avoid infinite loops and use the min_delta_ns
- * threshold that we will overwrite.
+ * such as the tracing. 
  * The next tick event will be scheduled to 3 times we currently spend on
  * hrtimer_interrupt(). This gives a good compromise, the cpus will spend
  * 1/4 of their time to process the hrtimer interrupts. This is enough to
  * let it running without serious starvation.
  */
 
-static inline void
-hrtimer_interrupt_hanging(struct clock_event_device *dev,
-   ktime_t try_time)
-{
-   force_clock_reprogram = 1;
-   dev-min_delta_ns = (unsigned long)try_time.tv64 * 3;
-   printk(KERN_WARNING hrtimer: interrupt too slow, 
-   forcing clock min delta to %lu ns\n, dev-min_delta_ns);
-}
 /*
  * High resolution timer interrupt
  * Called with interrupts disabled
@@ -1219,8 +1206,16 @@ void hrtimer_interrupt(struct clock_event_device *dev)
 
  retry:
/* 5 retries is enough to notice a hang */
-   if (!(++nr_retries % 5))
-   hrtimer_interrupt_hanging(dev, ktime_sub(ktime_get(), now));
+   if (!(++nr_retries % 5)) {
+   ktime_t try_time = ktime_sub(ktime_get(), now);
+
+   do {
+   for (i = 0; i  3; i++)
+   expires_next = ktime_add(expires_next,try_time);
+   } while (tick_program_event(expires_next, 0));
+
+   return;
+   }
 
now = ktime_get();
 
@@ -1286,7 +1281,7 @@ void hrtimer_interrupt(struct clock_event_device *dev)
 
/* Reprogramming necessary ? */
if (expires_next.tv64 != KTIME_MAX) {
-   if (tick_program_event(expires_next, force_clock_reprogram))
+   if (tick_program_event(expires_next, 0))
goto retry;
}
 }

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm guest: hrtimer: interrupt too slow

2009-10-08 Thread Marcelo Tosatti
On Thu, Oct 08, 2009 at 06:06:39PM +0400, Michael Tokarev wrote:
 Thomas Gleixner wrote:
 On Thu, 8 Oct 2009, Michael Tokarev wrote:

 Thomas Gleixner wrote:
 On Thu, 8 Oct 2009, Michael Tokarev wrote:
 Yesterday I was lucky enough to actually watch what's
 going on when the delay actually happens.

 I run desktop environment on a kvm virtual machine here.
 The server is on diskless terminal, and the rest, incl.
 the window manager etc, is started from a VM.

 And yesterday, during normal system load (nothing extra,
 and not idle either, and all the other guests were running
 under normal load too), I had a stall of everyhing on this
 X session for about 2..3, maybe 5 secounds.

 It felt like completely stuck machine. Nothing were moving
 on the screen, no reaction to the keyboard etc.

 And after several seconds it returned to normal.  With
 the familiar message in dmesg -- increasing hrtimer etc,
 to the next 50%.  (Without a patch from Marcelo at this
 time it shuold increase min_delta to a large number).

 To summarize: there's something, well, more interesting
 going on here.  In addition to the scheduling issues that
 causes timers to be calculated on the wrong CPU etc as
 Care to elaborate ?
 Such huge delays (in terms of seconds, not ms or ns) - I don't
 understand how such delays can be explained by sheduling to the
 different cpu etc.  That's what I mean.  I know very little about
 all this low-level stuff so I may be completely out of context,
 but such explanation does not look right to me, simple as that.
 By scheduling mistakes we can get mistakes in range of millisecs,
 but not secs.

 I'm really missing the big picture here. 

 What means causes timers to be calculated on the wrong CPU etc ?
 And what do you consider a scheduling mistake ?

 From the initial diagnostics by Marcelo:

  It seems the way hrtimer_interrupt_hanging calculates min_delta is
  wrong (especially to virtual machines). The guest vcpu can be scheduled
  out during the execution of the hrtimer callbacks (and the callbacks
  themselves can do operations that translate to blocking operations in
  the hypervisor).
 
  So high min_delta values can be calculated if, for example, a single
  hrtimer_interrupt run takes two host time slices to execute, while some
  other higher priority task runs for N slices in between.

 From this I conclude that the huge min_delta is due to some other task(s)
 on the host being run while this guest is in hrtimer callback.  But I
 fail to see why that process on the host takes SO MUCH time, to warrant
 resulting min_delta to 0.5s, or to cause delays for 3..5 seconds in
 guest.  It's ok to have delays in range of several extra milliseconds,
 but for *seconds* is too much.

 Note again that neither host nor guest are not under high load when
 this jump happens.  Also note that there's no high-priority processes
 running on the host, all are of the same priority level, including
 all the guests.

 Note also that so far I only see it on SMP guests, never on UP
 guests.  And only on guests with kvm_clock, not with acpi_pm
 clocksource.

 What I'm trying to say is that it looks like there's something
 else wrong here in the guest code.  Huge stalls, huge delays
 while in hrtimer callback (i think it jappens always when such
 delay is happening, it's just noticed by hrtimer code) -- that's
 the root cause of all this, (probably) wrong logic in hrtimer
 calibration just shows the results of something that's wrong
 elsewhere.

True.

Would be useful to collect sar (sar -B -b -u) output every one second
in both host/guest. You already mentioned load was low, but this should
give more details.

Was there swapping going on?


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm guest: hrtimer: interrupt too slow

2009-10-08 Thread Thomas Gleixner
On Thu, 8 Oct 2009, Marcelo Tosatti wrote:
 On Thu, Oct 08, 2009 at 10:05:01AM +0200, Thomas Gleixner wrote:
  On Wed, 7 Oct 2009, Marcelo Tosatti wrote:
   On Thu, Oct 08, 2009 at 01:17:35AM +0200, Frederic Weisbecker wrote:
   What about getting rid of the retry loop, instead? So something
   like:
   
   - run hrtimer callbacks (once)
   - while (tick_program_event(expires))
 expires = ktime_add_ns(expires, dev-min_delta_ns)
   
   This way there's no static tuning involved.
  
  And what does that buy us ? We get an timer interrupt right away, so
  it's not that much different from the retry loop. See below.
  
   Its not clear to me why the loop is there in the first place.
  
  We get a timer interrupt and handle the expired timers and find out
  the timer which is going to expire next to reprogram the hardware. Now
  when we program that expiry time we find out that the timer is already
  expired. So instead of programming the hardware to fire an interrupt
  in the very near future which you would do with your loop above we
  stay in the interrupt handler and expire the timer and any other by
  now expired timers right away.
  
  The hang check is just there to avoid starving (slow) machines. We do
  this by spreading the timer interrupts out so that the system can do
  something else than expiring timers.
 
 OK, makes sense.
 
 So why not program only the next tick using the heuristic, without 
 touching min_delta_ns?

That makes a certain amount of sense albeit I really hate that
heuristics crap especially when we know that we run as a guest. 

We better add a function pointer to the clock event device struct
which defaults to force it to be slow for real hardware and can be
overridden by paravirt guests.

Also it's not clear to me why the problem does only happen with
kvm_clock and not with acpi_pm timer emulation (according to the
reporter) and is restricted to SMP guests.

   retry:
   /* 5 retries is enough to notice a hang */
 - if (!(++nr_retries % 5))
 - hrtimer_interrupt_hanging(dev, ktime_sub(ktime_get(), now));
 + if (!(++nr_retries % 5)) {
 + ktime_t try_time = ktime_sub(ktime_get(), now);
 +
 + do {
 + for (i = 0; i  3; i++)
 + expires_next = ktime_add(expires_next,try_time);
 + } while (tick_program_event(expires_next, 0));

  This needs at least a WARN_ON_ONCE() or some other way (sysfs, proc,
  ...) where we can find out how often this happens.

Thanks,

tglx
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm guest: hrtimer: interrupt too slow

2009-10-08 Thread Michael Tokarev

Thomas Gleixner wrote:
[]

Also it's not clear to me why the problem does only happen with
kvm_clock and not with acpi_pm timer emulation (according to the
reporter) and is restricted to SMP guests.


I just reproduced it with acpi_pm.  I explained it already to Marcelo,
the problem is that the issue is difficult to trigger.  I still don't
have any pointers as of how to trigger it, all my attempts so far, to
create network, disk or cpu load, failed.  So the only way is to run
the guest and wait, in a hope it'll be there.  And I restarted the
guinea pig guest today (which happens to be our main office server :),
and voila, after ~4 hours uptime said the same thing about hrtimer.
That's lucky time, since it may run stable for several days...
It just happens (and I mentioned it each time) that I didn't *see* the
issue with acpi_pm.  Now I see it with acpi_pm too.

Speaking of smp -- well, that one is of the same category.  Maybe smp
just makes the issue easier to trigger but it exists with UP guests too,
maybe it's SMP-specific - I don't know.  What I know for sure is that
out of 4 guests here (running on the same host), 2 are SMP and 2 UP,
loaded approximately equally (according to the number of CPUs), and
two SMP guests shows the issue quite easily, while for 2 UP guests
I don't see anything in the logs for last 2 months.

The issue isn't unique to my machines, other people reported it too
in #kvm, including at least one active participant there.  For him,
issues stopped when he switched from SMP to UP guest.  Yet there's
no definite knowlege if the issue is really SMP-specific or not.


  retry:
/* 5 retries is enough to notice a hang */
-   if (!(++nr_retries % 5))
-   hrtimer_interrupt_hanging(dev, ktime_sub(ktime_get(), now));
+   if (!(++nr_retries % 5)) {
+   ktime_t try_time = ktime_sub(ktime_get(), now);
+
+   do {
+   for (i = 0; i  3; i++)
+   expires_next = ktime_add(expires_next,try_time);
+   } while (tick_program_event(expires_next, 0));


  This needs at least a WARN_ON_ONCE() or some other way (sysfs, proc,
  ...) where we can find out how often this happens.


Definitely.  Or printk_ratelimit.

Before Marcelo come with his first version I was thinking about exposing
that min_delta over procfs to be able to reset it back to a reasonable
value.. ;)

Thanks!

/mjt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm guest: hrtimer: interrupt too slow

2009-10-07 Thread Frederic Weisbecker
(Adding Thomas in Cc)


On Sat, Oct 03, 2009 at 08:12:05PM -0300, Marcelo Tosatti wrote:
 Michael,
 
 Can you please give the patch below a try please? (without acpi_pm timer 
 or priority adjustments for the guest).
 
 On Tue, Sep 29, 2009 at 05:12:17PM +0400, Michael Tokarev wrote:
  Hello.
 
  I'm having quite an.. unusable system here.
  It's not really a regresssion with 0.11.0,
  it was something similar before, but with
  0.11.0 and/or 2.6.31 it become much worse.
 
  The thing is that after some uptime, kvm
  guest prints something like this:
 
  hrtimer: interrupt too slow, forcing clock min delta to 461487495 ns
 
  after which system (guest) speeed becomes
  very slow.  The above message is from
  2.6.31 guest running wiht 0.11.0  2.6.31
  host.  Before I tried it with 0.10.6 and
  2.6.30 or 2.6.27, and the delta were a
  bit less than that:
 
  hrtimer: interrupt too slow, forcing clock min delta to 15415 ns
  hrtimer: interrupt too slow, forcing clock min delta to 93629025 ns
 
 It seems the way hrtimer_interrupt_hanging calculates min_delta is
 wrong (especially to virtual machines). The guest vcpu can be scheduled
 out during the execution of the hrtimer callbacks (and the callbacks
 themselves can do operations that translate to blocking operations in
 the hypervisor).
 
 So high min_delta values can be calculated if, for example, a single
 hrtimer_interrupt run takes two host time slices to execute, while some
 other higher priority task runs for N slices in between.
 
 Using the hrtimer_interrupt execution time (which can be the worse
 case at any given time), as the min_delta is problematic.
 
 So simply increase min_delta_ns by 50% once every detected failure,
 which will eventually lead to an acceptable threshold (the algorithm
 should scale back to down lower min_delta, to adjust back to wealthier
 times, too).
 
 diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
 index 49da79a..8997978 100644
 --- a/kernel/hrtimer.c
 +++ b/kernel/hrtimer.c
 @@ -1234,28 +1234,20 @@ static void __run_hrtimer(struct hrtimer *timer)
  
  #ifdef CONFIG_HIGH_RES_TIMERS
  
 -static int force_clock_reprogram;
 -
  /*
   * After 5 iteration's attempts, we consider that hrtimer_interrupt()
   * is hanging, which could happen with something that slows the interrupt
 - * such as the tracing. Then we force the clock reprogramming for each future
 - * hrtimer interrupts to avoid infinite loops and use the min_delta_ns
 - * threshold that we will overwrite.
 - * The next tick event will be scheduled to 3 times we currently spend on
 - * hrtimer_interrupt(). This gives a good compromise, the cpus will spend
 - * 1/4 of their time to process the hrtimer interrupts. This is enough to
 - * let it running without serious starvation.
 + * such as the tracing, so we increase min_delta_ns.
   */
  
  static inline void
 -hrtimer_interrupt_hanging(struct clock_event_device *dev,
 - ktime_t try_time)
 +hrtimer_interrupt_hanging(struct clock_event_device *dev)
  {
 - force_clock_reprogram = 1;
 - dev-min_delta_ns = (unsigned long)try_time.tv64 * 3;
 - printk(KERN_WARNING hrtimer: interrupt too slow, 
 - forcing clock min delta to %lu ns\n, dev-min_delta_ns);
 + dev-min_delta_ns += dev-min_delta_ns  1;


I haven't thought about the guest that could be scheduled out in
the middle of the timers servicing, making wrong this check based
of the time spent in hrtimer_interrupt().

I guess there is no easy/generic/cheap way to rebase this check
on the _virtual_ time spent in the timers servicing. By virtual,
I mean the time spent in the guest only.

In a non-guest kernel, the old check forces an adaptive rate
sharing:

- we spent n nanosecs to service the batch of timers.
- we are hanging
- we want at least 3/4 of time reserved for non-timer
  servicing in the kernel, this is a minimum prerequisite
  for the system to not starve
- adapt the min_clock_delta against to fit the above constraint

All that does not make sense anymore in a guest. The hang detection
and warnings, the recalibrations of the min_clock_deltas are completely
wrong in this context.
Not only does it spuriously warn, but the minimum timer is increasing
slowly and the guest progressively suffers from higher and higher
latencies.

That's really bad.

Your patch lowers the immediate impact and makes this illness evolving
smoother by scaling down the recalibration to the min_clock_delta.
This appeases the bug but doesn't solve it. I fear it could be even
worse because it makes it more discreet.


May be can we instead increase the minimum threshold of loop in the
hrtimer interrupt before considering it as a hang? Hmm, but a too high
number could make this check useless, depending of the number of pending
timers, which is a finite number.

Well, actually I'm not confident anymore in this check. Or actually we
should change it. May be we can rebase it on the time spent on the hrtimer
interrupt (and check it every 10 

Re: kvm guest: hrtimer: interrupt too slow

2009-10-07 Thread Marcelo Tosatti
On Thu, Oct 08, 2009 at 01:17:35AM +0200, Frederic Weisbecker wrote:
 (Adding Thomas in Cc)
 
 
 On Sat, Oct 03, 2009 at 08:12:05PM -0300, Marcelo Tosatti wrote:
  Michael,
  
  Can you please give the patch below a try please? (without acpi_pm timer 
  or priority adjustments for the guest).
  
  On Tue, Sep 29, 2009 at 05:12:17PM +0400, Michael Tokarev wrote:
   Hello.
  
   I'm having quite an.. unusable system here.
   It's not really a regresssion with 0.11.0,
   it was something similar before, but with
   0.11.0 and/or 2.6.31 it become much worse.
  
   The thing is that after some uptime, kvm
   guest prints something like this:
  
   hrtimer: interrupt too slow, forcing clock min delta to 461487495 ns
  
   after which system (guest) speeed becomes
   very slow.  The above message is from
   2.6.31 guest running wiht 0.11.0  2.6.31
   host.  Before I tried it with 0.10.6 and
   2.6.30 or 2.6.27, and the delta were a
   bit less than that:
  
   hrtimer: interrupt too slow, forcing clock min delta to 15415 ns
   hrtimer: interrupt too slow, forcing clock min delta to 93629025 ns
  
  It seems the way hrtimer_interrupt_hanging calculates min_delta is
  wrong (especially to virtual machines). The guest vcpu can be scheduled
  out during the execution of the hrtimer callbacks (and the callbacks
  themselves can do operations that translate to blocking operations in
  the hypervisor).
  
  So high min_delta values can be calculated if, for example, a single
  hrtimer_interrupt run takes two host time slices to execute, while some
  other higher priority task runs for N slices in between.
  
  Using the hrtimer_interrupt execution time (which can be the worse
  case at any given time), as the min_delta is problematic.
  
  So simply increase min_delta_ns by 50% once every detected failure,
  which will eventually lead to an acceptable threshold (the algorithm
  should scale back to down lower min_delta, to adjust back to wealthier
  times, too).
  
  diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
  index 49da79a..8997978 100644
  --- a/kernel/hrtimer.c
  +++ b/kernel/hrtimer.c
  @@ -1234,28 +1234,20 @@ static void __run_hrtimer(struct hrtimer *timer)
   
   #ifdef CONFIG_HIGH_RES_TIMERS
   
  -static int force_clock_reprogram;
  -
   /*
* After 5 iteration's attempts, we consider that hrtimer_interrupt()
* is hanging, which could happen with something that slows the interrupt
  - * such as the tracing. Then we force the clock reprogramming for each 
  future
  - * hrtimer interrupts to avoid infinite loops and use the min_delta_ns
  - * threshold that we will overwrite.
  - * The next tick event will be scheduled to 3 times we currently spend on
  - * hrtimer_interrupt(). This gives a good compromise, the cpus will spend
  - * 1/4 of their time to process the hrtimer interrupts. This is enough to
  - * let it running without serious starvation.
  + * such as the tracing, so we increase min_delta_ns.
*/
   
   static inline void
  -hrtimer_interrupt_hanging(struct clock_event_device *dev,
  -   ktime_t try_time)
  +hrtimer_interrupt_hanging(struct clock_event_device *dev)
   {
  -   force_clock_reprogram = 1;
  -   dev-min_delta_ns = (unsigned long)try_time.tv64 * 3;
  -   printk(KERN_WARNING hrtimer: interrupt too slow, 
  -   forcing clock min delta to %lu ns\n, dev-min_delta_ns);
  +   dev-min_delta_ns += dev-min_delta_ns  1;
 
 
 I haven't thought about the guest that could be scheduled out in
 the middle of the timers servicing, making wrong this check based
 of the time spent in hrtimer_interrupt().
 
 I guess there is no easy/generic/cheap way to rebase this check
 on the _virtual_ time spent in the timers servicing. By virtual,
 I mean the time spent in the guest only.
 
 In a non-guest kernel, the old check forces an adaptive rate
 sharing:
 
 - we spent n nanosecs to service the batch of timers.
 - we are hanging
 - we want at least 3/4 of time reserved for non-timer
   servicing in the kernel, this is a minimum prerequisite
   for the system to not starve
 - adapt the min_clock_delta against to fit the above constraint
 
 All that does not make sense anymore in a guest. The hang detection
 and warnings, the recalibrations of the min_clock_deltas are completely
 wrong in this context.
 Not only does it spuriously warn, but the minimum timer is increasing
 slowly and the guest progressively suffers from higher and higher
 latencies.
 
 That's really bad.
 
 Your patch lowers the immediate impact and makes this illness evolving
 smoother by scaling down the recalibration to the min_clock_delta.
 This appeases the bug but doesn't solve it. I fear it could be even
 worse because it makes it more discreet.

True.

 May be can we instead increase the minimum threshold of loop in the
 hrtimer interrupt before considering it as a hang? Hmm, but a too high
 number could make this check useless, depending of the number of pending
 timers, which is a finite 

Re: kvm guest: hrtimer: interrupt too slow

2009-10-06 Thread Michael Tokarev

Michael Tokarev wrote:

Marcelo Tosatti wrote:
[]
You should see min_delta_ns increase to a much smaller value, 
hopefully in

the 2000-1 range.


That's what I see:



Oct  4 16:13:09 paltus kernel: [boot]
Oct  5 12:35:51 paltus kernel: hrtimer: interrupt too slow, forcing clock min 
delta to 1500 ns
Oct  5 12:47:29 paltus kernel: hrtimer: interrupt too slow, forcing clock min 
delta to 2250 ns
Oct  5 15:46:41 paltus kernel: hrtimer: interrupt too slow, forcing clock min 
delta to 3375 ns
Oct  5 23:13:08 paltus kernel: hrtimer: interrupt too slow, forcing clock min 
delta to 5062 ns
Oct  6 17:14:38 paltus kernel: hrtimer: interrupt too slow, forcing clock min 
delta to 7593 ns

So it's increasing, slowly, so far.  Hopefully it'll slow down
further.

I want to reboot the host today, so it'll start over.

/mjt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm guest: hrtimer: interrupt too slow

2009-10-05 Thread Michael Tokarev

Marcelo Tosatti wrote:
[]

You should see min_delta_ns increase to a much smaller value, hopefully in
the 2000-1 range.


That's what I see:

Oct  4 16:13:09 paltus kernel: [boot]
Oct  5 12:35:51 paltus kernel: hrtimer: interrupt too slow, forcing clock min 
delta to 1500 ns
Oct  5 12:47:29 paltus kernel: hrtimer: interrupt too slow, forcing clock min 
delta to 2250 ns

I tried to stress-load the machine yesterday (Oct-4), including massive
network usage (I think it's the network that triggers the issue most
reliable), but wasn't able to trigger it.  Today, at Monday, at normal
system load, it triggered by its own, increasing min_delta to 50%
as in the patch.  The system runs smoothly so far, but it's to be
expected (till it increases min delta to some 10-digit number :).

But I'm unsure still if that's the right approach...  The root
cause seems to be the problem, and this patch just tries to
work around it in a better-than-before way.  But it might be
just me... ;)

I'll continue watching it, since the machine(s) in question are
important for us.

Thanks!

/mjt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm guest: hrtimer: interrupt too slow

2009-10-05 Thread Avi Kivity

On 09/29/2009 03:58 PM, Michael Tokarev wrote:

Avi Kivity wrote:

On 09/29/2009 03:12 PM, Michael Tokarev wrote:

[]

The thing is that after some uptime, kvm
guest prints something like this:

hrtimer: interrupt too slow, forcing clock min delta to 461487495 ns

[]

What happens if you use hpet or pmtimer as guest clocksource?


For all the guests I have handy, only 2 clocksources
are available: kvm-clock and acpi_pm.  The host itself
has hpet turned off because it itself had issues with
hpet, and after many tries we finally turned it off
(there were several long and painy threads on lkml
about this).

Dunno why pmtimer isn't available.


acpi_pm is pmtimer.



I tried switching to acpi_pm on a running guest (which
is still runnining in very slow mode), but that did not
make any difference - i.e., it did not become fast
again.


Please post host /proc/cpuinfo.


Here's the cpuinfo from host (for last core):

flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext 
fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good 
nonstop_tsc extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm 
extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs


Has both constant_tsc and nonstop_tsc, wonder why it's making so much 
trouble.



--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm guest: hrtimer: interrupt too slow

2009-10-04 Thread Michael Tokarev

Marcelo Tosatti wrote:

Michael,

Can you please give the patch below a try please? (without acpi_pm timer 
or priority adjustments for the guest).


Sure.  I'll try it out in a hour or two, while I can experiment freely because
it's weekend.

But I wonder...
[]

hrtimer: interrupt too slow, forcing clock min delta to 93629025 ns


It seems the way hrtimer_interrupt_hanging calculates min_delta is
wrong (especially to virtual machines). The guest vcpu can be scheduled
out during the execution of the hrtimer callbacks (and the callbacks
themselves can do operations that translate to blocking operations in
the hypervisor).

So high min_delta values can be calculated if, for example, a single
hrtimer_interrupt run takes two host time slices to execute, while some
other higher priority task runs for N slices in between.

Using the hrtimer_interrupt execution time (which can be the worse
case at any given time), as the min_delta is problematic.

So simply increase min_delta_ns by 50% once every detected failure,
which will eventually lead to an acceptable threshold (the algorithm
should scale back to down lower min_delta, to adjust back to wealthier
times, too).


..I wonder what should I check for.  I mean, the end result of this patch
is not entirely clear to me, what should it change?  I see that instead
of the now-calculated-after-error (probably very large) min_delta, it's
increased to a smaller value.

So I should be getting more such messages (forcing min_delta to $foo), but
the responsiveness of the guest should stay in more or less acceptable
range (unless it will continue erroring, in which case the responsiveness
will be slowly reduced).

Yes indeed, it's better than current situation, when the guest works fine
initially but out of the sudden it switches to a wild very-slow-to-reply
mode.  But it does not look like a right solution either, even if the
back adjustment (mentioned in the last statement above) will be implemented.
Unless I completely missed the point...

Neverless, the question stands: what I'm looking for now, when the patch is
applied?  I can't measure the responsiveness, especially since the min_delta
gets set to different (large) values each time (I mean current situation
without the patch).

Thanks!

/mjt


diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
index 49da79a..8997978 100644
--- a/kernel/hrtimer.c
+++ b/kernel/hrtimer.c
@@ -1234,28 +1234,20 @@ static void __run_hrtimer(struct hrtimer *timer)
 
 #ifdef CONFIG_HIGH_RES_TIMERS
 
-static int force_clock_reprogram;

-
 /*
  * After 5 iteration's attempts, we consider that hrtimer_interrupt()
  * is hanging, which could happen with something that slows the interrupt
- * such as the tracing. Then we force the clock reprogramming for each future
- * hrtimer interrupts to avoid infinite loops and use the min_delta_ns
- * threshold that we will overwrite.
- * The next tick event will be scheduled to 3 times we currently spend on
- * hrtimer_interrupt(). This gives a good compromise, the cpus will spend
- * 1/4 of their time to process the hrtimer interrupts. This is enough to
- * let it running without serious starvation.
+ * such as the tracing, so we increase min_delta_ns.
  */
 
 static inline void

-hrtimer_interrupt_hanging(struct clock_event_device *dev,
-   ktime_t try_time)
+hrtimer_interrupt_hanging(struct clock_event_device *dev)
 {
-   force_clock_reprogram = 1;
-   dev-min_delta_ns = (unsigned long)try_time.tv64 * 3;
-   printk(KERN_WARNING hrtimer: interrupt too slow, 
-   forcing clock min delta to %lu ns\n, dev-min_delta_ns);
+   dev-min_delta_ns += dev-min_delta_ns  1;
+   if (printk_ratelimit())
+   printk(KERN_WARNING hrtimer: interrupt too slow, 
+   forcing clock min delta to %lu ns\n,
+   dev-min_delta_ns);
 }
 /*
  * High resolution timer interrupt
@@ -1276,7 +1268,7 @@ void hrtimer_interrupt(struct clock_event_device *dev)
  retry:
/* 5 retries is enough to notice a hang */
if (!(++nr_retries % 5))
-   hrtimer_interrupt_hanging(dev, ktime_sub(ktime_get(), now));
+   hrtimer_interrupt_hanging(dev);
 
 	now = ktime_get();
 
@@ -1342,7 +1334,7 @@ void hrtimer_interrupt(struct clock_event_device *dev)
 
 	/* Reprogramming necessary ? */

if (expires_next.tv64 != KTIME_MAX) {
-   if (tick_program_event(expires_next, force_clock_reprogram))
+   if (tick_program_event(expires_next, 0))
goto retry;
}
 }


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm guest: hrtimer: interrupt too slow

2009-10-04 Thread Marcelo Tosatti
On Sun, Oct 04, 2009 at 04:01:02PM +0400, Michael Tokarev wrote:
 Marcelo Tosatti wrote:
 Michael,

 Can you please give the patch below a try please? (without acpi_pm 
 timer or priority adjustments for the guest).

 Sure.  I'll try it out in a hour or two, while I can experiment freely because
 it's weekend.

 But I wonder...
 []
 hrtimer: interrupt too slow, forcing clock min delta to 93629025 ns

 It seems the way hrtimer_interrupt_hanging calculates min_delta is
 wrong (especially to virtual machines). The guest vcpu can be scheduled
 out during the execution of the hrtimer callbacks (and the callbacks
 themselves can do operations that translate to blocking operations in
 the hypervisor).

 So high min_delta values can be calculated if, for example, a single
 hrtimer_interrupt run takes two host time slices to execute, while some
 other higher priority task runs for N slices in between.

 Using the hrtimer_interrupt execution time (which can be the worse
 case at any given time), as the min_delta is problematic.

 So simply increase min_delta_ns by 50% once every detected failure,
 which will eventually lead to an acceptable threshold (the algorithm
 should scale back to down lower min_delta, to adjust back to wealthier
 times, too).

 ..I wonder what should I check for.  I mean, the end result of this patch
 is not entirely clear to me, what should it change?  I see that instead
 of the now-calculated-after-error (probably very large) min_delta, it's
 increased to a smaller value.

 So I should be getting more such messages (forcing min_delta to $foo), but
 the responsiveness of the guest should stay in more or less acceptable
 range (unless it will continue erroring, in which case the responsiveness
 will be slowly reduced).

Right.

 Yes indeed, it's better than current situation, when the guest works fine
 initially but out of the sudden it switches to a wild very-slow-to-reply
 mode.  But it does not look like a right solution either, even if the
 back adjustment (mentioned in the last statement above) will be implemented.
 Unless I completely missed the point...

 Neverless, the question stands: what I'm looking for now, when the patch is
 applied?  I can't measure the responsiveness, especially since the min_delta
 gets set to different (large) values each time (I mean current situation
 without the patch).

You should see min_delta_ns increase to a much smaller value, hopefully in
the 2000-1 range.

min_delta_ns is the minimum delay a high resolution timer can have. You
had it set in the hundreds of milliseconds range.


 Thanks!

 /mjt

 diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
 index 49da79a..8997978 100644
 --- a/kernel/hrtimer.c
 +++ b/kernel/hrtimer.c
 @@ -1234,28 +1234,20 @@ static void __run_hrtimer(struct hrtimer *timer)
   #ifdef CONFIG_HIGH_RES_TIMERS
  -static int force_clock_reprogram;
 -
  /*
   * After 5 iteration's attempts, we consider that hrtimer_interrupt()
   * is hanging, which could happen with something that slows the interrupt
 - * such as the tracing. Then we force the clock reprogramming for each 
 future
 - * hrtimer interrupts to avoid infinite loops and use the min_delta_ns
 - * threshold that we will overwrite.
 - * The next tick event will be scheduled to 3 times we currently spend on
 - * hrtimer_interrupt(). This gives a good compromise, the cpus will spend
 - * 1/4 of their time to process the hrtimer interrupts. This is enough to
 - * let it running without serious starvation.
 + * such as the tracing, so we increase min_delta_ns.
   */
   static inline void
 -hrtimer_interrupt_hanging(struct clock_event_device *dev,
 -ktime_t try_time)
 +hrtimer_interrupt_hanging(struct clock_event_device *dev)
  {
 -force_clock_reprogram = 1;
 -dev-min_delta_ns = (unsigned long)try_time.tv64 * 3;
 -printk(KERN_WARNING hrtimer: interrupt too slow, 
 -forcing clock min delta to %lu ns\n, dev-min_delta_ns);
 +dev-min_delta_ns += dev-min_delta_ns  1;
 +if (printk_ratelimit())
 +printk(KERN_WARNING hrtimer: interrupt too slow, 
 +forcing clock min delta to %lu ns\n,
 +dev-min_delta_ns);
  }
  /*
   * High resolution timer interrupt
 @@ -1276,7 +1268,7 @@ void hrtimer_interrupt(struct clock_event_device *dev)
   retry:
  /* 5 retries is enough to notice a hang */
  if (!(++nr_retries % 5))
 -hrtimer_interrupt_hanging(dev, ktime_sub(ktime_get(), now));
 +hrtimer_interrupt_hanging(dev);
  now = ktime_get();
  @@ -1342,7 +1334,7 @@ void hrtimer_interrupt(struct clock_event_device 
 *dev)
  /* Reprogramming necessary ? */
  if (expires_next.tv64 != KTIME_MAX) {
 -if (tick_program_event(expires_next, force_clock_reprogram))
 +if (tick_program_event(expires_next, 0))
  goto retry;
  }
  }
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to 

Re: kvm guest: hrtimer: interrupt too slow

2009-10-03 Thread Marcelo Tosatti
Michael,

Can you please give the patch below a try please? (without acpi_pm timer 
or priority adjustments for the guest).

On Tue, Sep 29, 2009 at 05:12:17PM +0400, Michael Tokarev wrote:
 Hello.

 I'm having quite an.. unusable system here.
 It's not really a regresssion with 0.11.0,
 it was something similar before, but with
 0.11.0 and/or 2.6.31 it become much worse.

 The thing is that after some uptime, kvm
 guest prints something like this:

 hrtimer: interrupt too slow, forcing clock min delta to 461487495 ns

 after which system (guest) speeed becomes
 very slow.  The above message is from
 2.6.31 guest running wiht 0.11.0  2.6.31
 host.  Before I tried it with 0.10.6 and
 2.6.30 or 2.6.27, and the delta were a
 bit less than that:

 hrtimer: interrupt too slow, forcing clock min delta to 15415 ns
 hrtimer: interrupt too slow, forcing clock min delta to 93629025 ns

It seems the way hrtimer_interrupt_hanging calculates min_delta is
wrong (especially to virtual machines). The guest vcpu can be scheduled
out during the execution of the hrtimer callbacks (and the callbacks
themselves can do operations that translate to blocking operations in
the hypervisor).

So high min_delta values can be calculated if, for example, a single
hrtimer_interrupt run takes two host time slices to execute, while some
other higher priority task runs for N slices in between.

Using the hrtimer_interrupt execution time (which can be the worse
case at any given time), as the min_delta is problematic.

So simply increase min_delta_ns by 50% once every detected failure,
which will eventually lead to an acceptable threshold (the algorithm
should scale back to down lower min_delta, to adjust back to wealthier
times, too).

diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
index 49da79a..8997978 100644
--- a/kernel/hrtimer.c
+++ b/kernel/hrtimer.c
@@ -1234,28 +1234,20 @@ static void __run_hrtimer(struct hrtimer *timer)
 
 #ifdef CONFIG_HIGH_RES_TIMERS
 
-static int force_clock_reprogram;
-
 /*
  * After 5 iteration's attempts, we consider that hrtimer_interrupt()
  * is hanging, which could happen with something that slows the interrupt
- * such as the tracing. Then we force the clock reprogramming for each future
- * hrtimer interrupts to avoid infinite loops and use the min_delta_ns
- * threshold that we will overwrite.
- * The next tick event will be scheduled to 3 times we currently spend on
- * hrtimer_interrupt(). This gives a good compromise, the cpus will spend
- * 1/4 of their time to process the hrtimer interrupts. This is enough to
- * let it running without serious starvation.
+ * such as the tracing, so we increase min_delta_ns.
  */
 
 static inline void
-hrtimer_interrupt_hanging(struct clock_event_device *dev,
-   ktime_t try_time)
+hrtimer_interrupt_hanging(struct clock_event_device *dev)
 {
-   force_clock_reprogram = 1;
-   dev-min_delta_ns = (unsigned long)try_time.tv64 * 3;
-   printk(KERN_WARNING hrtimer: interrupt too slow, 
-   forcing clock min delta to %lu ns\n, dev-min_delta_ns);
+   dev-min_delta_ns += dev-min_delta_ns  1;
+   if (printk_ratelimit())
+   printk(KERN_WARNING hrtimer: interrupt too slow, 
+   forcing clock min delta to %lu ns\n,
+   dev-min_delta_ns);
 }
 /*
  * High resolution timer interrupt
@@ -1276,7 +1268,7 @@ void hrtimer_interrupt(struct clock_event_device *dev)
  retry:
/* 5 retries is enough to notice a hang */
if (!(++nr_retries % 5))
-   hrtimer_interrupt_hanging(dev, ktime_sub(ktime_get(), now));
+   hrtimer_interrupt_hanging(dev);
 
now = ktime_get();
 
@@ -1342,7 +1334,7 @@ void hrtimer_interrupt(struct clock_event_device *dev)
 
/* Reprogramming necessary ? */
if (expires_next.tv64 != KTIME_MAX) {
-   if (tick_program_event(expires_next, force_clock_reprogram))
+   if (tick_program_event(expires_next, 0))
goto retry;
}
 }
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


kvm guest: hrtimer: interrupt too slow

2009-09-29 Thread Michael Tokarev

Hello.

I'm having quite an.. unusable system here.
It's not really a regresssion with 0.11.0,
it was something similar before, but with
0.11.0 and/or 2.6.31 it become much worse.

The thing is that after some uptime, kvm
guest prints something like this:

hrtimer: interrupt too slow, forcing clock min delta to 461487495 ns

after which system (guest) speeed becomes
very slow.  The above message is from
2.6.31 guest running wiht 0.11.0  2.6.31
host.  Before I tried it with 0.10.6 and
2.6.30 or 2.6.27, and the delta were a
bit less than that:

hrtimer: interrupt too slow, forcing clock min delta to 15415 ns
hrtimer: interrupt too slow, forcing clock min delta to 93629025 ns

etc.

Before, guest was just somewhat slow.  But
now, it reached the state when it's almost
unusable.

Especially nice it's visible in xterm
(running on that guest): typing a single
char in xterm results in it being displayed
after 1..2 sec pause.  With applications
like web browser (where all the font
rendering is done on the client and X
server receives bitmap instead of a
character code to draw, it isn't that
bad.

I suspect it's only happening with small
network packets.

The above message (with large min delta)
occured after about 20 hours uptime.  Similar
(give or take) delays were observed previously
as well.  That to say, -- it's not something
you notice immediately.

The problem is quite consistent, that is,
different guests shows it sooner or later.

Right now I'm running 2.6.31-amd64 host,
32bit qemu-kvm-0.11.0, 2.6.31-i686 uniprocessor
guest, virtio networking and block devices,
kvm-clock is enabled.  Similar configuration
were used before (with different versions).
The hardware is Phenom 9750 (4core) running
on Amd780g/sb700 chipset.

Any hints on what to do with all this?

Thanks!

/mjt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm guest: hrtimer: interrupt too slow

2009-09-29 Thread Michael Tokarev

Avi Kivity wrote:

On 09/29/2009 03:12 PM, Michael Tokarev wrote:

[]

The thing is that after some uptime, kvm
guest prints something like this:

hrtimer: interrupt too slow, forcing clock min delta to 461487495 ns

[]

What happens if you use hpet or pmtimer as guest clocksource?


For all the guests I have handy, only 2 clocksources
are available: kvm-clock and acpi_pm.  The host itself
has hpet turned off because it itself had issues with
hpet, and after many tries we finally turned it off
(there were several long and painy threads on lkml
about this).

Dunno why pmtimer isn't available.

I tried switching to acpi_pm on a running guest (which
is still runnining in very slow mode), but that did not
make any difference - i.e., it did not become fast
again.


Please post host /proc/cpuinfo.


Here's the cpuinfo from host (for last core):

processor   : 3
vendor_id   : AuthenticAMD
cpu family  : 16
model   : 2
model name  : AMD Phenom(tm) 9750 Quad-Core Processor
stepping: 3
cpu MHz : 1200.000
cache size  : 512 KB
physical id : 0
siblings: 4
core id : 3
cpu cores   : 4
apicid  : 3
initial apicid  : 3
fpu : yes
fpu_exception   : yes
cpuid level : 5
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb 
rdtscp lm 3dnowext 3dnow constant_tsc rep_good nonstop_tsc extd_apicid pni 
monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a 
misalignsse 3dnowprefetch osvw ibs
bogomips: 4814.78
TLB size: 1024 4K pages
clflush size: 64
cache_alignment : 64
address sizes   : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate

It has cpufreq enabled (ondemand), but turning that off
does not change anything (that was the first thing I
tried, but AFTER the guest become slow).

Thanks!

/mjt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html