Re: [PATCH 4/5] watchdog: control hard lockup detection default

2014-08-18 Thread Ulrich Obergfell
>- Original Message -
>From: "Ingo Molnar" 
>To: "Don Zickus" 
>Cc: a...@linux-foundation.org, kvm@vger.kernel.org, pbonz...@redhat.com, 
>mi...@redhat.com, "LKML" , "Ulrich >Obergfell" 
>, "Andrew Jones" 
>Sent: Monday, August 18, 2014 11:16:44 AM
>Subject: Re: [PATCH 4/5] watchdog: control hard lockup detection default
>
>
> * Don Zickus  wrote:
>
>> The running kernel still has the ability to enable/disable at any
>> time with /proc/sys/kernel/nmi_watchdog us usual. However even
>> when the default has been overridden /proc/sys/kernel/nmi_watchdog
>> will initially show '1'. To truly turn it on one must disable/enable
>> it, i.e.
>>   echo 0 > /proc/sys/kernel/nmi_watchdog
>>   echo 1 > /proc/sys/kernel/nmi_watchdog
>
> This looks like a bug, why is this so?
>
> Thanks,
>
>   Ingo


This is because the hard lockup detector and the soft lockup detector are
enabled and disabled at the same time - there isn't a separate 'knob' for
each of them. Both are controlled via the 'watchdog_user_enabled' variable
which is 1 by default.

  lockup_detector_init
if (watchdog_user_enabled)
watchdog_enable_all_cpus
  smpboot_register_percpu_thread(&watchdog_threads)

At boot time, the above code path lauches a 'watchdog/N' thread for each
online CPU. The watchdog_enable() function is executed in the context of
these threads, and this attempts to enable the hard lockup detector and
the soft lockup detector. [Note: Soft lockup detection is implemented in
watchdog_timer_fn().]

  watchdog_enable
hrtimer_init(hrtimer, ...)
hrtimer->function = watchdog_timer_fn

watchdog_nmi_enable
  perf_event_create_kernel_counter(..., watchdog_overflow_callback)

hrtimer_start(hrtimer, ...)

On bare metal systems or in virtual environments where the hypervisor
does not emulate a PMU, watchdog_nmi_enable() can fail to allocate and
enable a PMU counter. This is reported by a console message:

  NMI watchdog: disabled (cpu0): hardware events not enabled

Hence, we can end up with a situation where the soft lockup detector is
enabled and the hard lockup detector is not enabled. However, the output
of 'cat /proc/sys/kernel/nmi_watchdog' is 1 because it merely shows the
state of the 'watchdog_user_enabled' variable.

The above is the behaviour even without the proposed patch. The patch
merely adds the following hunk in watchdog_nmi_enable() to 'fake' a
-ENOENT error return from perf_event_create_kernel_counter().

  +if (!watchdog_hardlockup_detector_is_enabled()) {
  +event = ERR_PTR(-ENOENT);
  +goto handle_err;
  +}

The patch does not break the output of 'cat /proc/sys/kernel/nmi_watchdog'
since the discrepancy between the output and the actual state of the hard
lockup detector is nothing new.


Regards,

Uli
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] watchdog: control hard lockup detection default

2014-07-25 Thread Ulrich Obergfell
> - Original Message -
> From: "Andrew Jones" 
> To: linux-ker...@vger.kernel.org, kvm@vger.kernel.org
> Cc: uober...@redhat.com, dzic...@redhat.com, pbonz...@redhat.com, 
> a...@linux-foundation.org, mi...@redhat.com
> Sent: Thursday, July 24, 2014 12:13:30 PM
> Subject: [PATCH 2/3] watchdog: control hard lockup detection default

[...]

> The running kernel still has the ability to enable/disable at any
> time with /proc/sys/kernel/nmi_watchdog us usual. However even
> when the default has been overridden /proc/sys/kernel/nmi_watchdog
> will initially show '1'. To truly turn it on one must disable/enable
> it, i.e.
>   echo 0 > /proc/sys/kernel/nmi_watchdog
>   echo 1 > /proc/sys/kernel/nmi_watchdog

[...]

> @@ -626,15 +665,17 @@ int proc_dowatchdog(struct ctl_table *table, int write,
>  * disabled. The 'watchdog_running' variable check in
>  * watchdog_*_all_cpus() function takes care of this.
>  */
> -if (watchdog_user_enabled && watchdog_thresh)
> +if (watchdog_user_enabled && watchdog_thresh) {
> +watchdog_enable_hardlockup_detector(true);
>  err = watchdog_enable_all_cpus(old_thresh != 
> watchdog_thresh);
> -else
> +} else

[...]


I just realized a possible issue in the above part of the patch:

If we would want to give the user the option to override the effect of patch 3/3
via /proc, I think proc_dowatchdog() should enable hard lockup detection _only_
in case of a state transition from 'NOT watchdog_running' to 'watchdog_running'.
  |
   if (watchdog_user_enabled && watchdog_thresh) {| need to add this
   if (!watchdog_running) <---'
   watchdog_enable_hardlockup_detector(true);
   err = watchdog_enable_all_cpus(old_thresh != watchdog_thresh);
   } else
   ...

The additional 'if (!watchdog_running)' would _require_ the user to perform the
sequence of commands

   echo 0 > /proc/sys/kernel/nmi_watchdog
   echo 1 > /proc/sys/kernel/nmi_watchdog

to enable hard lockup detection explicitly.

I think changing the 'watchdog_thresh' while 'watchdog_running' is true should
_not_ enable hard lockup detection as a side-effect, because a user may have a
'sysctl.conf' entry such as

   kernel.watchdog_thresh = ...

or may only want to change the 'watchdog_thresh' on the fly.

I think the following flow of execution could cause such undesired side-effect.

   proc_dowatchdog
 if (watchdog_user_enabled && watchdog_thresh) {

 watchdog_enable_hardlockup_detector
   hardlockup_detector_enabled = true

 watchdog_enable_all_cpus
   if (!watchdog_running) {
   ...
   } else if (sample_period_changed)
  update_timers_all_cpus
for_each_online_cpu
update_timers
  watchdog_nmi_disable
  ...
  watchdog_nmi_enable

watchdog_hardlockup_detector_is_enabled
  return true

enable perf counter for hard lockup detection

Regards,

Uli
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] watchdog: control hard lockup detection default

2014-07-24 Thread Ulrich Obergfell
>- Original Message -
>From: "Paolo Bonzini" 
>To: "Ulrich Obergfell" 
>Cc: "Andrew Jones" , linux-ker...@vger.kernel.org, 
>kvm@vger.kernel.org, dzic...@redhat.com, a...@linux-foundation.org, 
>>mi...@redhat.com
>Sent: Thursday, July 24, 2014 1:45:47 PM
>Subject: Re: [PATCH 2/3] watchdog: control hard lockup detection default
>
>Il 24/07/2014 13:44, Ulrich Obergfell ha scritto:
>> > But this means that it is not possible to re-enable softlockup detection
>> > only.  I think that should be the effect of echo 0 + echo 1, if
>> > hardlockup detection was disabled by either the command line or patch 3.
>>
>> The idea was to give the user two options to override the effect of patch 
>> 3/3.
>> Either via the kernel command line ('nmi_watchdog=') at boot time or via 
>> /proc
>> ('echo 0' + 'echo 1') when the system is up and running.
>
> I think the kernel command line is enough; another alternative is to
> split the nmi_watchdog /proc entry in two.
>
> Paolo

The current behaviour (without the patch) already allows a user to disable
NMI watchdog at boot time ('nmi_watchdog=0') and enable it explicitly when
the system is up and running ('echo 0' + 'echo 1'). I think it would be
more consistent with this behaviour and more intuitive if we would give
the user the option to override the effect of patch 3/3 via /proc. By
'intuitive' I mean that the user says: 'I _want_ this to be enabled'.

Regards,

Uli

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] watchdog: control hard lockup detection default

2014-07-24 Thread Ulrich Obergfell
>- Original Message -
>From: "Paolo Bonzini" 
>To: "Ulrich Obergfell" 
>Cc: "Andrew Jones" , linux-ker...@vger.kernel.org, 
>kvm@vger.kernel.org, dzic...@redhat.com, a...@linux-foundation.org, 
>>mi...@redhat.com
>Sent: Thursday, July 24, 2014 1:26:40 PM
>Subject: Re: [PATCH 2/3] watchdog: control hard lockup detection default
>
>Il 24/07/2014 13:18, Ulrich Obergfell ha scritto:
>>>> >> The running kernel still has the ability to enable/disable at any
>>>> >> time with /proc/sys/kernel/nmi_watchdog us usual. However even
>>>> >> when the default has been overridden /proc/sys/kernel/nmi_watchdog
>>>> >> will initially show '1'. To truly turn it on one must disable/enable
>>>> >> it, i.e.
>>>> >>   echo 0 > /proc/sys/kernel/nmi_watchdog
>>>> >>   echo 1 > /proc/sys/kernel/nmi_watchdog
>>> >
>>> > Why is it hard to make this show the right value? :)
>>> >
>>> > Paolo
>> 'echo 1 > /proc/sys/kernel/nmi_watchdog' enables both - hard lockup and
>> soft lockup detection. watchdog_enable_all_cpus() starts a 'watchdog/N'
>> thread for each CPU. If the kernel runs on a bare metal system where the
>> processor does not have a PMU, or when perf_event_create_kernel_counter()
>> returns failure to watchdog_nmi_enable(), or when the kernel runs as a
>> guest on a hypervisor that does not emulate a PMU, then the 'watchdog/N'
>> threads are still active for soft lockup detection. Patch 2/3 essentially
>> makes watchdog_nmi_enable() behave in the same way as if -ENOENT would
>> have been returned by perf_event_create_kernel_counter(). This is then
>> reported via a console message.
>> 
>>   NMI watchdog: disabled (cpu0): hardware events not enabled
>> 
>> It's hard say what _is_ 'the right value' (because lockup detection is
>> then enabled 'partially'), regardless of whether patch 2/3 is applied
>> or not.
>
> But this means that it is not possible to re-enable softlockup detection
> only.  I think that should be the effect of echo 0 + echo 1, if
> hardlockup detection was disabled by either the command line or patch 3.
>
> Paolo

The idea was to give the user two options to override the effect of patch 3/3.
Either via the kernel command line ('nmi_watchdog=') at boot time or via /proc
('echo 0' + 'echo 1') when the system is up and running.

Regards,

Uli
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] watchdog: control hard lockup detection default

2014-07-24 Thread Ulrich Obergfell
> - Original Message -
> From: "Paolo Bonzini" 
> To: "Andrew Jones" , linux-ker...@vger.kernel.org, 
> kvm@vger.kernel.org
> Cc: uober...@redhat.com, dzic...@redhat.com, a...@linux-foundation.org, 
> mi...@redhat.com
> Sent: Thursday, July 24, 2014 12:46:11 PM
> Subject: Re: [PATCH 2/3] watchdog: control hard lockup detection default
>
>Il 24/07/2014 12:13, Andrew Jones ha scritto:
>> 
>> The running kernel still has the ability to enable/disable at any
>> time with /proc/sys/kernel/nmi_watchdog us usual. However even
>> when the default has been overridden /proc/sys/kernel/nmi_watchdog
>> will initially show '1'. To truly turn it on one must disable/enable
>> it, i.e.
>>   echo 0 > /proc/sys/kernel/nmi_watchdog
>>   echo 1 > /proc/sys/kernel/nmi_watchdog
>
> Why is it hard to make this show the right value? :)
>
> Paolo

'echo 1 > /proc/sys/kernel/nmi_watchdog' enables both - hard lockup and
soft lockup detection. watchdog_enable_all_cpus() starts a 'watchdog/N'
thread for each CPU. If the kernel runs on a bare metal system where the
processor does not have a PMU, or when perf_event_create_kernel_counter()
returns failure to watchdog_nmi_enable(), or when the kernel runs as a
guest on a hypervisor that does not emulate a PMU, then the 'watchdog/N'
threads are still active for soft lockup detection. Patch 2/3 essentially
makes watchdog_nmi_enable() behave in the same way as if -ENOENT would
have been returned by perf_event_create_kernel_counter(). This is then
reported via a console message.

  NMI watchdog: disabled (cpu0): hardware events not enabled

It's hard say what _is_ 'the right value' (because lockup detection is
then enabled 'partially'), regardless of whether patch 2/3 is applied
or not.

Regards,

Uli
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/1] KVM: x86: improve the usability of the 'kvm_pio' tracepoint

2014-05-05 Thread Ulrich Obergfell
> From: "Xiao Guangrong" 
> To: "Ulrich Obergfell" , kvm@vger.kernel.org
> Cc: pbonz...@redhat.com
> Sent: Monday, May 5, 2014 9:10:19 AM
> Subject: Re: [PATCH 0/1] KVM: x86: improve the usability of the 'kvm_pio' 
> tracepoint
>
> On 05/02/2014 11:57 PM, Ulrich Obergfell wrote:
>> The current implementation of the 'kvm_pio' tracepoint in 
>> emulator_pio_in_out()
>> only tells us that 'something' has been read from or written to an I/O port. 
>> To
>> improve the usability of the tracepoint, I propose to include the 
>> value/content
>> that has been read or written in the trace output. The proposed patch aims at
>> the more common case where a single 8-bit or 16-bit or 32-bit value has been
>> read or written -- it does not fully cover the case where 'count' is greater
>> than one.
>> 
>> This is an example of what the patch can do (trace of PCI config space 
>> access).
>> 
>>  - on the host
>> 
>># trace-cmd record -e kvm:kvm_pio -f "(port >= 0xcf8) && (port <= 0xcff)"
>>/sys/kernel/debug/tracing/events/kvm/kvm_pio/filter
>>Hit Ctrl^C to stop recording
>> 
>>  - in a Linux guest
>> 
>># dd if=/sys/bus/pci/devices/:00:06.0/config bs=2 count=4 | hexdump
>>4+0 records in
>>4+0 records out
>>8 bytes (8 B) copied, 0.000114056 s, 70.1 kB/s
>>000 1af4 1001 0507 0010
>>008
>> 
>>  - on the host
>> 
>># trace-cmd report
>>...
>>qemu-kvm-23216 [001] 15211.994089: kvm_pio:  pio_write
>>at 0xcf8 size 4 count 1 val 0x80003000 
>>qemu-kvm-23216 [001] 15211.994108: kvm_pio:  pio_read
>>at 0xcfc size 2 count 1 val 0x1af4 
>>qemu-kvm-23216 [001] 15211.994129: kvm_pio:  pio_write
>>at 0xcf8 size 4 count 1 val 0x80003000 
>>qemu-kvm-23216 [001] 15211.994136: kvm_pio:  pio_read
>>at 0xcfe size 2 count 1 val 0x1001 
>>qemu-kvm-23216 [001] 15211.994143: kvm_pio:  pio_write
>>at 0xcf8 size 4 count 1 val 0x80003004 
>>qemu-kvm-23216 [001] 15211.994150: kvm_pio:  pio_read
>>at 0xcfc size 2 count 1 val 0x507 
>>qemu-kvm-23216 [001] 15211.994155: kvm_pio:  pio_write
>>at 0xcf8 size 4 count 1 val 0x80003004 
>>qemu-kvm-23216 [001] 15211.994161: kvm_pio:  pio_read
>>at 0xcfe size 2 count 1 val 0x10 
>> 
>
> Nice.
>
> Could please check "perf kvm stat" to see if "--event=ioport"
> can work after your patch?
>
> Reviewed-by: Xiao Guangrong 

I've run a quick test with a local kernel - built from 3.15.0-rc1 source,
including the proposed patch - in combination with the 'perf' package that
is installed on my test machine. I didn't build a new 'perf' binary from
3.15.0-rc1 source.

The following output of the 'perf kvm stat live --event=ioport -d 10' command
looks reasonable.


17:10:29.036811

Analyze events for all VMs, all VCPUs:

  IO Port AccessSamples  Samples% Time%   Min Time   Max Time   
  Avg time 

   0x177:PIN 3520.00%15.40%1us3us  
1.68us ( +-   8.63% )
   0x376:PIN 3017.14%16.37%1us6us  
2.08us ( +-  17.15% )
  0x170:POUT 15 8.57%18.99%2us9us  
4.83us ( +-  14.34% )
 0xc0ea:POUT 10 5.71% 6.57%2us2us  
2.51us ( +-   5.06% )
  0xc0ea:PIN 10 5.71% 6.21%1us6us  
2.37us ( +-  23.18% )
  0x176:POUT 10 5.71% 6.69%1us3us  
2.55us ( +-   7.59% )
   0x170:PIN  5 2.86% 3.36%2us2us  
2.56us ( +-   1.17% )
   0x171:PIN  5 2.86% 1.47%1us1us  
1.12us ( +-   0.37% )
  0x171:POUT  5 2.86% 3.26%2us2us  
2.49us ( +-   2.25% )
   0x172:PIN  5 2.86% 1.45%1us1us  
1.11us ( +-   0.24% )
  0x172:POUT  5 2.86% 2.67%1us2us  
2.04us ( +-   3.00% )
   0x173:PIN  5 2.86% 1.46%1us1us  
1.11us ( +-   0.29% )
  0x173:POUT  5 2.86% 2.60%1us2us  
1.99us ( +-   2.96% )
   0x174:PIN  5 2.86% 1.45%1us1us  
1.11us ( +-   0.16% )
  0x174:POUT  5

[PATCH 0/1] KVM: x86: improve the usability of the 'kvm_pio' tracepoint

2014-05-02 Thread Ulrich Obergfell
The current implementation of the 'kvm_pio' tracepoint in emulator_pio_in_out()
only tells us that 'something' has been read from or written to an I/O port. To
improve the usability of the tracepoint, I propose to include the value/content
that has been read or written in the trace output. The proposed patch aims at
the more common case where a single 8-bit or 16-bit or 32-bit value has been
read or written -- it does not fully cover the case where 'count' is greater
than one.

This is an example of what the patch can do (trace of PCI config space access).

 - on the host

   # trace-cmd record -e kvm:kvm_pio -f "(port >= 0xcf8) && (port <= 0xcff)"
   /sys/kernel/debug/tracing/events/kvm/kvm_pio/filter
   Hit Ctrl^C to stop recording

 - in a Linux guest

   # dd if=/sys/bus/pci/devices/:00:06.0/config bs=2 count=4 | hexdump
   4+0 records in
   4+0 records out
   8 bytes (8 B) copied, 0.000114056 s, 70.1 kB/s
   000 1af4 1001 0507 0010
   008

 - on the host

   # trace-cmd report
   ...
   qemu-kvm-23216 [001] 15211.994089: kvm_pio:  pio_write
   at 0xcf8 size 4 count 1 val 0x80003000 
   qemu-kvm-23216 [001] 15211.994108: kvm_pio:  pio_read
   at 0xcfc size 2 count 1 val 0x1af4 
   qemu-kvm-23216 [001] 15211.994129: kvm_pio:  pio_write
   at 0xcf8 size 4 count 1 val 0x80003000 
   qemu-kvm-23216 [001] 15211.994136: kvm_pio:  pio_read
   at 0xcfe size 2 count 1 val 0x1001 
   qemu-kvm-23216 [001] 15211.994143: kvm_pio:  pio_write
   at 0xcf8 size 4 count 1 val 0x80003004 
   qemu-kvm-23216 [001] 15211.994150: kvm_pio:  pio_read
   at 0xcfc size 2 count 1 val 0x507 
   qemu-kvm-23216 [001] 15211.994155: kvm_pio:  pio_write
   at 0xcf8 size 4 count 1 val 0x80003004 
   qemu-kvm-23216 [001] 15211.994161: kvm_pio:  pio_read
   at 0xcfe size 2 count 1 val 0x10 


Ulrich Obergfell (1):
  improve the usability of the 'kvm_pio' tracepoint

 arch/x86/kvm/trace.h | 20 
 arch/x86/kvm/x86.c   |  4 ++--
 2 files changed, 18 insertions(+), 6 deletions(-)

-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/1] KVM: x86: improve the usability of the 'kvm_pio' tracepoint

2014-05-02 Thread Ulrich Obergfell
This patch moves the 'kvm_pio' tracepoint to emulator_pio_in_emulated()
and emulator_pio_out_emulated(), and it adds an argument (a pointer to
the 'pio_data'). A single 8-bit or 16-bit or 32-bit data item is fetched
from 'pio_data' (depending on 'size'), and the value is included in the
trace record ('val'). If 'count' is greater than one, this is indicated
by the string "(...)" in the trace output.

Signed-off-by: Ulrich Obergfell 
---
 arch/x86/kvm/trace.h | 20 
 arch/x86/kvm/x86.c   |  4 ++--
 2 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index 545245d..33574c9 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -91,16 +91,21 @@ TRACE_EVENT(kvm_hv_hypercall,
 /*
  * Tracepoint for PIO.
  */
+
+#define KVM_PIO_IN   0
+#define KVM_PIO_OUT  1
+
 TRACE_EVENT(kvm_pio,
TP_PROTO(unsigned int rw, unsigned int port, unsigned int size,
-unsigned int count),
-   TP_ARGS(rw, port, size, count),
+unsigned int count, void *data),
+   TP_ARGS(rw, port, size, count, data),
 
TP_STRUCT__entry(
__field(unsigned int,   rw  )
__field(unsigned int,   port)
__field(unsigned int,   size)
__field(unsigned int,   count   )
+   __field(unsigned int,   val )
),
 
TP_fast_assign(
@@ -108,11 +113,18 @@ TRACE_EVENT(kvm_pio,
__entry->port   = port;
__entry->size   = size;
__entry->count  = count;
+   if (size == 1)
+   __entry->val= *(unsigned char *)data;
+   else if (size == 2)
+   __entry->val= *(unsigned short *)data;
+   else
+   __entry->val= *(unsigned int *)data;
),
 
-   TP_printk("pio_%s at 0x%x size %d count %d",
+   TP_printk("pio_%s at 0x%x size %d count %d val 0x%x %s",
  __entry->rw ? "write" : "read",
- __entry->port, __entry->size, __entry->count)
+ __entry->port, __entry->size, __entry->count, __entry->val,
+ __entry->count > 1 ? "(...)" : "")
 );
 
 /*
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8b8fc0b..cd79707 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4486,8 +4486,6 @@ static int emulator_pio_in_out(struct kvm_vcpu *vcpu, int 
size,
   unsigned short port, void *val,
   unsigned int count, bool in)
 {
-   trace_kvm_pio(!in, port, size, count);
-
vcpu->arch.pio.port = port;
vcpu->arch.pio.in = in;
vcpu->arch.pio.count  = count;
@@ -4522,6 +4520,7 @@ static int emulator_pio_in_emulated(struct 
x86_emulate_ctxt *ctxt,
if (ret) {
 data_avail:
memcpy(val, vcpu->arch.pio_data, size * count);
+   trace_kvm_pio(KVM_PIO_IN, port, size, count, 
vcpu->arch.pio_data);
vcpu->arch.pio.count = 0;
return 1;
}
@@ -4536,6 +4535,7 @@ static int emulator_pio_out_emulated(struct 
x86_emulate_ctxt *ctxt,
struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
 
memcpy(vcpu->arch.pio_data, val, size * count);
+   trace_kvm_pio(KVM_PIO_OUT, port, size, count, vcpu->arch.pio_data);
return emulator_pio_in_out(vcpu, size, port, (void *)val, count, false);
 }
 
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 5/5] hpet 'driftfix': add code in hpet_timer() to compensate delayed callbacks and coalesced interrupts

2011-05-20 Thread Ulrich Obergfell
Loss of periodic timer interrupts caused by delayed callbacks and by
interrupt coalescing is compensated by gradually injecting additional
interrupts during subsequent timer intervals, starting at a rate of
one additional interrupt per interval. The injection of additional
interrupts is based on a backlog of unaccounted HPET clock periods
(new HPETTimer field 'ticks_not_accounted'). The backlog increases
due to delayed callbacks and coalesced interrupts, and it decreases
if an interrupt was injected successfully. If the backlog increases
while compensation is still in progress, the rate at which additional
interrupts are injected is increased too. A limit is imposed on the
backlog and on the rate.

Injecting additional timer interrupts to compensate lost interrupts
can alleviate long term time drift. However, on a short time scale,
this method can have the side effect of making virtual machine time
intermittently pass slower and faster than real time (depending on
the guest's time keeping algorithm). Compensation is disabled by
default and can be enabled for guests where this behaviour may be
acceptable.

Signed-off-by: Ulrich Obergfell 
---
 hw/hpet.c |  120 +++-
 1 files changed, 118 insertions(+), 2 deletions(-)

diff --git a/hw/hpet.c b/hw/hpet.c
index 0428290..bc2a21a 100644
--- a/hw/hpet.c
+++ b/hw/hpet.c
@@ -31,6 +31,7 @@
 #include "hpet_emul.h"
 #include "sysbus.h"
 #include "mc146818rtc.h"
+#include 
 
 //#define HPET_DEBUG
 #ifdef HPET_DEBUG
@@ -41,6 +42,9 @@
 
 #define HPET_MSI_SUPPORT0
 
+#define MAX_TICKS_NOT_ACCOUNTED (uint64_t)5 /* 5 sec */
+#define MAX_IRQ_RATE(uint32_t)10
+
 struct HPETState;
 typedef struct HPETTimer {  /* timers */
 uint8_t tn; /*timer number*/
@@ -334,13 +338,68 @@ static const VMStateDescription vmstate_hpet = {
 };
 
 /*
+ * This function resets the driftfix state in the following situations.
+ *
+ * - When the guest o/s changes the 'CFG_ENABLE' bit (overall enable)
+ *   in the General Configuration Register from 0 to 1.
+ *
+ * - When the guest o/s changes the 'TN_ENABLE' bit (timer N interrupt enable)
+ *   in the Timer N Configuration and Capabilities Register from 0 to 1.
+ */
+static void hpet_timer_driftfix_reset(HPETTimer *t)
+{
+if (t->state->driftfix && timer_is_periodic(t)) {
+t->ticks_not_accounted = t->prev_period = t->period;
+t->irq_rate = 1;
+t->divisor = 1;
+}
+}
+
+/*
+ * This function determines whether there is a backlog of ticks for which
+ * no interrupts have been delivered to the guest o/s yet. If the backlog
+ * is equal to or greater than the current period length, then additional
+ * interrupts will be delivered to the guest o/s inside of the subsequent
+ * period interval to compensate missed interrupts.
+ *
+ * 'ticks_not_accounted' increases by 'N * period' when the comparator is
+ * being advanced, and it decreases by 'prev_period' when an interrupt is
+ * delivered to the guest o/s. Normally 'prev_period' is equal to 'period'
+ * and 'N' is 1. 'prev_period' is different from 'period' if a guest o/s
+ * has changed the comparator value during the previous period interval.
+ * 'N' is greater than 1 if the callback was delayed by 'N - 1' periods,
+ * and 'N' is zero while additional interrupts are delivered inside of an
+ * interval.
+ *
+ * This function is called after the comparator has been advanced but before
+ * the interrupt is delivered to the guest o/s. Hence, 'ticks_not_accounted'
+ * is equal to 'prev_period' plus 'period' if there is no backlog.
+ */
+static bool hpet_timer_has_tick_backlog(HPETTimer *t)
+{
+uint64_t backlog = 0;
+
+if (t->ticks_not_accounted >= t->period + t->prev_period) {
+backlog = t->ticks_not_accounted - (t->period + t->prev_period);
+}
+return (backlog >= t->period);
+}
+
+/*
  * timer expiration callback
  */
 static void hpet_timer(void *opaque)
 {
 HPETTimer *t = opaque;
+HPETState *s = t->state;
 uint64_t diff;
-
+int irq_delivered = 0;
+uint32_t period_count = 0;   /* elapsed periods since last callback
+  *   1: normal case
+  * > 1: missed 'period_count - 1' interrupts
+  *  due to delayed callback
+  *   0: callback inside of an interval
+  *  to deliver additional interrupts */
 uint64_t period = t->period;
 uint64_t cur_tick = hpet_get_ticks(t->state);
 
@@ -348,13 +407,48 @@ static void hpet_timer(void *opaque)
 if (t->config & HPET_TN_32BIT) {

[PATCH v5 2/5] hpet 'driftfix': add driftfix property to HPETState and DeviceInfo

2011-05-20 Thread Ulrich Obergfell
driftfix is a 'bit type' property. Compensation of delayed callbacks
and coalesced interrupts can be enabled with the command line option

-global hpet.driftfix=on

driftfix is 'off' (disabled) by default.

Signed-off-by: Ulrich Obergfell 
---
 hw/hpet.c |3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/hw/hpet.c b/hw/hpet.c
index 6ce07bc..7513065 100644
--- a/hw/hpet.c
+++ b/hw/hpet.c
@@ -72,6 +72,8 @@ typedef struct HPETState {
 uint64_t isr;   /* interrupt status reg */
 uint64_t hpet_counter;  /* main counter */
 uint8_t  hpet_id;   /* instance id */
+
+uint32_t driftfix;
 } HPETState;
 
 static uint32_t hpet_in_legacy_mode(HPETState *s)
@@ -738,6 +740,7 @@ static SysBusDeviceInfo hpet_device_info = {
 .qdev.props = (Property[]) {
 DEFINE_PROP_UINT8("timers", HPETState, num_timers, HPET_MIN_TIMERS),
 DEFINE_PROP_BIT("msi", HPETState, flags, HPET_MSI_SUPPORT, false),
+DEFINE_PROP_BIT("driftfix", HPETState, driftfix, 0, false),
 DEFINE_PROP_END_OF_LIST(),
 },
 };
-- 
1.6.2.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 4/5] hpet 'driftfix': add code in update_irq() to detect coalesced interrupts (x86 apic only)

2011-05-20 Thread Ulrich Obergfell
update_irq() uses a similar method as in 'rtc_td_hack' to detect
coalesced interrupts. The function entry addresses are retrieved
from 'target_get_irq_delivered' and 'target_reset_irq_delivered'.

This change can be replaced if a generic feedback infrastructure to
track coalesced IRQs for periodic, clock providing devices becomes
available.

Signed-off-by: Ulrich Obergfell 
---
 hw/hpet.c |   13 +++--
 1 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/hw/hpet.c b/hw/hpet.c
index dba9370..0428290 100644
--- a/hw/hpet.c
+++ b/hw/hpet.c
@@ -184,11 +184,12 @@ static inline uint64_t hpet_calculate_diff(HPETTimer *t, 
uint64_t current)
 }
 }
 
-static void update_irq(struct HPETTimer *timer, int set)
+static int update_irq(struct HPETTimer *timer, int set)
 {
 uint64_t mask;
 HPETState *s;
 int route;
+int irq_delivered = 1;
 
 if (timer->tn <= 1 && hpet_in_legacy_mode(timer->state)) {
 /* if LegacyReplacementRoute bit is set, HPET specification requires
@@ -213,8 +214,16 @@ static void update_irq(struct HPETTimer *timer, int set)
 qemu_irq_raise(s->irqs[route]);
 } else {
 s->isr &= ~mask;
-qemu_irq_pulse(s->irqs[route]);
+if (s->driftfix) {
+target_reset_irq_delivered();
+qemu_irq_raise(s->irqs[route]);
+irq_delivered = target_get_irq_delivered();
+qemu_irq_lower(s->irqs[route]);
+} else {
+qemu_irq_pulse(s->irqs[route]);
+}
 }
+return irq_delivered;
 }
 
 static void hpet_pre_save(void *opaque)
-- 
1.6.2.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 0/5] hpet 'driftfix': alleviate time drift with HPET periodic timers

2011-05-20 Thread Ulrich Obergfell
Hi,

This is version 5 of a series of patches that I originally posted in:

http://lists.gnu.org/archive/html/qemu-devel/2011-03/msg01989.html
http://lists.gnu.org/archive/html/qemu-devel/2011-03/msg01992.html
http://lists.gnu.org/archive/html/qemu-devel/2011-03/msg01991.html
http://lists.gnu.org/archive/html/qemu-devel/2011-03/msg01990.html

http://article.gmane.org/gmane.comp.emulators.kvm.devel/69325
http://article.gmane.org/gmane.comp.emulators.kvm.devel/69326
http://article.gmane.org/gmane.comp.emulators.kvm.devel/69327
http://article.gmane.org/gmane.comp.emulators.kvm.devel/69328


Changes since version 4:

Added comments to patch part 3 and part 5. No changes in the actual code.


Please review and please comment.

Regards,

Uli


Ulrich Obergfell (5):
  hpet 'driftfix': add hooks required to detect coalesced interrupts
(x86 apic only)
  hpet 'driftfix': add driftfix property to HPETState and DeviceInfo
  hpet 'driftfix': add fields to HPETTimer and VMStateDescription
  hpet 'driftfix': add code in update_irq() to detect coalesced
interrupts (x86 apic only)
  hpet 'driftfix': add code in hpet_timer() to compensate delayed
callbacks and coalesced interrupts

 hw/apic.c |4 ++
 hw/hpet.c |  178 +++--
 hw/pc.h   |   13 +
 vl.c  |   13 +
 4 files changed, 204 insertions(+), 4 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 3/5] hpet 'driftfix': add fields to HPETTimer and VMStateDescription

2011-05-20 Thread Ulrich Obergfell
The new fields in HPETTimer are covered by a separate VMStateDescription
which is a subsection of 'vmstate_hpet_timer'. They are only migrated if

-global hpet.driftfix=on

Signed-off-by: Ulrich Obergfell 
---
 hw/hpet.c |   42 ++
 1 files changed, 42 insertions(+), 0 deletions(-)

diff --git a/hw/hpet.c b/hw/hpet.c
index 7513065..dba9370 100644
--- a/hw/hpet.c
+++ b/hw/hpet.c
@@ -55,6 +55,19 @@ typedef struct HPETTimer {  /* timers */
 uint8_t wrap_flag;  /* timer pop will indicate wrap for one-shot 32-bit
  * mode. Next pop will be actual timer expiration.
  */
+/* driftfix state */
+uint64_t prev_period;/* needed when the guest o/s changes the
+  * comparator value */
+uint64_t ticks_not_accounted;/* ticks for which no interrupts have been
+  * delivered to the guest o/s yet */
+uint32_t irq_rate;   /* rate at which interrupts are delivered
+  * to the guest o/s during one period
+  * interval; if rate is greater than 1,
+  * additional interrupts are delivered
+  * to compensate missed interrupts */
+uint32_t divisor;/* needed to determine when the next
+  * timer callback should occur while
+  * rate is greater than 1 */
 } HPETTimer;
 
 typedef struct HPETState {
@@ -246,6 +259,27 @@ static int hpet_post_load(void *opaque, int version_id)
 return 0;
 }
 
+static bool hpet_timer_driftfix_vmstate_needed(void *opaque)
+{
+HPETTimer *t = opaque;
+
+return (t->state->driftfix != 0);
+}
+
+static const VMStateDescription vmstate_hpet_timer_driftfix = {
+.name = "hpet_timer_driftfix",
+.version_id = 1,
+.minimum_version_id = 1,
+.minimum_version_id_old = 1,
+.fields  = (VMStateField []) {
+VMSTATE_UINT64(prev_period, HPETTimer),
+VMSTATE_UINT64(ticks_not_accounted, HPETTimer),
+VMSTATE_UINT32(irq_rate, HPETTimer),
+VMSTATE_UINT32(divisor, HPETTimer),
+VMSTATE_END_OF_LIST()
+}
+};
+
 static const VMStateDescription vmstate_hpet_timer = {
 .name = "hpet_timer",
 .version_id = 1,
@@ -260,6 +294,14 @@ static const VMStateDescription vmstate_hpet_timer = {
 VMSTATE_UINT8(wrap_flag, HPETTimer),
 VMSTATE_TIMER(qemu_timer, HPETTimer),
 VMSTATE_END_OF_LIST()
+},
+.subsections = (VMStateSubsection []) {
+{
+.vmsd = &vmstate_hpet_timer_driftfix,
+.needed = hpet_timer_driftfix_vmstate_needed,
+}, {
+/* empty */
+}
 }
 };
 
-- 
1.6.2.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 1/5] hpet 'driftfix': add hooks required to detect coalesced interrupts (x86 apic only)

2011-05-20 Thread Ulrich Obergfell
'target_get_irq_delivered' and 'target_reset_irq_delivered' point
to functions that are called by update_irq() to detect coalesced
interrupts. Initially they point to stub functions which pretend
successful interrupt injection. apic code calls two registration
functions to replace the stubs with apic_get_irq_delivered() and
apic_reset_irq_delivered().

This change can be replaced if a generic feedback infrastructure to
track coalesced IRQs for periodic, clock providing devices becomes
available.

Signed-off-by: Ulrich Obergfell 
---
 hw/apic.c |4 
 hw/pc.h   |   13 +
 vl.c  |   13 +
 3 files changed, 30 insertions(+), 0 deletions(-)

diff --git a/hw/apic.c b/hw/apic.c
index a45b57f..94b1d15 100644
--- a/hw/apic.c
+++ b/hw/apic.c
@@ -17,6 +17,7 @@
  * License along with this library; if not, see <http://www.gnu.org/licenses/>
  */
 #include "hw.h"
+#include "pc.h"
 #include "apic.h"
 #include "ioapic.h"
 #include "qemu-timer.h"
@@ -1143,6 +1144,9 @@ static SysBusDeviceInfo apic_info = {
 
 static void apic_register_devices(void)
 {
+register_target_get_irq_delivered(apic_get_irq_delivered);
+register_target_reset_irq_delivered(apic_reset_irq_delivered);
+
 sysbus_register_withprop(&apic_info);
 }
 
diff --git a/hw/pc.h b/hw/pc.h
index bc8fcec..7511f28 100644
--- a/hw/pc.h
+++ b/hw/pc.h
@@ -7,6 +7,19 @@
 #include "fdc.h"
 #include "net.h"
 
+extern int (*target_get_irq_delivered)(void);
+extern void (*target_reset_irq_delivered)(void);
+
+static inline void register_target_get_irq_delivered(int (*func)(void))
+{
+target_get_irq_delivered = func;
+}
+
+static inline void register_target_reset_irq_delivered(void (*func)(void))
+{
+target_reset_irq_delivered = func;
+}
+
 /* PC-style peripherals (also used by other machines).  */
 
 /* serial.c */
diff --git a/vl.c b/vl.c
index 73e147f..456e320 100644
--- a/vl.c
+++ b/vl.c
@@ -232,6 +232,19 @@ const char *prom_envs[MAX_PROM_ENVS];
 const char *nvram = NULL;
 int boot_menu;
 
+static int target_get_irq_delivered_stub(void)
+{
+return 1;
+}
+
+static void target_reset_irq_delivered_stub(void)
+{
+return;
+}
+
+int (*target_get_irq_delivered)(void) = target_get_irq_delivered_stub;
+void (*target_reset_irq_delivered)(void) = target_reset_irq_delivered_stub;
+
 typedef struct FWBootEntry FWBootEntry;
 
 struct FWBootEntry {
-- 
1.6.2.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH v4 5/5] hpet 'driftfix': add code in hpet_timer() to compensate delayed callbacks and coalesced interrupts

2011-05-12 Thread Ulrich Obergfell

Hi Zachary,

1. re.:

>> +static void hpet_timer_driftfix_reset(HPETTimer *t)
>> +{
>> +if (t->state->driftfix&&  timer_is_periodic(t)) {
>> +t->ticks_not_accounted = t->prev_period = t->period;
>>
>
> This is rather confusing.  Clearly, ticks_not_accounted isn't actually
> ticks not accounted, it's a different variable entirely which is based
> on the current period.
>
> If it were actually ticks_not_accounted, it should be reset to zero.


hpet_timer_driftfix_reset() is called at two sites before the periodic
timer is actually started via hpet_set_timer(). An interrupt shall be
delivered at the first callback of hpet_timer(), after t->period amount
of time has passed. The ticks are recorded as 'not accounted' until the
interrupt is delivered to the guest. The following message explains why
t->prev_period is needed in addition to t->period and how it is used.

  http://lists.gnu.org/archive/html/qemu-devel/2011-05/msg00275.html

On entry to hpet_timer(), t->ticks_not_accounted is >= t->prev_period,
and is it increased by t->period while the comparator register is being
advanced. This is also the place where we can detect whether the current
callback was delayed. If the period_count is > 1, the delay was so large
that we missed to inject an interrupt (which needs to be compensated).

 while (hpet_time_after(cur_tick, t->cmp)) {
 t->cmp = (uint32_t)(t->cmp + t->period);
+t->ticks_not_accounted += t->period;
+period_count++;
 }

Please note that period_count can be zero. This happens during callbacks
that inject additional interrupts 'inside of' a period interval in order
to compensate missed and coalesced interrupts.

t->ticks_not_accounted is decreased only if an interrupt was delivered
to the guest. If an interrupt could not be delivered, the ticks that are
represented by that interrupt remain recorded as 'not accounted' (this
also triggers compensation of coalesced interrupts).

+if (irq_delivered) {
+t->ticks_not_accounted -= t->prev_period;
+t->prev_period = t->period;
+} else {
+if (period_count) {
+t->irq_rate++;
+t->irq_rate = MIN(t->irq_rate, MAX_IRQ_RATE);
+}
+}


2. re.:

>> +if (period_count) {
>> +t->divisor = t->irq_rate;
>> +}
>> +diff /= t->divisor--;
>>
>
> Why subtracting from the divisor?  Shouldn't the divisor and irq_rate
> change in lockstep?


t->irq_rate is the rate at which interrupts are delivered to the guest,
relative to the period length. If t->irq_rate is 1, then one interrupt
shall be injected per period interval. If t->irq_rate is > 1, we are in
'compensation mode' (trying to inject additional interrupts 'inside of'
an interval).

- If t->irq_rate is 2, then two interrupts shall be injected during a
  period interval (one regular and one additional).

- If t->irq_rate is 3, then three interrupts shall be injected during a
  period interval (one regular and two additional).

- etc.

A non-zero period_count marks the start of an interval, at which the
divisor is set to t->irq_rate. Let's take a look at an example where
t->divisor = t->irq_rate = 3.

- The current period starts at t(p), the next period starts at t(p+1).
  We are now at t(p). The first additional interrupt shall be injected
  at a(1), the second at a(2). Hence, the next callback is scheduled
  to occur at a(1) = t(p) + diff / 3.

  t(p) a(1) a(2) t(p+1)
  ++++ time
   <--- diff --->
   < diff / 3 >

- We are now in the callback at a(1). A new diff has been calculated,
  which is equal to the remaining time in the interval from a(1) to
  t(p+1). The second additional interrupt shall be injected at a(2).
  Hence, the next callback is scheduled to occur at a(2) = a(1) + diff / 2.

   a(1) a(2) t(p+1)
   +++ time
<-- diff --->
< diff / 2 >

- We are now in the callback at a(2). A new diff has been calculated,
  which is equal to the remaining time in the interval from a(2) to
  t(p+1). The next callback marks the beginning of period t(p+1).

a(2) t(p+1)
++ time
 <-- diff -->
 < diff / 1 >

At t(p), the divisor is set to irq_rate (= 3). diff is divided by 3
and the divisor is decremented by one. At a(1), the divisor is 2.
diff is divided by 2 and

[PATCH v4 2/5] hpet 'driftfix': add driftfix property to HPETState and DeviceInfo

2011-05-09 Thread Ulrich Obergfell
driftfix is a 'bit type' property. Compensation of delayed callbacks
and coalesced interrupts can be enabled with the command line option

-global hpet.driftfix=on

driftfix is 'off' (disabled) by default.

Signed-off-by: Ulrich Obergfell 
---
 hw/hpet.c |3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/hw/hpet.c b/hw/hpet.c
index 6ce07bc..7513065 100644
--- a/hw/hpet.c
+++ b/hw/hpet.c
@@ -72,6 +72,8 @@ typedef struct HPETState {
 uint64_t isr;   /* interrupt status reg */
 uint64_t hpet_counter;  /* main counter */
 uint8_t  hpet_id;   /* instance id */
+
+uint32_t driftfix;
 } HPETState;
 
 static uint32_t hpet_in_legacy_mode(HPETState *s)
@@ -738,6 +740,7 @@ static SysBusDeviceInfo hpet_device_info = {
 .qdev.props = (Property[]) {
 DEFINE_PROP_UINT8("timers", HPETState, num_timers, HPET_MIN_TIMERS),
 DEFINE_PROP_BIT("msi", HPETState, flags, HPET_MSI_SUPPORT, false),
+DEFINE_PROP_BIT("driftfix", HPETState, driftfix, 0, false),
 DEFINE_PROP_END_OF_LIST(),
 },
 };
-- 
1.6.2.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 5/5] hpet 'driftfix': add code in hpet_timer() to compensate delayed callbacks and coalesced interrupts

2011-05-09 Thread Ulrich Obergfell
Loss of periodic timer interrupts caused by delayed callbacks and by
interrupt coalescing is compensated by gradually injecting additional
interrupts during subsequent timer intervals, starting at a rate of
one additional interrupt per interval. The injection of additional
interrupts is based on a backlog of unaccounted HPET clock periods
(new HPETTimer field 'ticks_not_accounted'). The backlog increases
due to delayed callbacks and coalesced interrupts, and it decreases
if an interrupt was injected successfully. If the backlog increases
while compensation is still in progress, the rate at which additional
interrupts are injected is increased too. A limit is imposed on the
backlog and on the rate.

Injecting additional timer interrupts to compensate lost interrupts
can alleviate long term time drift. However, on a short time scale,
this method can have the side effect of making virtual machine time
intermittently pass slower and faster than real time (depending on
the guest's time keeping algorithm). Compensation is disabled by
default and can be enabled for guests where this behaviour may be
acceptable.

Signed-off-by: Ulrich Obergfell 
---
 hw/hpet.c |   70 +++-
 1 files changed, 68 insertions(+), 2 deletions(-)

diff --git a/hw/hpet.c b/hw/hpet.c
index e57c654..519fc6b 100644
--- a/hw/hpet.c
+++ b/hw/hpet.c
@@ -31,6 +31,7 @@
 #include "hpet_emul.h"
 #include "sysbus.h"
 #include "mc146818rtc.h"
+#include 
 
 //#define HPET_DEBUG
 #ifdef HPET_DEBUG
@@ -41,6 +42,9 @@
 
 #define HPET_MSI_SUPPORT0
 
+#define MAX_TICKS_NOT_ACCOUNTED (uint64_t)5 /* 5 sec */
+#define MAX_IRQ_RATE(uint32_t)10
+
 struct HPETState;
 typedef struct HPETTimer {  /* timers */
 uint8_t tn; /*timer number*/
@@ -324,14 +328,35 @@ static const VMStateDescription vmstate_hpet = {
 }
 };
 
+static void hpet_timer_driftfix_reset(HPETTimer *t)
+{
+if (t->state->driftfix && timer_is_periodic(t)) {
+t->ticks_not_accounted = t->prev_period = t->period;
+t->irq_rate = 1;
+t->divisor = 1;
+}
+}
+
+static bool hpet_timer_has_tick_backlog(HPETTimer *t)
+{
+uint64_t backlog = 0;
+
+if (t->ticks_not_accounted >= t->period + t->prev_period) {
+backlog = t->ticks_not_accounted - (t->period + t->prev_period);
+}
+return (backlog >= t->period);
+}
+
 /*
  * timer expiration callback
  */
 static void hpet_timer(void *opaque)
 {
 HPETTimer *t = opaque;
+HPETState *s = t->state;
 uint64_t diff;
-
+int irq_delivered = 0;
+uint32_t period_count = 0;
 uint64_t period = t->period;
 uint64_t cur_tick = hpet_get_ticks(t->state);
 
@@ -339,13 +364,37 @@ static void hpet_timer(void *opaque)
 if (t->config & HPET_TN_32BIT) {
 while (hpet_time_after(cur_tick, t->cmp)) {
 t->cmp = (uint32_t)(t->cmp + t->period);
+t->ticks_not_accounted += t->period;
+period_count++;
 }
 } else {
 while (hpet_time_after64(cur_tick, t->cmp)) {
 t->cmp += period;
+t->ticks_not_accounted += period;
+period_count++;
 }
 }
 diff = hpet_calculate_diff(t, cur_tick);
+if (s->driftfix) {
+if (t->ticks_not_accounted > MAX_TICKS_NOT_ACCOUNTED) {
+t->ticks_not_accounted = t->period + t->prev_period;
+}
+if (hpet_timer_has_tick_backlog(t)) {
+if (t->irq_rate == 1 || period_count > 1) {
+t->irq_rate++;
+t->irq_rate = MIN(t->irq_rate, MAX_IRQ_RATE);
+}
+if (t->divisor == 0) {
+assert(period_count);
+}
+if (period_count) {
+t->divisor = t->irq_rate;
+}
+diff /= t->divisor--;
+} else {
+t->irq_rate = 1;
+}
+}
 qemu_mod_timer(t->qemu_timer,
qemu_get_clock_ns(vm_clock) + 
(int64_t)ticks_to_ns(diff));
 } else if (t->config & HPET_TN_32BIT && !timer_is_periodic(t)) {
@@ -356,7 +405,22 @@ static void hpet_timer(void *opaque)
 t->wrap_flag = 0;
 }
 }
-update_irq(t, 1);
+if (s->driftfix && timer_is_periodic(t) && period != 0) {
+if (t->ticks_not_accounted >= t->period + t->prev_period) {
+irq_delivered = update_irq(t, 1);
+if (irq_delivered) {
+t->ticks_not_accounted -= t->prev_period;
+t->prev_period = t->period;
+} else {
+if (p

[PATCH v4 4/5] hpet 'driftfix': add code in update_irq() to detect coalesced interrupts (x86 apic only)

2011-05-09 Thread Ulrich Obergfell
update_irq() uses a similar method as in 'rtc_td_hack' to detect
coalesced interrupts. The function entry addresses are retrieved
from 'target_get_irq_delivered' and 'target_reset_irq_delivered'.

This change can be replaced if a generic feedback infrastructure to
track coalesced IRQs for periodic, clock providing devices becomes
available.

Signed-off-by: Ulrich Obergfell 
---
 hw/hpet.c |   13 +++--
 1 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/hw/hpet.c b/hw/hpet.c
index 7ab6e62..e57c654 100644
--- a/hw/hpet.c
+++ b/hw/hpet.c
@@ -175,11 +175,12 @@ static inline uint64_t hpet_calculate_diff(HPETTimer *t, 
uint64_t current)
 }
 }
 
-static void update_irq(struct HPETTimer *timer, int set)
+static int update_irq(struct HPETTimer *timer, int set)
 {
 uint64_t mask;
 HPETState *s;
 int route;
+int irq_delivered = 1;
 
 if (timer->tn <= 1 && hpet_in_legacy_mode(timer->state)) {
 /* if LegacyReplacementRoute bit is set, HPET specification requires
@@ -204,8 +205,16 @@ static void update_irq(struct HPETTimer *timer, int set)
 qemu_irq_raise(s->irqs[route]);
 } else {
 s->isr &= ~mask;
-qemu_irq_pulse(s->irqs[route]);
+if (s->driftfix) {
+target_reset_irq_delivered();
+qemu_irq_raise(s->irqs[route]);
+irq_delivered = target_get_irq_delivered();
+qemu_irq_lower(s->irqs[route]);
+} else {
+qemu_irq_pulse(s->irqs[route]);
+}
 }
+return irq_delivered;
 }
 
 static void hpet_pre_save(void *opaque)
-- 
1.6.2.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 3/5] hpet 'driftfix': add fields to HPETTimer and VMStateDescription

2011-05-09 Thread Ulrich Obergfell
The new fields in HPETTimer are covered by a separate VMStateDescription
which is a subsection of 'vmstate_hpet_timer'. They are only migrated if

-global hpet.driftfix=on

Signed-off-by: Ulrich Obergfell 
---
 hw/hpet.c |   33 +
 1 files changed, 33 insertions(+), 0 deletions(-)

diff --git a/hw/hpet.c b/hw/hpet.c
index 7513065..7ab6e62 100644
--- a/hw/hpet.c
+++ b/hw/hpet.c
@@ -55,6 +55,10 @@ typedef struct HPETTimer {  /* timers */
 uint8_t wrap_flag;  /* timer pop will indicate wrap for one-shot 32-bit
  * mode. Next pop will be actual timer expiration.
  */
+uint64_t prev_period;
+uint64_t ticks_not_accounted;
+uint32_t irq_rate;
+uint32_t divisor;
 } HPETTimer;
 
 typedef struct HPETState {
@@ -246,6 +250,27 @@ static int hpet_post_load(void *opaque, int version_id)
 return 0;
 }
 
+static bool hpet_timer_driftfix_vmstate_needed(void *opaque)
+{
+HPETTimer *t = opaque;
+
+return (t->state->driftfix != 0);
+}
+
+static const VMStateDescription vmstate_hpet_timer_driftfix = {
+.name = "hpet_timer_driftfix",
+.version_id = 1,
+.minimum_version_id = 1,
+.minimum_version_id_old = 1,
+.fields  = (VMStateField []) {
+VMSTATE_UINT64(prev_period, HPETTimer),
+VMSTATE_UINT64(ticks_not_accounted, HPETTimer),
+VMSTATE_UINT32(irq_rate, HPETTimer),
+VMSTATE_UINT32(divisor, HPETTimer),
+VMSTATE_END_OF_LIST()
+}
+};
+
 static const VMStateDescription vmstate_hpet_timer = {
 .name = "hpet_timer",
 .version_id = 1,
@@ -260,6 +285,14 @@ static const VMStateDescription vmstate_hpet_timer = {
 VMSTATE_UINT8(wrap_flag, HPETTimer),
 VMSTATE_TIMER(qemu_timer, HPETTimer),
 VMSTATE_END_OF_LIST()
+},
+.subsections = (VMStateSubsection []) {
+{
+.vmsd = &vmstate_hpet_timer_driftfix,
+.needed = hpet_timer_driftfix_vmstate_needed,
+}, {
+/* empty */
+}
 }
 };
 
-- 
1.6.2.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 1/5] hpet 'driftfix': add hooks required to detect coalesced interrupts (x86 apic only)

2011-05-09 Thread Ulrich Obergfell
'target_get_irq_delivered' and 'target_reset_irq_delivered' point
to functions that are called by update_irq() to detect coalesced
interrupts. Initially they point to stub functions which pretend
successful interrupt injection. apic code calls two registration
functions to replace the stubs with apic_get_irq_delivered() and
apic_reset_irq_delivered().

This change can be replaced if a generic feedback infrastructure to
track coalesced IRQs for periodic, clock providing devices becomes
available.

Signed-off-by: Ulrich Obergfell 
---
 hw/apic.c |4 
 hw/pc.h   |   13 +
 vl.c  |   13 +
 3 files changed, 30 insertions(+), 0 deletions(-)

diff --git a/hw/apic.c b/hw/apic.c
index a45b57f..94b1d15 100644
--- a/hw/apic.c
+++ b/hw/apic.c
@@ -17,6 +17,7 @@
  * License along with this library; if not, see <http://www.gnu.org/licenses/>
  */
 #include "hw.h"
+#include "pc.h"
 #include "apic.h"
 #include "ioapic.h"
 #include "qemu-timer.h"
@@ -1143,6 +1144,9 @@ static SysBusDeviceInfo apic_info = {
 
 static void apic_register_devices(void)
 {
+register_target_get_irq_delivered(apic_get_irq_delivered);
+register_target_reset_irq_delivered(apic_reset_irq_delivered);
+
 sysbus_register_withprop(&apic_info);
 }
 
diff --git a/hw/pc.h b/hw/pc.h
index 1291e2d..79885f4 100644
--- a/hw/pc.h
+++ b/hw/pc.h
@@ -7,6 +7,19 @@
 #include "fdc.h"
 #include "net.h"
 
+extern int (*target_get_irq_delivered)(void);
+extern void (*target_reset_irq_delivered)(void);
+
+static inline void register_target_get_irq_delivered(int (*func)(void))
+{
+target_get_irq_delivered = func;
+}
+
+static inline void register_target_reset_irq_delivered(void (*func)(void))
+{
+target_reset_irq_delivered = func;
+}
+
 /* PC-style peripherals (also used by other machines).  */
 
 /* serial.c */
diff --git a/vl.c b/vl.c
index a143250..a2bbc61 100644
--- a/vl.c
+++ b/vl.c
@@ -233,6 +233,19 @@ const char *prom_envs[MAX_PROM_ENVS];
 const char *nvram = NULL;
 int boot_menu;
 
+static int target_get_irq_delivered_stub(void)
+{
+return 1;
+}
+
+static void target_reset_irq_delivered_stub(void)
+{
+return;
+}
+
+int (*target_get_irq_delivered)(void) = target_get_irq_delivered_stub;
+void (*target_reset_irq_delivered)(void) = target_reset_irq_delivered_stub;
+
 typedef struct FWBootEntry FWBootEntry;
 
 struct FWBootEntry {
-- 
1.6.2.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 0/5] hpet 'driftfix': alleviate time drift with HPET periodic timers

2011-05-09 Thread Ulrich Obergfell
Hi,

This is version 4 of a series of patches that I originally posted in:

http://lists.gnu.org/archive/html/qemu-devel/2011-03/msg01989.html
http://lists.gnu.org/archive/html/qemu-devel/2011-03/msg01992.html
http://lists.gnu.org/archive/html/qemu-devel/2011-03/msg01991.html
http://lists.gnu.org/archive/html/qemu-devel/2011-03/msg01990.html

http://article.gmane.org/gmane.comp.emulators.kvm.devel/69325
http://article.gmane.org/gmane.comp.emulators.kvm.devel/69326
http://article.gmane.org/gmane.comp.emulators.kvm.devel/69327
http://article.gmane.org/gmane.comp.emulators.kvm.devel/69328


Changes since version 3:

 in patch part 1/5 and part 4/5

 -  Added stub functions for 'target_reset_irq_delivered' and
'target_get_irq_delivered'. Added registration functions
that are used by apic code to replace the stubs.

 -  Removed NULL pointer checks from update_irq().

 in patch part 5/5

 -  A minor modification in hpet_timer_has_tick_backlog().

 -  Renamed the local variable 'irq_count' in hpet_timer() to
'period_count'.

 -  Driftfix-related fields in struct 'HPETTimer' are no longer
being initialized/reset in hpet_reset(). Added the function
hpet_timer_driftfix_reset() which is called when the guest

.  sets the 'CFG_ENABLE' bit (overall enable)
   in the General Configuration Register.

.  sets the 'TN_ENABLE' bit (timer N interrupt enable)
   in the Timer N Configuration and Capabilities Register.


Please review and please comment.

Regards,

Uli


Ulrich Obergfell (5):
  hpet 'driftfix': add hooks required to detect coalesced interrupts
(x86 apic only)
  hpet 'driftfix': add driftfix property to HPETState and DeviceInfo
  hpet 'driftfix': add fields to HPETTimer and VMStateDescription
  hpet 'driftfix': add code in update_irq() to detect coalesced
interrupts (x86 apic only)
  hpet 'driftfix': add code in hpet_timer() to compensate delayed
callbacks and coalesced interrupts

 hw/apic.c |4 ++
 hw/hpet.c |  119 ++--
 hw/pc.h   |   13 +++
 vl.c  |   13 +++
 4 files changed, 145 insertions(+), 4 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 5/5] hpet 'driftfix': add code in hpet_timer() to compensate delayed callbacks and coalesced interrupts

2011-05-05 Thread Ulrich Obergfell

Hi Marcelo,
 
> Other than that, shouldnt reset accounting variables to init state on
> write to GLOBAL_ENABLE_CFG / writes to main counter?

I'd suggest to initialize/reset the driftfix-related fields in the
'HPETTimer' structure (including the backlog of unaccounted ticks)
in the following situations.


- When the guest o/s sets the 'CFG_ENABLE' bit (overall enable) in
  the General Configuration Register.

  This is the place in hpet_ram_writel():

case HPET_CFG:
...
if (activating_bit(old_val, new_val, HPET_CFG_ENABLE)) {
...
for (i = 0; i < s->num_timers; i++) {
if ((&s->timer[i])->cmp != ~0ULL) {
// initialize / reset fields here
hpet_set_timer(&s->timer[i]);


- When the guest o/s sets the 'TN_ENABLE' bit (timer N interrupt
  enable) in the Timer N Configuration and Capabilities Register.

  This is the place in hpet_ram_writel():

case HPET_TN_CFG:
...
if (activating_bit(old_val, new_val, HPET_TN_ENABLE)) {
// initialize / reset fields here
hpet_set_timer(timer);


This should cover cases such as ...

- first time initialization of HPET & timers during guest o/s boot.
  (includes kexec and reboot)

- if a guest o/s stops and restarts the timer.
  (for example, to change the comparator register value or
   to switch a timer between periodic mode and non-periodic mode)


Regards,

Uli
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 5/5] hpet 'driftfix': add code in hpet_timer() to compensate delayed callbacks and coalesced interrupts

2011-05-04 Thread Ulrich Obergfell

Hi Marcelo,
 
> Whats prev_period for, since in practice the period will not change
> between interrupts (OS programs comparator once, or perhaps twice
> during bootup) ?

'prev_period' is needed if a guest o/s changes the comparator period
'on the fly' (without stopping and restarting the timer).


 guest o/s changes period
   |
  ti(n-1)  |ti(n)  ti(n+1)
|  v  |  |
+-+--+

 <--- prev_period ---> <-- period -->


The idea is that each timer interrupt represents a certain quantum
of time (the comparator period). If a guest o/s changes the period
between timer interrupt 'n-1' and timer interrupt 'n', I think the
new value should not take effect before timer interrupt 'n'. Timer
interrupt 'n' still represents the old/previous quantum, and timer
interrupt 'n+1' represents the new quantum.

Hence, the patch decrements 'ticks_not_accounted' by 'prev_period'
and sets 'prev_period' to 'period' when an interrupt was delivered
to the guest o/s.

+irq_delivered = update_irq(t, 1);
+if (irq_delivered) {
+t->ticks_not_accounted -= t->prev_period;
+t->prev_period = t->period;
+} else {

Most of the time 'prev_period' is equal to 'period'. It should only
be different in the scenario shown above.


Regards,

Uli
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH v3 1/5] hpet 'driftfix': add hooks required to detect coalesced interrupts (x86 apic only)

2011-04-29 Thread Ulrich Obergfell

> On 2011-04-28 20:51, Blue Swirl wrote:
>> On Thu, Apr 28, 2011 at 5:24 PM, Ulrich Obergfell wrote:
>>> 'target_get_irq_delivered' and 'target_reset_irq_delivered' contain
>>> entry addresses of functions that are utilized by update_irq() to
>>> detect coalesced interrupts. apic code loads these pointers during
>>> initialization.
>>
>> I'm not so happy with this approach, but probably then the i386
>> dependencies can be removed from RTC and it can be compiled only once
>> for all targets.
> 
> This whole series is really the minimalistic approach. The callbacks
> defined here must remain a temporary "shortcut". Just like proper
> abstraction of periodic tick compensation for reuse in other timers has
> to be added later on. And the limitation to edge-triggered legacy HPET
> INTs has to be removed.

Since QEMU doesn't have a generic infrastructure to track interrupt
delivery, I decided to reuse something that is currently available.
The only mechanism that I'm aware of is the one that is utilized by
RTC code ('rtc_td_hack'), i.e. apic_get_irq_delivered() etc.

The changes that would be introduced by part 1/5 and part 4/5 of
this patch series could be replaced if a generic infrastructure
to track interrupt delivery becomes available.

Regards,

Uli
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 2/5] hpet 'driftfix': add driftfix property to HPETState and DeviceInfo

2011-04-28 Thread Ulrich Obergfell
driftfix is a 'bit type' property. Compensation of delayed callbacks
and coalesced interrupts can be enabled with the command line option

-global hpet.driftfix=on

driftfix is 'off' (disabled) by default.

Signed-off-by: Ulrich Obergfell 
---
 hw/hpet.c |3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/hw/hpet.c b/hw/hpet.c
index 6ce07bc..7513065 100644
--- a/hw/hpet.c
+++ b/hw/hpet.c
@@ -72,6 +72,8 @@ typedef struct HPETState {
 uint64_t isr;   /* interrupt status reg */
 uint64_t hpet_counter;  /* main counter */
 uint8_t  hpet_id;   /* instance id */
+
+uint32_t driftfix;
 } HPETState;
 
 static uint32_t hpet_in_legacy_mode(HPETState *s)
@@ -738,6 +740,7 @@ static SysBusDeviceInfo hpet_device_info = {
 .qdev.props = (Property[]) {
 DEFINE_PROP_UINT8("timers", HPETState, num_timers, HPET_MIN_TIMERS),
 DEFINE_PROP_BIT("msi", HPETState, flags, HPET_MSI_SUPPORT, false),
+DEFINE_PROP_BIT("driftfix", HPETState, driftfix, 0, false),
 DEFINE_PROP_END_OF_LIST(),
 },
 };
-- 
1.6.2.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 4/5] hpet 'driftfix': add code in update_irq() to detect coalesced interrupts (x86 apic only)

2011-04-28 Thread Ulrich Obergfell
update_irq() uses a similar method as in 'rtc_td_hack' to detect
coalesced interrupts. The function entry addresses are retrieved
from 'target_get_irq_delivered' and 'target_reset_irq_delivered'.

This change can be replaced if a generic feedback infrastructure to
track coalesced IRQs for periodic, clock providing devices becomes
available.

Signed-off-by: Ulrich Obergfell 
---
 hw/hpet.c |   15 +--
 1 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/hw/hpet.c b/hw/hpet.c
index 7ab6e62..35466ae 100644
--- a/hw/hpet.c
+++ b/hw/hpet.c
@@ -31,6 +31,7 @@
 #include "hpet_emul.h"
 #include "sysbus.h"
 #include "mc146818rtc.h"
+#include "sysemu.h"
 
 //#define HPET_DEBUG
 #ifdef HPET_DEBUG
@@ -175,11 +176,12 @@ static inline uint64_t hpet_calculate_diff(HPETTimer *t, 
uint64_t current)
 }
 }
 
-static void update_irq(struct HPETTimer *timer, int set)
+static int update_irq(struct HPETTimer *timer, int set)
 {
 uint64_t mask;
 HPETState *s;
 int route;
+int irq_delivered = 1;
 
 if (timer->tn <= 1 && hpet_in_legacy_mode(timer->state)) {
 /* if LegacyReplacementRoute bit is set, HPET specification requires
@@ -204,8 +206,17 @@ static void update_irq(struct HPETTimer *timer, int set)
 qemu_irq_raise(s->irqs[route]);
 } else {
 s->isr &= ~mask;
-qemu_irq_pulse(s->irqs[route]);
+if (s->driftfix && target_get_irq_delivered
+&& target_reset_irq_delivered) {
+target_reset_irq_delivered();
+qemu_irq_raise(s->irqs[route]);
+irq_delivered = target_get_irq_delivered();
+qemu_irq_lower(s->irqs[route]);
+} else {
+qemu_irq_pulse(s->irqs[route]);
+}
 }
+return irq_delivered;
 }
 
 static void hpet_pre_save(void *opaque)
-- 
1.6.2.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 3/5] hpet 'driftfix': add fields to HPETTimer and VMStateDescription

2011-04-28 Thread Ulrich Obergfell
The new fields in HPETTimer are covered by a separate VMStateDescription
which is a subsection of 'vmstate_hpet_timer'. They are only migrated if

-global hpet.driftfix=on

Signed-off-by: Ulrich Obergfell 
---
 hw/hpet.c |   33 +
 1 files changed, 33 insertions(+), 0 deletions(-)

diff --git a/hw/hpet.c b/hw/hpet.c
index 7513065..7ab6e62 100644
--- a/hw/hpet.c
+++ b/hw/hpet.c
@@ -55,6 +55,10 @@ typedef struct HPETTimer {  /* timers */
 uint8_t wrap_flag;  /* timer pop will indicate wrap for one-shot 32-bit
  * mode. Next pop will be actual timer expiration.
  */
+uint64_t prev_period;
+uint64_t ticks_not_accounted;
+uint32_t irq_rate;
+uint32_t divisor;
 } HPETTimer;
 
 typedef struct HPETState {
@@ -246,6 +250,27 @@ static int hpet_post_load(void *opaque, int version_id)
 return 0;
 }
 
+static bool hpet_timer_driftfix_vmstate_needed(void *opaque)
+{
+HPETTimer *t = opaque;
+
+return (t->state->driftfix != 0);
+}
+
+static const VMStateDescription vmstate_hpet_timer_driftfix = {
+.name = "hpet_timer_driftfix",
+.version_id = 1,
+.minimum_version_id = 1,
+.minimum_version_id_old = 1,
+.fields  = (VMStateField []) {
+VMSTATE_UINT64(prev_period, HPETTimer),
+VMSTATE_UINT64(ticks_not_accounted, HPETTimer),
+VMSTATE_UINT32(irq_rate, HPETTimer),
+VMSTATE_UINT32(divisor, HPETTimer),
+VMSTATE_END_OF_LIST()
+}
+};
+
 static const VMStateDescription vmstate_hpet_timer = {
 .name = "hpet_timer",
 .version_id = 1,
@@ -260,6 +285,14 @@ static const VMStateDescription vmstate_hpet_timer = {
 VMSTATE_UINT8(wrap_flag, HPETTimer),
 VMSTATE_TIMER(qemu_timer, HPETTimer),
 VMSTATE_END_OF_LIST()
+},
+.subsections = (VMStateSubsection []) {
+{
+.vmsd = &vmstate_hpet_timer_driftfix,
+.needed = hpet_timer_driftfix_vmstate_needed,
+}, {
+/* empty */
+}
 }
 };
 
-- 
1.6.2.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 0/5] hpet 'driftfix': alleviate time drift with HPET periodic timers

2011-04-28 Thread Ulrich Obergfell
Hi,

This is version 3 of a series of patches that I originally posted in:

http://lists.gnu.org/archive/html/qemu-devel/2011-03/msg01989.html
http://lists.gnu.org/archive/html/qemu-devel/2011-03/msg01992.html
http://lists.gnu.org/archive/html/qemu-devel/2011-03/msg01991.html
http://lists.gnu.org/archive/html/qemu-devel/2011-03/msg01990.html

http://article.gmane.org/gmane.comp.emulators.kvm.devel/69325
http://article.gmane.org/gmane.comp.emulators.kvm.devel/69326
http://article.gmane.org/gmane.comp.emulators.kvm.devel/69327
http://article.gmane.org/gmane.comp.emulators.kvm.devel/69328

Changes since version 2:

 -  The vmstate related to 'driftfix' is now in a separate subsection
that can be migrated conditionally.

 -  The new fields of the HPETTimer structure are no longer initialized
in hpet_init(). This is now done in hpet_reset() only.

 -  Compensation of lost timer interrupts is now based on a backlog of
unaccounted HPET clock periods instead of 'irqs_to_inject'. This
eliminates the need to scale 'irqs_to_inject' when a guest o/s
modifies the comparator register value.

Please review and please comment.

Regards,

Uli


Ulrich Obergfell (5):
  hpet 'driftfix': add hooks required to detect coalesced interrupts
(x86 apic only)
  hpet 'driftfix': add driftfix property to HPETState and DeviceInfo
  hpet 'driftfix': add fields to HPETTimer and VMStateDescription
  hpet 'driftfix': add code in update_irq() to detect coalesced
interrupts (x86 apic only)
  hpet 'driftfix': add code in hpet_timer() to compensate delayed
callbacks and coalesced interrupts

 hw/apic.c |4 ++
 hw/hpet.c |  114 ++--
 sysemu.h  |3 ++
 vl.c  |3 ++
 4 files changed, 120 insertions(+), 4 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 1/5] hpet 'driftfix': add hooks required to detect coalesced interrupts (x86 apic only)

2011-04-28 Thread Ulrich Obergfell
'target_get_irq_delivered' and 'target_reset_irq_delivered' contain
entry addresses of functions that are utilized by update_irq() to
detect coalesced interrupts. apic code loads these pointers during
initialization.

This change can be replaced if a generic feedback infrastructure to
track coalesced IRQs for periodic, clock providing devices becomes
available.

Signed-off-by: Ulrich Obergfell 
---
 hw/apic.c |4 
 sysemu.h  |3 +++
 vl.c  |3 +++
 3 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/hw/apic.c b/hw/apic.c
index a45b57f..eb0f6d9 100644
--- a/hw/apic.c
+++ b/hw/apic.c
@@ -24,6 +24,7 @@
 #include "sysbus.h"
 #include "trace.h"
 #include "kvm.h"
+#include "sysemu.h"
 
 /* APIC Local Vector Table */
 #define APIC_LVT_TIMER   0
@@ -1143,6 +1144,9 @@ static SysBusDeviceInfo apic_info = {
 
 static void apic_register_devices(void)
 {
+target_get_irq_delivered = apic_get_irq_delivered;
+target_reset_irq_delivered = apic_reset_irq_delivered;
+
 sysbus_register_withprop(&apic_info);
 }
 
diff --git a/sysemu.h b/sysemu.h
index 07d85cd..75b0139 100644
--- a/sysemu.h
+++ b/sysemu.h
@@ -98,6 +98,9 @@ int qemu_savevm_state_complete(Monitor *mon, QEMUFile *f);
 void qemu_savevm_state_cancel(Monitor *mon, QEMUFile *f);
 int qemu_loadvm_state(QEMUFile *f);
 
+extern int (*target_get_irq_delivered)(void);
+extern void (*target_reset_irq_delivered)(void);
+
 /* SLIRP */
 void do_info_slirp(Monitor *mon);
 
diff --git a/vl.c b/vl.c
index 0c24e07..7bab315 100644
--- a/vl.c
+++ b/vl.c
@@ -233,6 +233,9 @@ const char *prom_envs[MAX_PROM_ENVS];
 const char *nvram = NULL;
 int boot_menu;
 
+int (*target_get_irq_delivered)(void) = 0;
+void (*target_reset_irq_delivered)(void) = 0;
+
 typedef struct FWBootEntry FWBootEntry;
 
 struct FWBootEntry {
-- 
1.6.2.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 5/5] hpet 'driftfix': add code in hpet_timer() to compensate delayed callbacks and coalesced interrupts

2011-04-28 Thread Ulrich Obergfell
Loss of periodic timer interrupts caused by delayed callbacks and by
interrupt coalescing is compensated by gradually injecting additional
interrupts during subsequent timer intervals, starting at a rate of
one additional interrupt per interval. The injection of additional
interrupts is based on a backlog of unaccounted HPET clock periods
(new HPETTimer field 'ticks_not_accounted'). The backlog increases
due to delayed callbacks and coalesced interrupts, and it decreases
if an interrupt was injected successfully. If the backlog increases
while compensation is still in progress, the rate at which additional
interrupts are injected is increased too. A limit is imposed on the
backlog and on the rate.

Injecting additional timer interrupts to compensate lost interrupts
can alleviate long term time drift. However, on a short time scale,
this method can have the side effect of making virtual machine time
intermittently pass slower and faster than real time (depending on
the guest's time keeping algorithm). Compensation is disabled by
default and can be enabled for guests where this behaviour may be
acceptable.

Signed-off-by: Ulrich Obergfell 
---
 hw/hpet.c |   63 +++-
 1 files changed, 61 insertions(+), 2 deletions(-)

diff --git a/hw/hpet.c b/hw/hpet.c
index 35466ae..92d5f58 100644
--- a/hw/hpet.c
+++ b/hw/hpet.c
@@ -32,6 +32,7 @@
 #include "sysbus.h"
 #include "mc146818rtc.h"
 #include "sysemu.h"
+#include 
 
 //#define HPET_DEBUG
 #ifdef HPET_DEBUG
@@ -42,6 +43,9 @@
 
 #define HPET_MSI_SUPPORT0
 
+#define MAX_TICKS_NOT_ACCOUNTED (uint64_t)5 /* 5 sec */
+#define MAX_IRQ_RATE(uint32_t)10
+
 struct HPETState;
 typedef struct HPETTimer {  /* timers */
 uint8_t tn; /*timer number*/
@@ -326,28 +330,63 @@ static const VMStateDescription vmstate_hpet = {
 }
 };
 
+static bool hpet_timer_has_tick_backlog(HPETTimer *t)
+{
+uint64_t backlog = t->ticks_not_accounted - (t->period + t->prev_period);
+return (backlog >= t->period);
+}
+
 /*
  * timer expiration callback
  */
 static void hpet_timer(void *opaque)
 {
 HPETTimer *t = opaque;
+HPETState *s = t->state;
 uint64_t diff;
-
+int irq_delivered = 0;
+uint32_t irq_count = 0;
 uint64_t period = t->period;
 uint64_t cur_tick = hpet_get_ticks(t->state);
 
+if (s->driftfix && !t->ticks_not_accounted) {
+t->ticks_not_accounted = t->prev_period = t->period;
+}
 if (timer_is_periodic(t) && period != 0) {
 if (t->config & HPET_TN_32BIT) {
 while (hpet_time_after(cur_tick, t->cmp)) {
 t->cmp = (uint32_t)(t->cmp + t->period);
+t->ticks_not_accounted += t->period;
+irq_count++;
 }
 } else {
 while (hpet_time_after64(cur_tick, t->cmp)) {
 t->cmp += period;
+t->ticks_not_accounted += period;
+irq_count++;
 }
 }
 diff = hpet_calculate_diff(t, cur_tick);
+if (s->driftfix) {
+if (t->ticks_not_accounted > MAX_TICKS_NOT_ACCOUNTED) {
+t->ticks_not_accounted = t->period + t->prev_period;
+}
+if (hpet_timer_has_tick_backlog(t)) {
+if (t->irq_rate == 1 || irq_count > 1) {
+t->irq_rate++;
+t->irq_rate = MIN(t->irq_rate, MAX_IRQ_RATE);
+}
+if (t->divisor == 0) {
+assert(irq_count);
+}
+if (irq_count) {
+t->divisor = t->irq_rate;
+}
+diff /= t->divisor--;
+} else {
+t->irq_rate = 1;
+}
+}
 qemu_mod_timer(t->qemu_timer,
qemu_get_clock_ns(vm_clock) + 
(int64_t)ticks_to_ns(diff));
 } else if (t->config & HPET_TN_32BIT && !timer_is_periodic(t)) {
@@ -358,7 +397,22 @@ static void hpet_timer(void *opaque)
 t->wrap_flag = 0;
 }
 }
-update_irq(t, 1);
+if (s->driftfix && timer_is_periodic(t) && period != 0) {
+if (t->ticks_not_accounted >= t->period + t->prev_period) {
+irq_delivered = update_irq(t, 1);
+if (irq_delivered) {
+t->ticks_not_accounted -= t->prev_period;
+t->prev_period = t->period;
+} else {
+if (irq_count) {
+t->irq_rate++;
+t->irq_rate = MIN(t->irq_rate, MAX_IRQ_RATE);
+}
+}
+}
+} else {
+update_irq(t, 1);
+}
 }
 
 static void hpet_set_timer(HPE

Re: [Qemu-devel] [PATCH v2 3/5] hpet 'driftfix': add fields to HPETTimer and VMStateDescription

2011-04-11 Thread Ulrich Obergfell

>> vmstate_hpet_timer = {
>>   VMSTATE_UINT64(fsb, HPETTimer),
>>   VMSTATE_UINT64(period, HPETTimer),
>>   VMSTATE_UINT8(wrap_flag, HPETTimer),
>> + VMSTATE_UINT64_V(saved_period, HPETTimer, 3),
>> + VMSTATE_UINT64_V(ticks_not_accounted, HPETTimer, 3),
>> + VMSTATE_UINT32_V(irqs_to_inject, HPETTimer, 3),
>> + VMSTATE_UINT32_V(irq_rate, HPETTimer, 3),
>> + VMSTATE_UINT32_V(divisor, HPETTimer, 3),
>
> We ought to be able to use a subsection keyed off of whether any ticks
> are currently accumulated, no?


Anthony,

I'm not sure if I understand your question correctly. Are you suggesting
to migrate the driftfix-related state conditionally / only if there are
any ticks accumulated in 'ticks_not_accounted' and 'irqs_to_inject' ?

The size of the driftfix-related state is 28 bytes per timer and we have
32 timers per HPETState, i.e. 896 additional bytes per HPETState. With a
maximum number of 8 HPET blocks (HPETState), this amounts to 7168 bytes.
Hence, unconditional migration of the driftfix-related state should not
cause significant additional overhead.

Maybe I missed something. Could you please explain which benefit you see
in using a subsection ?


Regards,

Uli
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH v2 3/5] hpet 'driftfix': add fields to HPETTimer and VMStateDescription

2011-04-11 Thread Ulrich Obergfell

>>   typedef struct HPETState {
>> @@ -248,7 +253,7 @@ static int hpet_post_load(void *opaque, int
>> version_id)
>>
>>   static const VMStateDescription vmstate_hpet_timer = {
>>   .name = "hpet_timer",
>> - .version_id = 1,
>> + .version_id = 3,
>
> Why jump from 1 to 3?
> 
>>   .minimum_version_id = 1,
>>   .minimum_version_id_old = 1,
>>   .fields = (VMStateField []) {
>> @@ -258,6 +263,11 @@ static const VMStateDescription
>> vmstate_hpet_timer = {
>>   VMSTATE_UINT64(fsb, HPETTimer),
>>   VMSTATE_UINT64(period, HPETTimer),
>>   VMSTATE_UINT8(wrap_flag, HPETTimer),
>> + VMSTATE_UINT64_V(saved_period, HPETTimer, 3),
>> + VMSTATE_UINT64_V(ticks_not_accounted, HPETTimer, 3),
>> + VMSTATE_UINT32_V(irqs_to_inject, HPETTimer, 3),
>> + VMSTATE_UINT32_V(irq_rate, HPETTimer, 3),
>> + VMSTATE_UINT32_V(divisor, HPETTimer, 3),


Anthony,

I incremented the version ID of 'vmstate_hpet' from 2 to 3 to make sure
that migrations from a QEMU process that is capable of 'driftfix' to a
QEMU process that is _not_ capable of 'driftfix' will fail. I assigned
version ID 3 to 'vmstate_hpet_timer' and to the new fields in there too
to indicate that adding those fields was the reason why the version ID
of 'vmstate_hpet' was incremented to 3.

As far as the flow of execution in vmstate_load_state() is concerned, I
think it does not matter whether the version ID of 'vmstate_hpet_timer'
and the new fields in there is 2 or 3 (as long as they are consistent).
When the 'while(field->name)' loop in vmstate_load_state() gets to the
following field in 'vmstate_hpet' ...

VMSTATE_STRUCT_VARRAY_UINT8(timer, HPETState, num_timers, 0,
vmstate_hpet_timer, HPETTimer),

... it calls itself recursively ...

if (field->flags & VMS_STRUCT) {
ret = vmstate_load_state(f, field->vmsd, addr, field->vmsd->version_id);

'field->vmsd->version_id' is the version ID of 'vmstate_hpet_timer' [1].
Hence 'vmstate_hpet_timer.version_id' is being checked against itself ...

if (version_id > vmsd->version_id) {
return -EINVAL;
}

... and the version IDs of the new fields are also being checked against
'vmstate_hpet_timer.version_id' ...

if ((field->field_exists &&
 field->field_exists(opaque, version_id)) ||
(!field->field_exists &&
 field->version_id <= version_id)) {

If you want me to change the version ID of 'vmstate_hpet_timer' and the
new fields in there from 3 to 2, I can do that.


Regards,

Uli


[1] Ref.: commit fa3aad24d94a6cf894db52d83f72a399324a17bb
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 3/5] hpet 'driftfix': add fields to HPETTimer and VMStateDescription

2011-04-08 Thread Ulrich Obergfell

Signed-off-by: Ulrich Obergfell 
---
 hw/hpet.c |   14 --
 1 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/hw/hpet.c b/hw/hpet.c
index 45847ed..c150da5 100644
--- a/hw/hpet.c
+++ b/hw/hpet.c
@@ -55,6 +55,11 @@ typedef struct HPETTimer {  /* timers */
 uint8_t wrap_flag;  /* timer pop will indicate wrap for one-shot 32-bit
  * mode. Next pop will be actual timer expiration.
  */
+uint64_t saved_period;
+uint64_t ticks_not_accounted;
+uint32_t irqs_to_inject;
+uint32_t irq_rate;
+uint32_t divisor;
 } HPETTimer;
 
 typedef struct HPETState {
@@ -248,7 +253,7 @@ static int hpet_post_load(void *opaque, int version_id)
 
 static const VMStateDescription vmstate_hpet_timer = {
 .name = "hpet_timer",
-.version_id = 1,
+.version_id = 3,
 .minimum_version_id = 1,
 .minimum_version_id_old = 1,
 .fields  = (VMStateField []) {
@@ -258,6 +263,11 @@ static const VMStateDescription vmstate_hpet_timer = {
 VMSTATE_UINT64(fsb, HPETTimer),
 VMSTATE_UINT64(period, HPETTimer),
 VMSTATE_UINT8(wrap_flag, HPETTimer),
+VMSTATE_UINT64_V(saved_period, HPETTimer, 3),
+VMSTATE_UINT64_V(ticks_not_accounted, HPETTimer, 3),
+VMSTATE_UINT32_V(irqs_to_inject, HPETTimer, 3),
+VMSTATE_UINT32_V(irq_rate, HPETTimer, 3),
+VMSTATE_UINT32_V(divisor, HPETTimer, 3),
 VMSTATE_TIMER(qemu_timer, HPETTimer),
 VMSTATE_END_OF_LIST()
 }
@@ -265,7 +275,7 @@ static const VMStateDescription vmstate_hpet_timer = {
 
 static const VMStateDescription vmstate_hpet = {
 .name = "hpet",
-.version_id = 2,
+.version_id = 3,
 .minimum_version_id = 1,
 .minimum_version_id_old = 1,
 .pre_save = hpet_pre_save,
-- 
1.6.2.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 4/5] hpet 'driftfix': add code in update_irq() to detect coalesced interrupts (x86 apic only)

2011-04-08 Thread Ulrich Obergfell
update_irq() uses a similar method as in 'rtc_td_hack' to detect
coalesced interrupts. The function entry addresses are retrieved
from 'target_get_irq_delivered' and 'target_reset_irq_delivered'.

Signed-off-by: Ulrich Obergfell 
---
 hw/hpet.c |   15 +--
 1 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/hw/hpet.c b/hw/hpet.c
index c150da5..b5625fb 100644
--- a/hw/hpet.c
+++ b/hw/hpet.c
@@ -31,6 +31,7 @@
 #include "hpet_emul.h"
 #include "sysbus.h"
 #include "mc146818rtc.h"
+#include "sysemu.h"
 
 //#define HPET_DEBUG
 #ifdef HPET_DEBUG
@@ -176,11 +177,12 @@ static inline uint64_t hpet_calculate_diff(HPETTimer *t, 
uint64_t current)
 }
 }
 
-static void update_irq(struct HPETTimer *timer, int set)
+static int update_irq(struct HPETTimer *timer, int set)
 {
 uint64_t mask;
 HPETState *s;
 int route;
+int irq_delivered = 1;
 
 if (timer->tn <= 1 && hpet_in_legacy_mode(timer->state)) {
 /* if LegacyReplacementRoute bit is set, HPET specification requires
@@ -205,8 +207,17 @@ static void update_irq(struct HPETTimer *timer, int set)
 qemu_irq_raise(s->irqs[route]);
 } else {
 s->isr &= ~mask;
-qemu_irq_pulse(s->irqs[route]);
+if (s->driftfix && target_get_irq_delivered
+&& target_reset_irq_delivered) {
+target_reset_irq_delivered();
+qemu_irq_raise(s->irqs[route]);
+irq_delivered = target_get_irq_delivered();
+qemu_irq_lower(s->irqs[route]);
+} else {
+qemu_irq_pulse(s->irqs[route]);
+}
 }
+return irq_delivered;
 }
 
 static void hpet_pre_save(void *opaque)
-- 
1.6.2.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 2/5] hpet 'driftfix': add driftfix property to HPETState and DeviceInfo

2011-04-08 Thread Ulrich Obergfell
driftfix is a 'bit type' property. Compensation of delayed callbacks
and coalesced interrupts can be enabled with the command line option

-global hpet.driftfix=on

driftfix is 'off' (disabled) by default.

Signed-off-by: Ulrich Obergfell 
---
 hw/hpet.c |3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/hw/hpet.c b/hw/hpet.c
index e92b775..45847ed 100644
--- a/hw/hpet.c
+++ b/hw/hpet.c
@@ -72,6 +72,8 @@ typedef struct HPETState {
 uint64_t isr;   /* interrupt status reg */
 uint64_t hpet_counter;  /* main counter */
 uint8_t  hpet_id;   /* instance id */
+
+uint32_t driftfix;
 } HPETState;
 
 static uint32_t hpet_in_legacy_mode(HPETState *s)
@@ -738,6 +740,7 @@ static SysBusDeviceInfo hpet_device_info = {
 .qdev.props = (Property[]) {
 DEFINE_PROP_UINT8("timers", HPETState, num_timers, HPET_MIN_TIMERS),
 DEFINE_PROP_BIT("msi", HPETState, flags, HPET_MSI_SUPPORT, false),
+DEFINE_PROP_BIT("driftfix", HPETState, driftfix, 0, false),
 DEFINE_PROP_END_OF_LIST(),
 },
 };
-- 
1.6.2.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 5/5] hpet 'driftfix': add code in hpet_timer() to compensate delayed callbacks and coalesced interrupts

2011-04-08 Thread Ulrich Obergfell
Loss of periodic timer interrupts caused by delayed callbacks and by
interrupt coalescing is compensated by gradually injecting additional
interrupts during subsequent timer intervals, starting at a rate of
one additional interrupt per interval. If further interrupts are lost
while compensation is in progress, the rate is increased. A limit is
imposed on the rate and on the 'backlog' of lost interrupts that are
to be injected. If a guest o/s modifies the comparator register value
while compensation is in progress, the 'backlog' of lost interrupts
that are to be injected is scaled to the new value.

Injecting additional timer interrupts to compensate lost interrupts
can alleviate long term time drift. However, on a short time scale,
this method can have the side effect of making virtual machine time
intermittently pass faster than real time (depending on the guest's
time keeping algorithm). Compensation is disabled by default and can
be enabled for guests where this behaviour is acceptable.

Signed-off-by: Ulrich Obergfell 
---
 hw/hpet.c |   58 +-
 1 files changed, 57 insertions(+), 1 deletions(-)

diff --git a/hw/hpet.c b/hw/hpet.c
index b5625fb..5a25a96 100644
--- a/hw/hpet.c
+++ b/hw/hpet.c
@@ -42,6 +42,9 @@
 
 #define HPET_MSI_SUPPORT0
 
+#define MAX_IRQS_TO_INJECT (uint32_t)5000
+#define MAX_IRQ_RATE   (uint32_t)10
+
 struct HPETState;
 typedef struct HPETTimer {  /* timers */
 uint8_t tn; /*timer number*/
@@ -309,8 +312,11 @@ static const VMStateDescription vmstate_hpet = {
 static void hpet_timer(void *opaque)
 {
 HPETTimer *t = opaque;
+HPETState *s = t->state;
 uint64_t diff;
 
+int irq_delivered = 0;
+uint32_t irq_count = 0;
 uint64_t period = t->period;
 uint64_t cur_tick = hpet_get_ticks(t->state);
 
@@ -318,13 +324,36 @@ static void hpet_timer(void *opaque)
 if (t->config & HPET_TN_32BIT) {
 while (hpet_time_after(cur_tick, t->cmp)) {
 t->cmp = (uint32_t)(t->cmp + t->period);
+irq_count++;
 }
 } else {
 while (hpet_time_after64(cur_tick, t->cmp)) {
 t->cmp += period;
+irq_count++;
 }
 }
 diff = hpet_calculate_diff(t, cur_tick);
+if (s->driftfix) {
+if (t->saved_period != t->period) {
+uint64_t tmp = (t->irqs_to_inject * t->saved_period)
+  + t->ticks_not_accounted;
+t->irqs_to_inject = tmp / t->period;
+t->ticks_not_accounted = tmp % t->period;
+t->saved_period = t->period;
+}
+t->irqs_to_inject += irq_count;
+t->irqs_to_inject = MIN(t->irqs_to_inject, MAX_IRQS_TO_INJECT);
+if (t->irqs_to_inject > 1) {
+if (irq_count > 1) {
+t->irq_rate++;
+t->irq_rate = MIN(t->irq_rate, MAX_IRQ_RATE);
+}
+if (irq_count || t->divisor == 0) {
+t->divisor = t->irq_rate;
+}
+diff /= t->divisor--;
+}
+}
 qemu_mod_timer(t->qemu_timer,
qemu_get_clock(vm_clock) + (int64_t)ticks_to_ns(diff));
 } else if (t->config & HPET_TN_32BIT && !timer_is_periodic(t)) {
@@ -335,7 +364,22 @@ static void hpet_timer(void *opaque)
 t->wrap_flag = 0;
 }
 }
-update_irq(t, 1);
+if (s->driftfix && timer_is_periodic(t) && period != 0) {
+if (t->irqs_to_inject) {
+irq_delivered = update_irq(t, 1);
+if (irq_delivered) {
+t->irq_rate = MIN(t->irq_rate, t->irqs_to_inject);
+t->irqs_to_inject--;
+} else {
+if (irq_count) {
+t->irq_rate++;
+t->irq_rate = MIN(t->irq_rate, MAX_IRQ_RATE);
+}
+}
+}
+} else {
+update_irq(t, 1);
+}
 }
 
 static void hpet_set_timer(HPETTimer *t)
@@ -674,6 +718,12 @@ static void hpet_reset(DeviceState *d)
 timer->config |=  0x0004ULL << 32;
 timer->period = 0ULL;
 timer->wrap_flag = 0;
+
+timer->saved_period = 0;
+timer->ticks_not_accounted = 0;
+timer->irqs_to_inject = 0;
+timer->irq_rate = 1;
+timer->divisor = 1;
 }
 
 s->hpet_counter = 0ULL;
@@ -734,6 +784,12 @@ static int hpet_init(SysBusDevice *dev)
 timer->qemu_timer = qemu_new_timer(vm_clock, hpet_timer, timer);
 timer->tn = i;
 timer->state = s;
+
+timer->saved_period = 0;
+

[PATCH v2 1/5] hpet 'driftfix': add hooks required to detect coalesced interrupts (x86 apic only)

2011-04-08 Thread Ulrich Obergfell
'target_get_irq_delivered' and 'target_reset_irq_delivered' contain
entry addresses of functions that are utilized by update_irq() to
detect coalesced interrupts. apic code loads these pointers during
initialization.

Signed-off-by: Ulrich Obergfell 
---
 hw/apic.c |4 
 sysemu.h  |3 +++
 vl.c  |3 +++
 3 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/hw/apic.c b/hw/apic.c
index 00907e0..44d8cb3 100644
--- a/hw/apic.c
+++ b/hw/apic.c
@@ -24,6 +24,7 @@
 #include "sysbus.h"
 #include "trace.h"
 #include "kvm.h"
+#include "sysemu.h"
 
 /* APIC Local Vector Table */
 #define APIC_LVT_TIMER   0
@@ -1143,6 +1144,9 @@ static SysBusDeviceInfo apic_info = {
 
 static void apic_register_devices(void)
 {
+target_get_irq_delivered = apic_get_irq_delivered;
+target_reset_irq_delivered = apic_reset_irq_delivered;
+
 sysbus_register_withprop(&apic_info);
 }
 
diff --git a/sysemu.h b/sysemu.h
index 3f7e17e..b5d4f90 100644
--- a/sysemu.h
+++ b/sysemu.h
@@ -96,6 +96,9 @@ int qemu_savevm_state_complete(Monitor *mon, QEMUFile *f);
 void qemu_savevm_state_cancel(Monitor *mon, QEMUFile *f);
 int qemu_loadvm_state(QEMUFile *f);
 
+extern int (*target_get_irq_delivered)(void);
+extern void (*target_reset_irq_delivered)(void);
+
 /* SLIRP */
 void do_info_slirp(Monitor *mon);
 
diff --git a/vl.c b/vl.c
index 8bcf2ae..548912a 100644
--- a/vl.c
+++ b/vl.c
@@ -231,6 +231,9 @@ const char *prom_envs[MAX_PROM_ENVS];
 const char *nvram = NULL;
 int boot_menu;
 
+int (*target_get_irq_delivered)(void) = 0;
+void (*target_reset_irq_delivered)(void) = 0;
+
 typedef struct FWBootEntry FWBootEntry;
 
 struct FWBootEntry {
-- 
1.6.2.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 0/5] hpet 'driftfix': alleviate time drift with HPET periodic timers

2011-04-08 Thread Ulrich Obergfell
Hi,

This is version 2 of a series of patches that I originally posted in:

http://lists.gnu.org/archive/html/qemu-devel/2011-03/msg01989.html
http://lists.gnu.org/archive/html/qemu-devel/2011-03/msg01992.html
http://lists.gnu.org/archive/html/qemu-devel/2011-03/msg01991.html
http://lists.gnu.org/archive/html/qemu-devel/2011-03/msg01990.html

http://article.gmane.org/gmane.comp.emulators.kvm.devel/69325
http://article.gmane.org/gmane.comp.emulators.kvm.devel/69326
http://article.gmane.org/gmane.comp.emulators.kvm.devel/69327
http://article.gmane.org/gmane.comp.emulators.kvm.devel/69328

I implemented the proposed changes (remove configurability and ifdefs,
define 'driftfix' as a property of the HPET device) and I've split the
patch into smaller parts (originally 3, now 5). I also included a fix
related to scaling 't->irqs_to_inject' when the guest o/s modifies the
comparator register value.

Please review and please comment.

Regards,

Uli


Ulrich Obergfell (5):
  hpet 'driftfix': add hooks required to detect coalesced interrupts
(x86 apic only)
  hpet 'driftfix': add driftfix property to HPETState and DeviceInfo
  hpet 'driftfix': add fields to HPETTimer and VMStateDescription
  hpet 'driftfix': add code in update_irq() to detect coalesced
interrupts (x86 apic only)
  hpet 'driftfix': add code in hpet_timer() to compensate delayed
callbacks and coalesced interrupts

 hw/apic.c |4 +++
 hw/hpet.c |   90 +---
 sysemu.h  |3 ++
 vl.c  |3 ++
 4 files changed, 95 insertions(+), 5 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] alleviate time drift with HPET periodic timers

2011-03-22 Thread Ulrich Obergfell

>> Part 3 of the patch implements the following options for the
>> 'configure' script.
>>
>> --disable-hpet-driftfix
>> --enable-hpet-driftfix
> 
> I see no benefit in this configurability. Just make the driftfix
> unconditionally available, runtime-disabled by default for now until it
> matured and there is no downside in enabling it all the time.


Many Thanks Jan,

I enclosed the code in '#ifdef CONFIG_HPET_DRIFTFIX ... #endif'
so that it can be easily identified (and removed if the generic API
would be implemented some day). Since the ifdef's are already there
I added the configuration option for convenience. As you don't see
any benefit in this option, I can remove that part of the patch.
However, I'd suggest to keep the ifdef's and do the following:

- Rename to '#ifdef HPET_DRIFTFIX ... #endif' to make it clear that
  this is not controlled via a configuration option.

- Add '#define HPET_DRIFTFIX' to hw/hpet_emul.h.

Do you agree ?


Regards,

Uli
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] alleviate time drift with HPET periodic timers

2011-03-22 Thread Ulrich Obergfell

>> Part 1 of the patch implements the following QEMU command line option.
>>
>> -hpet [device=none|present][,driftfix=none|slew]
> 
> Just define driftfix as property of the hpet device. That way it can be
> controlled both globally (-global hpet.driftfix=...) and per hpet block
> (once we support instantiating >1 of them).


Many Thanks Jan,

I started investigating code changes. I'm thinking of ...

- adding a new field to the HPETState structure.

uint32_t driftfix;

- adding the property 'driftfix' to the DeviceInfo structure.

DEFINE_PROP_BIT("driftfix", HPETState, driftfix, 0, false)

  Using a single bit so that the option syntax would be, e.g.:

-global hpet.driftfix=on (Default is 'off')

- Replace all 'if (hpet_driftfix ...)' by:

if ((HPETState)s->driftfix ...)


Regards,

Uli
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] alleviate time drift with HPET periodic timers

2011-03-18 Thread Ulrich Obergfell

Part 3 of the patch implements the following options for the 'configure' script.

--disable-hpet-driftfix
--enable-hpet-driftfix

Signed-off-by: Ulrich Obergfell 


diff -up ./configure.orig3 ./configure
--- ./configure.orig3   2011-02-18 22:48:06.0 +0100
+++ ./configure 2011-03-13 12:43:44.870966181 +0100
@@ -140,6 +140,7 @@ linux_aio=""
 attr=""
 vhost_net=""
 xfs=""
+hpet_driftfix="no"
 
 gprof="no"
 debug_tcg="no"
@@ -749,6 +750,10 @@ for opt do
   ;;
   --enable-rbd) rbd="yes"
   ;;
+  --disable-hpet-driftfix) hpet_driftfix="no"
+  ;;
+  --enable-hpet-driftfix) hpet_driftfix="yes"
+  ;;
   *) echo "ERROR: unknown option $opt"; show_help="yes"
   ;;
   esac
@@ -2602,6 +2607,7 @@ echo "Trace output file $trace_file- Your SDL version is too old - please upgrade to have SDL support"
@@ -2893,6 +2899,10 @@ if test "$rbd" = "yes" ; then
   echo "CONFIG_RBD=y" >> $config_host_mak
 fi
 
+if test "$hpet_driftfix" = "yes" ; then
+  echo "CONFIG_HPET_DRIFTFIX=y" >> $config_host_mak
+fi
+
 # USB host support
 case "$usb" in
 linux)
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] alleviate time drift with HPET periodic timers

2011-03-18 Thread Ulrich Obergfell

Part 2 of the patch compensates loss of timer interrupts caused by ...

-  delayed timer callback.
-  interrupt coalescing (x86 only).

Lost timer interrupts are compensated by gradually injecting additional
interrupts during the subsequent timer intervals, starting at a rate of
one additional interrupt per interval. If further interrupts are lost
while compensation is still in progress, the rate is increased. A limit
is imposed on the rate and on the 'backlog' of lost interrupts that are
to be injected.

Signed-off-by: Ulrich Obergfell 


diff -up ./hw/apic.c.orig2 ./hw/apic.c
--- ./hw/apic.c.orig2   2011-02-18 22:48:06.0 +0100
+++ ./hw/apic.c 2011-03-13 12:40:56.110991702 +0100
@@ -24,6 +24,9 @@
 #include "sysbus.h"
 #include "trace.h"
 #include "kvm.h"
+#ifdef CONFIG_HPET_DRIFTFIX
+#include "sysemu.h"
+#endif
 
 /* APIC Local Vector Table */
 #define APIC_LVT_TIMER   0
@@ -1143,6 +1146,10 @@ static SysBusDeviceInfo apic_info = {
 
 static void apic_register_devices(void)
 {
+#ifdef CONFIG_HPET_DRIFTFIX
+qemu_get_irq_delivered = apic_get_irq_delivered;
+qemu_reset_irq_delivered = apic_reset_irq_delivered;
+#endif
 sysbus_register_withprop(&apic_info);
 }
 
diff -up ./hw/hpet.c.orig2 ./hw/hpet.c
--- ./hw/hpet.c.orig2   2011-02-18 22:48:06.0 +0100
+++ ./hw/hpet.c 2011-03-13 12:40:56.111991655 +0100
@@ -31,6 +31,9 @@
 #include "hpet_emul.h"
 #include "sysbus.h"
 #include "mc146818rtc.h"
+#ifdef CONFIG_HPET_DRIFTFIX
+#include "sysemu.h"
+#endif
 
 //#define HPET_DEBUG
 #ifdef HPET_DEBUG
@@ -41,6 +44,13 @@
 
 #define HPET_MSI_SUPPORT0
 
+#ifdef CONFIG_HPET_DRIFTFIX
+#define MAX_IRQS_TO_INJECT (uint32_t)5000
+#define MAX_IRQ_RATE   (uint32_t)10
+
+extern int hpet_driftfix;
+#endif
+
 struct HPETState;
 typedef struct HPETTimer {  /* timers */
 uint8_t tn; /*timer number*/
@@ -55,6 +65,12 @@ typedef struct HPETTimer {  /* timers */
 uint8_t wrap_flag;  /* timer pop will indicate wrap for one-shot 32-bit
  * mode. Next pop will be actual timer expiration.
  */
+#ifdef CONFIG_HPET_DRIFTFIX
+uint64_t saved_period;
+uint32_t irqs_to_inject;
+uint32_t irq_rate;
+uint32_t divisor;
+#endif
 } HPETTimer;
 
 typedef struct HPETState {
@@ -169,11 +185,12 @@ static inline uint64_t hpet_calculate_di
 }
 }
 
-static void update_irq(struct HPETTimer *timer, int set)
+static int update_irq(struct HPETTimer *timer, int set)
 {
 uint64_t mask;
 HPETState *s;
 int route;
+int irq_delivered = 1;
 
 if (timer->tn <= 1 && hpet_in_legacy_mode(timer->state)) {
 /* if LegacyReplacementRoute bit is set, HPET specification requires
@@ -198,8 +215,19 @@ static void update_irq(struct HPETTimer 
 qemu_irq_raise(s->irqs[route]);
 } else {
 s->isr &= ~mask;
+#ifdef CONFIG_HPET_DRIFTFIX
+if (hpet_driftfix && qemu_get_irq_delivered
+  && qemu_reset_irq_delivered) {
+qemu_reset_irq_delivered();
+qemu_irq_raise(s->irqs[route]);
+irq_delivered = qemu_get_irq_delivered();
+qemu_irq_lower(s->irqs[route]);
+}
+else
+#endif
 qemu_irq_pulse(s->irqs[route]);
 }
+return irq_delivered;
 }
 
 static void hpet_pre_save(void *opaque)
@@ -246,7 +274,11 @@ static int hpet_post_load(void *opaque, 
 
 static const VMStateDescription vmstate_hpet_timer = {
 .name = "hpet_timer",
+#ifdef CONFIG_HPET_DRIFTFIX
+.version_id = 3,
+#else
 .version_id = 1,
+#endif
 .minimum_version_id = 1,
 .minimum_version_id_old = 1,
 .fields  = (VMStateField []) {
@@ -256,6 +288,12 @@ static const VMStateDescription vmstate_
 VMSTATE_UINT64(fsb, HPETTimer),
 VMSTATE_UINT64(period, HPETTimer),
 VMSTATE_UINT8(wrap_flag, HPETTimer),
+#ifdef CONFIG_HPET_DRIFTFIX
+VMSTATE_UINT64_V(saved_period, HPETTimer, 3),
+VMSTATE_UINT32_V(irqs_to_inject, HPETTimer, 3),
+VMSTATE_UINT32_V(irq_rate, HPETTimer, 3),
+VMSTATE_UINT32_V(divisor, HPETTimer, 3),
+#endif
 VMSTATE_TIMER(qemu_timer, HPETTimer),
 VMSTATE_END_OF_LIST()
 }
@@ -263,7 +301,11 @@ static const VMStateDescription vmstate_
 
 static const VMStateDescription vmstate_hpet = {
 .name = "hpet",
+#ifdef CONFIG_HPET_DRIFTFIX
+.version_id = 3,
+#else
 .version_id = 2,
+#endif
 .minimum_version_id = 1,
 .minimum_version_id_old = 1,
 .pre_save = hpet_pre_save,
@@ -287,7 +329,10 @@ static void hpet_timer(void *opaque)
 {
 HPETTimer *t = opaque;
 uint64_t diff;
-
+#ifdef CONFIG_HPET_DRIFTFIX
+uint32_t irq_count = 0;
+int irq_delivered = 0;
+#endif
 uint64_t period = t->period;
 uint64_t cur_tick = hpet_get_ticks(t->state);
 
@@ -295,1

[PATCH 1/3] alleviate time drift with HPET periodic timers

2011-03-18 Thread Ulrich Obergfell

Part 1 of the patch implements the following QEMU command line option.

-hpet [device=none|present][,driftfix=none|slew]

Signed-off-by: Ulrich Obergfell 


diff -up ./qemu-config.c.orig1 ./qemu-config.c
--- ./qemu-config.c.orig1   2011-02-18 22:48:06.0 +0100
+++ ./qemu-config.c 2011-03-13 12:38:22.813976639 +0100
@@ -261,6 +261,23 @@ static QemuOptsList qemu_rtc_opts = {
 },
 };
 
+#ifdef CONFIG_HPET_DRIFTFIX
+static QemuOptsList qemu_hpet_opts = {
+.name = "hpet",
+.head = QTAILQ_HEAD_INITIALIZER(qemu_hpet_opts.head),
+.desc = {
+{
+.name = "device",
+.type = QEMU_OPT_STRING,
+},{
+.name = "driftfix",
+.type = QEMU_OPT_STRING,
+},
+{ /* end of list */ }
+},
+};
+#endif
+
 static QemuOptsList qemu_global_opts = {
 .name = "global",
 .head = QTAILQ_HEAD_INITIALIZER(qemu_global_opts.head),
@@ -462,6 +479,9 @@ static QemuOptsList *vm_config_groups[32
 &qemu_netdev_opts,
 &qemu_net_opts,
 &qemu_rtc_opts,
+#ifdef CONFIG_HPET_DRIFTFIX
+&qemu_hpet_opts,
+#endif
 &qemu_global_opts,
 &qemu_mon_opts,
 &qemu_cpudef_opts,
diff -up ./qemu-options.hx.orig1 ./qemu-options.hx
--- ./qemu-options.hx.orig1 2011-02-18 22:48:06.0 +0100
+++ ./qemu-options.hx   2011-03-13 12:38:22.815977096 +0100
@@ -972,6 +972,13 @@ STEXI
 Disable HPET support.
 ETEXI
 
+#ifdef CONFIG_HPET_DRIFTFIX
+DEF("hpet", HAS_ARG, QEMU_OPTION_hpet, \
+"-hpet [device=none|present][,driftfix=none|slew]\n" \
+"disable or enable HPET, disable or enable drift fix for 
periodic timers\n",
+QEMU_ARCH_ALL)
+#endif
+
 DEF("balloon", HAS_ARG, QEMU_OPTION_balloon,
 "-balloon none   disable balloon device\n"
 "-balloon virtio[,addr=str]\n"
diff -up ./vl.c.orig1 ./vl.c
--- ./vl.c.orig12011-02-18 22:48:06.0 +0100
+++ ./vl.c  2011-03-13 12:38:35.167984285 +0100
@@ -203,6 +203,9 @@ CharDriverState *parallel_hds[MAX_PARALL
 CharDriverState *virtcon_hds[MAX_VIRTIO_CONSOLES];
 int win2k_install_hack = 0;
 int rtc_td_hack = 0;
+#ifdef CONFIG_HPET_DRIFTFIX
+int hpet_driftfix = 0;
+#endif
 int usb_enabled = 0;
 int singlestep = 0;
 int smp_cpus = 1;
@@ -431,6 +434,41 @@ static void configure_rtc(QemuOpts *opts
 }
 }
 
+#ifdef CONFIG_HPET_DRIFTFIX
+static void configure_hpet(QemuOpts *opts)
+{
+const char *value, *opt;
+
+opt = "device";
+value = qemu_opt_get(opts, opt);
+if (value) {
+if (!strcmp(value, "present")) {
+no_hpet = 0;
+} else if (!strcmp(value, "none")) {
+no_hpet = 1;
+} else {
+goto error_exit;
+}
+}
+opt = "driftfix";
+value = qemu_opt_get(opts, opt);
+if (value) {
+if (!strcmp(value, "slew")) {
+hpet_driftfix = 1;
+} else if (!strcmp(value, "none")) {
+hpet_driftfix = 0;
+} else {
+goto error_exit;
+}
+}
+return;
+
+error_exit:
+fprintf(stderr, "qemu: -hpet option '%s': value missing or invalid\n", 
opt);
+exit(1);
+}
+#endif
+
 /***/
 /* Bluetooth support */
 static int nb_hcis;
@@ -2644,6 +2682,15 @@ int main(int argc, char **argv, char **e
 case QEMU_OPTION_no_hpet:
 no_hpet = 1;
 break;
+#ifdef CONFIG_HPET_DRIFTFIX
+case QEMU_OPTION_hpet:
+opts = qemu_opts_parse(qemu_find_opts("hpet"), optarg, 0);
+if (!opts) {
+exit(1);
+}
+configure_hpet(opts);
+break;
+#endif
 case QEMU_OPTION_balloon:
 if (balloon_parse(optarg) < 0) {
 fprintf(stderr, "Unknown -balloon argument %s\n", optarg);
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/3] alleviate time drift with HPET periodic timers

2011-03-18 Thread Ulrich Obergfell

Hi,

By the beginning of February I posted an RFC regarding an approach to
alleviate time drift with HPET periodic timers.

Ref.: http://article.gmane.org/gmane.comp.emulators.kvm.devel/67346
  http://lists.gnu.org/archive/html/qemu-devel/2011-02/msg00327.html

During the discussion it became apparent that a generic API that could
facilitate (re)injection of lost timer interrupts for any timer device
emulation would be preferred over an isolated fix for a specific device.
However, at this stage there appears to be no plan to actually implement
the generic API. Therefore, we may want to reconsider an improved version
of the code that I originally included in the RFC. This improved version
compensates the loss of timer interrupts caused by ...

-  delayed timer callback.
-  interrupt coalescing (x86 only).

Lost timer interrupts are compensated by gradually injecting additional
interrupts during the subsequent timer intervals, starting at a rate of
one additional interrupt per interval. If further interrupts are lost
while compensation is still in progress, the rate is increased. A limit
is imposed on the rate and on the 'backlog' of lost interrupts that are
to be injected. Compensation of lost timer interrupts can be enabled via
a new QEMU command line option [1].

  -hpet [device=none|present][,driftfix=none|slew]

The code is enclosed in '#ifdef CONFIG_HPET_DRIFTFIX ... #endif' and
it is configurable via a new option '--enable-hpet-driftfix' for the
'configure' script. This also makes it easy to identify and to remove
if the generic API would be implemented some day.

I've tested this code primarily by playing a list of example video files
continuously in a Windows Vista(tm) guest. In some cases I observed the
guest time fall 15+ seconds behind KVM host time in one minute without
the patch, whereas with the patch there was no visible difference between
the guest time and KVM host time after several hours of testing [2]. The
'qemu-system-x86_64' test binary image file was built from snapshot ...

http://git.kernel.org/?p=virt/kvm/qemu-kvm.git;a=commit;h=e700699e9f6f1e163985a8e4cd76d8568d24c152

... using only the '--prefix' and '--enable-hpet-driftfix' options of the
'configure' script (other options default). The series of patches is based
on the aforementioned snapshot and it consists of three parts.

-  Part 1 introduces the new QEMU command line option.
-  Part 2 implements the method to compensate lost timer interrupts.
-  Part 3 introduces the new option for the 'configure' script.

Please review and please comment.


Regards,

Uli Obergfell


[1] Injecting additional timer interrupts to compensate lost interrupts
can alleviate long term time drift. However, on a short time scale,
this method can have the side effect of making virtual machine time
intermittently pass faster than real time (depending on the guest's
time keeping algorithm). Compensation is disabled by default and can
be enabled for guests where this behaviour is acceptable.

[2] Testing required a patch in tight_compress_data(), ui/vnc-enc-tight.c
to avoid segmentation fault / core dump of qemu. However, this needs
to be discussed in a separate post.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Re: [RFC: 0/2] patch for QEMU HPET periodic timer emulation to alleviate time drift

2011-02-07 Thread Ulrich Obergfell

On 02/03/2011 9:07 PM, Anthony Liguori wrote:
> On 02/03/2011 09:28 AM, Jan Kiszka wrote:
>> On 2011-02-03 14:43, Ulrich Obergfell wrote:
...
>>> This is an RFC through which I would like to get feedback on how the
>>> idea of a patch to compensate those lost interrupts would be received:
>>>
>>> The patch determines the number of lost timer interrupts based on the
>>> number of elapsed comparator periods. Lost interrupts are compensated
>>>
>> That neglects coalescing of the HPET IRQs: If the timer is run regularly
>> but the guest is not able to retrieve the injected IRQs, you should
>> still see drifts with your patches.
>>
> 
> FWIW, this isn't the most common failure scenario. This is only really
> prominent when you have rapid reinject like we do with the in-kernel
> PIT. This generally shouldn't be an issue with gradual reinjection.
> 
>>> by gradually injecting additional interrupts during the subsequent
>>> timer intervals, starting at a rate of one additional interrupt per
>>> interval. If further interrupts are lost while compensation is still
>>> in progress, the rate is increased. The algorithm imposes a limit on
>>> the rate and on the 'backlog' of lost interrupts to be injected.


Anthony, Jan,

many thanks for your feedback. I'm sorry for not responding earlier.

I used the following KVM kernel module trace point in __apic_accept_irq()
to trace coalesced interrupts related to HPET timer 0:

result = !apic_test_and_set_irr(vector, apic);
trace_kvm_apic_accept_irq(vcpu->vcpu_id, delivery_mode,
  trig_mode, vector, !result);

While running the test outlined in my original email, I observed between
5 and 15 of these events occurring during one minute:

 qemu-system-x86-7632  [000]  3755.888909: kvm_apic_accept_irq: apicid 0
 vec 209 (Fixed|edge) (coalesced)

kvm->arch->vioapic.redirtbl[2] indicated that IRQ 2 was routed to vector
number 209.

With a comparator period of 15.6 milliseconds, the coalesced interrupts
apparently contribute less than 250 milliseconds of drift. However, the
total drift that I observed during one minute was up to 15 seconds.

I think the patch could possibly handle coalesced interrupts too:

- In update_irq(), use a similar method as in RTC emulation to detect
  coalesced interrupts. Return a delivery status to hpet_timer().

- In hpet_timer(), decrement t->irqs_to_inject (number of interrupts
  remaining to inject) only if the interrupt was not coalesced.


Regards,

Uli Obergfell
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Re: [RFC: 0/2] patch for QEMU HPET periodic timer emulation to alleviate time drift

2011-02-04 Thread Ulrich Obergfell

Anthony,

in reply to: 
> My point is that there's really low hanging fruit and while for some
> curious reason I don't actually see this patch, I believe that a patch
> like this probably can help us quite a lot in the short term.

I've sent the patch in two separate emails:

- code part 1 (introduces the qemu command line option)
  http://article.gmane.org/gmane.comp.emulators.kvm.devel/67347

- code part 2 (implements compensation of lost interrupts)
  http://article.gmane.org/gmane.comp.emulators.kvm.devel/67348


Regards,

Uli Obergfell
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC: 2/2] patch for QEMU HPET periodic timer emulation to alleviate time drift (code part 2)

2011-02-03 Thread Ulrich Obergfell

code part 2
---


Implements compensation of lost interrupts for HPET periodic timers.



diff -up ./hw/hpet.c.orig2 ./hw/hpet.c
--- ./hw/hpet.c.orig2   2011-01-21 23:34:47.0 +0100
+++ ./hw/hpet.c 2011-02-01 19:20:24.619247214 +0100
@@ -41,6 +41,13 @@
 
 #define HPET_MSI_SUPPORT0
 
+#ifdef TARGET_I386
+#define MAX_IRQS_TO_INJECT (uint32_t)100
+#define MAX_IRQ_RATE   (uint32_t)10
+
+extern int hpet_driftfix;
+#endif
+
 struct HPETState;
 typedef struct HPETTimer {  /* timers */
 uint8_t tn; /*timer number*/
@@ -55,6 +62,13 @@ typedef struct HPETTimer {  /* timers */
 uint8_t wrap_flag;  /* timer pop will indicate wrap for one-shot 32-bit
  * mode. Next pop will be actual timer expiration.
  */
+#ifdef TARGET_I386
+/* for lost interrupt compensation */
+uint64_t saved_period;
+uint32_t irqs_to_inject;
+uint32_t irq_rate;
+uint32_t divisor;
+#endif
 } HPETTimer;
 
 typedef struct HPETState {
@@ -248,7 +262,11 @@ static int hpet_post_load(void *opaque, 
 
 static const VMStateDescription vmstate_hpet_timer = {
 .name = "hpet_timer",
+#ifdef TARGET_I386
+.version_id = 3,
+#else
 .version_id = 1,
+#endif
 .minimum_version_id = 1,
 .minimum_version_id_old = 1,
 .fields  = (VMStateField []) {
@@ -258,6 +276,13 @@ static const VMStateDescription vmstate_
 VMSTATE_UINT64(fsb, HPETTimer),
 VMSTATE_UINT64(period, HPETTimer),
 VMSTATE_UINT8(wrap_flag, HPETTimer),
+#ifdef TARGET_I386
+/* for lost interrupt compensation */
+VMSTATE_UINT64_V(saved_period, HPETTimer, 3),
+VMSTATE_UINT32_V(irqs_to_inject, HPETTimer, 3),
+VMSTATE_UINT32_V(irq_rate, HPETTimer, 3),
+VMSTATE_UINT32_V(divisor, HPETTimer, 3),
+#endif
 VMSTATE_TIMER(qemu_timer, HPETTimer),
 VMSTATE_END_OF_LIST()
 }
@@ -265,7 +290,11 @@ static const VMStateDescription vmstate_
 
 static const VMStateDescription vmstate_hpet = {
 .name = "hpet",
+#ifdef TARGET_I386
+.version_id = 3,
+#else
 .version_id = 2,
+#endif
 .minimum_version_id = 1,
 .minimum_version_id_old = 1,
 .pre_save = hpet_pre_save,
@@ -289,7 +318,9 @@ static void hpet_timer(void *opaque)
 {
 HPETTimer *t = opaque;
 uint64_t diff;
-
+#ifdef TARGET_I386
+uint32_t irq_count = 0;
+#endif
 uint64_t period = t->period;
 uint64_t cur_tick = hpet_get_ticks(t->state);
 
@@ -297,13 +328,88 @@ static void hpet_timer(void *opaque)
 if (t->config & HPET_TN_32BIT) {
 while (hpet_time_after(cur_tick, t->cmp)) {
 t->cmp = (uint32_t)(t->cmp + t->period);
+#ifdef TARGET_I386
+/* count one irq per period */
+++irq_count;
+#endif
 }
 } else {
 while (hpet_time_after64(cur_tick, t->cmp)) {
 t->cmp += period;
+#ifdef TARGET_I386
+/* count one irq per period */
+++irq_count;
+#endif
 }
 }
 diff = hpet_calculate_diff(t, cur_tick);
+#ifdef TARGET_I386
+if (hpet_driftfix) {
+/*
+ * If the period value changed since the previous callback,
+ * scale the number of irqs to inject to the new period value
+ * and save the new period value.
+ */
+if (t->saved_period != t->period) {
+t->irqs_to_inject = (t->irqs_to_inject * t->saved_period)
+   / t->period;
+t->saved_period = t->period;
+}
+/*
+ * Add the irq count of the current callback to the number
+ * of irqs to inject. Make sure the result does not exceed
+ * the limit.
+ */
+t->irqs_to_inject += irq_count;
+if (t->irqs_to_inject > MAX_IRQS_TO_INJECT)
+t->irqs_to_inject = MAX_IRQS_TO_INJECT;
+/*
+ * One irq will be injected during the current callback. If
+ * the number of irqs to inject is greater than 1, there is
+ * is a backlog of 'irqs_to_inject - 1' interrupts that were
+ * lost and need to be compensated.
+ */
+if (t->irqs_to_inject > 1) {
+/*
+ * If the irq count of the current callback is greater
+ * than 1, 'irq_count - 1' interrupts were lost since
+ * the previous callback. Increase the rate at which
+ * additional irqs will be injected to compensate the
+ * lost interrupts. Make sure the rate does not exceed
+ * the limit.
+ */
+if (irq_count > 1) {
+t->irq_rate++;
+if (t->irq_rate > MAX_IRQ_RATE)
+t->irq_rate = MAX_IRQ_RATE;
+}
+  

[RFC: 1/2] patch for QEMU HPET periodic timer emulation to alleviate time drift (code part 1)

2011-02-03 Thread Ulrich Obergfell

code part 1
---


Introduces the '-hpet [device=none|present][,driftfix=none|slew]' option.



diff -up ./qemu-config.c.orig1 ./qemu-config.c
--- ./qemu-config.c.orig1   2011-01-21 23:34:47.0 +0100
+++ ./qemu-config.c 2011-02-01 19:38:26.665250920 +0100
@@ -255,6 +255,21 @@ static QemuOptsList qemu_rtc_opts = {
 },
 };
 
+static QemuOptsList qemu_hpet_opts = {
+.name = "hpet",
+.head = QTAILQ_HEAD_INITIALIZER(qemu_hpet_opts.head),
+.desc = {
+{
+.name = "device",
+.type = QEMU_OPT_STRING,
+},{
+.name = "driftfix",
+.type = QEMU_OPT_STRING,
+},
+{ /* end of list */ }
+},
+};
+
 static QemuOptsList qemu_global_opts = {
 .name = "global",
 .head = QTAILQ_HEAD_INITIALIZER(qemu_global_opts.head),
@@ -456,6 +471,7 @@ static QemuOptsList *vm_config_groups[32
 &qemu_netdev_opts,
 &qemu_net_opts,
 &qemu_rtc_opts,
+&qemu_hpet_opts,
 &qemu_global_opts,
 &qemu_mon_opts,
 &qemu_cpudef_opts,
diff -up ./qemu-options.hx.orig1 ./qemu-options.hx
--- ./qemu-options.hx.orig1 2011-01-21 23:34:47.0 +0100
+++ ./qemu-options.hx   2011-02-01 16:01:26.063249119 +0100
@@ -972,6 +972,11 @@ STEXI
 Disable HPET support.
 ETEXI
 
+DEF("hpet", HAS_ARG, QEMU_OPTION_hpet, \
+"-hpet [device=none|present][,driftfix=none|slew]\n" \
+"disable or enable HPET, disable or enable drift fix for 
periodic timers (x86 only)\n",
+QEMU_ARCH_ALL)
+
 DEF("balloon", HAS_ARG, QEMU_OPTION_balloon,
 "-balloon none   disable balloon device\n"
 "-balloon virtio[,addr=str]\n"
diff -up ./vl.c.orig1 ./vl.c
--- ./vl.c.orig12011-01-21 23:34:47.0 +0100
+++ ./vl.c  2011-02-01 19:39:22.163249205 +0100
@@ -203,6 +203,7 @@ CharDriverState *parallel_hds[MAX_PARALL
 CharDriverState *virtcon_hds[MAX_VIRTIO_CONSOLES];
 int win2k_install_hack = 0;
 int rtc_td_hack = 0;
+int hpet_driftfix = 0;
 int usb_enabled = 0;
 int singlestep = 0;
 int smp_cpus = 1;
@@ -431,6 +432,39 @@ static void configure_rtc(QemuOpts *opts
 }
 }
 
+static void configure_hpet(QemuOpts *opts)
+{
+const char *value, *opt;
+
+opt = "device";
+value = qemu_opt_get(opts, opt);
+if (value) {
+if (!strcmp(value, "present")) {
+no_hpet = 0;
+} else if (!strcmp(value, "none")) {
+no_hpet = 1;
+} else {
+goto error_exit;
+}
+}
+opt = "driftfix";
+value = qemu_opt_get(opts, opt);
+if (value) {
+if (!strcmp(value, "slew")) {
+hpet_driftfix = 1;
+} else if (!strcmp(value, "none")) {
+hpet_driftfix = 0;
+} else {
+goto error_exit;
+}
+}
+return;
+
+error_exit:
+fprintf(stderr, "qemu: -hpet option '%s': value missing or invalid\n", 
opt);
+exit(1);
+}
+
 /***/
 /* Bluetooth support */
 static int nb_hcis;
@@ -2608,6 +2642,13 @@ int main(int argc, char **argv, char **e
 case QEMU_OPTION_no_hpet:
 no_hpet = 1;
 break;
+case QEMU_OPTION_hpet:
+opts = qemu_opts_parse(qemu_find_opts("hpet"), optarg, 0);
+if (!opts) {
+exit(1);
+}
+configure_hpet(opts);
+break;
 case QEMU_OPTION_balloon:
 if (balloon_parse(optarg) < 0) {
 fprintf(stderr, "Unknown -balloon argument %s\n", optarg);
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC: 0/2] patch for QEMU HPET periodic timer emulation to alleviate time drift

2011-02-03 Thread Ulrich Obergfell

Hi,

I am observing severe backward time drift in a MS Windows Vista(tm)
guest running on a Fedora 14 KVM host. I can reproduce the problem
with the following steps:

1. Use 'vncviewer' to connect to the guest's desktop.
2. Click on the menu title bar of a window on the guest's desktop.
3. Move that window around on the guest's desktop.

While I keep on moving the window around for one minute, the guest
time falls up to 15 seconds behind host time.

The problem is caused by delayed callbacks of hpet_timer(). A timer
interrupt is injected into the guest during each callback. However,
interrupts are lost if delays are greater than a comparator period.


This is an RFC through which I would like to get feedback on how the
idea of a patch to compensate those lost interrupts would be received:

The patch determines the number of lost timer interrupts based on the
number of elapsed comparator periods. Lost interrupts are compensated
by gradually injecting additional interrupts during the subsequent
timer intervals, starting at a rate of one additional interrupt per
interval. If further interrupts are lost while compensation is still
in progress, the rate is increased. The algorithm imposes a limit on
the rate and on the 'backlog' of lost interrupts to be injected. The
patch can be enabled via a qemu command line option.

  -hpet [device=none|present][,driftfix=none|slew]

The 'device=none' option is equivalent to the '-no-hpet' option, and
the 'driftfix=slew' option enables the patch (similar to RTC).


The second and third part of this series of email contain the patch:

- Code part 1 introduces the qemu command line option.
- Code part 2 implements compensation of lost interrupts.

Please review and please comment.


With the patch enabled, I'm not observing any drift after one minute
of testing (as outlined above). I'm not using a fine grained method
to check the drift, just looking at the hrs:min:sec of the guest's
and the host's clock. If there is still some drift, it's apparently
less than a second.


Regards,

Uli Obergfell
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 0/4] KVM in-kernel PM Timer implementation

2010-12-15 Thread Ulrich Obergfell

- "Anthony Liguori"  wrote:

> On 12/14/2010 06:09 AM, Ulrich Obergfell wrote:

[...]

> > Parts 1 thru 4 of this RFC contain experimental source code which
> > I recently used to investigate the performance benefit. In a Linux
> > guest, I was running a program that calls gettimeofday() 'n' times
> > in a loop (the PM Timer register is read during each call). With
> > in-kernel PM Timer, I observed a significant reduction of program
> > execution time.
> >
> 
> I've played with this in the past.  Can you post real numbers, 
> preferably, with a real work load?


Anthony,

I only experimented with a gettimeofday() loop. With this test scenario
I observed that in-kernel PM Timer reduced the program execution time to
roughly half of the execution time that it takes with userspace PM Timer.
Please find some example results below (these results were obtained while
the host was not busy). The relative difference of in-kernel PM Timer
versus userspace PM Timer is high, whereas the absolute difference per
call appears to be low. So, the benefit much depends on how frequently
gettimeofday() is called in a real work load. I don't have any numbers
from a real work load. When I began working on this, I was motivated by
the fact that the Linux kernel itself provides an optimization for the
gettimeofday() call ('vxtime'). So, from this I presumed that there
would be real work loads which would benefit from the optimization of
the gettimeofday() call (otherwise, why would we have 'vxtime' ?).
Of course, 'vxtime' is not related to PM based time keeping. However,
the experimental code shows an approach to optimize gettimeofday() in
KVM virtual machines.


Regards,

Uli


- host:

# grep "model name" /proc/cpuinfo | sort | uniq -c
  8 model name : Intel(R) Core(TM) i7 CPU   Q 740  @ 1.73GHz

# uname -r
2.6.37-rc4


- guest:

# grep "model name" /proc/cpuinfo | sort | uniq -c
  4 model name : QEMU Virtual CPU version 0.13.50


- test program ('gtod.c'):

#include 
#include 

struct timeval tv;

main(int argc, char *argv[])
{
int i = atoi(argv[1]);
while (i-- > 0)
gettimeofday(&tv, NULL);
}


- example results with in-kernel PM Timer:

# for i in 1 2 3
> do
> time ./gtod 2500
> done

real0m44.302s
user0m1.090s
sys 0m43.163s

real0m44.509s
user0m1.100s
sys 0m43.393s

real0m45.290s
user0m1.160s
sys 0m44.123s

# for i in 1000 5000 1
> do
> time ./gtod $i
> done

real0m17.981s
user0m0.810s
sys 0m17.157s

real1m27.253s
user0m1.930s
sys 1m25.307s

real2m51.801s
user0m3.359s
sys 2m48.384s


- example results with userspace PM Timer:

# for i in 1 2 3
> do
> time ./gtod 2500
> done

real1m24.185s
user0m2.000s
sys 1m22.168s

real1m23.508s
user0m1.750s
sys 1m21.738s

real1m24.437s
user0m1.900s
sys 1m22.517s

# for i in 1000 5000 1
> do
> time ./gtod $i
> done

real0m33.479s
user0m0.680s
sys 0m32.785s

real2m50.831s
user0m3.389s
sys 2m47.405s

real5m42.304s
user0m7.319s
sys 5m34.919s
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 0/4] KVM in-kernel PM Timer implementation

2010-12-14 Thread Ulrich Obergfell

- "Avi Kivity"  wrote:

> On 12/14/2010 03:40 PM, Glauber Costa wrote:
> > >
> > >  What is the motivation for this?  Are there any important guests that
> > >  use the pmtimer?
> > Avi,
> >
> > All older RHEL and Windows, for example, would benefit for this.
> 
> They only benefit from it because we don't provide HPET.  If we did, the 
> guests would use HPET in preference to pmtimer, since HPET is so much
> better than pmtimer (yet still sucks in an absolute sense).
> 
> > >  If anything I'd expect hpet or the Microsoft synthetic timers to be a
> > >  lot more important.
> >
> > True. But also a lot more work.
> > Implementing just the pm timer counter - not the whole of it - in
> > kernel, gives us a lot of gain with not very much effort. Patch is
> > pretty simple, as you can see, and most of it is even code to turn it
> > on/off, etc.
> >
> 
> Partial emulation is not something I like since it causes a fuzzy 
> kernel/user boundary.  In this case, transitioning to userspace when 
> interrupts are enabled doesn't look so hot.  Are you sure all guests 
> that benefit from this don't enable the pmtimer interrupt?  What about
> the transition?  Will we have a time discontinuity when that happens?

Avi,

the idea is to use the '-kvm-pmtmr' option (in code part 4) only
with guests that do not enable the 'timer carry interrupt'. Guests
that need to enable the 'timer carry interrupt' should rather use
the PM Timer emulation in qemu userspace (i.e. they should not be
started with this option). If a guest is accidentally started with
this option, the in-kernel PM Timer (in code part 1) detects if
the guest attempts to enable the 'timer carry interrupt' and falls
back to PM Timer emulation in qemu userspace (in-kernel PM Timer
disables itself automatically). So, this is not a combination of
in-kernel PM Timer register emulation and qemu userspace PM Timer
interrupt emulation.

Regards,

Uli
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC 4/4] KVM in-kernel PM Timer implementation (experimental code part 4)

2010-12-14 Thread Ulrich Obergfell

experimental code part 4 (qemu userspace)
-


This code introduces the new qemu command line option '-kvm-pmtmr'.
qemu only creates and configures in-kernel PM Timer if this option
is specified on the command line.



diff -up ./qemu-kvm.c.orig4 ./qemu-kvm.c
--- ./qemu-kvm.c.orig4  2010-12-10 10:50:42.857811776 +0100
+++ ./qemu-kvm.c2010-12-10 11:45:23.783748044 +0100
@@ -54,6 +54,9 @@ int kvm_irqchip = 1;
 int kvm_pit = 1;
 int kvm_pit_reinject = 1;
 int kvm_nested = 0;
+#ifdef KVM_CAP_PMTMR
+int kvm_pmtmr = 0;
+#endif
 
 
 KVMState *kvm_state;
@@ -186,7 +189,7 @@ int kvm_init(int smp_cpus)
 kvm_context->no_irqchip_creation = 0;
 kvm_context->no_pit_creation = 0;
 #ifdef KVM_CAP_PMTMR
-kvm_context->no_pmtmr_creation = 0;
+kvm_context->no_pmtmr_creation = 1;
 #endif
 
 #ifdef KVM_CAP_SET_GUEST_DEBUG
@@ -241,6 +244,11 @@ void kvm_disable_pit_creation(kvm_contex
 }
 
 #ifdef KVM_CAP_PMTMR
+void kvm_enable_pmtmr_creation(kvm_context_t kvm)
+{
+kvm->no_pmtmr_creation = 0;
+}
+
 void (*kvm_arch_pmtmr_handler)(kvm_context_t kvm);
 /*
  * This handler is called by
@@ -1654,6 +1662,11 @@ static int kvm_create_context(void)
 if (!kvm_pit) {
 kvm_disable_pit_creation(kvm_context);
 }
+#ifdef KVM_CAP_PMTMR
+if (kvm_pmtmr) {
+kvm_enable_pmtmr_creation(kvm_context);
+}
+#endif
 if (kvm_create(kvm_context, 0, NULL) < 0) {
 kvm_finalize(kvm_state);
 return -1;
diff -up ./qemu-kvm.h.orig4 ./qemu-kvm.h
--- ./qemu-kvm.h.orig4  2010-12-10 11:26:43.726790319 +0100
+++ ./qemu-kvm.h2010-12-10 11:47:50.074805792 +0100
@@ -124,6 +124,18 @@ void kvm_disable_irqchip_creation(kvm_co
  */
 void kvm_disable_pit_creation(kvm_context_t kvm);
 
+#ifdef KVM_CAP_PMTMR
+/*!
+ * \brief Enable the in-kernel ACPI PM Timer register creation
+ *
+ * In-kernel ACPI PM Timer register is disabled by default.
+ * If in-kernel is to be used, this should be called prior to kvm_create().
+ *
+ *  \param kvm Pointer to the kvm_context
+ */
+void kvm_enable_pmtmr_creation(kvm_context_t kvm);
+#endif
+
 /*!
  * \brief Create new virtual machine
  *
@@ -706,6 +718,9 @@ extern int kvm_irqchip;
 extern int kvm_pit;
 extern int kvm_pit_reinject;
 extern int kvm_nested;
+#ifdef KVM_CAP_PMTMR
+extern int kvm_pmtmr;
+#endif
 extern kvm_context_t kvm_context;
 
 struct ioperm_data {
diff -up ./qemu-options.hx.orig4 ./qemu-options.hx
--- ./qemu-options.hx.orig4 2010-12-02 15:15:20.0 +0100
+++ ./qemu-options.hx   2010-12-06 11:27:57.273648509 +0100
@@ -2330,6 +2330,9 @@ DEF("no-kvm-pit-reinjection", 0, QEMU_OP
 QEMU_ARCH_I386)
 DEF("enable-nesting", 0, QEMU_OPTION_enable_nesting,
 "-enable-nesting enable support for running a VM inside the VM (AMD 
only)\n", QEMU_ARCH_I386)
+DEF("kvm-pmtmr", 0, QEMU_OPTION_kvm_pmtmr,
+"-kvm-pmtmr  enable KVM kernel mode ACPI PM Timer register 
emulation\n",
+QEMU_ARCH_I386)
 DEF("nvram", HAS_ARG, QEMU_OPTION_nvram,
 "-nvram FILE provide ia64 nvram contents\n", QEMU_ARCH_ALL)
 DEF("tdf", 0, QEMU_OPTION_tdf,
diff -up ./vl.c.orig4 ./vl.c
--- ./vl.c.orig42010-12-10 10:34:55.388997058 +0100
+++ ./vl.c  2010-12-10 11:50:20.566810444 +0100
@@ -2474,6 +2474,12 @@ int main(int argc, char **argv, char **e
kvm_nested = 1;
break;
}
+#ifdef KVM_CAP_PMTMR
+   case QEMU_OPTION_kvm_pmtmr: {
+   kvm_pmtmr = 1;
+   break;
+   }
+#endif
 #endif
 case QEMU_OPTION_usb:
 usb_enabled = 1;
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC 3/4] KVM in-kernel PM Timer implementation (experimental code part 3)

2010-12-14 Thread Ulrich Obergfell

experimental code part 3 (qemu userspace)
-


This code utlizes the new ioctl commands introduced by code part 2.

The KVM_CREATE_PMTMR ioctl command is simply called once when a virtual
machine is being created. However, calling KVM_CONFIGURE_PMTMR is more
challenging because it involves ...

-  passing the base address of PM I/O port range to code part 1
-  passing the clock offset to code part 1

'timers_state.cpu_clock_offset' gets updated at each vm_start() call.
However, the PM I/O port base address is not available at the first
vm_start() call. So, configuring the in-kernel PM Timer needs to be
postponed until the PIIX4 PCI configuration is initialized. This is
facilitated by the new function kvm_pmtmr_handler() which is called
by vm_start() and by pm_io_space_update().

kvm_pmtmr_handler() calls architecture-specific code thru a function
pointer 'kvm_arch_pmtmr_handler'. kvm_pmtmr_handler() is a 'no-op' if
an architecture does not provide or clears this function pointer. The
architecture-specific code is responsible for configuring the in-kernel
PM Timer.

The experimental code provides kvm_arch_configure_pmtmr_wrapper() in
qemu-kvm-x86.c. kvm_arch_create_pmtmr() sets 'kvm_arch_pmtmr_handler'
to 'kvm_arch_configure_pmtmr_wrapper' after successful completion of
the KVM_CREATE_PMTMR ioctl command.

kvm_arch_configure_pmtmr_wrapper() requires ACPI PM code to provide a
function pointer 'kvm_arch_get_pm_io_base' thru which the PM I/O port
base address can be obtained. kvm_arch_configure_pmtmr_wrapper() is a
'no-op' too if ACPI PM code does not provide or clears this function
pointer. The experimental code provides piix4_get_pm_io_base() in
hw/acpi_piix4.c. pm_io_space_update() sets 'kvm_arch_get_pm_io_base'
to 'piix4_get_pm_io_base'.

Consider two scenarios ...

-  during virtual machine creation and startup

 kvm_arch_create
   kvm_arch_create_pmtmr
 ioctl(KVM_CREATE_PMTMR)
 kvm_arch_pmtmr_handler = kvm_arch_configure_pmtmr_wrapper
  :
 vm_start
   kvm_pmtmr_handler
 kvm_arch_configure_pmtmr_wrapper
   'no-op' because kvm_arch_get_pm_io_base not set yet
  :
 pm_io_space_update
   kvm_arch_get_pm_io_base = piix4_get_pm_io_base
   kvm_pmtmr_handler
 kvm_arch_configure_pmtmr_wrapper
   obtain PM I/O port base thru kvm_arch_get_pm_io_base
   kvm_arch_configure_pmtmr
 ioctl(KVM_CONFIGURE_PMTMR)

-  any other vm_start() call, for example after migration

 vm_start
   kvm_pmtmr_handler
 kvm_arch_configure_pmtmr_wrapper
   obtain PM I/O port base thru kvm_arch_get_pm_io_base
   kvm_arch_configure_pmtmr
 ioctl(KVM_CONFIGURE_PMTMR)



diff -up ./hw/acpi_piix4.c.orig3 ./hw/acpi_piix4.c
--- ./hw/acpi_piix4.c.orig3 2010-12-02 15:15:20.0 +0100
+++ ./hw/acpi_piix4.c   2010-12-10 11:26:53.943753235 +0100
@@ -23,6 +23,7 @@
 #include "acpi.h"
 #include "sysemu.h"
 #include "range.h"
+#include "qemu-kvm.h"
 
 //#define DEBUG
 
@@ -80,6 +81,9 @@ typedef struct PIIX4PMState {
 
 static void piix4_acpi_system_hot_add_init(PCIBus *bus, PIIX4PMState *s);
 
+/* for cpu hotadd (and in-kernel PM Timer if KVM_CAP_PMTMR is defined) */
+static PIIX4PMState *global_piix4_pm_state;
+
 #define ACPI_ENABLE 0xf1
 #define ACPI_DISABLE 0xf0
 
@@ -250,6 +254,19 @@ static void acpi_dbg_writel(void *opaque
 PIIX4_DPRINTF("ACPI: DBG: 0x%08x\n", val);
 }
 
+#ifdef KVM_CAP_PMTMR
+static uint64_t piix4_get_pm_io_base(void)
+{
+PIIX4PMState *s = global_piix4_pm_state;
+uint32_t pm_io_base;
+
+pm_io_base = le32_to_cpu(*(uint32_t *)(s->dev.config + 0x40));
+pm_io_base &= 0xffc0;
+
+return (uint64_t)pm_io_base;
+}
+#endif
+
 static void pm_io_space_update(PIIX4PMState *s)
 {
 uint32_t pm_io_base;
@@ -262,6 +279,16 @@ static void pm_io_space_update(PIIX4PMSt
 PIIX4_DPRINTF("PM: mapping to 0x%x\n", pm_io_base);
 iorange_init(&s->ioport, &pm_iorange_ops, pm_io_base, 64);
 ioport_register(&s->ioport);
+#ifdef  KVM_CAP_PMTMR
+kvm_arch_get_pm_io_base = piix4_get_pm_io_base;
+/*
+ * The base address of the PM I/O port address range is now known.
+ * The following call is needed to pass the base address to the
+ * in-kernel PM Timer emulation. Note that 'kvm_arch_get_pm_io_base'
+ * must be set _before_ this call.
+ */
+kvm_pmtmr_handler();
+#endif
 }
 }
 
@@ -354,14 +381,12 @@ static void piix4_powerdown(void *opaque
 }
 }
 
-static PIIX4PMState *global_piix4_pm_state; /* cpu hotadd */
-
 static int piix4_pm_initfn(PCIDevice *dev)
 {
 PIIX4PMState *s = DO_UPCAST(PIIX4PMState, dev, dev);
 uint8_t *pci_conf;
 
-/* for cpu hotadd */
+/* for cpu hotadd and in-kernel PM Timer */
 global_piix4_pm_state = s;
 
 pci_conf = s->dev.config;
diff -up ./kvm/include/linux/kvm.h.orig3 ./kvm/include/linux/kvm.h
--- ./kvm/include/

[RFC 2/4] KVM in-kernel PM Timer implementation (experimental code part 2)

2010-12-14 Thread Ulrich Obergfell

experimental code part 2 (KVM kernel)
-


This code introduces two new ioctl commands KVM_CREATE_PMTMR and
KVM_CONFIGURE_PMTMR plus the new capability KVM_CAP_PMTMR to the
ioctl infrastructure of the KVM kernel. This code utilizes some
helper functions introduced by code part 1.



diff -up ./arch/x86/include/asm/kvm.h.orig2 ./arch/x86/include/asm/kvm.h
--- ./arch/x86/include/asm/kvm.h.orig2  2010-12-05 09:35:17.0 +0100
+++ ./arch/x86/include/asm/kvm.h2010-12-10 12:32:47.067686432 +0100
@@ -24,6 +24,7 @@
 #define __KVM_HAVE_DEBUGREGS
 #define __KVM_HAVE_XSAVE
 #define __KVM_HAVE_XCRS
+#define __KVM_HAVE_PMTMR
 
 /* Architectural interrupt line count. */
 #define KVM_NR_INTERRUPTS 256
diff -up ./arch/x86/kvm/x86.c.orig2 ./arch/x86/kvm/x86.c
--- ./arch/x86/kvm/x86.c.orig2  2010-12-05 09:35:17.0 +0100
+++ ./arch/x86/kvm/x86.c2010-12-10 12:24:58.083739549 +0100
@@ -26,6 +26,9 @@
 #include "tss.h"
 #include "kvm_cache_regs.h"
 #include "x86.h"
+#ifdef KVM_CAP_PMTMR
+#include "pmtmr.h"
+#endif
 
 #include 
 #include 
@@ -1965,6 +1968,9 @@ int kvm_dev_ioctl_check_extension(long e
case KVM_CAP_X86_ROBUST_SINGLESTEP:
case KVM_CAP_XSAVE:
case KVM_CAP_ASYNC_PF:
+#ifdef KVM_CAP_PMTMR
+   case KVM_CAP_PMTMR:
+#endif
r = 1;
break;
case KVM_CAP_COALESCED_MMIO:
@@ -3274,6 +3280,7 @@ long kvm_arch_vm_ioctl(struct file *filp
struct kvm_pit_state ps;
struct kvm_pit_state2 ps2;
struct kvm_pit_config pit_config;
+   struct kvm_pmtmr_config pmtmr_config;
} u;
 
switch (ioctl) {
@@ -3541,6 +3548,23 @@ long kvm_arch_vm_ioctl(struct file *filp
r = 0;
break;
}
+#ifdef KVM_CAP_PMTMR
+   case KVM_CREATE_PMTMR: {
+   mutex_lock(&kvm->slots_lock);
+   r = kvm_create_pmtmr(kvm);
+   mutex_unlock(&kvm->slots_lock);
+   break;
+   }
+   case KVM_CONFIGURE_PMTMR: {
+   r = -EFAULT;
+   if (copy_from_user(&u.pmtmr_config, argp,
+  sizeof(struct kvm_pmtmr_config)))
+   goto out;
+
+   r = kvm_configure_pmtmr(kvm, &u.pmtmr_config);
+   break;
+   }
+#endif
 
default:
;
diff -up ./include/linux/kvm.h.orig2 ./include/linux/kvm.h
--- ./include/linux/kvm.h.orig2 2010-12-05 09:35:17.0 +0100
+++ ./include/linux/kvm.h   2010-12-10 12:30:13.677745093 +0100
@@ -140,6 +140,12 @@ struct kvm_pit_config {
__u32 pad[15];
 };
 
+/* for KVM_CONFIGURE_PMTMR */
+struct kvm_pmtmr_config {
+   __u64 pm_io_base;
+   __s64 clock_offset;
+};
+
 #define KVM_PIT_SPEAKER_DUMMY 1
 
 #define KVM_EXIT_UNKNOWN  0
@@ -541,6 +547,9 @@ struct kvm_ppc_pvinfo {
 #define KVM_CAP_PPC_GET_PVINFO 57
 #define KVM_CAP_PPC_IRQ_LEVEL 58
 #define KVM_CAP_ASYNC_PF 59
+#ifdef __KVM_HAVE_PMTMR
+#define KVM_CAP_PMTMR 60
+#endif
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -672,6 +681,8 @@ struct kvm_clock_data {
 #define KVM_XEN_HVM_CONFIG_IOW(KVMIO,  0x7a, struct kvm_xen_hvm_config)
 #define KVM_SET_CLOCK _IOW(KVMIO,  0x7b, struct kvm_clock_data)
 #define KVM_GET_CLOCK _IOR(KVMIO,  0x7c, struct kvm_clock_data)
+#define KVM_CREATE_PMTMR  _IO(KVMIO,   0x7d)
+#define KVM_CONFIGURE_PMTMR   _IOW(KVMIO,  0x7e, struct kvm_pmtmr_config)
 /* Available with KVM_CAP_PIT_STATE2 */
 #define KVM_GET_PIT2  _IOR(KVMIO,  0x9f, struct kvm_pit_state2)
 #define KVM_SET_PIT2  _IOW(KVMIO,  0xa0, struct kvm_pit_state2)
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC 1/4] KVM in-kernel PM Timer implementation (experimental code part 1)

2010-12-14 Thread Ulrich Obergfell

experimental code part 1 (KVM kernel)
-


This code introduces the actual emulation of the PM Timer register
plus some helper functions to create and configure the in-kernel
PM Timer. The emulation utilizes the 'kvm_io_bus' infrastructure.



diff -up ./arch/x86/include/asm/kvm_host.h.orig1 
./arch/x86/include/asm/kvm_host.h
--- ./arch/x86/include/asm/kvm_host.h.orig1 2010-12-05 09:35:17.0 
+0100
+++ ./arch/x86/include/asm/kvm_host.h   2010-12-10 12:14:29.282686691 +0100
@@ -459,6 +459,10 @@ struct kvm_arch {
/* fields used by HYPER-V emulation */
u64 hv_guest_os_id;
u64 hv_hypercall;
+
+#ifdef KVM_CAP_PMTMR
+   struct kvm_pmtmr *vpmtmr;
+#endif
 };
 
 struct kvm_vm_stat {
diff -up ./arch/x86/kvm/i8254.c.orig1 ./arch/x86/kvm/i8254.c
--- ./arch/x86/kvm/i8254.c.orig12010-12-05 09:35:17.0 +0100
+++ ./arch/x86/kvm/i8254.c  2010-12-10 12:09:36.877729064 +0100
@@ -51,7 +51,7 @@
 #define RW_STATE_WORD1 4
 
 /* Compute with 96 bit intermediate result: (a*b)/c */
-static u64 muldiv64(u64 a, u32 b, u32 c)
+u64 muldiv64(u64 a, u32 b, u32 c)
 {
union {
u64 ll;
diff -up ./arch/x86/kvm/Makefile.orig1 ./arch/x86/kvm/Makefile
--- ./arch/x86/kvm/Makefile.orig1   2010-12-05 09:35:17.0 +0100
+++ ./arch/x86/kvm/Makefile 2010-12-10 12:07:14.379811121 +0100
@@ -12,7 +12,7 @@ kvm-$(CONFIG_IOMMU_API)   += $(addprefix .
 kvm-$(CONFIG_KVM_ASYNC_PF) += $(addprefix ../../../virt/kvm/, async_pf.o)
 
 kvm-y  += x86.o mmu.o emulate.o i8259.o irq.o lapic.o \
-  i8254.o timer.o
+  i8254.o timer.o pmtmr.o
 kvm-intel-y+= vmx.o
 kvm-amd-y  += svm.o
 
diff -up ./arch/x86/kvm/pmtmr.c.orig1 ./arch/x86/kvm/pmtmr.c
--- ./arch/x86/kvm/pmtmr.c.orig12010-12-10 12:05:39.878691941 +0100
+++ ./arch/x86/kvm/pmtmr.c  2010-12-10 12:06:00.987738524 +0100
@@ -0,0 +1,151 @@
+/*
+ * in-kernel ACPI PM Timer emulation
+ *
+ * Note: 'timer carry interrupt' is not implemented
+ */
+
+#include 
+
+#ifdef KVM_CAP_PMTMR
+
+#include "pmtmr.h"
+
+static int emulate_acpi_reg_pmtmr(struct kvm_pmtmr *pmtmr, void *data, int len)
+{
+   s64 tmp;
+   u32 running_count;
+
+   if (len != 4)
+   return -EOPNOTSUPP;
+
+   tmp = ktime_to_ns(ktime_get()) + pmtmr->clock_offset;
+   running_count = (u32)muldiv64(tmp, KVM_ACPI_PMTMR_FREQ, NSEC_PER_SEC);
+   *(u32 *)data = running_count & KVM_ACPI_PMTMR_MASK;
+
+#ifdef KVM_ACPI_PMTMR_STATS
+   pmtmr->read_count++;
+#endif
+   return 0;
+}
+
+/*
+ * This function returns true for I/O ports in the range from 'PM base'
+ * to 'PM Timer' (this range contains the PM1 Status and the PM1 Enable
+ * registers).
+ */
+static inline int pmtmr_in_range(struct kvm_pmtmr *pmtmr, gpa_t ioport)
+{
+   return ((ioport >= pmtmr->pm_io_base) &&
+   (ioport <= pmtmr->pm_io_base + KVM_ACPI_REG_PMTMR));
+}
+
+static inline struct kvm_pmtmr *dev_to_pmtmr(struct kvm_io_device *dev)
+{
+return container_of(dev, struct kvm_pmtmr, dev);
+}
+
+static int pmtmr_ioport_read(struct kvm_io_device *this,
+gpa_t ioport, int len, void *data)
+{
+   struct kvm_pmtmr *pmtmr = dev_to_pmtmr(this);
+
+   if (!pmtmr_in_range(pmtmr, ioport))
+   return -EOPNOTSUPP;
+
+   switch (ioport - pmtmr->pm_io_base) {
+   case KVM_ACPI_REG_PMTMR:
+   /* emulate PM Timer read if in-kernel emulation is enabled */
+   if (pmtmr->state == KVM_PMTMR_STATE_ENABLED)
+   return(emulate_acpi_reg_pmtmr(pmtmr, data, len));
+
+   /* fall thru */
+   default:
+   /* let qemu userspace handle everything else */
+   return -EOPNOTSUPP;
+   }
+}
+
+static int pmtmr_ioport_write(struct kvm_io_device *this,
+ gpa_t ioport, int len, const void *data)
+{
+   struct kvm_pmtmr *pmtmr = dev_to_pmtmr(this);
+
+   if (!pmtmr_in_range(pmtmr, ioport))
+   return -EOPNOTSUPP;
+
+   switch (ioport - pmtmr->pm_io_base) {
+   case KVM_ACPI_REG_PMTMR:
+   /* ignore PM Timer write */
+   return 0;
+   case KVM_ACPI_REG_PMEN:
+   if (len == 2) {
+   u16 val = *(u16 *)data;
+   /*
+* Fall back to qemu userspace PM Timer emulation if
+* the VM sets the 'timer carry interrupt enable' bit
+* in the PM1 Enable register.
+*/
+   if (val & KVM_ACPI_PMTMR_TMR_EN)
+   /* disable in-kernel PM Timer emulation */
+   pmtmr->state = KVM_PMTMR_STATE_DISABLED;
+   }
+   /* fall thru */
+   default:
+   /* let qemu userspace handle everything 

[RFC 0/4] KVM in-kernel PM Timer implementation

2010-12-14 Thread Ulrich Obergfell

Hi,

This is an RFC through which I would like to get feedback on how the
idea of in-kernel PM Timer would be received.

The current implementation of PM Timer emulation is 'heavy-weight'
because the code resides in qemu userspace. Guest operating systems
that use PM Timer as a clock source (for example, older versions of
Linux that do not have paravirtualized clock) would benefit from an
in-kernel PM Timer emulation.

Parts 1 thru 4 of this RFC contain experimental source code which
I recently used to investigate the performance benefit. In a Linux
guest, I was running a program that calls gettimeofday() 'n' times
in a loop (the PM Timer register is read during each call). With
in-kernel PM Timer, I observed a significant reduction of program
execution time.

The experimental code emulates the PM Timer register in KVM kernel.
All other components of ACPI PM remain in qemu userspace. Also, the
'timer carry interrupt' feature is not implemented in-kernel. If a
guest operating system needs to enable the 'timer carry interrupt',
the code takes care that PM Timer emulation falls back to userspace.
However, I think the design of the code has sufficient flexibility,
so that anyone who would want to add the 'timer carry interrupt'
feature in-kernel could try to do so later on.

Please review and please comment.


Regards,

Uli Obergfell
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html